baseados em gpu: Topics by Science.gov

Sample records for baseados em gpu

Massive parallelization of a 3D finite difference electromagnetic forward solution using domain decomposition methods on multiple CUDA enabled GPUs

NASA Astrophysics Data System (ADS)

Schultz, A.

2010-12-01

3D forward solvers lie at the core of inverse formulations used to image the variation of electrical conductivity within the Earth's interior. This property is associated with variations in temperature, composition, phase, presence of volatiles, and in specific settings, the presence of groundwater, geothermal resources, oil/gas or minerals. The high cost of 3D solutions has been a stumbling block to wider adoption of 3D methods. Parallel algorithms for modeling frequency domain 3D EM problems have not achieved wide scale adoption, with emphasis on fairly coarse grained parallelism using MPI and similar approaches. The communications bandwidth as well as the latency required to send and receive network communication packets is a limiting factor in implementing fine grained parallel strategies, inhibiting wide adoption of these algorithms. Leading Graphics Processor Unit (GPU) companies now produce GPUs with hundreds of GPU processor cores per die. The footprint, in silicon, of the GPU's restricted instruction set is much smaller than the general purpose instruction set required of a CPU. Consequently, the density of processor cores on a GPU can be much greater than on a CPU. GPUs also have local memory, registers and high speed communication with host CPUs, usually through PCIe type interconnects. The extremely low cost and high computational power of GPUs provides the EM geophysics community with an opportunity to achieve fine grained (i.e. massive) parallelization of codes on low cost hardware. The current generation of GPUs (e.g. NVidia Fermi) provides 3 billion transistors per chip die, with nearly 500 processor cores and up to 6 GB of fast (DDR5) GPU memory. This latest generation of GPU supports fast hardware double precision (64 bit) floating point operations of the type required for frequency domain EM forward solutions. Each Fermi GPU board can sustain nearly 1 TFLOP in double precision, and multiple boards can be installed in the host computer system. We describe our ongoing efforts to achieve massive parallelization on a novel hybrid GPU testbed machine currently configured with 12 Intel Westmere Xeon CPU cores (or 24 parallel computational threads) with 96 GB DDR3 system memory, 4 GPU subsystems which in aggregate contain 960 NVidia Tesla GPU cores with 16 GB dedicated DDR3 GPU memory, and a second interleved bank of 4 GPU subsystems containing in aggregate 1792 NVidia Fermi GPU cores with 12 GB dedicated DDR5 GPU memory. We are applying domain decomposition methods to a modified version of Weiss' (2001) 3D frequency domain full physics EM finite difference code, an open source GPL licensed f90 code available for download from www.OpenEM.org. This will be the core of a new hybrid 3D inversion that parallelizes frequencies across CPUs and individual forward solutions across GPUs. We describe progress made in modifying the code to use direct solvers in GPU cores dedicated to each small subdomain, iteratively improving the solution by matching adjacent subdomain boundary solutions, rather than iterative Krylov space sparse solvers as currently applied to the whole domain.
AKDNR - DNR Business Reporting System (DBRS)

Science.gov Websites

Skip to content State of Alaska myAlaska My Government Resident Business in Alaska Visiting Alaska Resources > IRM GPU > Main Menu DNR Business Reporting System (DBRS) The DNR Business Reporting System (DBRS) allows users to generate reports from the DNR Business databases and maps. The reports offered
Peregrine Software Toolchains | High-Performance Computing | NREL

Science.gov Websites

toolchain is an open-source alternative against which many technical applications are natively developed and tested. The Portland Group compilers are not fully supported, but are available to the HPC community. Use Group (PGI) C/C++ and Fortran (partially supported) The PGI Accelerator compilers include NVIDIA GPU
Estudo comparativo entre estrelas centrais de nebulosas planetárias deficientes em hidrogênio

NASA Astrophysics Data System (ADS)

Marcolino, W. L. F.; de Araújo, F. X.

2003-08-01

Apresentamos neste trabalho o resultado de um estudo das principais características espectrais das estrelas centrais de nebulosas planetárias (ECNP) deficientes em hidrogênio. A origem e a evolução dessas estrelas ainda constitui um problema em aberto na evolução estelar. Geralmente esses objetos são divididos em [WCE], [WCL] e [WELS]. Os tipos [WCE] e [WCL] apresentam um espectro típico de uma estrela Wolf-Rayet carbonada de população I e as [WELS] apresentam linhas fracas de carbono e oxigênio em emissão. Existem evidências que apontam a seguinte sequência evolutiva : [WCL] = > [WCE] = > [WELS] = > PG 1159 (pré anã-branca). No entanto, tal cenário apresenta falhas como por exemplo a falta de ECNP entre os tipos [WCL] e [WCE]. Baseados em uma amostra de 24 objetos obtida no telescópio de 1.52m em La Silla, Chile (acordo ESO/ON), ao longo do ano 2000, apresentamos os resultados da comparação das larguras equivalentes de diversas linhas relevantes entre os tipos [WCL], [WCE] e [WELS]. Verificamos que nossos dados estão de acordo com a sequência evolutiva. Baseado nas linhas de C IV, conseguimos dividir pela primeira vez as [WELS] em dois grupos principais. Além disso, os dados reforçam a afirmação de que as [WCE] são as estrelas que possuem a maior temperatura entre as ECNP deficientes em hidrogênio. Discutimos ainda, a escassez de dados disponíveis na literatura e a necessidade da obtenção de parametros físicos para estes objetos.
Ultrasonic location system =

NASA Astrophysics Data System (ADS)

Albuquerque, Daniel Filipe

Esta tese apresenta um sistema de localizacao baseado exclusivamente em ultrassons, nao necessitando de recorrer a qualquer outra tecnologia. Este sistema de localizacao foi concebido para poder operar em ambientes onde qualquer outra tecnologia nao pode ser utilizada ou o seu uso esta condicionado, como sao exemplo aplicacoes subaquaticas ou ambientes hospitalares. O sistema de localizacao proposto faz uso de uma rede de farois fixos permitindo que estacoes moveis se localizem. Devido a necessidade de transmissao de dados e medicao de distancias foi desenvolvido um pulso de ultrassons robusto a ecos que permite realizar ambas as tarefas com sucesso. O sistema de localizacao permite que as estacoes moveis se localizem escutando apenas a informacao em pulsos de ultrassons enviados pelos farois usando para tal um algoritmo baseado em diferencas de tempo de chegada. Desta forma a privacidade dos utilizadores e garantida e o sistema torna-se completamente independente do numero de utilizadores. Por forma a facilitar a implementacao da rede de farois apenas sera necessario determinar manualmente a posicao de alguns dos farois, designados por farois ancora. Estes irao permitir que os restantes farois, completamente autonomos, se possam localizar atraves de um algoritmo iterativo de localizacao baseado na minimizacao de uma funcao de custo. Para que este sistema possa funcionar como previsto sera necessario que os farois possam sincronizar os seus relogios e medir a distancia entre eles. Para tal, esta tese propoe um protocolo de sincronizacao de relogio que permite tambem obter as medidas de distancia entre os farois trocando somente tres mensagens de ultrassons. Adicionalmente, o sistema de localizacao permite que farois danificados possam ser substituidos sem comprometer a operabilidade da rede reduzindo a complexidade na manutencao. Para alem do mencionado, foi igualmente implementado um simulador de ultrassons para ambientes fechados, o qual provou ser bastante preciso e uma ferramenta de elevado valor para simular o comportamento do sistema de localizacao sobre condicoes controladas.
Fiber-optic components for optical communicatios and sensing =

NASA Astrophysics Data System (ADS)

Marques, Carlos Alberto Ferreira

Nos ultimos anos, a Optoelectronica tem sido estabelecida como um campo de investigacao capaz de conduzir a novas solucoes tecnologicas. As conquistas abundantes no campo da optica e lasers, bem como em comunicacoes opticas tem sido de grande importancia e desencadearam uma serie de inovacoes. Entre o grande numero de componentes opticos existentes, os componentes baseados em fibra optica sao principalmente relevantes devido a sua simplicidade e a elevada de transporte de dados da fibra optica. Neste trabalho foi focado um destes componentes opticos: as redes de difraccao em fibra optica, as quais tem propriedades opticas de processamento unicas. Esta classe de componentes opticos e extremamente atraente para o desenvolvimento de dispositivos de comunicacoes opticas e sensores. O trabalho comecou com uma analise teorica aplicada a redes em fibra e foram focados os metodos de fabricacao de redes em fibra mais utilizados. A inscricao de redes em fibra tambem foi abordado neste trabalho, onde um sistema de inscricao automatizada foi implementada para a fibra optica de silica, e os resultados experimentais mostraram uma boa aproximacao ao estudo de simulacao. Tambem foi desenvolvido um sistema de inscricao de redes de Bragg em fibra optica de plastico. Foi apresentado um estudo detalhado da modulacao acustico-optica em redes em fibra optica de silica e de plastico. Por meio de uma analise detalhada dos modos de excitacao mecanica aplicadas ao modulador acustico-optico, destacou-se que dois modos predominantes de excitacao acustica pode ser estabelecidos na fibra optica, dependendo da frequencia acustica aplicada. Atraves dessa caracterizacao, foi possivel desenvolver novas aplicacoes para comunicacoes opticas. Estudos e implementacao de diferentes dispositivos baseados em redes em fibra foram realizados, usando o efeito acustico-optico e o processo de regeneracao em fibra optica para varias aplicacoes tais como rapido multiplexador optico add-drop, atraso de grupo sintonizavel de redes de Bragg, redes de Bragg com descolamento de fase sintonizaveis, metodo para a inscricao de redes de Bragg com perfis complexos, filtro sintonizavel para equalizacao de ganho e filtros opticos notch ajustaveis.
Accelerating Computation of DCM for ERP in MATLAB by External Function Calls to the GPU.

PubMed

Wang, Wei-Jen; Hsieh, I-Fan; Chen, Chun-Chuan

2013-01-01

This study aims to improve the performance of Dynamic Causal Modelling for Event Related Potentials (DCM for ERP) in MATLAB by using external function calls to a graphics processing unit (GPU). DCM for ERP is an advanced method for studying neuronal effective connectivity. DCM utilizes an iterative procedure, the expectation maximization (EM) algorithm, to find the optimal parameters given a set of observations and the underlying probability model. As the EM algorithm is computationally demanding and the analysis faces possible combinatorial explosion of models to be tested, we propose a parallel computing scheme using the GPU to achieve a fast estimation of DCM for ERP. The computation of DCM for ERP is dynamically partitioned and distributed to threads for parallel processing, according to the DCM model complexity and the hardware constraints. The performance efficiency of this hardware-dependent thread arrangement strategy was evaluated using the synthetic data. The experimental data were used to validate the accuracy of the proposed computing scheme and quantify the time saving in practice. The simulation results show that the proposed scheme can accelerate the computation by a factor of 155 for the parallel part. For experimental data, the speedup factor is about 7 per model on average, depending on the model complexity and the data. This GPU-based implementation of DCM for ERP gives qualitatively the same results as the original MATLAB implementation does at the group level analysis. In conclusion, we believe that the proposed GPU-based implementation is very useful for users as a fast screen tool to select the most likely model and may provide implementation guidance for possible future clinical applications such as online diagnosis.
Accelerating Computation of DCM for ERP in MATLAB by External Function Calls to the GPU

PubMed Central

Wang, Wei-Jen; Hsieh, I-Fan; Chen, Chun-Chuan

2013-01-01

This study aims to improve the performance of Dynamic Causal Modelling for Event Related Potentials (DCM for ERP) in MATLAB by using external function calls to a graphics processing unit (GPU). DCM for ERP is an advanced method for studying neuronal effective connectivity. DCM utilizes an iterative procedure, the expectation maximization (EM) algorithm, to find the optimal parameters given a set of observations and the underlying probability model. As the EM algorithm is computationally demanding and the analysis faces possible combinatorial explosion of models to be tested, we propose a parallel computing scheme using the GPU to achieve a fast estimation of DCM for ERP. The computation of DCM for ERP is dynamically partitioned and distributed to threads for parallel processing, according to the DCM model complexity and the hardware constraints. The performance efficiency of this hardware-dependent thread arrangement strategy was evaluated using the synthetic data. The experimental data were used to validate the accuracy of the proposed computing scheme and quantify the time saving in practice. The simulation results show that the proposed scheme can accelerate the computation by a factor of 155 for the parallel part. For experimental data, the speedup factor is about 7 per model on average, depending on the model complexity and the data. This GPU-based implementation of DCM for ERP gives qualitatively the same results as the original MATLAB implementation does at the group level analysis. In conclusion, we believe that the proposed GPU-based implementation is very useful for users as a fast screen tool to select the most likely model and may provide implementation guidance for possible future clinical applications such as online diagnosis. PMID:23840507
Observações no infravermelho médio de objetos estelares jovens em NGC 3576

NASA Astrophysics Data System (ADS)

Barbosa, C.; Damineli, A.; Blum, R.; Conti, P.

2003-08-01

Apresentamos os resultados de observações no infravermelho médio de candidatos a objetos estelares jovens e massivos em NGC 3576. As imagens de alta resolução foram obtidas no observatório Gemini Sul com o uso dos filtros em 10,8, 7,9, 9,8, 12,5 e 18,2 mm. Nossas imagens mostram a fonte IRS 1 resolvida em 4 objetos pela primeira vez em 10 mm. Para cada objeto obtivemos a distribuição espectral de energia de 1.2 até 18 mm, bem como a temperatura de cor, a distribuição espacial e a profundidade óptica em 9,8 mm da poeira circunstelar. Apresentamos uma estimativa das massas dos objetos estudados, baseados na luminosidade emitida no infravermelho médio, bem como um modelo para explicar as diferentes características observadas de cada objeto. Finalmente discutimos a possível localização da(s) fonte(s) de ionização de NGC 3576.
Silver segregation in Ag/a-C nanocomposite coatings for potential application as antibacterial surfaces

NASA Astrophysics Data System (ADS)

Manninen, Noora Kristiina Alves de Sousa

O desenvolvimento de superficies antibacterianas representa um desafio atual em diferentes aplicacoes industriais, nomeadamente, dispositivos medicos, embalagens alimentares, texteis e sistemas de tratamento de agua. A maioria das bacterias existe em biofilmes que aderem fortemente a diferentes tipos de superficies uma vez que esta adesao representa um mecanismo estrategico de sobrevivencia. O fenomeno da adesao e colonizacao microbiana resulta na falha de diferentes dispositivos e componentes utilizados nas aplicacoes acima mencionadas, tendo como consequencia perdas economicas elevadas e representando tambem um problema de saude publica quando se tratam de aplicacoes como dispositivos medicos ou embalagens alimentares. Neste sentido, ao longo das ultimas decadas o desenvolvimento de superficies antibacterianas tem sido considerada uma estrategia emergente no desenvolvimento de materiais mais eficientes a serem aplicados em diferentes sectores. O objetivo da presente tese consiste no desenvolvimento e caracterizacao de revestimentos nanocompositos multifuncionais baseados em revestimentos de carbono amorfo dopado com nanoparticulas de prata (Ag/a-C) para potencial aplicacao em superficies antibacterianas. A Ag e atualmente considerada como o agente bactericida mais promissor e eficiente, sendo que as nanoparticulas de prata representam o material mais comercializado na area da nanotecnologia. A estrategia de modificacao superficial com revestimentos baseados em carbono amorfo (a-C) tem-se tornado popular do ponto de vista industrial essencialmente, devido entre outras propriedades, a sua resistencia ao desgaste tribologico excecional, que permite combinar uma elevada dureza com um baixo coeficiente de atrito, elevada estabilidade quimica, resistencia a corrosao e biocompatibilidade em diferentes aplicacoes biomedicas. Na atualidade os revestimentos de a-C sao utilizados em diferentes aplicacoes industriais nomeadamente dispositivos medicos, laminas de barbear e diferentes componentes mecanicos sujeitos a elevado desgaste tribologico. Neste sentido, a combinacao das propriedades intrinsecas destes materiais pode ser considerada uma abordagem promissora para o desenvolvimento de revestimentos multifuncionais, os quais podem ser aplicados em diferentes produtos, nomeadamente, dispositivos medicos. Na presente tese os revestimentos nanocompositos de Ag/a-C sao depositados por dois metodos distintos: (i) pulverizacao catodica em magnetrao e (ii) combinacao da pulverizacao catodica em magnetrao para deposicao da camada de a-C e condensacao em atmosfera inerte para a incorporacao simultanea de nanoparticulas de Ag na matrix de carbono. Os metodos acima mencionados sao comparados em relacao a uniformidade dos revestimentos depositados, permitindo efetuar a escolha do metodo de deposicao mais eficaz (pulverizacao catodica em magnetrao). Os revestimentos nanocompositos de Ag/a-C sao caraterizados relativamente a sua estrutura, estabilidade termodinamica em condicoes ambientais e propriedades funcionais (comportamento tribologico e atividade antibacteriana). O trabalho central da tese e focado na caraterizacao de revestimentos Ag/a-C contendo 20% at. de Ag, com diferentes espessuras e diferentes estruturas em multicamada. Os resultados sugerem que os revestimentos Ag/a-C sao instaveis mesmo em condicoes ambientais, sendo observado que a Ag e forma nanofibras entre as fronteiras das colunas, as quais recobrem a superficie do revestimento poucas semanas apos a producao. O processo de formacao de nanofibras e promovido pela humidade, sendo que, as particulas crescem atraves de um processo de coalescencia. As propriedades funcionais sugerem que os revestimentos Ag/a-C sao promissores do ponto de vista de actividade antibacteriana, a qual esta relacionada com a sua ionizacao. Os testes tribologicos revelam que em ambiente nao lubrificado a presenca da Ag promove a degradacao dos revestimentos a-C, contudo, em meios biologicos que simulam o liquido sinovial, presente nas articulacoes da anca, o comportamento tribologico e semelhante aos revestimentos a-C.
GASPACHO: a generic automatic solver using proximal algorithms for convex huge optimization problems

NASA Astrophysics Data System (ADS)

Goossens, Bart; Luong, Hiêp; Philips, Wilfried

2017-08-01

Many inverse problems (e.g., demosaicking, deblurring, denoising, image fusion, HDR synthesis) share various similarities: degradation operators are often modeled by a specific data fitting function while image prior knowledge (e.g., sparsity) is incorporated by additional regularization terms. In this paper, we investigate automatic algorithmic techniques for evaluating proximal operators. These algorithmic techniques also enable efficient calculation of adjoints from linear operators in a general matrix-free setting. In particular, we study the simultaneous-direction method of multipliers (SDMM) and the parallel proximal algorithm (PPXA) solvers and show that the automatically derived implementations are well suited for both single-GPU and multi-GPU processing. We demonstrate this approach for an Electron Microscopy (EM) deconvolution problem.
Gctf: Real-time CTF determination and correction

PubMed Central

Zhang, Kai

2016-01-01

Accurate estimation of the contrast transfer function (CTF) is critical for a near-atomic resolution cryo electron microscopy (cryoEM) reconstruction. Here, a GPU-accelerated computer program, Gctf, for accurate and robust, real-time CTF determination is presented. The main target of Gctf is to maximize the cross-correlation of a simulated CTF with the logarithmic amplitude spectra (LAS) of observed micrographs after background subtraction. Novel approaches in Gctf improve both speed and accuracy. In addition to GPU acceleration (e.g. 10–50×), a fast ‘1-dimensional search plus 2-dimensional refinement (1S2R)’ procedure further speeds up Gctf. Based on the global CTF determination, the local defocus for each particle and for single frames of movies is accurately refined, which improves CTF parameters of all particles for subsequent image processing. Novel diagnosis method using equiphase averaging (EPA) and self-consistency verification procedures have also been implemented in the program for practical use, especially for aims of near-atomic reconstruction. Gctf is an independent program and the outputs can be easily imported into other cryoEM software such as Relion (Scheres, 2012) and Frealign (Grigorieff, 2007). The results from several representative datasets are shown and discussed in this paper. PMID:26592709
Elastography using multi-stream GPU: an application to online tracked ultrasound elastography, in-vivo and the da Vinci Surgical System.

PubMed

Deshmukh, Nishikant P; Kang, Hyun Jae; Billings, Seth D; Taylor, Russell H; Hager, Gregory D; Boctor, Emad M

2014-01-01

A system for real-time ultrasound (US) elastography will advance interventions for the diagnosis and treatment of cancer by advancing methods such as thermal monitoring of tissue ablation. A multi-stream graphics processing unit (GPU) based accelerated normalized cross-correlation (NCC) elastography, with a maximum frame rate of 78 frames per second, is presented in this paper. A study of NCC window size is undertaken to determine the effect on frame rate and the quality of output elastography images. This paper also presents a novel system for Online Tracked Ultrasound Elastography (O-TRuE), which extends prior work on an offline method. By tracking the US probe with an electromagnetic (EM) tracker, the system selects in-plane radio frequency (RF) data frames for generating high quality elastograms. A novel method for evaluating the quality of an elastography output stream is presented, suggesting that O-TRuE generates more stable elastograms than generated by untracked, free-hand palpation. Since EM tracking cannot be used in all systems, an integration of real-time elastography and the da Vinci Surgical System is presented and evaluated for elastography stream quality based on our metric. The da Vinci surgical robot is outfitted with a laparoscopic US probe, and palpation motions are autonomously generated by customized software. It is found that a stable output stream can be achieved, which is affected by both the frequency and amplitude of palpation. The GPU framework is validated using data from in-vivo pig liver ablation; the generated elastography images identify the ablated region, outlined more clearly than in the corresponding B-mode US images.
Elastography Using Multi-Stream GPU: An Application to Online Tracked Ultrasound Elastography, In-Vivo and the da Vinci Surgical System

PubMed Central

Deshmukh, Nishikant P.; Kang, Hyun Jae; Billings, Seth D.; Taylor, Russell H.; Hager, Gregory D.; Boctor, Emad M.

2014-01-01

A system for real-time ultrasound (US) elastography will advance interventions for the diagnosis and treatment of cancer by advancing methods such as thermal monitoring of tissue ablation. A multi-stream graphics processing unit (GPU) based accelerated normalized cross-correlation (NCC) elastography, with a maximum frame rate of 78 frames per second, is presented in this paper. A study of NCC window size is undertaken to determine the effect on frame rate and the quality of output elastography images. This paper also presents a novel system for Online Tracked Ultrasound Elastography (O-TRuE), which extends prior work on an offline method. By tracking the US probe with an electromagnetic (EM) tracker, the system selects in-plane radio frequency (RF) data frames for generating high quality elastograms. A novel method for evaluating the quality of an elastography output stream is presented, suggesting that O-TRuE generates more stable elastograms than generated by untracked, free-hand palpation. Since EM tracking cannot be used in all systems, an integration of real-time elastography and the da Vinci Surgical System is presented and evaluated for elastography stream quality based on our metric. The da Vinci surgical robot is outfitted with a laparoscopic US probe, and palpation motions are autonomously generated by customized software. It is found that a stable output stream can be achieved, which is affected by both the frequency and amplitude of palpation. The GPU framework is validated using data from in-vivo pig liver ablation; the generated elastography images identify the ablated region, outlined more clearly than in the corresponding B-mode US images. PMID:25541954
Software Accelerates Computing Time for Complex Math

NASA Technical Reports Server (NTRS)

2014-01-01

Ames Research Center awarded Newark, Delaware-based EM Photonics Inc. SBIR funding to utilize graphic processing unit (GPU) technology- traditionally used for computer video games-to develop high-computing software called CULA. The software gives users the ability to run complex algorithms on personal computers with greater speed. As a result of the NASA collaboration, the number of employees at the company has increased 10 percent.
Implementation of collisions on GPU architecture in the Vorpal code

NASA Astrophysics Data System (ADS)

Leddy, Jarrod; Averkin, Sergey; Cowan, Ben; Sides, Scott; Werner, Greg; Cary, John

2017-10-01

The Vorpal code contains a variety of collision operators allowing for the simulation of plasmas containing multiple charge species interacting with neutrals, background gas, and EM fields. These existing algorithms have been improved and reimplemented to take advantage of the massive parallelization allowed by GPU architecture. The use of GPUs is most effective when algorithms are single-instruction multiple-data, so particle collisions are an ideal candidate for this parallelization technique due to their nature as a series of independent processes with the same underlying operation. This refactoring required data memory reorganization and careful consideration of device/host data allocation to minimize memory access and data communication per operation. Successful implementation has resulted in an order of magnitude increase in simulation speed for a test-case involving multiple binary collisions using the null collision method. Work supported by DARPA under contract W31P4Q-16-C-0009.
Scalable and Interactive Segmentation and Visualization of Neural Processes in EM Datasets

PubMed Central

Jeong, Won-Ki; Beyer, Johanna; Hadwiger, Markus; Vazquez, Amelio; Pfister, Hanspeter; Whitaker, Ross T.

2011-01-01

Recent advances in scanning technology provide high resolution EM (Electron Microscopy) datasets that allow neuroscientists to reconstruct complex neural connections in a nervous system. However, due to the enormous size and complexity of the resulting data, segmentation and visualization of neural processes in EM data is usually a difficult and very time-consuming task. In this paper, we present NeuroTrace, a novel EM volume segmentation and visualization system that consists of two parts: a semi-automatic multiphase level set segmentation with 3D tracking for reconstruction of neural processes, and a specialized volume rendering approach for visualization of EM volumes. It employs view-dependent on-demand filtering and evaluation of a local histogram edge metric, as well as on-the-fly interpolation and ray-casting of implicit surfaces for segmented neural structures. Both methods are implemented on the GPU for interactive performance. NeuroTrace is designed to be scalable to large datasets and data-parallel hardware architectures. A comparison of NeuroTrace with a commonly used manual EM segmentation tool shows that our interactive workflow is faster and easier to use for the reconstruction of complex neural processes. PMID:19834227
Properties of self-assembled diluted magnetic semiconductor nanostructures =

NASA Astrophysics Data System (ADS)

Ankiewicz, Amelia Olga Goncalves

Este trabalho centra-se na investigacao da possibilidade de se conseguir um semicondutor magnetico diluido (SMD) baseado em ZnO. Foi levado a cabo um estudo detalhado das propriedades magneticas e estruturais de estruturas de ZnO, nomeadamente nanofios (NFs), nanocristais (NCs) e filmes finos, dopadas com metais de transicao (MTs). Foram usadas varias tecnicas experimentais para caracterizar estas estruturas, designadamente difraccao de raios-X, microscopia electronica de varrimento, ressonancia magnetica, SQUID, e medidas de transporte. Foram incorporados substitucionalmente nos sitios do Zn ioes de Mn2+ e Co2+ em ambos os NFs e NCs de ZnO. Revelou-se para ambos os ioes dopantes, que a incorporacao e heterogenea, uma vez que parte do sinal de ressonancia paramagnetica electronica (RPE) vem de ioes de MTs em ambientes distorcidos ou enriquecidos com MTs. A partir das intensidades relativas dos espectros de RPE e de modificacoes da superficie, demonstra-se ainda que os NCs exibem uma estrutura core-shell. Os resultados, evidenciam que, com o aumento da concentracao de MTs, a dimensao dos NCs diminui e aumentam as distorcoes da rede. Finalmente, no caso dos NCs dopados com Mn, obteve-se o resultado singular de que a espessura da shell e da ordem de 0.3 nm e de que existe uma acumulacao de Mn na mesma. Com o objectivo de esclarecer o papel dos portadores de carga na medicao das interaccoes ferromagneticas, foram co-dopados filmes de ZnO com Mn e Al ou com Co e Al. Os filmes dopados com Mn, revelaram-se simplesmente paramagneticos, com os ioes de Mn substitucionais nos sitios do Zn. Por outro lado, os filmes dopados com Co exibem ferromagnetismo fraco nao intrinseco, provavelmente devido a decomposicao spinodal. Foram ainda efectuados estudos comparativos com filmes de ligas de Zn1-xFexO. Como era de esperar, detectaram-se segundas fases de espinela e de oxido de ferro nestas ligas; todas as amostras exibiam curvas de histerese a 300 K. Estes resultados suportam a hipotese de que as segundas fases sao responsaveis pelo comportamento magnetico observado em muitos sistemas baseados em ZnO. Nao se observou nenhuma evidencia de ferromagnetismo mediado por portadores de carga. As experiencias mostram que a analise de RPE permite demonstrar directamente se e onde estao incorporados os ioes de MTs e evidenciam a importancia dos efeitos de superficie para dimensoes menores que 15 nm, para as quais se formam estruturas core-shell. As investigacoes realizadas no ambito desta tese demonstram que nenhuma das amostras de ZnO estudadas exibiram propriedades de um SMD intrinseco e que, no futuro, sao necessarios estudos teoricos e experimentais detalhados das interaccoes de troca entre os ioes de MTs e os atomos do ZnO para determinar a origem das propriedades magneticas observadas.
Spiritual Coping: A Focus of New Nursing Diagnoses.

PubMed

Cabaço, Sandra Rosado; Caldeira, Sílvia; Vieira, Margarida; Rodgers, Beth

2017-03-01

To define the antecedents, consequents, and attributes of spiritual coping. Rodgers' evolutionary model for concept analysis was used to guide an integrative literature review of qualitative research. Six qualitative articles were included and elements that define and contextualize the concept were identified. Three new nursing diagnoses are proposed, based on qualitative findings. These new diagnoses should be submitted to clinical validation in different cultural and religious backgrounds, but the inclusion in the taxonomy highlights a holistic perspective concerning the spiritual dimension of patients' responses in life and health transitions, and so, bringing the approach to spirituality into nursing practice. Definir os antecedentes, os consequentes e os atributos de coping espiritual. MÉTODOS: Modelo evolucionário de análise de conceitos de Beth Rodgers baseado numa revisão integrativa de literatura de pesquisa qualitativa. Seis pesquisas qualitativas foram incluídas e os elementos que definem e contextualizam o conceito foram identificados. CONCLUSÕES: São propostos três novos diagnósticos de enfermagem, baseados na evidência de estudos qualitativos. IMPLICAÇÕES PARA A PRÁTICA: Estes novos diagnósticos devem ser submetidos a estudos de validação clínica em diferentes contextos culturais e religiosos, e quando incluídos na taxonomia estarão a enfatizar uma perspectiva holística das respostas dos pacientes relacionada à dimensão espiritual e, assim, promovendo a inclusão da espiritualidade na prática clínica. © 2017 NANDA International, Inc.
Accelerating three-dimensional FDTD calculations on GPU clusters for electromagnetic field simulation.

PubMed

Nagaoka, Tomoaki; Watanabe, Soichi

2012-01-01

Electromagnetic simulation with anatomically realistic computational human model using the finite-difference time domain (FDTD) method has recently been performed in a number of fields in biomedical engineering. To improve the method's calculation speed and realize large-scale computing with the computational human model, we adapt three-dimensional FDTD code to a multi-GPU cluster environment with Compute Unified Device Architecture and Message Passing Interface. Our multi-GPU cluster system consists of three nodes. The seven GPU boards (NVIDIA Tesla C2070) are mounted on each node. We examined the performance of the FDTD calculation on multi-GPU cluster environment. We confirmed that the FDTD calculation on the multi-GPU clusters is faster than that on a multi-GPU (a single workstation), and we also found that the GPU cluster system calculate faster than a vector supercomputer. In addition, our GPU cluster system allowed us to perform the large-scale FDTD calculation because were able to use GPU memory of over 100 GB.

gpuPOM: a GPU-based Princeton Ocean Model

NASA Astrophysics Data System (ADS)

Xu, S.; Huang, X.; Zhang, Y.; Fu, H.; Oey, L.-Y.; Xu, F.; Yang, G.

2014-11-01

Rapid advances in the performance of the graphics processing unit (GPU) have made the GPU a compelling solution for a series of scientific applications. However, most existing GPU acceleration works for climate models are doing partial code porting for certain hot spots, and can only achieve limited speedup for the entire model. In this work, we take the mpiPOM (a parallel version of the Princeton Ocean Model) as our starting point, design and implement a GPU-based Princeton Ocean Model. By carefully considering the architectural features of the state-of-the-art GPU devices, we rewrite the full mpiPOM model from the original Fortran version into a new Compute Unified Device Architecture C (CUDA-C) version. We take several accelerating methods to further improve the performance of gpuPOM, including optimizing memory access in a single GPU, overlapping communication and boundary operations among multiple GPUs, and overlapping input/output (I/O) between the hybrid Central Processing Unit (CPU) and the GPU. Our experimental results indicate that the performance of the gpuPOM on a workstation containing 4 GPUs is comparable to a powerful cluster with 408 CPU cores and it reduces the energy consumption by 6.8 times.
Active corrosion protection of AA2024 by sol-gel coatings with corrosion inhibitors =

NASA Astrophysics Data System (ADS)

Yasakau, Kiryl

A industria aeronautica utiliza ligas de aluminio de alta resistencia para o fabrico dos elementos estruturais dos avioes. As ligas usadas possuem excelentes propriedades mecanicas mas apresentam simultaneamente uma grande tendencia para a corrosao. Por esta razao essas ligas necessitam de proteccao anticorrosiva eficaz para poderem ser utilizadas com seguranca. Ate a data, os sistemas anticorrosivos mais eficazes para ligas de aluminio contem cromio hexavalente na sua composicao, sejam pre-tratamentos, camadas de conversao ou pigmentos anticorrosivos. O reconhecimento dos efeitos carcinogenicos do cromio hexavalente levou ao aparecimento de legislacao banindo o uso desta forma de cromio pela industria. Esta decisao trouxe a necessidade de encontrar alternativas ambientalmente inocuas mas igualmente eficazes. O principal objectivo do presente trabalho e o desenvolvimento de pretratamentos anticorrosivos activos para a liga de aluminio 2024, baseados em revestimentos hibridos produzidos pelo metodo sol-gel. Estes revestimentos deverao possuir boa aderencia ao substrato metalico, boas propriedades barreira e capacidade anticorrosiva activa. A proteccao activa pode ser alcancada atraves da incorporacao de inibidores anticorrosivos no pretratamento. O objectivo foi atingido atraves de uma sucessao de etapas. Primeiro investigou-se em detalhe a corrosao localizada (por picada) da liga de aluminio 2024. Os resultados obtidos permitiram uma melhor compreensao da susceptibilidade desta liga a processos de corrosao localizada. Estudaram-se tambem varios possiveis inibidores de corrosao usando tecnicas electroquimicas e microestruturais. Numa segunda etapa desenvolveram-se revestimentos anticorrosivos hibridos organico-inorganico baseados no metodo sol-gel. Compostos derivados de titania e zirconia foram combinados com siloxanos organofuncionais a fim de obter-se boa aderencia entre o revestimento e o substrato metalico assim como boas propriedades barreira. Testes industriais mostraram que estes novos revestimentos sao compativeis com os esquemas de pintura convencionais actualmente em uso. A estabilidade e o prazo de validade das formulacoes foram optimizados modificando a temperatura de armazenamento e a quantidade de agua usada durante a sintese. As formulacoes sol-gel foram dopadas com os inibidores seleccionados durante a primeira etapa e as propriedades anticorrosivas passivas e activas dos revestimentos obtidos foram estudadas numa terceira etapa do trabalho. Os resultados comprovam a influencia dos inibidores nas propriedades anticorrosivas dos revestimentos sol-gel. Em alguns casos a accao activa dos inibidores combinou-se com a proteccao passiva dada pelo revestimento mas noutros casos tera ocorrido interaccao quimica entre o inibidor e a matriz de sol-gel, de onde resultou a perda de propriedades protectoras do sistema combinado. Atendendo aos problemas provocados pela adicao directa dos inibidores na formulacao sol-gel procurou-se, numa quarta etapa, formas alternativas de incorporacao. Na primeira, produziu-se uma camada de titania nanoporosa na superficie da liga metalica que serviu de reservatorio para os inibidores. O revestimento sol-gel foi aplicado por cima da camada nanoporosa. Os inibidores armazenados nos poros actuam quando o substrato fica exposto ao ambiente agressivo. Numa segunda, os inibidores foram armazenados em nano-reservatorios de silica ou em nanoargilas (halloysite), os quais foram revestidos por polielectrolitos montados camada a camada. A terceira alternativa consistiu no uso de nano-fios de molibdato de cerio amorfo como inibidores anticorrosivos nanoparticulados. Os nano-reservatorios foram incorporados durante a sintese do sol-gel. Qualquer das abordagens permitiu eliminar o efeito negativo do inibidor sobre a estabilidade da matriz do sol-gel. Os revestimentos sol-gel desenvolvidos neste trabalho apresentaram proteccao anticorrosiva activa e capacidade de auto-reparacao. Os resultados obtidos mostraram o elevado potencial destes revestimentos para a proteccao anticorrosiva da liga de aluminio 2024.
GPU COMPUTING FOR PARTICLE TRACKING

DOE Office of Scientific and Technical Information (OSTI.GOV)

Nishimura, Hiroshi; Song, Kai; Muriki, Krishna

2011-03-25

This is a feasibility study of using a modern Graphics Processing Unit (GPU) to parallelize the accelerator particle tracking code. To demonstrate the massive parallelization features provided by GPU computing, a simplified TracyGPU program is developed for dynamic aperture calculation. Performances, issues, and challenges from introducing GPU are also discussed. General purpose Computation on Graphics Processing Units (GPGPU) bring massive parallel computing capabilities to numerical calculation. However, the unique architecture of GPU requires a comprehensive understanding of the hardware and programming model to be able to well optimize existing applications. In the field of accelerator physics, the dynamic aperture calculationmore » of a storage ring, which is often the most time consuming part of the accelerator modeling and simulation, can benefit from GPU due to its embarrassingly parallel feature, which fits well with the GPU programming model. In this paper, we use the Tesla C2050 GPU which consists of 14 multi-processois (MP) with 32 cores on each MP, therefore a total of 448 cores, to host thousands ot threads dynamically. Thread is a logical execution unit of the program on GPU. In the GPU programming model, threads are grouped into a collection of blocks Within each block, multiple threads share the same code, and up to 48 KB of shared memory. Multiple thread blocks form a grid, which is executed as a GPU kernel. A simplified code that is a subset of Tracy++ [2] is developed to demonstrate the possibility of using GPU to speed up the dynamic aperture calculation by having each thread track a particle.« less
High-Speed GPU-Based Fully Three-Dimensional Diffuse Optical Tomographic System

PubMed Central

Saikia, Manob Jyoti; Kanhirodan, Rajan; Mohan Vasu, Ram

2014-01-01

We have developed a graphics processor unit (GPU-) based high-speed fully 3D system for diffuse optical tomography (DOT). The reduction in execution time of 3D DOT algorithm, a severely ill-posed problem, is made possible through the use of (1) an algorithmic improvement that uses Broyden approach for updating the Jacobian matrix and thereby updating the parameter matrix and (2) the multinode multithreaded GPU and CUDA (Compute Unified Device Architecture) software architecture. Two different GPU implementations of DOT programs are developed in this study: (1) conventional C language program augmented by GPU CUDA and CULA routines (C GPU), (2) MATLAB program supported by MATLAB parallel computing toolkit for GPU (MATLAB GPU). The computation time of the algorithm on host CPU and the GPU system is presented for C and Matlab implementations. The forward computation uses finite element method (FEM) and the problem domain is discretized into 14610, 30823, and 66514 tetrahedral elements. The reconstruction time, so achieved for one iteration of the DOT reconstruction for 14610 elements, is 0.52 seconds for a C based GPU program for 2-plane measurements. The corresponding MATLAB based GPU program took 0.86 seconds. The maximum number of reconstructed frames so achieved is 2 frames per second. PMID:24891848
High-Speed GPU-Based Fully Three-Dimensional Diffuse Optical Tomographic System.

PubMed

Saikia, Manob Jyoti; Kanhirodan, Rajan; Mohan Vasu, Ram

2014-01-01

We have developed a graphics processor unit (GPU-) based high-speed fully 3D system for diffuse optical tomography (DOT). The reduction in execution time of 3D DOT algorithm, a severely ill-posed problem, is made possible through the use of (1) an algorithmic improvement that uses Broyden approach for updating the Jacobian matrix and thereby updating the parameter matrix and (2) the multinode multithreaded GPU and CUDA (Compute Unified Device Architecture) software architecture. Two different GPU implementations of DOT programs are developed in this study: (1) conventional C language program augmented by GPU CUDA and CULA routines (C GPU), (2) MATLAB program supported by MATLAB parallel computing toolkit for GPU (MATLAB GPU). The computation time of the algorithm on host CPU and the GPU system is presented for C and Matlab implementations. The forward computation uses finite element method (FEM) and the problem domain is discretized into 14610, 30823, and 66514 tetrahedral elements. The reconstruction time, so achieved for one iteration of the DOT reconstruction for 14610 elements, is 0.52 seconds for a C based GPU program for 2-plane measurements. The corresponding MATLAB based GPU program took 0.86 seconds. The maximum number of reconstructed frames so achieved is 2 frames per second.
MIGS-GPU: Microarray Image Gridding and Segmentation on the GPU.

PubMed

Katsigiannis, Stamos; Zacharia, Eleni; Maroulis, Dimitris

2017-05-01

Complementary DNA (cDNA) microarray is a powerful tool for simultaneously studying the expression level of thousands of genes. Nevertheless, the analysis of microarray images remains an arduous and challenging task due to the poor quality of the images that often suffer from noise, artifacts, and uneven background. In this study, the MIGS-GPU [Microarray Image Gridding and Segmentation on Graphics Processing Unit (GPU)] software for gridding and segmenting microarray images is presented. MIGS-GPU's computations are performed on the GPU by means of the compute unified device architecture (CUDA) in order to achieve fast performance and increase the utilization of available system resources. Evaluation on both real and synthetic cDNA microarray images showed that MIGS-GPU provides better performance than state-of-the-art alternatives, while the proposed GPU implementation achieves significantly lower computational times compared to the respective CPU approaches. Consequently, MIGS-GPU can be an advantageous and useful tool for biomedical laboratories, offering a user-friendly interface that requires minimum input in order to run.
SU-D-BRD-03: A Gateway for GPU Computing in Cancer Radiotherapy Research

DOE Office of Scientific and Technical Information (OSTI.GOV)

Jia, X; Folkerts, M; Shi, F

Purpose: Graphics Processing Unit (GPU) has become increasingly important in radiotherapy. However, it is still difficult for general clinical researchers to access GPU codes developed by other researchers, and for developers to objectively benchmark their codes. Moreover, it is quite often to see repeated efforts spent on developing low-quality GPU codes. The goal of this project is to establish an infrastructure for testing GPU codes, cross comparing them, and facilitating code distributions in radiotherapy community. Methods: We developed a system called Gateway for GPU Computing in Cancer Radiotherapy Research (GCR2). A number of GPU codes developed by our group andmore » other developers can be accessed via a web interface. To use the services, researchers first upload their test data or use the standard data provided by our system. Then they can select the GPU device on which the code will be executed. Our system offers all mainstream GPU hardware for code benchmarking purpose. After the code running is complete, the system automatically summarizes and displays the computing results. We also released a SDK to allow the developers to build their own algorithm implementation and submit their binary codes to the system. The submitted code is then systematically benchmarked using a variety of GPU hardware and representative data provided by our system. The developers can also compare their codes with others and generate benchmarking reports. Results: It is found that the developed system is fully functioning. Through a user-friendly web interface, researchers are able to test various GPU codes. Developers also benefit from this platform by comprehensively benchmarking their codes on various GPU platforms and representative clinical data sets. Conclusion: We have developed an open platform allowing the clinical researchers and developers to access the GPUs and GPU codes. This development will facilitate the utilization of GPU in radiation therapy field.« less
GPU-Acceleration of Sequence Homology Searches with Database Subsequence Clustering.

PubMed

Suzuki, Shuji; Kakuta, Masanori; Ishida, Takashi; Akiyama, Yutaka

2016-01-01

Sequence homology searches are used in various fields and require large amounts of computation time, especially for metagenomic analysis, owing to the large number of queries and the database size. To accelerate computing analyses, graphics processing units (GPUs) are widely used as a low-cost, high-performance computing platform. Therefore, we mapped the time-consuming steps involved in GHOSTZ, which is a state-of-the-art homology search algorithm for protein sequences, onto a GPU and implemented it as GHOSTZ-GPU. In addition, we optimized memory access for GPU calculations and for communication between the CPU and GPU. As per results of the evaluation test involving metagenomic data, GHOSTZ-GPU with 12 CPU threads and 1 GPU was approximately 3.0- to 4.1-fold faster than GHOSTZ with 12 CPU threads. Moreover, GHOSTZ-GPU with 12 CPU threads and 3 GPUs was approximately 5.8- to 7.7-fold faster than GHOSTZ with 12 CPU threads.
GPU Implementation of High Rayleigh Number Three-Dimensional Mantle Convection

NASA Astrophysics Data System (ADS)

Sanchez, D. A.; Yuen, D. A.; Wright, G. B.; Barnett, G. A.

2010-12-01

Although we have entered the age of petascale computing, many factors are still prohibiting high-performance computing (HPC) from infiltrating all suitable scientific disciplines. For this reason and others, application of GPU to HPC is gaining traction in the scientific world. With its low price point, high performance potential, and competitive scalability, GPU has been an option well worth considering for the last few years. Moreover with the advent of NVIDIA's Fermi architecture, which brings ECC memory, better double-precision performance, and more RAM to GPU, there is a strong message of corporate support for GPU in HPC. However many doubts linger concerning the practicality of using GPU for scientific computing. In particular, GPU has a reputation for being difficult to program and suitable for only a small subset of problems. Although inroads have been made in addressing these concerns, for many scientists GPU still has hurdles to clear before becoming an acceptable choice. We explore the applicability of GPU to geophysics by implementing a three-dimensional, second-order finite-difference model of Rayleigh-Benard thermal convection on an NVIDIA GPU using C for CUDA. Our code reaches sufficient resolution, on the order of 500x500x250 evenly-spaced finite-difference gridpoints, on a single GPU. We make extensive use of highly optimized CUBLAS routines, allowing us to achieve performance on the order of O( 0.1 ) µs per timestep*gridpoint at this resolution. This performance has allowed us to study high Rayleigh number simulations, on the order of 2x10^7, on a single GPU.
GPU: the biggest key processor for AI and parallel processing

NASA Astrophysics Data System (ADS)

Baji, Toru

2017-07-01

Two types of processors exist in the market. One is the conventional CPU and the other is Graphic Processor Unit (GPU). Typical CPU is composed of 1 to 8 cores while GPU has thousands of cores. CPU is good for sequential processing, while GPU is good to accelerate software with heavy parallel executions. GPU was initially dedicated for 3D graphics. However from 2006, when GPU started to apply general-purpose cores, it was noticed that this architecture can be used as a general purpose massive-parallel processor. NVIDIA developed a software framework Compute Unified Device Architecture (CUDA) that make it possible to easily program the GPU for these application. With CUDA, GPU started to be used in workstations and supercomputers widely. Recently two key technologies are highlighted in the industry. The Artificial Intelligence (AI) and Autonomous Driving Cars. AI requires a massive parallel operation to train many-layers of neural networks. With CPU alone, it was impossible to finish the training in a practical time. The latest multi-GPU system with P100 makes it possible to finish the training in a few hours. For the autonomous driving cars, TOPS class of performance is required to implement perception, localization, path planning processing and again SoC with integrated GPU will play a key role there. In this paper, the evolution of the GPU which is one of the biggest commercial devices requiring state-of-the-art fabrication technology will be introduced. Also overview of the GPU demanding key application like the ones described above will be introduced.
Clinical implementation of a GPU-based simplified Monte Carlo method for a treatment planning system of proton beam therapy.

PubMed

Kohno, R; Hotta, K; Nishioka, S; Matsubara, K; Tansho, R; Suzuki, T

2011-11-21

We implemented the simplified Monte Carlo (SMC) method on graphics processing unit (GPU) architecture under the computer-unified device architecture platform developed by NVIDIA. The GPU-based SMC was clinically applied for four patients with head and neck, lung, or prostate cancer. The results were compared to those obtained by a traditional CPU-based SMC with respect to the computation time and discrepancy. In the CPU- and GPU-based SMC calculations, the estimated mean statistical errors of the calculated doses in the planning target volume region were within 0.5% rms. The dose distributions calculated by the GPU- and CPU-based SMCs were similar, within statistical errors. The GPU-based SMC showed 12.30-16.00 times faster performance than the CPU-based SMC. The computation time per beam arrangement using the GPU-based SMC for the clinical cases ranged 9-67 s. The results demonstrate the successful application of the GPU-based SMC to a clinical proton treatment planning.
GPU Optimizations for a Production Molecular Docking Code*

PubMed Central

Landaverde, Raphael; Herbordt, Martin C.

2015-01-01

Modeling molecular docking is critical to both understanding life processes and designing new drugs. In previous work we created the first published GPU-accelerated docking code (PIPER) which achieved a roughly 5× speed-up over a contemporaneous 4 core CPU. Advances in GPU architecture and in the CPU code, however, have since reduced this relalative performance by a factor of 10. In this paper we describe the upgrade of GPU PIPER. This required an entire rewrite, including algorithm changes and moving most remaining non-accelerated CPU code onto the GPU. The result is a 7× improvement in GPU performance and a 3.3× speedup over the CPU-only code. We find that this difference in time is almost entirely due to the difference in run times of the 3D FFT library functions on CPU (MKL) and GPU (cuFFT), respectively. The GPU code has been integrated into the ClusPro docking server which has over 4000 active users. PMID:26594667
GPU-Acceleration of Sequence Homology Searches with Database Subsequence Clustering

PubMed Central

Suzuki, Shuji; Kakuta, Masanori; Ishida, Takashi; Akiyama, Yutaka

2016-01-01

Sequence homology searches are used in various fields and require large amounts of computation time, especially for metagenomic analysis, owing to the large number of queries and the database size. To accelerate computing analyses, graphics processing units (GPUs) are widely used as a low-cost, high-performance computing platform. Therefore, we mapped the time-consuming steps involved in GHOSTZ, which is a state-of-the-art homology search algorithm for protein sequences, onto a GPU and implemented it as GHOSTZ-GPU. In addition, we optimized memory access for GPU calculations and for communication between the CPU and GPU. As per results of the evaluation test involving metagenomic data, GHOSTZ-GPU with 12 CPU threads and 1 GPU was approximately 3.0- to 4.1-fold faster than GHOSTZ with 12 CPU threads. Moreover, GHOSTZ-GPU with 12 CPU threads and 3 GPUs was approximately 5.8- to 7.7-fold faster than GHOSTZ with 12 CPU threads. PMID:27482905
GPU Optimizations for a Production Molecular Docking Code.

PubMed

Landaverde, Raphael; Herbordt, Martin C

2014-09-01

Modeling molecular docking is critical to both understanding life processes and designing new drugs. In previous work we created the first published GPU-accelerated docking code (PIPER) which achieved a roughly 5× speed-up over a contemporaneous 4 core CPU. Advances in GPU architecture and in the CPU code, however, have since reduced this relalative performance by a factor of 10. In this paper we describe the upgrade of GPU PIPER. This required an entire rewrite, including algorithm changes and moving most remaining non-accelerated CPU code onto the GPU. The result is a 7× improvement in GPU performance and a 3.3× speedup over the CPU-only code. We find that this difference in time is almost entirely due to the difference in run times of the 3D FFT library functions on CPU (MKL) and GPU (cuFFT), respectively. The GPU code has been integrated into the ClusPro docking server which has over 4000 active users.
Employing multi-GPU power for molecular dynamics simulation: an extension of GALAMOST

NASA Astrophysics Data System (ADS)

Zhu, You-Liang; Pan, Deng; Li, Zhan-Wei; Liu, Hong; Qian, Hu-Jun; Zhao, Yang; Lu, Zhong-Yuan; Sun, Zhao-Yan

2018-04-01

We describe the algorithm of employing multi-GPU power on the basis of Message Passing Interface (MPI) domain decomposition in a molecular dynamics code, GALAMOST, which is designed for the coarse-grained simulation of soft matters. The code of multi-GPU version is developed based on our previous single-GPU version. In multi-GPU runs, one GPU takes charge of one domain and runs single-GPU code path. The communication between neighbouring domains takes a similar algorithm of CPU-based code of LAMMPS, but is optimised specifically for GPUs. We employ a memory-saving design which can enlarge maximum system size at the same device condition. An optimisation algorithm is employed to prolong the update period of neighbour list. We demonstrate good performance of multi-GPU runs on the simulation of Lennard-Jones liquid, dissipative particle dynamics liquid, polymer and nanoparticle composite, and two-patch particles on workstation. A good scaling of many nodes on cluster for two-patch particles is presented.
NMF-mGPU: non-negative matrix factorization on multi-GPU systems.

PubMed

Mejía-Roa, Edgardo; Tabas-Madrid, Daniel; Setoain, Javier; García, Carlos; Tirado, Francisco; Pascual-Montano, Alberto

2015-02-13

In the last few years, the Non-negative Matrix Factorization ( NMF ) technique has gained a great interest among the Bioinformatics community, since it is able to extract interpretable parts from high-dimensional datasets. However, the computing time required to process large data matrices may become impractical, even for a parallel application running on a multiprocessors cluster. In this paper, we present NMF-mGPU, an efficient and easy-to-use implementation of the NMF algorithm that takes advantage of the high computing performance delivered by Graphics-Processing Units ( GPUs ). Driven by the ever-growing demands from the video-games industry, graphics cards usually provided in PCs and laptops have evolved from simple graphics-drawing platforms into high-performance programmable systems that can be used as coprocessors for linear-algebra operations. However, these devices may have a limited amount of on-board memory, which is not considered by other NMF implementations on GPU. NMF-mGPU is based on CUDA ( Compute Unified Device Architecture ), the NVIDIA's framework for GPU computing. On devices with low memory available, large input matrices are blockwise transferred from the system's main memory to the GPU's memory, and processed accordingly. In addition, NMF-mGPU has been explicitly optimized for the different CUDA architectures. Finally, platforms with multiple GPUs can be synchronized through MPI ( Message Passing Interface ). In a four-GPU system, this implementation is about 120 times faster than a single conventional processor, and more than four times faster than a single GPU device (i.e., a super-linear speedup). Applications of GPUs in Bioinformatics are getting more and more attention due to their outstanding performance when compared to traditional processors. In addition, their relatively low price represents a highly cost-effective alternative to conventional clusters. In life sciences, this results in an excellent opportunity to facilitate the daily work of bioinformaticians that are trying to extract biological meaning out of hundreds of gigabytes of experimental information. NMF-mGPU can be used "out of the box" by researchers with little or no expertise in GPU programming in a variety of platforms, such as PCs, laptops, or high-end GPU clusters. NMF-mGPU is freely available at https://github.com/bioinfo-cnb/bionmf-gpu .
Multi-GPU hybrid programming accelerated three-dimensional phase-field model in binary alloy

NASA Astrophysics Data System (ADS)

Zhu, Changsheng; Liu, Jieqiong; Zhu, Mingfang; Feng, Li

2018-03-01

In the process of dendritic growth simulation, the computational efficiency and the problem scales have extremely important influence on simulation efficiency of three-dimensional phase-field model. Thus, seeking for high performance calculation method to improve the computational efficiency and to expand the problem scales has a great significance to the research of microstructure of the material. A high performance calculation method based on MPI+CUDA hybrid programming model is introduced. Multi-GPU is used to implement quantitative numerical simulations of three-dimensional phase-field model in binary alloy under the condition of multi-physical processes coupling. The acceleration effect of different GPU nodes on different calculation scales is explored. On the foundation of multi-GPU calculation model that has been introduced, two optimization schemes, Non-blocking communication optimization and overlap of MPI and GPU computing optimization, are proposed. The results of two optimization schemes and basic multi-GPU model are compared. The calculation results show that the use of multi-GPU calculation model can improve the computational efficiency of three-dimensional phase-field obviously, which is 13 times to single GPU, and the problem scales have been expanded to 8193. The feasibility of two optimization schemes is shown, and the overlap of MPI and GPU computing optimization has better performance, which is 1.7 times to basic multi-GPU model, when 21 GPUs are used.
Parallel computing in experimental mechanics and optical measurement: A review (II)

NASA Astrophysics Data System (ADS)

Wang, Tianyi; Kemao, Qian

2018-05-01

With advantages such as non-destructiveness, high sensitivity and high accuracy, optical techniques have successfully integrated into various important physical quantities in experimental mechanics (EM) and optical measurement (OM). However, in pursuit of higher image resolutions for higher accuracy, the computation burden of optical techniques has become much heavier. Therefore, in recent years, heterogeneous platforms composing of hardware such as CPUs and GPUs, have been widely employed to accelerate these techniques due to their cost-effectiveness, short development cycle, easy portability, and high scalability. In this paper, we analyze various works by first illustrating their different architectures, followed by introducing their various parallel patterns for high speed computation. Next, we review the effects of CPU and GPU parallel computing specifically in EM & OM applications in a broad scope, which include digital image/volume correlation, fringe pattern analysis, tomography, hyperspectral imaging, computer-generated holograms, and integral imaging. In our survey, we have found that high parallelism can always be exploited in such applications for the development of high-performance systems.
GPU Accelerated Chemical Similarity Calculation for Compound Library Comparison

PubMed Central

Ma, Chao; Wang, Lirong; Xie, Xiang-Qun

2012-01-01

Chemical similarity calculation plays an important role in compound library design, virtual screening, and “lead” optimization. In this manuscript, we present a novel GPU-accelerated algorithm for all-vs-all Tanimoto matrix calculation and nearest neighbor search. By taking advantage of multi-core GPU architecture and CUDA parallel programming technology, the algorithm is up to 39 times superior to the existing commercial software that runs on CPUs. Because of the utilization of intrinsic GPU instructions, this approach is nearly 10 times faster than existing GPU-accelerated sparse vector algorithm, when Unity fingerprints are used for Tanimoto calculation. The GPU program that implements this new method takes about 20 minutes to complete the calculation of Tanimoto coefficients between 32M PubChem compounds and 10K Active Probes compounds, i.e., 324G Tanimoto coefficients, on a 128-CUDA-core GPU. PMID:21692447
Improving the performance of heterogeneous multi-core processors by modifying the cache coherence protocol

NASA Astrophysics Data System (ADS)

Fang, Juan; Hao, Xiaoting; Fan, Qingwen; Chang, Zeqing; Song, Shuying

2017-05-01

In the Heterogeneous multi-core architecture, CPU and GPU processor are integrated on the same chip, which poses a new challenge to the last-level cache management. In this architecture, the CPU application and the GPU application execute concurrently, accessing the last-level cache. CPU and GPU have different memory access characteristics, so that they have differences in the sensitivity of last-level cache (LLC) capacity. For many CPU applications, a reduced share of the LLC could lead to significant performance degradation. On the contrary, GPU applications can tolerate increase in memory access latency when there is sufficient thread-level parallelism. Taking into account the GPU program memory latency tolerance characteristics, this paper presents a method that let GPU applications can access to memory directly, leaving lots of LLC space for CPU applications, in improving the performance of CPU applications and does not affect the performance of GPU applications. When the CPU application is cache sensitive, and the GPU application is insensitive to the cache, the overall performance of the system is improved significantly.

Parallelization and checkpointing of GPU applications through program transformation

DOE Office of Scientific and Technical Information (OSTI.GOV)

Solano-Quinde, Lizandro Damian

2012-01-01

GPUs have emerged as a powerful tool for accelerating general-purpose applications. The availability of programming languages that makes writing general-purpose applications for running on GPUs tractable have consolidated GPUs as an alternative for accelerating general purpose applications. Among the areas that have benefited from GPU acceleration are: signal and image processing, computational fluid dynamics, quantum chemistry, and, in general, the High Performance Computing (HPC) Industry. In order to continue to exploit higher levels of parallelism with GPUs, multi-GPU systems are gaining popularity. In this context, single-GPU applications are parallelized for running in multi-GPU systems. Furthermore, multi-GPU systems help to solvemore » the GPU memory limitation for applications with large application memory footprint. Parallelizing single-GPU applications has been approached by libraries that distribute the workload at runtime, however, they impose execution overhead and are not portable. On the other hand, on traditional CPU systems, parallelization has been approached through application transformation at pre-compile time, which enhances the application to distribute the workload at application level and does not have the issues of library-based approaches. Hence, a parallelization scheme for GPU systems based on application transformation is needed. Like any computing engine of today, reliability is also a concern in GPUs. GPUs are vulnerable to transient and permanent failures. Current checkpoint/restart techniques are not suitable for systems with GPUs. Checkpointing for GPU systems present new and interesting challenges, primarily due to the natural differences imposed by the hardware design, the memory subsystem architecture, the massive number of threads, and the limited amount of synchronization among threads. Therefore, a checkpoint/restart technique suitable for GPU systems is needed. The goal of this work is to exploit higher levels of parallelism and to develop support for application-level fault tolerance in applications using multiple GPUs. Our techniques reduce the burden of enhancing single-GPU applications to support these features. To achieve our goal, this work designs and implements a framework for enhancing a single-GPU OpenCL application through application transformation.« less
Costs of topical treatment of pressure ulcer patients.

PubMed

Andrade, Cynthia Carolina Duarte; Almeida, Cláudia Fernanda Dos Santos Calixto de; Pereira, Walkíria Euzébio; Alemão, Márcia Mascarenhas; Brandão, Cristina Mariano Ruas; Borges, Eline Lima

2016-04-01

To evaluate the costs of a topical treatment of pressure ulcer (PU) patients in a hospital unit for treatment of chronic patients in 2014. This is an activity-based costing study. This method encompasses the identification, measurement and pricing of physical and human resources consumed for dressings. Procedure costs varied between BRL 16.41 and BRL 260.18. For PUs of the same category, of near areas and with the same type of barrier/adjuvant, the cost varied between 3.5% and 614.6%. For most dressings, the cost increased proportionally to the increase of the area and to the development of PU category. The primary barrier accounted for a high percentage of costs among all items required to the application of dressings (human and material resources). Dressings applied in sacral PUs had longer application times. This study allowed us to understand the costs involved in the treatment of PUs, and it may support decision-makers and other cost-effectiveness studies. Realizar uma avaliação do custo do tratamento tópico de pacientes com úlceras por pressão (UP), em uma unidade hospitalar de atendimento a pacientes crônicos no ano de 2014. Trata-se de um estudo de custos baseado no Sistema de custeio Baseado em Atividades. Este método contempla a identificação, mensuração e precificação dos recursos físicos e humanos consumidos para a realização de curativos. Os custos dos procedimentos variaram de R$16,41 a R$260,18. Para UP de mesma categoria, de áreas aproximadas e mesmo tipo de cobertura/adjuvante, a variação entre os custos foi de 3,5% a 614,6%. Para a maioria dos curativos, o custo aumentou proporcionalmente ao aumento da área e à progressão da categoria das UP. A cobertura primária representou elevado percentual nos custos entre todos os itens necessários para realizar os curativos (recursos humanos e materiais). Os curativos realizados nas UP sacrais foram os que apresentaram maiores tempos para execução. Este estudo permitiu conhecer os custos envolvidos no tratamento das UP e pode fornecer subsídios para os tomadores de decisão, assim como para a realização de estudos de custo-efetividade.
Research on large area VUV-sensitive gaseous photomultipliers for cryogenic applications

NASA Astrophysics Data System (ADS)

Coimbra, Artur Emanuel Cardoso

Desde cedo que a comunidade cientifica compreendeu que gases nobres em liquido sao excelentes meios de deteccao de radiacao, combinando a sua elevada densidade, elevado grau de homogeneidade e de elevado rendimento de cintilacao. Para alem destas caracteristicas inerentes, estes tem a potencialidade de fornecer ambos sinais de ionizacao - criando electroes livres - e cintilacao em resposta a interaccao com radiacao ionizante e, tendo em vista a sua aplicacao em experiencias de eventos raros relacionados com fisica de neutrinos ou materia-escura, a capacidade de autoblindagem garante a exclusao de eventos induzidos por radiacao de fundo. O facto de nao absorverem a sua propria luz, emergente dos eventos de cintilacao, permite a expansao deste tipo de detectores ate grandes volumes, sendo que as colaboracoes mais recentes propoem detectores com dezenas de toneladas de xenon em estado liquido. As experiencias actuais que usam gases nobres em estado liquido empregam xenon ou argon numa so fase (estado liquido) ou em dupla-fase (estado liquido + gasoso) e as suas aplicacoes abrangem desde as ja referidas experiencias de procura de eventos raros, passando por imagiologia medica tais como detectores de radiacao gama para PET ou câmaras Compton "3-gamma" em combinacao com PET, passando tambem por aplicacoes de seguranca como sistemas de inspeccao para deteccao de material fissil e, finalmente, em câmaras Compton para aplicacoes de astrofisica. Em ambas as configuracoes a leitura dos sinais de cintilacao e geralmente feita atraves de um grande numero de dispendiosos fotomultiplicadores de vacuo agrupados. A presente tese de doutoramento e dedicada aos fotomultiplicadores gasosos de grande area para aplicacoes criogenicas desenvolvidos no contexto do programa doutoral, tendo em vista a sua eventual aplicacao como um dispositivo complementar aos metodos existentes de deteccao de cintilacao, para aplicacao em futuras experiencias de grande escala. Esta pesquisa foi direccionada tendo em vista o desenvolvimento de eficientes fotomultiplicadores gasosos de grande area, potencialmente mais economicos por unidade de area, baseados em "Thick Gas-Electron Multipliers" (THGEMs). Combinando fotocatodos de alta eficiencia com multiplicadores gasosos de electroes capazes de atingir elevado ganho em carga obteve-se assim um dispositivo com elevada sensibilidade para a deteccao de fotoes unicos, com a possibilidade de discriminacao em posicao com resolucao espacial inferior a um milimetro e com resolucao temporal da ordem de poucos nano segundos. Contrariamente ao que sucede com a tecnologia de vacuo actualmente, com este dispositivo a localizacao em posicao de fotoes em grandes areas e feita num unico dispositivo integrando electronica habitualmente utilizada em experiencias de rastreamento de particulas. Neste trabalho o fotomultiplicador gasoso desenvolvido consiste numa cadeia de THGEMs combinados com um fotocatodo de iodeto de cesio (CsI) sensivel ao ultravioleta enquanto que os testes criogenicos foram realizados na Time Projection Chamber (TPC) de dupla fase de xenon liquido recentemente desenvolvida no Weizmann Institute of Science (WILiX). (Abstract shortened by ProQuest.).
SU-E-T-493: Accelerated Monte Carlo Methods for Photon Dosimetry Using a Dual-GPU System and CUDA.

PubMed

Liu, T; Ding, A; Xu, X

2012-06-01

To develop a Graphics Processing Unit (GPU) based Monte Carlo (MC) code that accelerates dose calculations on a dual-GPU system. We simulated a clinical case of prostate cancer treatment. A voxelized abdomen phantom derived from 120 CT slices was used containing 218×126×60 voxels, and a GE LightSpeed 16-MDCT scanner was modeled. A CPU version of the MC code was first developed in C++ and tested on Intel Xeon X5660 2.8GHz CPU, then it was translated into GPU version using CUDA C 4.1 and run on a dual Tesla m 2 090 GPU system. The code was featured with automatic assignment of simulation task to multiple GPUs, as well as accurate calculation of energy- and material- dependent cross-sections. Double-precision floating point format was used for accuracy. Doses to the rectum, prostate, bladder and femoral heads were calculated. When running on a single GPU, the MC GPU code was found to be ×19 times faster than the CPU code and ×42 times faster than MCNPX. These speedup factors were doubled on the dual-GPU system. The dose Result was benchmarked against MCNPX and a maximum difference of 1% was observed when the relative error is kept below 0.1%. A GPU-based MC code was developed for dose calculations using detailed patient and CT scanner models. Efficiency and accuracy were both guaranteed in this code. Scalability of the code was confirmed on the dual-GPU system. © 2012 American Association of Physicists in Medicine.
Accelerated rescaling of single Monte Carlo simulation runs with the Graphics Processing Unit (GPU).

PubMed

Yang, Owen; Choi, Bernard

2013-01-01

To interpret fiber-based and camera-based measurements of remitted light from biological tissues, researchers typically use analytical models, such as the diffusion approximation to light transport theory, or stochastic models, such as Monte Carlo modeling. To achieve rapid (ideally real-time) measurement of tissue optical properties, especially in clinical situations, there is a critical need to accelerate Monte Carlo simulation runs. In this manuscript, we report on our approach using the Graphics Processing Unit (GPU) to accelerate rescaling of single Monte Carlo runs to calculate rapidly diffuse reflectance values for different sets of tissue optical properties. We selected MATLAB to enable non-specialists in C and CUDA-based programming to use the generated open-source code. We developed a software package with four abstraction layers. To calculate a set of diffuse reflectance values from a simulated tissue with homogeneous optical properties, our rescaling GPU-based approach achieves a reduction in computation time of several orders of magnitude as compared to other GPU-based approaches. Specifically, our GPU-based approach generated a diffuse reflectance value in 0.08ms. The transfer time from CPU to GPU memory currently is a limiting factor with GPU-based calculations. However, for calculation of multiple diffuse reflectance values, our GPU-based approach still can lead to processing that is ~3400 times faster than other GPU-based approaches.
Acceleration for 2D time-domain elastic full waveform inversion using a single GPU card

NASA Astrophysics Data System (ADS)

Jiang, Jinpeng; Zhu, Peimin

2018-05-01

Full waveform inversion (FWI) is a challenging procedure due to the high computational cost related to the modeling, especially for the elastic case. The graphics processing unit (GPU) has become a popular device for the high-performance computing (HPC). To reduce the long computation time, we design and implement the GPU-based 2D elastic FWI (EFWI) in time domain using a single GPU card. We parallelize the forward modeling and gradient calculations using the CUDA programming language. To overcome the limitation of relatively small global memory on GPU, the boundary saving strategy is exploited to reconstruct the forward wavefield. Moreover, the L-BFGS optimization method used in the inversion increases the convergence of the misfit function. A multiscale inversion strategy is performed in the workflow to obtain the accurate inversion results. In our tests, the GPU-based implementations using a single GPU device achieve >15 times speedup in forward modeling, and about 12 times speedup in gradient calculation, compared with the eight-core CPU implementations optimized by OpenMP. The test results from the GPU implementations are verified to have enough accuracy by comparing the results obtained from the CPU implementations.
Simulation of reaction diffusion processes over biologically relevant size and time scales using multi-GPU workstations

PubMed Central

Hallock, Michael J.; Stone, John E.; Roberts, Elijah; Fry, Corey; Luthey-Schulten, Zaida

2014-01-01

Simulation of in vivo cellular processes with the reaction-diffusion master equation (RDME) is a computationally expensive task. Our previous software enabled simulation of inhomogeneous biochemical systems for small bacteria over long time scales using the MPD-RDME method on a single GPU. Simulations of larger eukaryotic systems exceed the on-board memory capacity of individual GPUs, and long time simulations of modest-sized cells such as yeast are impractical on a single GPU. We present a new multi-GPU parallel implementation of the MPD-RDME method based on a spatial decomposition approach that supports dynamic load balancing for workstations containing GPUs of varying performance and memory capacity. We take advantage of high-performance features of CUDA for peer-to-peer GPU memory transfers and evaluate the performance of our algorithms on state-of-the-art GPU devices. We present parallel e ciency and performance results for simulations using multiple GPUs as system size, particle counts, and number of reactions grow. We also demonstrate multi-GPU performance in simulations of the Min protein system in E. coli. Moreover, our multi-GPU decomposition and load balancing approach can be generalized to other lattice-based problems. PMID:24882911
Simulation of reaction diffusion processes over biologically relevant size and time scales using multi-GPU workstations.

PubMed

Hallock, Michael J; Stone, John E; Roberts, Elijah; Fry, Corey; Luthey-Schulten, Zaida

2014-05-01

Simulation of in vivo cellular processes with the reaction-diffusion master equation (RDME) is a computationally expensive task. Our previous software enabled simulation of inhomogeneous biochemical systems for small bacteria over long time scales using the MPD-RDME method on a single GPU. Simulations of larger eukaryotic systems exceed the on-board memory capacity of individual GPUs, and long time simulations of modest-sized cells such as yeast are impractical on a single GPU. We present a new multi-GPU parallel implementation of the MPD-RDME method based on a spatial decomposition approach that supports dynamic load balancing for workstations containing GPUs of varying performance and memory capacity. We take advantage of high-performance features of CUDA for peer-to-peer GPU memory transfers and evaluate the performance of our algorithms on state-of-the-art GPU devices. We present parallel e ciency and performance results for simulations using multiple GPUs as system size, particle counts, and number of reactions grow. We also demonstrate multi-GPU performance in simulations of the Min protein system in E. coli . Moreover, our multi-GPU decomposition and load balancing approach can be generalized to other lattice-based problems.
Impact of memory bottleneck on the performance of graphics processing units

NASA Astrophysics Data System (ADS)

Son, Dong Oh; Choi, Hong Jun; Kim, Jong Myon; Kim, Cheol Hong

2015-12-01

Recent graphics processing units (GPUs) can process general-purpose applications as well as graphics applications with the help of various user-friendly application programming interfaces (APIs) supported by GPU vendors. Unfortunately, utilizing the hardware resource in the GPU efficiently is a challenging problem, since the GPU architecture is totally different to the traditional CPU architecture. To solve this problem, many studies have focused on the techniques for improving the system performance using GPUs. In this work, we analyze the GPU performance varying GPU parameters such as the number of cores and clock frequency. According to our simulations, the GPU performance can be improved by 125.8% and 16.2% on average as the number of cores and clock frequency increase, respectively. However, the performance is saturated when memory bottleneck problems incur due to huge data requests to the memory. The performance of GPUs can be improved as the memory bottleneck is reduced by changing GPU parameters dynamically.
GPU-Accelerated Forward and Back-Projections with Spatially Varying Kernels for 3D DIRECT TOF PET Reconstruction.

PubMed

Ha, S; Matej, S; Ispiryan, M; Mueller, K

2013-02-01

We describe a GPU-accelerated framework that efficiently models spatially (shift) variant system response kernels and performs forward- and back-projection operations with these kernels for the DIRECT (Direct Image Reconstruction for TOF) iterative reconstruction approach. Inherent challenges arise from the poor memory cache performance at non-axis aligned TOF directions. Focusing on the GPU memory access patterns, we utilize different kinds of GPU memory according to these patterns in order to maximize the memory cache performance. We also exploit the GPU instruction-level parallelism to efficiently hide long latencies from the memory operations. Our experiments indicate that our GPU implementation of the projection operators has slightly faster or approximately comparable time performance than FFT-based approaches using state-of-the-art FFTW routines. However, most importantly, our GPU framework can also efficiently handle any generic system response kernels, such as spatially symmetric and shift-variant as well as spatially asymmetric and shift-variant, both of which an FFT-based approach cannot cope with.
GPU-accelerated phase-field simulation of dendritic solidification in a binary alloy

NASA Astrophysics Data System (ADS)

Yamanaka, Akinori; Aoki, Takayuki; Ogawa, Satoi; Takaki, Tomohiro

2011-03-01

The phase-field simulation for dendritic solidification of a binary alloy has been accelerated by using a graphic processing unit (GPU). To perform the phase-field simulation of the alloy solidification on GPU, a program code was developed with computer unified device architecture (CUDA). In this paper, the implementation technique of the phase-field model on GPU is presented. Also, we evaluated the acceleration performance of the three-dimensional solidification simulation by using a single NVIDIA TESLA C1060 GPU and the developed program code. The results showed that the GPU calculation for 5763 computational grids achieved the performance of 170 GFLOPS by utilizing the shared memory as a software-managed cache. Furthermore, it can be demonstrated that the computation with the GPU is 100 times faster than that with a single CPU core. From the obtained results, we confirmed the feasibility of realizing a real-time full three-dimensional phase-field simulation of microstructure evolution on a personal desktop computer.
GPU-Accelerated Forward and Back-Projections With Spatially Varying Kernels for 3D DIRECT TOF PET Reconstruction

NASA Astrophysics Data System (ADS)

Ha, S.; Matej, S.; Ispiryan, M.; Mueller, K.

2013-02-01

We describe a GPU-accelerated framework that efficiently models spatially (shift) variant system response kernels and performs forward- and back-projection operations with these kernels for the DIRECT (Direct Image Reconstruction for TOF) iterative reconstruction approach. Inherent challenges arise from the poor memory cache performance at non-axis aligned TOF directions. Focusing on the GPU memory access patterns, we utilize different kinds of GPU memory according to these patterns in order to maximize the memory cache performance. We also exploit the GPU instruction-level parallelism to efficiently hide long latencies from the memory operations. Our experiments indicate that our GPU implementation of the projection operators has slightly faster or approximately comparable time performance than FFT-based approaches using state-of-the-art FFTW routines. However, most importantly, our GPU framework can also efficiently handle any generic system response kernels, such as spatially symmetric and shift-variant as well as spatially asymmetric and shift-variant, both of which an FFT-based approach cannot cope with.
On the effective implementation of a boundary element code on graphics processing units unsing an out-of-core LU algorithm

DOE Office of Scientific and Technical Information (OSTI.GOV)

D'Azevedo, Ed F; Nintcheu Fata, Sylvain

2012-01-01

A collocation boundary element code for solving the three-dimensional Laplace equation, publicly available from \\url{http://www.intetec.org}, has been adapted to run on an Nvidia Tesla general purpose graphics processing unit (GPU). Global matrix assembly and LU factorization of the resulting dense matrix were performed on the GPU. Out-of-core techniques were used to solve problems larger than available GPU memory. The code achieved over eight times speedup in matrix assembly and about 56~Gflops/sec in the LU factorization using only 512~Mbytes of GPU memory. Details of the GPU implementation and comparisons with the standard sequential algorithm are included to illustrate the performance ofmore » the GPU code.« less
Combating the Reliability Challenge of GPU Register File at Low Supply Voltage

DOE Office of Scientific and Technical Information (OSTI.GOV)

Tan, Jingweijia; Song, Shuaiwen; Yan, Kaige

Supply voltage reduction is an effective approach to significantly reduce GPU energy consumption. As the largest on-chip storage structure, the GPU register file becomes the reliability hotspot that prevents further supply voltage reduction below the safe limit (Vmin) due to process variation effects. This work addresses the reliability challenge of the GPU register file at low supply voltages, which is an essential first step for aggressive supply voltage reduction of the entire GPU chip. We propose GR-Guard, an architectural solution that leverages long register dead time to enable reliable operations from unreliable register file at low voltages.
Astronomia para/com crianças carentes em Limeira

NASA Astrophysics Data System (ADS)

Bretones, P. S.; Oliveira, V. C.

2003-08-01

Em 2001, o Instituto Superior de Ciências Aplicadas (ISCA Faculdades de Limeira) iniciou um projeto pelo qual o Observatório do Morro Azul empreendeu uma parceria com o Centro de Promoção Social Municipal (CEPROSOM), instituição mantida pela Prefeitura Municipal de Limeira para atender crianças e adolescentes carentes. O CEPROSOM contava com dois projetos: Projeto Centro de Convivência Infantil (CCI) e Programa Criança e Adolescente (PCA), que atendiam crianças e adolescentes em Centros Comunitários de diversas áreas da cidade. Esses projetos têm como prioridades estabelecer atividades prazerosas para as crianças no sentido de retirá-las das ruas. Assim sendo, as crianças passaram a ter mais um tipo de atividade - as visitas ao observatório. Este painel descreve as várias fases do projeto, que envolveu: reuniões de planejamento, curso de Astronomia para as orientadoras dos CCIs e PCAs, atividades relacionadas a visitas das crianças ao Observatório, proposta de construção de gnômons e relógios de Sol nos diversos Centros Comunitários de Limeira e divulgação do projeto na imprensa. O painel inclui discussões sobre a aprendizagem de crianças carentes, relatos que mostram a postura das orientadoras sobre a pertinência do ensino de Astronomia, relatos do monitor que fez o atendimento no Observatório e o que o número de crianças atendidas representou para as atividades da instituição desde o início de suas atividades e, em particular, em 2001. Os resultados são baseados na análise de relatos das orientadoras e do monitor do Observatório, registros de visitas e matérias da imprensa local. Conclui com uma avaliação do que tal projeto representou para as Instituições participantes. Para o Observatório, em particular, foi feita uma análise com relação às outras modalidades de atendimentos que envolvem alunos de escolas e público em geral. Também é abordada a questão do compromisso social do Observatório na educação do público em questão.
Estudo de propriedades estruturais e opticas de multicamadas epitaxiais emissoras de luz baseadas em InGaN/GaN

NASA Astrophysics Data System (ADS)

Pereira, Sergio Manuel de Sousa

Esta tese apresenta os resultados de uma investigacao experimental em filmes epitaxiais emissores de luz baseados em InxGa1-xN. O InxGa1-xN e uma liga semicondutora ternaria do grupo III-N muito utilizada como camada activa numa gama de dispositivos optoelectronicos em desenvolvimento, incluindo diodos emissores de luz (LEDs) e diodos laser (LDs), para operacao na regiao do visivel e ultravioleta do espectro electromagnetico. Neste estudo, caracterizam-se as propriedade opticas e estruturais de camadas simples e pocos quânticos multiplos (Multiple Quantum Wells, MQWs) de InxGa1-xN/GaN, com enfase nas suas propriedades fisicas fundamentais. O objectivo central do trabalho prende-se com a compreensao mais profunda dos processos fisicos que estao por tras das suas propriedades opticas, preenchendo o fosso existente entre aplicacoes tecnologicas e o conhecimento cientifico. Nomeadamente, a tese aborda os problemas da medicao da fraccao de InN (x) em multicamadas ultrafinas sujeitas a tensoes, a influencia da composicao e das tensoes microscopicas nas propriedades opticas e estruturais. A questao relativa a segregacao de fases em multicamadas de InxGa1-xN/GaN e tambem discutida a luz dos resultados obtidos. A metodologia seguida assenta na integracao de resultados obtidos por tecnicas complementares atraves de uma analise sistematica e multidisciplinar. Esta abordagem passa pela combinacao de: 1) Crescimento de amostras por deposicao epitaxial em fase de vapor organometalico (MOVPE) com caracteristicas especificas de forma a tentar isolar parâmetros estruturais, tais como espessura e composicao; 2) Caracterizacao nanoestrutural por microscopia de forca atomica (AFM), microscopica electronica de varrimento (SEM), difraccao de raios-X e retro-dispersao de Rutherford (RBS); 3) Caracterizacao optica a escalas complementares por: espectroscopia de absorcao optica (OA), fotoluminescencia (PL), catodoluminescencia (CL) e microscopia confocal (CM) com analise espectral. Com base nos resultados obtidos, a tese propoe modelos de interpretacao para as propriedades estruturais e opticas, dando enfase as suas correlacoes. Em particular, estabelece-se a necessidade de considerar fenomenos relacionados com tensoes microscopicas na interpretacao dos resultados experimentais. Com este trabalho fica clara a necessidade de um conhecimento detalhado das caracteristicas nanoestruturais para interpretar as propriedades opticas das ligas de InxGa1-xN. None
The Performance Improvement of the Lagrangian Particle Dispersion Model (LPDM) Using Graphics Processing Unit (GPU) Computing

DTIC Science & Technology

2017-08-01

access to the GPU for general purpose processing .5 CUDA is designed to work easily with multiple programming languages , including Fortran. CUDA is a...Using Graphics Processing Unit (GPU) Computing by Leelinda P Dawson Approved for public release; distribution unlimited...The Performance Improvement of the Lagrangian Particle Dispersion Model (LPDM) Using Graphics Processing Unit (GPU) Computing by Leelinda
SU-E-T-395: Multi-GPU-Based VMAT Treatment Plan Optimization Using a Column-Generation Approach

DOE Office of Scientific and Technical Information (OSTI.GOV)

Tian, Z; Shi, F; Jia, X

Purpose: GPU has been employed to speed up VMAT optimizations from hours to minutes. However, its limited memory capacity makes it difficult to handle cases with a huge dose-deposition-coefficient (DDC) matrix, e.g. those with a large target size, multiple arcs, small beam angle intervals and/or small beamlet size. We propose multi-GPU-based VMAT optimization to solve this memory issue to make GPU-based VMAT more practical for clinical use. Methods: Our column-generation-based method generates apertures sequentially by iteratively searching for an optimal feasible aperture (referred as pricing problem, PP) and optimizing aperture intensities (referred as master problem, MP). The PP requires accessmore » to the large DDC matrix, which is implemented on a multi-GPU system. Each GPU stores a DDC sub-matrix corresponding to one fraction of beam angles and is only responsible for calculation related to those angles. Broadcast and parallel reduction schemes are adopted for inter-GPU data transfer. MP is a relatively small-scale problem and is implemented on one GPU. One headand- neck cancer case was used for test. Three different strategies for VMAT optimization on single GPU were also implemented for comparison: (S1) truncating DDC matrix to ignore its small value entries for optimization; (S2) transferring DDC matrix part by part to GPU during optimizations whenever needed; (S3) moving DDC matrix related calculation onto CPU. Results: Our multi-GPU-based implementation reaches a good plan within 1 minute. Although S1 was 10 seconds faster than our method, the obtained plan quality is worse. Both S2 and S3 handle the full DDC matrix and hence yield the same plan as in our method. However, the computation time is longer, namely 4 minutes and 30 minutes, respectively. Conclusion: Our multi-GPU-based VMAT optimization can effectively solve the limited memory issue with good plan quality and high efficiency, making GPUbased ultra-fast VMAT planning practical for real clinical use.« less
Validation of GPU based TomoTherapy dose calculation engine.

PubMed

Chen, Quan; Lu, Weiguo; Chen, Yu; Chen, Mingli; Henderson, Douglas; Sterpin, Edmond

2012-04-01

The graphic processing unit (GPU) based TomoTherapy convolution/superposition(C/S) dose engine (GPU dose engine) achieves a dramatic performance improvement over the traditional CPU-cluster based TomoTherapy dose engine (CPU dose engine). Besides the architecture difference between the GPU and CPU, there are several algorithm changes from the CPU dose engine to the GPU dose engine. These changes made the GPU dose slightly different from the CPU-cluster dose. In order for the commercial release of the GPU dose engine, its accuracy has to be validated. Thirty eight TomoTherapy phantom plans and 19 patient plans were calculated with both dose engines to evaluate the equivalency between the two dose engines. Gamma indices (Γ) were used for the equivalency evaluation. The GPU dose was further verified with the absolute point dose measurement with ion chamber and film measurements for phantom plans. Monte Carlo calculation was used as a reference for both dose engines in the accuracy evaluation in heterogeneous phantom and actual patients. The GPU dose engine showed excellent agreement with the current CPU dose engine. The majority of cases had over 99.99% of voxels with Γ(1%, 1 mm) < 1. The worst case observed in the phantom had 0.22% voxels violating the criterion. In patient cases, the worst percentage of voxels violating the criterion was 0.57%. For absolute point dose verification, all cases agreed with measurement to within ±3% with average error magnitude within 1%. All cases passed the acceptance criterion that more than 95% of the pixels have Γ(3%, 3 mm) < 1 in film measurement, and the average passing pixel percentage is 98.5%-99%. The GPU dose engine also showed similar degree of accuracy in heterogeneous media as the current TomoTherapy dose engine. It is verified and validated that the ultrafast TomoTherapy GPU dose engine can safely replace the existing TomoTherapy cluster based dose engine without degradation in dose accuracy.
SU-E-J-91: FFT Based Medical Image Registration Using a Graphics Processing Unit (GPU).

PubMed

Luce, J; Hoggarth, M; Lin, J; Block, A; Roeske, J

2012-06-01

To evaluate the efficiency gains obtained from using a Graphics Processing Unit (GPU) to perform a Fourier Transform (FT) based image registration. Fourier-based image registration involves obtaining the FT of the component images, and analyzing them in Fourier space to determine the translations and rotations of one image set relative to another. An important property of FT registration is that by enlarging the images (adding additional pixels), one can obtain translations and rotations with sub-pixel resolution. The expense, however, is an increased computational time. GPUs may decrease the computational time associated with FT image registration by taking advantage of their parallel architecture to perform matrix computations much more efficiently than a Central Processor Unit (CPU). In order to evaluate the computational gains produced by a GPU, images with known translational shifts were utilized. A program was written in the Interactive Data Language (IDL; Exelis, Boulder, CO) to performCPU-based calculations. Subsequently, the program was modified using GPU bindings (Tech-X, Boulder, CO) to perform GPU-based computation on the same system. Multiple image sizes were used, ranging from 256×256 to 2304×2304. The time required to complete the full algorithm by the CPU and GPU were benchmarked and the speed increase was defined as the ratio of the CPU-to-GPU computational time. The ratio of the CPU-to- GPU time was greater than 1.0 for all images, which indicates the GPU is performing the algorithm faster than the CPU. The smallest improvement, a 1.21 ratio, was found with the smallest image size of 256×256, and the largest speedup, a 4.25 ratio, was observed with the largest image size of 2304×2304. GPU programming resulted in a significant decrease in computational time associated with a FT image registration algorithm. The inclusion of the GPU may provide near real-time, sub-pixel registration capability. © 2012 American Association of Physicists in Medicine.

Next-generation acceleration and code optimization for light transport in turbid media using GPUs

PubMed Central

Alerstam, Erik; Lo, William Chun Yip; Han, Tianyi David; Rose, Jonathan; Andersson-Engels, Stefan; Lilge, Lothar

2010-01-01

A highly optimized Monte Carlo (MC) code package for simulating light transport is developed on the latest graphics processing unit (GPU) built for general-purpose computing from NVIDIA - the Fermi GPU. In biomedical optics, the MC method is the gold standard approach for simulating light transport in biological tissue, both due to its accuracy and its flexibility in modelling realistic, heterogeneous tissue geometry in 3-D. However, the widespread use of MC simulations in inverse problems, such as treatment planning for PDT, is limited by their long computation time. Despite its parallel nature, optimizing MC code on the GPU has been shown to be a challenge, particularly when the sharing of simulation result matrices among many parallel threads demands the frequent use of atomic instructions to access the slow GPU global memory. This paper proposes an optimization scheme that utilizes the fast shared memory to resolve the performance bottleneck caused by atomic access, and discusses numerous other optimization techniques needed to harness the full potential of the GPU. Using these techniques, a widely accepted MC code package in biophotonics, called MCML, was successfully accelerated on a Fermi GPU by approximately 600x compared to a state-of-the-art Intel Core i7 CPU. A skin model consisting of 7 layers was used as the standard simulation geometry. To demonstrate the possibility of GPU cluster computing, the same GPU code was executed on four GPUs, showing a linear improvement in performance with an increasing number of GPUs. The GPU-based MCML code package, named GPU-MCML, is compatible with a wide range of graphics cards and is released as an open-source software in two versions: an optimized version tuned for high performance and a simplified version for beginners (http://code.google.com/p/gpumcml). PMID:21258498
Memory-Scalable GPU Spatial Hierarchy Construction.

PubMed

Qiming Hou; Xin Sun; Kun Zhou; Lauterbach, C; Manocha, D

2011-04-01

Recent GPU algorithms for constructing spatial hierarchies have achieved promising performance for moderately complex models by using the breadth-first search (BFS) construction order. While being able to exploit the massive parallelism on the GPU, the BFS order also consumes excessive GPU memory, which becomes a serious issue for interactive applications involving very complex models with more than a few million triangles. In this paper, we propose to use the partial breadth-first search (PBFS) construction order to control memory consumption while maximizing performance. We apply the PBFS order to two hierarchy construction algorithms. The first algorithm is for kd-trees that automatically balances between the level of parallelism and intermediate memory usage. With PBFS, peak memory consumption during construction can be efficiently controlled without costly CPU-GPU data transfer. We also develop memory allocation strategies to effectively limit memory fragmentation. The resulting algorithm scales well with GPU memory and constructs kd-trees of models with millions of triangles at interactive rates on GPUs with 1 GB memory. Compared with existing algorithms, our algorithm is an order of magnitude more scalable for a given GPU memory bound. The second algorithm is for out-of-core bounding volume hierarchy (BVH) construction for very large scenes based on the PBFS construction order. At each iteration, all constructed nodes are dumped to the CPU memory, and the GPU memory is freed for the next iteration's use. In this way, the algorithm is able to build trees that are too large to be stored in the GPU memory. Experiments show that our algorithm can construct BVHs for scenes with up to 20 M triangles, several times larger than previous GPU algorithms.
The Metropolis Monte Carlo method with CUDA enabled Graphic Processing Units

DOE Office of Scientific and Technical Information (OSTI.GOV)

Hall, Clifford; School of Physics, Astronomy, and Computational Sciences, George Mason University, 4400 University Dr., Fairfax, VA 22030; Ji, Weixiao

2014-02-01

We present a CPU–GPU system for runtime acceleration of large molecular simulations using GPU computation and memory swaps. The memory architecture of the GPU can be used both as container for simulation data stored on the graphics card and as floating-point code target, providing an effective means for the manipulation of atomistic or molecular data on the GPU. To fully take advantage of this mechanism, efficient GPU realizations of algorithms used to perform atomistic and molecular simulations are essential. Our system implements a versatile molecular engine, including inter-molecule interactions and orientational variables for performing the Metropolis Monte Carlo (MMC) algorithm,more » which is one type of Markov chain Monte Carlo. By combining memory objects with floating-point code fragments we have implemented an MMC parallel engine that entirely avoids the communication time of molecular data at runtime. Our runtime acceleration system is a forerunner of a new class of CPU–GPU algorithms exploiting memory concepts combined with threading for avoiding bus bandwidth and communication. The testbed molecular system used here is a condensed phase system of oligopyrrole chains. A benchmark shows a size scaling speedup of 60 for systems with 210,000 pyrrole monomers. Our implementation can easily be combined with MPI to connect in parallel several CPU–GPU duets. -- Highlights: •We parallelize the Metropolis Monte Carlo (MMC) algorithm on one CPU—GPU duet. •The Adaptive Tempering Monte Carlo employs MMC and profits from this CPU—GPU implementation. •Our benchmark shows a size scaling-up speedup of 62 for systems with 225,000 particles. •The testbed involves a polymeric system of oligopyrroles in the condensed phase. •The CPU—GPU parallelization includes dipole—dipole and Mie—Jones classic potentials.« less
In vivo retinal and choroidal hypoxia imaging using a novel activatable hypoxia-selective near-infrared fluorescent probe.

PubMed

Fukuda, Shinichi; Okuda, Kensuke; Kishino, Genichiro; Hoshi, Sujin; Kawano, Itsuki; Fukuda, Masahiro; Yamashita, Toshiharu; Beheregaray, Simone; Nagano, Masumi; Ohneda, Osamu; Nagasawa, Hideko; Oshika, Tetsuro

2016-12-01

Retinal hypoxia plays a crucial role in ocular neovascular diseases, such as diabetic retinopathy, retinopathy of prematurity, and retinal vascular occlusion. Fluorescein angiography is useful for identifying the hypoxia extent by detecting non-perfusion areas or neovascularization, but its ability to detect early stages of hypoxia is limited. Recently, in vivo fluorescent probes for detecting hypoxia have been developed; however, these have not been extensively applied in ophthalmology. We evaluated whether a novel donor-excited photo-induced electron transfer (d-PeT) system based on an activatable hypoxia-selective near-infrared fluorescent (NIRF) probe (GPU-327) responds to both mild and severe hypoxia in various ocular ischemic diseases animal models. The ocular fundus examination offers unique opportunities for direct observation of the retina through the transparent cornea and lens. After injection of GPU-327 in various ocular hypoxic diseases of mouse and rabbit models, NIRF imaging in the ocular fundus can be performed noninvasively and easily by using commercially available fundus cameras. To investigate the safety of GPU-327, electroretinograms were also recorded after GPU-327 and PBS injection. Fluorescence of GPU-327 increased under mild hypoxic conditions in vitro. GPU-327 also yielded excellent signal-to-noise ratio without washing out in vivo experiments. By using near-infrared region, GPU-327 enables imaging of deeper ischemia, such as choroidal circulation. Additionally, from an electroretinogram, GPU-327 did not cause neurotoxicity. GPU-327 identified hypoxic area both in vivo and in vitro.
Large-scale neural circuit mapping data analysis accelerated with the graphical processing unit (GPU).

PubMed

Shi, Yulin; Veidenbaum, Alexander V; Nicolau, Alex; Xu, Xiangmin

2015-01-15

Modern neuroscience research demands computing power. Neural circuit mapping studies such as those using laser scanning photostimulation (LSPS) produce large amounts of data and require intensive computation for post hoc processing and analysis. Here we report on the design and implementation of a cost-effective desktop computer system for accelerated experimental data processing with recent GPU computing technology. A new version of Matlab software with GPU enabled functions is used to develop programs that run on Nvidia GPUs to harness their parallel computing power. We evaluated both the central processing unit (CPU) and GPU-enabled computational performance of our system in benchmark testing and practical applications. The experimental results show that the GPU-CPU co-processing of simulated data and actual LSPS experimental data clearly outperformed the multi-core CPU with up to a 22× speedup, depending on computational tasks. Further, we present a comparison of numerical accuracy between GPU and CPU computation to verify the precision of GPU computation. In addition, we show how GPUs can be effectively adapted to improve the performance of commercial image processing software such as Adobe Photoshop. To our best knowledge, this is the first demonstration of GPU application in neural circuit mapping and electrophysiology-based data processing. Together, GPU enabled computation enhances our ability to process large-scale data sets derived from neural circuit mapping studies, allowing for increased processing speeds while retaining data precision. Copyright © 2014 Elsevier B.V. All rights reserved.
Large scale neural circuit mapping data analysis accelerated with the graphical processing unit (GPU)

PubMed Central

Shi, Yulin; Veidenbaum, Alexander V.; Nicolau, Alex; Xu, Xiangmin

2014-01-01

Background Modern neuroscience research demands computing power. Neural circuit mapping studies such as those using laser scanning photostimulation (LSPS) produce large amounts of data and require intensive computation for post-hoc processing and analysis. New Method Here we report on the design and implementation of a cost-effective desktop computer system for accelerated experimental data processing with recent GPU computing technology. A new version of Matlab software with GPU enabled functions is used to develop programs that run on Nvidia GPUs to harness their parallel computing power. Results We evaluated both the central processing unit (CPU) and GPU-enabled computational performance of our system in benchmark testing and practical applications. The experimental results show that the GPU-CPU co-processing of simulated data and actual LSPS experimental data clearly outperformed the multi-core CPU with up to a 22x speedup, depending on computational tasks. Further, we present a comparison of numerical accuracy between GPU and CPU computation to verify the precision of GPU computation. In addition, we show how GPUs can be effectively adapted to improve the performance of commercial image processing software such as Adobe Photoshop. Comparison with Existing Method(s) To our best knowledge, this is the first demonstration of GPU application in neural circuit mapping and electrophysiology-based data processing. Conclusions Together, GPU enabled computation enhances our ability to process large-scale data sets derived from neural circuit mapping studies, allowing for increased processing speeds while retaining data precision. PMID:25277633
Evolução química de galáxias HII anãs

NASA Astrophysics Data System (ADS)

Ferraresi, M., Jr.; Cuisinier, F.; Telles, E.

2003-08-01

Galáxias HII anãs são galáxias de baixa massa, com alto conteúdo de gás, e se encontram em uma fase intensa de formação estelar. A taxa de formação estelar está tão alta nestas galáxias que não pode ter se mantido durante sua vida inteira. O tempo máximo de duração do episódio atual de formação estelar deve ser no máximo de algumas dezenas de milhões de anos, bem inferior à idade destas galáxias. Isto leva naturalmente a idéia de que já aconteceram surtos anteriores. Abundâncias químicas oferecem uma ferramenta poderosa para investigar a história evolutiva destas galáxias, porque aumentam de geração em geração estelar. O hidrogênio, o oxigênio, o nitrogênio produzem algumas das linhas mais importantes em um gás foto-ionizado, permitindo a determinação das abundâncias destes elementos facilmente. A dispersão das abundâncias em oxigênio e nitrogênio é significativa, sendo maior que os erros observacionais. O oxigênio é produzido em estrelas massivas, que explodem quase instâneamente, enquanto o nitrogênio é produzido em estrelas de massa intermediária, que só o liberam depois de um atraso de @ 500 mihões de anos. Construímos um modelo de evolução química semi-analítico, utilizando rendimentos empíricos baseados nas abundâncias observadas destes dois elementos. Conseguimos através deste modelo rudimentar explicar nas galáxias de mais baixas metalicidades as abundâncias de oxigênio e de nitrogênio, assim como a dispersão dos dados observacionais devida a formação estelar descontínua, e isto com um número baixo de surtos (1 ou 2, no máximo 3).
GPU accelerated cell-based adaptive mesh refinement on unstructured quadrilateral grid

NASA Astrophysics Data System (ADS)

Luo, Xisheng; Wang, Luying; Ran, Wei; Qin, Fenghua

2016-10-01

A GPU accelerated inviscid flow solver is developed on an unstructured quadrilateral grid in the present work. For the first time, the cell-based adaptive mesh refinement (AMR) is fully implemented on GPU for the unstructured quadrilateral grid, which greatly reduces the frequency of data exchange between GPU and CPU. Specifically, the AMR is processed with atomic operations to parallelize list operations, and null memory recycling is realized to improve the efficiency of memory utilization. It is found that results obtained by GPUs agree very well with the exact or experimental results in literature. An acceleration ratio of 4 is obtained between the parallel code running on the old GPU GT9800 and the serial code running on E3-1230 V2. With the optimization of configuring a larger L1 cache and adopting Shared Memory based atomic operations on the newer GPU C2050, an acceleration ratio of 20 is achieved. The parallelized cell-based AMR processes have achieved 2x speedup on GT9800 and 18x on Tesla C2050, which demonstrates that parallel running of the cell-based AMR method on GPU is feasible and efficient. Our results also indicate that the new development of GPU architecture benefits the fluid dynamics computing significantly.
Multi-GPU implementation of a VMAT treatment plan optimization algorithm.

PubMed

Tian, Zhen; Peng, Fei; Folkerts, Michael; Tan, Jun; Jia, Xun; Jiang, Steve B

2015-06-01

Volumetric modulated arc therapy (VMAT) optimization is a computationally challenging problem due to its large data size, high degrees of freedom, and many hardware constraints. High-performance graphics processing units (GPUs) have been used to speed up the computations. However, GPU's relatively small memory size cannot handle cases with a large dose-deposition coefficient (DDC) matrix in cases of, e.g., those with a large target size, multiple targets, multiple arcs, and/or small beamlet size. The main purpose of this paper is to report an implementation of a column-generation-based VMAT algorithm, previously developed in the authors' group, on a multi-GPU platform to solve the memory limitation problem. While the column-generation-based VMAT algorithm has been previously developed, the GPU implementation details have not been reported. Hence, another purpose is to present detailed techniques employed for GPU implementation. The authors also would like to utilize this particular problem as an example problem to study the feasibility of using a multi-GPU platform to solve large-scale problems in medical physics. The column-generation approach generates VMAT apertures sequentially by solving a pricing problem (PP) and a master problem (MP) iteratively. In the authors' method, the sparse DDC matrix is first stored on a CPU in coordinate list format (COO). On the GPU side, this matrix is split into four submatrices according to beam angles, which are stored on four GPUs in compressed sparse row format. Computation of beamlet price, the first step in PP, is accomplished using multi-GPUs. A fast inter-GPU data transfer scheme is accomplished using peer-to-peer access. The remaining steps of PP and MP problems are implemented on CPU or a single GPU due to their modest problem scale and computational loads. Barzilai and Borwein algorithm with a subspace step scheme is adopted here to solve the MP problem. A head and neck (H&N) cancer case is then used to validate the authors' method. The authors also compare their multi-GPU implementation with three different single GPU implementation strategies, i.e., truncating DDC matrix (S1), repeatedly transferring DDC matrix between CPU and GPU (S2), and porting computations involving DDC matrix to CPU (S3), in terms of both plan quality and computational efficiency. Two more H&N patient cases and three prostate cases are used to demonstrate the advantages of the authors' method. The authors' multi-GPU implementation can finish the optimization process within ∼ 1 min for the H&N patient case. S1 leads to an inferior plan quality although its total time was 10 s shorter than the multi-GPU implementation due to the reduced matrix size. S2 and S3 yield the same plan quality as the multi-GPU implementation but take ∼4 and ∼6 min, respectively. High computational efficiency was consistently achieved for the other five patient cases tested, with VMAT plans of clinically acceptable quality obtained within 23-46 s. Conversely, to obtain clinically comparable or acceptable plans for all six of these VMAT cases that the authors have tested in this paper, the optimization time needed in a commercial TPS system on CPU was found to be in an order of several minutes. The results demonstrate that the multi-GPU implementation of the authors' column-generation-based VMAT optimization can handle the large-scale VMAT optimization problem efficiently without sacrificing plan quality. The authors' study may serve as an example to shed some light on other large-scale medical physics problems that require multi-GPU techniques.
GPU Accelerated Clustering for Arbitrary Shapes in Geoscience Data

NASA Astrophysics Data System (ADS)

Pankratius, V.; Gowanlock, M.; Rude, C. M.; Li, J. D.

2016-12-01

Clustering algorithms have become a vital component in intelligent systems for geoscience that helps scientists discover and track phenomena of various kinds. Here, we outline advances in Density-Based Spatial Clustering of Applications with Noise (DBSCAN) which detects clusters of arbitrary shape that are common in geospatial data. In particular, we propose a hybrid CPU-GPU implementation of DBSCAN and highlight new optimization approaches on the GPU that allows clustering detection in parallel while optimizing data transport during CPU-GPU interactions. We employ an efficient batching scheme between the host and GPU such that limited GPU memory is not prohibitive when processing large and/or dense datasets. To minimize data transfer overhead, we estimate the total workload size and employ an execution that generates optimized batches that will not overflow the GPU buffer. This work is demonstrated on space weather Total Electron Content (TEC) datasets containing over 5 million measurements from instruments worldwide, and allows scientists to spot spatially coherent phenomena with ease. Our approach is up to 30 times faster than a sequential implementation and therefore accelerates discoveries in large datasets. We acknowledge support from NSF ACI-1442997.
A real-time spike sorting method based on the embedded GPU.

PubMed

Zelan Yang; Kedi Xu; Xiang Tian; Shaomin Zhang; Xiaoxiang Zheng

2017-07-01

Microelectrode arrays with hundreds of channels have been widely used to acquire neuron population signals in neuroscience studies. Online spike sorting is becoming one of the most important challenges for high-throughput neural signal acquisition systems. Graphic processing unit (GPU) with high parallel computing capability might provide an alternative solution for increasing real-time computational demands on spike sorting. This study reported a method of real-time spike sorting through computing unified device architecture (CUDA) which was implemented on an embedded GPU (NVIDIA JETSON Tegra K1, TK1). The sorting approach is based on the principal component analysis (PCA) and K-means. By analyzing the parallelism of each process, the method was further optimized in the thread memory model of GPU. Our results showed that the GPU-based classifier on TK1 is 37.92 times faster than the MATLAB-based classifier on PC while their accuracies were the same with each other. The high-performance computing features of embedded GPU demonstrated in our studies suggested that the embedded GPU provide a promising platform for the real-time neural signal processing.
A sample implementation for parallelizing Divide-and-Conquer algorithms on the GPU.

PubMed

Mei, Gang; Zhang, Jiayin; Xu, Nengxiong; Zhao, Kunyang

2018-01-01

The strategy of Divide-and-Conquer (D&C) is one of the frequently used programming patterns to design efficient algorithms in computer science, which has been parallelized on shared memory systems and distributed memory systems. Tzeng and Owens specifically developed a generic paradigm for parallelizing D&C algorithms on modern Graphics Processing Units (GPUs). In this paper, by following the generic paradigm proposed by Tzeng and Owens, we provide a new and publicly available GPU implementation of the famous D&C algorithm, QuickHull, to give a sample and guide for parallelizing D&C algorithms on the GPU. The experimental results demonstrate the practicality of our sample GPU implementation. Our research objective in this paper is to present a sample GPU implementation of a classical D&C algorithm to help interested readers to develop their own efficient GPU implementations with fewer efforts.
GPU MrBayes V3.1: MrBayes on Graphics Processing Units for Protein Sequence Data.

PubMed

Pang, Shuai; Stones, Rebecca J; Ren, Ming-Ming; Liu, Xiao-Guang; Wang, Gang; Xia, Hong-ju; Wu, Hao-Yang; Liu, Yang; Xie, Qiang

2015-09-01

We present a modified GPU (graphics processing unit) version of MrBayes, called ta(MC)(3) (GPU MrBayes V3.1), for Bayesian phylogenetic inference on protein data sets. Our main contributions are 1) utilizing 64-bit variables, thereby enabling ta(MC)(3) to process larger data sets than MrBayes; and 2) to use Kahan summation to improve accuracy, convergence rates, and consequently runtime. Versus the current fastest software, we achieve a speedup of up to around 2.5 (and up to around 90 vs. serial MrBayes), and more on multi-GPU hardware. GPU MrBayes V3.1 is available from http://sourceforge.net/projects/mrbayes-gpu/. © The Author 2015. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com.
GPU-based High-Performance Computing for Radiation Therapy

PubMed Central

Jia, Xun; Ziegenhein, Peter; Jiang, Steve B.

2014-01-01

Recent developments in radiotherapy therapy demand high computation powers to solve challenging problems in a timely fashion in a clinical environment. Graphics processing unit (GPU), as an emerging high-performance computing platform, has been introduced to radiotherapy. It is particularly attractive due to its high computational power, small size, and low cost for facility deployment and maintenance. Over the past a few years, GPU-based high-performance computing in radiotherapy has experienced rapid developments. A tremendous amount of studies have been conducted, in which large acceleration factors compared with the conventional CPU platform have been observed. In this article, we will first give a brief introduction to the GPU hardware structure and programming model. We will then review the current applications of GPU in major imaging-related and therapy-related problems encountered in radiotherapy. A comparison of GPU with other platforms will also be presented. PMID:24486639
GPU accelerated implementation of NCI calculations using promolecular density.

PubMed

Rubez, Gaëtan; Etancelin, Jean-Matthieu; Vigouroux, Xavier; Krajecki, Michael; Boisson, Jean-Charles; Hénon, Eric

2017-05-30

The NCI approach is a modern tool to reveal chemical noncovalent interactions. It is particularly attractive to describe ligand-protein binding. A custom implementation for NCI using promolecular density is presented. It is designed to leverage the computational power of NVIDIA graphics processing unit (GPU) accelerators through the CUDA programming model. The code performances of three versions are examined on a test set of 144 systems. NCI calculations are particularly well suited to the GPU architecture, which reduces drastically the computational time. On a single compute node, the dual-GPU version leads to a 39-fold improvement for the biggest instance compared to the optimal OpenMP parallel run (C code, icc compiler) with 16 CPU cores. Energy consumption measurements carried out on both CPU and GPU NCI tests show that the GPU approach provides substantial energy savings. © 2017 Wiley Periodicals, Inc. © 2017 Wiley Periodicals, Inc.
A GPU-based calculation using the three-dimensional FDTD method for electromagnetic field analysis.

PubMed

Nagaoka, Tomoaki; Watanabe, Soichi

2010-01-01

Numerical simulations with the numerical human model using the finite-difference time domain (FDTD) method have recently been performed frequently in a number of fields in biomedical engineering. However, the FDTD calculation runs too slowly. We focus, therefore, on general purpose programming on the graphics processing unit (GPGPU). The three-dimensional FDTD method was implemented on the GPU using Compute Unified Device Architecture (CUDA). In this study, we used the NVIDIA Tesla C1060 as a GPGPU board. The performance of the GPU is evaluated in comparison with the performance of a conventional CPU and a vector supercomputer. The results indicate that three-dimensional FDTD calculations using a GPU can significantly reduce run time in comparison with that using a conventional CPU, even a native GPU implementation of the three-dimensional FDTD method, while the GPU/CPU speed ratio varies with the calculation domain and thread block size.
Development of High-speed Visualization System of Hypocenter Data Using CUDA-based GPU computing

NASA Astrophysics Data System (ADS)

Kumagai, T.; Okubo, K.; Uchida, N.; Matsuzawa, T.; Kawada, N.; Takeuchi, N.

2014-12-01

After the Great East Japan Earthquake on March 11, 2011, intelligent visualization of seismic information is becoming important to understand the earthquake phenomena. On the other hand, to date, the quantity of seismic data becomes enormous as a progress of high accuracy observation network; we need to treat many parameters (e.g., positional information, origin time, magnitude, etc.) to efficiently display the seismic information. Therefore, high-speed processing of data and image information is necessary to handle enormous amounts of seismic data. Recently, GPU (Graphic Processing Unit) is used as an acceleration tool for data processing and calculation in various study fields. This movement is called GPGPU (General Purpose computing on GPUs). In the last few years the performance of GPU keeps on improving rapidly. GPU computing gives us the high-performance computing environment at a lower cost than before. Moreover, use of GPU has an advantage of visualization of processed data, because GPU is originally architecture for graphics processing. In the GPU computing, the processed data is always stored in the video memory. Therefore, we can directly write drawing information to the VRAM on the video card by combining CUDA and the graphics API. In this study, we employ CUDA and OpenGL and/or DirectX to realize full-GPU implementation. This method makes it possible to write drawing information to the VRAM on the video card without PCIe bus data transfer: It enables the high-speed processing of seismic data. The present study examines the GPU computing-based high-speed visualization and the feasibility for high-speed visualization system of hypocenter data.
A fast three-dimensional gamma evaluation using a GPU utilizing texture memory for on-the-fly interpolations.

PubMed

Persoon, Lucas C G G; Podesta, Mark; van Elmpt, Wouter J C; Nijsten, Sebastiaan M J J G; Verhaegen, Frank

2011-07-01

A widely accepted method to quantify differences in dose distributions is the gamma (gamma) evaluation. Currently, almost all gamma implementations utilize the central processing unit (CPU). Recently, the graphics processing unit (GPU) has become a powerful platform for specific computing tasks. In this study, we describe the implementation of a 3D gamma evaluation using a GPU to improve calculation time. The gamma evaluation algorithm was implemented on an NVIDIA Tesla C2050 GPU using the compute unified device architecture (CUDA). First, several cubic virtual phantoms were simulated. These phantoms were tested with varying dose cube sizes and set-ups, introducing artificial dose differences. Second, to show applicability in clinical practice, five patient cases have been evaluated using the 3D dose distribution from a treatment planning system as the reference and the delivered dose determined during treatment as the comparison. A calculation time comparison between the CPU and GPU was made with varying thread-block sizes including the option of using texture or global memory. A GPU over CPU speed-up of 66 +/- 12 was achieved for the virtual phantoms. For the patient cases, a speed-up of 57 +/- 15 using the GPU was obtained. A thread-block size of 16 x 16 performed best in all cases. The use of texture memory improved the total calculation time, especially when interpolation was applied. Differences between the CPU and GPU gammas were negligible. The GPU and its features, such as texture memory, decreased the calculation time for gamma evaluations considerably without loss of accuracy.
GPU computing in medical physics: a review.

PubMed

Pratx, Guillem; Xing, Lei

2011-05-01

The graphics processing unit (GPU) has emerged as a competitive platform for computing massively parallel problems. Many computing applications in medical physics can be formulated as data-parallel tasks that exploit the capabilities of the GPU for reducing processing times. The authors review the basic principles of GPU computing as well as the main performance optimization techniques, and survey existing applications in three areas of medical physics, namely image reconstruction, dose calculation and treatment plan optimization, and image processing.
CPU-GPU hybrid accelerating the Zuker algorithm for RNA secondary structure prediction applications.

PubMed

Lei, Guoqing; Dou, Yong; Wan, Wen; Xia, Fei; Li, Rongchun; Ma, Meng; Zou, Dan

2012-01-01

Prediction of ribonucleic acid (RNA) secondary structure remains one of the most important research areas in bioinformatics. The Zuker algorithm is one of the most popular methods of free energy minimization for RNA secondary structure prediction. Thus far, few studies have been reported on the acceleration of the Zuker algorithm on general-purpose processors or on extra accelerators such as Field Programmable Gate-Array (FPGA) and Graphics Processing Units (GPU). To the best of our knowledge, no implementation combines both CPU and extra accelerators, such as GPUs, to accelerate the Zuker algorithm applications. In this paper, a CPU-GPU hybrid computing system that accelerates Zuker algorithm applications for RNA secondary structure prediction is proposed. The computing tasks are allocated between CPU and GPU for parallel cooperate execution. Performance differences between the CPU and the GPU in the task-allocation scheme are considered to obtain workload balance. To improve the hybrid system performance, the Zuker algorithm is optimally implemented with special methods for CPU and GPU architecture. Speedup of 15.93× over optimized multi-core SIMD CPU implementation and performance advantage of 16% over optimized GPU implementation are shown in the experimental results. More than 14% of the sequences are executed on CPU in the hybrid system. The system combining CPU and GPU to accelerate the Zuker algorithm is proven to be promising and can be applied to other bioinformatics applications.

Efficient Implementation of MrBayes on Multi-GPU

PubMed Central

Zhou, Jianfu; Liu, Xiaoguang; Wang, Gang

2013-01-01

MrBayes, using Metropolis-coupled Markov chain Monte Carlo (MCMCMC or (MC)3), is a popular program for Bayesian inference. As a leading method of using DNA data to infer phylogeny, the (MC)3 Bayesian algorithm and its improved and parallel versions are now not fast enough for biologists to analyze massive real-world DNA data. Recently, graphics processor unit (GPU) has shown its power as a coprocessor (or rather, an accelerator) in many fields. This article describes an efficient implementation a(MC)3 (aMCMCMC) for MrBayes (MC)3 on compute unified device architecture. By dynamically adjusting the task granularity to adapt to input data size and hardware configuration, it makes full use of GPU cores with different data sets. An adaptive method is also developed to split and combine DNA sequences to make full use of a large number of GPU cards. Furthermore, a new “node-by-node” task scheduling strategy is developed to improve concurrency, and several optimizing methods are used to reduce extra overhead. Experimental results show that a(MC)3 achieves up to 63× speedup over serial MrBayes on a single machine with one GPU card, and up to 170× speedup with four GPU cards, and up to 478× speedup with a 32-node GPU cluster. a(MC)3 is dramatically faster than all the previous (MC)3 algorithms and scales well to large GPU clusters. PMID:23493260
Accelerating Spaceborne SAR Imaging Using Multiple CPU/GPU Deep Collaborative Computing

PubMed Central

Zhang, Fan; Li, Guojun; Li, Wei; Hu, Wei; Hu, Yuxin

2016-01-01

With the development of synthetic aperture radar (SAR) technologies in recent years, the huge amount of remote sensing data brings challenges for real-time imaging processing. Therefore, high performance computing (HPC) methods have been presented to accelerate SAR imaging, especially the GPU based methods. In the classical GPU based imaging algorithm, GPU is employed to accelerate image processing by massive parallel computing, and CPU is only used to perform the auxiliary work such as data input/output (IO). However, the computing capability of CPU is ignored and underestimated. In this work, a new deep collaborative SAR imaging method based on multiple CPU/GPU is proposed to achieve real-time SAR imaging. Through the proposed tasks partitioning and scheduling strategy, the whole image can be generated with deep collaborative multiple CPU/GPU computing. In the part of CPU parallel imaging, the advanced vector extension (AVX) method is firstly introduced into the multi-core CPU parallel method for higher efficiency. As for the GPU parallel imaging, not only the bottlenecks of memory limitation and frequent data transferring are broken, but also kinds of optimized strategies are applied, such as streaming, parallel pipeline and so on. Experimental results demonstrate that the deep CPU/GPU collaborative imaging method enhances the efficiency of SAR imaging on single-core CPU by 270 times and realizes the real-time imaging in that the imaging rate outperforms the raw data generation rate. PMID:27070606
Efficient implementation of MrBayes on multi-GPU.

PubMed

Bao, Jie; Xia, Hongju; Zhou, Jianfu; Liu, Xiaoguang; Wang, Gang

2013-06-01

MrBayes, using Metropolis-coupled Markov chain Monte Carlo (MCMCMC or (MC)(3)), is a popular program for Bayesian inference. As a leading method of using DNA data to infer phylogeny, the (MC)(3) Bayesian algorithm and its improved and parallel versions are now not fast enough for biologists to analyze massive real-world DNA data. Recently, graphics processor unit (GPU) has shown its power as a coprocessor (or rather, an accelerator) in many fields. This article describes an efficient implementation a(MC)(3) (aMCMCMC) for MrBayes (MC)(3) on compute unified device architecture. By dynamically adjusting the task granularity to adapt to input data size and hardware configuration, it makes full use of GPU cores with different data sets. An adaptive method is also developed to split and combine DNA sequences to make full use of a large number of GPU cards. Furthermore, a new "node-by-node" task scheduling strategy is developed to improve concurrency, and several optimizing methods are used to reduce extra overhead. Experimental results show that a(MC)(3) achieves up to 63× speedup over serial MrBayes on a single machine with one GPU card, and up to 170× speedup with four GPU cards, and up to 478× speedup with a 32-node GPU cluster. a(MC)(3) is dramatically faster than all the previous (MC)(3) algorithms and scales well to large GPU clusters.
Architecting the Finite Element Method Pipeline for the GPU.

PubMed

Fu, Zhisong; Lewis, T James; Kirby, Robert M; Whitaker, Ross T

2014-02-01

The finite element method (FEM) is a widely employed numerical technique for approximating the solution of partial differential equations (PDEs) in various science and engineering applications. Many of these applications benefit from fast execution of the FEM pipeline. One way to accelerate the FEM pipeline is by exploiting advances in modern computational hardware, such as the many-core streaming processors like the graphical processing unit (GPU). In this paper, we present the algorithms and data-structures necessary to move the entire FEM pipeline to the GPU. First we propose an efficient GPU-based algorithm to generate local element information and to assemble the global linear system associated with the FEM discretization of an elliptic PDE. To solve the corresponding linear system efficiently on the GPU, we implement a conjugate gradient method preconditioned with a geometry-informed algebraic multi-grid (AMG) method preconditioner. We propose a new fine-grained parallelism strategy, a corresponding multigrid cycling stage and efficient data mapping to the many-core architecture of GPU. Comparison of our on-GPU assembly versus a traditional serial implementation on the CPU achieves up to an 87 × speedup. Focusing on the linear system solver alone, we achieve a speedup of up to 51 × versus use of a comparable state-of-the-art serial CPU linear system solver. Furthermore, the method compares favorably with other GPU-based, sparse, linear solvers.
Accelerating Spaceborne SAR Imaging Using Multiple CPU/GPU Deep Collaborative Computing.

PubMed

Zhang, Fan; Li, Guojun; Li, Wei; Hu, Wei; Hu, Yuxin

2016-04-07

With the development of synthetic aperture radar (SAR) technologies in recent years, the huge amount of remote sensing data brings challenges for real-time imaging processing. Therefore, high performance computing (HPC) methods have been presented to accelerate SAR imaging, especially the GPU based methods. In the classical GPU based imaging algorithm, GPU is employed to accelerate image processing by massive parallel computing, and CPU is only used to perform the auxiliary work such as data input/output (IO). However, the computing capability of CPU is ignored and underestimated. In this work, a new deep collaborative SAR imaging method based on multiple CPU/GPU is proposed to achieve real-time SAR imaging. Through the proposed tasks partitioning and scheduling strategy, the whole image can be generated with deep collaborative multiple CPU/GPU computing. In the part of CPU parallel imaging, the advanced vector extension (AVX) method is firstly introduced into the multi-core CPU parallel method for higher efficiency. As for the GPU parallel imaging, not only the bottlenecks of memory limitation and frequent data transferring are broken, but also kinds of optimized strategies are applied, such as streaming, parallel pipeline and so on. Experimental results demonstrate that the deep CPU/GPU collaborative imaging method enhances the efficiency of SAR imaging on single-core CPU by 270 times and realizes the real-time imaging in that the imaging rate outperforms the raw data generation rate.
SU-E-J-60: Efficient Monte Carlo Dose Calculation On CPU-GPU Heterogeneous Systems

DOE Office of Scientific and Technical Information (OSTI.GOV)

Xiao, K; Chen, D. Z; Hu, X. S

Purpose: It is well-known that the performance of GPU-based Monte Carlo dose calculation implementations is bounded by memory bandwidth. One major cause of this bottleneck is the random memory writing patterns in dose deposition, which leads to several memory efficiency issues on GPU such as un-coalesced writing and atomic operations. We propose a new method to alleviate such issues on CPU-GPU heterogeneous systems, which achieves overall performance improvement for Monte Carlo dose calculation. Methods: Dose deposition is to accumulate dose into the voxels of a dose volume along the trajectories of radiation rays. Our idea is to partition this proceduremore » into the following three steps, which are fine-tuned for CPU or GPU: (1) each GPU thread writes dose results with location information to a buffer on GPU memory, which achieves fully-coalesced and atomic-free memory transactions; (2) the dose results in the buffer are transferred to CPU memory; (3) the dose volume is constructed from the dose buffer on CPU. We organize the processing of all radiation rays into streams. Since the steps within a stream use different hardware resources (i.e., GPU, DMA, CPU), we can overlap the execution of these steps for different streams by pipelining. Results: We evaluated our method using a Monte Carlo Convolution Superposition (MCCS) program and tested our implementation for various clinical cases on a heterogeneous system containing an Intel i7 quad-core CPU and an NVIDIA TITAN GPU. Comparing with a straightforward MCCS implementation on the same system (using both CPU and GPU for radiation ray tracing), our method gained 2-5X speedup without losing dose calculation accuracy. Conclusion: The results show that our new method improves the effective memory bandwidth and overall performance for MCCS on the CPU-GPU systems. Our proposed method can also be applied to accelerate other Monte Carlo dose calculation approaches. This research was supported in part by NSF under Grants CCF-1217906, and also in part by a research contract from the Sandia National Laboratories.« less
GPU-accelerated non-uniform fast Fourier transform-based compressive sensing spectral domain optical coherence tomography.

PubMed

Xu, Daguang; Huang, Yong; Kang, Jin U

2014-06-16

We implemented the graphics processing unit (GPU) accelerated compressive sensing (CS) non-uniform in k-space spectral domain optical coherence tomography (SD OCT). Kaiser-Bessel (KB) function and Gaussian function are used independently as the convolution kernel in the gridding-based non-uniform fast Fourier transform (NUFFT) algorithm with different oversampling ratios and kernel widths. Our implementation is compared with the GPU-accelerated modified non-uniform discrete Fourier transform (MNUDFT) matrix-based CS SD OCT and the GPU-accelerated fast Fourier transform (FFT)-based CS SD OCT. It was found that our implementation has comparable performance to the GPU-accelerated MNUDFT-based CS SD OCT in terms of image quality while providing more than 5 times speed enhancement. When compared to the GPU-accelerated FFT based-CS SD OCT, it shows smaller background noise and less side lobes while eliminating the need for the cumbersome k-space grid filling and the k-linear calibration procedure. Finally, we demonstrated that by using a conventional desktop computer architecture having three GPUs, real-time B-mode imaging can be obtained in excess of 30 fps for the GPU-accelerated NUFFT based CS SD OCT with frame size 2048(axial) × 1,000(lateral).
GPU-Meta-Storms: computing the structure similarities among massive amount of microbial community samples using GPU.

PubMed

Su, Xiaoquan; Wang, Xuetao; Jing, Gongchao; Ning, Kang

2014-04-01

The number of microbial community samples is increasing with exponential speed. Data-mining among microbial community samples could facilitate the discovery of valuable biological information that is still hidden in the massive data. However, current methods for the comparison among microbial communities are limited by their ability to process large amount of samples each with complex community structure. We have developed an optimized GPU-based software, GPU-Meta-Storms, to efficiently measure the quantitative phylogenetic similarity among massive amount of microbial community samples. Our results have shown that GPU-Meta-Storms would be able to compute the pair-wise similarity scores for 10 240 samples within 20 min, which gained a speed-up of >17 000 times compared with single-core CPU, and >2600 times compared with 16-core CPU. Therefore, the high-performance of GPU-Meta-Storms could facilitate in-depth data mining among massive microbial community samples, and make the real-time analysis and monitoring of temporal or conditional changes for microbial communities possible. GPU-Meta-Storms is implemented by CUDA (Compute Unified Device Architecture) and C++. Source code is available at http://www.computationalbioenergy.org/meta-storms.html.
The National Geoelectromagnetic Facility - an open access resource for ultra wideband electromagnetic geophysics (Invited)

NASA Astrophysics Data System (ADS)

Schultz, A.; Urquhart, S.; Slater, M.

2010-12-01

At present, the US academic community has access to two national electromagnetic (EM) instrument pools that support long-period magnetotelluric (MT) equipment suitable for crust-mantle scale studies. The requirements of near surface geophysics, hydrology, glaciology, as well as the full range of crust and mantle investigations require development of new capabilities in data acquisition with broader frequency bandwidth than these existing units, increased instrument numbers, and concomitant developments in 3D/4D data interpretation. NSF Major Research Instrumentation support has been obtained to meet these requirements by developing an initial set of next-generation instruments as a National Geoelectromagnetic Facility (NGF), available to all PIs on a cost recovery basis, and operated by Oregon State University (OSU). In contrast to existing instruments with data acquisition systems specialized to operate within specific frequency bands and for specific electromagnetic methods, the NGF model "Zen/5" instruments being co-developed by OSU and Zonge Research and Engineering Organization are based on modular receivers with a flexible number of digital and analog input channels, designed to acquire EM data at dc, and from frequencies ranging from micro-Hz to MHz. These systems can be deployed in a compact, low power configuration for extended deployments (e.g. for crust-mantle scale experiments), or in a high frequency sampling mode for near surface work. The NGF is also acquiring controlled source EM transmitters, so that investigators may carry out magnetotelluric, audio-MT, radiofrequency-MT, as well as time-domain/transient EM and DC resistivity studies. The instruments are designed to simultaneously accommodate multiple electric field dipole sensors, magnetic fluxgates and induction coil sensors. Sample rates as high as 2.5 MHz with resolution between 24 and 32 bits, depending on sample rate, are specified to allow for high fidelity recording of waveforms. The NGF is accepting instrument use requests from investigators planning electromagnetic surveys via webform submission on its web site ngf.coas.oregonstate.edu. The site is also a port of entry to request access to the 46 long period magnetotelluric instruments also operated by OSU as national instrument pools. Cyberinfrastructure support is available to investigators, including field computers, EM data processing software, and access to a hybrid CPU-GPU parallel computing environment, currently configured with dual Intel Westmere hexacore CPUs and 960 NVidia Tesla and 1792 Nvidia Fermi GPU cores. The capabilities of the Zen/5 receivers will be presented, with examples of data acquired from a recent shallow water marine controlled source experiment conducted in coastal Oregon as part of an effort to locate a buried submarine pipeline, using a 1.1 KW 256 Hz signal source imposed on the pipeline from shore. A Zen/5 prototype instrument, modified for marine use through support by the Oregon Wave Energy Trust, demonstrated the marine capabilities of the NGF instrument design.
Compute-unified device architecture implementation of a block-matching algorithm for multiple graphical processing unit cards

PubMed Central

Massanes, Francesc; Cadennes, Marie; Brankov, Jovan G.

2012-01-01

In this paper we describe and evaluate a fast implementation of a classical block matching motion estimation algorithm for multiple Graphical Processing Units (GPUs) using the Compute Unified Device Architecture (CUDA) computing engine. The implemented block matching algorithm (BMA) uses summed absolute difference (SAD) error criterion and full grid search (FS) for finding optimal block displacement. In this evaluation we compared the execution time of a GPU and CPU implementation for images of various sizes, using integer and non-integer search grids. The results show that use of a GPU card can shorten computation time by a factor of 200 times for integer and 1000 times for a non-integer search grid. The additional speedup for non-integer search grid comes from the fact that GPU has built-in hardware for image interpolation. Further, when using multiple GPU cards, the presented evaluation shows the importance of the data splitting method across multiple cards, but an almost linear speedup with a number of cards is achievable. In addition we compared execution time of the proposed FS GPU implementation with two existing, highly optimized non-full grid search CPU based motion estimations methods, namely implementation of the Pyramidal Lucas Kanade Optical flow algorithm in OpenCV and Simplified Unsymmetrical multi-Hexagon search in H.264/AVC standard. In these comparisons, FS GPU implementation still showed modest improvement even though the computational complexity of FS GPU implementation is substantially higher than non-FS CPU implementation. We also demonstrated that for an image sequence of 720×480 pixels in resolution, commonly used in video surveillance, the proposed GPU implementation is sufficiently fast for real-time motion estimation at 30 frames-per-second using two NVIDIA C1060 Tesla GPU cards. PMID:22347787
Compute-unified device architecture implementation of a block-matching algorithm for multiple graphical processing unit cards.

PubMed

Massanes, Francesc; Cadennes, Marie; Brankov, Jovan G

2011-07-01

In this paper we describe and evaluate a fast implementation of a classical block matching motion estimation algorithm for multiple Graphical Processing Units (GPUs) using the Compute Unified Device Architecture (CUDA) computing engine. The implemented block matching algorithm (BMA) uses summed absolute difference (SAD) error criterion and full grid search (FS) for finding optimal block displacement. In this evaluation we compared the execution time of a GPU and CPU implementation for images of various sizes, using integer and non-integer search grids.The results show that use of a GPU card can shorten computation time by a factor of 200 times for integer and 1000 times for a non-integer search grid. The additional speedup for non-integer search grid comes from the fact that GPU has built-in hardware for image interpolation. Further, when using multiple GPU cards, the presented evaluation shows the importance of the data splitting method across multiple cards, but an almost linear speedup with a number of cards is achievable.In addition we compared execution time of the proposed FS GPU implementation with two existing, highly optimized non-full grid search CPU based motion estimations methods, namely implementation of the Pyramidal Lucas Kanade Optical flow algorithm in OpenCV and Simplified Unsymmetrical multi-Hexagon search in H.264/AVC standard. In these comparisons, FS GPU implementation still showed modest improvement even though the computational complexity of FS GPU implementation is substantially higher than non-FS CPU implementation.We also demonstrated that for an image sequence of 720×480 pixels in resolution, commonly used in video surveillance, the proposed GPU implementation is sufficiently fast for real-time motion estimation at 30 frames-per-second using two NVIDIA C1060 Tesla GPU cards.
Study on efficiency of time computation in x-ray imaging simulation base on Monte Carlo algorithm using graphics processing unit

DOE Office of Scientific and Technical Information (OSTI.GOV)

Setiani, Tia Dwi, E-mail: tiadwisetiani@gmail.com; Suprijadi; Nuclear Physics and Biophysics Reaserch Division, Faculty of Mathematics and Natural Sciences, Institut Teknologi Bandung Jalan Ganesha 10 Bandung, 40132

Monte Carlo (MC) is one of the powerful techniques for simulation in x-ray imaging. MC method can simulate the radiation transport within matter with high accuracy and provides a natural way to simulate radiation transport in complex systems. One of the codes based on MC algorithm that are widely used for radiographic images simulation is MC-GPU, a codes developed by Andrea Basal. This study was aimed to investigate the time computation of x-ray imaging simulation in GPU (Graphics Processing Unit) compared to a standard CPU (Central Processing Unit). Furthermore, the effect of physical parameters to the quality of radiographic imagesmore » and the comparison of image quality resulted from simulation in the GPU and CPU are evaluated in this paper. The simulations were run in CPU which was simulated in serial condition, and in two GPU with 384 cores and 2304 cores. In simulation using GPU, each cores calculates one photon, so, a large number of photon were calculated simultaneously. Results show that the time simulations on GPU were significantly accelerated compared to CPU. The simulations on the 2304 core of GPU were performed about 64 -114 times faster than on CPU, while the simulation on the 384 core of GPU were performed about 20 – 31 times faster than in a single core of CPU. Another result shows that optimum quality of images from the simulation was gained at the history start from 10{sup 8} and the energy from 60 Kev to 90 Kev. Analyzed by statistical approach, the quality of GPU and CPU images are relatively the same.« less
Hadoop-MCC: Efficient Multiple Compound Comparison Algorithm Using Hadoop.

PubMed

Hua, Guan-Jie; Hung, Che-Lun; Tang, Chuan Yi

2018-01-01

In the past decade, the drug design technologies have been improved enormously. The computer-aided drug design (CADD) has played an important role in analysis and prediction in drug development, which makes the procedure more economical and efficient. However, computation with big data, such as ZINC containing more than 60 million compounds data and GDB-13 with more than 930 million small molecules, is a noticeable issue of time-consuming problem. Therefore, we propose a novel heterogeneous high performance computing method, named as Hadoop-MCC, integrating Hadoop and GPU, to copy with big chemical structure data efficiently. Hadoop-MCC gains the high availability and fault tolerance from Hadoop, as Hadoop is used to scatter input data to GPU devices and gather the results from GPU devices. Hadoop framework adopts mapper/reducer computation model. In the proposed method, mappers response for fetching SMILES data segments and perform LINGO method on GPU, then reducers collect all comparison results produced by mappers. Due to the high availability of Hadoop, all of LINGO computational jobs on mappers can be completed, even if some of the mappers encounter problems. A comparison of LINGO is performed on each the GPU device in parallel. According to the experimental results, the proposed method on multiple GPU devices can achieve better computational performance than the CUDA-MCC on a single GPU device. Hadoop-MCC is able to achieve scalability, high availability, and fault tolerance granted by Hadoop, and high performance as well by integrating computational power of both of Hadoop and GPU. It has been shown that using the heterogeneous architecture as Hadoop-MCC effectively can enhance better computational performance than on a single GPU device. Copyright© Bentham Science Publishers; For any queries, please email at epub@benthamscience.org.
GPU Computing in Bayesian Inference of Realized Stochastic Volatility Model

NASA Astrophysics Data System (ADS)

Takaishi, Tetsuya

2015-01-01

The realized stochastic volatility (RSV) model that utilizes the realized volatility as additional information has been proposed to infer volatility of financial time series. We consider the Bayesian inference of the RSV model by the Hybrid Monte Carlo (HMC) algorithm. The HMC algorithm can be parallelized and thus performed on the GPU for speedup. The GPU code is developed with CUDA Fortran. We compare the computational time in performing the HMC algorithm on GPU (GTX 760) and CPU (Intel i7-4770 3.4GHz) and find that the GPU can be up to 17 times faster than the CPU. We also code the program with OpenACC and find that appropriate coding can achieve the similar speedup with CUDA Fortran.
HASEonGPU-An adaptive, load-balanced MPI/GPU-code for calculating the amplified spontaneous emission in high power laser media

NASA Astrophysics Data System (ADS)

Eckert, C. H. J.; Zenker, E.; Bussmann, M.; Albach, D.

2016-10-01

We present an adaptive Monte Carlo algorithm for computing the amplified spontaneous emission (ASE) flux in laser gain media pumped by pulsed lasers. With the design of high power lasers in mind, which require large size gain media, we have developed the open source code HASEonGPU that is capable of utilizing multiple graphic processing units (GPUs). With HASEonGPU, time to solution is reduced to minutes on a medium size GPU cluster of 64 NVIDIA Tesla K20m GPUs and excellent speedup is achieved when scaling to multiple GPUs. Comparison of simulation results to measurements of ASE in Y b 3 + : Y AG ceramics show perfect agreement.
Local Alignment Tool Based on Hadoop Framework and GPU Architecture

PubMed Central

Hung, Che-Lun; Hua, Guan-Jie

2014-01-01

With the rapid growth of next generation sequencing technologies, such as Slex, more and more data have been discovered and published. To analyze such huge data the computational performance is an important issue. Recently, many tools, such as SOAP, have been implemented on Hadoop and GPU parallel computing architectures. BLASTP is an important tool, implemented on GPU architectures, for biologists to compare protein sequences. To deal with the big biology data, it is hard to rely on single GPU. Therefore, we implement a distributed BLASTP by combining Hadoop and multi-GPUs. The experimental results present that the proposed method can improve the performance of BLASTP on single GPU, and also it can achieve high availability and fault tolerance. PMID:24955362
Local alignment tool based on Hadoop framework and GPU architecture.

PubMed

Hung, Che-Lun; Hua, Guan-Jie

2014-01-01

With the rapid growth of next generation sequencing technologies, such as Slex, more and more data have been discovered and published. To analyze such huge data the computational performance is an important issue. Recently, many tools, such as SOAP, have been implemented on Hadoop and GPU parallel computing architectures. BLASTP is an important tool, implemented on GPU architectures, for biologists to compare protein sequences. To deal with the big biology data, it is hard to rely on single GPU. Therefore, we implement a distributed BLASTP by combining Hadoop and multi-GPUs. The experimental results present that the proposed method can improve the performance of BLASTP on single GPU, and also it can achieve high availability and fault tolerance.
A survey of CPU-GPU heterogeneous computing techniques

DOE PAGES

Mittal, Sparsh; Vetter, Jeffrey S.

2015-07-04

As both CPU and GPU become employed in a wide range of applications, it has been acknowledged that both of these processing units (PUs) have their unique features and strengths and hence, CPU-GPU collaboration is inevitable to achieve high-performance computing. This has motivated significant amount of research on heterogeneous computing techniques, along with the design of CPU-GPU fused chips and petascale heterogeneous supercomputers. In this paper, we survey heterogeneous computing techniques (HCTs) such as workload-partitioning which enable utilizing both CPU and GPU to improve performance and/or energy efficiency. We review heterogeneous computing approaches at runtime, algorithm, programming, compiler and applicationmore » level. Further, we review both discrete and fused CPU-GPU systems; and discuss benchmark suites designed for evaluating heterogeneous computing systems (HCSs). Furthermore, we believe that this paper will provide insights into working and scope of applications of HCTs to researchers and motivate them to further harness the computational powers of CPUs and GPUs to achieve the goal of exascale performance.« less
GPU accelerated manifold correction method for spinning compact binaries

NASA Astrophysics Data System (ADS)

Ran, Chong-xi; Liu, Song; Zhong, Shuang-ying

2018-04-01

The graphics processing unit (GPU) acceleration of the manifold correction algorithm based on the compute unified device architecture (CUDA) technology is designed to simulate the dynamic evolution of the Post-Newtonian (PN) Hamiltonian formulation of spinning compact binaries. The feasibility and the efficiency of parallel computation on GPU have been confirmed by various numerical experiments. The numerical comparisons show that the accuracy on GPU execution of manifold corrections method has a good agreement with the execution of codes on merely central processing unit (CPU-based) method. The acceleration ability when the codes are implemented on GPU can increase enormously through the use of shared memory and register optimization techniques without additional hardware costs, implying that the speedup is nearly 13 times as compared with the codes executed on CPU for phase space scan (including 314 × 314 orbits). In addition, GPU-accelerated manifold correction method is used to numerically study how dynamics are affected by the spin-induced quadrupole-monopole interaction for black hole binary system.
A survey of CPU-GPU heterogeneous computing techniques

DOE Office of Scientific and Technical Information (OSTI.GOV)

Mittal, Sparsh; Vetter, Jeffrey S.

As both CPU and GPU become employed in a wide range of applications, it has been acknowledged that both of these processing units (PUs) have their unique features and strengths and hence, CPU-GPU collaboration is inevitable to achieve high-performance computing. This has motivated significant amount of research on heterogeneous computing techniques, along with the design of CPU-GPU fused chips and petascale heterogeneous supercomputers. In this paper, we survey heterogeneous computing techniques (HCTs) such as workload-partitioning which enable utilizing both CPU and GPU to improve performance and/or energy efficiency. We review heterogeneous computing approaches at runtime, algorithm, programming, compiler and applicationmore » level. Further, we review both discrete and fused CPU-GPU systems; and discuss benchmark suites designed for evaluating heterogeneous computing systems (HCSs). Furthermore, we believe that this paper will provide insights into working and scope of applications of HCTs to researchers and motivate them to further harness the computational powers of CPUs and GPUs to achieve the goal of exascale performance.« less

CPU-GPU hybrid accelerating the Zuker algorithm for RNA secondary structure prediction applications

PubMed Central

2012-01-01

Background Prediction of ribonucleic acid (RNA) secondary structure remains one of the most important research areas in bioinformatics. The Zuker algorithm is one of the most popular methods of free energy minimization for RNA secondary structure prediction. Thus far, few studies have been reported on the acceleration of the Zuker algorithm on general-purpose processors or on extra accelerators such as Field Programmable Gate-Array (FPGA) and Graphics Processing Units (GPU). To the best of our knowledge, no implementation combines both CPU and extra accelerators, such as GPUs, to accelerate the Zuker algorithm applications. Results In this paper, a CPU-GPU hybrid computing system that accelerates Zuker algorithm applications for RNA secondary structure prediction is proposed. The computing tasks are allocated between CPU and GPU for parallel cooperate execution. Performance differences between the CPU and the GPU in the task-allocation scheme are considered to obtain workload balance. To improve the hybrid system performance, the Zuker algorithm is optimally implemented with special methods for CPU and GPU architecture. Conclusions Speedup of 15.93× over optimized multi-core SIMD CPU implementation and performance advantage of 16% over optimized GPU implementation are shown in the experimental results. More than 14% of the sequences are executed on CPU in the hybrid system. The system combining CPU and GPU to accelerate the Zuker algorithm is proven to be promising and can be applied to other bioinformatics applications. PMID:22369626
Dense GPU-enhanced surface reconstruction from stereo endoscopic images for intraoperative registration.

PubMed

Rohl, Sebastian; Bodenstedt, Sebastian; Suwelack, Stefan; Dillmann, Rudiger; Speidel, Stefanie; Kenngott, Hannes; Muller-Stich, Beat P

2012-03-01

In laparoscopic surgery, soft tissue deformations substantially change the surgical site, thus impeding the use of preoperative planning during intraoperative navigation. Extracting depth information from endoscopic images and building a surface model of the surgical field-of-view is one way to represent this constantly deforming environment. The information can then be used for intraoperative registration. Stereo reconstruction is a typical problem within computer vision. However, most of the available methods do not fulfill the specific requirements in a minimally invasive setting such as the need of real-time performance, the problem of view-dependent specular reflections and large curved areas with partly homogeneous or periodic textures and occlusions. In this paper, the authors present an approach toward intraoperative surface reconstruction based on stereo endoscopic images. The authors describe our answer to this problem through correspondence analysis, disparity correction and refinement, 3D reconstruction, point cloud smoothing and meshing. Real-time performance is achieved by implementing the algorithms on the gpu. The authors also present a new hybrid cpu-gpu algorithm that unifies the advantages of the cpu and the gpu version. In a comprehensive evaluation using in vivo data, in silico data from the literature and virtual data from a newly developed simulation environment, the cpu, the gpu, and the hybrid cpu-gpu versions of the surface reconstruction are compared to a cpu and a gpu algorithm from the literature. The recommended approach toward intraoperative surface reconstruction can be conducted in real-time depending on the image resolution (20 fps for the gpu and 14fps for the hybrid cpu-gpu version on resolution of 640 × 480). It is robust to homogeneous regions without texture, large image changes, noise or errors from camera calibration, and it reconstructs the surface down to sub millimeter accuracy. In all the experiments within the simulation environment, the mean distance to ground truth data is between 0.05 and 0.6 mm for the hybrid cpu-gpu version. The hybrid cpu-gpu algorithm shows a much more superior performance than its cpu and gpu counterpart (mean distance reduction 26% and 45%, respectively, for the experiments in the simulation environment). The recommended approach for surface reconstruction is fast, robust, and accurate. It can represent changes in the intraoperative environment and can be used to adapt a preoperative model within the surgical site by registration of these two models.
GPU real-time processing in NA62 trigger system

NASA Astrophysics Data System (ADS)

Ammendola, R.; Biagioni, A.; Chiozzi, S.; Cretaro, P.; Di Lorenzo, S.; Fantechi, R.; Fiorini, M.; Frezza, O.; Lamanna, G.; Lo Cicero, F.; Lonardo, A.; Martinelli, M.; Neri, I.; Paolucci, P. S.; Pastorelli, E.; Piandani, R.; Piccini, M.; Pontisso, L.; Rossetti, D.; Simula, F.; Sozzi, M.; Vicini, P.

2017-01-01

A commercial Graphics Processing Unit (GPU) is used to build a fast Level 0 (L0) trigger system tested parasitically with the TDAQ (Trigger and Data Acquisition systems) of the NA62 experiment at CERN. In particular, the parallel computing power of the GPU is exploited to perform real-time fitting in the Ring Imaging CHerenkov (RICH) detector. Direct GPU communication using a FPGA-based board has been used to reduce the data transmission latency. The performance of the system for multi-ring reconstrunction obtained during the NA62 physics run will be presented.
gWEGA: GPU-accelerated WEGA for molecular superposition and shape comparison.

PubMed

Yan, Xin; Li, Jiabo; Gu, Qiong; Xu, Jun

2014-06-05

Virtual screening of a large chemical library for drug lead identification requires searching/superimposing a large number of three-dimensional (3D) chemical structures. This article reports a graphic processing unit (GPU)-accelerated weighted Gaussian algorithm (gWEGA) that expedites shape or shape-feature similarity score-based virtual screening. With 86 GPU nodes (each node has one GPU card), gWEGA can screen 110 million conformations derived from an entire ZINC drug-like database with diverse antidiabetic agents as query structures within 2 s (i.e., screening more than 55 million conformations per second). The rapid screening speed was accomplished through the massive parallelization on multiple GPU nodes and rapid prescreening of 3D structures (based on their shape descriptors and pharmacophore feature compositions). Copyright © 2014 Wiley Periodicals, Inc.
Graphics Processing Unit Acceleration of Gyrokinetic Turbulence Simulations

NASA Astrophysics Data System (ADS)

Hause, Benjamin; Parker, Scott

2012-10-01

We find a substantial increase in on-node performance using Graphics Processing Unit (GPU) acceleration in gyrokinetic delta-f particle-in-cell simulation. Optimization is performed on a two-dimensional slab gyrokinetic particle simulation using the Portland Group Fortran compiler with the GPU accelerator compiler directives. We have implemented the GPU acceleration on a Core I7 gaming PC with a NVIDIA GTX 580 GPU. We find comparable, or better, acceleration relative to the NERSC DIRAC cluster with the NVIDIA Tesla C2050 computing processor. The Tesla C 2050 is about 2.6 times more expensive than the GTX 580 gaming GPU. Optimization strategies and comparisons between DIRAC and the gaming PC will be presented. We will also discuss progress on optimizing the comprehensive three dimensional general geometry GEM code.
Understanding GPU Power. A Survey of Profiling, Modeling, and Simulation Methods

DOE PAGES

Bridges, Robert A.; Imam, Neena; Mintz, Tiffany M.

2016-09-01

Modern graphics processing units (GPUs) have complex architectures that admit exceptional performance and energy efficiency for high throughput applications.Though GPUs consume large amounts of power, their use for high throughput applications facilitate state-of-the-art energy efficiency and performance. Consequently, continued development relies on understanding their power consumption. Our work is a survey of GPU power modeling and profiling methods with increased detail on noteworthy efforts. Moreover, as direct measurement of GPU power is necessary for model evaluation and parameter initiation, internal and external power sensors are discussed. Hardware counters, which are low-level tallies of hardware events, share strong correlation to powermore » use and performance. Statistical correlation between power and performance counters has yielded worthwhile GPU power models, yet the complexity inherent to GPU architectures presents new hurdles for power modeling. Developments and challenges of counter-based GPU power modeling is discussed. Often building on the counter-based models, research efforts for GPU power simulation, which make power predictions from input code and hardware knowledge, provide opportunities for optimization in programming or architectural design. Noteworthy strides in power simulations for GPUs are included along with their performance or functional simulator counterparts when appropriate. Lastly, possible directions for future research are discussed.« less
Acoustic reverse-time migration using GPU card and POSIX thread based on the adaptive optimal finite-difference scheme and the hybrid absorbing boundary condition

NASA Astrophysics Data System (ADS)

Cai, Xiaohui; Liu, Yang; Ren, Zhiming

2018-06-01

Reverse-time migration (RTM) is a powerful tool for imaging geologically complex structures such as steep-dip and subsalt. However, its implementation is quite computationally expensive. Recently, as a low-cost solution, the graphic processing unit (GPU) was introduced to improve the efficiency of RTM. In the paper, we develop three ameliorative strategies to implement RTM on GPU card. First, given the high accuracy and efficiency of the adaptive optimal finite-difference (FD) method based on least squares (LS) on central processing unit (CPU), we study the optimal LS-based FD method on GPU. Second, we develop the CPU-based hybrid absorbing boundary condition (ABC) to the GPU-based one by addressing two issues of the former when introduced to GPU card: time-consuming and chaotic threads. Third, for large-scale data, the combinatorial strategy for optimal checkpointing and efficient boundary storage is introduced for the trade-off between memory and recomputation. To save the time of communication between host and disk, the portable operating system interface (POSIX) thread is utilized to create the other CPU core at the checkpoints. Applications of the three strategies on GPU with the compute unified device architecture (CUDA) programming language in RTM demonstrate their efficiency and validity.
GPU-based prompt gamma ray imaging from boron neutron capture therapy.

PubMed

Yoon, Do-Kun; Jung, Joo-Young; Jo Hong, Key; Sil Lee, Keum; Suk Suh, Tae

2015-01-01

The purpose of this research is to perform the fast reconstruction of a prompt gamma ray image using a graphics processing unit (GPU) computation from boron neutron capture therapy (BNCT) simulations. To evaluate the accuracy of the reconstructed image, a phantom including four boron uptake regions (BURs) was used in the simulation. After the Monte Carlo simulation of the BNCT, the modified ordered subset expectation maximization reconstruction algorithm using the GPU computation was used to reconstruct the images with fewer projections. The computation times for image reconstruction were compared between the GPU and the central processing unit (CPU). Also, the accuracy of the reconstructed image was evaluated by a receiver operating characteristic (ROC) curve analysis. The image reconstruction time using the GPU was 196 times faster than the conventional reconstruction time using the CPU. For the four BURs, the area under curve values from the ROC curve were 0.6726 (A-region), 0.6890 (B-region), 0.7384 (C-region), and 0.8009 (D-region). The tomographic image using the prompt gamma ray event from the BNCT simulation was acquired using the GPU computation in order to perform a fast reconstruction during treatment. The authors verified the feasibility of the prompt gamma ray image reconstruction using the GPU computation for BNCT simulations.
Fast distributed large-pixel-count hologram computation using a GPU cluster.

PubMed

Pan, Yuechao; Xu, Xuewu; Liang, Xinan

2013-09-10

Large-pixel-count holograms are one essential part for big size holographic three-dimensional (3D) display, but the generation of such holograms is computationally demanding. In order to address this issue, we have built a graphics processing unit (GPU) cluster with 32.5 Tflop/s computing power and implemented distributed hologram computation on it with speed improvement techniques, such as shared memory on GPU, GPU level adaptive load balancing, and node level load distribution. Using these speed improvement techniques on the GPU cluster, we have achieved 71.4 times computation speed increase for 186M-pixel holograms. Furthermore, we have used the approaches of diffraction limits and subdivision of holograms to overcome the GPU memory limit in computing large-pixel-count holograms. 745M-pixel and 1.80G-pixel holograms were computed in 343 and 3326 s, respectively, for more than 2 million object points with RGB colors. Color 3D objects with 1.02M points were successfully reconstructed from 186M-pixel hologram computed in 8.82 s with all the above three speed improvement techniques. It is shown that distributed hologram computation using a GPU cluster is a promising approach to increase the computation speed of large-pixel-count holograms for large size holographic display.
Understanding GPU Power. A Survey of Profiling, Modeling, and Simulation Methods

DOE Office of Scientific and Technical Information (OSTI.GOV)

Bridges, Robert A.; Imam, Neena; Mintz, Tiffany M.

Modern graphics processing units (GPUs) have complex architectures that admit exceptional performance and energy efficiency for high throughput applications.Though GPUs consume large amounts of power, their use for high throughput applications facilitate state-of-the-art energy efficiency and performance. Consequently, continued development relies on understanding their power consumption. Our work is a survey of GPU power modeling and profiling methods with increased detail on noteworthy efforts. Moreover, as direct measurement of GPU power is necessary for model evaluation and parameter initiation, internal and external power sensors are discussed. Hardware counters, which are low-level tallies of hardware events, share strong correlation to powermore » use and performance. Statistical correlation between power and performance counters has yielded worthwhile GPU power models, yet the complexity inherent to GPU architectures presents new hurdles for power modeling. Developments and challenges of counter-based GPU power modeling is discussed. Often building on the counter-based models, research efforts for GPU power simulation, which make power predictions from input code and hardware knowledge, provide opportunities for optimization in programming or architectural design. Noteworthy strides in power simulations for GPUs are included along with their performance or functional simulator counterparts when appropriate. Lastly, possible directions for future research are discussed.« less
Parallelized computation for computer simulation of electrocardiograms using personal computers with multi-core CPU and general-purpose GPU.

PubMed

Shen, Wenfeng; Wei, Daming; Xu, Weimin; Zhu, Xin; Yuan, Shizhong

2010-10-01

Biological computations like electrocardiological modelling and simulation usually require high-performance computing environments. This paper introduces an implementation of parallel computation for computer simulation of electrocardiograms (ECGs) in a personal computer environment with an Intel CPU of Core (TM) 2 Quad Q6600 and a GPU of Geforce 8800GT, with software support by OpenMP and CUDA. It was tested in three parallelization device setups: (a) a four-core CPU without a general-purpose GPU, (b) a general-purpose GPU plus 1 core of CPU, and (c) a four-core CPU plus a general-purpose GPU. To effectively take advantage of a multi-core CPU and a general-purpose GPU, an algorithm based on load-prediction dynamic scheduling was developed and applied to setting (c). In the simulation with 1600 time steps, the speedup of the parallel computation as compared to the serial computation was 3.9 in setting (a), 16.8 in setting (b), and 20.0 in setting (c). This study demonstrates that a current PC with a multi-core CPU and a general-purpose GPU provides a good environment for parallel computations in biological modelling and simulation studies. Copyright 2010 Elsevier Ireland Ltd. All rights reserved.
Efficient methods for implementation of multi-level nonrigid mass-preserving image registration on GPUs and multi-threaded CPUs.

PubMed

Ellingwood, Nathan D; Yin, Youbing; Smith, Matthew; Lin, Ching-Long

2016-04-01

Faster and more accurate methods for registration of images are important for research involved in conducting population-based studies that utilize medical imaging, as well as improvements for use in clinical applications. We present a novel computation- and memory-efficient multi-level method on graphics processing units (GPU) for performing registration of two computed tomography (CT) volumetric lung images. We developed a computation- and memory-efficient Diffeomorphic Multi-level B-Spline Transform Composite (DMTC) method to implement nonrigid mass-preserving registration of two CT lung images on GPU. The framework consists of a hierarchy of B-Spline control grids of increasing resolution. A similarity criterion known as the sum of squared tissue volume difference (SSTVD) was adopted to preserve lung tissue mass. The use of SSTVD consists of the calculation of the tissue volume, the Jacobian, and their derivatives, which makes its implementation on GPU challenging due to memory constraints. The use of the DMTC method enabled reduced computation and memory storage of variables with minimal communication between GPU and Central Processing Unit (CPU) due to ability to pre-compute values. The method was assessed on six healthy human subjects. Resultant GPU-generated displacement fields were compared against the previously validated CPU counterpart fields, showing good agreement with an average normalized root mean square error (nRMS) of 0.044±0.015. Runtime and performance speedup are compared between single-threaded CPU, multi-threaded CPU, and GPU algorithms. Best performance speedup occurs at the highest resolution in the GPU implementation for the SSTVD cost and cost gradient computations, with a speedup of 112 times that of the single-threaded CPU version and 11 times over the twelve-threaded version when considering average time per iteration using a Nvidia Tesla K20X GPU. The proposed GPU-based DMTC method outperforms its multi-threaded CPU version in terms of runtime. Total registration time reduced runtime to 2.9min on the GPU version, compared to 12.8min on twelve-threaded CPU version and 112.5min on a single-threaded CPU. Furthermore, the GPU implementation discussed in this work can be adapted for use of other cost functions that require calculation of the first derivatives. Copyright © 2015 Elsevier Ireland Ltd. All rights reserved.
Collaborating CPU and GPU for large-scale high-order CFD simulations with complex grids on the TianHe-1A supercomputer

NASA Astrophysics Data System (ADS)

Xu, Chuanfu; Deng, Xiaogang; Zhang, Lilun; Fang, Jianbin; Wang, Guangxue; Jiang, Yi; Cao, Wei; Che, Yonggang; Wang, Yongxian; Wang, Zhenghua; Liu, Wei; Cheng, Xinghua

2014-12-01

Programming and optimizing complex, real-world CFD codes on current many-core accelerated HPC systems is very challenging, especially when collaborating CPUs and accelerators to fully tap the potential of heterogeneous systems. In this paper, with a tri-level hybrid and heterogeneous programming model using MPI + OpenMP + CUDA, we port and optimize our high-order multi-block structured CFD software HOSTA on the GPU-accelerated TianHe-1A supercomputer. HOSTA adopts two self-developed high-order compact definite difference schemes WCNS and HDCS that can simulate flows with complex geometries. We present a dual-level parallelization scheme for efficient multi-block computation on GPUs and perform particular kernel optimizations for high-order CFD schemes. The GPU-only approach achieves a speedup of about 1.3 when comparing one Tesla M2050 GPU with two Xeon X5670 CPUs. To achieve a greater speedup, we collaborate CPU and GPU for HOSTA instead of using a naive GPU-only approach. We present a novel scheme to balance the loads between the store-poor GPU and the store-rich CPU. Taking CPU and GPU load balance into account, we improve the maximum simulation problem size per TianHe-1A node for HOSTA by 2.3×, meanwhile the collaborative approach can improve the performance by around 45% compared to the GPU-only approach. Further, to scale HOSTA on TianHe-1A, we propose a gather/scatter optimization to minimize PCI-e data transfer times for ghost and singularity data of 3D grid blocks, and overlap the collaborative computation and communication as far as possible using some advanced CUDA and MPI features. Scalability tests show that HOSTA can achieve a parallel efficiency of above 60% on 1024 TianHe-1A nodes. With our method, we have successfully simulated an EET high-lift airfoil configuration containing 800M cells and China's large civil airplane configuration containing 150M cells. To our best knowledge, those are the largest-scale CPU-GPU collaborative simulations that solve realistic CFD problems with both complex configurations and high-order schemes.
Collaborating CPU and GPU for large-scale high-order CFD simulations with complex grids on the TianHe-1A supercomputer

DOE Office of Scientific and Technical Information (OSTI.GOV)

Xu, Chuanfu, E-mail: xuchuanfu@nudt.edu.cn; Deng, Xiaogang; Zhang, Lilun

Programming and optimizing complex, real-world CFD codes on current many-core accelerated HPC systems is very challenging, especially when collaborating CPUs and accelerators to fully tap the potential of heterogeneous systems. In this paper, with a tri-level hybrid and heterogeneous programming model using MPI + OpenMP + CUDA, we port and optimize our high-order multi-block structured CFD software HOSTA on the GPU-accelerated TianHe-1A supercomputer. HOSTA adopts two self-developed high-order compact definite difference schemes WCNS and HDCS that can simulate flows with complex geometries. We present a dual-level parallelization scheme for efficient multi-block computation on GPUs and perform particular kernel optimizations formore » high-order CFD schemes. The GPU-only approach achieves a speedup of about 1.3 when comparing one Tesla M2050 GPU with two Xeon X5670 CPUs. To achieve a greater speedup, we collaborate CPU and GPU for HOSTA instead of using a naive GPU-only approach. We present a novel scheme to balance the loads between the store-poor GPU and the store-rich CPU. Taking CPU and GPU load balance into account, we improve the maximum simulation problem size per TianHe-1A node for HOSTA by 2.3×, meanwhile the collaborative approach can improve the performance by around 45% compared to the GPU-only approach. Further, to scale HOSTA on TianHe-1A, we propose a gather/scatter optimization to minimize PCI-e data transfer times for ghost and singularity data of 3D grid blocks, and overlap the collaborative computation and communication as far as possible using some advanced CUDA and MPI features. Scalability tests show that HOSTA can achieve a parallel efficiency of above 60% on 1024 TianHe-1A nodes. With our method, we have successfully simulated an EET high-lift airfoil configuration containing 800M cells and China's large civil airplane configuration containing 150M cells. To our best knowledge, those are the largest-scale CPU–GPU collaborative simulations that solve realistic CFD problems with both complex configurations and high-order schemes.« less
cellGPU: Massively parallel simulations of dynamic vertex models

NASA Astrophysics Data System (ADS)

Sussman, Daniel M.

2017-10-01

Vertex models represent confluent tissue by polygonal or polyhedral tilings of space, with the individual cells interacting via force laws that depend on both the geometry of the cells and the topology of the tessellation. This dependence on the connectivity of the cellular network introduces several complications to performing molecular-dynamics-like simulations of vertex models, and in particular makes parallelizing the simulations difficult. cellGPU addresses this difficulty and lays the foundation for massively parallelized, GPU-based simulations of these models. This article discusses its implementation for a pair of two-dimensional models, and compares the typical performance that can be expected between running cellGPU entirely on the CPU versus its performance when running on a range of commercial and server-grade graphics cards. By implementing the calculation of topological changes and forces on cells in a highly parallelizable fashion, cellGPU enables researchers to simulate time- and length-scales previously inaccessible via existing single-threaded CPU implementations. Program Files doi:http://dx.doi.org/10.17632/6j2cj29t3r.1 Licensing provisions: MIT Programming language: CUDA/C++ Nature of problem: Simulations of off-lattice "vertex models" of cells, in which the interaction forces depend on both the geometry and the topology of the cellular aggregate. Solution method: Highly parallelized GPU-accelerated dynamical simulations in which the force calculations and the topological features can be handled on either the CPU or GPU. Additional comments: The code is hosted at https://gitlab.com/dmsussman/cellGPU, with documentation additionally maintained at http://dmsussman.gitlab.io/cellGPUdocumentation
Incompressible SPH (ISPH) with fast Poisson solver on a GPU

NASA Astrophysics Data System (ADS)

Chow, Alex D.; Rogers, Benedict D.; Lind, Steven J.; Stansby, Peter K.

2018-05-01

This paper presents a fast incompressible SPH (ISPH) solver implemented to run entirely on a graphics processing unit (GPU) capable of simulating several millions of particles in three dimensions on a single GPU. The ISPH algorithm is implemented by converting the highly optimised open-source weakly-compressible SPH (WCSPH) code DualSPHysics to run ISPH on the GPU, combining it with the open-source linear algebra library ViennaCL for fast solutions of the pressure Poisson equation (PPE). Several challenges are addressed with this research: constructing a PPE matrix every timestep on the GPU for moving particles, optimising the limited GPU memory, and exploiting fast matrix solvers. The ISPH pressure projection algorithm is implemented as 4 separate stages, each with a particle sweep, including an algorithm for the population of the PPE matrix suitable for the GPU, and mixed precision storage methods. An accurate and robust ISPH boundary condition ideal for parallel processing is also established by adapting an existing WCSPH boundary condition for ISPH. A variety of validation cases are presented: an impulsively started plate, incompressible flow around a moving square in a box, and dambreaks (2-D and 3-D) which demonstrate the accuracy, flexibility, and speed of the methodology. Fragmentation of the free surface is shown to influence the performance of matrix preconditioners and therefore the PPE matrix solution time. The Jacobi preconditioner demonstrates robustness and reliability in the presence of fragmented flows. For a dambreak simulation, GPU speed ups demonstrate up to 10-18 times and 1.1-4.5 times compared to single-threaded and 16-threaded CPU run times respectively.
Testing and Validating Gadget2 for GPUs

NASA Astrophysics Data System (ADS)

Wibking, Benjamin; Holley-Bockelmann, K.; Berlind, A. A.

2013-01-01

We are currently upgrading a version of Gadget2 (Springel et al., 2005) that is optimized for NVIDIA's CUDA GPU architecture (Frigaard, unpublished) to work with the latest libraries and graphics cards. Preliminary tests of its performance indicate a ~40x speedup in the particle force tree approximation calculation, with overall speedup of 5-10x for cosmological simulations run with GPUs compared to running on the same CPU cores without GPU acceleration. We believe this speedup can be reasonably increased by an additional factor of two with futher optimization, including overlap of computation on CPU and GPU. Tests of single-precision GPU numerical fidelity currently indicate accuracy of the mass function and the spectral power density to within a few percent of extended-precision CPU results with the unmodified form of Gadget. Additionally, we plan to test and optimize the GPU code for Millenium-scale "grand challenge" simulations of >10^9 particles, a scale that has been previously untested with this code, with the aid of the NSF XSEDE flagship GPU-based supercomputing cluster codenamed "Keeneland." Current work involves additional validation of numerical results, extending the numerical precision of the GPU calculations to double precision, and evaluating performance/accuracy tradeoffs. We believe that this project, if successful, will yield substantial computational performance benefits to the N-body research community as the next generation of GPU supercomputing resources becomes available, both increasing the electrical power efficiency of ever-larger computations (making simulations possible a decade from now at scales and resolutions unavailable today) and accelerating the pace of research in the field.
Lossless data compression for improving the performance of a GPU-based beamformer.

PubMed

Lok, U-Wai; Fan, Gang-Wei; Li, Pai-Chi

2015-04-01

The powerful parallel computation ability of a graphics processing unit (GPU) makes it feasible to perform dynamic receive beamforming However, a real time GPU-based beamformer requires high data rate to transfer radio-frequency (RF) data from hardware to software memory, as well as from central processing unit (CPU) to GPU memory. There are data compression methods (e.g. Joint Photographic Experts Group (JPEG)) available for the hardware front end to reduce data size, alleviating the data transfer requirement of the hardware interface. Nevertheless, the required decoding time may even be larger than the transmission time of its original data, in turn degrading the overall performance of the GPU-based beamformer. This article proposes and implements a lossless compression-decompression algorithm, which enables in parallel compression and decompression of data. By this means, the data transfer requirement of hardware interface and the transmission time of CPU to GPU data transfers are reduced, without sacrificing image quality. In simulation results, the compression ratio reached around 1.7. The encoder design of our lossless compression approach requires low hardware resources and reasonable latency in a field programmable gate array. In addition, the transmission time of transferring data from CPU to GPU with the parallel decoding process improved by threefold, as compared with transferring original uncompressed data. These results show that our proposed lossless compression plus parallel decoder approach not only mitigate the transmission bandwidth requirement to transfer data from hardware front end to software system but also reduce the transmission time for CPU to GPU data transfer. © The Author(s) 2014.
GPU-accelerated automatic identification of robust beam setups for proton and carbon-ion radiotherapy

NASA Astrophysics Data System (ADS)

Ammazzalorso, F.; Bednarz, T.; Jelen, U.

2014-03-01

We demonstrate acceleration on graphic processing units (GPU) of automatic identification of robust particle therapy beam setups, minimizing negative dosimetric effects of Bragg peak displacement caused by treatment-time patient positioning errors. Our particle therapy research toolkit, RobuR, was extended with OpenCL support and used to implement calculation on GPU of the Port Homogeneity Index, a metric scoring irradiation port robustness through analysis of tissue density patterns prior to dose optimization and computation. Results were benchmarked against an independent native CPU implementation. Numerical results were in agreement between the GPU implementation and native CPU implementation. For 10 skull base cases, the GPU-accelerated implementation was employed to select beam setups for proton and carbon ion treatment plans, which proved to be dosimetrically robust, when recomputed in presence of various simulated positioning errors. From the point of view of performance, average running time on the GPU decreased by at least one order of magnitude compared to the CPU, rendering the GPU-accelerated analysis a feasible step in a clinical treatment planning interactive session. In conclusion, selection of robust particle therapy beam setups can be effectively accelerated on a GPU and become an unintrusive part of the particle therapy treatment planning workflow. Additionally, the speed gain opens new usage scenarios, like interactive analysis manipulation (e.g. constraining of some setup) and re-execution. Finally, through OpenCL portable parallelism, the new implementation is suitable also for CPU-only use, taking advantage of multiple cores, and can potentially exploit types of accelerators other than GPUs.
GPU-Based Point Cloud Superpositioning for Structural Comparisons of Protein Binding Sites.

PubMed

Leinweber, Matthias; Fober, Thomas; Freisleben, Bernd

2018-01-01

In this paper, we present a novel approach to solve the labeled point cloud superpositioning problem for performing structural comparisons of protein binding sites. The solution is based on a parallel evolution strategy that operates on large populations and runs on GPU hardware. The proposed evolution strategy reduces the likelihood of getting stuck in a local optimum of the multimodal real-valued optimization problem represented by labeled point cloud superpositioning. The performance of the GPU-based parallel evolution strategy is compared to a previously proposed CPU-based sequential approach for labeled point cloud superpositioning, indicating that the GPU-based parallel evolution strategy leads to qualitatively better results and significantly shorter runtimes, with speed improvements of up to a factor of 1,500 for large populations. Binary classification tests based on the ATP, NADH, and FAD protein subsets of CavBase, a database containing putative binding sites, show average classification rate improvements from about 92 percent (CPU) to 96 percent (GPU). Further experiments indicate that the proposed GPU-based labeled point cloud superpositioning approach can be superior to traditional protein comparison approaches based on sequence alignments.

irGPU.proton.Net: Irregular strong charge interaction networks of protonatable groups in protein molecules--a GPU solver using the fast multipole method and statistical thermodynamics.

PubMed

Kantardjiev, Alexander A

2015-04-05

A cluster of strongly interacting ionization groups in protein molecules with irregular ionization behavior is suggestive for specific structure-function relationship. However, their computational treatment is unconventional (e.g., lack of convergence in naive self-consistent iterative algorithm). The stringent evaluation requires evaluation of Boltzmann averaged statistical mechanics sums and electrostatic energy estimation for each microstate. irGPU: Irregular strong interactions in proteins--a GPU solver is novel solution to a versatile problem in protein biophysics--atypical protonation behavior of coupled groups. The computational severity of the problem is alleviated by parallelization (via GPU kernels) which is applied for the electrostatic interaction evaluation (including explicit electrostatics via the fast multipole method) as well as statistical mechanics sums (partition function) estimation. Special attention is given to the ease of the service and encapsulation of theoretical details without sacrificing rigor of computational procedures. irGPU is not just a solution-in-principle but a promising practical application with potential to entice community into deeper understanding of principles governing biomolecule mechanisms. © 2015 Wiley Periodicals, Inc.
Optimizing Tensor Contraction Expressions for Hybrid CPU-GPU Execution

DOE Office of Scientific and Technical Information (OSTI.GOV)

Ma, Wenjing; Krishnamoorthy, Sriram; Villa, Oreste

2013-03-01

Tensor contractions are generalized multidimensional matrix multiplication operations that widely occur in quantum chemistry. Efficient execution of tensor contractions on Graphics Processing Units (GPUs) requires several challenges to be addressed, including index permutation and small dimension-sizes reducing thread block utilization. Moreover, to apply the same optimizations to various expressions, we need a code generation tool. In this paper, we present our approach to automatically generate CUDA code to execute tensor contractions on GPUs, including management of data movement between CPU and GPU. To evaluate our tool, GPU-enabled code is generated for the most expensive contractions in CCSD(T), a key coupledmore » cluster method, and incorporated into NWChem, a popular computational chemistry suite. For this method, we demonstrate speedup over a factor of 8.4 using one GPU (instead of one core per node) and over 2.6 when utilizing the entire system using hybrid CPU+GPU solution with 2 GPUs and 5 cores (instead of 7 cores per node). Finally, we analyze the implementation behavior on future GPU systems.« less
A Hybrid CPU/GPU Pattern-Matching Algorithm for Deep Packet Inspection

PubMed Central

Chen, Yaw-Chung

2015-01-01

The large quantities of data now being transferred via high-speed networks have made deep packet inspection indispensable for security purposes. Scalable and low-cost signature-based network intrusion detection systems have been developed for deep packet inspection for various software platforms. Traditional approaches that only involve central processing units (CPUs) are now considered inadequate in terms of inspection speed. Graphic processing units (GPUs) have superior parallel processing power, but transmission bottlenecks can reduce optimal GPU efficiency. In this paper we describe our proposal for a hybrid CPU/GPU pattern-matching algorithm (HPMA) that divides and distributes the packet-inspecting workload between a CPU and GPU. All packets are initially inspected by the CPU and filtered using a simple pre-filtering algorithm, and packets that might contain malicious content are sent to the GPU for further inspection. Test results indicate that in terms of random payload traffic, the matching speed of our proposed algorithm was 3.4 times and 2.7 times faster than those of the AC-CPU and AC-GPU algorithms, respectively. Further, HPMA achieved higher energy efficiency than the other tested algorithms. PMID:26437335
A Hybrid CPU/GPU Pattern-Matching Algorithm for Deep Packet Inspection.

PubMed

Lee, Chun-Liang; Lin, Yi-Shan; Chen, Yaw-Chung

2015-01-01

The large quantities of data now being transferred via high-speed networks have made deep packet inspection indispensable for security purposes. Scalable and low-cost signature-based network intrusion detection systems have been developed for deep packet inspection for various software platforms. Traditional approaches that only involve central processing units (CPUs) are now considered inadequate in terms of inspection speed. Graphic processing units (GPUs) have superior parallel processing power, but transmission bottlenecks can reduce optimal GPU efficiency. In this paper we describe our proposal for a hybrid CPU/GPU pattern-matching algorithm (HPMA) that divides and distributes the packet-inspecting workload between a CPU and GPU. All packets are initially inspected by the CPU and filtered using a simple pre-filtering algorithm, and packets that might contain malicious content are sent to the GPU for further inspection. Test results indicate that in terms of random payload traffic, the matching speed of our proposed algorithm was 3.4 times and 2.7 times faster than those of the AC-CPU and AC-GPU algorithms, respectively. Further, HPMA achieved higher energy efficiency than the other tested algorithms.
Work stealing for GPU-accelerated parallel programs in a global address space framework: WORK STEALING ON GPU-ACCELERATED SYSTEMS

DOE Office of Scientific and Technical Information (OSTI.GOV)

Arafat, Humayun; Dinan, James; Krishnamoorthy, Sriram

Task parallelism is an attractive approach to automatically load balance the computation in a parallel system and adapt to dynamism exhibited by parallel systems. Exploiting task parallelism through work stealing has been extensively studied in shared and distributed-memory contexts. In this paper, we study the design of a system that uses work stealing for dynamic load balancing of task-parallel programs executed on hybrid distributed-memory CPU-graphics processing unit (GPU) systems in a global-address space framework. We take into account the unique nature of the accelerator model employed by GPUs, the significant performance difference between GPU and CPU execution as a functionmore » of problem size, and the distinct CPU and GPU memory domains. We consider various alternatives in designing a distributed work stealing algorithm for CPU-GPU systems, while taking into account the impact of task distribution and data movement overheads. These strategies are evaluated using microbenchmarks that capture various execution configurations as well as the state-of-the-art CCSD(T) application module from the computational chemistry domain.« less
Fast 3D elastic micro-seismic source location using new GPU features

NASA Astrophysics Data System (ADS)

Xue, Qingfeng; Wang, Yibo; Chang, Xu

2016-12-01

In this paper, we describe new GPU features and their applications in passive seismic - micro-seismic location. Locating micro-seismic events is quite important in seismic exploration, especially when searching for unconventional oil and gas resources. Different from the traditional ray-based methods, the wave equation method, such as the method we use in our paper, has a remarkable advantage in adapting to low signal-to-noise ratio conditions and does not need a person to select the data. However, because it has a conspicuous deficiency due to its computation cost, these methods are not widely used in industrial fields. To make the method useful, we implement imaging-like wave equation micro-seismic location in a 3D elastic media and use GPU to accelerate our algorithm. We also introduce some new GPU features into the implementation to solve the data transfer and GPU utilization problems. Numerical and field data experiments show that our method can achieve a more than 30% performance improvement in GPU implementation just by using these new features.
GPU-accelerated Tersoff potentials for massively parallel Molecular Dynamics simulations

NASA Astrophysics Data System (ADS)

Nguyen, Trung Dac

2017-03-01

The Tersoff potential is one of the empirical many-body potentials that has been widely used in simulation studies at atomic scales. Unlike pair-wise potentials, the Tersoff potential involves three-body terms, which require much more arithmetic operations and data dependency. In this contribution, we have implemented the GPU-accelerated version of several variants of the Tersoff potential for LAMMPS, an open-source massively parallel Molecular Dynamics code. Compared to the existing MPI implementation in LAMMPS, the GPU implementation exhibits a better scalability and offers a speedup of 2.2X when run on 1000 compute nodes on the Titan supercomputer. On a single node, the speedup ranges from 2.0 to 8.0 times, depending on the number of atoms per GPU and hardware configurations. The most notable features of our GPU-accelerated version include its design for MPI/accelerator heterogeneous parallelism, its compatibility with other functionalities in LAMMPS, its ability to give deterministic results and to support both NVIDIA CUDA- and OpenCL-enabled accelerators. Our implementation is now part of the GPU package in LAMMPS and accessible for public use.
Multi-GPU implementation of a VMAT treatment plan optimization algorithm

DOE Office of Scientific and Technical Information (OSTI.GOV)

Tian, Zhen, E-mail: Zhen.Tian@UTSouthwestern.edu, E-mail: Xun.Jia@UTSouthwestern.edu, E-mail: Steve.Jiang@UTSouthwestern.edu; Folkerts, Michael; Tan, Jun

Purpose: Volumetric modulated arc therapy (VMAT) optimization is a computationally challenging problem due to its large data size, high degrees of freedom, and many hardware constraints. High-performance graphics processing units (GPUs) have been used to speed up the computations. However, GPU’s relatively small memory size cannot handle cases with a large dose-deposition coefficient (DDC) matrix in cases of, e.g., those with a large target size, multiple targets, multiple arcs, and/or small beamlet size. The main purpose of this paper is to report an implementation of a column-generation-based VMAT algorithm, previously developed in the authors’ group, on a multi-GPU platform tomore » solve the memory limitation problem. While the column-generation-based VMAT algorithm has been previously developed, the GPU implementation details have not been reported. Hence, another purpose is to present detailed techniques employed for GPU implementation. The authors also would like to utilize this particular problem as an example problem to study the feasibility of using a multi-GPU platform to solve large-scale problems in medical physics. Methods: The column-generation approach generates VMAT apertures sequentially by solving a pricing problem (PP) and a master problem (MP) iteratively. In the authors’ method, the sparse DDC matrix is first stored on a CPU in coordinate list format (COO). On the GPU side, this matrix is split into four submatrices according to beam angles, which are stored on four GPUs in compressed sparse row format. Computation of beamlet price, the first step in PP, is accomplished using multi-GPUs. A fast inter-GPU data transfer scheme is accomplished using peer-to-peer access. The remaining steps of PP and MP problems are implemented on CPU or a single GPU due to their modest problem scale and computational loads. Barzilai and Borwein algorithm with a subspace step scheme is adopted here to solve the MP problem. A head and neck (H and N) cancer case is then used to validate the authors’ method. The authors also compare their multi-GPU implementation with three different single GPU implementation strategies, i.e., truncating DDC matrix (S1), repeatedly transferring DDC matrix between CPU and GPU (S2), and porting computations involving DDC matrix to CPU (S3), in terms of both plan quality and computational efficiency. Two more H and N patient cases and three prostate cases are used to demonstrate the advantages of the authors’ method. Results: The authors’ multi-GPU implementation can finish the optimization process within ∼1 min for the H and N patient case. S1 leads to an inferior plan quality although its total time was 10 s shorter than the multi-GPU implementation due to the reduced matrix size. S2 and S3 yield the same plan quality as the multi-GPU implementation but take ∼4 and ∼6 min, respectively. High computational efficiency was consistently achieved for the other five patient cases tested, with VMAT plans of clinically acceptable quality obtained within 23–46 s. Conversely, to obtain clinically comparable or acceptable plans for all six of these VMAT cases that the authors have tested in this paper, the optimization time needed in a commercial TPS system on CPU was found to be in an order of several minutes. Conclusions: The results demonstrate that the multi-GPU implementation of the authors’ column-generation-based VMAT optimization can handle the large-scale VMAT optimization problem efficiently without sacrificing plan quality. The authors’ study may serve as an example to shed some light on other large-scale medical physics problems that require multi-GPU techniques.« less
Implementation of Multipattern String Matching Accelerated with GPU for Intrusion Detection System

NASA Astrophysics Data System (ADS)

Nehemia, Rangga; Lim, Charles; Galinium, Maulahikmah; Rinaldi Widianto, Ahmad

2017-04-01

As Internet-related security threats continue to increase in terms of volume and sophistication, existing Intrusion Detection System is also being challenged to cope with the current Internet development. Multi Pattern String Matching algorithm accelerated with Graphical Processing Unit is being utilized to improve the packet scanning performance of the IDS. This paper implements a Multi Pattern String Matching algorithm, also called Parallel Failureless Aho Corasick accelerated with GPU to improve the performance of IDS. OpenCL library is used to allow the IDS to support various GPU, including popular GPU such as NVIDIA and AMD, used in our research. The experiment result shows that the application of Multi Pattern String Matching using GPU accelerated platform provides a speed up, by up to 141% in term of throughput compared to the previous research.
An efficient spectral crystal plasticity solver for GPU architectures

NASA Astrophysics Data System (ADS)

Malahe, Michael

2018-03-01

We present a spectral crystal plasticity (CP) solver for graphics processing unit (GPU) architectures that achieves a tenfold increase in efficiency over prior GPU solvers. The approach makes use of a database containing a spectral decomposition of CP simulations performed using a conventional iterative solver over a parameter space of crystal orientations and applied velocity gradients. The key improvements in efficiency come from reducing global memory transactions, exposing more instruction-level parallelism, reducing integer instructions and performing fast range reductions on trigonometric arguments. The scheme also makes more efficient use of memory than prior work, allowing for larger problems to be solved on a single GPU. We illustrate these improvements with a simulation of 390 million crystal grains on a consumer-grade GPU, which executes at a rate of 2.72 s per strain step.
Multi-GPU accelerated three-dimensional FDTD method for electromagnetic simulation.

PubMed

Nagaoka, Tomoaki; Watanabe, Soichi

2011-01-01

Numerical simulation with a numerical human model using the finite-difference time domain (FDTD) method has recently been performed in a number of fields in biomedical engineering. To improve the method's calculation speed and realize large-scale computing with the numerical human model, we adapt three-dimensional FDTD code to a multi-GPU environment using Compute Unified Device Architecture (CUDA). In this study, we used NVIDIA Tesla C2070 as GPGPU boards. The performance of multi-GPU is evaluated in comparison with that of a single GPU and vector supercomputer. The calculation speed with four GPUs was approximately 3.5 times faster than with a single GPU, and was slightly (approx. 1.3 times) slower than with the supercomputer. Calculation speed of the three-dimensional FDTD method using GPUs can significantly improve with an expanding number of GPUs.
ScipionCloud: An integrative and interactive gateway for large scale cryo electron microscopy image processing on commercial and academic clouds.

PubMed

Cuenca-Alba, Jesús; Del Cano, Laura; Gómez Blanco, Josué; de la Rosa Trevín, José Miguel; Conesa Mingo, Pablo; Marabini, Roberto; S Sorzano, Carlos Oscar; Carazo, Jose María

2017-10-01

New instrumentation for cryo electron microscopy (cryoEM) has significantly increased data collection rate as well as data quality, creating bottlenecks at the image processing level. Current image processing model of moving the acquired images from the data source (electron microscope) to desktops or local clusters for processing is encountering many practical limitations. However, computing may also take place in distributed and decentralized environments. In this way, cloud is a new form of accessing computing and storage resources on demand. Here, we evaluate on how this new computational paradigm can be effectively used by extending our current integrative framework for image processing, creating ScipionCloud. This new development has resulted in a full installation of Scipion both in public and private clouds, accessible as public "images", with all the required preinstalled cryoEM software, just requiring a Web browser to access all Graphical User Interfaces. We have profiled the performance of different configurations on Amazon Web Services and the European Federated Cloud, always on architectures incorporating GPU's, and compared them with a local facility. We have also analyzed the economical convenience of different scenarios, so cryoEM scientists have a clearer picture of the setup that is best suited for their needs and budgets. Copyright © 2017 The Authors. Published by Elsevier Inc. All rights reserved.
Efficient implementation of the 3D-DDA ray traversal algorithm on GPU and its application in radiation dose calculation.

PubMed

Xiao, Kai; Chen, Danny Z; Hu, X Sharon; Zhou, Bo

2012-12-01

The three-dimensional digital differential analyzer (3D-DDA) algorithm is a widely used ray traversal method, which is also at the core of many convolution∕superposition (C∕S) dose calculation approaches. However, porting existing C∕S dose calculation methods onto graphics processing unit (GPU) has brought challenges to retaining the efficiency of this algorithm. In particular, straightforward implementation of the original 3D-DDA algorithm inflicts a lot of branch divergence which conflicts with the GPU programming model and leads to suboptimal performance. In this paper, an efficient GPU implementation of the 3D-DDA algorithm is proposed, which effectively reduces such branch divergence and improves performance of the C∕S dose calculation programs running on GPU. The main idea of the proposed method is to convert a number of conditional statements in the original 3D-DDA algorithm into a set of simple operations (e.g., arithmetic, comparison, and logic) which are better supported by the GPU architecture. To verify and demonstrate the performance improvement, this ray traversal method was integrated into a GPU-based collapsed cone convolution∕superposition (CCCS) dose calculation program. The proposed method has been tested using a water phantom and various clinical cases on an NVIDIA GTX570 GPU. The CCCS dose calculation program based on the efficient 3D-DDA ray traversal implementation runs 1.42 ∼ 2.67× faster than the one based on the original 3D-DDA implementation, without losing any accuracy. The results show that the proposed method can effectively reduce branch divergence in the original 3D-DDA ray traversal algorithm and improve the performance of the CCCS program running on GPU. Considering the wide utilization of the 3D-DDA algorithm, various applications can benefit from this implementation method.
GPU-accelerated Monte Carlo convolution/superposition implementation for dose calculation.

PubMed

Zhou, Bo; Yu, Cedric X; Chen, Danny Z; Hu, X Sharon

2010-11-01

Dose calculation is a key component in radiation treatment planning systems. Its performance and accuracy are crucial to the quality of treatment plans as emerging advanced radiation therapy technologies are exerting ever tighter constraints on dose calculation. A common practice is to choose either a deterministic method such as the convolution/superposition (CS) method for speed or a Monte Carlo (MC) method for accuracy. The goal of this work is to boost the performance of a hybrid Monte Carlo convolution/superposition (MCCS) method by devising a graphics processing unit (GPU) implementation so as to make the method practical for day-to-day usage. Although the MCCS algorithm combines the merits of MC fluence generation and CS fluence transport, it is still not fast enough to be used as a day-to-day planning tool. To alleviate the speed issue of MC algorithms, the authors adopted MCCS as their target method and implemented a GPU-based version. In order to fully utilize the GPU computing power, the MCCS algorithm is modified to match the GPU hardware architecture. The performance of the authors' GPU-based implementation on an Nvidia GTX260 card is compared to a multithreaded software implementation on a quad-core system. A speedup in the range of 6.7-11.4x is observed for the clinical cases used. The less than 2% statistical fluctuation also indicates that the accuracy of the authors' GPU-based implementation is in good agreement with the results from the quad-core CPU implementation. This work shows that GPU is a feasible and cost-efficient solution compared to other alternatives such as using cluster machines or field-programmable gate arrays for satisfying the increasing demands on computation speed and accuracy of dose calculation. But there are also inherent limitations of using GPU for accelerating MC-type applications, which are also analyzed in detail in this article.
SU-E-T-423: Fast Photon Convolution Calculation with a 3D-Ideal Kernel On the GPU

DOE Office of Scientific and Technical Information (OSTI.GOV)

Moriya, S; Sato, M; Tachibana, H

Purpose: The calculation time is a trade-off for improving the accuracy of convolution dose calculation with fine calculation spacing of the KERMA kernel. We investigated to accelerate the convolution calculation using an ideal kernel on the Graphic Processing Units (GPU). Methods: The calculation was performed on the AMD graphics hardware of Dual FirePro D700 and our algorithm was implemented using the Aparapi that convert Java bytecode to OpenCL. The process of dose calculation was separated with the TERMA and KERMA steps. The dose deposited at the coordinate (x, y, z) was determined in the process. In the dose calculation runningmore » on the central processing unit (CPU) of Intel Xeon E5, the calculation loops were performed for all calculation points. On the GPU computation, all of the calculation processes for the points were sent to the GPU and the multi-thread computation was done. In this study, the dose calculation was performed in a water equivalent homogeneous phantom with 150{sup 3} voxels (2 mm calculation grid) and the calculation speed on the GPU to that on the CPU and the accuracy of PDD were compared. Results: The calculation time for the GPU and the CPU were 3.3 sec and 4.4 hour, respectively. The calculation speed for the GPU was 4800 times faster than that for the CPU. The PDD curve for the GPU was perfectly matched to that for the CPU. Conclusion: The convolution calculation with the ideal kernel on the GPU was clinically acceptable for time and may be more accurate in an inhomogeneous region. Intensity modulated arc therapy needs dose calculations for different gantry angles at many control points. Thus, it would be more practical that the kernel uses a coarse spacing technique if the calculation is faster while keeping the similar accuracy to a current treatment planning system.« less
Integrative multicellular biological modeling: a case study of 3D epidermal development using GPU algorithms

PubMed Central

2010-01-01

Background Simulation of sophisticated biological models requires considerable computational power. These models typically integrate together numerous biological phenomena such as spatially-explicit heterogeneous cells, cell-cell interactions, cell-environment interactions and intracellular gene networks. The recent advent of programming for graphical processing units (GPU) opens up the possibility of developing more integrative, detailed and predictive biological models while at the same time decreasing the computational cost to simulate those models. Results We construct a 3D model of epidermal development and provide a set of GPU algorithms that executes significantly faster than sequential central processing unit (CPU) code. We provide a parallel implementation of the subcellular element method for individual cells residing in a lattice-free spatial environment. Each cell in our epidermal model includes an internal gene network, which integrates cellular interaction of Notch signaling together with environmental interaction of basement membrane adhesion, to specify cellular state and behaviors such as growth and division. We take a pedagogical approach to describing how modeling methods are efficiently implemented on the GPU including memory layout of data structures and functional decomposition. We discuss various programmatic issues and provide a set of design guidelines for GPU programming that are instructive to avoid common pitfalls as well as to extract performance from the GPU architecture. Conclusions We demonstrate that GPU algorithms represent a significant technological advance for the simulation of complex biological models. We further demonstrate with our epidermal model that the integration of multiple complex modeling methods for heterogeneous multicellular biological processes is both feasible and computationally tractable using this new technology. We hope that the provided algorithms and source code will be a starting point for modelers to develop their own GPU implementations, and encourage others to implement their modeling methods on the GPU and to make that code available to the wider community. PMID:20696053
GPU-based optimal control for RWM feedback in tokamaks

DOE PAGES

Clement, Mitchell; Hanson, Jeremy; Bialek, Jim; ...

2017-08-23

The design and implementation of a Graphics Processing Unit (GPU) based Resistive Wall Mode (RWM) controller to perform feedback control on the RWM using Linear Quadratic Gaussian (LQG) control is reported herein. Also, the control algorithm is based on a simplified DIII-D VALEN model. By using NVIDIA’s GPUDirect RDMA framework, the digitizer and output module are able to write and read directly to and from GPU memory, eliminating memory transfers between host and GPU. In conclusion, the system and algorithm was able to reduce plasma response excited by externally applied fields by 32% during development experiments.
GPU-based optimal control for RWM feedback in tokamaks

DOE Office of Scientific and Technical Information (OSTI.GOV)

Clement, Mitchell; Hanson, Jeremy; Bialek, Jim

The design and implementation of a Graphics Processing Unit (GPU) based Resistive Wall Mode (RWM) controller to perform feedback control on the RWM using Linear Quadratic Gaussian (LQG) control is reported herein. Also, the control algorithm is based on a simplified DIII-D VALEN model. By using NVIDIA’s GPUDirect RDMA framework, the digitizer and output module are able to write and read directly to and from GPU memory, eliminating memory transfers between host and GPU. In conclusion, the system and algorithm was able to reduce plasma response excited by externally applied fields by 32% during development experiments.
NLSEmagic: Nonlinear Schrödinger equation multi-dimensional Matlab-based GPU-accelerated integrators using compact high-order schemes

NASA Astrophysics Data System (ADS)

Caplan, R. M.

2013-04-01

We present a simple to use, yet powerful code package called NLSEmagic to numerically integrate the nonlinear Schrödinger equation in one, two, and three dimensions. NLSEmagic is a high-order finite-difference code package which utilizes graphic processing unit (GPU) parallel architectures. The codes running on the GPU are many times faster than their serial counterparts, and are much cheaper to run than on standard parallel clusters. The codes are developed with usability and portability in mind, and therefore are written to interface with MATLAB utilizing custom GPU-enabled C codes with the MEX-compiler interface. The packages are freely distributed, including user manuals and set-up files. Catalogue identifier: AEOJ_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEOJ_v1_0.html Program obtainable from: CPC Program Library, Queen’s University, Belfast, N. Ireland Licensing provisions: Standard CPC licence, http://cpc.cs.qub.ac.uk/licence/licence.html No. of lines in distributed program, including test data, etc.: 124453 No. of bytes in distributed program, including test data, etc.: 4728604 Distribution format: tar.gz Programming language: C, CUDA, MATLAB. Computer: PC, MAC. Operating system: Windows, MacOS, Linux. Has the code been vectorized or parallelized?: Yes. Number of processors used: Single CPU, number of GPU processors dependent on chosen GPU card (max is currently 3072 cores on GeForce GTX 690). Supplementary material: Setup guide, Installation guide. RAM: Highly dependent on dimensionality and grid size. For typical medium-large problem size in three dimensions, 4GB is sufficient. Keywords: Nonlinear Schröodinger Equation, GPU, high-order finite difference, Bose-Einstien condensates. Classification: 4.3, 7.7. Nature of problem: Integrate solutions of the time-dependent one-, two-, and three-dimensional cubic nonlinear Schrödinger equation. Solution method: The integrators utilize a fully-explicit fourth-order Runge-Kutta scheme in time and both second- and fourth-order differencing in space. The integrators are written to run on NVIDIA GPUs and are interfaced with MATLAB including built-in visualization and analysis tools. Restrictions: The main restriction for the GPU integrators is the amount of RAM on the GPU as the code is currently only designed for running on a single GPU. Unusual features: Ability to visualize real-time simulations through the interaction of MATLAB and the compiled GPU integrators. Additional comments: Setup guide and Installation guide provided. Program has a dedicated web site at www.nlsemagic.com. Running time: A three-dimensional run with a grid dimension of 87×87×203 for 3360 time steps (100 non-dimensional time units) takes about one and a half minutes on a GeForce GTX 580 GPU card.
GPU-based prompt gamma ray imaging from boron neutron capture therapy

DOE Office of Scientific and Technical Information (OSTI.GOV)

Yoon, Do-Kun; Jung, Joo-Young; Suk Suh, Tae, E-mail: suhsanta@catholic.ac.kr

Purpose: The purpose of this research is to perform the fast reconstruction of a prompt gamma ray image using a graphics processing unit (GPU) computation from boron neutron capture therapy (BNCT) simulations. Methods: To evaluate the accuracy of the reconstructed image, a phantom including four boron uptake regions (BURs) was used in the simulation. After the Monte Carlo simulation of the BNCT, the modified ordered subset expectation maximization reconstruction algorithm using the GPU computation was used to reconstruct the images with fewer projections. The computation times for image reconstruction were compared between the GPU and the central processing unit (CPU).more » Also, the accuracy of the reconstructed image was evaluated by a receiver operating characteristic (ROC) curve analysis. Results: The image reconstruction time using the GPU was 196 times faster than the conventional reconstruction time using the CPU. For the four BURs, the area under curve values from the ROC curve were 0.6726 (A-region), 0.6890 (B-region), 0.7384 (C-region), and 0.8009 (D-region). Conclusions: The tomographic image using the prompt gamma ray event from the BNCT simulation was acquired using the GPU computation in order to perform a fast reconstruction during treatment. The authors verified the feasibility of the prompt gamma ray image reconstruction using the GPU computation for BNCT simulations.« less

TU-FG-BRB-07: GPU-Based Prompt Gamma Ray Imaging From Boron Neutron Capture Therapy

DOE Office of Scientific and Technical Information (OSTI.GOV)

Kim, S; Suh, T; Yoon, D

Purpose: The purpose of this research is to perform the fast reconstruction of a prompt gamma ray image using a graphics processing unit (GPU) computation from boron neutron capture therapy (BNCT) simulations. Methods: To evaluate the accuracy of the reconstructed image, a phantom including four boron uptake regions (BURs) was used in the simulation. After the Monte Carlo simulation of the BNCT, the modified ordered subset expectation maximization reconstruction algorithm using the GPU computation was used to reconstruct the images with fewer projections. The computation times for image reconstruction were compared between the GPU and the central processing unit (CPU).more » Also, the accuracy of the reconstructed image was evaluated by a receiver operating characteristic (ROC) curve analysis. Results: The image reconstruction time using the GPU was 196 times faster than the conventional reconstruction time using the CPU. For the four BURs, the area under curve values from the ROC curve were 0.6726 (A-region), 0.6890 (B-region), 0.7384 (C-region), and 0.8009 (D-region). Conclusion: The tomographic image using the prompt gamma ray event from the BNCT simulation was acquired using the GPU computation in order to perform a fast reconstruction during treatment. The authors verified the feasibility of the prompt gamma ray reconstruction using the GPU computation for BNCT simulations.« less
The CUBLAS and CULA based GPU acceleration of adaptive finite element framework for bioluminescence tomography.

PubMed

Zhang, Bo; Yang, Xiang; Yang, Fei; Yang, Xin; Qin, Chenghu; Han, Dong; Ma, Xibo; Liu, Kai; Tian, Jie

2010-09-13

In molecular imaging (MI), especially the optical molecular imaging, bioluminescence tomography (BLT) emerges as an effective imaging modality for small animal imaging. The finite element methods (FEMs), especially the adaptive finite element (AFE) framework, play an important role in BLT. The processing speed of the FEMs and the AFE framework still needs to be improved, although the multi-thread CPU technology and the multi CPU technology have already been applied. In this paper, we for the first time introduce a new kind of acceleration technology to accelerate the AFE framework for BLT, using the graphics processing unit (GPU). Besides the processing speed, the GPU technology can get a balance between the cost and performance. The CUBLAS and CULA are two main important and powerful libraries for programming on NVIDIA GPUs. With the help of CUBLAS and CULA, it is easy to code on NVIDIA GPU and there is no need to worry about the details about the hardware environment of a specific GPU. The numerical experiments are designed to show the necessity, effect and application of the proposed CUBLAS and CULA based GPU acceleration. From the results of the experiments, we can reach the conclusion that the proposed CUBLAS and CULA based GPU acceleration method can improve the processing speed of the AFE framework very much while getting a balance between cost and performance.
GPULife

DOE Office of Scientific and Technical Information (OSTI.GOV)

Kelly, Priscilla N.

2016-08-12

The code runs the Game of Life among several processors. Each processor uses CUDA to set up the grid's buffer on the GPU, and that buffer is fed to other GPU languages to apply the rules of the game of life. Only the halo is copied off the buffer and exchanged using MPI. This code looks at the interoperability of GPU languages among current platforms.
Multi-phase SPH modelling of violent hydrodynamics on GPUs

NASA Astrophysics Data System (ADS)

Mokos, Athanasios; Rogers, Benedict D.; Stansby, Peter K.; Domínguez, José M.

2015-11-01

This paper presents the acceleration of multi-phase smoothed particle hydrodynamics (SPH) using a graphics processing unit (GPU) enabling large numbers of particles (10-20 million) to be simulated on just a single GPU card. With novel hardware architectures such as a GPU, the optimum approach to implement a multi-phase scheme presents some new challenges. Many more particles must be included in the calculation and there are very different speeds of sound in each phase with the largest speed of sound determining the time step. This requires efficient computation. To take full advantage of the hardware acceleration provided by a single GPU for a multi-phase simulation, four different algorithms are investigated: conditional statements, binary operators, separate particle lists and an intermediate global function. Runtime results show that the optimum approach needs to employ separate cell and neighbour lists for each phase. The profiler shows that this approach leads to a reduction in both memory transactions and arithmetic operations giving significant runtime gains. The four different algorithms are compared to the efficiency of the optimised single-phase GPU code, DualSPHysics, for 2-D and 3-D simulations which indicate that the multi-phase functionality has a significant computational overhead. A comparison with an optimised CPU code shows a speed up of an order of magnitude over an OpenMP simulation with 8 threads and two orders of magnitude over a single thread simulation. A demonstration of the multi-phase SPH GPU code is provided by a 3-D dam break case impacting an obstacle. This shows better agreement with experimental results than an equivalent single-phase code. The multi-phase GPU code enables a convergence study to be undertaken on a single GPU with a large number of particles that otherwise would have required large high performance computing resources.
Democratic population decisions result in robust policy-gradient learning: a parametric study with GPU simulations.

PubMed

Richmond, Paul; Buesing, Lars; Giugliano, Michele; Vasilaki, Eleni

2011-05-04

High performance computing on the Graphics Processing Unit (GPU) is an emerging field driven by the promise of high computational power at a low cost. However, GPU programming is a non-trivial task and moreover architectural limitations raise the question of whether investing effort in this direction may be worthwhile. In this work, we use GPU programming to simulate a two-layer network of Integrate-and-Fire neurons with varying degrees of recurrent connectivity and investigate its ability to learn a simplified navigation task using a policy-gradient learning rule stemming from Reinforcement Learning. The purpose of this paper is twofold. First, we want to support the use of GPUs in the field of Computational Neuroscience. Second, using GPU computing power, we investigate the conditions under which the said architecture and learning rule demonstrate best performance. Our work indicates that networks featuring strong Mexican-Hat-shaped recurrent connections in the top layer, where decision making is governed by the formation of a stable activity bump in the neural population (a "non-democratic" mechanism), achieve mediocre learning results at best. In absence of recurrent connections, where all neurons "vote" independently ("democratic") for a decision via population vector readout, the task is generally learned better and more robustly. Our study would have been extremely difficult on a desktop computer without the use of GPU programming. We present the routines developed for this purpose and show that a speed improvement of 5x up to 42x is provided versus optimised Python code. The higher speed is achieved when we exploit the parallelism of the GPU in the search of learning parameters. This suggests that efficient GPU programming can significantly reduce the time needed for simulating networks of spiking neurons, particularly when multiple parameter configurations are investigated.
Adaptive multi-GPU Exchange Monte Carlo for the 3D Random Field Ising Model

NASA Astrophysics Data System (ADS)

Navarro, Cristóbal A.; Huang, Wei; Deng, Youjin

2016-08-01

This work presents an adaptive multi-GPU Exchange Monte Carlo approach for the simulation of the 3D Random Field Ising Model (RFIM). The design is based on a two-level parallelization. The first level, spin-level parallelism, maps the parallel computation as optimal 3D thread-blocks that simulate blocks of spins in shared memory with minimal halo surface, assuming a constant block volume. The second level, replica-level parallelism, uses multi-GPU computation to handle the simulation of an ensemble of replicas. CUDA's concurrent kernel execution feature is used in order to fill the occupancy of each GPU with many replicas, providing a performance boost that is more notorious at the smallest values of L. In addition to the two-level parallel design, the work proposes an adaptive multi-GPU approach that dynamically builds a proper temperature set free of exchange bottlenecks. The strategy is based on mid-point insertions at the temperature gaps where the exchange rate is most compromised. The extra work generated by the insertions is balanced across the GPUs independently of where the mid-point insertions were performed. Performance results show that spin-level performance is approximately two orders of magnitude faster than a single-core CPU version and one order of magnitude faster than a parallel multi-core CPU version running on 16-cores. Multi-GPU performance is highly convenient under a weak scaling setting, reaching up to 99 % efficiency as long as the number of GPUs and L increase together. The combination of the adaptive approach with the parallel multi-GPU design has extended our possibilities of simulation to sizes of L = 32 , 64 for a workstation with two GPUs. Sizes beyond L = 64 can eventually be studied using larger multi-GPU systems.
CUDASW++ 3.0: accelerating Smith-Waterman protein database search by coupling CPU and GPU SIMD instructions.

PubMed

Liu, Yongchao; Wirawan, Adrianto; Schmidt, Bertil

2013-04-04

The maximal sensitivity for local alignments makes the Smith-Waterman algorithm a popular choice for protein sequence database search based on pairwise alignment. However, the algorithm is compute-intensive due to a quadratic time complexity. Corresponding runtimes are further compounded by the rapid growth of sequence databases. We present CUDASW++ 3.0, a fast Smith-Waterman protein database search algorithm, which couples CPU and GPU SIMD instructions and carries out concurrent CPU and GPU computations. For the CPU computation, this algorithm employs SSE-based vector execution units as accelerators. For the GPU computation, we have investigated for the first time a GPU SIMD parallelization, which employs CUDA PTX SIMD video instructions to gain more data parallelism beyond the SIMT execution model. Moreover, sequence alignment workloads are automatically distributed over CPUs and GPUs based on their respective compute capabilities. Evaluation on the Swiss-Prot database shows that CUDASW++ 3.0 gains a performance improvement over CUDASW++ 2.0 up to 2.9 and 3.2, with a maximum performance of 119.0 and 185.6 GCUPS, on a single-GPU GeForce GTX 680 and a dual-GPU GeForce GTX 690 graphics card, respectively. In addition, our algorithm has demonstrated significant speedups over other top-performing tools: SWIPE and BLAST+. CUDASW++ 3.0 is written in CUDA C++ and PTX assembly languages, targeting GPUs based on the Kepler architecture. This algorithm obtains significant speedups over its predecessor: CUDASW++ 2.0, by benefiting from the use of CPU and GPU SIMD instructions as well as the concurrent execution on CPUs and GPUs. The source code and the simulated data are available at http://cudasw.sourceforge.net.
Revisiting Molecular Dynamics on a CPU/GPU system: Water Kernel and SHAKE Parallelization.

PubMed

Ruymgaart, A Peter; Elber, Ron

2012-11-13

We report Graphics Processing Unit (GPU) and Open-MP parallel implementations of water-specific force calculations and of bond constraints for use in Molecular Dynamics simulations. We focus on a typical laboratory computing-environment in which a CPU with a few cores is attached to a GPU. We discuss in detail the design of the code and we illustrate performance comparable to highly optimized codes such as GROMACS. Beside speed our code shows excellent energy conservation. Utilization of water-specific lists allows the efficient calculations of non-bonded interactions that include water molecules and results in a speed-up factor of more than 40 on the GPU compared to code optimized on a single CPU core for systems larger than 20,000 atoms. This is up four-fold from a factor of 10 reported in our initial GPU implementation that did not include a water-specific code. Another optimization is the implementation of constrained dynamics entirely on the GPU. The routine, which enforces constraints of all bonds, runs in parallel on multiple Open-MP cores or entirely on the GPU. It is based on Conjugate Gradient solution of the Lagrange multipliers (CG SHAKE). The GPU implementation is partially in double precision and requires no communication with the CPU during the execution of the SHAKE algorithm. The (parallel) implementation of SHAKE allows an increase of the time step to 2.0fs while maintaining excellent energy conservation. Interestingly, CG SHAKE is faster than the usual bond relaxation algorithm even on a single core if high accuracy is expected. The significant speedup of the optimized components transfers the computational bottleneck of the MD calculation to the reciprocal part of Particle Mesh Ewald (PME).
GPU-accelerated track reconstruction in the ALICE High Level Trigger

NASA Astrophysics Data System (ADS)

Rohr, David; Gorbunov, Sergey; Lindenstruth, Volker; ALICE Collaboration

2017-10-01

ALICE (A Large Heavy Ion Experiment) is one of the four major experiments at the Large Hadron Collider (LHC) at CERN. The High Level Trigger (HLT) is an online compute farm which reconstructs events measured by the ALICE detector in real-time. The most compute-intensive part is the reconstruction of particle trajectories called tracking and the most important detector for tracking is the Time Projection Chamber (TPC). The HLT uses a GPU-accelerated algorithm for TPC tracking that is based on the Cellular Automaton principle and on the Kalman filter. The GPU tracking has been running in 24/7 operation since 2012 in LHC Run 1 and 2. In order to better leverage the potential of the GPUs, and speed up the overall HLT reconstruction, we plan to bring more reconstruction steps (e.g. the tracking for other detectors) onto the GPUs. There are several tasks running so far on the CPU that could benefit from cooperation with the tracking, which is hardly feasible at the moment due to the delay of the PCI Express transfers. Moving more steps onto the GPU, and processing them on the GPU at once, will reduce PCI Express transfers and free up CPU resources. On top of that, modern GPUs and GPU programming APIs provide new features which are not yet exploited by the TPC tracking. We present our new developments for GPU reconstruction, both with a focus on the online reconstruction on GPU for the online offline computing upgrade in ALICE during LHC Run 3, and also taking into account how the current HLT in Run 2 can profit from these improvements.
Resultados de médio e longo prazo do tratamento endovenoso de varizes com laser de diodo em 1940 nm: análise crítica e considerações técnicas

PubMed Central

Viarengo, Luiz Marcelo Aiello; Viarengo, Gabriel; Martins, Aline Meira; Mancini, Marília Wechellian; Lopes, Luciana Almeida

2017-01-01

Resumo Contexto Desde a introdução do laser endovenoso para tratamento das varizes, há uma busca pelo comprimento de onda ideal, capaz de produzir o maior dano seletivo possível com maior segurança e menor incidência de efeitos adversos. Objetivos Avaliar os resultados de médio e longo prazo do laser de diodo de 1940 nm no tratamento de varizes, correlacionando os parâmetros utilizados com a durabilidade do desfecho anatômico. Métodos Revisão retrospectiva de pacientes diagnosticados com insuficiência venosa crônica em estágio clínico baseado em clínica, etiologia, anatomia e patofisiologia (CEAP) C2 a C6, submetidos ao tratamento termoablativo endovenoso de varizes tronculares, com laser com comprimento de onda em 1940 nm com fibra óptica de emissão radial, no período de abril de 2012 a julho de 2015. Uma revisão sistemática dos registros médicos eletrônicos foi realizada para obter dados demográficos e dados clínicos, incluindo dados de ultrassom dúplex, durante o período de seguimento pós-operatório. Resultados A média de idade dos pacientes foi de 53,3 anos; 37 eram mulheres (90,2%). O tempo médio de seguimento foi de 803 dias. O calibre médio das veias tratadas foi de 7,8 mm. A taxa de sucesso imediato foi de 100%, com densidade de energia endovenosa linear (linear endovenous energy density, LEED) média de 45,3 J/cm. A taxa de sucesso tardio foi de 95,1%, com duas recanalizações por volta de 12 meses pós-ablação. Não houve nenhuma recanalização nas veias tratadas com LEED superior a 30 J/cm. Conclusões O laser 1940 nm mostrou-se seguro e efetivo, em médio e longo prazo, para os parâmetros propostos, em segmentos venosos com até 10 mm de diâmetro. PMID:29930619
Storage strategies of eddy-current FE-BI model for GPU implementation

NASA Astrophysics Data System (ADS)

Bardel, Charles; Lei, Naiguang; Udpa, Lalita

2013-01-01

In the past few years graphical processing units (GPUs) have shown tremendous improvements in computational throughput over standard CPU architecture. However, this comes at the cost of restructuring the algorithms to meet the strengths and drawbacks of this GPU architecture. A major drawback is the state of limited memory, and hence storage of FE stiffness matrices on the GPU is important. In contrast to storage on CPU the GPU storage format has significant influence on the overall performance. This paper presents an investigation of a storage strategy in the implementation of a two-dimensional finite element-boundary integral (FE-BI) model for Eddy current NDE applications, on GPU architecture. Specifically, the high dimensional matrices are manipulated by examining the matrix structure and optimally splitting into structurally independent component matrices for efficient storage and retrieval of each component. Results obtained using the proposed approach are compared to those of conventional CPU implementation for validating the method.
Semiempirical Quantum Chemical Calculations Accelerated on a Hybrid Multicore CPU-GPU Computing Platform.

PubMed

Wu, Xin; Koslowski, Axel; Thiel, Walter

2012-07-10

In this work, we demonstrate that semiempirical quantum chemical calculations can be accelerated significantly by leveraging the graphics processing unit (GPU) as a coprocessor on a hybrid multicore CPU-GPU computing platform. Semiempirical calculations using the MNDO, AM1, PM3, OM1, OM2, and OM3 model Hamiltonians were systematically profiled for three types of test systems (fullerenes, water clusters, and solvated crambin) to identify the most time-consuming sections of the code. The corresponding routines were ported to the GPU and optimized employing both existing library functions and a GPU kernel that carries out a sequence of noniterative Jacobi transformations during pseudodiagonalization. The overall computation times for single-point energy calculations and geometry optimizations of large molecules were reduced by one order of magnitude for all methods, as compared to runs on a single CPU core.
GPU-Powered Coherent Beamforming

NASA Astrophysics Data System (ADS)

Magro, A.; Adami, K. Zarb; Hickish, J.

2015-03-01

Graphics processing units (GPU)-based beamforming is a relatively unexplored area in radio astronomy, possibly due to the assumption that any such system will be severely limited by the PCIe bandwidth required to transfer data to the GPU. We have developed a CUDA-based GPU implementation of a coherent beamformer, specifically designed and optimized for deployment at the BEST-2 array which can generate an arbitrary number of synthesized beams for a wide range of parameters. It achieves ˜1.3 TFLOPs on an NVIDIA Tesla K20, approximately 10x faster than an optimized, multithreaded CPU implementation. This kernel has been integrated into two real-time, GPU-based time-domain software pipelines deployed at the BEST-2 array in Medicina: a standalone beamforming pipeline and a transient detection pipeline. We present performance benchmarks for the beamforming kernel as well as the transient detection pipeline with beamforming capabilities as well as results of test observation.
Particle-in-cell simulations on graphic processing units

NASA Astrophysics Data System (ADS)

Ren, C.; Zhou, X.; Li, J.; Huang, M. C.; Zhao, Y.

2014-10-01

We will show our recent progress in using GPU's to accelerate the PIC code OSIRIS [Fonseca et al. LNCS 2331, 342 (2002)]. The OISRIS parallel structure is retained and the computation-intensive kernels are shipped to GPU's. Algorithms for the kernels are adapted for the GPU, including high-order charge-conserving current deposition schemes with few branching and parallel particle sorting [Kong et al., JCP 230, 1676 (2011)]. These algorithms make efficient use of the GPU shared memory. This work was supported by U.S. Department of Energy under Grant No. DE-FC02-04ER54789 and by NSF under Grant No. PHY-1314734.
Improving GPU-accelerated adaptive IDW interpolation algorithm using fast kNN search.

PubMed

Mei, Gang; Xu, Nengxiong; Xu, Liangliang

2016-01-01

This paper presents an efficient parallel Adaptive Inverse Distance Weighting (AIDW) interpolation algorithm on modern Graphics Processing Unit (GPU). The presented algorithm is an improvement of our previous GPU-accelerated AIDW algorithm by adopting fast k-nearest neighbors (kNN) search. In AIDW, it needs to find several nearest neighboring data points for each interpolated point to adaptively determine the power parameter; and then the desired prediction value of the interpolated point is obtained by weighted interpolating using the power parameter. In this work, we develop a fast kNN search approach based on the space-partitioning data structure, even grid, to improve the previous GPU-accelerated AIDW algorithm. The improved algorithm is composed of the stages of kNN search and weighted interpolating. To evaluate the performance of the improved algorithm, we perform five groups of experimental tests. The experimental results indicate: (1) the improved algorithm can achieve a speedup of up to 1017 over the corresponding serial algorithm; (2) the improved algorithm is at least two times faster than our previous GPU-accelerated AIDW algorithm; and (3) the utilization of fast kNN search can significantly improve the computational efficiency of the entire GPU-accelerated AIDW algorithm.
Compiler-based code generation and autotuning for geometric multigrid on GPU-accelerated supercomputers

DOE PAGES

Basu, Protonu; Williams, Samuel; Van Straalen, Brian; ...

2017-04-05

GPUs, with their high bandwidths and computational capabilities are an increasingly popular target for scientific computing. Unfortunately, to date, harnessing the power of the GPU has required use of a GPU-specific programming model like CUDA, OpenCL, or OpenACC. Thus, in order to deliver portability across CPU-based and GPU-accelerated supercomputers, programmers are forced to write and maintain two versions of their applications or frameworks. In this paper, we explore the use of a compiler-based autotuning framework based on CUDA-CHiLL to deliver not only portability, but also performance portability across CPU- and GPU-accelerated platforms for the geometric multigrid linear solvers found inmore » many scientific applications. We also show that with autotuning we can attain near Roofline (a performance bound for a computation and target architecture) performance across the key operations in the miniGMG benchmark for both CPU- and GPU-based architectures as well as for a multiple stencil discretizations and smoothers. We show that our technology is readily interoperable with MPI resulting in performance at scale equal to that obtained via hand-optimized MPI+CUDA implementation.« less
Thread scheduling for GPU-based OPC simulation on multi-thread

NASA Astrophysics Data System (ADS)

Lee, Heejun; Kim, Sangwook; Hong, Jisuk; Lee, Sooryong; Han, Hwansoo

2018-03-01

As semiconductor product development based on shrinkage continues, the accuracy and difficulty required for the model based optical proximity correction (MBOPC) is increasing. OPC simulation time, which is the most timeconsuming part of MBOPC, is rapidly increasing due to high pattern density in a layout and complex OPC model. To reduce OPC simulation time, we attempt to apply graphic processing unit (GPU) to MBOPC because OPC process is good to be programmed in parallel. We address some issues that may typically happen during GPU-based OPC simulation in multi thread system, such as "out of memory" and "GPU idle time". To overcome these problems, we propose a thread scheduling method, which manages OPC jobs in multiple threads in such a way that simulations jobs from multiple threads are alternatively executed on GPU while correction jobs are executed at the same time in each CPU cores. It was observed that the amount of GPU peak memory usage decreases by up to 35%, and MBOPC runtime also decreases by 4%. In cases where out of memory issues occur in a multi-threaded environment, the thread scheduler was used to improve MBOPC runtime up to 23%.
GPU accelerated simulations of 3D deterministic particle transport using discrete ordinates method

NASA Astrophysics Data System (ADS)

Gong, Chunye; Liu, Jie; Chi, Lihua; Huang, Haowei; Fang, Jingyue; Gong, Zhenghu

2011-07-01

Graphics Processing Unit (GPU), originally developed for real-time, high-definition 3D graphics in computer games, now provides great faculty in solving scientific applications. The basis of particle transport simulation is the time-dependent, multi-group, inhomogeneous Boltzmann transport equation. The numerical solution to the Boltzmann equation involves the discrete ordinates ( Sn) method and the procedure of source iteration. In this paper, we present a GPU accelerated simulation of one energy group time-independent deterministic discrete ordinates particle transport in 3D Cartesian geometry (Sweep3D). The performance of the GPU simulations are reported with the simulations of vacuum boundary condition. The discussion of the relative advantages and disadvantages of the GPU implementation, the simulation on multi GPUs, the programming effort and code portability are also reported. The results show that the overall performance speedup of one NVIDIA Tesla M2050 GPU ranges from 2.56 compared with one Intel Xeon X5670 chip to 8.14 compared with one Intel Core Q6600 chip for no flux fixup. The simulation with flux fixup on one M2050 is 1.23 times faster than on one X5670.
Parallel Optimization of 3D Cardiac Electrophysiological Model Using GPU

PubMed Central

Xia, Yong; Zhang, Henggui

2015-01-01

Large-scale 3D virtual heart model simulations are highly demanding in computational resources. This imposes a big challenge to the traditional computation resources based on CPU environment, which already cannot meet the requirement of the whole computation demands or are not easily available due to expensive costs. GPU as a parallel computing environment therefore provides an alternative to solve the large-scale computational problems of whole heart modeling. In this study, using a 3D sheep atrial model as a test bed, we developed a GPU-based simulation algorithm to simulate the conduction of electrical excitation waves in the 3D atria. In the GPU algorithm, a multicellular tissue model was split into two components: one is the single cell model (ordinary differential equation) and the other is the diffusion term of the monodomain model (partial differential equation). Such a decoupling enabled realization of the GPU parallel algorithm. Furthermore, several optimization strategies were proposed based on the features of the virtual heart model, which enabled a 200-fold speedup as compared to a CPU implementation. In conclusion, an optimized GPU algorithm has been developed that provides an economic and powerful platform for 3D whole heart simulations. PMID:26581957
Fast MPEG-CDVS Encoder With GPU-CPU Hybrid Computing

NASA Astrophysics Data System (ADS)

Duan, Ling-Yu; Sun, Wei; Zhang, Xinfeng; Wang, Shiqi; Chen, Jie; Yin, Jianxiong; See, Simon; Huang, Tiejun; Kot, Alex C.; Gao, Wen

2018-05-01

The compact descriptors for visual search (CDVS) standard from ISO/IEC moving pictures experts group (MPEG) has succeeded in enabling the interoperability for efficient and effective image retrieval by standardizing the bitstream syntax of compact feature descriptors. However, the intensive computation of CDVS encoder unfortunately hinders its widely deployment in industry for large-scale visual search. In this paper, we revisit the merits of low complexity design of CDVS core techniques and present a very fast CDVS encoder by leveraging the massive parallel execution resources of GPU. We elegantly shift the computation-intensive and parallel-friendly modules to the state-of-the-arts GPU platforms, in which the thread block allocation and the memory access are jointly optimized to eliminate performance loss. In addition, those operations with heavy data dependence are allocated to CPU to resolve the extra but non-necessary computation burden for GPU. Furthermore, we have demonstrated the proposed fast CDVS encoder can work well with those convolution neural network approaches which has harmoniously leveraged the advantages of GPU platforms, and yielded significant performance improvements. Comprehensive experimental results over benchmarks are evaluated, which has shown that the fast CDVS encoder using GPU-CPU hybrid computing is promising for scalable visual search.

High Performance GPU-Based Fourier Volume Rendering.

PubMed

Abdellah, Marwan; Eldeib, Ayman; Sharawi, Amr

2015-01-01

Fourier volume rendering (FVR) is a significant visualization technique that has been used widely in digital radiography. As a result of its (N (2)log⁡N) time complexity, it provides a faster alternative to spatial domain volume rendering algorithms that are (N (3)) computationally complex. Relying on the Fourier projection-slice theorem, this technique operates on the spectral representation of a 3D volume instead of processing its spatial representation to generate attenuation-only projections that look like X-ray radiographs. Due to the rapid evolution of its underlying architecture, the graphics processing unit (GPU) became an attractive competent platform that can deliver giant computational raw power compared to the central processing unit (CPU) on a per-dollar-basis. The introduction of the compute unified device architecture (CUDA) technology enables embarrassingly-parallel algorithms to run efficiently on CUDA-capable GPU architectures. In this work, a high performance GPU-accelerated implementation of the FVR pipeline on CUDA-enabled GPUs is presented. This proposed implementation can achieve a speed-up of 117x compared to a single-threaded hybrid implementation that uses the CPU and GPU together by taking advantage of executing the rendering pipeline entirely on recent GPU architectures.
Parallel Optimization of 3D Cardiac Electrophysiological Model Using GPU.

PubMed

Xia, Yong; Wang, Kuanquan; Zhang, Henggui

2015-01-01

Large-scale 3D virtual heart model simulations are highly demanding in computational resources. This imposes a big challenge to the traditional computation resources based on CPU environment, which already cannot meet the requirement of the whole computation demands or are not easily available due to expensive costs. GPU as a parallel computing environment therefore provides an alternative to solve the large-scale computational problems of whole heart modeling. In this study, using a 3D sheep atrial model as a test bed, we developed a GPU-based simulation algorithm to simulate the conduction of electrical excitation waves in the 3D atria. In the GPU algorithm, a multicellular tissue model was split into two components: one is the single cell model (ordinary differential equation) and the other is the diffusion term of the monodomain model (partial differential equation). Such a decoupling enabled realization of the GPU parallel algorithm. Furthermore, several optimization strategies were proposed based on the features of the virtual heart model, which enabled a 200-fold speedup as compared to a CPU implementation. In conclusion, an optimized GPU algorithm has been developed that provides an economic and powerful platform for 3D whole heart simulations.
Compiler-based code generation and autotuning for geometric multigrid on GPU-accelerated supercomputers

DOE Office of Scientific and Technical Information (OSTI.GOV)

Basu, Protonu; Williams, Samuel; Van Straalen, Brian

GPUs, with their high bandwidths and computational capabilities are an increasingly popular target for scientific computing. Unfortunately, to date, harnessing the power of the GPU has required use of a GPU-specific programming model like CUDA, OpenCL, or OpenACC. Thus, in order to deliver portability across CPU-based and GPU-accelerated supercomputers, programmers are forced to write and maintain two versions of their applications or frameworks. In this paper, we explore the use of a compiler-based autotuning framework based on CUDA-CHiLL to deliver not only portability, but also performance portability across CPU- and GPU-accelerated platforms for the geometric multigrid linear solvers found inmore » many scientific applications. We also show that with autotuning we can attain near Roofline (a performance bound for a computation and target architecture) performance across the key operations in the miniGMG benchmark for both CPU- and GPU-based architectures as well as for a multiple stencil discretizations and smoothers. We show that our technology is readily interoperable with MPI resulting in performance at scale equal to that obtained via hand-optimized MPI+CUDA implementation.« less
The development of GPU-based parallel PRNG for Monte Carlo applications in CUDA Fortran

DOE Office of Scientific and Technical Information (OSTI.GOV)

Kargaran, Hamed, E-mail: h-kargaran@sbu.ac.ir; Minuchehr, Abdolhamid; Zolfaghari, Ahmad

The implementation of Monte Carlo simulation on the CUDA Fortran requires a fast random number generation with good statistical properties on GPU. In this study, a GPU-based parallel pseudo random number generator (GPPRNG) have been proposed to use in high performance computing systems. According to the type of GPU memory usage, GPU scheme is divided into two work modes including GLOBAL-MODE and SHARED-MODE. To generate parallel random numbers based on the independent sequence method, the combination of middle-square method and chaotic map along with the Xorshift PRNG have been employed. Implementation of our developed PPRNG on a single GPU showedmore » a speedup of 150x and 470x (with respect to the speed of PRNG on a single CPU core) for GLOBAL-MODE and SHARED-MODE, respectively. To evaluate the accuracy of our developed GPPRNG, its performance was compared to that of some other commercially available PPRNGs such as MATLAB, FORTRAN and Miller-Park algorithm through employing the specific standard tests. The results of this comparison showed that the developed GPPRNG in this study can be used as a fast and accurate tool for computational science applications.« less
A novel heterogeneous algorithm to simulate multiphase flow in porous media on multicore CPU-GPU systems

NASA Astrophysics Data System (ADS)

McClure, J. E.; Prins, J. F.; Miller, C. T.

2014-07-01

Multiphase flow implementations of the lattice Boltzmann method (LBM) are widely applied to the study of porous medium systems. In this work, we construct a new variant of the popular "color" LBM for two-phase flow in which a three-dimensional, 19-velocity (D3Q19) lattice is used to compute the momentum transport solution while a three-dimensional, seven velocity (D3Q7) lattice is used to compute the mass transport solution. Based on this formulation, we implement a novel heterogeneous GPU-accelerated algorithm in which the mass transport solution is computed by multiple shared memory CPU cores programmed using OpenMP while a concurrent solution of the momentum transport is performed using a GPU. The heterogeneous solution is demonstrated to provide speedup of 2.6 × as compared to multi-core CPU solution and 1.8 × compared to GPU solution due to concurrent utilization of both CPU and GPU bandwidths. Furthermore, we verify that the proposed formulation provides an accurate physical representation of multiphase flow processes and demonstrate that the approach can be applied to perform heterogeneous simulations of two-phase flow in porous media using a typical GPU-accelerated workstation.
GPU-based parallel algorithm for blind image restoration using midfrequency-based methods

NASA Astrophysics Data System (ADS)

Xie, Lang; Luo, Yi-han; Bao, Qi-liang

2013-08-01

GPU-based general-purpose computing is a new branch of modern parallel computing, so the study of parallel algorithms specially designed for GPU hardware architecture is of great significance. In order to solve the problem of high computational complexity and poor real-time performance in blind image restoration, the midfrequency-based algorithm for blind image restoration was analyzed and improved in this paper. Furthermore, a midfrequency-based filtering method is also used to restore the image hardly with any recursion or iteration. Combining the algorithm with data intensiveness, data parallel computing and GPU execution model of single instruction and multiple threads, a new parallel midfrequency-based algorithm for blind image restoration is proposed in this paper, which is suitable for stream computing of GPU. In this algorithm, the GPU is utilized to accelerate the estimation of class-G point spread functions and midfrequency-based filtering. Aiming at better management of the GPU threads, the threads in a grid are scheduled according to the decomposition of the filtering data in frequency domain after the optimization of data access and the communication between the host and the device. The kernel parallelism structure is determined by the decomposition of the filtering data to ensure the transmission rate to get around the memory bandwidth limitation. The results show that, with the new algorithm, the operational speed is significantly increased and the real-time performance of image restoration is effectively improved, especially for high-resolution images.
Large Scale Document Inversion using a Multi-threaded Computing System

PubMed Central

Jung, Sungbo; Chang, Dar-Jen; Park, Juw Won

2018-01-01

Current microprocessor architecture is moving towards multi-core/multi-threaded systems. This trend has led to a surge of interest in using multi-threaded computing devices, such as the Graphics Processing Unit (GPU), for general purpose computing. We can utilize the GPU in computation as a massive parallel coprocessor because the GPU consists of multiple cores. The GPU is also an affordable, attractive, and user-programmable commodity. Nowadays a lot of information has been flooded into the digital domain around the world. Huge volume of data, such as digital libraries, social networking services, e-commerce product data, and reviews, etc., is produced or collected every moment with dramatic growth in size. Although the inverted index is a useful data structure that can be used for full text searches or document retrieval, a large number of documents will require a tremendous amount of time to create the index. The performance of document inversion can be improved by multi-thread or multi-core GPU. Our approach is to implement a linear-time, hash-based, single program multiple data (SPMD), document inversion algorithm on the NVIDIA GPU/CUDA programming platform utilizing the huge computational power of the GPU, to develop high performance solutions for document indexing. Our proposed parallel document inversion system shows 2-3 times faster performance than a sequential system on two different test datasets from PubMed abstract and e-commerce product reviews. CCS Concepts •Information systems➝Information retrieval • Computing methodologies➝Massively parallel and high-performance simulations. PMID:29861701
Novel hybrid GPU-CPU implementation of parallelized Monte Carlo parametric expectation maximization estimation method for population pharmacokinetic data analysis.

PubMed

Ng, C M

2013-10-01

The development of a population PK/PD model, an essential component for model-based drug development, is both time- and labor-intensive. A graphical-processing unit (GPU) computing technology has been proposed and used to accelerate many scientific computations. The objective of this study was to develop a hybrid GPU-CPU implementation of parallelized Monte Carlo parametric expectation maximization (MCPEM) estimation algorithm for population PK data analysis. A hybrid GPU-CPU implementation of the MCPEM algorithm (MCPEMGPU) and identical algorithm that is designed for the single CPU (MCPEMCPU) were developed using MATLAB in a single computer equipped with dual Xeon 6-Core E5690 CPU and a NVIDIA Tesla C2070 GPU parallel computing card that contained 448 stream processors. Two different PK models with rich/sparse sampling design schemes were used to simulate population data in assessing the performance of MCPEMCPU and MCPEMGPU. Results were analyzed by comparing the parameter estimation and model computation times. Speedup factor was used to assess the relative benefit of parallelized MCPEMGPU over MCPEMCPU in shortening model computation time. The MCPEMGPU consistently achieved shorter computation time than the MCPEMCPU and can offer more than 48-fold speedup using a single GPU card. The novel hybrid GPU-CPU implementation of parallelized MCPEM algorithm developed in this study holds a great promise in serving as the core for the next-generation of modeling software for population PK/PD analysis.
Large Scale Document Inversion using a Multi-threaded Computing System.

PubMed

Jung, Sungbo; Chang, Dar-Jen; Park, Juw Won

2017-06-01

Current microprocessor architecture is moving towards multi-core/multi-threaded systems. This trend has led to a surge of interest in using multi-threaded computing devices, such as the Graphics Processing Unit (GPU), for general purpose computing. We can utilize the GPU in computation as a massive parallel coprocessor because the GPU consists of multiple cores. The GPU is also an affordable, attractive, and user-programmable commodity. Nowadays a lot of information has been flooded into the digital domain around the world. Huge volume of data, such as digital libraries, social networking services, e-commerce product data, and reviews, etc., is produced or collected every moment with dramatic growth in size. Although the inverted index is a useful data structure that can be used for full text searches or document retrieval, a large number of documents will require a tremendous amount of time to create the index. The performance of document inversion can be improved by multi-thread or multi-core GPU. Our approach is to implement a linear-time, hash-based, single program multiple data (SPMD), document inversion algorithm on the NVIDIA GPU/CUDA programming platform utilizing the huge computational power of the GPU, to develop high performance solutions for document indexing. Our proposed parallel document inversion system shows 2-3 times faster performance than a sequential system on two different test datasets from PubMed abstract and e-commerce product reviews. •Information systems➝Information retrieval • Computing methodologies➝Massively parallel and high-performance simulations.
GPU applications for data processing

DOE Office of Scientific and Technical Information (OSTI.GOV)

Vladymyrov, Mykhailo, E-mail: mykhailo.vladymyrov@cern.ch; Aleksandrov, Andrey; INFN sezione di Napoli, I-80125 Napoli

2015-12-31

Modern experiments that use nuclear photoemulsion imply fast and efficient data acquisition from the emulsion can be performed. The new approaches in developing scanning systems require real-time processing of large amount of data. Methods that use Graphical Processing Unit (GPU) computing power for emulsion data processing are presented here. It is shown how the GPU-accelerated emulsion processing helped us to rise the scanning speed by factor of nine.
TH-A-18C-09: Ultra-Fast Monte Carlo Simulation for Cone Beam CT Imaging of Brain Trauma

DOE Office of Scientific and Technical Information (OSTI.GOV)

Sisniega, A; Zbijewski, W; Stayman, J

Purpose: Application of cone-beam CT (CBCT) to low-contrast soft tissue imaging, such as in detection of traumatic brain injury, is challenged by high levels of scatter. A fast, accurate scatter correction method based on Monte Carlo (MC) estimation is developed for application in high-quality CBCT imaging of acute brain injury. Methods: The correction involves MC scatter estimation executed on an NVIDIA GTX 780 GPU (MC-GPU), with baseline simulation speed of ~1e7 photons/sec. MC-GPU is accelerated by a novel, GPU-optimized implementation of variance reduction (VR) techniques (forced detection and photon splitting). The number of simulated tracks and projections is reduced formore » additional speed-up. Residual noise is removed and the missing scatter projections are estimated via kernel smoothing (KS) in projection plane and across gantry angles. The method is assessed using CBCT images of a head phantom presenting a realistic simulation of fresh intracranial hemorrhage (100 kVp, 180 mAs, 720 projections, source-detector distance 700 mm, source-axis distance 480 mm). Results: For a fixed run-time of ~1 sec/projection, GPU-optimized VR reduces the noise in MC-GPU scatter estimates by a factor of 4. For scatter correction, MC-GPU with VR is executed with 4-fold angular downsampling and 1e5 photons/projection, yielding 3.5 minute run-time per scan, and de-noised with optimized KS. Corrected CBCT images demonstrate uniformity improvement of 18 HU and contrast improvement of 26 HU compared to no correction, and a 52% increase in contrast-tonoise ratio in simulated hemorrhage compared to “oracle” constant fraction correction. Conclusion: Acceleration of MC-GPU achieved through GPU-optimized variance reduction and kernel smoothing yields an efficient (<5 min/scan) and accurate scatter correction that does not rely on additional hardware or simplifying assumptions about the scatter distribution. The method is undergoing implementation in a novel CBCT dedicated to brain trauma imaging at the point of care in sports and military applications. Research grant from Carestream Health. JY is an employee of Carestream Health.« less
Distributed GPU Computing in GIScience

NASA Astrophysics Data System (ADS)

Jiang, Y.; Yang, C.; Huang, Q.; Li, J.; Sun, M.

2013-12-01

Geoscientists strived to discover potential principles and patterns hidden inside ever-growing Big Data for scientific discoveries. To better achieve this objective, more capable computing resources are required to process, analyze and visualize Big Data (Ferreira et al., 2003; Li et al., 2013). Current CPU-based computing techniques cannot promptly meet the computing challenges caused by increasing amount of datasets from different domains, such as social media, earth observation, environmental sensing (Li et al., 2013). Meanwhile CPU-based computing resources structured as cluster or supercomputer is costly. In the past several years with GPU-based technology matured in both the capability and performance, GPU-based computing has emerged as a new computing paradigm. Compare to traditional computing microprocessor, the modern GPU, as a compelling alternative microprocessor, has outstanding high parallel processing capability with cost-effectiveness and efficiency(Owens et al., 2008), although it is initially designed for graphical rendering in visualization pipe. This presentation reports a distributed GPU computing framework for integrating GPU-based computing within distributed environment. Within this framework, 1) for each single computer, computing resources of both GPU-based and CPU-based can be fully utilized to improve the performance of visualizing and processing Big Data; 2) within a network environment, a variety of computers can be used to build up a virtual super computer to support CPU-based and GPU-based computing in distributed computing environment; 3) GPUs, as a specific graphic targeted device, are used to greatly improve the rendering efficiency in distributed geo-visualization, especially for 3D/4D visualization. Key words: Geovisualization, GIScience, Spatiotemporal Studies Reference : 1. Ferreira de Oliveira, M. C., & Levkowitz, H. (2003). From visual data exploration to visual data mining: A survey. Visualization and Computer Graphics, IEEE Transactions on, 9(3), 378-394. 2. Li, J., Jiang, Y., Yang, C., Huang, Q., & Rice, M. (2013). Visualizing 3D/4D Environmental Data Using Many-core Graphics Processing Units (GPUs) and Multi-core Central Processing Units (CPUs). Computers & Geosciences, 59(9), 78-89. 3. Owens, J. D., Houston, M., Luebke, D., Green, S., Stone, J. E., & Phillips, J. C. (2008). GPU computing. Proceedings of the IEEE, 96(5), 879-899.
The Research and Test of Fast Radio Burst Real-time Search Algorithm Based on GPU Acceleration

NASA Astrophysics Data System (ADS)

Wang, J.; Chen, M. Z.; Pei, X.; Wang, Z. Q.

2017-03-01

In order to satisfy the research needs of Nanshan 25 m radio telescope of Xinjiang Astronomical Observatory (XAO) and study the key technology of the planned QiTai radio Telescope (QTT), the receiver group of XAO studied the GPU (Graphics Processing Unit) based real-time FRB searching algorithm which developed from the original FRB searching algorithm based on CPU (Central Processing Unit), and built the FRB real-time searching system. The comparison of the GPU system and the CPU system shows that: on the basis of ensuring the accuracy of the search, the speed of the GPU accelerated algorithm is improved by 35-45 times compared with the CPU algorithm.
GraphReduce: Processing Large-Scale Graphs on Accelerator-Based Systems

DOE Office of Scientific and Technical Information (OSTI.GOV)

Sengupta, Dipanjan; Song, Shuaiwen; Agarwal, Kapil

2015-11-15

Recent work on real-world graph analytics has sought to leverage the massive amount of parallelism offered by GPU devices, but challenges remain due to the inherent irregularity of graph algorithms and limitations in GPU-resident memory for storing large graphs. We present GraphReduce, a highly efficient and scalable GPU-based framework that operates on graphs that exceed the device’s internal memory capacity. GraphReduce adopts a combination of edge- and vertex-centric implementations of the Gather-Apply-Scatter programming model and operates on multiple asynchronous GPU streams to fully exploit the high degrees of parallelism in GPUs with efficient graph data movement between the host andmore » device.« less
GPU-accelerated computation of electron transfer.

PubMed

Höfinger, Siegfried; Acocella, Angela; Pop, Sergiu C; Narumi, Tetsu; Yasuoka, Kenji; Beu, Titus; Zerbetto, Francesco

2012-11-05

Electron transfer is a fundamental process that can be studied with the help of computer simulation. The underlying quantum mechanical description renders the problem a computationally intensive application. In this study, we probe the graphics processing unit (GPU) for suitability to this type of problem. Time-critical components are identified via profiling of an existing implementation and several different variants are tested involving the GPU at increasing levels of abstraction. A publicly available library supporting basic linear algebra operations on the GPU turns out to accelerate the computation approximately 50-fold with minor dependence on actual problem size. The performance gain does not compromise numerical accuracy and is of significant value for practical purposes. Copyright © 2012 Wiley Periodicals, Inc.
GPU Accelerated Prognostics

NASA Technical Reports Server (NTRS)

Gorospe, George E., Jr.; Daigle, Matthew J.; Sankararaman, Shankar; Kulkarni, Chetan S.; Ng, Eley

2017-01-01

Prognostic methods enable operators and maintainers to predict the future performance for critical systems. However, these methods can be computationally expensive and may need to be performed each time new information about the system becomes available. In light of these computational requirements, we have investigated the application of graphics processing units (GPUs) as a computational platform for real-time prognostics. Recent advances in GPU technology have reduced cost and increased the computational capability of these highly parallel processing units, making them more attractive for the deployment of prognostic software. We present a survey of model-based prognostic algorithms with considerations for leveraging the parallel architecture of the GPU and a case study of GPU-accelerated battery prognostics with computational performance results.
Optimizing a mobile robot control system using GPU acceleration

NASA Astrophysics Data System (ADS)

Tuck, Nat; McGuinness, Michael; Martin, Fred

2012-01-01

This paper describes our attempt to optimize a robot control program for the Intelligent Ground Vehicle Competition (IGVC) by running computationally intensive portions of the system on a commodity graphics processing unit (GPU). The IGVC Autonomous Challenge requires a control program that performs a number of different computationally intensive tasks ranging from computer vision to path planning. For the 2011 competition our Robot Operating System (ROS) based control system would not run comfortably on the multicore CPU on our custom robot platform. The process of profiling the ROS control program and selecting appropriate modules for porting to run on a GPU is described. A GPU-targeting compiler, Bacon, is used to speed up development and help optimize the ported modules. The impact of the ported modules on overall performance is discussed. We conclude that GPU optimization can free a significant amount of CPU resources with minimal effort for expensive user-written code, but that replacing heavily-optimized library functions is more difficult, and a much less efficient use of time.
Transportable GPU (General Processor Units) chip set technology for standard computer architectures

NASA Astrophysics Data System (ADS)

Fosdick, R. E.; Denison, H. C.

1982-11-01

The USAFR-developed GPU Chip Set has been utilized by Tracor to implement both USAF and Navy Standard 16-Bit Airborne Computer Architectures. Both configurations are currently being delivered into DOD full-scale development programs. Leadless Hermetic Chip Carrier packaging has facilitated implementation of both architectures on single 41/2 x 5 substrates. The CMOS and CMOS/SOS implementations of the GPU Chip Set have allowed both CPU implementations to use less than 3 watts of power each. Recent efforts by Tracor for USAF have included the definition of a next-generation GPU Chip Set that will retain the application-proven architecture of the current chip set while offering the added cost advantages of transportability across ISO-CMOS and CMOS/SOS processes and across numerous semiconductor manufacturers using a newly-defined set of common design rules. The Enhanced GPU Chip Set will increase speed by an approximate factor of 3 while significantly reducing chip counts and costs of standard CPU implementations.
A survey of GPU-based medical image computing techniques

PubMed Central

Shi, Lin; Liu, Wen; Zhang, Heye; Xie, Yongming

2012-01-01

Medical imaging currently plays a crucial role throughout the entire clinical applications from medical scientific research to diagnostics and treatment planning. However, medical imaging procedures are often computationally demanding due to the large three-dimensional (3D) medical datasets to process in practical clinical applications. With the rapidly enhancing performances of graphics processors, improved programming support, and excellent price-to-performance ratio, the graphics processing unit (GPU) has emerged as a competitive parallel computing platform for computationally expensive and demanding tasks in a wide range of medical image applications. The major purpose of this survey is to provide a comprehensive reference source for the starters or researchers involved in GPU-based medical image processing. Within this survey, the continuous advancement of GPU computing is reviewed and the existing traditional applications in three areas of medical image processing, namely, segmentation, registration and visualization, are surveyed. The potential advantages and associated challenges of current GPU-based medical imaging are also discussed to inspire future applications in medicine. PMID:23256080
Accelerating image reconstruction in dual-head PET system by GPU and symmetry properties.

PubMed

Chou, Cheng-Ying; Dong, Yun; Hung, Yukai; Kao, Yu-Jiun; Wang, Weichung; Kao, Chien-Min; Chen, Chin-Tu

2012-01-01

Positron emission tomography (PET) is an important imaging modality in both clinical usage and research studies. We have developed a compact high-sensitivity PET system that consisted of two large-area panel PET detector heads, which produce more than 224 million lines of response and thus request dramatic computational demands. In this work, we employed a state-of-the-art graphics processing unit (GPU), NVIDIA Tesla C2070, to yield an efficient reconstruction process. Our approaches ingeniously integrate the distinguished features of the symmetry properties of the imaging system and GPU architectures, including block/warp/thread assignments and effective memory usage, to accelerate the computations for ordered subset expectation maximization (OSEM) image reconstruction. The OSEM reconstruction algorithms were implemented employing both CPU-based and GPU-based codes, and their computational performance was quantitatively analyzed and compared. The results showed that the GPU-accelerated scheme can drastically reduce the reconstruction time and thus can largely expand the applicability of the dual-head PET system.

The GPU implementation of micro - Doppler period estimation

NASA Astrophysics Data System (ADS)

Yang, Liyuan; Wang, Junling; Bi, Ran

2018-03-01

Aiming at the problem that the computational complexity and the deficiency of real-time of the wideband radar echo signal, a program is designed to improve the performance of real-time extraction of micro-motion feature in this paper based on the CPU-GPU heterogeneous parallel structure. Firstly, we discuss the principle of the micro-Doppler effect generated by the rolling of the scattering points on the orbiting satellite, analyses how to use Kalman filter to compensate the translational motion of tumbling satellite and how to use the joint time-frequency analysis and inverse Radon transform to extract the micro-motion features from the echo after compensation. Secondly, the advantages of GPU in terms of real-time processing and the working principle of CPU-GPU heterogeneous parallelism are analysed, and a program flow based on GPU to extract the micro-motion feature from the radar echo signal of rolling satellite is designed. At the end of the article the results of extraction are given to verify the correctness of the program and algorithm.
GPU computing with Kaczmarz’s and other iterative algorithms for linear systems

PubMed Central

Elble, Joseph M.; Sahinidis, Nikolaos V.; Vouzis, Panagiotis

2009-01-01

The graphics processing unit (GPU) is used to solve large linear systems derived from partial differential equations. The differential equations studied are strongly convection-dominated, of various sizes, and common to many fields, including computational fluid dynamics, heat transfer, and structural mechanics. The paper presents comparisons between GPU and CPU implementations of several well-known iterative methods, including Kaczmarz’s, Cimmino’s, component averaging, conjugate gradient normal residual (CGNR), symmetric successive overrelaxation-preconditioned conjugate gradient, and conjugate-gradient-accelerated component-averaged row projections (CARP-CG). Computations are preformed with dense as well as general banded systems. The results demonstrate that our GPU implementation outperforms CPU implementations of these algorithms, as well as previously studied parallel implementations on Linux clusters and shared memory systems. While the CGNR method had begun to fall out of favor for solving such problems, for the problems studied in this paper, the CGNR method implemented on the GPU performed better than the other methods, including a cluster implementation of the CARP-CG method. PMID:20526446
Production Level CFD Code Acceleration for Hybrid Many-Core Architectures

NASA Technical Reports Server (NTRS)

Duffy, Austen C.; Hammond, Dana P.; Nielsen, Eric J.

2012-01-01

In this work, a novel graphics processing unit (GPU) distributed sharing model for hybrid many-core architectures is introduced and employed in the acceleration of a production-level computational fluid dynamics (CFD) code. The latest generation graphics hardware allows multiple processor cores to simultaneously share a single GPU through concurrent kernel execution. This feature has allowed the NASA FUN3D code to be accelerated in parallel with up to four processor cores sharing a single GPU. For codes to scale and fully use resources on these and the next generation machines, codes will need to employ some type of GPU sharing model, as presented in this work. Findings include the effects of GPU sharing on overall performance. A discussion of the inherent challenges that parallel unstructured CFD codes face in accelerator-based computing environments is included, with considerations for future generation architectures. This work was completed by the author in August 2010, and reflects the analysis and results of the time.
Fast MPEG-CDVS Encoder With GPU-CPU Hybrid Computing.

PubMed

Duan, Ling-Yu; Sun, Wei; Zhang, Xinfeng; Wang, Shiqi; Chen, Jie; Yin, Jianxiong; See, Simon; Huang, Tiejun; Kot, Alex C; Gao, Wen

2018-05-01

The compact descriptors for visual search (CDVS) standard from ISO/IEC moving pictures experts group has succeeded in enabling the interoperability for efficient and effective image retrieval by standardizing the bitstream syntax of compact feature descriptors. However, the intensive computation of a CDVS encoder unfortunately hinders its widely deployment in industry for large-scale visual search. In this paper, we revisit the merits of low complexity design of CDVS core techniques and present a very fast CDVS encoder by leveraging the massive parallel execution resources of graphics processing unit (GPU). We elegantly shift the computation-intensive and parallel-friendly modules to the state-of-the-arts GPU platforms, in which the thread block allocation as well as the memory access mechanism are jointly optimized to eliminate performance loss. In addition, those operations with heavy data dependence are allocated to CPU for resolving the extra but non-necessary computation burden for GPU. Furthermore, we have demonstrated the proposed fast CDVS encoder can work well with those convolution neural network approaches which enables to leverage the advantages of GPU platforms harmoniously, and yield significant performance improvements. Comprehensive experimental results over benchmarks are evaluated, which has shown that the fast CDVS encoder using GPU-CPU hybrid computing is promising for scalable visual search.
A GPU-Accelerated Approach for Feature Tracking in Time-Varying Imagery Datasets.

PubMed

Peng, Chao; Sahani, Sandip; Rushing, John

2017-10-01

We propose a novel parallel connected component labeling (CCL) algorithm along with efficient out-of-core data management to detect and track feature regions of large time-varying imagery datasets. Our approach contributes to the big data field with parallel algorithms tailored for GPU architectures. We remove the data dependency between frames and achieve pixel-level parallelism. Due to the large size, the entire dataset cannot fit into cached memory. Frames have to be streamed through the memory hierarchy (disk to CPU main memory and then to GPU memory), partitioned, and processed as batches, where each batch is small enough to fit into the GPU. To reconnect the feature regions that are separated due to data partitioning, we present a novel batch merging algorithm to extract the region connection information across multiple batches in a parallel fashion. The information is organized in a memory-efficient structure and supports fast indexing on the GPU. Our experiment uses a commodity workstation equipped with a single GPU. The results show that our approach can efficiently process a weather dataset composed of terabytes of time-varying radar images. The advantages of our approach are demonstrated by comparing to the performance of an efficient CPU cluster implementation which is being used by the weather scientists.
A CFD Heterogeneous Parallel Solver Based on Collaborating CPU and GPU

NASA Astrophysics Data System (ADS)

Lai, Jianqi; Tian, Zhengyu; Li, Hua; Pan, Sha

2018-03-01

Since Graphic Processing Unit (GPU) has a strong ability of floating-point computation and memory bandwidth for data parallelism, it has been widely used in the areas of common computing such as molecular dynamics (MD), computational fluid dynamics (CFD) and so on. The emergence of compute unified device architecture (CUDA), which reduces the complexity of compiling program, brings the great opportunities to CFD. There are three different modes for parallel solution of NS equations: parallel solver based on CPU, parallel solver based on GPU and heterogeneous parallel solver based on collaborating CPU and GPU. As we can see, GPUs are relatively rich in compute capacity but poor in memory capacity and the CPUs do the opposite. We need to make full use of the GPUs and CPUs, so a CFD heterogeneous parallel solver based on collaborating CPU and GPU has been established. Three cases are presented to analyse the solver’s computational accuracy and heterogeneous parallel efficiency. The numerical results agree well with experiment results, which demonstrate that the heterogeneous parallel solver has high computational precision. The speedup on a single GPU is more than 40 for laminar flow, it decreases for turbulent flow, but it still can reach more than 20. What’s more, the speedup increases as the grid size becomes larger.
Accelerating large-scale simulation of seismic wave propagation by multi-GPUs and three-dimensional domain decomposition

NASA Astrophysics Data System (ADS)

Okamoto, Taro; Takenaka, Hiroshi; Nakamura, Takeshi; Aoki, Takayuki

2010-12-01

We adopted the GPU (graphics processing unit) to accelerate the large-scale finite-difference simulation of seismic wave propagation. The simulation can benefit from the high-memory bandwidth of GPU because it is a "memory intensive" problem. In a single-GPU case we achieved a performance of about 56 GFlops, which was about 45-fold faster than that achieved by a single core of the host central processing unit (CPU). We confirmed that the optimized use of fast shared memory and registers were essential for performance. In the multi-GPU case with three-dimensional domain decomposition, the non-contiguous memory alignment in the ghost zones was found to impose quite long time in data transfer between GPU and the host node. This problem was solved by using contiguous memory buffers for ghost zones. We achieved a performance of about 2.2 TFlops by using 120 GPUs and 330 GB of total memory: nearly (or more than) 2200 cores of host CPUs would be required to achieve the same performance. The weak scaling was nearly proportional to the number of GPUs. We therefore conclude that GPU computing for large-scale simulation of seismic wave propagation is a promising approach as a faster simulation is possible with reduced computational resources compared to CPUs.
Toward GPGPU accelerated human electromechanical cardiac simulations

PubMed Central

Vigueras, Guillermo; Roy, Ishani; Cookson, Andrew; Lee, Jack; Smith, Nicolas; Nordsletten, David

2014-01-01

In this paper, we look at the acceleration of weakly coupled electromechanics using the graphics processing unit (GPU). Specifically, we port to the GPU a number of components of Heart—a CPU-based finite element code developed for simulating multi-physics problems. On the basis of a criterion of computational cost, we implemented on the GPU the ODE and PDE solution steps for the electrophysiology problem and the Jacobian and residual evaluation for the mechanics problem. Performance of the GPU implementation is then compared with single core CPU (SC) execution as well as multi-core CPU (MC) computations with equivalent theoretical performance. Results show that for a human scale left ventricle mesh, GPU acceleration of the electrophysiology problem provided speedups of 164 × compared with SC and 5.5 times compared with MC for the solution of the ODE model. Speedup of up to 72 × compared with SC and 2.6 × compared with MC was also observed for the PDE solve. Using the same human geometry, the GPU implementation of mechanics residual/Jacobian computation provided speedups of up to 44 × compared with SC and 2.0 × compared with MC. © 2013 The Authors. International Journal for Numerical Methods in Biomedical Engineering published by John Wiley & Sons, Ltd. PMID:24115492
Accelerating EPI distortion correction by utilizing a modern GPU-based parallel computation.

PubMed

Yang, Yao-Hao; Huang, Teng-Yi; Wang, Fu-Nien; Chuang, Tzu-Chao; Chen, Nan-Kuei

2013-04-01

The combination of phase demodulation and field mapping is a practical method to correct echo planar imaging (EPI) geometric distortion. However, since phase dispersion accumulates in each phase-encoding step, the calculation complexity of phase modulation is Ny-fold higher than conventional image reconstructions. Thus, correcting EPI images via phase demodulation is generally a time-consuming task. Parallel computing by employing general-purpose calculations on graphics processing units (GPU) can accelerate scientific computing if the algorithm is parallelized. This study proposes a method that incorporates the GPU-based technique into phase demodulation calculations to reduce computation time. The proposed parallel algorithm was applied to a PROPELLER-EPI diffusion tensor data set. The GPU-based phase demodulation method reduced the EPI distortion correctly, and accelerated the computation. The total reconstruction time of the 16-slice PROPELLER-EPI diffusion tensor images with matrix size of 128 × 128 was reduced from 1,754 seconds to 101 seconds by utilizing the parallelized 4-GPU program. GPU computing is a promising method to accelerate EPI geometric correction. The resulting reduction in computation time of phase demodulation should accelerate postprocessing for studies performed with EPI, and should effectuate the PROPELLER-EPI technique for clinical practice. Copyright © 2011 by the American Society of Neuroimaging.
FastGCN: A GPU Accelerated Tool for Fast Gene Co-Expression Networks

PubMed Central

Liang, Meimei; Zhang, Futao; Jin, Gulei; Zhu, Jun

2015-01-01

Gene co-expression networks comprise one type of valuable biological networks. Many methods and tools have been published to construct gene co-expression networks; however, most of these tools and methods are inconvenient and time consuming for large datasets. We have developed a user-friendly, accelerated and optimized tool for constructing gene co-expression networks that can fully harness the parallel nature of GPU (Graphic Processing Unit) architectures. Genetic entropies were exploited to filter out genes with no or small expression changes in the raw data preprocessing step. Pearson correlation coefficients were then calculated. After that, we normalized these coefficients and employed the False Discovery Rate to control the multiple tests. At last, modules identification was conducted to construct the co-expression networks. All of these calculations were implemented on a GPU. We also compressed the coefficient matrix to save space. We compared the performance of the GPU implementation with those of multi-core CPU implementations with 16 CPU threads, single-thread C/C++ implementation and single-thread R implementation. Our results show that GPU implementation largely outperforms single-thread C/C++ implementation and single-thread R implementation, and GPU implementation outperforms multi-core CPU implementation when the number of genes increases. With the test dataset containing 16,000 genes and 590 individuals, we can achieve greater than 63 times the speed using a GPU implementation compared with a single-thread R implementation when 50 percent of genes were filtered out and about 80 times the speed when no genes were filtered out. PMID:25602758
FastGCN: a GPU accelerated tool for fast gene co-expression networks.

PubMed

Liang, Meimei; Zhang, Futao; Jin, Gulei; Zhu, Jun

2015-01-01

Gene co-expression networks comprise one type of valuable biological networks. Many methods and tools have been published to construct gene co-expression networks; however, most of these tools and methods are inconvenient and time consuming for large datasets. We have developed a user-friendly, accelerated and optimized tool for constructing gene co-expression networks that can fully harness the parallel nature of GPU (Graphic Processing Unit) architectures. Genetic entropies were exploited to filter out genes with no or small expression changes in the raw data preprocessing step. Pearson correlation coefficients were then calculated. After that, we normalized these coefficients and employed the False Discovery Rate to control the multiple tests. At last, modules identification was conducted to construct the co-expression networks. All of these calculations were implemented on a GPU. We also compressed the coefficient matrix to save space. We compared the performance of the GPU implementation with those of multi-core CPU implementations with 16 CPU threads, single-thread C/C++ implementation and single-thread R implementation. Our results show that GPU implementation largely outperforms single-thread C/C++ implementation and single-thread R implementation, and GPU implementation outperforms multi-core CPU implementation when the number of genes increases. With the test dataset containing 16,000 genes and 590 individuals, we can achieve greater than 63 times the speed using a GPU implementation compared with a single-thread R implementation when 50 percent of genes were filtered out and about 80 times the speed when no genes were filtered out.
Advantages of GPU technology in DFT calculations of intercalated graphene

NASA Astrophysics Data System (ADS)

Pešić, J.; Gajić, R.

2014-09-01

Over the past few years, the expansion of general-purpose graphic-processing unit (GPGPU) technology has had a great impact on computational science. GPGPU is the utilization of a graphics-processing unit (GPU) to perform calculations in applications usually handled by the central processing unit (CPU). Use of GPGPUs as a way to increase computational power in the material sciences has significantly decreased computational costs in already highly demanding calculations. A level of the acceleration and parallelization depends on the problem itself. Some problems can benefit from GPU acceleration and parallelization, such as the finite-difference time-domain algorithm (FTDT) and density-functional theory (DFT), while others cannot take advantage of these modern technologies. A number of GPU-supported applications had emerged in the past several years (www.nvidia.com/object/gpu-applications.html). Quantum Espresso (QE) is reported as an integrated suite of open source computer codes for electronic-structure calculations and materials modeling at the nano-scale. It is based on DFT, the use of a plane-waves basis and a pseudopotential approach. Since the QE 5.0 version, it has been implemented as a plug-in component for standard QE packages that allows exploiting the capabilities of Nvidia GPU graphic cards (www.qe-forge.org/gf/proj). In this study, we have examined the impact of the usage of GPU acceleration and parallelization on the numerical performance of DFT calculations. Graphene has been attracting attention worldwide and has already shown some remarkable properties. We have studied an intercalated graphene, using the QE package PHonon, which employs GPU. The term ‘intercalation’ refers to a process whereby foreign adatoms are inserted onto a graphene lattice. In addition, by intercalating different atoms between graphene layers, it is possible to tune their physical properties. Our experiments have shown there are benefits from using GPUs, and we reached an acceleration of several times compared to standard CPU calculations.
Evaluating the Usefulness of a Novel 10B-Carrier Conjugated With Cyclic RGD Peptide in Boron Neutron Capture Therapy

PubMed Central

Masunaga, Shin-ichiro; Kimura, Sadaaki; Harada, Tomohiro; Okuda, Kensuke; Sakurai, Yoshinori; Tanaka, Hiroki; Suzuki, Minoru; Kondo, Natsuko; Maruhashi, Akira; Nagasawa, Hideko; Ono, Koji

2012-01-01

Background To evaluate the usefulness of a novel 10B-carrier conjugated with an integrin-binding cyclic RGD peptide (GPU-201) in boron neutron capture therapy (BNCT). Methods GPU-201 was synthesized from integrin-binding Arg-Gly-Asp (RGD) consensus sequence of matrix proteins and a 10B cluster 1, 2-dicarba-closo-dodecaborane-10B. Mercaptododecaborate-10B (BSH) dissolved in physiological saline and BSH and GPU-201 dissolved with cyclodextrin (CD) as a solubilizing and dispersing agent were intraperitoneally administered to SCC VII tumor-bearing mice. Then, the 10B concentrations in the tumors and normal tissues were measured by γ-ray spectrometry. Meanwhile, tumor-bearing mice were continuously given 5-bromo-2’-deoxyuridine (BrdU) to label all proliferating (P) cells in the tumors, then treated with GPU-201, BSH-CD, or BSH. Immediately after reactor neutron beam or γ-ray irradiation, during which intratumor 10B concentrations were kept at levels similar to each other, cells from some tumors were isolated and incubated with a cytokinesis blocker. The responses of the Q and total (= P + Q) cell populations were assessed based on the frequency of micronuclei using immunofluorescence staining for BrdU. Results The 10B from BSH was washed away rapidly in all these tissues and the retention of 10B from BSH-CD and GPU-201 was similar except in blood where the 10B concentration from GPU-201 was higher for longer. GPU-201 showed a significantly stronger radio-sensitizing effect under neutron beam irradiation on both total and Q cell populations than any other 10B-carrier. Conclusion A novel 10B-carrier conjugated with an integrin-binding RGD peptide (GPU-201) that sensitized tumor cells more markedly than conventional 10B-carriers may be a promising candidate for use in BNCT. However, its toxicity needs to be tested further. PMID:29147290
High performance MRI simulations of motion on multi-GPU systems.

PubMed

Xanthis, Christos G; Venetis, Ioannis E; Aletras, Anthony H

2014-07-04

MRI physics simulators have been developed in the past for optimizing imaging protocols and for training purposes. However, these simulators have only addressed motion within a limited scope. The purpose of this study was the incorporation of realistic motion, such as cardiac motion, respiratory motion and flow, within MRI simulations in a high performance multi-GPU environment. Three different motion models were introduced in the Magnetic Resonance Imaging SIMULator (MRISIMUL) of this study: cardiac motion, respiratory motion and flow. Simulation of a simple Gradient Echo pulse sequence and a CINE pulse sequence on the corresponding anatomical model was performed. Myocardial tagging was also investigated. In pulse sequence design, software crushers were introduced to accommodate the long execution times in order to avoid spurious echoes formation. The displacement of the anatomical model isochromats was calculated within the Graphics Processing Unit (GPU) kernel for every timestep of the pulse sequence. Experiments that would allow simulation of custom anatomical and motion models were also performed. Last, simulations of motion with MRISIMUL on single-node and multi-node multi-GPU systems were examined. Gradient Echo and CINE images of the three motion models were produced and motion-related artifacts were demonstrated. The temporal evolution of the contractility of the heart was presented through the application of myocardial tagging. Better simulation performance and image quality were presented through the introduction of software crushers without the need to further increase the computational load and GPU resources. Last, MRISIMUL demonstrated an almost linear scalable performance with the increasing number of available GPU cards, in both single-node and multi-node multi-GPU computer systems. MRISIMUL is the first MR physics simulator to have implemented motion with a 3D large computational load on a single computer multi-GPU configuration. The incorporation of realistic motion models, such as cardiac motion, respiratory motion and flow may benefit the design and optimization of existing or new MR pulse sequences, protocols and algorithms, which examine motion related MR applications.
Fast GPU-based computation of spatial multigrid multiframe LMEM for PET.

PubMed

Nassiri, Moulay Ali; Carrier, Jean-François; Després, Philippe

2015-09-01

Significant efforts were invested during the last decade to accelerate PET list-mode reconstructions, notably with GPU devices. However, the computation time per event is still relatively long, and the list-mode efficiency on the GPU is well below the histogram-mode efficiency. Since list-mode data are not arranged in any regular pattern, costly accesses to the GPU global memory can hardly be optimized and geometrical symmetries cannot be used. To overcome obstacles that limit the acceleration of reconstruction from list-mode on the GPU, a multigrid and multiframe approach of an expectation-maximization algorithm was developed. The reconstruction process is started during data acquisition, and calculations are executed concurrently on the GPU and the CPU, while the system matrix is computed on-the-fly. A new convergence criterion also was introduced, which is computationally more efficient on the GPU. The implementation was tested on a Tesla C2050 GPU device for a Gemini GXL PET system geometry. The results show that the proposed algorithm (multigrid and multiframe list-mode expectation-maximization, MGMF-LMEM) converges to the same solution as the LMEM algorithm more than three times faster. The execution time of the MGMF-LMEM algorithm was 1.1 s per million of events on the Tesla C2050 hardware used, for a reconstructed space of 188 x 188 x 57 voxels of 2 x 2 x 3.15 mm3. For 17- and 22-mm simulated hot lesions, the MGMF-LMEM algorithm led on the first iteration to contrast recovery coefficients (CRC) of more than 75 % of the maximum CRC while achieving a minimum in the relative mean square error. Therefore, the MGMF-LMEM algorithm can be used as a one-pass method to perform real-time reconstructions for low-count acquisitions, as in list-mode gated studies. The computation time for one iteration and 60 millions of events was approximately 66 s.
GPU-based Parallel Application Design for Emerging Mobile Devices

NASA Astrophysics Data System (ADS)

Gupta, Kshitij

A revolution is underway in the computing world that is causing a fundamental paradigm shift in device capabilities and form-factor, with a move from well-established legacy desktop/laptop computers to mobile devices in varying sizes and shapes. Amongst all the tasks these devices must support, graphics has emerged as the 'killer app' for providing a fluid user interface and high-fidelity game rendering, effectively making the graphics processor (GPU) one of the key components in (present and future) mobile systems. By utilizing the GPU as a general-purpose parallel processor, this dissertation explores the GPU computing design space from an applications standpoint, in the mobile context, by focusing on key challenges presented by these devices---limited compute, memory bandwidth, and stringent power consumption requirements---while improving the overall application efficiency of the increasingly important speech recognition workload for mobile user interaction. We broadly partition trends in GPU computing into four major categories. We analyze hardware and programming model limitations in current-generation GPUs and detail an alternate programming style called Persistent Threads, identify four use case patterns, and propose minimal modifications that would be required for extending native support. We show how by manually extracting data locality and altering the speech recognition pipeline, we are able to achieve significant savings in memory bandwidth while simultaneously reducing the compute burden on GPU-like parallel processors. As we foresee GPU computing to evolve from its current 'co-processor' model into an independent 'applications processor' that is capable of executing complex work independently, we create an alternate application framework that enables the GPU to handle all control-flow dependencies autonomously at run-time while minimizing host involvement to just issuing commands, that facilitates an efficient application implementation. Finally, as compute and communication capabilities of mobile devices improve, we analyze energy implications of processing speech recognition locally (on-chip) and offloading it to servers (in-cloud).
A Framework for Batched and GPU-Resident Factorization Algorithms Applied to Block Householder Transformations

DOE Office of Scientific and Technical Information (OSTI.GOV)

Dong, Tingzing Tim; Tomov, Stanimire Z; Luszczek, Piotr R

As modern hardware keeps evolving, an increasingly effective approach to developing energy efficient and high-performance solvers is to design them to work on many small size and independent problems. Many applications already need this functionality, especially for GPUs, which are currently known to be about four to five times more energy efficient than multicore CPUs. We describe the development of one-sided factorizations that work for a set of small dense matrices in parallel, and we illustrate our techniques on the QR factorization based on Householder transformations. We refer to this mode of operation as a batched factorization. Our approach ismore » based on representing the algorithms as a sequence of batched BLAS routines for GPU-only execution. This is in contrast to the hybrid CPU-GPU algorithms that rely heavily on using the multicore CPU for specific parts of the workload. But for a system to benefit fully from the GPU's significantly higher energy efficiency, avoiding the use of the multicore CPU must be a primary design goal, so the system can rely more heavily on the more efficient GPU. Additionally, this will result in the removal of the costly CPU-to-GPU communication. Furthermore, we do not use a single symmetric multiprocessor(on the GPU) to factorize a single problem at a time. We illustrate how our performance analysis, and the use of profiling and tracing tools, guided the development and optimization of our batched factorization to achieve up to a 2-fold speedup and a 3-fold energy efficiency improvement compared to our highly optimized batched CPU implementations based on the MKL library(when using two sockets of Intel Sandy Bridge CPUs). Compared to a batched QR factorization featured in the CUBLAS library for GPUs, we achieved up to 5x speedup on the K40 GPU.« less
Democratic Population Decisions Result in Robust Policy-Gradient Learning: A Parametric Study with GPU Simulations

PubMed Central

Richmond, Paul; Buesing, Lars; Giugliano, Michele; Vasilaki, Eleni

2011-01-01

High performance computing on the Graphics Processing Unit (GPU) is an emerging field driven by the promise of high computational power at a low cost. However, GPU programming is a non-trivial task and moreover architectural limitations raise the question of whether investing effort in this direction may be worthwhile. In this work, we use GPU programming to simulate a two-layer network of Integrate-and-Fire neurons with varying degrees of recurrent connectivity and investigate its ability to learn a simplified navigation task using a policy-gradient learning rule stemming from Reinforcement Learning. The purpose of this paper is twofold. First, we want to support the use of GPUs in the field of Computational Neuroscience. Second, using GPU computing power, we investigate the conditions under which the said architecture and learning rule demonstrate best performance. Our work indicates that networks featuring strong Mexican-Hat-shaped recurrent connections in the top layer, where decision making is governed by the formation of a stable activity bump in the neural population (a “non-democratic” mechanism), achieve mediocre learning results at best. In absence of recurrent connections, where all neurons “vote” independently (“democratic”) for a decision via population vector readout, the task is generally learned better and more robustly. Our study would have been extremely difficult on a desktop computer without the use of GPU programming. We present the routines developed for this purpose and show that a speed improvement of 5x up to 42x is provided versus optimised Python code. The higher speed is achieved when we exploit the parallelism of the GPU in the search of learning parameters. This suggests that efficient GPU programming can significantly reduce the time needed for simulating networks of spiking neurons, particularly when multiple parameter configurations are investigated. PMID:21572529
GPU accelerated generation of digitally reconstructed radiographs for 2-D/3-D image registration.

PubMed

Dorgham, Osama M; Laycock, Stephen D; Fisher, Mark H

2012-09-01

Recent advances in programming languages for graphics processing units (GPUs) provide developers with a convenient way of implementing applications which can be executed on the CPU and GPU interchangeably. GPUs are becoming relatively cheap, powerful, and widely available hardware components, which can be used to perform intensive calculations. The last decade of hardware performance developments shows that GPU-based computation is progressing significantly faster than CPU-based computation, particularly if one considers the execution of highly parallelisable algorithms. Future predictions illustrate that this trend is likely to continue. In this paper, we introduce a way of accelerating 2-D/3-D image registration by developing a hybrid system which executes on the CPU and utilizes the GPU for parallelizing the generation of digitally reconstructed radiographs (DRRs). Based on the advancements of the GPU over the CPU, it is timely to exploit the benefits of many-core GPU technology by developing algorithms for DRR generation. Although some previous work has investigated the rendering of DRRs using the GPU, this paper investigates approximations which reduce the computational overhead while still maintaining a quality consistent with that needed for 2-D/3-D registration with sufficient accuracy to be clinically acceptable in certain applications of radiation oncology. Furthermore, by comparing implementations of 2-D/3-D registration on the CPU and GPU, we investigate current performance and propose an optimal framework for PC implementations addressing the rigid registration problem. Using this framework, we are able to render DRR images from a 256×256×133 CT volume in ~24 ms using an NVidia GeForce 8800 GTX and in ~2 ms using NVidia GeForce GTX 580. In addition to applications requiring fast automatic patient setup, these levels of performance suggest image-guided radiation therapy at video frame rates is technically feasible using relatively low cost PC architecture.
High performance computing for deformable image registration: towards a new paradigm in adaptive radiotherapy.

PubMed

Samant, Sanjiv S; Xia, Junyi; Muyan-Ozcelik, Pinar; Owens, John D

2008-08-01

The advent of readily available temporal imaging or time series volumetric (4D) imaging has become an indispensable component of treatment planning and adaptive radiotherapy (ART) at many radiotherapy centers. Deformable image registration (DIR) is also used in other areas of medical imaging, including motion corrected image reconstruction. Due to long computation time, clinical applications of DIR in radiation therapy and elsewhere have been limited and consequently relegated to offline analysis. With the recent advances in hardware and software, graphics processing unit (GPU) based computing is an emerging technology for general purpose computation, including DIR, and is suitable for highly parallelized computing. However, traditional general purpose computation on the GPU is limited because the constraints of the available programming platforms. As well, compared to CPU programming, the GPU currently has reduced dedicated processor memory, which can limit the useful working data set for parallelized processing. We present an implementation of the demons algorithm using the NVIDIA 8800 GTX GPU and the new CUDA programming language. The GPU performance will be compared with single threading and multithreading CPU implementations on an Intel dual core 2.4 GHz CPU using the C programming language. CUDA provides a C-like language programming interface, and allows for direct access to the highly parallel compute units in the GPU. Comparisons for volumetric clinical lung images acquired using 4DCT were carried out. Computation time for 100 iterations in the range of 1.8-13.5 s was observed for the GPU with image size ranging from 2.0 x 10(6) to 14.2 x 10(6) pixels. The GPU registration was 55-61 times faster than the CPU for the single threading implementation, and 34-39 times faster for the multithreading implementation. For CPU based computing, the computational time generally has a linear dependence on image size for medical imaging data. Computational efficiency is characterized in terms of time per megapixels per iteration (TPMI) with units of seconds per megapixels per iteration (or spmi). For the demons algorithm, our CPU implementation yielded largely invariant values of TPMI. The mean TPMIs were 0.527 spmi and 0.335 spmi for the single threading and multithreading cases, respectively, with <2% variation over the considered image data range. For GPU computing, we achieved TPMI =0.00916 spmi with 3.7% variation, indicating optimized memory handling under CUDA. The paradigm of GPU based real-time DIR opens up a host of clinical applications for medical imaging.

Dataflow-Based Implementation of Layered Sensing Applications on High-Performance Embedded Processors

DTIC Science & Technology

2013-03-01

time (milliseconds) GFlops Comparison to GPU peak performance (%) Cascade Gaussian Filtering 13 45.19 6.3 Difference of Gaussian 0.512 152...values for the GPU-targeted actor implementations in terms of Giga Floating Point Operations Per Second ( GFLOPS ). Our GFLOPS calculation for an actor...kernels. The results for GFLOPS are provided in Table . The actors were implemented on an NVIDIA GTX260 GPU, which provides 715 GFLOPS as peak
GraphReduce: Large-Scale Graph Analytics on Accelerator-Based HPC Systems

DOE Office of Scientific and Technical Information (OSTI.GOV)

Sengupta, Dipanjan; Agarwal, Kapil; Song, Shuaiwen

2015-09-30

Recent work on real-world graph analytics has sought to leverage the massive amount of parallelism offered by GPU devices, but challenges remain due to the inherent irregularity of graph algorithms and limitations in GPU-resident memory for storing large graphs. We present GraphReduce, a highly efficient and scalable GPU-based framework that operates on graphs that exceed the device’s internal memory capacity. GraphReduce adopts a combination of both edge- and vertex-centric implementations of the Gather-Apply-Scatter programming model and operates on multiple asynchronous GPU streams to fully exploit the high degrees of parallelism in GPUs with efficient graph data movement between the hostmore » and the device.« less
CoMD Implementation Suite in Emerging Programming Models

DOE Office of Scientific and Technical Information (OSTI.GOV)

Haque, Riyaz; Reeve, Sam; Juallmes, Luc

CoMD-Em is a software implementation suite of the CoMD [4] proxy app using different emerging programming models. It is intended to analyze the features and capabilities of novel programming models that could help ensure code and performance portability and scalability across heterogeneous platforms while improving programmer productivity. Another goal is to provide the authors and venders with some meaningful feedback regarding the capabilities and limitations of their models. The actual application is a classical molecular dynamics (MD) simulation using either the Lennard-Jones method (LJ) or the embedded atom method (EAM) for primary particle interaction. The code can be extended tomore » support alternate interaction models. The code is expected ro run on a wide class of heterogeneous hardware configurations like shard/distributed/hybrid memory, GPU's and any other platform supported by the underlying programming model.« less
GPU based framework for geospatial analyses

NASA Astrophysics Data System (ADS)

Cosmin Sandric, Ionut; Ionita, Cristian; Dardala, Marian; Furtuna, Titus

2017-04-01

Parallel processing on multiple CPU cores is already used at large scale in geocomputing, but parallel processing on graphics cards is just at the beginning. Being able to use an simple laptop with a dedicated graphics card for advanced and very fast geocomputation is an advantage that each scientist wants to have. The necessity to have high speed computation in geosciences has increased in the last 10 years, mostly due to the increase in the available datasets. These datasets are becoming more and more detailed and hence they require more space to store and more time to process. Distributed computation on multicore CPU's and GPU's plays an important role by processing one by one small parts from these big datasets. These way of computations allows to speed up the process, because instead of using just one process for each dataset, the user can use all the cores from a CPU or up to hundreds of cores from GPU The framework provide to the end user a standalone tools for morphometry analyses at multiscale level. An important part of the framework is dedicated to uncertainty propagation in geospatial analyses. The uncertainty may come from the data collection or may be induced by the model or may have an infinite sources. These uncertainties plays important roles when a spatial delineation of the phenomena is modelled. Uncertainty propagation is implemented inside the GPU framework using Monte Carlo simulations. The GPU framework with the standalone tools proved to be a reliable tool for modelling complex natural phenomena The framework is based on NVidia Cuda technology and is written in C++ programming language. The code source will be available on github at https://github.com/sandricionut/GeoRsGPU Acknowledgement: GPU framework for geospatial analysis, Young Researchers Grant (ICUB-University of Bucharest) 2016, director Ionut Sandric
Implementation and optimization of ultrasound signal processing algorithms on mobile GPU

NASA Astrophysics Data System (ADS)

Kong, Woo Kyu; Lee, Wooyoul; Kim, Kyu Cheol; Yoo, Yangmo; Song, Tai-Kyong

2014-03-01

A general-purpose graphics processing unit (GPGPU) has been used for improving computing power in medical ultrasound imaging systems. Recently, a mobile GPU becomes powerful to deal with 3D games and videos at high frame rates on Full HD or HD resolution displays. This paper proposes the method to implement ultrasound signal processing on a mobile GPU available in the high-end smartphone (Galaxy S4, Samsung Electronics, Seoul, Korea) with programmable shaders on the OpenGL ES 2.0 platform. To maximize the performance of the mobile GPU, the optimization of shader design and load sharing between vertex and fragment shader was performed. The beamformed data were captured from a tissue mimicking phantom (Model 539 Multipurpose Phantom, ATS Laboratories, Inc., Bridgeport, CT, USA) by using a commercial ultrasound imaging system equipped with a research package (Ultrasonix Touch, Ultrasonix, Richmond, BC, Canada). The real-time performance is evaluated by frame rates while varying the range of signal processing blocks. The implementation method of ultrasound signal processing on OpenGL ES 2.0 was verified by analyzing PSNR with MATLAB gold standard that has the same signal path. CNR was also analyzed to verify the method. From the evaluations, the proposed mobile GPU-based processing method has no significant difference with the processing using MATLAB (i.e., PSNR<52.51 dB). The comparable results of CNR were obtained from both processing methods (i.e., 11.31). From the mobile GPU implementation, the frame rates of 57.6 Hz were achieved. The total execution time was 17.4 ms that was faster than the acquisition time (i.e., 34.4 ms). These results indicate that the mobile GPU-based processing method can support real-time ultrasound B-mode processing on the smartphone.
GPU-Accelerated Voxelwise Hepatic Perfusion Quantification

PubMed Central

Wang, H; Cao, Y

2012-01-01

Voxelwise quantification of hepatic perfusion parameters from dynamic contrast enhanced (DCE) imaging greatly contributes to assessment of liver function in response to radiation therapy. However, the efficiency of the estimation of hepatic perfusion parameters voxel-by-voxel in the whole liver using a dual-input single-compartment model requires substantial improvement for routine clinical applications. In this paper, we utilize the parallel computation power of a graphics processing unit (GPU) to accelerate the computation, while maintaining the same accuracy as the conventional method. Using CUDA-GPU, the hepatic perfusion computations over multiple voxels are run across the GPU blocks concurrently but independently. At each voxel, non-linear least squares fitting the time series of the liver DCE data to the compartmental model is distributed to multiple threads in a block, and the computations of different time points are performed simultaneously and synchronically. An efficient fast Fourier transform in a block is also developed for the convolution computation in the model. The GPU computations of the voxel-by-voxel hepatic perfusion images are compared with ones by the CPU using the simulated DCE data and the experimental DCE MR images from patients. The computation speed is improved by 30 times using a NVIDIA Tesla C2050 GPU compared to a 2.67 GHz Intel Xeon CPU processor. To obtain liver perfusion maps with 626400 voxels in a patient’s liver, it takes 0.9 min with the GPU-accelerated voxelwise computation, compared to 110 min with the CPU, while both methods result in perfusion parameters differences less than 10−6. The method will be useful for generating liver perfusion images in clinical settings. PMID:22892645
Fast-GPU-PCC: A GPU-Based Technique to Compute Pairwise Pearson's Correlation Coefficients for Time Series Data-fMRI Study.

PubMed

Eslami, Taban; Saeed, Fahad

2018-04-20

Functional magnetic resonance imaging (fMRI) is a non-invasive brain imaging technique, which has been regularly used for studying brain’s functional activities in the past few years. A very well-used measure for capturing functional associations in brain is Pearson’s correlation coefficient. Pearson’s correlation is widely used for constructing functional network and studying dynamic functional connectivity of the brain. These are useful measures for understanding the effects of brain disorders on connectivities among brain regions. The fMRI scanners produce huge number of voxels and using traditional central processing unit (CPU)-based techniques for computing pairwise correlations is very time consuming especially when large number of subjects are being studied. In this paper, we propose a graphics processing unit (GPU)-based algorithm called Fast-GPU-PCC for computing pairwise Pearson’s correlation coefficient. Based on the symmetric property of Pearson’s correlation, this approach returns N ( N − 1 ) / 2 correlation coefficients located at strictly upper triangle part of the correlation matrix. Storing correlations in a one-dimensional array with the order as proposed in this paper is useful for further usage. Our experiments on real and synthetic fMRI data for different number of voxels and varying length of time series show that the proposed approach outperformed state of the art GPU-based techniques as well as the sequential CPU-based versions. We show that Fast-GPU-PCC runs 62 times faster than CPU-based version and about 2 to 3 times faster than two other state of the art GPU-based methods.
SU-D-206-01: Employing a Novel Consensus Optimization Strategy to Achieve Iterative Cone Beam CT Reconstruction On a Multi-GPU Platform

DOE Office of Scientific and Technical Information (OSTI.GOV)

Li, B; Southern Medical University, Guangzhou, Guangdong; Tian, Z

Purpose: While compressed sensing-based cone-beam CT (CBCT) iterative reconstruction techniques have demonstrated tremendous capability of reconstructing high-quality images from undersampled noisy data, its long computation time still hinders wide application in routine clinic. The purpose of this study is to develop a reconstruction framework that employs modern consensus optimization techniques to achieve CBCT reconstruction on a multi-GPU platform for improved computational efficiency. Methods: Total projection data were evenly distributed to multiple GPUs. Each GPU performed reconstruction using its own projection data with a conventional total variation regularization approach to ensure image quality. In addition, the solutions from GPUs were subjectmore » to a consistency constraint that they should be identical. We solved the optimization problem with all the constraints considered rigorously using an alternating direction method of multipliers (ADMM) algorithm. The reconstruction framework was implemented using OpenCL on a platform with two Nvidia GTX590 GPU cards, each with two GPUs. We studied the performance of our method and demonstrated its advantages through a simulation case with a NCAT phantom and an experimental case with a Catphan phantom. Result: Compared with the CBCT images reconstructed using conventional FDK method with full projection datasets, our proposed method achieved comparable image quality with about one third projection numbers. The computation time on the multi-GPU platform was ∼55 s and ∼ 35 s in the two cases respectively, achieving a speedup factor of ∼ 3.0 compared with single GPU reconstruction. Conclusion: We have developed a consensus ADMM-based CBCT reconstruction method which enabled performing reconstruction on a multi-GPU platform. The achieved efficiency made this method clinically attractive.« less
Efficient Parallel Levenberg-Marquardt Model Fitting towards Real-Time Automated Parametric Imaging Microscopy

PubMed Central

Zhu, Xiang; Zhang, Dianwen

2013-01-01

We present a fast, accurate and robust parallel Levenberg-Marquardt minimization optimizer, GPU-LMFit, which is implemented on graphics processing unit for high performance scalable parallel model fitting processing. GPU-LMFit can provide a dramatic speed-up in massive model fitting analyses to enable real-time automated pixel-wise parametric imaging microscopy. We demonstrate the performance of GPU-LMFit for the applications in superresolution localization microscopy and fluorescence lifetime imaging microscopy. PMID:24130785
Problems Related to Parallelization of CFD Algorithms on GPU, Multi-GPU and Hybrid Architectures

NASA Astrophysics Data System (ADS)

Biazewicz, Marek; Kurowski, Krzysztof; Ludwiczak, Bogdan; Napieraia, Krystyna

2010-09-01

Computational Fluid Dynamics (CFD) is one of the branches of fluid mechanics, which uses numerical methods and algorithms to solve and analyze fluid flows. CFD is used in various domains, such as oil and gas reservoir uncertainty analysis, aerodynamic body shapes optimization (e.g. planes, cars, ships, sport helmets, skis), natural phenomena analysis, numerical simulation for weather forecasting or realistic visualizations. CFD problem is very complex and needs a lot of computational power to obtain the results in a reasonable time. We have implemented a parallel application for two-dimensional CFD simulation with a free surface approximation (MAC method) using new hardware architectures, in particular multi-GPU and hybrid computing environments. For this purpose we decided to use NVIDIA graphic cards with CUDA environment due to its simplicity of programming and good computations performance. We used finite difference discretization of Navier-Stokes equations, where fluid is propagated over an Eulerian Grid. In this model, the behavior of the fluid inside the cell depends only on the properties of local, surrounding cells, therefore it is well suited for the GPU-based architecture. In this paper we demonstrate how to use efficiently the computing power of GPUs for CFD. Additionally, we present some best practices to help users analyze and improve the performance of CFD applications executed on GPU. Finally, we discuss various challenges around the multi-GPU implementation on the example of matrix multiplication.
Pyglidein - A Simple HTCondor Glidein Service

NASA Astrophysics Data System (ADS)

Schultz, D.; Riedel, B.; Merino, G.

2017-10-01

A major challenge for data processing and analysis at the IceCube Neutrino Observatory presents itself in connecting a large set of individual clusters together to form a computing grid. Most of these clusters do not provide a “standard” grid interface. Using a local account on each submit machine, HTCondor glideins can be submitted to virtually any type of scheduler. The glideins then connect back to a main HTCondor pool, where jobs can run normally with no special syntax. To respond to dynamic load, a simple server advertises the number of idle jobs in the queue and the resources they request. The submit script can query this server to optimize glideins to what is needed, or not submit if there is no demand. Configuring HTCondor dynamic slots in the glideins allows us to efficiently handle varying memory requirements as well as whole-node jobs. One step of the IceCube simulation chain, photon propagation in the ice, heavily relies on GPUs for faster execution. Therefore, one important requirement for any workload management system in IceCube is to handle GPU resources properly. Within the pyglidein system, we have successfully configured HTCondor glideins to use any GPU allocated to it, with jobs using the standard HTCondor GPU syntax to request and use a GPU. This mechanism allows us to seamlessly integrate our local GPU cluster with remote non-Grid GPU clusters, including specially allocated resources at XSEDE supercomputers.
Cpu/gpu Computing for AN Implicit Multi-Block Compressible Navier-Stokes Solver on Heterogeneous Platform

NASA Astrophysics Data System (ADS)

Deng, Liang; Bai, Hanli; Wang, Fang; Xu, Qingxin

2016-06-01

CPU/GPU computing allows scientists to tremendously accelerate their numerical codes. In this paper, we port and optimize a double precision alternating direction implicit (ADI) solver for three-dimensional compressible Navier-Stokes equations from our in-house Computational Fluid Dynamics (CFD) software on heterogeneous platform. First, we implement a full GPU version of the ADI solver to remove a lot of redundant data transfers between CPU and GPU, and then design two fine-grain schemes, namely “one-thread-one-point” and “one-thread-one-line”, to maximize the performance. Second, we present a dual-level parallelization scheme using the CPU/GPU collaborative model to exploit the computational resources of both multi-core CPUs and many-core GPUs within the heterogeneous platform. Finally, considering the fact that memory on a single node becomes inadequate when the simulation size grows, we present a tri-level hybrid programming pattern MPI-OpenMP-CUDA that merges fine-grain parallelism using OpenMP and CUDA threads with coarse-grain parallelism using MPI for inter-node communication. We also propose a strategy to overlap the computation with communication using the advanced features of CUDA and MPI programming. We obtain speedups of 6.0 for the ADI solver on one Tesla M2050 GPU in contrast to two Xeon X5670 CPUs. Scalability tests show that our implementation can offer significant performance improvement on heterogeneous platform.
OpenACC acceleration of an unstructured CFD solver based on a reconstructed discontinuous Galerkin method for compressible flows

DOE PAGES

Xia, Yidong; Lou, Jialin; Luo, Hong; ...

2015-02-09

Here, an OpenACC directive-based graphics processing unit (GPU) parallel scheme is presented for solving the compressible Navier–Stokes equations on 3D hybrid unstructured grids with a third-order reconstructed discontinuous Galerkin method. The developed scheme requires the minimum code intrusion and algorithm alteration for upgrading a legacy solver with the GPU computing capability at very little extra effort in programming, which leads to a unified and portable code development strategy. A face coloring algorithm is adopted to eliminate the memory contention because of the threading of internal and boundary face integrals. A number of flow problems are presented to verify the implementationmore » of the developed scheme. Timing measurements were obtained by running the resulting GPU code on one Nvidia Tesla K20c GPU card (Nvidia Corporation, Santa Clara, CA, USA) and compared with those obtained by running the equivalent Message Passing Interface (MPI) parallel CPU code on a compute node (consisting of two AMD Opteron 6128 eight-core CPUs (Advanced Micro Devices, Inc., Sunnyvale, CA, USA)). Speedup factors of up to 24× and 1.6× for the GPU code were achieved with respect to one and 16 CPU cores, respectively. The numerical results indicate that this OpenACC-based parallel scheme is an effective and extensible approach to port unstructured high-order CFD solvers to GPU computing.« less
Improving metabolic stability with deuterium: The discovery of GPU-028, a potent free fatty acid receptor 4 agonists.

PubMed

Li, Zheng; Xu, Xue; Li, Gang; Fu, Xiaoting; Liu, Yanzhi; Feng, Yufeng; Wang, Mingyan; Ouyang, Yunting; Han, Jing

2017-12-15

The free fatty acid receptor 4 (FFA4) has emerged as a promising anti-diabetic target due to its function in improvement of insulin secretion and insulin resistance. The FFA4 agonist TUG-891 revealed great potential as a widely used pharmacological tool, but it has been suffered from high plasma clearance probably because the phenylpropanoic acid is vulnerable to β-oxidation. To identify metabolically stable analog without influence on physiological mechanism of TUG-891, we tried to incorporate deuterium at the α-position of phenylpropionic acid to afford compound 4 (GPU-028). As expected, GPU-028 revealed a longer half-life (T 1/2  = 1.66 h), lower clearance (CL = 0.97 L/h/kg) and higher maximum plasma concentration (C max  = 2035.23 μg/L), resulting in a 4-fold higher exposure than TUG-891. Although GPU-028 exhibited a similar agonistic activity in comparison to TUG-891, the hypoglycemic effect of GPU-028 was better than that of TUG-891 after treatment over four weeks in diet-induced obese mice. These positive results indicated that GPU-028 might be a better pharmacological tool than TUG-891 to explore physiological function of FFA4, especially on the in vivo study. Copyright © 2017 Elsevier Ltd. All rights reserved.
Higher-order ice-sheet modelling accelerated by multigrid on graphics cards

NASA Astrophysics Data System (ADS)

Brædstrup, Christian; Egholm, David

2013-04-01

Higher-order ice flow modelling is a very computer intensive process owing primarily to the nonlinear influence of the horizontal stress coupling. When applied for simulating long-term glacial landscape evolution, the ice-sheet models must consider very long time series, while both high temporal and spatial resolution is needed to resolve small effects. The use of higher-order and full stokes models have therefore seen very limited usage in this field. However, recent advances in graphics card (GPU) technology for high performance computing have proven extremely efficient in accelerating many large-scale scientific computations. The general purpose GPU (GPGPU) technology is cheap, has a low power consumption and fits into a normal desktop computer. It could therefore provide a powerful tool for many glaciologists working on ice flow models. Our current research focuses on utilising the GPU as a tool in ice-sheet and glacier modelling. To this extent we have implemented the Integrated Second-Order Shallow Ice Approximation (iSOSIA) equations on the device using the finite difference method. To accelerate the computations, the GPU solver uses a non-linear Red-Black Gauss-Seidel iterator coupled with a Full Approximation Scheme (FAS) multigrid setup to further aid convergence. The GPU finite difference implementation provides the inherent parallelization that scales from hundreds to several thousands of cores on newer cards. We demonstrate the efficiency of the GPU multigrid solver using benchmark experiments.
The novel implicit LU-SGS parallel iterative method based on the diffusion equation of a nuclear reactor on a GPU cluster

NASA Astrophysics Data System (ADS)

Zhang, Jilin; Sha, Chaoqun; Wu, Yusen; Wan, Jian; Zhou, Li; Ren, Yongjian; Si, Huayou; Yin, Yuyu; Jing, Ya

2017-02-01

GPU not only is used in the field of graphic technology but also has been widely used in areas needing a large number of numerical calculations. In the energy industry, because of low carbon, high energy density, high duration and other characteristics, the development of nuclear energy cannot easily be replaced by other energy sources. Management of core fuel is one of the major areas of concern in a nuclear power plant, and it is directly related to the economic benefits and cost of nuclear power. The large-scale reactor core expansion equation is large and complicated, so the calculation of the diffusion equation is crucial in the core fuel management process. In this paper, we use CUDA programming technology on a GPU cluster to run the LU-SGS parallel iterative calculation against the background of the diffusion equation of the reactor. We divide one-dimensional and two-dimensional mesh into a plurality of domains, with each domain evenly distributed on the GPU blocks. A parallel collision scheme is put forward that defines the virtual boundary of the grid exchange information and data transmission by non-stop collision. Compared with the serial program, the experiment shows that GPU greatly improves the efficiency of program execution and verifies that GPU is playing a much more important role in the field of numerical calculations.
permGPU: Using graphics processing units in RNA microarray association studies.

PubMed

Shterev, Ivo D; Jung, Sin-Ho; George, Stephen L; Owzar, Kouros

2010-06-16

Many analyses of microarray association studies involve permutation, bootstrap resampling and cross-validation, that are ideally formulated as embarrassingly parallel computing problems. Given that these analyses are computationally intensive, scalable approaches that can take advantage of multi-core processor systems need to be developed. We have developed a CUDA based implementation, permGPU, that employs graphics processing units in microarray association studies. We illustrate the performance and applicability of permGPU within the context of permutation resampling for a number of test statistics. An extensive simulation study demonstrates a dramatic increase in performance when using permGPU on an NVIDIA GTX 280 card compared to an optimized C/C++ solution running on a conventional Linux server. permGPU is available as an open-source stand-alone application and as an extension package for the R statistical environment. It provides a dramatic increase in performance for permutation resampling analysis in the context of microarray association studies. The current version offers six test statistics for carrying out permutation resampling analyses for binary, quantitative and censored time-to-event traits.
Parallel Computer System for 3D Visualization Stereo on GPU

NASA Astrophysics Data System (ADS)

Al-Oraiqat, Anas M.; Zori, Sergii A.

2018-03-01

This paper proposes the organization of a parallel computer system based on Graphic Processors Unit (GPU) for 3D stereo image synthesis. The development is based on the modified ray tracing method developed by the authors for fast search of tracing rays intersections with scene objects. The system allows significant increase in the productivity for the 3D stereo synthesis of photorealistic quality. The generalized procedure of 3D stereo image synthesis on the Graphics Processing Unit/Graphics Processing Clusters (GPU/GPC) is proposed. The efficiency of the proposed solutions by GPU implementation is compared with single-threaded and multithreaded implementations on the CPU. The achieved average acceleration in multi-thread implementation on the test GPU and CPU is about 7.5 and 1.6 times, respectively. Studying the influence of choosing the size and configuration of the computational Compute Unified Device Archi-tecture (CUDA) network on the computational speed shows the importance of their correct selection. The obtained experimental estimations can be significantly improved by new GPUs with a large number of processing cores and multiprocessors, as well as optimized configuration of the computing CUDA network.
Multi-GPU parallel algorithm design and analysis for improved inversion of probability tomography with gravity gradiometry data

NASA Astrophysics Data System (ADS)

Hou, Zhenlong; Huang, Danian

2017-09-01

In this paper, we make a study on the inversion of probability tomography (IPT) with gravity gradiometry data at first. The space resolution of the results is improved by multi-tensor joint inversion, depth weighting matrix and the other methods. Aiming at solving the problems brought by the big data in the exploration, we present the parallel algorithm and the performance analysis combining Compute Unified Device Architecture (CUDA) with Open Multi-Processing (OpenMP) based on Graphics Processing Unit (GPU) accelerating. In the test of the synthetic model and real data from Vinton Dome, we get the improved results. It is also proved that the improved inversion algorithm is effective and feasible. The performance of parallel algorithm we designed is better than the other ones with CUDA. The maximum speedup could be more than 200. In the performance analysis, multi-GPU speedup and multi-GPU efficiency are applied to analyze the scalability of the multi-GPU programs. The designed parallel algorithm is demonstrated to be able to process larger scale of data and the new analysis method is practical.
Adaptation of a Multi-Block Structured Solver for Effective Use in a Hybrid CPU/GPU Massively Parallel Environment

NASA Astrophysics Data System (ADS)

Gutzwiller, David; Gontier, Mathieu; Demeulenaere, Alain

2014-11-01

Multi-Block structured solvers hold many advantages over their unstructured counterparts, such as a smaller memory footprint and efficient serial performance. Historically, multi-block structured solvers have not been easily adapted for use in a High Performance Computing (HPC) environment, and the recent trend towards hybrid GPU/CPU architectures has further complicated the situation. This paper will elaborate on developments and innovations applied to the NUMECA FINE/Turbo solver that have allowed near-linear scalability with real-world problems on over 250 hybrid GPU/GPU cluster nodes. Discussion will focus on the implementation of virtual partitioning and load balancing algorithms using a novel meta-block concept. This implementation is transparent to the user, allowing all pre- and post-processing steps to be performed using a simple, unpartitioned grid topology. Additional discussion will elaborate on developments that have improved parallel performance, including fully parallel I/O with the ADIOS API and the GPU porting of the computationally heavy CPUBooster convergence acceleration module. Head of HPC and Release Management, Numeca International.

Work stealing for GPU-accelerated parallel programs in a global address space framework

DOE Office of Scientific and Technical Information (OSTI.GOV)

Arafat, Humayun; Dinan, James; Krishnamoorthy, Sriram

Task parallelism is an attractive approach to automatically load balance the computation in a parallel system and adapt to dynamism exhibited by parallel systems. Exploiting task parallelism through work stealing has been extensively studied in shared and distributed-memory contexts. In this paper, we study the design of a system that uses work stealing for dynamic load balancing of task-parallel programs executed on hybrid distributed-memory CPU-graphics processing unit (GPU) systems in a global-address space framework. We take into account the unique nature of the accelerator model employed by GPUs, the significant performance difference between GPU and CPU execution as a functionmore » of problem size, and the distinct CPU and GPU memory domains. We consider various alternatives in designing a distributed work stealing algorithm for CPU-GPU systems, while taking into account the impact of task distribution and data movement overheads. These strategies are evaluated using microbenchmarks that capture various execution configurations as well as the state-of-the-art CCSD(T) application module from the computational chemistry domain« less
Implementation of EAM and FS potentials in HOOMD-blue

NASA Astrophysics Data System (ADS)

Yang, Lin; Zhang, Feng; Travesset, Alex; Wang, Caizhuang; Ho, Kaiming

HOOMD-blue is a general-purpose software to perform classical molecular dynamics simulations entirely on GPUs. We provide full support for EAM and FS type potentials in HOOMD-blue, and report accuracy and efficiency benchmarks, including comparisons with the LAMMPS GPU package. Two problems were selected to test the accuracy: the determination of the glass transition temperature of Cu64.5Zr35.5 alloy using an FS potential and the calculation of pair distribution functions of Ni3Al using an EAM potential. In both cases, the results using HOOMD-blue are indistinguishable from those obtained by the GPU package in LAMMPS within statistical uncertainties. As tests for time efficiency, we benchmark time-steps per second using LAMMPS GPU and HOOMD-blue on one NVIDIA Tesla GPU. Compared to our typical LAMMPS simulations on one CPU cluster node which has 16 CPUs, LAMMPS GPU can be 3-3.5 times faster, and HOOMD-blue can be 4-5.5 times faster. We acknowledge the support from Laboratory Directed Research and Development (LDRD) of Ames Laboratory.
GPU-accelerated FDTD modeling of radio-frequency field-tissue interactions in high-field MRI.

PubMed

Chi, Jieru; Liu, Feng; Weber, Ewald; Li, Yu; Crozier, Stuart

2011-06-01

The analysis of high-field RF field-tissue interactions requires high-performance finite-difference time-domain (FDTD) computing. Conventional CPU-based FDTD calculations offer limited computing performance in a PC environment. This study presents a graphics processing unit (GPU)-based parallel-computing framework, producing substantially boosted computing efficiency (with a two-order speedup factor) at a PC-level cost. Specific details of implementing the FDTD method on a GPU architecture have been presented and the new computational strategy has been successfully applied to the design of a novel 8-element transceive RF coil system at 9.4 T. Facilitated by the powerful GPU-FDTD computing, the new RF coil array offers optimized fields (averaging 25% improvement in sensitivity, and 20% reduction in loop coupling compared with conventional array structures of the same size) for small animal imaging with a robust RF configuration. The GPU-enabled acceleration paves the way for FDTD to be applied for both detailed forward modeling and inverse design of MRI coils, which were previously impractical.
Ramses-GPU: Second order MUSCL-Handcock finite volume fluid solver

NASA Astrophysics Data System (ADS)

Kestener, Pierre

2017-10-01

RamsesGPU is a reimplementation of RAMSES (ascl:1011.007) which drops the adaptive mesh refinement (AMR) features to optimize 3D uniform grid algorithms for modern graphics processor units (GPU) to provide an efficient software package for astrophysics applications that do not need AMR features but do require a very large number of integration time steps. RamsesGPU provides an very efficient C++/CUDA/MPI software implementation of a second order MUSCL-Handcock finite volume fluid solver for compressible hydrodynamics as a magnetohydrodynamics solver based on the constraint transport technique. Other useful modules includes static gravity, dissipative terms (viscosity, resistivity), and forcing source term for turbulence studies, and special care was taken to enhance parallel input/output performance by using state-of-the-art libraries such as HDF5 and parallel-netcdf.
Simulation of ring polymer melts with GPU acceleration

NASA Astrophysics Data System (ADS)

Schram, R. D.; Barkema, G. T.

2018-06-01

We implemented the elastic lattice polymer model on the GPU (Graphics Processing Unit), and show that the GPU is very efficient for polymer simulations of dense polymer melts. The implementation is able to perform up to 4.1 ṡ109 Monte Carlo moves per second. Compared to our standard CPU implementation, we find an effective speed-up of a factor 92. Using this GPU implementation we studied the equilibrium properties and the dynamics of non-concatenated ring polymers in a melt of such polymers, using Rouse modes. With increasing polymer length, we found a very slow transition to compactness with a growth exponent ν ≈ 1 / 3. Numerically we find that the longest internal time scale of the polymer scales as N3.1, with N the molecular weight of the ring polymer.
Tensor Algebra Library for NVidia Graphics Processing Units

DOE Office of Scientific and Technical Information (OSTI.GOV)

Liakh, Dmitry

This is a general purpose math library implementing basic tensor algebra operations on NVidia GPU accelerators. This software is a tensor algebra library that can perform basic tensor algebra operations, including tensor contractions, tensor products, tensor additions, etc., on NVidia GPU accelerators, asynchronously with respect to the CPU host. It supports a simultaneous use of multiple NVidia GPUs. Each asynchronous API function returns a handle which can later be used for querying the completion of the corresponding tensor algebra operation on a specific GPU. The tensors participating in a particular tensor operation are assumed to be stored in local RAMmore » of a node or GPU RAM. The main research area where this library can be utilized is the quantum many-body theory (e.g., in electronic structure theory).« less
Overview of implementation of DARPA GPU program in SAIC

NASA Astrophysics Data System (ADS)

Braunreiter, Dennis; Furtek, Jeremy; Chen, Hai-Wen; Healy, Dennis

2008-04-01

This paper reviews the implementation of DARPA MTO STAP-BOY program for both Phase I and II conducted at Science Applications International Corporation (SAIC). The STAP-BOY program conducts fast covariance factorization and tuning techniques for space-time adaptive process (STAP) Algorithm Implementation on Graphics Processor unit (GPU) Architectures for Embedded Systems. The first part of our presentation on the DARPA STAP-BOY program will focus on GPU implementation and algorithm innovations for a prototype radar STAP algorithm. The STAP algorithm will be implemented on the GPU, using stream programming (from companies such as PeakStream, ATI Technologies' CTM, and NVIDIA) and traditional graphics APIs. This algorithm will include fast range adaptive STAP weight updates and beamforming applications, each of which has been modified to exploit the parallel nature of graphics architectures.
Multi-core and GPU accelerated simulation of a radial star target imaged with equivalent t-number circular and Gaussian pupils

NASA Astrophysics Data System (ADS)

Greynolds, Alan W.

2013-09-01

Results from the GelOE optical engineering software are presented for the through-focus, monochromatic coherent and polychromatic incoherent imaging of a radial "star" target for equivalent t-number circular and Gaussian pupils. The FFT-based simulations are carried out using OpenMP threading on a multi-core desktop computer, with and without the aid of a many-core NVIDIA GPU accessing its cuFFT library. It is found that a custom FFT optimized for the 12-core host has similar performance to a simply implemented 256-core GPU FFT. A more sophisticated version of the latter but tuned to reduce overhead on a 448-core GPU is 20 to 28 times faster than a basic FFT implementation running on one CPU core.
A GPU-paralleled implementation of an enhanced face recognition algorithm

NASA Astrophysics Data System (ADS)

Chen, Hao; Liu, Xiyang; Shao, Shuai; Zan, Jiguo

2013-03-01

Face recognition algorithm based on compressed sensing and sparse representation is hotly argued in these years. The scheme of this algorithm increases recognition rate as well as anti-noise capability. However, the computational cost is expensive and has become a main restricting factor for real world applications. In this paper, we introduce a GPU-accelerated hybrid variant of face recognition algorithm named parallel face recognition algorithm (pFRA). We describe here how to carry out parallel optimization design to take full advantage of many-core structure of a GPU. The pFRA is tested and compared with several other implementations under different data sample size. Finally, Our pFRA, implemented with NVIDIA GPU and Computer Unified Device Architecture (CUDA) programming model, achieves a significant speedup over the traditional CPU implementations.
Elder-friendly emergency services in Brazil: necessary conditions for care.

PubMed

Santos, Mariana Timmers Dos; Lima, Maria Alice Dias da Silva; Zucatti, Paula Buchs

2016-01-01

To identify and analyze the aspects necessary to provide an elder-friendly emergency service (ES) from the perspective of nurses. This is a descriptive, quantitative study using the Delphi technique in three rounds. Nurses with professional experience in the ES and/or researchers with publications and/or conducting research in the study area were selected. The first round of the Delphi panel had 72 participants, the second 49, and the third 44. An online questionnaire was used based on a review of the scientific literature with questions organized into the central dimensions of elder-friendly hospitals. A five-point Likert scale was used for each question and a 70% consensus level was established. There were 38 aspects identified as necessary for elderly care that were organized into central dimensions. The study's results are consistent with the findings in scientific literature and suggest indicators for quality of care and training for an elder-friendly ES. Identificar e analisar aspectos necessários para um atendimento amigo do idoso nos serviços de emergência (SE), na perspectiva de enfermeiros. Estudo descritivo, quantitativo, com utilização da Técnica Delphi, em três rodadas. Foram selecionados enfermeiros com experiência profissional em SE e/ou pesquisadores com publicações e/ou desenvolvendo pesquisas na área de estudo. A primeira rodada do painel Delphi contou com 72 participantes, a segunda com 49 e a terceira com 44. Foi utilizado questionário on-line, baseado na revisão da literatura científica, com questões organizadas em dimensões centrais de hospitais amigos do idoso. Foi utilizada uma escala de Likert de 5 pontos para cada questão e estabelecido nível de consenso de 70%. Foram identificados 38 aspectos necessários para o atendimento ao idoso, organizados em dimensões centrais. Os resultados do estudo são consistentes com os achados na literatura científica e sugerem indicadores para qualidade do cuidado e para formação de SE amigos do idoso.
Correlação de longo alcance em sistemas binários de raios-x usando remoção de flutuações

NASA Astrophysics Data System (ADS)

Pereira, M. G.; Moret, M. A.; Zebende, G. F.; Nogueira, E., Jr.

2003-08-01

Neste trabalho é proposta uma metodologia de analise de series temporais de fontes astrofísicas, baseada no método proposto por Peng et al. (1994) e Liu et al. (1999), o qual consiste na idéia de que uma série temporal correlacionada pode ser mapeada por um processo de busca de auto-similaridades em diversas escalas de tempo n. Removendo as eventuais tendências e integrando o sinal observado, é obtida uma medida do desvio médio quadrático das flutuações do sinal integrado F(n)~na, onde a representa o fator de escala associado com a auto-similaridade da correlação de longo alcance do sinal. Baseado nos valores obtidos de a, é possível distinguir entre os casos de sinais não-correlacionados, tipo ruído branco (a = 0,5), sinal anti-persistentes (a < 0,5) e sinal persistente (a > 0,5). Usando esta metodologia, foram analisadas 129 curvas de luz de sistemas binários de raios-X, provenientes do banco de dados públicos de observações feitas pelo instrumento All Sky Monitor, a bordo do satélite Rossi X-Ray Timing Explorer (ASM-RXTE). Foram identificadas a presença de a'0,5 em mais de 90% dos sistemas estudados, implicando em dizer que as flutuações de intensidade observadas apresentam correlação de auto-similaridade, sem entretanto, indícios de apresentarem uma escala de tempo característica das flutuações de intensidade. Sistemas onde são observadas erupções (flares), apresentam sistematicamente a > 0,5, característica esta, possivelmente associada com persistência das flutuações de densidade de disco ou taxa de acréscimo de massa. Os sistemas com curvas de luz onde nao são observadas as erupções apresentam uma distribuição normal centrada em a~0,62+/-0,10. Referências ¾ Peng, C.-K., Buldyrev, S.V., Havlin, S., Simons, M., Stanley, H.E., e Goldberg, A.L., Phys. Rev. E, (49), 1685 (1994). ¾ Liu, Y., Gopikrishnan, P., Cizeau, P., Meyer, M., Peng,C.-K., e Stanley, H.E., Phys. Rev. E, (60), 1390 (1999).
Graphics processing unit (GPU) real-time infrared scene generation

NASA Astrophysics Data System (ADS)

Christie, Chad L.; Gouthas, Efthimios (Themie); Williams, Owen M.

2007-04-01

VIRSuite, the GPU-based suite of software tools developed at DSTO for real-time infrared scene generation, is described. The tools include the painting of scene objects with radiometrically-associated colours, translucent object generation, polar plot validation and versatile scene generation. Special features include radiometric scaling within the GPU and the presence of zoom anti-aliasing at the core of VIRSuite. Extension of the zoom anti-aliasing construct to cover target embedding and the treatment of translucent objects is described.
High performance MRI simulations of motion on multi-GPU systems

PubMed Central

2014-01-01

Background MRI physics simulators have been developed in the past for optimizing imaging protocols and for training purposes. However, these simulators have only addressed motion within a limited scope. The purpose of this study was the incorporation of realistic motion, such as cardiac motion, respiratory motion and flow, within MRI simulations in a high performance multi-GPU environment. Methods Three different motion models were introduced in the Magnetic Resonance Imaging SIMULator (MRISIMUL) of this study: cardiac motion, respiratory motion and flow. Simulation of a simple Gradient Echo pulse sequence and a CINE pulse sequence on the corresponding anatomical model was performed. Myocardial tagging was also investigated. In pulse sequence design, software crushers were introduced to accommodate the long execution times in order to avoid spurious echoes formation. The displacement of the anatomical model isochromats was calculated within the Graphics Processing Unit (GPU) kernel for every timestep of the pulse sequence. Experiments that would allow simulation of custom anatomical and motion models were also performed. Last, simulations of motion with MRISIMUL on single-node and multi-node multi-GPU systems were examined. Results Gradient Echo and CINE images of the three motion models were produced and motion-related artifacts were demonstrated. The temporal evolution of the contractility of the heart was presented through the application of myocardial tagging. Better simulation performance and image quality were presented through the introduction of software crushers without the need to further increase the computational load and GPU resources. Last, MRISIMUL demonstrated an almost linear scalable performance with the increasing number of available GPU cards, in both single-node and multi-node multi-GPU computer systems. Conclusions MRISIMUL is the first MR physics simulator to have implemented motion with a 3D large computational load on a single computer multi-GPU configuration. The incorporation of realistic motion models, such as cardiac motion, respiratory motion and flow may benefit the design and optimization of existing or new MR pulse sequences, protocols and algorithms, which examine motion related MR applications. PMID:24996972
A survey of techniques for architecting and managing GPU register file

DOE PAGES

Mittal, Sparsh

2016-04-07

To support their massively-multithreaded architecture, GPUs use very large register file (RF) which has a capacity higher than even L1 and L2 caches. In total contrast, traditional CPUs use tiny RF and much larger caches to optimize latency. Due to these differences, along with the crucial impact of RF in determining GPU performance, novel and intelligent techniques are required for managing GPU RF. In this paper, we survey the techniques for designing and managing GPU RF. We discuss techniques related to performance, energy and reliability aspects of RF. To emphasize the similarities and differences between the techniques, we classify themmore » along several parameters. Lastly, the aim of this paper is to synthesize the state-of-art developments in RF management and also stimulate further research in this area.« less
A GPU-Based Implementation of the Firefly Algorithm for Variable Selection in Multivariate Calibration Problems

PubMed Central

de Paula, Lauro C. M.; Soares, Anderson S.; de Lima, Telma W.; Delbem, Alexandre C. B.; Coelho, Clarimar J.; Filho, Arlindo R. G.

2014-01-01

Several variable selection algorithms in multivariate calibration can be accelerated using Graphics Processing Units (GPU). Among these algorithms, the Firefly Algorithm (FA) is a recent proposed metaheuristic that may be used for variable selection. This paper presents a GPU-based FA (FA-MLR) with multiobjective formulation for variable selection in multivariate calibration problems and compares it with some traditional sequential algorithms in the literature. The advantage of the proposed implementation is demonstrated in an example involving a relatively large number of variables. The results showed that the FA-MLR, in comparison with the traditional algorithms is a more suitable choice and a relevant contribution for the variable selection problem. Additionally, the results also demonstrated that the FA-MLR performed in a GPU can be five times faster than its sequential implementation. PMID:25493625
A GPU-Based Implementation of the Firefly Algorithm for Variable Selection in Multivariate Calibration Problems.

PubMed

de Paula, Lauro C M; Soares, Anderson S; de Lima, Telma W; Delbem, Alexandre C B; Coelho, Clarimar J; Filho, Arlindo R G

2014-01-01

Several variable selection algorithms in multivariate calibration can be accelerated using Graphics Processing Units (GPU). Among these algorithms, the Firefly Algorithm (FA) is a recent proposed metaheuristic that may be used for variable selection. This paper presents a GPU-based FA (FA-MLR) with multiobjective formulation for variable selection in multivariate calibration problems and compares it with some traditional sequential algorithms in the literature. The advantage of the proposed implementation is demonstrated in an example involving a relatively large number of variables. The results showed that the FA-MLR, in comparison with the traditional algorithms is a more suitable choice and a relevant contribution for the variable selection problem. Additionally, the results also demonstrated that the FA-MLR performed in a GPU can be five times faster than its sequential implementation.
Accelerating electron tomography reconstruction algorithm ICON with GPU.

PubMed

Chen, Yu; Wang, Zihao; Zhang, Jingrong; Li, Lun; Wan, Xiaohua; Sun, Fei; Zhang, Fa

2017-01-01

Electron tomography (ET) plays an important role in studying in situ cell ultrastructure in three-dimensional space. Due to limited tilt angles, ET reconstruction always suffers from the "missing wedge" problem. With a validation procedure, iterative compressed-sensing optimized NUFFT reconstruction (ICON) demonstrates its power in the restoration of validated missing information for low SNR biological ET dataset. However, the huge computational demand has become a major problem for the application of ICON. In this work, we analyzed the framework of ICON and classified the operations of major steps of ICON reconstruction into three types. Accordingly, we designed parallel strategies and implemented them on graphics processing units (GPU) to generate a parallel program ICON-GPU. With high accuracy, ICON-GPU has a great acceleration compared to its CPU version, up to 83.7×, greatly relieving ICON's dependence on computing resource.
Protein-protein docking on hardware accelerators: comparison of GPU and MIC architectures

PubMed Central

2015-01-01

Background The hardware accelerators will provide solutions to computationally complex problems in bioinformatics fields. However, the effect of acceleration depends on the nature of the application, thus selection of an appropriate accelerator requires some consideration. Results In the present study, we compared the effects of acceleration using graphics processing unit (GPU) and many integrated core (MIC) on the speed of fast Fourier transform (FFT)-based protein-protein docking calculation. The GPU implementation performed the protein-protein docking calculations approximately five times faster than the MIC offload mode implementation. The MIC native mode implementation has the advantage in the implementation costs. However, the performance was worse with larger protein pairs because of memory limitations. Conclusion The results suggest that GPU is more suitable than MIC for accelerating FFT-based protein-protein docking applications. PMID:25707855
Rapid earthquake detection through GPU-Based template matching

NASA Astrophysics Data System (ADS)

Mu, Dawei; Lee, En-Jui; Chen, Po

2017-12-01

The template-matching algorithm (TMA) has been widely adopted for improving the reliability of earthquake detection. The TMA is based on calculating the normalized cross-correlation coefficient (NCC) between a collection of selected template waveforms and the continuous waveform recordings of seismic instruments. In realistic applications, the computational cost of the TMA is much higher than that of traditional techniques. In this study, we provide an analysis of the TMA and show how the GPU architecture provides an almost ideal environment for accelerating the TMA and NCC-based pattern recognition algorithms in general. So far, our best-performing GPU code has achieved a speedup factor of more than 800 with respect to a common sequential CPU code. We demonstrate the performance of our GPU code using seismic waveform recordings from the ML 6.6 Meinong earthquake sequence in Taiwan.
A survey of techniques for architecting and managing GPU register file

DOE Office of Scientific and Technical Information (OSTI.GOV)

Mittal, Sparsh

To support their massively-multithreaded architecture, GPUs use very large register file (RF) which has a capacity higher than even L1 and L2 caches. In total contrast, traditional CPUs use tiny RF and much larger caches to optimize latency. Due to these differences, along with the crucial impact of RF in determining GPU performance, novel and intelligent techniques are required for managing GPU RF. In this paper, we survey the techniques for designing and managing GPU RF. We discuss techniques related to performance, energy and reliability aspects of RF. To emphasize the similarities and differences between the techniques, we classify themmore » along several parameters. Lastly, the aim of this paper is to synthesize the state-of-art developments in RF management and also stimulate further research in this area.« less

A fully parallel in time and space algorithm for simulating the electrical activity of a neural tissue.

PubMed

Bedez, Mathieu; Belhachmi, Zakaria; Haeberlé, Olivier; Greget, Renaud; Moussaoui, Saliha; Bouteiller, Jean-Marie; Bischoff, Serge

2016-01-15

The resolution of a model describing the electrical activity of neural tissue and its propagation within this tissue is highly consuming in term of computing time and requires strong computing power to achieve good results. In this study, we present a method to solve a model describing the electrical propagation in neuronal tissue, using parareal algorithm, coupling with parallelization space using CUDA in graphical processing unit (GPU). We applied the method of resolution to different dimensions of the geometry of our model (1-D, 2-D and 3-D). The GPU results are compared with simulations from a multi-core processor cluster, using message-passing interface (MPI), where the spatial scale was parallelized in order to reach a comparable calculation time than that of the presented method using GPU. A gain of a factor 100 in term of computational time between sequential results and those obtained using the GPU has been obtained, in the case of 3-D geometry. Given the structure of the GPU, this factor increases according to the fineness of the geometry used in the computation. To the best of our knowledge, it is the first time such a method is used, even in the case of neuroscience. Parallelization time coupled with GPU parallelization space allows for drastically reducing computational time with a fine resolution of the model describing the propagation of the electrical signal in a neuronal tissue. Copyright © 2015 Elsevier B.V. All rights reserved.
Multi-GPU and multi-CPU accelerated FDTD scheme for vibroacoustic applications

NASA Astrophysics Data System (ADS)

Francés, J.; Otero, B.; Bleda, S.; Gallego, S.; Neipp, C.; Márquez, A.; Beléndez, A.

2015-06-01

The Finite-Difference Time-Domain (FDTD) method is applied to the analysis of vibroacoustic problems and to study the propagation of longitudinal and transversal waves in a stratified media. The potential of the scheme and the relevance of each acceleration strategy for massively computations in FDTD are demonstrated in this work. In this paper, we propose two new specific implementations of the bi-dimensional scheme of the FDTD method using multi-CPU and multi-GPU, respectively. In the first implementation, an open source message passing interface (OMPI) has been included in order to massively exploit the resources of a biprocessor station with two Intel Xeon processors. Moreover, regarding CPU code version, the streaming SIMD extensions (SSE) and also the advanced vectorial extensions (AVX) have been included with shared memory approaches that take advantage of the multi-core platforms. On the other hand, the second implementation called the multi-GPU code version is based on Peer-to-Peer communications available in CUDA on two GPUs (NVIDIA GTX 670). Subsequently, this paper presents an accurate analysis of the influence of the different code versions including shared memory approaches, vector instructions and multi-processors (both CPU and GPU) and compares them in order to delimit the degree of improvement of using distributed solutions based on multi-CPU and multi-GPU. The performance of both approaches was analysed and it has been demonstrated that the addition of shared memory schemes to CPU computing improves substantially the performance of vector instructions enlarging the simulation sizes that use efficiently the cache memory of CPUs. In this case GPU computing is slightly twice times faster than the fine tuned CPU version in both cases one and two nodes. However, for massively computations explicit vector instructions do not worth it since the memory bandwidth is the limiting factor and the performance tends to be the same than the sequential version with auto-vectorisation and also shared memory approach. In this scenario GPU computing is the best option since it provides a homogeneous behaviour. More specifically, the speedup of GPU computing achieves an upper limit of 12 for both one and two GPUs, whereas the performance reaches peak values of 80 GFlops and 146 GFlops for the performance for one GPU and two GPUs respectively. Finally, the method is applied to an earth crust profile in order to demonstrate the potential of our approach and the necessity of applying acceleration strategies in these type of applications.
Supermassive Black Hole Binaries in High Performance Massively Parallel Direct N-body Simulations on Large GPU Clusters

NASA Astrophysics Data System (ADS)

Spurzem, R.; Berczik, P.; Zhong, S.; Nitadori, K.; Hamada, T.; Berentzen, I.; Veles, A.

2012-07-01

Astrophysical Computer Simulations of Dense Star Clusters in Galactic Nuclei with Supermassive Black Holes are presented using new cost-efficient supercomputers in China accelerated by graphical processing cards (GPU). We use large high-accuracy direct N-body simulations with Hermite scheme and block-time steps, parallelised across a large number of nodes on the large scale and across many GPU thread processors on each node on the small scale. A sustained performance of more than 350 Tflop/s for a science run on using simultaneously 1600 Fermi C2050 GPUs is reached; a detailed performance model is presented and studies for the largest GPU clusters in China with up to Petaflop/s performance and 7000 Fermi GPU cards. In our case study we look at two supermassive black holes with equal and unequal masses embedded in a dense stellar cluster in a galactic nucleus. The hardening processes due to interactions between black holes and stars, effects of rotation in the stellar system and relativistic forces between the black holes are simultaneously taken into account. The simulation stops at the complete relativistic merger of the black holes.
Exploiting graphics processing units for computational biology and bioinformatics.

PubMed

Payne, Joshua L; Sinnott-Armstrong, Nicholas A; Moore, Jason H

2010-09-01

Advances in the video gaming industry have led to the production of low-cost, high-performance graphics processing units (GPUs) that possess more memory bandwidth and computational capability than central processing units (CPUs), the standard workhorses of scientific computing. With the recent release of generalpurpose GPUs and NVIDIA's GPU programming language, CUDA, graphics engines are being adopted widely in scientific computing applications, particularly in the fields of computational biology and bioinformatics. The goal of this article is to concisely present an introduction to GPU hardware and programming, aimed at the computational biologist or bioinformaticist. To this end, we discuss the primary differences between GPU and CPU architecture, introduce the basics of the CUDA programming language, and discuss important CUDA programming practices, such as the proper use of coalesced reads, data types, and memory hierarchies. We highlight each of these topics in the context of computing the all-pairs distance between instances in a dataset, a common procedure in numerous disciplines of scientific computing. We conclude with a runtime analysis of the GPU and CPU implementations of the all-pairs distance calculation. We show our final GPU implementation to outperform the CPU implementation by a factor of 1700.
GPU-based stochastic-gradient optimization for non-rigid medical image registration in time-critical applications

NASA Astrophysics Data System (ADS)

Bhosale, Parag; Staring, Marius; Al-Ars, Zaid; Berendsen, Floris F.

2018-03-01

Currently, non-rigid image registration algorithms are too computationally intensive to use in time-critical applications. Existing implementations that focus on speed typically address this by either parallelization on GPU-hardware, or by introducing methodically novel techniques into CPU-oriented algorithms. Stochastic gradient descent (SGD) optimization and variations thereof have proven to drastically reduce the computational burden for CPU-based image registration, but have not been successfully applied in GPU hardware due to its stochastic nature. This paper proposes 1) NiftyRegSGD, a SGD optimization for the GPU-based image registration tool NiftyReg, 2) random chunk sampler, a new random sampling strategy that better utilizes the memory bandwidth of GPU hardware. Experiments have been performed on 3D lung CT data of 19 patients, which compared NiftyRegSGD (with and without random chunk sampler) with CPU-based elastix Fast Adaptive SGD (FASGD) and NiftyReg. The registration runtime was 21.5s, 4.4s and 2.8s for elastix-FASGD, NiftyRegSGD without, and NiftyRegSGD with random chunk sampling, respectively, while similar accuracy was obtained. Our method is publicly available at https://github.com/SuperElastix/NiftyRegSGD.
Implementation of GPU accelerated SPECT reconstruction with Monte Carlo-based scatter correction.

PubMed

Bexelius, Tobias; Sohlberg, Antti

2018-06-01

Statistical SPECT reconstruction can be very time-consuming especially when compensations for collimator and detector response, attenuation, and scatter are included in the reconstruction. This work proposes an accelerated SPECT reconstruction algorithm based on graphics processing unit (GPU) processing. Ordered subset expectation maximization (OSEM) algorithm with CT-based attenuation modelling, depth-dependent Gaussian convolution-based collimator-detector response modelling, and Monte Carlo-based scatter compensation was implemented using OpenCL. The OpenCL implementation was compared against the existing multi-threaded OSEM implementation running on a central processing unit (CPU) in terms of scatter-to-primary ratios, standardized uptake values (SUVs), and processing speed using mathematical phantoms and clinical multi-bed bone SPECT/CT studies. The difference in scatter-to-primary ratios, visual appearance, and SUVs between GPU and CPU implementations was minor. On the other hand, at its best, the GPU implementation was noticed to be 24 times faster than the multi-threaded CPU version on a normal 128 × 128 matrix size 3 bed bone SPECT/CT data set when compensations for collimator and detector response, attenuation, and scatter were included. GPU SPECT reconstructions show great promise as an every day clinical reconstruction tool.
Accelerated event-by-event Monte Carlo microdosimetric calculations of electrons and protons tracks on a multi-core CPU and a CUDA-enabled GPU.

PubMed

Kalantzis, Georgios; Tachibana, Hidenobu

2014-01-01

For microdosimetric calculations event-by-event Monte Carlo (MC) methods are considered the most accurate. The main shortcoming of those methods is the extensive requirement for computational time. In this work we present an event-by-event MC code of low projectile energy electron and proton tracks for accelerated microdosimetric MC simulations on a graphic processing unit (GPU). Additionally, a hybrid implementation scheme was realized by employing OpenMP and CUDA in such a way that both GPU and multi-core CPU were utilized simultaneously. The two implementation schemes have been tested and compared with the sequential single threaded MC code on the CPU. Performance comparison was established on the speed-up for a set of benchmarking cases of electron and proton tracks. A maximum speedup of 67.2 was achieved for the GPU-based MC code, while a further improvement of the speedup up to 20% was achieved for the hybrid approach. The results indicate the capability of our CPU-GPU implementation for accelerated MC microdosimetric calculations of both electron and proton tracks without loss of accuracy. Copyright © 2013 Elsevier Ireland Ltd. All rights reserved.
A GPU-Accelerated 3-D Coupled Subsample Estimation Algorithm for Volumetric Breast Strain Elastography.

PubMed

Peng, Bo; Wang, Yuqi; Hall, Timothy J; Jiang, Jingfeng

2017-04-01

Our primary objective of this paper was to extend a previously published 2-D coupled subsample tracking algorithm for 3-D speckle tracking in the framework of ultrasound breast strain elastography. In order to overcome heavy computational cost, we investigated the use of a graphic processing unit (GPU) to accelerate the 3-D coupled subsample speckle tracking method. The performance of the proposed GPU implementation was tested using a tissue-mimicking phantom and in vivo breast ultrasound data. The performance of this 3-D subsample tracking algorithm was compared with the conventional 3-D quadratic subsample estimation algorithm. On the basis of these evaluations, we concluded that the GPU implementation of this 3-D subsample estimation algorithm can provide high-quality strain data (i.e., high correlation between the predeformation and the motion-compensated postdeformation radio frequency echo data and high contrast-to-noise ratio strain images), as compared with the conventional 3-D quadratic subsample algorithm. Using the GPU implementation of the 3-D speckle tracking algorithm, volumetric strain data can be achieved relatively fast (approximately 20 s per volume [2.5 cm ×2.5 cm ×2.5 cm]).
PuReMD-GPU: A reactive molecular dynamics simulation package for GPUs

DOE Office of Scientific and Technical Information (OSTI.GOV)

Kylasa, S.B., E-mail: skylasa@purdue.edu; Aktulga, H.M., E-mail: hmaktulga@lbl.gov; Grama, A.Y., E-mail: ayg@cs.purdue.edu

2014-09-01

We present an efficient and highly accurate GP-GPU implementation of our community code, PuReMD, for reactive molecular dynamics simulations using the ReaxFF force field. PuReMD and its incorporation into LAMMPS (Reax/C) is used by a large number of research groups worldwide for simulating diverse systems ranging from biomembranes to explosives (RDX) at atomistic level of detail. The sub-femtosecond time-steps associated with ReaxFF strongly motivate significant improvements to per-timestep simulation time through effective use of GPUs. This paper presents, in detail, the design and implementation of PuReMD-GPU, which enables ReaxFF simulations on GPUs, as well as various performance optimization techniques wemore » developed to obtain high performance on state-of-the-art hardware. Comprehensive experiments on model systems (bulk water and amorphous silica) are presented to quantify the performance improvements achieved by PuReMD-GPU and to verify its accuracy. In particular, our experiments show up to 16× improvement in runtime compared to our highly optimized CPU-only single-core ReaxFF implementation. PuReMD-GPU is a unique production code, and is currently available on request from the authors.« less
Streaming parallel GPU acceleration of large-scale filter-based spiking neural networks.

PubMed

Slażyński, Leszek; Bohte, Sander

2012-01-01

The arrival of graphics processing (GPU) cards suitable for massively parallel computing promises affordable large-scale neural network simulation previously only available at supercomputing facilities. While the raw numbers suggest that GPUs may outperform CPUs by at least an order of magnitude, the challenge is to develop fine-grained parallel algorithms to fully exploit the particulars of GPUs. Computation in a neural network is inherently parallel and thus a natural match for GPU architectures: given inputs, the internal state for each neuron can be updated in parallel. We show that for filter-based spiking neurons, like the Spike Response Model, the additive nature of membrane potential dynamics enables additional update parallelism. This also reduces the accumulation of numerical errors when using single precision computation, the native precision of GPUs. We further show that optimizing simulation algorithms and data structures to the GPU's architecture has a large pay-off: for example, matching iterative neural updating to the memory architecture of the GPU speeds up this simulation step by a factor of three to five. With such optimizations, we can simulate in better-than-realtime plausible spiking neural networks of up to 50 000 neurons, processing over 35 million spiking events per second.
Multi-GPU maximum entropy image synthesis for radio astronomy

NASA Astrophysics Data System (ADS)

Cárcamo, M.; Román, P. E.; Casassus, S.; Moral, V.; Rannou, F. R.

2018-01-01

The maximum entropy method (MEM) is a well known deconvolution technique in radio-interferometry. This method solves a non-linear optimization problem with an entropy regularization term. Other heuristics such as CLEAN are faster but highly user dependent. Nevertheless, MEM has the following advantages: it is unsupervised, it has a statistical basis, it has a better resolution and better image quality under certain conditions. This work presents a high performance GPU version of non-gridding MEM, which is tested using real and simulated data. We propose a single-GPU and a multi-GPU implementation for single and multi-spectral data, respectively. We also make use of the Peer-to-Peer and Unified Virtual Addressing features of newer GPUs which allows to exploit transparently and efficiently multiple GPUs. Several ALMA data sets are used to demonstrate the effectiveness in imaging and to evaluate GPU performance. The results show that a speedup from 1000 to 5000 times faster than a sequential version can be achieved, depending on data and image size. This allows to reconstruct the HD142527 CO(6-5) short baseline data set in 2.1 min, instead of 2.5 days that takes a sequential version on CPU.
Real-time time-division color electroholography using a single GPU and a USB module for synchronizing reference light.

PubMed

Araki, Hiromitsu; Takada, Naoki; Niwase, Hiroaki; Ikawa, Shohei; Fujiwara, Masato; Nakayama, Hirotaka; Kakue, Takashi; Shimobaba, Tomoyoshi; Ito, Tomoyoshi

2015-12-01

We propose real-time time-division color electroholography using a single graphics processing unit (GPU) and a simple synchronization system of reference light. To facilitate real-time time-division color electroholography, we developed a light emitting diode (LED) controller with a universal serial bus (USB) module and the drive circuit for reference light. A one-chip RGB LED connected to a personal computer via an LED controller was used as the reference light. A single GPU calculates three computer-generated holograms (CGHs) suitable for red, green, and blue colors in each frame of a three-dimensional (3D) movie. After CGH calculation using a single GPU, the CPU can synchronize the CGH display with the color switching of the one-chip RGB LED via the LED controller. Consequently, we succeeded in real-time time-division color electroholography for a 3D object consisting of around 1000 points per color when an NVIDIA GeForce GTX TITAN was used as the GPU. Furthermore, we implemented the proposed method in various GPUs. The experimental results showed that the proposed method was effective for various GPUs.
Benchmarking GPU and CPU codes for Heisenberg spin glass over-relaxation

NASA Astrophysics Data System (ADS)

Bernaschi, M.; Parisi, G.; Parisi, L.

2011-06-01

We present a set of possible implementations for Graphics Processing Units (GPU) of the Over-relaxation technique applied to the 3D Heisenberg spin glass model. The results show that a carefully tuned code can achieve more than 100 GFlops/s of sustained performance and update a single spin in about 0.6 nanoseconds. A multi-hit technique that exploits the GPU shared memory further reduces this time. Such results are compared with those obtained by means of a highly-tuned vector-parallel code on latest generation multi-core CPUs.
Simultaneous Range-Velocity Processing and SNR Analysis of AFIT’s Random Noise Radar

DTIC Science & Technology

2012-03-22

reducing the overall processing time. Two computers, equipped with NVIDIA ® GPUs, were used to process the col- 45 lected data. The specifications for each...gather the results back to the CPU. Another company , AccelerEyes®, has developed a product called Jacket® that claims to be better than the parallel...Number of Processing Cores 4 8 Processor Speed 3.33 GHz 3.07 GHz Installed Memory 48 GB 48 GB GPU Make NVIDIA NVIDIA GPU Model Tesla 1060 Tesla C2070 GPU
GPU Acceleration of DSP for Communication Receivers.

PubMed

Gunther, Jake; Gunther, Hyrum; Moon, Todd

2017-09-01

Graphics processing unit (GPU) implementations of signal processing algorithms can outperform CPU-based implementations. This paper describes the GPU implementation of several algorithms encountered in a wide range of high-data rate communication receivers including filters, multirate filters, numerically controlled oscillators, and multi-stage digital down converters. These structures are tested by processing the 20 MHz wide FM radio band (88-108 MHz). Two receiver structures are explored: a single channel receiver and a filter bank channelizer. Both run in real time on NVIDIA GeForce GTX 1080 graphics card.
High performance transcription factor-DNA docking with GPU computing

PubMed Central

2012-01-01

Background Protein-DNA docking is a very challenging problem in structural bioinformatics and has important implications in a number of applications, such as structure-based prediction of transcription factor binding sites and rational drug design. Protein-DNA docking is very computational demanding due to the high cost of energy calculation and the statistical nature of conformational sampling algorithms. More importantly, experiments show that the docking quality depends on the coverage of the conformational sampling space. It is therefore desirable to accelerate the computation of the docking algorithm, not only to reduce computing time, but also to improve docking quality. Methods In an attempt to accelerate the sampling process and to improve the docking performance, we developed a graphics processing unit (GPU)-based protein-DNA docking algorithm. The algorithm employs a potential-based energy function to describe the binding affinity of a protein-DNA pair, and integrates Monte-Carlo simulation and a simulated annealing method to search through the conformational space. Algorithmic techniques were developed to improve the computation efficiency and scalability on GPU-based high performance computing systems. Results The effectiveness of our approach is tested on a non-redundant set of 75 TF-DNA complexes and a newly developed TF-DNA docking benchmark. We demonstrated that the GPU-based docking algorithm can significantly accelerate the simulation process and thereby improving the chance of finding near-native TF-DNA complex structures. This study also suggests that further improvement in protein-DNA docking research would require efforts from two integral aspects: improvement in computation efficiency and energy function design. Conclusions We present a high performance computing approach for improving the prediction accuracy of protein-DNA docking. The GPU-based docking algorithm accelerates the search of the conformational space and thus increases the chance of finding more near-native structures. To the best of our knowledge, this is the first ad hoc effort of applying GPU or GPU clusters to the protein-DNA docking problem. PMID:22759575
SU-E-T-29: A Web Application for GPU-Based Monte Carlo IMRT/VMAT QA with Delivered Dose Verification

DOE Office of Scientific and Technical Information (OSTI.GOV)

Folkerts, M; University of California, San Diego, La Jolla, CA; Graves, Y

Purpose: To enable an existing web application for GPU-based Monte Carlo (MC) 3D dosimetry quality assurance (QA) to compute “delivered dose” from linac logfile data. Methods: We added significant features to an IMRT/VMAT QA web application which is based on existing technologies (HTML5, Python, and Django). This tool interfaces with python, c-code libraries, and command line-based GPU applications to perform a MC-based IMRT/VMAT QA. The web app automates many complicated aspects of interfacing clinical DICOM and logfile data with cutting-edge GPU software to run a MC dose calculation. The resultant web app is powerful, easy to use, and is ablemore » to re-compute both plan dose (from DICOM data) and delivered dose (from logfile data). Both dynalog and trajectorylog file formats are supported. Users upload zipped DICOM RP, CT, and RD data and set the expected statistic uncertainty for the MC dose calculation. A 3D gamma index map, 3D dose distribution, gamma histogram, dosimetric statistics, and DVH curves are displayed to the user. Additional the user may upload the delivery logfile data from the linac to compute a 'delivered dose' calculation and corresponding gamma tests. A comprehensive PDF QA report summarizing the results can also be downloaded. Results: We successfully improved a web app for a GPU-based QA tool that consists of logfile parcing, fluence map generation, CT image processing, GPU based MC dose calculation, gamma index calculation, and DVH calculation. The result is an IMRT and VMAT QA tool that conducts an independent dose calculation for a given treatment plan and delivery log file. The system takes both DICOM data and logfile data to compute plan dose and delivered dose respectively. Conclusion: We sucessfully improved a GPU-based MC QA tool to allow for logfile dose calculation. The high efficiency and accessibility will greatly facilitate IMRT and VMAT QA.« less
Parallelizing flow-accumulation calculations on graphics processing units—From iterative DEM preprocessing algorithm to recursive multiple-flow-direction algorithm

NASA Astrophysics Data System (ADS)

Qin, Cheng-Zhi; Zhan, Lijun

2012-06-01

As one of the important tasks in digital terrain analysis, the calculation of flow accumulations from gridded digital elevation models (DEMs) usually involves two steps in a real application: (1) using an iterative DEM preprocessing algorithm to remove the depressions and flat areas commonly contained in real DEMs, and (2) using a recursive flow-direction algorithm to calculate the flow accumulation for every cell in the DEM. Because both algorithms are computationally intensive, quick calculation of the flow accumulations from a DEM (especially for a large area) presents a practical challenge to personal computer (PC) users. In recent years, rapid increases in hardware capacity of the graphics processing units (GPUs) provided in modern PCs have made it possible to meet this challenge in a PC environment. Parallel computing on GPUs using a compute-unified-device-architecture (CUDA) programming model has been explored to speed up the execution of the single-flow-direction algorithm (SFD). However, the parallel implementation on a GPU of the multiple-flow-direction (MFD) algorithm, which generally performs better than the SFD algorithm, has not been reported. Moreover, GPU-based parallelization of the DEM preprocessing step in the flow-accumulation calculations has not been addressed. This paper proposes a parallel approach to calculate flow accumulations (including both iterative DEM preprocessing and a recursive MFD algorithm) on a CUDA-compatible GPU. For the parallelization of an MFD algorithm (MFD-md), two different parallelization strategies using a GPU are explored. The first parallelization strategy, which has been used in the existing parallel SFD algorithm on GPU, has the problem of computing redundancy. Therefore, we designed a parallelization strategy based on graph theory. The application results show that the proposed parallel approach to calculate flow accumulations on a GPU performs much faster than either sequential algorithms or other parallel GPU-based algorithms based on existing parallelization strategies.
Ice-sheet modelling accelerated by graphics cards

NASA Astrophysics Data System (ADS)

Brædstrup, Christian Fredborg; Damsgaard, Anders; Egholm, David Lundbek

2014-11-01

Studies of glaciers and ice sheets have increased the demand for high performance numerical ice flow models over the past decades. When exploring the highly non-linear dynamics of fast flowing glaciers and ice streams, or when coupling multiple flow processes for ice, water, and sediment, researchers are often forced to use super-computing clusters. As an alternative to conventional high-performance computing hardware, the Graphical Processing Unit (GPU) is capable of massively parallel computing while retaining a compact design and low cost. In this study, we present a strategy for accelerating a higher-order ice flow model using a GPU. By applying the newest GPU hardware, we achieve up to 180× speedup compared to a similar but serial CPU implementation. Our results suggest that GPU acceleration is a competitive option for ice-flow modelling when compared to CPU-optimised algorithms parallelised by the OpenMP or Message Passing Interface (MPI) protocols.
Gpu Implementation of a Viscous Flow Solver on Unstructured Grids

NASA Astrophysics Data System (ADS)

Xu, Tianhao; Chen, Long

2016-06-01

Graphics processing units have gained popularities in scientific computing over past several years due to their outstanding parallel computing capability. Computational fluid dynamics applications involve large amounts of calculations, therefore a latest GPU card is preferable of which the peak computing performance and memory bandwidth are much better than a contemporary high-end CPU. We herein focus on the detailed implementation of our GPU targeting Reynolds-averaged Navier-Stokes equations solver based on finite-volume method. The solver employs a vertex-centered scheme on unstructured grids for the sake of being capable of handling complex topologies. Multiple optimizations are carried out to improve the memory accessing performance and kernel utilization. Both steady and unsteady flow simulation cases are carried out using explicit Runge-Kutta scheme. The solver with GPU acceleration in this paper is demonstrated to have competitive advantages over the CPU targeting one.

Accelerating Large Scale Image Analyses on Parallel, CPU-GPU Equipped Systems

PubMed Central

Teodoro, George; Kurc, Tahsin M.; Pan, Tony; Cooper, Lee A.D.; Kong, Jun; Widener, Patrick; Saltz, Joel H.

2014-01-01

The past decade has witnessed a major paradigm shift in high performance computing with the introduction of accelerators as general purpose processors. These computing devices make available very high parallel computing power at low cost and power consumption, transforming current high performance platforms into heterogeneous CPU-GPU equipped systems. Although the theoretical performance achieved by these hybrid systems is impressive, taking practical advantage of this computing power remains a very challenging problem. Most applications are still deployed to either GPU or CPU, leaving the other resource under- or un-utilized. In this paper, we propose, implement, and evaluate a performance aware scheduling technique along with optimizations to make efficient collaborative use of CPUs and GPUs on a parallel system. In the context of feature computations in large scale image analysis applications, our evaluations show that intelligently co-scheduling CPUs and GPUs can significantly improve performance over GPU-only or multi-core CPU-only approaches. PMID:25419545
Medical image processing on the GPU - past, present and future.

PubMed

Eklund, Anders; Dufort, Paul; Forsberg, Daniel; LaConte, Stephen M

2013-12-01

Graphics processing units (GPUs) are used today in a wide range of applications, mainly because they can dramatically accelerate parallel computing, are affordable and energy efficient. In the field of medical imaging, GPUs are in some cases crucial for enabling practical use of computationally demanding algorithms. This review presents the past and present work on GPU accelerated medical image processing, and is meant to serve as an overview and introduction to existing GPU implementations. The review covers GPU acceleration of basic image processing operations (filtering, interpolation, histogram estimation and distance transforms), the most commonly used algorithms in medical imaging (image registration, image segmentation and image denoising) and algorithms that are specific to individual modalities (CT, PET, SPECT, MRI, fMRI, DTI, ultrasound, optical imaging and microscopy). The review ends by highlighting some future possibilities and challenges. Copyright © 2013 Elsevier B.V. All rights reserved.
3D Data Denoising via Nonlocal Means Filter by Using Parallel GPU Strategies

PubMed Central

Cuomo, Salvatore; De Michele, Pasquale; Piccialli, Francesco

2014-01-01

Nonlocal Means (NLM) algorithm is widely considered as a state-of-the-art denoising filter in many research fields. Its high computational complexity leads researchers to the development of parallel programming approaches and the use of massively parallel architectures such as the GPUs. In the recent years, the GPU devices had led to achieving reasonable running times by filtering, slice-by-slice, and 3D datasets with a 2D NLM algorithm. In our approach we design and implement a fully 3D NonLocal Means parallel approach, adopting different algorithm mapping strategies on GPU architecture and multi-GPU framework, in order to demonstrate its high applicability and scalability. The experimental results we obtained encourage the usability of our approach in a large spectrum of applicative scenarios such as magnetic resonance imaging (MRI) or video sequence denoising. PMID:25045397
Real-time generation of infrared ocean scene based on GPU

NASA Astrophysics Data System (ADS)

Jiang, Zhaoyi; Wang, Xun; Lin, Yun; Jin, Jianqiu

2007-12-01

Infrared (IR) image synthesis for ocean scene has become more and more important nowadays, especially for remote sensing and military application. Although a number of works present ready-to-use simulations, those techniques cover only a few possible ways of water interacting with the environment. And the detail calculation of ocean temperature is rarely considered by previous investigators. With the advance of programmable features of graphic card, many algorithms previously limited to offline processing have become feasible for real-time usage. In this paper, we propose an efficient algorithm for real-time rendering of infrared ocean scene using the newest features of programmable graphics processors (GPU). It differs from previous works in three aspects: adaptive GPU-based ocean surface tessellation, sophisticated balance equation of thermal balance for ocean surface, and GPU-based rendering for infrared ocean scene. Finally some results of infrared image are shown, which are in good accordance with real images.
A GPU accelerated and error-controlled solver for the unbounded Poisson equation in three dimensions

NASA Astrophysics Data System (ADS)

Exl, Lukas

2017-12-01

An efficient solver for the three dimensional free-space Poisson equation is presented. The underlying numerical method is based on finite Fourier series approximation. While the error of all involved approximations can be fully controlled, the overall computation error is driven by the convergence of the finite Fourier series of the density. For smooth and fast-decaying densities the proposed method will be spectrally accurate. The method scales with O(N log N) operations, where N is the total number of discretization points in the Cartesian grid. The majority of the computational costs come from fast Fourier transforms (FFT), which makes it ideal for GPU computation. Several numerical computations on CPU and GPU validate the method and show efficiency and convergence behavior. Tests are performed using the Vienna Scientific Cluster 3 (VSC3). A free MATLAB implementation for CPU and GPU is provided to the interested community.
Strong scaling of general-purpose molecular dynamics simulations on GPUs

NASA Astrophysics Data System (ADS)

Glaser, Jens; Nguyen, Trung Dac; Anderson, Joshua A.; Lui, Pak; Spiga, Filippo; Millan, Jaime A.; Morse, David C.; Glotzer, Sharon C.

2015-07-01

We describe a highly optimized implementation of MPI domain decomposition in a GPU-enabled, general-purpose molecular dynamics code, HOOMD-blue (Anderson and Glotzer, 2013). Our approach is inspired by a traditional CPU-based code, LAMMPS (Plimpton, 1995), but is implemented within a code that was designed for execution on GPUs from the start (Anderson et al., 2008). The software supports short-ranged pair force and bond force fields and achieves optimal GPU performance using an autotuning algorithm. We are able to demonstrate equivalent or superior scaling on up to 3375 GPUs in Lennard-Jones and dissipative particle dynamics (DPD) simulations of up to 108 million particles. GPUDirect RDMA capabilities in recent GPU generations provide better performance in full double precision calculations. For a representative polymer physics application, HOOMD-blue 1.0 provides an effective GPU vs. CPU node speed-up of 12.5 ×.
GPU Based N-Gram String Matching Algorithm with Score Table Approach for String Searching in Many Documents

NASA Astrophysics Data System (ADS)

Srinivasa, K. G.; Shree Devi, B. N.

2017-10-01

String searching in documents has become a tedious task with the evolution of Big Data. Generation of large data sets demand for a high performance search algorithm in areas such as text mining, information retrieval and many others. The popularity of GPU's for general purpose computing has been increasing for various applications. Therefore it is of great interest to exploit the thread feature of a GPU to provide a high performance search algorithm. This paper proposes an optimized new approach to N-gram model for string search in a number of lengthy documents and its GPU implementation. The algorithm exploits GPGPUs for searching strings in many documents employing character level N-gram matching with parallel Score Table approach and search using CUDA API. The new approach of Score table used for frequency storage of N-grams in a document, makes the search independent of the document's length and allows faster access to the frequency values, thus decreasing the search complexity. The extensive thread feature in a GPU has been exploited to enable parallel pre-processing of trigrams in a document for Score Table creation and parallel search in huge number of documents, thus speeding up the whole search process even for a large pattern size. Experiments were carried out for many documents of varied length and search strings from the standard Lorem Ipsum text on NVIDIA's GeForce GT 540M GPU with 96 cores. Results prove that the parallel approach for Score Table creation and searching gives a good speed up than the same approach executed serially.
Cost-effective GPU-grid for genome-wide epistasis calculations.

PubMed

Pütz, B; Kam-Thong, T; Karbalai, N; Altmann, A; Müller-Myhsok, B

2013-01-01

Until recently, genotype studies were limited to the investigation of single SNP effects due to the computational burden incurred when studying pairwise interactions of SNPs. However, some genetic effects as simple as coloring (in plants and animals) cannot be ascribed to a single locus but only understood when epistasis is taken into account [1]. It is expected that such effects are also found in complex diseases where many genes contribute to the clinical outcome of affected individuals. Only recently have such problems become feasible computationally. The inherently parallel structure of the problem makes it a perfect candidate for massive parallelization on either grid or cloud architectures. Since we are also dealing with confidential patient data, we were not able to consider a cloud-based solution but had to find a way to process the data in-house and aimed to build a local GPU-based grid structure. Sequential epistatsis calculations were ported to GPU using CUDA at various levels. Parallelization on the CPU was compared to corresponding GPU counterparts with regards to performance and cost. A cost-effective solution was created by combining custom-built nodes equipped with relatively inexpensive consumer-level graphics cards with highly parallel GPUs in a local grid. The GPU method outperforms current cluster-based systems on a price/performance criterion, as a single GPU shows speed performance comparable up to 200 CPU cores. The outlined approach will work for problems that easily lend themselves to massive parallelization. Code for various tasks has been made available and ongoing development of tools will further ease the transition from sequential to parallel algorithms.
Fast 3D dosimetric verifications based on an electronic portal imaging device using a GPU calculation engine.

PubMed

Zhu, Jinhan; Chen, Lixin; Chen, Along; Luo, Guangwen; Deng, Xiaowu; Liu, Xiaowei

2015-04-11

To use a graphic processing unit (GPU) calculation engine to implement a fast 3D pre-treatment dosimetric verification procedure based on an electronic portal imaging device (EPID). The GPU algorithm includes the deconvolution and convolution method for the fluence-map calculations, the collapsed-cone convolution/superposition (CCCS) algorithm for the 3D dose calculations and the 3D gamma evaluation calculations. The results of the GPU-based CCCS algorithm were compared to those of Monte Carlo simulations. The planned and EPID-based reconstructed dose distributions in overridden-to-water phantoms and the original patients were compared for 6 MV and 10 MV photon beams in intensity-modulated radiation therapy (IMRT) treatment plans based on dose differences and gamma analysis. The total single-field dose computation time was less than 8 s, and the gamma evaluation for a 0.1-cm grid resolution was completed in approximately 1 s. The results of the GPU-based CCCS algorithm exhibited good agreement with those of the Monte Carlo simulations. The gamma analysis indicated good agreement between the planned and reconstructed dose distributions for the treatment plans. For the target volume, the differences in the mean dose were less than 1.8%, and the differences in the maximum dose were less than 2.5%. For the critical organs, minor differences were observed between the reconstructed and planned doses. The GPU calculation engine was used to boost the speed of 3D dose and gamma evaluation calculations, thus offering the possibility of true real-time 3D dosimetric verification.
DeF-GPU: Efficient and effective deletions finding in hepatitis B viral genomic DNA using a GPU architecture.

PubMed

Cheng, Chun-Pei; Lan, Kuo-Lun; Liu, Wen-Chun; Chang, Ting-Tsung; Tseng, Vincent S

2016-12-01

Hepatitis B viral (HBV) infection is strongly associated with an increased risk of liver diseases like cirrhosis or hepatocellular carcinoma (HCC). Many lines of evidence suggest that deletions occurring in HBV genomic DNA are highly associated with the activity of HBV via the interplay between aberrant viral proteins release and human immune system. Deletions finding on the HBV whole genome sequences is thus a very important issue though there exist underlying the challenges in mining such big and complex biological data. Although some next generation sequencing (NGS) tools are recently designed for identifying structural variations such as insertions or deletions, their validity is generally committed to human sequences study. This design may not be suitable for viruses due to different species. We propose a graphics processing unit (GPU)-based data mining method called DeF-GPU to efficiently and precisely identify HBV deletions from large NGS data, which generally contain millions of reads. To fit the single instruction multiple data instructions, sequencing reads are referred to as multiple data and the deletion finding procedure is referred to as a single instruction. We use Compute Unified Device Architecture (CUDA) to parallelize the procedures, and further validate DeF-GPU on 5 synthetic and 1 real datasets. Our results suggest that DeF-GPU outperforms the existing commonly-used method Pindel and is able to exactly identify the deletions of our ground truth in few seconds. The source code and other related materials are available at https://sourceforge.net/projects/defgpu/. Copyright Â© 2016 Elsevier Inc. All rights reserved.
Fast generation of computer-generated hologram by graphics processing unit

NASA Astrophysics Data System (ADS)

Matsuda, Sho; Fujii, Tomohiko; Yamaguchi, Takeshi; Yoshikawa, Hiroshi

2009-02-01

A cylindrical hologram is well known to be viewable in 360 deg. This hologram depends high pixel resolution.Therefore, Computer-Generated Cylindrical Hologram (CGCH) requires huge calculation amount.In our previous research, we used look-up table method for fast calculation with Intel Pentium4 2.8 GHz.It took 480 hours to calculate high resolution CGCH (504,000 x 63,000 pixels and the average number of object points are 27,000).To improve quality of CGCH reconstructed image, fringe pattern requires higher spatial frequency and resolution.Therefore, to increase the calculation speed, we have to change the calculation method. In this paper, to reduce the calculation time of CGCH (912,000 x 108,000 pixels), we employ Graphics Processing Unit (GPU).It took 4,406 hours to calculate high resolution CGCH on Xeon 3.4 GHz.Since GPU has many streaming processors and a parallel processing structure, GPU works as the high performance parallel processor.In addition, GPU gives max performance to 2 dimensional data and streaming data.Recently, GPU can be utilized for the general purpose (GPGPU).For example, NVIDIA's GeForce7 series became a programmable processor with Cg programming language.Next GeForce8 series have CUDA as software development kit made by NVIDIA.Theoretically, calculation ability of GPU is announced as 500 GFLOPS. From the experimental result, we have achieved that 47 times faster calculation compared with our previous work which used CPU.Therefore, CGCH can be generated in 95 hours.So, total time is 110 hours to calculate and print the CGCH.
High-throughput sequence alignment using Graphics Processing Units

PubMed Central

Schatz, Michael C; Trapnell, Cole; Delcher, Arthur L; Varshney, Amitabh

2007-01-01

Background The recent availability of new, less expensive high-throughput DNA sequencing technologies has yielded a dramatic increase in the volume of sequence data that must be analyzed. These data are being generated for several purposes, including genotyping, genome resequencing, metagenomics, and de novo genome assembly projects. Sequence alignment programs such as MUMmer have proven essential for analysis of these data, but researchers will need ever faster, high-throughput alignment tools running on inexpensive hardware to keep up with new sequence technologies. Results This paper describes MUMmerGPU, an open-source high-throughput parallel pairwise local sequence alignment program that runs on commodity Graphics Processing Units (GPUs) in common workstations. MUMmerGPU uses the new Compute Unified Device Architecture (CUDA) from nVidia to align multiple query sequences against a single reference sequence stored as a suffix tree. By processing the queries in parallel on the highly parallel graphics card, MUMmerGPU achieves more than a 10-fold speedup over a serial CPU version of the sequence alignment kernel, and outperforms the exact alignment component of MUMmer on a high end CPU by 3.5-fold in total application time when aligning reads from recent sequencing projects using Solexa/Illumina, 454, and Sanger sequencing technologies. Conclusion MUMmerGPU is a low cost, ultra-fast sequence alignment program designed to handle the increasing volume of data produced by new, high-throughput sequencing technologies. MUMmerGPU demonstrates that even memory-intensive applications can run significantly faster on the relatively low-cost GPU than on the CPU. PMID:18070356
A GPU OpenCL based cross-platform Monte Carlo dose calculation engine (goMC)

NASA Astrophysics Data System (ADS)

Tian, Zhen; Shi, Feng; Folkerts, Michael; Qin, Nan; Jiang, Steve B.; Jia, Xun

2015-09-01

Monte Carlo (MC) simulation has been recognized as the most accurate dose calculation method for radiotherapy. However, the extremely long computation time impedes its clinical application. Recently, a lot of effort has been made to realize fast MC dose calculation on graphic processing units (GPUs). However, most of the GPU-based MC dose engines have been developed under NVidia’s CUDA environment. This limits the code portability to other platforms, hindering the introduction of GPU-based MC simulations to clinical practice. The objective of this paper is to develop a GPU OpenCL based cross-platform MC dose engine named goMC with coupled photon-electron simulation for external photon and electron radiotherapy in the MeV energy range. Compared to our previously developed GPU-based MC code named gDPM (Jia et al 2012 Phys. Med. Biol. 57 7783-97), goMC has two major differences. First, it was developed under the OpenCL environment for high code portability and hence could be run not only on different GPU cards but also on CPU platforms. Second, we adopted the electron transport model used in EGSnrc MC package and PENELOPE’s random hinge method in our new dose engine, instead of the dose planning method employed in gDPM. Dose distributions were calculated for a 15 MeV electron beam and a 6 MV photon beam in a homogenous water phantom, a water-bone-lung-water slab phantom and a half-slab phantom. Satisfactory agreement between the two MC dose engines goMC and gDPM was observed in all cases. The average dose differences in the regions that received a dose higher than 10% of the maximum dose were 0.48-0.53% for the electron beam cases and 0.15-0.17% for the photon beam cases. In terms of efficiency, goMC was ~4-16% slower than gDPM when running on the same NVidia TITAN card for all the cases we tested, due to both the different electron transport models and the different development environments. The code portability of our new dose engine goMC was validated by successfully running it on a variety of different computing devices including an NVidia GPU card, two AMD GPU cards and an Intel CPU processor. Computational efficiency among these platforms was compared.
CUDA programs for the GPU computing of the Swendsen-Wang multi-cluster spin flip algorithm: 2D and 3D Ising, Potts, and XY models

NASA Astrophysics Data System (ADS)

Komura, Yukihiro; Okabe, Yutaka

2014-03-01

We present sample CUDA programs for the GPU computing of the Swendsen-Wang multi-cluster spin flip algorithm. We deal with the classical spin models; the Ising model, the q-state Potts model, and the classical XY model. As for the lattice, both the 2D (square) lattice and the 3D (simple cubic) lattice are treated. We already reported the idea of the GPU implementation for 2D models (Komura and Okabe, 2012). We here explain the details of sample programs, and discuss the performance of the present GPU implementation for the 3D Ising and XY models. We also show the calculated results of the moment ratio for these models, and discuss phase transitions. Catalogue identifier: AERM_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AERM_v1_0.html Program obtainable from: CPC Program Library, Queen’s University, Belfast, N. Ireland Licensing provisions: Standard CPC licence, http://cpc.cs.qub.ac.uk/licence/licence.html No. of lines in distributed program, including test data, etc.: 5632 No. of bytes in distributed program, including test data, etc.: 14688 Distribution format: tar.gz Programming language: C, CUDA. Computer: System with an NVIDIA CUDA enabled GPU. Operating system: System with an NVIDIA CUDA enabled GPU. Classification: 23. External routines: NVIDIA CUDA Toolkit 3.0 or newer Nature of problem: Monte Carlo simulation of classical spin systems. Ising, q-state Potts model, and the classical XY model are treated for both two-dimensional and three-dimensional lattices. Solution method: GPU-based Swendsen-Wang multi-cluster spin flip Monte Carlo method. The CUDA implementation for the cluster-labeling is based on the work by Hawick et al. [1] and that by Kalentev et al. [2]. Restrictions: The system size is limited depending on the memory of a GPU. Running time: For the parameters used in the sample programs, it takes about a minute for each program. Of course, it depends on the system size, the number of Monte Carlo steps, etc. References: [1] K.A. Hawick, A. Leist, and D. P. Playne, Parallel Computing 36 (2010) 655-678 [2] O. Kalentev, A. Rai, S. Kemnitzb, and R. Schneider, J. Parallel Distrib. Comput. 71 (2011) 615-620
Viscoelastic Finite Difference Modeling Using Graphics Processing Units

NASA Astrophysics Data System (ADS)

Fabien-Ouellet, G.; Gloaguen, E.; Giroux, B.

2014-12-01

Full waveform seismic modeling requires a huge amount of computing power that still challenges today's technology. This limits the applicability of powerful processing approaches in seismic exploration like full-waveform inversion. This paper explores the use of Graphics Processing Units (GPU) to compute a time based finite-difference solution to the viscoelastic wave equation. The aim is to investigate whether the adoption of the GPU technology is susceptible to reduce significantly the computing time of simulations. The code presented herein is based on the freely accessible software of Bohlen (2002) in 2D provided under a General Public License (GNU) licence. This implementation is based on a second order centred differences scheme to approximate time differences and staggered grid schemes with centred difference of order 2, 4, 6, 8, and 12 for spatial derivatives. The code is fully parallel and is written using the Message Passing Interface (MPI), and it thus supports simulations of vast seismic models on a cluster of CPUs. To port the code from Bohlen (2002) on GPUs, the OpenCl framework was chosen for its ability to work on both CPUs and GPUs and its adoption by most of GPU manufacturers. In our implementation, OpenCL works in conjunction with MPI, which allows computations on a cluster of GPU for large-scale model simulations. We tested our code for model sizes between 1002 and 60002 elements. Comparison shows a decrease in computation time of more than two orders of magnitude between the GPU implementation run on a AMD Radeon HD 7950 and the CPU implementation run on a 2.26 GHz Intel Xeon Quad-Core. The speed-up varies depending on the order of the finite difference approximation and generally increases for higher orders. Increasing speed-ups are also obtained for increasing model size, which can be explained by kernel overheads and delays introduced by memory transfers to and from the GPU through the PCI-E bus. Those tests indicate that the GPU memory size and the slow memory transfers are the limiting factors of our GPU implementation. Those results show the benefits of using GPUs instead of CPUs for time based finite-difference seismic simulations. The reductions in computation time and in hardware costs are significant and open the door for new approaches in seismic inversion.
Musrfit-Real Time Parameter Fitting Using GPUs

NASA Astrophysics Data System (ADS)

Locans, Uldis; Suter, Andreas

High transverse field μSR (HTF-μSR) experiments typically lead to a rather large data sets, since it is necessary to follow the high frequencies present in the positron decay histograms. The analysis of these data sets can be very time consuming, usually due to the limited computational power of the hardware. To overcome the limited computing resources rotating reference frame transformation (RRF) is often used to reduce the data sets that need to be handled. This comes at a price typically the μSR community is not aware of: (i) due to the RRF transformation the fitting parameter estimate is of poorer precision, i.e., more extended expensive beamtime is needed. (ii) RRF introduces systematic errors which hampers the statistical interpretation of χ2 or the maximum log-likelihood. We will briefly discuss these issues in a non-exhaustive practical way. The only and single purpose of the RRF transformation is the sluggish computer power. Therefore during this work GPU (Graphical Processing Units) based fitting was developed which allows to perform real-time full data analysis without RRF. GPUs have become increasingly popular in scientific computing in recent years. Due to their highly parallel architecture they provide the opportunity to accelerate many applications with considerably less costs than upgrading the CPU computational power. With the emergence of frameworks such as CUDA and OpenCL these devices have become more easily programmable. During this work GPU support was added to Musrfit- a data analysis framework for μSR experiments. The new fitting algorithm uses CUDA or OpenCL to offload the most time consuming parts of the calculations to Nvidia or AMD GPUs. Using the current CPU implementation in Musrfit parameter fitting can take hours for certain data sets while the GPU version can allow to perform real-time data analysis on the same data sets. This work describes the challenges that arise in adding the GPU support to t as well as results obtained using the GPU version. The speedups using the GPU were measured comparing to the CPU implementation. Two different GPUs were used for the comparison — high end Nvidia Tesla K40c GPU designed for HPC applications and AMD Radeon R9 390× GPU designed for gaming industry.
A GPU OpenCL based cross-platform Monte Carlo dose calculation engine (goMC).

PubMed

Tian, Zhen; Shi, Feng; Folkerts, Michael; Qin, Nan; Jiang, Steve B; Jia, Xun

2015-10-07

Monte Carlo (MC) simulation has been recognized as the most accurate dose calculation method for radiotherapy. However, the extremely long computation time impedes its clinical application. Recently, a lot of effort has been made to realize fast MC dose calculation on graphic processing units (GPUs). However, most of the GPU-based MC dose engines have been developed under NVidia's CUDA environment. This limits the code portability to other platforms, hindering the introduction of GPU-based MC simulations to clinical practice. The objective of this paper is to develop a GPU OpenCL based cross-platform MC dose engine named goMC with coupled photon-electron simulation for external photon and electron radiotherapy in the MeV energy range. Compared to our previously developed GPU-based MC code named gDPM (Jia et al 2012 Phys. Med. Biol. 57 7783-97), goMC has two major differences. First, it was developed under the OpenCL environment for high code portability and hence could be run not only on different GPU cards but also on CPU platforms. Second, we adopted the electron transport model used in EGSnrc MC package and PENELOPE's random hinge method in our new dose engine, instead of the dose planning method employed in gDPM. Dose distributions were calculated for a 15 MeV electron beam and a 6 MV photon beam in a homogenous water phantom, a water-bone-lung-water slab phantom and a half-slab phantom. Satisfactory agreement between the two MC dose engines goMC and gDPM was observed in all cases. The average dose differences in the regions that received a dose higher than 10% of the maximum dose were 0.48-0.53% for the electron beam cases and 0.15-0.17% for the photon beam cases. In terms of efficiency, goMC was ~4-16% slower than gDPM when running on the same NVidia TITAN card for all the cases we tested, due to both the different electron transport models and the different development environments. The code portability of our new dose engine goMC was validated by successfully running it on a variety of different computing devices including an NVidia GPU card, two AMD GPU cards and an Intel CPU processor. Computational efficiency among these platforms was compared.
Type 2 Diabetes Mellitus remission eighteen months after Roux-en-Y gastric bypass.

PubMed

Girundi, Marcelo Gomes

2016-01-01

to evaluate the effectiveness of Roux-en-Y gastric bypass in improving the glycemic profile of obese patients with type 2 Diabetes Mellitus (DM2) after 18 months of follow-up. four hundred sixty-eight pacients with DM2 and BMI ≥35 were submitted to Roux-en-Y gastric bypass, from 1998 to 2010. All patients were submitted to glycemic control analysis in the 3rd, 6th, 9th, 12th and 18th postoperative months. We considered: type 2 diabetic patients, the ones with fasting glucose ≥126mg/dl and HbA1C ≥6.5 in two dosages; high risk patients for diabetes, those who presented fasting glucose ≥ 100 to 125 mg/dl and HbA1C between 5.7%-6.4%; and normal patients, those presenting glucose <100mg/dl and HbA1C <5.7%. Such diagnostic criteria were based on the official position of Sociedade Brasileira de Diabetes, published in July, 2011. The remission of DM2 was seen in 410 (87.6%) out of 468 patients 18 months after the surgery, that being a meaningful difference, with p<0.001. Fourty-eight (10.3%) patients sustained criteria for the disease and ten (2.1%) continued at high risk for DM2. Roux-en-Y gastric bypass was effective in the promotion and maintaince of long-term glycemic control. There are evidences showing that the remission of DM2 is not only related to weight loss and that other enteroinsular axis mechanisms must be involved. avaliar a eficácia da gastroplastia com derivação em Y-de-Roux, em pacientes obesos e portadores de Diabetes Mellitus tipo 2 (DM2), na melhoria do perfil glicêmico após 18 meses de seguimento. foram submetidos à derivação gástrica em Y-de-Roux 468 pacientes com IMC ≥35 e portadores de DM2, no período de 1998 a 2010. Todos os pacientes tiveram a análise do controle glicêmico realizadas no terceiro, sexto, nono, 12o e 18o meses de pós-operatório. Os critérios diagnósticos de diabetes foram baseados no Posicionamento Oficial da Sociedade Brasileira de Diabetes, publicado em julho de 2011. observou-se a remissão do DM2 em 410 pacientes (87,6%) após 18 meses da cirurgia, sendo essa diferença significativa com p-valor <0,001. A doença se manteve inalterada em 48 pacientes (10,3%), e dez pacientes (2,1%) permaneceram com o risco aumentado para DM2. a gastroplastia com derivação em Y-de-Roux foi efetiva na promoção e manutenção do controle glicêmico em longo prazo.
Accelerating Pseudo-Random Number Generator for MCNP on GPU

NASA Astrophysics Data System (ADS)

Gong, Chunye; Liu, Jie; Chi, Lihua; Hu, Qingfeng; Deng, Li; Gong, Zhenghu

2010-09-01

Pseudo-random number generators (PRNG) are intensively used in many stochastic algorithms in particle simulations, artificial neural networks and other scientific computation. The PRNG in Monte Carlo N-Particle Transport Code (MCNP) requires long period, high quality, flexible jump and fast enough. In this paper, we implement such a PRNG for MCNP on NVIDIA's GTX200 Graphics Processor Units (GPU) using CUDA programming model. Results shows that 3.80 to 8.10 times speedup are achieved compared with 4 to 6 cores CPUs and more than 679.18 million double precision random numbers can be generated per second on GPU.
Digital image processing using parallel computing based on CUDA technology

NASA Astrophysics Data System (ADS)

Skirnevskiy, I. P.; Pustovit, A. V.; Abdrashitova, M. O.

2017-01-01

This article describes expediency of using a graphics processing unit (GPU) in big data processing in the context of digital images processing. It provides a short description of a parallel computing technology and its usage in different areas, definition of the image noise and a brief overview of some noise removal algorithms. It also describes some basic requirements that should be met by certain noise removal algorithm in the projection to computer tomography. It provides comparison of the performance with and without using GPU as well as with different percentage of using CPU and GPU.

GAPD: a GPU-accelerated atom-based polychromatic diffraction simulation code.

PubMed

E, J C; Wang, L; Chen, S; Zhang, Y Y; Luo, S N

2018-03-01

GAPD, a graphics-processing-unit (GPU)-accelerated atom-based polychromatic diffraction simulation code for direct, kinematics-based, simulations of X-ray/electron diffraction of large-scale atomic systems with mono-/polychromatic beams and arbitrary plane detector geometries, is presented. This code implements GPU parallel computation via both real- and reciprocal-space decompositions. With GAPD, direct simulations are performed of the reciprocal lattice node of ultralarge systems (∼5 billion atoms) and diffraction patterns of single-crystal and polycrystalline configurations with mono- and polychromatic X-ray beams (including synchrotron undulator sources), and validation, benchmark and application cases are presented.
A numerical code for the simulation of non-equilibrium chemically reacting flows on hybrid CPU-GPU clusters

NASA Astrophysics Data System (ADS)

Kudryavtsev, Alexey N.; Kashkovsky, Alexander V.; Borisov, Semyon P.; Shershnev, Anton A.

2017-10-01

In the present work a computer code RCFS for numerical simulation of chemically reacting compressible flows on hybrid CPU/GPU supercomputers is developed. It solves 3D unsteady Euler equations for multispecies chemically reacting flows in general curvilinear coordinates using shock-capturing TVD schemes. Time advancement is carried out using the explicit Runge-Kutta TVD schemes. Program implementation uses CUDA application programming interface to perform GPU computations. Data between GPUs is distributed via domain decomposition technique. The developed code is verified on the number of test cases including supersonic flow over a cylinder.
Implementing a GPU-based numerical algorithm for modelling dynamics of a high-speed train

NASA Astrophysics Data System (ADS)

Sytov, E. S.; Bratus, A. S.; Yurchenko, D.

2018-04-01

This paper discusses the initiative of implementing a GPU-based numerical algorithm for studying various phenomena associated with dynamics of a high-speed railway transport. The proposed numerical algorithm for calculating a critical speed of the bogie is based on the first Lyapunov number. Numerical algorithm is validated by analytical results, derived for a simple model. A dynamic model of a carriage connected to a new dual-wheelset flexible bogie is studied for linear and dry friction damping. Numerical results obtained by CPU, MPU and GPU approaches are compared and appropriateness of these methods is discussed.
A 3D front tracking method on a CPU/GPU system

DOE Office of Scientific and Technical Information (OSTI.GOV)

Bo, Wurigen; Grove, John

2011-01-21

We describe the method to port a sequential 3D interface tracking code to a GPU with CUDA. The interface is represented as a triangular mesh. Interface geometry properties and point propagation are performed on a GPU. Interface mesh adaptation is performed on a CPU. The convergence of the method is assessed from the test problems with given velocity fields. Performance results show overall speedups from 11 to 14 for the test problems under mesh refinement. We also briefly describe our ongoing work to couple the interface tracking method with a hydro solver.
Improving Quantum Gate Simulation using a GPU

NASA Astrophysics Data System (ADS)

Gutierrez, Eladio; Romero, Sergio; Trenas, Maria A.; Zapata, Emilio L.

2008-11-01

Due to the increasing computing power of the graphics processing units (GPU), they are becoming more and more popular when solving general purpose algorithms. As the simulation of quantum computers results on a problem with exponential complexity, it is advisable to perform a parallel computation, such as the one provided by the SIMD multiprocessors present in recent GPUs. In this paper, we focus on an important quantum algorithm, the quantum Fourier transform (QTF), in order to evaluate different parallelization strategies on a novel GPU architecture. Our implementation makes use of the new CUDA software/hardware architecture developed recently by NVIDIA.
Interactive Light Stimulus Generation with High Performance Real-Time Image Processing and Simple Scripting.

PubMed

Szécsi, László; Kacsó, Ágota; Zeck, Günther; Hantz, Péter

2017-01-01

Light stimulation with precise and complex spatial and temporal modulation is demanded by a series of research fields like visual neuroscience, optogenetics, ophthalmology, and visual psychophysics. We developed a user-friendly and flexible stimulus generating framework (GEARS GPU-based Eye And Retina Stimulation Software), which offers access to GPU computing power, and allows interactive modification of stimulus parameters during experiments. Furthermore, it has built-in support for driving external equipment, as well as for synchronization tasks, via USB ports. The use of GEARS does not require elaborate programming skills. The necessary scripting is visually aided by an intuitive interface, while the details of the underlying software and hardware components remain hidden. Internally, the software is a C++/Python hybrid using OpenGL graphics. Computations are performed on the GPU, and are defined in the GLSL shading language. However, all GPU settings, including the GPU shader programs, are automatically generated by GEARS. This is configured through a method encountered in game programming, which allows high flexibility: stimuli are straightforwardly composed using a broad library of basic components. Stimulus rendering is implemented solely in C++, therefore intermediary libraries for interfacing could be omitted. This enables the program to perform computationally demanding tasks like en-masse random number generation or real-time image processing by local and global operations.
ESPRIT-Like Two-Dimensional DOA Estimation for Monostatic MIMO Radar with Electromagnetic Vector Received Sensors under the Condition of Gain and Phase Uncertainties and Mutual Coupling

PubMed Central

Zhang, Yongshun; Zheng, Guimei; Feng, Cunqian; Tang, Jun

2017-01-01

In this paper, we focus on the problem of two-dimensional direction of arrival (2D-DOA) estimation for monostatic MIMO Radar with electromagnetic vector received sensors (MIMO-EMVSs) under the condition of gain and phase uncertainties (GPU) and mutual coupling (MC). GPU would spoil the invariance property of the EMVSs in MIMO-EMVSs, thus the effective ESPRIT algorithm unable to be used directly. Then we put forward a C-SPD ESPRIT-like algorithm. It estimates the 2D-DOA and polarization station angle (PSA) based on the instrumental sensors method (ISM). The C-SPD ESPRIT-like algorithm can obtain good angle estimation accuracy without knowing the GPU. Furthermore, it can be applied to arbitrary array configuration and has low complexity for avoiding the angle searching procedure. When MC and GPU exist together between the elements of EMVSs, in order to make our algorithm feasible, we derive a class of separated electromagnetic vector receiver and give the S-SPD ESPRIT-like algorithm. It can solve the problem of GPU and MC efficiently. And the array configuration can be arbitrary. The effectiveness of our proposed algorithms is verified by the simulation result. PMID:29072588
ESPRIT-Like Two-Dimensional DOA Estimation for Monostatic MIMO Radar with Electromagnetic Vector Received Sensors under the Condition of Gain and Phase Uncertainties and Mutual Coupling.

PubMed

Zhang, Dong; Zhang, Yongshun; Zheng, Guimei; Feng, Cunqian; Tang, Jun

2017-10-26

In this paper, we focus on the problem of two-dimensional direction of arrival (2D-DOA) estimation for monostatic MIMO Radar with electromagnetic vector received sensors (MIMO-EMVSs) under the condition of gain and phase uncertainties (GPU) and mutual coupling (MC). GPU would spoil the invariance property of the EMVSs in MIMO-EMVSs, thus the effective ESPRIT algorithm unable to be used directly. Then we put forward a C-SPD ESPRIT-like algorithm. It estimates the 2D-DOA and polarization station angle (PSA) based on the instrumental sensors method (ISM). The C-SPD ESPRIT-like algorithm can obtain good angle estimation accuracy without knowing the GPU. Furthermore, it can be applied to arbitrary array configuration and has low complexity for avoiding the angle searching procedure. When MC and GPU exist together between the elements of EMVSs, in order to make our algorithm feasible, we derive a class of separated electromagnetic vector receiver and give the S-SPD ESPRIT-like algorithm. It can solve the problem of GPU and MC efficiently. And the array configuration can be arbitrary. The effectiveness of our proposed algorithms is verified by the simulation result.
A GPU-accelerated 3D Coupled Sub-sample Estimation Algorithm for Volumetric Breast Strain Elastography

PubMed Central

Peng, Bo; Wang, Yuqi; Hall, Timothy J; Jiang, Jingfeng

2017-01-01

Our primary objective of this work was to extend a previously published 2D coupled sub-sample tracking algorithm for 3D speckle tracking in the framework of ultrasound breast strain elastography. In order to overcome heavy computational cost, we investigated the use of a graphic processing unit (GPU) to accelerate the 3D coupled sub-sample speckle tracking method. The performance of the proposed GPU implementation was tested using a tissue-mimicking (TM) phantom and in vivo breast ultrasound data. The performance of this 3D sub-sample tracking algorithm was compared with the conventional 3D quadratic sub-sample estimation algorithm. On the basis of these evaluations, we concluded that the GPU implementation of this 3D sub-sample estimation algorithm can provide high-quality strain data (i.e. high correlation between the pre- and the motion-compensated post-deformation RF echo data and high contrast-to-noise ratio strain images), as compared to the conventional 3D quadratic sub-sample algorithm. Using the GPU implementation of the 3D speckle tracking algorithm, volumetric strain data can be achieved relatively fast (approximately 20 seconds per volume [2.5 cm × 2.5 cm × 2.5 cm]). PMID:28166493
GPU-BSM: A GPU-Based Tool to Map Bisulfite-Treated Reads

PubMed Central

Manconi, Andrea; Orro, Alessandro; Manca, Emanuele; Armano, Giuliano; Milanesi, Luciano

2014-01-01

Cytosine DNA methylation is an epigenetic mark implicated in several biological processes. Bisulfite treatment of DNA is acknowledged as the gold standard technique to study methylation. This technique introduces changes in the genomic DNA by converting cytosines to uracils while 5-methylcytosines remain nonreactive. During PCR amplification 5-methylcytosines are amplified as cytosine, whereas uracils and thymines as thymine. To detect the methylation levels, reads treated with the bisulfite must be aligned against a reference genome. Mapping these reads to a reference genome represents a significant computational challenge mainly due to the increased search space and the loss of information introduced by the treatment. To deal with this computational challenge we devised GPU-BSM, a tool based on modern Graphics Processing Units. Graphics Processing Units are hardware accelerators that are increasingly being used successfully to accelerate general-purpose scientific applications. GPU-BSM is a tool able to map bisulfite-treated reads from whole genome bisulfite sequencing and reduced representation bisulfite sequencing, and to estimate methylation levels, with the goal of detecting methylation. Due to the massive parallelization obtained by exploiting graphics cards, GPU-BSM aligns bisulfite-treated reads faster than other cutting-edge solutions, while outperforming most of them in terms of unique mapped reads. PMID:24842718
cuBLASTP: Fine-Grained Parallelization of Protein Sequence Search on CPU+GPU.

PubMed

Zhang, Jing; Wang, Hao; Feng, Wu-Chun

2017-01-01

BLAST, short for Basic Local Alignment Search Tool, is a ubiquitous tool used in the life sciences for pairwise sequence search. However, with the advent of next-generation sequencing (NGS), whether at the outset or downstream from NGS, the exponential growth of sequence databases is outstripping our ability to analyze the data. While recent studies have utilized the graphics processing unit (GPU) to speedup the BLAST algorithm for searching protein sequences (i.e., BLASTP), these studies use coarse-grained parallelism, where one sequence alignment is mapped to only one thread. Such an approach does not efficiently utilize the capabilities of a GPU, particularly due to the irregularity of BLASTP in both execution paths and memory-access patterns. To address the above shortcomings, we present a fine-grained approach to parallelize BLASTP, where each individual phase of sequence search is mapped to many threads on a GPU. This approach, which we refer to as cuBLASTP, reorders data-access patterns and reduces divergent branches of the most time-consuming phases (i.e., hit detection and ungapped extension). In addition, cuBLASTP optimizes the remaining phases (i.e., gapped extension and alignment with trace back) on a multicore CPU and overlaps their execution with the phases running on the GPU.
MrBayes tgMC3++: A High Performance and Resource-Efficient GPU-Oriented Phylogenetic Analysis Method.

PubMed

Ling, Cheng; Hamada, Tsuyoshi; Gao, Jingyang; Zhao, Guoguang; Sun, Donghong; Shi, Weifeng

2016-01-01

MrBayes is a widespread phylogenetic inference tool harnessing empirical evolutionary models and Bayesian statistics. However, the computational cost on the likelihood estimation is very expensive, resulting in undesirably long execution time. Although a number of multi-threaded optimizations have been proposed to speed up MrBayes, there are bottlenecks that severely limit the GPU thread-level parallelism of likelihood estimations. This study proposes a high performance and resource-efficient method for GPU-oriented parallelization of likelihood estimations. Instead of having to rely on empirical programming, the proposed novel decomposition storage model implements high performance data transfers implicitly. In terms of performance improvement, a speedup factor of up to 178 can be achieved on the analysis of simulated datasets by four Tesla K40 cards. In comparison to the other publicly available GPU-oriented MrBayes, the tgMC 3 ++ method (proposed herein) outperforms the tgMC 3 (v1.0), nMC 3 (v2.1.1) and oMC 3 (v1.00) methods by speedup factors of up to 1.6, 1.9 and 2.9, respectively. Moreover, tgMC 3 ++ supports more evolutionary models and gamma categories, which previous GPU-oriented methods fail to take into analysis.
GPU based contouring method on grid DEM data

NASA Astrophysics Data System (ADS)

Tan, Liheng; Wan, Gang; Li, Feng; Chen, Xiaohui; Du, Wenlong

2017-08-01

This paper presents a novel method to generate contour lines from grid DEM data based on the programmable GPU pipeline. The previous contouring approaches often use CPU to construct a finite element mesh from the raw DEM data, and then extract contour segments from the elements. They also need a tracing or sorting strategy to generate the final continuous contours. These approaches can be heavily CPU-costing and time-consuming. Meanwhile the generated contours would be unsmooth if the raw data is sparsely distributed. Unlike the CPU approaches, we employ the GPU's vertex shader to generate a triangular mesh with arbitrary user-defined density, in which the height of each vertex is calculated through a third-order Cardinal spline function. Then in the same frame, segments are extracted from the triangles by the geometry shader, and translated to the CPU-side with an internal order in the GPU's transform feedback stage. Finally we propose a "Grid Sorting" algorithm to achieve the continuous contour lines by travelling the segments only once. Our method makes use of multiple stages of GPU pipeline for computation, which can generate smooth contour lines, and is significantly faster than the previous CPU approaches. The algorithm can be easily implemented with OpenGL 3.3 API or higher on consumer-level PCs.
Parallelized multi–graphics processing unit framework for high-speed Gabor-domain optical coherence microscopy

PubMed Central

Tankam, Patrice; Santhanam, Anand P.; Lee, Kye-Sung; Won, Jungeun; Canavesi, Cristina; Rolland, Jannick P.

2014-01-01

Abstract. Gabor-domain optical coherence microscopy (GD-OCM) is a volumetric high-resolution technique capable of acquiring three-dimensional (3-D) skin images with histological resolution. Real-time image processing is needed to enable GD-OCM imaging in a clinical setting. We present a parallelized and scalable multi-graphics processing unit (GPU) computing framework for real-time GD-OCM image processing. A parallelized control mechanism was developed to individually assign computation tasks to each of the GPUs. For each GPU, the optimal number of amplitude-scans (A-scans) to be processed in parallel was selected to maximize GPU memory usage and core throughput. We investigated five computing architectures for computational speed-up in processing 1000×1000 A-scans. The proposed parallelized multi-GPU computing framework enables processing at a computational speed faster than the GD-OCM image acquisition, thereby facilitating high-speed GD-OCM imaging in a clinical setting. Using two parallelized GPUs, the image processing of a 1×1×0.6 mm3 skin sample was performed in about 13 s, and the performance was benchmarked at 6.5 s with four GPUs. This work thus demonstrates that 3-D GD-OCM data may be displayed in real-time to the examiner using parallelized GPU processing. PMID:24695868
Parallelized multi-graphics processing unit framework for high-speed Gabor-domain optical coherence microscopy.

PubMed

Tankam, Patrice; Santhanam, Anand P; Lee, Kye-Sung; Won, Jungeun; Canavesi, Cristina; Rolland, Jannick P

2014-07-01

Gabor-domain optical coherence microscopy (GD-OCM) is a volumetric high-resolution technique capable of acquiring three-dimensional (3-D) skin images with histological resolution. Real-time image processing is needed to enable GD-OCM imaging in a clinical setting. We present a parallelized and scalable multi-graphics processing unit (GPU) computing framework for real-time GD-OCM image processing. A parallelized control mechanism was developed to individually assign computation tasks to each of the GPUs. For each GPU, the optimal number of amplitude-scans (A-scans) to be processed in parallel was selected to maximize GPU memory usage and core throughput. We investigated five computing architectures for computational speed-up in processing 1000×1000 A-scans. The proposed parallelized multi-GPU computing framework enables processing at a computational speed faster than the GD-OCM image acquisition, thereby facilitating high-speed GD-OCM imaging in a clinical setting. Using two parallelized GPUs, the image processing of a 1×1×0.6 mm3 skin sample was performed in about 13 s, and the performance was benchmarked at 6.5 s with four GPUs. This work thus demonstrates that 3-D GD-OCM data may be displayed in real-time to the examiner using parallelized GPU processing.
GPU.proton.DOCK: Genuine Protein Ultrafast proton equilibria consistent DOCKing.

PubMed

Kantardjiev, Alexander A

2011-07-01

GPU.proton.DOCK (Genuine Protein Ultrafast proton equilibria consistent DOCKing) is a state of the art service for in silico prediction of protein-protein interactions via rigorous and ultrafast docking code. It is unique in providing stringent account of electrostatic interactions self-consistency and proton equilibria mutual effects of docking partners. GPU.proton.DOCK is the first server offering such a crucial supplement to protein docking algorithms--a step toward more reliable and high accuracy docking results. The code (especially the Fast Fourier Transform bottleneck and electrostatic fields computation) is parallelized to run on a GPU supercomputer. The high performance will be of use for large-scale structural bioinformatics and systems biology projects, thus bridging physics of the interactions with analysis of molecular networks. We propose workflows for exploring in silico charge mutagenesis effects. Special emphasis is given to the interface-intuitive and user-friendly. The input is comprised of the atomic coordinate files in PDB format. The advanced user is provided with a special input section for addition of non-polypeptide charges, extra ionogenic groups with intrinsic pK(a) values or fixed ions. The output is comprised of docked complexes in PDB format as well as interactive visualization in a molecular viewer. GPU.proton.DOCK server can be accessed at http://gpudock.orgchm.bas.bg/.
GPU-accelerated Kernel Regression Reconstruction for Freehand 3D Ultrasound Imaging.

PubMed

Wen, Tiexiang; Li, Ling; Zhu, Qingsong; Qin, Wenjian; Gu, Jia; Yang, Feng; Xie, Yaoqin

2017-07-01

Volume reconstruction method plays an important role in improving reconstructed volumetric image quality for freehand three-dimensional (3D) ultrasound imaging. By utilizing the capability of programmable graphics processing unit (GPU), we can achieve a real-time incremental volume reconstruction at a speed of 25-50 frames per second (fps). After incremental reconstruction and visualization, hole-filling is performed on GPU to fill remaining empty voxels. However, traditional pixel nearest neighbor-based hole-filling fails to reconstruct volume with high image quality. On the contrary, the kernel regression provides an accurate volume reconstruction method for 3D ultrasound imaging but with the cost of heavy computational complexity. In this paper, a GPU-based fast kernel regression method is proposed for high-quality volume after the incremental reconstruction of freehand ultrasound. The experimental results show that improved image quality for speckle reduction and details preservation can be obtained with the parameter setting of kernel window size of [Formula: see text] and kernel bandwidth of 1.0. The computational performance of the proposed GPU-based method can be over 200 times faster than that on central processing unit (CPU), and the volume with size of 50 million voxels in our experiment can be reconstructed within 10 seconds.
Hierarchical Petascale Simulation Framework For Stress Corrosion Cracking

DOE Office of Scientific and Technical Information (OSTI.GOV)

Grama, Ananth

2013-12-18

A number of major accomplishments resulted from the project. These include: • Data Structures, Algorithms, and Numerical Methods for Reactive Molecular Dynamics. We have developed a range of novel data structures, algorithms, and solvers (amortized ILU, Spike) for use with ReaxFF and charge equilibration. • Parallel Formulations of ReactiveMD (Purdue ReactiveMolecular Dynamics Package, PuReMD, PuReMD-GPU, and PG-PuReMD) for Messaging, GPU, and GPU Cluster Platforms. We have developed efficient serial, parallel (MPI), GPU (Cuda), and GPU Cluster (MPI/Cuda) implementations. Our implementations have been demonstrated to be significantly better than the state of the art, both in terms of performance and scalability.more » • Comprehensive Validation in the Context of Diverse Applications. We have demonstrated the use of our software in diverse systems, including silica-water, silicon-germanium nanorods, and as part of other projects, extended it to applications ranging from explosives (RDX) to lipid bilayers (biomembranes under oxidative stress). • Open Source Software Packages for Reactive Molecular Dynamics. All versions of our soft- ware have been released over the public domain. There are over 100 major research groups worldwide using our software. • Implementation into the Department of Energy LAMMPS Software Package. We have also integrated our software into the Department of Energy LAMMPS software package.« less
Quantum Chemical Calculations Using Accelerators: Migrating Matrix Operations to the NVIDIA Kepler GPU and the Intel Xeon Phi.

PubMed

Leang, Sarom S; Rendell, Alistair P; Gordon, Mark S

2014-03-11

Increasingly, modern computer systems comprise a multicore general-purpose processor augmented with a number of special purpose devices or accelerators connected via an external interface such as a PCI bus. The NVIDIA Kepler Graphical Processing Unit (GPU) and the Intel Phi are two examples of such accelerators. Accelerators offer peak performances that can be well above those of the host processor. How to exploit this heterogeneous environment for legacy application codes is not, however, straightforward. This paper considers how matrix operations in typical quantum chemical calculations can be migrated to the GPU and Phi systems. Double precision general matrix multiply operations are endemic in electronic structure calculations, especially methods that include electron correlation, such as density functional theory, second order perturbation theory, and coupled cluster theory. The use of approaches that automatically determine whether to use the host or an accelerator, based on problem size, is explored, with computations that are occurring on the accelerator and/or the host. For data-transfers over PCI-e, the GPU provides the best overall performance for data sizes up to 4096 MB with consistent upload and download rates between 5-5.6 GB/s and 5.4-6.3 GB/s, respectively. The GPU outperforms the Phi for both square and nonsquare matrix multiplications.
An analytic linear accelerator source model for GPU-based Monte Carlo dose calculations.

PubMed

Tian, Zhen; Li, Yongbao; Folkerts, Michael; Shi, Feng; Jiang, Steve B; Jia, Xun

2015-10-21

Recently, there has been a lot of research interest in developing fast Monte Carlo (MC) dose calculation methods on graphics processing unit (GPU) platforms. A good linear accelerator (linac) source model is critical for both accuracy and efficiency considerations. In principle, an analytical source model should be more preferred for GPU-based MC dose engines than a phase-space file-based model, in that data loading and CPU-GPU data transfer can be avoided. In this paper, we presented an analytical field-independent source model specifically developed for GPU-based MC dose calculations, associated with a GPU-friendly sampling scheme. A key concept called phase-space-ring (PSR) was proposed. Each PSR contained a group of particles that were of the same type, close in energy and reside in a narrow ring on the phase-space plane located just above the upper jaws. The model parameterized the probability densities of particle location, direction and energy for each primary photon PSR, scattered photon PSR and electron PSR. Models of one 2D Gaussian distribution or multiple Gaussian components were employed to represent the particle direction distributions of these PSRs. A method was developed to analyze a reference phase-space file and derive corresponding model parameters. To efficiently use our model in MC dose calculations on GPU, we proposed a GPU-friendly sampling strategy, which ensured that the particles sampled and transported simultaneously are of the same type and close in energy to alleviate GPU thread divergences. To test the accuracy of our model, dose distributions of a set of open fields in a water phantom were calculated using our source model and compared to those calculated using the reference phase-space files. For the high dose gradient regions, the average distance-to-agreement (DTA) was within 1 mm and the maximum DTA within 2 mm. For relatively low dose gradient regions, the root-mean-square (RMS) dose difference was within 1.1% and the maximum dose difference within 1.7%. The maximum relative difference of output factors was within 0.5%. Over 98.5% passing rate was achieved in 3D gamma-index tests with 2%/2 mm criteria in both an IMRT prostate patient case and a head-and-neck case. These results demonstrated the efficacy of our model in terms of accurately representing a reference phase-space file. We have also tested the efficiency gain of our source model over our previously developed phase-space-let file source model. The overall efficiency of dose calculation was found to be improved by ~1.3-2.2 times in water and patient cases using our analytical model.

GPU-Accelerated Optical Coherence Tomography Signal Processing and Visualization

NASA Astrophysics Data System (ADS)

Darbrazi, Seyed Hamid Hosseiny

As piroxenas sao um vasto grupo de silicatos minerais encontrados em muitas rochas igneas e metamorficas. Na sua forma mais simples, estes silicatos sao constituidas por cadeias de SiO3 ligando grupos tetrahedricos de SiO4. A formula quimica geral das piroxenas e M2M1T2O6, onde M2 se refere a catioes geralmente em uma coordenacao octaedrica distorcida (Mg2+, Fe2+, Mn2+, Li+, Ca2+, Na+), M1 refere-se a catioes numa coordenacao octaedrica regular (Al3+, Fe3+, Ti4+, Cr3+, V3+, Ti3+, Zr4+, Sc3+, Zn2+, Mg2+, Fe2+, Mn2+), e T a catioes em coordenacao tetrahedrica (Si4+, Al3+, Fe3+). As piroxenas com estrutura monoclinica sao designadas de clinopiroxenes. A estabilidade das clinopyroxenes num espectro de composicoes quimicas amplo, em conjugacao com a possibilidade de ajustar as suas propriedades fisicas e quimicas e a durabilidade quimica, tem gerado um interesse mundial devido a suas aplicacoes em ciencia e tecnologia de materiais. Este trabalho trata do desenvolvimento de vidros e de vitro-cerâmicos baseadas de clinopiroxenas para aplicacoes funcionais. O estudo teve objectivos cientificos e tecnologicos; nomeadamente, adquirir conhecimentos fundamentais sobre a formacao de fases cristalinas e solucoes solidas em determinados sistemas vitro-cerâmicos, e avaliar a viabilidade de aplicacao dos novos materiais em diferentes areas tecnologicas, com especial enfase sobre a selagem em celulas de combustivel de oxido solido (SOFC). Com este intuito, prepararam-se varios vidros e materiais vitro-cerâmicos ao longo das juntas Enstatite (MgSiO3) - diopsidio (CaMgSi2O6) e diopsidio (CaMgSi2O6) - Ca - Tschermak (CaAlSi2O6), os quais foram caracterizados atraves de um vasto leque de tecnicas. Todos os vidros foram preparados por fusao-arrefecimento enquanto os vitro-cerâmicos foram obtidos quer por sinterizacao e cristalizacao de fritas, quer por nucleacao e cristalizacao de vidros monoliticos. Estudaram-se ainda os efeitos de varias substituicoes ionicas em composicoes de diopsidio contendo Al na estrutura, sinterizacao e no comportamento durante a cristalizacao de vidros e nas propriedades dos materiais vitro-cerâmicos, com relevância para a sua aplicacao como selantes em SOFC. Verificou-se que Foi observado que os vidros/vitro-cerâmicos a base de enstatite nao apresentavam as caracteristicas necessarias para serem usados como materiais selantes em SOFC, enquanto as melhores propriedades apresentadas pelos vitro-cerâmicos a base de diopsidio qualificaram-nos para futuros estudos neste tipo de aplicacoes. Para alem de investigar a adequacao dos vitro-cerâmicos a base de clinopyroxene como selantes, esta tese tem tambem como objetivo estudar a influencia dos agentes de nucleacao na nucleacao em volume dos vitro-cerâmicos resultantes a base de diopsidio, de modo a qualifica-los como potenciais materiais hopedeiros de residuos nucleares radioactivos.
Parallel hyperbolic PDE simulation on clusters: Cell versus GPU

NASA Astrophysics Data System (ADS)

Rostrup, Scott; De Sterck, Hans

2010-12-01

Increasingly, high-performance computing is looking towards data-parallel computational devices to enhance computational performance. Two technologies that have received significant attention are IBM's Cell Processor and NVIDIA's CUDA programming model for graphics processing unit (GPU) computing. In this paper we investigate the acceleration of parallel hyperbolic partial differential equation simulation on structured grids with explicit time integration on clusters with Cell and GPU backends. The message passing interface (MPI) is used for communication between nodes at the coarsest level of parallelism. Optimizations of the simulation code at the several finer levels of parallelism that the data-parallel devices provide are described in terms of data layout, data flow and data-parallel instructions. Optimized Cell and GPU performance are compared with reference code performance on a single x86 central processing unit (CPU) core in single and double precision. We further compare the CPU, Cell and GPU platforms on a chip-to-chip basis, and compare performance on single cluster nodes with two CPUs, two Cell processors or two GPUs in a shared memory configuration (without MPI). We finally compare performance on clusters with 32 CPUs, 32 Cell processors, and 32 GPUs using MPI. Our GPU cluster results use NVIDIA Tesla GPUs with GT200 architecture, but some preliminary results on recently introduced NVIDIA GPUs with the next-generation Fermi architecture are also included. This paper provides computational scientists and engineers who are considering porting their codes to accelerator environments with insight into how structured grid based explicit algorithms can be optimized for clusters with Cell and GPU accelerators. It also provides insight into the speed-up that may be gained on current and future accelerator architectures for this class of applications. Program summaryProgram title: SWsolver Catalogue identifier: AEGY_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEGY_v1_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: GPL v3 No. of lines in distributed program, including test data, etc.: 59 168 No. of bytes in distributed program, including test data, etc.: 453 409 Distribution format: tar.gz Programming language: C, CUDA Computer: Parallel Computing Clusters. Individual compute nodes may consist of x86 CPU, Cell processor, or x86 CPU with attached NVIDIA GPU accelerator. Operating system: Linux Has the code been vectorised or parallelized?: Yes. Tested on 1-128 x86 CPU cores, 1-32 Cell Processors, and 1-32 NVIDIA GPUs. RAM: Tested on Problems requiring up to 4 GB per compute node. Classification: 12 External routines: MPI, CUDA, IBM Cell SDK Nature of problem: MPI-parallel simulation of Shallow Water equations using high-resolution 2D hyperbolic equation solver on regular Cartesian grids for x86 CPU, Cell Processor, and NVIDIA GPU using CUDA. Solution method: SWsolver provides 3 implementations of a high-resolution 2D Shallow Water equation solver on regular Cartesian grids, for CPU, Cell Processor, and NVIDIA GPU. Each implementation uses MPI to divide work across a parallel computing cluster. Additional comments: Sub-program numdiff is used for the test run.
Use of a graphics processing unit (GPU) to facilitate real-time 3D graphic presentation of the patient skin-dose distribution during fluoroscopic interventional procedures

PubMed Central

Rana, Vijay; Rudin, Stephen; Bednarek, Daniel R.

2012-01-01

We have developed a dose-tracking system (DTS) that calculates the radiation dose to the patient’s skin in real-time by acquiring exposure parameters and imaging-system-geometry from the digital bus on a Toshiba Infinix C-arm unit. The cumulative dose values are then displayed as a color map on an OpenGL-based 3D graphic of the patient for immediate feedback to the interventionalist. Determination of those elements on the surface of the patient 3D-graphic that intersect the beam and calculation of the dose for these elements in real time demands fast computation. Reducing the size of the elements results in more computation load on the computer processor and therefore a tradeoff occurs between the resolution of the patient graphic and the real-time performance of the DTS. The speed of the DTS for calculating dose to the skin is limited by the central processing unit (CPU) and can be improved by using the parallel processing power of a graphics processing unit (GPU). Here, we compare the performance speed of GPU-based DTS software to that of the current CPU-based software as a function of the resolution of the patient graphics. Results show a tremendous improvement in speed using the GPU. While an increase in the spatial resolution of the patient graphics resulted in slowing down the computational speed of the DTS on the CPU, the speed of the GPU-based DTS was hardly affected. This GPU-based DTS can be a powerful tool for providing accurate, real-time feedback about patient skin-dose to physicians while performing interventional procedures. PMID:24027616
Protein alignment algorithms with an efficient backtracking routine on multiple GPUs.

PubMed

Blazewicz, Jacek; Frohmberg, Wojciech; Kierzynka, Michal; Pesch, Erwin; Wojciechowski, Pawel

2011-05-20

Pairwise sequence alignment methods are widely used in biological research. The increasing number of sequences is perceived as one of the upcoming challenges for sequence alignment methods in the nearest future. To overcome this challenge several GPU (Graphics Processing Unit) computing approaches have been proposed lately. These solutions show a great potential of a GPU platform but in most cases address the problem of sequence database scanning and computing only the alignment score whereas the alignment itself is omitted. Thus, the need arose to implement the global and semiglobal Needleman-Wunsch, and Smith-Waterman algorithms with a backtracking procedure which is needed to construct the alignment. In this paper we present the solution that performs the alignment of every given sequence pair, which is a required step for progressive multiple sequence alignment methods, as well as for DNA recognition at the DNA assembly stage. Performed tests show that the implementation, with performance up to 6.3 GCUPS on a single GPU for affine gap penalties, is very efficient in comparison to other CPU and GPU-based solutions. Moreover, multiple GPUs support with load balancing makes the application very scalable. The article shows that the backtracking procedure of the sequence alignment algorithms may be designed to fit in with the GPU architecture. Therefore, our algorithm, apart from scores, is able to compute pairwise alignments. This opens a wide range of new possibilities, allowing other methods from the area of molecular biology to take advantage of the new computational architecture. Performed tests show that the efficiency of the implementation is excellent. Moreover, the speed of our GPU-based algorithms can be almost linearly increased when using more than one graphics card.
The normal vaginal and uterine bacterial microbiome in giant pandas (Ailuropoda melanoleuca).

PubMed

Yang, Xin; Cheng, Guangyang; Li, Caiwu; Yang, Jiang; Li, Jianan; Chen, Danyu; Zou, Wencheng; Jin, SenYan; Zhang, Hemin; Li, Desheng; He, Yongguo; Wang, Chengdong; Wang, Min; Wang, Hongning

2017-06-01

While the health effects of the colonization of the reproductive tracts of mammals by bacterial communities are widely known, there is a dearth of knowledge specifically in relation to giant panda microbiomes. In order to investigate the vaginal and uterine bacterial diversity of healthy giant pandas, we used high-throughput sequence analysis of portions of the 16S rRNA gene, based on samples taken from the vaginas (GPV group) and uteri (GPU group) of these animals. Results showed that the four most abundant phyla, which contained in excess of 98% of the total sequences, were Proteobacteria (59.2% for GPV and 51.4% for GPU), Firmicutes (34.4% for GPV and 23.3% for GPU), Actinobacteria (5.2% for GPV and 14.0% for GPU) and Bacteroidetes (0.3% for GPV and 10.3% for GPU). At the genus level, Escherichia was most abundant (11.0%) in the GPV, followed by Leuconostoc (8.7%), Pseudomonas (8.0%), Acinetobacter (7.3%), Streptococcus (6.3%) and Lactococcus (6.0%). In relation to the uterine samples, Janthinobacterium had the highest prevalence rate (20.2%), followed by Corynebacterium (13.2%), Streptococcus (19.6%), Psychrobacter (9.3%), Escherichia (7.5%) and Bacteroides (6.2%). Moreover, both Chao1 and abundance-based coverage estimator (ACE) species richness indices, which were operating at the same sequencing depth for each sample, demonstrated that GPV had more species richness than GPU, while Simpson and Shannon indices of diversity indicated that GPV had the higher bacterial diversity. These findings contribute to our understanding of the potential influence abnormal reproductive tract microbial communities have on negative pregnancy outcomes in giant pandas. Copyright © 2017 Elsevier GmbH. All rights reserved.
CUDA-based high-performance computing of the S-BPF algorithm with no-waiting pipelining

NASA Astrophysics Data System (ADS)

Deng, Lin; Yan, Bin; Chang, Qingmei; Han, Yu; Zhang, Xiang; Xi, Xiaoqi; Li, Lei

2015-10-01

The backprojection-filtration (BPF) algorithm has become a good solution for local reconstruction in cone-beam computed tomography (CBCT). However, the reconstruction speed of BPF is a severe limitation for clinical applications. The selective-backprojection filtration (S-BPF) algorithm is developed to improve the parallel performance of BPF by selective backprojection. Furthermore, the general-purpose graphics processing unit (GP-GPU) is a popular tool for accelerating the reconstruction. Much work has been performed aiming for the optimization of the cone-beam back-projection. As the cone-beam back-projection process becomes faster, the data transportation holds a much bigger time proportion in the reconstruction than before. This paper focuses on minimizing the total time in the reconstruction with the S-BPF algorithm by hiding the data transportation among hard disk, CPU and GPU. And based on the analysis of the S-BPF algorithm, some strategies are implemented: (1) the asynchronous calls are used to overlap the implemention of CPU and GPU, (2) an innovative strategy is applied to obtain the DBP image to hide the transport time effectively, (3) two streams for data transportation and calculation are synchronized by the cudaEvent in the inverse of finite Hilbert transform on GPU. Our main contribution is a smart reconstruction of the S-BPF algorithm with GPU's continuous calculation and no data transportation time cost. a 5123 volume is reconstructed in less than 0.7 second on a single Tesla-based K20 GPU from 182 views projection with 5122 pixel per projection. The time cost of our implementation is about a half of that without the overlap behavior.
Discovering epistasis in large scale genetic association studies by exploiting graphics cards.

PubMed

Chen, Gary K; Guo, Yunfei

2013-12-03

Despite the enormous investments made in collecting DNA samples and generating germline variation data across thousands of individuals in modern genome-wide association studies (GWAS), progress has been frustratingly slow in explaining much of the heritability in common disease. Today's paradigm of testing independent hypotheses on each single nucleotide polymorphism (SNP) marker is unlikely to adequately reflect the complex biological processes in disease risk. Alternatively, modeling risk as an ensemble of SNPs that act in concert in a pathway, and/or interact non-additively on log risk for example, may be a more sensible way to approach gene mapping in modern studies. Implementing such analyzes genome-wide can quickly become intractable due to the fact that even modest size SNP panels on modern genotype arrays (500k markers) pose a combinatorial nightmare, require tens of billions of models to be tested for evidence of interaction. In this article, we provide an in-depth analysis of programs that have been developed to explicitly overcome these enormous computational barriers through the use of processors on graphics cards known as Graphics Processing Units (GPU). We include tutorials on GPU technology, which will convey why they are growing in appeal with today's numerical scientists. One obvious advantage is the impressive density of microprocessor cores that are available on only a single GPU. Whereas high end servers feature up to 24 Intel or AMD CPU cores, the latest GPU offerings from nVidia feature over 2600 cores. Each compute node may be outfitted with up to 4 GPU devices. Success on GPUs varies across problems. However, epistasis screens fare well due to the high degree of parallelism exposed in these problems. Papers that we review routinely report GPU speedups of over two orders of magnitude (>100x) over standard CPU implementations.
Use of a graphics processing unit (GPU) to facilitate real-time 3D graphic presentation of the patient skin-dose distribution during fluoroscopic interventional procedures.

PubMed

Rana, Vijay; Rudin, Stephen; Bednarek, Daniel R

2012-02-23

We have developed a dose-tracking system (DTS) that calculates the radiation dose to the patient's skin in real-time by acquiring exposure parameters and imaging-system-geometry from the digital bus on a Toshiba Infinix C-arm unit. The cumulative dose values are then displayed as a color map on an OpenGL-based 3D graphic of the patient for immediate feedback to the interventionalist. Determination of those elements on the surface of the patient 3D-graphic that intersect the beam and calculation of the dose for these elements in real time demands fast computation. Reducing the size of the elements results in more computation load on the computer processor and therefore a tradeoff occurs between the resolution of the patient graphic and the real-time performance of the DTS. The speed of the DTS for calculating dose to the skin is limited by the central processing unit (CPU) and can be improved by using the parallel processing power of a graphics processing unit (GPU). Here, we compare the performance speed of GPU-based DTS software to that of the current CPU-based software as a function of the resolution of the patient graphics. Results show a tremendous improvement in speed using the GPU. While an increase in the spatial resolution of the patient graphics resulted in slowing down the computational speed of the DTS on the CPU, the speed of the GPU-based DTS was hardly affected. This GPU-based DTS can be a powerful tool for providing accurate, real-time feedback about patient skin-dose to physicians while performing interventional procedures.
A GPU-Parallelized Eigen-Based Clutter Filter Framework for Ultrasound Color Flow Imaging.

PubMed

Chee, Adrian J Y; Yiu, Billy Y S; Yu, Alfred C H

2017-01-01

Eigen-filters with attenuation response adapted to clutter statistics in color flow imaging (CFI) have shown improved flow detection sensitivity in the presence of tissue motion. Nevertheless, its practical adoption in clinical use is not straightforward due to the high computational cost for solving eigendecompositions. Here, we provide a pedagogical description of how a real-time computing framework for eigen-based clutter filtering can be developed through a single-instruction, multiple data (SIMD) computing approach that can be implemented on a graphical processing unit (GPU). Emphasis is placed on the single-ensemble-based eigen-filtering approach (Hankel singular value decomposition), since it is algorithmically compatible with GPU-based SIMD computing. The key algebraic principles and the corresponding SIMD algorithm are explained, and annotations on how such algorithm can be rationally implemented on the GPU are presented. Real-time efficacy of our framework was experimentally investigated on a single GPU device (GTX Titan X), and the computing throughput for varying scan depths and slow-time ensemble lengths was studied. Using our eigen-processing framework, real-time video-range throughput (24 frames/s) can be attained for CFI frames with full view in azimuth direction (128 scanlines), up to a scan depth of 5 cm ( λ pixel axial spacing) for slow-time ensemble length of 16 samples. The corresponding CFI image frames, with respect to the ones derived from non-adaptive polynomial regression clutter filtering, yielded enhanced flow detection sensitivity in vivo, as demonstrated in a carotid imaging case example. These findings indicate that the GPU-enabled eigen-based clutter filtering can improve CFI flow detection performance in real time.
Accelerated GPU based SPECT Monte Carlo simulations.

PubMed

Garcia, Marie-Paule; Bert, Julien; Benoit, Didier; Bardiès, Manuel; Visvikis, Dimitris

2016-06-07

Monte Carlo (MC) modelling is widely used in the field of single photon emission computed tomography (SPECT) as it is a reliable technique to simulate very high quality scans. This technique provides very accurate modelling of the radiation transport and particle interactions in a heterogeneous medium. Various MC codes exist for nuclear medicine imaging simulations. Recently, new strategies exploiting the computing capabilities of graphical processing units (GPU) have been proposed. This work aims at evaluating the accuracy of such GPU implementation strategies in comparison to standard MC codes in the context of SPECT imaging. GATE was considered the reference MC toolkit and used to evaluate the performance of newly developed GPU Geant4-based Monte Carlo simulation (GGEMS) modules for SPECT imaging. Radioisotopes with different photon energies were used with these various CPU and GPU Geant4-based MC codes in order to assess the best strategy for each configuration. Three different isotopes were considered: (99m) Tc, (111)In and (131)I, using a low energy high resolution (LEHR) collimator, a medium energy general purpose (MEGP) collimator and a high energy general purpose (HEGP) collimator respectively. Point source, uniform source, cylindrical phantom and anthropomorphic phantom acquisitions were simulated using a model of the GE infinia II 3/8" gamma camera. Both simulation platforms yielded a similar system sensitivity and image statistical quality for the various combinations. The overall acceleration factor between GATE and GGEMS platform derived from the same cylindrical phantom acquisition was between 18 and 27 for the different radioisotopes. Besides, a full MC simulation using an anthropomorphic phantom showed the full potential of the GGEMS platform, with a resulting acceleration factor up to 71. The good agreement with reference codes and the acceleration factors obtained support the use of GPU implementation strategies for improving computational efficiency of SPECT imaging simulations.
A comparison of native GPU computing versus OpenACC for implementing flow-routing algorithms in hydrological applications

NASA Astrophysics Data System (ADS)

Rueda, Antonio J.; Noguera, José M.; Luque, Adrián

2016-02-01

In recent years GPU computing has gained wide acceptance as a simple low-cost solution for speeding up computationally expensive processing in many scientific and engineering applications. However, in most cases accelerating a traditional CPU implementation for a GPU is a non-trivial task that requires a thorough refactorization of the code and specific optimizations that depend on the architecture of the device. OpenACC is a promising technology that aims at reducing the effort required to accelerate C/C++/Fortran code on an attached multicore device. Virtually with this technology the CPU code only has to be augmented with a few compiler directives to identify the areas to be accelerated and the way in which data has to be moved between the CPU and GPU. Its potential benefits are multiple: better code readability, less development time, lower risk of errors and less dependency on the underlying architecture and future evolution of the GPU technology. Our aim with this work is to evaluate the pros and cons of using OpenACC against native GPU implementations in computationally expensive hydrological applications, using the classic D8 algorithm of O'Callaghan and Mark for river network extraction as case-study. We implemented the flow accumulation step of this algorithm in CPU, using OpenACC and two different CUDA versions, comparing the length and complexity of the code and its performance with different datasets. We advance that although OpenACC can not match the performance of a CUDA optimized implementation (×3.5 slower in average), it provides a significant performance improvement against a CPU implementation (×2-6) with by far a simpler code and less implementation effort.
Computing the Density Matrix in Electronic Structure Theory on Graphics Processing Units.

PubMed

Cawkwell, M J; Sanville, E J; Mniszewski, S M; Niklasson, Anders M N

2012-11-13

The self-consistent solution of a Schrödinger-like equation for the density matrix is a critical and computationally demanding step in quantum-based models of interatomic bonding. This step was tackled historically via the diagonalization of the Hamiltonian. We have investigated the performance and accuracy of the second-order spectral projection (SP2) algorithm for the computation of the density matrix via a recursive expansion of the Fermi operator in a series of generalized matrix-matrix multiplications. We demonstrate that owing to its simplicity, the SP2 algorithm [Niklasson, A. M. N. Phys. Rev. B2002, 66, 155115] is exceptionally well suited to implementation on graphics processing units (GPUs). The performance in double and single precision arithmetic of a hybrid GPU/central processing unit (CPU) and full GPU implementation of the SP2 algorithm exceed those of a CPU-only implementation of the SP2 algorithm and traditional matrix diagonalization when the dimensions of the matrices exceed about 2000 × 2000. Padding schemes for arrays allocated in the GPU memory that optimize the performance of the CUBLAS implementations of the level 3 BLAS DGEMM and SGEMM subroutines for generalized matrix-matrix multiplications are described in detail. The analysis of the relative performance of the hybrid CPU/GPU and full GPU implementations indicate that the transfer of arrays between the GPU and CPU constitutes only a small fraction of the total computation time. The errors measured in the self-consistent density matrices computed using the SP2 algorithm are generally smaller than those measured in matrices computed via diagonalization. Furthermore, the errors in the density matrices computed using the SP2 algorithm do not exhibit any dependence of system size, whereas the errors increase linearly with the number of orbitals when diagonalization is employed.
Discovering epistasis in large scale genetic association studies by exploiting graphics cards

PubMed Central

Chen, Gary K.; Guo, Yunfei

2013-01-01

Despite the enormous investments made in collecting DNA samples and generating germline variation data across thousands of individuals in modern genome-wide association studies (GWAS), progress has been frustratingly slow in explaining much of the heritability in common disease. Today's paradigm of testing independent hypotheses on each single nucleotide polymorphism (SNP) marker is unlikely to adequately reflect the complex biological processes in disease risk. Alternatively, modeling risk as an ensemble of SNPs that act in concert in a pathway, and/or interact non-additively on log risk for example, may be a more sensible way to approach gene mapping in modern studies. Implementing such analyzes genome-wide can quickly become intractable due to the fact that even modest size SNP panels on modern genotype arrays (500k markers) pose a combinatorial nightmare, require tens of billions of models to be tested for evidence of interaction. In this article, we provide an in-depth analysis of programs that have been developed to explicitly overcome these enormous computational barriers through the use of processors on graphics cards known as Graphics Processing Units (GPU). We include tutorials on GPU technology, which will convey why they are growing in appeal with today's numerical scientists. One obvious advantage is the impressive density of microprocessor cores that are available on only a single GPU. Whereas high end servers feature up to 24 Intel or AMD CPU cores, the latest GPU offerings from nVidia feature over 2600 cores. Each compute node may be outfitted with up to 4 GPU devices. Success on GPUs varies across problems. However, epistasis screens fare well due to the high degree of parallelism exposed in these problems. Papers that we review routinely report GPU speedups of over two orders of magnitude (>100x) over standard CPU implementations. PMID:24348518
Fast Simulation of Dynamic Ultrasound Images Using the GPU.

PubMed

Storve, Sigurd; Torp, Hans

2017-10-01

Simulated ultrasound data is a valuable tool for development and validation of quantitative image analysis methods in echocardiography. Unfortunately, simulation time can become prohibitive for phantoms consisting of a large number of point scatterers. The COLE algorithm by Gao et al. is a fast convolution-based simulator that trades simulation accuracy for improved speed. We present highly efficient parallelized CPU and GPU implementations of the COLE algorithm with an emphasis on dynamic simulations involving moving point scatterers. We argue that it is crucial to minimize the amount of data transfers from the CPU to achieve good performance on the GPU. We achieve this by storing the complete trajectories of the dynamic point scatterers as spline curves in the GPU memory. This leads to good efficiency when simulating sequences consisting of a large number of frames, such as B-mode and tissue Doppler data for a full cardiac cycle. In addition, we propose a phase-based subsample delay technique that efficiently eliminates flickering artifacts seen in B-mode sequences when COLE is used without enough temporal oversampling. To assess the performance, we used a laptop computer and a desktop computer, each equipped with a multicore Intel CPU and an NVIDIA GPU. Running the simulator on a high-end TITAN X GPU, we observed two orders of magnitude speedup compared to the parallel CPU version, three orders of magnitude speedup compared to simulation times reported by Gao et al. in their paper on COLE, and a speedup of 27000 times compared to the multithreaded version of Field II, using numbers reported in a paper by Jensen. We hope that by releasing the simulator as an open-source project we will encourage its use and further development.
Accelerating Smith-Waterman Alignment for Protein Database Search Using Frequency Distance Filtration Scheme Based on CPU-GPU Collaborative System.

PubMed

Liu, Yu; Hong, Yang; Lin, Chun-Yuan; Hung, Che-Lun

2015-01-01

The Smith-Waterman (SW) algorithm has been widely utilized for searching biological sequence databases in bioinformatics. Recently, several works have adopted the graphic card with Graphic Processing Units (GPUs) and their associated CUDA model to enhance the performance of SW computations. However, these works mainly focused on the protein database search by using the intertask parallelization technique, and only using the GPU capability to do the SW computations one by one. Hence, in this paper, we will propose an efficient SW alignment method, called CUDA-SWfr, for the protein database search by using the intratask parallelization technique based on a CPU-GPU collaborative system. Before doing the SW computations on GPU, a procedure is applied on CPU by using the frequency distance filtration scheme (FDFS) to eliminate the unnecessary alignments. The experimental results indicate that CUDA-SWfr runs 9.6 times and 96 times faster than the CPU-based SW method without and with FDFS, respectively.
Symplectic multi-particle tracking on GPUs

NASA Astrophysics Data System (ADS)

Liu, Zhicong; Qiang, Ji

2018-05-01

A symplectic multi-particle tracking model is implemented on the Graphic Processing Units (GPUs) using the Compute Unified Device Architecture (CUDA) language. The symplectic tracking model can preserve phase space structure and reduce non-physical effects in long term simulation, which is important for beam property evaluation in particle accelerators. Though this model is computationally expensive, it is very suitable for parallelization and can be accelerated significantly by using GPUs. In this paper, we optimized the implementation of the symplectic tracking model on both single GPU and multiple GPUs. Using a single GPU processor, the code achieves a factor of 2-10 speedup for a range of problem sizes compared with the time on a single state-of-the-art Central Processing Unit (CPU) node with similar power consumption and semiconductor technology. It also shows good scalability on a multi-GPU cluster at Oak Ridge Leadership Computing Facility. In an application to beam dynamics simulation, the GPU implementation helps save more than a factor of two total computing time in comparison to the CPU implementation.
Particle-in-cell simulations with charge-conserving current deposition on graphic processing units

NASA Astrophysics Data System (ADS)

Ren, Chuang; Kong, Xianglong; Huang, Michael; Decyk, Viktor; Mori, Warren

2011-10-01

Recently using CUDA, we have developed an electromagnetic Particle-in-Cell (PIC) code with charge-conserving current deposition for Nvidia graphic processing units (GPU's) (Kong et al., Journal of Computational Physics 230, 1676 (2011). On a Tesla M2050 (Fermi) card, the GPU PIC code can achieve a one-particle-step process time of 1.2 - 3.2 ns in 2D and 2.3 - 7.2 ns in 3D, depending on plasma temperatures. In this talk we will discuss novel algorithms for GPU-PIC including charge-conserving current deposition scheme with few branching and parallel particle sorting. These algorithms have made efficient use of the GPU shared memory. We will also discuss how to replace the computation kernels of existing parallel CPU codes while keeping their parallel structures. This work was supported by U.S. Department of Energy under Grant Nos. DE-FG02-06ER54879 and DE-FC02-04ER54789 and by NSF under Grant Nos. PHY-0903797 and CCF-0747324.
GPU computing of compressible flow problems by a meshless method with space-filling curves

NASA Astrophysics Data System (ADS)

Ma, Z. H.; Wang, H.; Pu, S. H.

2014-04-01

A graphic processing unit (GPU) implementation of a meshless method for solving compressible flow problems is presented in this paper. Least-square fit is used to discretize the spatial derivatives of Euler equations and an upwind scheme is applied to estimate the flux terms. The compute unified device architecture (CUDA) C programming model is employed to efficiently and flexibly port the meshless solver from CPU to GPU. Considering the data locality of randomly distributed points, space-filling curves are adopted to re-number the points in order to improve the memory performance. Detailed evaluations are firstly carried out to assess the accuracy and conservation property of the underlying numerical method. Then the GPU accelerated flow solver is used to solve external steady flows over aerodynamic configurations. Representative results are validated through extensive comparisons with the experimental, finite volume or other available reference solutions. Performance analysis reveals that the running time cost of simulations is significantly reduced while impressive (more than an order of magnitude) speedups are achieved.
GPU-based streaming architectures for fast cone-beam CT image reconstruction and demons deformable registration.

PubMed

Sharp, G C; Kandasamy, N; Singh, H; Folkert, M

2007-10-07

This paper shows how to significantly accelerate cone-beam CT reconstruction and 3D deformable image registration using the stream-processing model. We describe data-parallel designs for the Feldkamp, Davis and Kress (FDK) reconstruction algorithm, and the demons deformable registration algorithm, suitable for use on a commodity graphics processing unit. The streaming versions of these algorithms are implemented using the Brook programming environment and executed on an NVidia 8800 GPU. Performance results using CT data of a preserved swine lung indicate that the GPU-based implementations of the FDK and demons algorithms achieve a substantial speedup--up to 80 times for FDK and 70 times for demons when compared to an optimized reference implementation on a 2.8 GHz Intel processor. In addition, the accuracy of the GPU-based implementations was found to be excellent. Compared with CPU-based implementations, the RMS differences were less than 0.1 Hounsfield unit for reconstruction and less than 0.1 mm for deformable registration.
Power and Performance Trade-offs for Space Time Adaptive Processing

DOE Office of Scientific and Technical Information (OSTI.GOV)

Gawande, Nitin A.; Manzano Franco, Joseph B.; Tumeo, Antonino

Computational efficiency – performance relative to power or energy – is one of the most important concerns when designing RADAR processing systems. This paper analyzes power and performance trade-offs for a typical Space Time Adaptive Processing (STAP) application. We study STAP implementations for CUDA and OpenMP on two computationally efficient architectures, Intel Haswell Core I7-4770TE and NVIDIA Kayla with a GK208 GPU. We analyze the power and performance of STAP’s computationally intensive kernels across the two hardware testbeds. We also show the impact and trade-offs of GPU optimization techniques. We show that data parallelism can be exploited for efficient implementationmore » on the Haswell CPU architecture. The GPU architecture is able to process large size data sets without increase in power requirement. The use of shared memory has a significant impact on the power requirement for the GPU. A balance between the use of shared memory and main memory access leads to an improved performance in a typical STAP application.« less

The density matrix renormalization group algorithm on kilo-processor architectures: Implementation and trade-offs

NASA Astrophysics Data System (ADS)

Nemes, Csaba; Barcza, Gergely; Nagy, Zoltán; Legeza, Örs; Szolgay, Péter

2014-06-01

In the numerical analysis of strongly correlated quantum lattice models one of the leading algorithms developed to balance the size of the effective Hilbert space and the accuracy of the simulation is the density matrix renormalization group (DMRG) algorithm, in which the run-time is dominated by the iterative diagonalization of the Hamilton operator. As the most time-dominant step of the diagonalization can be expressed as a list of dense matrix operations, the DMRG is an appealing candidate to fully utilize the computing power residing in novel kilo-processor architectures. In the paper a smart hybrid CPU-GPU implementation is presented, which exploits the power of both CPU and GPU and tolerates problems exceeding the GPU memory size. Furthermore, a new CUDA kernel has been designed for asymmetric matrix-vector multiplication to accelerate the rest of the diagonalization. Besides the evaluation of the GPU implementation, the practical limits of an FPGA implementation are also discussed.
GPU-Based Real-Time Volumetric Ultrasound Image Reconstruction for a Ring Array

PubMed Central

Choe, Jung Woo; Nikoozadeh, Amin; Oralkan, Ömer; Khuri-Yakub, Butrus T.

2014-01-01

Synthetic phased array (SPA) beamforming with Hadamard coding and aperture weighting is an optimal option for real-time volumetric imaging with a ring array, a particularly attractive geometry in intracardiac and intravascular applications. However, the imaging frame rate of this method is limited by the immense computational load required in synthetic beamforming. For fast imaging with a ring array, we developed graphics processing unit (GPU)-based, real-time image reconstruction software that exploits massive data-level parallelism in beamforming operations. The GPU-based software reconstructs and displays three cross-sectional images at 45 frames per second (fps). This frame rate is 4.5 times higher than that for our previously-developed multi-core CPU-based software. In an alternative imaging mode, it shows one B-mode image rotating about the axis and its maximum intensity projection (MIP), processed at a rate of 104 fps. This paper describes the image reconstruction procedure on the GPU platform and presents the experimental images obtained using this software. PMID:23529080
Accelerating a three-dimensional eco-hydrological cellular automaton on GPGPU with OpenCL

NASA Astrophysics Data System (ADS)

Senatore, Alfonso; D'Ambrosio, Donato; De Rango, Alessio; Rongo, Rocco; Spataro, William; Straface, Salvatore; Mendicino, Giuseppe

2016-10-01

This work presents an effective implementation of a numerical model for complete eco-hydrological Cellular Automata modeling on Graphical Processing Units (GPU) with OpenCL (Open Computing Language) for heterogeneous computation (i.e., on CPUs and/or GPUs). Different types of parallel implementations were carried out (e.g., use of fast local memory, loop unrolling, etc), showing increasing performance improvements in terms of speedup, adopting also some original optimizations strategies. Moreover, numerical analysis of results (i.e., comparison of CPU and GPU outcomes in terms of rounding errors) have proven to be satisfactory. Experiments were carried out on a workstation with two CPUs (Intel Xeon E5440 at 2.83GHz), one GPU AMD R9 280X and one GPU nVIDIA Tesla K20c. Results have been extremely positive, but further testing should be performed to assess the functionality of the adopted strategies on other complete models and their ability to fruitfully exploit parallel systems resources.
High Performance Computing of Meshless Time Domain Method on Multi-GPU Cluster

NASA Astrophysics Data System (ADS)

Ikuno, Soichiro; Nakata, Susumu; Hirokawa, Yuta; Itoh, Taku

2015-01-01

High performance computing of Meshless Time Domain Method (MTDM) on multi-GPU using the supercomputer HA-PACS (Highly Accelerated Parallel Advanced system for Computational Sciences) at University of Tsukuba is investigated. Generally, the finite difference time domain (FDTD) method is adopted for the numerical simulation of the electromagnetic wave propagation phenomena. However, the numerical domain must be divided into rectangle meshes, and it is difficult to adopt the problem in a complexed domain to the method. On the other hand, MTDM can be easily adept to the problem because MTDM does not requires meshes. In the present study, we implement MTDM on multi-GPU cluster to speedup the method, and numerically investigate the performance of the method on multi-GPU cluster. To reduce the computation time, the communication time between the decomposed domain is hided below the perfect matched layer (PML) calculation procedure. The results of computation show that speedup of MTDM on 128 GPUs is 173 times faster than that of single CPU calculation.
DEM GPU studies of industrial scale particle simulations for granular flow civil engineering applications

NASA Astrophysics Data System (ADS)

Pizette, Patrick; Govender, Nicolin; Wilke, Daniel N.; Abriak, Nor-Edine

2017-06-01

The use of the Discrete Element Method (DEM) for industrial civil engineering industrial applications is currently limited due to the computational demands when large numbers of particles are considered. The graphics processing unit (GPU) with its highly parallelized hardware architecture shows potential to enable solution of civil engineering problems using discrete granular approaches. We demonstrate in this study the pratical utility of a validated GPU-enabled DEM modeling environment to simulate industrial scale granular problems. As illustration, the flow discharge of storage silos using 8 and 17 million particles is considered. DEM simulations have been performed to investigate the influence of particle size (equivalent size for the 20/40-mesh gravel) and induced shear stress for two hopper shapes. The preliminary results indicate that the shape of the hopper significantly influences the discharge rates for the same material. Specifically, this work shows that GPU-enabled DEM modeling environments can model industrial scale problems on a single portable computer within a day for 30 seconds of process time.
Real-Time Compressive Sensing MRI Reconstruction Using GPU Computing and Split Bregman Methods

PubMed Central

Smith, David S.; Gore, John C.; Yankeelov, Thomas E.; Welch, E. Brian

2012-01-01

Compressive sensing (CS) has been shown to enable dramatic acceleration of MRI acquisition in some applications. Being an iterative reconstruction technique, CS MRI reconstructions can be more time-consuming than traditional inverse Fourier reconstruction. We have accelerated our CS MRI reconstruction by factors of up to 27 by using a split Bregman solver combined with a graphics processing unit (GPU) computing platform. The increases in speed we find are similar to those we measure for matrix multiplication on this platform, suggesting that the split Bregman methods parallelize efficiently. We demonstrate that the combination of the rapid convergence of the split Bregman algorithm and the massively parallel strategy of GPU computing can enable real-time CS reconstruction of even acquisition data matrices of dimension 40962 or more, depending on available GPU VRAM. Reconstruction of two-dimensional data matrices of dimension 10242 and smaller took ~0.3 s or less, showing that this platform also provides very fast iterative reconstruction for small-to-moderate size images. PMID:22481908
Stochastic DT-MRI connectivity mapping on the GPU.

PubMed

McGraw, Tim; Nadar, Mariappan

2007-01-01

We present a method for stochastic fiber tract mapping from diffusion tensor MRI (DT-MRI) implemented on graphics hardware. From the simulated fibers we compute a connectivity map that gives an indication of the probability that two points in the dataset are connected by a neuronal fiber path. A Bayesian formulation of the fiber model is given and it is shown that the inversion method can be used to construct plausible connectivity. An implementation of this fiber model on the graphics processing unit (GPU) is presented. Since the fiber paths can be stochastically generated independently of one another, the algorithm is highly parallelizable. This allows us to exploit the data-parallel nature of the GPU fragment processors. We also present a framework for the connectivity computation on the GPU. Our implementation allows the user to interactively select regions of interest and observe the evolving connectivity results during computation. Results are presented from the stochastic generation of over 250,000 fiber steps per iteration at interactive frame rates on consumer-grade graphics hardware.
Real-Time Compressive Sensing MRI Reconstruction Using GPU Computing and Split Bregman Methods.

PubMed

Smith, David S; Gore, John C; Yankeelov, Thomas E; Welch, E Brian

2012-01-01

Compressive sensing (CS) has been shown to enable dramatic acceleration of MRI acquisition in some applications. Being an iterative reconstruction technique, CS MRI reconstructions can be more time-consuming than traditional inverse Fourier reconstruction. We have accelerated our CS MRI reconstruction by factors of up to 27 by using a split Bregman solver combined with a graphics processing unit (GPU) computing platform. The increases in speed we find are similar to those we measure for matrix multiplication on this platform, suggesting that the split Bregman methods parallelize efficiently. We demonstrate that the combination of the rapid convergence of the split Bregman algorithm and the massively parallel strategy of GPU computing can enable real-time CS reconstruction of even acquisition data matrices of dimension 4096(2) or more, depending on available GPU VRAM. Reconstruction of two-dimensional data matrices of dimension 1024(2) and smaller took ~0.3 s or less, showing that this platform also provides very fast iterative reconstruction for small-to-moderate size images.
Efficient Acceleration of the Pair-HMMs Forward Algorithm for GATK HaplotypeCaller on Graphics Processing Units.

PubMed

Ren, Shanshan; Bertels, Koen; Al-Ars, Zaid

2018-01-01

GATK HaplotypeCaller (HC) is a popular variant caller, which is widely used to identify variants in complex genomes. However, due to its high variants detection accuracy, it suffers from long execution time. In GATK HC, the pair-HMMs forward algorithm accounts for a large percentage of the total execution time. This article proposes to accelerate the pair-HMMs forward algorithm on graphics processing units (GPUs) to improve the performance of GATK HC. This article presents several GPU-based implementations of the pair-HMMs forward algorithm. It also analyzes the performance bottlenecks of the implementations on an NVIDIA Tesla K40 card with various data sets. Based on these results and the characteristics of GATK HC, we are able to identify the GPU-based implementations with the highest performance for the various analyzed data sets. Experimental results show that the GPU-based implementations of the pair-HMMs forward algorithm achieve a speedup of up to 5.47× over existing GPU-based implementations.
Multicore and GPU algorithms for Nussinov RNA folding

PubMed Central

2014-01-01

Background One segment of a RNA sequence might be paired with another segment of the same RNA sequence due to the force of hydrogen bonds. This two-dimensional structure is called the RNA sequence's secondary structure. Several algorithms have been proposed to predict an RNA sequence's secondary structure. These algorithms are referred to as RNA folding algorithms. Results We develop cache efficient, multicore, and GPU algorithms for RNA folding using Nussinov's algorithm. Conclusions Our cache efficient algorithm provides a speedup between 1.6 and 3.0 relative to a naive straightforward single core code. The multicore version of the cache efficient single core algorithm provides a speedup, relative to the naive single core algorithm, between 7.5 and 14.0 on a 6 core hyperthreaded CPU. Our GPU algorithm for the NVIDIA C2050 is up to 1582 times as fast as the naive single core algorithm and between 5.1 and 11.2 times as fast as the fastest previously known GPU algorithm for Nussinov RNA folding. PMID:25082539
Methods for compressible fluid simulation on GPUs using high-order finite differences

NASA Astrophysics Data System (ADS)

Pekkilä, Johannes; Väisälä, Miikka S.; Käpylä, Maarit J.; Käpylä, Petri J.; Anjum, Omer

2017-08-01

We focus on implementing and optimizing a sixth-order finite-difference solver for simulating compressible fluids on a GPU using third-order Runge-Kutta integration. Since graphics processing units perform well in data-parallel tasks, this makes them an attractive platform for fluid simulation. However, high-order stencil computation is memory-intensive with respect to both main memory and the caches of the GPU. We present two approaches for simulating compressible fluids using 55-point and 19-point stencils. We seek to reduce the requirements for memory bandwidth and cache size in our methods by using cache blocking and decomposing a latency-bound kernel into several bandwidth-bound kernels. Our fastest implementation is bandwidth-bound and integrates 343 million grid points per second on a Tesla K40t GPU, achieving a 3 . 6 × speedup over a comparable hydrodynamics solver benchmarked on two Intel Xeon E5-2690v3 processors. Our alternative GPU implementation is latency-bound and achieves the rate of 168 million updates per second.
A non-voxel-based broad-beam (NVBB) framework for IMRT treatment planning.

PubMed

Lu, Weiguo

2010-12-07

We present a novel framework that enables very large scale intensity-modulated radiation therapy (IMRT) planning in limited computation resources with improvements in cost, plan quality and planning throughput. Current IMRT optimization uses a voxel-based beamlet superposition (VBS) framework that requires pre-calculation and storage of a large amount of beamlet data, resulting in large temporal and spatial complexity. We developed a non-voxel-based broad-beam (NVBB) framework for IMRT capable of direct treatment parameter optimization (DTPO). In this framework, both objective function and derivative are evaluated based on the continuous viewpoint, abandoning 'voxel' and 'beamlet' representations. Thus pre-calculation and storage of beamlets are no longer needed. The NVBB framework has linear complexities (O(N(3))) in both space and time. The low memory, full computation and data parallelization nature of the framework render its efficient implementation on the graphic processing unit (GPU). We implemented the NVBB framework and incorporated it with the TomoTherapy treatment planning system (TPS). The new TPS runs on a single workstation with one GPU card (NVBB-GPU). Extensive verification/validation tests were performed in house and via third parties. Benchmarks on dose accuracy, plan quality and throughput were compared with the commercial TomoTherapy TPS that is based on the VBS framework and uses a computer cluster with 14 nodes (VBS-cluster). For all tests, the dose accuracy of these two TPSs is comparable (within 1%). Plan qualities were comparable with no clinically significant difference for most cases except that superior target uniformity was seen in the NVBB-GPU for some cases. However, the planning time using the NVBB-GPU was reduced many folds over the VBS-cluster. In conclusion, we developed a novel NVBB framework for IMRT optimization. The continuous viewpoint and DTPO nature of the algorithm eliminate the need for beamlets and lead to better plan quality. The computation parallelization on a GPU instead of a computer cluster significantly reduces hardware and service costs. Compared with using the current VBS framework on a computer cluster, the planning time is significantly reduced using the NVBB framework on a single workstation with a GPU card.
Exploiting on-node heterogeneity for in-situ analytics of climate simulations via a functional partitioning framework

NASA Astrophysics Data System (ADS)

Sapra, Karan; Gupta, Saurabh; Atchley, Scott; Anantharaj, Valentine; Miller, Ross; Vazhkudai, Sudharshan

2016-04-01

Efficient resource utilization is critical for improved end-to-end computing and workflow of scientific applications. Heterogeneous node architectures, such as the GPU-enabled Titan supercomputer at the Oak Ridge Leadership Computing Facility (OLCF), present us with further challenges. In many HPC applications on Titan, the accelerators are the primary compute engines while the CPUs orchestrate the offloading of work onto the accelerators, and moving the output back to the main memory. On the other hand, applications that do not exploit GPUs, the CPU usage is dominant while the GPUs idle. We utilized Heterogenous Functional Partitioning (HFP) runtime framework that can optimize usage of resources on a compute node to expedite an application's end-to-end workflow. This approach is different from existing techniques for in-situ analyses in that it provides a framework for on-the-fly analysis on-node by dynamically exploiting under-utilized resources therein. We have implemented in the Community Earth System Model (CESM) a new concurrent diagnostic processing capability enabled by the HFP framework. Various single variate statistics, such as means and distributions, are computed in-situ by launching HFP tasks on the GPU via the node local HFP daemon. Since our current configuration of CESM does not use GPU resources heavily, we can move these tasks to GPU using the HFP framework. Each rank running the atmospheric model in CESM pushes the variables of of interest via HFP function calls to the HFP daemon. This node local daemon is responsible for receiving the data from main program and launching the designated analytics tasks on the GPU. We have implemented these analytics tasks in C and use OpenACC directives to enable GPU acceleration. This methodology is also advantageous while executing GPU-enabled configurations of CESM when the CPUs will be idle during portions of the runtime. In our implementation results, we demonstrate that it is more efficient to use HFP framework to offload the tasks to GPUs instead of doing it in the main application. We observe increased resource utilization and overall productivity in this approach by using HFP framework for end-to-end workflow.
A Fully GPU-Based Ray-Driven Backprojector via a Ray-Culling Scheme with Voxel-Level Parallelization for Cone-Beam CT Reconstruction.

PubMed

Park, Hyeong-Gyu; Shin, Yeong-Gil; Lee, Ho

2015-12-01

A ray-driven backprojector is based on ray-tracing, which computes the length of the intersection between the ray paths and each voxel to be reconstructed. To reduce the computational burden caused by these exhaustive intersection tests, we propose a fully graphics processing unit (GPU)-based ray-driven backprojector in conjunction with a ray-culling scheme that enables straightforward parallelization without compromising the high computing performance of a GPU. The purpose of the ray-culling scheme is to reduce the number of ray-voxel intersection tests by excluding rays irrelevant to a specific voxel computation. This rejection step is based on an axis-aligned bounding box (AABB) enclosing a region of voxel projection, where eight vertices of each voxel are projected onto the detector plane. The range of the rectangular-shaped AABB is determined by min/max operations on the coordinates in the region. Using the indices of pixels inside the AABB, the rays passing through the voxel can be identified and the voxel is weighted as the length of intersection between the voxel and the ray. This procedure makes it possible to reflect voxel-level parallelization, allowing an independent calculation at each voxel, which is feasible for a GPU implementation. To eliminate redundant calculations during ray-culling, a shared-memory optimization is applied to exploit the GPU memory hierarchy. In experimental results using real measurement data with phantoms, the proposed GPU-based ray-culling scheme reconstructed a volume of resolution 28032803176 in 77 seconds from 680 projections of resolution 10243768 , which is 26 times and 7.5 times faster than standard CPU-based and GPU-based ray-driven backprojectors, respectively. Qualitative and quantitative analyses showed that the ray-driven backprojector provides high-quality reconstruction images when compared with those generated by the Feldkamp-Davis-Kress algorithm using a pixel-driven backprojector, with an average of 2.5 times higher contrast-to-noise ratio, 1.04 times higher universal quality index, and 1.39 times higher normalized mutual information. © The Author(s) 2014.
Graphics Processing Unit-Accelerated Nonrigid Registration of MR Images to CT Images During CT-Guided Percutaneous Liver Tumor Ablations.

PubMed

Tokuda, Junichi; Plishker, William; Torabi, Meysam; Olubiyi, Olutayo I; Zaki, George; Tatli, Servet; Silverman, Stuart G; Shekher, Raj; Hata, Nobuhiko

2015-06-01

Accuracy and speed are essential for the intraprocedural nonrigid magnetic resonance (MR) to computed tomography (CT) image registration in the assessment of tumor margins during CT-guided liver tumor ablations. Although both accuracy and speed can be improved by limiting the registration to a region of interest (ROI), manual contouring of the ROI prolongs the registration process substantially. To achieve accurate and fast registration without the use of an ROI, we combined a nonrigid registration technique on the basis of volume subdivision with hardware acceleration using a graphics processing unit (GPU). We compared the registration accuracy and processing time of GPU-accelerated volume subdivision-based nonrigid registration technique to the conventional nonrigid B-spline registration technique. Fourteen image data sets of preprocedural MR and intraprocedural CT images for percutaneous CT-guided liver tumor ablations were obtained. Each set of images was registered using the GPU-accelerated volume subdivision technique and the B-spline technique. Manual contouring of ROI was used only for the B-spline technique. Registration accuracies (Dice similarity coefficient [DSC] and 95% Hausdorff distance [HD]) and total processing time including contouring of ROIs and computation were compared using a paired Student t test. Accuracies of the GPU-accelerated registrations and B-spline registrations, respectively, were 88.3 ± 3.7% versus 89.3 ± 4.9% (P = .41) for DSC and 13.1 ± 5.2 versus 11.4 ± 6.3 mm (P = .15) for HD. Total processing time of the GPU-accelerated registration and B-spline registration techniques was 88 ± 14 versus 557 ± 116 seconds (P < .000000002), respectively; there was no significant difference in computation time despite the difference in the complexity of the algorithms (P = .71). The GPU-accelerated volume subdivision technique was as accurate as the B-spline technique and required significantly less processing time. The GPU-accelerated volume subdivision technique may enable the implementation of nonrigid registration into routine clinical practice. Copyright © 2015 AUR. Published by Elsevier Inc. All rights reserved.
CMSA: a heterogeneous CPU/GPU computing system for multiple similar RNA/DNA sequence alignment.

PubMed

Chen, Xi; Wang, Chen; Tang, Shanjiang; Yu, Ce; Zou, Quan

2017-06-24

The multiple sequence alignment (MSA) is a classic and powerful technique for sequence analysis in bioinformatics. With the rapid growth of biological datasets, MSA parallelization becomes necessary to keep its running time in an acceptable level. Although there are a lot of work on MSA problems, their approaches are either insufficient or contain some implicit assumptions that limit the generality of usage. First, the information of users' sequences, including the sizes of datasets and the lengths of sequences, can be of arbitrary values and are generally unknown before submitted, which are unfortunately ignored by previous work. Second, the center star strategy is suited for aligning similar sequences. But its first stage, center sequence selection, is highly time-consuming and requires further optimization. Moreover, given the heterogeneous CPU/GPU platform, prior studies consider the MSA parallelization on GPU devices only, making the CPUs idle during the computation. Co-run computation, however, can maximize the utilization of the computing resources by enabling the workload computation on both CPU and GPU simultaneously. This paper presents CMSA, a robust and efficient MSA system for large-scale datasets on the heterogeneous CPU/GPU platform. It performs and optimizes multiple sequence alignment automatically for users' submitted sequences without any assumptions. CMSA adopts the co-run computation model so that both CPU and GPU devices are fully utilized. Moreover, CMSA proposes an improved center star strategy that reduces the time complexity of its center sequence selection process from O(mn 2 ) to O(mn). The experimental results show that CMSA achieves an up to 11× speedup and outperforms the state-of-the-art software. CMSA focuses on the multiple similar RNA/DNA sequence alignment and proposes a novel bitmap based algorithm to improve the center star strategy. We can conclude that harvesting the high performance of modern GPU is a promising approach to accelerate multiple sequence alignment. Besides, adopting the co-run computation model can maximize the entire system utilization significantly. The source code is available at https://github.com/wangvsa/CMSA .
CPU-GPU mixed implementation of virtual node method for real-time interactive cutting of deformable objects using OpenCL.

PubMed

Jia, Shiyu; Zhang, Weizhong; Yu, Xiaokang; Pan, Zhenkuan

2015-09-01

Surgical simulators need to simulate interactive cutting of deformable objects in real time. The goal of this work was to design an interactive cutting algorithm that eliminates traditional cutting state classification and can work simultaneously with real-time GPU-accelerated deformation without affecting its numerical stability. A modified virtual node method for cutting is proposed. Deformable object is modeled as a real tetrahedral mesh embedded in a virtual tetrahedral mesh, and the former is used for graphics rendering and collision, while the latter is used for deformation. Cutting algorithm first subdivides real tetrahedrons to eliminate all face and edge intersections, then splits faces, edges and vertices along cutting tool trajectory to form cut surfaces. Next virtual tetrahedrons containing more than one connected real tetrahedral fragments are duplicated, and connectivity between virtual tetrahedrons is updated. Finally, embedding relationship between real and virtual tetrahedral meshes is updated. Co-rotational linear finite element method is used for deformation. Cutting and collision are processed by CPU, while deformation is carried out by GPU using OpenCL. Efficiency of GPU-accelerated deformation algorithm was tested using block models with varying numbers of tetrahedrons. Effectiveness of our cutting algorithm under multiple cuts and self-intersecting cuts was tested using a block model and a cylinder model. Cutting of a more complex liver model was performed, and detailed performance characteristics of cutting, deformation and collision were measured and analyzed. Our cutting algorithm can produce continuous cut surfaces when traditional minimal element creation algorithm fails. Our GPU-accelerated deformation algorithm remains stable with constant time step under multiple arbitrary cuts and works on both NVIDIA and AMD GPUs. GPU-CPU speed ratio can be as high as 10 for models with 80,000 tetrahedrons. Forty to sixty percent real-time performance and 100-200 Hz simulation rate are achieved for the liver model with 3,101 tetrahedrons. Major bottlenecks for simulation efficiency are cutting, collision processing and CPU-GPU data transfer. Future work needs to improve on these areas.
Fast quantum Monte Carlo on a GPU

NASA Astrophysics Data System (ADS)

Lutsyshyn, Y.

2015-02-01

We present a scheme for the parallelization of quantum Monte Carlo method on graphical processing units, focusing on variational Monte Carlo simulation of bosonic systems. We use asynchronous execution schemes with shared memory persistence, and obtain an excellent utilization of the accelerator. The CUDA code is provided along with a package that simulates liquid helium-4. The program was benchmarked on several models of Nvidia GPU, including Fermi GTX560 and M2090, and the Kepler architecture K20 GPU. Special optimization was developed for the Kepler cards, including placement of data structures in the register space of the Kepler GPUs. Kepler-specific optimization is discussed.
GPU based cloud system for high-performance arrhythmia detection with parallel k-NN algorithm.

PubMed

Tae Joon Jun; Hyun Ji Park; Hyuk Yoo; Young-Hak Kim; Daeyoung Kim

2016-08-01

In this paper, we propose an GPU based Cloud system for high-performance arrhythmia detection. Pan-Tompkins algorithm is used for QRS detection and we optimized beat classification algorithm with K-Nearest Neighbor (K-NN). To support high performance beat classification on the system, we parallelized beat classification algorithm with CUDA to execute the algorithm on virtualized GPU devices on the Cloud system. MIT-BIH Arrhythmia database is used for validation of the algorithm. The system achieved about 93.5% of detection rate which is comparable to previous researches while our algorithm shows 2.5 times faster execution time compared to CPU only detection algorithm.
Determinant Computation on the GPU using the Condensation Method

NASA Astrophysics Data System (ADS)

Anisul Haque, Sardar; Moreno Maza, Marc

2012-02-01

We report on a GPU implementation of the condensation method designed by Abdelmalek Salem and Kouachi Said for computing the determinant of a matrix. We consider two types of coefficients: modular integers and floating point numbers. We evaluate the performance of our code by measuring its effective bandwidth and argue that it is numerical stable in the floating point number case. In addition, we compare our code with serial implementation of determinant computation from well-known mathematical packages. Our results suggest that a GPU implementation of the condensation method has a large potential for improving those packages in terms of running time and numerical stability.

Graphics Processing Units (GPU) and the Goddard Earth Observing System atmospheric model (GEOS-5): Implementation and Potential Applications

NASA Technical Reports Server (NTRS)

Putnam, William M.

2011-01-01

Earth system models like the Goddard Earth Observing System model (GEOS-5) have been pushing the limits of large clusters of multi-core microprocessors, producing breath-taking fidelity in resolving cloud systems at a global scale. GPU computing presents an opportunity for improving the efficiency of these leading edge models. A GPU implementation of GEOS-5 will facilitate the use of cloud-system resolving resolutions in data assimilation and weather prediction, at resolutions near 3.5 km, improving our ability to extract detailed information from high-resolution satellite observations and ultimately produce better weather and climate predictions
A survey of GPU-based acceleration techniques in MRI reconstructions

PubMed Central

Wang, Haifeng; Peng, Hanchuan; Chang, Yuchou

2018-01-01

Image reconstruction in magnetic resonance imaging (MRI) clinical applications has become increasingly more complicated. However, diagnostic and treatment require very fast computational procedure. Modern competitive platforms of graphics processing unit (GPU) have been used to make high-performance parallel computations available, and attractive to common consumers for computing massively parallel reconstruction problems at commodity price. GPUs have also become more and more important for reconstruction computations, especially when deep learning starts to be applied into MRI reconstruction. The motivation of this survey is to review the image reconstruction schemes of GPU computing for MRI applications and provide a summary reference for researchers in MRI community. PMID:29675361
A survey of GPU-based acceleration techniques in MRI reconstructions.

PubMed

Wang, Haifeng; Peng, Hanchuan; Chang, Yuchou; Liang, Dong

2018-03-01

Image reconstruction in magnetic resonance imaging (MRI) clinical applications has become increasingly more complicated. However, diagnostic and treatment require very fast computational procedure. Modern competitive platforms of graphics processing unit (GPU) have been used to make high-performance parallel computations available, and attractive to common consumers for computing massively parallel reconstruction problems at commodity price. GPUs have also become more and more important for reconstruction computations, especially when deep learning starts to be applied into MRI reconstruction. The motivation of this survey is to review the image reconstruction schemes of GPU computing for MRI applications and provide a summary reference for researchers in MRI community.
Graphics Processing Units for HEP trigger systems

NASA Astrophysics Data System (ADS)

Ammendola, R.; Bauce, M.; Biagioni, A.; Chiozzi, S.; Cotta Ramusino, A.; Fantechi, R.; Fiorini, M.; Giagu, S.; Gianoli, A.; Lamanna, G.; Lonardo, A.; Messina, A.; Neri, I.; Paolucci, P. S.; Piandani, R.; Pontisso, L.; Rescigno, M.; Simula, F.; Sozzi, M.; Vicini, P.

2016-07-01

General-purpose computing on GPUs (Graphics Processing Units) is emerging as a new paradigm in several fields of science, although so far applications have been tailored to the specific strengths of such devices as accelerator in offline computation. With the steady reduction of GPU latencies, and the increase in link and memory throughput, the use of such devices for real-time applications in high-energy physics data acquisition and trigger systems is becoming ripe. We will discuss the use of online parallel computing on GPU for synchronous low level trigger, focusing on CERN NA62 experiment trigger system. The use of GPU in higher level trigger system is also briefly considered.
Inversion of Heavy Current Electroheat Problems on a Graphics Processing Unit (GPU)

DTIC Science & Technology

2013-11-10

Unit (GPU) Victor U. Karthik 1 , Sivamayam Sivasuthan 1, Arunasalam Rahunanthan 2 , Ravi S. Thyagarajan 3 and Paramsothy Jayakumar 3 , Lalita...Victor Karthik; Sivamayam Sivasuthan; Arunasalam Rahunanthan; Ravi Thyagarajan; Paramsothy Jayakumar 5d. PROJECT NUMBER 5e. TASK NUMBER 5f. WORK
Interactive Light Stimulus Generation with High Performance Real-Time Image Processing and Simple Scripting

PubMed Central

Szécsi, László; Kacsó, Ágota; Zeck, Günther; Hantz, Péter

2017-01-01

Light stimulation with precise and complex spatial and temporal modulation is demanded by a series of research fields like visual neuroscience, optogenetics, ophthalmology, and visual psychophysics. We developed a user-friendly and flexible stimulus generating framework (GEARS GPU-based Eye And Retina Stimulation Software), which offers access to GPU computing power, and allows interactive modification of stimulus parameters during experiments. Furthermore, it has built-in support for driving external equipment, as well as for synchronization tasks, via USB ports. The use of GEARS does not require elaborate programming skills. The necessary scripting is visually aided by an intuitive interface, while the details of the underlying software and hardware components remain hidden. Internally, the software is a C++/Python hybrid using OpenGL graphics. Computations are performed on the GPU, and are defined in the GLSL shading language. However, all GPU settings, including the GPU shader programs, are automatically generated by GEARS. This is configured through a method encountered in game programming, which allows high flexibility: stimuli are straightforwardly composed using a broad library of basic components. Stimulus rendering is implemented solely in C++, therefore intermediary libraries for interfacing could be omitted. This enables the program to perform computationally demanding tasks like en-masse random number generation or real-time image processing by local and global operations. PMID:29326579
Graphics Processing Unit (GPU) implementation of image processing algorithms to improve system performance of the Control, Acquisition, Processing, and Image Display System (CAPIDS) of the Micro-Angiographic Fluoroscope (MAF).

PubMed

Vasan, S N Swetadri; Ionita, Ciprian N; Titus, A H; Cartwright, A N; Bednarek, D R; Rudin, S

2012-02-23

We present the image processing upgrades implemented on a Graphics Processing Unit (GPU) in the Control, Acquisition, Processing, and Image Display System (CAPIDS) for the custom Micro-Angiographic Fluoroscope (MAF) detector. Most of the image processing currently implemented in the CAPIDS system is pixel independent; that is, the operation on each pixel is the same and the operation on one does not depend upon the result from the operation on the other, allowing the entire image to be processed in parallel. GPU hardware was developed for this kind of massive parallel processing implementation. Thus for an algorithm which has a high amount of parallelism, a GPU implementation is much faster than a CPU implementation. The image processing algorithm upgrades implemented on the CAPIDS system include flat field correction, temporal filtering, image subtraction, roadmap mask generation and display window and leveling. A comparison between the previous and the upgraded version of CAPIDS has been presented, to demonstrate how the improvement is achieved. By performing the image processing on a GPU, significant improvements (with respect to timing or frame rate) have been achieved, including stable operation of the system at 30 fps during a fluoroscopy run, a DSA run, a roadmap procedure and automatic image windowing and leveling during each frame.
Graphics Processing Unit (GPU) Acceleration of the Goddard Earth Observing System Atmospheric Model

NASA Technical Reports Server (NTRS)

Putnam, Williama

2011-01-01

The Goddard Earth Observing System 5 (GEOS-5) is the atmospheric model used by the Global Modeling and Assimilation Office (GMAO) for a variety of applications, from long-term climate prediction at relatively coarse resolution, to data assimilation and numerical weather prediction, to very high-resolution cloud-resolving simulations. GEOS-5 is being ported to a graphics processing unit (GPU) cluster at the NASA Center for Climate Simulation (NCCS). By utilizing GPU co-processor technology, we expect to increase the throughput of GEOS-5 by at least an order of magnitude, and accelerate the process of scientific exploration across all scales of global modeling, including: The large-scale, high-end application of non-hydrostatic, global, cloud-resolving modeling at 10- to I-kilometer (km) global resolutions Intermediate-resolution seasonal climate and weather prediction at 50- to 25-km on small clusters of GPUs Long-range, coarse-resolution climate modeling, enabled on a small box of GPUs for the individual researcher After being ported to the GPU cluster, the primary physics components and the dynamical core of GEOS-5 have demonstrated a potential speedup of 15-40 times over conventional processor cores. Performance improvements of this magnitude reduce the required scalability of 1-km, global, cloud-resolving models from an unfathomable 6 million cores to an attainable 200,000 GPU-enabled cores.
Fast 2D flood modelling using GPU technology - recent applications and new developments

NASA Astrophysics Data System (ADS)

Crossley, Amanda; Lamb, Rob; Waller, Simon; Dunning, Paul

2010-05-01

In recent years there has been considerable interest amongst scientists and engineers in exploiting the potential of commodity graphics hardware for desktop parallel computing. The Graphics Processing Units (GPUs) that are used in PC graphics cards have now evolved into powerful parallel co-processors that can be used to accelerate the numerical codes used for floodplain inundation modelling. We report in this paper on experience over the past two years in developing and applying two dimensional (2D) flood inundation models using GPUs to achieve significant practical performance benefits. Starting with a solution scheme for the 2D diffusion wave approximation to the 2D Shallow Water Equations (SWEs), we have demonstrated the capability to reduce model run times in ‘real-world' applications using GPU hardware and programming techniques. We then present results from a GPU-based 2D finite volume SWE solver. A series of numerical test cases demonstrate that the model produces outputs that are accurate and consistent with reference results published elsewhere. In comparisons conducted for a real world test case, the GPU-based SWE model was over 100 times faster than the CPU version. We conclude with some discussion of practical experience in using the GPU technology for flood mapping applications, and for research projects investigating use of Monte Carlo simulation methods for the analysis of uncertainty in 2D flood modelling.
Benchmarking and Evaluating Unified Memory for OpenMP GPU Offloading

DOE Office of Scientific and Technical Information (OSTI.GOV)

Mishra, Alok; Li, Lingda; Kong, Martin

Here, the latest OpenMP standard offers automatic device offloading capabilities which facilitate GPU programming. Despite this, there remain many challenges. One of these is the unified memory feature introduced in recent GPUs. GPUs in current and future HPC systems have enhanced support for unified memory space. In such systems, CPU and GPU can access each other's memory transparently, that is, the data movement is managed automatically by the underlying system software and hardware. Memory over subscription is also possible in these systems. However, there is a significant lack of knowledge about how this mechanism will perform, and how programmers shouldmore » use it. We have modified several benchmarks codes, in the Rodinia benchmark suite, to study the behavior of OpenMP accelerator extensions and have used them to explore the impact of unified memory in an OpenMP context. We moreover modified the open source LLVM compiler to allow OpenMP programs to exploit unified memory. The results of our evaluation reveal that, while the performance of unified memory is comparable with that of normal GPU offloading for benchmarks with little data reuse, it suffers from significant overhead when GPU memory is over subcribed for benchmarks with large amount of data reuse. Based on these results, we provide several guidelines for programmers to achieve better performance with unified memory.« less
Using Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data

NASA Astrophysics Data System (ADS)

O'Connor, A. S.; Justice, B.; Harris, A. T.

2013-12-01

Graphics Processing Units (GPUs) are high-performance multiple-core processors capable of very high computational speeds and large data throughput. Modern GPUs are inexpensive and widely available commercially. These are general-purpose parallel processors with support for a variety of programming interfaces, including industry standard languages such as C. GPU implementations of algorithms that are well suited for parallel processing can often achieve speedups of several orders of magnitude over optimized CPU codes. Significant improvements in speeds for imagery orthorectification, atmospheric correction, target detection and image transformations like Independent Components Analsyis (ICA) have been achieved using GPU-based implementations. Additional optimizations, when factored in with GPU processing capabilities, can provide 50x - 100x reduction in the time required to process large imagery. Exelis Visual Information Solutions (VIS) has implemented a CUDA based GPU processing frame work for accelerating ENVI and IDL processes that can best take advantage of parallelization. Testing Exelis VIS has performed shows that orthorectification can take as long as two hours with a WorldView1 35,0000 x 35,000 pixel image. With GPU orthorecification, the same orthorectification process takes three minutes. By speeding up image processing, imagery can successfully be used by first responders, scientists making rapid discoveries with near real time data, and provides an operational component to data centers needing to quickly process and disseminate data.
GPU Linear Algebra Libraries and GPGPU Programming for Accelerating MOPAC Semiempirical Quantum Chemistry Calculations.

PubMed

Maia, Julio Daniel Carvalho; Urquiza Carvalho, Gabriel Aires; Mangueira, Carlos Peixoto; Santana, Sidney Ramos; Cabral, Lucidio Anjos Formiga; Rocha, Gerd B

2012-09-11

In this study, we present some modifications in the semiempirical quantum chemistry MOPAC2009 code that accelerate single-point energy calculations (1SCF) of medium-size (up to 2500 atoms) molecular systems using GPU coprocessors and multithreaded shared-memory CPUs. Our modifications consisted of using a combination of highly optimized linear algebra libraries for both CPU (LAPACK and BLAS from Intel MKL) and GPU (MAGMA and CUBLAS) to hasten time-consuming parts of MOPAC such as the pseudodiagonalization, full diagonalization, and density matrix assembling. We have shown that it is possible to obtain large speedups just by using CPU serial linear algebra libraries in the MOPAC code. As a special case, we show a speedup of up to 14 times for a methanol simulation box containing 2400 atoms and 4800 basis functions, with even greater gains in performance when using multithreaded CPUs (2.1 times in relation to the single-threaded CPU code using linear algebra libraries) and GPUs (3.8 times). This degree of acceleration opens new perspectives for modeling larger structures which appear in inorganic chemistry (such as zeolites and MOFs), biochemistry (such as polysaccharides, small proteins, and DNA fragments), and materials science (such as nanotubes and fullerenes). In addition, we believe that this parallel (GPU-GPU) MOPAC code will make it feasible to use semiempirical methods in lengthy molecular simulations using both hybrid QM/MM and QM/QM potentials.
Performance Testing of GPU-Based Approximate Matching Algorithm on Network Traffic

DTIC Science & Technology

2015-03-01

Defense Department’s use. vi THIS PAGE INTENTIONALLY LEFT BLANK vii TABLE OF CONTENTS I. INTRODUCTION...22 D. GENERATING DIGESTS ............................................................................23 1. Reference...the-shelf GPU Graphical Processing Unit GPGPU General -Purpose Graphic Processing Unit HBSS Host-Based Security System HIPS Host Intrusion
Smooth Particle Hydrodynamics for Surf Zone Waves

DTIC Science & Technology

2009-01-01

2010.) The GPU-SPHysics code, initiated by Dr. Alexis Hérault at the Istituto Nazionale di Geofisica e Vulcanologia in Sicily, has been applied to... Geofisica e Vulcanologia, sezione di Catania, for the development of GPU-SPHysics. Drs. Hérault and Bilotta were in residence at JHU during January of
Considerations for GPU SEE Testing

NASA Technical Reports Server (NTRS)

Wyrwas, Edward J.

2017-01-01

This presentation will discuss the considerations an engineer should take to perform Single Event Effects (SEE) testing on GPU devices. Notable topics will include setup complexity, architecture insight which permits cross platform normalization, acquiring a reasonable detail of information from the test suite, and a few lessons learned from preliminary testing.
Intraoperative near-infrared fluorescent imaging during robotic operations.

PubMed

Macedo, Antonio Luiz de Vasconcellos; Schraibman, Vladimir

2016-01-01

The intraoperative identification of certain anatomical structures because they are small or visually occult may be challenging. The development of minimally invasive surgery brought additional difficulties to identify these structures due to the lack of complete tactile sensitivity. A number of different forms of intraoperative mapping have been tried. Recently, the near-infrared fluorescence imaging technology with indocyanine green has been added to robotic platforms. In addition, this technology has been tested in several types of operations, and has advantages such as safety, low cost and good results. Disadvantages are linked to contrast distribution in certain clinical scenarios. The intraoperative near-infrared fluorescent imaging is new and promising addition to robotic surgery. Several reports show the utility of this technology in several different procedures. The ideal dose, time and site for dye injection are not well defined. No high quality evidence-based comparative studies and long-term follow-up outcomes have been published so far. Initial results, however, are good and safe. RESUMO A identificação intraoperatória de certas estruturas anatômicas, por seu tamanho ou por elas serem ocultas à visão, pode ser desafiadora. O desenvolvimento da cirurgia minimamente invasiva trouxe dificuldades adicionais, pela falta da sensibilidade tátil completa. Diversas formas de detecção intraoperatória destas estruturas têm sido tentadas. Recentemente, a tecnologia de fluorescência infravermelha com verde de indocianina foi associada às plataformas robóticas. Além disso, essa tecnologia tem sido testada em uma variedade de cirurgias, e suas vantagens parecem estar ligadas a baixo custo, segurança e bons resultados. As desvantagens estão associadas à má distribuição do contraste em determinados cenários. A imagem intraoperatória por fluorescência infravermelha é uma nova e promissora adição à cirurgia robótica. Diversas séries mostram a utilidade da tecnologia em diferentes procedimentos. Dose ideal, local e tempo da injeção do corante ainda não estão bem estabelecidos. Estudos comparativos de alta qualidade epidemiológica baseados em evidência ainda não estão disponíveis. No entanto, os resultados iniciais são bons e seguros.
GPU accelerated real-time confocal fluorescence lifetime imaging microscopy (FLIM) based on the analog mean-delay (AMD) method

PubMed Central

Kim, Byungyeon; Park, Byungjun; Lee, Seungrag; Won, Youngjae

2016-01-01

We demonstrated GPU accelerated real-time confocal fluorescence lifetime imaging microscopy (FLIM) based on the analog mean-delay (AMD) method. Our algorithm was verified for various fluorescence lifetimes and photon numbers. The GPU processing time was faster than the physical scanning time for images up to 800 × 800, and more than 149 times faster than a single core CPU. The frame rate of our system was demonstrated to be 13 fps for a 200 × 200 pixel image when observing maize vascular tissue. This system can be utilized for observing dynamic biological reactions, medical diagnosis, and real-time industrial inspection. PMID:28018724
Blind detection of giant pulses: GPU implementation

NASA Astrophysics Data System (ADS)

Ait-Allal, Dalal; Weber, Rodolphe; Dumez-Viou, Cédric; Cognard, Ismael; Theureau, Gilles

2012-01-01

Radio astronomical pulsar observations require specific instrumentation and dedicated signal processing to cope with the dispersion caused by the interstellar medium. Moreover, the quality of observations can be limited by radio frequency interference (RFI) generated by Telecommunications activity. This article presents the innovative pulsar instrumentation based on graphical processing units (GPU) which has been designed at the Nançay Radio Astronomical Observatory. In addition, for giant pulsar search, we propose a new approach which combines a hardware-efficient search method and some RFI mitigation capabilities. Although this approach is less sensitive than the classical approach, its advantage is that no a priori information on the pulsar parameters is required. The validation of a GPU implementation is under way.
NaNet-10: a 10GbE network interface card for the GPU-based low-level trigger of the NA62 RICH detector.

NASA Astrophysics Data System (ADS)

Ammendola, R.; Biagioni, A.; Fiorini, M.; Frezza, O.; Lonardo, A.; Lamanna, G.; Lo Cicero, F.; Martinelli, M.; Neri, I.; Paolucci, P. S.; Pastorelli, E.; Piandani, R.; Pontisso, L.; Rossetti, D.; Simula, F.; Sozzi, M.; Tosoratto, L.; Vicini, P.

2016-03-01

A GPU-based low level (L0) trigger is currently integrated in the experimental setup of the RICH detector of the NA62 experiment to assess the feasibility of building more refined physics-related trigger primitives and thus improve the trigger discriminating power. To ensure the real-time operation of the system, a dedicated data transport mechanism has been implemented: an FPGA-based Network Interface Card (NaNet-10) receives data from detectors and forwards them with low, predictable latency to the memory of the GPU performing the trigger algorithms. Results of the ring-shaped hit patterns reconstruction will be reported and discussed.
Convolution of large 3D images on GPU and its decomposition

NASA Astrophysics Data System (ADS)

Karas, Pavel; Svoboda, David

2011-12-01

In this article, we propose a method for computing convolution of large 3D images. The convolution is performed in a frequency domain using a convolution theorem. The algorithm is accelerated on a graphic card by means of the CUDA parallel computing model. Convolution is decomposed in a frequency domain using the decimation in frequency algorithm. We pay attention to keeping our approach efficient in terms of both time and memory consumption and also in terms of memory transfers between CPU and GPU which have a significant inuence on overall computational time. We also study the implementation on multiple GPUs and compare the results between the multi-GPU and multi-CPU implementations.

The gputools package enables GPU computing in R.

PubMed

Buckner, Joshua; Wilson, Justin; Seligman, Mark; Athey, Brian; Watson, Stanley; Meng, Fan

2010-01-01

By default, the R statistical environment does not make use of parallelism. Researchers may resort to expensive solutions such as cluster hardware for large analysis tasks. Graphics processing units (GPUs) provide an inexpensive and computationally powerful alternative. Using R and the CUDA toolkit from Nvidia, we have implemented several functions commonly used in microarray gene expression analysis for GPU-equipped computers. R users can take advantage of the better performance provided by an Nvidia GPU. The package is available from CRAN, the R project's repository of packages, at http://cran.r-project.org/web/packages/gputools More information about our gputools R package is available at http://brainarray.mbni.med.umich.edu/brainarray/Rgpgpu
Training Evaluation as an Integral Component of Training for Performance.

ERIC Educational Resources Information Center

Lapp, H. J., Jr.

A training evaluation system should address four major areas: reaction, learning, behavior, and results. The training evaluation system at GPU Nuclear Corporation addresses each of these areas through practical approaches such as course and program evaluation. GPU's program evaluation instrument uses a Likert-type scale to assess task development,…
Performance Analysis of GFDL's GCM Line-By-Line Radiative Transfer Model on GPU and MIC Architectures

NASA Astrophysics Data System (ADS)

Menzel, R.; Paynter, D.; Jones, A. L.

2017-12-01

Due to their relatively low computational cost, radiative transfer models in global climate models (GCMs) run on traditional CPU architectures generally consist of shortwave and longwave parameterizations over a small number of wavelength bands. With the rise of newer GPU and MIC architectures, however, the performance of high resolution line-by-line radiative transfer models may soon approach those of the physical parameterizations currently employed in GCMs. Here we present an analysis of the current performance of a new line-by-line radiative transfer model currently under development at GFDL. Although originally designed to specifically exploit GPU architectures through the use of CUDA, the radiative transfer model has recently been extended to include OpenMP in an effort to also effectively target MIC architectures such as Intel's Xeon Phi. Using input data provided by the upcoming Radiative Forcing Model Intercomparison Project (RFMIP, as part of CMIP 6), we compare model results and performance data for various model configurations and spectral resolutions run on both GPU and Intel Knights Landing architectures to analogous runs of the standard Oxford Reference Forward Model on traditional CPUs.
GPU-accelerated low-latency real-time searches for gravitational waves from compact binary coalescence

NASA Astrophysics Data System (ADS)

Liu, Yuan; Du, Zhihui; Chung, Shin Kee; Hooper, Shaun; Blair, David; Wen, Linqing

2012-12-01

We present a graphics processing unit (GPU)-accelerated time-domain low-latency algorithm to search for gravitational waves (GWs) from coalescing binaries of compact objects based on the summed parallel infinite impulse response (SPIIR) filtering technique. The aim is to facilitate fast detection of GWs with a minimum delay to allow prompt electromagnetic follow-up observations. To maximize the GPU acceleration, we apply an efficient batched parallel computing model that significantly reduces the number of synchronizations in SPIIR and optimizes the usage of the memory and hardware resource. Our code is tested on the CUDA ‘Fermi’ architecture in a GTX 480 graphics card and its performance is compared with a single core of Intel Core i7 920 (2.67 GHz). A 58-fold speedup is achieved while giving results in close agreement with the CPU implementation. Our result indicates that it is possible to conduct a full search for GWs from compact binary coalescence in real time with only one desktop computer equipped with a Fermi GPU card for the initial LIGO detectors which in the past required more than 100 CPUs.
Accelerating Monte Carlo simulations of photon transport in a voxelized geometry using a massively parallel graphics processing unit.

PubMed

Badal, Andreu; Badano, Aldo

2009-11-01

It is a known fact that Monte Carlo simulations of radiation transport are computationally intensive and may require long computing times. The authors introduce a new paradigm for the acceleration of Monte Carlo simulations: The use of a graphics processing unit (GPU) as the main computing device instead of a central processing unit (CPU). A GPU-based Monte Carlo code that simulates photon transport in a voxelized geometry with the accurate physics models from PENELOPE has been developed using the CUDATM programming model (NVIDIA Corporation, Santa Clara, CA). An outline of the new code and a sample x-ray imaging simulation with an anthropomorphic phantom are presented. A remarkable 27-fold speed up factor was obtained using a GPU compared to a single core CPU. The reported results show that GPUs are currently a good alternative to CPUs for the simulation of radiation transport. Since the performance of GPUs is currently increasing at a faster pace than that of CPUs, the advantages of GPU-based software are likely to be more pronounced in the future.
Massively parallel algorithm and implementation of RI-MP2 energy calculation for peta-scale many-core supercomputers.

PubMed

Katouda, Michio; Naruse, Akira; Hirano, Yukihiko; Nakajima, Takahito

2016-11-15

A new parallel algorithm and its implementation for the RI-MP2 energy calculation utilizing peta-flop-class many-core supercomputers are presented. Some improvements from the previous algorithm (J. Chem. Theory Comput. 2013, 9, 5373) have been performed: (1) a dual-level hierarchical parallelization scheme that enables the use of more than 10,000 Message Passing Interface (MPI) processes and (2) a new data communication scheme that reduces network communication overhead. A multi-node and multi-GPU implementation of the present algorithm is presented for calculations on a central processing unit (CPU)/graphics processing unit (GPU) hybrid supercomputer. Benchmark results of the new algorithm and its implementation using the K computer (CPU clustering system) and TSUBAME 2.5 (CPU/GPU hybrid system) demonstrate high efficiency. The peak performance of 3.1 PFLOPS is attained using 80,199 nodes of the K computer. The peak performance of the multi-node and multi-GPU implementation is 514 TFLOPS using 1349 nodes and 4047 GPUs of TSUBAME 2.5. © 2016 Wiley Periodicals, Inc. © 2016 Wiley Periodicals, Inc.
3D brain tumor localization and parameter estimation using thermographic approach on GPU.

PubMed

Bousselham, Abdelmajid; Bouattane, Omar; Youssfi, Mohamed; Raihani, Abdelhadi

2018-01-01

The aim of this paper is to present a GPU parallel algorithm for brain tumor detection to estimate its size and location from surface temperature distribution obtained by thermography. The normal brain tissue is modeled as a rectangular cube including spherical tumor. The temperature distribution is calculated using forward three dimensional Pennes bioheat transfer equation, it's solved using massively parallel Finite Difference Method (FDM) and implemented on Graphics Processing Unit (GPU). Genetic Algorithm (GA) was used to solve the inverse problem and estimate the tumor size and location by minimizing an objective function involving measured temperature on the surface to those obtained by numerical simulation. The parallel implementation of Finite Difference Method reduces significantly the time of bioheat transfer and greatly accelerates the inverse identification of brain tumor thermophysical and geometrical properties. Experimental results show significant gains in the computational speed on GPU and achieve a speedup of around 41 compared to the CPU. The analysis performance of the estimation based on tumor size inside brain tissue also presented. Copyright © 2017 Elsevier Ltd. All rights reserved.
GPU-accelerated Red Blood Cells Simulations with Transport Dissipative Particle Dynamics.

PubMed

Blumers, Ansel L; Tang, Yu-Hang; Li, Zhen; Li, Xuejin; Karniadakis, George E

2017-08-01

Mesoscopic numerical simulations provide a unique approach for the quantification of the chemical influences on red blood cell functionalities. The transport Dissipative Particles Dynamics (tDPD) method can lead to such effective multiscale simulations due to its ability to simultaneously capture mesoscopic advection, diffusion, and reaction. In this paper, we present a GPU-accelerated red blood cell simulation package based on a tDPD adaptation of our red blood cell model, which can correctly recover the cell membrane viscosity, elasticity, bending stiffness, and cross-membrane chemical transport. The package essentially processes all computational workloads in parallel by GPU, and it incorporates multi-stream scheduling and non-blocking MPI communications to improve inter-node scalability. Our code is validated for accuracy and compared against the CPU counterpart for speed. Strong scaling and weak scaling are also presented to characterizes scalability. We observe a speedup of 10.1 on one GPU over all 16 cores within a single node, and a weak scaling efficiency of 91% across 256 nodes. The program enables quick-turnaround and high-throughput numerical simulations for investigating chemical-driven red blood cell phenomena and disorders.
GPURFSCREEN: a GPU based virtual screening tool using random forest classifier.

PubMed

Jayaraj, P B; Ajay, Mathias K; Nufail, M; Gopakumar, G; Jaleel, U C A

2016-01-01

In-silico methods are an integral part of modern drug discovery paradigm. Virtual screening, an in-silico method, is used to refine data models and reduce the chemical space on which wet lab experiments need to be performed. Virtual screening of a ligand data model requires large scale computations, making it a highly time consuming task. This process can be speeded up by implementing parallelized algorithms on a Graphical Processing Unit (GPU). Random Forest is a robust classification algorithm that can be employed in the virtual screening. A ligand based virtual screening tool (GPURFSCREEN) that uses random forests on GPU systems has been proposed and evaluated in this paper. This tool produces optimized results at a lower execution time for large bioassay data sets. The quality of results produced by our tool on GPU is same as that on a regular serial environment. Considering the magnitude of data to be screened, the parallelized virtual screening has a significantly lower running time at high throughput. The proposed parallel tool outperforms its serial counterpart by successfully screening billions of molecules in training and prediction phases.
Analysis of performance improvements for host and GPU interface of the APENet+ 3D Torus network

NASA Astrophysics Data System (ADS)

Ammendola A, R.; Biagioni, A.; Frezza, O.; Lo Cicero, F.; Lonardo, A.; Paolucci, P. S.; Rossetti, D.; Simula, F.; Tosoratto, L.; Vicini, P.

2014-06-01

APEnet+ is an INFN (Italian Institute for Nuclear Physics) project aiming to develop a custom 3-Dimensional torus interconnect network optimized for hybrid clusters CPU-GPU dedicated to High Performance scientific Computing. The APEnet+ interconnect fabric is built on a FPGA-based PCI-express board with 6 bi-directional off-board links showing 34 Gbps of raw bandwidth per direction, and leverages upon peer-to-peer capabilities of Fermi and Kepler-class NVIDIA GPUs to obtain real zero-copy, GPU-to-GPU low latency transfers. The minimization of APEnet+ transfer latency is achieved through the adoption of RDMA protocol implemented in FPGA with specialized hardware blocks tightly coupled with embedded microprocessor. This architecture provides a high performance low latency offload engine for both trasmit and receive side of data transactions: preliminary results are encouraging, showing 50% of bandwidth increase for large packet size transfers. In this paper we describe the APEnet+ architecture, detailing the hardware implementation and discuss the impact of such RDMA specialized hardware on host interface latency and bandwidth.
GPU-based low-level trigger system for the standalone reconstruction of the ring-shaped hit patterns in the RICH Cherenkov detector of NA62 experiment

NASA Astrophysics Data System (ADS)

Ammendola, R.; Biagioni, A.; Chiozzi, S.; Cretaro, P.; Cotta Ramusino, A.; Di Lorenzo, S.; Fantechi, R.; Fiorini, M.; Frezza, O.; Gianoli, A.; Lamanna, G.; Lo Cicero, F.; Lonardo, A.; Martinelli, M.; Neri, I.; Paolucci, P. S.; Pastorelli, E.; Piandani, R.; Piccini, M.; Pontisso, L.; Rossetti, D.; Simula, F.; Sozzi, M.; Vicini, P.

2017-03-01

This project aims to exploit the parallel computing power of a commercial Graphics Processing Unit (GPU) to implement fast pattern matching in the Ring Imaging Cherenkov (RICH) detector for the level 0 (L0) trigger of the NA62 experiment. In this approach, the ring-fitting algorithm is seedless, being fed with raw RICH data, with no previous information on the ring position from other detectors. Moreover, since the L0 trigger is provided with a more elaborated information than a simple multiplicity number, it results in a higher selection power. Two methods have been studied in order to reduce the data transfer latency from the readout boards of the detector to the GPU, i.e., the use of a dedicated NIC device driver with very low latency and a direct data transfer protocol from a custom FPGA-based NIC to the GPU. The performance of the system, developed through the FPGA approach, for multi-ring Cherenkov online reconstruction obtained during the NA62 physics runs is presented.
Practical Implementation of Prestack Kirchhoff Time Migration on a General Purpose Graphics Processing Unit

NASA Astrophysics Data System (ADS)

Liu, Guofeng; Li, Chun

2016-08-01

In this study, we present a practical implementation of prestack Kirchhoff time migration (PSTM) on a general purpose graphic processing unit. First, we consider the three main optimizations of the PSTM GPU code, i.e., designing a configuration based on a reasonable execution, using the texture memory for velocity interpolation, and the application of an intrinsic function in device code. This approach can achieve a speedup of nearly 45 times on a NVIDIA GTX 680 GPU compared with CPU code when a larger imaging space is used, where the PSTM output is a common reflection point that is gathered as I[ nx][ ny][ nh][ nt] in matrix format. However, this method requires more memory space so the limited imaging space cannot fully exploit the GPU sources. To overcome this problem, we designed a PSTM scheme with multi-GPUs for imaging different seismic data on different GPUs using an offset value. This process can achieve the peak speedup of GPU PSTM code and it greatly increases the efficiency of the calculations, but without changing the imaging result.
Performance and scalability of Fourier domain optical coherence tomography acceleration using graphics processing units.

PubMed

Li, Jian; Bloch, Pavel; Xu, Jing; Sarunic, Marinko V; Shannon, Lesley

2011-05-01

Fourier domain optical coherence tomography (FD-OCT) provides faster line rates, better resolution, and higher sensitivity for noninvasive, in vivo biomedical imaging compared to traditional time domain OCT (TD-OCT). However, because the signal processing for FD-OCT is computationally intensive, real-time FD-OCT applications demand powerful computing platforms to deliver acceptable performance. Graphics processing units (GPUs) have been used as coprocessors to accelerate FD-OCT by leveraging their relatively simple programming model to exploit thread-level parallelism. Unfortunately, GPUs do not "share" memory with their host processors, requiring additional data transfers between the GPU and CPU. In this paper, we implement a complete FD-OCT accelerator on a consumer grade GPU/CPU platform. Our data acquisition system uses spectrometer-based detection and a dual-arm interferometer topology with numerical dispersion compensation for retinal imaging. We demonstrate that the maximum line rate is dictated by the memory transfer time and not the processing time due to the GPU platform's memory model. Finally, we discuss how the performance trends of GPU-based accelerators compare to the expected future requirements of FD-OCT data rates.
Bin recycling strategy for improving the histogram precision on GPU

NASA Astrophysics Data System (ADS)

Cárdenas-Montes, Miguel; Rodríguez-Vázquez, Juan José; Vega-Rodríguez, Miguel A.

2016-07-01

Histogram is an easily comprehensible way to present data and analyses. In the current scientific context with access to large volumes of data, the processing time for building histogram has dramatically increased. For this reason, parallel construction is necessary to alleviate the impact of the processing time in the analysis activities. In this scenario, GPU computing is becoming widely used for reducing until affordable levels the processing time of histogram construction. Associated to the increment of the processing time, the implementations are stressed on the bin-count accuracy. Accuracy aspects due to the particularities of the implementations are not usually taken into consideration when building histogram with very large data sets. In this work, a bin recycling strategy to create an accuracy-aware implementation for building histogram on GPU is presented. In order to evaluate the approach, this strategy was applied to the computation of the three-point angular correlation function, which is a relevant function in Cosmology for the study of the Large Scale Structure of Universe. As a consequence of the study a high-accuracy implementation for histogram construction on GPU is proposed.
GPU Accelerated Vector Median Filter

NASA Technical Reports Server (NTRS)

Aras, Rifat; Shen, Yuzhong

2011-01-01

Noise reduction is an important step for most image processing tasks. For three channel color images, a widely used technique is vector median filter in which color values of pixels are treated as 3-component vectors. Vector median filters are computationally expensive; for a window size of n x n, each of the n(sup 2) vectors has to be compared with other n(sup 2) - 1 vectors in distances. General purpose computation on graphics processing units (GPUs) is the paradigm of utilizing high-performance many-core GPU architectures for computation tasks that are normally handled by CPUs. In this work. NVIDIA's Compute Unified Device Architecture (CUDA) paradigm is used to accelerate vector median filtering. which has to the best of our knowledge never been done before. The performance of GPU accelerated vector median filter is compared to that of the CPU and MPI-based versions for different image and window sizes, Initial findings of the study showed 100x improvement of performance of vector median filter implementation on GPUs over CPU implementations and further speed-up is expected after more extensive optimizations of the GPU algorithm .
GPU-based Green's function simulations of shear waves generated by an applied acoustic radiation force in elastic and viscoelastic models.

PubMed

Yang, Yiqun; Urban, Matthew W; McGough, Robert J

2018-05-15

Shear wave calculations induced by an acoustic radiation force are very time-consuming on desktop computers, and high-performance graphics processing units (GPUs) achieve dramatic reductions in the computation time for these simulations. The acoustic radiation force is calculated using the fast near field method and the angular spectrum approach, and then the shear waves are calculated in parallel with Green's functions on a GPU. This combination enables rapid evaluation of shear waves for push beams with different spatial samplings and for apertures with different f/#. Relative to shear wave simulations that evaluate the same algorithm on an Intel i7 desktop computer, a high performance nVidia GPU reduces the time required for these calculations by a factor of 45 and 700 when applied to elastic and viscoelastic shear wave simulation models, respectively. These GPU-accelerated simulations also compared to measurements in different viscoelastic phantoms, and the results are similar. For parametric evaluations and for comparisons with measured shear wave data, shear wave simulations with the Green's function approach are ideally suited for high-performance GPUs.
An efficient tensor transpose algorithm for multicore CPU, Intel Xeon Phi, and NVidia Tesla GPU

NASA Astrophysics Data System (ADS)

Lyakh, Dmitry I.

2015-04-01

An efficient parallel tensor transpose algorithm is suggested for shared-memory computing units, namely, multicore CPU, Intel Xeon Phi, and NVidia GPU. The algorithm operates on dense tensors (multidimensional arrays) and is based on the optimization of cache utilization on x86 CPU and the use of shared memory on NVidia GPU. From the applied side, the ultimate goal is to minimize the overhead encountered in the transformation of tensor contractions into matrix multiplications in computer implementations of advanced methods of quantum many-body theory (e.g., in electronic structure theory and nuclear physics). A particular accent is made on higher-dimensional tensors that typically appear in the so-called multireference correlated methods of electronic structure theory. Depending on tensor dimensionality, the presented optimized algorithms can achieve an order of magnitude speedup on x86 CPUs and 2-3 times speedup on NVidia Tesla K20X GPU with respect to the naïve scattering algorithm (no memory access optimization). The tensor transpose routines developed in this work have been incorporated into a general-purpose tensor algebra library (TAL-SH).
NiftySim: A GPU-based nonlinear finite element package for simulation of soft tissue biomechanics.

PubMed

Johnsen, Stian F; Taylor, Zeike A; Clarkson, Matthew J; Hipwell, John; Modat, Marc; Eiben, Bjoern; Han, Lianghao; Hu, Yipeng; Mertzanidou, Thomy; Hawkes, David J; Ourselin, Sebastien

2015-07-01

NiftySim, an open-source finite element toolkit, has been designed to allow incorporation of high-performance soft tissue simulation capabilities into biomedical applications. The toolkit provides the option of execution on fast graphics processing unit (GPU) hardware, numerous constitutive models and solid-element options, membrane and shell elements, and contact modelling facilities, in a simple to use library. The toolkit is founded on the total Lagrangian explicit dynamics (TLEDs) algorithm, which has been shown to be efficient and accurate for simulation of soft tissues. The base code is written in C[Formula: see text], and GPU execution is achieved using the nVidia CUDA framework. In most cases, interaction with the underlying solvers can be achieved through a single Simulator class, which may be embedded directly in third-party applications such as, surgical guidance systems. Advanced capabilities such as contact modelling and nonlinear constitutive models are also provided, as are more experimental technologies like reduced order modelling. A consistent description of the underlying solution algorithm, its implementation with a focus on GPU execution, and examples of the toolkit's usage in biomedical applications are provided. Efficient mapping of the TLED algorithm to parallel hardware results in very high computational performance, far exceeding that available in commercial packages. The NiftySim toolkit provides high-performance soft tissue simulation capabilities using GPU technology for biomechanical simulation research applications in medical image computing, surgical simulation, and surgical guidance applications.
Graphics processing unit accelerated phase field dislocation dynamics: Application to bi-metallic interfaces

DOE PAGES

Eghtesad, Adnan; Germaschewski, Kai; Beyerlein, Irene J.; ...

2017-10-14

We present the first high-performance computing implementation of the meso-scale phase field dislocation dynamics (PFDD) model on a graphics processing unit (GPU)-based platform. The implementation takes advantage of the portable OpenACC standard directive pragmas along with Nvidia's compute unified device architecture (CUDA) fast Fourier transform (FFT) library called CUFFT to execute the FFT computations within the PFDD formulation on the same GPU platform. The overall implementation is termed ACCPFDD-CUFFT. The package is entirely performance portable due to the use of OPENACC-CUDA inter-operability, in which calls to CUDA functions are replaced with the OPENACC data regions for a host central processingmore » unit (CPU) and device (GPU). A comprehensive benchmark study has been conducted, which compares a number of FFT routines, the Numerical Recipes FFT (FOURN), Fastest Fourier Transform in the West (FFTW), and the CUFFT. The last one exploits the advantages of the GPU hardware for FFT calculations. The novel ACCPFDD-CUFFT implementation is verified using the analytical solutions for the stress field around an infinite edge dislocation and subsequently applied to simulate the interaction and motion of dislocations through a bi-phase copper-nickel (Cu–Ni) interface. It is demonstrated that the ACCPFDD-CUFFT implementation on a single TESLA K80 GPU offers a 27.6X speedup relative to the serial version and a 5X speedup relative to the 22-multicore Intel Xeon CPU E5-2699 v4 @ 2.20 GHz version of the code.« less
Migrating EO/IR sensors to cloud-based infrastructure as service architectures

NASA Astrophysics Data System (ADS)

Berglie, Stephen T.; Webster, Steven; May, Christopher M.

2014-06-01

The Night Vision Image Generator (NVIG), a product of US Army RDECOM CERDEC NVESD, is a visualization tool used widely throughout Army simulation environments to provide fully attributed synthesized, full motion video using physics-based sensor and environmental effects. The NVIG relies heavily on contemporary hardware-based acceleration and GPU processing techniques, which push the envelope of both enterprise and commodity-level hypervisor support for providing virtual machines with direct access to hardware resources. The NVIG has successfully been integrated into fully virtual environments where system architectures leverage cloudbased technologies to various extents in order to streamline infrastructure and service management. This paper details the challenges presented to engineers seeking to migrate GPU-bound processes, such as the NVIG, to virtual machines and, ultimately, Cloud-Based IAS architectures. In addition, it presents the path that led to success for the NVIG. A brief overview of Cloud-Based infrastructure management tool sets is provided, and several virtual desktop solutions are outlined. A discrimination is made between general purpose virtual desktop technologies compared to technologies that expose GPU-specific capabilities, including direct rendering and hard ware-based video encoding. Candidate hypervisor/virtual machine configurations that nominally satisfy the virtualized hardware-level GPU requirements of the NVIG are presented , and each is subsequently reviewed in light of its implications on higher-level Cloud management techniques. Implementation details are included from the hardware level, through the operating system, to the 3D graphics APls required by the NVIG and similar GPU-bound tools.

DOE Office of Scientific and Technical Information (OSTI.GOV)

Lindstrom, P; Cohen, J D

We present a streaming geometry compression codec for multiresolution, uniformly-gridded, triangular terrain patches that supports very fast decompression. Our method is based on linear prediction and residual coding for lossless compression of the full-resolution data. As simplified patches on coarser levels in the hierarchy already incur some data loss, we optionally allow further quantization for more lossy compression. The quantization levels are adaptive on a per-patch basis, while still permitting seamless, adaptive tessellations of the terrain. Our geometry compression on such a hierarchy achieves compression ratios of 3:1 to 12:1. Our scheme is not only suitable for fast decompression onmore » the CPU, but also for parallel decoding on the GPU with peak throughput over 2 billion triangles per second. Each terrain patch is independently decompressed on the fly from a variable-rate bitstream by a GPU geometry program with no branches or conditionals. Thus we can store the geometry compressed on the GPU, reducing storage and bandwidth requirements throughout the system. In our rendering approach, only compressed bitstreams and the decoded height values in the view-dependent 'cut' are explicitly stored on the GPU. Normal vectors are computed in a streaming fashion, and remaining geometry and texture coordinates, as well as mesh connectivity, are shared and re-used for all patches. We demonstrate and evaluate our algorithms on a small prototype system in which all compressed geometry fits in the GPU memory and decompression occurs on the fly every rendering frame without any cache maintenance.« less
Graphics processing unit accelerated phase field dislocation dynamics: Application to bi-metallic interfaces

DOE Office of Scientific and Technical Information (OSTI.GOV)

Eghtesad, Adnan; Germaschewski, Kai; Beyerlein, Irene J.

We present the first high-performance computing implementation of the meso-scale phase field dislocation dynamics (PFDD) model on a graphics processing unit (GPU)-based platform. The implementation takes advantage of the portable OpenACC standard directive pragmas along with Nvidia's compute unified device architecture (CUDA) fast Fourier transform (FFT) library called CUFFT to execute the FFT computations within the PFDD formulation on the same GPU platform. The overall implementation is termed ACCPFDD-CUFFT. The package is entirely performance portable due to the use of OPENACC-CUDA inter-operability, in which calls to CUDA functions are replaced with the OPENACC data regions for a host central processingmore » unit (CPU) and device (GPU). A comprehensive benchmark study has been conducted, which compares a number of FFT routines, the Numerical Recipes FFT (FOURN), Fastest Fourier Transform in the West (FFTW), and the CUFFT. The last one exploits the advantages of the GPU hardware for FFT calculations. The novel ACCPFDD-CUFFT implementation is verified using the analytical solutions for the stress field around an infinite edge dislocation and subsequently applied to simulate the interaction and motion of dislocations through a bi-phase copper-nickel (Cu–Ni) interface. It is demonstrated that the ACCPFDD-CUFFT implementation on a single TESLA K80 GPU offers a 27.6X speedup relative to the serial version and a 5X speedup relative to the 22-multicore Intel Xeon CPU E5-2699 v4 @ 2.20 GHz version of the code.« less
GPU-accelerated depth map generation for X-ray simulations of complex CAD geometries

NASA Astrophysics Data System (ADS)

Grandin, Robert J.; Young, Gavin; Holland, Stephen D.; Krishnamurthy, Adarsh

2018-04-01

Interactive x-ray simulations of complex computer-aided design (CAD) models can provide valuable insights for better interpretation of the defect signatures such as porosity from x-ray CT images. Generating the depth map along a particular direction for the given CAD geometry is the most compute-intensive step in x-ray simulations. We have developed a GPU-accelerated method for real-time generation of depth maps of complex CAD geometries. We preprocess complex components designed using commercial CAD systems using a custom CAD module and convert them into a fine user-defined surface tessellation. Our CAD module can be used by different simulators as well as handle complex geometries, including those that arise from complex castings and composite structures. We then make use of a parallel algorithm that runs on a graphics processing unit (GPU) to convert the finely-tessellated CAD model to a voxelized representation. The voxelized representation can enable heterogeneous modeling of the volume enclosed by the CAD model by assigning heterogeneous material properties in specific regions. The depth maps are generated from this voxelized representation with the help of a GPU-accelerated ray-casting algorithm. The GPU-accelerated ray-casting method enables interactive (> 60 frames-per-second) generation of the depth maps of complex CAD geometries. This enables arbitrarily rotation and slicing of the CAD model, leading to better interpretation of the x-ray images by the user. In addition, the depth maps can be used to aid directly in CT reconstruction algorithms.
TU-AB-BRC-09: Fast Dose-Averaged LET and Biological Dose Calculations for Proton Therapy Using Graphics Cards

DOE Office of Scientific and Technical Information (OSTI.GOV)

Wan, H; Tseung, Chan; Beltran, C

Purpose: To demonstrate fast and accurate Monte Carlo (MC) calculations of proton dose-averaged linear energy transfer (LETd) and biological dose (BD) on a Graphics Processing Unit (GPU) card. Methods: A previously validated GPU-based MC simulation of proton transport was used to rapidly generate LETd distributions for proton treatment plans. Since this MC handles proton-nuclei interactions on an event-by-event using a Bertini intranuclear cascade-evaporation model, secondary protons were taken into account. The smaller contributions of secondary neutrons and recoil nuclei were ignored. Recent work has shown that LETd values are sensitive to the scoring method. The GPU-based LETd calculations were verifiedmore » by comparing with a TOPAS custom scorer that uses tabulated stopping powers, following recommendations by other authors. Comparisons were made for prostate and head-and-neck patients. A python script is used to convert the MC-generated LETd distributions to BD using a variety of published linear quadratic models, and to export the BD in DICOM format for subsequent evaluation. Results: Very good agreement is obtained between TOPAS and our GPU MC. Given a complex head-and-neck plan with 1 mm voxel spacing, the physical dose, LETd and BD calculations for 10{sup 8} proton histories can be completed in ∼5 minutes using a NVIDIA Titan X card. The rapid turnover means that MC feedback can be obtained on dosimetric plan accuracy as well as BD hotspot locations, particularly in regards to their proximity to critical structures. In our institution the GPU MC-generated dose, LETd and BD maps are used to assess plan quality for all patients undergoing treatment. Conclusion: Fast and accurate MC-based LETd calculations can be performed on the GPU. The resulting BD maps provide valuable feedback during treatment plan review. Partially funded by Varian Medical Systems.« less
TU-EF-304-07: Monte Carlo-Based Inverse Treatment Plan Optimization for Intensity Modulated Proton Therapy

DOE Office of Scientific and Technical Information (OSTI.GOV)

Li, Y; UT Southwestern Medical Center, Dallas, TX; Tian, Z

2015-06-15

Purpose: Intensity-modulated proton therapy (IMPT) is increasingly used in proton therapy. For IMPT optimization, Monte Carlo (MC) is desired for spots dose calculations because of its high accuracy, especially in cases with a high level of heterogeneity. It is also preferred in biological optimization problems due to the capability of computing quantities related to biological effects. However, MC simulation is typically too slow to be used for this purpose. Although GPU-based MC engines have become available, the achieved efficiency is still not ideal. The purpose of this work is to develop a new optimization scheme to include GPU-based MC intomore » IMPT. Methods: A conventional approach using MC in IMPT simply calls the MC dose engine repeatedly for each spot dose calculations. However, this is not the optimal approach, because of the unnecessary computations on some spots that turned out to have very small weights after solving the optimization problem. GPU-memory writing conflict occurring at a small beam size also reduces computational efficiency. To solve these problems, we developed a new framework that iteratively performs MC dose calculations and plan optimizations. At each dose calculation step, the particles were sampled from different spots altogether with Metropolis algorithm, such that the particle number is proportional to the latest optimized spot intensity. Simultaneously transporting particles from multiple spots also mitigated the memory writing conflict problem. Results: We have validated the proposed MC-based optimization schemes in one prostate case. The total computation time of our method was ∼5–6 min on one NVIDIA GPU card, including both spot dose calculation and plan optimization, whereas a conventional method naively using the same GPU-based MC engine were ∼3 times slower. Conclusion: A fast GPU-based MC dose calculation method along with a novel optimization workflow is developed. The high efficiency makes it attractive for clinical usages.« less
Modeling parameterized geometry in GPU-based Monte Carlo particle transport simulation for radiotherapy.

PubMed

Chi, Yujie; Tian, Zhen; Jia, Xun

2016-08-07

Monte Carlo (MC) particle transport simulation on a graphics-processing unit (GPU) platform has been extensively studied recently due to the efficiency advantage achieved via massive parallelization. Almost all of the existing GPU-based MC packages were developed for voxelized geometry. This limited application scope of these packages. The purpose of this paper is to develop a module to model parametric geometry and integrate it in GPU-based MC simulations. In our module, each continuous region was defined by its bounding surfaces that were parameterized by quadratic functions. Particle navigation functions in this geometry were developed. The module was incorporated to two previously developed GPU-based MC packages and was tested in two example problems: (1) low energy photon transport simulation in a brachytherapy case with a shielded cylinder applicator and (2) MeV coupled photon/electron transport simulation in a phantom containing several inserts of different shapes. In both cases, the calculated dose distributions agreed well with those calculated in the corresponding voxelized geometry. The averaged dose differences were 1.03% and 0.29%, respectively. We also used the developed package to perform simulations of a Varian VS 2000 brachytherapy source and generated a phase-space file. The computation time under the parameterized geometry depended on the memory location storing the geometry data. When the data was stored in GPU's shared memory, the highest computational speed was achieved. Incorporation of parameterized geometry yielded a computation time that was ~3 times of that in the corresponding voxelized geometry. We also developed a strategy to use an auxiliary index array to reduce frequency of geometry calculations and hence improve efficiency. With this strategy, the computational time ranged in 1.75-2.03 times of the voxelized geometry for coupled photon/electron transport depending on the voxel dimension of the auxiliary index array, and in 0.69-1.23 times for photon only transport.
SU-E-T-558: Monte Carlo Photon Transport Simulations On GPU with Quadric Geometry

DOE Office of Scientific and Technical Information (OSTI.GOV)

Chi, Y; Tian, Z; Jiang, S

Purpose: Monte Carlo simulation on GPU has experienced rapid advancements over the past a few years and tremendous accelerations have been achieved. Yet existing packages were developed only in voxelized geometry. In some applications, e.g. radioactive seed modeling, simulations in more complicated geometry are needed. This abstract reports our initial efforts towards developing a quadric geometry module aiming at expanding the application scope of GPU-based MC simulations. Methods: We defined the simulation geometry consisting of a number of homogeneous bodies, each specified by its material composition and limiting surfaces characterized by quadric functions. A tree data structure was utilized tomore » define geometric relationship between different bodies. We modified our GPU-based photon MC transport package to incorporate this geometry. Specifically, geometry parameters were loaded into GPU’s shared memory for fast access. Geometry functions were rewritten to enable the identification of the body that contains the current particle location via a fast searching algorithm based on the tree data structure. Results: We tested our package in an example problem of HDR-brachytherapy dose calculation for shielded cylinder. The dose under the quadric geometry and that under the voxelized geometry agreed in 94.2% of total voxels within 20% isodose line based on a statistical t-test (95% confidence level), where the reference dose was defined to be the one at 0.5cm away from the cylinder surface. It took 243sec to transport 100million source photons under this quadric geometry on an NVidia Titan GPU card. Compared with simulation time of 99.6sec in the voxelized geometry, including quadric geometry reduced efficiency due to the complicated geometry-related computations. Conclusion: Our GPU-based MC package has been extended to support photon transport simulation in quadric geometry. Satisfactory accuracy was observed with a reduced efficiency. Developments for charged particle transport in this geometry are currently in progress.« less
A New GPU-Enabled MODTRAN Thermal Model for the PLUME TRACKER Volcanic Emission Analysis Toolkit

NASA Astrophysics Data System (ADS)

Acharya, P. K.; Berk, A.; Guiang, C.; Kennett, R.; Perkins, T.; Realmuto, V. J.

2013-12-01

Real-time quantification of volcanic gaseous and particulate releases is important for (1) recognizing rapid increases in SO2 gaseous emissions which may signal an impending eruption; (2) characterizing ash clouds to enable safe and efficient commercial aviation; and (3) quantifying the impact of volcanic aerosols on climate forcing. The Jet Propulsion Laboratory (JPL) has developed state-of-the-art algorithms, embedded in their analyst-driven Plume Tracker toolkit, for performing SO2, NH3, and CH4 retrievals from remotely sensed multi-spectral Thermal InfraRed spectral imagery. While Plume Tracker provides accurate results, it typically requires extensive analyst time. A major bottleneck in this processing is the relatively slow but accurate FORTRAN-based MODTRAN atmospheric and plume radiance model, developed by Spectral Sciences, Inc. (SSI). To overcome this bottleneck, SSI in collaboration with JPL, is porting these slow thermal radiance algorithms onto massively parallel, relatively inexpensive and commercially-available GPUs. This paper discusses SSI's efforts to accelerate the MODTRAN thermal emission algorithms used by Plume Tracker. Specifically, we are developing a GPU implementation of the Curtis-Godson averaging and the Voigt in-band transmittances from near line center molecular absorption, which comprise the major computational bottleneck. The transmittance calculations were decomposed into separate functions, individually implemented as GPU kernels, and tested for accuracy and performance relative to the original CPU code. Speedup factors of 14 to 30× were realized for individual processing components on an NVIDIA GeForce GTX 295 graphics card with no loss of accuracy. Due to the separate host (CPU) and device (GPU) memory spaces, a redesign of the MODTRAN architecture was required to ensure efficient data transfer between host and device, and to facilitate high parallel throughput. Currently, we are incorporating the separate GPU kernels into a single function for calculating the Voigt in-band transmittance, and subsequently for integration into the re-architectured MODTRAN6 code. Our overall objective is that by combining the GPU processing with more efficient Plume Tracker retrieval algorithms, a 100-fold increase in the computational speed will be realized. Since the Plume Tracker runs on Windows-based platforms, the GPU-enhanced MODTRAN6 will be packaged as a DLL. We do however anticipate that the accelerated option will be made available to the general MODTRAN community through an application programming interface (API).
SU-G-TeP1-15: Toward a Novel GPU Accelerated Deterministic Solution to the Linear Boltzmann Transport Equation

DOE Office of Scientific and Technical Information (OSTI.GOV)

Yang, R; Fallone, B; Cross Cancer Institute, Edmonton, AB

Purpose: To develop a Graphic Processor Unit (GPU) accelerated deterministic solution to the Linear Boltzmann Transport Equation (LBTE) for accurate dose calculations in radiotherapy (RT). A deterministic solution yields the potential for major speed improvements due to the sparse matrix-vector and vector-vector multiplications and would thus be of benefit to RT. Methods: In order to leverage the massively parallel architecture of GPUs, the first order LBTE was reformulated as a second order self-adjoint equation using the Least Squares Finite Element Method (LSFEM). This produces a symmetric positive-definite matrix which is efficiently solved using a parallelized conjugate gradient (CG) solver. Themore » LSFEM formalism is applied in space, discrete ordinates is applied in angle, and the Multigroup method is applied in energy. The final linear system of equations produced is tightly coupled in space and angle. Our code written in CUDA-C was benchmarked on an Nvidia GeForce TITAN-X GPU against an Intel i7-6700K CPU. A spatial mesh of 30,950 tetrahedral elements was used with an S4 angular approximation. Results: To avoid repeating a full computationally intensive finite element matrix assembly at each Multigroup energy, a novel mapping algorithm was developed which minimized the operations required at each energy. Additionally, a parallelized memory mapping for the kronecker product between the sparse spatial and angular matrices, including Dirichlet boundary conditions, was created. Atomicity is preserved by graph-coloring overlapping nodes into separate kernel launches. The one-time mapping calculations for matrix assembly, kronecker product, and boundary condition application took 452±1ms on GPU. Matrix assembly for 16 energy groups took 556±3s on CPU, and 358±2ms on GPU using the mappings developed. The CG solver took 93±1s on CPU, and 468±2ms on GPU. Conclusion: Three computationally intensive subroutines in deterministically solving the LBTE have been formulated on GPU, resulting in two orders of magnitude speedup. Funding support from Natural Sciences and Engineering Research Council and Alberta Innovates Health Solutions. Dr. Fallone is a co-founder and CEO of MagnetTx Oncology Solutions (under discussions to license Alberta bi-planar linac MR for commercialization).« less
GPU-based multi-volume ray casting within VTK for medical applications.

PubMed

Bozorgi, Mohammadmehdi; Lindseth, Frank

2015-03-01

Multi-volume visualization is important for displaying relevant information in multimodal or multitemporal medical imaging studies. The main objective with the current study was to develop an efficient GPU-based multi-volume ray caster (MVRC) and validate the proposed visualization system in the context of image-guided surgical navigation. Ray casting can produce high-quality 2D images from 3D volume data but the method is computationally demanding, especially when multiple volumes are involved, so a parallel GPU version has been implemented. In the proposed MVRC, imaginary rays are sent through the volumes (one ray for each pixel in the view), and at equal and short intervals along the rays, samples are collected from each volume. Samples from all the volumes are composited using front to back α-blending. Since all the rays can be processed simultaneously, the MVRC was implemented in parallel on the GPU to achieve acceptable interactive frame rates. The method is fully integrated within the visualization toolkit (VTK) pipeline with the ability to apply different operations (e.g., transformations, clipping, and cropping) on each volume separately. The implemented method is cross-platform (Windows, Linux and Mac OSX) and runs on different graphics card (NVidia and AMD). The speed of the MVRC was tested with one to five volumes of varying sizes: 128(3), 256(3), and 512(3). A Tesla C2070 GPU was used, and the output image size was 600 × 600 pixels. The original VTK single-volume ray caster and the MVRC were compared when rendering only one volume. The multi-volume rendering system achieved an interactive frame rate (> 15 fps) when rendering five small volumes (128 (3) voxels), four medium-sized volumes (256(3) voxels), and two large volumes (512(3) voxels). When rendering single volumes, the frame rate of the MVRC was comparable to the original VTK ray caster for small and medium-sized datasets but was approximately 3 frames per second slower for large datasets. The MVRC was successfully integrated in an existing surgical navigation system and was shown to be clinically useful during an ultrasound-guided neurosurgical tumor resection. A GPU-based MVRC for VTK is a useful tool in medical visualization. The proposed multi-volume GPU-based ray caster for VTK provided high-quality images at reasonable frame rates. The MVRC was effective when used in a neurosurgical navigation application.
78 FR 22347 - GPU Nuclear Inc., Three Mile Island Nuclear Power Station, Unit 2, Exemption From Certain...

Federal Register 2010, 2011, 2012, 2013, 2014

2013-04-15

... NUCLEAR REGULATORY COMMISSION [Docket No. 50-320; NRC-2013-0065] GPU Nuclear Inc., Three Mile Island Nuclear Power Station, Unit 2, Exemption From Certain Security Requirements AGENCY: Nuclear Regulatory Commission. ACTION: Exemption. FOR FURTHER INFORMATION CONTACT: John B. Hickman, Office of Federal and State Materials and Environmental...
Initial Comparison of Single Cylinder Stirling Engine Computer Model Predictions with Test Results

NASA Technical Reports Server (NTRS)

Tew, R. C., Jr.; Thieme, L. G.; Miao, D.

1979-01-01

A Stirling engine digital computer model developed at NASA Lewis Research Center was configured to predict the performance of the GPU-3 single-cylinder rhombic drive engine. Revisions to the basic equations and assumptions are discussed. Model predictions with the early results of the Lewis Research Center GPU-3 tests are compared.
A GPU-based large-scale Monte Carlo simulation method for systems with long-range interactions

NASA Astrophysics Data System (ADS)

Liang, Yihao; Xing, Xiangjun; Li, Yaohang

2017-06-01

In this work we present an efficient implementation of Canonical Monte Carlo simulation for Coulomb many body systems on graphics processing units (GPU). Our method takes advantage of the GPU Single Instruction, Multiple Data (SIMD) architectures, and adopts the sequential updating scheme of Metropolis algorithm. It makes no approximation in the computation of energy, and reaches a remarkable 440-fold speedup, compared with the serial implementation on CPU. We further use this method to simulate primitive model electrolytes, and measure very precisely all ion-ion pair correlation functions at high concentrations. From these data, we extract the renormalized Debye length, renormalized valences of constituent ions, and renormalized dielectric constants. These results demonstrate unequivocally physics beyond the classical Poisson-Boltzmann theory.
Simulating spin models on GPU

NASA Astrophysics Data System (ADS)

Weigel, Martin

2011-09-01

Over the last couple of years it has been realized that the vast computational power of graphics processing units (GPUs) could be harvested for purposes other than the video game industry. This power, which at least nominally exceeds that of current CPUs by large factors, results from the relative simplicity of the GPU architectures as compared to CPUs, combined with a large number of parallel processing units on a single chip. To benefit from this setup for general computing purposes, the problems at hand need to be prepared in a way to profit from the inherent parallelism and hierarchical structure of memory accesses. In this contribution I discuss the performance potential for simulating spin models, such as the Ising model, on GPU as compared to conventional simulations on CPU.
Interactive brain shift compensation using GPU based programming

NASA Astrophysics Data System (ADS)

van der Steen, Sander; Noordmans, Herke Jan; Verdaasdonk, Rudolf

2009-02-01

Processing large images files or real-time video streams requires intense computational power. Driven by the gaming industry, the processing power of graphic process units (GPUs) has increased significantly. With the pixel shader model 4.0 the GPU can be used for image processing 10x faster than the CPU. Dedicated software was developed to deform 3D MR and CT image sets for real-time brain shift correction during navigated neurosurgery using landmarks or cortical surface traces defined by the navigation pointer. Feedback was given using orthogonal slices and an interactively raytraced 3D brain image. GPU based programming enables real-time processing of high definition image datasets and various applications can be developed in medicine, optics and image sciences.
GPU-accelerated computational tool for studying the effectiveness of asteroid disruption techniques

NASA Astrophysics Data System (ADS)

Zimmerman, Ben J.; Wie, Bong

2016-10-01

This paper presents the development of a new Graphics Processing Unit (GPU) accelerated computational tool for asteroid disruption techniques. Numerical simulations are completed using the high-order spectral difference (SD) method. Due to the compact nature of the SD method, it is well suited for implementation with the GPU architecture, hence solutions are generated at orders of magnitude faster than the Central Processing Unit (CPU) counterpart. A multiphase model integrated with the SD method is introduced, and several asteroid disruption simulations are conducted, including kinetic-energy impactors, multi-kinetic energy impactor systems, and nuclear options. Results illustrate the benefits of using multi-kinetic energy impactor systems when compared to a single impactor system. In addition, the effectiveness of nuclear options is observed.
GPU accelerated FDTD solver and its application in MRI.

PubMed

Chi, J; Liu, F; Jin, J; Mason, D G; Crozier, S

2010-01-01

The finite difference time domain (FDTD) method is a popular technique for computational electromagnetics (CEM). The large computational power often required, however, has been a limiting factor for its applications. In this paper, we will present a graphics processing unit (GPU)-based parallel FDTD solver and its successful application to the investigation of a novel B1 shimming scheme for high-field magnetic resonance imaging (MRI). The optimized shimming scheme exhibits considerably improved transmit B(1) profiles. The GPU implementation dramatically shortened the runtime of FDTD simulation of electromagnetic field compared with its CPU counterpart. The acceleration in runtime has made such investigation possible, and will pave the way for other studies of large-scale computational electromagnetic problems in modern MRI which were previously impractical.
Data assimilation using a GPU accelerated path integral Monte Carlo approach

NASA Astrophysics Data System (ADS)

Quinn, John C.; Abarbanel, Henry D. I.

2011-09-01

The answers to data assimilation questions can be expressed as path integrals over all possible state and parameter histories. We show how these path integrals can be evaluated numerically using a Markov Chain Monte Carlo method designed to run in parallel on a graphics processing unit (GPU). We demonstrate the application of the method to an example with a transmembrane voltage time series of a simulated neuron as an input, and using a Hodgkin-Huxley neuron model. By taking advantage of GPU computing, we gain a parallel speedup factor of up to about 300, compared to an equivalent serial computation on a CPU, with performance increasing as the length of the observation time used for data assimilation increases.
GRAVIDY, a GPU modular, parallel direct-summation N-body integrator: dynamics with softening

NASA Astrophysics Data System (ADS)

Maureira-Fredes, Cristián; Amaro-Seoane, Pau

2018-01-01

A wide variety of outstanding problems in astrophysics involve the motion of a large number of particles under the force of gravity. These include the global evolution of globular clusters, tidal disruptions of stars by a massive black hole, the formation of protoplanets and sources of gravitational radiation. The direct-summation of N gravitational forces is a complex problem with no analytical solution and can only be tackled with approximations and numerical methods. To this end, the Hermite scheme is a widely used integration method. With different numerical techniques and special-purpose hardware, it can be used to speed up the calculations. But these methods tend to be computationally slow and cumbersome to work with. We present a new graphics processing unit (GPU), direct-summation N-body integrator written from scratch and based on this scheme, which includes relativistic corrections for sources of gravitational radiation. GRAVIDY has high modularity, allowing users to readily introduce new physics, it exploits available computational resources and will be maintained by regular updates. GRAVIDY can be used in parallel on multiple CPUs and GPUs, with a considerable speed-up benefit. The single-GPU version is between one and two orders of magnitude faster than the single-CPU version. A test run using four GPUs in parallel shows a speed-up factor of about 3 as compared to the single-GPU version. The conception and design of this first release is aimed at users with access to traditional parallel CPU clusters or computational nodes with one or a few GPU cards.
A GPU-accelerated semi-implicit fractional-step method for numerical solutions of incompressible Navier-Stokes equations

NASA Astrophysics Data System (ADS)

Ha, Sanghyun; Park, Junshin; You, Donghyun

2018-01-01

Utility of the computational power of Graphics Processing Units (GPUs) is elaborated for solutions of incompressible Navier-Stokes equations which are integrated using a semi-implicit fractional-step method. The Alternating Direction Implicit (ADI) and the Fourier-transform-based direct solution methods used in the semi-implicit fractional-step method take advantage of multiple tridiagonal matrices whose inversion is known as the major bottleneck for acceleration on a typical multi-core machine. A novel implementation of the semi-implicit fractional-step method designed for GPU acceleration of the incompressible Navier-Stokes equations is presented. Aspects of the programing model of Compute Unified Device Architecture (CUDA), which are critical to the bandwidth-bound nature of the present method are discussed in detail. A data layout for efficient use of CUDA libraries is proposed for acceleration of tridiagonal matrix inversion and fast Fourier transform. OpenMP is employed for concurrent collection of turbulence statistics on a CPU while the Navier-Stokes equations are computed on a GPU. Performance of the present method using CUDA is assessed by comparing the speed of solving three tridiagonal matrices using ADI with the speed of solving one heptadiagonal matrix using a conjugate gradient method. An overall speedup of 20 times is achieved using a Tesla K40 GPU in comparison with a single-core Xeon E5-2660 v3 CPU in simulations of turbulent boundary-layer flow over a flat plate conducted on over 134 million grids. Enhanced performance of 48 times speedup is reached for the same problem using a Tesla P100 GPU.

An efficient implementation of 3D high-resolution imaging for large-scale seismic data with GPU/CPU heterogeneous parallel computing

NASA Astrophysics Data System (ADS)

Xu, Jincheng; Liu, Wei; Wang, Jin; Liu, Linong; Zhang, Jianfeng

2018-02-01

De-absorption pre-stack time migration (QPSTM) compensates for the absorption and dispersion of seismic waves by introducing an effective Q parameter, thereby making it an effective tool for 3D, high-resolution imaging of seismic data. Although the optimal aperture obtained via stationary-phase migration reduces the computational cost of 3D QPSTM and yields 3D stationary-phase QPSTM, the associated computational efficiency is still the main problem in the processing of 3D, high-resolution images for real large-scale seismic data. In the current paper, we proposed a division method for large-scale, 3D seismic data to optimize the performance of stationary-phase QPSTM on clusters of graphics processing units (GPU). Then, we designed an imaging point parallel strategy to achieve an optimal parallel computing performance. Afterward, we adopted an asynchronous double buffering scheme for multi-stream to perform the GPU/CPU parallel computing. Moreover, several key optimization strategies of computation and storage based on the compute unified device architecture (CUDA) were adopted to accelerate the 3D stationary-phase QPSTM algorithm. Compared with the initial GPU code, the implementation of the key optimization steps, including thread optimization, shared memory optimization, register optimization and special function units (SFU), greatly improved the efficiency. A numerical example employing real large-scale, 3D seismic data showed that our scheme is nearly 80 times faster than the CPU-QPSTM algorithm. Our GPU/CPU heterogeneous parallel computing framework significant reduces the computational cost and facilitates 3D high-resolution imaging for large-scale seismic data.
GPU-Q-J, a fast method for calculating root mean square deviation (RMSD) after optimal superposition

PubMed Central

2011-01-01

Background Calculation of the root mean square deviation (RMSD) between the atomic coordinates of two optimally superposed structures is a basic component of structural comparison techniques. We describe a quaternion based method, GPU-Q-J, that is stable with single precision calculations and suitable for graphics processor units (GPUs). The application was implemented on an ATI 4770 graphics card in C/C++ and Brook+ in Linux where it was 260 to 760 times faster than existing unoptimized CPU methods. Source code is available from the Compbio website http://software.compbio.washington.edu/misc/downloads/st_gpu_fit/ or from the author LHH. Findings The Nutritious Rice for the World Project (NRW) on World Community Grid predicted de novo, the structures of over 62,000 small proteins and protein domains returning a total of 10 billion candidate structures. Clustering ensembles of structures on this scale requires calculation of large similarity matrices consisting of RMSDs between each pair of structures in the set. As a real-world test, we calculated the matrices for 6 different ensembles from NRW. The GPU method was 260 times faster that the fastest existing CPU based method and over 500 times faster than the method that had been previously used. Conclusions GPU-Q-J is a significant advance over previous CPU methods. It relieves a major bottleneck in the clustering of large numbers of structures for NRW. It also has applications in structure comparison methods that involve multiple superposition and RMSD determination steps, particularly when such methods are applied on a proteome and genome wide scale. PMID:21453553
Sop-GPU: accelerating biomolecular simulations in the centisecond timescale using graphics processors.

PubMed

Zhmurov, A; Dima, R I; Kholodov, Y; Barsegov, V

2010-11-01

Theoretical exploration of fundamental biological processes involving the forced unraveling of multimeric proteins, the sliding motion in protein fibers and the mechanical deformation of biomolecular assemblies under physiological force loads is challenging even for distributed computing systems. Using a C(α)-based coarse-grained self organized polymer (SOP) model, we implemented the Langevin simulations of proteins on graphics processing units (SOP-GPU program). We assessed the computational performance of an end-to-end application of the program, where all the steps of the algorithm are running on a GPU, by profiling the simulation time and memory usage for a number of test systems. The ∼90-fold computational speedup on a GPU, compared with an optimized central processing unit program, enabled us to follow the dynamics in the centisecond timescale, and to obtain the force-extension profiles using experimental pulling speeds (v(f) = 1-10 μm/s) employed in atomic force microscopy and in optical tweezers-based dynamic force spectroscopy. We found that the mechanical molecular response critically depends on the conditions of force application and that the kinetics and pathways for unfolding change drastically even upon a modest 10-fold increase in v(f). This implies that, to resolve accurately the free energy landscape and to relate the results of single-molecule experiments in vitro and in silico, molecular simulations should be carried out under the experimentally relevant force loads. This can be accomplished in reasonable wall-clock time for biomolecules of size as large as 10(5) residues using the SOP-GPU package. © 2010 Wiley-Liss, Inc.
GPU acceleration towards real-time image reconstruction in 3D tomographic diffractive microscopy

NASA Astrophysics Data System (ADS)

Bailleul, J.; Simon, B.; Debailleul, M.; Liu, H.; Haeberlé, O.

2012-06-01

Phase microscopy techniques regained interest in allowing for the observation of unprepared specimens with excellent temporal resolution. Tomographic diffractive microscopy is an extension of holographic microscopy which permits 3D observations with a finer resolution than incoherent light microscopes. Specimens are imaged by a series of 2D holograms: their accumulation progressively fills the range of frequencies of the specimen in Fourier space. A 3D inverse FFT eventually provides a spatial image of the specimen. Consequently, acquisition then reconstruction are mandatory to produce an image that could prelude real-time control of the observed specimen. The MIPS Laboratory has built a tomographic diffractive microscope with an unsurpassed 130nm resolution but a low imaging speed - no less than one minute. Afterwards, a high-end PC reconstructs the 3D image in 20 seconds. We now expect an interactive system providing preview images during the acquisition for monitoring purposes. We first present a prototype implementing this solution on CPU: acquisition and reconstruction are tied in a producer-consumer scheme, sharing common data into CPU memory. Then we present a prototype dispatching some reconstruction tasks to GPU in order to take advantage of SIMDparallelization for FFT and higher bandwidth for filtering operations. The CPU scheme takes 6 seconds for a 3D image update while the GPU scheme can go down to 2 or > 1 seconds depending on the GPU class. This opens opportunities for 4D imaging of living organisms or crystallization processes. We also consider the relevance of GPU for 3D image interaction in our specific conditions.
GPU-Accelerated Molecular Dynamics Simulation to Study Liquid Crystal Phase Transition Using Coarse-Grained Gay-Berne Anisotropic Potential.

PubMed

Chen, Wenduo; Zhu, Youliang; Cui, Fengchao; Liu, Lunyang; Sun, Zhaoyan; Chen, Jizhong; Li, Yunqi

2016-01-01

Gay-Berne (GB) potential is regarded as an accurate model in the simulation of anisotropic particles, especially for liquid crystal (LC) mesogens. However, its computational complexity leads to an extremely time-consuming process for large systems. Here, we developed a GPU-accelerated molecular dynamics (MD) simulation with coarse-grained GB potential implemented in GALAMOST package to investigate the LC phase transitions for mesogens in small molecules, main-chain or side-chain polymers. For identical mesogens in three different molecules, on cooling from fully isotropic melts, the small molecules form a single-domain smectic-B phase, while the main-chain LC polymers prefer a single-domain nematic phase as a result of connective restraints in neighboring mesogens. The phase transition of side-chain LC polymers undergoes a two-step process: nucleation of nematic islands and formation of multi-domain nematic texture. The particular behavior originates in the fact that the rotational orientation of the mesogenes is hindered by the polymer backbones. Both the global distribution and the local orientation of mesogens are critical for the phase transition of anisotropic particles. Furthermore, compared with the MD simulation in LAMMPS, our GPU-accelerated code is about 4 times faster than the GPU version of LAMMPS and at least 200 times faster than the CPU version of LAMMPS. This study clearly shows that GPU-accelerated MD simulation with GB potential in GALAMOST can efficiently handle systems with anisotropic particles and interactions, and accurately explore phase differences originated from molecular structures.
GPU-Accelerated Molecular Dynamics Simulation to Study Liquid Crystal Phase Transition Using Coarse-Grained Gay-Berne Anisotropic Potential

PubMed Central

Cui, Fengchao; Liu, Lunyang; Sun, Zhaoyan; Chen, Jizhong; Li, Yunqi

2016-01-01

Gay-Berne (GB) potential is regarded as an accurate model in the simulation of anisotropic particles, especially for liquid crystal (LC) mesogens. However, its computational complexity leads to an extremely time-consuming process for large systems. Here, we developed a GPU-accelerated molecular dynamics (MD) simulation with coarse-grained GB potential implemented in GALAMOST package to investigate the LC phase transitions for mesogens in small molecules, main-chain or side-chain polymers. For identical mesogens in three different molecules, on cooling from fully isotropic melts, the small molecules form a single-domain smectic-B phase, while the main-chain LC polymers prefer a single-domain nematic phase as a result of connective restraints in neighboring mesogens. The phase transition of side-chain LC polymers undergoes a two-step process: nucleation of nematic islands and formation of multi-domain nematic texture. The particular behavior originates in the fact that the rotational orientation of the mesogenes is hindered by the polymer backbones. Both the global distribution and the local orientation of mesogens are critical for the phase transition of anisotropic particles. Furthermore, compared with the MD simulation in LAMMPS, our GPU-accelerated code is about 4 times faster than the GPU version of LAMMPS and at least 200 times faster than the CPU version of LAMMPS. This study clearly shows that GPU-accelerated MD simulation with GB potential in GALAMOST can efficiently handle systems with anisotropic particles and interactions, and accurately explore phase differences originated from molecular structures. PMID:26986851
Massively Parallel Signal Processing using the Graphics Processing Unit for Real-Time Brain-Computer Interface Feature Extraction.

PubMed

Wilson, J Adam; Williams, Justin C

2009-01-01

The clock speeds of modern computer processors have nearly plateaued in the past 5 years. Consequently, neural prosthetic systems that rely on processing large quantities of data in a short period of time face a bottleneck, in that it may not be possible to process all of the data recorded from an electrode array with high channel counts and bandwidth, such as electrocorticographic grids or other implantable systems. Therefore, in this study a method of using the processing capabilities of a graphics card [graphics processing unit (GPU)] was developed for real-time neural signal processing of a brain-computer interface (BCI). The NVIDIA CUDA system was used to offload processing to the GPU, which is capable of running many operations in parallel, potentially greatly increasing the speed of existing algorithms. The BCI system records many channels of data, which are processed and translated into a control signal, such as the movement of a computer cursor. This signal processing chain involves computing a matrix-matrix multiplication (i.e., a spatial filter), followed by calculating the power spectral density on every channel using an auto-regressive method, and finally classifying appropriate features for control. In this study, the first two computationally intensive steps were implemented on the GPU, and the speed was compared to both the current implementation and a central processing unit-based implementation that uses multi-threading. Significant performance gains were obtained with GPU processing: the current implementation processed 1000 channels of 250 ms in 933 ms, while the new GPU method took only 27 ms, an improvement of nearly 35 times.
An Analytical Overview of Spirituality in NANDA-I Taxonomies.

PubMed

Mesquita, Ana Cláudia; Caldeira, Sílvia; Chaves, Erika; Carvalho, Emilia Campos de

2017-03-01

To discuss the approach of spirituality in NANDA-I taxonomies, based on the elements that characterize this phenomenon. This study was based on concepts that are usually adopted in the literature for defining spirituality and on the analysis of the NANDA-I taxonomies from I to III. Spirituality is included in all taxonomies but all three are missing some attributes to guarantee the completeness of this dimension for nursing diagnosis. Taxonomy III makes different approaches to spirituality and some inconsistencies. Contribute to the development and review of the new proposal for taxonomy. Discutir a abordagem à espiritualidade nas taxonomias da NANDA-I, baseada nos elementos que caracterizam este fenômeno. MÉTODOS: Este estudo foi baseado em conceitos usualmente adotados na literatura de enfermagem para definir espiritualidade e na análise das taxonomias da NANDA-I, desde a I à III. A espiritualidade está incluída nas taxonomias, porém estas carecem de atributos do seu conceito. CONCLUSÕES: A taxonomia III faz diferentes abordagens à espiritualidade, porém com algumas inconsistências identificadas. IMPLICAÇÕES PARA A ENFERMAGEM: Esta análise pode contribuir para o desenvolvimento e revisão da taxonomia III. © 2017 NANDA International, Inc.
GPU-based Green’s function simulations of shear waves generated by an applied acoustic radiation force in elastic and viscoelastic models

NASA Astrophysics Data System (ADS)

Yang, Yiqun; Urban, Matthew W.; McGough, Robert J.

2018-05-01

Shear wave calculations induced by an acoustic radiation force are very time-consuming on desktop computers, and high-performance graphics processing units (GPUs) achieve dramatic reductions in the computation time for these simulations. The acoustic radiation force is calculated using the fast near field method and the angular spectrum approach, and then the shear waves are calculated in parallel with Green’s functions on a GPU. This combination enables rapid evaluation of shear waves for push beams with different spatial samplings and for apertures with different f/#. Relative to shear wave simulations that evaluate the same algorithm on an Intel i7 desktop computer, a high performance nVidia GPU reduces the time required for these calculations by a factor of 45 and 700 when applied to elastic and viscoelastic shear wave simulation models, respectively. These GPU-accelerated simulations also compared to measurements in different viscoelastic phantoms, and the results are similar. For parametric evaluations and for comparisons with measured shear wave data, shear wave simulations with the Green’s function approach are ideally suited for high-performance GPUs.
Efficient parallel implementation of active appearance model fitting algorithm on GPU.

PubMed

Wang, Jinwei; Ma, Xirong; Zhu, Yuanping; Sun, Jizhou

2014-01-01

The active appearance model (AAM) is one of the most powerful model-based object detecting and tracking methods which has been widely used in various situations. However, the high-dimensional texture representation causes very time-consuming computations, which makes the AAM difficult to apply to real-time systems. The emergence of modern graphics processing units (GPUs) that feature a many-core, fine-grained parallel architecture provides new and promising solutions to overcome the computational challenge. In this paper, we propose an efficient parallel implementation of the AAM fitting algorithm on GPUs. Our design idea is fine grain parallelism in which we distribute the texture data of the AAM, in pixels, to thousands of parallel GPU threads for processing, which makes the algorithm fit better into the GPU architecture. We implement our algorithm using the compute unified device architecture (CUDA) on the Nvidia's GTX 650 GPU, which has the latest Kepler architecture. To compare the performance of our algorithm with different data sizes, we built sixteen face AAM models of different dimensional textures. The experiment results show that our parallel AAM fitting algorithm can achieve real-time performance for videos even on very high-dimensional textures.
Efficient Parallel Implementation of Active Appearance Model Fitting Algorithm on GPU

PubMed Central

Wang, Jinwei; Ma, Xirong; Zhu, Yuanping; Sun, Jizhou

2014-01-01

The active appearance model (AAM) is one of the most powerful model-based object detecting and tracking methods which has been widely used in various situations. However, the high-dimensional texture representation causes very time-consuming computations, which makes the AAM difficult to apply to real-time systems. The emergence of modern graphics processing units (GPUs) that feature a many-core, fine-grained parallel architecture provides new and promising solutions to overcome the computational challenge. In this paper, we propose an efficient parallel implementation of the AAM fitting algorithm on GPUs. Our design idea is fine grain parallelism in which we distribute the texture data of the AAM, in pixels, to thousands of parallel GPU threads for processing, which makes the algorithm fit better into the GPU architecture. We implement our algorithm using the compute unified device architecture (CUDA) on the Nvidia's GTX 650 GPU, which has the latest Kepler architecture. To compare the performance of our algorithm with different data sizes, we built sixteen face AAM models of different dimensional textures. The experiment results show that our parallel AAM fitting algorithm can achieve real-time performance for videos even on very high-dimensional textures. PMID:24723812
An efficient tensor transpose algorithm for multicore CPU, Intel Xeon Phi, and NVidia Tesla GPU

DOE PAGES

Lyakh, Dmitry I.

2015-01-05

An efficient parallel tensor transpose algorithm is suggested for shared-memory computing units, namely, multicore CPU, Intel Xeon Phi, and NVidia GPU. The algorithm operates on dense tensors (multidimensional arrays) and is based on the optimization of cache utilization on x86 CPU and the use of shared memory on NVidia GPU. From the applied side, the ultimate goal is to minimize the overhead encountered in the transformation of tensor contractions into matrix multiplications in computer implementations of advanced methods of quantum many-body theory (e.g., in electronic structure theory and nuclear physics). A particular accent is made on higher-dimensional tensors that typicallymore » appear in the so-called multireference correlated methods of electronic structure theory. Depending on tensor dimensionality, the presented optimized algorithms can achieve an order of magnitude speedup on x86 CPUs and 2-3 times speedup on NVidia Tesla K20X GPU with respect to the na ve scattering algorithm (no memory access optimization). Furthermore, the tensor transpose routines developed in this work have been incorporated into a general-purpose tensor algebra library (TAL-SH).« less
Fast, Accurate and Shift-Varying Line Projections for Iterative Reconstruction Using the GPU

PubMed Central

Pratx, Guillem; Chinn, Garry; Olcott, Peter D.; Levin, Craig S.

2013-01-01

List-mode processing provides an efficient way to deal with sparse projections in iterative image reconstruction for emission tomography. An issue often reported is the tremendous amount of computation required by such algorithm. Each recorded event requires several back- and forward line projections. We investigated the use of the programmable graphics processing unit (GPU) to accelerate the line-projection operations and implement fully-3D list-mode ordered-subsets expectation-maximization for positron emission tomography (PET). We designed a reconstruction approach that incorporates resolution kernels, which model the spatially-varying physical processes associated with photon emission, transport and detection. Our development is particularly suitable for applications where the projection data is sparse, such as high-resolution, dynamic, and time-of-flight PET reconstruction. The GPU approach runs more than 50 times faster than an equivalent CPU implementation while image quality and accuracy are virtually identical. This paper describes in details how the GPU can be used to accelerate the line projection operations, even when the lines-of-response have arbitrary endpoint locations and shift-varying resolution kernels are used. A quantitative evaluation is included to validate the correctness of this new approach. PMID:19244015
An Approach in Radiation Therapy Treatment Planning: A Fast, GPU-Based Monte Carlo Method.

PubMed

Karbalaee, Mojtaba; Shahbazi-Gahrouei, Daryoush; Tavakoli, Mohammad B

2017-01-01

An accurate and fast radiation dose calculation is essential for successful radiation radiotherapy. The aim of this study was to implement a new graphic processing unit (GPU) based radiation therapy treatment planning for accurate and fast dose calculation in radiotherapy centers. A program was written for parallel running based on GPU. The code validation was performed by EGSnrc/DOSXYZnrc. Moreover, a semi-automatic, rotary, asymmetric phantom was designed and produced using a bone, the lung, and the soft tissue equivalent materials. All measurements were performed using a Mapcheck dosimeter. The accuracy of the code was validated using the experimental data, which was obtained from the anthropomorphic phantom as the gold standard. The findings showed that, compared with those of DOSXYZnrc in the virtual phantom and for most of the voxels (>95%), <3% dose-difference or 3 mm distance-to-agreement (DTA) was found. Moreover, considering the anthropomorphic phantom, compared to the Mapcheck dose measurements, <5% dose-difference or 5 mm DTA was observed. Fast calculation speed and high accuracy of GPU-based Monte Carlo method in dose calculation may be useful in routine radiation therapy centers as the core and main component of a treatment planning verification system.
Molecular Monte Carlo Simulations Using Graphics Processing Units: To Waste Recycle or Not?

PubMed

Kim, Jihan; Rodgers, Jocelyn M; Athènes, Manuel; Smit, Berend

2011-10-11

In the waste recycling Monte Carlo (WRMC) algorithm, (1) multiple trial states may be simultaneously generated and utilized during Monte Carlo moves to improve the statistical accuracy of the simulations, suggesting that such an algorithm may be well posed for implementation in parallel on graphics processing units (GPUs). In this paper, we implement two waste recycling Monte Carlo algorithms in CUDA (Compute Unified Device Architecture) using uniformly distributed random trial states and trial states based on displacement random-walk steps, and we test the methods on a methane-zeolite MFI framework system to evaluate their utility. We discuss the specific implementation details of the waste recycling GPU algorithm and compare the methods to other parallel algorithms optimized for the framework system. We analyze the relationship between the statistical accuracy of our simulations and the CUDA block size to determine the efficient allocation of the GPU hardware resources. We make comparisons between the GPU and the serial CPU Monte Carlo implementations to assess speedup over conventional microprocessors. Finally, we apply our optimized GPU algorithms to the important problem of determining free energy landscapes, in this case for molecular motion through the zeolite LTA.
Accelerating Monte Carlo simulations of photon transport in a voxelized geometry using a massively parallel graphics processing unit

DOE Office of Scientific and Technical Information (OSTI.GOV)

Badal, Andreu; Badano, Aldo

Purpose: It is a known fact that Monte Carlo simulations of radiation transport are computationally intensive and may require long computing times. The authors introduce a new paradigm for the acceleration of Monte Carlo simulations: The use of a graphics processing unit (GPU) as the main computing device instead of a central processing unit (CPU). Methods: A GPU-based Monte Carlo code that simulates photon transport in a voxelized geometry with the accurate physics models from PENELOPE has been developed using the CUDA programming model (NVIDIA Corporation, Santa Clara, CA). Results: An outline of the new code and a sample x-raymore » imaging simulation with an anthropomorphic phantom are presented. A remarkable 27-fold speed up factor was obtained using a GPU compared to a single core CPU. Conclusions: The reported results show that GPUs are currently a good alternative to CPUs for the simulation of radiation transport. Since the performance of GPUs is currently increasing at a faster pace than that of CPUs, the advantages of GPU-based software are likely to be more pronounced in the future.« less
Multilevel Summation of Electrostatic Potentials Using Graphics Processing Units*

PubMed Central

Hardy, David J.; Stone, John E.; Schulten, Klaus

2009-01-01

Physical and engineering practicalities involved in microprocessor design have resulted in flat performance growth for traditional single-core microprocessors. The urgent need for continuing increases in the performance of scientific applications requires the use of many-core processors and accelerators such as graphics processing units (GPUs). This paper discusses GPU acceleration of the multilevel summation method for computing electrostatic potentials and forces for a system of charged atoms, which is a problem of paramount importance in biomolecular modeling applications. We present and test a new GPU algorithm for the long-range part of the potentials that computes a cutoff pair potential between lattice points, essentially convolving a fixed 3-D lattice of “weights” over all sub-cubes of a much larger lattice. The implementation exploits the different memory subsystems provided on the GPU to stream optimally sized data sets through the multiprocessors. We demonstrate for the full multilevel summation calculation speedups of up to 26 using a single GPU and 46 using multiple GPUs, enabling the computation of a high-resolution map of the electrostatic potential for a system of 1.5 million atoms in under 12 seconds. PMID:20161132
GPU color space conversion

NASA Astrophysics Data System (ADS)

Chase, Patrick; Vondran, Gary

2011-01-01

Tetrahedral interpolation is commonly used to implement continuous color space conversions from sparse 3D and 4D lookup tables. We investigate the implementation and optimization of tetrahedral interpolation algorithms for GPUs, and compare to the best known CPU implementations as well as to a well known GPU-based trilinear implementation. We show that a 500 NVIDIA GTX-580 GPU is 3x faster than a 1000 Intel Core i7 980X CPU for 3D interpolation, and 9x faster for 4D interpolation. Performance-relevant GPU attributes are explored including thread scheduling, local memory characteristics, global memory hierarchy, and cache behaviors. We consider existing tetrahedral interpolation algorithms and tune based on the structure and branching capabilities of current GPUs. Global memory performance is improved by reordering and expanding the lookup table to ensure optimal access behaviors. Per multiprocessor local memory is exploited to implement optimally coalesced global memory accesses, and local memory addressing is optimized to minimize bank conflicts. We explore the impacts of lookup table density upon computation and memory access costs. Also presented are CPU-based 3D and 4D interpolators, using SSE vector operations that are faster than any previously published solution.
Parallel design of JPEG-LS encoder on graphics processing units

NASA Astrophysics Data System (ADS)

Duan, Hao; Fang, Yong; Huang, Bormin

2012-01-01

With recent technical advances in graphic processing units (GPUs), GPUs have outperformed CPUs in terms of compute capability and memory bandwidth. Many successful GPU applications to high performance computing have been reported. JPEG-LS is an ISO/IEC standard for lossless image compression which utilizes adaptive context modeling and run-length coding to improve compression ratio. However, adaptive context modeling causes data dependency among adjacent pixels and the run-length coding has to be performed in a sequential way. Hence, using JPEG-LS to compress large-volume hyperspectral image data is quite time-consuming. We implement an efficient parallel JPEG-LS encoder for lossless hyperspectral compression on a NVIDIA GPU using the computer unified device architecture (CUDA) programming technology. We use the block parallel strategy, as well as such CUDA techniques as coalesced global memory access, parallel prefix sum, and asynchronous data transfer. We also show the relation between GPU speedup and AVIRIS block size, as well as the relation between compression ratio and AVIRIS block size. When AVIRIS images are divided into blocks, each with 64×64 pixels, we gain the best GPU performance with 26.3x speedup over its original CPU code.
GPUbased, Microsecond Latency, HectoChannel MIMO Feedback Control of Magnetically Confined Plasmas

NASA Astrophysics Data System (ADS)

Rath, Nikolaus

Feedback control has become a crucial tool in the research on magnetic confinement of plasmas for achieving controlled nuclear fusion. This thesis presents a novel plasma feedback control system that, for the first time, employs a Graphics Processing Unit (GPU) for microsecond-latency, real-time control computations. This novel application area for GPU computing is opened up by a new system architecture that is optimized for low-latency computations on less than kilobyte sized data samples as they occur in typical plasma control algorithms. In contrast to traditional GPU computing approaches that target complex, high-throughput computations with massive amounts of data, the architecture presented in this thesis uses the GPU as the primary processing unit rather than as an auxiliary of the CPU, and data is transferred from A-D/D-A converters directly into GPU memory using peer-to-peer PCI Express transfers. The described design has been implemented in a new, GPU-based control system for the High-Beta Tokamak - Extended Pulse (HBT-EP) device. The system is built from commodity hardware and uses an NVIDIA GeForce GPU and D-TACQ A-D/D-A converters providing a total of 96 input and 64 output channels. The system is able to run with sampling periods down to 4 μs and latencies down to 8 μs. The GPU provides a total processing power of 1.5 x 1012 floating point operations per second. To illustrate the performance and versatility of both the general architecture and concrete implementation, a new control algorithm has been developed. The algorithm is designed for the control of multiple rotating magnetic perturbations in situations where the plasma equilibrium is not known exactly and features an adaptive system model: instead of requiring the rotation frequencies and growth rates embedded in the system model to be set a priori, the adaptive algorithm derives these parameters from the evolution of the perturbation amplitudes themselves. This results in non-linear control computations with high computational demands, but is handled easily by the GPU based system. Both digital processing latency and an arbitrary multi-pole response of amplifiers and control coils is fully taken into account for the generation of control signals. To separate sensor signals into perturbed and equilibrium components without knowledge of the equilibrium fields, a new separation method based on biorthogonal decomposition is introduced and used to derive a filter that performs the separation in real-time. The control algorithm has been implemented and tested on the new, GPU-based feedback control system of the HBT-EP tokamak. In this instance, the algorithm was set up to control four rotating n = 1 perturbations at different poloidal angles. The perturbations were treated as coupled in frequency but independent in amplitude and phase, so that the system effectively controls a helical n = 1 perturbation with unknown poloidal spectrum. Depending on the plasma's edge safety factor and rotation frequency, the control system is shown to be able to suppress the amplitude of the dominant 8 kHz mode by up to 60% or amplify the saturated amplitude by a factor of up to two. Intermediate feedback phases combine suppression and amplification with a speed up or slow down of the mode rotation frequency. Increasing feedback gain results in the excitation of an additional, slowly rotating 1.4 kHz mode without further effects on the 8 kHz mode. The feedback performance is found to exceed previous results obtained with an FPGA- and Kalman-filter based control system without requiring any tuning of system model parameters. Experimental results are compared with simulations based on a combination of the Boozer surface current model and the Fitzpatrick-Aydemir model. Within the subset of phenomena that can be represented by the model as well as determined experimentally, qualitative agreement is found.

GPU and APU computations of Finite Time Lyapunov Exponent fields

NASA Astrophysics Data System (ADS)

Conti, Christian; Rossinelli, Diego; Koumoutsakos, Petros

2012-03-01

We present GPU and APU accelerated computations of Finite-Time Lyapunov Exponent (FTLE) fields. The calculation of FTLEs is a computationally intensive process, as in order to obtain the sharp ridges associated with the Lagrangian Coherent Structures an extensive resampling of the flow field is required. The computational performance of this resampling is limited by the memory bandwidth of the underlying computer architecture. The present technique harnesses data-parallel execution of many-core architectures and relies on fast and accurate evaluations of moment conserving functions for the mesh to particle interpolations. We demonstrate how the computation of FTLEs can be efficiently performed on a GPU and on an APU through OpenCL and we report over one order of magnitude improvements over multi-threaded executions in FTLE computations of bluff body flows.
DOE Office of Scientific and Technical Information (OSTI.GOV)

Gallarno, George; Rogers, James H; Maxwell, Don E

The high computational capability of graphics processing units (GPUs) is enabling and driving the scientific discovery process at large-scale. The world s second fastest supercomputer for open science, Titan, has more than 18,000 GPUs that computational scientists use to perform scientific simu- lations and data analysis. Understanding of GPU reliability characteristics, however, is still in its nascent stage since GPUs have only recently been deployed at large-scale. This paper presents a detailed study of GPU errors and their impact on system operations and applications, describing experiences with the 18,688 GPUs on the Titan supercom- puter as well as lessons learnedmore » in the process of efficient operation of GPUs at scale. These experiences are helpful to HPC sites which already have large-scale GPU clusters or plan to deploy GPUs in the future.« less
EvoGraph: On-The-Fly Efficient Mining of Evolving Graphs on GPU

DOE Office of Scientific and Technical Information (OSTI.GOV)

Sengupta, Dipanjan; Song, Shuaiwen

With the prevalence of the World Wide Web and social networks, there has been a growing interest in high performance analytics for constantly-evolving dynamic graphs. Modern GPUs provide massive AQ1 amount of parallelism for efficient graph processing, but the challenges remain due to their lack of support for the near real-time streaming nature of dynamic graphs. Specifically, due to the current high volume and velocity of graph data combined with the complexity of user queries, traditional processing methods by first storing the updates and then repeatedly running static graph analytics on a sequence of versions or snapshots are deemed undesirablemore » and computational infeasible on GPU. We present EvoGraph, a highly efficient and scalable GPU- based dynamic graph analytics framework.« less
Noniterative Multireference Coupled Cluster Methods on Heterogeneous CPU-GPU Systems

DOE Office of Scientific and Technical Information (OSTI.GOV)

Bhaskaran-Nair, Kiran; Ma, Wenjing; Krishnamoorthy, Sriram

2013-04-09

A novel parallel algorithm for non-iterative multireference coupled cluster (MRCC) theories, which merges recently introduced reference-level parallelism (RLP) [K. Bhaskaran-Nair, J.Brabec, E. Aprà, H.J.J. van Dam, J. Pittner, K. Kowalski, J. Chem. Phys. 137, 094112 (2012)] with the possibility of accelerating numerical calculations using graphics processing unit (GPU) is presented. We discuss the performance of this algorithm on the example of the MRCCSD(T) method (iterative singles and doubles and perturbative triples), where the corrections due to triples are added to the diagonal elements of the MRCCSD (iterative singles and doubles) effective Hamiltonian matrix. The performance of the combined RLP/GPU algorithmmore » is illustrated on the example of the Brillouin-Wigner (BW) and Mukherjee (Mk) state-specific MRCCSD(T) formulations.« less
Mendel-GPU: haplotyping and genotype imputation on graphics processing units

PubMed Central

Chen, Gary K.; Wang, Kai; Stram, Alex H.; Sobel, Eric M.; Lange, Kenneth

2012-01-01

Motivation: In modern sequencing studies, one can improve the confidence of genotype calls by phasing haplotypes using information from an external reference panel of fully typed unrelated individuals. However, the computational demands are so high that they prohibit researchers with limited computational resources from haplotyping large-scale sequence data. Results: Our graphics processing unit based software delivers haplotyping and imputation accuracies comparable to competing programs at a fraction of the computational cost and peak memory demand. Availability: Mendel-GPU, our OpenCL software, runs on Linux platforms and is portable across AMD and nVidia GPUs. Users can download both code and documentation at http://code.google.com/p/mendel-gpu/. Contact: gary.k.chen@usc.edu Supplementary information: Supplementary data are available at Bioinformatics online. PMID:22954633
A simple GPU-accelerated two-dimensional MUSCL-Hancock solver for ideal magnetohydrodynamics

NASA Astrophysics Data System (ADS)

Bard, Christopher M.; Dorelli, John C.

2014-02-01

We describe our experience using NVIDIA's CUDA (Compute Unified Device Architecture) C programming environment to implement a two-dimensional second-order MUSCL-Hancock ideal magnetohydrodynamics (MHD) solver on a GTX 480 Graphics Processing Unit (GPU). Taking a simple approach in which the MHD variables are stored exclusively in the global memory of the GTX 480 and accessed in a cache-friendly manner (without further optimizing memory access by, for example, staging data in the GPU's faster shared memory), we achieved a maximum speed-up of ≈126 for a 10242 grid relative to the sequential C code running on a single Intel Nehalem (2.8 GHz) core. This speedup is consistent with simple estimates based on the known floating point performance, memory throughput and parallel processing capacity of the GTX 480.
GPU-Accelerated Stony-Brook University 5-class Microphysics Scheme in WRF

NASA Astrophysics Data System (ADS)

Mielikainen, J.; Huang, B.; Huang, A.

2011-12-01

The Weather Research and Forecasting (WRF) model is a next-generation mesoscale numerical weather prediction system. Microphysics plays an important role in weather and climate prediction. Several bulk water microphysics schemes are available within the WRF, with different numbers of simulated hydrometeor classes and methods for estimating their size fall speeds, distributions and densities. Stony-Brook University scheme (SBU-YLIN) is a 5-class scheme with riming intensity predicted to account for mixed-phase processes. In the past few years, co-processing on Graphics Processing Units (GPUs) has been a disruptive technology in High Performance Computing (HPC). GPUs use the ever increasing transistor count for adding more processor cores. Therefore, GPUs are well suited for massively data parallel processing with high floating point arithmetic intensity. Thus, it is imperative to update legacy scientific applications to take advantage of this unprecedented increase in computing power. CUDA is an extension to the C programming language offering programming GPU's directly. It is designed so that its constructs allow for natural expression of data-level parallelism. A CUDA program is organized into two parts: a serial program running on the CPU and a CUDA kernel running on the GPU. The CUDA code consists of three computational phases: transmission of data into the global memory of the GPU, execution of the CUDA kernel, and transmission of results from the GPU into the memory of CPU. CUDA takes a bottom-up point of view of parallelism is which thread is an atomic unit of parallelism. Individual threads are part of groups called warps, within which every thread executes exactly the same sequence of instructions. To test SBU-YLIN, we used a CONtinental United States (CONUS) benchmark data set for 12 km resolution domain for October 24, 2001. A WRF domain is a geographic region of interest discretized into a 2-dimensional grid parallel to the ground. Each grid point has multiple levels, which correspond to various vertical heights in the atmosphere. The size of the CONUS 12 km domain is 433 x 308 horizontal grid points with 35 vertical levels. First, the entire SBU-YLIN Fortran code was rewritten in C in preparation of GPU accelerated version. After that, C code was verified against Fortran code for identical outputs. Default compiler options from WRF were used for gfortran and gcc compilers. The processing time for the original Fortran code is 12274 ms and 12893 ms for C version. The processing times for GPU implementation of SBU-YLIN microphysics scheme with I/O are 57.7 ms and 37.2 ms for 1 and 2 GPUs, respectively. The corresponding speedups are 213x and 330x compared to a Fortran implementation. Without I/O the speedup is 896x on 1 GPU. Obviously, ignoring I/O time speedup scales linearly with GPUs. Thus, 2 GPUs have a speedup of 1788x without I/O. Microphysics computation is just a small part of the whole WRF model. After having completely implemented WRF on GPU, the inputs for SBU-YLIN do not have to be transferred from CPU. Instead they are results of previous WRF modules. Therefore, the role of I/O is greatly diminished once all of WRF have been converted to run on GPUs. In the near future, we expect to have a WRF running completely on GPUs for a superior performance.
Development of a GPU Compatible Version of the Fast Radiation Code RRTMG

NASA Astrophysics Data System (ADS)

Iacono, M. J.; Mlawer, E. J.; Berthiaume, D.; Cady-Pereira, K. E.; Suarez, M.; Oreopoulos, L.; Lee, D.

2012-12-01

The absorption of solar radiation and emission/absorption of thermal radiation are crucial components of the physics that drive Earth's climate and weather. Therefore, accurate radiative transfer calculations are necessary for realistic climate and weather simulations. Efficient radiation codes have been developed for this purpose, but their accuracy requirements still necessitate that as much as 30% of the computational time of a GCM is spent computing radiative fluxes and heating rates. The overall computational expense constitutes a limitation on a GCM's predictive ability if it becomes an impediment to adding new physics to or increasing the spatial and/or vertical resolution of the model. The emergence of Graphics Processing Unit (GPU) technology, which will allow the parallel computation of multiple independent radiative calculations in a GCM, will lead to a fundamental change in the competition between accuracy and speed. Processing time previously consumed by radiative transfer will now be available for the modeling of other processes, such as physics parameterizations, without any sacrifice in the accuracy of the radiative transfer. Furthermore, fast radiation calculations can be performed much more frequently and will allow the modeling of radiative effects of rapid changes in the atmosphere. The fast radiation code RRTMG, developed at Atmospheric and Environmental Research (AER), is utilized operationally in many dynamical models throughout the world. We will present the results from the first stage of an effort to create a version of the RRTMG radiation code designed to run efficiently in a GPU environment. This effort will focus on the RRTMG implementation in GEOS-5. RRTMG has an internal pseudo-spectral vector of length of order 100 that, when combined with the much greater length of the global horizontal grid vector from which the radiation code is called in GEOS-5, makes RRTMG/GEOS-5 particularly suited to achieving a significant speed improvement through GPU technology. This large number of independent cases will allow us to take full advantage of the computational power of the latest GPUs, ensuring that all thread cores in the GPU remain active, a key criterion for obtaining significant speedup. The CUDA (Compute Unified Device Architecture) Fortran compiler developed by PGI and Nvidia will allow us to construct this parallel implementation on the GPU while remaining in the Fortran language. This implementation will scale very well across various CUDA-supported GPUs such as the recently released Fermi Nvidia cards. We will present the computational speed improvements of the GPU-compatible code relative to the standard CPU-based RRTMG with respect to a very large and diverse suite of atmospheric profiles. This suite will also be utilized to demonstrate the minimal impact of the code restructuring on the accuracy of radiation calculations. The GPU-compatible version of RRTMG will be directly applicable to future versions of GEOS-5, but it is also likely to provide significant associated benefits for other GCMs that employ RRTMG.
LU Factorization with Partial Pivoting for a Multi-CPU, Multi-GPU Shared Memory System

DOE Office of Scientific and Technical Information (OSTI.GOV)

Kurzak, Jakub; Luszczek, Pitior; Faverge, Mathieu

2012-03-01

LU factorization with partial pivoting is a canonical numerical procedure and the main component of the High Performance LINPACK benchmark. This article presents an implementation of the algorithm for a hybrid, shared memory, system with standard CPU cores and GPU accelerators. Performance in excess of one TeraFLOPS is achieved using four AMD Magny Cours CPUs and four NVIDIA Fermi GPUs.
Massively parallel simulator of optical coherence tomography of inhomogeneous turbid media.

PubMed

Malektaji, Siavash; Lima, Ivan T; Escobar I, Mauricio R; Sherif, Sherif S

2017-10-01

An accurate and practical simulator for Optical Coherence Tomography (OCT) could be an important tool to study the underlying physical phenomena in OCT such as multiple light scattering. Recently, many researchers have investigated simulation of OCT of turbid media, e.g., tissue, using Monte Carlo methods. The main drawback of these earlier simulators is the long computational time required to produce accurate results. We developed a massively parallel simulator of OCT of inhomogeneous turbid media that obtains both Class I diffusive reflectivity, due to ballistic and quasi-ballistic scattered photons, and Class II diffusive reflectivity due to multiply scattered photons. This Monte Carlo-based simulator is implemented on graphic processing units (GPUs), using the Compute Unified Device Architecture (CUDA) platform and programming model, to exploit the parallel nature of propagation of photons in tissue. It models an arbitrary shaped sample medium as a tetrahedron-based mesh and uses an advanced importance sampling scheme. This new simulator speeds up simulations of OCT of inhomogeneous turbid media by about two orders of magnitude. To demonstrate this result, we have compared the computation times of our new parallel simulator and its serial counterpart using two samples of inhomogeneous turbid media. We have shown that our parallel implementation reduced simulation time of OCT of the first sample medium from 407 min to 92 min by using a single GPU card, to 12 min by using 8 GPU cards and to 7 min by using 16 GPU cards. For the second sample medium, the OCT simulation time was reduced from 209 h to 35.6 h by using a single GPU card, and to 4.65 h by using 8 GPU cards, and to only 2 h by using 16 GPU cards. Therefore our new parallel simulator is considerably more practical to use than its central processing unit (CPU)-based counterpart. Our new parallel OCT simulator could be a practical tool to study the different physical phenomena underlying OCT, or to design OCT systems with improved performance. Copyright © 2017 Elsevier B.V. All rights reserved.
SU-F-BRD-02: Application of ARCHERRT-- A GPU-Based Monte Carlo Dose Engine for Radiation Therapy -- to Tomotherapy and Patient-Independent IMRT

DOE Office of Scientific and Technical Information (OSTI.GOV)

Su, L; Du, X; Liu, T

Purpose: As a module of ARCHER -- Accelerated Radiation-transport Computations in Heterogeneous EnviRonments, ARCHER{sub RT} is designed for RadioTherapy (RT) dose calculation. This paper describes the application of ARCHERRT on patient-dependent TomoTherapy and patient-independent IMRT. It also conducts a 'fair' comparison of different GPUs and multicore CPU. Methods: The source input used for patient-dependent TomoTherapy is phase space file (PSF) generated from optimized plan. For patient-independent IMRT, the open filed PSF is used for different cases. The intensity modulation is simulated by fluence map. The GEANT4 code is used as benchmark. DVH and gamma index test are employed to evaluatemore » the accuracy of ARCHER{sub RT} code. Some previous studies reported misleading speedups by comparing GPU code with serial CPU code. To perform a fairer comparison, we write multi-thread code with OpenMP to fully exploit computing potential of CPU. The hardware involved in this study are a 6-core Intel E5-2620 CPU and 6 NVIDIA M2090 GPUs, a K20 GPU and a K40 GPU. Results: Dosimetric results from ARCHER{sub RT} and GEANT4 show good agreement. The 2%/2mm gamma test pass rates for different clinical cases are 97.2% to 99.7%. A single M2090 GPU needs 50~79 seconds for the simulation to achieve a statistical error of 1% in the PTV. The K40 card is about 1.7∼1.8 times faster than M2090 card. Using 6 M2090 card, the simulation can be finished in about 10 seconds. For comparison, Intel E5-2620 needs 507∼879 seconds for the same simulation. Conclusion: We successfully applied ARCHER{sub RT} to Tomotherapy and patient-independent IMRT, and conducted a fair comparison between GPU and CPU performance. The ARCHER{sub RT} code is both accurate and efficient and may be used towards clinical applications.« less
SU-F-T-256: 4D IMRT Planning Using An Early Prototype GPU-Enabled Eclipse Workstation

DOE Office of Scientific and Technical Information (OSTI.GOV)

Hagan, A; Modiri, A; Sawant, A

Purpose: True 4D IMRT planning, based on simultaneous spatiotemporal optimization has been shown to significantly improve plan quality in lung radiotherapy. However, the high computational complexity associated with such planning represents a significant barrier to widespread clinical deployment. We introduce an early prototype GPU-enabled Eclipse workstation for inverse planning. To our knowledge, this is the first GPUintegrated Eclipse system demonstrating the potential for clinical translation of GPU computing on a major commercially-available TPS. Methods: The prototype system comprised of four NVIDIA Tesla K80 GPUs, with a maximum processing capability of 8.5 Tflops per K80 card. The system architecture consisted ofmore » three key modules: (i) a GPU-based inverse planning module using a highly-parallelizable, swarm intelligence-based global optimization algorithm, (ii) a GPU-based open-source b-spline deformable image registration module, Elastix, and (iii) a CUDA-based data management module. For evaluation, aperture fluence weights in an IMRT plan were optimized over 9 beams,166 apertures and 10 respiratory phases (14940 variables) for a lung cancer case (GTV = 95 cc, right lower lobe, 15 mm cranio-caudal motion). Sensitivity of the planning time and memory expense to parameter variations was quantified. Results: GPU-based inverse planning was significantly accelerated compared to its CPU counterpart (36 vs 488 min, for 10 phases, 10 search agents and 10 iterations). The optimized IMRT plan significantly improved OAR sparing compared to the original internal target volume (ITV)-based clinical plan, while maintaining prescribed tumor coverage. The dose-sparing improvements were: Esophagus Dmax 50%, Heart Dmax 42% and Spinal cord Dmax 25%. Conclusion: Our early prototype system demonstrates that through massive parallelization, computationally intense tasks such as 4D treatment planning can be accomplished in clinically feasible timeframes. With further optimization, such systems are expected to enable the eventual clinical translation of higher-dimensional and complex treatment planning strategies to significantly improve plan quality. This work was partially supported through research funding from National Institutes of Health (R01CA169102) and Varian Medical Systems, Palo Alto, CA, USA.« less
Comparison and Computational Performance of Tsunami-HySEA and MOST Models for the LANTEX 2013 scenario

NASA Astrophysics Data System (ADS)

González-Vida, Jose M.; Macías, Jorge; Mercado, Aurelio; Ortega, Sergio; Castro, Manuel J.

2017-04-01

Tsunami-HySEA model is used to simulate the Caribbean LANTEX 2013 scenario (LANTEX is the acronym for Large AtlaNtic Tsunami EXercise, which is carried out annually). The numerical simulation of the propagation and inundation phases, is performed with both models but using different mesh resolutions and nested meshes. Some comparisons with the MOST tsunami model available at the University of Puerto Rico (UPR) are made. Both models compare well for propagating tsunami waves in open sea, producing very similar results. In near-shore shallow waters, Tsunami-HySEA should be compared with the inundation version of MOST, since the propagation version of MOST is limited to deeper waters. Regarding the inundation phase, a 1 arc-sec (approximately 30 m) resolution mesh covering all of Puerto Rico, is used, and a three-level nested meshes technique implemented. In the inundation phase, larger differences between model results are observed. Nevertheless, the most striking difference resides in computational time; Tsunami-HySEA is coded using the advantages of GPU architecture, and can produce a 4 h simulation in a 60 arcsec resolution grid for the whole Caribbean Sea in less than 4 min with a single general-purpose GPU and as fast as 11 s with 32 general-purpose GPUs. In the inundation stage with nested meshes, approximately 8 hours of wall clock time is needed for a 2-h simulation in a single GPU (versus more than 2 days for the MOST inundation, running three different parts of the island—West, Center, East—at the same time due to memory limitations in MOST). When domain decomposition techniques are finally implemented by breaking up the computational domain into sub-domains and assigning a GPU to each sub-domain (multi-GPU Tsunami-HySEA version), we show that the wall clock time significantly decreases, allowing high-resolution inundation modelling in very short computational times, reducing, for example, if eight GPUs are used, the wall clock time to around 1 hour. Besides, these computational times are obtained using general-purpose GPU hardware.
TH-A-19A-04: Latent Uncertainties and Performance of a GPU-Implemented Pre-Calculated Track Monte Carlo Method

DOE Office of Scientific and Technical Information (OSTI.GOV)

Renaud, M; Seuntjens, J; Roberge, D

Purpose: Assessing the performance and uncertainty of a pre-calculated Monte Carlo (PMC) algorithm for proton and electron transport running on graphics processing units (GPU). While PMC methods have been described in the past, an explicit quantification of the latent uncertainty arising from recycling a limited number of tracks in the pre-generated track bank is missing from the literature. With a proper uncertainty analysis, an optimal pre-generated track bank size can be selected for a desired dose calculation uncertainty. Methods: Particle tracks were pre-generated for electrons and protons using EGSnrc and GEANT4, respectively. The PMC algorithm for track transport was implementedmore » on the CUDA programming framework. GPU-PMC dose distributions were compared to benchmark dose distributions simulated using general-purpose MC codes in the same conditions. A latent uncertainty analysis was performed by comparing GPUPMC dose values to a “ground truth” benchmark while varying the track bank size and primary particle histories. Results: GPU-PMC dose distributions and benchmark doses were within 1% of each other in voxels with dose greater than 50% of Dmax. In proton calculations, a submillimeter distance-to-agreement error was observed at the Bragg Peak. Latent uncertainty followed a Poisson distribution with the number of tracks per energy (TPE) and a track bank of 20,000 TPE produced a latent uncertainty of approximately 1%. Efficiency analysis showed a 937× and 508× gain over a single processor core running DOSXYZnrc for 16 MeV electrons in water and bone, respectively. Conclusion: The GPU-PMC method can calculate dose distributions for electrons and protons to a statistical uncertainty below 1%. The track bank size necessary to achieve an optimal efficiency can be tuned based on the desired uncertainty. Coupled with a model to calculate dose contributions from uncharged particles, GPU-PMC is a candidate for inverse planning of modulated electron radiotherapy and scanned proton beams. This work was supported in part by FRSQ-MSSS (Grant No. 22090), NSERC RG (Grant No. 432290) and CIHR MOP (Grant No. MOP-211360)« less
A fast GPU-based Monte Carlo simulation of proton transport with detailed modeling of nonelastic interactions.

PubMed

Wan Chan Tseung, H; Ma, J; Beltran, C

2015-06-01

Very fast Monte Carlo (MC) simulations of proton transport have been implemented recently on graphics processing units (GPUs). However, these MCs usually use simplified models for nonelastic proton-nucleus interactions. Our primary goal is to build a GPU-based proton transport MC with detailed modeling of elastic and nonelastic proton-nucleus collisions. Using the cuda framework, the authors implemented GPU kernels for the following tasks: (1) simulation of beam spots from our possible scanning nozzle configurations, (2) proton propagation through CT geometry, taking into account nuclear elastic scattering, multiple scattering, and energy loss straggling, (3) modeling of the intranuclear cascade stage of nonelastic interactions when they occur, (4) simulation of nuclear evaporation, and (5) statistical error estimates on the dose. To validate our MC, the authors performed (1) secondary particle yield calculations in proton collisions with therapeutically relevant nuclei, (2) dose calculations in homogeneous phantoms, (3) recalculations of complex head and neck treatment plans from a commercially available treatment planning system, and compared with (GEANT)4.9.6p2/TOPAS. Yields, energy, and angular distributions of secondaries from nonelastic collisions on various nuclei are in good agreement with the (GEANT)4.9.6p2 Bertini and Binary cascade models. The 3D-gamma pass rate at 2%-2 mm for treatment plan simulations is typically 98%. The net computational time on a NVIDIA GTX680 card, including all CPU-GPU data transfers, is ∼ 20 s for 1 × 10(7) proton histories. Our GPU-based MC is the first of its kind to include a detailed nuclear model to handle nonelastic interactions of protons with any nucleus. Dosimetric calculations are in very good agreement with (GEANT)4.9.6p2/TOPAS. Our MC is being integrated into a framework to perform fast routine clinical QA of pencil-beam based treatment plans, and is being used as the dose calculation engine in a clinically applicable MC-based IMPT treatment planning system. The detailed nuclear modeling will allow us to perform very fast linear energy transfer and neutron dose estimates on the GPU.
Novel materials based on chitosan, its derivatives and cellulose fibres

NASA Astrophysics Data System (ADS)

Fernandes, Susana Cristina de Matos

O presente trabalho tem como principal objectivo o desenvolvimento de novos materiais baseados em quitosano, seus derivados e celulose, na forma de nanofibras ou de papel. Em primeiro lugar procedeu-se a purificacao das amostras comerciais de quitosano e a sua caracterizacao exaustiva em termos morfologicos e fisicoquimicos. Devido a valores contraditorios encontrados na literatura relativamente a energia de superficie do quitosano, e tendo em conta a sua utilizacao como precursor de modificacoes quimicas e a sua aplicacao em misturas com outros materiais, realizou-se tambem um estudo sistematico da determinacao da energia de superficie do quitosano, da quitina e seus respectivos homologos monomericos, por medicao de ângulos de contacto Em todas as amostras comerciais destes polimeros identificaram-se impurezas nao polares que estao associadas a erros na determinacao da componente polar da energia de superficie. Apos a remocao destas impurezas, o valor da energia total de superficie (gs), e em particular da sua componente polar, aumentou consideravelmente. Depois de purificadas e caracterizadas, algumas das amostras de quitosano foram entao usadas na preparacao de filmes nanocompositos, nomeadamente dois quitosanos com diferentes graus de polimerizacao, correspondentes derivados soluveis em agua (cloreto de N-(3-(N,N,N-trimetilamonio)-2- hidroxipropilo) de quitosano) e nanofibras de celulose como reforco (celulose nanofibrilada (NFC) e celulose bacteriana (BC). Estes filmes transparentes foram preparados atraves de um processo simples e com conotacao 'verde' pela dispersao homogenea de diferentes teores de NFC (ate 60%) e BC (ate 40%) nas solucoes de quitosano (1.5% w/v) seguida da evaporacao do solvente. Os filmes obtidos foram depois caracterizados por diversas tecnicas, tais como SEM, AFM, difraccao de raio-X, TGA, DMA, ensaios de traccao e espectroscopia no visivel. Estes filmes sao altamente transparentes e apresentam melhores propriedades mecânicas e maior estabilidade termica do que os correspondentes filmes sem reforco. Outra abordagem deste trabalho envolveu o revestimento de folhas de papel de E. globulus com quitosano e dois derivados, um derivado fluorescente e um derivado soluvel em agua, numa maquina de revestimentos ('maquina de colagem') a escala piloto. Este estudo envolveu inicialmente a deposicao de 1 a 5 camadas do derivado de quitosano fluorescente sobre as folhas de papel de forma a estudar a sua distribuicao nas folhas em termos de espalhamento e penetracao, atraves de medicoes de reflectância e luminescencia. Os resultados mostraram que, por um lado, a distribuicao do quitosano na superficie era homogenea e que, por outro lado, a sua penetracao atraves dos poros do papel cessou apos tres deposicoes. Depois da terceira camada verificou-se a formacao de um filme continuo de quitosano sobre a superficie do papel. Estes resultados mostram que este derivado de quitosano fluorescente pode ser utilizado como marcador na optimizacao e compreensao de mecanismos de deposicao de quitosano em papel e outros substratos. Depois de conhecida a distribuicao do quitosano nas folhas de papel, estudou-se o efeito do revestimento de quitosano e do seu derivado soluvel em agua nas propriedades finais do papel. As propriedades morfologicas, mecânicas, superficiais, opticas, assim como a permeabilidade ao ar e ao vapor de agua, a aptidao a impressao e o envelhecimento do papel, foram exaustivamente avaliadas. De uma forma geral, os revestimentos com quitosano e com o seu derivado soluvel em agua tiveram um impacto positivo nas propriedades finais do papel, que se mostrou ser dependente do numero de camadas depositadas. Os resultados tambem mostraram que os papeis revestidos com o derivado soluvel em agua apresentaram melhores propriedades opticas, aptidao a impressao e melhores resultados em relacao ao envelhecimento do que os papeis revestidos com quitosano. Assim, o uso de derivados de quitosano soluveis em agua em processos de revestimento de papel representa uma estrategia bastante interessante e sustentavel para o desenvolvimento de novos materiais funcionais ou na melhoria das propriedades finais dos papeis. Por fim, tendo como objectivo valorizar os residuos e fraccoes menos nobres da quitina e do quitosano provenientes da industria transformadora, estes polimeros foram convertidos em poliois viscosos atraves de uma reaccao simples de oxipropilacao. Este processo tem tambem conotacao "verde" uma vez que nao requer solvente, nao origina subprodutos e nao exige nenhuma operacao especifica (separacao, purificacao, etc) para isolar o produto da reaccao. As amostras de quitina e quitosano foram pre-activadas com KOH e depois modificadas com um excesso de oxido de propileno (PO) num reactor apropriado. Em todos os casos, o produto da reaccao foi um liquido viscoso composto por quitina ou quitosano oxipropilados e homopolimero de PO. Estas duas fraccoes foram separadas e caracterizadas.
Performance analysis of the FDTD method applied to holographic volume gratings: Multi-core CPU versus GPU computing

NASA Astrophysics Data System (ADS)

Francés, J.; Bleda, S.; Neipp, C.; Márquez, A.; Pascual, I.; Beléndez, A.

2013-03-01

The finite-difference time-domain method (FDTD) allows electromagnetic field distribution analysis as a function of time and space. The method is applied to analyze holographic volume gratings (HVGs) for the near-field distribution at optical wavelengths. Usually, this application requires the simulation of wide areas, which implies more memory and time processing. In this work, we propose a specific implementation of the FDTD method including several add-ons for a precise simulation of optical diffractive elements. Values in the near-field region are computed considering the illumination of the grating by means of a plane wave for different angles of incidence and including absorbing boundaries as well. We compare the results obtained by FDTD with those obtained using a matrix method (MM) applied to diffraction gratings. In addition, we have developed two optimized versions of the algorithm, for both CPU and GPU, in order to analyze the improvement of using the new NVIDIA Fermi GPU architecture versus highly tuned multi-core CPU as a function of the size simulation. In particular, the optimized CPU implementation takes advantage of the arithmetic and data transfer streaming SIMD (single instruction multiple data) extensions (SSE) included explicitly in the code and also of multi-threading by means of OpenMP directives. A good agreement between the results obtained using both FDTD and MM methods is obtained, thus validating our methodology. Moreover, the performance of the GPU is compared to the SSE+OpenMP CPU implementation, and it is quantitatively determined that a highly optimized CPU program can be competitive for a wider range of simulation sizes, whereas GPU computing becomes more powerful for large-scale simulations.
Accelerating moderately stiff chemical kinetics in reactive-flow simulations using GPUs

NASA Astrophysics Data System (ADS)

Niemeyer, Kyle E.; Sung, Chih-Jen

2014-01-01

The chemical kinetics ODEs arising from operator-split reactive-flow simulations were solved on GPUs using explicit integration algorithms. Nonstiff chemical kinetics of a hydrogen oxidation mechanism (9 species and 38 irreversible reactions) were computed using the explicit fifth-order Runge-Kutta-Cash-Karp method, and the GPU-accelerated version performed faster than single- and six-core CPU versions by factors of 126 and 25, respectively, for 524,288 ODEs. Moderately stiff kinetics, represented with mechanisms for hydrogen/carbon-monoxide (13 species and 54 irreversible reactions) and methane (53 species and 634 irreversible reactions) oxidation, were computed using the stabilized explicit second-order Runge-Kutta-Chebyshev (RKC) algorithm. The GPU-based RKC implementation demonstrated an increase in performance of nearly 59 and 10 times, for problem sizes consisting of 262,144 ODEs and larger, than the single- and six-core CPU-based RKC algorithms using the hydrogen/carbon-monoxide mechanism. With the methane mechanism, RKC-GPU performed more than 65 and 11 times faster, for problem sizes consisting of 131,072 ODEs and larger, than the single- and six-core RKC-CPU versions, and up to 57 times faster than the six-core CPU-based implicit VODE algorithm on 65,536 ODEs. In the presence of more severe stiffness, such as ethylene oxidation (111 species and 1566 irreversible reactions), RKC-GPU performed more than 17 times faster than RKC-CPU on six cores for 32,768 ODEs and larger, and at best 4.5 times faster than VODE on six CPU cores for 65,536 ODEs. With a larger time step size, RKC-GPU performed at best 2.5 times slower than six-core VODE for 8192 ODEs and larger. Therefore, the need for developing new strategies for integrating stiff chemistry on GPUs was discussed.
GPU-based RFA simulation for minimally invasive cancer treatment of liver tumours.

PubMed

Mariappan, Panchatcharam; Weir, Phil; Flanagan, Ronan; Voglreiter, Philip; Alhonnoro, Tuomas; Pollari, Mika; Moche, Michael; Busse, Harald; Futterer, Jurgen; Portugaller, Horst Rupert; Sequeiros, Roberto Blanco; Kolesnik, Marina

2017-01-01

Radiofrequency ablation (RFA) is one of the most popular and well-standardized minimally invasive cancer treatments (MICT) for liver tumours, employed where surgical resection has been contraindicated. Less-experienced interventional radiologists (IRs) require an appropriate planning tool for the treatment to help avoid incomplete treatment and so reduce the tumour recurrence risk. Although a few tools are available to predict the ablation lesion geometry, the process is computationally expensive. Also, in our implementation, a few patient-specific parameters are used to improve the accuracy of the lesion prediction. Advanced heterogeneous computing using personal computers, incorporating the graphics processing unit (GPU) and the central processing unit (CPU), is proposed to predict the ablation lesion geometry. The most recent GPU technology is used to accelerate the finite element approximation of Penne's bioheat equation and a three state cell model. Patient-specific input parameters are used in the bioheat model to improve accuracy of the predicted lesion. A fast GPU-based RFA solver is developed to predict the lesion by doing most of the computational tasks in the GPU, while reserving the CPU for concurrent tasks such as lesion extraction based on the heat deposition at each finite element node. The solver takes less than 3 min for a treatment duration of 26 min. When the model receives patient-specific input parameters, the deviation between real and predicted lesion is below 3 mm. A multi-centre retrospective study indicates that the fast RFA solver is capable of providing the IR with the predicted lesion in the short time period before the intervention begins when the patient has been clinically prepared for the treatment.
The X-ray Crystal Structure of the Phage Tail Terminator Protein Reveals the Biologically Relevant Hexameric Rang Structure and Demonstrates a Conserved mechanism of Tail Termination among Divrse Long Tailed Phages

DOE Office of Scientific and Technical Information (OSTI.GOV)

Pell, L.; Liu, A; Edmonds, L

The tail terminator protein (TrP) plays an essential role in phage tail assembly by capping the rapidly polymerizing tail once it has reached its requisite length and serving as the interaction surface for phage heads. Here, we present the 2.7-A crystal structure of a hexameric ring of gpU, the TrP of phage ?. Using sequence alignment analysis and site-directed mutagenesis, we have shown that this multimeric structure is biologically relevant and we have delineated its functional surfaces. Comparison of the hexameric crystal structure with the solution structure of gpU that we previously solved using NMR spectroscopy shows large structural changesmore » occurring upon multimerization and suggests a mechanism that allows gpU to remain monomeric at high concentrations on its own, yet polymerize readily upon contact with an assembled tail tube. The gpU hexamer displays several flexible loops that play key roles in head and tail binding, implying a role for disorder-to-order transitions in controlling assembly as has been observed with other ? morphogenetic proteins. Finally, we have found that the hexameric structure of gpU is very similar to the structure of a putative TrP from a contractile phage tail even though it displays no detectable sequence similarity. This finding coupled with further bioinformatic investigations has led us to conclude that the TrPs of non-contractile-tailed phages, such as ?, are evolutionarily related to those of contractile-tailed phages, such as P2 and Mu, and that all long-tailed phages may utilize a conserved mechanism for tail termination.« less

Real-time computation of parameter fitting and image reconstruction using graphical processing units

NASA Astrophysics Data System (ADS)

Locans, Uldis; Adelmann, Andreas; Suter, Andreas; Fischer, Jannis; Lustermann, Werner; Dissertori, Günther; Wang, Qiulin

2017-06-01

In recent years graphical processing units (GPUs) have become a powerful tool in scientific computing. Their potential to speed up highly parallel applications brings the power of high performance computing to a wider range of users. However, programming these devices and integrating their use in existing applications is still a challenging task. In this paper we examined the potential of GPUs for two different applications. The first application, created at Paul Scherrer Institut (PSI), is used for parameter fitting during data analysis of μSR (muon spin rotation, relaxation and resonance) experiments. The second application, developed at ETH, is used for PET (Positron Emission Tomography) image reconstruction and analysis. Applications currently in use were examined to identify parts of the algorithms in need of optimization. Efficient GPU kernels were created in order to allow applications to use a GPU, to speed up the previously identified parts. Benchmarking tests were performed in order to measure the achieved speedup. During this work, we focused on single GPU systems to show that real time data analysis of these problems can be achieved without the need for large computing clusters. The results show that the currently used application for parameter fitting, which uses OpenMP to parallelize calculations over multiple CPU cores, can be accelerated around 40 times through the use of a GPU. The speedup may vary depending on the size and complexity of the problem. For PET image analysis, the obtained speedups of the GPU version were more than × 40 larger compared to a single core CPU implementation. The achieved results show that it is possible to improve the execution time by orders of magnitude.
Multi-GPU Acceleration of Branchless Distance Driven Projection and Backprojection for Clinical Helical CT.

PubMed

Mitra, Ayan; Politte, David G; Whiting, Bruce R; Williamson, Jeffrey F; O'Sullivan, Joseph A

2017-01-01

Model-based image reconstruction (MBIR) techniques have the potential to generate high quality images from noisy measurements and a small number of projections which can reduce the x-ray dose in patients. These MBIR techniques rely on projection and backprojection to refine an image estimate. One of the widely used projectors for these modern MBIR based technique is called branchless distance driven (DD) projection and backprojection. While this method produces superior quality images, the computational cost of iterative updates keeps it from being ubiquitous in clinical applications. In this paper, we provide several new parallelization ideas for concurrent execution of the DD projectors in multi-GPU systems using CUDA programming tools. We have introduced some novel schemes for dividing the projection data and image voxels over multiple GPUs to avoid runtime overhead and inter-device synchronization issues. We have also reduced the complexity of overlap calculation of the algorithm by eliminating the common projection plane and directly projecting the detector boundaries onto image voxel boundaries. To reduce the time required for calculating the overlap between the detector edges and image voxel boundaries, we have proposed a pre-accumulation technique to accumulate image intensities in perpendicular 2D image slabs (from a 3D image) before projection and after backprojection to ensure our DD kernels run faster in parallel GPU threads. For the implementation of our iterative MBIR technique we use a parallel multi-GPU version of the alternating minimization (AM) algorithm with penalized likelihood update. The time performance using our proposed reconstruction method with Siemens Sensation 16 patient scan data shows an average of 24 times speedup using a single TITAN X GPU and 74 times speedup using 3 TITAN X GPUs in parallel for combined projection and backprojection.
Performance evaluation of GPU parallelization, space-time adaptive algorithms, and their combination for simulating cardiac electrophysiology.

PubMed

Sachetto Oliveira, Rafael; Martins Rocha, Bernardo; Burgarelli, Denise; Meira, Wagner; Constantinides, Christakis; Weber Dos Santos, Rodrigo

2018-02-01

The use of computer models as a tool for the study and understanding of the complex phenomena of cardiac electrophysiology has attained increased importance nowadays. At the same time, the increased complexity of the biophysical processes translates into complex computational and mathematical models. To speed up cardiac simulations and to allow more precise and realistic uses, 2 different techniques have been traditionally exploited: parallel computing and sophisticated numerical methods. In this work, we combine a modern parallel computing technique based on multicore and graphics processing units (GPUs) and a sophisticated numerical method based on a new space-time adaptive algorithm. We evaluate each technique alone and in different combinations: multicore and GPU, multicore and GPU and space adaptivity, multicore and GPU and space adaptivity and time adaptivity. All the techniques and combinations were evaluated under different scenarios: 3D simulations on slabs, 3D simulations on a ventricular mouse mesh, ie, complex geometry, sinus-rhythm, and arrhythmic conditions. Our results suggest that multicore and GPU accelerate the simulations by an approximate factor of 33×, whereas the speedups attained by the space-time adaptive algorithms were approximately 48. Nevertheless, by combining all the techniques, we obtained speedups that ranged between 165 and 498. The tested methods were able to reduce the execution time of a simulation by more than 498× for a complex cellular model in a slab geometry and by 165× in a realistic heart geometry simulating spiral waves. The proposed methods will allow faster and more realistic simulations in a feasible time with no significant loss of accuracy. Copyright © 2017 John Wiley & Sons, Ltd.
Specialized Computer Systems for Environment Visualization

NASA Astrophysics Data System (ADS)

Al-Oraiqat, Anas M.; Bashkov, Evgeniy A.; Zori, Sergii A.

2018-06-01

The need for real time image generation of landscapes arises in various fields as part of tasks solved by virtual and augmented reality systems, as well as geographic information systems. Such systems provide opportunities for collecting, storing, analyzing and graphically visualizing geographic data. Algorithmic and hardware software tools for increasing the realism and efficiency of the environment visualization in 3D visualization systems are proposed. This paper discusses a modified path tracing algorithm with a two-level hierarchy of bounding volumes and finding intersections with Axis-Aligned Bounding Box. The proposed algorithm eliminates the branching and hence makes the algorithm more suitable to be implemented on the multi-threaded CPU and GPU. A modified ROAM algorithm is used to solve the qualitative visualization of reliefs' problems and landscapes. The algorithm is implemented on parallel systems—cluster and Compute Unified Device Architecture-networks. Results show that the implementation on MPI clusters is more efficient than Graphics Processing Unit/Graphics Processing Clusters and allows real-time synthesis. The organization and algorithms of the parallel GPU system for the 3D pseudo stereo image/video synthesis are proposed. With realizing possibility analysis on a parallel GPU-architecture of each stage, 3D pseudo stereo synthesis is performed. An experimental prototype of a specialized hardware-software system 3D pseudo stereo imaging and video was developed on the CPU/GPU. The experimental results show that the proposed adaptation of 3D pseudo stereo imaging to the architecture of GPU-systems is efficient. Also it accelerates the computational procedures of 3D pseudo-stereo synthesis for the anaglyph and anamorphic formats of the 3D stereo frame without performing optimization procedures. The acceleration is on average 11 and 54 times for test GPUs.
GPU acceleration of Runge Kutta-Fehlberg and its comparison with Dormand-Prince method

NASA Astrophysics Data System (ADS)

Seen, Wo Mei; Gobithaasan, R. U.; Miura, Kenjiro T.

2014-07-01

There is a significant reduction of processing time and speedup of performance in computer graphics with the emergence of Graphic Processing Units (GPUs). GPUs have been developed to surpass Central Processing Unit (CPU) in terms of performance and processing speed. This evolution has opened up a new area in computing and researches where highly parallel GPU has been used for non-graphical algorithms. Physical or phenomenal simulations and modelling can be accelerated through General Purpose Graphic Processing Units (GPGPU) and Compute Unified Device Architecture (CUDA) implementations. These phenomena can be represented with mathematical models in the form of Ordinary Differential Equations (ODEs) which encompasses the gist of change rate between independent and dependent variables. ODEs are numerically integrated over time in order to simulate these behaviours. The classical Runge-Kutta (RK) scheme is the common method used to numerically solve ODEs. The Runge Kutta Fehlberg (RKF) scheme has been specially developed to provide an estimate of the principal local truncation error at each step, known as embedding estimate technique. This paper delves into the implementation of RKF scheme for GPU devices and compares its result with Dorman Prince method. A pseudo code is developed to show the implementation in detail. Hence, practitioners will be able to understand the data allocation in GPU, formation of RKF kernels and the flow of data to/from GPU-CPU upon RKF kernel evaluation. The pseudo code is then written in C Language and two ODE models are executed to show the achievable speedup as compared to CPU implementation. The accuracy and efficiency of the proposed implementation method is discussed in the final section of this paper.
GPU-based simulation of optical propagation through turbulence for active and passive imaging

NASA Astrophysics Data System (ADS)

Monnier, Goulven; Duval, François-Régis; Amram, Solène

2014-10-01

IMOTEP is a GPU-based (Graphical Processing Units) software relying on a fast parallel implementation of Fresnel diffraction through successive phase screens. Its applications include active imaging, laser telemetry and passive imaging through turbulence with anisoplanatic spatial and temporal fluctuations. Thanks to parallel implementation on GPU, speedups ranging from 40X to 70X are achieved. The present paper gives a brief overview of IMOTEP models, algorithms, implementation and user interface. It then focuses on major improvements recently brought to the anisoplanatic imaging simulation method. Previously, we took advantage of the computational power offered by the GPU to develop a simulation method based on large series of deterministic realisations of the PSF distorted by turbulence. The phase screen propagation algorithm, by reproducing higher moments of the incident wavefront distortion, provides realistic PSFs. However, we first used a coarse gaussian model to fit the numerical PSFs and characterise there spatial statistics through only 3 parameters (two-dimensional displacements of centroid and width). Meanwhile, this approach was unable to reproduce the effects related to the details of the PSF structure, especially the "speckles" leading to prominent high-frequency content in short-exposure images. To overcome this limitation, we recently implemented a new empirical model of the PSF, based on Principal Components Analysis (PCA), ought to catch most of the PSF complexity. The GPU implementation allows estimating and handling efficiently the numerous (up to several hundreds) principal components typically required under the strong turbulence regime. A first demanding computational step involves PCA, phase screen propagation and covariance estimates. In a second step, realistic instantaneous images, fully accounting for anisoplanatic effects, are quickly generated. Preliminary results are presented.
SU-E-T-36: A GPU-Accelerated Monte-Carlo Dose Calculation Platform and Its Application Toward Validating a ViewRay Beam Model

DOE Office of Scientific and Technical Information (OSTI.GOV)

Wang, Y; Mazur, T; Green, O

Purpose: To build a fast, accurate and easily-deployable research platform for Monte-Carlo dose calculations. We port the dose calculation engine PENELOPE to C++, and accelerate calculations using GPU acceleration. Simulations of a Co-60 beam model provided by ViewRay demonstrate the capabilities of the platform. Methods: We built software that incorporates a beam model interface, CT-phantom model, GPU-accelerated PENELOPE engine, and GUI front-end. We rewrote the PENELOPE kernel in C++ (from Fortran) and accelerated the code on a GPU. We seamlessly integrated a Co-60 beam model (obtained from ViewRay) into our platform. Simulations of various field sizes and SSDs using amore » homogeneous water phantom generated PDDs, dose profiles, and output factors that were compared to experiment data. Results: With GPU acceleration using a dated graphics card (Nvidia Tesla C2050), a highly accurate simulation – including 100*100*100 grid, 3×3×3 mm3 voxels, <1% uncertainty, and 4.2×4.2 cm2 field size – runs 24 times faster (20 minutes versus 8 hours) than when parallelizing on 8 threads across a new CPU (Intel i7-4770). Simulated PDDs, profiles and output ratios for the commercial system agree well with experiment data measured using radiographic film or ionization chamber. Based on our analysis, this beam model is precise enough for general applications. Conclusions: Using a beam model for a Co-60 system provided by ViewRay, we evaluate a dose calculation platform that we developed. Comparison to measurements demonstrates the promise of our software for use as a research platform for dose calculations, with applications including quality assurance and treatment plan verification.« less
Algorithms of GPU-enabled reactive force field (ReaxFF) molecular dynamics.

PubMed

Zheng, Mo; Li, Xiaoxia; Guo, Li

2013-04-01

Reactive force field (ReaxFF), a recent and novel bond order potential, allows for reactive molecular dynamics (ReaxFF MD) simulations for modeling larger and more complex molecular systems involving chemical reactions when compared with computation intensive quantum mechanical methods. However, ReaxFF MD can be approximately 10-50 times slower than classical MD due to its explicit modeling of bond forming and breaking, the dynamic charge equilibration at each time-step, and its one order smaller time-step than the classical MD, all of which pose significant computational challenges in simulation capability to reach spatio-temporal scales of nanometers and nanoseconds. The very recent advances of graphics processing unit (GPU) provide not only highly favorable performance for GPU enabled MD programs compared with CPU implementations but also an opportunity to manage with the computing power and memory demanding nature imposed on computer hardware by ReaxFF MD. In this paper, we present the algorithms of GMD-Reax, the first GPU enabled ReaxFF MD program with significantly improved performance surpassing CPU implementations on desktop workstations. The performance of GMD-Reax has been benchmarked on a PC equipped with a NVIDIA C2050 GPU for coal pyrolysis simulation systems with atoms ranging from 1378 to 27,283. GMD-Reax achieved speedups as high as 12 times faster than Duin et al.'s FORTRAN codes in Lammps on 8 CPU cores and 6 times faster than the Lammps' C codes based on PuReMD in terms of the simulation time per time-step averaged over 100 steps. GMD-Reax could be used as a new and efficient computational tool for exploiting very complex molecular reactions via ReaxFF MD simulation on desktop workstations. Copyright © 2013 Elsevier Inc. All rights reserved.
GPU-based fast cone beam CT reconstruction from undersampled and noisy projection data via total variation.

PubMed

Jia, Xun; Lou, Yifei; Li, Ruijiang; Song, William Y; Jiang, Steve B

2010-04-01

Cone-beam CT (CBCT) plays an important role in image guided radiation therapy (IGRT). However, the large radiation dose from serial CBCT scans in most IGRT procedures raises a clinical concern, especially for pediatric patients who are essentially excluded from receiving IGRT for this reason. The goal of this work is to develop a fast GPU-based algorithm to reconstruct CBCT from undersampled and noisy projection data so as to lower the imaging dose. The CBCT is reconstructed by minimizing an energy functional consisting of a data fidelity term and a total variation regularization term. The authors developed a GPU-friendly version of the forward-backward splitting algorithm to solve this model. A multigrid technique is also employed. It is found that 20-40 x-ray projections are sufficient to reconstruct images with satisfactory quality for IGRT. The reconstruction time ranges from 77 to 130 s on an NVIDIA Tesla C1060 (NVIDIA, Santa Clara, CA) GPU card, depending on the number of projections used, which is estimated about 100 times faster than similar iterative reconstruction approaches. Moreover, phantom studies indicate that the algorithm enables the CBCT to be reconstructed under a scanning protocol with as low as 0.1 mA s/projection. Comparing with currently widely used full-fan head and neck scanning protocol of approximately 360 projections with 0.4 mA s/projection, it is estimated that an overall 36-72 times dose reduction has been achieved in our fast CBCT reconstruction algorithm. This work indicates that the developed GPU-based CBCT reconstruction algorithm is capable of lowering imaging dose considerably. The high computation efficiency in this algorithm makes the iterative CBCT reconstruction approach applicable in real clinical environments.
Evaluating Multi-core Architectures through Accelerating the Three-Dimensional Lax–Wendroff Correction

DOE Office of Scientific and Technical Information (OSTI.GOV)

You, Yang; Fu, Haohuan; Song, Shuaiwen

2014-07-18

Wave propagation forward modeling is a widely used computational method in oil and gas exploration. The iterative stencil loops in such problems have broad applications in scientific computing. However, executing such loops can be highly time time-consuming, which greatly limits application’s performance and power efficiency. In this paper, we accelerate the forward modeling technique on the latest multi-core and many-core architectures such as Intel Sandy Bridge CPUs, NVIDIA Fermi C2070 GPU, NVIDIA Kepler K20x GPU, and the Intel Xeon Phi Co-processor. For the GPU platforms, we propose two parallel strategies to explore the performance optimization opportunities for our stencil kernels.more » For Sandy Bridge CPUs and MIC, we also employ various optimization techniques in order to achieve the best.« less
A parallelization scheme of the periodic signals tracking algorithm for isochronous mass spectrometry on GPUs

NASA Astrophysics Data System (ADS)

Chen, R. J.; Wang, M.; Yan, X. L.; Yang, Q.; Lam, Y. H.; Yang, L.; Zhang, Y. H.

2017-12-01

The periodic signals tracking algorithm has been used to determine the revolution times of ions stored in storage rings in isochronous mass spectrometry (IMS) experiments. It has been a challenge to perform real-time data analysis by using the periodic signals tracking algorithm in the IMS experiments. In this paper, a parallelization scheme of the periodic signals tracking algorithm is introduced and a new program is developed. The computing time of data analysis can be reduced by a factor of ∼71 and of ∼346 by using our new program on Tesla C1060 GPU and Tesla K20c GPU, compared to using old program on Xeon E5540 CPU. We succeed in performing real-time data analysis for the IMS experiments by using the new program on Tesla K20c GPU.
A Simple GPU-Accelerated Two-Dimensional MUSCL-Hancock Solver for Ideal Magnetohydrodynamics

NASA Technical Reports Server (NTRS)

Bard, Christopher; Dorelli, John C.

2013-01-01

We describe our experience using NVIDIA's CUDA (Compute Unified Device Architecture) C programming environment to implement a two-dimensional second-order MUSCL-Hancock ideal magnetohydrodynamics (MHD) solver on a GTX 480 Graphics Processing Unit (GPU). Taking a simple approach in which the MHD variables are stored exclusively in the global memory of the GTX 480 and accessed in a cache-friendly manner (without further optimizing memory access by, for example, staging data in the GPU's faster shared memory), we achieved a maximum speed-up of approx. = 126 for a sq 1024 grid relative to the sequential C code running on a single Intel Nehalem (2.8 GHz) core. This speedup is consistent with simple estimates based on the known floating point performance, memory throughput and parallel processing capacity of the GTX 480.
Irregular large-scale computed tomography on multiple graphics processors improves energy-efficiency metrics for industrial applications

NASA Astrophysics Data System (ADS)

Jimenez, Edward S.; Goodman, Eric L.; Park, Ryeojin; Orr, Laurel J.; Thompson, Kyle R.

2014-09-01

This paper will investigate energy-efficiency for various real-world industrial computed-tomography reconstruction algorithms, both CPU- and GPU-based implementations. This work shows that the energy required for a given reconstruction is based on performance and problem size. There are many ways to describe performance and energy efficiency, thus this work will investigate multiple metrics including performance-per-watt, energy-delay product, and energy consumption. This work found that irregular GPU-based approaches1 realized tremendous savings in energy consumption when compared to CPU implementations while also significantly improving the performance-per- watt and energy-delay product metrics. Additional energy savings and other metric improvement was realized on the GPU-based reconstructions by improving storage I/O by implementing a parallel MIMD-like modularization of the compute and I/O tasks.
GPU-accelerated algorithms for compressed signals recovery with application to astronomical imagery deblurring

NASA Astrophysics Data System (ADS)

Fiandrotti, Attilio; Fosson, Sophie M.; Ravazzi, Chiara; Magli, Enrico

2018-04-01

Compressive sensing promises to enable bandwidth-efficient on-board compression of astronomical data by lifting the encoding complexity from the source to the receiver. The signal is recovered off-line, exploiting GPUs parallel computation capabilities to speedup the reconstruction process. However, inherent GPU hardware constraints limit the size of the recoverable signal and the speedup practically achievable. In this work, we design parallel algorithms that exploit the properties of circulant matrices for efficient GPU-accelerated sparse signals recovery. Our approach reduces the memory requirements, allowing us to recover very large signals with limited memory. In addition, it achieves a tenfold signal recovery speedup thanks to ad-hoc parallelization of matrix-vector multiplications and matrix inversions. Finally, we practically demonstrate our algorithms in a typical application of circulant matrices: deblurring a sparse astronomical image in the compressed domain.
Hypergraph partitioning implementation for parallelizing matrix-vector multiplication using CUDA GPU-based parallel computing

NASA Astrophysics Data System (ADS)

Murni, Bustamam, A.; Ernastuti, Handhika, T.; Kerami, D.

2017-07-01

Calculation of the matrix-vector multiplication in the real-world problems often involves large matrix with arbitrary size. Therefore, parallelization is needed to speed up the calculation process that usually takes a long time. Graph partitioning techniques that have been discussed in the previous studies cannot be used to complete the parallelized calculation of matrix-vector multiplication with arbitrary size. This is due to the assumption of graph partitioning techniques that can only solve the square and symmetric matrix. Hypergraph partitioning techniques will overcome the shortcomings of the graph partitioning technique. This paper addresses the efficient parallelization of matrix-vector multiplication through hypergraph partitioning techniques using CUDA GPU-based parallel computing. CUDA (compute unified device architecture) is a parallel computing platform and programming model that was created by NVIDIA and implemented by the GPU (graphics processing unit).
High-performance computing on GPUs for resistivity logging of oil and gas wells

NASA Astrophysics Data System (ADS)

Glinskikh, V.; Dudaev, A.; Nechaev, O.; Surodina, I.

2017-10-01

We developed and implemented into software an algorithm for high-performance simulation of electrical logs from oil and gas wells using high-performance heterogeneous computing. The numerical solution of the 2D forward problem is based on the finite-element method and the Cholesky decomposition for solving a system of linear algebraic equations (SLAE). Software implementations of the algorithm used the NVIDIA CUDA technology and computing libraries are made, allowing us to perform decomposition of SLAE and find its solution on central processor unit (CPU) and graphics processor unit (GPU). The calculation time is analyzed depending on the matrix size and number of its non-zero elements. We estimated the computing speed on CPU and GPU, including high-performance heterogeneous CPU-GPU computing. Using the developed algorithm, we simulated resistivity data in realistic models.
An Investigation of Unified Memory Access Performance in CUDA

PubMed Central

Landaverde, Raphael; Zhang, Tiansheng; Coskun, Ayse K.; Herbordt, Martin

2015-01-01

Managing memory between the CPU and GPU is a major challenge in GPU computing. A programming model, Unified Memory Access (UMA), has been recently introduced by Nvidia to simplify the complexities of memory management while claiming good overall performance. In this paper, we investigate this programming model and evaluate its performance and programming model simplifications based on our experimental results. We find that beyond on-demand data transfers to the CPU, the GPU is also able to request subsets of data it requires on demand. This feature allows UMA to outperform full data transfer methods for certain parallel applications and small data sizes. We also find, however, that for the majority of applications and memory access patterns, the performance overheads associated with UMA are significant, while the simplifications to the programming model restrict flexibility for adding future optimizations. PMID:26594668
Um supressor de fundo térmico para a câmara infravermelha CamIV

NASA Astrophysics Data System (ADS)

Jablonski, F.; Laporte, R.

2003-08-01

O ângulo sólido subtendido pelos pixels na câmara infravermelha do NexGal (CamIV) que operamos no OPD/LNA contém contribuições provenientes do sistema de coleta de fluxo propriamente dito - sendo esta a parte que interessa para as medidas astronômicas - e contribuições da obstrução central, sistema de suporte do espelho secundário e região exterior à pupila de entrada do telescópio. Estas últimas contribuições são devi-das à emissão de corpo negro à temperatura ambiente e aumentam exponencialmente para comprimentos de onda maiores que 2 micra (banda K, no infravermelho próximo). Embora a resultante pode ser quantificada e subtraída dos sinais relevantes, sua variância se adiciona à variância do sinal, e pode ser facilmente a contribuição domi-nante para a incerteza final das medidas, tornando ineficiente o processo de extração de informação e degradando a sensibilidade da câmara. A maneira clássica de resolver esse problema em sistemas ópticos que operam no infravermelho, onde os efeitos da emissão térmica do ambiente são importantes, é restringir o ângulo sólido subtendido pelos pixels individuais exclusivamente aos raios provenientes do sistema óptico. Para tanto, projeta-se uma imagem real, bastante reduzida, da pupila de entrada do sistema óptico num anteparo que transmita para o sistema de imageamento só o que interessa, bloqueando as contribuições das bordas externas à pupila de entrada, obstrução central do telescópio e sistema de suporte. Como a projeção é realizada em ambiente criogênico, a contribuição térmica espúria é efetivamente eliminada. Nós optamos por um sistema do tipo Offner para implementar na prática esta função. Trata-se de um sistema baseado em espelhos esféricos, bastante compacto e ajustado por construção. A opção por espelhos do mesmo material que o sistema de suporte (Alumínio) minimiza a dilatação diferencial, crítica nesse tipo de aplicação. Apresentamos as soluções detalhadas do projeto óptico-mecânico, bem como uma análise de flexões e desempenho em termos de qualidade de imagem.
Visual Media Reasoning - Terrain-based Geolocation

DTIC Science & Technology

2015-06-01

the drawings, specifications, or other data does not license the holder or any other person or corporation ; or convey any rights or permission to...3.4 Alternative Metric Investigation This section describes a graphics processor unit (GPU) based implementation in the NVIDIA CUDA programming...utilizing 2 concurrent CPU cores, each controlling a single Nvidia C2075 Tesla Fermi CUDA card. Figure 22 shows a comparison of the CPU and the GPU powered
Multi-Kepler GPU vs. multi-Intel MIC for spin systems simulations

NASA Astrophysics Data System (ADS)

Bernaschi, M.; Bisson, M.; Salvadore, F.

2014-10-01

We present and compare the performances of two many-core architectures: the Nvidia Kepler and the Intel MIC both in a single system and in cluster configuration for the simulation of spin systems. As a benchmark we consider the time required to update a single spin of the 3D Heisenberg spin glass model by using the Over-relaxation algorithm. We present data also for a traditional high-end multi-core architecture: the Intel Sandy Bridge. The results show that although on the two Intel architectures it is possible to use basically the same code, the performances of a Intel MIC change dramatically depending on (apparently) minor details. Another issue is that to obtain a reasonable scalability with the Intel Phi coprocessor (Phi is the coprocessor that implements the MIC architecture) in a cluster configuration it is necessary to use the so-called offload mode which reduces the performances of the single system. As to the GPU, the Kepler architecture offers a clear advantage with respect to the previous Fermi architecture maintaining exactly the same source code. Scalability of the multi-GPU implementation remains very good by using the CPU as a communication co-processor of the GPU. All source codes are provided for inspection and for double-checking the results.

Accelerating Smith-Waterman Algorithm for Biological Database Search on CUDA-Compatible GPUs

NASA Astrophysics Data System (ADS)

Munekawa, Yuma; Ino, Fumihiko; Hagihara, Kenichi

This paper presents a fast method capable of accelerating the Smith-Waterman algorithm for biological database search on a cluster of graphics processing units (GPUs). Our method is implemented using compute unified device architecture (CUDA), which is available on the nVIDIA GPU. As compared with previous methods, our method has four major contributions. (1) The method efficiently uses on-chip shared memory to reduce the data amount being transferred between off-chip video memory and processing elements in the GPU. (2) It also reduces the number of data fetches by applying a data reuse technique to query and database sequences. (3) A pipelined method is also implemented to overlap GPU execution with database access. (4) Finally, a master/worker paradigm is employed to accelerate hundreds of database searches on a cluster system. In experiments, the peak performance on a GeForce GTX 280 card reaches 8.32 giga cell updates per second (GCUPS). We also find that our method reduces the amount of data fetches to 1/140, achieving approximately three times higher performance than a previous CUDA-based method. Our 32-node cluster version is approximately 28 times faster than a single GPU version. Furthermore, the effective performance reaches 75.6 giga instructions per second (GIPS) using 32 GeForce 8800 GTX cards.
Towards robust algorithms for current deposition and dynamic load-balancing in a GPU particle in cell code

NASA Astrophysics Data System (ADS)

Rossi, Francesco; Londrillo, Pasquale; Sgattoni, Andrea; Sinigardi, Stefano; Turchetti, Giorgio

2012-12-01

We present `jasmine', an implementation of a fully relativistic, 3D, electromagnetic Particle-In-Cell (PIC) code, capable of running simulations in various laser plasma acceleration regimes on Graphics-Processing-Units (GPUs) HPC clusters. Standard energy/charge preserving FDTD-based algorithms have been implemented using double precision and quadratic (or arbitrary sized) shape functions for the particle weighting. When porting a PIC scheme to the GPU architecture (or, in general, a shared memory environment), the particle-to-grid operations (e.g. the evaluation of the current density) require special care to avoid memory inconsistencies and conflicts. Here we present a robust implementation of this operation that is efficient for any number of particles per cell and particle shape function order. Our algorithm exploits the exposed GPU memory hierarchy and avoids the use of atomic operations, which can hurt performance especially when many particles lay on the same cell. We show the code multi-GPU scalability results and present a dynamic load-balancing algorithm. The code is written using a python-based C++ meta-programming technique which translates in a high level of modularity and allows for easy performance tuning and simple extension of the core algorithms to various simulation schemes.
DOE Office of Scientific and Technical Information (OSTI.GOV)

Allada, Veerendra, Benjegerdes, Troy; Bode, Brett

Commodity clusters augmented with application accelerators are evolving as competitive high performance computing systems. The Graphical Processing Unit (GPU) with a very high arithmetic density and performance per price ratio is a good platform for the scientific application acceleration. In addition to the interconnect bottlenecks among the cluster compute nodes, the cost of memory copies between the host and the GPU device have to be carefully amortized to improve the overall efficiency of the application. Scientific applications also rely on efficient implementation of the BAsic Linear Algebra Subroutines (BLAS), among which the General Matrix Multiply (GEMM) is considered as themore » workhorse subroutine. In this paper, they study the performance of the memory copies and GEMM subroutines that are critical to port the computational chemistry algorithms to the GPU clusters. To that end, a benchmark based on the NetPIPE framework is developed to evaluate the latency and bandwidth of the memory copies between the host and the GPU device. The performance of the single and double precision GEMM subroutines from the NVIDIA CUBLAS 2.0 library are studied. The results have been compared with that of the BLAS routines from the Intel Math Kernel Library (MKL) to understand the computational trade-offs. The test bed is a Intel Xeon cluster equipped with NVIDIA Tesla GPUs.« less
CUDA Optimization Strategies for Compute- and Memory-Bound Neuroimaging Algorithms

PubMed Central

Lee, Daren; Dinov, Ivo; Dong, Bin; Gutman, Boris; Yanovsky, Igor; Toga, Arthur W.

2011-01-01

As neuroimaging algorithms and technology continue to grow faster than CPU performance in complexity and image resolution, data-parallel computing methods will be increasingly important. The high performance, data-parallel architecture of modern graphical processing units (GPUs) can reduce computational times by orders of magnitude. However, its massively threaded architecture introduces challenges when GPU resources are exceeded. This paper presents optimization strategies for compute- and memory-bound algorithms for the CUDA architecture. For compute-bound algorithms, the registers are reduced through variable reuse via shared memory and the data throughput is increased through heavier thread workloads and maximizing the thread configuration for a single thread block per multiprocessor. For memory-bound algorithms, fitting the data into the fast but limited GPU resources is achieved through reorganizing the data into self-contained structures and employing a multi-pass approach. Memory latencies are reduced by selecting memory resources whose cache performance are optimized for the algorithm's access patterns. We demonstrate the strategies on two computationally expensive algorithms and achieve optimized GPU implementations that perform up to 6× faster than unoptimized ones. Compared to CPU implementations, we achieve peak GPU speedups of 129× for the 3D unbiased nonlinear image registration technique and 93× for the non-local means surface denoising algorithm. PMID:21159404
Rapid motion compensation for prostate biopsy using GPU.

PubMed

Shen, Feimo; Narayanan, Ramkrishnan; Suri, Jasjit S

2008-01-01

Image-guided procedures have become routine in medicine. Due to the nature of three-dimensional (3-D) structure of the target organs, two-dimensional (2-D) image acquisition is gradually being replaced by 3-D imaging. Specifically in the diagnosis of prostate cancer, biopsy can be performed using 3-D transrectal ultrasound (TRUS) image guidance. Because prostatic cancers are multifocal, it is crucial to accurately guide biopsy needles towards planned targets. Further the gland tends to move due to external physical disturbances, discomfort introduced by the procedure or intrinsic peristalsis. As a result the exact position of the gland must be rapidly updated so as to correspond with the originally acquired 3-D TRUS volume prior to biopsy planning. A graphics processing unit (GPU) is used in this study to compute rapid updates performing 3-D motion compensation via registration of the live 2-D image and the acquired 3-D TRUS volume. The parallel computational framework on the GPU is exploited resulting in mean compute times of 0.46 seconds for updating the position of a live 2-D buffer image containing 91,000 pixels. A 2x sub-sampling resulted in a further improvement to 0.19 seconds. With the increase in GPU multiprocessors and sub-sampling, we observe that real time motion compensation can be achieved.
BALSA: integrated secondary analysis for whole-genome and whole-exome sequencing, accelerated by GPU.

PubMed

Luo, Ruibang; Wong, Yiu-Lun; Law, Wai-Chun; Lee, Lap-Kei; Cheung, Jeanno; Liu, Chi-Man; Lam, Tak-Wah

2014-01-01

This paper reports an integrated solution, called BALSA, for the secondary analysis of next generation sequencing data; it exploits the computational power of GPU and an intricate memory management to give a fast and accurate analysis. From raw reads to variants (including SNPs and Indels), BALSA, using just a single computing node with a commodity GPU board, takes 5.5 h to process 50-fold whole genome sequencing (∼750 million 100 bp paired-end reads), or just 25 min for 210-fold whole exome sequencing. BALSA's speed is rooted at its parallel algorithms to effectively exploit a GPU to speed up processes like alignment, realignment and statistical testing. BALSA incorporates a 16-genotype model to support the calling of SNPs and Indels and achieves competitive variant calling accuracy and sensitivity when compared to the ensemble of six popular variant callers. BALSA also supports efficient identification of somatic SNVs and CNVs; experiments showed that BALSA recovers all the previously validated somatic SNVs and CNVs, and it is more sensitive for somatic Indel detection. BALSA outputs variants in VCF format. A pileup-like SNAPSHOT format, while maintaining the same fidelity as BAM in variant calling, enables efficient storage and indexing, and facilitates the App development of downstream analyses. BALSA is available at: http://sourceforge.net/p/balsa.
Efficient Parallel Video Processing Techniques on GPU: From Framework to Implementation

PubMed Central

Su, Huayou; Wen, Mei; Wu, Nan; Ren, Ju; Zhang, Chunyuan

2014-01-01

Through reorganizing the execution order and optimizing the data structure, we proposed an efficient parallel framework for H.264/AVC encoder based on massively parallel architecture. We implemented the proposed framework by CUDA on NVIDIA's GPU. Not only the compute intensive components of the H.264 encoder are parallelized but also the control intensive components are realized effectively, such as CAVLC and deblocking filter. In addition, we proposed serial optimization methods, including the multiresolution multiwindow for motion estimation, multilevel parallel strategy to enhance the parallelism of intracoding as much as possible, component-based parallel CAVLC, and direction-priority deblocking filter. More than 96% of workload of H.264 encoder is offloaded to GPU. Experimental results show that the parallel implementation outperforms the serial program by 20 times of speedup ratio and satisfies the requirement of the real-time HD encoding of 30 fps. The loss of PSNR is from 0.14 dB to 0.77 dB, when keeping the same bitrate. Through the analysis to the kernels, we found that speedup ratios of the compute intensive algorithms are proportional with the computation power of the GPU. However, the performance of the control intensive parts (CAVLC) is much related to the memory bandwidth, which gives an insight for new architecture design. PMID:24757432
Rapid simulation of X-ray transmission imaging for baggage inspection via GPU-based ray-tracing

NASA Astrophysics Data System (ADS)

Gong, Qian; Stoian, Razvan-Ionut; Coccarelli, David S.; Greenberg, Joel A.; Vera, Esteban; Gehm, Michael E.

2018-01-01

We present a pipeline that rapidly simulates X-ray transmission imaging for arbitrary system architectures using GPU-based ray-tracing techniques. The purpose of the pipeline is to enable statistical analysis of threat detection in the context of airline baggage inspection. As a faster alternative to Monte Carlo methods, we adopt a deterministic approach for simulating photoelectric absorption-based imaging. The highly-optimized NVIDIA OptiX API is used to implement ray-tracing, greatly speeding code execution. In addition, we implement the first hierarchical representation structure to determine the interaction path length of rays traversing heterogeneous media described by layered polygons. The accuracy of the pipeline has been validated by comparing simulated data with experimental data collected using a heterogenous phantom and a laboratory X-ray imaging system. On a single computer, our approach allows us to generate over 400 2D transmission projections (125 × 125 pixels per frame) per hour for a bag packed with hundreds of everyday objects. By implementing our approach on cloud-based GPU computing platforms, we find that the same 2D projections of approximately 3.9 million bags can be obtained in a single day using 400 GPU instances, at a cost of only 0.001 per bag.
Acceleration of Linear Finite-Difference Poisson-Boltzmann Methods on Graphics Processing Units.

PubMed

Qi, Ruxi; Botello-Smith, Wesley M; Luo, Ray

2017-07-11

Electrostatic interactions play crucial roles in biophysical processes such as protein folding and molecular recognition. Poisson-Boltzmann equation (PBE)-based models have emerged as widely used in modeling these important processes. Though great efforts have been put into developing efficient PBE numerical models, challenges still remain due to the high dimensionality of typical biomolecular systems. In this study, we implemented and analyzed commonly used linear PBE solvers for the ever-improving graphics processing units (GPU) for biomolecular simulations, including both standard and preconditioned conjugate gradient (CG) solvers with several alternative preconditioners. Our implementation utilizes the standard Nvidia CUDA libraries cuSPARSE, cuBLAS, and CUSP. Extensive tests show that good numerical accuracy can be achieved given that the single precision is often used for numerical applications on GPU platforms. The optimal GPU performance was observed with the Jacobi-preconditioned CG solver, with a significant speedup over standard CG solver on CPU in our diversified test cases. Our analysis further shows that different matrix storage formats also considerably affect the efficiency of different linear PBE solvers on GPU, with the diagonal format best suited for our standard finite-difference linear systems. Further efficiency may be possible with matrix-free operations and integrated grid stencil setup specifically tailored for the banded matrices in PBE-specific linear systems.
Sub-second pencil beam dose calculation on GPU for adaptive proton therapy.

PubMed

da Silva, Joakim; Ansorge, Richard; Jena, Rajesh

2015-06-21

Although proton therapy delivered using scanned pencil beams has the potential to produce better dose conformity than conventional radiotherapy, the created dose distributions are more sensitive to anatomical changes and patient motion. Therefore, the introduction of adaptive treatment techniques where the dose can be monitored as it is being delivered is highly desirable. We present a GPU-based dose calculation engine relying on the widely used pencil beam algorithm, developed for on-line dose calculation. The calculation engine was implemented from scratch, with each step of the algorithm parallelized and adapted to run efficiently on the GPU architecture. To ensure fast calculation, it employs several application-specific modifications and simplifications, and a fast scatter-based implementation of the computationally expensive kernel superposition step. The calculation time for a skull base treatment plan using two beam directions was 0.22 s on an Nvidia Tesla K40 GPU, whereas a test case of a cubic target in water from the literature took 0.14 s to calculate. The accuracy of the patient dose distributions was assessed by calculating the γ-index with respect to a gold standard Monte Carlo simulation. The passing rates were 99.2% and 96.7%, respectively, for the 3%/3 mm and 2%/2 mm criteria, matching those produced by a clinical treatment planning system.
A parallel finite element procedure for contact-impact problems using edge-based smooth triangular element and GPU

NASA Astrophysics Data System (ADS)

Cai, Yong; Cui, Xiangyang; Li, Guangyao; Liu, Wenyang

2018-04-01

The edge-smooth finite element method (ES-FEM) can improve the computational accuracy of triangular shell elements and the mesh partition efficiency of complex models. In this paper, an approach is developed to perform explicit finite element simulations of contact-impact problems with a graphical processing unit (GPU) using a special edge-smooth triangular shell element based on ES-FEM. Of critical importance for this problem is achieving finer-grained parallelism to enable efficient data loading and to minimize communication between the device and host. Four kinds of parallel strategies are then developed to efficiently solve these ES-FEM based shell element formulas, and various optimization methods are adopted to ensure aligned memory access. Special focus is dedicated to developing an approach for the parallel construction of edge systems. A parallel hierarchy-territory contact-searching algorithm (HITA) and a parallel penalty function calculation method are embedded in this parallel explicit algorithm. Finally, the program flow is well designed, and a GPU-based simulation system is developed, using Nvidia's CUDA. Several numerical examples are presented to illustrate the high quality of the results obtained with the proposed methods. In addition, the GPU-based parallel computation is shown to significantly reduce the computing time.
GPU Lossless Hyperspectral Data Compression System for Space Applications

NASA Technical Reports Server (NTRS)

Keymeulen, Didier; Aranki, Nazeeh; Hopson, Ben; Kiely, Aaron; Klimesh, Matthew; Benkrid, Khaled

2012-01-01

On-board lossless hyperspectral data compression reduces data volume in order to meet NASA and DoD limited downlink capabilities. At JPL, a novel, adaptive and predictive technique for lossless compression of hyperspectral data, named the Fast Lossless (FL) algorithm, was recently developed. This technique uses an adaptive filtering method and achieves state-of-the-art performance in both compression effectiveness and low complexity. Because of its outstanding performance and suitability for real-time onboard hardware implementation, the FL compressor is being formalized as the emerging CCSDS Standard for Lossless Multispectral & Hyperspectral image compression. The FL compressor is well-suited for parallel hardware implementation. A GPU hardware implementation was developed for FL targeting the current state-of-the-art GPUs from NVIDIA(Trademark). The GPU implementation on a NVIDIA(Trademark) GeForce(Trademark) GTX 580 achieves a throughput performance of 583.08 Mbits/sec (44.85 MSamples/sec) and an acceleration of at least 6 times a software implementation running on a 3.47 GHz single core Intel(Trademark) Xeon(Trademark) processor. This paper describes the design and implementation of the FL algorithm on the GPU. The massively parallel implementation will provide in the future a fast and practical real-time solution for airborne and space applications.
CUDA optimization strategies for compute- and memory-bound neuroimaging algorithms.

PubMed

Lee, Daren; Dinov, Ivo; Dong, Bin; Gutman, Boris; Yanovsky, Igor; Toga, Arthur W

2012-06-01

As neuroimaging algorithms and technology continue to grow faster than CPU performance in complexity and image resolution, data-parallel computing methods will be increasingly important. The high performance, data-parallel architecture of modern graphical processing units (GPUs) can reduce computational times by orders of magnitude. However, its massively threaded architecture introduces challenges when GPU resources are exceeded. This paper presents optimization strategies for compute- and memory-bound algorithms for the CUDA architecture. For compute-bound algorithms, the registers are reduced through variable reuse via shared memory and the data throughput is increased through heavier thread workloads and maximizing the thread configuration for a single thread block per multiprocessor. For memory-bound algorithms, fitting the data into the fast but limited GPU resources is achieved through reorganizing the data into self-contained structures and employing a multi-pass approach. Memory latencies are reduced by selecting memory resources whose cache performance are optimized for the algorithm's access patterns. We demonstrate the strategies on two computationally expensive algorithms and achieve optimized GPU implementations that perform up to 6× faster than unoptimized ones. Compared to CPU implementations, we achieve peak GPU speedups of 129× for the 3D unbiased nonlinear image registration technique and 93× for the non-local means surface denoising algorithm. Copyright © 2010 Elsevier Ireland Ltd. All rights reserved.
MHD code using multi graphical processing units: SMAUG+

NASA Astrophysics Data System (ADS)

Gyenge, N.; Griffiths, M. K.; Erdélyi, R.

2018-01-01

This paper introduces the Sheffield Magnetohydrodynamics Algorithm Using GPUs (SMAUG+), an advanced numerical code for solving magnetohydrodynamic (MHD) problems, using multi-GPU systems. Multi-GPU systems facilitate the development of accelerated codes and enable us to investigate larger model sizes and/or more detailed computational domain resolutions. This is a significant advancement over the parent single-GPU MHD code, SMAUG (Griffiths et al., 2015). Here, we demonstrate the validity of the SMAUG + code, describe the parallelisation techniques and investigate performance benchmarks. The initial configuration of the Orszag-Tang vortex simulations are distributed among 4, 16, 64 and 100 GPUs. Furthermore, different simulation box resolutions are applied: 1000 × 1000, 2044 × 2044, 4000 × 4000 and 8000 × 8000 . We also tested the code with the Brio-Wu shock tube simulations with model size of 800 employing up to 10 GPUs. Based on the test results, we observed speed ups and slow downs, depending on the granularity and the communication overhead of certain parallel tasks. The main aim of the code development is to provide massively parallel code without the memory limitation of a single GPU. By using our code, the applied model size could be significantly increased. We demonstrate that we are able to successfully compute numerically valid and large 2D MHD problems.
GPU Accelerated Ultrasonic Tomography Using Propagation and Back Propagation Method

DTIC Science & Technology

2015-09-28

the medical imaging field using GPUs has been done for many years. In [1], Copeland et al. used 2D images , obtained by X - ray projections, to...Index Terms— Medical Imaging , Ultrasonic Tomography, GPU, CUDA, Parallel Computing I. INTRODUCTION GRAPHIC Processing Units (GPUs) are computation... Imaging Algorithm The process of reconstructing images from ultrasonic infor- mation starts with the following acoustical wave equation: ∂2 ∂t2 u ( x
A GPU Parallelization of the Absolute Nodal Coordinate Formulation for Applications in Flexible Multibody Dynamics

DTIC Science & Technology

2012-02-17

to be solved. Disclaimer: Reference herein to any specific commercial company , product, process, or service by trade name, trademark...data processing rather than data caching and control flow. To make use of this computational power, NVIDIA introduced a general purpose parallel...GPU implementations were run on an Intel Nehalem Xeon E5520 2.26GHz processor with an NVIDIA Tesla C2070 graphics card for varying numbers of
Accelerating Molecular Dynamic Simulation on Graphics Processing Units

PubMed Central

Friedrichs, Mark S.; Eastman, Peter; Vaidyanathan, Vishal; Houston, Mike; Legrand, Scott; Beberg, Adam L.; Ensign, Daniel L.; Bruns, Christopher M.; Pande, Vijay S.

2009-01-01

We describe a complete implementation of all-atom protein molecular dynamics running entirely on a graphics processing unit (GPU), including all standard force field terms, integration, constraints, and implicit solvent. We discuss the design of our algorithms and important optimizations needed to fully take advantage of a GPU. We evaluate its performance, and show that it can be more than 700 times faster than a conventional implementation running on a single CPU core. PMID:19191337
Online Tracking Algorithms on GPUs for the P̅ANDA Experiment at FAIR

NASA Astrophysics Data System (ADS)

Bianchi, L.; Herten, A.; Ritman, J.; Stockmanns, T.; Adinetz, A.; Kraus, J.; Pleiter, D.

2015-12-01

P̅ANDA is a future hadron and nuclear physics experiment at the FAIR facility in construction in Darmstadt, Germany. In contrast to the majority of current experiments, PANDA's strategy for data acquisition is based on event reconstruction from free-streaming data, performed in real time entirely by software algorithms using global detector information. This paper reports the status of the development of algorithms for the reconstruction of charged particle tracks, optimized online data processing applications, using General-Purpose Graphic Processing Units (GPU). Two algorithms for trackfinding, the Triplet Finder and the Circle Hough, are described, and details of their GPU implementations are highlighted. Average track reconstruction times of less than 100 ns are obtained running the Triplet Finder on state-of- the-art GPU cards. In addition, a proof-of-concept system for the dispatch of data to tracking algorithms using Message Queues is presented.
GPU-Accelerated Molecular Modeling Coming Of Age

PubMed Central

Stone, John E.; Hardy, David J.; Ufimtsev, Ivan S.

2010-01-01

Graphics processing units (GPUs) have traditionally been used in molecular modeling solely for visualization of molecular structures and animation of trajectories resulting from molecular dynamics simulations. Modern GPUs have evolved into fully programmable, massively parallel co-processors that can now be exploited to accelerate many scientific computations, typically providing about one order of magnitude speedup over CPU code and in special cases providing speedups of two orders of magnitude. This paper surveys the development of molecular modeling algorithms that leverage GPU computing, the advances already made and remaining issues to be resolved, and the continuing evolution of GPU technology that promises to become even more useful to molecular modeling. Hardware acceleration with commodity GPUs is expected to benefit the overall computational biology community by bringing teraflops performance to desktop workstations and in some cases potentially changing what were formerly batch-mode computational jobs into interactive tasks. PMID:20675161
Early Experiences Porting the NAMD and VMD Molecular Simulation and Analysis Software to GPU-Accelerated OpenPOWER Platforms

PubMed Central

Stone, John E.; Hynninen, Antti-Pekka; Phillips, James C.; Schulten, Klaus

2017-01-01

All-atom molecular dynamics simulations of biomolecules provide a powerful tool for exploring the structure and dynamics of large protein complexes within realistic cellular environments. Unfortunately, such simulations are extremely demanding in terms of their computational requirements, and they present many challenges in terms of preparation, simulation methodology, and analysis and visualization of results. We describe our early experiences porting the popular molecular dynamics simulation program NAMD and the simulation preparation, analysis, and visualization tool VMD to GPU-accelerated OpenPOWER hardware platforms. We report our experiences with compiler-provided autovectorization and compare with hand-coded vector intrinsics for the POWER8 CPU. We explore the performance benefits obtained from unique POWER8 architectural features such as 8-way SMT and its value for particular molecular modeling tasks. Finally, we evaluate the performance of several GPU-accelerated molecular modeling kernels and relate them to other hardware platforms. PMID:29202130

POM.gpu-v1.0: a GPU-based Princeton Ocean Model

NASA Astrophysics Data System (ADS)

Xu, S.; Huang, X.; Oey, L.-Y.; Xu, F.; Fu, H.; Zhang, Y.; Yang, G.

2015-09-01

Graphics processing units (GPUs) are an attractive solution in many scientific applications due to their high performance. However, most existing GPU conversions of climate models use GPUs for only a few computationally intensive regions. In the present study, we redesign the mpiPOM (a parallel version of the Princeton Ocean Model) with GPUs. Specifically, we first convert the model from its original Fortran form to a new Compute Unified Device Architecture C (CUDA-C) code, then we optimize the code on each of the GPUs, the communications between the GPUs, and the I / O between the GPUs and the central processing units (CPUs). We show that the performance of the new model on a workstation containing four GPUs is comparable to that on a powerful cluster with 408 standard CPU cores, and it reduces the energy consumption by a factor of 6.8.
Near-realtime simulations of biolelectric activity in small mammalian hearts using graphical processing units

PubMed Central

Vigmond, Edward J.; Boyle, Patrick M.; Leon, L. Joshua; Plank, Gernot

2014-01-01

Simulations of cardiac bioelectric phenomena remain a significant challenge despite continual advancements in computational machinery. Spanning large temporal and spatial ranges demands millions of nodes to accurately depict geometry, and a comparable number of timesteps to capture dynamics. This study explores a new hardware computing paradigm, the graphics processing unit (GPU), to accelerate cardiac models, and analyzes results in the context of simulating a small mammalian heart in real time. The ODEs associated with membrane ionic flow were computed on traditional CPU and compared to GPU performance, for one to four parallel processing units. The scalability of solving the PDE responsible for tissue coupling was examined on a cluster using up to 128 cores. Results indicate that the GPU implementation was between 9 and 17 times faster than the CPU implementation and scaled similarly. Solving the PDE was still 160 times slower than real time. PMID:19964295
GPU implementation of the linear scaling three dimensional fragment method for large scale electronic structure calculations

NASA Astrophysics Data System (ADS)

Jia, Weile; Wang, Jue; Chi, Xuebin; Wang, Lin-Wang

2017-02-01

LS3DF, namely linear scaling three-dimensional fragment method, is an efficient linear scaling ab initio total energy electronic structure calculation code based on a divide-and-conquer strategy. In this paper, we present our GPU implementation of the LS3DF code. Our test results show that the GPU code can calculate systems with about ten thousand atoms fully self-consistently in the order of 10 min using thousands of computing nodes. This makes the electronic structure calculations of 10,000-atom nanosystems routine work. This speed is 4.5-6 times faster than the CPU calculations using the same number of nodes on the Titan machine in the Oak Ridge leadership computing facility (OLCF). Such speedup is achieved by (a) carefully re-designing of the computationally heavy kernels; (b) redesign of the communication pattern for heterogeneous supercomputers.
GPU-accelerated simulations of isolated black holes

NASA Astrophysics Data System (ADS)

Lewis, Adam G. M.; Pfeiffer, Harald P.

2018-05-01

We present a port of the numerical relativity code SpEC which is capable of running on NVIDIA GPUs. Since this code must be maintained in parallel with SpEC itself, a primary design consideration is to perform as few explicit code changes as possible. We therefore rely on a hierarchy of automated porting strategies. At the highest level we use TLoops, a C++ library of our design, to automatically emit CUDA code equivalent to tensorial expressions written into C++ source using a syntax similar to analytic calculation. Next, we trace out and cache explicit matrix representations of the numerous linear transformations in the SpEC code, which allows these to be performed on the GPU using pre-existing matrix-multiplication libraries. We port the few remaining important modules by hand. In this paper we detail the specifics of our port, and present benchmarks of it simulating isolated black hole spacetimes on several generations of NVIDIA GPU.
GPU Based Software Correlators - Perspectives for VLBI2010

NASA Technical Reports Server (NTRS)

Hobiger, Thomas; Kimura, Moritaka; Takefuji, Kazuhiro; Oyama, Tomoaki; Koyama, Yasuhiro; Kondo, Tetsuro; Gotoh, Tadahiro; Amagai, Jun

2010-01-01

Caused by historical separation and driven by the requirements of the PC gaming industry, Graphics Processing Units (GPUs) have evolved to massive parallel processing systems which entered the area of non-graphic related applications. Although a single processing core on the GPU is much slower and provides less functionality than its counterpart on the CPU, the huge number of these small processing entities outperforms the classical processors when the application can be parallelized. Thus, in recent years various radio astronomical projects have started to make use of this technology either to realize the correlator on this platform or to establish the post-processing pipeline with GPUs. Therefore, the feasibility of GPUs as a choice for a VLBI correlator is being investigated, including pros and cons of this technology. Additionally, a GPU based software correlator will be reviewed with respect to energy consumption/GFlop/sec and cost/GFlop/sec.
High-throughput GPU-based LDPC decoding

NASA Astrophysics Data System (ADS)

Chang, Yang-Lang; Chang, Cheng-Chun; Huang, Min-Yu; Huang, Bormin

2010-08-01

Low-density parity-check (LDPC) code is a linear block code known to approach the Shannon limit via the iterative sum-product algorithm. LDPC codes have been adopted in most current communication systems such as DVB-S2, WiMAX, WI-FI and 10GBASE-T. LDPC for the needs of reliable and flexible communication links for a wide variety of communication standards and configurations have inspired the demand for high-performance and flexibility computing. Accordingly, finding a fast and reconfigurable developing platform for designing the high-throughput LDPC decoder has become important especially for rapidly changing communication standards and configurations. In this paper, a new graphic-processing-unit (GPU) LDPC decoding platform with the asynchronous data transfer is proposed to realize this practical implementation. Experimental results showed that the proposed GPU-based decoder achieved 271x speedup compared to its CPU-based counterpart. It can serve as a high-throughput LDPC decoder.
Accelerating Biomedical Signal Processing Using GPU: A Case Study of Snore Sound Feature Extraction.

PubMed

Guo, Jian; Qian, Kun; Zhang, Gongxuan; Xu, Huijie; Schuller, Björn

2017-12-01

The advent of 'Big Data' and 'Deep Learning' offers both, a great challenge and a huge opportunity for personalised health-care. In machine learning-based biomedical data analysis, feature extraction is a key step for 'feeding' the subsequent classifiers. With increasing numbers of biomedical data, extracting features from these 'big' data is an intensive and time-consuming task. In this case study, we employ a Graphics Processing Unit (GPU) via Python to extract features from a large corpus of snore sound data. Those features can subsequently be imported into many well-known deep learning training frameworks without any format processing. The snore sound data were collected from several hospitals (20 subjects, with 770-990 MB per subject - in total 17.20 GB). Experimental results show that our GPU-based processing significantly speeds up the feature extraction phase, by up to seven times, as compared to the previous CPU system.
GPU Particle Tracking and MHD Simulations with Greatly Enhanced Computational Speed

NASA Astrophysics Data System (ADS)

Ziemba, T.; O'Donnell, D.; Carscadden, J.; Cash, M.; Winglee, R.; Harnett, E.

2008-12-01

GPUs are intrinsically highly parallelized systems that provide more than an order of magnitude computing speed over a CPU based systems, for less cost than a high end-workstation. Recent advancements in GPU technologies allow for full IEEE float specifications with performance up to several hundred GFLOPs per GPU, and new software architectures have recently become available to ease the transition from graphics based to scientific applications. This allows for a cheap alternative to standard supercomputing methods and should increase the time to discovery. 3-D particle tracking and MHD codes have been developed using NVIDIA's CUDA and have demonstrated speed up of nearly a factor of 20 over equivalent CPU versions of the codes. Such a speed up enables new applications to develop, including real time running of radiation belt simulations and real time running of global magnetospheric simulations, both of which could provide important space weather prediction tools.
More IMPATIENT: A Gridding-Accelerated Toeplitz-based Strategy for Non-Cartesian High-Resolution 3D MRI on GPUs

PubMed Central

Gai, Jiading; Obeid, Nady; Holtrop, Joseph L.; Wu, Xiao-Long; Lam, Fan; Fu, Maojing; Haldar, Justin P.; Hwu, Wen-mei W.; Liang, Zhi-Pei; Sutton, Bradley P.

2013-01-01

Several recent methods have been proposed to obtain significant speed-ups in MRI image reconstruction by leveraging the computational power of GPUs. Previously, we implemented a GPU-based image reconstruction technique called the Illinois Massively Parallel Acquisition Toolkit for Image reconstruction with ENhanced Throughput in MRI (IMPATIENT MRI) for reconstructing data collected along arbitrary 3D trajectories. In this paper, we improve IMPATIENT by removing computational bottlenecks by using a gridding approach to accelerate the computation of various data structures needed by the previous routine. Further, we enhance the routine with capabilities for off-resonance correction and multi-sensor parallel imaging reconstruction. Through implementation of optimized gridding into our iterative reconstruction scheme, speed-ups of more than a factor of 200 are provided in the improved GPU implementation compared to the previous accelerated GPU code. PMID:23682203
GPU-accelerated molecular modeling coming of age.

PubMed

Stone, John E; Hardy, David J; Ufimtsev, Ivan S; Schulten, Klaus

2010-09-01

Graphics processing units (GPUs) have traditionally been used in molecular modeling solely for visualization of molecular structures and animation of trajectories resulting from molecular dynamics simulations. Modern GPUs have evolved into fully programmable, massively parallel co-processors that can now be exploited to accelerate many scientific computations, typically providing about one order of magnitude speedup over CPU code and in special cases providing speedups of two orders of magnitude. This paper surveys the development of molecular modeling algorithms that leverage GPU computing, the advances already made and remaining issues to be resolved, and the continuing evolution of GPU technology that promises to become even more useful to molecular modeling. Hardware acceleration with commodity GPUs is expected to benefit the overall computational biology community by bringing teraflops performance to desktop workstations and in some cases potentially changing what were formerly batch-mode computational jobs into interactive tasks. (c) 2010 Elsevier Inc. All rights reserved.
A Survey Of Techniques for Managing and Leveraging Caches in GPUs

DOE Office of Scientific and Technical Information (OSTI.GOV)

Mittal, Sparsh

2014-09-01

Initially introduced as special-purpose accelerators for graphics applications, graphics processing units (GPUs) have now emerged as general purpose computing platforms for a wide range of applications. To address the requirements of these applications, modern GPUs include sizable hardware-managed caches. However, several factors, such as unique architecture of GPU, rise of CPU–GPU heterogeneous computing, etc., demand effective management of caches to achieve high performance and energy efficiency. Recently, several techniques have been proposed for this purpose. In this paper, we survey several architectural and system-level techniques proposed for managing and leveraging GPU caches. We also discuss the importance and challenges ofmore » cache management in GPUs. The aim of this paper is to provide the readers insights into cache management techniques for GPUs and motivate them to propose even better techniques for leveraging the full potential of caches in the GPUs of tomorrow.« less
Classification of hyperspectral imagery using MapReduce on a NVIDIA graphics processing unit (Conference Presentation)

NASA Astrophysics Data System (ADS)

Ramirez, Andres; Rahnemoonfar, Maryam

2017-04-01

A hyperspectral image provides multidimensional figure rich in data consisting of hundreds of spectral dimensions. Analyzing the spectral and spatial information of such image with linear and non-linear algorithms will result in high computational time. In order to overcome this problem, this research presents a system using a MapReduce-Graphics Processing Unit (GPU) model that can help analyzing a hyperspectral image through the usage of parallel hardware and a parallel programming model, which will be simpler to handle compared to other low-level parallel programming models. Additionally, Hadoop was used as an open-source version of the MapReduce parallel programming model. This research compared classification accuracy results and timing results between the Hadoop and GPU system and tested it against the following test cases: the CPU and GPU test case, a CPU test case and a test case where no dimensional reduction was applied.
Rapid Parallel Calculation of shell Element Based On GPU

NASA Astrophysics Data System (ADS)

Wanga, Jian Hua; Lia, Guang Yao; Lib, Sheng; Li, Guang Yao

2010-06-01

Long computing time bottlenecked the application of finite element. In this paper, an effective method to speed up the FEM calculation by using the existing modern graphic processing unit and programmable colored rendering tool was put forward, which devised the representation of unit information in accordance with the features of GPU, converted all the unit calculation into film rendering process, solved the simulation work of all the unit calculation of the internal force, and overcame the shortcomings of lowly parallel level appeared ever before when it run in a single computer. Studies shown that this method could improve efficiency and shorten calculating hours greatly. The results of emulation calculation about the elasticity problem of large number cells in the sheet metal proved that using the GPU parallel simulation calculation was faster than using the CPU's. It is useful and efficient to solve the project problems in this way.
CUDA Fortran acceleration for the finite-difference time-domain method

NASA Astrophysics Data System (ADS)

Hadi, Mohammed F.; Esmaeili, Seyed A.

2013-05-01

A detailed description of programming the three-dimensional finite-difference time-domain (FDTD) method to run on graphical processing units (GPUs) using CUDA Fortran is presented. Two FDTD-to-CUDA thread-block mapping designs are investigated and their performances compared. Comparative assessment of trade-offs between GPU's shared memory and L1 cache is also discussed. This presentation is for the benefit of FDTD programmers who work exclusively with Fortran and are reluctant to port their codes to C in order to utilize GPU computing. The derived CUDA Fortran code is compared with an optimized CPU version that runs on a workstation-class CPU to present a realistic GPU to CPU run time comparison and thus help in making better informed investment decisions on FDTD code redesigns and equipment upgrades. All analyses are mirrored with CUDA C simulations to put in perspective the present state of CUDA Fortran development.
Sailfish: A flexible multi-GPU implementation of the lattice Boltzmann method

NASA Astrophysics Data System (ADS)

Januszewski, M.; Kostur, M.

2014-09-01

We present Sailfish, an open source fluid simulation package implementing the lattice Boltzmann method (LBM) on modern Graphics Processing Units (GPUs) using CUDA/OpenCL. We take a novel approach to GPU code implementation and use run-time code generation techniques and a high level programming language (Python) to achieve state of the art performance, while allowing easy experimentation with different LBM models and tuning for various types of hardware. We discuss the general design principles of the code, scaling to multiple GPUs in a distributed environment, as well as the GPU implementation and optimization of many different LBM models, both single component (BGK, MRT, ELBM) and multicomponent (Shan-Chen, free energy). The paper also presents results of performance benchmarks spanning the last three NVIDIA GPU generations (Tesla, Fermi, Kepler), which we hope will be useful for researchers working with this type of hardware and similar codes. Catalogue identifier: AETA_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AETA_v1_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: GNU Lesser General Public License, version 3 No. of lines in distributed program, including test data, etc.: 225864 No. of bytes in distributed program, including test data, etc.: 46861049 Distribution format: tar.gz Programming language: Python, CUDA C, OpenCL. Computer: Any with an OpenCL or CUDA-compliant GPU. Operating system: No limits (tested on Linux and Mac OS X). RAM: Hundreds of megabytes to tens of gigabytes for typical cases. Classification: 12, 6.5. External routines: PyCUDA/PyOpenCL, Numpy, Mako, ZeroMQ (for multi-GPU simulations), scipy, sympy Nature of problem: GPU-accelerated simulation of single- and multi-component fluid flows. Solution method: A wide range of relaxation models (LBGK, MRT, regularized LB, ELBM, Shan-Chen, free energy, free surface) and boundary conditions within the lattice Boltzmann method framework. Simulations can be run in single or double precision using one or more GPUs. Restrictions: The lattice Boltzmann method works for low Mach number flows only. Unusual features: The actual numerical calculations run exclusively on GPUs. The numerical code is built dynamically at run-time in CUDA C or OpenCL, using templates and symbolic formulas. The high-level control of the simulation is maintained by a Python process. Additional comments: !!!!! The distribution file for this program is over 45 Mbytes and therefore is not delivered directly when Download or Email is requested. Instead a html file giving details of how the program can be obtained is sent. !!!!! Running time: Problem-dependent, typically minutes (for small cases or short simulations) to hours (large cases or long simulations).
Dosimetric comparison of helical tomotherapy treatment plans for total marrow irradiation created using GPU and CPU dose calculation engines.

PubMed

Nalichowski, Adrian; Burmeister, Jay

2013-07-01

To compare optimization characteristics, plan quality, and treatment delivery efficiency between total marrow irradiation (TMI) plans using the new TomoTherapy graphic processing unit (GPU) based dose engine and CPU/cluster based dose engine. Five TMI plans created on an anthropomorphic phantom were optimized and calculated with both dose engines. The planning treatment volume (PTV) included all the bones from head to mid femur except for upper extremities. Evaluated organs at risk (OAR) consisted of lung, liver, heart, kidneys, and brain. The following treatment parameters were used to generate the TMI plans: field widths of 2.5 and 5 cm, modulation factors of 2 and 2.5, and pitch of either 0.287 or 0.43. The optimization parameters were chosen based on the PTV and OAR priorities and the plans were optimized with a fixed number of iterations. The PTV constraint was selected to ensure that at least 95% of the PTV received the prescription dose. The plans were evaluated based on D80 and D50 (dose to 80% and 50% of the OAR volume, respectively) and hotspot volumes within the PTVs. Gamma indices (Γ) were also used to compare planar dose distributions between the two modalities. The optimization and dose calculation times were compared between the two systems. The treatment delivery times were also evaluated. The results showed very good dosimetric agreement between the GPU and CPU calculated plans for any of the evaluated planning parameters indicating that both systems converge on nearly identical plans. All D80 and D50 parameters varied by less than 3% of the prescription dose with an average difference of 0.8%. A gamma analysis Γ(3%, 3 mm) < 1 of the GPU plan resulted in over 90% of calculated voxels satisfying Γ < 1 criterion as compared to baseline CPU plan. The average number of voxels meeting the Γ < 1 criterion for all the plans was 97%. In terms of dose optimization/calculation efficiency, there was a 20-fold reduction in planning time with the new GPU system. The average optimization/dose calculation time utilizing the traditional CPU/cluster based system was 579 vs 26.8 min for the GPU based system. There was no difference in the calculated treatment delivery time per fraction. Beam-on time varied based on field width and pitch and ranged between 15 and 28 min. The TomoTherapy GPU based dose engine is capable of calculating TMI treatment plans with plan quality nearly identical to plans calculated using the traditional CPU/cluster based system, while significantly reducing the time required for optimization and dose calculation.
SU-E-T-500: Initial Implementation of GPU-Based Particle Swarm Optimization for 4D IMRT Planning in Lung SBRT

DOE Office of Scientific and Technical Information (OSTI.GOV)

Modiri, A; Hagan, A; Gu, X

Purpose 4D-IMRT planning, combined with dynamic MLC tracking delivery, utilizes the temporal dimension as an additional degree of freedom to achieve improved OAR-sparing. The computational complexity for such optimization increases exponentially with increase in dimensionality. In order to accomplish this task in a clinically-feasible time frame, we present an initial implementation of GPU-based 4D-IMRT planning based on particle swarm optimization (PSO). Methods The target and normal structures were manually contoured on ten phases of a 4DCT scan of a NSCLC patient with a 54cm3 right-lower-lobe tumor (1.5cm motion). Corresponding ten 3D-IMRT plans were created in the Eclipse treatment planning systemmore » (Ver-13.6). A vendor-provided scripting interface was used to export 3D-dose matrices corresponding to each control point (10 phases × 9 beams × 166 control points = 14,940), which served as input to PSO. The optimization task was to iteratively adjust the weights of each control point and scale the corresponding dose matrices. In order to handle the large amount of data in GPU memory, dose matrices were sparsified and placed in contiguous memory blocks with the 14,940 weight-variables. PSO was implemented on CPU (dual-Xeon, 3.1GHz) and GPU (dual-K20 Tesla, 2496 cores, 3.52Tflops, each) platforms. NiftyReg, an open-source deformable image registration package, was used to calculate the summed dose. Results The 4D-PSO plan yielded PTV coverage comparable to the clinical ITV-based plan and significantly higher OAR-sparing, as follows: lung Dmean=33%; lung V20=27%; spinal cord Dmax=26%; esophagus Dmax=42%; heart Dmax=0%; heart Dmean=47%. The GPU-PSO processing time for 14940 variables and 7 PSO-particles was 41% that of CPU-PSO (199 vs. 488 minutes). Conclusion Truly 4D-IMRT planning can yield significant OAR dose-sparing while preserving PTV coverage. The corresponding optimization problem is large-scale, non-convex and computationally rigorous. Our initial results indicate that GPU-based PSO with further software optimization can make such planning clinically feasible. This work was supported through funding from the National Institutes of Health and Varian Medical Systems.« less
Proton Testing of nVidia GTX 1050 GPU

NASA Technical Reports Server (NTRS)

Wyrwas, E. J.

2017-01-01

Single-Event Effects (SEE) testing was conducted on the nVidia GTX 1050 Graphics Processor Unit (GPU); herein referred to as device under test (DUT). Testing was conducted at Massachusetts General Hospitals (MGH) Francis H. Burr Proton Therapy Center on April 9th, 2017 using 200-MeV protons. This testing trip was purposed to provide a baseline assessment of the radiation susceptibility of the DUT as no previous testing had been conducted on this component.
Parallel tempering simulation of the three-dimensional Edwards-Anderson model with compact asynchronous multispin coding on GPU

NASA Astrophysics Data System (ADS)

Fang, Ye; Feng, Sheng; Tam, Ka-Ming; Yun, Zhifeng; Moreno, Juana; Ramanujam, J.; Jarrell, Mark

2014-10-01

Monte Carlo simulations of the Ising model play an important role in the field of computational statistical physics, and they have revealed many properties of the model over the past few decades. However, the effect of frustration due to random disorder, in particular the possible spin glass phase, remains a crucial but poorly understood problem. One of the obstacles in the Monte Carlo simulation of random frustrated systems is their long relaxation time making an efficient parallel implementation on state-of-the-art computation platforms highly desirable. The Graphics Processing Unit (GPU) is such a platform that provides an opportunity to significantly enhance the computational performance and thus gain new insight into this problem. In this paper, we present optimization and tuning approaches for the CUDA implementation of the spin glass simulation on GPUs. We discuss the integration of various design alternatives, such as GPU kernel construction with minimal communication, memory tiling, and look-up tables. We present a binary data format, Compact Asynchronous Multispin Coding (CAMSC), which provides an additional 28.4% speedup compared with the traditionally used Asynchronous Multispin Coding (AMSC). Our overall design sustains a performance of 33.5 ps per spin flip attempt for simulating the three-dimensional Edwards-Anderson model with parallel tempering, which significantly improves the performance over existing GPU implementations.
Sub-second pencil beam dose calculation on GPU for adaptive proton therapy

NASA Astrophysics Data System (ADS)

da Silva, Joakim; Ansorge, Richard; Jena, Rajesh

2015-06-01

Although proton therapy delivered using scanned pencil beams has the potential to produce better dose conformity than conventional radiotherapy, the created dose distributions are more sensitive to anatomical changes and patient motion. Therefore, the introduction of adaptive treatment techniques where the dose can be monitored as it is being delivered is highly desirable. We present a GPU-based dose calculation engine relying on the widely used pencil beam algorithm, developed for on-line dose calculation. The calculation engine was implemented from scratch, with each step of the algorithm parallelized and adapted to run efficiently on the GPU architecture. To ensure fast calculation, it employs several application-specific modifications and simplifications, and a fast scatter-based implementation of the computationally expensive kernel superposition step. The calculation time for a skull base treatment plan using two beam directions was 0.22 s on an Nvidia Tesla K40 GPU, whereas a test case of a cubic target in water from the literature took 0.14 s to calculate. The accuracy of the patient dose distributions was assessed by calculating the γ-index with respect to a gold standard Monte Carlo simulation. The passing rates were 99.2% and 96.7%, respectively, for the 3%/3 mm and 2%/2 mm criteria, matching those produced by a clinical treatment planning system.

Some links on this page may take you to non-federal websites. Their policies may differ from this site.