Sample records for nvidia graphics processing

  1. A Large Scale, High Resolution Agent-Based Insurgency Model

    DTIC Science & Technology

    2013-09-30

    CUDA) is NVIDIA Corporation’s software development model for General Purpose Programming on Graphics Processing Units (GPGPU) ( NVIDIA Corporation ...Conference. Argonne National Laboratory, Argonne, IL, October, 2005. NVIDIA Corporation . NVIDIA CUDA Programming Guide 2.0 [Online]. NVIDIA Corporation

  2. Grace: A cross-platform micromagnetic simulator on graphics processing units

    NASA Astrophysics Data System (ADS)

    Zhu, Ru

    2015-12-01

    A micromagnetic simulator running on graphics processing units (GPUs) is presented. Different from GPU implementations of other research groups which are predominantly running on NVidia's CUDA platform, this simulator is developed with C++ Accelerated Massive Parallelism (C++ AMP) and is hardware platform independent. It runs on GPUs from venders including NVidia, AMD and Intel, and achieves significant performance boost as compared to previous central processing unit (CPU) simulators, up to two orders of magnitude. The simulator paved the way for running large size micromagnetic simulations on both high-end workstations with dedicated graphics cards and low-end personal computers with integrated graphics cards, and is freely available to download.

  3. Numerical Integration with Graphical Processing Unit for QKD Simulation

    DTIC Science & Technology

    2014-03-27

    Windows system application programming interface (API) timer. The problem sizes studied produce speedups greater than 60x on the NVIDIA Tesla C2075...13 2.3.3 CUDA API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.3.4 CUDA and NVIDIA GPU Hardware...Theoretical Floating-Point Operations per Second for Intel CPUs and NVIDIA GPUs [3

  4. Real-time radar signal processing using GPGPU (general-purpose graphic processing unit)

    NASA Astrophysics Data System (ADS)

    Kong, Fanxing; Zhang, Yan Rockee; Cai, Jingxiao; Palmer, Robert D.

    2016-05-01

    This study introduces a practical approach to develop real-time signal processing chain for general phased array radar on NVIDIA GPUs(Graphical Processing Units) using CUDA (Compute Unified Device Architecture) libraries such as cuBlas and cuFFT, which are adopted from open source libraries and optimized for the NVIDIA GPUs. The processed results are rigorously verified against those from the CPUs. Performance benchmarked in computation time with various input data cube sizes are compared across GPUs and CPUs. Through the analysis, it will be demonstrated that GPGPUs (General Purpose GPU) real-time processing of the array radar data is possible with relatively low-cost commercial GPUs.

  5. Employing OpenCL to Accelerate Ab Initio Calculations on Graphics Processing Units.

    PubMed

    Kussmann, Jörg; Ochsenfeld, Christian

    2017-06-13

    We present an extension of our graphics processing units (GPU)-accelerated quantum chemistry package to employ OpenCL compute kernels, which can be executed on a wide range of computing devices like CPUs, Intel Xeon Phi, and AMD GPUs. Here, we focus on the use of AMD GPUs and discuss differences as compared to CUDA-based calculations on NVIDIA GPUs. First illustrative timings are presented for hybrid density functional theory calculations using serial as well as parallel compute environments. The results show that AMD GPUs are as fast or faster than comparable NVIDIA GPUs and provide a viable alternative for quantum chemical applications.

  6. Advanced mathematical on-line analysis in nuclear experiments. Usage of parallel computing CUDA routines in standard root analysis

    NASA Astrophysics Data System (ADS)

    Grzeszczuk, A.; Kowalski, S.

    2015-04-01

    Compute Unified Device Architecture (CUDA) is a parallel computing platform developed by Nvidia for increase speed of graphics by usage of parallel mode for processes calculation. The success of this solution has opened technology General-Purpose Graphic Processor Units (GPGPUs) for applications not coupled with graphics. The GPGPUs system can be applying as effective tool for reducing huge number of data for pulse shape analysis measures, by on-line recalculation or by very quick system of compression. The simplified structure of CUDA system and model of programming based on example Nvidia GForce GTX580 card are presented by our poster contribution in stand-alone version and as ROOT application.

  7. A GPU Parallelization of the Absolute Nodal Coordinate Formulation for Applications in Flexible Multibody Dynamics

    DTIC Science & Technology

    2012-02-17

    to be solved. Disclaimer: Reference herein to any specific commercial company , product, process, or service by trade name, trademark...data processing rather than data caching and control flow. To make use of this computational power, NVIDIA introduced a general purpose parallel...GPU implementations were run on an Intel Nehalem Xeon E5520 2.26GHz processor with an NVIDIA Tesla C2070 graphics card for varying numbers of

  8. Accelerated Adaptive MGS Phase Retrieval

    NASA Technical Reports Server (NTRS)

    Lam, Raymond K.; Ohara, Catherine M.; Green, Joseph J.; Bikkannavar, Siddarayappa A.; Basinger, Scott A.; Redding, David C.; Shi, Fang

    2011-01-01

    The Modified Gerchberg-Saxton (MGS) algorithm is an image-based wavefront-sensing method that can turn any science instrument focal plane into a wavefront sensor. MGS characterizes optical systems by estimating the wavefront errors in the exit pupil using only intensity images of a star or other point source of light. This innovative implementation of MGS significantly accelerates the MGS phase retrieval algorithm by using stream-processing hardware on conventional graphics cards. Stream processing is a relatively new, yet powerful, paradigm to allow parallel processing of certain applications that apply single instructions to multiple data (SIMD). These stream processors are designed specifically to support large-scale parallel computing on a single graphics chip. Computationally intensive algorithms, such as the Fast Fourier Transform (FFT), are particularly well suited for this computing environment. This high-speed version of MGS exploits commercially available hardware to accomplish the same objective in a fraction of the original time. The exploit involves performing matrix calculations in nVidia graphic cards. The graphical processor unit (GPU) is hardware that is specialized for computationally intensive, highly parallel computation. From the software perspective, a parallel programming model is used, called CUDA, to transparently scale multicore parallelism in hardware. This technology gives computationally intensive applications access to the processing power of the nVidia GPUs through a C/C++ programming interface. The AAMGS (Accelerated Adaptive MGS) software takes advantage of these advanced technologies, to accelerate the optical phase error characterization. With a single PC that contains four nVidia GTX-280 graphic cards, the new implementation can process four images simultaneously to produce a JWST (James Webb Space Telescope) wavefront measurement 60 times faster than the previous code.

  9. Finite Element Optimization for Nondestructive Evaluation on a Graphics Processing Unit for Ground Vehicle Hull Inspection

    DTIC Science & Technology

    2013-08-22

    4 cores, where the code may simultaneously run on the multiple cores or the graphics processing unit (or GPU – to be more specific on an NVIDIA ...allowed to get accurate crack shapes. DISCLAIMER Reference herein to any specific commercial company , product, process, or service by trade name

  10. General purpose graphic processing unit implementation of adaptive pulse compression algorithms

    NASA Astrophysics Data System (ADS)

    Cai, Jingxiao; Zhang, Yan

    2017-07-01

    This study introduces a practical approach to implement real-time signal processing algorithms for general surveillance radar based on NVIDIA graphical processing units (GPUs). The pulse compression algorithms are implemented using compute unified device architecture (CUDA) libraries such as CUDA basic linear algebra subroutines and CUDA fast Fourier transform library, which are adopted from open source libraries and optimized for the NVIDIA GPUs. For more advanced, adaptive processing algorithms such as adaptive pulse compression, customized kernel optimization is needed and investigated. A statistical optimization approach is developed for this purpose without needing much knowledge of the physical configurations of the kernels. It was found that the kernel optimization approach can significantly improve the performance. Benchmark performance is compared with the CPU performance in terms of processing accelerations. The proposed implementation framework can be used in various radar systems including ground-based phased array radar, airborne sense and avoid radar, and aerospace surveillance radar.

  11. Graphics Processing Unit Acceleration of Gyrokinetic Turbulence Simulations

    NASA Astrophysics Data System (ADS)

    Hause, Benjamin; Parker, Scott

    2012-10-01

    We find a substantial increase in on-node performance using Graphics Processing Unit (GPU) acceleration in gyrokinetic delta-f particle-in-cell simulation. Optimization is performed on a two-dimensional slab gyrokinetic particle simulation using the Portland Group Fortran compiler with the GPU accelerator compiler directives. We have implemented the GPU acceleration on a Core I7 gaming PC with a NVIDIA GTX 580 GPU. We find comparable, or better, acceleration relative to the NERSC DIRAC cluster with the NVIDIA Tesla C2050 computing processor. The Tesla C 2050 is about 2.6 times more expensive than the GTX 580 gaming GPU. Optimization strategies and comparisons between DIRAC and the gaming PC will be presented. We will also discuss progress on optimizing the comprehensive three dimensional general geometry GEM code.

  12. Tensor Algebra Library for NVidia Graphics Processing Units

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Liakh, Dmitry

    This is a general purpose math library implementing basic tensor algebra operations on NVidia GPU accelerators. This software is a tensor algebra library that can perform basic tensor algebra operations, including tensor contractions, tensor products, tensor additions, etc., on NVidia GPU accelerators, asynchronously with respect to the CPU host. It supports a simultaneous use of multiple NVidia GPUs. Each asynchronous API function returns a handle which can later be used for querying the completion of the corresponding tensor algebra operation on a specific GPU. The tensors participating in a particular tensor operation are assumed to be stored in local RAMmore » of a node or GPU RAM. The main research area where this library can be utilized is the quantum many-body theory (e.g., in electronic structure theory).« less

  13. GPU acceleration for digitally reconstructed radiographs using bindless texture objects and CUDA/OpenGL interoperability.

    PubMed

    Abdellah, Marwan; Eldeib, Ayman; Owis, Mohamed I

    2015-01-01

    This paper features an advanced implementation of the X-ray rendering algorithm that harnesses the giant computing power of the current commodity graphics processors to accelerate the generation of high resolution digitally reconstructed radiographs (DRRs). The presented pipeline exploits the latest features of NVIDIA Graphics Processing Unit (GPU) architectures, mainly bindless texture objects and dynamic parallelism. The rendering throughput is substantially improved by exploiting the interoperability mechanisms between CUDA and OpenGL. The benchmarks of our optimized rendering pipeline reflect its capability of generating DRRs with resolutions of 2048(2) and 4096(2) at interactive and semi interactive frame-rates using an NVIDIA GeForce 970 GTX device.

  14. Fast analytical scatter estimation using graphics processing units.

    PubMed

    Ingleby, Harry; Lippuner, Jonas; Rickey, Daniel W; Li, Yue; Elbakri, Idris

    2015-01-01

    To develop a fast patient-specific analytical estimator of first-order Compton and Rayleigh scatter in cone-beam computed tomography, implemented using graphics processing units. The authors developed an analytical estimator for first-order Compton and Rayleigh scatter in a cone-beam computed tomography geometry. The estimator was coded using NVIDIA's CUDA environment for execution on an NVIDIA graphics processing unit. Performance of the analytical estimator was validated by comparison with high-count Monte Carlo simulations for two different numerical phantoms. Monoenergetic analytical simulations were compared with monoenergetic and polyenergetic Monte Carlo simulations. Analytical and Monte Carlo scatter estimates were compared both qualitatively, from visual inspection of images and profiles, and quantitatively, using a scaled root-mean-square difference metric. Reconstruction of simulated cone-beam projection data of an anthropomorphic breast phantom illustrated the potential of this method as a component of a scatter correction algorithm. The monoenergetic analytical and Monte Carlo scatter estimates showed very good agreement. The monoenergetic analytical estimates showed good agreement for Compton single scatter and reasonable agreement for Rayleigh single scatter when compared with polyenergetic Monte Carlo estimates. For a voxelized phantom with dimensions 128 × 128 × 128 voxels and a detector with 256 × 256 pixels, the analytical estimator required 669 seconds for a single projection, using a single NVIDIA 9800 GX2 video card. Accounting for first order scatter in cone-beam image reconstruction improves the contrast to noise ratio of the reconstructed images. The analytical scatter estimator, implemented using graphics processing units, provides rapid and accurate estimates of single scatter and with further acceleration and a method to account for multiple scatter may be useful for practical scatter correction schemes.

  15. Optimization of Selected Remote Sensing Algorithms for Embedded NVIDIA Kepler GPU Architecture

    NASA Technical Reports Server (NTRS)

    Riha, Lubomir; Le Moigne, Jacqueline; El-Ghazawi, Tarek

    2015-01-01

    This paper evaluates the potential of embedded Graphic Processing Units in the Nvidias Tegra K1 for onboard processing. The performance is compared to a general purpose multi-core CPU and full fledge GPU accelerator. This study uses two algorithms: Wavelet Spectral Dimension Reduction of Hyperspectral Imagery and Automated Cloud-Cover Assessment (ACCA) Algorithm. Tegra K1 achieved 51 for ACCA algorithm and 20 for the dimension reduction algorithm, as compared to the performance of the high-end 8-core server Intel Xeon CPU with 13.5 times higher power consumption.

  16. Micromagnetics on high-performance workstation and mobile computational platforms

    NASA Astrophysics Data System (ADS)

    Fu, S.; Chang, R.; Couture, S.; Menarini, M.; Escobar, M. A.; Kuteifan, M.; Lubarda, M.; Gabay, D.; Lomakin, V.

    2015-05-01

    The feasibility of using high-performance desktop and embedded mobile computational platforms is presented, including multi-core Intel central processing unit, Nvidia desktop graphics processing units, and Nvidia Jetson TK1 Platform. FastMag finite element method-based micromagnetic simulator is used as a testbed, showing high efficiency on all the platforms. Optimization aspects of improving the performance of the mobile systems are discussed. The high performance, low cost, low power consumption, and rapid performance increase of the embedded mobile systems make them a promising candidate for micromagnetic simulations. Such architectures can be used as standalone systems or can be built as low-power computing clusters.

  17. Analysis and Implementation of Particle-to-Particle (P2P) Graphics Processor Unit (GPU) Kernel for Black-Box Adaptive Fast Multipole Method

    DTIC Science & Technology

    2015-06-01

    5110P and 16 dx360M4 nodes each with one NVIDIA Kepler K20M/K40M GPU. Each node contained dual Intel Xeon E5-2670 (Sandy Bridge) central processing...kernel and as such does not employ multiple processors. This work makes use of a single processing core and a single NVIDIA Kepler K40 GK110...bandwidth (2 × 16 slot), 7.877 GFloat/s; Kepler K40 peak, 4,290 × 1 billion floating-point operations (GFLOPs), and 288 GB/s Kepler K40 memory

  18. Ultraviolet Communication for Medical Applications

    DTIC Science & Technology

    2015-06-01

    In the previous Phase I effort, Directed Energy Inc.’s (DEI) parent company Imaging Systems Technology (IST) demonstrated feasibility of several key...accurately model high path loss. Custom photon scatter code was rewritten for parallel execution on a graphics processing unit (GPU). The NVidia CUDA

  19. GPU Acceleration of DSP for Communication Receivers.

    PubMed

    Gunther, Jake; Gunther, Hyrum; Moon, Todd

    2017-09-01

    Graphics processing unit (GPU) implementations of signal processing algorithms can outperform CPU-based implementations. This paper describes the GPU implementation of several algorithms encountered in a wide range of high-data rate communication receivers including filters, multirate filters, numerically controlled oscillators, and multi-stage digital down converters. These structures are tested by processing the 20 MHz wide FM radio band (88-108 MHz). Two receiver structures are explored: a single channel receiver and a filter bank channelizer. Both run in real time on NVIDIA GeForce GTX 1080 graphics card.

  20. Poster: Building a Large Tiled-Display Cluster

    DTIC Science & Technology

    2012-10-01

    graphics cards ( Nvidia Quadro FX 5800), and each graphics ∗e-mail: mark.livingston@nrl.navy.mil †e-mail: jonathan.decker@nrl.navy.mil card in a display...such as DisplayPort and HDMI (see: Nvidia Quadro 6000). We recommend these formats because they are much easier to plug-and-play. 3.4 Leverage Open...will find yourself with all the issues related to owning a server room. Today, there are a number of companies offering turn-key so- lutions for tiled

  1. Optimizing ion channel models using a parallel genetic algorithm on graphical processors.

    PubMed

    Ben-Shalom, Roy; Aviv, Amit; Razon, Benjamin; Korngreen, Alon

    2012-01-01

    We have recently shown that we can semi-automatically constrain models of voltage-gated ion channels by combining a stochastic search algorithm with ionic currents measured using multiple voltage-clamp protocols. Although numerically successful, this approach is highly demanding computationally, with optimization on a high performance Linux cluster typically lasting several days. To solve this computational bottleneck we converted our optimization algorithm for work on a graphical processing unit (GPU) using NVIDIA's CUDA. Parallelizing the process on a Fermi graphic computing engine from NVIDIA increased the speed ∼180 times over an application running on an 80 node Linux cluster, considerably reducing simulation times. This application allows users to optimize models for ion channel kinetics on a single, inexpensive, desktop "super computer," greatly reducing the time and cost of building models relevant to neuronal physiology. We also demonstrate that the point of algorithm parallelization is crucial to its performance. We substantially reduced computing time by solving the ODEs (Ordinary Differential Equations) so as to massively reduce memory transfers to and from the GPU. This approach may be applied to speed up other data intensive applications requiring iterative solutions of ODEs. Copyright © 2012 Elsevier B.V. All rights reserved.

  2. GPUmotif: An Ultra-Fast and Energy-Efficient Motif Analysis Program Using Graphics Processing Units

    PubMed Central

    Zandevakili, Pooya; Hu, Ming; Qin, Zhaohui

    2012-01-01

    Computational detection of TF binding patterns has become an indispensable tool in functional genomics research. With the rapid advance of new sequencing technologies, large amounts of protein-DNA interaction data have been produced. Analyzing this data can provide substantial insight into the mechanisms of transcriptional regulation. However, the massive amount of sequence data presents daunting challenges. In our previous work, we have developed a novel algorithm called Hybrid Motif Sampler (HMS) that enables more scalable and accurate motif analysis. Despite much improvement, HMS is still time-consuming due to the requirement to calculate matching probabilities position-by-position. Using the NVIDIA CUDA toolkit, we developed a graphics processing unit (GPU)-accelerated motif analysis program named GPUmotif. We proposed a “fragmentation" technique to hide data transfer time between memories. Performance comparison studies showed that commonly-used model-based motif scan and de novo motif finding procedures such as HMS can be dramatically accelerated when running GPUmotif on NVIDIA graphics cards. As a result, energy consumption can also be greatly reduced when running motif analysis using GPUmotif. The GPUmotif program is freely available at http://sourceforge.net/projects/gpumotif/ PMID:22662128

  3. GPUmotif: an ultra-fast and energy-efficient motif analysis program using graphics processing units.

    PubMed

    Zandevakili, Pooya; Hu, Ming; Qin, Zhaohui

    2012-01-01

    Computational detection of TF binding patterns has become an indispensable tool in functional genomics research. With the rapid advance of new sequencing technologies, large amounts of protein-DNA interaction data have been produced. Analyzing this data can provide substantial insight into the mechanisms of transcriptional regulation. However, the massive amount of sequence data presents daunting challenges. In our previous work, we have developed a novel algorithm called Hybrid Motif Sampler (HMS) that enables more scalable and accurate motif analysis. Despite much improvement, HMS is still time-consuming due to the requirement to calculate matching probabilities position-by-position. Using the NVIDIA CUDA toolkit, we developed a graphics processing unit (GPU)-accelerated motif analysis program named GPUmotif. We proposed a "fragmentation" technique to hide data transfer time between memories. Performance comparison studies showed that commonly-used model-based motif scan and de novo motif finding procedures such as HMS can be dramatically accelerated when running GPUmotif on NVIDIA graphics cards. As a result, energy consumption can also be greatly reduced when running motif analysis using GPUmotif. The GPUmotif program is freely available at http://sourceforge.net/projects/gpumotif/

  4. Accelerating Monte Carlo simulations with an NVIDIA ® graphics processor

    NASA Astrophysics Data System (ADS)

    Martinsen, Paul; Blaschke, Johannes; Künnemeyer, Rainer; Jordan, Robert

    2009-10-01

    Modern graphics cards, commonly used in desktop computers, have evolved beyond a simple interface between processor and display to incorporate sophisticated calculation engines that can be applied to general purpose computing. The Monte Carlo algorithm for modelling photon transport in turbid media has been implemented on an NVIDIA ® 8800 GT graphics card using the CUDA toolkit. The Monte Carlo method relies on following the trajectory of millions of photons through the sample, often taking hours or days to complete. The graphics-processor implementation, processing roughly 110 million scattering events per second, was found to run more than 70 times faster than a similar, single-threaded implementation on a 2.67 GHz desktop computer. Program summaryProgram title: Phoogle-C/Phoogle-G Catalogue identifier: AEEB_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEEB_v1_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: Standard CPC licence, http://cpc.cs.qub.ac.uk/licence/licence.html No. of lines in distributed program, including test data, etc.: 51 264 No. of bytes in distributed program, including test data, etc.: 2 238 805 Distribution format: tar.gz Programming language: C++ Computer: Designed for Intel PCs. Phoogle-G requires a NVIDIA graphics card with support for CUDA 1.1 Operating system: Windows XP Has the code been vectorised or parallelized?: Phoogle-G is written for SIMD architectures RAM: 1 GB Classification: 21.1 External routines: Charles Karney Random number library. Microsoft Foundation Class library. NVIDA CUDA library [1]. Nature of problem: The Monte Carlo technique is an effective algorithm for exploring the propagation of light in turbid media. However, accurate results require tracing the path of many photons within the media. The independence of photons naturally lends the Monte Carlo technique to implementation on parallel architectures. Generally, parallel computing can be expensive, but recent advances in consumer grade graphics cards have opened the possibility of high-performance desktop parallel-computing. Solution method: In this pair of programmes we have implemented the Monte Carlo algorithm described by Prahl et al. [2] for photon transport in infinite scattering media to compare the performance of two readily accessible architectures: a standard desktop PC and a consumer grade graphics card from NVIDIA. Restrictions: The graphics card implementation uses single precision floating point numbers for all calculations. Only photon transport from an isotropic point-source is supported. The graphics-card version has no user interface. The simulation parameters must be set in the source code. The desktop version has a simple user interface; however some properties can only be accessed through an ActiveX client (such as Matlab). Additional comments: The random number library used has a LGPL ( http://www.gnu.org/copyleft/lesser.html) licence. Running time: Runtime can range from minutes to months depending on the number of photons simulated and the optical properties of the medium. References:http://www.nvidia.com/object/cuda_home.html. S. Prahl, M. Keijzer, Sl. Jacques, A. Welch, SPIE Institute Series 5 (1989) 102.

  5. Application of graphics processing units to search pipelines for gravitational waves from coalescing binaries of compact objects

    NASA Astrophysics Data System (ADS)

    Chung, Shin Kee; Wen, Linqing; Blair, David; Cannon, Kipp; Datta, Amitava

    2010-07-01

    We report a novel application of a graphics processing unit (GPU) for the purpose of accelerating the search pipelines for gravitational waves from coalescing binaries of compact objects. A speed-up of 16-fold in total has been achieved with an NVIDIA GeForce 8800 Ultra GPU card compared with one core of a 2.5 GHz Intel Q9300 central processing unit (CPU). We show that substantial improvements are possible and discuss the reduction in CPU count required for the detection of inspiral sources afforded by the use of GPUs.

  6. Investigating the Importance of Stereo Displays for Helicopter Landing Simulation

    DTIC Science & Technology

    2016-08-11

    visualization. The two instances of X Plane® were implemented using two separate PCs, each incorporating Intel i7 processors and Nvidia Quadro K4200... Nvidia GeForce GTX 680 graphics card was used to administer the stereo acuity and fusion range tests. The tests were displayed on an Asus VG278HE 3D...monitor with 1920x1080 pixels that was compatible with Nvidia 3D Vision2 and that used active shutter glasses. At a 1-m viewing distance, the

  7. Graphics processing unit based computation for NDE applications

    NASA Astrophysics Data System (ADS)

    Nahas, C. A.; Rajagopal, Prabhu; Balasubramaniam, Krishnan; Krishnamurthy, C. V.

    2012-05-01

    Advances in parallel processing in recent years are helping to improve the cost of numerical simulation. Breakthroughs in Graphical Processing Unit (GPU) based computation now offer the prospect of further drastic improvements. The introduction of 'compute unified device architecture' (CUDA) by NVIDIA (the global technology company based in Santa Clara, California, USA) has made programming GPUs for general purpose computing accessible to the average programmer. Here we use CUDA to develop parallel finite difference schemes as applicable to two problems of interest to NDE community, namely heat diffusion and elastic wave propagation. The implementations are for two-dimensions. Performance improvement of the GPU implementation against serial CPU implementation is then discussed.

  8. The gputools package enables GPU computing in R.

    PubMed

    Buckner, Joshua; Wilson, Justin; Seligman, Mark; Athey, Brian; Watson, Stanley; Meng, Fan

    2010-01-01

    By default, the R statistical environment does not make use of parallelism. Researchers may resort to expensive solutions such as cluster hardware for large analysis tasks. Graphics processing units (GPUs) provide an inexpensive and computationally powerful alternative. Using R and the CUDA toolkit from Nvidia, we have implemented several functions commonly used in microarray gene expression analysis for GPU-equipped computers. R users can take advantage of the better performance provided by an Nvidia GPU. The package is available from CRAN, the R project's repository of packages, at http://cran.r-project.org/web/packages/gputools More information about our gputools R package is available at http://brainarray.mbni.med.umich.edu/brainarray/Rgpgpu

  9. Graphics processing unit (GPU)-based computation of heat conduction in thermally anisotropic solids

    NASA Astrophysics Data System (ADS)

    Nahas, C. A.; Balasubramaniam, Krishnan; Rajagopal, Prabhu

    2013-01-01

    Numerical modeling of anisotropic media is a computationally intensive task since it brings additional complexity to the field problem in such a way that the physical properties are different in different directions. Largely used in the aerospace industry because of their lightweight nature, composite materials are a very good example of thermally anisotropic media. With advancements in video gaming technology, parallel processors are much cheaper today and accessibility to higher-end graphical processing devices has increased dramatically over the past couple of years. Since these massively parallel GPUs are very good in handling floating point arithmetic, they provide a new platform for engineers and scientists to accelerate their numerical models using commodity hardware. In this paper we implement a parallel finite difference model of thermal diffusion through anisotropic media using the NVIDIA CUDA (Compute Unified device Architecture). We use the NVIDIA GeForce GTX 560 Ti as our primary computing device which consists of 384 CUDA cores clocked at 1645 MHz with a standard desktop pc as the host platform. We compare the results from standard CPU implementation for its accuracy and speed and draw implications for simulation using the GPU paradigm.

  10. Graphics Processing Unit Acceleration of Gyrokinetic Turbulence Simulations

    NASA Astrophysics Data System (ADS)

    Hause, Benjamin; Parker, Scott; Chen, Yang

    2013-10-01

    We find a substantial increase in on-node performance using Graphics Processing Unit (GPU) acceleration in gyrokinetic delta-f particle-in-cell simulation. Optimization is performed on a two-dimensional slab gyrokinetic particle simulation using the Portland Group Fortran compiler with the OpenACC compiler directives and Fortran CUDA. Mixed implementation of both Open-ACC and CUDA is demonstrated. CUDA is required for optimizing the particle deposition algorithm. We have implemented the GPU acceleration on a third generation Core I7 gaming PC with two NVIDIA GTX 680 GPUs. We find comparable, or better, acceleration relative to the NERSC DIRAC cluster with the NVIDIA Tesla C2050 computing processor. The Tesla C 2050 is about 2.6 times more expensive than the GTX 580 gaming GPU. We also see enormous speedups (10 or more) on the Titan supercomputer at Oak Ridge with Kepler K20 GPUs. Results show speed-ups comparable or better than that of OpenMP models utilizing multiple cores. The use of hybrid OpenACC, CUDA Fortran, and MPI models across many nodes will also be discussed. Optimization strategies will be presented. We will discuss progress on optimizing the comprehensive three dimensional general geometry GEM code.

  11. Visual Media Reasoning - Terrain-based Geolocation

    DTIC Science & Technology

    2015-06-01

    the drawings, specifications, or other data does not license the holder or any other person or corporation ; or convey any rights or permission to...3.4 Alternative Metric Investigation This section describes a graphics processor unit (GPU) based implementation in the NVIDIA CUDA programming...utilizing 2 concurrent CPU cores, each controlling a single Nvidia C2075 Tesla Fermi CUDA card. Figure 22 shows a comparison of the CPU and the GPU powered

  12. Acceleration of spiking neural network based pattern recognition on NVIDIA graphics processors.

    PubMed

    Han, Bing; Taha, Tarek M

    2010-04-01

    There is currently a strong push in the research community to develop biological scale implementations of neuron based vision models. Systems at this scale are computationally demanding and generally utilize more accurate neuron models, such as the Izhikevich and the Hodgkin-Huxley models, in favor of the more popular integrate and fire model. We examine the feasibility of using graphics processing units (GPUs) to accelerate a spiking neural network based character recognition network to enable such large scale systems. Two versions of the network utilizing the Izhikevich and Hodgkin-Huxley models are implemented. Three NVIDIA general-purpose (GP) GPU platforms are examined, including the GeForce 9800 GX2, the Tesla C1060, and the Tesla S1070. Our results show that the GPGPUs can provide significant speedup over conventional processors. In particular, the fastest GPGPU utilized, the Tesla S1070, provided a speedup of 5.6 and 84.4 over highly optimized implementations on the fastest central processing unit (CPU) tested, a quadcore 2.67 GHz Xeon processor, for the Izhikevich and the Hodgkin-Huxley models, respectively. The CPU implementation utilized all four cores and the vector data parallelism offered by the processor. The results indicate that GPUs are well suited for this application domain.

  13. Proton Testing of nVidia GTX 1050 GPU

    NASA Technical Reports Server (NTRS)

    Wyrwas, E. J.

    2017-01-01

    Single-Event Effects (SEE) testing was conducted on the nVidia GTX 1050 Graphics Processor Unit (GPU); herein referred to as device under test (DUT). Testing was conducted at Massachusetts General Hospitals (MGH) Francis H. Burr Proton Therapy Center on April 9th, 2017 using 200-MeV protons. This testing trip was purposed to provide a baseline assessment of the radiation susceptibility of the DUT as no previous testing had been conducted on this component.

  14. 75 FR 48338 - Intel Corporation; Analysis of Proposed Consent Order to Aid Public Comment

    Federal Register 2010, 2011, 2012, 2013, 2014

    2010-08-10

    ... integrated into chipsets as well as discrete graphics cards. NVIDIA has been at the forefront of developing... to connect peripheral products such as discrete GPUs to the CPU. A bus is a connection point between... platform. Intel's commitment to maintain an open PCIe bus will provide discrete graphics manufacturers...

  15. Gravitational tree-code on graphics processing units: implementation in CUDA

    NASA Astrophysics Data System (ADS)

    Gaburov, Evghenii; Bédorf, Jeroen; Portegies Zwart, Simon

    2010-05-01

    We present a new very fast tree-code which runs on massively parallel Graphical Processing Units (GPU) with NVIDIA CUDA architecture. The tree-construction and calculation of multipole moments is carried out on the host CPU, while the force calculation which consists of tree walks and evaluation of interaction list is carried out on the GPU. In this way we achieve a sustained performance of about 100GFLOP/s and data transfer rates of about 50GB/s. It takes about a second to compute forces on a million particles with an opening angle of θ ≈ 0.5. The code has a convenient user interface and is freely available for use. http://castle.strw.leidenuniv.nl/software/octgrav.html

  16. Investigating the Mobility of Light Autonoumous Tracked Vehicles Using a High Performance Computing Simulation Capability

    DTIC Science & Technology

    2012-08-01

    UNCLASSIFIED: Distribution Statement A. Approved for public release. DISCLAIMER: Reference herein to any specific commercial company , product...FunctionBay, S. Korea – NVIDIA – Caterpillar – MSC.Software – Advanced Micro Devices (AMD) 14-16 AUG 2012  Aaron Bartholomew  Makarand Datar...16GB DDR2 Graphics: 4x NVIDIA Tesla C1060 Power supply 1: 1000W Power supply 2: 750W Assembled Quad GPU Machine 14-16 AUG 2012 30

  17. Overview of implementation of DARPA GPU program in SAIC

    NASA Astrophysics Data System (ADS)

    Braunreiter, Dennis; Furtek, Jeremy; Chen, Hai-Wen; Healy, Dennis

    2008-04-01

    This paper reviews the implementation of DARPA MTO STAP-BOY program for both Phase I and II conducted at Science Applications International Corporation (SAIC). The STAP-BOY program conducts fast covariance factorization and tuning techniques for space-time adaptive process (STAP) Algorithm Implementation on Graphics Processor unit (GPU) Architectures for Embedded Systems. The first part of our presentation on the DARPA STAP-BOY program will focus on GPU implementation and algorithm innovations for a prototype radar STAP algorithm. The STAP algorithm will be implemented on the GPU, using stream programming (from companies such as PeakStream, ATI Technologies' CTM, and NVIDIA) and traditional graphics APIs. This algorithm will include fast range adaptive STAP weight updates and beamforming applications, each of which has been modified to exploit the parallel nature of graphics architectures.

  18. Accelerating gravitational microlensing simulations using the Xeon Phi coprocessor

    NASA Astrophysics Data System (ADS)

    Chen, B.; Kantowski, R.; Dai, X.; Baron, E.; Van der Mark, P.

    2017-04-01

    Recently Graphics Processing Units (GPUs) have been used to speed up very CPU-intensive gravitational microlensing simulations. In this work, we use the Xeon Phi coprocessor to accelerate such simulations and compare its performance on a microlensing code with that of NVIDIA's GPUs. For the selected set of parameters evaluated in our experiment, we find that the speedup by Intel's Knights Corner coprocessor is comparable to that by NVIDIA's Fermi family of GPUs with compute capability 2.0, but less significant than GPUs with higher compute capabilities such as the Kepler. However, the very recently released second generation Xeon Phi, Knights Landing, is about 5.8 times faster than the Knights Corner, and about 2.9 times faster than the Kepler GPU used in our simulations. We conclude that the Xeon Phi is a very promising alternative to GPUs for modern high performance microlensing simulations.

  19. Using 3D Computer Graphics Multimedia to Motivate Preservice Teachers' Learning of Geometry and Pedagogy

    ERIC Educational Resources Information Center

    Goodson-Espy, Tracy; Lynch-Davis, Kathleen; Schram, Pamela; Quickenton, Art

    2010-01-01

    This paper describes the genesis and purpose of our geometry methods course, focusing on a geometry-teaching technology we created using NVIDIA[R] Chameleon demonstration. This article presents examples from a sequence of lessons centered about a 3D computer graphics demonstration of the chameleon and its geometry. In addition, we present data…

  20. On the effective implementation of a boundary element code on graphics processing units unsing an out-of-core LU algorithm

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    D'Azevedo, Ed F; Nintcheu Fata, Sylvain

    2012-01-01

    A collocation boundary element code for solving the three-dimensional Laplace equation, publicly available from \\url{http://www.intetec.org}, has been adapted to run on an Nvidia Tesla general purpose graphics processing unit (GPU). Global matrix assembly and LU factorization of the resulting dense matrix were performed on the GPU. Out-of-core techniques were used to solve problems larger than available GPU memory. The code achieved over eight times speedup in matrix assembly and about 56~Gflops/sec in the LU factorization using only 512~Mbytes of GPU memory. Details of the GPU implementation and comparisons with the standard sequential algorithm are included to illustrate the performance ofmore » the GPU code.« less

  1. Mendel-GPU: haplotyping and genotype imputation on graphics processing units

    PubMed Central

    Chen, Gary K.; Wang, Kai; Stram, Alex H.; Sobel, Eric M.; Lange, Kenneth

    2012-01-01

    Motivation: In modern sequencing studies, one can improve the confidence of genotype calls by phasing haplotypes using information from an external reference panel of fully typed unrelated individuals. However, the computational demands are so high that they prohibit researchers with limited computational resources from haplotyping large-scale sequence data. Results: Our graphics processing unit based software delivers haplotyping and imputation accuracies comparable to competing programs at a fraction of the computational cost and peak memory demand. Availability: Mendel-GPU, our OpenCL software, runs on Linux platforms and is portable across AMD and nVidia GPUs. Users can download both code and documentation at http://code.google.com/p/mendel-gpu/. Contact: gary.k.chen@usc.edu Supplementary information: Supplementary data are available at Bioinformatics online. PMID:22954633

  2. Multidisciplinary Simulation Acceleration using Multiple Shared-Memory Graphical Processing Units

    NASA Astrophysics Data System (ADS)

    Kemal, Jonathan Yashar

    For purposes of optimizing and analyzing turbomachinery and other designs, the unsteady Favre-averaged flow-field differential equations for an ideal compressible gas can be solved in conjunction with the heat conduction equation. We solve all equations using the finite-volume multiple-grid numerical technique, with the dual time-step scheme used for unsteady simulations. Our numerical solver code targets CUDA-capable Graphical Processing Units (GPUs) produced by NVIDIA. Making use of MPI, our solver can run across networked compute notes, where each MPI process can use either a GPU or a Central Processing Unit (CPU) core for primary solver calculations. We use NVIDIA Tesla C2050/C2070 GPUs based on the Fermi architecture, and compare our resulting performance against Intel Zeon X5690 CPUs. Solver routines converted to CUDA typically run about 10 times faster on a GPU for sufficiently dense computational grids. We used a conjugate cylinder computational grid and ran a turbulent steady flow simulation using 4 increasingly dense computational grids. Our densest computational grid is divided into 13 blocks each containing 1033x1033 grid points, for a total of 13.87 million grid points or 1.07 million grid points per domain block. To obtain overall speedups, we compare the execution time of the solver's iteration loop, including all resource intensive GPU-related memory copies. Comparing the performance of 8 GPUs to that of 8 CPUs, we obtain an overall speedup of about 6.0 when using our densest computational grid. This amounts to an 8-GPU simulation running about 39.5 times faster than running than a single-CPU simulation.

  3. FANTOM: Algorithm-Architecture Codesign for High-Performance Embedded Signal and Image Processing Systems

    DTIC Science & Technology

    2013-05-25

    graphics processors by IBM, AMD, and nVIDIA . They are between general-purpose pro- cessors and special-purpose processors. In Phase II. 3.10 Measure of...particular, Dr. Kevin Irick started a company Silicon Scapes and he has been the CEO. 5 Implications for Related/Future Research We speculate that...final project report in Jan. 2011. At the test and validation stage of the project. FANTOM’s partner at Raytheon quit from his company and hence from

  4. Advantages of GPU technology in DFT calculations of intercalated graphene

    NASA Astrophysics Data System (ADS)

    Pešić, J.; Gajić, R.

    2014-09-01

    Over the past few years, the expansion of general-purpose graphic-processing unit (GPGPU) technology has had a great impact on computational science. GPGPU is the utilization of a graphics-processing unit (GPU) to perform calculations in applications usually handled by the central processing unit (CPU). Use of GPGPUs as a way to increase computational power in the material sciences has significantly decreased computational costs in already highly demanding calculations. A level of the acceleration and parallelization depends on the problem itself. Some problems can benefit from GPU acceleration and parallelization, such as the finite-difference time-domain algorithm (FTDT) and density-functional theory (DFT), while others cannot take advantage of these modern technologies. A number of GPU-supported applications had emerged in the past several years (www.nvidia.com/object/gpu-applications.html). Quantum Espresso (QE) is reported as an integrated suite of open source computer codes for electronic-structure calculations and materials modeling at the nano-scale. It is based on DFT, the use of a plane-waves basis and a pseudopotential approach. Since the QE 5.0 version, it has been implemented as a plug-in component for standard QE packages that allows exploiting the capabilities of Nvidia GPU graphic cards (www.qe-forge.org/gf/proj). In this study, we have examined the impact of the usage of GPU acceleration and parallelization on the numerical performance of DFT calculations. Graphene has been attracting attention worldwide and has already shown some remarkable properties. We have studied an intercalated graphene, using the QE package PHonon, which employs GPU. The term ‘intercalation’ refers to a process whereby foreign adatoms are inserted onto a graphene lattice. In addition, by intercalating different atoms between graphene layers, it is possible to tune their physical properties. Our experiments have shown there are benefits from using GPUs, and we reached an acceleration of several times compared to standard CPU calculations.

  5. Performance analysis of a parallel Monte Carlo code for simulating solar radiative transfer in cloudy atmospheres using CUDA-enabled NVIDIA GPU

    NASA Astrophysics Data System (ADS)

    Russkova, Tatiana V.

    2017-11-01

    One tool to improve the performance of Monte Carlo methods for numerical simulation of light transport in the Earth's atmosphere is the parallel technology. A new algorithm oriented to parallel execution on the CUDA-enabled NVIDIA graphics processor is discussed. The efficiency of parallelization is analyzed on the basis of calculating the upward and downward fluxes of solar radiation in both a vertically homogeneous and inhomogeneous models of the atmosphere. The results of testing the new code under various atmospheric conditions including continuous singlelayered and multilayered clouds, and selective molecular absorption are presented. The results of testing the code using video cards with different compute capability are analyzed. It is shown that the changeover of computing from conventional PCs to the architecture of graphics processors gives more than a hundredfold increase in performance and fully reveals the capabilities of the technology used.

  6. Exploiting current-generation graphics hardware for synthetic-scene generation

    NASA Astrophysics Data System (ADS)

    Tanner, Michael A.; Keen, Wayne A.

    2010-04-01

    Increasing seeker frame rate and pixel count, as well as the demand for higher levels of scene fidelity, have driven scene generation software for hardware-in-the-loop (HWIL) and software-in-the-loop (SWIL) testing to higher levels of parallelization. Because modern PC graphics cards provide multiple computational cores (240 shader cores for a current NVIDIA Corporation GeForce and Quadro cards), implementation of phenomenology codes on graphics processing units (GPUs) offers significant potential for simultaneous enhancement of simulation frame rate and fidelity. To take advantage of this potential requires algorithm implementation that is structured to minimize data transfers between the central processing unit (CPU) and the GPU. In this paper, preliminary methodologies developed at the Kinetic Hardware In-The-Loop Simulator (KHILS) will be presented. Included in this paper will be various language tradeoffs between conventional shader programming, Compute Unified Device Architecture (CUDA) and Open Computing Language (OpenCL), including performance trades and possible pathways for future tool development.

  7. MGUPGMA: A Fast UPGMA Algorithm With Multiple Graphics Processing Units Using NCCL

    PubMed Central

    Hua, Guan-Jie; Hung, Che-Lun; Lin, Chun-Yuan; Wu, Fu-Che; Chan, Yu-Wei; Tang, Chuan Yi

    2017-01-01

    A phylogenetic tree is a visual diagram of the relationship between a set of biological species. The scientists usually use it to analyze many characteristics of the species. The distance-matrix methods, such as Unweighted Pair Group Method with Arithmetic Mean and Neighbor Joining, construct a phylogenetic tree by calculating pairwise genetic distances between taxa. These methods have the computational performance issue. Although several new methods with high-performance hardware and frameworks have been proposed, the issue still exists. In this work, a novel parallel Unweighted Pair Group Method with Arithmetic Mean approach on multiple Graphics Processing Units is proposed to construct a phylogenetic tree from extremely large set of sequences. The experimental results present that the proposed approach on a DGX-1 server with 8 NVIDIA P100 graphic cards achieves approximately 3-fold to 7-fold speedup over the implementation of Unweighted Pair Group Method with Arithmetic Mean on a modern CPU and a single GPU, respectively. PMID:29051701

  8. MGUPGMA: A Fast UPGMA Algorithm With Multiple Graphics Processing Units Using NCCL.

    PubMed

    Hua, Guan-Jie; Hung, Che-Lun; Lin, Chun-Yuan; Wu, Fu-Che; Chan, Yu-Wei; Tang, Chuan Yi

    2017-01-01

    A phylogenetic tree is a visual diagram of the relationship between a set of biological species. The scientists usually use it to analyze many characteristics of the species. The distance-matrix methods, such as Unweighted Pair Group Method with Arithmetic Mean and Neighbor Joining, construct a phylogenetic tree by calculating pairwise genetic distances between taxa. These methods have the computational performance issue. Although several new methods with high-performance hardware and frameworks have been proposed, the issue still exists. In this work, a novel parallel Unweighted Pair Group Method with Arithmetic Mean approach on multiple Graphics Processing Units is proposed to construct a phylogenetic tree from extremely large set of sequences. The experimental results present that the proposed approach on a DGX-1 server with 8 NVIDIA P100 graphic cards achieves approximately 3-fold to 7-fold speedup over the implementation of Unweighted Pair Group Method with Arithmetic Mean on a modern CPU and a single GPU, respectively.

  9. Swan: A tool for porting CUDA programs to OpenCL

    NASA Astrophysics Data System (ADS)

    Harvey, M. J.; De Fabritiis, G.

    2011-04-01

    The use of modern, high-performance graphical processing units (GPUs) for acceleration of scientific computation has been widely reported. The majority of this work has used the CUDA programming model supported exclusively by GPUs manufactured by NVIDIA. An industry standardisation effort has recently produced the OpenCL specification for GPU programming. This offers the benefits of hardware-independence and reduced dependence on proprietary tool-chains. Here we describe a source-to-source translation tool, "Swan" for facilitating the conversion of an existing CUDA code to use the OpenCL model, as a means to aid programmers experienced with CUDA in evaluating OpenCL and alternative hardware. While the performance of equivalent OpenCL and CUDA code on fixed hardware should be comparable, we find that a real-world CUDA application ported to OpenCL exhibits an overall 50% increase in runtime, a reduction in performance attributable to the immaturity of contemporary compilers. The ported application is shown to have platform independence, running on both NVIDIA and AMD GPUs without modification. We conclude that OpenCL is a viable platform for developing portable GPU applications but that the more mature CUDA tools continue to provide best performance. Program summaryProgram title: Swan Catalogue identifier: AEIH_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEIH_v1_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: GNU Public License version 2 No. of lines in distributed program, including test data, etc.: 17 736 No. of bytes in distributed program, including test data, etc.: 131 177 Distribution format: tar.gz Programming language: C Computer: PC Operating system: Linux RAM: 256 Mbytes Classification: 6.5 External routines: NVIDIA CUDA, OpenCL Nature of problem: Graphical Processing Units (GPUs) from NVIDIA are preferentially programed with the proprietary CUDA programming toolkit. An alternative programming model promoted as an industry standard, OpenCL, provides similar capabilities to CUDA and is also supported on non-NVIDIA hardware (including multicore ×86 CPUs, AMD GPUs and IBM Cell processors). The adaptation of a program from CUDA to OpenCL is relatively straightforward but laborious. The Swan tool facilitates this conversion. Solution method:Swan performs a translation of CUDA kernel source code into an OpenCL equivalent. It also generates the C source code for entry point functions, simplifying kernel invocation from the host program. A concise host-side API abstracts the CUDA and OpenCL APIs. A program adapted to use Swan has no dependency on the CUDA compiler for the host-side program. The converted program may be built for either CUDA or OpenCL, with the selection made at compile time. Restrictions: No support for CUDA C++ features Running time: Nominal

  10. Real-Space Density Functional Theory on Graphical Processing Units: Computational Approach and Comparison to Gaussian Basis Set Methods.

    PubMed

    Andrade, Xavier; Aspuru-Guzik, Alán

    2013-10-08

    We discuss the application of graphical processing units (GPUs) to accelerate real-space density functional theory (DFT) calculations. To make our implementation efficient, we have developed a scheme to expose the data parallelism available in the DFT approach; this is applied to the different procedures required for a real-space DFT calculation. We present results for current-generation GPUs from AMD and Nvidia, which show that our scheme, implemented in the free code Octopus, can reach a sustained performance of up to 90 GFlops for a single GPU, representing a significant speed-up when compared to the CPU version of the code. Moreover, for some systems, our implementation can outperform a GPU Gaussian basis set code, showing that the real-space approach is a competitive alternative for DFT simulations on GPUs.

  11. Efficient particle-in-cell simulation of auroral plasma phenomena using a CUDA enabled graphics processing unit

    NASA Astrophysics Data System (ADS)

    Sewell, Stephen

    This thesis introduces a software framework that effectively utilizes low-cost commercially available Graphic Processing Units (GPUs) to simulate complex scientific plasma phenomena that are modeled using the Particle-In-Cell (PIC) paradigm. The software framework that was developed conforms to the Compute Unified Device Architecture (CUDA), a standard for general purpose graphic processing that was introduced by NVIDIA Corporation. This framework has been verified for correctness and applied to advance the state of understanding of the electromagnetic aspects of the development of the Aurora Borealis and Aurora Australis. For each phase of the PIC methodology, this research has identified one or more methods to exploit the problem's natural parallelism and effectively map it for execution on the graphic processing unit and its host processor. The sources of overhead that can reduce the effectiveness of parallelization for each of these methods have also been identified. One of the novel aspects of this research was the utilization of particle sorting during the grid interpolation phase. The final representation resulted in simulations that executed about 38 times faster than simulations that were run on a single-core general-purpose processing system. The scalability of this framework to larger problem sizes and future generation systems has also been investigated.

  12. Fast generation of computer-generated hologram by graphics processing unit

    NASA Astrophysics Data System (ADS)

    Matsuda, Sho; Fujii, Tomohiko; Yamaguchi, Takeshi; Yoshikawa, Hiroshi

    2009-02-01

    A cylindrical hologram is well known to be viewable in 360 deg. This hologram depends high pixel resolution.Therefore, Computer-Generated Cylindrical Hologram (CGCH) requires huge calculation amount.In our previous research, we used look-up table method for fast calculation with Intel Pentium4 2.8 GHz.It took 480 hours to calculate high resolution CGCH (504,000 x 63,000 pixels and the average number of object points are 27,000).To improve quality of CGCH reconstructed image, fringe pattern requires higher spatial frequency and resolution.Therefore, to increase the calculation speed, we have to change the calculation method. In this paper, to reduce the calculation time of CGCH (912,000 x 108,000 pixels), we employ Graphics Processing Unit (GPU).It took 4,406 hours to calculate high resolution CGCH on Xeon 3.4 GHz.Since GPU has many streaming processors and a parallel processing structure, GPU works as the high performance parallel processor.In addition, GPU gives max performance to 2 dimensional data and streaming data.Recently, GPU can be utilized for the general purpose (GPGPU).For example, NVIDIA's GeForce7 series became a programmable processor with Cg programming language.Next GeForce8 series have CUDA as software development kit made by NVIDIA.Theoretically, calculation ability of GPU is announced as 500 GFLOPS. From the experimental result, we have achieved that 47 times faster calculation compared with our previous work which used CPU.Therefore, CGCH can be generated in 95 hours.So, total time is 110 hours to calculate and print the CGCH.

  13. A comparative study of history-based versus vectorized Monte Carlo methods in the GPU/CUDA environment for a simple neutron eigenvalue problem

    NASA Astrophysics Data System (ADS)

    Liu, Tianyu; Du, Xining; Ji, Wei; Xu, X. George; Brown, Forrest B.

    2014-06-01

    For nuclear reactor analysis such as the neutron eigenvalue calculations, the time consuming Monte Carlo (MC) simulations can be accelerated by using graphics processing units (GPUs). However, traditional MC methods are often history-based, and their performance on GPUs is affected significantly by the thread divergence problem. In this paper we describe the development of a newly designed event-based vectorized MC algorithm for solving the neutron eigenvalue problem. The code was implemented using NVIDIA's Compute Unified Device Architecture (CUDA), and tested on a NVIDIA Tesla M2090 GPU card. We found that although the vectorized MC algorithm greatly reduces the occurrence of thread divergence thus enhancing the warp execution efficiency, the overall simulation speed is roughly ten times slower than the history-based MC code on GPUs. Profiling results suggest that the slow speed is probably due to the memory access latency caused by the large amount of global memory transactions. Possible solutions to improve the code efficiency are discussed.

  14. Computer simulations and real-time control of ELT AO systems using graphical processing units

    NASA Astrophysics Data System (ADS)

    Wang, Lianqi; Ellerbroek, Brent

    2012-07-01

    The adaptive optics (AO) simulations at the Thirty Meter Telescope (TMT) have been carried out using the efficient, C based multi-threaded adaptive optics simulator (MAOS, http://github.com/lianqiw/maos). By porting time-critical parts of MAOS to graphical processing units (GPU) using NVIDIA CUDA technology, we achieved a 10 fold speed up for each GTX 580 GPU used compared to a modern quad core CPU. Each time step of full scale end to end simulation for the TMT narrow field infrared AO system (NFIRAOS) takes only 0.11 second in a desktop with two GTX 580s. We also demonstrate that the TMT minimum variance reconstructor can be assembled in matrix vector multiply (MVM) format in 8 seconds with 8 GTX 580 GPUs, meeting the TMT requirement for updating the reconstructor. Analysis show that it is also possible to apply the MVM using 8 GTX 580s within the required latency.

  15. Nanoscale multireference quantum chemistry: full configuration interaction on graphical processing units.

    PubMed

    Fales, B Scott; Levine, Benjamin G

    2015-10-13

    Methods based on a full configuration interaction (FCI) expansion in an active space of orbitals are widely used for modeling chemical phenomena such as bond breaking, multiply excited states, and conical intersections in small-to-medium-sized molecules, but these phenomena occur in systems of all sizes. To scale such calculations up to the nanoscale, we have developed an implementation of FCI in which electron repulsion integral transformation and several of the more expensive steps in σ vector formation are performed on graphical processing unit (GPU) hardware. When applied to a 1.7 × 1.4 × 1.4 nm silicon nanoparticle (Si72H64) described with the polarized, all-electron 6-31G** basis set, our implementation can solve for the ground state of the 16-active-electron/16-active-orbital CASCI Hamiltonian (more than 100,000,000 configurations) in 39 min on a single NVidia K40 GPU.

  16. Classification of hyperspectral imagery using MapReduce on a NVIDIA graphics processing unit (Conference Presentation)

    NASA Astrophysics Data System (ADS)

    Ramirez, Andres; Rahnemoonfar, Maryam

    2017-04-01

    A hyperspectral image provides multidimensional figure rich in data consisting of hundreds of spectral dimensions. Analyzing the spectral and spatial information of such image with linear and non-linear algorithms will result in high computational time. In order to overcome this problem, this research presents a system using a MapReduce-Graphics Processing Unit (GPU) model that can help analyzing a hyperspectral image through the usage of parallel hardware and a parallel programming model, which will be simpler to handle compared to other low-level parallel programming models. Additionally, Hadoop was used as an open-source version of the MapReduce parallel programming model. This research compared classification accuracy results and timing results between the Hadoop and GPU system and tested it against the following test cases: the CPU and GPU test case, a CPU test case and a test case where no dimensional reduction was applied.

  17. Fast parallel tandem mass spectral library searching using GPU hardware acceleration.

    PubMed

    Baumgardner, Lydia Ashleigh; Shanmugam, Avinash Kumar; Lam, Henry; Eng, Jimmy K; Martin, Daniel B

    2011-06-03

    Mass spectrometry-based proteomics is a maturing discipline of biologic research that is experiencing substantial growth. Instrumentation has steadily improved over time with the advent of faster and more sensitive instruments collecting ever larger data files. Consequently, the computational process of matching a peptide fragmentation pattern to its sequence, traditionally accomplished by sequence database searching and more recently also by spectral library searching, has become a bottleneck in many mass spectrometry experiments. In both of these methods, the main rate-limiting step is the comparison of an acquired spectrum with all potential matches from a spectral library or sequence database. This is a highly parallelizable process because the core computational element can be represented as a simple but arithmetically intense multiplication of two vectors. In this paper, we present a proof of concept project taking advantage of the massively parallel computing available on graphics processing units (GPUs) to distribute and accelerate the process of spectral assignment using spectral library searching. This program, which we have named FastPaSS (for Fast Parallelized Spectral Searching), is implemented in CUDA (Compute Unified Device Architecture) from NVIDIA, which allows direct access to the processors in an NVIDIA GPU. Our efforts demonstrate the feasibility of GPU computing for spectral assignment, through implementation of the validated spectral searching algorithm SpectraST in the CUDA environment.

  18. Fast, multi-channel real-time processing of signals with microsecond latency using graphics processing units.

    PubMed

    Rath, N; Kato, S; Levesque, J P; Mauel, M E; Navratil, G A; Peng, Q

    2014-04-01

    Fast, digital signal processing (DSP) has many applications. Typical hardware options for performing DSP are field-programmable gate arrays (FPGAs), application-specific integrated DSP chips, or general purpose personal computer systems. This paper presents a novel DSP platform that has been developed for feedback control on the HBT-EP tokamak device. The system runs all signal processing exclusively on a Graphics Processing Unit (GPU) to achieve real-time performance with latencies below 8 μs. Signals are transferred into and out of the GPU using PCI Express peer-to-peer direct-memory-access transfers without involvement of the central processing unit or host memory. Tests were performed on the feedback control system of the HBT-EP tokamak using forty 16-bit floating point inputs and outputs each and a sampling rate of up to 250 kHz. Signals were digitized by a D-TACQ ACQ196 module, processing done on an NVIDIA GTX 580 GPU programmed in CUDA, and analog output was generated by D-TACQ AO32CPCI modules.

  19. Particle-in-cell simulations with charge-conserving current deposition on graphic processing units

    NASA Astrophysics Data System (ADS)

    Ren, Chuang; Kong, Xianglong; Huang, Michael; Decyk, Viktor; Mori, Warren

    2011-10-01

    Recently using CUDA, we have developed an electromagnetic Particle-in-Cell (PIC) code with charge-conserving current deposition for Nvidia graphic processing units (GPU's) (Kong et al., Journal of Computational Physics 230, 1676 (2011). On a Tesla M2050 (Fermi) card, the GPU PIC code can achieve a one-particle-step process time of 1.2 - 3.2 ns in 2D and 2.3 - 7.2 ns in 3D, depending on plasma temperatures. In this talk we will discuss novel algorithms for GPU-PIC including charge-conserving current deposition scheme with few branching and parallel particle sorting. These algorithms have made efficient use of the GPU shared memory. We will also discuss how to replace the computation kernels of existing parallel CPU codes while keeping their parallel structures. This work was supported by U.S. Department of Energy under Grant Nos. DE-FG02-06ER54879 and DE-FC02-04ER54789 and by NSF under Grant Nos. PHY-0903797 and CCF-0747324.

  20. Exploiting graphics processing units for computational biology and bioinformatics.

    PubMed

    Payne, Joshua L; Sinnott-Armstrong, Nicholas A; Moore, Jason H

    2010-09-01

    Advances in the video gaming industry have led to the production of low-cost, high-performance graphics processing units (GPUs) that possess more memory bandwidth and computational capability than central processing units (CPUs), the standard workhorses of scientific computing. With the recent release of generalpurpose GPUs and NVIDIA's GPU programming language, CUDA, graphics engines are being adopted widely in scientific computing applications, particularly in the fields of computational biology and bioinformatics. The goal of this article is to concisely present an introduction to GPU hardware and programming, aimed at the computational biologist or bioinformaticist. To this end, we discuss the primary differences between GPU and CPU architecture, introduce the basics of the CUDA programming language, and discuss important CUDA programming practices, such as the proper use of coalesced reads, data types, and memory hierarchies. We highlight each of these topics in the context of computing the all-pairs distance between instances in a dataset, a common procedure in numerous disciplines of scientific computing. We conclude with a runtime analysis of the GPU and CPU implementations of the all-pairs distance calculation. We show our final GPU implementation to outperform the CPU implementation by a factor of 1700.

  1. Real-time blood flow visualization using the graphics processing unit

    NASA Astrophysics Data System (ADS)

    Yang, Owen; Cuccia, David; Choi, Bernard

    2011-01-01

    Laser speckle imaging (LSI) is a technique in which coherent light incident on a surface produces a reflected speckle pattern that is related to the underlying movement of optical scatterers, such as red blood cells, indicating blood flow. Image-processing algorithms can be applied to produce speckle flow index (SFI) maps of relative blood flow. We present a novel algorithm that employs the NVIDIA Compute Unified Device Architecture (CUDA) platform to perform laser speckle image processing on the graphics processing unit. Software written in C was integrated with CUDA and integrated into a LabVIEW Virtual Instrument (VI) that is interfaced with a monochrome CCD camera able to acquire high-resolution raw speckle images at nearly 10 fps. With the CUDA code integrated into the LabVIEW VI, the processing and display of SFI images were performed also at ~10 fps. We present three video examples depicting real-time flow imaging during a reactive hyperemia maneuver, with fluid flow through an in vitro phantom, and a demonstration of real-time LSI during laser surgery of a port wine stain birthmark.

  2. Real-time blood flow visualization using the graphics processing unit

    PubMed Central

    Yang, Owen; Cuccia, David; Choi, Bernard

    2011-01-01

    Laser speckle imaging (LSI) is a technique in which coherent light incident on a surface produces a reflected speckle pattern that is related to the underlying movement of optical scatterers, such as red blood cells, indicating blood flow. Image-processing algorithms can be applied to produce speckle flow index (SFI) maps of relative blood flow. We present a novel algorithm that employs the NVIDIA Compute Unified Device Architecture (CUDA) platform to perform laser speckle image processing on the graphics processing unit. Software written in C was integrated with CUDA and integrated into a LabVIEW Virtual Instrument (VI) that is interfaced with a monochrome CCD camera able to acquire high-resolution raw speckle images at nearly 10 fps. With the CUDA code integrated into the LabVIEW VI, the processing and display of SFI images were performed also at ∼10 fps. We present three video examples depicting real-time flow imaging during a reactive hyperemia maneuver, with fluid flow through an in vitro phantom, and a demonstration of real-time LSI during laser surgery of a port wine stain birthmark. PMID:21280915

  3. Improving Quantum Gate Simulation using a GPU

    NASA Astrophysics Data System (ADS)

    Gutierrez, Eladio; Romero, Sergio; Trenas, Maria A.; Zapata, Emilio L.

    2008-11-01

    Due to the increasing computing power of the graphics processing units (GPU), they are becoming more and more popular when solving general purpose algorithms. As the simulation of quantum computers results on a problem with exponential complexity, it is advisable to perform a parallel computation, such as the one provided by the SIMD multiprocessors present in recent GPUs. In this paper, we focus on an important quantum algorithm, the quantum Fourier transform (QTF), in order to evaluate different parallelization strategies on a novel GPU architecture. Our implementation makes use of the new CUDA software/hardware architecture developed recently by NVIDIA.

  4. Evaluation of an Adaptive Automation Trigger Based on Task Performance, Priority, and Frequency

    DTIC Science & Technology

    2013-06-01

    with dual Intel ® Xeon ® CPU x5550 processors @ 2.67 GHz each, 12.0 GB RAM, and a 1.5 GB PCIe nVidia Quadro FX 4800 graphics card (Microsoft...Cole Publishing Company . Miller, C. A., & Parasuraman, R. (2007). Designing for flexible interaction between humans and automation: Delegation

  5. GPU-based cone beam computed tomography.

    PubMed

    Noël, Peter B; Walczak, Alan M; Xu, Jinhui; Corso, Jason J; Hoffmann, Kenneth R; Schafer, Sebastian

    2010-06-01

    The use of cone beam computed tomography (CBCT) is growing in the clinical arena due to its ability to provide 3D information during interventions, its high diagnostic quality (sub-millimeter resolution), and its short scanning times (60 s). In many situations, the short scanning time of CBCT is followed by a time-consuming 3D reconstruction. The standard reconstruction algorithm for CBCT data is the filtered backprojection, which for a volume of size 256(3) takes up to 25 min on a standard system. Recent developments in the area of Graphic Processing Units (GPUs) make it possible to have access to high-performance computing solutions at a low cost, allowing their use in many scientific problems. We have implemented an algorithm for 3D reconstruction of CBCT data using the Compute Unified Device Architecture (CUDA) provided by NVIDIA (NVIDIA Corporation, Santa Clara, California), which was executed on a NVIDIA GeForce GTX 280. Our implementation results in improved reconstruction times from minutes, and perhaps hours, to a matter of seconds, while also giving the clinician the ability to view 3D volumetric data at higher resolutions. We evaluated our implementation on ten clinical data sets and one phantom data set to observe if differences occur between CPU and GPU-based reconstructions. By using our approach, the computation time for 256(3) is reduced from 25 min on the CPU to 3.2 s on the GPU. The GPU reconstruction time for 512(3) volumes is 8.5 s. Copyright 2009 Elsevier Ireland Ltd. All rights reserved.

  6. Parallel fuzzy connected image segmentation on GPU.

    PubMed

    Zhuge, Ying; Cao, Yong; Udupa, Jayaram K; Miller, Robert W

    2011-07-01

    Image segmentation techniques using fuzzy connectedness (FC) principles have shown their effectiveness in segmenting a variety of objects in several large applications. However, one challenge in these algorithms has been their excessive computational requirements when processing large image datasets. Nowadays, commodity graphics hardware provides a highly parallel computing environment. In this paper, the authors present a parallel fuzzy connected image segmentation algorithm implementation on NVIDIA's compute unified device Architecture (CUDA) platform for segmenting medical image data sets. In the FC algorithm, there are two major computational tasks: (i) computing the fuzzy affinity relations and (ii) computing the fuzzy connectedness relations. These two tasks are implemented as CUDA kernels and executed on GPU. A dramatic improvement in speed for both tasks is achieved as a result. Our experiments based on three data sets of small, medium, and large data size demonstrate the efficiency of the parallel algorithm, which achieves a speed-up factor of 24.4x, 18.1x, and 10.3x, correspondingly, for the three data sets on the NVIDIA Tesla C1060 over the implementation of the algorithm on CPU, and takes 0.25, 0.72, and 15.04 s, correspondingly, for the three data sets. The authors developed a parallel algorithm of the widely used fuzzy connected image segmentation method on the NVIDIA GPUs, which are far more cost- and speed-effective than both cluster of workstations and multiprocessing systems. A near-interactive speed of segmentation has been achieved, even for the large data set.

  7. GPU accelerated Monte Carlo simulation of Brownian motors dynamics with CUDA

    NASA Astrophysics Data System (ADS)

    Spiechowicz, J.; Kostur, M.; Machura, L.

    2015-06-01

    This work presents an updated and extended guide on methods of a proper acceleration of the Monte Carlo integration of stochastic differential equations with the commonly available NVIDIA Graphics Processing Units using the CUDA programming environment. We outline the general aspects of the scientific computing on graphics cards and demonstrate them with two models of a well known phenomenon of the noise induced transport of Brownian motors in periodic structures. As a source of fluctuations in the considered systems we selected the three most commonly occurring noises: the Gaussian white noise, the white Poissonian noise and the dichotomous process also known as a random telegraph signal. The detailed discussion on various aspects of the applied numerical schemes is also presented. The measured speedup can be of the astonishing order of about 3000 when compared to a typical CPU. This number significantly expands the range of problems solvable by use of stochastic simulations, allowing even an interactive research in some cases.

  8. FPGA Implementation of the Coupled Filtering Method and the Affine Warping Method.

    PubMed

    Zhang, Chen; Liang, Tianzhu; Mok, Philip K T; Yu, Weichuan

    2017-07-01

    In ultrasound image analysis, the speckle tracking methods are widely applied to study the elasticity of body tissue. However, "feature-motion decorrelation" still remains as a challenge for the speckle tracking methods. Recently, a coupled filtering method and an affine warping method were proposed to accurately estimate strain values, when the tissue deformation is large. The major drawback of these methods is the high computational complexity. Even the graphics processing unit (GPU)-based program requires a long time to finish the analysis. In this paper, we propose field-programmable gate array (FPGA)-based implementations of both methods for further acceleration. The capability of FPGAs on handling different image processing components in these methods is discussed. A fast and memory-saving image warping approach is proposed. The algorithms are reformulated to build a highly efficient pipeline on FPGA. The final implementations on a Xilinx Virtex-7 FPGA are at least 13 times faster than the GPU implementation on the NVIDIA graphic card (GeForce GTX 580).

  9. Efficient Acceleration of the Pair-HMMs Forward Algorithm for GATK HaplotypeCaller on Graphics Processing Units.

    PubMed

    Ren, Shanshan; Bertels, Koen; Al-Ars, Zaid

    2018-01-01

    GATK HaplotypeCaller (HC) is a popular variant caller, which is widely used to identify variants in complex genomes. However, due to its high variants detection accuracy, it suffers from long execution time. In GATK HC, the pair-HMMs forward algorithm accounts for a large percentage of the total execution time. This article proposes to accelerate the pair-HMMs forward algorithm on graphics processing units (GPUs) to improve the performance of GATK HC. This article presents several GPU-based implementations of the pair-HMMs forward algorithm. It also analyzes the performance bottlenecks of the implementations on an NVIDIA Tesla K40 card with various data sets. Based on these results and the characteristics of GATK HC, we are able to identify the GPU-based implementations with the highest performance for the various analyzed data sets. Experimental results show that the GPU-based implementations of the pair-HMMs forward algorithm achieve a speedup of up to 5.47× over existing GPU-based implementations.

  10. Fast parallel tandem mass spectral library searching using GPU hardware acceleration

    PubMed Central

    Baumgardner, Lydia Ashleigh; Shanmugam, Avinash Kumar; Lam, Henry; Eng, Jimmy K.; Martin, Daniel B.

    2011-01-01

    Mass spectrometry-based proteomics is a maturing discipline of biologic research that is experiencing substantial growth. Instrumentation has steadily improved over time with the advent of faster and more sensitive instruments collecting ever larger data files. Consequently, the computational process of matching a peptide fragmentation pattern to its sequence, traditionally accomplished by sequence database searching and more recently also by spectral library searching, has become a bottleneck in many mass spectrometry experiments. In both of these methods, the main rate limiting step is the comparison of an acquired spectrum with all potential matches from a spectral library or sequence database. This is a highly parallelizable process because the core computational element can be represented as a simple but arithmetically intense multiplication of two vectors. In this paper we present a proof of concept project taking advantage of the massively parallel computing available on graphics processing units (GPUs) to distribute and accelerate the process of spectral assignment using spectral library searching. This program, which we have named FastPaSS (for Fast Parallelized Spectral Searching) is implemented in CUDA (Compute Unified Device Architecture) from NVIDIA which allows direct access to the processors in an NVIDIA GPU. Our efforts demonstrate the feasibility of GPU computing for spectral assignment, through implementation of the validated spectral searching algorithm SpectraST in the CUDA environment. PMID:21545112

  11. The CUBLAS and CULA based GPU acceleration of adaptive finite element framework for bioluminescence tomography.

    PubMed

    Zhang, Bo; Yang, Xiang; Yang, Fei; Yang, Xin; Qin, Chenghu; Han, Dong; Ma, Xibo; Liu, Kai; Tian, Jie

    2010-09-13

    In molecular imaging (MI), especially the optical molecular imaging, bioluminescence tomography (BLT) emerges as an effective imaging modality for small animal imaging. The finite element methods (FEMs), especially the adaptive finite element (AFE) framework, play an important role in BLT. The processing speed of the FEMs and the AFE framework still needs to be improved, although the multi-thread CPU technology and the multi CPU technology have already been applied. In this paper, we for the first time introduce a new kind of acceleration technology to accelerate the AFE framework for BLT, using the graphics processing unit (GPU). Besides the processing speed, the GPU technology can get a balance between the cost and performance. The CUBLAS and CULA are two main important and powerful libraries for programming on NVIDIA GPUs. With the help of CUBLAS and CULA, it is easy to code on NVIDIA GPU and there is no need to worry about the details about the hardware environment of a specific GPU. The numerical experiments are designed to show the necessity, effect and application of the proposed CUBLAS and CULA based GPU acceleration. From the results of the experiments, we can reach the conclusion that the proposed CUBLAS and CULA based GPU acceleration method can improve the processing speed of the AFE framework very much while getting a balance between cost and performance.

  12. Embedded-Based Graphics Processing Unit Cluster Platform for Multiple Sequence Alignments

    PubMed Central

    Wei, Jyh-Da; Cheng, Hui-Jun; Lin, Chun-Yuan; Ye, Jin; Yeh, Kuan-Yu

    2017-01-01

    High-end graphics processing units (GPUs), such as NVIDIA Tesla/Fermi/Kepler series cards with thousands of cores per chip, are widely applied to high-performance computing fields in a decade. These desktop GPU cards should be installed in personal computers/servers with desktop CPUs, and the cost and power consumption of constructing a GPU cluster platform are very high. In recent years, NVIDIA releases an embedded board, called Jetson Tegra K1 (TK1), which contains 4 ARM Cortex-A15 CPUs and 192 Compute Unified Device Architecture cores (belong to Kepler GPUs). Jetson Tegra K1 has several advantages, such as the low cost, low power consumption, and high applicability, and it has been applied into several specific applications. In our previous work, a bioinformatics platform with a single TK1 (STK platform) was constructed, and this previous work is also used to prove that the Web and mobile services can be implemented in the STK platform with a good cost-performance ratio by comparing a STK platform with the desktop CPU and GPU. In this work, an embedded-based GPU cluster platform will be constructed with multiple TK1s (MTK platform). Complex system installation and setup are necessary procedures at first. Then, 2 job assignment modes are designed for the MTK platform to provide services for users. Finally, ClustalW v2.0.11 and ClustalWtk will be ported to the MTK platform. The experimental results showed that the speedup ratios achieved 5.5 and 4.8 times for ClustalW v2.0.11 and ClustalWtk, respectively, by comparing 6 TK1s with a single TK1. The MTK platform is proven to be useful for multiple sequence alignments. PMID:28835734

  13. Embedded-Based Graphics Processing Unit Cluster Platform for Multiple Sequence Alignments.

    PubMed

    Wei, Jyh-Da; Cheng, Hui-Jun; Lin, Chun-Yuan; Ye, Jin; Yeh, Kuan-Yu

    2017-01-01

    High-end graphics processing units (GPUs), such as NVIDIA Tesla/Fermi/Kepler series cards with thousands of cores per chip, are widely applied to high-performance computing fields in a decade. These desktop GPU cards should be installed in personal computers/servers with desktop CPUs, and the cost and power consumption of constructing a GPU cluster platform are very high. In recent years, NVIDIA releases an embedded board, called Jetson Tegra K1 (TK1), which contains 4 ARM Cortex-A15 CPUs and 192 Compute Unified Device Architecture cores (belong to Kepler GPUs). Jetson Tegra K1 has several advantages, such as the low cost, low power consumption, and high applicability, and it has been applied into several specific applications. In our previous work, a bioinformatics platform with a single TK1 (STK platform) was constructed, and this previous work is also used to prove that the Web and mobile services can be implemented in the STK platform with a good cost-performance ratio by comparing a STK platform with the desktop CPU and GPU. In this work, an embedded-based GPU cluster platform will be constructed with multiple TK1s (MTK platform). Complex system installation and setup are necessary procedures at first. Then, 2 job assignment modes are designed for the MTK platform to provide services for users. Finally, ClustalW v2.0.11 and ClustalWtk will be ported to the MTK platform. The experimental results showed that the speedup ratios achieved 5.5 and 4.8 times for ClustalW v2.0.11 and ClustalWtk, respectively, by comparing 6 TK1s with a single TK1. The MTK platform is proven to be useful for multiple sequence alignments.

  14. Computational algorithms for simulations in atmospheric optics.

    PubMed

    Konyaev, P A; Lukin, V P

    2016-04-20

    A computer simulation technique for atmospheric and adaptive optics based on parallel programing is discussed. A parallel propagation algorithm is designed and a modified spectral-phase method for computer generation of 2D time-variant random fields is developed. Temporal power spectra of Laguerre-Gaussian beam fluctuations are considered as an example to illustrate the applications discussed. Implementation of the proposed algorithms using Intel MKL and IPP libraries and NVIDIA CUDA technology is shown to be very fast and accurate. The hardware system for the computer simulation is an off-the-shelf desktop with an Intel Core i7-4790K CPU operating at a turbo-speed frequency up to 5 GHz and an NVIDIA GeForce GTX-960 graphics accelerator with 1024 1.5 GHz processors.

  15. MULTI-CORE AND OPTICAL PROCESSOR RELATED APPLICATIONS RESEARCH AT OAK RIDGE NATIONAL LABORATORY

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Barhen, Jacob; Kerekes, Ryan A; ST Charles, Jesse Lee

    2008-01-01

    High-speed parallelization of common tasks holds great promise as a low-risk approach to achieving the significant increases in signal processing and computational performance required for next generation innovations in reconfigurable radio systems. Researchers at the Oak Ridge National Laboratory have been working on exploiting the parallelization offered by this emerging technology and applying it to a variety of problems. This paper will highlight recent experience with four different parallel processors applied to signal processing tasks that are directly relevant to signal processing required for SDR/CR waveforms. The first is the EnLight Optical Core Processor applied to matched filter (MF) correlationmore » processing via fast Fourier transform (FFT) of broadband Dopplersensitive waveforms (DSW) using active sonar arrays for target tracking. The second is the IBM CELL Broadband Engine applied to 2-D discrete Fourier transform (DFT) kernel for image processing and frequency domain processing. And the third is the NVIDIA graphical processor applied to document feature clustering. EnLight Optical Core Processor. Optical processing is inherently capable of high-parallelism that can be translated to very high performance, low power dissipation computing. The EnLight 256 is a small form factor signal processing chip (5x5 cm2) with a digital optical core that is being developed by an Israeli startup company. As part of its evaluation of foreign technology, ORNL's Center for Engineering Science Advanced Research (CESAR) had access to a precursor EnLight 64 Alpha hardware for a preliminary assessment of capabilities in terms of large Fourier transforms for matched filter banks and on applications related to Doppler-sensitive waveforms. This processor is optimized for array operations, which it performs in fixed-point arithmetic at the rate of 16 TeraOPS at 8-bit precision. This is approximately 1000 times faster than the fastest DSP available today. The optical core performs the matrix-vector multiplications, where the nominal matrix size is 256x256. The system clock is 125MHz. At each clock cycle, 128K multiply-and-add operations per second (OPS) are carried out, which yields a peak performance of 16 TeraOPS. IBM Cell Broadband Engine. The Cell processor is the extraordinary resulting product of 5 years of sustained, intensive R&D collaboration (involving over $400M investment) between IBM, Sony, and Toshiba. Its architecture comprises one multithreaded 64-bit PowerPC processor element (PPE) with VMX capabilities and two levels of globally coherent cache, and 8 synergistic processor elements (SPEs). Each SPE consists of a processor (SPU) designed for streaming workloads, local memory, and a globally coherent direct memory access (DMA) engine. Computations are performed in 128-bit wide single instruction multiple data streams (SIMD). An integrated high-bandwidth element interconnect bus (EIB) connects the nine processors and their ports to external memory and to system I/O. The Applied Software Engineering Research (ASER) Group at the ORNL is applying the Cell to a variety of text and image analysis applications. Research on Cell-equipped PlayStation3 (PS3) consoles has led to the development of a correlation-based image recognition engine that enables a single PS3 to process images at more than 10X the speed of state-of-the-art single-core processors. NVIDIA Graphics Processing Units. The ASER group is also employing the latest NVIDIA graphical processing units (GPUs) to accelerate clustering of thousands of text documents using recently developed clustering algorithms such as document flocking and affinity propagation.« less

  16. Hybrid parallel computing architecture for multiview phase shifting

    NASA Astrophysics Data System (ADS)

    Zhong, Kai; Li, Zhongwei; Zhou, Xiaohui; Shi, Yusheng; Wang, Congjun

    2014-11-01

    The multiview phase-shifting method shows its powerful capability in achieving high resolution three-dimensional (3-D) shape measurement. Unfortunately, this ability results in very high computation costs and 3-D computations have to be processed offline. To realize real-time 3-D shape measurement, a hybrid parallel computing architecture is proposed for multiview phase shifting. In this architecture, the central processing unit can co-operate with the graphic processing unit (GPU) to achieve hybrid parallel computing. The high computation cost procedures, including lens distortion rectification, phase computation, correspondence, and 3-D reconstruction, are implemented in GPU, and a three-layer kernel function model is designed to simultaneously realize coarse-grained and fine-grained paralleling computing. Experimental results verify that the developed system can perform 50 fps (frame per second) real-time 3-D measurement with 260 K 3-D points per frame. A speedup of up to 180 times is obtained for the performance of the proposed technique using a NVIDIA GT560Ti graphics card rather than a sequential C in a 3.4 GHZ Inter Core i7 3770.

  17. Accelerating Monte Carlo simulations of photon transport in a voxelized geometry using a massively parallel graphics processing unit.

    PubMed

    Badal, Andreu; Badano, Aldo

    2009-11-01

    It is a known fact that Monte Carlo simulations of radiation transport are computationally intensive and may require long computing times. The authors introduce a new paradigm for the acceleration of Monte Carlo simulations: The use of a graphics processing unit (GPU) as the main computing device instead of a central processing unit (CPU). A GPU-based Monte Carlo code that simulates photon transport in a voxelized geometry with the accurate physics models from PENELOPE has been developed using the CUDATM programming model (NVIDIA Corporation, Santa Clara, CA). An outline of the new code and a sample x-ray imaging simulation with an anthropomorphic phantom are presented. A remarkable 27-fold speed up factor was obtained using a GPU compared to a single core CPU. The reported results show that GPUs are currently a good alternative to CPUs for the simulation of radiation transport. Since the performance of GPUs is currently increasing at a faster pace than that of CPUs, the advantages of GPU-based software are likely to be more pronounced in the future.

  18. Practical Implementation of Prestack Kirchhoff Time Migration on a General Purpose Graphics Processing Unit

    NASA Astrophysics Data System (ADS)

    Liu, Guofeng; Li, Chun

    2016-08-01

    In this study, we present a practical implementation of prestack Kirchhoff time migration (PSTM) on a general purpose graphic processing unit. First, we consider the three main optimizations of the PSTM GPU code, i.e., designing a configuration based on a reasonable execution, using the texture memory for velocity interpolation, and the application of an intrinsic function in device code. This approach can achieve a speedup of nearly 45 times on a NVIDIA GTX 680 GPU compared with CPU code when a larger imaging space is used, where the PSTM output is a common reflection point that is gathered as I[ nx][ ny][ nh][ nt] in matrix format. However, this method requires more memory space so the limited imaging space cannot fully exploit the GPU sources. To overcome this problem, we designed a PSTM scheme with multi-GPUs for imaging different seismic data on different GPUs using an offset value. This process can achieve the peak speedup of GPU PSTM code and it greatly increases the efficiency of the calculations, but without changing the imaging result.

  19. Simultaneous Range-Velocity Processing and SNR Analysis of AFIT’s Random Noise Radar

    DTIC Science & Technology

    2012-03-22

    reducing the overall processing time. Two computers, equipped with NVIDIA ® GPUs, were used to process the col- 45 lected data. The specifications for each...gather the results back to the CPU. Another company , AccelerEyes®, has developed a product called Jacket® that claims to be better than the parallel...Number of Processing Cores 4 8 Processor Speed 3.33 GHz 3.07 GHz Installed Memory 48 GB 48 GB GPU Make NVIDIA NVIDIA GPU Model Tesla 1060 Tesla C2070 GPU

  20. Fast quantum Monte Carlo on a GPU

    NASA Astrophysics Data System (ADS)

    Lutsyshyn, Y.

    2015-02-01

    We present a scheme for the parallelization of quantum Monte Carlo method on graphical processing units, focusing on variational Monte Carlo simulation of bosonic systems. We use asynchronous execution schemes with shared memory persistence, and obtain an excellent utilization of the accelerator. The CUDA code is provided along with a package that simulates liquid helium-4. The program was benchmarked on several models of Nvidia GPU, including Fermi GTX560 and M2090, and the Kepler architecture K20 GPU. Special optimization was developed for the Kepler cards, including placement of data structures in the register space of the Kepler GPUs. Kepler-specific optimization is discussed.

  1. Gibraltar v 1.0

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    CURRY, MATTHEW LEON; WARD, H. LEE; & SKJELLUM, ANTHONY

    Gibraltar is a library and associated test suite which performs Reed-Solomon coding and decoding of data buffers using graphics processing units which support NVIDIA's CUDA technology. This library is used to generate redundant data allowing for recovery of lost information. For example, a user can generate m new blocks of data from n original blocks, distributing those pieces over n+m devices. If any m devices fail, the contents of those devices can be recovered from the contents of the other n devices, even if some of the original blocks are lost. This is a generalized description of RAID, a techniquemore » for increasing data storage speed and size.« less

  2. Accelerating Advanced MRI Reconstructions on GPUs.

    PubMed

    Stone, S S; Haldar, J P; Tsao, S C; Hwu, W-M W; Sutton, B P; Liang, Z-P

    2008-10-01

    Computational acceleration on graphics processing units (GPUs) can make advanced magnetic resonance imaging (MRI) reconstruction algorithms attractive in clinical settings, thereby improving the quality of MR images across a broad spectrum of applications. This paper describes the acceleration of such an algorithm on NVIDIA's Quadro FX 5600. The reconstruction of a 3D image with 128(3) voxels achieves up to 180 GFLOPS and requires just over one minute on the Quadro, while reconstruction on a quad-core CPU is twenty-one times slower. Furthermore, relative to the true image, the error exhibited by the advanced reconstruction is only 12%, while conventional reconstruction techniques incur error of 42%.

  3. CELES: CUDA-accelerated simulation of electromagnetic scattering by large ensembles of spheres

    NASA Astrophysics Data System (ADS)

    Egel, Amos; Pattelli, Lorenzo; Mazzamuto, Giacomo; Wiersma, Diederik S.; Lemmer, Uli

    2017-09-01

    CELES is a freely available MATLAB toolbox to simulate light scattering by many spherical particles. Aiming at high computational performance, CELES leverages block-diagonal preconditioning, a lookup-table approach to evaluate costly functions and massively parallel execution on NVIDIA graphics processing units using the CUDA computing platform. The combination of these techniques allows to efficiently address large electrodynamic problems (>104 scatterers) on inexpensive consumer hardware. In this paper, we validate near- and far-field distributions against the well-established multi-sphere T-matrix (MSTM) code and discuss the convergence behavior for ensembles of different sizes, including an exemplary system comprising 105 particles.

  4. NVIDIA OptiX ray-tracing engine as a new tool for modelling medical imaging systems

    NASA Astrophysics Data System (ADS)

    Pietrzak, Jakub; Kacperski, Krzysztof; Cieślar, Marek

    2015-03-01

    The most accurate technique to model the X- and gamma radiation path through a numerically defined object is the Monte Carlo simulation which follows single photons according to their interaction probabilities. A simplified and much faster approach, which just integrates total interaction probabilities along selected paths, is known as ray tracing. Both techniques are used in medical imaging for simulating real imaging systems and as projectors required in iterative tomographic reconstruction algorithms. These approaches are ready for massive parallel implementation e.g. on Graphics Processing Units (GPU), which can greatly accelerate the computation time at a relatively low cost. In this paper we describe the application of the NVIDIA OptiX ray-tracing engine, popular in professional graphics and rendering applications, as a new powerful tool for X- and gamma ray-tracing in medical imaging. It allows the implementation of a variety of physical interactions of rays with pixel-, mesh- or nurbs-based objects, and recording any required quantities, like path integrals, interaction sites, deposited energies, and others. Using the OptiX engine we have implemented a code for rapid Monte Carlo simulations of Single Photon Emission Computed Tomography (SPECT) imaging, as well as the ray-tracing projector, which can be used in reconstruction algorithms. The engine generates efficient, scalable and optimized GPU code, ready to run on multi GPU heterogeneous systems. We have compared the results our simulations with the GATE package. With the OptiX engine the computation time of a Monte Carlo simulation can be reduced from days to minutes.

  5. Parallel hyperbolic PDE simulation on clusters: Cell versus GPU

    NASA Astrophysics Data System (ADS)

    Rostrup, Scott; De Sterck, Hans

    2010-12-01

    Increasingly, high-performance computing is looking towards data-parallel computational devices to enhance computational performance. Two technologies that have received significant attention are IBM's Cell Processor and NVIDIA's CUDA programming model for graphics processing unit (GPU) computing. In this paper we investigate the acceleration of parallel hyperbolic partial differential equation simulation on structured grids with explicit time integration on clusters with Cell and GPU backends. The message passing interface (MPI) is used for communication between nodes at the coarsest level of parallelism. Optimizations of the simulation code at the several finer levels of parallelism that the data-parallel devices provide are described in terms of data layout, data flow and data-parallel instructions. Optimized Cell and GPU performance are compared with reference code performance on a single x86 central processing unit (CPU) core in single and double precision. We further compare the CPU, Cell and GPU platforms on a chip-to-chip basis, and compare performance on single cluster nodes with two CPUs, two Cell processors or two GPUs in a shared memory configuration (without MPI). We finally compare performance on clusters with 32 CPUs, 32 Cell processors, and 32 GPUs using MPI. Our GPU cluster results use NVIDIA Tesla GPUs with GT200 architecture, but some preliminary results on recently introduced NVIDIA GPUs with the next-generation Fermi architecture are also included. This paper provides computational scientists and engineers who are considering porting their codes to accelerator environments with insight into how structured grid based explicit algorithms can be optimized for clusters with Cell and GPU accelerators. It also provides insight into the speed-up that may be gained on current and future accelerator architectures for this class of applications. Program summaryProgram title: SWsolver Catalogue identifier: AEGY_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEGY_v1_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: GPL v3 No. of lines in distributed program, including test data, etc.: 59 168 No. of bytes in distributed program, including test data, etc.: 453 409 Distribution format: tar.gz Programming language: C, CUDA Computer: Parallel Computing Clusters. Individual compute nodes may consist of x86 CPU, Cell processor, or x86 CPU with attached NVIDIA GPU accelerator. Operating system: Linux Has the code been vectorised or parallelized?: Yes. Tested on 1-128 x86 CPU cores, 1-32 Cell Processors, and 1-32 NVIDIA GPUs. RAM: Tested on Problems requiring up to 4 GB per compute node. Classification: 12 External routines: MPI, CUDA, IBM Cell SDK Nature of problem: MPI-parallel simulation of Shallow Water equations using high-resolution 2D hyperbolic equation solver on regular Cartesian grids for x86 CPU, Cell Processor, and NVIDIA GPU using CUDA. Solution method: SWsolver provides 3 implementations of a high-resolution 2D Shallow Water equation solver on regular Cartesian grids, for CPU, Cell Processor, and NVIDIA GPU. Each implementation uses MPI to divide work across a parallel computing cluster. Additional comments: Sub-program numdiff is used for the test run.

  6. Universal Batch Steganalysis

    DTIC Science & Technology

    2014-06-01

    in large-scale datasets such as might be obtained by monitoring a corporate network or social network. Identifying guilty actors, rather than payload...by monitoring a corporate network or social network. Identifying guilty actors, rather than payload-carrying objects, is entirely novel in steganalysis...implementation using Compute Unified Device Architecture (CUDA) on NVIDIA graphics cards. The key to good performance is to combine computations so that

  7. Massively Parallel Signal Processing using the Graphics Processing Unit for Real-Time Brain-Computer Interface Feature Extraction.

    PubMed

    Wilson, J Adam; Williams, Justin C

    2009-01-01

    The clock speeds of modern computer processors have nearly plateaued in the past 5 years. Consequently, neural prosthetic systems that rely on processing large quantities of data in a short period of time face a bottleneck, in that it may not be possible to process all of the data recorded from an electrode array with high channel counts and bandwidth, such as electrocorticographic grids or other implantable systems. Therefore, in this study a method of using the processing capabilities of a graphics card [graphics processing unit (GPU)] was developed for real-time neural signal processing of a brain-computer interface (BCI). The NVIDIA CUDA system was used to offload processing to the GPU, which is capable of running many operations in parallel, potentially greatly increasing the speed of existing algorithms. The BCI system records many channels of data, which are processed and translated into a control signal, such as the movement of a computer cursor. This signal processing chain involves computing a matrix-matrix multiplication (i.e., a spatial filter), followed by calculating the power spectral density on every channel using an auto-regressive method, and finally classifying appropriate features for control. In this study, the first two computationally intensive steps were implemented on the GPU, and the speed was compared to both the current implementation and a central processing unit-based implementation that uses multi-threading. Significant performance gains were obtained with GPU processing: the current implementation processed 1000 channels of 250 ms in 933 ms, while the new GPU method took only 27 ms, an improvement of nearly 35 times.

  8. High-Speed Particle-in-Cell Simulation Parallelized with Graphic Processing Units for Low Temperature Plasmas for Material Processing

    NASA Astrophysics Data System (ADS)

    Hur, Min Young; Verboncoeur, John; Lee, Hae June

    2014-10-01

    Particle-in-cell (PIC) simulations have high fidelity in the plasma device requiring transient kinetic modeling compared with fluid simulations. It uses less approximation on the plasma kinetics but requires many particles and grids to observe the semantic results. It means that the simulation spends lots of simulation time in proportion to the number of particles. Therefore, PIC simulation needs high performance computing. In this research, a graphic processing unit (GPU) is adopted for high performance computing of PIC simulation for low temperature discharge plasmas. GPUs have many-core processors and high memory bandwidth compared with a central processing unit (CPU). NVIDIA GeForce GPUs were used for the test with hundreds of cores which show cost-effective performance. PIC code algorithm is divided into two modules which are a field solver and a particle mover. The particle mover module is divided into four routines which are named move, boundary, Monte Carlo collision (MCC), and deposit. Overall, the GPU code solves particle motions as well as electrostatic potential in two-dimensional geometry almost 30 times faster than a single CPU code. This work was supported by the Korea Institute of Science Technology Information.

  9. Exact diagonalization of quantum lattice models on coprocessors

    NASA Astrophysics Data System (ADS)

    Siro, T.; Harju, A.

    2016-10-01

    We implement the Lanczos algorithm on an Intel Xeon Phi coprocessor and compare its performance to a multi-core Intel Xeon CPU and an NVIDIA graphics processor. The Xeon and the Xeon Phi are parallelized with OpenMP and the graphics processor is programmed with CUDA. The performance is evaluated by measuring the execution time of a single step in the Lanczos algorithm. We study two quantum lattice models with different particle numbers, and conclude that for small systems, the multi-core CPU is the fastest platform, while for large systems, the graphics processor is the clear winner, reaching speedups of up to 7.6 compared to the CPU. The Xeon Phi outperforms the CPU with sufficiently large particle number, reaching a speedup of 2.5.

  10. permGPU: Using graphics processing units in RNA microarray association studies.

    PubMed

    Shterev, Ivo D; Jung, Sin-Ho; George, Stephen L; Owzar, Kouros

    2010-06-16

    Many analyses of microarray association studies involve permutation, bootstrap resampling and cross-validation, that are ideally formulated as embarrassingly parallel computing problems. Given that these analyses are computationally intensive, scalable approaches that can take advantage of multi-core processor systems need to be developed. We have developed a CUDA based implementation, permGPU, that employs graphics processing units in microarray association studies. We illustrate the performance and applicability of permGPU within the context of permutation resampling for a number of test statistics. An extensive simulation study demonstrates a dramatic increase in performance when using permGPU on an NVIDIA GTX 280 card compared to an optimized C/C++ solution running on a conventional Linux server. permGPU is available as an open-source stand-alone application and as an extension package for the R statistical environment. It provides a dramatic increase in performance for permutation resampling analysis in the context of microarray association studies. The current version offers six test statistics for carrying out permutation resampling analyses for binary, quantitative and censored time-to-event traits.

  11. Real-time electroholography using a multiple-graphics processing unit cluster system with a single spatial light modulator and the InfiniBand network

    NASA Astrophysics Data System (ADS)

    Niwase, Hiroaki; Takada, Naoki; Araki, Hiromitsu; Maeda, Yuki; Fujiwara, Masato; Nakayama, Hirotaka; Kakue, Takashi; Shimobaba, Tomoyoshi; Ito, Tomoyoshi

    2016-09-01

    Parallel calculations of large-pixel-count computer-generated holograms (CGHs) are suitable for multiple-graphics processing unit (multi-GPU) cluster systems. However, it is not easy for a multi-GPU cluster system to accomplish fast CGH calculations when CGH transfers between PCs are required. In these cases, the CGH transfer between the PCs becomes a bottleneck. Usually, this problem occurs only in multi-GPU cluster systems with a single spatial light modulator. To overcome this problem, we propose a simple method using the InfiniBand network. The computational speed of the proposed method using 13 GPUs (NVIDIA GeForce GTX TITAN X) was more than 3000 times faster than that of a CPU (Intel Core i7 4770) when the number of three-dimensional (3-D) object points exceeded 20,480. In practice, we achieved ˜40 tera floating point operations per second (TFLOPS) when the number of 3-D object points exceeded 40,960. Our proposed method was able to reconstruct a real-time movie of a 3-D object comprising 95,949 points.

  12. Global magnetohydrodynamic simulations on multiple GPUs

    NASA Astrophysics Data System (ADS)

    Wong, Un-Hong; Wong, Hon-Cheng; Ma, Yonghui

    2014-01-01

    Global magnetohydrodynamic (MHD) models play the major role in investigating the solar wind-magnetosphere interaction. However, the huge computation requirement in global MHD simulations is also the main problem that needs to be solved. With the recent development of modern graphics processing units (GPUs) and the Compute Unified Device Architecture (CUDA), it is possible to perform global MHD simulations in a more efficient manner. In this paper, we present a global magnetohydrodynamic (MHD) simulator on multiple GPUs using CUDA 4.0 with GPUDirect 2.0. Our implementation is based on the modified leapfrog scheme, which is a combination of the leapfrog scheme and the two-step Lax-Wendroff scheme. GPUDirect 2.0 is used in our implementation to drive multiple GPUs. All data transferring and kernel processing are managed with CUDA 4.0 API instead of using MPI or OpenMP. Performance measurements are made on a multi-GPU system with eight NVIDIA Tesla M2050 (Fermi architecture) graphics cards. These measurements show that our multi-GPU implementation achieves a peak performance of 97.36 GFLOPS in double precision.

  13. Three-directional motion-compensation mask-based novel look-up table on graphics processing units for video-rate generation of digital holographic videos of three-dimensional scenes.

    PubMed

    Kwon, Min-Woo; Kim, Seung-Cheol; Kim, Eun-Soo

    2016-01-20

    A three-directional motion-compensation mask-based novel look-up table method is proposed and implemented on graphics processing units (GPUs) for video-rate generation of digital holographic videos of three-dimensional (3D) scenes. Since the proposed method is designed to be well matched with the software and memory structures of GPUs, the number of compute-unified-device-architecture kernel function calls can be significantly reduced. This results in a great increase of the computational speed of the proposed method, allowing video-rate generation of the computer-generated hologram (CGH) patterns of 3D scenes. Experimental results reveal that the proposed method can generate 39.8 frames of Fresnel CGH patterns with 1920×1080 pixels per second for the test 3D video scenario with 12,088 object points on dual GPU boards of NVIDIA GTX TITANs, and they confirm the feasibility of the proposed method in the practical application fields of electroholographic 3D displays.

  14. Acceleration of High Angular Momentum Electron Repulsion Integrals and Integral Derivatives on Graphics Processing Units.

    PubMed

    Miao, Yipu; Merz, Kenneth M

    2015-04-14

    We present an efficient implementation of ab initio self-consistent field (SCF) energy and gradient calculations that run on Compute Unified Device Architecture (CUDA) enabled graphical processing units (GPUs) using recurrence relations. We first discuss the machine-generated code that calculates the electron-repulsion integrals (ERIs) for different ERI types. Next we describe the porting of the SCF gradient calculation to GPUs, which results in an acceleration of the computation of the first-order derivative of the ERIs. However, only s, p, and d ERIs and s and p derivatives could be executed simultaneously on GPUs using the current version of CUDA and generation of NVidia GPUs using a previously described algorithm [Miao and Merz J. Chem. Theory Comput. 2013, 9, 965-976.]. Hence, we developed an algorithm to compute f type ERIs and d type ERI derivatives on GPUs. Our benchmarks shows the performance GPU enable ERI and ERI derivative computation yielded speedups of 10-18 times relative to traditional CPU execution. An accuracy analysis using double-precision calculations demonstrates that the overall accuracy is satisfactory for most applications.

  15. MATCHED FILTER COMPUTATION ON FPGA, CELL, AND GPU

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    BAKER, ZACHARY K.; GOKHALE, MAYA B.; TRIPP, JUSTIN L.

    2007-01-08

    The matched filter is an important kernel in the processing of hyperspectral data. The filter enables researchers to sift useful data from instruments that span large frequency bands. In this work, they evaluate the performance of a matched filter algorithm implementation on accelerated co-processor (XD1000), the IBM Cell microprocessor, and the NVIDIA GeForce 6900 GTX GPU graphics card. They provide extensive discussion of the challenges and opportunities afforded by each platform. In particular, they explore the problems of partitioning the filter most efficiently between the host CPU and the co-processor. Using their results, they derive several performance metrics that providemore » the optimal solution for a variety of application situations.« less

  16. The Process of Parallelizing the Conjunction Prediction Algorithm of ESA's SSA Conjunction Prediction Service Using GPGPU

    NASA Astrophysics Data System (ADS)

    Fehr, M.; Navarro, V.; Martin, L.; Fletcher, E.

    2013-08-01

    Space Situational Awareness[8] (SSA) is defined as the comprehensive knowledge, understanding and maintained awareness of the population of space objects, the space environment and existing threats and risks. As ESA's SSA Conjunction Prediction Service (CPS) requires the repetitive application of a processing algorithm against a data set of man-made space objects, it is crucial to exploit the highly parallelizable nature of this problem. Currently the CPS system makes use of OpenMP[7] for parallelization purposes using CPU threads, but only a GPU with its hundreds of cores can fully benefit from such high levels of parallelism. This paper presents the adaptation of several core algorithms[5] of the CPS for general-purpose computing on graphics processing units (GPGPU) using NVIDIAs Compute Unified Device Architecture (CUDA).

  17. A simple GPU-accelerated two-dimensional MUSCL-Hancock solver for ideal magnetohydrodynamics

    NASA Astrophysics Data System (ADS)

    Bard, Christopher M.; Dorelli, John C.

    2014-02-01

    We describe our experience using NVIDIA's CUDA (Compute Unified Device Architecture) C programming environment to implement a two-dimensional second-order MUSCL-Hancock ideal magnetohydrodynamics (MHD) solver on a GTX 480 Graphics Processing Unit (GPU). Taking a simple approach in which the MHD variables are stored exclusively in the global memory of the GTX 480 and accessed in a cache-friendly manner (without further optimizing memory access by, for example, staging data in the GPU's faster shared memory), we achieved a maximum speed-up of ≈126 for a 10242 grid relative to the sequential C code running on a single Intel Nehalem (2.8 GHz) core. This speedup is consistent with simple estimates based on the known floating point performance, memory throughput and parallel processing capacity of the GTX 480.

  18. Accelerating Monte Carlo simulations of photon transport in a voxelized geometry using a massively parallel graphics processing unit

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Badal, Andreu; Badano, Aldo

    Purpose: It is a known fact that Monte Carlo simulations of radiation transport are computationally intensive and may require long computing times. The authors introduce a new paradigm for the acceleration of Monte Carlo simulations: The use of a graphics processing unit (GPU) as the main computing device instead of a central processing unit (CPU). Methods: A GPU-based Monte Carlo code that simulates photon transport in a voxelized geometry with the accurate physics models from PENELOPE has been developed using the CUDA programming model (NVIDIA Corporation, Santa Clara, CA). Results: An outline of the new code and a sample x-raymore » imaging simulation with an anthropomorphic phantom are presented. A remarkable 27-fold speed up factor was obtained using a GPU compared to a single core CPU. Conclusions: The reported results show that GPUs are currently a good alternative to CPUs for the simulation of radiation transport. Since the performance of GPUs is currently increasing at a faster pace than that of CPUs, the advantages of GPU-based software are likely to be more pronounced in the future.« less

  19. Flexible, fast and accurate sequence alignment profiling on GPGPU with PaSWAS.

    PubMed

    Warris, Sven; Yalcin, Feyruz; Jackson, Katherine J L; Nap, Jan Peter

    2015-01-01

    To obtain large-scale sequence alignments in a fast and flexible way is an important step in the analyses of next generation sequencing data. Applications based on the Smith-Waterman (SW) algorithm are often either not fast enough, limited to dedicated tasks or not sufficiently accurate due to statistical issues. Current SW implementations that run on graphics hardware do not report the alignment details necessary for further analysis. With the Parallel SW Alignment Software (PaSWAS) it is possible (a) to have easy access to the computational power of NVIDIA-based general purpose graphics processing units (GPGPUs) to perform high-speed sequence alignments, and (b) retrieve relevant information such as score, number of gaps and mismatches. The software reports multiple hits per alignment. The added value of the new SW implementation is demonstrated with two test cases: (1) tag recovery in next generation sequence data and (2) isotype assignment within an immunoglobulin 454 sequence data set. Both cases show the usability and versatility of the new parallel Smith-Waterman implementation.

  20. Advanced computer graphic techniques for laser range finder (LRF) simulation

    NASA Astrophysics Data System (ADS)

    Bedkowski, Janusz; Jankowski, Stanislaw

    2008-11-01

    This paper show an advanced computer graphic techniques for laser range finder (LRF) simulation. The LRF is the common sensor for unmanned ground vehicle, autonomous mobile robot and security applications. The cost of the measurement system is extremely high, therefore the simulation tool is designed. The simulation gives an opportunity to execute algorithm such as the obstacle avoidance[1], slam for robot localization[2], detection of vegetation and water obstacles in surroundings of the robot chassis[3], LRF measurement in crowd of people[1]. The Axis Aligned Bounding Box (AABB) and alternative technique based on CUDA (NVIDIA Compute Unified Device Architecture) is presented.

  1. A parallel algorithm for the initial screening of space debris collisions prediction using the SGP4/SDP4 models and GPU acceleration

    NASA Astrophysics Data System (ADS)

    Lin, Mingpei; Xu, Ming; Fu, Xiaoyu

    2017-05-01

    Currently, a tremendous amount of space debris in Earth's orbit imperils operational spacecraft. It is essential to undertake risk assessments of collisions and predict dangerous encounters in space. However, collision predictions for an enormous amount of space debris give rise to large-scale computations. In this paper, a parallel algorithm is established on the Compute Unified Device Architecture (CUDA) platform of NVIDIA Corporation for collision prediction. According to the parallel structure of NVIDIA graphics processors, a block decomposition strategy is adopted in the algorithm. Space debris is divided into batches, and the computation and data transfer operations of adjacent batches overlap. As a consequence, the latency to access shared memory during the entire computing process is significantly reduced, and a higher computing speed is reached. Theoretically, a simulation of collision prediction for space debris of any amount and for any time span can be executed. To verify this algorithm, a simulation example including 1382 pieces of debris, whose operational time scales vary from 1 min to 3 days, is conducted on Tesla C2075 of NVIDIA. The simulation results demonstrate that with the same computational accuracy as that of a CPU, the computing speed of the parallel algorithm on a GPU is 30 times that on a CPU. Based on this algorithm, collision prediction of over 150 Chinese spacecraft for a time span of 3 days can be completed in less than 3 h on a single computer, which meets the timeliness requirement of the initial screening task. Furthermore, the algorithm can be adapted for multiple tasks, including particle filtration, constellation design, and Monte-Carlo simulation of an orbital computation.

  2. DOE Office of Scientific and Technical Information (OSTI.GOV)

    Allada, Veerendra, Benjegerdes, Troy; Bode, Brett

    Commodity clusters augmented with application accelerators are evolving as competitive high performance computing systems. The Graphical Processing Unit (GPU) with a very high arithmetic density and performance per price ratio is a good platform for the scientific application acceleration. In addition to the interconnect bottlenecks among the cluster compute nodes, the cost of memory copies between the host and the GPU device have to be carefully amortized to improve the overall efficiency of the application. Scientific applications also rely on efficient implementation of the BAsic Linear Algebra Subroutines (BLAS), among which the General Matrix Multiply (GEMM) is considered as themore » workhorse subroutine. In this paper, they study the performance of the memory copies and GEMM subroutines that are critical to port the computational chemistry algorithms to the GPU clusters. To that end, a benchmark based on the NetPIPE framework is developed to evaluate the latency and bandwidth of the memory copies between the host and the GPU device. The performance of the single and double precision GEMM subroutines from the NVIDIA CUBLAS 2.0 library are studied. The results have been compared with that of the BLAS routines from the Intel Math Kernel Library (MKL) to understand the computational trade-offs. The test bed is a Intel Xeon cluster equipped with NVIDIA Tesla GPUs.« less

  3. Quantum Chemical Calculations Using Accelerators: Migrating Matrix Operations to the NVIDIA Kepler GPU and the Intel Xeon Phi.

    PubMed

    Leang, Sarom S; Rendell, Alistair P; Gordon, Mark S

    2014-03-11

    Increasingly, modern computer systems comprise a multicore general-purpose processor augmented with a number of special purpose devices or accelerators connected via an external interface such as a PCI bus. The NVIDIA Kepler Graphical Processing Unit (GPU) and the Intel Phi are two examples of such accelerators. Accelerators offer peak performances that can be well above those of the host processor. How to exploit this heterogeneous environment for legacy application codes is not, however, straightforward. This paper considers how matrix operations in typical quantum chemical calculations can be migrated to the GPU and Phi systems. Double precision general matrix multiply operations are endemic in electronic structure calculations, especially methods that include electron correlation, such as density functional theory, second order perturbation theory, and coupled cluster theory. The use of approaches that automatically determine whether to use the host or an accelerator, based on problem size, is explored, with computations that are occurring on the accelerator and/or the host. For data-transfers over PCI-e, the GPU provides the best overall performance for data sizes up to 4096 MB with consistent upload and download rates between 5-5.6 GB/s and 5.4-6.3 GB/s, respectively. The GPU outperforms the Phi for both square and nonsquare matrix multiplications.

  4. GPU-based real-time trinocular stereo vision

    NASA Astrophysics Data System (ADS)

    Yao, Yuanbin; Linton, R. J.; Padir, Taskin

    2013-01-01

    Most stereovision applications are binocular which uses information from a 2-camera array to perform stereo matching and compute the depth image. Trinocular stereovision with a 3-camera array has been proved to provide higher accuracy in stereo matching which could benefit applications like distance finding, object recognition, and detection. This paper presents a real-time stereovision algorithm implemented on a GPGPU (General-purpose graphics processing unit) using a trinocular stereovision camera array. Algorithm employs a winner-take-all method applied to perform fusion of disparities in different directions following various image processing techniques to obtain the depth information. The goal of the algorithm is to achieve real-time processing speed with the help of a GPGPU involving the use of Open Source Computer Vision Library (OpenCV) in C++ and NVidia CUDA GPGPU Solution. The results are compared in accuracy and speed to verify the improvement.

  5. Rapid automated classification of anesthetic depth levels using GPU based parallelization of neural networks.

    PubMed

    Peker, Musa; Şen, Baha; Gürüler, Hüseyin

    2015-02-01

    The effect of anesthesia on the patient is referred to as depth of anesthesia. Rapid classification of appropriate depth level of anesthesia is a matter of great importance in surgical operations. Similarly, accelerating classification algorithms is important for the rapid solution of problems in the field of biomedical signal processing. However numerous, time-consuming mathematical operations are required when training and testing stages of the classification algorithms, especially in neural networks. In this study, to accelerate the process, parallel programming and computing platform (Nvidia CUDA) facilitates dramatic increases in computing performance by harnessing the power of the graphics processing unit (GPU) was utilized. The system was employed to detect anesthetic depth level on related electroencephalogram (EEG) data set. This dataset is rather complex and large. Moreover, the achieving more anesthetic levels with rapid response is critical in anesthesia. The proposed parallelization method yielded high accurate classification results in a faster time.

  6. Acceleration of Linear Finite-Difference Poisson-Boltzmann Methods on Graphics Processing Units.

    PubMed

    Qi, Ruxi; Botello-Smith, Wesley M; Luo, Ray

    2017-07-11

    Electrostatic interactions play crucial roles in biophysical processes such as protein folding and molecular recognition. Poisson-Boltzmann equation (PBE)-based models have emerged as widely used in modeling these important processes. Though great efforts have been put into developing efficient PBE numerical models, challenges still remain due to the high dimensionality of typical biomolecular systems. In this study, we implemented and analyzed commonly used linear PBE solvers for the ever-improving graphics processing units (GPU) for biomolecular simulations, including both standard and preconditioned conjugate gradient (CG) solvers with several alternative preconditioners. Our implementation utilizes the standard Nvidia CUDA libraries cuSPARSE, cuBLAS, and CUSP. Extensive tests show that good numerical accuracy can be achieved given that the single precision is often used for numerical applications on GPU platforms. The optimal GPU performance was observed with the Jacobi-preconditioned CG solver, with a significant speedup over standard CG solver on CPU in our diversified test cases. Our analysis further shows that different matrix storage formats also considerably affect the efficiency of different linear PBE solvers on GPU, with the diagonal format best suited for our standard finite-difference linear systems. Further efficiency may be possible with matrix-free operations and integrated grid stencil setup specifically tailored for the banded matrices in PBE-specific linear systems.

  7. A Simple GPU-Accelerated Two-Dimensional MUSCL-Hancock Solver for Ideal Magnetohydrodynamics

    NASA Technical Reports Server (NTRS)

    Bard, Christopher; Dorelli, John C.

    2013-01-01

    We describe our experience using NVIDIA's CUDA (Compute Unified Device Architecture) C programming environment to implement a two-dimensional second-order MUSCL-Hancock ideal magnetohydrodynamics (MHD) solver on a GTX 480 Graphics Processing Unit (GPU). Taking a simple approach in which the MHD variables are stored exclusively in the global memory of the GTX 480 and accessed in a cache-friendly manner (without further optimizing memory access by, for example, staging data in the GPU's faster shared memory), we achieved a maximum speed-up of approx. = 126 for a sq 1024 grid relative to the sequential C code running on a single Intel Nehalem (2.8 GHz) core. This speedup is consistent with simple estimates based on the known floating point performance, memory throughput and parallel processing capacity of the GTX 480.

  8. Hypergraph partitioning implementation for parallelizing matrix-vector multiplication using CUDA GPU-based parallel computing

    NASA Astrophysics Data System (ADS)

    Murni, Bustamam, A.; Ernastuti, Handhika, T.; Kerami, D.

    2017-07-01

    Calculation of the matrix-vector multiplication in the real-world problems often involves large matrix with arbitrary size. Therefore, parallelization is needed to speed up the calculation process that usually takes a long time. Graph partitioning techniques that have been discussed in the previous studies cannot be used to complete the parallelized calculation of matrix-vector multiplication with arbitrary size. This is due to the assumption of graph partitioning techniques that can only solve the square and symmetric matrix. Hypergraph partitioning techniques will overcome the shortcomings of the graph partitioning technique. This paper addresses the efficient parallelization of matrix-vector multiplication through hypergraph partitioning techniques using CUDA GPU-based parallel computing. CUDA (compute unified device architecture) is a parallel computing platform and programming model that was created by NVIDIA and implemented by the GPU (graphics processing unit).

  9. HASEonGPU-An adaptive, load-balanced MPI/GPU-code for calculating the amplified spontaneous emission in high power laser media

    NASA Astrophysics Data System (ADS)

    Eckert, C. H. J.; Zenker, E.; Bussmann, M.; Albach, D.

    2016-10-01

    We present an adaptive Monte Carlo algorithm for computing the amplified spontaneous emission (ASE) flux in laser gain media pumped by pulsed lasers. With the design of high power lasers in mind, which require large size gain media, we have developed the open source code HASEonGPU that is capable of utilizing multiple graphic processing units (GPUs). With HASEonGPU, time to solution is reduced to minutes on a medium size GPU cluster of 64 NVIDIA Tesla K20m GPUs and excellent speedup is achieved when scaling to multiple GPUs. Comparison of simulation results to measurements of ASE in Y b 3 + : Y AG ceramics show perfect agreement.

  10. BarraCUDA - a fast short read sequence aligner using graphics processing units

    PubMed Central

    2012-01-01

    Background With the maturation of next-generation DNA sequencing (NGS) technologies, the throughput of DNA sequencing reads has soared to over 600 gigabases from a single instrument run. General purpose computing on graphics processing units (GPGPU), extracts the computing power from hundreds of parallel stream processors within graphics processing cores and provides a cost-effective and energy efficient alternative to traditional high-performance computing (HPC) clusters. In this article, we describe the implementation of BarraCUDA, a GPGPU sequence alignment software that is based on BWA, to accelerate the alignment of sequencing reads generated by these instruments to a reference DNA sequence. Findings Using the NVIDIA Compute Unified Device Architecture (CUDA) software development environment, we ported the most computational-intensive alignment component of BWA to GPU to take advantage of the massive parallelism. As a result, BarraCUDA offers a magnitude of performance boost in alignment throughput when compared to a CPU core while delivering the same level of alignment fidelity. The software is also capable of supporting multiple CUDA devices in parallel to further accelerate the alignment throughput. Conclusions BarraCUDA is designed to take advantage of the parallelism of GPU to accelerate the alignment of millions of sequencing reads generated by NGS instruments. By doing this, we could, at least in part streamline the current bioinformatics pipeline such that the wider scientific community could benefit from the sequencing technology. BarraCUDA is currently available from http://seqbarracuda.sf.net PMID:22244497

  11. Parallel design of JPEG-LS encoder on graphics processing units

    NASA Astrophysics Data System (ADS)

    Duan, Hao; Fang, Yong; Huang, Bormin

    2012-01-01

    With recent technical advances in graphic processing units (GPUs), GPUs have outperformed CPUs in terms of compute capability and memory bandwidth. Many successful GPU applications to high performance computing have been reported. JPEG-LS is an ISO/IEC standard for lossless image compression which utilizes adaptive context modeling and run-length coding to improve compression ratio. However, adaptive context modeling causes data dependency among adjacent pixels and the run-length coding has to be performed in a sequential way. Hence, using JPEG-LS to compress large-volume hyperspectral image data is quite time-consuming. We implement an efficient parallel JPEG-LS encoder for lossless hyperspectral compression on a NVIDIA GPU using the computer unified device architecture (CUDA) programming technology. We use the block parallel strategy, as well as such CUDA techniques as coalesced global memory access, parallel prefix sum, and asynchronous data transfer. We also show the relation between GPU speedup and AVIRIS block size, as well as the relation between compression ratio and AVIRIS block size. When AVIRIS images are divided into blocks, each with 64×64 pixels, we gain the best GPU performance with 26.3x speedup over its original CPU code.

  12. OCTGRAV: Sparse Octree Gravitational N-body Code on Graphics Processing Units

    NASA Astrophysics Data System (ADS)

    Gaburov, Evghenii; Bédorf, Jeroen; Portegies Zwart, Simon

    2010-10-01

    Octgrav is a very fast tree-code which runs on massively parallel Graphical Processing Units (GPU) with NVIDIA CUDA architecture. The algorithms are based on parallel-scan and sort methods. The tree-construction and calculation of multipole moments is carried out on the host CPU, while the force calculation which consists of tree walks and evaluation of interaction list is carried out on the GPU. In this way, a sustained performance of about 100GFLOP/s and data transfer rates of about 50GB/s is achieved. It takes about a second to compute forces on a million particles with an opening angle of heta approx 0.5. To test the performance and feasibility, we implemented the algorithms in CUDA in the form of a gravitational tree-code which completely runs on the GPU. The tree construction and traverse algorithms are portable to many-core devices which have support for CUDA or OpenCL programming languages. The gravitational tree-code outperforms tuned CPU code during the tree-construction and shows a performance improvement of more than a factor 20 overall, resulting in a processing rate of more than 2.8 million particles per second. The code has a convenient user interface and is freely available for use.

  13. GRay: A Massively Parallel GPU-based Code for Ray Tracing in Relativistic Spacetimes

    NASA Astrophysics Data System (ADS)

    Chan, Chi-kwan; Psaltis, Dimitrios; Özel, Feryal

    2013-11-01

    We introduce GRay, a massively parallel integrator designed to trace the trajectories of billions of photons in a curved spacetime. This graphics-processing-unit (GPU)-based integrator employs the stream processing paradigm, is implemented in CUDA C/C++, and runs on nVidia graphics cards. The peak performance of GRay using single-precision floating-point arithmetic on a single GPU exceeds 300 GFLOP (or 1 ns per photon per time step). For a realistic problem, where the peak performance cannot be reached, GRay is two orders of magnitude faster than existing central-processing-unit-based ray-tracing codes. This performance enhancement allows more effective searches of large parameter spaces when comparing theoretical predictions of images, spectra, and light curves from the vicinities of compact objects to observations. GRay can also perform on-the-fly ray tracing within general relativistic magnetohydrodynamic algorithms that simulate accretion flows around compact objects. Making use of this algorithm, we calculate the properties of the shadows of Kerr black holes and the photon rings that surround them. We also provide accurate fitting formulae of their dependencies on black hole spin and observer inclination, which can be used to interpret upcoming observations of the black holes at the center of the Milky Way, as well as M87, with the Event Horizon Telescope.

  14. cudaMap: a GPU accelerated program for gene expression connectivity mapping.

    PubMed

    McArt, Darragh G; Bankhead, Peter; Dunne, Philip D; Salto-Tellez, Manuel; Hamilton, Peter; Zhang, Shu-Dong

    2013-10-11

    Modern cancer research often involves large datasets and the use of sophisticated statistical techniques. Together these add a heavy computational load to the analysis, which is often coupled with issues surrounding data accessibility. Connectivity mapping is an advanced bioinformatic and computational technique dedicated to therapeutics discovery and drug re-purposing around differential gene expression analysis. On a normal desktop PC, it is common for the connectivity mapping task with a single gene signature to take > 2h to complete using sscMap, a popular Java application that runs on standard CPUs (Central Processing Units). Here, we describe new software, cudaMap, which has been implemented using CUDA C/C++ to harness the computational power of NVIDIA GPUs (Graphics Processing Units) to greatly reduce processing times for connectivity mapping. cudaMap can identify candidate therapeutics from the same signature in just over thirty seconds when using an NVIDIA Tesla C2050 GPU. Results from the analysis of multiple gene signatures, which would previously have taken several days, can now be obtained in as little as 10 minutes, greatly facilitating candidate therapeutics discovery with high throughput. We are able to demonstrate dramatic speed differentials between GPU assisted performance and CPU executions as the computational load increases for high accuracy evaluation of statistical significance. Emerging 'omics' technologies are constantly increasing the volume of data and information to be processed in all areas of biomedical research. Embracing the multicore functionality of GPUs represents a major avenue of local accelerated computing. cudaMap will make a strong contribution in the discovery of candidate therapeutics by enabling speedy execution of heavy duty connectivity mapping tasks, which are increasingly required in modern cancer research. cudaMap is open source and can be freely downloaded from http://purl.oclc.org/NET/cudaMap.

  15. Parallel Implementation of MAFFT on CUDA-Enabled Graphics Hardware.

    PubMed

    Zhu, Xiangyuan; Li, Kenli; Salah, Ahmad; Shi, Lin; Li, Keqin

    2015-01-01

    Multiple sequence alignment (MSA) constitutes an extremely powerful tool for many biological applications including phylogenetic tree estimation, secondary structure prediction, and critical residue identification. However, aligning large biological sequences with popular tools such as MAFFT requires long runtimes on sequential architectures. Due to the ever increasing sizes of sequence databases, there is increasing demand to accelerate this task. In this paper, we demonstrate how graphic processing units (GPUs), powered by the compute unified device architecture (CUDA), can be used as an efficient computational platform to accelerate the MAFFT algorithm. To fully exploit the GPU's capabilities for accelerating MAFFT, we have optimized the sequence data organization to eliminate the bandwidth bottleneck of memory access, designed a memory allocation and reuse strategy to make full use of limited memory of GPUs, proposed a new modified-run-length encoding (MRLE) scheme to reduce memory consumption, and used high-performance shared memory to speed up I/O operations. Our implementation tested in three NVIDIA GPUs achieves speedup up to 11.28 on a Tesla K20m GPU compared to the sequential MAFFT 7.015.

  16. Adaptive mesh fluid simulations on GPU

    NASA Astrophysics Data System (ADS)

    Wang, Peng; Abel, Tom; Kaehler, Ralf

    2010-10-01

    We describe an implementation of compressible inviscid fluid solvers with block-structured adaptive mesh refinement on Graphics Processing Units using NVIDIA's CUDA. We show that a class of high resolution shock capturing schemes can be mapped naturally on this architecture. Using the method of lines approach with the second order total variation diminishing Runge-Kutta time integration scheme, piecewise linear reconstruction, and a Harten-Lax-van Leer Riemann solver, we achieve an overall speedup of approximately 10 times faster execution on one graphics card as compared to a single core on the host computer. We attain this speedup in uniform grid runs as well as in problems with deep AMR hierarchies. Our framework can readily be applied to more general systems of conservation laws and extended to higher order shock capturing schemes. This is shown directly by an implementation of a magneto-hydrodynamic solver and comparing its performance to the pure hydrodynamic case. Finally, we also combined our CUDA parallel scheme with MPI to make the code run on GPU clusters. Close to ideal speedup is observed on up to four GPUs.

  17. GPU accelerated fuzzy connected image segmentation by using CUDA.

    PubMed

    Zhuge, Ying; Cao, Yong; Miller, Robert W

    2009-01-01

    Image segmentation techniques using fuzzy connectedness principles have shown their effectiveness in segmenting a variety of objects in several large applications in recent years. However, one problem of these algorithms has been their excessive computational requirements when processing large image datasets. Nowadays commodity graphics hardware provides high parallel computing power. In this paper, we present a parallel fuzzy connected image segmentation algorithm on Nvidia's Compute Unified Device Architecture (CUDA) platform for segmenting large medical image data sets. Our experiments based on three data sets with small, medium, and large data size demonstrate the efficiency of the parallel algorithm, which achieves a speed-up factor of 7.2x, 7.3x, and 14.4x, correspondingly, for the three data sets over the sequential implementation of fuzzy connected image segmentation algorithm on CPU.

  18. Solving lattice QCD systems of equations using mixed precision solvers on GPUs

    NASA Astrophysics Data System (ADS)

    Clark, M. A.; Babich, R.; Barros, K.; Brower, R. C.; Rebbi, C.

    2010-09-01

    Modern graphics hardware is designed for highly parallel numerical tasks and promises significant cost and performance benefits for many scientific applications. One such application is lattice quantum chromodynamics (lattice QCD), where the main computational challenge is to efficiently solve the discretized Dirac equation in the presence of an SU(3) gauge field. Using NVIDIA's CUDA platform we have implemented a Wilson-Dirac sparse matrix-vector product that performs at up to 40, 135 and 212 Gflops for double, single and half precision respectively on NVIDIA's GeForce GTX 280 GPU. We have developed a new mixed precision approach for Krylov solvers using reliable updates which allows for full double precision accuracy while using only single or half precision arithmetic for the bulk of the computation. The resulting BiCGstab and CG solvers run in excess of 100 Gflops and, in terms of iterations until convergence, perform better than the usual defect-correction approach for mixed precision.

  19. Very high frame rate volumetric integration of depth images on mobile devices.

    PubMed

    Kähler, Olaf; Adrian Prisacariu, Victor; Yuheng Ren, Carl; Sun, Xin; Torr, Philip; Murray, David

    2015-11-01

    Volumetric methods provide efficient, flexible and simple ways of integrating multiple depth images into a full 3D model. They provide dense and photorealistic 3D reconstructions, and parallelised implementations on GPUs achieve real-time performance on modern graphics hardware. To run such methods on mobile devices, providing users with freedom of movement and instantaneous reconstruction feedback, remains challenging however. In this paper we present a range of modifications to existing volumetric integration methods based on voxel block hashing, considerably improving their performance and making them applicable to tablet computer applications. We present (i) optimisations for the basic data structure, and its allocation and integration; (ii) a highly optimised raycasting pipeline; and (iii) extensions to the camera tracker to incorporate IMU data. In total, our system thus achieves frame rates up 47 Hz on a Nvidia Shield Tablet and 910 Hz on a Nvidia GTX Titan XGPU, or even beyond 1.1 kHz without visualisation.

  20. OpenACC acceleration of an unstructured CFD solver based on a reconstructed discontinuous Galerkin method for compressible flows

    DOE PAGES

    Xia, Yidong; Lou, Jialin; Luo, Hong; ...

    2015-02-09

    Here, an OpenACC directive-based graphics processing unit (GPU) parallel scheme is presented for solving the compressible Navier–Stokes equations on 3D hybrid unstructured grids with a third-order reconstructed discontinuous Galerkin method. The developed scheme requires the minimum code intrusion and algorithm alteration for upgrading a legacy solver with the GPU computing capability at very little extra effort in programming, which leads to a unified and portable code development strategy. A face coloring algorithm is adopted to eliminate the memory contention because of the threading of internal and boundary face integrals. A number of flow problems are presented to verify the implementationmore » of the developed scheme. Timing measurements were obtained by running the resulting GPU code on one Nvidia Tesla K20c GPU card (Nvidia Corporation, Santa Clara, CA, USA) and compared with those obtained by running the equivalent Message Passing Interface (MPI) parallel CPU code on a compute node (consisting of two AMD Opteron 6128 eight-core CPUs (Advanced Micro Devices, Inc., Sunnyvale, CA, USA)). Speedup factors of up to 24× and 1.6× for the GPU code were achieved with respect to one and 16 CPU cores, respectively. The numerical results indicate that this OpenACC-based parallel scheme is an effective and extensible approach to port unstructured high-order CFD solvers to GPU computing.« less

  1. High-throughput sequence alignment using Graphics Processing Units

    PubMed Central

    Schatz, Michael C; Trapnell, Cole; Delcher, Arthur L; Varshney, Amitabh

    2007-01-01

    Background The recent availability of new, less expensive high-throughput DNA sequencing technologies has yielded a dramatic increase in the volume of sequence data that must be analyzed. These data are being generated for several purposes, including genotyping, genome resequencing, metagenomics, and de novo genome assembly projects. Sequence alignment programs such as MUMmer have proven essential for analysis of these data, but researchers will need ever faster, high-throughput alignment tools running on inexpensive hardware to keep up with new sequence technologies. Results This paper describes MUMmerGPU, an open-source high-throughput parallel pairwise local sequence alignment program that runs on commodity Graphics Processing Units (GPUs) in common workstations. MUMmerGPU uses the new Compute Unified Device Architecture (CUDA) from nVidia to align multiple query sequences against a single reference sequence stored as a suffix tree. By processing the queries in parallel on the highly parallel graphics card, MUMmerGPU achieves more than a 10-fold speedup over a serial CPU version of the sequence alignment kernel, and outperforms the exact alignment component of MUMmer on a high end CPU by 3.5-fold in total application time when aligning reads from recent sequencing projects using Solexa/Illumina, 454, and Sanger sequencing technologies. Conclusion MUMmerGPU is a low cost, ultra-fast sequence alignment program designed to handle the increasing volume of data produced by new, high-throughput sequencing technologies. MUMmerGPU demonstrates that even memory-intensive applications can run significantly faster on the relatively low-cost GPU than on the CPU. PMID:18070356

  2. Accelerating large-scale protein structure alignments with graphics processing units

    PubMed Central

    2012-01-01

    Background Large-scale protein structure alignment, an indispensable tool to structural bioinformatics, poses a tremendous challenge on computational resources. To ensure structure alignment accuracy and efficiency, efforts have been made to parallelize traditional alignment algorithms in grid environments. However, these solutions are costly and of limited accessibility. Others trade alignment quality for speedup by using high-level characteristics of structure fragments for structure comparisons. Findings We present ppsAlign, a parallel protein structure Alignment framework designed and optimized to exploit the parallelism of Graphics Processing Units (GPUs). As a general-purpose GPU platform, ppsAlign could take many concurrent methods, such as TM-align and Fr-TM-align, into the parallelized algorithm design. We evaluated ppsAlign on an NVIDIA Tesla C2050 GPU card, and compared it with existing software solutions running on an AMD dual-core CPU. We observed a 36-fold speedup over TM-align, a 65-fold speedup over Fr-TM-align, and a 40-fold speedup over MAMMOTH. Conclusions ppsAlign is a high-performance protein structure alignment tool designed to tackle the computational complexity issues from protein structural data. The solution presented in this paper allows large-scale structure comparisons to be performed using massive parallel computing power of GPU. PMID:22357132

  3. Techniques for Mapping Synthetic Aperture Radar Processing Algorithms to Multi-GPU Clusters

    DTIC Science & Technology

    2012-12-01

    Experimental results were generated with 10 nVidia Tesla C2050 GPUs having maximum throughput of 972 Gflop /s. Our approach scales well for output...Experimental results were generated with 10 nVidia Tesla C2050 GPUs having maximum throughput of 972 Gflop /s. Our approach scales well for output

  4. GRay: A MASSIVELY PARALLEL GPU-BASED CODE FOR RAY TRACING IN RELATIVISTIC SPACETIMES

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Chan, Chi-kwan; Psaltis, Dimitrios; Özel, Feryal

    We introduce GRay, a massively parallel integrator designed to trace the trajectories of billions of photons in a curved spacetime. This graphics-processing-unit (GPU)-based integrator employs the stream processing paradigm, is implemented in CUDA C/C++, and runs on nVidia graphics cards. The peak performance of GRay using single-precision floating-point arithmetic on a single GPU exceeds 300 GFLOP (or 1 ns per photon per time step). For a realistic problem, where the peak performance cannot be reached, GRay is two orders of magnitude faster than existing central-processing-unit-based ray-tracing codes. This performance enhancement allows more effective searches of large parameter spaces when comparingmore » theoretical predictions of images, spectra, and light curves from the vicinities of compact objects to observations. GRay can also perform on-the-fly ray tracing within general relativistic magnetohydrodynamic algorithms that simulate accretion flows around compact objects. Making use of this algorithm, we calculate the properties of the shadows of Kerr black holes and the photon rings that surround them. We also provide accurate fitting formulae of their dependencies on black hole spin and observer inclination, which can be used to interpret upcoming observations of the black holes at the center of the Milky Way, as well as M87, with the Event Horizon Telescope.« less

  5. GPU-accelerated phase extraction algorithm for interferograms: a real-time application

    NASA Astrophysics Data System (ADS)

    Zhu, Xiaoqiang; Wu, Yongqian; Liu, Fengwei

    2016-11-01

    Optical testing, having the merits of non-destruction and high sensitivity, provides a vital guideline for optical manufacturing. But the testing process is often computationally intensive and expensive, usually up to a few seconds, which is sufferable for dynamic testing. In this paper, a GPU-accelerated phase extraction algorithm is proposed, which is based on the advanced iterative algorithm. The accelerated algorithm can extract the right phase-distribution from thirteen 1024x1024 fringe patterns with arbitrary phase shifts in 233 milliseconds on average using NVIDIA Quadro 4000 graphic card, which achieved a 12.7x speedup ratio than the same algorithm executed on CPU and 6.6x speedup ratio than that on Matlab using DWANING W5801 workstation. The performance improvement can fulfill the demand of computational accuracy and real-time application.

  6. CUDA-Accelerated Geodesic Ray-Tracing for Fiber Tracking

    PubMed Central

    van Aart, Evert; Sepasian, Neda; Jalba, Andrei; Vilanova, Anna

    2011-01-01

    Diffusion Tensor Imaging (DTI) allows to noninvasively measure the diffusion of water in fibrous tissue. By reconstructing the fibers from DTI data using a fiber-tracking algorithm, we can deduce the structure of the tissue. In this paper, we outline an approach to accelerating such a fiber-tracking algorithm using a Graphics Processing Unit (GPU). This algorithm, which is based on the calculation of geodesics, has shown promising results for both synthetic and real data, but is limited in its applicability by its high computational requirements. We present a solution which uses the parallelism offered by modern GPUs, in combination with the CUDA platform by NVIDIA, to significantly reduce the execution time of the fiber-tracking algorithm. Compared to a multithreaded CPU implementation of the same algorithm, our GPU mapping achieves a speedup factor of up to 40 times. PMID:21941525

  7. Implementation of Multipattern String Matching Accelerated with GPU for Intrusion Detection System

    NASA Astrophysics Data System (ADS)

    Nehemia, Rangga; Lim, Charles; Galinium, Maulahikmah; Rinaldi Widianto, Ahmad

    2017-04-01

    As Internet-related security threats continue to increase in terms of volume and sophistication, existing Intrusion Detection System is also being challenged to cope with the current Internet development. Multi Pattern String Matching algorithm accelerated with Graphical Processing Unit is being utilized to improve the packet scanning performance of the IDS. This paper implements a Multi Pattern String Matching algorithm, also called Parallel Failureless Aho Corasick accelerated with GPU to improve the performance of IDS. OpenCL library is used to allow the IDS to support various GPU, including popular GPU such as NVIDIA and AMD, used in our research. The experiment result shows that the application of Multi Pattern String Matching using GPU accelerated platform provides a speed up, by up to 141% in term of throughput compared to the previous research.

  8. GPU accelerated simulations of 3D deterministic particle transport using discrete ordinates method

    NASA Astrophysics Data System (ADS)

    Gong, Chunye; Liu, Jie; Chi, Lihua; Huang, Haowei; Fang, Jingyue; Gong, Zhenghu

    2011-07-01

    Graphics Processing Unit (GPU), originally developed for real-time, high-definition 3D graphics in computer games, now provides great faculty in solving scientific applications. The basis of particle transport simulation is the time-dependent, multi-group, inhomogeneous Boltzmann transport equation. The numerical solution to the Boltzmann equation involves the discrete ordinates ( Sn) method and the procedure of source iteration. In this paper, we present a GPU accelerated simulation of one energy group time-independent deterministic discrete ordinates particle transport in 3D Cartesian geometry (Sweep3D). The performance of the GPU simulations are reported with the simulations of vacuum boundary condition. The discussion of the relative advantages and disadvantages of the GPU implementation, the simulation on multi GPUs, the programming effort and code portability are also reported. The results show that the overall performance speedup of one NVIDIA Tesla M2050 GPU ranges from 2.56 compared with one Intel Xeon X5670 chip to 8.14 compared with one Intel Core Q6600 chip for no flux fixup. The simulation with flux fixup on one M2050 is 1.23 times faster than on one X5670.

  9. The application of the large particles method of numerical modeling of the process of carbonic nanostructures synthesis in plasma

    NASA Astrophysics Data System (ADS)

    Abramov, G. V.; Gavrilov, A. N.

    2018-03-01

    The article deals with the numerical solution of the mathematical model of the particles motion and interaction in multicomponent plasma by the example of electric arc synthesis of carbon nanostructures. The high order of the particles and the number of their interactions requires a significant input of machine resources and time for calculations. Application of the large particles method makes it possible to reduce the amount of computation and the requirements for hardware resources without affecting the accuracy of numerical calculations. The use of technology of GPGPU parallel computing using the Nvidia CUDA technology allows organizing all General purpose computation on the basis of the graphical processor graphics card. The comparative analysis of different approaches to parallelization of computations to speed up calculations with the choice of the algorithm in which to calculate the accuracy of the solution shared memory is used. Numerical study of the influence of particles density in the macro particle on the motion parameters and the total number of particle collisions in the plasma for different modes of synthesis has been carried out. The rational range of the coherence coefficient of particle in the macro particle is computed.

  10. Forward and adjoint spectral-element simulations of seismic wave propagation using hardware accelerators

    NASA Astrophysics Data System (ADS)

    Peter, Daniel; Videau, Brice; Pouget, Kevin; Komatitsch, Dimitri

    2015-04-01

    Improving the resolution of tomographic images is crucial to answer important questions on the nature of Earth's subsurface structure and internal processes. Seismic tomography is the most prominent approach where seismic signals from ground-motion records are used to infer physical properties of internal structures such as compressional- and shear-wave speeds, anisotropy and attenuation. Recent advances in regional- and global-scale seismic inversions move towards full-waveform inversions which require accurate simulations of seismic wave propagation in complex 3D media, providing access to the full 3D seismic wavefields. However, these numerical simulations are computationally very expensive and need high-performance computing (HPC) facilities for further improving the current state of knowledge. During recent years, many-core architectures such as graphics processing units (GPUs) have been added to available large HPC systems. Such GPU-accelerated computing together with advances in multi-core central processing units (CPUs) can greatly accelerate scientific applications. There are mainly two possible choices of language support for GPU cards, the CUDA programming environment and OpenCL language standard. CUDA software development targets NVIDIA graphic cards while OpenCL was adopted mainly by AMD graphic cards. In order to employ such hardware accelerators for seismic wave propagation simulations, we incorporated a code generation tool BOAST into an existing spectral-element code package SPECFEM3D_GLOBE. This allows us to use meta-programming of computational kernels and generate optimized source code for both CUDA and OpenCL languages, running simulations on either CUDA or OpenCL hardware accelerators. We show here applications of forward and adjoint seismic wave propagation on CUDA/OpenCL GPUs, validating results and comparing performances for different simulations and hardware usages.

  11. Real-time simulation of a spiking neural network model of the basal ganglia circuitry using general purpose computing on graphics processing units.

    PubMed

    Igarashi, Jun; Shouno, Osamu; Fukai, Tomoki; Tsujino, Hiroshi

    2011-11-01

    Real-time simulation of a biologically realistic spiking neural network is necessary for evaluation of its capacity to interact with real environments. However, the real-time simulation of such a neural network is difficult due to its high computational costs that arise from two factors: (1) vast network size and (2) the complicated dynamics of biologically realistic neurons. In order to address these problems, mainly the latter, we chose to use general purpose computing on graphics processing units (GPGPUs) for simulation of such a neural network, taking advantage of the powerful computational capability of a graphics processing unit (GPU). As a target for real-time simulation, we used a model of the basal ganglia that has been developed according to electrophysiological and anatomical knowledge. The model consists of heterogeneous populations of 370 spiking model neurons, including computationally heavy conductance-based models, connected by 11,002 synapses. Simulation of the model has not yet been performed in real-time using a general computing server. By parallelization of the model on the NVIDIA Geforce GTX 280 GPU in data-parallel and task-parallel fashion, faster-than-real-time simulation was robustly realized with only one-third of the GPU's total computational resources. Furthermore, we used the GPU's full computational resources to perform faster-than-real-time simulation of three instances of the basal ganglia model; these instances consisted of 1100 neurons and 33,006 synapses and were synchronized at each calculation step. Finally, we developed software for simultaneous visualization of faster-than-real-time simulation output. These results suggest the potential power of GPGPU techniques in real-time simulation of realistic neural networks. Copyright © 2011 Elsevier Ltd. All rights reserved.

  12. Multi-GPU Accelerated Admittance Method for High-Resolution Human Exposure Evaluation.

    PubMed

    Xiong, Zubiao; Feng, Shi; Kautz, Richard; Chandra, Sandeep; Altunyurt, Nevin; Chen, Ji

    2015-12-01

    A multi-graphics processing unit (GPU) accelerated admittance method solver is presented for solving the induced electric field in high-resolution anatomical models of human body when exposed to external low-frequency magnetic fields. In the solver, the anatomical model is discretized as a three-dimensional network of admittances. The conjugate orthogonal conjugate gradient (COCG) iterative algorithm is employed to take advantage of the symmetric property of the complex-valued linear system of equations. Compared against the widely used biconjugate gradient stabilized method, the COCG algorithm can reduce the solving time by 3.5 times and reduce the storage requirement by about 40%. The iterative algorithm is then accelerated further by using multiple NVIDIA GPUs. The computations and data transfers between GPUs are overlapped in time by using asynchronous concurrent execution design. The communication overhead is well hidden so that the acceleration is nearly linear with the number of GPU cards. Numerical examples show that our GPU implementation running on four NVIDIA Tesla K20c cards can reach 90 times faster than the CPU implementation running on eight CPU cores (two Intel Xeon E5-2603 processors). The implemented solver is able to solve large dimensional problems efficiently. A whole adult body discretized in 1-mm resolution can be solved in just several minutes. The high efficiency achieved makes it practical to investigate human exposure involving a large number of cases with a high resolution that meets the requirements of international dosimetry guidelines.

  13. Parallel fuzzy connected image segmentation on GPU

    PubMed Central

    Zhuge, Ying; Cao, Yong; Udupa, Jayaram K.; Miller, Robert W.

    2011-01-01

    Purpose: Image segmentation techniques using fuzzy connectedness (FC) principles have shown their effectiveness in segmenting a variety of objects in several large applications. However, one challenge in these algorithms has been their excessive computational requirements when processing large image datasets. Nowadays, commodity graphics hardware provides a highly parallel computing environment. In this paper, the authors present a parallel fuzzy connected image segmentation algorithm implementation on NVIDIA’s compute unified device Architecture (cuda) platform for segmenting medical image data sets. Methods: In the FC algorithm, there are two major computational tasks: (i) computing the fuzzy affinity relations and (ii) computing the fuzzy connectedness relations. These two tasks are implemented as cuda kernels and executed on GPU. A dramatic improvement in speed for both tasks is achieved as a result. Results: Our experiments based on three data sets of small, medium, and large data size demonstrate the efficiency of the parallel algorithm, which achieves a speed-up factor of 24.4x, 18.1x, and 10.3x, correspondingly, for the three data sets on the NVIDIA Tesla C1060 over the implementation of the algorithm on CPU, and takes 0.25, 0.72, and 15.04 s, correspondingly, for the three data sets. Conclusions: The authors developed a parallel algorithm of the widely used fuzzy connected image segmentation method on the NVIDIA GPUs, which are far more cost- and speed-effective than both cluster of workstations and multiprocessing systems. A near-interactive speed of segmentation has been achieved, even for the large data set. PMID:21859037

  14. GENIE: a software package for gene-gene interaction analysis in genetic association studies using multiple GPU or CPU cores.

    PubMed

    Chikkagoudar, Satish; Wang, Kai; Li, Mingyao

    2011-05-26

    Gene-gene interaction in genetic association studies is computationally intensive when a large number of SNPs are involved. Most of the latest Central Processing Units (CPUs) have multiple cores, whereas Graphics Processing Units (GPUs) also have hundreds of cores and have been recently used to implement faster scientific software. However, currently there are no genetic analysis software packages that allow users to fully utilize the computing power of these multi-core devices for genetic interaction analysis for binary traits. Here we present a novel software package GENIE, which utilizes the power of multiple GPU or CPU processor cores to parallelize the interaction analysis. GENIE reads an entire genetic association study dataset into memory and partitions the dataset into fragments with non-overlapping sets of SNPs. For each fragment, GENIE analyzes: 1) the interaction of SNPs within it in parallel, and 2) the interaction between the SNPs of the current fragment and other fragments in parallel. We tested GENIE on a large-scale candidate gene study on high-density lipoprotein cholesterol. Using an NVIDIA Tesla C1060 graphics card, the GPU mode of GENIE achieves a speedup of 27 times over its single-core CPU mode run. GENIE is open-source, economical, user-friendly, and scalable. Since the computing power and memory capacity of graphics cards are increasing rapidly while their cost is going down, we anticipate that GENIE will achieve greater speedups with faster GPU cards. Documentation, source code, and precompiled binaries can be downloaded from http://www.cceb.upenn.edu/~mli/software/GENIE/.

  15. GENIE: a software package for gene-gene interaction analysis in genetic association studies using multiple GPU or CPU cores

    PubMed Central

    2011-01-01

    Background Gene-gene interaction in genetic association studies is computationally intensive when a large number of SNPs are involved. Most of the latest Central Processing Units (CPUs) have multiple cores, whereas Graphics Processing Units (GPUs) also have hundreds of cores and have been recently used to implement faster scientific software. However, currently there are no genetic analysis software packages that allow users to fully utilize the computing power of these multi-core devices for genetic interaction analysis for binary traits. Findings Here we present a novel software package GENIE, which utilizes the power of multiple GPU or CPU processor cores to parallelize the interaction analysis. GENIE reads an entire genetic association study dataset into memory and partitions the dataset into fragments with non-overlapping sets of SNPs. For each fragment, GENIE analyzes: 1) the interaction of SNPs within it in parallel, and 2) the interaction between the SNPs of the current fragment and other fragments in parallel. We tested GENIE on a large-scale candidate gene study on high-density lipoprotein cholesterol. Using an NVIDIA Tesla C1060 graphics card, the GPU mode of GENIE achieves a speedup of 27 times over its single-core CPU mode run. Conclusions GENIE is open-source, economical, user-friendly, and scalable. Since the computing power and memory capacity of graphics cards are increasing rapidly while their cost is going down, we anticipate that GENIE will achieve greater speedups with faster GPU cards. Documentation, source code, and precompiled binaries can be downloaded from http://www.cceb.upenn.edu/~mli/software/GENIE/. PMID:21615923

  16. 77 FR 26789 - Certain Semiconductor Chips Having Synchronous Dynamic Random Access Memory Controllers and...

    Federal Register 2010, 2011, 2012, 2013, 2014

    2012-05-07

    ... patents. 73 FR 75131. The principal respondent was NVIDIA Corporation of Santa Clara, California (``NVIDIA''). Joining NVIDIA as respondents were approximately twenty of NVIDIA's customers. The Commission found a... accused products in the United States: NVIDIA; Hewlett-Packard Co. of Palo Alto, California; ASUS Computer...

  17. GPU-based streaming architectures for fast cone-beam CT image reconstruction and demons deformable registration.

    PubMed

    Sharp, G C; Kandasamy, N; Singh, H; Folkert, M

    2007-10-07

    This paper shows how to significantly accelerate cone-beam CT reconstruction and 3D deformable image registration using the stream-processing model. We describe data-parallel designs for the Feldkamp, Davis and Kress (FDK) reconstruction algorithm, and the demons deformable registration algorithm, suitable for use on a commodity graphics processing unit. The streaming versions of these algorithms are implemented using the Brook programming environment and executed on an NVidia 8800 GPU. Performance results using CT data of a preserved swine lung indicate that the GPU-based implementations of the FDK and demons algorithms achieve a substantial speedup--up to 80 times for FDK and 70 times for demons when compared to an optimized reference implementation on a 2.8 GHz Intel processor. In addition, the accuracy of the GPU-based implementations was found to be excellent. Compared with CPU-based implementations, the RMS differences were less than 0.1 Hounsfield unit for reconstruction and less than 0.1 mm for deformable registration.

  18. Accelerating image reconstruction in dual-head PET system by GPU and symmetry properties.

    PubMed

    Chou, Cheng-Ying; Dong, Yun; Hung, Yukai; Kao, Yu-Jiun; Wang, Weichung; Kao, Chien-Min; Chen, Chin-Tu

    2012-01-01

    Positron emission tomography (PET) is an important imaging modality in both clinical usage and research studies. We have developed a compact high-sensitivity PET system that consisted of two large-area panel PET detector heads, which produce more than 224 million lines of response and thus request dramatic computational demands. In this work, we employed a state-of-the-art graphics processing unit (GPU), NVIDIA Tesla C2070, to yield an efficient reconstruction process. Our approaches ingeniously integrate the distinguished features of the symmetry properties of the imaging system and GPU architectures, including block/warp/thread assignments and effective memory usage, to accelerate the computations for ordered subset expectation maximization (OSEM) image reconstruction. The OSEM reconstruction algorithms were implemented employing both CPU-based and GPU-based codes, and their computational performance was quantitatively analyzed and compared. The results showed that the GPU-accelerated scheme can drastically reduce the reconstruction time and thus can largely expand the applicability of the dual-head PET system.

  19. Automatic detection and classification of obstacles with applications in autonomous mobile robots

    NASA Astrophysics Data System (ADS)

    Ponomaryov, Volodymyr I.; Rosas-Miranda, Dario I.

    2016-04-01

    Hardware implementation of an automatic detection and classification of objects that can represent an obstacle for an autonomous mobile robot using stereo vision algorithms is presented. We propose and evaluate a new method to detect and classify objects for a mobile robot in outdoor conditions. This method is divided in two parts, the first one is the object detection step based on the distance from the objects to the camera and a BLOB analysis. The second part is the classification step that is based on visuals primitives and a SVM classifier. The proposed method is performed in GPU in order to reduce the processing time values. This is performed with help of hardware based on multi-core processors and GPU platform, using a NVIDIA R GeForce R GT640 graphic card and Matlab over a PC with Windows 10.

  20. CUDAEASY - a GPU accelerated cosmological lattice program

    NASA Astrophysics Data System (ADS)

    Sainio, J.

    2010-05-01

    This paper presents, to the author's knowledge, the first graphics processing unit (GPU) accelerated program that solves the evolution of interacting scalar fields in an expanding universe. We present the implementation in NVIDIA's Compute Unified Device Architecture (CUDA) and compare the performance to other similar programs in chaotic inflation models. We report speedups between one and two orders of magnitude depending on the used hardware and software while achieving small errors in single precision. Simulations that used to last roughly one day to compute can now be done in hours and this difference is expected to increase in the future. The program has been written in the spirit of LATTICEEASY and users of the aforementioned program should find it relatively easy to start using CUDAEASY in lattice simulations. The program is available at http://www.physics.utu.fi/theory/particlecosmology/cudaeasy/ under the GNU General Public License.

  1. Fast multipole methods on a cluster of GPUs for the meshless simulation of turbulence

    NASA Astrophysics Data System (ADS)

    Yokota, R.; Narumi, T.; Sakamaki, R.; Kameoka, S.; Obi, S.; Yasuoka, K.

    2009-11-01

    Recent advances in the parallelizability of fast N-body algorithms, and the programmability of graphics processing units (GPUs) have opened a new path for particle based simulations. For the simulation of turbulence, vortex methods can now be considered as an interesting alternative to finite difference and spectral methods. The present study focuses on the efficient implementation of the fast multipole method and pseudo-particle method on a cluster of NVIDIA GeForce 8800 GT GPUs, and applies this to a vortex method calculation of homogeneous isotropic turbulence. The results of the present vortex method agree quantitatively with that of the reference calculation using a spectral method. We achieved a maximum speed of 7.48 TFlops using 64 GPUs, and the cost performance was near 9.4/GFlops. The calculation of the present vortex method on 64 GPUs took 4120 s, while the spectral method on 32 CPUs took 4910 s.

  2. GPU accelerated implementation of NCI calculations using promolecular density.

    PubMed

    Rubez, Gaëtan; Etancelin, Jean-Matthieu; Vigouroux, Xavier; Krajecki, Michael; Boisson, Jean-Charles; Hénon, Eric

    2017-05-30

    The NCI approach is a modern tool to reveal chemical noncovalent interactions. It is particularly attractive to describe ligand-protein binding. A custom implementation for NCI using promolecular density is presented. It is designed to leverage the computational power of NVIDIA graphics processing unit (GPU) accelerators through the CUDA programming model. The code performances of three versions are examined on a test set of 144 systems. NCI calculations are particularly well suited to the GPU architecture, which reduces drastically the computational time. On a single compute node, the dual-GPU version leads to a 39-fold improvement for the biggest instance compared to the optimal OpenMP parallel run (C code, icc compiler) with 16 CPU cores. Energy consumption measurements carried out on both CPU and GPU NCI tests show that the GPU approach provides substantial energy savings. © 2017 Wiley Periodicals, Inc. © 2017 Wiley Periodicals, Inc.

  3. GPU-Powered Coherent Beamforming

    NASA Astrophysics Data System (ADS)

    Magro, A.; Adami, K. Zarb; Hickish, J.

    2015-03-01

    Graphics processing units (GPU)-based beamforming is a relatively unexplored area in radio astronomy, possibly due to the assumption that any such system will be severely limited by the PCIe bandwidth required to transfer data to the GPU. We have developed a CUDA-based GPU implementation of a coherent beamformer, specifically designed and optimized for deployment at the BEST-2 array which can generate an arbitrary number of synthesized beams for a wide range of parameters. It achieves ˜1.3 TFLOPs on an NVIDIA Tesla K20, approximately 10x faster than an optimized, multithreaded CPU implementation. This kernel has been integrated into two real-time, GPU-based time-domain software pipelines deployed at the BEST-2 array in Medicina: a standalone beamforming pipeline and a transient detection pipeline. We present performance benchmarks for the beamforming kernel as well as the transient detection pipeline with beamforming capabilities as well as results of test observation.

  4. A GPU-based calculation using the three-dimensional FDTD method for electromagnetic field analysis.

    PubMed

    Nagaoka, Tomoaki; Watanabe, Soichi

    2010-01-01

    Numerical simulations with the numerical human model using the finite-difference time domain (FDTD) method have recently been performed frequently in a number of fields in biomedical engineering. However, the FDTD calculation runs too slowly. We focus, therefore, on general purpose programming on the graphics processing unit (GPGPU). The three-dimensional FDTD method was implemented on the GPU using Compute Unified Device Architecture (CUDA). In this study, we used the NVIDIA Tesla C1060 as a GPGPU board. The performance of the GPU is evaluated in comparison with the performance of a conventional CPU and a vector supercomputer. The results indicate that three-dimensional FDTD calculations using a GPU can significantly reduce run time in comparison with that using a conventional CPU, even a native GPU implementation of the three-dimensional FDTD method, while the GPU/CPU speed ratio varies with the calculation domain and thread block size.

  5. cudaMap: a GPU accelerated program for gene expression connectivity mapping

    PubMed Central

    2013-01-01

    Background Modern cancer research often involves large datasets and the use of sophisticated statistical techniques. Together these add a heavy computational load to the analysis, which is often coupled with issues surrounding data accessibility. Connectivity mapping is an advanced bioinformatic and computational technique dedicated to therapeutics discovery and drug re-purposing around differential gene expression analysis. On a normal desktop PC, it is common for the connectivity mapping task with a single gene signature to take > 2h to complete using sscMap, a popular Java application that runs on standard CPUs (Central Processing Units). Here, we describe new software, cudaMap, which has been implemented using CUDA C/C++ to harness the computational power of NVIDIA GPUs (Graphics Processing Units) to greatly reduce processing times for connectivity mapping. Results cudaMap can identify candidate therapeutics from the same signature in just over thirty seconds when using an NVIDIA Tesla C2050 GPU. Results from the analysis of multiple gene signatures, which would previously have taken several days, can now be obtained in as little as 10 minutes, greatly facilitating candidate therapeutics discovery with high throughput. We are able to demonstrate dramatic speed differentials between GPU assisted performance and CPU executions as the computational load increases for high accuracy evaluation of statistical significance. Conclusion Emerging ‘omics’ technologies are constantly increasing the volume of data and information to be processed in all areas of biomedical research. Embracing the multicore functionality of GPUs represents a major avenue of local accelerated computing. cudaMap will make a strong contribution in the discovery of candidate therapeutics by enabling speedy execution of heavy duty connectivity mapping tasks, which are increasingly required in modern cancer research. cudaMap is open source and can be freely downloaded from http://purl.oclc.org/NET/cudaMap. PMID:24112435

  6. A real-time spike sorting method based on the embedded GPU.

    PubMed

    Zelan Yang; Kedi Xu; Xiang Tian; Shaomin Zhang; Xiaoxiang Zheng

    2017-07-01

    Microelectrode arrays with hundreds of channels have been widely used to acquire neuron population signals in neuroscience studies. Online spike sorting is becoming one of the most important challenges for high-throughput neural signal acquisition systems. Graphic processing unit (GPU) with high parallel computing capability might provide an alternative solution for increasing real-time computational demands on spike sorting. This study reported a method of real-time spike sorting through computing unified device architecture (CUDA) which was implemented on an embedded GPU (NVIDIA JETSON Tegra K1, TK1). The sorting approach is based on the principal component analysis (PCA) and K-means. By analyzing the parallelism of each process, the method was further optimized in the thread memory model of GPU. Our results showed that the GPU-based classifier on TK1 is 37.92 times faster than the MATLAB-based classifier on PC while their accuracies were the same with each other. The high-performance computing features of embedded GPU demonstrated in our studies suggested that the embedded GPU provide a promising platform for the real-time neural signal processing.

  7. Performance evaluation for volumetric segmentation of multiple sclerosis lesions using MATLAB and computing engine in the graphical processing unit (GPU)

    NASA Astrophysics Data System (ADS)

    Le, Anh H.; Park, Young W.; Ma, Kevin; Jacobs, Colin; Liu, Brent J.

    2010-03-01

    Multiple Sclerosis (MS) is a progressive neurological disease affecting myelin pathways in the brain. Multiple lesions in the white matter can cause paralysis and severe motor disabilities of the affected patient. To solve the issue of inconsistency and user-dependency in manual lesion measurement of MRI, we have proposed a 3-D automated lesion quantification algorithm to enable objective and efficient lesion volume tracking. The computer-aided detection (CAD) of MS, written in MATLAB, utilizes K-Nearest Neighbors (KNN) method to compute the probability of lesions on a per-voxel basis. Despite the highly optimized algorithm of imaging processing that is used in CAD development, MS CAD integration and evaluation in clinical workflow is technically challenging due to the requirement of high computation rates and memory bandwidth in the recursive nature of the algorithm. In this paper, we present the development and evaluation of using a computing engine in the graphical processing unit (GPU) with MATLAB for segmentation of MS lesions. The paper investigates the utilization of a high-end GPU for parallel computing of KNN in the MATLAB environment to improve algorithm performance. The integration is accomplished using NVIDIA's CUDA developmental toolkit for MATLAB. The results of this study will validate the practicality and effectiveness of the prototype MS CAD in a clinical setting. The GPU method may allow MS CAD to rapidly integrate in an electronic patient record or any disease-centric health care system.

  8. Experiences modeling ocean circulation problems on a 30 node commodity cluster with 3840 GPU processor cores.

    NASA Astrophysics Data System (ADS)

    Hill, C.

    2008-12-01

    Low cost graphic cards today use many, relatively simple, compute cores to deliver support for memory bandwidth of more than 100GB/s and theoretical floating point performance of more than 500 GFlop/s. Right now this performance is, however, only accessible to highly parallel algorithm implementations that, (i) can use a hundred or more, 32-bit floating point, concurrently executing cores, (ii) can work with graphics memory that resides on the graphics card side of the graphics bus and (iii) can be partially expressed in a language that can be compiled by a graphics programming tool. In this talk we describe our experiences implementing a complete, but relatively simple, time dependent shallow-water equations simulation targeting a cluster of 30 computers each hosting one graphics card. The implementation takes into account the considerations (i), (ii) and (iii) listed previously. We code our algorithm as a series of numerical kernels. Each kernel is designed to be executed by multiple threads of a single process. Kernels are passed memory blocks to compute over which can be persistent blocks of memory on a graphics card. Each kernel is individually implemented using the NVidia CUDA language but driven from a higher level supervisory code that is almost identical to a standard model driver. The supervisory code controls the overall simulation timestepping, but is written to minimize data transfer between main memory and graphics memory (a massive performance bottle-neck on current systems). Using the recipe outlined we can boost the performance of our cluster by nearly an order of magnitude, relative to the same algorithm executing only on the cluster CPU's. Achieving this performance boost requires that many threads are available to each graphics processor for execution within each numerical kernel and that the simulations working set of data can fit into the graphics card memory. As we describe, this puts interesting upper and lower bounds on the problem sizes for which this technology is currently most useful. However, many interesting problems fit within this envelope. Looking forward, we extrapolate our experience to estimate full-scale ocean model performance and applicability. Finally we describe preliminary hybrid mixed 32-bit and 64-bit experiments with graphics cards that support 64-bit arithmetic, albeit at a lower performance.

  9. CUDASW++: optimizing Smith-Waterman sequence database searches for CUDA-enabled graphics processing units

    PubMed Central

    Liu, Yongchao; Maskell, Douglas L; Schmidt, Bertil

    2009-01-01

    Background The Smith-Waterman algorithm is one of the most widely used tools for searching biological sequence databases due to its high sensitivity. Unfortunately, the Smith-Waterman algorithm is computationally demanding, which is further compounded by the exponential growth of sequence databases. The recent emergence of many-core architectures, and their associated programming interfaces, provides an opportunity to accelerate sequence database searches using commonly available and inexpensive hardware. Findings Our CUDASW++ implementation (benchmarked on a single-GPU NVIDIA GeForce GTX 280 graphics card and a dual-GPU GeForce GTX 295 graphics card) provides a significant performance improvement compared to other publicly available implementations, such as SWPS3, CBESW, SW-CUDA, and NCBI-BLAST. CUDASW++ supports query sequences of length up to 59K and for query sequences ranging in length from 144 to 5,478 in Swiss-Prot release 56.6, the single-GPU version achieves an average performance of 9.509 GCUPS with a lowest performance of 9.039 GCUPS and a highest performance of 9.660 GCUPS, and the dual-GPU version achieves an average performance of 14.484 GCUPS with a lowest performance of 10.660 GCUPS and a highest performance of 16.087 GCUPS. Conclusion CUDASW++ is publicly available open-source software. It provides a significant performance improvement for Smith-Waterman-based protein sequence database searches by fully exploiting the compute capability of commonly used CUDA-enabled low-cost GPUs. PMID:19416548

  10. Accelerating Pseudo-Random Number Generator for MCNP on GPU

    NASA Astrophysics Data System (ADS)

    Gong, Chunye; Liu, Jie; Chi, Lihua; Hu, Qingfeng; Deng, Li; Gong, Zhenghu

    2010-09-01

    Pseudo-random number generators (PRNG) are intensively used in many stochastic algorithms in particle simulations, artificial neural networks and other scientific computation. The PRNG in Monte Carlo N-Particle Transport Code (MCNP) requires long period, high quality, flexible jump and fast enough. In this paper, we implement such a PRNG for MCNP on NVIDIA's GTX200 Graphics Processor Units (GPU) using CUDA programming model. Results shows that 3.80 to 8.10 times speedup are achieved compared with 4 to 6 cores CPUs and more than 679.18 million double precision random numbers can be generated per second on GPU.

  11. Spectral-element Seismic Wave Propagation on CUDA/OpenCL Hardware Accelerators

    NASA Astrophysics Data System (ADS)

    Peter, D. B.; Videau, B.; Pouget, K.; Komatitsch, D.

    2015-12-01

    Seismic wave propagation codes are essential tools to investigate a variety of wave phenomena in the Earth. Furthermore, they can now be used for seismic full-waveform inversions in regional- and global-scale adjoint tomography. Although these seismic wave propagation solvers are crucial ingredients to improve the resolution of tomographic images to answer important questions about the nature of Earth's internal processes and subsurface structure, their practical application is often limited due to high computational costs. They thus need high-performance computing (HPC) facilities to improving the current state of knowledge. At present, numerous large HPC systems embed many-core architectures such as graphics processing units (GPUs) to enhance numerical performance. Such hardware accelerators can be programmed using either the CUDA programming environment or the OpenCL language standard. CUDA software development targets NVIDIA graphic cards while OpenCL was adopted by additional hardware accelerators, like e.g. AMD graphic cards, ARM-based processors as well as Intel Xeon Phi coprocessors. For seismic wave propagation simulations using the open-source spectral-element code package SPECFEM3D_GLOBE, we incorporated an automatic source-to-source code generation tool (BOAST) which allows us to use meta-programming of all computational kernels for forward and adjoint runs. Using our BOAST kernels, we generate optimized source code for both CUDA and OpenCL languages within the source code package. Thus, seismic wave simulations are able now to fully utilize CUDA and OpenCL hardware accelerators. We show benchmarks of forward seismic wave propagation simulations using SPECFEM3D_GLOBE on CUDA/OpenCL GPUs, validating results and comparing performances for different simulations and hardware usages.

  12. Discovering epistasis in large scale genetic association studies by exploiting graphics cards.

    PubMed

    Chen, Gary K; Guo, Yunfei

    2013-12-03

    Despite the enormous investments made in collecting DNA samples and generating germline variation data across thousands of individuals in modern genome-wide association studies (GWAS), progress has been frustratingly slow in explaining much of the heritability in common disease. Today's paradigm of testing independent hypotheses on each single nucleotide polymorphism (SNP) marker is unlikely to adequately reflect the complex biological processes in disease risk. Alternatively, modeling risk as an ensemble of SNPs that act in concert in a pathway, and/or interact non-additively on log risk for example, may be a more sensible way to approach gene mapping in modern studies. Implementing such analyzes genome-wide can quickly become intractable due to the fact that even modest size SNP panels on modern genotype arrays (500k markers) pose a combinatorial nightmare, require tens of billions of models to be tested for evidence of interaction. In this article, we provide an in-depth analysis of programs that have been developed to explicitly overcome these enormous computational barriers through the use of processors on graphics cards known as Graphics Processing Units (GPU). We include tutorials on GPU technology, which will convey why they are growing in appeal with today's numerical scientists. One obvious advantage is the impressive density of microprocessor cores that are available on only a single GPU. Whereas high end servers feature up to 24 Intel or AMD CPU cores, the latest GPU offerings from nVidia feature over 2600 cores. Each compute node may be outfitted with up to 4 GPU devices. Success on GPUs varies across problems. However, epistasis screens fare well due to the high degree of parallelism exposed in these problems. Papers that we review routinely report GPU speedups of over two orders of magnitude (>100x) over standard CPU implementations.

  13. Discovering epistasis in large scale genetic association studies by exploiting graphics cards

    PubMed Central

    Chen, Gary K.; Guo, Yunfei

    2013-01-01

    Despite the enormous investments made in collecting DNA samples and generating germline variation data across thousands of individuals in modern genome-wide association studies (GWAS), progress has been frustratingly slow in explaining much of the heritability in common disease. Today's paradigm of testing independent hypotheses on each single nucleotide polymorphism (SNP) marker is unlikely to adequately reflect the complex biological processes in disease risk. Alternatively, modeling risk as an ensemble of SNPs that act in concert in a pathway, and/or interact non-additively on log risk for example, may be a more sensible way to approach gene mapping in modern studies. Implementing such analyzes genome-wide can quickly become intractable due to the fact that even modest size SNP panels on modern genotype arrays (500k markers) pose a combinatorial nightmare, require tens of billions of models to be tested for evidence of interaction. In this article, we provide an in-depth analysis of programs that have been developed to explicitly overcome these enormous computational barriers through the use of processors on graphics cards known as Graphics Processing Units (GPU). We include tutorials on GPU technology, which will convey why they are growing in appeal with today's numerical scientists. One obvious advantage is the impressive density of microprocessor cores that are available on only a single GPU. Whereas high end servers feature up to 24 Intel or AMD CPU cores, the latest GPU offerings from nVidia feature over 2600 cores. Each compute node may be outfitted with up to 4 GPU devices. Success on GPUs varies across problems. However, epistasis screens fare well due to the high degree of parallelism exposed in these problems. Papers that we review routinely report GPU speedups of over two orders of magnitude (>100x) over standard CPU implementations. PMID:24348518

  14. Multistage Analysis of Cyber Threats for Quick Mission Impact Assessment (CyberIA)

    DTIC Science & Technology

    2015-09-01

    Corporation. NVIDIA ® is a registered trademark of the NVIDIA Corporation. CUDA™ is a trademark of the NVIDIA Corporation. Released by J. Lee...for developing and integrating different high-performance C/C++ algorithms. This capability is significant because NVIDIA ® CUDA™ architecture

  15. GPU based framework for geospatial analyses

    NASA Astrophysics Data System (ADS)

    Cosmin Sandric, Ionut; Ionita, Cristian; Dardala, Marian; Furtuna, Titus

    2017-04-01

    Parallel processing on multiple CPU cores is already used at large scale in geocomputing, but parallel processing on graphics cards is just at the beginning. Being able to use an simple laptop with a dedicated graphics card for advanced and very fast geocomputation is an advantage that each scientist wants to have. The necessity to have high speed computation in geosciences has increased in the last 10 years, mostly due to the increase in the available datasets. These datasets are becoming more and more detailed and hence they require more space to store and more time to process. Distributed computation on multicore CPU's and GPU's plays an important role by processing one by one small parts from these big datasets. These way of computations allows to speed up the process, because instead of using just one process for each dataset, the user can use all the cores from a CPU or up to hundreds of cores from GPU The framework provide to the end user a standalone tools for morphometry analyses at multiscale level. An important part of the framework is dedicated to uncertainty propagation in geospatial analyses. The uncertainty may come from the data collection or may be induced by the model or may have an infinite sources. These uncertainties plays important roles when a spatial delineation of the phenomena is modelled. Uncertainty propagation is implemented inside the GPU framework using Monte Carlo simulations. The GPU framework with the standalone tools proved to be a reliable tool for modelling complex natural phenomena The framework is based on NVidia Cuda technology and is written in C++ programming language. The code source will be available on github at https://github.com/sandricionut/GeoRsGPU Acknowledgement: GPU framework for geospatial analysis, Young Researchers Grant (ICUB-University of Bucharest) 2016, director Ionut Sandric

  16. Data Acquisition with GPUs: The DAQ for the Muon $g$-$2$ Experiment at Fermilab

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Gohn, W.

    Graphical Processing Units (GPUs) have recently become a valuable computing tool for the acquisition of data at high rates and for a relatively low cost. The devices work by parallelizing the code into thousands of threads, each executing a simple process, such as identifying pulses from a waveform digitizer. The CUDA programming library can be used to effectively write code to parallelize such tasks on Nvidia GPUs, providing a significant upgrade in performance over CPU based acquisition systems. The muonmore » $g$-$2$ experiment at Fermilab is heavily relying on GPUs to process its data. The data acquisition system for this experiment must have the ability to create deadtime-free records from 700 $$\\mu$$s muon spills at a raw data rate 18 GB per second. Data will be collected using 1296 channels of $$\\mu$$TCA-based 800 MSPS, 12 bit waveform digitizers and processed in a layered array of networked commodity processors with 24 GPUs working in parallel to perform a fast recording of the muon decays during the spill. The described data acquisition system is currently being constructed, and will be fully operational before the start of the experiment in 2017.« less

  17. GBOOST: a GPU-based tool for detecting gene-gene interactions in genome-wide case control studies.

    PubMed

    Yung, Ling Sing; Yang, Can; Wan, Xiang; Yu, Weichuan

    2011-05-01

    Collecting millions of genetic variations is feasible with the advanced genotyping technology. With a huge amount of genetic variations data in hand, developing efficient algorithms to carry out the gene-gene interaction analysis in a timely manner has become one of the key problems in genome-wide association studies (GWAS). Boolean operation-based screening and testing (BOOST), a recent work in GWAS, completes gene-gene interaction analysis in 2.5 days on a desktop computer. Compared with central processing units (CPUs), graphic processing units (GPUs) are highly parallel hardware and provide massive computing resources. We are, therefore, motivated to use GPUs to further speed up the analysis of gene-gene interactions. We implement the BOOST method based on a GPU framework and name it GBOOST. GBOOST achieves a 40-fold speedup compared with BOOST. It completes the analysis of Wellcome Trust Case Control Consortium Type 2 Diabetes (WTCCC T2D) genome data within 1.34 h on a desktop computer equipped with Nvidia GeForce GTX 285 display card. GBOOST code is available at http://bioinformatics.ust.hk/BOOST.html#GBOOST.

  18. Fast simulation of Proton Induced X-Ray Emission Tomography using CUDA

    NASA Astrophysics Data System (ADS)

    Beasley, D. G.; Marques, A. C.; Alves, L. C.; da Silva, R. C.

    2013-07-01

    A new 3D Proton Induced X-Ray Emission Tomography (PIXE-T) and Scanning Transmission Ion Microscopy Tomography (STIM-T) simulation software has been developed in Java and uses NVIDIA™ Common Unified Device Architecture (CUDA) to calculate the X-ray attenuation for large detector areas. A challenge with PIXE-T is to get sufficient counts while retaining a small beam spot size. Therefore a high geometric efficiency is required. However, as the detector solid angle increases the calculations required for accurate reconstruction of the data increase substantially. To overcome this limitation, the CUDA parallel computing platform was used which enables general purpose programming of NVIDIA graphics processing units (GPUs) to perform computations traditionally handled by the central processing unit (CPU). For simulation performance evaluation, the results of a CPU- and a CUDA-based simulation of a phantom are presented. Furthermore, a comparison with the simulation code in the PIXE-Tomography reconstruction software DISRA (A. Sakellariou, D.N. Jamieson, G.J.F. Legge, 2001) is also shown. Compared to a CPU implementation, the CUDA based simulation is approximately 30× faster.

  19. GPU Accelerated Vector Median Filter

    NASA Technical Reports Server (NTRS)

    Aras, Rifat; Shen, Yuzhong

    2011-01-01

    Noise reduction is an important step for most image processing tasks. For three channel color images, a widely used technique is vector median filter in which color values of pixels are treated as 3-component vectors. Vector median filters are computationally expensive; for a window size of n x n, each of the n(sup 2) vectors has to be compared with other n(sup 2) - 1 vectors in distances. General purpose computation on graphics processing units (GPUs) is the paradigm of utilizing high-performance many-core GPU architectures for computation tasks that are normally handled by CPUs. In this work. NVIDIA's Compute Unified Device Architecture (CUDA) paradigm is used to accelerate vector median filtering. which has to the best of our knowledge never been done before. The performance of GPU accelerated vector median filter is compared to that of the CPU and MPI-based versions for different image and window sizes, Initial findings of the study showed 100x improvement of performance of vector median filter implementation on GPUs over CPU implementations and further speed-up is expected after more extensive optimizations of the GPU algorithm .

  20. Graphics processing unit accelerated phase field dislocation dynamics: Application to bi-metallic interfaces

    DOE PAGES

    Eghtesad, Adnan; Germaschewski, Kai; Beyerlein, Irene J.; ...

    2017-10-14

    We present the first high-performance computing implementation of the meso-scale phase field dislocation dynamics (PFDD) model on a graphics processing unit (GPU)-based platform. The implementation takes advantage of the portable OpenACC standard directive pragmas along with Nvidia's compute unified device architecture (CUDA) fast Fourier transform (FFT) library called CUFFT to execute the FFT computations within the PFDD formulation on the same GPU platform. The overall implementation is termed ACCPFDD-CUFFT. The package is entirely performance portable due to the use of OPENACC-CUDA inter-operability, in which calls to CUDA functions are replaced with the OPENACC data regions for a host central processingmore » unit (CPU) and device (GPU). A comprehensive benchmark study has been conducted, which compares a number of FFT routines, the Numerical Recipes FFT (FOURN), Fastest Fourier Transform in the West (FFTW), and the CUFFT. The last one exploits the advantages of the GPU hardware for FFT calculations. The novel ACCPFDD-CUFFT implementation is verified using the analytical solutions for the stress field around an infinite edge dislocation and subsequently applied to simulate the interaction and motion of dislocations through a bi-phase copper-nickel (Cu–Ni) interface. It is demonstrated that the ACCPFDD-CUFFT implementation on a single TESLA K80 GPU offers a 27.6X speedup relative to the serial version and a 5X speedup relative to the 22-multicore Intel Xeon CPU E5-2699 v4 @ 2.20 GHz version of the code.« less

  1. Graphics processing unit accelerated phase field dislocation dynamics: Application to bi-metallic interfaces

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Eghtesad, Adnan; Germaschewski, Kai; Beyerlein, Irene J.

    We present the first high-performance computing implementation of the meso-scale phase field dislocation dynamics (PFDD) model on a graphics processing unit (GPU)-based platform. The implementation takes advantage of the portable OpenACC standard directive pragmas along with Nvidia's compute unified device architecture (CUDA) fast Fourier transform (FFT) library called CUFFT to execute the FFT computations within the PFDD formulation on the same GPU platform. The overall implementation is termed ACCPFDD-CUFFT. The package is entirely performance portable due to the use of OPENACC-CUDA inter-operability, in which calls to CUDA functions are replaced with the OPENACC data regions for a host central processingmore » unit (CPU) and device (GPU). A comprehensive benchmark study has been conducted, which compares a number of FFT routines, the Numerical Recipes FFT (FOURN), Fastest Fourier Transform in the West (FFTW), and the CUFFT. The last one exploits the advantages of the GPU hardware for FFT calculations. The novel ACCPFDD-CUFFT implementation is verified using the analytical solutions for the stress field around an infinite edge dislocation and subsequently applied to simulate the interaction and motion of dislocations through a bi-phase copper-nickel (Cu–Ni) interface. It is demonstrated that the ACCPFDD-CUFFT implementation on a single TESLA K80 GPU offers a 27.6X speedup relative to the serial version and a 5X speedup relative to the 22-multicore Intel Xeon CPU E5-2699 v4 @ 2.20 GHz version of the code.« less

  2. GPU: the biggest key processor for AI and parallel processing

    NASA Astrophysics Data System (ADS)

    Baji, Toru

    2017-07-01

    Two types of processors exist in the market. One is the conventional CPU and the other is Graphic Processor Unit (GPU). Typical CPU is composed of 1 to 8 cores while GPU has thousands of cores. CPU is good for sequential processing, while GPU is good to accelerate software with heavy parallel executions. GPU was initially dedicated for 3D graphics. However from 2006, when GPU started to apply general-purpose cores, it was noticed that this architecture can be used as a general purpose massive-parallel processor. NVIDIA developed a software framework Compute Unified Device Architecture (CUDA) that make it possible to easily program the GPU for these application. With CUDA, GPU started to be used in workstations and supercomputers widely. Recently two key technologies are highlighted in the industry. The Artificial Intelligence (AI) and Autonomous Driving Cars. AI requires a massive parallel operation to train many-layers of neural networks. With CPU alone, it was impossible to finish the training in a practical time. The latest multi-GPU system with P100 makes it possible to finish the training in a few hours. For the autonomous driving cars, TOPS class of performance is required to implement perception, localization, path planning processing and again SoC with integrated GPU will play a key role there. In this paper, the evolution of the GPU which is one of the biggest commercial devices requiring state-of-the-art fabrication technology will be introduced. Also overview of the GPU demanding key application like the ones described above will be introduced.

  3. Multi-Core Programming Design Patterns: Stream Processing Algorithms for Dynamic Scene Perceptions

    DTIC Science & Technology

    2014-05-01

    processor developed by IBM and other companies , incorpo- rates the verb—POWER5— processor as the Power Processor Element (PPE), one of the early general...deliver an power efficient single-precision peak performance of more than 256 GFlops. Substantially more raw power became available later, when nVIDIA ...algorithms, including IBM’s Cell/B.E., GPUs from NVidia and AMD and many-core CPUs from Intel.27 The vast growth of digital video content has been a

  4. Implementation of Headtracking and 3D Stereo with Unity and VRPN for Computer Simulations

    NASA Technical Reports Server (NTRS)

    Noyes, Matthew A.

    2013-01-01

    This paper explores low-cost hardware and software methods to provide depth cues traditionally absent in monocular displays. The use of a VRPN server in conjunction with a Microsoft Kinect and/or Nintendo Wiimote to provide head tracking information to a Unity application, and NVIDIA 3D Vision for retinal disparity support, is discussed. Methods are suggested to implement this technology with NASA's EDGE simulation graphics package, along with potential caveats. Finally, future applications of this technology to astronaut crew training, particularly when combined with an omnidirectional treadmill for virtual locomotion and NASA's ARGOS system for reduced gravity simulation, are discussed.

  5. Charge order-superfluidity transition in a two-dimensional system of hard-core bosons and emerging domain structures

    NASA Astrophysics Data System (ADS)

    Moskvin, A. S.; Panov, Yu. D.; Rybakov, F. N.; Borisov, A. B.

    2017-11-01

    We have used high-performance parallel computations by NVIDIA graphics cards applying the method of nonlinear conjugate gradients and Monte Carlo method to observe directly the developing ground state configuration of a two-dimensional hard-core boson system with decrease in temperature, and its evolution with deviation from a half-filling. This has allowed us to explore unconventional features of a charge order—superfluidity phase transition, specifically, formation of an irregular domain structure, emergence of a filamentary superfluid structure that condenses within of the charge-ordered phase domain antiphase boundaries, and formation and evolution of various topological structures.

  6. Line-by-line spectroscopic simulations on graphics processing units

    NASA Astrophysics Data System (ADS)

    Collange, Sylvain; Daumas, Marc; Defour, David

    2008-01-01

    We report here on software that performs line-by-line spectroscopic simulations on gases. Elaborate models (such as narrow band and correlated-K) are accurate and efficient for bands where various components are not simultaneously and significantly active. Line-by-line is probably the most accurate model in the infrared for blends of gases that contain high proportions of H 2O and CO 2 as this was the case for our prototype simulation. Our implementation on graphics processing units sustains a speedup close to 330 on computation-intensive tasks and 12 on memory intensive tasks compared to implementations on one core of high-end processors. This speedup is due to data parallelism, efficient memory access for specific patterns and some dedicated hardware operators only available in graphics processing units. It is obtained leaving most of processor resources available and it would scale linearly with the number of graphics processing units in parallel machines. Line-by-line simulation coupled with simulation of fluid dynamics was long believed to be economically intractable but our work shows that it could be done with some affordable additional resources compared to what is necessary to perform simulations on fluid dynamics alone. Program summaryProgram title: GPU4RE Catalogue identifier: ADZY_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/ADZY_v1_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: Standard CPC licence, http://cpc.cs.qub.ac.uk/licence/licence.html No. of lines in distributed program, including test data, etc.: 62 776 No. of bytes in distributed program, including test data, etc.: 1 513 247 Distribution format: tar.gz Programming language: C++ Computer: x86 PC Operating system: Linux, Microsoft Windows. Compilation requires either gcc/g++ under Linux or Visual C++ 2003/2005 and Cygwin under Windows. It has been tested using gcc 4.1.2 under Ubuntu Linux 7.04 and using Visual C++ 2005 with Cygwin 1.5.24 under Windows XP. RAM: 1 gigabyte Classification: 21.2 External routines: OpenGL ( http://www.opengl.org) Nature of problem: Simulating radiative transfer on high-temperature high-pressure gases. Solution method: Line-by-line Monte-Carlo ray-tracing. Unusual features: Parallel computations are moved to the GPU. Additional comments: nVidia GeForce 7000 or ATI Radeon X1000 series graphics processing unit is required. Running time: A few minutes.

  7. A portable platform for accelerated PIC codes and its application to GPUs using OpenACC

    NASA Astrophysics Data System (ADS)

    Hariri, F.; Tran, T. M.; Jocksch, A.; Lanti, E.; Progsch, J.; Messmer, P.; Brunner, S.; Gheller, C.; Villard, L.

    2016-10-01

    We present a portable platform, called PIC_ENGINE, for accelerating Particle-In-Cell (PIC) codes on heterogeneous many-core architectures such as Graphic Processing Units (GPUs). The aim of this development is efficient simulations on future exascale systems by allowing different parallelization strategies depending on the application problem and the specific architecture. To this end, this platform contains the basic steps of the PIC algorithm and has been designed as a test bed for different algorithmic options and data structures. Among the architectures that this engine can explore, particular attention is given here to systems equipped with GPUs. The study demonstrates that our portable PIC implementation based on the OpenACC programming model can achieve performance closely matching theoretical predictions. Using the Cray XC30 system, Piz Daint, at the Swiss National Supercomputing Centre (CSCS), we show that PIC_ENGINE running on an NVIDIA Kepler K20X GPU can outperform the one on an Intel Sandy bridge 8-core CPU by a factor of 3.4.

  8. Accelerating a three-dimensional eco-hydrological cellular automaton on GPGPU with OpenCL

    NASA Astrophysics Data System (ADS)

    Senatore, Alfonso; D'Ambrosio, Donato; De Rango, Alessio; Rongo, Rocco; Spataro, William; Straface, Salvatore; Mendicino, Giuseppe

    2016-10-01

    This work presents an effective implementation of a numerical model for complete eco-hydrological Cellular Automata modeling on Graphical Processing Units (GPU) with OpenCL (Open Computing Language) for heterogeneous computation (i.e., on CPUs and/or GPUs). Different types of parallel implementations were carried out (e.g., use of fast local memory, loop unrolling, etc), showing increasing performance improvements in terms of speedup, adopting also some original optimizations strategies. Moreover, numerical analysis of results (i.e., comparison of CPU and GPU outcomes in terms of rounding errors) have proven to be satisfactory. Experiments were carried out on a workstation with two CPUs (Intel Xeon E5440 at 2.83GHz), one GPU AMD R9 280X and one GPU nVIDIA Tesla K20c. Results have been extremely positive, but further testing should be performed to assess the functionality of the adopted strategies on other complete models and their ability to fruitfully exploit parallel systems resources.

  9. Clinical implementation of a GPU-based simplified Monte Carlo method for a treatment planning system of proton beam therapy.

    PubMed

    Kohno, R; Hotta, K; Nishioka, S; Matsubara, K; Tansho, R; Suzuki, T

    2011-11-21

    We implemented the simplified Monte Carlo (SMC) method on graphics processing unit (GPU) architecture under the computer-unified device architecture platform developed by NVIDIA. The GPU-based SMC was clinically applied for four patients with head and neck, lung, or prostate cancer. The results were compared to those obtained by a traditional CPU-based SMC with respect to the computation time and discrepancy. In the CPU- and GPU-based SMC calculations, the estimated mean statistical errors of the calculated doses in the planning target volume region were within 0.5% rms. The dose distributions calculated by the GPU- and CPU-based SMCs were similar, within statistical errors. The GPU-based SMC showed 12.30-16.00 times faster performance than the CPU-based SMC. The computation time per beam arrangement using the GPU-based SMC for the clinical cases ranged 9-67 s. The results demonstrate the successful application of the GPU-based SMC to a clinical proton treatment planning.

  10. MAGI: a Node.js web service for fast microRNA-Seq analysis in a GPU infrastructure.

    PubMed

    Kim, Jihoon; Levy, Eric; Ferbrache, Alex; Stepanowsky, Petra; Farcas, Claudiu; Wang, Shuang; Brunner, Stefan; Bath, Tyler; Wu, Yuan; Ohno-Machado, Lucila

    2014-10-01

    MAGI is a web service for fast MicroRNA-Seq data analysis in a graphics processing unit (GPU) infrastructure. Using just a browser, users have access to results as web reports in just a few hours->600% end-to-end performance improvement over state of the art. MAGI's salient features are (i) transfer of large input files in native FASTA with Qualities (FASTQ) format through drag-and-drop operations, (ii) rapid prediction of microRNA target genes leveraging parallel computing with GPU devices, (iii) all-in-one analytics with novel feature extraction, statistical test for differential expression and diagnostic plot generation for quality control and (iv) interactive visualization and exploration of results in web reports that are readily available for publication. MAGI relies on the Node.js JavaScript framework, along with NVIDIA CUDA C, PHP: Hypertext Preprocessor (PHP), Perl and R. It is freely available at http://magi.ucsd.edu. © The Author 2014. Published by Oxford University Press.

  11. GPU-accelerated phase-field simulation of dendritic solidification in a binary alloy

    NASA Astrophysics Data System (ADS)

    Yamanaka, Akinori; Aoki, Takayuki; Ogawa, Satoi; Takaki, Tomohiro

    2011-03-01

    The phase-field simulation for dendritic solidification of a binary alloy has been accelerated by using a graphic processing unit (GPU). To perform the phase-field simulation of the alloy solidification on GPU, a program code was developed with computer unified device architecture (CUDA). In this paper, the implementation technique of the phase-field model on GPU is presented. Also, we evaluated the acceleration performance of the three-dimensional solidification simulation by using a single NVIDIA TESLA C1060 GPU and the developed program code. The results showed that the GPU calculation for 5763 computational grids achieved the performance of 170 GFLOPS by utilizing the shared memory as a software-managed cache. Furthermore, it can be demonstrated that the computation with the GPU is 100 times faster than that with a single CPU core. From the obtained results, we confirmed the feasibility of realizing a real-time full three-dimensional phase-field simulation of microstructure evolution on a personal desktop computer.

  12. Algorithms for GPU-based molecular dynamics simulations of complex fluids: Applications to water, mixtures, and liquid crystals.

    PubMed

    Kazachenko, Sergey; Giovinazzo, Mark; Hall, Kyle Wm; Cann, Natalie M

    2015-09-15

    A custom code for molecular dynamics simulations has been designed to run on CUDA-enabled NVIDIA graphics processing units (GPUs). The double-precision code simulates multicomponent fluids, with intramolecular and intermolecular forces, coarse-grained and atomistic models, holonomic constraints, Nosé-Hoover thermostats, and the generation of distribution functions. Algorithms to compute Lennard-Jones and Gay-Berne interactions, and the electrostatic force using Ewald summations, are discussed. A neighbor list is introduced to improve scaling with respect to system size. Three test systems are examined: SPC/E water; an n-hexane/2-propanol mixture; and a liquid crystal mesogen, 2-(4-butyloxyphenyl)-5-octyloxypyrimidine. Code performance is analyzed for each system. With one GPU, a 33-119 fold increase in performance is achieved compared with the serial code while the use of two GPUs leads to a 69-287 fold improvement and three GPUs yield a 101-377 fold speedup. © 2015 Wiley Periodicals, Inc.

  13. A versatile model for soft patchy particles with various patch arrangements.

    PubMed

    Li, Zhan-Wei; Zhu, You-Liang; Lu, Zhong-Yuan; Sun, Zhao-Yan

    2016-01-21

    We propose a simple and general mesoscale soft patchy particle model, which can felicitously describe the deformable and surface-anisotropic characteristics of soft patchy particles. This model can be used in dynamics simulations to investigate the aggregation behavior and mechanism of various types of soft patchy particles with tunable number, size, direction, and geometrical arrangement of the patches. To improve the computational efficiency of this mesoscale model in dynamics simulations, we give the simulation algorithm that fits the compute unified device architecture (CUDA) framework of NVIDIA graphics processing units (GPUs). The validation of the model and the performance of the simulations using GPUs are demonstrated by simulating several benchmark systems of soft patchy particles with 1 to 4 patches in a regular geometrical arrangement. Because of its simplicity and computational efficiency, the soft patchy particle model will provide a powerful tool to investigate the aggregation behavior of soft patchy particles, such as patchy micelles, patchy microgels, and patchy dendrimers, over larger spatial and temporal scales.

  14. Design and implementation of a hybrid MPI-CUDA model for the Smith-Waterman algorithm.

    PubMed

    Khaled, Heba; Faheem, Hossam El Deen Mostafa; El Gohary, Rania

    2015-01-01

    This paper provides a novel hybrid model for solving the multiple pair-wise sequence alignment problem combining message passing interface and CUDA, the parallel computing platform and programming model invented by NVIDIA. The proposed model targets homogeneous cluster nodes equipped with similar Graphical Processing Unit (GPU) cards. The model consists of the Master Node Dispatcher (MND) and the Worker GPU Nodes (WGN). The MND distributes the workload among the cluster working nodes and then aggregates the results. The WGN performs the multiple pair-wise sequence alignments using the Smith-Waterman algorithm. We also propose a modified implementation to the Smith-Waterman algorithm based on computing the alignment matrices row-wise. The experimental results demonstrate a considerable reduction in the running time by increasing the number of the working GPU nodes. The proposed model achieved a performance of about 12 Giga cell updates per second when we tested against the SWISS-PROT protein knowledge base running on four nodes.

  15. Enhanced Graphics for Extended Scale Range

    NASA Technical Reports Server (NTRS)

    Hanson, Andrew J.; Chi-Wing Fu, Philip

    2012-01-01

    Enhanced Graphics for Extended Scale Range is a computer program for rendering fly-through views of scene models that include visible objects differing in size by large orders of magnitude. An example would be a scene showing a person in a park at night with the moon, stars, and galaxies in the background sky. Prior graphical computer programs exhibit arithmetic and other anomalies when rendering scenes containing objects that differ enormously in scale and distance from the viewer. The present program dynamically repartitions distance scales of objects in a scene during rendering to eliminate almost all such anomalies in a way compatible with implementation in other software and in hardware accelerators. By assigning depth ranges correspond ing to rendering precision requirements, either automatically or under program control, this program spaces out object scales to match the precision requirements of the rendering arithmetic. This action includes an intelligent partition of the depth buffer ranges to avoid known anomalies from this source. The program is written in C++, using OpenGL, GLUT, and GLUI standard libraries, and nVidia GEForce Vertex Shader extensions. The program has been shown to work on several computers running UNIX and Windows operating systems.

  16. A GPU OpenCL based cross-platform Monte Carlo dose calculation engine (goMC).

    PubMed

    Tian, Zhen; Shi, Feng; Folkerts, Michael; Qin, Nan; Jiang, Steve B; Jia, Xun

    2015-10-07

    Monte Carlo (MC) simulation has been recognized as the most accurate dose calculation method for radiotherapy. However, the extremely long computation time impedes its clinical application. Recently, a lot of effort has been made to realize fast MC dose calculation on graphic processing units (GPUs). However, most of the GPU-based MC dose engines have been developed under NVidia's CUDA environment. This limits the code portability to other platforms, hindering the introduction of GPU-based MC simulations to clinical practice. The objective of this paper is to develop a GPU OpenCL based cross-platform MC dose engine named goMC with coupled photon-electron simulation for external photon and electron radiotherapy in the MeV energy range. Compared to our previously developed GPU-based MC code named gDPM (Jia et al 2012 Phys. Med. Biol. 57 7783-97), goMC has two major differences. First, it was developed under the OpenCL environment for high code portability and hence could be run not only on different GPU cards but also on CPU platforms. Second, we adopted the electron transport model used in EGSnrc MC package and PENELOPE's random hinge method in our new dose engine, instead of the dose planning method employed in gDPM. Dose distributions were calculated for a 15 MeV electron beam and a 6 MV photon beam in a homogenous water phantom, a water-bone-lung-water slab phantom and a half-slab phantom. Satisfactory agreement between the two MC dose engines goMC and gDPM was observed in all cases. The average dose differences in the regions that received a dose higher than 10% of the maximum dose were 0.48-0.53% for the electron beam cases and 0.15-0.17% for the photon beam cases. In terms of efficiency, goMC was ~4-16% slower than gDPM when running on the same NVidia TITAN card for all the cases we tested, due to both the different electron transport models and the different development environments. The code portability of our new dose engine goMC was validated by successfully running it on a variety of different computing devices including an NVidia GPU card, two AMD GPU cards and an Intel CPU processor. Computational efficiency among these platforms was compared.

  17. Supporting Real-Time Computer Vision Workloads using OpenVX on Multicore+GPU Platforms

    DTIC Science & Technology

    2015-05-01

    a registered trademark of the NVIDIA Corporation . Report Documentation Page Form ApprovedOMB No. 0704-0188 Public reporting burden for the collection...from NVIDIA , we adapted an alpha- version of an NVIDIA OpenVX implementation called VisionWorks® [3] to run atop PGMRT (a graph-based mid- dleware...time support to an OpenVX implementation by NVIDIA called VisionWorks. Our modifications were applied to an alpha-version of VisionWorks. This alpha

  18. Improved preconditioned conjugate gradient algorithm and application in 3D inversion of gravity-gradiometry data

    NASA Astrophysics Data System (ADS)

    Wang, Tai-Han; Huang, Da-Nian; Ma, Guo-Qing; Meng, Zhao-Hai; Li, Ye

    2017-06-01

    With the continuous development of full tensor gradiometer (FTG) measurement techniques, three-dimensional (3D) inversion of FTG data is becoming increasingly used in oil and gas exploration. In the fast processing and interpretation of large-scale high-precision data, the use of the graphics processing unit process unit (GPU) and preconditioning methods are very important in the data inversion. In this paper, an improved preconditioned conjugate gradient algorithm is proposed by combining the symmetric successive over-relaxation (SSOR) technique and the incomplete Choleksy decomposition conjugate gradient algorithm (ICCG). Since preparing the preconditioner requires extra time, a parallel implement based on GPU is proposed. The improved method is then applied in the inversion of noisecontaminated synthetic data to prove its adaptability in the inversion of 3D FTG data. Results show that the parallel SSOR-ICCG algorithm based on NVIDIA Tesla C2050 GPU achieves a speedup of approximately 25 times that of a serial program using a 2.0 GHz Central Processing Unit (CPU). Real airborne gravity-gradiometry data from Vinton salt dome (southwest Louisiana, USA) are also considered. Good results are obtained, which verifies the efficiency and feasibility of the proposed parallel method in fast inversion of 3D FTG data.

  19. Unusual Domain Structure and Filamentary Superfluidity for 2D Hard-Core Bosons in Insulating Charge-Ordered Phase

    NASA Astrophysics Data System (ADS)

    Panov, Yu. D.; Moskvin, A. S.; Rybakov, F. N.; Borisov, A. B.

    2016-12-01

    We made use of a special algorithm for compute unified device architecture for NVIDIA graphics cards, a nonlinear conjugate-gradient method to minimize energy functional, and Monte-Carlo technique to directly observe the forming of the ground state configuration for the 2D hard-core bosons by lowering the temperature and its evolution with deviation away from half-filling. The novel technique allowed us to examine earlier implications and uncover novel features of the phase transitions, in particular, look upon the nucleation of the odd domain structure, emergence of filamentary superfluidity nucleated at the antiphase domain walls of the charge-ordered phase, and nucleation and evolution of different topological structures.

  20. Spectral-element simulation of two-dimensional elastic wave propagation in fully heterogeneous media on a GPU cluster

    NASA Astrophysics Data System (ADS)

    Rudianto, Indra; Sudarmaji

    2018-04-01

    We present an implementation of the spectral-element method for simulation of two-dimensional elastic wave propagation in fully heterogeneous media. We have incorporated most of realistic geological features in the model, including surface topography, curved layer interfaces, and 2-D wave-speed heterogeneity. To accommodate such complexity, we use an unstructured quadrilateral meshing technique. Simulation was performed on a GPU cluster, which consists of 24 core processors Intel Xeon CPU and 4 NVIDIA Quadro graphics cards using CUDA and MPI implementation. We speed up the computation by a factor of about 5 compared to MPI only, and by a factor of about 40 compared to Serial implementation.

  1. Fast analysis of molecular dynamics trajectories with graphics processing units-Radial distribution function histogramming

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Levine, Benjamin G., E-mail: ben.levine@temple.ed; Stone, John E., E-mail: johns@ks.uiuc.ed; Kohlmeyer, Axel, E-mail: akohlmey@temple.ed

    2011-05-01

    The calculation of radial distribution functions (RDFs) from molecular dynamics trajectory data is a common and computationally expensive analysis task. The rate limiting step in the calculation of the RDF is building a histogram of the distance between atom pairs in each trajectory frame. Here we present an implementation of this histogramming scheme for multiple graphics processing units (GPUs). The algorithm features a tiling scheme to maximize the reuse of data at the fastest levels of the GPU's memory hierarchy and dynamic load balancing to allow high performance on heterogeneous configurations of GPUs. Several versions of the RDF algorithm aremore » presented, utilizing the specific hardware features found on different generations of GPUs. We take advantage of larger shared memory and atomic memory operations available on state-of-the-art GPUs to accelerate the code significantly. The use of atomic memory operations allows the fast, limited-capacity on-chip memory to be used much more efficiently, resulting in a fivefold increase in performance compared to the version of the algorithm without atomic operations. The ultimate version of the algorithm running in parallel on four NVIDIA GeForce GTX 480 (Fermi) GPUs was found to be 92 times faster than a multithreaded implementation running on an Intel Xeon 5550 CPU. On this multi-GPU hardware, the RDF between two selections of 1,000,000 atoms each can be calculated in 26.9 s per frame. The multi-GPU RDF algorithms described here are implemented in VMD, a widely used and freely available software package for molecular dynamics visualization and analysis.« less

  2. Fast Analysis of Molecular Dynamics Trajectories with Graphics Processing Units—Radial Distribution Function Histogramming

    PubMed Central

    Stone, John E.; Kohlmeyer, Axel

    2011-01-01

    The calculation of radial distribution functions (RDFs) from molecular dynamics trajectory data is a common and computationally expensive analysis task. The rate limiting step in the calculation of the RDF is building a histogram of the distance between atom pairs in each trajectory frame. Here we present an implementation of this histogramming scheme for multiple graphics processing units (GPUs). The algorithm features a tiling scheme to maximize the reuse of data at the fastest levels of the GPU’s memory hierarchy and dynamic load balancing to allow high performance on heterogeneous configurations of GPUs. Several versions of the RDF algorithm are presented, utilizing the specific hardware features found on different generations of GPUs. We take advantage of larger shared memory and atomic memory operations available on state-of-the-art GPUs to accelerate the code significantly. The use of atomic memory operations allows the fast, limited-capacity on-chip memory to be used much more efficiently, resulting in a fivefold increase in performance compared to the version of the algorithm without atomic operations. The ultimate version of the algorithm running in parallel on four NVIDIA GeForce GTX 480 (Fermi) GPUs was found to be 92 times faster than a multithreaded implementation running on an Intel Xeon 5550 CPU. On this multi-GPU hardware, the RDF between two selections of 1,000,000 atoms each can be calculated in 26.9 seconds per frame. The multi-GPU RDF algorithms described here are implemented in VMD, a widely used and freely available software package for molecular dynamics visualization and analysis. PMID:21547007

  3. DOE Office of Scientific and Technical Information (OSTI.GOV)

    Beltran, C; Kamal, H

    Purpose: To provide a multicriteria optimization algorithm for intensity modulated radiation therapy using pencil proton beam scanning. Methods: Intensity modulated radiation therapy using pencil proton beam scanning requires efficient optimization algorithms to overcome the uncertainties in the Bragg peaks locations. This work is focused on optimization algorithms that are based on Monte Carlo simulation of the treatment planning and use the weights and the dose volume histogram (DVH) control points to steer toward desired plans. The proton beam treatment planning process based on single objective optimization (representing a weighted sum of multiple objectives) usually leads to time-consuming iterations involving treatmentmore » planning team members. We proved a time efficient multicriteria optimization algorithm that is developed to run on NVIDIA GPU (Graphical Processing Units) cluster. The multicriteria optimization algorithm running time benefits from up-sampling of the CT voxel size of the calculations without loss of fidelity. Results: We will present preliminary results of Multicriteria optimization for intensity modulated proton therapy based on DVH control points. The results will show optimization results of a phantom case and a brain tumor case. Conclusion: The multicriteria optimization of the intensity modulated radiation therapy using pencil proton beam scanning provides a novel tool for treatment planning. Work support by a grant from Varian Inc.« less

  4. Efficient parallel implementation of active appearance model fitting algorithm on GPU.

    PubMed

    Wang, Jinwei; Ma, Xirong; Zhu, Yuanping; Sun, Jizhou

    2014-01-01

    The active appearance model (AAM) is one of the most powerful model-based object detecting and tracking methods which has been widely used in various situations. However, the high-dimensional texture representation causes very time-consuming computations, which makes the AAM difficult to apply to real-time systems. The emergence of modern graphics processing units (GPUs) that feature a many-core, fine-grained parallel architecture provides new and promising solutions to overcome the computational challenge. In this paper, we propose an efficient parallel implementation of the AAM fitting algorithm on GPUs. Our design idea is fine grain parallelism in which we distribute the texture data of the AAM, in pixels, to thousands of parallel GPU threads for processing, which makes the algorithm fit better into the GPU architecture. We implement our algorithm using the compute unified device architecture (CUDA) on the Nvidia's GTX 650 GPU, which has the latest Kepler architecture. To compare the performance of our algorithm with different data sizes, we built sixteen face AAM models of different dimensional textures. The experiment results show that our parallel AAM fitting algorithm can achieve real-time performance for videos even on very high-dimensional textures.

  5. Efficient Parallel Implementation of Active Appearance Model Fitting Algorithm on GPU

    PubMed Central

    Wang, Jinwei; Ma, Xirong; Zhu, Yuanping; Sun, Jizhou

    2014-01-01

    The active appearance model (AAM) is one of the most powerful model-based object detecting and tracking methods which has been widely used in various situations. However, the high-dimensional texture representation causes very time-consuming computations, which makes the AAM difficult to apply to real-time systems. The emergence of modern graphics processing units (GPUs) that feature a many-core, fine-grained parallel architecture provides new and promising solutions to overcome the computational challenge. In this paper, we propose an efficient parallel implementation of the AAM fitting algorithm on GPUs. Our design idea is fine grain parallelism in which we distribute the texture data of the AAM, in pixels, to thousands of parallel GPU threads for processing, which makes the algorithm fit better into the GPU architecture. We implement our algorithm using the compute unified device architecture (CUDA) on the Nvidia's GTX 650 GPU, which has the latest Kepler architecture. To compare the performance of our algorithm with different data sizes, we built sixteen face AAM models of different dimensional textures. The experiment results show that our parallel AAM fitting algorithm can achieve real-time performance for videos even on very high-dimensional textures. PMID:24723812

  6. A Distributed GPU-Based Framework for Real-Time 3D Volume Rendering of Large Astronomical Data Cubes

    NASA Astrophysics Data System (ADS)

    Hassan, A. H.; Fluke, C. J.; Barnes, D. G.

    2012-05-01

    We present a framework to volume-render three-dimensional data cubes interactively using distributed ray-casting and volume-bricking over a cluster of workstations powered by one or more graphics processing units (GPUs) and a multi-core central processing unit (CPU). The main design target for this framework is to provide an in-core visualization solution able to provide three-dimensional interactive views of terabyte-sized data cubes. We tested the presented framework using a computing cluster comprising 64 nodes with a total of 128GPUs. The framework proved to be scalable to render a 204GB data cube with an average of 30 frames per second. Our performance analyses also compare the use of NVIDIA Tesla 1060 and 2050GPU architectures and the effect of increasing the visualization output resolution on the rendering performance. Although our initial focus, as shown in the examples presented in this work, is volume rendering of spectral data cubes from radio astronomy, we contend that our approach has applicability to other disciplines where close to real-time volume rendering of terabyte-order three-dimensional data sets is a requirement.

  7. Parallel hyperspectral compressive sensing method on GPU

    NASA Astrophysics Data System (ADS)

    Bernabé, Sergio; Martín, Gabriel; Nascimento, José M. P.

    2015-10-01

    Remote hyperspectral sensors collect large amounts of data per flight usually with low spatial resolution. It is known that the bandwidth connection between the satellite/airborne platform and the ground station is reduced, thus a compression onboard method is desirable to reduce the amount of data to be transmitted. This paper presents a parallel implementation of an compressive sensing method, called parallel hyperspectral coded aperture (P-HYCA), for graphics processing units (GPU) using the compute unified device architecture (CUDA). This method takes into account two main properties of hyperspectral dataset, namely the high correlation existing among the spectral bands and the generally low number of endmembers needed to explain the data, which largely reduces the number of measurements necessary to correctly reconstruct the original data. Experimental results conducted using synthetic and real hyperspectral datasets on two different GPU architectures by NVIDIA: GeForce GTX 590 and GeForce GTX TITAN, reveal that the use of GPUs can provide real-time compressive sensing performance. The achieved speedup is up to 20 times when compared with the processing time of HYCA running on one core of the Intel i7-2600 CPU (3.4GHz), with 16 Gbyte memory.

  8. Parallel hyperspectral image reconstruction using random projections

    NASA Astrophysics Data System (ADS)

    Sevilla, Jorge; Martín, Gabriel; Nascimento, José M. P.

    2016-10-01

    Spaceborne sensors systems are characterized by scarce onboard computing and storage resources and by communication links with reduced bandwidth. Random projections techniques have been demonstrated as an effective and very light way to reduce the number of measurements in hyperspectral data, thus, the data to be transmitted to the Earth station is reduced. However, the reconstruction of the original data from the random projections may be computationally expensive. SpeCA is a blind hyperspectral reconstruction technique that exploits the fact that hyperspectral vectors often belong to a low dimensional subspace. SpeCA has shown promising results in the task of recovering hyperspectral data from a reduced number of random measurements. In this manuscript we focus on the implementation of the SpeCA algorithm for graphics processing units (GPU) using the compute unified device architecture (CUDA). Experimental results conducted using synthetic and real hyperspectral datasets on the GPU architecture by NVIDIA: GeForce GTX 980, reveal that the use of GPUs can provide real-time reconstruction. The achieved speedup is up to 22 times when compared with the processing time of SpeCA running on one core of the Intel i7-4790K CPU (3.4GHz), with 32 Gbyte memory.

  9. Real-time colour hologram generation based on ray-sampling plane with multi-GPU acceleration.

    PubMed

    Sato, Hirochika; Kakue, Takashi; Ichihashi, Yasuyuki; Endo, Yutaka; Wakunami, Koki; Oi, Ryutaro; Yamamoto, Kenji; Nakayama, Hirotaka; Shimobaba, Tomoyoshi; Ito, Tomoyoshi

    2018-01-24

    Although electro-holography can reconstruct three-dimensional (3D) motion pictures, its computational cost is too heavy to allow for real-time reconstruction of 3D motion pictures. This study explores accelerating colour hologram generation using light-ray information on a ray-sampling (RS) plane with a graphics processing unit (GPU) to realise a real-time holographic display system. We refer to an image corresponding to light-ray information as an RS image. Colour holograms were generated from three RS images with resolutions of 2,048 × 2,048; 3,072 × 3,072 and 4,096 × 4,096 pixels. The computational results indicate that the generation of the colour holograms using multiple GPUs (NVIDIA Geforce GTX 1080) was approximately 300-500 times faster than those generated using a central processing unit. In addition, the results demonstrate that 3D motion pictures were successfully reconstructed from RS images of 3,072 × 3,072 pixels at approximately 15 frames per second using an electro-holographic reconstruction system in which colour holograms were generated from RS images in real time.

  10. Big Data GPU-Driven Parallel Processing Spatial and Spatio-Temporal Clustering Algorithms

    NASA Astrophysics Data System (ADS)

    Konstantaras, Antonios; Skounakis, Emmanouil; Kilty, James-Alexander; Frantzeskakis, Theofanis; Maravelakis, Emmanuel

    2016-04-01

    Advances in graphics processing units' technology towards encompassing parallel architectures [1], comprised of thousands of cores and multiples of parallel threads, provide the foundation in terms of hardware for the rapid processing of various parallel applications regarding seismic big data analysis. Seismic data are normally stored as collections of vectors in massive matrices, growing rapidly in size as wider areas are covered, denser recording networks are being established and decades of data are being compiled together [2]. Yet, many processes regarding seismic data analysis are performed on each seismic event independently or as distinct tiles [3] of specific grouped seismic events within a much larger data set. Such processes, independent of one another can be performed in parallel narrowing down processing times drastically [1,3]. This research work presents the development and implementation of three parallel processing algorithms using Cuda C [4] for the investigation of potentially distinct seismic regions [5,6] present in the vicinity of the southern Hellenic seismic arc. The algorithms, programmed and executed in parallel comparatively, are the: fuzzy k-means clustering with expert knowledge [7] in assigning overall clusters' number; density-based clustering [8]; and a selves-developed spatio-temporal clustering algorithm encompassing expert [9] and empirical knowledge [10] for the specific area under investigation. Indexing terms: GPU parallel programming, Cuda C, heterogeneous processing, distinct seismic regions, parallel clustering algorithms, spatio-temporal clustering References [1] Kirk, D. and Hwu, W.: 'Programming massively parallel processors - A hands-on approach', 2nd Edition, Morgan Kaufman Publisher, 2013 [2] Konstantaras, A., Valianatos, F., Varley, M.R. and Makris, J.P.: 'Soft-Computing Modelling of Seismicity in the Southern Hellenic Arc', Geoscience and Remote Sensing Letters, vol. 5 (3), pp. 323-327, 2008 [3] Papadakis, S. and Diamantaras, K.: 'Programming and architecture of parallel processing systems', 1st Edition, Eds. Kleidarithmos, 2011 [4] NVIDIA.: 'NVidia CUDA C Programming Guide', version 5.0, NVidia (reference book) [5] Konstantaras, A.: 'Classification of Distinct Seismic Regions and Regional Temporal Modelling of Seismicity in the Vicinity of the Hellenic Seismic Arc', IEEE Selected Topics in Applied Earth Observations and Remote Sensing, vol. 6 (4), pp. 1857-1863, 2013 [6] Konstantaras, A. Varley, M.R.,. Valianatos, F., Collins, G. and Holifield, P.: 'Recognition of electric earthquake precursors using neuro-fuzzy models: methodology and simulation results', Proc. IASTED International Conference on Signal Processing Pattern Recognition and Applications (SPPRA 2002), Crete, Greece, 2002, pp 303-308, 2002 [7] Konstantaras, A., Katsifarakis, E., Maravelakis, E., Skounakis, E., Kokkinos, E. and Karapidakis, E.: 'Intelligent Spatial-Clustering of Seismicity in the Vicinity of the Hellenic Seismic Arc', Earth Science Research, vol. 1 (2), pp. 1-10, 2012 [8] Georgoulas, G., Konstantaras, A., Katsifarakis, E., Stylios, C.D., Maravelakis, E. and Vachtsevanos, G.: '"Seismic-Mass" Density-based Algorithm for Spatio-Temporal Clustering', Expert Systems with Applications, vol. 40 (10), pp. 4183-4189, 2013 [9] Konstantaras, A. J.: 'Expert knowledge-based algorithm for the dynamic discrimination of interactive natural clusters', Earth Science Informatics, 2015 (In Press, see: www.scopus.com) [10] Drakatos, G. and Latoussakis, J.: 'A catalog of aftershock sequences in Greece (1971-1997): Their spatial and temporal characteristics', Journal of Seismology, vol. 5, pp. 137-145, 2001

  11. Accelerated speckle imaging with the ATST visible broadband imager

    NASA Astrophysics Data System (ADS)

    Wöger, Friedrich; Ferayorni, Andrew

    2012-09-01

    The Advanced Technology Solar Telescope (ATST), a 4 meter class telescope for observations of the solar atmosphere currently in construction phase, will generate data at rates of the order of 10 TB/day with its state of the art instrumentation. The high-priority ATST Visible Broadband Imager (VBI) instrument alone will create two data streams with a bandwidth of 960 MB/s each. Because of the related data handling issues, these data will be post-processed with speckle interferometry algorithms in near-real time at the telescope using the cost-effective Graphics Processing Unit (GPU) technology that is supported by the ATST Data Handling System. In this contribution, we lay out the VBI-specific approach to its image processing pipeline, put this into the context of the underlying ATST Data Handling System infrastructure, and finally describe the details of how the algorithms were redesigned to exploit data parallelism in the speckle image reconstruction algorithms. An algorithm re-design is often required to efficiently speed up an application using GPU technology; we have chosen NVIDIA's CUDA language as basis for our implementation. We present our preliminary results of the algorithm performance using our test facilities, and base a conservative estimate on the requirements of a full system that could achieve near real-time performance at ATST on these results.

  12. High definition live 3D-OCT in vivo: design and evaluation of a 4D OCT engine with 1 GVoxel/s.

    PubMed

    Wieser, Wolfgang; Draxinger, Wolfgang; Klein, Thomas; Karpf, Sebastian; Pfeiffer, Tom; Huber, Robert

    2014-09-01

    We present a 1300 nm OCT system for volumetric real-time live OCT acquisition and visualization at 1 billion volume elements per second. All technological challenges and problems associated with such high scanning speed are discussed in detail as well as the solutions. In one configuration, the system acquires, processes and visualizes 26 volumes per second where each volume consists of 320 x 320 depth scans and each depth scan has 400 usable pixels. This is the fastest real-time OCT to date in terms of voxel rate. A 51 Hz volume rate is realized with half the frame number. In both configurations the speed can be sustained indefinitely. The OCT system uses a 1310 nm Fourier domain mode locked (FDML) laser operated at 3.2 MHz sweep rate. Data acquisition is performed with two dedicated digitizer cards, each running at 2.5 GS/s, hosted in a single desktop computer. Live real-time data processing and visualization are realized with custom developed software on an NVidia GTX 690 dual graphics processing unit (GPU) card. To evaluate potential future applications of such a system, we present volumetric videos captured at 26 and 51 Hz of planktonic crustaceans and skin.

  13. High definition live 3D-OCT in vivo: design and evaluation of a 4D OCT engine with 1 GVoxel/s

    PubMed Central

    Wieser, Wolfgang; Draxinger, Wolfgang; Klein, Thomas; Karpf, Sebastian; Pfeiffer, Tom; Huber, Robert

    2014-01-01

    We present a 1300 nm OCT system for volumetric real-time live OCT acquisition and visualization at 1 billion volume elements per second. All technological challenges and problems associated with such high scanning speed are discussed in detail as well as the solutions. In one configuration, the system acquires, processes and visualizes 26 volumes per second where each volume consists of 320 x 320 depth scans and each depth scan has 400 usable pixels. This is the fastest real-time OCT to date in terms of voxel rate. A 51 Hz volume rate is realized with half the frame number. In both configurations the speed can be sustained indefinitely. The OCT system uses a 1310 nm Fourier domain mode locked (FDML) laser operated at 3.2 MHz sweep rate. Data acquisition is performed with two dedicated digitizer cards, each running at 2.5 GS/s, hosted in a single desktop computer. Live real-time data processing and visualization are realized with custom developed software on an NVidia GTX 690 dual graphics processing unit (GPU) card. To evaluate potential future applications of such a system, we present volumetric videos captured at 26 and 51 Hz of planktonic crustaceans and skin. PMID:25401010

  14. Real time mitigation of atmospheric turbulence in long distance imaging using the lucky region fusion algorithm with FPGA and GPU hardware acceleration

    NASA Astrophysics Data System (ADS)

    Jackson, Christopher Robert

    "Lucky-region" fusion (LRF) is a synthetic imaging technique that has proven successful in enhancing the quality of images distorted by atmospheric turbulence. The LRF algorithm selects sharp regions of an image obtained from a series of short exposure frames, and fuses the sharp regions into a final, improved image. In previous research, the LRF algorithm had been implemented on a PC using the C programming language. However, the PC did not have sufficient sequential processing power to handle real-time extraction, processing and reduction required when the LRF algorithm was applied to real-time video from fast, high-resolution image sensors. This thesis describes two hardware implementations of the LRF algorithm to achieve real-time image processing. The first was created with a VIRTEX-7 field programmable gate array (FPGA). The other developed using the graphics processing unit (GPU) of a NVIDIA GeForce GTX 690 video card. The novelty in the FPGA approach is the creation of a "black box" LRF video processing system with a general camera link input, a user controller interface, and a camera link video output. We also describe a custom hardware simulation environment we have built to test the FPGA LRF implementation. The advantage of the GPU approach is significantly improved development time, integration of image stabilization into the system, and comparable atmospheric turbulence mitigation.

  15. gCUP: rapid GPU-based HIV-1 co-receptor usage prediction for next-generation sequencing.

    PubMed

    Olejnik, Michael; Steuwer, Michel; Gorlatch, Sergei; Heider, Dominik

    2014-11-15

    Next-generation sequencing (NGS) has a large potential in HIV diagnostics, and genotypic prediction models have been developed and successfully tested in the recent years. However, albeit being highly accurate, these computational models lack computational efficiency to reach their full potential. In this study, we demonstrate the use of graphics processing units (GPUs) in combination with a computational prediction model for HIV tropism. Our new model named gCUP, parallelized and optimized for GPU, is highly accurate and can classify >175 000 sequences per second on an NVIDIA GeForce GTX 460. The computational efficiency of our new model is the next step to enable NGS technologies to reach clinical significance in HIV diagnostics. Moreover, our approach is not limited to HIV tropism prediction, but can also be easily adapted to other settings, e.g. drug resistance prediction. The source code can be downloaded at http://www.heiderlab.de d.heider@wz-straubing.de. © The Author 2014. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.

  16. Large calculation of the flow over a hypersonic vehicle using a GPU

    NASA Astrophysics Data System (ADS)

    Elsen, Erich; LeGresley, Patrick; Darve, Eric

    2008-12-01

    Graphics processing units are capable of impressive computing performance up to 518 Gflops peak performance. Various groups have been using these processors for general purpose computing; most efforts have focussed on demonstrating relatively basic calculations, e.g. numerical linear algebra, or physical simulations for visualization purposes with limited accuracy. This paper describes the simulation of a hypersonic vehicle configuration with detailed geometry and accurate boundary conditions using the compressible Euler equations. To the authors' knowledge, this is the most sophisticated calculation of this kind in terms of complexity of the geometry, the physical model, the numerical methods employed, and the accuracy of the solution. The Navier-Stokes Stanford University Solver (NSSUS) was used for this purpose. NSSUS is a multi-block structured code with a provably stable and accurate numerical discretization which uses a vertex-based finite-difference method. A multi-grid scheme is used to accelerate the solution of the system. Based on a comparison of the Intel Core 2 Duo and NVIDIA 8800GTX, speed-ups of over 40× were demonstrated for simple test geometries and 20× for complex geometries.

  17. Optimizing Likelihood Models for Particle Trajectory Segmentation in Multi-State Systems.

    PubMed

    Young, Dylan Christopher; Scrimgeour, Jan

    2018-06-19

    Particle tracking offers significant insight into the molecular mechanics that govern the behav- ior of living cells. The analysis of molecular trajectories that transition between different motive states, such as diffusive, driven and tethered modes, is of considerable importance, with even single trajectories containing significant amounts of information about a molecule's environment and its interactions with cellular structures. Hidden Markov models (HMM) have been widely adopted to perform the segmentation of such complex tracks. In this paper, we show that extensive analysis of hidden Markov model outputs using data derived from multi-state Brownian dynamics simulations can be used both for the optimization of the likelihood models used to describe the states of the system and for characterization of the technique's failure mechanisms. This analysis was made pos- sible by the implementation of parallelized adaptive direct search algorithm on a Nvidia graphics processing unit. This approach provides critical information for the visualization of HMM failure and successful design of particle tracking experiments where trajectories contain multiple mobile states. © 2018 IOP Publishing Ltd.

  18. Development of an embedded atmospheric turbulence mitigation engine

    NASA Astrophysics Data System (ADS)

    Paolini, Aaron; Bonnett, James; Kozacik, Stephen; Kelmelis, Eric

    2017-05-01

    Methods to reconstruct pictures from imagery degraded by atmospheric turbulence have been under development for decades. The techniques were initially developed for observing astronomical phenomena from the Earth's surface, but have more recently been modified for ground and air surveillance scenarios. Such applications can impose significant constraints on deployment options because they both increase the computational complexity of the algorithms themselves and often dictate a requirement for low size, weight, and power (SWaP) form factors. Consequently, embedded implementations must be developed that can perform the necessary computations on low-SWaP platforms. Fortunately, there is an emerging class of embedded processors driven by the mobile and ubiquitous computing industries. We have leveraged these processors to develop embedded versions of the core atmospheric correction engine found in our ATCOM software. In this paper, we will present our experience adapting our algorithms for embedded systems on a chip (SoCs), namely the NVIDIA Tegra that couples general-purpose ARM cores with their graphics processing unit (GPU) technology and the Xilinx Zynq which pairs similar ARM cores with their field-programmable gate array (FPGA) fabric.

  19. Object tracking mask-based NLUT on GPUs for real-time generation of holographic videos of three-dimensional scenes.

    PubMed

    Kwon, M-W; Kim, S-C; Yoon, S-E; Ho, Y-S; Kim, E-S

    2015-02-09

    A new object tracking mask-based novel-look-up-table (OTM-NLUT) method is proposed and implemented on graphics-processing-units (GPUs) for real-time generation of holographic videos of three-dimensional (3-D) scenes. Since the proposed method is designed to be matched with software and memory structures of the GPU, the number of compute-unified-device-architecture (CUDA) kernel function calls and the computer-generated hologram (CGH) buffer size of the proposed method have been significantly reduced. It therefore results in a great increase of the computational speed of the proposed method and enables real-time generation of CGH patterns of 3-D scenes. Experimental results show that the proposed method can generate 31.1 frames of Fresnel CGH patterns with 1,920 × 1,080 pixels per second, on average, for three test 3-D video scenarios with 12,666 object points on three GPU boards of NVIDIA GTX TITAN, and confirm the feasibility of the proposed method in the practical application of electro-holographic 3-D displays.

  20. GPU-based Green's function simulations of shear waves generated by an applied acoustic radiation force in elastic and viscoelastic models.

    PubMed

    Yang, Yiqun; Urban, Matthew W; McGough, Robert J

    2018-05-15

    Shear wave calculations induced by an acoustic radiation force are very time-consuming on desktop computers, and high-performance graphics processing units (GPUs) achieve dramatic reductions in the computation time for these simulations. The acoustic radiation force is calculated using the fast near field method and the angular spectrum approach, and then the shear waves are calculated in parallel with Green's functions on a GPU. This combination enables rapid evaluation of shear waves for push beams with different spatial samplings and for apertures with different f/#. Relative to shear wave simulations that evaluate the same algorithm on an Intel i7 desktop computer, a high performance nVidia GPU reduces the time required for these calculations by a factor of 45 and 700 when applied to elastic and viscoelastic shear wave simulation models, respectively. These GPU-accelerated simulations also compared to measurements in different viscoelastic phantoms, and the results are similar. For parametric evaluations and for comparisons with measured shear wave data, shear wave simulations with the Green's function approach are ideally suited for high-performance GPUs.

  1. Statistical tools for analysis and modeling of cosmic populations and astronomical time series: CUDAHM and TSE

    NASA Astrophysics Data System (ADS)

    Loredo, Thomas; Budavari, Tamas; Scargle, Jeffrey D.

    2018-01-01

    This presentation provides an overview of open-source software packages addressing two challenging classes of astrostatistics problems. (1) CUDAHM is a C++ framework for hierarchical Bayesian modeling of cosmic populations, leveraging graphics processing units (GPUs) to enable applying this computationally challenging paradigm to large datasets. CUDAHM is motivated by measurement error problems in astronomy, where density estimation and linear and nonlinear regression must be addressed for populations of thousands to millions of objects whose features are measured with possibly complex uncertainties, potentially including selection effects. An example calculation demonstrates accurate GPU-accelerated luminosity function estimation for simulated populations of $10^6$ objects in about two hours using a single NVIDIA Tesla K40c GPU. (2) Time Series Explorer (TSE) is a collection of software in Python and MATLAB for exploratory analysis and statistical modeling of astronomical time series. It comprises a library of stand-alone functions and classes, as well as an application environment for interactive exploration of times series data. The presentation will summarize key capabilities of this emerging project, including new algorithms for analysis of irregularly-sampled time series.

  2. GPUs, a New Tool of Acceleration in CFD: Efficiency and Reliability on Smoothed Particle Hydrodynamics Methods

    PubMed Central

    Crespo, Alejandro C.; Dominguez, Jose M.; Barreiro, Anxo; Gómez-Gesteira, Moncho; Rogers, Benedict D.

    2011-01-01

    Smoothed Particle Hydrodynamics (SPH) is a numerical method commonly used in Computational Fluid Dynamics (CFD) to simulate complex free-surface flows. Simulations with this mesh-free particle method far exceed the capacity of a single processor. In this paper, as part of a dual-functioning code for either central processing units (CPUs) or Graphics Processor Units (GPUs), a parallelisation using GPUs is presented. The GPU parallelisation technique uses the Compute Unified Device Architecture (CUDA) of nVidia devices. Simulations with more than one million particles on a single GPU card exhibit speedups of up to two orders of magnitude over using a single-core CPU. It is demonstrated that the code achieves different speedups with different CUDA-enabled GPUs. The numerical behaviour of the SPH code is validated with a standard benchmark test case of dam break flow impacting on an obstacle where good agreement with the experimental results is observed. Both the achieved speed-ups and the quantitative agreement with experiments suggest that CUDA-based GPU programming can be used in SPH methods with efficiency and reliability. PMID:21695185

  3. Large-scale neural circuit mapping data analysis accelerated with the graphical processing unit (GPU).

    PubMed

    Shi, Yulin; Veidenbaum, Alexander V; Nicolau, Alex; Xu, Xiangmin

    2015-01-15

    Modern neuroscience research demands computing power. Neural circuit mapping studies such as those using laser scanning photostimulation (LSPS) produce large amounts of data and require intensive computation for post hoc processing and analysis. Here we report on the design and implementation of a cost-effective desktop computer system for accelerated experimental data processing with recent GPU computing technology. A new version of Matlab software with GPU enabled functions is used to develop programs that run on Nvidia GPUs to harness their parallel computing power. We evaluated both the central processing unit (CPU) and GPU-enabled computational performance of our system in benchmark testing and practical applications. The experimental results show that the GPU-CPU co-processing of simulated data and actual LSPS experimental data clearly outperformed the multi-core CPU with up to a 22× speedup, depending on computational tasks. Further, we present a comparison of numerical accuracy between GPU and CPU computation to verify the precision of GPU computation. In addition, we show how GPUs can be effectively adapted to improve the performance of commercial image processing software such as Adobe Photoshop. To our best knowledge, this is the first demonstration of GPU application in neural circuit mapping and electrophysiology-based data processing. Together, GPU enabled computation enhances our ability to process large-scale data sets derived from neural circuit mapping studies, allowing for increased processing speeds while retaining data precision. Copyright © 2014 Elsevier B.V. All rights reserved.

  4. Large scale neural circuit mapping data analysis accelerated with the graphical processing unit (GPU)

    PubMed Central

    Shi, Yulin; Veidenbaum, Alexander V.; Nicolau, Alex; Xu, Xiangmin

    2014-01-01

    Background Modern neuroscience research demands computing power. Neural circuit mapping studies such as those using laser scanning photostimulation (LSPS) produce large amounts of data and require intensive computation for post-hoc processing and analysis. New Method Here we report on the design and implementation of a cost-effective desktop computer system for accelerated experimental data processing with recent GPU computing technology. A new version of Matlab software with GPU enabled functions is used to develop programs that run on Nvidia GPUs to harness their parallel computing power. Results We evaluated both the central processing unit (CPU) and GPU-enabled computational performance of our system in benchmark testing and practical applications. The experimental results show that the GPU-CPU co-processing of simulated data and actual LSPS experimental data clearly outperformed the multi-core CPU with up to a 22x speedup, depending on computational tasks. Further, we present a comparison of numerical accuracy between GPU and CPU computation to verify the precision of GPU computation. In addition, we show how GPUs can be effectively adapted to improve the performance of commercial image processing software such as Adobe Photoshop. Comparison with Existing Method(s) To our best knowledge, this is the first demonstration of GPU application in neural circuit mapping and electrophysiology-based data processing. Conclusions Together, GPU enabled computation enhances our ability to process large-scale data sets derived from neural circuit mapping studies, allowing for increased processing speeds while retaining data precision. PMID:25277633

  5. Spatial 3D infrastructure: display-independent software framework, high-speed rendering electronics, and several new displays

    NASA Astrophysics Data System (ADS)

    Chun, Won-Suk; Napoli, Joshua; Cossairt, Oliver S.; Dorval, Rick K.; Hall, Deirdre M.; Purtell, Thomas J., II; Schooler, James F.; Banker, Yigal; Favalora, Gregg E.

    2005-03-01

    We present a software and hardware foundation to enable the rapid adoption of 3-D displays. Different 3-D displays - such as multiplanar, multiview, and electroholographic displays - naturally require different rendering methods. The adoption of these displays in the marketplace will be accelerated by a common software framework. The authors designed the SpatialGL API, a new rendering framework that unifies these display methods under one interface. SpatialGL enables complementary visualization assets to coexist through a uniform infrastructure. Also, SpatialGL supports legacy interfaces such as the OpenGL API. The authors" first implementation of SpatialGL uses multiview and multislice rendering algorithms to exploit the performance of modern graphics processing units (GPUs) to enable real-time visualization of 3-D graphics from medical imaging, oil & gas exploration, and homeland security. At the time of writing, SpatialGL runs on COTS workstations (both Windows and Linux) and on Actuality"s high-performance embedded computational engine that couples an NVIDIA GeForce 6800 Ultra GPU, an AMD Athlon 64 processor, and a proprietary, high-speed, programmable volumetric frame buffer that interfaces to a 1024 x 768 x 3 digital projector. Progress is illustrated using an off-the-shelf multiview display, Actuality"s multiplanar Perspecta Spatial 3D System, and an experimental multiview display. The experimental display is a quasi-holographic view-sequential system that generates aerial imagery measuring 30 mm x 25 mm x 25 mm, providing 198 horizontal views.

  6. Parallel algorithm for solving Kepler’s equation on Graphics Processing Units: Application to analysis of Doppler exoplanet searches

    NASA Astrophysics Data System (ADS)

    Ford, Eric B.

    2009-05-01

    We present the results of a highly parallel Kepler equation solver using the Graphics Processing Unit (GPU) on a commercial nVidia GeForce 280GTX and the "Compute Unified Device Architecture" (CUDA) programming environment. We apply this to evaluate a goodness-of-fit statistic (e.g., χ2) for Doppler observations of stars potentially harboring multiple planetary companions (assuming negligible planet-planet interactions). Given the high-dimensionality of the model parameter space (at least five dimensions per planet), a global search is extremely computationally demanding. We expect that the underlying Kepler solver and model evaluator will be combined with a wide variety of more sophisticated algorithms to provide efficient global search, parameter estimation, model comparison, and adaptive experimental design for radial velocity and/or astrometric planet searches. We tested multiple implementations using single precision, double precision, pairs of single precision, and mixed precision arithmetic. We find that the vast majority of computations can be performed using single precision arithmetic, with selective use of compensated summation for increased precision. However, standard single precision is not adequate for calculating the mean anomaly from the time of observation and orbital period when evaluating the goodness-of-fit for real planetary systems and observational data sets. Using all double precision, our GPU code outperforms a similar code using a modern CPU by a factor of over 60. Using mixed precision, our GPU code provides a speed-up factor of over 600, when evaluating nsys > 1024 models planetary systems each containing npl = 4 planets and assuming nobs = 256 observations of each system. We conclude that modern GPUs also offer a powerful tool for repeatedly evaluating Kepler's equation and a goodness-of-fit statistic for orbital models when presented with a large parameter space.

  7. Compute-unified device architecture implementation of a block-matching algorithm for multiple graphical processing unit cards

    PubMed Central

    Massanes, Francesc; Cadennes, Marie; Brankov, Jovan G.

    2012-01-01

    In this paper we describe and evaluate a fast implementation of a classical block matching motion estimation algorithm for multiple Graphical Processing Units (GPUs) using the Compute Unified Device Architecture (CUDA) computing engine. The implemented block matching algorithm (BMA) uses summed absolute difference (SAD) error criterion and full grid search (FS) for finding optimal block displacement. In this evaluation we compared the execution time of a GPU and CPU implementation for images of various sizes, using integer and non-integer search grids. The results show that use of a GPU card can shorten computation time by a factor of 200 times for integer and 1000 times for a non-integer search grid. The additional speedup for non-integer search grid comes from the fact that GPU has built-in hardware for image interpolation. Further, when using multiple GPU cards, the presented evaluation shows the importance of the data splitting method across multiple cards, but an almost linear speedup with a number of cards is achievable. In addition we compared execution time of the proposed FS GPU implementation with two existing, highly optimized non-full grid search CPU based motion estimations methods, namely implementation of the Pyramidal Lucas Kanade Optical flow algorithm in OpenCV and Simplified Unsymmetrical multi-Hexagon search in H.264/AVC standard. In these comparisons, FS GPU implementation still showed modest improvement even though the computational complexity of FS GPU implementation is substantially higher than non-FS CPU implementation. We also demonstrated that for an image sequence of 720×480 pixels in resolution, commonly used in video surveillance, the proposed GPU implementation is sufficiently fast for real-time motion estimation at 30 frames-per-second using two NVIDIA C1060 Tesla GPU cards. PMID:22347787

  8. Compute-unified device architecture implementation of a block-matching algorithm for multiple graphical processing unit cards.

    PubMed

    Massanes, Francesc; Cadennes, Marie; Brankov, Jovan G

    2011-07-01

    In this paper we describe and evaluate a fast implementation of a classical block matching motion estimation algorithm for multiple Graphical Processing Units (GPUs) using the Compute Unified Device Architecture (CUDA) computing engine. The implemented block matching algorithm (BMA) uses summed absolute difference (SAD) error criterion and full grid search (FS) for finding optimal block displacement. In this evaluation we compared the execution time of a GPU and CPU implementation for images of various sizes, using integer and non-integer search grids.The results show that use of a GPU card can shorten computation time by a factor of 200 times for integer and 1000 times for a non-integer search grid. The additional speedup for non-integer search grid comes from the fact that GPU has built-in hardware for image interpolation. Further, when using multiple GPU cards, the presented evaluation shows the importance of the data splitting method across multiple cards, but an almost linear speedup with a number of cards is achievable.In addition we compared execution time of the proposed FS GPU implementation with two existing, highly optimized non-full grid search CPU based motion estimations methods, namely implementation of the Pyramidal Lucas Kanade Optical flow algorithm in OpenCV and Simplified Unsymmetrical multi-Hexagon search in H.264/AVC standard. In these comparisons, FS GPU implementation still showed modest improvement even though the computational complexity of FS GPU implementation is substantially higher than non-FS CPU implementation.We also demonstrated that for an image sequence of 720×480 pixels in resolution, commonly used in video surveillance, the proposed GPU implementation is sufficiently fast for real-time motion estimation at 30 frames-per-second using two NVIDIA C1060 Tesla GPU cards.

  9. A new tool for supervised classification of satellite images available on web servers: Google Maps as a case study

    NASA Astrophysics Data System (ADS)

    García-Flores, Agustín.; Paz-Gallardo, Abel; Plaza, Antonio; Li, Jun

    2016-10-01

    This paper describes a new web platform dedicated to the classification of satellite images called Hypergim. The current implementation of this platform enables users to perform classification of satellite images from any part of the world thanks to the worldwide maps provided by Google Maps. To perform this classification, Hypergim uses unsupervised algorithms like Isodata and K-means. Here, we present an extension of the original platform in which we adapt Hypergim in order to use supervised algorithms to improve the classification results. This involves a significant modification of the user interface, providing the user with a way to obtain samples of classes present in the images to use in the training phase of the classification process. Another main goal of this development is to improve the runtime of the image classification process. To achieve this goal, we use a parallel implementation of the Random Forest classification algorithm. This implementation is a modification of the well-known CURFIL software package. The use of this type of algorithms to perform image classification is widespread today thanks to its precision and ease of training. The actual implementation of Random Forest was developed using CUDA platform, which enables us to exploit the potential of several models of NVIDIA graphics processing units using them to execute general purpose computing tasks as image classification algorithms. As well as CUDA, we use other parallel libraries as Intel Boost, taking advantage of the multithreading capabilities of modern CPUs. To ensure the best possible results, the platform is deployed in a cluster of commodity graphics processing units (GPUs), so that multiple users can use the tool in a concurrent way. The experimental results indicate that this new algorithm widely outperform the previous unsupervised algorithms implemented in Hypergim, both in runtime as well as precision of the actual classification of the images.

  10. GPU accelerated generation of digitally reconstructed radiographs for 2-D/3-D image registration.

    PubMed

    Dorgham, Osama M; Laycock, Stephen D; Fisher, Mark H

    2012-09-01

    Recent advances in programming languages for graphics processing units (GPUs) provide developers with a convenient way of implementing applications which can be executed on the CPU and GPU interchangeably. GPUs are becoming relatively cheap, powerful, and widely available hardware components, which can be used to perform intensive calculations. The last decade of hardware performance developments shows that GPU-based computation is progressing significantly faster than CPU-based computation, particularly if one considers the execution of highly parallelisable algorithms. Future predictions illustrate that this trend is likely to continue. In this paper, we introduce a way of accelerating 2-D/3-D image registration by developing a hybrid system which executes on the CPU and utilizes the GPU for parallelizing the generation of digitally reconstructed radiographs (DRRs). Based on the advancements of the GPU over the CPU, it is timely to exploit the benefits of many-core GPU technology by developing algorithms for DRR generation. Although some previous work has investigated the rendering of DRRs using the GPU, this paper investigates approximations which reduce the computational overhead while still maintaining a quality consistent with that needed for 2-D/3-D registration with sufficient accuracy to be clinically acceptable in certain applications of radiation oncology. Furthermore, by comparing implementations of 2-D/3-D registration on the CPU and GPU, we investigate current performance and propose an optimal framework for PC implementations addressing the rigid registration problem. Using this framework, we are able to render DRR images from a 256×256×133 CT volume in ~24 ms using an NVidia GeForce 8800 GTX and in ~2 ms using NVidia GeForce GTX 580. In addition to applications requiring fast automatic patient setup, these levels of performance suggest image-guided radiation therapy at video frame rates is technically feasible using relatively low cost PC architecture.

  11. Development of a GPU-Accelerated 3-D Full-Wave Code for Electromagnetic Wave Propagation in a Cold Plasma

    NASA Astrophysics Data System (ADS)

    Woodbury, D.; Kubota, S.; Johnson, I.

    2014-10-01

    Computer simulations of electromagnetic wave propagation in magnetized plasmas are an important tool for both plasma heating and diagnostics. For active millimeter-wave and microwave diagnostics, accurately modeling the evolution of the beam parameters for launched, reflected or scattered waves in a toroidal plasma requires that calculations be done using the full 3-D geometry. Previously, we reported on the application of GPGPU (General-Purpose computing on Graphics Processing Units) to a 3-D vacuum Maxwell code using the FDTD (Finite-Difference Time-Domain) method. Tests were done for Gaussian beam propagation with a hard source antenna, utilizing the parallel processing capabilities of the NVIDIA K20M. In the current study, we have modified the 3-D code to include a soft source antenna and an induced current density based on the cold plasma approximation. Results from Gaussian beam propagation in an inhomogeneous anisotropic plasma, along with comparisons to ray- and beam-tracing calculations will be presented. Additional enhancements, such as advanced coding techniques for improved speedup, will also be investigated. Supported by U.S. DoE Grant DE-FG02-99-ER54527 and in part by the U.S. DoE, Office of Science, WDTS under the Science Undergraduate Laboratory Internship program.

  12. Accelerating Smith-Waterman Algorithm for Biological Database Search on CUDA-Compatible GPUs

    NASA Astrophysics Data System (ADS)

    Munekawa, Yuma; Ino, Fumihiko; Hagihara, Kenichi

    This paper presents a fast method capable of accelerating the Smith-Waterman algorithm for biological database search on a cluster of graphics processing units (GPUs). Our method is implemented using compute unified device architecture (CUDA), which is available on the nVIDIA GPU. As compared with previous methods, our method has four major contributions. (1) The method efficiently uses on-chip shared memory to reduce the data amount being transferred between off-chip video memory and processing elements in the GPU. (2) It also reduces the number of data fetches by applying a data reuse technique to query and database sequences. (3) A pipelined method is also implemented to overlap GPU execution with database access. (4) Finally, a master/worker paradigm is employed to accelerate hundreds of database searches on a cluster system. In experiments, the peak performance on a GeForce GTX 280 card reaches 8.32 giga cell updates per second (GCUPS). We also find that our method reduces the amount of data fetches to 1/140, achieving approximately three times higher performance than a previous CUDA-based method. Our 32-node cluster version is approximately 28 times faster than a single GPU version. Furthermore, the effective performance reaches 75.6 giga instructions per second (GIPS) using 32 GeForce 8800 GTX cards.

  13. GPU-based ultra-fast dose calculation using a finite size pencil beam model.

    PubMed

    Gu, Xuejun; Choi, Dongju; Men, Chunhua; Pan, Hubert; Majumdar, Amitava; Jiang, Steve B

    2009-10-21

    Online adaptive radiation therapy (ART) is an attractive concept that promises the ability to deliver an optimal treatment in response to the inter-fraction variability in patient anatomy. However, it has yet to be realized due to technical limitations. Fast dose deposit coefficient calculation is a critical component of the online planning process that is required for plan optimization of intensity-modulated radiation therapy (IMRT). Computer graphics processing units (GPUs) are well suited to provide the requisite fast performance for the data-parallel nature of dose calculation. In this work, we develop a dose calculation engine based on a finite-size pencil beam (FSPB) algorithm and a GPU parallel computing framework. The developed framework can accommodate any FSPB model. We test our implementation in the case of a water phantom and the case of a prostate cancer patient with varying beamlet and voxel sizes. All testing scenarios achieved speedup ranging from 200 to 400 times when using a NVIDIA Tesla C1060 card in comparison with a 2.27 GHz Intel Xeon CPU. The computational time for calculating dose deposition coefficients for a nine-field prostate IMRT plan with this new framework is less than 1 s. This indicates that the GPU-based FSPB algorithm is well suited for online re-planning for adaptive radiotherapy.

  14. GPU Lossless Hyperspectral Data Compression System

    NASA Technical Reports Server (NTRS)

    Aranki, Nazeeh I.; Keymeulen, Didier; Kiely, Aaron B.; Klimesh, Matthew A.

    2014-01-01

    Hyperspectral imaging systems onboard aircraft or spacecraft can acquire large amounts of data, putting a strain on limited downlink and storage resources. Onboard data compression can mitigate this problem but may require a system capable of a high throughput. In order to achieve a high throughput with a software compressor, a graphics processing unit (GPU) implementation of a compressor was developed targeting the current state-of-the-art GPUs from NVIDIA(R). The implementation is based on the fast lossless (FL) compression algorithm reported in "Fast Lossless Compression of Multispectral-Image Data" (NPO- 42517), NASA Tech Briefs, Vol. 30, No. 8 (August 2006), page 26, which operates on hyperspectral data and achieves excellent compression performance while having low complexity. The FL compressor uses an adaptive filtering method and achieves state-of-the-art performance in both compression effectiveness and low complexity. The new Consultative Committee for Space Data Systems (CCSDS) Standard for Lossless Multispectral & Hyperspectral image compression (CCSDS 123) is based on the FL compressor. The software makes use of the highly-parallel processing capability of GPUs to achieve a throughput at least six times higher than that of a software implementation running on a single-core CPU. This implementation provides a practical real-time solution for compression of data from airborne hyperspectral instruments.

  15. Challenges and Opportunities in Propulsion Simulations

    DTIC Science & Technology

    2015-09-24

    leverage Nvidia GPU accelerators •  Release common computational infrastructure as Distro A for collaboration •  Add physics modules as either...Gemini (6.4 GB/s) Dual Rail EDR-IB (23 GB/s) Interconnect Topology 3D Torus Non-blocking Fat Tree Processors AMD Opteron™ NVIDIA Kepler™ IBM...POWER9 NVIDIA Volta™ File System 32 PB, 1 TB/s, Lustre® 120 PB, 1 TB/s, GPFS™ Peak power consumption 9 MW 10 MW Titan vs. Summit Source: R

  16. Singular value decomposition utilizing parallel algorithms on graphical processors

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Kotas, Charlotte W; Barhen, Jacob

    2011-01-01

    One of the current challenges in underwater acoustic array signal processing is the detection of quiet targets in the presence of noise. In order to enable robust detection, one of the key processing steps requires data and replica whitening. This, in turn, involves the eigen-decomposition of the sample spectral matrix, Cx = 1/K xKX(k)XH(k) where X(k) denotes a single frequency snapshot with an element for each element of the array. By employing the singular value decomposition (SVD) method, the eigenvectors and eigenvalues can be determined directly from the data without computing the sample covariance matrix, reducing the computational requirements formore » a given level of accuracy (van Trees, Optimum Array Processing). (Recall that the SVD of a complex matrix A involves determining V, , and U such that A = U VH where U and V are orthonormal and is a positive, real, diagonal matrix containing the singular values of A. U and V are the eigenvectors of AAH and AHA, respectively, while the singular values are the square roots of the eigenvalues of AAH.) Because it is desirable to be able to compute these quantities in real time, an efficient technique for computing the SVD is vital. In addition, emerging multicore processors like graphical processing units (GPUs) are bringing parallel processing capabilities to an ever increasing number of users. Since the computational tasks involved in array signal processing are well suited for parallelization, it is expected that these computations will be implemented using GPUs as soon as users have the necessary computational tools available to them. Thus, it is important to have an SVD algorithm that is suitable for these processors. This work explores the effectiveness of two different parallel SVD implementations on an NVIDIA Tesla C2050 GPU (14 multiprocessors, 32 cores per multiprocessor, 1.15 GHz clock - peed). The first algorithm is based on a two-step algorithm which bidiagonalizes the matrix using Householder transformations, and then diagonalizes the intermediate bidiagonal matrix through implicit QR shifts. This is similar to that implemented for real matrices by Lahabar and Narayanan ("Singular Value Decomposition on GPU using CUDA", IEEE International Parallel Distributed Processing Symposium 2009). The implementation is done in a hybrid manner, with the bidiagonalization stage done using the GPU while the diagonalization stage is done using the CPU, with the GPU used to update the U and V matrices. The second algorithm is based on a one-sided Jacobi scheme utilizing a sequence of pair-wise column orthogonalizations such that A is replaced by AV until the resulting matrix is sufficiently orthogonal (that is, equal to U ). V is obtained from the sequence of orthogonalizations, while can be found from the square root of the diagonal elements of AH A and, once is known, U can be found from column scaling the resulting matrix. These implementations utilize CUDA Fortran and NVIDIA's CUB LAS library. The primary goal of this study is to quantify the comparative performance of these two techniques against themselves and other standard implementations (for example, MATLAB). Considering that there is significant overhead associated with transferring data to the GPU and with synchronization between the GPU and the host CPU, it is also important to understand when it is worthwhile to use the GPU in terms of the matrix size and number of concurrent SVDs to be calculated.« less

  17. AMITIS: A 3D GPU-Based Hybrid-PIC Model for Space and Plasma Physics

    NASA Astrophysics Data System (ADS)

    Fatemi, Shahab; Poppe, Andrew R.; Delory, Gregory T.; Farrell, William M.

    2017-05-01

    We have developed, for the first time, an advanced modeling infrastructure in space simulations (AMITIS) with an embedded three-dimensional self-consistent grid-based hybrid model of plasma (kinetic ions and fluid electrons) that runs entirely on graphics processing units (GPUs). The model uses NVIDIA GPUs and their associated parallel computing platform, CUDA, developed for general purpose processing on GPUs. The model uses a single CPU-GPU pair, where the CPU transfers data between the system and GPU memory, executes CUDA kernels, and writes simulation outputs on the disk. All computations, including moving particles, calculating macroscopic properties of particles on a grid, and solving hybrid model equations are processed on a single GPU. We explain various computing kernels within AMITIS and compare their performance with an already existing well-tested hybrid model of plasma that runs in parallel using multi-CPU platforms. We show that AMITIS runs ∼10 times faster than the parallel CPU-based hybrid model. We also introduce an implicit solver for computation of Faraday’s Equation, resulting in an explicit-implicit scheme for the hybrid model equation. We show that the proposed scheme is stable and accurate. We examine the AMITIS energy conservation and show that the energy is conserved with an error < 0.2% after 500,000 timesteps, even when a very low number of particles per cell is used.

  18. Full Monte Carlo-Based Biologic Treatment Plan Optimization System for Intensity Modulated Carbon Ion Therapy on Graphics Processing Unit.

    PubMed

    Qin, Nan; Shen, Chenyang; Tsai, Min-Yu; Pinto, Marco; Tian, Zhen; Dedes, Georgios; Pompos, Arnold; Jiang, Steve B; Parodi, Katia; Jia, Xun

    2018-01-01

    One of the major benefits of carbon ion therapy is enhanced biological effectiveness at the Bragg peak region. For intensity modulated carbon ion therapy (IMCT), it is desirable to use Monte Carlo (MC) methods to compute the properties of each pencil beam spot for treatment planning, because of their accuracy in modeling physics processes and estimating biological effects. We previously developed goCMC, a graphics processing unit (GPU)-oriented MC engine for carbon ion therapy. The purpose of the present study was to build a biological treatment plan optimization system using goCMC. The repair-misrepair-fixation model was implemented to compute the spatial distribution of linear-quadratic model parameters for each spot. A treatment plan optimization module was developed to minimize the difference between the prescribed and actual biological effect. We used a gradient-based algorithm to solve the optimization problem. The system was embedded in the Varian Eclipse treatment planning system under a client-server architecture to achieve a user-friendly planning environment. We tested the system with a 1-dimensional homogeneous water case and 3 3-dimensional patient cases. Our system generated treatment plans with biological spread-out Bragg peaks covering the targeted regions and sparing critical structures. Using 4 NVidia GTX 1080 GPUs, the total computation time, including spot simulation, optimization, and final dose calculation, was 0.6 hour for the prostate case (8282 spots), 0.2 hour for the pancreas case (3795 spots), and 0.3 hour for the brain case (6724 spots). The computation time was dominated by MC spot simulation. We built a biological treatment plan optimization system for IMCT that performs simulations using a fast MC engine, goCMC. To the best of our knowledge, this is the first time that full MC-based IMCT inverse planning has been achieved in a clinically viable time frame. Copyright © 2017 Elsevier Inc. All rights reserved.

  19. GASPRNG: GPU accelerated scalable parallel random number generator library

    NASA Astrophysics Data System (ADS)

    Gao, Shuang; Peterson, Gregory D.

    2013-04-01

    Graphics processors represent a promising technology for accelerating computational science applications. Many computational science applications require fast and scalable random number generation with good statistical properties, so they use the Scalable Parallel Random Number Generators library (SPRNG). We present the GPU Accelerated SPRNG library (GASPRNG) to accelerate SPRNG in GPU-based high performance computing systems. GASPRNG includes code for a host CPU and CUDA code for execution on NVIDIA graphics processing units (GPUs) along with a programming interface to support various usage models for pseudorandom numbers and computational science applications executing on the CPU, GPU, or both. This paper describes the implementation approach used to produce high performance and also describes how to use the programming interface. The programming interface allows a user to be able to use GASPRNG the same way as SPRNG on traditional serial or parallel computers as well as to develop tightly coupled programs executing primarily on the GPU. We also describe how to install GASPRNG and use it. To help illustrate linking with GASPRNG, various demonstration codes are included for the different usage models. GASPRNG on a single GPU shows up to 280x speedup over SPRNG on a single CPU core and is able to scale for larger systems in the same manner as SPRNG. Because GASPRNG generates identical streams of pseudorandom numbers as SPRNG, users can be confident about the quality of GASPRNG for scalable computational science applications. Catalogue identifier: AEOI_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEOI_v1_0.html Program obtainable from: CPC Program Library, Queen’s University, Belfast, N. Ireland Licensing provisions: UTK license. No. of lines in distributed program, including test data, etc.: 167900 No. of bytes in distributed program, including test data, etc.: 1422058 Distribution format: tar.gz Programming language: C and CUDA. Computer: Any PC or workstation with NVIDIA GPU (Tested on Fermi GTX480, Tesla C1060, Tesla M2070). Operating system: Linux with CUDA version 4.0 or later. Should also run on MacOS, Windows, or UNIX. Has the code been vectorized or parallelized?: Yes. Parallelized using MPI directives. RAM: 512 MB˜ 732 MB (main memory on host CPU, depending on the data type of random numbers.) / 512 MB (GPU global memory) Classification: 4.13, 6.5. Nature of problem: Many computational science applications are able to consume large numbers of random numbers. For example, Monte Carlo simulations are able to consume limitless random numbers for the computation as long as resources for the computing are supported. Moreover, parallel computational science applications require independent streams of random numbers to attain statistically significant results. The SPRNG library provides this capability, but at a significant computational cost. The GASPRNG library presented here accelerates the generators of independent streams of random numbers using graphical processing units (GPUs). Solution method: Multiple copies of random number generators in GPUs allow a computational science application to consume large numbers of random numbers from independent, parallel streams. GASPRNG is a random number generators library to allow a computational science application to employ multiple copies of random number generators to boost performance. Users can interface GASPRNG with software code executing on microprocessors and/or GPUs. Running time: The tests provided take a few minutes to run.

  20. GPU-based Green’s function simulations of shear waves generated by an applied acoustic radiation force in elastic and viscoelastic models

    NASA Astrophysics Data System (ADS)

    Yang, Yiqun; Urban, Matthew W.; McGough, Robert J.

    2018-05-01

    Shear wave calculations induced by an acoustic radiation force are very time-consuming on desktop computers, and high-performance graphics processing units (GPUs) achieve dramatic reductions in the computation time for these simulations. The acoustic radiation force is calculated using the fast near field method and the angular spectrum approach, and then the shear waves are calculated in parallel with Green’s functions on a GPU. This combination enables rapid evaluation of shear waves for push beams with different spatial samplings and for apertures with different f/#. Relative to shear wave simulations that evaluate the same algorithm on an Intel i7 desktop computer, a high performance nVidia GPU reduces the time required for these calculations by a factor of 45 and 700 when applied to elastic and viscoelastic shear wave simulation models, respectively. These GPU-accelerated simulations also compared to measurements in different viscoelastic phantoms, and the results are similar. For parametric evaluations and for comparisons with measured shear wave data, shear wave simulations with the Green’s function approach are ideally suited for high-performance GPUs.

  1. Computational Modeling and Numerical Methods for Spatiotemporal Calcium Cycling in Ventricular Myocytes

    PubMed Central

    Nivala, Michael; de Lange, Enno; Rovetti, Robert; Qu, Zhilin

    2012-01-01

    Intracellular calcium (Ca) cycling dynamics in cardiac myocytes is regulated by a complex network of spatially distributed organelles, such as sarcoplasmic reticulum (SR), mitochondria, and myofibrils. In this study, we present a mathematical model of intracellular Ca cycling and numerical and computational methods for computer simulations. The model consists of a coupled Ca release unit (CRU) network, which includes a SR domain and a myoplasm domain. Each CRU contains 10 L-type Ca channels and 100 ryanodine receptor channels, with individual channels simulated stochastically using a variant of Gillespie’s method, modified here to handle time-dependent transition rates. Both the SR domain and the myoplasm domain in each CRU are modeled by 5 × 5 × 5 voxels to maintain proper Ca diffusion. Advanced numerical algorithms implemented on graphical processing units were used for fast computational simulations. For a myocyte containing 100 × 20 × 10 CRUs, a 1-s heart time simulation takes about 10 min of machine time on a single NVIDIA Tesla C2050. Examples of simulated Ca cycling dynamics, such as Ca sparks, Ca waves, and Ca alternans, are shown. PMID:22586402

  2. fast_protein_cluster: parallel and optimized clustering of large-scale protein modeling data.

    PubMed

    Hung, Ling-Hong; Samudrala, Ram

    2014-06-15

    fast_protein_cluster is a fast, parallel and memory efficient package used to cluster 60 000 sets of protein models (with up to 550 000 models per set) generated by the Nutritious Rice for the World project. fast_protein_cluster is an optimized and extensible toolkit that supports Root Mean Square Deviation after optimal superposition (RMSD) and Template Modeling score (TM-score) as metrics. RMSD calculations using a laptop CPU are 60× faster than qcprot and 3× faster than current graphics processing unit (GPU) implementations. New GPU code further increases the speed of RMSD and TM-score calculations. fast_protein_cluster provides novel k-means and hierarchical clustering methods that are up to 250× and 2000× faster, respectively, than Clusco, and identify significantly more accurate models than Spicker and Clusco. fast_protein_cluster is written in C++ using OpenMP for multi-threading support. Custom streaming Single Instruction Multiple Data (SIMD) extensions and advanced vector extension intrinsics code accelerate CPU calculations, and OpenCL kernels support AMD and Nvidia GPUs. fast_protein_cluster is available under the M.I.T. license. (http://software.compbio.washington.edu/fast_protein_cluster) © The Author 2014. Published by Oxford University Press.

  3. Numerical Simulation of Transit-Time Ultrasonic Flowmeters by a Direct Approach.

    PubMed

    Luca, Adrian; Marchiano, Regis; Chassaing, Jean-Camille

    2016-06-01

    This paper deals with the development of a computational code for the numerical simulation of wave propagation through domains with a complex geometry consisting in both solids and moving fluids. The emphasis is on the numerical simulation of ultrasonic flowmeters (UFMs) by modeling the wave propagation in solids with the equations of linear elasticity (ELE) and in fluids with the linearized Euler equations (LEEs). This approach requires high performance computing because of the high number of degrees of freedom and the long propagation distances. Therefore, the numerical method should be chosen with care. In order to minimize the numerical dissipation which may occur in this kind of configuration, the numerical method employed here is the nodal discontinuous Galerkin (DG) method. Also, this method is well suited for parallel computing. To speed up the code, almost all the computational stages have been implemented to run on graphical processing unit (GPU) by using the compute unified device architecture (CUDA) programming model from NVIDIA. This approach has been validated and then used for the two-dimensional simulation of gas UFMs. The large contrast of acoustic impedance characteristic to gas UFMs makes their simulation a real challenge.

  4. Real-time time-division color electroholography using a single GPU and a USB module for synchronizing reference light.

    PubMed

    Araki, Hiromitsu; Takada, Naoki; Niwase, Hiroaki; Ikawa, Shohei; Fujiwara, Masato; Nakayama, Hirotaka; Kakue, Takashi; Shimobaba, Tomoyoshi; Ito, Tomoyoshi

    2015-12-01

    We propose real-time time-division color electroholography using a single graphics processing unit (GPU) and a simple synchronization system of reference light. To facilitate real-time time-division color electroholography, we developed a light emitting diode (LED) controller with a universal serial bus (USB) module and the drive circuit for reference light. A one-chip RGB LED connected to a personal computer via an LED controller was used as the reference light. A single GPU calculates three computer-generated holograms (CGHs) suitable for red, green, and blue colors in each frame of a three-dimensional (3D) movie. After CGH calculation using a single GPU, the CPU can synchronize the CGH display with the color switching of the one-chip RGB LED via the LED controller. Consequently, we succeeded in real-time time-division color electroholography for a 3D object consisting of around 1000 points per color when an NVIDIA GeForce GTX TITAN was used as the GPU. Furthermore, we implemented the proposed method in various GPUs. The experimental results showed that the proposed method was effective for various GPUs.

  5. Prefiltering Model for Homology Detection Algorithms on GPU.

    PubMed

    Retamosa, Germán; de Pedro, Luis; González, Ivan; Tamames, Javier

    2016-01-01

    Homology detection has evolved over the time from heavy algorithms based on dynamic programming approaches to lightweight alternatives based on different heuristic models. However, the main problem with these algorithms is that they use complex statistical models, which makes it difficult to achieve a relevant speedup and find exact matches with the original results. Thus, their acceleration is essential. The aim of this article was to prefilter a sequence database. To make this work, we have implemented a groundbreaking heuristic model based on NVIDIA's graphics processing units (GPUs) and multicore processors. Depending on the sensitivity settings, this makes it possible to quickly reduce the sequence database by factors between 50% and 95%, while rejecting no significant sequences. Furthermore, this prefiltering application can be used together with multiple homology detection algorithms as a part of a next-generation sequencing system. Extensive performance and accuracy tests have been carried out in the Spanish National Centre for Biotechnology (NCB). The results show that GPU hardware can accelerate the execution times of former homology detection applications, such as National Centre for Biotechnology Information (NCBI), Basic Local Alignment Search Tool for Proteins (BLASTP), up to a factor of 4.

  6. Accelerating the reconstruction of magnetic resonance imaging by three-dimensional dual-dictionary learning using CUDA.

    PubMed

    Jiansen Li; Jianqi Sun; Ying Song; Yanran Xu; Jun Zhao

    2014-01-01

    An effective way to improve the data acquisition speed of magnetic resonance imaging (MRI) is using under-sampled k-space data, and dictionary learning method can be used to maintain the reconstruction quality. Three-dimensional dictionary trains the atoms in dictionary in the form of blocks, which can utilize the spatial correlation among slices. Dual-dictionary learning method includes a low-resolution dictionary and a high-resolution dictionary, for sparse coding and image updating respectively. However, the amount of data is huge for three-dimensional reconstruction, especially when the number of slices is large. Thus, the procedure is time-consuming. In this paper, we first utilize the NVIDIA Corporation's compute unified device architecture (CUDA) programming model to design the parallel algorithms on graphics processing unit (GPU) to accelerate the reconstruction procedure. The main optimizations operate in the dictionary learning algorithm and the image updating part, such as the orthogonal matching pursuit (OMP) algorithm and the k-singular value decomposition (K-SVD) algorithm. Then we develop another version of CUDA code with algorithmic optimization. Experimental results show that more than 324 times of speedup is achieved compared with the CPU-only codes when the number of MRI slices is 24.

  7. Utilizing GPUs to Accelerate Turbomachinery CFD Codes

    NASA Technical Reports Server (NTRS)

    MacCalla, Weylin; Kulkarni, Sameer

    2016-01-01

    GPU computing has established itself as a way to accelerate parallel codes in the high performance computing world. This work focuses on speeding up APNASA, a legacy CFD code used at NASA Glenn Research Center, while also drawing conclusions about the nature of GPU computing and the requirements to make GPGPU worthwhile on legacy codes. Rewriting and restructuring of the source code was avoided to limit the introduction of new bugs. The code was profiled and investigated for parallelization potential, then OpenACC directives were used to indicate parallel parts of the code. The use of OpenACC directives was not able to reduce the runtime of APNASA on either the NVIDIA Tesla discrete graphics card, or the AMD accelerated processing unit. Additionally, it was found that in order to justify the use of GPGPU, the amount of parallel work being done within a kernel would have to greatly exceed the work being done by any one portion of the APNASA code. It was determined that in order for an application like APNASA to be accelerated on the GPU, it should not be modular in nature, and the parallel portions of the code must contain a large portion of the code's computation time.

  8. Exploring DeepMedic for the purpose of segmenting white matter hyperintensity lesions

    NASA Astrophysics Data System (ADS)

    Lippert, Fiona; Cheng, Bastian; Golsari, Amir; Weiler, Florian; Gregori, Johannes; Thomalla, Götz; Klein, Jan

    2018-02-01

    DeepMedic, an open source software library based on a multi-channel multi-resolution 3D convolutional neural network, has recently been made publicly available for brain lesion segmentations. It has already been shown that segmentation tasks on MRI data of patients having traumatic brain injuries, brain tumors, and ischemic stroke lesions can be performed very well. In this paper we describe how it can efficiently be used for the purpose of detecting and segmenting white matter hyperintensity lesions. We examined if it can be applied to single-channel routine 2D FLAIR data. For evaluation, we annotated 197 datasets with different numbers and sizes of white matter hyperintensity lesions. Our experiments have shown that substantial results with respect to the segmentation quality can be achieved. Compared to the original parametrization of the DeepMedic neural network, the timings for training can be drastically reduced if adjusting corresponding training parameters, while at the same time the Dice coefficients remain nearly unchanged. This enables for performing a whole training process within a single day utilizing a NVIDIA GeForce GTX 580 graphics board which makes this library also very interesting for research purposes on low-end GPU hardware.

  9. Computational Particle Dynamic Simulations on Multicore Processors (CPDMu) Final Report Phase I

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Schmalz, Mark S

    2011-07-24

    Statement of Problem - Department of Energy has many legacy codes for simulation of computational particle dynamics and computational fluid dynamics applications that are designed to run on sequential processors and are not easily parallelized. Emerging high-performance computing architectures employ massively parallel multicore architectures (e.g., graphics processing units) to increase throughput. Parallelization of legacy simulation codes is a high priority, to achieve compatibility, efficiency, accuracy, and extensibility. General Statement of Solution - A legacy simulation application designed for implementation on mainly-sequential processors has been represented as a graph G. Mathematical transformations, applied to G, produce a graph representation {und G}more » for a high-performance architecture. Key computational and data movement kernels of the application were analyzed/optimized for parallel execution using the mapping G {yields} {und G}, which can be performed semi-automatically. This approach is widely applicable to many types of high-performance computing systems, such as graphics processing units or clusters comprised of nodes that contain one or more such units. Phase I Accomplishments - Phase I research decomposed/profiled computational particle dynamics simulation code for rocket fuel combustion into low and high computational cost regions (respectively, mainly sequential and mainly parallel kernels), with analysis of space and time complexity. Using the research team's expertise in algorithm-to-architecture mappings, the high-cost kernels were transformed, parallelized, and implemented on Nvidia Fermi GPUs. Measured speedups (GPU with respect to single-core CPU) were approximately 20-32X for realistic model parameters, without final optimization. Error analysis showed no loss of computational accuracy. Commercial Applications and Other Benefits - The proposed research will constitute a breakthrough in solution of problems related to efficient parallel computation of particle and fluid dynamics simulations. These problems occur throughout DOE, military and commercial sectors: the potential payoff is high. We plan to license or sell the solution to contractors for military and domestic applications such as disaster simulation (aerodynamic and hydrodynamic), Government agencies (hydrological and environmental simulations), and medical applications (e.g., in tomographic image reconstruction). Keywords - High-performance Computing, Graphic Processing Unit, Fluid/Particle Simulation. Summary for Members of Congress - Department of Energy has many simulation codes that must compute faster, to be effective. The Phase I research parallelized particle/fluid simulations for rocket combustion, for high-performance computing systems.« less

  10. NMF-mGPU: non-negative matrix factorization on multi-GPU systems.

    PubMed

    Mejía-Roa, Edgardo; Tabas-Madrid, Daniel; Setoain, Javier; García, Carlos; Tirado, Francisco; Pascual-Montano, Alberto

    2015-02-13

    In the last few years, the Non-negative Matrix Factorization ( NMF ) technique has gained a great interest among the Bioinformatics community, since it is able to extract interpretable parts from high-dimensional datasets. However, the computing time required to process large data matrices may become impractical, even for a parallel application running on a multiprocessors cluster. In this paper, we present NMF-mGPU, an efficient and easy-to-use implementation of the NMF algorithm that takes advantage of the high computing performance delivered by Graphics-Processing Units ( GPUs ). Driven by the ever-growing demands from the video-games industry, graphics cards usually provided in PCs and laptops have evolved from simple graphics-drawing platforms into high-performance programmable systems that can be used as coprocessors for linear-algebra operations. However, these devices may have a limited amount of on-board memory, which is not considered by other NMF implementations on GPU. NMF-mGPU is based on CUDA ( Compute Unified Device Architecture ), the NVIDIA's framework for GPU computing. On devices with low memory available, large input matrices are blockwise transferred from the system's main memory to the GPU's memory, and processed accordingly. In addition, NMF-mGPU has been explicitly optimized for the different CUDA architectures. Finally, platforms with multiple GPUs can be synchronized through MPI ( Message Passing Interface ). In a four-GPU system, this implementation is about 120 times faster than a single conventional processor, and more than four times faster than a single GPU device (i.e., a super-linear speedup). Applications of GPUs in Bioinformatics are getting more and more attention due to their outstanding performance when compared to traditional processors. In addition, their relatively low price represents a highly cost-effective alternative to conventional clusters. In life sciences, this results in an excellent opportunity to facilitate the daily work of bioinformaticians that are trying to extract biological meaning out of hundreds of gigabytes of experimental information. NMF-mGPU can be used "out of the box" by researchers with little or no expertise in GPU programming in a variety of platforms, such as PCs, laptops, or high-end GPU clusters. NMF-mGPU is freely available at https://github.com/bioinfo-cnb/bionmf-gpu .

  11. High-performance computing on GPUs for resistivity logging of oil and gas wells

    NASA Astrophysics Data System (ADS)

    Glinskikh, V.; Dudaev, A.; Nechaev, O.; Surodina, I.

    2017-10-01

    We developed and implemented into software an algorithm for high-performance simulation of electrical logs from oil and gas wells using high-performance heterogeneous computing. The numerical solution of the 2D forward problem is based on the finite-element method and the Cholesky decomposition for solving a system of linear algebraic equations (SLAE). Software implementations of the algorithm used the NVIDIA CUDA technology and computing libraries are made, allowing us to perform decomposition of SLAE and find its solution on central processor unit (CPU) and graphics processor unit (GPU). The calculation time is analyzed depending on the matrix size and number of its non-zero elements. We estimated the computing speed on CPU and GPU, including high-performance heterogeneous CPU-GPU computing. Using the developed algorithm, we simulated resistivity data in realistic models.

  12. 75 FR 44989 - In the Matter of Certain Semiconductor Chips Having Synchronous Dynamic Random Access Memory...

    Federal Register 2010, 2011, 2012, 2013, 2014

    2010-07-30

    ... following respondents: NVIDIA Corporation of Santa Clara, California; Asustek Computer, Inc. of Taipei... exclusion order and cease- and-desist orders against respondents NVIDIA Corp.; Hewlett-Packard Co.; ASUS...

  13. ARCHERRT – A GPU-based and photon-electron coupled Monte Carlo dose computing engine for radiation therapy: Software development and application to helical tomotherapy

    PubMed Central

    Su, Lin; Yang, Youming; Bednarz, Bryan; Sterpin, Edmond; Du, Xining; Liu, Tianyu; Ji, Wei; Xu, X. George

    2014-01-01

    Purpose: Using the graphical processing units (GPU) hardware technology, an extremely fast Monte Carlo (MC) code ARCHERRT is developed for radiation dose calculations in radiation therapy. This paper describes the detailed software development and testing for three clinical TomoTherapy® cases: the prostate, lung, and head & neck. Methods: To obtain clinically relevant dose distributions, phase space files (PSFs) created from optimized radiation therapy treatment plan fluence maps were used as the input to ARCHERRT. Patient-specific phantoms were constructed from patient CT images. Batch simulations were employed to facilitate the time-consuming task of loading large PSFs, and to improve the estimation of statistical uncertainty. Furthermore, two different Woodcock tracking algorithms were implemented and their relative performance was compared. The dose curves of an Elekta accelerator PSF incident on a homogeneous water phantom were benchmarked against DOSXYZnrc. For each of the treatment cases, dose volume histograms and isodose maps were produced from ARCHERRT and the general-purpose code, GEANT4. The gamma index analysis was performed to evaluate the similarity of voxel doses obtained from these two codes. The hardware accelerators used in this study are one NVIDIA K20 GPU, one NVIDIA K40 GPU, and six NVIDIA M2090 GPUs. In addition, to make a fairer comparison of the CPU and GPU performance, a multithreaded CPU code was developed using OpenMP and tested on an Intel E5-2620 CPU. Results: For the water phantom, the depth dose curve and dose profiles from ARCHERRT agree well with DOSXYZnrc. For clinical cases, results from ARCHERRT are compared with those from GEANT4 and good agreement is observed. Gamma index test is performed for voxels whose dose is greater than 10% of maximum dose. For 2%/2mm criteria, the passing rates for the prostate, lung case, and head & neck cases are 99.7%, 98.5%, and 97.2%, respectively. Due to specific architecture of GPU, modified Woodcock tracking algorithm performed inferior to the original one. ARCHERRT achieves a fast speed for PSF-based dose calculations. With a single M2090 card, the simulations cost about 60, 50, 80 s for three cases, respectively, with the 1% statistical error in the PTV. Using the latest K40 card, the simulations are 1.7–1.8 times faster. More impressively, six M2090 cards could finish the simulations in 8.9–13.4 s. For comparison, the same simulations on Intel E5-2620 (12 hyperthreading) cost about 500–800 s. Conclusions: ARCHERRT was developed successfully to perform fast and accurate MC dose calculation for radiotherapy using PSFs and patient CT phantoms. PMID:24989378

  14. ARCHERRT - a GPU-based and photon-electron coupled Monte Carlo dose computing engine for radiation therapy: software development and application to helical tomotherapy.

    PubMed

    Su, Lin; Yang, Youming; Bednarz, Bryan; Sterpin, Edmond; Du, Xining; Liu, Tianyu; Ji, Wei; Xu, X George

    2014-07-01

    Using the graphical processing units (GPU) hardware technology, an extremely fast Monte Carlo (MC) code ARCHERRT is developed for radiation dose calculations in radiation therapy. This paper describes the detailed software development and testing for three clinical TomoTherapy® cases: the prostate, lung, and head & neck. To obtain clinically relevant dose distributions, phase space files (PSFs) created from optimized radiation therapy treatment plan fluence maps were used as the input to ARCHERRT. Patient-specific phantoms were constructed from patient CT images. Batch simulations were employed to facilitate the time-consuming task of loading large PSFs, and to improve the estimation of statistical uncertainty. Furthermore, two different Woodcock tracking algorithms were implemented and their relative performance was compared. The dose curves of an Elekta accelerator PSF incident on a homogeneous water phantom were benchmarked against DOSXYZnrc. For each of the treatment cases, dose volume histograms and isodose maps were produced from ARCHERRT and the general-purpose code, GEANT4. The gamma index analysis was performed to evaluate the similarity of voxel doses obtained from these two codes. The hardware accelerators used in this study are one NVIDIA K20 GPU, one NVIDIA K40 GPU, and six NVIDIA M2090 GPUs. In addition, to make a fairer comparison of the CPU and GPU performance, a multithreaded CPU code was developed using OpenMP and tested on an Intel E5-2620 CPU. For the water phantom, the depth dose curve and dose profiles from ARCHERRT agree well with DOSXYZnrc. For clinical cases, results from ARCHERRT are compared with those from GEANT4 and good agreement is observed. Gamma index test is performed for voxels whose dose is greater than 10% of maximum dose. For 2%/2mm criteria, the passing rates for the prostate, lung case, and head & neck cases are 99.7%, 98.5%, and 97.2%, respectively. Due to specific architecture of GPU, modified Woodcock tracking algorithm performed inferior to the original one. ARCHERRT achieves a fast speed for PSF-based dose calculations. With a single M2090 card, the simulations cost about 60, 50, 80 s for three cases, respectively, with the 1% statistical error in the PTV. Using the latest K40 card, the simulations are 1.7-1.8 times faster. More impressively, six M2090 cards could finish the simulations in 8.9-13.4 s. For comparison, the same simulations on Intel E5-2620 (12 hyperthreading) cost about 500-800 s. ARCHERRT was developed successfully to perform fast and accurate MC dose calculation for radiotherapy using PSFs and patient CT phantoms.

  15. Real-time implementation of optimized maximum noise fraction transform for feature extraction of hyperspectral images

    NASA Astrophysics Data System (ADS)

    Wu, Yuanfeng; Gao, Lianru; Zhang, Bing; Zhao, Haina; Li, Jun

    2014-01-01

    We present a parallel implementation of the optimized maximum noise fraction (G-OMNF) transform algorithm for feature extraction of hyperspectral images on commodity graphics processing units (GPUs). The proposed approach explored the algorithm data-level concurrency and optimized the computing flow. We first defined a three-dimensional grid, in which each thread calculates a sub-block data to easily facilitate the spatial and spectral neighborhood data searches in noise estimation, which is one of the most important steps involved in OMNF. Then, we optimized the processing flow and computed the noise covariance matrix before computing the image covariance matrix to reduce the original hyperspectral image data transmission. These optimization strategies can greatly improve the computing efficiency and can be applied to other feature extraction algorithms. The proposed parallel feature extraction algorithm was implemented on an Nvidia Tesla GPU using the compute unified device architecture and basic linear algebra subroutines library. Through the experiments on several real hyperspectral images, our GPU parallel implementation provides a significant speedup of the algorithm compared with the CPU implementation, especially for highly data parallelizable and arithmetically intensive algorithm parts, such as noise estimation. In order to further evaluate the effectiveness of G-OMNF, we used two different applications: spectral unmixing and classification for evaluation. Considering the sensor scanning rate and the data acquisition time, the proposed parallel implementation met the on-board real-time feature extraction.

  16. Applying GIS and high performance agent-based simulation for managing an Old World Screwworm fly invasion of Australia.

    PubMed

    Welch, M C; Kwan, P W; Sajeev, A S M

    2014-10-01

    Agent-based modelling has proven to be a promising approach for developing rich simulations for complex phenomena that provide decision support functions across a broad range of areas including biological, social and agricultural sciences. This paper demonstrates how high performance computing technologies, namely General-Purpose Computing on Graphics Processing Units (GPGPU), and commercial Geographic Information Systems (GIS) can be applied to develop a national scale, agent-based simulation of an incursion of Old World Screwworm fly (OWS fly) into the Australian mainland. The development of this simulation model leverages the combination of massively data-parallel processing capabilities supported by NVidia's Compute Unified Device Architecture (CUDA) and the advanced spatial visualisation capabilities of GIS. These technologies have enabled the implementation of an individual-based, stochastic lifecycle and dispersal algorithm for the OWS fly invasion. The simulation model draws upon a wide range of biological data as input to stochastically determine the reproduction and survival of the OWS fly through the different stages of its lifecycle and dispersal of gravid females. Through this model, a highly efficient computational platform has been developed for studying the effectiveness of control and mitigation strategies and their associated economic impact on livestock industries can be materialised. Copyright © 2014 International Atomic Energy Agency 2014. Published by Elsevier B.V. All rights reserved.

  17. A smooth particle hydrodynamics code to model collisions between solid, self-gravitating objects

    NASA Astrophysics Data System (ADS)

    Schäfer, C.; Riecker, S.; Maindl, T. I.; Speith, R.; Scherrer, S.; Kley, W.

    2016-05-01

    Context. Modern graphics processing units (GPUs) lead to a major increase in the performance of the computation of astrophysical simulations. Owing to the different nature of GPU architecture compared to traditional central processing units (CPUs) such as x86 architecture, existing numerical codes cannot be easily migrated to run on GPU. Here, we present a new implementation of the numerical method smooth particle hydrodynamics (SPH) using CUDA and the first astrophysical application of the new code: the collision between Ceres-sized objects. Aims: The new code allows for a tremendous increase in speed of astrophysical simulations with SPH and self-gravity at low costs for new hardware. Methods: We have implemented the SPH equations to model gas, liquids and elastic, and plastic solid bodies and added a fragmentation model for brittle materials. Self-gravity may be optionally included in the simulations and is treated by the use of a Barnes-Hut tree. Results: We find an impressive performance gain using NVIDIA consumer devices compared to our existing OpenMP code. The new code is freely available to the community upon request. If you are interested in our CUDA SPH code miluphCUDA, please write an email to Christoph Schäfer. miluphCUDA is the CUDA port of miluph. miluph is pronounced [maßl2v]. We do not support the use of the code for military purposes.

  18. Comparison of multihardware parallel implementations for a phase unwrapping algorithm

    NASA Astrophysics Data System (ADS)

    Hernandez-Lopez, Francisco Javier; Rivera, Mariano; Salazar-Garibay, Adan; Legarda-Sáenz, Ricardo

    2018-04-01

    Phase unwrapping is an important problem in the areas of optical metrology, synthetic aperture radar (SAR) image analysis, and magnetic resonance imaging (MRI) analysis. These images are becoming larger in size and, particularly, the availability and need for processing of SAR and MRI data have increased significantly with the acquisition of remote sensing data and the popularization of magnetic resonators in clinical diagnosis. Therefore, it is important to develop faster and accurate phase unwrapping algorithms. We propose a parallel multigrid algorithm of a phase unwrapping method named accumulation of residual maps, which builds on a serial algorithm that consists of the minimization of a cost function; minimization achieved by means of a serial Gauss-Seidel kind algorithm. Our algorithm also optimizes the original cost function, but unlike the original work, our algorithm is a parallel Jacobi class with alternated minimizations. This strategy is known as the chessboard type, where red pixels can be updated in parallel at same iteration since they are independent. Similarly, black pixels can be updated in parallel in an alternating iteration. We present parallel implementations of our algorithm for different parallel multicore architecture such as CPU-multicore, Xeon Phi coprocessor, and Nvidia graphics processing unit. In all the cases, we obtain a superior performance of our parallel algorithm when compared with the original serial version. In addition, we present a detailed comparative performance of the developed parallel versions.

  19. GAMUT: GPU accelerated microRNA analysis to uncover target genes through CUDA-miRanda

    PubMed Central

    2014-01-01

    Background Non-coding sequences such as microRNAs have important roles in disease processes. Computational microRNA target identification (CMTI) is becoming increasingly important since traditional experimental methods for target identification pose many difficulties. These methods are time-consuming, costly, and often need guidance from computational methods to narrow down candidate genes anyway. However, most CMTI methods are computationally demanding, since they need to handle not only several million query microRNA and reference RNA pairs, but also several million nucleotide comparisons within each given pair. Thus, the need to perform microRNA identification at such large scale has increased the demand for parallel computing. Methods Although most CMTI programs (e.g., the miRanda algorithm) are based on a modified Smith-Waterman (SW) algorithm, the existing parallel SW implementations (e.g., CUDASW++ 2.0/3.0, SWIPE) are unable to meet this demand in CMTI tasks. We present CUDA-miRanda, a fast microRNA target identification algorithm that takes advantage of massively parallel computing on Graphics Processing Units (GPU) using NVIDIA's Compute Unified Device Architecture (CUDA). CUDA-miRanda specifically focuses on the local alignment of short (i.e., ≤ 32 nucleotides) sequences against longer reference sequences (e.g., 20K nucleotides). Moreover, the proposed algorithm is able to report multiple alignments (up to 191 top scores) and the corresponding traceback sequences for any given (query sequence, reference sequence) pair. Results Speeds over 5.36 Giga Cell Updates Per Second (GCUPs) are achieved on a server with 4 NVIDIA Tesla M2090 GPUs. Compared to the original miRanda algorithm, which is evaluated on an Intel Xeon E5620@2.4 GHz CPU, the experimental results show up to 166 times performance gains in terms of execution time. In addition, we have verified that the exact same targets were predicted in both CUDA-miRanda and the original miRanda implementations through multiple test datasets. Conclusions We offer a GPU-based alternative to high performance compute (HPC) that can be developed locally at a relatively small cost. The community of GPU developers in the biomedical research community, particularly for genome analysis, is still growing. With increasing shared resources, this community will be able to advance CMTI in a very significant manner. Our source code is available at https://sourceforge.net/projects/cudamiranda/. PMID:25077821

  20. Next-generation acceleration and code optimization for light transport in turbid media using GPUs

    PubMed Central

    Alerstam, Erik; Lo, William Chun Yip; Han, Tianyi David; Rose, Jonathan; Andersson-Engels, Stefan; Lilge, Lothar

    2010-01-01

    A highly optimized Monte Carlo (MC) code package for simulating light transport is developed on the latest graphics processing unit (GPU) built for general-purpose computing from NVIDIA - the Fermi GPU. In biomedical optics, the MC method is the gold standard approach for simulating light transport in biological tissue, both due to its accuracy and its flexibility in modelling realistic, heterogeneous tissue geometry in 3-D. However, the widespread use of MC simulations in inverse problems, such as treatment planning for PDT, is limited by their long computation time. Despite its parallel nature, optimizing MC code on the GPU has been shown to be a challenge, particularly when the sharing of simulation result matrices among many parallel threads demands the frequent use of atomic instructions to access the slow GPU global memory. This paper proposes an optimization scheme that utilizes the fast shared memory to resolve the performance bottleneck caused by atomic access, and discusses numerous other optimization techniques needed to harness the full potential of the GPU. Using these techniques, a widely accepted MC code package in biophotonics, called MCML, was successfully accelerated on a Fermi GPU by approximately 600x compared to a state-of-the-art Intel Core i7 CPU. A skin model consisting of 7 layers was used as the standard simulation geometry. To demonstrate the possibility of GPU cluster computing, the same GPU code was executed on four GPUs, showing a linear improvement in performance with an increasing number of GPUs. The GPU-based MCML code package, named GPU-MCML, is compatible with a wide range of graphics cards and is released as an open-source software in two versions: an optimized version tuned for high performance and a simplified version for beginners (http://code.google.com/p/gpumcml). PMID:21258498

  1. Web-based, GPU-accelerated, Monte Carlo simulation and visualization of indirect radiation imaging detector performance.

    PubMed

    Dong, Han; Sharma, Diksha; Badano, Aldo

    2014-12-01

    Monte Carlo simulations play a vital role in the understanding of the fundamental limitations, design, and optimization of existing and emerging medical imaging systems. Efforts in this area have resulted in the development of a wide variety of open-source software packages. One such package, hybridmantis, uses a novel hybrid concept to model indirect scintillator detectors by balancing the computational load using dual CPU and graphics processing unit (GPU) processors, obtaining computational efficiency with reasonable accuracy. In this work, the authors describe two open-source visualization interfaces, webmantis and visualmantis to facilitate the setup of computational experiments via hybridmantis. The visualization tools visualmantis and webmantis enable the user to control simulation properties through a user interface. In the case of webmantis, control via a web browser allows access through mobile devices such as smartphones or tablets. webmantis acts as a server back-end and communicates with an NVIDIA GPU computing cluster that can support multiuser environments where users can execute different experiments in parallel. The output consists of point response and pulse-height spectrum, and optical transport statistics generated by hybridmantis. The users can download the output images and statistics through a zip file for future reference. In addition, webmantis provides a visualization window that displays a few selected optical photon path as they get transported through the detector columns and allows the user to trace the history of the optical photons. The visualization tools visualmantis and webmantis provide features such as on the fly generation of pulse-height spectra and response functions for microcolumnar x-ray imagers while allowing users to save simulation parameters and results from prior experiments. The graphical interfaces simplify the simulation setup and allow the user to go directly from specifying input parameters to receiving visual feedback for the model predictions.

  2. GPU Particle Tracking and MHD Simulations with Greatly Enhanced Computational Speed

    NASA Astrophysics Data System (ADS)

    Ziemba, T.; O'Donnell, D.; Carscadden, J.; Cash, M.; Winglee, R.; Harnett, E.

    2008-12-01

    GPUs are intrinsically highly parallelized systems that provide more than an order of magnitude computing speed over a CPU based systems, for less cost than a high end-workstation. Recent advancements in GPU technologies allow for full IEEE float specifications with performance up to several hundred GFLOPs per GPU, and new software architectures have recently become available to ease the transition from graphics based to scientific applications. This allows for a cheap alternative to standard supercomputing methods and should increase the time to discovery. 3-D particle tracking and MHD codes have been developed using NVIDIA's CUDA and have demonstrated speed up of nearly a factor of 20 over equivalent CPU versions of the codes. Such a speed up enables new applications to develop, including real time running of radiation belt simulations and real time running of global magnetospheric simulations, both of which could provide important space weather prediction tools.

  3. Designing stereoscopic information visualization for 3D-TV: What can we can learn from S3D gaming?

    NASA Astrophysics Data System (ADS)

    Schild, Jonas; Masuch, Maic

    2012-03-01

    This paper explores graphical design and spatial alignment of visual information and graphical elements into stereoscopically filmed content, e.g. captions, subtitles, and especially more complex elements in 3D-TV productions. The method used is a descriptive analysis of existing computer- and video games that have been adapted for stereoscopic display using semi-automatic rendering techniques (e.g. Nvidia 3D Vision) or games which have been specifically designed for stereoscopic vision. Digital games often feature compelling visual interfaces that combine high usability with creative visual design. We explore selected examples of game interfaces in stereoscopic vision regarding their stereoscopic characteristics, how they draw attention, how we judge effect and comfort and where the interfaces fail. As a result, we propose a list of five aspects which should be considered when designing stereoscopic visual information: explicit information, implicit information, spatial reference, drawing attention, and vertical alignment. We discuss possible consequences, opportunities and challenges for integrating visual information elements into 3D-TV content. This work shall further help to improve current editing systems and identifies a need for future editing systems for 3DTV, e.g., live editing and real-time alignment of visual information into 3D footage.

  4. Bayesian Methods and Confidence Intervals for Automatic Target Recognition of SAR Canonical Shapes

    DTIC Science & Technology

    2014-03-27

    and DirectX [22]. The CUDA platform was developed by the NVIDIA Corporation to allow programmers access to the computational capabilities of the...were used for the intense repetitive computations. Developing CUDA software requires writing code for specialized compilers provided by NVIDIA and

  5. GPU accelerated edge-region based level set evolution constrained by 2D gray-scale histogram.

    PubMed

    Balla-Arabé, Souleymane; Gao, Xinbo; Wang, Bin

    2013-07-01

    Due to its intrinsic nature which allows to easily handle complex shapes and topological changes, the level set method (LSM) has been widely used in image segmentation. Nevertheless, LSM is computationally expensive, which limits its applications in real-time systems. For this purpose, we propose a new level set algorithm, which uses simultaneously edge, region, and 2D histogram information in order to efficiently segment objects of interest in a given scene. The computational complexity of the proposed LSM is greatly reduced by using the highly parallelizable lattice Boltzmann method (LBM) with a body force to solve the level set equation (LSE). The body force is the link with image data and is defined from the proposed LSE. The proposed LSM is then implemented using an NVIDIA graphics processing units to fully take advantage of the LBM local nature. The new algorithm is effective, robust against noise, independent to the initial contour, fast, and highly parallelizable. The edge and region information enable to detect objects with and without edges, and the 2D histogram information enable the effectiveness of the method in a noisy environment. Experimental results on synthetic and real images demonstrate subjectively and objectively the performance of the proposed method.

  6. GALARIO: a GPU accelerated library for analysing radio interferometer observations

    NASA Astrophysics Data System (ADS)

    Tazzari, Marco; Beaujean, Frederik; Testi, Leonardo

    2018-06-01

    We present GALARIO, a computational library that exploits the power of modern graphical processing units (GPUs) to accelerate the analysis of observations from radio interferometers like Atacama Large Millimeter and sub-millimeter Array or the Karl G. Jansky Very Large Array. GALARIO speeds up the computation of synthetic visibilities from a generic 2D model image or a radial brightness profile (for axisymmetric sources). On a GPU, GALARIO is 150 faster than standard PYTHON and 10 times faster than serial C++ code on a CPU. Highly modular, easy to use, and to adopt in existing code, GALARIO comes as two compiled libraries, one for Nvidia GPUs and one for multicore CPUs, where both have the same functions with identical interfaces. GALARIO comes with PYTHON bindings but can also be directly used in C or C++. The versatility and the speed of GALARIO open new analysis pathways that otherwise would be prohibitively time consuming, e.g. fitting high-resolution observations of large number of objects, or entire spectral cubes of molecular gas emission. It is a general tool that can be applied to any field that uses radio interferometer observations. The source code is available online at http://github.com/mtazzari/galario under the open source GNU Lesser General Public License v3.

  7. Smooth Particle Hydrodynamics GPU-Acceleration Tool for Asteroid Fragmentation Simulation

    NASA Astrophysics Data System (ADS)

    Buruchenko, Sergey K.; Schäfer, Christoph M.; Maindl, Thomas I.

    2017-10-01

    The impact threat of near-Earth objects (NEOs) is a concern to the global community, as evidenced by the Chelyabinsk event (caused by a 17-m meteorite) in Russia on February 15, 2013 and a near miss by asteroid 2012 DA14 ( 30 m diameter), on the same day. The expected energy, from either a low-altitude air burst or direct impact, would have severe consequences, especially in populated regions. To mitigate this threat one of the methods is employment of large kinetic-energy impactors (KEIs). The simulation of asteroid target fragmentation is a challenging task which demands efficient and accurate numerical methods with large computational power. Modern graphics processing units (GPUs) lead to a major increase 10 times and more in the performance of the computation of astrophysical and high velocity impacts. The paper presents a new implementation of the numerical method smooth particle hydrodynamics (SPH) using NVIDIA-GPU and the first astrophysical and high velocity application of the new code. The code allows for a tremendous increase in speed of astrophysical simulations with SPH and self-gravity at low costs for new hardware. We have implemented the SPH equations to model gas, liquids and elastic, and plastic solid bodies and added a fragmentation model for brittle materials. Self-gravity may be optionally included in the simulations.

  8. A parallel finite element procedure for contact-impact problems using edge-based smooth triangular element and GPU

    NASA Astrophysics Data System (ADS)

    Cai, Yong; Cui, Xiangyang; Li, Guangyao; Liu, Wenyang

    2018-04-01

    The edge-smooth finite element method (ES-FEM) can improve the computational accuracy of triangular shell elements and the mesh partition efficiency of complex models. In this paper, an approach is developed to perform explicit finite element simulations of contact-impact problems with a graphical processing unit (GPU) using a special edge-smooth triangular shell element based on ES-FEM. Of critical importance for this problem is achieving finer-grained parallelism to enable efficient data loading and to minimize communication between the device and host. Four kinds of parallel strategies are then developed to efficiently solve these ES-FEM based shell element formulas, and various optimization methods are adopted to ensure aligned memory access. Special focus is dedicated to developing an approach for the parallel construction of edge systems. A parallel hierarchy-territory contact-searching algorithm (HITA) and a parallel penalty function calculation method are embedded in this parallel explicit algorithm. Finally, the program flow is well designed, and a GPU-based simulation system is developed, using Nvidia's CUDA. Several numerical examples are presented to illustrate the high quality of the results obtained with the proposed methods. In addition, the GPU-based parallel computation is shown to significantly reduce the computing time.

  9. An efficient tensor transpose algorithm for multicore CPU, Intel Xeon Phi, and NVidia Tesla GPU

    NASA Astrophysics Data System (ADS)

    Lyakh, Dmitry I.

    2015-04-01

    An efficient parallel tensor transpose algorithm is suggested for shared-memory computing units, namely, multicore CPU, Intel Xeon Phi, and NVidia GPU. The algorithm operates on dense tensors (multidimensional arrays) and is based on the optimization of cache utilization on x86 CPU and the use of shared memory on NVidia GPU. From the applied side, the ultimate goal is to minimize the overhead encountered in the transformation of tensor contractions into matrix multiplications in computer implementations of advanced methods of quantum many-body theory (e.g., in electronic structure theory and nuclear physics). A particular accent is made on higher-dimensional tensors that typically appear in the so-called multireference correlated methods of electronic structure theory. Depending on tensor dimensionality, the presented optimized algorithms can achieve an order of magnitude speedup on x86 CPUs and 2-3 times speedup on NVidia Tesla K20X GPU with respect to the naïve scattering algorithm (no memory access optimization). The tensor transpose routines developed in this work have been incorporated into a general-purpose tensor algebra library (TAL-SH).

  10. CUDA compatible GPU cards as efficient hardware accelerators for Smith-Waterman sequence alignment

    PubMed Central

    Manavski, Svetlin A; Valle, Giorgio

    2008-01-01

    Background Searching for similarities in protein and DNA databases has become a routine procedure in Molecular Biology. The Smith-Waterman algorithm has been available for more than 25 years. It is based on a dynamic programming approach that explores all the possible alignments between two sequences; as a result it returns the optimal local alignment. Unfortunately, the computational cost is very high, requiring a number of operations proportional to the product of the length of two sequences. Furthermore, the exponential growth of protein and DNA databases makes the Smith-Waterman algorithm unrealistic for searching similarities in large sets of sequences. For these reasons heuristic approaches such as those implemented in FASTA and BLAST tend to be preferred, allowing faster execution times at the cost of reduced sensitivity. The main motivation of our work is to exploit the huge computational power of commonly available graphic cards, to develop high performance solutions for sequence alignment. Results In this paper we present what we believe is the fastest solution of the exact Smith-Waterman algorithm running on commodity hardware. It is implemented in the recently released CUDA programming environment by NVidia. CUDA allows direct access to the hardware primitives of the last-generation Graphics Processing Units (GPU) G80. Speeds of more than 3.5 GCUPS (Giga Cell Updates Per Second) are achieved on a workstation running two GeForce 8800 GTX. Exhaustive tests have been done to compare our implementation to SSEARCH and BLAST, running on a 3 GHz Intel Pentium IV processor. Our solution was also compared to a recently published GPU implementation and to a Single Instruction Multiple Data (SIMD) solution. These tests show that our implementation performs from 2 to 30 times faster than any other previous attempt available on commodity hardware. Conclusions The results show that graphic cards are now sufficiently advanced to be used as efficient hardware accelerators for sequence alignment. Their performance is better than any alternative available on commodity hardware platforms. The solution presented in this paper allows large scale alignments to be performed at low cost, using the exact Smith-Waterman algorithm instead of the largely adopted heuristic approaches. PMID:18387198

  11. ARCHER{sub RT} – A GPU-based and photon-electron coupled Monte Carlo dose computing engine for radiation therapy: Software development and application to helical tomotherapy

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Su, Lin; Du, Xining; Liu, Tianyu

    Purpose: Using the graphical processing units (GPU) hardware technology, an extremely fast Monte Carlo (MC) code ARCHER{sub RT} is developed for radiation dose calculations in radiation therapy. This paper describes the detailed software development and testing for three clinical TomoTherapy® cases: the prostate, lung, and head and neck. Methods: To obtain clinically relevant dose distributions, phase space files (PSFs) created from optimized radiation therapy treatment plan fluence maps were used as the input to ARCHER{sub RT}. Patient-specific phantoms were constructed from patient CT images. Batch simulations were employed to facilitate the time-consuming task of loading large PSFs, and to improvemore » the estimation of statistical uncertainty. Furthermore, two different Woodcock tracking algorithms were implemented and their relative performance was compared. The dose curves of an Elekta accelerator PSF incident on a homogeneous water phantom were benchmarked against DOSXYZnrc. For each of the treatment cases, dose volume histograms and isodose maps were produced from ARCHER{sub RT} and the general-purpose code, GEANT4. The gamma index analysis was performed to evaluate the similarity of voxel doses obtained from these two codes. The hardware accelerators used in this study are one NVIDIA K20 GPU, one NVIDIA K40 GPU, and six NVIDIA M2090 GPUs. In addition, to make a fairer comparison of the CPU and GPU performance, a multithreaded CPU code was developed using OpenMP and tested on an Intel E5-2620 CPU. Results: For the water phantom, the depth dose curve and dose profiles from ARCHER{sub RT} agree well with DOSXYZnrc. For clinical cases, results from ARCHER{sub RT} are compared with those from GEANT4 and good agreement is observed. Gamma index test is performed for voxels whose dose is greater than 10% of maximum dose. For 2%/2mm criteria, the passing rates for the prostate, lung case, and head and neck cases are 99.7%, 98.5%, and 97.2%, respectively. Due to specific architecture of GPU, modified Woodcock tracking algorithm performed inferior to the original one. ARCHER{sub RT} achieves a fast speed for PSF-based dose calculations. With a single M2090 card, the simulations cost about 60, 50, 80 s for three cases, respectively, with the 1% statistical error in the PTV. Using the latest K40 card, the simulations are 1.7–1.8 times faster. More impressively, six M2090 cards could finish the simulations in 8.9–13.4 s. For comparison, the same simulations on Intel E5-2620 (12 hyperthreading) cost about 500–800 s. Conclusions: ARCHER{sub RT} was developed successfully to perform fast and accurate MC dose calculation for radiotherapy using PSFs and patient CT phantoms.« less

  12. DOE Office of Scientific and Technical Information (OSTI.GOV)

    Kartsaklis, Christos; Civario, G

    This paper discusses an ongoing progress regarding the development of a Java-based library for rapid kernel prototyping in NVIDIA PTX and PTX instruction scheduling. It is aimed at developers seeking total control of emitted PTX, highly parametric emission of, and tunable instruction reordering. It is primarily used for code development at ICHEC but is also hoped that NVIDIA GPU community will also find it beneficial.

  13. Convex Clustering: An Attractive Alternative to Hierarchical Clustering

    PubMed Central

    Chen, Gary K.; Chi, Eric C.; Ranola, John Michael O.; Lange, Kenneth

    2015-01-01

    The primary goal in cluster analysis is to discover natural groupings of objects. The field of cluster analysis is crowded with diverse methods that make special assumptions about data and address different scientific aims. Despite its shortcomings in accuracy, hierarchical clustering is the dominant clustering method in bioinformatics. Biologists find the trees constructed by hierarchical clustering visually appealing and in tune with their evolutionary perspective. Hierarchical clustering operates on multiple scales simultaneously. This is essential, for instance, in transcriptome data, where one may be interested in making qualitative inferences about how lower-order relationships like gene modules lead to higher-order relationships like pathways or biological processes. The recently developed method of convex clustering preserves the visual appeal of hierarchical clustering while ameliorating its propensity to make false inferences in the presence of outliers and noise. The solution paths generated by convex clustering reveal relationships between clusters that are hidden by static methods such as k-means clustering. The current paper derives and tests a novel proximal distance algorithm for minimizing the objective function of convex clustering. The algorithm separates parameters, accommodates missing data, and supports prior information on relationships. Our program CONVEXCLUSTER incorporating the algorithm is implemented on ATI and nVidia graphics processing units (GPUs) for maximal speed. Several biological examples illustrate the strengths of convex clustering and the ability of the proximal distance algorithm to handle high-dimensional problems. CONVEXCLUSTER can be freely downloaded from the UCLA Human Genetics web site at http://www.genetics.ucla.edu/software/ PMID:25965340

  14. ROI-Based On-Board Compression for Hyperspectral Remote Sensing Images on GPU.

    PubMed

    Giordano, Rossella; Guccione, Pietro

    2017-05-19

    In recent years, hyperspectral sensors for Earth remote sensing have become very popular. Such systems are able to provide the user with images having both spectral and spatial information. The current hyperspectral spaceborne sensors are able to capture large areas with increased spatial and spectral resolution. For this reason, the volume of acquired data needs to be reduced on board in order to avoid a low orbital duty cycle due to limited storage space. Recently, literature has focused the attention on efficient ways for on-board data compression. This topic is a challenging task due to the difficult environment (outer space) and due to the limited time, power and computing resources. Often, the hardware properties of Graphic Processing Units (GPU) have been adopted to reduce the processing time using parallel computing. The current work proposes a framework for on-board operation on a GPU, using NVIDIA's CUDA (Compute Unified Device Architecture) architecture. The algorithm aims at performing on-board compression using the target's related strategy. In detail, the main operations are: the automatic recognition of land cover types or detection of events in near real time in regions of interest (this is a user related choice) with an unsupervised classifier; the compression of specific regions with space-variant different bit rates including Principal Component Analysis (PCA), wavelet and arithmetic coding; and data volume management to the Ground Station. Experiments are provided using a real dataset taken from an AVIRIS (Airborne Visible/Infrared Imaging Spectrometer) airborne sensor in a harbor area.

  15. Realisation of a holographic microlaser scalpel using a digital micromirror device

    NASA Astrophysics Data System (ADS)

    Zwick, Susanne; Warber, Michael; Haist, Tobias; Osten, Wolfgang

    2007-06-01

    Modern spatial light modulators (SLM) enable the generation of more or less arbitrary light fields in three dimensions. Such light fields can be used for different future applications in the field of biomedical optics. One example is the processing/cutting of biological material on a microscopic scale. By displaying computer generated holograms by suitable SLMs it is possible to ablate complex structures into three-dimensional objects without scanning with very high accuracy on a microscopic scale. To effectively cut biological materials by light, pulsed ultraviolet light is preferable. We will present a combined setup of a holographic laser scalpel using a digital micromirror device (DMD) and holographic optical tweezers using a liquid crystal display (LCD). The setup enables to move and cut or process micro-scaled objects like biological cells or tissue in three dimensions with high accuracy and without any mechanical movements just by changing the hologram displayed by the SLMs. We will show that holograms can be used to compensate aberrations implemented by the DMD or other optical components of the setup. Also we can generate arbitrary light fields like stripes, circles or arbitrary curves. Additionally we will present results for the fast optimization of holograms for the system. In particular we will show results obtained by implementing iterative Fourier transform based algorithms on a standard consumer graphics board (Nvidia 8800GLX). By this approach we are able to compute more than 360 complex 2D FFTs (512 × 512 pixels) per second with floating point precision.

  16. GPU/MIC Acceleration of the LHC High Level Trigger to Extend the Physics Reach at the LHC

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Halyo, Valerie; Tully, Christopher

    The quest for rare new physics phenomena leads the PI [3] to propose evaluation of coprocessors based on Graphics Processing Units (GPUs) and the Intel Many Integrated Core (MIC) architecture for integration into the trigger system at LHC. This will require development of a new massively parallel implementation of the well known Combinatorial Track Finder which uses the Kalman Filter to accelerate processing of data from the silicon pixel and microstrip detectors and reconstruct the trajectory of all charged particles down to momentums of 100 MeV. It is expected to run at least one order of magnitude faster than anmore » equivalent algorithm on a quad core CPU for extreme pileup scenarios of 100 interactions per bunch crossing. The new tracking algorithms will be developed and optimized separately on the GPU and Intel MIC and then evaluated against each other for performance and power efficiency. The results will be used to project the cost of the proposed hardware architectures for the HLT server farm, taking into account the long term projections of the main vendors in the market (AMD, Intel, and NVIDIA) over the next 10 years. Extensive experience and familiarity of the PI with the LHC tracker and trigger requirements led to the development of a complementary tracking algorithm that is described in [arxiv: 1305.4855], [arxiv: 1309.6275] and preliminary results accepted to JINST.« less

  17. Convex clustering: an attractive alternative to hierarchical clustering.

    PubMed

    Chen, Gary K; Chi, Eric C; Ranola, John Michael O; Lange, Kenneth

    2015-05-01

    The primary goal in cluster analysis is to discover natural groupings of objects. The field of cluster analysis is crowded with diverse methods that make special assumptions about data and address different scientific aims. Despite its shortcomings in accuracy, hierarchical clustering is the dominant clustering method in bioinformatics. Biologists find the trees constructed by hierarchical clustering visually appealing and in tune with their evolutionary perspective. Hierarchical clustering operates on multiple scales simultaneously. This is essential, for instance, in transcriptome data, where one may be interested in making qualitative inferences about how lower-order relationships like gene modules lead to higher-order relationships like pathways or biological processes. The recently developed method of convex clustering preserves the visual appeal of hierarchical clustering while ameliorating its propensity to make false inferences in the presence of outliers and noise. The solution paths generated by convex clustering reveal relationships between clusters that are hidden by static methods such as k-means clustering. The current paper derives and tests a novel proximal distance algorithm for minimizing the objective function of convex clustering. The algorithm separates parameters, accommodates missing data, and supports prior information on relationships. Our program CONVEXCLUSTER incorporating the algorithm is implemented on ATI and nVidia graphics processing units (GPUs) for maximal speed. Several biological examples illustrate the strengths of convex clustering and the ability of the proximal distance algorithm to handle high-dimensional problems. CONVEXCLUSTER can be freely downloaded from the UCLA Human Genetics web site at http://www.genetics.ucla.edu/software/.

  18. A practical implementation of 3D TTI reverse time migration with multi-GPUs

    NASA Astrophysics Data System (ADS)

    Li, Chun; Liu, Guofeng; Li, Yihang

    2017-05-01

    Tilted transversely isotropic (TTI) media are typical earth anisotropy media from practical observational studies. Accurate anisotropic imaging is recognized as a breakthrough in areas with complex anisotropic structures. TTI reverse time migration (RTM) is an important method for these areas. However, P and SV waves are coupled together in the pseudo-acoustic wave equation. The SV wave is regarded as an artifact for RTM of the P wave. We adopt matching of the anisotropy parameters to suppress the SV artifacts. Another problem in the implementation of TTI RTM is instability of the numerical solution for a variably oriented axis of symmetry. We adopt Fletcher's equation by setting a small amount of SV velocity without an acoustic approximation to stabilize the wavefield propagation. To improve calculation efficiency, we use NVIDIA graphic processing unit (GPU) with compute unified device architecture instead of traditional CPU architecture. To accomplish this, we introduced a random velocity boundary and an extended homogeneous anisotropic boundary for the remaining four anisotropic parameters in the source propagation. This process avoids large storage memory and IO requirements, which is important when using a GPU with limited bandwidth of PCI-E. Furthermore, we extend the single GPU code to multi-GPUs and present a corresponding high concurrent strategy with multiple asynchronous streams, which closely achieved an ideal speedup ratio of 2:1 when compared with a single GPU. Synthetic tests validate the correctness and effectiveness of our multi-GPUs-based TTI RTM method.

  19. An efficient tensor transpose algorithm for multicore CPU, Intel Xeon Phi, and NVidia Tesla GPU

    DOE PAGES

    Lyakh, Dmitry I.

    2015-01-05

    An efficient parallel tensor transpose algorithm is suggested for shared-memory computing units, namely, multicore CPU, Intel Xeon Phi, and NVidia GPU. The algorithm operates on dense tensors (multidimensional arrays) and is based on the optimization of cache utilization on x86 CPU and the use of shared memory on NVidia GPU. From the applied side, the ultimate goal is to minimize the overhead encountered in the transformation of tensor contractions into matrix multiplications in computer implementations of advanced methods of quantum many-body theory (e.g., in electronic structure theory and nuclear physics). A particular accent is made on higher-dimensional tensors that typicallymore » appear in the so-called multireference correlated methods of electronic structure theory. Depending on tensor dimensionality, the presented optimized algorithms can achieve an order of magnitude speedup on x86 CPUs and 2-3 times speedup on NVidia Tesla K20X GPU with respect to the na ve scattering algorithm (no memory access optimization). Furthermore, the tensor transpose routines developed in this work have been incorporated into a general-purpose tensor algebra library (TAL-SH).« less

  20. Using the PhysX engine for physics-based virtual surgery with force feedback.

    PubMed

    Maciel, Anderson; Halic, Tansel; Lu, Zhonghua; Nedel, Luciana P; De, Suvranu

    2009-09-01

    The development of modern surgical simulators is highly challenging, as they must support complex simulation environments. The demand for higher realism in such simulators has driven researchers to adopt physics-based models, which are computationally very demanding. This poses a major problem, since real-time interactions must permit graphical updates of 30 Hz and a much higher rate of 1 kHz for force feedback (haptics). Recently several physics engines have been developed which offer multi-physics simulation capabilities, including rigid and deformable bodies, cloth and fluids. While such physics engines provide unique opportunities for the development of surgical simulators, their higher latencies, compared to what is necessary for real-time graphics and haptics, offer significant barriers to their use in interactive simulation environments. In this work, we propose solutions to this problem and demonstrate how a multimodal surgical simulation environment may be developed based on NVIDIA's PhysX physics library. Hence, models that are undergoing relatively low-frequency updates in PhysX can exist in an environment that demands much higher frequency updates for haptics. We use a collision handling layer to interface between the physical response provided by PhysX and the haptic rendering device to provide both real-time tissue response and force feedback. Our simulator integrates a bimanual haptic interface for force feedback and per-pixel shaders for graphics realism in real time. To demonstrate the effectiveness of our approach, we present the simulation of the laparoscopic adjustable gastric banding (LAGB) procedure as a case study. To develop complex and realistic surgical trainers with realistic organ geometries and tissue properties demands stable physics-based deformation methods, which are not always compatible with the interaction level required for such trainers. We have shown that combining different modelling strategies for behaviour, collision and graphics is possible and desirable. Such multimodal environments enable suitable rates to simulate the major steps of the LAGB procedure.

  1. Enabling Computational Dynamics in Distributed Computing Environments Using a Heterogeneous Computing Template

    DTIC Science & Technology

    2011-08-09

    fastest 10 supercomputers in the world. Both systems rely on GPU co-processing, one using AMD cards, the second, called Nebulae , using NVIDIA Tesla...Page 9 of 10 UNCLASSIFIED capability of almost 3 petaflop/s, the highest in TOP500, Nebulae only holds the No. 2 position on the TOP500 list of the

  2. Operational Based Vision Assessment

    DTIC Science & Technology

    2014-02-01

    formulated or supplied the drawings, specifications, or other data does not license the holder or any other person or corporation or convey any...expensive than other developers’ software. The sources for the GPUs ( Nvidia ) and the host computer (Concurrent’s iHawk) were identified. The...boundaries, which is a distracting artifact when performing visual tests. The problem has been isolated by the OBVA team to the Nvidia GPUs. The OBVA system

  3. Decryption-decompression of AES protected ZIP files on GPUs

    NASA Astrophysics Data System (ADS)

    Duong, Tan Nhat; Pham, Phong Hong; Nguyen, Duc Huu; Nguyen, Thuy Thanh; Le, Hung Duc

    2011-10-01

    AES is a strong encryption system, so decryption-decompression of AES encrypted ZIP files requires very large computing power and techniques of reducing the password space. This makes implementations of techniques on common computing system not practical. In [1], we reduced the original very large password search space to a much smaller one which surely containing the correct password. Based on reduced set of passwords, in this paper, we parallel decryption, decompression and plain text recognition for encrypted ZIP files by using CUDA computing technology on graphics cards GeForce GTX295 of NVIDIA, to find out the correct password. The experimental results have shown that the speed of decrypting, decompressing, recognizing plain text and finding out the original password increases about from 45 to 180 times (depends on the number of GPUs) compared to sequential execution on the Intel Core 2 Quad Q8400 2.66 GHz. These results have demonstrated the potential applicability of GPUs in this cryptanalysis field.

  4. A High Performance Computing Framework for Physics-based Modeling and Simulation of Military Ground Vehicles

    DTIC Science & Technology

    2011-03-25

    number one and Nebulae at number three. Both systems rely on GPU co-processing and use Intel Xeon processors cards and NVIDIA Tesla C2050 GPUs. In...spite of a theoretical peak capability of almost 3 Petaflop/s, Nebulae clocked at 1.271 PFlop/s when running the Linpack benchmark, which puts it

  5. High Resolution Imaging Testbed Utilizing Sodium Laser Guide Star Adaptive Optics: The Real Time Wavefront Reconstructor Computer

    DTIC Science & Technology

    2008-07-31

    Unlike the Lyrtech, each DSP on a Bittware board offers 3 MB of on-chip memory and 3 GFLOPs of 32-bit peak processing power. Based on the performance...Each NVIDIA 8800 Ultra features 576 GFLOPS on 128 612-MHz single-precision floating-point SIMD processors, arranged in 16 clusters of eight. Each

  6. Supercomputing with toys: harnessing the power of NVIDIA 8800GTX and playstation 3 for bioinformatics problem.

    PubMed

    Wilson, Justin; Dai, Manhong; Jakupovic, Elvis; Watson, Stanley; Meng, Fan

    2007-01-01

    Modern video cards and game consoles typically have much better performance to price ratios than that of general purpose CPUs. The parallel processing capabilities of game hardware are well-suited for high throughput biomedical data analysis. Our initial results suggest that game hardware is a cost-effective platform for some computationally demanding bioinformatics problems.

  7. Integrating the Nqueens Algorithm into a Parameterized Benchmark Suite

    DTIC Science & Technology

    2016-02-01

    FOB is a 64-node heterogeneous cluster consisting of 16-IBM dx360M4 nodes, each with one NVIDIA Kepler K20M GPUs and 48-IBM dx360M4 nodes, and each...nodes have 256-GB of memory and an NVIDIA Tesla K40 GPU. More details on Excalibur can be found on the US Army DSRC website.19 Figures 3 and 4 show the

  8. a method of gravity and seismic sequential inversion and its GPU implementation

    NASA Astrophysics Data System (ADS)

    Liu, G.; Meng, X.

    2011-12-01

    In this abstract, we introduce a gravity and seismic sequential inversion method to invert for density and velocity together. For the gravity inversion, we use an iterative method based on correlation imaging algorithm; for the seismic inversion, we use the full waveform inversion. The link between the density and velocity is an empirical formula called Gardner equation, for large volumes of data, we use the GPU to accelerate the computation. For the gravity inversion method , we introduce a method based on correlation imaging algorithm,it is also a interative method, first we calculate the correlation imaging of the observed gravity anomaly, it is some value between -1 and +1, then we multiply this value with a little density ,this value become the initial density model. We get a forward reuslt with this initial model and also calculate the correaltion imaging of the misfit of observed data and the forward data, also multiply the correaltion imaging result a little density and add it to the initial model, then do the same procedure above , at last ,we can get a inversion density model. For the seismic inveron method ,we use a mothod base on the linearity of acoustic wave equation written in the frequency domain,with a intial velociy model, we can get a good velocity result. In the sequential inversion of gravity and seismic , we need a link formula to convert between density and velocity ,in our method , we use the Gardner equation. Driven by the insatiable market demand for real time, high-definition 3D images, the programmable NVIDIA Graphic Processing Unit (GPU) as co-processor of CPU has been developed for high performance computing. Compute Unified Device Architecture (CUDA) is a parallel programming model and software environment provided by NVIDIA designed to overcome the challenge of using traditional general purpose GPU while maintaining a low learn curve for programmers familiar with standard programming languages such as C. In our inversion processing, we use the GPU to accelerate our gravity and seismic inversion. Taking the gravity inversion as an example, its kernels are gravity forward simulation and correlation imaging, after the parallelization in GPU, in 3D case,the inversion module, the original five CPU loops are reduced to three,the forward module the original five CPU loops are reduced to two. Acknowledgments We acknowledge the financial support of Sinoprobe project (201011039 and 201011049-03), the Fundamental Research Funds for the Central Universities (2010ZY26 and 2011PY0183), the National Natural Science Foundation of China (41074095) and the Open Project of State Key Laboratory of Geological Processes and Mineral Resources (GPMR0945).

  9. CUDA-based acceleration of collateral filtering in brain MR images

    NASA Astrophysics Data System (ADS)

    Li, Cheng-Yuan; Chang, Herng-Hua

    2017-02-01

    Image denoising is one of the fundamental and essential tasks within image processing. In medical imaging, finding an effective algorithm that can remove random noise in MR images is important. This paper proposes an effective noise reduction method for brain magnetic resonance (MR) images. Our approach is based on the collateral filter which is a more powerful method than the bilateral filter in many cases. However, the computation of the collateral filter algorithm is quite time-consuming. To solve this problem, we improved the collateral filter algorithm with parallel computing using GPU. We adopted CUDA, an application programming interface for GPU by NVIDIA, to accelerate the computation. Our experimental evaluation on an Intel Xeon CPU E5-2620 v3 2.40GHz with a NVIDIA Tesla K40c GPU indicated that the proposed implementation runs dramatically faster than the traditional collateral filter. We believe that the proposed framework has established a general blueprint for achieving fast and robust filtering in a wide variety of medical image denoising applications.

  10. Novel 3D/VR interactive environment for MD simulations, visualization and analysis.

    PubMed

    Doblack, Benjamin N; Allis, Tim; Dávila, Lilian P

    2014-12-18

    The increasing development of computing (hardware and software) in the last decades has impacted scientific research in many fields including materials science, biology, chemistry and physics among many others. A new computational system for the accurate and fast simulation and 3D/VR visualization of nanostructures is presented here, using the open-source molecular dynamics (MD) computer program LAMMPS. This alternative computational method uses modern graphics processors, NVIDIA CUDA technology and specialized scientific codes to overcome processing speed barriers common to traditional computing methods. In conjunction with a virtual reality system used to model materials, this enhancement allows the addition of accelerated MD simulation capability. The motivation is to provide a novel research environment which simultaneously allows visualization, simulation, modeling and analysis. The research goal is to investigate the structure and properties of inorganic nanostructures (e.g., silica glass nanosprings) under different conditions using this innovative computational system. The work presented outlines a description of the 3D/VR Visualization System and basic components, an overview of important considerations such as the physical environment, details on the setup and use of the novel system, a general procedure for the accelerated MD enhancement, technical information, and relevant remarks. The impact of this work is the creation of a unique computational system combining nanoscale materials simulation, visualization and interactivity in a virtual environment, which is both a research and teaching instrument at UC Merced.

  11. A GPU-accelerated semi-implicit fractional step method for numerical solutions of incompressible Navier-Stokes equations

    NASA Astrophysics Data System (ADS)

    Ha, Sanghyun; Park, Junshin; You, Donghyun

    2017-11-01

    Utility of the computational power of modern Graphics Processing Units (GPUs) is elaborated for solutions of incompressible Navier-Stokes equations which are integrated using a semi-implicit fractional-step method. Due to its serial and bandwidth-bound nature, the present choice of numerical methods is considered to be a good candidate for evaluating the potential of GPUs for solving Navier-Stokes equations using non-explicit time integration. An efficient algorithm is presented for GPU acceleration of the Alternating Direction Implicit (ADI) and the Fourier-transform-based direct solution method used in the semi-implicit fractional-step method. OpenMP is employed for concurrent collection of turbulence statistics on a CPU while Navier-Stokes equations are computed on a GPU. Extension to multiple NVIDIA GPUs is implemented using NVLink supported by the Pascal architecture. Performance of the present method is experimented on multiple Tesla P100 GPUs compared with a single-core Xeon E5-2650 v4 CPU in simulations of boundary-layer flow over a flat plate. Supported by the National Research Foundation of Korea (NRF) Grant funded by the Korea government (Ministry of Science, ICT and Future Planning NRF-2016R1E1A2A01939553, NRF-2014R1A2A1A11049599, and Ministry of Trade, Industry and Energy 201611101000230).

  12. Efficient Scalable Median Filtering Using Histogram-Based Operations.

    PubMed

    Green, Oded

    2018-05-01

    Median filtering is a smoothing technique for noise removal in images. While there are various implementations of median filtering for a single-core CPU, there are few implementations for accelerators and multi-core systems. Many parallel implementations of median filtering use a sorting algorithm for rearranging the values within a filtering window and taking the median of the sorted value. While using sorting algorithms allows for simple parallel implementations, the cost of the sorting becomes prohibitive as the filtering windows grow. This makes such algorithms, sequential and parallel alike, inefficient. In this work, we introduce the first software parallel median filtering that is non-sorting-based. The new algorithm uses efficient histogram-based operations. These reduce the computational requirements of the new algorithm while also accessing the image fewer times. We show an implementation of our algorithm for both the CPU and NVIDIA's CUDA supported graphics processing unit (GPU). The new algorithm is compared with several other leading CPU and GPU implementations. The CPU implementation has near perfect linear scaling with a speedup on a quad-core system. The GPU implementation is several orders of magnitude faster than the other GPU implementations for mid-size median filters. For small kernels, and , comparison-based approaches are preferable as fewer operations are required. Lastly, the new algorithm is open-source and can be found in the OpenCV library.

  13. Novel 3D/VR Interactive Environment for MD Simulations, Visualization and Analysis

    PubMed Central

    Doblack, Benjamin N.; Allis, Tim; Dávila, Lilian P.

    2014-01-01

    The increasing development of computing (hardware and software) in the last decades has impacted scientific research in many fields including materials science, biology, chemistry and physics among many others. A new computational system for the accurate and fast simulation and 3D/VR visualization of nanostructures is presented here, using the open-source molecular dynamics (MD) computer program LAMMPS. This alternative computational method uses modern graphics processors, NVIDIA CUDA technology and specialized scientific codes to overcome processing speed barriers common to traditional computing methods. In conjunction with a virtual reality system used to model materials, this enhancement allows the addition of accelerated MD simulation capability. The motivation is to provide a novel research environment which simultaneously allows visualization, simulation, modeling and analysis. The research goal is to investigate the structure and properties of inorganic nanostructures (e.g., silica glass nanosprings) under different conditions using this innovative computational system. The work presented outlines a description of the 3D/VR Visualization System and basic components, an overview of important considerations such as the physical environment, details on the setup and use of the novel system, a general procedure for the accelerated MD enhancement, technical information, and relevant remarks. The impact of this work is the creation of a unique computational system combining nanoscale materials simulation, visualization and interactivity in a virtual environment, which is both a research and teaching instrument at UC Merced. PMID:25549300

  14. Fully accelerating quantum Monte Carlo simulations of real materials on GPU clusters

    NASA Astrophysics Data System (ADS)

    Esler, Kenneth

    2011-03-01

    Quantum Monte Carlo (QMC) has proved to be an invaluable tool for predicting the properties of matter from fundamental principles, combining very high accuracy with extreme parallel scalability. By solving the many-body Schrödinger equation through a stochastic projection, it achieves greater accuracy than mean-field methods and better scaling with system size than quantum chemical methods, enabling scientific discovery across a broad spectrum of disciplines. In recent years, graphics processing units (GPUs) have provided a high-performance and low-cost new approach to scientific computing, and GPU-based supercomputers are now among the fastest in the world. The multiple forms of parallelism afforded by QMC algorithms make the method an ideal candidate for acceleration in the many-core paradigm. We present the results of porting the QMCPACK code to run on GPU clusters using the NVIDIA CUDA platform. Using mixed precision on GPUs and MPI for intercommunication, we observe typical full-application speedups of approximately 10x to 15x relative to quad-core CPUs alone, while reproducing the double-precision CPU results within statistical error. We discuss the algorithm modifications necessary to achieve good performance on this heterogeneous architecture and present the results of applying our code to molecules and bulk materials. Supported by the U.S. DOE under Contract No. DOE-DE-FG05-08OR23336 and by the NSF under No. 0904572.

  15. WARP

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Bergmann, Ryan M.; Rowland, Kelly L.

    2017-04-12

    WARP, which can stand for ``Weaving All the Random Particles,'' is a three-dimensional (3D) continuous energy Monte Carlo neutron transport code developed at UC Berkeley to efficiently execute on NVIDIA graphics processing unit (GPU) platforms. WARP accelerates Monte Carlo simulations while preserving the benefits of using the Monte Carlo method, namely, that very few physical and geometrical simplifications are applied. WARP is able to calculate multiplication factors, neutron flux distributions (in both space and energy), and fission source distributions for time-independent neutron transport problems. It can run in both criticality or fixed source modes, but fixed source mode is currentlymore » not robust, optimized, or maintained in the newest version. WARP can transport neutrons in unrestricted arrangements of parallelepipeds, hexagonal prisms, cylinders, and spheres. The goal of developing WARP is to investigate algorithms that can grow into a full-featured, continuous energy, Monte Carlo neutron transport code that is accelerated by running on GPUs. The crux of the effort is to make Monte Carlo calculations faster while producing accurate results. Modern supercomputers are commonly being built with GPU coprocessor cards in their nodes to increase their computational efficiency and performance. GPUs execute efficiently on data-parallel problems, but most CPU codes, including those for Monte Carlo neutral particle transport, are predominantly task-parallel. WARP uses a data-parallel neutron transport algorithm to take advantage of the computing power GPUs offer.« less

  16. homeSound: Real-Time Audio Event Detection Based on High Performance Computing for Behaviour and Surveillance Remote Monitoring.

    PubMed

    Alsina-Pagès, Rosa Ma; Navarro, Joan; Alías, Francesc; Hervás, Marcos

    2017-04-13

    The consistent growth in human life expectancy during the recent years has driven governments and private organizations to increase the efforts in caring for the eldest segment of the population. These institutions have built hospitals and retirement homes that have been rapidly overfilled, making their associated maintenance and operating costs prohibitive. The latest advances in technology and communications envisage new ways to monitor those people with special needs at their own home, increasing their quality of life in a cost-affordable way. The purpose of this paper is to present an Ambient Assisted Living (AAL) platform able to analyze, identify, and detect specific acoustic events happening in daily life environments, which enables the medic staff to remotely track the status of every patient in real-time. Additionally, this tele-care proposal is validated through a proof-of-concept experiment that takes benefit of the capabilities of the NVIDIA Graphical Processing Unit running on a Jetson TK1 board to locally detect acoustic events. Conducted experiments demonstrate the feasibility of this approach by reaching an overall accuracy of 82% when identifying a set of 14 indoor environment events related to the domestic surveillance and patients' behaviour monitoring field. Obtained results encourage practitioners to keep working in this direction, and enable health care providers to remotely track the status of their patients in real-time with non-invasive methods.

  17. homeSound: Real-Time Audio Event Detection Based on High Performance Computing for Behaviour and Surveillance Remote Monitoring

    PubMed Central

    Alsina-Pagès, Rosa Ma; Navarro, Joan; Alías, Francesc; Hervás, Marcos

    2017-01-01

    The consistent growth in human life expectancy during the recent years has driven governments and private organizations to increase the efforts in caring for the eldest segment of the population. These institutions have built hospitals and retirement homes that have been rapidly overfilled, making their associated maintenance and operating costs prohibitive. The latest advances in technology and communications envisage new ways to monitor those people with special needs at their own home, increasing their quality of life in a cost-affordable way. The purpose of this paper is to present an Ambient Assisted Living (AAL) platform able to analyze, identify, and detect specific acoustic events happening in daily life environments, which enables the medic staff to remotely track the status of every patient in real-time. Additionally, this tele-care proposal is validated through a proof-of-concept experiment that takes benefit of the capabilities of the NVIDIA Graphical Processing Unit running on a Jetson TK1 board to locally detect acoustic events. Conducted experiments demonstrate the feasibility of this approach by reaching an overall accuracy of 82% when identifying a set of 14 indoor environment events related to the domestic surveillance and patients’ behaviour monitoring field. Obtained results encourage practitioners to keep working in this direction, and enable health care providers to remotely track the status of their patients in real-time with non-invasive methods. PMID:28406459

  18. A GPU OpenCL based cross-platform Monte Carlo dose calculation engine (goMC)

    NASA Astrophysics Data System (ADS)

    Tian, Zhen; Shi, Feng; Folkerts, Michael; Qin, Nan; Jiang, Steve B.; Jia, Xun

    2015-09-01

    Monte Carlo (MC) simulation has been recognized as the most accurate dose calculation method for radiotherapy. However, the extremely long computation time impedes its clinical application. Recently, a lot of effort has been made to realize fast MC dose calculation on graphic processing units (GPUs). However, most of the GPU-based MC dose engines have been developed under NVidia’s CUDA environment. This limits the code portability to other platforms, hindering the introduction of GPU-based MC simulations to clinical practice. The objective of this paper is to develop a GPU OpenCL based cross-platform MC dose engine named goMC with coupled photon-electron simulation for external photon and electron radiotherapy in the MeV energy range. Compared to our previously developed GPU-based MC code named gDPM (Jia et al 2012 Phys. Med. Biol. 57 7783-97), goMC has two major differences. First, it was developed under the OpenCL environment for high code portability and hence could be run not only on different GPU cards but also on CPU platforms. Second, we adopted the electron transport model used in EGSnrc MC package and PENELOPE’s random hinge method in our new dose engine, instead of the dose planning method employed in gDPM. Dose distributions were calculated for a 15 MeV electron beam and a 6 MV photon beam in a homogenous water phantom, a water-bone-lung-water slab phantom and a half-slab phantom. Satisfactory agreement between the two MC dose engines goMC and gDPM was observed in all cases. The average dose differences in the regions that received a dose higher than 10% of the maximum dose were 0.48-0.53% for the electron beam cases and 0.15-0.17% for the photon beam cases. In terms of efficiency, goMC was ~4-16% slower than gDPM when running on the same NVidia TITAN card for all the cases we tested, due to both the different electron transport models and the different development environments. The code portability of our new dose engine goMC was validated by successfully running it on a variety of different computing devices including an NVidia GPU card, two AMD GPU cards and an Intel CPU processor. Computational efficiency among these platforms was compared.

  19. Accelerating numerical solution of stochastic differential equations with CUDA

    NASA Astrophysics Data System (ADS)

    Januszewski, M.; Kostur, M.

    2010-01-01

    Numerical integration of stochastic differential equations is commonly used in many branches of science. In this paper we present how to accelerate this kind of numerical calculations with popular NVIDIA Graphics Processing Units using the CUDA programming environment. We address general aspects of numerical programming on stream processors and illustrate them by two examples: the noisy phase dynamics in a Josephson junction and the noisy Kuramoto model. In presented cases the measured speedup can be as high as 675× compared to a typical CPU, which corresponds to several billion integration steps per second. This means that calculations which took weeks can now be completed in less than one hour. This brings stochastic simulation to a completely new level, opening for research a whole new range of problems which can now be solved interactively. Program summaryProgram title: SDE Catalogue identifier: AEFG_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEFG_v1_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: Gnu GPL v3 No. of lines in distributed program, including test data, etc.: 978 No. of bytes in distributed program, including test data, etc.: 5905 Distribution format: tar.gz Programming language: CUDA C Computer: any system with a CUDA-compatible GPU Operating system: Linux RAM: 64 MB of GPU memory Classification: 4.3 External routines: The program requires the NVIDIA CUDA Toolkit Version 2.0 or newer and the GNU Scientific Library v1.0 or newer. Optionally gnuplot is recommended for quick visualization of the results. Nature of problem: Direct numerical integration of stochastic differential equations is a computationally intensive problem, due to the necessity of calculating multiple independent realizations of the system. We exploit the inherent parallelism of this problem and perform the calculations on GPUs using the CUDA programming environment. The GPU's ability to execute hundreds of threads simultaneously makes it possible to speed up the computation by over two orders of magnitude, compared to a typical modern CPU. Solution method: The stochastic Runge-Kutta method of the second order is applied to integrate the equation of motion. Ensemble-averaged quantities of interest are obtained through averaging over multiple independent realizations of the system. Unusual features: The numerical solution of the stochastic differential equations in question is performed on a GPU using the CUDA environment. Running time: < 1 minute

  20. Musrfit-Real Time Parameter Fitting Using GPUs

    NASA Astrophysics Data System (ADS)

    Locans, Uldis; Suter, Andreas

    High transverse field μSR (HTF-μSR) experiments typically lead to a rather large data sets, since it is necessary to follow the high frequencies present in the positron decay histograms. The analysis of these data sets can be very time consuming, usually due to the limited computational power of the hardware. To overcome the limited computing resources rotating reference frame transformation (RRF) is often used to reduce the data sets that need to be handled. This comes at a price typically the μSR community is not aware of: (i) due to the RRF transformation the fitting parameter estimate is of poorer precision, i.e., more extended expensive beamtime is needed. (ii) RRF introduces systematic errors which hampers the statistical interpretation of χ2 or the maximum log-likelihood. We will briefly discuss these issues in a non-exhaustive practical way. The only and single purpose of the RRF transformation is the sluggish computer power. Therefore during this work GPU (Graphical Processing Units) based fitting was developed which allows to perform real-time full data analysis without RRF. GPUs have become increasingly popular in scientific computing in recent years. Due to their highly parallel architecture they provide the opportunity to accelerate many applications with considerably less costs than upgrading the CPU computational power. With the emergence of frameworks such as CUDA and OpenCL these devices have become more easily programmable. During this work GPU support was added to Musrfit- a data analysis framework for μSR experiments. The new fitting algorithm uses CUDA or OpenCL to offload the most time consuming parts of the calculations to Nvidia or AMD GPUs. Using the current CPU implementation in Musrfit parameter fitting can take hours for certain data sets while the GPU version can allow to perform real-time data analysis on the same data sets. This work describes the challenges that arise in adding the GPU support to t as well as results obtained using the GPU version. The speedups using the GPU were measured comparing to the CPU implementation. Two different GPUs were used for the comparison — high end Nvidia Tesla K40c GPU designed for HPC applications and AMD Radeon R9 390× GPU designed for gaming industry.

  1. Web-based, GPU-accelerated, Monte Carlo simulation and visualization of indirect radiation imaging detector performance

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Dong, Han; Sharma, Diksha; Badano, Aldo, E-mail: aldo.badano@fda.hhs.gov

    2014-12-15

    Purpose: Monte Carlo simulations play a vital role in the understanding of the fundamental limitations, design, and optimization of existing and emerging medical imaging systems. Efforts in this area have resulted in the development of a wide variety of open-source software packages. One such package, hybridMANTIS, uses a novel hybrid concept to model indirect scintillator detectors by balancing the computational load using dual CPU and graphics processing unit (GPU) processors, obtaining computational efficiency with reasonable accuracy. In this work, the authors describe two open-source visualization interfaces, webMANTIS and visualMANTIS to facilitate the setup of computational experiments via hybridMANTIS. Methods: Themore » visualization tools visualMANTIS and webMANTIS enable the user to control simulation properties through a user interface. In the case of webMANTIS, control via a web browser allows access through mobile devices such as smartphones or tablets. webMANTIS acts as a server back-end and communicates with an NVIDIA GPU computing cluster that can support multiuser environments where users can execute different experiments in parallel. Results: The output consists of point response and pulse-height spectrum, and optical transport statistics generated by hybridMANTIS. The users can download the output images and statistics through a zip file for future reference. In addition, webMANTIS provides a visualization window that displays a few selected optical photon path as they get transported through the detector columns and allows the user to trace the history of the optical photons. Conclusions: The visualization tools visualMANTIS and webMANTIS provide features such as on the fly generation of pulse-height spectra and response functions for microcolumnar x-ray imagers while allowing users to save simulation parameters and results from prior experiments. The graphical interfaces simplify the simulation setup and allow the user to go directly from specifying input parameters to receiving visual feedback for the model predictions.« less

  2. TU-AB-BRC-09: Fast Dose-Averaged LET and Biological Dose Calculations for Proton Therapy Using Graphics Cards

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Wan, H; Tseung, Chan; Beltran, C

    Purpose: To demonstrate fast and accurate Monte Carlo (MC) calculations of proton dose-averaged linear energy transfer (LETd) and biological dose (BD) on a Graphics Processing Unit (GPU) card. Methods: A previously validated GPU-based MC simulation of proton transport was used to rapidly generate LETd distributions for proton treatment plans. Since this MC handles proton-nuclei interactions on an event-by-event using a Bertini intranuclear cascade-evaporation model, secondary protons were taken into account. The smaller contributions of secondary neutrons and recoil nuclei were ignored. Recent work has shown that LETd values are sensitive to the scoring method. The GPU-based LETd calculations were verifiedmore » by comparing with a TOPAS custom scorer that uses tabulated stopping powers, following recommendations by other authors. Comparisons were made for prostate and head-and-neck patients. A python script is used to convert the MC-generated LETd distributions to BD using a variety of published linear quadratic models, and to export the BD in DICOM format for subsequent evaluation. Results: Very good agreement is obtained between TOPAS and our GPU MC. Given a complex head-and-neck plan with 1 mm voxel spacing, the physical dose, LETd and BD calculations for 10{sup 8} proton histories can be completed in ∼5 minutes using a NVIDIA Titan X card. The rapid turnover means that MC feedback can be obtained on dosimetric plan accuracy as well as BD hotspot locations, particularly in regards to their proximity to critical structures. In our institution the GPU MC-generated dose, LETd and BD maps are used to assess plan quality for all patients undergoing treatment. Conclusion: Fast and accurate MC-based LETd calculations can be performed on the GPU. The resulting BD maps provide valuable feedback during treatment plan review. Partially funded by Varian Medical Systems.« less

  3. Design Tools for Accelerating Development and Usage of Multi-Core Computing Platforms

    DTIC Science & Technology

    2014-04-01

    Government formulated or supplied the drawings, specifications, or other data does not license the holder or any other person or corporation ; or convey...multicore PDSP platforms. The GPU- based capabilities of TDIF are currently oriented towards NVIDIA GPUs, based on the Compute Unified Device Architecture...CUDA) programming language [ NVIDIA 2007], which can be viewed as an extension of C. The multicore PDSP capabilities currently in TDIF are oriented

  4. The RISC-V Instruction Set Manual Volume 2: Privileged Architecture Version 1.7

    DTIC Science & Technology

    2015-05-09

    DIG07-10227). Additional support came from Par Lab affiliates Nokia, NVIDIA , Oracle, and Samsung. • Project Isis: DoE Award DE-SC0003624. • ASPIRE...STARnet center funded by the Semiconductor Research Corporation . Additional sup- port from ASPIRE industrial sponsor, Intel, and ASPIRE affiliates...Google, Huawei, Nokia, NVIDIA , Oracle, and Samsung. The content of this paper does not necessarily reflect the position or the policy of the US

  5. Modeling & Analysis of Multicore Architectures for Embedded SIGINT Applications

    DTIC Science & Technology

    2015-03-01

    NVIDIA Kepler K20 [7][8] 2496e 706 225 3520 15.6 Intel Xeon Phi 5110P [9] 60 1050 225 1010 4.5 Adapteva Epiphany [10] 16 – 4K 800 0.270 19 70.4...Cortex A15 and a Kepler GPU with 192 “CUDA” cores, and is more comparable as an HPEEC platform than Tesla series GPUs, such as the NVIDIA C2075 and K20

  6. 2D to 3D conversion implemented in different hardware

    NASA Astrophysics Data System (ADS)

    Ramos-Diaz, Eduardo; Gonzalez-Huitron, Victor; Ponomaryov, Volodymyr I.; Hernandez-Fragoso, Araceli

    2015-02-01

    Conversion of available 2D data for release in 3D content is a hot topic for providers and for success of the 3D applications, in general. It naturally completely relies on virtual view synthesis of a second view given by original 2D video. Disparity map (DM) estimation is a central task in 3D generation but still follows a very difficult problem for rendering novel images precisely. There exist different approaches in DM reconstruction, among them manually and semiautomatic methods that can produce high quality DMs but they demonstrate hard time consuming and are computationally expensive. In this paper, several hardware implementations of designed frameworks for an automatic 3D color video generation based on 2D real video sequence are proposed. The novel framework includes simultaneous processing of stereo pairs using the following blocks: CIE L*a*b* color space conversions, stereo matching via pyramidal scheme, color segmentation by k-means on an a*b* color plane, and adaptive post-filtering, DM estimation using stereo matching between left and right images (or neighboring frames in a video), adaptive post-filtering, and finally, the anaglyph 3D scene generation. Novel technique has been implemented on DSP TMS320DM648, Matlab's Simulink module over a PC with Windows 7, and using graphic card (NVIDIA Quadro K2000) demonstrating that the proposed approach can be applied in real-time processing mode. The time values needed, mean Similarity Structural Index Measure (SSIM) and Bad Matching Pixels (B) values for different hardware implementations (GPU, Single CPU, and DSP) are exposed in this paper.

  7. A fast three-dimensional gamma evaluation using a GPU utilizing texture memory for on-the-fly interpolations.

    PubMed

    Persoon, Lucas C G G; Podesta, Mark; van Elmpt, Wouter J C; Nijsten, Sebastiaan M J J G; Verhaegen, Frank

    2011-07-01

    A widely accepted method to quantify differences in dose distributions is the gamma (gamma) evaluation. Currently, almost all gamma implementations utilize the central processing unit (CPU). Recently, the graphics processing unit (GPU) has become a powerful platform for specific computing tasks. In this study, we describe the implementation of a 3D gamma evaluation using a GPU to improve calculation time. The gamma evaluation algorithm was implemented on an NVIDIA Tesla C2050 GPU using the compute unified device architecture (CUDA). First, several cubic virtual phantoms were simulated. These phantoms were tested with varying dose cube sizes and set-ups, introducing artificial dose differences. Second, to show applicability in clinical practice, five patient cases have been evaluated using the 3D dose distribution from a treatment planning system as the reference and the delivered dose determined during treatment as the comparison. A calculation time comparison between the CPU and GPU was made with varying thread-block sizes including the option of using texture or global memory. A GPU over CPU speed-up of 66 +/- 12 was achieved for the virtual phantoms. For the patient cases, a speed-up of 57 +/- 15 using the GPU was obtained. A thread-block size of 16 x 16 performed best in all cases. The use of texture memory improved the total calculation time, especially when interpolation was applied. Differences between the CPU and GPU gammas were negligible. The GPU and its features, such as texture memory, decreased the calculation time for gamma evaluations considerably without loss of accuracy.

  8. A fast image registration approach of neural activities in light-sheet fluorescence microscopy images

    NASA Astrophysics Data System (ADS)

    Meng, Hui; Hui, Hui; Hu, Chaoen; Yang, Xin; Tian, Jie

    2017-03-01

    The ability of fast and single-neuron resolution imaging of neural activities enables light-sheet fluorescence microscopy (LSFM) as a powerful imaging technique in functional neural connection applications. The state-of-art LSFM imaging system can record the neuronal activities of entire brain for small animal, such as zebrafish or C. elegans at single-neuron resolution. However, the stimulated and spontaneous movements in animal brain result in inconsistent neuron positions during recording process. It is time consuming to register the acquired large-scale images with conventional method. In this work, we address the problem of fast registration of neural positions in stacks of LSFM images. This is necessary to register brain structures and activities. To achieve fast registration of neural activities, we present a rigid registration architecture by implementation of Graphics Processing Unit (GPU). In this approach, the image stacks were preprocessed on GPU by mean stretching to reduce the computation effort. The present image was registered to the previous image stack that considered as reference. A fast Fourier transform (FFT) algorithm was used for calculating the shift of the image stack. The calculations for image registration were performed in different threads while the preparation functionality was refactored and called only once by the master thread. We implemented our registration algorithm on NVIDIA Quadro K4200 GPU under Compute Unified Device Architecture (CUDA) programming environment. The experimental results showed that the registration computation can speed-up to 550ms for a full high-resolution brain image. Our approach also has potential to be used for other dynamic image registrations in biomedical applications.

  9. Routine Microsecond Molecular Dynamics Simulations with AMBER on GPUs. 1. Generalized Born

    PubMed Central

    2012-01-01

    We present an implementation of generalized Born implicit solvent all-atom classical molecular dynamics (MD) within the AMBER program package that runs entirely on CUDA enabled NVIDIA graphics processing units (GPUs). We discuss the algorithms that are used to exploit the processing power of the GPUs and show the performance that can be achieved in comparison to simulations on conventional CPU clusters. The implementation supports three different precision models in which the contributions to the forces are calculated in single precision floating point arithmetic but accumulated in double precision (SPDP), or everything is computed in single precision (SPSP) or double precision (DPDP). In addition to performance, we have focused on understanding the implications of the different precision models on the outcome of implicit solvent MD simulations. We show results for a range of tests including the accuracy of single point force evaluations and energy conservation as well as structural properties pertainining to protein dynamics. The numerical noise due to rounding errors within the SPSP precision model is sufficiently large to lead to an accumulation of errors which can result in unphysical trajectories for long time scale simulations. We recommend the use of the mixed-precision SPDP model since the numerical results obtained are comparable with those of the full double precision DPDP model and the reference double precision CPU implementation but at significantly reduced computational cost. Our implementation provides performance for GB simulations on a single desktop that is on par with, and in some cases exceeds, that of traditional supercomputers. PMID:22582031

  10. Communication Efficient Gaussian Elimination with Partial Pivoting using a Shape Morphing Data Layout

    DTIC Science & Technology

    2013-02-21

    support comes from ParLab affiliates National Instruments, Nokia, NVIDIA , Oracle and Samsung, as well as MathWorks. Research is also supported by DOE...affiliates National Instruments, Nokia, NVIDIA , Oracle and Samsung, as well as MathWorks. Research is also supported by DOE grants DE-SC0004938, DE-SC0005136...International Business Machines Company , 1966. [17] S. Toledo. Locality of reference in LU decomposition with partial pivoting. SIAM J. Matrix Anal. Appl., 18

  11. Proton Testing of nVidia Jetson TX1

    NASA Technical Reports Server (NTRS)

    Wyrwas, Edward J.

    2017-01-01

    Single-Event Effects (SEE) testing was conducted on the nVidia Jetson TX1 System on Chip (SOC); herein referred to as device under test (DUT). Testing was conducted at Massachusetts General Hospitals (MGH) Francis H. Burr Proton Therapy Center on October 16th, 2016 using 200MeV protons. This testing trip was purposed to provide a baseline assessment of the radiation susceptibility of the DUT as no previous testing had been conducted on this component.

  12. Particle In Cell Codes on Highly Parallel Architectures

    NASA Astrophysics Data System (ADS)

    Tableman, Adam

    2014-10-01

    We describe strategies and examples of Particle-In-Cell Codes running on Nvidia GPU and Intel Phi architectures. This includes basic implementations in skeletons codes and full-scale development versions (encompassing 1D, 2D, and 3D codes) in Osiris. Both the similarities and differences between Intel's and Nvidia's hardware will be examined. Work supported by grants NSF ACI 1339893, DOE DE SC 000849, DOE DE SC 0008316, DOE DE NA 0001833, and DOE DE FC02 04ER 54780.

  13. AESS: Accelerated Exact Stochastic Simulation

    NASA Astrophysics Data System (ADS)

    Jenkins, David D.; Peterson, Gregory D.

    2011-12-01

    The Stochastic Simulation Algorithm (SSA) developed by Gillespie provides a powerful mechanism for exploring the behavior of chemical systems with small species populations or with important noise contributions. Gene circuit simulations for systems biology commonly employ the SSA method, as do ecological applications. This algorithm tends to be computationally expensive, so researchers seek an efficient implementation of SSA. In this program package, the Accelerated Exact Stochastic Simulation Algorithm (AESS) contains optimized implementations of Gillespie's SSA that improve the performance of individual simulation runs or ensembles of simulations used for sweeping parameters or to provide statistically significant results. Program summaryProgram title: AESS Catalogue identifier: AEJW_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEJW_v1_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: University of Tennessee copyright agreement No. of lines in distributed program, including test data, etc.: 10 861 No. of bytes in distributed program, including test data, etc.: 394 631 Distribution format: tar.gz Programming language: C for processors, CUDA for NVIDIA GPUs Computer: Developed and tested on various x86 computers and NVIDIA C1060 Tesla and GTX 480 Fermi GPUs. The system targets x86 workstations, optionally with multicore processors or NVIDIA GPUs as accelerators. Operating system: Tested under Ubuntu Linux OS and CentOS 5.5 Linux OS Classification: 3, 16.12 Nature of problem: Simulation of chemical systems, particularly with low species populations, can be accurately performed using Gillespie's method of stochastic simulation. Numerous variations on the original stochastic simulation algorithm have been developed, including approaches that produce results with statistics that exactly match the chemical master equation (CME) as well as other approaches that approximate the CME. Solution method: The Accelerated Exact Stochastic Simulation (AESS) tool provides implementations of a wide variety of popular variations on the Gillespie method. Users can select the specific algorithm considered most appropriate. Comparisons between the methods and with other available implementations indicate that AESS provides the fastest known implementation of Gillespie's method for a variety of test models. Users may wish to execute ensembles of simulations to sweep parameters or to obtain better statistical results, so AESS supports acceleration of ensembles of simulation using parallel processing with MPI, SSE vector units on x86 processors, and/or using NVIDIA GPUs with CUDA.

  14. NiftySim: A GPU-based nonlinear finite element package for simulation of soft tissue biomechanics.

    PubMed

    Johnsen, Stian F; Taylor, Zeike A; Clarkson, Matthew J; Hipwell, John; Modat, Marc; Eiben, Bjoern; Han, Lianghao; Hu, Yipeng; Mertzanidou, Thomy; Hawkes, David J; Ourselin, Sebastien

    2015-07-01

    NiftySim, an open-source finite element toolkit, has been designed to allow incorporation of high-performance soft tissue simulation capabilities into biomedical applications. The toolkit provides the option of execution on fast graphics processing unit (GPU) hardware, numerous constitutive models and solid-element options, membrane and shell elements, and contact modelling facilities, in a simple to use library. The toolkit is founded on the total Lagrangian explicit dynamics (TLEDs) algorithm, which has been shown to be efficient and accurate for simulation of soft tissues. The base code is written in C[Formula: see text], and GPU execution is achieved using the nVidia CUDA framework. In most cases, interaction with the underlying solvers can be achieved through a single Simulator class, which may be embedded directly in third-party applications such as, surgical guidance systems. Advanced capabilities such as contact modelling and nonlinear constitutive models are also provided, as are more experimental technologies like reduced order modelling. A consistent description of the underlying solution algorithm, its implementation with a focus on GPU execution, and examples of the toolkit's usage in biomedical applications are provided. Efficient mapping of the TLED algorithm to parallel hardware results in very high computational performance, far exceeding that available in commercial packages. The NiftySim toolkit provides high-performance soft tissue simulation capabilities using GPU technology for biomechanical simulation research applications in medical image computing, surgical simulation, and surgical guidance applications.

  15. Large Scale Document Inversion using a Multi-threaded Computing System

    PubMed Central

    Jung, Sungbo; Chang, Dar-Jen; Park, Juw Won

    2018-01-01

    Current microprocessor architecture is moving towards multi-core/multi-threaded systems. This trend has led to a surge of interest in using multi-threaded computing devices, such as the Graphics Processing Unit (GPU), for general purpose computing. We can utilize the GPU in computation as a massive parallel coprocessor because the GPU consists of multiple cores. The GPU is also an affordable, attractive, and user-programmable commodity. Nowadays a lot of information has been flooded into the digital domain around the world. Huge volume of data, such as digital libraries, social networking services, e-commerce product data, and reviews, etc., is produced or collected every moment with dramatic growth in size. Although the inverted index is a useful data structure that can be used for full text searches or document retrieval, a large number of documents will require a tremendous amount of time to create the index. The performance of document inversion can be improved by multi-thread or multi-core GPU. Our approach is to implement a linear-time, hash-based, single program multiple data (SPMD), document inversion algorithm on the NVIDIA GPU/CUDA programming platform utilizing the huge computational power of the GPU, to develop high performance solutions for document indexing. Our proposed parallel document inversion system shows 2-3 times faster performance than a sequential system on two different test datasets from PubMed abstract and e-commerce product reviews. CCS Concepts •Information systems➝Information retrieval • Computing methodologies➝Massively parallel and high-performance simulations. PMID:29861701

  16. Novel hybrid GPU-CPU implementation of parallelized Monte Carlo parametric expectation maximization estimation method for population pharmacokinetic data analysis.

    PubMed

    Ng, C M

    2013-10-01

    The development of a population PK/PD model, an essential component for model-based drug development, is both time- and labor-intensive. A graphical-processing unit (GPU) computing technology has been proposed and used to accelerate many scientific computations. The objective of this study was to develop a hybrid GPU-CPU implementation of parallelized Monte Carlo parametric expectation maximization (MCPEM) estimation algorithm for population PK data analysis. A hybrid GPU-CPU implementation of the MCPEM algorithm (MCPEMGPU) and identical algorithm that is designed for the single CPU (MCPEMCPU) were developed using MATLAB in a single computer equipped with dual Xeon 6-Core E5690 CPU and a NVIDIA Tesla C2070 GPU parallel computing card that contained 448 stream processors. Two different PK models with rich/sparse sampling design schemes were used to simulate population data in assessing the performance of MCPEMCPU and MCPEMGPU. Results were analyzed by comparing the parameter estimation and model computation times. Speedup factor was used to assess the relative benefit of parallelized MCPEMGPU over MCPEMCPU in shortening model computation time. The MCPEMGPU consistently achieved shorter computation time than the MCPEMCPU and can offer more than 48-fold speedup using a single GPU card. The novel hybrid GPU-CPU implementation of parallelized MCPEM algorithm developed in this study holds a great promise in serving as the core for the next-generation of modeling software for population PK/PD analysis.

  17. Parallelized Kalman-Filter-Based Reconstruction of Particle Tracks on Many-Core Processors and GPUs

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Cerati, Giuseppe; Elmer, Peter; Krutelyov, Slava

    2017-01-01

    For over a decade now, physical and energy constraints have limited clock speed improvements in commodity microprocessors. Instead, chipmakers have been pushed into producing lower-power, multi-core processors such as Graphical Processing Units (GPU), ARM CPUs, and Intel MICs. Broad-based efforts from manufacturers and developers have been devoted to making these processors user-friendly enough to perform general computations. However, extracting performance from a larger number of cores, as well as specialized vector or SIMD units, requires special care in algorithm design and code optimization. One of the most computationally challenging problems in high-energy particle experiments is finding and fitting the charged-particlemore » tracks during event reconstruction. This is expected to become by far the dominant problem at the High-Luminosity Large Hadron Collider (HL-LHC), for example. Today the most common track finding methods are those based on the Kalman filter. Experience with Kalman techniques on real tracking detector systems has shown that they are robust and provide high physics performance. This is why they are currently in use at the LHC, both in the trigger and offine. Previously we reported on the significant parallel speedups that resulted from our investigations to adapt Kalman filters to track fitting and track building on Intel Xeon and Xeon Phi. Here, we discuss our progresses toward the understanding of these processors and the new developments to port the Kalman filter to NVIDIA GPUs.« less

  18. Routine Microsecond Molecular Dynamics Simulations with AMBER on GPUs. 2. Explicit Solvent Particle Mesh Ewald.

    PubMed

    Salomon-Ferrer, Romelia; Götz, Andreas W; Poole, Duncan; Le Grand, Scott; Walker, Ross C

    2013-09-10

    We present an implementation of explicit solvent all atom classical molecular dynamics (MD) within the AMBER program package that runs entirely on CUDA-enabled GPUs. First released publicly in April 2010 as part of version 11 of the AMBER MD package and further improved and optimized over the last two years, this implementation supports the three most widely used statistical mechanical ensembles (NVE, NVT, and NPT), uses particle mesh Ewald (PME) for the long-range electrostatics, and runs entirely on CUDA-enabled NVIDIA graphics processing units (GPUs), providing results that are statistically indistinguishable from the traditional CPU version of the software and with performance that exceeds that achievable by the CPU version of AMBER software running on all conventional CPU-based clusters and supercomputers. We briefly discuss three different precision models developed specifically for this work (SPDP, SPFP, and DPDP) and highlight the technical details of the approach as it extends beyond previously reported work [Götz et al., J. Chem. Theory Comput. 2012, DOI: 10.1021/ct200909j; Le Grand et al., Comp. Phys. Comm. 2013, DOI: 10.1016/j.cpc.2012.09.022].We highlight the substantial improvements in performance that are seen over traditional CPU-only machines and provide validation of our implementation and precision models. We also provide evidence supporting our decision to deprecate the previously described fully single precision (SPSP) model from the latest release of the AMBER software package.

  19. Parallelized Kalman-Filter-Based Reconstruction of Particle Tracks on Many-Core Processors and GPUs

    NASA Astrophysics Data System (ADS)

    Cerati, Giuseppe; Elmer, Peter; Krutelyov, Slava; Lantz, Steven; Lefebvre, Matthieu; Masciovecchio, Mario; McDermott, Kevin; Riley, Daniel; Tadel, Matevž; Wittich, Peter; Würthwein, Frank; Yagil, Avi

    2017-08-01

    For over a decade now, physical and energy constraints have limited clock speed improvements in commodity microprocessors. Instead, chipmakers have been pushed into producing lower-power, multi-core processors such as Graphical Processing Units (GPU), ARM CPUs, and Intel MICs. Broad-based efforts from manufacturers and developers have been devoted to making these processors user-friendly enough to perform general computations. However, extracting performance from a larger number of cores, as well as specialized vector or SIMD units, requires special care in algorithm design and code optimization. One of the most computationally challenging problems in high-energy particle experiments is finding and fitting the charged-particle tracks during event reconstruction. This is expected to become by far the dominant problem at the High-Luminosity Large Hadron Collider (HL-LHC), for example. Today the most common track finding methods are those based on the Kalman filter. Experience with Kalman techniques on real tracking detector systems has shown that they are robust and provide high physics performance. This is why they are currently in use at the LHC, both in the trigger and offine. Previously we reported on the significant parallel speedups that resulted from our investigations to adapt Kalman filters to track fitting and track building on Intel Xeon and Xeon Phi. Here, we discuss our progresses toward the understanding of these processors and the new developments to port the Kalman filter to NVIDIA GPUs.

  20. Large Scale Document Inversion using a Multi-threaded Computing System.

    PubMed

    Jung, Sungbo; Chang, Dar-Jen; Park, Juw Won

    2017-06-01

    Current microprocessor architecture is moving towards multi-core/multi-threaded systems. This trend has led to a surge of interest in using multi-threaded computing devices, such as the Graphics Processing Unit (GPU), for general purpose computing. We can utilize the GPU in computation as a massive parallel coprocessor because the GPU consists of multiple cores. The GPU is also an affordable, attractive, and user-programmable commodity. Nowadays a lot of information has been flooded into the digital domain around the world. Huge volume of data, such as digital libraries, social networking services, e-commerce product data, and reviews, etc., is produced or collected every moment with dramatic growth in size. Although the inverted index is a useful data structure that can be used for full text searches or document retrieval, a large number of documents will require a tremendous amount of time to create the index. The performance of document inversion can be improved by multi-thread or multi-core GPU. Our approach is to implement a linear-time, hash-based, single program multiple data (SPMD), document inversion algorithm on the NVIDIA GPU/CUDA programming platform utilizing the huge computational power of the GPU, to develop high performance solutions for document indexing. Our proposed parallel document inversion system shows 2-3 times faster performance than a sequential system on two different test datasets from PubMed abstract and e-commerce product reviews. •Information systems➝Information retrieval • Computing methodologies➝Massively parallel and high-performance simulations.

  1. GeNN: a code generation framework for accelerated brain simulations

    NASA Astrophysics Data System (ADS)

    Yavuz, Esin; Turner, James; Nowotny, Thomas

    2016-01-01

    Large-scale numerical simulations of detailed brain circuit models are important for identifying hypotheses on brain functions and testing their consistency and plausibility. An ongoing challenge for simulating realistic models is, however, computational speed. In this paper, we present the GeNN (GPU-enhanced Neuronal Networks) framework, which aims to facilitate the use of graphics accelerators for computational models of large-scale neuronal networks to address this challenge. GeNN is an open source library that generates code to accelerate the execution of network simulations on NVIDIA GPUs, through a flexible and extensible interface, which does not require in-depth technical knowledge from the users. We present performance benchmarks showing that 200-fold speedup compared to a single core of a CPU can be achieved for a network of one million conductance based Hodgkin-Huxley neurons but that for other models the speedup can differ. GeNN is available for Linux, Mac OS X and Windows platforms. The source code, user manual, tutorials, Wiki, in-depth example projects and all other related information can be found on the project website http://genn-team.github.io/genn/.

  2. The Spherical Tokamak MEDUSA for Mexico

    NASA Astrophysics Data System (ADS)

    Ribeiro, C.; Salvador, M.; Gonzalez, J.; Munoz, O.; Tapia, A.; Arredondo, V.; Chavez, R.; Nieto, A.; Gonzalez, J.; Garza, A.; Estrada, I.; Jasso, E.; Acosta, C.; Briones, C.; Cavazos, G.; Martinez, J.; Morones, J.; Almaguer, J.; Fonck, R.

    2011-10-01

    The former spherical tokamak MEDUSA (Madison EDUcation Small Aspect.ratio tokamak, R < 0.14m, a < 0.10m, BT < 0.5T, Ip < 40kA, 3ms pulse) is currently being recomissioned at the Universidad Autónoma de Nuevo León, Mexico, as part of an agreement between the Faculties of Mech.-Elect. Eng. and Phy. Sci.-Maths. The main objective for having MEDUSA is to train students in plasma physics & technical related issues, aiming a full design of a medium size device (e.g. Tokamak-T). Details of technical modifications and a preliminary scientific programme will be presented. MEDUSA-MX will also benefit any developments in the existing Mexican Fusion Network. Strong liaison within national and international plasma physics communities is expected. New activities on plasma & engineering modeling are expected to be developed in parallel by using the existing facilities such as a multi-platform computer (Silicon Graphics Altix XE250, 128G RAM, 3.7TB HD, 2.7GHz, quad-core processor), ancillary graph system (NVIDIA Quadro FE 2000/1GB GDDR-5 PCI X16 128, 3.2GHz), and COMSOL Multiphysics-Solid Works programs.

  3. GeNN: a code generation framework for accelerated brain simulations.

    PubMed

    Yavuz, Esin; Turner, James; Nowotny, Thomas

    2016-01-07

    Large-scale numerical simulations of detailed brain circuit models are important for identifying hypotheses on brain functions and testing their consistency and plausibility. An ongoing challenge for simulating realistic models is, however, computational speed. In this paper, we present the GeNN (GPU-enhanced Neuronal Networks) framework, which aims to facilitate the use of graphics accelerators for computational models of large-scale neuronal networks to address this challenge. GeNN is an open source library that generates code to accelerate the execution of network simulations on NVIDIA GPUs, through a flexible and extensible interface, which does not require in-depth technical knowledge from the users. We present performance benchmarks showing that 200-fold speedup compared to a single core of a CPU can be achieved for a network of one million conductance based Hodgkin-Huxley neurons but that for other models the speedup can differ. GeNN is available for Linux, Mac OS X and Windows platforms. The source code, user manual, tutorials, Wiki, in-depth example projects and all other related information can be found on the project website http://genn-team.github.io/genn/.

  4. GeNN: a code generation framework for accelerated brain simulations

    PubMed Central

    Yavuz, Esin; Turner, James; Nowotny, Thomas

    2016-01-01

    Large-scale numerical simulations of detailed brain circuit models are important for identifying hypotheses on brain functions and testing their consistency and plausibility. An ongoing challenge for simulating realistic models is, however, computational speed. In this paper, we present the GeNN (GPU-enhanced Neuronal Networks) framework, which aims to facilitate the use of graphics accelerators for computational models of large-scale neuronal networks to address this challenge. GeNN is an open source library that generates code to accelerate the execution of network simulations on NVIDIA GPUs, through a flexible and extensible interface, which does not require in-depth technical knowledge from the users. We present performance benchmarks showing that 200-fold speedup compared to a single core of a CPU can be achieved for a network of one million conductance based Hodgkin-Huxley neurons but that for other models the speedup can differ. GeNN is available for Linux, Mac OS X and Windows platforms. The source code, user manual, tutorials, Wiki, in-depth example projects and all other related information can be found on the project website http://genn-team.github.io/genn/. PMID:26740369

  5. Speeding-up Bioinformatics Algorithms with Heterogeneous Architectures: Highly Heterogeneous Smith-Waterman (HHeterSW).

    PubMed

    Gálvez, Sergio; Ferusic, Adis; Esteban, Francisco J; Hernández, Pilar; Caballero, Juan A; Dorado, Gabriel

    2016-10-01

    The Smith-Waterman algorithm has a great sensitivity when used for biological sequence-database searches, but at the expense of high computing-power requirements. To overcome this problem, there are implementations in literature that exploit the different hardware-architectures available in a standard PC, such as GPU, CPU, and coprocessors. We introduce an application that splits the original database-search problem into smaller parts, resolves each of them by executing the most efficient implementations of the Smith-Waterman algorithms in different hardware architectures, and finally unifies the generated results. Using non-overlapping hardware allows simultaneous execution, and up to 2.58-fold performance gain, when compared with any other algorithm to search sequence databases. Even the performance of the popular BLAST heuristic is exceeded in 78% of the tests. The application has been tested with standard hardware: Intel i7-4820K CPU, Intel Xeon Phi 31S1P coprocessors, and nVidia GeForce GTX 960 graphics cards. An important increase in performance has been obtained in a wide range of situations, effectively exploiting the available hardware.

  6. DeepSite: protein-binding site predictor using 3D-convolutional neural networks.

    PubMed

    Jiménez, J; Doerr, S; Martínez-Rosell, G; Rose, A S; De Fabritiis, G

    2017-10-01

    An important step in structure-based drug design consists in the prediction of druggable binding sites. Several algorithms for detecting binding cavities, those likely to bind to a small drug compound, have been developed over the years by clever exploitation of geometric, chemical and evolutionary features of the protein. Here we present a novel knowledge-based approach that uses state-of-the-art convolutional neural networks, where the algorithm is learned by examples. In total, 7622 proteins from the scPDB database of binding sites have been evaluated using both a distance and a volumetric overlap approach. Our machine-learning based method demonstrates superior performance to two other competitive algorithmic strategies. DeepSite is freely available at www.playmolecule.org. Users can submit either a PDB ID or PDB file for pocket detection to our NVIDIA GPU-equipped servers through a WebGL graphical interface. gianni.defabritiis@upf.edu. Supplementary data are available at Bioinformatics online. © The Author (2017). Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com

  7. Development of a GPU Compatible Version of the Fast Radiation Code RRTMG

    NASA Astrophysics Data System (ADS)

    Iacono, M. J.; Mlawer, E. J.; Berthiaume, D.; Cady-Pereira, K. E.; Suarez, M.; Oreopoulos, L.; Lee, D.

    2012-12-01

    The absorption of solar radiation and emission/absorption of thermal radiation are crucial components of the physics that drive Earth's climate and weather. Therefore, accurate radiative transfer calculations are necessary for realistic climate and weather simulations. Efficient radiation codes have been developed for this purpose, but their accuracy requirements still necessitate that as much as 30% of the computational time of a GCM is spent computing radiative fluxes and heating rates. The overall computational expense constitutes a limitation on a GCM's predictive ability if it becomes an impediment to adding new physics to or increasing the spatial and/or vertical resolution of the model. The emergence of Graphics Processing Unit (GPU) technology, which will allow the parallel computation of multiple independent radiative calculations in a GCM, will lead to a fundamental change in the competition between accuracy and speed. Processing time previously consumed by radiative transfer will now be available for the modeling of other processes, such as physics parameterizations, without any sacrifice in the accuracy of the radiative transfer. Furthermore, fast radiation calculations can be performed much more frequently and will allow the modeling of radiative effects of rapid changes in the atmosphere. The fast radiation code RRTMG, developed at Atmospheric and Environmental Research (AER), is utilized operationally in many dynamical models throughout the world. We will present the results from the first stage of an effort to create a version of the RRTMG radiation code designed to run efficiently in a GPU environment. This effort will focus on the RRTMG implementation in GEOS-5. RRTMG has an internal pseudo-spectral vector of length of order 100 that, when combined with the much greater length of the global horizontal grid vector from which the radiation code is called in GEOS-5, makes RRTMG/GEOS-5 particularly suited to achieving a significant speed improvement through GPU technology. This large number of independent cases will allow us to take full advantage of the computational power of the latest GPUs, ensuring that all thread cores in the GPU remain active, a key criterion for obtaining significant speedup. The CUDA (Compute Unified Device Architecture) Fortran compiler developed by PGI and Nvidia will allow us to construct this parallel implementation on the GPU while remaining in the Fortran language. This implementation will scale very well across various CUDA-supported GPUs such as the recently released Fermi Nvidia cards. We will present the computational speed improvements of the GPU-compatible code relative to the standard CPU-based RRTMG with respect to a very large and diverse suite of atmospheric profiles. This suite will also be utilized to demonstrate the minimal impact of the code restructuring on the accuracy of radiation calculations. The GPU-compatible version of RRTMG will be directly applicable to future versions of GEOS-5, but it is also likely to provide significant associated benefits for other GCMs that employ RRTMG.

  8. GeauxDock: Accelerating Structure-Based Virtual Screening with Heterogeneous Computing

    PubMed Central

    Fang, Ye; Ding, Yun; Feinstein, Wei P.; Koppelman, David M.; Moreno, Juana; Jarrell, Mark; Ramanujam, J.; Brylinski, Michal

    2016-01-01

    Computational modeling of drug binding to proteins is an integral component of direct drug design. Particularly, structure-based virtual screening is often used to perform large-scale modeling of putative associations between small organic molecules and their pharmacologically relevant protein targets. Because of a large number of drug candidates to be evaluated, an accurate and fast docking engine is a critical element of virtual screening. Consequently, highly optimized docking codes are of paramount importance for the effectiveness of virtual screening methods. In this communication, we describe the implementation, tuning and performance characteristics of GeauxDock, a recently developed molecular docking program. GeauxDock is built upon the Monte Carlo algorithm and features a novel scoring function combining physics-based energy terms with statistical and knowledge-based potentials. Developed specifically for heterogeneous computing platforms, the current version of GeauxDock can be deployed on modern, multi-core Central Processing Units (CPUs) as well as massively parallel accelerators, Intel Xeon Phi and NVIDIA Graphics Processing Unit (GPU). First, we carried out a thorough performance tuning of the high-level framework and the docking kernel to produce a fast serial code, which was then ported to shared-memory multi-core CPUs yielding a near-ideal scaling. Further, using Xeon Phi gives 1.9× performance improvement over a dual 10-core Xeon CPU, whereas the best GPU accelerator, GeForce GTX 980, achieves a speedup as high as 3.5×. On that account, GeauxDock can take advantage of modern heterogeneous architectures to considerably accelerate structure-based virtual screening applications. GeauxDock is open-sourced and publicly available at www.brylinski.org/geauxdock and https://figshare.com/articles/geauxdock_tar_gz/3205249. PMID:27420300

  9. Development of a Cloud Resolving Model for Heterogeneous Supercomputers

    NASA Astrophysics Data System (ADS)

    Sreepathi, S.; Norman, M. R.; Pal, A.; Hannah, W.; Ponder, C.

    2017-12-01

    A cloud resolving climate model is needed to reduce major systematic errors in climate simulations due to structural uncertainty in numerical treatments of convection - such as convective storm systems. This research describes the porting effort to enable SAM (System for Atmosphere Modeling) cloud resolving model on heterogeneous supercomputers using GPUs (Graphical Processing Units). We have isolated a standalone configuration of SAM that is targeted to be integrated into the DOE ACME (Accelerated Climate Modeling for Energy) Earth System model. We have identified key computational kernels from the model and offloaded them to a GPU using the OpenACC programming model. Furthermore, we are investigating various optimization strategies intended to enhance GPU utilization including loop fusion/fission, coalesced data access and loop refactoring to a higher abstraction level. We will present early performance results, lessons learned as well as optimization strategies. The computational platform used in this study is the Summitdev system, an early testbed that is one generation removed from Summit, the next leadership class supercomputer at Oak Ridge National Laboratory. The system contains 54 nodes wherein each node has 2 IBM POWER8 CPUs and 4 NVIDIA Tesla P100 GPUs. This work is part of a larger project, ACME-MMF component of the U.S. Department of Energy(DOE) Exascale Computing Project. The ACME-MMF approach addresses structural uncertainty in cloud processes by replacing traditional parameterizations with cloud resolving "superparameterization" within each grid cell of global climate model. Super-parameterization dramatically increases arithmetic intensity, making the MMF approach an ideal strategy to achieve good performance on emerging exascale computing architectures. The goal of the project is to integrate superparameterization into ACME, and explore its full potential to scientifically and computationally advance climate simulation and prediction.

  10. BEAGLE: an application programming interface and high-performance computing library for statistical phylogenetics.

    PubMed

    Ayres, Daniel L; Darling, Aaron; Zwickl, Derrick J; Beerli, Peter; Holder, Mark T; Lewis, Paul O; Huelsenbeck, John P; Ronquist, Fredrik; Swofford, David L; Cummings, Michael P; Rambaut, Andrew; Suchard, Marc A

    2012-01-01

    Phylogenetic inference is fundamental to our understanding of most aspects of the origin and evolution of life, and in recent years, there has been a concentration of interest in statistical approaches such as Bayesian inference and maximum likelihood estimation. Yet, for large data sets and realistic or interesting models of evolution, these approaches remain computationally demanding. High-throughput sequencing can yield data for thousands of taxa, but scaling to such problems using serial computing often necessitates the use of nonstatistical or approximate approaches. The recent emergence of graphics processing units (GPUs) provides an opportunity to leverage their excellent floating-point computational performance to accelerate statistical phylogenetic inference. A specialized library for phylogenetic calculation would allow existing software packages to make more effective use of available computer hardware, including GPUs. Adoption of a common library would also make it easier for other emerging computing architectures, such as field programmable gate arrays, to be used in the future. We present BEAGLE, an application programming interface (API) and library for high-performance statistical phylogenetic inference. The API provides a uniform interface for performing phylogenetic likelihood calculations on a variety of compute hardware platforms. The library includes a set of efficient implementations and can currently exploit hardware including GPUs using NVIDIA CUDA, central processing units (CPUs) with Streaming SIMD Extensions and related processor supplementary instruction sets, and multicore CPUs via OpenMP. To demonstrate the advantages of a common API, we have incorporated the library into several popular phylogenetic software packages. The BEAGLE library is free open source software licensed under the Lesser GPL and available from http://beagle-lib.googlecode.com. An example client program is available as public domain software.

  11. GeauxDock: Accelerating Structure-Based Virtual Screening with Heterogeneous Computing.

    PubMed

    Fang, Ye; Ding, Yun; Feinstein, Wei P; Koppelman, David M; Moreno, Juana; Jarrell, Mark; Ramanujam, J; Brylinski, Michal

    2016-01-01

    Computational modeling of drug binding to proteins is an integral component of direct drug design. Particularly, structure-based virtual screening is often used to perform large-scale modeling of putative associations between small organic molecules and their pharmacologically relevant protein targets. Because of a large number of drug candidates to be evaluated, an accurate and fast docking engine is a critical element of virtual screening. Consequently, highly optimized docking codes are of paramount importance for the effectiveness of virtual screening methods. In this communication, we describe the implementation, tuning and performance characteristics of GeauxDock, a recently developed molecular docking program. GeauxDock is built upon the Monte Carlo algorithm and features a novel scoring function combining physics-based energy terms with statistical and knowledge-based potentials. Developed specifically for heterogeneous computing platforms, the current version of GeauxDock can be deployed on modern, multi-core Central Processing Units (CPUs) as well as massively parallel accelerators, Intel Xeon Phi and NVIDIA Graphics Processing Unit (GPU). First, we carried out a thorough performance tuning of the high-level framework and the docking kernel to produce a fast serial code, which was then ported to shared-memory multi-core CPUs yielding a near-ideal scaling. Further, using Xeon Phi gives 1.9× performance improvement over a dual 10-core Xeon CPU, whereas the best GPU accelerator, GeForce GTX 980, achieves a speedup as high as 3.5×. On that account, GeauxDock can take advantage of modern heterogeneous architectures to considerably accelerate structure-based virtual screening applications. GeauxDock is open-sourced and publicly available at www.brylinski.org/geauxdock and https://figshare.com/articles/geauxdock_tar_gz/3205249.

  12. BROCCOLI: Software for fast fMRI analysis on many-core CPUs and GPUs

    PubMed Central

    Eklund, Anders; Dufort, Paul; Villani, Mattias; LaConte, Stephen

    2014-01-01

    Analysis of functional magnetic resonance imaging (fMRI) data is becoming ever more computationally demanding as temporal and spatial resolutions improve, and large, publicly available data sets proliferate. Moreover, methodological improvements in the neuroimaging pipeline, such as non-linear spatial normalization, non-parametric permutation tests and Bayesian Markov Chain Monte Carlo approaches, can dramatically increase the computational burden. Despite these challenges, there do not yet exist any fMRI software packages which leverage inexpensive and powerful graphics processing units (GPUs) to perform these analyses. Here, we therefore present BROCCOLI, a free software package written in OpenCL (Open Computing Language) that can be used for parallel analysis of fMRI data on a large variety of hardware configurations. BROCCOLI has, for example, been tested with an Intel CPU, an Nvidia GPU, and an AMD GPU. These tests show that parallel processing of fMRI data can lead to significantly faster analysis pipelines. This speedup can be achieved on relatively standard hardware, but further, dramatic speed improvements require only a modest investment in GPU hardware. BROCCOLI (running on a GPU) can perform non-linear spatial normalization to a 1 mm3 brain template in 4–6 s, and run a second level permutation test with 10,000 permutations in about a minute. These non-parametric tests are generally more robust than their parametric counterparts, and can also enable more sophisticated analyses by estimating complicated null distributions. Additionally, BROCCOLI includes support for Bayesian first-level fMRI analysis using a Gibbs sampler. The new software is freely available under GNU GPL3 and can be downloaded from github (https://github.com/wanderine/BROCCOLI/). PMID:24672471

  13. Left ventricle segmentation in cardiac MRI images using fully convolutional neural networks

    NASA Astrophysics Data System (ADS)

    Vázquez Romaguera, Liset; Costa, Marly Guimarães Fernandes; Romero, Francisco Perdigón; Costa Filho, Cicero Ferreira Fernandes

    2017-03-01

    According to the World Health Organization, cardiovascular diseases are the leading cause of death worldwide, accounting for 17.3 million deaths per year, a number that is expected to grow to more than 23.6 million by 2030. Most cardiac pathologies involve the left ventricle; therefore, estimation of several functional parameters from a previous segmentation of this structure can be helpful in diagnosis. Manual delineation is a time consuming and tedious task that is also prone to high intra and inter-observer variability. Thus, there exists a need for automated cardiac segmentation method to help facilitate the diagnosis of cardiovascular diseases. In this work we propose a deep fully convolutional neural network architecture to address this issue and assess its performance. The model was trained end to end in a supervised learning stage from whole cardiac MRI images input and ground truth to make a per pixel classification. For its design, development and experimentation was used Caffe deep learning framework over an NVidia Quadro K4200 Graphics Processing Unit. The net architecture is: Conv64-ReLU (2x) - MaxPooling - Conv128-ReLU (2x) - MaxPooling - Conv256-ReLU (2x) - MaxPooling - Conv512-ReLu-Dropout (2x) - Conv2-ReLU - Deconv - Crop - Softmax. Training and testing processes were carried out using 5-fold cross validation with short axis cardiac magnetic resonance images from Sunnybrook Database. We obtained a Dice score of 0.92 and 0.90, Hausdorff distance of 4.48 and 5.43, Jaccard index of 0.97 and 0.97, sensitivity of 0.92 and 0.90 and specificity of 0.99 and 0.99, overall mean values with SGD and RMSProp, respectively.

  14. BEAGLE: An Application Programming Interface and High-Performance Computing Library for Statistical Phylogenetics

    PubMed Central

    Ayres, Daniel L.; Darling, Aaron; Zwickl, Derrick J.; Beerli, Peter; Holder, Mark T.; Lewis, Paul O.; Huelsenbeck, John P.; Ronquist, Fredrik; Swofford, David L.; Cummings, Michael P.; Rambaut, Andrew; Suchard, Marc A.

    2012-01-01

    Abstract Phylogenetic inference is fundamental to our understanding of most aspects of the origin and evolution of life, and in recent years, there has been a concentration of interest in statistical approaches such as Bayesian inference and maximum likelihood estimation. Yet, for large data sets and realistic or interesting models of evolution, these approaches remain computationally demanding. High-throughput sequencing can yield data for thousands of taxa, but scaling to such problems using serial computing often necessitates the use of nonstatistical or approximate approaches. The recent emergence of graphics processing units (GPUs) provides an opportunity to leverage their excellent floating-point computational performance to accelerate statistical phylogenetic inference. A specialized library for phylogenetic calculation would allow existing software packages to make more effective use of available computer hardware, including GPUs. Adoption of a common library would also make it easier for other emerging computing architectures, such as field programmable gate arrays, to be used in the future. We present BEAGLE, an application programming interface (API) and library for high-performance statistical phylogenetic inference. The API provides a uniform interface for performing phylogenetic likelihood calculations on a variety of compute hardware platforms. The library includes a set of efficient implementations and can currently exploit hardware including GPUs using NVIDIA CUDA, central processing units (CPUs) with Streaming SIMD Extensions and related processor supplementary instruction sets, and multicore CPUs via OpenMP. To demonstrate the advantages of a common API, we have incorporated the library into several popular phylogenetic software packages. The BEAGLE library is free open source software licensed under the Lesser GPL and available from http://beagle-lib.googlecode.com. An example client program is available as public domain software. PMID:21963610

  15. Scaling Deep Learning Workloads: NVIDIA DGX-1/Pascal and Intel Knights Landing

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Gawande, Nitin A.; Landwehr, Joshua B.; Daily, Jeffrey A.

    Deep Learning (DL) algorithms have become ubiquitous in data analytics. As a result, major computing vendors --- including NVIDIA, Intel, AMD and IBM --- have architectural road-maps influenced by DL workloads. Furthermore, several vendors have recently advertised new computing products as accelerating DL workloads. Unfortunately, it is difficult for data scientists to quantify the potential of these different products. This paper provides a performance and power analysis of important DL workloads on two major parallel architectures: NVIDIA DGX-1 (eight Pascal P100 GPUs interconnected with NVLink) and Intel Knights Landing (KNL) CPUs interconnected with Intel Omni-Path. Our evaluation consists of amore » cross section of convolutional neural net workloads: CifarNet, CaffeNet, AlexNet and GoogleNet topologies using the Cifar10 and ImageNet datasets. The workloads are vendor optimized for each architecture. GPUs provide the highest overall raw performance. Our analysis indicates that although GPUs provide the highest overall performance, the gap can close for some convolutional networks; and KNL can be competitive when considering performance/watt. Furthermore, NVLink is critical to GPU scaling.« less

  16. DOE Office of Scientific and Technical Information (OSTI.GOV)

    Lyakh, Dmitry I.

    An efficient parallel tensor transpose algorithm is suggested for shared-memory computing units, namely, multicore CPU, Intel Xeon Phi, and NVidia GPU. The algorithm operates on dense tensors (multidimensional arrays) and is based on the optimization of cache utilization on x86 CPU and the use of shared memory on NVidia GPU. From the applied side, the ultimate goal is to minimize the overhead encountered in the transformation of tensor contractions into matrix multiplications in computer implementations of advanced methods of quantum many-body theory (e.g., in electronic structure theory and nuclear physics). A particular accent is made on higher-dimensional tensors that typicallymore » appear in the so-called multireference correlated methods of electronic structure theory. Depending on tensor dimensionality, the presented optimized algorithms can achieve an order of magnitude speedup on x86 CPUs and 2-3 times speedup on NVidia Tesla K20X GPU with respect to the na ve scattering algorithm (no memory access optimization). Furthermore, the tensor transpose routines developed in this work have been incorporated into a general-purpose tensor algebra library (TAL-SH).« less

  17. Multi-GPU configuration of 4D intensity modulated radiation therapy inverse planning using global optimization

    NASA Astrophysics Data System (ADS)

    Hagan, Aaron; Sawant, Amit; Folkerts, Michael; Modiri, Arezoo

    2018-01-01

    We report on the design, implementation and characterization of a multi-graphic processing unit (GPU) computational platform for higher-order optimization in radiotherapy treatment planning. In collaboration with a commercial vendor (Varian Medical Systems, Palo Alto, CA), a research prototype GPU-enabled Eclipse (V13.6) workstation was configured. The hardware consisted of dual 8-core Xeon processors, 256 GB RAM and four NVIDIA Tesla K80 general purpose GPUs. We demonstrate the utility of this platform for large radiotherapy optimization problems through the development and characterization of a parallelized particle swarm optimization (PSO) four dimensional (4D) intensity modulated radiation therapy (IMRT) technique. The PSO engine was coupled to the Eclipse treatment planning system via a vendor-provided scripting interface. Specific challenges addressed in this implementation were (i) data management and (ii) non-uniform memory access (NUMA). For the former, we alternated between parameters over which the computation process was parallelized. For the latter, we reduced the amount of data required to be transferred over the NUMA bridge. The datasets examined in this study were approximately 300 GB in size, including 4D computed tomography images, anatomical structure contours and dose deposition matrices. For evaluation, we created a 4D-IMRT treatment plan for one lung cancer patient and analyzed computation speed while varying several parameters (number of respiratory phases, GPUs, PSO particles, and data matrix sizes). The optimized 4D-IMRT plan enhanced sparing of organs at risk by an average reduction of 26% in maximum dose, compared to the clinical optimized IMRT plan, where the internal target volume was used. We validated our computation time analyses in two additional cases. The computation speed in our implementation did not monotonically increase with the number of GPUs. The optimal number of GPUs (five, in our study) is directly related to the hardware specifications. The optimization process took 35 min using 50 PSO particles, 25 iterations and 5 GPUs.

  18. Multi-GPU configuration of 4D intensity modulated radiation therapy inverse planning using global optimization.

    PubMed

    Hagan, Aaron; Sawant, Amit; Folkerts, Michael; Modiri, Arezoo

    2018-01-16

    We report on the design, implementation and characterization of a multi-graphic processing unit (GPU) computational platform for higher-order optimization in radiotherapy treatment planning. In collaboration with a commercial vendor (Varian Medical Systems, Palo Alto, CA), a research prototype GPU-enabled Eclipse (V13.6) workstation was configured. The hardware consisted of dual 8-core Xeon processors, 256 GB RAM and four NVIDIA Tesla K80 general purpose GPUs. We demonstrate the utility of this platform for large radiotherapy optimization problems through the development and characterization of a parallelized particle swarm optimization (PSO) four dimensional (4D) intensity modulated radiation therapy (IMRT) technique. The PSO engine was coupled to the Eclipse treatment planning system via a vendor-provided scripting interface. Specific challenges addressed in this implementation were (i) data management and (ii) non-uniform memory access (NUMA). For the former, we alternated between parameters over which the computation process was parallelized. For the latter, we reduced the amount of data required to be transferred over the NUMA bridge. The datasets examined in this study were approximately 300 GB in size, including 4D computed tomography images, anatomical structure contours and dose deposition matrices. For evaluation, we created a 4D-IMRT treatment plan for one lung cancer patient and analyzed computation speed while varying several parameters (number of respiratory phases, GPUs, PSO particles, and data matrix sizes). The optimized 4D-IMRT plan enhanced sparing of organs at risk by an average reduction of [Formula: see text] in maximum dose, compared to the clinical optimized IMRT plan, where the internal target volume was used. We validated our computation time analyses in two additional cases. The computation speed in our implementation did not monotonically increase with the number of GPUs. The optimal number of GPUs (five, in our study) is directly related to the hardware specifications. The optimization process took 35 min using 50 PSO particles, 25 iterations and 5 GPUs.

  19. High Performance Real-Time Visualization of Voluminous Scientific Data Through the NOAA Earth Information System (NEIS).

    NASA Astrophysics Data System (ADS)

    Stewart, J.; Hackathorn, E. J.; Joyce, J.; Smith, J. S.

    2014-12-01

    Within our community data volume is rapidly expanding. These data have limited value if one cannot interact or visualize the data in a timely manner. The scientific community needs the ability to dynamically visualize, analyze, and interact with these data along with other environmental data in real-time regardless of the physical location or data format. Within the National Oceanic Atmospheric Administration's (NOAA's), the Earth System Research Laboratory (ESRL) is actively developing the NOAA Earth Information System (NEIS). Previously, the NEIS team investigated methods of data discovery and interoperability. The recent focus shifted to high performance real-time visualization allowing NEIS to bring massive amounts of 4-D data, including output from weather forecast models as well as data from different observations (surface obs, upper air, etc...) in one place. Our server side architecture provides a real-time stream processing system which utilizes server based NVIDIA Graphical Processing Units (GPU's) for data processing, wavelet based compression, and other preparation techniques for visualization, allows NEIS to minimize the bandwidth and latency for data delivery to end-users. Client side, users interact with NEIS services through the visualization application developed at ESRL called TerraViz. Terraviz is developed using the Unity game engine and takes advantage of the GPU's allowing a user to interact with large data sets in real time that might not have been possible before. Through these technologies, the NEIS team has improved accessibility to 'Big Data' along with providing tools allowing novel visualization and seamless integration of data across time and space regardless of data size, physical location, or data format. These capabilities provide the ability to see the global interactions and their importance for weather prediction. Additionally, they allow greater access than currently exists helping to foster scientific collaboration and new ideas. This presentation will provide an update of the recent enhancements of the NEIS architecture and visualization capabilities, challenges faced, as well as ongoing research activities related to this project.

  20. Fast skin dose estimation system for interventional radiology

    PubMed Central

    Takata, Takeshi; Kotoku, Jun’ichi; Maejima, Hideyuki; Kumagai, Shinobu; Arai, Norikazu; Kobayashi, Takenori; Shiraishi, Kenshiro; Yamamoto, Masayoshi; Kondo, Hiroshi; Furui, Shigeru

    2018-01-01

    Abstract To minimise the radiation dermatitis related to interventional radiology (IR), rapid and accurate dose estimation has been sought for all procedures. We propose a technique for estimating the patient skin dose rapidly and accurately using Monte Carlo (MC) simulation with a graphical processing unit (GPU, GTX 1080; Nvidia Corp.). The skin dose distribution is simulated based on an individual patient’s computed tomography (CT) dataset for fluoroscopic conditions after the CT dataset has been segmented into air, water and bone based on pixel values. The skin is assumed to be one layer at the outer surface of the body. Fluoroscopic conditions are obtained from a log file of a fluoroscopic examination. Estimating the absorbed skin dose distribution requires calibration of the dose simulated by our system. For this purpose, a linear function was used to approximate the relation between the simulated dose and the measured dose using radiophotoluminescence (RPL) glass dosimeters in a water-equivalent phantom. Differences of maximum skin dose between our system and the Particle and Heavy Ion Transport code System (PHITS) were as high as 6.1%. The relative statistical error (2 σ) for the simulated dose obtained using our system was ≤3.5%. Using a GPU, the simulation on the chest CT dataset aiming at the heart was within 3.49 s on average: the GPU is 122 times faster than a CPU (Core i7–7700K; Intel Corp.). Our system (using the GPU, the log file, and the CT dataset) estimated the skin dose more rapidly and more accurately than conventional methods. PMID:29136194

  1. Fast skin dose estimation system for interventional radiology.

    PubMed

    Takata, Takeshi; Kotoku, Jun'ichi; Maejima, Hideyuki; Kumagai, Shinobu; Arai, Norikazu; Kobayashi, Takenori; Shiraishi, Kenshiro; Yamamoto, Masayoshi; Kondo, Hiroshi; Furui, Shigeru

    2018-03-01

    To minimise the radiation dermatitis related to interventional radiology (IR), rapid and accurate dose estimation has been sought for all procedures. We propose a technique for estimating the patient skin dose rapidly and accurately using Monte Carlo (MC) simulation with a graphical processing unit (GPU, GTX 1080; Nvidia Corp.). The skin dose distribution is simulated based on an individual patient's computed tomography (CT) dataset for fluoroscopic conditions after the CT dataset has been segmented into air, water and bone based on pixel values. The skin is assumed to be one layer at the outer surface of the body. Fluoroscopic conditions are obtained from a log file of a fluoroscopic examination. Estimating the absorbed skin dose distribution requires calibration of the dose simulated by our system. For this purpose, a linear function was used to approximate the relation between the simulated dose and the measured dose using radiophotoluminescence (RPL) glass dosimeters in a water-equivalent phantom. Differences of maximum skin dose between our system and the Particle and Heavy Ion Transport code System (PHITS) were as high as 6.1%. The relative statistical error (2 σ) for the simulated dose obtained using our system was ≤3.5%. Using a GPU, the simulation on the chest CT dataset aiming at the heart was within 3.49 s on average: the GPU is 122 times faster than a CPU (Core i7-7700K; Intel Corp.). Our system (using the GPU, the log file, and the CT dataset) estimated the skin dose more rapidly and more accurately than conventional methods.

  2. Heterogeneous computing architecture for fast detection of SNP-SNP interactions.

    PubMed

    Sluga, Davor; Curk, Tomaz; Zupan, Blaz; Lotric, Uros

    2014-06-25

    The extent of data in a typical genome-wide association study (GWAS) poses considerable computational challenges to software tools for gene-gene interaction discovery. Exhaustive evaluation of all interactions among hundreds of thousands to millions of single nucleotide polymorphisms (SNPs) may require weeks or even months of computation. Massively parallel hardware within a modern Graphic Processing Unit (GPU) and Many Integrated Core (MIC) coprocessors can shorten the run time considerably. While the utility of GPU-based implementations in bioinformatics has been well studied, MIC architecture has been introduced only recently and may provide a number of comparative advantages that have yet to be explored and tested. We have developed a heterogeneous, GPU and Intel MIC-accelerated software module for SNP-SNP interaction discovery to replace the previously single-threaded computational core in the interactive web-based data exploration program SNPsyn. We report on differences between these two modern massively parallel architectures and their software environments. Their utility resulted in an order of magnitude shorter execution times when compared to the single-threaded CPU implementation. GPU implementation on a single Nvidia Tesla K20 runs twice as fast as that for the MIC architecture-based Xeon Phi P5110 coprocessor, but also requires considerably more programming effort. General purpose GPUs are a mature platform with large amounts of computing power capable of tackling inherently parallel problems, but can prove demanding for the programmer. On the other hand the new MIC architecture, albeit lacking in performance reduces the programming effort and makes it up with a more general architecture suitable for a wider range of problems.

  3. Heterogeneous computing architecture for fast detection of SNP-SNP interactions

    PubMed Central

    2014-01-01

    Background The extent of data in a typical genome-wide association study (GWAS) poses considerable computational challenges to software tools for gene-gene interaction discovery. Exhaustive evaluation of all interactions among hundreds of thousands to millions of single nucleotide polymorphisms (SNPs) may require weeks or even months of computation. Massively parallel hardware within a modern Graphic Processing Unit (GPU) and Many Integrated Core (MIC) coprocessors can shorten the run time considerably. While the utility of GPU-based implementations in bioinformatics has been well studied, MIC architecture has been introduced only recently and may provide a number of comparative advantages that have yet to be explored and tested. Results We have developed a heterogeneous, GPU and Intel MIC-accelerated software module for SNP-SNP interaction discovery to replace the previously single-threaded computational core in the interactive web-based data exploration program SNPsyn. We report on differences between these two modern massively parallel architectures and their software environments. Their utility resulted in an order of magnitude shorter execution times when compared to the single-threaded CPU implementation. GPU implementation on a single Nvidia Tesla K20 runs twice as fast as that for the MIC architecture-based Xeon Phi P5110 coprocessor, but also requires considerably more programming effort. Conclusions General purpose GPUs are a mature platform with large amounts of computing power capable of tackling inherently parallel problems, but can prove demanding for the programmer. On the other hand the new MIC architecture, albeit lacking in performance reduces the programming effort and makes it up with a more general architecture suitable for a wider range of problems. PMID:24964802

  4. Evaluation of the Intel Xeon Phi 7120 and NVIDIA K80 as accelerators for two-dimensional panel codes

    PubMed Central

    2017-01-01

    To optimize the geometry of airfoils for a specific application is an important engineering problem. In this context genetic algorithms have enjoyed some success as they are able to explore the search space without getting stuck in local optima. However, these algorithms require the computation of aerodynamic properties for a significant number of airfoil geometries. Consequently, for low-speed aerodynamics, panel methods are most often used as the inner solver. In this paper we evaluate the performance of such an optimization algorithm on modern accelerators (more specifically, the Intel Xeon Phi 7120 and the NVIDIA K80). For that purpose, we have implemented an optimized version of the algorithm on the CPU and Xeon Phi (based on OpenMP, vectorization, and the Intel MKL library) and on the GPU (based on CUDA and the MAGMA library). We present timing results for all codes and discuss the similarities and differences between the three implementations. Overall, we observe a speedup of approximately 2.5 for adding an Intel Xeon Phi 7120 to a dual socket workstation and a speedup between 3.4 and 3.8 for adding a NVIDIA K80 to a dual socket workstation. PMID:28582389

  5. Evaluation of the Intel Xeon Phi 7120 and NVIDIA K80 as accelerators for two-dimensional panel codes.

    PubMed

    Einkemmer, Lukas

    2017-01-01

    To optimize the geometry of airfoils for a specific application is an important engineering problem. In this context genetic algorithms have enjoyed some success as they are able to explore the search space without getting stuck in local optima. However, these algorithms require the computation of aerodynamic properties for a significant number of airfoil geometries. Consequently, for low-speed aerodynamics, panel methods are most often used as the inner solver. In this paper we evaluate the performance of such an optimization algorithm on modern accelerators (more specifically, the Intel Xeon Phi 7120 and the NVIDIA K80). For that purpose, we have implemented an optimized version of the algorithm on the CPU and Xeon Phi (based on OpenMP, vectorization, and the Intel MKL library) and on the GPU (based on CUDA and the MAGMA library). We present timing results for all codes and discuss the similarities and differences between the three implementations. Overall, we observe a speedup of approximately 2.5 for adding an Intel Xeon Phi 7120 to a dual socket workstation and a speedup between 3.4 and 3.8 for adding a NVIDIA K80 to a dual socket workstation.

  6. StarSmasher: Smoothed Particle Hydrodynamics code for smashing stars and planets

    NASA Astrophysics Data System (ADS)

    Gaburov, Evghenii; Lombardi, James C., Jr.; Portegies Zwart, Simon; Rasio, F. A.

    2018-05-01

    Smoothed Particle Hydrodynamics (SPH) is a Lagrangian particle method that approximates a continuous fluid as discrete nodes, each carrying various parameters such as mass, position, velocity, pressure, and temperature. In an SPH simulation the resolution scales with the particle density; StarSmasher is able to handle both equal-mass and equal number-density particle models. StarSmasher solves for hydro forces by calculating the pressure for each particle as a function of the particle's properties - density, internal energy, and internal properties (e.g. temperature and mean molecular weight). The code implements variational equations of motion and libraries to calculate the gravitational forces between particles using direct summation on NVIDIA graphics cards. Using a direct summation instead of a tree-based algorithm for gravity increases the accuracy of the gravity calculations at the cost of speed. The code uses a cubic spline for the smoothing kernel and an artificial viscosity prescription coupled with a Balsara Switch to prevent unphysical interparticle penetration. The code also implements an artificial relaxation force to the equations of motion to add a drag term to the calculated accelerations during relaxation integrations. Initially called StarCrash, StarSmasher was developed originally by Rasio.

  7. Redefining the Data Pipeline Using GPUs

    NASA Astrophysics Data System (ADS)

    Warner, C.; Eikenberry, S. S.; Gonzalez, A. H.; Packham, C.

    2013-10-01

    There are two major challenges facing the next generation of data processing pipelines: 1) handling an ever increasing volume of data as array sizes continue to increase and 2) the desire to process data in near real-time to maximize observing efficiency by providing rapid feedback on data quality. Combining the power of modern graphics processing units (GPUs), relational database management systems (RDBMSs), and extensible markup language (XML) to re-imagine traditional data pipelines will allow us to meet these challenges. Modern GPUs contain hundreds of processing cores, each of which can process hundreds of threads concurrently. Technologies such as Nvidia's Compute Unified Device Architecture (CUDA) platform and the PyCUDA (http://mathema.tician.de/software/pycuda) module for Python allow us to write parallel algorithms and easily link GPU-optimized code into existing data pipeline frameworks. This approach has produced speed gains of over a factor of 100 compared to CPU implementations for individual algorithms and overall pipeline speed gains of a factor of 10-25 compared to traditionally built data pipelines for both imaging and spectroscopy (Warner et al., 2011). However, there are still many bottlenecks inherent in the design of traditional data pipelines. For instance, file input/output of intermediate steps is now a significant portion of the overall processing time. In addition, most traditional pipelines are not designed to be able to process data on-the-fly in real time. We present a model for a next-generation data pipeline that has the flexibility to process data in near real-time at the observatory as well as to automatically process huge archives of past data by using a simple XML configuration file. XML is ideal for describing both the dataset and the processes that will be applied to the data. Meta-data for the datasets would be stored using an RDBMS (such as mysql or PostgreSQL) which could be easily and rapidly queried and file I/O would be kept at a minimum. We believe this redefined data pipeline will be able to process data at the telescope, concurrent with continuing observations, thus maximizing precious observing time and optimizing the observational process in general. We also believe that using this design, it is possible to obtain a speed gain of a factor of 30-40 over traditional data pipelines when processing large archives of data.

  8. Spiking neural networks on high performance computer clusters

    NASA Astrophysics Data System (ADS)

    Chen, Chong; Taha, Tarek M.

    2011-09-01

    In this paper we examine the acceleration of two spiking neural network models on three clusters of multicore processors representing three categories of processors: x86, STI Cell, and NVIDIA GPGPUs. The x86 cluster utilized consists of 352 dualcore AMD Opterons, the Cell cluster consists of 320 Sony Playstation 3s, while the GPGPU cluster contains 32 NVIDIA Tesla S1070 systems. The results indicate that the GPGPU platform can dominate in performance compared to the Cell and x86 platforms examined. From a cost perspective, the GPGPU is more expensive in terms of neuron/s throughput. If the cost of GPGPUs go down in the future, this platform will become very cost effective for these models.

  9. MILC Code Performance on High End CPU and GPU Supercomputer Clusters

    NASA Astrophysics Data System (ADS)

    DeTar, Carleton; Gottlieb, Steven; Li, Ruizi; Toussaint, Doug

    2018-03-01

    With recent developments in parallel supercomputing architecture, many core, multi-core, and GPU processors are now commonplace, resulting in more levels of parallelism, memory hierarchy, and programming complexity. It has been necessary to adapt the MILC code to these new processors starting with NVIDIA GPUs, and more recently, the Intel Xeon Phi processors. We report on our efforts to port and optimize our code for the Intel Knights Landing architecture. We consider performance of the MILC code with MPI and OpenMP, and optimizations with QOPQDP and QPhiX. For the latter approach, we concentrate on the staggered conjugate gradient and gauge force. We also consider performance on recent NVIDIA GPUs using the QUDA library.

  10. Color graphics, interactive processing, and the supercomputer

    NASA Technical Reports Server (NTRS)

    Smith-Taylor, Rudeen

    1987-01-01

    The development of a common graphics environment for the NASA Langley Research Center user community and the integration of a supercomputer into this environment is examined. The initial computer hardware, the software graphics packages, and their configurations are described. The addition of improved computer graphics capability to the supercomputer, and the utilization of the graphic software and hardware are discussed. Consideration is given to the interactive processing system which supports the computer in an interactive debugging, processing, and graphics environment.

  11. Scaling deep learning workloads: NVIDIA DGX-1/Pascal and Intel Knights Landing

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Gawande, Nitin A.; Landwehr, Joshua B.; Daily, Jeffrey A.

    Deep Learning (DL) algorithms have become ubiquitous in data analytics. As a result, major computing vendors --- including NVIDIA, Intel, AMD, and IBM --- have architectural road-maps influenced by DL workloads. Furthermore, several vendors have recently advertised new computing products as accelerating large DL workloads. Unfortunately, it is difficult for data scientists to quantify the potential of these different products. This paper provides a performance and power analysis of important DL workloads on two major parallel architectures: NVIDIA DGX-1 (eight Pascal P100 GPUs interconnected with NVLink) and Intel Knights Landing (KNL) CPUs interconnected with Intel Omni-Path or Cray Aries. Ourmore » evaluation consists of a cross section of convolutional neural net workloads: CifarNet, AlexNet, GoogLeNet, and ResNet50 topologies using the Cifar10 and ImageNet datasets. The workloads are vendor-optimized for each architecture. Our analysis indicates that although GPUs provide the highest overall performance, the gap can close for some convolutional networks; and the KNL can be competitive in performance/watt. We find that NVLink facilitates scaling efficiency on GPUs. However, its importance is heavily dependent on neural network architecture. Furthermore, for weak-scaling --- sometimes encouraged by restricted GPU memory --- NVLink is less important.« less

  12. 3D tumor localization through real-time volumetric x-ray imaging for lung cancer radiotherapy.

    PubMed

    Li, Ruijiang; Lewis, John H; Jia, Xun; Gu, Xuejun; Folkerts, Michael; Men, Chunhua; Song, William Y; Jiang, Steve B

    2011-05-01

    To evaluate an algorithm for real-time 3D tumor localization from a single x-ray projection image for lung cancer radiotherapy. Recently, we have developed an algorithm for reconstructing volumetric images and extracting 3D tumor motion information from a single x-ray projection [Li et al., Med. Phys. 37, 2822-2826 (2010)]. We have demonstrated its feasibility using a digital respiratory phantom with regular breathing patterns. In this work, we present a detailed description and a comprehensive evaluation of the improved algorithm. The algorithm was improved by incorporating respiratory motion prediction. The accuracy and efficiency of using this algorithm for 3D tumor localization were then evaluated on (1) a digital respiratory phantom, (2) a physical respiratory phantom, and (3) five lung cancer patients. These evaluation cases include both regular and irregular breathing patterns that are different from the training dataset. For the digital respiratory phantom with regular and irregular breathing, the average 3D tumor localization error is less than 1 mm which does not seem to be affected by amplitude change, period change, or baseline shift. On an NVIDIA Tesla C1060 graphic processing unit (GPU) card, the average computation time for 3D tumor localization from each projection ranges between 0.19 and 0.26 s, for both regular and irregular breathing, which is about a 10% improvement over previously reported results. For the physical respiratory phantom, an average tumor localization error below 1 mm was achieved with an average computation time of 0.13 and 0.16 s on the same graphic processing unit (GPU) card, for regular and irregular breathing, respectively. For the five lung cancer patients, the average tumor localization error is below 2 mm in both the axial and tangential directions. The average computation time on the same GPU card ranges between 0.26 and 0.34 s. Through a comprehensive evaluation of our algorithm, we have established its accuracy in 3D tumor localization to be on the order of 1 mm on average and 2 mm at 95 percentile for both digital and physical phantoms, and within 2 mm on average and 4 mm at 95 percentile for lung cancer patients. The results also indicate that the accuracy is not affected by the breathing pattern, be it regular or irregular. High computational efficiency can be achieved on GPU, requiring 0.1-0.3 s for each x-ray projection.

  13. A high-throughput screening approach to discovering good forms of biologically inspired visual representation.

    PubMed

    Pinto, Nicolas; Doukhan, David; DiCarlo, James J; Cox, David D

    2009-11-01

    While many models of biological object recognition share a common set of "broad-stroke" properties, the performance of any one model depends strongly on the choice of parameters in a particular instantiation of that model--e.g., the number of units per layer, the size of pooling kernels, exponents in normalization operations, etc. Since the number of such parameters (explicit or implicit) is typically large and the computational cost of evaluating one particular parameter set is high, the space of possible model instantiations goes largely unexplored. Thus, when a model fails to approach the abilities of biological visual systems, we are left uncertain whether this failure is because we are missing a fundamental idea or because the correct "parts" have not been tuned correctly, assembled at sufficient scale, or provided with enough training. Here, we present a high-throughput approach to the exploration of such parameter sets, leveraging recent advances in stream processing hardware (high-end NVIDIA graphic cards and the PlayStation 3's IBM Cell Processor). In analogy to high-throughput screening approaches in molecular biology and genetics, we explored thousands of potential network architectures and parameter instantiations, screening those that show promising object recognition performance for further analysis. We show that this approach can yield significant, reproducible gains in performance across an array of basic object recognition tasks, consistently outperforming a variety of state-of-the-art purpose-built vision systems from the literature. As the scale of available computational power continues to expand, we argue that this approach has the potential to greatly accelerate progress in both artificial vision and our understanding of the computational underpinning of biological vision.

  14. cuTauLeaping: A GPU-Powered Tau-Leaping Stochastic Simulator for Massive Parallel Analyses of Biological Systems

    PubMed Central

    Besozzi, Daniela; Pescini, Dario; Mauri, Giancarlo

    2014-01-01

    Tau-leaping is a stochastic simulation algorithm that efficiently reconstructs the temporal evolution of biological systems, modeled according to the stochastic formulation of chemical kinetics. The analysis of dynamical properties of these systems in physiological and perturbed conditions usually requires the execution of a large number of simulations, leading to high computational costs. Since each simulation can be executed independently from the others, a massive parallelization of tau-leaping can bring to relevant reductions of the overall running time. The emerging field of General Purpose Graphic Processing Units (GPGPU) provides power-efficient high-performance computing at a relatively low cost. In this work we introduce cuTauLeaping, a stochastic simulator of biological systems that makes use of GPGPU computing to execute multiple parallel tau-leaping simulations, by fully exploiting the Nvidia's Fermi GPU architecture. We show how a considerable computational speedup is achieved on GPU by partitioning the execution of tau-leaping into multiple separated phases, and we describe how to avoid some implementation pitfalls related to the scarcity of memory resources on the GPU streaming multiprocessors. Our results show that cuTauLeaping largely outperforms the CPU-based tau-leaping implementation when the number of parallel simulations increases, with a break-even directly depending on the size of the biological system and on the complexity of its emergent dynamics. In particular, cuTauLeaping is exploited to investigate the probability distribution of bistable states in the Schlögl model, and to carry out a bidimensional parameter sweep analysis to study the oscillatory regimes in the Ras/cAMP/PKA pathway in S. cerevisiae. PMID:24663957

  15. CLAST: CUDA implemented large-scale alignment search tool.

    PubMed

    Yano, Masahiro; Mori, Hiroshi; Akiyama, Yutaka; Yamada, Takuji; Kurokawa, Ken

    2014-12-11

    Metagenomics is a powerful methodology to study microbial communities, but it is highly dependent on nucleotide sequence similarity searching against sequence databases. Metagenomic analyses with next-generation sequencing technologies produce enormous numbers of reads from microbial communities, and many reads are derived from microbes whose genomes have not yet been sequenced, limiting the usefulness of existing sequence similarity search tools. Therefore, there is a clear need for a sequence similarity search tool that can rapidly detect weak similarity in large datasets. We developed a tool, which we named CLAST (CUDA implemented large-scale alignment search tool), that enables analyses of millions of reads and thousands of reference genome sequences, and runs on NVIDIA Fermi architecture graphics processing units. CLAST has four main advantages over existing alignment tools. First, CLAST was capable of identifying sequence similarities ~80.8 times faster than BLAST and 9.6 times faster than BLAT. Second, CLAST executes global alignment as the default (local alignment is also an option), enabling CLAST to assign reads to taxonomic and functional groups based on evolutionarily distant nucleotide sequences with high accuracy. Third, CLAST does not need a preprocessed sequence database like Burrows-Wheeler Transform-based tools, and this enables CLAST to incorporate large, frequently updated sequence databases. Fourth, CLAST requires <2 GB of main memory, making it possible to run CLAST on a standard desktop computer or server node. CLAST achieved very high speed (similar to the Burrows-Wheeler Transform-based Bowtie 2 for long reads) and sensitivity (equal to BLAST, BLAT, and FR-HIT) without the need for extensive database preprocessing or a specialized computing platform. Our results demonstrate that CLAST has the potential to be one of the most powerful and realistic approaches to analyze the massive amount of sequence data from next-generation sequencing technologies.

  16. GPU-Accelerated Voxelwise Hepatic Perfusion Quantification

    PubMed Central

    Wang, H; Cao, Y

    2012-01-01

    Voxelwise quantification of hepatic perfusion parameters from dynamic contrast enhanced (DCE) imaging greatly contributes to assessment of liver function in response to radiation therapy. However, the efficiency of the estimation of hepatic perfusion parameters voxel-by-voxel in the whole liver using a dual-input single-compartment model requires substantial improvement for routine clinical applications. In this paper, we utilize the parallel computation power of a graphics processing unit (GPU) to accelerate the computation, while maintaining the same accuracy as the conventional method. Using CUDA-GPU, the hepatic perfusion computations over multiple voxels are run across the GPU blocks concurrently but independently. At each voxel, non-linear least squares fitting the time series of the liver DCE data to the compartmental model is distributed to multiple threads in a block, and the computations of different time points are performed simultaneously and synchronically. An efficient fast Fourier transform in a block is also developed for the convolution computation in the model. The GPU computations of the voxel-by-voxel hepatic perfusion images are compared with ones by the CPU using the simulated DCE data and the experimental DCE MR images from patients. The computation speed is improved by 30 times using a NVIDIA Tesla C2050 GPU compared to a 2.67 GHz Intel Xeon CPU processor. To obtain liver perfusion maps with 626400 voxels in a patient’s liver, it takes 0.9 min with the GPU-accelerated voxelwise computation, compared to 110 min with the CPU, while both methods result in perfusion parameters differences less than 10−6. The method will be useful for generating liver perfusion images in clinical settings. PMID:22892645

  17. Patient-specific non-linear finite element modelling for predicting soft organ deformation in real-time: application to non-rigid neuroimage registration.

    PubMed

    Wittek, Adam; Joldes, Grand; Couton, Mathieu; Warfield, Simon K; Miller, Karol

    2010-12-01

    Long computation times of non-linear (i.e. accounting for geometric and material non-linearity) biomechanical models have been regarded as one of the key factors preventing application of such models in predicting organ deformation for image-guided surgery. This contribution presents real-time patient-specific computation of the deformation field within the brain for six cases of brain shift induced by craniotomy (i.e. surgical opening of the skull) using specialised non-linear finite element procedures implemented on a graphics processing unit (GPU). In contrast to commercial finite element codes that rely on an updated Lagrangian formulation and implicit integration in time domain for steady state solutions, our procedures utilise the total Lagrangian formulation with explicit time stepping and dynamic relaxation. We used patient-specific finite element meshes consisting of hexahedral and non-locking tetrahedral elements, together with realistic material properties for the brain tissue and appropriate contact conditions at the boundaries. The loading was defined by prescribing deformations on the brain surface under the craniotomy. Application of the computed deformation fields to register (i.e. align) the preoperative and intraoperative images indicated that the models very accurately predict the intraoperative deformations within the brain. For each case, computing the brain deformation field took less than 4 s using an NVIDIA Tesla C870 GPU, which is two orders of magnitude reduction in computation time in comparison to our previous study in which the brain deformation was predicted using a commercial finite element solver executed on a personal computer. Copyright © 2010 Elsevier Ltd. All rights reserved.

  18. Algorithms of GPU-enabled reactive force field (ReaxFF) molecular dynamics.

    PubMed

    Zheng, Mo; Li, Xiaoxia; Guo, Li

    2013-04-01

    Reactive force field (ReaxFF), a recent and novel bond order potential, allows for reactive molecular dynamics (ReaxFF MD) simulations for modeling larger and more complex molecular systems involving chemical reactions when compared with computation intensive quantum mechanical methods. However, ReaxFF MD can be approximately 10-50 times slower than classical MD due to its explicit modeling of bond forming and breaking, the dynamic charge equilibration at each time-step, and its one order smaller time-step than the classical MD, all of which pose significant computational challenges in simulation capability to reach spatio-temporal scales of nanometers and nanoseconds. The very recent advances of graphics processing unit (GPU) provide not only highly favorable performance for GPU enabled MD programs compared with CPU implementations but also an opportunity to manage with the computing power and memory demanding nature imposed on computer hardware by ReaxFF MD. In this paper, we present the algorithms of GMD-Reax, the first GPU enabled ReaxFF MD program with significantly improved performance surpassing CPU implementations on desktop workstations. The performance of GMD-Reax has been benchmarked on a PC equipped with a NVIDIA C2050 GPU for coal pyrolysis simulation systems with atoms ranging from 1378 to 27,283. GMD-Reax achieved speedups as high as 12 times faster than Duin et al.'s FORTRAN codes in Lammps on 8 CPU cores and 6 times faster than the Lammps' C codes based on PuReMD in terms of the simulation time per time-step averaged over 100 steps. GMD-Reax could be used as a new and efficient computational tool for exploiting very complex molecular reactions via ReaxFF MD simulation on desktop workstations. Copyright © 2013 Elsevier Inc. All rights reserved.

  19. Multi-GPGPU Tsunami simulation at Toyama-bay

    NASA Astrophysics Data System (ADS)

    Furuyama, Shoichi; Ueda, Yuki

    2017-07-01

    Accelerated multi General Purpose Graphics Processing Unit (GPGPU) calculation for Tsunami run-up simulation was achieved at the wide area (whole Toyama-bay in Japan) by faster computation technique. Toyama-bay has active-faults at the sea-bed. It has a high possibility to occur earthquakes and Tsunami waves in the case of the huge earthquake, that's why to predict the area of Tsunami run-up is important for decreasing damages to residents by the disaster. However it is very hard task to achieve the simulation by the computer resources problem. A several meter's order of the high resolution calculation is required for the running-up Tsunami simulation because artificial structures on the ground such as roads, buildings, and houses are very small. On the other hand the huge area simulation is also required. In the Toyama-bay case the area is 42 [km] × 15 [km]. When 5 [m] × 5 [m] size computational cells are used for the simulation, over 26,000,000 computational cells are generated. To calculate the simulation, a normal CPU desktop computer took about 10 hours for the calculation. An improvement of calculation time is important problem for the immediate prediction system of Tsunami running-up, as a result it will contribute to protect a lot of residents around the coastal region. The study tried to decrease this calculation time by using multi GPGPU system which is equipped with six NVIDIA TESLA K20xs, InfiniBand network connection between computer nodes by MVAPICH library. As a result 5.16 times faster calculation was achieved on six GPUs than one GPU case and it was 86% parallel efficiency to the linear speed up.

  20. A High-Throughput Screening Approach to Discovering Good Forms of Biologically Inspired Visual Representation

    PubMed Central

    Pinto, Nicolas; Doukhan, David; DiCarlo, James J.; Cox, David D.

    2009-01-01

    While many models of biological object recognition share a common set of “broad-stroke” properties, the performance of any one model depends strongly on the choice of parameters in a particular instantiation of that model—e.g., the number of units per layer, the size of pooling kernels, exponents in normalization operations, etc. Since the number of such parameters (explicit or implicit) is typically large and the computational cost of evaluating one particular parameter set is high, the space of possible model instantiations goes largely unexplored. Thus, when a model fails to approach the abilities of biological visual systems, we are left uncertain whether this failure is because we are missing a fundamental idea or because the correct “parts” have not been tuned correctly, assembled at sufficient scale, or provided with enough training. Here, we present a high-throughput approach to the exploration of such parameter sets, leveraging recent advances in stream processing hardware (high-end NVIDIA graphic cards and the PlayStation 3's IBM Cell Processor). In analogy to high-throughput screening approaches in molecular biology and genetics, we explored thousands of potential network architectures and parameter instantiations, screening those that show promising object recognition performance for further analysis. We show that this approach can yield significant, reproducible gains in performance across an array of basic object recognition tasks, consistently outperforming a variety of state-of-the-art purpose-built vision systems from the literature. As the scale of available computational power continues to expand, we argue that this approach has the potential to greatly accelerate progress in both artificial vision and our understanding of the computational underpinning of biological vision. PMID:19956750

  1. Massive parallelization of a 3D finite difference electromagnetic forward solution using domain decomposition methods on multiple CUDA enabled GPUs

    NASA Astrophysics Data System (ADS)

    Schultz, A.

    2010-12-01

    3D forward solvers lie at the core of inverse formulations used to image the variation of electrical conductivity within the Earth's interior. This property is associated with variations in temperature, composition, phase, presence of volatiles, and in specific settings, the presence of groundwater, geothermal resources, oil/gas or minerals. The high cost of 3D solutions has been a stumbling block to wider adoption of 3D methods. Parallel algorithms for modeling frequency domain 3D EM problems have not achieved wide scale adoption, with emphasis on fairly coarse grained parallelism using MPI and similar approaches. The communications bandwidth as well as the latency required to send and receive network communication packets is a limiting factor in implementing fine grained parallel strategies, inhibiting wide adoption of these algorithms. Leading Graphics Processor Unit (GPU) companies now produce GPUs with hundreds of GPU processor cores per die. The footprint, in silicon, of the GPU's restricted instruction set is much smaller than the general purpose instruction set required of a CPU. Consequently, the density of processor cores on a GPU can be much greater than on a CPU. GPUs also have local memory, registers and high speed communication with host CPUs, usually through PCIe type interconnects. The extremely low cost and high computational power of GPUs provides the EM geophysics community with an opportunity to achieve fine grained (i.e. massive) parallelization of codes on low cost hardware. The current generation of GPUs (e.g. NVidia Fermi) provides 3 billion transistors per chip die, with nearly 500 processor cores and up to 6 GB of fast (DDR5) GPU memory. This latest generation of GPU supports fast hardware double precision (64 bit) floating point operations of the type required for frequency domain EM forward solutions. Each Fermi GPU board can sustain nearly 1 TFLOP in double precision, and multiple boards can be installed in the host computer system. We describe our ongoing efforts to achieve massive parallelization on a novel hybrid GPU testbed machine currently configured with 12 Intel Westmere Xeon CPU cores (or 24 parallel computational threads) with 96 GB DDR3 system memory, 4 GPU subsystems which in aggregate contain 960 NVidia Tesla GPU cores with 16 GB dedicated DDR3 GPU memory, and a second interleved bank of 4 GPU subsystems containing in aggregate 1792 NVidia Fermi GPU cores with 12 GB dedicated DDR5 GPU memory. We are applying domain decomposition methods to a modified version of Weiss' (2001) 3D frequency domain full physics EM finite difference code, an open source GPL licensed f90 code available for download from www.OpenEM.org. This will be the core of a new hybrid 3D inversion that parallelizes frequencies across CPUs and individual forward solutions across GPUs. We describe progress made in modifying the code to use direct solvers in GPU cores dedicated to each small subdomain, iteratively improving the solution by matching adjacent subdomain boundary solutions, rather than iterative Krylov space sparse solvers as currently applied to the whole domain.

  2. A Theoretical Analysis of Learning with Graphics--Implications for Computer Graphics Design.

    ERIC Educational Resources Information Center

    ChanLin, Lih-Juan

    This paper reviews the literature pertinent to learning with graphics. The dual coding theory provides explanation about how graphics are stored and precessed in semantic memory. The level of processing theory suggests how graphics can be employed in learning to encourage deeper processing. In addition to dual coding theory and level of processing…

  3. Scaling Deep Learning workloads: NVIDIA DGX-1/Pascal and Intel Knights Landing

    DOE PAGES

    Gawande, Nitin A.; Daily, Jeff A.; Siegel, Charles; ...

    2018-05-05

    Deep Learning (DL) algorithms have become ubiquitous in data analytics. As a result, major computing vendors—including NVIDIA, Intel, AMD, and IBM—have architectural road maps influenced by DL workloads. Furthermore, several vendors have recently advertised new computing products as accelerating large DL workloads. Unfortunately, it is difficult for data scientists to quantify the potential of these different products. Here, this article provides a performance and power analysis of important DL workloads on two major parallel architectures: NVIDIA DGX-1 (eight Pascal P100 GPUs interconnected with NVLink) and Intel Knights Landing (KNL) CPUs interconnected with Intel Omni-Path or Cray Aries. Our evaluation consistsmore » of a cross section of convolutional neural net workloads: CifarNet, AlexNet, GoogLeNet, and ResNet50 topologies using the Cifar10 and ImageNet datasets. The workloads are vendor-optimized for each architecture. We use sequentially equivalent implementations to maintain iso-accuracy between parallel and sequential DL models. Our analysis indicates that although GPUs provide the highest overall performance, the gap can close for some convolutional networks; and the KNL can be competitive in performance/watt. We find that NVLink facilitates scaling efficiency on GPUs. However, its importance is heavily dependent on neural network architecture. Furthermore, for weak-scaling—sometimes encouraged by restricted GPU memory—NVLink is less important.« less

  4. Scaling Deep Learning workloads: NVIDIA DGX-1/Pascal and Intel Knights Landing

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Gawande, Nitin A.; Daily, Jeff A.; Siegel, Charles

    Deep Learning (DL) algorithms have become ubiquitous in data analytics. As a result, major computing vendors—including NVIDIA, Intel, AMD, and IBM—have architectural road maps influenced by DL workloads. Furthermore, several vendors have recently advertised new computing products as accelerating large DL workloads. Unfortunately, it is difficult for data scientists to quantify the potential of these different products. Here, this article provides a performance and power analysis of important DL workloads on two major parallel architectures: NVIDIA DGX-1 (eight Pascal P100 GPUs interconnected with NVLink) and Intel Knights Landing (KNL) CPUs interconnected with Intel Omni-Path or Cray Aries. Our evaluation consistsmore » of a cross section of convolutional neural net workloads: CifarNet, AlexNet, GoogLeNet, and ResNet50 topologies using the Cifar10 and ImageNet datasets. The workloads are vendor-optimized for each architecture. We use sequentially equivalent implementations to maintain iso-accuracy between parallel and sequential DL models. Our analysis indicates that although GPUs provide the highest overall performance, the gap can close for some convolutional networks; and the KNL can be competitive in performance/watt. We find that NVLink facilitates scaling efficiency on GPUs. However, its importance is heavily dependent on neural network architecture. Furthermore, for weak-scaling—sometimes encouraged by restricted GPU memory—NVLink is less important.« less

  5. GPU acceleration of Dock6's Amber scoring computation.

    PubMed

    Yang, Hailong; Zhou, Qiongqiong; Li, Bo; Wang, Yongjian; Luan, Zhongzhi; Qian, Depei; Li, Hanlu

    2010-01-01

    Dressing the problem of virtual screening is a long-term goal in the drug discovery field, which if properly solved, can significantly shorten new drugs' R&D cycle. The scoring functionality that evaluates the fitness of the docking result is one of the major challenges in virtual screening. In general, scoring functionality in docking requires a large amount of floating-point calculations, which usually takes several weeks or even months to be finished. This time-consuming procedure is unacceptable, especially when highly fatal and infectious virus arises such as SARS and H1N1, which forces the scoring task to be done in a limited time. This paper presents how to leverage the computational power of GPU to accelerate Dock6's (http://dock.compbio.ucsf.edu/DOCK_6/) Amber (J. Comput. Chem. 25: 1157-1174, 2004) scoring with NVIDIA CUDA (NVIDIA Corporation Technical Staff, Compute Unified Device Architecture - Programming Guide, NVIDIA Corporation, 2008) (Compute Unified Device Architecture) platform. We also discuss many factors that will greatly influence the performance after porting the Amber scoring to GPU, including thread management, data transfer, and divergence hidden. Our experiments show that the GPU-accelerated Amber scoring achieves a 6.5× speedup with respect to the original version running on AMD dual-core CPU for the same problem size. This acceleration makes the Amber scoring more competitive and efficient for large-scale virtual screening problems.

  6. Spectral turning bands for efficient Gaussian random fields generation on GPUs and accelerators

    NASA Astrophysics Data System (ADS)

    Hunger, L.; Cosenza, B.; Kimeswenger, S.; Fahringer, T.

    2015-11-01

    A random field (RF) is a set of correlated random variables associated with different spatial locations. RF generation algorithms are of crucial importance for many scientific areas, such as astrophysics, geostatistics, computer graphics, and many others. Current approaches commonly make use of 3D fast Fourier transform (FFT), which does not scale well for RF bigger than the available memory; they are also limited to regular rectilinear meshes. We introduce random field generation with the turning band method (RAFT), an RF generation algorithm based on the turning band method that is optimized for massively parallel hardware such as GPUs and accelerators. Our algorithm replaces the 3D FFT with a lower-order, one-dimensional FFT followed by a projection step and is further optimized with loop unrolling and blocking. RAFT can easily generate RF on non-regular (non-uniform) meshes and efficiently produce fields with mesh sizes bigger than the available device memory by using a streaming, out-of-core approach. Our algorithm generates RF with the correct statistical behavior and is tested on a variety of modern hardware, such as NVIDIA Tesla, AMD FirePro and Intel Phi. RAFT is faster than the traditional methods on regular meshes and has been successfully applied to two real case scenarios: planetary nebulae and cosmological simulations.

  7. 3D gaze tracking system for NVidia 3D Vision®.

    PubMed

    Wibirama, Sunu; Hamamoto, Kazuhiko

    2013-01-01

    Inappropriate parallax setting in stereoscopic content generally causes visual fatigue and visual discomfort. To optimize three dimensional (3D) effects in stereoscopic content by taking into account health issue, understanding how user gazes at 3D direction in virtual space is currently an important research topic. In this paper, we report the study of developing a novel 3D gaze tracking system for Nvidia 3D Vision(®) to be used in desktop stereoscopic display. We suggest an optimized geometric method to accurately measure the position of virtual 3D object. Our experimental result shows that the proposed system achieved better accuracy compared to conventional geometric method by average errors 0.83 cm, 0.87 cm, and 1.06 cm in X, Y, and Z dimensions, respectively.

  8. Student Thinking Processes While Constructing Graphic Representations of Textbook Content: What Insights Do Think-Alouds Provide?

    ERIC Educational Resources Information Center

    Scott, D. Beth; Dreher, Mariam Jean

    2016-01-01

    This study examined the thinking processes students engage in while constructing graphic representations of textbook content. Twenty-eight students who either used graphic representations in a routine manner during social studies instruction or learned to construct graphic representations based on the rhetorical patterns used to organize textbook…

  9. Graphic Design in Libraries: A Conceptual Process

    ERIC Educational Resources Information Center

    Ruiz, Miguel

    2014-01-01

    Providing successful library services requires efficient and effective communication with users; therefore, it is important that content creators who develop visual materials understand key components of design and, specifically, develop a holistic graphic design process. Graphic design, as a form of visual communication, is the process of…

  10. Adaptive-optics optical coherence tomography processing using a graphics processing unit.

    PubMed

    Shafer, Brandon A; Kriske, Jeffery E; Kocaoglu, Omer P; Turner, Timothy L; Liu, Zhuolin; Lee, John Jaehwan; Miller, Donald T

    2014-01-01

    Graphics processing units are increasingly being used for scientific computing for their powerful parallel processing abilities, and moderate price compared to super computers and computing grids. In this paper we have used a general purpose graphics processing unit to process adaptive-optics optical coherence tomography (AOOCT) images in real time. Increasing the processing speed of AOOCT is an essential step in moving the super high resolution technology closer to clinical viability.

  11. Evaluating Multi-core Architectures through Accelerating the Three-Dimensional Lax–Wendroff Correction

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    You, Yang; Fu, Haohuan; Song, Shuaiwen

    2014-07-18

    Wave propagation forward modeling is a widely used computational method in oil and gas exploration. The iterative stencil loops in such problems have broad applications in scientific computing. However, executing such loops can be highly time time-consuming, which greatly limits application’s performance and power efficiency. In this paper, we accelerate the forward modeling technique on the latest multi-core and many-core architectures such as Intel Sandy Bridge CPUs, NVIDIA Fermi C2070 GPU, NVIDIA Kepler K20x GPU, and the Intel Xeon Phi Co-processor. For the GPU platforms, we propose two parallel strategies to explore the performance optimization opportunities for our stencil kernels.more » For Sandy Bridge CPUs and MIC, we also employ various optimization techniques in order to achieve the best.« less

  12. Multiprocessor graphics computation and display using transputers

    NASA Technical Reports Server (NTRS)

    Ellis, Graham K.

    1988-01-01

    A package of two-dimensional graphics routines was developed to run on a transputer-based parallel processing system. These routines were designed to enable applications programmers to easily generate and display results from the transputer network in a graphic format. The graphics procedures were designed for the lowest possible network communication overhead for increased performance. The routines were designed for ease of use and to present an intuitive approach to generating graphics on the transputer parallel processing system.

  13. Commercial Off-The-Shelf (COTS) Graphics Processing Board (GPB) Radiation Test Evaluation Report

    NASA Technical Reports Server (NTRS)

    Salazar, George A.; Steele, Glen F.

    2013-01-01

    Large round trip communications latency for deep space missions will require more onboard computational capabilities to enable the space vehicle to undertake many tasks that have traditionally been ground-based, mission control responsibilities. As a result, visual display graphics will be required to provide simpler vehicle situational awareness through graphical representations, as well as provide capabilities never before done in a space mission, such as augmented reality for in-flight maintenance or Telepresence activities. These capabilities will require graphics processors and associated support electronic components for high computational graphics processing. In an effort to understand the performance of commercial graphics card electronics operating in the expected radiation environment, a preliminary test was performed on five commercial offthe- shelf (COTS) graphics cards. This paper discusses the preliminary evaluation test results of five COTS graphics processing cards tested to the International Space Station (ISS) low earth orbit radiation environment. Three of the five graphics cards were tested to a total dose of 6000 rads (Si). The test articles, test configuration, preliminary results, and recommendations are discussed.

  14. The Performance Improvement of the Lagrangian Particle Dispersion Model (LPDM) Using Graphics Processing Unit (GPU) Computing

    DTIC Science & Technology

    2017-08-01

    access to the GPU for general purpose processing .5 CUDA is designed to work easily with multiple programming languages , including Fortran. CUDA is a...Using Graphics Processing Unit (GPU) Computing by Leelinda P Dawson Approved for public release; distribution unlimited...The Performance Improvement of the Lagrangian Particle Dispersion Model (LPDM) Using Graphics Processing Unit (GPU) Computing by Leelinda

  15. A Relational Reasoning Approach to Text-Graphic Processing

    ERIC Educational Resources Information Center

    Danielson, Robert W.; Sinatra, Gale M.

    2017-01-01

    We propose that research on text-graphic processing could be strengthened by the inclusion of relational reasoning perspectives. We briefly outline four aspects of relational reasoning: "analogies," "anomalies," "antinomies", and "antitheses". Next, we illustrate how text-graphic researchers have been…

  16. Printing--Graphic Arts--Graphic Communications

    ERIC Educational Resources Information Center

    Hauenstein, A. Dean

    1975-01-01

    Recently, "graphic arts" has shifted from printing skills to a conceptual approach of production processes. "Graphic communications" must embrace the total system of communication through graphic media, to serve broad career education purposes; students taught concepts and principles can be flexible and adaptive. The author…

  17. Application Modernization at LLNL and the Sierra Center of Excellence

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Neely, J. Robert; de Supinski, Bronis R.

    We repport that in 2014, Lawrence Livermore National Laboratory began acquisition of Sierra, a pre-exascale system from IBM and Nvidia. It marks a significant shift in direction for LLNL by introducing the concept of heterogeneous computing via GPUs. LLNL’s mission requires application teams to prepare for this paradigm shift. Thus, the Sierra procurement required a proposed Center of Excellence that would align the expertise of the chosen vendors with laboratory personnel that represent the application developers, system software, and tool providers in a concentrated effort to prepare the laboratory’s codes in advance of the system transitioning to production in 2018.more » Finally, this article presents LLNL’s overall application strategy, with a focus on how LLNL is collaborating with IBM and Nvidia to ensure a successful transition of its mission-oriented applications into the exascale era.« less

  18. GPU-accelerated simulations of isolated black holes

    NASA Astrophysics Data System (ADS)

    Lewis, Adam G. M.; Pfeiffer, Harald P.

    2018-05-01

    We present a port of the numerical relativity code SpEC which is capable of running on NVIDIA GPUs. Since this code must be maintained in parallel with SpEC itself, a primary design consideration is to perform as few explicit code changes as possible. We therefore rely on a hierarchy of automated porting strategies. At the highest level we use TLoops, a C++ library of our design, to automatically emit CUDA code equivalent to tensorial expressions written into C++ source using a syntax similar to analytic calculation. Next, we trace out and cache explicit matrix representations of the numerous linear transformations in the SpEC code, which allows these to be performed on the GPU using pre-existing matrix-multiplication libraries. We port the few remaining important modules by hand. In this paper we detail the specifics of our port, and present benchmarks of it simulating isolated black hole spacetimes on several generations of NVIDIA GPU.

  19. Application Modernization at LLNL and the Sierra Center of Excellence

    DOE PAGES

    Neely, J. Robert; de Supinski, Bronis R.

    2017-09-01

    We repport that in 2014, Lawrence Livermore National Laboratory began acquisition of Sierra, a pre-exascale system from IBM and Nvidia. It marks a significant shift in direction for LLNL by introducing the concept of heterogeneous computing via GPUs. LLNL’s mission requires application teams to prepare for this paradigm shift. Thus, the Sierra procurement required a proposed Center of Excellence that would align the expertise of the chosen vendors with laboratory personnel that represent the application developers, system software, and tool providers in a concentrated effort to prepare the laboratory’s codes in advance of the system transitioning to production in 2018.more » Finally, this article presents LLNL’s overall application strategy, with a focus on how LLNL is collaborating with IBM and Nvidia to ensure a successful transition of its mission-oriented applications into the exascale era.« less

  20. Measuring Cognitive Load in Test Items: Static Graphics versus Animated Graphics

    ERIC Educational Resources Information Center

    Dindar, M.; Kabakçi Yurdakul, I.; Inan Dönmez, F.

    2015-01-01

    The majority of multimedia learning studies focus on the use of graphics in learning process but very few of them examine the role of graphics in testing students' knowledge. This study investigates the use of static graphics versus animated graphics in a computer-based English achievement test from a cognitive load theory perspective. Three…

  1. Process and representation in graphical displays

    NASA Technical Reports Server (NTRS)

    Gillan, Douglas J.; Lewis, Robert; Rudisill, Marianne

    1990-01-01

    How people comprehend graphics is examined. Graphical comprehension involves the cognitive representation of information from a graphic display and the processing strategies that people apply to answer questions about graphics. Research on representation has examined both the features present in a graphic display and the cognitive representation of the graphic. The key features include the physical components of a graph, the relation between the figure and its axes, and the information in the graph. Tests of people's memory for graphs indicate that both the physical and informational aspect of a graph are important in the cognitive representation of a graph. However, the physical (or perceptual) features overshadow the information to a large degree. Processing strategies also involve a perception-information distinction. In order to answer simple questions (e.g., determining the value of a variable, comparing several variables, and determining the mean of a set of variables), people switch between two information processing strategies: (1) an arithmetic, look-up strategy in which they use a graph much like a table, looking up values and performing arithmetic calculations; and (2) a perceptual strategy in which they use the spatial characteristics of the graph to make comparisons and estimations. The user's choice of strategies depends on the task and the characteristics of the graph. A theory of graphic comprehension is presented.

  2. Software Graphics Processing Unit (sGPU) for Deep Space Applications

    NASA Technical Reports Server (NTRS)

    McCabe, Mary; Salazar, George; Steele, Glen

    2015-01-01

    A graphics processing capability will be required for deep space missions and must include a range of applications, from safety-critical vehicle health status to telemedicine for crew health. However, preliminary radiation testing of commercial graphics processing cards suggest they cannot operate in the deep space radiation environment. Investigation into an Software Graphics Processing Unit (sGPU)comprised of commercial-equivalent radiation hardened/tolerant single board computers, field programmable gate arrays, and safety-critical display software shows promising results. Preliminary performance of approximately 30 frames per second (FPS) has been achieved. Use of multi-core processors may provide a significant increase in performance.

  3. Problems Related to Parallelization of CFD Algorithms on GPU, Multi-GPU and Hybrid Architectures

    NASA Astrophysics Data System (ADS)

    Biazewicz, Marek; Kurowski, Krzysztof; Ludwiczak, Bogdan; Napieraia, Krystyna

    2010-09-01

    Computational Fluid Dynamics (CFD) is one of the branches of fluid mechanics, which uses numerical methods and algorithms to solve and analyze fluid flows. CFD is used in various domains, such as oil and gas reservoir uncertainty analysis, aerodynamic body shapes optimization (e.g. planes, cars, ships, sport helmets, skis), natural phenomena analysis, numerical simulation for weather forecasting or realistic visualizations. CFD problem is very complex and needs a lot of computational power to obtain the results in a reasonable time. We have implemented a parallel application for two-dimensional CFD simulation with a free surface approximation (MAC method) using new hardware architectures, in particular multi-GPU and hybrid computing environments. For this purpose we decided to use NVIDIA graphic cards with CUDA environment due to its simplicity of programming and good computations performance. We used finite difference discretization of Navier-Stokes equations, where fluid is propagated over an Eulerian Grid. In this model, the behavior of the fluid inside the cell depends only on the properties of local, surrounding cells, therefore it is well suited for the GPU-based architecture. In this paper we demonstrate how to use efficiently the computing power of GPUs for CFD. Additionally, we present some best practices to help users analyze and improve the performance of CFD applications executed on GPU. Finally, we discuss various challenges around the multi-GPU implementation on the example of matrix multiplication.

  4. DOE Office of Scientific and Technical Information (OSTI.GOV)

    Madduri, Kamesh; Im, Eun-Jin; Ibrahim, Khaled Z.

    The next decade of high-performance computing (HPC) systems will see a rapid evolution and divergence of multi- and manycore architectures as power and cooling constraints limit increases in microprocessor clock speeds. Understanding efficient optimization methodologies on diverse multicore designs in the context of demanding numerical methods is one of the greatest challenges faced today by the HPC community. In this paper, we examine the efficient multicore optimization of GTC, a petascale gyrokinetic toroidal fusion code for studying plasma microturbulence in tokamak devices. For GTC’s key computational components (charge deposition and particle push), we explore efficient parallelization strategies across a broadmore » range of emerging multicore designs, including the recently-released Intel Nehalem-EX, the AMD Opteron Istanbul, and the highly multithreaded Sun UltraSparc T2+. We also present the first study on tuning gyrokinetic particle-in-cell (PIC) algorithms for graphics processors, using the NVIDIA C2050 (Fermi). Our work discusses several novel optimization approaches for gyrokinetic PIC, including mixed-precision computation, particle binning and decomposition strategies, grid replication, SIMDized atomic floating-point operations, and effective GPU texture memory utilization. Overall, we achieve significant performance improvements of 1.3–4.7× on these complex PIC kernels, despite the inherent challenges of data dependency and locality. Finally, our work also points to several architectural and programming features that could significantly enhance PIC performance and productivity on next-generation architectures.« less

  5. Rapid data processing for ultrafast X-ray computed tomography using scalable and modular CUDA based pipelines

    NASA Astrophysics Data System (ADS)

    Frust, Tobias; Wagner, Michael; Stephan, Jan; Juckeland, Guido; Bieberle, André

    2017-10-01

    Ultrafast X-ray tomography is an advanced imaging technique for the study of dynamic processes basing on the principles of electron beam scanning. A typical application case for this technique is e.g. the study of multiphase flows, that is, flows of mixtures of substances such as gas-liquidflows in pipelines or chemical reactors. At Helmholtz-Zentrum Dresden-Rossendorf (HZDR) a number of such tomography scanners are operated. Currently, there are two main points limiting their application in some fields. First, after each CT scan sequence the data of the radiation detector must be downloaded from the scanner to a data processing machine. Second, the current data processing is comparably time-consuming compared to the CT scan sequence interval. To enable online observations or use this technique to control actuators in real-time, a modular and scalable data processing tool has been developed, consisting of user-definable stages working independently together in a so called data processing pipeline, that keeps up with the CT scanner's maximal frame rate of up to 8 kHz. The newly developed data processing stages are freely programmable and combinable. In order to achieve the highest processing performance all relevant data processing steps, which are required for a standard slice image reconstruction, were individually implemented in separate stages using Graphics Processing Units (GPUs) and NVIDIA's CUDA programming language. Data processing performance tests on different high-end GPUs (Tesla K20c, GeForce GTX 1080, Tesla P100) showed excellent performance. Program Files doi:http://dx.doi.org/10.17632/65sx747rvm.1 Licensing provisions: LGPLv3 Programming language: C++/CUDA Supplementary material: Test data set, used for the performance analysis. Nature of problem: Ultrafast computed tomography is performed with a scan rate of up to 8 kHz. To obtain cross-sectional images from projection data computer-based image reconstruction algorithms must be applied. The objective of the presented program is to reconstruct a data stream of around 1.3 GB s-1 in a minimum time period. Thus, the program allows to go into new fields of application and to use in the future even more compute-intensive algorithms, especially for data post-processing, to improve the quality of data analysis. Solution method: The program solves the given problem using a two-step process: first, by a generic, expandable and widely applicable template library implementing the streaming paradigm (GLADOS); second, by optimized processing stages for ultrafast computed tomography implementing the required algorithms in a performance-oriented way using CUDA (RISA). Thereby, task-parallelism between the processing stages as well as data parallelism within one processing stage is realized.

  6. Process and representation in graphical displays

    NASA Technical Reports Server (NTRS)

    Gillan, Douglas J.; Lewis, Robert; Rudisill, Marianne

    1993-01-01

    Our initial model of graphic comprehension has focused on statistical graphs. Like other models of human-computer interaction, models of graphical comprehension can be used by human-computer interface designers and developers to create interfaces that present information in an efficient and usable manner. Our investigation of graph comprehension addresses two primary questions: how do people represent the information contained in a data graph?; and how do they process information from the graph? The topics of focus for graphic representation concern the features into which people decompose a graph and the representations of the graph in memory. The issue of processing can be further analyzed as two questions: what overall processing strategies do people use?; and what are the specific processing skills required?

  7. Integration of rocket turbine design and analysis through computer graphics

    NASA Technical Reports Server (NTRS)

    Hsu, Wayne; Boynton, Jim

    1988-01-01

    An interactive approach with engineering computer graphics is used to integrate the design and analysis processes of a rocket engine turbine into a progressive and iterative design procedure. The processes are interconnected through pre- and postprocessors. The graphics are used to generate the blade profiles, their stacking, finite element generation, and analysis presentation through color graphics. Steps of the design process discussed include pitch-line design, axisymmetric hub-to-tip meridional design, and quasi-three-dimensional analysis. The viscous two- and three-dimensional analysis codes are executed after acceptable designs are achieved and estimates of initial losses are confirmed.

  8. 76 FR 384 - Certain Semiconductor Chips and Products Containing Same; Notice of Investigation

    Federal Register 2010, 2011, 2012, 2013, 2014

    2011-01-04

    ..., Dusing Road 1, Hsinchu Science Park, Hsin-Chu, Taiwan 30078. nVidia Corporation, 2701 San Tomas... respondent, to find the facts to be as alleged in the complaint and this notice and to enter an initial...

  9. 75 FR 25294 - Notice Pursuant to the National Cooperative Research and Production Act of 1993-DVD Copy Control...

    Federal Register 2010, 2011, 2012, 2013, 2014

    2010-05-07

    ..., Baarlo Noord Limburg, THE NETHERLANDS; MIT Technology Co., Ltd., Dongguan, Guangdong, PEOPLE'S REPUBLIC... media b.v., Tilburg, THE NETHERLANDS; Mattel Inc., El Segundo, CA; nVidia Corporation, Santa Clara, CA...

  10. Graphic Design Is Not a Medium.

    ERIC Educational Resources Information Center

    Gruber, John Edward, Jr.

    2001-01-01

    Discusses graphic design and reviews its development from analog processes to a digital tool with the use of computers. Topics include graphical user interfaces; the need for visual communication concepts; transmedia as opposed to repurposing; and graphic design instruction in higher education. (LRW)

  11. CPU-GPU mixed implementation of virtual node method for real-time interactive cutting of deformable objects using OpenCL.

    PubMed

    Jia, Shiyu; Zhang, Weizhong; Yu, Xiaokang; Pan, Zhenkuan

    2015-09-01

    Surgical simulators need to simulate interactive cutting of deformable objects in real time. The goal of this work was to design an interactive cutting algorithm that eliminates traditional cutting state classification and can work simultaneously with real-time GPU-accelerated deformation without affecting its numerical stability. A modified virtual node method for cutting is proposed. Deformable object is modeled as a real tetrahedral mesh embedded in a virtual tetrahedral mesh, and the former is used for graphics rendering and collision, while the latter is used for deformation. Cutting algorithm first subdivides real tetrahedrons to eliminate all face and edge intersections, then splits faces, edges and vertices along cutting tool trajectory to form cut surfaces. Next virtual tetrahedrons containing more than one connected real tetrahedral fragments are duplicated, and connectivity between virtual tetrahedrons is updated. Finally, embedding relationship between real and virtual tetrahedral meshes is updated. Co-rotational linear finite element method is used for deformation. Cutting and collision are processed by CPU, while deformation is carried out by GPU using OpenCL. Efficiency of GPU-accelerated deformation algorithm was tested using block models with varying numbers of tetrahedrons. Effectiveness of our cutting algorithm under multiple cuts and self-intersecting cuts was tested using a block model and a cylinder model. Cutting of a more complex liver model was performed, and detailed performance characteristics of cutting, deformation and collision were measured and analyzed. Our cutting algorithm can produce continuous cut surfaces when traditional minimal element creation algorithm fails. Our GPU-accelerated deformation algorithm remains stable with constant time step under multiple arbitrary cuts and works on both NVIDIA and AMD GPUs. GPU-CPU speed ratio can be as high as 10 for models with 80,000 tetrahedrons. Forty to sixty percent real-time performance and 100-200 Hz simulation rate are achieved for the liver model with 3,101 tetrahedrons. Major bottlenecks for simulation efficiency are cutting, collision processing and CPU-GPU data transfer. Future work needs to improve on these areas.

  12. Rapid indirect trajectory optimization on highly parallel computing architectures

    NASA Astrophysics Data System (ADS)

    Antony, Thomas

    Trajectory optimization is a field which can benefit greatly from the advantages offered by parallel computing. The current state-of-the-art in trajectory optimization focuses on the use of direct optimization methods, such as the pseudo-spectral method. These methods are favored due to their ease of implementation and large convergence regions while indirect methods have largely been ignored in the literature in the past decade except for specific applications in astrodynamics. It has been shown that the shortcomings conventionally associated with indirect methods can be overcome by the use of a continuation method in which complex trajectory solutions are obtained by solving a sequence of progressively difficult optimization problems. High performance computing hardware is trending towards more parallel architectures as opposed to powerful single-core processors. Graphics Processing Units (GPU), which were originally developed for 3D graphics rendering have gained popularity in the past decade as high-performance, programmable parallel processors. The Compute Unified Device Architecture (CUDA) framework, a parallel computing architecture and programming model developed by NVIDIA, is one of the most widely used platforms in GPU computing. GPUs have been applied to a wide range of fields that require the solution of complex, computationally demanding problems. A GPU-accelerated indirect trajectory optimization methodology which uses the multiple shooting method and continuation is developed using the CUDA platform. The various algorithmic optimizations used to exploit the parallelism inherent in the indirect shooting method are described. The resulting rapid optimal control framework enables the construction of high quality optimal trajectories that satisfy problem-specific constraints and fully satisfy the necessary conditions of optimality. The benefits of the framework are highlighted by construction of maximum terminal velocity trajectories for a hypothetical long range weapon system. The techniques used to construct an initial guess from an analytic near-ballistic trajectory and the methods used to formulate the necessary conditions of optimality in a manner that is transparent to the designer are discussed. Various hypothetical mission scenarios that enforce different combinations of initial, terminal, interior point and path constraints demonstrate the rapid construction of complex trajectories without requiring any a-priori insight into the structure of the solutions. Trajectory problems of this kind were previously considered impractical to solve using indirect methods. The performance of the GPU-accelerated solver is found to be 2x--4x faster than MATLAB's bvp4c, even while running on GPU hardware that is five years behind the state-of-the-art.

  13. Analysis of the Finite Precision s-Step Biconjugate Gradient Method

    DTIC Science & Technology

    2014-03-13

    Center for Future Architecture Research, a member of STARnet, a Semiconductor Research Corporation program sponsored by MARCO and DARPA, and ASPIRE Lab...industrial sponsors and affiliates Intel, Google, Nokia, NVIDIA , Oracle, and Samsung. Any opinions, findings, conclusions, or recommendations in this

  14. Power and Performance Trade-offs for Space Time Adaptive Processing

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Gawande, Nitin A.; Manzano Franco, Joseph B.; Tumeo, Antonino

    Computational efficiency – performance relative to power or energy – is one of the most important concerns when designing RADAR processing systems. This paper analyzes power and performance trade-offs for a typical Space Time Adaptive Processing (STAP) application. We study STAP implementations for CUDA and OpenMP on two computationally efficient architectures, Intel Haswell Core I7-4770TE and NVIDIA Kayla with a GK208 GPU. We analyze the power and performance of STAP’s computationally intensive kernels across the two hardware testbeds. We also show the impact and trade-offs of GPU optimization techniques. We show that data parallelism can be exploited for efficient implementationmore » on the Haswell CPU architecture. The GPU architecture is able to process large size data sets without increase in power requirement. The use of shared memory has a significant impact on the power requirement for the GPU. A balance between the use of shared memory and main memory access leads to an improved performance in a typical STAP application.« less

  15. The Effects of Interactive Graphics Analogies on Recall of Concepts in Science

    DTIC Science & Technology

    1976-08-01

    processing , in the Craik and Lockhart sense, were induced by this postlesson condition. 3. The fact that students were able to deal with both...higher scores on a graphics posttest in Experiment III. These results suggest that both shallow and deep processing , in the Craik and Lockhart ...graphics posttest in Experiment III. These results suggest that both shallow and deep processing , in the Cralk and Lockhart sense, were induced by

  16. Graphic Arts: Process Camera, Stripping, and Platemaking. Fourth Edition. Teacher Edition [and] Student Edition.

    ERIC Educational Resources Information Center

    Multistate Academic and Vocational Curriculum Consortium, Stillwater, OK.

    This publication contains both a teacher edition and a student edition of materials for a course in graphic arts that covers the process camera, stripping, and platemaking. The course introduces basic concepts and skills necessary for entry-level employment in a graphic communication occupation. The contents of the materials are tied to measurable…

  17. Real-time volumetric image reconstruction and 3D tumor localization based on a single x-ray projection image for lung cancer radiotherapy.

    PubMed

    Li, Ruijiang; Jia, Xun; Lewis, John H; Gu, Xuejun; Folkerts, Michael; Men, Chunhua; Jiang, Steve B

    2010-06-01

    To develop an algorithm for real-time volumetric image reconstruction and 3D tumor localization based on a single x-ray projection image for lung cancer radiotherapy. Given a set of volumetric images of a patient at N breathing phases as the training data, deformable image registration was performed between a reference phase and the other N-1 phases, resulting in N-1 deformation vector fields (DVFs). These DVFs can be represented efficiently by a few eigenvectors and coefficients obtained from principal component analysis (PCA). By varying the PCA coefficients, new DVFs can be generated, which, when applied on the reference image, lead to new volumetric images. A volumetric image can then be reconstructed from a single projection image by optimizing the PCA coefficients such that its computed projection matches the measured one. The 3D location of the tumor can be derived by applying the inverted DVF on its position in the reference image. The algorithm was implemented on graphics processing units (GPUs) to achieve real-time efficiency. The training data were generated using a realistic and dynamic mathematical phantom with ten breathing phases. The testing data were 360 cone beam projections corresponding to one gantry rotation, simulated using the same phantom with a 50% increase in breathing amplitude. The average relative image intensity error of the reconstructed volumetric images is 6.9% +/- 2.4%. The average 3D tumor localization error is 0.8 +/- 0.5 mm. On an NVIDIA Tesla C1060 GPU card, the average computation time for reconstructing a volumetric image from each projection is 0.24 s (range: 0.17 and 0.35 s). The authors have shown the feasibility of reconstructing volumetric images and localizing tumor positions in 3D in near real-time from a single x-ray image.

  18. Visual Analytics of integrated Data Systems for Space Weather Purposes

    NASA Astrophysics Data System (ADS)

    Rosa, Reinaldo; Veronese, Thalita; Giovani, Paulo

    Analysis of information from multiple data sources obtained through high resolution instrumental measurements has become a fundamental task in all scientific areas. The development of expert methods able to treat such multi-source data systems, with both large variability and measurement extension, is a key for studying complex scientific phenomena, especially those related to systemic analysis in space and environmental sciences. In this talk, we present a time series generalization introducing the concept of generalized numerical lattice, which represents a discrete sequence of temporal measures for a given variable. In this novel representation approach each generalized numerical lattice brings post-analytical data information. We define a generalized numerical lattice as a set of three parameters representing the following data properties: dimensionality, size and post-analytical measure (e.g., the autocorrelation, Hurst exponent, etc)[1]. From this representation generalization, any multi-source database can be reduced to a closed set of classified time series in spatiotemporal generalized dimensions. As a case study, we show a preliminary application in space science data, highlighting the possibility of a real time analysis expert system. In this particular application, we have selected and analyzed, using detrended fluctuation analysis (DFA), several decimetric solar bursts associated to X flare-classes. The association with geomagnetic activity is also reported. DFA method is performed in the framework of a radio burst automatic monitoring system. Our results may characterize the variability pattern evolution, computing the DFA scaling exponent, scanning the time series by a short windowing before the extreme event [2]. For the first time, the application of systematic fluctuation analysis for space weather purposes is presented. The prototype for visual analytics is implemented in a Compute Unified Device Architecture (CUDA) by using the K20 Nvidia graphics processing units (GPUs) to reduce the integrated analysis runtime. [1] Veronese et al. doi: 10.6062/jcis.2009.01.02.0021, 2010. [2] Veronese et al. doi:http://dx.doi.org/10.1016/j.jastp.2010.09.030, 2011.

  19. A GPU-accelerated and Monte Carlo-based intensity modulated proton therapy optimization system.

    PubMed

    Ma, Jiasen; Beltran, Chris; Seum Wan Chan Tseung, Hok; Herman, Michael G

    2014-12-01

    Conventional spot scanning intensity modulated proton therapy (IMPT) treatment planning systems (TPSs) optimize proton spot weights based on analytical dose calculations. These analytical dose calculations have been shown to have severe limitations in heterogeneous materials. Monte Carlo (MC) methods do not have these limitations; however, MC-based systems have been of limited clinical use due to the large number of beam spots in IMPT and the extremely long calculation time of traditional MC techniques. In this work, the authors present a clinically applicable IMPT TPS that utilizes a very fast MC calculation. An in-house graphics processing unit (GPU)-based MC dose calculation engine was employed to generate the dose influence map for each proton spot. With the MC generated influence map, a modified least-squares optimization method was used to achieve the desired dose volume histograms (DVHs). The intrinsic CT image resolution was adopted for voxelization in simulation and optimization to preserve spatial resolution. The optimizations were computed on a multi-GPU framework to mitigate the memory limitation issues for the large dose influence maps that resulted from maintaining the intrinsic CT resolution. The effects of tail cutoff and starting condition were studied and minimized in this work. For relatively large and complex three-field head and neck cases, i.e., >100,000 spots with a target volume of ∼ 1000 cm(3) and multiple surrounding critical structures, the optimization together with the initial MC dose influence map calculation was done in a clinically viable time frame (less than 30 min) on a GPU cluster consisting of 24 Nvidia GeForce GTX Titan cards. The in-house MC TPS plans were comparable to a commercial TPS plans based on DVH comparisons. A MC-based treatment planning system was developed. The treatment planning can be performed in a clinically viable time frame on a hardware system costing around 45,000 dollars. The fast calculation and optimization make the system easily expandable to robust and multicriteria optimization.

  20. An efficient mixed-precision, hybrid CPU-GPU implementation of a nonlinearly implicit one-dimensional particle-in-cell algorithm

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Chen, Guangye; Chacon, Luis; Barnes, Daniel C

    2012-01-01

    Recently, a fully implicit, energy- and charge-conserving particle-in-cell method has been developed for multi-scale, full-f kinetic simulations [G. Chen, et al., J. Comput. Phys. 230, 18 (2011)]. The method employs a Jacobian-free Newton-Krylov (JFNK) solver and is capable of using very large timesteps without loss of numerical stability or accuracy. A fundamental feature of the method is the segregation of particle orbit integrations from the field solver, while remaining fully self-consistent. This provides great flexibility, and dramatically improves the solver efficiency by reducing the degrees of freedom of the associated nonlinear system. However, it requires a particle push per nonlinearmore » residual evaluation, which makes the particle push the most time-consuming operation in the algorithm. This paper describes a very efficient mixed-precision, hybrid CPU-GPU implementation of the implicit PIC algorithm. The JFNK solver is kept on the CPU (in double precision), while the inherent data parallelism of the particle mover is exploited by implementing it in single-precision on a graphics processing unit (GPU) using CUDA. Performance-oriented optimizations, with the aid of an analytical performance model, the roofline model, are employed. Despite being highly dynamic, the adaptive, charge-conserving particle mover algorithm achieves up to 300 400 GOp/s (including single-precision floating-point, integer, and logic operations) on a Nvidia GeForce GTX580, corresponding to 20 25% absolute GPU efficiency (against the peak theoretical performance) and 50-70% intrinsic efficiency (against the algorithm s maximum operational throughput, which neglects all latencies). This is about 200-300 times faster than an equivalent serial CPU implementation. When the single-precision GPU particle mover is combined with a double-precision CPU JFNK field solver, overall performance gains 100 vs. the double-precision CPU-only serial version are obtained, with no apparent loss of robustness or accuracy when applied to a challenging long-time scale ion acoustic wave simulation.« less

  1. Efficient implementation of the 3D-DDA ray traversal algorithm on GPU and its application in radiation dose calculation.

    PubMed

    Xiao, Kai; Chen, Danny Z; Hu, X Sharon; Zhou, Bo

    2012-12-01

    The three-dimensional digital differential analyzer (3D-DDA) algorithm is a widely used ray traversal method, which is also at the core of many convolution∕superposition (C∕S) dose calculation approaches. However, porting existing C∕S dose calculation methods onto graphics processing unit (GPU) has brought challenges to retaining the efficiency of this algorithm. In particular, straightforward implementation of the original 3D-DDA algorithm inflicts a lot of branch divergence which conflicts with the GPU programming model and leads to suboptimal performance. In this paper, an efficient GPU implementation of the 3D-DDA algorithm is proposed, which effectively reduces such branch divergence and improves performance of the C∕S dose calculation programs running on GPU. The main idea of the proposed method is to convert a number of conditional statements in the original 3D-DDA algorithm into a set of simple operations (e.g., arithmetic, comparison, and logic) which are better supported by the GPU architecture. To verify and demonstrate the performance improvement, this ray traversal method was integrated into a GPU-based collapsed cone convolution∕superposition (CCCS) dose calculation program. The proposed method has been tested using a water phantom and various clinical cases on an NVIDIA GTX570 GPU. The CCCS dose calculation program based on the efficient 3D-DDA ray traversal implementation runs 1.42 ∼ 2.67× faster than the one based on the original 3D-DDA implementation, without losing any accuracy. The results show that the proposed method can effectively reduce branch divergence in the original 3D-DDA ray traversal algorithm and improve the performance of the CCCS program running on GPU. Considering the wide utilization of the 3D-DDA algorithm, various applications can benefit from this implementation method.

  2. GPU-accelerated Monte Carlo convolution/superposition implementation for dose calculation.

    PubMed

    Zhou, Bo; Yu, Cedric X; Chen, Danny Z; Hu, X Sharon

    2010-11-01

    Dose calculation is a key component in radiation treatment planning systems. Its performance and accuracy are crucial to the quality of treatment plans as emerging advanced radiation therapy technologies are exerting ever tighter constraints on dose calculation. A common practice is to choose either a deterministic method such as the convolution/superposition (CS) method for speed or a Monte Carlo (MC) method for accuracy. The goal of this work is to boost the performance of a hybrid Monte Carlo convolution/superposition (MCCS) method by devising a graphics processing unit (GPU) implementation so as to make the method practical for day-to-day usage. Although the MCCS algorithm combines the merits of MC fluence generation and CS fluence transport, it is still not fast enough to be used as a day-to-day planning tool. To alleviate the speed issue of MC algorithms, the authors adopted MCCS as their target method and implemented a GPU-based version. In order to fully utilize the GPU computing power, the MCCS algorithm is modified to match the GPU hardware architecture. The performance of the authors' GPU-based implementation on an Nvidia GTX260 card is compared to a multithreaded software implementation on a quad-core system. A speedup in the range of 6.7-11.4x is observed for the clinical cases used. The less than 2% statistical fluctuation also indicates that the accuracy of the authors' GPU-based implementation is in good agreement with the results from the quad-core CPU implementation. This work shows that GPU is a feasible and cost-efficient solution compared to other alternatives such as using cluster machines or field-programmable gate arrays for satisfying the increasing demands on computation speed and accuracy of dose calculation. But there are also inherent limitations of using GPU for accelerating MC-type applications, which are also analyzed in detail in this article.

  3. Memory transfer optimization for a lattice Boltzmann solver on Kepler architecture nVidia GPUs

    NASA Astrophysics Data System (ADS)

    Mawson, Mark J.; Revell, Alistair J.

    2014-10-01

    The Lattice Boltzmann method (LBM) for solving fluid flow is naturally well suited to an efficient implementation for massively parallel computing, due to the prevalence of local operations in the algorithm. This paper presents and analyses the performance of a 3D lattice Boltzmann solver, optimized for third generation nVidia GPU hardware, also known as 'Kepler'. We provide a review of previous optimization strategies and analyse data read/write times for different memory types. In LBM, the time propagation step (known as streaming), involves shifting data to adjacent locations and is central to parallel performance; here we examine three approaches which make use of different hardware options. Two of which make use of 'performance enhancing' features of the GPU; shared memory and the new shuffle instruction found in Kepler based GPUs. These are compared to a standard transfer of data which relies instead on optimized storage to increase coalesced access. It is shown that the more simple approach is most efficient; since the need for large numbers of registers per thread in LBM limits the block size and thus the efficiency of these special features is reduced. Detailed results are obtained for a D3Q19 LBM solver, which is benchmarked on nVidia K5000M and K20C GPUs. In the latter case the use of a read-only data cache is explored, and peak performance of over 1036 Million Lattice Updates Per Second (MLUPS) is achieved. The appearance of a periodic bottleneck in the solver performance is also reported, believed to be hardware related; spikes in iteration-time occur with a frequency of around 11 Hz for both GPUs, independent of the size of the problem.

  4. GPU-based relative fuzzy connectedness image segmentation.

    PubMed

    Zhuge, Ying; Ciesielski, Krzysztof C; Udupa, Jayaram K; Miller, Robert W

    2013-01-01

    Recently, clinical radiological research and practice are becoming increasingly quantitative. Further, images continue to increase in size and volume. For quantitative radiology to become practical, it is crucial that image segmentation algorithms and their implementations are rapid and yield practical run time on very large data sets. The purpose of this paper is to present a parallel version of an algorithm that belongs to the family of fuzzy connectedness (FC) algorithms, to achieve an interactive speed for segmenting large medical image data sets. The most common FC segmentations, optimizing an [script-l](∞)-based energy, are known as relative fuzzy connectedness (RFC) and iterative relative fuzzy connectedness (IRFC). Both RFC and IRFC objects (of which IRFC contains RFC) can be found via linear time algorithms, linear with respect to the image size. The new algorithm, P-ORFC (for parallel optimal RFC), which is implemented by using NVIDIA's Compute Unified Device Architecture (CUDA) platform, considerably improves the computational speed of the above mentioned CPU based IRFC algorithm. Experiments based on four data sets of small, medium, large, and super data size, achieved speedup factors of 32.8×, 22.9×, 20.9×, and 17.5×, correspondingly, on the NVIDIA Tesla C1060 platform. Although the output of P-ORFC need not precisely match that of IRFC output, it is very close to it and, as the authors prove, always lies between the RFC and IRFC objects. A parallel version of a top-of-the-line algorithm in the family of FC has been developed on the NVIDIA GPUs. An interactive speed of segmentation has been achieved, even for the largest medical image data set. Such GPU implementations may play a crucial role in automatic anatomy recognition in clinical radiology.

  5. HMI conventions for process control graphics.

    PubMed

    Pikaar, Ruud N

    2012-01-01

    Process operators supervise and control complex processes. To enable the operator to do an adequate job, instrumentation and process control engineers need to address several related topics, such as console design, information design, navigation, and alarm management. In process control upgrade projects, usually a 1:1 conversion of existing graphics is proposed. This paper suggests another approach, efficiently leading to a reduced number of new powerful process graphics, supported by a permanent process overview displays. In addition a road map for structuring content (process information) and conventions for the presentation of objects, symbols, and so on, has been developed. The impact of the human factors engineering approach on process control upgrade projects is illustrated by several cases.

  6. Acceleration of integral imaging based incoherent Fourier hologram capture using graphic processing unit.

    PubMed

    Jeong, Kyeong-Min; Kim, Hee-Seung; Hong, Sung-In; Lee, Sung-Keun; Jo, Na-Young; Kim, Yong-Soo; Lim, Hong-Gi; Park, Jae-Hyeung

    2012-10-08

    Speed enhancement of integral imaging based incoherent Fourier hologram capture using a graphic processing unit is reported. Integral imaging based method enables exact hologram capture of real-existing three-dimensional objects under regular incoherent illumination. In our implementation, we apply parallel computation scheme using the graphic processing unit, accelerating the processing speed. Using enhanced speed of hologram capture, we also implement a pseudo real-time hologram capture and optical reconstruction system. The overall operation speed is measured to be 1 frame per second.

  7. Write Is Right: Using Graphic Organizers to Improve Student Mathematical Problem Solving

    ERIC Educational Resources Information Center

    Zollman, Alan

    2012-01-01

    Teachers have used graphic organizers successfully in teaching the writing process. This paper describes graphic organizers and their potential mathematics benefits for both students and teachers, elucidates a specific graphic organizer adaptation for mathematical problem solving, and discusses results using the "four-corners-and-a-diamond"…

  8. Building Regression Models: The Importance of Graphics.

    ERIC Educational Resources Information Center

    Dunn, Richard

    1989-01-01

    Points out reasons for using graphical methods to teach simple and multiple regression analysis. Argues that a graphically oriented approach has considerable pedagogic advantages in the exposition of simple and multiple regression. Shows that graphical methods may play a central role in the process of building regression models. (Author/LS)

  9. Mathematical Creative Activity and the Graphic Calculator

    ERIC Educational Resources Information Center

    Duda, Janina

    2011-01-01

    Teaching mathematics using graphic calculators has been an issue of didactic discussions for years. Finding ways in which graphic calculators can enrich the development process of creative activity in mathematically gifted students between the ages of 16-17 is the focus of this article. Research was conducted using graphic calculators with…

  10. Defining Identities through Multiliteracies: EL Teens Narrate Their Immigration Experiences as Graphic Stories

    ERIC Educational Resources Information Center

    Danzak, Robin L.

    2011-01-01

    Based on a framework of identity-as-narrative and multiliteracies, this article describes "Graphic Journeys," a multimedia literacy project in which English learners (ELs) in middle school created graphic stories that expressed their families' immigration experiences. The process involved reading graphic novels, journaling, interviewing, and…

  11. Pushing Memory Bandwidth Limitations Through Efficient Implementations of Block-Krylov Space Solvers on GPUs

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Clark, M. A.; Strelchenko, Alexei; Vaquero, Alejandro

    Lattice quantum chromodynamics simulations in nuclear physics have benefited from a tremendous number of algorithmic advances such as multigrid and eigenvector deflation. These improve the time to solution but do not alleviate the intrinsic memory-bandwidth constraints of the matrix-vector operation dominating iterative solvers. Batching this operation for multiple vectors and exploiting cache and register blocking can yield a super-linear speed up. Block-Krylov solvers can naturally take advantage of such batched matrix-vector operations, further reducing the iterations to solution by sharing the Krylov space between solves. However, practical implementations typically suffer from the quadratic scaling in the number of vector-vector operations.more » Using the QUDA library, we present an implementation of a block-CG solver on NVIDIA GPUs which reduces the memory-bandwidth complexity of vector-vector operations from quadratic to linear. We present results for the HISQ discretization, showing a 5x speedup compared to highly-optimized independent Krylov solves on NVIDIA's SaturnV cluster.« less

  12. High performance in silico virtual drug screening on many-core processors.

    PubMed

    McIntosh-Smith, Simon; Price, James; Sessions, Richard B; Ibarra, Amaurys A

    2015-05-01

    Drug screening is an important part of the drug development pipeline for the pharmaceutical industry. Traditional, lab-based methods are increasingly being augmented with computational methods, ranging from simple molecular similarity searches through more complex pharmacophore matching to more computationally intensive approaches, such as molecular docking. The latter simulates the binding of drug molecules to their targets, typically protein molecules. In this work, we describe BUDE, the Bristol University Docking Engine, which has been ported to the OpenCL industry standard parallel programming language in order to exploit the performance of modern many-core processors. Our highly optimized OpenCL implementation of BUDE sustains 1.43 TFLOP/s on a single Nvidia GTX 680 GPU, or 46% of peak performance. BUDE also exploits OpenCL to deliver effective performance portability across a broad spectrum of different computer architectures from different vendors, including GPUs from Nvidia and AMD, Intel's Xeon Phi and multi-core CPUs with SIMD instruction sets.

  13. High performance in silico virtual drug screening on many-core processors

    PubMed Central

    Price, James; Sessions, Richard B; Ibarra, Amaurys A

    2015-01-01

    Drug screening is an important part of the drug development pipeline for the pharmaceutical industry. Traditional, lab-based methods are increasingly being augmented with computational methods, ranging from simple molecular similarity searches through more complex pharmacophore matching to more computationally intensive approaches, such as molecular docking. The latter simulates the binding of drug molecules to their targets, typically protein molecules. In this work, we describe BUDE, the Bristol University Docking Engine, which has been ported to the OpenCL industry standard parallel programming language in order to exploit the performance of modern many-core processors. Our highly optimized OpenCL implementation of BUDE sustains 1.43 TFLOP/s on a single Nvidia GTX 680 GPU, or 46% of peak performance. BUDE also exploits OpenCL to deliver effective performance portability across a broad spectrum of different computer architectures from different vendors, including GPUs from Nvidia and AMD, Intel’s Xeon Phi and multi-core CPUs with SIMD instruction sets. PMID:25972727

  14. Robot graphic simulation testbed

    NASA Technical Reports Server (NTRS)

    Cook, George E.; Sztipanovits, Janos; Biegl, Csaba; Karsai, Gabor; Springfield, James F.

    1991-01-01

    The objective of this research was twofold. First, the basic capabilities of ROBOSIM (graphical simulation system) were improved and extended by taking advantage of advanced graphic workstation technology and artificial intelligence programming techniques. Second, the scope of the graphic simulation testbed was extended to include general problems of Space Station automation. Hardware support for 3-D graphics and high processing performance make high resolution solid modeling, collision detection, and simulation of structural dynamics computationally feasible. The Space Station is a complex system with many interacting subsystems. Design and testing of automation concepts demand modeling of the affected processes, their interactions, and that of the proposed control systems. The automation testbed was designed to facilitate studies in Space Station automation concepts.

  15. Parallel processor-based raster graphics system architecture

    DOEpatents

    Littlefield, Richard J.

    1990-01-01

    An apparatus for generating raster graphics images from the graphics command stream includes a plurality of graphics processors connected in parallel, each adapted to receive any part of the graphics command stream for processing the command stream part into pixel data. The apparatus also includes a frame buffer for mapping the pixel data to pixel locations and an interconnection network for interconnecting the graphics processors to the frame buffer. Through the interconnection network, each graphics processor may access any part of the frame buffer concurrently with another graphics processor accessing any other part of the frame buffer. The plurality of graphics processors can thereby transmit concurrently pixel data to pixel locations in the frame buffer.

  16. A New GPU-Enabled MODTRAN Thermal Model for the PLUME TRACKER Volcanic Emission Analysis Toolkit

    NASA Astrophysics Data System (ADS)

    Acharya, P. K.; Berk, A.; Guiang, C.; Kennett, R.; Perkins, T.; Realmuto, V. J.

    2013-12-01

    Real-time quantification of volcanic gaseous and particulate releases is important for (1) recognizing rapid increases in SO2 gaseous emissions which may signal an impending eruption; (2) characterizing ash clouds to enable safe and efficient commercial aviation; and (3) quantifying the impact of volcanic aerosols on climate forcing. The Jet Propulsion Laboratory (JPL) has developed state-of-the-art algorithms, embedded in their analyst-driven Plume Tracker toolkit, for performing SO2, NH3, and CH4 retrievals from remotely sensed multi-spectral Thermal InfraRed spectral imagery. While Plume Tracker provides accurate results, it typically requires extensive analyst time. A major bottleneck in this processing is the relatively slow but accurate FORTRAN-based MODTRAN atmospheric and plume radiance model, developed by Spectral Sciences, Inc. (SSI). To overcome this bottleneck, SSI in collaboration with JPL, is porting these slow thermal radiance algorithms onto massively parallel, relatively inexpensive and commercially-available GPUs. This paper discusses SSI's efforts to accelerate the MODTRAN thermal emission algorithms used by Plume Tracker. Specifically, we are developing a GPU implementation of the Curtis-Godson averaging and the Voigt in-band transmittances from near line center molecular absorption, which comprise the major computational bottleneck. The transmittance calculations were decomposed into separate functions, individually implemented as GPU kernels, and tested for accuracy and performance relative to the original CPU code. Speedup factors of 14 to 30× were realized for individual processing components on an NVIDIA GeForce GTX 295 graphics card with no loss of accuracy. Due to the separate host (CPU) and device (GPU) memory spaces, a redesign of the MODTRAN architecture was required to ensure efficient data transfer between host and device, and to facilitate high parallel throughput. Currently, we are incorporating the separate GPU kernels into a single function for calculating the Voigt in-band transmittance, and subsequently for integration into the re-architectured MODTRAN6 code. Our overall objective is that by combining the GPU processing with more efficient Plume Tracker retrieval algorithms, a 100-fold increase in the computational speed will be realized. Since the Plume Tracker runs on Windows-based platforms, the GPU-enhanced MODTRAN6 will be packaged as a DLL. We do however anticipate that the accelerated option will be made available to the general MODTRAN community through an application programming interface (API).

  17. 75 FR 32826 - Self-Regulatory Organizations; NASDAQ OMX PHLX, Inc.; Notice of Filing and Immediate...

    Federal Register 2010, 2011, 2012, 2013, 2014

    2010-06-09

    ...''), American Express Company (``AXP''), Ciena Corp. (``CIEN''), Star Scientific, Inc. (``CIGX''), Dendreon Corp. (``DNDN''), eBay Inc. (``EBAY''), Corning Inc. (``GLW''), Halliburton Company (``HAL''), iShares Dow Jones US Real Estate (``IYR''), Motorola, Inc., (``MOT''), NVIDIA Corporation (``NVDA''), ON Semiconductor...

  18. Accelerated Application Development: The ORNL Titan Experience

    DOE PAGES

    Joubert, Wayne; Archibald, Richard K.; Berrill, Mark A.; ...

    2015-05-09

    The use of computational accelerators such as NVIDIA GPUs and Intel Xeon Phi processors is now widespread in the high performance computing community, with many applications delivering impressive performance gains. However, programming these systems for high performance, performance portability and software maintainability has been a challenge. In this paper we discuss experiences porting applications to the Titan system. Titan, which began planning in 2009 and was deployed for general use in 2013, was the first multi-petaflop system based on accelerator hardware. To ready applications for accelerated computing, a preparedness effort was undertaken prior to delivery of Titan. In this papermore » we report experiences and lessons learned from this process and describe how users are currently making use of computational accelerators on Titan.« less

  19. Accelerated application development: The ORNL Titan experience

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Joubert, Wayne; Archibald, Rick; Berrill, Mark

    2015-08-01

    The use of computational accelerators such as NVIDIA GPUs and Intel Xeon Phi processors is now widespread in the high performance computing community, with many applications delivering impressive performance gains. However, programming these systems for high performance, performance portability and software maintainability has been a challenge. In this paper we discuss experiences porting applications to the Titan system. Titan, which began planning in 2009 and was deployed for general use in 2013, was the first multi-petaflop system based on accelerator hardware. To ready applications for accelerated computing, a preparedness effort was undertaken prior to delivery of Titan. In this papermore » we report experiences and lessons learned from this process and describe how users are currently making use of computational accelerators on Titan.« less

  20. Real-time dedispersion for fast radio transient surveys, using auto tuning on many-core accelerators

    NASA Astrophysics Data System (ADS)

    Sclocco, A.; van Leeuwen, J.; Bal, H. E.; van Nieuwpoort, R. V.

    2016-01-01

    Dedispersion, the removal of deleterious smearing of impulsive signals by the interstellar matter, is one of the most intensive processing steps in any radio survey for pulsars and fast transients. We here present a study of the parallelization of this algorithm on many-core accelerators, including GPUs from AMD and NVIDIA, and the Intel Xeon Phi. We find that dedispersion is inherently memory-bound. Even in a perfect scenario, hardware limitations keep the arithmetic intensity low, thus limiting performance. We next exploit auto-tuning to adapt dedispersion to different accelerators, observations, and even telescopes. We demonstrate that the optimal settings differ between observational setups, and that auto-tuning significantly improves performance. This impacts time-domain surveys from Apertif to SKA.

  1. Autofocus method for automated microscopy using embedded GPUs.

    PubMed

    Castillo-Secilla, J M; Saval-Calvo, M; Medina-Valdès, L; Cuenca-Asensi, S; Martínez-Álvarez, A; Sánchez, C; Cristóbal, G

    2017-03-01

    In this paper we present a method for autofocusing images of sputum smears taken from a microscope which combines the finding of the optimal focus distance with an algorithm for extending the depth of field (EDoF). Our multifocus fusion method produces an unique image where all the relevant objects of the analyzed scene are well focused, independently to their distance to the sensor. This process is computationally expensive which makes unfeasible its automation using traditional embedded processors. For this purpose a low-cost optimized implementation is proposed using limited resources embedded GPU integrated on cutting-edge NVIDIA system on chip. The extensive tests performed on different sputum smear image sets show the real-time capabilities of our implementation maintaining the quality of the output image.

  2. Engineering visualization utilizing advanced animation

    NASA Technical Reports Server (NTRS)

    Sabionski, Gunter R.; Robinson, Thomas L., Jr.

    1989-01-01

    Engineering visualization is the use of computer graphics to depict engineering analysis and simulation in visual form from project planning through documentation. Graphics displays let engineers see data represented dynamically which permits the quick evaluation of results. The current state of graphics hardware and software generally allows the creation of two types of 3D graphics. The use of animated video as an engineering visualization tool is presented. The engineering, animation, and videography aspects of animated video production are each discussed. Specific issues include the integration of staffing expertise, hardware, software, and the various production processes. A detailed explanation of the animation process reveals the capabilities of this unique engineering visualization method. Automation of animation and video production processes are covered and future directions are proposed.

  3. Target Information Processing: A Joint Decision and Estimation Approach

    DTIC Science & Technology

    2012-03-29

    ground targets ( track - before - detect ) using computer cluster and graphics processing unit. Estimation and filtering theory is one of the most important...targets ( track - before - detect ) using computer cluster and graphics processing unit. Estimation and filtering theory is one of the most important

  4. Orthorectification by Using Gpgpu Method

    NASA Astrophysics Data System (ADS)

    Sahin, H.; Kulur, S.

    2012-07-01

    Thanks to the nature of the graphics processing, the newly released products offer highly parallel processing units with high-memory bandwidth and computational power of more than teraflops per second. The modern GPUs are not only powerful graphic engines but also they are high level parallel programmable processors with very fast computing capabilities and high-memory bandwidth speed compared to central processing units (CPU). Data-parallel computations can be shortly described as mapping data elements to parallel processing threads. The rapid development of GPUs programmability and capabilities attracted the attentions of researchers dealing with complex problems which need high level calculations. This interest has revealed the concepts of "General Purpose Computation on Graphics Processing Units (GPGPU)" and "stream processing". The graphic processors are powerful hardware which is really cheap and affordable. So the graphic processors became an alternative to computer processors. The graphic chips which were standard application hardware have been transformed into modern, powerful and programmable processors to meet the overall needs. Especially in recent years, the phenomenon of the usage of graphics processing units in general purpose computation has led the researchers and developers to this point. The biggest problem is that the graphics processing units use different programming models unlike current programming methods. Therefore, an efficient GPU programming requires re-coding of the current program algorithm by considering the limitations and the structure of the graphics hardware. Currently, multi-core processors can not be programmed by using traditional programming methods. Event procedure programming method can not be used for programming the multi-core processors. GPUs are especially effective in finding solution for repetition of the computing steps for many data elements when high accuracy is needed. Thus, it provides the computing process more quickly and accurately. Compared to the GPUs, CPUs which perform just one computing in a time according to the flow control are slower in performance. This structure can be evaluated for various applications of computer technology. In this study covers how general purpose parallel programming and computational power of the GPUs can be used in photogrammetric applications especially direct georeferencing. The direct georeferencing algorithm is coded by using GPGPU method and CUDA (Compute Unified Device Architecture) programming language. Results provided by this method were compared with the traditional CPU programming. In the other application the projective rectification is coded by using GPGPU method and CUDA programming language. Sample images of various sizes, as compared to the results of the program were evaluated. GPGPU method can be used especially in repetition of same computations on highly dense data, thus finding the solution quickly.

  5. Gyrokinetic particle-in-cell optimization on emerging multi- and manycore platforms

    DOE PAGES

    Madduri, Kamesh; Im, Eun-Jin; Ibrahim, Khaled Z.; ...

    2011-03-02

    The next decade of high-performance computing (HPC) systems will see a rapid evolution and divergence of multi- and manycore architectures as power and cooling constraints limit increases in microprocessor clock speeds. Understanding efficient optimization methodologies on diverse multicore designs in the context of demanding numerical methods is one of the greatest challenges faced today by the HPC community. In this paper, we examine the efficient multicore optimization of GTC, a petascale gyrokinetic toroidal fusion code for studying plasma microturbulence in tokamak devices. For GTC’s key computational components (charge deposition and particle push), we explore efficient parallelization strategies across a broadmore » range of emerging multicore designs, including the recently-released Intel Nehalem-EX, the AMD Opteron Istanbul, and the highly multithreaded Sun UltraSparc T2+. We also present the first study on tuning gyrokinetic particle-in-cell (PIC) algorithms for graphics processors, using the NVIDIA C2050 (Fermi). Our work discusses several novel optimization approaches for gyrokinetic PIC, including mixed-precision computation, particle binning and decomposition strategies, grid replication, SIMDized atomic floating-point operations, and effective GPU texture memory utilization. Overall, we achieve significant performance improvements of 1.3–4.7× on these complex PIC kernels, despite the inherent challenges of data dependency and locality. Finally, our work also points to several architectural and programming features that could significantly enhance PIC performance and productivity on next-generation architectures.« less

  6. Graphical Man/Machine Communications

    DTIC Science & Technology

    Progress is reported concerning the use of computer controlled graphical displays in the areas of radiaton diffusion and hydrodynamics, general...ventricular dynamics. Progress is continuing on the use of computer graphics in architecture. Some progress in halftone graphics is reported with no basic...developments presented. Colored halftone perspective pictures are being used to represent multivariable situations. Nonlinear waveform processing is

  7. 75 FR 30887 - Self-Regulatory Organizations; The NASDAQ Stock Market LLC; Notice of Filing and Immediate...

    Federal Register 2010, 2011, 2012, 2013, 2014

    2010-06-02

    ...''), American Express Company (``AXP''), Ciena Corp. (``CIEN''), Star Scientific, Inc. (``CIGX''), Dendreon Corp. (``DNDN''), eBay Inc. (``EBAY''), Corning Inc. (``GLW''), Halliburton Company (``HAL''), iShares Dow Jones US Real Estate (``IYR''), Motorola, Inc., (``MOT''), NVIDIA Corporation (``NVDA''), ON Semiconductor...

  8. Weather information network including graphical display

    NASA Technical Reports Server (NTRS)

    Leger, Daniel R. (Inventor); Burdon, David (Inventor); Son, Robert S. (Inventor); Martin, Kevin D. (Inventor); Harrison, John (Inventor); Hughes, Keith R. (Inventor)

    2006-01-01

    An apparatus for providing weather information onboard an aircraft includes a processor unit and a graphical user interface. The processor unit processes weather information after it is received onboard the aircraft from a ground-based source, and the graphical user interface provides a graphical presentation of the weather information to a user onboard the aircraft. Preferably, the graphical user interface includes one or more user-selectable options for graphically displaying at least one of convection information, turbulence information, icing information, weather satellite information, SIGMET information, significant weather prognosis information, and winds aloft information.

  9. Model for mapping settlements

    DOEpatents

    Vatsavai, Ranga Raju; Graesser, Jordan B.; Bhaduri, Budhendra L.

    2016-07-05

    A programmable media includes a graphical processing unit in communication with a memory element. The graphical processing unit is configured to detect one or more settlement regions from a high resolution remote sensed image based on the execution of programming code. The graphical processing unit identifies one or more settlements through the execution of the programming code that executes a multi-instance learning algorithm that models portions of the high resolution remote sensed image. The identification is based on spectral bands transmitted by a satellite and on selected designations of the image patches.

  10. Use of a graphics processing unit (GPU) to facilitate real-time 3D graphic presentation of the patient skin-dose distribution during fluoroscopic interventional procedures

    PubMed Central

    Rana, Vijay; Rudin, Stephen; Bednarek, Daniel R.

    2012-01-01

    We have developed a dose-tracking system (DTS) that calculates the radiation dose to the patient’s skin in real-time by acquiring exposure parameters and imaging-system-geometry from the digital bus on a Toshiba Infinix C-arm unit. The cumulative dose values are then displayed as a color map on an OpenGL-based 3D graphic of the patient for immediate feedback to the interventionalist. Determination of those elements on the surface of the patient 3D-graphic that intersect the beam and calculation of the dose for these elements in real time demands fast computation. Reducing the size of the elements results in more computation load on the computer processor and therefore a tradeoff occurs between the resolution of the patient graphic and the real-time performance of the DTS. The speed of the DTS for calculating dose to the skin is limited by the central processing unit (CPU) and can be improved by using the parallel processing power of a graphics processing unit (GPU). Here, we compare the performance speed of GPU-based DTS software to that of the current CPU-based software as a function of the resolution of the patient graphics. Results show a tremendous improvement in speed using the GPU. While an increase in the spatial resolution of the patient graphics resulted in slowing down the computational speed of the DTS on the CPU, the speed of the GPU-based DTS was hardly affected. This GPU-based DTS can be a powerful tool for providing accurate, real-time feedback about patient skin-dose to physicians while performing interventional procedures. PMID:24027616

  11. Use of a graphics processing unit (GPU) to facilitate real-time 3D graphic presentation of the patient skin-dose distribution during fluoroscopic interventional procedures.

    PubMed

    Rana, Vijay; Rudin, Stephen; Bednarek, Daniel R

    2012-02-23

    We have developed a dose-tracking system (DTS) that calculates the radiation dose to the patient's skin in real-time by acquiring exposure parameters and imaging-system-geometry from the digital bus on a Toshiba Infinix C-arm unit. The cumulative dose values are then displayed as a color map on an OpenGL-based 3D graphic of the patient for immediate feedback to the interventionalist. Determination of those elements on the surface of the patient 3D-graphic that intersect the beam and calculation of the dose for these elements in real time demands fast computation. Reducing the size of the elements results in more computation load on the computer processor and therefore a tradeoff occurs between the resolution of the patient graphic and the real-time performance of the DTS. The speed of the DTS for calculating dose to the skin is limited by the central processing unit (CPU) and can be improved by using the parallel processing power of a graphics processing unit (GPU). Here, we compare the performance speed of GPU-based DTS software to that of the current CPU-based software as a function of the resolution of the patient graphics. Results show a tremendous improvement in speed using the GPU. While an increase in the spatial resolution of the patient graphics resulted in slowing down the computational speed of the DTS on the CPU, the speed of the GPU-based DTS was hardly affected. This GPU-based DTS can be a powerful tool for providing accurate, real-time feedback about patient skin-dose to physicians while performing interventional procedures.

  12. Montblanc1: GPU accelerated radio interferometer measurement equations in support of Bayesian inference for radio observations

    NASA Astrophysics Data System (ADS)

    Perkins, S. J.; Marais, P. C.; Zwart, J. T. L.; Natarajan, I.; Tasse, C.; Smirnov, O.

    2015-09-01

    We present Montblanc, a GPU implementation of the Radio interferometer measurement equation (RIME) in support of the Bayesian inference for radio observations (BIRO) technique. BIRO uses Bayesian inference to select sky models that best match the visibilities observed by a radio interferometer. To accomplish this, BIRO evaluates the RIME multiple times, varying sky model parameters to produce multiple model visibilities. χ2 values computed from the model and observed visibilities are used as likelihood values to drive the Bayesian sampling process and select the best sky model. As most of the elements of the RIME and χ2 calculation are independent of one another, they are highly amenable to parallel computation. Additionally, Montblanc caters for iterative RIME evaluation to produce multiple χ2 values. Modified model parameters are transferred to the GPU between each iteration. We implemented Montblanc as a Python package based upon NVIDIA's CUDA architecture. As such, it is easy to extend and implement different pipelines. At present, Montblanc supports point and Gaussian morphologies, but is designed for easy addition of new source profiles. Montblanc's RIME implementation is performant: On an NVIDIA K40, it is approximately 250 times faster than MEQTREES on a dual hexacore Intel E5-2620v2 CPU. Compared to the OSKAR simulator's GPU-implemented RIME components it is 7.7 and 12 times faster on the same K40 for single and double-precision floating point respectively. However, OSKAR's RIME implementation is more general than Montblanc's BIRO-tailored RIME. Theoretical analysis of Montblanc's dominant CUDA kernel suggests that it is memory bound. In practice, profiling shows that is balanced between compute and memory, as much of the data required by the problem is retained in L1 and L2 caches.

  13. [Influence of the recording interval and a graphic organizer on the writing process/product and on other psychological variables].

    PubMed

    García Sánchez, Jesús N; Rodríguez Pérez, Celestino

    2007-05-01

    An experimental study of the influence of the recording interval and a graphic organizer on the processes of writing composition and on the final product is presented. We studied 326 participants, age 10 to 16 years old, by means of a nested design. Two groups were compared: one group was aided in the writing process with a graphic organizer and the other was not. Each group was subdivided into two further groups: one with a mean recording interval of 45 seconds and the other with approximately 90 seconds recording interval in a writing log. The results showed that the group aided by a graphic organizer obtained better results both in processes and writing product, and that the groups assessed with an average interval of 45 seconds obtained worse results. Implications for educational practice are discussed, and limitations and future perspectives are commented on.

  14. Chromium: A Stress-Processing Framework for Interactive Rendering on Clusters

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Humphreys, G,; Houston, M.; Ng, Y.-R.

    2002-01-11

    We describe Chromium, a system for manipulating streams of graphics API commands on clusters of workstations. Chromium's stream filters can be arranged to create sort-first and sort-last parallel graphics architectures that, in many cases, support the same applications while using only commodity graphics accelerators. In addition, these stream filters can be extended programmatically, allowing the user to customize the stream transformations performed by nodes in a cluster. Because our stream processing mechanism is completely general, any cluster-parallel rendering algorithm can be either implemented on top of or embedded in Chromium. In this paper, we give examples of real-world applications thatmore » use Chromium to achieve good scalability on clusters of workstations, and describe other potential uses of this stream processing technology. By completely abstracting the underlying graphics architecture, network topology, and API command processing semantics, we allow a variety of applications to run in different environments.« less

  15. Task-Analytic Design of Graphic Presentations

    DTIC Science & Technology

    1990-05-18

    important premise of Larkin and Simon’s work is that, when comparing alternative presentations, it is fruitful to characterize graphic-based problem solving...using the same information-processing models used to help understand problem solving using other representations [Newell and Simon, 19721...luring execution of graphic presentation- 4 based problem -solving procedures. Chapter 2 reviews other work related to the problem of designing graphic

  16. Interactive computer graphics system for structural sizing and analysis of aircraft structures

    NASA Technical Reports Server (NTRS)

    Bendavid, D.; Pipano, A.; Raibstein, A.; Somekh, E.

    1975-01-01

    A computerized system for preliminary sizing and analysis of aircraft wing and fuselage structures was described. The system is based upon repeated application of analytical program modules, which are interactively interfaced and sequence-controlled during the iterative design process with the aid of design-oriented graphics software modules. The entire process is initiated and controlled via low-cost interactive graphics terminals driven by a remote computer in a time-sharing mode.

  17. CrossTalk. The Journal of Defense Software Engineering. Volume 25, Number 3

    DTIC Science & Technology

    2012-06-01

    OMG) standard Business Process Modeling and Nota- tion ( BPMN ) [6] graphical notation. I will address each of these: identify and document steps...to a value stream map using BPMN and textual process narratives. The resulting process narratives or process metadata includes key information...objectives. Once the processes are identified we can graphically document them capturing the process using BPMN (see Figure 1). The BPMN models

  18. Parallel Geospatial Data Management for Multi-Scale Environmental Data Analysis on GPUs

    NASA Astrophysics Data System (ADS)

    Wang, D.; Zhang, J.; Wei, Y.

    2013-12-01

    As the spatial and temporal resolutions of Earth observatory data and Earth system simulation outputs are getting higher, in-situ and/or post- processing such large amount of geospatial data increasingly becomes a bottleneck in scientific inquires of Earth systems and their human impacts. Existing geospatial techniques that are based on outdated computing models (e.g., serial algorithms and disk-resident systems), as have been implemented in many commercial and open source packages, are incapable of processing large-scale geospatial data and achieve desired level of performance. In this study, we have developed a set of parallel data structures and algorithms that are capable of utilizing massively data parallel computing power available on commodity Graphics Processing Units (GPUs) for a popular geospatial technique called Zonal Statistics. Given two input datasets with one representing measurements (e.g., temperature or precipitation) and the other one represent polygonal zones (e.g., ecological or administrative zones), Zonal Statistics computes major statistics (or complete distribution histograms) of the measurements in all regions. Our technique has four steps and each step can be mapped to GPU hardware by identifying its inherent data parallelisms. First, a raster is divided into blocks and per-block histograms are derived. Second, the Minimum Bounding Boxes (MBRs) of polygons are computed and are spatially matched with raster blocks; matched polygon-block pairs are tested and blocks that are either inside or intersect with polygons are identified. Third, per-block histograms are aggregated to polygons for blocks that are completely within polygons. Finally, for blocks that intersect with polygon boundaries, all the raster cells within the blocks are examined using point-in-polygon-test and cells that are within polygons are used to update corresponding histograms. As the task becomes I/O bound after applying spatial indexing and GPU hardware acceleration, we have developed a GPU-based data compression technique by reusing our previous work on Bitplane Quadtree (or BPQ-Tree) based indexing of binary bitmaps. Results have shown that our GPU-based parallel Zonal Statistic technique on 3000+ US counties over 20+ billion NASA SRTM 30 meter resolution Digital Elevation (DEM) raster cells has achieved impressive end-to-end runtimes: 101 seconds and 46 seconds a low-end workstation equipped with a Nvidia GTX Titan GPU using cold and hot cache, respectively; and, 60-70 seconds using a single OLCF TITAN computing node and 10-15 seconds using 8 nodes. Our experiment results clearly show the potentials of using high-end computing facilities for large-scale geospatial processing.

  19. Peregrine Software Toolchains | High-Performance Computing | NREL

    Science.gov Websites

    toolchain is an open-source alternative against which many technical applications are natively developed and tested. The Portland Group compilers are not fully supported, but are available to the HPC community. Use Group (PGI) C/C++ and Fortran (partially supported) The PGI Accelerator compilers include NVIDIA GPU

  20. Graphic Arts: Book Two. Process Camera, Stripping, and Platemaking.

    ERIC Educational Resources Information Center

    Farajollahi, Karim; And Others

    The second of a three-volume set of instructional materials for a course in graphic arts, this manual consists of 10 instructional units dealing with the process camera, stripping, and platemaking. Covered in the individual units are the process camera and darkroom photography, line photography, half-tone photography, other darkroom techniques,…

  1. Graphic Arts: Process Camera, Stripping, and Platemaking. Third Edition.

    ERIC Educational Resources Information Center

    Crummett, Dan

    This document contains teacher and student materials for a course in graphic arts concentrating on camera work, stripping, and plate making in the printing process. Eight units of instruction cover the following topics: (1) the process camera and darkroom equipment; (2) line photography; (3) halftone photography; (4) other darkroom techniques; (5)…

  2. Mechanical properties of bovine cortical bone based on the automated ball indentation technique and graphics processing method.

    PubMed

    Zhang, Airong; Zhang, Song; Bian, Cuirong

    2018-02-01

    Cortical bone provides the main form of support in humans and other vertebrates against various forces. Thus, capturing its mechanical properties is important. In this study, the mechanical properties of cortical bone were investigated by using automated ball indentation and graphics processing at both the macroscopic and microstructural levels under dry conditions. First, all polished samples were photographed under a metallographic microscope, and the area ratio of the circumferential lamellae and osteons was calculated through the graphics processing method. Second, fully-computer-controlled automated ball indentation (ABI) tests were performed to explore the micro-mechanical properties of the cortical bone at room temperature and a constant indenter speed. The indentation defects were examined with a scanning electron microscope. Finally, the macroscopic mechanical properties of the cortical bone were estimated with the graphics processing method and mixture rule. Combining ABI and graphics processing proved to be an effective tool to obtaining the mechanical properties of the cortical bone, and the indenter size had a significant effect on the measurement. The methods presented in this paper provide an innovative approach to acquiring the macroscopic mechanical properties of cortical bone in a nondestructive manner. Copyright © 2017 Elsevier Ltd. All rights reserved.

  3. Pre and post processing using the IBM 3277 display station graphics attachment (RPQ7H0284)

    NASA Technical Reports Server (NTRS)

    Burroughs, S. H.; Lawlor, M. B.; Miller, I. M.

    1978-01-01

    A graphical interactive procedure operating under TSO and utilizing two CRT display terminals is shown to be an effective means of accomplishing mesh generation, establishing boundary conditions, and reviewing graphic output for finite element analysis activity.

  4. Getting Graphic at the School Library.

    ERIC Educational Resources Information Center

    Kan, Kat

    2003-01-01

    Provides information for school libraries interested in acquiring graphic novels. Discusses theft prevention; processing and cataloging; maintaining the collection; what to choose, with two Web sites for more information on graphic novels for libraries; collection development decisions; and Japanese comics called Manga. Includes an annotated list…

  5. A graphically oriented specification language for automatic code generation. GRASP/Ada: A Graphical Representation of Algorithms, Structure, and Processes for Ada, phase 1

    NASA Technical Reports Server (NTRS)

    Cross, James H., II; Morrison, Kelly I.; May, Charles H., Jr.; Waddel, Kathryn C.

    1989-01-01

    The first phase of a three-phase effort to develop a new graphically oriented specification language which will facilitate the reverse engineering of Ada source code into graphical representations (GRs) as well as the automatic generation of Ada source code is described. A simplified view of the three phases of Graphical Representations for Algorithms, Structure, and Processes for Ada (GRASP/Ada) with respect to three basic classes of GRs is presented. Phase 1 concentrated on the derivation of an algorithmic diagram, the control structure diagram (CSD) (CRO88a) from Ada source code or Ada PDL. Phase 2 includes the generation of architectural and system level diagrams such as structure charts and data flow diagrams and should result in a requirements specification for a graphically oriented language able to support automatic code generation. Phase 3 will concentrate on the development of a prototype to demonstrate the feasibility of this new specification language.

  6. Guide to making time-lapse graphics using the facilities of the National Magnetic Fusion Energy Computing Center

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Munro, J.K. Jr.

    1980-05-01

    The advent of large, fast computers has opened the way to modeling more complex physical processes and to handling very large quantities of experimental data. The amount of information that can be processed in a short period of time is so great that use of graphical displays assumes greater importance as a means of displaying this information. Information from dynamical processes can be displayed conveniently by use of animated graphics. This guide presents the basic techniques for generating black and white animated graphics, with consideration of aesthetic, mechanical, and computational problems. The guide is intended for use by someone whomore » wants to make movies on the National Magnetic Fusion Energy Computing Center (NMFECC) CDC-7600. Problems encountered by a geographically remote user are given particular attention. Detailed information is given that will allow a remote user to do some file checking and diagnosis before giving graphics files to the system for processing into film in order to spot problems without having to wait for film to be delivered. Source listings of some useful software are given in appendices along with descriptions of how to use it. 3 figures, 5 tables.« less

  7. Graphic Warning Labels Elicit Affective and Thoughtful Responses from Smokers: Results of a Randomized Clinical Trial.

    PubMed

    Evans, Abigail T; Peters, Ellen; Strasser, Andrew A; Emery, Lydia F; Sheerin, Kaitlin M; Romer, Daniel

    2015-01-01

    Observational research suggests that placing graphic images on cigarette warning labels can reduce smoking rates, but field studies lack experimental control. Our primary objective was to determine the psychological processes set in motion by naturalistic exposure to graphic vs. text-only warnings in a randomized clinical trial involving exposure to modified cigarette packs over a 4-week period. Theories of graphic-warning impact were tested by examining affect toward smoking, credibility of warning information, risk perceptions, quit intentions, warning label memory, and smoking risk knowledge. Adults who smoked between 5 and 40 cigarettes daily (N = 293; mean age = 33.7), did not have a contra-indicated medical condition, and did not intend to quit were recruited from Philadelphia, PA and Columbus, OH. Smokers were randomly assigned to receive their own brand of cigarettes for four weeks in one of three warning conditions: text only, graphic images plus text, or graphic images with elaborated text. Data from 244 participants who completed the trial were analyzed in structural-equation models. The presence of graphic images (compared to text-only) caused more negative affect toward smoking, a process that indirectly influenced risk perceptions and quit intentions (e.g., image->negative affect->risk perception->quit intention). Negative affect from graphic images also enhanced warning credibility including through increased scrutiny of the warnings, a process that also indirectly affected risk perceptions and quit intentions (e.g., image->negative affect->risk scrutiny->warning credibility->risk perception->quit intention). Unexpectedly, elaborated text reduced warning credibility. Finally, graphic warnings increased warning-information recall and indirectly increased smoking-risk knowledge at the end of the trial and one month later. In the first naturalistic clinical trial conducted, graphic warning labels are more effective than text-only warnings in encouraging smokers to consider quitting and in educating them about smoking's risks. Negative affective reactions to smoking, thinking about risks, and perceptions of credibility are mediators of their impact. Clinicaltrials.gov NCT01782053.

  8. A GPU-Based Wide-Band Radio Spectrometer

    NASA Astrophysics Data System (ADS)

    Chennamangalam, Jayanth; Scott, Simon; Jones, Glenn; Chen, Hong; Ford, John; Kepley, Amanda; Lorimer, D. R.; Nie, Jun; Prestage, Richard; Roshi, D. Anish; Wagner, Mark; Werthimer, Dan

    2014-12-01

    The graphics processing unit has become an integral part of astronomical instrumentation, enabling high-performance online data reduction and accelerated online signal processing. In this paper, we describe a wide-band reconfigurable spectrometer built using an off-the-shelf graphics processing unit card. This spectrometer, when configured as a polyphase filter bank, supports a dual-polarisation bandwidth of up to 1.1 GHz (or a single-polarisation bandwidth of up to 2.2 GHz) on the latest generation of graphics processing units. On the other hand, when configured as a direct fast Fourier transform, the spectrometer supports a dual-polarisation bandwidth of up to 1.4 GHz (or a single-polarisation bandwidth of up to 2.8 GHz).

  9. Vital Signs for Instructional Design

    ERIC Educational Resources Information Center

    Ley, Kathryn; Gannon-Cook, Ruth

    2014-01-01

    The purpose of this study was to investigate the relationship between a collaborative design process for selecting instructional graphics and online learner perceptions of graphic appropriateness. At the end of their online graduate course, 9 students ranked how appropriately each of 25 graphics represented 1 of 8 human performance technology…

  10. GPU-based relative fuzzy connectedness image segmentation

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Zhuge Ying; Ciesielski, Krzysztof C.; Udupa, Jayaram K.

    2013-01-15

    Purpose:Recently, clinical radiological research and practice are becoming increasingly quantitative. Further, images continue to increase in size and volume. For quantitative radiology to become practical, it is crucial that image segmentation algorithms and their implementations are rapid and yield practical run time on very large data sets. The purpose of this paper is to present a parallel version of an algorithm that belongs to the family of fuzzy connectedness (FC) algorithms, to achieve an interactive speed for segmenting large medical image data sets. Methods: The most common FC segmentations, optimizing an Script-Small-L {sub {infinity}}-based energy, are known as relative fuzzymore » connectedness (RFC) and iterative relative fuzzy connectedness (IRFC). Both RFC and IRFC objects (of which IRFC contains RFC) can be found via linear time algorithms, linear with respect to the image size. The new algorithm, P-ORFC (for parallel optimal RFC), which is implemented by using NVIDIA's Compute Unified Device Architecture (CUDA) platform, considerably improves the computational speed of the above mentioned CPU based IRFC algorithm. Results: Experiments based on four data sets of small, medium, large, and super data size, achieved speedup factors of 32.8 Multiplication-Sign , 22.9 Multiplication-Sign , 20.9 Multiplication-Sign , and 17.5 Multiplication-Sign , correspondingly, on the NVIDIA Tesla C1060 platform. Although the output of P-ORFC need not precisely match that of IRFC output, it is very close to it and, as the authors prove, always lies between the RFC and IRFC objects. Conclusions: A parallel version of a top-of-the-line algorithm in the family of FC has been developed on the NVIDIA GPUs. An interactive speed of segmentation has been achieved, even for the largest medical image data set. Such GPU implementations may play a crucial role in automatic anatomy recognition in clinical radiology.« less

  11. GPU-computing in econophysics and statistical physics

    NASA Astrophysics Data System (ADS)

    Preis, T.

    2011-03-01

    A recent trend in computer science and related fields is general purpose computing on graphics processing units (GPUs), which can yield impressive performance. With multiple cores connected by high memory bandwidth, today's GPUs offer resources for non-graphics parallel processing. This article provides a brief introduction into the field of GPU computing and includes examples. In particular computationally expensive analyses employed in financial market context are coded on a graphics card architecture which leads to a significant reduction of computing time. In order to demonstrate the wide range of possible applications, a standard model in statistical physics - the Ising model - is ported to a graphics card architecture as well, resulting in large speedup values.

  12. Graphical Representation of Parallel Algorithmic Processes

    DTIC Science & Technology

    1990-12-01

    interface with the AAARF main process . The source code for the AAARF class-common library is in the common subdi- rectory and consists of the following files... for public release; distribution unlimited AFIT/GCE/ENG/90D-07 Graphical Representation of Parallel Algorithmic Processes THESIS Presented to the...goal of this study is to develop an algorithm animation facility for parallel processes executing on different architectures, from multiprocessor

  13. On the Road to Graphicacy: The Learning of Graphical Representation Systems

    ERIC Educational Resources Information Center

    Postigo, Yolanda; Pozo, Juan Ignacio

    2004-01-01

    This article examines the learning of different types of graphic information by subjects with different levels of education and knowledge of the content represented. Three levels of graphic information learning were distinguished (explicit, implicit, and conceptual information processing) and two experiments were conducted, looking at graph and…

  14. Conceptual Learning with Multiple Graphical Representations: Intelligent Tutoring Systems Support for Sense-Making and Fluency-Building Processes

    ERIC Educational Resources Information Center

    Rau, Martina A.

    2013-01-01

    Most learning environments in the STEM disciplines use multiple graphical representations along with textual descriptions and symbolic representations. Multiple graphical representations are powerful learning tools because they can emphasize complementary aspects of complex learning contents. However, to benefit from multiple graphical…

  15. GPU-powered Shotgun Stochastic Search for Dirichlet process mixtures of Gaussian Graphical Models

    PubMed Central

    Mukherjee, Chiranjit; Rodriguez, Abel

    2016-01-01

    Gaussian graphical models are popular for modeling high-dimensional multivariate data with sparse conditional dependencies. A mixture of Gaussian graphical models extends this model to the more realistic scenario where observations come from a heterogenous population composed of a small number of homogeneous sub-groups. In this paper we present a novel stochastic search algorithm for finding the posterior mode of high-dimensional Dirichlet process mixtures of decomposable Gaussian graphical models. Further, we investigate how to harness the massive thread-parallelization capabilities of graphical processing units to accelerate computation. The computational advantages of our algorithms are demonstrated with various simulated data examples in which we compare our stochastic search with a Markov chain Monte Carlo algorithm in moderate dimensional data examples. These experiments show that our stochastic search largely outperforms the Markov chain Monte Carlo algorithm in terms of computing-times and in terms of the quality of the posterior mode discovered. Finally, we analyze a gene expression dataset in which Markov chain Monte Carlo algorithms are too slow to be practically useful. PMID:28626348

  16. GPU-powered Shotgun Stochastic Search for Dirichlet process mixtures of Gaussian Graphical Models.

    PubMed

    Mukherjee, Chiranjit; Rodriguez, Abel

    2016-01-01

    Gaussian graphical models are popular for modeling high-dimensional multivariate data with sparse conditional dependencies. A mixture of Gaussian graphical models extends this model to the more realistic scenario where observations come from a heterogenous population composed of a small number of homogeneous sub-groups. In this paper we present a novel stochastic search algorithm for finding the posterior mode of high-dimensional Dirichlet process mixtures of decomposable Gaussian graphical models. Further, we investigate how to harness the massive thread-parallelization capabilities of graphical processing units to accelerate computation. The computational advantages of our algorithms are demonstrated with various simulated data examples in which we compare our stochastic search with a Markov chain Monte Carlo algorithm in moderate dimensional data examples. These experiments show that our stochastic search largely outperforms the Markov chain Monte Carlo algorithm in terms of computing-times and in terms of the quality of the posterior mode discovered. Finally, we analyze a gene expression dataset in which Markov chain Monte Carlo algorithms are too slow to be practically useful.

  17. The value of animations in biology teaching: a study of long-term memory retention.

    PubMed

    O'Day, Danton H

    2007-01-01

    Previous work has established that a narrated animation is more effective at communicating a complex biological process (signal transduction) than the equivalent graphic with figure legend. To my knowledge, no study has been done in any subject area on the effectiveness of animations versus graphics in the long-term retention of information, a primary and critical issue in studies of teaching and learning. In this study, involving 393 student responses, three different animations and two graphics-one with and one lacking a legend-were used to determine the long-term retention of information. The results show that students retain more information 21 d after viewing an animation without narration compared with an equivalent graphic whether or not that graphic had a legend. Students' comments provide additional insight into the value of animations in the pedagogical process, and suggestions for future work are proposed.

  18. Surface and deep structures in graphics comprehension.

    PubMed

    Schnotz, Wolfgang; Baadte, Christiane

    2015-05-01

    Comprehension of graphics can be considered as a process of schema-mediated structure mapping from external graphics on internal mental models. Two experiments were conducted to test the hypothesis that graphics possess a perceptible surface structure as well as a semantic deep structure both of which affect mental model construction. The same content was presented to different groups of learners by graphics from different perspectives with different surface structures but the same deep structure. Deep structures were complementary: major features of the learning content in one experiment became minor features in the other experiment, and vice versa. Text was held constant. Participants were asked to read, understand, and memorize the learning material. Furthermore, they were either instructed to process the material from the perspective supported by the graphic or from an alternative perspective, or they received no further instruction. After learning, they were asked to recall the learning content from different perspectives by completing graphs of different formats as accurately as possible. Learners' recall was more accurate if the format of recall was the same as the learning format which indicates surface structure influences. However, participants also showed more accurate recall when they remembered the content from a perspective emphasizing the deep structure, regardless of the graphics format presented before. This included better recall of what they had not seen than of what they really had seen before. That is, deep structure effects overrode surface effects. Depending on context conditions, stimulation of additional cognitive processing by instruction had partially positive and partially negative effects.

  19. A State Articulated Instructional Objectives Guide for Occupational Education Programs. State Pilot Model for Drafting (Graphic Communications). Part I--Basic. Part II--Specialty Programs. Section A (Mechanical Drafting and Design). Section B (Architectural Drafting and Design).

    ERIC Educational Resources Information Center

    North Carolina State Dept. of Community Colleges, Raleigh.

    A two-part articulation instructional objective guide for drafting (graphic communications) is provided. Part I contains summary information on seven blocks (courses) of instruction. They are as follow: introduction; basic technical drafting; problem solving in graphics; reproduction processes; freehand drawing and sketching; graphics composition;…

  20. Processing Information in Graphical Form.

    ERIC Educational Resources Information Center

    Curcio, Frances R.; Smith-Burke, M. Trika

    The purpose of this exploratory, descriptive study was to examine how children process different tasks of comprehension presented in graphical form. During the Spring 1981, 8 fourth graders and 9 seventh graders were interviewed. The children were presented with graphs accompanied by six questions reflecting three levels of comprehension:…

  1. Concept Learning through Image Processing.

    ERIC Educational Resources Information Center

    Cifuentes, Lauren; Yi-Chuan, Jane Hsieh

    This study explored computer-based image processing as a study strategy for middle school students' science concept learning. Specifically, the research examined the effects of computer graphics generation on science concept learning and the impact of using computer graphics to show interrelationships among concepts during study time. The 87…

  2. Industrial Technology Modernization Program. Project 32. Factory Vision. Phase 2

    DTIC Science & Technology

    1988-04-01

    instructions for the PWA’s, generating the numerical control (NC) program instructions for factory assembly equipment, controlling the process... generating the numerical control (NC) program instructions for factory assembly equipment, controlling the production process instructions and NC... Assembly Operations the "Create Production Process Program" will automatically generate a sequence of graphics pages (in paper mode), or graphics screens

  3. Use of graphics in the design office at the Military Aircraft Division of the British Aircraft Corporation

    NASA Technical Reports Server (NTRS)

    Coles, W. A.

    1975-01-01

    The CAD/CAM interactive computer graphics system was described; uses to which it has been put were shown, and current developments of the system were outlined. The system supports batch, time sharing, and fully interactive graphic processing. Engineers using the system may switch between these methods of data processing and problem solving to make the best use of the available resources. It is concluded that the introduction of on-line computing in the form of teletypes, storage tubes, and fully interactive graphics has resulted in large increases in productivity and reduced timescales in the geometric computing, numerical lofting and part programming areas, together with a greater utilization of the system in the technical departments.

  4. Distributed computation of graphics primitives on a transputer network

    NASA Technical Reports Server (NTRS)

    Ellis, Graham K.

    1988-01-01

    A method is developed for distributing the computation of graphics primitives on a parallel processing network. Off-the-shelf transputer boards are used to perform the graphics transformations and scan-conversion tasks that would normally be assigned to a single transputer based display processor. Each node in the network performs a single graphics primitive computation. Frequently requested tasks can be duplicated on several nodes. The results indicate that the current distribution of commands on the graphics network shows a performance degradation when compared to the graphics display board alone. A change to more computation per node for every communication (perform more complex tasks on each node) may cause the desired increase in throughput.

  5. A Strategy for Making Content Reading Successful: Grades 4-6.

    ERIC Educational Resources Information Center

    Alvermann, Donna E.; Boothby, Paula R.

    A graphic organizer is a tree diagram that consists of vocabulary related to one particular concept. A modified version of a graphic organizer contains empty slots that represent missing information and actively involves students during the reading process as opposed to before or after. This modified graphic organizer can provide both the…

  6. Role of Graphics Tools in the Learning Design Process

    ERIC Educational Resources Information Center

    Laisney, Patrice; Brandt-Pomares, Pascale

    2015-01-01

    This paper discusses the design activities of students in secondary school in France. Graphics tools are now part of the capacity of design professionals. It is therefore apt to reflect on their integration into the technological education. Has the use of intermediate graphical tools changed students' performance, and if so in what direction, in…

  7. Graphic Arts: Process Camera, Stripping, and Platemaking. Teacher Guide.

    ERIC Educational Resources Information Center

    Feasley, Sue C., Ed.

    This curriculum guide is the second in a three-volume series of instructional materials for competency-based graphic arts instruction. Each publication is designed to include the technical content and tasks necessary for a student to be employed in an entry-level graphic arts occupation. Introductory materials include an instructional/task…

  8. Critique and Process: Signature Pedagogies in the Graphic Design Classroom

    ERIC Educational Resources Information Center

    Motley, Phillip

    2017-01-01

    Like many disciplines in design and the visual fine arts, critique is a signature pedagogy in the graphic design classroom. It serves as both a formative and summative assessment while also giving students the opportunity to practice the habits of graphic design. Critiques help students become keen observers of relevant disciplinary criteria;…

  9. A heterogeneous and parallel computing framework for high-resolution hydrodynamic simulations

    NASA Astrophysics Data System (ADS)

    Smith, Luke; Liang, Qiuhua

    2015-04-01

    Shock-capturing hydrodynamic models are now widely applied in the context of flood risk assessment and forecasting, accurately capturing the behaviour of surface water over ground and within rivers. Such models are generally explicit in their numerical basis, and can be computationally expensive; this has prohibited full use of high-resolution topographic data for complex urban environments, now easily obtainable through airborne altimetric surveys (LiDAR). As processor clock speed advances have stagnated in recent years, further computational performance gains are largely dependent on the use of parallel processing. Heterogeneous computing architectures (e.g. graphics processing units or compute accelerator cards) provide a cost-effective means of achieving high throughput in cases where the same calculation is performed with a large input dataset. In recent years this technique has been applied successfully for flood risk mapping, such as within the national surface water flood risk assessment for the United Kingdom. We present a flexible software framework for hydrodynamic simulations across multiple processors of different architectures, within multiple computer systems, enabled using OpenCL and Message Passing Interface (MPI) libraries. A finite-volume Godunov-type scheme is implemented using the HLLC approach to solving the Riemann problem, with optional extension to second-order accuracy in space and time using the MUSCL-Hancock approach. The framework is successfully applied on personal computers and a small cluster to provide considerable improvements in performance. The most significant performance gains were achieved across two servers, each containing four NVIDIA GPUs, with a mix of K20, M2075 and C2050 devices. Advantages are found with respect to decreased parametric sensitivity, and thus in reducing uncertainty, for a major fluvial flood within a large catchment during 2005 in Carlisle, England. Simulations for the three-day event could be performed on a 2m grid within a few hours. In the context of a rapid pluvial flood event in Newcastle upon Tyne during 2012, the technique allows simulation of inundation for a 31km2 of the city centre in less than an hour on a 2m grid; however, further grid refinement is required to fully capture important smaller flow pathways. Good agreement between the model and observed inundation is achieved for a variety of dam failure, slow fluvial inundation, rapid pluvial inundation, and defence breach scenarios in the UK.

  10. Efficient methods for implementation of multi-level nonrigid mass-preserving image registration on GPUs and multi-threaded CPUs.

    PubMed

    Ellingwood, Nathan D; Yin, Youbing; Smith, Matthew; Lin, Ching-Long

    2016-04-01

    Faster and more accurate methods for registration of images are important for research involved in conducting population-based studies that utilize medical imaging, as well as improvements for use in clinical applications. We present a novel computation- and memory-efficient multi-level method on graphics processing units (GPU) for performing registration of two computed tomography (CT) volumetric lung images. We developed a computation- and memory-efficient Diffeomorphic Multi-level B-Spline Transform Composite (DMTC) method to implement nonrigid mass-preserving registration of two CT lung images on GPU. The framework consists of a hierarchy of B-Spline control grids of increasing resolution. A similarity criterion known as the sum of squared tissue volume difference (SSTVD) was adopted to preserve lung tissue mass. The use of SSTVD consists of the calculation of the tissue volume, the Jacobian, and their derivatives, which makes its implementation on GPU challenging due to memory constraints. The use of the DMTC method enabled reduced computation and memory storage of variables with minimal communication between GPU and Central Processing Unit (CPU) due to ability to pre-compute values. The method was assessed on six healthy human subjects. Resultant GPU-generated displacement fields were compared against the previously validated CPU counterpart fields, showing good agreement with an average normalized root mean square error (nRMS) of 0.044±0.015. Runtime and performance speedup are compared between single-threaded CPU, multi-threaded CPU, and GPU algorithms. Best performance speedup occurs at the highest resolution in the GPU implementation for the SSTVD cost and cost gradient computations, with a speedup of 112 times that of the single-threaded CPU version and 11 times over the twelve-threaded version when considering average time per iteration using a Nvidia Tesla K20X GPU. The proposed GPU-based DMTC method outperforms its multi-threaded CPU version in terms of runtime. Total registration time reduced runtime to 2.9min on the GPU version, compared to 12.8min on twelve-threaded CPU version and 112.5min on a single-threaded CPU. Furthermore, the GPU implementation discussed in this work can be adapted for use of other cost functions that require calculation of the first derivatives. Copyright © 2015 Elsevier Ireland Ltd. All rights reserved.

  11. A DDC Bibliography on Optical or Graphic Information Processing (Information Sciences Series). Volume I.

    ERIC Educational Resources Information Center

    Defense Documentation Center, Alexandria, VA.

    This unclassified-unlimited bibliography contains 183 references, with abstracts, dealing specifically with optical or graphic information processing. Citations are grouped under three headings: display devices and theory, character recognition, and pattern recognition. Within each group, they are arranged in accession number (AD-number) sequence.…

  12. An Interactive Graphics Program for Investigating Digital Signal Processing.

    ERIC Educational Resources Information Center

    Miller, Billy K.; And Others

    1983-01-01

    Describes development of an interactive computer graphics program for use in teaching digital signal processing. The program allows students to interactively configure digital systems on a monitor display and observe their system's performance by means of digital plots on the system's outputs. A sample program run is included. (JN)

  13. Performance Testing of GPU-Based Approximate Matching Algorithm on Network Traffic

    DTIC Science & Technology

    2015-03-01

    Defense Department’s use. vi THIS PAGE INTENTIONALLY LEFT BLANK vii TABLE OF CONTENTS I.  INTRODUCTION...22  D.  GENERATING DIGESTS ............................................................................23  1.  Reference...the-shelf GPU Graphical Processing Unit GPGPU General -Purpose Graphic Processing Unit HBSS Host-Based Security System HIPS Host Intrusion

  14. Parallelized CCHE2D flow model with CUDA Fortran on Graphics Process Units

    USDA-ARS?s Scientific Manuscript database

    This paper presents the CCHE2D implicit flow model parallelized using CUDA Fortran programming technique on Graphics Processing Units (GPUs). A parallelized implicit Alternating Direction Implicit (ADI) solver using Parallel Cyclic Reduction (PCR) algorithm on GPU is developed and tested. This solve...

  15. Graphic Arts: Book Three. The Press and Related Processes.

    ERIC Educational Resources Information Center

    Farajollahi, Karim; And Others

    The third of a three-volume set of instructional materials for a graphic arts course, this manual consists of nine instructional units dealing with presses and related processes. Covered in the units are basic press fundamentals, offset press systems, offset press operating procedures, offset inks and dampening chemistry, preventive maintenance…

  16. The Use of Computer Graphics in the Design Process.

    ERIC Educational Resources Information Center

    Palazzi, Maria

    This master's thesis examines applications of computer technology to the field of industrial design and ways in which technology can transform the traditional process. Following a statement of the problem, the history and applications of the fields of computer graphics and industrial design are reviewed. The traditional industrial design process…

  17. A Prototype Lisp-Based Soft Real-Time Object-Oriented Graphical User Interface for Control System Development

    NASA Technical Reports Server (NTRS)

    Litt, Jonathan; Wong, Edmond; Simon, Donald L.

    1994-01-01

    A prototype Lisp-based soft real-time object-oriented Graphical User Interface for control system development is presented. The Graphical User Interface executes alongside a test system in laboratory conditions to permit observation of the closed loop operation through animation, graphics, and text. Since it must perform interactive graphics while updating the screen in real time, techniques are discussed which allow quick, efficient data processing and animation. Examples from an implementation are included to demonstrate some typical functionalities which allow the user to follow the control system's operation.

  18. A graphical language for reliability model generation

    NASA Technical Reports Server (NTRS)

    Howell, Sandra V.; Bavuso, Salvatore J.; Haley, Pamela J.

    1990-01-01

    A graphical interface capability of the hybrid automated reliability predictor (HARP) is described. The graphics-oriented (GO) module provides the user with a graphical language for modeling system failure modes through the selection of various fault tree gates, including sequence dependency gates, or by a Markov chain. With this graphical input language, a fault tree becomes a convenient notation for describing a system. In accounting for any sequence dependencies, HARP converts the fault-tree notation to a complex stochastic process that is reduced to a Markov chain which it can then solve for system reliability. The graphics capability is available for use on an IBM-compatible PC, a Sun, and a VAX workstation. The GO module is written in the C programming language and uses the Graphical Kernel System (GKS) standard for graphics implementation. The PC, VAX, and Sun versions of the HARP GO module are currently in beta-testing.

  19. GRASP/Ada: Graphical Representations of Algorithms, Structures, and Processes for Ada. The development of a program analysis environment for Ada: Reverse engineering tools for Ada, task 2, phase 3

    NASA Technical Reports Server (NTRS)

    Cross, James H., II

    1991-01-01

    The main objective is the investigation, formulation, and generation of graphical representations of algorithms, structures, and processes for Ada (GRASP/Ada). The presented task, in which various graphical representations that can be extracted or generated from source code are described and categorized, is focused on reverse engineering. The following subject areas are covered: the system model; control structure diagram generator; object oriented design diagram generator; user interface; and the GRASP library.

  20. Graphical Language for Data Processing

    NASA Technical Reports Server (NTRS)

    Alphonso, Keith

    2011-01-01

    A graphical language for processing data allows processing elements to be connected with virtual wires that represent data flows between processing modules. The processing of complex data, such as lidar data, requires many different algorithms to be applied. The purpose of this innovation is to automate the processing of complex data, such as LIDAR, without the need for complex scripting and programming languages. The system consists of a set of user-interface components that allow the user to drag and drop various algorithmic and processing components onto a process graph. By working graphically, the user can completely visualize the process flow and create complex diagrams. This innovation supports the nesting of graphs, such that a graph can be included in another graph as a single step for processing. In addition to the user interface components, the system includes a set of .NET classes that represent the graph internally. These classes provide the internal system representation of the graphical user interface. The system includes a graph execution component that reads the internal representation of the graph (as described above) and executes that graph. The execution of the graph follows the interpreted model of execution in that each node is traversed and executed from the original internal representation. In addition, there are components that allow external code elements, such as algorithms, to be easily integrated into the system, thus making the system infinitely expandable.

  1. Testing and Validating Gadget2 for GPUs

    NASA Astrophysics Data System (ADS)

    Wibking, Benjamin; Holley-Bockelmann, K.; Berlind, A. A.

    2013-01-01

    We are currently upgrading a version of Gadget2 (Springel et al., 2005) that is optimized for NVIDIA's CUDA GPU architecture (Frigaard, unpublished) to work with the latest libraries and graphics cards. Preliminary tests of its performance indicate a ~40x speedup in the particle force tree approximation calculation, with overall speedup of 5-10x for cosmological simulations run with GPUs compared to running on the same CPU cores without GPU acceleration. We believe this speedup can be reasonably increased by an additional factor of two with futher optimization, including overlap of computation on CPU and GPU. Tests of single-precision GPU numerical fidelity currently indicate accuracy of the mass function and the spectral power density to within a few percent of extended-precision CPU results with the unmodified form of Gadget. Additionally, we plan to test and optimize the GPU code for Millenium-scale "grand challenge" simulations of >10^9 particles, a scale that has been previously untested with this code, with the aid of the NSF XSEDE flagship GPU-based supercomputing cluster codenamed "Keeneland." Current work involves additional validation of numerical results, extending the numerical precision of the GPU calculations to double precision, and evaluating performance/accuracy tradeoffs. We believe that this project, if successful, will yield substantial computational performance benefits to the N-body research community as the next generation of GPU supercomputing resources becomes available, both increasing the electrical power efficiency of ever-larger computations (making simulations possible a decade from now at scales and resolutions unavailable today) and accelerating the pace of research in the field.

  2. Parallel k-means++

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    A parallelization of the k-means++ seed selection algorithm on three distinct hardware platforms: GPU, multicore CPU, and multithreaded architecture. K-means++ was developed by David Arthur and Sergei Vassilvitskii in 2007 as an extension of the k-means data clustering technique. These algorithms allow people to cluster multidimensional data, by attempting to minimize the mean distance of data points within a cluster. K-means++ improved upon traditional k-means by using a more intelligent approach to selecting the initial seeds for the clustering process. While k-means++ has become a popular alternative to traditional k-means clustering, little work has been done to parallelize this technique.more » We have developed original C++ code for parallelizing the algorithm on three unique hardware architectures: GPU using NVidia's CUDA/Thrust framework, multicore CPU using OpenMP, and the Cray XMT multithreaded architecture. By parallelizing the process for these platforms, we are able to perform k-means++ clustering much more quickly than it could be done before.« less

  3. Evaluating Mobile Graphics Processing Units (GPUs) for Real-Time Resource Constrained Applications

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Meredith, J; Conger, J; Liu, Y

    2005-11-11

    Modern graphics processing units (GPUs) can provide tremendous performance boosts for some applications beyond what a single CPU can accomplish, and their performance is growing at a rate faster than CPUs as well. Mobile GPUs available for laptops have the small form factor and low power requirements suitable for use in embedded processing. We evaluated several desktop and mobile GPUs and CPUs on traditional and non-traditional graphics tasks, as well as on the most time consuming pieces of a full hyperspectral imaging application. Accuracy remained high despite small differences in arithmetic operations like rounding. Performance improvements are summarized here relativemore » to a desktop Pentium 4 CPU.« less

  4. TRIIG - Time-lapse reproduction of images through interactive graphics. [digital processing of quality hard copy

    NASA Technical Reports Server (NTRS)

    Buckner, J. D.; Council, H. W.; Edwards, T. R.

    1974-01-01

    Description of the hardware and software implementing the system of time-lapse reproduction of images through interactive graphics (TRIIG). The system produces a quality hard copy of processed images in a fast and inexpensive manner. This capability allows for optimal development of processing software through the rapid viewing of many image frames in an interactive mode. Three critical optical devices are used to reproduce an image: an Optronics photo reader/writer, the Adage Graphics Terminal, and Polaroid Type 57 high speed film. Typical sources of digitized images are observation satellites, such as ERTS or Mariner, computer coupled electron microscopes for high-magnification studies, or computer coupled X-ray devices for medical research.

  5. Graphic Warning Labels Elicit Affective and Thoughtful Responses from Smokers: Results of a Randomized Clinical Trial

    PubMed Central

    Evans, Abigail T.; Peters, Ellen; Strasser, Andrew A.; Emery, Lydia F.; Sheerin, Kaitlin M.; Romer, Daniel

    2015-01-01

    Objective Observational research suggests that placing graphic images on cigarette warning labels can reduce smoking rates, but field studies lack experimental control. Our primary objective was to determine the psychological processes set in motion by naturalistic exposure to graphic vs. text-only warnings in a randomized clinical trial involving exposure to modified cigarette packs over a 4-week period. Theories of graphic-warning impact were tested by examining affect toward smoking, credibility of warning information, risk perceptions, quit intentions, warning label memory, and smoking risk knowledge. Methods Adults who smoked between 5 and 40 cigarettes daily (N = 293; mean age = 33.7), did not have a contra-indicated medical condition, and did not intend to quit were recruited from Philadelphia, PA and Columbus, OH. Smokers were randomly assigned to receive their own brand of cigarettes for four weeks in one of three warning conditions: text only, graphic images plus text, or graphic images with elaborated text. Results Data from 244 participants who completed the trial were analyzed in structural-equation models. The presence of graphic images (compared to text-only) caused more negative affect toward smoking, a process that indirectly influenced risk perceptions and quit intentions (e.g., image->negative affect->risk perception->quit intention). Negative affect from graphic images also enhanced warning credibility including through increased scrutiny of the warnings, a process that also indirectly affected risk perceptions and quit intentions (e.g., image->negative affect->risk scrutiny->warning credibility->risk perception->quit intention). Unexpectedly, elaborated text reduced warning credibility. Finally, graphic warnings increased warning-information recall and indirectly increased smoking-risk knowledge at the end of the trial and one month later. Conclusions In the first naturalistic clinical trial conducted, graphic warning labels are more effective than text-only warnings in encouraging smokers to consider quitting and in educating them about smoking’s risks. Negative affective reactions to smoking, thinking about risks, and perceptions of credibility are mediators of their impact. Trial Registration Clinicaltrials.gov NCT01782053 PMID:26672982

  6. The Study of Graphic Sense and Its Effects on the Acquisition of Literacy. Final Report.

    ERIC Educational Resources Information Center

    Hernandez-Chavez, Eduardo; Curtis, Jan

    This report describes a study on the development of children's conceptualizations of written language, that is, their graphic sense. The study investigated three issues: (1) whether acquisition of literacy is a developmental process common to all normal children, (2) whether the levels of graphic sense tend to be associated with particular…

  7. Groundwater Resources Assessment under the Pressures of Humanity and Climate Changes

    Treesearch

    Bret Bruce; Diana Allen; Henrique Chaves; Gordon Grant; Gualbert Oude Essink; Henk Kooi; Ian White; Jason Gurdak; Jay Famiglietti; Jose Luis Martin-Bordes; Kevin Hiscock; Matthew Rodell; Neno Kukuric; Peter B. McMahon; Richard Taylor; Timothy Green; Yoseph Yechieli

    2008-01-01

    Given the vision and mission statements for GRAPHIC above, this document provides an updated framework for the GRAPHIC program. The approach to addressing global issues under the GRAPHIC umbrella involves case studies designed to cover a broad range of the identified Subjects, Methods, and Regions. Interdependencies of factors and processes affecting subsurface water...

  8. Interplay of Computer and Paper-Based Sketching in Graphic Design

    ERIC Educational Resources Information Center

    Pan, Rui; Kuo, Shih-Ping; Strobel, Johannes

    2013-01-01

    The purpose of this study is to investigate student designers' attitude and choices towards the use of computers and paper sketches when involved in a graphic design process. 65 computer graphic technology undergraduates participated in this research. A mixed method study with survey and in-depth interviews was applied to answer the research…

  9. Pre-Service Science Teachers' Construction and Interpretation of Graphs

    ERIC Educational Resources Information Center

    Ergül, N. Remziye

    2018-01-01

    Data and graphic analysis and interpretation are important parts of science process skills and science curriculum. So it refers to visual display of data using relevant graphical representations. One of the tools used in science courses is graphics for explain the relationship among each of the concepts and therefore it is important to know data…

  10. Improving aircraft conceptual design - A PHIGS interactive graphics interface for ACSYNT

    NASA Technical Reports Server (NTRS)

    Wampler, S. G.; Myklebust, A.; Jayaram, S.; Gelhausen, P.

    1988-01-01

    A CAD interface has been created for the 'ACSYNT' aircraft conceptual design code that permits the execution and control of the design process via interactive graphics menus. This CAD interface was coded entirely with the new three-dimensional graphics standard, the Programmer's Hierarchical Interactive Graphics System. The CAD/ACSYNT system is designed for use by state-of-the-art high-speed imaging work stations. Attention is given to the approaches employed in modeling, data storage, and rendering.

  11. A graphical weather system design for the NASA transport systems research vehicle B-737

    NASA Technical Reports Server (NTRS)

    Scanlon, Charles H.

    1992-01-01

    A graphical weather system was designed for testing in the NASA Transport Systems Research Vehicle B-737 airplane and simulator. The purpose of these tests was to measure the impact of graphical weather products on aircrew decision processes, weather situation awareness, reroute clearances, workload, and weather monitoring. The flight crew graphical weather interface is described along with integration of the weather system with the flight navigation system, and data link transmission methods for sending weather data to the airplane.

  12. GRASP - A Prototype Interactive Graphic Sawing Program - (Forest Products Journal)

    Treesearch

    Luis G. Occeña; Daniel L. Schmoldt

    1996-01-01

    A versatile microcomputer-based interactive graphics sawing program has been developed as a tool for modeling various hardwood processes, from bucking and topping to log sawing, lumber edging, secondary processing, and even veneering. The microcomputer platform makes the tool affordable and accessible. A solid modeling basis provides the tool with a sound geometrical...

  13. Digital-Computer Processing of Graphical Data. Final Report.

    ERIC Educational Resources Information Center

    Freeman, Herbert

    The final report of a two-year study concerned with the digital-computer processing of graphical data. Five separate investigations carried out under this study are described briefly, and a detailed bibliography, complete with abstracts, is included in which are listed the technical papers and reports published during the period of this program.…

  14. GRASP - A Prototype Interactive Graphic Sawing Program - (MU-IE Technical Report)

    Treesearch

    Luis G. Occeña; Daniel L. Schmoldt

    1995-01-01

    A versatile microcomputer-based interactive graphics program has been developed as a tool for modeling various hardwood processes, from bucking and topping to log sawing, lumber edging, secondary processing, even veneering. The microcomputer platform makes the tool affordable and accessible.A solid modeling basis provides the tool with a sound geometrical and...

  15. How to Apply for Protection Time Graphic

    EPA Pesticide Factsheets

    We will review insect repellent products that voluntarily apply to use the repellency awareness graphic to ensure that their scientific data meet current testing protocols and standard evaluation processes.

  16. Graphical representations of data improve student understanding of measurement and uncertainty: An eye-tracking study

    NASA Astrophysics Data System (ADS)

    Susac, Ana; Bubic, Andreja; Martinjak, Petra; Planinic, Maja; Palmovic, Marijan

    2017-12-01

    Developing a better understanding of the measurement process and measurement uncertainty is one of the main goals of university physics laboratory courses. This study investigated the influence of graphical representation of data on student understanding and interpreting of measurement results. A sample of 101 undergraduate students (48 first year students and 53 third and fifth year students) from the Department of Physics, University of Zagreb were tested with a paper-and-pencil test consisting of eight multiple-choice test items about measurement uncertainties. One version of the test items included graphical representations of the measurement data. About half of the students solved that version of the test while the remaining students solved the same test without graphical representations. The results have shown that the students who had the graphical representation of data scored higher than their colleagues without graphical representation. In the second part of the study, measurements of eye movements were carried out on a sample of thirty undergraduate students from the Department of Physics, University of Zagreb while students were solving the same test on a computer screen. The results revealed that students who had the graphical representation of data spent considerably less time viewing the numerical data than the other group of students. These results indicate that graphical representation may be beneficial for data processing and data comparison. Graphical representation helps with visualization of data and therefore reduces the cognitive load on students while performing measurement data analysis, so students should be encouraged to use it.

  17. Analyzing women's roles through graphic representation of narratives.

    PubMed

    Hall, Joanne M

    2003-08-01

    A 1992 triangulated international nursing study of women's health was reported. The researchers used the perspectives of feminism and symbolic interactionism, specifically role theory. A narrative analysis was done to clarify the concept of role integration. The narrative analysis was reported in 1992, but graphic/visual techniques used in the team dialogue process of narrative analysis were not reported due to space limitations. These techniques have not been reported elsewhere and thus remain innovative. Specific steps in the method are outlined here in detail as an audit trail. The process would be useful to other qualitative researchers as an exemplar of one novel way that verbal data can be abstracted visually/graphically. Suggestions are included for aspects of narrative, in addition to roles, that could be depicted graphically in qualitative research.

  18. Examination of Multi-Core Architectures

    DTIC Science & Technology

    2010-11-01

    NOVEMBER 2010 2. REPORT TYPE Interim Technical Report 3. DATES COVERED (From - To) February 2010 – July 2010 4 . TITLE AND SUBTITLE EXAMINATION OF...STATEMENT 1 2.0 BACKGROUND 1 3.0 ARCHITECTURE CHARACTERISTICS 3 3.1 NVIDIA Tesla 3 3.2 TILE64 4 ...1 Tesla Architecture 3 2 TILE64 Architecture 4 3 Single Tile Architecture 4 4 STI Cell Broadband Engine

  19. Computational Omics Funding Opportunity | Office of Cancer Clinical Proteomics Research

    Cancer.gov

    The National Cancer Institute's Clinical Proteomic Tumor Analysis Consortium (CPTAC) and the NVIDIA Foundation are pleased to announce funding opportunities in the fight against cancer. Each organization has launched a request for proposals (RFP) that will collectively fund up to $2 million to help to develop a new generation of data-intensive scientific tools to find new ways to treat cancer.

  20. Universal Batch Steganalysis

    DTIC Science & Technology

    2014-06-30

    steganalysis) in large-scale datasets such as might be obtained by monitoring a corporate network or social network. Identifying guilty actors...guilty’ user (of steganalysis) in large-scale datasets such as might be obtained by monitoring a corporate network or social network. Identifying guilty...floating point operations (1 TFLOPs) for a 1 megapixel image. We designed a new implementation using Compute Unified Device Architecture (CUDA) on NVIDIA

  1. SU (2) lattice gauge theory simulations on Fermi GPUs

    NASA Astrophysics Data System (ADS)

    Cardoso, Nuno; Bicudo, Pedro

    2011-05-01

    In this work we explore the performance of CUDA in quenched lattice SU (2) simulations. CUDA, NVIDIA Compute Unified Device Architecture, is a hardware and software architecture developed by NVIDIA for computing on the GPU. We present an analysis and performance comparison between the GPU and CPU in single and double precision. Analyses with multiple GPUs and two different architectures (G200 and Fermi architectures) are also presented. In order to obtain a high performance, the code must be optimized for the GPU architecture, i.e., an implementation that exploits the memory hierarchy of the CUDA programming model. We produce codes for the Monte Carlo generation of SU (2) lattice gauge configurations, for the mean plaquette, for the Polyakov Loop at finite T and for the Wilson loop. We also present results for the potential using many configurations (50,000) without smearing and almost 2000 configurations with APE smearing. With two Fermi GPUs we have achieved an excellent performance of 200× the speed over one CPU, in single precision, around 110 Gflops/s. We also find that, using the Fermi architecture, double precision computations for the static quark-antiquark potential are not much slower (less than 2× slower) than single precision computations.

  2. A multi-port 10GbE PCIe NIC featuring UDP offload and GPUDirect capabilities.

    NASA Astrophysics Data System (ADS)

    Ammendola, Roberto; Biagioni, Andrea; Frezza, Ottorino; Lamanna, Gianluca; Lo Cicero, Francesca; Lonardo, Alessandro; Martinelli, Michele; Stanislao Paolucci, Pier; Pastorelli, Elena; Pontisso, Luca; Rossetti, Davide; Simula, Francesco; Sozzi, Marco; Tosoratto, Laura; Vicini, Piero

    2015-12-01

    NaNet-10 is a four-ports 10GbE PCIe Network Interface Card designed for low-latency real-time operations with GPU systems. To this purpose the design includes an UDP offload module, for fast and clock-cycle deterministic handling of the transport layer protocol, plus a GPUDirect P2P/RDMA engine for low-latency communication with NVIDIA Tesla GPU devices. A dedicated module (Multi-Stream) can optionally process input UDP streams before data is delivered through PCIe DMA to their destination devices, re-organizing data from different streams guaranteeing computational optimization. NaNet-10 is going to be integrated in the NA62 CERN experiment in order to assess the suitability of GPGPU systems as real-time triggers; results and lessons learned while performing this activity will be reported herein.

  3. A hybrid reconstruction algorithm for fast and accurate 4D cone-beam CT imaging.

    PubMed

    Yan, Hao; Zhen, Xin; Folkerts, Michael; Li, Yongbao; Pan, Tinsu; Cervino, Laura; Jiang, Steve B; Jia, Xun

    2014-07-01

    4D cone beam CT (4D-CBCT) has been utilized in radiation therapy to provide 4D image guidance in lung and upper abdomen area. However, clinical application of 4D-CBCT is currently limited due to the long scan time and low image quality. The purpose of this paper is to develop a new 4D-CBCT reconstruction method that restores volumetric images based on the 1-min scan data acquired with a standard 3D-CBCT protocol. The model optimizes a deformation vector field that deforms a patient-specific planning CT (p-CT), so that the calculated 4D-CBCT projections match measurements. A forward-backward splitting (FBS) method is invented to solve the optimization problem. It splits the original problem into two well-studied subproblems, i.e., image reconstruction and deformable image registration. By iteratively solving the two subproblems, FBS gradually yields correct deformation information, while maintaining high image quality. The whole workflow is implemented on a graphic-processing-unit to improve efficiency. Comprehensive evaluations have been conducted on a moving phantom and three real patient cases regarding the accuracy and quality of the reconstructed images, as well as the algorithm robustness and efficiency. The proposed algorithm reconstructs 4D-CBCT images from highly under-sampled projection data acquired with 1-min scans. Regarding the anatomical structure location accuracy, 0.204 mm average differences and 0.484 mm maximum difference are found for the phantom case, and the maximum differences of 0.3-0.5 mm for patients 1-3 are observed. As for the image quality, intensity errors below 5 and 20 HU compared to the planning CT are achieved for the phantom and the patient cases, respectively. Signal-noise-ratio values are improved by 12.74 and 5.12 times compared to results from FDK algorithm using the 1-min data and 4-min data, respectively. The computation time of the algorithm on a NVIDIA GTX590 card is 1-1.5 min per phase. High-quality 4D-CBCT imaging based on the clinically standard 1-min 3D CBCT scanning protocol is feasible via the proposed hybrid reconstruction algorithm.

  4. Real-time dose computation: GPU-accelerated source modeling and superposition/convolution

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Jacques, Robert; Wong, John; Taylor, Russell

    Purpose: To accelerate dose calculation to interactive rates using highly parallel graphics processing units (GPUs). Methods: The authors have extended their prior work in GPU-accelerated superposition/convolution with a modern dual-source model and have enhanced performance. The primary source algorithm supports both focused leaf ends and asymmetric rounded leaf ends. The extra-focal algorithm uses a discretized, isotropic area source and models multileaf collimator leaf height effects. The spectral and attenuation effects of static beam modifiers were integrated into each source's spectral function. The authors introduce the concepts of arc superposition and delta superposition. Arc superposition utilizes separate angular sampling for themore » total energy released per unit mass (TERMA) and superposition computations to increase accuracy and performance. Delta superposition allows single beamlet changes to be computed efficiently. The authors extended their concept of multi-resolution superposition to include kernel tilting. Multi-resolution superposition approximates solid angle ray-tracing, improving performance and scalability with a minor loss in accuracy. Superposition/convolution was implemented using the inverse cumulative-cumulative kernel and exact radiological path ray-tracing. The accuracy analyses were performed using multiple kernel ray samplings, both with and without kernel tilting and multi-resolution superposition. Results: Source model performance was <9 ms (data dependent) for a high resolution (400{sup 2}) field using an NVIDIA (Santa Clara, CA) GeForce GTX 280. Computation of the physically correct multispectral TERMA attenuation was improved by a material centric approach, which increased performance by over 80%. Superposition performance was improved by {approx}24% to 0.058 and 0.94 s for 64{sup 3} and 128{sup 3} water phantoms; a speed-up of 101-144x over the highly optimized Pinnacle{sup 3} (Philips, Madison, WI) implementation. Pinnacle{sup 3} times were 8.3 and 94 s, respectively, on an AMD (Sunnyvale, CA) Opteron 254 (two cores, 2.8 GHz). Conclusions: The authors have completed a comprehensive, GPU-accelerated dose engine in order to provide a substantial performance gain over CPU based implementations. Real-time dose computation is feasible with the accuracy levels of the superposition/convolution algorithm.« less

  5. GPU accelerated Monte-Carlo simulation of SEM images for metrology

    NASA Astrophysics Data System (ADS)

    Verduin, T.; Lokhorst, S. R.; Hagen, C. W.

    2016-03-01

    In this work we address the computation times of numerical studies in dimensional metrology. In particular, full Monte-Carlo simulation programs for scanning electron microscopy (SEM) image acquisition are known to be notoriously slow. Our quest in reducing the computation time of SEM image simulation has led us to investigate the use of graphics processing units (GPUs) for metrology. We have succeeded in creating a full Monte-Carlo simulation program for SEM images, which runs entirely on a GPU. The physical scattering models of this GPU simulator are identical to a previous CPU-based simulator, which includes the dielectric function model for inelastic scattering and also refinements for low-voltage SEM applications. As a case study for the performance, we considered the simulated exposure of a complex feature: an isolated silicon line with rough sidewalls located on a at silicon substrate. The surface of the rough feature is decomposed into 408 012 triangles. We have used an exposure dose of 6 mC/cm2, which corresponds to 6 553 600 primary electrons on average (Poisson distributed). We repeat the simulation for various primary electron energies, 300 eV, 500 eV, 800 eV, 1 keV, 3 keV and 5 keV. At first we run the simulation on a GeForce GTX480 from NVIDIA. The very same simulation is duplicated on our CPU-based program, for which we have used an Intel Xeon X5650. Apart from statistics in the simulation, no difference is found between the CPU and GPU simulated results. The GTX480 generates the images (depending on the primary electron energy) 350 to 425 times faster than a single threaded Intel X5650 CPU. Although this is a tremendous speedup, we actually have not reached the maximum throughput because of the limited amount of available memory on the GTX480. Nevertheless, the speedup enables the fast acquisition of simulated SEM images for metrology. We now have the potential to investigate case studies in CD-SEM metrology, which otherwise would take unreasonable amounts of computation time.

  6. The Structure and Properties of Silica Glass Nanostructures using Novel Computational Systems

    NASA Astrophysics Data System (ADS)

    Doblack, Benjamin N.

    The structure and properties of silica glass nanostructures are examined using computational methods in this work. Standard synthesis methods of silica and its associated material properties are first discussed in brief. A review of prior experiments on this amorphous material is also presented. Background and methodology for the simulation of mechanical tests on amorphous bulk silica and nanostructures are later presented. A new computational system for the accurate and fast simulation of silica glass is also presented, using an appropriate interatomic potential for this material within the open-source molecular dynamics computer program LAMMPS. This alternative computational method uses modern graphics processors, Nvidia CUDA technology and specialized scientific codes to overcome processing speed barriers common to traditional computing methods. In conjunction with a virtual reality system used to model select materials, this enhancement allows the addition of accelerated molecular dynamics simulation capability. The motivation is to provide a novel research environment which simultaneously allows visualization, simulation, modeling and analysis. The research goal of this project is to investigate the structure and size dependent mechanical properties of silica glass nanohelical structures under tensile MD conditions using the innovative computational system. Specifically, silica nanoribbons and nanosprings are evaluated which revealed unique size dependent elastic moduli when compared to the bulk material. For the nanoribbons, the tensile behavior differed widely between the models simulated, with distinct characteristic extended elastic regions. In the case of the nanosprings simulated, more clear trends are observed. In particular, larger nanospring wire cross-sectional radii (r) lead to larger Young's moduli, while larger helical diameters (2R) resulted in smaller Young's moduli. Structural transformations and theoretical models are also analyzed to identify possible factors which might affect the mechanical response of silica nanostructures under tension. The work presented outlines an innovative simulation methodology, and discusses how results can be validated against prior experimental and simulation findings. The ultimate goal is to develop new computational methods for the study of nanostructures which will make the field of materials science more accessible, cost effective and efficient.

  7. A fast GPU-based Monte Carlo simulation of proton transport with detailed modeling of nonelastic interactions.

    PubMed

    Wan Chan Tseung, H; Ma, J; Beltran, C

    2015-06-01

    Very fast Monte Carlo (MC) simulations of proton transport have been implemented recently on graphics processing units (GPUs). However, these MCs usually use simplified models for nonelastic proton-nucleus interactions. Our primary goal is to build a GPU-based proton transport MC with detailed modeling of elastic and nonelastic proton-nucleus collisions. Using the cuda framework, the authors implemented GPU kernels for the following tasks: (1) simulation of beam spots from our possible scanning nozzle configurations, (2) proton propagation through CT geometry, taking into account nuclear elastic scattering, multiple scattering, and energy loss straggling, (3) modeling of the intranuclear cascade stage of nonelastic interactions when they occur, (4) simulation of nuclear evaporation, and (5) statistical error estimates on the dose. To validate our MC, the authors performed (1) secondary particle yield calculations in proton collisions with therapeutically relevant nuclei, (2) dose calculations in homogeneous phantoms, (3) recalculations of complex head and neck treatment plans from a commercially available treatment planning system, and compared with (GEANT)4.9.6p2/TOPAS. Yields, energy, and angular distributions of secondaries from nonelastic collisions on various nuclei are in good agreement with the (GEANT)4.9.6p2 Bertini and Binary cascade models. The 3D-gamma pass rate at 2%-2 mm for treatment plan simulations is typically 98%. The net computational time on a NVIDIA GTX680 card, including all CPU-GPU data transfers, is ∼ 20 s for 1 × 10(7) proton histories. Our GPU-based MC is the first of its kind to include a detailed nuclear model to handle nonelastic interactions of protons with any nucleus. Dosimetric calculations are in very good agreement with (GEANT)4.9.6p2/TOPAS. Our MC is being integrated into a framework to perform fast routine clinical QA of pencil-beam based treatment plans, and is being used as the dose calculation engine in a clinically applicable MC-based IMPT treatment planning system. The detailed nuclear modeling will allow us to perform very fast linear energy transfer and neutron dose estimates on the GPU.

  8. DspaceOgre 3D Graphics Visualization Tool

    NASA Technical Reports Server (NTRS)

    Jain, Abhinandan; Myin, Steven; Pomerantz, Marc I.

    2011-01-01

    This general-purpose 3D graphics visualization C++ tool is designed for visualization of simulation and analysis data for articulated mechanisms. Examples of such systems are vehicles, robotic arms, biomechanics models, and biomolecular structures. DspaceOgre builds upon the open-source Ogre3D graphics visualization library. It provides additional classes to support the management of complex scenes involving multiple viewpoints and different scene groups, and can be used as a remote graphics server. This software provides improved support for adding programs at the graphics processing unit (GPU) level for improved performance. It also improves upon the messaging interface it exposes for use as a visualization server.

  9. Design Application Translates 2-D Graphics to 3-D Surfaces

    NASA Technical Reports Server (NTRS)

    2007-01-01

    Fabric Images Inc., specializing in the printing and manufacturing of fabric tension architecture for the retail, museum, and exhibit/tradeshow communities, designed software to translate 2-D graphics for 3-D surfaces prior to print production. Fabric Images' fabric-flattening design process models a 3-D surface based on computer-aided design (CAD) specifications. The surface geometry of the model is used to form a 2-D template, similar to a flattening process developed by NASA's Glenn Research Center. This template or pattern is then applied in the development of a 2-D graphic layout. Benefits of this process include 11.5 percent time savings per project, less material wasted, and the ability to improve upon graphic techniques and offer new design services. Partners include Exhibitgroup/Giltspur (end-user client: TAC Air, a division of Truman Arnold Companies Inc.), Jack Morton Worldwide (end-user client: Nickelodeon), as well as 3D Exhibits Inc., and MG Design Associates Corp.

  10. Curriculum Design of Computer Graphics Programs: A Survey of Art/Design Programs at the University Level.

    ERIC Educational Resources Information Center

    McKee, Richard Lee

    This master's thesis reports the results of a survey submitted to over 30 colleges and universities that currently offer computer graphics courses or are in the planning stage of curriculum design. Intended to provide a profile of the computer graphics programs and insight into the process of curriculum design, the survey gathered data on program…

  11. Graphic Arts Technology: Industrial Arts Curriculum Guide. Grades 9-12. Bulletin No. 1334 (Tentative).

    ERIC Educational Resources Information Center

    Louisiana State Dept. of Education, Baton Rouge.

    The tentative guide in graphic arts technology for senior high schools is part of a series of industrial arts curriculum materials developed by the State of Louisiana. The course is designed to provide "hands-on" experience with tools and materials along with a study of the industrial processes in graphic arts technology. In addition,…

  12. Performance Modeling in CUDA Streams - A Means for High-Throughput Data Processing.

    PubMed

    Li, Hao; Yu, Di; Kumar, Anand; Tu, Yi-Cheng

    2014-10-01

    Push-based database management system (DBMS) is a new type of data processing software that streams large volume of data to concurrent query operators. The high data rate of such systems requires large computing power provided by the query engine. In our previous work, we built a push-based DBMS named G-SDMS to harness the unrivaled computational capabilities of modern GPUs. A major design goal of G-SDMS is to support concurrent processing of heterogenous query processing operations and enable resource allocation among such operations. Understanding the performance of operations as a result of resource consumption is thus a premise in the design of G-SDMS. With NVIDIA's CUDA framework as the system implementation platform, we present our recent work on performance modeling of CUDA kernels running concurrently under a runtime mechanism named CUDA stream . Specifically, we explore the connection between performance and resource occupancy of compute-bound kernels and develop a model that can predict the performance of such kernels. Furthermore, we provide an in-depth anatomy of the CUDA stream mechanism and summarize the main kernel scheduling disciplines in it. Our models and derived scheduling disciplines are verified by extensive experiments using synthetic and real-world CUDA kernels.

  13. Computer graphics application in the engineering design integration system

    NASA Technical Reports Server (NTRS)

    Glatt, C. R.; Abel, R. W.; Hirsch, G. N.; Alford, G. E.; Colquitt, W. N.; Stewart, W. A.

    1975-01-01

    The computer graphics aspect of the Engineering Design Integration (EDIN) system and its application to design problems were discussed. Three basic types of computer graphics may be used with the EDIN system for the evaluation of aerospace vehicles preliminary designs: offline graphics systems using vellum-inking or photographic processes, online graphics systems characterized by direct coupled low cost storage tube terminals with limited interactive capabilities, and a minicomputer based refresh terminal offering highly interactive capabilities. The offline line systems are characterized by high quality (resolution better than 0.254 mm) and slow turnaround (one to four days). The online systems are characterized by low cost, instant visualization of the computer results, slow line speed (300 BAUD), poor hard copy, and the early limitations on vector graphic input capabilities. The recent acquisition of the Adage 330 Graphic Display system has greatly enhanced the potential for interactive computer aided design.

  14. Interactive Computing and Graphics in Undergraduate Digital Signal Processing. Microcomputing Working Paper Series F 84-9.

    ERIC Educational Resources Information Center

    Onaral, Banu; And Others

    This report describes the development of a Drexel University electrical and computer engineering course on digital filter design that used interactive computing and graphics, and was one of three courses in a senior-level sequence on digital signal processing (DSP). Interactive and digital analysis/design routines and the interconnection of these…

  15. Malleable Thought: The Role of Craft Thinking in Practice-Led Graphic Design

    ERIC Educational Resources Information Center

    Ings, Welby

    2015-01-01

    This article considers the potential of craft processes as creative engagements in graphic design research. It initially discusses the uneasy history of craft within the discipline, then draws upon case studies undertaken by three established designers who, in their postgraduate theses, engaged with craft as a process of thinking. In doing so, the…

  16. Graphic model of the processes involved in the production of casegood furniture

    Treesearch

    Kristen G. Hoff; Subhash C. Sarin; R. Bruce Anderson; R. Bruce Anderson

    1992-01-01

    Imports from foreign furniture manufacturers are on ,the rise, and American manufacturers must take advantage of recent technological advances to regain their lost market share. To facilitate the implementation of these technologies for improving productivity and quality, a graphic model of the wood furniture production process is presented using the IDEF modeling...

  17. iMOSFLM: a new graphical interface for diffraction-image processing with MOSFLM

    PubMed Central

    Battye, T. Geoff G.; Kontogiannis, Luke; Johnson, Owen; Powell, Harold R.; Leslie, Andrew G. W.

    2011-01-01

    iMOSFLM is a graphical user interface to the diffraction data-integration program MOSFLM. It is designed to simplify data processing by dividing the process into a series of steps, which are normally carried out sequentially. Each step has its own display pane, allowing control over parameters that influence that step and providing graphical feedback to the user. Suitable values for integration parameters are set automatically, but additional menus provide a detailed level of control for experienced users. The image display and the interfaces to the different tasks (indexing, strategy calculation, cell refinement, integration and history) are described. The most important parameters for each step and the best way of assessing success or failure are discussed. PMID:21460445

  18. Geometric Processing and Its Relational Graphics

    DTIC Science & Technology

    1976-10-01

    20, If different from Report) f3. SUPPLEMENTARY NOTES 9. KEY WORDS (Cbnttnue on reverse aide if neceaaary .mdldentlfy by bfock number) Graphics GIFT ...are typified by defining an object as a series of adjacent triangular or rectangular patches or surfaces (ruled surfaces may also be used). The GIFT ...code embodies the Patch code concept in one of its solids, the ARS; however, processing of a many-faceted GIFT solid takes longer to process than its

  19. Common Graphics Library (CGL). Volume 1: LEZ user's guide

    NASA Technical Reports Server (NTRS)

    Taylor, Nancy L.; Hammond, Dana P.; Hofler, Alicia S.; Miner, David L.

    1988-01-01

    Users are introduced to and instructed in the use of the Langley Easy (LEZ) routines of the Common Graphics Library (CGL). The LEZ routines form an application independent graphics package which enables the user community to view data quickly and easily, while providing a means of generating scientific charts conforming to the publication and/or viewgraph process. A distinct advantage for using the LEZ routines is that the underlying graphics package may be replaced or modified without requiring the users to change their application programs. The library is written in ANSI FORTRAN 77, and currently uses a CORE-based underlying graphics package, and is therefore machine independent, providing support for centralized and/or distributed computer systems.

  20. Network, system, and status software enhancements for the autonomously managed electrical power system breadboard. Volume 4: Graphical status display

    NASA Technical Reports Server (NTRS)

    Mckee, James W.

    1990-01-01

    This volume (4 of 4) contains the description, structured flow charts, prints of the graphical displays, and source code to generate the displays for the AMPS graphical status system. The function of these displays is to present to the manager of the AMPS system a graphical status display with the hot boxes that allow the manager to get more detailed status on selected portions of the AMPS system. The development of the graphical displays is divided into two processes; the creation of the screen images and storage of them in files on the computer, and the running of the status program which uses the screen images.

Top