nvidia geforce gtx: Topics by Science.gov

Sample records for nvidia geforce gtx

Proton Testing of nVidia GTX 1050 GPU

NASA Technical Reports Server (NTRS)

Wyrwas, E. J.

2017-01-01

Single-Event Effects (SEE) testing was conducted on the nVidia GTX 1050 Graphics Processor Unit (GPU); herein referred to as device under test (DUT). Testing was conducted at Massachusetts General Hospitals (MGH) Francis H. Burr Proton Therapy Center on April 9th, 2017 using 200-MeV protons. This testing trip was purposed to provide a baseline assessment of the radiation susceptibility of the DUT as no previous testing had been conducted on this component.
Investigating the Importance of Stereo Displays for Helicopter Landing Simulation

DTIC Science & Technology

2016-08-11

visualization. The two instances of X Plane® were implemented using two separate PCs, each incorporating Intel i7 processors and Nvidia Quadro K4200... Nvidia GeForce GTX 680 graphics card was used to administer the stereo acuity and fusion range tests. The tests were displayed on an Asus VG278HE 3D...monitor with 1920x1080 pixels that was compatible with Nvidia 3D Vision2 and that used active shutter glasses. At a 1-m viewing distance, the
GPU acceleration for digitally reconstructed radiographs using bindless texture objects and CUDA/OpenGL interoperability.

PubMed

Abdellah, Marwan; Eldeib, Ayman; Owis, Mohamed I

2015-01-01

This paper features an advanced implementation of the X-ray rendering algorithm that harnesses the giant computing power of the current commodity graphics processors to accelerate the generation of high resolution digitally reconstructed radiographs (DRRs). The presented pipeline exploits the latest features of NVIDIA Graphics Processing Unit (GPU) architectures, mainly bindless texture objects and dynamic parallelism. The rendering throughput is substantially improved by exploiting the interoperability mechanisms between CUDA and OpenGL. The benchmarks of our optimized rendering pipeline reflect its capability of generating DRRs with resolutions of 2048(2) and 4096(2) at interactive and semi interactive frame-rates using an NVIDIA GeForce 970 GTX device.
Building a Terabyte Memory Bandwidth Compute Node with Four Consumer Electronics GPUs

NASA Astrophysics Data System (ADS)

Omlin, Samuel; Räss, Ludovic; Podladchikov, Yuri

2014-05-01

GPUs released for consumer electronics are generally built with the same chip architectures as the GPUs released for professional usage. With regards to scientific computing, there are no obvious important differences in functionality or performance between the two types of releases, yet the price can differ up to one order of magnitude. For example, the consumer electronics release of the most recent NVIDIA Kepler architecture (GK110), named GeForce GTX TITAN, performed equally well in conducted memory bandwidth tests as the professional release, named Tesla K20; the consumer electronics release costs about one third of the professional release. We explain how to design and assemble a well adjusted computer with four high-end consumer electronics GPUs (GeForce GTX TITAN) combining more than 1 terabyte/s memory bandwidth. We compare the system's performance and precision with the one of hardware released for professional usage. The system can be used as a powerful workstation for scientific computing or as a compute node in a home-built GPU cluster.
Design and Implementation of the PALM-3000 Real-Time Control System

NASA Technical Reports Server (NTRS)

Truong, Tuan N.; Bouchez, Antonin H.; Burruss, Rick S.; Dekany, Richard G.; Guiwits, Stephen R.; Roberts, Jennifer E.; Shelton, Jean C.; Troy, Mitchell

2012-01-01

This paper reflects, from a computational perspective, on the experience gathered in designing and implementing realtime control of the PALM-3000 adaptive optics system currently in operation at the Palomar Observatory. We review the algorithms that serve as functional requirements driving the architecture developed, and describe key design issues and solutions that contributed to the system's low compute-latency. Additionally, we describe an implementation of dense matrix-vector-multiplication for wavefront reconstruction that exceeds 95% of the maximum sustained achievable bandwidth on NVIDIA Geforce 8800GTX GPU.
GPU Acceleration of DSP for Communication Receivers.

PubMed

Gunther, Jake; Gunther, Hyrum; Moon, Todd

2017-09-01

Graphics processing unit (GPU) implementations of signal processing algorithms can outperform CPU-based implementations. This paper describes the GPU implementation of several algorithms encountered in a wide range of high-data rate communication receivers including filters, multirate filters, numerically controlled oscillators, and multi-stage digital down converters. These structures are tested by processing the 20 MHz wide FM radio band (88-108 MHz). Two receiver structures are explored: a single channel receiver and a filter bank channelizer. Both run in real time on NVIDIA GeForce GTX 1080 graphics card.
Computational algorithms for simulations in atmospheric optics.

PubMed

Konyaev, P A; Lukin, V P

2016-04-20

A computer simulation technique for atmospheric and adaptive optics based on parallel programing is discussed. A parallel propagation algorithm is designed and a modified spectral-phase method for computer generation of 2D time-variant random fields is developed. Temporal power spectra of Laguerre-Gaussian beam fluctuations are considered as an example to illustrate the applications discussed. Implementation of the proposed algorithms using Intel MKL and IPP libraries and NVIDIA CUDA technology is shown to be very fast and accurate. The hardware system for the computer simulation is an off-the-shelf desktop with an Intel Core i7-4790K CPU operating at a turbo-speed frequency up to 5 GHz and an NVIDIA GeForce GTX-960 graphics accelerator with 1024 1.5 GHz processors.
Parallel hyperspectral compressive sensing method on GPU

NASA Astrophysics Data System (ADS)

Bernabé, Sergio; Martín, Gabriel; Nascimento, José M. P.

2015-10-01

Remote hyperspectral sensors collect large amounts of data per flight usually with low spatial resolution. It is known that the bandwidth connection between the satellite/airborne platform and the ground station is reduced, thus a compression onboard method is desirable to reduce the amount of data to be transmitted. This paper presents a parallel implementation of an compressive sensing method, called parallel hyperspectral coded aperture (P-HYCA), for graphics processing units (GPU) using the compute unified device architecture (CUDA). This method takes into account two main properties of hyperspectral dataset, namely the high correlation existing among the spectral bands and the generally low number of endmembers needed to explain the data, which largely reduces the number of measurements necessary to correctly reconstruct the original data. Experimental results conducted using synthetic and real hyperspectral datasets on two different GPU architectures by NVIDIA: GeForce GTX 590 and GeForce GTX TITAN, reveal that the use of GPUs can provide real-time compressive sensing performance. The achieved speedup is up to 20 times when compared with the processing time of HYCA running on one core of the Intel i7-2600 CPU (3.4GHz), with 16 Gbyte memory.
Accelerating Smith-Waterman Algorithm for Biological Database Search on CUDA-Compatible GPUs

NASA Astrophysics Data System (ADS)

Munekawa, Yuma; Ino, Fumihiko; Hagihara, Kenichi

This paper presents a fast method capable of accelerating the Smith-Waterman algorithm for biological database search on a cluster of graphics processing units (GPUs). Our method is implemented using compute unified device architecture (CUDA), which is available on the nVIDIA GPU. As compared with previous methods, our method has four major contributions. (1) The method efficiently uses on-chip shared memory to reduce the data amount being transferred between off-chip video memory and processing elements in the GPU. (2) It also reduces the number of data fetches by applying a data reuse technique to query and database sequences. (3) A pipelined method is also implemented to overlap GPU execution with database access. (4) Finally, a master/worker paradigm is employed to accelerate hundreds of database searches on a cluster system. In experiments, the peak performance on a GeForce GTX 280 card reaches 8.32 giga cell updates per second (GCUPS). We also find that our method reduces the amount of data fetches to 1/140, achieving approximately three times higher performance than a previous CUDA-based method. Our 32-node cluster version is approximately 28 times faster than a single GPU version. Furthermore, the effective performance reaches 75.6 giga instructions per second (GIPS) using 32 GeForce 8800 GTX cards.
CUDA-based real time surgery simulation.

PubMed

Liu, Youquan; De, Suvranu

2008-01-01

In this paper we present a general software platform that enables real time surgery simulation on the newly available compute unified device architecture (CUDA)from NVIDIA. CUDA-enabled GPUs harness the power of 128 processors which allow data parallel computations. Compared to the previous GPGPU, it is significantly more flexible with a C language interface. We report implementation of both collision detection and consequent deformation computation algorithms. Our test results indicate that the CUDA enables a twenty times speedup for collision detection and about fifteen times speedup for deformation computation on an Intel Core 2 Quad 2.66 GHz machine with GeForce 8800 GTX.
GPU accelerated generation of digitally reconstructed radiographs for 2-D/3-D image registration.

PubMed

Dorgham, Osama M; Laycock, Stephen D; Fisher, Mark H

2012-09-01

Recent advances in programming languages for graphics processing units (GPUs) provide developers with a convenient way of implementing applications which can be executed on the CPU and GPU interchangeably. GPUs are becoming relatively cheap, powerful, and widely available hardware components, which can be used to perform intensive calculations. The last decade of hardware performance developments shows that GPU-based computation is progressing significantly faster than CPU-based computation, particularly if one considers the execution of highly parallelisable algorithms. Future predictions illustrate that this trend is likely to continue. In this paper, we introduce a way of accelerating 2-D/3-D image registration by developing a hybrid system which executes on the CPU and utilizes the GPU for parallelizing the generation of digitally reconstructed radiographs (DRRs). Based on the advancements of the GPU over the CPU, it is timely to exploit the benefits of many-core GPU technology by developing algorithms for DRR generation. Although some previous work has investigated the rendering of DRRs using the GPU, this paper investigates approximations which reduce the computational overhead while still maintaining a quality consistent with that needed for 2-D/3-D registration with sufficient accuracy to be clinically acceptable in certain applications of radiation oncology. Furthermore, by comparing implementations of 2-D/3-D registration on the CPU and GPU, we investigate current performance and propose an optimal framework for PC implementations addressing the rigid registration problem. Using this framework, we are able to render DRR images from a 256×256×133 CT volume in ~24 ms using an NVidia GeForce 8800 GTX and in ~2 ms using NVidia GeForce GTX 580. In addition to applications requiring fast automatic patient setup, these levels of performance suggest image-guided radiation therapy at video frame rates is technically feasible using relatively low cost PC
MATCHED FILTER COMPUTATION ON FPGA, CELL, AND GPU

DOE Office of Scientific and Technical Information (OSTI.GOV)

BAKER, ZACHARY K.; GOKHALE, MAYA B.; TRIPP, JUSTIN L.

2007-01-08

The matched filter is an important kernel in the processing of hyperspectral data. The filter enables researchers to sift useful data from instruments that span large frequency bands. In this work, they evaluate the performance of a matched filter algorithm implementation on accelerated co-processor (XD1000), the IBM Cell microprocessor, and the NVIDIA GeForce 6900 GTX GPU graphics card. They provide extensive discussion of the challenges and opportunities afforded by each platform. In particular, they explore the problems of partitioning the filter most efficiently between the host CPU and the co-processor. Using their results, they derive several performance metrics that providemore » the optimal solution for a variety of application situations.« less
Solving lattice QCD systems of equations using mixed precision solvers on GPUs

NASA Astrophysics Data System (ADS)

Clark, M. A.; Babich, R.; Barros, K.; Brower, R. C.; Rebbi, C.

2010-09-01

Modern graphics hardware is designed for highly parallel numerical tasks and promises significant cost and performance benefits for many scientific applications. One such application is lattice quantum chromodynamics (lattice QCD), where the main computational challenge is to efficiently solve the discretized Dirac equation in the presence of an SU(3) gauge field. Using NVIDIA's CUDA platform we have implemented a Wilson-Dirac sparse matrix-vector product that performs at up to 40, 135 and 212 Gflops for double, single and half precision respectively on NVIDIA's GeForce GTX 280 GPU. We have developed a new mixed precision approach for Krylov solvers using reliable updates which allows for full double precision accuracy while using only single or half precision arithmetic for the bulk of the computation. The resulting BiCGstab and CG solvers run in excess of 100 Gflops and, in terms of iterations until convergence, perform better than the usual defect-correction approach for mixed precision.
CUDASW++: optimizing Smith-Waterman sequence database searches for CUDA-enabled graphics processing units

PubMed Central

Liu, Yongchao; Maskell, Douglas L; Schmidt, Bertil

2009-01-01

Background The Smith-Waterman algorithm is one of the most widely used tools for searching biological sequence databases due to its high sensitivity. Unfortunately, the Smith-Waterman algorithm is computationally demanding, which is further compounded by the exponential growth of sequence databases. The recent emergence of many-core architectures, and their associated programming interfaces, provides an opportunity to accelerate sequence database searches using commonly available and inexpensive hardware. Findings Our CUDASW++ implementation (benchmarked on a single-GPU NVIDIA GeForce GTX 280 graphics card and a dual-GPU GeForce GTX 295 graphics card) provides a significant performance improvement compared to other publicly available implementations, such as SWPS3, CBESW, SW-CUDA, and NCBI-BLAST. CUDASW++ supports query sequences of length up to 59K and for query sequences ranging in length from 144 to 5,478 in Swiss-Prot release 56.6, the single-GPU version achieves an average performance of 9.509 GCUPS with a lowest performance of 9.039 GCUPS and a highest performance of 9.660 GCUPS, and the dual-GPU version achieves an average performance of 14.484 GCUPS with a lowest performance of 10.660 GCUPS and a highest performance of 16.087 GCUPS. Conclusion CUDASW++ is publicly available open-source software. It provides a significant performance improvement for Smith-Waterman-based protein sequence database searches by fully exploiting the compute capability of commonly used CUDA-enabled low-cost GPUs. PMID:19416548
Graphics processing unit (GPU)-based computation of heat conduction in thermally anisotropic solids

NASA Astrophysics Data System (ADS)

Nahas, C. A.; Balasubramaniam, Krishnan; Rajagopal, Prabhu

2013-01-01

Numerical modeling of anisotropic media is a computationally intensive task since it brings additional complexity to the field problem in such a way that the physical properties are different in different directions. Largely used in the aerospace industry because of their lightweight nature, composite materials are a very good example of thermally anisotropic media. With advancements in video gaming technology, parallel processors are much cheaper today and accessibility to higher-end graphical processing devices has increased dramatically over the past couple of years. Since these massively parallel GPUs are very good in handling floating point arithmetic, they provide a new platform for engineers and scientists to accelerate their numerical models using commodity hardware. In this paper we implement a parallel finite difference model of thermal diffusion through anisotropic media using the NVIDIA CUDA (Compute Unified device Architecture). We use the NVIDIA GeForce GTX 560 Ti as our primary computing device which consists of 384 CUDA cores clocked at 1645 MHz with a standard desktop pc as the host platform. We compare the results from standard CPU implementation for its accuracy and speed and draw implications for simulation using the GPU paradigm.
Test Report for NG Sensors GTX-1000

DOE Office of Scientific and Technical Information (OSTI.GOV)

Manginell, Ronald P.

2014-12-01

This report describes initial testing of the NG Sensor GTX-1000 natural gas monitoring system. This testing showed that the retention time, peak area stability and heating value repeatability of the GTX-1000 were promising for natural gas measurements in the field or at the well head. The repeatability can be less than 0.25% for LHV and HHV for the Airgas standard tested in this report, which is very promising for a first generation prototype. Ultimately this system should be capable of 0.1% repeatability in heating value at significant size and power reductions compared with competing systems.
SRM-Assisted Trajectory for the GTX Reference Vehicle

NASA Technical Reports Server (NTRS)

Riehl, John; Trefny, Charles; Kosareo, Daniel

2002-01-01

A goal of the GTX effort has been to demonstrate the feasibility of a single stage- to- orbit (SSTO) vehicle that delivers a small payload to low earth orbit. The small payload class was chosen in order to minimize the risk and cost of development of this revolutionary system. A preliminary design study by the GTX team has resulted in the current configuration that offers considerable promise for meeting the stated goal. The size and gross lift-off weight resulting from scaling the current design to closure however may be considered impractical for the small payload. In lieu of evolving the project's reference vehicle to a large-payload class, this paper offers the alternative of using solid-rocket motors in order to close the vehicle at a practical scale. This approach offers a near-term, quasi-reusable system that easily evolves to reusable SSTO following subsequent development and optimization. This paper presents an overview of the impact of the addition of SRM's to the GTX reference vehicle's performance and trajectory. The overall methods of vehicle modeling and trajectory optimization will also be presented. A key element in the trajectory optimization is the use of the program OTIS 3.10 that provides rapid convergence and a great deal of flexibility to the user. This paper will also present the methods used to implement GTX requirements into OTIS modeling.
SRM-Assisted Trajectory for the GTX Reference Vehicle

NASA Technical Reports Server (NTRS)

Riehl, John; Trefny, Charles; Kosareo, Daniel (Technical Monitor)

2002-01-01

A goal of the GTX effort has been to demonstrate the feasibility of a single stage-to-orbit (SSTO) vehicle that delivers a small payload to low earth orbit. The small payload class was chosen in order to minimize the risk and cost of development of this revolutionary system. A preliminary design study by the GTX team has resulted in the current configuration that offers considerable promise for meeting the stated goal. The size and gross lift-off weight resulting from scaling the current design to closure however may be considered impractical for the small payload. In lieu of evolving the project' reference vehicle to a large-payload class, this paper offers the alternative of using solid-rocket motors in order to close the vehicle at a practical scale. This approach offers a near-term, quasi-reusable system that easily evolves to reusable SSTO following subsequent development and optimization. This paper presents an overview of the impact of the addition of SRM's to the GTX reference vehicle#s performance and trajectory. The overall methods of vehicle modeling and trajectory optimization will also be presented. A key element in the trajectory optimization is the use of the program OTIS 3.10 that provides rapid convergence and a great deal of flexibility to the user. This paper will also present the methods used to implement GTX requirements into OTIS modeling.
Acceleration of spiking neural network based pattern recognition on NVIDIA graphics processors.

PubMed

Han, Bing; Taha, Tarek M

2010-04-01

There is currently a strong push in the research community to develop biological scale implementations of neuron based vision models. Systems at this scale are computationally demanding and generally utilize more accurate neuron models, such as the Izhikevich and the Hodgkin-Huxley models, in favor of the more popular integrate and fire model. We examine the feasibility of using graphics processing units (GPUs) to accelerate a spiking neural network based character recognition network to enable such large scale systems. Two versions of the network utilizing the Izhikevich and Hodgkin-Huxley models are implemented. Three NVIDIA general-purpose (GP) GPU platforms are examined, including the GeForce 9800 GX2, the Tesla C1060, and the Tesla S1070. Our results show that the GPGPUs can provide significant speedup over conventional processors. In particular, the fastest GPGPU utilized, the Tesla S1070, provided a speedup of 5.6 and 84.4 over highly optimized implementations on the fastest central processing unit (CPU) tested, a quadcore 2.67 GHz Xeon processor, for the Izhikevich and the Hodgkin-Huxley models, respectively. The CPU implementation utilized all four cores and the vector data parallelism offered by the processor. The results indicate that GPUs are well suited for this application domain.
Decryption-decompression of AES protected ZIP files on GPUs

NASA Astrophysics Data System (ADS)

Duong, Tan Nhat; Pham, Phong Hong; Nguyen, Duc Huu; Nguyen, Thuy Thanh; Le, Hung Duc

2011-10-01

AES is a strong encryption system, so decryption-decompression of AES encrypted ZIP files requires very large computing power and techniques of reducing the password space. This makes implementations of techniques on common computing system not practical. In [1], we reduced the original very large password search space to a much smaller one which surely containing the correct password. Based on reduced set of passwords, in this paper, we parallel decryption, decompression and plain text recognition for encrypted ZIP files by using CUDA computing technology on graphics cards GeForce GTX295 of NVIDIA, to find out the correct password. The experimental results have shown that the speed of decrypting, decompressing, recognizing plain text and finding out the original password increases about from 45 to 180 times (depends on the number of GPUs) compared to sequential execution on the Intel Core 2 Quad Q8400 2.66 GHz. These results have demonstrated the potential applicability of GPUs in this cryptanalysis field.

FPGA Implementation of the Coupled Filtering Method and the Affine Warping Method.

PubMed

Zhang, Chen; Liang, Tianzhu; Mok, Philip K T; Yu, Weichuan

2017-07-01

In ultrasound image analysis, the speckle tracking methods are widely applied to study the elasticity of body tissue. However, "feature-motion decorrelation" still remains as a challenge for the speckle tracking methods. Recently, a coupled filtering method and an affine warping method were proposed to accurately estimate strain values, when the tissue deformation is large. The major drawback of these methods is the high computational complexity. Even the graphics processing unit (GPU)-based program requires a long time to finish the analysis. In this paper, we propose field-programmable gate array (FPGA)-based implementations of both methods for further acceleration. The capability of FPGAs on handling different image processing components in these methods is discussed. A fast and memory-saving image warping approach is proposed. The algorithms are reformulated to build a highly efficient pipeline on FPGA. The final implementations on a Xilinx Virtex-7 FPGA are at least 13 times faster than the GPU implementation on the NVIDIA graphic card (GeForce GTX 580).
GOTHIC: Gravitational oct-tree code accelerated by hierarchical time step controlling

NASA Astrophysics Data System (ADS)

Miki, Yohei; Umemura, Masayuki

2017-04-01

The tree method is a widely implemented algorithm for collisionless N-body simulations in astrophysics well suited for GPU(s). Adopting hierarchical time stepping can accelerate N-body simulations; however, it is infrequently implemented and its potential remains untested in GPU implementations. We have developed a Gravitational Oct-Tree code accelerated by HIerarchical time step Controlling named GOTHIC, which adopts both the tree method and the hierarchical time step. The code adopts some adaptive optimizations by monitoring the execution time of each function on-the-fly and minimizes the time-to-solution by balancing the measured time of multiple functions. Results of performance measurements with realistic particle distribution performed on NVIDIA Tesla M2090, K20X, and GeForce GTX TITAN X, which are representative GPUs of the Fermi, Kepler, and Maxwell generation of GPUs, show that the hierarchical time step achieves a speedup by a factor of around 3-5 times compared to the shared time step. The measured elapsed time per step of GOTHIC is 0.30 s or 0.44 s on GTX TITAN X when the particle distribution represents the Andromeda galaxy or the NFW sphere, respectively, with 224 = 16,777,216 particles. The averaged performance of the code corresponds to 10-30% of the theoretical single precision peak performance of the GPU.
GPU-based cone beam computed tomography.

PubMed

Noël, Peter B; Walczak, Alan M; Xu, Jinhui; Corso, Jason J; Hoffmann, Kenneth R; Schafer, Sebastian

2010-06-01

The use of cone beam computed tomography (CBCT) is growing in the clinical arena due to its ability to provide 3D information during interventions, its high diagnostic quality (sub-millimeter resolution), and its short scanning times (60 s). In many situations, the short scanning time of CBCT is followed by a time-consuming 3D reconstruction. The standard reconstruction algorithm for CBCT data is the filtered backprojection, which for a volume of size 256(3) takes up to 25 min on a standard system. Recent developments in the area of Graphic Processing Units (GPUs) make it possible to have access to high-performance computing solutions at a low cost, allowing their use in many scientific problems. We have implemented an algorithm for 3D reconstruction of CBCT data using the Compute Unified Device Architecture (CUDA) provided by NVIDIA (NVIDIA Corporation, Santa Clara, California), which was executed on a NVIDIA GeForce GTX 280. Our implementation results in improved reconstruction times from minutes, and perhaps hours, to a matter of seconds, while also giving the clinician the ability to view 3D volumetric data at higher resolutions. We evaluated our implementation on ten clinical data sets and one phantom data set to observe if differences occur between CPU and GPU-based reconstructions. By using our approach, the computation time for 256(3) is reduced from 25 min on the CPU to 3.2 s on the GPU. The GPU reconstruction time for 512(3) volumes is 8.5 s. Copyright 2009 Elsevier Ireland Ltd. All rights reserved.
GBOOST: a GPU-based tool for detecting gene-gene interactions in genome-wide case control studies.

PubMed

Yung, Ling Sing; Yang, Can; Wan, Xiang; Yu, Weichuan

2011-05-01

Collecting millions of genetic variations is feasible with the advanced genotyping technology. With a huge amount of genetic variations data in hand, developing efficient algorithms to carry out the gene-gene interaction analysis in a timely manner has become one of the key problems in genome-wide association studies (GWAS). Boolean operation-based screening and testing (BOOST), a recent work in GWAS, completes gene-gene interaction analysis in 2.5 days on a desktop computer. Compared with central processing units (CPUs), graphic processing units (GPUs) are highly parallel hardware and provide massive computing resources. We are, therefore, motivated to use GPUs to further speed up the analysis of gene-gene interactions. We implement the BOOST method based on a GPU framework and name it GBOOST. GBOOST achieves a 40-fold speedup compared with BOOST. It completes the analysis of Wellcome Trust Case Control Consortium Type 2 Diabetes (WTCCC T2D) genome data within 1.34 h on a desktop computer equipped with Nvidia GeForce GTX 285 display card. GBOOST code is available at http://bioinformatics.ust.hk/BOOST.html#GBOOST.
gCUP: rapid GPU-based HIV-1 co-receptor usage prediction for next-generation sequencing.

PubMed

Olejnik, Michael; Steuwer, Michel; Gorlatch, Sergei; Heider, Dominik

2014-11-15

Next-generation sequencing (NGS) has a large potential in HIV diagnostics, and genotypic prediction models have been developed and successfully tested in the recent years. However, albeit being highly accurate, these computational models lack computational efficiency to reach their full potential. In this study, we demonstrate the use of graphics processing units (GPUs) in combination with a computational prediction model for HIV tropism. Our new model named gCUP, parallelized and optimized for GPU, is highly accurate and can classify >175 000 sequences per second on an NVIDIA GeForce GTX 460. The computational efficiency of our new model is the next step to enable NGS technologies to reach clinical significance in HIV diagnostics. Moreover, our approach is not limited to HIV tropism prediction, but can also be easily adapted to other settings, e.g. drug resistance prediction. The source code can be downloaded at http://www.heiderlab.de d.heider@wz-straubing.de. © The Author 2014. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.
Real-time electroholography using a multiple-graphics processing unit cluster system with a single spatial light modulator and the InfiniBand network

NASA Astrophysics Data System (ADS)

Niwase, Hiroaki; Takada, Naoki; Araki, Hiromitsu; Maeda, Yuki; Fujiwara, Masato; Nakayama, Hirotaka; Kakue, Takashi; Shimobaba, Tomoyoshi; Ito, Tomoyoshi

2016-09-01

Parallel calculations of large-pixel-count computer-generated holograms (CGHs) are suitable for multiple-graphics processing unit (multi-GPU) cluster systems. However, it is not easy for a multi-GPU cluster system to accomplish fast CGH calculations when CGH transfers between PCs are required. In these cases, the CGH transfer between the PCs becomes a bottleneck. Usually, this problem occurs only in multi-GPU cluster systems with a single spatial light modulator. To overcome this problem, we propose a simple method using the InfiniBand network. The computational speed of the proposed method using 13 GPUs (NVIDIA GeForce GTX TITAN X) was more than 3000 times faster than that of a CPU (Intel Core i7 4770) when the number of three-dimensional (3-D) object points exceeded 20,480. In practice, we achieved ˜40 tera floating point operations per second (TFLOPS) when the number of 3-D object points exceeded 40,960. Our proposed method was able to reconstruct a real-time movie of a 3-D object comprising 95,949 points.
Parallel halftoning technique using dot diffusion optimization

NASA Astrophysics Data System (ADS)

Molina-Garcia, Javier; Ponomaryov, Volodymyr I.; Reyes-Reyes, Rogelio; Cruz-Ramos, Clara

2017-05-01

In this paper, a novel approach for halftone images is proposed and implemented for images that are obtained by the Dot Diffusion (DD) method. Designed technique is based on an optimization of the so-called class matrix used in DD algorithm and it consists of generation new versions of class matrix, which has no baron and near-baron in order to minimize inconsistencies during the distribution of the error. Proposed class matrix has different properties and each is designed for two different applications: applications where the inverse-halftoning is necessary, and applications where this method is not required. The proposed method has been implemented in GPU (NVIDIA GeForce GTX 750 Ti), multicore processors (AMD FX(tm)-6300 Six-Core Processor and in Intel core i5-4200U), using CUDA and OpenCV over a PC with linux. Experimental results have shown that novel framework generates a good quality of the halftone images and the inverse halftone images obtained. The simulation results using parallel architectures have demonstrated the efficiency of the novel technique when it is implemented in real-time processing.
Real-time time-division color electroholography using a single GPU and a USB module for synchronizing reference light.

PubMed

Araki, Hiromitsu; Takada, Naoki; Niwase, Hiroaki; Ikawa, Shohei; Fujiwara, Masato; Nakayama, Hirotaka; Kakue, Takashi; Shimobaba, Tomoyoshi; Ito, Tomoyoshi

2015-12-01

We propose real-time time-division color electroholography using a single graphics processing unit (GPU) and a simple synchronization system of reference light. To facilitate real-time time-division color electroholography, we developed a light emitting diode (LED) controller with a universal serial bus (USB) module and the drive circuit for reference light. A one-chip RGB LED connected to a personal computer via an LED controller was used as the reference light. A single GPU calculates three computer-generated holograms (CGHs) suitable for red, green, and blue colors in each frame of a three-dimensional (3D) movie. After CGH calculation using a single GPU, the CPU can synchronize the CGH display with the color switching of the one-chip RGB LED via the LED controller. Consequently, we succeeded in real-time time-division color electroholography for a 3D object consisting of around 1000 points per color when an NVIDIA GeForce GTX TITAN was used as the GPU. Furthermore, we implemented the proposed method in various GPUs. The experimental results showed that the proposed method was effective for various GPUs.
Speeding-up Bioinformatics Algorithms with Heterogeneous Architectures: Highly Heterogeneous Smith-Waterman (HHeterSW).

PubMed

Gálvez, Sergio; Ferusic, Adis; Esteban, Francisco J; Hernández, Pilar; Caballero, Juan A; Dorado, Gabriel

2016-10-01

The Smith-Waterman algorithm has a great sensitivity when used for biological sequence-database searches, but at the expense of high computing-power requirements. To overcome this problem, there are implementations in literature that exploit the different hardware-architectures available in a standard PC, such as GPU, CPU, and coprocessors. We introduce an application that splits the original database-search problem into smaller parts, resolves each of them by executing the most efficient implementations of the Smith-Waterman algorithms in different hardware architectures, and finally unifies the generated results. Using non-overlapping hardware allows simultaneous execution, and up to 2.58-fold performance gain, when compared with any other algorithm to search sequence databases. Even the performance of the popular BLAST heuristic is exceeded in 78% of the tests. The application has been tested with standard hardware: Intel i7-4820K CPU, Intel Xeon Phi 31S1P coprocessors, and nVidia GeForce GTX 960 graphics cards. An important increase in performance has been obtained in a wide range of situations, effectively exploiting the available hardware.
Extension of the validation of AOAC Official Method 2005.06 for dc-GTX2,3: interlaboratory study.

PubMed

Ben-Gigirey, Begoña; Rodríguez-Velasco, María L; Gago-Martínez, Ana

2012-01-01

AOAC Official Method(SM) 2005.06 for the determination of saxitoxin (STX)-group toxins in shellfish by LC with fluorescence detection with precolumn oxidation was previously validated and adopted First Action following a collaborative study. However, the method was not validated for all key STX-group toxins, and procedures to quantify some of them were not provided. With more STX-group toxin standards commercially available and modifications to procedures, it was possible to overcome some of these difficulties. The European Union Reference Laboratory for Marine Biotoxins conducted an interlaboratory exercise to extend AOAC Official Method 2005.06 validation for dc-GTX2,3 and to compile precision data for several STX-group toxins. This paper reports the study design and the results obtained. The performance characteristics for dc-GTX2,3 (intralaboratory and interlaboratory precision, recovery, and theoretical quantification limit) were evaluated. The mean recoveries obtained for dc-GTX2,3 were, in general, low (53.1-58.6%). The RSD for reproducibility (RSD(r)%) for dc-GTX2,3 in all samples ranged from 28.2 to 45.7%, and HorRat values ranged from 1.5 to 2.8. The article also describes a hydrolysis protocol to convert GTX6 to NEO, which has been proven to be useful for the quantification of GTX6 while the GTX6 standard is not available. The performance of the participant laboratories in the application of this method was compared with that obtained from the original collaborative study of the method. Intralaboratory and interlaboratory precision data for several STX-group toxins, including dc-NEO and GTX6, are reported here. This study can be useful for those laboratories determining STX-group toxins to fully implement AOAC Official Method 2005.06 for official paralytic shellfish poisoning control. However the overall quantitative performance obtained with the method was poor for certain toxins.
Performance Validation Approach for the GTX Air-Breathing Launch Vehicle

NASA Technical Reports Server (NTRS)

Trefny, Charles J.; Roche, Joseph M.

2002-01-01

The primary objective of the GTX effort is to determine whether or not air-breathing propulsion can enable a launch vehicle to achieve orbit in a single stage. Structural weight, vehicle aerodynamics, and propulsion performance must be accurately known over the entire flight trajectory in order to make a credible assessment. Structural, aerodynamic, and propulsion parameters are strongly interdependent, which necessitates a system approach to design, evaluation, and optimization of a single-stage-to-orbit concept. The GTX reference vehicle serves this purpose, by allowing design, development, and validation of components and subsystems in a system context. The reference vehicle configuration (including propulsion) was carefully chosen so as to provide high potential for structural and volumetric efficiency, and to allow the high specific impulse of air-breathing propulsion cycles to be exploited. Minor evolution of the configuration has occurred as analytical and experimental results have become available. With this development process comes increasing validation of the weight and performance levels used in system performance determination. This paper presents an overview of the GTX reference vehicle and the approach to its performance validation. Subscale test rigs and numerical studies used to develop and validate component performance levels and unit structural weights are outlined. The sensitivity of the equivalent, effective specific impulse to key propulsion component efficiencies is presented. The role of flight demonstration in development and validation is discussed.
Tensor Algebra Library for NVidia Graphics Processing Units

DOE Office of Scientific and Technical Information (OSTI.GOV)

Liakh, Dmitry

This is a general purpose math library implementing basic tensor algebra operations on NVidia GPU accelerators. This software is a tensor algebra library that can perform basic tensor algebra operations, including tensor contractions, tensor products, tensor additions, etc., on NVidia GPU accelerators, asynchronously with respect to the CPU host. It supports a simultaneous use of multiple NVidia GPUs. Each asynchronous API function returns a handle which can later be used for querying the completion of the corresponding tensor algebra operation on a specific GPU. The tensors participating in a particular tensor operation are assumed to be stored in local RAMmore » of a node or GPU RAM. The main research area where this library can be utilized is the quantum many-body theory (e.g., in electronic structure theory).« less
Immunogenic and protective efficacy of recombinant protein GtxA-N against Gallibacterium anatis challenge in chickens.

PubMed

Pedersen, Ida J; Pors, Susanne E; Bager Skjerning, Ragnhild J; Nielsen, Søren S; Bojesen, Anders M

2015-10-01

Gallibacterium anatis is a major cause of reproductive tract infections in chickens. Here, we aimed to evaluate the efficacy of the recombinant protein GtxA-N at protecting hens, by addressing three objectives; (i) evaluating the antibody response following immunization (ii) scoring and comparing lesions, following challenge with G. anatis, in immunized and non-immunized hens and (iii) investigating if the anti-GtxA-N antibody titre in individual hens correlated with the observed lesions. Two consecutive experiments were performed in hens. In the first experiment hens were immunized with GtxA-N on day 0 and day 14, infected with G. anatis on day 28 and euthanized on day 56. The GtxA-N antibody response was assessed in pooled serum samples throughout the experiment, using an indirect enzyme-linked immunosorbent assay (ELISA). In the second experiment the GtxA-N antibody titres were assessed in individual hens before and after immunization. Subsequently, the hens were inoculated with G. anatis and finally all hens where euthanized and submitted for post mortem examination 48 h after inoculation. Immunization elicited strong antibody responses that lasted at least 8 weeks (P < .0001). The individual antibody titres observed in response to immunization varied considerably among hens (range: 174,100-281,500). Lesion scores following G. anatis infection were significantly lower in immunized hens compared to non-immunized hens (P = .004). Within the immunized group, no correlation was found between the individual antibody titres and the lesion scores. This study clearly demonstrated GtxA-N as a vaccine antigen able of inducing protective immunity against G. anatis.
Real-time colour hologram generation based on ray-sampling plane with multi-GPU acceleration.

PubMed

Sato, Hirochika; Kakue, Takashi; Ichihashi, Yasuyuki; Endo, Yutaka; Wakunami, Koki; Oi, Ryutaro; Yamamoto, Kenji; Nakayama, Hirotaka; Shimobaba, Tomoyoshi; Ito, Tomoyoshi

2018-01-24

Although electro-holography can reconstruct three-dimensional (3D) motion pictures, its computational cost is too heavy to allow for real-time reconstruction of 3D motion pictures. This study explores accelerating colour hologram generation using light-ray information on a ray-sampling (RS) plane with a graphics processing unit (GPU) to realise a real-time holographic display system. We refer to an image corresponding to light-ray information as an RS image. Colour holograms were generated from three RS images with resolutions of 2,048 × 2,048; 3,072 × 3,072 and 4,096 × 4,096 pixels. The computational results indicate that the generation of the colour holograms using multiple GPUs (NVIDIA Geforce GTX 1080) was approximately 300-500 times faster than those generated using a central processing unit. In addition, the results demonstrate that 3D motion pictures were successfully reconstructed from RS images of 3,072 × 3,072 pixels at approximately 15 frames per second using an electro-holographic reconstruction system in which colour holograms were generated from RS images in real time.
Exploring DeepMedic for the purpose of segmenting white matter hyperintensity lesions

NASA Astrophysics Data System (ADS)

Lippert, Fiona; Cheng, Bastian; Golsari, Amir; Weiler, Florian; Gregori, Johannes; Thomalla, Götz; Klein, Jan

2018-02-01

DeepMedic, an open source software library based on a multi-channel multi-resolution 3D convolutional neural network, has recently been made publicly available for brain lesion segmentations. It has already been shown that segmentation tasks on MRI data of patients having traumatic brain injuries, brain tumors, and ischemic stroke lesions can be performed very well. In this paper we describe how it can efficiently be used for the purpose of detecting and segmenting white matter hyperintensity lesions. We examined if it can be applied to single-channel routine 2D FLAIR data. For evaluation, we annotated 197 datasets with different numbers and sizes of white matter hyperintensity lesions. Our experiments have shown that substantial results with respect to the segmentation quality can be achieved. Compared to the original parametrization of the DeepMedic neural network, the timings for training can be drastically reduced if adjusting corresponding training parameters, while at the same time the Dice coefficients remain nearly unchanged. This enables for performing a whole training process within a single day utilizing a NVIDIA GeForce GTX 580 graphics board which makes this library also very interesting for research purposes on low-end GPU hardware.
Enobosarm (GTx-024, S-22): a potential treatment for cachexia.

PubMed

Srinath, Reshmi; Dobs, Adrian

2014-02-01

Muscle loss and wasting occurs with aging and in multiple disease states including cancer, heart failure, chronic obstructive pulmonary disease, end-stage liver disease, end-stage renal disease and HIV. Cachexia is defined as a multifactorial syndrome that is associated with anorexia, weight loss and increased catabolism, with increased morbidity and mortality. Currently no therapy is approved for the treatment or prevention of cachexia. Different treatment options have been suggested but many have proven to be ineffective or associated with adverse events. Nonsteroidal selective androgen receptor modulators (SARMs) are a new class of anabolic agents that bind the androgen receptor and exhibit tissue selectivity. Enobosarm (GTx-024, S-22) is a recently developed SARM, developed by GTx, Inc. (TN, USA), which has been tested in Phase I, II and III trials with promising results in terms of improving lean body mass and measurements of physical function and power. Enobosarm has received fast track designation by the US FDA and results from the Phase III trials POWER1 and POWER2 will help determine approval for use in the prevention and treatment of muscle wasting in patients with non-small-cell lung cancer. This article provides an introduction to enobosarm as a new therapeutic strategy for the prevention and treatment of cachexia. A review of the literature was performed using search terms 'cachexia', 'sarcopenia', 'SARM', 'enobosarm' and 'GTx-024' in September 2013 using multiple databases as well as online resources.
GTX Reference Vehicle Structural Verification Methods and Weight Summary

NASA Technical Reports Server (NTRS)

Hunter, J. E.; McCurdy, D. R.; Dunn, P. W.

2002-01-01

The design of a single-stage-to-orbit air breathing propulsion system requires the simultaneous development of a reference launch vehicle in order to achieve the optimal mission performance. Accordingly, for the GTX study a 300-lb payload reference vehicle was preliminary sized to a gross liftoff weight (GLOW) of 238,000 lb. A finite element model of the integrated vehicle/propulsion system was subjected to the trajectory environment and subsequently optimized for structural efficiency. This study involved the development of aerodynamic loads mapped to finite element models of the integrated system in order to assess vehicle margins of safety. Commercially available analysis codes were used in the process along with some internally developed spread-sheets and FORTRAN codes specific to the GTX geometry for mapping of thermal and pressure loads. A mass fraction of 0.20 for the integrated system dry weight has been the driver for a vehicle design consisting of state-of-the-art composite materials in order to meet the rigid weight requirements. This paper summarizes the methodology used for preliminary analyses and presents the current status of the weight optimization for the structural components of the integrated system.
GTX Reference Vehicle Structural Verification Methods and Weight Summary

NASA Technical Reports Server (NTRS)

Hunter, J. E.; McCurdy, D. R.; Dunn, P. W.

2002-01-01

The design of a single-stage-to-orbit air breathing propulsion system requires the simultaneous development of a reference launch vehicle in order to achieve the optimal mission performance. Accordingly, for the GTX study a 300-lb payload reference vehicle was preliminarily sized to a gross liftoff weight (GLOW) of 238,000 lb. A finite element model of the integrated vehicle/propulsion system was subjected to the trajectory environment and subsequently optimized for structural efficiency. This study involved the development of aerodynamic loads mapped to finite element models of the integrated system in order to assess vehicle margins of safety. Commercially available analysis codes were used in the process along with some internally developed spreadsheets and FORTRAN codes specific to the GTX geometry for mapping of thermal and pressure loads. A mass fraction of 0.20 for the integrated system dry weight has been the driver for a vehicle design consisting of state-of-the-art composite materials in order to meet the rigid weight requirements. This paper summarizes the methodology used for preliminary analyses and presents the current status of the weight optimization for the structural components of the integrated system.
GPU accelerated Monte-Carlo simulation of SEM images for metrology

NASA Astrophysics Data System (ADS)

Verduin, T.; Lokhorst, S. R.; Hagen, C. W.

2016-03-01

In this work we address the computation times of numerical studies in dimensional metrology. In particular, full Monte-Carlo simulation programs for scanning electron microscopy (SEM) image acquisition are known to be notoriously slow. Our quest in reducing the computation time of SEM image simulation has led us to investigate the use of graphics processing units (GPUs) for metrology. We have succeeded in creating a full Monte-Carlo simulation program for SEM images, which runs entirely on a GPU. The physical scattering models of this GPU simulator are identical to a previous CPU-based simulator, which includes the dielectric function model for inelastic scattering and also refinements for low-voltage SEM applications. As a case study for the performance, we considered the simulated exposure of a complex feature: an isolated silicon line with rough sidewalls located on a at silicon substrate. The surface of the rough feature is decomposed into 408 012 triangles. We have used an exposure dose of 6 mC/cm2, which corresponds to 6 553 600 primary electrons on average (Poisson distributed). We repeat the simulation for various primary electron energies, 300 eV, 500 eV, 800 eV, 1 keV, 3 keV and 5 keV. At first we run the simulation on a GeForce GTX480 from NVIDIA. The very same simulation is duplicated on our CPU-based program, for which we have used an Intel Xeon X5650. Apart from statistics in the simulation, no difference is found between the CPU and GPU simulated results. The GTX480 generates the images (depending on the primary electron energy) 350 to 425 times faster than a single threaded Intel X5650 CPU. Although this is a tremendous speedup, we actually have not reached the maximum throughput because of the limited amount of available memory on the GTX480. Nevertheless, the speedup enables the fast acquisition of simulated SEM images for metrology. We now have the potential to investigate case studies in CD-SEM metrology, which otherwise would take unreasonable
Parallel hyperspectral image reconstruction using random projections

NASA Astrophysics Data System (ADS)

Sevilla, Jorge; Martín, Gabriel; Nascimento, José M. P.

2016-10-01

Spaceborne sensors systems are characterized by scarce onboard computing and storage resources and by communication links with reduced bandwidth. Random projections techniques have been demonstrated as an effective and very light way to reduce the number of measurements in hyperspectral data, thus, the data to be transmitted to the Earth station is reduced. However, the reconstruction of the original data from the random projections may be computationally expensive. SpeCA is a blind hyperspectral reconstruction technique that exploits the fact that hyperspectral vectors often belong to a low dimensional subspace. SpeCA has shown promising results in the task of recovering hyperspectral data from a reduced number of random measurements. In this manuscript we focus on the implementation of the SpeCA algorithm for graphics processing units (GPU) using the compute unified device architecture (CUDA). Experimental results conducted using synthetic and real hyperspectral datasets on the GPU architecture by NVIDIA: GeForce GTX 980, reveal that the use of GPUs can provide real-time reconstruction. The achieved speedup is up to 22 times when compared with the processing time of SpeCA running on one core of the Intel i7-4790K CPU (3.4GHz), with 32 Gbyte memory.

Performance Evaluation of the NASA GTX RBCC Flowpath

NASA Technical Reports Server (NTRS)

Thomas, Scott R.; Palac, Donald T.; Trefny, Charles J.; Roche, Joseph M.

2001-01-01

The NASA Glenn Research Center serves as NASAs lead center for aeropropulsion. Several programs are underway to explore revolutionary airbreathing propulsion systems in response to the challenge of reducing the cost of space transportation. Concepts being investigated include rocket-based combined cycle (RBCC), pulse detonation wave, and turbine-based combined cycle (TBCC) engines. The GTX concept is a vertical launched, horizontal landing, single stage to orbit (SSTO) vehicle utilizing RBCC engines. The propulsion pod has a nearly half-axisymmetric flowpath that incorporates a rocket and ram-scramjet. The engine system operates from lift-off up to above Mach 10, at which point the airbreathing engine flowpath is closed off, and the rocket alone powers the vehicle to orbit. The paper presents an overview of the research efforts supporting the development of this RBCC propulsion system. The experimental efforts of this program consist of a series of test rigs. Each rig is focused on development and optimization of the flowpath over a specific operating mode of the engine. These rigs collectively establish propulsion system performance over all modes of operation, therefore, covering the entire speed range. Computational Fluid Mechanics (CFD) analysis is an important element of the GTX propulsion system development and validation. These efforts guide experiments and flowpath design, provide insight into experimental data, and extend results to conditions and scales not achievable in ground test facilities. Some examples of important CFD results are presented.
Graphics Processing Unit Acceleration of Gyrokinetic Turbulence Simulations

NASA Astrophysics Data System (ADS)

Hause, Benjamin; Parker, Scott

2012-10-01

We find a substantial increase in on-node performance using Graphics Processing Unit (GPU) acceleration in gyrokinetic delta-f particle-in-cell simulation. Optimization is performed on a two-dimensional slab gyrokinetic particle simulation using the Portland Group Fortran compiler with the GPU accelerator compiler directives. We have implemented the GPU acceleration on a Core I7 gaming PC with a NVIDIA GTX 580 GPU. We find comparable, or better, acceleration relative to the NERSC DIRAC cluster with the NVIDIA Tesla C2050 computing processor. The Tesla C 2050 is about 2.6 times more expensive than the GTX 580 gaming GPU. Optimization strategies and comparisons between DIRAC and the gaming PC will be presented. We will also discuss progress on optimizing the comprehensive three dimensional general geometry GEM code.
Real time mitigation of atmospheric turbulence in long distance imaging using the lucky region fusion algorithm with FPGA and GPU hardware acceleration

NASA Astrophysics Data System (ADS)

Jackson, Christopher Robert

"Lucky-region" fusion (LRF) is a synthetic imaging technique that has proven successful in enhancing the quality of images distorted by atmospheric turbulence. The LRF algorithm selects sharp regions of an image obtained from a series of short exposure frames, and fuses the sharp regions into a final, improved image. In previous research, the LRF algorithm had been implemented on a PC using the C programming language. However, the PC did not have sufficient sequential processing power to handle real-time extraction, processing and reduction required when the LRF algorithm was applied to real-time video from fast, high-resolution image sensors. This thesis describes two hardware implementations of the LRF algorithm to achieve real-time image processing. The first was created with a VIRTEX-7 field programmable gate array (FPGA). The other developed using the graphics processing unit (GPU) of a NVIDIA GeForce GTX 690 video card. The novelty in the FPGA approach is the creation of a "black box" LRF video processing system with a general camera link input, a user controller interface, and a camera link video output. We also describe a custom hardware simulation environment we have built to test the FPGA LRF implementation. The advantage of the GPU approach is significantly improved development time, integration of image stabilization into the system, and comparable atmospheric turbulence mitigation.
Computer simulations and real-time control of ELT AO systems using graphical processing units

NASA Astrophysics Data System (ADS)

Wang, Lianqi; Ellerbroek, Brent

2012-07-01

The adaptive optics (AO) simulations at the Thirty Meter Telescope (TMT) have been carried out using the efficient, C based multi-threaded adaptive optics simulator (MAOS, http://github.com/lianqiw/maos). By porting time-critical parts of MAOS to graphical processing units (GPU) using NVIDIA CUDA technology, we achieved a 10 fold speed up for each GTX 580 GPU used compared to a modern quad core CPU. Each time step of full scale end to end simulation for the TMT narrow field infrared AO system (NFIRAOS) takes only 0.11 second in a desktop with two GTX 580s. We also demonstrate that the TMT minimum variance reconstructor can be assembled in matrix vector multiply (MVM) format in 8 seconds with 8 GTX 580 GPUs, meeting the TMT requirement for updating the reconstructor. Analysis show that it is also possible to apply the MVM using 8 GTX 580s within the required latency.
A Simple GPU-Accelerated Two-Dimensional MUSCL-Hancock Solver for Ideal Magnetohydrodynamics

NASA Technical Reports Server (NTRS)

Bard, Christopher; Dorelli, John C.

2013-01-01

We describe our experience using NVIDIA's CUDA (Compute Unified Device Architecture) C programming environment to implement a two-dimensional second-order MUSCL-Hancock ideal magnetohydrodynamics (MHD) solver on a GTX 480 Graphics Processing Unit (GPU). Taking a simple approach in which the MHD variables are stored exclusively in the global memory of the GTX 480 and accessed in a cache-friendly manner (without further optimizing memory access by, for example, staging data in the GPU's faster shared memory), we achieved a maximum speed-up of approx. = 126 for a sq 1024 grid relative to the sequential C code running on a single Intel Nehalem (2.8 GHz) core. This speedup is consistent with simple estimates based on the known floating point performance, memory throughput and parallel processing capacity of the GTX 480.
A simple GPU-accelerated two-dimensional MUSCL-Hancock solver for ideal magnetohydrodynamics

NASA Astrophysics Data System (ADS)

Bard, Christopher M.; Dorelli, John C.

2014-02-01

We describe our experience using NVIDIA's CUDA (Compute Unified Device Architecture) C programming environment to implement a two-dimensional second-order MUSCL-Hancock ideal magnetohydrodynamics (MHD) solver on a GTX 480 Graphics Processing Unit (GPU). Taking a simple approach in which the MHD variables are stored exclusively in the global memory of the GTX 480 and accessed in a cache-friendly manner (without further optimizing memory access by, for example, staging data in the GPU's faster shared memory), we achieved a maximum speed-up of ≈126 for a 10242 grid relative to the sequential C code running on a single Intel Nehalem (2.8 GHz) core. This speedup is consistent with simple estimates based on the known floating point performance, memory throughput and parallel processing capacity of the GTX 480.
Proton Testing of nVidia Jetson TX1

NASA Technical Reports Server (NTRS)

Wyrwas, Edward J.

2017-01-01

Single-Event Effects (SEE) testing was conducted on the nVidia Jetson TX1 System on Chip (SOC); herein referred to as device under test (DUT). Testing was conducted at Massachusetts General Hospitals (MGH) Francis H. Burr Proton Therapy Center on October 16th, 2016 using 200MeV protons. This testing trip was purposed to provide a baseline assessment of the radiation susceptibility of the DUT as no previous testing had been conducted on this component.
An efficient tensor transpose algorithm for multicore CPU, Intel Xeon Phi, and NVidia Tesla GPU

NASA Astrophysics Data System (ADS)

Lyakh, Dmitry I.

2015-04-01

An efficient parallel tensor transpose algorithm is suggested for shared-memory computing units, namely, multicore CPU, Intel Xeon Phi, and NVidia GPU. The algorithm operates on dense tensors (multidimensional arrays) and is based on the optimization of cache utilization on x86 CPU and the use of shared memory on NVidia GPU. From the applied side, the ultimate goal is to minimize the overhead encountered in the transformation of tensor contractions into matrix multiplications in computer implementations of advanced methods of quantum many-body theory (e.g., in electronic structure theory and nuclear physics). A particular accent is made on higher-dimensional tensors that typically appear in the so-called multireference correlated methods of electronic structure theory. Depending on tensor dimensionality, the presented optimized algorithms can achieve an order of magnitude speedup on x86 CPUs and 2-3 times speedup on NVidia Tesla K20X GPU with respect to the naïve scattering algorithm (no memory access optimization). The tensor transpose routines developed in this work have been incorporated into a general-purpose tensor algebra library (TAL-SH).
Advanced mathematical on-line analysis in nuclear experiments. Usage of parallel computing CUDA routines in standard root analysis

NASA Astrophysics Data System (ADS)

Grzeszczuk, A.; Kowalski, S.

2015-04-01

Compute Unified Device Architecture (CUDA) is a parallel computing platform developed by Nvidia for increase speed of graphics by usage of parallel mode for processes calculation. The success of this solution has opened technology General-Purpose Graphic Processor Units (GPGPUs) for applications not coupled with graphics. The GPGPUs system can be applying as effective tool for reducing huge number of data for pulse shape analysis measures, by on-line recalculation or by very quick system of compression. The simplified structure of CUDA system and model of programming based on example Nvidia GForce GTX580 card are presented by our poster contribution in stand-alone version and as ROOT application.
NLSEmagic: Nonlinear Schrödinger equation multi-dimensional Matlab-based GPU-accelerated integrators using compact high-order schemes

NASA Astrophysics Data System (ADS)

Caplan, R. M.

2013-04-01

and both second- and fourth-order differencing in space. The integrators are written to run on NVIDIA GPUs and are interfaced with MATLAB including built-in visualization and analysis tools. Restrictions: The main restriction for the GPU integrators is the amount of RAM on the GPU as the code is currently only designed for running on a single GPU. Unusual features: Ability to visualize real-time simulations through the interaction of MATLAB and the compiled GPU integrators. Additional comments: Setup guide and Installation guide provided. Program has a dedicated web site at www.nlsemagic.com. Running time: A three-dimensional run with a grid dimension of 87×87×203 for 3360 time steps (100 non-dimensional time units) takes about one and a half minutes on a GeForce GTX 580 GPU card.
Graphics Processing Unit Acceleration of Gyrokinetic Turbulence Simulations

NASA Astrophysics Data System (ADS)

Hause, Benjamin; Parker, Scott; Chen, Yang

2013-10-01

We find a substantial increase in on-node performance using Graphics Processing Unit (GPU) acceleration in gyrokinetic delta-f particle-in-cell simulation. Optimization is performed on a two-dimensional slab gyrokinetic particle simulation using the Portland Group Fortran compiler with the OpenACC compiler directives and Fortran CUDA. Mixed implementation of both Open-ACC and CUDA is demonstrated. CUDA is required for optimizing the particle deposition algorithm. We have implemented the GPU acceleration on a third generation Core I7 gaming PC with two NVIDIA GTX 680 GPUs. We find comparable, or better, acceleration relative to the NERSC DIRAC cluster with the NVIDIA Tesla C2050 computing processor. The Tesla C 2050 is about 2.6 times more expensive than the GTX 580 gaming GPU. We also see enormous speedups (10 or more) on the Titan supercomputer at Oak Ridge with Kepler K20 GPUs. Results show speed-ups comparable or better than that of OpenMP models utilizing multiple cores. The use of hybrid OpenACC, CUDA Fortran, and MPI models across many nodes will also be discussed. Optimization strategies will be presented. We will discuss progress on optimizing the comprehensive three dimensional general geometry GEM code.
Fast analysis of molecular dynamics trajectories with graphics processing units-Radial distribution function histogramming

DOE Office of Scientific and Technical Information (OSTI.GOV)

Levine, Benjamin G., E-mail: ben.levine@temple.ed; Stone, John E., E-mail: johns@ks.uiuc.ed; Kohlmeyer, Axel, E-mail: akohlmey@temple.ed

2011-05-01

The calculation of radial distribution functions (RDFs) from molecular dynamics trajectory data is a common and computationally expensive analysis task. The rate limiting step in the calculation of the RDF is building a histogram of the distance between atom pairs in each trajectory frame. Here we present an implementation of this histogramming scheme for multiple graphics processing units (GPUs). The algorithm features a tiling scheme to maximize the reuse of data at the fastest levels of the GPU's memory hierarchy and dynamic load balancing to allow high performance on heterogeneous configurations of GPUs. Several versions of the RDF algorithm aremore » presented, utilizing the specific hardware features found on different generations of GPUs. We take advantage of larger shared memory and atomic memory operations available on state-of-the-art GPUs to accelerate the code significantly. The use of atomic memory operations allows the fast, limited-capacity on-chip memory to be used much more efficiently, resulting in a fivefold increase in performance compared to the version of the algorithm without atomic operations. The ultimate version of the algorithm running in parallel on four NVIDIA GeForce GTX 480 (Fermi) GPUs was found to be 92 times faster than a multithreaded implementation running on an Intel Xeon 5550 CPU. On this multi-GPU hardware, the RDF between two selections of 1,000,000 atoms each can be calculated in 26.9 s per frame. The multi-GPU RDF algorithms described here are implemented in VMD, a widely used and freely available software package for molecular dynamics visualization and analysis.« less
Fast Analysis of Molecular Dynamics Trajectories with Graphics Processing Units—Radial Distribution Function Histogramming

PubMed Central

Stone, John E.; Kohlmeyer, Axel

2011-01-01

The calculation of radial distribution functions (RDFs) from molecular dynamics trajectory data is a common and computationally expensive analysis task. The rate limiting step in the calculation of the RDF is building a histogram of the distance between atom pairs in each trajectory frame. Here we present an implementation of this histogramming scheme for multiple graphics processing units (GPUs). The algorithm features a tiling scheme to maximize the reuse of data at the fastest levels of the GPU’s memory hierarchy and dynamic load balancing to allow high performance on heterogeneous configurations of GPUs. Several versions of the RDF algorithm are presented, utilizing the specific hardware features found on different generations of GPUs. We take advantage of larger shared memory and atomic memory operations available on state-of-the-art GPUs to accelerate the code significantly. The use of atomic memory operations allows the fast, limited-capacity on-chip memory to be used much more efficiently, resulting in a fivefold increase in performance compared to the version of the algorithm without atomic operations. The ultimate version of the algorithm running in parallel on four NVIDIA GeForce GTX 480 (Fermi) GPUs was found to be 92 times faster than a multithreaded implementation running on an Intel Xeon 5550 CPU. On this multi-GPU hardware, the RDF between two selections of 1,000,000 atoms each can be calculated in 26.9 seconds per frame. The multi-GPU RDF algorithms described here are implemented in VMD, a widely used and freely available software package for molecular dynamics visualization and analysis. PMID:21547007
An efficient tensor transpose algorithm for multicore CPU, Intel Xeon Phi, and NVidia Tesla GPU

DOE PAGES

Lyakh, Dmitry I.

2015-01-05

An efficient parallel tensor transpose algorithm is suggested for shared-memory computing units, namely, multicore CPU, Intel Xeon Phi, and NVidia GPU. The algorithm operates on dense tensors (multidimensional arrays) and is based on the optimization of cache utilization on x86 CPU and the use of shared memory on NVidia GPU. From the applied side, the ultimate goal is to minimize the overhead encountered in the transformation of tensor contractions into matrix multiplications in computer implementations of advanced methods of quantum many-body theory (e.g., in electronic structure theory and nuclear physics). A particular accent is made on higher-dimensional tensors that typicallymore » appear in the so-called multireference correlated methods of electronic structure theory. Depending on tensor dimensionality, the presented optimized algorithms can achieve an order of magnitude speedup on x86 CPUs and 2-3 times speedup on NVidia Tesla K20X GPU with respect to the na ve scattering algorithm (no memory access optimization). Furthermore, the tensor transpose routines developed in this work have been incorporated into a general-purpose tensor algebra library (TAL-SH).« less
CUDASW++ 3.0: accelerating Smith-Waterman protein database search by coupling CPU and GPU SIMD instructions.

PubMed

Liu, Yongchao; Wirawan, Adrianto; Schmidt, Bertil

2013-04-04

The maximal sensitivity for local alignments makes the Smith-Waterman algorithm a popular choice for protein sequence database search based on pairwise alignment. However, the algorithm is compute-intensive due to a quadratic time complexity. Corresponding runtimes are further compounded by the rapid growth of sequence databases. We present CUDASW++ 3.0, a fast Smith-Waterman protein database search algorithm, which couples CPU and GPU SIMD instructions and carries out concurrent CPU and GPU computations. For the CPU computation, this algorithm employs SSE-based vector execution units as accelerators. For the GPU computation, we have investigated for the first time a GPU SIMD parallelization, which employs CUDA PTX SIMD video instructions to gain more data parallelism beyond the SIMT execution model. Moreover, sequence alignment workloads are automatically distributed over CPUs and GPUs based on their respective compute capabilities. Evaluation on the Swiss-Prot database shows that CUDASW++ 3.0 gains a performance improvement over CUDASW++ 2.0 up to 2.9 and 3.2, with a maximum performance of 119.0 and 185.6 GCUPS, on a single-GPU GeForce GTX 680 and a dual-GPU GeForce GTX 690 graphics card, respectively. In addition, our algorithm has demonstrated significant speedups over other top-performing tools: SWIPE and BLAST+. CUDASW++ 3.0 is written in CUDA C++ and PTX assembly languages, targeting GPUs based on the Kepler architecture. This algorithm obtains significant speedups over its predecessor: CUDASW++ 2.0, by benefiting from the use of CPU and GPU SIMD instructions as well as the concurrent execution on CPUs and GPUs. The source code and the simulated data are available at http://cudasw.sourceforge.net.
Scaling Deep Learning Workloads: NVIDIA DGX-1/Pascal and Intel Knights Landing

DOE Office of Scientific and Technical Information (OSTI.GOV)

Gawande, Nitin A.; Landwehr, Joshua B.; Daily, Jeffrey A.

Deep Learning (DL) algorithms have become ubiquitous in data analytics. As a result, major computing vendors --- including NVIDIA, Intel, AMD and IBM --- have architectural road-maps influenced by DL workloads. Furthermore, several vendors have recently advertised new computing products as accelerating DL workloads. Unfortunately, it is difficult for data scientists to quantify the potential of these different products. This paper provides a performance and power analysis of important DL workloads on two major parallel architectures: NVIDIA DGX-1 (eight Pascal P100 GPUs interconnected with NVLink) and Intel Knights Landing (KNL) CPUs interconnected with Intel Omni-Path. Our evaluation consists of amore » cross section of convolutional neural net workloads: CifarNet, CaffeNet, AlexNet and GoogleNet topologies using the Cifar10 and ImageNet datasets. The workloads are vendor optimized for each architecture. GPUs provide the highest overall raw performance. Our analysis indicates that although GPUs provide the highest overall performance, the gap can close for some convolutional networks; and KNL can be competitive when considering performance/watt. Furthermore, NVLink is critical to GPU scaling.« less
Dataflow-Based Implementation of Layered Sensing Applications on High-Performance Embedded Processors

DTIC Science & Technology

2013-03-01

time (milliseconds) GFlops Comparison to GPU peak performance (%) Cascade Gaussian Filtering 13 45.19 6.3 Difference of Gaussian 0.512 152...values for the GPU-targeted actor implementations in terms of Giga Floating Point Operations Per Second ( GFLOPS ). Our GFLOPS calculation for an actor...kernels. The results for GFLOPS are provided in Table . The actors were implemented on an NVIDIA GTX260 GPU, which provides 715 GFLOPS as peak
A Thermal Management Systems Model for the NASA GTX RBCC Concept

NASA Technical Reports Server (NTRS)

Traci, Richard M.; Farr, John L., Jr.; Laganelli, Tony; Walker, James (Technical Monitor)

2002-01-01

The Vehicle Integrated Thermal Management Analysis Code (VITMAC) was further developed to aid the analysis, design, and optimization of propellant and thermal management concepts for advanced propulsion systems. The computational tool is based on engineering level principles and models. A graphical user interface (GUI) provides a simple and straightforward method to assess and evaluate multiple concepts before undertaking more rigorous analysis of candidate systems. The tool incorporates the Chemical Equilibrium and Applications (CEA) program and the RJPA code to permit heat transfer analysis of both rocket and air breathing propulsion systems. Key parts of the code have been validated with experimental data. The tool was specifically tailored to analyze rocket-based combined-cycle (RBCC) propulsion systems being considered for space transportation applications. This report describes the computational tool and its development and verification for NASA GTX RBCC propulsion system applications.
Accelerating Monte Carlo simulations with an NVIDIA ® graphics processor

NASA Astrophysics Data System (ADS)

Martinsen, Paul; Blaschke, Johannes; Künnemeyer, Rainer; Jordan, Robert

2009-10-01

Modern graphics cards, commonly used in desktop computers, have evolved beyond a simple interface between processor and display to incorporate sophisticated calculation engines that can be applied to general purpose computing. The Monte Carlo algorithm for modelling photon transport in turbid media has been implemented on an NVIDIA ® 8800 GT graphics card using the CUDA toolkit. The Monte Carlo method relies on following the trajectory of millions of photons through the sample, often taking hours or days to complete. The graphics-processor implementation, processing roughly 110 million scattering events per second, was found to run more than 70 times faster than a similar, single-threaded implementation on a 2.67 GHz desktop computer. Program summaryProgram title: Phoogle-C/Phoogle-G Catalogue identifier: AEEB_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEEB_v1_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: Standard CPC licence, http://cpc.cs.qub.ac.uk/licence/licence.html No. of lines in distributed program, including test data, etc.: 51 264 No. of bytes in distributed program, including test data, etc.: 2 238 805 Distribution format: tar.gz Programming language: C++ Computer: Designed for Intel PCs. Phoogle-G requires a NVIDIA graphics card with support for CUDA 1.1 Operating system: Windows XP Has the code been vectorised or parallelized?: Phoogle-G is written for SIMD architectures RAM: 1 GB Classification: 21.1 External routines: Charles Karney Random number library. Microsoft Foundation Class library. NVIDA CUDA library [1]. Nature of problem: The Monte Carlo technique is an effective algorithm for exploring the propagation of light in turbid media. However, accurate results require tracing the path of many photons within the media. The independence of photons naturally lends the Monte Carlo technique to implementation on parallel architectures. Generally, parallel computing
Application of graphics processing units to search pipelines for gravitational waves from coalescing binaries of compact objects

NASA Astrophysics Data System (ADS)

Chung, Shin Kee; Wen, Linqing; Blair, David; Cannon, Kipp; Datta, Amitava

2010-07-01

We report a novel application of a graphics processing unit (GPU) for the purpose of accelerating the search pipelines for gravitational waves from coalescing binaries of compact objects. A speed-up of 16-fold in total has been achieved with an NVIDIA GeForce 8800 Ultra GPU card compared with one core of a 2.5 GHz Intel Q9300 central processing unit (CPU). We show that substantial improvements are possible and discuss the reduction in CPU count required for the detection of inspiral sources afforded by the use of GPUs.

Software beamforming: comparison between a phased array and synthetic transmit aperture.

PubMed

Li, Yen-Feng; Li, Pai-Chi

2011-04-01

The data-transfer and computation requirements are compared between software-based beamforming using a phased array (PA) and a synthetic transmit aperture (STA). The advantages of a software-based architecture are reduced system complexity and lower hardware cost. Although this architecture can be implemented using commercial CPUs or GPUs, the high computation and data-transfer requirements limit its real-time beamforming performance. In particular, transferring the raw rf data from the front-end subsystem to the software back-end remains challenging with current state-of-the-art electronics technologies, which offset the cost advantage of the software back end. This study investigated the tradeoff between the data-transfer and computation requirements. Two beamforming methods based on a PA and STA, respectively, were used: the former requires a higher data transfer rate and the latter requires more memory operations. The beamformers were implemente;d in an NVIDIA GeForce GTX 260 GPU and an Intel core i7 920 CPU. The frame rate of PA beamforming was 42 fps with a 128-element array transducer, with 2048 samples per firing and 189 beams per image (with a 95 MB/frame data-transfer requirement). The frame rate of STA beamforming was 40 fps with 16 firings per image (with an 8 MB/frame data-transfer requirement). Both approaches achieved real-time beamforming performance but each had its own bottleneck. On the one hand, the required data-transfer speed was considerably reduced in STA beamforming, whereas this required more memory operations, which limited the overall computation time. The advantages of the GPU approach over the CPU approach were clearly demonstrated.
Real-time simulation of a spiking neural network model of the basal ganglia circuitry using general purpose computing on graphics processing units.

PubMed

Igarashi, Jun; Shouno, Osamu; Fukai, Tomoki; Tsujino, Hiroshi

2011-11-01

Real-time simulation of a biologically realistic spiking neural network is necessary for evaluation of its capacity to interact with real environments. However, the real-time simulation of such a neural network is difficult due to its high computational costs that arise from two factors: (1) vast network size and (2) the complicated dynamics of biologically realistic neurons. In order to address these problems, mainly the latter, we chose to use general purpose computing on graphics processing units (GPGPUs) for simulation of such a neural network, taking advantage of the powerful computational capability of a graphics processing unit (GPU). As a target for real-time simulation, we used a model of the basal ganglia that has been developed according to electrophysiological and anatomical knowledge. The model consists of heterogeneous populations of 370 spiking model neurons, including computationally heavy conductance-based models, connected by 11,002 synapses. Simulation of the model has not yet been performed in real-time using a general computing server. By parallelization of the model on the NVIDIA Geforce GTX 280 GPU in data-parallel and task-parallel fashion, faster-than-real-time simulation was robustly realized with only one-third of the GPU's total computational resources. Furthermore, we used the GPU's full computational resources to perform faster-than-real-time simulation of three instances of the basal ganglia model; these instances consisted of 1100 neurons and 33,006 synapses and were synchronized at each calculation step. Finally, we developed software for simultaneous visualization of faster-than-real-time simulation output. These results suggest the potential power of GPGPU techniques in real-time simulation of realistic neural networks. Copyright © 2011 Elsevier Ltd. All rights reserved.
Scaling deep learning workloads: NVIDIA DGX-1/Pascal and Intel Knights Landing

DOE Office of Scientific and Technical Information (OSTI.GOV)

Gawande, Nitin A.; Landwehr, Joshua B.; Daily, Jeffrey A.

Deep Learning (DL) algorithms have become ubiquitous in data analytics. As a result, major computing vendors --- including NVIDIA, Intel, AMD, and IBM --- have architectural road-maps influenced by DL workloads. Furthermore, several vendors have recently advertised new computing products as accelerating large DL workloads. Unfortunately, it is difficult for data scientists to quantify the potential of these different products. This paper provides a performance and power analysis of important DL workloads on two major parallel architectures: NVIDIA DGX-1 (eight Pascal P100 GPUs interconnected with NVLink) and Intel Knights Landing (KNL) CPUs interconnected with Intel Omni-Path or Cray Aries. Ourmore » evaluation consists of a cross section of convolutional neural net workloads: CifarNet, AlexNet, GoogLeNet, and ResNet50 topologies using the Cifar10 and ImageNet datasets. The workloads are vendor-optimized for each architecture. Our analysis indicates that although GPUs provide the highest overall performance, the gap can close for some convolutional networks; and the KNL can be competitive in performance/watt. We find that NVLink facilitates scaling efficiency on GPUs. However, its importance is heavily dependent on neural network architecture. Furthermore, for weak-scaling --- sometimes encouraged by restricted GPU memory --- NVLink is less important.« less
Fast generation of computer-generated hologram by graphics processing unit

NASA Astrophysics Data System (ADS)

Matsuda, Sho; Fujii, Tomohiko; Yamaguchi, Takeshi; Yoshikawa, Hiroshi

2009-02-01

A cylindrical hologram is well known to be viewable in 360 deg. This hologram depends high pixel resolution.Therefore, Computer-Generated Cylindrical Hologram (CGCH) requires huge calculation amount.In our previous research, we used look-up table method for fast calculation with Intel Pentium4 2.8 GHz.It took 480 hours to calculate high resolution CGCH (504,000 x 63,000 pixels and the average number of object points are 27,000).To improve quality of CGCH reconstructed image, fringe pattern requires higher spatial frequency and resolution.Therefore, to increase the calculation speed, we have to change the calculation method. In this paper, to reduce the calculation time of CGCH (912,000 x 108,000 pixels), we employ Graphics Processing Unit (GPU).It took 4,406 hours to calculate high resolution CGCH on Xeon 3.4 GHz.Since GPU has many streaming processors and a parallel processing structure, GPU works as the high performance parallel processor.In addition, GPU gives max performance to 2 dimensional data and streaming data.Recently, GPU can be utilized for the general purpose (GPGPU).For example, NVIDIA's GeForce7 series became a programmable processor with Cg programming language.Next GeForce8 series have CUDA as software development kit made by NVIDIA.Theoretically, calculation ability of GPU is announced as 500 GFLOPS. From the experimental result, we have achieved that 47 times faster calculation compared with our previous work which used CPU.Therefore, CGCH can be generated in 95 hours.So, total time is 110 hours to calculate and print the CGCH.
Scaling Deep Learning workloads: NVIDIA DGX-1/Pascal and Intel Knights Landing

DOE PAGES

Gawande, Nitin A.; Daily, Jeff A.; Siegel, Charles; ...

2018-05-05

Deep Learning (DL) algorithms have become ubiquitous in data analytics. As a result, major computing vendors—including NVIDIA, Intel, AMD, and IBM—have architectural road maps influenced by DL workloads. Furthermore, several vendors have recently advertised new computing products as accelerating large DL workloads. Unfortunately, it is difficult for data scientists to quantify the potential of these different products. Here, this article provides a performance and power analysis of important DL workloads on two major parallel architectures: NVIDIA DGX-1 (eight Pascal P100 GPUs interconnected with NVLink) and Intel Knights Landing (KNL) CPUs interconnected with Intel Omni-Path or Cray Aries. Our evaluation consistsmore » of a cross section of convolutional neural net workloads: CifarNet, AlexNet, GoogLeNet, and ResNet50 topologies using the Cifar10 and ImageNet datasets. The workloads are vendor-optimized for each architecture. We use sequentially equivalent implementations to maintain iso-accuracy between parallel and sequential DL models. Our analysis indicates that although GPUs provide the highest overall performance, the gap can close for some convolutional networks; and the KNL can be competitive in performance/watt. We find that NVLink facilitates scaling efficiency on GPUs. However, its importance is heavily dependent on neural network architecture. Furthermore, for weak-scaling—sometimes encouraged by restricted GPU memory—NVLink is less important.« less
Scaling Deep Learning workloads: NVIDIA DGX-1/Pascal and Intel Knights Landing

DOE Office of Scientific and Technical Information (OSTI.GOV)

Gawande, Nitin A.; Daily, Jeff A.; Siegel, Charles

Deep Learning (DL) algorithms have become ubiquitous in data analytics. As a result, major computing vendors—including NVIDIA, Intel, AMD, and IBM—have architectural road maps influenced by DL workloads. Furthermore, several vendors have recently advertised new computing products as accelerating large DL workloads. Unfortunately, it is difficult for data scientists to quantify the potential of these different products. Here, this article provides a performance and power analysis of important DL workloads on two major parallel architectures: NVIDIA DGX-1 (eight Pascal P100 GPUs interconnected with NVLink) and Intel Knights Landing (KNL) CPUs interconnected with Intel Omni-Path or Cray Aries. Our evaluation consistsmore » of a cross section of convolutional neural net workloads: CifarNet, AlexNet, GoogLeNet, and ResNet50 topologies using the Cifar10 and ImageNet datasets. The workloads are vendor-optimized for each architecture. We use sequentially equivalent implementations to maintain iso-accuracy between parallel and sequential DL models. Our analysis indicates that although GPUs provide the highest overall performance, the gap can close for some convolutional networks; and the KNL can be competitive in performance/watt. We find that NVLink facilitates scaling efficiency on GPUs. However, its importance is heavily dependent on neural network architecture. Furthermore, for weak-scaling—sometimes encouraged by restricted GPU memory—NVLink is less important.« less
GeauxDock: Accelerating Structure-Based Virtual Screening with Heterogeneous Computing

PubMed Central

Fang, Ye; Ding, Yun; Feinstein, Wei P.; Koppelman, David M.; Moreno, Juana; Jarrell, Mark; Ramanujam, J.; Brylinski, Michal

2016-01-01

Computational modeling of drug binding to proteins is an integral component of direct drug design. Particularly, structure-based virtual screening is often used to perform large-scale modeling of putative associations between small organic molecules and their pharmacologically relevant protein targets. Because of a large number of drug candidates to be evaluated, an accurate and fast docking engine is a critical element of virtual screening. Consequently, highly optimized docking codes are of paramount importance for the effectiveness of virtual screening methods. In this communication, we describe the implementation, tuning and performance characteristics of GeauxDock, a recently developed molecular docking program. GeauxDock is built upon the Monte Carlo algorithm and features a novel scoring function combining physics-based energy terms with statistical and knowledge-based potentials. Developed specifically for heterogeneous computing platforms, the current version of GeauxDock can be deployed on modern, multi-core Central Processing Units (CPUs) as well as massively parallel accelerators, Intel Xeon Phi and NVIDIA Graphics Processing Unit (GPU). First, we carried out a thorough performance tuning of the high-level framework and the docking kernel to produce a fast serial code, which was then ported to shared-memory multi-core CPUs yielding a near-ideal scaling. Further, using Xeon Phi gives 1.9× performance improvement over a dual 10-core Xeon CPU, whereas the best GPU accelerator, GeForce GTX 980, achieves a speedup as high as 3.5×. On that account, GeauxDock can take advantage of modern heterogeneous architectures to considerably accelerate structure-based virtual screening applications. GeauxDock is open-sourced and publicly available at www.brylinski.org/geauxdock and https://figshare.com/articles/geauxdock_tar_gz/3205249. PMID:27420300
Parallel algorithm for solving Kepler’s equation on Graphics Processing Units: Application to analysis of Doppler exoplanet searches

NASA Astrophysics Data System (ADS)

Ford, Eric B.

2009-05-01

We present the results of a highly parallel Kepler equation solver using the Graphics Processing Unit (GPU) on a commercial nVidia GeForce 280GTX and the "Compute Unified Device Architecture" (CUDA) programming environment. We apply this to evaluate a goodness-of-fit statistic (e.g., χ2) for Doppler observations of stars potentially harboring multiple planetary companions (assuming negligible planet-planet interactions). Given the high-dimensionality of the model parameter space (at least five dimensions per planet), a global search is extremely computationally demanding. We expect that the underlying Kepler solver and model evaluator will be combined with a wide variety of more sophisticated algorithms to provide efficient global search, parameter estimation, model comparison, and adaptive experimental design for radial velocity and/or astrometric planet searches. We tested multiple implementations using single precision, double precision, pairs of single precision, and mixed precision arithmetic. We find that the vast majority of computations can be performed using single precision arithmetic, with selective use of compensated summation for increased precision. However, standard single precision is not adequate for calculating the mean anomaly from the time of observation and orbital period when evaluating the goodness-of-fit for real planetary systems and observational data sets. Using all double precision, our GPU code outperforms a similar code using a modern CPU by a factor of over 60. Using mixed precision, our GPU code provides a speed-up factor of over 600, when evaluating nsys > 1024 models planetary systems each containing npl = 4 planets and assuming nobs = 256 observations of each system. We conclude that modern GPUs also offer a powerful tool for repeatedly evaluating Kepler's equation and a goodness-of-fit statistic for orbital models when presented with a large parameter space.
SU-C-BRC-07: Parametrized GPU Accelerated Electron Monte Carlo Second Check

DOE Office of Scientific and Technical Information (OSTI.GOV)

Haywood, J

Purpose: I am presenting a parameterized 3D GPU accelerated electron Monte Carlo second check program. Method: I wrote the 3D grid dose calculation algorithm in CUDA and utilized an NVIDIA GeForce GTX 780 Ti to run all of the calculations. The electron path beyond the distal end of the cone is governed by four parameters: the amplitude of scattering (AMP), the mean and width of a Gaussian energy distribution (E and α), and the percentage of photons. In my code, I adjusted all parameters until the calculated PDD and profile fit the measured 10×10 open beam data within 1%/1mm. Imore » then wrote a user interface for reading the DICOM treatment plan and images in Python. In order to verify the algorithm, I calculated 3D dose distributions on a variety of phantoms and geometries, and compared them with the Eclipse eMC calculations. I also calculated several patient specific dose distributions, including a nose and an ear. Finally, I compared my algorithm’s computation times to Eclipse’s. Results: The calculated MU for all of the investigated geometries agree with the TPS within the TG-114 action level of 5%. The MU for the nose was < 0.5 % different while the MU for the ear at 105 SSD was ∼2 %. Calculation times for a 12MeV 10×10 open beam ranged from 1 second for a 2.5 mm grid resolution with ∼15 million particles to 33 seconds on a 1 mm grid with ∼460 million particles. Eclipse calculation runtimes distributed over 10 FAS workers were 9 seconds to 15 minutes respectively. Conclusion: The GPU accelerated second check allows quick MU verification while accounting for patient specific geometry and heterogeneity.« less
GeauxDock: Accelerating Structure-Based Virtual Screening with Heterogeneous Computing.

PubMed

Fang, Ye; Ding, Yun; Feinstein, Wei P; Koppelman, David M; Moreno, Juana; Jarrell, Mark; Ramanujam, J; Brylinski, Michal

2016-01-01

Computational modeling of drug binding to proteins is an integral component of direct drug design. Particularly, structure-based virtual screening is often used to perform large-scale modeling of putative associations between small organic molecules and their pharmacologically relevant protein targets. Because of a large number of drug candidates to be evaluated, an accurate and fast docking engine is a critical element of virtual screening. Consequently, highly optimized docking codes are of paramount importance for the effectiveness of virtual screening methods. In this communication, we describe the implementation, tuning and performance characteristics of GeauxDock, a recently developed molecular docking program. GeauxDock is built upon the Monte Carlo algorithm and features a novel scoring function combining physics-based energy terms with statistical and knowledge-based potentials. Developed specifically for heterogeneous computing platforms, the current version of GeauxDock can be deployed on modern, multi-core Central Processing Units (CPUs) as well as massively parallel accelerators, Intel Xeon Phi and NVIDIA Graphics Processing Unit (GPU). First, we carried out a thorough performance tuning of the high-level framework and the docking kernel to produce a fast serial code, which was then ported to shared-memory multi-core CPUs yielding a near-ideal scaling. Further, using Xeon Phi gives 1.9× performance improvement over a dual 10-core Xeon CPU, whereas the best GPU accelerator, GeForce GTX 980, achieves a speedup as high as 3.5×. On that account, GeauxDock can take advantage of modern heterogeneous architectures to considerably accelerate structure-based virtual screening applications. GeauxDock is open-sourced and publicly available at www.brylinski.org/geauxdock and https://figshare.com/articles/geauxdock_tar_gz/3205249.
Optimization of Selected Remote Sensing Algorithms for Embedded NVIDIA Kepler GPU Architecture

NASA Technical Reports Server (NTRS)

Riha, Lubomir; Le Moigne, Jacqueline; El-Ghazawi, Tarek

2015-01-01

This paper evaluates the potential of embedded Graphic Processing Units in the Nvidias Tegra K1 for onboard processing. The performance is compared to a general purpose multi-core CPU and full fledge GPU accelerator. This study uses two algorithms: Wavelet Spectral Dimension Reduction of Hyperspectral Imagery and Automated Cloud-Cover Assessment (ACCA) Algorithm. Tegra K1 achieved 51 for ACCA algorithm and 20 for the dimension reduction algorithm, as compared to the performance of the high-end 8-core server Intel Xeon CPU with 13.5 times higher power consumption.
Supercomputing with toys: harnessing the power of NVIDIA 8800GTX and playstation 3 for bioinformatics problem.

PubMed

Wilson, Justin; Dai, Manhong; Jakupovic, Elvis; Watson, Stanley; Meng, Fan

2007-01-01

Modern video cards and game consoles typically have much better performance to price ratios than that of general purpose CPUs. The parallel processing capabilities of game hardware are well-suited for high throughput biomedical data analysis. Our initial results suggest that game hardware is a cost-effective platform for some computationally demanding bioinformatics problems.
A performance model for GPUs with caches

DOE PAGES

Dao, Thanh Tuan; Kim, Jungwon; Seo, Sangmin; ...

2014-06-24

To exploit the abundant computational power of the world's fastest supercomputers, an even workload distribution to the typically heterogeneous compute devices is necessary. While relatively accurate performance models exist for conventional CPUs, accurate performance estimation models for modern GPUs do not exist. This paper presents two accurate models for modern GPUs: a sampling-based linear model, and a model based on machine-learning (ML) techniques which improves the accuracy of the linear model and is applicable to modern GPUs with and without caches. We first construct the sampling-based linear model to predict the runtime of an arbitrary OpenCL kernel. Based on anmore » analysis of NVIDIA GPUs' scheduling policies we determine the earliest sampling points that allow an accurate estimation. The linear model cannot capture well the significant effects that memory coalescing or caching as implemented in modern GPUs have on performance. We therefore propose a model based on ML techniques that takes several compiler-generated statistics about the kernel as well as the GPU's hardware performance counters as additional inputs to obtain a more accurate runtime performance estimation for modern GPUs. We demonstrate the effectiveness and broad applicability of the model by applying it to three different NVIDIA GPU architectures and one AMD GPU architecture. On an extensive set of OpenCL benchmarks, on average, the proposed model estimates the runtime performance with less than 7 percent error for a second-generation GTX 280 with no on-chip caches and less than 5 percent for the Fermi-based GTX 580 with hardware caches. On the Kepler-based GTX 680, the linear model has an error of less than 10 percent. On an AMD GPU architecture, Radeon HD 6970, the model estimates with 8 percent of error rates. As a result, the proposed technique outperforms existing models by a factor of 5 to 6 in terms of accuracy.« less
3D gaze tracking system for NVidia 3D Vision®.

PubMed

Wibirama, Sunu; Hamamoto, Kazuhiko

2013-01-01

Inappropriate parallax setting in stereoscopic content generally causes visual fatigue and visual discomfort. To optimize three dimensional (3D) effects in stereoscopic content by taking into account health issue, understanding how user gazes at 3D direction in virtual space is currently an important research topic. In this paper, we report the study of developing a novel 3D gaze tracking system for Nvidia 3D Vision(®) to be used in desktop stereoscopic display. We suggest an optimized geometric method to accurately measure the position of virtual 3D object. Our experimental result shows that the proposed system achieved better accuracy compared to conventional geometric method by average errors 0.83 cm, 0.87 cm, and 1.06 cm in X, Y, and Z dimensions, respectively.
Design Evolution and Performance Characterization of the GTX Air-Breathing Launch Vehicle Inlet

NASA Technical Reports Server (NTRS)

DeBonis, J. R.; Steffen, C. J., Jr.; Rice, T.; Trefny, C. J.

2002-01-01

The design and analysis of a second version of the inlet for the GTX rocket-based combine-cycle launch vehicle is discussed. The previous design did not achieve its predicted performance levels due to excessive turning of low-momentum comer flows and local over-contraction due to asymmetric end-walls. This design attempts to remove these problems by reducing the spike half-angle to 10- from 12-degrees and by implementing true plane of symmetry end-walls. Axisymmetric Reynolds-Averaged Navier-Stokes simulations using both perfect gas and real gas, finite rate chemistry, assumptions were performed to aid in the design process and to create a comprehensive database of inlet performance. The inlet design, which operates over the entire air-breathing Mach number range from 0 to 12, and the performance database are presented. The performance database, for use in cycle analysis, includes predictions of mass capture, pressure recovery, throat Mach number, drag force, and heat load, for the entire Mach range. Results of the computations are compared with experimental data to validate the performance database.
High performance in silico virtual drug screening on many-core processors.

PubMed

McIntosh-Smith, Simon; Price, James; Sessions, Richard B; Ibarra, Amaurys A

2015-05-01

Drug screening is an important part of the drug development pipeline for the pharmaceutical industry. Traditional, lab-based methods are increasingly being augmented with computational methods, ranging from simple molecular similarity searches through more complex pharmacophore matching to more computationally intensive approaches, such as molecular docking. The latter simulates the binding of drug molecules to their targets, typically protein molecules. In this work, we describe BUDE, the Bristol University Docking Engine, which has been ported to the OpenCL industry standard parallel programming language in order to exploit the performance of modern many-core processors. Our highly optimized OpenCL implementation of BUDE sustains 1.43 TFLOP/s on a single Nvidia GTX 680 GPU, or 46% of peak performance. BUDE also exploits OpenCL to deliver effective performance portability across a broad spectrum of different computer architectures from different vendors, including GPUs from Nvidia and AMD, Intel's Xeon Phi and multi-core CPUs with SIMD instruction sets.
Very high frame rate volumetric integration of depth images on mobile devices.

PubMed

Kähler, Olaf; Adrian Prisacariu, Victor; Yuheng Ren, Carl; Sun, Xin; Torr, Philip; Murray, David

2015-11-01

Volumetric methods provide efficient, flexible and simple ways of integrating multiple depth images into a full 3D model. They provide dense and photorealistic 3D reconstructions, and parallelised implementations on GPUs achieve real-time performance on modern graphics hardware. To run such methods on mobile devices, providing users with freedom of movement and instantaneous reconstruction feedback, remains challenging however. In this paper we present a range of modifications to existing volumetric integration methods based on voxel block hashing, considerably improving their performance and making them applicable to tablet computer applications. We present (i) optimisations for the basic data structure, and its allocation and integration; (ii) a highly optimised raycasting pipeline; and (iii) extensions to the camera tracker to incorporate IMU data. In total, our system thus achieves frame rates up 47 Hz on a Nvidia Shield Tablet and 910 Hz on a Nvidia GTX Titan XGPU, or even beyond 1.1 kHz without visualisation.
High performance in silico virtual drug screening on many-core processors

PubMed Central

Price, James; Sessions, Richard B; Ibarra, Amaurys A

2015-01-01

Drug screening is an important part of the drug development pipeline for the pharmaceutical industry. Traditional, lab-based methods are increasingly being augmented with computational methods, ranging from simple molecular similarity searches through more complex pharmacophore matching to more computationally intensive approaches, such as molecular docking. The latter simulates the binding of drug molecules to their targets, typically protein molecules. In this work, we describe BUDE, the Bristol University Docking Engine, which has been ported to the OpenCL industry standard parallel programming language in order to exploit the performance of modern many-core processors. Our highly optimized OpenCL implementation of BUDE sustains 1.43 TFLOP/s on a single Nvidia GTX 680 GPU, or 46% of peak performance. BUDE also exploits OpenCL to deliver effective performance portability across a broad spectrum of different computer architectures from different vendors, including GPUs from Nvidia and AMD, Intel’s Xeon Phi and multi-core CPUs with SIMD instruction sets. PMID:25972727
Accelerating Pseudo-Random Number Generator for MCNP on GPU

NASA Astrophysics Data System (ADS)

Gong, Chunye; Liu, Jie; Chi, Lihua; Hu, Qingfeng; Deng, Li; Gong, Zhenghu

2010-09-01

Pseudo-random number generators (PRNG) are intensively used in many stochastic algorithms in particle simulations, artificial neural networks and other scientific computation. The PRNG in Monte Carlo N-Particle Transport Code (MCNP) requires long period, high quality, flexible jump and fast enough. In this paper, we implement such a PRNG for MCNP on NVIDIA's GTX200 Graphics Processor Units (GPU) using CUDA programming model. Results shows that 3.80 to 8.10 times speedup are achieved compared with 4 to 6 cores CPUs and more than 679.18 million double precision random numbers can be generated per second on GPU.
Affordable Flight Demonstration of the GTX Air-Breathing SSTO Vehicle Concept

NASA Technical Reports Server (NTRS)

Krivanek, Thomas M.; Roche, Joseph M.; Riehl, John P.; Kosareo, Daniel N.

2002-01-01

The rocket based combined cycle (RBCC) powered single-stage-to-orbit (SSTO) reusable launch vehicle has the potential to significantly reduce the total cost per pound for orbital payload missions. To validate overall system performance, a flight demonstration must be performed. This paper presents an overview of the first phase of a flight demonstration program for the GTX SSTO vehicle concept. Phase 1 will validate the propulsion performance of the vehicle configuration over the supersonic and hypersonic airbreathing portions of the trajectory. The focus and goal of Phase 1 is to demonstrate the integration and performance of the propulsion system flowpath with the vehicle aerodynamics over the air-breathing trajectory. This demonstrator vehicle will have dual mode ramjet/scramjets, which include the inlet, combustor, and nozzle with geometrically scaled aerodynamic surface outer mold lines (OML) defining the forebody, boundary layer diverter, wings, and tail. The primary objective of this study is to demonstrate propulsion system performance and operability including the ram to scram transition, as well as to validate vehicle aerodynamics and propulsion airframe integration. To minimize overall risk and development cost the effort will incorporate proven materials, use existing turbomachinery in the propellant delivery systems, launch from an existing unmanned remote launch facility, and use basic vehicle recovery techniques to minimize control and landing requirements. A second phase would demonstrate propulsion performance across all critical portions of a space launch trajectory (lift off through transition to all-rocket) integrated with flight-like vehicle systems.

Affordable Flight Demonstration of the GTX Air-Breathing SSTO Vehicle Concept

NASA Technical Reports Server (NTRS)

Krivanek, Thomas M.; Roche, Joseph M.; Riehl, John P.; Kosareo, Daniel N.

2003-01-01

The rocket based combined cycle (RBCC) powered single-stage-to-orbit (SSTO) reusable launch vehicle has the potential to significantly reduce the total cost per pound for orbital payload missions. To validate overall system performance, a flight demonstration must be performed. This paper presents an overview of the first phase of a flight demonstration program for the GTX SSTO vehicle concept. Phase 1 will validate the propulsion performance of the vehicle configuration over the supersonic and hypersonic air- breathing portions of the trajectory. The focus and goal of Phase 1 is to demonstrate the integration and performance of the propulsion system flowpath with the vehicle aerodynamics over the air-breathing trajectory. This demonstrator vehicle will have dual mode ramjetkcramjets, which include the inlet, combustor, and nozzle with geometrically scaled aerodynamic surface outer mold lines (OML) defining the forebody, boundary layer diverter, wings, and tail. The primary objective of this study is to demon- strate propulsion system performance and operability including the ram to scram transition, as well as to validate vehicle aerodynamics and propulsion airframe integration. To minimize overall risk and develop ment cost the effort will incorporate proven materials, use existing turbomachinery in the propellant delivery systems, launch from an existing unmanned remote launch facility, and use basic vehicle recovery techniques to minimize control and landing requirements. A second phase would demonstrate propulsion performance across all critical portions of a space launch trajectory (lift off through transition to all-rocket) integrated with flight-like vehicle systems.
Evaluation of the Intel Xeon Phi 7120 and NVIDIA K80 as accelerators for two-dimensional panel codes

PubMed Central

2017-01-01

To optimize the geometry of airfoils for a specific application is an important engineering problem. In this context genetic algorithms have enjoyed some success as they are able to explore the search space without getting stuck in local optima. However, these algorithms require the computation of aerodynamic properties for a significant number of airfoil geometries. Consequently, for low-speed aerodynamics, panel methods are most often used as the inner solver. In this paper we evaluate the performance of such an optimization algorithm on modern accelerators (more specifically, the Intel Xeon Phi 7120 and the NVIDIA K80). For that purpose, we have implemented an optimized version of the algorithm on the CPU and Xeon Phi (based on OpenMP, vectorization, and the Intel MKL library) and on the GPU (based on CUDA and the MAGMA library). We present timing results for all codes and discuss the similarities and differences between the three implementations. Overall, we observe a speedup of approximately 2.5 for adding an Intel Xeon Phi 7120 to a dual socket workstation and a speedup between 3.4 and 3.8 for adding a NVIDIA K80 to a dual socket workstation. PMID:28582389
Evaluation of the Intel Xeon Phi 7120 and NVIDIA K80 as accelerators for two-dimensional panel codes.

PubMed

Einkemmer, Lukas

2017-01-01

To optimize the geometry of airfoils for a specific application is an important engineering problem. In this context genetic algorithms have enjoyed some success as they are able to explore the search space without getting stuck in local optima. However, these algorithms require the computation of aerodynamic properties for a significant number of airfoil geometries. Consequently, for low-speed aerodynamics, panel methods are most often used as the inner solver. In this paper we evaluate the performance of such an optimization algorithm on modern accelerators (more specifically, the Intel Xeon Phi 7120 and the NVIDIA K80). For that purpose, we have implemented an optimized version of the algorithm on the CPU and Xeon Phi (based on OpenMP, vectorization, and the Intel MKL library) and on the GPU (based on CUDA and the MAGMA library). We present timing results for all codes and discuss the similarities and differences between the three implementations. Overall, we observe a speedup of approximately 2.5 for adding an Intel Xeon Phi 7120 to a dual socket workstation and a speedup between 3.4 and 3.8 for adding a NVIDIA K80 to a dual socket workstation.
A GPU-accelerated and Monte Carlo-based intensity modulated proton therapy optimization system.

PubMed

Ma, Jiasen; Beltran, Chris; Seum Wan Chan Tseung, Hok; Herman, Michael G

2014-12-01

Conventional spot scanning intensity modulated proton therapy (IMPT) treatment planning systems (TPSs) optimize proton spot weights based on analytical dose calculations. These analytical dose calculations have been shown to have severe limitations in heterogeneous materials. Monte Carlo (MC) methods do not have these limitations; however, MC-based systems have been of limited clinical use due to the large number of beam spots in IMPT and the extremely long calculation time of traditional MC techniques. In this work, the authors present a clinically applicable IMPT TPS that utilizes a very fast MC calculation. An in-house graphics processing unit (GPU)-based MC dose calculation engine was employed to generate the dose influence map for each proton spot. With the MC generated influence map, a modified least-squares optimization method was used to achieve the desired dose volume histograms (DVHs). The intrinsic CT image resolution was adopted for voxelization in simulation and optimization to preserve spatial resolution. The optimizations were computed on a multi-GPU framework to mitigate the memory limitation issues for the large dose influence maps that resulted from maintaining the intrinsic CT resolution. The effects of tail cutoff and starting condition were studied and minimized in this work. For relatively large and complex three-field head and neck cases, i.e., >100,000 spots with a target volume of ∼ 1000 cm(3) and multiple surrounding critical structures, the optimization together with the initial MC dose influence map calculation was done in a clinically viable time frame (less than 30 min) on a GPU cluster consisting of 24 Nvidia GeForce GTX Titan cards. The in-house MC TPS plans were comparable to a commercial TPS plans based on DVH comparisons. A MC-based treatment planning system was developed. The treatment planning can be performed in a clinically viable time frame on a hardware system costing around 45,000 dollars. The fast calculation and
Interactive collision detection for deformable models using streaming AABBs.

PubMed

Zhang, Xinyu; Kim, Young J

2007-01-01

We present an interactive and accurate collision detection algorithm for deformable, polygonal objects based on the streaming computational model. Our algorithm can detect all possible pairwise primitive-level intersections between two severely deforming models at highly interactive rates. In our streaming computational model, we consider a set of axis aligned bounding boxes (AABBs) that bound each of the given deformable objects as an input stream and perform massively-parallel pairwise, overlapping tests onto the incoming streams. As a result, we are able to prevent performance stalls in the streaming pipeline that can be caused by expensive indexing mechanism required by bounding volume hierarchy-based streaming algorithms. At runtime, as the underlying models deform over time, we employ a novel, streaming algorithm to update the geometric changes in the AABB streams. Moreover, in order to get only the computed result (i.e., collision results between AABBs) without reading back the entire output streams, we propose a streaming en/decoding strategy that can be performed in a hierarchical fashion. After determining overlapped AABBs, we perform a primitive-level (e.g., triangle) intersection checking on a serial computational model such as CPUs. We implemented the entire pipeline of our algorithm using off-the-shelf graphics processors (GPUs), such as nVIDIA GeForce 7800 GTX, for streaming computations, and Intel Dual Core 3.4G processors for serial computations. We benchmarked our algorithm with different models of varying complexities, ranging from 15K up to 50K triangles, under various deformation motions, and the timings were obtained as 30 approximately 100 FPS depending on the complexity of models and their relative configurations. Finally, we made comparisons with a well-known GPU-based collision detection algorithm, CULLIDE [4] and observed about three times performance improvement over the earlier approach. We also made comparisons with a SW-based AABB
An efficient mixed-precision, hybrid CPU-GPU implementation of a nonlinearly implicit one-dimensional particle-in-cell algorithm

DOE Office of Scientific and Technical Information (OSTI.GOV)

Chen, Guangye; Chacon, Luis; Barnes, Daniel C

2012-01-01

Recently, a fully implicit, energy- and charge-conserving particle-in-cell method has been developed for multi-scale, full-f kinetic simulations [G. Chen, et al., J. Comput. Phys. 230, 18 (2011)]. The method employs a Jacobian-free Newton-Krylov (JFNK) solver and is capable of using very large timesteps without loss of numerical stability or accuracy. A fundamental feature of the method is the segregation of particle orbit integrations from the field solver, while remaining fully self-consistent. This provides great flexibility, and dramatically improves the solver efficiency by reducing the degrees of freedom of the associated nonlinear system. However, it requires a particle push per nonlinearmore » residual evaluation, which makes the particle push the most time-consuming operation in the algorithm. This paper describes a very efficient mixed-precision, hybrid CPU-GPU implementation of the implicit PIC algorithm. The JFNK solver is kept on the CPU (in double precision), while the inherent data parallelism of the particle mover is exploited by implementing it in single-precision on a graphics processing unit (GPU) using CUDA. Performance-oriented optimizations, with the aid of an analytical performance model, the roofline model, are employed. Despite being highly dynamic, the adaptive, charge-conserving particle mover algorithm achieves up to 300 400 GOp/s (including single-precision floating-point, integer, and logic operations) on a Nvidia GeForce GTX580, corresponding to 20 25% absolute GPU efficiency (against the peak theoretical performance) and 50-70% intrinsic efficiency (against the algorithm s maximum operational throughput, which neglects all latencies). This is about 200-300 times faster than an equivalent serial CPU implementation. When the single-precision GPU particle mover is combined with a double-precision CPU JFNK field solver, overall performance gains 100 vs. the double-precision CPU-only serial version are obtained, with no apparent loss of
Fast quantum Monte Carlo on a GPU

NASA Astrophysics Data System (ADS)

Lutsyshyn, Y.

2015-02-01

We present a scheme for the parallelization of quantum Monte Carlo method on graphical processing units, focusing on variational Monte Carlo simulation of bosonic systems. We use asynchronous execution schemes with shared memory persistence, and obtain an excellent utilization of the accelerator. The CUDA code is provided along with a package that simulates liquid helium-4. The program was benchmarked on several models of Nvidia GPU, including Fermi GTX560 and M2090, and the Kepler architecture K20 GPU. Special optimization was developed for the Kepler cards, including placement of data structures in the register space of the Kepler GPUs. Kepler-specific optimization is discussed.
Memory transfer optimization for a lattice Boltzmann solver on Kepler architecture nVidia GPUs

NASA Astrophysics Data System (ADS)

Mawson, Mark J.; Revell, Alistair J.

2014-10-01

The Lattice Boltzmann method (LBM) for solving fluid flow is naturally well suited to an efficient implementation for massively parallel computing, due to the prevalence of local operations in the algorithm. This paper presents and analyses the performance of a 3D lattice Boltzmann solver, optimized for third generation nVidia GPU hardware, also known as 'Kepler'. We provide a review of previous optimization strategies and analyse data read/write times for different memory types. In LBM, the time propagation step (known as streaming), involves shifting data to adjacent locations and is central to parallel performance; here we examine three approaches which make use of different hardware options. Two of which make use of 'performance enhancing' features of the GPU; shared memory and the new shuffle instruction found in Kepler based GPUs. These are compared to a standard transfer of data which relies instead on optimized storage to increase coalesced access. It is shown that the more simple approach is most efficient; since the need for large numbers of registers per thread in LBM limits the block size and thus the efficiency of these special features is reduced. Detailed results are obtained for a D3Q19 LBM solver, which is benchmarked on nVidia K5000M and K20C GPUs. In the latter case the use of a read-only data cache is explored, and peak performance of over 1036 Million Lattice Updates Per Second (MLUPS) is achieved. The appearance of a periodic bottleneck in the solver performance is also reported, believed to be hardware related; spikes in iteration-time occur with a frequency of around 11 Hz for both GPUs, independent of the size of the problem.
Optimizing Approximate Weighted Matching on Nvidia Kepler K40

DOE Office of Scientific and Technical Information (OSTI.GOV)

Naim, Md; Manne, Fredrik; Halappanavar, Mahantesh

Matching is a fundamental graph problem with numerous applications in science and engineering. While algorithms for computing optimal matchings are difficult to parallelize, approximation algorithms on the other hand generally compute high quality solutions and are amenable to parallelization. In this paper, we present efficient implementations of the current best algorithm for half-approximate weighted matching, the Suitor algorithm, on Nvidia Kepler K-40 platform. We develop four variants of the algorithm that exploit hardware features to address key challenges for a GPU implementation. We also experiment with different combinations of work assigned to a warp. Using an exhaustive set ofmore » $269$ inputs, we demonstrate that the new implementation outperforms the previous best GPU algorithm by $10$ to $$100\\times$$ for over $100$ instances, and from $100$ to $$1000\\times$$ for $15$ instances. We also demonstrate up to $$20\\times$$ speedup relative to $2$ threads, and up to $$5\\times$$ relative to $16$ threads on Intel Xeon platform with $16$ cores for the same algorithm. The new algorithms and implementations provided in this paper will have a direct impact on several applications that repeatedly use matching as a key compute kernel. Further, algorithm designs and insights provided in this paper will benefit other researchers implementing graph algorithms on modern GPU architectures.« less
Automatic detection and classification of obstacles with applications in autonomous mobile robots

NASA Astrophysics Data System (ADS)

Ponomaryov, Volodymyr I.; Rosas-Miranda, Dario I.

2016-04-01

Hardware implementation of an automatic detection and classification of objects that can represent an obstacle for an autonomous mobile robot using stereo vision algorithms is presented. We propose and evaluate a new method to detect and classify objects for a mobile robot in outdoor conditions. This method is divided in two parts, the first one is the object detection step based on the distance from the objects to the camera and a BLOB analysis. The second part is the classification step that is based on visuals primitives and a SVM classifier. The proposed method is performed in GPU in order to reduce the processing time values. This is performed with help of hardware based on multi-core processors and GPU platform, using a NVIDIA R GeForce R GT640 graphic card and Matlab over a PC with Windows 10.
Real-time dose computation: GPU-accelerated source modeling and superposition/convolution

DOE Office of Scientific and Technical Information (OSTI.GOV)

Jacques, Robert; Wong, John; Taylor, Russell

Purpose: To accelerate dose calculation to interactive rates using highly parallel graphics processing units (GPUs). Methods: The authors have extended their prior work in GPU-accelerated superposition/convolution with a modern dual-source model and have enhanced performance. The primary source algorithm supports both focused leaf ends and asymmetric rounded leaf ends. The extra-focal algorithm uses a discretized, isotropic area source and models multileaf collimator leaf height effects. The spectral and attenuation effects of static beam modifiers were integrated into each source's spectral function. The authors introduce the concepts of arc superposition and delta superposition. Arc superposition utilizes separate angular sampling for themore » total energy released per unit mass (TERMA) and superposition computations to increase accuracy and performance. Delta superposition allows single beamlet changes to be computed efficiently. The authors extended their concept of multi-resolution superposition to include kernel tilting. Multi-resolution superposition approximates solid angle ray-tracing, improving performance and scalability with a minor loss in accuracy. Superposition/convolution was implemented using the inverse cumulative-cumulative kernel and exact radiological path ray-tracing. The accuracy analyses were performed using multiple kernel ray samplings, both with and without kernel tilting and multi-resolution superposition. Results: Source model performance was <9 ms (data dependent) for a high resolution (400{sup 2}) field using an NVIDIA (Santa Clara, CA) GeForce GTX 280. Computation of the physically correct multispectral TERMA attenuation was improved by a material centric approach, which increased performance by over 80%. Superposition performance was improved by {approx}24% to 0.058 and 0.94 s for 64{sup 3} and 128{sup 3} water phantoms; a speed-up of 101-144x over the highly optimized Pinnacle{sup 3} (Philips, Madison, WI) implementation
Robust 3D-2D image registration: application to spine interventions and vertebral labeling in the presence of anatomical deformation

NASA Astrophysics Data System (ADS)

Otake, Yoshito; Wang, Adam S.; Webster Stayman, J.; Uneri, Ali; Kleinszig, Gerhard; Vogt, Sebastian; Khanna, A. Jay; Gokaslan, Ziya L.; Siewerdsen, Jeffrey H.

2013-12-01

We present a framework for robustly estimating registration between a 3D volume image and a 2D projection image and evaluate its precision and robustness in spine interventions for vertebral localization in the presence of anatomical deformation. The framework employs a normalized gradient information similarity metric and multi-start covariance matrix adaptation evolution strategy optimization with local-restarts, which provided improved robustness against deformation and content mismatch. The parallelized implementation allowed orders-of-magnitude acceleration in computation time and improved the robustness of registration via multi-start global optimization. Experiments involved a cadaver specimen and two CT datasets (supine and prone) and 36 C-arm fluoroscopy images acquired with the specimen in four positions (supine, prone, supine with lordosis, prone with kyphosis), three regions (thoracic, abdominal, and lumbar), and three levels of geometric magnification (1.7, 2.0, 2.4). Registration accuracy was evaluated in terms of projection distance error (PDE) between the estimated and true target points in the projection image, including 14 400 random trials (200 trials on the 72 registration scenarios) with initialization error up to ±200 mm and ±10°. The resulting median PDE was better than 0.1 mm in all cases, depending somewhat on the resolution of input CT and fluoroscopy images. The cadaver experiments illustrated the tradeoff between robustness and computation time, yielding a success rate of 99.993% in vertebral labeling (with ‘success’ defined as PDE <5 mm) using 1,718 664 ± 96 582 function evaluations computed in 54.0 ± 3.5 s on a mid-range GPU (nVidia, GeForce GTX690). Parameters yielding a faster search (e.g., fewer multi-starts) reduced robustness under conditions of large deformation and poor initialization (99.535% success for the same data registered in 13.1 s), but given good initialization (e.g., ±5 mm, assuming a robust initial
Fast GPU-based Monte Carlo code for SPECT/CT reconstructions generates improved 177Lu images.

PubMed

Rydén, T; Heydorn Lagerlöf, J; Hemmingsson, J; Marin, I; Svensson, J; Båth, M; Gjertsson, P; Bernhardt, P

2018-01-04

Full Monte Carlo (MC)-based SPECT reconstructions have a strong potential for correcting for image degrading factors, but the reconstruction times are long. The objective of this study was to develop a highly parallel Monte Carlo code for fast, ordered subset expectation maximum (OSEM) reconstructions of SPECT/CT images. The MC code was written in the Compute Unified Device Architecture language for a computer with four graphics processing units (GPUs) (GeForce GTX Titan X, Nvidia, USA). This enabled simulations of parallel photon emissions from the voxels matrix (128 3 or 256 3 ). Each computed tomography (CT) number was converted to attenuation coefficients for photo absorption, coherent scattering, and incoherent scattering. For photon scattering, the deflection angle was determined by the differential scattering cross sections. An angular response function was developed and used to model the accepted angles for photon interaction with the crystal, and a detector scattering kernel was used for modeling the photon scattering in the detector. Predefined energy and spatial resolution kernels for the crystal were used. The MC code was implemented in the OSEM reconstruction of clinical and phantom 177 Lu SPECT/CT images. The Jaszczak image quality phantom was used to evaluate the performance of the MC reconstruction in comparison with attenuated corrected (AC) OSEM reconstructions and attenuated corrected OSEM reconstructions with resolution recovery corrections (RRC). The performance of the MC code was 3200 million photons/s. The required number of photons emitted per voxel to obtain a sufficiently low noise level in the simulated image was 200 for a 128 3 voxel matrix. With this number of emitted photons/voxel, the MC-based OSEM reconstruction with ten subsets was performed within 20 s/iteration. The images converged after around six iterations. Therefore, the reconstruction time was around 3 min. The activity recovery for the spheres in the Jaszczak phantom was
CUDA compatible GPU cards as efficient hardware accelerators for Smith-Waterman sequence alignment

PubMed Central

Manavski, Svetlin A; Valle, Giorgio

2008-01-01

Background Searching for similarities in protein and DNA databases has become a routine procedure in Molecular Biology. The Smith-Waterman algorithm has been available for more than 25 years. It is based on a dynamic programming approach that explores all the possible alignments between two sequences; as a result it returns the optimal local alignment. Unfortunately, the computational cost is very high, requiring a number of operations proportional to the product of the length of two sequences. Furthermore, the exponential growth of protein and DNA databases makes the Smith-Waterman algorithm unrealistic for searching similarities in large sets of sequences. For these reasons heuristic approaches such as those implemented in FASTA and BLAST tend to be preferred, allowing faster execution times at the cost of reduced sensitivity. The main motivation of our work is to exploit the huge computational power of commonly available graphic cards, to develop high performance solutions for sequence alignment. Results In this paper we present what we believe is the fastest solution of the exact Smith-Waterman algorithm running on commodity hardware. It is implemented in the recently released CUDA programming environment by NVidia. CUDA allows direct access to the hardware primitives of the last-generation Graphics Processing Units (GPU) G80. Speeds of more than 3.5 GCUPS (Giga Cell Updates Per Second) are achieved on a workstation running two GeForce 8800 GTX. Exhaustive tests have been done to compare our implementation to SSEARCH and BLAST, running on a 3 GHz Intel Pentium IV processor. Our solution was also compared to a recently published GPU implementation and to a Single Instruction Multiple Data (SIMD) solution. These tests show that our implementation performs from 2 to 30 times faster than any other previous attempt available on commodity hardware. Conclusions The results show that graphic cards are now sufficiently advanced to be used as efficient hardware
Fast multipole methods on a cluster of GPUs for the meshless simulation of turbulence

NASA Astrophysics Data System (ADS)

Yokota, R.; Narumi, T.; Sakamaki, R.; Kameoka, S.; Obi, S.; Yasuoka, K.

2009-11-01

Recent advances in the parallelizability of fast N-body algorithms, and the programmability of graphics processing units (GPUs) have opened a new path for particle based simulations. For the simulation of turbulence, vortex methods can now be considered as an interesting alternative to finite difference and spectral methods. The present study focuses on the efficient implementation of the fast multipole method and pseudo-particle method on a cluster of NVIDIA GeForce 8800 GT GPUs, and applies this to a vortex method calculation of homogeneous isotropic turbulence. The results of the present vortex method agree quantitatively with that of the reference calculation using a spectral method. We achieved a maximum speed of 7.48 TFlops using 64 GPUs, and the cost performance was near 9.4/GFlops. The calculation of the present vortex method on 64 GPUs took 4120 s, while the spectral method on 32 CPUs took 4910 s.
Fast precalculated triangular mesh algorithm for 3D binary computer-generated holograms.

PubMed

Yang, Fan; Kaczorowski, Andrzej; Wilkinson, Tim D

2014-12-10

A new method for constructing computer-generated holograms using a precalculated triangular mesh is presented. The speed of calculation can be increased dramatically by exploiting both the precalculated base triangle and GPU parallel computing. Unlike algorithms using point-based sources, this method can reconstruct a more vivid 3D object instead of a "hollow image." In addition, there is no need to do a fast Fourier transform for each 3D element every time. A ferroelectric liquid crystal spatial light modulator is used to display the binary hologram within our experiment and the hologram of a base right triangle is produced by utilizing just a one-step Fourier transform in the 2D case, which can be expanded to the 3D case by multiplying by a suitable Fresnel phase plane. All 3D holograms generated in this paper are based on Fresnel propagation; thus, the Fresnel plane is treated as a vital element in producing the hologram. A GeForce GTX 770 graphics card with 2 GB memory is used to achieve parallel computing.
GPU Lossless Hyperspectral Data Compression System for Space Applications

NASA Technical Reports Server (NTRS)

Keymeulen, Didier; Aranki, Nazeeh; Hopson, Ben; Kiely, Aaron; Klimesh, Matthew; Benkrid, Khaled

2012-01-01

On-board lossless hyperspectral data compression reduces data volume in order to meet NASA and DoD limited downlink capabilities. At JPL, a novel, adaptive and predictive technique for lossless compression of hyperspectral data, named the Fast Lossless (FL) algorithm, was recently developed. This technique uses an adaptive filtering method and achieves state-of-the-art performance in both compression effectiveness and low complexity. Because of its outstanding performance and suitability for real-time onboard hardware implementation, the FL compressor is being formalized as the emerging CCSDS Standard for Lossless Multispectral & Hyperspectral image compression. The FL compressor is well-suited for parallel hardware implementation. A GPU hardware implementation was developed for FL targeting the current state-of-the-art GPUs from NVIDIA(Trademark). The GPU implementation on a NVIDIA(Trademark) GeForce(Trademark) GTX 580 achieves a throughput performance of 583.08 Mbits/sec (44.85 MSamples/sec) and an acceleration of at least 6 times a software implementation running on a 3.47 GHz single core Intel(Trademark) Xeon(Trademark) processor. This paper describes the design and implementation of the FL algorithm on the GPU. The massively parallel implementation will provide in the future a fast and practical real-time solution for airborne and space applications.
Exploiting current-generation graphics hardware for synthetic-scene generation

NASA Astrophysics Data System (ADS)

Tanner, Michael A.; Keen, Wayne A.

2010-04-01

Increasing seeker frame rate and pixel count, as well as the demand for higher levels of scene fidelity, have driven scene generation software for hardware-in-the-loop (HWIL) and software-in-the-loop (SWIL) testing to higher levels of parallelization. Because modern PC graphics cards provide multiple computational cores (240 shader cores for a current NVIDIA Corporation GeForce and Quadro cards), implementation of phenomenology codes on graphics processing units (GPUs) offers significant potential for simultaneous enhancement of simulation frame rate and fidelity. To take advantage of this potential requires algorithm implementation that is structured to minimize data transfers between the central processing unit (CPU) and the GPU. In this paper, preliminary methodologies developed at the Kinetic Hardware In-The-Loop Simulator (KHILS) will be presented. Included in this paper will be various language tradeoffs between conventional shader programming, Compute Unified Device Architecture (CUDA) and Open Computing Language (OpenCL), including performance trades and possible pathways for future tool development.
Accelerated Adaptive MGS Phase Retrieval

NASA Technical Reports Server (NTRS)

Lam, Raymond K.; Ohara, Catherine M.; Green, Joseph J.; Bikkannavar, Siddarayappa A.; Basinger, Scott A.; Redding, David C.; Shi, Fang

2011-01-01

The Modified Gerchberg-Saxton (MGS) algorithm is an image-based wavefront-sensing method that can turn any science instrument focal plane into a wavefront sensor. MGS characterizes optical systems by estimating the wavefront errors in the exit pupil using only intensity images of a star or other point source of light. This innovative implementation of MGS significantly accelerates the MGS phase retrieval algorithm by using stream-processing hardware on conventional graphics cards. Stream processing is a relatively new, yet powerful, paradigm to allow parallel processing of certain applications that apply single instructions to multiple data (SIMD). These stream processors are designed specifically to support large-scale parallel computing on a single graphics chip. Computationally intensive algorithms, such as the Fast Fourier Transform (FFT), are particularly well suited for this computing environment. This high-speed version of MGS exploits commercially available hardware to accomplish the same objective in a fraction of the original time. The exploit involves performing matrix calculations in nVidia graphic cards. The graphical processor unit (GPU) is hardware that is specialized for computationally intensive, highly parallel computation. From the software perspective, a parallel programming model is used, called CUDA, to transparently scale multicore parallelism in hardware. This technology gives computationally intensive applications access to the processing power of the nVidia GPUs through a C/C++ programming interface. The AAMGS (Accelerated Adaptive MGS) software takes advantage of these advanced technologies, to accelerate the optical phase error characterization. With a single PC that contains four nVidia GTX-280 graphic cards, the new implementation can process four images simultaneously to produce a JWST (James Webb Space Telescope) wavefront measurement 60 times faster than the previous code.
Multimodality imaging and state-of-art GPU technology in discriminating benign from malignant breast lesions on real time decision support system

NASA Astrophysics Data System (ADS)

Kostopoulos, S.; Sidiropoulos, K.; Glotsos, D.; Dimitropoulos, N.; Kalatzis, I.; Asvestas, P.; Cavouras, D.

2014-03-01

The aim of this study was to design a pattern recognition system for assisting the diagnosis of breast lesions, using image information from Ultrasound (US) and Digital Mammography (DM) imaging modalities. State-of-art computer technology was employed based on commercial Graphics Processing Unit (GPU) cards and parallel programming. An experienced radiologist outlined breast lesions on both US and DM images from 59 patients employing a custom designed computer software application. Textural features were extracted from each lesion and were used to design the pattern recognition system. Several classifiers were tested for highest performance in discriminating benign from malignant lesions. Classifiers were also combined into ensemble schemes for further improvement of the system's classification accuracy. Following the pattern recognition system optimization, the final system was designed employing the Probabilistic Neural Network classifier (PNN) on the GPU card (GeForce 580GTX) using CUDA programming framework and C++ programming language. The use of such state-of-art technology renders the system capable of redesigning itself on site once additional verified US and DM data are collected. Mixture of US and DM features optimized performance with over 90% accuracy in correctly classifying the lesions.

A New GPU-Enabled MODTRAN Thermal Model for the PLUME TRACKER Volcanic Emission Analysis Toolkit

NASA Astrophysics Data System (ADS)

Acharya, P. K.; Berk, A.; Guiang, C.; Kennett, R.; Perkins, T.; Realmuto, V. J.

2013-12-01

Real-time quantification of volcanic gaseous and particulate releases is important for (1) recognizing rapid increases in SO2 gaseous emissions which may signal an impending eruption; (2) characterizing ash clouds to enable safe and efficient commercial aviation; and (3) quantifying the impact of volcanic aerosols on climate forcing. The Jet Propulsion Laboratory (JPL) has developed state-of-the-art algorithms, embedded in their analyst-driven Plume Tracker toolkit, for performing SO2, NH3, and CH4 retrievals from remotely sensed multi-spectral Thermal InfraRed spectral imagery. While Plume Tracker provides accurate results, it typically requires extensive analyst time. A major bottleneck in this processing is the relatively slow but accurate FORTRAN-based MODTRAN atmospheric and plume radiance model, developed by Spectral Sciences, Inc. (SSI). To overcome this bottleneck, SSI in collaboration with JPL, is porting these slow thermal radiance algorithms onto massively parallel, relatively inexpensive and commercially-available GPUs. This paper discusses SSI's efforts to accelerate the MODTRAN thermal emission algorithms used by Plume Tracker. Specifically, we are developing a GPU implementation of the Curtis-Godson averaging and the Voigt in-band transmittances from near line center molecular absorption, which comprise the major computational bottleneck. The transmittance calculations were decomposed into separate functions, individually implemented as GPU kernels, and tested for accuracy and performance relative to the original CPU code. Speedup factors of 14 to 30× were realized for individual processing components on an NVIDIA GeForce GTX 295 graphics card with no loss of accuracy. Due to the separate host (CPU) and device (GPU) memory spaces, a redesign of the MODTRAN architecture was required to ensure efficient data transfer between host and device, and to facilitate high parallel throughput. Currently, we are incorporating the separate GPU kernels into a
Initial development of goCMC: a GPU-oriented fast cross-platform Monte Carlo engine for carbon ion therapy

PubMed Central

Qin, Nan; Pinto, Marco; Tian, Zhen; Dedes, Georgios; Pompos, Arnold; Jiang, Steve B.; Parodi, Katia; Jia, Xun

2017-01-01

voxel size, the computation time to simulate 107 carbons was 9.9–125 sec, 2.5–50 sec and 60–612 sec on an AMD Radeon GPU card, an NVidia GeForce GTX 1080 GPU card and an Intel Xeon E5-2640 CPU, respectively. The combined accuracy, efficiency and portability make goCMC attractive for research and clinical applications in carbon ion therapy. PMID:28140352
Initial development of goCMC: a GPU-oriented fast cross-platform Monte Carlo engine for carbon ion therapy

NASA Astrophysics Data System (ADS)

Qin, Nan; Pinto, Marco; Tian, Zhen; Dedes, Georgios; Pompos, Arnold; Jiang, Steve B.; Parodi, Katia; Jia, Xun

2017-05-01

energy and voxel size, the computation time to simulate {{10}7} carbons was 9.9-125 s, 2.5-50 s and 60-612 s on an AMD Radeon GPU card, an NVidia GeForce GTX 1080 GPU card and an Intel Xeon E5-2640 CPU, respectively. The combined accuracy, efficiency and portability make goCMC attractive for research and clinical applications in carbon ion therapy.
Initial development of goCMC: a GPU-oriented fast cross-platform Monte Carlo engine for carbon ion therapy.

PubMed

Qin, Nan; Pinto, Marco; Tian, Zhen; Dedes, Georgios; Pompos, Arnold; Jiang, Steve B; Parodi, Katia; Jia, Xun

2017-05-07

beam energy and voxel size, the computation time to simulate [Formula: see text] carbons was 9.9-125 s, 2.5-50 s and 60-612 s on an AMD Radeon GPU card, an NVidia GeForce GTX 1080 GPU card and an Intel Xeon E5-2640 CPU, respectively. The combined accuracy, efficiency and portability make goCMC attractive for research and clinical applications in carbon ion therapy.
Three-directional motion-compensation mask-based novel look-up table on graphics processing units for video-rate generation of digital holographic videos of three-dimensional scenes.

PubMed

Kwon, Min-Woo; Kim, Seung-Cheol; Kim, Eun-Soo

2016-01-20

A three-directional motion-compensation mask-based novel look-up table method is proposed and implemented on graphics processing units (GPUs) for video-rate generation of digital holographic videos of three-dimensional (3D) scenes. Since the proposed method is designed to be well matched with the software and memory structures of GPUs, the number of compute-unified-device-architecture kernel function calls can be significantly reduced. This results in a great increase of the computational speed of the proposed method, allowing video-rate generation of the computer-generated hologram (CGH) patterns of 3D scenes. Experimental results reveal that the proposed method can generate 39.8 frames of Fresnel CGH patterns with 1920×1080 pixels per second for the test 3D video scenario with 12,088 object points on dual GPU boards of NVIDIA GTX TITANs, and they confirm the feasibility of the proposed method in the practical application fields of electroholographic 3D displays.
Performance analysis of a parallel Monte Carlo code for simulating solar radiative transfer in cloudy atmospheres using CUDA-enabled NVIDIA GPU

NASA Astrophysics Data System (ADS)

Russkova, Tatiana V.

2017-11-01

One tool to improve the performance of Monte Carlo methods for numerical simulation of light transport in the Earth's atmosphere is the parallel technology. A new algorithm oriented to parallel execution on the CUDA-enabled NVIDIA graphics processor is discussed. The efficiency of parallelization is analyzed on the basis of calculating the upward and downward fluxes of solar radiation in both a vertically homogeneous and inhomogeneous models of the atmosphere. The results of testing the new code under various atmospheric conditions including continuous singlelayered and multilayered clouds, and selective molecular absorption are presented. The results of testing the code using video cards with different compute capability are analyzed. It is shown that the changeover of computing from conventional PCs to the architecture of graphics processors gives more than a hundredfold increase in performance and fully reveals the capabilities of the technology used.
High-Speed Particle-in-Cell Simulation Parallelized with Graphic Processing Units for Low Temperature Plasmas for Material Processing

NASA Astrophysics Data System (ADS)

Hur, Min Young; Verboncoeur, John; Lee, Hae June

2014-10-01

Particle-in-cell (PIC) simulations have high fidelity in the plasma device requiring transient kinetic modeling compared with fluid simulations. It uses less approximation on the plasma kinetics but requires many particles and grids to observe the semantic results. It means that the simulation spends lots of simulation time in proportion to the number of particles. Therefore, PIC simulation needs high performance computing. In this research, a graphic processing unit (GPU) is adopted for high performance computing of PIC simulation for low temperature discharge plasmas. GPUs have many-core processors and high memory bandwidth compared with a central processing unit (CPU). NVIDIA GeForce GPUs were used for the test with hundreds of cores which show cost-effective performance. PIC code algorithm is divided into two modules which are a field solver and a particle mover. The particle mover module is divided into four routines which are named move, boundary, Monte Carlo collision (MCC), and deposit. Overall, the GPU code solves particle motions as well as electrostatic potential in two-dimensional geometry almost 30 times faster than a single CPU code. This work was supported by the Korea Institute of Science Technology Information.
Enhanced Graphics for Extended Scale Range

NASA Technical Reports Server (NTRS)

Hanson, Andrew J.; Chi-Wing Fu, Philip

2012-01-01

Enhanced Graphics for Extended Scale Range is a computer program for rendering fly-through views of scene models that include visible objects differing in size by large orders of magnitude. An example would be a scene showing a person in a park at night with the moon, stars, and galaxies in the background sky. Prior graphical computer programs exhibit arithmetic and other anomalies when rendering scenes containing objects that differ enormously in scale and distance from the viewer. The present program dynamically repartitions distance scales of objects in a scene during rendering to eliminate almost all such anomalies in a way compatible with implementation in other software and in hardware accelerators. By assigning depth ranges correspond ing to rendering precision requirements, either automatically or under program control, this program spaces out object scales to match the precision requirements of the rendering arithmetic. This action includes an intelligent partition of the depth buffer ranges to avoid known anomalies from this source. The program is written in C++, using OpenGL, GLUT, and GLUI standard libraries, and nVidia GEForce Vertex Shader extensions. The program has been shown to work on several computers running UNIX and Windows operating systems.
Rapid data processing for ultrafast X-ray computed tomography using scalable and modular CUDA based pipelines

NASA Astrophysics Data System (ADS)

Frust, Tobias; Wagner, Michael; Stephan, Jan; Juckeland, Guido; Bieberle, André

2017-10-01

Ultrafast X-ray tomography is an advanced imaging technique for the study of dynamic processes basing on the principles of electron beam scanning. A typical application case for this technique is e.g. the study of multiphase flows, that is, flows of mixtures of substances such as gas-liquidflows in pipelines or chemical reactors. At Helmholtz-Zentrum Dresden-Rossendorf (HZDR) a number of such tomography scanners are operated. Currently, there are two main points limiting their application in some fields. First, after each CT scan sequence the data of the radiation detector must be downloaded from the scanner to a data processing machine. Second, the current data processing is comparably time-consuming compared to the CT scan sequence interval. To enable online observations or use this technique to control actuators in real-time, a modular and scalable data processing tool has been developed, consisting of user-definable stages working independently together in a so called data processing pipeline, that keeps up with the CT scanner's maximal frame rate of up to 8 kHz. The newly developed data processing stages are freely programmable and combinable. In order to achieve the highest processing performance all relevant data processing steps, which are required for a standard slice image reconstruction, were individually implemented in separate stages using Graphics Processing Units (GPUs) and NVIDIA's CUDA programming language. Data processing performance tests on different high-end GPUs (Tesla K20c, GeForce GTX 1080, Tesla P100) showed excellent performance. Program Files doi:http://dx.doi.org/10.17632/65sx747rvm.1 Licensing provisions: LGPLv3 Programming language: C++/CUDA Supplementary material: Test data set, used for the performance analysis. Nature of problem: Ultrafast computed tomography is performed with a scan rate of up to 8 kHz. To obtain cross-sectional images from projection data computer-based image reconstruction algorithms must be applied. The
NVIDIA OptiX ray-tracing engine as a new tool for modelling medical imaging systems

NASA Astrophysics Data System (ADS)

Pietrzak, Jakub; Kacperski, Krzysztof; Cieślar, Marek

2015-03-01

The most accurate technique to model the X- and gamma radiation path through a numerically defined object is the Monte Carlo simulation which follows single photons according to their interaction probabilities. A simplified and much faster approach, which just integrates total interaction probabilities along selected paths, is known as ray tracing. Both techniques are used in medical imaging for simulating real imaging systems and as projectors required in iterative tomographic reconstruction algorithms. These approaches are ready for massive parallel implementation e.g. on Graphics Processing Units (GPU), which can greatly accelerate the computation time at a relatively low cost. In this paper we describe the application of the NVIDIA OptiX ray-tracing engine, popular in professional graphics and rendering applications, as a new powerful tool for X- and gamma ray-tracing in medical imaging. It allows the implementation of a variety of physical interactions of rays with pixel-, mesh- or nurbs-based objects, and recording any required quantities, like path integrals, interaction sites, deposited energies, and others. Using the OptiX engine we have implemented a code for rapid Monte Carlo simulations of Single Photon Emission Computed Tomography (SPECT) imaging, as well as the ray-tracing projector, which can be used in reconstruction algorithms. The engine generates efficient, scalable and optimized GPU code, ready to run on multi GPU heterogeneous systems. We have compared the results our simulations with the GATE package. With the OptiX engine the computation time of a Monte Carlo simulation can be reduced from days to minutes.
AESS: Accelerated Exact Stochastic Simulation

NASA Astrophysics Data System (ADS)

Jenkins, David D.; Peterson, Gregory D.

2011-12-01

The Stochastic Simulation Algorithm (SSA) developed by Gillespie provides a powerful mechanism for exploring the behavior of chemical systems with small species populations or with important noise contributions. Gene circuit simulations for systems biology commonly employ the SSA method, as do ecological applications. This algorithm tends to be computationally expensive, so researchers seek an efficient implementation of SSA. In this program package, the Accelerated Exact Stochastic Simulation Algorithm (AESS) contains optimized implementations of Gillespie's SSA that improve the performance of individual simulation runs or ensembles of simulations used for sweeping parameters or to provide statistically significant results. Program summaryProgram title: AESS Catalogue identifier: AEJW_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEJW_v1_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: University of Tennessee copyright agreement No. of lines in distributed program, including test data, etc.: 10 861 No. of bytes in distributed program, including test data, etc.: 394 631 Distribution format: tar.gz Programming language: C for processors, CUDA for NVIDIA GPUs Computer: Developed and tested on various x86 computers and NVIDIA C1060 Tesla and GTX 480 Fermi GPUs. The system targets x86 workstations, optionally with multicore processors or NVIDIA GPUs as accelerators. Operating system: Tested under Ubuntu Linux OS and CentOS 5.5 Linux OS Classification: 3, 16.12 Nature of problem: Simulation of chemical systems, particularly with low species populations, can be accurately performed using Gillespie's method of stochastic simulation. Numerous variations on the original stochastic simulation algorithm have been developed, including approaches that produce results with statistics that exactly match the chemical master equation (CME) as well as other approaches that approximate the CME. Solution
Quantum Chemical Calculations Using Accelerators: Migrating Matrix Operations to the NVIDIA Kepler GPU and the Intel Xeon Phi.

PubMed

Leang, Sarom S; Rendell, Alistair P; Gordon, Mark S

2014-03-11

Increasingly, modern computer systems comprise a multicore general-purpose processor augmented with a number of special purpose devices or accelerators connected via an external interface such as a PCI bus. The NVIDIA Kepler Graphical Processing Unit (GPU) and the Intel Phi are two examples of such accelerators. Accelerators offer peak performances that can be well above those of the host processor. How to exploit this heterogeneous environment for legacy application codes is not, however, straightforward. This paper considers how matrix operations in typical quantum chemical calculations can be migrated to the GPU and Phi systems. Double precision general matrix multiply operations are endemic in electronic structure calculations, especially methods that include electron correlation, such as density functional theory, second order perturbation theory, and coupled cluster theory. The use of approaches that automatically determine whether to use the host or an accelerator, based on problem size, is explored, with computations that are occurring on the accelerator and/or the host. For data-transfers over PCI-e, the GPU provides the best overall performance for data sizes up to 4096 MB with consistent upload and download rates between 5-5.6 GB/s and 5.4-6.3 GB/s, respectively. The GPU outperforms the Phi for both square and nonsquare matrix multiplications.
permGPU: Using graphics processing units in RNA microarray association studies.

PubMed

Shterev, Ivo D; Jung, Sin-Ho; George, Stephen L; Owzar, Kouros

2010-06-16

Many analyses of microarray association studies involve permutation, bootstrap resampling and cross-validation, that are ideally formulated as embarrassingly parallel computing problems. Given that these analyses are computationally intensive, scalable approaches that can take advantage of multi-core processor systems need to be developed. We have developed a CUDA based implementation, permGPU, that employs graphics processing units in microarray association studies. We illustrate the performance and applicability of permGPU within the context of permutation resampling for a number of test statistics. An extensive simulation study demonstrates a dramatic increase in performance when using permGPU on an NVIDIA GTX 280 card compared to an optimized C/C++ solution running on a conventional Linux server. permGPU is available as an open-source stand-alone application and as an extension package for the R statistical environment. It provides a dramatic increase in performance for permutation resampling analysis in the context of microarray association studies. The current version offers six test statistics for carrying out permutation resampling analyses for binary, quantitative and censored time-to-event traits.
Practical Implementation of Prestack Kirchhoff Time Migration on a General Purpose Graphics Processing Unit

NASA Astrophysics Data System (ADS)

Liu, Guofeng; Li, Chun

2016-08-01

In this study, we present a practical implementation of prestack Kirchhoff time migration (PSTM) on a general purpose graphic processing unit. First, we consider the three main optimizations of the PSTM GPU code, i.e., designing a configuration based on a reasonable execution, using the texture memory for velocity interpolation, and the application of an intrinsic function in device code. This approach can achieve a speedup of nearly 45 times on a NVIDIA GTX 680 GPU compared with CPU code when a larger imaging space is used, where the PSTM output is a common reflection point that is gathered as I[ nx][ ny][ nh][ nt] in matrix format. However, this method requires more memory space so the limited imaging space cannot fully exploit the GPU sources. To overcome this problem, we designed a PSTM scheme with multi-GPUs for imaging different seismic data on different GPUs using an offset value. This process can achieve the peak speedup of GPU PSTM code and it greatly increases the efficiency of the calculations, but without changing the imaging result.
Fast, multi-channel real-time processing of signals with microsecond latency using graphics processing units.

PubMed

Rath, N; Kato, S; Levesque, J P; Mauel, M E; Navratil, G A; Peng, Q

2014-04-01

Fast, digital signal processing (DSP) has many applications. Typical hardware options for performing DSP are field-programmable gate arrays (FPGAs), application-specific integrated DSP chips, or general purpose personal computer systems. This paper presents a novel DSP platform that has been developed for feedback control on the HBT-EP tokamak device. The system runs all signal processing exclusively on a Graphics Processing Unit (GPU) to achieve real-time performance with latencies below 8 μs. Signals are transferred into and out of the GPU using PCI Express peer-to-peer direct-memory-access transfers without involvement of the central processing unit or host memory. Tests were performed on the feedback control system of the HBT-EP tokamak using forty 16-bit floating point inputs and outputs each and a sampling rate of up to 250 kHz. Signals were digitized by a D-TACQ ACQ196 module, processing done on an NVIDIA GTX 580 GPU programmed in CUDA, and analog output was generated by D-TACQ AO32CPCI modules.
Object tracking mask-based NLUT on GPUs for real-time generation of holographic videos of three-dimensional scenes.

PubMed

Kwon, M-W; Kim, S-C; Yoon, S-E; Ho, Y-S; Kim, E-S

2015-02-09

A new object tracking mask-based novel-look-up-table (OTM-NLUT) method is proposed and implemented on graphics-processing-units (GPUs) for real-time generation of holographic videos of three-dimensional (3-D) scenes. Since the proposed method is designed to be matched with software and memory structures of the GPU, the number of compute-unified-device-architecture (CUDA) kernel function calls and the computer-generated hologram (CGH) buffer size of the proposed method have been significantly reduced. It therefore results in a great increase of the computational speed of the proposed method and enables real-time generation of CGH patterns of 3-D scenes. Experimental results show that the proposed method can generate 31.1 frames of Fresnel CGH patterns with 1,920 × 1,080 pixels per second, on average, for three test 3-D video scenarios with 12,666 object points on three GPU boards of NVIDIA GTX TITAN, and confirm the feasibility of the proposed method in the practical application of electro-holographic 3-D displays.
Design of a decision support system, trained on GPU, for assisting melanoma diagnosis in dermatoscopy images

NASA Astrophysics Data System (ADS)

Glotsos, Dimitris; Kostopoulos, Spiros; Lalissidou, Stella; Sidiropoulos, Konstantinos; Asvestas, Pantelis; Konstandinou, Christos; Xenogiannopoulos, George; Konstantina Nikolatou, Eirini; Perakis, Konstantinos; Bouras, Thanassis; Cavouras, Dionisis

2015-09-01

The purpose of this study was to design a decision support system for assisting the diagnosis of melanoma in dermatoscopy images. Clinical material comprised images of 44 dysplastic (clark's nevi) and 44 malignant melanoma lesions, obtained from the dermatology database Dermnet. Initially, images were processed for hair removal and background correction using the Dull Razor algorithm. Processed images were segmented to isolate moles from surrounding background, using a combination of level sets and an automated thresholding approach. Morphological (area, size, shape) and textural features (first and second order) were calculated from each one of the segmented moles. Extracted features were fed to a pattern recognition system assembled with the Probabilistic Neural Network Classifier, which was trained to distinguish between benign and malignant cases, using the exhaustive search and the leave one out method. The system was designed on the GPU card (GeForce 580GTX) using CUDA programming framework and C++ programming language. Results showed that the designed system discriminated benign from malignant moles with 88.6% accuracy employing morphological and textural features. The proposed system could be used for analysing moles depicted on smart phone images after appropriate training with smartphone images cases. This could assist towards early detection of melanoma cases, if suspicious moles were to be captured on smartphone by patients and be transferred to the physician together with an assessment of the mole's nature.
Efficient parallel implementation of active appearance model fitting algorithm on GPU.

PubMed

Wang, Jinwei; Ma, Xirong; Zhu, Yuanping; Sun, Jizhou

2014-01-01

The active appearance model (AAM) is one of the most powerful model-based object detecting and tracking methods which has been widely used in various situations. However, the high-dimensional texture representation causes very time-consuming computations, which makes the AAM difficult to apply to real-time systems. The emergence of modern graphics processing units (GPUs) that feature a many-core, fine-grained parallel architecture provides new and promising solutions to overcome the computational challenge. In this paper, we propose an efficient parallel implementation of the AAM fitting algorithm on GPUs. Our design idea is fine grain parallelism in which we distribute the texture data of the AAM, in pixels, to thousands of parallel GPU threads for processing, which makes the algorithm fit better into the GPU architecture. We implement our algorithm using the compute unified device architecture (CUDA) on the Nvidia's GTX 650 GPU, which has the latest Kepler architecture. To compare the performance of our algorithm with different data sizes, we built sixteen face AAM models of different dimensional textures. The experiment results show that our parallel AAM fitting algorithm can achieve real-time performance for videos even on very high-dimensional textures.
Efficient Parallel Implementation of Active Appearance Model Fitting Algorithm on GPU

PubMed Central

Wang, Jinwei; Ma, Xirong; Zhu, Yuanping; Sun, Jizhou

2014-01-01

The active appearance model (AAM) is one of the most powerful model-based object detecting and tracking methods which has been widely used in various situations. However, the high-dimensional texture representation causes very time-consuming computations, which makes the AAM difficult to apply to real-time systems. The emergence of modern graphics processing units (GPUs) that feature a many-core, fine-grained parallel architecture provides new and promising solutions to overcome the computational challenge. In this paper, we propose an efficient parallel implementation of the AAM fitting algorithm on GPUs. Our design idea is fine grain parallelism in which we distribute the texture data of the AAM, in pixels, to thousands of parallel GPU threads for processing, which makes the algorithm fit better into the GPU architecture. We implement our algorithm using the compute unified device architecture (CUDA) on the Nvidia's GTX 650 GPU, which has the latest Kepler architecture. To compare the performance of our algorithm with different data sizes, we built sixteen face AAM models of different dimensional textures. The experiment results show that our parallel AAM fitting algorithm can achieve real-time performance for videos even on very high-dimensional textures. PMID:24723812
High definition live 3D-OCT in vivo: design and evaluation of a 4D OCT engine with 1 GVoxel/s.

PubMed

Wieser, Wolfgang; Draxinger, Wolfgang; Klein, Thomas; Karpf, Sebastian; Pfeiffer, Tom; Huber, Robert

2014-09-01

We present a 1300 nm OCT system for volumetric real-time live OCT acquisition and visualization at 1 billion volume elements per second. All technological challenges and problems associated with such high scanning speed are discussed in detail as well as the solutions. In one configuration, the system acquires, processes and visualizes 26 volumes per second where each volume consists of 320 x 320 depth scans and each depth scan has 400 usable pixels. This is the fastest real-time OCT to date in terms of voxel rate. A 51 Hz volume rate is realized with half the frame number. In both configurations the speed can be sustained indefinitely. The OCT system uses a 1310 nm Fourier domain mode locked (FDML) laser operated at 3.2 MHz sweep rate. Data acquisition is performed with two dedicated digitizer cards, each running at 2.5 GS/s, hosted in a single desktop computer. Live real-time data processing and visualization are realized with custom developed software on an NVidia GTX 690 dual graphics processing unit (GPU) card. To evaluate potential future applications of such a system, we present volumetric videos captured at 26 and 51 Hz of planktonic crustaceans and skin.

Large calculation of the flow over a hypersonic vehicle using a GPU

NASA Astrophysics Data System (ADS)

Elsen, Erich; LeGresley, Patrick; Darve, Eric

2008-12-01

Graphics processing units are capable of impressive computing performance up to 518 Gflops peak performance. Various groups have been using these processors for general purpose computing; most efforts have focussed on demonstrating relatively basic calculations, e.g. numerical linear algebra, or physical simulations for visualization purposes with limited accuracy. This paper describes the simulation of a hypersonic vehicle configuration with detailed geometry and accurate boundary conditions using the compressible Euler equations. To the authors' knowledge, this is the most sophisticated calculation of this kind in terms of complexity of the geometry, the physical model, the numerical methods employed, and the accuracy of the solution. The Navier-Stokes Stanford University Solver (NSSUS) was used for this purpose. NSSUS is a multi-block structured code with a provably stable and accurate numerical discretization which uses a vertex-based finite-difference method. A multi-grid scheme is used to accelerate the solution of the system. Based on a comparison of the Intel Core 2 Duo and NVIDIA 8800GTX, speed-ups of over 40× were demonstrated for simple test geometries and 20× for complex geometries.
GPU color space conversion

NASA Astrophysics Data System (ADS)

Chase, Patrick; Vondran, Gary

2011-01-01

Tetrahedral interpolation is commonly used to implement continuous color space conversions from sparse 3D and 4D lookup tables. We investigate the implementation and optimization of tetrahedral interpolation algorithms for GPUs, and compare to the best known CPU implementations as well as to a well known GPU-based trilinear implementation. We show that a 500 NVIDIA GTX-580 GPU is 3x faster than a 1000 Intel Core i7 980X CPU for 3D interpolation, and 9x faster for 4D interpolation. Performance-relevant GPU attributes are explored including thread scheduling, local memory characteristics, global memory hierarchy, and cache behaviors. We consider existing tetrahedral interpolation algorithms and tune based on the structure and branching capabilities of current GPUs. Global memory performance is improved by reordering and expanding the lookup table to ensure optimal access behaviors. Per multiprocessor local memory is exploited to implement optimally coalesced global memory accesses, and local memory addressing is optimized to minimize bank conflicts. We explore the impacts of lookup table density upon computation and memory access costs. Also presented are CPU-based 3D and 4D interpolators, using SSE vector operations that are faster than any previously published solution.
Automatic Railway Traffic Object Detection System Using Feature Fusion Refine Neural Network under Shunting Mode.

PubMed

Ye, Tao; Wang, Baocheng; Song, Ping; Li, Juan

2018-06-12

Many accidents happen under shunting mode when the speed of a train is below 45 km/h. In this mode, train attendants observe the railway condition ahead using the traditional manual method and tell the observation results to the driver in order to avoid danger. To address this problem, an automatic object detection system based on convolutional neural network (CNN) is proposed to detect objects ahead in shunting mode, which is called Feature Fusion Refine neural network (FR-Net). It consists of three connected modules, i.e., the depthwise-pointwise convolution, the coarse detection module, and the object detection module. Depth-wise-pointwise convolutions are used to improve the detection in real time. The coarse detection module coarsely refine the locations and sizes of prior anchors to provide better initialization for the subsequent module and also reduces search space for the classification, whereas the object detection module aims to regress accurate object locations and predict the class labels for the prior anchors. The experimental results on the railway traffic dataset show that FR-Net achieves 0.8953 mAP with 72.3 FPS performance on a machine with a GeForce GTX1080Ti with the input size of 320 × 320 pixels. The results imply that FR-Net takes a good tradeoff both on effectiveness and real time performance. The proposed method can meet the needs of practical application in shunting mode.
GPU accelerated study of heat transfer and fluid flow by lattice Boltzmann method on CUDA

NASA Astrophysics Data System (ADS)

Ren, Qinlong

Lattice Boltzmann method (LBM) has been developed as a powerful numerical approach to simulate the complex fluid flow and heat transfer phenomena during the past two decades. As a mesoscale method based on the kinetic theory, LBM has several advantages compared with traditional numerical methods such as physical representation of microscopic interactions, dealing with complex geometries and highly parallel nature. Lattice Boltzmann method has been applied to solve various fluid behaviors and heat transfer process like conjugate heat transfer, magnetic and electric field, diffusion and mixing process, chemical reactions, multiphase flow, phase change process, non-isothermal flow in porous medium, microfluidics, fluid-structure interactions in biological system and so on. In addition, as a non-body-conformal grid method, the immersed boundary method (IBM) could be applied to handle the complex or moving geometries in the domain. The immersed boundary method could be coupled with lattice Boltzmann method to study the heat transfer and fluid flow problems. Heat transfer and fluid flow are solved on Euler nodes by LBM while the complex solid geometries are captured by Lagrangian nodes using immersed boundary method. Parallel computing has been a popular topic for many decades to accelerate the computational speed in engineering and scientific fields. Today, almost all the laptop and desktop have central processing units (CPUs) with multiple cores which could be used for parallel computing. However, the cost of CPUs with hundreds of cores is still high which limits its capability of high performance computing on personal computer. Graphic processing units (GPU) is originally used for the computer video cards have been emerged as the most powerful high-performance workstation in recent years. Unlike the CPUs, the cost of GPU with thousands of cores is cheap. For example, the GPU (GeForce GTX TITAN) which is used in the current work has 2688 cores and the price is only 1
Image-guided thoracic surgery in the hybrid operation room.

PubMed

Ujiie, Hideki; Effat, Andrew; Yasufuku, Kazuhiro

2017-01-01

There has been an increase in the use of image-guided technology to facilitate minimally invasive therapy. The next generation of minimally invasive therapy is focused on advancement and translation of novel image-guided technologies in therapeutic interventions, including surgery, interventional pulmonology, radiation therapy, and interventional laser therapy. To establish the efficacy of different minimally invasive therapies, we have developed a hybrid operating room, known as the guided therapeutics operating room (GTx OR) at the Toronto General Hospital. The GTx OR is equipped with multi-modality image-guidance systems, which features a dual source-dual energy computed tomography (CT) scanner, a robotic cone-beam CT (CBCT)/fluoroscopy, high-performance endobronchial ultrasound system, endoscopic surgery system, near-infrared (NIR) fluorescence imaging system, and navigation tracking systems. The novel multimodality image-guidance systems allow physicians to quickly, and accurately image patients while they are on the operating table. This yield improved outcomes since physicians are able to use image guidance during their procedures, and carry out innovative multi-modality therapeutics. Multiple preclinical translational studies pertaining to innovative minimally invasive technology is being developed in our guided therapeutics laboratory (GTx Lab). The GTx Lab is equipped with similar technology, and multimodality image-guidance systems as the GTx OR, and acts as an appropriate platform for translation of research into human clinical trials. Through the GTx Lab, we are able to perform basic research, such as the development of image-guided technologies, preclinical model testing, as well as preclinical imaging, and then translate that research into the GTx OR. This OR allows for the utilization of new technologies in cancer therapy, including molecular imaging, and other innovative imaging modalities, and therefore enables a better quality of life for
Evaluation of the root canal shaping ability of two rotary nickel-titanium systems.

PubMed

Al-Manei, K K; Al-Hadlaq, S M S

2014-10-01

The aim was to investigate the canal shaping abilities of the twisted file (TF) and GT series X file (GTX) systems. Sixty mesial root canals of mandibular molars with curvatures of 15-50° were divided randomly into two groups of 30 canals each. The teeth were sectioned horizontally at 3, 6 and 9 mm from the apex. Root canals were prepared with TF and GTX files, respectively, and the shaping abilities of the systems were evaluated at three levels (coronal, middle and apical) based on the comparison of pre- and post-instrumentation photographs using AutoCAD software. Preparation time was also assessed. Data from the two groups were compared statistically using the Student's t-test. There was no significant difference between the rotary systems in terms of change in root canal cross-sectional area, root canal transportation, centring ability or minimum dentine thickness. Remaining dentine thickness at the coronal and middle levels was similar in the TF and GTX groups, but GTX instruments left significantly less dentine than TF instruments on the mesial aspects of root canals at the apical level. Root canal preparation with TF instruments required significantly less time than with GTX instruments. The TF and GTX NiTi rotary instruments showed similar shaping abilities, but root canal preparation was more rapid with the TF than with the GTX system. © 2014 International Endodontic Journal. Published by John Wiley & Sons Ltd.
Selective isolation of gonyautoxins 1,4 from the dinoflagellate Alexandrium minutum based on molecularly imprinted solid-phase extraction.

PubMed

Lian, Ziru; Wang, Jiangtao

2017-09-15

Gonyautoxins 1,4 (GTX1,4) from Alexandrium minutum samples were isolated selectively and recognized specifically by an innovative and effective extraction procedure based on molecular imprinting technology. Novel molecularly imprinted polymer microspheres (MIPMs) were prepared by double-templated imprinting strategy using caffeine and pentoxifylline as dummy templates. The synthesized polymers displayed good affinity to GTX1,4 and were applied as sorbents. Further, an off-line molecularly imprinted solid-phase extraction (MISPE) protocol was optimized and an effective approach based on the MISPE coupled with HPLC-FLD was developed for selective isolation of GTX1,4 from the cultured A. minutum samples. The separation method showed good extraction efficiency (73.2-81.5%) for GTX1,4 and efficient removal of interferences matrices was also achieved after the MISPE process for the microalgal samples. The outcome demonstrated the superiority and great potential of the MISPE procedure for direct separation of GTX1,4 from marine microalgal extracts. Copyright © 2017. Published by Elsevier Ltd.
The visible ear simulator: a public PC application for GPU-accelerated haptic 3D simulation of ear surgery based on the visible ear data.

PubMed

Sorensen, Mads Solvsten; Mosegaard, Jesper; Trier, Peter

2009-06-01

Existing virtual simulators for middle ear surgery are based on 3-dimensional (3D) models from computed tomographic or magnetic resonance imaging data in which image quality is limited by the lack of detail (maximum, approximately 50 voxels/mm3), natural color, and texture of the source material.Virtual training often requires the purchase of a program, a customized computer, and expensive peripherals dedicated exclusively to this purpose. The Visible Ear freeware library of digital images from a fresh-frozen human temporal bone was segmented, and real-time volume rendered as a 3D model of high-fidelity, true color, and great anatomic detail and realism of the surgically relevant structures. A haptic drilling model was developed for surgical interaction with the 3D model. Realistic visualization in high-fidelity (approximately 125 voxels/mm3) and true color, 2D, or optional anaglyph stereoscopic 3D was achieved on a standard Core 2 Duo personal computer with a GeForce 8,800 GTX graphics card, and surgical interaction was provided through a relatively inexpensive (approximately $2,500) Phantom Omni haptic 3D pointing device. This prototype is published for download (approximately 120 MB) as freeware at http://www.alexandra.dk/ves/index.htm.With increasing personal computer performance, future versions may include enhanced resolution (up to 8,000 voxels/mm3) and realistic interaction with deformable soft tissue components such as skin, tympanic membrane, dura, and cholesteatomas-features some of which are not possible with computed tomographic-/magnetic resonance imaging-based systems.
Scalable streaming tools for analyzing N-body simulations: Finding halos and investigating excursion sets in one pass

NASA Astrophysics Data System (ADS)

Ivkin, N.; Liu, Z.; Yang, L. F.; Kumar, S. S.; Lemson, G.; Neyrinck, M.; Szalay, A. S.; Braverman, V.; Budavari, T.

2018-04-01

Cosmological N-body simulations play a vital role in studying models for the evolution of the Universe. To compare to observations and make a scientific inference, statistic analysis on large simulation datasets, e.g., finding halos, obtaining multi-point correlation functions, is crucial. However, traditional in-memory methods for these tasks do not scale to the datasets that are forbiddingly large in modern simulations. Our prior paper (Liu et al., 2015) proposes memory-efficient streaming algorithms that can find the largest halos in a simulation with up to 109 particles on a small server or desktop. However, this approach fails when directly scaling to larger datasets. This paper presents a robust streaming tool that leverages state-of-the-art techniques on GPU boosting, sampling, and parallel I/O, to significantly improve performance and scalability. Our rigorous analysis of the sketch parameters improves the previous results from finding the centers of the 103 largest halos (Liu et al., 2015) to ∼ 104 - 105, and reveals the trade-offs between memory, running time and number of halos. Our experiments show that our tool can scale to datasets with up to ∼ 1012 particles while using less than an hour of running time on a single GPU Nvidia GTX 1080.
A real-time standard parts inspection based on deep learning

NASA Astrophysics Data System (ADS)

Xu, Kuan; Li, XuDong; Jiang, Hongzhi; Zhao, Huijie

2017-10-01

Since standard parts are necessary components in mechanical structure like bogie and connector. These mechanical structures will be shattered or loosen if standard parts are lost. So real-time standard parts inspection systems are essential to guarantee their safety. Researchers would like to take inspection systems based on deep learning because it works well in image with complex backgrounds which is common in standard parts inspection situation. A typical inspection detection system contains two basic components: feature extractors and object classifiers. For the object classifier, Region Proposal Network (RPN) is one of the most essential architectures in most state-of-art object detection systems. However, in the basic RPN architecture, the proposals of Region of Interest (ROI) have fixed sizes (9 anchors for each pixel), they are effective but they waste much computing resources and time. In standard parts detection situations, standard parts have given size, thus we can manually choose sizes of anchors based on the ground-truths through machine learning. The experiments prove that we could use 2 anchors to achieve almost the same accuracy and recall rate. Basically, our standard parts detection system could reach 15fps on NVIDIA GTX1080 (GPU), while achieving detection accuracy 90.01% mAP.
High definition live 3D-OCT in vivo: design and evaluation of a 4D OCT engine with 1 GVoxel/s

PubMed Central

Wieser, Wolfgang; Draxinger, Wolfgang; Klein, Thomas; Karpf, Sebastian; Pfeiffer, Tom; Huber, Robert

2014-01-01

We present a 1300 nm OCT system for volumetric real-time live OCT acquisition and visualization at 1 billion volume elements per second. All technological challenges and problems associated with such high scanning speed are discussed in detail as well as the solutions. In one configuration, the system acquires, processes and visualizes 26 volumes per second where each volume consists of 320 x 320 depth scans and each depth scan has 400 usable pixels. This is the fastest real-time OCT to date in terms of voxel rate. A 51 Hz volume rate is realized with half the frame number. In both configurations the speed can be sustained indefinitely. The OCT system uses a 1310 nm Fourier domain mode locked (FDML) laser operated at 3.2 MHz sweep rate. Data acquisition is performed with two dedicated digitizer cards, each running at 2.5 GS/s, hosted in a single desktop computer. Live real-time data processing and visualization are realized with custom developed software on an NVidia GTX 690 dual graphics processing unit (GPU) card. To evaluate potential future applications of such a system, we present volumetric videos captured at 26 and 51 Hz of planktonic crustaceans and skin. PMID:25401010
A Large Scale, High Resolution Agent-Based Insurgency Model

DTIC Science & Technology

2013-09-30

CUDA) is NVIDIA Corporation’s software development model for General Purpose Programming on Graphics Processing Units (GPGPU) ( NVIDIA Corporation ...Conference. Argonne National Laboratory, Argonne, IL, October, 2005. NVIDIA Corporation . NVIDIA CUDA Programming Guide 2.0 [Online]. NVIDIA Corporation
Optimization of hydrophilic interaction liquid chromatography/mass spectrometry and development of solid-phase extraction for the determination of paralytic shellfish poisoning toxins.

PubMed

Turrell, Elizabeth; Stobo, Lesley; Lacaze, Jean-Pierre; Piletsky, Sergey; Piletska, Elena

2008-01-01

The combination of hydrophilic interaction liquid chromatography (HILIC) and liquid chromatography/mass spectrometry (LC/MS) for the determination of paralytic shellfish poisoning (PSP) toxins has been proposed for use in routine monitoring of shellfish. In this study, methods for the detection of multiple PSP toxins [saxitoxin (STX), neosaxitoxin (NEO), decarbamoyl saxitoxin (dcSTX), decarbamoyl neosaxitoxin (dcNEO), gonyautoxins 1-5 (GTX1, GTX2, GTX3, GTX4, GTX5), decarbamoyl gonyautoxins (dcGTX2 and dcGTX3), and the N-sulfocarbamoyl C toxins (C1 and C2)] were optimized using single (MS) and triple quadrupole (MS/MS) instruments. Chromatographic separation of the toxins was achieved by using a TSK-gel Amide-80 analytical column, although superior chromatography was observed through application of a ZIC-HILIC column. Preparative procedures used to clean up shellfish extracts and concentrate PSP toxins prior to analysis were investigated. The capacity of computationally designed polymeric (CDP) materials and HILIC solid-phase extraction (SPE) cartridges to retain highly polar PSP toxins was explored. Three CDP materials and 2 HILIC cartridges were assessed for the extraction of PSP toxins from aqueous solution. Screening of the CDPs showed that all tested polymers adsorbed PSP toxins. A variety of elution procedures were examined, with dilute 0.01% acetic acid providing optimum recovery from a CDP based on 2-(trifluoromethyl)acrylic acid as the monomer. ZIC-HILIC SPE cartridges were superior to the PolyLC equivalent, with recoveries ranging from 70 to 112% (ZIC-HILIC) and 0 to 90% (PolyLC) depending on the PSP toxin. It is proposed that optimized SPE and HILIC-MS methods can be applied for the quantitative determination of PSP toxins in shellfish.
Experimental and computational studies on molecularly imprinted solid-phase extraction for gonyautoxins 2,3 from dinoflagellate Alexandrium minutum.

PubMed

Lian, Ziru; Li, Hai-Bei; Wang, Jiangtao

2016-08-01

An innovative and effective extraction procedure based on molecularly imprinted solid-phase extraction (MISPE) was developed for the isolation of gonyautoxins 2,3 (GTX2,3) from Alexandrium minutum sample. Molecularly imprinted polymer microspheres were prepared by suspension polymerization and and were employed as sorbents for the solid-phase extraction of GTX2,3. An off-line MISPE protocol was optimized. Subsequently, the extract samples from A. minutum were analyzed. The results showed that the interference matrices in the extract were obviously cleaned up by MISPE procedures. This outcome enabled the direct extraction of GTX2,3 in A. minutum samples with extraction efficiency as high as 83 %, rather significantly, without any need for a cleanup step prior to the extraction. Furthermore, computational approach also provided direct evidences of the high selective isolation of GTX2,3 from the microalgal extracts.
77 FR 26789 - Certain Semiconductor Chips Having Synchronous Dynamic Random Access Memory Controllers and...

Federal Register 2010, 2011, 2012, 2013, 2014

2012-05-07

... patents. 73 FR 75131. The principal respondent was NVIDIA Corporation of Santa Clara, California (``NVIDIA''). Joining NVIDIA as respondents were approximately twenty of NVIDIA's customers. The Commission found a... accused products in the United States: NVIDIA; Hewlett-Packard Co. of Palo Alto, California; ASUS Computer...
Spatial 3D infrastructure: display-independent software framework, high-speed rendering electronics, and several new displays

NASA Astrophysics Data System (ADS)

Chun, Won-Suk; Napoli, Joshua; Cossairt, Oliver S.; Dorval, Rick K.; Hall, Deirdre M.; Purtell, Thomas J., II; Schooler, James F.; Banker, Yigal; Favalora, Gregg E.

2005-03-01

We present a software and hardware foundation to enable the rapid adoption of 3-D displays. Different 3-D displays - such as multiplanar, multiview, and electroholographic displays - naturally require different rendering methods. The adoption of these displays in the marketplace will be accelerated by a common software framework. The authors designed the SpatialGL API, a new rendering framework that unifies these display methods under one interface. SpatialGL enables complementary visualization assets to coexist through a uniform infrastructure. Also, SpatialGL supports legacy interfaces such as the OpenGL API. The authors" first implementation of SpatialGL uses multiview and multislice rendering algorithms to exploit the performance of modern graphics processing units (GPUs) to enable real-time visualization of 3-D graphics from medical imaging, oil & gas exploration, and homeland security. At the time of writing, SpatialGL runs on COTS workstations (both Windows and Linux) and on Actuality"s high-performance embedded computational engine that couples an NVIDIA GeForce 6800 Ultra GPU, an AMD Athlon 64 processor, and a proprietary, high-speed, programmable volumetric frame buffer that interfaces to a 1024 x 768 x 3 digital projector. Progress is illustrated using an off-the-shelf multiview display, Actuality"s multiplanar Perspecta Spatial 3D System, and an experimental multiview display. The experimental display is a quasi-holographic view-sequential system that generates aerial imagery measuring 30 mm x 25 mm x 25 mm, providing 198 horizontal views.
GPGPU-based explicit finite element computations for applications in biomechanics: the performance of material models, element technologies, and hardware generations.

PubMed

Strbac, V; Pierce, D M; Vander Sloten, J; Famaey, N

2017-12-01

Finite element (FE) simulations are increasingly valuable in assessing and improving the performance of biomedical devices and procedures. Due to high computational demands such simulations may become difficult or even infeasible, especially when considering nearly incompressible and anisotropic material models prevalent in analyses of soft tissues. Implementations of GPGPU-based explicit FEs predominantly cover isotropic materials, e.g. the neo-Hookean model. To elucidate the computational expense of anisotropic materials, we implement the Gasser-Ogden-Holzapfel dispersed, fiber-reinforced model and compare solution times against the neo-Hookean model. Implementations of GPGPU-based explicit FEs conventionally rely on single-point (under) integration. To elucidate the expense of full and selective-reduced integration (more reliable) we implement both and compare corresponding solution times against those generated using underintegration. To better understand the advancement of hardware, we compare results generated using representative Nvidia GPGPUs from three recent generations: Fermi (C2075), Kepler (K20c), and Maxwell (GTX980). We explore scaling by solving the same boundary value problem (an extension-inflation test on a segment of human aorta) with progressively larger FE meshes. Our results demonstrate substantial improvements in simulation speeds relative to two benchmark FE codes (up to 300[Formula: see text] while maintaining accuracy), and thus open many avenues to novel applications in biomechanics and medicine.
Real-time stereo vision-based lane detection system

NASA Astrophysics Data System (ADS)

Fan, Rui; Dahnoun, Naim

2018-07-01

The detection of multiple curved lane markings on a non-flat road surface is still a challenging task for vehicular systems. To make an improvement, depth information can be used to enhance the robustness of the lane detection systems. In this paper, a proposed lane detection system is developed from our previous work where the estimation of the dense vanishing point is further improved using the disparity information. However, the outliers in the least squares fitting severely affect the accuracy when estimating the vanishing point. Therefore, in this paper we use random sample consensus to update the parameters of the road model iteratively until the percentage of the inliers exceeds our pre-set threshold. This significantly helps the system to overcome some suddenly changing conditions. Furthermore, we propose a novel lane position validation approach which computes the energy of each possible solution and selects all satisfying lane positions for visualisation. The proposed system is implemented on a heterogeneous system which consists of an Intel Core i7-4720HQ CPU and an NVIDIA GTX 970M GPU. A processing speed of 143 fps has been achieved, which is over 38 times faster than our previous work. Moreover, in order to evaluate the detection precision, we tested 2495 frames including 5361 lanes. It is shown that the overall successful detection rate is increased from 98.7% to 99.5%.
Multistage Analysis of Cyber Threats for Quick Mission Impact Assessment (CyberIA)

DTIC Science & Technology

2015-09-01

Corporation. NVIDIA ® is a registered trademark of the NVIDIA Corporation. CUDA™ is a trademark of the NVIDIA Corporation. Released by J. Lee...for developing and integrating different high-performance C/C++ algorithms. This capability is significant because NVIDIA ® CUDA™ architecture
Oxidative Stress Mechanisms Do Not Discriminate between Genotoxic and Nongenotoxic Liver Carcinogens.

PubMed

Deferme, Lize; Wolters, Jarno; Claessen, Sandra; Briedé, Jacco; Kleinjans, Jos

2015-08-17

It is widely accepted that in chemical carcinogenesis different modes-of-action exist, e.g., genotoxic (GTX) versus nongenotoxic (NGTX) carcinogenesis. In this context, it has been suggested that oxidative stress response pathways are typical for NGTX carcinogenesis. To evaluate this, we examined oxidative stress-related changes in gene expression, cell cycle distribution, and (oxidative) DNA damage in human hepatoma cells (HepG2) exposed to GTX-, NGTX-, and noncarcinogens, at multiple time points (4-8-24-48-72 h). Two GTX (azathriopine (AZA) and furan) and two NGTX (tetradecanoyl-phorbol-acetate, (TPA) and tetrachloroethylene (TCE)) carcinogens as well as two noncarcinogens (diazinon (DZN, d-mannitol (Dman)) were selected, while per class one compound was deemed to induce oxidative stress and the other not. Oxidative stressors AZA, TPA, and DZN induced a 10-fold higher number of gene expression changes over time compared to those of furan, TCE, or Dman treatment. Genes commonly expressed among AZA, TPA, and DZN were specifically involved in oxidative stress, DNA damage, and immune responses. However, differences in gene expression between GTX and NGTX carcinogens did not correlate to oxidative stress or DNA damage but could instead be assigned to compound-specific characteristics. This conclusion was underlined by results from functional readouts on ROS formation and (oxidative) DNA damage. Therefore, oxidative stress may represent the underlying cause for increased risk of liver toxicity and even carcinogenesis; however, it does not discriminate between GTX and NGTX carcinogens.

GPU Based N-Gram String Matching Algorithm with Score Table Approach for String Searching in Many Documents

NASA Astrophysics Data System (ADS)

Srinivasa, K. G.; Shree Devi, B. N.

2017-10-01

String searching in documents has become a tedious task with the evolution of Big Data. Generation of large data sets demand for a high performance search algorithm in areas such as text mining, information retrieval and many others. The popularity of GPU's for general purpose computing has been increasing for various applications. Therefore it is of great interest to exploit the thread feature of a GPU to provide a high performance search algorithm. This paper proposes an optimized new approach to N-gram model for string search in a number of lengthy documents and its GPU implementation. The algorithm exploits GPGPUs for searching strings in many documents employing character level N-gram matching with parallel Score Table approach and search using CUDA API. The new approach of Score table used for frequency storage of N-grams in a document, makes the search independent of the document's length and allows faster access to the frequency values, thus decreasing the search complexity. The extensive thread feature in a GPU has been exploited to enable parallel pre-processing of trigrams in a document for Score Table creation and parallel search in huge number of documents, thus speeding up the whole search process even for a large pattern size. Experiments were carried out for many documents of varied length and search strings from the standard Lorem Ipsum text on NVIDIA's GeForce GT 540M GPU with 96 cores. Results prove that the parallel approach for Score Table creation and searching gives a good speed up than the same approach executed serially.
The effects of video game experience and active stereoscopy on performance in combat identification tasks.

PubMed

Keebler, Joseph R; Jentsch, Florian; Schuster, David

2014-12-01

We investigated the effects of active stereoscopic simulation-based training and individual differences in video game experience on multiple indices of combat identification (CID) performance. Fratricide is a major problem in combat operations involving military vehicles. In this research, we aimed to evaluate the effects of training on CID performance in order to reduce fratricide errors. Individuals were trained on 12 combat vehicles in a simulation, which were presented via either a non-stereoscopic or active stereoscopic display using NVIDIA's GeForce shutter glass technology. Self-report was used to assess video game experience, leading to four between-subjects groups: high video game experience with stereoscopy, low video game experience with stereoscopy, high video game experience without stereoscopy, and low video game experience without stereoscopy. We then tested participants on their memory of each vehicle's alliance and name across multiple measures, including photographs and videos. There was a main effect for both video game experience and stereoscopy across many of the dependent measures. Further, we found interactions between video game experience and stereoscopic training, such that those individuals with high video game experience in the non-stereoscopic group had the highest performance outcomes in the sample on multiple dependent measures. This study suggests that individual differences in video game experience may be predictive of enhanced performance in CID tasks. Selection based on video game experience in CID tasks may be a useful strategy for future military training. Future research should investigate the generalizability of these effects, such as identification through unmanned vehicle sensors.
High-performance 3D compressive sensing MRI reconstruction.

PubMed

Kim, Daehyun; Trzasko, Joshua D; Smelyanskiy, Mikhail; Haider, Clifton R; Manduca, Armando; Dubey, Pradeep

2010-01-01

Compressive Sensing (CS) is a nascent sampling and reconstruction paradigm that describes how sparse or compressible signals can be accurately approximated using many fewer samples than traditionally believed. In magnetic resonance imaging (MRI), where scan duration is directly proportional to the number of acquired samples, CS has the potential to dramatically decrease scan time. However, the computationally expensive nature of CS reconstructions has so far precluded their use in routine clinical practice - instead, more-easily generated but lower-quality images continue to be used. We investigate the development and optimization of a proven inexact quasi-Newton CS reconstruction algorithm on several modern parallel architectures, including CPUs, GPUs, and Intel's Many Integrated Core (MIC) architecture. Our (optimized) baseline implementation on a quad-core Core i7 is able to reconstruct a 256 × 160×80 volume of the neurovasculature from an 8-channel, 10 × undersampled data set within 56 seconds, which is already a significant improvement over existing implementations. The latest six-core Core i7 reduces the reconstruction time further to 32 seconds. Moreover, we show that the CS algorithm benefits from modern throughput-oriented architectures. Specifically, our CUDA-base implementation on NVIDIA GTX480 reconstructs the same dataset in 16 seconds, while Intel's Knights Ferry (KNF) of the MIC architecture even reduces the time to 12 seconds. Such level of performance allows the neurovascular dataset to be reconstructed within a clinically viable time.
Physiological roles of Kv2 channels in entorhinal cortex layer II stellate cells revealed by Guangxitoxin‐1E

PubMed Central

Hönigsperger, Christoph; Nigro, Maximiliano J.

2016-01-01

Key points Kv2 channels underlie delayed‐rectifier potassium currents in various neurons, although their physiological roles often remain elusive. Almost nothing is known about Kv2 channel functions in medial entorhinal cortex (mEC) neurons, which are involved in representing space, memory formation, epilepsy and dementia.Stellate cells in layer II of the mEC project to the hippocampus and are considered to be space‐representing grid cells. We used the new Kv2 blocker Guangxitoxin‐1E (GTx) to study Kv2 functions in these neurons.Voltage clamp recordings from mEC stellate cells in rat brain slices showed that GTx inhibited delayed‐rectifier K+ current but not transient A‐type current.In current clamp, GTx had multiple effects: (i) increasing excitability and bursting at moderate spike rates but reducing firing at high rates; (ii) enhancing after‐depolarizations; (iii) reducing the fast and medium after‐hyperpolarizations; (iv) broadening action potentials; and (v) reducing spike clustering.GTx is a useful tool for studying Kv2 channels and their functions in neurons. Abstract The medial entorhinal cortex (mEC) is strongly involved in spatial navigation, memory, dementia and epilepsy. Although potassium channels shape neuronal activity, their roles in mEC are largely unknown. We used the new Kv2 blocker Guangxitoxin‐1E (GTx; 10–100 nm) in rat brain slices to investigate Kv2 channel functions in mEC layer II stellate cells (SCs). These neurons project to the hippocampus and are considered to be grid cells representing space. Voltage clamp recordings from SCs nucleated patches showed that GTx inhibited a delayed rectifier K+ current activating beyond –30 mV but not transient A‐type current. In current clamp, GTx (i) had almost no effect on the first action potential but markedly slowed repolarization of late spikes during repetitive firing; (ii) enhanced the after‐depolarization (ADP); (iii) reduced fast and medium after�
Supporting Real-Time Computer Vision Workloads using OpenVX on Multicore+GPU Platforms

DTIC Science & Technology

2015-05-01

a registered trademark of the NVIDIA Corporation . Report Documentation Page Form ApprovedOMB No. 0704-0188 Public reporting burden for the collection...from NVIDIA , we adapted an alpha- version of an NVIDIA OpenVX implementation called VisionWorks® [3] to run atop PGMRT (a graph-based mid- dleware...time support to an OpenVX implementation by NVIDIA called VisionWorks. Our modifications were applied to an alpha-version of VisionWorks. This alpha
Analysis of grayanatoxin in Rhododendron honey and effect on antioxidant parameters in rats.

PubMed

Sibel, Silici; Enis, Yonar M; Hüseyin, Sahin; Timucin, Atayoğlu A; Duran, Ozkok

2014-10-28

Rhododendron honey, locally known as "mad honey", contains gryanotoksin (GTX) and thus induces toxic effects when consumed in large amounts. But, it is still popularly used for treating medical conditions such as high blood pressure or gastro-intestinal disorders. The aim of this study was to evaluate the effect of GTX on antioxidant parameters measured from rats fed with Rhododendron honey. A total of sixty Sprague-Dawley female rats were divided into five groups of 12 rats each, one being the control group (Group 1) and the others being the experimental groups (Groups 2 to 5). Group 2 was treated with 0.015 mg/kg/bw of Grayanotoxin-III (GTX-III) standard preparation via intraperitoneal injection. Groups 3, 4 and 5 were respectively given Rhododendron honey (RH) at doses of 0.1, 0.5, and 2.5 g/kg/bw via oral gavage. After one hour, blood samples were collected from the rats. Glutathione peroxidase (GSh-Px), superoxide dismutase (SOD), catalase (CAT) activities and malondialdehyde (MDA) contents were examined in blood, heart, lungs, liver, kidney, testicles, epididiymis, spleen and brain specimens. The data from the rats in Groups 2 (GTX) and 5 (RH at 2.5 g/kg/bw) showed negative effect on the antioxidants parameters in blood and all tissue samples examined at the specified doses and time period. Administration of GTX to rats at dose of 0.015 mg/kg/bw resulted in lipid peroxidation. (This part needs to be enhanced more). It has been observed that both Grayanotoxin and high dose Rhododendron honey treatments showed oxidant effect on blood plasma and organ tissues investigated. Copyright © 2014 Elsevier Ireland Ltd. All rights reserved.
Removal of Paralytic Shellfish Toxins by Probiotic Lactic Acid Bacteria

PubMed Central

Vasama, Mari; Kumar, Himanshu; Salminen, Seppo; Haskard, Carolyn A.

2014-01-01

Paralytic shellfish toxins (PSTs) are non-protein neurotoxins produced by saltwater dinoflagellates and freshwater cyanobacteria. The ability of Lactobacillus rhamnosus strains GG and LC-705 (in viable and non-viable forms) to remove PSTs (saxitoxin (STX), neosaxitoxin (neoSTX), gonyautoxins 2 and 3 (GTX2/3), C-toxins 1 and 2 (C1/2)) from neutral and acidic solution (pH 7.3 and 2) was examined using HPLC. Binding decreased in the order of STX ~ neoSTX > C2 > GTX3 > GTX2 > C1. Removal of STX and neoSTX (77%–97.2%) was significantly greater than removal of GTX3 and C2 (33.3%–49.7%). There were no significant differences in toxin removal capacity between viable and non-viable forms of lactobacilli, which suggested that binding rather than metabolism is the mechanism of the removal of toxins. In general, binding was not affected by the presence of other organic molecules in solution. Importantly, this is the first study to demonstrate the ability of specific probiotic lactic bacteria to remove PSTs, particularly the most toxic PST-STX, from solution. Further, these results warrant thorough screening and assessment of safe and beneficial microbes for their usefulness in the seafood and water industries and their effectiveness in vivo. PMID:25046082
A Frequency Agile, Self-Adaptive Serial Link on Xilinx FPGAs

NASA Astrophysics Data System (ADS)

Aloisio, A.; Giordano, R.; Izzo, V.; Perrella, S.

2015-06-01

In this paper, we focused on the GTX transceiver modules of Xilinx Kintex 7 field-programmable gate arrays (FPGAs), which provide high bandwidth, low jitter on the recovered clock, and an equalization system on the transmitter and the receiver. We present a frequency agile, auto-adaptive serial link. The link is able to take care of the reconfiguration of the GTX parameters in order to fully benefit from the available link bandwidth, by setting the highest line rate. It is designed around an FPGA-embedded microprocessor, which drives the programmable ports of the GTX in order to control the quality of the received data and to easily calculate the bit-error rate in each sampling point of the eye diagram. We present the self-adaptive link project, the description of the test system, and the main results.
WE-E-213CD-08: A Novel Level Set Active Contour Algorithm Using the Jensen-Renyi Divergence for Tumor Segmentation in PET.

PubMed

Markel, D; Naqa, I El

2012-06-01

Positron emission tomography (PET) presents a valuable resource for delineating the biological tumor volume (BTV) for image-guided radiotherapy. However, accurate and consistent image segmentation is a significant challenge within the context of PET, owing to its low spatial resolution and high levels of noise. Active contour methods based on the level set methods can be sensitive to noise and susceptible to failing in low contrast regions. Therefore, this work evaluates a novel active contour algorithm applied to the task of PET tumor segmentation. A novel active contour segmentation algorithm based on maximizing the Jensen-Renyi Divergence between regions of interest was applied to the task of segmenting lesions in 7 patients with T3-T4 pharyngolaryngeal squamous cell carcinoma. The algorithm was implemented on an NVidia GEFORCE GTV 560M GPU. The cases were taken from the Louvain database, which includes contours of the macroscopically defined BTV drawn using histology of resected tissue. The images were pre-processed using denoising/deconvolution. The segmented volumes agreed well with the macroscopic contours, with an average concordance index and classification error of 0.6 ± 0.09 and 55 ± 16.5%, respectively. The algorithm in its present implementation requires approximately 0.5-1.3 sec per iteration and can reach convergence within 10-30 iterations. The Jensen-Renyi active contour method was shown to come close to and in terms of concordance, outperforms a variety of PET segmentation methods that have been previously evaluated using the same data. Further evaluation on a larger dataset along with performance optimization is necessary before clinical deployment. © 2012 American Association of Physicists in Medicine.
Numerical Integration with Graphical Processing Unit for QKD Simulation

DTIC Science & Technology

2014-03-27

Windows system application programming interface (API) timer. The problem sizes studied produce speedups greater than 60x on the NVIDIA Tesla C2075...13 2.3.3 CUDA API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.3.4 CUDA and NVIDIA GPU Hardware...Theoretical Floating-Point Operations per Second for Intel CPUs and NVIDIA GPUs [3
Androgen receptor agonists increase lean mass, improve cardiopulmonary functions and extend survival in preclinical models of Duchenne muscular dystrophy.

PubMed

Ponnusamy, Suriyan; Sullivan, Ryan D; You, Dahui; Zafar, Nadeem; He Yang, Chuan; Thiyagarajan, Thirumagal; Johnson, Daniel L; Barrett, Maron L; Koehler, Nikki J; Star, Mayra; Stephenson, Erin J; Bridges, Dave; Cormier, Stephania A; Pfeffer, Lawrence M; Narayanan, Ramesh

2017-07-01

Duchenne muscular dystrophy (DMD) is a neuromuscular disease that predominantly affects boys as a result of mutation(s) in the dystrophin gene. DMD is characterized by musculoskeletal and cardiopulmonary complications, resulting in shorter life-span. Boys afflicted by DMD typically exhibit symptoms within 3-5 years of age and declining physical functions before attaining puberty. We hypothesized that rapidly deteriorating health of pre-pubertal boys with DMD could be due to diminished anabolic actions of androgens in muscle, and that intervention with an androgen receptor (AR) agonist will reverse musculoskeletal complications and extend survival. While castration of dystrophin and utrophin double mutant (mdx-dm) mice to mimic pre-pubertal nadir androgen condition resulted in premature death, maintenance of androgen levels extended the survival. Non-steroidal selective-AR modulator, GTx-026, which selectively builds muscle and bone was tested in X-linked muscular dystrophy mice (mdx). GTx-026 significantly increased body weight, lean mass and grip strength by 60-80% over vehicle-treated mdx mice. While vehicle-treated castrated mdx mice exhibited cardiopulmonary impairment and fibrosis of heart and lungs, GTx-026 returned cardiopulmonary function and intensity of fibrosis to healthy control levels. GTx-026 elicits its musculoskeletal effects through pathways that are distinct from dystrophin-regulated pathways, making AR agonists ideal candidates for combination approaches. While castration of mdx-dm mice resulted in weaker muscle and shorter survival, GTx-026 treatment increased the muscle mass, function and survival, indicating that androgens are important for extended survival. These preclinical results support the importance of androgens and the need for intervention with AR agonists to treat DMD-affected boys. © The Author 2017. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com.
Differential Scanning Calorimetric (DSC) Analysis of Rotary Nickel-Titanium (NiTi) Endodontic File (RNEF)

NASA Astrophysics Data System (ADS)

Wu, Ray Chun Tung; Chung, C. Y.

2012-12-01

To determine the variation of A f along the axial length of rotary nickel-titanium endodontic files (RNEF). Three commercial brands of 4% taper RNEF: GTX (#20, 25 mm, Dentsply Tulsa Dental Specialties, Tulsa, OK, USA), K3 (#25, 25 mm) and TF (Twisted File #25, 27 mm) (Sybron Kerr, Orange, CA, USA) were cut into segments at 4 mm increment from the working tip. Regional specimens were measured for differential heat-flow over thermal cycling, generally with continuous heating or cooling (5 °C/min) and 5 min hold at set temperatures (start, finish temperatures): GTX: -55, 90 °C; K3: -55, 45 °C; TF: -55, 60 °C; using differential scanning calorimeter. This experiment demonstrated regional differences in A f along the axial length of GTX and K3 files. Similar variation was not obvious in the TF samples. A contributory effect of regional difference in strain-hardening due to grinding and machining during manufacturing is proposed.
Fast skin dose estimation system for interventional radiology

PubMed Central

Takata, Takeshi; Kotoku, Jun’ichi; Maejima, Hideyuki; Kumagai, Shinobu; Arai, Norikazu; Kobayashi, Takenori; Shiraishi, Kenshiro; Yamamoto, Masayoshi; Kondo, Hiroshi; Furui, Shigeru

2018-01-01

Abstract To minimise the radiation dermatitis related to interventional radiology (IR), rapid and accurate dose estimation has been sought for all procedures. We propose a technique for estimating the patient skin dose rapidly and accurately using Monte Carlo (MC) simulation with a graphical processing unit (GPU, GTX 1080; Nvidia Corp.). The skin dose distribution is simulated based on an individual patient’s computed tomography (CT) dataset for fluoroscopic conditions after the CT dataset has been segmented into air, water and bone based on pixel values. The skin is assumed to be one layer at the outer surface of the body. Fluoroscopic conditions are obtained from a log file of a fluoroscopic examination. Estimating the absorbed skin dose distribution requires calibration of the dose simulated by our system. For this purpose, a linear function was used to approximate the relation between the simulated dose and the measured dose using radiophotoluminescence (RPL) glass dosimeters in a water-equivalent phantom. Differences of maximum skin dose between our system and the Particle and Heavy Ion Transport code System (PHITS) were as high as 6.1%. The relative statistical error (2 σ) for the simulated dose obtained using our system was ≤3.5%. Using a GPU, the simulation on the chest CT dataset aiming at the heart was within 3.49 s on average: the GPU is 122 times faster than a CPU (Core i7–7700K; Intel Corp.). Our system (using the GPU, the log file, and the CT dataset) estimated the skin dose more rapidly and more accurately than conventional methods. PMID:29136194
Fast skin dose estimation system for interventional radiology.

PubMed

Takata, Takeshi; Kotoku, Jun'ichi; Maejima, Hideyuki; Kumagai, Shinobu; Arai, Norikazu; Kobayashi, Takenori; Shiraishi, Kenshiro; Yamamoto, Masayoshi; Kondo, Hiroshi; Furui, Shigeru

2018-03-01

To minimise the radiation dermatitis related to interventional radiology (IR), rapid and accurate dose estimation has been sought for all procedures. We propose a technique for estimating the patient skin dose rapidly and accurately using Monte Carlo (MC) simulation with a graphical processing unit (GPU, GTX 1080; Nvidia Corp.). The skin dose distribution is simulated based on an individual patient's computed tomography (CT) dataset for fluoroscopic conditions after the CT dataset has been segmented into air, water and bone based on pixel values. The skin is assumed to be one layer at the outer surface of the body. Fluoroscopic conditions are obtained from a log file of a fluoroscopic examination. Estimating the absorbed skin dose distribution requires calibration of the dose simulated by our system. For this purpose, a linear function was used to approximate the relation between the simulated dose and the measured dose using radiophotoluminescence (RPL) glass dosimeters in a water-equivalent phantom. Differences of maximum skin dose between our system and the Particle and Heavy Ion Transport code System (PHITS) were as high as 6.1%. The relative statistical error (2 σ) for the simulated dose obtained using our system was ≤3.5%. Using a GPU, the simulation on the chest CT dataset aiming at the heart was within 3.49 s on average: the GPU is 122 times faster than a CPU (Core i7-7700K; Intel Corp.). Our system (using the GPU, the log file, and the CT dataset) estimated the skin dose more rapidly and more accurately than conventional methods.
3D Hydrodynamic Simulation of Classical Novae Explosions

NASA Astrophysics Data System (ADS)

Kendrick, Coleman J.

2015-01-01

This project investigates the formation and lifecycle of classical novae and determines how parameters such as: white dwarf mass, star mass and separation affect the evolution of the rotating binary system. These parameters affect the accretion rate, frequency of the nova explosions and light curves. Each particle in the simulation represents a volume of hydrogen gas and are initialized randomly in the outer shell of the companion star. The forces on each particle include: gravity, centrifugal, coriolis, friction, and Langevin. The friction and Langevin forces are used to model the viscosity and internal pressure of the gas. A velocity Verlet method with a one second time step is used to compute velocities and positions of the particles. A new particle recycling method was developed which was critical for computing an accurate and stable accretion rate and keeping the particle count reasonable. I used C++ and OpenCL to create my simulations and ran them on two Nvidia GTX580s. My simulations used up to 1 million particles and required up to 10 hours to complete. My simulation results for novae U Scorpii and DD Circinus are consistent with professional hydrodynamic simulations and observed experimental data (light curves and outburst frequencies). When the white dwarf mass is increased, the time between explosions decreases dramatically. My model was used to make the first prediction for the next outburst of nova DD Circinus. My simulations also show that the companion star blocks the expanding gas shell leading to an asymmetrical expanding shell.
SU-E-T-422: Fast Analytical Beamlet Optimization for Volumetric Intensity-Modulated Arc Therapy

DOE Office of Scientific and Technical Information (OSTI.GOV)

Chan, Kenny S K; Lee, Louis K Y; Xing, L

2015-06-15

Purpose: To implement a fast optimization algorithm on CPU/GPU heterogeneous computing platform and to obtain an optimal fluence for a given target dose distribution from the pre-calculated beamlets in an analytical approach. Methods: The 2D target dose distribution was modeled as an n-dimensional vector and estimated by a linear combination of independent basis vectors. The basis set was composed of the pre-calculated beamlet dose distributions at every 6 degrees of gantry angle and the cost function was set as the magnitude square of the vector difference between the target and the estimated dose distribution. The optimal weighting of the basis,more » which corresponds to the optimal fluence, was obtained analytically by the least square method. Those basis vectors with a positive weighting were selected for entering into the next level of optimization. Totally, 7 levels of optimization were implemented in the study.Ten head-and-neck and ten prostate carcinoma cases were selected for the study and mapped to a round water phantom with a diameter of 20cm. The Matlab computation was performed in a heterogeneous programming environment with Intel i7 CPU and NVIDIA Geforce 840M GPU. Results: In all selected cases, the estimated dose distribution was in a good agreement with the given target dose distribution and their correlation coefficients were found to be in the range of 0.9992 to 0.9997. Their root-mean-square error was monotonically decreasing and converging after 7 cycles of optimization. The computation took only about 10 seconds and the optimal fluence maps at each gantry angle throughout an arc were quickly obtained. Conclusion: An analytical approach is derived for finding the optimal fluence for a given target dose distribution and a fast optimization algorithm implemented on the CPU/GPU heterogeneous computing environment greatly reduces the optimization time.« less
75 FR 44989 - In the Matter of Certain Semiconductor Chips Having Synchronous Dynamic Random Access Memory...

Federal Register 2010, 2011, 2012, 2013, 2014

2010-07-30

... following respondents: NVIDIA Corporation of Santa Clara, California; Asustek Computer, Inc. of Taipei... exclusion order and cease- and-desist orders against respondents NVIDIA Corp.; Hewlett-Packard Co.; ASUS...
Apical extrusion of Enterococcus faecalis using three different rotary instrumentation techniques: an in vitro study.

PubMed

Taneja, Sonali; Kumari, Manju; Barua, Madhumita; Dudeja, Chetna; Malik, Meeta

2015-01-01

To compare the apical extrusion of Enterococcus faecalis after instrumentation with three different Ni-Ti rotary instruments- An in vitro study. In vitro study Methods and Material: Forty freshly extracted mandibular premolars were mounted in bacteria collection apparatus and root canals were contaminated with a suspension of Enterococcus faecalis. The contaminated teeth were divided into 4 groups of 10 teeth each according to rotary system used for instrumentation: Group1: Hyflex files, Group 2: GTX files, Group 3: Protaper files and Group 4: control group (no instrumentation). Bacteria extruded after preparations were collected into vials and microbiological samples were incubated in BHI broth for 24 hrs. The colony forming units were determined for each sample. Statistical analysis was done using one way ANOVA followed by post hoc independent " t" test. GTX files extruded least amount of bacteria followed by Hyflex files. Maximum extrusion of E. faecalis was seen in rotary Protaper group. Least amount of extrusion was seen with GTX files followed by Hyflex files and then rotary Protaper system.
Prevalence, Variability and Bioconcentration of Saxitoxin-Group in Different Marine Species Present in the Food Chain

PubMed Central

Oyaneder Terrazas, Javiera; Contreras, Héctor R.; García, Carlos

2017-01-01

The saxitoxin-group (STX-group) corresponds to toxic metabolites produced by cyanobacteria and dinoflagellates of the genera Alexandrium, Gymnodinium, and Pyrodinium. Over the last decade, it has been possible to extrapolate the areas contaminated with the STX-group worldwide, including Chile, a phenomenon that has affected ≈35% of the Southern Pacific coast territory, generating a high economic impact. The objective of this research was to study the toxicity of the STX-group in all aquatic organisms (bivalves, algae, echinoderms, crustaceans, tunicates, cephalopods, gastropods, and fish) present in areas with a variable presence of harmful algal blooms (HABs). Then, the toxic profiles of each species and dose of STX equivalents ingested by a 60 kg person from 400 g of shellfish were determined to establish the health risk assessment. The toxins with the highest prevalence detected were gonyautoxin-4/1 (GTX4/GTX1), gonyautoxin-3/2 (GTX3/GTX2), neosaxitoxin (neoSTX), decarbamoylsaxitoxin (dcSTX), and saxitoxin (STX), with average concentrations of 400, 2800, 280, 200, and 2000 µg kg−1 respectively, a species-specific variability, dependent on the evaluated tissue, which demonstrates the biotransformation of the analogues in the trophic transfer with a predominance of α-epimers in all toxic profiles. The identification in multiple vectors, as well as in unregulated species, suggests that a risk assessment and risk management update are required; also, chemical and specific analyses for the detection of all analogues associated with the STX-group need to be established. PMID:28604648
Challenges and Opportunities in Propulsion Simulations

DTIC Science & Technology

2015-09-24

leverage Nvidia GPU accelerators •  Release common computational infrastructure as Distro A for collaboration •  Add physics modules as either...Gemini (6.4 GB/s) Dual Rail EDR-IB (23 GB/s) Interconnect Topology 3D Torus Non-blocking Fat Tree Processors AMD Opteron™ NVIDIA Kepler™ IBM...POWER9 NVIDIA Volta™ File System 32 PB, 1 TB/s, Lustre® 120 PB, 1 TB/s, GPFS™ Peak power consumption 9 MW 10 MW Titan vs. Summit Source: R

SU-D-206-01: Employing a Novel Consensus Optimization Strategy to Achieve Iterative Cone Beam CT Reconstruction On a Multi-GPU Platform

DOE Office of Scientific and Technical Information (OSTI.GOV)

Li, B; Southern Medical University, Guangzhou, Guangdong; Tian, Z

Purpose: While compressed sensing-based cone-beam CT (CBCT) iterative reconstruction techniques have demonstrated tremendous capability of reconstructing high-quality images from undersampled noisy data, its long computation time still hinders wide application in routine clinic. The purpose of this study is to develop a reconstruction framework that employs modern consensus optimization techniques to achieve CBCT reconstruction on a multi-GPU platform for improved computational efficiency. Methods: Total projection data were evenly distributed to multiple GPUs. Each GPU performed reconstruction using its own projection data with a conventional total variation regularization approach to ensure image quality. In addition, the solutions from GPUs were subjectmore » to a consistency constraint that they should be identical. We solved the optimization problem with all the constraints considered rigorously using an alternating direction method of multipliers (ADMM) algorithm. The reconstruction framework was implemented using OpenCL on a platform with two Nvidia GTX590 GPU cards, each with two GPUs. We studied the performance of our method and demonstrated its advantages through a simulation case with a NCAT phantom and an experimental case with a Catphan phantom. Result: Compared with the CBCT images reconstructed using conventional FDK method with full projection datasets, our proposed method achieved comparable image quality with about one third projection numbers. The computation time on the multi-GPU platform was ∼55 s and ∼ 35 s in the two cases respectively, achieving a speedup factor of ∼ 3.0 compared with single GPU reconstruction. Conclusion: We have developed a consensus ADMM-based CBCT reconstruction method which enabled performing reconstruction on a multi-GPU platform. The achieved efficiency made this method clinically attractive.« less
Simultaneous Range-Velocity Processing and SNR Analysis of AFIT’s Random Noise Radar

DTIC Science & Technology

2012-03-22

reducing the overall processing time. Two computers, equipped with NVIDIA ® GPUs, were used to process the col- 45 lected data. The specifications for each...gather the results back to the CPU. Another company , AccelerEyes®, has developed a product called Jacket® that claims to be better than the parallel...Number of Processing Cores 4 8 Processor Speed 3.33 GHz 3.07 GHz Installed Memory 48 GB 48 GB GPU Make NVIDIA NVIDIA GPU Model Tesla 1060 Tesla C2070 GPU
Three Dimensional CFD Analysis of the GTX Combustor

NASA Technical Reports Server (NTRS)

Steffen, C. J., Jr.; Bond, R. B.; Edwards, J. R.

2002-01-01

The annular combustor geometry of a combined-cycle engine has been analyzed with three-dimensional computational fluid dynamics. Both subsonic combustion and supersonic combustion flowfields have been simulated. The subsonic combustion analysis was executed in conjunction with a direct-connect test rig. Two cold-flow and one hot-flow results are presented. The simulations compare favorably with the test data for the two cold flow calculations; the hot-flow data was not yet available. The hot-flow simulation indicates that the conventional ejector-ramjet cycle would not provide adequate mixing at the conditions tested. The supersonic combustion ramjet flowfield was simulated with frozen chemistry model. A five-parameter test matrix was specified, according to statistical design-of-experiments theory. Twenty-seven separate simulations were used to assemble surrogate models for combustor mixing efficiency and total pressure recovery. ScramJet injector design parameters (injector angle, location, and fuel split) as well as mission variables (total fuel massflow and freestream Mach number) were included in the analysis. A promising injector design has been identified that provides good mixing characteristics with low total pressure losses. The surrogate models can be used to develop performance maps of different injector designs. Several complex three-way variable interactions appear within the dataset that are not adequately resolved with the current statistical analysis.
Bayesian Methods and Confidence Intervals for Automatic Target Recognition of SAR Canonical Shapes

DTIC Science & Technology

2014-03-27

and DirectX [22]. The CUDA platform was developed by the NVIDIA Corporation to allow programmers access to the computational capabilities of the...were used for the intense repetitive computations. Developing CUDA software requires writing code for specialized compilers provided by NVIDIA and
Three Dimensional Numerical Simulation of Rocket-based Combined-cycle Engine Response During Mode Transition Events

NASA Technical Reports Server (NTRS)

Edwards, Jack R.; McRae, D. Scott; Bond, Ryan B.; Steffan, Christopher (Technical Monitor)

2003-01-01

The GTX program at NASA Glenn Research Center is designed to develop a launch vehicle concept based on rocket-based combined-cycle (RBCC) propulsion. Experimental testing, cycle analysis, and computational fluid dynamics modeling have all demonstrated the viability of the GTX concept, yet significant technical issues and challenges still remain. Our research effort develops a unique capability for dynamic CFD simulation of complete high-speed propulsion devices and focuses this technology toward analysis of the GTX response during critical mode transition events. Our principal attention is focused on Mode 1/Mode 2 operation, in which initial rocket propulsion is transitioned into thermal-throat ramjet propulsion. A critical element of the GTX concept is the use of an Independent Ramjet Stream (IRS) cycle to provide propulsion at Mach numbers less than 3. In the IRS cycle, rocket thrust is initially used for primary power, and the hot rocket plume is used as a flame-holding mechanism for hydrogen fuel injected into the secondary air stream. A critical aspect is the establishment of a thermal throat in the secondary stream through the combination of area reduction effects and combustion-induced heat release. This is a necessity to enable the power-down of the rocket and the eventual shift to ramjet mode. Our focus in this first year of the grant has been in three areas, each progressing directly toward the key initial goal of simulating thermal throat formation during the IRS cycle: CFD algorithm development; simulation of Mode 1 experiments conducted at Glenn's Rig 1 facility; and IRS cycle simulations. The remainder of this report discusses each of these efforts in detail and presents a plan of work for the next year.
Therapeutic transfusions of granulocytes collected by simple bag method for children with cancer and neutropenic infections: results of a single-centre pilot study.

PubMed

Kikuta, A; Ohto, H; Nemoto, K; Mochizuki, K; Sano, H; Ito, M; Suzuki, H

2006-07-01

Granulocyte transfusion therapy (GTX) can be effective for life-threatening infections unresponsive to conventional antimicrobial therapies in severely neutropenic children with cancer. We developed a new granulocyte collection method, named the 'bag method', in which apheresis, hydroxyethyl starch (HES) or dexamethasone are not used. We undertook a pilot study to determine the feasibility and the safety of GTX collected by the bag method for children with cancer and life-threatening infections. A total of 25 GTX were administered to 13 patients (median age 3 years, range: 0.3-17; median weight 10.6 kg, range: 4.5-49.8) with neutropenia-related infections. Thirteen blood-relative donors received granulocyte colony-stimulating factor (G-CSF) (5-10 microg/kg), subcutaneously, 14 h before collection. Major end-points were granulocyte yields, post-transfusion absolute neutrophil counts (ANC) in patients, donor and patient safety, and clinical outcome on day 30. The median yield of ANC per 400 ml of processed whole blood was 6.2 x 10(9) (range: 2.5-15.0 x 10(9)). Patients received a mean of 6.4 +/- 0.8 x 10(8) granulocytes per kg of body weight per transfusion. The 1-h and 24-h post-transfusion ANC rose to 607 +/- 124/microl and 704 +/- 300/microl, respectively, from the baseline of 21/microl before the first GTX. Adverse reactions were observed in five of 13 donors (bone pain, headache, vasovagal reaction; all < or = grade 2) and in two of 25 transfusions of 13 patients (transient hypoxia; grade 3). Ten patients had favourable responses, and infection resolved in nine patients. The bag method without apheresis relieves the physical load of donors and enables patients with a low body weight to provide an adequate dose of granulocytes.
Concentrations of gatifloxacin in plasma and urine and penetration into prostatic and seminal fluid, ejaculate, and sperm cells after single oral administrations of 400 milligrams to volunteers.

PubMed

Naber, C K; Steghafner, M; Kinzig-Schippers, M; Sauber, C; Sörgel, F; Stahlberg, H J; Naber, K G

2001-01-01

Gatifloxacin (GTX), a new fluoroquinolone with extended antibacterial activity, is an interesting candidate for the treatment of chronic bacterial prostatitis (CBP). Besides the antibacterial spectrum, the concentrations in the target tissues and fluids are crucial for the treatment of CBP. Thus, it was of interest to investigate its penetration into prostatic and seminal fluid. GTX concentrations in plasma, urine, ejaculate, prostatic and seminal fluid, and sperm cells were determined by a high-performance liquid chromatography method after oral intake of a single 400-mg dose in 10 male Caucasian volunteers in the fasting state. Simultaneous application of the renal contrast agent iohexol was used to estimate the maximal possible contamination of ejaculate and prostatic and seminal fluid by urine. GTX was well tolerated. The means (standard deviations) for the following parameters were as indicated: time to maximum concentration of drug in serum, 1.66 (0. 91) h; maximum concentration of drug in serum, 2.90 (0.39) microg/ml; area under the concentration-time curve from 0 to 24 h, 25.65 microg. h/ml; and half life, 7.2 (0.90) h. Within 12 h about 50% of the drug was excreted unchanged into the urine. The mean renal clearance was 169 ml/min. The gatifloxacin concentrations in ejaculate, seminal fluid, and prostatic fluid were in the range of the corresponding plasma concentrations which were 1.92 (0.27) microg/ml at approximately the same time point (4 h after drug intake). The concentrations in sperm cells (0.195, 0.076, and 0.011 microg/ml) could be determined in three subjects. The good penetration into prostatic and seminal fluid, the good tolerance, and the previously reported broad antibacterial spectrum suggest that GTX may be a good alternative for the treatment of chronic bacterial prostatitis. Clinical studies should be performed to confirm this assumption.
Development of an Experiment High Performance Nozzle Research Program

NASA Technical Reports Server (NTRS)

2004-01-01

As proposed in the above OAI/NASA Glenn Research Center (GRC) Co-Operative Agreement the objective of the work was to provide consultation and assistance to the NASA GRC GTX Rocket Based Combined Cycle (RBCC) Program Team in planning and developing requirements, scale model concepts, and plans for an experimental nozzle research program. The GTX was one of the launch vehicle concepts being studied as a possible future replacement for the aging NASA Space Shuttle, and was one RBCC element in the ongoing NASA Access to Space R&D Program (Reference 1). The ultimate program objective was the development of an appropriate experimental research program to evaluate and validate proposed nozzle concepts, and thereby result in the optimization of a high performance nozzle for the GTX launch vehicle. Included in this task were the identification of appropriate existing test facilities, development of requirements for new non-existent test rigs and fixtures, develop scale nozzle model concepts, and propose corresponding test plans. Also included were the evaluation of originally proposed and alternate nozzle designs (in-house and contractor), evaluation of Computational Fluid Dynamics (CFD) study results, and make recommendations for geometric changes to result in improved nozzle thrust coefficient performance (Cfg).
Efficient implementation of the 3D-DDA ray traversal algorithm on GPU and its application in radiation dose calculation.

PubMed

Xiao, Kai; Chen, Danny Z; Hu, X Sharon; Zhou, Bo

2012-12-01

The three-dimensional digital differential analyzer (3D-DDA) algorithm is a widely used ray traversal method, which is also at the core of many convolution∕superposition (C∕S) dose calculation approaches. However, porting existing C∕S dose calculation methods onto graphics processing unit (GPU) has brought challenges to retaining the efficiency of this algorithm. In particular, straightforward implementation of the original 3D-DDA algorithm inflicts a lot of branch divergence which conflicts with the GPU programming model and leads to suboptimal performance. In this paper, an efficient GPU implementation of the 3D-DDA algorithm is proposed, which effectively reduces such branch divergence and improves performance of the C∕S dose calculation programs running on GPU. The main idea of the proposed method is to convert a number of conditional statements in the original 3D-DDA algorithm into a set of simple operations (e.g., arithmetic, comparison, and logic) which are better supported by the GPU architecture. To verify and demonstrate the performance improvement, this ray traversal method was integrated into a GPU-based collapsed cone convolution∕superposition (CCCS) dose calculation program. The proposed method has been tested using a water phantom and various clinical cases on an NVIDIA GTX570 GPU. The CCCS dose calculation program based on the efficient 3D-DDA ray traversal implementation runs 1.42 ∼ 2.67× faster than the one based on the original 3D-DDA implementation, without losing any accuracy. The results show that the proposed method can effectively reduce branch divergence in the original 3D-DDA ray traversal algorithm and improve the performance of the CCCS program running on GPU. Considering the wide utilization of the 3D-DDA algorithm, various applications can benefit from this implementation method.
Techniques for Mapping Synthetic Aperture Radar Processing Algorithms to Multi-GPU Clusters

DTIC Science & Technology

2012-12-01

Experimental results were generated with 10 nVidia Tesla C2050 GPUs having maximum throughput of 972 Gflop /s. Our approach scales well for output...Experimental results were generated with 10 nVidia Tesla C2050 GPUs having maximum throughput of 972 Gflop /s. Our approach scales well for output
Synthesis of the Paralytic Shellfish Poisons (+)-Gonyautoxin 2, (+)-Gonyautoxin 3, and (+)-11,11-Dihydroxysaxitoxin.

PubMed

Mulcahy, John V; Walker, James R; Merit, Jeffrey E; Whitehead, Alan; Du Bois, J

2016-05-11

The paralytic shellfish poisons are a collection of guanidine-containing natural products that are biosynthesized by prokaryote and eukaryote marine organisms. These compounds bind and inhibit isoforms of the mammalian voltage-gated Na(+) ion channel at concentrations ranging from 10(-11) to 10(-5) M. Here, we describe the de novo synthesis of three paralytic shellfish poisons, gonyautoxin 2, gonyautoxin 3, and 11,11-dihydroxysaxitoxin. Key steps include a diastereoselective Pictet-Spengler reaction and an intramolecular amination of an N-guanidyl pyrrole by a sulfonyl guanidine. The IC50's of GTX 2, GTX 3, and 11,11-dhSTX have been measured against rat NaV1.4, and are found to be 22 nM, 15 nM, and 2.2 μM, respectively.
Comparison of the shaping ability of GT® Series X, Twisted Files and AlphaKite rotary nickel-titanium systems in simulated canals

PubMed Central

2013-01-01

Background Efforts to improve the performance of rotary NiTi instruments by enhancing the properties of NiTi alloy, or their manufacturing processes rather than changes in instrument geometries have been reported. The aim of this study was to compare in-vitro the shaping ability of three different rotary nickel-titanium instruments produced by different manufacturing methods. Methods Thirty simulated root canals with a curvature of 35˚ in resin blocks were prepared with three different rotary NiTi systems: AK- AlphaKite (Gebr. Brasseler, Germany), GTX- GT® Series X (Dentsply, Germany) and TF- Twisted Files (SybronEndo, USA). The canals were prepared according to the manufacturers’ instructions. Pre- and post-instrumentation images were recorded and assessment of canal curvature modifications was carried out with an image analysis program (GSA, Germany). The preparation time and incidence of procedural errors were recorded. Instruments were evaluated under a microscope with 15 × magnifications (Carl Zeiss OPMI Pro Ergo, Germany) for signs of deformation. The Data were statistically analyzed using SPSS (Wilcoxon and Mann–Whitney U-tests, at a confidence interval of 95%). Results Less canal transportation was produced by TF apically, although the difference among the groups was not statistically significant. GTX removed the greatest amount of resin from the middle and coronal parts of the canal and the difference among the groups was statistically significant (p < 0.05). The shortest preparation time was registered with TF (444 s) and the longest with GTX (714 s), the difference among the groups was statistically significant (p < 0.05). During the preparation of the canals no instrument fractured. Eleven instruments of TF and one of AK were deformed. Conclusion Under the conditions of this study, all rotary NiTi instruments maintained the working length and prepared a well-shaped root canal. The least canal transportation was produced by AK. GTX
Separation of paralytic shellfish poisoning toxins on Chromarods-SIII by thin-layer chromatography with the Iatroscan (mark 5) and flame thermionic detection.

PubMed

Indrasena, W M; Ackman, R G; Gill, T A

1999-09-10

Thin-layer chromatography (TLC) on Chromarods-SIII with the Iatroscan (Mark-5) and a flame thermionic detector (FTID) was used to develop a rapid method for the detection of paralytic shellfish poisoning (PSP) toxins. The effect of variation in hydrogen (H2) flow, air flow, scan time and detector current on the FTID peak response for both phosphatidylcholine (PC) and PSP were studied in order to define optimum detection conditions. A combination of hydrogen and air flow-rates of 50 ml/min and 1.5-2.0 l/min respectively, along with a scan time of 40 s/rod and detector current of 3.0 A (ampere) or above were found to yield the best results for the detection of PSP compounds. Increasing the detector current level to as high as 3.3 A gave about 130 times more FTID response than did flame ionization detection (FID), for PSP components. Quantities of standards as small as 1 ng neosaxitoxin (NEO), 5 ng saxitoxin (STX), 5 ng B1-toxins (B1), 2 ng gonyautoxin (GTX) 2/3, 6 ng GTX 1/4 and 6 ng C-toxins (C1/C2) could be detected with the FTID. The method detection limits for toxic shellfish tissues using the FTID were 0.4, 2.1, 0.8 and 2.5 micrograms per g tissue for GTX 2/3, STX, NEO and C toxins, respectively. The FTID response increased with increasing detector current and with increasing the scan time. Increasing hydrogen and air flow-rates resulted in decreasing sensitivity within defined limits. Numerous solvent systems were tested, and, solvent consisting of chloroform: methanol-water-acetic acid (30:50:8:2) could separate C toxins from GTX, which eluted ahead of NEO and STX. Accordingly, TLC/FTID with the Iatroscan (Mark-5) seems to be a promising, relatively inexpensive and rapid method of screening plant and animal tissues for PSP toxins.
Concentrations of Gatifloxacin in Plasma and Urine and Penetration into Prostatic and Seminal Fluid, Ejaculate, and Sperm Cells after Single Oral Administrations of 400 Milligrams to Volunteers

PubMed Central

Naber, Christoph K.; Steghafner, Michaela; Kinzig-Schippers, Martina; Sauber, Christian; Sörgel, Fritz; Stahlberg, Hans-Jürgen; Naber, Kurt G.

2001-01-01

Gatifloxacin (GTX), a new fluoroquinolone with extended antibacterial activity, is an interesting candidate for the treatment of chronic bacterial prostatitis (CBP). Besides the antibacterial spectrum, the concentrations in the target tissues and fluids are crucial for the treatment of CBP. Thus, it was of interest to investigate its penetration into prostatic and seminal fluid. GTX concentrations in plasma, urine, ejaculate, prostatic and seminal fluid, and sperm cells were determined by a high-performance liquid chromatography method after oral intake of a single 400-mg dose in 10 male Caucasian volunteers in the fasting state. Simultaneous application of the renal contrast agent iohexol was used to estimate the maximal possible contamination of ejaculate and prostatic and seminal fluid by urine. GTX was well tolerated. The means (standard deviations) for the following parameters were as indicated: time to maximum concentration of drug in serum, 1.66 (0.91) h; maximum concentration of drug in serum, 2.90 (0.39) μg/ml; area under the concentration-time curve from 0 to 24 h, 25.65 μg · h/ml; and half life, 7.2 (0.90) h. Within 12 h about 50% of the drug was excreted unchanged into the urine. The mean renal clearance was 169 ml/min. The gatifloxacin concentrations in ejaculate, seminal fluid, and prostatic fluid were in the range of the corresponding plasma concentrations which were 1.92 (0.27) μg/ml at approximately the same time point (4 h after drug intake). The concentrations in sperm cells (0.195, 0.076, and 0.011 μg/ml) could be determined in three subjects. The good penetration into prostatic and seminal fluid, the good tolerance, and the previously reported broad antibacterial spectrum suggest that GTX may be a good alternative for the treatment of chronic bacterial prostatitis. Clinical studies should be performed to confirm this assumption. PMID:11120980
DOE Office of Scientific and Technical Information (OSTI.GOV)

Kartsaklis, Christos; Civario, G

This paper discusses an ongoing progress regarding the development of a Java-based library for rapid kernel prototyping in NVIDIA PTX and PTX instruction scheduling. It is aimed at developers seeking total control of emitted PTX, highly parametric emission of, and tunable instruction reordering. It is primarily used for code development at ICHEC but is also hoped that NVIDIA GPU community will also find it beneficial.
A GPU Parallelization of the Absolute Nodal Coordinate Formulation for Applications in Flexible Multibody Dynamics

DTIC Science & Technology

2012-02-17

to be solved. Disclaimer: Reference herein to any specific commercial company , product, process, or service by trade name, trademark...data processing rather than data caching and control flow. To make use of this computational power, NVIDIA introduced a general purpose parallel...GPU implementations were run on an Intel Nehalem Xeon E5520 2.26GHz processor with an NVIDIA Tesla C2070 graphics card for varying numbers of
Operational Based Vision Assessment

DTIC Science & Technology

2014-02-01

formulated or supplied the drawings, specifications, or other data does not license the holder or any other person or corporation or convey any...expensive than other developers’ software. The sources for the GPUs ( Nvidia ) and the host computer (Concurrent’s iHawk) were identified. The...boundaries, which is a distracting artifact when performing visual tests. The problem has been isolated by the OBVA team to the Nvidia GPUs. The OBVA system
Design Tools for Accelerating Development and Usage of Multi-Core Computing Platforms

DTIC Science & Technology

2014-04-01

Government formulated or supplied the drawings, specifications, or other data does not license the holder or any other person or corporation ; or convey...multicore PDSP platforms. The GPU- based capabilities of TDIF are currently oriented towards NVIDIA GPUs, based on the Compute Unified Device Architecture...CUDA) programming language [ NVIDIA 2007], which can be viewed as an extension of C. The multicore PDSP capabilities currently in TDIF are oriented
The RISC-V Instruction Set Manual Volume 2: Privileged Architecture Version 1.7

DTIC Science & Technology

2015-05-09

DIG07-10227). Additional support came from Par Lab affiliates Nokia, NVIDIA , Oracle, and Samsung. • Project Isis: DoE Award DE-SC0003624. • ASPIRE...STARnet center funded by the Semiconductor Research Corporation . Additional sup- port from ASPIRE industrial sponsor, Intel, and ASPIRE affiliates...Google, Huawei, Nokia, NVIDIA , Oracle, and Samsung. The content of this paper does not necessarily reflect the position or the policy of the US
Modeling & Analysis of Multicore Architectures for Embedded SIGINT Applications

DTIC Science & Technology

2015-03-01

NVIDIA Kepler K20 [7][8] 2496e 706 225 3520 15.6 Intel Xeon Phi 5110P [9] 60 1050 225 1010 4.5 Adapteva Epiphany [10] 16 – 4K 800 0.270 19 70.4...Cortex A15 and a Kepler GPU with 192 “CUDA” cores, and is more comparable as an HPEEC platform than Tesla series GPUs, such as the NVIDIA C2075 and K20

Integrating the Nqueens Algorithm into a Parameterized Benchmark Suite

DTIC Science & Technology

2016-02-01

FOB is a 64-node heterogeneous cluster consisting of 16-IBM dx360M4 nodes, each with one NVIDIA Kepler K20M GPUs and 48-IBM dx360M4 nodes, and each...nodes have 256-GB of memory and an NVIDIA Tesla K40 GPU. More details on Excalibur can be found on the US Army DSRC website.19 Figures 3 and 4 show the
Investigating the Mobility of Light Autonoumous Tracked Vehicles Using a High Performance Computing Simulation Capability

DTIC Science & Technology

2012-08-01

UNCLASSIFIED: Distribution Statement A. Approved for public release. DISCLAIMER: Reference herein to any specific commercial company , product...FunctionBay, S. Korea – NVIDIA – Caterpillar – MSC.Software – Advanced Micro Devices (AMD) 14-16 AUG 2012  Aaron Bartholomew  Makarand Datar...16GB DDR2 Graphics: 4x NVIDIA Tesla C1060 Power supply 1: 1000W Power supply 2: 750W Assembled Quad GPU Machine 14-16 AUG 2012 30
Communication Efficient Gaussian Elimination with Partial Pivoting using a Shape Morphing Data Layout

DTIC Science & Technology

2013-02-21

support comes from ParLab affiliates National Instruments, Nokia, NVIDIA , Oracle and Samsung, as well as MathWorks. Research is also supported by DOE...affiliates National Instruments, Nokia, NVIDIA , Oracle and Samsung, as well as MathWorks. Research is also supported by DOE grants DE-SC0004938, DE-SC0005136...International Business Machines Company , 1966. [17] S. Toledo. Locality of reference in LU decomposition with partial pivoting. SIAM J. Matrix Anal. Appl., 18
Visual Media Reasoning - Terrain-based Geolocation

DTIC Science & Technology

2015-06-01

the drawings, specifications, or other data does not license the holder or any other person or corporation ; or convey any rights or permission to...3.4 Alternative Metric Investigation This section describes a graphics processor unit (GPU) based implementation in the NVIDIA CUDA programming...utilizing 2 concurrent CPU cores, each controlling a single Nvidia C2075 Tesla Fermi CUDA card. Figure 22 shows a comparison of the CPU and the GPU powered
Multi-Core Programming Design Patterns: Stream Processing Algorithms for Dynamic Scene Perceptions

DTIC Science & Technology

2014-05-01

processor developed by IBM and other companies , incorpo- rates the verb—POWER5— processor as the Power Processor Element (PPE), one of the early general...deliver an power efficient single-precision peak performance of more than 256 GFlops. Substantially more raw power became available later, when nVIDIA ...algorithms, including IBM’s Cell/B.E., GPUs from NVidia and AMD and many-core CPUs from Intel.27 The vast growth of digital video content has been a
A low-complexity 2-point step size gradient projection method with selective function evaluations for smoothed total variation based CBCT reconstructions

NASA Astrophysics Data System (ADS)

Song, Bongyong; Park, Justin C.; Song, William Y.

2014-11-01

The Barzilai-Borwein (BB) 2-point step size gradient method is receiving attention for accelerating Total Variation (TV) based CBCT reconstructions. In order to become truly viable for clinical applications, however, its convergence property needs to be properly addressed. We propose a novel fast converging gradient projection BB method that requires ‘at most one function evaluation’ in each iterative step. This Selective Function Evaluation method, referred to as GPBB-SFE in this paper, exhibits the desired convergence property when it is combined with a ‘smoothed TV’ or any other differentiable prior. This way, the proposed GPBB-SFE algorithm offers fast and guaranteed convergence to the desired 3DCBCT image with minimal computational complexity. We first applied this algorithm to a Shepp-Logan numerical phantom. We then applied to a CatPhan 600 physical phantom (The Phantom Laboratory, Salem, NY) and a clinically-treated head-and-neck patient, both acquired from the TrueBeam™ system (Varian Medical Systems, Palo Alto, CA). Furthermore, we accelerated the reconstruction by implementing the algorithm on NVIDIA GTX 480 GPU card. We first compared GPBB-SFE with three recently proposed BB-based CBCT reconstruction methods available in the literature using Shepp-Logan numerical phantom with 40 projections. It is found that GPBB-SFE shows either faster convergence speed/time or superior convergence property compared to existing BB-based algorithms. With the CatPhan 600 physical phantom, the GPBB-SFE algorithm requires only 3 function evaluations in 30 iterations and reconstructs the standard, 364-projection FDK reconstruction quality image using only 60 projections. We then applied the algorithm to a clinically-treated head-and-neck patient. It was observed that the GPBB-SFE algorithm requires only 18 function evaluations in 30 iterations. Compared with the FDK algorithm with 364 projections, the GPBB-SFE algorithm produces visibly equivalent quality CBCT
A low-complexity 2-point step size gradient projection method with selective function evaluations for smoothed total variation based CBCT reconstructions.

PubMed

Song, Bongyong; Park, Justin C; Song, William Y

2014-11-07

The Barzilai-Borwein (BB) 2-point step size gradient method is receiving attention for accelerating Total Variation (TV) based CBCT reconstructions. In order to become truly viable for clinical applications, however, its convergence property needs to be properly addressed. We propose a novel fast converging gradient projection BB method that requires 'at most one function evaluation' in each iterative step. This Selective Function Evaluation method, referred to as GPBB-SFE in this paper, exhibits the desired convergence property when it is combined with a 'smoothed TV' or any other differentiable prior. This way, the proposed GPBB-SFE algorithm offers fast and guaranteed convergence to the desired 3DCBCT image with minimal computational complexity. We first applied this algorithm to a Shepp-Logan numerical phantom. We then applied to a CatPhan 600 physical phantom (The Phantom Laboratory, Salem, NY) and a clinically-treated head-and-neck patient, both acquired from the TrueBeam™ system (Varian Medical Systems, Palo Alto, CA). Furthermore, we accelerated the reconstruction by implementing the algorithm on NVIDIA GTX 480 GPU card. We first compared GPBB-SFE with three recently proposed BB-based CBCT reconstruction methods available in the literature using Shepp-Logan numerical phantom with 40 projections. It is found that GPBB-SFE shows either faster convergence speed/time or superior convergence property compared to existing BB-based algorithms. With the CatPhan 600 physical phantom, the GPBB-SFE algorithm requires only 3 function evaluations in 30 iterations and reconstructs the standard, 364-projection FDK reconstruction quality image using only 60 projections. We then applied the algorithm to a clinically-treated head-and-neck patient. It was observed that the GPBB-SFE algorithm requires only 18 function evaluations in 30 iterations. Compared with the FDK algorithm with 364 projections, the GPBB-SFE algorithm produces visibly equivalent quality CBCT image for
GPU-accelerated Monte Carlo convolution/superposition implementation for dose calculation.

PubMed

Zhou, Bo; Yu, Cedric X; Chen, Danny Z; Hu, X Sharon

2010-11-01

Dose calculation is a key component in radiation treatment planning systems. Its performance and accuracy are crucial to the quality of treatment plans as emerging advanced radiation therapy technologies are exerting ever tighter constraints on dose calculation. A common practice is to choose either a deterministic method such as the convolution/superposition (CS) method for speed or a Monte Carlo (MC) method for accuracy. The goal of this work is to boost the performance of a hybrid Monte Carlo convolution/superposition (MCCS) method by devising a graphics processing unit (GPU) implementation so as to make the method practical for day-to-day usage. Although the MCCS algorithm combines the merits of MC fluence generation and CS fluence transport, it is still not fast enough to be used as a day-to-day planning tool. To alleviate the speed issue of MC algorithms, the authors adopted MCCS as their target method and implemented a GPU-based version. In order to fully utilize the GPU computing power, the MCCS algorithm is modified to match the GPU hardware architecture. The performance of the authors' GPU-based implementation on an Nvidia GTX260 card is compared to a multithreaded software implementation on a quad-core system. A speedup in the range of 6.7-11.4x is observed for the clinical cases used. The less than 2% statistical fluctuation also indicates that the accuracy of the authors' GPU-based implementation is in good agreement with the results from the quad-core CPU implementation. This work shows that GPU is a feasible and cost-efficient solution compared to other alternatives such as using cluster machines or field-programmable gate arrays for satisfying the increasing demands on computation speed and accuracy of dose calculation. But there are also inherent limitations of using GPU for accelerating MC-type applications, which are also analyzed in detail in this article.
Axillary Lymph Node Evaluation Utilizing Convolutional Neural Networks Using MRI Dataset.

PubMed

Ha, Richard; Chang, Peter; Karcich, Jenika; Mutasa, Simukayi; Fardanesh, Reza; Wynn, Ralph T; Liu, Michael Z; Jambawalikar, Sachin

2018-04-25

The aim of this study is to evaluate the role of convolutional neural network (CNN) in predicting axillary lymph node metastasis, using a breast MRI dataset. An institutional review board (IRB)-approved retrospective review of our database from 1/2013 to 6/2016 identified 275 axillary lymph nodes for this study. Biopsy-proven 133 metastatic axillary lymph nodes and 142 negative control lymph nodes were identified based on benign biopsies (100) and from healthy MRI screening patients (42) with at least 3 years of negative follow-up. For each breast MRI, axillary lymph node was identified on first T1 post contrast dynamic images and underwent 3D segmentation using an open source software platform 3D Slicer. A 32 × 32 patch was then extracted from the center slice of the segmented tumor data. A CNN was designed for lymph node prediction based on each of these cropped images. The CNN consisted of seven convolutional layers and max-pooling layers with 50% dropout applied in the linear layer. In addition, data augmentation and L2 regularization were performed to limit overfitting. Training was implemented using the Adam optimizer, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments. Code for this study was written in Python using the TensorFlow module (1.0.0). Experiments and CNN training were done on a Linux workstation with NVIDIA GTX 1070 Pascal GPU. Two class axillary lymph node metastasis prediction models were evaluated. For each lymph node, a final softmax score threshold of 0.5 was used for classification. Based on this, CNN achieved a mean five-fold cross-validation accuracy of 84.3%. It is feasible for current deep CNN architectures to be trained to predict likelihood of axillary lymph node metastasis. Larger dataset will likely improve our prediction model and can potentially be a non-invasive alternative to core needle biopsy and even sentinel lymph node
Full Monte Carlo-Based Biologic Treatment Plan Optimization System for Intensity Modulated Carbon Ion Therapy on Graphics Processing Unit.

PubMed

Qin, Nan; Shen, Chenyang; Tsai, Min-Yu; Pinto, Marco; Tian, Zhen; Dedes, Georgios; Pompos, Arnold; Jiang, Steve B; Parodi, Katia; Jia, Xun

2018-01-01

One of the major benefits of carbon ion therapy is enhanced biological effectiveness at the Bragg peak region. For intensity modulated carbon ion therapy (IMCT), it is desirable to use Monte Carlo (MC) methods to compute the properties of each pencil beam spot for treatment planning, because of their accuracy in modeling physics processes and estimating biological effects. We previously developed goCMC, a graphics processing unit (GPU)-oriented MC engine for carbon ion therapy. The purpose of the present study was to build a biological treatment plan optimization system using goCMC. The repair-misrepair-fixation model was implemented to compute the spatial distribution of linear-quadratic model parameters for each spot. A treatment plan optimization module was developed to minimize the difference between the prescribed and actual biological effect. We used a gradient-based algorithm to solve the optimization problem. The system was embedded in the Varian Eclipse treatment planning system under a client-server architecture to achieve a user-friendly planning environment. We tested the system with a 1-dimensional homogeneous water case and 3 3-dimensional patient cases. Our system generated treatment plans with biological spread-out Bragg peaks covering the targeted regions and sparing critical structures. Using 4 NVidia GTX 1080 GPUs, the total computation time, including spot simulation, optimization, and final dose calculation, was 0.6 hour for the prostate case (8282 spots), 0.2 hour for the pancreas case (3795 spots), and 0.3 hour for the brain case (6724 spots). The computation time was dominated by MC spot simulation. We built a biological treatment plan optimization system for IMCT that performs simulations using a fast MC engine, goCMC. To the best of our knowledge, this is the first time that full MC-based IMCT inverse planning has been achieved in a clinically viable time frame. Copyright © 2017 Elsevier Inc. All rights reserved.
Particle In Cell Codes on Highly Parallel Architectures

NASA Astrophysics Data System (ADS)

Tableman, Adam

2014-10-01

We describe strategies and examples of Particle-In-Cell Codes running on Nvidia GPU and Intel Phi architectures. This includes basic implementations in skeletons codes and full-scale development versions (encompassing 1D, 2D, and 3D codes) in Osiris. Both the similarities and differences between Intel's and Nvidia's hardware will be examined. Work supported by grants NSF ACI 1339893, DOE DE SC 000849, DOE DE SC 0008316, DOE DE NA 0001833, and DOE DE FC02 04ER 54780.
Poster: Building a Large Tiled-Display Cluster

DTIC Science & Technology

2012-10-01

graphics cards ( Nvidia Quadro FX 5800), and each graphics ∗e-mail: mark.livingston@nrl.navy.mil †e-mail: jonathan.decker@nrl.navy.mil card in a display...such as DisplayPort and HDMI (see: Nvidia Quadro 6000). We recommend these formats because they are much easier to plug-and-play. 3.4 Leverage Open...will find yourself with all the issues related to owning a server room. Today, there are a number of companies offering turn-key so- lutions for tiled
Paralytic shellfish toxins in phytoplankton and shellfish samples collected from the Bohai Sea, China.

PubMed

Liu, Yang; Yu, Ren-Cheng; Kong, Fan-Zhou; Chen, Zhen-Fan; Dai, Li; Gao, Yan; Zhang, Qing-Chun; Wang, Yun-Feng; Yan, Tian; Zhou, Ming-Jiang

2017-02-15

Phytoplankton and shellfish samples collected periodically from 5 representative mariculture zones around the Bohai Sea, Laishan (LS), Laizhou (LZ), Hangu (HG), Qinhuangdao (QHD) and Huludao (HLD), were analysed for paralytic shellfish toxins (PSTs) using an high-performance liquid chromatography (HPLC) method. Toxins were detected in 13 out of 20 phytoplankton samples, and N-sulfocarbamoyl toxins (C1/2) were predominant components of PSTs in phytoplankton samples with relatively low toxin content. However, two phytoplankton samples with high PST content collected from QHD and LS had unique toxin profiles characterized by high-potency carbamoyl toxins (GTX1/4) and decarbamoyl toxins (dcGTX2/3 and dcSTX), respectively. PSTs were commonly found in shellfish samples, and toxin content ranged from 0 to 27.6nmol/g. High level of PSTs were often found in scallops and clams. Shellfish from QHD in spring, and LZ and LS in autumn exhibited high risks of PST contamination. Copyright © 2016 Elsevier Ltd. All rights reserved.
SU-G-TeP1-15: Toward a Novel GPU Accelerated Deterministic Solution to the Linear Boltzmann Transport Equation

DOE Office of Scientific and Technical Information (OSTI.GOV)

Yang, R; Fallone, B; Cross Cancer Institute, Edmonton, AB

Purpose: To develop a Graphic Processor Unit (GPU) accelerated deterministic solution to the Linear Boltzmann Transport Equation (LBTE) for accurate dose calculations in radiotherapy (RT). A deterministic solution yields the potential for major speed improvements due to the sparse matrix-vector and vector-vector multiplications and would thus be of benefit to RT. Methods: In order to leverage the massively parallel architecture of GPUs, the first order LBTE was reformulated as a second order self-adjoint equation using the Least Squares Finite Element Method (LSFEM). This produces a symmetric positive-definite matrix which is efficiently solved using a parallelized conjugate gradient (CG) solver. Themore » LSFEM formalism is applied in space, discrete ordinates is applied in angle, and the Multigroup method is applied in energy. The final linear system of equations produced is tightly coupled in space and angle. Our code written in CUDA-C was benchmarked on an Nvidia GeForce TITAN-X GPU against an Intel i7-6700K CPU. A spatial mesh of 30,950 tetrahedral elements was used with an S4 angular approximation. Results: To avoid repeating a full computationally intensive finite element matrix assembly at each Multigroup energy, a novel mapping algorithm was developed which minimized the operations required at each energy. Additionally, a parallelized memory mapping for the kronecker product between the sparse spatial and angular matrices, including Dirichlet boundary conditions, was created. Atomicity is preserved by graph-coloring overlapping nodes into separate kernel launches. The one-time mapping calculations for matrix assembly, kronecker product, and boundary condition application took 452±1ms on GPU. Matrix assembly for 16 energy groups took 556±3s on CPU, and 358±2ms on GPU using the mappings developed. The CG solver took 93±1s on CPU, and 468±2ms on GPU. Conclusion: Three computationally intensive subroutines in deterministically solving the LBTE have been
Analysis and Implementation of Particle-to-Particle (P2P) Graphics Processor Unit (GPU) Kernel for Black-Box Adaptive Fast Multipole Method

DTIC Science & Technology

2015-06-01

5110P and 16 dx360M4 nodes each with one NVIDIA Kepler K20M/K40M GPU. Each node contained dual Intel Xeon E5-2670 (Sandy Bridge) central processing...kernel and as such does not employ multiple processors. This work makes use of a single processing core and a single NVIDIA Kepler K40 GK110...bandwidth (2 × 16 slot), 7.877 GFloat/s; Kepler K40 peak, 4,290 × 1 billion floating-point operations (GFLOPs), and 288 GB/s Kepler K40 memory
Toxin Profile of Gymnodinium catenatum (Dinophyceae) from the Portuguese Coast, as Determined by Liquid Chromatography Tandem Mass Spectrometry

PubMed Central

Costa, Pedro R.; Robertson, Alison; Quilliam, Michael A.

2015-01-01

The marine dinoflagellate Gymnodinium catenatum has been associated with paralytic shellfish poisoning (PSP) outbreaks in Portuguese waters for many years. PSP syndrome is caused by consumption of seafood contaminated with paralytic shellfish toxins (PSTs), a suite of potent neurotoxins. Gymnodinium catenatum was frequently reported along the Portuguese coast throughout the late 1980s and early 1990s, but was absent between 1995 and 2005. Since this time, G. catenatum blooms have been recurrent, causing contamination of fishery resources along the Atlantic coast of Portugal. The aim of this study was to evaluate the toxin profile of G. catenatum isolated from the Portuguese coast before and after the 10-year hiatus to determine changes and potential impacts for the region. Hydrophilic interaction liquid chromatography tandem mass spectrometry (HILIC-MS/MS) was utilized to determine the presence of any known and emerging PSTs in sample extracts. Several PST derivatives were identified, including the N-sulfocarbamoyl analogues (C1–4), gonyautoxin 5 (GTX5), gonyautoxin 6 (GTX6), and decarbamoyl derivatives, decarbamoyl saxitoxin (dcSTX), decarbamoyl neosaxitoxin (dcNeo) and decarbamoyl gonyautoxin 3 (dcGTX3). In addition, three known hydroxy benzoate derivatives, G. catenatum toxin 1 (GC1), GC2 and GC3, were confirmed in cultured and wild strains of G. catenatum. Moreover, two presumed N-hydroxylated analogues of GC2 and GC3, designated GC5 and GC6, are reported. This work contributes to our understanding of the toxigenicity of G. catenatum in the coastal waters of Portugal and provides valuable information on emerging PST classes that may be relevant for routine monitoring programs tasked with the prevention and control of marine toxins in fish and shellfish. PMID:25871287
Geometric analysis of root canals prepared by four rotary NiTi shaping systems.

PubMed

Hashem, Ahmed Abdel Rahman; Ghoneim, Angie Galal; Lutfy, Reem Ahmed; Foda, Manar Yehia; Omar, Gihan Abdel Fatah

2012-07-01

A great number of nickel-titanium (NiTi) rotary systems with noncutting tips, different cross-sections, superior resistance to torsional fracture, varying tapers, and manufacturing method have been introduced to the market. The purpose of this study was to evaluate and compare the effect of 4 rotary NiTi preparation systems, Revo-S (RS; Micro-Mega, Besancon Cedex, France), Twisted file (TF; SybronEndo, Amersfoort, The Netherlands), ProFile GT Series X (GTX; Dentsply, Tulsa Dental Specialties, Tulsa, OK), and ProTaper (PT; Dentsply Maillefer, Ballaigues, Switzerland), on volumetric changes and transportation of curved root canals. Forty mesiobuccal canals of mandibular molars with an angle of curvature ranging from 25° to 40° were divided according to the instrument used in canal preparation into 4 groups of 10 samples each: group RS, group TF, group GTX, and group PT. Canals were scanned using an i-CAT CBCT scanner (Imaging Science International, Hatfield, PA) before and after preparation to evaluate the volumetric changes. Root canal transportation and centering ratio were evaluated at 1.3, 2.6, 5.2, and 7.8 mm from the apex. The significance level was set at P ≤ .05. The PT system removed a significantly higher amount of dentin than the other systems (P = .025). At the 1.3-mm level, there was no significant difference in canal transportation and centering ratio among the groups. However, at the other levels, TF maintained the original canal curvature recording significantly the least degree of canal transportation as well as the highest mean centering ratio. The TF system showed superior shaping ability in curved canals. Revo-S and GTX were better than ProTaper regarding both canal transportation and centering ability. Copyright © 2012 American Association of Endodontists. Published by Elsevier Inc. All rights reserved.
Dexamethasone promotes granulocyte mobilization by prolonging the half-life of granulocyte-colony-stimulating factor in healthy donors for granulocyte transfusions.

PubMed

Hiemstra, Ida H; van Hamme, John L; Janssen, Machiel H; van den Berg, Timo K; Kuijpers, Taco W

2017-03-01

Granulocyte transfusion (GTX) is a potential approach to correcting neutropenia and relieving the increased risk of infection in patients who are refractory to antibiotics. To mobilize enough granulocytes for transfusion, healthy donors are premedicated with granulocyte-colony-stimulating factor (G-CSF) and dexamethasone. Granulocytes have a short circulatory half-life. Consequently, patients need to receive GTX every other day to keep circulating granulocyte counts at an acceptable level. We investigated whether plasma from premedicated donors was capable of prolonging neutrophil survival and, if so, which factor could be held responsible. The effects of plasma from G-CSF/dexamethasone-treated donors on neutrophil survival were assessed by annexin-V, CD16. and CXCR4 staining and nuclear morphology. We isolated an albumin-bound protein using α-chymotrypsin and albumin-depletion and further characterized it using protein analysis. The effects of dexamethasone and G-CSF were assessed using mifepristone and G-CSF-neutralizing antibody. G-CSF plasma concentrations were determined by Western blot and Luminex analyses. G-CSF/dexamethasone plasma contained a survival-promoting factor for at least 2 days. This factor was recognized as an albumin-associated protein and was identified as G-CSF itself, which was surprising considering its reported half-life of only 4.5 hours. Compared with coadministration of dexamethasone, administration of G-CSF alone to the same GTX donors led to a faster decline in circulating G-CSF levels, whereas dexamethasone itself did not induce any G-CSF, demonstrating a role for dexamethasone in increasing G-CSF half-life. Dexamethasone increases granulocyte yield upon coadministration with G-CSF by extending G-CSF half-life. This observation might also be exploited in the coadministration of dexamethasone with other recombinant proteins to modulate their half-life. © 2016 AABB.
Employing OpenCL to Accelerate Ab Initio Calculations on Graphics Processing Units.

PubMed

Kussmann, Jörg; Ochsenfeld, Christian

2017-06-13

We present an extension of our graphics processing units (GPU)-accelerated quantum chemistry package to employ OpenCL compute kernels, which can be executed on a wide range of computing devices like CPUs, Intel Xeon Phi, and AMD GPUs. Here, we focus on the use of AMD GPUs and discuss differences as compared to CUDA-based calculations on NVIDIA GPUs. First illustrative timings are presented for hybrid density functional theory calculations using serial as well as parallel compute environments. The results show that AMD GPUs are as fast or faster than comparable NVIDIA GPUs and provide a viable alternative for quantum chemical applications.
Parallel hyperbolic PDE simulation on clusters: Cell versus GPU

NASA Astrophysics Data System (ADS)

Rostrup, Scott; De Sterck, Hans

2010-12-01

Increasingly, high-performance computing is looking towards data-parallel computational devices to enhance computational performance. Two technologies that have received significant attention are IBM's Cell Processor and NVIDIA's CUDA programming model for graphics processing unit (GPU) computing. In this paper we investigate the acceleration of parallel hyperbolic partial differential equation simulation on structured grids with explicit time integration on clusters with Cell and GPU backends. The message passing interface (MPI) is used for communication between nodes at the coarsest level of parallelism. Optimizations of the simulation code at the several finer levels of parallelism that the data-parallel devices provide are described in terms of data layout, data flow and data-parallel instructions. Optimized Cell and GPU performance are compared with reference code performance on a single x86 central processing unit (CPU) core in single and double precision. We further compare the CPU, Cell and GPU platforms on a chip-to-chip basis, and compare performance on single cluster nodes with two CPUs, two Cell processors or two GPUs in a shared memory configuration (without MPI). We finally compare performance on clusters with 32 CPUs, 32 Cell processors, and 32 GPUs using MPI. Our GPU cluster results use NVIDIA Tesla GPUs with GT200 architecture, but some preliminary results on recently introduced NVIDIA GPUs with the next-generation Fermi architecture are also included. This paper provides computational scientists and engineers who are considering porting their codes to accelerator environments with insight into how structured grid based explicit algorithms can be optimized for clusters with Cell and GPU accelerators. It also provides insight into the speed-up that may be gained on current and future accelerator architectures for this class of applications. Program summaryProgram title: SWsolver Catalogue identifier: AEGY_v1_0 Program summary URL

DOE Office of Scientific and Technical Information (OSTI.GOV)

Lyakh, Dmitry I.

An efficient parallel tensor transpose algorithm is suggested for shared-memory computing units, namely, multicore CPU, Intel Xeon Phi, and NVidia GPU. The algorithm operates on dense tensors (multidimensional arrays) and is based on the optimization of cache utilization on x86 CPU and the use of shared memory on NVidia GPU. From the applied side, the ultimate goal is to minimize the overhead encountered in the transformation of tensor contractions into matrix multiplications in computer implementations of advanced methods of quantum many-body theory (e.g., in electronic structure theory and nuclear physics). A particular accent is made on higher-dimensional tensors that typicallymore » appear in the so-called multireference correlated methods of electronic structure theory. Depending on tensor dimensionality, the presented optimized algorithms can achieve an order of magnitude speedup on x86 CPUs and 2-3 times speedup on NVidia Tesla K20X GPU with respect to the na ve scattering algorithm (no memory access optimization). Furthermore, the tensor transpose routines developed in this work have been incorporated into a general-purpose tensor algebra library (TAL-SH).« less
Grace: A cross-platform micromagnetic simulator on graphics processing units

NASA Astrophysics Data System (ADS)

Zhu, Ru

2015-12-01

A micromagnetic simulator running on graphics processing units (GPUs) is presented. Different from GPU implementations of other research groups which are predominantly running on NVidia's CUDA platform, this simulator is developed with C++ Accelerated Massive Parallelism (C++ AMP) and is hardware platform independent. It runs on GPUs from venders including NVidia, AMD and Intel, and achieves significant performance boost as compared to previous central processing unit (CPU) simulators, up to two orders of magnitude. The simulator paved the way for running large size micromagnetic simulations on both high-end workstations with dedicated graphics cards and low-end personal computers with integrated graphics cards, and is freely available to download.
Real-time radar signal processing using GPGPU (general-purpose graphic processing unit)

NASA Astrophysics Data System (ADS)

Kong, Fanxing; Zhang, Yan Rockee; Cai, Jingxiao; Palmer, Robert D.

2016-05-01

This study introduces a practical approach to develop real-time signal processing chain for general phased array radar on NVIDIA GPUs(Graphical Processing Units) using CUDA (Compute Unified Device Architecture) libraries such as cuBlas and cuFFT, which are adopted from open source libraries and optimized for the NVIDIA GPUs. The processed results are rigorously verified against those from the CPUs. Performance benchmarked in computation time with various input data cube sizes are compared across GPUs and CPUs. Through the analysis, it will be demonstrated that GPGPUs (General Purpose GPU) real-time processing of the array radar data is possible with relatively low-cost commercial GPUs.
Micromagnetics on high-performance workstation and mobile computational platforms

NASA Astrophysics Data System (ADS)

Fu, S.; Chang, R.; Couture, S.; Menarini, M.; Escobar, M. A.; Kuteifan, M.; Lubarda, M.; Gabay, D.; Lomakin, V.

2015-05-01

The feasibility of using high-performance desktop and embedded mobile computational platforms is presented, including multi-core Intel central processing unit, Nvidia desktop graphics processing units, and Nvidia Jetson TK1 Platform. FastMag finite element method-based micromagnetic simulator is used as a testbed, showing high efficiency on all the platforms. Optimization aspects of improving the performance of the mobile systems are discussed. The high performance, low cost, low power consumption, and rapid performance increase of the embedded mobile systems make them a promising candidate for micromagnetic simulations. Such architectures can be used as standalone systems or can be built as low-power computing clusters.
A hybrid reconstruction algorithm for fast and accurate 4D cone-beam CT imaging.

PubMed

Yan, Hao; Zhen, Xin; Folkerts, Michael; Li, Yongbao; Pan, Tinsu; Cervino, Laura; Jiang, Steve B; Jia, Xun

2014-07-01

4D cone beam CT (4D-CBCT) has been utilized in radiation therapy to provide 4D image guidance in lung and upper abdomen area. However, clinical application of 4D-CBCT is currently limited due to the long scan time and low image quality. The purpose of this paper is to develop a new 4D-CBCT reconstruction method that restores volumetric images based on the 1-min scan data acquired with a standard 3D-CBCT protocol. The model optimizes a deformation vector field that deforms a patient-specific planning CT (p-CT), so that the calculated 4D-CBCT projections match measurements. A forward-backward splitting (FBS) method is invented to solve the optimization problem. It splits the original problem into two well-studied subproblems, i.e., image reconstruction and deformable image registration. By iteratively solving the two subproblems, FBS gradually yields correct deformation information, while maintaining high image quality. The whole workflow is implemented on a graphic-processing-unit to improve efficiency. Comprehensive evaluations have been conducted on a moving phantom and three real patient cases regarding the accuracy and quality of the reconstructed images, as well as the algorithm robustness and efficiency. The proposed algorithm reconstructs 4D-CBCT images from highly under-sampled projection data acquired with 1-min scans. Regarding the anatomical structure location accuracy, 0.204 mm average differences and 0.484 mm maximum difference are found for the phantom case, and the maximum differences of 0.3-0.5 mm for patients 1-3 are observed. As for the image quality, intensity errors below 5 and 20 HU compared to the planning CT are achieved for the phantom and the patient cases, respectively. Signal-noise-ratio values are improved by 12.74 and 5.12 times compared to results from FDK algorithm using the 1-min data and 4-min data, respectively. The computation time of the algorithm on a NVIDIA GTX590 card is 1-1.5 min per phase. High-quality 4D-CBCT imaging based
TH-A-18C-09: Ultra-Fast Monte Carlo Simulation for Cone Beam CT Imaging of Brain Trauma

DOE Office of Scientific and Technical Information (OSTI.GOV)

Sisniega, A; Zbijewski, W; Stayman, J

Purpose: Application of cone-beam CT (CBCT) to low-contrast soft tissue imaging, such as in detection of traumatic brain injury, is challenged by high levels of scatter. A fast, accurate scatter correction method based on Monte Carlo (MC) estimation is developed for application in high-quality CBCT imaging of acute brain injury. Methods: The correction involves MC scatter estimation executed on an NVIDIA GTX 780 GPU (MC-GPU), with baseline simulation speed of ~1e7 photons/sec. MC-GPU is accelerated by a novel, GPU-optimized implementation of variance reduction (VR) techniques (forced detection and photon splitting). The number of simulated tracks and projections is reduced formore » additional speed-up. Residual noise is removed and the missing scatter projections are estimated via kernel smoothing (KS) in projection plane and across gantry angles. The method is assessed using CBCT images of a head phantom presenting a realistic simulation of fresh intracranial hemorrhage (100 kVp, 180 mAs, 720 projections, source-detector distance 700 mm, source-axis distance 480 mm). Results: For a fixed run-time of ~1 sec/projection, GPU-optimized VR reduces the noise in MC-GPU scatter estimates by a factor of 4. For scatter correction, MC-GPU with VR is executed with 4-fold angular downsampling and 1e5 photons/projection, yielding 3.5 minute run-time per scan, and de-noised with optimized KS. Corrected CBCT images demonstrate uniformity improvement of 18 HU and contrast improvement of 26 HU compared to no correction, and a 52% increase in contrast-tonoise ratio in simulated hemorrhage compared to “oracle” constant fraction correction. Conclusion: Acceleration of MC-GPU achieved through GPU-optimized variance reduction and kernel smoothing yields an efficient (<5 min/scan) and accurate scatter correction that does not rely on additional hardware or simplifying assumptions about the scatter distribution. The method is undergoing implementation in a novel CBCT dedicated to
TH-A-18C-04: Ultrafast Cone-Beam CT Scatter Correction with GPU-Based Monte Carlo Simulation

DOE Office of Scientific and Technical Information (OSTI.GOV)

Xu, Y; Southern Medical University, Guangzhou; Bai, T

2014-06-15

Purpose: Scatter artifacts severely degrade image quality of cone-beam CT (CBCT). We present an ultrafast scatter correction framework by using GPU-based Monte Carlo (MC) simulation and prior patient CT image, aiming at automatically finish the whole process including both scatter correction and reconstructions within 30 seconds. Methods: The method consists of six steps: 1) FDK reconstruction using raw projection data; 2) Rigid Registration of planning CT to the FDK results; 3) MC scatter calculation at sparse view angles using the planning CT; 4) Interpolation of the calculated scatter signals to other angles; 5) Removal of scatter from the raw projections;more » 6) FDK reconstruction using the scatter-corrected projections. In addition to using GPU to accelerate MC photon simulations, we also use a small number of photons and a down-sampled CT image in simulation to further reduce computation time. A novel denoising algorithm is used to eliminate MC scatter noise caused by low photon numbers. The method is validated on head-and-neck cases with simulated and clinical data. Results: We have studied impacts of photo histories, volume down sampling factors on the accuracy of scatter estimation. The Fourier analysis was conducted to show that scatter images calculated at 31 angles are sufficient to restore those at all angles with <0.1% error. For the simulated case with a resolution of 512×512×100, we simulated 10M photons per angle. The total computation time is 23.77 seconds on a Nvidia GTX Titan GPU. The scatter-induced shading/cupping artifacts are substantially reduced, and the average HU error of a region-of-interest is reduced from 75.9 to 19.0 HU. Similar results were found for a real patient case. Conclusion: A practical ultrafast MC-based CBCT scatter correction scheme is developed. The whole process of scatter correction and reconstruction is accomplished within 30 seconds. This study is supported in part by NIH (1R01CA154747-01), The Core Technology
A hybrid reconstruction algorithm for fast and accurate 4D cone-beam CT imaging

DOE Office of Scientific and Technical Information (OSTI.GOV)

Yan, Hao; Folkerts, Michael; Jiang, Steve B., E-mail: xun.jia@utsouthwestern.edu, E-mail: steve.jiang@UTSouthwestern.edu

2014-07-15

Purpose: 4D cone beam CT (4D-CBCT) has been utilized in radiation therapy to provide 4D image guidance in lung and upper abdomen area. However, clinical application of 4D-CBCT is currently limited due to the long scan time and low image quality. The purpose of this paper is to develop a new 4D-CBCT reconstruction method that restores volumetric images based on the 1-min scan data acquired with a standard 3D-CBCT protocol. Methods: The model optimizes a deformation vector field that deforms a patient-specific planning CT (p-CT), so that the calculated 4D-CBCT projections match measurements. A forward-backward splitting (FBS) method is inventedmore » to solve the optimization problem. It splits the original problem into two well-studied subproblems, i.e., image reconstruction and deformable image registration. By iteratively solving the two subproblems, FBS gradually yields correct deformation information, while maintaining high image quality. The whole workflow is implemented on a graphic-processing-unit to improve efficiency. Comprehensive evaluations have been conducted on a moving phantom and three real patient cases regarding the accuracy and quality of the reconstructed images, as well as the algorithm robustness and efficiency. Results: The proposed algorithm reconstructs 4D-CBCT images from highly under-sampled projection data acquired with 1-min scans. Regarding the anatomical structure location accuracy, 0.204 mm average differences and 0.484 mm maximum difference are found for the phantom case, and the maximum differences of 0.3–0.5 mm for patients 1–3 are observed. As for the image quality, intensity errors below 5 and 20 HU compared to the planning CT are achieved for the phantom and the patient cases, respectively. Signal-noise-ratio values are improved by 12.74 and 5.12 times compared to results from FDK algorithm using the 1-min data and 4-min data, respectively. The computation time of the algorithm on a NVIDIA GTX590 card is 1–1.5 min per
A fast GPU-based Monte Carlo simulation of proton transport with detailed modeling of nonelastic interactions.

PubMed

Wan Chan Tseung, H; Ma, J; Beltran, C

2015-06-01

Very fast Monte Carlo (MC) simulations of proton transport have been implemented recently on graphics processing units (GPUs). However, these MCs usually use simplified models for nonelastic proton-nucleus interactions. Our primary goal is to build a GPU-based proton transport MC with detailed modeling of elastic and nonelastic proton-nucleus collisions. Using the cuda framework, the authors implemented GPU kernels for the following tasks: (1) simulation of beam spots from our possible scanning nozzle configurations, (2) proton propagation through CT geometry, taking into account nuclear elastic scattering, multiple scattering, and energy loss straggling, (3) modeling of the intranuclear cascade stage of nonelastic interactions when they occur, (4) simulation of nuclear evaporation, and (5) statistical error estimates on the dose. To validate our MC, the authors performed (1) secondary particle yield calculations in proton collisions with therapeutically relevant nuclei, (2) dose calculations in homogeneous phantoms, (3) recalculations of complex head and neck treatment plans from a commercially available treatment planning system, and compared with (GEANT)4.9.6p2/TOPAS. Yields, energy, and angular distributions of secondaries from nonelastic collisions on various nuclei are in good agreement with the (GEANT)4.9.6p2 Bertini and Binary cascade models. The 3D-gamma pass rate at 2%-2 mm for treatment plan simulations is typically 98%. The net computational time on a NVIDIA GTX680 card, including all CPU-GPU data transfers, is ∼ 20 s for 1 × 10(7) proton histories. Our GPU-based MC is the first of its kind to include a detailed nuclear model to handle nonelastic interactions of protons with any nucleus. Dosimetric calculations are in very good agreement with (GEANT)4.9.6p2/TOPAS. Our MC is being integrated into a framework to perform fast routine clinical QA of pencil-beam based treatment plans, and is being used as the dose calculation engine in a clinically
Spiking neural networks on high performance computer clusters

NASA Astrophysics Data System (ADS)

Chen, Chong; Taha, Tarek M.

2011-09-01

In this paper we examine the acceleration of two spiking neural network models on three clusters of multicore processors representing three categories of processors: x86, STI Cell, and NVIDIA GPGPUs. The x86 cluster utilized consists of 352 dualcore AMD Opterons, the Cell cluster consists of 320 Sony Playstation 3s, while the GPGPU cluster contains 32 NVIDIA Tesla S1070 systems. The results indicate that the GPGPU platform can dominate in performance compared to the Cell and x86 platforms examined. From a cost perspective, the GPGPU is more expensive in terms of neuron/s throughput. If the cost of GPGPUs go down in the future, this platform will become very cost effective for these models.
MILC Code Performance on High End CPU and GPU Supercomputer Clusters

NASA Astrophysics Data System (ADS)

DeTar, Carleton; Gottlieb, Steven; Li, Ruizi; Toussaint, Doug

2018-03-01

With recent developments in parallel supercomputing architecture, many core, multi-core, and GPU processors are now commonplace, resulting in more levels of parallelism, memory hierarchy, and programming complexity. It has been necessary to adapt the MILC code to these new processors starting with NVIDIA GPUs, and more recently, the Intel Xeon Phi processors. We report on our efforts to port and optimize our code for the Intel Knights Landing architecture. We consider performance of the MILC code with MPI and OpenMP, and optimizations with QOPQDP and QPhiX. For the latter approach, we concentrate on the staggered conjugate gradient and gauge force. We also consider performance on recent NVIDIA GPUs using the QUDA library.
The gputools package enables GPU computing in R.

PubMed

Buckner, Joshua; Wilson, Justin; Seligman, Mark; Athey, Brian; Watson, Stanley; Meng, Fan

2010-01-01

By default, the R statistical environment does not make use of parallelism. Researchers may resort to expensive solutions such as cluster hardware for large analysis tasks. Graphics processing units (GPUs) provide an inexpensive and computationally powerful alternative. Using R and the CUDA toolkit from Nvidia, we have implemented several functions commonly used in microarray gene expression analysis for GPU-equipped computers. R users can take advantage of the better performance provided by an Nvidia GPU. The package is available from CRAN, the R project's repository of packages, at http://cran.r-project.org/web/packages/gputools More information about our gputools R package is available at http://brainarray.mbni.med.umich.edu/brainarray/Rgpgpu
Special Topics

DTIC Science & Technology

2012-01-01

training encom- passes several concepts, including cognitive knowledge, a performance assessment or pretest , training, a re- peat assessment or posttest ...significantly decreased mor- tality. For the lessons learned in ca- sualty care to be passed on to the next group of surgeons, the training for deployed...unpaid consultant to Athena GTX, Blackhawk Products Group , CHI Systems, Combat Medical Systems, Composite Resources, Compression Works, Creative
Line-by-line spectroscopic simulations on graphics processing units

NASA Astrophysics Data System (ADS)

Collange, Sylvain; Daumas, Marc; Defour, David

2008-01-01

++ 2005 with Cygwin 1.5.24 under Windows XP. RAM: 1 gigabyte Classification: 21.2 External routines: OpenGL ( http://www.opengl.org) Nature of problem: Simulating radiative transfer on high-temperature high-pressure gases. Solution method: Line-by-line Monte-Carlo ray-tracing. Unusual features: Parallel computations are moved to the GPU. Additional comments: nVidia GeForce 7000 or ATI Radeon X1000 series graphics processing unit is required. Running time: A few minutes.
Liquid Chromatography with a Fluorimetric Detection Method for Analysis of Paralytic Shellfish Toxins and Tetrodotoxin Based on a Porous Graphitic Carbon Column

PubMed Central

Rey, Veronica; Botana, Ana M.; Alvarez, Mercedes; Antelo, Alvaro; Botana, Luis M.

2016-01-01

Paralytic shellfish toxins (PST) traditionally have been analyzed by liquid chromatography with either pre- or post-column derivatization and always with a silica-based stationary phase. This technique resulted in different methods that need more than one run to analyze the toxins. Furthermore, tetrodotoxin (TTX) was recently found in bivalves of northward locations in Europe due to climate change, so it is important to analyze it along with PST because their signs of toxicity are similar in the bioassay. The methods described here detail a new approach to eliminate different runs, by using a new porous graphitic carbon stationary phase. Firstly we describe the separation of 13 PST that belong to different groups, taking into account the side-chains of substituents, in one single run of less than 30 min with good reproducibility. The method was assayed in four shellfish matrices: mussel (Mytillus galloprovincialis), clam (Pecten maximus), scallop (Ruditapes decussatus) and oyster (Ostrea edulis). The results for all of the parameters studied are provided, and the detection limits for the majority of toxins were improved with regard to previous liquid chromatography methods: the lowest values were those for decarbamoyl-gonyautoxin 2 (dcGTX2) and gonyautoxin 2 (GTX2) in mussel (0.0001 mg saxitoxin (STX)·diHCl kg−1 for each toxin), decarbamoyl-saxitoxin (dcSTX) in clam (0.0003 mg STX·diHCl kg−1), N-sulfocarbamoyl-gonyautoxins 2 and 3 (C1 and C2) in scallop (0.0001 mg STX·diHCl kg−1 for each toxin) and dcSTX (0.0003 mg STX·diHCl kg−1 ) in oyster; gonyautoxin 2 (GTX2) showed the highest limit of detection in oyster (0.0366 mg STX·diHCl kg−1). Secondly, we propose a modification of the method for the simultaneous analysis of PST and TTX, with some minor changes in the solvent gradient, although the detection limit for TTX does not allow its use nowadays for regulatory purposes. PMID:27367728
75 FR 25294 - Notice Pursuant to the National Cooperative Research and Production Act of 1993-DVD Copy Control...

Federal Register 2010, 2011, 2012, 2013, 2014

2010-05-07

..., Baarlo Noord Limburg, THE NETHERLANDS; MIT Technology Co., Ltd., Dongguan, Guangdong, PEOPLE'S REPUBLIC... media b.v., Tilburg, THE NETHERLANDS; Mattel Inc., El Segundo, CA; nVidia Corporation, Santa Clara, CA...
Fatigue resistance of engine-driven rotary nickel-titanium instruments produced by new manufacturing methods.

PubMed

Gambarini, Gianluca; Grande, Nicola Maria; Plotino, Gianluca; Somma, Francesco; Garala, Manish; De Luca, Massimo; Testarelli, Luca

2008-08-01

The aim of the present study was to investigate whether cyclic fatigue resistance is increased for nickel-titanium instruments manufactured by using new processes. This was evaluated by comparing instruments produced by using the twisted method (TF; SybronEndo, Orange, CA) and those using the M-wire alloy (GTX; Dentsply Tulsa-Dental Specialties, Tulsa, OK) with instruments produced by a traditional NiTi grinding process (K3, SybronEndo). Tests were performed with a specific cyclic fatigue device that evaluated cycles to failure of rotary instruments inside curved artificial canals. Results indicated that size 06-25 TF instruments showed a significant increase (p < 0.05) in the mean number of cycles to failure when compared with size 06-25 K3 files. Size 06-20 K3 instruments showed no significant increase (p > 0.05) in the mean number of cycles to failure when compared with size 06-20 GT series X instruments. The new manufacturing process produced nickel-titanium rotary files (TF) significantly more resistant to fatigue than instruments produced with the traditional NiTi grinding process. Instruments produced with M-wire (GTX) were not found to be more resistant to fatigue than instruments produced with the traditional NiTi grinding process.
Potentiometric chemical sensors for the detection of paralytic shellfish toxins.

PubMed

Ferreira, Nádia S; Cruz, Marco G N; Gomes, Maria Teresa S R; Rudnitskaya, Alisa

2018-05-01

Potentiometric chemical sensors for the detection of paralytic shellfish toxins have been developed. Four toxins typically encountered in Portuguese waters, namely saxitoxin, decarbamoyl saxitoxin, gonyautoxin GTX5 and C1&C2, were selected for the study. A series of miniaturized sensors with solid inner contact and plasticized polyvinylchloride membranes containing ionophores, nine compositions in total, were prepared and their characteristics evaluated. Sensors displayed cross-sensitivity to four studied toxins, i.e. response to several toxins together with low selectivity. High selectivity towards paralytic shellfish toxins was observed in the presence of inorganic cations with selectivity coefficients ranging from 0.04 to 0.001 for Na + and K + and 3.6*10 -4 to 3.4*10 -5 for Ca 2+ . Detection limits were in the range from 0.25 to 0.9 μmolL -1 for saxitoxin and decarbamoyl saxitoxin, and from 0.08 to 1.8 μmolL -1 for GTX5 and C1&C2, which allows toxin detection at the concentration levels corresponding to the legal limits. Characteristics of the developed sensors allow their use in the electronic tongue multisensor system for simultaneous quantification of paralytic shellfish toxins. Copyright © 2018 Elsevier B.V. All rights reserved.
GASPRNG: GPU accelerated scalable parallel random number generator library

NASA Astrophysics Data System (ADS)

Gao, Shuang; Peterson, Gregory D.

2013-04-01

workstation with NVIDIA GPU (Tested on Fermi GTX480, Tesla C1060, Tesla M2070). Operating system: Linux with CUDA version 4.0 or later. Should also run on MacOS, Windows, or UNIX. Has the code been vectorized or parallelized?: Yes. Parallelized using MPI directives. RAM: 512 MB˜ 732 MB (main memory on host CPU, depending on the data type of random numbers.) / 512 MB (GPU global memory) Classification: 4.13, 6.5. Nature of problem: Many computational science applications are able to consume large numbers of random numbers. For example, Monte Carlo simulations are able to consume limitless random numbers for the computation as long as resources for the computing are supported. Moreover, parallel computational science applications require independent streams of random numbers to attain statistically significant results. The SPRNG library provides this capability, but at a significant computational cost. The GASPRNG library presented here accelerates the generators of independent streams of random numbers using graphical processing units (GPUs). Solution method: Multiple copies of random number generators in GPUs allow a computational science application to consume large numbers of random numbers from independent, parallel streams. GASPRNG is a random number generators library to allow a computational science application to employ multiple copies of random number generators to boost performance. Users can interface GASPRNG with software code executing on microprocessors and/or GPUs. Running time: The tests provided take a few minutes to run.
Air-Breathing Launch Vehicle Technology Being Developed

NASA Technical Reports Server (NTRS)

Trefny, Charles J.

2003-01-01

Of the technical factors that would contribute to lowering the cost of space access, reusability has high potential. The primary objective of the GTX program is to determine whether or not air-breathing propulsion can enable reusable single-stage-to-orbit (SSTO) operations. The approach is based on maturation of a reference vehicle design with focus on the integration and flight-weight construction of its air-breathing rocket-based combined-cycle (RBCC) propulsion system.

Evaluating Multi-core Architectures through Accelerating the Three-Dimensional Lax–Wendroff Correction

DOE Office of Scientific and Technical Information (OSTI.GOV)

You, Yang; Fu, Haohuan; Song, Shuaiwen

2014-07-18

Wave propagation forward modeling is a widely used computational method in oil and gas exploration. The iterative stencil loops in such problems have broad applications in scientific computing. However, executing such loops can be highly time time-consuming, which greatly limits application’s performance and power efficiency. In this paper, we accelerate the forward modeling technique on the latest multi-core and many-core architectures such as Intel Sandy Bridge CPUs, NVIDIA Fermi C2070 GPU, NVIDIA Kepler K20x GPU, and the Intel Xeon Phi Co-processor. For the GPU platforms, we propose two parallel strategies to explore the performance optimization opportunities for our stencil kernels.more » For Sandy Bridge CPUs and MIC, we also employ various optimization techniques in order to achieve the best.« less
76 FR 384 - Certain Semiconductor Chips and Products Containing Same; Notice of Investigation

Federal Register 2010, 2011, 2012, 2013, 2014

2011-01-04

..., Dusing Road 1, Hsinchu Science Park, Hsin-Chu, Taiwan 30078. nVidia Corporation, 2701 San Tomas... respondent, to find the facts to be as alleged in the complaint and this notice and to enter an initial...
GPU acceleration of Dock6's Amber scoring computation.

PubMed

Yang, Hailong; Zhou, Qiongqiong; Li, Bo; Wang, Yongjian; Luan, Zhongzhi; Qian, Depei; Li, Hanlu

2010-01-01

Dressing the problem of virtual screening is a long-term goal in the drug discovery field, which if properly solved, can significantly shorten new drugs' R&D cycle. The scoring functionality that evaluates the fitness of the docking result is one of the major challenges in virtual screening. In general, scoring functionality in docking requires a large amount of floating-point calculations, which usually takes several weeks or even months to be finished. This time-consuming procedure is unacceptable, especially when highly fatal and infectious virus arises such as SARS and H1N1, which forces the scoring task to be done in a limited time. This paper presents how to leverage the computational power of GPU to accelerate Dock6's (http://dock.compbio.ucsf.edu/DOCK_6/) Amber (J. Comput. Chem. 25: 1157-1174, 2004) scoring with NVIDIA CUDA (NVIDIA Corporation Technical Staff, Compute Unified Device Architecture - Programming Guide, NVIDIA Corporation, 2008) (Compute Unified Device Architecture) platform. We also discuss many factors that will greatly influence the performance after porting the Amber scoring to GPU, including thread management, data transfer, and divergence hidden. Our experiments show that the GPU-accelerated Amber scoring achieves a 6.5× speedup with respect to the original version running on AMD dual-core CPU for the same problem size. This acceleration makes the Amber scoring more competitive and efficient for large-scale virtual screening problems.
Ecological and Physiological Studies of Gymnodinium catenatum in the Mexican Pacific: A Review

PubMed Central

Band-Schmidt, Christine J.; Bustillos-Guzmán, José J.; López-Cortés, David J.; Gárate-Lizárraga, Ismael; Núñez-Vázquez, Erick J.; Hernández-Sandoval, Francisco E.

2010-01-01

This review presents a detailed analysis of the state of knowledge of studies done in Mexico related to the dinoflagellate Gymnodinium catenatum, a paralytic toxin producer. This species was first reported in the Gulf of California in 1939; since then most studies in Mexico have focused on local blooms and seasonal variations. G. catenatum is most abundant during March and April, usually associated with water temperatures between 18 and 25 ºC and an increase in nutrients. In vitro studies of G. catenatum strains from different bays along the Pacific coast of Mexico show that this species can grow in wide ranges of salinities, temperatures, and N:P ratios. Latitudinal differences are observed in the toxicity and toxin profile, but the presence of dcSTX, dcGTX2-3, C1, and C2 are usual components. A common characteristic of the toxin profile found in shellfish, when G. catenatum is present in the coastal environment, is the detection of dcGTX2-3, dcSTX, C1, and C2. Few bioassay studies have reported effects in mollusks and lethal effects in mice, and shrimp; however no adverse effects have been observed in the copepod Acartia clausi. Interestingly, genetic sequencing of D1-D2 LSU rDNA revealed that it differs only in one base pair, compared with strains from other regions. PMID:20631876
Ecological and physiological studies of Gymnodinium catenatum in the Mexican Pacific: a review.

PubMed

Band-Schmidt, Christine J; Bustillos-Guzmán, José J; López-Cortés, David J; Gárate-Lizárraga, Ismael; Núñez-Vázquez, Erick J; Hernández-Sandoval, Francisco E

2010-06-23

This review presents a detailed analysis of the state of knowledge of studies done in Mexico related to the dinoflagellate Gymnodinium catenatum, a paralytic toxin producer. This species was first reported in the Gulf of California in 1939; since then most studies in Mexico have focused on local blooms and seasonal variations. G. catenatum is most abundant during March and April, usually associated with water temperatures between 18 and 25 °C and an increase in nutrients. In vitro studies of G. catenatum strains from different bays along the Pacific coast of Mexico show that this species can grow in wide ranges of salinities, temperatures, and N:P ratios. Latitudinal differences are observed in the toxicity and toxin profile, but the presence of dcSTX, dcGTX2-3, C1, and C2 are usual components. A common characteristic of the toxin profile found in shellfish, when G. catenatum is present in the coastal environment, is the detection of dcGTX2-3, dcSTX, C1, and C2. Few bioassay studies have reported effects in mollusks and lethal effects in mice, and shrimp; however no adverse effects have been observed in the copepod Acartia clausi. Interestingly, genetic sequencing of D1-D2 LSU rDNA revealed that it differs only in one base pair, compared with strains from other regions.
75 FR 32826 - Self-Regulatory Organizations; NASDAQ OMX PHLX, Inc.; Notice of Filing and Immediate...

Federal Register 2010, 2011, 2012, 2013, 2014

2010-06-09

...''), American Express Company (``AXP''), Ciena Corp. (``CIEN''), Star Scientific, Inc. (``CIGX''), Dendreon Corp. (``DNDN''), eBay Inc. (``EBAY''), Corning Inc. (``GLW''), Halliburton Company (``HAL''), iShares Dow Jones US Real Estate (``IYR''), Motorola, Inc., (``MOT''), NVIDIA Corporation (``NVDA''), ON Semiconductor...
75 FR 30887 - Self-Regulatory Organizations; The NASDAQ Stock Market LLC; Notice of Filing and Immediate...

Federal Register 2010, 2011, 2012, 2013, 2014

2010-06-02

...''), American Express Company (``AXP''), Ciena Corp. (``CIEN''), Star Scientific, Inc. (``CIGX''), Dendreon Corp. (``DNDN''), eBay Inc. (``EBAY''), Corning Inc. (``GLW''), Halliburton Company (``HAL''), iShares Dow Jones US Real Estate (``IYR''), Motorola, Inc., (``MOT''), NVIDIA Corporation (``NVDA''), ON Semiconductor...
Analysis of the Finite Precision s-Step Biconjugate Gradient Method

DTIC Science & Technology

2014-03-13

Center for Future Architecture Research, a member of STARnet, a Semiconductor Research Corporation program sponsored by MARCO and DARPA, and ASPIRE Lab...industrial sponsors and affiliates Intel, Google, Nokia, NVIDIA , Oracle, and Samsung. Any opinions, findings, conclusions, or recommendations in this
Multi-GPU and multi-CPU accelerated FDTD scheme for vibroacoustic applications

NASA Astrophysics Data System (ADS)

Francés, J.; Otero, B.; Bleda, S.; Gallego, S.; Neipp, C.; Márquez, A.; Beléndez, A.

2015-06-01

The Finite-Difference Time-Domain (FDTD) method is applied to the analysis of vibroacoustic problems and to study the propagation of longitudinal and transversal waves in a stratified media. The potential of the scheme and the relevance of each acceleration strategy for massively computations in FDTD are demonstrated in this work. In this paper, we propose two new specific implementations of the bi-dimensional scheme of the FDTD method using multi-CPU and multi-GPU, respectively. In the first implementation, an open source message passing interface (OMPI) has been included in order to massively exploit the resources of a biprocessor station with two Intel Xeon processors. Moreover, regarding CPU code version, the streaming SIMD extensions (SSE) and also the advanced vectorial extensions (AVX) have been included with shared memory approaches that take advantage of the multi-core platforms. On the other hand, the second implementation called the multi-GPU code version is based on Peer-to-Peer communications available in CUDA on two GPUs (NVIDIA GTX 670). Subsequently, this paper presents an accurate analysis of the influence of the different code versions including shared memory approaches, vector instructions and multi-processors (both CPU and GPU) and compares them in order to delimit the degree of improvement of using distributed solutions based on multi-CPU and multi-GPU. The performance of both approaches was analysed and it has been demonstrated that the addition of shared memory schemes to CPU computing improves substantially the performance of vector instructions enlarging the simulation sizes that use efficiently the cache memory of CPUs. In this case GPU computing is slightly twice times faster than the fine tuned CPU version in both cases one and two nodes. However, for massively computations explicit vector instructions do not worth it since the memory bandwidth is the limiting factor and the performance tends to be the same than the sequential version
Swan: A tool for porting CUDA programs to OpenCL

NASA Astrophysics Data System (ADS)

Harvey, M. J.; De Fabritiis, G.

2011-04-01

The use of modern, high-performance graphical processing units (GPUs) for acceleration of scientific computation has been widely reported. The majority of this work has used the CUDA programming model supported exclusively by GPUs manufactured by NVIDIA. An industry standardisation effort has recently produced the OpenCL specification for GPU programming. This offers the benefits of hardware-independence and reduced dependence on proprietary tool-chains. Here we describe a source-to-source translation tool, "Swan" for facilitating the conversion of an existing CUDA code to use the OpenCL model, as a means to aid programmers experienced with CUDA in evaluating OpenCL and alternative hardware. While the performance of equivalent OpenCL and CUDA code on fixed hardware should be comparable, we find that a real-world CUDA application ported to OpenCL exhibits an overall 50% increase in runtime, a reduction in performance attributable to the immaturity of contemporary compilers. The ported application is shown to have platform independence, running on both NVIDIA and AMD GPUs without modification. We conclude that OpenCL is a viable platform for developing portable GPU applications but that the more mature CUDA tools continue to provide best performance. Program summaryProgram title: Swan Catalogue identifier: AEIH_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEIH_v1_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: GNU Public License version 2 No. of lines in distributed program, including test data, etc.: 17 736 No. of bytes in distributed program, including test data, etc.: 131 177 Distribution format: tar.gz Programming language: C Computer: PC Operating system: Linux RAM: 256 Mbytes Classification: 6.5 External routines: NVIDIA CUDA, OpenCL Nature of problem: Graphical Processing Units (GPUs) from NVIDIA are preferentially programed with the proprietary CUDA programming toolkit. An
Application Modernization at LLNL and the Sierra Center of Excellence

DOE Office of Scientific and Technical Information (OSTI.GOV)

Neely, J. Robert; de Supinski, Bronis R.

We repport that in 2014, Lawrence Livermore National Laboratory began acquisition of Sierra, a pre-exascale system from IBM and Nvidia. It marks a significant shift in direction for LLNL by introducing the concept of heterogeneous computing via GPUs. LLNL’s mission requires application teams to prepare for this paradigm shift. Thus, the Sierra procurement required a proposed Center of Excellence that would align the expertise of the chosen vendors with laboratory personnel that represent the application developers, system software, and tool providers in a concentrated effort to prepare the laboratory’s codes in advance of the system transitioning to production in 2018.more » Finally, this article presents LLNL’s overall application strategy, with a focus on how LLNL is collaborating with IBM and Nvidia to ensure a successful transition of its mission-oriented applications into the exascale era.« less
Application Modernization at LLNL and the Sierra Center of Excellence

DOE PAGES

Neely, J. Robert; de Supinski, Bronis R.

2017-09-01

We repport that in 2014, Lawrence Livermore National Laboratory began acquisition of Sierra, a pre-exascale system from IBM and Nvidia. It marks a significant shift in direction for LLNL by introducing the concept of heterogeneous computing via GPUs. LLNL’s mission requires application teams to prepare for this paradigm shift. Thus, the Sierra procurement required a proposed Center of Excellence that would align the expertise of the chosen vendors with laboratory personnel that represent the application developers, system software, and tool providers in a concentrated effort to prepare the laboratory’s codes in advance of the system transitioning to production in 2018.more » Finally, this article presents LLNL’s overall application strategy, with a focus on how LLNL is collaborating with IBM and Nvidia to ensure a successful transition of its mission-oriented applications into the exascale era.« less
Accelerating gravitational microlensing simulations using the Xeon Phi coprocessor

NASA Astrophysics Data System (ADS)

Chen, B.; Kantowski, R.; Dai, X.; Baron, E.; Van der Mark, P.

2017-04-01

Recently Graphics Processing Units (GPUs) have been used to speed up very CPU-intensive gravitational microlensing simulations. In this work, we use the Xeon Phi coprocessor to accelerate such simulations and compare its performance on a microlensing code with that of NVIDIA's GPUs. For the selected set of parameters evaluated in our experiment, we find that the speedup by Intel's Knights Corner coprocessor is comparable to that by NVIDIA's Fermi family of GPUs with compute capability 2.0, but less significant than GPUs with higher compute capabilities such as the Kepler. However, the very recently released second generation Xeon Phi, Knights Landing, is about 5.8 times faster than the Knights Corner, and about 2.9 times faster than the Kepler GPU used in our simulations. We conclude that the Xeon Phi is a very promising alternative to GPUs for modern high performance microlensing simulations.
Parallel fuzzy connected image segmentation on GPU.

PubMed

Zhuge, Ying; Cao, Yong; Udupa, Jayaram K; Miller, Robert W

2011-07-01

Image segmentation techniques using fuzzy connectedness (FC) principles have shown their effectiveness in segmenting a variety of objects in several large applications. However, one challenge in these algorithms has been their excessive computational requirements when processing large image datasets. Nowadays, commodity graphics hardware provides a highly parallel computing environment. In this paper, the authors present a parallel fuzzy connected image segmentation algorithm implementation on NVIDIA's compute unified device Architecture (CUDA) platform for segmenting medical image data sets. In the FC algorithm, there are two major computational tasks: (i) computing the fuzzy affinity relations and (ii) computing the fuzzy connectedness relations. These two tasks are implemented as CUDA kernels and executed on GPU. A dramatic improvement in speed for both tasks is achieved as a result. Our experiments based on three data sets of small, medium, and large data size demonstrate the efficiency of the parallel algorithm, which achieves a speed-up factor of 24.4x, 18.1x, and 10.3x, correspondingly, for the three data sets on the NVIDIA Tesla C1060 over the implementation of the algorithm on CPU, and takes 0.25, 0.72, and 15.04 s, correspondingly, for the three data sets. The authors developed a parallel algorithm of the widely used fuzzy connected image segmentation method on the NVIDIA GPUs, which are far more cost- and speed-effective than both cluster of workstations and multiprocessing systems. A near-interactive speed of segmentation has been achieved, even for the large data set.
Ultraviolet Communication for Medical Applications

DTIC Science & Technology

2015-06-01

In the previous Phase I effort, Directed Energy Inc.’s (DEI) parent company Imaging Systems Technology (IST) demonstrated feasibility of several key...accurately model high path loss. Custom photon scatter code was rewritten for parallel execution on a graphics processing unit (GPU). The NVidia CUDA
General purpose graphic processing unit implementation of adaptive pulse compression algorithms

NASA Astrophysics Data System (ADS)

Cai, Jingxiao; Zhang, Yan

2017-07-01

This study introduces a practical approach to implement real-time signal processing algorithms for general surveillance radar based on NVIDIA graphical processing units (GPUs). The pulse compression algorithms are implemented using compute unified device architecture (CUDA) libraries such as CUDA basic linear algebra subroutines and CUDA fast Fourier transform library, which are adopted from open source libraries and optimized for the NVIDIA GPUs. For more advanced, adaptive processing algorithms such as adaptive pulse compression, customized kernel optimization is needed and investigated. A statistical optimization approach is developed for this purpose without needing much knowledge of the physical configurations of the kernels. It was found that the kernel optimization approach can significantly improve the performance. Benchmark performance is compared with the CPU performance in terms of processing accelerations. The proposed implementation framework can be used in various radar systems including ground-based phased array radar, airborne sense and avoid radar, and aerospace surveillance radar.
GPU-accelerated simulations of isolated black holes

NASA Astrophysics Data System (ADS)

Lewis, Adam G. M.; Pfeiffer, Harald P.

2018-05-01

We present a port of the numerical relativity code SpEC which is capable of running on NVIDIA GPUs. Since this code must be maintained in parallel with SpEC itself, a primary design consideration is to perform as few explicit code changes as possible. We therefore rely on a hierarchy of automated porting strategies. At the highest level we use TLoops, a C++ library of our design, to automatically emit CUDA code equivalent to tensorial expressions written into C++ source using a syntax similar to analytic calculation. Next, we trace out and cache explicit matrix representations of the numerous linear transformations in the SpEC code, which allows these to be performed on the GPU using pre-existing matrix-multiplication libraries. We port the few remaining important modules by hand. In this paper we detail the specifics of our port, and present benchmarks of it simulating isolated black hole spacetimes on several generations of NVIDIA GPU.
Pushing Memory Bandwidth Limitations Through Efficient Implementations of Block-Krylov Space Solvers on GPUs

DOE Office of Scientific and Technical Information (OSTI.GOV)

Clark, M. A.; Strelchenko, Alexei; Vaquero, Alejandro

Lattice quantum chromodynamics simulations in nuclear physics have benefited from a tremendous number of algorithmic advances such as multigrid and eigenvector deflation. These improve the time to solution but do not alleviate the intrinsic memory-bandwidth constraints of the matrix-vector operation dominating iterative solvers. Batching this operation for multiple vectors and exploiting cache and register blocking can yield a super-linear speed up. Block-Krylov solvers can naturally take advantage of such batched matrix-vector operations, further reducing the iterations to solution by sharing the Krylov space between solves. However, practical implementations typically suffer from the quadratic scaling in the number of vector-vector operations.more » Using the QUDA library, we present an implementation of a block-CG solver on NVIDIA GPUs which reduces the memory-bandwidth complexity of vector-vector operations from quadratic to linear. We present results for the HISQ discretization, showing a 5x speedup compared to highly-optimized independent Krylov solves on NVIDIA's SaturnV cluster.« less
A comparative study of history-based versus vectorized Monte Carlo methods in the GPU/CUDA environment for a simple neutron eigenvalue problem

NASA Astrophysics Data System (ADS)

Liu, Tianyu; Du, Xining; Ji, Wei; Xu, X. George; Brown, Forrest B.

2014-06-01

For nuclear reactor analysis such as the neutron eigenvalue calculations, the time consuming Monte Carlo (MC) simulations can be accelerated by using graphics processing units (GPUs). However, traditional MC methods are often history-based, and their performance on GPUs is affected significantly by the thread divergence problem. In this paper we describe the development of a newly designed event-based vectorized MC algorithm for solving the neutron eigenvalue problem. The code was implemented using NVIDIA's Compute Unified Device Architecture (CUDA), and tested on a NVIDIA Tesla M2090 GPU card. We found that although the vectorized MC algorithm greatly reduces the occurrence of thread divergence thus enhancing the warp execution efficiency, the overall simulation speed is roughly ten times slower than the history-based MC code on GPUs. Profiling results suggest that the slow speed is probably due to the memory access latency caused by the large amount of global memory transactions. Possible solutions to improve the code efficiency are discussed.
75 FR 48338 - Intel Corporation; Analysis of Proposed Consent Order to Aid Public Comment

Federal Register 2010, 2011, 2012, 2013, 2014

2010-08-10

... integrated into chipsets as well as discrete graphics cards. NVIDIA has been at the forefront of developing... to connect peripheral products such as discrete GPUs to the CPU. A bus is a connection point between... platform. Intel's commitment to maintain an open PCIe bus will provide discrete graphics manufacturers...

Finite Element Optimization for Nondestructive Evaluation on a Graphics Processing Unit for Ground Vehicle Hull Inspection

DTIC Science & Technology

2013-08-22

4 cores, where the code may simultaneously run on the multiple cores or the graphics processing unit (or GPU – to be more specific on an NVIDIA ...allowed to get accurate crack shapes. DISCLAIMER Reference herein to any specific commercial company , product, process, or service by trade name
Fast analytical scatter estimation using graphics processing units.

PubMed

Ingleby, Harry; Lippuner, Jonas; Rickey, Daniel W; Li, Yue; Elbakri, Idris

2015-01-01

To develop a fast patient-specific analytical estimator of first-order Compton and Rayleigh scatter in cone-beam computed tomography, implemented using graphics processing units. The authors developed an analytical estimator for first-order Compton and Rayleigh scatter in a cone-beam computed tomography geometry. The estimator was coded using NVIDIA's CUDA environment for execution on an NVIDIA graphics processing unit. Performance of the analytical estimator was validated by comparison with high-count Monte Carlo simulations for two different numerical phantoms. Monoenergetic analytical simulations were compared with monoenergetic and polyenergetic Monte Carlo simulations. Analytical and Monte Carlo scatter estimates were compared both qualitatively, from visual inspection of images and profiles, and quantitatively, using a scaled root-mean-square difference metric. Reconstruction of simulated cone-beam projection data of an anthropomorphic breast phantom illustrated the potential of this method as a component of a scatter correction algorithm. The monoenergetic analytical and Monte Carlo scatter estimates showed very good agreement. The monoenergetic analytical estimates showed good agreement for Compton single scatter and reasonable agreement for Rayleigh single scatter when compared with polyenergetic Monte Carlo estimates. For a voxelized phantom with dimensions 128 × 128 × 128 voxels and a detector with 256 × 256 pixels, the analytical estimator required 669 seconds for a single projection, using a single NVIDIA 9800 GX2 video card. Accounting for first order scatter in cone-beam image reconstruction improves the contrast to noise ratio of the reconstructed images. The analytical scatter estimator, implemented using graphics processing units, provides rapid and accurate estimates of single scatter and with further acceleration and a method to account for multiple scatter may be useful for practical scatter correction schemes.
A parallel algorithm for the initial screening of space debris collisions prediction using the SGP4/SDP4 models and GPU acceleration

NASA Astrophysics Data System (ADS)

Lin, Mingpei; Xu, Ming; Fu, Xiaoyu

2017-05-01

Currently, a tremendous amount of space debris in Earth's orbit imperils operational spacecraft. It is essential to undertake risk assessments of collisions and predict dangerous encounters in space. However, collision predictions for an enormous amount of space debris give rise to large-scale computations. In this paper, a parallel algorithm is established on the Compute Unified Device Architecture (CUDA) platform of NVIDIA Corporation for collision prediction. According to the parallel structure of NVIDIA graphics processors, a block decomposition strategy is adopted in the algorithm. Space debris is divided into batches, and the computation and data transfer operations of adjacent batches overlap. As a consequence, the latency to access shared memory during the entire computing process is significantly reduced, and a higher computing speed is reached. Theoretically, a simulation of collision prediction for space debris of any amount and for any time span can be executed. To verify this algorithm, a simulation example including 1382 pieces of debris, whose operational time scales vary from 1 min to 3 days, is conducted on Tesla C2075 of NVIDIA. The simulation results demonstrate that with the same computational accuracy as that of a CPU, the computing speed of the parallel algorithm on a GPU is 30 times that on a CPU. Based on this algorithm, collision prediction of over 150 Chinese spacecraft for a time span of 3 days can be completed in less than 3 h on a single computer, which meets the timeliness requirement of the initial screening task. Furthermore, the algorithm can be adapted for multiple tasks, including particle filtration, constellation design, and Monte-Carlo simulation of an orbital computation.
Evaluation of an Adaptive Automation Trigger Based on Task Performance, Priority, and Frequency

DTIC Science & Technology

2013-06-01

with dual Intel ® Xeon ® CPU x5550 processors @ 2.67 GHz each, 12.0 GB RAM, and a 1.5 GB PCIe nVidia Quadro FX 4800 graphics card (Microsoft...Cole Publishing Company . Miller, C. A., & Parasuraman, R. (2007). Designing for flexible interaction between humans and automation: Delegation
A High Performance Computing Framework for Physics-based Modeling and Simulation of Military Ground Vehicles

DTIC Science & Technology

2011-03-25

number one and Nebulae at number three. Both systems rely on GPU co-processing and use Intel Xeon processors cards and NVIDIA Tesla C2050 GPUs. In...spite of a theoretical peak capability of almost 3 Petaflop/s, Nebulae clocked at 1.271 PFlop/s when running the Linpack benchmark, which puts it
High Resolution Imaging Testbed Utilizing Sodium Laser Guide Star Adaptive Optics: The Real Time Wavefront Reconstructor Computer

DTIC Science & Technology

2008-07-31

Unlike the Lyrtech, each DSP on a Bittware board offers 3 MB of on-chip memory and 3 GFLOPs of 32-bit peak processing power. Based on the performance...Each NVIDIA 8800 Ultra features 576 GFLOPS on 128 612-MHz single-precision floating-point SIMD processors, arranged in 16 clusters of eight. Each
Peregrine Software Toolchains | High-Performance Computing | NREL

Science.gov Websites

toolchain is an open-source alternative against which many technical applications are natively developed and tested. The Portland Group compilers are not fully supported, but are available to the HPC community. Use Group (PGI) C/C++ and Fortran (partially supported) The PGI Accelerator compilers include NVIDIA GPU
Using 3D Computer Graphics Multimedia to Motivate Preservice Teachers' Learning of Geometry and Pedagogy

ERIC Educational Resources Information Center

Goodson-Espy, Tracy; Lynch-Davis, Kathleen; Schram, Pamela; Quickenton, Art

2010-01-01

This paper describes the genesis and purpose of our geometry methods course, focusing on a geometry-teaching technology we created using NVIDIA[R] Chameleon demonstration. This article presents examples from a sequence of lessons centered about a 3D computer graphics demonstration of the chameleon and its geometry. In addition, we present data…
GPU-based relative fuzzy connectedness image segmentation

DOE Office of Scientific and Technical Information (OSTI.GOV)

Zhuge Ying; Ciesielski, Krzysztof C.; Udupa, Jayaram K.

2013-01-15

Purpose:Recently, clinical radiological research and practice are becoming increasingly quantitative. Further, images continue to increase in size and volume. For quantitative radiology to become practical, it is crucial that image segmentation algorithms and their implementations are rapid and yield practical run time on very large data sets. The purpose of this paper is to present a parallel version of an algorithm that belongs to the family of fuzzy connectedness (FC) algorithms, to achieve an interactive speed for segmenting large medical image data sets. Methods: The most common FC segmentations, optimizing an Script-Small-L {sub {infinity}}-based energy, are known as relative fuzzymore » connectedness (RFC) and iterative relative fuzzy connectedness (IRFC). Both RFC and IRFC objects (of which IRFC contains RFC) can be found via linear time algorithms, linear with respect to the image size. The new algorithm, P-ORFC (for parallel optimal RFC), which is implemented by using NVIDIA's Compute Unified Device Architecture (CUDA) platform, considerably improves the computational speed of the above mentioned CPU based IRFC algorithm. Results: Experiments based on four data sets of small, medium, large, and super data size, achieved speedup factors of 32.8 Multiplication-Sign , 22.9 Multiplication-Sign , 20.9 Multiplication-Sign , and 17.5 Multiplication-Sign , correspondingly, on the NVIDIA Tesla C1060 platform. Although the output of P-ORFC need not precisely match that of IRFC output, it is very close to it and, as the authors prove, always lies between the RFC and IRFC objects. Conclusions: A parallel version of a top-of-the-line algorithm in the family of FC has been developed on the NVIDIA GPUs. An interactive speed of segmentation has been achieved, even for the largest medical image data set. Such GPU implementations may play a crucial role in automatic anatomy recognition in clinical radiology.« less
GPU-based relative fuzzy connectedness image segmentation.

PubMed

Zhuge, Ying; Ciesielski, Krzysztof C; Udupa, Jayaram K; Miller, Robert W

2013-01-01

Recently, clinical radiological research and practice are becoming increasingly quantitative. Further, images continue to increase in size and volume. For quantitative radiology to become practical, it is crucial that image segmentation algorithms and their implementations are rapid and yield practical run time on very large data sets. The purpose of this paper is to present a parallel version of an algorithm that belongs to the family of fuzzy connectedness (FC) algorithms, to achieve an interactive speed for segmenting large medical image data sets. The most common FC segmentations, optimizing an [script-l](∞)-based energy, are known as relative fuzzy connectedness (RFC) and iterative relative fuzzy connectedness (IRFC). Both RFC and IRFC objects (of which IRFC contains RFC) can be found via linear time algorithms, linear with respect to the image size. The new algorithm, P-ORFC (for parallel optimal RFC), which is implemented by using NVIDIA's Compute Unified Device Architecture (CUDA) platform, considerably improves the computational speed of the above mentioned CPU based IRFC algorithm. Experiments based on four data sets of small, medium, large, and super data size, achieved speedup factors of 32.8×, 22.9×, 20.9×, and 17.5×, correspondingly, on the NVIDIA Tesla C1060 platform. Although the output of P-ORFC need not precisely match that of IRFC output, it is very close to it and, as the authors prove, always lies between the RFC and IRFC objects. A parallel version of a top-of-the-line algorithm in the family of FC has been developed on the NVIDIA GPUs. An interactive speed of segmentation has been achieved, even for the largest medical image data set. Such GPU implementations may play a crucial role in automatic anatomy recognition in clinical radiology.
Enabling Computational Dynamics in Distributed Computing Environments Using a Heterogeneous Computing Template

DTIC Science & Technology

2011-08-09

fastest 10 supercomputers in the world. Both systems rely on GPU co-processing, one using AMD cards, the second, called Nebulae , using NVIDIA Tesla...Page 9 of 10 UNCLASSIFIED capability of almost 3 petaflop/s, the highest in TOP500, Nebulae only holds the No. 2 position on the TOP500 list of the
CUDA-based acceleration of collateral filtering in brain MR images

NASA Astrophysics Data System (ADS)

Li, Cheng-Yuan; Chang, Herng-Hua

2017-02-01

Image denoising is one of the fundamental and essential tasks within image processing. In medical imaging, finding an effective algorithm that can remove random noise in MR images is important. This paper proposes an effective noise reduction method for brain magnetic resonance (MR) images. Our approach is based on the collateral filter which is a more powerful method than the bilateral filter in many cases. However, the computation of the collateral filter algorithm is quite time-consuming. To solve this problem, we improved the collateral filter algorithm with parallel computing using GPU. We adopted CUDA, an application programming interface for GPU by NVIDIA, to accelerate the computation. Our experimental evaluation on an Intel Xeon CPU E5-2620 v3 2.40GHz with a NVIDIA Tesla K40c GPU indicated that the proposed implementation runs dramatically faster than the traditional collateral filter. We believe that the proposed framework has established a general blueprint for achieving fast and robust filtering in a wide variety of medical image denoising applications.
Toxicity and paralytic shellfish toxin profiles of the xanthid crabs, Lophozozymus pictor and Zosimus aeneus, collected from some Australian coral reefs.

PubMed

Llewellyn, L E; Endean, R

1989-01-01

Purification of toxic aqueous extracts from the xanthid crabs Zosimus aeneus and Lophozozymus pictor, collected from Australian waters, yielded paralytic shelfish toxins, including saxitoxin (STX), neosaxitoxin (neoSTX) and gonyautoxins 1, 2 and 4 (GTX1,2,4). No more than two paralytic shellfish toxins were found in any of the purified extracts from any specimen. Four specimens of Z. aeneus and one specimen of L. pictor each contained more toxic material than the suggested human oral lethal dose. The moult of a specimen of L. pictor was toxic, which may indicate a route in crabs for toxin removal.
High performance computing for deformable image registration: towards a new paradigm in adaptive radiotherapy.

PubMed

Samant, Sanjiv S; Xia, Junyi; Muyan-Ozcelik, Pinar; Owens, John D

2008-08-01

The advent of readily available temporal imaging or time series volumetric (4D) imaging has become an indispensable component of treatment planning and adaptive radiotherapy (ART) at many radiotherapy centers. Deformable image registration (DIR) is also used in other areas of medical imaging, including motion corrected image reconstruction. Due to long computation time, clinical applications of DIR in radiation therapy and elsewhere have been limited and consequently relegated to offline analysis. With the recent advances in hardware and software, graphics processing unit (GPU) based computing is an emerging technology for general purpose computation, including DIR, and is suitable for highly parallelized computing. However, traditional general purpose computation on the GPU is limited because the constraints of the available programming platforms. As well, compared to CPU programming, the GPU currently has reduced dedicated processor memory, which can limit the useful working data set for parallelized processing. We present an implementation of the demons algorithm using the NVIDIA 8800 GTX GPU and the new CUDA programming language. The GPU performance will be compared with single threading and multithreading CPU implementations on an Intel dual core 2.4 GHz CPU using the C programming language. CUDA provides a C-like language programming interface, and allows for direct access to the highly parallel compute units in the GPU. Comparisons for volumetric clinical lung images acquired using 4DCT were carried out. Computation time for 100 iterations in the range of 1.8-13.5 s was observed for the GPU with image size ranging from 2.0 x 10(6) to 14.2 x 10(6) pixels. The GPU registration was 55-61 times faster than the CPU for the single threading implementation, and 34-39 times faster for the multithreading implementation. For CPU based computing, the computational time generally has a linear dependence on image size for medical imaging data. Computational efficiency is
Parallel Geospatial Data Management for Multi-Scale Environmental Data Analysis on GPUs

NASA Astrophysics Data System (ADS)

Wang, D.; Zhang, J.; Wei, Y.

2013-12-01

, we have developed a GPU-based data compression technique by reusing our previous work on Bitplane Quadtree (or BPQ-Tree) based indexing of binary bitmaps. Results have shown that our GPU-based parallel Zonal Statistic technique on 3000+ US counties over 20+ billion NASA SRTM 30 meter resolution Digital Elevation (DEM) raster cells has achieved impressive end-to-end runtimes: 101 seconds and 46 seconds a low-end workstation equipped with a Nvidia GTX Titan GPU using cold and hot cache, respectively; and, 60-70 seconds using a single OLCF TITAN computing node and 10-15 seconds using 8 nodes. Our experiment results clearly show the potentials of using high-end computing facilities for large-scale geospatial processing.
Generating Billion-Edge Scale-Free Networks in Seconds: Performance Study of a Novel GPU-based Preferential Attachment Model

DOE Office of Scientific and Technical Information (OSTI.GOV)

Perumalla, Kalyan S.; Alam, Maksudul

A novel parallel algorithm is presented for generating random scale-free networks using the preferential-attachment model. The algorithm, named cuPPA, is custom-designed for single instruction multiple data (SIMD) style of parallel processing supported by modern processors such as graphical processing units (GPUs). To the best of our knowledge, our algorithm is the first to exploit GPUs, and also the fastest implementation available today, to generate scale free networks using the preferential attachment model. A detailed performance study is presented to understand the scalability and runtime characteristics of the cuPPA algorithm. In one of the best cases, when executed on an NVidiamore » GeForce 1080 GPU, cuPPA generates a scale free network of a billion edges in less than 2 seconds.« less
75 FR 32803 - Notice of Issuance of Final Determination Concerning a GTX Mobile+ Hand Held Computer

Federal Register 2010, 2011, 2012, 2013, 2014

2010-06-09

... Programmable Read-Only Memory (``PROM'') chip, substantially transformed the PROM into a U.S. article. The... parts (such as various connectors and an Electronically Erasable Programmable Read Only Memory, or...
Computational Omics Funding Opportunity | Office of Cancer Clinical Proteomics Research

Cancer.gov

The National Cancer Institute's Clinical Proteomic Tumor Analysis Consortium (CPTAC) and the NVIDIA Foundation are pleased to announce funding opportunities in the fight against cancer. Each organization has launched a request for proposals (RFP) that will collectively fund up to $2 million to help to develop a new generation of data-intensive scientific tools to find new ways to treat cancer.
LU Factorization with Partial Pivoting for a Multi-CPU, Multi-GPU Shared Memory System

DOE Office of Scientific and Technical Information (OSTI.GOV)

Kurzak, Jakub; Luszczek, Pitior; Faverge, Mathieu

2012-03-01

LU factorization with partial pivoting is a canonical numerical procedure and the main component of the High Performance LINPACK benchmark. This article presents an implementation of the algorithm for a hybrid, shared memory, system with standard CPU cores and GPU accelerators. Performance in excess of one TeraFLOPS is achieved using four AMD Magny Cours CPUs and four NVIDIA Fermi GPUs.
Supersonic Wind Tunnel Tests of a Half-axisymmetric 12 Deg-spike Inlet to a Rocket-based Combined-cycle Propulsion System

NASA Technical Reports Server (NTRS)

DeBonis, J. R.; Trefny, C. J.

2001-01-01

Results of an isolated inlet test for NASA's GTX air-breathing launch vehicle concept are presented. The GTX is a Vertical Take-off/ Horizontal Landing reusable single-stage-to-orbit system powered by a rocket-based combined-cycle propulsion system. Tests were conducted in the NASA Glenn 1- by 1-Foot Supersonic Wind Tunnel during two entries in October 1998 and February 1999. Tests were run from Mach 2.8 to 6. Integrated performance parameters and static pressure distributions are reported. The maximum contraction ratios achieved in the tests were lower than predicted by axisymmetric Reynolds-averaged Navier-Stokes computational fluid dynamics (CFD). At Mach 6, the maximum contraction ratio was roughly one-half of the CFD value of 16. The addition of either boundary-layer trip strips or vortex generators had a negligible effect on the maximum contraction ratio. A shock boundary-layer interaction was also evident on the end-walls that terminate the annular flowpath cross section. Cut-back end-walls, designed to reduce the boundary-layer growth upstream of the shock and minimize the interaction, also had negligible effect on the maximum contraction ratio. Both the excessive turning of low-momentum comer flows and local over-contraction due to asymmetric end-walls were identified as possible reasons for the discrepancy between the CFD predictions and the experiment. It is recommended that the centerbody spike and throat angles be reduced in order to lessen the induced pressure rise. The addition of a step on the cowl surface, and planar end-walls more closely approximating a plane of symmetry are also recommended. Provisions for end-wall boundary-layer bleed should be incorporated.

Optimizing ion channel models using a parallel genetic algorithm on graphical processors.

PubMed

Ben-Shalom, Roy; Aviv, Amit; Razon, Benjamin; Korngreen, Alon

2012-01-01

We have recently shown that we can semi-automatically constrain models of voltage-gated ion channels by combining a stochastic search algorithm with ionic currents measured using multiple voltage-clamp protocols. Although numerically successful, this approach is highly demanding computationally, with optimization on a high performance Linux cluster typically lasting several days. To solve this computational bottleneck we converted our optimization algorithm for work on a graphical processing unit (GPU) using NVIDIA's CUDA. Parallelizing the process on a Fermi graphic computing engine from NVIDIA increased the speed ∼180 times over an application running on an 80 node Linux cluster, considerably reducing simulation times. This application allows users to optimize models for ion channel kinetics on a single, inexpensive, desktop "super computer," greatly reducing the time and cost of building models relevant to neuronal physiology. We also demonstrate that the point of algorithm parallelization is crucial to its performance. We substantially reduced computing time by solving the ODEs (Ordinary Differential Equations) so as to massively reduce memory transfers to and from the GPU. This approach may be applied to speed up other data intensive applications requiring iterative solutions of ODEs. Copyright © 2012 Elsevier B.V. All rights reserved.
GPUmotif: An Ultra-Fast and Energy-Efficient Motif Analysis Program Using Graphics Processing Units

PubMed Central

Zandevakili, Pooya; Hu, Ming; Qin, Zhaohui

2012-01-01

Computational detection of TF binding patterns has become an indispensable tool in functional genomics research. With the rapid advance of new sequencing technologies, large amounts of protein-DNA interaction data have been produced. Analyzing this data can provide substantial insight into the mechanisms of transcriptional regulation. However, the massive amount of sequence data presents daunting challenges. In our previous work, we have developed a novel algorithm called Hybrid Motif Sampler (HMS) that enables more scalable and accurate motif analysis. Despite much improvement, HMS is still time-consuming due to the requirement to calculate matching probabilities position-by-position. Using the NVIDIA CUDA toolkit, we developed a graphics processing unit (GPU)-accelerated motif analysis program named GPUmotif. We proposed a “fragmentation" technique to hide data transfer time between memories. Performance comparison studies showed that commonly-used model-based motif scan and de novo motif finding procedures such as HMS can be dramatically accelerated when running GPUmotif on NVIDIA graphics cards. As a result, energy consumption can also be greatly reduced when running motif analysis using GPUmotif. The GPUmotif program is freely available at http://sourceforge.net/projects/gpumotif/ PMID:22662128
GPUmotif: an ultra-fast and energy-efficient motif analysis program using graphics processing units.

PubMed

Zandevakili, Pooya; Hu, Ming; Qin, Zhaohui

2012-01-01

Computational detection of TF binding patterns has become an indispensable tool in functional genomics research. With the rapid advance of new sequencing technologies, large amounts of protein-DNA interaction data have been produced. Analyzing this data can provide substantial insight into the mechanisms of transcriptional regulation. However, the massive amount of sequence data presents daunting challenges. In our previous work, we have developed a novel algorithm called Hybrid Motif Sampler (HMS) that enables more scalable and accurate motif analysis. Despite much improvement, HMS is still time-consuming due to the requirement to calculate matching probabilities position-by-position. Using the NVIDIA CUDA toolkit, we developed a graphics processing unit (GPU)-accelerated motif analysis program named GPUmotif. We proposed a "fragmentation" technique to hide data transfer time between memories. Performance comparison studies showed that commonly-used model-based motif scan and de novo motif finding procedures such as HMS can be dramatically accelerated when running GPUmotif on NVIDIA graphics cards. As a result, energy consumption can also be greatly reduced when running motif analysis using GPUmotif. The GPUmotif program is freely available at http://sourceforge.net/projects/gpumotif/
Examination of Multi-Core Architectures

DTIC Science & Technology

2010-11-01

NOVEMBER 2010 2. REPORT TYPE Interim Technical Report 3. DATES COVERED (From - To) February 2010 – July 2010 4 . TITLE AND SUBTITLE EXAMINATION OF...STATEMENT 1 2.0 BACKGROUND 1 3.0 ARCHITECTURE CHARACTERISTICS 3 3.1 NVIDIA Tesla 3 3.2 TILE64 4 ...1 Tesla Architecture 3 2 TILE64 Architecture 4 3 Single Tile Architecture 4 4 STI Cell Broadband Engine
Computational Omics Pre-Awardees | Office of Cancer Clinical Proteomics Research

Cancer.gov

The National Cancer Institute's Clinical Proteomic Tumor Analysis Consortium (CPTAC) is pleased to announce the pre-awardees of the Computational Omics solicitation. Working with NVIDIA Foundation's Compute the Cure initiative and Leidos Biomedical Research Inc., the NCI, through this solicitation, seeks to leverage computational efforts to provide tools for the mining and interpretation of large-scale publicly available ‘omics’ datasets.
Universal Batch Steganalysis

DTIC Science & Technology

2014-06-01

in large-scale datasets such as might be obtained by monitoring a corporate network or social network. Identifying guilty actors, rather than payload...by monitoring a corporate network or social network. Identifying guilty actors, rather than payload-carrying objects, is entirely novel in steganalysis...implementation using Compute Unified Device Architecture (CUDA) on NVIDIA graphics cards. The key to good performance is to combine computations so that
Universal Batch Steganalysis

DTIC Science & Technology

2014-06-30

steganalysis) in large-scale datasets such as might be obtained by monitoring a corporate network or social network. Identifying guilty actors...guilty’ user (of steganalysis) in large-scale datasets such as might be obtained by monitoring a corporate network or social network. Identifying guilty...floating point operations (1 TFLOPs) for a 1 megapixel image. We designed a new implementation using Compute Unified Device Architecture (CUDA) on NVIDIA
Contention Bounds for Combinations of Computation Graphs and Network Topologies

DTIC Science & Technology

2014-08-08

member of STARnet, a Semiconductor Research Corporation program sponsored by MARCO and DARPA, and ASPIRE Lab industrial sponsors and affiliates Intel...Google, Nokia, NVIDIA , Oracle, MathWorks and Samsung. Also funded by U.S. DOE Office of Science, Office of Advanced Scientific Computing Research...DARPA Award Number HR0011-12-2- 0016, the Center for Future Architecture Research, a mem- ber of STARnet, a Semiconductor Research Corporation
Development of an Integrated Nozzle for a Symmetric, RBCC Launch Vehicle Configuration

NASA Technical Reports Server (NTRS)

Smith, Timothy D.; Canabal, Francisco, III; Rice, Tharen; Blaha, Bernard

2000-01-01

The development of rocket based combined cycle (RBCC) engines is highly dependent upon integrating several different modes of operation into a single system. One of the key components to develop acceptable performance levels through each mode of operation is the nozzle. It must be highly integrated to serve the expansion processes of both rocket and air-breathing modes without undue weight, drag, or complexity. The NASA GTX configuration requires a fixed geometry, altitude-compensating nozzle configuration. The initial configuration, used mainly to estimate weight and cooling requirements was a 1 So half-angle cone, which cuts a concave surface from a point within the flowpath to the vehicle trailing edge. Results of 3-D CFD calculations on this geometry are presented. To address the critical issues associated with integrated, fixed geometry, multimode nozzle development, the GTX team has initiated a series of tasks to evolve the nozzle design, and validate performance levels. An overview of these tasks is given. The first element is a design activity to develop tools for integration of efficient expansion surfaces With the existing flowpath and vehicle aft-body, and to develop a second-generation nozzle design. A preliminary result using a "streamline-tracing" technique is presented. As the nozzle design evolves, a combination of 3-D CFD analysis and experimental evaluation will be used to validate the design procedure and determine the installed performance for propulsion cycle modeling. The initial experimental effort will consist of cold-flow experiments designed to validate the general trends of the streamline-tracing methodology and anchor the CFD analysis. Experiments will also be conducted to simulate nozzle performance during each mode of operation. As the design matures, hot-fire tests will be conducted to refine performance estimates and anchor more sophisticated reacting-flow analysis.
NDetermin: Inferring Nondeterministic Sequential Specifications for Parallelism Correctness

DTIC Science & Technology

2011-12-16

other provision of law, no person shall be subject to a penalty for failing to comply with a collection of information if it does not display a...Lab affiliates National Instruments, NEC, Nokia , NVIDIA, and Samsung. NDetermin: Inferring Nondeterministic Sequential Specifications for Parallelism...concurrently update x, some of these CAS’s will fail and those parallel loop iterations will recompute their updates to x and try again. Consider the parallel
Finite difference numerical method for the superlattice Boltzmann transport equation and case comparison of CPU(C) and GPU(CUDA) implementations

DOE Office of Scientific and Technical Information (OSTI.GOV)

Priimak, Dmitri

2014-12-01

We present a finite difference numerical algorithm for solving two dimensional spatially homogeneous Boltzmann transport equation which describes electron transport in a semiconductor superlattice subject to crossed time dependent electric and constant magnetic fields. The algorithm is implemented both in C language targeted to CPU and in CUDA C language targeted to commodity NVidia GPU. We compare performances and merits of one implementation versus another and discuss various software optimisation techniques.
Ultraviolet Communication for Medical Applications

DTIC Science & Technology

2014-05-01

parent company Imaging Systems Technology (IST) demonstrated feasibility of several key concepts are being developed into a working prototype in the...program using multiple high-end GPUs ( NVIDIA Tesla K20). Finally, the Monte Carlo simulation task will be resumed after the Milestone 2 demonstration...is acceptable for automated printing and handling. Next, the option of having our shells electroded by an external company was investigated and DEI
Classification of hyperspectral imagery using MapReduce on a NVIDIA graphics processing unit (Conference Presentation)

NASA Astrophysics Data System (ADS)

Ramirez, Andres; Rahnemoonfar, Maryam

2017-04-01

A hyperspectral image provides multidimensional figure rich in data consisting of hundreds of spectral dimensions. Analyzing the spectral and spatial information of such image with linear and non-linear algorithms will result in high computational time. In order to overcome this problem, this research presents a system using a MapReduce-Graphics Processing Unit (GPU) model that can help analyzing a hyperspectral image through the usage of parallel hardware and a parallel programming model, which will be simpler to handle compared to other low-level parallel programming models. Additionally, Hadoop was used as an open-source version of the MapReduce parallel programming model. This research compared classification accuracy results and timing results between the Hadoop and GPU system and tested it against the following test cases: the CPU and GPU test case, a CPU test case and a test case where no dimensional reduction was applied.
Real-time liquid-crystal atmosphere turbulence simulator with graphic processing unit.

PubMed

Hu, Lifa; Xuan, Li; Li, Dayu; Cao, Zhaoliang; Mu, Quanquan; Liu, Yonggang; Peng, Zenghui; Lu, Xinghai

2009-04-27

To generate time-evolving atmosphere turbulence in real time, a phase-generating method for our liquid-crystal (LC) atmosphere turbulence simulator (ATS) is derived based on the Fourier series (FS) method. A real matrix expression for generating turbulence phases is given and calculated with a graphic processing unit (GPU), the GeForce 8800 Ultra. A liquid crystal on silicon (LCOS) with 256x256 pixels is used as the turbulence simulator. The total time to generate a turbulence phase is about 7.8 ms for calculation and readout with the GPU. A parallel processing method of calculating and sending a picture to the LCOS is used to improve the simulating speed of our LC ATS. Therefore, the real-time turbulence phase-generation frequency of our LC ATS is up to 128 Hz. To our knowledge, it is the highest speed used to generate a turbulence phase in real time.
A mitral annulus tracking approach for navigation of off-pump beating heart mitral valve repair.

PubMed

Li, Feng P; Rajchl, Martin; Moore, John; Peters, Terry M

2015-01-01

porcine data, the authors compared the tracked MVA to a manually segmented MVA. The overall accuracy is 2.37 ± 1.67 mm for single plane images and 2.35 ± 1.55 mm for biplane images. The interoperator variation in manual segmentation was 2.32 ± 1.24 mm for single plane images and 1.73 ± 1.18 mm for biplane images. The computational efficiency of the algorithm on a desktop computer with an Intel(®) Xeon(®) CPU @3.47 GHz and an NVIDIA GeForce 690 graphic card is such that the time required for registering four MVA points was about 60 ms. The authors developed a rapid MVA tracking algorithm for use in the guidance of off-pump beating heart transapical mitral valve repair. This approach uses 2D biplane TEE images and was tested on a dynamic heart phantom and interventional porcine image data. Results regarding the accuracy and efficiency of the authors' MVA tracking algorithm are promising, and fulfill the requirements for surgical navigation.
SU (2) lattice gauge theory simulations on Fermi GPUs

DOE Office of Scientific and Technical Information (OSTI.GOV)

Cardoso, Nuno, E-mail: nunocardoso@cftp.ist.utl.p; Bicudo, Pedro, E-mail: bicudo@ist.utl.p

2011-05-10

In this work we explore the performance of CUDA in quenched lattice SU (2) simulations. CUDA, NVIDIA Compute Unified Device Architecture, is a hardware and software architecture developed by NVIDIA for computing on the GPU. We present an analysis and performance comparison between the GPU and CPU in single and double precision. Analyses with multiple GPUs and two different architectures (G200 and Fermi architectures) are also presented. In order to obtain a high performance, the code must be optimized for the GPU architecture, i.e., an implementation that exploits the memory hierarchy of the CUDA programming model. We produce codes formore » the Monte Carlo generation of SU (2) lattice gauge configurations, for the mean plaquette, for the Polyakov Loop at finite T and for the Wilson loop. We also present results for the potential using many configurations (50,000) without smearing and almost 2000 configurations with APE smearing. With two Fermi GPUs we have achieved an excellent performance of 200x the speed over one CPU, in single precision, around 110 Gflops/s. We also find that, using the Fermi architecture, double precision computations for the static quark-antiquark potential are not much slower (less than 2x slower) than single precision computations.« less
SU (2) lattice gauge theory simulations on Fermi GPUs

NASA Astrophysics Data System (ADS)

Cardoso, Nuno; Bicudo, Pedro

2011-05-01

In this work we explore the performance of CUDA in quenched lattice SU (2) simulations. CUDA, NVIDIA Compute Unified Device Architecture, is a hardware and software architecture developed by NVIDIA for computing on the GPU. We present an analysis and performance comparison between the GPU and CPU in single and double precision. Analyses with multiple GPUs and two different architectures (G200 and Fermi architectures) are also presented. In order to obtain a high performance, the code must be optimized for the GPU architecture, i.e., an implementation that exploits the memory hierarchy of the CUDA programming model. We produce codes for the Monte Carlo generation of SU (2) lattice gauge configurations, for the mean plaquette, for the Polyakov Loop at finite T and for the Wilson loop. We also present results for the potential using many configurations (50,000) without smearing and almost 2000 configurations with APE smearing. With two Fermi GPUs we have achieved an excellent performance of 200× the speed over one CPU, in single precision, around 110 Gflops/s. We also find that, using the Fermi architecture, double precision computations for the static quark-antiquark potential are not much slower (less than 2× slower) than single precision computations.
Wavelet-based multicomponent denoising on GPU to improve the classification of hyperspectral images

NASA Astrophysics Data System (ADS)

Quesada-Barriuso, Pablo; Heras, Dora B.; Argüello, Francisco; Mouriño, J. C.

2017-10-01

Supervised classification allows handling a wide range of remote sensing hyperspectral applications. Enhancing the spatial organization of the pixels over the image has proven to be beneficial for the interpretation of the image content, thus increasing the classification accuracy. Denoising in the spatial domain of the image has been shown as a technique that enhances the structures in the image. This paper proposes a multi-component denoising approach in order to increase the classification accuracy when a classification method is applied. It is computed on multicore CPUs and NVIDIA GPUs. The method combines feature extraction based on a 1Ddiscrete wavelet transform (DWT) applied in the spectral dimension followed by an Extended Morphological Profile (EMP) and a classifier (SVM or ELM). The multi-component noise reduction is applied to the EMP just before the classification. The denoising recursively applies a separable 2D DWT after which the number of wavelet coefficients is reduced by using a threshold. Finally, inverse 2D-DWT filters are applied to reconstruct the noise free original component. The computational cost of the classifiers as well as the cost of the whole classification chain is high but it is reduced achieving real-time behavior for some applications through their computation on NVIDIA multi-GPU platforms.
Accelerating Multiple Compound Comparison Using LINGO-Based Load-Balancing Strategies on Multi-GPUs.

PubMed

Lin, Chun-Yuan; Wang, Chung-Hung; Hung, Che-Lun; Lin, Yu-Shiang

2015-01-01

Compound comparison is an important task for the computational chemistry. By the comparison results, potential inhibitors can be found and then used for the pharmacy experiments. The time complexity of a pairwise compound comparison is O(n (2)), where n is the maximal length of compounds. In general, the length of compounds is tens to hundreds, and the computation time is small. However, more and more compounds have been synthesized and extracted now, even more than tens of millions. Therefore, it still will be time-consuming when comparing with a large amount of compounds (seen as a multiple compound comparison problem, abbreviated to MCC). The intrinsic time complexity of MCC problem is O(k (2) n (2)) with k compounds of maximal length n. In this paper, we propose a GPU-based algorithm for MCC problem, called CUDA-MCC, on single- and multi-GPUs. Four LINGO-based load-balancing strategies are considered in CUDA-MCC in order to accelerate the computation speed among thread blocks on GPUs. CUDA-MCC was implemented by C+OpenMP+CUDA. CUDA-MCC achieved 45 times and 391 times faster than its CPU version on a single NVIDIA Tesla K20m GPU card and a dual-NVIDIA Tesla K20m GPU card, respectively, under the experimental results.
Best Performers Announced for the NCI-CPTAC DREAM Proteogenomics Computational Challenge | Office of Cancer Clinical Proteomics Research

Cancer.gov

The National Cancer Institute (NCI) Clinical Proteomic Tumor Analysis Consortium (CPTAC) is pleased to announce that teams led by Jaewoo Kang (Korea University), and Yuanfang Guan with Hongyang Li (University of Michigan) as the best performers of the NCI-CPTAC DREAM Proteogenomics Computational Challenge. Over 500 participants from 20 countries registered for the Challenge, which offered $25,000 in cash awards contributed by the NVIDIA Foundation through its Compute the Cure initiative.

Using Advanced Computing in Applied Dynamics: From the Dynamics of Granular Material to the Motion of the Mars Rover

DTIC Science & Technology

2013-08-26

USING ADVANCED COMPUTING IN APPLIED DYNAMICS : FROM THE DYNAMICS OF GRANULAR MATERIAL TO THE MOTION OF THE MARS ROVER Dan Negrut NVIDIA CUDA...USING ADVANCED COMPUTING IN APPLIED DYNAMICS : FROM THE DYNAMICS OF GRANULAR MATERIAL TO THE MOTION OF THE MARS ROVER 5a. CONTRACT NUMBER W911NF-11-F...University of Parma, Italy • Drs. Paramsothy Jayakumar & David Lamb, US Army TARDEC • Mihai Anitescu, University of Chicago & Argonne National Lab
Performance of parallel computation using CUDA for solving the one-dimensional elasticity equations

NASA Astrophysics Data System (ADS)

Darmawan, J. B. B.; Mungkasi, S.

2017-01-01

In this paper, we investigate the performance of parallel computation in solving the one-dimensional elasticity equations. Elasticity equations are usually implemented in engineering science. Solving these equations fast and efficiently is desired. Therefore, we propose the use of parallel computation. Our parallel computation uses CUDA of the NVIDIA. Our research results show that parallel computation using CUDA has a great advantage and is powerful when the computation is of large scale.
CUDA programs for the GPU computing of the Swendsen-Wang multi-cluster spin flip algorithm: 2D and 3D Ising, Potts, and XY models

NASA Astrophysics Data System (ADS)

Komura, Yukihiro; Okabe, Yutaka

2014-03-01

We present sample CUDA programs for the GPU computing of the Swendsen-Wang multi-cluster spin flip algorithm. We deal with the classical spin models; the Ising model, the q-state Potts model, and the classical XY model. As for the lattice, both the 2D (square) lattice and the 3D (simple cubic) lattice are treated. We already reported the idea of the GPU implementation for 2D models (Komura and Okabe, 2012). We here explain the details of sample programs, and discuss the performance of the present GPU implementation for the 3D Ising and XY models. We also show the calculated results of the moment ratio for these models, and discuss phase transitions. Catalogue identifier: AERM_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AERM_v1_0.html Program obtainable from: CPC Program Library, Queen’s University, Belfast, N. Ireland Licensing provisions: Standard CPC licence, http://cpc.cs.qub.ac.uk/licence/licence.html No. of lines in distributed program, including test data, etc.: 5632 No. of bytes in distributed program, including test data, etc.: 14688 Distribution format: tar.gz Programming language: C, CUDA. Computer: System with an NVIDIA CUDA enabled GPU. Operating system: System with an NVIDIA CUDA enabled GPU. Classification: 23. External routines: NVIDIA CUDA Toolkit 3.0 or newer Nature of problem: Monte Carlo simulation of classical spin systems. Ising, q-state Potts model, and the classical XY model are treated for both two-dimensional and three-dimensional lattices. Solution method: GPU-based Swendsen-Wang multi-cluster spin flip Monte Carlo method. The CUDA implementation for the cluster-labeling is based on the work by Hawick et al. [1] and that by Kalentev et al. [2]. Restrictions: The system size is limited depending on the memory of a GPU. Running time: For the parameters used in the sample programs, it takes about a minute for each program. Of course, it depends on the system size, the number of Monte Carlo steps, etc. References: [1] K
26th JANNAF Airbreathing Propulsion Subcommittee Meeting. Volume 1

NASA Technical Reports Server (NTRS)

Fry, Ronald S. (Editor); Gannaway, Mary T. (Editor)

2002-01-01

This volume, the first of four volumes, is a collection of 28 unclassified/unlimited-distribution papers which were presented at the Joint Army-Navy-NASA-Air Force (JANNAF) 26th Airbreathing Propulsion Subcommittee (APS) was held jointly with the 38th Combustion Subcommittee (CS), 20th Propulsion Systems Hazards Subcommittee (PSHS), and 2nd Modeling and Simulation Subcommittee. The meeting was held 8-12 April 2002 at the Bayside Inn at The Sandestin Golf & Beach Resort and Eglin Air Force Base, Destin, Florida. Topics covered include: scramjet and ramjet R&D program overviews; tactical propulsion; space access; NASA GTX status; PDE technology; actively cooled engine structures; modeling and simulation of complex hydrocarbon fuels and unsteady processes; and component modeling and simulation.
GPU Computing in Bayesian Inference of Realized Stochastic Volatility Model

NASA Astrophysics Data System (ADS)

Takaishi, Tetsuya

2015-01-01

The realized stochastic volatility (RSV) model that utilizes the realized volatility as additional information has been proposed to infer volatility of financial time series. We consider the Bayesian inference of the RSV model by the Hybrid Monte Carlo (HMC) algorithm. The HMC algorithm can be parallelized and thus performed on the GPU for speedup. The GPU code is developed with CUDA Fortran. We compare the computational time in performing the HMC algorithm on GPU (GTX 760) and CPU (Intel i7-4770 3.4GHz) and find that the GPU can be up to 17 times faster than the CPU. We also code the program with OpenACC and find that appropriate coding can achieve the similar speedup with CUDA Fortran.
High-Performance Analysis of Filtered Semantic Graphs

DTIC Science & Technology

2012-05-06

any other provision of law, no person shall be subject to a penalty for failing to comply with a collection of information if it does not display a...observation that explains why SEJITS+KDT performance is so close to CombBLAS performance in practice (as shown later in Section 7) even though its in-core...NEC, Nokia , NVIDIA, Oracle, and Samsung. This research used resources of the National Energy Research Sci- entific Computing Center, which is
FANTOM: Algorithm-Architecture Codesign for High-Performance Embedded Signal and Image Processing Systems

DTIC Science & Technology

2013-05-25

graphics processors by IBM, AMD, and nVIDIA . They are between general-purpose pro- cessors and special-purpose processors. In Phase II. 3.10 Measure of...particular, Dr. Kevin Irick started a company Silicon Scapes and he has been the CEO. 5 Implications for Related/Future Research We speculate that...final project report in Jan. 2011. At the test and validation stage of the project. FANTOM’s partner at Raytheon quit from his company and hence from
Hardware and Software Design of FPGA-based PCIe Gen3 interface for APEnet+ network interconnect system

NASA Astrophysics Data System (ADS)

Ammendola, R.; Biagioni, A.; Frezza, O.; Lo Cicero, F.; Lonardo, A.; Martinelli, M.; Paolucci, P. S.; Pastorelli, E.; Rossetti, D.; Simula, F.; Tosoratto, L.; Vicini, P.

2015-12-01

In the attempt to develop an interconnection architecture optimized for hybrid HPC systems dedicated to scientific computing, we designed APEnet+, a point-to-point, low-latency and high-performance network controller supporting 6 fully bidirectional off-board links over a 3D torus topology. The first release of APEnet+ (named V4) was a board based on a 40 nm Altera FPGA, integrating 6 channels at 34 Gbps of raw bandwidth per direction and a PCIe Gen2 x8 host interface. It has been the first-of-its-kind device to implement an RDMA protocol to directly read/write data from/to Fermi and Kepler NVIDIA GPUs using NVIDIA peer-to-peer and GPUDirect RDMA protocols, obtaining real zero-copy GPU-to-GPU transfers over the network. The latest generation of APEnet+ systems (now named V5) implements a PCIe Gen3 x8 host interface on a 28 nm Altera Stratix V FPGA, with multi-standard fast transceivers (up to 14.4 Gbps) and an increased amount of configurable internal resources and hardware IP cores to support main interconnection standard protocols. Herein we present the APEnet+ V5 architecture, the status of its hardware and its system software design. Both its Linux Device Driver and the low-level libraries have been redeveloped to support the PCIe Gen3 protocol, introducing optimizations and solutions based on hardware/software co-design.
Fast parallel tandem mass spectral library searching using GPU hardware acceleration

PubMed Central

Baumgardner, Lydia Ashleigh; Shanmugam, Avinash Kumar; Lam, Henry; Eng, Jimmy K.; Martin, Daniel B.

2011-01-01

Mass spectrometry-based proteomics is a maturing discipline of biologic research that is experiencing substantial growth. Instrumentation has steadily improved over time with the advent of faster and more sensitive instruments collecting ever larger data files. Consequently, the computational process of matching a peptide fragmentation pattern to its sequence, traditionally accomplished by sequence database searching and more recently also by spectral library searching, has become a bottleneck in many mass spectrometry experiments. In both of these methods, the main rate limiting step is the comparison of an acquired spectrum with all potential matches from a spectral library or sequence database. This is a highly parallelizable process because the core computational element can be represented as a simple but arithmetically intense multiplication of two vectors. In this paper we present a proof of concept project taking advantage of the massively parallel computing available on graphics processing units (GPUs) to distribute and accelerate the process of spectral assignment using spectral library searching. This program, which we have named FastPaSS (for Fast Parallelized Spectral Searching) is implemented in CUDA (Compute Unified Device Architecture) from NVIDIA which allows direct access to the processors in an NVIDIA GPU. Our efforts demonstrate the feasibility of GPU computing for spectral assignment, through implementation of the validated spectral searching algorithm SpectraST in the CUDA environment. PMID:21545112
OpenACC acceleration of an unstructured CFD solver based on a reconstructed discontinuous Galerkin method for compressible flows

DOE PAGES

Xia, Yidong; Lou, Jialin; Luo, Hong; ...

2015-02-09

Here, an OpenACC directive-based graphics processing unit (GPU) parallel scheme is presented for solving the compressible Navier–Stokes equations on 3D hybrid unstructured grids with a third-order reconstructed discontinuous Galerkin method. The developed scheme requires the minimum code intrusion and algorithm alteration for upgrading a legacy solver with the GPU computing capability at very little extra effort in programming, which leads to a unified and portable code development strategy. A face coloring algorithm is adopted to eliminate the memory contention because of the threading of internal and boundary face integrals. A number of flow problems are presented to verify the implementationmore » of the developed scheme. Timing measurements were obtained by running the resulting GPU code on one Nvidia Tesla K20c GPU card (Nvidia Corporation, Santa Clara, CA, USA) and compared with those obtained by running the equivalent Message Passing Interface (MPI) parallel CPU code on a compute node (consisting of two AMD Opteron 6128 eight-core CPUs (Advanced Micro Devices, Inc., Sunnyvale, CA, USA)). Speedup factors of up to 24× and 1.6× for the GPU code were achieved with respect to one and 16 CPU cores, respectively. The numerical results indicate that this OpenACC-based parallel scheme is an effective and extensible approach to port unstructured high-order CFD solvers to GPU computing.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)

Allada, Veerendra, Benjegerdes, Troy; Bode, Brett

Commodity clusters augmented with application accelerators are evolving as competitive high performance computing systems. The Graphical Processing Unit (GPU) with a very high arithmetic density and performance per price ratio is a good platform for the scientific application acceleration. In addition to the interconnect bottlenecks among the cluster compute nodes, the cost of memory copies between the host and the GPU device have to be carefully amortized to improve the overall efficiency of the application. Scientific applications also rely on efficient implementation of the BAsic Linear Algebra Subroutines (BLAS), among which the General Matrix Multiply (GEMM) is considered as themore » workhorse subroutine. In this paper, they study the performance of the memory copies and GEMM subroutines that are critical to port the computational chemistry algorithms to the GPU clusters. To that end, a benchmark based on the NetPIPE framework is developed to evaluate the latency and bandwidth of the memory copies between the host and the GPU device. The performance of the single and double precision GEMM subroutines from the NVIDIA CUBLAS 2.0 library are studied. The results have been compared with that of the BLAS routines from the Intel Math Kernel Library (MKL) to understand the computational trade-offs. The test bed is a Intel Xeon cluster equipped with NVIDIA Tesla GPUs.« less
Accelerating Multiple Compound Comparison Using LINGO-Based Load-Balancing Strategies on Multi-GPUs

PubMed Central

Lin, Chun-Yuan; Wang, Chung-Hung; Hung, Che-Lun; Lin, Yu-Shiang

2015-01-01

Compound comparison is an important task for the computational chemistry. By the comparison results, potential inhibitors can be found and then used for the pharmacy experiments. The time complexity of a pairwise compound comparison is O(n 2), where n is the maximal length of compounds. In general, the length of compounds is tens to hundreds, and the computation time is small. However, more and more compounds have been synthesized and extracted now, even more than tens of millions. Therefore, it still will be time-consuming when comparing with a large amount of compounds (seen as a multiple compound comparison problem, abbreviated to MCC). The intrinsic time complexity of MCC problem is O(k 2 n 2) with k compounds of maximal length n. In this paper, we propose a GPU-based algorithm for MCC problem, called CUDA-MCC, on single- and multi-GPUs. Four LINGO-based load-balancing strategies are considered in CUDA-MCC in order to accelerate the computation speed among thread blocks on GPUs. CUDA-MCC was implemented by C+OpenMP+CUDA. CUDA-MCC achieved 45 times and 391 times faster than its CPU version on a single NVIDIA Tesla K20m GPU card and a dual-NVIDIA Tesla K20m GPU card, respectively, under the experimental results. PMID:26491652
Fast parallel tandem mass spectral library searching using GPU hardware acceleration.

PubMed

Baumgardner, Lydia Ashleigh; Shanmugam, Avinash Kumar; Lam, Henry; Eng, Jimmy K; Martin, Daniel B

2011-06-03

Mass spectrometry-based proteomics is a maturing discipline of biologic research that is experiencing substantial growth. Instrumentation has steadily improved over time with the advent of faster and more sensitive instruments collecting ever larger data files. Consequently, the computational process of matching a peptide fragmentation pattern to its sequence, traditionally accomplished by sequence database searching and more recently also by spectral library searching, has become a bottleneck in many mass spectrometry experiments. In both of these methods, the main rate-limiting step is the comparison of an acquired spectrum with all potential matches from a spectral library or sequence database. This is a highly parallelizable process because the core computational element can be represented as a simple but arithmetically intense multiplication of two vectors. In this paper, we present a proof of concept project taking advantage of the massively parallel computing available on graphics processing units (GPUs) to distribute and accelerate the process of spectral assignment using spectral library searching. This program, which we have named FastPaSS (for Fast Parallelized Spectral Searching), is implemented in CUDA (Compute Unified Device Architecture) from NVIDIA, which allows direct access to the processors in an NVIDIA GPU. Our efforts demonstrate the feasibility of GPU computing for spectral assignment, through implementation of the validated spectral searching algorithm SpectraST in the CUDA environment.
GPU-powered model analysis with PySB/cupSODA.

PubMed

Harris, Leonard A; Nobile, Marco S; Pino, James C; Lubbock, Alexander L R; Besozzi, Daniela; Mauri, Giancarlo; Cazzaniga, Paolo; Lopez, Carlos F

2017-11-01

A major barrier to the practical utilization of large, complex models of biochemical systems is the lack of open-source computational tools to evaluate model behaviors over high-dimensional parameter spaces. This is due to the high computational expense of performing thousands to millions of model simulations required for statistical analysis. To address this need, we have implemented a user-friendly interface between cupSODA, a GPU-powered kinetic simulator, and PySB, a Python-based modeling and simulation framework. For three example models of varying size, we show that for large numbers of simulations PySB/cupSODA achieves order-of-magnitude speedups relative to a CPU-based ordinary differential equation integrator. The PySB/cupSODA interface has been integrated into the PySB modeling framework (version 1.4.0), which can be installed from the Python Package Index (PyPI) using a Python package manager such as pip. cupSODA source code and precompiled binaries (Linux, Mac OS/X, Windows) are available at github.com/aresio/cupSODA (requires an Nvidia GPU; developer.nvidia.com/cuda-gpus). Additional information about PySB is available at pysb.org. paolo.cazzaniga@unibg.it or c.lopez@vanderbilt.edu. Supplementary data are available at Bioinformatics online. © The Author(s) 2017. Published by Oxford University Press.
The CUBLAS and CULA based GPU acceleration of adaptive finite element framework for bioluminescence tomography.

PubMed

Zhang, Bo; Yang, Xiang; Yang, Fei; Yang, Xin; Qin, Chenghu; Han, Dong; Ma, Xibo; Liu, Kai; Tian, Jie

2010-09-13

In molecular imaging (MI), especially the optical molecular imaging, bioluminescence tomography (BLT) emerges as an effective imaging modality for small animal imaging. The finite element methods (FEMs), especially the adaptive finite element (AFE) framework, play an important role in BLT. The processing speed of the FEMs and the AFE framework still needs to be improved, although the multi-thread CPU technology and the multi CPU technology have already been applied. In this paper, we for the first time introduce a new kind of acceleration technology to accelerate the AFE framework for BLT, using the graphics processing unit (GPU). Besides the processing speed, the GPU technology can get a balance between the cost and performance. The CUBLAS and CULA are two main important and powerful libraries for programming on NVIDIA GPUs. With the help of CUBLAS and CULA, it is easy to code on NVIDIA GPU and there is no need to worry about the details about the hardware environment of a specific GPU. The numerical experiments are designed to show the necessity, effect and application of the proposed CUBLAS and CULA based GPU acceleration. From the results of the experiments, we can reach the conclusion that the proposed CUBLAS and CULA based GPU acceleration method can improve the processing speed of the AFE framework very much while getting a balance between cost and performance.
Communication Avoiding Rank Revealing QR Factorization with Column Pivoting

DTIC Science & Technology

2013-05-03

person shall be subject to a penalty for failing to comply with a collection of information if it does not display a currently valid OMB control number...ParLab affiliates National Instruments, Nokia , NVIDIA, Oracle, and Samsung, and sup- port from MathWorks. We also acknowledge the support of the US...bounds from equation (1.3). In practice the QR factorization with column pivoting often works well, and it is widely used even if it is known to fail , for
Preliminary Sizing Completed for Single- Stage-To-Orbit Launch Vehicles Powered By Rocket-Based Combined Cycle Technology

NASA Technical Reports Server (NTRS)

Roche, Joseph M.

2002-01-01

Single-stage-to-orbit (SSTO) propulsion remains an elusive goal for launch vehicles. The physics of the problem is leading developers to a search for higher propulsion performance than is available with all-rocket power. Rocket-based combined cycle (RBCC) technology provides additional propulsion performance that may enable SSTO flight. Structural efficiency is also a major driving force in enabling SSTO flight. Increases in performance with RBCC propulsion are offset with the added size of the propulsion system. Geometrical considerations must be exploited to minimize the weight. Integration of the propulsion system with the vehicle must be carefully planned such that aeroperformance is not degraded and the air-breathing performance is enhanced. Consequently, the vehicle's structural architecture becomes one with the propulsion system architecture. Geometrical considerations applied to the integrated vehicle lead to low drag and high structural and volumetric efficiency. Sizing of the SSTO launch vehicle (GTX) is itself an elusive task. The weight of the vehicle depends strongly on the propellant required to meet the mission requirements. Changes in propellant requirements result in changes in the size of the vehicle, which in turn, affect the weight of the vehicle and change the propellant requirements. An iterative approach is necessary to size the vehicle to meet the flight requirements. GTX Sizer was developed to do exactly this. The governing geometry was built into a spreadsheet model along with scaling relationships. The scaling laws attempt to maintain structural integrity as the vehicle size is changed. Key aerodynamic relationships are maintained as the vehicle size is changed. The closed weight and center of gravity are displayed graphically on a plot of the synthesized vehicle. In addition, comprehensive tabular data of the subsystem weights and centers of gravity are generated. The model has been verified for accuracy with finite element analysis. The
PSP toxin levels and plankton community composition and abundance in size-fractionated vertical profiles during spring/summer blooms of the toxic dinoflagellate Alexandrium fundyense in the Gulf of Maine and on Georges Bank, 2007, 2008, and 2010: 1. Toxin levels.

PubMed

Deeds, Jonathan R; Petitpas, Christian M; Shue, Vangie; White, Kevin D; Keafer, Bruce A; McGillicuddy, Dennis J; Milligan, Peter J; Anderson, Donald M; Turner, Jefferson T

2014-05-01

As part of the NOAA ECOHAB funded Gulf of Maine Toxicity (GOMTOX) project, we determined Alexandrium fundyense abundance, paralytic shellfish poisoning (PSP) toxin composition, and concentration in quantitatively-sampled size-fractionated (20-64, 64-100, 100-200, 200-500, and > 500 μm) particulate water samples, and the community composition of potential grazers of A. fundyense in these size fractions, at multiple depths (typically 1, 10, 20 m, and near-bottom) during 10 large-scale sampling cruises during the A. fundyense bloom season (May-August) in the coastal Gulf of Maine and on Georges Bank in 2007, 2008, and 2010. Our findings were as follows: (1) when all sampling stations and all depths were summed by year, the majority (94% ± 4%) of total PSP toxicity was contained in the 20-64 μm size fraction; (2) when further analyzed by depth, the 20-64 μm size fraction was the primary source of toxin for 97% of the stations and depths samples over three years; (3) overall PSP toxin profiles were fairly consistent during the three seasons of sampling with gonyautoxins (1, 2, 3, and 4) dominating (90.7% ± 5.5%), followed by the carbamate toxins saxitoxin (STX) and neosaxitoxin (NEO) (7.7% ± 4.5%), followed by n-sulfocarbamoyl toxins (C1 and 2, GTX5) (1.3% ± 0.6%), followed by all decarbamoyl toxins (dcSTX, dcNEO, dcGTX2&3) (< 1%), although differences were noted between PSP toxin compositions for nearshore coastal Gulf of Maine sampling stations compared to offshore Georges Bank sampling stations for 2 out of 3 years; (4) surface cell counts of A. fundyense were a fairly reliable predictor of the presence of toxins throughout the water column; and (5) nearshore surface cell counts of A. fundyense in the coastal Gulf of Maine were not a reliable predictor of A. fundyense populations offshore on Georges Bank for 2 out of the 3 years sampled.
Rocket-Based Combined Cycle Flowpath Testing for Modes 1 and 4

NASA Technical Reports Server (NTRS)

Rice, Tharen

2002-01-01

Under sponsorship of the NASA Glenn Research Center (NASA GRC), the Johns Hopkins University Applied Physics Laboratory (JHU/APL) designed and built a five-inch diameter, Rocket-Based Combined Cycle (RBCC) engine to investigate mode 1 and mode 4 engine performance as well as Mach 4 inlet performance. This engine was designed so that engine area and length ratios were similar to the NASA GRC GTX engine is shown. Unlike the GTX semi-circular engine design, the APL engine is completely axisymmetric. For this design, a traditional rocket thruster was installed inside of the scramjet flowpath, along the engine centerline. A three part test series was conducted to determine Mode I and Mode 4 engine performance. In part one, testing of the rocket thruster alone was accomplished and its performance determined (average Isp efficiency = 90%). In part two, Mode 1 (air-augmented rocket) testing was conducted at a nominal chamber pressure-to-ambient pressure ratio of 100 with the engine inlet fully open. Results showed that there was neither a thrust increment nor decrement over rocket-only thrust during Mode 1 operation. In part three, Mode 4 testing was conducted with chamber pressure-to-ambient pressure ratios lower than desired (80 instead of 600) with the inlet fully closed. Results for this testing showed a performance decrease of 20% as compared to the rocket-only testing. It is felt that these results are directly related to the low pressure ratio tested and not the engine design. During this program, Mach 4 inlet testing was also conducted. For these tests, a moveable centerbody was tested to determine the maximum contraction ratio for the engine design. The experimental results agreed with CFD results conducted by NASA GRC, showing a maximum geometric contraction ratio of approximately 10.5. This report details the hardware design, test setup, experimental results and data analysis associated with the aforementioned tests.
Zebrafish neurotoxicity from aphantoxins--cyanobacterial paralytic shellfish poisons (PSPs) from Aphanizomenon flos-aquae DC-1.

PubMed

Zhang, Delu; Hu, Chunxiang; Wang, Gaohong; Li, Dunhai; Li, Genbao; Liu, Yongding

2013-05-01

Aphanizomenon flos-aquae (A. flos-aquae), a cyanobacterium frequently encountered in water blooms worldwide, is source of neurotoxins known as PSPs or aphantoxins that present a major threat to the environment and to human health. Although the molecular mechanism of PSP action is well known, many unresolved questions remain concerning its mechanisms of toxicity. Aphantoxins purified from a natural isolate of A. flos-aquae DC-1 were analyzed by high-performance liquid chromatography (HPLC), the major component toxins were the gonyautoxins1 and 5 (GTX1 and GTX5, 34.04% and 21.28%, respectively) and the neosaxitoxin (neoSTX, 12.77%). The LD50 of the aphantoxin preparation was determined to be 11.33 μg/kg (7.75 μg saxitoxin equivalents (STXeq) per kg) following intraperitoneal injection of zebrafish (Danio rerio). To address the neurotoxicology of the aphantoxin preparation, zebrafish were injected with low and high sublethal doses of A. flos-aquae DC-1 toxins 7.73 and 9.28 μg /kg (5.3 and 6.4 μg STXeq/kg, respectively) and brain tissues were analyzed by electron microscopy and RT-PCR at different timepoints postinjection. Low-dose aphantoxin exposure was associated with chromatin condensation, cell-membrane blebbing, and the appearance of apoptotic bodies. High-dose exposure was associated with cytoplasmic vacuolization, mitochondrial swelling, and expansion of the endoplasmic reticulum. At early timepoints (3 h) many cells exhibited characteristic features of both apoptosis and necrosis. At later timepoints apoptosis appeared to predominate in the low-dose group, whereas necrosis predominated in the high-dose group. RT-PCR revealed that mRNA levels of the apoptosis-related genes encoding p53, Bax, caspase-3, and c-Jun were upregulated after aphantoxin exposure, but there was no evidence of DNA laddering; apoptosis could take place by pathways independent of DNA fragmentation. These results demonstrate that aphantoxin exposure can cause cell death in zebrafish

PSP toxin levels and plankton community composition and abundance in size-fractionated vertical profiles during spring/summer blooms of the toxic dinoflagellate Alexandrium fundyense in the Gulf of Maine and on Georges Bank, 2007, 2008, and 2010: 1. Toxin levels

PubMed Central

Deeds, Jonathan R.; Petitpas, Christian M.; Shue, Vangie; White, Kevin D.; Keafer, Bruce A.; McGillicuddy, Dennis J.; Milligan, Peter J.; Anderson, Donald M.; Turner, Jefferson T.

2014-01-01

As part of the NOAA ECOHAB funded Gulf of Maine Toxicity (GOMTOX)1 project, we determined Alexandrium fundyense abundance, paralytic shellfish poisoning (PSP) toxin composition, and concentration in quantitatively-sampled size-fractionated (20–64, 64–100, 100–200, 200–500, and > 500 μm) particulate water samples, and the community composition of potential grazers of A. fundyense in these size fractions, at multiple depths (typically 1, 10, 20 m, and near-bottom) during 10 large-scale sampling cruises during the A. fundyense bloom season (May–August) in the coastal Gulf of Maine and on Georges Bank in 2007, 2008, and 2010. Our findings were as follows: (1) when all sampling stations and all depths were summed by year, the majority (94% ± 4%) of total PSP toxicity was contained in the 20–64 μm size fraction; (2) when further analyzed by depth, the 20–64 μm size fraction was the primary source of toxin for 97% of the stations and depths samples over three years; (3) overall PSP toxin profiles were fairly consistent during the three seasons of sampling with gonyautoxins (1, 2, 3, and 4) dominating (90.7% ± 5.5%), followed by the carbamate toxins saxitoxin (STX) and neosaxitoxin (NEO) (7.7% ± 4.5%), followed by n-sulfocarbamoyl toxins (C1 and 2, GTX5) (1.3% ± 0.6%), followed by all decarbamoyl toxins (dcSTX, dcNEO, dcGTX2&3) (< 1%), although differences were noted between PSP toxin compositions for nearshore coastal Gulf of Maine sampling stations compared to offshore Georges Bank sampling stations for 2 out of 3 years; (4) surface cell counts of A. fundyense were a fairly reliable predictor of the presence of toxins throughout the water column; and (5) nearshore surface cell counts of A. fundyense in the coastal Gulf of Maine were not a reliable predictor of A. fundyense populations offshore on Georges Bank for 2 out of the 3 years sampled. PMID:25076816
Stochastic first passage time accelerated with CUDA

NASA Astrophysics Data System (ADS)

Pierro, Vincenzo; Troiano, Luigi; Mejuto, Elena; Filatrella, Giovanni

2018-05-01

The numerical integration of stochastic trajectories to estimate the time to pass a threshold is an interesting physical quantity, for instance in Josephson junctions and atomic force microscopy, where the full trajectory is not accessible. We propose an algorithm suitable for efficient implementation on graphical processing unit in CUDA environment. The proposed approach for well balanced loads achieves almost perfect scaling with the number of available threads and processors, and allows an acceleration of about 400× with a GPU GTX980 respect to standard multicore CPU. This method allows with off the shell GPU to challenge problems that are otherwise prohibitive, as thermal activation in slowly tilted potentials. In particular, we demonstrate that it is possible to simulate the switching currents distributions of Josephson junctions in the timescale of actual experiments.
Latent uncertainties of the precalculated track Monte Carlo method.

PubMed

Renaud, Marc-André; Roberge, David; Seuntjens, Jan

2015-01-01

calculations, a small (≤ 1 mm) distance-to-agreement error was observed at the Bragg peak. Latent uncertainty was characterized for electrons and found to follow a Poisson distribution with the number of unique tracks per energy. A track bank of 12 energies and 60000 unique tracks per pregenerated energy in water had a size of 2.4 GB and achieved a latent uncertainty of approximately 1% at an optimal efficiency gain over DOSXYZnrc. Larger track banks produced a lower latent uncertainty at the cost of increased memory consumption. Using an NVIDIA GTX 590, efficiency analysis showed a 807 × efficiency increase over DOSXYZnrc for 16 MeV electrons in water and 508 × for 16 MeV electrons in bone. The PMC method can calculate dose distributions for electrons and protons to a statistical uncertainty of 1% with a large efficiency gain over conventional MC codes. Before performing clinical dose calculations, models to calculate dose contributions from uncharged particles must be implemented. Following the successful implementation of these models, the PMC method will be evaluated as a candidate for inverse planning of modulated electron radiation therapy and scanned proton beams.
A GPU OpenCL based cross-platform Monte Carlo dose calculation engine (goMC).

PubMed

Tian, Zhen; Shi, Feng; Folkerts, Michael; Qin, Nan; Jiang, Steve B; Jia, Xun

2015-10-07

Monte Carlo (MC) simulation has been recognized as the most accurate dose calculation method for radiotherapy. However, the extremely long computation time impedes its clinical application. Recently, a lot of effort has been made to realize fast MC dose calculation on graphic processing units (GPUs). However, most of the GPU-based MC dose engines have been developed under NVidia's CUDA environment. This limits the code portability to other platforms, hindering the introduction of GPU-based MC simulations to clinical practice. The objective of this paper is to develop a GPU OpenCL based cross-platform MC dose engine named goMC with coupled photon-electron simulation for external photon and electron radiotherapy in the MeV energy range. Compared to our previously developed GPU-based MC code named gDPM (Jia et al 2012 Phys. Med. Biol. 57 7783-97), goMC has two major differences. First, it was developed under the OpenCL environment for high code portability and hence could be run not only on different GPU cards but also on CPU platforms. Second, we adopted the electron transport model used in EGSnrc MC package and PENELOPE's random hinge method in our new dose engine, instead of the dose planning method employed in gDPM. Dose distributions were calculated for a 15 MeV electron beam and a 6 MV photon beam in a homogenous water phantom, a water-bone-lung-water slab phantom and a half-slab phantom. Satisfactory agreement between the two MC dose engines goMC and gDPM was observed in all cases. The average dose differences in the regions that received a dose higher than 10% of the maximum dose were 0.48-0.53% for the electron beam cases and 0.15-0.17% for the photon beam cases. In terms of efficiency, goMC was ~4-16% slower than gDPM when running on the same NVidia TITAN card for all the cases we tested, due to both the different electron transport models and the different development environments. The code portability of our new dose engine goMC was validated by
Multi-GPU Accelerated Admittance Method for High-Resolution Human Exposure Evaluation.

PubMed

Xiong, Zubiao; Feng, Shi; Kautz, Richard; Chandra, Sandeep; Altunyurt, Nevin; Chen, Ji

2015-12-01

A multi-graphics processing unit (GPU) accelerated admittance method solver is presented for solving the induced electric field in high-resolution anatomical models of human body when exposed to external low-frequency magnetic fields. In the solver, the anatomical model is discretized as a three-dimensional network of admittances. The conjugate orthogonal conjugate gradient (COCG) iterative algorithm is employed to take advantage of the symmetric property of the complex-valued linear system of equations. Compared against the widely used biconjugate gradient stabilized method, the COCG algorithm can reduce the solving time by 3.5 times and reduce the storage requirement by about 40%. The iterative algorithm is then accelerated further by using multiple NVIDIA GPUs. The computations and data transfers between GPUs are overlapped in time by using asynchronous concurrent execution design. The communication overhead is well hidden so that the acceleration is nearly linear with the number of GPU cards. Numerical examples show that our GPU implementation running on four NVIDIA Tesla K20c cards can reach 90 times faster than the CPU implementation running on eight CPU cores (two Intel Xeon E5-2603 processors). The implemented solver is able to solve large dimensional problems efficiently. A whole adult body discretized in 1-mm resolution can be solved in just several minutes. The high efficiency achieved makes it practical to investigate human exposure involving a large number of cases with a high resolution that meets the requirements of international dosimetry guidelines.
PC Scene Generation

NASA Astrophysics Data System (ADS)

Buford, James A., Jr.; Cosby, David; Bunfield, Dennis H.; Mayhall, Anthony J.; Trimble, Darian E.

2007-04-01

AMRDEC has successfully tested hardware and software for Real-Time Scene Generation for IR and SAL Sensors on COTS PC based hardware and video cards. AMRDEC personnel worked with nVidia and Concurrent Computer Corporation to develop a Scene Generation system capable of frame rates of at least 120Hz while frame locked to an external source (such as a missile seeker) with no dropped frames. Latency measurements and image validation were performed using COTS and in-house developed hardware and software. Software for the Scene Generation system was developed using OpenSceneGraph.
Lattice QCD based on OpenCL

NASA Astrophysics Data System (ADS)

Bach, Matthias; Lindenstruth, Volker; Philipsen, Owe; Pinke, Christopher

2013-09-01

We present an OpenCL-based Lattice QCD application using a heatbath algorithm for the pure gauge case and Wilson fermions in the twisted mass formulation. The implementation is platform independent and can be used on AMD or NVIDIA GPUs, as well as on classical CPUs. On the AMD Radeon HD 5870 our double precision ⁄D implementation performs at 60 GFLOPS over a wide range of lattice sizes. The hybrid Monte Carlo presented reaches a speedup of four over the reference code running on a server CPU.
Analysis and optimization of gyrokinetic toroidal simulations on homogenous and heterogenous platforms

DOE PAGES

Ibrahim, Khaled Z.; Madduri, Kamesh; Williams, Samuel; ...

2013-07-18

The Gyrokinetic Toroidal Code (GTC) uses the particle-in-cell method to efficiently simulate plasma microturbulence. This paper presents novel analysis and optimization techniques to enhance the performance of GTC on large-scale machines. We introduce cell access analysis to better manage locality vs. synchronization tradeoffs on CPU and GPU-based architectures. Finally, our optimized hybrid parallel implementation of GTC uses MPI, OpenMP, and NVIDIA CUDA, achieves up to a 2× speedup over the reference Fortran version on multiple parallel systems, and scales efficiently to tens of thousands of cores.
Construction of the Fock Matrix on a Grid-Based Molecular Orbital Basis Using GPGPUs.

PubMed

Losilla, Sergio A; Watson, Mark A; Aspuru-Guzik, Alán; Sundholm, Dage

2015-05-12

We present a GPGPU implementation of the construction of the Fock matrix in the molecular orbital basis using the fully numerical, grid-based bubbles representation. For a test set of molecules containing up to 90 electrons, the total Hartree-Fock energies obtained from reference GTO-based calculations are reproduced within 10(-4) Eh to 10(-8) Eh for most of the molecules studied. Despite the very large number of arithmetic operations involved, the high performance obtained made the calculations possible on a single Nvidia Tesla K40 GPGPU card.
Parallel Implementation of Numerical Solution of Few-Body Problem Using Feynman's Continual Integrals

NASA Astrophysics Data System (ADS)

Naumenko, Mikhail; Samarin, Viacheslav

2018-02-01

Modern parallel computing algorithm has been applied to the solution of the few-body problem. The approach is based on Feynman's continual integrals method implemented in C++ programming language using NVIDIA CUDA technology. A wide range of 3-body and 4-body bound systems has been considered including nuclei described as consisting of protons and neutrons (e.g., 3,4He) and nuclei described as consisting of clusters and nucleons (e.g., 6He). The correctness of the results was checked by the comparison with the exactly solvable 4-body oscillatory system and experimental data.
Three-dimensional photoacoustic tomography based on graphics-processing-unit-accelerated finite element method.

PubMed

Peng, Kuan; He, Ling; Zhu, Ziqiang; Tang, Jingtian; Xiao, Jiaying

2013-12-01

Compared with commonly used analytical reconstruction methods, the frequency-domain finite element method (FEM) based approach has proven to be an accurate and flexible algorithm for photoacoustic tomography. However, the FEM-based algorithm is computationally demanding, especially for three-dimensional cases. To enhance the algorithm's efficiency, in this work a parallel computational strategy is implemented in the framework of the FEM-based reconstruction algorithm using a graphic-processing-unit parallel frame named the "compute unified device architecture." A series of simulation experiments is carried out to test the accuracy and accelerating effect of the improved method. The results obtained indicate that the parallel calculation does not change the accuracy of the reconstruction algorithm, while its computational cost is significantly reduced by a factor of 38.9 with a GTX 580 graphics card using the improved method.
JANNAF 25th Airbreathing Propulsion Subcommittee, 37th Combustion Subcommittee and 1st Modeling and Simulation Subcommittee Joint Meeting. Volume 1

NASA Technical Reports Server (NTRS)

Fry, Ronald S.; Becker, Dorothy L.

2000-01-01

Volume I, the first of three volumes, is a compilation of 24 unclassified/unlimited-distribution technical papers presented at the Joint Army-Navy-NASA-Air Force (JANNAF) 25th Airbreathing Propulsion Subcommittee, 37th Combustion Subcommittee and 1st Modeling and Simulation Subcommittee (MSS) meeting held jointly with the 19th Propulsion Systems Hazards Subcommittee. The meeting was held 13-17 November 2000 at the Naval Postgraduate School and Hyatt Regency Hotel, Monterey, California. Topics covered include: a Keynote Address on Future Combat Systems, a review of the new JANNAF Modeling and Simulation Subcommittee, and technical papers on Hyper-X propulsion development and verification; GTX airbreathing launch vehicles; Hypersonic technology development, including program overviews, fuels for advanced propulsion, ramjet and scramjet research, hypersonic test medium effects; and RBCC engine design and performance, and PDE and UCAV advanced and combined cycle engine technologies.
GPU Implementation of High Rayleigh Number Three-Dimensional Mantle Convection

NASA Astrophysics Data System (ADS)

Sanchez, D. A.; Yuen, D. A.; Wright, G. B.; Barnett, G. A.

2010-12-01

Although we have entered the age of petascale computing, many factors are still prohibiting high-performance computing (HPC) from infiltrating all suitable scientific disciplines. For this reason and others, application of GPU to HPC is gaining traction in the scientific world. With its low price point, high performance potential, and competitive scalability, GPU has been an option well worth considering for the last few years. Moreover with the advent of NVIDIA's Fermi architecture, which brings ECC memory, better double-precision performance, and more RAM to GPU, there is a strong message of corporate support for GPU in HPC. However many doubts linger concerning the practicality of using GPU for scientific computing. In particular, GPU has a reputation for being difficult to program and suitable for only a small subset of problems. Although inroads have been made in addressing these concerns, for many scientists GPU still has hurdles to clear before becoming an acceptable choice. We explore the applicability of GPU to geophysics by implementing a three-dimensional, second-order finite-difference model of Rayleigh-Benard thermal convection on an NVIDIA GPU using C for CUDA. Our code reaches sufficient resolution, on the order of 500x500x250 evenly-spaced finite-difference gridpoints, on a single GPU. We make extensive use of highly optimized CUBLAS routines, allowing us to achieve performance on the order of O( 0.1 ) µs per timestep*gridpoint at this resolution. This performance has allowed us to study high Rayleigh number simulations, on the order of 2x10^7, on a single GPU.
Parallel fuzzy connected image segmentation on GPU

PubMed Central

Zhuge, Ying; Cao, Yong; Udupa, Jayaram K.; Miller, Robert W.

2011-01-01

Purpose: Image segmentation techniques using fuzzy connectedness (FC) principles have shown their effectiveness in segmenting a variety of objects in several large applications. However, one challenge in these algorithms has been their excessive computational requirements when processing large image datasets. Nowadays, commodity graphics hardware provides a highly parallel computing environment. In this paper, the authors present a parallel fuzzy connected image segmentation algorithm implementation on NVIDIA’s compute unified device Architecture (cuda) platform for segmenting medical image data sets. Methods: In the FC algorithm, there are two major computational tasks: (i) computing the fuzzy affinity relations and (ii) computing the fuzzy connectedness relations. These two tasks are implemented as cuda kernels and executed on GPU. A dramatic improvement in speed for both tasks is achieved as a result. Results: Our experiments based on three data sets of small, medium, and large data size demonstrate the efficiency of the parallel algorithm, which achieves a speed-up factor of 24.4x, 18.1x, and 10.3x, correspondingly, for the three data sets on the NVIDIA Tesla C1060 over the implementation of the algorithm on CPU, and takes 0.25, 0.72, and 15.04 s, correspondingly, for the three data sets. Conclusions: The authors developed a parallel algorithm of the widely used fuzzy connected image segmentation method on the NVIDIA GPUs, which are far more cost- and speed-effective than both cluster of workstations and multiprocessing systems. A near-interactive speed of segmentation has been achieved, even for the large data set. PMID:21859037
Atmospheric aerosol composition and source apportionments to aerosol in southern Taiwan

NASA Astrophysics Data System (ADS)

Tsai, Ying I.; Chen, Chien-Lung

In this study, the chemical characteristics of winter aerosol at four sites in southern Taiwan were determined and the Gaussian Trajectory transfer coefficient model (GTx) was then used to identify the major air pollutant sources affecting the study sites. Aerosols were found to be acidic at all four sites. The most important constituents of the particulate matter (PM) by mass were SO 42-, organic carbon (OC), NO 3-, elemental carbon (EC) and NH 4+, with SO 42-, NO 3-, and NH 4+ together constituting 86.0-87.9% of the total PM 2.5 soluble inorganic salts and 68.9-78.3% of the total PM 2.5-10 soluble inorganic salts, showing that secondary photochemical solution components such as these were the major contributors to the aerosol water-soluble ions. The coastal site, Linyuan (LY), had the highest PM mass percentage of sea salts, higher in the coarse fraction, and higher sea salts during daytime than during nighttime, indicating that the prevailing daytime sea breeze brought with it more sea-salt aerosol. Other than sea salts, crustal matter, and EC in PM 2.5 at Jenwu (JW) and in PM 2.5-10 at LY, all aerosol components were higher during nighttime, due to relatively low nighttime mixing heights limiting vertical and horizontal dispersion. At JW, a site with heavy traffic loadings, the OC/EC ratio in the nighttime fine and coarse fractions of approximately 2.2 was higher than during daytime, indicating that in addition to primary organic aerosol (POA), secondary organic aerosol (SOA) also contributed to the nighttime PM 2.5. This was also true of the nighttime coarse fraction at LY. The GTx produced correlation coefficients ( r) for simulated and observed daily concentrations of PM 10 at the four sites (receptors) in the range 0.45-0.59 and biases from -6% to -20%. Source apportionment indicated that point sources were the largest PM 10 source at JW, LY and Daliao (DL), while at Meinung (MN), a suburban site with less local PM 10, SO x and NO x emissions, upwind
Cretin Memory Flow on Sierra

DOE Office of Scientific and Technical Information (OSTI.GOV)

Langer, S. H.; Scott, H. A.

2016-08-05

The Cretin iCOE project has a goal of enabling the efficient generation of Non-LTE opacities for use in radiation-hydrodynamic simulation codes using the Nvidia boards on LLNL’s upcoming Sierra system. Achieving the desired level of accuracy for some simulations require the use of a vary large number of atomic configurations (a configuration includes the atomic level for all electrons and how they are coupled together). The NLTE rate matrix needs to be solved separately in each zone. Calculating NLTE opacities can consume more time than all other physics packages used in a simulation.
Implementation of Headtracking and 3D Stereo with Unity and VRPN for Computer Simulations

NASA Technical Reports Server (NTRS)

Noyes, Matthew A.

2013-01-01

This paper explores low-cost hardware and software methods to provide depth cues traditionally absent in monocular displays. The use of a VRPN server in conjunction with a Microsoft Kinect and/or Nintendo Wiimote to provide head tracking information to a Unity application, and NVIDIA 3D Vision for retinal disparity support, is discussed. Methods are suggested to implement this technology with NASA's EDGE simulation graphics package, along with potential caveats. Finally, future applications of this technology to astronaut crew training, particularly when combined with an omnidirectional treadmill for virtual locomotion and NASA's ARGOS system for reduced gravity simulation, are discussed.
Charge order-superfluidity transition in a two-dimensional system of hard-core bosons and emerging domain structures

NASA Astrophysics Data System (ADS)

Moskvin, A. S.; Panov, Yu. D.; Rybakov, F. N.; Borisov, A. B.

2017-11-01

We have used high-performance parallel computations by NVIDIA graphics cards applying the method of nonlinear conjugate gradients and Monte Carlo method to observe directly the developing ground state configuration of a two-dimensional hard-core boson system with decrease in temperature, and its evolution with deviation from a half-filling. This has allowed us to explore unconventional features of a charge order—superfluidity phase transition, specifically, formation of an irregular domain structure, emergence of a filamentary superfluid structure that condenses within of the charge-ordered phase domain antiphase boundaries, and formation and evolution of various topological structures.
CELES: CUDA-accelerated simulation of electromagnetic scattering by large ensembles of spheres

NASA Astrophysics Data System (ADS)

Egel, Amos; Pattelli, Lorenzo; Mazzamuto, Giacomo; Wiersma, Diederik S.; Lemmer, Uli

2017-09-01

CELES is a freely available MATLAB toolbox to simulate light scattering by many spherical particles. Aiming at high computational performance, CELES leverages block-diagonal preconditioning, a lookup-table approach to evaluate costly functions and massively parallel execution on NVIDIA graphics processing units using the CUDA computing platform. The combination of these techniques allows to efficiently address large electrodynamic problems (>104 scatterers) on inexpensive consumer hardware. In this paper, we validate near- and far-field distributions against the well-established multi-sphere T-matrix (MSTM) code and discuss the convergence behavior for ensembles of different sizes, including an exemplary system comprising 105 particles.
End-to-end plasma bubble PIC simulations on GPUs

NASA Astrophysics Data System (ADS)

Germaschewski, Kai; Fox, William; Matteucci, Jackson; Bhattacharjee, Amitava

2017-10-01

Accelerator technologies play a crucial role in eventually achieving exascale computing capabilities. The current and upcoming leadership machines at ORNL (Titan and Summit) employ Nvidia GPUs, which provide vast computational power but also need specifically adapted computational kernels to fully exploit them. In this work, we will show end-to-end particle-in-cell simulations of the formation, evolution and coalescence of laser-generated plasma bubbles. This work showcases the GPU capabilities of the PSC particle-in-cell code, which has been adapted for this problem to support particle injection, a heating operator and a collision operator on GPUs.

Improving Quantum Gate Simulation using a GPU

NASA Astrophysics Data System (ADS)

Gutierrez, Eladio; Romero, Sergio; Trenas, Maria A.; Zapata, Emilio L.

2008-11-01

Due to the increasing computing power of the graphics processing units (GPU), they are becoming more and more popular when solving general purpose algorithms. As the simulation of quantum computers results on a problem with exponential complexity, it is advisable to perform a parallel computation, such as the one provided by the SIMD multiprocessors present in recent GPUs. In this paper, we focus on an important quantum algorithm, the quantum Fourier transform (QTF), in order to evaluate different parallelization strategies on a novel GPU architecture. Our implementation makes use of the new CUDA software/hardware architecture developed recently by NVIDIA.
Advanced computer graphic techniques for laser range finder (LRF) simulation

NASA Astrophysics Data System (ADS)

Bedkowski, Janusz; Jankowski, Stanislaw

2008-11-01

This paper show an advanced computer graphic techniques for laser range finder (LRF) simulation. The LRF is the common sensor for unmanned ground vehicle, autonomous mobile robot and security applications. The cost of the measurement system is extremely high, therefore the simulation tool is designed. The simulation gives an opportunity to execute algorithm such as the obstacle avoidance[1], slam for robot localization[2], detection of vegetation and water obstacles in surroundings of the robot chassis[3], LRF measurement in crowd of people[1]. The Axis Aligned Bounding Box (AABB) and alternative technique based on CUDA (NVIDIA Compute Unified Device Architecture) is presented.
GPU accelerated population annealing algorithm

NASA Astrophysics Data System (ADS)

Barash, Lev Yu.; Weigel, Martin; Borovský, Michal; Janke, Wolfhard; Shchur, Lev N.

2017-11-01

Population annealing is a promising recent approach for Monte Carlo simulations in statistical physics, in particular for the simulation of systems with complex free-energy landscapes. It is a hybrid method, combining importance sampling through Markov chains with elements of sequential Monte Carlo in the form of population control. While it appears to provide algorithmic capabilities for the simulation of such systems that are roughly comparable to those of more established approaches such as parallel tempering, it is intrinsically much more suitable for massively parallel computing. Here, we tap into this structural advantage and present a highly optimized implementation of the population annealing algorithm on GPUs that promises speed-ups of several orders of magnitude as compared to a serial implementation on CPUs. While the sample code is for simulations of the 2D ferromagnetic Ising model, it should be easily adapted for simulations of other spin models, including disordered systems. Our code includes implementations of some advanced algorithmic features that have only recently been suggested, namely the automatic adaptation of temperature steps and a multi-histogram analysis of the data at different temperatures. Program Files doi:http://dx.doi.org/10.17632/sgzt4b7b3m.1 Licensing provisions: Creative Commons Attribution license (CC BY 4.0) Programming language: C, CUDA External routines/libraries: NVIDIA CUDA Toolkit 6.5 or newer Nature of problem: The program calculates the internal energy, specific heat, several magnetization moments, entropy and free energy of the 2D Ising model on square lattices of edge length L with periodic boundary conditions as a function of inverse temperature β. Solution method: The code uses population annealing, a hybrid method combining Markov chain updates with population control. The code is implemented for NVIDIA GPUs using the CUDA language and employs advanced techniques such as multi-spin coding, adaptive temperature
ARCHERRT – A GPU-based and photon-electron coupled Monte Carlo dose computing engine for radiation therapy: Software development and application to helical tomotherapy

PubMed Central

Su, Lin; Yang, Youming; Bednarz, Bryan; Sterpin, Edmond; Du, Xining; Liu, Tianyu; Ji, Wei; Xu, X. George

2014-01-01

Purpose: Using the graphical processing units (GPU) hardware technology, an extremely fast Monte Carlo (MC) code ARCHERRT is developed for radiation dose calculations in radiation therapy. This paper describes the detailed software development and testing for three clinical TomoTherapy® cases: the prostate, lung, and head & neck. Methods: To obtain clinically relevant dose distributions, phase space files (PSFs) created from optimized radiation therapy treatment plan fluence maps were used as the input to ARCHERRT. Patient-specific phantoms were constructed from patient CT images. Batch simulations were employed to facilitate the time-consuming task of loading large PSFs, and to improve the estimation of statistical uncertainty. Furthermore, two different Woodcock tracking algorithms were implemented and their relative performance was compared. The dose curves of an Elekta accelerator PSF incident on a homogeneous water phantom were benchmarked against DOSXYZnrc. For each of the treatment cases, dose volume histograms and isodose maps were produced from ARCHERRT and the general-purpose code, GEANT4. The gamma index analysis was performed to evaluate the similarity of voxel doses obtained from these two codes. The hardware accelerators used in this study are one NVIDIA K20 GPU, one NVIDIA K40 GPU, and six NVIDIA M2090 GPUs. In addition, to make a fairer comparison of the CPU and GPU performance, a multithreaded CPU code was developed using OpenMP and tested on an Intel E5-2620 CPU. Results: For the water phantom, the depth dose curve and dose profiles from ARCHERRT agree well with DOSXYZnrc. For clinical cases, results from ARCHERRT are compared with those from GEANT4 and good agreement is observed. Gamma index test is performed for voxels whose dose is greater than 10% of maximum dose. For 2%/2mm criteria, the passing rates for the prostate, lung case, and head & neck cases are 99.7%, 98.5%, and 97.2%, respectively. Due to specific architecture of GPU, modified
ARCHERRT - a GPU-based and photon-electron coupled Monte Carlo dose computing engine for radiation therapy: software development and application to helical tomotherapy.

PubMed

Su, Lin; Yang, Youming; Bednarz, Bryan; Sterpin, Edmond; Du, Xining; Liu, Tianyu; Ji, Wei; Xu, X George

2014-07-01

Using the graphical processing units (GPU) hardware technology, an extremely fast Monte Carlo (MC) code ARCHERRT is developed for radiation dose calculations in radiation therapy. This paper describes the detailed software development and testing for three clinical TomoTherapy® cases: the prostate, lung, and head & neck. To obtain clinically relevant dose distributions, phase space files (PSFs) created from optimized radiation therapy treatment plan fluence maps were used as the input to ARCHERRT. Patient-specific phantoms were constructed from patient CT images. Batch simulations were employed to facilitate the time-consuming task of loading large PSFs, and to improve the estimation of statistical uncertainty. Furthermore, two different Woodcock tracking algorithms were implemented and their relative performance was compared. The dose curves of an Elekta accelerator PSF incident on a homogeneous water phantom were benchmarked against DOSXYZnrc. For each of the treatment cases, dose volume histograms and isodose maps were produced from ARCHERRT and the general-purpose code, GEANT4. The gamma index analysis was performed to evaluate the similarity of voxel doses obtained from these two codes. The hardware accelerators used in this study are one NVIDIA K20 GPU, one NVIDIA K40 GPU, and six NVIDIA M2090 GPUs. In addition, to make a fairer comparison of the CPU and GPU performance, a multithreaded CPU code was developed using OpenMP and tested on an Intel E5-2620 CPU. For the water phantom, the depth dose curve and dose profiles from ARCHERRT agree well with DOSXYZnrc. For clinical cases, results from ARCHERRT are compared with those from GEANT4 and good agreement is observed. Gamma index test is performed for voxels whose dose is greater than 10% of maximum dose. For 2%/2mm criteria, the passing rates for the prostate, lung case, and head & neck cases are 99.7%, 98.5%, and 97.2%, respectively. Due to specific architecture of GPU, modified Woodcock tracking algorithm
Ice crystals classification using airborne measurements in mixing phase

NASA Astrophysics Data System (ADS)

Sorin Vajaiac, Nicolae; Boscornea, Andreea

2017-04-01

This paper presents a case study of ice crystals classification from airborne measurements in mixed-phase clouds. Ice crystal shadow is recorded with CIP (Cloud Imaging Probe) component of CAPS (Cloud, Aerosol, and Precipitation Spectrometer) system. The analyzed flight was performed in the south-western part of Romania (between Pietrosani, Ramnicu Valcea, Craiova and Targu Jiu), with a Beechcraft C90 GTX which was specially equipped with a CAPS system. The temperature, during the fly, reached the lowest value at -35 °C. These low temperatures allow the formation of ice crystals and influence their form. For the here presented ice crystals classification a special software, OASIS (Optical Array Shadow Imaging Software), developed by DMT (Droplet Measurement Technologies), was used. The obtained results, as expected are influenced by the atmospheric and microphysical parameters. The particles recorded where classified in four groups: edge, irregular, round and small.
Latent uncertainties of the precalculated track Monte Carlo method

DOE Office of Scientific and Technical Information (OSTI.GOV)

Renaud, Marc-André; Seuntjens, Jan; Roberge, David

20% of the maximum dose. In proton calculations, a small (≤1 mm) distance-to-agreement error was observed at the Bragg peak. Latent uncertainty was characterized for electrons and found to follow a Poisson distribution with the number of unique tracks per energy. A track bank of 12 energies and 60000 unique tracks per pregenerated energy in water had a size of 2.4 GB and achieved a latent uncertainty of approximately 1% at an optimal efficiency gain over DOSXYZnrc. Larger track banks produced a lower latent uncertainty at the cost of increased memory consumption. Using an NVIDIA GTX 590, efficiency analysis showed a 807 × efficiency increase over DOSXYZnrc for 16 MeV electrons in water and 508 × for 16 MeV electrons in bone. Conclusions: The PMC method can calculate dose distributions for electrons and protons to a statistical uncertainty of 1% with a large efficiency gain over conventional MC codes. Before performing clinical dose calculations, models to calculate dose contributions from uncharged particles must be implemented. Following the successful implementation of these models, the PMC method will be evaluated as a candidate for inverse planning of modulated electron radiation therapy and scanned proton beams.« less
Multidisciplinary Simulation Acceleration using Multiple Shared-Memory Graphical Processing Units

NASA Astrophysics Data System (ADS)

Kemal, Jonathan Yashar

For purposes of optimizing and analyzing turbomachinery and other designs, the unsteady Favre-averaged flow-field differential equations for an ideal compressible gas can be solved in conjunction with the heat conduction equation. We solve all equations using the finite-volume multiple-grid numerical technique, with the dual time-step scheme used for unsteady simulations. Our numerical solver code targets CUDA-capable Graphical Processing Units (GPUs) produced by NVIDIA. Making use of MPI, our solver can run across networked compute notes, where each MPI process can use either a GPU or a Central Processing Unit (CPU) core for primary solver calculations. We use NVIDIA Tesla C2050/C2070 GPUs based on the Fermi architecture, and compare our resulting performance against Intel Zeon X5690 CPUs. Solver routines converted to CUDA typically run about 10 times faster on a GPU for sufficiently dense computational grids. We used a conjugate cylinder computational grid and ran a turbulent steady flow simulation using 4 increasingly dense computational grids. Our densest computational grid is divided into 13 blocks each containing 1033x1033 grid points, for a total of 13.87 million grid points or 1.07 million grid points per domain block. To obtain overall speedups, we compare the execution time of the solver's iteration loop, including all resource intensive GPU-related memory copies. Comparing the performance of 8 GPUs to that of 8 CPUs, we obtain an overall speedup of about 6.0 when using our densest computational grid. This amounts to an 8-GPU simulation running about 39.5 times faster than running than a single-CPU simulation.
Kalman filter tracking on parallel architectures

NASA Astrophysics Data System (ADS)

Cerati, G.; Elmer, P.; Krutelyov, S.; Lantz, S.; Lefebvre, M.; McDermott, K.; Riley, D.; Tadel, M.; Wittich, P.; Wurthwein, F.; Yagil, A.

2017-10-01

We report on the progress of our studies towards a Kalman filter track reconstruction algorithm with optimal performance on manycore architectures. The combinatorial structure of these algorithms is not immediately compatible with an efficient SIMD (or SIMT) implementation; the challenge for us is to recast the existing software so it can readily generate hundreds of shared-memory threads that exploit the underlying instruction set of modern processors. We show how the data and associated tasks can be organized in a way that is conducive to both multithreading and vectorization. We demonstrate very good performance on Intel Xeon and Xeon Phi architectures, as well as promising first results on Nvidia GPUs.
MEGADOCK 4.0: an ultra-high-performance protein-protein docking software for heterogeneous supercomputers.

PubMed

Ohue, Masahito; Shimoda, Takehiro; Suzuki, Shuji; Matsuzaki, Yuri; Ishida, Takashi; Akiyama, Yutaka

2014-11-15

The application of protein-protein docking in large-scale interactome analysis is a major challenge in structural bioinformatics and requires huge computing resources. In this work, we present MEGADOCK 4.0, an FFT-based docking software that makes extensive use of recent heterogeneous supercomputers and shows powerful, scalable performance of >97% strong scaling. MEGADOCK 4.0 is written in C++ with OpenMPI and NVIDIA CUDA 5.0 (or later) and is freely available to all academic and non-profit users at: http://www.bi.cs.titech.ac.jp/megadock. akiyama@cs.titech.ac.jp Supplementary data are available at Bioinformatics online. © The Author 2014. Published by Oxford University Press.
Graphics processing unit based computation for NDE applications

NASA Astrophysics Data System (ADS)

Nahas, C. A.; Rajagopal, Prabhu; Balasubramaniam, Krishnan; Krishnamurthy, C. V.

2012-05-01

Advances in parallel processing in recent years are helping to improve the cost of numerical simulation. Breakthroughs in Graphical Processing Unit (GPU) based computation now offer the prospect of further drastic improvements. The introduction of 'compute unified device architecture' (CUDA) by NVIDIA (the global technology company based in Santa Clara, California, USA) has made programming GPUs for general purpose computing accessible to the average programmer. Here we use CUDA to develop parallel finite difference schemes as applicable to two problems of interest to NDE community, namely heat diffusion and elastic wave propagation. The implementations are for two-dimensions. Performance improvement of the GPU implementation against serial CPU implementation is then discussed.
Accelerating Advanced MRI Reconstructions on GPUs.

PubMed

Stone, S S; Haldar, J P; Tsao, S C; Hwu, W-M W; Sutton, B P; Liang, Z-P

2008-10-01

Computational acceleration on graphics processing units (GPUs) can make advanced magnetic resonance imaging (MRI) reconstruction algorithms attractive in clinical settings, thereby improving the quality of MR images across a broad spectrum of applications. This paper describes the acceleration of such an algorithm on NVIDIA's Quadro FX 5600. The reconstruction of a 3D image with 128(3) voxels achieves up to 180 GFLOPS and requires just over one minute on the Quadro, while reconstruction on a quad-core CPU is twenty-one times slower. Furthermore, relative to the true image, the error exhibited by the advanced reconstruction is only 12%, while conventional reconstruction techniques incur error of 42%.
Three-dimensional scene reconstruction from a two-dimensional image

NASA Astrophysics Data System (ADS)

Parkins, Franz; Jacobs, Eddie

2017-05-01

We propose and simulate a method of reconstructing a three-dimensional scene from a two-dimensional image for developing and augmenting world models for autonomous navigation. This is an extension of the Perspective-n-Point (PnP) method which uses a sampling of the 3D scene, 2D image point parings, and Random Sampling Consensus (RANSAC) to infer the pose of the object and produce a 3D mesh of the original scene. Using object recognition and segmentation, we simulate the implementation on a scene of 3D objects with an eye to implementation on embeddable hardware. The final solution will be deployed on the NVIDIA Tegra platform.
HASEonGPU-An adaptive, load-balanced MPI/GPU-code for calculating the amplified spontaneous emission in high power laser media

NASA Astrophysics Data System (ADS)

Eckert, C. H. J.; Zenker, E.; Bussmann, M.; Albach, D.

2016-10-01

We present an adaptive Monte Carlo algorithm for computing the amplified spontaneous emission (ASE) flux in laser gain media pumped by pulsed lasers. With the design of high power lasers in mind, which require large size gain media, we have developed the open source code HASEonGPU that is capable of utilizing multiple graphic processing units (GPUs). With HASEonGPU, time to solution is reduced to minutes on a medium size GPU cluster of 64 NVIDIA Tesla K20m GPUs and excellent speedup is achieved when scaling to multiple GPUs. Comparison of simulation results to measurements of ASE in Y b 3 + : Y AG ceramics show perfect agreement.
XaNSoNS: GPU-accelerated simulator of diffraction patterns of nanoparticles

NASA Astrophysics Data System (ADS)

Neverov, V. S.

XaNSoNS is an open source software with GPU support, which simulates X-ray and neutron 1D (or 2D) diffraction patterns and pair-distribution functions (PDF) for amorphous or crystalline nanoparticles (up to ∼107 atoms) of heterogeneous structural content. Among the multiple parameters of the structure the user may specify atomic displacements, site occupancies, molecular displacements and molecular rotations. The software uses general equations nonspecific to crystalline structures to calculate the scattering intensity. It supports four major standards of parallel computing: MPI, OpenMP, Nvidia CUDA and OpenCL, enabling it to run on various architectures, from CPU-based HPCs to consumer-level GPUs.
Unusual Domain Structure and Filamentary Superfluidity for 2D Hard-Core Bosons in Insulating Charge-Ordered Phase

NASA Astrophysics Data System (ADS)

Panov, Yu. D.; Moskvin, A. S.; Rybakov, F. N.; Borisov, A. B.

2016-12-01

We made use of a special algorithm for compute unified device architecture for NVIDIA graphics cards, a nonlinear conjugate-gradient method to minimize energy functional, and Monte-Carlo technique to directly observe the forming of the ground state configuration for the 2D hard-core bosons by lowering the temperature and its evolution with deviation away from half-filling. The novel technique allowed us to examine earlier implications and uncover novel features of the phase transitions, in particular, look upon the nucleation of the odd domain structure, emergence of filamentary superfluidity nucleated at the antiphase domain walls of the charge-ordered phase, and nucleation and evolution of different topological structures.
Spectral-element simulation of two-dimensional elastic wave propagation in fully heterogeneous media on a GPU cluster

NASA Astrophysics Data System (ADS)

Rudianto, Indra; Sudarmaji

2018-04-01

We present an implementation of the spectral-element method for simulation of two-dimensional elastic wave propagation in fully heterogeneous media. We have incorporated most of realistic geological features in the model, including surface topography, curved layer interfaces, and 2-D wave-speed heterogeneity. To accommodate such complexity, we use an unstructured quadrilateral meshing technique. Simulation was performed on a GPU cluster, which consists of 24 core processors Intel Xeon CPU and 4 NVIDIA Quadro graphics cards using CUDA and MPI implementation. We speed up the computation by a factor of about 5 compared to MPI only, and by a factor of about 40 compared to Serial implementation.
Aspects of GPU perfomance in algorithms with random memory access

NASA Astrophysics Data System (ADS)

Kashkovsky, Alexander V.; Shershnev, Anton A.; Vashchenkov, Pavel V.

2017-10-01

The numerical code for solving the Boltzmann equation on the hybrid computational cluster using the Direct Simulation Monte Carlo (DSMC) method showed that on Tesla K40 accelerators computational performance drops dramatically with increase of percentage of occupied GPU memory. Testing revealed that memory access time increases tens of times after certain critical percentage of memory is occupied. Moreover, it seems to be the common problem of all NVidia's GPUs arising from its architecture. Few modifications of the numerical algorithm were suggested to overcome this problem. One of them, based on the splitting the memory into "virtual" blocks, resulted in 2.5 times speed up.
High-speed railway real-time localization auxiliary method based on deep neural network

NASA Astrophysics Data System (ADS)

Chen, Dongjie; Zhang, Wensheng; Yang, Yang

2017-11-01

High-speed railway intelligent monitoring and management system is composed of schedule integration, geographic information, location services, and data mining technology for integration of time and space data. Assistant localization is a significant submodule of the intelligent monitoring system. In practical application, the general access is to capture the image sequences of the components by using a high-definition camera, digital image processing technique and target detection, tracking and even behavior analysis method. In this paper, we present an end-to-end character recognition method based on a deep CNN network called YOLO-toc for high-speed railway pillar plate number. Different from other deep CNNs, YOLO-toc is an end-to-end multi-target detection framework, furthermore, it exhibits a state-of-art performance on real-time detection with a nearly 50fps achieved on GPU (GTX960). Finally, we realize a real-time but high-accuracy pillar plate number recognition system and integrate natural scene OCR into a dedicated classification YOLO-toc model.
ARCHER{sub RT} – A GPU-based and photon-electron coupled Monte Carlo dose computing engine for radiation therapy: Software development and application to helical tomotherapy

DOE Office of Scientific and Technical Information (OSTI.GOV)

Su, Lin; Du, Xining; Liu, Tianyu

Purpose: Using the graphical processing units (GPU) hardware technology, an extremely fast Monte Carlo (MC) code ARCHER{sub RT} is developed for radiation dose calculations in radiation therapy. This paper describes the detailed software development and testing for three clinical TomoTherapy® cases: the prostate, lung, and head and neck. Methods: To obtain clinically relevant dose distributions, phase space files (PSFs) created from optimized radiation therapy treatment plan fluence maps were used as the input to ARCHER{sub RT}. Patient-specific phantoms were constructed from patient CT images. Batch simulations were employed to facilitate the time-consuming task of loading large PSFs, and to improvemore » the estimation of statistical uncertainty. Furthermore, two different Woodcock tracking algorithms were implemented and their relative performance was compared. The dose curves of an Elekta accelerator PSF incident on a homogeneous water phantom were benchmarked against DOSXYZnrc. For each of the treatment cases, dose volume histograms and isodose maps were produced from ARCHER{sub RT} and the general-purpose code, GEANT4. The gamma index analysis was performed to evaluate the similarity of voxel doses obtained from these two codes. The hardware accelerators used in this study are one NVIDIA K20 GPU, one NVIDIA K40 GPU, and six NVIDIA M2090 GPUs. In addition, to make a fairer comparison of the CPU and GPU performance, a multithreaded CPU code was developed using OpenMP and tested on an Intel E5-2620 CPU. Results: For the water phantom, the depth dose curve and dose profiles from ARCHER{sub RT} agree well with DOSXYZnrc. For clinical cases, results from ARCHER{sub RT} are compared with those from GEANT4 and good agreement is observed. Gamma index test is performed for voxels whose dose is greater than 10% of maximum dose. For 2%/2mm criteria, the passing rates for the prostate, lung case, and head and neck cases are 99.7%, 98.5%, and 97.2%, respectively. Due to

Glutamine prevents oxidative stress in a model of portal hypertension.

PubMed

Zabot, Gilmara Pandolfo; Carvalhal, Gustavo Franco; Marroni, Norma Possa; Licks, Francielli; Hartmann, Renata Minuzzo; da Silva, Vinícius Duval; Fillmann, Henrique Sarubbi

2017-07-07

To evaluate the protective effects of glutamine in a model of portal hypertension (PH) induced by partial portal vein ligation (PPVL). Male Wistar rats were housed in a controlled environment and were allowed access to food and water ad libitum . Twenty-four male Wistar rats were divided into four experimental groups: (1) control group (SO) - rats underwent exploratory laparotomy; (2) control + glutamine group (SO + G) - rats were subjected to laparotomy and were treated intraperitoneally with glutamine; (3) portal hypertension group (PPVL) - rats were subjected to PPVL; and (4) PPVL + glutamine group (PPVL + G) - rats were treated intraperitoneally with glutamine for seven days. Local injuries were determined by evaluating intestinal segments for oxidative stress using lipid peroxidation and the activities of glutathione peroxidase (GPx), endothelial nitric oxide synthase (eNOS) and inducible nitric oxide synthase (iNOS) after PPVL. Lipid peroxidation of the membrane was increased in the animals subjected to PH ( P < 0.01). However, the group that received glutamine for seven days after the PPVL procedure showed levels of lipid peroxidation similar to those of the control groups ( P > 0.05). The activity of the antioxidant enzyme GTx was decreased in the gut of animals subjected to PH compared with that in the control group of animals not subjected to PH ( P < 0.01). However, the group that received glutamine for seven days after the PPVL showed similar GTx activity to both the control groups not subjected to PH ( P > 0.05). At least 10 random, non-overlapping images of each histological slide with 200 × magnification (44 pixel = 1 μm) were captured. The sum means of all areas, of each group were calculated. The mean areas of eNOS staining for both of the control groups were similar. The PPVL group showed the largest area of staining for eNOS. The PPVL + G group had the second highest amount of staining, but the mean value was much lower than that of the PPVL
Montblanc1: GPU accelerated radio interferometer measurement equations in support of Bayesian inference for radio observations

NASA Astrophysics Data System (ADS)

Perkins, S. J.; Marais, P. C.; Zwart, J. T. L.; Natarajan, I.; Tasse, C.; Smirnov, O.

2015-09-01

We present Montblanc, a GPU implementation of the Radio interferometer measurement equation (RIME) in support of the Bayesian inference for radio observations (BIRO) technique. BIRO uses Bayesian inference to select sky models that best match the visibilities observed by a radio interferometer. To accomplish this, BIRO evaluates the RIME multiple times, varying sky model parameters to produce multiple model visibilities. χ2 values computed from the model and observed visibilities are used as likelihood values to drive the Bayesian sampling process and select the best sky model. As most of the elements of the RIME and χ2 calculation are independent of one another, they are highly amenable to parallel computation. Additionally, Montblanc caters for iterative RIME evaluation to produce multiple χ2 values. Modified model parameters are transferred to the GPU between each iteration. We implemented Montblanc as a Python package based upon NVIDIA's CUDA architecture. As such, it is easy to extend and implement different pipelines. At present, Montblanc supports point and Gaussian morphologies, but is designed for easy addition of new source profiles. Montblanc's RIME implementation is performant: On an NVIDIA K40, it is approximately 250 times faster than MEQTREES on a dual hexacore Intel E5-2620v2 CPU. Compared to the OSKAR simulator's GPU-implemented RIME components it is 7.7 and 12 times faster on the same K40 for single and double-precision floating point respectively. However, OSKAR's RIME implementation is more general than Montblanc's BIRO-tailored RIME. Theoretical analysis of Montblanc's dominant CUDA kernel suggests that it is memory bound. In practice, profiling shows that is balanced between compute and memory, as much of the data required by the problem is retained in L1 and L2 caches.
cudaMap: a GPU accelerated program for gene expression connectivity mapping

PubMed Central

2013-01-01

Background Modern cancer research often involves large datasets and the use of sophisticated statistical techniques. Together these add a heavy computational load to the analysis, which is often coupled with issues surrounding data accessibility. Connectivity mapping is an advanced bioinformatic and computational technique dedicated to therapeutics discovery and drug re-purposing around differential gene expression analysis. On a normal desktop PC, it is common for the connectivity mapping task with a single gene signature to take > 2h to complete using sscMap, a popular Java application that runs on standard CPUs (Central Processing Units). Here, we describe new software, cudaMap, which has been implemented using CUDA C/C++ to harness the computational power of NVIDIA GPUs (Graphics Processing Units) to greatly reduce processing times for connectivity mapping. Results cudaMap can identify candidate therapeutics from the same signature in just over thirty seconds when using an NVIDIA Tesla C2050 GPU. Results from the analysis of multiple gene signatures, which would previously have taken several days, can now be obtained in as little as 10 minutes, greatly facilitating candidate therapeutics discovery with high throughput. We are able to demonstrate dramatic speed differentials between GPU assisted performance and CPU executions as the computational load increases for high accuracy evaluation of statistical significance. Conclusion Emerging ‘omics’ technologies are constantly increasing the volume of data and information to be processed in all areas of biomedical research. Embracing the multicore functionality of GPUs represents a major avenue of local accelerated computing. cudaMap will make a strong contribution in the discovery of candidate therapeutics by enabling speedy execution of heavy duty connectivity mapping tasks, which are increasingly required in modern cancer research. cudaMap is open source and can be freely downloaded from http
Embedded-Based Graphics Processing Unit Cluster Platform for Multiple Sequence Alignments

PubMed Central

Wei, Jyh-Da; Cheng, Hui-Jun; Lin, Chun-Yuan; Ye, Jin; Yeh, Kuan-Yu

2017-01-01

High-end graphics processing units (GPUs), such as NVIDIA Tesla/Fermi/Kepler series cards with thousands of cores per chip, are widely applied to high-performance computing fields in a decade. These desktop GPU cards should be installed in personal computers/servers with desktop CPUs, and the cost and power consumption of constructing a GPU cluster platform are very high. In recent years, NVIDIA releases an embedded board, called Jetson Tegra K1 (TK1), which contains 4 ARM Cortex-A15 CPUs and 192 Compute Unified Device Architecture cores (belong to Kepler GPUs). Jetson Tegra K1 has several advantages, such as the low cost, low power consumption, and high applicability, and it has been applied into several specific applications. In our previous work, a bioinformatics platform with a single TK1 (STK platform) was constructed, and this previous work is also used to prove that the Web and mobile services can be implemented in the STK platform with a good cost-performance ratio by comparing a STK platform with the desktop CPU and GPU. In this work, an embedded-based GPU cluster platform will be constructed with multiple TK1s (MTK platform). Complex system installation and setup are necessary procedures at first. Then, 2 job assignment modes are designed for the MTK platform to provide services for users. Finally, ClustalW v2.0.11 and ClustalWtk will be ported to the MTK platform. The experimental results showed that the speedup ratios achieved 5.5 and 4.8 times for ClustalW v2.0.11 and ClustalWtk, respectively, by comparing 6 TK1s with a single TK1. The MTK platform is proven to be useful for multiple sequence alignments. PMID:28835734
GPU-based fast cone beam CT reconstruction from undersampled and noisy projection data via total variation.

PubMed

Jia, Xun; Lou, Yifei; Li, Ruijiang; Song, William Y; Jiang, Steve B

2010-04-01

Cone-beam CT (CBCT) plays an important role in image guided radiation therapy (IGRT). However, the large radiation dose from serial CBCT scans in most IGRT procedures raises a clinical concern, especially for pediatric patients who are essentially excluded from receiving IGRT for this reason. The goal of this work is to develop a fast GPU-based algorithm to reconstruct CBCT from undersampled and noisy projection data so as to lower the imaging dose. The CBCT is reconstructed by minimizing an energy functional consisting of a data fidelity term and a total variation regularization term. The authors developed a GPU-friendly version of the forward-backward splitting algorithm to solve this model. A multigrid technique is also employed. It is found that 20-40 x-ray projections are sufficient to reconstruct images with satisfactory quality for IGRT. The reconstruction time ranges from 77 to 130 s on an NVIDIA Tesla C1060 (NVIDIA, Santa Clara, CA) GPU card, depending on the number of projections used, which is estimated about 100 times faster than similar iterative reconstruction approaches. Moreover, phantom studies indicate that the algorithm enables the CBCT to be reconstructed under a scanning protocol with as low as 0.1 mA s/projection. Comparing with currently widely used full-fan head and neck scanning protocol of approximately 360 projections with 0.4 mA s/projection, it is estimated that an overall 36-72 times dose reduction has been achieved in our fast CBCT reconstruction algorithm. This work indicates that the developed GPU-based CBCT reconstruction algorithm is capable of lowering imaging dose considerably. The high computation efficiency in this algorithm makes the iterative CBCT reconstruction approach applicable in real clinical environments.
cudaMap: a GPU accelerated program for gene expression connectivity mapping.

PubMed

McArt, Darragh G; Bankhead, Peter; Dunne, Philip D; Salto-Tellez, Manuel; Hamilton, Peter; Zhang, Shu-Dong

2013-10-11

Modern cancer research often involves large datasets and the use of sophisticated statistical techniques. Together these add a heavy computational load to the analysis, which is often coupled with issues surrounding data accessibility. Connectivity mapping is an advanced bioinformatic and computational technique dedicated to therapeutics discovery and drug re-purposing around differential gene expression analysis. On a normal desktop PC, it is common for the connectivity mapping task with a single gene signature to take > 2h to complete using sscMap, a popular Java application that runs on standard CPUs (Central Processing Units). Here, we describe new software, cudaMap, which has been implemented using CUDA C/C++ to harness the computational power of NVIDIA GPUs (Graphics Processing Units) to greatly reduce processing times for connectivity mapping. cudaMap can identify candidate therapeutics from the same signature in just over thirty seconds when using an NVIDIA Tesla C2050 GPU. Results from the analysis of multiple gene signatures, which would previously have taken several days, can now be obtained in as little as 10 minutes, greatly facilitating candidate therapeutics discovery with high throughput. We are able to demonstrate dramatic speed differentials between GPU assisted performance and CPU executions as the computational load increases for high accuracy evaluation of statistical significance. Emerging 'omics' technologies are constantly increasing the volume of data and information to be processed in all areas of biomedical research. Embracing the multicore functionality of GPUs represents a major avenue of local accelerated computing. cudaMap will make a strong contribution in the discovery of candidate therapeutics by enabling speedy execution of heavy duty connectivity mapping tasks, which are increasingly required in modern cancer research. cudaMap is open source and can be freely downloaded from http://purl.oclc.org/NET/cudaMap.
Embedded-Based Graphics Processing Unit Cluster Platform for Multiple Sequence Alignments.

PubMed

Wei, Jyh-Da; Cheng, Hui-Jun; Lin, Chun-Yuan; Ye, Jin; Yeh, Kuan-Yu

2017-01-01

High-end graphics processing units (GPUs), such as NVIDIA Tesla/Fermi/Kepler series cards with thousands of cores per chip, are widely applied to high-performance computing fields in a decade. These desktop GPU cards should be installed in personal computers/servers with desktop CPUs, and the cost and power consumption of constructing a GPU cluster platform are very high. In recent years, NVIDIA releases an embedded board, called Jetson Tegra K1 (TK1), which contains 4 ARM Cortex-A15 CPUs and 192 Compute Unified Device Architecture cores (belong to Kepler GPUs). Jetson Tegra K1 has several advantages, such as the low cost, low power consumption, and high applicability, and it has been applied into several specific applications. In our previous work, a bioinformatics platform with a single TK1 (STK platform) was constructed, and this previous work is also used to prove that the Web and mobile services can be implemented in the STK platform with a good cost-performance ratio by comparing a STK platform with the desktop CPU and GPU. In this work, an embedded-based GPU cluster platform will be constructed with multiple TK1s (MTK platform). Complex system installation and setup are necessary procedures at first. Then, 2 job assignment modes are designed for the MTK platform to provide services for users. Finally, ClustalW v2.0.11 and ClustalWtk will be ported to the MTK platform. The experimental results showed that the speedup ratios achieved 5.5 and 4.8 times for ClustalW v2.0.11 and ClustalWtk, respectively, by comparing 6 TK1s with a single TK1. The MTK platform is proven to be useful for multiple sequence alignments.
GPU-based relative fuzzy connectedness image segmentation

PubMed Central

Zhuge, Ying; Ciesielski, Krzysztof C.; Udupa, Jayaram K.; Miller, Robert W.

2013-01-01

Purpose: Recently, clinical radiological research and practice are becoming increasingly quantitative. Further, images continue to increase in size and volume. For quantitative radiology to become practical, it is crucial that image segmentation algorithms and their implementations are rapid and yield practical run time on very large data sets. The purpose of this paper is to present a parallel version of an algorithm that belongs to the family of fuzzy connectedness (FC) algorithms, to achieve an interactive speed for segmenting large medical image data sets. Methods: The most common FC segmentations, optimizing an ℓ∞-based energy, are known as relative fuzzy connectedness (RFC) and iterative relative fuzzy connectedness (IRFC). Both RFC and IRFC objects (of which IRFC contains RFC) can be found via linear time algorithms, linear with respect to the image size. The new algorithm, P-ORFC (for parallel optimal RFC), which is implemented by using NVIDIA’s Compute Unified Device Architecture (CUDA) platform, considerably improves the computational speed of the above mentioned CPU based IRFC algorithm. Results: Experiments based on four data sets of small, medium, large, and super data size, achieved speedup factors of 32.8×, 22.9×, 20.9×, and 17.5×, correspondingly, on the NVIDIA Tesla C1060 platform. Although the output of P-ORFC need not precisely match that of IRFC output, it is very close to it and, as the authors prove, always lies between the RFC and IRFC objects. Conclusions: A parallel version of a top-of-the-line algorithm in the family of FC has been developed on the NVIDIA GPUs. An interactive speed of segmentation has been achieved, even for the largest medical image data set. Such GPU implementations may play a crucial role in automatic anatomy recognition in clinical radiology. PMID:23298094
Gibraltar v 1.0

DOE Office of Scientific and Technical Information (OSTI.GOV)

CURRY, MATTHEW LEON; WARD, H. LEE; & SKJELLUM, ANTHONY

Gibraltar is a library and associated test suite which performs Reed-Solomon coding and decoding of data buffers using graphics processing units which support NVIDIA's CUDA technology. This library is used to generate redundant data allowing for recovery of lost information. For example, a user can generate m new blocks of data from n original blocks, distributing those pieces over n+m devices. If any m devices fail, the contents of those devices can be recovered from the contents of the other n devices, even if some of the original blocks are lost. This is a generalized description of RAID, a techniquemore » for increasing data storage speed and size.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)

Messer, Bronson; Harris, James A; Parete-Koon, Suzanne T

We describe recent development work on the core-collapse supernova code CHIMERA. CHIMERA has consumed more than 100 million cpu-hours on Oak Ridge Leadership Computing Facility (OLCF) platforms in the past 3 years, ranking it among the most important applications at the OLCF. Most of the work described has been focused on exploiting the multicore nature of the current platform (Jaguar) via, e.g., multithreading using OpenMP. In addition, we have begun a major effort to marshal the computational power of GPUs with CHIMERA. The impending upgrade of Jaguar to Titan a 20+ PF machine with an NVIDIA GPU on many nodesmore » makes this work essential.« less
On the effective implementation of a boundary element code on graphics processing units unsing an out-of-core LU algorithm

DOE Office of Scientific and Technical Information (OSTI.GOV)

D'Azevedo, Ed F; Nintcheu Fata, Sylvain

2012-01-01

A collocation boundary element code for solving the three-dimensional Laplace equation, publicly available from \\url{http://www.intetec.org}, has been adapted to run on an Nvidia Tesla general purpose graphics processing unit (GPU). Global matrix assembly and LU factorization of the resulting dense matrix were performed on the GPU. Out-of-core techniques were used to solve problems larger than available GPU memory. The code achieved over eight times speedup in matrix assembly and about 56~Gflops/sec in the LU factorization using only 512~Mbytes of GPU memory. Details of the GPU implementation and comparisons with the standard sequential algorithm are included to illustrate the performance ofmore » the GPU code.« less
Performance of GeantV EM Physics Models

NASA Astrophysics Data System (ADS)

Amadio, G.; Ananya, A.; Apostolakis, J.; Aurora, A.; Bandieramonte, M.; Bhattacharyya, A.; Bianchini, C.; Brun, R.; Canal, P.; Carminati, F.; Cosmo, G.; Duhem, L.; Elvira, D.; Folger, G.; Gheata, A.; Gheata, M.; Goulas, I.; Iope, R.; Jun, S. Y.; Lima, G.; Mohanty, A.; Nikitina, T.; Novak, M.; Pokorski, W.; Ribon, A.; Seghal, R.; Shadura, O.; Vallecorsa, S.; Wenzel, S.; Zhang, Y.

2017-10-01

The recent progress in parallel hardware architectures with deeper vector pipelines or many-cores technologies brings opportunities for HEP experiments to take advantage of SIMD and SIMT computing models. Launched in 2013, the GeantV project studies performance gains in propagating multiple particles in parallel, improving instruction throughput and data locality in HEP event simulation on modern parallel hardware architecture. Due to the complexity of geometry description and physics algorithms of a typical HEP application, performance analysis is indispensable in identifying factors limiting parallel execution. In this report, we will present design considerations and preliminary computing performance of GeantV physics models on coprocessors (Intel Xeon Phi and NVidia GPUs) as well as on mainstream CPUs.
Open source acceleration of wave optics simulations on energy efficient high-performance computing platforms

NASA Astrophysics Data System (ADS)

Beck, Jeffrey; Bos, Jeremy P.

2017-05-01

We compare several modifications to the open-source wave optics package, WavePy, intended to improve execution time. Specifically, we compare the relative performance of the Intel MKL, a CPU based OpenCV distribution, and GPU-based version. Performance is compared between distributions both on the same compute platform and between a fully-featured computing workstation and the NVIDIA Jetson TX1 platform. Comparisons are drawn in terms of both execution time and power consumption. We have found that substituting the Fast Fourier Transform operation from OpenCV provides a marked improvement on all platforms. In addition, we show that embedded platforms offer some possibility for extensive improvement in terms of efficiency compared to a fully featured workstation.
NASA's Hybrid Reality Lab: One Giant Leap for Full Dive

NASA Technical Reports Server (NTRS)

Delgado, Francisco J.; Noyes, Matthew

2017-01-01

This presentation demonstrates how NASA is using consumer VR headsets, game engine technology and NVIDIA's GPUs to create highly immersive future training systems augmented with extremely realistic haptic feedback, sound, additional sensory information, and how these can be used to improve the engineering workflow. Include in this presentation is an environment simulation of the ISS, where users can interact with virtual objects, handrails, and tracked physical objects while inside VR, integration of consumer VR headsets with the Active Response Gravity Offload System, and a space habitat architectural evaluation tool. Attendees will learn how the best elements of real and virtual worlds can be combined into a hybrid reality environment with tangible engineering and scientific applications.
Electromagnetic Physics Models for Parallel Computing Architectures

NASA Astrophysics Data System (ADS)

Amadio, G.; Ananya, A.; Apostolakis, J.; Aurora, A.; Bandieramonte, M.; Bhattacharyya, A.; Bianchini, C.; Brun, R.; Canal, P.; Carminati, F.; Duhem, L.; Elvira, D.; Gheata, A.; Gheata, M.; Goulas, I.; Iope, R.; Jun, S. Y.; Lima, G.; Mohanty, A.; Nikitina, T.; Novak, M.; Pokorski, W.; Ribon, A.; Seghal, R.; Shadura, O.; Vallecorsa, S.; Wenzel, S.; Zhang, Y.

2016-10-01

The recent emergence of hardware architectures characterized by many-core or accelerated processors has opened new opportunities for concurrent programming models taking advantage of both SIMD and SIMT architectures. GeantV, a next generation detector simulation, has been designed to exploit both the vector capability of mainstream CPUs and multi-threading capabilities of coprocessors including NVidia GPUs and Intel Xeon Phi. The characteristics of these architectures are very different in terms of the vectorization depth and type of parallelization needed to achieve optimal performance. In this paper we describe implementation of electromagnetic physics models developed for parallel computing architectures as a part of the GeantV project. Results of preliminary performance evaluation and physics validation are presented as well.
The approximation of anomalous magnetic field by array of magnetized rods

NASA Astrophysics Data System (ADS)

Denis, Byzov; Lev, Muravyev; Natalia, Fedorova

2017-07-01

The method for calculation the vertical component of an anomalous magnetic field from its absolute value is presented. Conversion is based on the approximation of magnetic induction module anomalies by the set of singular sources and the subsequent calculation for the vertical component of the field with the chosen distribution. The rods that are uniformly magnetized along their axis were used as a set of singular sources. Applicability analysis of different methods of nonlinear optimization for solving the given task was carried out. The algorithm is implemented using the parallel computing technology on the NVidia GPU. The approximation and calculation of vertical component is demonstrated for regional magnetic field of North Eurasia territories.
Parallelized computation for computer simulation of electrocardiograms using personal computers with multi-core CPU and general-purpose GPU.

PubMed

Shen, Wenfeng; Wei, Daming; Xu, Weimin; Zhu, Xin; Yuan, Shizhong

2010-10-01

Biological computations like electrocardiological modelling and simulation usually require high-performance computing environments. This paper introduces an implementation of parallel computation for computer simulation of electrocardiograms (ECGs) in a personal computer environment with an Intel CPU of Core (TM) 2 Quad Q6600 and a GPU of Geforce 8800GT, with software support by OpenMP and CUDA. It was tested in three parallelization device setups: (a) a four-core CPU without a general-purpose GPU, (b) a general-purpose GPU plus 1 core of CPU, and (c) a four-core CPU plus a general-purpose GPU. To effectively take advantage of a multi-core CPU and a general-purpose GPU, an algorithm based on load-prediction dynamic scheduling was developed and applied to setting (c). In the simulation with 1600 time steps, the speedup of the parallel computation as compared to the serial computation was 3.9 in setting (a), 16.8 in setting (b), and 20.0 in setting (c). This study demonstrates that a current PC with a multi-core CPU and a general-purpose GPU provides a good environment for parallel computations in biological modelling and simulation studies. Copyright 2010 Elsevier Ireland Ltd. All rights reserved.
BlochSolver: A GPU-optimized fast 3D MRI simulator for experimentally compatible pulse sequences

NASA Astrophysics Data System (ADS)

Kose, Ryoichi; Kose, Katsumi

2017-08-01

A magnetic resonance imaging (MRI) simulator, which reproduces MRI experiments using computers, has been developed using two graphic-processor-unit (GPU) boards (GTX 1080). The MRI simulator was developed to run according to pulse sequences used in experiments. Experiments and simulations were performed to demonstrate the usefulness of the MRI simulator for three types of pulse sequences, namely, three-dimensional (3D) gradient-echo, 3D radio-frequency spoiled gradient-echo, and gradient-echo multislice with practical matrix sizes. The results demonstrated that the calculation speed using two GPU boards was typically about 7 TFLOPS and about 14 times faster than the calculation speed using CPUs (two 18-core Xeons). We also found that MR images acquired by experiment could be reproduced using an appropriate number of subvoxels, and that 3D isotropic and two-dimensional multislice imaging experiments for practical matrix sizes could be simulated using the MRI simulator. Therefore, we concluded that such powerful MRI simulators are expected to become an indispensable tool for MRI research and development.
Gravitational tree-code on graphics processing units: implementation in CUDA

NASA Astrophysics Data System (ADS)

Gaburov, Evghenii; Bédorf, Jeroen; Portegies Zwart, Simon

2010-05-01

We present a new very fast tree-code which runs on massively parallel Graphical Processing Units (GPU) with NVIDIA CUDA architecture. The tree-construction and calculation of multipole moments is carried out on the host CPU, while the force calculation which consists of tree walks and evaluation of interaction list is carried out on the GPU. In this way we achieve a sustained performance of about 100GFLOP/s and data transfer rates of about 50GB/s. It takes about a second to compute forces on a million particles with an opening angle of θ ≈ 0.5. The code has a convenient user interface and is freely available for use. http://castle.strw.leidenuniv.nl/software/octgrav.html
Locality-Aware CTA Clustering For Modern GPUs

DOE Office of Scientific and Technical Information (OSTI.GOV)

Li, Ang; Song, Shuaiwen; Liu, Weifeng

2017-04-08

In this paper, we proposed a novel clustering technique for tapping into the performance potential of a largely ignored type of locality: inter-CTA locality. We first demonstrated the capability of the existing GPU hardware to exploit such locality, both spatially and temporally, on L1 or L1/Tex unified cache. To verify the potential of this locality, we quantified its existence in a broad spectrum of applications and discussed its sources of origin. Based on these insights, we proposed the concept of CTA-Clustering and its associated software techniques. Finally, We evaluated these techniques on all modern generations of NVIDIA GPU architectures. Themore » experimental results showed that our proposed clustering techniques could significantly improve on-chip cache performance.« less

First experience of vectorizing electromagnetic physics models for detector simulation

NASA Astrophysics Data System (ADS)

Amadio, G.; Apostolakis, J.; Bandieramonte, M.; Bianchini, C.; Bitzes, G.; Brun, R.; Canal, P.; Carminati, F.; de Fine Licht, J.; Duhem, L.; Elvira, D.; Gheata, A.; Jun, S. Y.; Lima, G.; Novak, M.; Presbyterian, M.; Shadura, O.; Seghal, R.; Wenzel, S.

2015-12-01

The recent emergence of hardware architectures characterized by many-core or accelerated processors has opened new opportunities for concurrent programming models taking advantage of both SIMD and SIMT architectures. The GeantV vector prototype for detector simulations has been designed to exploit both the vector capability of mainstream CPUs and multi-threading capabilities of coprocessors including NVidia GPUs and Intel Xeon Phi. The characteristics of these architectures are very different in terms of the vectorization depth, parallelization needed to achieve optimal performance or memory access latency and speed. An additional challenge is to avoid the code duplication often inherent to supporting heterogeneous platforms. In this paper we present the first experience of vectorizing electromagnetic physics models developed for the GeantV project.
Real-time dedispersion for fast radio transient surveys, using auto tuning on many-core accelerators

NASA Astrophysics Data System (ADS)

Sclocco, A.; van Leeuwen, J.; Bal, H. E.; van Nieuwpoort, R. V.

2016-01-01

Dedispersion, the removal of deleterious smearing of impulsive signals by the interstellar matter, is one of the most intensive processing steps in any radio survey for pulsars and fast transients. We here present a study of the parallelization of this algorithm on many-core accelerators, including GPUs from AMD and NVIDIA, and the Intel Xeon Phi. We find that dedispersion is inherently memory-bound. Even in a perfect scenario, hardware limitations keep the arithmetic intensity low, thus limiting performance. We next exploit auto-tuning to adapt dedispersion to different accelerators, observations, and even telescopes. We demonstrate that the optimal settings differ between observational setups, and that auto-tuning significantly improves performance. This impacts time-domain surveys from Apertif to SKA.
Exact diagonalization of quantum lattice models on coprocessors

NASA Astrophysics Data System (ADS)

Siro, T.; Harju, A.

2016-10-01

We implement the Lanczos algorithm on an Intel Xeon Phi coprocessor and compare its performance to a multi-core Intel Xeon CPU and an NVIDIA graphics processor. The Xeon and the Xeon Phi are parallelized with OpenMP and the graphics processor is programmed with CUDA. The performance is evaluated by measuring the execution time of a single step in the Lanczos algorithm. We study two quantum lattice models with different particle numbers, and conclude that for small systems, the multi-core CPU is the fastest platform, while for large systems, the graphics processor is the clear winner, reaching speedups of up to 7.6 compared to the CPU. The Xeon Phi outperforms the CPU with sufficiently large particle number, reaching a speedup of 2.5.
Multi-core and GPU accelerated simulation of a radial star target imaged with equivalent t-number circular and Gaussian pupils

NASA Astrophysics Data System (ADS)

Greynolds, Alan W.

2013-09-01

Results from the GelOE optical engineering software are presented for the through-focus, monochromatic coherent and polychromatic incoherent imaging of a radial "star" target for equivalent t-number circular and Gaussian pupils. The FFT-based simulations are carried out using OpenMP threading on a multi-core desktop computer, with and without the aid of a many-core NVIDIA GPU accessing its cuFFT library. It is found that a custom FFT optimized for the 12-core host has similar performance to a simply implemented 256-core GPU FFT. A more sophisticated version of the latter but tuned to reduce overhead on a 448-core GPU is 20 to 28 times faster than a basic FFT implementation running on one CPU core.
Lattice QCD at finite temperature and density from Taylor expansion

NASA Astrophysics Data System (ADS)

Steinbrecher, Patrick

2017-01-01

In the first part, I present an overview of recent Lattice QCD simulations at finite temperature and density. In particular, we discuss fluctuations of conserved charges: baryon number, electric charge and strangeness. These can be obtained from Taylor expanding the QCD pressure as a function of corresponding chemical potentials. Our simulations were performed using quark masses corresponding to physical pion mass of about 140 MeV and allow a direct comparison to experimental data from ultra-relativistic heavy ion beams at hadron colliders such as the Relativistic Heavy Ion Collider at Brookhaven National Laboratory and the Large Hadron Collider at CERN. In the second part, we discuss computational challenges for current and future exascale Lattice simulations with a focus on new silicon developments from Intel and NVIDIA.
Modeling of Radiotherapy Linac Source Terms Using ARCHER Monte Carlo Code: Performance Comparison for GPU and MIC Parallel Computing Devices

NASA Astrophysics Data System (ADS)

Lin, Hui; Liu, Tianyu; Su, Lin; Bednarz, Bryan; Caracappa, Peter; Xu, X. George

2017-09-01

Monte Carlo (MC) simulation is well recognized as the most accurate method for radiation dose calculations. For radiotherapy applications, accurate modelling of the source term, i.e. the clinical linear accelerator is critical to the simulation. The purpose of this paper is to perform source modelling and examine the accuracy and performance of the models on Intel Many Integrated Core coprocessors (aka Xeon Phi) and Nvidia GPU using ARCHER and explore the potential optimization methods. Phase Space-based source modelling for has been implemented. Good agreements were found in a tomotherapy prostate patient case and a TrueBeam breast case. From the aspect of performance, the whole simulation for prostate plan and breast plan cost about 173s and 73s with 1% statistical error.
Autofocus method for automated microscopy using embedded GPUs.

PubMed

Castillo-Secilla, J M; Saval-Calvo, M; Medina-Valdès, L; Cuenca-Asensi, S; Martínez-Álvarez, A; Sánchez, C; Cristóbal, G

2017-03-01

In this paper we present a method for autofocusing images of sputum smears taken from a microscope which combines the finding of the optimal focus distance with an algorithm for extending the depth of field (EDoF). Our multifocus fusion method produces an unique image where all the relevant objects of the analyzed scene are well focused, independently to their distance to the sensor. This process is computationally expensive which makes unfeasible its automation using traditional embedded processors. For this purpose a low-cost optimized implementation is proposed using limited resources embedded GPU integrated on cutting-edge NVIDIA system on chip. The extensive tests performed on different sputum smear image sets show the real-time capabilities of our implementation maintaining the quality of the output image.
Electromagnetic physics models for parallel computing architectures

DOE PAGES

Amadio, G.; Ananya, A.; Apostolakis, J.; ...

2016-11-21

The recent emergence of hardware architectures characterized by many-core or accelerated processors has opened new opportunities for concurrent programming models taking advantage of both SIMD and SIMT architectures. GeantV, a next generation detector simulation, has been designed to exploit both the vector capability of mainstream CPUs and multi-threading capabilities of coprocessors including NVidia GPUs and Intel Xeon Phi. The characteristics of these architectures are very different in terms of the vectorization depth and type of parallelization needed to achieve optimal performance. In this paper we describe implementation of electromagnetic physics models developed for parallel computing architectures as a part ofmore » the GeantV project. Finally, the results of preliminary performance evaluation and physics validation are presented as well.« less
Advantages of GPU technology in DFT calculations of intercalated graphene

NASA Astrophysics Data System (ADS)

Pešić, J.; Gajić, R.

2014-09-01

Over the past few years, the expansion of general-purpose graphic-processing unit (GPGPU) technology has had a great impact on computational science. GPGPU is the utilization of a graphics-processing unit (GPU) to perform calculations in applications usually handled by the central processing unit (CPU). Use of GPGPUs as a way to increase computational power in the material sciences has significantly decreased computational costs in already highly demanding calculations. A level of the acceleration and parallelization depends on the problem itself. Some problems can benefit from GPU acceleration and parallelization, such as the finite-difference time-domain algorithm (FTDT) and density-functional theory (DFT), while others cannot take advantage of these modern technologies. A number of GPU-supported applications had emerged in the past several years (www.nvidia.com/object/gpu-applications.html). Quantum Espresso (QE) is reported as an integrated suite of open source computer codes for electronic-structure calculations and materials modeling at the nano-scale. It is based on DFT, the use of a plane-waves basis and a pseudopotential approach. Since the QE 5.0 version, it has been implemented as a plug-in component for standard QE packages that allows exploiting the capabilities of Nvidia GPU graphic cards (www.qe-forge.org/gf/proj). In this study, we have examined the impact of the usage of GPU acceleration and parallelization on the numerical performance of DFT calculations. Graphene has been attracting attention worldwide and has already shown some remarkable properties. We have studied an intercalated graphene, using the QE package PHonon, which employs GPU. The term ‘intercalation’ refers to a process whereby foreign adatoms are inserted onto a graphene lattice. In addition, by intercalating different atoms between graphene layers, it is possible to tune their physical properties. Our experiments have shown there are benefits from using GPUs, and we reached an
GPUbased, Microsecond Latency, HectoChannel MIMO Feedback Control of Magnetically Confined Plasmas

NASA Astrophysics Data System (ADS)

Rath, Nikolaus

Feedback control has become a crucial tool in the research on magnetic confinement of plasmas for achieving controlled nuclear fusion. This thesis presents a novel plasma feedback control system that, for the first time, employs a Graphics Processing Unit (GPU) for microsecond-latency, real-time control computations. This novel application area for GPU computing is opened up by a new system architecture that is optimized for low-latency computations on less than kilobyte sized data samples as they occur in typical plasma control algorithms. In contrast to traditional GPU computing approaches that target complex, high-throughput computations with massive amounts of data, the architecture presented in this thesis uses the GPU as the primary processing unit rather than as an auxiliary of the CPU, and data is transferred from A-D/D-A converters directly into GPU memory using peer-to-peer PCI Express transfers. The described design has been implemented in a new, GPU-based control system for the High-Beta Tokamak - Extended Pulse (HBT-EP) device. The system is built from commodity hardware and uses an NVIDIA GeForce GPU and D-TACQ A-D/D-A converters providing a total of 96 input and 64 output channels. The system is able to run with sampling periods down to 4 μs and latencies down to 8 μs. The GPU provides a total processing power of 1.5 x 1012 floating point operations per second. To illustrate the performance and versatility of both the general architecture and concrete implementation, a new control algorithm has been developed. The algorithm is designed for the control of multiple rotating magnetic perturbations in situations where the plasma equilibrium is not known exactly and features an adaptive system model: instead of requiring the rotation frequencies and growth rates embedded in the system model to be set a priori, the adaptive algorithm derives these parameters from the evolution of the perturbation amplitudes themselves. This results in non-linear control
GPU accelerated fuzzy connected image segmentation by using CUDA.

PubMed

Zhuge, Ying; Cao, Yong; Miller, Robert W

2009-01-01

Image segmentation techniques using fuzzy connectedness principles have shown their effectiveness in segmenting a variety of objects in several large applications in recent years. However, one problem of these algorithms has been their excessive computational requirements when processing large image datasets. Nowadays commodity graphics hardware provides high parallel computing power. In this paper, we present a parallel fuzzy connected image segmentation algorithm on Nvidia's Compute Unified Device Architecture (CUDA) platform for segmenting large medical image data sets. Our experiments based on three data sets with small, medium, and large data size demonstrate the efficiency of the parallel algorithm, which achieves a speed-up factor of 7.2x, 7.3x, and 14.4x, correspondingly, for the three data sets over the sequential implementation of fuzzy connected image segmentation algorithm on CPU.
A multi-port 10GbE PCIe NIC featuring UDP offload and GPUDirect capabilities.

NASA Astrophysics Data System (ADS)

Ammendola, Roberto; Biagioni, Andrea; Frezza, Ottorino; Lamanna, Gianluca; Lo Cicero, Francesca; Lonardo, Alessandro; Martinelli, Michele; Stanislao Paolucci, Pier; Pastorelli, Elena; Pontisso, Luca; Rossetti, Davide; Simula, Francesco; Sozzi, Marco; Tosoratto, Laura; Vicini, Piero

2015-12-01

NaNet-10 is a four-ports 10GbE PCIe Network Interface Card designed for low-latency real-time operations with GPU systems. To this purpose the design includes an UDP offload module, for fast and clock-cycle deterministic handling of the transport layer protocol, plus a GPUDirect P2P/RDMA engine for low-latency communication with NVIDIA Tesla GPU devices. A dedicated module (Multi-Stream) can optionally process input UDP streams before data is delivered through PCIe DMA to their destination devices, re-organizing data from different streams guaranteeing computational optimization. NaNet-10 is going to be integrated in the NA62 CERN experiment in order to assess the suitability of GPGPU systems as real-time triggers; results and lessons learned while performing this activity will be reported herein.
The Process of Parallelizing the Conjunction Prediction Algorithm of ESA's SSA Conjunction Prediction Service Using GPGPU

NASA Astrophysics Data System (ADS)

Fehr, M.; Navarro, V.; Martin, L.; Fletcher, E.

2013-08-01

Space Situational Awareness[8] (SSA) is defined as the comprehensive knowledge, understanding and maintained awareness of the population of space objects, the space environment and existing threats and risks. As ESA's SSA Conjunction Prediction Service (CPS) requires the repetitive application of a processing algorithm against a data set of man-made space objects, it is crucial to exploit the highly parallelizable nature of this problem. Currently the CPS system makes use of OpenMP[7] for parallelization purposes using CPU threads, but only a GPU with its hundreds of cores can fully benefit from such high levels of parallelism. This paper presents the adaptation of several core algorithms[5] of the CPS for general-purpose computing on graphics processing units (GPGPU) using NVIDIAs Compute Unified Device Architecture (CUDA).
Accelerated Application Development: The ORNL Titan Experience

DOE PAGES

Joubert, Wayne; Archibald, Richard K.; Berrill, Mark A.; ...

2015-05-09

The use of computational accelerators such as NVIDIA GPUs and Intel Xeon Phi processors is now widespread in the high performance computing community, with many applications delivering impressive performance gains. However, programming these systems for high performance, performance portability and software maintainability has been a challenge. In this paper we discuss experiences porting applications to the Titan system. Titan, which began planning in 2009 and was deployed for general use in 2013, was the first multi-petaflop system based on accelerator hardware. To ready applications for accelerated computing, a preparedness effort was undertaken prior to delivery of Titan. In this papermore » we report experiences and lessons learned from this process and describe how users are currently making use of computational accelerators on Titan.« less
Accelerated application development: The ORNL Titan experience

DOE Office of Scientific and Technical Information (OSTI.GOV)

Joubert, Wayne; Archibald, Rick; Berrill, Mark

2015-08-01

The use of computational accelerators such as NVIDIA GPUs and Intel Xeon Phi processors is now widespread in the high performance computing community, with many applications delivering impressive performance gains. However, programming these systems for high performance, performance portability and software maintainability has been a challenge. In this paper we discuss experiences porting applications to the Titan system. Titan, which began planning in 2009 and was deployed for general use in 2013, was the first multi-petaflop system based on accelerator hardware. To ready applications for accelerated computing, a preparedness effort was undertaken prior to delivery of Titan. In this papermore » we report experiences and lessons learned from this process and describe how users are currently making use of computational accelerators on Titan.« less
Overview of implementation of DARPA GPU program in SAIC

NASA Astrophysics Data System (ADS)

Braunreiter, Dennis; Furtek, Jeremy; Chen, Hai-Wen; Healy, Dennis

2008-04-01

This paper reviews the implementation of DARPA MTO STAP-BOY program for both Phase I and II conducted at Science Applications International Corporation (SAIC). The STAP-BOY program conducts fast covariance factorization and tuning techniques for space-time adaptive process (STAP) Algorithm Implementation on Graphics Processor unit (GPU) Architectures for Embedded Systems. The first part of our presentation on the DARPA STAP-BOY program will focus on GPU implementation and algorithm innovations for a prototype radar STAP algorithm. The STAP algorithm will be implemented on the GPU, using stream programming (from companies such as PeakStream, ATI Technologies' CTM, and NVIDIA) and traditional graphics APIs. This algorithm will include fast range adaptive STAP weight updates and beamforming applications, each of which has been modified to exploit the parallel nature of graphics architectures.
A rapid parallelization of cone-beam projection and back-projection operator based on texture fetching interpolation

NASA Astrophysics Data System (ADS)

Xie, Lizhe; Hu, Yining; Chen, Yang; Shi, Luyao

2015-03-01

Projection and back-projection are the most computational consuming parts in Computed Tomography (CT) reconstruction. Parallelization strategies using GPU computing techniques have been introduced. We in this paper present a new parallelization scheme for both projection and back-projection. The proposed method is based on CUDA technology carried out by NVIDIA Corporation. Instead of build complex model, we aimed on optimizing the existing algorithm and make it suitable for CUDA implementation so as to gain fast computation speed. Besides making use of texture fetching operation which helps gain faster interpolation speed, we fixed sampling numbers in the computation of projection, to ensure the synchronization of blocks and threads, thus prevents the latency caused by inconsistent computation complexity. Experiment results have proven the computational efficiency and imaging quality of the proposed method.
Real-Space Density Functional Theory on Graphical Processing Units: Computational Approach and Comparison to Gaussian Basis Set Methods.

PubMed

Andrade, Xavier; Aspuru-Guzik, Alán

2013-10-08

We discuss the application of graphical processing units (GPUs) to accelerate real-space density functional theory (DFT) calculations. To make our implementation efficient, we have developed a scheme to expose the data parallelism available in the DFT approach; this is applied to the different procedures required for a real-space DFT calculation. We present results for current-generation GPUs from AMD and Nvidia, which show that our scheme, implemented in the free code Octopus, can reach a sustained performance of up to 90 GFlops for a single GPU, representing a significant speed-up when compared to the CPU version of the code. Moreover, for some systems, our implementation can outperform a GPU Gaussian basis set code, showing that the real-space approach is a competitive alternative for DFT simulations on GPUs.
A GPU-paralleled implementation of an enhanced face recognition algorithm

NASA Astrophysics Data System (ADS)

Chen, Hao; Liu, Xiyang; Shao, Shuai; Zan, Jiguo

2013-03-01

Face recognition algorithm based on compressed sensing and sparse representation is hotly argued in these years. The scheme of this algorithm increases recognition rate as well as anti-noise capability. However, the computational cost is expensive and has become a main restricting factor for real world applications. In this paper, we introduce a GPU-accelerated hybrid variant of face recognition algorithm named parallel face recognition algorithm (pFRA). We describe here how to carry out parallel optimization design to take full advantage of many-core structure of a GPU. The pFRA is tested and compared with several other implementations under different data sample size. Finally, Our pFRA, implemented with NVIDIA GPU and Computer Unified Device Architecture (CUDA) programming model, achieves a significant speedup over the traditional CPU implementations.
Genetically improved BarraCUDA.

PubMed

Langdon, W B; Lam, Brian Yee Hong

2017-01-01

BarraCUDA is an open source C program which uses the BWA algorithm in parallel with nVidia CUDA to align short next generation DNA sequences against a reference genome. Recently its source code was optimised using "Genetic Improvement". The genetically improved (GI) code is up to three times faster on short paired end reads from The 1000 Genomes Project and 60% more accurate on a short BioPlanet.com GCAT alignment benchmark. GPGPU BarraCUDA running on a single K80 Tesla GPU can align short paired end nextGen sequences up to ten times faster than bwa on a 12 core server. The speed up was such that the GI version was adopted and has been regularly downloaded from SourceForge for more than 12 months.

GPU-accelerated phase extraction algorithm for interferograms: a real-time application

NASA Astrophysics Data System (ADS)

Zhu, Xiaoqiang; Wu, Yongqian; Liu, Fengwei

2016-11-01

Optical testing, having the merits of non-destruction and high sensitivity, provides a vital guideline for optical manufacturing. But the testing process is often computationally intensive and expensive, usually up to a few seconds, which is sufferable for dynamic testing. In this paper, a GPU-accelerated phase extraction algorithm is proposed, which is based on the advanced iterative algorithm. The accelerated algorithm can extract the right phase-distribution from thirteen 1024x1024 fringe patterns with arbitrary phase shifts in 233 milliseconds on average using NVIDIA Quadro 4000 graphic card, which achieved a 12.7x speedup ratio than the same algorithm executed on CPU and 6.6x speedup ratio than that on Matlab using DWANING W5801 workstation. The performance improvement can fulfill the demand of computational accuracy and real-time application.
High-performance computing on GPUs for resistivity logging of oil and gas wells

NASA Astrophysics Data System (ADS)

Glinskikh, V.; Dudaev, A.; Nechaev, O.; Surodina, I.

2017-10-01

We developed and implemented into software an algorithm for high-performance simulation of electrical logs from oil and gas wells using high-performance heterogeneous computing. The numerical solution of the 2D forward problem is based on the finite-element method and the Cholesky decomposition for solving a system of linear algebraic equations (SLAE). Software implementations of the algorithm used the NVIDIA CUDA technology and computing libraries are made, allowing us to perform decomposition of SLAE and find its solution on central processor unit (CPU) and graphics processor unit (GPU). The calculation time is analyzed depending on the matrix size and number of its non-zero elements. We estimated the computing speed on CPU and GPU, including high-performance heterogeneous CPU-GPU computing. Using the developed algorithm, we simulated resistivity data in realistic models.
A high-speed DAQ framework for future high-level trigger and event building clusters

NASA Astrophysics Data System (ADS)

Caselle, M.; Ardila Perez, L. E.; Balzer, M.; Dritschler, T.; Kopmann, A.; Mohr, H.; Rota, L.; Vogelgesang, M.; Weber, M.

2017-03-01

Modern data acquisition and trigger systems require a throughput of several GB/s and latencies of the order of microseconds. To satisfy such requirements, a heterogeneous readout system based on FPGA readout cards and GPU-based computing nodes coupled by InfiniBand has been developed. The incoming data from the back-end electronics is delivered directly into the internal memory of GPUs through a dedicated peer-to-peer PCIe communication. High performance DMA engines have been developed for direct communication between FPGAs and GPUs using "DirectGMA (AMD)" and "GPUDirect (NVIDIA)" technologies. The proposed infrastructure is a candidate for future generations of event building clusters, high-level trigger filter farms and low-level trigger system. In this paper the heterogeneous FPGA-GPU architecture will be presented and its performance be discussed.
Implementation of Multipattern String Matching Accelerated with GPU for Intrusion Detection System

NASA Astrophysics Data System (ADS)

Nehemia, Rangga; Lim, Charles; Galinium, Maulahikmah; Rinaldi Widianto, Ahmad

2017-04-01

As Internet-related security threats continue to increase in terms of volume and sophistication, existing Intrusion Detection System is also being challenged to cope with the current Internet development. Multi Pattern String Matching algorithm accelerated with Graphical Processing Unit is being utilized to improve the packet scanning performance of the IDS. This paper implements a Multi Pattern String Matching algorithm, also called Parallel Failureless Aho Corasick accelerated with GPU to improve the performance of IDS. OpenCL library is used to allow the IDS to support various GPU, including popular GPU such as NVIDIA and AMD, used in our research. The experiment result shows that the application of Multi Pattern String Matching using GPU accelerated platform provides a speed up, by up to 141% in term of throughput compared to the previous research.
Mendel-GPU: haplotyping and genotype imputation on graphics processing units

PubMed Central

Chen, Gary K.; Wang, Kai; Stram, Alex H.; Sobel, Eric M.; Lange, Kenneth

2012-01-01

Motivation: In modern sequencing studies, one can improve the confidence of genotype calls by phasing haplotypes using information from an external reference panel of fully typed unrelated individuals. However, the computational demands are so high that they prohibit researchers with limited computational resources from haplotyping large-scale sequence data. Results: Our graphics processing unit based software delivers haplotyping and imputation accuracies comparable to competing programs at a fraction of the computational cost and peak memory demand. Availability: Mendel-GPU, our OpenCL software, runs on Linux platforms and is portable across AMD and nVidia GPUs. Users can download both code and documentation at http://code.google.com/p/mendel-gpu/. Contact: gary.k.chen@usc.edu Supplementary information: Supplementary data are available at Bioinformatics online. PMID:22954633
Development of an Implicit, Charge and Energy Conserving 2D Electromagnetic PIC Code on Advanced Architectures

NASA Astrophysics Data System (ADS)

Payne, Joshua; Taitano, William; Knoll, Dana; Liebs, Chris; Murthy, Karthik; Feltman, Nicolas; Wang, Yijie; McCarthy, Colleen; Cieren, Emanuel

2012-10-01

In order to solve problems such as the ion coalescence and slow MHD shocks fully kinetically we developed a fully implicit 2D energy and charge conserving electromagnetic PIC code, PlasmaApp2D. PlasmaApp2D differs from previous implicit PIC implementations in that it will utilize advanced architectures such as GPUs and shared memory CPU systems, with problems too large to fit into cache. PlasmaApp2D will be a hybrid CPU-GPU code developed primarily to run on the DARWIN cluster at LANL utilizing four 12-core AMD Opteron CPUs and two NVIDIA Tesla GPUs per node. MPI will be used for cross-node communication, OpenMP will be used for on-node parallelism, and CUDA will be used for the GPUs. Development progress and initial results will be presented.
HONEI: A collection of libraries for numerical computations targeting multiple processor architectures

NASA Astrophysics Data System (ADS)

van Dyk, Danny; Geveler, Markus; Mallach, Sven; Ribbrock, Dirk; Göddeke, Dominik; Gutwenger, Carsten

2009-12-01

We present HONEI, an open-source collection of libraries offering a hardware oriented approach to numerical calculations. HONEI abstracts the hardware, and applications written on top of HONEI can be executed on a wide range of computer architectures such as CPUs, GPUs and the Cell processor. We demonstrate the flexibility and performance of our approach with two test applications, a Finite Element multigrid solver for the Poisson problem and a robust and fast simulation of shallow water waves. By linking against HONEI's libraries, we achieve a two-fold speedup over straight forward C++ code using HONEI's SSE backend, and additional 3-4 and 4-16 times faster execution on the Cell and a GPU. A second important aspect of our approach is that the full performance capabilities of the hardware under consideration can be exploited by adding optimised application-specific operations to the HONEI libraries. HONEI provides all necessary infrastructure for development and evaluation of such kernels, significantly simplifying their development. Program summaryProgram title: HONEI Catalogue identifier: AEDW_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEDW_v1_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: GPLv2 No. of lines in distributed program, including test data, etc.: 216 180 No. of bytes in distributed program, including test data, etc.: 1 270 140 Distribution format: tar.gz Programming language: C++ Computer: x86, x86_64, NVIDIA CUDA GPUs, Cell blades and PlayStation 3 Operating system: Linux RAM: at least 500 MB free Classification: 4.8, 4.3, 6.1 External routines: SSE: none; [1] for GPU, [2] for Cell backend Nature of problem: Computational science in general and numerical simulation in particular have reached a turning point. The revolution developers are facing is not primarily driven by a change in (problem-specific) methodology, but rather by the fundamental paradigm shift of the
Big Data GPU-Driven Parallel Processing Spatial and Spatio-Temporal Clustering Algorithms

NASA Astrophysics Data System (ADS)

Konstantaras, Antonios; Skounakis, Emmanouil; Kilty, James-Alexander; Frantzeskakis, Theofanis; Maravelakis, Emmanuel

2016-04-01

Diamantaras, K.: 'Programming and architecture of parallel processing systems', 1st Edition, Eds. Kleidarithmos, 2011 [4] NVIDIA.: 'NVidia CUDA C Programming Guide', version 5.0, NVidia (reference book) [5] Konstantaras, A.: 'Classification of Distinct Seismic Regions and Regional Temporal Modelling of Seismicity in the Vicinity of the Hellenic Seismic Arc', IEEE Selected Topics in Applied Earth Observations and Remote Sensing, vol. 6 (4), pp. 1857-1863, 2013 [6] Konstantaras, A. Varley, M.R.,. Valianatos, F., Collins, G. and Holifield, P.: 'Recognition of electric earthquake precursors using neuro-fuzzy models: methodology and simulation results', Proc. IASTED International Conference on Signal Processing Pattern Recognition and Applications (SPPRA 2002), Crete, Greece, 2002, pp 303-308, 2002 [7] Konstantaras, A., Katsifarakis, E., Maravelakis, E., Skounakis, E., Kokkinos, E. and Karapidakis, E.: 'Intelligent Spatial-Clustering of Seismicity in the Vicinity of the Hellenic Seismic Arc', Earth Science Research, vol. 1 (2), pp. 1-10, 2012 [8] Georgoulas, G., Konstantaras, A., Katsifarakis, E., Stylios, C.D., Maravelakis, E. and Vachtsevanos, G.: '"Seismic-Mass" Density-based Algorithm for Spatio-Temporal Clustering', Expert Systems with Applications, vol. 40 (10), pp. 4183-4189, 2013 [9] Konstantaras, A. J.: 'Expert knowledge-based algorithm for the dynamic discrimination of interactive natural clusters', Earth Science Informatics, 2015 (In Press, see: www.scopus.com) [10] Drakatos, G. and Latoussakis, J.: 'A catalog of aftershock sequences in Greece (1971-1997): Their spatial and temporal characteristics', Journal of Seismology, vol. 5, pp. 137-145, 2001
Massive parallelization of a 3D finite difference electromagnetic forward solution using domain decomposition methods on multiple CUDA enabled GPUs

NASA Astrophysics Data System (ADS)

Schultz, A.

2010-12-01

3D forward solvers lie at the core of inverse formulations used to image the variation of electrical conductivity within the Earth's interior. This property is associated with variations in temperature, composition, phase, presence of volatiles, and in specific settings, the presence of groundwater, geothermal resources, oil/gas or minerals. The high cost of 3D solutions has been a stumbling block to wider adoption of 3D methods. Parallel algorithms for modeling frequency domain 3D EM problems have not achieved wide scale adoption, with emphasis on fairly coarse grained parallelism using MPI and similar approaches. The communications bandwidth as well as the latency required to send and receive network communication packets is a limiting factor in implementing fine grained parallel strategies, inhibiting wide adoption of these algorithms. Leading Graphics Processor Unit (GPU) companies now produce GPUs with hundreds of GPU processor cores per die. The footprint, in silicon, of the GPU's restricted instruction set is much smaller than the general purpose instruction set required of a CPU. Consequently, the density of processor cores on a GPU can be much greater than on a CPU. GPUs also have local memory, registers and high speed communication with host CPUs, usually through PCIe type interconnects. The extremely low cost and high computational power of GPUs provides the EM geophysics community with an opportunity to achieve fine grained (i.e. massive) parallelization of codes on low cost hardware. The current generation of GPUs (e.g. NVidia Fermi) provides 3 billion transistors per chip die, with nearly 500 processor cores and up to 6 GB of fast (DDR5) GPU memory. This latest generation of GPU supports fast hardware double precision (64 bit) floating point operations of the type required for frequency domain EM forward solutions. Each Fermi GPU board can sustain nearly 1 TFLOP in double precision, and multiple boards can be installed in the host computer system. We
Accumulation, biotransformation, histopathology and paralysis in the Pacific calico scallop Argopecten ventricosus by the paralyzing toxins of the dinoflagellate Gymnodinium catenatum.

PubMed

Escobedo-Lozano, Amada Y; Estrada, Norma; Ascencio, Felipe; Contreras, Gerardo; Alonso-Rodriguez, Rosalba

2012-05-01

The dinoflagellate Gymnodinium catenatum produces paralyzing shellfish poisons that are consumed and accumulated by bivalves. We performed short-term feeding experiments to examine ingestion, accumulation, biotransformation, histopathology, and paralysis in the juvenile Pacific calico scallop Argopecten ventricosus that consume this dinoflagellate. Depletion of algal cells was measured in closed systems. Histopathological preparations were microscopically analyzed. Paralysis was observed and the time of recovery recorded. Accumulation and possible biotransformation of toxins were measured by HPLC analysis. Feeding activity in treated scallops showed that scallops produced pseudofeces, ingestion rates decreased at 8 h; approximately 60% of the scallops were paralyzed and melanin production and hemocyte aggregation were observed in several tissues at 15 h. HPLC analysis showed that the only toxins present in the dinoflagellates and scallops were the N-sulfo-carbamoyl toxins (C1, C2); after hydrolysis, the carbamate toxins (epimers GTX2/3) were present. C1 and C2 toxins were most common in the mantle, followed by the digestive gland and stomach-complex, adductor muscle, kidney and rectum group, and finally, gills. Toxin profiles in scallop tissue were similar to the dinoflagellate; biotransformations were not present in the scallops in this short-term feeding experiment.
GPU-accelerated low-latency real-time searches for gravitational waves from compact binary coalescence

NASA Astrophysics Data System (ADS)

Liu, Yuan; Du, Zhihui; Chung, Shin Kee; Hooper, Shaun; Blair, David; Wen, Linqing

2012-12-01

We present a graphics processing unit (GPU)-accelerated time-domain low-latency algorithm to search for gravitational waves (GWs) from coalescing binaries of compact objects based on the summed parallel infinite impulse response (SPIIR) filtering technique. The aim is to facilitate fast detection of GWs with a minimum delay to allow prompt electromagnetic follow-up observations. To maximize the GPU acceleration, we apply an efficient batched parallel computing model that significantly reduces the number of synchronizations in SPIIR and optimizes the usage of the memory and hardware resource. Our code is tested on the CUDA ‘Fermi’ architecture in a GTX 480 graphics card and its performance is compared with a single core of Intel Core i7 920 (2.67 GHz). A 58-fold speedup is achieved while giving results in close agreement with the CPU implementation. Our result indicates that it is possible to conduct a full search for GWs from compact binary coalescence in real time with only one desktop computer equipped with a Fermi GPU card for the initial LIGO detectors which in the past required more than 100 CPUs.
mm_par2.0: An object-oriented molecular dynamics simulation program parallelized using a hierarchical scheme with MPI and OPENMP

NASA Astrophysics Data System (ADS)

Oh, Kwang Jin; Kang, Ji Hoon; Myung, Hun Joo

2012-02-01

decomposition is not popular due to its poor scalability. On the other hand, domain decomposition scheme is better for scalability. It still has a limitation in utilizing a large number of cores on recent petascale computers due to the requirement that the domain size is larger than the potential cutoff distance. To go beyond such a limitation, a hierarchical parallelization scheme has been adopted in this new version and implemented using MPI [7] and OPENMP [8]. Summary of revisions: (1) Object-oriented programming has been used. (2) A hierarchical parallelization scheme has been adopted. (3) SPME routine has been fully parallelized with parallel 3D FFT using volumetric decomposition scheme [9]. K.J.O. thanks Mr. Seung Min Lee for useful discussion on programming and debugging. Running time: Running time depends on system size and methods used. For test system containing a protein (PDB id: 5DHFR) with CHARMM22 force field [10] and 7023 TIP3P [11] waters in simulation box having dimension 62.23 Å×62.23 Å×62.23 Å, the benchmark results are given in Fig. 1. Here the potential cutoff distance was set to 12 Å and the switching function was applied from 10 Å for the force calculation in real space. For the SPME [12] calculation, K, K, and K were set to 64 and the interpolation order was set to 4. To do the fast Fourier transform, we used Intel MKL library. All bonds including hydrogen atoms were constrained using SHAKE/RATTLE algorithms [13,14]. The code was compiled using Intel compiler version 11.1 and mvapich2 version 1.5. Fig. 2 shows performance gains from using CUDA-enabled version [15] of mm_par for 5DHFR simulation in water on Intel Core2Quad 2.83 GHz and GeForce GTX 580. Even though mm_par2.0 is not ported yet for GPU, its performance data would be useful to expect mm_par2.0 performance on GPU. Timing results for 1000 MD steps. 1, 2, 4, and 8 in the figure mean the number of OPENMP threads. Timing results for 1000 MD steps from double precision simulation on CPU
Hypergraph partitioning implementation for parallelizing matrix-vector multiplication using CUDA GPU-based parallel computing

NASA Astrophysics Data System (ADS)

Murni, Bustamam, A.; Ernastuti, Handhika, T.; Kerami, D.

2017-07-01

Calculation of the matrix-vector multiplication in the real-world problems often involves large matrix with arbitrary size. Therefore, parallelization is needed to speed up the calculation process that usually takes a long time. Graph partitioning techniques that have been discussed in the previous studies cannot be used to complete the parallelized calculation of matrix-vector multiplication with arbitrary size. This is due to the assumption of graph partitioning techniques that can only solve the square and symmetric matrix. Hypergraph partitioning techniques will overcome the shortcomings of the graph partitioning technique. This paper addresses the efficient parallelization of matrix-vector multiplication through hypergraph partitioning techniques using CUDA GPU-based parallel computing. CUDA (compute unified device architecture) is a parallel computing platform and programming model that was created by NVIDIA and implemented by the GPU (graphics processing unit).
An Investigation of Unified Memory Access Performance in CUDA

PubMed Central

Landaverde, Raphael; Zhang, Tiansheng; Coskun, Ayse K.; Herbordt, Martin

2015-01-01

Managing memory between the CPU and GPU is a major challenge in GPU computing. A programming model, Unified Memory Access (UMA), has been recently introduced by Nvidia to simplify the complexities of memory management while claiming good overall performance. In this paper, we investigate this programming model and evaluate its performance and programming model simplifications based on our experimental results. We find that beyond on-demand data transfers to the CPU, the GPU is also able to request subsets of data it requires on demand. This feature allows UMA to outperform full data transfer methods for certain parallel applications and small data sizes. We also find, however, that for the majority of applications and memory access patterns, the performance overheads associated with UMA are significant, while the simplifications to the programming model restrict flexibility for adding future optimizations. PMID:26594668
CUDA-Accelerated Geodesic Ray-Tracing for Fiber Tracking

PubMed Central

van Aart, Evert; Sepasian, Neda; Jalba, Andrei; Vilanova, Anna

2011-01-01

Diffusion Tensor Imaging (DTI) allows to noninvasively measure the diffusion of water in fibrous tissue. By reconstructing the fibers from DTI data using a fiber-tracking algorithm, we can deduce the structure of the tissue. In this paper, we outline an approach to accelerating such a fiber-tracking algorithm using a Graphics Processing Unit (GPU). This algorithm, which is based on the calculation of geodesics, has shown promising results for both synthetic and real data, but is limited in its applicability by its high computational requirements. We present a solution which uses the parallelism offered by modern GPUs, in combination with the CUDA platform by NVIDIA, to significantly reduce the execution time of the fiber-tracking algorithm. Compared to a multithreaded CPU implementation of the same algorithm, our GPU mapping achieves a speedup factor of up to 40 times. PMID:21941525
Multi-GPU accelerated three-dimensional FDTD method for electromagnetic simulation.

PubMed

Nagaoka, Tomoaki; Watanabe, Soichi

2011-01-01

Numerical simulation with a numerical human model using the finite-difference time domain (FDTD) method has recently been performed in a number of fields in biomedical engineering. To improve the method's calculation speed and realize large-scale computing with the numerical human model, we adapt three-dimensional FDTD code to a multi-GPU environment using Compute Unified Device Architecture (CUDA). In this study, we used NVIDIA Tesla C2070 as GPGPU boards. The performance of multi-GPU is evaluated in comparison with that of a single GPU and vector supercomputer. The calculation speed with four GPUs was approximately 3.5 times faster than with a single GPU, and was slightly (approx. 1.3 times) slower than with the supercomputer. Calculation speed of the three-dimensional FDTD method using GPUs can significantly improve with an expanding number of GPUs.
MULTI-CORE AND OPTICAL PROCESSOR RELATED APPLICATIONS RESEARCH AT OAK RIDGE NATIONAL LABORATORY

DOE Office of Scientific and Technical Information (OSTI.GOV)

Barhen, Jacob; Kerekes, Ryan A; ST Charles, Jesse Lee

2008-01-01

High-speed parallelization of common tasks holds great promise as a low-risk approach to achieving the significant increases in signal processing and computational performance required for next generation innovations in reconfigurable radio systems. Researchers at the Oak Ridge National Laboratory have been working on exploiting the parallelization offered by this emerging technology and applying it to a variety of problems. This paper will highlight recent experience with four different parallel processors applied to signal processing tasks that are directly relevant to signal processing required for SDR/CR waveforms. The first is the EnLight Optical Core Processor applied to matched filter (MF) correlationmore » processing via fast Fourier transform (FFT) of broadband Dopplersensitive waveforms (DSW) using active sonar arrays for target tracking. The second is the IBM CELL Broadband Engine applied to 2-D discrete Fourier transform (DFT) kernel for image processing and frequency domain processing. And the third is the NVIDIA graphical processor applied to document feature clustering. EnLight Optical Core Processor. Optical processing is inherently capable of high-parallelism that can be translated to very high performance, low power dissipation computing. The EnLight 256 is a small form factor signal processing chip (5x5 cm2) with a digital optical core that is being developed by an Israeli startup company. As part of its evaluation of foreign technology, ORNL's Center for Engineering Science Advanced Research (CESAR) had access to a precursor EnLight 64 Alpha hardware for a preliminary assessment of capabilities in terms of large Fourier transforms for matched filter banks and on applications related to Doppler-sensitive waveforms. This processor is optimized for array operations, which it performs in fixed-point arithmetic at the rate of 16 TeraOPS at 8-bit precision. This is approximately 1000 times faster than the fastest DSP available today. The optical
Musrfit-Real Time Parameter Fitting Using GPUs

NASA Astrophysics Data System (ADS)

Locans, Uldis; Suter, Andreas

High transverse field μSR (HTF-μSR) experiments typically lead to a rather large data sets, since it is necessary to follow the high frequencies present in the positron decay histograms. The analysis of these data sets can be very time consuming, usually due to the limited computational power of the hardware. To overcome the limited computing resources rotating reference frame transformation (RRF) is often used to reduce the data sets that need to be handled. This comes at a price typically the μSR community is not aware of: (i) due to the RRF transformation the fitting parameter estimate is of poorer precision, i.e., more extended expensive beamtime is needed. (ii) RRF introduces systematic errors which hampers the statistical interpretation of χ2 or the maximum log-likelihood. We will briefly discuss these issues in a non-exhaustive practical way. The only and single purpose of the RRF transformation is the sluggish computer power. Therefore during this work GPU (Graphical Processing Units) based fitting was developed which allows to perform real-time full data analysis without RRF. GPUs have become increasingly popular in scientific computing in recent years. Due to their highly parallel architecture they provide the opportunity to accelerate many applications with considerably less costs than upgrading the CPU computational power. With the emergence of frameworks such as CUDA and OpenCL these devices have become more easily programmable. During this work GPU support was added to Musrfit- a data analysis framework for μSR experiments. The new fitting algorithm uses CUDA or OpenCL to offload the most time consuming parts of the calculations to Nvidia or AMD GPUs. Using the current CPU implementation in Musrfit parameter fitting can take hours for certain data sets while the GPU version can allow to perform real-time data analysis on the same data sets. This work describes the challenges that arise in adding the GPU support to t as well as results obtained
DOE Office of Scientific and Technical Information (OSTI.GOV)

Edwards, Harold C.; Ibanez, Daniel Alejandro

This report documents the ASC/ATDM Kokkos deliverable "Production Portable Dy- namic Task DAG Capability." This capability enables applications to create and execute a dynamic task DAG ; a collection of heterogeneous computational tasks with a directed acyclic graph (DAG) of "execute after" dependencies where tasks and their dependencies are dynamically created and destroyed as tasks execute. The Kokkos task scheduler executes the dynamic task DAG on the target execution resource; e.g. a multicore CPU, a manycore CPU such as Intel's Knights Landing (KNL), or an NVIDIA GPU. Several major technical challenges had to be addressed during development of Kokkos' Taskmore » DAG capability: (1) portability to a GPU with it's simplified hardware and micro- runtime, (2) thread-scalable memory allocation and deallocation from a bounded pool of memory, (3) thread-scalable scheduler for dynamic task DAG, (4) usability by applications.« less
GPU-based real-time trinocular stereo vision

NASA Astrophysics Data System (ADS)

Yao, Yuanbin; Linton, R. J.; Padir, Taskin

2013-01-01

Most stereovision applications are binocular which uses information from a 2-camera array to perform stereo matching and compute the depth image. Trinocular stereovision with a 3-camera array has been proved to provide higher accuracy in stereo matching which could benefit applications like distance finding, object recognition, and detection. This paper presents a real-time stereovision algorithm implemented on a GPGPU (General-purpose graphics processing unit) using a trinocular stereovision camera array. Algorithm employs a winner-take-all method applied to perform fusion of disparities in different directions following various image processing techniques to obtain the depth information. The goal of the algorithm is to achieve real-time processing speed with the help of a GPGPU involving the use of Open Source Computer Vision Library (OpenCV) in C++ and NVidia CUDA GPGPU Solution. The results are compared in accuracy and speed to verify the improvement.

CUDAEASY - a GPU accelerated cosmological lattice program

NASA Astrophysics Data System (ADS)

Sainio, J.

2010-05-01

This paper presents, to the author's knowledge, the first graphics processing unit (GPU) accelerated program that solves the evolution of interacting scalar fields in an expanding universe. We present the implementation in NVIDIA's Compute Unified Device Architecture (CUDA) and compare the performance to other similar programs in chaotic inflation models. We report speedups between one and two orders of magnitude depending on the used hardware and software while achieving small errors in single precision. Simulations that used to last roughly one day to compute can now be done in hours and this difference is expected to increase in the future. The program has been written in the spirit of LATTICEEASY and users of the aforementioned program should find it relatively easy to start using CUDAEASY in lattice simulations. The program is available at http://www.physics.utu.fi/theory/particlecosmology/cudaeasy/ under the GNU General Public License.
Toward performance portability of the Albany finite element analysis code using the Kokkos library

DOE Office of Scientific and Technical Information (OSTI.GOV)

Demeshko, Irina; Watkins, Jerry; Tezaur, Irina K.

Performance portability on heterogeneous high-performance computing (HPC) systems is a major challenge faced today by code developers: parallel code needs to be executed correctly as well as with high performance on machines with different architectures, operating systems, and software libraries. The finite element method (FEM) is a popular and flexible method for discretizing partial differential equations arising in a wide variety of scientific, engineering, and industrial applications that require HPC. This paper presents some preliminary results pertaining to our development of a performance portable implementation of the FEM-based Albany code. Performance portability is achieved using the Kokkos library. We presentmore » performance results for the Aeras global atmosphere dynamical core module in Albany. Finally, numerical experiments show that our single code implementation gives reasonable performance across three multicore/many-core architectures: NVIDIA General Processing Units (GPU’s), Intel Xeon Phis, and multicore CPUs.« less
Toward performance portability of the Albany finite element analysis code using the Kokkos library

DOE PAGES

Demeshko, Irina; Watkins, Jerry; Tezaur, Irina K.; ...

2018-02-05

Performance portability on heterogeneous high-performance computing (HPC) systems is a major challenge faced today by code developers: parallel code needs to be executed correctly as well as with high performance on machines with different architectures, operating systems, and software libraries. The finite element method (FEM) is a popular and flexible method for discretizing partial differential equations arising in a wide variety of scientific, engineering, and industrial applications that require HPC. This paper presents some preliminary results pertaining to our development of a performance portable implementation of the FEM-based Albany code. Performance portability is achieved using the Kokkos library. We presentmore » performance results for the Aeras global atmosphere dynamical core module in Albany. Finally, numerical experiments show that our single code implementation gives reasonable performance across three multicore/many-core architectures: NVIDIA General Processing Units (GPU’s), Intel Xeon Phis, and multicore CPUs.« less
GPU Particle Tracking and MHD Simulations with Greatly Enhanced Computational Speed

NASA Astrophysics Data System (ADS)

Ziemba, T.; O'Donnell, D.; Carscadden, J.; Cash, M.; Winglee, R.; Harnett, E.

2008-12-01

GPUs are intrinsically highly parallelized systems that provide more than an order of magnitude computing speed over a CPU based systems, for less cost than a high end-workstation. Recent advancements in GPU technologies allow for full IEEE float specifications with performance up to several hundred GFLOPs per GPU, and new software architectures have recently become available to ease the transition from graphics based to scientific applications. This allows for a cheap alternative to standard supercomputing methods and should increase the time to discovery. 3-D particle tracking and MHD codes have been developed using NVIDIA's CUDA and have demonstrated speed up of nearly a factor of 20 over equivalent CPU versions of the codes. Such a speed up enables new applications to develop, including real time running of radiation belt simulations and real time running of global magnetospheric simulations, both of which could provide important space weather prediction tools.
Model-independent partial wave analysis using a massively-parallel fitting framework

NASA Astrophysics Data System (ADS)

Sun, L.; Aoude, R.; dos Reis, A. C.; Sokoloff, M.

2017-10-01

The functionality of GooFit, a GPU-friendly framework for doing maximum-likelihood fits, has been extended to extract model-independent {\\mathscr{S}}-wave amplitudes in three-body decays such as D + → h + h + h -. A full amplitude analysis is done where the magnitudes and phases of the {\\mathscr{S}}-wave amplitudes are anchored at a finite number of m 2(h + h -) control points, and a cubic spline is used to interpolate between these points. The amplitudes for {\\mathscr{P}}-wave and {\\mathscr{D}}-wave intermediate states are modeled as spin-dependent Breit-Wigner resonances. GooFit uses the Thrust library, with a CUDA backend for NVIDIA GPUs and an OpenMP backend for threads with conventional CPUs. Performance on a variety of platforms is compared. Executing on systems with GPUs is typically a few hundred times faster than executing the same algorithm on a single CPU.
Rapid automated classification of anesthetic depth levels using GPU based parallelization of neural networks.

PubMed

Peker, Musa; Şen, Baha; Gürüler, Hüseyin

2015-02-01

The effect of anesthesia on the patient is referred to as depth of anesthesia. Rapid classification of appropriate depth level of anesthesia is a matter of great importance in surgical operations. Similarly, accelerating classification algorithms is important for the rapid solution of problems in the field of biomedical signal processing. However numerous, time-consuming mathematical operations are required when training and testing stages of the classification algorithms, especially in neural networks. In this study, to accelerate the process, parallel programming and computing platform (Nvidia CUDA) facilitates dramatic increases in computing performance by harnessing the power of the graphics processing unit (GPU) was utilized. The system was employed to detect anesthetic depth level on related electroencephalogram (EEG) data set. This dataset is rather complex and large. Moreover, the achieving more anesthetic levels with rapid response is critical in anesthesia. The proposed parallelization method yielded high accurate classification results in a faster time.
Nanoscale multireference quantum chemistry: full configuration interaction on graphical processing units.

PubMed

Fales, B Scott; Levine, Benjamin G

2015-10-13

Methods based on a full configuration interaction (FCI) expansion in an active space of orbitals are widely used for modeling chemical phenomena such as bond breaking, multiply excited states, and conical intersections in small-to-medium-sized molecules, but these phenomena occur in systems of all sizes. To scale such calculations up to the nanoscale, we have developed an implementation of FCI in which electron repulsion integral transformation and several of the more expensive steps in σ vector formation are performed on graphical processing unit (GPU) hardware. When applied to a 1.7 × 1.4 × 1.4 nm silicon nanoparticle (Si72H64) described with the polarized, all-electron 6-31G** basis set, our implementation can solve for the ground state of the 16-active-electron/16-active-orbital CASCI Hamiltonian (more than 100,000,000 configurations) in 39 min on a single NVidia K40 GPU.
GPU accelerated implementation of NCI calculations using promolecular density.

PubMed

Rubez, Gaëtan; Etancelin, Jean-Matthieu; Vigouroux, Xavier; Krajecki, Michael; Boisson, Jean-Charles; Hénon, Eric

2017-05-30

The NCI approach is a modern tool to reveal chemical noncovalent interactions. It is particularly attractive to describe ligand-protein binding. A custom implementation for NCI using promolecular density is presented. It is designed to leverage the computational power of NVIDIA graphics processing unit (GPU) accelerators through the CUDA programming model. The code performances of three versions are examined on a test set of 144 systems. NCI calculations are particularly well suited to the GPU architecture, which reduces drastically the computational time. On a single compute node, the dual-GPU version leads to a 39-fold improvement for the biggest instance compared to the optimal OpenMP parallel run (C code, icc compiler) with 16 CPU cores. Energy consumption measurements carried out on both CPU and GPU NCI tests show that the GPU approach provides substantial energy savings. © 2017 Wiley Periodicals, Inc. © 2017 Wiley Periodicals, Inc.
A GPU-based calculation using the three-dimensional FDTD method for electromagnetic field analysis.

PubMed

Nagaoka, Tomoaki; Watanabe, Soichi

2010-01-01

Numerical simulations with the numerical human model using the finite-difference time domain (FDTD) method have recently been performed frequently in a number of fields in biomedical engineering. However, the FDTD calculation runs too slowly. We focus, therefore, on general purpose programming on the graphics processing unit (GPGPU). The three-dimensional FDTD method was implemented on the GPU using Compute Unified Device Architecture (CUDA). In this study, we used the NVIDIA Tesla C1060 as a GPGPU board. The performance of the GPU is evaluated in comparison with the performance of a conventional CPU and a vector supercomputer. The results indicate that three-dimensional FDTD calculations using a GPU can significantly reduce run time in comparison with that using a conventional CPU, even a native GPU implementation of the three-dimensional FDTD method, while the GPU/CPU speed ratio varies with the calculation domain and thread block size.
A hybrid parallel architecture for electrostatic interactions in the simulation of dissipative particle dynamics

NASA Astrophysics Data System (ADS)

Yang, Sheng-Chun; Lu, Zhong-Yuan; Qian, Hu-Jun; Wang, Yong-Lei; Han, Jie-Ping

2017-11-01

In this work, we upgraded the electrostatic interaction method of CU-ENUF (Yang, et al., 2016) which first applied CUNFFT (nonequispaced Fourier transforms based on CUDA) to the reciprocal-space electrostatic computation and made the computation of electrostatic interaction done thoroughly in GPU. The upgraded edition of CU-ENUF runs concurrently in a hybrid parallel way that enables the computation parallelizing on multiple computer nodes firstly, then further on the installed GPU in each computer. By this parallel strategy, the size of simulation system will be never restricted to the throughput of a single CPU or GPU. The most critical technical problem is how to parallelize a CUNFFT in the parallel strategy, which is conquered effectively by deep-seated research of basic principles and some algorithm skills. Furthermore, the upgraded method is capable of computing electrostatic interactions for both the atomistic molecular dynamics (MD) and the dissipative particle dynamics (DPD). Finally, the benchmarks conducted for validation and performance indicate that the upgraded method is able to not only present a good precision when setting suitable parameters, but also give an efficient way to compute electrostatic interactions for huge simulation systems. Program Files doi:http://dx.doi.org/10.17632/zncf24fhpv.1 Licensing provisions: GNU General Public License 3 (GPL) Programming language: C, C++, and CUDA C Supplementary material: The program is designed for effective electrostatic interactions of large-scale simulation systems, which runs on particular computers equipped with NVIDIA GPUs. It has been tested on (a) single computer node with Intel(R) Core(TM) i7-3770@ 3.40 GHz (CPU) and GTX 980 Ti (GPU), and (b) MPI parallel computer nodes with the same configurations. Nature of problem: For molecular dynamics simulation, the electrostatic interaction is the most time-consuming computation because of its long-range feature and slow convergence in simulation space
Accumulation, Biotransformation, Histopathology and Paralysis in the Pacific Calico Scallop Argopecten ventricosus by the Paralyzing Toxins of the Dinoflagellate Gymnodinium catenatum

PubMed Central

Escobedo-Lozano, Amada Y.; Estrada, Norma; Ascencio, Felipe; Contreras, Gerardo; Alonso-Rodriguez, Rosalba

2012-01-01

The dinoflagellate Gymnodinium catenatum produces paralyzing shellfish poisons that are consumed and accumulated by bivalves. We performed short-term feeding experiments to examine ingestion, accumulation, biotransformation, histopathology, and paralysis in the juvenile Pacific calico scallop Argopecten ventricosus that consume this dinoflagellate. Depletion of algal cells was measured in closed systems. Histopathological preparations were microscopically analyzed. Paralysis was observed and the time of recovery recorded. Accumulation and possible biotransformation of toxins were measured by HPLC analysis. Feeding activity in treated scallops showed that scallops produced pseudofeces, ingestion rates decreased at 8 h; approximately 60% of the scallops were paralyzed and melanin production and hemocyte aggregation were observed in several tissues at 15 h. HPLC analysis showed that the only toxins present in the dinoflagellates and scallops were the N-sulfo-carbamoyl toxins (C1, C2); after hydrolysis, the carbamate toxins (epimers GTX2/3) were present. C1 and C2 toxins were most common in the mantle, followed by the digestive gland and stomach-complex, adductor muscle, kidney and rectum group, and finally, gills. Toxin profiles in scallop tissue were similar to the dinoflagellate; biotransformations were not present in the scallops in this short-term feeding experiment. PMID:22822356
Modular Classification of Endoscopic Endonasal Transsphenoidal Approaches to Sellar Region: Anatomic Quantitative Study.

PubMed

Belotti, Francesco; Doglietto, Francesco; Schreiber, Alberto; Ravanelli, Marco; Ferrari, Marco; Lancini, Davide; Rampinelli, Vittorio; Hirtler, Lena; Buffoli, Barbara; Bolzoni Villaret, Andrea; Maroldi, Roberto; Rodella, Luigi Fabrizio; Nicolai, Piero; Fontanella, Marco Maria

2018-01-01

Endoscopic visualization does not necessarily correspond to an adequate working space. The need for balancing invasiveness and adequacy of sellar tumor exposure has recently led to the description of multiple endoscopic endonasal transsphenoidal approaches. Comparative anatomic data on these variants are lacking. We sought to quantitatively compare endoscopic endonasal transsphenoidal approaches to the sella and parasellar region, using the concept of "surgical pyramid." Four endoscopic transsphenoidal approaches were performed in 10 injected specimens: 1) hemisphenoidotomy; 2) transrostral; 3) extended transrostral (with superior turbinectomy); and 4) extended transrostral with posterior ethmoidectomy. ApproachViewer software (part of GTx-Eyes II, University Health Network, Toronto, Canada) with a dedicated navigation system was used to quantify the surgical pyramid volume, as well as exposure of sellar and parasellar areas. Statistical analyses were performed with Friedman's tests and Nemenyi's procedure. Hemisphenoidotomy provided limited exposure of the sellar area and a small working volume. A transrostral approach was necessary to expose the entire sella. Exposure of lateral parasellar areas required superior turbinectomy or posterior ethmoidectomy. The differences between each of the modules was statistically significant. The present study validates, from an anatomic point of view, a modular classification of endoscopic endonasal transsphenoidal approaches to the sellar region. Copyright © 2017 Elsevier Inc. All rights reserved.
Development and Validation of a Liquid Chromatography-Tandem Mass Spectrometry Method Coupled with Dispersive Solid-Phase Extraction for Simultaneous Quantification of Eight Paralytic Shellfish Poisoning Toxins in Shellfish.

PubMed

Yang, Xianli; Zhou, Lei; Tan, Yanglan; Shi, Xizhi; Zhao, Zhiyong; Nie, Dongxia; Zhou, Changyan; Liu, Hong

2017-06-29

In this study, a high-performance liquid chromatography-tandem mass spectrometry (HPLC-MS/MS) method was developed for simultaneous determination of eight paralytic shellfish poisoning (PSP) toxins, including saxitoxin (STX), neosaxitoxin (NEO), gonyautoxins (GTX1-4) and the N -sulfo carbamoyl toxins C1 and C2, in sea shellfish. The samples were extracted by acetonitrile/water (80:20, v / v ) with 0.1% formic and purified by dispersive solid-phase extraction (dSPE) with C18 silica and acidic alumina. Qualitative and quantitative detection for the target toxins were conducted under the multiple reaction monitoring (MRM) mode by using the positive electrospray ionization (ESI) mode after chromatographic separation on a TSK-gel Amide-80 HILIC column with water and acetonitrile. Matrix-matched calibration was used to compensate for matrix effects. The established method was further validated by determining the linearity ( R ² ≥ 0.9900), average recovery (81.52-116.50%), sensitivity (limits of detection (LODs): 0.33-5.52 μg·kg -1 ; limits of quantitation (LOQs): 1.32-11.29 μg·kg -1 ) and precision (relative standard deviation (RSD) ≤ 19.10%). The application of this proposed approach to thirty shellfish samples proved its desirable performance and sufficient capability for simultaneous determination of multiclass PSP toxins in sea foods.
Paralytic shellfish toxin producing Aphanizomenon gracile strains isolated from Lake Iznik, Turkey.

PubMed

Yilmaz, Mete; Foss, Amanda J; Selwood, Andrew I; Özen, Mihriban; Boundy, Michael

2018-06-15

Aphanizomenon gracile is one of the most widespread Paralytic Shellfish Toxin (PST) producing cyanobacteria in freshwater bodies in the Northern Hemisphere. It has been shown to produce various PST congeners, including saxitoxin (STX), neosaxitoxin (NEO), decarbamoylsaxitoxin (dcSTX) and gonyautoxin 5 (GTX5) in Europe, North America and Asia. Three cyanobacteria strains were isolated in Lake Iznik in northwestern Turkey. Morphological characterization of these strains suggested all three strains conformed to classical taxonomic identification of A. gracile with some differences such as clumping of filaments, partially hyaline cells in some filaments and longer than usual vegetative cells. Sequences of 16S rRNA gene of these strains were placed within an A. gracile cluster including the majority of PST producing strains, confirming the identification of these strains as A. gracile. These new strains possessed saxitoxin biosynthesis genes sxtA, sxtG and their sequences clustered with those of other A. gracile. Liquid chromatography tandem mass spectrometry (LC-MS/MS) analysis demonstrated the presence of NEO, STX, dcSTX and decarbamoylneosaxitoxin (dcNEO) in all strains. This is the first report of a PST producer in any water body in Turkey and first observation of dcNEO in an A. gracile culture. Copyright © 2018 Elsevier Ltd. All rights reserved.
A Comparison of Children's Physical Activity Levels in Physical Education, Recess, and Exergaming.

PubMed

Gao, Zan; Chen, Senlin; Stodden, David F

2015-03-01

To compare young children's different intensity physical activity (PA) levels in physical education, recess and exergaming programs. Participants were 140 first and second grade children (73 girls; Meanage= 7.88 years). Beyond the daily 20-minute recess, participants attended 75-minute weekly physical education classes and another 75-minute weekly exergaming classes. Children's PA levels were assessed by ActiGraph GTX3 accelerometers for 3 sessions in the 3 programs. The outcome variables were percentages of time spent in sedentary, light PA and moderate-to-vigorous PA (MVPA). There were significant main effects for program and grade, and an interaction effect for program by grade. Specifically, children's MVPA in exergaming and recess was higher than in physical education. The 2nd-grade children demonstrated lower sedentary behavior and MVPA than the first-grade children during recess; less light PA in both recess and exergaming than first-grade children; and less sedentary behavior but higher MVPA in exergaming than first-grade children. Young children generated higher PA levels in recess and exergaming as compared with physical education. Hence, other school-based PA programs may serve as essential components of a comprehensive school PA program. Implications are provided for educators and health professionals.
GPU accelerated Monte Carlo simulation of Brownian motors dynamics with CUDA

NASA Astrophysics Data System (ADS)

Spiechowicz, J.; Kostur, M.; Machura, L.

2015-06-01

This work presents an updated and extended guide on methods of a proper acceleration of the Monte Carlo integration of stochastic differential equations with the commonly available NVIDIA Graphics Processing Units using the CUDA programming environment. We outline the general aspects of the scientific computing on graphics cards and demonstrate them with two models of a well known phenomenon of the noise induced transport of Brownian motors in periodic structures. As a source of fluctuations in the considered systems we selected the three most commonly occurring noises: the Gaussian white noise, the white Poissonian noise and the dichotomous process also known as a random telegraph signal. The detailed discussion on various aspects of the applied numerical schemes is also presented. The measured speedup can be of the astonishing order of about 3000 when compared to a typical CPU. This number significantly expands the range of problems solvable by use of stochastic simulations, allowing even an interactive research in some cases.
Scaling deep learning on GPU and knights landing clusters

DOE Office of Scientific and Technical Information (OSTI.GOV)

You, Yang; Buluc, Aydin; Demmel, James

Training neural networks has become a big bottleneck. For example, training ImageNet dataset on one Nvidia K20 GPU needs 21 days. To speed up the training process, the current deep learning systems heavily rely on the hardware accelerators. However, these accelerators have limited on-chip memory compared with CPUs. We use both self-host Intel Knights Landing (KNL) clusters and multi-GPU clusters as our target platforms. From the algorithm aspect, we focus on Elastic Averaging SGD (EASGD) to design algorithms for HPC clusters. We redesign four efficient algorithms for HPC systems to improve EASGD's poor scaling on clusters. Async EASGD, Async MEASGD,more » and Hogwild EASGD are faster than existing counter-part methods (Async SGD, Async MSGD, and Hogwild SGD) in all comparisons. Sync EASGD achieves 5.3X speedup over original EASGD on the same platform. We achieve 91.5% weak scaling efficiency on 4253 KNL cores, which is higher than the state-of-the-art implementation.« less
Particle-in-cell simulations with charge-conserving current deposition on graphic processing units

NASA Astrophysics Data System (ADS)

Ren, Chuang; Kong, Xianglong; Huang, Michael; Decyk, Viktor; Mori, Warren

2011-10-01

Recently using CUDA, we have developed an electromagnetic Particle-in-Cell (PIC) code with charge-conserving current deposition for Nvidia graphic processing units (GPU's) (Kong et al., Journal of Computational Physics 230, 1676 (2011). On a Tesla M2050 (Fermi) card, the GPU PIC code can achieve a one-particle-step process time of 1.2 - 3.2 ns in 2D and 2.3 - 7.2 ns in 3D, depending on plasma temperatures. In this talk we will discuss novel algorithms for GPU-PIC including charge-conserving current deposition scheme with few branching and parallel particle sorting. These algorithms have made efficient use of the GPU shared memory. We will also discuss how to replace the computation kernels of existing parallel CPU codes while keeping their parallel structures. This work was supported by U.S. Department of Energy under Grant Nos. DE-FG02-06ER54879 and DE-FC02-04ER54789 and by NSF under Grant Nos. PHY-0903797 and CCF-0747324.
GPU-based streaming architectures for fast cone-beam CT image reconstruction and demons deformable registration.

PubMed

Sharp, G C; Kandasamy, N; Singh, H; Folkert, M

2007-10-07

This paper shows how to significantly accelerate cone-beam CT reconstruction and 3D deformable image registration using the stream-processing model. We describe data-parallel designs for the Feldkamp, Davis and Kress (FDK) reconstruction algorithm, and the demons deformable registration algorithm, suitable for use on a commodity graphics processing unit. The streaming versions of these algorithms are implemented using the Brook programming environment and executed on an NVidia 8800 GPU. Performance results using CT data of a preserved swine lung indicate that the GPU-based implementations of the FDK and demons algorithms achieve a substantial speedup--up to 80 times for FDK and 70 times for demons when compared to an optimized reference implementation on a 2.8 GHz Intel processor. In addition, the accuracy of the GPU-based implementations was found to be excellent. Compared with CPU-based implementations, the RMS differences were less than 0.1 Hounsfield unit for reconstruction and less than 0.1 mm for deformable registration.
Understanding Portability of a High-Level Programming Model on Contemporary Heterogeneous Architectures

DOE PAGES

Sabne, Amit J.; Sakdhnagool, Putt; Lee, Seyong; ...

2015-07-13

Accelerator-based heterogeneous computing is gaining momentum in the high-performance computing arena. However, the increased complexity of heterogeneous architectures demands more generic, high-level programming models. OpenACC is one such attempt to tackle this problem. Although the abstraction provided by OpenACC offers productivity, it raises questions concerning both functional and performance portability. In this article, the authors propose HeteroIR, a high-level, architecture-independent intermediate representation, to map high-level programming models, such as OpenACC, to heterogeneous architectures. They present a compiler approach that translates OpenACC programs into HeteroIR and accelerator kernels to obtain OpenACC functional portability. They then evaluate the performance portability obtained bymore » OpenACC with their approach on 12 OpenACC programs on Nvidia CUDA, AMD GCN, and Intel Xeon Phi architectures. They study the effects of various compiler optimizations and OpenACC program settings on these architectures to provide insights into the achieved performance portability.« less

A portable platform for accelerated PIC codes and its application to GPUs using OpenACC

NASA Astrophysics Data System (ADS)

Hariri, F.; Tran, T. M.; Jocksch, A.; Lanti, E.; Progsch, J.; Messmer, P.; Brunner, S.; Gheller, C.; Villard, L.

2016-10-01

We present a portable platform, called PIC_ENGINE, for accelerating Particle-In-Cell (PIC) codes on heterogeneous many-core architectures such as Graphic Processing Units (GPUs). The aim of this development is efficient simulations on future exascale systems by allowing different parallelization strategies depending on the application problem and the specific architecture. To this end, this platform contains the basic steps of the PIC algorithm and has been designed as a test bed for different algorithmic options and data structures. Among the architectures that this engine can explore, particular attention is given here to systems equipped with GPUs. The study demonstrates that our portable PIC implementation based on the OpenACC programming model can achieve performance closely matching theoretical predictions. Using the Cray XC30 system, Piz Daint, at the Swiss National Supercomputing Centre (CSCS), we show that PIC_ENGINE running on an NVIDIA Kepler K20X GPU can outperform the one on an Intel Sandy bridge 8-core CPU by a factor of 3.4.
Accelerating a three-dimensional eco-hydrological cellular automaton on GPGPU with OpenCL

NASA Astrophysics Data System (ADS)

Senatore, Alfonso; D'Ambrosio, Donato; De Rango, Alessio; Rongo, Rocco; Spataro, William; Straface, Salvatore; Mendicino, Giuseppe

2016-10-01

This work presents an effective implementation of a numerical model for complete eco-hydrological Cellular Automata modeling on Graphical Processing Units (GPU) with OpenCL (Open Computing Language) for heterogeneous computation (i.e., on CPUs and/or GPUs). Different types of parallel implementations were carried out (e.g., use of fast local memory, loop unrolling, etc), showing increasing performance improvements in terms of speedup, adopting also some original optimizations strategies. Moreover, numerical analysis of results (i.e., comparison of CPU and GPU outcomes in terms of rounding errors) have proven to be satisfactory. Experiments were carried out on a workstation with two CPUs (Intel Xeon E5440 at 2.83GHz), one GPU AMD R9 280X and one GPU nVIDIA Tesla K20c. Results have been extremely positive, but further testing should be performed to assess the functionality of the adopted strategies on other complete models and their ability to fruitfully exploit parallel systems resources.
Prism-based single-camera system for stereo display

NASA Astrophysics Data System (ADS)

Zhao, Yue; Cui, Xiaoyu; Wang, Zhiguo; Chen, Hongsheng; Fan, Heyu; Wu, Teresa

2016-06-01

This paper combines the prism and single camera and puts forward a method of stereo imaging with low cost. First of all, according to the principle of geometrical optics, we can deduce the relationship between the prism single-camera system and dual-camera system, and according to the principle of binocular vision we can deduce the relationship between binoculars and dual camera. Thus we can establish the relationship between the prism single-camera system and binoculars and get the positional relation of prism, camera, and object with the best effect of stereo display. Finally, using the active shutter stereo glasses of NVIDIA Company, we can realize the three-dimensional (3-D) display of the object. The experimental results show that the proposed approach can make use of the prism single-camera system to simulate the various observation manners of eyes. The stereo imaging system, which is designed by the method proposed by this paper, can restore the 3-D shape of the object being photographed factually.
Phases and Dynamics of Self-Assembled DNA Programmed Nanocubes

NASA Astrophysics Data System (ADS)

Knorowski, Christopher; Travesset, Alex

2013-03-01

Systems of Nanoparticles grafted with complementary DNA strands have been shown to self-assemble into an array of superlattices. In this talk, we extend our previous model, which successfully predicted equilibrium phases and dynamics of assembly for spherical Nanoparticles without fitting parameters, to the case of nanocubes. We show that the phase diagram consists of bcc and sc lattices, depending on DNA length. The bcc lattices are either rotator and orientational glass or cubatic. For temperatures above the DNA melting temperature, the system is equivalent to f-star polymer systems, and consist of bcc, also with rotator, orientational glass or cubatic orientational order as well as sc. We also provide a characterization of the dynamics, including the role of topological defects in crystal nucleation and growth. This work is funded by DOE through the Ames Lab under Contract DE-AC02-07CH11358. Most simulations are performed on the Exalted GPU cluster, which is funded by a grant from Iowa State University and Nvidia Corp.
High-Performance 3D Compressive Sensing MRI Reconstruction Using Many-Core Architectures.

PubMed

Kim, Daehyun; Trzasko, Joshua; Smelyanskiy, Mikhail; Haider, Clifton; Dubey, Pradeep; Manduca, Armando

2011-01-01

Compressive sensing (CS) describes how sparse signals can be accurately reconstructed from many fewer samples than required by the Nyquist criterion. Since MRI scan duration is proportional to the number of acquired samples, CS has been gaining significant attention in MRI. However, the computationally intensive nature of CS reconstructions has precluded their use in routine clinical practice. In this work, we investigate how different throughput-oriented architectures can benefit one CS algorithm and what levels of acceleration are feasible on different modern platforms. We demonstrate that a CUDA-based code running on an NVIDIA Tesla C2050 GPU can reconstruct a 256 × 160 × 80 volume from an 8-channel acquisition in 19 seconds, which is in itself a significant improvement over the state of the art. We then show that Intel's Knights Ferry can perform the same 3D MRI reconstruction in only 12 seconds, bringing CS methods even closer to clinical viability.
Clinical implementation of a GPU-based simplified Monte Carlo method for a treatment planning system of proton beam therapy.

PubMed

Kohno, R; Hotta, K; Nishioka, S; Matsubara, K; Tansho, R; Suzuki, T

2011-11-21

We implemented the simplified Monte Carlo (SMC) method on graphics processing unit (GPU) architecture under the computer-unified device architecture platform developed by NVIDIA. The GPU-based SMC was clinically applied for four patients with head and neck, lung, or prostate cancer. The results were compared to those obtained by a traditional CPU-based SMC with respect to the computation time and discrepancy. In the CPU- and GPU-based SMC calculations, the estimated mean statistical errors of the calculated doses in the planning target volume region were within 0.5% rms. The dose distributions calculated by the GPU- and CPU-based SMCs were similar, within statistical errors. The GPU-based SMC showed 12.30-16.00 times faster performance than the CPU-based SMC. The computation time per beam arrangement using the GPU-based SMC for the clinical cases ranged 9-67 s. The results demonstrate the successful application of the GPU-based SMC to a clinical proton treatment planning.
Accelerating three-dimensional FDTD calculations on GPU clusters for electromagnetic field simulation.

PubMed

Nagaoka, Tomoaki; Watanabe, Soichi

2012-01-01

Electromagnetic simulation with anatomically realistic computational human model using the finite-difference time domain (FDTD) method has recently been performed in a number of fields in biomedical engineering. To improve the method's calculation speed and realize large-scale computing with the computational human model, we adapt three-dimensional FDTD code to a multi-GPU cluster environment with Compute Unified Device Architecture and Message Passing Interface. Our multi-GPU cluster system consists of three nodes. The seven GPU boards (NVIDIA Tesla C2070) are mounted on each node. We examined the performance of the FDTD calculation on multi-GPU cluster environment. We confirmed that the FDTD calculation on the multi-GPU clusters is faster than that on a multi-GPU (a single workstation), and we also found that the GPU cluster system calculate faster than a vector supercomputer. In addition, our GPU cluster system allowed us to perform the large-scale FDTD calculation because were able to use GPU memory of over 100 GB.
Accelerating image reconstruction in dual-head PET system by GPU and symmetry properties.

PubMed

Chou, Cheng-Ying; Dong, Yun; Hung, Yukai; Kao, Yu-Jiun; Wang, Weichung; Kao, Chien-Min; Chen, Chin-Tu

2012-01-01

Positron emission tomography (PET) is an important imaging modality in both clinical usage and research studies. We have developed a compact high-sensitivity PET system that consisted of two large-area panel PET detector heads, which produce more than 224 million lines of response and thus request dramatic computational demands. In this work, we employed a state-of-the-art graphics processing unit (GPU), NVIDIA Tesla C2070, to yield an efficient reconstruction process. Our approaches ingeniously integrate the distinguished features of the symmetry properties of the imaging system and GPU architectures, including block/warp/thread assignments and effective memory usage, to accelerate the computations for ordered subset expectation maximization (OSEM) image reconstruction. The OSEM reconstruction algorithms were implemented employing both CPU-based and GPU-based codes, and their computational performance was quantitatively analyzed and compared. The results showed that the GPU-accelerated scheme can drastically reduce the reconstruction time and thus can largely expand the applicability of the dual-head PET system.
MGUPGMA: A Fast UPGMA Algorithm With Multiple Graphics Processing Units Using NCCL

PubMed Central

Hua, Guan-Jie; Hung, Che-Lun; Lin, Chun-Yuan; Wu, Fu-Che; Chan, Yu-Wei; Tang, Chuan Yi

2017-01-01

A phylogenetic tree is a visual diagram of the relationship between a set of biological species. The scientists usually use it to analyze many characteristics of the species. The distance-matrix methods, such as Unweighted Pair Group Method with Arithmetic Mean and Neighbor Joining, construct a phylogenetic tree by calculating pairwise genetic distances between taxa. These methods have the computational performance issue. Although several new methods with high-performance hardware and frameworks have been proposed, the issue still exists. In this work, a novel parallel Unweighted Pair Group Method with Arithmetic Mean approach on multiple Graphics Processing Units is proposed to construct a phylogenetic tree from extremely large set of sequences. The experimental results present that the proposed approach on a DGX-1 server with 8 NVIDIA P100 graphic cards achieves approximately 3-fold to 7-fold speedup over the implementation of Unweighted Pair Group Method with Arithmetic Mean on a modern CPU and a single GPU, respectively. PMID:29051701
MGUPGMA: A Fast UPGMA Algorithm With Multiple Graphics Processing Units Using NCCL.

PubMed

Hua, Guan-Jie; Hung, Che-Lun; Lin, Chun-Yuan; Wu, Fu-Che; Chan, Yu-Wei; Tang, Chuan Yi

2017-01-01

A phylogenetic tree is a visual diagram of the relationship between a set of biological species. The scientists usually use it to analyze many characteristics of the species. The distance-matrix methods, such as Unweighted Pair Group Method with Arithmetic Mean and Neighbor Joining, construct a phylogenetic tree by calculating pairwise genetic distances between taxa. These methods have the computational performance issue. Although several new methods with high-performance hardware and frameworks have been proposed, the issue still exists. In this work, a novel parallel Unweighted Pair Group Method with Arithmetic Mean approach on multiple Graphics Processing Units is proposed to construct a phylogenetic tree from extremely large set of sequences. The experimental results present that the proposed approach on a DGX-1 server with 8 NVIDIA P100 graphic cards achieves approximately 3-fold to 7-fold speedup over the implementation of Unweighted Pair Group Method with Arithmetic Mean on a modern CPU and a single GPU, respectively.
MAGI: a Node.js web service for fast microRNA-Seq analysis in a GPU infrastructure.

PubMed

Kim, Jihoon; Levy, Eric; Ferbrache, Alex; Stepanowsky, Petra; Farcas, Claudiu; Wang, Shuang; Brunner, Stefan; Bath, Tyler; Wu, Yuan; Ohno-Machado, Lucila

2014-10-01

MAGI is a web service for fast MicroRNA-Seq data analysis in a graphics processing unit (GPU) infrastructure. Using just a browser, users have access to results as web reports in just a few hours->600% end-to-end performance improvement over state of the art. MAGI's salient features are (i) transfer of large input files in native FASTA with Qualities (FASTQ) format through drag-and-drop operations, (ii) rapid prediction of microRNA target genes leveraging parallel computing with GPU devices, (iii) all-in-one analytics with novel feature extraction, statistical test for differential expression and diagnostic plot generation for quality control and (iv) interactive visualization and exploration of results in web reports that are readily available for publication. MAGI relies on the Node.js JavaScript framework, along with NVIDIA CUDA C, PHP: Hypertext Preprocessor (PHP), Perl and R. It is freely available at http://magi.ucsd.edu. © The Author 2014. Published by Oxford University Press.
GPU-accelerated phase-field simulation of dendritic solidification in a binary alloy

NASA Astrophysics Data System (ADS)

Yamanaka, Akinori; Aoki, Takayuki; Ogawa, Satoi; Takaki, Tomohiro

2011-03-01

The phase-field simulation for dendritic solidification of a binary alloy has been accelerated by using a graphic processing unit (GPU). To perform the phase-field simulation of the alloy solidification on GPU, a program code was developed with computer unified device architecture (CUDA). In this paper, the implementation technique of the phase-field model on GPU is presented. Also, we evaluated the acceleration performance of the three-dimensional solidification simulation by using a single NVIDIA TESLA C1060 GPU and the developed program code. The results showed that the GPU calculation for 5763 computational grids achieved the performance of 170 GFLOPS by utilizing the shared memory as a software-managed cache. Furthermore, it can be demonstrated that the computation with the GPU is 100 times faster than that with a single CPU core. From the obtained results, we confirmed the feasibility of realizing a real-time full three-dimensional phase-field simulation of microstructure evolution on a personal desktop computer.
Multigroup Monte Carlo on GPUs: Comparison of history- and event-based algorithms

DOE PAGES

Hamilton, Steven P.; Slattery, Stuart R.; Evans, Thomas M.

2017-12-22

This article presents an investigation of the performance of different multigroup Monte Carlo transport algorithms on GPUs with a discussion of both history-based and event-based approaches. Several algorithmic improvements are introduced for both approaches. By modifying the history-based algorithm that is traditionally favored in CPU-based MC codes to occasionally filter out dead particles to reduce thread divergence, performance exceeds that of either the pure history-based or event-based approaches. The impacts of several algorithmic choices are discussed, including performance studies on Kepler and Pascal generation NVIDIA GPUs for fixed source and eigenvalue calculations. Single-device performance equivalent to 20–40 CPU cores onmore » the K40 GPU and 60–80 CPU cores on the P100 GPU is achieved. Last, in addition, nearly perfect multi-device parallel weak scaling is demonstrated on more than 16,000 nodes of the Titan supercomputer.« less
High-Performance 3D Compressive Sensing MRI Reconstruction Using Many-Core Architectures

PubMed Central

Kim, Daehyun; Trzasko, Joshua; Smelyanskiy, Mikhail; Haider, Clifton; Dubey, Pradeep; Manduca, Armando

2011-01-01

Compressive sensing (CS) describes how sparse signals can be accurately reconstructed from many fewer samples than required by the Nyquist criterion. Since MRI scan duration is proportional to the number of acquired samples, CS has been gaining significant attention in MRI. However, the computationally intensive nature of CS reconstructions has precluded their use in routine clinical practice. In this work, we investigate how different throughput-oriented architectures can benefit one CS algorithm and what levels of acceleration are feasible on different modern platforms. We demonstrate that a CUDA-based code running on an NVIDIA Tesla C2050 GPU can reconstruct a 256 × 160 × 80 volume from an 8-channel acquisition in 19 seconds, which is in itself a significant improvement over the state of the art. We then show that Intel's Knights Ferry can perform the same 3D MRI reconstruction in only 12 seconds, bringing CS methods even closer to clinical viability. PMID:21922017
Accelerating separable footprint (SF) forward and back projection on GPU

NASA Astrophysics Data System (ADS)

Xie, Xiaobin; McGaffin, Madison G.; Long, Yong; Fessler, Jeffrey A.; Wen, Minhua; Lin, James

2017-03-01

Statistical image reconstruction (SIR) methods for X-ray CT can improve image quality and reduce radiation dosages over conventional reconstruction methods, such as filtered back projection (FBP). However, SIR methods require much longer computation time. The separable footprint (SF) forward and back projection technique simplifies the calculation of intersecting volumes of image voxels and finite-size beams in a way that is both accurate and efficient for parallel implementation. We propose a new method to accelerate the SF forward and back projection on GPU with NVIDIA's CUDA environment. For the forward projection, we parallelize over all detector cells. For the back projection, we parallelize over all 3D image voxels. The simulation results show that the proposed method is faster than the acceleration method of the SF projectors proposed by Wu and Fessler.13 We further accelerate the proposed method using multiple GPUs. The results show that the computation time is reduced approximately proportional to the number of GPUs.
Algorithms for GPU-based molecular dynamics simulations of complex fluids: Applications to water, mixtures, and liquid crystals.

PubMed

Kazachenko, Sergey; Giovinazzo, Mark; Hall, Kyle Wm; Cann, Natalie M

2015-09-15

A custom code for molecular dynamics simulations has been designed to run on CUDA-enabled NVIDIA graphics processing units (GPUs). The double-precision code simulates multicomponent fluids, with intramolecular and intermolecular forces, coarse-grained and atomistic models, holonomic constraints, Nosé-Hoover thermostats, and the generation of distribution functions. Algorithms to compute Lennard-Jones and Gay-Berne interactions, and the electrostatic force using Ewald summations, are discussed. A neighbor list is introduced to improve scaling with respect to system size. Three test systems are examined: SPC/E water; an n-hexane/2-propanol mixture; and a liquid crystal mesogen, 2-(4-butyloxyphenyl)-5-octyloxypyrimidine. Code performance is analyzed for each system. With one GPU, a 33-119 fold increase in performance is achieved compared with the serial code while the use of two GPUs leads to a 69-287 fold improvement and three GPUs yield a 101-377 fold speedup. © 2015 Wiley Periodicals, Inc.
High-channel-count, high-density microelectrode array for closed-loop investigation of neuronal networks.

PubMed

Tsai, David; John, Esha; Chari, Tarun; Yuste, Rafael; Shepard, Kenneth

2015-01-01

We present a system for large-scale electrophysiological recording and stimulation of neural tissue with a planar topology. The recording system has 65,536 electrodes arranged in a 256 × 256 grid, with 25.5 μm pitch, and covering an area approximately 42.6 mm(2). The recording chain has 8.66 μV rms input-referred noise over a 100 ~ 10k Hz bandwidth while providing up to 66 dB of voltage gain. When recording from all electrodes in the array, it is capable of 10-kHz sampling per electrode. All electrodes can also perform patterned electrical microstimulation. The system produces ~ 1 GB/s of data when recording from the full array. To handle, store, and perform nearly real-time analyses of this large data stream, we developed a framework based around Xilinx FPGAs, Intel x86 CPUs and the NVIDIA Streaming Multiprocessors to interface with the electrode array.
Real-time depth camera tracking with geometrically stable weight algorithm

NASA Astrophysics Data System (ADS)

Fu, Xingyin; Zhu, Feng; Qi, Feng; Wang, Mingming

2017-03-01

We present an approach for real-time camera tracking with depth stream. Existing methods are prone to drift in sceneries without sufficient geometric information. First, we propose a new weight method for an iterative closest point algorithm commonly used in real-time dense mapping and tracking systems. By detecting uncertainty in pose and increasing weight of points that constrain unstable transformations, our system achieves accurate and robust trajectory estimation results. Our pipeline can be fully parallelized with GPU and incorporated into the current real-time depth camera tracking system seamlessly. Second, we compare the state-of-the-art weight algorithms and propose a weight degradation algorithm according to the measurement characteristics of a consumer depth camera. Third, we use Nvidia Kepler Shuffle instructions during warp and block reduction to improve the efficiency of our system. Results on the public TUM RGB-D database benchmark demonstrate that our camera tracking system achieves state-of-the-art results both in accuracy and efficiency.
GNuMe: A Galerkin-based Numerical Modeliing Environment for modeling geophysical fluid dynamics applications ranging from the Atmosphere to the Ocean

NASA Astrophysics Data System (ADS)

Giraldo, Francis; Abdi, Daniel; Kopera, Michal

2017-04-01

We have built a Galerkin-based Numerical Modeling Environment (GNuMe) for non hydrostatic atmospheric and ocean processes. GNuMe uses continuous Galerkin and Discontinuous Galerkin (CG/DG) discetizations as well as non-conforming adaptive mesh refinement (AMR), along with advanced time-integration methods that exploits both CG/DG and AMR capabilities. GNuMe currently solves the compressible and incompressible Navier-Stokes equations, the shallow water equations (with wetting and drying), and work is underway for inclusion of other types of equations. Moreover, GNuMe can run in both 2D and 3D modes on any type of accelerator hardware such as Nvidia GPUs and Intel KNL, and on standard X86 cores. In this talk, we shall present representative solutions obtained with GNuMe and will discuss where we think such a modeling framework could fit within standard Earth Systems Models. For further information on GNuMe please visit: http://frankgiraldo.wixsite.com/mysite/gnume.
A versatile model for soft patchy particles with various patch arrangements.

PubMed

Li, Zhan-Wei; Zhu, You-Liang; Lu, Zhong-Yuan; Sun, Zhao-Yan

2016-01-21

We propose a simple and general mesoscale soft patchy particle model, which can felicitously describe the deformable and surface-anisotropic characteristics of soft patchy particles. This model can be used in dynamics simulations to investigate the aggregation behavior and mechanism of various types of soft patchy particles with tunable number, size, direction, and geometrical arrangement of the patches. To improve the computational efficiency of this mesoscale model in dynamics simulations, we give the simulation algorithm that fits the compute unified device architecture (CUDA) framework of NVIDIA graphics processing units (GPUs). The validation of the model and the performance of the simulations using GPUs are demonstrated by simulating several benchmark systems of soft patchy particles with 1 to 4 patches in a regular geometrical arrangement. Because of its simplicity and computational efficiency, the soft patchy particle model will provide a powerful tool to investigate the aggregation behavior of soft patchy particles, such as patchy micelles, patchy microgels, and patchy dendrimers, over larger spatial and temporal scales.

Efficient Acceleration of the Pair-HMMs Forward Algorithm for GATK HaplotypeCaller on Graphics Processing Units.

PubMed

Ren, Shanshan; Bertels, Koen; Al-Ars, Zaid

2018-01-01

GATK HaplotypeCaller (HC) is a popular variant caller, which is widely used to identify variants in complex genomes. However, due to its high variants detection accuracy, it suffers from long execution time. In GATK HC, the pair-HMMs forward algorithm accounts for a large percentage of the total execution time. This article proposes to accelerate the pair-HMMs forward algorithm on graphics processing units (GPUs) to improve the performance of GATK HC. This article presents several GPU-based implementations of the pair-HMMs forward algorithm. It also analyzes the performance bottlenecks of the implementations on an NVIDIA Tesla K40 card with various data sets. Based on these results and the characteristics of GATK HC, we are able to identify the GPU-based implementations with the highest performance for the various analyzed data sets. Experimental results show that the GPU-based implementations of the pair-HMMs forward algorithm achieve a speedup of up to 5.47× over existing GPU-based implementations.
Design and implementation of a hybrid MPI-CUDA model for the Smith-Waterman algorithm.

PubMed

Khaled, Heba; Faheem, Hossam El Deen Mostafa; El Gohary, Rania

2015-01-01

This paper provides a novel hybrid model for solving the multiple pair-wise sequence alignment problem combining message passing interface and CUDA, the parallel computing platform and programming model invented by NVIDIA. The proposed model targets homogeneous cluster nodes equipped with similar Graphical Processing Unit (GPU) cards. The model consists of the Master Node Dispatcher (MND) and the Worker GPU Nodes (WGN). The MND distributes the workload among the cluster working nodes and then aggregates the results. The WGN performs the multiple pair-wise sequence alignments using the Smith-Waterman algorithm. We also propose a modified implementation to the Smith-Waterman algorithm based on computing the alignment matrices row-wise. The experimental results demonstrate a considerable reduction in the running time by increasing the number of the working GPU nodes. The proposed model achieved a performance of about 12 Giga cell updates per second when we tested against the SWISS-PROT protein knowledge base running on four nodes.
Implementation of metal-friendly EAM/FS-type semi-empirical potentials in HOOMD-blue: A GPU-accelerated molecular dynamics software

NASA Astrophysics Data System (ADS)

Yang, Lin; Zhang, Feng; Wang, Cai-Zhuang; Ho, Kai-Ming; Travesset, Alex

2018-04-01

We present an implementation of EAM and FS interatomic potentials, which are widely used in simulating metallic systems, in HOOMD-blue, a software designed to perform classical molecular dynamics simulations using GPU accelerations. We first discuss the details of our implementation and then report extensive benchmark tests. We demonstrate that single-precision floating point operations efficiently implemented on GPUs can produce sufficient accuracy when compared against double-precision codes, as demonstrated in test simulations of calculations of the glass-transition temperature of Cu64.5Zr35.5, and pair correlation function g (r) of liquid Ni3Al. Our code scales well with the size of the simulating system on NVIDIA Tesla M40 and P100 GPUs. Compared with another popular software LAMMPS running on 32 cores of AMD Opteron 6220 processors, the GPU/CPU performance ratio can reach as high as 4.6. The source code can be accessed through the HOOMD-blue web page for free by any interested user.
GPU-Powered Coherent Beamforming

NASA Astrophysics Data System (ADS)

Magro, A.; Adami, K. Zarb; Hickish, J.

2015-03-01

Graphics processing units (GPU)-based beamforming is a relatively unexplored area in radio astronomy, possibly due to the assumption that any such system will be severely limited by the PCIe bandwidth required to transfer data to the GPU. We have developed a CUDA-based GPU implementation of a coherent beamformer, specifically designed and optimized for deployment at the BEST-2 array which can generate an arbitrary number of synthesized beams for a wide range of parameters. It achieves ˜1.3 TFLOPs on an NVIDIA Tesla K20, approximately 10x faster than an optimized, multithreaded CPU implementation. This kernel has been integrated into two real-time, GPU-based time-domain software pipelines deployed at the BEST-2 array in Medicina: a standalone beamforming pipeline and a transient detection pipeline. We present performance benchmarks for the beamforming kernel as well as the transient detection pipeline with beamforming capabilities as well as results of test observation.
Multigroup Monte Carlo on GPUs: Comparison of history- and event-based algorithms

DOE Office of Scientific and Technical Information (OSTI.GOV)

Hamilton, Steven P.; Slattery, Stuart R.; Evans, Thomas M.

This article presents an investigation of the performance of different multigroup Monte Carlo transport algorithms on GPUs with a discussion of both history-based and event-based approaches. Several algorithmic improvements are introduced for both approaches. By modifying the history-based algorithm that is traditionally favored in CPU-based MC codes to occasionally filter out dead particles to reduce thread divergence, performance exceeds that of either the pure history-based or event-based approaches. The impacts of several algorithmic choices are discussed, including performance studies on Kepler and Pascal generation NVIDIA GPUs for fixed source and eigenvalue calculations. Single-device performance equivalent to 20–40 CPU cores onmore » the K40 GPU and 60–80 CPU cores on the P100 GPU is achieved. Last, in addition, nearly perfect multi-device parallel weak scaling is demonstrated on more than 16,000 nodes of the Titan supercomputer.« less
A GPU OpenCL based cross-platform Monte Carlo dose calculation engine (goMC)

NASA Astrophysics Data System (ADS)

Tian, Zhen; Shi, Feng; Folkerts, Michael; Qin, Nan; Jiang, Steve B.; Jia, Xun

2015-09-01

Monte Carlo (MC) simulation has been recognized as the most accurate dose calculation method for radiotherapy. However, the extremely long computation time impedes its clinical application. Recently, a lot of effort has been made to realize fast MC dose calculation on graphic processing units (GPUs). However, most of the GPU-based MC dose engines have been developed under NVidia’s CUDA environment. This limits the code portability to other platforms, hindering the introduction of GPU-based MC simulations to clinical practice. The objective of this paper is to develop a GPU OpenCL based cross-platform MC dose engine named goMC with coupled photon-electron simulation for external photon and electron radiotherapy in the MeV energy range. Compared to our previously developed GPU-based MC code named gDPM (Jia et al 2012 Phys. Med. Biol. 57 7783-97), goMC has two major differences. First, it was developed under the OpenCL environment for high code portability and hence could be run not only on different GPU cards but also on CPU platforms. Second, we adopted the electron transport model used in EGSnrc MC package and PENELOPE’s random hinge method in our new dose engine, instead of the dose planning method employed in gDPM. Dose distributions were calculated for a 15 MeV electron beam and a 6 MV photon beam in a homogenous water phantom, a water-bone-lung-water slab phantom and a half-slab phantom. Satisfactory agreement between the two MC dose engines goMC and gDPM was observed in all cases. The average dose differences in the regions that received a dose higher than 10% of the maximum dose were 0.48-0.53% for the electron beam cases and 0.15-0.17% for the photon beam cases. In terms of efficiency, goMC was ~4-16% slower than gDPM when running on the same NVidia TITAN card for all the cases we tested, due to both the different electron transport models and the different development environments. The code portability of our new dose engine goMC was validated by
Accelerating numerical solution of stochastic differential equations with CUDA

NASA Astrophysics Data System (ADS)

Januszewski, M.; Kostur, M.

2010-01-01

Numerical integration of stochastic differential equations is commonly used in many branches of science. In this paper we present how to accelerate this kind of numerical calculations with popular NVIDIA Graphics Processing Units using the CUDA programming environment. We address general aspects of numerical programming on stream processors and illustrate them by two examples: the noisy phase dynamics in a Josephson junction and the noisy Kuramoto model. In presented cases the measured speedup can be as high as 675× compared to a typical CPU, which corresponds to several billion integration steps per second. This means that calculations which took weeks can now be completed in less than one hour. This brings stochastic simulation to a completely new level, opening for research a whole new range of problems which can now be solved interactively. Program summaryProgram title: SDE Catalogue identifier: AEFG_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEFG_v1_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: Gnu GPL v3 No. of lines in distributed program, including test data, etc.: 978 No. of bytes in distributed program, including test data, etc.: 5905 Distribution format: tar.gz Programming language: CUDA C Computer: any system with a CUDA-compatible GPU Operating system: Linux RAM: 64 MB of GPU memory Classification: 4.3 External routines: The program requires the NVIDIA CUDA Toolkit Version 2.0 or newer and the GNU Scientific Library v1.0 or newer. Optionally gnuplot is recommended for quick visualization of the results. Nature of problem: Direct numerical integration of stochastic differential equations is a computationally intensive problem, due to the necessity of calculating multiple independent realizations of the system. We exploit the inherent parallelism of this problem and perform the calculations on GPUs using the CUDA programming environment. The GPU's ability to execute
GAMUT: GPU accelerated microRNA analysis to uncover target genes through CUDA-miRanda

PubMed Central

2014-01-01

Background Non-coding sequences such as microRNAs have important roles in disease processes. Computational microRNA target identification (CMTI) is becoming increasingly important since traditional experimental methods for target identification pose many difficulties. These methods are time-consuming, costly, and often need guidance from computational methods to narrow down candidate genes anyway. However, most CMTI methods are computationally demanding, since they need to handle not only several million query microRNA and reference RNA pairs, but also several million nucleotide comparisons within each given pair. Thus, the need to perform microRNA identification at such large scale has increased the demand for parallel computing. Methods Although most CMTI programs (e.g., the miRanda algorithm) are based on a modified Smith-Waterman (SW) algorithm, the existing parallel SW implementations (e.g., CUDASW++ 2.0/3.0, SWIPE) are unable to meet this demand in CMTI tasks. We present CUDA-miRanda, a fast microRNA target identification algorithm that takes advantage of massively parallel computing on Graphics Processing Units (GPU) using NVIDIA's Compute Unified Device Architecture (CUDA). CUDA-miRanda specifically focuses on the local alignment of short (i.e., ≤ 32 nucleotides) sequences against longer reference sequences (e.g., 20K nucleotides). Moreover, the proposed algorithm is able to report multiple alignments (up to 191 top scores) and the corresponding traceback sequences for any given (query sequence, reference sequence) pair. Results Speeds over 5.36 Giga Cell Updates Per Second (GCUPs) are achieved on a server with 4 NVIDIA Tesla M2090 GPUs. Compared to the original miRanda algorithm, which is evaluated on an Intel Xeon E5620@2.4 GHz CPU, the experimental results show up to 166 times performance gains in terms of execution time. In addition, we have verified that the exact same targets were predicted in both CUDA-miRanda and the original mi
4D megahertz optical coherence tomography (OCT): imaging and live display beyond 1 gigavoxel/sec (Conference Presentation)

NASA Astrophysics Data System (ADS)

Huber, Robert A.; Draxinger, Wolfgang; Wieser, Wolfgang; Kolb, Jan Philip; Pfeiffer, Tom; Karpf, Sebastian N.; Eibl, Matthias; Klein, Thomas

2016-03-01

Over the last 20 years, optical coherence tomography (OCT) has become a valuable diagnostic tool in ophthalmology with several 10,000 devices sold today. Other applications, like intravascular OCT in cardiology and gastro-intestinal imaging will follow. OCT provides 3-dimensional image data with microscopic resolution of biological tissue in vivo. In most applications, off-line processing of the acquired OCT-data is sufficient. However, for OCT applications like OCT aided surgical microscopes, for functional OCT imaging of tissue after a stimulus, or for interactive endoscopy an OCT engine capable of acquiring, processing and displaying large and high quality 3D OCT data sets at video rate is highly desired. We developed such a prototype OCT engine and demonstrate live OCT with 25 volumes per second at a size of 320x320x320 pixels. The computer processing load of more than 1.5 TFLOPS was handled by a GTX 690 graphics processing unit with more than 3000 stream processors operating in parallel. In the talk, we will describe the optics and electronics hardware as well as the software of the system in detail and analyze current limitations. The talk also focuses on new OCT applications, where such a system improves diagnosis and monitoring of medical procedures. The additional acquisition of hyperspectral stimulated Raman signals with the system will be discussed.
Development and Validation of a Liquid Chromatography-Tandem Mass Spectrometry Method Coupled with Dispersive Solid-Phase Extraction for Simultaneous Quantification of Eight Paralytic Shellfish Poisoning Toxins in Shellfish

PubMed Central

Yang, Xianli; Zhou, Lei; Tan, Yanglan; Shi, Xizhi; Zhao, Zhiyong; Nie, Dongxia; Zhou, Changyan; Liu, Hong

2017-01-01

In this study, a high-performance liquid chromatography-tandem mass spectrometry (HPLC-MS/MS) method was developed for simultaneous determination of eight paralytic shellfish poisoning (PSP) toxins, including saxitoxin (STX), neosaxitoxin (NEO), gonyautoxins (GTX1–4) and the N-sulfo carbamoyl toxins C1 and C2, in sea shellfish. The samples were extracted by acetonitrile/water (80:20, v/v) with 0.1% formic and purified by dispersive solid-phase extraction (dSPE) with C18 silica and acidic alumina. Qualitative and quantitative detection for the target toxins were conducted under the multiple reaction monitoring (MRM) mode by using the positive electrospray ionization (ESI) mode after chromatographic separation on a TSK-gel Amide-80 HILIC column with water and acetonitrile. Matrix-matched calibration was used to compensate for matrix effects. The established method was further validated by determining the linearity (R2 ≥ 0.9900), average recovery (81.52–116.50%), sensitivity (limits of detection (LODs): 0.33–5.52 μg·kg−1; limits of quantitation (LOQs): 1.32–11.29 μg·kg−1) and precision (relative standard deviation (RSD) ≤ 19.10%). The application of this proposed approach to thirty shellfish samples proved its desirable performance and sufficient capability for simultaneous determination of multiclass PSP toxins in sea foods. PMID:28661471
Paralytic Toxins Accumulation and Tissue Expression of α-Amylase and Lipase Genes in the Pacific Oyster Crassostrea gigas Fed with the Neurotoxic Dinoflagellate Alexandrium catenella

PubMed Central

Rolland, Jean-Luc; Pelletier, Kevin; Masseret, Estelle; Rieuvilleneuve, Fabien; Savar, Veronique; Santini, Adrien; Amzil, Zouher; Laabir, Mohamed

2012-01-01

The pacific oyster Crassostrea gigas was experimentally exposed to the neurotoxic Alexandrium catenella and a non-producer of PSTs, Alexandrium tamarense (control algae), at concentrations corresponding to those observed during the blooming period. At fixed time intervals, from 0 to 48 h, we determined the clearance rate, the total filtered cells, the composition of the fecal ribbons, the profile of the PSP toxins and the variation of the expression of two α-amylase and triacylglecerol lipase precursor (TLP) genes through semi-quantitative RT-PCR. The results showed a significant decrease of the clearance rate of C. gigas fed with both Alexandrium species. However, from 29 to 48 h, the clearance rate and cell filtration activity increased only in oysters fed with A. tamarense. The toxin concentrations in the digestive gland rose above the sanitary threshold in less than 48 h of exposure and GTX6, a compound absent in A. catenella cells, accumulated. The α-amylase B gene expression level increased significantly in the time interval from 6 to 48 h in the digestive gland of oysters fed with A. tamarense, whereas the TLP gene transcript was significantly up-regulated in the digestive gland of oysters fed with the neurotoxic A. catenella. All together, these results suggest that the digestion capacity could be affected by PSP toxins. PMID:23203275
The role of moderate-to-vigorous physical activity in mediating the relationship between central adiposity and immunometabolic profile in postmenopausal women.

PubMed

Diniz, Tiego A; Rossi, Fabricio E; Silveira, Loreana S; Neves, Lucas Melo; Fortaleza, Ana Claudia de Souza; Christofaro, Diego G D; Lira, Fabio S; Freitas-Junior, Ismael F

2017-01-01

To analyze the role of moderate-to-vigorous physical activity (MVPA) in mediating the relationship between central adiposity and immune and metabolic profile in postmenopausal women. Cross-sectional study comprising 49 postmenopausal women (aged 59.26 ± 8.32 years) without regular physical exercise practice. Body composition was measured by dual-energy X-ray absorptiometry. Fasting blood samples were collected for assessment of nonesterified fatty acids, tumor necrosis factor-α (TNF-α), interleukin-6 (IL-6), adiponectin, insulin and estimation of insulin resistance (HOMA-IR). Physical activity level was assessed with an accelerometer (Actigraph GTX3x) and reported as a percentage of time spent in sedentary behavior and MVPA. All analyses were performed using the software SPSS 17.0, with a significance level set at 5%. Sedentary women had a positive relationship between trunk fat and IL-6 (rho = 0.471; p = 0.020), and trunk fat and HOMA-IR (rho = 0.418; p = 0.042). Adiponectin and fat mass (%) were only positively correlated in physically active women (rho = 0.441; p = 0.027). Physically active women with normal trunk fat values presented a 14.7% lower chance of having increased HOMA-IR levels (β [95%CI] = 0.147 [0.027; 0.811]). The practice of sufficient levels of MVPA was a protective factor against immunometabolic disorders in postmenopausal women.
Finite Element Study on Continuous Rotating versus Reciprocating Nickel-Titanium Instruments.

PubMed

El-Anwar, Mohamed I; Yousief, Salah A; Kataia, Engy M; El-Wahab, Tarek M Abd

2016-01-01

In the present study, GTX and ProTaper as continuous rotating endodontic files were numerically compared with WaveOne reciprocating file using finite element analysis, aiming at having a low cost, accurate/trustworthy comparison as well as finding out the effect of instrument design and manufacturing material on its lifespan. Two 3D finite element models were especially prepared for this comparison. Commercial engineering CAD/CAM package was used to model full detailed flute geometries of the instruments. Multi-linear materials were defined in analysis by using real strain-stress data of NiTi and M-Wire. Non-linear static analysis was performed to simulate the instrument inside root canal at a 45° angle in the apical portion and subjected to 0.3 N.cm torsion. The three simulations in this study showed that M-Wire is slightly more resistant to failure than conventional NiTi. On the other hand, both materials are fairly similar in case of severe locking conditions. For the same instrument geometry, M-Wire instruments may have longer lifespan than the conventional NiTi ones. In case of severe locking conditions both materials will fail similarly. Larger cross sectional area (function of instrument taper) resisted better to failure than the smaller ones, while the cross sectional shape and its cutting angles could affect instrument cutting efficiency.
Hybrid parallel computing architecture for multiview phase shifting

NASA Astrophysics Data System (ADS)

Zhong, Kai; Li, Zhongwei; Zhou, Xiaohui; Shi, Yusheng; Wang, Congjun

2014-11-01

The multiview phase-shifting method shows its powerful capability in achieving high resolution three-dimensional (3-D) shape measurement. Unfortunately, this ability results in very high computation costs and 3-D computations have to be processed offline. To realize real-time 3-D shape measurement, a hybrid parallel computing architecture is proposed for multiview phase shifting. In this architecture, the central processing unit can co-operate with the graphic processing unit (GPU) to achieve hybrid parallel computing. The high computation cost procedures, including lens distortion rectification, phase computation, correspondence, and 3-D reconstruction, are implemented in GPU, and a three-layer kernel function model is designed to simultaneously realize coarse-grained and fine-grained paralleling computing. Experimental results verify that the developed system can perform 50 fps (frame per second) real-time 3-D measurement with 260 K 3-D points per frame. A speedup of up to 180 times is obtained for the performance of the proposed technique using a NVIDIA GT560Ti graphics card rather than a sequential C in a 3.4 GHZ Inter Core i7 3770.
GPU-based Green’s function simulations of shear waves generated by an applied acoustic radiation force in elastic and viscoelastic models

NASA Astrophysics Data System (ADS)

Yang, Yiqun; Urban, Matthew W.; McGough, Robert J.

2018-05-01

Shear wave calculations induced by an acoustic radiation force are very time-consuming on desktop computers, and high-performance graphics processing units (GPUs) achieve dramatic reductions in the computation time for these simulations. The acoustic radiation force is calculated using the fast near field method and the angular spectrum approach, and then the shear waves are calculated in parallel with Green’s functions on a GPU. This combination enables rapid evaluation of shear waves for push beams with different spatial samplings and for apertures with different f/#. Relative to shear wave simulations that evaluate the same algorithm on an Intel i7 desktop computer, a high performance nVidia GPU reduces the time required for these calculations by a factor of 45 and 700 when applied to elastic and viscoelastic shear wave simulation models, respectively. These GPU-accelerated simulations also compared to measurements in different viscoelastic phantoms, and the results are similar. For parametric evaluations and for comparisons with measured shear wave data, shear wave simulations with the Green’s function approach are ideally suited for high-performance GPUs.
Fast simulation of Proton Induced X-Ray Emission Tomography using CUDA

NASA Astrophysics Data System (ADS)

Beasley, D. G.; Marques, A. C.; Alves, L. C.; da Silva, R. C.

2013-07-01

A new 3D Proton Induced X-Ray Emission Tomography (PIXE-T) and Scanning Transmission Ion Microscopy Tomography (STIM-T) simulation software has been developed in Java and uses NVIDIA™ Common Unified Device Architecture (CUDA) to calculate the X-ray attenuation for large detector areas. A challenge with PIXE-T is to get sufficient counts while retaining a small beam spot size. Therefore a high geometric efficiency is required. However, as the detector solid angle increases the calculations required for accurate reconstruction of the data increase substantially. To overcome this limitation, the CUDA parallel computing platform was used which enables general purpose programming of NVIDIA graphics processing units (GPUs) to perform computations traditionally handled by the central processing unit (CPU). For simulation performance evaluation, the results of a CPU- and a CUDA-based simulation of a phantom are presented. Furthermore, a comparison with the simulation code in the PIXE-Tomography reconstruction software DISRA (A. Sakellariou, D.N. Jamieson, G.J.F. Legge, 2001) is also shown. Compared to a CPU implementation, the CUDA based simulation is approximately 30× faster.
Power and Performance Trade-offs for Space Time Adaptive Processing

DOE Office of Scientific and Technical Information (OSTI.GOV)

Gawande, Nitin A.; Manzano Franco, Joseph B.; Tumeo, Antonino

Computational efficiency – performance relative to power or energy – is one of the most important concerns when designing RADAR processing systems. This paper analyzes power and performance trade-offs for a typical Space Time Adaptive Processing (STAP) application. We study STAP implementations for CUDA and OpenMP on two computationally efficient architectures, Intel Haswell Core I7-4770TE and NVIDIA Kayla with a GK208 GPU. We analyze the power and performance of STAP’s computationally intensive kernels across the two hardware testbeds. We also show the impact and trade-offs of GPU optimization techniques. We show that data parallelism can be exploited for efficient implementationmore » on the Haswell CPU architecture. The GPU architecture is able to process large size data sets without increase in power requirement. The use of shared memory has a significant impact on the power requirement for the GPU. A balance between the use of shared memory and main memory access leads to an improved performance in a typical STAP application.« less
Analysis of performance improvements for host and GPU interface of the APENet+ 3D Torus network

NASA Astrophysics Data System (ADS)

Ammendola A, R.; Biagioni, A.; Frezza, O.; Lo Cicero, F.; Lonardo, A.; Paolucci, P. S.; Rossetti, D.; Simula, F.; Tosoratto, L.; Vicini, P.

2014-06-01

APEnet+ is an INFN (Italian Institute for Nuclear Physics) project aiming to develop a custom 3-Dimensional torus interconnect network optimized for hybrid clusters CPU-GPU dedicated to High Performance scientific Computing. The APEnet+ interconnect fabric is built on a FPGA-based PCI-express board with 6 bi-directional off-board links showing 34 Gbps of raw bandwidth per direction, and leverages upon peer-to-peer capabilities of Fermi and Kepler-class NVIDIA GPUs to obtain real zero-copy, GPU-to-GPU low latency transfers. The minimization of APEnet+ transfer latency is achieved through the adoption of RDMA protocol implemented in FPGA with specialized hardware blocks tightly coupled with embedded microprocessor. This architecture provides a high performance low latency offload engine for both trasmit and receive side of data transactions: preliminary results are encouraging, showing 50% of bandwidth increase for large packet size transfers. In this paper we describe the APEnet+ architecture, detailing the hardware implementation and discuss the impact of such RDMA specialized hardware on host interface latency and bandwidth.
Real-time blood flow visualization using the graphics processing unit

NASA Astrophysics Data System (ADS)

Yang, Owen; Cuccia, David; Choi, Bernard

2011-01-01

Laser speckle imaging (LSI) is a technique in which coherent light incident on a surface produces a reflected speckle pattern that is related to the underlying movement of optical scatterers, such as red blood cells, indicating blood flow. Image-processing algorithms can be applied to produce speckle flow index (SFI) maps of relative blood flow. We present a novel algorithm that employs the NVIDIA Compute Unified Device Architecture (CUDA) platform to perform laser speckle image processing on the graphics processing unit. Software written in C was integrated with CUDA and integrated into a LabVIEW Virtual Instrument (VI) that is interfaced with a monochrome CCD camera able to acquire high-resolution raw speckle images at nearly 10 fps. With the CUDA code integrated into the LabVIEW VI, the processing and display of SFI images were performed also at ~10 fps. We present three video examples depicting real-time flow imaging during a reactive hyperemia maneuver, with fluid flow through an in vitro phantom, and a demonstration of real-time LSI during laser surgery of a port wine stain birthmark.
Extending the length and time scales of Gram–Schmidt Lyapunov vector computations

DOE Office of Scientific and Technical Information (OSTI.GOV)

Costa, Anthony B., E-mail: acosta@northwestern.edu; Green, Jason R., E-mail: jason.green@umb.edu; Department of Chemistry, University of Massachusetts Boston, Boston, MA 02125

Lyapunov vectors have found growing interest recently due to their ability to characterize systems out of thermodynamic equilibrium. The computation of orthogonal Gram–Schmidt vectors requires multiplication and QR decomposition of large matrices, which grow as N{sup 2} (with the particle count). This expense has limited such calculations to relatively small systems and short time scales. Here, we detail two implementations of an algorithm for computing Gram–Schmidt vectors. The first is a distributed-memory message-passing method using Scalapack. The second uses the newly-released MAGMA library for GPUs. We compare the performance of both codes for Lennard–Jones fluids from N=100 to 1300 betweenmore » Intel Nahalem/Infiniband DDR and NVIDIA C2050 architectures. To our best knowledge, these are the largest systems for which the Gram–Schmidt Lyapunov vectors have been computed, and the first time their calculation has been GPU-accelerated. We conclude that Lyapunov vector calculations can be significantly extended in length and time by leveraging the power of GPU-accelerated linear algebra.« less

Extending the length and time scales of Gram-Schmidt Lyapunov vector computations

NASA Astrophysics Data System (ADS)

Costa, Anthony B.; Green, Jason R.

2013-08-01

Lyapunov vectors have found growing interest recently due to their ability to characterize systems out of thermodynamic equilibrium. The computation of orthogonal Gram-Schmidt vectors requires multiplication and QR decomposition of large matrices, which grow as N2 (with the particle count). This expense has limited such calculations to relatively small systems and short time scales. Here, we detail two implementations of an algorithm for computing Gram-Schmidt vectors. The first is a distributed-memory message-passing method using Scalapack. The second uses the newly-released MAGMA library for GPUs. We compare the performance of both codes for Lennard-Jones fluids from N=100 to 1300 between Intel Nahalem/Infiniband DDR and NVIDIA C2050 architectures. To our best knowledge, these are the largest systems for which the Gram-Schmidt Lyapunov vectors have been computed, and the first time their calculation has been GPU-accelerated. We conclude that Lyapunov vector calculations can be significantly extended in length and time by leveraging the power of GPU-accelerated linear algebra.
Implementation of metal-friendly EAM/FS-type semi-empirical potentials in HOOMD-blue: A GPU-accelerated molecular dynamics software

DOE PAGES

Yang, Lin; Zhang, Feng; Wang, Cai-Zhuang; ...

2018-01-12

We present an implementation of EAM and FS interatomic potentials, which are widely used in simulating metallic systems, in HOOMD-blue, a software designed to perform classical molecular dynamics simulations using GPU accelerations. We first discuss the details of our implementation and then report extensive benchmark tests. We demonstrate that single-precision floating point operations efficiently implemented on GPUs can produce sufficient accuracy when compared against double-precision codes, as demonstrated in test simulations of calculations of the glass-transition temperature of Cu 64.5Zr 35.5, and pair correlation function of liquid Ni 3Al. Our code scales well with the size of the simulating systemmore » on NVIDIA Tesla M40 and P100 GPUs. Compared with another popular software LAMMPS running on 32 cores of AMD Opteron 6220 processors, the GPU/CPU performance ratio can reach as high as 4.6. In conclusion, the source code can be accessed through the HOOMD-blue web page for free by any interested user.« less
Implementation of metal-friendly EAM/FS-type semi-empirical potentials in HOOMD-blue: A GPU-accelerated molecular dynamics software

DOE Office of Scientific and Technical Information (OSTI.GOV)

Yang, Lin; Zhang, Feng; Wang, Cai-Zhuang

We present an implementation of EAM and FS interatomic potentials, which are widely used in simulating metallic systems, in HOOMD-blue, a software designed to perform classical molecular dynamics simulations using GPU accelerations. We first discuss the details of our implementation and then report extensive benchmark tests. We demonstrate that single-precision floating point operations efficiently implemented on GPUs can produce sufficient accuracy when compared against double-precision codes, as demonstrated in test simulations of calculations of the glass-transition temperature of Cu 64.5Zr 35.5, and pair correlation function of liquid Ni 3Al. Our code scales well with the size of the simulating systemmore » on NVIDIA Tesla M40 and P100 GPUs. Compared with another popular software LAMMPS running on 32 cores of AMD Opteron 6220 processors, the GPU/CPU performance ratio can reach as high as 4.6. In conclusion, the source code can be accessed through the HOOMD-blue web page for free by any interested user.« less
Real-time blood flow visualization using the graphics processing unit

PubMed Central

Yang, Owen; Cuccia, David; Choi, Bernard

2011-01-01

Laser speckle imaging (LSI) is a technique in which coherent light incident on a surface produces a reflected speckle pattern that is related to the underlying movement of optical scatterers, such as red blood cells, indicating blood flow. Image-processing algorithms can be applied to produce speckle flow index (SFI) maps of relative blood flow. We present a novel algorithm that employs the NVIDIA Compute Unified Device Architecture (CUDA) platform to perform laser speckle image processing on the graphics processing unit. Software written in C was integrated with CUDA and integrated into a LabVIEW Virtual Instrument (VI) that is interfaced with a monochrome CCD camera able to acquire high-resolution raw speckle images at nearly 10 fps. With the CUDA code integrated into the LabVIEW VI, the processing and display of SFI images were performed also at ∼10 fps. We present three video examples depicting real-time flow imaging during a reactive hyperemia maneuver, with fluid flow through an in vitro phantom, and a demonstration of real-time LSI during laser surgery of a port wine stain birthmark. PMID:21280915
Bond Order Correlations in the 2D Hubbard Model

NASA Astrophysics Data System (ADS)

Moore, Conrad; Abu Asal, Sameer; Yang, Shuxiang; Moreno, Juana; Jarrell, Mark

We use the dynamical cluster approximation to study the bond correlations in the Hubbard model with next nearest neighbor (nnn) hopping to explore the region of the phase diagram where the Fermi liquid phase is separated from the pseudogap phase by the Lifshitz line at zero temperature. We implement the Hirsch-Fye cluster solver that has the advantage of providing direct access to the computation of the bond operators via the decoupling field. In the pseudogap phase, the parallel bond order susceptibility is shown to persist at zero temperature while it vanishes for the Fermi liquid phase which allows the shape of the Lifshitz line to be mapped as a function of filling and nnn hopping. Our cluster solver implements NVIDIA's CUDA language to accelerate the linear algebra of the Quantum Monte Carlo to help alleviate the sign problem by allowing for more Monte Carlo updates to be performed in a reasonable amount of computation time. Work supported by the NSF EPSCoR Cooperative Agreement No. EPS-1003897 with additional support from the Louisiana Board of Regents.
Parallel k-means++

DOE Office of Scientific and Technical Information (OSTI.GOV)

A parallelization of the k-means++ seed selection algorithm on three distinct hardware platforms: GPU, multicore CPU, and multithreaded architecture. K-means++ was developed by David Arthur and Sergei Vassilvitskii in 2007 as an extension of the k-means data clustering technique. These algorithms allow people to cluster multidimensional data, by attempting to minimize the mean distance of data points within a cluster. K-means++ improved upon traditional k-means by using a more intelligent approach to selecting the initial seeds for the clustering process. While k-means++ has become a popular alternative to traditional k-means clustering, little work has been done to parallelize this technique.more » We have developed original C++ code for parallelizing the algorithm on three unique hardware architectures: GPU using NVidia's CUDA/Thrust framework, multicore CPU using OpenMP, and the Cray XMT multithreaded architecture. By parallelizing the process for these platforms, we are able to perform k-means++ clustering much more quickly than it could be done before.« less
Parallel Implementation of MAFFT on CUDA-Enabled Graphics Hardware.

PubMed

Zhu, Xiangyuan; Li, Kenli; Salah, Ahmad; Shi, Lin; Li, Keqin

2015-01-01

Multiple sequence alignment (MSA) constitutes an extremely powerful tool for many biological applications including phylogenetic tree estimation, secondary structure prediction, and critical residue identification. However, aligning large biological sequences with popular tools such as MAFFT requires long runtimes on sequential architectures. Due to the ever increasing sizes of sequence databases, there is increasing demand to accelerate this task. In this paper, we demonstrate how graphic processing units (GPUs), powered by the compute unified device architecture (CUDA), can be used as an efficient computational platform to accelerate the MAFFT algorithm. To fully exploit the GPU's capabilities for accelerating MAFFT, we have optimized the sequence data organization to eliminate the bandwidth bottleneck of memory access, designed a memory allocation and reuse strategy to make full use of limited memory of GPUs, proposed a new modified-run-length encoding (MRLE) scheme to reduce memory consumption, and used high-performance shared memory to speed up I/O operations. Our implementation tested in three NVIDIA GPUs achieves speedup up to 11.28 on a Tesla K20m GPU compared to the sequential MAFFT 7.015.
Interactions between Nanoparticles and Polymer Brushes: Molecular Dynamics Simulations and Self-consistent Field Theory Calculations

NASA Astrophysics Data System (ADS)

Cheng, Shengfeng; Wen, Chengyuan; Egorov, Sergei

2015-03-01

Molecular dynamics simulations and self-consistent field theory calculations are employed to study the interactions between a nanoparticle and a polymer brush at various densities of chains grafted to a plane. Simulations with both implicit and explicit solvent are performed. In either case the nanoparticle is loaded to the brush at a constant velocity. Then a series of simulations are performed to compute the force exerted on the nanoparticle that is fixed at various distances from the grafting plane. The potential of mean force is calculated and compared to the prediction based on a self-consistent field theory. Our simulations show that the explicit solvent leads to effects that are not captured in simulations with implicit solvent, indicating the importance of including explicit solvent in molecular simulations of such systems. Our results also demonstrate an interesting correlation between the force on the nanoparticle and the density profile of the brush. We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Tesla K40 GPU used for this research.
Accelerating Monte Carlo simulations of photon transport in a voxelized geometry using a massively parallel graphics processing unit.

PubMed

Badal, Andreu; Badano, Aldo

2009-11-01

It is a known fact that Monte Carlo simulations of radiation transport are computationally intensive and may require long computing times. The authors introduce a new paradigm for the acceleration of Monte Carlo simulations: The use of a graphics processing unit (GPU) as the main computing device instead of a central processing unit (CPU). A GPU-based Monte Carlo code that simulates photon transport in a voxelized geometry with the accurate physics models from PENELOPE has been developed using the CUDATM programming model (NVIDIA Corporation, Santa Clara, CA). An outline of the new code and a sample x-ray imaging simulation with an anthropomorphic phantom are presented. A remarkable 27-fold speed up factor was obtained using a GPU compared to a single core CPU. The reported results show that GPUs are currently a good alternative to CPUs for the simulation of radiation transport. Since the performance of GPUs is currently increasing at a faster pace than that of CPUs, the advantages of GPU-based software are likely to be more pronounced in the future.
Global magnetohydrodynamic simulations on multiple GPUs

NASA Astrophysics Data System (ADS)

Wong, Un-Hong; Wong, Hon-Cheng; Ma, Yonghui

2014-01-01

Global magnetohydrodynamic (MHD) models play the major role in investigating the solar wind-magnetosphere interaction. However, the huge computation requirement in global MHD simulations is also the main problem that needs to be solved. With the recent development of modern graphics processing units (GPUs) and the Compute Unified Device Architecture (CUDA), it is possible to perform global MHD simulations in a more efficient manner. In this paper, we present a global magnetohydrodynamic (MHD) simulator on multiple GPUs using CUDA 4.0 with GPUDirect 2.0. Our implementation is based on the modified leapfrog scheme, which is a combination of the leapfrog scheme and the two-step Lax-Wendroff scheme. GPUDirect 2.0 is used in our implementation to drive multiple GPUs. All data transferring and kernel processing are managed with CUDA 4.0 API instead of using MPI or OpenMP. Performance measurements are made on a multi-GPU system with eight NVIDIA Tesla M2050 (Fermi architecture) graphics cards. These measurements show that our multi-GPU implementation achieves a peak performance of 97.36 GFLOPS in double precision.
Message Passing on GPUs

NASA Astrophysics Data System (ADS)

Stuart, J. A.

2011-12-01

This paper explores the challenges in implementing a message passing interface usable on systems with data-parallel processors, and more specifically GPUs. As a case study, we design and implement the ``DCGN'' API on NVIDIA GPUs that is similar to MPI and allows full access to the underlying architecture. We introduce the notion of data-parallel thread-groups as a way to map resources to MPI ranks. We use a method that also allows the data-parallel processors to run autonomously from user-written CPU code. In order to facilitate communication, we use a sleep-based polling system to store and retrieve messages. Unlike previous systems, our method provides both performance and flexibility. By running a test suite of applications with different communication requirements, we find that a tolerable amount of overhead is incurred, somewhere between one and five percent depending on the application, and indicate the locations where this overhead accumulates. We conclude that with innovations in chipsets and drivers, this overhead will be mitigated and provide similar performance to typical CPU-based MPI implementations while providing fully-dynamic communication.
Optimizing legacy molecular dynamics software with directive-based offload

NASA Astrophysics Data System (ADS)

Michael Brown, W.; Carrillo, Jan-Michael Y.; Gavhane, Nitin; Thakkar, Foram M.; Plimpton, Steven J.

2015-10-01

Directive-based programming models are one solution for exploiting many-core coprocessors to increase simulation rates in molecular dynamics. They offer the potential to reduce code complexity with offload models that can selectively target computations to run on the CPU, the coprocessor, or both. In this paper, we describe modifications to the LAMMPS molecular dynamics code to enable concurrent calculations on a CPU and coprocessor. We demonstrate that standard molecular dynamics algorithms can run efficiently on both the CPU and an x86-based coprocessor using the same subroutines. As a consequence, we demonstrate that code optimizations for the coprocessor also result in speedups on the CPU; in extreme cases up to 4.7X. We provide results for LAMMPS benchmarks and for production molecular dynamics simulations using the Stampede hybrid supercomputer with both Intel® Xeon Phi™ coprocessors and NVIDIA GPUs. The optimizations presented have increased simulation rates by over 2X for organic molecules and over 7X for liquid crystals on Stampede. The optimizations are available as part of the "Intel package" supplied with LAMMPS.
Adaptive mesh fluid simulations on GPU

NASA Astrophysics Data System (ADS)

Wang, Peng; Abel, Tom; Kaehler, Ralf

2010-10-01

We describe an implementation of compressible inviscid fluid solvers with block-structured adaptive mesh refinement on Graphics Processing Units using NVIDIA's CUDA. We show that a class of high resolution shock capturing schemes can be mapped naturally on this architecture. Using the method of lines approach with the second order total variation diminishing Runge-Kutta time integration scheme, piecewise linear reconstruction, and a Harten-Lax-van Leer Riemann solver, we achieve an overall speedup of approximately 10 times faster execution on one graphics card as compared to a single core on the host computer. We attain this speedup in uniform grid runs as well as in problems with deep AMR hierarchies. Our framework can readily be applied to more general systems of conservation laws and extended to higher order shock capturing schemes. This is shown directly by an implementation of a magneto-hydrodynamic solver and comparing its performance to the pure hydrodynamic case. Finally, we also combined our CUDA parallel scheme with MPI to make the code run on GPU clusters. Close to ideal speedup is observed on up to four GPUs.
Stacked Multilayer Self-Organizing Map for Background Modeling.

PubMed

Zhao, Zhenjie; Zhang, Xuebo; Fang, Yongchun

2015-09-01

In this paper, a new background modeling method called stacked multilayer self-organizing map background model (SMSOM-BM) is proposed, which presents several merits such as strong representative ability for complex scenarios, easy to use, and so on. In order to enhance the representative ability of the background model and make the parameters learned automatically, the recently developed idea of representative learning (or deep learning) is elegantly employed to extend the existing single-layer self-organizing map background model to a multilayer one (namely, the proposed SMSOM-BM). As a consequence, the SMSOM-BM gains several merits including strong representative ability to learn background model of challenging scenarios, and automatic determination for most network parameters. More specifically, every pixel is modeled by a SMSOM, and spatial consistency is considered at each layer. By introducing a novel over-layer filtering process, we can train the background model layer by layer in an efficient manner. Furthermore, for real-time performance consideration, we have implemented the proposed method using NVIDIA CUDA platform. Comparative experimental results show superior performance of the proposed approach.
A real-time spike sorting method based on the embedded GPU.

PubMed

Zelan Yang; Kedi Xu; Xiang Tian; Shaomin Zhang; Xiaoxiang Zheng

2017-07-01

Microelectrode arrays with hundreds of channels have been widely used to acquire neuron population signals in neuroscience studies. Online spike sorting is becoming one of the most important challenges for high-throughput neural signal acquisition systems. Graphic processing unit (GPU) with high parallel computing capability might provide an alternative solution for increasing real-time computational demands on spike sorting. This study reported a method of real-time spike sorting through computing unified device architecture (CUDA) which was implemented on an embedded GPU (NVIDIA JETSON Tegra K1, TK1). The sorting approach is based on the principal component analysis (PCA) and K-means. By analyzing the parallelism of each process, the method was further optimized in the thread memory model of GPU. Our results showed that the GPU-based classifier on TK1 is 37.92 times faster than the MATLAB-based classifier on PC while their accuracies were the same with each other. The high-performance computing features of embedded GPU demonstrated in our studies suggested that the embedded GPU provide a promising platform for the real-time neural signal processing.
Optimizing Likelihood Models for Particle Trajectory Segmentation in Multi-State Systems.

PubMed

Young, Dylan Christopher; Scrimgeour, Jan

2018-06-19

Particle tracking offers significant insight into the molecular mechanics that govern the behav- ior of living cells. The analysis of molecular trajectories that transition between different motive states, such as diffusive, driven and tethered modes, is of considerable importance, with even single trajectories containing significant amounts of information about a molecule's environment and its interactions with cellular structures. Hidden Markov models (HMM) have been widely adopted to perform the segmentation of such complex tracks. In this paper, we show that extensive analysis of hidden Markov model outputs using data derived from multi-state Brownian dynamics simulations can be used both for the optimization of the likelihood models used to describe the states of the system and for characterization of the technique's failure mechanisms. This analysis was made pos- sible by the implementation of parallelized adaptive direct search algorithm on a Nvidia graphics processing unit. This approach provides critical information for the visualization of HMM failure and successful design of particle tracking experiments where trajectories contain multiple mobile states. © 2018 IOP Publishing Ltd.
Numerical solution of the Navier-Stokes equations by discontinuous Galerkin method

NASA Astrophysics Data System (ADS)

Krasnov, M. M.; Kuchugov, P. A.; E Ladonkina, M.; E Lutsky, A.; Tishkin, V. F.

2017-02-01

Detailed unstructured grids and numerical methods of high accuracy are frequently used in the numerical simulation of gasdynamic flows in areas with complex geometry. Galerkin method with discontinuous basis functions or Discontinuous Galerkin Method (DGM) works well in dealing with such problems. This approach offers a number of advantages inherent to both finite-element and finite-difference approximations. Moreover, the present paper shows that DGM schemes can be viewed as Godunov method extension to piecewise-polynomial functions. As is known, DGM involves significant computational complexity, and this brings up the question of ensuring the most effective use of all the computational capacity available. In order to speed up the calculations, operator programming method has been applied while creating the computational module. This approach makes possible compact encoding of mathematical formulas and facilitates the porting of programs to parallel architectures, such as NVidia CUDA and Intel Xeon Phi. With the software package, based on DGM, numerical simulations of supersonic flow past solid bodies has been carried out. The numerical results are in good agreement with the experimental ones.
DeepSite: protein-binding site predictor using 3D-convolutional neural networks.

PubMed

Jiménez, J; Doerr, S; Martínez-Rosell, G; Rose, A S; De Fabritiis, G

2017-10-01

An important step in structure-based drug design consists in the prediction of druggable binding sites. Several algorithms for detecting binding cavities, those likely to bind to a small drug compound, have been developed over the years by clever exploitation of geometric, chemical and evolutionary features of the protein. Here we present a novel knowledge-based approach that uses state-of-the-art convolutional neural networks, where the algorithm is learned by examples. In total, 7622 proteins from the scPDB database of binding sites have been evaluated using both a distance and a volumetric overlap approach. Our machine-learning based method demonstrates superior performance to two other competitive algorithmic strategies. DeepSite is freely available at www.playmolecule.org. Users can submit either a PDB ID or PDB file for pocket detection to our NVIDIA GPU-equipped servers through a WebGL graphical interface. gianni.defabritiis@upf.edu. Supplementary data are available at Bioinformatics online. © The Author (2017). Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com
Implementation of EAM and FS potentials in HOOMD-blue

NASA Astrophysics Data System (ADS)

Yang, Lin; Zhang, Feng; Travesset, Alex; Wang, Caizhuang; Ho, Kaiming

HOOMD-blue is a general-purpose software to perform classical molecular dynamics simulations entirely on GPUs. We provide full support for EAM and FS type potentials in HOOMD-blue, and report accuracy and efficiency benchmarks, including comparisons with the LAMMPS GPU package. Two problems were selected to test the accuracy: the determination of the glass transition temperature of Cu64.5Zr35.5 alloy using an FS potential and the calculation of pair distribution functions of Ni3Al using an EAM potential. In both cases, the results using HOOMD-blue are indistinguishable from those obtained by the GPU package in LAMMPS within statistical uncertainties. As tests for time efficiency, we benchmark time-steps per second using LAMMPS GPU and HOOMD-blue on one NVIDIA Tesla GPU. Compared to our typical LAMMPS simulations on one CPU cluster node which has 16 CPUs, LAMMPS GPU can be 3-3.5 times faster, and HOOMD-blue can be 4-5.5 times faster. We acknowledge the support from Laboratory Directed Research and Development (LDRD) of Ames Laboratory.
Development of an embedded atmospheric turbulence mitigation engine

NASA Astrophysics Data System (ADS)

Paolini, Aaron; Bonnett, James; Kozacik, Stephen; Kelmelis, Eric

2017-05-01

Methods to reconstruct pictures from imagery degraded by atmospheric turbulence have been under development for decades. The techniques were initially developed for observing astronomical phenomena from the Earth's surface, but have more recently been modified for ground and air surveillance scenarios. Such applications can impose significant constraints on deployment options because they both increase the computational complexity of the algorithms themselves and often dictate a requirement for low size, weight, and power (SWaP) form factors. Consequently, embedded implementations must be developed that can perform the necessary computations on low-SWaP platforms. Fortunately, there is an emerging class of embedded processors driven by the mobile and ubiquitous computing industries. We have leveraged these processors to develop embedded versions of the core atmospheric correction engine found in our ATCOM software. In this paper, we will present our experience adapting our algorithms for embedded systems on a chip (SoCs), namely the NVIDIA Tegra that couples general-purpose ARM cores with their graphics processing unit (GPU) technology and the Xilinx Zynq which pairs similar ARM cores with their field-programmable gate array (FPGA) fabric.

A heterogeneous computing accelerated SCE-UA global optimization method using OpenMP, OpenCL, CUDA, and OpenACC.

PubMed

Kan, Guangyuan; He, Xiaoyan; Ding, Liuqian; Li, Jiren; Liang, Ke; Hong, Yang

2017-10-01

The shuffled complex evolution optimization developed at the University of Arizona (SCE-UA) has been successfully applied in various kinds of scientific and engineering optimization applications, such as hydrological model parameter calibration, for many years. The algorithm possesses good global optimality, convergence stability and robustness. However, benchmark and real-world applications reveal the poor computational efficiency of the SCE-UA. This research aims at the parallelization and acceleration of the SCE-UA method based on powerful heterogeneous computing technology. The parallel SCE-UA is implemented on Intel Xeon multi-core CPU (by using OpenMP and OpenCL) and NVIDIA Tesla many-core GPU (by using OpenCL, CUDA, and OpenACC). The serial and parallel SCE-UA were tested based on the Griewank benchmark function. Comparison results indicate the parallel SCE-UA significantly improves computational efficiency compared to the original serial version. The OpenCL implementation obtains the best overall acceleration results however, with the most complex source code. The parallel SCE-UA has bright prospects to be applied in real-world applications.
Flexible, fast and accurate sequence alignment profiling on GPGPU with PaSWAS.

PubMed

Warris, Sven; Yalcin, Feyruz; Jackson, Katherine J L; Nap, Jan Peter

2015-01-01

To obtain large-scale sequence alignments in a fast and flexible way is an important step in the analyses of next generation sequencing data. Applications based on the Smith-Waterman (SW) algorithm are often either not fast enough, limited to dedicated tasks or not sufficiently accurate due to statistical issues. Current SW implementations that run on graphics hardware do not report the alignment details necessary for further analysis. With the Parallel SW Alignment Software (PaSWAS) it is possible (a) to have easy access to the computational power of NVIDIA-based general purpose graphics processing units (GPGPUs) to perform high-speed sequence alignments, and (b) retrieve relevant information such as score, number of gaps and mismatches. The software reports multiple hits per alignment. The added value of the new SW implementation is demonstrated with two test cases: (1) tag recovery in next generation sequence data and (2) isotype assignment within an immunoglobulin 454 sequence data set. Both cases show the usability and versatility of the new parallel Smith-Waterman implementation.
Multicore and GPU algorithms for Nussinov RNA folding

PubMed Central

2014-01-01

Background One segment of a RNA sequence might be paired with another segment of the same RNA sequence due to the force of hydrogen bonds. This two-dimensional structure is called the RNA sequence's secondary structure. Several algorithms have been proposed to predict an RNA sequence's secondary structure. These algorithms are referred to as RNA folding algorithms. Results We develop cache efficient, multicore, and GPU algorithms for RNA folding using Nussinov's algorithm. Conclusions Our cache efficient algorithm provides a speedup between 1.6 and 3.0 relative to a naive straightforward single core code. The multicore version of the cache efficient single core algorithm provides a speedup, relative to the naive single core algorithm, between 7.5 and 14.0 on a 6 core hyperthreaded CPU. Our GPU algorithm for the NVIDIA C2050 is up to 1582 times as fast as the naive single core algorithm and between 5.1 and 11.2 times as fast as the fastest previously known GPU algorithm for Nussinov RNA folding. PMID:25082539
GPU-based Green's function simulations of shear waves generated by an applied acoustic radiation force in elastic and viscoelastic models.

PubMed

Yang, Yiqun; Urban, Matthew W; McGough, Robert J

2018-05-15

Shear wave calculations induced by an acoustic radiation force are very time-consuming on desktop computers, and high-performance graphics processing units (GPUs) achieve dramatic reductions in the computation time for these simulations. The acoustic radiation force is calculated using the fast near field method and the angular spectrum approach, and then the shear waves are calculated in parallel with Green's functions on a GPU. This combination enables rapid evaluation of shear waves for push beams with different spatial samplings and for apertures with different f/#. Relative to shear wave simulations that evaluate the same algorithm on an Intel i7 desktop computer, a high performance nVidia GPU reduces the time required for these calculations by a factor of 45 and 700 when applied to elastic and viscoelastic shear wave simulation models, respectively. These GPU-accelerated simulations also compared to measurements in different viscoelastic phantoms, and the results are similar. For parametric evaluations and for comparisons with measured shear wave data, shear wave simulations with the Green's function approach are ideally suited for high-performance GPUs.
Acceleration of High Angular Momentum Electron Repulsion Integrals and Integral Derivatives on Graphics Processing Units.

PubMed

Miao, Yipu; Merz, Kenneth M

2015-04-14

We present an efficient implementation of ab initio self-consistent field (SCF) energy and gradient calculations that run on Compute Unified Device Architecture (CUDA) enabled graphical processing units (GPUs) using recurrence relations. We first discuss the machine-generated code that calculates the electron-repulsion integrals (ERIs) for different ERI types. Next we describe the porting of the SCF gradient calculation to GPUs, which results in an acceleration of the computation of the first-order derivative of the ERIs. However, only s, p, and d ERIs and s and p derivatives could be executed simultaneously on GPUs using the current version of CUDA and generation of NVidia GPUs using a previously described algorithm [Miao and Merz J. Chem. Theory Comput. 2013, 9, 965-976.]. Hence, we developed an algorithm to compute f type ERIs and d type ERI derivatives on GPUs. Our benchmarks shows the performance GPU enable ERI and ERI derivative computation yielded speedups of 10-18 times relative to traditional CPU execution. An accuracy analysis using double-precision calculations demonstrates that the overall accuracy is satisfactory for most applications.
Statistical tools for analysis and modeling of cosmic populations and astronomical time series: CUDAHM and TSE

NASA Astrophysics Data System (ADS)

Loredo, Thomas; Budavari, Tamas; Scargle, Jeffrey D.

2018-01-01

This presentation provides an overview of open-source software packages addressing two challenging classes of astrostatistics problems. (1) CUDAHM is a C++ framework for hierarchical Bayesian modeling of cosmic populations, leveraging graphics processing units (GPUs) to enable applying this computationally challenging paradigm to large datasets. CUDAHM is motivated by measurement error problems in astronomy, where density estimation and linear and nonlinear regression must be addressed for populations of thousands to millions of objects whose features are measured with possibly complex uncertainties, potentially including selection effects. An example calculation demonstrates accurate GPU-accelerated luminosity function estimation for simulated populations of $10^6$ objects in about two hours using a single NVIDIA Tesla K40c GPU. (2) Time Series Explorer (TSE) is a collection of software in Python and MATLAB for exploratory analysis and statistical modeling of astronomical time series. It comprises a library of stand-alone functions and classes, as well as an application environment for interactive exploration of times series data. The presentation will summarize key capabilities of this emerging project, including new algorithms for analysis of irregularly-sampled time series.
GPUs, a New Tool of Acceleration in CFD: Efficiency and Reliability on Smoothed Particle Hydrodynamics Methods

PubMed Central

Crespo, Alejandro C.; Dominguez, Jose M.; Barreiro, Anxo; Gómez-Gesteira, Moncho; Rogers, Benedict D.

2011-01-01

Smoothed Particle Hydrodynamics (SPH) is a numerical method commonly used in Computational Fluid Dynamics (CFD) to simulate complex free-surface flows. Simulations with this mesh-free particle method far exceed the capacity of a single processor. In this paper, as part of a dual-functioning code for either central processing units (CPUs) or Graphics Processor Units (GPUs), a parallelisation using GPUs is presented. The GPU parallelisation technique uses the Compute Unified Device Architecture (CUDA) of nVidia devices. Simulations with more than one million particles on a single GPU card exhibit speedups of up to two orders of magnitude over using a single-core CPU. It is demonstrated that the code achieves different speedups with different CUDA-enabled GPUs. The numerical behaviour of the SPH code is validated with a standard benchmark test case of dam break flow impacting on an obstacle where good agreement with the experimental results is observed. Both the achieved speed-ups and the quantitative agreement with experiments suggest that CUDA-based GPU programming can be used in SPH methods with efficiency and reliability. PMID:21695185
a method of gravity and seismic sequential inversion and its GPU implementation

NASA Astrophysics Data System (ADS)

Liu, G.; Meng, X.

2011-12-01

In this abstract, we introduce a gravity and seismic sequential inversion method to invert for density and velocity together. For the gravity inversion, we use an iterative method based on correlation imaging algorithm; for the seismic inversion, we use the full waveform inversion. The link between the density and velocity is an empirical formula called Gardner equation, for large volumes of data, we use the GPU to accelerate the computation. For the gravity inversion method , we introduce a method based on correlation imaging algorithm,it is also a interative method, first we calculate the correlation imaging of the observed gravity anomaly, it is some value between -1 and +1, then we multiply this value with a little density ,this value become the initial density model. We get a forward reuslt with this initial model and also calculate the correaltion imaging of the misfit of observed data and the forward data, also multiply the correaltion imaging result a little density and add it to the initial model, then do the same procedure above , at last ,we can get a inversion density model. For the seismic inveron method ,we use a mothod base on the linearity of acoustic wave equation written in the frequency domain,with a intial velociy model, we can get a good velocity result. In the sequential inversion of gravity and seismic , we need a link formula to convert between density and velocity ,in our method , we use the Gardner equation. Driven by the insatiable market demand for real time, high-definition 3D images, the programmable NVIDIA Graphic Processing Unit (GPU) as co-processor of CPU has been developed for high performance computing. Compute Unified Device Architecture (CUDA) is a parallel programming model and software environment provided by NVIDIA designed to overcome the challenge of using traditional general purpose GPU while maintaining a low learn curve for programmers familiar with standard programming languages such as C. In our inversion processing
Accuracy of the Yamax CW-701 Pedometer for measuring steps in controlled and free-living conditions

PubMed Central

Coffman, Maren J; Reeve, Charlie L; Butler, Shannon; Keeling, Maiya; Talbot, Laura A

2016-01-01

Objective The Yamax Digi-Walker CW-701 (Yamax CW-701) is a low-cost pedometer that includes a 7-day memory, a 2-week cumulative memory, and automatically resets to zero at midnight. To date, the accuracy of the Yamax CW-701 has not been determined. The purpose of this study was to assess the accuracy of steps recorded by the Yamax CW-701 pedometer compared with actual steps and two other devices. Methods The study was conducted in a campus-based lab and in free-living settings with 22 students, faculty, and staff at a mid-sized university in the Southeastern US. While wearing a Yamax CW-701, Yamax Digi-Walker SW-200, and an ActiGraph GTX3 accelerometer, participants engaged in activities at variable speeds and conditions. To assess accuracy of each device, steps recorded were compared with actual step counts. Statistical tests included paired sample t-tests, percent accuracy, intraclass correlation coefficient, and Bland–Altman plots. Results The Yamax CW-701 demonstrated reliability and concurrent validity during walking at a fast pace and walking on a track, and in free-living conditions. Decreased accuracy was noted walking at a slow pace. Conclusions These findings are consistent with prior research. With most pedometers and accelerometers, adequate force and intensity must be present for a step to register. The Yamax CW-701 is accurate in recording steps taken while walking at a fast pace and in free-living settings. PMID:29942555
Accuracy of the Yamax CW-701 Pedometer for measuring steps in controlled and free-living conditions.

PubMed

Coffman, Maren J; Reeve, Charlie L; Butler, Shannon; Keeling, Maiya; Talbot, Laura A

2016-01-01

The Yamax Digi-Walker CW-701 (Yamax CW-701) is a low-cost pedometer that includes a 7-day memory, a 2-week cumulative memory, and automatically resets to zero at midnight. To date, the accuracy of the Yamax CW-701 has not been determined. The purpose of this study was to assess the accuracy of steps recorded by the Yamax CW-701 pedometer compared with actual steps and two other devices. The study was conducted in a campus-based lab and in free-living settings with 22 students, faculty, and staff at a mid-sized university in the Southeastern US. While wearing a Yamax CW-701, Yamax Digi-Walker SW-200, and an ActiGraph GTX3 accelerometer, participants engaged in activities at variable speeds and conditions. To assess accuracy of each device, steps recorded were compared with actual step counts. Statistical tests included paired sample t -tests, percent accuracy, intraclass correlation coefficient, and Bland-Altman plots. The Yamax CW-701 demonstrated reliability and concurrent validity during walking at a fast pace and walking on a track, and in free-living conditions. Decreased accuracy was noted walking at a slow pace. These findings are consistent with prior research. With most pedometers and accelerometers, adequate force and intensity must be present for a step to register. The Yamax CW-701 is accurate in recording steps taken while walking at a fast pace and in free-living settings.
Review: visual analytics of climate networks

NASA Astrophysics Data System (ADS)

Nocke, T.; Buschmann, S.; Donges, J. F.; Marwan, N.; Schulz, H.-J.; Tominski, C.

2015-09-01

Network analysis has become an important approach in studying complex spatiotemporal behaviour within geophysical observation and simulation data. This new field produces increasing numbers of large geo-referenced networks to be analysed. Particular focus lies currently on the network analysis of the complex statistical interrelationship structure within climatological fields. The standard procedure for such network analyses is the extraction of network measures in combination with static standard visualisation methods. Existing interactive visualisation methods and tools for geo-referenced network exploration are often either not known to the analyst or their potential is not fully exploited. To fill this gap, we illustrate how interactive visual analytics methods in combination with geovisualisation can be tailored for visual climate network investigation. Therefore, the paper provides a problem analysis relating the multiple visualisation challenges to a survey undertaken with network analysts from the research fields of climate and complex systems science. Then, as an overview for the interested practitioner, we review the state-of-the-art in climate network visualisation and provide an overview of existing tools. As a further contribution, we introduce the visual network analytics tools CGV and GTX, providing tailored solutions for climate network analysis, including alternative geographic projections, edge bundling, and 3-D network support. Using these tools, the paper illustrates the application potentials of visual analytics for climate networks based on several use cases including examples from global, regional, and multi-layered climate networks.
New solutions for climate network visualization

NASA Astrophysics Data System (ADS)

Nocke, Thomas; Buschmann, Stefan; Donges, Jonathan F.; Marwan, Norbert

2016-04-01

An increasing amount of climate and climate impact research methods deals with geo-referenced networks, including energy, trade, supply-chain, disease dissemination and climatic tele-connection networks. At the same time, the size and complexity of these networks increases, resulting in networks of more than hundred thousand or even millions of edges, which are often temporally evolving, have additional data at nodes and edges, and can consist of multiple layers even in real 3D. This gives challenges to both the static representation and the interactive exploration of these networks, first of all avoiding edge clutter ("edge spagetti") and allowing interactivity even for unfiltered networks. Within this presentation, we illustrate potential solutions to these challenges. Therefore, we give a glimpse on a questionnaire performed with climate and complex system scientists with respect to their network visualization requirements, and on a review of available state-of-the-art visualization techniques and tools for this purpose (see as well Nocke et al., 2015). In the main part, we present alternative visualization solutions for several use cases (global, regional, and multi-layered climate networks) including alternative geographic projections, edge bundling, and 3-D network support (based on CGV and GTX tools), and implementation details to reach interactive frame rates. References: Nocke, T., S. Buschmann, J. F. Donges, N. Marwan, H.-J. Schulz, and C. Tominski: Review: Visual analytics of climate networks, Nonlinear Processes in Geophysics, 22, 545-570, doi:10.5194/npg-22-545-2015, 2015
Review: visual analytics of climate networks

NASA Astrophysics Data System (ADS)

Nocke, T.; Buschmann, S.; Donges, J. F.; Marwan, N.; Schulz, H.-J.; Tominski, C.

2015-04-01

Network analysis has become an important approach in studying complex spatiotemporal behaviour within geophysical observation and simulation data. This new field produces increasing amounts of large geo-referenced networks to be analysed. Particular focus lies currently on the network analysis of the complex statistical interrelationship structure within climatological fields. The standard procedure for such network analyses is the extraction of network measures in combination with static standard visualisation methods. Existing interactive visualisation methods and tools for geo-referenced network exploration are often either not known to the analyst or their potential is not fully exploited. To fill this gap, we illustrate how interactive visual analytics methods in combination with geovisualisation can be tailored for visual climate network investigation. Therefore, the paper provides a problem analysis, relating the multiple visualisation challenges with a survey undertaken with network analysts from the research fields of climate and complex systems science. Then, as an overview for the interested practitioner, we review the state-of-the-art in climate network visualisation and provide an overview of existing tools. As a further contribution, we introduce the visual network analytics tools CGV and GTX, providing tailored solutions for climate network analysis, including alternative geographic projections, edge bundling, and 3-D network support. Using these tools, the paper illustrates the application potentials of visual analytics for climate networks based on several use cases including examples from global, regional, and multi-layered climate networks.
A deep learning method for early screening of lung cancer

NASA Astrophysics Data System (ADS)

Zhang, Kunpeng; Jiang, Huiqin; Ma, Ling; Gao, Jianbo; Yang, Xiaopeng

2018-04-01

Lung cancer is the leading cause of cancer-related deaths among men. In this paper, we propose a pulmonary nodule detection method for early screening of lung cancer based on the improved AlexNet model. In order to maintain the same image quality as the existing B/S architecture PACS system, we convert the original CT image into JPEG format image by analyzing the DICOM file firstly. Secondly, in view of the large size and complex background of CT chest images, we design the convolution neural network on basis of AlexNet model and sparse convolution structure. At last we train our models on the software named DIGITS which is provided by NVIDIA. The main contribution of this paper is to apply the convolutional neural network for the early screening of lung cancer and improve the screening accuracy by combining the AlexNet model with the sparse convolution structure. We make a series of experiments on the chest CT images using the proposed method, of which the sensitivity and specificity indicates that the method presented in this paper can effectively improve the accuracy of early screening of lung cancer and it has certain clinical significance at the same time.
Exploiting graphics processing units for computational biology and bioinformatics.

PubMed

Payne, Joshua L; Sinnott-Armstrong, Nicholas A; Moore, Jason H

2010-09-01

Advances in the video gaming industry have led to the production of low-cost, high-performance graphics processing units (GPUs) that possess more memory bandwidth and computational capability than central processing units (CPUs), the standard workhorses of scientific computing. With the recent release of generalpurpose GPUs and NVIDIA's GPU programming language, CUDA, graphics engines are being adopted widely in scientific computing applications, particularly in the fields of computational biology and bioinformatics. The goal of this article is to concisely present an introduction to GPU hardware and programming, aimed at the computational biologist or bioinformaticist. To this end, we discuss the primary differences between GPU and CPU architecture, introduce the basics of the CUDA programming language, and discuss important CUDA programming practices, such as the proper use of coalesced reads, data types, and memory hierarchies. We highlight each of these topics in the context of computing the all-pairs distance between instances in a dataset, a common procedure in numerous disciplines of scientific computing. We conclude with a runtime analysis of the GPU and CPU implementations of the all-pairs distance calculation. We show our final GPU implementation to outperform the CPU implementation by a factor of 1700.
Accelerating Climate and Weather Simulations through Hybrid Computing

NASA Technical Reports Server (NTRS)

Zhou, Shujia; Cruz, Carlos; Duffy, Daniel; Tucker, Robert; Purcell, Mark

2011-01-01

Unconventional multi- and many-core processors (e.g. IBM (R) Cell B.E.(TM) and NVIDIA (R) GPU) have emerged as effective accelerators in trial climate and weather simulations. Yet these climate and weather models typically run on parallel computers with conventional processors (e.g. Intel, AMD, and IBM) using Message Passing Interface. To address challenges involved in efficiently and easily connecting accelerators to parallel computers, we investigated using IBM's Dynamic Application Virtualization (TM) (IBM DAV) software in a prototype hybrid computing system with representative climate and weather model components. The hybrid system comprises two Intel blades and two IBM QS22 Cell B.E. blades, connected with both InfiniBand(R) (IB) and 1-Gigabit Ethernet. The system significantly accelerates a solar radiation model component by offloading compute-intensive calculations to the Cell blades. Systematic tests show that IBM DAV can seamlessly offload compute-intensive calculations from Intel blades to Cell B.E. blades in a scalable, load-balanced manner. However, noticeable communication overhead was observed, mainly due to IP over the IB protocol. Full utilization of IB Sockets Direct Protocol and the lower latency production version of IBM DAV will reduce this overhead.
A Distributed GPU-Based Framework for Real-Time 3D Volume Rendering of Large Astronomical Data Cubes

NASA Astrophysics Data System (ADS)

Hassan, A. H.; Fluke, C. J.; Barnes, D. G.

2012-05-01

We present a framework to volume-render three-dimensional data cubes interactively using distributed ray-casting and volume-bricking over a cluster of workstations powered by one or more graphics processing units (GPUs) and a multi-core central processing unit (CPU). The main design target for this framework is to provide an in-core visualization solution able to provide three-dimensional interactive views of terabyte-sized data cubes. We tested the presented framework using a computing cluster comprising 64 nodes with a total of 128GPUs. The framework proved to be scalable to render a 204GB data cube with an average of 30 frames per second. Our performance analyses also compare the use of NVIDIA Tesla 1060 and 2050GPU architectures and the effect of increasing the visualization output resolution on the rendering performance. Although our initial focus, as shown in the examples presented in this work, is volume rendering of spectral data cubes from radio astronomy, we contend that our approach has applicability to other disciplines where close to real-time volume rendering of terabyte-order three-dimensional data sets is a requirement.
Announcing Supercomputer Summit

DOE Office of Scientific and Technical Information (OSTI.GOV)

Wells, Jack; Bland, Buddy; Nichols, Jeff

Summit is the next leap in leadership-class computing systems for open science. With Summit we will be able to address, with greater complexity and higher fidelity, questions concerning who we are, our place on earth, and in our universe. Summit will deliver more than five times the computational performance of Titan’s 18,688 nodes, using only approximately 3,400 nodes when it arrives in 2017. Like Titan, Summit will have a hybrid architecture, and each node will contain multiple IBM POWER9 CPUs and NVIDIA Volta GPUs all connected together with NVIDIA’s high-speed NVLink. Each node will have over half a terabyte ofmore » coherent memory (high bandwidth memory + DDR4) addressable by all CPUs and GPUs plus 800GB of non-volatile RAM that can be used as a burst buffer or as extended memory. To provide a high rate of I/O throughput, the nodes will be connected in a non-blocking fat-tree using a dual-rail Mellanox EDR InfiniBand interconnect. Upon completion, Summit will allow researchers in all fields of science unprecedented access to solving some of the world’s most pressing challenges.« less
GeNN: a code generation framework for accelerated brain simulations

NASA Astrophysics Data System (ADS)

Yavuz, Esin; Turner, James; Nowotny, Thomas

2016-01-01

Large-scale numerical simulations of detailed brain circuit models are important for identifying hypotheses on brain functions and testing their consistency and plausibility. An ongoing challenge for simulating realistic models is, however, computational speed. In this paper, we present the GeNN (GPU-enhanced Neuronal Networks) framework, which aims to facilitate the use of graphics accelerators for computational models of large-scale neuronal networks to address this challenge. GeNN is an open source library that generates code to accelerate the execution of network simulations on NVIDIA GPUs, through a flexible and extensible interface, which does not require in-depth technical knowledge from the users. We present performance benchmarks showing that 200-fold speedup compared to a single core of a CPU can be achieved for a network of one million conductance based Hodgkin-Huxley neurons but that for other models the speedup can differ. GeNN is available for Linux, Mac OS X and Windows platforms. The source code, user manual, tutorials, Wiki, in-depth example projects and all other related information can be found on the project website http://genn-team.github.io/genn/.
(Re)engineering Earth System Models to Expose Greater Concurrency for Ultrascale Computing: Practice, Experience, and Musings

NASA Astrophysics Data System (ADS)

Mills, R. T.

2014-12-01

As the high performance computing (HPC) community pushes towards the exascale horizon, the importance and prevalence of fine-grained parallelism in new computer architectures is increasing. This is perhaps most apparent in the proliferation of so-called "accelerators" such as the Intel Xeon Phi or NVIDIA GPGPUs, but the trend also holds for CPUs, where serial performance has grown slowly and effective use of hardware threads and vector units are becoming increasingly important to realizing high performance. This has significant implications for weather, climate, and Earth system modeling codes, many of which display impressive scalability across MPI ranks but take relatively little advantage of threading and vector processing. In addition to increasing parallelism, next generation codes will also need to address increasingly deep hierarchies for data movement: NUMA/cache levels, on node vs. off node, local vs. wide neighborhoods on the interconnect, and even in the I/O system. We will discuss some approaches (grounded in experiences with the Intel Xeon Phi architecture) for restructuring Earth science codes to maximize concurrency across multiple levels (vectors, threads, MPI ranks), and also discuss some novel approaches for minimizing expensive data movement/communication.

Some links on this page may take you to non-federal websites. Their policies may differ from this site.