Science.gov

Sample records for nvidia geforce gtx

  1. Proton Testing of nVidia GTX 1050 GPU

    NASA Technical Reports Server (NTRS)

    Wyrwas, E. J.

    2017-01-01

    Single-Event Effects (SEE) testing was conducted on the nVidia GTX 1050 Graphics Processor Unit (GPU); herein referred to as device under test (DUT). Testing was conducted at Massachusetts General Hospitals (MGH) Francis H. Burr Proton Therapy Center on April 9th, 2017 using 200-MeV protons. This testing trip was purposed to provide a baseline assessment of the radiation susceptibility of the DUT as no previous testing had been conducted on this component.

  2. Supercomputing with toys: harnessing the power of NVIDIA 8800GTX and playstation 3 for bioinformatics problem.

    PubMed

    Wilson, Justin; Dai, Manhong; Jakupovic, Elvis; Watson, Stanley; Meng, Fan

    2007-01-01

    Modern video cards and game consoles typically have much better performance to price ratios than that of general purpose CPUs. The parallel processing capabilities of game hardware are well-suited for high throughput biomedical data analysis. Our initial results suggest that game hardware is a cost-effective platform for some computationally demanding bioinformatics problems.

  3. Test Report for NG Sensors GTX-1000

    SciTech Connect

    Manginell, Ronald P.

    2014-12-01

    This report describes initial testing of the NG Sensor GTX-1000 natural gas monitoring system. This testing showed that the retention time, peak area stability and heating value repeatability of the GTX-1000 were promising for natural gas measurements in the field or at the well head. The repeatability can be less than 0.25% for LHV and HHV for the Airgas standard tested in this report, which is very promising for a first generation prototype. Ultimately this system should be capable of 0.1% repeatability in heating value at significant size and power reductions compared with competing systems.

  4. SRM-Assisted Trajectory for the GTX Reference Vehicle

    NASA Technical Reports Server (NTRS)

    Riehl, John; Trefny, Charles; Kosareo, Daniel

    2002-01-01

    A goal of the GTX effort has been to demonstrate the feasibility of a single stage- to- orbit (SSTO) vehicle that delivers a small payload to low earth orbit. The small payload class was chosen in order to minimize the risk and cost of development of this revolutionary system. A preliminary design study by the GTX team has resulted in the current configuration that offers considerable promise for meeting the stated goal. The size and gross lift-off weight resulting from scaling the current design to closure however may be considered impractical for the small payload. In lieu of evolving the project's reference vehicle to a large-payload class, this paper offers the alternative of using solid-rocket motors in order to close the vehicle at a practical scale. This approach offers a near-term, quasi-reusable system that easily evolves to reusable SSTO following subsequent development and optimization. This paper presents an overview of the impact of the addition of SRM's to the GTX reference vehicle's performance and trajectory. The overall methods of vehicle modeling and trajectory optimization will also be presented. A key element in the trajectory optimization is the use of the program OTIS 3.10 that provides rapid convergence and a great deal of flexibility to the user. This paper will also present the methods used to implement GTX requirements into OTIS modeling.

  5. SRM-Assisted Trajectory for the GTX Reference Vehicle

    NASA Technical Reports Server (NTRS)

    Riehl, John; Trefny, Charles; Kosareo, Daniel (Technical Monitor)

    2002-01-01

    A goal of the GTX effort has been to demonstrate the feasibility of a single stage-to-orbit (SSTO) vehicle that delivers a small payload to low earth orbit. The small payload class was chosen in order to minimize the risk and cost of development of this revolutionary system. A preliminary design study by the GTX team has resulted in the current configuration that offers considerable promise for meeting the stated goal. The size and gross lift-off weight resulting from scaling the current design to closure however may be considered impractical for the small payload. In lieu of evolving the project' reference vehicle to a large-payload class, this paper offers the alternative of using solid-rocket motors in order to close the vehicle at a practical scale. This approach offers a near-term, quasi-reusable system that easily evolves to reusable SSTO following subsequent development and optimization. This paper presents an overview of the impact of the addition of SRM's to the GTX reference vehicle#s performance and trajectory. The overall methods of vehicle modeling and trajectory optimization will also be presented. A key element in the trajectory optimization is the use of the program OTIS 3.10 that provides rapid convergence and a great deal of flexibility to the user. This paper will also present the methods used to implement GTX requirements into OTIS modeling.

  6. Acceleration of spiking neural network based pattern recognition on NVIDIA graphics processors.

    PubMed

    Han, Bing; Taha, Tarek M

    2010-04-01

    There is currently a strong push in the research community to develop biological scale implementations of neuron based vision models. Systems at this scale are computationally demanding and generally utilize more accurate neuron models, such as the Izhikevich and the Hodgkin-Huxley models, in favor of the more popular integrate and fire model. We examine the feasibility of using graphics processing units (GPUs) to accelerate a spiking neural network based character recognition network to enable such large scale systems. Two versions of the network utilizing the Izhikevich and Hodgkin-Huxley models are implemented. Three NVIDIA general-purpose (GP) GPU platforms are examined, including the GeForce 9800 GX2, the Tesla C1060, and the Tesla S1070. Our results show that the GPGPUs can provide significant speedup over conventional processors. In particular, the fastest GPGPU utilized, the Tesla S1070, provided a speedup of 5.6 and 84.4 over highly optimized implementations on the fastest central processing unit (CPU) tested, a quadcore 2.67 GHz Xeon processor, for the Izhikevich and the Hodgkin-Huxley models, respectively. The CPU implementation utilized all four cores and the vector data parallelism offered by the processor. The results indicate that GPUs are well suited for this application domain.

  7. Enobosarm (GTx-024, S-22): a potential treatment for cachexia.

    PubMed

    Srinath, Reshmi; Dobs, Adrian

    2014-02-01

    Muscle loss and wasting occurs with aging and in multiple disease states including cancer, heart failure, chronic obstructive pulmonary disease, end-stage liver disease, end-stage renal disease and HIV. Cachexia is defined as a multifactorial syndrome that is associated with anorexia, weight loss and increased catabolism, with increased morbidity and mortality. Currently no therapy is approved for the treatment or prevention of cachexia. Different treatment options have been suggested but many have proven to be ineffective or associated with adverse events. Nonsteroidal selective androgen receptor modulators (SARMs) are a new class of anabolic agents that bind the androgen receptor and exhibit tissue selectivity. Enobosarm (GTx-024, S-22) is a recently developed SARM, developed by GTx, Inc. (TN, USA), which has been tested in Phase I, II and III trials with promising results in terms of improving lean body mass and measurements of physical function and power. Enobosarm has received fast track designation by the US FDA and results from the Phase III trials POWER1 and POWER2 will help determine approval for use in the prevention and treatment of muscle wasting in patients with non-small-cell lung cancer. This article provides an introduction to enobosarm as a new therapeutic strategy for the prevention and treatment of cachexia. A review of the literature was performed using search terms 'cachexia', 'sarcopenia', 'SARM', 'enobosarm' and 'GTx-024' in September 2013 using multiple databases as well as online resources.

  8. Performance Evaluation of the NASA GTX RBCC Flowpath

    NASA Technical Reports Server (NTRS)

    Thomas, Scott R.; Palac, Donald T.; Trefny, Charles J.; Roche, Joseph M.

    2001-01-01

    The NASA Glenn Research Center serves as NASAs lead center for aeropropulsion. Several programs are underway to explore revolutionary airbreathing propulsion systems in response to the challenge of reducing the cost of space transportation. Concepts being investigated include rocket-based combined cycle (RBCC), pulse detonation wave, and turbine-based combined cycle (TBCC) engines. The GTX concept is a vertical launched, horizontal landing, single stage to orbit (SSTO) vehicle utilizing RBCC engines. The propulsion pod has a nearly half-axisymmetric flowpath that incorporates a rocket and ram-scramjet. The engine system operates from lift-off up to above Mach 10, at which point the airbreathing engine flowpath is closed off, and the rocket alone powers the vehicle to orbit. The paper presents an overview of the research efforts supporting the development of this RBCC propulsion system. The experimental efforts of this program consist of a series of test rigs. Each rig is focused on development and optimization of the flowpath over a specific operating mode of the engine. These rigs collectively establish propulsion system performance over all modes of operation, therefore, covering the entire speed range. Computational Fluid Mechanics (CFD) analysis is an important element of the GTX propulsion system development and validation. These efforts guide experiments and flowpath design, provide insight into experimental data, and extend results to conditions and scales not achievable in ground test facilities. Some examples of important CFD results are presented.

  9. GTX Reference Vehicle Structural Verification Methods and Weight Summary

    NASA Technical Reports Server (NTRS)

    Hunter, J. E.; McCurdy, D. R.; Dunn, P. W.

    2002-01-01

    The design of a single-stage-to-orbit air breathing propulsion system requires the simultaneous development of a reference launch vehicle in order to achieve the optimal mission performance. Accordingly, for the GTX study a 300-lb payload reference vehicle was preliminary sized to a gross liftoff weight (GLOW) of 238,000 lb. A finite element model of the integrated vehicle/propulsion system was subjected to the trajectory environment and subsequently optimized for structural efficiency. This study involved the development of aerodynamic loads mapped to finite element models of the integrated system in order to assess vehicle margins of safety. Commercially available analysis codes were used in the process along with some internally developed spread-sheets and FORTRAN codes specific to the GTX geometry for mapping of thermal and pressure loads. A mass fraction of 0.20 for the integrated system dry weight has been the driver for a vehicle design consisting of state-of-the-art composite materials in order to meet the rigid weight requirements. This paper summarizes the methodology used for preliminary analyses and presents the current status of the weight optimization for the structural components of the integrated system.

  10. GTX Reference Vehicle Structural Verification Methods and Weight Summary

    NASA Technical Reports Server (NTRS)

    Hunter, J. E.; McCurdy, D. R.; Dunn, P. W.

    2002-01-01

    The design of a single-stage-to-orbit air breathing propulsion system requires the simultaneous development of a reference launch vehicle in order to achieve the optimal mission performance. Accordingly, for the GTX study a 300-lb payload reference vehicle was preliminarily sized to a gross liftoff weight (GLOW) of 238,000 lb. A finite element model of the integrated vehicle/propulsion system was subjected to the trajectory environment and subsequently optimized for structural efficiency. This study involved the development of aerodynamic loads mapped to finite element models of the integrated system in order to assess vehicle margins of safety. Commercially available analysis codes were used in the process along with some internally developed spreadsheets and FORTRAN codes specific to the GTX geometry for mapping of thermal and pressure loads. A mass fraction of 0.20 for the integrated system dry weight has been the driver for a vehicle design consisting of state-of-the-art composite materials in order to meet the rigid weight requirements. This paper summarizes the methodology used for preliminary analyses and presents the current status of the weight optimization for the structural components of the integrated system.

  11. Tensor Algebra Library for NVidia Graphics Processing Units

    SciTech Connect

    Liakh, Dmitry

    This is a general purpose math library implementing basic tensor algebra operations on NVidia GPU accelerators. This software is a tensor algebra library that can perform basic tensor algebra operations, including tensor contractions, tensor products, tensor additions, etc., on NVidia GPU accelerators, asynchronously with respect to the CPU host. It supports a simultaneous use of multiple NVidia GPUs. Each asynchronous API function returns a handle which can later be used for querying the completion of the corresponding tensor algebra operation on a specific GPU. The tensors participating in a particular tensor operation are assumed to be stored in local RAMmore » of a node or GPU RAM. The main research area where this library can be utilized is the quantum many-body theory (e.g., in electronic structure theory).« less

  12. Proton Testing of nVidia Jetson TX1

    NASA Technical Reports Server (NTRS)

    Wyrwas, Edward J.

    2017-01-01

    Single-Event Effects (SEE) testing was conducted on the nVidia Jetson TX1 System on Chip (SOC); herein referred to as device under test (DUT). Testing was conducted at Massachusetts General Hospitals (MGH) Francis H. Burr Proton Therapy Center on October 16th, 2016 using 200MeV protons. This testing trip was purposed to provide a baseline assessment of the radiation susceptibility of the DUT as no previous testing had been conducted on this component.

  13. Accelerating Monte Carlo simulations with an NVIDIA ® graphics processor

    NASA Astrophysics Data System (ADS)

    Martinsen, Paul; Blaschke, Johannes; Künnemeyer, Rainer; Jordan, Robert

    2009-10-01

    Modern graphics cards, commonly used in desktop computers, have evolved beyond a simple interface between processor and display to incorporate sophisticated calculation engines that can be applied to general purpose computing. The Monte Carlo algorithm for modelling photon transport in turbid media has been implemented on an NVIDIA ® 8800 GT graphics card using the CUDA toolkit. The Monte Carlo method relies on following the trajectory of millions of photons through the sample, often taking hours or days to complete. The graphics-processor implementation, processing roughly 110 million scattering events per second, was found to run more than 70 times faster than a similar, single-threaded implementation on a 2.67 GHz desktop computer. Program summaryProgram title: Phoogle-C/Phoogle-G Catalogue identifier: AEEB_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEEB_v1_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: Standard CPC licence, http://cpc.cs.qub.ac.uk/licence/licence.html No. of lines in distributed program, including test data, etc.: 51 264 No. of bytes in distributed program, including test data, etc.: 2 238 805 Distribution format: tar.gz Programming language: C++ Computer: Designed for Intel PCs. Phoogle-G requires a NVIDIA graphics card with support for CUDA 1.1 Operating system: Windows XP Has the code been vectorised or parallelized?: Phoogle-G is written for SIMD architectures RAM: 1 GB Classification: 21.1 External routines: Charles Karney Random number library. Microsoft Foundation Class library. NVIDA CUDA library [1]. Nature of problem: The Monte Carlo technique is an effective algorithm for exploring the propagation of light in turbid media. However, accurate results require tracing the path of many photons within the media. The independence of photons naturally lends the Monte Carlo technique to implementation on parallel architectures. Generally, parallel computing

  14. Investigating the Importance of Stereo Displays for Helicopter Landing Simulation

    DTIC Science & Technology

    2016-08-11

    visualization. The two instances of X Plane® were implemented using two separate PCs, each incorporating Intel i7 processors and Nvidia Quadro K4200... Nvidia GeForce GTX 680 graphics card was used to administer the stereo acuity and fusion range tests. The tests were displayed on an Asus VG278HE 3D...monitor with 1920x1080 pixels that was compatible with Nvidia 3D Vision2 and that used active shutter glasses. At a 1-m viewing distance, the

  15. Performance Validation Approach for the GTX Air-Breathing Launch Vehicle

    NASA Technical Reports Server (NTRS)

    Trefny, Charles J.; Roche, Joseph M.

    2002-01-01

    The primary objective of the GTX effort is to determine whether or not air-breathing propulsion can enable a launch vehicle to achieve orbit in a single stage. Structural weight, vehicle aerodynamics, and propulsion performance must be accurately known over the entire flight trajectory in order to make a credible assessment. Structural, aerodynamic, and propulsion parameters are strongly interdependent, which necessitates a system approach to design, evaluation, and optimization of a single-stage-to-orbit concept. The GTX reference vehicle serves this purpose, by allowing design, development, and validation of components and subsystems in a system context. The reference vehicle configuration (including propulsion) was carefully chosen so as to provide high potential for structural and volumetric efficiency, and to allow the high specific impulse of air-breathing propulsion cycles to be exploited. Minor evolution of the configuration has occurred as analytical and experimental results have become available. With this development process comes increasing validation of the weight and performance levels used in system performance determination. This paper presents an overview of the GTX reference vehicle and the approach to its performance validation. Subscale test rigs and numerical studies used to develop and validate component performance levels and unit structural weights are outlined. The sensitivity of the equivalent, effective specific impulse to key propulsion component efficiencies is presented. The role of flight demonstration in development and validation is discussed.

  16. Optimizing Approximate Weighted Matching on Nvidia Kepler K40

    SciTech Connect

    Naim, Md; Manne, Fredrik; Halappanavar, Mahantesh

    Matching is a fundamental graph problem with numerous applications in science and engineering. While algorithms for computing optimal matchings are difficult to parallelize, approximation algorithms on the other hand generally compute high quality solutions and are amenable to parallelization. In this paper, we present efficient implementations of the current best algorithm for half-approximate weighted matching, the Suitor algorithm, on Nvidia Kepler K-40 platform. We develop four variants of the algorithm that exploit hardware features to address key challenges for a GPU implementation. We also experiment with different combinations of work assigned to a warp. Using an exhaustive set ofmore » $269$ inputs, we demonstrate that the new implementation outperforms the previous best GPU algorithm by $10$ to $$100\\times$$ for over $100$ instances, and from $100$ to $$1000\\times$$ for $15$ instances. We also demonstrate up to $$20\\times$$ speedup relative to $2$ threads, and up to $$5\\times$$ relative to $16$ threads on Intel Xeon platform with $16$ cores for the same algorithm. The new algorithms and implementations provided in this paper will have a direct impact on several applications that repeatedly use matching as a key compute kernel. Further, algorithm designs and insights provided in this paper will benefit other researchers implementing graph algorithms on modern GPU architectures.« less

  17. Immunogenic and protective efficacy of recombinant protein GtxA-N against Gallibacterium anatis challenge in chickens.

    PubMed

    Pedersen, Ida J; Pors, Susanne E; Bager Skjerning, Ragnhild J; Nielsen, Søren S; Bojesen, Anders M

    2015-10-01

    Gallibacterium anatis is a major cause of reproductive tract infections in chickens. Here, we aimed to evaluate the efficacy of the recombinant protein GtxA-N at protecting hens, by addressing three objectives; (i) evaluating the antibody response following immunization (ii) scoring and comparing lesions, following challenge with G. anatis, in immunized and non-immunized hens and (iii) investigating if the anti-GtxA-N antibody titre in individual hens correlated with the observed lesions. Two consecutive experiments were performed in hens. In the first experiment hens were immunized with GtxA-N on day 0 and day 14, infected with G. anatis on day 28 and euthanized on day 56. The GtxA-N antibody response was assessed in pooled serum samples throughout the experiment, using an indirect enzyme-linked immunosorbent assay (ELISA). In the second experiment the GtxA-N antibody titres were assessed in individual hens before and after immunization. Subsequently, the hens were inoculated with G. anatis and finally all hens where euthanized and submitted for post mortem examination 48 h after inoculation. Immunization elicited strong antibody responses that lasted at least 8 weeks (P < .0001). The individual antibody titres observed in response to immunization varied considerably among hens (range: 174,100-281,500). Lesion scores following G. anatis infection were significantly lower in immunized hens compared to non-immunized hens (P = .004). Within the immunized group, no correlation was found between the individual antibody titres and the lesion scores. This study clearly demonstrated GtxA-N as a vaccine antigen able of inducing protective immunity against G. anatis.

  18. Extension of the validation of AOAC Official Method 2005.06 for dc-GTX2,3: interlaboratory study.

    PubMed

    Ben-Gigirey, Begoña; Rodríguez-Velasco, María L; Gago-Martínez, Ana

    2012-01-01

    AOAC Official Method(SM) 2005.06 for the determination of saxitoxin (STX)-group toxins in shellfish by LC with fluorescence detection with precolumn oxidation was previously validated and adopted First Action following a collaborative study. However, the method was not validated for all key STX-group toxins, and procedures to quantify some of them were not provided. With more STX-group toxin standards commercially available and modifications to procedures, it was possible to overcome some of these difficulties. The European Union Reference Laboratory for Marine Biotoxins conducted an interlaboratory exercise to extend AOAC Official Method 2005.06 validation for dc-GTX2,3 and to compile precision data for several STX-group toxins. This paper reports the study design and the results obtained. The performance characteristics for dc-GTX2,3 (intralaboratory and interlaboratory precision, recovery, and theoretical quantification limit) were evaluated. The mean recoveries obtained for dc-GTX2,3 were, in general, low (53.1-58.6%). The RSD for reproducibility (RSD(r)%) for dc-GTX2,3 in all samples ranged from 28.2 to 45.7%, and HorRat values ranged from 1.5 to 2.8. The article also describes a hydrolysis protocol to convert GTX6 to NEO, which has been proven to be useful for the quantification of GTX6 while the GTX6 standard is not available. The performance of the participant laboratories in the application of this method was compared with that obtained from the original collaborative study of the method. Intralaboratory and interlaboratory precision data for several STX-group toxins, including dc-NEO and GTX6, are reported here. This study can be useful for those laboratories determining STX-group toxins to fully implement AOAC Official Method 2005.06 for official paralytic shellfish poisoning control. However the overall quantitative performance obtained with the method was poor for certain toxins.

  19. A Thermal Management Systems Model for the NASA GTX RBCC Concept

    NASA Technical Reports Server (NTRS)

    Traci, Richard M.; Farr, John L., Jr.; Laganelli, Tony; Walker, James (Technical Monitor)

    2002-01-01

    The Vehicle Integrated Thermal Management Analysis Code (VITMAC) was further developed to aid the analysis, design, and optimization of propellant and thermal management concepts for advanced propulsion systems. The computational tool is based on engineering level principles and models. A graphical user interface (GUI) provides a simple and straightforward method to assess and evaluate multiple concepts before undertaking more rigorous analysis of candidate systems. The tool incorporates the Chemical Equilibrium and Applications (CEA) program and the RJPA code to permit heat transfer analysis of both rocket and air breathing propulsion systems. Key parts of the code have been validated with experimental data. The tool was specifically tailored to analyze rocket-based combined-cycle (RBCC) propulsion systems being considered for space transportation applications. This report describes the computational tool and its development and verification for NASA GTX RBCC propulsion system applications.

  20. Affordable Flight Demonstration of the GTX Air-Breathing SSTO Vehicle Concept

    NASA Technical Reports Server (NTRS)

    Krivanek, Thomas M.; Roche, Joseph M.; Riehl, John P.; Kosareo, Daniel N.

    2003-01-01

    The rocket based combined cycle (RBCC) powered single-stage-to-orbit (SSTO) reusable launch vehicle has the potential to significantly reduce the total cost per pound for orbital payload missions. To validate overall system performance, a flight demonstration must be performed. This paper presents an overview of the first phase of a flight demonstration program for the GTX SSTO vehicle concept. Phase 1 will validate the propulsion performance of the vehicle configuration over the supersonic and hypersonic air- breathing portions of the trajectory. The focus and goal of Phase 1 is to demonstrate the integration and performance of the propulsion system flowpath with the vehicle aerodynamics over the air-breathing trajectory. This demonstrator vehicle will have dual mode ramjetkcramjets, which include the inlet, combustor, and nozzle with geometrically scaled aerodynamic surface outer mold lines (OML) defining the forebody, boundary layer diverter, wings, and tail. The primary objective of this study is to demon- strate propulsion system performance and operability including the ram to scram transition, as well as to validate vehicle aerodynamics and propulsion airframe integration. To minimize overall risk and develop ment cost the effort will incorporate proven materials, use existing turbomachinery in the propellant delivery systems, launch from an existing unmanned remote launch facility, and use basic vehicle recovery techniques to minimize control and landing requirements. A second phase would demonstrate propulsion performance across all critical portions of a space launch trajectory (lift off through transition to all-rocket) integrated with flight-like vehicle systems.

  1. Affordable Flight Demonstration of the GTX Air-Breathing SSTO Vehicle Concept

    NASA Technical Reports Server (NTRS)

    Krivanek, Thomas M.; Roche, Joseph M.; Riehl, John P.; Kosareo, Daniel N.

    2002-01-01

    The rocket based combined cycle (RBCC) powered single-stage-to-orbit (SSTO) reusable launch vehicle has the potential to significantly reduce the total cost per pound for orbital payload missions. To validate overall system performance, a flight demonstration must be performed. This paper presents an overview of the first phase of a flight demonstration program for the GTX SSTO vehicle concept. Phase 1 will validate the propulsion performance of the vehicle configuration over the supersonic and hypersonic airbreathing portions of the trajectory. The focus and goal of Phase 1 is to demonstrate the integration and performance of the propulsion system flowpath with the vehicle aerodynamics over the air-breathing trajectory. This demonstrator vehicle will have dual mode ramjet/scramjets, which include the inlet, combustor, and nozzle with geometrically scaled aerodynamic surface outer mold lines (OML) defining the forebody, boundary layer diverter, wings, and tail. The primary objective of this study is to demonstrate propulsion system performance and operability including the ram to scram transition, as well as to validate vehicle aerodynamics and propulsion airframe integration. To minimize overall risk and development cost the effort will incorporate proven materials, use existing turbomachinery in the propellant delivery systems, launch from an existing unmanned remote launch facility, and use basic vehicle recovery techniques to minimize control and landing requirements. A second phase would demonstrate propulsion performance across all critical portions of a space launch trajectory (lift off through transition to all-rocket) integrated with flight-like vehicle systems.

  2. Design Evolution and Performance Characterization of the GTX Air-Breathing Launch Vehicle Inlet

    NASA Technical Reports Server (NTRS)

    DeBonis, J. R.; Steffen, C. J., Jr.; Rice, T.; Trefny, C. J.

    2002-01-01

    The design and analysis of a second version of the inlet for the GTX rocket-based combine-cycle launch vehicle is discussed. The previous design did not achieve its predicted performance levels due to excessive turning of low-momentum comer flows and local over-contraction due to asymmetric end-walls. This design attempts to remove these problems by reducing the spike half-angle to 10- from 12-degrees and by implementing true plane of symmetry end-walls. Axisymmetric Reynolds-Averaged Navier-Stokes simulations using both perfect gas and real gas, finite rate chemistry, assumptions were performed to aid in the design process and to create a comprehensive database of inlet performance. The inlet design, which operates over the entire air-breathing Mach number range from 0 to 12, and the performance database are presented. The performance database, for use in cycle analysis, includes predictions of mass capture, pressure recovery, throat Mach number, drag force, and heat load, for the entire Mach range. Results of the computations are compared with experimental data to validate the performance database.

  3. An efficient tensor transpose algorithm for multicore CPU, Intel Xeon Phi, and NVidia Tesla GPU

    DOE PAGES

    Lyakh, Dmitry I.

    2015-01-05

    An efficient parallel tensor transpose algorithm is suggested for shared-memory computing units, namely, multicore CPU, Intel Xeon Phi, and NVidia GPU. The algorithm operates on dense tensors (multidimensional arrays) and is based on the optimization of cache utilization on x86 CPU and the use of shared memory on NVidia GPU. From the applied side, the ultimate goal is to minimize the overhead encountered in the transformation of tensor contractions into matrix multiplications in computer implementations of advanced methods of quantum many-body theory (e.g., in electronic structure theory and nuclear physics). A particular accent is made on higher-dimensional tensors that typicallymore » appear in the so-called multireference correlated methods of electronic structure theory. Depending on tensor dimensionality, the presented optimized algorithms can achieve an order of magnitude speedup on x86 CPUs and 2-3 times speedup on NVidia Tesla K20X GPU with respect to the na ve scattering algorithm (no memory access optimization). Furthermore, the tensor transpose routines developed in this work have been incorporated into a general-purpose tensor algebra library (TAL-SH).« less

  4. An efficient tensor transpose algorithm for multicore CPU, Intel Xeon Phi, and NVidia Tesla GPU

    NASA Astrophysics Data System (ADS)

    Lyakh, Dmitry I.

    2015-04-01

    An efficient parallel tensor transpose algorithm is suggested for shared-memory computing units, namely, multicore CPU, Intel Xeon Phi, and NVidia GPU. The algorithm operates on dense tensors (multidimensional arrays) and is based on the optimization of cache utilization on x86 CPU and the use of shared memory on NVidia GPU. From the applied side, the ultimate goal is to minimize the overhead encountered in the transformation of tensor contractions into matrix multiplications in computer implementations of advanced methods of quantum many-body theory (e.g., in electronic structure theory and nuclear physics). A particular accent is made on higher-dimensional tensors that typically appear in the so-called multireference correlated methods of electronic structure theory. Depending on tensor dimensionality, the presented optimized algorithms can achieve an order of magnitude speedup on x86 CPUs and 2-3 times speedup on NVidia Tesla K20X GPU with respect to the naïve scattering algorithm (no memory access optimization). The tensor transpose routines developed in this work have been incorporated into a general-purpose tensor algebra library (TAL-SH).

  5. Optimization of Selected Remote Sensing Algorithms for Embedded NVIDIA Kepler GPU Architecture

    NASA Technical Reports Server (NTRS)

    Riha, Lubomir; Le Moigne, Jacqueline; El-Ghazawi, Tarek

    2015-01-01

    This paper evaluates the potential of embedded Graphic Processing Units in the Nvidias Tegra K1 for onboard processing. The performance is compared to a general purpose multi-core CPU and full fledge GPU accelerator. This study uses two algorithms: Wavelet Spectral Dimension Reduction of Hyperspectral Imagery and Automated Cloud-Cover Assessment (ACCA) Algorithm. Tegra K1 achieved 51 for ACCA algorithm and 20 for the dimension reduction algorithm, as compared to the performance of the high-end 8-core server Intel Xeon CPU with 13.5 times higher power consumption.

  6. Scaling Deep Learning workloads: NVIDIA DGX-1/Pascal and Intel Knights Landing

    DOE PAGES

    Gawande, Nitin A.; Daily, Jeff A.; Siegel, Charles; ...

    2018-05-05

    Deep Learning (DL) algorithms have become ubiquitous in data analytics. As a result, major computing vendors—including NVIDIA, Intel, AMD, and IBM—have architectural road maps influenced by DL workloads. Furthermore, several vendors have recently advertised new computing products as accelerating large DL workloads. Unfortunately, it is difficult for data scientists to quantify the potential of these different products. Here, this article provides a performance and power analysis of important DL workloads on two major parallel architectures: NVIDIA DGX-1 (eight Pascal P100 GPUs interconnected with NVLink) and Intel Knights Landing (KNL) CPUs interconnected with Intel Omni-Path or Cray Aries. Our evaluation consistsmore » of a cross section of convolutional neural net workloads: CifarNet, AlexNet, GoogLeNet, and ResNet50 topologies using the Cifar10 and ImageNet datasets. The workloads are vendor-optimized for each architecture. We use sequentially equivalent implementations to maintain iso-accuracy between parallel and sequential DL models. Our analysis indicates that although GPUs provide the highest overall performance, the gap can close for some convolutional networks; and the KNL can be competitive in performance/watt. We find that NVLink facilitates scaling efficiency on GPUs. However, its importance is heavily dependent on neural network architecture. Furthermore, for weak-scaling—sometimes encouraged by restricted GPU memory—NVLink is less important.« less

  7. Scaling Deep Learning Workloads: NVIDIA DGX-1/Pascal and Intel Knights Landing

    SciTech Connect

    Gawande, Nitin A.; Landwehr, Joshua B.; Daily, Jeffrey A.

    Deep Learning (DL) algorithms have become ubiquitous in data analytics. As a result, major computing vendors --- including NVIDIA, Intel, AMD and IBM --- have architectural road-maps influenced by DL workloads. Furthermore, several vendors have recently advertised new computing products as accelerating DL workloads. Unfortunately, it is difficult for data scientists to quantify the potential of these different products. This paper provides a performance and power analysis of important DL workloads on two major parallel architectures: NVIDIA DGX-1 (eight Pascal P100 GPUs interconnected with NVLink) and Intel Knights Landing (KNL) CPUs interconnected with Intel Omni-Path. Our evaluation consists of amore » cross section of convolutional neural net workloads: CifarNet, CaffeNet, AlexNet and GoogleNet topologies using the Cifar10 and ImageNet datasets. The workloads are vendor optimized for each architecture. GPUs provide the highest overall raw performance. Our analysis indicates that although GPUs provide the highest overall performance, the gap can close for some convolutional networks; and KNL can be competitive when considering performance/watt. Furthermore, NVLink is critical to GPU scaling.« less

  8. Scaling Deep Learning workloads: NVIDIA DGX-1/Pascal and Intel Knights Landing

    SciTech Connect

    Gawande, Nitin A.; Daily, Jeff A.; Siegel, Charles

    Deep Learning (DL) algorithms have become ubiquitous in data analytics. As a result, major computing vendors—including NVIDIA, Intel, AMD, and IBM—have architectural road maps influenced by DL workloads. Furthermore, several vendors have recently advertised new computing products as accelerating large DL workloads. Unfortunately, it is difficult for data scientists to quantify the potential of these different products. Here, this article provides a performance and power analysis of important DL workloads on two major parallel architectures: NVIDIA DGX-1 (eight Pascal P100 GPUs interconnected with NVLink) and Intel Knights Landing (KNL) CPUs interconnected with Intel Omni-Path or Cray Aries. Our evaluation consistsmore » of a cross section of convolutional neural net workloads: CifarNet, AlexNet, GoogLeNet, and ResNet50 topologies using the Cifar10 and ImageNet datasets. The workloads are vendor-optimized for each architecture. We use sequentially equivalent implementations to maintain iso-accuracy between parallel and sequential DL models. Our analysis indicates that although GPUs provide the highest overall performance, the gap can close for some convolutional networks; and the KNL can be competitive in performance/watt. We find that NVLink facilitates scaling efficiency on GPUs. However, its importance is heavily dependent on neural network architecture. Furthermore, for weak-scaling—sometimes encouraged by restricted GPU memory—NVLink is less important.« less

  9. Scaling deep learning workloads: NVIDIA DGX-1/Pascal and Intel Knights Landing

    SciTech Connect

    Gawande, Nitin A.; Landwehr, Joshua B.; Daily, Jeffrey A.

    Deep Learning (DL) algorithms have become ubiquitous in data analytics. As a result, major computing vendors --- including NVIDIA, Intel, AMD, and IBM --- have architectural road-maps influenced by DL workloads. Furthermore, several vendors have recently advertised new computing products as accelerating large DL workloads. Unfortunately, it is difficult for data scientists to quantify the potential of these different products. This paper provides a performance and power analysis of important DL workloads on two major parallel architectures: NVIDIA DGX-1 (eight Pascal P100 GPUs interconnected with NVLink) and Intel Knights Landing (KNL) CPUs interconnected with Intel Omni-Path or Cray Aries. Ourmore » evaluation consists of a cross section of convolutional neural net workloads: CifarNet, AlexNet, GoogLeNet, and ResNet50 topologies using the Cifar10 and ImageNet datasets. The workloads are vendor-optimized for each architecture. Our analysis indicates that although GPUs provide the highest overall performance, the gap can close for some convolutional networks; and the KNL can be competitive in performance/watt. We find that NVLink facilitates scaling efficiency on GPUs. However, its importance is heavily dependent on neural network architecture. Furthermore, for weak-scaling --- sometimes encouraged by restricted GPU memory --- NVLink is less important.« less

  10. CUDA-based real time surgery simulation.

    PubMed

    Liu, Youquan; De, Suvranu

    2008-01-01

    In this paper we present a general software platform that enables real time surgery simulation on the newly available compute unified device architecture (CUDA)from NVIDIA. CUDA-enabled GPUs harness the power of 128 processors which allow data parallel computations. Compared to the previous GPGPU, it is significantly more flexible with a C language interface. We report implementation of both collision detection and consequent deformation computation algorithms. Our test results indicate that the CUDA enables a twenty times speedup for collision detection and about fifteen times speedup for deformation computation on an Intel Core 2 Quad 2.66 GHz machine with GeForce 8800 GTX.

  11. GPU Acceleration of DSP for Communication Receivers.

    PubMed

    Gunther, Jake; Gunther, Hyrum; Moon, Todd

    2017-09-01

    Graphics processing unit (GPU) implementations of signal processing algorithms can outperform CPU-based implementations. This paper describes the GPU implementation of several algorithms encountered in a wide range of high-data rate communication receivers including filters, multirate filters, numerically controlled oscillators, and multi-stage digital down converters. These structures are tested by processing the 20 MHz wide FM radio band (88-108 MHz). Two receiver structures are explored: a single channel receiver and a filter bank channelizer. Both run in real time on NVIDIA GeForce GTX 1080 graphics card.

  12. Design and Implementation of the PALM-3000 Real-Time Control System

    NASA Technical Reports Server (NTRS)

    Truong, Tuan N.; Bouchez, Antonin H.; Burruss, Rick S.; Dekany, Richard G.; Guiwits, Stephen R.; Roberts, Jennifer E.; Shelton, Jean C.; Troy, Mitchell

    2012-01-01

    This paper reflects, from a computational perspective, on the experience gathered in designing and implementing realtime control of the PALM-3000 adaptive optics system currently in operation at the Palomar Observatory. We review the algorithms that serve as functional requirements driving the architecture developed, and describe key design issues and solutions that contributed to the system's low compute-latency. Additionally, we describe an implementation of dense matrix-vector-multiplication for wavefront reconstruction that exceeds 95% of the maximum sustained achievable bandwidth on NVIDIA Geforce 8800GTX GPU.

  13. 3D gaze tracking system for NVidia 3D Vision®.

    PubMed

    Wibirama, Sunu; Hamamoto, Kazuhiko

    2013-01-01

    Inappropriate parallax setting in stereoscopic content generally causes visual fatigue and visual discomfort. To optimize three dimensional (3D) effects in stereoscopic content by taking into account health issue, understanding how user gazes at 3D direction in virtual space is currently an important research topic. In this paper, we report the study of developing a novel 3D gaze tracking system for Nvidia 3D Vision(®) to be used in desktop stereoscopic display. We suggest an optimized geometric method to accurately measure the position of virtual 3D object. Our experimental result shows that the proposed system achieved better accuracy compared to conventional geometric method by average errors 0.83 cm, 0.87 cm, and 1.06 cm in X, Y, and Z dimensions, respectively.

  14. Memory transfer optimization for a lattice Boltzmann solver on Kepler architecture nVidia GPUs

    NASA Astrophysics Data System (ADS)

    Mawson, Mark J.; Revell, Alistair J.

    2014-10-01

    The Lattice Boltzmann method (LBM) for solving fluid flow is naturally well suited to an efficient implementation for massively parallel computing, due to the prevalence of local operations in the algorithm. This paper presents and analyses the performance of a 3D lattice Boltzmann solver, optimized for third generation nVidia GPU hardware, also known as 'Kepler'. We provide a review of previous optimization strategies and analyse data read/write times for different memory types. In LBM, the time propagation step (known as streaming), involves shifting data to adjacent locations and is central to parallel performance; here we examine three approaches which make use of different hardware options. Two of which make use of 'performance enhancing' features of the GPU; shared memory and the new shuffle instruction found in Kepler based GPUs. These are compared to a standard transfer of data which relies instead on optimized storage to increase coalesced access. It is shown that the more simple approach is most efficient; since the need for large numbers of registers per thread in LBM limits the block size and thus the efficiency of these special features is reduced. Detailed results are obtained for a D3Q19 LBM solver, which is benchmarked on nVidia K5000M and K20C GPUs. In the latter case the use of a read-only data cache is explored, and peak performance of over 1036 Million Lattice Updates Per Second (MLUPS) is achieved. The appearance of a periodic bottleneck in the solver performance is also reported, believed to be hardware related; spikes in iteration-time occur with a frequency of around 11 Hz for both GPUs, independent of the size of the problem.

  15. Solving lattice QCD systems of equations using mixed precision solvers on GPUs

    NASA Astrophysics Data System (ADS)

    Clark, M. A.; Babich, R.; Barros, K.; Brower, R. C.; Rebbi, C.

    2010-09-01

    Modern graphics hardware is designed for highly parallel numerical tasks and promises significant cost and performance benefits for many scientific applications. One such application is lattice quantum chromodynamics (lattice QCD), where the main computational challenge is to efficiently solve the discretized Dirac equation in the presence of an SU(3) gauge field. Using NVIDIA's CUDA platform we have implemented a Wilson-Dirac sparse matrix-vector product that performs at up to 40, 135 and 212 Gflops for double, single and half precision respectively on NVIDIA's GeForce GTX 280 GPU. We have developed a new mixed precision approach for Krylov solvers using reliable updates which allows for full double precision accuracy while using only single or half precision arithmetic for the bulk of the computation. The resulting BiCGstab and CG solvers run in excess of 100 Gflops and, in terms of iterations until convergence, perform better than the usual defect-correction approach for mixed precision.

  16. Computational algorithms for simulations in atmospheric optics.

    PubMed

    Konyaev, P A; Lukin, V P

    2016-04-20

    A computer simulation technique for atmospheric and adaptive optics based on parallel programing is discussed. A parallel propagation algorithm is designed and a modified spectral-phase method for computer generation of 2D time-variant random fields is developed. Temporal power spectra of Laguerre-Gaussian beam fluctuations are considered as an example to illustrate the applications discussed. Implementation of the proposed algorithms using Intel MKL and IPP libraries and NVIDIA CUDA technology is shown to be very fast and accurate. The hardware system for the computer simulation is an off-the-shelf desktop with an Intel Core i7-4790K CPU operating at a turbo-speed frequency up to 5 GHz and an NVIDIA GeForce GTX-960 graphics accelerator with 1024 1.5 GHz processors.

  17. GPU acceleration for digitally reconstructed radiographs using bindless texture objects and CUDA/OpenGL interoperability.

    PubMed

    Abdellah, Marwan; Eldeib, Ayman; Owis, Mohamed I

    2015-01-01

    This paper features an advanced implementation of the X-ray rendering algorithm that harnesses the giant computing power of the current commodity graphics processors to accelerate the generation of high resolution digitally reconstructed radiographs (DRRs). The presented pipeline exploits the latest features of NVIDIA Graphics Processing Unit (GPU) architectures, mainly bindless texture objects and dynamic parallelism. The rendering throughput is substantially improved by exploiting the interoperability mechanisms between CUDA and OpenGL. The benchmarks of our optimized rendering pipeline reflect its capability of generating DRRs with resolutions of 2048(2) and 4096(2) at interactive and semi interactive frame-rates using an NVIDIA GeForce 970 GTX device.

  18. Evaluation of the Intel Xeon Phi 7120 and NVIDIA K80 as accelerators for two-dimensional panel codes

    PubMed Central

    2017-01-01

    To optimize the geometry of airfoils for a specific application is an important engineering problem. In this context genetic algorithms have enjoyed some success as they are able to explore the search space without getting stuck in local optima. However, these algorithms require the computation of aerodynamic properties for a significant number of airfoil geometries. Consequently, for low-speed aerodynamics, panel methods are most often used as the inner solver. In this paper we evaluate the performance of such an optimization algorithm on modern accelerators (more specifically, the Intel Xeon Phi 7120 and the NVIDIA K80). For that purpose, we have implemented an optimized version of the algorithm on the CPU and Xeon Phi (based on OpenMP, vectorization, and the Intel MKL library) and on the GPU (based on CUDA and the MAGMA library). We present timing results for all codes and discuss the similarities and differences between the three implementations. Overall, we observe a speedup of approximately 2.5 for adding an Intel Xeon Phi 7120 to a dual socket workstation and a speedup between 3.4 and 3.8 for adding a NVIDIA K80 to a dual socket workstation. PMID:28582389

  19. Evaluation of the Intel Xeon Phi 7120 and NVIDIA K80 as accelerators for two-dimensional panel codes.

    PubMed

    Einkemmer, Lukas

    2017-01-01

    To optimize the geometry of airfoils for a specific application is an important engineering problem. In this context genetic algorithms have enjoyed some success as they are able to explore the search space without getting stuck in local optima. However, these algorithms require the computation of aerodynamic properties for a significant number of airfoil geometries. Consequently, for low-speed aerodynamics, panel methods are most often used as the inner solver. In this paper we evaluate the performance of such an optimization algorithm on modern accelerators (more specifically, the Intel Xeon Phi 7120 and the NVIDIA K80). For that purpose, we have implemented an optimized version of the algorithm on the CPU and Xeon Phi (based on OpenMP, vectorization, and the Intel MKL library) and on the GPU (based on CUDA and the MAGMA library). We present timing results for all codes and discuss the similarities and differences between the three implementations. Overall, we observe a speedup of approximately 2.5 for adding an Intel Xeon Phi 7120 to a dual socket workstation and a speedup between 3.4 and 3.8 for adding a NVIDIA K80 to a dual socket workstation.

  20. Accelerating Smith-Waterman Algorithm for Biological Database Search on CUDA-Compatible GPUs

    NASA Astrophysics Data System (ADS)

    Munekawa, Yuma; Ino, Fumihiko; Hagihara, Kenichi

    This paper presents a fast method capable of accelerating the Smith-Waterman algorithm for biological database search on a cluster of graphics processing units (GPUs). Our method is implemented using compute unified device architecture (CUDA), which is available on the nVIDIA GPU. As compared with previous methods, our method has four major contributions. (1) The method efficiently uses on-chip shared memory to reduce the data amount being transferred between off-chip video memory and processing elements in the GPU. (2) It also reduces the number of data fetches by applying a data reuse technique to query and database sequences. (3) A pipelined method is also implemented to overlap GPU execution with database access. (4) Finally, a master/worker paradigm is employed to accelerate hundreds of database searches on a cluster system. In experiments, the peak performance on a GeForce GTX 280 card reaches 8.32 giga cell updates per second (GCUPS). We also find that our method reduces the amount of data fetches to 1/140, achieving approximately three times higher performance than a previous CUDA-based method. Our 32-node cluster version is approximately 28 times faster than a single GPU version. Furthermore, the effective performance reaches 75.6 giga instructions per second (GIPS) using 32 GeForce 8800 GTX cards.

  1. Building a Terabyte Memory Bandwidth Compute Node with Four Consumer Electronics GPUs

    NASA Astrophysics Data System (ADS)

    Omlin, Samuel; Räss, Ludovic; Podladchikov, Yuri

    2014-05-01

    GPUs released for consumer electronics are generally built with the same chip architectures as the GPUs released for professional usage. With regards to scientific computing, there are no obvious important differences in functionality or performance between the two types of releases, yet the price can differ up to one order of magnitude. For example, the consumer electronics release of the most recent NVIDIA Kepler architecture (GK110), named GeForce GTX TITAN, performed equally well in conducted memory bandwidth tests as the professional release, named Tesla K20; the consumer electronics release costs about one third of the professional release. We explain how to design and assemble a well adjusted computer with four high-end consumer electronics GPUs (GeForce GTX TITAN) combining more than 1 terabyte/s memory bandwidth. We compare the system's performance and precision with the one of hardware released for professional usage. The system can be used as a powerful workstation for scientific computing or as a compute node in a home-built GPU cluster.

  2. Parallel hyperspectral compressive sensing method on GPU

    NASA Astrophysics Data System (ADS)

    Bernabé, Sergio; Martín, Gabriel; Nascimento, José M. P.

    2015-10-01

    Remote hyperspectral sensors collect large amounts of data per flight usually with low spatial resolution. It is known that the bandwidth connection between the satellite/airborne platform and the ground station is reduced, thus a compression onboard method is desirable to reduce the amount of data to be transmitted. This paper presents a parallel implementation of an compressive sensing method, called parallel hyperspectral coded aperture (P-HYCA), for graphics processing units (GPU) using the compute unified device architecture (CUDA). This method takes into account two main properties of hyperspectral dataset, namely the high correlation existing among the spectral bands and the generally low number of endmembers needed to explain the data, which largely reduces the number of measurements necessary to correctly reconstruct the original data. Experimental results conducted using synthetic and real hyperspectral datasets on two different GPU architectures by NVIDIA: GeForce GTX 590 and GeForce GTX TITAN, reveal that the use of GPUs can provide real-time compressive sensing performance. The achieved speedup is up to 20 times when compared with the processing time of HYCA running on one core of the Intel i7-2600 CPU (3.4GHz), with 16 Gbyte memory.

  3. Quantum Chemical Calculations Using Accelerators: Migrating Matrix Operations to the NVIDIA Kepler GPU and the Intel Xeon Phi.

    PubMed

    Leang, Sarom S; Rendell, Alistair P; Gordon, Mark S

    2014-03-11

    Increasingly, modern computer systems comprise a multicore general-purpose processor augmented with a number of special purpose devices or accelerators connected via an external interface such as a PCI bus. The NVIDIA Kepler Graphical Processing Unit (GPU) and the Intel Phi are two examples of such accelerators. Accelerators offer peak performances that can be well above those of the host processor. How to exploit this heterogeneous environment for legacy application codes is not, however, straightforward. This paper considers how matrix operations in typical quantum chemical calculations can be migrated to the GPU and Phi systems. Double precision general matrix multiply operations are endemic in electronic structure calculations, especially methods that include electron correlation, such as density functional theory, second order perturbation theory, and coupled cluster theory. The use of approaches that automatically determine whether to use the host or an accelerator, based on problem size, is explored, with computations that are occurring on the accelerator and/or the host. For data-transfers over PCI-e, the GPU provides the best overall performance for data sizes up to 4096 MB with consistent upload and download rates between 5-5.6 GB/s and 5.4-6.3 GB/s, respectively. The GPU outperforms the Phi for both square and nonsquare matrix multiplications.

  4. NVIDIA OptiX ray-tracing engine as a new tool for modelling medical imaging systems

    NASA Astrophysics Data System (ADS)

    Pietrzak, Jakub; Kacperski, Krzysztof; Cieślar, Marek

    2015-03-01

    The most accurate technique to model the X- and gamma radiation path through a numerically defined object is the Monte Carlo simulation which follows single photons according to their interaction probabilities. A simplified and much faster approach, which just integrates total interaction probabilities along selected paths, is known as ray tracing. Both techniques are used in medical imaging for simulating real imaging systems and as projectors required in iterative tomographic reconstruction algorithms. These approaches are ready for massive parallel implementation e.g. on Graphics Processing Units (GPU), which can greatly accelerate the computation time at a relatively low cost. In this paper we describe the application of the NVIDIA OptiX ray-tracing engine, popular in professional graphics and rendering applications, as a new powerful tool for X- and gamma ray-tracing in medical imaging. It allows the implementation of a variety of physical interactions of rays with pixel-, mesh- or nurbs-based objects, and recording any required quantities, like path integrals, interaction sites, deposited energies, and others. Using the OptiX engine we have implemented a code for rapid Monte Carlo simulations of Single Photon Emission Computed Tomography (SPECT) imaging, as well as the ray-tracing projector, which can be used in reconstruction algorithms. The engine generates efficient, scalable and optimized GPU code, ready to run on multi GPU heterogeneous systems. We have compared the results our simulations with the GATE package. With the OptiX engine the computation time of a Monte Carlo simulation can be reduced from days to minutes.

  5. MATCHED FILTER COMPUTATION ON FPGA, CELL, AND GPU

    SciTech Connect

    BAKER, ZACHARY K.; GOKHALE, MAYA B.; TRIPP, JUSTIN L.

    2007-01-08

    The matched filter is an important kernel in the processing of hyperspectral data. The filter enables researchers to sift useful data from instruments that span large frequency bands. In this work, they evaluate the performance of a matched filter algorithm implementation on accelerated co-processor (XD1000), the IBM Cell microprocessor, and the NVIDIA GeForce 6900 GTX GPU graphics card. They provide extensive discussion of the challenges and opportunities afforded by each platform. In particular, they explore the problems of partitioning the filter most efficiently between the host CPU and the co-processor. Using their results, they derive several performance metrics that providemore » the optimal solution for a variety of application situations.« less

  6. GPU accelerated generation of digitally reconstructed radiographs for 2-D/3-D image registration.

    PubMed

    Dorgham, Osama M; Laycock, Stephen D; Fisher, Mark H

    2012-09-01

    Recent advances in programming languages for graphics processing units (GPUs) provide developers with a convenient way of implementing applications which can be executed on the CPU and GPU interchangeably. GPUs are becoming relatively cheap, powerful, and widely available hardware components, which can be used to perform intensive calculations. The last decade of hardware performance developments shows that GPU-based computation is progressing significantly faster than CPU-based computation, particularly if one considers the execution of highly parallelisable algorithms. Future predictions illustrate that this trend is likely to continue. In this paper, we introduce a way of accelerating 2-D/3-D image registration by developing a hybrid system which executes on the CPU and utilizes the GPU for parallelizing the generation of digitally reconstructed radiographs (DRRs). Based on the advancements of the GPU over the CPU, it is timely to exploit the benefits of many-core GPU technology by developing algorithms for DRR generation. Although some previous work has investigated the rendering of DRRs using the GPU, this paper investigates approximations which reduce the computational overhead while still maintaining a quality consistent with that needed for 2-D/3-D registration with sufficient accuracy to be clinically acceptable in certain applications of radiation oncology. Furthermore, by comparing implementations of 2-D/3-D registration on the CPU and GPU, we investigate current performance and propose an optimal framework for PC implementations addressing the rigid registration problem. Using this framework, we are able to render DRR images from a 256×256×133 CT volume in ~24 ms using an NVidia GeForce 8800 GTX and in ~2 ms using NVidia GeForce GTX 580. In addition to applications requiring fast automatic patient setup, these levels of performance suggest image-guided radiation therapy at video frame rates is technically feasible using relatively low cost PC

  7. Performance analysis of a parallel Monte Carlo code for simulating solar radiative transfer in cloudy atmospheres using CUDA-enabled NVIDIA GPU

    NASA Astrophysics Data System (ADS)

    Russkova, Tatiana V.

    2017-11-01

    One tool to improve the performance of Monte Carlo methods for numerical simulation of light transport in the Earth's atmosphere is the parallel technology. A new algorithm oriented to parallel execution on the CUDA-enabled NVIDIA graphics processor is discussed. The efficiency of parallelization is analyzed on the basis of calculating the upward and downward fluxes of solar radiation in both a vertically homogeneous and inhomogeneous models of the atmosphere. The results of testing the new code under various atmospheric conditions including continuous singlelayered and multilayered clouds, and selective molecular absorption are presented. The results of testing the code using video cards with different compute capability are analyzed. It is shown that the changeover of computing from conventional PCs to the architecture of graphics processors gives more than a hundredfold increase in performance and fully reveals the capabilities of the technology used.

  8. CUDASW++: optimizing Smith-Waterman sequence database searches for CUDA-enabled graphics processing units

    PubMed Central

    Liu, Yongchao; Maskell, Douglas L; Schmidt, Bertil

    2009-01-01

    Background The Smith-Waterman algorithm is one of the most widely used tools for searching biological sequence databases due to its high sensitivity. Unfortunately, the Smith-Waterman algorithm is computationally demanding, which is further compounded by the exponential growth of sequence databases. The recent emergence of many-core architectures, and their associated programming interfaces, provides an opportunity to accelerate sequence database searches using commonly available and inexpensive hardware. Findings Our CUDASW++ implementation (benchmarked on a single-GPU NVIDIA GeForce GTX 280 graphics card and a dual-GPU GeForce GTX 295 graphics card) provides a significant performance improvement compared to other publicly available implementations, such as SWPS3, CBESW, SW-CUDA, and NCBI-BLAST. CUDASW++ supports query sequences of length up to 59K and for query sequences ranging in length from 144 to 5,478 in Swiss-Prot release 56.6, the single-GPU version achieves an average performance of 9.509 GCUPS with a lowest performance of 9.039 GCUPS and a highest performance of 9.660 GCUPS, and the dual-GPU version achieves an average performance of 14.484 GCUPS with a lowest performance of 10.660 GCUPS and a highest performance of 16.087 GCUPS. Conclusion CUDASW++ is publicly available open-source software. It provides a significant performance improvement for Smith-Waterman-based protein sequence database searches by fully exploiting the compute capability of commonly used CUDA-enabled low-cost GPUs. PMID:19416548

  9. GOTHIC: Gravitational oct-tree code accelerated by hierarchical time step controlling

    NASA Astrophysics Data System (ADS)

    Miki, Yohei; Umemura, Masayuki

    2017-04-01

    The tree method is a widely implemented algorithm for collisionless N-body simulations in astrophysics well suited for GPU(s). Adopting hierarchical time stepping can accelerate N-body simulations; however, it is infrequently implemented and its potential remains untested in GPU implementations. We have developed a Gravitational Oct-Tree code accelerated by HIerarchical time step Controlling named GOTHIC, which adopts both the tree method and the hierarchical time step. The code adopts some adaptive optimizations by monitoring the execution time of each function on-the-fly and minimizes the time-to-solution by balancing the measured time of multiple functions. Results of performance measurements with realistic particle distribution performed on NVIDIA Tesla M2090, K20X, and GeForce GTX TITAN X, which are representative GPUs of the Fermi, Kepler, and Maxwell generation of GPUs, show that the hierarchical time step achieves a speedup by a factor of around 3-5 times compared to the shared time step. The measured elapsed time per step of GOTHIC is 0.30 s or 0.44 s on GTX TITAN X when the particle distribution represents the Andromeda galaxy or the NFW sphere, respectively, with 224 = 16,777,216 particles. The averaged performance of the code corresponds to 10-30% of the theoretical single precision peak performance of the GPU.

  10. Graphics processing unit (GPU)-based computation of heat conduction in thermally anisotropic solids

    NASA Astrophysics Data System (ADS)

    Nahas, C. A.; Balasubramaniam, Krishnan; Rajagopal, Prabhu

    2013-01-01

    Numerical modeling of anisotropic media is a computationally intensive task since it brings additional complexity to the field problem in such a way that the physical properties are different in different directions. Largely used in the aerospace industry because of their lightweight nature, composite materials are a very good example of thermally anisotropic media. With advancements in video gaming technology, parallel processors are much cheaper today and accessibility to higher-end graphical processing devices has increased dramatically over the past couple of years. Since these massively parallel GPUs are very good in handling floating point arithmetic, they provide a new platform for engineers and scientists to accelerate their numerical models using commodity hardware. In this paper we implement a parallel finite difference model of thermal diffusion through anisotropic media using the NVIDIA CUDA (Compute Unified device Architecture). We use the NVIDIA GeForce GTX 560 Ti as our primary computing device which consists of 384 CUDA cores clocked at 1645 MHz with a standard desktop pc as the host platform. We compare the results from standard CPU implementation for its accuracy and speed and draw implications for simulation using the GPU paradigm.

  11. GPU-based cone beam computed tomography.

    PubMed

    Noël, Peter B; Walczak, Alan M; Xu, Jinhui; Corso, Jason J; Hoffmann, Kenneth R; Schafer, Sebastian

    2010-06-01

    The use of cone beam computed tomography (CBCT) is growing in the clinical arena due to its ability to provide 3D information during interventions, its high diagnostic quality (sub-millimeter resolution), and its short scanning times (60 s). In many situations, the short scanning time of CBCT is followed by a time-consuming 3D reconstruction. The standard reconstruction algorithm for CBCT data is the filtered backprojection, which for a volume of size 256(3) takes up to 25 min on a standard system. Recent developments in the area of Graphic Processing Units (GPUs) make it possible to have access to high-performance computing solutions at a low cost, allowing their use in many scientific problems. We have implemented an algorithm for 3D reconstruction of CBCT data using the Compute Unified Device Architecture (CUDA) provided by NVIDIA (NVIDIA Corporation, Santa Clara, California), which was executed on a NVIDIA GeForce GTX 280. Our implementation results in improved reconstruction times from minutes, and perhaps hours, to a matter of seconds, while also giving the clinician the ability to view 3D volumetric data at higher resolutions. We evaluated our implementation on ten clinical data sets and one phantom data set to observe if differences occur between CPU and GPU-based reconstructions. By using our approach, the computation time for 256(3) is reduced from 25 min on the CPU to 3.2 s on the GPU. The GPU reconstruction time for 512(3) volumes is 8.5 s. Copyright 2009 Elsevier Ireland Ltd. All rights reserved.

  12. Real-time electroholography using a multiple-graphics processing unit cluster system with a single spatial light modulator and the InfiniBand network

    NASA Astrophysics Data System (ADS)

    Niwase, Hiroaki; Takada, Naoki; Araki, Hiromitsu; Maeda, Yuki; Fujiwara, Masato; Nakayama, Hirotaka; Kakue, Takashi; Shimobaba, Tomoyoshi; Ito, Tomoyoshi

    2016-09-01

    Parallel calculations of large-pixel-count computer-generated holograms (CGHs) are suitable for multiple-graphics processing unit (multi-GPU) cluster systems. However, it is not easy for a multi-GPU cluster system to accomplish fast CGH calculations when CGH transfers between PCs are required. In these cases, the CGH transfer between the PCs becomes a bottleneck. Usually, this problem occurs only in multi-GPU cluster systems with a single spatial light modulator. To overcome this problem, we propose a simple method using the InfiniBand network. The computational speed of the proposed method using 13 GPUs (NVIDIA GeForce GTX TITAN X) was more than 3000 times faster than that of a CPU (Intel Core i7 4770) when the number of three-dimensional (3-D) object points exceeded 20,480. In practice, we achieved ˜40 tera floating point operations per second (TFLOPS) when the number of 3-D object points exceeded 40,960. Our proposed method was able to reconstruct a real-time movie of a 3-D object comprising 95,949 points.

  13. Real time mitigation of atmospheric turbulence in long distance imaging using the lucky region fusion algorithm with FPGA and GPU hardware acceleration

    NASA Astrophysics Data System (ADS)

    Jackson, Christopher Robert

    "Lucky-region" fusion (LRF) is a synthetic imaging technique that has proven successful in enhancing the quality of images distorted by atmospheric turbulence. The LRF algorithm selects sharp regions of an image obtained from a series of short exposure frames, and fuses the sharp regions into a final, improved image. In previous research, the LRF algorithm had been implemented on a PC using the C programming language. However, the PC did not have sufficient sequential processing power to handle real-time extraction, processing and reduction required when the LRF algorithm was applied to real-time video from fast, high-resolution image sensors. This thesis describes two hardware implementations of the LRF algorithm to achieve real-time image processing. The first was created with a VIRTEX-7 field programmable gate array (FPGA). The other developed using the graphics processing unit (GPU) of a NVIDIA GeForce GTX 690 video card. The novelty in the FPGA approach is the creation of a "black box" LRF video processing system with a general camera link input, a user controller interface, and a camera link video output. We also describe a custom hardware simulation environment we have built to test the FPGA LRF implementation. The advantage of the GPU approach is significantly improved development time, integration of image stabilization into the system, and comparable atmospheric turbulence mitigation.

  14. GBOOST: a GPU-based tool for detecting gene-gene interactions in genome-wide case control studies.

    PubMed

    Yung, Ling Sing; Yang, Can; Wan, Xiang; Yu, Weichuan

    2011-05-01

    Collecting millions of genetic variations is feasible with the advanced genotyping technology. With a huge amount of genetic variations data in hand, developing efficient algorithms to carry out the gene-gene interaction analysis in a timely manner has become one of the key problems in genome-wide association studies (GWAS). Boolean operation-based screening and testing (BOOST), a recent work in GWAS, completes gene-gene interaction analysis in 2.5 days on a desktop computer. Compared with central processing units (CPUs), graphic processing units (GPUs) are highly parallel hardware and provide massive computing resources. We are, therefore, motivated to use GPUs to further speed up the analysis of gene-gene interactions. We implement the BOOST method based on a GPU framework and name it GBOOST. GBOOST achieves a 40-fold speedup compared with BOOST. It completes the analysis of Wellcome Trust Case Control Consortium Type 2 Diabetes (WTCCC T2D) genome data within 1.34 h on a desktop computer equipped with Nvidia GeForce GTX 285 display card. GBOOST code is available at http://bioinformatics.ust.hk/BOOST.html#GBOOST.

  15. Decryption-decompression of AES protected ZIP files on GPUs

    NASA Astrophysics Data System (ADS)

    Duong, Tan Nhat; Pham, Phong Hong; Nguyen, Duc Huu; Nguyen, Thuy Thanh; Le, Hung Duc

    2011-10-01

    AES is a strong encryption system, so decryption-decompression of AES encrypted ZIP files requires very large computing power and techniques of reducing the password space. This makes implementations of techniques on common computing system not practical. In [1], we reduced the original very large password search space to a much smaller one which surely containing the correct password. Based on reduced set of passwords, in this paper, we parallel decryption, decompression and plain text recognition for encrypted ZIP files by using CUDA computing technology on graphics cards GeForce GTX295 of NVIDIA, to find out the correct password. The experimental results have shown that the speed of decrypting, decompressing, recognizing plain text and finding out the original password increases about from 45 to 180 times (depends on the number of GPUs) compared to sequential execution on the Intel Core 2 Quad Q8400 2.66 GHz. These results have demonstrated the potential applicability of GPUs in this cryptanalysis field.

  16. gCUP: rapid GPU-based HIV-1 co-receptor usage prediction for next-generation sequencing.

    PubMed

    Olejnik, Michael; Steuwer, Michel; Gorlatch, Sergei; Heider, Dominik

    2014-11-15

    Next-generation sequencing (NGS) has a large potential in HIV diagnostics, and genotypic prediction models have been developed and successfully tested in the recent years. However, albeit being highly accurate, these computational models lack computational efficiency to reach their full potential. In this study, we demonstrate the use of graphics processing units (GPUs) in combination with a computational prediction model for HIV tropism. Our new model named gCUP, parallelized and optimized for GPU, is highly accurate and can classify >175 000 sequences per second on an NVIDIA GeForce GTX 460. The computational efficiency of our new model is the next step to enable NGS technologies to reach clinical significance in HIV diagnostics. Moreover, our approach is not limited to HIV tropism prediction, but can also be easily adapted to other settings, e.g. drug resistance prediction. The source code can be downloaded at http://www.heiderlab.de d.heider@wz-straubing.de. © The Author 2014. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.

  17. Parallel halftoning technique using dot diffusion optimization

    NASA Astrophysics Data System (ADS)

    Molina-Garcia, Javier; Ponomaryov, Volodymyr I.; Reyes-Reyes, Rogelio; Cruz-Ramos, Clara

    2017-05-01

    In this paper, a novel approach for halftone images is proposed and implemented for images that are obtained by the Dot Diffusion (DD) method. Designed technique is based on an optimization of the so-called class matrix used in DD algorithm and it consists of generation new versions of class matrix, which has no baron and near-baron in order to minimize inconsistencies during the distribution of the error. Proposed class matrix has different properties and each is designed for two different applications: applications where the inverse-halftoning is necessary, and applications where this method is not required. The proposed method has been implemented in GPU (NVIDIA GeForce GTX 750 Ti), multicore processors (AMD FX(tm)-6300 Six-Core Processor and in Intel core i5-4200U), using CUDA and OpenCV over a PC with linux. Experimental results have shown that novel framework generates a good quality of the halftone images and the inverse halftone images obtained. The simulation results using parallel architectures have demonstrated the efficiency of the novel technique when it is implemented in real-time processing.

  18. Real-time time-division color electroholography using a single GPU and a USB module for synchronizing reference light.

    PubMed

    Araki, Hiromitsu; Takada, Naoki; Niwase, Hiroaki; Ikawa, Shohei; Fujiwara, Masato; Nakayama, Hirotaka; Kakue, Takashi; Shimobaba, Tomoyoshi; Ito, Tomoyoshi

    2015-12-01

    We propose real-time time-division color electroholography using a single graphics processing unit (GPU) and a simple synchronization system of reference light. To facilitate real-time time-division color electroholography, we developed a light emitting diode (LED) controller with a universal serial bus (USB) module and the drive circuit for reference light. A one-chip RGB LED connected to a personal computer via an LED controller was used as the reference light. A single GPU calculates three computer-generated holograms (CGHs) suitable for red, green, and blue colors in each frame of a three-dimensional (3D) movie. After CGH calculation using a single GPU, the CPU can synchronize the CGH display with the color switching of the one-chip RGB LED via the LED controller. Consequently, we succeeded in real-time time-division color electroholography for a 3D object consisting of around 1000 points per color when an NVIDIA GeForce GTX TITAN was used as the GPU. Furthermore, we implemented the proposed method in various GPUs. The experimental results showed that the proposed method was effective for various GPUs.

  19. Parallel hyperspectral image reconstruction using random projections

    NASA Astrophysics Data System (ADS)

    Sevilla, Jorge; Martín, Gabriel; Nascimento, José M. P.

    2016-10-01

    Spaceborne sensors systems are characterized by scarce onboard computing and storage resources and by communication links with reduced bandwidth. Random projections techniques have been demonstrated as an effective and very light way to reduce the number of measurements in hyperspectral data, thus, the data to be transmitted to the Earth station is reduced. However, the reconstruction of the original data from the random projections may be computationally expensive. SpeCA is a blind hyperspectral reconstruction technique that exploits the fact that hyperspectral vectors often belong to a low dimensional subspace. SpeCA has shown promising results in the task of recovering hyperspectral data from a reduced number of random measurements. In this manuscript we focus on the implementation of the SpeCA algorithm for graphics processing units (GPU) using the compute unified device architecture (CUDA). Experimental results conducted using synthetic and real hyperspectral datasets on the GPU architecture by NVIDIA: GeForce GTX 980, reveal that the use of GPUs can provide real-time reconstruction. The achieved speedup is up to 22 times when compared with the processing time of SpeCA running on one core of the Intel i7-4790K CPU (3.4GHz), with 32 Gbyte memory.

  20. Speeding-up Bioinformatics Algorithms with Heterogeneous Architectures: Highly Heterogeneous Smith-Waterman (HHeterSW).

    PubMed

    Gálvez, Sergio; Ferusic, Adis; Esteban, Francisco J; Hernández, Pilar; Caballero, Juan A; Dorado, Gabriel

    2016-10-01

    The Smith-Waterman algorithm has a great sensitivity when used for biological sequence-database searches, but at the expense of high computing-power requirements. To overcome this problem, there are implementations in literature that exploit the different hardware-architectures available in a standard PC, such as GPU, CPU, and coprocessors. We introduce an application that splits the original database-search problem into smaller parts, resolves each of them by executing the most efficient implementations of the Smith-Waterman algorithms in different hardware architectures, and finally unifies the generated results. Using non-overlapping hardware allows simultaneous execution, and up to 2.58-fold performance gain, when compared with any other algorithm to search sequence databases. Even the performance of the popular BLAST heuristic is exceeded in 78% of the tests. The application has been tested with standard hardware: Intel i7-4820K CPU, Intel Xeon Phi 31S1P coprocessors, and nVidia GeForce GTX 960 graphics cards. An important increase in performance has been obtained in a wide range of situations, effectively exploiting the available hardware.

  1. FPGA Implementation of the Coupled Filtering Method and the Affine Warping Method.

    PubMed

    Zhang, Chen; Liang, Tianzhu; Mok, Philip K T; Yu, Weichuan

    2017-07-01

    In ultrasound image analysis, the speckle tracking methods are widely applied to study the elasticity of body tissue. However, "feature-motion decorrelation" still remains as a challenge for the speckle tracking methods. Recently, a coupled filtering method and an affine warping method were proposed to accurately estimate strain values, when the tissue deformation is large. The major drawback of these methods is the high computational complexity. Even the graphics processing unit (GPU)-based program requires a long time to finish the analysis. In this paper, we propose field-programmable gate array (FPGA)-based implementations of both methods for further acceleration. The capability of FPGAs on handling different image processing components in these methods is discussed. A fast and memory-saving image warping approach is proposed. The algorithms are reformulated to build a highly efficient pipeline on FPGA. The final implementations on a Xilinx Virtex-7 FPGA are at least 13 times faster than the GPU implementation on the NVIDIA graphic card (GeForce GTX 580).

  2. Real-time colour hologram generation based on ray-sampling plane with multi-GPU acceleration.

    PubMed

    Sato, Hirochika; Kakue, Takashi; Ichihashi, Yasuyuki; Endo, Yutaka; Wakunami, Koki; Oi, Ryutaro; Yamamoto, Kenji; Nakayama, Hirotaka; Shimobaba, Tomoyoshi; Ito, Tomoyoshi

    2018-01-24

    Although electro-holography can reconstruct three-dimensional (3D) motion pictures, its computational cost is too heavy to allow for real-time reconstruction of 3D motion pictures. This study explores accelerating colour hologram generation using light-ray information on a ray-sampling (RS) plane with a graphics processing unit (GPU) to realise a real-time holographic display system. We refer to an image corresponding to light-ray information as an RS image. Colour holograms were generated from three RS images with resolutions of 2,048 × 2,048; 3,072 × 3,072 and 4,096 × 4,096 pixels. The computational results indicate that the generation of the colour holograms using multiple GPUs (NVIDIA Geforce GTX 1080) was approximately 300-500 times faster than those generated using a central processing unit. In addition, the results demonstrate that 3D motion pictures were successfully reconstructed from RS images of 3,072 × 3,072 pixels at approximately 15 frames per second using an electro-holographic reconstruction system in which colour holograms were generated from RS images in real time.

  3. Exploring DeepMedic for the purpose of segmenting white matter hyperintensity lesions

    NASA Astrophysics Data System (ADS)

    Lippert, Fiona; Cheng, Bastian; Golsari, Amir; Weiler, Florian; Gregori, Johannes; Thomalla, Götz; Klein, Jan

    2018-02-01

    DeepMedic, an open source software library based on a multi-channel multi-resolution 3D convolutional neural network, has recently been made publicly available for brain lesion segmentations. It has already been shown that segmentation tasks on MRI data of patients having traumatic brain injuries, brain tumors, and ischemic stroke lesions can be performed very well. In this paper we describe how it can efficiently be used for the purpose of detecting and segmenting white matter hyperintensity lesions. We examined if it can be applied to single-channel routine 2D FLAIR data. For evaluation, we annotated 197 datasets with different numbers and sizes of white matter hyperintensity lesions. Our experiments have shown that substantial results with respect to the segmentation quality can be achieved. Compared to the original parametrization of the DeepMedic neural network, the timings for training can be drastically reduced if adjusting corresponding training parameters, while at the same time the Dice coefficients remain nearly unchanged. This enables for performing a whole training process within a single day utilizing a NVIDIA GeForce GTX 580 graphics board which makes this library also very interesting for research purposes on low-end GPU hardware.

  4. GPU accelerated Monte-Carlo simulation of SEM images for metrology

    NASA Astrophysics Data System (ADS)

    Verduin, T.; Lokhorst, S. R.; Hagen, C. W.

    2016-03-01

    In this work we address the computation times of numerical studies in dimensional metrology. In particular, full Monte-Carlo simulation programs for scanning electron microscopy (SEM) image acquisition are known to be notoriously slow. Our quest in reducing the computation time of SEM image simulation has led us to investigate the use of graphics processing units (GPUs) for metrology. We have succeeded in creating a full Monte-Carlo simulation program for SEM images, which runs entirely on a GPU. The physical scattering models of this GPU simulator are identical to a previous CPU-based simulator, which includes the dielectric function model for inelastic scattering and also refinements for low-voltage SEM applications. As a case study for the performance, we considered the simulated exposure of a complex feature: an isolated silicon line with rough sidewalls located on a at silicon substrate. The surface of the rough feature is decomposed into 408 012 triangles. We have used an exposure dose of 6 mC/cm2, which corresponds to 6 553 600 primary electrons on average (Poisson distributed). We repeat the simulation for various primary electron energies, 300 eV, 500 eV, 800 eV, 1 keV, 3 keV and 5 keV. At first we run the simulation on a GeForce GTX480 from NVIDIA. The very same simulation is duplicated on our CPU-based program, for which we have used an Intel Xeon X5650. Apart from statistics in the simulation, no difference is found between the CPU and GPU simulated results. The GTX480 generates the images (depending on the primary electron energy) 350 to 425 times faster than a single threaded Intel X5650 CPU. Although this is a tremendous speedup, we actually have not reached the maximum throughput because of the limited amount of available memory on the GTX480. Nevertheless, the speedup enables the fast acquisition of simulated SEM images for metrology. We now have the potential to investigate case studies in CD-SEM metrology, which otherwise would take unreasonable

  5. Three Dimensional CFD Analysis of the GTX Combustor

    NASA Technical Reports Server (NTRS)

    Steffen, C. J., Jr.; Bond, R. B.; Edwards, J. R.

    2002-01-01

    The annular combustor geometry of a combined-cycle engine has been analyzed with three-dimensional computational fluid dynamics. Both subsonic combustion and supersonic combustion flowfields have been simulated. The subsonic combustion analysis was executed in conjunction with a direct-connect test rig. Two cold-flow and one hot-flow results are presented. The simulations compare favorably with the test data for the two cold flow calculations; the hot-flow data was not yet available. The hot-flow simulation indicates that the conventional ejector-ramjet cycle would not provide adequate mixing at the conditions tested. The supersonic combustion ramjet flowfield was simulated with frozen chemistry model. A five-parameter test matrix was specified, according to statistical design-of-experiments theory. Twenty-seven separate simulations were used to assemble surrogate models for combustor mixing efficiency and total pressure recovery. ScramJet injector design parameters (injector angle, location, and fuel split) as well as mission variables (total fuel massflow and freestream Mach number) were included in the analysis. A promising injector design has been identified that provides good mixing characteristics with low total pressure losses. The surrogate models can be used to develop performance maps of different injector designs. Several complex three-way variable interactions appear within the dataset that are not adequately resolved with the current statistical analysis.

  6. Real-time simulation of a spiking neural network model of the basal ganglia circuitry using general purpose computing on graphics processing units.

    PubMed

    Igarashi, Jun; Shouno, Osamu; Fukai, Tomoki; Tsujino, Hiroshi

    2011-11-01

    Real-time simulation of a biologically realistic spiking neural network is necessary for evaluation of its capacity to interact with real environments. However, the real-time simulation of such a neural network is difficult due to its high computational costs that arise from two factors: (1) vast network size and (2) the complicated dynamics of biologically realistic neurons. In order to address these problems, mainly the latter, we chose to use general purpose computing on graphics processing units (GPGPUs) for simulation of such a neural network, taking advantage of the powerful computational capability of a graphics processing unit (GPU). As a target for real-time simulation, we used a model of the basal ganglia that has been developed according to electrophysiological and anatomical knowledge. The model consists of heterogeneous populations of 370 spiking model neurons, including computationally heavy conductance-based models, connected by 11,002 synapses. Simulation of the model has not yet been performed in real-time using a general computing server. By parallelization of the model on the NVIDIA Geforce GTX 280 GPU in data-parallel and task-parallel fashion, faster-than-real-time simulation was robustly realized with only one-third of the GPU's total computational resources. Furthermore, we used the GPU's full computational resources to perform faster-than-real-time simulation of three instances of the basal ganglia model; these instances consisted of 1100 neurons and 33,006 synapses and were synchronized at each calculation step. Finally, we developed software for simultaneous visualization of faster-than-real-time simulation output. These results suggest the potential power of GPGPU techniques in real-time simulation of realistic neural networks. Copyright © 2011 Elsevier Ltd. All rights reserved.

  7. GeauxDock: Accelerating Structure-Based Virtual Screening with Heterogeneous Computing

    PubMed Central

    Fang, Ye; Ding, Yun; Feinstein, Wei P.; Koppelman, David M.; Moreno, Juana; Jarrell, Mark; Ramanujam, J.; Brylinski, Michal

    2016-01-01

    Computational modeling of drug binding to proteins is an integral component of direct drug design. Particularly, structure-based virtual screening is often used to perform large-scale modeling of putative associations between small organic molecules and their pharmacologically relevant protein targets. Because of a large number of drug candidates to be evaluated, an accurate and fast docking engine is a critical element of virtual screening. Consequently, highly optimized docking codes are of paramount importance for the effectiveness of virtual screening methods. In this communication, we describe the implementation, tuning and performance characteristics of GeauxDock, a recently developed molecular docking program. GeauxDock is built upon the Monte Carlo algorithm and features a novel scoring function combining physics-based energy terms with statistical and knowledge-based potentials. Developed specifically for heterogeneous computing platforms, the current version of GeauxDock can be deployed on modern, multi-core Central Processing Units (CPUs) as well as massively parallel accelerators, Intel Xeon Phi and NVIDIA Graphics Processing Unit (GPU). First, we carried out a thorough performance tuning of the high-level framework and the docking kernel to produce a fast serial code, which was then ported to shared-memory multi-core CPUs yielding a near-ideal scaling. Further, using Xeon Phi gives 1.9× performance improvement over a dual 10-core Xeon CPU, whereas the best GPU accelerator, GeForce GTX 980, achieves a speedup as high as 3.5×. On that account, GeauxDock can take advantage of modern heterogeneous architectures to considerably accelerate structure-based virtual screening applications. GeauxDock is open-sourced and publicly available at www.brylinski.org/geauxdock and https://figshare.com/articles/geauxdock_tar_gz/3205249. PMID:27420300

  8. Interactive collision detection for deformable models using streaming AABBs.

    PubMed

    Zhang, Xinyu; Kim, Young J

    2007-01-01

    We present an interactive and accurate collision detection algorithm for deformable, polygonal objects based on the streaming computational model. Our algorithm can detect all possible pairwise primitive-level intersections between two severely deforming models at highly interactive rates. In our streaming computational model, we consider a set of axis aligned bounding boxes (AABBs) that bound each of the given deformable objects as an input stream and perform massively-parallel pairwise, overlapping tests onto the incoming streams. As a result, we are able to prevent performance stalls in the streaming pipeline that can be caused by expensive indexing mechanism required by bounding volume hierarchy-based streaming algorithms. At runtime, as the underlying models deform over time, we employ a novel, streaming algorithm to update the geometric changes in the AABB streams. Moreover, in order to get only the computed result (i.e., collision results between AABBs) without reading back the entire output streams, we propose a streaming en/decoding strategy that can be performed in a hierarchical fashion. After determining overlapped AABBs, we perform a primitive-level (e.g., triangle) intersection checking on a serial computational model such as CPUs. We implemented the entire pipeline of our algorithm using off-the-shelf graphics processors (GPUs), such as nVIDIA GeForce 7800 GTX, for streaming computations, and Intel Dual Core 3.4G processors for serial computations. We benchmarked our algorithm with different models of varying complexities, ranging from 15K up to 50K triangles, under various deformation motions, and the timings were obtained as 30 approximately 100 FPS depending on the complexity of models and their relative configurations. Finally, we made comparisons with a well-known GPU-based collision detection algorithm, CULLIDE [4] and observed about three times performance improvement over the earlier approach. We also made comparisons with a SW-based AABB

  9. Software beamforming: comparison between a phased array and synthetic transmit aperture.

    PubMed

    Li, Yen-Feng; Li, Pai-Chi

    2011-04-01

    The data-transfer and computation requirements are compared between software-based beamforming using a phased array (PA) and a synthetic transmit aperture (STA). The advantages of a software-based architecture are reduced system complexity and lower hardware cost. Although this architecture can be implemented using commercial CPUs or GPUs, the high computation and data-transfer requirements limit its real-time beamforming performance. In particular, transferring the raw rf data from the front-end subsystem to the software back-end remains challenging with current state-of-the-art electronics technologies, which offset the cost advantage of the software back end. This study investigated the tradeoff between the data-transfer and computation requirements. Two beamforming methods based on a PA and STA, respectively, were used: the former requires a higher data transfer rate and the latter requires more memory operations. The beamformers were implemente;d in an NVIDIA GeForce GTX 260 GPU and an Intel core i7 920 CPU. The frame rate of PA beamforming was 42 fps with a 128-element array transducer, with 2048 samples per firing and 189 beams per image (with a 95 MB/frame data-transfer requirement). The frame rate of STA beamforming was 40 fps with 16 firings per image (with an 8 MB/frame data-transfer requirement). Both approaches achieved real-time beamforming performance but each had its own bottleneck. On the one hand, the required data-transfer speed was considerably reduced in STA beamforming, whereas this required more memory operations, which limited the overall computation time. The advantages of the GPU approach over the CPU approach were clearly demonstrated.

  10. Parallel algorithm for solving Kepler’s equation on Graphics Processing Units: Application to analysis of Doppler exoplanet searches

    NASA Astrophysics Data System (ADS)

    Ford, Eric B.

    2009-05-01

    We present the results of a highly parallel Kepler equation solver using the Graphics Processing Unit (GPU) on a commercial nVidia GeForce 280GTX and the "Compute Unified Device Architecture" (CUDA) programming environment. We apply this to evaluate a goodness-of-fit statistic (e.g., χ2) for Doppler observations of stars potentially harboring multiple planetary companions (assuming negligible planet-planet interactions). Given the high-dimensionality of the model parameter space (at least five dimensions per planet), a global search is extremely computationally demanding. We expect that the underlying Kepler solver and model evaluator will be combined with a wide variety of more sophisticated algorithms to provide efficient global search, parameter estimation, model comparison, and adaptive experimental design for radial velocity and/or astrometric planet searches. We tested multiple implementations using single precision, double precision, pairs of single precision, and mixed precision arithmetic. We find that the vast majority of computations can be performed using single precision arithmetic, with selective use of compensated summation for increased precision. However, standard single precision is not adequate for calculating the mean anomaly from the time of observation and orbital period when evaluating the goodness-of-fit for real planetary systems and observational data sets. Using all double precision, our GPU code outperforms a similar code using a modern CPU by a factor of over 60. Using mixed precision, our GPU code provides a speed-up factor of over 600, when evaluating nsys > 1024 models planetary systems each containing npl = 4 planets and assuming nobs = 256 observations of each system. We conclude that modern GPUs also offer a powerful tool for repeatedly evaluating Kepler's equation and a goodness-of-fit statistic for orbital models when presented with a large parameter space.

  11. Fast analysis of molecular dynamics trajectories with graphics processing units-Radial distribution function histogramming

    SciTech Connect

    Levine, Benjamin G., E-mail: ben.levine@temple.ed; Stone, John E., E-mail: johns@ks.uiuc.ed; Kohlmeyer, Axel, E-mail: akohlmey@temple.ed

    2011-05-01

    The calculation of radial distribution functions (RDFs) from molecular dynamics trajectory data is a common and computationally expensive analysis task. The rate limiting step in the calculation of the RDF is building a histogram of the distance between atom pairs in each trajectory frame. Here we present an implementation of this histogramming scheme for multiple graphics processing units (GPUs). The algorithm features a tiling scheme to maximize the reuse of data at the fastest levels of the GPU's memory hierarchy and dynamic load balancing to allow high performance on heterogeneous configurations of GPUs. Several versions of the RDF algorithm aremore » presented, utilizing the specific hardware features found on different generations of GPUs. We take advantage of larger shared memory and atomic memory operations available on state-of-the-art GPUs to accelerate the code significantly. The use of atomic memory operations allows the fast, limited-capacity on-chip memory to be used much more efficiently, resulting in a fivefold increase in performance compared to the version of the algorithm without atomic operations. The ultimate version of the algorithm running in parallel on four NVIDIA GeForce GTX 480 (Fermi) GPUs was found to be 92 times faster than a multithreaded implementation running on an Intel Xeon 5550 CPU. On this multi-GPU hardware, the RDF between two selections of 1,000,000 atoms each can be calculated in 26.9 s per frame. The multi-GPU RDF algorithms described here are implemented in VMD, a widely used and freely available software package for molecular dynamics visualization and analysis.« less

  12. GeauxDock: Accelerating Structure-Based Virtual Screening with Heterogeneous Computing.

    PubMed

    Fang, Ye; Ding, Yun; Feinstein, Wei P; Koppelman, David M; Moreno, Juana; Jarrell, Mark; Ramanujam, J; Brylinski, Michal

    2016-01-01

    Computational modeling of drug binding to proteins is an integral component of direct drug design. Particularly, structure-based virtual screening is often used to perform large-scale modeling of putative associations between small organic molecules and their pharmacologically relevant protein targets. Because of a large number of drug candidates to be evaluated, an accurate and fast docking engine is a critical element of virtual screening. Consequently, highly optimized docking codes are of paramount importance for the effectiveness of virtual screening methods. In this communication, we describe the implementation, tuning and performance characteristics of GeauxDock, a recently developed molecular docking program. GeauxDock is built upon the Monte Carlo algorithm and features a novel scoring function combining physics-based energy terms with statistical and knowledge-based potentials. Developed specifically for heterogeneous computing platforms, the current version of GeauxDock can be deployed on modern, multi-core Central Processing Units (CPUs) as well as massively parallel accelerators, Intel Xeon Phi and NVIDIA Graphics Processing Unit (GPU). First, we carried out a thorough performance tuning of the high-level framework and the docking kernel to produce a fast serial code, which was then ported to shared-memory multi-core CPUs yielding a near-ideal scaling. Further, using Xeon Phi gives 1.9× performance improvement over a dual 10-core Xeon CPU, whereas the best GPU accelerator, GeForce GTX 980, achieves a speedup as high as 3.5×. On that account, GeauxDock can take advantage of modern heterogeneous architectures to considerably accelerate structure-based virtual screening applications. GeauxDock is open-sourced and publicly available at www.brylinski.org/geauxdock and https://figshare.com/articles/geauxdock_tar_gz/3205249.

  13. Fast Analysis of Molecular Dynamics Trajectories with Graphics Processing Units—Radial Distribution Function Histogramming

    PubMed Central

    Stone, John E.; Kohlmeyer, Axel

    2011-01-01

    The calculation of radial distribution functions (RDFs) from molecular dynamics trajectory data is a common and computationally expensive analysis task. The rate limiting step in the calculation of the RDF is building a histogram of the distance between atom pairs in each trajectory frame. Here we present an implementation of this histogramming scheme for multiple graphics processing units (GPUs). The algorithm features a tiling scheme to maximize the reuse of data at the fastest levels of the GPU’s memory hierarchy and dynamic load balancing to allow high performance on heterogeneous configurations of GPUs. Several versions of the RDF algorithm are presented, utilizing the specific hardware features found on different generations of GPUs. We take advantage of larger shared memory and atomic memory operations available on state-of-the-art GPUs to accelerate the code significantly. The use of atomic memory operations allows the fast, limited-capacity on-chip memory to be used much more efficiently, resulting in a fivefold increase in performance compared to the version of the algorithm without atomic operations. The ultimate version of the algorithm running in parallel on four NVIDIA GeForce GTX 480 (Fermi) GPUs was found to be 92 times faster than a multithreaded implementation running on an Intel Xeon 5550 CPU. On this multi-GPU hardware, the RDF between two selections of 1,000,000 atoms each can be calculated in 26.9 seconds per frame. The multi-GPU RDF algorithms described here are implemented in VMD, a widely used and freely available software package for molecular dynamics visualization and analysis. PMID:21547007

  14. SU-C-BRC-07: Parametrized GPU Accelerated Electron Monte Carlo Second Check

    SciTech Connect

    Haywood, J

    Purpose: I am presenting a parameterized 3D GPU accelerated electron Monte Carlo second check program. Method: I wrote the 3D grid dose calculation algorithm in CUDA and utilized an NVIDIA GeForce GTX 780 Ti to run all of the calculations. The electron path beyond the distal end of the cone is governed by four parameters: the amplitude of scattering (AMP), the mean and width of a Gaussian energy distribution (E and α), and the percentage of photons. In my code, I adjusted all parameters until the calculated PDD and profile fit the measured 10×10 open beam data within 1%/1mm. Imore » then wrote a user interface for reading the DICOM treatment plan and images in Python. In order to verify the algorithm, I calculated 3D dose distributions on a variety of phantoms and geometries, and compared them with the Eclipse eMC calculations. I also calculated several patient specific dose distributions, including a nose and an ear. Finally, I compared my algorithm’s computation times to Eclipse’s. Results: The calculated MU for all of the investigated geometries agree with the TPS within the TG-114 action level of 5%. The MU for the nose was < 0.5 % different while the MU for the ear at 105 SSD was ∼2 %. Calculation times for a 12MeV 10×10 open beam ranged from 1 second for a 2.5 mm grid resolution with ∼15 million particles to 33 seconds on a 1 mm grid with ∼460 million particles. Eclipse calculation runtimes distributed over 10 FAS workers were 9 seconds to 15 minutes respectively. Conclusion: The GPU accelerated second check allows quick MU verification while accounting for patient specific geometry and heterogeneity.« less

  15. NLSEmagic: Nonlinear Schrödinger equation multi-dimensional Matlab-based GPU-accelerated integrators using compact high-order schemes

    NASA Astrophysics Data System (ADS)

    Caplan, R. M.

    2013-04-01

    and both second- and fourth-order differencing in space. The integrators are written to run on NVIDIA GPUs and are interfaced with MATLAB including built-in visualization and analysis tools. Restrictions: The main restriction for the GPU integrators is the amount of RAM on the GPU as the code is currently only designed for running on a single GPU. Unusual features: Ability to visualize real-time simulations through the interaction of MATLAB and the compiled GPU integrators. Additional comments: Setup guide and Installation guide provided. Program has a dedicated web site at www.nlsemagic.com. Running time: A three-dimensional run with a grid dimension of 87×87×203 for 3360 time steps (100 non-dimensional time units) takes about one and a half minutes on a GeForce GTX 580 GPU card.

  16. Computer simulations and real-time control of ELT AO systems using graphical processing units

    NASA Astrophysics Data System (ADS)

    Wang, Lianqi; Ellerbroek, Brent

    2012-07-01

    The adaptive optics (AO) simulations at the Thirty Meter Telescope (TMT) have been carried out using the efficient, C based multi-threaded adaptive optics simulator (MAOS, http://github.com/lianqiw/maos). By porting time-critical parts of MAOS to graphical processing units (GPU) using NVIDIA CUDA technology, we achieved a 10 fold speed up for each GTX 580 GPU used compared to a modern quad core CPU. Each time step of full scale end to end simulation for the TMT narrow field infrared AO system (NFIRAOS) takes only 0.11 second in a desktop with two GTX 580s. We also demonstrate that the TMT minimum variance reconstructor can be assembled in matrix vector multiply (MVM) format in 8 seconds with 8 GTX 580 GPUs, meeting the TMT requirement for updating the reconstructor. Analysis show that it is also possible to apply the MVM using 8 GTX 580s within the required latency.

  17. A New GPU-Enabled MODTRAN Thermal Model for the PLUME TRACKER Volcanic Emission Analysis Toolkit

    NASA Astrophysics Data System (ADS)

    Acharya, P. K.; Berk, A.; Guiang, C.; Kennett, R.; Perkins, T.; Realmuto, V. J.

    2013-12-01

    Real-time quantification of volcanic gaseous and particulate releases is important for (1) recognizing rapid increases in SO2 gaseous emissions which may signal an impending eruption; (2) characterizing ash clouds to enable safe and efficient commercial aviation; and (3) quantifying the impact of volcanic aerosols on climate forcing. The Jet Propulsion Laboratory (JPL) has developed state-of-the-art algorithms, embedded in their analyst-driven Plume Tracker toolkit, for performing SO2, NH3, and CH4 retrievals from remotely sensed multi-spectral Thermal InfraRed spectral imagery. While Plume Tracker provides accurate results, it typically requires extensive analyst time. A major bottleneck in this processing is the relatively slow but accurate FORTRAN-based MODTRAN atmospheric and plume radiance model, developed by Spectral Sciences, Inc. (SSI). To overcome this bottleneck, SSI in collaboration with JPL, is porting these slow thermal radiance algorithms onto massively parallel, relatively inexpensive and commercially-available GPUs. This paper discusses SSI's efforts to accelerate the MODTRAN thermal emission algorithms used by Plume Tracker. Specifically, we are developing a GPU implementation of the Curtis-Godson averaging and the Voigt in-band transmittances from near line center molecular absorption, which comprise the major computational bottleneck. The transmittance calculations were decomposed into separate functions, individually implemented as GPU kernels, and tested for accuracy and performance relative to the original CPU code. Speedup factors of 14 to 30× were realized for individual processing components on an NVIDIA GeForce GTX 295 graphics card with no loss of accuracy. Due to the separate host (CPU) and device (GPU) memory spaces, a redesign of the MODTRAN architecture was required to ensure efficient data transfer between host and device, and to facilitate high parallel throughput. Currently, we are incorporating the separate GPU kernels into a

  18. Real-time dose computation: GPU-accelerated source modeling and superposition/convolution

    SciTech Connect

    Jacques, Robert; Wong, John; Taylor, Russell

    Purpose: To accelerate dose calculation to interactive rates using highly parallel graphics processing units (GPUs). Methods: The authors have extended their prior work in GPU-accelerated superposition/convolution with a modern dual-source model and have enhanced performance. The primary source algorithm supports both focused leaf ends and asymmetric rounded leaf ends. The extra-focal algorithm uses a discretized, isotropic area source and models multileaf collimator leaf height effects. The spectral and attenuation effects of static beam modifiers were integrated into each source's spectral function. The authors introduce the concepts of arc superposition and delta superposition. Arc superposition utilizes separate angular sampling for themore » total energy released per unit mass (TERMA) and superposition computations to increase accuracy and performance. Delta superposition allows single beamlet changes to be computed efficiently. The authors extended their concept of multi-resolution superposition to include kernel tilting. Multi-resolution superposition approximates solid angle ray-tracing, improving performance and scalability with a minor loss in accuracy. Superposition/convolution was implemented using the inverse cumulative-cumulative kernel and exact radiological path ray-tracing. The accuracy analyses were performed using multiple kernel ray samplings, both with and without kernel tilting and multi-resolution superposition. Results: Source model performance was <9 ms (data dependent) for a high resolution (400{sup 2}) field using an NVIDIA (Santa Clara, CA) GeForce GTX 280. Computation of the physically correct multispectral TERMA attenuation was improved by a material centric approach, which increased performance by over 80%. Superposition performance was improved by {approx}24% to 0.058 and 0.94 s for 64{sup 3} and 128{sup 3} water phantoms; a speed-up of 101-144x over the highly optimized Pinnacle{sup 3} (Philips, Madison, WI) implementation

  19. A GPU-accelerated and Monte Carlo-based intensity modulated proton therapy optimization system.

    PubMed

    Ma, Jiasen; Beltran, Chris; Seum Wan Chan Tseung, Hok; Herman, Michael G

    2014-12-01

    Conventional spot scanning intensity modulated proton therapy (IMPT) treatment planning systems (TPSs) optimize proton spot weights based on analytical dose calculations. These analytical dose calculations have been shown to have severe limitations in heterogeneous materials. Monte Carlo (MC) methods do not have these limitations; however, MC-based systems have been of limited clinical use due to the large number of beam spots in IMPT and the extremely long calculation time of traditional MC techniques. In this work, the authors present a clinically applicable IMPT TPS that utilizes a very fast MC calculation. An in-house graphics processing unit (GPU)-based MC dose calculation engine was employed to generate the dose influence map for each proton spot. With the MC generated influence map, a modified least-squares optimization method was used to achieve the desired dose volume histograms (DVHs). The intrinsic CT image resolution was adopted for voxelization in simulation and optimization to preserve spatial resolution. The optimizations were computed on a multi-GPU framework to mitigate the memory limitation issues for the large dose influence maps that resulted from maintaining the intrinsic CT resolution. The effects of tail cutoff and starting condition were studied and minimized in this work. For relatively large and complex three-field head and neck cases, i.e., >100,000 spots with a target volume of ∼ 1000 cm(3) and multiple surrounding critical structures, the optimization together with the initial MC dose influence map calculation was done in a clinically viable time frame (less than 30 min) on a GPU cluster consisting of 24 Nvidia GeForce GTX Titan cards. The in-house MC TPS plans were comparable to a commercial TPS plans based on DVH comparisons. A MC-based treatment planning system was developed. The treatment planning can be performed in a clinically viable time frame on a hardware system costing around 45,000 dollars. The fast calculation and

  20. An efficient mixed-precision, hybrid CPU-GPU implementation of a nonlinearly implicit one-dimensional particle-in-cell algorithm

    SciTech Connect

    Chen, Guangye; Chacon, Luis; Barnes, Daniel C

    2012-01-01

    Recently, a fully implicit, energy- and charge-conserving particle-in-cell method has been developed for multi-scale, full-f kinetic simulations [G. Chen, et al., J. Comput. Phys. 230, 18 (2011)]. The method employs a Jacobian-free Newton-Krylov (JFNK) solver and is capable of using very large timesteps without loss of numerical stability or accuracy. A fundamental feature of the method is the segregation of particle orbit integrations from the field solver, while remaining fully self-consistent. This provides great flexibility, and dramatically improves the solver efficiency by reducing the degrees of freedom of the associated nonlinear system. However, it requires a particle push per nonlinearmore » residual evaluation, which makes the particle push the most time-consuming operation in the algorithm. This paper describes a very efficient mixed-precision, hybrid CPU-GPU implementation of the implicit PIC algorithm. The JFNK solver is kept on the CPU (in double precision), while the inherent data parallelism of the particle mover is exploited by implementing it in single-precision on a graphics processing unit (GPU) using CUDA. Performance-oriented optimizations, with the aid of an analytical performance model, the roofline model, are employed. Despite being highly dynamic, the adaptive, charge-conserving particle mover algorithm achieves up to 300 400 GOp/s (including single-precision floating-point, integer, and logic operations) on a Nvidia GeForce GTX580, corresponding to 20 25% absolute GPU efficiency (against the peak theoretical performance) and 50-70% intrinsic efficiency (against the algorithm s maximum operational throughput, which neglects all latencies). This is about 200-300 times faster than an equivalent serial CPU implementation. When the single-precision GPU particle mover is combined with a double-precision CPU JFNK field solver, overall performance gains 100 vs. the double-precision CPU-only serial version are obtained, with no apparent loss of

  1. Initial development of goCMC: a GPU-oriented fast cross-platform Monte Carlo engine for carbon ion therapy

    PubMed Central

    Qin, Nan; Pinto, Marco; Tian, Zhen; Dedes, Georgios; Pompos, Arnold; Jiang, Steve B.; Parodi, Katia; Jia, Xun

    2017-01-01

    voxel size, the computation time to simulate 107 carbons was 9.9–125 sec, 2.5–50 sec and 60–612 sec on an AMD Radeon GPU card, an NVidia GeForce GTX 1080 GPU card and an Intel Xeon E5-2640 CPU, respectively. The combined accuracy, efficiency and portability make goCMC attractive for research and clinical applications in carbon ion therapy. PMID:28140352

  2. Fast GPU-based Monte Carlo code for SPECT/CT reconstructions generates improved 177Lu images.

    PubMed

    Rydén, T; Heydorn Lagerlöf, J; Hemmingsson, J; Marin, I; Svensson, J; Båth, M; Gjertsson, P; Bernhardt, P

    2018-01-04

    Full Monte Carlo (MC)-based SPECT reconstructions have a strong potential for correcting for image degrading factors, but the reconstruction times are long. The objective of this study was to develop a highly parallel Monte Carlo code for fast, ordered subset expectation maximum (OSEM) reconstructions of SPECT/CT images. The MC code was written in the Compute Unified Device Architecture language for a computer with four graphics processing units (GPUs) (GeForce GTX Titan X, Nvidia, USA). This enabled simulations of parallel photon emissions from the voxels matrix (128 3 or 256 3 ). Each computed tomography (CT) number was converted to attenuation coefficients for photo absorption, coherent scattering, and incoherent scattering. For photon scattering, the deflection angle was determined by the differential scattering cross sections. An angular response function was developed and used to model the accepted angles for photon interaction with the crystal, and a detector scattering kernel was used for modeling the photon scattering in the detector. Predefined energy and spatial resolution kernels for the crystal were used. The MC code was implemented in the OSEM reconstruction of clinical and phantom 177 Lu SPECT/CT images. The Jaszczak image quality phantom was used to evaluate the performance of the MC reconstruction in comparison with attenuated corrected (AC) OSEM reconstructions and attenuated corrected OSEM reconstructions with resolution recovery corrections (RRC). The performance of the MC code was 3200 million photons/s. The required number of photons emitted per voxel to obtain a sufficiently low noise level in the simulated image was 200 for a 128 3 voxel matrix. With this number of emitted photons/voxel, the MC-based OSEM reconstruction with ten subsets was performed within 20 s/iteration. The images converged after around six iterations. Therefore, the reconstruction time was around 3 min. The activity recovery for the spheres in the Jaszczak phantom was

  3. Initial development of goCMC: a GPU-oriented fast cross-platform Monte Carlo engine for carbon ion therapy

    NASA Astrophysics Data System (ADS)

    Qin, Nan; Pinto, Marco; Tian, Zhen; Dedes, Georgios; Pompos, Arnold; Jiang, Steve B.; Parodi, Katia; Jia, Xun

    2017-05-01

    energy and voxel size, the computation time to simulate {{10}7} carbons was 9.9-125 s, 2.5-50 s and 60-612 s on an AMD Radeon GPU card, an NVidia GeForce GTX 1080 GPU card and an Intel Xeon E5-2640 CPU, respectively. The combined accuracy, efficiency and portability make goCMC attractive for research and clinical applications in carbon ion therapy.

  4. CUDA compatible GPU cards as efficient hardware accelerators for Smith-Waterman sequence alignment

    PubMed Central

    Manavski, Svetlin A; Valle, Giorgio

    2008-01-01

    Background Searching for similarities in protein and DNA databases has become a routine procedure in Molecular Biology. The Smith-Waterman algorithm has been available for more than 25 years. It is based on a dynamic programming approach that explores all the possible alignments between two sequences; as a result it returns the optimal local alignment. Unfortunately, the computational cost is very high, requiring a number of operations proportional to the product of the length of two sequences. Furthermore, the exponential growth of protein and DNA databases makes the Smith-Waterman algorithm unrealistic for searching similarities in large sets of sequences. For these reasons heuristic approaches such as those implemented in FASTA and BLAST tend to be preferred, allowing faster execution times at the cost of reduced sensitivity. The main motivation of our work is to exploit the huge computational power of commonly available graphic cards, to develop high performance solutions for sequence alignment. Results In this paper we present what we believe is the fastest solution of the exact Smith-Waterman algorithm running on commodity hardware. It is implemented in the recently released CUDA programming environment by NVidia. CUDA allows direct access to the hardware primitives of the last-generation Graphics Processing Units (GPU) G80. Speeds of more than 3.5 GCUPS (Giga Cell Updates Per Second) are achieved on a workstation running two GeForce 8800 GTX. Exhaustive tests have been done to compare our implementation to SSEARCH and BLAST, running on a 3 GHz Intel Pentium IV processor. Our solution was also compared to a recently published GPU implementation and to a Single Instruction Multiple Data (SIMD) solution. These tests show that our implementation performs from 2 to 30 times faster than any other previous attempt available on commodity hardware. Conclusions The results show that graphic cards are now sufficiently advanced to be used as efficient hardware

  5. Initial development of goCMC: a GPU-oriented fast cross-platform Monte Carlo engine for carbon ion therapy.

    PubMed

    Qin, Nan; Pinto, Marco; Tian, Zhen; Dedes, Georgios; Pompos, Arnold; Jiang, Steve B; Parodi, Katia; Jia, Xun

    2017-05-07

    beam energy and voxel size, the computation time to simulate [Formula: see text] carbons was 9.9-125 s, 2.5-50 s and 60-612 s on an AMD Radeon GPU card, an NVidia GeForce GTX 1080 GPU card and an Intel Xeon E5-2640 CPU, respectively. The combined accuracy, efficiency and portability make goCMC attractive for research and clinical applications in carbon ion therapy.

  6. Robust 3D-2D image registration: application to spine interventions and vertebral labeling in the presence of anatomical deformation

    NASA Astrophysics Data System (ADS)

    Otake, Yoshito; Wang, Adam S.; Webster Stayman, J.; Uneri, Ali; Kleinszig, Gerhard; Vogt, Sebastian; Khanna, A. Jay; Gokaslan, Ziya L.; Siewerdsen, Jeffrey H.

    2013-12-01

    We present a framework for robustly estimating registration between a 3D volume image and a 2D projection image and evaluate its precision and robustness in spine interventions for vertebral localization in the presence of anatomical deformation. The framework employs a normalized gradient information similarity metric and multi-start covariance matrix adaptation evolution strategy optimization with local-restarts, which provided improved robustness against deformation and content mismatch. The parallelized implementation allowed orders-of-magnitude acceleration in computation time and improved the robustness of registration via multi-start global optimization. Experiments involved a cadaver specimen and two CT datasets (supine and prone) and 36 C-arm fluoroscopy images acquired with the specimen in four positions (supine, prone, supine with lordosis, prone with kyphosis), three regions (thoracic, abdominal, and lumbar), and three levels of geometric magnification (1.7, 2.0, 2.4). Registration accuracy was evaluated in terms of projection distance error (PDE) between the estimated and true target points in the projection image, including 14 400 random trials (200 trials on the 72 registration scenarios) with initialization error up to ±200 mm and ±10°. The resulting median PDE was better than 0.1 mm in all cases, depending somewhat on the resolution of input CT and fluoroscopy images. The cadaver experiments illustrated the tradeoff between robustness and computation time, yielding a success rate of 99.993% in vertebral labeling (with ‘success’ defined as PDE <5 mm) using 1,718 664 ± 96 582 function evaluations computed in 54.0 ± 3.5 s on a mid-range GPU (nVidia, GeForce GTX690). Parameters yielding a faster search (e.g., fewer multi-starts) reduced robustness under conditions of large deformation and poor initialization (99.535% success for the same data registered in 13.1 s), but given good initialization (e.g., ±5 mm, assuming a robust initial

  7. Advanced mathematical on-line analysis in nuclear experiments. Usage of parallel computing CUDA routines in standard root analysis

    NASA Astrophysics Data System (ADS)

    Grzeszczuk, A.; Kowalski, S.

    2015-04-01

    Compute Unified Device Architecture (CUDA) is a parallel computing platform developed by Nvidia for increase speed of graphics by usage of parallel mode for processes calculation. The success of this solution has opened technology General-Purpose Graphic Processor Units (GPGPUs) for applications not coupled with graphics. The GPGPUs system can be applying as effective tool for reducing huge number of data for pulse shape analysis measures, by on-line recalculation or by very quick system of compression. The simplified structure of CUDA system and model of programming based on example Nvidia GForce GTX580 card are presented by our poster contribution in stand-alone version and as ROOT application.

  8. Graphics Processing Unit Acceleration of Gyrokinetic Turbulence Simulations

    NASA Astrophysics Data System (ADS)

    Hause, Benjamin; Parker, Scott

    2012-10-01

    We find a substantial increase in on-node performance using Graphics Processing Unit (GPU) acceleration in gyrokinetic delta-f particle-in-cell simulation. Optimization is performed on a two-dimensional slab gyrokinetic particle simulation using the Portland Group Fortran compiler with the GPU accelerator compiler directives. We have implemented the GPU acceleration on a Core I7 gaming PC with a NVIDIA GTX 580 GPU. We find comparable, or better, acceleration relative to the NERSC DIRAC cluster with the NVIDIA Tesla C2050 computing processor. The Tesla C 2050 is about 2.6 times more expensive than the GTX 580 gaming GPU. Optimization strategies and comparisons between DIRAC and the gaming PC will be presented. We will also discuss progress on optimizing the comprehensive three dimensional general geometry GEM code.

  9. A Simple GPU-Accelerated Two-Dimensional MUSCL-Hancock Solver for Ideal Magnetohydrodynamics

    NASA Technical Reports Server (NTRS)

    Bard, Christopher; Dorelli, John C.

    2013-01-01

    We describe our experience using NVIDIA's CUDA (Compute Unified Device Architecture) C programming environment to implement a two-dimensional second-order MUSCL-Hancock ideal magnetohydrodynamics (MHD) solver on a GTX 480 Graphics Processing Unit (GPU). Taking a simple approach in which the MHD variables are stored exclusively in the global memory of the GTX 480 and accessed in a cache-friendly manner (without further optimizing memory access by, for example, staging data in the GPU's faster shared memory), we achieved a maximum speed-up of approx. = 126 for a sq 1024 grid relative to the sequential C code running on a single Intel Nehalem (2.8 GHz) core. This speedup is consistent with simple estimates based on the known floating point performance, memory throughput and parallel processing capacity of the GTX 480.

  10. A simple GPU-accelerated two-dimensional MUSCL-Hancock solver for ideal magnetohydrodynamics

    NASA Astrophysics Data System (ADS)

    Bard, Christopher M.; Dorelli, John C.

    2014-02-01

    We describe our experience using NVIDIA's CUDA (Compute Unified Device Architecture) C programming environment to implement a two-dimensional second-order MUSCL-Hancock ideal magnetohydrodynamics (MHD) solver on a GTX 480 Graphics Processing Unit (GPU). Taking a simple approach in which the MHD variables are stored exclusively in the global memory of the GTX 480 and accessed in a cache-friendly manner (without further optimizing memory access by, for example, staging data in the GPU's faster shared memory), we achieved a maximum speed-up of ≈126 for a 10242 grid relative to the sequential C code running on a single Intel Nehalem (2.8 GHz) core. This speedup is consistent with simple estimates based on the known floating point performance, memory throughput and parallel processing capacity of the GTX 480.

  11. Dataflow-Based Implementation of Layered Sensing Applications on High-Performance Embedded Processors

    DTIC Science & Technology

    2013-03-01

    time (milliseconds) GFlops Comparison to GPU peak performance (%) Cascade Gaussian Filtering 13 45.19 6.3 Difference of Gaussian 0.512 152...values for the GPU-targeted actor implementations in terms of Giga Floating Point Operations Per Second ( GFLOPS ). Our GFLOPS calculation for an actor...kernels. The results for GFLOPS are provided in Table . The actors were implemented on an NVIDIA GTX260 GPU, which provides 715 GFLOPS as peak

  12. 75 FR 32803 - Notice of Issuance of Final Determination Concerning a GTX Mobile+ Hand Held Computer

    Federal Register 2010, 2011, 2012, 2013, 2014

    2010-06-09

    ... Programmable Read-Only Memory (``PROM'') chip, substantially transformed the PROM into a U.S. article. The... parts (such as various connectors and an Electronically Erasable Programmable Read Only Memory, or...

  13. Application of graphics processing units to search pipelines for gravitational waves from coalescing binaries of compact objects

    NASA Astrophysics Data System (ADS)

    Chung, Shin Kee; Wen, Linqing; Blair, David; Cannon, Kipp; Datta, Amitava

    2010-07-01

    We report a novel application of a graphics processing unit (GPU) for the purpose of accelerating the search pipelines for gravitational waves from coalescing binaries of compact objects. A speed-up of 16-fold in total has been achieved with an NVIDIA GeForce 8800 Ultra GPU card compared with one core of a 2.5 GHz Intel Q9300 central processing unit (CPU). We show that substantial improvements are possible and discuss the reduction in CPU count required for the detection of inspiral sources afforded by the use of GPUs.

  14. Classification of hyperspectral imagery using MapReduce on a NVIDIA graphics processing unit (Conference Presentation)

    NASA Astrophysics Data System (ADS)

    Ramirez, Andres; Rahnemoonfar, Maryam

    2017-04-01

    A hyperspectral image provides multidimensional figure rich in data consisting of hundreds of spectral dimensions. Analyzing the spectral and spatial information of such image with linear and non-linear algorithms will result in high computational time. In order to overcome this problem, this research presents a system using a MapReduce-Graphics Processing Unit (GPU) model that can help analyzing a hyperspectral image through the usage of parallel hardware and a parallel programming model, which will be simpler to handle compared to other low-level parallel programming models. Additionally, Hadoop was used as an open-source version of the MapReduce parallel programming model. This research compared classification accuracy results and timing results between the Hadoop and GPU system and tested it against the following test cases: the CPU and GPU test case, a CPU test case and a test case where no dimensional reduction was applied.

  15. GPU accelerated study of heat transfer and fluid flow by lattice Boltzmann method on CUDA

    NASA Astrophysics Data System (ADS)

    Ren, Qinlong

    Lattice Boltzmann method (LBM) has been developed as a powerful numerical approach to simulate the complex fluid flow and heat transfer phenomena during the past two decades. As a mesoscale method based on the kinetic theory, LBM has several advantages compared with traditional numerical methods such as physical representation of microscopic interactions, dealing with complex geometries and highly parallel nature. Lattice Boltzmann method has been applied to solve various fluid behaviors and heat transfer process like conjugate heat transfer, magnetic and electric field, diffusion and mixing process, chemical reactions, multiphase flow, phase change process, non-isothermal flow in porous medium, microfluidics, fluid-structure interactions in biological system and so on. In addition, as a non-body-conformal grid method, the immersed boundary method (IBM) could be applied to handle the complex or moving geometries in the domain. The immersed boundary method could be coupled with lattice Boltzmann method to study the heat transfer and fluid flow problems. Heat transfer and fluid flow are solved on Euler nodes by LBM while the complex solid geometries are captured by Lagrangian nodes using immersed boundary method. Parallel computing has been a popular topic for many decades to accelerate the computational speed in engineering and scientific fields. Today, almost all the laptop and desktop have central processing units (CPUs) with multiple cores which could be used for parallel computing. However, the cost of CPUs with hundreds of cores is still high which limits its capability of high performance computing on personal computer. Graphic processing units (GPU) is originally used for the computer video cards have been emerged as the most powerful high-performance workstation in recent years. Unlike the CPUs, the cost of GPU with thousands of cores is cheap. For example, the GPU (GeForce GTX TITAN) which is used in the current work has 2688 cores and the price is only 1

  16. Rapid data processing for ultrafast X-ray computed tomography using scalable and modular CUDA based pipelines

    NASA Astrophysics Data System (ADS)

    Frust, Tobias; Wagner, Michael; Stephan, Jan; Juckeland, Guido; Bieberle, André

    2017-10-01

    Ultrafast X-ray tomography is an advanced imaging technique for the study of dynamic processes basing on the principles of electron beam scanning. A typical application case for this technique is e.g. the study of multiphase flows, that is, flows of mixtures of substances such as gas-liquidflows in pipelines or chemical reactors. At Helmholtz-Zentrum Dresden-Rossendorf (HZDR) a number of such tomography scanners are operated. Currently, there are two main points limiting their application in some fields. First, after each CT scan sequence the data of the radiation detector must be downloaded from the scanner to a data processing machine. Second, the current data processing is comparably time-consuming compared to the CT scan sequence interval. To enable online observations or use this technique to control actuators in real-time, a modular and scalable data processing tool has been developed, consisting of user-definable stages working independently together in a so called data processing pipeline, that keeps up with the CT scanner's maximal frame rate of up to 8 kHz. The newly developed data processing stages are freely programmable and combinable. In order to achieve the highest processing performance all relevant data processing steps, which are required for a standard slice image reconstruction, were individually implemented in separate stages using Graphics Processing Units (GPUs) and NVIDIA's CUDA programming language. Data processing performance tests on different high-end GPUs (Tesla K20c, GeForce GTX 1080, Tesla P100) showed excellent performance. Program Files doi:http://dx.doi.org/10.17632/65sx747rvm.1 Licensing provisions: LGPLv3 Programming language: C++/CUDA Supplementary material: Test data set, used for the performance analysis. Nature of problem: Ultrafast computed tomography is performed with a scan rate of up to 8 kHz. To obtain cross-sectional images from projection data computer-based image reconstruction algorithms must be applied. The

  17. Accelerating Pseudo-Random Number Generator for MCNP on GPU

    NASA Astrophysics Data System (ADS)

    Gong, Chunye; Liu, Jie; Chi, Lihua; Hu, Qingfeng; Deng, Li; Gong, Zhenghu

    2010-09-01

    Pseudo-random number generators (PRNG) are intensively used in many stochastic algorithms in particle simulations, artificial neural networks and other scientific computation. The PRNG in Monte Carlo N-Particle Transport Code (MCNP) requires long period, high quality, flexible jump and fast enough. In this paper, we implement such a PRNG for MCNP on NVIDIA's GTX200 Graphics Processor Units (GPU) using CUDA programming model. Results shows that 3.80 to 8.10 times speedup are achieved compared with 4 to 6 cores CPUs and more than 679.18 million double precision random numbers can be generated per second on GPU.

  18. Fast quantum Monte Carlo on a GPU

    NASA Astrophysics Data System (ADS)

    Lutsyshyn, Y.

    2015-02-01

    We present a scheme for the parallelization of quantum Monte Carlo method on graphical processing units, focusing on variational Monte Carlo simulation of bosonic systems. We use asynchronous execution schemes with shared memory persistence, and obtain an excellent utilization of the accelerator. The CUDA code is provided along with a package that simulates liquid helium-4. The program was benchmarked on several models of Nvidia GPU, including Fermi GTX560 and M2090, and the Kepler architecture K20 GPU. Special optimization was developed for the Kepler cards, including placement of data structures in the register space of the Kepler GPUs. Kepler-specific optimization is discussed.

  19. CUDASW++ 3.0: accelerating Smith-Waterman protein database search by coupling CPU and GPU SIMD instructions.

    PubMed

    Liu, Yongchao; Wirawan, Adrianto; Schmidt, Bertil

    2013-04-04

    The maximal sensitivity for local alignments makes the Smith-Waterman algorithm a popular choice for protein sequence database search based on pairwise alignment. However, the algorithm is compute-intensive due to a quadratic time complexity. Corresponding runtimes are further compounded by the rapid growth of sequence databases. We present CUDASW++ 3.0, a fast Smith-Waterman protein database search algorithm, which couples CPU and GPU SIMD instructions and carries out concurrent CPU and GPU computations. For the CPU computation, this algorithm employs SSE-based vector execution units as accelerators. For the GPU computation, we have investigated for the first time a GPU SIMD parallelization, which employs CUDA PTX SIMD video instructions to gain more data parallelism beyond the SIMT execution model. Moreover, sequence alignment workloads are automatically distributed over CPUs and GPUs based on their respective compute capabilities. Evaluation on the Swiss-Prot database shows that CUDASW++ 3.0 gains a performance improvement over CUDASW++ 2.0 up to 2.9 and 3.2, with a maximum performance of 119.0 and 185.6 GCUPS, on a single-GPU GeForce GTX 680 and a dual-GPU GeForce GTX 690 graphics card, respectively. In addition, our algorithm has demonstrated significant speedups over other top-performing tools: SWIPE and BLAST+. CUDASW++ 3.0 is written in CUDA C++ and PTX assembly languages, targeting GPUs based on the Kepler architecture. This algorithm obtains significant speedups over its predecessor: CUDASW++ 2.0, by benefiting from the use of CPU and GPU SIMD instructions as well as the concurrent execution on CPUs and GPUs. The source code and the simulated data are available at http://cudasw.sourceforge.net.

  20. Very high frame rate volumetric integration of depth images on mobile devices.

    PubMed

    Kähler, Olaf; Adrian Prisacariu, Victor; Yuheng Ren, Carl; Sun, Xin; Torr, Philip; Murray, David

    2015-11-01

    Volumetric methods provide efficient, flexible and simple ways of integrating multiple depth images into a full 3D model. They provide dense and photorealistic 3D reconstructions, and parallelised implementations on GPUs achieve real-time performance on modern graphics hardware. To run such methods on mobile devices, providing users with freedom of movement and instantaneous reconstruction feedback, remains challenging however. In this paper we present a range of modifications to existing volumetric integration methods based on voxel block hashing, considerably improving their performance and making them applicable to tablet computer applications. We present (i) optimisations for the basic data structure, and its allocation and integration; (ii) a highly optimised raycasting pipeline; and (iii) extensions to the camera tracker to incorporate IMU data. In total, our system thus achieves frame rates up 47 Hz on a Nvidia Shield Tablet and 910 Hz on a Nvidia GTX Titan XGPU, or even beyond 1.1 kHz without visualisation.

  1. Graphics Processing Unit Acceleration of Gyrokinetic Turbulence Simulations

    NASA Astrophysics Data System (ADS)

    Hause, Benjamin; Parker, Scott; Chen, Yang

    2013-10-01

    We find a substantial increase in on-node performance using Graphics Processing Unit (GPU) acceleration in gyrokinetic delta-f particle-in-cell simulation. Optimization is performed on a two-dimensional slab gyrokinetic particle simulation using the Portland Group Fortran compiler with the OpenACC compiler directives and Fortran CUDA. Mixed implementation of both Open-ACC and CUDA is demonstrated. CUDA is required for optimizing the particle deposition algorithm. We have implemented the GPU acceleration on a third generation Core I7 gaming PC with two NVIDIA GTX 680 GPUs. We find comparable, or better, acceleration relative to the NERSC DIRAC cluster with the NVIDIA Tesla C2050 computing processor. The Tesla C 2050 is about 2.6 times more expensive than the GTX 580 gaming GPU. We also see enormous speedups (10 or more) on the Titan supercomputer at Oak Ridge with Kepler K20 GPUs. Results show speed-ups comparable or better than that of OpenMP models utilizing multiple cores. The use of hybrid OpenACC, CUDA Fortran, and MPI models across many nodes will also be discussed. Optimization strategies will be presented. We will discuss progress on optimizing the comprehensive three dimensional general geometry GEM code.

  2. A performance model for GPUs with caches

    DOE PAGES

    Dao, Thanh Tuan; Kim, Jungwon; Seo, Sangmin; ...

    2014-06-24

    To exploit the abundant computational power of the world's fastest supercomputers, an even workload distribution to the typically heterogeneous compute devices is necessary. While relatively accurate performance models exist for conventional CPUs, accurate performance estimation models for modern GPUs do not exist. This paper presents two accurate models for modern GPUs: a sampling-based linear model, and a model based on machine-learning (ML) techniques which improves the accuracy of the linear model and is applicable to modern GPUs with and without caches. We first construct the sampling-based linear model to predict the runtime of an arbitrary OpenCL kernel. Based on anmore » analysis of NVIDIA GPUs' scheduling policies we determine the earliest sampling points that allow an accurate estimation. The linear model cannot capture well the significant effects that memory coalescing or caching as implemented in modern GPUs have on performance. We therefore propose a model based on ML techniques that takes several compiler-generated statistics about the kernel as well as the GPU's hardware performance counters as additional inputs to obtain a more accurate runtime performance estimation for modern GPUs. We demonstrate the effectiveness and broad applicability of the model by applying it to three different NVIDIA GPU architectures and one AMD GPU architecture. On an extensive set of OpenCL benchmarks, on average, the proposed model estimates the runtime performance with less than 7 percent error for a second-generation GTX 280 with no on-chip caches and less than 5 percent for the Fermi-based GTX 580 with hardware caches. On the Kepler-based GTX 680, the linear model has an error of less than 10 percent. On an AMD GPU architecture, Radeon HD 6970, the model estimates with 8 percent of error rates. As a result, the proposed technique outperforms existing models by a factor of 5 to 6 in terms of accuracy.« less

  3. Fast multipole methods on a cluster of GPUs for the meshless simulation of turbulence

    NASA Astrophysics Data System (ADS)

    Yokota, R.; Narumi, T.; Sakamaki, R.; Kameoka, S.; Obi, S.; Yasuoka, K.

    2009-11-01

    Recent advances in the parallelizability of fast N-body algorithms, and the programmability of graphics processing units (GPUs) have opened a new path for particle based simulations. For the simulation of turbulence, vortex methods can now be considered as an interesting alternative to finite difference and spectral methods. The present study focuses on the efficient implementation of the fast multipole method and pseudo-particle method on a cluster of NVIDIA GeForce 8800 GT GPUs, and applies this to a vortex method calculation of homogeneous isotropic turbulence. The results of the present vortex method agree quantitatively with that of the reference calculation using a spectral method. We achieved a maximum speed of 7.48 TFlops using 64 GPUs, and the cost performance was near 9.4/GFlops. The calculation of the present vortex method on 64 GPUs took 4120 s, while the spectral method on 32 CPUs took 4910 s.

  4. Automatic detection and classification of obstacles with applications in autonomous mobile robots

    NASA Astrophysics Data System (ADS)

    Ponomaryov, Volodymyr I.; Rosas-Miranda, Dario I.

    2016-04-01

    Hardware implementation of an automatic detection and classification of objects that can represent an obstacle for an autonomous mobile robot using stereo vision algorithms is presented. We propose and evaluate a new method to detect and classify objects for a mobile robot in outdoor conditions. This method is divided in two parts, the first one is the object detection step based on the distance from the objects to the camera and a BLOB analysis. The second part is the classification step that is based on visuals primitives and a SVM classifier. The proposed method is performed in GPU in order to reduce the processing time values. This is performed with help of hardware based on multi-core processors and GPU platform, using a NVIDIA R GeForce R GT640 graphic card and Matlab over a PC with Windows 10.

  5. Fast generation of computer-generated hologram by graphics processing unit

    NASA Astrophysics Data System (ADS)

    Matsuda, Sho; Fujii, Tomohiko; Yamaguchi, Takeshi; Yoshikawa, Hiroshi

    2009-02-01

    A cylindrical hologram is well known to be viewable in 360 deg. This hologram depends high pixel resolution.Therefore, Computer-Generated Cylindrical Hologram (CGCH) requires huge calculation amount.In our previous research, we used look-up table method for fast calculation with Intel Pentium4 2.8 GHz.It took 480 hours to calculate high resolution CGCH (504,000 x 63,000 pixels and the average number of object points are 27,000).To improve quality of CGCH reconstructed image, fringe pattern requires higher spatial frequency and resolution.Therefore, to increase the calculation speed, we have to change the calculation method. In this paper, to reduce the calculation time of CGCH (912,000 x 108,000 pixels), we employ Graphics Processing Unit (GPU).It took 4,406 hours to calculate high resolution CGCH on Xeon 3.4 GHz.Since GPU has many streaming processors and a parallel processing structure, GPU works as the high performance parallel processor.In addition, GPU gives max performance to 2 dimensional data and streaming data.Recently, GPU can be utilized for the general purpose (GPGPU).For example, NVIDIA's GeForce7 series became a programmable processor with Cg programming language.Next GeForce8 series have CUDA as software development kit made by NVIDIA.Theoretically, calculation ability of GPU is announced as 500 GFLOPS. From the experimental result, we have achieved that 47 times faster calculation compared with our previous work which used CPU.Therefore, CGCH can be generated in 95 hours.So, total time is 110 hours to calculate and print the CGCH.

  6. A Large Scale, High Resolution Agent-Based Insurgency Model

    DTIC Science & Technology

    2013-09-30

    CUDA) is NVIDIA Corporation’s software development model for General Purpose Programming on Graphics Processing Units (GPGPU) ( NVIDIA Corporation ...Conference. Argonne National Laboratory, Argonne, IL, October, 2005. NVIDIA Corporation . NVIDIA CUDA Programming Guide 2.0 [Online]. NVIDIA Corporation

  7. Multimodality imaging and state-of-art GPU technology in discriminating benign from malignant breast lesions on real time decision support system

    NASA Astrophysics Data System (ADS)

    Kostopoulos, S.; Sidiropoulos, K.; Glotsos, D.; Dimitropoulos, N.; Kalatzis, I.; Asvestas, P.; Cavouras, D.

    2014-03-01

    The aim of this study was to design a pattern recognition system for assisting the diagnosis of breast lesions, using image information from Ultrasound (US) and Digital Mammography (DM) imaging modalities. State-of-art computer technology was employed based on commercial Graphics Processing Unit (GPU) cards and parallel programming. An experienced radiologist outlined breast lesions on both US and DM images from 59 patients employing a custom designed computer software application. Textural features were extracted from each lesion and were used to design the pattern recognition system. Several classifiers were tested for highest performance in discriminating benign from malignant lesions. Classifiers were also combined into ensemble schemes for further improvement of the system's classification accuracy. Following the pattern recognition system optimization, the final system was designed employing the Probabilistic Neural Network classifier (PNN) on the GPU card (GeForce 580GTX) using CUDA programming framework and C++ programming language. The use of such state-of-art technology renders the system capable of redesigning itself on site once additional verified US and DM data are collected. Mixture of US and DM features optimized performance with over 90% accuracy in correctly classifying the lesions.

  8. Fast precalculated triangular mesh algorithm for 3D binary computer-generated holograms.

    PubMed

    Yang, Fan; Kaczorowski, Andrzej; Wilkinson, Tim D

    2014-12-10

    A new method for constructing computer-generated holograms using a precalculated triangular mesh is presented. The speed of calculation can be increased dramatically by exploiting both the precalculated base triangle and GPU parallel computing. Unlike algorithms using point-based sources, this method can reconstruct a more vivid 3D object instead of a "hollow image." In addition, there is no need to do a fast Fourier transform for each 3D element every time. A ferroelectric liquid crystal spatial light modulator is used to display the binary hologram within our experiment and the hologram of a base right triangle is produced by utilizing just a one-step Fourier transform in the 2D case, which can be expanded to the 3D case by multiplying by a suitable Fresnel phase plane. All 3D holograms generated in this paper are based on Fresnel propagation; thus, the Fresnel plane is treated as a vital element in producing the hologram. A GeForce GTX 770 graphics card with 2 GB memory is used to achieve parallel computing.

  9. Design of a decision support system, trained on GPU, for assisting melanoma diagnosis in dermatoscopy images

    NASA Astrophysics Data System (ADS)

    Glotsos, Dimitris; Kostopoulos, Spiros; Lalissidou, Stella; Sidiropoulos, Konstantinos; Asvestas, Pantelis; Konstandinou, Christos; Xenogiannopoulos, George; Konstantina Nikolatou, Eirini; Perakis, Konstantinos; Bouras, Thanassis; Cavouras, Dionisis

    2015-09-01

    The purpose of this study was to design a decision support system for assisting the diagnosis of melanoma in dermatoscopy images. Clinical material comprised images of 44 dysplastic (clark's nevi) and 44 malignant melanoma lesions, obtained from the dermatology database Dermnet. Initially, images were processed for hair removal and background correction using the Dull Razor algorithm. Processed images were segmented to isolate moles from surrounding background, using a combination of level sets and an automated thresholding approach. Morphological (area, size, shape) and textural features (first and second order) were calculated from each one of the segmented moles. Extracted features were fed to a pattern recognition system assembled with the Probabilistic Neural Network Classifier, which was trained to distinguish between benign and malignant cases, using the exhaustive search and the leave one out method. The system was designed on the GPU card (GeForce 580GTX) using CUDA programming framework and C++ programming language. Results showed that the designed system discriminated benign from malignant moles with 88.6% accuracy employing morphological and textural features. The proposed system could be used for analysing moles depicted on smart phone images after appropriate training with smartphone images cases. This could assist towards early detection of melanoma cases, if suspicious moles were to be captured on smartphone by patients and be transferred to the physician together with an assessment of the mole's nature.

  10. Automatic Railway Traffic Object Detection System Using Feature Fusion Refine Neural Network under Shunting Mode.

    PubMed

    Ye, Tao; Wang, Baocheng; Song, Ping; Li, Juan

    2018-06-12

    Many accidents happen under shunting mode when the speed of a train is below 45 km/h. In this mode, train attendants observe the railway condition ahead using the traditional manual method and tell the observation results to the driver in order to avoid danger. To address this problem, an automatic object detection system based on convolutional neural network (CNN) is proposed to detect objects ahead in shunting mode, which is called Feature Fusion Refine neural network (FR-Net). It consists of three connected modules, i.e., the depthwise-pointwise convolution, the coarse detection module, and the object detection module. Depth-wise-pointwise convolutions are used to improve the detection in real time. The coarse detection module coarsely refine the locations and sizes of prior anchors to provide better initialization for the subsequent module and also reduces search space for the classification, whereas the object detection module aims to regress accurate object locations and predict the class labels for the prior anchors. The experimental results on the railway traffic dataset show that FR-Net achieves 0.8953 mAP with 72.3 FPS performance on a machine with a GeForce GTX1080Ti with the input size of 320 × 320 pixels. The results imply that FR-Net takes a good tradeoff both on effectiveness and real time performance. The proposed method can meet the needs of practical application in shunting mode.

  11. High performance in silico virtual drug screening on many-core processors.

    PubMed

    McIntosh-Smith, Simon; Price, James; Sessions, Richard B; Ibarra, Amaurys A

    2015-05-01

    Drug screening is an important part of the drug development pipeline for the pharmaceutical industry. Traditional, lab-based methods are increasingly being augmented with computational methods, ranging from simple molecular similarity searches through more complex pharmacophore matching to more computationally intensive approaches, such as molecular docking. The latter simulates the binding of drug molecules to their targets, typically protein molecules. In this work, we describe BUDE, the Bristol University Docking Engine, which has been ported to the OpenCL industry standard parallel programming language in order to exploit the performance of modern many-core processors. Our highly optimized OpenCL implementation of BUDE sustains 1.43 TFLOP/s on a single Nvidia GTX 680 GPU, or 46% of peak performance. BUDE also exploits OpenCL to deliver effective performance portability across a broad spectrum of different computer architectures from different vendors, including GPUs from Nvidia and AMD, Intel's Xeon Phi and multi-core CPUs with SIMD instruction sets.

  12. GPU Lossless Hyperspectral Data Compression System for Space Applications

    NASA Technical Reports Server (NTRS)

    Keymeulen, Didier; Aranki, Nazeeh; Hopson, Ben; Kiely, Aaron; Klimesh, Matthew; Benkrid, Khaled

    2012-01-01

    On-board lossless hyperspectral data compression reduces data volume in order to meet NASA and DoD limited downlink capabilities. At JPL, a novel, adaptive and predictive technique for lossless compression of hyperspectral data, named the Fast Lossless (FL) algorithm, was recently developed. This technique uses an adaptive filtering method and achieves state-of-the-art performance in both compression effectiveness and low complexity. Because of its outstanding performance and suitability for real-time onboard hardware implementation, the FL compressor is being formalized as the emerging CCSDS Standard for Lossless Multispectral & Hyperspectral image compression. The FL compressor is well-suited for parallel hardware implementation. A GPU hardware implementation was developed for FL targeting the current state-of-the-art GPUs from NVIDIA(Trademark). The GPU implementation on a NVIDIA(Trademark) GeForce(Trademark) GTX 580 achieves a throughput performance of 583.08 Mbits/sec (44.85 MSamples/sec) and an acceleration of at least 6 times a software implementation running on a 3.47 GHz single core Intel(Trademark) Xeon(Trademark) processor. This paper describes the design and implementation of the FL algorithm on the GPU. The massively parallel implementation will provide in the future a fast and practical real-time solution for airborne and space applications.

  13. High performance in silico virtual drug screening on many-core processors

    PubMed Central

    Price, James; Sessions, Richard B; Ibarra, Amaurys A

    2015-01-01

    Drug screening is an important part of the drug development pipeline for the pharmaceutical industry. Traditional, lab-based methods are increasingly being augmented with computational methods, ranging from simple molecular similarity searches through more complex pharmacophore matching to more computationally intensive approaches, such as molecular docking. The latter simulates the binding of drug molecules to their targets, typically protein molecules. In this work, we describe BUDE, the Bristol University Docking Engine, which has been ported to the OpenCL industry standard parallel programming language in order to exploit the performance of modern many-core processors. Our highly optimized OpenCL implementation of BUDE sustains 1.43 TFLOP/s on a single Nvidia GTX 680 GPU, or 46% of peak performance. BUDE also exploits OpenCL to deliver effective performance portability across a broad spectrum of different computer architectures from different vendors, including GPUs from Nvidia and AMD, Intel’s Xeon Phi and multi-core CPUs with SIMD instruction sets. PMID:25972727

  14. Accelerated Adaptive MGS Phase Retrieval

    NASA Technical Reports Server (NTRS)

    Lam, Raymond K.; Ohara, Catherine M.; Green, Joseph J.; Bikkannavar, Siddarayappa A.; Basinger, Scott A.; Redding, David C.; Shi, Fang

    2011-01-01

    The Modified Gerchberg-Saxton (MGS) algorithm is an image-based wavefront-sensing method that can turn any science instrument focal plane into a wavefront sensor. MGS characterizes optical systems by estimating the wavefront errors in the exit pupil using only intensity images of a star or other point source of light. This innovative implementation of MGS significantly accelerates the MGS phase retrieval algorithm by using stream-processing hardware on conventional graphics cards. Stream processing is a relatively new, yet powerful, paradigm to allow parallel processing of certain applications that apply single instructions to multiple data (SIMD). These stream processors are designed specifically to support large-scale parallel computing on a single graphics chip. Computationally intensive algorithms, such as the Fast Fourier Transform (FFT), are particularly well suited for this computing environment. This high-speed version of MGS exploits commercially available hardware to accomplish the same objective in a fraction of the original time. The exploit involves performing matrix calculations in nVidia graphic cards. The graphical processor unit (GPU) is hardware that is specialized for computationally intensive, highly parallel computation. From the software perspective, a parallel programming model is used, called CUDA, to transparently scale multicore parallelism in hardware. This technology gives computationally intensive applications access to the processing power of the nVidia GPUs through a C/C++ programming interface. The AAMGS (Accelerated Adaptive MGS) software takes advantage of these advanced technologies, to accelerate the optical phase error characterization. With a single PC that contains four nVidia GTX-280 graphic cards, the new implementation can process four images simultaneously to produce a JWST (James Webb Space Telescope) wavefront measurement 60 times faster than the previous code.

  15. High-Speed Particle-in-Cell Simulation Parallelized with Graphic Processing Units for Low Temperature Plasmas for Material Processing

    NASA Astrophysics Data System (ADS)

    Hur, Min Young; Verboncoeur, John; Lee, Hae June

    2014-10-01

    Particle-in-cell (PIC) simulations have high fidelity in the plasma device requiring transient kinetic modeling compared with fluid simulations. It uses less approximation on the plasma kinetics but requires many particles and grids to observe the semantic results. It means that the simulation spends lots of simulation time in proportion to the number of particles. Therefore, PIC simulation needs high performance computing. In this research, a graphic processing unit (GPU) is adopted for high performance computing of PIC simulation for low temperature discharge plasmas. GPUs have many-core processors and high memory bandwidth compared with a central processing unit (CPU). NVIDIA GeForce GPUs were used for the test with hundreds of cores which show cost-effective performance. PIC code algorithm is divided into two modules which are a field solver and a particle mover. The particle mover module is divided into four routines which are named move, boundary, Monte Carlo collision (MCC), and deposit. Overall, the GPU code solves particle motions as well as electrostatic potential in two-dimensional geometry almost 30 times faster than a single CPU code. This work was supported by the Korea Institute of Science Technology Information.

  16. Spatial 3D infrastructure: display-independent software framework, high-speed rendering electronics, and several new displays

    NASA Astrophysics Data System (ADS)

    Chun, Won-Suk; Napoli, Joshua; Cossairt, Oliver S.; Dorval, Rick K.; Hall, Deirdre M.; Purtell, Thomas J., II; Schooler, James F.; Banker, Yigal; Favalora, Gregg E.

    2005-03-01

    We present a software and hardware foundation to enable the rapid adoption of 3-D displays. Different 3-D displays - such as multiplanar, multiview, and electroholographic displays - naturally require different rendering methods. The adoption of these displays in the marketplace will be accelerated by a common software framework. The authors designed the SpatialGL API, a new rendering framework that unifies these display methods under one interface. SpatialGL enables complementary visualization assets to coexist through a uniform infrastructure. Also, SpatialGL supports legacy interfaces such as the OpenGL API. The authors" first implementation of SpatialGL uses multiview and multislice rendering algorithms to exploit the performance of modern graphics processing units (GPUs) to enable real-time visualization of 3-D graphics from medical imaging, oil & gas exploration, and homeland security. At the time of writing, SpatialGL runs on COTS workstations (both Windows and Linux) and on Actuality"s high-performance embedded computational engine that couples an NVIDIA GeForce 6800 Ultra GPU, an AMD Athlon 64 processor, and a proprietary, high-speed, programmable volumetric frame buffer that interfaces to a 1024 x 768 x 3 digital projector. Progress is illustrated using an off-the-shelf multiview display, Actuality"s multiplanar Perspecta Spatial 3D System, and an experimental multiview display. The experimental display is a quasi-holographic view-sequential system that generates aerial imagery measuring 30 mm x 25 mm x 25 mm, providing 198 horizontal views.

  17. Exploiting current-generation graphics hardware for synthetic-scene generation

    NASA Astrophysics Data System (ADS)

    Tanner, Michael A.; Keen, Wayne A.

    2010-04-01

    Increasing seeker frame rate and pixel count, as well as the demand for higher levels of scene fidelity, have driven scene generation software for hardware-in-the-loop (HWIL) and software-in-the-loop (SWIL) testing to higher levels of parallelization. Because modern PC graphics cards provide multiple computational cores (240 shader cores for a current NVIDIA Corporation GeForce and Quadro cards), implementation of phenomenology codes on graphics processing units (GPUs) offers significant potential for simultaneous enhancement of simulation frame rate and fidelity. To take advantage of this potential requires algorithm implementation that is structured to minimize data transfers between the central processing unit (CPU) and the GPU. In this paper, preliminary methodologies developed at the Kinetic Hardware In-The-Loop Simulator (KHILS) will be presented. Included in this paper will be various language tradeoffs between conventional shader programming, Compute Unified Device Architecture (CUDA) and Open Computing Language (OpenCL), including performance trades and possible pathways for future tool development.

  18. Enhanced Graphics for Extended Scale Range

    NASA Technical Reports Server (NTRS)

    Hanson, Andrew J.; Chi-Wing Fu, Philip

    2012-01-01

    Enhanced Graphics for Extended Scale Range is a computer program for rendering fly-through views of scene models that include visible objects differing in size by large orders of magnitude. An example would be a scene showing a person in a park at night with the moon, stars, and galaxies in the background sky. Prior graphical computer programs exhibit arithmetic and other anomalies when rendering scenes containing objects that differ enormously in scale and distance from the viewer. The present program dynamically repartitions distance scales of objects in a scene during rendering to eliminate almost all such anomalies in a way compatible with implementation in other software and in hardware accelerators. By assigning depth ranges correspond ing to rendering precision requirements, either automatically or under program control, this program spaces out object scales to match the precision requirements of the rendering arithmetic. This action includes an intelligent partition of the depth buffer ranges to avoid known anomalies from this source. The program is written in C++, using OpenGL, GLUT, and GLUI standard libraries, and nVidia GEForce Vertex Shader extensions. The program has been shown to work on several computers running UNIX and Windows operating systems.

  19. AESS: Accelerated Exact Stochastic Simulation

    NASA Astrophysics Data System (ADS)

    Jenkins, David D.; Peterson, Gregory D.

    2011-12-01

    The Stochastic Simulation Algorithm (SSA) developed by Gillespie provides a powerful mechanism for exploring the behavior of chemical systems with small species populations or with important noise contributions. Gene circuit simulations for systems biology commonly employ the SSA method, as do ecological applications. This algorithm tends to be computationally expensive, so researchers seek an efficient implementation of SSA. In this program package, the Accelerated Exact Stochastic Simulation Algorithm (AESS) contains optimized implementations of Gillespie's SSA that improve the performance of individual simulation runs or ensembles of simulations used for sweeping parameters or to provide statistically significant results. Program summaryProgram title: AESS Catalogue identifier: AEJW_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEJW_v1_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: University of Tennessee copyright agreement No. of lines in distributed program, including test data, etc.: 10 861 No. of bytes in distributed program, including test data, etc.: 394 631 Distribution format: tar.gz Programming language: C for processors, CUDA for NVIDIA GPUs Computer: Developed and tested on various x86 computers and NVIDIA C1060 Tesla and GTX 480 Fermi GPUs. The system targets x86 workstations, optionally with multicore processors or NVIDIA GPUs as accelerators. Operating system: Tested under Ubuntu Linux OS and CentOS 5.5 Linux OS Classification: 3, 16.12 Nature of problem: Simulation of chemical systems, particularly with low species populations, can be accurately performed using Gillespie's method of stochastic simulation. Numerous variations on the original stochastic simulation algorithm have been developed, including approaches that produce results with statistics that exactly match the chemical master equation (CME) as well as other approaches that approximate the CME. Solution

  20. The visible ear simulator: a public PC application for GPU-accelerated haptic 3D simulation of ear surgery based on the visible ear data.

    PubMed

    Sorensen, Mads Solvsten; Mosegaard, Jesper; Trier, Peter

    2009-06-01

    Existing virtual simulators for middle ear surgery are based on 3-dimensional (3D) models from computed tomographic or magnetic resonance imaging data in which image quality is limited by the lack of detail (maximum, approximately 50 voxels/mm3), natural color, and texture of the source material.Virtual training often requires the purchase of a program, a customized computer, and expensive peripherals dedicated exclusively to this purpose. The Visible Ear freeware library of digital images from a fresh-frozen human temporal bone was segmented, and real-time volume rendered as a 3D model of high-fidelity, true color, and great anatomic detail and realism of the surgically relevant structures. A haptic drilling model was developed for surgical interaction with the 3D model. Realistic visualization in high-fidelity (approximately 125 voxels/mm3) and true color, 2D, or optional anaglyph stereoscopic 3D was achieved on a standard Core 2 Duo personal computer with a GeForce 8,800 GTX graphics card, and surgical interaction was provided through a relatively inexpensive (approximately $2,500) Phantom Omni haptic 3D pointing device. This prototype is published for download (approximately 120 MB) as freeware at http://www.alexandra.dk/ves/index.htm.With increasing personal computer performance, future versions may include enhanced resolution (up to 8,000 voxels/mm3) and realistic interaction with deformable soft tissue components such as skin, tympanic membrane, dura, and cholesteatomas-features some of which are not possible with computed tomographic-/magnetic resonance imaging-based systems.

  1. Practical Implementation of Prestack Kirchhoff Time Migration on a General Purpose Graphics Processing Unit

    NASA Astrophysics Data System (ADS)

    Liu, Guofeng; Li, Chun

    2016-08-01

    In this study, we present a practical implementation of prestack Kirchhoff time migration (PSTM) on a general purpose graphic processing unit. First, we consider the three main optimizations of the PSTM GPU code, i.e., designing a configuration based on a reasonable execution, using the texture memory for velocity interpolation, and the application of an intrinsic function in device code. This approach can achieve a speedup of nearly 45 times on a NVIDIA GTX 680 GPU compared with CPU code when a larger imaging space is used, where the PSTM output is a common reflection point that is gathered as I[ nx][ ny][ nh][ nt] in matrix format. However, this method requires more memory space so the limited imaging space cannot fully exploit the GPU sources. To overcome this problem, we designed a PSTM scheme with multi-GPUs for imaging different seismic data on different GPUs using an offset value. This process can achieve the peak speedup of GPU PSTM code and it greatly increases the efficiency of the calculations, but without changing the imaging result.

  2. permGPU: Using graphics processing units in RNA microarray association studies.

    PubMed

    Shterev, Ivo D; Jung, Sin-Ho; George, Stephen L; Owzar, Kouros

    2010-06-16

    Many analyses of microarray association studies involve permutation, bootstrap resampling and cross-validation, that are ideally formulated as embarrassingly parallel computing problems. Given that these analyses are computationally intensive, scalable approaches that can take advantage of multi-core processor systems need to be developed. We have developed a CUDA based implementation, permGPU, that employs graphics processing units in microarray association studies. We illustrate the performance and applicability of permGPU within the context of permutation resampling for a number of test statistics. An extensive simulation study demonstrates a dramatic increase in performance when using permGPU on an NVIDIA GTX 280 card compared to an optimized C/C++ solution running on a conventional Linux server. permGPU is available as an open-source stand-alone application and as an extension package for the R statistical environment. It provides a dramatic increase in performance for permutation resampling analysis in the context of microarray association studies. The current version offers six test statistics for carrying out permutation resampling analyses for binary, quantitative and censored time-to-event traits.

  3. Large calculation of the flow over a hypersonic vehicle using a GPU

    NASA Astrophysics Data System (ADS)

    Elsen, Erich; LeGresley, Patrick; Darve, Eric

    2008-12-01

    Graphics processing units are capable of impressive computing performance up to 518 Gflops peak performance. Various groups have been using these processors for general purpose computing; most efforts have focussed on demonstrating relatively basic calculations, e.g. numerical linear algebra, or physical simulations for visualization purposes with limited accuracy. This paper describes the simulation of a hypersonic vehicle configuration with detailed geometry and accurate boundary conditions using the compressible Euler equations. To the authors' knowledge, this is the most sophisticated calculation of this kind in terms of complexity of the geometry, the physical model, the numerical methods employed, and the accuracy of the solution. The Navier-Stokes Stanford University Solver (NSSUS) was used for this purpose. NSSUS is a multi-block structured code with a provably stable and accurate numerical discretization which uses a vertex-based finite-difference method. A multi-grid scheme is used to accelerate the solution of the system. Based on a comparison of the Intel Core 2 Duo and NVIDIA 8800GTX, speed-ups of over 40× were demonstrated for simple test geometries and 20× for complex geometries.

  4. Efficient parallel implementation of active appearance model fitting algorithm on GPU.

    PubMed

    Wang, Jinwei; Ma, Xirong; Zhu, Yuanping; Sun, Jizhou

    2014-01-01

    The active appearance model (AAM) is one of the most powerful model-based object detecting and tracking methods which has been widely used in various situations. However, the high-dimensional texture representation causes very time-consuming computations, which makes the AAM difficult to apply to real-time systems. The emergence of modern graphics processing units (GPUs) that feature a many-core, fine-grained parallel architecture provides new and promising solutions to overcome the computational challenge. In this paper, we propose an efficient parallel implementation of the AAM fitting algorithm on GPUs. Our design idea is fine grain parallelism in which we distribute the texture data of the AAM, in pixels, to thousands of parallel GPU threads for processing, which makes the algorithm fit better into the GPU architecture. We implement our algorithm using the compute unified device architecture (CUDA) on the Nvidia's GTX 650 GPU, which has the latest Kepler architecture. To compare the performance of our algorithm with different data sizes, we built sixteen face AAM models of different dimensional textures. The experiment results show that our parallel AAM fitting algorithm can achieve real-time performance for videos even on very high-dimensional textures.

  5. Efficient Parallel Implementation of Active Appearance Model Fitting Algorithm on GPU

    PubMed Central

    Wang, Jinwei; Ma, Xirong; Zhu, Yuanping; Sun, Jizhou

    2014-01-01

    The active appearance model (AAM) is one of the most powerful model-based object detecting and tracking methods which has been widely used in various situations. However, the high-dimensional texture representation causes very time-consuming computations, which makes the AAM difficult to apply to real-time systems. The emergence of modern graphics processing units (GPUs) that feature a many-core, fine-grained parallel architecture provides new and promising solutions to overcome the computational challenge. In this paper, we propose an efficient parallel implementation of the AAM fitting algorithm on GPUs. Our design idea is fine grain parallelism in which we distribute the texture data of the AAM, in pixels, to thousands of parallel GPU threads for processing, which makes the algorithm fit better into the GPU architecture. We implement our algorithm using the compute unified device architecture (CUDA) on the Nvidia's GTX 650 GPU, which has the latest Kepler architecture. To compare the performance of our algorithm with different data sizes, we built sixteen face AAM models of different dimensional textures. The experiment results show that our parallel AAM fitting algorithm can achieve real-time performance for videos even on very high-dimensional textures. PMID:24723812

  6. High definition live 3D-OCT in vivo: design and evaluation of a 4D OCT engine with 1 GVoxel/s.

    PubMed

    Wieser, Wolfgang; Draxinger, Wolfgang; Klein, Thomas; Karpf, Sebastian; Pfeiffer, Tom; Huber, Robert

    2014-09-01

    We present a 1300 nm OCT system for volumetric real-time live OCT acquisition and visualization at 1 billion volume elements per second. All technological challenges and problems associated with such high scanning speed are discussed in detail as well as the solutions. In one configuration, the system acquires, processes and visualizes 26 volumes per second where each volume consists of 320 x 320 depth scans and each depth scan has 400 usable pixels. This is the fastest real-time OCT to date in terms of voxel rate. A 51 Hz volume rate is realized with half the frame number. In both configurations the speed can be sustained indefinitely. The OCT system uses a 1310 nm Fourier domain mode locked (FDML) laser operated at 3.2 MHz sweep rate. Data acquisition is performed with two dedicated digitizer cards, each running at 2.5 GS/s, hosted in a single desktop computer. Live real-time data processing and visualization are realized with custom developed software on an NVidia GTX 690 dual graphics processing unit (GPU) card. To evaluate potential future applications of such a system, we present volumetric videos captured at 26 and 51 Hz of planktonic crustaceans and skin.

  7. Scalable streaming tools for analyzing N-body simulations: Finding halos and investigating excursion sets in one pass

    NASA Astrophysics Data System (ADS)

    Ivkin, N.; Liu, Z.; Yang, L. F.; Kumar, S. S.; Lemson, G.; Neyrinck, M.; Szalay, A. S.; Braverman, V.; Budavari, T.

    2018-04-01

    Cosmological N-body simulations play a vital role in studying models for the evolution of the Universe. To compare to observations and make a scientific inference, statistic analysis on large simulation datasets, e.g., finding halos, obtaining multi-point correlation functions, is crucial. However, traditional in-memory methods for these tasks do not scale to the datasets that are forbiddingly large in modern simulations. Our prior paper (Liu et al., 2015) proposes memory-efficient streaming algorithms that can find the largest halos in a simulation with up to 109 particles on a small server or desktop. However, this approach fails when directly scaling to larger datasets. This paper presents a robust streaming tool that leverages state-of-the-art techniques on GPU boosting, sampling, and parallel I/O, to significantly improve performance and scalability. Our rigorous analysis of the sketch parameters improves the previous results from finding the centers of the 103 largest halos (Liu et al., 2015) to ∼ 104 - 105, and reveals the trade-offs between memory, running time and number of halos. Our experiments show that our tool can scale to datasets with up to ∼ 1012 particles while using less than an hour of running time on a single GPU Nvidia GTX 1080.

  8. Fast, multi-channel real-time processing of signals with microsecond latency using graphics processing units.

    PubMed

    Rath, N; Kato, S; Levesque, J P; Mauel, M E; Navratil, G A; Peng, Q

    2014-04-01

    Fast, digital signal processing (DSP) has many applications. Typical hardware options for performing DSP are field-programmable gate arrays (FPGAs), application-specific integrated DSP chips, or general purpose personal computer systems. This paper presents a novel DSP platform that has been developed for feedback control on the HBT-EP tokamak device. The system runs all signal processing exclusively on a Graphics Processing Unit (GPU) to achieve real-time performance with latencies below 8 μs. Signals are transferred into and out of the GPU using PCI Express peer-to-peer direct-memory-access transfers without involvement of the central processing unit or host memory. Tests were performed on the feedback control system of the HBT-EP tokamak using forty 16-bit floating point inputs and outputs each and a sampling rate of up to 250 kHz. Signals were digitized by a D-TACQ ACQ196 module, processing done on an NVIDIA GTX 580 GPU programmed in CUDA, and analog output was generated by D-TACQ AO32CPCI modules.

  9. 3D Hydrodynamic Simulation of Classical Novae Explosions

    NASA Astrophysics Data System (ADS)

    Kendrick, Coleman J.

    2015-01-01

    This project investigates the formation and lifecycle of classical novae and determines how parameters such as: white dwarf mass, star mass and separation affect the evolution of the rotating binary system. These parameters affect the accretion rate, frequency of the nova explosions and light curves. Each particle in the simulation represents a volume of hydrogen gas and are initialized randomly in the outer shell of the companion star. The forces on each particle include: gravity, centrifugal, coriolis, friction, and Langevin. The friction and Langevin forces are used to model the viscosity and internal pressure of the gas. A velocity Verlet method with a one second time step is used to compute velocities and positions of the particles. A new particle recycling method was developed which was critical for computing an accurate and stable accretion rate and keeping the particle count reasonable. I used C++ and OpenCL to create my simulations and ran them on two Nvidia GTX580s. My simulations used up to 1 million particles and required up to 10 hours to complete. My simulation results for novae U Scorpii and DD Circinus are consistent with professional hydrodynamic simulations and observed experimental data (light curves and outburst frequencies). When the white dwarf mass is increased, the time between explosions decreases dramatically. My model was used to make the first prediction for the next outburst of nova DD Circinus. My simulations also show that the companion star blocks the expanding gas shell leading to an asymmetrical expanding shell.

  10. High-performance 3D compressive sensing MRI reconstruction.

    PubMed

    Kim, Daehyun; Trzasko, Joshua D; Smelyanskiy, Mikhail; Haider, Clifton R; Manduca, Armando; Dubey, Pradeep

    2010-01-01

    Compressive Sensing (CS) is a nascent sampling and reconstruction paradigm that describes how sparse or compressible signals can be accurately approximated using many fewer samples than traditionally believed. In magnetic resonance imaging (MRI), where scan duration is directly proportional to the number of acquired samples, CS has the potential to dramatically decrease scan time. However, the computationally expensive nature of CS reconstructions has so far precluded their use in routine clinical practice - instead, more-easily generated but lower-quality images continue to be used. We investigate the development and optimization of a proven inexact quasi-Newton CS reconstruction algorithm on several modern parallel architectures, including CPUs, GPUs, and Intel's Many Integrated Core (MIC) architecture. Our (optimized) baseline implementation on a quad-core Core i7 is able to reconstruct a 256 × 160×80 volume of the neurovasculature from an 8-channel, 10 × undersampled data set within 56 seconds, which is already a significant improvement over existing implementations. The latest six-core Core i7 reduces the reconstruction time further to 32 seconds. Moreover, we show that the CS algorithm benefits from modern throughput-oriented architectures. Specifically, our CUDA-base implementation on NVIDIA GTX480 reconstructs the same dataset in 16 seconds, while Intel's Knights Ferry (KNF) of the MIC architecture even reduces the time to 12 seconds. Such level of performance allows the neurovascular dataset to be reconstructed within a clinically viable time.

  11. Real-time stereo vision-based lane detection system

    NASA Astrophysics Data System (ADS)

    Fan, Rui; Dahnoun, Naim

    2018-07-01

    The detection of multiple curved lane markings on a non-flat road surface is still a challenging task for vehicular systems. To make an improvement, depth information can be used to enhance the robustness of the lane detection systems. In this paper, a proposed lane detection system is developed from our previous work where the estimation of the dense vanishing point is further improved using the disparity information. However, the outliers in the least squares fitting severely affect the accuracy when estimating the vanishing point. Therefore, in this paper we use random sample consensus to update the parameters of the road model iteratively until the percentage of the inliers exceeds our pre-set threshold. This significantly helps the system to overcome some suddenly changing conditions. Furthermore, we propose a novel lane position validation approach which computes the energy of each possible solution and selects all satisfying lane positions for visualisation. The proposed system is implemented on a heterogeneous system which consists of an Intel Core i7-4720HQ CPU and an NVIDIA GTX 970M GPU. A processing speed of 143 fps has been achieved, which is over 38 times faster than our previous work. Moreover, in order to evaluate the detection precision, we tested 2495 frames including 5361 lanes. It is shown that the overall successful detection rate is increased from 98.7% to 99.5%.

  12. Object tracking mask-based NLUT on GPUs for real-time generation of holographic videos of three-dimensional scenes.

    PubMed

    Kwon, M-W; Kim, S-C; Yoon, S-E; Ho, Y-S; Kim, E-S

    2015-02-09

    A new object tracking mask-based novel-look-up-table (OTM-NLUT) method is proposed and implemented on graphics-processing-units (GPUs) for real-time generation of holographic videos of three-dimensional (3-D) scenes. Since the proposed method is designed to be matched with software and memory structures of the GPU, the number of compute-unified-device-architecture (CUDA) kernel function calls and the computer-generated hologram (CGH) buffer size of the proposed method have been significantly reduced. It therefore results in a great increase of the computational speed of the proposed method and enables real-time generation of CGH patterns of 3-D scenes. Experimental results show that the proposed method can generate 31.1 frames of Fresnel CGH patterns with 1,920 × 1,080 pixels per second, on average, for three test 3-D video scenarios with 12,666 object points on three GPU boards of NVIDIA GTX TITAN, and confirm the feasibility of the proposed method in the practical application of electro-holographic 3-D displays.

  13. Three-directional motion-compensation mask-based novel look-up table on graphics processing units for video-rate generation of digital holographic videos of three-dimensional scenes.

    PubMed

    Kwon, Min-Woo; Kim, Seung-Cheol; Kim, Eun-Soo

    2016-01-20

    A three-directional motion-compensation mask-based novel look-up table method is proposed and implemented on graphics processing units (GPUs) for video-rate generation of digital holographic videos of three-dimensional (3D) scenes. Since the proposed method is designed to be well matched with the software and memory structures of GPUs, the number of compute-unified-device-architecture kernel function calls can be significantly reduced. This results in a great increase of the computational speed of the proposed method, allowing video-rate generation of the computer-generated hologram (CGH) patterns of 3D scenes. Experimental results reveal that the proposed method can generate 39.8 frames of Fresnel CGH patterns with 1920×1080 pixels per second for the test 3D video scenario with 12,088 object points on dual GPU boards of NVIDIA GTX TITANs, and they confirm the feasibility of the proposed method in the practical application fields of electroholographic 3D displays.

  14. GPGPU-based explicit finite element computations for applications in biomechanics: the performance of material models, element technologies, and hardware generations.

    PubMed

    Strbac, V; Pierce, D M; Vander Sloten, J; Famaey, N

    2017-12-01

    Finite element (FE) simulations are increasingly valuable in assessing and improving the performance of biomedical devices and procedures. Due to high computational demands such simulations may become difficult or even infeasible, especially when considering nearly incompressible and anisotropic material models prevalent in analyses of soft tissues. Implementations of GPGPU-based explicit FEs predominantly cover isotropic materials, e.g. the neo-Hookean model. To elucidate the computational expense of anisotropic materials, we implement the Gasser-Ogden-Holzapfel dispersed, fiber-reinforced model and compare solution times against the neo-Hookean model. Implementations of GPGPU-based explicit FEs conventionally rely on single-point (under) integration. To elucidate the expense of full and selective-reduced integration (more reliable) we implement both and compare corresponding solution times against those generated using underintegration. To better understand the advancement of hardware, we compare results generated using representative Nvidia GPGPUs from three recent generations: Fermi (C2075), Kepler (K20c), and Maxwell (GTX980). We explore scaling by solving the same boundary value problem (an extension-inflation test on a segment of human aorta) with progressively larger FE meshes. Our results demonstrate substantial improvements in simulation speeds relative to two benchmark FE codes (up to 300[Formula: see text] while maintaining accuracy), and thus open many avenues to novel applications in biomechanics and medicine.

  15. A real-time standard parts inspection based on deep learning

    NASA Astrophysics Data System (ADS)

    Xu, Kuan; Li, XuDong; Jiang, Hongzhi; Zhao, Huijie

    2017-10-01

    Since standard parts are necessary components in mechanical structure like bogie and connector. These mechanical structures will be shattered or loosen if standard parts are lost. So real-time standard parts inspection systems are essential to guarantee their safety. Researchers would like to take inspection systems based on deep learning because it works well in image with complex backgrounds which is common in standard parts inspection situation. A typical inspection detection system contains two basic components: feature extractors and object classifiers. For the object classifier, Region Proposal Network (RPN) is one of the most essential architectures in most state-of-art object detection systems. However, in the basic RPN architecture, the proposals of Region of Interest (ROI) have fixed sizes (9 anchors for each pixel), they are effective but they waste much computing resources and time. In standard parts detection situations, standard parts have given size, thus we can manually choose sizes of anchors based on the ground-truths through machine learning. The experiments prove that we could use 2 anchors to achieve almost the same accuracy and recall rate. Basically, our standard parts detection system could reach 15fps on NVIDIA GTX1080 (GPU), while achieving detection accuracy 90.01% mAP.

  16. High definition live 3D-OCT in vivo: design and evaluation of a 4D OCT engine with 1 GVoxel/s

    PubMed Central

    Wieser, Wolfgang; Draxinger, Wolfgang; Klein, Thomas; Karpf, Sebastian; Pfeiffer, Tom; Huber, Robert

    2014-01-01

    We present a 1300 nm OCT system for volumetric real-time live OCT acquisition and visualization at 1 billion volume elements per second. All technological challenges and problems associated with such high scanning speed are discussed in detail as well as the solutions. In one configuration, the system acquires, processes and visualizes 26 volumes per second where each volume consists of 320 x 320 depth scans and each depth scan has 400 usable pixels. This is the fastest real-time OCT to date in terms of voxel rate. A 51 Hz volume rate is realized with half the frame number. In both configurations the speed can be sustained indefinitely. The OCT system uses a 1310 nm Fourier domain mode locked (FDML) laser operated at 3.2 MHz sweep rate. Data acquisition is performed with two dedicated digitizer cards, each running at 2.5 GS/s, hosted in a single desktop computer. Live real-time data processing and visualization are realized with custom developed software on an NVidia GTX 690 dual graphics processing unit (GPU) card. To evaluate potential future applications of such a system, we present volumetric videos captured at 26 and 51 Hz of planktonic crustaceans and skin. PMID:25401010

  17. GPU color space conversion

    NASA Astrophysics Data System (ADS)

    Chase, Patrick; Vondran, Gary

    2011-01-01

    Tetrahedral interpolation is commonly used to implement continuous color space conversions from sparse 3D and 4D lookup tables. We investigate the implementation and optimization of tetrahedral interpolation algorithms for GPUs, and compare to the best known CPU implementations as well as to a well known GPU-based trilinear implementation. We show that a 500 NVIDIA GTX-580 GPU is 3x faster than a 1000 Intel Core i7 980X CPU for 3D interpolation, and 9x faster for 4D interpolation. Performance-relevant GPU attributes are explored including thread scheduling, local memory characteristics, global memory hierarchy, and cache behaviors. We consider existing tetrahedral interpolation algorithms and tune based on the structure and branching capabilities of current GPUs. Global memory performance is improved by reordering and expanding the lookup table to ensure optimal access behaviors. Per multiprocessor local memory is exploited to implement optimally coalesced global memory accesses, and local memory addressing is optimized to minimize bank conflicts. We explore the impacts of lookup table density upon computation and memory access costs. Also presented are CPU-based 3D and 4D interpolators, using SSE vector operations that are faster than any previously published solution.

  18. 77 FR 26789 - Certain Semiconductor Chips Having Synchronous Dynamic Random Access Memory Controllers and...

    Federal Register 2010, 2011, 2012, 2013, 2014

    2012-05-07

    ... patents. 73 FR 75131. The principal respondent was NVIDIA Corporation of Santa Clara, California (``NVIDIA''). Joining NVIDIA as respondents were approximately twenty of NVIDIA's customers. The Commission found a... accused products in the United States: NVIDIA; Hewlett-Packard Co. of Palo Alto, California; ASUS Computer...

  19. GPU Based N-Gram String Matching Algorithm with Score Table Approach for String Searching in Many Documents

    NASA Astrophysics Data System (ADS)

    Srinivasa, K. G.; Shree Devi, B. N.

    2017-10-01

    String searching in documents has become a tedious task with the evolution of Big Data. Generation of large data sets demand for a high performance search algorithm in areas such as text mining, information retrieval and many others. The popularity of GPU's for general purpose computing has been increasing for various applications. Therefore it is of great interest to exploit the thread feature of a GPU to provide a high performance search algorithm. This paper proposes an optimized new approach to N-gram model for string search in a number of lengthy documents and its GPU implementation. The algorithm exploits GPGPUs for searching strings in many documents employing character level N-gram matching with parallel Score Table approach and search using CUDA API. The new approach of Score table used for frequency storage of N-grams in a document, makes the search independent of the document's length and allows faster access to the frequency values, thus decreasing the search complexity. The extensive thread feature in a GPU has been exploited to enable parallel pre-processing of trigrams in a document for Score Table creation and parallel search in huge number of documents, thus speeding up the whole search process even for a large pattern size. Experiments were carried out for many documents of varied length and search strings from the standard Lorem Ipsum text on NVIDIA's GeForce GT 540M GPU with 96 cores. Results prove that the parallel approach for Score Table creation and searching gives a good speed up than the same approach executed serially.

  20. WE-E-213CD-08: A Novel Level Set Active Contour Algorithm Using the Jensen-Renyi Divergence for Tumor Segmentation in PET.

    PubMed

    Markel, D; Naqa, I El

    2012-06-01

    Positron emission tomography (PET) presents a valuable resource for delineating the biological tumor volume (BTV) for image-guided radiotherapy. However, accurate and consistent image segmentation is a significant challenge within the context of PET, owing to its low spatial resolution and high levels of noise. Active contour methods based on the level set methods can be sensitive to noise and susceptible to failing in low contrast regions. Therefore, this work evaluates a novel active contour algorithm applied to the task of PET tumor segmentation. A novel active contour segmentation algorithm based on maximizing the Jensen-Renyi Divergence between regions of interest was applied to the task of segmenting lesions in 7 patients with T3-T4 pharyngolaryngeal squamous cell carcinoma. The algorithm was implemented on an NVidia GEFORCE GTV 560M GPU. The cases were taken from the Louvain database, which includes contours of the macroscopically defined BTV drawn using histology of resected tissue. The images were pre-processed using denoising/deconvolution. The segmented volumes agreed well with the macroscopic contours, with an average concordance index and classification error of 0.6 ± 0.09 and 55 ± 16.5%, respectively. The algorithm in its present implementation requires approximately 0.5-1.3 sec per iteration and can reach convergence within 10-30 iterations. The Jensen-Renyi active contour method was shown to come close to and in terms of concordance, outperforms a variety of PET segmentation methods that have been previously evaluated using the same data. Further evaluation on a larger dataset along with performance optimization is necessary before clinical deployment. © 2012 American Association of Physicists in Medicine.

  1. SU-E-T-422: Fast Analytical Beamlet Optimization for Volumetric Intensity-Modulated Arc Therapy

    SciTech Connect

    Chan, Kenny S K; Lee, Louis K Y; Xing, L

    2015-06-15

    Purpose: To implement a fast optimization algorithm on CPU/GPU heterogeneous computing platform and to obtain an optimal fluence for a given target dose distribution from the pre-calculated beamlets in an analytical approach. Methods: The 2D target dose distribution was modeled as an n-dimensional vector and estimated by a linear combination of independent basis vectors. The basis set was composed of the pre-calculated beamlet dose distributions at every 6 degrees of gantry angle and the cost function was set as the magnitude square of the vector difference between the target and the estimated dose distribution. The optimal weighting of the basis,more » which corresponds to the optimal fluence, was obtained analytically by the least square method. Those basis vectors with a positive weighting were selected for entering into the next level of optimization. Totally, 7 levels of optimization were implemented in the study.Ten head-and-neck and ten prostate carcinoma cases were selected for the study and mapped to a round water phantom with a diameter of 20cm. The Matlab computation was performed in a heterogeneous programming environment with Intel i7 CPU and NVIDIA Geforce 840M GPU. Results: In all selected cases, the estimated dose distribution was in a good agreement with the given target dose distribution and their correlation coefficients were found to be in the range of 0.9992 to 0.9997. Their root-mean-square error was monotonically decreasing and converging after 7 cycles of optimization. The computation took only about 10 seconds and the optimal fluence maps at each gantry angle throughout an arc were quickly obtained. Conclusion: An analytical approach is derived for finding the optimal fluence for a given target dose distribution and a fast optimization algorithm implemented on the CPU/GPU heterogeneous computing environment greatly reduces the optimization time.« less

  2. The effects of video game experience and active stereoscopy on performance in combat identification tasks.

    PubMed

    Keebler, Joseph R; Jentsch, Florian; Schuster, David

    2014-12-01

    We investigated the effects of active stereoscopic simulation-based training and individual differences in video game experience on multiple indices of combat identification (CID) performance. Fratricide is a major problem in combat operations involving military vehicles. In this research, we aimed to evaluate the effects of training on CID performance in order to reduce fratricide errors. Individuals were trained on 12 combat vehicles in a simulation, which were presented via either a non-stereoscopic or active stereoscopic display using NVIDIA's GeForce shutter glass technology. Self-report was used to assess video game experience, leading to four between-subjects groups: high video game experience with stereoscopy, low video game experience with stereoscopy, high video game experience without stereoscopy, and low video game experience without stereoscopy. We then tested participants on their memory of each vehicle's alliance and name across multiple measures, including photographs and videos. There was a main effect for both video game experience and stereoscopy across many of the dependent measures. Further, we found interactions between video game experience and stereoscopic training, such that those individuals with high video game experience in the non-stereoscopic group had the highest performance outcomes in the sample on multiple dependent measures. This study suggests that individual differences in video game experience may be predictive of enhanced performance in CID tasks. Selection based on video game experience in CID tasks may be a useful strategy for future military training. Future research should investigate the generalizability of these effects, such as identification through unmanned vehicle sensors.

  3. Multistage Analysis of Cyber Threats for Quick Mission Impact Assessment (CyberIA)

    DTIC Science & Technology

    2015-09-01

    Corporation. NVIDIA ® is a registered trademark of the NVIDIA Corporation. CUDA™ is a trademark of the NVIDIA Corporation. Released by J. Lee...for developing and integrating different high-performance C/C++ algorithms. This capability is significant because NVIDIA ® CUDA™ architecture

  4. SU-D-206-01: Employing a Novel Consensus Optimization Strategy to Achieve Iterative Cone Beam CT Reconstruction On a Multi-GPU Platform

    SciTech Connect

    Li, B; Southern Medical University, Guangzhou, Guangdong; Tian, Z

    Purpose: While compressed sensing-based cone-beam CT (CBCT) iterative reconstruction techniques have demonstrated tremendous capability of reconstructing high-quality images from undersampled noisy data, its long computation time still hinders wide application in routine clinic. The purpose of this study is to develop a reconstruction framework that employs modern consensus optimization techniques to achieve CBCT reconstruction on a multi-GPU platform for improved computational efficiency. Methods: Total projection data were evenly distributed to multiple GPUs. Each GPU performed reconstruction using its own projection data with a conventional total variation regularization approach to ensure image quality. In addition, the solutions from GPUs were subjectmore » to a consistency constraint that they should be identical. We solved the optimization problem with all the constraints considered rigorously using an alternating direction method of multipliers (ADMM) algorithm. The reconstruction framework was implemented using OpenCL on a platform with two Nvidia GTX590 GPU cards, each with two GPUs. We studied the performance of our method and demonstrated its advantages through a simulation case with a NCAT phantom and an experimental case with a Catphan phantom. Result: Compared with the CBCT images reconstructed using conventional FDK method with full projection datasets, our proposed method achieved comparable image quality with about one third projection numbers. The computation time on the multi-GPU platform was ∼55 s and ∼ 35 s in the two cases respectively, achieving a speedup factor of ∼ 3.0 compared with single GPU reconstruction. Conclusion: We have developed a consensus ADMM-based CBCT reconstruction method which enabled performing reconstruction on a multi-GPU platform. The achieved efficiency made this method clinically attractive.« less

  5. Fast skin dose estimation system for interventional radiology

    PubMed Central

    Takata, Takeshi; Kotoku, Jun’ichi; Maejima, Hideyuki; Kumagai, Shinobu; Arai, Norikazu; Kobayashi, Takenori; Shiraishi, Kenshiro; Yamamoto, Masayoshi; Kondo, Hiroshi; Furui, Shigeru

    2018-01-01

    Abstract To minimise the radiation dermatitis related to interventional radiology (IR), rapid and accurate dose estimation has been sought for all procedures. We propose a technique for estimating the patient skin dose rapidly and accurately using Monte Carlo (MC) simulation with a graphical processing unit (GPU, GTX 1080; Nvidia Corp.). The skin dose distribution is simulated based on an individual patient’s computed tomography (CT) dataset for fluoroscopic conditions after the CT dataset has been segmented into air, water and bone based on pixel values. The skin is assumed to be one layer at the outer surface of the body. Fluoroscopic conditions are obtained from a log file of a fluoroscopic examination. Estimating the absorbed skin dose distribution requires calibration of the dose simulated by our system. For this purpose, a linear function was used to approximate the relation between the simulated dose and the measured dose using radiophotoluminescence (RPL) glass dosimeters in a water-equivalent phantom. Differences of maximum skin dose between our system and the Particle and Heavy Ion Transport code System (PHITS) were as high as 6.1%. The relative statistical error (2 σ) for the simulated dose obtained using our system was ≤3.5%. Using a GPU, the simulation on the chest CT dataset aiming at the heart was within 3.49 s on average: the GPU is 122 times faster than a CPU (Core i7–7700K; Intel Corp.). Our system (using the GPU, the log file, and the CT dataset) estimated the skin dose more rapidly and more accurately than conventional methods. PMID:29136194

  6. Fast skin dose estimation system for interventional radiology.

    PubMed

    Takata, Takeshi; Kotoku, Jun'ichi; Maejima, Hideyuki; Kumagai, Shinobu; Arai, Norikazu; Kobayashi, Takenori; Shiraishi, Kenshiro; Yamamoto, Masayoshi; Kondo, Hiroshi; Furui, Shigeru

    2018-03-01

    To minimise the radiation dermatitis related to interventional radiology (IR), rapid and accurate dose estimation has been sought for all procedures. We propose a technique for estimating the patient skin dose rapidly and accurately using Monte Carlo (MC) simulation with a graphical processing unit (GPU, GTX 1080; Nvidia Corp.). The skin dose distribution is simulated based on an individual patient's computed tomography (CT) dataset for fluoroscopic conditions after the CT dataset has been segmented into air, water and bone based on pixel values. The skin is assumed to be one layer at the outer surface of the body. Fluoroscopic conditions are obtained from a log file of a fluoroscopic examination. Estimating the absorbed skin dose distribution requires calibration of the dose simulated by our system. For this purpose, a linear function was used to approximate the relation between the simulated dose and the measured dose using radiophotoluminescence (RPL) glass dosimeters in a water-equivalent phantom. Differences of maximum skin dose between our system and the Particle and Heavy Ion Transport code System (PHITS) were as high as 6.1%. The relative statistical error (2 σ) for the simulated dose obtained using our system was ≤3.5%. Using a GPU, the simulation on the chest CT dataset aiming at the heart was within 3.49 s on average: the GPU is 122 times faster than a CPU (Core i7-7700K; Intel Corp.). Our system (using the GPU, the log file, and the CT dataset) estimated the skin dose more rapidly and more accurately than conventional methods.

  7. Efficient implementation of the 3D-DDA ray traversal algorithm on GPU and its application in radiation dose calculation.

    PubMed

    Xiao, Kai; Chen, Danny Z; Hu, X Sharon; Zhou, Bo

    2012-12-01

    The three-dimensional digital differential analyzer (3D-DDA) algorithm is a widely used ray traversal method, which is also at the core of many convolution∕superposition (C∕S) dose calculation approaches. However, porting existing C∕S dose calculation methods onto graphics processing unit (GPU) has brought challenges to retaining the efficiency of this algorithm. In particular, straightforward implementation of the original 3D-DDA algorithm inflicts a lot of branch divergence which conflicts with the GPU programming model and leads to suboptimal performance. In this paper, an efficient GPU implementation of the 3D-DDA algorithm is proposed, which effectively reduces such branch divergence and improves performance of the C∕S dose calculation programs running on GPU. The main idea of the proposed method is to convert a number of conditional statements in the original 3D-DDA algorithm into a set of simple operations (e.g., arithmetic, comparison, and logic) which are better supported by the GPU architecture. To verify and demonstrate the performance improvement, this ray traversal method was integrated into a GPU-based collapsed cone convolution∕superposition (CCCS) dose calculation program. The proposed method has been tested using a water phantom and various clinical cases on an NVIDIA GTX570 GPU. The CCCS dose calculation program based on the efficient 3D-DDA ray traversal implementation runs 1.42 ∼ 2.67× faster than the one based on the original 3D-DDA implementation, without losing any accuracy. The results show that the proposed method can effectively reduce branch divergence in the original 3D-DDA ray traversal algorithm and improve the performance of the CCCS program running on GPU. Considering the wide utilization of the 3D-DDA algorithm, various applications can benefit from this implementation method.

  8. GPU-accelerated Monte Carlo convolution/superposition implementation for dose calculation.

    PubMed

    Zhou, Bo; Yu, Cedric X; Chen, Danny Z; Hu, X Sharon

    2010-11-01

    Dose calculation is a key component in radiation treatment planning systems. Its performance and accuracy are crucial to the quality of treatment plans as emerging advanced radiation therapy technologies are exerting ever tighter constraints on dose calculation. A common practice is to choose either a deterministic method such as the convolution/superposition (CS) method for speed or a Monte Carlo (MC) method for accuracy. The goal of this work is to boost the performance of a hybrid Monte Carlo convolution/superposition (MCCS) method by devising a graphics processing unit (GPU) implementation so as to make the method practical for day-to-day usage. Although the MCCS algorithm combines the merits of MC fluence generation and CS fluence transport, it is still not fast enough to be used as a day-to-day planning tool. To alleviate the speed issue of MC algorithms, the authors adopted MCCS as their target method and implemented a GPU-based version. In order to fully utilize the GPU computing power, the MCCS algorithm is modified to match the GPU hardware architecture. The performance of the authors' GPU-based implementation on an Nvidia GTX260 card is compared to a multithreaded software implementation on a quad-core system. A speedup in the range of 6.7-11.4x is observed for the clinical cases used. The less than 2% statistical fluctuation also indicates that the accuracy of the authors' GPU-based implementation is in good agreement with the results from the quad-core CPU implementation. This work shows that GPU is a feasible and cost-efficient solution compared to other alternatives such as using cluster machines or field-programmable gate arrays for satisfying the increasing demands on computation speed and accuracy of dose calculation. But there are also inherent limitations of using GPU for accelerating MC-type applications, which are also analyzed in detail in this article.

  9. Full Monte Carlo-Based Biologic Treatment Plan Optimization System for Intensity Modulated Carbon Ion Therapy on Graphics Processing Unit.

    PubMed

    Qin, Nan; Shen, Chenyang; Tsai, Min-Yu; Pinto, Marco; Tian, Zhen; Dedes, Georgios; Pompos, Arnold; Jiang, Steve B; Parodi, Katia; Jia, Xun

    2018-01-01

    One of the major benefits of carbon ion therapy is enhanced biological effectiveness at the Bragg peak region. For intensity modulated carbon ion therapy (IMCT), it is desirable to use Monte Carlo (MC) methods to compute the properties of each pencil beam spot for treatment planning, because of their accuracy in modeling physics processes and estimating biological effects. We previously developed goCMC, a graphics processing unit (GPU)-oriented MC engine for carbon ion therapy. The purpose of the present study was to build a biological treatment plan optimization system using goCMC. The repair-misrepair-fixation model was implemented to compute the spatial distribution of linear-quadratic model parameters for each spot. A treatment plan optimization module was developed to minimize the difference between the prescribed and actual biological effect. We used a gradient-based algorithm to solve the optimization problem. The system was embedded in the Varian Eclipse treatment planning system under a client-server architecture to achieve a user-friendly planning environment. We tested the system with a 1-dimensional homogeneous water case and 3 3-dimensional patient cases. Our system generated treatment plans with biological spread-out Bragg peaks covering the targeted regions and sparing critical structures. Using 4 NVidia GTX 1080 GPUs, the total computation time, including spot simulation, optimization, and final dose calculation, was 0.6 hour for the prostate case (8282 spots), 0.2 hour for the pancreas case (3795 spots), and 0.3 hour for the brain case (6724 spots). The computation time was dominated by MC spot simulation. We built a biological treatment plan optimization system for IMCT that performs simulations using a fast MC engine, goCMC. To the best of our knowledge, this is the first time that full MC-based IMCT inverse planning has been achieved in a clinically viable time frame. Copyright © 2017 Elsevier Inc. All rights reserved.

  10. SU-G-TeP1-15: Toward a Novel GPU Accelerated Deterministic Solution to the Linear Boltzmann Transport Equation

    SciTech Connect

    Yang, R; Fallone, B; Cross Cancer Institute, Edmonton, AB

    Purpose: To develop a Graphic Processor Unit (GPU) accelerated deterministic solution to the Linear Boltzmann Transport Equation (LBTE) for accurate dose calculations in radiotherapy (RT). A deterministic solution yields the potential for major speed improvements due to the sparse matrix-vector and vector-vector multiplications and would thus be of benefit to RT. Methods: In order to leverage the massively parallel architecture of GPUs, the first order LBTE was reformulated as a second order self-adjoint equation using the Least Squares Finite Element Method (LSFEM). This produces a symmetric positive-definite matrix which is efficiently solved using a parallelized conjugate gradient (CG) solver. Themore » LSFEM formalism is applied in space, discrete ordinates is applied in angle, and the Multigroup method is applied in energy. The final linear system of equations produced is tightly coupled in space and angle. Our code written in CUDA-C was benchmarked on an Nvidia GeForce TITAN-X GPU against an Intel i7-6700K CPU. A spatial mesh of 30,950 tetrahedral elements was used with an S4 angular approximation. Results: To avoid repeating a full computationally intensive finite element matrix assembly at each Multigroup energy, a novel mapping algorithm was developed which minimized the operations required at each energy. Additionally, a parallelized memory mapping for the kronecker product between the sparse spatial and angular matrices, including Dirichlet boundary conditions, was created. Atomicity is preserved by graph-coloring overlapping nodes into separate kernel launches. The one-time mapping calculations for matrix assembly, kronecker product, and boundary condition application took 452±1ms on GPU. Matrix assembly for 16 energy groups took 556±3s on CPU, and 358±2ms on GPU using the mappings developed. The CG solver took 93±1s on CPU, and 468±2ms on GPU. Conclusion: Three computationally intensive subroutines in deterministically solving the LBTE have been

  11. Line-by-line spectroscopic simulations on graphics processing units

    NASA Astrophysics Data System (ADS)

    Collange, Sylvain; Daumas, Marc; Defour, David

    2008-01-01

    ++ 2005 with Cygwin 1.5.24 under Windows XP. RAM: 1 gigabyte Classification: 21.2 External routines: OpenGL ( http://www.opengl.org) Nature of problem: Simulating radiative transfer on high-temperature high-pressure gases. Solution method: Line-by-line Monte-Carlo ray-tracing. Unusual features: Parallel computations are moved to the GPU. Additional comments: nVidia GeForce 7000 or ATI Radeon X1000 series graphics processing unit is required. Running time: A few minutes.

  12. 75 FR 44989 - In the Matter of Certain Semiconductor Chips Having Synchronous Dynamic Random Access Memory...

    Federal Register 2010, 2011, 2012, 2013, 2014

    2010-07-30

    ... following respondents: NVIDIA Corporation of Santa Clara, California; Asustek Computer, Inc. of Taipei... exclusion order and cease- and-desist orders against respondents NVIDIA Corp.; Hewlett-Packard Co.; ASUS...

  13. Numerical Integration with Graphical Processing Unit for QKD Simulation

    DTIC Science & Technology

    2014-03-27

    Windows system application programming interface (API) timer. The problem sizes studied produce speedups greater than 60x on the NVIDIA Tesla C2075...13 2.3.3 CUDA API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.3.4 CUDA and NVIDIA GPU Hardware...Theoretical Floating-Point Operations per Second for Intel CPUs and NVIDIA GPUs [3

  14. A hybrid reconstruction algorithm for fast and accurate 4D cone-beam CT imaging

    SciTech Connect

    Yan, Hao; Folkerts, Michael; Jiang, Steve B., E-mail: xun.jia@utsouthwestern.edu, E-mail: steve.jiang@UTSouthwestern.edu

    2014-07-15

    Purpose: 4D cone beam CT (4D-CBCT) has been utilized in radiation therapy to provide 4D image guidance in lung and upper abdomen area. However, clinical application of 4D-CBCT is currently limited due to the long scan time and low image quality. The purpose of this paper is to develop a new 4D-CBCT reconstruction method that restores volumetric images based on the 1-min scan data acquired with a standard 3D-CBCT protocol. Methods: The model optimizes a deformation vector field that deforms a patient-specific planning CT (p-CT), so that the calculated 4D-CBCT projections match measurements. A forward-backward splitting (FBS) method is inventedmore » to solve the optimization problem. It splits the original problem into two well-studied subproblems, i.e., image reconstruction and deformable image registration. By iteratively solving the two subproblems, FBS gradually yields correct deformation information, while maintaining high image quality. The whole workflow is implemented on a graphic-processing-unit to improve efficiency. Comprehensive evaluations have been conducted on a moving phantom and three real patient cases regarding the accuracy and quality of the reconstructed images, as well as the algorithm robustness and efficiency. Results: The proposed algorithm reconstructs 4D-CBCT images from highly under-sampled projection data acquired with 1-min scans. Regarding the anatomical structure location accuracy, 0.204 mm average differences and 0.484 mm maximum difference are found for the phantom case, and the maximum differences of 0.3–0.5 mm for patients 1–3 are observed. As for the image quality, intensity errors below 5 and 20 HU compared to the planning CT are achieved for the phantom and the patient cases, respectively. Signal-noise-ratio values are improved by 12.74 and 5.12 times compared to results from FDK algorithm using the 1-min data and 4-min data, respectively. The computation time of the algorithm on a NVIDIA GTX590 card is 1–1.5 min per

  15. TH-A-18C-04: Ultrafast Cone-Beam CT Scatter Correction with GPU-Based Monte Carlo Simulation

    SciTech Connect

    Xu, Y; Southern Medical University, Guangzhou; Bai, T

    2014-06-15

    Purpose: Scatter artifacts severely degrade image quality of cone-beam CT (CBCT). We present an ultrafast scatter correction framework by using GPU-based Monte Carlo (MC) simulation and prior patient CT image, aiming at automatically finish the whole process including both scatter correction and reconstructions within 30 seconds. Methods: The method consists of six steps: 1) FDK reconstruction using raw projection data; 2) Rigid Registration of planning CT to the FDK results; 3) MC scatter calculation at sparse view angles using the planning CT; 4) Interpolation of the calculated scatter signals to other angles; 5) Removal of scatter from the raw projections;more » 6) FDK reconstruction using the scatter-corrected projections. In addition to using GPU to accelerate MC photon simulations, we also use a small number of photons and a down-sampled CT image in simulation to further reduce computation time. A novel denoising algorithm is used to eliminate MC scatter noise caused by low photon numbers. The method is validated on head-and-neck cases with simulated and clinical data. Results: We have studied impacts of photo histories, volume down sampling factors on the accuracy of scatter estimation. The Fourier analysis was conducted to show that scatter images calculated at 31 angles are sufficient to restore those at all angles with <0.1% error. For the simulated case with a resolution of 512×512×100, we simulated 10M photons per angle. The total computation time is 23.77 seconds on a Nvidia GTX Titan GPU. The scatter-induced shading/cupping artifacts are substantially reduced, and the average HU error of a region-of-interest is reduced from 75.9 to 19.0 HU. Similar results were found for a real patient case. Conclusion: A practical ultrafast MC-based CBCT scatter correction scheme is developed. The whole process of scatter correction and reconstruction is accomplished within 30 seconds. This study is supported in part by NIH (1R01CA154747-01), The Core Technology

  16. TH-A-18C-09: Ultra-Fast Monte Carlo Simulation for Cone Beam CT Imaging of Brain Trauma

    SciTech Connect

    Sisniega, A; Zbijewski, W; Stayman, J

    Purpose: Application of cone-beam CT (CBCT) to low-contrast soft tissue imaging, such as in detection of traumatic brain injury, is challenged by high levels of scatter. A fast, accurate scatter correction method based on Monte Carlo (MC) estimation is developed for application in high-quality CBCT imaging of acute brain injury. Methods: The correction involves MC scatter estimation executed on an NVIDIA GTX 780 GPU (MC-GPU), with baseline simulation speed of ~1e7 photons/sec. MC-GPU is accelerated by a novel, GPU-optimized implementation of variance reduction (VR) techniques (forced detection and photon splitting). The number of simulated tracks and projections is reduced formore » additional speed-up. Residual noise is removed and the missing scatter projections are estimated via kernel smoothing (KS) in projection plane and across gantry angles. The method is assessed using CBCT images of a head phantom presenting a realistic simulation of fresh intracranial hemorrhage (100 kVp, 180 mAs, 720 projections, source-detector distance 700 mm, source-axis distance 480 mm). Results: For a fixed run-time of ~1 sec/projection, GPU-optimized VR reduces the noise in MC-GPU scatter estimates by a factor of 4. For scatter correction, MC-GPU with VR is executed with 4-fold angular downsampling and 1e5 photons/projection, yielding 3.5 minute run-time per scan, and de-noised with optimized KS. Corrected CBCT images demonstrate uniformity improvement of 18 HU and contrast improvement of 26 HU compared to no correction, and a 52% increase in contrast-tonoise ratio in simulated hemorrhage compared to “oracle” constant fraction correction. Conclusion: Acceleration of MC-GPU achieved through GPU-optimized variance reduction and kernel smoothing yields an efficient (<5 min/scan) and accurate scatter correction that does not rely on additional hardware or simplifying assumptions about the scatter distribution. The method is undergoing implementation in a novel CBCT dedicated to

  17. A low-complexity 2-point step size gradient projection method with selective function evaluations for smoothed total variation based CBCT reconstructions

    NASA Astrophysics Data System (ADS)

    Song, Bongyong; Park, Justin C.; Song, William Y.

    2014-11-01

    The Barzilai-Borwein (BB) 2-point step size gradient method is receiving attention for accelerating Total Variation (TV) based CBCT reconstructions. In order to become truly viable for clinical applications, however, its convergence property needs to be properly addressed. We propose a novel fast converging gradient projection BB method that requires ‘at most one function evaluation’ in each iterative step. This Selective Function Evaluation method, referred to as GPBB-SFE in this paper, exhibits the desired convergence property when it is combined with a ‘smoothed TV’ or any other differentiable prior. This way, the proposed GPBB-SFE algorithm offers fast and guaranteed convergence to the desired 3DCBCT image with minimal computational complexity. We first applied this algorithm to a Shepp-Logan numerical phantom. We then applied to a CatPhan 600 physical phantom (The Phantom Laboratory, Salem, NY) and a clinically-treated head-and-neck patient, both acquired from the TrueBeam™ system (Varian Medical Systems, Palo Alto, CA). Furthermore, we accelerated the reconstruction by implementing the algorithm on NVIDIA GTX 480 GPU card. We first compared GPBB-SFE with three recently proposed BB-based CBCT reconstruction methods available in the literature using Shepp-Logan numerical phantom with 40 projections. It is found that GPBB-SFE shows either faster convergence speed/time or superior convergence property compared to existing BB-based algorithms. With the CatPhan 600 physical phantom, the GPBB-SFE algorithm requires only 3 function evaluations in 30 iterations and reconstructs the standard, 364-projection FDK reconstruction quality image using only 60 projections. We then applied the algorithm to a clinically-treated head-and-neck patient. It was observed that the GPBB-SFE algorithm requires only 18 function evaluations in 30 iterations. Compared with the FDK algorithm with 364 projections, the GPBB-SFE algorithm produces visibly equivalent quality CBCT

  18. A hybrid reconstruction algorithm for fast and accurate 4D cone-beam CT imaging.

    PubMed

    Yan, Hao; Zhen, Xin; Folkerts, Michael; Li, Yongbao; Pan, Tinsu; Cervino, Laura; Jiang, Steve B; Jia, Xun

    2014-07-01

    4D cone beam CT (4D-CBCT) has been utilized in radiation therapy to provide 4D image guidance in lung and upper abdomen area. However, clinical application of 4D-CBCT is currently limited due to the long scan time and low image quality. The purpose of this paper is to develop a new 4D-CBCT reconstruction method that restores volumetric images based on the 1-min scan data acquired with a standard 3D-CBCT protocol. The model optimizes a deformation vector field that deforms a patient-specific planning CT (p-CT), so that the calculated 4D-CBCT projections match measurements. A forward-backward splitting (FBS) method is invented to solve the optimization problem. It splits the original problem into two well-studied subproblems, i.e., image reconstruction and deformable image registration. By iteratively solving the two subproblems, FBS gradually yields correct deformation information, while maintaining high image quality. The whole workflow is implemented on a graphic-processing-unit to improve efficiency. Comprehensive evaluations have been conducted on a moving phantom and three real patient cases regarding the accuracy and quality of the reconstructed images, as well as the algorithm robustness and efficiency. The proposed algorithm reconstructs 4D-CBCT images from highly under-sampled projection data acquired with 1-min scans. Regarding the anatomical structure location accuracy, 0.204 mm average differences and 0.484 mm maximum difference are found for the phantom case, and the maximum differences of 0.3-0.5 mm for patients 1-3 are observed. As for the image quality, intensity errors below 5 and 20 HU compared to the planning CT are achieved for the phantom and the patient cases, respectively. Signal-noise-ratio values are improved by 12.74 and 5.12 times compared to results from FDK algorithm using the 1-min data and 4-min data, respectively. The computation time of the algorithm on a NVIDIA GTX590 card is 1-1.5 min per phase. High-quality 4D-CBCT imaging based

  19. High performance computing for deformable image registration: towards a new paradigm in adaptive radiotherapy.

    PubMed

    Samant, Sanjiv S; Xia, Junyi; Muyan-Ozcelik, Pinar; Owens, John D

    2008-08-01

    The advent of readily available temporal imaging or time series volumetric (4D) imaging has become an indispensable component of treatment planning and adaptive radiotherapy (ART) at many radiotherapy centers. Deformable image registration (DIR) is also used in other areas of medical imaging, including motion corrected image reconstruction. Due to long computation time, clinical applications of DIR in radiation therapy and elsewhere have been limited and consequently relegated to offline analysis. With the recent advances in hardware and software, graphics processing unit (GPU) based computing is an emerging technology for general purpose computation, including DIR, and is suitable for highly parallelized computing. However, traditional general purpose computation on the GPU is limited because the constraints of the available programming platforms. As well, compared to CPU programming, the GPU currently has reduced dedicated processor memory, which can limit the useful working data set for parallelized processing. We present an implementation of the demons algorithm using the NVIDIA 8800 GTX GPU and the new CUDA programming language. The GPU performance will be compared with single threading and multithreading CPU implementations on an Intel dual core 2.4 GHz CPU using the C programming language. CUDA provides a C-like language programming interface, and allows for direct access to the highly parallel compute units in the GPU. Comparisons for volumetric clinical lung images acquired using 4DCT were carried out. Computation time for 100 iterations in the range of 1.8-13.5 s was observed for the GPU with image size ranging from 2.0 x 10(6) to 14.2 x 10(6) pixels. The GPU registration was 55-61 times faster than the CPU for the single threading implementation, and 34-39 times faster for the multithreading implementation. For CPU based computing, the computational time generally has a linear dependence on image size for medical imaging data. Computational efficiency is

  20. A low-complexity 2-point step size gradient projection method with selective function evaluations for smoothed total variation based CBCT reconstructions.

    PubMed

    Song, Bongyong; Park, Justin C; Song, William Y

    2014-11-07

    The Barzilai-Borwein (BB) 2-point step size gradient method is receiving attention for accelerating Total Variation (TV) based CBCT reconstructions. In order to become truly viable for clinical applications, however, its convergence property needs to be properly addressed. We propose a novel fast converging gradient projection BB method that requires 'at most one function evaluation' in each iterative step. This Selective Function Evaluation method, referred to as GPBB-SFE in this paper, exhibits the desired convergence property when it is combined with a 'smoothed TV' or any other differentiable prior. This way, the proposed GPBB-SFE algorithm offers fast and guaranteed convergence to the desired 3DCBCT image with minimal computational complexity. We first applied this algorithm to a Shepp-Logan numerical phantom. We then applied to a CatPhan 600 physical phantom (The Phantom Laboratory, Salem, NY) and a clinically-treated head-and-neck patient, both acquired from the TrueBeam™ system (Varian Medical Systems, Palo Alto, CA). Furthermore, we accelerated the reconstruction by implementing the algorithm on NVIDIA GTX 480 GPU card. We first compared GPBB-SFE with three recently proposed BB-based CBCT reconstruction methods available in the literature using Shepp-Logan numerical phantom with 40 projections. It is found that GPBB-SFE shows either faster convergence speed/time or superior convergence property compared to existing BB-based algorithms. With the CatPhan 600 physical phantom, the GPBB-SFE algorithm requires only 3 function evaluations in 30 iterations and reconstructs the standard, 364-projection FDK reconstruction quality image using only 60 projections. We then applied the algorithm to a clinically-treated head-and-neck patient. It was observed that the GPBB-SFE algorithm requires only 18 function evaluations in 30 iterations. Compared with the FDK algorithm with 364 projections, the GPBB-SFE algorithm produces visibly equivalent quality CBCT image for

  1. Parallel Geospatial Data Management for Multi-Scale Environmental Data Analysis on GPUs

    NASA Astrophysics Data System (ADS)

    Wang, D.; Zhang, J.; Wei, Y.

    2013-12-01

    , we have developed a GPU-based data compression technique by reusing our previous work on Bitplane Quadtree (or BPQ-Tree) based indexing of binary bitmaps. Results have shown that our GPU-based parallel Zonal Statistic technique on 3000+ US counties over 20+ billion NASA SRTM 30 meter resolution Digital Elevation (DEM) raster cells has achieved impressive end-to-end runtimes: 101 seconds and 46 seconds a low-end workstation equipped with a Nvidia GTX Titan GPU using cold and hot cache, respectively; and, 60-70 seconds using a single OLCF TITAN computing node and 10-15 seconds using 8 nodes. Our experiment results clearly show the potentials of using high-end computing facilities for large-scale geospatial processing.

  2. Axillary Lymph Node Evaluation Utilizing Convolutional Neural Networks Using MRI Dataset.

    PubMed

    Ha, Richard; Chang, Peter; Karcich, Jenika; Mutasa, Simukayi; Fardanesh, Reza; Wynn, Ralph T; Liu, Michael Z; Jambawalikar, Sachin

    2018-04-25

    The aim of this study is to evaluate the role of convolutional neural network (CNN) in predicting axillary lymph node metastasis, using a breast MRI dataset. An institutional review board (IRB)-approved retrospective review of our database from 1/2013 to 6/2016 identified 275 axillary lymph nodes for this study. Biopsy-proven 133 metastatic axillary lymph nodes and 142 negative control lymph nodes were identified based on benign biopsies (100) and from healthy MRI screening patients (42) with at least 3 years of negative follow-up. For each breast MRI, axillary lymph node was identified on first T1 post contrast dynamic images and underwent 3D segmentation using an open source software platform 3D Slicer. A 32 × 32 patch was then extracted from the center slice of the segmented tumor data. A CNN was designed for lymph node prediction based on each of these cropped images. The CNN consisted of seven convolutional layers and max-pooling layers with 50% dropout applied in the linear layer. In addition, data augmentation and L2 regularization were performed to limit overfitting. Training was implemented using the Adam optimizer, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments. Code for this study was written in Python using the TensorFlow module (1.0.0). Experiments and CNN training were done on a Linux workstation with NVIDIA GTX 1070 Pascal GPU. Two class axillary lymph node metastasis prediction models were evaluated. For each lymph node, a final softmax score threshold of 0.5 was used for classification. Based on this, CNN achieved a mean five-fold cross-validation accuracy of 84.3%. It is feasible for current deep CNN architectures to be trained to predict likelihood of axillary lymph node metastasis. Larger dataset will likely improve our prediction model and can potentially be a non-invasive alternative to core needle biopsy and even sentinel lymph node

  3. A fast GPU-based Monte Carlo simulation of proton transport with detailed modeling of nonelastic interactions.

    PubMed

    Wan Chan Tseung, H; Ma, J; Beltran, C

    2015-06-01

    Very fast Monte Carlo (MC) simulations of proton transport have been implemented recently on graphics processing units (GPUs). However, these MCs usually use simplified models for nonelastic proton-nucleus interactions. Our primary goal is to build a GPU-based proton transport MC with detailed modeling of elastic and nonelastic proton-nucleus collisions. Using the cuda framework, the authors implemented GPU kernels for the following tasks: (1) simulation of beam spots from our possible scanning nozzle configurations, (2) proton propagation through CT geometry, taking into account nuclear elastic scattering, multiple scattering, and energy loss straggling, (3) modeling of the intranuclear cascade stage of nonelastic interactions when they occur, (4) simulation of nuclear evaporation, and (5) statistical error estimates on the dose. To validate our MC, the authors performed (1) secondary particle yield calculations in proton collisions with therapeutically relevant nuclei, (2) dose calculations in homogeneous phantoms, (3) recalculations of complex head and neck treatment plans from a commercially available treatment planning system, and compared with (GEANT)4.9.6p2/TOPAS. Yields, energy, and angular distributions of secondaries from nonelastic collisions on various nuclei are in good agreement with the (GEANT)4.9.6p2 Bertini and Binary cascade models. The 3D-gamma pass rate at 2%-2 mm for treatment plan simulations is typically 98%. The net computational time on a NVIDIA GTX680 card, including all CPU-GPU data transfers, is ∼ 20 s for 1 × 10(7) proton histories. Our GPU-based MC is the first of its kind to include a detailed nuclear model to handle nonelastic interactions of protons with any nucleus. Dosimetric calculations are in very good agreement with (GEANT)4.9.6p2/TOPAS. Our MC is being integrated into a framework to perform fast routine clinical QA of pencil-beam based treatment plans, and is being used as the dose calculation engine in a clinically

  4. Multi-GPU and multi-CPU accelerated FDTD scheme for vibroacoustic applications

    NASA Astrophysics Data System (ADS)

    Francés, J.; Otero, B.; Bleda, S.; Gallego, S.; Neipp, C.; Márquez, A.; Beléndez, A.

    2015-06-01

    The Finite-Difference Time-Domain (FDTD) method is applied to the analysis of vibroacoustic problems and to study the propagation of longitudinal and transversal waves in a stratified media. The potential of the scheme and the relevance of each acceleration strategy for massively computations in FDTD are demonstrated in this work. In this paper, we propose two new specific implementations of the bi-dimensional scheme of the FDTD method using multi-CPU and multi-GPU, respectively. In the first implementation, an open source message passing interface (OMPI) has been included in order to massively exploit the resources of a biprocessor station with two Intel Xeon processors. Moreover, regarding CPU code version, the streaming SIMD extensions (SSE) and also the advanced vectorial extensions (AVX) have been included with shared memory approaches that take advantage of the multi-core platforms. On the other hand, the second implementation called the multi-GPU code version is based on Peer-to-Peer communications available in CUDA on two GPUs (NVIDIA GTX 670). Subsequently, this paper presents an accurate analysis of the influence of the different code versions including shared memory approaches, vector instructions and multi-processors (both CPU and GPU) and compares them in order to delimit the degree of improvement of using distributed solutions based on multi-CPU and multi-GPU. The performance of both approaches was analysed and it has been demonstrated that the addition of shared memory schemes to CPU computing improves substantially the performance of vector instructions enlarging the simulation sizes that use efficiently the cache memory of CPUs. In this case GPU computing is slightly twice times faster than the fine tuned CPU version in both cases one and two nodes. However, for massively computations explicit vector instructions do not worth it since the memory bandwidth is the limiting factor and the performance tends to be the same than the sequential version

  5. GASPRNG: GPU accelerated scalable parallel random number generator library

    NASA Astrophysics Data System (ADS)

    Gao, Shuang; Peterson, Gregory D.

    2013-04-01

    workstation with NVIDIA GPU (Tested on Fermi GTX480, Tesla C1060, Tesla M2070). Operating system: Linux with CUDA version 4.0 or later. Should also run on MacOS, Windows, or UNIX. Has the code been vectorized or parallelized?: Yes. Parallelized using MPI directives. RAM: 512 MB˜ 732 MB (main memory on host CPU, depending on the data type of random numbers.) / 512 MB (GPU global memory) Classification: 4.13, 6.5. Nature of problem: Many computational science applications are able to consume large numbers of random numbers. For example, Monte Carlo simulations are able to consume limitless random numbers for the computation as long as resources for the computing are supported. Moreover, parallel computational science applications require independent streams of random numbers to attain statistically significant results. The SPRNG library provides this capability, but at a significant computational cost. The GASPRNG library presented here accelerates the generators of independent streams of random numbers using graphical processing units (GPUs). Solution method: Multiple copies of random number generators in GPUs allow a computational science application to consume large numbers of random numbers from independent, parallel streams. GASPRNG is a random number generators library to allow a computational science application to employ multiple copies of random number generators to boost performance. Users can interface GASPRNG with software code executing on microprocessors and/or GPUs. Running time: The tests provided take a few minutes to run.

  6. Evaluation of the root canal shaping ability of two rotary nickel-titanium systems.

    PubMed

    Al-Manei, K K; Al-Hadlaq, S M S

    2014-10-01

    The aim was to investigate the canal shaping abilities of the twisted file (TF) and GT series X file (GTX) systems. Sixty mesial root canals of mandibular molars with curvatures of 15-50° were divided randomly into two groups of 30 canals each. The teeth were sectioned horizontally at 3, 6 and 9 mm from the apex. Root canals were prepared with TF and GTX files, respectively, and the shaping abilities of the systems were evaluated at three levels (coronal, middle and apical) based on the comparison of pre- and post-instrumentation photographs using AutoCAD software. Preparation time was also assessed. Data from the two groups were compared statistically using the Student's t-test. There was no significant difference between the rotary systems in terms of change in root canal cross-sectional area, root canal transportation, centring ability or minimum dentine thickness. Remaining dentine thickness at the coronal and middle levels was similar in the TF and GTX groups, but GTX instruments left significantly less dentine than TF instruments on the mesial aspects of root canals at the apical level. Root canal preparation with TF instruments required significantly less time than with GTX instruments. The TF and GTX NiTi rotary instruments showed similar shaping abilities, but root canal preparation was more rapid with the TF than with the GTX system. © 2014 International Endodontic Journal. Published by John Wiley & Sons Ltd.

  7. Image-guided thoracic surgery in the hybrid operation room.

    PubMed

    Ujiie, Hideki; Effat, Andrew; Yasufuku, Kazuhiro

    2017-01-01

    There has been an increase in the use of image-guided technology to facilitate minimally invasive therapy. The next generation of minimally invasive therapy is focused on advancement and translation of novel image-guided technologies in therapeutic interventions, including surgery, interventional pulmonology, radiation therapy, and interventional laser therapy. To establish the efficacy of different minimally invasive therapies, we have developed a hybrid operating room, known as the guided therapeutics operating room (GTx OR) at the Toronto General Hospital. The GTx OR is equipped with multi-modality image-guidance systems, which features a dual source-dual energy computed tomography (CT) scanner, a robotic cone-beam CT (CBCT)/fluoroscopy, high-performance endobronchial ultrasound system, endoscopic surgery system, near-infrared (NIR) fluorescence imaging system, and navigation tracking systems. The novel multimodality image-guidance systems allow physicians to quickly, and accurately image patients while they are on the operating table. This yield improved outcomes since physicians are able to use image guidance during their procedures, and carry out innovative multi-modality therapeutics. Multiple preclinical translational studies pertaining to innovative minimally invasive technology is being developed in our guided therapeutics laboratory (GTx Lab). The GTx Lab is equipped with similar technology, and multimodality image-guidance systems as the GTx OR, and acts as an appropriate platform for translation of research into human clinical trials. Through the GTx Lab, we are able to perform basic research, such as the development of image-guided technologies, preclinical model testing, as well as preclinical imaging, and then translate that research into the GTx OR. This OR allows for the utilization of new technologies in cancer therapy, including molecular imaging, and other innovative imaging modalities, and therefore enables a better quality of life for

  8. Experimental and computational studies on molecularly imprinted solid-phase extraction for gonyautoxins 2,3 from dinoflagellate Alexandrium minutum.

    PubMed

    Lian, Ziru; Li, Hai-Bei; Wang, Jiangtao

    2016-08-01

    An innovative and effective extraction procedure based on molecularly imprinted solid-phase extraction (MISPE) was developed for the isolation of gonyautoxins 2,3 (GTX2,3) from Alexandrium minutum sample. Molecularly imprinted polymer microspheres were prepared by suspension polymerization and and were employed as sorbents for the solid-phase extraction of GTX2,3. An off-line MISPE protocol was optimized. Subsequently, the extract samples from A. minutum were analyzed. The results showed that the interference matrices in the extract were obviously cleaned up by MISPE procedures. This outcome enabled the direct extraction of GTX2,3 in A. minutum samples with extraction efficiency as high as 83 %, rather significantly, without any need for a cleanup step prior to the extraction. Furthermore, computational approach also provided direct evidences of the high selective isolation of GTX2,3 from the microalgal extracts.

  9. A mitral annulus tracking approach for navigation of off-pump beating heart mitral valve repair.

    PubMed

    Li, Feng P; Rajchl, Martin; Moore, John; Peters, Terry M

    2015-01-01

    porcine data, the authors compared the tracked MVA to a manually segmented MVA. The overall accuracy is 2.37 ± 1.67 mm for single plane images and 2.35 ± 1.55 mm for biplane images. The interoperator variation in manual segmentation was 2.32 ± 1.24 mm for single plane images and 1.73 ± 1.18 mm for biplane images. The computational efficiency of the algorithm on a desktop computer with an Intel(®) Xeon(®) CPU @3.47 GHz and an NVIDIA GeForce 690 graphic card is such that the time required for registering four MVA points was about 60 ms. The authors developed a rapid MVA tracking algorithm for use in the guidance of off-pump beating heart transapical mitral valve repair. This approach uses 2D biplane TEE images and was tested on a dynamic heart phantom and interventional porcine image data. Results regarding the accuracy and efficiency of the authors' MVA tracking algorithm are promising, and fulfill the requirements for surgical navigation.

  10. Supporting Real-Time Computer Vision Workloads using OpenVX on Multicore+GPU Platforms

    DTIC Science & Technology

    2015-05-01

    a registered trademark of the NVIDIA Corporation . Report Documentation Page Form ApprovedOMB No. 0704-0188 Public reporting burden for the collection...from NVIDIA , we adapted an alpha- version of an NVIDIA OpenVX implementation called VisionWorks® [3] to run atop PGMRT (a graph-based mid- dleware...time support to an OpenVX implementation by NVIDIA called VisionWorks. Our modifications were applied to an alpha-version of VisionWorks. This alpha

  11. Optimization of hydrophilic interaction liquid chromatography/mass spectrometry and development of solid-phase extraction for the determination of paralytic shellfish poisoning toxins.

    PubMed

    Turrell, Elizabeth; Stobo, Lesley; Lacaze, Jean-Pierre; Piletsky, Sergey; Piletska, Elena

    2008-01-01

    The combination of hydrophilic interaction liquid chromatography (HILIC) and liquid chromatography/mass spectrometry (LC/MS) for the determination of paralytic shellfish poisoning (PSP) toxins has been proposed for use in routine monitoring of shellfish. In this study, methods for the detection of multiple PSP toxins [saxitoxin (STX), neosaxitoxin (NEO), decarbamoyl saxitoxin (dcSTX), decarbamoyl neosaxitoxin (dcNEO), gonyautoxins 1-5 (GTX1, GTX2, GTX3, GTX4, GTX5), decarbamoyl gonyautoxins (dcGTX2 and dcGTX3), and the N-sulfocarbamoyl C toxins (C1 and C2)] were optimized using single (MS) and triple quadrupole (MS/MS) instruments. Chromatographic separation of the toxins was achieved by using a TSK-gel Amide-80 analytical column, although superior chromatography was observed through application of a ZIC-HILIC column. Preparative procedures used to clean up shellfish extracts and concentrate PSP toxins prior to analysis were investigated. The capacity of computationally designed polymeric (CDP) materials and HILIC solid-phase extraction (SPE) cartridges to retain highly polar PSP toxins was explored. Three CDP materials and 2 HILIC cartridges were assessed for the extraction of PSP toxins from aqueous solution. Screening of the CDPs showed that all tested polymers adsorbed PSP toxins. A variety of elution procedures were examined, with dilute 0.01% acetic acid providing optimum recovery from a CDP based on 2-(trifluoromethyl)acrylic acid as the monomer. ZIC-HILIC SPE cartridges were superior to the PolyLC equivalent, with recoveries ranging from 70 to 112% (ZIC-HILIC) and 0 to 90% (PolyLC) depending on the PSP toxin. It is proposed that optimized SPE and HILIC-MS methods can be applied for the quantitative determination of PSP toxins in shellfish.

  12. Techniques for Mapping Synthetic Aperture Radar Processing Algorithms to Multi-GPU Clusters

    DTIC Science & Technology

    2012-12-01

    Experimental results were generated with 10 nVidia Tesla C2050 GPUs having maximum throughput of 972 Gflop /s. Our approach scales well for output...Experimental results were generated with 10 nVidia Tesla C2050 GPUs having maximum throughput of 972 Gflop /s. Our approach scales well for output

  13. Bayesian Methods and Confidence Intervals for Automatic Target Recognition of SAR Canonical Shapes

    DTIC Science & Technology

    2014-03-27

    and DirectX [22]. The CUDA platform was developed by the NVIDIA Corporation to allow programmers access to the computational capabilities of the...were used for the intense repetitive computations. Developing CUDA software requires writing code for specialized compilers provided by NVIDIA and

  14. Latent uncertainties of the precalculated track Monte Carlo method

    SciTech Connect

    Renaud, Marc-André; Seuntjens, Jan; Roberge, David

    20% of the maximum dose. In proton calculations, a small (≤1 mm) distance-to-agreement error was observed at the Bragg peak. Latent uncertainty was characterized for electrons and found to follow a Poisson distribution with the number of unique tracks per energy. A track bank of 12 energies and 60000 unique tracks per pregenerated energy in water had a size of 2.4 GB and achieved a latent uncertainty of approximately 1% at an optimal efficiency gain over DOSXYZnrc. Larger track banks produced a lower latent uncertainty at the cost of increased memory consumption. Using an NVIDIA GTX 590, efficiency analysis showed a 807 × efficiency increase over DOSXYZnrc for 16 MeV electrons in water and 508 × for 16 MeV electrons in bone. Conclusions: The PMC method can calculate dose distributions for electrons and protons to a statistical uncertainty of 1% with a large efficiency gain over conventional MC codes. Before performing clinical dose calculations, models to calculate dose contributions from uncharged particles must be implemented. Following the successful implementation of these models, the PMC method will be evaluated as a candidate for inverse planning of modulated electron radiation therapy and scanned proton beams.« less

  15. Latent uncertainties of the precalculated track Monte Carlo method.

    PubMed

    Renaud, Marc-André; Roberge, David; Seuntjens, Jan

    2015-01-01

    calculations, a small (≤ 1 mm) distance-to-agreement error was observed at the Bragg peak. Latent uncertainty was characterized for electrons and found to follow a Poisson distribution with the number of unique tracks per energy. A track bank of 12 energies and 60000 unique tracks per pregenerated energy in water had a size of 2.4 GB and achieved a latent uncertainty of approximately 1% at an optimal efficiency gain over DOSXYZnrc. Larger track banks produced a lower latent uncertainty at the cost of increased memory consumption. Using an NVIDIA GTX 590, efficiency analysis showed a 807 × efficiency increase over DOSXYZnrc for 16 MeV electrons in water and 508 × for 16 MeV electrons in bone. The PMC method can calculate dose distributions for electrons and protons to a statistical uncertainty of 1% with a large efficiency gain over conventional MC codes. Before performing clinical dose calculations, models to calculate dose contributions from uncharged particles must be implemented. Following the successful implementation of these models, the PMC method will be evaluated as a candidate for inverse planning of modulated electron radiation therapy and scanned proton beams.

  16. Oxidative Stress Mechanisms Do Not Discriminate between Genotoxic and Nongenotoxic Liver Carcinogens.

    PubMed

    Deferme, Lize; Wolters, Jarno; Claessen, Sandra; Briedé, Jacco; Kleinjans, Jos

    2015-08-17

    It is widely accepted that in chemical carcinogenesis different modes-of-action exist, e.g., genotoxic (GTX) versus nongenotoxic (NGTX) carcinogenesis. In this context, it has been suggested that oxidative stress response pathways are typical for NGTX carcinogenesis. To evaluate this, we examined oxidative stress-related changes in gene expression, cell cycle distribution, and (oxidative) DNA damage in human hepatoma cells (HepG2) exposed to GTX-, NGTX-, and noncarcinogens, at multiple time points (4-8-24-48-72 h). Two GTX (azathriopine (AZA) and furan) and two NGTX (tetradecanoyl-phorbol-acetate, (TPA) and tetrachloroethylene (TCE)) carcinogens as well as two noncarcinogens (diazinon (DZN, d-mannitol (Dman)) were selected, while per class one compound was deemed to induce oxidative stress and the other not. Oxidative stressors AZA, TPA, and DZN induced a 10-fold higher number of gene expression changes over time compared to those of furan, TCE, or Dman treatment. Genes commonly expressed among AZA, TPA, and DZN were specifically involved in oxidative stress, DNA damage, and immune responses. However, differences in gene expression between GTX and NGTX carcinogens did not correlate to oxidative stress or DNA damage but could instead be assigned to compound-specific characteristics. This conclusion was underlined by results from functional readouts on ROS formation and (oxidative) DNA damage. Therefore, oxidative stress may represent the underlying cause for increased risk of liver toxicity and even carcinogenesis; however, it does not discriminate between GTX and NGTX carcinogens.

  17. Selective isolation of gonyautoxins 1,4 from the dinoflagellate Alexandrium minutum based on molecularly imprinted solid-phase extraction.

    PubMed

    Lian, Ziru; Wang, Jiangtao

    2017-09-15

    Gonyautoxins 1,4 (GTX1,4) from Alexandrium minutum samples were isolated selectively and recognized specifically by an innovative and effective extraction procedure based on molecular imprinting technology. Novel molecularly imprinted polymer microspheres (MIPMs) were prepared by double-templated imprinting strategy using caffeine and pentoxifylline as dummy templates. The synthesized polymers displayed good affinity to GTX1,4 and were applied as sorbents. Further, an off-line molecularly imprinted solid-phase extraction (MISPE) protocol was optimized and an effective approach based on the MISPE coupled with HPLC-FLD was developed for selective isolation of GTX1,4 from the cultured A. minutum samples. The separation method showed good extraction efficiency (73.2-81.5%) for GTX1,4 and efficient removal of interferences matrices was also achieved after the MISPE process for the microalgal samples. The outcome demonstrated the superiority and great potential of the MISPE procedure for direct separation of GTX1,4 from marine microalgal extracts. Copyright © 2017. Published by Elsevier Ltd.

  18. mm_par2.0: An object-oriented molecular dynamics simulation program parallelized using a hierarchical scheme with MPI and OPENMP

    NASA Astrophysics Data System (ADS)

    Oh, Kwang Jin; Kang, Ji Hoon; Myung, Hun Joo

    2012-02-01

    decomposition is not popular due to its poor scalability. On the other hand, domain decomposition scheme is better for scalability. It still has a limitation in utilizing a large number of cores on recent petascale computers due to the requirement that the domain size is larger than the potential cutoff distance. To go beyond such a limitation, a hierarchical parallelization scheme has been adopted in this new version and implemented using MPI [7] and OPENMP [8]. Summary of revisions: (1) Object-oriented programming has been used. (2) A hierarchical parallelization scheme has been adopted. (3) SPME routine has been fully parallelized with parallel 3D FFT using volumetric decomposition scheme [9]. K.J.O. thanks Mr. Seung Min Lee for useful discussion on programming and debugging. Running time: Running time depends on system size and methods used. For test system containing a protein (PDB id: 5DHFR) with CHARMM22 force field [10] and 7023 TIP3P [11] waters in simulation box having dimension 62.23 Å×62.23 Å×62.23 Å, the benchmark results are given in Fig. 1. Here the potential cutoff distance was set to 12 Å and the switching function was applied from 10 Å for the force calculation in real space. For the SPME [12] calculation, K, K, and K were set to 64 and the interpolation order was set to 4. To do the fast Fourier transform, we used Intel MKL library. All bonds including hydrogen atoms were constrained using SHAKE/RATTLE algorithms [13,14]. The code was compiled using Intel compiler version 11.1 and mvapich2 version 1.5. Fig. 2 shows performance gains from using CUDA-enabled version [15] of mm_par for 5DHFR simulation in water on Intel Core2Quad 2.83 GHz and GeForce GTX 580. Even though mm_par2.0 is not ported yet for GPU, its performance data would be useful to expect mm_par2.0 performance on GPU. Timing results for 1000 MD steps. 1, 2, 4, and 8 in the figure mean the number of OPENMP threads. Timing results for 1000 MD steps from double precision simulation on CPU

  19. A Frequency Agile, Self-Adaptive Serial Link on Xilinx FPGAs

    NASA Astrophysics Data System (ADS)

    Aloisio, A.; Giordano, R.; Izzo, V.; Perrella, S.

    2015-06-01

    In this paper, we focused on the GTX transceiver modules of Xilinx Kintex 7 field-programmable gate arrays (FPGAs), which provide high bandwidth, low jitter on the recovered clock, and an equalization system on the transmitter and the receiver. We present a frequency agile, auto-adaptive serial link. The link is able to take care of the reconfiguration of the GTX parameters in order to fully benefit from the available link bandwidth, by setting the highest line rate. It is designed around an FPGA-embedded microprocessor, which drives the programmable ports of the GTX in order to control the quality of the received data and to easily calculate the bit-error rate in each sampling point of the eye diagram. We present the self-adaptive link project, the description of the test system, and the main results.

  20. Simultaneous Range-Velocity Processing and SNR Analysis of AFIT’s Random Noise Radar

    DTIC Science & Technology

    2012-03-22

    reducing the overall processing time. Two computers, equipped with NVIDIA ® GPUs, were used to process the col- 45 lected data. The specifications for each...gather the results back to the CPU. Another company , AccelerEyes®, has developed a product called Jacket® that claims to be better than the parallel...Number of Processing Cores 4 8 Processor Speed 3.33 GHz 3.07 GHz Installed Memory 48 GB 48 GB GPU Make NVIDIA NVIDIA GPU Model Tesla 1060 Tesla C2070 GPU

  1. Challenges and Opportunities in Propulsion Simulations

    DTIC Science & Technology

    2015-09-24

    leverage Nvidia GPU accelerators •  Release common computational infrastructure as Distro A for collaboration •  Add physics modules as either...Gemini (6.4 GB/s) Dual Rail EDR-IB (23 GB/s) Interconnect Topology 3D Torus Non-blocking Fat Tree Processors AMD Opteron™ NVIDIA Kepler™ IBM...POWER9 NVIDIA Volta™ File System 32 PB, 1 TB/s, Lustre® 120 PB, 1 TB/s, GPFS™ Peak power consumption 9 MW 10 MW Titan vs. Summit Source: R

  2. Removal of Paralytic Shellfish Toxins by Probiotic Lactic Acid Bacteria

    PubMed Central

    Vasama, Mari; Kumar, Himanshu; Salminen, Seppo; Haskard, Carolyn A.

    2014-01-01

    Paralytic shellfish toxins (PSTs) are non-protein neurotoxins produced by saltwater dinoflagellates and freshwater cyanobacteria. The ability of Lactobacillus rhamnosus strains GG and LC-705 (in viable and non-viable forms) to remove PSTs (saxitoxin (STX), neosaxitoxin (neoSTX), gonyautoxins 2 and 3 (GTX2/3), C-toxins 1 and 2 (C1/2)) from neutral and acidic solution (pH 7.3 and 2) was examined using HPLC. Binding decreased in the order of STX ~ neoSTX > C2 > GTX3 > GTX2 > C1. Removal of STX and neoSTX (77%–97.2%) was significantly greater than removal of GTX3 and C2 (33.3%–49.7%). There were no significant differences in toxin removal capacity between viable and non-viable forms of lactobacilli, which suggested that binding rather than metabolism is the mechanism of the removal of toxins. In general, binding was not affected by the presence of other organic molecules in solution. Importantly, this is the first study to demonstrate the ability of specific probiotic lactic bacteria to remove PSTs, particularly the most toxic PST-STX, from solution. Further, these results warrant thorough screening and assessment of safe and beneficial microbes for their usefulness in the seafood and water industries and their effectiveness in vivo. PMID:25046082

  3. 76 FR 384 - Certain Semiconductor Chips and Products Containing Same; Notice of Investigation

    Federal Register 2010, 2011, 2012, 2013, 2014

    2011-01-04

    ..., Dusing Road 1, Hsinchu Science Park, Hsin-Chu, Taiwan 30078. nVidia Corporation, 2701 San Tomas... respondent, to find the facts to be as alleged in the complaint and this notice and to enter an initial...

  4. 75 FR 25294 - Notice Pursuant to the National Cooperative Research and Production Act of 1993-DVD Copy Control...

    Federal Register 2010, 2011, 2012, 2013, 2014

    2010-05-07

    ..., Baarlo Noord Limburg, THE NETHERLANDS; MIT Technology Co., Ltd., Dongguan, Guangdong, PEOPLE'S REPUBLIC... media b.v., Tilburg, THE NETHERLANDS; Mattel Inc., El Segundo, CA; nVidia Corporation, Santa Clara, CA...

  5. Modeling & Analysis of Multicore Architectures for Embedded SIGINT Applications

    DTIC Science & Technology

    2015-03-01

    NVIDIA Kepler K20 [7][8] 2496e 706 225 3520 15.6 Intel Xeon Phi 5110P [9] 60 1050 225 1010 4.5 Adapteva Epiphany [10] 16 – 4K 800 0.270 19 70.4...Cortex A15 and a Kepler GPU with 192 “CUDA” cores, and is more comparable as an HPEEC platform than Tesla series GPUs, such as the NVIDIA C2075 and K20

  6. Particle In Cell Codes on Highly Parallel Architectures

    NASA Astrophysics Data System (ADS)

    Tableman, Adam

    2014-10-01

    We describe strategies and examples of Particle-In-Cell Codes running on Nvidia GPU and Intel Phi architectures. This includes basic implementations in skeletons codes and full-scale development versions (encompassing 1D, 2D, and 3D codes) in Osiris. Both the similarities and differences between Intel's and Nvidia's hardware will be examined. Work supported by grants NSF ACI 1339893, DOE DE SC 000849, DOE DE SC 0008316, DOE DE NA 0001833, and DOE DE FC02 04ER 54780.

  7. Investigating the Mobility of Light Autonoumous Tracked Vehicles Using a High Performance Computing Simulation Capability

    DTIC Science & Technology

    2012-08-01

    UNCLASSIFIED: Distribution Statement A. Approved for public release. DISCLAIMER: Reference herein to any specific commercial company , product...FunctionBay, S. Korea – NVIDIA – Caterpillar – MSC.Software – Advanced Micro Devices (AMD) 14-16 AUG 2012  Aaron Bartholomew  Makarand Datar...16GB DDR2 Graphics: 4x NVIDIA Tesla C1060 Power supply 1: 1000W Power supply 2: 750W Assembled Quad GPU Machine 14-16 AUG 2012 30

  8. A GPU Parallelization of the Absolute Nodal Coordinate Formulation for Applications in Flexible Multibody Dynamics

    DTIC Science & Technology

    2012-02-17

    to be solved. Disclaimer: Reference herein to any specific commercial company , product, process, or service by trade name, trademark...data processing rather than data caching and control flow. To make use of this computational power, NVIDIA introduced a general purpose parallel...GPU implementations were run on an Intel Nehalem Xeon E5520 2.26GHz processor with an NVIDIA Tesla C2070 graphics card for varying numbers of

  9. Communication Efficient Gaussian Elimination with Partial Pivoting using a Shape Morphing Data Layout

    DTIC Science & Technology

    2013-02-21

    support comes from ParLab affiliates National Instruments, Nokia, NVIDIA , Oracle and Samsung, as well as MathWorks. Research is also supported by DOE...affiliates National Instruments, Nokia, NVIDIA , Oracle and Samsung, as well as MathWorks. Research is also supported by DOE grants DE-SC0004938, DE-SC0005136...International Business Machines Company , 1966. [17] S. Toledo. Locality of reference in LU decomposition with partial pivoting. SIAM J. Matrix Anal. Appl., 18

  10. Operational Based Vision Assessment

    DTIC Science & Technology

    2014-02-01

    formulated or supplied the drawings, specifications, or other data does not license the holder or any other person or corporation or convey any...expensive than other developers’ software. The sources for the GPUs ( Nvidia ) and the host computer (Concurrent’s iHawk) were identified. The...boundaries, which is a distracting artifact when performing visual tests. The problem has been isolated by the OBVA team to the Nvidia GPUs. The OBVA system

  11. Visual Media Reasoning - Terrain-based Geolocation

    DTIC Science & Technology

    2015-06-01

    the drawings, specifications, or other data does not license the holder or any other person or corporation ; or convey any rights or permission to...3.4 Alternative Metric Investigation This section describes a graphics processor unit (GPU) based implementation in the NVIDIA CUDA programming...utilizing 2 concurrent CPU cores, each controlling a single Nvidia C2075 Tesla Fermi CUDA card. Figure 22 shows a comparison of the CPU and the GPU powered

  12. Design Tools for Accelerating Development and Usage of Multi-Core Computing Platforms

    DTIC Science & Technology

    2014-04-01

    Government formulated or supplied the drawings, specifications, or other data does not license the holder or any other person or corporation ; or convey...multicore PDSP platforms. The GPU- based capabilities of TDIF are currently oriented towards NVIDIA GPUs, based on the Compute Unified Device Architecture...CUDA) programming language [ NVIDIA 2007], which can be viewed as an extension of C. The multicore PDSP capabilities currently in TDIF are oriented

  13. The RISC-V Instruction Set Manual Volume 2: Privileged Architecture Version 1.7

    DTIC Science & Technology

    2015-05-09

    DIG07-10227). Additional support came from Par Lab affiliates Nokia, NVIDIA , Oracle, and Samsung. • Project Isis: DoE Award DE-SC0003624. • ASPIRE...STARnet center funded by the Semiconductor Research Corporation . Additional sup- port from ASPIRE industrial sponsor, Intel, and ASPIRE affiliates...Google, Huawei, Nokia, NVIDIA , Oracle, and Samsung. The content of this paper does not necessarily reflect the position or the policy of the US

  14. SciTech Connect

    Kartsaklis, Christos; Civario, G

    This paper discusses an ongoing progress regarding the development of a Java-based library for rapid kernel prototyping in NVIDIA PTX and PTX instruction scheduling. It is aimed at developers seeking total control of emitted PTX, highly parametric emission of, and tunable instruction reordering. It is primarily used for code development at ICHEC but is also hoped that NVIDIA GPU community will also find it beneficial.

  15. Integrating the Nqueens Algorithm into a Parameterized Benchmark Suite

    DTIC Science & Technology

    2016-02-01

    FOB is a 64-node heterogeneous cluster consisting of 16-IBM dx360M4 nodes, each with one NVIDIA Kepler K20M GPUs and 48-IBM dx360M4 nodes, and each...nodes have 256-GB of memory and an NVIDIA Tesla K40 GPU. More details on Excalibur can be found on the US Army DSRC website.19 Figures 3 and 4 show the

  16. GPUbased, Microsecond Latency, HectoChannel MIMO Feedback Control of Magnetically Confined Plasmas

    NASA Astrophysics Data System (ADS)

    Rath, Nikolaus

    Feedback control has become a crucial tool in the research on magnetic confinement of plasmas for achieving controlled nuclear fusion. This thesis presents a novel plasma feedback control system that, for the first time, employs a Graphics Processing Unit (GPU) for microsecond-latency, real-time control computations. This novel application area for GPU computing is opened up by a new system architecture that is optimized for low-latency computations on less than kilobyte sized data samples as they occur in typical plasma control algorithms. In contrast to traditional GPU computing approaches that target complex, high-throughput computations with massive amounts of data, the architecture presented in this thesis uses the GPU as the primary processing unit rather than as an auxiliary of the CPU, and data is transferred from A-D/D-A converters directly into GPU memory using peer-to-peer PCI Express transfers. The described design has been implemented in a new, GPU-based control system for the High-Beta Tokamak - Extended Pulse (HBT-EP) device. The system is built from commodity hardware and uses an NVIDIA GeForce GPU and D-TACQ A-D/D-A converters providing a total of 96 input and 64 output channels. The system is able to run with sampling periods down to 4 μs and latencies down to 8 μs. The GPU provides a total processing power of 1.5 x 1012 floating point operations per second. To illustrate the performance and versatility of both the general architecture and concrete implementation, a new control algorithm has been developed. The algorithm is designed for the control of multiple rotating magnetic perturbations in situations where the plasma equilibrium is not known exactly and features an adaptive system model: instead of requiring the rotation frequencies and growth rates embedded in the system model to be set a priori, the adaptive algorithm derives these parameters from the evolution of the perturbation amplitudes themselves. This results in non-linear control

  17. Analysis of grayanatoxin in Rhododendron honey and effect on antioxidant parameters in rats.

    PubMed

    Sibel, Silici; Enis, Yonar M; Hüseyin, Sahin; Timucin, Atayoğlu A; Duran, Ozkok

    2014-10-28

    Rhododendron honey, locally known as "mad honey", contains gryanotoksin (GTX) and thus induces toxic effects when consumed in large amounts. But, it is still popularly used for treating medical conditions such as high blood pressure or gastro-intestinal disorders. The aim of this study was to evaluate the effect of GTX on antioxidant parameters measured from rats fed with Rhododendron honey. A total of sixty Sprague-Dawley female rats were divided into five groups of 12 rats each, one being the control group (Group 1) and the others being the experimental groups (Groups 2 to 5). Group 2 was treated with 0.015 mg/kg/bw of Grayanotoxin-III (GTX-III) standard preparation via intraperitoneal injection. Groups 3, 4 and 5 were respectively given Rhododendron honey (RH) at doses of 0.1, 0.5, and 2.5 g/kg/bw via oral gavage. After one hour, blood samples were collected from the rats. Glutathione peroxidase (GSh-Px), superoxide dismutase (SOD), catalase (CAT) activities and malondialdehyde (MDA) contents were examined in blood, heart, lungs, liver, kidney, testicles, epididiymis, spleen and brain specimens. The data from the rats in Groups 2 (GTX) and 5 (RH at 2.5 g/kg/bw) showed negative effect on the antioxidants parameters in blood and all tissue samples examined at the specified doses and time period. Administration of GTX to rats at dose of 0.015 mg/kg/bw resulted in lipid peroxidation. (This part needs to be enhanced more). It has been observed that both Grayanotoxin and high dose Rhododendron honey treatments showed oxidant effect on blood plasma and organ tissues investigated. Copyright © 2014 Elsevier Ireland Ltd. All rights reserved.

  18. Androgen receptor agonists increase lean mass, improve cardiopulmonary functions and extend survival in preclinical models of Duchenne muscular dystrophy.

    PubMed

    Ponnusamy, Suriyan; Sullivan, Ryan D; You, Dahui; Zafar, Nadeem; He Yang, Chuan; Thiyagarajan, Thirumagal; Johnson, Daniel L; Barrett, Maron L; Koehler, Nikki J; Star, Mayra; Stephenson, Erin J; Bridges, Dave; Cormier, Stephania A; Pfeffer, Lawrence M; Narayanan, Ramesh

    2017-07-01

    Duchenne muscular dystrophy (DMD) is a neuromuscular disease that predominantly affects boys as a result of mutation(s) in the dystrophin gene. DMD is characterized by musculoskeletal and cardiopulmonary complications, resulting in shorter life-span. Boys afflicted by DMD typically exhibit symptoms within 3-5 years of age and declining physical functions before attaining puberty. We hypothesized that rapidly deteriorating health of pre-pubertal boys with DMD could be due to diminished anabolic actions of androgens in muscle, and that intervention with an androgen receptor (AR) agonist will reverse musculoskeletal complications and extend survival. While castration of dystrophin and utrophin double mutant (mdx-dm) mice to mimic pre-pubertal nadir androgen condition resulted in premature death, maintenance of androgen levels extended the survival. Non-steroidal selective-AR modulator, GTx-026, which selectively builds muscle and bone was tested in X-linked muscular dystrophy mice (mdx). GTx-026 significantly increased body weight, lean mass and grip strength by 60-80% over vehicle-treated mdx mice. While vehicle-treated castrated mdx mice exhibited cardiopulmonary impairment and fibrosis of heart and lungs, GTx-026 returned cardiopulmonary function and intensity of fibrosis to healthy control levels. GTx-026 elicits its musculoskeletal effects through pathways that are distinct from dystrophin-regulated pathways, making AR agonists ideal candidates for combination approaches. While castration of mdx-dm mice resulted in weaker muscle and shorter survival, GTx-026 treatment increased the muscle mass, function and survival, indicating that androgens are important for extended survival. These preclinical results support the importance of androgens and the need for intervention with AR agonists to treat DMD-affected boys. © The Author 2017. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com.

  19. A hybrid parallel architecture for electrostatic interactions in the simulation of dissipative particle dynamics

    NASA Astrophysics Data System (ADS)

    Yang, Sheng-Chun; Lu, Zhong-Yuan; Qian, Hu-Jun; Wang, Yong-Lei; Han, Jie-Ping

    2017-11-01

    In this work, we upgraded the electrostatic interaction method of CU-ENUF (Yang, et al., 2016) which first applied CUNFFT (nonequispaced Fourier transforms based on CUDA) to the reciprocal-space electrostatic computation and made the computation of electrostatic interaction done thoroughly in GPU. The upgraded edition of CU-ENUF runs concurrently in a hybrid parallel way that enables the computation parallelizing on multiple computer nodes firstly, then further on the installed GPU in each computer. By this parallel strategy, the size of simulation system will be never restricted to the throughput of a single CPU or GPU. The most critical technical problem is how to parallelize a CUNFFT in the parallel strategy, which is conquered effectively by deep-seated research of basic principles and some algorithm skills. Furthermore, the upgraded method is capable of computing electrostatic interactions for both the atomistic molecular dynamics (MD) and the dissipative particle dynamics (DPD). Finally, the benchmarks conducted for validation and performance indicate that the upgraded method is able to not only present a good precision when setting suitable parameters, but also give an efficient way to compute electrostatic interactions for huge simulation systems. Program Files doi:http://dx.doi.org/10.17632/zncf24fhpv.1 Licensing provisions: GNU General Public License 3 (GPL) Programming language: C, C++, and CUDA C Supplementary material: The program is designed for effective electrostatic interactions of large-scale simulation systems, which runs on particular computers equipped with NVIDIA GPUs. It has been tested on (a) single computer node with Intel(R) Core(TM) i7-3770@ 3.40 GHz (CPU) and GTX 980 Ti (GPU), and (b) MPI parallel computer nodes with the same configurations. Nature of problem: For molecular dynamics simulation, the electrostatic interaction is the most time-consuming computation because of its long-range feature and slow convergence in simulation space

  20. Synthesis of the Paralytic Shellfish Poisons (+)-Gonyautoxin 2, (+)-Gonyautoxin 3, and (+)-11,11-Dihydroxysaxitoxin.

    PubMed

    Mulcahy, John V; Walker, James R; Merit, Jeffrey E; Whitehead, Alan; Du Bois, J

    2016-05-11

    The paralytic shellfish poisons are a collection of guanidine-containing natural products that are biosynthesized by prokaryote and eukaryote marine organisms. These compounds bind and inhibit isoforms of the mammalian voltage-gated Na(+) ion channel at concentrations ranging from 10(-11) to 10(-5) M. Here, we describe the de novo synthesis of three paralytic shellfish poisons, gonyautoxin 2, gonyautoxin 3, and 11,11-dihydroxysaxitoxin. Key steps include a diastereoselective Pictet-Spengler reaction and an intramolecular amination of an N-guanidyl pyrrole by a sulfonyl guanidine. The IC50's of GTX 2, GTX 3, and 11,11-dhSTX have been measured against rat NaV1.4, and are found to be 22 nM, 15 nM, and 2.2 μM, respectively.

  1. Physiological roles of Kv2 channels in entorhinal cortex layer II stellate cells revealed by Guangxitoxin‐1E

    PubMed Central

    Hönigsperger, Christoph; Nigro, Maximiliano J.

    2016-01-01

    Key points Kv2 channels underlie delayed‐rectifier potassium currents in various neurons, although their physiological roles often remain elusive. Almost nothing is known about Kv2 channel functions in medial entorhinal cortex (mEC) neurons, which are involved in representing space, memory formation, epilepsy and dementia.Stellate cells in layer II of the mEC project to the hippocampus and are considered to be space‐representing grid cells. We used the new Kv2 blocker Guangxitoxin‐1E (GTx) to study Kv2 functions in these neurons.Voltage clamp recordings from mEC stellate cells in rat brain slices showed that GTx inhibited delayed‐rectifier K+ current but not transient A‐type current.In current clamp, GTx had multiple effects: (i) increasing excitability and bursting at moderate spike rates but reducing firing at high rates; (ii) enhancing after‐depolarizations; (iii) reducing the fast and medium after‐hyperpolarizations; (iv) broadening action potentials; and (v) reducing spike clustering.GTx is a useful tool for studying Kv2 channels and their functions in neurons. Abstract The medial entorhinal cortex (mEC) is strongly involved in spatial navigation, memory, dementia and epilepsy. Although potassium channels shape neuronal activity, their roles in mEC are largely unknown. We used the new Kv2 blocker Guangxitoxin‐1E (GTx; 10–100 nm) in rat brain slices to investigate Kv2 channel functions in mEC layer II stellate cells (SCs). These neurons project to the hippocampus and are considered to be grid cells representing space. Voltage clamp recordings from SCs nucleated patches showed that GTx inhibited a delayed rectifier K+ current activating beyond –30 mV but not transient A‐type current. In current clamp, GTx (i) had almost no effect on the first action potential but markedly slowed repolarization of late spikes during repetitive firing; (ii) enhanced the after‐depolarization (ADP); (iii) reduced fast and medium after

  2. Three Dimensional Numerical Simulation of Rocket-based Combined-cycle Engine Response During Mode Transition Events

    NASA Technical Reports Server (NTRS)

    Edwards, Jack R.; McRae, D. Scott; Bond, Ryan B.; Steffan, Christopher (Technical Monitor)

    2003-01-01

    The GTX program at NASA Glenn Research Center is designed to develop a launch vehicle concept based on rocket-based combined-cycle (RBCC) propulsion. Experimental testing, cycle analysis, and computational fluid dynamics modeling have all demonstrated the viability of the GTX concept, yet significant technical issues and challenges still remain. Our research effort develops a unique capability for dynamic CFD simulation of complete high-speed propulsion devices and focuses this technology toward analysis of the GTX response during critical mode transition events. Our principal attention is focused on Mode 1/Mode 2 operation, in which initial rocket propulsion is transitioned into thermal-throat ramjet propulsion. A critical element of the GTX concept is the use of an Independent Ramjet Stream (IRS) cycle to provide propulsion at Mach numbers less than 3. In the IRS cycle, rocket thrust is initially used for primary power, and the hot rocket plume is used as a flame-holding mechanism for hydrogen fuel injected into the secondary air stream. A critical aspect is the establishment of a thermal throat in the secondary stream through the combination of area reduction effects and combustion-induced heat release. This is a necessity to enable the power-down of the rocket and the eventual shift to ramjet mode. Our focus in this first year of the grant has been in three areas, each progressing directly toward the key initial goal of simulating thermal throat formation during the IRS cycle: CFD algorithm development; simulation of Mode 1 experiments conducted at Glenn's Rig 1 facility; and IRS cycle simulations. The remainder of this report discusses each of these efforts in detail and presents a plan of work for the next year.

  3. Therapeutic transfusions of granulocytes collected by simple bag method for children with cancer and neutropenic infections: results of a single-centre pilot study.

    PubMed

    Kikuta, A; Ohto, H; Nemoto, K; Mochizuki, K; Sano, H; Ito, M; Suzuki, H

    2006-07-01

    Granulocyte transfusion therapy (GTX) can be effective for life-threatening infections unresponsive to conventional antimicrobial therapies in severely neutropenic children with cancer. We developed a new granulocyte collection method, named the 'bag method', in which apheresis, hydroxyethyl starch (HES) or dexamethasone are not used. We undertook a pilot study to determine the feasibility and the safety of GTX collected by the bag method for children with cancer and life-threatening infections. A total of 25 GTX were administered to 13 patients (median age 3 years, range: 0.3-17; median weight 10.6 kg, range: 4.5-49.8) with neutropenia-related infections. Thirteen blood-relative donors received granulocyte colony-stimulating factor (G-CSF) (5-10 microg/kg), subcutaneously, 14 h before collection. Major end-points were granulocyte yields, post-transfusion absolute neutrophil counts (ANC) in patients, donor and patient safety, and clinical outcome on day 30. The median yield of ANC per 400 ml of processed whole blood was 6.2 x 10(9) (range: 2.5-15.0 x 10(9)). Patients received a mean of 6.4 +/- 0.8 x 10(8) granulocytes per kg of body weight per transfusion. The 1-h and 24-h post-transfusion ANC rose to 607 +/- 124/microl and 704 +/- 300/microl, respectively, from the baseline of 21/microl before the first GTX. Adverse reactions were observed in five of 13 donors (bone pain, headache, vasovagal reaction; all < or = grade 2) and in two of 25 transfusions of 13 patients (transient hypoxia; grade 3). Ten patients had favourable responses, and infection resolved in nine patients. The bag method without apheresis relieves the physical load of donors and enables patients with a low body weight to provide an adequate dose of granulocytes.

  4. Air-Breathing Launch Vehicle Technology Being Developed

    NASA Technical Reports Server (NTRS)

    Trefny, Charles J.

    2003-01-01

    Of the technical factors that would contribute to lowering the cost of space access, reusability has high potential. The primary objective of the GTX program is to determine whether or not air-breathing propulsion can enable reusable single-stage-to-orbit (SSTO) operations. The approach is based on maturation of a reference vehicle design with focus on the integration and flight-weight construction of its air-breathing rocket-based combined-cycle (RBCC) propulsion system.

  5. Prevalence, Variability and Bioconcentration of Saxitoxin-Group in Different Marine Species Present in the Food Chain

    PubMed Central

    Oyaneder Terrazas, Javiera; Contreras, Héctor R.; García, Carlos

    2017-01-01

    The saxitoxin-group (STX-group) corresponds to toxic metabolites produced by cyanobacteria and dinoflagellates of the genera Alexandrium, Gymnodinium, and Pyrodinium. Over the last decade, it has been possible to extrapolate the areas contaminated with the STX-group worldwide, including Chile, a phenomenon that has affected ≈35% of the Southern Pacific coast territory, generating a high economic impact. The objective of this research was to study the toxicity of the STX-group in all aquatic organisms (bivalves, algae, echinoderms, crustaceans, tunicates, cephalopods, gastropods, and fish) present in areas with a variable presence of harmful algal blooms (HABs). Then, the toxic profiles of each species and dose of STX equivalents ingested by a 60 kg person from 400 g of shellfish were determined to establish the health risk assessment. The toxins with the highest prevalence detected were gonyautoxin-4/1 (GTX4/GTX1), gonyautoxin-3/2 (GTX3/GTX2), neosaxitoxin (neoSTX), decarbamoylsaxitoxin (dcSTX), and saxitoxin (STX), with average concentrations of 400, 2800, 280, 200, and 2000 µg kg−1 respectively, a species-specific variability, dependent on the evaluated tissue, which demonstrates the biotransformation of the analogues in the trophic transfer with a predominance of α-epimers in all toxic profiles. The identification in multiple vectors, as well as in unregulated species, suggests that a risk assessment and risk management update are required; also, chemical and specific analyses for the detection of all analogues associated with the STX-group need to be established. PMID:28604648

  6. Concentrations of gatifloxacin in plasma and urine and penetration into prostatic and seminal fluid, ejaculate, and sperm cells after single oral administrations of 400 milligrams to volunteers.

    PubMed

    Naber, C K; Steghafner, M; Kinzig-Schippers, M; Sauber, C; Sörgel, F; Stahlberg, H J; Naber, K G

    2001-01-01

    Gatifloxacin (GTX), a new fluoroquinolone with extended antibacterial activity, is an interesting candidate for the treatment of chronic bacterial prostatitis (CBP). Besides the antibacterial spectrum, the concentrations in the target tissues and fluids are crucial for the treatment of CBP. Thus, it was of interest to investigate its penetration into prostatic and seminal fluid. GTX concentrations in plasma, urine, ejaculate, prostatic and seminal fluid, and sperm cells were determined by a high-performance liquid chromatography method after oral intake of a single 400-mg dose in 10 male Caucasian volunteers in the fasting state. Simultaneous application of the renal contrast agent iohexol was used to estimate the maximal possible contamination of ejaculate and prostatic and seminal fluid by urine. GTX was well tolerated. The means (standard deviations) for the following parameters were as indicated: time to maximum concentration of drug in serum, 1.66 (0. 91) h; maximum concentration of drug in serum, 2.90 (0.39) microg/ml; area under the concentration-time curve from 0 to 24 h, 25.65 microg. h/ml; and half life, 7.2 (0.90) h. Within 12 h about 50% of the drug was excreted unchanged into the urine. The mean renal clearance was 169 ml/min. The gatifloxacin concentrations in ejaculate, seminal fluid, and prostatic fluid were in the range of the corresponding plasma concentrations which were 1.92 (0.27) microg/ml at approximately the same time point (4 h after drug intake). The concentrations in sperm cells (0.195, 0.076, and 0.011 microg/ml) could be determined in three subjects. The good penetration into prostatic and seminal fluid, the good tolerance, and the previously reported broad antibacterial spectrum suggest that GTX may be a good alternative for the treatment of chronic bacterial prostatitis. Clinical studies should be performed to confirm this assumption.

  7. Concentrations of Gatifloxacin in Plasma and Urine and Penetration into Prostatic and Seminal Fluid, Ejaculate, and Sperm Cells after Single Oral Administrations of 400 Milligrams to Volunteers

    PubMed Central

    Naber, Christoph K.; Steghafner, Michaela; Kinzig-Schippers, Martina; Sauber, Christian; Sörgel, Fritz; Stahlberg, Hans-Jürgen; Naber, Kurt G.

    2001-01-01

    Gatifloxacin (GTX), a new fluoroquinolone with extended antibacterial activity, is an interesting candidate for the treatment of chronic bacterial prostatitis (CBP). Besides the antibacterial spectrum, the concentrations in the target tissues and fluids are crucial for the treatment of CBP. Thus, it was of interest to investigate its penetration into prostatic and seminal fluid. GTX concentrations in plasma, urine, ejaculate, prostatic and seminal fluid, and sperm cells were determined by a high-performance liquid chromatography method after oral intake of a single 400-mg dose in 10 male Caucasian volunteers in the fasting state. Simultaneous application of the renal contrast agent iohexol was used to estimate the maximal possible contamination of ejaculate and prostatic and seminal fluid by urine. GTX was well tolerated. The means (standard deviations) for the following parameters were as indicated: time to maximum concentration of drug in serum, 1.66 (0.91) h; maximum concentration of drug in serum, 2.90 (0.39) μg/ml; area under the concentration-time curve from 0 to 24 h, 25.65 μg · h/ml; and half life, 7.2 (0.90) h. Within 12 h about 50% of the drug was excreted unchanged into the urine. The mean renal clearance was 169 ml/min. The gatifloxacin concentrations in ejaculate, seminal fluid, and prostatic fluid were in the range of the corresponding plasma concentrations which were 1.92 (0.27) μg/ml at approximately the same time point (4 h after drug intake). The concentrations in sperm cells (0.195, 0.076, and 0.011 μg/ml) could be determined in three subjects. The good penetration into prostatic and seminal fluid, the good tolerance, and the previously reported broad antibacterial spectrum suggest that GTX may be a good alternative for the treatment of chronic bacterial prostatitis. Clinical studies should be performed to confirm this assumption. PMID:11120980

  8. Special Topics

    DTIC Science & Technology

    2012-01-01

    training encom- passes several concepts, including cognitive knowledge, a performance assessment or pretest , training, a re- peat assessment or posttest ...significantly decreased mor- tality. For the lessons learned in ca- sualty care to be passed on to the next group of surgeons, the training for deployed...unpaid consultant to Athena GTX, Blackhawk Products Group , CHI Systems, Combat Medical Systems, Composite Resources, Compression Works, Creative

  9. Differential Scanning Calorimetric (DSC) Analysis of Rotary Nickel-Titanium (NiTi) Endodontic File (RNEF)

    NASA Astrophysics Data System (ADS)

    Wu, Ray Chun Tung; Chung, C. Y.

    2012-12-01

    To determine the variation of A f along the axial length of rotary nickel-titanium endodontic files (RNEF). Three commercial brands of 4% taper RNEF: GTX (#20, 25 mm, Dentsply Tulsa Dental Specialties, Tulsa, OK, USA), K3 (#25, 25 mm) and TF (Twisted File #25, 27 mm) (Sybron Kerr, Orange, CA, USA) were cut into segments at 4 mm increment from the working tip. Regional specimens were measured for differential heat-flow over thermal cycling, generally with continuous heating or cooling (5 °C/min) and 5 min hold at set temperatures (start, finish temperatures): GTX: -55, 90 °C; K3: -55, 45 °C; TF: -55, 60 °C; using differential scanning calorimeter. This experiment demonstrated regional differences in A f along the axial length of GTX and K3 files. Similar variation was not obvious in the TF samples. A contributory effect of regional difference in strain-hardening due to grinding and machining during manufacturing is proposed.

  10. Apical extrusion of Enterococcus faecalis using three different rotary instrumentation techniques: an in vitro study.

    PubMed

    Taneja, Sonali; Kumari, Manju; Barua, Madhumita; Dudeja, Chetna; Malik, Meeta

    2015-01-01

    To compare the apical extrusion of Enterococcus faecalis after instrumentation with three different Ni-Ti rotary instruments- An in vitro study. In vitro study Methods and Material: Forty freshly extracted mandibular premolars were mounted in bacteria collection apparatus and root canals were contaminated with a suspension of Enterococcus faecalis. The contaminated teeth were divided into 4 groups of 10 teeth each according to rotary system used for instrumentation: Group1: Hyflex files, Group 2: GTX files, Group 3: Protaper files and Group 4: control group (no instrumentation). Bacteria extruded after preparations were collected into vials and microbiological samples were incubated in BHI broth for 24 hrs. The colony forming units were determined for each sample. Statistical analysis was done using one way ANOVA followed by post hoc independent " t" test. GTX files extruded least amount of bacteria followed by Hyflex files. Maximum extrusion of E. faecalis was seen in rotary Protaper group. Least amount of extrusion was seen with GTX files followed by Hyflex files and then rotary Protaper system.

  11. Development of an Experiment High Performance Nozzle Research Program

    NASA Technical Reports Server (NTRS)

    2004-01-01

    As proposed in the above OAI/NASA Glenn Research Center (GRC) Co-Operative Agreement the objective of the work was to provide consultation and assistance to the NASA GRC GTX Rocket Based Combined Cycle (RBCC) Program Team in planning and developing requirements, scale model concepts, and plans for an experimental nozzle research program. The GTX was one of the launch vehicle concepts being studied as a possible future replacement for the aging NASA Space Shuttle, and was one RBCC element in the ongoing NASA Access to Space R&D Program (Reference 1). The ultimate program objective was the development of an appropriate experimental research program to evaluate and validate proposed nozzle concepts, and thereby result in the optimization of a high performance nozzle for the GTX launch vehicle. Included in this task were the identification of appropriate existing test facilities, development of requirements for new non-existent test rigs and fixtures, develop scale nozzle model concepts, and propose corresponding test plans. Also included were the evaluation of originally proposed and alternate nozzle designs (in-house and contractor), evaluation of Computational Fluid Dynamics (CFD) study results, and make recommendations for geometric changes to result in improved nozzle thrust coefficient performance (Cfg).

  12. Analysis and Implementation of Particle-to-Particle (P2P) Graphics Processor Unit (GPU) Kernel for Black-Box Adaptive Fast Multipole Method

    DTIC Science & Technology

    2015-06-01

    5110P and 16 dx360M4 nodes each with one NVIDIA Kepler K20M/K40M GPU. Each node contained dual Intel Xeon E5-2670 (Sandy Bridge) central processing...kernel and as such does not employ multiple processors. This work makes use of a single processing core and a single NVIDIA Kepler K40 GK110...bandwidth (2 × 16 slot), 7.877 GFloat/s; Kepler K40 peak, 4,290 × 1 billion floating-point operations (GFLOPs), and 288 GB/s Kepler K40 memory

  13. Grace: A cross-platform micromagnetic simulator on graphics processing units

    NASA Astrophysics Data System (ADS)

    Zhu, Ru

    2015-12-01

    A micromagnetic simulator running on graphics processing units (GPUs) is presented. Different from GPU implementations of other research groups which are predominantly running on NVidia's CUDA platform, this simulator is developed with C++ Accelerated Massive Parallelism (C++ AMP) and is hardware platform independent. It runs on GPUs from venders including NVidia, AMD and Intel, and achieves significant performance boost as compared to previous central processing unit (CPU) simulators, up to two orders of magnitude. The simulator paved the way for running large size micromagnetic simulations on both high-end workstations with dedicated graphics cards and low-end personal computers with integrated graphics cards, and is freely available to download.

  14. Micromagnetics on high-performance workstation and mobile computational platforms

    NASA Astrophysics Data System (ADS)

    Fu, S.; Chang, R.; Couture, S.; Menarini, M.; Escobar, M. A.; Kuteifan, M.; Lubarda, M.; Gabay, D.; Lomakin, V.

    2015-05-01

    The feasibility of using high-performance desktop and embedded mobile computational platforms is presented, including multi-core Intel central processing unit, Nvidia desktop graphics processing units, and Nvidia Jetson TK1 Platform. FastMag finite element method-based micromagnetic simulator is used as a testbed, showing high efficiency on all the platforms. Optimization aspects of improving the performance of the mobile systems are discussed. The high performance, low cost, low power consumption, and rapid performance increase of the embedded mobile systems make them a promising candidate for micromagnetic simulations. Such architectures can be used as standalone systems or can be built as low-power computing clusters.

  15. Real-time radar signal processing using GPGPU (general-purpose graphic processing unit)

    NASA Astrophysics Data System (ADS)

    Kong, Fanxing; Zhang, Yan Rockee; Cai, Jingxiao; Palmer, Robert D.

    2016-05-01

    This study introduces a practical approach to develop real-time signal processing chain for general phased array radar on NVIDIA GPUs(Graphical Processing Units) using CUDA (Compute Unified Device Architecture) libraries such as cuBlas and cuFFT, which are adopted from open source libraries and optimized for the NVIDIA GPUs. The processed results are rigorously verified against those from the CPUs. Performance benchmarked in computation time with various input data cube sizes are compared across GPUs and CPUs. Through the analysis, it will be demonstrated that GPGPUs (General Purpose GPU) real-time processing of the array radar data is possible with relatively low-cost commercial GPUs.

  16. Multi-Core Programming Design Patterns: Stream Processing Algorithms for Dynamic Scene Perceptions

    DTIC Science & Technology

    2014-05-01

    processor developed by IBM and other companies , incorpo- rates the verb—POWER5— processor as the Power Processor Element (PPE), one of the early general...deliver an power efficient single-precision peak performance of more than 256 GFlops. Substantially more raw power became available later, when nVIDIA ...algorithms, including IBM’s Cell/B.E., GPUs from NVidia and AMD and many-core CPUs from Intel.27 The vast growth of digital video content has been a

  17. Poster: Building a Large Tiled-Display Cluster

    DTIC Science & Technology

    2012-10-01

    graphics cards ( Nvidia Quadro FX 5800), and each graphics ∗e-mail: mark.livingston@nrl.navy.mil †e-mail: jonathan.decker@nrl.navy.mil card in a display...such as DisplayPort and HDMI (see: Nvidia Quadro 6000). We recommend these formats because they are much easier to plug-and-play. 3.4 Leverage Open...will find yourself with all the issues related to owning a server room. Today, there are a number of companies offering turn-key so- lutions for tiled

  18. Employing OpenCL to Accelerate Ab Initio Calculations on Graphics Processing Units.

    PubMed

    Kussmann, Jörg; Ochsenfeld, Christian

    2017-06-13

    We present an extension of our graphics processing units (GPU)-accelerated quantum chemistry package to employ OpenCL compute kernels, which can be executed on a wide range of computing devices like CPUs, Intel Xeon Phi, and AMD GPUs. Here, we focus on the use of AMD GPUs and discuss differences as compared to CUDA-based calculations on NVIDIA GPUs. First illustrative timings are presented for hybrid density functional theory calculations using serial as well as parallel compute environments. The results show that AMD GPUs are as fast or faster than comparable NVIDIA GPUs and provide a viable alternative for quantum chemical applications.

  19. High Resolution Imaging Testbed Utilizing Sodium Laser Guide Star Adaptive Optics: The Real Time Wavefront Reconstructor Computer

    DTIC Science & Technology

    2008-07-31

    Unlike the Lyrtech, each DSP on a Bittware board offers 3 MB of on-chip memory and 3 GFLOPs of 32-bit peak processing power. Based on the performance...Each NVIDIA 8800 Ultra features 576 GFLOPS on 128 612-MHz single-precision floating-point SIMD processors, arranged in 16 clusters of eight. Each

  20. 75 FR 48338 - Intel Corporation; Analysis of Proposed Consent Order to Aid Public Comment

    Federal Register 2010, 2011, 2012, 2013, 2014

    2010-08-10

    ... integrated into chipsets as well as discrete graphics cards. NVIDIA has been at the forefront of developing... to connect peripheral products such as discrete GPUs to the CPU. A bus is a connection point between... platform. Intel's commitment to maintain an open PCIe bus will provide discrete graphics manufacturers...

  1. Computational Omics Funding Opportunity | Office of Cancer Clinical Proteomics Research

    Cancer.gov

    The National Cancer Institute's Clinical Proteomic Tumor Analysis Consortium (CPTAC) and the NVIDIA Foundation are pleased to announce funding opportunities in the fight against cancer. Each organization has launched a request for proposals (RFP) that will collectively fund up to $2 million to help to develop a new generation of data-intensive scientific tools to find new ways to treat cancer.

  2. A High Performance Computing Framework for Physics-based Modeling and Simulation of Military Ground Vehicles

    DTIC Science & Technology

    2011-03-25

    number one and Nebulae at number three. Both systems rely on GPU co-processing and use Intel Xeon processors cards and NVIDIA Tesla C2050 GPUs. In...spite of a theoretical peak capability of almost 3 Petaflop/s, Nebulae clocked at 1.271 PFlop/s when running the Linpack benchmark, which puts it

  3. Enabling Computational Dynamics in Distributed Computing Environments Using a Heterogeneous Computing Template

    DTIC Science & Technology

    2011-08-09

    fastest 10 supercomputers in the world. Both systems rely on GPU co-processing, one using AMD cards, the second, called Nebulae , using NVIDIA Tesla...Page 9 of 10 UNCLASSIFIED capability of almost 3 petaflop/s, the highest in TOP500, Nebulae only holds the No. 2 position on the TOP500 list of the

  4. Ultraviolet Communication for Medical Applications

    DTIC Science & Technology

    2015-06-01

    In the previous Phase I effort, Directed Energy Inc.’s (DEI) parent company Imaging Systems Technology (IST) demonstrated feasibility of several key...accurately model high path loss. Custom photon scatter code was rewritten for parallel execution on a graphics processing unit (GPU). The NVidia CUDA

  5. Evaluation of an Adaptive Automation Trigger Based on Task Performance, Priority, and Frequency

    DTIC Science & Technology

    2013-06-01

    with dual Intel ® Xeon ® CPU x5550 processors @ 2.67 GHz each, 12.0 GB RAM, and a 1.5 GB PCIe nVidia Quadro FX 4800 graphics card (Microsoft...Cole Publishing Company . Miller, C. A., & Parasuraman, R. (2007). Designing for flexible interaction between humans and automation: Delegation

  6. Finite Element Optimization for Nondestructive Evaluation on a Graphics Processing Unit for Ground Vehicle Hull Inspection

    DTIC Science & Technology

    2013-08-22

    4 cores, where the code may simultaneously run on the multiple cores or the graphics processing unit (or GPU – to be more specific on an NVIDIA ...allowed to get accurate crack shapes. DISCLAIMER Reference herein to any specific commercial company , product, process, or service by trade name

  7. 75 FR 32826 - Self-Regulatory Organizations; NASDAQ OMX PHLX, Inc.; Notice of Filing and Immediate...

    Federal Register 2010, 2011, 2012, 2013, 2014

    2010-06-09

    ...''), American Express Company (``AXP''), Ciena Corp. (``CIEN''), Star Scientific, Inc. (``CIGX''), Dendreon Corp. (``DNDN''), eBay Inc. (``EBAY''), Corning Inc. (``GLW''), Halliburton Company (``HAL''), iShares Dow Jones US Real Estate (``IYR''), Motorola, Inc., (``MOT''), NVIDIA Corporation (``NVDA''), ON Semiconductor...

  8. 75 FR 30887 - Self-Regulatory Organizations; The NASDAQ Stock Market LLC; Notice of Filing and Immediate...

    Federal Register 2010, 2011, 2012, 2013, 2014

    2010-06-02

    ...''), American Express Company (``AXP''), Ciena Corp. (``CIEN''), Star Scientific, Inc. (``CIGX''), Dendreon Corp. (``DNDN''), eBay Inc. (``EBAY''), Corning Inc. (``GLW''), Halliburton Company (``HAL''), iShares Dow Jones US Real Estate (``IYR''), Motorola, Inc., (``MOT''), NVIDIA Corporation (``NVDA''), ON Semiconductor...

  9. Analysis of the Finite Precision s-Step Biconjugate Gradient Method

    DTIC Science & Technology

    2014-03-13

    Center for Future Architecture Research, a member of STARnet, a Semiconductor Research Corporation program sponsored by MARCO and DARPA, and ASPIRE Lab...industrial sponsors and affiliates Intel, Google, Nokia, NVIDIA , Oracle, and Samsung. Any opinions, findings, conclusions, or recommendations in this

  10. Peregrine Software Toolchains | High-Performance Computing | NREL

    Science.gov Websites

    toolchain is an open-source alternative against which many technical applications are natively developed and tested. The Portland Group compilers are not fully supported, but are available to the HPC community. Use Group (PGI) C/C++ and Fortran (partially supported) The PGI Accelerator compilers include NVIDIA GPU

  11. LU Factorization with Partial Pivoting for a Multi-CPU, Multi-GPU Shared Memory System

    SciTech Connect

    Kurzak, Jakub; Luszczek, Pitior; Faverge, Mathieu

    2012-03-01

    LU factorization with partial pivoting is a canonical numerical procedure and the main component of the High Performance LINPACK benchmark. This article presents an implementation of the algorithm for a hybrid, shared memory, system with standard CPU cores and GPU accelerators. Performance in excess of one TeraFLOPS is achieved using four AMD Magny Cours CPUs and four NVIDIA Fermi GPUs.

  12. Using 3D Computer Graphics Multimedia to Motivate Preservice Teachers' Learning of Geometry and Pedagogy

    ERIC Educational Resources Information Center

    Goodson-Espy, Tracy; Lynch-Davis, Kathleen; Schram, Pamela; Quickenton, Art

    2010-01-01

    This paper describes the genesis and purpose of our geometry methods course, focusing on a geometry-teaching technology we created using NVIDIA[R] Chameleon demonstration. This article presents examples from a sequence of lessons centered about a 3D computer graphics demonstration of the chameleon and its geometry. In addition, we present data…

  13. Toxin Profile of Gymnodinium catenatum (Dinophyceae) from the Portuguese Coast, as Determined by Liquid Chromatography Tandem Mass Spectrometry

    PubMed Central

    Costa, Pedro R.; Robertson, Alison; Quilliam, Michael A.

    2015-01-01

    The marine dinoflagellate Gymnodinium catenatum has been associated with paralytic shellfish poisoning (PSP) outbreaks in Portuguese waters for many years. PSP syndrome is caused by consumption of seafood contaminated with paralytic shellfish toxins (PSTs), a suite of potent neurotoxins. Gymnodinium catenatum was frequently reported along the Portuguese coast throughout the late 1980s and early 1990s, but was absent between 1995 and 2005. Since this time, G. catenatum blooms have been recurrent, causing contamination of fishery resources along the Atlantic coast of Portugal. The aim of this study was to evaluate the toxin profile of G. catenatum isolated from the Portuguese coast before and after the 10-year hiatus to determine changes and potential impacts for the region. Hydrophilic interaction liquid chromatography tandem mass spectrometry (HILIC-MS/MS) was utilized to determine the presence of any known and emerging PSTs in sample extracts. Several PST derivatives were identified, including the N-sulfocarbamoyl analogues (C1–4), gonyautoxin 5 (GTX5), gonyautoxin 6 (GTX6), and decarbamoyl derivatives, decarbamoyl saxitoxin (dcSTX), decarbamoyl neosaxitoxin (dcNeo) and decarbamoyl gonyautoxin 3 (dcGTX3). In addition, three known hydroxy benzoate derivatives, G. catenatum toxin 1 (GC1), GC2 and GC3, were confirmed in cultured and wild strains of G. catenatum. Moreover, two presumed N-hydroxylated analogues of GC2 and GC3, designated GC5 and GC6, are reported. This work contributes to our understanding of the toxigenicity of G. catenatum in the coastal waters of Portugal and provides valuable information on emerging PST classes that may be relevant for routine monitoring programs tasked with the prevention and control of marine toxins in fish and shellfish. PMID:25871287

  14. Geometric analysis of root canals prepared by four rotary NiTi shaping systems.

    PubMed

    Hashem, Ahmed Abdel Rahman; Ghoneim, Angie Galal; Lutfy, Reem Ahmed; Foda, Manar Yehia; Omar, Gihan Abdel Fatah

    2012-07-01

    A great number of nickel-titanium (NiTi) rotary systems with noncutting tips, different cross-sections, superior resistance to torsional fracture, varying tapers, and manufacturing method have been introduced to the market. The purpose of this study was to evaluate and compare the effect of 4 rotary NiTi preparation systems, Revo-S (RS; Micro-Mega, Besancon Cedex, France), Twisted file (TF; SybronEndo, Amersfoort, The Netherlands), ProFile GT Series X (GTX; Dentsply, Tulsa Dental Specialties, Tulsa, OK), and ProTaper (PT; Dentsply Maillefer, Ballaigues, Switzerland), on volumetric changes and transportation of curved root canals. Forty mesiobuccal canals of mandibular molars with an angle of curvature ranging from 25° to 40° were divided according to the instrument used in canal preparation into 4 groups of 10 samples each: group RS, group TF, group GTX, and group PT. Canals were scanned using an i-CAT CBCT scanner (Imaging Science International, Hatfield, PA) before and after preparation to evaluate the volumetric changes. Root canal transportation and centering ratio were evaluated at 1.3, 2.6, 5.2, and 7.8 mm from the apex. The significance level was set at P ≤ .05. The PT system removed a significantly higher amount of dentin than the other systems (P = .025). At the 1.3-mm level, there was no significant difference in canal transportation and centering ratio among the groups. However, at the other levels, TF maintained the original canal curvature recording significantly the least degree of canal transportation as well as the highest mean centering ratio. The TF system showed superior shaping ability in curved canals. Revo-S and GTX were better than ProTaper regarding both canal transportation and centering ability. Copyright © 2012 American Association of Endodontists. Published by Elsevier Inc. All rights reserved.

  15. Dexamethasone promotes granulocyte mobilization by prolonging the half-life of granulocyte-colony-stimulating factor in healthy donors for granulocyte transfusions.

    PubMed

    Hiemstra, Ida H; van Hamme, John L; Janssen, Machiel H; van den Berg, Timo K; Kuijpers, Taco W

    2017-03-01

    Granulocyte transfusion (GTX) is a potential approach to correcting neutropenia and relieving the increased risk of infection in patients who are refractory to antibiotics. To mobilize enough granulocytes for transfusion, healthy donors are premedicated with granulocyte-colony-stimulating factor (G-CSF) and dexamethasone. Granulocytes have a short circulatory half-life. Consequently, patients need to receive GTX every other day to keep circulating granulocyte counts at an acceptable level. We investigated whether plasma from premedicated donors was capable of prolonging neutrophil survival and, if so, which factor could be held responsible. The effects of plasma from G-CSF/dexamethasone-treated donors on neutrophil survival were assessed by annexin-V, CD16. and CXCR4 staining and nuclear morphology. We isolated an albumin-bound protein using α-chymotrypsin and albumin-depletion and further characterized it using protein analysis. The effects of dexamethasone and G-CSF were assessed using mifepristone and G-CSF-neutralizing antibody. G-CSF plasma concentrations were determined by Western blot and Luminex analyses. G-CSF/dexamethasone plasma contained a survival-promoting factor for at least 2 days. This factor was recognized as an albumin-associated protein and was identified as G-CSF itself, which was surprising considering its reported half-life of only 4.5 hours. Compared with coadministration of dexamethasone, administration of G-CSF alone to the same GTX donors led to a faster decline in circulating G-CSF levels, whereas dexamethasone itself did not induce any G-CSF, demonstrating a role for dexamethasone in increasing G-CSF half-life. Dexamethasone increases granulocyte yield upon coadministration with G-CSF by extending G-CSF half-life. This observation might also be exploited in the coadministration of dexamethasone with other recombinant proteins to modulate their half-life. © 2016 AABB.

  16. Separation of paralytic shellfish poisoning toxins on Chromarods-SIII by thin-layer chromatography with the Iatroscan (mark 5) and flame thermionic detection.

    PubMed

    Indrasena, W M; Ackman, R G; Gill, T A

    1999-09-10

    Thin-layer chromatography (TLC) on Chromarods-SIII with the Iatroscan (Mark-5) and a flame thermionic detector (FTID) was used to develop a rapid method for the detection of paralytic shellfish poisoning (PSP) toxins. The effect of variation in hydrogen (H2) flow, air flow, scan time and detector current on the FTID peak response for both phosphatidylcholine (PC) and PSP were studied in order to define optimum detection conditions. A combination of hydrogen and air flow-rates of 50 ml/min and 1.5-2.0 l/min respectively, along with a scan time of 40 s/rod and detector current of 3.0 A (ampere) or above were found to yield the best results for the detection of PSP compounds. Increasing the detector current level to as high as 3.3 A gave about 130 times more FTID response than did flame ionization detection (FID), for PSP components. Quantities of standards as small as 1 ng neosaxitoxin (NEO), 5 ng saxitoxin (STX), 5 ng B1-toxins (B1), 2 ng gonyautoxin (GTX) 2/3, 6 ng GTX 1/4 and 6 ng C-toxins (C1/C2) could be detected with the FTID. The method detection limits for toxic shellfish tissues using the FTID were 0.4, 2.1, 0.8 and 2.5 micrograms per g tissue for GTX 2/3, STX, NEO and C toxins, respectively. The FTID response increased with increasing detector current and with increasing the scan time. Increasing hydrogen and air flow-rates resulted in decreasing sensitivity within defined limits. Numerous solvent systems were tested, and, solvent consisting of chloroform: methanol-water-acetic acid (30:50:8:2) could separate C toxins from GTX, which eluted ahead of NEO and STX. Accordingly, TLC/FTID with the Iatroscan (Mark-5) seems to be a promising, relatively inexpensive and rapid method of screening plant and animal tissues for PSP toxins.

  17. Comparison of the shaping ability of GT® Series X, Twisted Files and AlphaKite rotary nickel-titanium systems in simulated canals

    PubMed Central

    2013-01-01

    Background Efforts to improve the performance of rotary NiTi instruments by enhancing the properties of NiTi alloy, or their manufacturing processes rather than changes in instrument geometries have been reported. The aim of this study was to compare in-vitro the shaping ability of three different rotary nickel-titanium instruments produced by different manufacturing methods. Methods Thirty simulated root canals with a curvature of 35˚ in resin blocks were prepared with three different rotary NiTi systems: AK- AlphaKite (Gebr. Brasseler, Germany), GTX- GT® Series X (Dentsply, Germany) and TF- Twisted Files (SybronEndo, USA). The canals were prepared according to the manufacturers’ instructions. Pre- and post-instrumentation images were recorded and assessment of canal curvature modifications was carried out with an image analysis program (GSA, Germany). The preparation time and incidence of procedural errors were recorded. Instruments were evaluated under a microscope with 15 × magnifications (Carl Zeiss OPMI Pro Ergo, Germany) for signs of deformation. The Data were statistically analyzed using SPSS (Wilcoxon and Mann–Whitney U-tests, at a confidence interval of 95%). Results Less canal transportation was produced by TF apically, although the difference among the groups was not statistically significant. GTX removed the greatest amount of resin from the middle and coronal parts of the canal and the difference among the groups was statistically significant (p < 0.05). The shortest preparation time was registered with TF (444 s) and the longest with GTX (714 s), the difference among the groups was statistically significant (p < 0.05). During the preparation of the canals no instrument fractured. Eleven instruments of TF and one of AK were deformed. Conclusion Under the conditions of this study, all rotary NiTi instruments maintained the working length and prepared a well-shaped root canal. The least canal transportation was produced by AK. GTX

  18. Toxicity and paralytic shellfish toxin profiles of the xanthid crabs, Lophozozymus pictor and Zosimus aeneus, collected from some Australian coral reefs.

    PubMed

    Llewellyn, L E; Endean, R

    1989-01-01

    Purification of toxic aqueous extracts from the xanthid crabs Zosimus aeneus and Lophozozymus pictor, collected from Australian waters, yielded paralytic shelfish toxins, including saxitoxin (STX), neosaxitoxin (neoSTX) and gonyautoxins 1, 2 and 4 (GTX1,2,4). No more than two paralytic shellfish toxins were found in any of the purified extracts from any specimen. Four specimens of Z. aeneus and one specimen of L. pictor each contained more toxic material than the suggested human oral lethal dose. The moult of a specimen of L. pictor was toxic, which may indicate a route in crabs for toxin removal.

  19. Real-time liquid-crystal atmosphere turbulence simulator with graphic processing unit.

    PubMed

    Hu, Lifa; Xuan, Li; Li, Dayu; Cao, Zhaoliang; Mu, Quanquan; Liu, Yonggang; Peng, Zenghui; Lu, Xinghai

    2009-04-27

    To generate time-evolving atmosphere turbulence in real time, a phase-generating method for our liquid-crystal (LC) atmosphere turbulence simulator (ATS) is derived based on the Fourier series (FS) method. A real matrix expression for generating turbulence phases is given and calculated with a graphic processing unit (GPU), the GeForce 8800 Ultra. A liquid crystal on silicon (LCOS) with 256x256 pixels is used as the turbulence simulator. The total time to generate a turbulence phase is about 7.8 ms for calculation and readout with the GPU. A parallel processing method of calculating and sending a picture to the LCOS is used to improve the simulating speed of our LC ATS. Therefore, the real-time turbulence phase-generation frequency of our LC ATS is up to 128 Hz. To our knowledge, it is the highest speed used to generate a turbulence phase in real time.

  20. Generating Billion-Edge Scale-Free Networks in Seconds: Performance Study of a Novel GPU-based Preferential Attachment Model

    SciTech Connect

    Perumalla, Kalyan S.; Alam, Maksudul

    A novel parallel algorithm is presented for generating random scale-free networks using the preferential-attachment model. The algorithm, named cuPPA, is custom-designed for single instruction multiple data (SIMD) style of parallel processing supported by modern processors such as graphical processing units (GPUs). To the best of our knowledge, our algorithm is the first to exploit GPUs, and also the fastest implementation available today, to generate scale free networks using the preferential attachment model. A detailed performance study is presented to understand the scalability and runtime characteristics of the cuPPA algorithm. In one of the best cases, when executed on an NVidiamore » GeForce 1080 GPU, cuPPA generates a scale free network of a billion edges in less than 2 seconds.« less

  1. Fatigue resistance of engine-driven rotary nickel-titanium instruments produced by new manufacturing methods.

    PubMed

    Gambarini, Gianluca; Grande, Nicola Maria; Plotino, Gianluca; Somma, Francesco; Garala, Manish; De Luca, Massimo; Testarelli, Luca

    2008-08-01

    The aim of the present study was to investigate whether cyclic fatigue resistance is increased for nickel-titanium instruments manufactured by using new processes. This was evaluated by comparing instruments produced by using the twisted method (TF; SybronEndo, Orange, CA) and those using the M-wire alloy (GTX; Dentsply Tulsa-Dental Specialties, Tulsa, OK) with instruments produced by a traditional NiTi grinding process (K3, SybronEndo). Tests were performed with a specific cyclic fatigue device that evaluated cycles to failure of rotary instruments inside curved artificial canals. Results indicated that size 06-25 TF instruments showed a significant increase (p < 0.05) in the mean number of cycles to failure when compared with size 06-25 K3 files. Size 06-20 K3 instruments showed no significant increase (p > 0.05) in the mean number of cycles to failure when compared with size 06-20 GT series X instruments. The new manufacturing process produced nickel-titanium rotary files (TF) significantly more resistant to fatigue than instruments produced with the traditional NiTi grinding process. Instruments produced with M-wire (GTX) were not found to be more resistant to fatigue than instruments produced with the traditional NiTi grinding process.

  2. Ecological and Physiological Studies of Gymnodinium catenatum in the Mexican Pacific: A Review

    PubMed Central

    Band-Schmidt, Christine J.; Bustillos-Guzmán, José J.; López-Cortés, David J.; Gárate-Lizárraga, Ismael; Núñez-Vázquez, Erick J.; Hernández-Sandoval, Francisco E.

    2010-01-01

    This review presents a detailed analysis of the state of knowledge of studies done in Mexico related to the dinoflagellate Gymnodinium catenatum, a paralytic toxin producer. This species was first reported in the Gulf of California in 1939; since then most studies in Mexico have focused on local blooms and seasonal variations. G. catenatum is most abundant during March and April, usually associated with water temperatures between 18 and 25 ºC and an increase in nutrients. In vitro studies of G. catenatum strains from different bays along the Pacific coast of Mexico show that this species can grow in wide ranges of salinities, temperatures, and N:P ratios. Latitudinal differences are observed in the toxicity and toxin profile, but the presence of dcSTX, dcGTX2-3, C1, and C2 are usual components. A common characteristic of the toxin profile found in shellfish, when G. catenatum is present in the coastal environment, is the detection of dcGTX2-3, dcSTX, C1, and C2. Few bioassay studies have reported effects in mollusks and lethal effects in mice, and shrimp; however no adverse effects have been observed in the copepod Acartia clausi. Interestingly, genetic sequencing of D1-D2 LSU rDNA revealed that it differs only in one base pair, compared with strains from other regions. PMID:20631876

  3. Paralytic shellfish toxins in phytoplankton and shellfish samples collected from the Bohai Sea, China.

    PubMed

    Liu, Yang; Yu, Ren-Cheng; Kong, Fan-Zhou; Chen, Zhen-Fan; Dai, Li; Gao, Yan; Zhang, Qing-Chun; Wang, Yun-Feng; Yan, Tian; Zhou, Ming-Jiang

    2017-02-15

    Phytoplankton and shellfish samples collected periodically from 5 representative mariculture zones around the Bohai Sea, Laishan (LS), Laizhou (LZ), Hangu (HG), Qinhuangdao (QHD) and Huludao (HLD), were analysed for paralytic shellfish toxins (PSTs) using an high-performance liquid chromatography (HPLC) method. Toxins were detected in 13 out of 20 phytoplankton samples, and N-sulfocarbamoyl toxins (C1/2) were predominant components of PSTs in phytoplankton samples with relatively low toxin content. However, two phytoplankton samples with high PST content collected from QHD and LS had unique toxin profiles characterized by high-potency carbamoyl toxins (GTX1/4) and decarbamoyl toxins (dcGTX2/3 and dcSTX), respectively. PSTs were commonly found in shellfish samples, and toxin content ranged from 0 to 27.6nmol/g. High level of PSTs were often found in scallops and clams. Shellfish from QHD in spring, and LZ and LS in autumn exhibited high risks of PST contamination. Copyright © 2016 Elsevier Ltd. All rights reserved.

  4. Ecological and physiological studies of Gymnodinium catenatum in the Mexican Pacific: a review.

    PubMed

    Band-Schmidt, Christine J; Bustillos-Guzmán, José J; López-Cortés, David J; Gárate-Lizárraga, Ismael; Núñez-Vázquez, Erick J; Hernández-Sandoval, Francisco E

    2010-06-23

    This review presents a detailed analysis of the state of knowledge of studies done in Mexico related to the dinoflagellate Gymnodinium catenatum, a paralytic toxin producer. This species was first reported in the Gulf of California in 1939; since then most studies in Mexico have focused on local blooms and seasonal variations. G. catenatum is most abundant during March and April, usually associated with water temperatures between 18 and 25 °C and an increase in nutrients. In vitro studies of G. catenatum strains from different bays along the Pacific coast of Mexico show that this species can grow in wide ranges of salinities, temperatures, and N:P ratios. Latitudinal differences are observed in the toxicity and toxin profile, but the presence of dcSTX, dcGTX2-3, C1, and C2 are usual components. A common characteristic of the toxin profile found in shellfish, when G. catenatum is present in the coastal environment, is the detection of dcGTX2-3, dcSTX, C1, and C2. Few bioassay studies have reported effects in mollusks and lethal effects in mice, and shrimp; however no adverse effects have been observed in the copepod Acartia clausi. Interestingly, genetic sequencing of D1-D2 LSU rDNA revealed that it differs only in one base pair, compared with strains from other regions.

  5. Potentiometric chemical sensors for the detection of paralytic shellfish toxins.

    PubMed

    Ferreira, Nádia S; Cruz, Marco G N; Gomes, Maria Teresa S R; Rudnitskaya, Alisa

    2018-05-01

    Potentiometric chemical sensors for the detection of paralytic shellfish toxins have been developed. Four toxins typically encountered in Portuguese waters, namely saxitoxin, decarbamoyl saxitoxin, gonyautoxin GTX5 and C1&C2, were selected for the study. A series of miniaturized sensors with solid inner contact and plasticized polyvinylchloride membranes containing ionophores, nine compositions in total, were prepared and their characteristics evaluated. Sensors displayed cross-sensitivity to four studied toxins, i.e. response to several toxins together with low selectivity. High selectivity towards paralytic shellfish toxins was observed in the presence of inorganic cations with selectivity coefficients ranging from 0.04 to 0.001 for Na + and K + and 3.6*10 -4 to 3.4*10 -5 for Ca 2+ . Detection limits were in the range from 0.25 to 0.9 μmolL -1 for saxitoxin and decarbamoyl saxitoxin, and from 0.08 to 1.8 μmolL -1 for GTX5 and C1&C2, which allows toxin detection at the concentration levels corresponding to the legal limits. Characteristics of the developed sensors allow their use in the electronic tongue multisensor system for simultaneous quantification of paralytic shellfish toxins. Copyright © 2018 Elsevier B.V. All rights reserved.

  6. Evaluating Multi-core Architectures through Accelerating the Three-Dimensional Lax–Wendroff Correction

    SciTech Connect

    You, Yang; Fu, Haohuan; Song, Shuaiwen

    2014-07-18

    Wave propagation forward modeling is a widely used computational method in oil and gas exploration. The iterative stencil loops in such problems have broad applications in scientific computing. However, executing such loops can be highly time time-consuming, which greatly limits application’s performance and power efficiency. In this paper, we accelerate the forward modeling technique on the latest multi-core and many-core architectures such as Intel Sandy Bridge CPUs, NVIDIA Fermi C2070 GPU, NVIDIA Kepler K20x GPU, and the Intel Xeon Phi Co-processor. For the GPU platforms, we propose two parallel strategies to explore the performance optimization opportunities for our stencil kernels.more » For Sandy Bridge CPUs and MIC, we also employ various optimization techniques in order to achieve the best.« less

  7. Spiking neural networks on high performance computer clusters

    NASA Astrophysics Data System (ADS)

    Chen, Chong; Taha, Tarek M.

    2011-09-01

    In this paper we examine the acceleration of two spiking neural network models on three clusters of multicore processors representing three categories of processors: x86, STI Cell, and NVIDIA GPGPUs. The x86 cluster utilized consists of 352 dualcore AMD Opterons, the Cell cluster consists of 320 Sony Playstation 3s, while the GPGPU cluster contains 32 NVIDIA Tesla S1070 systems. The results indicate that the GPGPU platform can dominate in performance compared to the Cell and x86 platforms examined. From a cost perspective, the GPGPU is more expensive in terms of neuron/s throughput. If the cost of GPGPUs go down in the future, this platform will become very cost effective for these models.

  8. Application Modernization at LLNL and the Sierra Center of Excellence

    SciTech Connect

    Neely, J. Robert; de Supinski, Bronis R.

    We repport that in 2014, Lawrence Livermore National Laboratory began acquisition of Sierra, a pre-exascale system from IBM and Nvidia. It marks a significant shift in direction for LLNL by introducing the concept of heterogeneous computing via GPUs. LLNL’s mission requires application teams to prepare for this paradigm shift. Thus, the Sierra procurement required a proposed Center of Excellence that would align the expertise of the chosen vendors with laboratory personnel that represent the application developers, system software, and tool providers in a concentrated effort to prepare the laboratory’s codes in advance of the system transitioning to production in 2018.more » Finally, this article presents LLNL’s overall application strategy, with a focus on how LLNL is collaborating with IBM and Nvidia to ensure a successful transition of its mission-oriented applications into the exascale era.« less

  9. General purpose graphic processing unit implementation of adaptive pulse compression algorithms

    NASA Astrophysics Data System (ADS)

    Cai, Jingxiao; Zhang, Yan

    2017-07-01

    This study introduces a practical approach to implement real-time signal processing algorithms for general surveillance radar based on NVIDIA graphical processing units (GPUs). The pulse compression algorithms are implemented using compute unified device architecture (CUDA) libraries such as CUDA basic linear algebra subroutines and CUDA fast Fourier transform library, which are adopted from open source libraries and optimized for the NVIDIA GPUs. For more advanced, adaptive processing algorithms such as adaptive pulse compression, customized kernel optimization is needed and investigated. A statistical optimization approach is developed for this purpose without needing much knowledge of the physical configurations of the kernels. It was found that the kernel optimization approach can significantly improve the performance. Benchmark performance is compared with the CPU performance in terms of processing accelerations. The proposed implementation framework can be used in various radar systems including ground-based phased array radar, airborne sense and avoid radar, and aerospace surveillance radar.

  10. GPU-accelerated simulations of isolated black holes

    NASA Astrophysics Data System (ADS)

    Lewis, Adam G. M.; Pfeiffer, Harald P.

    2018-05-01

    We present a port of the numerical relativity code SpEC which is capable of running on NVIDIA GPUs. Since this code must be maintained in parallel with SpEC itself, a primary design consideration is to perform as few explicit code changes as possible. We therefore rely on a hierarchy of automated porting strategies. At the highest level we use TLoops, a C++ library of our design, to automatically emit CUDA code equivalent to tensorial expressions written into C++ source using a syntax similar to analytic calculation. Next, we trace out and cache explicit matrix representations of the numerous linear transformations in the SpEC code, which allows these to be performed on the GPU using pre-existing matrix-multiplication libraries. We port the few remaining important modules by hand. In this paper we detail the specifics of our port, and present benchmarks of it simulating isolated black hole spacetimes on several generations of NVIDIA GPU.

  11. Application Modernization at LLNL and the Sierra Center of Excellence

    DOE PAGES

    Neely, J. Robert; de Supinski, Bronis R.

    2017-09-01

    We repport that in 2014, Lawrence Livermore National Laboratory began acquisition of Sierra, a pre-exascale system from IBM and Nvidia. It marks a significant shift in direction for LLNL by introducing the concept of heterogeneous computing via GPUs. LLNL’s mission requires application teams to prepare for this paradigm shift. Thus, the Sierra procurement required a proposed Center of Excellence that would align the expertise of the chosen vendors with laboratory personnel that represent the application developers, system software, and tool providers in a concentrated effort to prepare the laboratory’s codes in advance of the system transitioning to production in 2018.more » Finally, this article presents LLNL’s overall application strategy, with a focus on how LLNL is collaborating with IBM and Nvidia to ensure a successful transition of its mission-oriented applications into the exascale era.« less

  12. MILC Code Performance on High End CPU and GPU Supercomputer Clusters

    NASA Astrophysics Data System (ADS)

    DeTar, Carleton; Gottlieb, Steven; Li, Ruizi; Toussaint, Doug

    2018-03-01

    With recent developments in parallel supercomputing architecture, many core, multi-core, and GPU processors are now commonplace, resulting in more levels of parallelism, memory hierarchy, and programming complexity. It has been necessary to adapt the MILC code to these new processors starting with NVIDIA GPUs, and more recently, the Intel Xeon Phi processors. We report on our efforts to port and optimize our code for the Intel Knights Landing architecture. We consider performance of the MILC code with MPI and OpenMP, and optimizations with QOPQDP and QPhiX. For the latter approach, we concentrate on the staggered conjugate gradient and gauge force. We also consider performance on recent NVIDIA GPUs using the QUDA library.

  13. The gputools package enables GPU computing in R.

    PubMed

    Buckner, Joshua; Wilson, Justin; Seligman, Mark; Athey, Brian; Watson, Stanley; Meng, Fan

    2010-01-01

    By default, the R statistical environment does not make use of parallelism. Researchers may resort to expensive solutions such as cluster hardware for large analysis tasks. Graphics processing units (GPUs) provide an inexpensive and computationally powerful alternative. Using R and the CUDA toolkit from Nvidia, we have implemented several functions commonly used in microarray gene expression analysis for GPU-equipped computers. R users can take advantage of the better performance provided by an Nvidia GPU. The package is available from CRAN, the R project's repository of packages, at http://cran.r-project.org/web/packages/gputools More information about our gputools R package is available at http://brainarray.mbni.med.umich.edu/brainarray/Rgpgpu

  14. Accelerating gravitational microlensing simulations using the Xeon Phi coprocessor

    NASA Astrophysics Data System (ADS)

    Chen, B.; Kantowski, R.; Dai, X.; Baron, E.; Van der Mark, P.

    2017-04-01

    Recently Graphics Processing Units (GPUs) have been used to speed up very CPU-intensive gravitational microlensing simulations. In this work, we use the Xeon Phi coprocessor to accelerate such simulations and compare its performance on a microlensing code with that of NVIDIA's GPUs. For the selected set of parameters evaluated in our experiment, we find that the speedup by Intel's Knights Corner coprocessor is comparable to that by NVIDIA's Fermi family of GPUs with compute capability 2.0, but less significant than GPUs with higher compute capabilities such as the Kepler. However, the very recently released second generation Xeon Phi, Knights Landing, is about 5.8 times faster than the Knights Corner, and about 2.9 times faster than the Kepler GPU used in our simulations. We conclude that the Xeon Phi is a very promising alternative to GPUs for modern high performance microlensing simulations.

  15. NDetermin: Inferring Nondeterministic Sequential Specifications for Parallelism Correctness

    DTIC Science & Technology

    2011-12-16

    other provision of law, no person shall be subject to a penalty for failing to comply with a collection of information if it does not display a...Lab affiliates National Instruments, NEC, Nokia , NVIDIA, and Samsung. NDetermin: Inferring Nondeterministic Sequential Specifications for Parallelism...concurrently update x, some of these CAS’s will fail and those parallel loop iterations will recompute their updates to x and try again. Consider the parallel

  16. High-Performance Analysis of Filtered Semantic Graphs

    DTIC Science & Technology

    2012-05-06

    any other provision of law, no person shall be subject to a penalty for failing to comply with a collection of information if it does not display a...observation that explains why SEJITS+KDT performance is so close to CombBLAS performance in practice (as shown later in Section 7) even though its in-core...NEC, Nokia , NVIDIA, Oracle, and Samsung. This research used resources of the National Energy Research Sci- entific Computing Center, which is

  17. Examination of Multi-Core Architectures

    DTIC Science & Technology

    2010-11-01

    NOVEMBER 2010 2. REPORT TYPE Interim Technical Report 3. DATES COVERED (From - To) February 2010 – July 2010 4 . TITLE AND SUBTITLE EXAMINATION OF...STATEMENT 1 2.0 BACKGROUND 1 3.0 ARCHITECTURE CHARACTERISTICS 3 3.1 NVIDIA Tesla 3 3.2 TILE64 4 ...1 Tesla Architecture 3 2 TILE64 Architecture 4 3 Single Tile Architecture 4 4 STI Cell Broadband Engine

  18. Best Performers Announced for the NCI-CPTAC DREAM Proteogenomics Computational Challenge | Office of Cancer Clinical Proteomics Research

    Cancer.gov

    The National Cancer Institute (NCI) Clinical Proteomic Tumor Analysis Consortium (CPTAC) is pleased to announce that teams led by Jaewoo Kang (Korea University), and Yuanfang Guan with Hongyang Li (University of Michigan) as the best performers of the NCI-CPTAC DREAM Proteogenomics Computational Challenge. Over 500 participants from 20 countries registered for the Challenge, which offered $25,000 in cash awards contributed by the NVIDIA Foundation through its Compute the Cure initiative.

  19. Computational Omics Pre-Awardees | Office of Cancer Clinical Proteomics Research

    Cancer.gov

    The National Cancer Institute's Clinical Proteomic Tumor Analysis Consortium (CPTAC) is pleased to announce the pre-awardees of the Computational Omics solicitation. Working with NVIDIA Foundation's Compute the Cure initiative and Leidos Biomedical Research Inc., the NCI, through this solicitation, seeks to leverage computational efforts to provide tools for the mining and interpretation of large-scale publicly available ‘omics’ datasets.

  20. Finite difference numerical method for the superlattice Boltzmann transport equation and case comparison of CPU(C) and GPU(CUDA) implementations

    SciTech Connect

    Priimak, Dmitri

    2014-12-01

    We present a finite difference numerical algorithm for solving two dimensional spatially homogeneous Boltzmann transport equation which describes electron transport in a semiconductor superlattice subject to crossed time dependent electric and constant magnetic fields. The algorithm is implemented both in C language targeted to CPU and in CUDA C language targeted to commodity NVidia GPU. We compare performances and merits of one implementation versus another and discuss various software optimisation techniques.

  1. Ultraviolet Communication for Medical Applications

    DTIC Science & Technology

    2014-05-01

    parent company Imaging Systems Technology (IST) demonstrated feasibility of several key concepts are being developed into a working prototype in the...program using multiple high-end GPUs ( NVIDIA Tesla K20). Finally, the Monte Carlo simulation task will be resumed after the Milestone 2 demonstration...is acceptable for automated printing and handling. Next, the option of having our shells electroded by an external company was investigated and DEI

  2. Universal Batch Steganalysis

    DTIC Science & Technology

    2014-06-01

    in large-scale datasets such as might be obtained by monitoring a corporate network or social network. Identifying guilty actors, rather than payload...by monitoring a corporate network or social network. Identifying guilty actors, rather than payload-carrying objects, is entirely novel in steganalysis...implementation using Compute Unified Device Architecture (CUDA) on NVIDIA graphics cards. The key to good performance is to combine computations so that

  3. Universal Batch Steganalysis

    DTIC Science & Technology

    2014-06-30

    steganalysis) in large-scale datasets such as might be obtained by monitoring a corporate network or social network. Identifying guilty actors...guilty’ user (of steganalysis) in large-scale datasets such as might be obtained by monitoring a corporate network or social network. Identifying guilty...floating point operations (1 TFLOPs) for a 1 megapixel image. We designed a new implementation using Compute Unified Device Architecture (CUDA) on NVIDIA

  4. Contention Bounds for Combinations of Computation Graphs and Network Topologies

    DTIC Science & Technology

    2014-08-08

    member of STARnet, a Semiconductor Research Corporation program sponsored by MARCO and DARPA, and ASPIRE Lab industrial sponsors and affiliates Intel...Google, Nokia, NVIDIA , Oracle, MathWorks and Samsung. Also funded by U.S. DOE Office of Science, Office of Advanced Scientific Computing Research...DARPA Award Number HR0011-12-2- 0016, the Center for Future Architecture Research, a mem- ber of STARnet, a Semiconductor Research Corporation

  5. Using Advanced Computing in Applied Dynamics: From the Dynamics of Granular Material to the Motion of the Mars Rover

    DTIC Science & Technology

    2013-08-26

    USING ADVANCED COMPUTING IN APPLIED DYNAMICS : FROM THE DYNAMICS OF GRANULAR MATERIAL TO THE MOTION OF THE MARS ROVER Dan Negrut NVIDIA CUDA...USING ADVANCED COMPUTING IN APPLIED DYNAMICS : FROM THE DYNAMICS OF GRANULAR MATERIAL TO THE MOTION OF THE MARS ROVER 5a. CONTRACT NUMBER W911NF-11-F...University of Parma, Italy • Drs. Paramsothy Jayakumar & David Lamb, US Army TARDEC • Mihai Anitescu, University of Chicago & Argonne National Lab

  6. Performance of parallel computation using CUDA for solving the one-dimensional elasticity equations

    NASA Astrophysics Data System (ADS)

    Darmawan, J. B. B.; Mungkasi, S.

    2017-01-01

    In this paper, we investigate the performance of parallel computation in solving the one-dimensional elasticity equations. Elasticity equations are usually implemented in engineering science. Solving these equations fast and efficiently is desired. Therefore, we propose the use of parallel computation. Our parallel computation uses CUDA of the NVIDIA. Our research results show that parallel computation using CUDA has a great advantage and is powerful when the computation is of large scale.

  7. SciTech Connect

    Lyakh, Dmitry I.

    An efficient parallel tensor transpose algorithm is suggested for shared-memory computing units, namely, multicore CPU, Intel Xeon Phi, and NVidia GPU. The algorithm operates on dense tensors (multidimensional arrays) and is based on the optimization of cache utilization on x86 CPU and the use of shared memory on NVidia GPU. From the applied side, the ultimate goal is to minimize the overhead encountered in the transformation of tensor contractions into matrix multiplications in computer implementations of advanced methods of quantum many-body theory (e.g., in electronic structure theory and nuclear physics). A particular accent is made on higher-dimensional tensors that typicallymore » appear in the so-called multireference correlated methods of electronic structure theory. Depending on tensor dimensionality, the presented optimized algorithms can achieve an order of magnitude speedup on x86 CPUs and 2-3 times speedup on NVidia Tesla K20X GPU with respect to the na ve scattering algorithm (no memory access optimization). Furthermore, the tensor transpose routines developed in this work have been incorporated into a general-purpose tensor algebra library (TAL-SH).« less

  8. GPU acceleration of Dock6's Amber scoring computation.

    PubMed

    Yang, Hailong; Zhou, Qiongqiong; Li, Bo; Wang, Yongjian; Luan, Zhongzhi; Qian, Depei; Li, Hanlu

    2010-01-01

    Dressing the problem of virtual screening is a long-term goal in the drug discovery field, which if properly solved, can significantly shorten new drugs' R&D cycle. The scoring functionality that evaluates the fitness of the docking result is one of the major challenges in virtual screening. In general, scoring functionality in docking requires a large amount of floating-point calculations, which usually takes several weeks or even months to be finished. This time-consuming procedure is unacceptable, especially when highly fatal and infectious virus arises such as SARS and H1N1, which forces the scoring task to be done in a limited time. This paper presents how to leverage the computational power of GPU to accelerate Dock6's (http://dock.compbio.ucsf.edu/DOCK_6/) Amber (J. Comput. Chem. 25: 1157-1174, 2004) scoring with NVIDIA CUDA (NVIDIA Corporation Technical Staff, Compute Unified Device Architecture - Programming Guide, NVIDIA Corporation, 2008) (Compute Unified Device Architecture) platform. We also discuss many factors that will greatly influence the performance after porting the Amber scoring to GPU, including thread management, data transfer, and divergence hidden. Our experiments show that the GPU-accelerated Amber scoring achieves a 6.5× speedup with respect to the original version running on AMD dual-core CPU for the same problem size. This acceleration makes the Amber scoring more competitive and efficient for large-scale virtual screening problems.

  9. Parallel hyperbolic PDE simulation on clusters: Cell versus GPU

    NASA Astrophysics Data System (ADS)

    Rostrup, Scott; De Sterck, Hans

    2010-12-01

    Increasingly, high-performance computing is looking towards data-parallel computational devices to enhance computational performance. Two technologies that have received significant attention are IBM's Cell Processor and NVIDIA's CUDA programming model for graphics processing unit (GPU) computing. In this paper we investigate the acceleration of parallel hyperbolic partial differential equation simulation on structured grids with explicit time integration on clusters with Cell and GPU backends. The message passing interface (MPI) is used for communication between nodes at the coarsest level of parallelism. Optimizations of the simulation code at the several finer levels of parallelism that the data-parallel devices provide are described in terms of data layout, data flow and data-parallel instructions. Optimized Cell and GPU performance are compared with reference code performance on a single x86 central processing unit (CPU) core in single and double precision. We further compare the CPU, Cell and GPU platforms on a chip-to-chip basis, and compare performance on single cluster nodes with two CPUs, two Cell processors or two GPUs in a shared memory configuration (without MPI). We finally compare performance on clusters with 32 CPUs, 32 Cell processors, and 32 GPUs using MPI. Our GPU cluster results use NVIDIA Tesla GPUs with GT200 architecture, but some preliminary results on recently introduced NVIDIA GPUs with the next-generation Fermi architecture are also included. This paper provides computational scientists and engineers who are considering porting their codes to accelerator environments with insight into how structured grid based explicit algorithms can be optimized for clusters with Cell and GPU accelerators. It also provides insight into the speed-up that may be gained on current and future accelerator architectures for this class of applications. Program summaryProgram title: SWsolver Catalogue identifier: AEGY_v1_0 Program summary URL

  10. Liquid Chromatography with a Fluorimetric Detection Method for Analysis of Paralytic Shellfish Toxins and Tetrodotoxin Based on a Porous Graphitic Carbon Column

    PubMed Central

    Rey, Veronica; Botana, Ana M.; Alvarez, Mercedes; Antelo, Alvaro; Botana, Luis M.

    2016-01-01

    Paralytic shellfish toxins (PST) traditionally have been analyzed by liquid chromatography with either pre- or post-column derivatization and always with a silica-based stationary phase. This technique resulted in different methods that need more than one run to analyze the toxins. Furthermore, tetrodotoxin (TTX) was recently found in bivalves of northward locations in Europe due to climate change, so it is important to analyze it along with PST because their signs of toxicity are similar in the bioassay. The methods described here detail a new approach to eliminate different runs, by using a new porous graphitic carbon stationary phase. Firstly we describe the separation of 13 PST that belong to different groups, taking into account the side-chains of substituents, in one single run of less than 30 min with good reproducibility. The method was assayed in four shellfish matrices: mussel (Mytillus galloprovincialis), clam (Pecten maximus), scallop (Ruditapes decussatus) and oyster (Ostrea edulis). The results for all of the parameters studied are provided, and the detection limits for the majority of toxins were improved with regard to previous liquid chromatography methods: the lowest values were those for decarbamoyl-gonyautoxin 2 (dcGTX2) and gonyautoxin 2 (GTX2) in mussel (0.0001 mg saxitoxin (STX)·diHCl kg−1 for each toxin), decarbamoyl-saxitoxin (dcSTX) in clam (0.0003 mg STX·diHCl kg−1), N-sulfocarbamoyl-gonyautoxins 2 and 3 (C1 and C2) in scallop (0.0001 mg STX·diHCl kg−1 for each toxin) and dcSTX (0.0003 mg STX·diHCl kg−1 ) in oyster; gonyautoxin 2 (GTX2) showed the highest limit of detection in oyster (0.0366 mg STX·diHCl kg−1). Secondly, we propose a modification of the method for the simultaneous analysis of PST and TTX, with some minor changes in the solvent gradient, although the detection limit for TTX does not allow its use nowadays for regulatory purposes. PMID:27367728

  11. 26th JANNAF Airbreathing Propulsion Subcommittee Meeting. Volume 1

    NASA Technical Reports Server (NTRS)

    Fry, Ronald S. (Editor); Gannaway, Mary T. (Editor)

    2002-01-01

    This volume, the first of four volumes, is a collection of 28 unclassified/unlimited-distribution papers which were presented at the Joint Army-Navy-NASA-Air Force (JANNAF) 26th Airbreathing Propulsion Subcommittee (APS) was held jointly with the 38th Combustion Subcommittee (CS), 20th Propulsion Systems Hazards Subcommittee (PSHS), and 2nd Modeling and Simulation Subcommittee. The meeting was held 8-12 April 2002 at the Bayside Inn at The Sandestin Golf & Beach Resort and Eglin Air Force Base, Destin, Florida. Topics covered include: scramjet and ramjet R&D program overviews; tactical propulsion; space access; NASA GTX status; PDE technology; actively cooled engine structures; modeling and simulation of complex hydrocarbon fuels and unsteady processes; and component modeling and simulation.

  12. Stochastic first passage time accelerated with CUDA

    NASA Astrophysics Data System (ADS)

    Pierro, Vincenzo; Troiano, Luigi; Mejuto, Elena; Filatrella, Giovanni

    2018-05-01

    The numerical integration of stochastic trajectories to estimate the time to pass a threshold is an interesting physical quantity, for instance in Josephson junctions and atomic force microscopy, where the full trajectory is not accessible. We propose an algorithm suitable for efficient implementation on graphical processing unit in CUDA environment. The proposed approach for well balanced loads achieves almost perfect scaling with the number of available threads and processors, and allows an acceleration of about 400× with a GPU GTX980 respect to standard multicore CPU. This method allows with off the shell GPU to challenge problems that are otherwise prohibitive, as thermal activation in slowly tilted potentials. In particular, we demonstrate that it is possible to simulate the switching currents distributions of Josephson junctions in the timescale of actual experiments.

  13. GPU Computing in Bayesian Inference of Realized Stochastic Volatility Model

    NASA Astrophysics Data System (ADS)

    Takaishi, Tetsuya

    2015-01-01

    The realized stochastic volatility (RSV) model that utilizes the realized volatility as additional information has been proposed to infer volatility of financial time series. We consider the Bayesian inference of the RSV model by the Hybrid Monte Carlo (HMC) algorithm. The HMC algorithm can be parallelized and thus performed on the GPU for speedup. The GPU code is developed with CUDA Fortran. We compare the computational time in performing the HMC algorithm on GPU (GTX 760) and CPU (Intel i7-4770 3.4GHz) and find that the GPU can be up to 17 times faster than the CPU. We also code the program with OpenACC and find that appropriate coding can achieve the similar speedup with CUDA Fortran.

  14. JANNAF 25th Airbreathing Propulsion Subcommittee, 37th Combustion Subcommittee and 1st Modeling and Simulation Subcommittee Joint Meeting. Volume 1

    NASA Technical Reports Server (NTRS)

    Fry, Ronald S.; Becker, Dorothy L.

    2000-01-01

    Volume I, the first of three volumes, is a compilation of 24 unclassified/unlimited-distribution technical papers presented at the Joint Army-Navy-NASA-Air Force (JANNAF) 25th Airbreathing Propulsion Subcommittee, 37th Combustion Subcommittee and 1st Modeling and Simulation Subcommittee (MSS) meeting held jointly with the 19th Propulsion Systems Hazards Subcommittee. The meeting was held 13-17 November 2000 at the Naval Postgraduate School and Hyatt Regency Hotel, Monterey, California. Topics covered include: a Keynote Address on Future Combat Systems, a review of the new JANNAF Modeling and Simulation Subcommittee, and technical papers on Hyper-X propulsion development and verification; GTX airbreathing launch vehicles; Hypersonic technology development, including program overviews, fuels for advanced propulsion, ramjet and scramjet research, hypersonic test medium effects; and RBCC engine design and performance, and PDE and UCAV advanced and combined cycle engine technologies.

  15. Three-dimensional photoacoustic tomography based on graphics-processing-unit-accelerated finite element method.

    PubMed

    Peng, Kuan; He, Ling; Zhu, Ziqiang; Tang, Jingtian; Xiao, Jiaying

    2013-12-01

    Compared with commonly used analytical reconstruction methods, the frequency-domain finite element method (FEM) based approach has proven to be an accurate and flexible algorithm for photoacoustic tomography. However, the FEM-based algorithm is computationally demanding, especially for three-dimensional cases. To enhance the algorithm's efficiency, in this work a parallel computational strategy is implemented in the framework of the FEM-based reconstruction algorithm using a graphic-processing-unit parallel frame named the "compute unified device architecture." A series of simulation experiments is carried out to test the accuracy and accelerating effect of the improved method. The results obtained indicate that the parallel calculation does not change the accuracy of the reconstruction algorithm, while its computational cost is significantly reduced by a factor of 38.9 with a GTX 580 graphics card using the improved method.

  16. Ice crystals classification using airborne measurements in mixing phase

    NASA Astrophysics Data System (ADS)

    Sorin Vajaiac, Nicolae; Boscornea, Andreea

    2017-04-01

    This paper presents a case study of ice crystals classification from airborne measurements in mixed-phase clouds. Ice crystal shadow is recorded with CIP (Cloud Imaging Probe) component of CAPS (Cloud, Aerosol, and Precipitation Spectrometer) system. The analyzed flight was performed in the south-western part of Romania (between Pietrosani, Ramnicu Valcea, Craiova and Targu Jiu), with a Beechcraft C90 GTX which was specially equipped with a CAPS system. The temperature, during the fly, reached the lowest value at -35 °C. These low temperatures allow the formation of ice crystals and influence their form. For the here presented ice crystals classification a special software, OASIS (Optical Array Shadow Imaging Software), developed by DMT (Droplet Measurement Technologies), was used. The obtained results, as expected are influenced by the atmospheric and microphysical parameters. The particles recorded where classified in four groups: edge, irregular, round and small.

  17. Rocket-Based Combined Cycle Flowpath Testing for Modes 1 and 4

    NASA Technical Reports Server (NTRS)

    Rice, Tharen

    2002-01-01

    Under sponsorship of the NASA Glenn Research Center (NASA GRC), the Johns Hopkins University Applied Physics Laboratory (JHU/APL) designed and built a five-inch diameter, Rocket-Based Combined Cycle (RBCC) engine to investigate mode 1 and mode 4 engine performance as well as Mach 4 inlet performance. This engine was designed so that engine area and length ratios were similar to the NASA GRC GTX engine is shown. Unlike the GTX semi-circular engine design, the APL engine is completely axisymmetric. For this design, a traditional rocket thruster was installed inside of the scramjet flowpath, along the engine centerline. A three part test series was conducted to determine Mode I and Mode 4 engine performance. In part one, testing of the rocket thruster alone was accomplished and its performance determined (average Isp efficiency = 90%). In part two, Mode 1 (air-augmented rocket) testing was conducted at a nominal chamber pressure-to-ambient pressure ratio of 100 with the engine inlet fully open. Results showed that there was neither a thrust increment nor decrement over rocket-only thrust during Mode 1 operation. In part three, Mode 4 testing was conducted with chamber pressure-to-ambient pressure ratios lower than desired (80 instead of 600) with the inlet fully closed. Results for this testing showed a performance decrease of 20% as compared to the rocket-only testing. It is felt that these results are directly related to the low pressure ratio tested and not the engine design. During this program, Mach 4 inlet testing was also conducted. For these tests, a moveable centerbody was tested to determine the maximum contraction ratio for the engine design. The experimental results agreed with CFD results conducted by NASA GRC, showing a maximum geometric contraction ratio of approximately 10.5. This report details the hardware design, test setup, experimental results and data analysis associated with the aforementioned tests.

  18. Supersonic Wind Tunnel Tests of a Half-axisymmetric 12 Deg-spike Inlet to a Rocket-based Combined-cycle Propulsion System

    NASA Technical Reports Server (NTRS)

    DeBonis, J. R.; Trefny, C. J.

    2001-01-01

    Results of an isolated inlet test for NASA's GTX air-breathing launch vehicle concept are presented. The GTX is a Vertical Take-off/ Horizontal Landing reusable single-stage-to-orbit system powered by a rocket-based combined-cycle propulsion system. Tests were conducted in the NASA Glenn 1- by 1-Foot Supersonic Wind Tunnel during two entries in October 1998 and February 1999. Tests were run from Mach 2.8 to 6. Integrated performance parameters and static pressure distributions are reported. The maximum contraction ratios achieved in the tests were lower than predicted by axisymmetric Reynolds-averaged Navier-Stokes computational fluid dynamics (CFD). At Mach 6, the maximum contraction ratio was roughly one-half of the CFD value of 16. The addition of either boundary-layer trip strips or vortex generators had a negligible effect on the maximum contraction ratio. A shock boundary-layer interaction was also evident on the end-walls that terminate the annular flowpath cross section. Cut-back end-walls, designed to reduce the boundary-layer growth upstream of the shock and minimize the interaction, also had negligible effect on the maximum contraction ratio. Both the excessive turning of low-momentum comer flows and local over-contraction due to asymmetric end-walls were identified as possible reasons for the discrepancy between the CFD predictions and the experiment. It is recommended that the centerbody spike and throat angles be reduced in order to lessen the induced pressure rise. The addition of a step on the cowl surface, and planar end-walls more closely approximating a plane of symmetry are also recommended. Provisions for end-wall boundary-layer bleed should be incorporated.

  19. Development of an Integrated Nozzle for a Symmetric, RBCC Launch Vehicle Configuration

    NASA Technical Reports Server (NTRS)

    Smith, Timothy D.; Canabal, Francisco, III; Rice, Tharen; Blaha, Bernard

    2000-01-01

    The development of rocket based combined cycle (RBCC) engines is highly dependent upon integrating several different modes of operation into a single system. One of the key components to develop acceptable performance levels through each mode of operation is the nozzle. It must be highly integrated to serve the expansion processes of both rocket and air-breathing modes without undue weight, drag, or complexity. The NASA GTX configuration requires a fixed geometry, altitude-compensating nozzle configuration. The initial configuration, used mainly to estimate weight and cooling requirements was a 1 So half-angle cone, which cuts a concave surface from a point within the flowpath to the vehicle trailing edge. Results of 3-D CFD calculations on this geometry are presented. To address the critical issues associated with integrated, fixed geometry, multimode nozzle development, the GTX team has initiated a series of tasks to evolve the nozzle design, and validate performance levels. An overview of these tasks is given. The first element is a design activity to develop tools for integration of efficient expansion surfaces With the existing flowpath and vehicle aft-body, and to develop a second-generation nozzle design. A preliminary result using a "streamline-tracing" technique is presented. As the nozzle design evolves, a combination of 3-D CFD analysis and experimental evaluation will be used to validate the design procedure and determine the installed performance for propulsion cycle modeling. The initial experimental effort will consist of cold-flow experiments designed to validate the general trends of the streamline-tracing methodology and anchor the CFD analysis. Experiments will also be conducted to simulate nozzle performance during each mode of operation. As the design matures, hot-fire tests will be conducted to refine performance estimates and anchor more sophisticated reacting-flow analysis.

  20. GPU-based relative fuzzy connectedness image segmentation

    SciTech Connect

    Zhuge Ying; Ciesielski, Krzysztof C.; Udupa, Jayaram K.

    2013-01-15

    Purpose:Recently, clinical radiological research and practice are becoming increasingly quantitative. Further, images continue to increase in size and volume. For quantitative radiology to become practical, it is crucial that image segmentation algorithms and their implementations are rapid and yield practical run time on very large data sets. The purpose of this paper is to present a parallel version of an algorithm that belongs to the family of fuzzy connectedness (FC) algorithms, to achieve an interactive speed for segmenting large medical image data sets. Methods: The most common FC segmentations, optimizing an Script-Small-L {sub {infinity}}-based energy, are known as relative fuzzymore » connectedness (RFC) and iterative relative fuzzy connectedness (IRFC). Both RFC and IRFC objects (of which IRFC contains RFC) can be found via linear time algorithms, linear with respect to the image size. The new algorithm, P-ORFC (for parallel optimal RFC), which is implemented by using NVIDIA's Compute Unified Device Architecture (CUDA) platform, considerably improves the computational speed of the above mentioned CPU based IRFC algorithm. Results: Experiments based on four data sets of small, medium, large, and super data size, achieved speedup factors of 32.8 Multiplication-Sign , 22.9 Multiplication-Sign , 20.9 Multiplication-Sign , and 17.5 Multiplication-Sign , correspondingly, on the NVIDIA Tesla C1060 platform. Although the output of P-ORFC need not precisely match that of IRFC output, it is very close to it and, as the authors prove, always lies between the RFC and IRFC objects. Conclusions: A parallel version of a top-of-the-line algorithm in the family of FC has been developed on the NVIDIA GPUs. An interactive speed of segmentation has been achieved, even for the largest medical image data set. Such GPU implementations may play a crucial role in automatic anatomy recognition in clinical radiology.« less

  1. Fast analytical scatter estimation using graphics processing units.

    PubMed

    Ingleby, Harry; Lippuner, Jonas; Rickey, Daniel W; Li, Yue; Elbakri, Idris

    2015-01-01

    To develop a fast patient-specific analytical estimator of first-order Compton and Rayleigh scatter in cone-beam computed tomography, implemented using graphics processing units. The authors developed an analytical estimator for first-order Compton and Rayleigh scatter in a cone-beam computed tomography geometry. The estimator was coded using NVIDIA's CUDA environment for execution on an NVIDIA graphics processing unit. Performance of the analytical estimator was validated by comparison with high-count Monte Carlo simulations for two different numerical phantoms. Monoenergetic analytical simulations were compared with monoenergetic and polyenergetic Monte Carlo simulations. Analytical and Monte Carlo scatter estimates were compared both qualitatively, from visual inspection of images and profiles, and quantitatively, using a scaled root-mean-square difference metric. Reconstruction of simulated cone-beam projection data of an anthropomorphic breast phantom illustrated the potential of this method as a component of a scatter correction algorithm. The monoenergetic analytical and Monte Carlo scatter estimates showed very good agreement. The monoenergetic analytical estimates showed good agreement for Compton single scatter and reasonable agreement for Rayleigh single scatter when compared with polyenergetic Monte Carlo estimates. For a voxelized phantom with dimensions 128 × 128 × 128 voxels and a detector with 256 × 256 pixels, the analytical estimator required 669 seconds for a single projection, using a single NVIDIA 9800 GX2 video card. Accounting for first order scatter in cone-beam image reconstruction improves the contrast to noise ratio of the reconstructed images. The analytical scatter estimator, implemented using graphics processing units, provides rapid and accurate estimates of single scatter and with further acceleration and a method to account for multiple scatter may be useful for practical scatter correction schemes.

  2. GPU-based relative fuzzy connectedness image segmentation.

    PubMed

    Zhuge, Ying; Ciesielski, Krzysztof C; Udupa, Jayaram K; Miller, Robert W

    2013-01-01

    Recently, clinical radiological research and practice are becoming increasingly quantitative. Further, images continue to increase in size and volume. For quantitative radiology to become practical, it is crucial that image segmentation algorithms and their implementations are rapid and yield practical run time on very large data sets. The purpose of this paper is to present a parallel version of an algorithm that belongs to the family of fuzzy connectedness (FC) algorithms, to achieve an interactive speed for segmenting large medical image data sets. The most common FC segmentations, optimizing an [script-l](∞)-based energy, are known as relative fuzzy connectedness (RFC) and iterative relative fuzzy connectedness (IRFC). Both RFC and IRFC objects (of which IRFC contains RFC) can be found via linear time algorithms, linear with respect to the image size. The new algorithm, P-ORFC (for parallel optimal RFC), which is implemented by using NVIDIA's Compute Unified Device Architecture (CUDA) platform, considerably improves the computational speed of the above mentioned CPU based IRFC algorithm. Experiments based on four data sets of small, medium, large, and super data size, achieved speedup factors of 32.8×, 22.9×, 20.9×, and 17.5×, correspondingly, on the NVIDIA Tesla C1060 platform. Although the output of P-ORFC need not precisely match that of IRFC output, it is very close to it and, as the authors prove, always lies between the RFC and IRFC objects. A parallel version of a top-of-the-line algorithm in the family of FC has been developed on the NVIDIA GPUs. An interactive speed of segmentation has been achieved, even for the largest medical image data set. Such GPU implementations may play a crucial role in automatic anatomy recognition in clinical radiology.

  3. A parallel algorithm for the initial screening of space debris collisions prediction using the SGP4/SDP4 models and GPU acceleration

    NASA Astrophysics Data System (ADS)

    Lin, Mingpei; Xu, Ming; Fu, Xiaoyu

    2017-05-01

    Currently, a tremendous amount of space debris in Earth's orbit imperils operational spacecraft. It is essential to undertake risk assessments of collisions and predict dangerous encounters in space. However, collision predictions for an enormous amount of space debris give rise to large-scale computations. In this paper, a parallel algorithm is established on the Compute Unified Device Architecture (CUDA) platform of NVIDIA Corporation for collision prediction. According to the parallel structure of NVIDIA graphics processors, a block decomposition strategy is adopted in the algorithm. Space debris is divided into batches, and the computation and data transfer operations of adjacent batches overlap. As a consequence, the latency to access shared memory during the entire computing process is significantly reduced, and a higher computing speed is reached. Theoretically, a simulation of collision prediction for space debris of any amount and for any time span can be executed. To verify this algorithm, a simulation example including 1382 pieces of debris, whose operational time scales vary from 1 min to 3 days, is conducted on Tesla C2075 of NVIDIA. The simulation results demonstrate that with the same computational accuracy as that of a CPU, the computing speed of the parallel algorithm on a GPU is 30 times that on a CPU. Based on this algorithm, collision prediction of over 150 Chinese spacecraft for a time span of 3 days can be completed in less than 3 h on a single computer, which meets the timeliness requirement of the initial screening task. Furthermore, the algorithm can be adapted for multiple tasks, including particle filtration, constellation design, and Monte-Carlo simulation of an orbital computation.

  4. Parallel fuzzy connected image segmentation on GPU.

    PubMed

    Zhuge, Ying; Cao, Yong; Udupa, Jayaram K; Miller, Robert W

    2011-07-01

    Image segmentation techniques using fuzzy connectedness (FC) principles have shown their effectiveness in segmenting a variety of objects in several large applications. However, one challenge in these algorithms has been their excessive computational requirements when processing large image datasets. Nowadays, commodity graphics hardware provides a highly parallel computing environment. In this paper, the authors present a parallel fuzzy connected image segmentation algorithm implementation on NVIDIA's compute unified device Architecture (CUDA) platform for segmenting medical image data sets. In the FC algorithm, there are two major computational tasks: (i) computing the fuzzy affinity relations and (ii) computing the fuzzy connectedness relations. These two tasks are implemented as CUDA kernels and executed on GPU. A dramatic improvement in speed for both tasks is achieved as a result. Our experiments based on three data sets of small, medium, and large data size demonstrate the efficiency of the parallel algorithm, which achieves a speed-up factor of 24.4x, 18.1x, and 10.3x, correspondingly, for the three data sets on the NVIDIA Tesla C1060 over the implementation of the algorithm on CPU, and takes 0.25, 0.72, and 15.04 s, correspondingly, for the three data sets. The authors developed a parallel algorithm of the widely used fuzzy connected image segmentation method on the NVIDIA GPUs, which are far more cost- and speed-effective than both cluster of workstations and multiprocessing systems. A near-interactive speed of segmentation has been achieved, even for the large data set.

  5. Communication Avoiding Rank Revealing QR Factorization with Column Pivoting

    DTIC Science & Technology

    2013-05-03

    person shall be subject to a penalty for failing to comply with a collection of information if it does not display a currently valid OMB control number...ParLab affiliates National Instruments, Nokia , NVIDIA, Oracle, and Samsung, and sup- port from MathWorks. We also acknowledge the support of the US...bounds from equation (1.3). In practice the QR factorization with column pivoting often works well, and it is widely used even if it is known to fail , for

  6. Kalman filter tracking on parallel architectures

    NASA Astrophysics Data System (ADS)

    Cerati, G.; Elmer, P.; Krutelyov, S.; Lantz, S.; Lefebvre, M.; McDermott, K.; Riley, D.; Tadel, M.; Wittich, P.; Wurthwein, F.; Yagil, A.

    2017-10-01

    We report on the progress of our studies towards a Kalman filter track reconstruction algorithm with optimal performance on manycore architectures. The combinatorial structure of these algorithms is not immediately compatible with an efficient SIMD (or SIMT) implementation; the challenge for us is to recast the existing software so it can readily generate hundreds of shared-memory threads that exploit the underlying instruction set of modern processors. We show how the data and associated tasks can be organized in a way that is conducive to both multithreading and vectorization. We demonstrate very good performance on Intel Xeon and Xeon Phi architectures, as well as promising first results on Nvidia GPUs.

  7. Analysis and optimization of gyrokinetic toroidal simulations on homogenous and heterogenous platforms

    DOE PAGES

    Ibrahim, Khaled Z.; Madduri, Kamesh; Williams, Samuel; ...

    2013-07-18

    The Gyrokinetic Toroidal Code (GTC) uses the particle-in-cell method to efficiently simulate plasma microturbulence. This paper presents novel analysis and optimization techniques to enhance the performance of GTC on large-scale machines. We introduce cell access analysis to better manage locality vs. synchronization tradeoffs on CPU and GPU-based architectures. Finally, our optimized hybrid parallel implementation of GTC uses MPI, OpenMP, and NVIDIA CUDA, achieves up to a 2× speedup over the reference Fortran version on multiple parallel systems, and scales efficiently to tens of thousands of cores.

  8. Lattice QCD based on OpenCL

    NASA Astrophysics Data System (ADS)

    Bach, Matthias; Lindenstruth, Volker; Philipsen, Owe; Pinke, Christopher

    2013-09-01

    We present an OpenCL-based Lattice QCD application using a heatbath algorithm for the pure gauge case and Wilson fermions in the twisted mass formulation. The implementation is platform independent and can be used on AMD or NVIDIA GPUs, as well as on classical CPUs. On the AMD Radeon HD 5870 our double precision ⁄D implementation performs at 60 GFLOPS over a wide range of lattice sizes. The hybrid Monte Carlo presented reaches a speedup of four over the reference code running on a server CPU.

  9. Accelerating Advanced MRI Reconstructions on GPUs.

    PubMed

    Stone, S S; Haldar, J P; Tsao, S C; Hwu, W-M W; Sutton, B P; Liang, Z-P

    2008-10-01

    Computational acceleration on graphics processing units (GPUs) can make advanced magnetic resonance imaging (MRI) reconstruction algorithms attractive in clinical settings, thereby improving the quality of MR images across a broad spectrum of applications. This paper describes the acceleration of such an algorithm on NVIDIA's Quadro FX 5600. The reconstruction of a 3D image with 128(3) voxels achieves up to 180 GFLOPS and requires just over one minute on the Quadro, while reconstruction on a quad-core CPU is twenty-one times slower. Furthermore, relative to the true image, the error exhibited by the advanced reconstruction is only 12%, while conventional reconstruction techniques incur error of 42%.

  10. Implementation of Headtracking and 3D Stereo with Unity and VRPN for Computer Simulations

    NASA Technical Reports Server (NTRS)

    Noyes, Matthew A.

    2013-01-01

    This paper explores low-cost hardware and software methods to provide depth cues traditionally absent in monocular displays. The use of a VRPN server in conjunction with a Microsoft Kinect and/or Nintendo Wiimote to provide head tracking information to a Unity application, and NVIDIA 3D Vision for retinal disparity support, is discussed. Methods are suggested to implement this technology with NASA's EDGE simulation graphics package, along with potential caveats. Finally, future applications of this technology to astronaut crew training, particularly when combined with an omnidirectional treadmill for virtual locomotion and NASA's ARGOS system for reduced gravity simulation, are discussed.

  11. Cretin Memory Flow on Sierra

    SciTech Connect

    Langer, S. H.; Scott, H. A.

    2016-08-05

    The Cretin iCOE project has a goal of enabling the efficient generation of Non-LTE opacities for use in radiation-hydrodynamic simulation codes using the Nvidia boards on LLNL’s upcoming Sierra system. Achieving the desired level of accuracy for some simulations require the use of a vary large number of atomic configurations (a configuration includes the atomic level for all electrons and how they are coupled together). The NLTE rate matrix needs to be solved separately in each zone. Calculating NLTE opacities can consume more time than all other physics packages used in a simulation.

  12. Graphics processing unit based computation for NDE applications

    NASA Astrophysics Data System (ADS)

    Nahas, C. A.; Rajagopal, Prabhu; Balasubramaniam, Krishnan; Krishnamurthy, C. V.

    2012-05-01

    Advances in parallel processing in recent years are helping to improve the cost of numerical simulation. Breakthroughs in Graphical Processing Unit (GPU) based computation now offer the prospect of further drastic improvements. The introduction of 'compute unified device architecture' (CUDA) by NVIDIA (the global technology company based in Santa Clara, California, USA) has made programming GPUs for general purpose computing accessible to the average programmer. Here we use CUDA to develop parallel finite difference schemes as applicable to two problems of interest to NDE community, namely heat diffusion and elastic wave propagation. The implementations are for two-dimensions. Performance improvement of the GPU implementation against serial CPU implementation is then discussed.

  13. FANTOM: Algorithm-Architecture Codesign for High-Performance Embedded Signal and Image Processing Systems

    DTIC Science & Technology

    2013-05-25

    graphics processors by IBM, AMD, and nVIDIA . They are between general-purpose pro- cessors and special-purpose processors. In Phase II. 3.10 Measure of...particular, Dr. Kevin Irick started a company Silicon Scapes and he has been the CEO. 5 Implications for Related/Future Research We speculate that...final project report in Jan. 2011. At the test and validation stage of the project. FANTOM’s partner at Raytheon quit from his company and hence from

  14. PC Scene Generation

    NASA Astrophysics Data System (ADS)

    Buford, James A., Jr.; Cosby, David; Bunfield, Dennis H.; Mayhall, Anthony J.; Trimble, Darian E.

    2007-04-01

    AMRDEC has successfully tested hardware and software for Real-Time Scene Generation for IR and SAL Sensors on COTS PC based hardware and video cards. AMRDEC personnel worked with nVidia and Concurrent Computer Corporation to develop a Scene Generation system capable of frame rates of at least 120Hz while frame locked to an external source (such as a missile seeker) with no dropped frames. Latency measurements and image validation were performed using COTS and in-house developed hardware and software. Software for the Scene Generation system was developed using OpenSceneGraph.

  15. Improving Quantum Gate Simulation using a GPU

    NASA Astrophysics Data System (ADS)

    Gutierrez, Eladio; Romero, Sergio; Trenas, Maria A.; Zapata, Emilio L.

    2008-11-01

    Due to the increasing computing power of the graphics processing units (GPU), they are becoming more and more popular when solving general purpose algorithms. As the simulation of quantum computers results on a problem with exponential complexity, it is advisable to perform a parallel computation, such as the one provided by the SIMD multiprocessors present in recent GPUs. In this paper, we focus on an important quantum algorithm, the quantum Fourier transform (QTF), in order to evaluate different parallelization strategies on a novel GPU architecture. Our implementation makes use of the new CUDA software/hardware architecture developed recently by NVIDIA.

  16. Advanced computer graphic techniques for laser range finder (LRF) simulation

    NASA Astrophysics Data System (ADS)

    Bedkowski, Janusz; Jankowski, Stanislaw

    2008-11-01

    This paper show an advanced computer graphic techniques for laser range finder (LRF) simulation. The LRF is the common sensor for unmanned ground vehicle, autonomous mobile robot and security applications. The cost of the measurement system is extremely high, therefore the simulation tool is designed. The simulation gives an opportunity to execute algorithm such as the obstacle avoidance[1], slam for robot localization[2], detection of vegetation and water obstacles in surroundings of the robot chassis[3], LRF measurement in crowd of people[1]. The Axis Aligned Bounding Box (AABB) and alternative technique based on CUDA (NVIDIA Compute Unified Device Architecture) is presented.

  17. Three-dimensional scene reconstruction from a two-dimensional image

    NASA Astrophysics Data System (ADS)

    Parkins, Franz; Jacobs, Eddie

    2017-05-01

    We propose and simulate a method of reconstructing a three-dimensional scene from a two-dimensional image for developing and augmenting world models for autonomous navigation. This is an extension of the Perspective-n-Point (PnP) method which uses a sampling of the 3D scene, 2D image point parings, and Random Sampling Consensus (RANSAC) to infer the pose of the object and produce a 3D mesh of the original scene. Using object recognition and segmentation, we simulate the implementation on a scene of 3D objects with an eye to implementation on embeddable hardware. The final solution will be deployed on the NVIDIA Tegra platform.

  18. Charge order-superfluidity transition in a two-dimensional system of hard-core bosons and emerging domain structures

    NASA Astrophysics Data System (ADS)

    Moskvin, A. S.; Panov, Yu. D.; Rybakov, F. N.; Borisov, A. B.

    2017-11-01

    We have used high-performance parallel computations by NVIDIA graphics cards applying the method of nonlinear conjugate gradients and Monte Carlo method to observe directly the developing ground state configuration of a two-dimensional hard-core boson system with decrease in temperature, and its evolution with deviation from a half-filling. This has allowed us to explore unconventional features of a charge order—superfluidity phase transition, specifically, formation of an irregular domain structure, emergence of a filamentary superfluid structure that condenses within of the charge-ordered phase domain antiphase boundaries, and formation and evolution of various topological structures.

  19. HASEonGPU-An adaptive, load-balanced MPI/GPU-code for calculating the amplified spontaneous emission in high power laser media

    NASA Astrophysics Data System (ADS)

    Eckert, C. H. J.; Zenker, E.; Bussmann, M.; Albach, D.

    2016-10-01

    We present an adaptive Monte Carlo algorithm for computing the amplified spontaneous emission (ASE) flux in laser gain media pumped by pulsed lasers. With the design of high power lasers in mind, which require large size gain media, we have developed the open source code HASEonGPU that is capable of utilizing multiple graphic processing units (GPUs). With HASEonGPU, time to solution is reduced to minutes on a medium size GPU cluster of 64 NVIDIA Tesla K20m GPUs and excellent speedup is achieved when scaling to multiple GPUs. Comparison of simulation results to measurements of ASE in Y b 3 + : Y AG ceramics show perfect agreement.

  20. Construction of the Fock Matrix on a Grid-Based Molecular Orbital Basis Using GPGPUs.

    PubMed

    Losilla, Sergio A; Watson, Mark A; Aspuru-Guzik, Alán; Sundholm, Dage

    2015-05-12

    We present a GPGPU implementation of the construction of the Fock matrix in the molecular orbital basis using the fully numerical, grid-based bubbles representation. For a test set of molecules containing up to 90 electrons, the total Hartree-Fock energies obtained from reference GTO-based calculations are reproduced within 10(-4) Eh to 10(-8) Eh for most of the molecules studied. Despite the very large number of arithmetic operations involved, the high performance obtained made the calculations possible on a single Nvidia Tesla K40 GPGPU card.

  1. CELES: CUDA-accelerated simulation of electromagnetic scattering by large ensembles of spheres

    NASA Astrophysics Data System (ADS)

    Egel, Amos; Pattelli, Lorenzo; Mazzamuto, Giacomo; Wiersma, Diederik S.; Lemmer, Uli

    2017-09-01

    CELES is a freely available MATLAB toolbox to simulate light scattering by many spherical particles. Aiming at high computational performance, CELES leverages block-diagonal preconditioning, a lookup-table approach to evaluate costly functions and massively parallel execution on NVIDIA graphics processing units using the CUDA computing platform. The combination of these techniques allows to efficiently address large electrodynamic problems (>104 scatterers) on inexpensive consumer hardware. In this paper, we validate near- and far-field distributions against the well-established multi-sphere T-matrix (MSTM) code and discuss the convergence behavior for ensembles of different sizes, including an exemplary system comprising 105 particles.

  2. Parallel Implementation of Numerical Solution of Few-Body Problem Using Feynman's Continual Integrals

    NASA Astrophysics Data System (ADS)

    Naumenko, Mikhail; Samarin, Viacheslav

    2018-02-01

    Modern parallel computing algorithm has been applied to the solution of the few-body problem. The approach is based on Feynman's continual integrals method implemented in C++ programming language using NVIDIA CUDA technology. A wide range of 3-body and 4-body bound systems has been considered including nuclei described as consisting of protons and neutrons (e.g., 3,4He) and nuclei described as consisting of clusters and nucleons (e.g., 6He). The correctness of the results was checked by the comparison with the exactly solvable 4-body oscillatory system and experimental data.

  3. Spectral-element simulation of two-dimensional elastic wave propagation in fully heterogeneous media on a GPU cluster

    NASA Astrophysics Data System (ADS)

    Rudianto, Indra; Sudarmaji

    2018-04-01

    We present an implementation of the spectral-element method for simulation of two-dimensional elastic wave propagation in fully heterogeneous media. We have incorporated most of realistic geological features in the model, including surface topography, curved layer interfaces, and 2-D wave-speed heterogeneity. To accommodate such complexity, we use an unstructured quadrilateral meshing technique. Simulation was performed on a GPU cluster, which consists of 24 core processors Intel Xeon CPU and 4 NVIDIA Quadro graphics cards using CUDA and MPI implementation. We speed up the computation by a factor of about 5 compared to MPI only, and by a factor of about 40 compared to Serial implementation.

  4. End-to-end plasma bubble PIC simulations on GPUs

    NASA Astrophysics Data System (ADS)

    Germaschewski, Kai; Fox, William; Matteucci, Jackson; Bhattacharjee, Amitava

    2017-10-01

    Accelerator technologies play a crucial role in eventually achieving exascale computing capabilities. The current and upcoming leadership machines at ORNL (Titan and Summit) employ Nvidia GPUs, which provide vast computational power but also need specifically adapted computational kernels to fully exploit them. In this work, we will show end-to-end particle-in-cell simulations of the formation, evolution and coalescence of laser-generated plasma bubbles. This work showcases the GPU capabilities of the PSC particle-in-cell code, which has been adapted for this problem to support particle injection, a heating operator and a collision operator on GPUs.

  5. Aspects of GPU perfomance in algorithms with random memory access

    NASA Astrophysics Data System (ADS)

    Kashkovsky, Alexander V.; Shershnev, Anton A.; Vashchenkov, Pavel V.

    2017-10-01

    The numerical code for solving the Boltzmann equation on the hybrid computational cluster using the Direct Simulation Monte Carlo (DSMC) method showed that on Tesla K40 accelerators computational performance drops dramatically with increase of percentage of occupied GPU memory. Testing revealed that memory access time increases tens of times after certain critical percentage of memory is occupied. Moreover, it seems to be the common problem of all NVidia's GPUs arising from its architecture. Few modifications of the numerical algorithm were suggested to overcome this problem. One of them, based on the splitting the memory into "virtual" blocks, resulted in 2.5 times speed up.

  6. The approximation of anomalous magnetic field by array of magnetized rods

    NASA Astrophysics Data System (ADS)

    Denis, Byzov; Lev, Muravyev; Natalia, Fedorova

    2017-07-01

    The method for calculation the vertical component of an anomalous magnetic field from its absolute value is presented. Conversion is based on the approximation of magnetic induction module anomalies by the set of singular sources and the subsequent calculation for the vertical component of the field with the chosen distribution. The rods that are uniformly magnetized along their axis were used as a set of singular sources. Applicability analysis of different methods of nonlinear optimization for solving the given task was carried out. The algorithm is implemented using the parallel computing technology on the NVidia GPU. The approximation and calculation of vertical component is demonstrated for regional magnetic field of North Eurasia territories.

  7. Parallelized computation for computer simulation of electrocardiograms using personal computers with multi-core CPU and general-purpose GPU.

    PubMed

    Shen, Wenfeng; Wei, Daming; Xu, Weimin; Zhu, Xin; Yuan, Shizhong

    2010-10-01

    Biological computations like electrocardiological modelling and simulation usually require high-performance computing environments. This paper introduces an implementation of parallel computation for computer simulation of electrocardiograms (ECGs) in a personal computer environment with an Intel CPU of Core (TM) 2 Quad Q6600 and a GPU of Geforce 8800GT, with software support by OpenMP and CUDA. It was tested in three parallelization device setups: (a) a four-core CPU without a general-purpose GPU, (b) a general-purpose GPU plus 1 core of CPU, and (c) a four-core CPU plus a general-purpose GPU. To effectively take advantage of a multi-core CPU and a general-purpose GPU, an algorithm based on load-prediction dynamic scheduling was developed and applied to setting (c). In the simulation with 1600 time steps, the speedup of the parallel computation as compared to the serial computation was 3.9 in setting (a), 16.8 in setting (b), and 20.0 in setting (c). This study demonstrates that a current PC with a multi-core CPU and a general-purpose GPU provides a good environment for parallel computations in biological modelling and simulation studies. Copyright 2010 Elsevier Ireland Ltd. All rights reserved.

  8. Optimizing ion channel models using a parallel genetic algorithm on graphical processors.

    PubMed

    Ben-Shalom, Roy; Aviv, Amit; Razon, Benjamin; Korngreen, Alon

    2012-01-01

    We have recently shown that we can semi-automatically constrain models of voltage-gated ion channels by combining a stochastic search algorithm with ionic currents measured using multiple voltage-clamp protocols. Although numerically successful, this approach is highly demanding computationally, with optimization on a high performance Linux cluster typically lasting several days. To solve this computational bottleneck we converted our optimization algorithm for work on a graphical processing unit (GPU) using NVIDIA's CUDA. Parallelizing the process on a Fermi graphic computing engine from NVIDIA increased the speed ∼180 times over an application running on an 80 node Linux cluster, considerably reducing simulation times. This application allows users to optimize models for ion channel kinetics on a single, inexpensive, desktop "super computer," greatly reducing the time and cost of building models relevant to neuronal physiology. We also demonstrate that the point of algorithm parallelization is crucial to its performance. We substantially reduced computing time by solving the ODEs (Ordinary Differential Equations) so as to massively reduce memory transfers to and from the GPU. This approach may be applied to speed up other data intensive applications requiring iterative solutions of ODEs. Copyright © 2012 Elsevier B.V. All rights reserved.

  9. Hardware and Software Design of FPGA-based PCIe Gen3 interface for APEnet+ network interconnect system

    NASA Astrophysics Data System (ADS)

    Ammendola, R.; Biagioni, A.; Frezza, O.; Lo Cicero, F.; Lonardo, A.; Martinelli, M.; Paolucci, P. S.; Pastorelli, E.; Rossetti, D.; Simula, F.; Tosoratto, L.; Vicini, P.

    2015-12-01

    In the attempt to develop an interconnection architecture optimized for hybrid HPC systems dedicated to scientific computing, we designed APEnet+, a point-to-point, low-latency and high-performance network controller supporting 6 fully bidirectional off-board links over a 3D torus topology. The first release of APEnet+ (named V4) was a board based on a 40 nm Altera FPGA, integrating 6 channels at 34 Gbps of raw bandwidth per direction and a PCIe Gen2 x8 host interface. It has been the first-of-its-kind device to implement an RDMA protocol to directly read/write data from/to Fermi and Kepler NVIDIA GPUs using NVIDIA peer-to-peer and GPUDirect RDMA protocols, obtaining real zero-copy GPU-to-GPU transfers over the network. The latest generation of APEnet+ systems (now named V5) implements a PCIe Gen3 x8 host interface on a 28 nm Altera Stratix V FPGA, with multi-standard fast transceivers (up to 14.4 Gbps) and an increased amount of configurable internal resources and hardware IP cores to support main interconnection standard protocols. Herein we present the APEnet+ V5 architecture, the status of its hardware and its system software design. Both its Linux Device Driver and the low-level libraries have been redeveloped to support the PCIe Gen3 protocol, introducing optimizations and solutions based on hardware/software co-design.

  10. SU (2) lattice gauge theory simulations on Fermi GPUs

    SciTech Connect

    Cardoso, Nuno, E-mail: nunocardoso@cftp.ist.utl.p; Bicudo, Pedro, E-mail: bicudo@ist.utl.p

    2011-05-10

    In this work we explore the performance of CUDA in quenched lattice SU (2) simulations. CUDA, NVIDIA Compute Unified Device Architecture, is a hardware and software architecture developed by NVIDIA for computing on the GPU. We present an analysis and performance comparison between the GPU and CPU in single and double precision. Analyses with multiple GPUs and two different architectures (G200 and Fermi architectures) are also presented. In order to obtain a high performance, the code must be optimized for the GPU architecture, i.e., an implementation that exploits the memory hierarchy of the CUDA programming model. We produce codes formore » the Monte Carlo generation of SU (2) lattice gauge configurations, for the mean plaquette, for the Polyakov Loop at finite T and for the Wilson loop. We also present results for the potential using many configurations (50,000) without smearing and almost 2000 configurations with APE smearing. With two Fermi GPUs we have achieved an excellent performance of 200x the speed over one CPU, in single precision, around 110 Gflops/s. We also find that, using the Fermi architecture, double precision computations for the static quark-antiquark potential are not much slower (less than 2x slower) than single precision computations.« less

  11. SU (2) lattice gauge theory simulations on Fermi GPUs

    NASA Astrophysics Data System (ADS)

    Cardoso, Nuno; Bicudo, Pedro

    2011-05-01

    In this work we explore the performance of CUDA in quenched lattice SU (2) simulations. CUDA, NVIDIA Compute Unified Device Architecture, is a hardware and software architecture developed by NVIDIA for computing on the GPU. We present an analysis and performance comparison between the GPU and CPU in single and double precision. Analyses with multiple GPUs and two different architectures (G200 and Fermi architectures) are also presented. In order to obtain a high performance, the code must be optimized for the GPU architecture, i.e., an implementation that exploits the memory hierarchy of the CUDA programming model. We produce codes for the Monte Carlo generation of SU (2) lattice gauge configurations, for the mean plaquette, for the Polyakov Loop at finite T and for the Wilson loop. We also present results for the potential using many configurations (50,000) without smearing and almost 2000 configurations with APE smearing. With two Fermi GPUs we have achieved an excellent performance of 200× the speed over one CPU, in single precision, around 110 Gflops/s. We also find that, using the Fermi architecture, double precision computations for the static quark-antiquark potential are not much slower (less than 2× slower) than single precision computations.

  12. Fast parallel tandem mass spectral library searching using GPU hardware acceleration

    PubMed Central

    Baumgardner, Lydia Ashleigh; Shanmugam, Avinash Kumar; Lam, Henry; Eng, Jimmy K.; Martin, Daniel B.

    2011-01-01

    Mass spectrometry-based proteomics is a maturing discipline of biologic research that is experiencing substantial growth. Instrumentation has steadily improved over time with the advent of faster and more sensitive instruments collecting ever larger data files. Consequently, the computational process of matching a peptide fragmentation pattern to its sequence, traditionally accomplished by sequence database searching and more recently also by spectral library searching, has become a bottleneck in many mass spectrometry experiments. In both of these methods, the main rate limiting step is the comparison of an acquired spectrum with all potential matches from a spectral library or sequence database. This is a highly parallelizable process because the core computational element can be represented as a simple but arithmetically intense multiplication of two vectors. In this paper we present a proof of concept project taking advantage of the massively parallel computing available on graphics processing units (GPUs) to distribute and accelerate the process of spectral assignment using spectral library searching. This program, which we have named FastPaSS (for Fast Parallelized Spectral Searching) is implemented in CUDA (Compute Unified Device Architecture) from NVIDIA which allows direct access to the processors in an NVIDIA GPU. Our efforts demonstrate the feasibility of GPU computing for spectral assignment, through implementation of the validated spectral searching algorithm SpectraST in the CUDA environment. PMID:21545112

  13. GPUmotif: An Ultra-Fast and Energy-Efficient Motif Analysis Program Using Graphics Processing Units

    PubMed Central

    Zandevakili, Pooya; Hu, Ming; Qin, Zhaohui

    2012-01-01

    Computational detection of TF binding patterns has become an indispensable tool in functional genomics research. With the rapid advance of new sequencing technologies, large amounts of protein-DNA interaction data have been produced. Analyzing this data can provide substantial insight into the mechanisms of transcriptional regulation. However, the massive amount of sequence data presents daunting challenges. In our previous work, we have developed a novel algorithm called Hybrid Motif Sampler (HMS) that enables more scalable and accurate motif analysis. Despite much improvement, HMS is still time-consuming due to the requirement to calculate matching probabilities position-by-position. Using the NVIDIA CUDA toolkit, we developed a graphics processing unit (GPU)-accelerated motif analysis program named GPUmotif. We proposed a “fragmentation" technique to hide data transfer time between memories. Performance comparison studies showed that commonly-used model-based motif scan and de novo motif finding procedures such as HMS can be dramatically accelerated when running GPUmotif on NVIDIA graphics cards. As a result, energy consumption can also be greatly reduced when running motif analysis using GPUmotif. The GPUmotif program is freely available at http://sourceforge.net/projects/gpumotif/ PMID:22662128

  14. Pushing Memory Bandwidth Limitations Through Efficient Implementations of Block-Krylov Space Solvers on GPUs

    SciTech Connect

    Clark, M. A.; Strelchenko, Alexei; Vaquero, Alejandro

    Lattice quantum chromodynamics simulations in nuclear physics have benefited from a tremendous number of algorithmic advances such as multigrid and eigenvector deflation. These improve the time to solution but do not alleviate the intrinsic memory-bandwidth constraints of the matrix-vector operation dominating iterative solvers. Batching this operation for multiple vectors and exploiting cache and register blocking can yield a super-linear speed up. Block-Krylov solvers can naturally take advantage of such batched matrix-vector operations, further reducing the iterations to solution by sharing the Krylov space between solves. However, practical implementations typically suffer from the quadratic scaling in the number of vector-vector operations.more » Using the QUDA library, we present an implementation of a block-CG solver on NVIDIA GPUs which reduces the memory-bandwidth complexity of vector-vector operations from quadratic to linear. We present results for the HISQ discretization, showing a 5x speedup compared to highly-optimized independent Krylov solves on NVIDIA's SaturnV cluster.« less

  15. OpenACC acceleration of an unstructured CFD solver based on a reconstructed discontinuous Galerkin method for compressible flows

    DOE PAGES

    Xia, Yidong; Lou, Jialin; Luo, Hong; ...

    2015-02-09

    Here, an OpenACC directive-based graphics processing unit (GPU) parallel scheme is presented for solving the compressible Navier–Stokes equations on 3D hybrid unstructured grids with a third-order reconstructed discontinuous Galerkin method. The developed scheme requires the minimum code intrusion and algorithm alteration for upgrading a legacy solver with the GPU computing capability at very little extra effort in programming, which leads to a unified and portable code development strategy. A face coloring algorithm is adopted to eliminate the memory contention because of the threading of internal and boundary face integrals. A number of flow problems are presented to verify the implementationmore » of the developed scheme. Timing measurements were obtained by running the resulting GPU code on one Nvidia Tesla K20c GPU card (Nvidia Corporation, Santa Clara, CA, USA) and compared with those obtained by running the equivalent Message Passing Interface (MPI) parallel CPU code on a compute node (consisting of two AMD Opteron 6128 eight-core CPUs (Advanced Micro Devices, Inc., Sunnyvale, CA, USA)). Speedup factors of up to 24× and 1.6× for the GPU code were achieved with respect to one and 16 CPU cores, respectively. The numerical results indicate that this OpenACC-based parallel scheme is an effective and extensible approach to port unstructured high-order CFD solvers to GPU computing.« less

  16. SciTech Connect

    Allada, Veerendra, Benjegerdes, Troy; Bode, Brett

    Commodity clusters augmented with application accelerators are evolving as competitive high performance computing systems. The Graphical Processing Unit (GPU) with a very high arithmetic density and performance per price ratio is a good platform for the scientific application acceleration. In addition to the interconnect bottlenecks among the cluster compute nodes, the cost of memory copies between the host and the GPU device have to be carefully amortized to improve the overall efficiency of the application. Scientific applications also rely on efficient implementation of the BAsic Linear Algebra Subroutines (BLAS), among which the General Matrix Multiply (GEMM) is considered as themore » workhorse subroutine. In this paper, they study the performance of the memory copies and GEMM subroutines that are critical to port the computational chemistry algorithms to the GPU clusters. To that end, a benchmark based on the NetPIPE framework is developed to evaluate the latency and bandwidth of the memory copies between the host and the GPU device. The performance of the single and double precision GEMM subroutines from the NVIDIA CUBLAS 2.0 library are studied. The results have been compared with that of the BLAS routines from the Intel Math Kernel Library (MKL) to understand the computational trade-offs. The test bed is a Intel Xeon cluster equipped with NVIDIA Tesla GPUs.« less

  17. A comparative study of history-based versus vectorized Monte Carlo methods in the GPU/CUDA environment for a simple neutron eigenvalue problem

    NASA Astrophysics Data System (ADS)

    Liu, Tianyu; Du, Xining; Ji, Wei; Xu, X. George; Brown, Forrest B.

    2014-06-01

    For nuclear reactor analysis such as the neutron eigenvalue calculations, the time consuming Monte Carlo (MC) simulations can be accelerated by using graphics processing units (GPUs). However, traditional MC methods are often history-based, and their performance on GPUs is affected significantly by the thread divergence problem. In this paper we describe the development of a newly designed event-based vectorized MC algorithm for solving the neutron eigenvalue problem. The code was implemented using NVIDIA's Compute Unified Device Architecture (CUDA), and tested on a NVIDIA Tesla M2090 GPU card. We found that although the vectorized MC algorithm greatly reduces the occurrence of thread divergence thus enhancing the warp execution efficiency, the overall simulation speed is roughly ten times slower than the history-based MC code on GPUs. Profiling results suggest that the slow speed is probably due to the memory access latency caused by the large amount of global memory transactions. Possible solutions to improve the code efficiency are discussed.

  18. Accelerating Multiple Compound Comparison Using LINGO-Based Load-Balancing Strategies on Multi-GPUs

    PubMed Central

    Lin, Chun-Yuan; Wang, Chung-Hung; Hung, Che-Lun; Lin, Yu-Shiang

    2015-01-01

    Compound comparison is an important task for the computational chemistry. By the comparison results, potential inhibitors can be found and then used for the pharmacy experiments. The time complexity of a pairwise compound comparison is O(n 2), where n is the maximal length of compounds. In general, the length of compounds is tens to hundreds, and the computation time is small. However, more and more compounds have been synthesized and extracted now, even more than tens of millions. Therefore, it still will be time-consuming when comparing with a large amount of compounds (seen as a multiple compound comparison problem, abbreviated to MCC). The intrinsic time complexity of MCC problem is O(k 2 n 2) with k compounds of maximal length n. In this paper, we propose a GPU-based algorithm for MCC problem, called CUDA-MCC, on single- and multi-GPUs. Four LINGO-based load-balancing strategies are considered in CUDA-MCC in order to accelerate the computation speed among thread blocks on GPUs. CUDA-MCC was implemented by C+OpenMP+CUDA. CUDA-MCC achieved 45 times and 391 times faster than its CPU version on a single NVIDIA Tesla K20m GPU card and a dual-NVIDIA Tesla K20m GPU card, respectively, under the experimental results. PMID:26491652

  19. Wavelet-based multicomponent denoising on GPU to improve the classification of hyperspectral images

    NASA Astrophysics Data System (ADS)

    Quesada-Barriuso, Pablo; Heras, Dora B.; Argüello, Francisco; Mouriño, J. C.

    2017-10-01

    Supervised classification allows handling a wide range of remote sensing hyperspectral applications. Enhancing the spatial organization of the pixels over the image has proven to be beneficial for the interpretation of the image content, thus increasing the classification accuracy. Denoising in the spatial domain of the image has been shown as a technique that enhances the structures in the image. This paper proposes a multi-component denoising approach in order to increase the classification accuracy when a classification method is applied. It is computed on multicore CPUs and NVIDIA GPUs. The method combines feature extraction based on a 1Ddiscrete wavelet transform (DWT) applied in the spectral dimension followed by an Extended Morphological Profile (EMP) and a classifier (SVM or ELM). The multi-component noise reduction is applied to the EMP just before the classification. The denoising recursively applies a separable 2D DWT after which the number of wavelet coefficients is reduced by using a threshold. Finally, inverse 2D-DWT filters are applied to reconstruct the noise free original component. The computational cost of the classifiers as well as the cost of the whole classification chain is high but it is reduced achieving real-time behavior for some applications through their computation on NVIDIA multi-GPU platforms.

  20. CUDA-based acceleration of collateral filtering in brain MR images

    NASA Astrophysics Data System (ADS)

    Li, Cheng-Yuan; Chang, Herng-Hua

    2017-02-01

    Image denoising is one of the fundamental and essential tasks within image processing. In medical imaging, finding an effective algorithm that can remove random noise in MR images is important. This paper proposes an effective noise reduction method for brain magnetic resonance (MR) images. Our approach is based on the collateral filter which is a more powerful method than the bilateral filter in many cases. However, the computation of the collateral filter algorithm is quite time-consuming. To solve this problem, we improved the collateral filter algorithm with parallel computing using GPU. We adopted CUDA, an application programming interface for GPU by NVIDIA, to accelerate the computation. Our experimental evaluation on an Intel Xeon CPU E5-2620 v3 2.40GHz with a NVIDIA Tesla K40c GPU indicated that the proposed implementation runs dramatically faster than the traditional collateral filter. We believe that the proposed framework has established a general blueprint for achieving fast and robust filtering in a wide variety of medical image denoising applications.

  1. Multi-GPU Accelerated Admittance Method for High-Resolution Human Exposure Evaluation.

    PubMed

    Xiong, Zubiao; Feng, Shi; Kautz, Richard; Chandra, Sandeep; Altunyurt, Nevin; Chen, Ji

    2015-12-01

    A multi-graphics processing unit (GPU) accelerated admittance method solver is presented for solving the induced electric field in high-resolution anatomical models of human body when exposed to external low-frequency magnetic fields. In the solver, the anatomical model is discretized as a three-dimensional network of admittances. The conjugate orthogonal conjugate gradient (COCG) iterative algorithm is employed to take advantage of the symmetric property of the complex-valued linear system of equations. Compared against the widely used biconjugate gradient stabilized method, the COCG algorithm can reduce the solving time by 3.5 times and reduce the storage requirement by about 40%. The iterative algorithm is then accelerated further by using multiple NVIDIA GPUs. The computations and data transfers between GPUs are overlapped in time by using asynchronous concurrent execution design. The communication overhead is well hidden so that the acceleration is nearly linear with the number of GPU cards. Numerical examples show that our GPU implementation running on four NVIDIA Tesla K20c cards can reach 90 times faster than the CPU implementation running on eight CPU cores (two Intel Xeon E5-2603 processors). The implemented solver is able to solve large dimensional problems efficiently. A whole adult body discretized in 1-mm resolution can be solved in just several minutes. The high efficiency achieved makes it practical to investigate human exposure involving a large number of cases with a high resolution that meets the requirements of international dosimetry guidelines.

  2. Accelerating Multiple Compound Comparison Using LINGO-Based Load-Balancing Strategies on Multi-GPUs.

    PubMed

    Lin, Chun-Yuan; Wang, Chung-Hung; Hung, Che-Lun; Lin, Yu-Shiang

    2015-01-01

    Compound comparison is an important task for the computational chemistry. By the comparison results, potential inhibitors can be found and then used for the pharmacy experiments. The time complexity of a pairwise compound comparison is O(n (2)), where n is the maximal length of compounds. In general, the length of compounds is tens to hundreds, and the computation time is small. However, more and more compounds have been synthesized and extracted now, even more than tens of millions. Therefore, it still will be time-consuming when comparing with a large amount of compounds (seen as a multiple compound comparison problem, abbreviated to MCC). The intrinsic time complexity of MCC problem is O(k (2) n (2)) with k compounds of maximal length n. In this paper, we propose a GPU-based algorithm for MCC problem, called CUDA-MCC, on single- and multi-GPUs. Four LINGO-based load-balancing strategies are considered in CUDA-MCC in order to accelerate the computation speed among thread blocks on GPUs. CUDA-MCC was implemented by C+OpenMP+CUDA. CUDA-MCC achieved 45 times and 391 times faster than its CPU version on a single NVIDIA Tesla K20m GPU card and a dual-NVIDIA Tesla K20m GPU card, respectively, under the experimental results.

  3. Fast parallel tandem mass spectral library searching using GPU hardware acceleration.

    PubMed

    Baumgardner, Lydia Ashleigh; Shanmugam, Avinash Kumar; Lam, Henry; Eng, Jimmy K; Martin, Daniel B

    2011-06-03

    Mass spectrometry-based proteomics is a maturing discipline of biologic research that is experiencing substantial growth. Instrumentation has steadily improved over time with the advent of faster and more sensitive instruments collecting ever larger data files. Consequently, the computational process of matching a peptide fragmentation pattern to its sequence, traditionally accomplished by sequence database searching and more recently also by spectral library searching, has become a bottleneck in many mass spectrometry experiments. In both of these methods, the main rate-limiting step is the comparison of an acquired spectrum with all potential matches from a spectral library or sequence database. This is a highly parallelizable process because the core computational element can be represented as a simple but arithmetically intense multiplication of two vectors. In this paper, we present a proof of concept project taking advantage of the massively parallel computing available on graphics processing units (GPUs) to distribute and accelerate the process of spectral assignment using spectral library searching. This program, which we have named FastPaSS (for Fast Parallelized Spectral Searching), is implemented in CUDA (Compute Unified Device Architecture) from NVIDIA, which allows direct access to the processors in an NVIDIA GPU. Our efforts demonstrate the feasibility of GPU computing for spectral assignment, through implementation of the validated spectral searching algorithm SpectraST in the CUDA environment.

  4. GPU-powered model analysis with PySB/cupSODA.

    PubMed

    Harris, Leonard A; Nobile, Marco S; Pino, James C; Lubbock, Alexander L R; Besozzi, Daniela; Mauri, Giancarlo; Cazzaniga, Paolo; Lopez, Carlos F

    2017-11-01

    A major barrier to the practical utilization of large, complex models of biochemical systems is the lack of open-source computational tools to evaluate model behaviors over high-dimensional parameter spaces. This is due to the high computational expense of performing thousands to millions of model simulations required for statistical analysis. To address this need, we have implemented a user-friendly interface between cupSODA, a GPU-powered kinetic simulator, and PySB, a Python-based modeling and simulation framework. For three example models of varying size, we show that for large numbers of simulations PySB/cupSODA achieves order-of-magnitude speedups relative to a CPU-based ordinary differential equation integrator. The PySB/cupSODA interface has been integrated into the PySB modeling framework (version 1.4.0), which can be installed from the Python Package Index (PyPI) using a Python package manager such as pip. cupSODA source code and precompiled binaries (Linux, Mac OS/X, Windows) are available at github.com/aresio/cupSODA (requires an Nvidia GPU; developer.nvidia.com/cuda-gpus). Additional information about PySB is available at pysb.org. paolo.cazzaniga@unibg.it or c.lopez@vanderbilt.edu. Supplementary data are available at Bioinformatics online. © The Author(s) 2017. Published by Oxford University Press.

  5. GPUmotif: an ultra-fast and energy-efficient motif analysis program using graphics processing units.

    PubMed

    Zandevakili, Pooya; Hu, Ming; Qin, Zhaohui

    2012-01-01

    Computational detection of TF binding patterns has become an indispensable tool in functional genomics research. With the rapid advance of new sequencing technologies, large amounts of protein-DNA interaction data have been produced. Analyzing this data can provide substantial insight into the mechanisms of transcriptional regulation. However, the massive amount of sequence data presents daunting challenges. In our previous work, we have developed a novel algorithm called Hybrid Motif Sampler (HMS) that enables more scalable and accurate motif analysis. Despite much improvement, HMS is still time-consuming due to the requirement to calculate matching probabilities position-by-position. Using the NVIDIA CUDA toolkit, we developed a graphics processing unit (GPU)-accelerated motif analysis program named GPUmotif. We proposed a "fragmentation" technique to hide data transfer time between memories. Performance comparison studies showed that commonly-used model-based motif scan and de novo motif finding procedures such as HMS can be dramatically accelerated when running GPUmotif on NVIDIA graphics cards. As a result, energy consumption can also be greatly reduced when running motif analysis using GPUmotif. The GPUmotif program is freely available at http://sourceforge.net/projects/gpumotif/

  6. The CUBLAS and CULA based GPU acceleration of adaptive finite element framework for bioluminescence tomography.

    PubMed

    Zhang, Bo; Yang, Xiang; Yang, Fei; Yang, Xin; Qin, Chenghu; Han, Dong; Ma, Xibo; Liu, Kai; Tian, Jie

    2010-09-13

    In molecular imaging (MI), especially the optical molecular imaging, bioluminescence tomography (BLT) emerges as an effective imaging modality for small animal imaging. The finite element methods (FEMs), especially the adaptive finite element (AFE) framework, play an important role in BLT. The processing speed of the FEMs and the AFE framework still needs to be improved, although the multi-thread CPU technology and the multi CPU technology have already been applied. In this paper, we for the first time introduce a new kind of acceleration technology to accelerate the AFE framework for BLT, using the graphics processing unit (GPU). Besides the processing speed, the GPU technology can get a balance between the cost and performance. The CUBLAS and CULA are two main important and powerful libraries for programming on NVIDIA GPUs. With the help of CUBLAS and CULA, it is easy to code on NVIDIA GPU and there is no need to worry about the details about the hardware environment of a specific GPU. The numerical experiments are designed to show the necessity, effect and application of the proposed CUBLAS and CULA based GPU acceleration. From the results of the experiments, we can reach the conclusion that the proposed CUBLAS and CULA based GPU acceleration method can improve the processing speed of the AFE framework very much while getting a balance between cost and performance.

  7. Swan: A tool for porting CUDA programs to OpenCL

    NASA Astrophysics Data System (ADS)

    Harvey, M. J.; De Fabritiis, G.

    2011-04-01

    The use of modern, high-performance graphical processing units (GPUs) for acceleration of scientific computation has been widely reported. The majority of this work has used the CUDA programming model supported exclusively by GPUs manufactured by NVIDIA. An industry standardisation effort has recently produced the OpenCL specification for GPU programming. This offers the benefits of hardware-independence and reduced dependence on proprietary tool-chains. Here we describe a source-to-source translation tool, "Swan" for facilitating the conversion of an existing CUDA code to use the OpenCL model, as a means to aid programmers experienced with CUDA in evaluating OpenCL and alternative hardware. While the performance of equivalent OpenCL and CUDA code on fixed hardware should be comparable, we find that a real-world CUDA application ported to OpenCL exhibits an overall 50% increase in runtime, a reduction in performance attributable to the immaturity of contemporary compilers. The ported application is shown to have platform independence, running on both NVIDIA and AMD GPUs without modification. We conclude that OpenCL is a viable platform for developing portable GPU applications but that the more mature CUDA tools continue to provide best performance. Program summaryProgram title: Swan Catalogue identifier: AEIH_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEIH_v1_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: GNU Public License version 2 No. of lines in distributed program, including test data, etc.: 17 736 No. of bytes in distributed program, including test data, etc.: 131 177 Distribution format: tar.gz Programming language: C Computer: PC Operating system: Linux RAM: 256 Mbytes Classification: 6.5 External routines: NVIDIA CUDA, OpenCL Nature of problem: Graphical Processing Units (GPUs) from NVIDIA are preferentially programed with the proprietary CUDA programming toolkit. An

  8. Atmospheric aerosol composition and source apportionments to aerosol in southern Taiwan

    NASA Astrophysics Data System (ADS)

    Tsai, Ying I.; Chen, Chien-Lung

    In this study, the chemical characteristics of winter aerosol at four sites in southern Taiwan were determined and the Gaussian Trajectory transfer coefficient model (GTx) was then used to identify the major air pollutant sources affecting the study sites. Aerosols were found to be acidic at all four sites. The most important constituents of the particulate matter (PM) by mass were SO 42-, organic carbon (OC), NO 3-, elemental carbon (EC) and NH 4+, with SO 42-, NO 3-, and NH 4+ together constituting 86.0-87.9% of the total PM 2.5 soluble inorganic salts and 68.9-78.3% of the total PM 2.5-10 soluble inorganic salts, showing that secondary photochemical solution components such as these were the major contributors to the aerosol water-soluble ions. The coastal site, Linyuan (LY), had the highest PM mass percentage of sea salts, higher in the coarse fraction, and higher sea salts during daytime than during nighttime, indicating that the prevailing daytime sea breeze brought with it more sea-salt aerosol. Other than sea salts, crustal matter, and EC in PM 2.5 at Jenwu (JW) and in PM 2.5-10 at LY, all aerosol components were higher during nighttime, due to relatively low nighttime mixing heights limiting vertical and horizontal dispersion. At JW, a site with heavy traffic loadings, the OC/EC ratio in the nighttime fine and coarse fractions of approximately 2.2 was higher than during daytime, indicating that in addition to primary organic aerosol (POA), secondary organic aerosol (SOA) also contributed to the nighttime PM 2.5. This was also true of the nighttime coarse fraction at LY. The GTx produced correlation coefficients ( r) for simulated and observed daily concentrations of PM 10 at the four sites (receptors) in the range 0.45-0.59 and biases from -6% to -20%. Source apportionment indicated that point sources were the largest PM 10 source at JW, LY and Daliao (DL), while at Meinung (MN), a suburban site with less local PM 10, SO x and NO x emissions, upwind

  9. PSP toxin levels and plankton community composition and abundance in size-fractionated vertical profiles during spring/summer blooms of the toxic dinoflagellate Alexandrium fundyense in the Gulf of Maine and on Georges Bank, 2007, 2008, and 2010: 1. Toxin levels.

    PubMed

    Deeds, Jonathan R; Petitpas, Christian M; Shue, Vangie; White, Kevin D; Keafer, Bruce A; McGillicuddy, Dennis J; Milligan, Peter J; Anderson, Donald M; Turner, Jefferson T

    2014-05-01

    As part of the NOAA ECOHAB funded Gulf of Maine Toxicity (GOMTOX) project, we determined Alexandrium fundyense abundance, paralytic shellfish poisoning (PSP) toxin composition, and concentration in quantitatively-sampled size-fractionated (20-64, 64-100, 100-200, 200-500, and > 500 μm) particulate water samples, and the community composition of potential grazers of A. fundyense in these size fractions, at multiple depths (typically 1, 10, 20 m, and near-bottom) during 10 large-scale sampling cruises during the A. fundyense bloom season (May-August) in the coastal Gulf of Maine and on Georges Bank in 2007, 2008, and 2010. Our findings were as follows: (1) when all sampling stations and all depths were summed by year, the majority (94% ± 4%) of total PSP toxicity was contained in the 20-64 μm size fraction; (2) when further analyzed by depth, the 20-64 μm size fraction was the primary source of toxin for 97% of the stations and depths samples over three years; (3) overall PSP toxin profiles were fairly consistent during the three seasons of sampling with gonyautoxins (1, 2, 3, and 4) dominating (90.7% ± 5.5%), followed by the carbamate toxins saxitoxin (STX) and neosaxitoxin (NEO) (7.7% ± 4.5%), followed by n-sulfocarbamoyl toxins (C1 and 2, GTX5) (1.3% ± 0.6%), followed by all decarbamoyl toxins (dcSTX, dcNEO, dcGTX2&3) (< 1%), although differences were noted between PSP toxin compositions for nearshore coastal Gulf of Maine sampling stations compared to offshore Georges Bank sampling stations for 2 out of 3 years; (4) surface cell counts of A. fundyense were a fairly reliable predictor of the presence of toxins throughout the water column; and (5) nearshore surface cell counts of A. fundyense in the coastal Gulf of Maine were not a reliable predictor of A. fundyense populations offshore on Georges Bank for 2 out of the 3 years sampled.

  10. Preliminary Sizing Completed for Single- Stage-To-Orbit Launch Vehicles Powered By Rocket-Based Combined Cycle Technology

    NASA Technical Reports Server (NTRS)

    Roche, Joseph M.

    2002-01-01

    Single-stage-to-orbit (SSTO) propulsion remains an elusive goal for launch vehicles. The physics of the problem is leading developers to a search for higher propulsion performance than is available with all-rocket power. Rocket-based combined cycle (RBCC) technology provides additional propulsion performance that may enable SSTO flight. Structural efficiency is also a major driving force in enabling SSTO flight. Increases in performance with RBCC propulsion are offset with the added size of the propulsion system. Geometrical considerations must be exploited to minimize the weight. Integration of the propulsion system with the vehicle must be carefully planned such that aeroperformance is not degraded and the air-breathing performance is enhanced. Consequently, the vehicle's structural architecture becomes one with the propulsion system architecture. Geometrical considerations applied to the integrated vehicle lead to low drag and high structural and volumetric efficiency. Sizing of the SSTO launch vehicle (GTX) is itself an elusive task. The weight of the vehicle depends strongly on the propellant required to meet the mission requirements. Changes in propellant requirements result in changes in the size of the vehicle, which in turn, affect the weight of the vehicle and change the propellant requirements. An iterative approach is necessary to size the vehicle to meet the flight requirements. GTX Sizer was developed to do exactly this. The governing geometry was built into a spreadsheet model along with scaling relationships. The scaling laws attempt to maintain structural integrity as the vehicle size is changed. Key aerodynamic relationships are maintained as the vehicle size is changed. The closed weight and center of gravity are displayed graphically on a plot of the synthesized vehicle. In addition, comprehensive tabular data of the subsystem weights and centers of gravity are generated. The model has been verified for accuracy with finite element analysis. The

  11. Glutamine prevents oxidative stress in a model of portal hypertension.

    PubMed

    Zabot, Gilmara Pandolfo; Carvalhal, Gustavo Franco; Marroni, Norma Possa; Licks, Francielli; Hartmann, Renata Minuzzo; da Silva, Vinícius Duval; Fillmann, Henrique Sarubbi

    2017-07-07

    To evaluate the protective effects of glutamine in a model of portal hypertension (PH) induced by partial portal vein ligation (PPVL). Male Wistar rats were housed in a controlled environment and were allowed access to food and water ad libitum . Twenty-four male Wistar rats were divided into four experimental groups: (1) control group (SO) - rats underwent exploratory laparotomy; (2) control + glutamine group (SO + G) - rats were subjected to laparotomy and were treated intraperitoneally with glutamine; (3) portal hypertension group (PPVL) - rats were subjected to PPVL; and (4) PPVL + glutamine group (PPVL + G) - rats were treated intraperitoneally with glutamine for seven days. Local injuries were determined by evaluating intestinal segments for oxidative stress using lipid peroxidation and the activities of glutathione peroxidase (GPx), endothelial nitric oxide synthase (eNOS) and inducible nitric oxide synthase (iNOS) after PPVL. Lipid peroxidation of the membrane was increased in the animals subjected to PH ( P < 0.01). However, the group that received glutamine for seven days after the PPVL procedure showed levels of lipid peroxidation similar to those of the control groups ( P > 0.05). The activity of the antioxidant enzyme GTx was decreased in the gut of animals subjected to PH compared with that in the control group of animals not subjected to PH ( P < 0.01). However, the group that received glutamine for seven days after the PPVL showed similar GTx activity to both the control groups not subjected to PH ( P > 0.05). At least 10 random, non-overlapping images of each histological slide with 200 × magnification (44 pixel = 1 μm) were captured. The sum means of all areas, of each group were calculated. The mean areas of eNOS staining for both of the control groups were similar. The PPVL group showed the largest area of staining for eNOS. The PPVL + G group had the second highest amount of staining, but the mean value was much lower than that of the PPVL

  12. Zebrafish neurotoxicity from aphantoxins--cyanobacterial paralytic shellfish poisons (PSPs) from Aphanizomenon flos-aquae DC-1.

    PubMed

    Zhang, Delu; Hu, Chunxiang; Wang, Gaohong; Li, Dunhai; Li, Genbao; Liu, Yongding

    2013-05-01

    Aphanizomenon flos-aquae (A. flos-aquae), a cyanobacterium frequently encountered in water blooms worldwide, is source of neurotoxins known as PSPs or aphantoxins that present a major threat to the environment and to human health. Although the molecular mechanism of PSP action is well known, many unresolved questions remain concerning its mechanisms of toxicity. Aphantoxins purified from a natural isolate of A. flos-aquae DC-1 were analyzed by high-performance liquid chromatography (HPLC), the major component toxins were the gonyautoxins1 and 5 (GTX1 and GTX5, 34.04% and 21.28%, respectively) and the neosaxitoxin (neoSTX, 12.77%). The LD50 of the aphantoxin preparation was determined to be 11.33 μg/kg (7.75 μg saxitoxin equivalents (STXeq) per kg) following intraperitoneal injection of zebrafish (Danio rerio). To address the neurotoxicology of the aphantoxin preparation, zebrafish were injected with low and high sublethal doses of A. flos-aquae DC-1 toxins 7.73 and 9.28 μg /kg (5.3 and 6.4 μg STXeq/kg, respectively) and brain tissues were analyzed by electron microscopy and RT-PCR at different timepoints postinjection. Low-dose aphantoxin exposure was associated with chromatin condensation, cell-membrane blebbing, and the appearance of apoptotic bodies. High-dose exposure was associated with cytoplasmic vacuolization, mitochondrial swelling, and expansion of the endoplasmic reticulum. At early timepoints (3 h) many cells exhibited characteristic features of both apoptosis and necrosis. At later timepoints apoptosis appeared to predominate in the low-dose group, whereas necrosis predominated in the high-dose group. RT-PCR revealed that mRNA levels of the apoptosis-related genes encoding p53, Bax, caspase-3, and c-Jun were upregulated after aphantoxin exposure, but there was no evidence of DNA laddering; apoptosis could take place by pathways independent of DNA fragmentation. These results demonstrate that aphantoxin exposure can cause cell death in zebrafish

  13. PSP toxin levels and plankton community composition and abundance in size-fractionated vertical profiles during spring/summer blooms of the toxic dinoflagellate Alexandrium fundyense in the Gulf of Maine and on Georges Bank, 2007, 2008, and 2010: 1. Toxin levels

    PubMed Central

    Deeds, Jonathan R.; Petitpas, Christian M.; Shue, Vangie; White, Kevin D.; Keafer, Bruce A.; McGillicuddy, Dennis J.; Milligan, Peter J.; Anderson, Donald M.; Turner, Jefferson T.

    2014-01-01

    As part of the NOAA ECOHAB funded Gulf of Maine Toxicity (GOMTOX)1 project, we determined Alexandrium fundyense abundance, paralytic shellfish poisoning (PSP) toxin composition, and concentration in quantitatively-sampled size-fractionated (20–64, 64–100, 100–200, 200–500, and > 500 μm) particulate water samples, and the community composition of potential grazers of A. fundyense in these size fractions, at multiple depths (typically 1, 10, 20 m, and near-bottom) during 10 large-scale sampling cruises during the A. fundyense bloom season (May–August) in the coastal Gulf of Maine and on Georges Bank in 2007, 2008, and 2010. Our findings were as follows: (1) when all sampling stations and all depths were summed by year, the majority (94% ± 4%) of total PSP toxicity was contained in the 20–64 μm size fraction; (2) when further analyzed by depth, the 20–64 μm size fraction was the primary source of toxin for 97% of the stations and depths samples over three years; (3) overall PSP toxin profiles were fairly consistent during the three seasons of sampling with gonyautoxins (1, 2, 3, and 4) dominating (90.7% ± 5.5%), followed by the carbamate toxins saxitoxin (STX) and neosaxitoxin (NEO) (7.7% ± 4.5%), followed by n-sulfocarbamoyl toxins (C1 and 2, GTX5) (1.3% ± 0.6%), followed by all decarbamoyl toxins (dcSTX, dcNEO, dcGTX2&3) (< 1%), although differences were noted between PSP toxin compositions for nearshore coastal Gulf of Maine sampling stations compared to offshore Georges Bank sampling stations for 2 out of 3 years; (4) surface cell counts of A. fundyense were a fairly reliable predictor of the presence of toxins throughout the water column; and (5) nearshore surface cell counts of A. fundyense in the coastal Gulf of Maine were not a reliable predictor of A. fundyense populations offshore on Georges Bank for 2 out of the 3 years sampled. PMID:25076816

  14. GPU accelerated population annealing algorithm

    NASA Astrophysics Data System (ADS)

    Barash, Lev Yu.; Weigel, Martin; Borovský, Michal; Janke, Wolfhard; Shchur, Lev N.

    2017-11-01

    Population annealing is a promising recent approach for Monte Carlo simulations in statistical physics, in particular for the simulation of systems with complex free-energy landscapes. It is a hybrid method, combining importance sampling through Markov chains with elements of sequential Monte Carlo in the form of population control. While it appears to provide algorithmic capabilities for the simulation of such systems that are roughly comparable to those of more established approaches such as parallel tempering, it is intrinsically much more suitable for massively parallel computing. Here, we tap into this structural advantage and present a highly optimized implementation of the population annealing algorithm on GPUs that promises speed-ups of several orders of magnitude as compared to a serial implementation on CPUs. While the sample code is for simulations of the 2D ferromagnetic Ising model, it should be easily adapted for simulations of other spin models, including disordered systems. Our code includes implementations of some advanced algorithmic features that have only recently been suggested, namely the automatic adaptation of temperature steps and a multi-histogram analysis of the data at different temperatures. Program Files doi:http://dx.doi.org/10.17632/sgzt4b7b3m.1 Licensing provisions: Creative Commons Attribution license (CC BY 4.0) Programming language: C, CUDA External routines/libraries: NVIDIA CUDA Toolkit 6.5 or newer Nature of problem: The program calculates the internal energy, specific heat, several magnetization moments, entropy and free energy of the 2D Ising model on square lattices of edge length L with periodic boundary conditions as a function of inverse temperature β. Solution method: The code uses population annealing, a hybrid method combining Markov chain updates with population control. The code is implemented for NVIDIA GPUs using the CUDA language and employs advanced techniques such as multi-spin coding, adaptive temperature

  15. CUDA programs for the GPU computing of the Swendsen-Wang multi-cluster spin flip algorithm: 2D and 3D Ising, Potts, and XY models

    NASA Astrophysics Data System (ADS)

    Komura, Yukihiro; Okabe, Yutaka

    2014-03-01

    We present sample CUDA programs for the GPU computing of the Swendsen-Wang multi-cluster spin flip algorithm. We deal with the classical spin models; the Ising model, the q-state Potts model, and the classical XY model. As for the lattice, both the 2D (square) lattice and the 3D (simple cubic) lattice are treated. We already reported the idea of the GPU implementation for 2D models (Komura and Okabe, 2012). We here explain the details of sample programs, and discuss the performance of the present GPU implementation for the 3D Ising and XY models. We also show the calculated results of the moment ratio for these models, and discuss phase transitions. Catalogue identifier: AERM_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AERM_v1_0.html Program obtainable from: CPC Program Library, Queen’s University, Belfast, N. Ireland Licensing provisions: Standard CPC licence, http://cpc.cs.qub.ac.uk/licence/licence.html No. of lines in distributed program, including test data, etc.: 5632 No. of bytes in distributed program, including test data, etc.: 14688 Distribution format: tar.gz Programming language: C, CUDA. Computer: System with an NVIDIA CUDA enabled GPU. Operating system: System with an NVIDIA CUDA enabled GPU. Classification: 23. External routines: NVIDIA CUDA Toolkit 3.0 or newer Nature of problem: Monte Carlo simulation of classical spin systems. Ising, q-state Potts model, and the classical XY model are treated for both two-dimensional and three-dimensional lattices. Solution method: GPU-based Swendsen-Wang multi-cluster spin flip Monte Carlo method. The CUDA implementation for the cluster-labeling is based on the work by Hawick et al. [1] and that by Kalentev et al. [2]. Restrictions: The system size is limited depending on the memory of a GPU. Running time: For the parameters used in the sample programs, it takes about a minute for each program. Of course, it depends on the system size, the number of Monte Carlo steps, etc. References: [1] K

  16. ARCHERRT – A GPU-based and photon-electron coupled Monte Carlo dose computing engine for radiation therapy: Software development and application to helical tomotherapy

    PubMed Central

    Su, Lin; Yang, Youming; Bednarz, Bryan; Sterpin, Edmond; Du, Xining; Liu, Tianyu; Ji, Wei; Xu, X. George

    2014-01-01

    Purpose: Using the graphical processing units (GPU) hardware technology, an extremely fast Monte Carlo (MC) code ARCHERRT is developed for radiation dose calculations in radiation therapy. This paper describes the detailed software development and testing for three clinical TomoTherapy® cases: the prostate, lung, and head & neck. Methods: To obtain clinically relevant dose distributions, phase space files (PSFs) created from optimized radiation therapy treatment plan fluence maps were used as the input to ARCHERRT. Patient-specific phantoms were constructed from patient CT images. Batch simulations were employed to facilitate the time-consuming task of loading large PSFs, and to improve the estimation of statistical uncertainty. Furthermore, two different Woodcock tracking algorithms were implemented and their relative performance was compared. The dose curves of an Elekta accelerator PSF incident on a homogeneous water phantom were benchmarked against DOSXYZnrc. For each of the treatment cases, dose volume histograms and isodose maps were produced from ARCHERRT and the general-purpose code, GEANT4. The gamma index analysis was performed to evaluate the similarity of voxel doses obtained from these two codes. The hardware accelerators used in this study are one NVIDIA K20 GPU, one NVIDIA K40 GPU, and six NVIDIA M2090 GPUs. In addition, to make a fairer comparison of the CPU and GPU performance, a multithreaded CPU code was developed using OpenMP and tested on an Intel E5-2620 CPU. Results: For the water phantom, the depth dose curve and dose profiles from ARCHERRT agree well with DOSXYZnrc. For clinical cases, results from ARCHERRT are compared with those from GEANT4 and good agreement is observed. Gamma index test is performed for voxels whose dose is greater than 10% of maximum dose. For 2%/2mm criteria, the passing rates for the prostate, lung case, and head & neck cases are 99.7%, 98.5%, and 97.2%, respectively. Due to specific architecture of GPU, modified

  17. ARCHERRT - a GPU-based and photon-electron coupled Monte Carlo dose computing engine for radiation therapy: software development and application to helical tomotherapy.

    PubMed

    Su, Lin; Yang, Youming; Bednarz, Bryan; Sterpin, Edmond; Du, Xining; Liu, Tianyu; Ji, Wei; Xu, X George

    2014-07-01

    Using the graphical processing units (GPU) hardware technology, an extremely fast Monte Carlo (MC) code ARCHERRT is developed for radiation dose calculations in radiation therapy. This paper describes the detailed software development and testing for three clinical TomoTherapy® cases: the prostate, lung, and head & neck. To obtain clinically relevant dose distributions, phase space files (PSFs) created from optimized radiation therapy treatment plan fluence maps were used as the input to ARCHERRT. Patient-specific phantoms were constructed from patient CT images. Batch simulations were employed to facilitate the time-consuming task of loading large PSFs, and to improve the estimation of statistical uncertainty. Furthermore, two different Woodcock tracking algorithms were implemented and their relative performance was compared. The dose curves of an Elekta accelerator PSF incident on a homogeneous water phantom were benchmarked against DOSXYZnrc. For each of the treatment cases, dose volume histograms and isodose maps were produced from ARCHERRT and the general-purpose code, GEANT4. The gamma index analysis was performed to evaluate the similarity of voxel doses obtained from these two codes. The hardware accelerators used in this study are one NVIDIA K20 GPU, one NVIDIA K40 GPU, and six NVIDIA M2090 GPUs. In addition, to make a fairer comparison of the CPU and GPU performance, a multithreaded CPU code was developed using OpenMP and tested on an Intel E5-2620 CPU. For the water phantom, the depth dose curve and dose profiles from ARCHERRT agree well with DOSXYZnrc. For clinical cases, results from ARCHERRT are compared with those from GEANT4 and good agreement is observed. Gamma index test is performed for voxels whose dose is greater than 10% of maximum dose. For 2%/2mm criteria, the passing rates for the prostate, lung case, and head & neck cases are 99.7%, 98.5%, and 97.2%, respectively. Due to specific architecture of GPU, modified Woodcock tracking algorithm

  18. A GPU OpenCL based cross-platform Monte Carlo dose calculation engine (goMC).

    PubMed

    Tian, Zhen; Shi, Feng; Folkerts, Michael; Qin, Nan; Jiang, Steve B; Jia, Xun

    2015-10-07

    Monte Carlo (MC) simulation has been recognized as the most accurate dose calculation method for radiotherapy. However, the extremely long computation time impedes its clinical application. Recently, a lot of effort has been made to realize fast MC dose calculation on graphic processing units (GPUs). However, most of the GPU-based MC dose engines have been developed under NVidia's CUDA environment. This limits the code portability to other platforms, hindering the introduction of GPU-based MC simulations to clinical practice. The objective of this paper is to develop a GPU OpenCL based cross-platform MC dose engine named goMC with coupled photon-electron simulation for external photon and electron radiotherapy in the MeV energy range. Compared to our previously developed GPU-based MC code named gDPM (Jia et al 2012 Phys. Med. Biol. 57 7783-97), goMC has two major differences. First, it was developed under the OpenCL environment for high code portability and hence could be run not only on different GPU cards but also on CPU platforms. Second, we adopted the electron transport model used in EGSnrc MC package and PENELOPE's random hinge method in our new dose engine, instead of the dose planning method employed in gDPM. Dose distributions were calculated for a 15 MeV electron beam and a 6 MV photon beam in a homogenous water phantom, a water-bone-lung-water slab phantom and a half-slab phantom. Satisfactory agreement between the two MC dose engines goMC and gDPM was observed in all cases. The average dose differences in the regions that received a dose higher than 10% of the maximum dose were 0.48-0.53% for the electron beam cases and 0.15-0.17% for the photon beam cases. In terms of efficiency, goMC was ~4-16% slower than gDPM when running on the same NVidia TITAN card for all the cases we tested, due to both the different electron transport models and the different development environments. The code portability of our new dose engine goMC was validated by

  19. High-speed railway real-time localization auxiliary method based on deep neural network

    NASA Astrophysics Data System (ADS)

    Chen, Dongjie; Zhang, Wensheng; Yang, Yang

    2017-11-01

    High-speed railway intelligent monitoring and management system is composed of schedule integration, geographic information, location services, and data mining technology for integration of time and space data. Assistant localization is a significant submodule of the intelligent monitoring system. In practical application, the general access is to capture the image sequences of the components by using a high-definition camera, digital image processing technique and target detection, tracking and even behavior analysis method. In this paper, we present an end-to-end character recognition method based on a deep CNN network called YOLO-toc for high-speed railway pillar plate number. Different from other deep CNNs, YOLO-toc is an end-to-end multi-target detection framework, furthermore, it exhibits a state-of-art performance on real-time detection with a nearly 50fps achieved on GPU (GTX960). Finally, we realize a real-time but high-accuracy pillar plate number recognition system and integrate natural scene OCR into a dedicated classification YOLO-toc model.

  20. BlochSolver: A GPU-optimized fast 3D MRI simulator for experimentally compatible pulse sequences

    NASA Astrophysics Data System (ADS)

    Kose, Ryoichi; Kose, Katsumi

    2017-08-01

    A magnetic resonance imaging (MRI) simulator, which reproduces MRI experiments using computers, has been developed using two graphic-processor-unit (GPU) boards (GTX 1080). The MRI simulator was developed to run according to pulse sequences used in experiments. Experiments and simulations were performed to demonstrate the usefulness of the MRI simulator for three types of pulse sequences, namely, three-dimensional (3D) gradient-echo, 3D radio-frequency spoiled gradient-echo, and gradient-echo multislice with practical matrix sizes. The results demonstrated that the calculation speed using two GPU boards was typically about 7 TFLOPS and about 14 times faster than the calculation speed using CPUs (two 18-core Xeons). We also found that MR images acquired by experiment could be reproduced using an appropriate number of subvoxels, and that 3D isotropic and two-dimensional multislice imaging experiments for practical matrix sizes could be simulated using the MRI simulator. Therefore, we concluded that such powerful MRI simulators are expected to become an indispensable tool for MRI research and development.

  1. Review: visual analytics of climate networks

    NASA Astrophysics Data System (ADS)

    Nocke, T.; Buschmann, S.; Donges, J. F.; Marwan, N.; Schulz, H.-J.; Tominski, C.

    2015-09-01

    Network analysis has become an important approach in studying complex spatiotemporal behaviour within geophysical observation and simulation data. This new field produces increasing numbers of large geo-referenced networks to be analysed. Particular focus lies currently on the network analysis of the complex statistical interrelationship structure within climatological fields. The standard procedure for such network analyses is the extraction of network measures in combination with static standard visualisation methods. Existing interactive visualisation methods and tools for geo-referenced network exploration are often either not known to the analyst or their potential is not fully exploited. To fill this gap, we illustrate how interactive visual analytics methods in combination with geovisualisation can be tailored for visual climate network investigation. Therefore, the paper provides a problem analysis relating the multiple visualisation challenges to a survey undertaken with network analysts from the research fields of climate and complex systems science. Then, as an overview for the interested practitioner, we review the state-of-the-art in climate network visualisation and provide an overview of existing tools. As a further contribution, we introduce the visual network analytics tools CGV and GTX, providing tailored solutions for climate network analysis, including alternative geographic projections, edge bundling, and 3-D network support. Using these tools, the paper illustrates the application potentials of visual analytics for climate networks based on several use cases including examples from global, regional, and multi-layered climate networks.

  2. New solutions for climate network visualization

    NASA Astrophysics Data System (ADS)

    Nocke, Thomas; Buschmann, Stefan; Donges, Jonathan F.; Marwan, Norbert

    2016-04-01

    An increasing amount of climate and climate impact research methods deals with geo-referenced networks, including energy, trade, supply-chain, disease dissemination and climatic tele-connection networks. At the same time, the size and complexity of these networks increases, resulting in networks of more than hundred thousand or even millions of edges, which are often temporally evolving, have additional data at nodes and edges, and can consist of multiple layers even in real 3D. This gives challenges to both the static representation and the interactive exploration of these networks, first of all avoiding edge clutter ("edge spagetti") and allowing interactivity even for unfiltered networks. Within this presentation, we illustrate potential solutions to these challenges. Therefore, we give a glimpse on a questionnaire performed with climate and complex system scientists with respect to their network visualization requirements, and on a review of available state-of-the-art visualization techniques and tools for this purpose (see as well Nocke et al., 2015). In the main part, we present alternative visualization solutions for several use cases (global, regional, and multi-layered climate networks) including alternative geographic projections, edge bundling, and 3-D network support (based on CGV and GTX tools), and implementation details to reach interactive frame rates. References: Nocke, T., S. Buschmann, J. F. Donges, N. Marwan, H.-J. Schulz, and C. Tominski: Review: Visual analytics of climate networks, Nonlinear Processes in Geophysics, 22, 545-570, doi:10.5194/npg-22-545-2015, 2015

  3. Review: visual analytics of climate networks

    NASA Astrophysics Data System (ADS)

    Nocke, T.; Buschmann, S.; Donges, J. F.; Marwan, N.; Schulz, H.-J.; Tominski, C.

    2015-04-01

    Network analysis has become an important approach in studying complex spatiotemporal behaviour within geophysical observation and simulation data. This new field produces increasing amounts of large geo-referenced networks to be analysed. Particular focus lies currently on the network analysis of the complex statistical interrelationship structure within climatological fields. The standard procedure for such network analyses is the extraction of network measures in combination with static standard visualisation methods. Existing interactive visualisation methods and tools for geo-referenced network exploration are often either not known to the analyst or their potential is not fully exploited. To fill this gap, we illustrate how interactive visual analytics methods in combination with geovisualisation can be tailored for visual climate network investigation. Therefore, the paper provides a problem analysis, relating the multiple visualisation challenges with a survey undertaken with network analysts from the research fields of climate and complex systems science. Then, as an overview for the interested practitioner, we review the state-of-the-art in climate network visualisation and provide an overview of existing tools. As a further contribution, we introduce the visual network analytics tools CGV and GTX, providing tailored solutions for climate network analysis, including alternative geographic projections, edge bundling, and 3-D network support. Using these tools, the paper illustrates the application potentials of visual analytics for climate networks based on several use cases including examples from global, regional, and multi-layered climate networks.

  4. 4D megahertz optical coherence tomography (OCT): imaging and live display beyond 1 gigavoxel/sec (Conference Presentation)

    NASA Astrophysics Data System (ADS)

    Huber, Robert A.; Draxinger, Wolfgang; Wieser, Wolfgang; Kolb, Jan Philip; Pfeiffer, Tom; Karpf, Sebastian N.; Eibl, Matthias; Klein, Thomas

    2016-03-01

    Over the last 20 years, optical coherence tomography (OCT) has become a valuable diagnostic tool in ophthalmology with several 10,000 devices sold today. Other applications, like intravascular OCT in cardiology and gastro-intestinal imaging will follow. OCT provides 3-dimensional image data with microscopic resolution of biological tissue in vivo. In most applications, off-line processing of the acquired OCT-data is sufficient. However, for OCT applications like OCT aided surgical microscopes, for functional OCT imaging of tissue after a stimulus, or for interactive endoscopy an OCT engine capable of acquiring, processing and displaying large and high quality 3D OCT data sets at video rate is highly desired. We developed such a prototype OCT engine and demonstrate live OCT with 25 volumes per second at a size of 320x320x320 pixels. The computer processing load of more than 1.5 TFLOPS was handled by a GTX 690 graphics processing unit with more than 3000 stream processors operating in parallel. In the talk, we will describe the optics and electronics hardware as well as the software of the system in detail and analyze current limitations. The talk also focuses on new OCT applications, where such a system improves diagnosis and monitoring of medical procedures. The additional acquisition of hyperspectral stimulated Raman signals with the system will be discussed.

  5. Accuracy of the Yamax CW-701 Pedometer for measuring steps in controlled and free-living conditions

    PubMed Central

    Coffman, Maren J; Reeve, Charlie L; Butler, Shannon; Keeling, Maiya; Talbot, Laura A

    2016-01-01

    Objective The Yamax Digi-Walker CW-701 (Yamax CW-701) is a low-cost pedometer that includes a 7-day memory, a 2-week cumulative memory, and automatically resets to zero at midnight. To date, the accuracy of the Yamax CW-701 has not been determined. The purpose of this study was to assess the accuracy of steps recorded by the Yamax CW-701 pedometer compared with actual steps and two other devices. Methods The study was conducted in a campus-based lab and in free-living settings with 22 students, faculty, and staff at a mid-sized university in the Southeastern US. While wearing a Yamax CW-701, Yamax Digi-Walker SW-200, and an ActiGraph GTX3 accelerometer, participants engaged in activities at variable speeds and conditions. To assess accuracy of each device, steps recorded were compared with actual step counts. Statistical tests included paired sample t-tests, percent accuracy, intraclass correlation coefficient, and Bland–Altman plots. Results The Yamax CW-701 demonstrated reliability and concurrent validity during walking at a fast pace and walking on a track, and in free-living conditions. Decreased accuracy was noted walking at a slow pace. Conclusions These findings are consistent with prior research. With most pedometers and accelerometers, adequate force and intensity must be present for a step to register. The Yamax CW-701 is accurate in recording steps taken while walking at a fast pace and in free-living settings. PMID:29942555

  6. Accuracy of the Yamax CW-701 Pedometer for measuring steps in controlled and free-living conditions.

    PubMed

    Coffman, Maren J; Reeve, Charlie L; Butler, Shannon; Keeling, Maiya; Talbot, Laura A

    2016-01-01

    The Yamax Digi-Walker CW-701 (Yamax CW-701) is a low-cost pedometer that includes a 7-day memory, a 2-week cumulative memory, and automatically resets to zero at midnight. To date, the accuracy of the Yamax CW-701 has not been determined. The purpose of this study was to assess the accuracy of steps recorded by the Yamax CW-701 pedometer compared with actual steps and two other devices. The study was conducted in a campus-based lab and in free-living settings with 22 students, faculty, and staff at a mid-sized university in the Southeastern US. While wearing a Yamax CW-701, Yamax Digi-Walker SW-200, and an ActiGraph GTX3 accelerometer, participants engaged in activities at variable speeds and conditions. To assess accuracy of each device, steps recorded were compared with actual step counts. Statistical tests included paired sample t -tests, percent accuracy, intraclass correlation coefficient, and Bland-Altman plots. The Yamax CW-701 demonstrated reliability and concurrent validity during walking at a fast pace and walking on a track, and in free-living conditions. Decreased accuracy was noted walking at a slow pace. These findings are consistent with prior research. With most pedometers and accelerometers, adequate force and intensity must be present for a step to register. The Yamax CW-701 is accurate in recording steps taken while walking at a fast pace and in free-living settings.

  7. Development and Validation of a Liquid Chromatography-Tandem Mass Spectrometry Method Coupled with Dispersive Solid-Phase Extraction for Simultaneous Quantification of Eight Paralytic Shellfish Poisoning Toxins in Shellfish.

    PubMed

    Yang, Xianli; Zhou, Lei; Tan, Yanglan; Shi, Xizhi; Zhao, Zhiyong; Nie, Dongxia; Zhou, Changyan; Liu, Hong

    2017-06-29

    In this study, a high-performance liquid chromatography-tandem mass spectrometry (HPLC-MS/MS) method was developed for simultaneous determination of eight paralytic shellfish poisoning (PSP) toxins, including saxitoxin (STX), neosaxitoxin (NEO), gonyautoxins (GTX1-4) and the N -sulfo carbamoyl toxins C1 and C2, in sea shellfish. The samples were extracted by acetonitrile/water (80:20, v / v ) with 0.1% formic and purified by dispersive solid-phase extraction (dSPE) with C18 silica and acidic alumina. Qualitative and quantitative detection for the target toxins were conducted under the multiple reaction monitoring (MRM) mode by using the positive electrospray ionization (ESI) mode after chromatographic separation on a TSK-gel Amide-80 HILIC column with water and acetonitrile. Matrix-matched calibration was used to compensate for matrix effects. The established method was further validated by determining the linearity ( R ² ≥ 0.9900), average recovery (81.52-116.50%), sensitivity (limits of detection (LODs): 0.33-5.52 μg·kg -1 ; limits of quantitation (LOQs): 1.32-11.29 μg·kg -1 ) and precision (relative standard deviation (RSD) ≤ 19.10%). The application of this proposed approach to thirty shellfish samples proved its desirable performance and sufficient capability for simultaneous determination of multiclass PSP toxins in sea foods.

  8. Paralytic shellfish toxin producing Aphanizomenon gracile strains isolated from Lake Iznik, Turkey.

    PubMed

    Yilmaz, Mete; Foss, Amanda J; Selwood, Andrew I; Özen, Mihriban; Boundy, Michael

    2018-06-15

    Aphanizomenon gracile is one of the most widespread Paralytic Shellfish Toxin (PST) producing cyanobacteria in freshwater bodies in the Northern Hemisphere. It has been shown to produce various PST congeners, including saxitoxin (STX), neosaxitoxin (NEO), decarbamoylsaxitoxin (dcSTX) and gonyautoxin 5 (GTX5) in Europe, North America and Asia. Three cyanobacteria strains were isolated in Lake Iznik in northwestern Turkey. Morphological characterization of these strains suggested all three strains conformed to classical taxonomic identification of A. gracile with some differences such as clumping of filaments, partially hyaline cells in some filaments and longer than usual vegetative cells. Sequences of 16S rRNA gene of these strains were placed within an A. gracile cluster including the majority of PST producing strains, confirming the identification of these strains as A. gracile. These new strains possessed saxitoxin biosynthesis genes sxtA, sxtG and their sequences clustered with those of other A. gracile. Liquid chromatography tandem mass spectrometry (LC-MS/MS) analysis demonstrated the presence of NEO, STX, dcSTX and decarbamoylneosaxitoxin (dcNEO) in all strains. This is the first report of a PST producer in any water body in Turkey and first observation of dcNEO in an A. gracile culture. Copyright © 2018 Elsevier Ltd. All rights reserved.

  9. Accumulation, biotransformation, histopathology and paralysis in the Pacific calico scallop Argopecten ventricosus by the paralyzing toxins of the dinoflagellate Gymnodinium catenatum.

    PubMed

    Escobedo-Lozano, Amada Y; Estrada, Norma; Ascencio, Felipe; Contreras, Gerardo; Alonso-Rodriguez, Rosalba

    2012-05-01

    The dinoflagellate Gymnodinium catenatum produces paralyzing shellfish poisons that are consumed and accumulated by bivalves. We performed short-term feeding experiments to examine ingestion, accumulation, biotransformation, histopathology, and paralysis in the juvenile Pacific calico scallop Argopecten ventricosus that consume this dinoflagellate. Depletion of algal cells was measured in closed systems. Histopathological preparations were microscopically analyzed. Paralysis was observed and the time of recovery recorded. Accumulation and possible biotransformation of toxins were measured by HPLC analysis. Feeding activity in treated scallops showed that scallops produced pseudofeces, ingestion rates decreased at 8 h; approximately 60% of the scallops were paralyzed and melanin production and hemocyte aggregation were observed in several tissues at 15 h. HPLC analysis showed that the only toxins present in the dinoflagellates and scallops were the N-sulfo-carbamoyl toxins (C1, C2); after hydrolysis, the carbamate toxins (epimers GTX2/3) were present. C1 and C2 toxins were most common in the mantle, followed by the digestive gland and stomach-complex, adductor muscle, kidney and rectum group, and finally, gills. Toxin profiles in scallop tissue were similar to the dinoflagellate; biotransformations were not present in the scallops in this short-term feeding experiment.

  10. Accumulation, Biotransformation, Histopathology and Paralysis in the Pacific Calico Scallop Argopecten ventricosus by the Paralyzing Toxins of the Dinoflagellate Gymnodinium catenatum

    PubMed Central

    Escobedo-Lozano, Amada Y.; Estrada, Norma; Ascencio, Felipe; Contreras, Gerardo; Alonso-Rodriguez, Rosalba

    2012-01-01

    The dinoflagellate Gymnodinium catenatum produces paralyzing shellfish poisons that are consumed and accumulated by bivalves. We performed short-term feeding experiments to examine ingestion, accumulation, biotransformation, histopathology, and paralysis in the juvenile Pacific calico scallop Argopecten ventricosus that consume this dinoflagellate. Depletion of algal cells was measured in closed systems. Histopathological preparations were microscopically analyzed. Paralysis was observed and the time of recovery recorded. Accumulation and possible biotransformation of toxins were measured by HPLC analysis. Feeding activity in treated scallops showed that scallops produced pseudofeces, ingestion rates decreased at 8 h; approximately 60% of the scallops were paralyzed and melanin production and hemocyte aggregation were observed in several tissues at 15 h. HPLC analysis showed that the only toxins present in the dinoflagellates and scallops were the N-sulfo-carbamoyl toxins (C1, C2); after hydrolysis, the carbamate toxins (epimers GTX2/3) were present. C1 and C2 toxins were most common in the mantle, followed by the digestive gland and stomach-complex, adductor muscle, kidney and rectum group, and finally, gills. Toxin profiles in scallop tissue were similar to the dinoflagellate; biotransformations were not present in the scallops in this short-term feeding experiment. PMID:22822356

  11. Modular Classification of Endoscopic Endonasal Transsphenoidal Approaches to Sellar Region: Anatomic Quantitative Study.

    PubMed

    Belotti, Francesco; Doglietto, Francesco; Schreiber, Alberto; Ravanelli, Marco; Ferrari, Marco; Lancini, Davide; Rampinelli, Vittorio; Hirtler, Lena; Buffoli, Barbara; Bolzoni Villaret, Andrea; Maroldi, Roberto; Rodella, Luigi Fabrizio; Nicolai, Piero; Fontanella, Marco Maria

    2018-01-01

    Endoscopic visualization does not necessarily correspond to an adequate working space. The need for balancing invasiveness and adequacy of sellar tumor exposure has recently led to the description of multiple endoscopic endonasal transsphenoidal approaches. Comparative anatomic data on these variants are lacking. We sought to quantitatively compare endoscopic endonasal transsphenoidal approaches to the sella and parasellar region, using the concept of "surgical pyramid." Four endoscopic transsphenoidal approaches were performed in 10 injected specimens: 1) hemisphenoidotomy; 2) transrostral; 3) extended transrostral (with superior turbinectomy); and 4) extended transrostral with posterior ethmoidectomy. ApproachViewer software (part of GTx-Eyes II, University Health Network, Toronto, Canada) with a dedicated navigation system was used to quantify the surgical pyramid volume, as well as exposure of sellar and parasellar areas. Statistical analyses were performed with Friedman's tests and Nemenyi's procedure. Hemisphenoidotomy provided limited exposure of the sellar area and a small working volume. A transrostral approach was necessary to expose the entire sella. Exposure of lateral parasellar areas required superior turbinectomy or posterior ethmoidectomy. The differences between each of the modules was statistically significant. The present study validates, from an anatomic point of view, a modular classification of endoscopic endonasal transsphenoidal approaches to the sellar region. Copyright © 2017 Elsevier Inc. All rights reserved.

  12. Paralytic Toxins Accumulation and Tissue Expression of α-Amylase and Lipase Genes in the Pacific Oyster Crassostrea gigas Fed with the Neurotoxic Dinoflagellate Alexandrium catenella

    PubMed Central

    Rolland, Jean-Luc; Pelletier, Kevin; Masseret, Estelle; Rieuvilleneuve, Fabien; Savar, Veronique; Santini, Adrien; Amzil, Zouher; Laabir, Mohamed

    2012-01-01

    The pacific oyster Crassostrea gigas was experimentally exposed to the neurotoxic Alexandrium catenella and a non-producer of PSTs, Alexandrium tamarense (control algae), at concentrations corresponding to those observed during the blooming period. At fixed time intervals, from 0 to 48 h, we determined the clearance rate, the total filtered cells, the composition of the fecal ribbons, the profile of the PSP toxins and the variation of the expression of two α-amylase and triacylglecerol lipase precursor (TLP) genes through semi-quantitative RT-PCR. The results showed a significant decrease of the clearance rate of C. gigas fed with both Alexandrium species. However, from 29 to 48 h, the clearance rate and cell filtration activity increased only in oysters fed with A. tamarense. The toxin concentrations in the digestive gland rose above the sanitary threshold in less than 48 h of exposure and GTX6, a compound absent in A. catenella cells, accumulated. The α-amylase B gene expression level increased significantly in the time interval from 6 to 48 h in the digestive gland of oysters fed with A. tamarense, whereas the TLP gene transcript was significantly up-regulated in the digestive gland of oysters fed with the neurotoxic A. catenella. All together, these results suggest that the digestion capacity could be affected by PSP toxins. PMID:23203275

  13. GPU-accelerated low-latency real-time searches for gravitational waves from compact binary coalescence

    NASA Astrophysics Data System (ADS)

    Liu, Yuan; Du, Zhihui; Chung, Shin Kee; Hooper, Shaun; Blair, David; Wen, Linqing

    2012-12-01

    We present a graphics processing unit (GPU)-accelerated time-domain low-latency algorithm to search for gravitational waves (GWs) from coalescing binaries of compact objects based on the summed parallel infinite impulse response (SPIIR) filtering technique. The aim is to facilitate fast detection of GWs with a minimum delay to allow prompt electromagnetic follow-up observations. To maximize the GPU acceleration, we apply an efficient batched parallel computing model that significantly reduces the number of synchronizations in SPIIR and optimizes the usage of the memory and hardware resource. Our code is tested on the CUDA ‘Fermi’ architecture in a GTX 480 graphics card and its performance is compared with a single core of Intel Core i7 920 (2.67 GHz). A 58-fold speedup is achieved while giving results in close agreement with the CPU implementation. Our result indicates that it is possible to conduct a full search for GWs from compact binary coalescence in real time with only one desktop computer equipped with a Fermi GPU card for the initial LIGO detectors which in the past required more than 100 CPUs.

  14. Development and Validation of a Liquid Chromatography-Tandem Mass Spectrometry Method Coupled with Dispersive Solid-Phase Extraction for Simultaneous Quantification of Eight Paralytic Shellfish Poisoning Toxins in Shellfish

    PubMed Central

    Yang, Xianli; Zhou, Lei; Tan, Yanglan; Shi, Xizhi; Zhao, Zhiyong; Nie, Dongxia; Zhou, Changyan; Liu, Hong

    2017-01-01

    In this study, a high-performance liquid chromatography-tandem mass spectrometry (HPLC-MS/MS) method was developed for simultaneous determination of eight paralytic shellfish poisoning (PSP) toxins, including saxitoxin (STX), neosaxitoxin (NEO), gonyautoxins (GTX1–4) and the N-sulfo carbamoyl toxins C1 and C2, in sea shellfish. The samples were extracted by acetonitrile/water (80:20, v/v) with 0.1% formic and purified by dispersive solid-phase extraction (dSPE) with C18 silica and acidic alumina. Qualitative and quantitative detection for the target toxins were conducted under the multiple reaction monitoring (MRM) mode by using the positive electrospray ionization (ESI) mode after chromatographic separation on a TSK-gel Amide-80 HILIC column with water and acetonitrile. Matrix-matched calibration was used to compensate for matrix effects. The established method was further validated by determining the linearity (R2 ≥ 0.9900), average recovery (81.52–116.50%), sensitivity (limits of detection (LODs): 0.33–5.52 μg·kg−1; limits of quantitation (LOQs): 1.32–11.29 μg·kg−1) and precision (relative standard deviation (RSD) ≤ 19.10%). The application of this proposed approach to thirty shellfish samples proved its desirable performance and sufficient capability for simultaneous determination of multiclass PSP toxins in sea foods. PMID:28661471

  15. The role of moderate-to-vigorous physical activity in mediating the relationship between central adiposity and immunometabolic profile in postmenopausal women.

    PubMed

    Diniz, Tiego A; Rossi, Fabricio E; Silveira, Loreana S; Neves, Lucas Melo; Fortaleza, Ana Claudia de Souza; Christofaro, Diego G D; Lira, Fabio S; Freitas-Junior, Ismael F

    2017-01-01

    To analyze the role of moderate-to-vigorous physical activity (MVPA) in mediating the relationship between central adiposity and immune and metabolic profile in postmenopausal women. Cross-sectional study comprising 49 postmenopausal women (aged 59.26 ± 8.32 years) without regular physical exercise practice. Body composition was measured by dual-energy X-ray absorptiometry. Fasting blood samples were collected for assessment of nonesterified fatty acids, tumor necrosis factor-α (TNF-α), interleukin-6 (IL-6), adiponectin, insulin and estimation of insulin resistance (HOMA-IR). Physical activity level was assessed with an accelerometer (Actigraph GTX3x) and reported as a percentage of time spent in sedentary behavior and MVPA. All analyses were performed using the software SPSS 17.0, with a significance level set at 5%. Sedentary women had a positive relationship between trunk fat and IL-6 (rho = 0.471; p = 0.020), and trunk fat and HOMA-IR (rho = 0.418; p = 0.042). Adiponectin and fat mass (%) were only positively correlated in physically active women (rho = 0.441; p = 0.027). Physically active women with normal trunk fat values presented a 14.7% lower chance of having increased HOMA-IR levels (β [95%CI] = 0.147 [0.027; 0.811]). The practice of sufficient levels of MVPA was a protective factor against immunometabolic disorders in postmenopausal women.

  16. A Comparison of Children's Physical Activity Levels in Physical Education, Recess, and Exergaming.

    PubMed

    Gao, Zan; Chen, Senlin; Stodden, David F

    2015-03-01

    To compare young children's different intensity physical activity (PA) levels in physical education, recess and exergaming programs. Participants were 140 first and second grade children (73 girls; Meanage= 7.88 years). Beyond the daily 20-minute recess, participants attended 75-minute weekly physical education classes and another 75-minute weekly exergaming classes. Children's PA levels were assessed by ActiGraph GTX3 accelerometers for 3 sessions in the 3 programs. The outcome variables were percentages of time spent in sedentary, light PA and moderate-to-vigorous PA (MVPA). There were significant main effects for program and grade, and an interaction effect for program by grade. Specifically, children's MVPA in exergaming and recess was higher than in physical education. The 2nd-grade children demonstrated lower sedentary behavior and MVPA than the first-grade children during recess; less light PA in both recess and exergaming than first-grade children; and less sedentary behavior but higher MVPA in exergaming than first-grade children. Young children generated higher PA levels in recess and exergaming as compared with physical education. Hence, other school-based PA programs may serve as essential components of a comprehensive school PA program. Implications are provided for educators and health professionals.

  17. Finite Element Study on Continuous Rotating versus Reciprocating Nickel-Titanium Instruments.

    PubMed

    El-Anwar, Mohamed I; Yousief, Salah A; Kataia, Engy M; El-Wahab, Tarek M Abd

    2016-01-01

    In the present study, GTX and ProTaper as continuous rotating endodontic files were numerically compared with WaveOne reciprocating file using finite element analysis, aiming at having a low cost, accurate/trustworthy comparison as well as finding out the effect of instrument design and manufacturing material on its lifespan. Two 3D finite element models were especially prepared for this comparison. Commercial engineering CAD/CAM package was used to model full detailed flute geometries of the instruments. Multi-linear materials were defined in analysis by using real strain-stress data of NiTi and M-Wire. Non-linear static analysis was performed to simulate the instrument inside root canal at a 45° angle in the apical portion and subjected to 0.3 N.cm torsion. The three simulations in this study showed that M-Wire is slightly more resistant to failure than conventional NiTi. On the other hand, both materials are fairly similar in case of severe locking conditions. For the same instrument geometry, M-Wire instruments may have longer lifespan than the conventional NiTi ones. In case of severe locking conditions both materials will fail similarly. Larger cross sectional area (function of instrument taper) resisted better to failure than the smaller ones, while the cross sectional shape and its cutting angles could affect instrument cutting efficiency.

  18. A multi-port 10GbE PCIe NIC featuring UDP offload and GPUDirect capabilities.

    NASA Astrophysics Data System (ADS)

    Ammendola, Roberto; Biagioni, Andrea; Frezza, Ottorino; Lamanna, Gianluca; Lo Cicero, Francesca; Lonardo, Alessandro; Martinelli, Michele; Stanislao Paolucci, Pier; Pastorelli, Elena; Pontisso, Luca; Rossetti, Davide; Simula, Francesco; Sozzi, Marco; Tosoratto, Laura; Vicini, Piero

    2015-12-01

    NaNet-10 is a four-ports 10GbE PCIe Network Interface Card designed for low-latency real-time operations with GPU systems. To this purpose the design includes an UDP offload module, for fast and clock-cycle deterministic handling of the transport layer protocol, plus a GPUDirect P2P/RDMA engine for low-latency communication with NVIDIA Tesla GPU devices. A dedicated module (Multi-Stream) can optionally process input UDP streams before data is delivered through PCIe DMA to their destination devices, re-organizing data from different streams guaranteeing computational optimization. NaNet-10 is going to be integrated in the NA62 CERN experiment in order to assess the suitability of GPGPU systems as real-time triggers; results and lessons learned while performing this activity will be reported herein.

  19. Locality-Aware CTA Clustering For Modern GPUs

    SciTech Connect

    Li, Ang; Song, Shuaiwen; Liu, Weifeng

    2017-04-08

    In this paper, we proposed a novel clustering technique for tapping into the performance potential of a largely ignored type of locality: inter-CTA locality. We first demonstrated the capability of the existing GPU hardware to exploit such locality, both spatially and temporally, on L1 or L1/Tex unified cache. To verify the potential of this locality, we quantified its existence in a broad spectrum of applications and discussed its sources of origin. Based on these insights, we proposed the concept of CTA-Clustering and its associated software techniques. Finally, We evaluated these techniques on all modern generations of NVIDIA GPU architectures. Themore » experimental results showed that our proposed clustering techniques could significantly improve on-chip cache performance.« less

  20. GPU Particle Tracking and MHD Simulations with Greatly Enhanced Computational Speed

    NASA Astrophysics Data System (ADS)

    Ziemba, T.; O'Donnell, D.; Carscadden, J.; Cash, M.; Winglee, R.; Harnett, E.

    2008-12-01

    GPUs are intrinsically highly parallelized systems that provide more than an order of magnitude computing speed over a CPU based systems, for less cost than a high end-workstation. Recent advancements in GPU technologies allow for full IEEE float specifications with performance up to several hundred GFLOPs per GPU, and new software architectures have recently become available to ease the transition from graphics based to scientific applications. This allows for a cheap alternative to standard supercomputing methods and should increase the time to discovery. 3-D particle tracking and MHD codes have been developed using NVIDIA's CUDA and have demonstrated speed up of nearly a factor of 20 over equivalent CPU versions of the codes. Such a speed up enables new applications to develop, including real time running of radiation belt simulations and real time running of global magnetospheric simulations, both of which could provide important space weather prediction tools.

  1. Real-Space Density Functional Theory on Graphical Processing Units: Computational Approach and Comparison to Gaussian Basis Set Methods.

    PubMed

    Andrade, Xavier; Aspuru-Guzik, Alán

    2013-10-08

    We discuss the application of graphical processing units (GPUs) to accelerate real-space density functional theory (DFT) calculations. To make our implementation efficient, we have developed a scheme to expose the data parallelism available in the DFT approach; this is applied to the different procedures required for a real-space DFT calculation. We present results for current-generation GPUs from AMD and Nvidia, which show that our scheme, implemented in the free code Octopus, can reach a sustained performance of up to 90 GFlops for a single GPU, representing a significant speed-up when compared to the CPU version of the code. Moreover, for some systems, our implementation can outperform a GPU Gaussian basis set code, showing that the real-space approach is a competitive alternative for DFT simulations on GPUs.

  2. On the effective implementation of a boundary element code on graphics processing units unsing an out-of-core LU algorithm

    SciTech Connect

    D'Azevedo, Ed F; Nintcheu Fata, Sylvain

    2012-01-01

    A collocation boundary element code for solving the three-dimensional Laplace equation, publicly available from \\url{http://www.intetec.org}, has been adapted to run on an Nvidia Tesla general purpose graphics processing unit (GPU). Global matrix assembly and LU factorization of the resulting dense matrix were performed on the GPU. Out-of-core techniques were used to solve problems larger than available GPU memory. The code achieved over eight times speedup in matrix assembly and about 56~Gflops/sec in the LU factorization using only 512~Mbytes of GPU memory. Details of the GPU implementation and comparisons with the standard sequential algorithm are included to illustrate the performance ofmore » the GPU code.« less

  3. Multi-core and GPU accelerated simulation of a radial star target imaged with equivalent t-number circular and Gaussian pupils

    NASA Astrophysics Data System (ADS)

    Greynolds, Alan W.

    2013-09-01

    Results from the GelOE optical engineering software are presented for the through-focus, monochromatic coherent and polychromatic incoherent imaging of a radial "star" target for equivalent t-number circular and Gaussian pupils. The FFT-based simulations are carried out using OpenMP threading on a multi-core desktop computer, with and without the aid of a many-core NVIDIA GPU accessing its cuFFT library. It is found that a custom FFT optimized for the 12-core host has similar performance to a simply implemented 256-core GPU FFT. A more sophisticated version of the latter but tuned to reduce overhead on a 448-core GPU is 20 to 28 times faster than a basic FFT implementation running on one CPU core.

  4. Gravitational tree-code on graphics processing units: implementation in CUDA

    NASA Astrophysics Data System (ADS)

    Gaburov, Evghenii; Bédorf, Jeroen; Portegies Zwart, Simon

    2010-05-01

    We present a new very fast tree-code which runs on massively parallel Graphical Processing Units (GPU) with NVIDIA CUDA architecture. The tree-construction and calculation of multipole moments is carried out on the host CPU, while the force calculation which consists of tree walks and evaluation of interaction list is carried out on the GPU. In this way we achieve a sustained performance of about 100GFLOP/s and data transfer rates of about 50GB/s. It takes about a second to compute forces on a million particles with an opening angle of θ ≈ 0.5. The code has a convenient user interface and is freely available for use. http://castle.strw.leidenuniv.nl/software/octgrav.html

  5. SciTech Connect

    Edwards, Harold C.; Ibanez, Daniel Alejandro

    This report documents the ASC/ATDM Kokkos deliverable "Production Portable Dy- namic Task DAG Capability." This capability enables applications to create and execute a dynamic task DAG ; a collection of heterogeneous computational tasks with a directed acyclic graph (DAG) of "execute after" dependencies where tasks and their dependencies are dynamically created and destroyed as tasks execute. The Kokkos task scheduler executes the dynamic task DAG on the target execution resource; e.g. a multicore CPU, a manycore CPU such as Intel's Knights Landing (KNL), or an NVIDIA GPU. Several major technical challenges had to be addressed during development of Kokkos' Taskmore » DAG capability: (1) portability to a GPU with it's simplified hardware and micro- runtime, (2) thread-scalable memory allocation and deallocation from a bounded pool of memory, (3) thread-scalable scheduler for dynamic task DAG, (4) usability by applications.« less

  6. Accelerated Application Development: The ORNL Titan Experience

    DOE PAGES

    Joubert, Wayne; Archibald, Richard K.; Berrill, Mark A.; ...

    2015-05-09

    The use of computational accelerators such as NVIDIA GPUs and Intel Xeon Phi processors is now widespread in the high performance computing community, with many applications delivering impressive performance gains. However, programming these systems for high performance, performance portability and software maintainability has been a challenge. In this paper we discuss experiences porting applications to the Titan system. Titan, which began planning in 2009 and was deployed for general use in 2013, was the first multi-petaflop system based on accelerator hardware. To ready applications for accelerated computing, a preparedness effort was undertaken prior to delivery of Titan. In this papermore » we report experiences and lessons learned from this process and describe how users are currently making use of computational accelerators on Titan.« less

  7. Gibraltar v 1.0

    SciTech Connect

    CURRY, MATTHEW LEON; WARD, H. LEE; & SKJELLUM, ANTHONY

    Gibraltar is a library and associated test suite which performs Reed-Solomon coding and decoding of data buffers using graphics processing units which support NVIDIA's CUDA technology. This library is used to generate redundant data allowing for recovery of lost information. For example, a user can generate m new blocks of data from n original blocks, distributing those pieces over n+m devices. If any m devices fail, the contents of those devices can be recovered from the contents of the other n devices, even if some of the original blocks are lost. This is a generalized description of RAID, a techniquemore » for increasing data storage speed and size.« less

  8. Toward performance portability of the Albany finite element analysis code using the Kokkos library

    DOE PAGES

    Demeshko, Irina; Watkins, Jerry; Tezaur, Irina K.; ...

    2018-02-05

    Performance portability on heterogeneous high-performance computing (HPC) systems is a major challenge faced today by code developers: parallel code needs to be executed correctly as well as with high performance on machines with different architectures, operating systems, and software libraries. The finite element method (FEM) is a popular and flexible method for discretizing partial differential equations arising in a wide variety of scientific, engineering, and industrial applications that require HPC. This paper presents some preliminary results pertaining to our development of a performance portable implementation of the FEM-based Albany code. Performance portability is achieved using the Kokkos library. We presentmore » performance results for the Aeras global atmosphere dynamical core module in Albany. Finally, numerical experiments show that our single code implementation gives reasonable performance across three multicore/many-core architectures: NVIDIA General Processing Units (GPU’s), Intel Xeon Phis, and multicore CPUs.« less

  9. Accelerated application development: The ORNL Titan experience

    SciTech Connect

    Joubert, Wayne; Archibald, Rick; Berrill, Mark

    2015-08-01

    The use of computational accelerators such as NVIDIA GPUs and Intel Xeon Phi processors is now widespread in the high performance computing community, with many applications delivering impressive performance gains. However, programming these systems for high performance, performance portability and software maintainability has been a challenge. In this paper we discuss experiences porting applications to the Titan system. Titan, which began planning in 2009 and was deployed for general use in 2013, was the first multi-petaflop system based on accelerator hardware. To ready applications for accelerated computing, a preparedness effort was undertaken prior to delivery of Titan. In this papermore » we report experiences and lessons learned from this process and describe how users are currently making use of computational accelerators on Titan.« less

  10. First experience of vectorizing electromagnetic physics models for detector simulation

    NASA Astrophysics Data System (ADS)

    Amadio, G.; Apostolakis, J.; Bandieramonte, M.; Bianchini, C.; Bitzes, G.; Brun, R.; Canal, P.; Carminati, F.; de Fine Licht, J.; Duhem, L.; Elvira, D.; Gheata, A.; Jun, S. Y.; Lima, G.; Novak, M.; Presbyterian, M.; Shadura, O.; Seghal, R.; Wenzel, S.

    2015-12-01

    The recent emergence of hardware architectures characterized by many-core or accelerated processors has opened new opportunities for concurrent programming models taking advantage of both SIMD and SIMT architectures. The GeantV vector prototype for detector simulations has been designed to exploit both the vector capability of mainstream CPUs and multi-threading capabilities of coprocessors including NVidia GPUs and Intel Xeon Phi. The characteristics of these architectures are very different in terms of the vectorization depth, parallelization needed to achieve optimal performance or memory access latency and speed. An additional challenge is to avoid the code duplication often inherent to supporting heterogeneous platforms. In this paper we present the first experience of vectorizing electromagnetic physics models developed for the GeantV project.

  11. GPU-based real-time trinocular stereo vision

    NASA Astrophysics Data System (ADS)

    Yao, Yuanbin; Linton, R. J.; Padir, Taskin

    2013-01-01

    Most stereovision applications are binocular which uses information from a 2-camera array to perform stereo matching and compute the depth image. Trinocular stereovision with a 3-camera array has been proved to provide higher accuracy in stereo matching which could benefit applications like distance finding, object recognition, and detection. This paper presents a real-time stereovision algorithm implemented on a GPGPU (General-purpose graphics processing unit) using a trinocular stereovision camera array. Algorithm employs a winner-take-all method applied to perform fusion of disparities in different directions following various image processing techniques to obtain the depth information. The goal of the algorithm is to achieve real-time processing speed with the help of a GPGPU involving the use of Open Source Computer Vision Library (OpenCV) in C++ and NVidia CUDA GPGPU Solution. The results are compared in accuracy and speed to verify the improvement.

  12. GPU accelerated fuzzy connected image segmentation by using CUDA.

    PubMed

    Zhuge, Ying; Cao, Yong; Miller, Robert W

    2009-01-01

    Image segmentation techniques using fuzzy connectedness principles have shown their effectiveness in segmenting a variety of objects in several large applications in recent years. However, one problem of these algorithms has been their excessive computational requirements when processing large image datasets. Nowadays commodity graphics hardware provides high parallel computing power. In this paper, we present a parallel fuzzy connected image segmentation algorithm on Nvidia's Compute Unified Device Architecture (CUDA) platform for segmenting large medical image data sets. Our experiments based on three data sets with small, medium, and large data size demonstrate the efficiency of the parallel algorithm, which achieves a speed-up factor of 7.2x, 7.3x, and 14.4x, correspondingly, for the three data sets over the sequential implementation of fuzzy connected image segmentation algorithm on CPU.

  13. GPU accelerated Monte Carlo simulation of Brownian motors dynamics with CUDA

    NASA Astrophysics Data System (ADS)

    Spiechowicz, J.; Kostur, M.; Machura, L.

    2015-06-01

    This work presents an updated and extended guide on methods of a proper acceleration of the Monte Carlo integration of stochastic differential equations with the commonly available NVIDIA Graphics Processing Units using the CUDA programming environment. We outline the general aspects of the scientific computing on graphics cards and demonstrate them with two models of a well known phenomenon of the noise induced transport of Brownian motors in periodic structures. As a source of fluctuations in the considered systems we selected the three most commonly occurring noises: the Gaussian white noise, the white Poissonian noise and the dichotomous process also known as a random telegraph signal. The detailed discussion on various aspects of the applied numerical schemes is also presented. The measured speedup can be of the astonishing order of about 3000 when compared to a typical CPU. This number significantly expands the range of problems solvable by use of stochastic simulations, allowing even an interactive research in some cases.

  14. High-Performance 3D Compressive Sensing MRI Reconstruction Using Many-Core Architectures

    PubMed Central

    Kim, Daehyun; Trzasko, Joshua; Smelyanskiy, Mikhail; Haider, Clifton; Dubey, Pradeep; Manduca, Armando

    2011-01-01

    Compressive sensing (CS) describes how sparse signals can be accurately reconstructed from many fewer samples than required by the Nyquist criterion. Since MRI scan duration is proportional to the number of acquired samples, CS has been gaining significant attention in MRI. However, the computationally intensive nature of CS reconstructions has precluded their use in routine clinical practice. In this work, we investigate how different throughput-oriented architectures can benefit one CS algorithm and what levels of acceleration are feasible on different modern platforms. We demonstrate that a CUDA-based code running on an NVIDIA Tesla C2050 GPU can reconstruct a 256 × 160 × 80 volume from an 8-channel acquisition in 19 seconds, which is in itself a significant improvement over the state of the art. We then show that Intel's Knights Ferry can perform the same 3D MRI reconstruction in only 12 seconds, bringing CS methods even closer to clinical viability. PMID:21922017

  15. MEGADOCK 4.0: an ultra-high-performance protein-protein docking software for heterogeneous supercomputers.

    PubMed

    Ohue, Masahito; Shimoda, Takehiro; Suzuki, Shuji; Matsuzaki, Yuri; Ishida, Takashi; Akiyama, Yutaka

    2014-11-15

    The application of protein-protein docking in large-scale interactome analysis is a major challenge in structural bioinformatics and requires huge computing resources. In this work, we present MEGADOCK 4.0, an FFT-based docking software that makes extensive use of recent heterogeneous supercomputers and shows powerful, scalable performance of >97% strong scaling. MEGADOCK 4.0 is written in C++ with OpenMPI and NVIDIA CUDA 5.0 (or later) and is freely available to all academic and non-profit users at: http://www.bi.cs.titech.ac.jp/megadock. akiyama@cs.titech.ac.jp Supplementary data are available at Bioinformatics online. © The Author 2014. Published by Oxford University Press.

  16. Real-time dedispersion for fast radio transient surveys, using auto tuning on many-core accelerators

    NASA Astrophysics Data System (ADS)

    Sclocco, A.; van Leeuwen, J.; Bal, H. E.; van Nieuwpoort, R. V.

    2016-01-01

    Dedispersion, the removal of deleterious smearing of impulsive signals by the interstellar matter, is one of the most intensive processing steps in any radio survey for pulsars and fast transients. We here present a study of the parallelization of this algorithm on many-core accelerators, including GPUs from AMD and NVIDIA, and the Intel Xeon Phi. We find that dedispersion is inherently memory-bound. Even in a perfect scenario, hardware limitations keep the arithmetic intensity low, thus limiting performance. We next exploit auto-tuning to adapt dedispersion to different accelerators, observations, and even telescopes. We demonstrate that the optimal settings differ between observational setups, and that auto-tuning significantly improves performance. This impacts time-domain surveys from Apertif to SKA.

  17. Exact diagonalization of quantum lattice models on coprocessors

    NASA Astrophysics Data System (ADS)

    Siro, T.; Harju, A.

    2016-10-01

    We implement the Lanczos algorithm on an Intel Xeon Phi coprocessor and compare its performance to a multi-core Intel Xeon CPU and an NVIDIA graphics processor. The Xeon and the Xeon Phi are parallelized with OpenMP and the graphics processor is programmed with CUDA. The performance is evaluated by measuring the execution time of a single step in the Lanczos algorithm. We study two quantum lattice models with different particle numbers, and conclude that for small systems, the multi-core CPU is the fastest platform, while for large systems, the graphics processor is the clear winner, reaching speedups of up to 7.6 compared to the CPU. The Xeon Phi outperforms the CPU with sufficiently large particle number, reaching a speedup of 2.5.

  18. The Process of Parallelizing the Conjunction Prediction Algorithm of ESA's SSA Conjunction Prediction Service Using GPGPU

    NASA Astrophysics Data System (ADS)

    Fehr, M.; Navarro, V.; Martin, L.; Fletcher, E.

    2013-08-01

    Space Situational Awareness[8] (SSA) is defined as the comprehensive knowledge, understanding and maintained awareness of the population of space objects, the space environment and existing threats and risks. As ESA's SSA Conjunction Prediction Service (CPS) requires the repetitive application of a processing algorithm against a data set of man-made space objects, it is crucial to exploit the highly parallelizable nature of this problem. Currently the CPS system makes use of OpenMP[7] for parallelization purposes using CPU threads, but only a GPU with its hundreds of cores can fully benefit from such high levels of parallelism. This paper presents the adaptation of several core algorithms[5] of the CPS for general-purpose computing on graphics processing units (GPGPU) using NVIDIAs Compute Unified Device Architecture (CUDA).

  19. CUDAEASY - a GPU accelerated cosmological lattice program

    NASA Astrophysics Data System (ADS)

    Sainio, J.

    2010-05-01

    This paper presents, to the author's knowledge, the first graphics processing unit (GPU) accelerated program that solves the evolution of interacting scalar fields in an expanding universe. We present the implementation in NVIDIA's Compute Unified Device Architecture (CUDA) and compare the performance to other similar programs in chaotic inflation models. We report speedups between one and two orders of magnitude depending on the used hardware and software while achieving small errors in single precision. Simulations that used to last roughly one day to compute can now be done in hours and this difference is expected to increase in the future. The program has been written in the spirit of LATTICEEASY and users of the aforementioned program should find it relatively easy to start using CUDAEASY in lattice simulations. The program is available at http://www.physics.utu.fi/theory/particlecosmology/cudaeasy/ under the GNU General Public License.

  20. Overview of implementation of DARPA GPU program in SAIC

    NASA Astrophysics Data System (ADS)

    Braunreiter, Dennis; Furtek, Jeremy; Chen, Hai-Wen; Healy, Dennis

    2008-04-01

    This paper reviews the implementation of DARPA MTO STAP-BOY program for both Phase I and II conducted at Science Applications International Corporation (SAIC). The STAP-BOY program conducts fast covariance factorization and tuning techniques for space-time adaptive process (STAP) Algorithm Implementation on Graphics Processor unit (GPU) Architectures for Embedded Systems. The first part of our presentation on the DARPA STAP-BOY program will focus on GPU implementation and algorithm innovations for a prototype radar STAP algorithm. The STAP algorithm will be implemented on the GPU, using stream programming (from companies such as PeakStream, ATI Technologies' CTM, and NVIDIA) and traditional graphics APIs. This algorithm will include fast range adaptive STAP weight updates and beamforming applications, each of which has been modified to exploit the parallel nature of graphics architectures.

  1. A rapid parallelization of cone-beam projection and back-projection operator based on texture fetching interpolation

    NASA Astrophysics Data System (ADS)

    Xie, Lizhe; Hu, Yining; Chen, Yang; Shi, Luyao

    2015-03-01

    Projection and back-projection are the most computational consuming parts in Computed Tomography (CT) reconstruction. Parallelization strategies using GPU computing techniques have been introduced. We in this paper present a new parallelization scheme for both projection and back-projection. The proposed method is based on CUDA technology carried out by NVIDIA Corporation. Instead of build complex model, we aimed on optimizing the existing algorithm and make it suitable for CUDA implementation so as to gain fast computation speed. Besides making use of texture fetching operation which helps gain faster interpolation speed, we fixed sampling numbers in the computation of projection, to ensure the synchronization of blocks and threads, thus prevents the latency caused by inconsistent computation complexity. Experiment results have proven the computational efficiency and imaging quality of the proposed method.

  2. High-Performance 3D Compressive Sensing MRI Reconstruction Using Many-Core Architectures.

    PubMed

    Kim, Daehyun; Trzasko, Joshua; Smelyanskiy, Mikhail; Haider, Clifton; Dubey, Pradeep; Manduca, Armando

    2011-01-01

    Compressive sensing (CS) describes how sparse signals can be accurately reconstructed from many fewer samples than required by the Nyquist criterion. Since MRI scan duration is proportional to the number of acquired samples, CS has been gaining significant attention in MRI. However, the computationally intensive nature of CS reconstructions has precluded their use in routine clinical practice. In this work, we investigate how different throughput-oriented architectures can benefit one CS algorithm and what levels of acceleration are feasible on different modern platforms. We demonstrate that a CUDA-based code running on an NVIDIA Tesla C2050 GPU can reconstruct a 256 × 160 × 80 volume from an 8-channel acquisition in 19 seconds, which is in itself a significant improvement over the state of the art. We then show that Intel's Knights Ferry can perform the same 3D MRI reconstruction in only 12 seconds, bringing CS methods even closer to clinical viability.

  3. MGUPGMA: A Fast UPGMA Algorithm With Multiple Graphics Processing Units Using NCCL

    PubMed Central

    Hua, Guan-Jie; Hung, Che-Lun; Lin, Chun-Yuan; Wu, Fu-Che; Chan, Yu-Wei; Tang, Chuan Yi

    2017-01-01

    A phylogenetic tree is a visual diagram of the relationship between a set of biological species. The scientists usually use it to analyze many characteristics of the species. The distance-matrix methods, such as Unweighted Pair Group Method with Arithmetic Mean and Neighbor Joining, construct a phylogenetic tree by calculating pairwise genetic distances between taxa. These methods have the computational performance issue. Although several new methods with high-performance hardware and frameworks have been proposed, the issue still exists. In this work, a novel parallel Unweighted Pair Group Method with Arithmetic Mean approach on multiple Graphics Processing Units is proposed to construct a phylogenetic tree from extremely large set of sequences. The experimental results present that the proposed approach on a DGX-1 server with 8 NVIDIA P100 graphic cards achieves approximately 3-fold to 7-fold speedup over the implementation of Unweighted Pair Group Method with Arithmetic Mean on a modern CPU and a single GPU, respectively. PMID:29051701

  4. MGUPGMA: A Fast UPGMA Algorithm With Multiple Graphics Processing Units Using NCCL.

    PubMed

    Hua, Guan-Jie; Hung, Che-Lun; Lin, Chun-Yuan; Wu, Fu-Che; Chan, Yu-Wei; Tang, Chuan Yi

    2017-01-01

    A phylogenetic tree is a visual diagram of the relationship between a set of biological species. The scientists usually use it to analyze many characteristics of the species. The distance-matrix methods, such as Unweighted Pair Group Method with Arithmetic Mean and Neighbor Joining, construct a phylogenetic tree by calculating pairwise genetic distances between taxa. These methods have the computational performance issue. Although several new methods with high-performance hardware and frameworks have been proposed, the issue still exists. In this work, a novel parallel Unweighted Pair Group Method with Arithmetic Mean approach on multiple Graphics Processing Units is proposed to construct a phylogenetic tree from extremely large set of sequences. The experimental results present that the proposed approach on a DGX-1 server with 8 NVIDIA P100 graphic cards achieves approximately 3-fold to 7-fold speedup over the implementation of Unweighted Pair Group Method with Arithmetic Mean on a modern CPU and a single GPU, respectively.

  5. SciTech Connect

    Messer, Bronson; Harris, James A; Parete-Koon, Suzanne T

    We describe recent development work on the core-collapse supernova code CHIMERA. CHIMERA has consumed more than 100 million cpu-hours on Oak Ridge Leadership Computing Facility (OLCF) platforms in the past 3 years, ranking it among the most important applications at the OLCF. Most of the work described has been focused on exploiting the multicore nature of the current platform (Jaguar) via, e.g., multithreading using OpenMP. In addition, we have begun a major effort to marshal the computational power of GPUs with CHIMERA. The impending upgrade of Jaguar to Titan a 20+ PF machine with an NVIDIA GPU on many nodesmore » makes this work essential.« less

  6. Toward performance portability of the Albany finite element analysis code using the Kokkos library

    SciTech Connect

    Demeshko, Irina; Watkins, Jerry; Tezaur, Irina K.

    Performance portability on heterogeneous high-performance computing (HPC) systems is a major challenge faced today by code developers: parallel code needs to be executed correctly as well as with high performance on machines with different architectures, operating systems, and software libraries. The finite element method (FEM) is a popular and flexible method for discretizing partial differential equations arising in a wide variety of scientific, engineering, and industrial applications that require HPC. This paper presents some preliminary results pertaining to our development of a performance portable implementation of the FEM-based Albany code. Performance portability is achieved using the Kokkos library. We presentmore » performance results for the Aeras global atmosphere dynamical core module in Albany. Finally, numerical experiments show that our single code implementation gives reasonable performance across three multicore/many-core architectures: NVIDIA General Processing Units (GPU’s), Intel Xeon Phis, and multicore CPUs.« less

  7. A GPU-paralleled implementation of an enhanced face recognition algorithm

    NASA Astrophysics Data System (ADS)

    Chen, Hao; Liu, Xiyang; Shao, Shuai; Zan, Jiguo

    2013-03-01

    Face recognition algorithm based on compressed sensing and sparse representation is hotly argued in these years. The scheme of this algorithm increases recognition rate as well as anti-noise capability. However, the computational cost is expensive and has become a main restricting factor for real world applications. In this paper, we introduce a GPU-accelerated hybrid variant of face recognition algorithm named parallel face recognition algorithm (pFRA). We describe here how to carry out parallel optimization design to take full advantage of many-core structure of a GPU. The pFRA is tested and compared with several other implementations under different data sample size. Finally, Our pFRA, implemented with NVIDIA GPU and Computer Unified Device Architecture (CUDA) programming model, achieves a significant speedup over the traditional CPU implementations.

  8. Electromagnetic physics models for parallel computing architectures

    DOE PAGES

    Amadio, G.; Ananya, A.; Apostolakis, J.; ...

    2016-11-21

    The recent emergence of hardware architectures characterized by many-core or accelerated processors has opened new opportunities for concurrent programming models taking advantage of both SIMD and SIMT architectures. GeantV, a next generation detector simulation, has been designed to exploit both the vector capability of mainstream CPUs and multi-threading capabilities of coprocessors including NVidia GPUs and Intel Xeon Phi. The characteristics of these architectures are very different in terms of the vectorization depth and type of parallelization needed to achieve optimal performance. In this paper we describe implementation of electromagnetic physics models developed for parallel computing architectures as a part ofmore » the GeantV project. Finally, the results of preliminary performance evaluation and physics validation are presented as well.« less

  9. A GPU-based calculation using the three-dimensional FDTD method for electromagnetic field analysis.

    PubMed

    Nagaoka, Tomoaki; Watanabe, Soichi

    2010-01-01

    Numerical simulations with the numerical human model using the finite-difference time domain (FDTD) method have recently been performed frequently in a number of fields in biomedical engineering. However, the FDTD calculation runs too slowly. We focus, therefore, on general purpose programming on the graphics processing unit (GPGPU). The three-dimensional FDTD method was implemented on the GPU using Compute Unified Device Architecture (CUDA). In this study, we used the NVIDIA Tesla C1060 as a GPGPU board. The performance of the GPU is evaluated in comparison with the performance of a conventional CPU and a vector supercomputer. The results indicate that three-dimensional FDTD calculations using a GPU can significantly reduce run time in comparison with that using a conventional CPU, even a native GPU implementation of the three-dimensional FDTD method, while the GPU/CPU speed ratio varies with the calculation domain and thread block size.

  10. Multi-GPU accelerated three-dimensional FDTD method for electromagnetic simulation.

    PubMed

    Nagaoka, Tomoaki; Watanabe, Soichi

    2011-01-01

    Numerical simulation with a numerical human model using the finite-difference time domain (FDTD) method has recently been performed in a number of fields in biomedical engineering. To improve the method's calculation speed and realize large-scale computing with the numerical human model, we adapt three-dimensional FDTD code to a multi-GPU environment using Compute Unified Device Architecture (CUDA). In this study, we used NVIDIA Tesla C2070 as GPGPU boards. The performance of multi-GPU is evaluated in comparison with that of a single GPU and vector supercomputer. The calculation speed with four GPUs was approximately 3.5 times faster than with a single GPU, and was slightly (approx. 1.3 times) slower than with the supercomputer. Calculation speed of the three-dimensional FDTD method using GPUs can significantly improve with an expanding number of GPUs.

  11. Accelerating separable footprint (SF) forward and back projection on GPU

    NASA Astrophysics Data System (ADS)

    Xie, Xiaobin; McGaffin, Madison G.; Long, Yong; Fessler, Jeffrey A.; Wen, Minhua; Lin, James

    2017-03-01

    Statistical image reconstruction (SIR) methods for X-ray CT can improve image quality and reduce radiation dosages over conventional reconstruction methods, such as filtered back projection (FBP). However, SIR methods require much longer computation time. The separable footprint (SF) forward and back projection technique simplifies the calculation of intersecting volumes of image voxels and finite-size beams in a way that is both accurate and efficient for parallel implementation. We propose a new method to accelerate the SF forward and back projection on GPU with NVIDIA's CUDA environment. For the forward projection, we parallelize over all detector cells. For the back projection, we parallelize over all 3D image voxels. The simulation results show that the proposed method is faster than the acceleration method of the SF projectors proposed by Wu and Fessler.13 We further accelerate the proposed method using multiple GPUs. The results show that the computation time is reduced approximately proportional to the number of GPUs.

  12. Genetically improved BarraCUDA.

    PubMed

    Langdon, W B; Lam, Brian Yee Hong

    2017-01-01

    BarraCUDA is an open source C program which uses the BWA algorithm in parallel with nVidia CUDA to align short next generation DNA sequences against a reference genome. Recently its source code was optimised using "Genetic Improvement". The genetically improved (GI) code is up to three times faster on short paired end reads from The 1000 Genomes Project and 60% more accurate on a short BioPlanet.com GCAT alignment benchmark. GPGPU BarraCUDA running on a single K80 Tesla GPU can align short paired end nextGen sequences up to ten times faster than bwa on a 12 core server. The speed up was such that the GI version was adopted and has been regularly downloaded from SourceForge for more than 12 months.

  13. Model-independent partial wave analysis using a massively-parallel fitting framework

    NASA Astrophysics Data System (ADS)

    Sun, L.; Aoude, R.; dos Reis, A. C.; Sokoloff, M.

    2017-10-01

    The functionality of GooFit, a GPU-friendly framework for doing maximum-likelihood fits, has been extended to extract model-independent {\\mathscr{S}}-wave amplitudes in three-body decays such as D + → h + h + h -. A full amplitude analysis is done where the magnitudes and phases of the {\\mathscr{S}}-wave amplitudes are anchored at a finite number of m 2(h + h -) control points, and a cubic spline is used to interpolate between these points. The amplitudes for {\\mathscr{P}}-wave and {\\mathscr{D}}-wave intermediate states are modeled as spin-dependent Breit-Wigner resonances. GooFit uses the Thrust library, with a CUDA backend for NVIDIA GPUs and an OpenMP backend for threads with conventional CPUs. Performance on a variety of platforms is compared. Executing on systems with GPUs is typically a few hundred times faster than executing the same algorithm on a single CPU.

  14. Rapid automated classification of anesthetic depth levels using GPU based parallelization of neural networks.

    PubMed

    Peker, Musa; Şen, Baha; Gürüler, Hüseyin

    2015-02-01

    The effect of anesthesia on the patient is referred to as depth of anesthesia. Rapid classification of appropriate depth level of anesthesia is a matter of great importance in surgical operations. Similarly, accelerating classification algorithms is important for the rapid solution of problems in the field of biomedical signal processing. However numerous, time-consuming mathematical operations are required when training and testing stages of the classification algorithms, especially in neural networks. In this study, to accelerate the process, parallel programming and computing platform (Nvidia CUDA) facilitates dramatic increases in computing performance by harnessing the power of the graphics processing unit (GPU) was utilized. The system was employed to detect anesthetic depth level on related electroencephalogram (EEG) data set. This dataset is rather complex and large. Moreover, the achieving more anesthetic levels with rapid response is critical in anesthesia. The proposed parallelization method yielded high accurate classification results in a faster time.

  15. GPU-accelerated phase extraction algorithm for interferograms: a real-time application

    NASA Astrophysics Data System (ADS)

    Zhu, Xiaoqiang; Wu, Yongqian; Liu, Fengwei

    2016-11-01

    Optical testing, having the merits of non-destruction and high sensitivity, provides a vital guideline for optical manufacturing. But the testing process is often computationally intensive and expensive, usually up to a few seconds, which is sufferable for dynamic testing. In this paper, a GPU-accelerated phase extraction algorithm is proposed, which is based on the advanced iterative algorithm. The accelerated algorithm can extract the right phase-distribution from thirteen 1024x1024 fringe patterns with arbitrary phase shifts in 233 milliseconds on average using NVIDIA Quadro 4000 graphic card, which achieved a 12.7x speedup ratio than the same algorithm executed on CPU and 6.6x speedup ratio than that on Matlab using DWANING W5801 workstation. The performance improvement can fulfill the demand of computational accuracy and real-time application.

  16. Performance of GeantV EM Physics Models

    NASA Astrophysics Data System (ADS)

    Amadio, G.; Ananya, A.; Apostolakis, J.; Aurora, A.; Bandieramonte, M.; Bhattacharyya, A.; Bianchini, C.; Brun, R.; Canal, P.; Carminati, F.; Cosmo, G.; Duhem, L.; Elvira, D.; Folger, G.; Gheata, A.; Gheata, M.; Goulas, I.; Iope, R.; Jun, S. Y.; Lima, G.; Mohanty, A.; Nikitina, T.; Novak, M.; Pokorski, W.; Ribon, A.; Seghal, R.; Shadura, O.; Vallecorsa, S.; Wenzel, S.; Zhang, Y.

    2017-10-01

    The recent progress in parallel hardware architectures with deeper vector pipelines or many-cores technologies brings opportunities for HEP experiments to take advantage of SIMD and SIMT computing models. Launched in 2013, the GeantV project studies performance gains in propagating multiple particles in parallel, improving instruction throughput and data locality in HEP event simulation on modern parallel hardware architecture. Due to the complexity of geometry description and physics algorithms of a typical HEP application, performance analysis is indispensable in identifying factors limiting parallel execution. In this report, we will present design considerations and preliminary computing performance of GeantV physics models on coprocessors (Intel Xeon Phi and NVidia GPUs) as well as on mainstream CPUs.

  17. Hypergraph partitioning implementation for parallelizing matrix-vector multiplication using CUDA GPU-based parallel computing

    NASA Astrophysics Data System (ADS)

    Murni, Bustamam, A.; Ernastuti, Handhika, T.; Kerami, D.

    2017-07-01

    Calculation of the matrix-vector multiplication in the real-world problems often involves large matrix with arbitrary size. Therefore, parallelization is needed to speed up the calculation process that usually takes a long time. Graph partitioning techniques that have been discussed in the previous studies cannot be used to complete the parallelized calculation of matrix-vector multiplication with arbitrary size. This is due to the assumption of graph partitioning techniques that can only solve the square and symmetric matrix. Hypergraph partitioning techniques will overcome the shortcomings of the graph partitioning technique. This paper addresses the efficient parallelization of matrix-vector multiplication through hypergraph partitioning techniques using CUDA GPU-based parallel computing. CUDA (compute unified device architecture) is a parallel computing platform and programming model that was created by NVIDIA and implemented by the GPU (graphics processing unit).

  18. High-performance computing on GPUs for resistivity logging of oil and gas wells

    NASA Astrophysics Data System (ADS)

    Glinskikh, V.; Dudaev, A.; Nechaev, O.; Surodina, I.

    2017-10-01

    We developed and implemented into software an algorithm for high-performance simulation of electrical logs from oil and gas wells using high-performance heterogeneous computing. The numerical solution of the 2D forward problem is based on the finite-element method and the Cholesky decomposition for solving a system of linear algebraic equations (SLAE). Software implementations of the algorithm used the NVIDIA CUDA technology and computing libraries are made, allowing us to perform decomposition of SLAE and find its solution on central processor unit (CPU) and graphics processor unit (GPU). The calculation time is analyzed depending on the matrix size and number of its non-zero elements. We estimated the computing speed on CPU and GPU, including high-performance heterogeneous CPU-GPU computing. Using the developed algorithm, we simulated resistivity data in realistic models.

  19. A high-speed DAQ framework for future high-level trigger and event building clusters

    NASA Astrophysics Data System (ADS)

    Caselle, M.; Ardila Perez, L. E.; Balzer, M.; Dritschler, T.; Kopmann, A.; Mohr, H.; Rota, L.; Vogelgesang, M.; Weber, M.

    2017-03-01

    Modern data acquisition and trigger systems require a throughput of several GB/s and latencies of the order of microseconds. To satisfy such requirements, a heterogeneous readout system based on FPGA readout cards and GPU-based computing nodes coupled by InfiniBand has been developed. The incoming data from the back-end electronics is delivered directly into the internal memory of GPUs through a dedicated peer-to-peer PCIe communication. High performance DMA engines have been developed for direct communication between FPGAs and GPUs using "DirectGMA (AMD)" and "GPUDirect (NVIDIA)" technologies. The proposed infrastructure is a candidate for future generations of event building clusters, high-level trigger filter farms and low-level trigger system. In this paper the heterogeneous FPGA-GPU architecture will be presented and its performance be discussed.

  20. XaNSoNS: GPU-accelerated simulator of diffraction patterns of nanoparticles

    NASA Astrophysics Data System (ADS)

    Neverov, V. S.

    XaNSoNS is an open source software with GPU support, which simulates X-ray and neutron 1D (or 2D) diffraction patterns and pair-distribution functions (PDF) for amorphous or crystalline nanoparticles (up to ∼107 atoms) of heterogeneous structural content. Among the multiple parameters of the structure the user may specify atomic displacements, site occupancies, molecular displacements and molecular rotations. The software uses general equations nonspecific to crystalline structures to calculate the scattering intensity. It supports four major standards of parallel computing: MPI, OpenMP, Nvidia CUDA and OpenCL, enabling it to run on various architectures, from CPU-based HPCs to consumer-level GPUs.

  1. Open source acceleration of wave optics simulations on energy efficient high-performance computing platforms

    NASA Astrophysics Data System (ADS)

    Beck, Jeffrey; Bos, Jeremy P.

    2017-05-01

    We compare several modifications to the open-source wave optics package, WavePy, intended to improve execution time. Specifically, we compare the relative performance of the Intel MKL, a CPU based OpenCV distribution, and GPU-based version. Performance is compared between distributions both on the same compute platform and between a fully-featured computing workstation and the NVIDIA Jetson TX1 platform. Comparisons are drawn in terms of both execution time and power consumption. We have found that substituting the Fast Fourier Transform operation from OpenCV provides a marked improvement on all platforms. In addition, we show that embedded platforms offer some possibility for extensive improvement in terms of efficiency compared to a fully featured workstation.

  2. An Investigation of Unified Memory Access Performance in CUDA

    PubMed Central

    Landaverde, Raphael; Zhang, Tiansheng; Coskun, Ayse K.; Herbordt, Martin

    2015-01-01

    Managing memory between the CPU and GPU is a major challenge in GPU computing. A programming model, Unified Memory Access (UMA), has been recently introduced by Nvidia to simplify the complexities of memory management while claiming good overall performance. In this paper, we investigate this programming model and evaluate its performance and programming model simplifications based on our experimental results. We find that beyond on-demand data transfers to the CPU, the GPU is also able to request subsets of data it requires on demand. This feature allows UMA to outperform full data transfer methods for certain parallel applications and small data sizes. We also find, however, that for the majority of applications and memory access patterns, the performance overheads associated with UMA are significant, while the simplifications to the programming model restrict flexibility for adding future optimizations. PMID:26594668

  3. Nanoscale multireference quantum chemistry: full configuration interaction on graphical processing units.

    PubMed

    Fales, B Scott; Levine, Benjamin G

    2015-10-13

    Methods based on a full configuration interaction (FCI) expansion in an active space of orbitals are widely used for modeling chemical phenomena such as bond breaking, multiply excited states, and conical intersections in small-to-medium-sized molecules, but these phenomena occur in systems of all sizes. To scale such calculations up to the nanoscale, we have developed an implementation of FCI in which electron repulsion integral transformation and several of the more expensive steps in σ vector formation are performed on graphical processing unit (GPU) hardware. When applied to a 1.7 × 1.4 × 1.4 nm silicon nanoparticle (Si72H64) described with the polarized, all-electron 6-31G** basis set, our implementation can solve for the ground state of the 16-active-electron/16-active-orbital CASCI Hamiltonian (more than 100,000,000 configurations) in 39 min on a single NVidia K40 GPU.

  4. Lattice QCD at finite temperature and density from Taylor expansion

    NASA Astrophysics Data System (ADS)

    Steinbrecher, Patrick

    2017-01-01

    In the first part, I present an overview of recent Lattice QCD simulations at finite temperature and density. In particular, we discuss fluctuations of conserved charges: baryon number, electric charge and strangeness. These can be obtained from Taylor expanding the QCD pressure as a function of corresponding chemical potentials. Our simulations were performed using quark masses corresponding to physical pion mass of about 140 MeV and allow a direct comparison to experimental data from ultra-relativistic heavy ion beams at hadron colliders such as the Relativistic Heavy Ion Collider at Brookhaven National Laboratory and the Large Hadron Collider at CERN. In the second part, we discuss computational challenges for current and future exascale Lattice simulations with a focus on new silicon developments from Intel and NVIDIA.

  5. CUDA-Accelerated Geodesic Ray-Tracing for Fiber Tracking

    PubMed Central

    van Aart, Evert; Sepasian, Neda; Jalba, Andrei; Vilanova, Anna

    2011-01-01

    Diffusion Tensor Imaging (DTI) allows to noninvasively measure the diffusion of water in fibrous tissue. By reconstructing the fibers from DTI data using a fiber-tracking algorithm, we can deduce the structure of the tissue. In this paper, we outline an approach to accelerating such a fiber-tracking algorithm using a Graphics Processing Unit (GPU). This algorithm, which is based on the calculation of geodesics, has shown promising results for both synthetic and real data, but is limited in its applicability by its high computational requirements. We present a solution which uses the parallelism offered by modern GPUs, in combination with the CUDA platform by NVIDIA, to significantly reduce the execution time of the fiber-tracking algorithm. Compared to a multithreaded CPU implementation of the same algorithm, our GPU mapping achieves a speedup factor of up to 40 times. PMID:21941525

  6. Unusual Domain Structure and Filamentary Superfluidity for 2D Hard-Core Bosons in Insulating Charge-Ordered Phase

    NASA Astrophysics Data System (ADS)

    Panov, Yu. D.; Moskvin, A. S.; Rybakov, F. N.; Borisov, A. B.

    2016-12-01

    We made use of a special algorithm for compute unified device architecture for NVIDIA graphics cards, a nonlinear conjugate-gradient method to minimize energy functional, and Monte-Carlo technique to directly observe the forming of the ground state configuration for the 2D hard-core bosons by lowering the temperature and its evolution with deviation away from half-filling. The novel technique allowed us to examine earlier implications and uncover novel features of the phase transitions, in particular, look upon the nucleation of the odd domain structure, emergence of filamentary superfluidity nucleated at the antiphase domain walls of the charge-ordered phase, and nucleation and evolution of different topological structures.

  7. Modeling of Radiotherapy Linac Source Terms Using ARCHER Monte Carlo Code: Performance Comparison for GPU and MIC Parallel Computing Devices

    NASA Astrophysics Data System (ADS)

    Lin, Hui; Liu, Tianyu; Su, Lin; Bednarz, Bryan; Caracappa, Peter; Xu, X. George

    2017-09-01

    Monte Carlo (MC) simulation is well recognized as the most accurate method for radiation dose calculations. For radiotherapy applications, accurate modelling of the source term, i.e. the clinical linear accelerator is critical to the simulation. The purpose of this paper is to perform source modelling and examine the accuracy and performance of the models on Intel Many Integrated Core coprocessors (aka Xeon Phi) and Nvidia GPU using ARCHER and explore the potential optimization methods. Phase Space-based source modelling for has been implemented. Good agreements were found in a tomotherapy prostate patient case and a TrueBeam breast case. From the aspect of performance, the whole simulation for prostate plan and breast plan cost about 173s and 73s with 1% statistical error.

  8. NASA's Hybrid Reality Lab: One Giant Leap for Full Dive

    NASA Technical Reports Server (NTRS)

    Delgado, Francisco J.; Noyes, Matthew

    2017-01-01

    This presentation demonstrates how NASA is using consumer VR headsets, game engine technology and NVIDIA's GPUs to create highly immersive future training systems augmented with extremely realistic haptic feedback, sound, additional sensory information, and how these can be used to improve the engineering workflow. Include in this presentation is an environment simulation of the ISS, where users can interact with virtual objects, handrails, and tracked physical objects while inside VR, integration of consumer VR headsets with the Active Response Gravity Offload System, and a space habitat architectural evaluation tool. Attendees will learn how the best elements of real and virtual worlds can be combined into a hybrid reality environment with tangible engineering and scientific applications.

  9. Electromagnetic Physics Models for Parallel Computing Architectures

    NASA Astrophysics Data System (ADS)

    Amadio, G.; Ananya, A.; Apostolakis, J.; Aurora, A.; Bandieramonte, M.; Bhattacharyya, A.; Bianchini, C.; Brun, R.; Canal, P.; Carminati, F.; Duhem, L.; Elvira, D.; Gheata, A.; Gheata, M.; Goulas, I.; Iope, R.; Jun, S. Y.; Lima, G.; Mohanty, A.; Nikitina, T.; Novak, M.; Pokorski, W.; Ribon, A.; Seghal, R.; Shadura, O.; Vallecorsa, S.; Wenzel, S.; Zhang, Y.

    2016-10-01

    The recent emergence of hardware architectures characterized by many-core or accelerated processors has opened new opportunities for concurrent programming models taking advantage of both SIMD and SIMT architectures. GeantV, a next generation detector simulation, has been designed to exploit both the vector capability of mainstream CPUs and multi-threading capabilities of coprocessors including NVidia GPUs and Intel Xeon Phi. The characteristics of these architectures are very different in terms of the vectorization depth and type of parallelization needed to achieve optimal performance. In this paper we describe implementation of electromagnetic physics models developed for parallel computing architectures as a part of the GeantV project. Results of preliminary performance evaluation and physics validation are presented as well.

  10. Implementation of Multipattern String Matching Accelerated with GPU for Intrusion Detection System

    NASA Astrophysics Data System (ADS)

    Nehemia, Rangga; Lim, Charles; Galinium, Maulahikmah; Rinaldi Widianto, Ahmad

    2017-04-01

    As Internet-related security threats continue to increase in terms of volume and sophistication, existing Intrusion Detection System is also being challenged to cope with the current Internet development. Multi Pattern String Matching algorithm accelerated with Graphical Processing Unit is being utilized to improve the packet scanning performance of the IDS. This paper implements a Multi Pattern String Matching algorithm, also called Parallel Failureless Aho Corasick accelerated with GPU to improve the performance of IDS. OpenCL library is used to allow the IDS to support various GPU, including popular GPU such as NVIDIA and AMD, used in our research. The experiment result shows that the application of Multi Pattern String Matching using GPU accelerated platform provides a speed up, by up to 141% in term of throughput compared to the previous research.

  11. GPU accelerated implementation of NCI calculations using promolecular density.

    PubMed

    Rubez, Gaëtan; Etancelin, Jean-Matthieu; Vigouroux, Xavier; Krajecki, Michael; Boisson, Jean-Charles; Hénon, Eric

    2017-05-30

    The NCI approach is a modern tool to reveal chemical noncovalent interactions. It is particularly attractive to describe ligand-protein binding. A custom implementation for NCI using promolecular density is presented. It is designed to leverage the computational power of NVIDIA graphics processing unit (GPU) accelerators through the CUDA programming model. The code performances of three versions are examined on a test set of 144 systems. NCI calculations are particularly well suited to the GPU architecture, which reduces drastically the computational time. On a single compute node, the dual-GPU version leads to a 39-fold improvement for the biggest instance compared to the optimal OpenMP parallel run (C code, icc compiler) with 16 CPU cores. Energy consumption measurements carried out on both CPU and GPU NCI tests show that the GPU approach provides substantial energy savings. © 2017 Wiley Periodicals, Inc. © 2017 Wiley Periodicals, Inc.

  12. Autofocus method for automated microscopy using embedded GPUs.

    PubMed

    Castillo-Secilla, J M; Saval-Calvo, M; Medina-Valdès, L; Cuenca-Asensi, S; Martínez-Álvarez, A; Sánchez, C; Cristóbal, G

    2017-03-01

    In this paper we present a method for autofocusing images of sputum smears taken from a microscope which combines the finding of the optimal focus distance with an algorithm for extending the depth of field (EDoF). Our multifocus fusion method produces an unique image where all the relevant objects of the analyzed scene are well focused, independently to their distance to the sensor. This process is computationally expensive which makes unfeasible its automation using traditional embedded processors. For this purpose a low-cost optimized implementation is proposed using limited resources embedded GPU integrated on cutting-edge NVIDIA system on chip. The extensive tests performed on different sputum smear image sets show the real-time capabilities of our implementation maintaining the quality of the output image.

  13. Mendel-GPU: haplotyping and genotype imputation on graphics processing units

    PubMed Central

    Chen, Gary K.; Wang, Kai; Stram, Alex H.; Sobel, Eric M.; Lange, Kenneth

    2012-01-01

    Motivation: In modern sequencing studies, one can improve the confidence of genotype calls by phasing haplotypes using information from an external reference panel of fully typed unrelated individuals. However, the computational demands are so high that they prohibit researchers with limited computational resources from haplotyping large-scale sequence data. Results: Our graphics processing unit based software delivers haplotyping and imputation accuracies comparable to competing programs at a fraction of the computational cost and peak memory demand. Availability: Mendel-GPU, our OpenCL software, runs on Linux platforms and is portable across AMD and nVidia GPUs. Users can download both code and documentation at http://code.google.com/p/mendel-gpu/. Contact: gary.k.chen@usc.edu Supplementary information: Supplementary data are available at Bioinformatics online. PMID:22954633

  14. GPU-Powered Coherent Beamforming

    NASA Astrophysics Data System (ADS)

    Magro, A.; Adami, K. Zarb; Hickish, J.

    2015-03-01

    Graphics processing units (GPU)-based beamforming is a relatively unexplored area in radio astronomy, possibly due to the assumption that any such system will be severely limited by the PCIe bandwidth required to transfer data to the GPU. We have developed a CUDA-based GPU implementation of a coherent beamformer, specifically designed and optimized for deployment at the BEST-2 array which can generate an arbitrary number of synthesized beams for a wide range of parameters. It achieves ˜1.3 TFLOPs on an NVIDIA Tesla K20, approximately 10x faster than an optimized, multithreaded CPU implementation. This kernel has been integrated into two real-time, GPU-based time-domain software pipelines deployed at the BEST-2 array in Medicina: a standalone beamforming pipeline and a transient detection pipeline. We present performance benchmarks for the beamforming kernel as well as the transient detection pipeline with beamforming capabilities as well as results of test observation.

  15. Development of an Implicit, Charge and Energy Conserving 2D Electromagnetic PIC Code on Advanced Architectures

    NASA Astrophysics Data System (ADS)

    Payne, Joshua; Taitano, William; Knoll, Dana; Liebs, Chris; Murthy, Karthik; Feltman, Nicolas; Wang, Yijie; McCarthy, Colleen; Cieren, Emanuel

    2012-10-01

    In order to solve problems such as the ion coalescence and slow MHD shocks fully kinetically we developed a fully implicit 2D energy and charge conserving electromagnetic PIC code, PlasmaApp2D. PlasmaApp2D differs from previous implicit PIC implementations in that it will utilize advanced architectures such as GPUs and shared memory CPU systems, with problems too large to fit into cache. PlasmaApp2D will be a hybrid CPU-GPU code developed primarily to run on the DARWIN cluster at LANL utilizing four 12-core AMD Opteron CPUs and two NVIDIA Tesla GPUs per node. MPI will be used for cross-node communication, OpenMP will be used for on-node parallelism, and CUDA will be used for the GPUs. Development progress and initial results will be presented.

  16. Montblanc1: GPU accelerated radio interferometer measurement equations in support of Bayesian inference for radio observations

    NASA Astrophysics Data System (ADS)

    Perkins, S. J.; Marais, P. C.; Zwart, J. T. L.; Natarajan, I.; Tasse, C.; Smirnov, O.

    2015-09-01

    We present Montblanc, a GPU implementation of the Radio interferometer measurement equation (RIME) in support of the Bayesian inference for radio observations (BIRO) technique. BIRO uses Bayesian inference to select sky models that best match the visibilities observed by a radio interferometer. To accomplish this, BIRO evaluates the RIME multiple times, varying sky model parameters to produce multiple model visibilities. χ2 values computed from the model and observed visibilities are used as likelihood values to drive the Bayesian sampling process and select the best sky model. As most of the elements of the RIME and χ2 calculation are independent of one another, they are highly amenable to parallel computation. Additionally, Montblanc caters for iterative RIME evaluation to produce multiple χ2 values. Modified model parameters are transferred to the GPU between each iteration. We implemented Montblanc as a Python package based upon NVIDIA's CUDA architecture. As such, it is easy to extend and implement different pipelines. At present, Montblanc supports point and Gaussian morphologies, but is designed for easy addition of new source profiles. Montblanc's RIME implementation is performant: On an NVIDIA K40, it is approximately 250 times faster than MEQTREES on a dual hexacore Intel E5-2620v2 CPU. Compared to the OSKAR simulator's GPU-implemented RIME components it is 7.7 and 12 times faster on the same K40 for single and double-precision floating point respectively. However, OSKAR's RIME implementation is more general than Montblanc's BIRO-tailored RIME. Theoretical analysis of Montblanc's dominant CUDA kernel suggests that it is memory bound. In practice, profiling shows that is balanced between compute and memory, as much of the data required by the problem is retained in L1 and L2 caches.

  17. Embedded-Based Graphics Processing Unit Cluster Platform for Multiple Sequence Alignments

    PubMed Central

    Wei, Jyh-Da; Cheng, Hui-Jun; Lin, Chun-Yuan; Ye, Jin; Yeh, Kuan-Yu

    2017-01-01

    High-end graphics processing units (GPUs), such as NVIDIA Tesla/Fermi/Kepler series cards with thousands of cores per chip, are widely applied to high-performance computing fields in a decade. These desktop GPU cards should be installed in personal computers/servers with desktop CPUs, and the cost and power consumption of constructing a GPU cluster platform are very high. In recent years, NVIDIA releases an embedded board, called Jetson Tegra K1 (TK1), which contains 4 ARM Cortex-A15 CPUs and 192 Compute Unified Device Architecture cores (belong to Kepler GPUs). Jetson Tegra K1 has several advantages, such as the low cost, low power consumption, and high applicability, and it has been applied into several specific applications. In our previous work, a bioinformatics platform with a single TK1 (STK platform) was constructed, and this previous work is also used to prove that the Web and mobile services can be implemented in the STK platform with a good cost-performance ratio by comparing a STK platform with the desktop CPU and GPU. In this work, an embedded-based GPU cluster platform will be constructed with multiple TK1s (MTK platform). Complex system installation and setup are necessary procedures at first. Then, 2 job assignment modes are designed for the MTK platform to provide services for users. Finally, ClustalW v2.0.11 and ClustalWtk will be ported to the MTK platform. The experimental results showed that the speedup ratios achieved 5.5 and 4.8 times for ClustalW v2.0.11 and ClustalWtk, respectively, by comparing 6 TK1s with a single TK1. The MTK platform is proven to be useful for multiple sequence alignments. PMID:28835734

  18. Advantages of GPU technology in DFT calculations of intercalated graphene

    NASA Astrophysics Data System (ADS)

    Pešić, J.; Gajić, R.

    2014-09-01

    Over the past few years, the expansion of general-purpose graphic-processing unit (GPGPU) technology has had a great impact on computational science. GPGPU is the utilization of a graphics-processing unit (GPU) to perform calculations in applications usually handled by the central processing unit (CPU). Use of GPGPUs as a way to increase computational power in the material sciences has significantly decreased computational costs in already highly demanding calculations. A level of the acceleration and parallelization depends on the problem itself. Some problems can benefit from GPU acceleration and parallelization, such as the finite-difference time-domain algorithm (FTDT) and density-functional theory (DFT), while others cannot take advantage of these modern technologies. A number of GPU-supported applications had emerged in the past several years (www.nvidia.com/object/gpu-applications.html). Quantum Espresso (QE) is reported as an integrated suite of open source computer codes for electronic-structure calculations and materials modeling at the nano-scale. It is based on DFT, the use of a plane-waves basis and a pseudopotential approach. Since the QE 5.0 version, it has been implemented as a plug-in component for standard QE packages that allows exploiting the capabilities of Nvidia GPU graphic cards (www.qe-forge.org/gf/proj). In this study, we have examined the impact of the usage of GPU acceleration and parallelization on the numerical performance of DFT calculations. Graphene has been attracting attention worldwide and has already shown some remarkable properties. We have studied an intercalated graphene, using the QE package PHonon, which employs GPU. The term ‘intercalation’ refers to a process whereby foreign adatoms are inserted onto a graphene lattice. In addition, by intercalating different atoms between graphene layers, it is possible to tune their physical properties. Our experiments have shown there are benefits from using GPUs, and we reached an

  19. Multidisciplinary Simulation Acceleration using Multiple Shared-Memory Graphical Processing Units

    NASA Astrophysics Data System (ADS)

    Kemal, Jonathan Yashar

    For purposes of optimizing and analyzing turbomachinery and other designs, the unsteady Favre-averaged flow-field differential equations for an ideal compressible gas can be solved in conjunction with the heat conduction equation. We solve all equations using the finite-volume multiple-grid numerical technique, with the dual time-step scheme used for unsteady simulations. Our numerical solver code targets CUDA-capable Graphical Processing Units (GPUs) produced by NVIDIA. Making use of MPI, our solver can run across networked compute notes, where each MPI process can use either a GPU or a Central Processing Unit (CPU) core for primary solver calculations. We use NVIDIA Tesla C2050/C2070 GPUs based on the Fermi architecture, and compare our resulting performance against Intel Zeon X5690 CPUs. Solver routines converted to CUDA typically run about 10 times faster on a GPU for sufficiently dense computational grids. We used a conjugate cylinder computational grid and ran a turbulent steady flow simulation using 4 increasingly dense computational grids. Our densest computational grid is divided into 13 blocks each containing 1033x1033 grid points, for a total of 13.87 million grid points or 1.07 million grid points per domain block. To obtain overall speedups, we compare the execution time of the solver's iteration loop, including all resource intensive GPU-related memory copies. Comparing the performance of 8 GPUs to that of 8 CPUs, we obtain an overall speedup of about 6.0 when using our densest computational grid. This amounts to an 8-GPU simulation running about 39.5 times faster than running than a single-CPU simulation.

  20. cudaMap: a GPU accelerated program for gene expression connectivity mapping

    PubMed Central

    2013-01-01

    Background Modern cancer research often involves large datasets and the use of sophisticated statistical techniques. Together these add a heavy computational load to the analysis, which is often coupled with issues surrounding data accessibility. Connectivity mapping is an advanced bioinformatic and computational technique dedicated to therapeutics discovery and drug re-purposing around differential gene expression analysis. On a normal desktop PC, it is common for the connectivity mapping task with a single gene signature to take > 2h to complete using sscMap, a popular Java application that runs on standard CPUs (Central Processing Units). Here, we describe new software, cudaMap, which has been implemented using CUDA C/C++ to harness the computational power of NVIDIA GPUs (Graphics Processing Units) to greatly reduce processing times for connectivity mapping. Results cudaMap can identify candidate therapeutics from the same signature in just over thirty seconds when using an NVIDIA Tesla C2050 GPU. Results from the analysis of multiple gene signatures, which would previously have taken several days, can now be obtained in as little as 10 minutes, greatly facilitating candidate therapeutics discovery with high throughput. We are able to demonstrate dramatic speed differentials between GPU assisted performance and CPU executions as the computational load increases for high accuracy evaluation of statistical significance. Conclusion Emerging ‘omics’ technologies are constantly increasing the volume of data and information to be processed in all areas of biomedical research. Embracing the multicore functionality of GPUs represents a major avenue of local accelerated computing. cudaMap will make a strong contribution in the discovery of candidate therapeutics by enabling speedy execution of heavy duty connectivity mapping tasks, which are increasingly required in modern cancer research. cudaMap is open source and can be freely downloaded from http

  1. GPU Implementation of High Rayleigh Number Three-Dimensional Mantle Convection

    NASA Astrophysics Data System (ADS)

    Sanchez, D. A.; Yuen, D. A.; Wright, G. B.; Barnett, G. A.

    2010-12-01

    Although we have entered the age of petascale computing, many factors are still prohibiting high-performance computing (HPC) from infiltrating all suitable scientific disciplines. For this reason and others, application of GPU to HPC is gaining traction in the scientific world. With its low price point, high performance potential, and competitive scalability, GPU has been an option well worth considering for the last few years. Moreover with the advent of NVIDIA's Fermi architecture, which brings ECC memory, better double-precision performance, and more RAM to GPU, there is a strong message of corporate support for GPU in HPC. However many doubts linger concerning the practicality of using GPU for scientific computing. In particular, GPU has a reputation for being difficult to program and suitable for only a small subset of problems. Although inroads have been made in addressing these concerns, for many scientists GPU still has hurdles to clear before becoming an acceptable choice. We explore the applicability of GPU to geophysics by implementing a three-dimensional, second-order finite-difference model of Rayleigh-Benard thermal convection on an NVIDIA GPU using C for CUDA. Our code reaches sufficient resolution, on the order of 500x500x250 evenly-spaced finite-difference gridpoints, on a single GPU. We make extensive use of highly optimized CUBLAS routines, allowing us to achieve performance on the order of O( 0.1 ) µs per timestep*gridpoint at this resolution. This performance has allowed us to study high Rayleigh number simulations, on the order of 2x10^7, on a single GPU.

  2. GPU-based fast cone beam CT reconstruction from undersampled and noisy projection data via total variation.

    PubMed

    Jia, Xun; Lou, Yifei; Li, Ruijiang; Song, William Y; Jiang, Steve B

    2010-04-01

    Cone-beam CT (CBCT) plays an important role in image guided radiation therapy (IGRT). However, the large radiation dose from serial CBCT scans in most IGRT procedures raises a clinical concern, especially for pediatric patients who are essentially excluded from receiving IGRT for this reason. The goal of this work is to develop a fast GPU-based algorithm to reconstruct CBCT from undersampled and noisy projection data so as to lower the imaging dose. The CBCT is reconstructed by minimizing an energy functional consisting of a data fidelity term and a total variation regularization term. The authors developed a GPU-friendly version of the forward-backward splitting algorithm to solve this model. A multigrid technique is also employed. It is found that 20-40 x-ray projections are sufficient to reconstruct images with satisfactory quality for IGRT. The reconstruction time ranges from 77 to 130 s on an NVIDIA Tesla C1060 (NVIDIA, Santa Clara, CA) GPU card, depending on the number of projections used, which is estimated about 100 times faster than similar iterative reconstruction approaches. Moreover, phantom studies indicate that the algorithm enables the CBCT to be reconstructed under a scanning protocol with as low as 0.1 mA s/projection. Comparing with currently widely used full-fan head and neck scanning protocol of approximately 360 projections with 0.4 mA s/projection, it is estimated that an overall 36-72 times dose reduction has been achieved in our fast CBCT reconstruction algorithm. This work indicates that the developed GPU-based CBCT reconstruction algorithm is capable of lowering imaging dose considerably. The high computation efficiency in this algorithm makes the iterative CBCT reconstruction approach applicable in real clinical environments.

  3. cudaMap: a GPU accelerated program for gene expression connectivity mapping.

    PubMed

    McArt, Darragh G; Bankhead, Peter; Dunne, Philip D; Salto-Tellez, Manuel; Hamilton, Peter; Zhang, Shu-Dong

    2013-10-11

    Modern cancer research often involves large datasets and the use of sophisticated statistical techniques. Together these add a heavy computational load to the analysis, which is often coupled with issues surrounding data accessibility. Connectivity mapping is an advanced bioinformatic and computational technique dedicated to therapeutics discovery and drug re-purposing around differential gene expression analysis. On a normal desktop PC, it is common for the connectivity mapping task with a single gene signature to take > 2h to complete using sscMap, a popular Java application that runs on standard CPUs (Central Processing Units). Here, we describe new software, cudaMap, which has been implemented using CUDA C/C++ to harness the computational power of NVIDIA GPUs (Graphics Processing Units) to greatly reduce processing times for connectivity mapping. cudaMap can identify candidate therapeutics from the same signature in just over thirty seconds when using an NVIDIA Tesla C2050 GPU. Results from the analysis of multiple gene signatures, which would previously have taken several days, can now be obtained in as little as 10 minutes, greatly facilitating candidate therapeutics discovery with high throughput. We are able to demonstrate dramatic speed differentials between GPU assisted performance and CPU executions as the computational load increases for high accuracy evaluation of statistical significance. Emerging 'omics' technologies are constantly increasing the volume of data and information to be processed in all areas of biomedical research. Embracing the multicore functionality of GPUs represents a major avenue of local accelerated computing. cudaMap will make a strong contribution in the discovery of candidate therapeutics by enabling speedy execution of heavy duty connectivity mapping tasks, which are increasingly required in modern cancer research. cudaMap is open source and can be freely downloaded from http://purl.oclc.org/NET/cudaMap.

  4. Embedded-Based Graphics Processing Unit Cluster Platform for Multiple Sequence Alignments.

    PubMed

    Wei, Jyh-Da; Cheng, Hui-Jun; Lin, Chun-Yuan; Ye, Jin; Yeh, Kuan-Yu

    2017-01-01

    High-end graphics processing units (GPUs), such as NVIDIA Tesla/Fermi/Kepler series cards with thousands of cores per chip, are widely applied to high-performance computing fields in a decade. These desktop GPU cards should be installed in personal computers/servers with desktop CPUs, and the cost and power consumption of constructing a GPU cluster platform are very high. In recent years, NVIDIA releases an embedded board, called Jetson Tegra K1 (TK1), which contains 4 ARM Cortex-A15 CPUs and 192 Compute Unified Device Architecture cores (belong to Kepler GPUs). Jetson Tegra K1 has several advantages, such as the low cost, low power consumption, and high applicability, and it has been applied into several specific applications. In our previous work, a bioinformatics platform with a single TK1 (STK platform) was constructed, and this previous work is also used to prove that the Web and mobile services can be implemented in the STK platform with a good cost-performance ratio by comparing a STK platform with the desktop CPU and GPU. In this work, an embedded-based GPU cluster platform will be constructed with multiple TK1s (MTK platform). Complex system installation and setup are necessary procedures at first. Then, 2 job assignment modes are designed for the MTK platform to provide services for users. Finally, ClustalW v2.0.11 and ClustalWtk will be ported to the MTK platform. The experimental results showed that the speedup ratios achieved 5.5 and 4.8 times for ClustalW v2.0.11 and ClustalWtk, respectively, by comparing 6 TK1s with a single TK1. The MTK platform is proven to be useful for multiple sequence alignments.

  5. Parallel fuzzy connected image segmentation on GPU

    PubMed Central

    Zhuge, Ying; Cao, Yong; Udupa, Jayaram K.; Miller, Robert W.

    2011-01-01

    Purpose: Image segmentation techniques using fuzzy connectedness (FC) principles have shown their effectiveness in segmenting a variety of objects in several large applications. However, one challenge in these algorithms has been their excessive computational requirements when processing large image datasets. Nowadays, commodity graphics hardware provides a highly parallel computing environment. In this paper, the authors present a parallel fuzzy connected image segmentation algorithm implementation on NVIDIA’s compute unified device Architecture (cuda) platform for segmenting medical image data sets. Methods: In the FC algorithm, there are two major computational tasks: (i) computing the fuzzy affinity relations and (ii) computing the fuzzy connectedness relations. These two tasks are implemented as cuda kernels and executed on GPU. A dramatic improvement in speed for both tasks is achieved as a result. Results: Our experiments based on three data sets of small, medium, and large data size demonstrate the efficiency of the parallel algorithm, which achieves a speed-up factor of 24.4x, 18.1x, and 10.3x, correspondingly, for the three data sets on the NVIDIA Tesla C1060 over the implementation of the algorithm on CPU, and takes 0.25, 0.72, and 15.04 s, correspondingly, for the three data sets. Conclusions: The authors developed a parallel algorithm of the widely used fuzzy connected image segmentation method on the NVIDIA GPUs, which are far more cost- and speed-effective than both cluster of workstations and multiprocessing systems. A near-interactive speed of segmentation has been achieved, even for the large data set. PMID:21859037

  6. GPU-based relative fuzzy connectedness image segmentation

    PubMed Central

    Zhuge, Ying; Ciesielski, Krzysztof C.; Udupa, Jayaram K.; Miller, Robert W.

    2013-01-01

    Purpose: Recently, clinical radiological research and practice are becoming increasingly quantitative. Further, images continue to increase in size and volume. For quantitative radiology to become practical, it is crucial that image segmentation algorithms and their implementations are rapid and yield practical run time on very large data sets. The purpose of this paper is to present a parallel version of an algorithm that belongs to the family of fuzzy connectedness (FC) algorithms, to achieve an interactive speed for segmenting large medical image data sets. Methods: The most common FC segmentations, optimizing an ℓ∞-based energy, are known as relative fuzzy connectedness (RFC) and iterative relative fuzzy connectedness (IRFC). Both RFC and IRFC objects (of which IRFC contains RFC) can be found via linear time algorithms, linear with respect to the image size. The new algorithm, P-ORFC (for parallel optimal RFC), which is implemented by using NVIDIA’s Compute Unified Device Architecture (CUDA) platform, considerably improves the computational speed of the above mentioned CPU based IRFC algorithm. Results: Experiments based on four data sets of small, medium, large, and super data size, achieved speedup factors of 32.8×, 22.9×, 20.9×, and 17.5×, correspondingly, on the NVIDIA Tesla C1060 platform. Although the output of P-ORFC need not precisely match that of IRFC output, it is very close to it and, as the authors prove, always lies between the RFC and IRFC objects. Conclusions: A parallel version of a top-of-the-line algorithm in the family of FC has been developed on the NVIDIA GPUs. An interactive speed of segmentation has been achieved, even for the largest medical image data set. Such GPU implementations may play a crucial role in automatic anatomy recognition in clinical radiology. PMID:23298094

  7. Physical activity and quality of life in long-term hospitalized patients with severe mental illness: a cross-sectional study.

    PubMed

    Deenik, Jeroen; Kruisdijk, Frank; Tenback, Diederik; Braakman-Jansen, Annemarie; Taal, Erik; Hopman-Rock, Marijke; Beekman, Aartjan; Tak, Erwin; Hendriksen, Ingrid; van Harten, Peter

    2017-08-18

    Increasing physical activity in patients with severe mental illness is believed to have positive effects on physical health, psychiatric symptoms and as well quality of life. Till now, little is known about the relationship between physical activity and quality of life in long-term hospitalized patients with severe mental illness and knowledge of the determinants of behavioural change is lacking. The purpose of this study was to elucidate the relationship between objectively measured physical activity and quality of life, and explore modifiable psychological determinants of change in physical activity in long-term hospitalized patients with severe mental illness. In 184 inpatients, physical activity was measured using an accelerometer (ActiGraph GTX+). Quality of life was assessed by EuroQol-5D and WHOQol-Bref. Attitude and perceived self-efficacy towards physical activity were collected using the Physical Activity Enjoyment Scale and the Multidimensional Self Efficacy Questionnaire, respectively. Patient and disease characteristics were derived retrospectively from electronic patient records. Associations and potential predictors were analysed using hierarchical regression. Physical activity was positively related with and a predictor of all quality of life outcomes except on the environmental domain, independent of patient and disease characteristics. However, non-linear relationships showed that most improvement in quality of life lies in the change from sedentary to light activity. Attitude and self-efficacy were not related to physical activity. Physical activity is positively associated with quality of life, especially for patients in the lower spectrum of physical activity. An association between attitude and self-efficacy and physical activity was absent. Therefore, results suggest the need of alternative, more integrated and (peer-)supported interventions to structurally improve physical activity in this inpatient population. Slight changes from sedentary

  8. Pyrethroids and Nectar Toxins Have Subtle Effects on the Motor Function, Grooming and Wing Fanning Behaviour of Honeybees (Apis mellifera).

    PubMed

    Oliver, Caitlin J; Softley, Samantha; Williamson, Sally M; Stevenson, Philip C; Wright, Geraldine A

    2015-01-01

    Sodium channels, found ubiquitously in animal muscle cells and neurons, are one of the main target sites of many naturally-occurring, insecticidal plant compounds and agricultural pesticides. Pyrethroids, derived from compounds found only in the Asteraceae, are particularly toxic to insects and have been successfully used as pesticides including on flowering crops that are visited by pollinators. Pyrethrins, from which they were derived, occur naturally in the nectar of some flowering plant species. We know relatively little about how such compounds--i.e., compounds that target sodium channels--influence pollinators at low or sub-lethal doses. Here, we exposed individual adult forager honeybees to several compounds that bind to sodium channels to identify whether these compounds affect motor function. Using an assay previously developed to identify the effect of drugs and toxins on individual bees, we investigated how acute exposure to 10 ng doses (1 ppm) of the pyrethroid insecticides (cyfluthrin, tau-fluvalinate, allethrin and permethrin) and the nectar toxins (aconitine and grayanotoxin I) affected honeybee locomotion, grooming and wing fanning behaviour. Bees exposed to these compounds spent more time upside down and fanning their wings. They also had longer bouts of standing still. Bees exposed to the nectar toxin, aconitine, and the pyrethroid, allethrin, also spent less time grooming their antennae. We also found that the concentration of the nectar toxin, grayanotoxin I (GTX), fed to bees affected the time spent upside down (i.e., failure to perform the righting reflex). Our data show that low doses of pyrethroids and other nectar toxins that target sodium channels mainly influence motor function through their effect on the righting reflex of adult worker honeybees.

  9. SU-E-T-37: A GPU-Based Pencil Beam Algorithm for Dose Calculations in Proton Radiation Therapy

    SciTech Connect

    Kalantzis, G; Leventouri, T; Tachibana, H

    Purpose: Recent developments in radiation therapy have been focused on applications of charged particles, especially protons. Over the years several dose calculation methods have been proposed in proton therapy. A common characteristic of all these methods is their extensive computational burden. In the current study we present for the first time, to our best knowledge, a GPU-based PBA for proton dose calculations in Matlab. Methods: In the current study we employed an analytical expression for the protons depth dose distribution. The central-axis term is taken from the broad-beam central-axis depth dose in water modified by an inverse square correction whilemore » the distribution of the off-axis term was considered Gaussian. The serial code was implemented in MATLAB and was launched on a desktop with a quad core Intel Xeon X5550 at 2.67GHz with 8 GB of RAM. For the parallelization on the GPU, the parallel computing toolbox was employed and the code was launched on a GTX 770 with Kepler architecture. The performance comparison was established on the speedup factors. Results: The performance of the GPU code was evaluated for three different energies: low (50 MeV), medium (100 MeV) and high (150 MeV). Four square fields were selected for each energy, and the dose calculations were performed with both the serial and parallel codes for a homogeneous water phantom with size 300×300×300 mm3. The resolution of the PBs was set to 1.0 mm. The maximum speedup of ∼127 was achieved for the highest energy and the largest field size. Conclusion: A GPU-based PB algorithm for proton dose calculations in Matlab was presented. A maximum speedup of ∼127 was achieved. Future directions of the current work include extension of our method for dose calculation in heterogeneous phantoms.« less

  10. Real-time text extraction based on the page layout analysis system

    NASA Astrophysics Data System (ADS)

    Soua, M.; Benchekroun, A.; Kachouri, R.; Akil, M.

    2017-05-01

    Several approaches were proposed in order to extract text from scanned documents. However, text extraction in heterogeneous documents stills a real challenge. Indeed, text extraction in this context is a difficult task because of the variation of the text due to the differences of sizes, styles and orientations, as well as to the complexity of the document region background. Recently, we have proposed the improved hybrid binarization based on Kmeans method (I-HBK)5 to extract suitably the text from heterogeneous documents. In this method, the Page Layout Analysis (PLA), part of the Tesseract OCR engine, is used to identify text and image regions. Afterwards our hybrid binarization is applied separately on each kind of regions. In one side, gamma correction is employed before to process image regions. In the other side, binarization is performed directly on text regions. Then, a foreground and background color study is performed to correct inverted region colors. Finally, characters are located from the binarized regions based on the PLA algorithm. In this work, we extend the integration of the PLA algorithm within the I-HBK method. In addition, to speed up the separation of text and image step, we employ an efficient GPU acceleration. Through the performed experiments, we demonstrate the high F-measure accuracy of the PLA algorithm reaching 95% on the LRDE dataset. In addition, we illustrate the sequential and the parallel compared PLA versions. The obtained results give a speedup of 3.7x when comparing the parallel PLA implementation on GPU GTX 660 to the CPU version.

  11. Metastatic pancreatic cancer: Is there a light at the end of the tunnel?

    PubMed Central

    Vaccaro, Vanja; Sperduti, Isabella; Vari, Sabrina; Bria, Emilio; Melisi, Davide; Garufi, Carlo; Nuzzo, Carmen; Scarpa, Aldo; Tortora, Giampaolo; Cognetti, Francesco; Reni, Michele; Milella, Michele

    2015-01-01

    Due to extremely poor prognosis, pancreatic cancer (PDAC) represents the fourth leading cause of cancer-related death in Western countries. For more than a decade, gemcitabine (Gem) has been the mainstay of first-line PDAC treatment. Many efforts aimed at improving single-agent Gem efficacy by either combining it with a second cytotoxic/molecularly targeted agent or pharmacokinetic modulation provided disappointing results. Recently, the field of systemic therapy of advanced PDAC is finally moving forward. Polychemotherapy has shown promise over single-agent Gem: regimens like PEFG-PEXG-PDXG and GTX provide significant potential advantages in terms of survival and/or disease control, although sometimes at the cost of poor tolerability. The PRODIGE 4/ACCORD 11 was the first phase III trial to provide unequivocal benefit using the polychemotherapy regimen FOLFIRINOX; however the less favorable safety profile and the characteristics of the enrolled population, restrict the use of FOLFIRINOX to young and fit PDAC patients. The nanoparticle albumin-bound paclitaxel (nab-Paclitaxel) formulation was developed to overcome resistance due to the desmoplastic stroma surrounding pancreatic cancer cells. Regardless of whether or not this is its main mechanisms of action, the combination of nab-Paclitaxel plus Gem showed a statistically and clinically significant survival advantage over single agent Gem and significantly improved all the secondary endpoints. Furthermore, recent findings on maintenance therapy are opening up potential new avenues in the treatment of advanced PDAC, particularly in a new era in which highly effective first-line regimens allow patients to experience prolonged disease control. Here, we provide an overview of recent advances in the systemic treatment of advanced PDAC, mostly focusing on recent findings that have set new standards in metastatic disease. Potential avenues for further development in the metastatic setting and current efforts to integrate

  12. Metastatic pancreatic cancer: Is there a light at the end of the tunnel?

    PubMed

    Vaccaro, Vanja; Sperduti, Isabella; Vari, Sabrina; Bria, Emilio; Melisi, Davide; Garufi, Carlo; Nuzzo, Carmen; Scarpa, Aldo; Tortora, Giampaolo; Cognetti, Francesco; Reni, Michele; Milella, Michele

    2015-04-28

    Due to extremely poor prognosis, pancreatic cancer (PDAC) represents the fourth leading cause of cancer-related death in Western countries. For more than a decade, gemcitabine (Gem) has been the mainstay of first-line PDAC treatment. Many efforts aimed at improving single-agent Gem efficacy by either combining it with a second cytotoxic/molecularly targeted agent or pharmacokinetic modulation provided disappointing results. Recently, the field of systemic therapy of advanced PDAC is finally moving forward. Polychemotherapy has shown promise over single-agent Gem: regimens like PEFG-PEXG-PDXG and GTX provide significant potential advantages in terms of survival and/or disease control, although sometimes at the cost of poor tolerability. The PRODIGE 4/ACCORD 11 was the first phase III trial to provide unequivocal benefit using the polychemotherapy regimen FOLFIRINOX; however the less favorable safety profile and the characteristics of the enrolled population, restrict the use of FOLFIRINOX to young and fit PDAC patients. The nanoparticle albumin-bound paclitaxel (nab-Paclitaxel) formulation was developed to overcome resistance due to the desmoplastic stroma surrounding pancreatic cancer cells. Regardless of whether or not this is its main mechanisms of action, the combination of nab-Paclitaxel plus Gem showed a statistically and clinically significant survival advantage over single agent Gem and significantly improved all the secondary endpoints. Furthermore, recent findings on maintenance therapy are opening up potential new avenues in the treatment of advanced PDAC, particularly in a new era in which highly effective first-line regimens allow patients to experience prolonged disease control. Here, we provide an overview of recent advances in the systemic treatment of advanced PDAC, mostly focusing on recent findings that have set new standards in metastatic disease. Potential avenues for further development in the metastatic setting and current efforts to integrate

  13. Uptake, transfer and elimination kinetics of paralytic shellfish toxins in common octopus (Octopus vulgaris).

    PubMed

    Lopes, Vanessa M; Baptista, Miguel; Repolho, Tiago; Rosa, Rui; Costa, Pedro Reis

    2014-01-01

    Marine phycotoxins derived from harmful algal blooms are known to be associated with mass mortalities in the higher trophic levels of marine food webs. Bivalve mollusks and planktivorous fish are the most studied vectors of marine phycotoxins. However, field surveys recently showed that cephalopod mollusks also constitute potential vectors of toxins. Thus, here we determine, for the first time, the time course of accumulation and depuration of paralytic shellfish toxins (PSTs) in the common octopus (Octopus vulgaris). Concomitantly, the underlying kinetics of toxin transfer between tissue compartments was also calculated. Naturally contaminated clams were used to orally expose the octopus to PSTs during 6 days. Afterwards, octopus specimens were fed with non-contaminated shellfish during 10 days of depuration period. Toxins reached the highest concentrations in the digestive gland surpassing the levels in the kidney by three orders of magnitude. PSTs were not detected in any other tissue analyzed. Net accumulation efficiencies of 42% for GTX5, 36% for dcSTX and 23% for C1+2 were calculated for the digestive gland. These compounds were the most abundant toxins in both digestive gland and the contaminated shellfish diet. The small differences in relative abundance of each toxin observed between the prey and the cephalopod predator indicates low conversion rates of these toxins. The depuration period was better described using an exponential decay model comprising a single compartment - the entire viscera. It is worth noting that since octopuses' excretion and depuration rates are low, the digestive gland is able to accumulate very high toxin concentrations for long periods of time. Therefore, the present study clearly shows that O. vulgaris is a high-potential vector of PSTs during and even after the occurrence of these toxic algal blooms. Copyright © 2013 Elsevier B.V. All rights reserved.

  14. Validation of three short physical activity questionnaires with accelerometers among university students in Spain.

    PubMed

    Rodríguez-Muñoz, Sheila; Corella, Cristina; Abarca-Sos, Alberto; Zaragoza, Javier

    2017-12-01

    Physical activity (PA) in university students has not been analyzed with specific questionnaires tailored to this population. Therefore, the purpose of this study was to analyze the validity of three PA questionnaires developed on other populations comparing with accelerometer values: counts and moderate to vigorous PA (MVPA) calculated with uniaxial and triaxial cut points. One hundred and forty-five university students (of whom, 92 women) from Spain wore an accelerometer GT3X or GTX+ to collect PA data of 7 full days. Three questionnaires, Physical Activity Questionnaire for Adults (PAQ-AD), Assessment of Physical Activity Questionnaire (APALQ), and the International Physical Activity Questionnaire Short Form (IPAQ-SF) were administrated jointly with the collection of accelerometer values. Finally, after the application of inclusion criteria, data from 95 participants (62 women) with a mean age of 21.96±2.33 years were analyzed to compare the instruments measures. The correlational analysis showed that PAQ-AD (0.44-0.56) and IPAQ-SF (0.26-0.69) questionnaires were significantly related to accelerometers scores: counts, uniaxial MVPA and triaxial MVPA. Conversely, APALQ displayed no significant relations for males with accelerometers scores for MVPA created with both cut points. PAQ-AD and IPAQ-SF questionnaires have shown adequate validity to use with Spanish university students. The use of counts to validate self-report data in order to reduce the variability display by MVPA created with different cut points is discussed. Finally, validated instruments to measure PA in university students will allow implementation of strategies for PA promotion based on reliable data.

  15. Warm temperature acclimation impacts metabolism of paralytic shellfish toxins from Alexandrium minutum in commercial oysters.

    PubMed

    Farrell, Hazel; Seebacher, Frank; O'Connor, Wayne; Zammit, Anthony; Harwood, D Tim; Murray, Shauna

    2015-09-01

    Species of Alexandrium produce potent neurotoxins termed paralytic shellfish toxins and are expanding their ranges worldwide, concurrent with increases in sea surface temperature. The metabolism of molluscs is temperature dependent, and increases in ocean temperature may influence both the abundance and distribution of Alexandrium and the dynamics of toxin uptake and depuration in shellfish. Here, we conducted a large-scale study of the effect of temperature on the uptake and depuration of paralytic shellfish toxins in three commercial oysters (Saccostrea glomerata and diploid and triploid Crassostrea gigas, n = 252 per species/ploidy level). Oysters were acclimated to two constant temperatures, reflecting current and predicted climate scenarios (22 and 27 °C), and fed a diet including the paralytic shellfish toxin-producing species Alexandrium minutum. While the oysters fed on A. minutum in similar quantities, concentrations of the toxin analogue GTX1,4 were significantly lower in warm-acclimated S. glomerata and diploid C. gigas after 12 days. Following exposure to A. minutum, toxicity of triploid C. gigas was not affected by temperature. Generally, detoxification rates were reduced in warm-acclimated oysters. The routine metabolism of the oysters was not affected by the toxins, but a significant effect was found at a cellular level in diploid C. gigas. The increasing incidences of Alexandrium blooms worldwide are a challenge for shellfish food safety regulation. Our findings indicate that rising ocean temperatures may reduce paralytic shellfish toxin accumulation in two of the three oyster types; however, they may persist for longer periods in oyster tissue. © 2015 John Wiley & Sons Ltd.

  16. Delayed heart rate recovery after exercise as a risk factor of incident type 2 diabetes mellitus after adjusting for glycometabolic parameters in men.

    PubMed

    Yu, Tae Yang; Jee, Jae Hwan; Bae, Ji Cheol; Hong, Won-Jung; Jin, Sang-Man; Kim, Jae Hyeon; Lee, Moon-Kyu

    2016-10-15

    Some studies have reported that delayed heart rate recovery (HRR) after exercise is associated with incident type 2 diabetes mellitus (T2DM). This study aimed to investigate the longitudinal association of delayed HRR following a graded exercise treadmill test (GTX) with the development of T2DM including glucose-associated parameters as an adjusting factor in healthy Korean men. Analyses including fasting plasma glucose, HOMA-IR, HOMA-β, and HbA1c as confounding factors and known confounders were performed. HRR was calculated as peak heart rate minus heart rate after a 1-min rest (HRR 1). Cox proportional hazards model was used to quantify the independent association between HRR and incident T2DM. During 9082 person-years of follow-up between 2006 and 2012, there were 180 (10.1%) incident cases of T2DM. After adjustment for age, BMI, systolic BP, diastolic BP, smoking status, peak heart rate, peak oxygen uptake, TG, LDL-C, HDL-C, fasting plasma glucose, HOMA-IR, HOMA-β, and HbA1c, the hazard ratios (HRs) [95% confidence interval (CI)] of incident T2DM comparing the second and third tertiles to the first tertile of HRR 1 were 0.867 (0.609-1.235) and 0.624 (0.426-0.915), respectively (p for trend=0.017). As a continuous variable, in the fully-adjusted model, the HR (95% CI) of incident T2DM associated with each 1 beat increase in HRR 1 was 0.980 (0.960-1.000) (p=0.048). This study demonstrated that delayed HRR after exercise predicts incident T2DM in men, even after adjusting for fasting glucose, HOMA-IR, HOMA-β, and HbA1c. However, only HRR 1 had clinical significance. Copyright © 2016 Elsevier Ireland Ltd. All rights reserved.

  17. Uptake, distribution and depuration of paralytic shellfish toxins from Alexandrium minutum in Australian greenlip abalone, Haliotis laevigata.

    PubMed

    Dowsett, Natalie; Hallegraeff, Gustaaf; van Ruth, Paul; van Ginkel, Roel; McNabb, Paul; Hay, Brenda; O'Connor, Wayne; Kiermeier, Andreas; Deveney, Marty; McLeod, Catherine

    2011-07-01

    Farmed greenlip abalone Haliotis laevigata were fed commercial seaweed-based food pellets or feed pellets supplemented with 8 × 10⁵ Alexandrium minutum dinoflagellate cells g⁻¹ (containing 12 ± 3.0 μg STX-equivalent 100 g⁻¹, which was mainly GTX-1,4) every second day for 50 days. Exposure of abalone to PST supplemented feed for 50 days did not affect behaviour or survival but saw accumulation of up to 1.6 μg STX-equivalent 100 g⁻¹ in the abalone foot tissue (muscle, mouth without oesophagus and epipodial fringe), which is ∼50 times lower than the maximum permissible limit (80 μg 100 g⁻¹ tissue) for PSTs in molluscan shellfish. The PST levels in the foot were reduced to 0.48 μg STX-equivalent 100 g⁻¹ after scrubbing and removal of the pigment surrounding the epithelium of the epipodial fringe (confirmed by both HPLC and LC-MS/MS). Thus, scrubbing the epipodial fringe, a common procedure during commercial abalone canning, reduced PST levels by ∼70%. Only trace levels of PSTs were detected in the viscera (stomach, gut, heart, gonad, gills and mantle) of the abalone. A toxin reduction of approximately 73% was observed in STX-contaminated abalone held in clean water and fed uncontaminated food over 50 days. The low level of PST uptake when abalone were exposed to high numbers of A. minutum cells over a prolonged period may indicate a low risk of PSP poisoning to humans from the consumption of H. laevigata that has been exposed to a bloom of potentially toxic A. minutum in Australia. Further research is required to establish if non-dietary accumulation can result in significant levels of PSTs in abalone. Crown Copyright © 2011. Published by Elsevier Ltd. All rights reserved.

  18. Two-Dimensional High Definition Versus Three-Dimensional Endoscopy in Endonasal Skull Base Surgery: A Comparative Preclinical Study.

    PubMed

    Rampinelli, Vittorio; Doglietto, Francesco; Mattavelli, Davide; Qiu, Jimmy; Raffetti, Elena; Schreiber, Alberto; Villaret, Andrea Bolzoni; Kucharczyk, Walter; Donato, Francesco; Fontanella, Marco Maria; Nicolai, Piero

    2017-09-01

    Three-dimensional (3D) endoscopy has been recently introduced in endonasal skull base surgery. Only a relatively limited number of studies have compared it to 2-dimensional, high definition technology. The objective was to compare, in a preclinical setting for endonasal endoscopic surgery, the surgical maneuverability of 2-dimensional, high definition and 3D endoscopy. A group of 68 volunteers, novice and experienced surgeons, were asked to perform 2 tasks, namely simulating grasping and dissection surgical maneuvers, in a model of the nasal cavities. Time to complete the tasks was recorded. A questionnaire to investigate subjective feelings during tasks was filled by each participant. In 25 subjects, the surgeons' movements were continuously tracked by a magnetic-based neuronavigator coupled with dedicated software (ApproachViewer, part of GTx-UHN) and the recorded trajectories were analyzed by comparing jitter, sum of square differences, and funnel index. Total execution time was significantly lower with 3D technology (P < 0.05) in beginners and experts. Questionnaires showed that beginners preferred 3D endoscopy more frequently than experts. A minority (14%) of beginners experienced discomfort with 3D endoscopy. Analysis of jitter showed a trend toward increased effectiveness of surgical maneuvers with 3D endoscopy. Sum of square differences and funnel index analyses documented better values with 3D endoscopy in experts. In a preclinical setting for endonasal skull base surgery, 3D technology appears to confer an advantage in terms of time of execution and precision of surgical maneuvers. Copyright © 2017 Elsevier Inc. All rights reserved.

  19. A GPU-Parallelized Eigen-Based Clutter Filter Framework for Ultrasound Color Flow Imaging.

    PubMed

    Chee, Adrian J Y; Yiu, Billy Y S; Yu, Alfred C H

    2017-01-01

    Eigen-filters with attenuation response adapted to clutter statistics in color flow imaging (CFI) have shown improved flow detection sensitivity in the presence of tissue motion. Nevertheless, its practical adoption in clinical use is not straightforward due to the high computational cost for solving eigendecompositions. Here, we provide a pedagogical description of how a real-time computing framework for eigen-based clutter filtering can be developed through a single-instruction, multiple data (SIMD) computing approach that can be implemented on a graphical processing unit (GPU). Emphasis is placed on the single-ensemble-based eigen-filtering approach (Hankel singular value decomposition), since it is algorithmically compatible with GPU-based SIMD computing. The key algebraic principles and the corresponding SIMD algorithm are explained, and annotations on how such algorithm can be rationally implemented on the GPU are presented. Real-time efficacy of our framework was experimentally investigated on a single GPU device (GTX Titan X), and the computing throughput for varying scan depths and slow-time ensemble lengths was studied. Using our eigen-processing framework, real-time video-range throughput (24 frames/s) can be attained for CFI frames with full view in azimuth direction (128 scanlines), up to a scan depth of 5 cm ( λ pixel axial spacing) for slow-time ensemble length of 16 samples. The corresponding CFI image frames, with respect to the ones derived from non-adaptive polynomial regression clutter filtering, yielded enhanced flow detection sensitivity in vivo, as demonstrated in a carotid imaging case example. These findings indicate that the GPU-enabled eigen-based clutter filtering can improve CFI flow detection performance in real time.

  20. Selective androgen receptor modulators for the prevention and treatment of muscle wasting associated with cancer.

    PubMed

    Dalton, James T; Taylor, Ryan P; Mohler, Michael L; Steiner, Mitchell S

    2013-12-01

    This review highlights selective androgen receptor modulators (SARMs) as emerging agents in late-stage clinical development for the prevention and treatment of muscle wasting associated with cancer. Muscle wasting, including a loss of skeletal muscle, is a cancer-related symptom that begins early in the progression of cancer and affects a patient's quality of life, ability to tolerate chemotherapy, and survival. SARMs increase muscle mass and improve physical function in healthy and diseased individuals, and potentially may provide a new therapy for muscle wasting and cancer cachexia. SARMs modulate the same anabolic pathways targeted with classical steroidal androgens, but within the dose range in which expected effects on muscle mass and function are seen androgenic side-effects on prostate, skin, and hair have not been observed. Unlike testosterone, SARMs are orally active, nonaromatizable, nonvirilizing, and tissue-selective anabolic agents. Recent clinical efficacy data for LGD-4033, MK-0773, MK-3984, and enobosarm (GTx-024, ostarine, and S-22) are reviewed. Enobosarm, a nonsteroidal SARM, is the most well characterized clinically, and has consistently demonstrated increases in lean body mass and better physical function across several populations along with a lower hazard ratio for survival in cancer patients. Completed in May 2013, results for the Phase III clinical trials entitled Prevention and treatment Of muscle Wasting in patiEnts with Cancer1 (POWER1) and POWER2 evaluating enobosarm for the prevention and treatment of muscle wasting in patients with nonsmall cell lung cancer will be available soon, and will potentially establish a SARM, enobosarm, as the first drug for the prevention and treatment of muscle wasting in cancer patients.

  1. Experimental Evaluation of the Effect of Angle-of-attack on the External Aerodynamics and Mass Capture of a Symmetric Three-engine Air-breathing Launch Vehicle Configuration at Supersonic Speeds

    NASA Technical Reports Server (NTRS)

    Kim, Hyun D.; Frate, Franco C.

    2001-01-01

    A subscale aerodynamic model of the GTX air-breathing launch vehicle was tested at NASA Glenn Research Center's 10- by 10-Foot Supersonic Wind Tunnel from Mach 2.0 to 3.5 at various angles-of-attack. The objective of the test was to investigate the effect of angle-of-attack on inlet mass capture, inlet diverter effectiveness, and the flowfield at the cowl lip plane. The flow-through inlets were tested with and without boundary-layer diverters. Quantitative measurements such as inlet mass flow rates and pitot-pressure distributions in the cowl lip plane are presented. At a 3deg angle-of-attack, the flow rates for the top and side inlets were within 8 percent of the zero angle-of-attack value, and little distortion was evident at the cowl lip plane. Surface oil flow patterns showing the shock/boundary-layer interaction caused by the inlet spikes are shown. In addition to inlet data, vehicle forebody static pressure distributions, boundary-layer profiles, and temperature-sensitive paint images to evaluate the boundary-layer transition are presented. Three-dimensional parabolized Navier-Stokes computational fluid dynamics calculations of the forebody flowfield are presented and show good agreement with the experimental static pressure distributions and boundary-layer profiles. With the boundary-layer diverters installed, no adverse aerodynamic phenomena were found that would prevent the inlets from operating at the required angles-of-attack. We recommend that phase 2 of the test program be initiated, where inlet contraction ratio and diverter geometry variations will be tested.

  2. Liquid water content variation with altitude in clouds over Europe

    NASA Astrophysics Data System (ADS)

    Andreea, Boscornea; Sabina, Stefan

    2013-04-01

    Cloud water content is one of the most fundamental measurements in cloud physics. Knowledge of the vertical variability of cloud microphysical characteristics is important for a variety of reasons. The profile of liquid water content (LWC) partially governs the radiative transfer for cloudy atmospheres, LWC profiles improves our understanding of processes acting to form and maintain cloud systems and may lead to improvements in the representation of clouds in numerical models. Presently, in situ airborne measurements provide the most accurate information about cloud microphysical characteristics. This information can be used for verification of both numerical models and cloud remote sensing techniques. The aim of this paper was to analyze the liquid water content (LWC) measurements in clouds, in time of the aircraft flights. The aircraft and its platform ATMOSLAB - Airborne Laboratory for Environmental Atmospheric Research is property of the National Institute for Aerospace Research "Elie Carafoli" (INCAS), Bucharest, Romania. The airborne laboratory equipped for special research missions is based on a Hawker Beechcraft - King Air C90 GTx aircraft and is equipped with a sensors system CAPS - Cloud, Aerosol and Precipitation Spectrometer (30 bins, 0.51-50 m). The processed and analyzed measurements are acquired during 4 flights from Romania (Bucharest, 44°25'57″N 26°06'14″E) to Germany (Berlin 52°30'2″N 13°23'56″E) above the same region of Europe. The flight path was starting from Bucharest to the western part of Romania above Hungary, Austria at a cruse altitude between 6000-8500 m, and after 5 hours reaching Berlin. In total we acquired data during approximately 20 flight hours and we presented the vertical and horizontal LWC variations for different cloud types. The LWC values are similar for each type of cloud to values from literature. The vertical LWC profiles in the atmosphere measured during takeoff and landing of the aircraft have shown their

  3. Validity of activity trackers, smartphones, and phone applications to measure steps in various walking conditions.

    PubMed

    Höchsmann, C; Knaier, R; Eymann, J; Hintermann, J; Infanger, D; Schmidt-Trucksäss, A

    2018-02-20

    To examine the validity of popular smartphone accelerometer applications and a consumer activity wristband compared to a widely used research accelerometer while assessing the impact of the phone's position on the accuracy of step detection. Twenty volunteers from 2 different age groups (Group A: 18-25 years, n = 10; Group B 45-70 years, n = 10) were equipped with 3 iPhone SE smartphones (placed in pants pocket, shoulder bag, and backpack), 1 Samsung Galaxy S6 Edge (pants pocket), 1 Garmin Vivofit 2 wristband, and 2 ActiGraph wGTX+ devices (worn at wrist and hip) while walking on a treadmill (1.6, 3.2, 4.8, and 6.0 km/h) and completing a walking course. All smartphones included 6 accelerometer applications. Video observation was used as gold standard. Validity was evaluated by comparing each device with the gold standard using mean absolute percentage errors (MAPE). The MAPE of the iPhone SE (all positions) and the Garmin Vivofit was small (<3) for treadmill walking ≥3.2 km/h and for free walking. The Samsung Galaxy and hip-worn ActiGraph showed small MAPE only for treadmill walking at 4.8 and 6.0 km/h and for free walking. The wrist-worn ActiGraph showed high MAPE (17-47) for all walking conditions. The iPhone SE and the Garmin Vivofit 2 are accurate tools for step counting in different age groups and during various walking conditions, even during slow walking. The phone's position does not impact the accuracy of step detection, which substantially improves the versatility for physical activity assessment in clinical and research settings. © 2018 John Wiley & Sons A/S. Published by John Wiley & Sons Ltd.

  4. Chosing erosion control nets. Can't you decide? Ask the lab.

    NASA Astrophysics Data System (ADS)

    Simkova, Jana; Jacka, Lukas

    2015-04-01

    Geotextiles (GTXs) have been used to protect steep slopes against soil erosion for about 60 years and many products have become available. The choice of individual product is always based on its ratio of cost versus effectiveness. Generally applicable recommendations for specific site conditions are missing and testing the effectiveness of GTXs in the field is time consuming and costly. Due to various site conditions, results of numerous case-studies cannot be generalized. One of the major and site-specific factors affecting the erosion process, and hence the effectiveness of GTXs, is the soil. This study aimed to determine the rate of influence of three natural erosion control nets on the volume and velocity of surface runoff caused by rainfall. The nets were installed on slope under laboratory conditions and then exposed to simulated rainfall. An impermeable plastic film was used as a substrate instead of soil to simulate non-infiltrating conditions. A comparison of the influence of tested GTX samples on surface runoff may indicate to their erosion control effect. Thus, the results could help with choosing a particular product. Under real conditions, the effect of erosion control nets would be increased by the infiltration capacity of the soil, equally for all samples. Therefore, the order of effectiveness of the samples should stay unchanged. To validate this theory, a field experiment was carried out where soil loss was recorded along with runoff characteristics. The data trends of discharge culmination under natural conditions were similar to trends under laboratory conditions and corresponded to soil loss records.

  5. MULTI-CORE AND OPTICAL PROCESSOR RELATED APPLICATIONS RESEARCH AT OAK RIDGE NATIONAL LABORATORY

    SciTech Connect

    Barhen, Jacob; Kerekes, Ryan A; ST Charles, Jesse Lee

    2008-01-01

    High-speed parallelization of common tasks holds great promise as a low-risk approach to achieving the significant increases in signal processing and computational performance required for next generation innovations in reconfigurable radio systems. Researchers at the Oak Ridge National Laboratory have been working on exploiting the parallelization offered by this emerging technology and applying it to a variety of problems. This paper will highlight recent experience with four different parallel processors applied to signal processing tasks that are directly relevant to signal processing required for SDR/CR waveforms. The first is the EnLight Optical Core Processor applied to matched filter (MF) correlationmore » processing via fast Fourier transform (FFT) of broadband Dopplersensitive waveforms (DSW) using active sonar arrays for target tracking. The second is the IBM CELL Broadband Engine applied to 2-D discrete Fourier transform (DFT) kernel for image processing and frequency domain processing. And the third is the NVIDIA graphical processor applied to document feature clustering. EnLight Optical Core Processor. Optical processing is inherently capable of high-parallelism that can be translated to very high performance, low power dissipation computing. The EnLight 256 is a small form factor signal processing chip (5x5 cm2) with a digital optical core that is being developed by an Israeli startup company. As part of its evaluation of foreign technology, ORNL's Center for Engineering Science Advanced Research (CESAR) had access to a precursor EnLight 64 Alpha hardware for a preliminary assessment of capabilities in terms of large Fourier transforms for matched filter banks and on applications related to Doppler-sensitive waveforms. This processor is optimized for array operations, which it performs in fixed-point arithmetic at the rate of 16 TeraOPS at 8-bit precision. This is approximately 1000 times faster than the fastest DSP available today. The optical

  6. HONEI: A collection of libraries for numerical computations targeting multiple processor architectures

    NASA Astrophysics Data System (ADS)

    van Dyk, Danny; Geveler, Markus; Mallach, Sven; Ribbrock, Dirk; Göddeke, Dominik; Gutwenger, Carsten

    2009-12-01

    We present HONEI, an open-source collection of libraries offering a hardware oriented approach to numerical calculations. HONEI abstracts the hardware, and applications written on top of HONEI can be executed on a wide range of computer architectures such as CPUs, GPUs and the Cell processor. We demonstrate the flexibility and performance of our approach with two test applications, a Finite Element multigrid solver for the Poisson problem and a robust and fast simulation of shallow water waves. By linking against HONEI's libraries, we achieve a two-fold speedup over straight forward C++ code using HONEI's SSE backend, and additional 3-4 and 4-16 times faster execution on the Cell and a GPU. A second important aspect of our approach is that the full performance capabilities of the hardware under consideration can be exploited by adding optimised application-specific operations to the HONEI libraries. HONEI provides all necessary infrastructure for development and evaluation of such kernels, significantly simplifying their development. Program summaryProgram title: HONEI Catalogue identifier: AEDW_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEDW_v1_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: GPLv2 No. of lines in distributed program, including test data, etc.: 216 180 No. of bytes in distributed program, including test data, etc.: 1 270 140 Distribution format: tar.gz Programming language: C++ Computer: x86, x86_64, NVIDIA CUDA GPUs, Cell blades and PlayStation 3 Operating system: Linux RAM: at least 500 MB free Classification: 4.8, 4.3, 6.1 External routines: SSE: none; [1] for GPU, [2] for Cell backend Nature of problem: Computational science in general and numerical simulation in particular have reached a turning point. The revolution developers are facing is not primarily driven by a change in (problem-specific) methodology, but rather by the fundamental paradigm shift of the

  7. Big Data GPU-Driven Parallel Processing Spatial and Spatio-Temporal Clustering Algorithms

    NASA Astrophysics Data System (ADS)

    Konstantaras, Antonios; Skounakis, Emmanouil; Kilty, James-Alexander; Frantzeskakis, Theofanis; Maravelakis, Emmanuel

    2016-04-01

    Diamantaras, K.: 'Programming and architecture of parallel processing systems', 1st Edition, Eds. Kleidarithmos, 2011 [4] NVIDIA.: 'NVidia CUDA C Programming Guide', version 5.0, NVidia (reference book) [5] Konstantaras, A.: 'Classification of Distinct Seismic Regions and Regional Temporal Modelling of Seismicity in the Vicinity of the Hellenic Seismic Arc', IEEE Selected Topics in Applied Earth Observations and Remote Sensing, vol. 6 (4), pp. 1857-1863, 2013 [6] Konstantaras, A. Varley, M.R.,. Valianatos, F., Collins, G. and Holifield, P.: 'Recognition of electric earthquake precursors using neuro-fuzzy models: methodology and simulation results', Proc. IASTED International Conference on Signal Processing Pattern Recognition and Applications (SPPRA 2002), Crete, Greece, 2002, pp 303-308, 2002 [7] Konstantaras, A., Katsifarakis, E., Maravelakis, E., Skounakis, E., Kokkinos, E. and Karapidakis, E.: 'Intelligent Spatial-Clustering of Seismicity in the Vicinity of the Hellenic Seismic Arc', Earth Science Research, vol. 1 (2), pp. 1-10, 2012 [8] Georgoulas, G., Konstantaras, A., Katsifarakis, E., Stylios, C.D., Maravelakis, E. and Vachtsevanos, G.: '"Seismic-Mass" Density-based Algorithm for Spatio-Temporal Clustering', Expert Systems with Applications, vol. 40 (10), pp. 4183-4189, 2013 [9] Konstantaras, A. J.: 'Expert knowledge-based algorithm for the dynamic discrimination of interactive natural clusters', Earth Science Informatics, 2015 (In Press, see: www.scopus.com) [10] Drakatos, G. and Latoussakis, J.: 'A catalog of aftershock sequences in Greece (1971-1997): Their spatial and temporal characteristics', Journal of Seismology, vol. 5, pp. 137-145, 2001

  8. Massive parallelization of a 3D finite difference electromagnetic forward solution using domain decomposition methods on multiple CUDA enabled GPUs

    NASA Astrophysics Data System (ADS)

    Schultz, A.

    2010-12-01

    3D forward solvers lie at the core of inverse formulations used to image the variation of electrical conductivity within the Earth's interior. This property is associated with variations in temperature, composition, phase, presence of volatiles, and in specific settings, the presence of groundwater, geothermal resources, oil/gas or minerals. The high cost of 3D solutions has been a stumbling block to wider adoption of 3D methods. Parallel algorithms for modeling frequency domain 3D EM problems have not achieved wide scale adoption, with emphasis on fairly coarse grained parallelism using MPI and similar approaches. The communications bandwidth as well as the latency required to send and receive network communication packets is a limiting factor in implementing fine grained parallel strategies, inhibiting wide adoption of these algorithms. Leading Graphics Processor Unit (GPU) companies now produce GPUs with hundreds of GPU processor cores per die. The footprint, in silicon, of the GPU's restricted instruction set is much smaller than the general purpose instruction set required of a CPU. Consequently, the density of processor cores on a GPU can be much greater than on a CPU. GPUs also have local memory, registers and high speed communication with host CPUs, usually through PCIe type interconnects. The extremely low cost and high computational power of GPUs provides the EM geophysics community with an opportunity to achieve fine grained (i.e. massive) parallelization of codes on low cost hardware. The current generation of GPUs (e.g. NVidia Fermi) provides 3 billion transistors per chip die, with nearly 500 processor cores and up to 6 GB of fast (DDR5) GPU memory. This latest generation of GPU supports fast hardware double precision (64 bit) floating point operations of the type required for frequency domain EM forward solutions. Each Fermi GPU board can sustain nearly 1 TFLOP in double precision, and multiple boards can be installed in the host computer system. We

  9. ARCHER{sub RT} – A GPU-based and photon-electron coupled Monte Carlo dose computing engine for radiation therapy: Software development and application to helical tomotherapy

    SciTech Connect

    Su, Lin; Du, Xining; Liu, Tianyu

    Purpose: Using the graphical processing units (GPU) hardware technology, an extremely fast Monte Carlo (MC) code ARCHER{sub RT} is developed for radiation dose calculations in radiation therapy. This paper describes the detailed software development and testing for three clinical TomoTherapy® cases: the prostate, lung, and head and neck. Methods: To obtain clinically relevant dose distributions, phase space files (PSFs) created from optimized radiation therapy treatment plan fluence maps were used as the input to ARCHER{sub RT}. Patient-specific phantoms were constructed from patient CT images. Batch simulations were employed to facilitate the time-consuming task of loading large PSFs, and to improvemore » the estimation of statistical uncertainty. Furthermore, two different Woodcock tracking algorithms were implemented and their relative performance was compared. The dose curves of an Elekta accelerator PSF incident on a homogeneous water phantom were benchmarked against DOSXYZnrc. For each of the treatment cases, dose volume histograms and isodose maps were produced from ARCHER{sub RT} and the general-purpose code, GEANT4. The gamma index analysis was performed to evaluate the similarity of voxel doses obtained from these two codes. The hardware accelerators used in this study are one NVIDIA K20 GPU, one NVIDIA K40 GPU, and six NVIDIA M2090 GPUs. In addition, to make a fairer comparison of the CPU and GPU performance, a multithreaded CPU code was developed using OpenMP and tested on an Intel E5-2620 CPU. Results: For the water phantom, the depth dose curve and dose profiles from ARCHER{sub RT} agree well with DOSXYZnrc. For clinical cases, results from ARCHER{sub RT} are compared with those from GEANT4 and good agreement is observed. Gamma index test is performed for voxels whose dose is greater than 10% of maximum dose. For 2%/2mm criteria, the passing rates for the prostate, lung case, and head and neck cases are 99.7%, 98.5%, and 97.2%, respectively. Due to

  10. Accelerating numerical solution of stochastic differential equations with CUDA

    NASA Astrophysics Data System (ADS)

    Januszewski, M.; Kostur, M.

    2010-01-01

    Numerical integration of stochastic differential equations is commonly used in many branches of science. In this paper we present how to accelerate this kind of numerical calculations with popular NVIDIA Graphics Processing Units using the CUDA programming environment. We address general aspects of numerical programming on stream processors and illustrate them by two examples: the noisy phase dynamics in a Josephson junction and the noisy Kuramoto model. In presented cases the measured speedup can be as high as 675× compared to a typical CPU, which corresponds to several billion integration steps per second. This means that calculations which took weeks can now be completed in less than one hour. This brings stochastic simulation to a completely new level, opening for research a whole new range of problems which can now be solved interactively. Program summaryProgram title: SDE Catalogue identifier: AEFG_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEFG_v1_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: Gnu GPL v3 No. of lines in distributed program, including test data, etc.: 978 No. of bytes in distributed program, including test data, etc.: 5905 Distribution format: tar.gz Programming language: CUDA C Computer: any system with a CUDA-compatible GPU Operating system: Linux RAM: 64 MB of GPU memory Classification: 4.3 External routines: The program requires the NVIDIA CUDA Toolkit Version 2.0 or newer and the GNU Scientific Library v1.0 or newer. Optionally gnuplot is recommended for quick visualization of the results. Nature of problem: Direct numerical integration of stochastic differential equations is a computationally intensive problem, due to the necessity of calculating multiple independent realizations of the system. We exploit the inherent parallelism of this problem and perform the calculations on GPUs using the CUDA programming environment. The GPU's ability to execute

  11. A GPU OpenCL based cross-platform Monte Carlo dose calculation engine (goMC)

    NASA Astrophysics Data System (ADS)

    Tian, Zhen; Shi, Feng; Folkerts, Michael; Qin, Nan; Jiang, Steve B.; Jia, Xun

    2015-09-01

    Monte Carlo (MC) simulation has been recognized as the most accurate dose calculation method for radiotherapy. However, the extremely long computation time impedes its clinical application. Recently, a lot of effort has been made to realize fast MC dose calculation on graphic processing units (GPUs). However, most of the GPU-based MC dose engines have been developed under NVidia’s CUDA environment. This limits the code portability to other platforms, hindering the introduction of GPU-based MC simulations to clinical practice. The objective of this paper is to develop a GPU OpenCL based cross-platform MC dose engine named goMC with coupled photon-electron simulation for external photon and electron radiotherapy in the MeV energy range. Compared to our previously developed GPU-based MC code named gDPM (Jia et al 2012 Phys. Med. Biol. 57 7783-97), goMC has two major differences. First, it was developed under the OpenCL environment for high code portability and hence could be run not only on different GPU cards but also on CPU platforms. Second, we adopted the electron transport model used in EGSnrc MC package and PENELOPE’s random hinge method in our new dose engine, instead of the dose planning method employed in gDPM. Dose distributions were calculated for a 15 MeV electron beam and a 6 MV photon beam in a homogenous water phantom, a water-bone-lung-water slab phantom and a half-slab phantom. Satisfactory agreement between the two MC dose engines goMC and gDPM was observed in all cases. The average dose differences in the regions that received a dose higher than 10% of the maximum dose were 0.48-0.53% for the electron beam cases and 0.15-0.17% for the photon beam cases. In terms of efficiency, goMC was ~4-16% slower than gDPM when running on the same NVidia TITAN card for all the cases we tested, due to both the different electron transport models and the different development environments. The code portability of our new dose engine goMC was validated by

  12. a method of gravity and seismic sequential inversion and its GPU implementation

    NASA Astrophysics Data System (ADS)

    Liu, G.; Meng, X.

    2011-12-01

    In this abstract, we introduce a gravity and seismic sequential inversion method to invert for density and velocity together. For the gravity inversion, we use an iterative method based on correlation imaging algorithm; for the seismic inversion, we use the full waveform inversion. The link between the density and velocity is an empirical formula called Gardner equation, for large volumes of data, we use the GPU to accelerate the computation. For the gravity inversion method , we introduce a method based on correlation imaging algorithm,it is also a interative method, first we calculate the correlation imaging of the observed gravity anomaly, it is some value between -1 and +1, then we multiply this value with a little density ,this value become the initial density model. We get a forward reuslt with this initial model and also calculate the correaltion imaging of the misfit of observed data and the forward data, also multiply the correaltion imaging result a little density and add it to the initial model, then do the same procedure above , at last ,we can get a inversion density model. For the seismic inveron method ,we use a mothod base on the linearity of acoustic wave equation written in the frequency domain,with a intial velociy model, we can get a good velocity result. In the sequential inversion of gravity and seismic , we need a link formula to convert between density and velocity ,in our method , we use the Gardner equation. Driven by the insatiable market demand for real time, high-definition 3D images, the programmable NVIDIA Graphic Processing Unit (GPU) as co-processor of CPU has been developed for high performance computing. Compute Unified Device Architecture (CUDA) is a parallel programming model and software environment provided by NVIDIA designed to overcome the challenge of using traditional general purpose GPU while maintaining a low learn curve for programmers familiar with standard programming languages such as C. In our inversion processing

  13. Musrfit-Real Time Parameter Fitting Using GPUs

    NASA Astrophysics Data System (ADS)

    Locans, Uldis; Suter, Andreas

    High transverse field μSR (HTF-μSR) experiments typically lead to a rather large data sets, since it is necessary to follow the high frequencies present in the positron decay histograms. The analysis of these data sets can be very time consuming, usually due to the limited computational power of the hardware. To overcome the limited computing resources rotating reference frame transformation (RRF) is often used to reduce the data sets that need to be handled. This comes at a price typically the μSR community is not aware of: (i) due to the RRF transformation the fitting parameter estimate is of poorer precision, i.e., more extended expensive beamtime is needed. (ii) RRF introduces systematic errors which hampers the statistical interpretation of χ2 or the maximum log-likelihood. We will briefly discuss these issues in a non-exhaustive practical way. The only and single purpose of the RRF transformation is the sluggish computer power. Therefore during this work GPU (Graphical Processing Units) based fitting was developed which allows to perform real-time full data analysis without RRF. GPUs have become increasingly popular in scientific computing in recent years. Due to their highly parallel architecture they provide the opportunity to accelerate many applications with considerably less costs than upgrading the CPU computational power. With the emergence of frameworks such as CUDA and OpenCL these devices have become more easily programmable. During this work GPU support was added to Musrfit- a data analysis framework for μSR experiments. The new fitting algorithm uses CUDA or OpenCL to offload the most time consuming parts of the calculations to Nvidia or AMD GPUs. Using the current CPU implementation in Musrfit parameter fitting can take hours for certain data sets while the GPU version can allow to perform real-time data analysis on the same data sets. This work describes the challenges that arise in adding the GPU support to t as well as results obtained

  14. GAMUT: GPU accelerated microRNA analysis to uncover target genes through CUDA-miRanda

    PubMed Central

    2014-01-01

    Background Non-coding sequences such as microRNAs have important roles in disease processes. Computational microRNA target identification (CMTI) is becoming increasingly important since traditional experimental methods for target identification pose many difficulties. These methods are time-consuming, costly, and often need guidance from computational methods to narrow down candidate genes anyway. However, most CMTI methods are computationally demanding, since they need to handle not only several million query microRNA and reference RNA pairs, but also several million nucleotide comparisons within each given pair. Thus, the need to perform microRNA identification at such large scale has increased the demand for parallel computing. Methods Although most CMTI programs (e.g., the miRanda algorithm) are based on a modified Smith-Waterman (SW) algorithm, the existing parallel SW implementations (e.g., CUDASW++ 2.0/3.0, SWIPE) are unable to meet this demand in CMTI tasks. We present CUDA-miRanda, a fast microRNA target identification algorithm that takes advantage of massively parallel computing on Graphics Processing Units (GPU) using NVIDIA's Compute Unified Device Architecture (CUDA). CUDA-miRanda specifically focuses on the local alignment of short (i.e., ≤ 32 nucleotides) sequences against longer reference sequences (e.g., 20K nucleotides). Moreover, the proposed algorithm is able to report multiple alignments (up to 191 top scores) and the corresponding traceback sequences for any given (query sequence, reference sequence) pair. Results Speeds over 5.36 Giga Cell Updates Per Second (GCUPs) are achieved on a server with 4 NVIDIA Tesla M2090 GPUs. Compared to the original miRanda algorithm, which is evaluated on an Intel Xeon E5620@2.4 GHz CPU, the experimental results show up to 166 times performance gains in terms of execution time. In addition, we have verified that the exact same targets were predicted in both CUDA-miRanda and the original mi

  15. Parallelized Kalman-Filter-Based Reconstruction of Particle Tracks on Many-Core Processors and GPUs

    SciTech Connect

    Cerati, Giuseppe; Elmer, Peter; Krutelyov, Slava

    2017-01-01

    For over a decade now, physical and energy constraints have limited clock speed improvements in commodity microprocessors. Instead, chipmakers have been pushed into producing lower-power, multi-core processors such as Graphical Processing Units (GPU), ARM CPUs, and Intel MICs. Broad-based efforts from manufacturers and developers have been devoted to making these processors user-friendly enough to perform general computations. However, extracting performance from a larger number of cores, as well as specialized vector or SIMD units, requires special care in algorithm design and code optimization. One of the most computationally challenging problems in high-energy particle experiments is finding and fitting the charged-particlemore » tracks during event reconstruction. This is expected to become by far the dominant problem at the High-Luminosity Large Hadron Collider (HL-LHC), for example. Today the most common track finding methods are those based on the Kalman filter. Experience with Kalman techniques on real tracking detector systems has shown that they are robust and provide high physics performance. This is why they are currently in use at the LHC, both in the trigger and offine. Previously we reported on the significant parallel speedups that resulted from our investigations to adapt Kalman filters to track fitting and track building on Intel Xeon and Xeon Phi. Here, we discuss our progresses toward the understanding of these processors and the new developments to port the Kalman filter to NVIDIA GPUs.« less

  16. Accelerating Climate and Weather Simulations through Hybrid Computing

    NASA Technical Reports Server (NTRS)

    Zhou, Shujia; Cruz, Carlos; Duffy, Daniel; Tucker, Robert; Purcell, Mark

    2011-01-01

    Unconventional multi- and many-core processors (e.g. IBM (R) Cell B.E.(TM) and NVIDIA (R) GPU) have emerged as effective accelerators in trial climate and weather simulations. Yet these climate and weather models typically run on parallel computers with conventional processors (e.g. Intel, AMD, and IBM) using Message Passing Interface. To address challenges involved in efficiently and easily connecting accelerators to parallel computers, we investigated using IBM's Dynamic Application Virtualization (TM) (IBM DAV) software in a prototype hybrid computing system with representative climate and weather model components. The hybrid system comprises two Intel blades and two IBM QS22 Cell B.E. blades, connected with both InfiniBand(R) (IB) and 1-Gigabit Ethernet. The system significantly accelerates a solar radiation model component by offloading compute-intensive calculations to the Cell blades. Systematic tests show that IBM DAV can seamlessly offload compute-intensive calculations from Intel blades to Cell B.E. blades in a scalable, load-balanced manner. However, noticeable communication overhead was observed, mainly due to IP over the IB protocol. Full utilization of IB Sockets Direct Protocol and the lower latency production version of IBM DAV will reduce this overhead.

  17. DeepSAT's CloudCNN: A Deep Neural Network for Rapid Cloud Detection from Geostationary Satellites

    NASA Astrophysics Data System (ADS)

    Kalia, S.; Li, S.; Ganguly, S.; Nemani, R. R.

    2017-12-01

    Cloud and cloud shadow detection has important applications in weather and climate studies. It is even more crucial when we introduce geostationary satellites into the field of terrestrial remotesensing. With the challenges associated with data acquired in very high frequency (10-15 mins per scan), the ability to derive an accurate cloud/shadow mask from geostationary satellite data iscritical. The key to the success for most of the existing algorithms depends on spatially and temporally varying thresholds, which better capture local atmospheric and surface effects.However, the selection of proper threshold is difficult and may lead to erroneous results. In this work, we propose a deep neural network based approach called CloudCNN to classifycloud/shadow from Himawari-8 AHI and GOES-16 ABI multispectral data. DeepSAT's CloudCNN consists of an encoder-decoder based architecture for binary-class pixel wise segmentation. We train CloudCNN on multi-GPU Nvidia Devbox cluster, and deploy the prediction pipeline on NASA Earth Exchange (NEX) Pleiades supercomputer. We achieved an overall accuracy of 93.29% on test samples. Since, the predictions take only a few seconds to segment a full multi-spectral GOES-16 or Himawari-8 Full Disk image, the developed framework can be used for real-time cloud detection, cyclone detection, or extreme weather event predictions.

  18. fast_protein_cluster: parallel and optimized clustering of large-scale protein modeling data.

    PubMed

    Hung, Ling-Hong; Samudrala, Ram

    2014-06-15

    fast_protein_cluster is a fast, parallel and memory efficient package used to cluster 60 000 sets of protein models (with up to 550 000 models per set) generated by the Nutritious Rice for the World project. fast_protein_cluster is an optimized and extensible toolkit that supports Root Mean Square Deviation after optimal superposition (RMSD) and Template Modeling score (TM-score) as metrics. RMSD calculations using a laptop CPU are 60× faster than qcprot and 3× faster than current graphics processing unit (GPU) implementations. New GPU code further increases the speed of RMSD and TM-score calculations. fast_protein_cluster provides novel k-means and hierarchical clustering methods that are up to 250× and 2000× faster, respectively, than Clusco, and identify significantly more accurate models than Spicker and Clusco. fast_protein_cluster is written in C++ using OpenMP for multi-threading support. Custom streaming Single Instruction Multiple Data (SIMD) extensions and advanced vector extension intrinsics code accelerate CPU calculations, and OpenCL kernels support AMD and Nvidia GPUs. fast_protein_cluster is available under the M.I.T. license. (http://software.compbio.washington.edu/fast_protein_cluster) © The Author 2014. Published by Oxford University Press.

  19. Analysis of performance improvements for host and GPU interface of the APENet+ 3D Torus network

    NASA Astrophysics Data System (ADS)

    Ammendola A, R.; Biagioni, A.; Frezza, O.; Lo Cicero, F.; Lonardo, A.; Paolucci, P. S.; Rossetti, D.; Simula, F.; Tosoratto, L.; Vicini, P.

    2014-06-01

    APEnet+ is an INFN (Italian Institute for Nuclear Physics) project aiming to develop a custom 3-Dimensional torus interconnect network optimized for hybrid clusters CPU-GPU dedicated to High Performance scientific Computing. The APEnet+ interconnect fabric is built on a FPGA-based PCI-express board with 6 bi-directional off-board links showing 34 Gbps of raw bandwidth per direction, and leverages upon peer-to-peer capabilities of Fermi and Kepler-class NVIDIA GPUs to obtain real zero-copy, GPU-to-GPU low latency transfers. The minimization of APEnet+ transfer latency is achieved through the adoption of RDMA protocol implemented in FPGA with specialized hardware blocks tightly coupled with embedded microprocessor. This architecture provides a high performance low latency offload engine for both trasmit and receive side of data transactions: preliminary results are encouraging, showing 50% of bandwidth increase for large packet size transfers. In this paper we describe the APEnet+ architecture, detailing the hardware implementation and discuss the impact of such RDMA specialized hardware on host interface latency and bandwidth.

  20. GALARIO: a GPU accelerated library for analysing radio interferometer observations

    NASA Astrophysics Data System (ADS)

    Tazzari, Marco; Beaujean, Frederik; Testi, Leonardo

    2018-06-01

    We present GALARIO, a computational library that exploits the power of modern graphical processing units (GPUs) to accelerate the analysis of observations from radio interferometers like Atacama Large Millimeter and sub-millimeter Array or the Karl G. Jansky Very Large Array. GALARIO speeds up the computation of synthetic visibilities from a generic 2D model image or a radial brightness profile (for axisymmetric sources). On a GPU, GALARIO is 150 faster than standard PYTHON and 10 times faster than serial C++ code on a CPU. Highly modular, easy to use, and to adopt in existing code, GALARIO comes as two compiled libraries, one for Nvidia GPUs and one for multicore CPUs, where both have the same functions with identical interfaces. GALARIO comes with PYTHON bindings but can also be directly used in C or C++. The versatility and the speed of GALARIO open new analysis pathways that otherwise would be prohibitively time consuming, e.g. fitting high-resolution observations of large number of objects, or entire spectral cubes of molecular gas emission. It is a general tool that can be applied to any field that uses radio interferometer observations. The source code is available online at http://github.com/mtazzari/galario under the open source GNU Lesser General Public License v3.

  1. The Spherical Tokamak MEDUSA for Mexico

    NASA Astrophysics Data System (ADS)

    Ribeiro, C.; Salvador, M.; Gonzalez, J.; Munoz, O.; Tapia, A.; Arredondo, V.; Chavez, R.; Nieto, A.; Gonzalez, J.; Garza, A.; Estrada, I.; Jasso, E.; Acosta, C.; Briones, C.; Cavazos, G.; Martinez, J.; Morones, J.; Almaguer, J.; Fonck, R.

    2011-10-01

    The former spherical tokamak MEDUSA (Madison EDUcation Small Aspect.ratio tokamak, R < 0.14m, a < 0.10m, BT < 0.5T, Ip < 40kA, 3ms pulse) is currently being recomissioned at the Universidad Autónoma de Nuevo León, Mexico, as part of an agreement between the Faculties of Mech.-Elect. Eng. and Phy. Sci.-Maths. The main objective for having MEDUSA is to train students in plasma physics & technical related issues, aiming a full design of a medium size device (e.g. Tokamak-T). Details of technical modifications and a preliminary scientific programme will be presented. MEDUSA-MX will also benefit any developments in the existing Mexican Fusion Network. Strong liaison within national and international plasma physics communities is expected. New activities on plasma & engineering modeling are expected to be developed in parallel by using the existing facilities such as a multi-platform computer (Silicon Graphics Altix XE250, 128G RAM, 3.7TB HD, 2.7GHz, quad-core processor), ancillary graph system (NVIDIA Quadro FE 2000/1GB GDDR-5 PCI X16 128, 3.2GHz), and COMSOL Multiphysics-Solid Works programs.

  2. Aho-Corasick String Matching on Shared and Distributed Memory Parallel Architectures

    SciTech Connect

    Tumeo, Antonino; Villa, Oreste; Chavarría-Miranda, Daniel

    String matching is at the core of many critical applications, including network intrusion detection systems, search engines, virus scanners, spam filters, DNA and protein sequencing, and data mining. For all of these applications string matching requires a combination of (sometimes all) the following characteristics: high and/or predictable performance, support for large data sets and flexibility of integration and customization. Many software based implementations targeting conventional cache-based microprocessors fail to achieve high and predictable performance requirements, while Field-Programmable Gate Array (FPGA) implementations and dedicated hardware solutions fail to support large data sets (dictionary sizes) and are difficult to integrate and customize.more » The advent of multicore, multithreaded, and GPU-based systems is opening the possibility for software based solutions to reach very high performance at a sustained rate. This paper compares several software-based implementations of the Aho-Corasick string searching algorithm for high performance systems. We discuss the implementation of the algorithm on several types of shared-memory high-performance architectures (Niagara 2, large x86 SMPs and Cray XMT), distributed memory with homogeneous processing elements (InfiniBand cluster of x86 multicores) and heterogeneous processing elements (InfiniBand cluster of x86 multicores with NVIDIA Tesla C10 GPUs). We describe in detail how each solution achieves the objectives of supporting large dictionaries, sustaining high performance, and enabling customization and flexibility using various data sets.« less

  3. On-line range images registration with GPGPU

    NASA Astrophysics Data System (ADS)

    Będkowski, J.; Naruniec, J.

    2013-03-01

    This paper concerns implementation of algorithms in the two important aspects of modern 3D data processing: data registration and segmentation. Solution proposed for the first topic is based on the 3D space decomposition, while the latter on image processing and local neighbourhood search. Data processing is implemented by using NVIDIA compute unified device architecture (NIVIDIA CUDA) parallel computation. The result of the segmentation is a coloured map where different colours correspond to different objects, such as walls, floor and stairs. The research is related to the problem of collecting 3D data with a RGB-D camera mounted on a rotated head, to be used in mobile robot applications. Performance of the data registration algorithm is aimed for on-line processing. The iterative closest point (ICP) approach is chosen as a registration method. Computations are based on the parallel fast nearest neighbour search. This procedure decomposes 3D space into cubic buckets and, therefore, the time of the matching is deterministic. First technique of the data segmentation uses accele-rometers integrated with a RGB-D sensor to obtain rotation compensation and image processing method for defining pre-requisites of the known categories. The second technique uses the adapted nearest neighbour search procedure for obtaining normal vectors for each range point.

  4. Early Experiences Writing Performance Portable OpenMP 4 Codes

    SciTech Connect

    Joubert, Wayne; Hernandez, Oscar R

    In this paper, we evaluate the recently available directives in OpenMP 4 to parallelize a computational kernel using both the traditional shared memory approach and the newer accelerator targeting capabilities. In addition, we explore various transformations that attempt to increase application performance portability, and examine the expressiveness and performance implications of using these approaches. For example, we want to understand if the target map directives in OpenMP 4 improve data locality when mapped to a shared memory system, as opposed to the traditional first touch policy approach in traditional OpenMP. To that end, we use recent Cray and Intel compilersmore » to measure the performance variations of a simple application kernel when executed on the OLCF s Titan supercomputer with NVIDIA GPUs and the Beacon system with Intel Xeon Phi accelerators attached. To better understand these trade-offs, we compare our results from traditional OpenMP shared memory implementations to the newer accelerator programming model when it is used to target both the CPU and an attached heterogeneous device. We believe the results and lessons learned as presented in this paper will be useful to the larger user community by providing guidelines that can assist programmers in the development of performance portable code.« less

  5. GPU-based streaming architectures for fast cone-beam CT image reconstruction and demons deformable registration.

    PubMed

    Sharp, G C; Kandasamy, N; Singh, H; Folkert, M

    2007-10-07

    This paper shows how to significantly accelerate cone-beam CT reconstruction and 3D deformable image registration using the stream-processing model. We describe data-parallel designs for the Feldkamp, Davis and Kress (FDK) reconstruction algorithm, and the demons deformable registration algorithm, suitable for use on a commodity graphics processing unit. The streaming versions of these algorithms are implemented using the Brook programming environment and executed on an NVidia 8800 GPU. Performance results using CT data of a preserved swine lung indicate that the GPU-based implementations of the FDK and demons algorithms achieve a substantial speedup--up to 80 times for FDK and 70 times for demons when compared to an optimized reference implementation on a 2.8 GHz Intel processor. In addition, the accuracy of the GPU-based implementations was found to be excellent. Compared with CPU-based implementations, the RMS differences were less than 0.1 Hounsfield unit for reconstruction and less than 0.1 mm for deformable registration.

  6. Power and Performance Trade-offs for Space Time Adaptive Processing

    SciTech Connect

    Gawande, Nitin A.; Manzano Franco, Joseph B.; Tumeo, Antonino

    Computational efficiency – performance relative to power or energy – is one of the most important concerns when designing RADAR processing systems. This paper analyzes power and performance trade-offs for a typical Space Time Adaptive Processing (STAP) application. We study STAP implementations for CUDA and OpenMP on two computationally efficient architectures, Intel Haswell Core I7-4770TE and NVIDIA Kayla with a GK208 GPU. We analyze the power and performance of STAP’s computationally intensive kernels across the two hardware testbeds. We also show the impact and trade-offs of GPU optimization techniques. We show that data parallelism can be exploited for efficient implementationmore » on the Haswell CPU architecture. The GPU architecture is able to process large size data sets without increase in power requirement. The use of shared memory has a significant impact on the power requirement for the GPU. A balance between the use of shared memory and main memory access leads to an improved performance in a typical STAP application.« less

  7. Announcing Supercomputer Summit

    SciTech Connect

    Wells, Jack; Bland, Buddy; Nichols, Jeff

    Summit is the next leap in leadership-class computing systems for open science. With Summit we will be able to address, with greater complexity and higher fidelity, questions concerning who we are, our place on earth, and in our universe. Summit will deliver more than five times the computational performance of Titan’s 18,688 nodes, using only approximately 3,400 nodes when it arrives in 2017. Like Titan, Summit will have a hybrid architecture, and each node will contain multiple IBM POWER9 CPUs and NVIDIA Volta GPUs all connected together with NVIDIA’s high-speed NVLink. Each node will have over half a terabyte ofmore » coherent memory (high bandwidth memory + DDR4) addressable by all CPUs and GPUs plus 800GB of non-volatile RAM that can be used as a burst buffer or as extended memory. To provide a high rate of I/O throughput, the nodes will be connected in a non-blocking fat-tree using a dual-rail Mellanox EDR InfiniBand interconnect. Upon completion, Summit will allow researchers in all fields of science unprecedented access to solving some of the world’s most pressing challenges.« less

  8. Kokkos GPU Compiler

    SciTech Connect

    Moss, Nicholas

    The Kokkos Clang compiler is a version of the Clang C++ compiler that has been modified to perform targeted code generation for Kokkos constructs in the goal of generating highly optimized code and to provide semantic (domain) awareness throughout the compilation toolchain of these constructs such as parallel for and parallel reduce. This approach is taken to explore the possibilities of exposing the developer’s intentions to the underlying compiler infrastructure (e.g. optimization and analysis passes within the middle stages of the compiler) instead of relying solely on the restricted capabilities of C++ template metaprogramming. To date our current activities havemore » focused on correct GPU code generation and thus we have not yet focused on improving overall performance. The compiler is implemented by recognizing specific (syntactic) Kokkos constructs in order to bypass normal template expansion mechanisms and instead use the semantic knowledge of Kokkos to directly generate code in the compiler’s intermediate representation (IR); which is then translated into an NVIDIA-centric GPU program and supporting runtime calls. In addition, by capturing and maintaining the higher-level semantics of Kokkos directly within the lower levels of the compiler has the potential for significantly improving the ability of the compiler to communicate with the developer in the terms of their original programming model/semantics.« less

  9. HPC enabled real-time remote processing of laparoscopic surgery

    NASA Astrophysics Data System (ADS)

    Ronaghi, Zahra; Sapra, Karan; Izard, Ryan; Duffy, Edward; Smith, Melissa C.; Wang, Kuang-Ching; Kwartowitz, David M.

    2016-03-01

    Laparoscopic surgery is a minimally invasive surgical technique. The benefit of small incisions has a disadvantage of limited visualization of subsurface tissues. Image-guided surgery (IGS) uses pre-operative and intra-operative images to map subsurface structures. One particular laparoscopic system is the daVinci-si robotic surgical system. The video streams generate approximately 360 megabytes of data per second. Real-time processing this large stream of data on a bedside PC, single or dual node setup, has become challenging and a high-performance computing (HPC) environment may not always be available at the point of care. To process this data on remote HPC clusters at the typical 30 frames per second rate, it is required that each 11.9 MB video frame be processed by a server and returned within 1/30th of a second. We have implement and compared performance of compression, segmentation and registration algorithms on Clemson's Palmetto supercomputer using dual NVIDIA K40 GPUs per node. Our computing framework will also enable reliability using replication of computation. We will securely transfer the files to remote HPC clusters utilizing an OpenFlow-based network service, Steroid OpenFlow Service (SOS) that can increase performance of large data transfers over long-distance and high bandwidth networks. As a result, utilizing high-speed OpenFlow- based network to access computing clusters with GPUs will improve surgical procedures by providing real-time medical image processing and laparoscopic data.

  10. GPU-accelerated phase-field simulation of dendritic solidification in a binary alloy

    NASA Astrophysics Data System (ADS)

    Yamanaka, Akinori; Aoki, Takayuki; Ogawa, Satoi; Takaki, Tomohiro

    2011-03-01

    The phase-field simulation for dendritic solidification of a binary alloy has been accelerated by using a graphic processing unit (GPU). To perform the phase-field simulation of the alloy solidification on GPU, a program code was developed with computer unified device architecture (CUDA). In this paper, the implementation technique of the phase-field model on GPU is presented. Also, we evaluated the acceleration performance of the three-dimensional solidification simulation by using a single NVIDIA TESLA C1060 GPU and the developed program code. The results showed that the GPU calculation for 5763 computational grids achieved the performance of 170 GFLOPS by utilizing the shared memory as a software-managed cache. Furthermore, it can be demonstrated that the computation with the GPU is 100 times faster than that with a single CPU core. From the obtained results, we confirmed the feasibility of realizing a real-time full three-dimensional phase-field simulation of microstructure evolution on a personal desktop computer.

  11. Global magnetohydrodynamic simulations on multiple GPUs

    NASA Astrophysics Data System (ADS)

    Wong, Un-Hong; Wong, Hon-Cheng; Ma, Yonghui

    2014-01-01

    Global magnetohydrodynamic (MHD) models play the major role in investigating the solar wind-magnetosphere interaction. However, the huge computation requirement in global MHD simulations is also the main problem that needs to be solved. With the recent development of modern graphics processing units (GPUs) and the Compute Unified Device Architecture (CUDA), it is possible to perform global MHD simulations in a more efficient manner. In this paper, we present a global magnetohydrodynamic (MHD) simulator on multiple GPUs using CUDA 4.0 with GPUDirect 2.0. Our implementation is based on the modified leapfrog scheme, which is a combination of the leapfrog scheme and the two-step Lax-Wendroff scheme. GPUDirect 2.0 is used in our implementation to drive multiple GPUs. All data transferring and kernel processing are managed with CUDA 4.0 API instead of using MPI or OpenMP. Performance measurements are made on a multi-GPU system with eight NVIDIA Tesla M2050 (Fermi architecture) graphics cards. These measurements show that our multi-GPU implementation achieves a peak performance of 97.36 GFLOPS in double precision.

  12. MAGI: a Node.js web service for fast microRNA-Seq analysis in a GPU infrastructure.

    PubMed

    Kim, Jihoon; Levy, Eric; Ferbrache, Alex; Stepanowsky, Petra; Farcas, Claudiu; Wang, Shuang; Brunner, Stefan; Bath, Tyler; Wu, Yuan; Ohno-Machado, Lucila

    2014-10-01

    MAGI is a web service for fast MicroRNA-Seq data analysis in a graphics processing unit (GPU) infrastructure. Using just a browser, users have access to results as web reports in just a few hours->600% end-to-end performance improvement over state of the art. MAGI's salient features are (i) transfer of large input files in native FASTA with Qualities (FASTQ) format through drag-and-drop operations, (ii) rapid prediction of microRNA target genes leveraging parallel computing with GPU devices, (iii) all-in-one analytics with novel feature extraction, statistical test for differential expression and diagnostic plot generation for quality control and (iv) interactive visualization and exploration of results in web reports that are readily available for publication. MAGI relies on the Node.js JavaScript framework, along with NVIDIA CUDA C, PHP: Hypertext Preprocessor (PHP), Perl and R. It is freely available at http://magi.ucsd.edu. © The Author 2014. Published by Oxford University Press.

  13. OCTGRAV: Sparse Octree Gravitational N-body Code on Graphics Processing Units

    NASA Astrophysics Data System (ADS)

    Gaburov, Evghenii; Bédorf, Jeroen; Portegies Zwart, Simon

    2010-10-01

    Octgrav is a very fast tree-code which runs on massively parallel Graphical Processing Units (GPU) with NVIDIA CUDA architecture. The algorithms are based on parallel-scan and sort methods. The tree-construction and calculation of multipole moments is carried out on the host CPU, while the force calculation which consists of tree walks and evaluation of interaction list is carried out on the GPU. In this way, a sustained performance of about 100GFLOP/s and data transfer rates of about 50GB/s is achieved. It takes about a second to compute forces on a million particles with an opening angle of heta approx 0.5. To test the performance and feasibility, we implemented the algorithms in CUDA in the form of a gravitational tree-code which completely runs on the GPU. The tree construction and traverse algorithms are portable to many-core devices which have support for CUDA or OpenCL programming languages. The gravitational tree-code outperforms tuned CPU code during the tree-construction and shows a performance improvement of more than a factor 20 overall, resulting in a processing rate of more than 2.8 million particles per second. The code has a convenient user interface and is freely available for use.

  14. GPU accelerated edge-region based level set evolution constrained by 2D gray-scale histogram.

    PubMed

    Balla-Arabé, Souleymane; Gao, Xinbo; Wang, Bin

    2013-07-01

    Due to its intrinsic nature which allows to easily handle complex shapes and topological changes, the level set method (LSM) has been widely used in image segmentation. Nevertheless, LSM is computationally expensive, which limits its applications in real-time systems. For this purpose, we propose a new level set algorithm, which uses simultaneously edge, region, and 2D histogram information in order to efficiently segment objects of interest in a given scene. The computational complexity of the proposed LSM is greatly reduced by using the highly parallelizable lattice Boltzmann method (LBM) with a body force to solve the level set equation (LSE). The body force is the link with image data and is defined from the proposed LSE. The proposed LSM is then implemented using an NVIDIA graphics processing units to fully take advantage of the LBM local nature. The new algorithm is effective, robust against noise, independent to the initial contour, fast, and highly parallelizable. The edge and region information enable to detect objects with and without edges, and the 2D histogram information enable the effectiveness of the method in a noisy environment. Experimental results on synthetic and real images demonstrate subjectively and objectively the performance of the proposed method.

  15. Designing stereoscopic information visualization for 3D-TV: What can we can learn from S3D gaming?

    NASA Astrophysics Data System (ADS)

    Schild, Jonas; Masuch, Maic

    2012-03-01

    This paper explores graphical design and spatial alignment of visual information and graphical elements into stereoscopically filmed content, e.g. captions, subtitles, and especially more complex elements in 3D-TV productions. The method used is a descriptive analysis of existing computer- and video games that have been adapted for stereoscopic display using semi-automatic rendering techniques (e.g. Nvidia 3D Vision) or games which have been specifically designed for stereoscopic vision. Digital games often feature compelling visual interfaces that combine high usability with creative visual design. We explore selected examples of game interfaces in stereoscopic vision regarding their stereoscopic characteristics, how they draw attention, how we judge effect and comfort and where the interfaces fail. As a result, we propose a list of five aspects which should be considered when designing stereoscopic visual information: explicit information, implicit information, spatial reference, drawing attention, and vertical alignment. We discuss possible consequences, opportunities and challenges for integrating visual information elements into 3D-TV content. This work shall further help to improve current editing systems and identifies a need for future editing systems for 3DTV, e.g., live editing and real-time alignment of visual information into 3D footage.

  16. A Distributed GPU-Based Framework for Real-Time 3D Volume Rendering of Large Astronomical Data Cubes

    NASA Astrophysics Data System (ADS)

    Hassan, A. H.; Fluke, C. J.; Barnes, D. G.

    2012-05-01

    We present a framework to volume-render three-dimensional data cubes interactively using distributed ray-casting and volume-bricking over a cluster of workstations powered by one or more graphics processing units (GPUs) and a multi-core central processing unit (CPU). The main design target for this framework is to provide an in-core visualization solution able to provide three-dimensional interactive views of terabyte-sized data cubes. We tested the presented framework using a computing cluster comprising 64 nodes with a total of 128GPUs. The framework proved to be scalable to render a 204GB data cube with an average of 30 frames per second. Our performance analyses also compare the use of NVIDIA Tesla 1060 and 2050GPU architectures and the effect of increasing the visualization output resolution on the rendering performance. Although our initial focus, as shown in the examples presented in this work, is volume rendering of spectral data cubes from radio astronomy, we contend that our approach has applicability to other disciplines where close to real-time volume rendering of terabyte-order three-dimensional data sets is a requirement.

  17. Sub-second pencil beam dose calculation on GPU for adaptive proton therapy

    NASA Astrophysics Data System (ADS)

    da Silva, Joakim; Ansorge, Richard; Jena, Rajesh

    2015-06-01

    Although proton therapy delivered using scanned pencil beams has the potential to produce better dose conformity than conventional radiotherapy, the created dose distributions are more sensitive to anatomical changes and patient motion. Therefore, the introduction of adaptive treatment techniques where the dose can be monitored as it is being delivered is highly desirable. We present a GPU-based dose calculation engine relying on the widely used pencil beam algorithm, developed for on-line dose calculation. The calculation engine was implemented from scratch, with each step of the algorithm parallelized and adapted to run efficiently on the GPU architecture. To ensure fast calculation, it employs several application-specific modifications and simplifications, and a fast scatter-based implementation of the computationally expensive kernel superposition step. The calculation time for a skull base treatment plan using two beam directions was 0.22 s on an Nvidia Tesla K40 GPU, whereas a test case of a cubic target in water from the literature took 0.14 s to calculate. The accuracy of the patient dose distributions was assessed by calculating the γ-index with respect to a gold standard Monte Carlo simulation. The passing rates were 99.2% and 96.7%, respectively, for the 3%/3 mm and 2%/2 mm criteria, matching those produced by a clinical treatment planning system.

  18. Multi-Kepler GPU vs. multi-Intel MIC for spin systems simulations

    NASA Astrophysics Data System (ADS)

    Bernaschi, M.; Bisson, M.; Salvadore, F.

    2014-10-01

    We present and compare the performances of two many-core architectures: the Nvidia Kepler and the Intel MIC both in a single system and in cluster configuration for the simulation of spin systems. As a benchmark we consider the time required to update a single spin of the 3D Heisenberg spin glass model by using the Over-relaxation algorithm. We present data also for a traditional high-end multi-core architecture: the Intel Sandy Bridge. The results show that although on the two Intel architectures it is possible to use basically the same code, the performances of a Intel MIC change dramatically depending on (apparently) minor details. Another issue is that to obtain a reasonable scalability with the Intel Phi coprocessor (Phi is the coprocessor that implements the MIC architecture) in a cluster configuration it is necessary to use the so-called offload mode which reduces the performances of the single system. As to the GPU, the Kepler architecture offers a clear advantage with respect to the previous Fermi architecture maintaining exactly the same source code. Scalability of the multi-GPU implementation remains very good by using the CPU as a communication co-processor of the GPU. All source codes are provided for inspection and for double-checking the results.

  19. NiftySim: A GPU-based nonlinear finite element package for simulation of soft tissue biomechanics.

    PubMed

    Johnsen, Stian F; Taylor, Zeike A; Clarkson, Matthew J; Hipwell, John; Modat, Marc; Eiben, Bjoern; Han, Lianghao; Hu, Yipeng; Mertzanidou, Thomy; Hawkes, David J; Ourselin, Sebastien

    2015-07-01

    NiftySim, an open-source finite element toolkit, has been designed to allow incorporation of high-performance soft tissue simulation capabilities into biomedical applications. The toolkit provides the option of execution on fast graphics processing unit (GPU) hardware, numerous constitutive models and solid-element options, membrane and shell elements, and contact modelling facilities, in a simple to use library. The toolkit is founded on the total Lagrangian explicit dynamics (TLEDs) algorithm, which has been shown to be efficient and accurate for simulation of soft tissues. The base code is written in C[Formula: see text], and GPU execution is achieved using the nVidia CUDA framework. In most cases, interaction with the underlying solvers can be achieved through a single Simulator class, which may be embedded directly in third-party applications such as, surgical guidance systems. Advanced capabilities such as contact modelling and nonlinear constitutive models are also provided, as are more experimental technologies like reduced order modelling. A consistent description of the underlying solution algorithm, its implementation with a focus on GPU execution, and examples of the toolkit's usage in biomedical applications are provided. Efficient mapping of the TLED algorithm to parallel hardware results in very high computational performance, far exceeding that available in commercial packages. The NiftySim toolkit provides high-performance soft tissue simulation capabilities using GPU technology for biomechanical simulation research applications in medical image computing, surgical simulation, and surgical guidance applications.

  20. Gyrokinetic particle-in-cell optimization on emerging multi- and manycore platforms

    DOE PAGES

    Madduri, Kamesh; Im, Eun-Jin; Ibrahim, Khaled Z.; ...

    2011-03-02

    The next decade of high-performance computing (HPC) systems will see a rapid evolution and divergence of multi- and manycore architectures as power and cooling constraints limit increases in microprocessor clock speeds. Understanding efficient optimization methodologies on diverse multicore designs in the context of demanding numerical methods is one of the greatest challenges faced today by the HPC community. In this paper, we examine the efficient multicore optimization of GTC, a petascale gyrokinetic toroidal fusion code for studying plasma microturbulence in tokamak devices. For GTC’s key computational components (charge deposition and particle push), we explore efficient parallelization strategies across a broadmore » range of emerging multicore designs, including the recently-released Intel Nehalem-EX, the AMD Opteron Istanbul, and the highly multithreaded Sun UltraSparc T2+. We also present the first study on tuning gyrokinetic particle-in-cell (PIC) algorithms for graphics processors, using the NVIDIA C2050 (Fermi). Our work discusses several novel optimization approaches for gyrokinetic PIC, including mixed-precision computation, particle binning and decomposition strategies, grid replication, SIMDized atomic floating-point operations, and effective GPU texture memory utilization. Overall, we achieve significant performance improvements of 1.3–4.7× on these complex PIC kernels, despite the inherent challenges of data dependency and locality. Finally, our work also points to several architectural and programming features that could significantly enhance PIC performance and productivity on next-generation architectures.« less

  1. Exploiting graphics processing units for computational biology and bioinformatics.

    PubMed

    Payne, Joshua L; Sinnott-Armstrong, Nicholas A; Moore, Jason H

    2010-09-01

    Advances in the video gaming industry have led to the production of low-cost, high-performance graphics processing units (GPUs) that possess more memory bandwidth and computational capability than central processing units (CPUs), the standard workhorses of scientific computing. With the recent release of generalpurpose GPUs and NVIDIA's GPU programming language, CUDA, graphics engines are being adopted widely in scientific computing applications, particularly in the fields of computational biology and bioinformatics. The goal of this article is to concisely present an introduction to GPU hardware and programming, aimed at the computational biologist or bioinformaticist. To this end, we discuss the primary differences between GPU and CPU architecture, introduce the basics of the CUDA programming language, and discuss important CUDA programming practices, such as the proper use of coalesced reads, data types, and memory hierarchies. We highlight each of these topics in the context of computing the all-pairs distance between instances in a dataset, a common procedure in numerous disciplines of scientific computing. We conclude with a runtime analysis of the GPU and CPU implementations of the all-pairs distance calculation. We show our final GPU implementation to outperform the CPU implementation by a factor of 1700.

  2. A deep learning method for early screening of lung cancer

    NASA Astrophysics Data System (ADS)

    Zhang, Kunpeng; Jiang, Huiqin; Ma, Ling; Gao, Jianbo; Yang, Xiaopeng

    2018-04-01

    Lung cancer is the leading cause of cancer-related deaths among men. In this paper, we propose a pulmonary nodule detection method for early screening of lung cancer based on the improved AlexNet model. In order to maintain the same image quality as the existing B/S architecture PACS system, we convert the original CT image into JPEG format image by analyzing the DICOM file firstly. Secondly, in view of the large size and complex background of CT chest images, we design the convolution neural network on basis of AlexNet model and sparse convolution structure. At last we train our models on the software named DIGITS which is provided by NVIDIA. The main contribution of this paper is to apply the convolutional neural network for the early screening of lung cancer and improve the screening accuracy by combining the AlexNet model with the sparse convolution structure. We make a series of experiments on the chest CT images using the proposed method, of which the sensitivity and specificity indicates that the method presented in this paper can effectively improve the accuracy of early screening of lung cancer and it has certain clinical significance at the same time.

  3. Performance evaluation for volumetric segmentation of multiple sclerosis lesions using MATLAB and computing engine in the graphical processing unit (GPU)

    NASA Astrophysics Data System (ADS)

    Le, Anh H.; Park, Young W.; Ma, Kevin; Jacobs, Colin; Liu, Brent J.

    2010-03-01

    Multiple Sclerosis (MS) is a progressive neurological disease affecting myelin pathways in the brain. Multiple lesions in the white matter can cause paralysis and severe motor disabilities of the affected patient. To solve the issue of inconsistency and user-dependency in manual lesion measurement of MRI, we have proposed a 3-D automated lesion quantification algorithm to enable objective and efficient lesion volume tracking. The computer-aided detection (CAD) of MS, written in MATLAB, utilizes K-Nearest Neighbors (KNN) method to compute the probability of lesions on a per-voxel basis. Despite the highly optimized algorithm of imaging processing that is used in CAD development, MS CAD integration and evaluation in clinical workflow is technically challenging due to the requirement of high computation rates and memory bandwidth in the recursive nature of the algorithm. In this paper, we present the development and evaluation of using a computing engine in the graphical processing unit (GPU) with MATLAB for segmentation of MS lesions. The paper investigates the utilization of a high-end GPU for parallel computing of KNN in the MATLAB environment to improve algorithm performance. The integration is accomplished using NVIDIA's CUDA developmental toolkit for MATLAB. The results of this study will validate the practicality and effectiveness of the prototype MS CAD in a clinical setting. The GPU method may allow MS CAD to rapidly integrate in an electronic patient record or any disease-centric health care system.

  4. Accelerated speckle imaging with the ATST visible broadband imager

    NASA Astrophysics Data System (ADS)

    Wöger, Friedrich; Ferayorni, Andrew

    2012-09-01

    The Advanced Technology Solar Telescope (ATST), a 4 meter class telescope for observations of the solar atmosphere currently in construction phase, will generate data at rates of the order of 10 TB/day with its state of the art instrumentation. The high-priority ATST Visible Broadband Imager (VBI) instrument alone will create two data streams with a bandwidth of 960 MB/s each. Because of the related data handling issues, these data will be post-processed with speckle interferometry algorithms in near-real time at the telescope using the cost-effective Graphics Processing Unit (GPU) technology that is supported by the ATST Data Handling System. In this contribution, we lay out the VBI-specific approach to its image processing pipeline, put this into the context of the underlying ATST Data Handling System infrastructure, and finally describe the details of how the algorithms were redesigned to exploit data parallelism in the speckle image reconstruction algorithms. An algorithm re-design is often required to efficiently speed up an application using GPU technology; we have chosen NVIDIA's CUDA language as basis for our implementation. We present our preliminary results of the algorithm performance using our test facilities, and base a conservative estimate on the requirements of a full system that could achieve near real-time performance at ATST on these results.

  5. Fast simulation of Proton Induced X-Ray Emission Tomography using CUDA

    NASA Astrophysics Data System (ADS)

    Beasley, D. G.; Marques, A. C.; Alves, L. C.; da Silva, R. C.

    2013-07-01

    A new 3D Proton Induced X-Ray Emission Tomography (PIXE-T) and Scanning Transmission Ion Microscopy Tomography (STIM-T) simulation software has been developed in Java and uses NVIDIA™ Common Unified Device Architecture (CUDA) to calculate the X-ray attenuation for large detector areas. A challenge with PIXE-T is to get sufficient counts while retaining a small beam spot size. Therefore a high geometric efficiency is required. However, as the detector solid angle increases the calculations required for accurate reconstruction of the data increase substantially. To overcome this limitation, the CUDA parallel computing platform was used which enables general purpose programming of NVIDIA graphics processing units (GPUs) to perform computations traditionally handled by the central processing unit (CPU). For simulation performance evaluation, the results of a CPU- and a CUDA-based simulation of a phantom are presented. Furthermore, a comparison with the simulation code in the PIXE-Tomography reconstruction software DISRA (A. Sakellariou, D.N. Jamieson, G.J.F. Legge, 2001) is also shown. Compared to a CPU implementation, the CUDA based simulation is approximately 30× faster.

  6. Optimizing legacy molecular dynamics software with directive-based offload

    DOE PAGES

    Michael Brown, W.; Carrillo, Jan-Michael Y.; Gavhane, Nitin; ...

    2015-05-14

    The directive-based programming models are one solution for exploiting many-core coprocessors to increase simulation rates in molecular dynamics. They offer the potential to reduce code complexity with offload models that can selectively target computations to run on the CPU, the coprocessor, or both. In our paper, we describe modifications to the LAMMPS molecular dynamics code to enable concurrent calculations on a CPU and coprocessor. We also demonstrate that standard molecular dynamics algorithms can run efficiently on both the CPU and an x86-based coprocessor using the same subroutines. As a consequence, we demonstrate that code optimizations for the coprocessor also resultmore » in speedups on the CPU; in extreme cases up to 4.7X. We provide results for LAMMAS benchmarks and for production molecular dynamics simulations using the Stampede hybrid supercomputer with both Intel (R) Xeon Phi (TM) coprocessors and NVIDIA GPUs: The optimizations presented have increased simulation rates by over 2X for organic molecules and over 7X for liquid crystals on Stampede. The optimizations are available as part of the "Intel package" supplied with LAMMPS. (C) 2015 Elsevier B.V. All rights reserved.« less

  7. Parallel peak pruning for scalable SMP contour tree computation

    SciTech Connect

    Carr, Hamish A.; Weber, Gunther H.; Sewell, Christopher M.

    As data sets grow to exascale, automated data analysis and visualisation are increasingly important, to intermediate human understanding and to reduce demands on disk storage via in situ analysis. Trends in architecture of high performance computing systems necessitate analysis algorithms to make effective use of combinations of massively multicore and distributed systems. One of the principal analytic tools is the contour tree, which analyses relationships between contours to identify features of more than local importance. Unfortunately, the predominant algorithms for computing the contour tree are explicitly serial, and founded on serial metaphors, which has limited the scalability of this formmore » of analysis. While there is some work on distributed contour tree computation, and separately on hybrid GPU-CPU computation, there is no efficient algorithm with strong formal guarantees on performance allied with fast practical performance. Here in this paper, we report the first shared SMP algorithm for fully parallel contour tree computation, withfor-mal guarantees of O(lgnlgt) parallel steps and O(n lgn) work, and implementations with up to 10x parallel speed up in OpenMP and up to 50x speed up in NVIDIA Thrust.« less

  8. Understanding Portability of a High-Level Programming Model on Contemporary Heterogeneous Architectures

    DOE PAGES

    Sabne, Amit J.; Sakdhnagool, Putt; Lee, Seyong; ...

    2015-07-13

    Accelerator-based heterogeneous computing is gaining momentum in the high-performance computing arena. However, the increased complexity of heterogeneous architectures demands more generic, high-level programming models. OpenACC is one such attempt to tackle this problem. Although the abstraction provided by OpenACC offers productivity, it raises questions concerning both functional and performance portability. In this article, the authors propose HeteroIR, a high-level, architecture-independent intermediate representation, to map high-level programming models, such as OpenACC, to heterogeneous architectures. They present a compiler approach that translates OpenACC programs into HeteroIR and accelerator kernels to obtain OpenACC functional portability. They then evaluate the performance portability obtained bymore » OpenACC with their approach on 12 OpenACC programs on Nvidia CUDA, AMD GCN, and Intel Xeon Phi architectures. They study the effects of various compiler optimizations and OpenACC program settings on these architectures to provide insights into the achieved performance portability.« less

  9. A portable platform for accelerated PIC codes and its application to GPUs using OpenACC

    NASA Astrophysics Data System (ADS)

    Hariri, F.; Tran, T. M.; Jocksch, A.; Lanti, E.; Progsch, J.; Messmer, P.; Brunner, S.; Gheller, C.; Villard, L.

    2016-10-01

    We present a portable platform, called PIC_ENGINE, for accelerating Particle-In-Cell (PIC) codes on heterogeneous many-core architectures such as Graphic Processing Units (GPUs). The aim of this development is efficient simulations on future exascale systems by allowing different parallelization strategies depending on the application problem and the specific architecture. To this end, this platform contains the basic steps of the PIC algorithm and has been designed as a test bed for different algorithmic options and data structures. Among the architectures that this engine can explore, particular attention is given here to systems equipped with GPUs. The study demonstrates that our portable PIC implementation based on the OpenACC programming model can achieve performance closely matching theoretical predictions. Using the Cray XC30 system, Piz Daint, at the Swiss National Supercomputing Centre (CSCS), we show that PIC_ENGINE running on an NVIDIA Kepler K20X GPU can outperform the one on an Intel Sandy bridge 8-core CPU by a factor of 3.4.

  10. (Re)engineering Earth System Models to Expose Greater Concurrency for Ultrascale Computing: Practice, Experience, and Musings

    NASA Astrophysics Data System (ADS)

    Mills, R. T.

    2014-12-01

    As the high performance computing (HPC) community pushes towards the exascale horizon, the importance and prevalence of fine-grained parallelism in new computer architectures is increasing. This is perhaps most apparent in the proliferation of so-called "accelerators" such as the Intel Xeon Phi or NVIDIA GPGPUs, but the trend also holds for CPUs, where serial performance has grown slowly and effective use of hardware threads and vector units are becoming increasingly important to realizing high performance. This has significant implications for weather, climate, and Earth system modeling codes, many of which display impressive scalability across MPI ranks but take relatively little advantage of threading and vector processing. In addition to increasing parallelism, next generation codes will also need to address increasingly deep hierarchies for data movement: NUMA/cache levels, on node vs. off node, local vs. wide neighborhoods on the interconnect, and even in the I/O system. We will discuss some approaches (grounded in experiences with the Intel Xeon Phi architecture) for restructuring Earth science codes to maximize concurrency across multiple levels (vectors, threads, MPI ranks), and also discuss some novel approaches for minimizing expensive data movement/communication.

  11. Programming standards for effective S-3D game development

    NASA Astrophysics Data System (ADS)

    Schneider, Neil; Matveev, Alexander

    2008-02-01

    When a video game is in development, more often than not it is being rendered in three dimensions - complete with volumetric depth. It's the PC monitor that is taking this three-dimensional information, and artificially displaying it in a flat, two-dimensional format. Stereoscopic drivers take the three-dimensional information captured from DirectX and OpenGL calls and properly display it with a unique left and right sided view for each eye so a proper stereoscopic 3D image can be seen by the gamer. The two-dimensional limitation of how information is displayed on screen has encouraged programming short-cuts and work-arounds that stifle this stereoscopic 3D effect, and the purpose of this guide is to outline techniques to get the best of both worlds. While the programming requirements do not significantly add to the game development time, following these guidelines will greatly enhance your customer's stereoscopic 3D experience, increase your likelihood of earning Meant to be Seen certification, and give you instant cost-free access to the industry's most valued consumer base. While this outline is mostly based on NVIDIA's programming guide and iZ3D resources, it is designed to work with all stereoscopic 3D hardware solutions and is not proprietary in any way.

  12. Particle-in-cell simulations with charge-conserving current deposition on graphic processing units

    NASA Astrophysics Data System (ADS)

    Ren, Chuang; Kong, Xianglong; Huang, Michael; Decyk, Viktor; Mori, Warren

    2011-10-01

    Recently using CUDA, we have developed an electromagnetic Particle-in-Cell (PIC) code with charge-conserving current deposition for Nvidia graphic processing units (GPU's) (Kong et al., Journal of Computational Physics 230, 1676 (2011). On a Tesla M2050 (Fermi) card, the GPU PIC code can achieve a one-particle-step process time of 1.2 - 3.2 ns in 2D and 2.3 - 7.2 ns in 3D, depending on plasma temperatures. In this talk we will discuss novel algorithms for GPU-PIC including charge-conserving current deposition scheme with few branching and parallel particle sorting. These algorithms have made efficient use of the GPU shared memory. We will also discuss how to replace the computation kernels of existing parallel CPU codes while keeping their parallel structures. This work was supported by U.S. Department of Energy under Grant Nos. DE-FG02-06ER54879 and DE-FC02-04ER54789 and by NSF under Grant Nos. PHY-0903797 and CCF-0747324.

  13. Development of a GPU-Accelerated 3-D Full-Wave Code for Electromagnetic Wave Propagation in a Cold Plasma

    NASA Astrophysics Data System (ADS)

    Woodbury, D.; Kubota, S.; Johnson, I.

    2014-10-01

    Computer simulations of electromagnetic wave propagation in magnetized plasmas are an important tool for both plasma heating and diagnostics. For active millimeter-wave and microwave diagnostics, accurately modeling the evolution of the beam parameters for launched, reflected or scattered waves in a toroidal plasma requires that calculations be done using the full 3-D geometry. Previously, we reported on the application of GPGPU (General-Purpose computing on Graphics Processing Units) to a 3-D vacuum Maxwell code using the FDTD (Finite-Difference Time-Domain) method. Tests were done for Gaussian beam propagation with a hard source antenna, utilizing the parallel processing capabilities of the NVIDIA K20M. In the current study, we have modified the 3-D code to include a soft source antenna and an induced current density based on the cold plasma approximation. Results from Gaussian beam propagation in an inhomogeneous anisotropic plasma, along with comparisons to ray- and beam-tracing calculations will be presented. Additional enhancements, such as advanced coding techniques for improved speedup, will also be investigated. Supported by U.S. DoE Grant DE-FG02-99-ER54527 and in part by the U.S. DoE, Office of Science, WDTS under the Science Undergraduate Laboratory Internship program.

  14. Efficient Scalable Median Filtering Using Histogram-Based Operations.

    PubMed

    Green, Oded

    2018-05-01

    Median filtering is a smoothing technique for noise removal in images. While there are various implementations of median filtering for a single-core CPU, there are few implementations for accelerators and multi-core systems. Many parallel implementations of median filtering use a sorting algorithm for rearranging the values within a filtering window and taking the median of the sorted value. While using sorting algorithms allows for simple parallel implementations, the cost of the sorting becomes prohibitive as the filtering windows grow. This makes such algorithms, sequential and parallel alike, inefficient. In this work, we introduce the first software parallel median filtering that is non-sorting-based. The new algorithm uses efficient histogram-based operations. These reduce the computational requirements of the new algorithm while also accessing the image fewer times. We show an implementation of our algorithm for both the CPU and NVIDIA's CUDA supported graphics processing unit (GPU). The new algorithm is compared with several other leading CPU and GPU implementations. The CPU implementation has near perfect linear scaling with a speedup on a quad-core system. The GPU implementation is several orders of magnitude faster than the other GPU implementations for mid-size median filters. For small kernels, and , comparison-based approaches are preferable as fewer operations are required. Lastly, the new algorithm is open-source and can be found in the OpenCV library.

  15. Scaling deep learning on GPU and knights landing clusters

    SciTech Connect

    You, Yang; Buluc, Aydin; Demmel, James

    Training neural networks has become a big bottleneck. For example, training ImageNet dataset on one Nvidia K20 GPU needs 21 days. To speed up the training process, the current deep learning systems heavily rely on the hardware accelerators. However, these accelerators have limited on-chip memory compared with CPUs. We use both self-host Intel Knights Landing (KNL) clusters and multi-GPU clusters as our target platforms. From the algorithm aspect, we focus on Elastic Averaging SGD (EASGD) to design algorithms for HPC clusters. We redesign four efficient algorithms for HPC systems to improve EASGD's poor scaling on clusters. Async EASGD, Async MEASGD,more » and Hogwild EASGD are faster than existing counter-part methods (Async SGD, Async MSGD, and Hogwild SGD) in all comparisons. Sync EASGD achieves 5.3X speedup over original EASGD on the same platform. We achieve 91.5% weak scaling efficiency on 4253 KNL cores, which is higher than the state-of-the-art implementation.« less

  16. GPU-based Green’s function simulations of shear waves generated by an applied acoustic radiation force in elastic and viscoelastic models

    NASA Astrophysics Data System (ADS)

    Yang, Yiqun; Urban, Matthew W.; McGough, Robert J.

    2018-05-01

    Shear wave calculations induced by an acoustic radiation force are very time-consuming on desktop computers, and high-performance graphics processing units (GPUs) achieve dramatic reductions in the computation time for these simulations. The acoustic radiation force is calculated using the fast near field method and the angular spectrum approach, and then the shear waves are calculated in parallel with Green’s functions on a GPU. This combination enables rapid evaluation of shear waves for push beams with different spatial samplings and for apertures with different f/#. Relative to shear wave simulations that evaluate the same algorithm on an Intel i7 desktop computer, a high performance nVidia GPU reduces the time required for these calculations by a factor of 45 and 700 when applied to elastic and viscoelastic shear wave simulation models, respectively. These GPU-accelerated simulations also compared to measurements in different viscoelastic phantoms, and the results are similar. For parametric evaluations and for comparisons with measured shear wave data, shear wave simulations with the Green’s function approach are ideally suited for high-performance GPUs.

  17. Hybrid parallel computing architecture for multiview phase shifting

    NASA Astrophysics Data System (ADS)

    Zhong, Kai; Li, Zhongwei; Zhou, Xiaohui; Shi, Yusheng; Wang, Congjun

    2014-11-01

    The multiview phase-shifting method shows its powerful capability in achieving high resolution three-dimensional (3-D) shape measurement. Unfortunately, this ability results in very high computation costs and 3-D computations have to be processed offline. To realize real-time 3-D shape measurement, a hybrid parallel computing architecture is proposed for multiview phase shifting. In this architecture, the central processing unit can co-operate with the graphic processing unit (GPU) to achieve hybrid parallel computing. The high computation cost procedures, including lens distortion rectification, phase computation, correspondence, and 3-D reconstruction, are implemented in GPU, and a three-layer kernel function model is designed to simultaneously realize coarse-grained and fine-grained paralleling computing. Experimental results verify that the developed system can perform 50 fps (frame per second) real-time 3-D measurement with 260 K 3-D points per frame. A speedup of up to 180 times is obtained for the performance of the proposed technique using a NVIDIA GT560Ti graphics card rather than a sequential C in a 3.4 GHZ Inter Core i7 3770.

  18. Problems Related to Parallelization of CFD Algorithms on GPU, Multi-GPU and Hybrid Architectures

    NASA Astrophysics Data System (ADS)

    Biazewicz, Marek; Kurowski, Krzysztof; Ludwiczak, Bogdan; Napieraia, Krystyna

    2010-09-01

    Computational Fluid Dynamics (CFD) is one of the branches of fluid mechanics, which uses numerical methods and algorithms to solve and analyze fluid flows. CFD is used in various domains, such as oil and gas reservoir uncertainty analysis, aerodynamic body shapes optimization (e.g. planes, cars, ships, sport helmets, skis), natural phenomena analysis, numerical simulation for weather forecasting or realistic visualizations. CFD problem is very complex and needs a lot of computational power to obtain the results in a reasonable time. We have implemented a parallel application for two-dimensional CFD simulation with a free surface approximation (MAC method) using new hardware architectures, in particular multi-GPU and hybrid computing environments. For this purpose we decided to use NVIDIA graphic cards with CUDA environment due to its simplicity of programming and good computations performance. We used finite difference discretization of Navier-Stokes equations, where fluid is propagated over an Eulerian Grid. In this model, the behavior of the fluid inside the cell depends only on the properties of local, surrounding cells, therefore it is well suited for the GPU-based architecture. In this paper we demonstrate how to use efficiently the computing power of GPUs for CFD. Additionally, we present some best practices to help users analyze and improve the performance of CFD applications executed on GPU. Finally, we discuss various challenges around the multi-GPU implementation on the example of matrix multiplication.

  19. SciTech Connect

    Beltran, C; Kamal, H

    Purpose: To provide a multicriteria optimization algorithm for intensity modulated radiation therapy using pencil proton beam scanning. Methods: Intensity modulated radiation therapy using pencil proton beam scanning requires efficient optimization algorithms to overcome the uncertainties in the Bragg peaks locations. This work is focused on optimization algorithms that are based on Monte Carlo simulation of the treatment planning and use the weights and the dose volume histogram (DVH) control points to steer toward desired plans. The proton beam treatment planning process based on single objective optimization (representing a weighted sum of multiple objectives) usually leads to time-consuming iterations involving treatmentmore » planning team members. We proved a time efficient multicriteria optimization algorithm that is developed to run on NVIDIA GPU (Graphical Processing Units) cluster. The multicriteria optimization algorithm running time benefits from up-sampling of the CT voxel size of the calculations without loss of fidelity. Results: We will present preliminary results of Multicriteria optimization for intensity modulated proton therapy based on DVH control points. The results will show optimization results of a phantom case and a brain tumor case. Conclusion: The multicriteria optimization of the intensity modulated radiation therapy using pencil proton beam scanning provides a novel tool for treatment planning. Work support by a grant from Varian Inc.« less

  20. Multigroup Monte Carlo on GPUs: Comparison of history- and event-based algorithms

    DOE PAGES

    Hamilton, Steven P.; Slattery, Stuart R.; Evans, Thomas M.

    2017-12-22

    This article presents an investigation of the performance of different multigroup Monte Carlo transport algorithms on GPUs with a discussion of both history-based and event-based approaches. Several algorithmic improvements are introduced for both approaches. By modifying the history-based algorithm that is traditionally favored in CPU-based MC codes to occasionally filter out dead particles to reduce thread divergence, performance exceeds that of either the pure history-based or event-based approaches. The impacts of several algorithmic choices are discussed, including performance studies on Kepler and Pascal generation NVIDIA GPUs for fixed source and eigenvalue calculations. Single-device performance equivalent to 20–40 CPU cores onmore » the K40 GPU and 60–80 CPU cores on the P100 GPU is achieved. Last, in addition, nearly perfect multi-device parallel weak scaling is demonstrated on more than 16,000 nodes of the Titan supercomputer.« less