xeon cpu processor: Topics by Science.gov

Sample records for xeon cpu processor

Exact diagonalization of quantum lattice models on coprocessors

NASA Astrophysics Data System (ADS)

Siro, T.; Harju, A.

2016-10-01

We implement the Lanczos algorithm on an Intel Xeon Phi coprocessor and compare its performance to a multi-core Intel Xeon CPU and an NVIDIA graphics processor. The Xeon and the Xeon Phi are parallelized with OpenMP and the graphics processor is programmed with CUDA. The performance is evaluated by measuring the execution time of a single step in the Lanczos algorithm. We study two quantum lattice models with different particle numbers, and conclude that for small systems, the multi-core CPU is the fastest platform, while for large systems, the graphics processor is the clear winner, reaching speedups of up to 7.6 compared to the CPU. The Xeon Phi outperforms the CPU with sufficiently large particle number, reaching a speedup of 2.5.
GNAQPMS v1.1: accelerating the Global Nested Air Quality Prediction Modeling System (GNAQPMS) on Intel Xeon Phi processors

NASA Astrophysics Data System (ADS)

Wang, Hui; Chen, Huansheng; Wu, Qizhong; Lin, Junmin; Chen, Xueshun; Xie, Xinwei; Wang, Rongrong; Tang, Xiao; Wang, Zifa

2017-08-01

The Global Nested Air Quality Prediction Modeling System (GNAQPMS) is the global version of the Nested Air Quality Prediction Modeling System (NAQPMS), which is a multi-scale chemical transport model used for air quality forecast and atmospheric environmental research. In this study, we present the porting and optimisation of GNAQPMS on a second-generation Intel Xeon Phi processor, codenamed Knights Landing (KNL). Compared with the first-generation Xeon Phi coprocessor (codenamed Knights Corner, KNC), KNL has many new hardware features such as a bootable processor, high-performance in-package memory and ISA compatibility with Intel Xeon processors. In particular, we describe the five optimisations we applied to the key modules of GNAQPMS, including the CBM-Z gas-phase chemistry, advection, convection and wet deposition modules. These optimisations work well on both the KNL 7250 processor and the Intel Xeon E5-2697 V4 processor. They include (1) updating the pure Message Passing Interface (MPI) parallel mode to the hybrid parallel mode with MPI and OpenMP in the emission, advection, convection and gas-phase chemistry modules; (2) fully employing the 512 bit wide vector processing units (VPUs) on the KNL platform; (3) reducing unnecessary memory access to improve cache efficiency; (4) reducing the thread local storage (TLS) in the CBM-Z gas-phase chemistry module to improve its OpenMP performance; and (5) changing the global communication from writing/reading interface files to MPI functions to improve the performance and the parallel scalability. These optimisations greatly improved the GNAQPMS performance. The same optimisations also work well for the Intel Xeon Broadwell processor, specifically E5-2697 v4. Compared with the baseline version of GNAQPMS, the optimised version was 3.51 × faster on KNL and 2.77 × faster on the CPU. Moreover, the optimised version ran at 26 % lower average power on KNL than on the CPU. With the combined performance and energy improvement, the KNL platform was 37.5 % more efficient on power consumption compared with the CPU platform. The optimisations also enabled much further parallel scalability on both the CPU cluster and the KNL cluster scaled to 40 CPU nodes and 30 KNL nodes, with a parallel efficiency of 70.4 and 42.2 %, respectively.
Cognitive Medical Wireless Testbed System (COMWITS)

DTIC Science & Technology

2016-11-01

Number: ...... ...... Sub Contractors (DD882) Names of other research staff Inventions (DD882) Scientific Progress This testbed merges two ARO grants...bit 64 bit CPU Intel Xeon Processor E5-1650v3 (6C, 3.5 GHz, Turbo, HT , 15M, 140W) Intel Core i7-3770 (3.4 GHz Quad Core, 77W) Dual Intel Xeon
Many-integrated core (MIC) technology for accelerating Monte Carlo simulation of radiation transport: A study based on the code DPM

NASA Astrophysics Data System (ADS)

Rodriguez, M.; Brualla, L.

2018-04-01

Monte Carlo simulation of radiation transport is computationally demanding to obtain reasonably low statistical uncertainties of the estimated quantities. Therefore, it can benefit in a large extent from high-performance computing. This work is aimed at assessing the performance of the first generation of the many-integrated core architecture (MIC) Xeon Phi coprocessor with respect to that of a CPU consisting of a double 12-core Xeon processor in Monte Carlo simulation of coupled electron-photonshowers. The comparison was made twofold, first, through a suite of basic tests including parallel versions of the random number generators Mersenne Twister and a modified implementation of RANECU. These tests were addressed to establish a baseline comparison between both devices. Secondly, through the p DPM code developed in this work. p DPM is a parallel version of the Dose Planning Method (DPM) program for fast Monte Carlo simulation of radiation transport in voxelized geometries. A variety of techniques addressed to obtain a large scalability on the Xeon Phi were implemented in p DPM. Maximum scalabilities of 84 . 2 × and 107 . 5 × were obtained in the Xeon Phi for simulations of electron and photon beams, respectively. Nevertheless, in none of the tests involving radiation transport the Xeon Phi performed better than the CPU. The disadvantage of the Xeon Phi with respect to the CPU owes to the low performance of the single core of the former. A single core of the Xeon Phi was more than 10 times less efficient than a single core of the CPU for all radiation transport simulations.
Vectorization for Molecular Dynamics on Intel Xeon Phi Corpocessors

NASA Astrophysics Data System (ADS)

Yi, Hongsuk

2014-03-01

Many modern processors are capable of exploiting data-level parallelism through the use of single instruction multiple data (SIMD) execution. The new Intel Xeon Phi coprocessor supports 512 bit vector registers for the high performance computing. In this paper, we have developed a hierarchical parallelization scheme for accelerated molecular dynamics simulations with the Terfoff potentials for covalent bond solid crystals on Intel Xeon Phi coprocessor systems. The scheme exploits multi-level parallelism computing. We combine thread-level parallelism using a tightly coupled thread-level and task-level parallelism with 512-bit vector register. The simulation results show that the parallel performance of SIMD implementations on Xeon Phi is apparently superior to their x86 CPU architecture.
Evaluation of an Adaptive Automation Trigger Based on Task Performance, Priority, and Frequency

DTIC Science & Technology

2013-06-01

with dual Intel ® Xeon ® CPU x5550 processors @ 2.67 GHz each, 12.0 GB RAM, and a 1.5 GB PCIe nVidia Quadro FX 4800 graphics card (Microsoft...Cole Publishing Company . Miller, C. A., & Parasuraman, R. (2007). Designing for flexible interaction between humans and automation: Delegation
Particle-in-Cell laser-plasma simulation on Xeon Phi coprocessors

NASA Astrophysics Data System (ADS)

Surmin, I. A.; Bastrakov, S. I.; Efimenko, E. S.; Gonoskov, A. A.; Korzhimanov, A. V.; Meyerov, I. B.

2016-05-01

This paper concerns the development of a high-performance implementation of the Particle-in-Cell method for plasma simulation on Intel Xeon Phi coprocessors. We discuss the suitability of the method for Xeon Phi architecture and present our experience in the porting and optimization of the existing parallel Particle-in-Cell code PICADOR. Direct porting without code modification gives performance on Xeon Phi close to that of an 8-core CPU on a benchmark problem with 50 particles per cell. We demonstrate step-by-step optimization techniques, such as improving data locality, enhancing parallelization efficiency and vectorization leading to an overall 4.2 × speedup on CPU and 7.5 × on Xeon Phi compared to the baseline version. The optimized version achieves 16.9 ns per particle update on an Intel Xeon E5-2660 CPU and 9.3 ns per particle update on an Intel Xeon Phi 5110P. For a real problem of laser ion acceleration in targets with surface grating, where a large number of macroparticles per cell is required, the speedup of Xeon Phi compared to CPU is 1.6 ×.
MILC Code Performance on High End CPU and GPU Supercomputer Clusters

NASA Astrophysics Data System (ADS)

DeTar, Carleton; Gottlieb, Steven; Li, Ruizi; Toussaint, Doug

2018-03-01

With recent developments in parallel supercomputing architecture, many core, multi-core, and GPU processors are now commonplace, resulting in more levels of parallelism, memory hierarchy, and programming complexity. It has been necessary to adapt the MILC code to these new processors starting with NVIDIA GPUs, and more recently, the Intel Xeon Phi processors. We report on our efforts to port and optimize our code for the Intel Knights Landing architecture. We consider performance of the MILC code with MPI and OpenMP, and optimizations with QOPQDP and QPhiX. For the latter approach, we concentrate on the staggered conjugate gradient and gauge force. We also consider performance on recent NVIDIA GPUs using the QUDA library.
Fast multipurpose Monte Carlo simulation for proton therapy using multi- and many-core CPU architectures.

PubMed

Souris, Kevin; Lee, John Aldo; Sterpin, Edmond

2016-04-01

Accuracy in proton therapy treatment planning can be improved using Monte Carlo (MC) simulations. However the long computation time of such methods hinders their use in clinical routine. This work aims to develop a fast multipurpose Monte Carlo simulation tool for proton therapy using massively parallel central processing unit (CPU) architectures. A new Monte Carlo, called MCsquare (many-core Monte Carlo), has been designed and optimized for the last generation of Intel Xeon processors and Intel Xeon Phi coprocessors. These massively parallel architectures offer the flexibility and the computational power suitable to MC methods. The class-II condensed history algorithm of MCsquare provides a fast and yet accurate method of simulating heavy charged particles such as protons, deuterons, and alphas inside voxelized geometries. Hard ionizations, with energy losses above a user-specified threshold, are simulated individually while soft events are regrouped in a multiple scattering theory. Elastic and inelastic nuclear interactions are sampled from ICRU 63 differential cross sections, thereby allowing for the computation of prompt gamma emission profiles. MCsquare has been benchmarked with the gate/geant4 Monte Carlo application for homogeneous and heterogeneous geometries. Comparisons with gate/geant4 for various geometries show deviations within 2%-1 mm. In spite of the limited memory bandwidth of the coprocessor simulation time is below 25 s for 10(7) primary 200 MeV protons in average soft tissues using all Xeon Phi and CPU resources embedded in a single desktop unit. MCsquare exploits the flexibility of CPU architectures to provide a multipurpose MC simulation tool. Optimized code enables the use of accurate MC calculation within a reasonable computation time, adequate for clinical practice. MCsquare also simulates prompt gamma emission and can thus be used also for in vivo range verification.
Efficient irregular wavefront propagation algorithms on Intel® Xeon Phi™

PubMed Central

Gomes, Jeremias M.; Teodoro, George; de Melo, Alba; Kong, Jun; Kurc, Tahsin; Saltz, Joel H.

2016-01-01

We investigate the execution of the Irregular Wavefront Propagation Pattern (IWPP), a fundamental computing structure used in several image analysis operations, on the Intel® Xeon Phi™ co-processor. An efficient implementation of IWPP on the Xeon Phi is a challenging problem because of IWPP’s irregularity and the use of atomic instructions in the original IWPP algorithm to resolve race conditions. On the Xeon Phi, the use of SIMD and vectorization instructions is critical to attain high performance. However, SIMD atomic instructions are not supported. Therefore, we propose a new IWPP algorithm that can take advantage of the supported SIMD instruction set. We also evaluate an alternate storage container (priority queue) to track active elements in the wavefront in an effort to improve the parallel algorithm efficiency. The new IWPP algorithm is evaluated with Morphological Reconstruction and Imfill operations as use cases. Our results show performance improvements of up to 5.63× on top of the original IWPP due to vectorization. Moreover, the new IWPP achieves speedups of 45.7× and 1.62×, respectively, as compared to efficient CPU and GPU implementations. PMID:27298591
Efficient irregular wavefront propagation algorithms on Intel® Xeon Phi™.

PubMed

Gomes, Jeremias M; Teodoro, George; de Melo, Alba; Kong, Jun; Kurc, Tahsin; Saltz, Joel H

2015-10-01

We investigate the execution of the Irregular Wavefront Propagation Pattern (IWPP), a fundamental computing structure used in several image analysis operations, on the Intel ® Xeon Phi ™ co-processor. An efficient implementation of IWPP on the Xeon Phi is a challenging problem because of IWPP's irregularity and the use of atomic instructions in the original IWPP algorithm to resolve race conditions. On the Xeon Phi, the use of SIMD and vectorization instructions is critical to attain high performance. However, SIMD atomic instructions are not supported. Therefore, we propose a new IWPP algorithm that can take advantage of the supported SIMD instruction set. We also evaluate an alternate storage container (priority queue) to track active elements in the wavefront in an effort to improve the parallel algorithm efficiency. The new IWPP algorithm is evaluated with Morphological Reconstruction and Imfill operations as use cases. Our results show performance improvements of up to 5.63 × on top of the original IWPP due to vectorization. Moreover, the new IWPP achieves speedups of 45.7 × and 1.62 × , respectively, as compared to efficient CPU and GPU implementations.
Optimizing the updated Goddard shortwave radiation Weather Research and Forecasting (WRF) scheme for Intel Many Integrated Core (MIC) architecture

NASA Astrophysics Data System (ADS)

Mielikainen, Jarno; Huang, Bormin; Huang, Allen H.-L.

2015-05-01

Intel Many Integrated Core (MIC) ushers in a new era of supercomputing speed, performance, and compatibility. It allows the developers to run code at trillions of calculations per second using the familiar programming model. In this paper, we present our results of optimizing the updated Goddard shortwave radiation Weather Research and Forecasting (WRF) scheme on Intel Many Integrated Core Architecture (MIC) hardware. The Intel Xeon Phi coprocessor is the first product based on Intel MIC architecture, and it consists of up to 61 cores connected by a high performance on-die bidirectional interconnect. The co-processor supports all important Intel development tools. Thus, the development environment is familiar one to a vast number of CPU developers. Although, getting a maximum performance out of Xeon Phi will require using some novel optimization techniques. Those optimization techniques are discusses in this paper. The results show that the optimizations improved performance of the original code on Xeon Phi 7120P by a factor of 1.3x.
Fast multipurpose Monte Carlo simulation for proton therapy using multi- and many-core CPU architectures

DOE Office of Scientific and Technical Information (OSTI.GOV)

Souris, Kevin, E-mail: kevin.souris@uclouvain.be; Lee, John Aldo; Sterpin, Edmond

2016-04-15

Purpose: Accuracy in proton therapy treatment planning can be improved using Monte Carlo (MC) simulations. However the long computation time of such methods hinders their use in clinical routine. This work aims to develop a fast multipurpose Monte Carlo simulation tool for proton therapy using massively parallel central processing unit (CPU) architectures. Methods: A new Monte Carlo, called MCsquare (many-core Monte Carlo), has been designed and optimized for the last generation of Intel Xeon processors and Intel Xeon Phi coprocessors. These massively parallel architectures offer the flexibility and the computational power suitable to MC methods. The class-II condensed history algorithmmore » of MCsquare provides a fast and yet accurate method of simulating heavy charged particles such as protons, deuterons, and alphas inside voxelized geometries. Hard ionizations, with energy losses above a user-specified threshold, are simulated individually while soft events are regrouped in a multiple scattering theory. Elastic and inelastic nuclear interactions are sampled from ICRU 63 differential cross sections, thereby allowing for the computation of prompt gamma emission profiles. MCsquare has been benchmarked with the GATE/GEANT4 Monte Carlo application for homogeneous and heterogeneous geometries. Results: Comparisons with GATE/GEANT4 for various geometries show deviations within 2%–1 mm. In spite of the limited memory bandwidth of the coprocessor simulation time is below 25 s for 10{sup 7} primary 200 MeV protons in average soft tissues using all Xeon Phi and CPU resources embedded in a single desktop unit. Conclusions: MCsquare exploits the flexibility of CPU architectures to provide a multipurpose MC simulation tool. Optimized code enables the use of accurate MC calculation within a reasonable computation time, adequate for clinical practice. MCsquare also simulates prompt gamma emission and can thus be used also for in vivo range verification.« less
Computational performance of a smoothed particle hydrodynamics simulation for shared-memory parallel computing

NASA Astrophysics Data System (ADS)

Nishiura, Daisuke; Furuichi, Mikito; Sakaguchi, Hide

2015-09-01

The computational performance of a smoothed particle hydrodynamics (SPH) simulation is investigated for three types of current shared-memory parallel computer devices: many integrated core (MIC) processors, graphics processing units (GPUs), and multi-core CPUs. We are especially interested in efficient shared-memory allocation methods for each chipset, because the efficient data access patterns differ between compute unified device architecture (CUDA) programming for GPUs and OpenMP programming for MIC processors and multi-core CPUs. We first introduce several parallel implementation techniques for the SPH code, and then examine these on our target computer architectures to determine the most effective algorithms for each processor unit. In addition, we evaluate the effective computing performance and power efficiency of the SPH simulation on each architecture, as these are critical metrics for overall performance in a multi-device environment. In our benchmark test, the GPU is found to produce the best arithmetic performance as a standalone device unit, and gives the most efficient power consumption. The multi-core CPU obtains the most effective computing performance. The computational speed of the MIC processor on Xeon Phi approached that of two Xeon CPUs. This indicates that using MICs is an attractive choice for existing SPH codes on multi-core CPUs parallelized by OpenMP, as it gains computational acceleration without the need for significant changes to the source code.
Accelerating finite-rate chemical kinetics with coprocessors: Comparing vectorization methods on GPUs, MICs, and CPUs

NASA Astrophysics Data System (ADS)

Stone, Christopher P.; Alferman, Andrew T.; Niemeyer, Kyle E.

2018-05-01

Accurate and efficient methods for solving stiff ordinary differential equations (ODEs) are a critical component of turbulent combustion simulations with finite-rate chemistry. The ODEs governing the chemical kinetics at each mesh point are decoupled by operator-splitting allowing each to be solved concurrently. An efficient ODE solver must then take into account the available thread and instruction-level parallelism of the underlying hardware, especially on many-core coprocessors, as well as the numerical efficiency. A stiff Rosenbrock and a nonstiff Runge-Kutta ODE solver are both implemented using the single instruction, multiple thread (SIMT) and single instruction, multiple data (SIMD) paradigms within OpenCL. Both methods solve multiple ODEs concurrently within the same instruction stream. The performance of these parallel implementations was measured on three chemical kinetic models of increasing size across several multicore and many-core platforms. Two separate benchmarks were conducted to clearly determine any performance advantage offered by either method. The first benchmark measured the run-time of evaluating the right-hand-side source terms in parallel and the second benchmark integrated a series of constant-pressure, homogeneous reactors using the Rosenbrock and Runge-Kutta solvers. The right-hand-side evaluations with SIMD parallelism on the host multicore Xeon CPU and many-core Xeon Phi co-processor performed approximately three times faster than the baseline multithreaded C++ code. The SIMT parallel model on the host and Phi was 13%-35% slower than the baseline while the SIMT model on the NVIDIA Kepler GPU provided approximately the same performance as the SIMD model on the Phi. The runtimes for both ODE solvers decreased significantly with the SIMD implementations on the host CPU (2.5-2.7 ×) and Xeon Phi coprocessor (4.7-4.9 ×) compared to the baseline parallel code. The SIMT implementations on the GPU ran 1.5-1.6 times faster than the baseline multithreaded CPU code; however, this was significantly slower than the SIMD versions on the host CPU or the Xeon Phi. The performance difference between the three platforms was attributed to thread divergence caused by the adaptive step-sizes within the ODE integrators. Analysis showed that the wider vector width of the GPU incurs a higher level of divergence than the narrower Sandy Bridge or Xeon Phi. The significant performance improvement provided by the SIMD parallel strategy motivates further research into more ODE solver methods that are both SIMD-friendly and computationally efficient.
Acceleration of spiking neural network based pattern recognition on NVIDIA graphics processors.

PubMed

Han, Bing; Taha, Tarek M

2010-04-01

There is currently a strong push in the research community to develop biological scale implementations of neuron based vision models. Systems at this scale are computationally demanding and generally utilize more accurate neuron models, such as the Izhikevich and the Hodgkin-Huxley models, in favor of the more popular integrate and fire model. We examine the feasibility of using graphics processing units (GPUs) to accelerate a spiking neural network based character recognition network to enable such large scale systems. Two versions of the network utilizing the Izhikevich and Hodgkin-Huxley models are implemented. Three NVIDIA general-purpose (GP) GPU platforms are examined, including the GeForce 9800 GX2, the Tesla C1060, and the Tesla S1070. Our results show that the GPGPUs can provide significant speedup over conventional processors. In particular, the fastest GPGPU utilized, the Tesla S1070, provided a speedup of 5.6 and 84.4 over highly optimized implementations on the fastest central processing unit (CPU) tested, a quadcore 2.67 GHz Xeon processor, for the Izhikevich and the Hodgkin-Huxley models, respectively. The CPU implementation utilized all four cores and the vector data parallelism offered by the processor. The results indicate that GPUs are well suited for this application domain.
Benchmarking hardware architecture candidates for the NFIRAOS real-time controller

NASA Astrophysics Data System (ADS)

Smith, Malcolm; Kerley, Dan; Herriot, Glen; Véran, Jean-Pierre

2014-07-01

As a part of the trade study for the Narrow Field Infrared Adaptive Optics System, the adaptive optics system for the Thirty Meter Telescope, we investigated the feasibility of performing real-time control computation using a Linux operating system and Intel Xeon E5 CPUs. We also investigated a Xeon Phi based architecture which allows higher levels of parallelism. This paper summarizes both the CPU based real-time controller architecture and the Xeon Phi based RTC. The Intel Xeon E5 CPU solution meets the requirements and performs the computation for one AO cycle in an average of 767 microseconds. The Xeon Phi solution did not meet the 1200 microsecond time requirement and also suffered from unpredictable execution times. More detailed benchmark results are reported for both architectures.
Static analysis of the hull plate using the finite element method

NASA Astrophysics Data System (ADS)

Ion, A.

2015-11-01

This paper aims at presenting the static analysis for two levels of a container ship's construction as follows: the first level is at the girder / hull plate and the second level is conducted at the entire strength hull of the vessel. This article will describe the work for the static analysis of a hull plate. We shall use the software package ANSYS Mechanical 14.5. The program is run on a computer with four Intel Xeon X5260 CPU processors at 3.33 GHz, 32 GB memory installed. In terms of software, the shared memory parallel version of ANSYS refers to running ANSYS across multiple cores on a SMP system. The distributed memory parallel version of ANSYS (Distributed ANSYS) refers to running ANSYS across multiple processors on SMP systems or DMP systems.
Using Intel's Knight Landing Processor to Accelerate Global Nested Air Quality Prediction Modeling System (GNAQPMS) Model

NASA Astrophysics Data System (ADS)

Wang, H.; Chen, H.; Chen, X.; Wu, Q.; Wang, Z.

2016-12-01

The Global Nested Air Quality Prediction Modeling System for Hg (GNAQPMS-Hg) is a global chemical transport model coupled Hg transport module to investigate the mercury pollution. In this study, we present our work of transplanting the GNAQPMS model on Intel Xeon Phi processor, Knights Landing (KNL) to accelerate the model. KNL is the second-generation product adopting Many Integrated Core Architecture (MIC) architecture. Compared with the first generation Knight Corner (KNC), KNL has more new hardware features, that it can be used as unique processor as well as coprocessor with other CPU. According to the Vtune tool, the high overhead modules in GNAQPMS model have been addressed, including CBMZ gas chemistry, advection and convection module, and wet deposition module. These high overhead modules were accelerated by optimizing code and using new techniques of KNL. The following optimized measures was done: 1) Changing the pure MPI parallel mode to hybrid parallel mode with MPI and OpenMP; 2.Vectorizing the code to using the 512-bit wide vector computation unit. 3. Reducing unnecessary memory access and calculation. 4. Reducing Thread Local Storage (TLS) for common variables with each OpenMP thread in CBMZ. 5. Changing the way of global communication from files writing and reading to MPI functions. After optimization, the performance of GNAQPMS is greatly increased both on CPU and KNL platform, the single-node test showed that optimized version has 2.6x speedup on two sockets CPU platform and 3.3x speedup on one socket KNL platform compared with the baseline version code, which means the KNL has 1.29x speedup when compared with 2 sockets CPU platform.
Nonlinear Wave Simulation on the Xeon Phi Knights Landing Processor

NASA Astrophysics Data System (ADS)

Hristov, Ivan; Goranov, Goran; Hristova, Radoslava

2018-02-01

We consider an interesting from computational point of view standing wave simulation by solving coupled 2D perturbed Sine-Gordon equations. We make an OpenMP realization which explores both thread and SIMD levels of parallelism. We test the OpenMP program on two different energy equivalent Intel architectures: 2× Xeon E5-2695 v2 processors, (code-named "Ivy Bridge-EP") in the Hybrilit cluster, and Xeon Phi 7250 processor (code-named "Knights Landing" (KNL). The results show 2 times better performance on KNL processor.

Hierarchical algorithms for modeling the ocean on hierarchical architectures

NASA Astrophysics Data System (ADS)

Hill, C. N.

2012-12-01

This presentation will describe an approach to using accelerator/co-processor technology that maps hierarchical, multi-scale modeling techniques to an underlying hierarchical hardware architecture. The focus of this work is on making effective use of both CPU and accelerator/co-processor parts of a system, for large scale ocean modeling. In the work, a lower resolution basin scale ocean model is locally coupled to multiple, "embedded", limited area higher resolution sub-models. The higher resolution models execute on co-processor/accelerator hardware and do not interact directly with other sub-models. The lower resolution basin scale model executes on the system CPU(s). The result is a multi-scale algorithm that aligns with hardware designs in the co-processor/accelerator space. We demonstrate this approach being used to substitute explicit process models for standard parameterizations. Code for our sub-models is implemented through a generic abstraction layer, so that we can target multiple accelerator architectures with different programming environments. We will present two application and implementation examples. One uses the CUDA programming environment and targets GPU hardware. This example employs a simple non-hydrostatic two dimensional sub-model to represent vertical motion more accurately. The second example uses a highly threaded three-dimensional model at high resolution. This targets a MIC/Xeon Phi like environment and uses sub-models as a way to explicitly compute sub-mesoscale terms. In both cases the accelerator/co-processor capability provides extra compute cycles that allow improved model fidelity for little or no extra wall-clock time cost.
Spectral-element simulation of two-dimensional elastic wave propagation in fully heterogeneous media on a GPU cluster

NASA Astrophysics Data System (ADS)

Rudianto, Indra; Sudarmaji

2018-04-01

We present an implementation of the spectral-element method for simulation of two-dimensional elastic wave propagation in fully heterogeneous media. We have incorporated most of realistic geological features in the model, including surface topography, curved layer interfaces, and 2-D wave-speed heterogeneity. To accommodate such complexity, we use an unstructured quadrilateral meshing technique. Simulation was performed on a GPU cluster, which consists of 24 core processors Intel Xeon CPU and 4 NVIDIA Quadro graphics cards using CUDA and MPI implementation. We speed up the computation by a factor of about 5 compared to MPI only, and by a factor of about 40 compared to Serial implementation.
Fast 2D FWI on a multi and many-cores workstation.

NASA Astrophysics Data System (ADS)

Thierry, Philippe; Donno, Daniela; Noble, Mark

2014-05-01

Following the introduction of x86 co-processors (Xeon Phi) and the performance increase of standard 2-socket workstations using the latest 12 cores E5-v2 x86-64 CPU, we present here a MPI + OpenMP implementation of an acoustic 2D FWI (full waveform inversion) code which simultaneously runs on the CPUs and on the co-processors installed in a workstation. The main advantage of running a 2D FWI on a workstation is to be able to quickly evaluate new features such as more complicated wave equations, new cost functions, finite-difference stencils or boundary conditions. Since the co-processor is made of 61 in-order x86 cores, each of them having up to 4 threads, this many-core can be seen as a shared memory SMP (symmetric multiprocessing) machine with its own IP address. Depending on the vendor, a single workstation can handle several co-processors making the workstation as a personal cluster under the desk. The original Fortran 90 CPU version of the 2D FWI code is just recompiled to get a Xeon Phi x86 binary. This multi and many-core configuration uses standard compilers and associated MPI as well as math libraries under Linux; therefore, the cost of code development remains constant, while improving computation time. We choose to implement the code with the so-called symmetric mode to fully use the capacity of the workstation, but we also evaluate the scalability of the code in native mode (i.e running only on the co-processor) thanks to the Linux ssh and NFS capabilities. Usual care of optimization and SIMD vectorization is used to ensure optimal performances, and to analyze the application performances and bottlenecks on both platforms. The 2D FWI implementation uses finite-difference time-domain forward modeling and a quasi-Newton (with L-BFGS algorithm) optimization scheme for the model parameters update. Parallelization is achieved through standard MPI shot gathers distribution and OpenMP for domain decomposition within the co-processor. Taking advantage of the 16 GB of memory available on the co-processor we are able to keep wavefields in memory to achieve the gradient computation by cross-correlation of forward and back-propagated wavefields needed by our time-domain FWI scheme, without heavy traffic on the i/o subsystem and PCIe bus. In this presentation we will also review some simple methodologies to determine performance expectation compared to real performances in order to get optimization effort estimation before starting any huge modification or rewriting of research codes. The key message is the ease of use and development of this hybrid configuration to reach not the absolute peak performance value but the optimal one that ensures the best balance between geophysical and computer developments.
DOE Office of Scientific and Technical Information (OSTI.GOV)

Cohen, J; Dossa, D; Gokhale, M

Critical data science applications requiring frequent access to storage perform poorly on today's computing architectures. This project addresses efficient computation of data-intensive problems in national security and basic science by exploring, advancing, and applying a new form of computing called storage-intensive supercomputing (SISC). Our goal is to enable applications that simply cannot run on current systems, and, for a broad range of data-intensive problems, to deliver an order of magnitude improvement in price/performance over today's data-intensive architectures. This technical report documents much of the work done under LDRD 07-ERD-063 Storage Intensive Supercomputing during the period 05/07-09/07. The following chapters describe:more » (1) a new file I/O monitoring tool iotrace developed to capture the dynamic I/O profiles of Linux processes; (2) an out-of-core graph benchmark for level-set expansion of scale-free graphs; (3) an entity extraction benchmark consisting of a pipeline of eight components; and (4) an image resampling benchmark drawn from the SWarp program in the LSST data processing pipeline. The performance of the graph and entity extraction benchmarks was measured in three different scenarios: data sets residing on the NFS file server and accessed over the network; data sets stored on local disk; and data sets stored on the Fusion I/O parallel NAND Flash array. The image resampling benchmark compared performance of software-only to GPU-accelerated. In addition to the work reported here, an additional text processing application was developed that used an FPGA to accelerate n-gram profiling for language classification. The n-gram application will be presented at SC07 at the High Performance Reconfigurable Computing Technologies and Applications Workshop. The graph and entity extraction benchmarks were run on a Supermicro server housing the NAND Flash 40GB parallel disk array, the Fusion-io. The Fusion system specs are as follows: SuperMicro X7DBE Xeon Dual Socket Blackford Server Motherboard; 2 Intel Xeon Dual-Core 2.66 GHz processors; 1 GB DDR2 PC2-5300 RAM (2 x 512); 80GB Hard Drive (Seagate SATA II Barracuda). The Fusion board is presently capable of 4X in a PCIe slot. The image resampling benchmark was run on a dual Xeon workstation with NVIDIA graphics card (see Chapter 5 for full specification). An XtremeData Opteron+FPGA was used for the language classification application. We observed that these benchmarks are not uniformly I/O intensive. The only benchmark that showed greater that 50% of the time in I/O was the graph algorithm when it accessed data files over NFS. When local disk was used, the graph benchmark spent at most 40% of its time in I/O. The other benchmarks were CPU dominated. The image resampling benchmark and language classification showed order of magnitude speedup over software by using co-processor technology to offload the CPU-intensive kernels. Our experiments to date suggest that emerging hardware technologies offer significant benefit to boosting the performance of data-intensive algorithms. Using GPU and FPGA co-processors, we were able to improve performance by more than an order of magnitude on the benchmark algorithms, eliminating the processor bottleneck of CPU-bound tasks. Experiments with a prototype solid state nonvolative memory available today show 10X better throughput on random reads than disk, with a 2X speedup on a graph processing benchmark when compared to the use of local SATA disk.« less
Evaluation of the Xeon phi processor as a technology for the acceleration of real-time control in high-order adaptive optics systems

NASA Astrophysics Data System (ADS)

Barr, David; Basden, Alastair; Dipper, Nigel; Schwartz, Noah; Vick, Andy; Schnetler, Hermine

2014-08-01

We present wavefront reconstruction acceleration of high-order AO systems using an Intel Xeon Phi processor. The Xeon Phi is a coprocessor providing many integrated cores and designed for accelerating compute intensive, numerical codes. Unlike other accelerator technologies, it allows virtually unchanged C/C++ to be recompiled to run on the Xeon Phi, giving the potential of making development, upgrade and maintenance faster and less complex. We benchmark the Xeon Phi in the context of AO real-time control by running a matrix vector multiply (MVM) algorithm. We investigate variability in execution time and demonstrate a substantial speed-up in loop frequency. We examine the integration of a Xeon Phi into an existing RTC system and show that performance improvements can be achieved with limited development effort.
Novel hybrid GPU-CPU implementation of parallelized Monte Carlo parametric expectation maximization estimation method for population pharmacokinetic data analysis.

PubMed

Ng, C M

2013-10-01

The development of a population PK/PD model, an essential component for model-based drug development, is both time- and labor-intensive. A graphical-processing unit (GPU) computing technology has been proposed and used to accelerate many scientific computations. The objective of this study was to develop a hybrid GPU-CPU implementation of parallelized Monte Carlo parametric expectation maximization (MCPEM) estimation algorithm for population PK data analysis. A hybrid GPU-CPU implementation of the MCPEM algorithm (MCPEMGPU) and identical algorithm that is designed for the single CPU (MCPEMCPU) were developed using MATLAB in a single computer equipped with dual Xeon 6-Core E5690 CPU and a NVIDIA Tesla C2070 GPU parallel computing card that contained 448 stream processors. Two different PK models with rich/sparse sampling design schemes were used to simulate population data in assessing the performance of MCPEMCPU and MCPEMGPU. Results were analyzed by comparing the parameter estimation and model computation times. Speedup factor was used to assess the relative benefit of parallelized MCPEMGPU over MCPEMCPU in shortening model computation time. The MCPEMGPU consistently achieved shorter computation time than the MCPEMCPU and can offer more than 48-fold speedup using a single GPU card. The novel hybrid GPU-CPU implementation of parallelized MCPEM algorithm developed in this study holds a great promise in serving as the core for the next-generation of modeling software for population PK/PD analysis.
Initial results on computational performance of Intel Many Integrated Core (MIC) architecture: implementation of the Weather and Research Forecasting (WRF) Purdue-Lin microphysics scheme

NASA Astrophysics Data System (ADS)

Mielikainen, Jarno; Huang, Bormin; Huang, Allen H.

2014-10-01

Purdue-Lin scheme is a relatively sophisticated microphysics scheme in the Weather Research and Forecasting (WRF) model. The scheme includes six classes of hydro meteors: water vapor, cloud water, raid, cloud ice, snow and graupel. The scheme is very suitable for massively parallel computation as there are no interactions among horizontal grid points. In this paper, we accelerate the Purdue Lin scheme using Intel Many Integrated Core Architecture (MIC) hardware. The Intel Xeon Phi is a high performance coprocessor consists of up to 61 cores. The Xeon Phi is connected to a CPU via the PCI Express (PICe) bus. In this paper, we will discuss in detail the code optimization issues encountered while tuning the Purdue-Lin microphysics Fortran code for Xeon Phi. In particularly, getting a good performance required utilizing multiple cores, the wide vector operations and make efficient use of memory. The results show that the optimizations improved performance of the original code on Xeon Phi 5110P by a factor of 4.2x. Furthermore, the same optimizations improved performance on Intel Xeon E5-2603 CPU by a factor of 1.2x compared to the original code.
ELT-scale Adaptive Optics real-time control with thes Intel Xeon Phi Many Integrated Core Architecture

NASA Astrophysics Data System (ADS)

Jenkins, David R.; Basden, Alastair; Myers, Richard M.

2018-05-01

We propose a solution to the increased computational demands of Extremely Large Telescope (ELT) scale adaptive optics (AO) real-time control with the Intel Xeon Phi Knights Landing (KNL) Many Integrated Core (MIC) Architecture. The computational demands of an AO real-time controller (RTC) scale with the fourth power of telescope diameter and so the next generation ELTs require orders of magnitude more processing power for the RTC pipeline than existing systems. The Xeon Phi contains a large number (≥64) of low power x86 CPU cores and high bandwidth memory integrated into a single socketed server CPU package. The increased parallelism and memory bandwidth are crucial to providing the performance for reconstructing wavefronts with the required precision for ELT scale AO. Here, we demonstrate that the Xeon Phi KNL is capable of performing ELT scale single conjugate AO real-time control computation at over 1.0kHz with less than 20μs RMS jitter. We have also shown that with a wavefront sensor camera attached the KNL can process the real-time control loop at up to 966Hz, the maximum frame-rate of the camera, with jitter remaining below 20μs RMS. Future studies will involve exploring the use of a cluster of Xeon Phis for the real-time control of the MCAO and MOAO regimes of AO. We find that the Xeon Phi is highly suitable for ELT AO real time control.
Performance Evaluation of NWChem Ab-Initio Molecular Dynamics (AIMD) Simulations on the Intel® Xeon Phi™ Processor

DOE Office of Scientific and Technical Information (OSTI.GOV)

Bylaska, Eric J.; Jacquelin, Mathias; De Jong, Wibe A.

2017-10-20

Ab-initio Molecular Dynamics (AIMD) methods are an important class of algorithms, as they enable scientists to understand the chemistry and dynamics of molecular and condensed phase systems while retaining a first-principles-based description of their interactions. Many-core architectures such as the Intel® Xeon Phi™ processor are an interesting and promising target for these algorithms, as they can provide the computational power that is needed to solve interesting problems in chemistry. In this paper, we describe the efforts of refactoring the existing AIMD plane-wave method of NWChem from an MPI-only implementation to a scalable, hybrid code that employs MPI and OpenMP tomore » exploit the capabilities of current and future many-core architectures. We describe the optimizations required to get close to optimal performance for the multiplication of the tall-and-skinny matrices that form the core of the computational algorithm. We present strong scaling results on the complete AIMD simulation for a test case that simulates 256 water molecules and that strong-scales well on a cluster of 1024 nodes of Intel Xeon Phi processors. We compare the performance obtained with a cluster of dual-socket Intel® Xeon® E5–2698v3 processors.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)

Trędak, Przemysław, E-mail: przemyslaw.tredak@fuw.edu.pl; Rudnicki, Witold R.; Interdisciplinary Centre for Mathematical and Computational Modelling, University of Warsaw, ul. Pawińskiego 5a, 02-106 Warsaw

The second generation Reactive Bond Order (REBO) empirical potential is commonly used to accurately model a wide range hydrocarbon materials. It is also extensible to other atom types and interactions. REBO potential assumes complex multi-body interaction model, that is difficult to represent efficiently in the SIMD or SIMT programming model. Hence, despite its importance, no efficient GPGPU implementation has been developed for this potential. Here we present a detailed description of a highly efficient GPGPU implementation of molecular dynamics algorithm using REBO potential. The presented algorithm takes advantage of rarely used properties of the SIMT architecture of a modern GPUmore » to solve difficult synchronizations issues that arise in computations of multi-body potential. Techniques developed for this problem may be also used to achieve efficient solutions of different problems. The performance of proposed algorithm is assessed using a range of model systems. It is compared to highly optimized CPU implementation (both single core and OpenMP) available in LAMMPS package. These experiments show up to 6x improvement in forces computation time using single processor of the NVIDIA Tesla K80 compared to high end 16-core Intel Xeon processor.« less
List-mode PET image reconstruction for motion correction using the Intel XEON PHI co-processor

NASA Astrophysics Data System (ADS)

Ryder, W. J.; Angelis, G. I.; Bashar, R.; Gillam, J. E.; Fulton, R.; Meikle, S.

2014-03-01

List-mode image reconstruction with motion correction is computationally expensive, as it requires projection of hundreds of millions of rays through a 3D array. To decrease reconstruction time it is possible to use symmetric multiprocessing computers or graphics processing units. The former can have high financial costs, while the latter can require refactoring of algorithms. The Xeon Phi is a new co-processor card with a Many Integrated Core architecture that can run 4 multiple-instruction, multiple data threads per core with each thread having a 512-bit single instruction, multiple data vector register. Thus, it is possible to run in the region of 220 threads simultaneously. The aim of this study was to investigate whether the Xeon Phi co-processor card is a viable alternative to an x86 Linux server for accelerating List-mode PET image reconstruction for motion correction. An existing list-mode image reconstruction algorithm with motion correction was ported to run on the Xeon Phi coprocessor with the multi-threading implemented using pthreads. There were no differences between images reconstructed using the Phi co-processor card and images reconstructed using the same algorithm run on a Linux server. However, it was found that the reconstruction runtimes were 3 times greater for the Phi than the server. A new version of the image reconstruction algorithm was developed in C++ using OpenMP for mutli-threading and the Phi runtimes decreased to 1.67 times that of the host Linux server. Data transfer from the host to co-processor card was found to be a rate-limiting step; this needs to be carefully considered in order to maximize runtime speeds. When considering the purchase price of a Linux workstation with Xeon Phi co-processor card and top of the range Linux server, the former is a cost-effective computation resource for list-mode image reconstruction. A multi-Phi workstation could be a viable alternative to cluster computers at a lower cost for medical imaging applications.
Application of Intel Many Integrated Core (MIC) accelerators to the Pleim-Xiu land surface scheme

NASA Astrophysics Data System (ADS)

Huang, Melin; Huang, Bormin; Huang, Allen H.

2015-10-01

The land-surface model (LSM) is one physics process in the weather research and forecast (WRF) model. The LSM includes atmospheric information from the surface layer scheme, radiative forcing from the radiation scheme, and precipitation forcing from the microphysics and convective schemes, together with internal information on the land's state variables and land-surface properties. The LSM is to provide heat and moisture fluxes over land points and sea-ice points. The Pleim-Xiu (PX) scheme is one LSM. The PX LSM features three pathways for moisture fluxes: evapotranspiration, soil evaporation, and evaporation from wet canopies. To accelerate the computation process of this scheme, we employ Intel Xeon Phi Many Integrated Core (MIC) Architecture as it is a multiprocessor computer structure with merits of efficient parallelization and vectorization essentials. Our results show that the MIC-based optimization of this scheme running on Xeon Phi coprocessor 7120P improves the performance by 2.3x and 11.7x as compared to the original code respectively running on one CPU socket (eight cores) and on one CPU core with Intel Xeon E5-2670.
Multi-GPU and multi-CPU accelerated FDTD scheme for vibroacoustic applications

NASA Astrophysics Data System (ADS)

Francés, J.; Otero, B.; Bleda, S.; Gallego, S.; Neipp, C.; Márquez, A.; Beléndez, A.

2015-06-01

The Finite-Difference Time-Domain (FDTD) method is applied to the analysis of vibroacoustic problems and to study the propagation of longitudinal and transversal waves in a stratified media. The potential of the scheme and the relevance of each acceleration strategy for massively computations in FDTD are demonstrated in this work. In this paper, we propose two new specific implementations of the bi-dimensional scheme of the FDTD method using multi-CPU and multi-GPU, respectively. In the first implementation, an open source message passing interface (OMPI) has been included in order to massively exploit the resources of a biprocessor station with two Intel Xeon processors. Moreover, regarding CPU code version, the streaming SIMD extensions (SSE) and also the advanced vectorial extensions (AVX) have been included with shared memory approaches that take advantage of the multi-core platforms. On the other hand, the second implementation called the multi-GPU code version is based on Peer-to-Peer communications available in CUDA on two GPUs (NVIDIA GTX 670). Subsequently, this paper presents an accurate analysis of the influence of the different code versions including shared memory approaches, vector instructions and multi-processors (both CPU and GPU) and compares them in order to delimit the degree of improvement of using distributed solutions based on multi-CPU and multi-GPU. The performance of both approaches was analysed and it has been demonstrated that the addition of shared memory schemes to CPU computing improves substantially the performance of vector instructions enlarging the simulation sizes that use efficiently the cache memory of CPUs. In this case GPU computing is slightly twice times faster than the fine tuned CPU version in both cases one and two nodes. However, for massively computations explicit vector instructions do not worth it since the memory bandwidth is the limiting factor and the performance tends to be the same than the sequential version with auto-vectorisation and also shared memory approach. In this scenario GPU computing is the best option since it provides a homogeneous behaviour. More specifically, the speedup of GPU computing achieves an upper limit of 12 for both one and two GPUs, whereas the performance reaches peak values of 80 GFlops and 146 GFlops for the performance for one GPU and two GPUs respectively. Finally, the method is applied to an earth crust profile in order to demonstrate the potential of our approach and the necessity of applying acceleration strategies in these type of applications.
Efficient implementation of the many-body Reactive Bond Order (REBO) potential on GPU

NASA Astrophysics Data System (ADS)

Trędak, Przemysław; Rudnicki, Witold R.; Majewski, Jacek A.

2016-09-01

The second generation Reactive Bond Order (REBO) empirical potential is commonly used to accurately model a wide range hydrocarbon materials. It is also extensible to other atom types and interactions. REBO potential assumes complex multi-body interaction model, that is difficult to represent efficiently in the SIMD or SIMT programming model. Hence, despite its importance, no efficient GPGPU implementation has been developed for this potential. Here we present a detailed description of a highly efficient GPGPU implementation of molecular dynamics algorithm using REBO potential. The presented algorithm takes advantage of rarely used properties of the SIMT architecture of a modern GPU to solve difficult synchronizations issues that arise in computations of multi-body potential. Techniques developed for this problem may be also used to achieve efficient solutions of different problems. The performance of proposed algorithm is assessed using a range of model systems. It is compared to highly optimized CPU implementation (both single core and OpenMP) available in LAMMPS package. These experiments show up to 6x improvement in forces computation time using single processor of the NVIDIA Tesla K80 compared to high end 16-core Intel Xeon processor.
An efficient tensor transpose algorithm for multicore CPU, Intel Xeon Phi, and NVidia Tesla GPU

NASA Astrophysics Data System (ADS)

Lyakh, Dmitry I.

2015-04-01

An efficient parallel tensor transpose algorithm is suggested for shared-memory computing units, namely, multicore CPU, Intel Xeon Phi, and NVidia GPU. The algorithm operates on dense tensors (multidimensional arrays) and is based on the optimization of cache utilization on x86 CPU and the use of shared memory on NVidia GPU. From the applied side, the ultimate goal is to minimize the overhead encountered in the transformation of tensor contractions into matrix multiplications in computer implementations of advanced methods of quantum many-body theory (e.g., in electronic structure theory and nuclear physics). A particular accent is made on higher-dimensional tensors that typically appear in the so-called multireference correlated methods of electronic structure theory. Depending on tensor dimensionality, the presented optimized algorithms can achieve an order of magnitude speedup on x86 CPUs and 2-3 times speedup on NVidia Tesla K20X GPU with respect to the naïve scattering algorithm (no memory access optimization). The tensor transpose routines developed in this work have been incorporated into a general-purpose tensor algebra library (TAL-SH).
Optimizing the Betts-Miller-Janjic cumulus parameterization with Intel Many Integrated Core (MIC) architecture

NASA Astrophysics Data System (ADS)

Huang, Melin; Huang, Bormin; Huang, Allen H.-L.

2015-10-01

The schemes of cumulus parameterization are responsible for the sub-grid-scale effects of convective and/or shallow clouds, and intended to represent vertical fluxes due to unresolved updrafts and downdrafts and compensating motion outside the clouds. Some schemes additionally provide cloud and precipitation field tendencies in the convective column, and momentum tendencies due to convective transport of momentum. The schemes all provide the convective component of surface rainfall. Betts-Miller-Janjic (BMJ) is one scheme to fulfill such purposes in the weather research and forecast (WRF) model. National Centers for Environmental Prediction (NCEP) has tried to optimize the BMJ scheme for operational application. As there are no interactions among horizontal grid points, this scheme is very suitable for parallel computation. With the advantage of Intel Xeon Phi Many Integrated Core (MIC) architecture, efficient parallelization and vectorization essentials, it allows us to optimize the BMJ scheme. If compared to the original code respectively running on one CPU socket (eight cores) and on one CPU core with Intel Xeon E5-2670, the MIC-based optimization of this scheme running on Xeon Phi coprocessor 7120P improves the performance by 2.4x and 17.0x, respectively.
Evaluating and optimizing the NERSC workload on Knights Landing

DOE Office of Scientific and Technical Information (OSTI.GOV)

Barnes, T; Cook, B; Deslippe, J

2017-01-30

NERSC has partnered with 20 representative application teams to evaluate performance on the Xeon-Phi Knights Landing architecture and develop an application-optimization strategy for the greater NERSC workload on the recently installed Cori system. In this article, we present early case studies and summarized results from a subset of the 20 applications highlighting the impact of important architecture differences between the Xeon-Phi and traditional Xeon processors. We summarize the status of the applications and describe the greater optimization strategy that has formed.
Evaluating and Optimizing the NERSC Workload on Knights Landing

DOE Office of Scientific and Technical Information (OSTI.GOV)

Barnes, Taylor; Cook, Brandon; Doerfler, Douglas

2016-01-01

NERSC has partnered with 20 representative application teams to evaluate performance on the Xeon-Phi Knights Landing architecture and develop an application-optimization strategy for the greater NERSC workload on the recently installed Cori system. In this article, we present early case studies and summarized results from a subset of the 20 applications highlighting the impact of important architecture differences between the Xeon-Phi and traditional Xeon processors. We summarize the status of the applications and describe the greater optimization strategy that has formed.
Peregrine System | High-Performance Computing | NREL

Science.gov Websites

) and longer-term (/projects) storage. These file systems are mounted on all nodes. Peregrine has three -2670 Xeon processors and 64 GB of memory. In addition to mounting the /home, /nopt, /projects and # cores/node Memory/node Peak (DP) performance per node 88 Intel Xeon E5-2670 "Sandy Bridge" 8
Accelerating gravitational microlensing simulations using the Xeon Phi coprocessor

NASA Astrophysics Data System (ADS)

Chen, B.; Kantowski, R.; Dai, X.; Baron, E.; Van der Mark, P.

2017-04-01

Recently Graphics Processing Units (GPUs) have been used to speed up very CPU-intensive gravitational microlensing simulations. In this work, we use the Xeon Phi coprocessor to accelerate such simulations and compare its performance on a microlensing code with that of NVIDIA's GPUs. For the selected set of parameters evaluated in our experiment, we find that the speedup by Intel's Knights Corner coprocessor is comparable to that by NVIDIA's Fermi family of GPUs with compute capability 2.0, but less significant than GPUs with higher compute capabilities such as the Kepler. However, the very recently released second generation Xeon Phi, Knights Landing, is about 5.8 times faster than the Knights Corner, and about 2.9 times faster than the Kepler GPU used in our simulations. We conclude that the Xeon Phi is a very promising alternative to GPUs for modern high performance microlensing simulations.

Generic accelerated sequence alignment in SeqAn using vectorization and multi-threading.

PubMed

Rahn, René; Budach, Stefan; Costanza, Pascal; Ehrhardt, Marcel; Hancox, Jonny; Reinert, Knut

2018-05-03

Pairwise sequence alignment is undoubtedly a central tool in many bioinformatics analyses. In this paper, we present a generically accelerated module for pairwise sequence alignments applicable for a broad range of applications. In our module, we unified the standard dynamic programming kernel used for pairwise sequence alignments and extended it with a generalized inter-sequence vectorization layout, such that many alignments can be computed simultaneously by exploiting SIMD (Single Instruction Multiple Data) instructions of modern processors. We then extended the module by adding two layers of thread-level parallelization, where we a) distribute many independent alignments on multiple threads and b) inherently parallelize a single alignment computation using a work stealing approach producing a dynamic wavefront progressing along the minor diagonal. We evaluated our alignment vectorization and parallelization on different processors, including the newest Intel® Xeon® (Skylake) and Intel® Xeon Phi™ (KNL) processors, and use cases. The instruction set AVX512-BW (Byte and Word), available on Skylake processors, can genuinely improve the performance of vectorized alignments. We could run single alignments 1600 times faster on the Xeon Phi™ and 1400 times faster on the Xeon® than executing them with our previous sequential alignment module. The module is programmed in C++ using the SeqAn (Reinert et al., 2017) library and distributed with version 2.4. under the BSD license. We support SSE4, AVX2, AVX512 instructions and included UME::SIMD, a SIMD-instruction wrapper library, to extend our module for further instruction sets. We thoroughly test all alignment components with all major C++ compilers on various platforms. rene.rahn@fu-berlin.de.
Accelerating 3D Elastic Wave Equations on Knights Landing based Intel Xeon Phi processors

NASA Astrophysics Data System (ADS)

Sourouri, Mohammed; Birger Raknes, Espen

2017-04-01

In advanced imaging methods like reverse-time migration (RTM) and full waveform inversion (FWI) the elastic wave equation (EWE) is numerically solved many times to create the seismic image or the elastic parameter model update. Thus, it is essential to optimize the solution time for solving the EWE as this will have a major impact on the total computational cost in running RTM or FWI. From a computational point of view applications implementing EWEs are associated with two major challenges. The first challenge is the amount of memory-bound computations involved, while the second challenge is the execution of such computations over very large datasets. So far, multi-core processors have not been able to tackle these two challenges, which eventually led to the adoption of accelerators such as Graphics Processing Units (GPUs). Compared to conventional CPUs, GPUs are densely populated with many floating-point units and fast memory, a type of architecture that has proven to map well to many scientific computations. Despite its architectural advantages, full-scale adoption of accelerators has yet to materialize. First, accelerators require a significant programming effort imposed by programming models such as CUDA or OpenCL. Second, accelerators come with a limited amount of memory, which also require explicit data transfers between the CPU and the accelerator over the slow PCI bus. The second generation of the Xeon Phi processor based on the Knights Landing (KNL) architecture, promises the computational capabilities of an accelerator but require the same programming effort as traditional multi-core processors. The high computational performance is realized through many integrated cores (number of cores and tiles and memory varies with the model) organized in tiles that are connected via a 2D mesh based interconnect. In contrary to accelerators, KNL is a self-hosted system, meaning explicit data transfers over the PCI bus are no longer required. However, like most accelerators, KNL sports a memory subsystem consisting of low-level caches and 16GB of high-bandwidth MCDRAM memory. For capacity computing, up to 400GB of conventional DDR4 memory is provided. Such a strict hierarchical memory layout means that data locality is imperative if the true potential of this product is to be harnessed. In this work, we study a series of optimizations specifically targeting KNL for our EWE based application to reduce the time-to-solution time for the following 3D model sizes in grid points: 1283, 2563 and 5123. We compare the results with an optimized version for multi-core CPUs running on a dual-socket Xeon E5 2680v3 system using OpenMP. Our initial naive implementation on the KNL is roughly 20% faster than the multi-core version, but by using only one thread per core and careful memory placement using the memkind library, we could achieve higher speedups. Additionally, by using the MCDRAM as cache for problem sizes that are smaller than 16 GB further performance improvements were unlocked. Depending on the problem size, our overall results indicate that the KNL based system is approximately 2.2x faster than the 24-core Xeon E5 2680v3 system, with only modest changes to the code.
Acceleration of Cherenkov angle reconstruction with the new Intel Xeon/FPGA compute platform for the particle identification in the LHCb Upgrade

NASA Astrophysics Data System (ADS)

Faerber, Christian

2017-10-01

The LHCb experiment at the LHC will upgrade its detector by 2018/2019 to a ‘triggerless’ readout scheme, where all the readout electronics and several sub-detector parts will be replaced. The new readout electronics will be able to readout the detector at 40 MHz. This increases the data bandwidth from the detector down to the Event Filter farm to 40 TBit/s, which also has to be processed to select the interesting proton-proton collision for later storage. The architecture of such a computing farm, which can process this amount of data as efficiently as possible, is a challenging task and several compute accelerator technologies are being considered for use inside the new Event Filter farm. In the high performance computing sector more and more FPGA compute accelerators are used to improve the compute performance and reduce the power consumption (e.g. in the Microsoft Catapult project and Bing search engine). Also for the LHCb upgrade the usage of an experimental FPGA accelerated computing platform in the Event Building or in the Event Filter farm is being considered and therefore tested. This platform from Intel hosts a general CPU and a high performance FPGA linked via a high speed link which is for this platform a QPI link. On the FPGA an accelerator is implemented. The used system is a two socket platform from Intel with a Xeon CPU and an FPGA. The FPGA has cache-coherent memory access to the main memory of the server and can collaborate with the CPU. As a first step, a computing intensive algorithm to reconstruct Cherenkov angles for the LHCb RICH particle identification was successfully ported in Verilog to the Intel Xeon/FPGA platform and accelerated by a factor of 35. The same algorithm was ported to the Intel Xeon/FPGA platform with OpenCL. The implementation work and the performance will be compared. Also another FPGA accelerator the Nallatech 385 PCIe accelerator with the same Stratix V FPGA were tested for performance. The results show that the Intel Xeon/FPGA platforms, which are built in general for high performance computing, are also very interesting for the High Energy Physics community.
An efficient tensor transpose algorithm for multicore CPU, Intel Xeon Phi, and NVidia Tesla GPU

DOE PAGES

Lyakh, Dmitry I.

2015-01-05

An efficient parallel tensor transpose algorithm is suggested for shared-memory computing units, namely, multicore CPU, Intel Xeon Phi, and NVidia GPU. The algorithm operates on dense tensors (multidimensional arrays) and is based on the optimization of cache utilization on x86 CPU and the use of shared memory on NVidia GPU. From the applied side, the ultimate goal is to minimize the overhead encountered in the transformation of tensor contractions into matrix multiplications in computer implementations of advanced methods of quantum many-body theory (e.g., in electronic structure theory and nuclear physics). A particular accent is made on higher-dimensional tensors that typicallymore » appear in the so-called multireference correlated methods of electronic structure theory. Depending on tensor dimensionality, the presented optimized algorithms can achieve an order of magnitude speedup on x86 CPUs and 2-3 times speedup on NVidia Tesla K20X GPU with respect to the na ve scattering algorithm (no memory access optimization). Furthermore, the tensor transpose routines developed in this work have been incorporated into a general-purpose tensor algebra library (TAL-SH).« less
GeantV: from CPU to accelerators

NASA Astrophysics Data System (ADS)

Amadio, G.; Ananya, A.; Apostolakis, J.; Arora, A.; Bandieramonte, M.; Bhattacharyya, A.; Bianchini, C.; Brun, R.; Canal, P.; Carminati, F.; Duhem, L.; Elvira, D.; Gheata, A.; Gheata, M.; Goulas, I.; Iope, R.; Jun, S.; Lima, G.; Mohanty, A.; Nikitina, T.; Novak, M.; Pokorski, W.; Ribon, A.; Sehgal, R.; Shadura, O.; Vallecorsa, S.; Wenzel, S.; Zhang, Y.

2016-10-01

The GeantV project aims to research and develop the next-generation simulation software describing the passage of particles through matter. While the modern CPU architectures are being targeted first, resources such as GPGPU, Intel© Xeon Phi, Atom or ARM cannot be ignored anymore by HEP CPU-bound applications. The proof of concept GeantV prototype has been mainly engineered for CPU's having vector units but we have foreseen from early stages a bridge to arbitrary accelerators. A software layer consisting of architecture/technology specific backends supports currently this concept. This approach allows to abstract out the basic types such as scalar/vector but also to formalize generic computation kernels using transparently library or device specific constructs based on Vc, CUDA, Cilk+ or Intel intrinsics. While the main goal of this approach is portable performance, as a bonus, it comes with the insulation of the core application and algorithms from the technology layer. This allows our application to be long term maintainable and versatile to changes at the backend side. The paper presents the first results of basket-based GeantV geometry navigation on the Intel© Xeon Phi KNC architecture. We present the scalability and vectorization study, conducted using Intel performance tools, as well as our preliminary conclusions on the use of accelerators for GeantV transport. We also describe the current work and preliminary results for using the GeantV transport kernel on GPUs.
Does the Intel Xeon Phi processor fit HEP workloads?

NASA Astrophysics Data System (ADS)

Nowak, A.; Bitzes, G.; Dotti, A.; Lazzaro, A.; Jarp, S.; Szostek, P.; Valsan, L.; Botezatu, M.; Leduc, J.

2014-06-01

This paper summarizes the five years of CERN openlab's efforts focused on the Intel Xeon Phi co-processor, from the time of its inception to public release. We consider the architecture of the device vis a vis the characteristics of HEP software and identify key opportunities for HEP processing, as well as scaling limitations. We report on improvements and speedups linked to parallelization and vectorization on benchmarks involving software frameworks such as Geant4 and ROOT. Finally, we extrapolate current software and hardware trends and project them onto accelerators of the future, with the specifics of offline and online HEP processing in mind.
IGA-ADS: Isogeometric analysis FEM using ADS solver

NASA Astrophysics Data System (ADS)

Łoś, Marcin M.; Woźniak, Maciej; Paszyński, Maciej; Lenharth, Andrew; Hassaan, Muhamm Amber; Pingali, Keshav

2017-08-01

In this paper we present a fast explicit solver for solution of non-stationary problems using L2 projections with isogeometric finite element method. The solver has been implemented within GALOIS framework. It enables parallel multi-core simulations of different time-dependent problems, in 1D, 2D, or 3D. We have prepared the solver framework in a way that enables direct implementation of the selected PDE and corresponding boundary conditions. In this paper we describe the installation, implementation of exemplary three PDEs, and execution of the simulations on multi-core Linux cluster nodes. We consider three case studies, including heat transfer, linear elasticity, as well as non-linear flow in heterogeneous media. The presented package generates output suitable for interfacing with Gnuplot and ParaView visualization software. The exemplary simulations show near perfect scalability on Gilbert shared-memory node with four Intel® Xeon® CPU E7-4860 processors, each possessing 10 physical cores (for a total of 40 cores).
Multi-threaded ATLAS simulation on Intel Knights Landing processors

NASA Astrophysics Data System (ADS)

Farrell, Steven; Calafiura, Paolo; Leggett, Charles; Tsulaia, Vakhtang; Dotti, Andrea; ATLAS Collaboration

2017-10-01

The Knights Landing (KNL) release of the Intel Many Integrated Core (MIC) Xeon Phi line of processors is a potential game changer for HEP computing. With 72 cores and deep vector registers, the KNL cards promise significant performance benefits for highly-parallel, compute-heavy applications. Cori, the newest supercomputer at the National Energy Research Scientific Computing Center (NERSC), was delivered to its users in two phases with the first phase online at the end of 2015 and the second phase now online at the end of 2016. Cori Phase 2 is based on the KNL architecture and contains over 9000 compute nodes with 96GB DDR4 memory. ATLAS simulation with the multithreaded Athena Framework (AthenaMT) is a good potential use-case for the KNL architecture and supercomputers like Cori. ATLAS simulation jobs have a high ratio of CPU computation to disk I/O and have been shown to scale well in multi-threading and across many nodes. In this paper we will give an overview of the ATLAS simulation application with details on its multi-threaded design. Then, we will present a performance analysis of the application on KNL devices and compare it to a traditional x86 platform to demonstrate the capabilities of the architecture and evaluate the benefits of utilizing KNL platforms like Cori for ATLAS production.
The parallel algorithm for the 2D discrete wavelet transform

NASA Astrophysics Data System (ADS)

Barina, David; Najman, Pavel; Kleparnik, Petr; Kula, Michal; Zemcik, Pavel

2018-04-01

The discrete wavelet transform can be found at the heart of many image-processing algorithms. Until now, the transform on general-purpose processors (CPUs) was mostly computed using a separable lifting scheme. As the lifting scheme consists of a small number of operations, it is preferred for processing using single-core CPUs. However, considering a parallel processing using multi-core processors, this scheme is inappropriate due to a large number of steps. On such architectures, the number of steps corresponds to the number of points that represent the exchange of data. Consequently, these points often form a performance bottleneck. Our approach appropriately rearranges calculations inside the transform, and thereby reduces the number of steps. In other words, we propose a new scheme that is friendly to parallel environments. When evaluating on multi-core CPUs, we consistently overcome the original lifting scheme. The evaluation was performed on 61-core Intel Xeon Phi and 8-core Intel Xeon processors.
Reducing adaptive optics latency using Xeon Phi many-core processors

NASA Astrophysics Data System (ADS)

Barr, David; Basden, Alastair; Dipper, Nigel; Schwartz, Noah

2015-11-01

The next generation of Extremely Large Telescopes (ELTs) for astronomy will rely heavily on the performance of their adaptive optics (AO) systems. Real-time control is at the heart of the critical technologies that will enable telescopes to deliver the best possible science and will require a very significant extrapolation from current AO hardware existing for 4-10 m telescopes. Investigating novel real-time computing architectures and testing their eligibility against anticipated challenges is one of the main priorities of technology development for the ELTs. This paper investigates the suitability of the Intel Xeon Phi, which is a commercial off-the-shelf hardware accelerator. We focus on wavefront reconstruction performance, implementing a straightforward matrix-vector multiplication (MVM) algorithm. We present benchmarking results of the Xeon Phi on a real-time Linux platform, both as a standalone processor and integrated into an existing real-time controller (RTC). Performance of single and multiple Xeon Phis are investigated. We show that this technology has the potential of greatly reducing the mean latency and variations in execution time (jitter) of large AO systems. We present both a detailed performance analysis of the Xeon Phi for a typical E-ELT first-light instrument along with a more general approach that enables us to extend to any AO system size. We show that systematic and detailed performance analysis is an essential part of testing novel real-time control hardware to guarantee optimal science results.
GeantV: From CPU to accelerators

DOE PAGES

Amadio, G.; Ananya, A.; Apostolakis, J.; ...

2016-01-01

The GeantV project aims to research and develop the next-generation simulation software describing the passage of particles through matter. While the modern CPU architectures are being targeted first, resources such as GPGPU, Intel© Xeon Phi, Atom or ARM cannot be ignored anymore by HEP CPU-bound applications. The proof of concept GeantV prototype has been mainly engineered for CPU's having vector units but we have foreseen from early stages a bridge to arbitrary accelerators. A software layer consisting of architecture/technology specific backends supports currently this concept. This approach allows to abstract out the basic types such as scalar/vector but also tomore » formalize generic computation kernels using transparently library or device specific constructs based on Vc, CUDA, Cilk+ or Intel intrinsics. While the main goal of this approach is portable performance, as a bonus, it comes with the insulation of the core application and algorithms from the technology layer. This allows our application to be long term maintainable and versatile to changes at the backend side. The paper presents the first results of basket-based GeantV geometry navigation on the Intel© Xeon Phi KNC architecture. We present the scalability and vectorization study, conducted using Intel performance tools, as well as our preliminary conclusions on the use of accelerators for GeantV transport. Lastly, we also describe the current work and preliminary results for using the GeantV transport kernel on GPUs.« less
Optimization of Selected Remote Sensing Algorithms for Embedded NVIDIA Kepler GPU Architecture

NASA Technical Reports Server (NTRS)

Riha, Lubomir; Le Moigne, Jacqueline; El-Ghazawi, Tarek

2015-01-01

This paper evaluates the potential of embedded Graphic Processing Units in the Nvidias Tegra K1 for onboard processing. The performance is compared to a general purpose multi-core CPU and full fledge GPU accelerator. This study uses two algorithms: Wavelet Spectral Dimension Reduction of Hyperspectral Imagery and Automated Cloud-Cover Assessment (ACCA) Algorithm. Tegra K1 achieved 51 for ACCA algorithm and 20 for the dimension reduction algorithm, as compared to the performance of the high-end 8-core server Intel Xeon CPU with 13.5 times higher power consumption.
GPU-Accelerated Voxelwise Hepatic Perfusion Quantification

PubMed Central

Wang, H; Cao, Y

2012-01-01

Voxelwise quantification of hepatic perfusion parameters from dynamic contrast enhanced (DCE) imaging greatly contributes to assessment of liver function in response to radiation therapy. However, the efficiency of the estimation of hepatic perfusion parameters voxel-by-voxel in the whole liver using a dual-input single-compartment model requires substantial improvement for routine clinical applications. In this paper, we utilize the parallel computation power of a graphics processing unit (GPU) to accelerate the computation, while maintaining the same accuracy as the conventional method. Using CUDA-GPU, the hepatic perfusion computations over multiple voxels are run across the GPU blocks concurrently but independently. At each voxel, non-linear least squares fitting the time series of the liver DCE data to the compartmental model is distributed to multiple threads in a block, and the computations of different time points are performed simultaneously and synchronically. An efficient fast Fourier transform in a block is also developed for the convolution computation in the model. The GPU computations of the voxel-by-voxel hepatic perfusion images are compared with ones by the CPU using the simulated DCE data and the experimental DCE MR images from patients. The computation speed is improved by 30 times using a NVIDIA Tesla C2050 GPU compared to a 2.67 GHz Intel Xeon CPU processor. To obtain liver perfusion maps with 626400 voxels in a patient’s liver, it takes 0.9 min with the GPU-accelerated voxelwise computation, compared to 110 min with the CPU, while both methods result in perfusion parameters differences less than 10−6. The method will be useful for generating liver perfusion images in clinical settings. PMID:22892645
Fast generation of computer-generated hologram by graphics processing unit

NASA Astrophysics Data System (ADS)

Matsuda, Sho; Fujii, Tomohiko; Yamaguchi, Takeshi; Yoshikawa, Hiroshi

2009-02-01

A cylindrical hologram is well known to be viewable in 360 deg. This hologram depends high pixel resolution.Therefore, Computer-Generated Cylindrical Hologram (CGCH) requires huge calculation amount.In our previous research, we used look-up table method for fast calculation with Intel Pentium4 2.8 GHz.It took 480 hours to calculate high resolution CGCH (504,000 x 63,000 pixels and the average number of object points are 27,000).To improve quality of CGCH reconstructed image, fringe pattern requires higher spatial frequency and resolution.Therefore, to increase the calculation speed, we have to change the calculation method. In this paper, to reduce the calculation time of CGCH (912,000 x 108,000 pixels), we employ Graphics Processing Unit (GPU).It took 4,406 hours to calculate high resolution CGCH on Xeon 3.4 GHz.Since GPU has many streaming processors and a parallel processing structure, GPU works as the high performance parallel processor.In addition, GPU gives max performance to 2 dimensional data and streaming data.Recently, GPU can be utilized for the general purpose (GPGPU).For example, NVIDIA's GeForce7 series became a programmable processor with Cg programming language.Next GeForce8 series have CUDA as software development kit made by NVIDIA.Theoretically, calculation ability of GPU is announced as 500 GFLOPS. From the experimental result, we have achieved that 47 times faster calculation compared with our previous work which used CPU.Therefore, CGCH can be generated in 95 hours.So, total time is 110 hours to calculate and print the CGCH.
Application of the multireference equation of motion coupled cluster method, including spin-orbit coupling, to the atomic spectra of Cr, Mn, Fe and Co

NASA Astrophysics Data System (ADS)

Liu, Zhebing; Huntington, Lee M. J.; Nooijen, Marcel

2015-10-01

The recently introduced multireference equation of motion (MR-EOM) approach is combined with a simple treatment of spin-orbit coupling, as implemented in the ORCA program. The resulting multireference equation of motion spin-orbit coupling (MR-EOM-SOC) approach is applied to the first-row transition metal atoms Cr, Mn, Fe and Co, for which experimental data are readily available. Using the MR-EOM-SOC approach, the splittings in each L-S multiplet can be accurately assessed (root mean square (RMS) errors of about 70 cm-1). The RMS errors for J-specific excitation energies range from 414 to 783 cm-1 and are comparable to previously reported J-averaged MR-EOM results using the ACESII program. The MR-EOM approach is highly efficient. A typical MR-EOM calculation of a full spin-orbit spectrum takes about 2 CPU hours on a single processor of a 12-core node, consisting of Intel XEON 2.93 GHz CPUs with 12.3 MB of shared cache memory.
Multi-GPU Accelerated Admittance Method for High-Resolution Human Exposure Evaluation.

PubMed

Xiong, Zubiao; Feng, Shi; Kautz, Richard; Chandra, Sandeep; Altunyurt, Nevin; Chen, Ji

2015-12-01

A multi-graphics processing unit (GPU) accelerated admittance method solver is presented for solving the induced electric field in high-resolution anatomical models of human body when exposed to external low-frequency magnetic fields. In the solver, the anatomical model is discretized as a three-dimensional network of admittances. The conjugate orthogonal conjugate gradient (COCG) iterative algorithm is employed to take advantage of the symmetric property of the complex-valued linear system of equations. Compared against the widely used biconjugate gradient stabilized method, the COCG algorithm can reduce the solving time by 3.5 times and reduce the storage requirement by about 40%. The iterative algorithm is then accelerated further by using multiple NVIDIA GPUs. The computations and data transfers between GPUs are overlapped in time by using asynchronous concurrent execution design. The communication overhead is well hidden so that the acceleration is nearly linear with the number of GPU cards. Numerical examples show that our GPU implementation running on four NVIDIA Tesla K20c cards can reach 90 times faster than the CPU implementation running on eight CPU cores (two Intel Xeon E5-2603 processors). The implemented solver is able to solve large dimensional problems efficiently. A whole adult body discretized in 1-mm resolution can be solved in just several minutes. The high efficiency achieved makes it practical to investigate human exposure involving a large number of cases with a high resolution that meets the requirements of international dosimetry guidelines.
Optimizing Performance of Combustion Chemistry Solvers on Intel's Many Integrated Core (MIC) Architectures

DOE Office of Scientific and Technical Information (OSTI.GOV)

Sitaraman, Hariswaran; Grout, Ray W

This work investigates novel algorithm designs and optimization techniques for restructuring chemistry integrators in zero and multidimensional combustion solvers, which can then be effectively used on the emerging generation of Intel's Many Integrated Core/Xeon Phi processors. These processors offer increased computing performance via large number of lightweight cores at relatively lower clock speeds compared to traditional processors (e.g. Intel Sandybridge/Ivybridge) used in current supercomputers. This style of processor can be productively used for chemistry integrators that form a costly part of computational combustion codes, in spite of their relatively lower clock speeds. Performance commensurate with traditional processors is achieved heremore » through the combination of careful memory layout, exposing multiple levels of fine grain parallelism and through extensive use of vendor supported libraries (Cilk Plus and Math Kernel Libraries). Important optimization techniques for efficient memory usage and vectorization have been identified and quantified. These optimizations resulted in a factor of ~ 3 speed-up using Intel 2013 compiler and ~ 1.5 using Intel 2017 compiler for large chemical mechanisms compared to the unoptimized version on the Intel Xeon Phi. The strategies, especially with respect to memory usage and vectorization, should also be beneficial for general purpose computational fluid dynamics codes.« less
Evaluation of the Intel Xeon Phi 7120 and NVIDIA K80 as accelerators for two-dimensional panel codes

PubMed Central

2017-01-01

To optimize the geometry of airfoils for a specific application is an important engineering problem. In this context genetic algorithms have enjoyed some success as they are able to explore the search space without getting stuck in local optima. However, these algorithms require the computation of aerodynamic properties for a significant number of airfoil geometries. Consequently, for low-speed aerodynamics, panel methods are most often used as the inner solver. In this paper we evaluate the performance of such an optimization algorithm on modern accelerators (more specifically, the Intel Xeon Phi 7120 and the NVIDIA K80). For that purpose, we have implemented an optimized version of the algorithm on the CPU and Xeon Phi (based on OpenMP, vectorization, and the Intel MKL library) and on the GPU (based on CUDA and the MAGMA library). We present timing results for all codes and discuss the similarities and differences between the three implementations. Overall, we observe a speedup of approximately 2.5 for adding an Intel Xeon Phi 7120 to a dual socket workstation and a speedup between 3.4 and 3.8 for adding a NVIDIA K80 to a dual socket workstation. PMID:28582389
Evaluation of the Intel Xeon Phi 7120 and NVIDIA K80 as accelerators for two-dimensional panel codes.

PubMed

Einkemmer, Lukas

2017-01-01

To optimize the geometry of airfoils for a specific application is an important engineering problem. In this context genetic algorithms have enjoyed some success as they are able to explore the search space without getting stuck in local optima. However, these algorithms require the computation of aerodynamic properties for a significant number of airfoil geometries. Consequently, for low-speed aerodynamics, panel methods are most often used as the inner solver. In this paper we evaluate the performance of such an optimization algorithm on modern accelerators (more specifically, the Intel Xeon Phi 7120 and the NVIDIA K80). For that purpose, we have implemented an optimized version of the algorithm on the CPU and Xeon Phi (based on OpenMP, vectorization, and the Intel MKL library) and on the GPU (based on CUDA and the MAGMA library). We present timing results for all codes and discuss the similarities and differences between the three implementations. Overall, we observe a speedup of approximately 2.5 for adding an Intel Xeon Phi 7120 to a dual socket workstation and a speedup between 3.4 and 3.8 for adding a NVIDIA K80 to a dual socket workstation.
Extension of the AMBER molecular dynamics software to Intel's Many Integrated Core (MIC) architecture

NASA Astrophysics Data System (ADS)

Needham, Perri J.; Bhuiyan, Ashraf; Walker, Ross C.

2016-04-01

We present an implementation of explicit solvent particle mesh Ewald (PME) classical molecular dynamics (MD) within the PMEMD molecular dynamics engine, that forms part of the AMBER v14 MD software package, that makes use of Intel Xeon Phi coprocessors by offloading portions of the PME direct summation and neighbor list build to the coprocessor. We refer to this implementation as pmemd MIC offload and in this paper present the technical details of the algorithm, including basic models for MPI and OpenMP configuration, and analyze the resultant performance. The algorithm provides the best performance improvement for large systems (>400,000 atoms), achieving a ∼35% performance improvement for satellite tobacco mosaic virus (1,067,095 atoms) when 2 Intel E5-2697 v2 processors (2 ×12 cores, 30M cache, 2.7 GHz) are coupled to an Intel Xeon Phi coprocessor (Model 7120P-1.238/1.333 GHz, 61 cores). The implementation utilizes a two-fold decomposition strategy: spatial decomposition using an MPI library and thread-based decomposition using OpenMP. We also present compiler optimization settings that improve the performance on Intel Xeon processors, while retaining simulation accuracy.

Heterogeneous High Throughput Scientific Computing with APM X-Gene and Intel Xeon Phi

NASA Astrophysics Data System (ADS)

Abdurachmanov, David; Bockelman, Brian; Elmer, Peter; Eulisse, Giulio; Knight, Robert; Muzaffar, Shahzad

2015-05-01

Electrical power requirements will be a constraint on the future growth of Distributed High Throughput Computing (DHTC) as used by High Energy Physics. Performance-per-watt is a critical metric for the evaluation of computer architectures for cost- efficient computing. Additionally, future performance growth will come from heterogeneous, many-core, and high computing density platforms with specialized processors. In this paper, we examine the Intel Xeon Phi Many Integrated Cores (MIC) co-processor and Applied Micro X-Gene ARMv8 64-bit low-power server system-on-a-chip (SoC) solutions for scientific computing applications. We report our experience on software porting, performance and energy efficiency and evaluate the potential for use of such technologies in the context of distributed computing systems such as the Worldwide LHC Computing Grid (WLCG).
Performance Evaluation of an Intel Haswell- and Ivy Bridge-Based Supercomputer Using Scientific and Engineering Applications

NASA Technical Reports Server (NTRS)

Saini, Subhash; Hood, Robert T.; Chang, Johnny; Baron, John

2016-01-01

We present a performance evaluation conducted on a production supercomputer of the Intel Xeon Processor E5- 2680v3, a twelve-core implementation of the fourth-generation Haswell architecture, and compare it with Intel Xeon Processor E5-2680v2, an Ivy Bridge implementation of the third-generation Sandy Bridge architecture. Several new architectural features have been incorporated in Haswell including improvements in all levels of the memory hierarchy as well as improvements to vector instructions and power management. We critically evaluate these new features of Haswell and compare with Ivy Bridge using several low-level benchmarks including subset of HPCC, HPCG and four full-scale scientific and engineering applications. We also present a model to predict the performance of HPCG and Cart3D within 5%, and Overflow within 10% accuracy.
Modern multicore and manycore architectures: Modelling, optimisation and benchmarking a multiblock CFD code

NASA Astrophysics Data System (ADS)

Hadade, Ioan; di Mare, Luca

2016-08-01

Modern multicore and manycore processors exhibit multiple levels of parallelism through a wide range of architectural features such as SIMD for data parallel execution or threads for core parallelism. The exploitation of multi-level parallelism is therefore crucial for achieving superior performance on current and future processors. This paper presents the performance tuning of a multiblock CFD solver on Intel SandyBridge and Haswell multicore CPUs and the Intel Xeon Phi Knights Corner coprocessor. Code optimisations have been applied on two computational kernels exhibiting different computational patterns: the update of flow variables and the evaluation of the Roe numerical fluxes. We discuss at great length the code transformations required for achieving efficient SIMD computations for both kernels across the selected devices including SIMD shuffles and transpositions for flux stencil computations and global memory transformations. Core parallelism is expressed through threading based on a number of domain decomposition techniques together with optimisations pertaining to alleviating NUMA effects found in multi-socket compute nodes. Results are correlated with the Roofline performance model in order to assert their efficiency for each distinct architecture. We report significant speedups for single thread execution across both kernels: 2-5X on the multicore CPUs and 14-23X on the Xeon Phi coprocessor. Computations at full node and chip concurrency deliver a factor of three speedup on the multicore processors and up to 24X on the Xeon Phi manycore coprocessor.
A polyphase filter for many-core architectures

NASA Astrophysics Data System (ADS)

Adámek, K.; Novotný, J.; Armour, W.

2016-07-01

In this article we discuss our implementation of a polyphase filter for real-time data processing in radio astronomy. The polyphase filter is a standard tool in digital signal processing and as such a well established algorithm. We describe in detail our implementation of the polyphase filter algorithm and its behaviour on three generations of NVIDIA GPU cards (Fermi, Kepler, Maxwell), on the Intel Xeon CPU and Xeon Phi (Knights Corner) platforms. All of our implementations aim to exploit the potential for data reuse that the algorithm offers. Our GPU implementations explore two different methods for achieving this, the first makes use of L1/Texture cache, the second uses shared memory. We discuss the usability of each of our implementations along with their behaviours. We measure performance in execution time, which is a critical factor for real-time systems, we also present results in terms of bandwidth (GB/s), compute (GFLOP/s/s) and type conversions (GTc/s). We include a presentation of our results in terms of the sample rate which can be processed in real-time by a chosen platform, which more intuitively describes the expected performance in a signal processing setting. Our findings show that, for the GPUs considered, the performance of our polyphase filter when using lower precision input data is limited by type conversions rather than device bandwidth. We compare these results to an implementation on the Xeon Phi. We show that our Xeon Phi implementation has a performance that is 1.5 × to 1.92 × greater than our CPU implementation, however is not insufficient to compete with the performance of GPUs. We conclude with a comparison of our best performing code to two other implementations of the polyphase filter, showing that our implementation is faster in nearly all cases. This work forms part of the Astro-Accelerate project, a many-core accelerated real-time data processing library for digital signal processing of time-domain radio astronomy data.
General-purpose interface bus for multiuser, multitasking computer system

NASA Technical Reports Server (NTRS)

Generazio, Edward R.; Roth, Don J.; Stang, David B.

1990-01-01

The architecture of a multiuser, multitasking, virtual-memory computer system intended for the use by a medium-size research group is described. There are three central processing units (CPU) in the configuration, each with 16 MB memory, and two 474 MB hard disks attached. CPU 1 is designed for data analysis and contains an array processor for fast-Fourier transformations. In addition, CPU 1 shares display images viewed with the image processor. CPU 2 is designed for image analysis and display. CPU 3 is designed for data acquisition and contains 8 GPIB channels and an analog-to-digital conversion input/output interface with 16 channels. Up to 9 users can access the third CPU simultaneously for data acquisition. Focus is placed on the optimization of hardware interfaces and software, facilitating instrument control, data acquisition, and processing.
Simulating Hydrologic Flow and Reactive Transport with PFLOTRAN and PETSc on Emerging Fine-Grained Parallel Computer Architectures

NASA Astrophysics Data System (ADS)

Mills, R. T.; Rupp, K.; Smith, B. F.; Brown, J.; Knepley, M.; Zhang, H.; Adams, M.; Hammond, G. E.

2017-12-01

As the high-performance computing community pushes towards the exascale horizon, power and heat considerations have driven the increasing importance and prevalence of fine-grained parallelism in new computer architectures. High-performance computing centers have become increasingly reliant on GPGPU accelerators and "manycore" processors such as the Intel Xeon Phi line, and 512-bit SIMD registers have even been introduced in the latest generation of Intel's mainstream Xeon server processors. The high degree of fine-grained parallelism and more complicated memory hierarchy considerations of such "manycore" processors present several challenges to existing scientific software. Here, we consider how the massively parallel, open-source hydrologic flow and reactive transport code PFLOTRAN - and the underlying Portable, Extensible Toolkit for Scientific Computation (PETSc) library on which it is built - can best take advantage of such architectures. We will discuss some key features of these novel architectures and our code optimizations and algorithmic developments targeted at them, and present experiences drawn from working with a wide range of PFLOTRAN benchmark problems on these architectures.
Efficient Execution of Microscopy Image Analysis on CPU, GPU, and MIC Equipped Cluster Systems.

PubMed

Andrade, G; Ferreira, R; Teodoro, George; Rocha, Leonardo; Saltz, Joel H; Kurc, Tahsin

2014-10-01

High performance computing is experiencing a major paradigm shift with the introduction of accelerators, such as graphics processing units (GPUs) and Intel Xeon Phi (MIC). These processors have made available a tremendous computing power at low cost, and are transforming machines into hybrid systems equipped with CPUs and accelerators. Although these systems can deliver a very high peak performance, making full use of its resources in real-world applications is a complex problem. Most current applications deployed to these machines are still being executed in a single processor, leaving other devices underutilized. In this paper we explore a scenario in which applications are composed of hierarchical data flow tasks which are allocated to nodes of a distributed memory machine in coarse-grain, but each of them may be composed of several finer-grain tasks which can be allocated to different devices within the node. We propose and implement novel performance aware scheduling techniques that can be used to allocate tasks to devices. We evaluate our techniques using a pathology image analysis application used to investigate brain cancer morphology, and our experimental evaluation shows that the proposed scheduling strategies significantly outperforms other efficient scheduling techniques, such as Heterogeneous Earliest Finish Time - HEFT, in cooperative executions using CPUs, GPUs, and MICs. We also experimentally show that our strategies are less sensitive to inaccuracy in the scheduling input data and that the performance gains are maintained as the application scales.
Efficient Execution of Microscopy Image Analysis on CPU, GPU, and MIC Equipped Cluster Systems

PubMed Central

Andrade, G.; Ferreira, R.; Teodoro, George; Rocha, Leonardo; Saltz, Joel H.; Kurc, Tahsin

2015-01-01

High performance computing is experiencing a major paradigm shift with the introduction of accelerators, such as graphics processing units (GPUs) and Intel Xeon Phi (MIC). These processors have made available a tremendous computing power at low cost, and are transforming machines into hybrid systems equipped with CPUs and accelerators. Although these systems can deliver a very high peak performance, making full use of its resources in real-world applications is a complex problem. Most current applications deployed to these machines are still being executed in a single processor, leaving other devices underutilized. In this paper we explore a scenario in which applications are composed of hierarchical data flow tasks which are allocated to nodes of a distributed memory machine in coarse-grain, but each of them may be composed of several finer-grain tasks which can be allocated to different devices within the node. We propose and implement novel performance aware scheduling techniques that can be used to allocate tasks to devices. We evaluate our techniques using a pathology image analysis application used to investigate brain cancer morphology, and our experimental evaluation shows that the proposed scheduling strategies significantly outperforms other efficient scheduling techniques, such as Heterogeneous Earliest Finish Time - HEFT, in cooperative executions using CPUs, GPUs, and MICs. We also experimentally show that our strategies are less sensitive to inaccuracy in the scheduling input data and that the performance gains are maintained as the application scales. PMID:26640423
Heterogeneous high throughput scientific computing with APM X-Gene and Intel Xeon Phi

DOE PAGES

Abdurachmanov, David; Bockelman, Brian; Elmer, Peter; ...

2015-05-22

Electrical power requirements will be a constraint on the future growth of Distributed High Throughput Computing (DHTC) as used by High Energy Physics. Performance-per-watt is a critical metric for the evaluation of computer architectures for cost- efficient computing. Additionally, future performance growth will come from heterogeneous, many-core, and high computing density platforms with specialized processors. In this paper, we examine the Intel Xeon Phi Many Integrated Cores (MIC) co-processor and Applied Micro X-Gene ARMv8 64-bit low-power server system-on-a-chip (SoC) solutions for scientific computing applications. As a result, we report our experience on software porting, performance and energy efficiency and evaluatemore » the potential for use of such technologies in the context of distributed computing systems such as the Worldwide LHC Computing Grid (WLCG).« less
A CPU/MIC Collaborated Parallel Framework for GROMACS on Tianhe-2 Supercomputer.

PubMed

Peng, Shaoliang; Yang, Shunyun; Su, Wenhe; Zhang, Xiaoyu; Zhang, Tenglilang; Liu, Weiguo; Zhao, Xingming

2017-06-16

Molecular Dynamics (MD) is the simulation of the dynamic behavior of atoms and molecules. As the most popular software for molecular dynamics, GROMACS cannot work on large-scale data because of limit computing resources. In this paper, we propose a CPU and Intel® Xeon Phi Many Integrated Core (MIC) collaborated parallel framework to accelerate GROMACS using the offload mode on a MIC coprocessor, with which the performance of GROMACS is improved significantly, especially with the utility of Tianhe-2 supercomputer. Furthermore, we optimize GROMACS so that it can run on both the CPU and MIC at the same time. In addition, we accelerate multi-node GROMACS so that it can be used in practice. Benchmarking on real data, our accelerated GROMACS performs very well and reduces computation time significantly. Source code: https://github.com/tianhe2/gromacs-mic.
Accelerating the Pace of Protein Functional Annotation With Intel Xeon Phi Coprocessors.

PubMed

Feinstein, Wei P; Moreno, Juana; Jarrell, Mark; Brylinski, Michal

2015-06-01

Intel Xeon Phi is a new addition to the family of powerful parallel accelerators. The range of its potential applications in computationally driven research is broad; however, at present, the repository of scientific codes is still relatively limited. In this study, we describe the development and benchmarking of a parallel version of eFindSite, a structural bioinformatics algorithm for the prediction of ligand-binding sites in proteins. Implemented for the Intel Xeon Phi platform, the parallelization of the structure alignment portion of eFindSite using pragma-based OpenMP brings about the desired performance improvements, which scale well with the number of computing cores. Compared to a serial version, the parallel code runs 11.8 and 10.1 times faster on the CPU and the coprocessor, respectively; when both resources are utilized simultaneously, the speedup is 17.6. For example, ligand-binding predictions for 501 benchmarking proteins are completed in 2.1 hours on a single Stampede node equipped with the Intel Xeon Phi card compared to 3.1 hours without the accelerator and 36.8 hours required by a serial version. In addition to the satisfactory parallel performance, porting existing scientific codes to the Intel Xeon Phi architecture is relatively straightforward with a short development time due to the support of common parallel programming models by the coprocessor. The parallel version of eFindSite is freely available to the academic community at www.brylinski.org/efindsite.
A configurable distributed high-performance computing framework for satellite's TDI-CCD imaging simulation

NASA Astrophysics Data System (ADS)

Xue, Bo; Mao, Bingjing; Chen, Xiaomei; Ni, Guoqiang

2010-11-01

This paper renders a configurable distributed high performance computing(HPC) framework for TDI-CCD imaging simulation. It uses strategy pattern to adapt multi-algorithms. Thus, this framework help to decrease the simulation time with low expense. Imaging simulation for TDI-CCD mounted on satellite contains four processes: 1) atmosphere leads degradation, 2) optical system leads degradation, 3) electronic system of TDI-CCD leads degradation and re-sampling process, 4) data integration. Process 1) to 3) utilize diversity data-intensity algorithms such as FFT, convolution and LaGrange Interpol etc., which requires powerful CPU. Even uses Intel Xeon X5550 processor, regular series process method takes more than 30 hours for a simulation whose result image size is 1500 * 1462. With literature study, there isn't any mature distributing HPC framework in this field. Here we developed a distribute computing framework for TDI-CCD imaging simulation, which is based on WCF[1], uses Client/Server (C/S) layer and invokes the free CPU resources in LAN. The server pushes the process 1) to 3) tasks to those free computing capacity. Ultimately we rendered the HPC in low cost. In the computing experiment with 4 symmetric nodes and 1 server , this framework reduced about 74% simulation time. Adding more asymmetric nodes to the computing network, the time decreased namely. In conclusion, this framework could provide unlimited computation capacity in condition that the network and task management server are affordable. And this is the brand new HPC solution for TDI-CCD imaging simulation and similar applications.
GW Calculations of Materials on the Intel Xeon-Phi Architecture

NASA Astrophysics Data System (ADS)

Deslippe, Jack; da Jornada, Felipe H.; Vigil-Fowler, Derek; Biller, Ariel; Chelikowsky, James R.; Louie, Steven G.

Intel Xeon-Phi processors are expected to power a large number of High-Performance Computing (HPC) systems around the United States and the world in the near future. We evaluate the ability of GW and pre-requisite Density Functional Theory (DFT) calculations for materials on utilizing the Xeon-Phi architecture. We describe the optimization process and performance improvements achieved. We find that the GW method, like other higher level Many-Body methods beyond standard local/semilocal approximations to Kohn-Sham DFT, is particularly well suited for many-core architectures due to the ability to exploit a large amount of parallelism over plane-waves, band-pairs and frequencies. Support provided by the SCIDAC program, Department of Energy, Office of Science, Advanced Scientic Computing Research and Basic Energy Sciences. Grant Numbers DE-SC0008877 (Austin) and DE-AC02-05CH11231 (LBNL).
Deploying electromagnetic particle-in-cell (EM-PIC) codes on Xeon Phi accelerators boards

NASA Astrophysics Data System (ADS)

Fonseca, Ricardo

2014-10-01

The complexity of the phenomena involved in several relevant plasma physics scenarios, where highly nonlinear and kinetic processes dominate, makes purely theoretical descriptions impossible. Further understanding of these scenarios requires detailed numerical modeling, but fully relativistic particle-in-cell codes such as OSIRIS are computationally intensive. The quest towards Exaflop computer systems has lead to the development of HPC systems based on add-on accelerator cards, such as GPGPUs and more recently the Xeon Phi accelerators that power the current number 1 system in the world. These cards, also referred to as Intel Many Integrated Core Architecture (MIC) offer peak theoretical performances of >1 TFlop/s for general purpose calculations in a single board, and are receiving significant attention as an attractive alternative to CPUs for plasma modeling. In this work we report on our efforts towards the deployment of an EM-PIC code on a Xeon Phi architecture system. We will focus on the parallelization and vectorization strategies followed, and present a detailed performance evaluation of code performance in comparison with the CPU code.
Improving the performance of heterogeneous multi-core processors by modifying the cache coherence protocol

NASA Astrophysics Data System (ADS)

Fang, Juan; Hao, Xiaoting; Fan, Qingwen; Chang, Zeqing; Song, Shuying

2017-05-01

In the Heterogeneous multi-core architecture, CPU and GPU processor are integrated on the same chip, which poses a new challenge to the last-level cache management. In this architecture, the CPU application and the GPU application execute concurrently, accessing the last-level cache. CPU and GPU have different memory access characteristics, so that they have differences in the sensitivity of last-level cache (LLC) capacity. For many CPU applications, a reduced share of the LLC could lead to significant performance degradation. On the contrary, GPU applications can tolerate increase in memory access latency when there is sufficient thread-level parallelism. Taking into account the GPU program memory latency tolerance characteristics, this paper presents a method that let GPU applications can access to memory directly, leaving lots of LLC space for CPU applications, in improving the performance of CPU applications and does not affect the performance of GPU applications. When the CPU application is cache sensitive, and the GPU application is insensitive to the cache, the overall performance of the system is improved significantly.
An efficient implementation of semi-numerical computation of the Hartree-Fock exchange on the Intel Phi processor

NASA Astrophysics Data System (ADS)

Liu, Fenglai; Kong, Jing

2018-07-01

Unique technical challenges and their solutions for implementing semi-numerical Hartree-Fock exchange on the Phil Processor are discussed, especially concerning the single- instruction-multiple-data type of processing and small cache size. Benchmark calculations on a series of buckyball molecules with various Gaussian basis sets on a Phi processor and a six-core CPU show that the Phi processor provides as much as 12 times of speedup with large basis sets compared with the conventional four-center electron repulsion integration approach performed on the CPU. The accuracy of the semi-numerical scheme is also evaluated and found to be comparable to that of the resolution-of-identity approach.
Parallel Mutual Information Based Construction of Genome-Scale Networks on the Intel® Xeon Phi™ Coprocessor.

PubMed

Misra, Sanchit; Pamnany, Kiran; Aluru, Srinivas

2015-01-01

Construction of whole-genome networks from large-scale gene expression data is an important problem in systems biology. While several techniques have been developed, most cannot handle network reconstruction at the whole-genome scale, and the few that can, require large clusters. In this paper, we present a solution on the Intel Xeon Phi coprocessor, taking advantage of its multi-level parallelism including many x86-based cores, multiple threads per core, and vector processing units. We also present a solution on the Intel® Xeon® processor. Our solution is based on TINGe, a fast parallel network reconstruction technique that uses mutual information and permutation testing for assessing statistical significance. We demonstrate the first ever inference of a plant whole genome regulatory network on a single chip by constructing a 15,575 gene network of the plant Arabidopsis thaliana from 3,137 microarray experiments in only 22 minutes. In addition, our optimization for parallelizing mutual information computation on the Intel Xeon Phi coprocessor holds out lessons that are applicable to other domains.
Massive parallelization of a 3D finite difference electromagnetic forward solution using domain decomposition methods on multiple CUDA enabled GPUs

NASA Astrophysics Data System (ADS)

Schultz, A.

2010-12-01

3D forward solvers lie at the core of inverse formulations used to image the variation of electrical conductivity within the Earth's interior. This property is associated with variations in temperature, composition, phase, presence of volatiles, and in specific settings, the presence of groundwater, geothermal resources, oil/gas or minerals. The high cost of 3D solutions has been a stumbling block to wider adoption of 3D methods. Parallel algorithms for modeling frequency domain 3D EM problems have not achieved wide scale adoption, with emphasis on fairly coarse grained parallelism using MPI and similar approaches. The communications bandwidth as well as the latency required to send and receive network communication packets is a limiting factor in implementing fine grained parallel strategies, inhibiting wide adoption of these algorithms. Leading Graphics Processor Unit (GPU) companies now produce GPUs with hundreds of GPU processor cores per die. The footprint, in silicon, of the GPU's restricted instruction set is much smaller than the general purpose instruction set required of a CPU. Consequently, the density of processor cores on a GPU can be much greater than on a CPU. GPUs also have local memory, registers and high speed communication with host CPUs, usually through PCIe type interconnects. The extremely low cost and high computational power of GPUs provides the EM geophysics community with an opportunity to achieve fine grained (i.e. massive) parallelization of codes on low cost hardware. The current generation of GPUs (e.g. NVidia Fermi) provides 3 billion transistors per chip die, with nearly 500 processor cores and up to 6 GB of fast (DDR5) GPU memory. This latest generation of GPU supports fast hardware double precision (64 bit) floating point operations of the type required for frequency domain EM forward solutions. Each Fermi GPU board can sustain nearly 1 TFLOP in double precision, and multiple boards can be installed in the host computer system. We describe our ongoing efforts to achieve massive parallelization on a novel hybrid GPU testbed machine currently configured with 12 Intel Westmere Xeon CPU cores (or 24 parallel computational threads) with 96 GB DDR3 system memory, 4 GPU subsystems which in aggregate contain 960 NVidia Tesla GPU cores with 16 GB dedicated DDR3 GPU memory, and a second interleved bank of 4 GPU subsystems containing in aggregate 1792 NVidia Fermi GPU cores with 12 GB dedicated DDR5 GPU memory. We are applying domain decomposition methods to a modified version of Weiss' (2001) 3D frequency domain full physics EM finite difference code, an open source GPL licensed f90 code available for download from www.OpenEM.org. This will be the core of a new hybrid 3D inversion that parallelizes frequencies across CPUs and individual forward solutions across GPUs. We describe progress made in modifying the code to use direct solvers in GPU cores dedicated to each small subdomain, iteratively improving the solution by matching adjacent subdomain boundary solutions, rather than iterative Krylov space sparse solvers as currently applied to the whole domain.
Enhancing Image Processing Performance for PCID in a Heterogeneous Network of Multi-core Processors

DTIC Science & Technology

2009-09-01

TFLOPS of Playstation 3 (PS3) nodes with IBM Cell Broadband Engine multi-cores and 15 dual-quad Xeon head nodes. The interconnect fabric includes... 4 3. INFORMATION MANAGEMENT FOR PARALLELIZATION AND...STREAMING............................................................. 7 4 . RESULTS
Optimizing legacy molecular dynamics software with directive-based offload

NASA Astrophysics Data System (ADS)

Michael Brown, W.; Carrillo, Jan-Michael Y.; Gavhane, Nitin; Thakkar, Foram M.; Plimpton, Steven J.

2015-10-01

Directive-based programming models are one solution for exploiting many-core coprocessors to increase simulation rates in molecular dynamics. They offer the potential to reduce code complexity with offload models that can selectively target computations to run on the CPU, the coprocessor, or both. In this paper, we describe modifications to the LAMMPS molecular dynamics code to enable concurrent calculations on a CPU and coprocessor. We demonstrate that standard molecular dynamics algorithms can run efficiently on both the CPU and an x86-based coprocessor using the same subroutines. As a consequence, we demonstrate that code optimizations for the coprocessor also result in speedups on the CPU; in extreme cases up to 4.7X. We provide results for LAMMPS benchmarks and for production molecular dynamics simulations using the Stampede hybrid supercomputer with both Intel® Xeon Phi™ coprocessors and NVIDIA GPUs. The optimizations presented have increased simulation rates by over 2X for organic molecules and over 7X for liquid crystals on Stampede. The optimizations are available as part of the "Intel package" supplied with LAMMPS.

Kalman filter tracking on parallel architectures

NASA Astrophysics Data System (ADS)

Cerati, G.; Elmer, P.; Krutelyov, S.; Lantz, S.; Lefebvre, M.; McDermott, K.; Riley, D.; Tadel, M.; Wittich, P.; Wurthwein, F.; Yagil, A.

2017-10-01

We report on the progress of our studies towards a Kalman filter track reconstruction algorithm with optimal performance on manycore architectures. The combinatorial structure of these algorithms is not immediately compatible with an efficient SIMD (or SIMT) implementation; the challenge for us is to recast the existing software so it can readily generate hundreds of shared-memory threads that exploit the underlying instruction set of modern processors. We show how the data and associated tasks can be organized in a way that is conducive to both multithreading and vectorization. We demonstrate very good performance on Intel Xeon and Xeon Phi architectures, as well as promising first results on Nvidia GPUs.
Bridging FPGA and GPU technologies for AO real-time control

NASA Astrophysics Data System (ADS)

Perret, Denis; Lainé, Maxime; Bernard, Julien; Gratadour, Damien; Sevin, Arnaud

2016-07-01

Our team has developed a common environment for high performance simulations and real-time control of AO systems based on the use of Graphics Processors Units in the context of the COMPASS project. Such a solution, based on the ability of the real time core in the simulation to provide adequate computing performance, limits the cost of developing AO RTC systems and makes them more scalable. A code developed and validated in the context of the simulation may be injected directly into the system and tested on sky. Furthermore, the use of relatively low cost components also offers significant advantages for the system hardware platform. However, the use of GPUs in an AO loop comes with drawbacks: the traditional way of offloading computation from CPU to GPUs - involving multiple copies and unacceptable overhead in kernel launching - is not well suited in a real time context. This last application requires the implementation of a solution enabling direct memory access (DMA) to the GPU memory from a third party device, bypassing the operating system. This allows this device to communicate directly with the real-time core of the simulation feeding it with the WFS camera pixel stream. We show that DMA between a custom FPGA-based frame-grabber and a computation unit (GPU, FPGA, or Coprocessor such as Xeon-phi) across PCIe allows us to get latencies compatible with what will be needed on ELTs. As a fine-grained synchronization mechanism is not yet made available by GPU vendors, we propose the use of memory polling to avoid interrupts handling and involvement of a CPU. Network and Vision protocols are handled by the FPGA-based Network Interface Card (NIC). We present the results we obtained on a complete AO loop using camera and deformable mirror simulators.
P-Hint-Hunt: a deep parallelized whole genome DNA methylation detection tool.

PubMed

Peng, Shaoliang; Yang, Shunyun; Gao, Ming; Liao, Xiangke; Liu, Jie; Yang, Canqun; Wu, Chengkun; Yu, Wenqiang

2017-03-14

The increasing studies have been conducted using whole genome DNA methylation detection as one of the most important part of epigenetics research to find the significant relationships among DNA methylation and several typical diseases, such as cancers and diabetes. In many of those studies, mapping the bisulfite treated sequence to the whole genome has been the main method to study DNA cytosine methylation. However, today's relative tools almost suffer from inaccuracies and time-consuming problems. In our study, we designed a new DNA methylation prediction tool ("Hint-Hunt") to solve the problem. By having an optimal complex alignment computation and Smith-Waterman matrix dynamic programming, Hint-Hunt could analyze and predict the DNA methylation status. But when Hint-Hunt tried to predict DNA methylation status with large-scale dataset, there are still slow speed and low temporal-spatial efficiency problems. In order to solve the problems of Smith-Waterman dynamic programming and low temporal-spatial efficiency, we further design a deep parallelized whole genome DNA methylation detection tool ("P-Hint-Hunt") on Tianhe-2 (TH-2) supercomputer. To the best of our knowledge, P-Hint-Hunt is the first parallel DNA methylation detection tool with a high speed-up to process large-scale dataset, and could run both on CPU and Intel Xeon Phi coprocessors. Moreover, we deploy and evaluate Hint-Hunt and P-Hint-Hunt on TH-2 supercomputer in different scales. The experimental results illuminate our tools eliminate the deviation caused by bisulfite treatment in mapping procedure and the multi-level parallel program yields a 48 times speed-up with 64 threads. P-Hint-Hunt gain a deep acceleration on CPU and Intel Xeon Phi heterogeneous platform, which gives full play of the advantages of multi-cores (CPU) and many-cores (Phi).
Using Intel Xeon Phi to accelerate the WRF TEMF planetary boundary layer scheme

NASA Astrophysics Data System (ADS)

Mielikainen, Jarno; Huang, Bormin; Huang, Allen

2014-05-01

The Weather Research and Forecasting (WRF) model is designed for numerical weather prediction and atmospheric research. The WRF software infrastructure consists of several components such as dynamic solvers and physics schemes. Numerical models are used to resolve the large-scale flow. However, subgrid-scale parameterizations are for an estimation of small-scale properties (e.g., boundary layer turbulence and convection, clouds, radiation). Those have a significant influence on the resolved scale due to the complex nonlinear nature of the atmosphere. For the cloudy planetary boundary layer (PBL), it is fundamental to parameterize vertical turbulent fluxes and subgrid-scale condensation in a realistic manner. A parameterization based on the Total Energy - Mass Flux (TEMF) that unifies turbulence and moist convection components produces a better result that the other PBL schemes. For that reason, the TEMF scheme is chosen as the PBL scheme we optimized for Intel Many Integrated Core (MIC), which ushers in a new era of supercomputing speed, performance, and compatibility. It allows the developers to run code at trillions of calculations per second using the familiar programming model. In this paper, we present our optimization results for TEMF planetary boundary layer scheme. The optimizations that were performed were quite generic in nature. Those optimizations included vectorization of the code to utilize vector units inside each CPU. Furthermore, memory access was improved by scalarizing some of the intermediate arrays. The results show that the optimization improved MIC performance by 14.8x. Furthermore, the optimizations increased CPU performance by 2.6x compared to the original multi-threaded code on quad core Intel Xeon E5-2603 running at 1.8 GHz. Compared to the optimized code running on a single CPU socket the optimized MIC code is 6.2x faster.
Thermal Hotspots in CPU Die and It's Future Architecture

NASA Astrophysics Data System (ADS)

Wang, Jian; Hu, Fu-Yuan

Owing to the increasing core frequency and chip integration and the limited die dimension, the power densities in CPU chip have been increasing fastly. The high temperature on chip resulted by power densities threats the processor's performance and chip's reliability. This paper analyzed the thermal hotspots in die and their properties. A new architecture of function units in die - - hot units distributed architecture is suggested to cope with the problems of high power densities for future processor chip.
DOE Office of Scientific and Technical Information (OSTI.GOV)

Zhang, Yao; Balaprakash, Prasanna; Meng, Jiayuan

We present Raexplore, a performance modeling framework for architecture exploration. Raexplore enables rapid, automated, and systematic search of architecture design space by combining hardware counter-based performance characterization and analytical performance modeling. We demonstrate Raexplore for two recent manycore processors IBM Blue- Gene/Q compute chip and Intel Xeon Phi, targeting a set of scientific applications. Our framework is able to capture complex interactions between architectural components including instruction pipeline, cache, and memory, and to achieve a 3–22% error for same-architecture and cross-architecture performance predictions. Furthermore, we apply our framework to assess the two processors, and discover and evaluate a list ofmore » architectural scaling options for future processor designs.« less
Evaluation of the Intel Xeon Phi Co-processor to accelerate the sensitivity map calculation for PET imaging

NASA Astrophysics Data System (ADS)

Dey, T.; Rodrigue, P.

2015-07-01

We aim to evaluate the Intel Xeon Phi coprocessor for acceleration of 3D Positron Emission Tomography (PET) image reconstruction. We focus on the sensitivity map calculation as one computational intensive part of PET image reconstruction, since it is a promising candidate for acceleration with the Many Integrated Core (MIC) architecture of the Xeon Phi. The computation of the voxels in the field of view (FoV) can be done in parallel and the 103 to 104 samples needed to calculate the detection probability of each voxel can take advantage of vectorization. We use the ray tracing kernels of the Embree project to calculate the hit points of the sample rays with the detector and in a second step the sum of the radiological path taking into account attenuation is determined. The core components are implemented using the Intel single instruction multiple data compiler (ISPC) to enable a portable implementation showing efficient vectorization either on the Xeon Phi and the Host platform. On the Xeon Phi, the calculation of the radiological path is also implemented in hardware specific intrinsic instructions (so-called `intrinsics') to allow manually-optimized vectorization. For parallelization either OpenMP and ISPC tasking (based on pthreads) are evaluated.Our implementation achieved a scalability factor of 0.90 on the Xeon Phi coprocessor (model 5110P) with 60 cores at 1 GHz. Only minor differences were found between parallelization with OpenMP and the ISPC tasking feature. The implementation using intrinsics was found to be about 12% faster than the portable ISPC version. With this version, a speedup of 1.43 was achieved on the Xeon Phi coprocessor compared to the host system (HP SL250s Gen8) equipped with two Xeon (E5-2670) CPUs, with 8 cores at 2.6 to 3.3 GHz each. Using a second Xeon Phi card the speedup could be further increased to 2.77. No significant differences were found between the results of the different Xeon Phi and the Host implementations. The examination showed that a reasonable speedup of sensitivity map calculation could be achieved on the Xeon Phi either by a portable or a hardware specific implementation.
Parallelized Kalman-Filter-Based Reconstruction of Particle Tracks on Many-Core Processors and GPUs

DOE Office of Scientific and Technical Information (OSTI.GOV)

Cerati, Giuseppe; Elmer, Peter; Krutelyov, Slava

2017-01-01

For over a decade now, physical and energy constraints have limited clock speed improvements in commodity microprocessors. Instead, chipmakers have been pushed into producing lower-power, multi-core processors such as Graphical Processing Units (GPU), ARM CPUs, and Intel MICs. Broad-based efforts from manufacturers and developers have been devoted to making these processors user-friendly enough to perform general computations. However, extracting performance from a larger number of cores, as well as specialized vector or SIMD units, requires special care in algorithm design and code optimization. One of the most computationally challenging problems in high-energy particle experiments is finding and fitting the charged-particlemore » tracks during event reconstruction. This is expected to become by far the dominant problem at the High-Luminosity Large Hadron Collider (HL-LHC), for example. Today the most common track finding methods are those based on the Kalman filter. Experience with Kalman techniques on real tracking detector systems has shown that they are robust and provide high physics performance. This is why they are currently in use at the LHC, both in the trigger and offine. Previously we reported on the significant parallel speedups that resulted from our investigations to adapt Kalman filters to track fitting and track building on Intel Xeon and Xeon Phi. Here, we discuss our progresses toward the understanding of these processors and the new developments to port the Kalman filter to NVIDIA GPUs.« less
Parallelized Kalman-Filter-Based Reconstruction of Particle Tracks on Many-Core Processors and GPUs

NASA Astrophysics Data System (ADS)

Cerati, Giuseppe; Elmer, Peter; Krutelyov, Slava; Lantz, Steven; Lefebvre, Matthieu; Masciovecchio, Mario; McDermott, Kevin; Riley, Daniel; Tadel, Matevž; Wittich, Peter; Würthwein, Frank; Yagil, Avi

2017-08-01

For over a decade now, physical and energy constraints have limited clock speed improvements in commodity microprocessors. Instead, chipmakers have been pushed into producing lower-power, multi-core processors such as Graphical Processing Units (GPU), ARM CPUs, and Intel MICs. Broad-based efforts from manufacturers and developers have been devoted to making these processors user-friendly enough to perform general computations. However, extracting performance from a larger number of cores, as well as specialized vector or SIMD units, requires special care in algorithm design and code optimization. One of the most computationally challenging problems in high-energy particle experiments is finding and fitting the charged-particle tracks during event reconstruction. This is expected to become by far the dominant problem at the High-Luminosity Large Hadron Collider (HL-LHC), for example. Today the most common track finding methods are those based on the Kalman filter. Experience with Kalman techniques on real tracking detector systems has shown that they are robust and provide high physics performance. This is why they are currently in use at the LHC, both in the trigger and offine. Previously we reported on the significant parallel speedups that resulted from our investigations to adapt Kalman filters to track fitting and track building on Intel Xeon and Xeon Phi. Here, we discuss our progresses toward the understanding of these processors and the new developments to port the Kalman filter to NVIDIA GPUs.
Computing effective properties of random heterogeneous materials on heterogeneous parallel processors

NASA Astrophysics Data System (ADS)

Leidi, Tiziano; Scocchi, Giulio; Grossi, Loris; Pusterla, Simone; D'Angelo, Claudio; Thiran, Jean-Philippe; Ortona, Alberto

2012-11-01

In recent decades, finite element (FE) techniques have been extensively used for predicting effective properties of random heterogeneous materials. In the case of very complex microstructures, the choice of numerical methods for the solution of this problem can offer some advantages over classical analytical approaches, and it allows the use of digital images obtained from real material samples (e.g., using computed tomography). On the other hand, having a large number of elements is often necessary for properly describing complex microstructures, ultimately leading to extremely time-consuming computations and high memory requirements. With the final objective of reducing these limitations, we improved an existing freely available FE code for the computation of effective conductivity (electrical and thermal) of microstructure digital models. To allow execution on hardware combining multi-core CPUs and a GPU, we first translated the original algorithm from Fortran to C, and we subdivided it into software components. Then, we enhanced the C version of the algorithm for parallel processing with heterogeneous processors. With the goal of maximizing the obtained performances and limiting resource consumption, we utilized a software architecture based on stream processing, event-driven scheduling, and dynamic load balancing. The parallel processing version of the algorithm has been validated using a simple microstructure consisting of a single sphere located at the centre of a cubic box, yielding consistent results. Finally, the code was used for the calculation of the effective thermal conductivity of a digital model of a real sample (a ceramic foam obtained using X-ray computed tomography). On a computer equipped with dual hexa-core Intel Xeon X5670 processors and an NVIDIA Tesla C2050, the parallel application version features near to linear speed-up progression when using only the CPU cores. It executes more than 20 times faster when additionally using the GPU.
Evaluating the networking characteristics of the Cray XC-40 Intel Knights Landing-based Cori supercomputer at NERSC

DOE Office of Scientific and Technical Information (OSTI.GOV)

Doerfler, Douglas; Austin, Brian; Cook, Brandon

There are many potential issues associated with deploying the Intel Xeon Phi™ (code named Knights Landing [KNL]) manycore processor in a large-scale supercomputer. One in particular is the ability to fully utilize the high-speed communications network, given that the serial performance of a Xeon Phi TM core is a fraction of a Xeon®core. In this paper, we take a look at the trade-offs associated with allocating enough cores to fully utilize the Aries high-speed network versus cores dedicated to computation, e.g., the trade-off between MPI and OpenMP. In addition, we evaluate new features of Cray MPI in support of KNL,more » such as internode optimizations. We also evaluate one-sided programming models such as Unified Parallel C. We quantify the impact of the above trade-offs and features using a suite of National Energy Research Scientific Computing Center applications.« less
DD-αAMG on QPACE 3

NASA Astrophysics Data System (ADS)

Georg, Peter; Richtmann, Daniel; Wettig, Tilo

2018-03-01

We describe our experience porting the Regensburg implementation of the DD-αAMG solver from QPACE 2 to QPACE 3. We first review how the code was ported from the first generation Intel Xeon Phi processor (Knights Corner) to its successor (Knights Landing). We then describe the modifications in the communication library necessitated by the switch from InfiniBand to Omni-Path. Finally, we present the performance of the code on a single processor as well as the scaling on many nodes, where in both cases the speedup factor is close to the theoretical expectations.
Optimizing legacy molecular dynamics software with directive-based offload

DOE PAGES

Michael Brown, W.; Carrillo, Jan-Michael Y.; Gavhane, Nitin; ...

2015-05-14

The directive-based programming models are one solution for exploiting many-core coprocessors to increase simulation rates in molecular dynamics. They offer the potential to reduce code complexity with offload models that can selectively target computations to run on the CPU, the coprocessor, or both. In our paper, we describe modifications to the LAMMPS molecular dynamics code to enable concurrent calculations on a CPU and coprocessor. We also demonstrate that standard molecular dynamics algorithms can run efficiently on both the CPU and an x86-based coprocessor using the same subroutines. As a consequence, we demonstrate that code optimizations for the coprocessor also resultmore » in speedups on the CPU; in extreme cases up to 4.7X. We provide results for LAMMAS benchmarks and for production molecular dynamics simulations using the Stampede hybrid supercomputer with both Intel (R) Xeon Phi (TM) coprocessors and NVIDIA GPUs: The optimizations presented have increased simulation rates by over 2X for organic molecules and over 7X for liquid crystals on Stampede. The optimizations are available as part of the "Intel package" supplied with LAMMPS. (C) 2015 Elsevier B.V. All rights reserved.« less
A hybrid algorithm for parallel molecular dynamics simulations

NASA Astrophysics Data System (ADS)

Mangiardi, Chris M.; Meyer, R.

2017-10-01

This article describes algorithms for the hybrid parallelization and SIMD vectorization of molecular dynamics simulations with short-range forces. The parallelization method combines domain decomposition with a thread-based parallelization approach. The goal of the work is to enable efficient simulations of very large (tens of millions of atoms) and inhomogeneous systems on many-core processors with hundreds or thousands of cores and SIMD units with large vector sizes. In order to test the efficiency of the method, simulations of a variety of configurations with up to 74 million atoms have been performed. Results are shown that were obtained on multi-core systems with Sandy Bridge and Haswell processors as well as systems with Xeon Phi many-core processors.
Parallel hyperbolic PDE simulation on clusters: Cell versus GPU

NASA Astrophysics Data System (ADS)

Rostrup, Scott; De Sterck, Hans

2010-12-01

Increasingly, high-performance computing is looking towards data-parallel computational devices to enhance computational performance. Two technologies that have received significant attention are IBM's Cell Processor and NVIDIA's CUDA programming model for graphics processing unit (GPU) computing. In this paper we investigate the acceleration of parallel hyperbolic partial differential equation simulation on structured grids with explicit time integration on clusters with Cell and GPU backends. The message passing interface (MPI) is used for communication between nodes at the coarsest level of parallelism. Optimizations of the simulation code at the several finer levels of parallelism that the data-parallel devices provide are described in terms of data layout, data flow and data-parallel instructions. Optimized Cell and GPU performance are compared with reference code performance on a single x86 central processing unit (CPU) core in single and double precision. We further compare the CPU, Cell and GPU platforms on a chip-to-chip basis, and compare performance on single cluster nodes with two CPUs, two Cell processors or two GPUs in a shared memory configuration (without MPI). We finally compare performance on clusters with 32 CPUs, 32 Cell processors, and 32 GPUs using MPI. Our GPU cluster results use NVIDIA Tesla GPUs with GT200 architecture, but some preliminary results on recently introduced NVIDIA GPUs with the next-generation Fermi architecture are also included. This paper provides computational scientists and engineers who are considering porting their codes to accelerator environments with insight into how structured grid based explicit algorithms can be optimized for clusters with Cell and GPU accelerators. It also provides insight into the speed-up that may be gained on current and future accelerator architectures for this class of applications. Program summaryProgram title: SWsolver Catalogue identifier: AEGY_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEGY_v1_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: GPL v3 No. of lines in distributed program, including test data, etc.: 59 168 No. of bytes in distributed program, including test data, etc.: 453 409 Distribution format: tar.gz Programming language: C, CUDA Computer: Parallel Computing Clusters. Individual compute nodes may consist of x86 CPU, Cell processor, or x86 CPU with attached NVIDIA GPU accelerator. Operating system: Linux Has the code been vectorised or parallelized?: Yes. Tested on 1-128 x86 CPU cores, 1-32 Cell Processors, and 1-32 NVIDIA GPUs. RAM: Tested on Problems requiring up to 4 GB per compute node. Classification: 12 External routines: MPI, CUDA, IBM Cell SDK Nature of problem: MPI-parallel simulation of Shallow Water equations using high-resolution 2D hyperbolic equation solver on regular Cartesian grids for x86 CPU, Cell Processor, and NVIDIA GPU using CUDA. Solution method: SWsolver provides 3 implementations of a high-resolution 2D Shallow Water equation solver on regular Cartesian grids, for CPU, Cell Processor, and NVIDIA GPU. Each implementation uses MPI to divide work across a parallel computing cluster. Additional comments: Sub-program numdiff is used for the test run.
Revisiting Intel Xeon Phi optimization of Thompson cloud microphysics scheme in Weather Research and Forecasting (WRF) model

NASA Astrophysics Data System (ADS)

Mielikainen, Jarno; Huang, Bormin; Huang, Allen

2015-10-01

The Thompson cloud microphysics scheme is a sophisticated cloud microphysics scheme in the Weather Research and Forecasting (WRF) model. The scheme is very suitable for massively parallel computation as there are no interactions among horizontal grid points. Compared to the earlier microphysics schemes, the Thompson scheme incorporates a large number of improvements. Thus, we have optimized the speed of this important part of WRF. Intel Many Integrated Core (MIC) ushers in a new era of supercomputing speed, performance, and compatibility. It allows the developers to run code at trillions of calculations per second using the familiar programming model. In this paper, we present our results of optimizing the Thompson microphysics scheme on Intel Many Integrated Core Architecture (MIC) hardware. The Intel Xeon Phi coprocessor is the first product based on Intel MIC architecture, and it consists of up to 61 cores connected by a high performance on-die bidirectional interconnect. The coprocessor supports all important Intel development tools. Thus, the development environment is familiar one to a vast number of CPU developers. Although, getting a maximum performance out of MICs will require using some novel optimization techniques. New optimizations for an updated Thompson scheme are discusses in this paper. The optimizations improved the performance of the original Thompson code on Xeon Phi 7120P by a factor of 1.8x. Furthermore, the same optimizations improved the performance of the Thompson on a dual socket configuration of eight core Intel Xeon E5-2670 CPUs by a factor of 1.8x compared to the original Thompson code.
GPU: the biggest key processor for AI and parallel processing

NASA Astrophysics Data System (ADS)

Baji, Toru

2017-07-01

Two types of processors exist in the market. One is the conventional CPU and the other is Graphic Processor Unit (GPU). Typical CPU is composed of 1 to 8 cores while GPU has thousands of cores. CPU is good for sequential processing, while GPU is good to accelerate software with heavy parallel executions. GPU was initially dedicated for 3D graphics. However from 2006, when GPU started to apply general-purpose cores, it was noticed that this architecture can be used as a general purpose massive-parallel processor. NVIDIA developed a software framework Compute Unified Device Architecture (CUDA) that make it possible to easily program the GPU for these application. With CUDA, GPU started to be used in workstations and supercomputers widely. Recently two key technologies are highlighted in the industry. The Artificial Intelligence (AI) and Autonomous Driving Cars. AI requires a massive parallel operation to train many-layers of neural networks. With CPU alone, it was impossible to finish the training in a practical time. The latest multi-GPU system with P100 makes it possible to finish the training in a few hours. For the autonomous driving cars, TOPS class of performance is required to implement perception, localization, path planning processing and again SoC with integrated GPU will play a key role there. In this paper, the evolution of the GPU which is one of the biggest commercial devices requiring state-of-the-art fabrication technology will be introduced. Also overview of the GPU demanding key application like the ones described above will be introduced.
Preliminary Study of Image Reconstruction Algorithm on a Digital Signal Processor

DTIC Science & Technology

2014-03-01

5.2 Comparison of CPU-GPU, CPU-FPGA, and CPU-DSP Designs The work for implementing VHDL description of the back-projection algorithm on a physical...FPGA was not complete. Hence, the DSP implementation results are compared with the simulated results for the VHDL design. Simulating VHDL provides an...rather than at the software level. Depending on an application’s characteristics, FPGA implementations can provide a significant performance
Enhancing Image Processing Performance for PCID in a Heterogeneous Network of Multi-code Processors

NASA Astrophysics Data System (ADS)

Linderman, R.; Spetka, S.; Fitzgerald, D.; Emeny, S.

The Physically-Constrained Iterative Deconvolution (PCID) image deblurring code is being ported to heterogeneous networks of multi-core systems, including Intel Xeons and IBM Cell Broadband Engines. This paper reports results from experiments using the JAWS supercomputer at MHPCC (60 TFLOPS of dual-dual Xeon nodes linked with Infiniband) and the Cell Cluster at AFRL in Rome, NY. The Cell Cluster has 52 TFLOPS of Playstation 3 (PS3) nodes with IBM Cell Broadband Engine multi-cores and 15 dual-quad Xeon head nodes. The interconnect fabric includes Infiniband, 10 Gigabit Ethernet and 1 Gigabit Ethernet to each of the 336 PS3s. The results compare approaches to parallelizing FFT executions across the Xeons and the Cell's Synergistic Processing Elements (SPEs) for frame-level image processing. The experiments included Intel's Performance Primitives and Math Kernel Library, FFTW3.2, and Carnegie Mellon's SPIRAL. Optimization of FFTs in the PCID code led to a decrease in relative processing time for FFTs. Profiling PCID version 6.2, about one year ago, showed the 13 functions that accounted for the highest percentage of processing were all FFT processing functions. They accounted for over 88% of processing time in one run on Xeons. FFT optimizations led to improvement in the current PCID version 8.0. A recent profile showed that only two of the 19 functions with the highest processing time were FFT processing functions. Timing measurements showed that FFT processing for PCID version 8.0 has been reduced to less than 19% of overall processing time. We are working toward a goal of scaling to 200-400 cores per job (1-2 imagery frames/core). Running a pair of cores on each set of frames reduces latency by implementing parallel FFT processing. Our current results show scaling well out to 100 pairs of cores. These results support the next higher level of parallelism in PCID, where groups of several hundred frames each producing one resolved image are sent to cliques of several hundred cores in a round robin fashion. Current efforts toward further performance enhancement for PCID are shifting toward using the Playstations in conjunction with the Xeons to take advantage of outstanding price/performance as well as the Flops/Watt cost advantage. We are fine-tuning the PCID parallization strategy to balance processing over Xeons and Cell BEs to find an optimal partitioning of PCID over the heterogeneous processors. A high performance information management system that exploits native Infiniband multicast is used to improve latency among the head nodes. Using a publication/subscription oriented information management system to implement a unified communications platform makes runs on large HPCs with thousands of intercommunicating cores more flexible and more fault tolerant. It features a loose couplingof publishers to subscribers through intervening brokers. We are also working on enhancing performance for both Xeons and Cell BEs, buy moving selected operations to single precision. Techniques for adapting the code to single precision and performance results are reported.
High-performance computing on GPUs for resistivity logging of oil and gas wells

NASA Astrophysics Data System (ADS)

Glinskikh, V.; Dudaev, A.; Nechaev, O.; Surodina, I.

2017-10-01

We developed and implemented into software an algorithm for high-performance simulation of electrical logs from oil and gas wells using high-performance heterogeneous computing. The numerical solution of the 2D forward problem is based on the finite-element method and the Cholesky decomposition for solving a system of linear algebraic equations (SLAE). Software implementations of the algorithm used the NVIDIA CUDA technology and computing libraries are made, allowing us to perform decomposition of SLAE and find its solution on central processor unit (CPU) and graphics processor unit (GPU). The calculation time is analyzed depending on the matrix size and number of its non-zero elements. We estimated the computing speed on CPU and GPU, including high-performance heterogeneous CPU-GPU computing. Using the developed algorithm, we simulated resistivity data in realistic models.

An efficient MPI/OpenMP parallelization of the Hartree–Fock–Roothaan method for the first generation of Intel® Xeon Phi™ processor architecture

DOE PAGES

Mironov, Vladimir; Moskovsky, Alexander; D’Mello, Michael; ...

2017-10-04

The Hartree-Fock (HF) method in the quantum chemistry package GAMESS represents one of the most irregular algorithms in computation today. Major steps in the calculation are the irregular computation of electron repulsion integrals (ERIs) and the building of the Fock matrix. These are the central components of the main Self Consistent Field (SCF) loop, the key hotspot in Electronic Structure (ES) codes. By threading the MPI ranks in the official release of the GAMESS code, we not only speed up the main SCF loop (4x to 6x for large systems), but also achieve a significant (>2x) reduction in the overallmore » memory footprint. These improvements are a direct consequence of memory access optimizations within the MPI ranks. We benchmark our implementation against the official release of the GAMESS code on the Intel R Xeon PhiTM supercomputer. Here, scaling numbers are reported on up to 7,680 cores on Intel Xeon Phi coprocessors.« less
Development of small scale cluster computer for numerical analysis

NASA Astrophysics Data System (ADS)

Zulkifli, N. H. N.; Sapit, A.; Mohammed, A. N.

2017-09-01

In this study, two units of personal computer were successfully networked together to form a small scale cluster. Each of the processor involved are multicore processor which has four cores in it, thus made this cluster to have eight processors. Here, the cluster incorporate Ubuntu 14.04 LINUX environment with MPI implementation (MPICH2). Two main tests were conducted in order to test the cluster, which is communication test and performance test. The communication test was done to make sure that the computers are able to pass the required information without any problem and were done by using simple MPI Hello Program where the program written in C language. Additional, performance test was also done to prove that this cluster calculation performance is much better than single CPU computer. In this performance test, four tests were done by running the same code by using single node, 2 processors, 4 processors, and 8 processors. The result shows that with additional processors, the time required to solve the problem decrease. Time required for the calculation shorten to half when we double the processors. To conclude, we successfully develop a small scale cluster computer using common hardware which capable of higher computing power when compare to single CPU processor, and this can be beneficial for research that require high computing power especially numerical analysis such as finite element analysis, computational fluid dynamics, and computational physics analysis.
A High Performance Computing Framework for Physics-based Modeling and Simulation of Military Ground Vehicles

DTIC Science & Technology

2011-03-25

number one and Nebulae at number three. Both systems rely on GPU co-processing and use Intel Xeon processors cards and NVIDIA Tesla C2050 GPUs. In...spite of a theoretical peak capability of almost 3 Petaflop/s, Nebulae clocked at 1.271 PFlop/s when running the Linpack benchmark, which puts it
OpenMP GNU and Intel Fortran programs for solving the time-dependent Gross-Pitaevskii equation

NASA Astrophysics Data System (ADS)

Young-S., Luis E.; Muruganandam, Paulsamy; Adhikari, Sadhan K.; Lončar, Vladimir; Vudragović, Dušan; Balaž, Antun

2017-11-01

We present Open Multi-Processing (OpenMP) version of Fortran 90 programs for solving the Gross-Pitaevskii (GP) equation for a Bose-Einstein condensate in one, two, and three spatial dimensions, optimized for use with GNU and Intel compilers. We use the split-step Crank-Nicolson algorithm for imaginary- and real-time propagation, which enables efficient calculation of stationary and non-stationary solutions, respectively. The present OpenMP programs are designed for computers with multi-core processors and optimized for compiling with both commercially-licensed Intel Fortran and popular free open-source GNU Fortran compiler. The programs are easy to use and are elaborated with helpful comments for the users. All input parameters are listed at the beginning of each program. Different output files provide physical quantities such as energy, chemical potential, root-mean-square sizes, densities, etc. We also present speedup test results for new versions of the programs. Program files doi:http://dx.doi.org/10.17632/y8zk3jgn84.2 Licensing provisions: Apache License 2.0 Programming language: OpenMP GNU and Intel Fortran 90. Computer: Any multi-core personal computer or workstation with the appropriate OpenMP-capable Fortran compiler installed. Number of processors used: All available CPU cores on the executing computer. Journal reference of previous version: Comput. Phys. Commun. 180 (2009) 1888; ibid.204 (2016) 209. Does the new version supersede the previous version?: Not completely. It does supersede previous Fortran programs from both references above, but not OpenMP C programs from Comput. Phys. Commun. 204 (2016) 209. Nature of problem: The present Open Multi-Processing (OpenMP) Fortran programs, optimized for use with commercially-licensed Intel Fortran and free open-source GNU Fortran compilers, solve the time-dependent nonlinear partial differential (GP) equation for a trapped Bose-Einstein condensate in one (1d), two (2d), and three (3d) spatial dimensions for six different trap symmetries: axially and radially symmetric traps in 3d, circularly symmetric traps in 2d, fully isotropic (spherically symmetric) and fully anisotropic traps in 2d and 3d, as well as 1d traps, where no spatial symmetry is considered. Solution method: We employ the split-step Crank-Nicolson algorithm to discretize the time-dependent GP equation in space and time. The discretized equation is then solved by imaginary- or real-time propagation, employing adequately small space and time steps, to yield the solution of stationary and non-stationary problems, respectively. Reasons for the new version: Previously published Fortran programs [1,2] have now become popular tools [3] for solving the GP equation. These programs have been translated to the C programming language [4] and later extended to the more complex scenario of dipolar atoms [5]. Now virtually all computers have multi-core processors and some have motherboards with more than one physical computer processing unit (CPU), which may increase the number of available CPU cores on a single computer to several tens. The C programs have been adopted to be very fast on such multi-core modern computers using general-purpose graphic processing units (GPGPU) with Nvidia CUDA and computer clusters using Message Passing Interface (MPI) [6]. Nevertheless, previously developed Fortran programs are also commonly used for scientific computation and most of them use a single CPU core at a time in modern multi-core laptops, desktops, and workstations. Unless the Fortran programs are made aware and capable of making efficient use of the available CPU cores, the solution of even a realistic dynamical 1d problem, not to mention the more complicated 2d and 3d problems, could be time consuming using the Fortran programs. Previously, we published auto-parallel Fortran programs [2] suitable for Intel (but not GNU) compiler for solving the GP equation. Hence, a need for the full OpenMP version of the Fortran programs to reduce the execution time cannot be overemphasized. To address this issue, we provide here such OpenMP Fortran programs, optimized for both Intel and GNU Fortran compilers and capable of using all available CPU cores, which can significantly reduce the execution time. Summary of revisions: Previous Fortran programs [1] for solving the time-dependent GP equation in 1d, 2d, and 3d with different trap symmetries have been parallelized using the OpenMP interface to reduce the execution time on multi-core processors. There are six different trap symmetries considered, resulting in six programs for imaginary-time propagation and six for real-time propagation, totaling to 12 programs included in BEC-GP-OMP-FOR software package. All input data (number of atoms, scattering length, harmonic oscillator trap length, trap anisotropy, etc.) are conveniently placed at the beginning of each program, as before [2]. Present programs introduce a new input parameter, which is designated by Number_of_Threads and defines the number of CPU cores of the processor to be used in the calculation. If one sets the value 0 for this parameter, all available CPU cores will be used. For the most efficient calculation it is advisable to leave one CPU core unused for the background system's jobs. For example, on a machine with 20 CPU cores such that we used for testing, it is advisable to use up to 19 CPU cores. However, the total number of used CPU cores can be divided into more than one job. For instance, one can run three simulations simultaneously using 10, 4, and 5 CPU cores, respectively, thus totaling to 19 used CPU cores on a 20-core computer. The Fortran source programs are located in the directory src, and can be compiled by the make command using the makefile in the root directory BEC-GP-OMP-FOR of the software package. The examples of produced output files can be found in the directory output, although some large density files are omitted, to save space. The programs calculate the values of actually used dimensionless nonlinearities from the physical input parameters, where the input parameters correspond to the identical nonlinearity values as in the previously published programs [1], so that the output files of the old and new programs can be directly compared. The output files are conveniently named such that their contents can be easily identified, following the naming convention introduced in Ref. [2]. For example, a file named -out.txt, where is a name of the individual program, represents the general output file containing input data, time and space steps, nonlinearity, energy and chemical potential, and was named fort.7 in the old Fortran version of programs [1]. A file named -den.txt is the output file with the condensate density, which had the names fort.3 and fort.4 in the old Fortran version [1] for imaginary- and real-time propagation programs, respectively. Other possible density outputs, such as the initial density, are commented out in the programs to have a simpler set of output files, but users can uncomment and re-enable them, if needed. In addition, there are output files for reduced (integrated) 1d and 2d densities for different programs. In the real-time programs there is also an output file reporting the dynamics of evolution of root-mean-square sizes after a perturbation is introduced. The supplied real-time programs solve the stationary GP equation, and then calculate the dynamics. As the imaginary-time programs are more accurate than the real-time programs for the solution of a stationary problem, one can first solve the stationary problem using the imaginary-time programs, adapt the real-time programs to read the pre-calculated wave function and then study the dynamics. In that case the parameter NSTP in the real-time programs should be set to zero and the space mesh and nonlinearity parameters should be identical in both programs. The reader is advised to consult our previous publication where a complete description of the output files is given [2]. A readme.txt file, included in the root directory, explains the procedure to compile and run the programs. We tested our programs on a workstation with two 10-core Intel Xeon E5-2650 v3 CPUs. The parameters used for testing are given in sample input files, provided in the corresponding directory together with the programs. In Table 1 we present wall-clock execution times for runs on 1, 6, and 19 CPU cores for programs compiled using Intel and GNU Fortran compilers. The corresponding columns "Intel speedup" and "GNU speedup" give the ratio of wall-clock execution times of runs on 1 and 19 CPU cores, and denote the actual measured speedup for 19 CPU cores. In all cases and for all numbers of CPU cores, although the GNU Fortran compiler gives excellent results, the Intel Fortran compiler turns out to be slightly faster. Note that during these tests we always ran only a single simulation on a workstation at a time, to avoid any possible interference issues. Therefore, the obtained wall-clock times are more reliable than the ones that could be measured with two or more jobs running simultaneously. We also studied the speedup of the programs as a function of the number of CPU cores used. The performance of the Intel and GNU Fortran compilers is illustrated in Fig. 1, where we plot the speedup and actual wall-clock times as functions of the number of CPU cores for 2d and 3d programs. We see that the speedup increases monotonically with the number of CPU cores in all cases and has large values (between 10 and 14 for 3d programs) for the maximal number of cores. This fully justifies the development of OpenMP programs, which enable much faster and more efficient solving of the GP equation. However, a slow saturation in the speedup with the further increase in the number of CPU cores is observed in all cases, as expected. The speedup tends to increase for programs in higher dimensions, as they become more complex and have to process more data. This is why the speedups of the supplied 2d and 3d programs are larger than those of 1d programs. Also, for a single program the speedup increases with the size of the spatial grid, i.e., with the number of spatial discretization points, since this increases the amount of calculations performed by the program. To demonstrate this, we tested the supplied real2d-th program and varied the number of spatial discretization points NX=NY from 20 to 1000. The measured speedup obtained when running this program on 19 CPU cores as a function of the number of discretization points is shown in Fig. 2. The speedup first increases rapidly with the number of discretization points and eventually saturates. Additional comments: Example inputs provided with the programs take less than 30 minutes to run on a workstation with two Intel Xeon E5-2650 v3 processors (2 QPI links, 10 CPU cores, 25 MB cache, 2.3 GHz).



      
      Using all of your CPU's in HIPE
      NASA Astrophysics Data System (ADS)
      Jacobson, J. D.; Fadda, D.
         2012-09-01
         Modern computer architectures increasingly feature multi-core CPU's. For example, the MacbookPro features the Intel quad-core i7 processors. Through the use of hyper-threading, where each core can execute two threads simultaneously, the quad-core i7 can support eight simultaneous processing threads. All this on your laptop! This CPU power can now be put into service by scientists to perform data reduction tasks, but only if the software has been designed to take advantage of the multiple processor architectures. Up to now, software written for Herschel data reduction (HIPE), written in Jython and JAVA, is single-threaded and can only utilize a single processor. Users of HIPE do not get any advantage from the additional processors. Why not put all of the CPU resources to work reducing your data? We present a multi-threaded software application that corrects long-term transients in the signal from the PACS unchopped spectroscopy line scan mode. In this poster, we present a multi-threaded software framework to achieve performance improvements from parallel execution. We will show how a task to correct transients in the PACS Spectroscopy Pipeline for the un-chopped line scan mode, has been threaded. This computation-intensive task uses either a one-parameter or a three parameter exponential function, to characterize the transient. The task uses a JAVA implementation of Minpack, translated from the C (Moshier) and IDL (Markwardt) by the authors, to optimize the correction parameters. We also explain how to determine if a task can benefit from threading (Amdahl's Law), and if it is safe to thread. The design and implementation, using the JAVA concurrency package completions service is described. Pitfalls, timing bugs, thread safety, resource control, testing and performance improvements are described and plotted.
      

      
      Performance tuning Weather Research and Forecasting (WRF) Goddard longwave radiative transfer scheme on Intel Xeon Phi
      NASA Astrophysics Data System (ADS)
      Mielikainen, Jarno; Huang, Bormin; Huang, Allen H.
         2015-10-01
         Next-generation mesoscale numerical weather prediction system, the Weather Research and Forecasting (WRF) model, is a designed for dual use for forecasting and research. WRF offers multiple physics options that can be combined in any way. One of the physics options is radiance computation. The major source for energy for the earth's climate is solar radiation. Thus, it is imperative to accurately model horizontal and vertical distribution of the heating. Goddard solar radiative transfer model includes the absorption duo to water vapor,ozone, ozygen, carbon dioxide, clouds and aerosols. The model computes the interactions among the absorption and scattering by clouds, aerosols, molecules and surface. Finally, fluxes are integrated over the entire longwave spectrum.In this paper, we present our results of optimizing the Goddard longwave radiative transfer scheme on Intel Many Integrated Core Architecture (MIC) hardware. The Intel Xeon Phi coprocessor is the first product based on Intel MIC architecture, and it consists of up to 61 cores connected by a high performance on-die bidirectional interconnect. The coprocessor supports all important Intel development tools. Thus, the development environment is familiar one to a vast number of CPU developers. Although, getting a maximum performance out of MICs will require using some novel optimization techniques. Those optimization techniques are discusses in this paper. The optimizations improved the performance of the original Goddard longwave radiative transfer scheme on Xeon Phi 7120P by a factor of 2.2x. Furthermore, the same optimizations improved the performance of the Goddard longwave radiative transfer scheme on a dual socket configuration of eight core Intel Xeon E5-2670 CPUs by a factor of 2.1x compared to the original Goddard longwave radiative transfer scheme code.
      

      
      Analysis and Implementation of Particle-to-Particle (P2P) Graphics Processor Unit (GPU) Kernel for Black-Box Adaptive Fast Multipole Method
      DTIC Science & Technology
      
         2015-06-01
         5110P and 16 dx360M4 nodes each with one NVIDIA Kepler K20M/K40M GPU. Each node contained dual Intel Xeon E5-2670 (Sandy Bridge) central processing...kernel and as such does not employ multiple processors. This work makes use of a single processing core and a single NVIDIA Kepler K40 GK110...bandwidth (2 × 16 slot), 7.877 GFloat/s; Kepler K40 peak, 4,290 × 1 billion floating-point operations (GFLOPs), and 288 GB/s Kepler K40 memory
      

      
      Scaling Support Vector Machines On Modern HPC Platforms
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      You, Yang; Fu, Haohuan; Song, Shuaiwen
         2015-02-01
         We designed and implemented MIC-SVM, a highly efficient parallel SVM for x86 based multicore and many-core architectures, such as the Intel Ivy Bridge CPUs and Intel Xeon Phi co-processor (MIC). We propose various novel analysis methods and optimization techniques to fully utilize the multilevel parallelism provided by these architectures and serve as general optimization methods for other machine learning tools.
      

      
      Domain Wall Fermion Inverter on Pentium 4
      NASA Astrophysics Data System (ADS)
      Pochinsky, Andrew
         2005-03-01
         A highly optimized domain wall fermion inverter has been developed as part of the SciDAC lattice initiative. By designing the code to minimize memory bus traffic, it achieves high cache reuse and performance in excess of 2 GFlops for out of L2 cache problem sizes on a GigE cluster with 2.66 GHz Xeon processors. The code uses the SciDAC QMP communication library.
      

      
      Developing infrared array controller with software real time operating system
      NASA Astrophysics Data System (ADS)
      Sako, Shigeyuki; Miyata, Takashi; Nakamura, Tomohiko; Motohara, Kentaro; Uchimoto, Yuka Katsuno; Onaka, Takashi; Kataza, Hirokazu
         2008-07-01
         Real-time capabilities are required for a controller of a large format array to reduce a dead-time attributed by readout and data transfer. The real-time processing has been achieved by dedicated processors including DSP, CPLD, and FPGA devices. However, the dedicated processors have problems with memory resources, inflexibility, and high cost. Meanwhile, a recent PC has sufficient resources of CPUs and memories to control the infrared array and to process a large amount of frame data in real-time. In this study, we have developed an infrared array controller with a software real-time operating system (RTOS) instead of the dedicated processors. A Linux PC equipped with a RTAI extension and a dual-core CPU is used as a main computer, and one of the CPU cores is allocated to the real-time processing. A digital I/O board with DMA functions is used for an I/O interface. The signal-processing cores are integrated in the OS kernel as a real-time driver module, which is composed of two virtual devices of the clock processor and the frame processor tasks. The array controller with the RTOS realizes complicated operations easily, flexibly, and at a low cost.
      

      
      Hot Chips and Hot Interconnects for High End Computing Systems
      NASA Technical Reports Server (NTRS)
      Saini, Subhash
         2005-01-01
         I will discuss several processors: 1. The Cray proprietary processor used in the Cray X1; 2. The IBM Power 3 and Power 4 used in an IBM SP 3 and IBM SP 4 systems; 3. The Intel Itanium and Xeon, used in the SGI Altix systems and clusters respectively; 4. IBM System-on-a-Chip used in IBM BlueGene/L; 5. HP Alpha EV68 processor used in DOE ASCI Q cluster; 6. SPARC64 V processor, which is used in the Fujitsu PRIMEPOWER HPC2500; 7. An NEC proprietary processor, which is used in NEC SX-6/7; 8. Power 4+ processor, which is used in Hitachi SR11000; 9. NEC proprietary processor, which is used in Earth Simulator. The IBM POWER5 and Red Storm Computing Systems will also be discussed. The architectures of these processors will first be presented, followed by interconnection networks and a description of high-end computer systems based on these processors and networks. The performance of various hardware/programming model combinations will then be compared, based on latest NAS Parallel Benchmark results (MPI, OpenMP/HPF and hybrid (MPI + OpenMP). The tutorial will conclude with a discussion of general trends in the field of high performance computing, (quantum computing, DNA computing, cellular engineering, and neural networks).
      

      
      Cpu/gpu Computing for AN Implicit Multi-Block Compressible Navier-Stokes Solver on Heterogeneous Platform
      NASA Astrophysics Data System (ADS)
      Deng, Liang; Bai, Hanli; Wang, Fang; Xu, Qingxin
         2016-06-01
         CPU/GPU computing allows scientists to tremendously accelerate their numerical codes. In this paper, we port and optimize a double precision alternating direction implicit (ADI) solver for three-dimensional compressible Navier-Stokes equations from our in-house Computational Fluid Dynamics (CFD) software on heterogeneous platform. First, we implement a full GPU version of the ADI solver to remove a lot of redundant data transfers between CPU and GPU, and then design two fine-grain schemes, namely “one-thread-one-point” and “one-thread-one-line”, to maximize the performance. Second, we present a dual-level parallelization scheme using the CPU/GPU collaborative model to exploit the computational resources of both multi-core CPUs and many-core GPUs within the heterogeneous platform. Finally, considering the fact that memory on a single node becomes inadequate when the simulation size grows, we present a tri-level hybrid programming pattern MPI-OpenMP-CUDA that merges fine-grain parallelism using OpenMP and CUDA threads with coarse-grain parallelism using MPI for inter-node communication. We also propose a strategy to overlap the computation with communication using the advanced features of CUDA and MPI programming. We obtain speedups of 6.0 for the ADI solver on one Tesla M2050 GPU in contrast to two Xeon X5670 CPUs. Scalability tests show that our implementation can offer significant performance improvement on heterogeneous platform.
      

      
      Scalability of a Low-Cost Multi-Teraflop Linux Cluster for High-End Classical Atomistic and Quantum Mechanical Simulations
      NASA Technical Reports Server (NTRS)
      Kikuchi, Hideaki; Kalia, Rajiv K.; Nakano, Aiichiro; Vashishta, Priya; Shimojo, Fuyuki; Saini, Subhash
         2003-01-01
         Scalability of a low-cost, Intel Xeon-based, multi-Teraflop Linux cluster is tested for two high-end scientific applications: Classical atomistic simulation based on the molecular dynamics method and quantum mechanical calculation based on the density functional theory. These scalable parallel applications use space-time multiresolution algorithms and feature computational-space decomposition, wavelet-based adaptive load balancing, and spacefilling-curve-based data compression for scalable I/O. Comparative performance tests are performed on a 1,024-processor Linux cluster and a conventional higher-end parallel supercomputer, 1,184-processor IBM SP4. The results show that the performance of the Linux cluster is comparable to that of the SP4. We also study various effects, such as the sharing of memory and L2 cache among processors, on the performance.
      

      
      Tactical Operations Analysis Support Facility.
      DTIC Science & Technology
      
         1981-05-01
         Punch/Reader 2 DMC-11AR DDCMP Micro Processor 2 DMC-11DA Network Link Line Unit 2 DL-11E Async Serial Line Interface 4 Intel IN-1670 448K Words MOS Memory...86 5.3 VIRTUAL PROCESSORS - VAX-11/750 ........................... 89 5.4 A RELATIONAL DATA MANAGEMENT SYSTEM - ORACLE...Central Processing Unit (CPU) is a 16 bit processor for high-speed, real time applications, and for large multi-user, multi- task, time shared
      

      
      The Effect of NUMA Tunings on CPU Performance
      NASA Astrophysics Data System (ADS)
      Hollowell, Christopher; Caramarcu, Costin; Strecker-Kellogg, William; Wong, Antonio; Zaytsev, Alexandr
         2015-12-01
         Non-Uniform Memory Access (NUMA) is a memory architecture for symmetric multiprocessing (SMP) systems where each processor is directly connected to separate memory. Indirect access to other CPU's (remote) RAM is still possible, but such requests are slower as they must also pass through that memory's controlling CPU. In concert with a NUMA-aware operating system, the NUMA hardware architecture can help eliminate the memory performance reductions generally seen in SMP systems when multiple processors simultaneously attempt to access memory. The x86 CPU architecture has supported NUMA for a number of years. Modern operating systems such as Linux support NUMA-aware scheduling, where the OS attempts to schedule a process to the CPU directly attached to the majority of its RAM. In Linux, it is possible to further manually tune the NUMA subsystem using the numactl utility. With the release of Red Hat Enterprise Linux (RHEL) 6.3, the numad daemon became available in this distribution. This daemon monitors a system's NUMA topology and utilization, and automatically makes adjustments to optimize locality. As the number of cores in x86 servers continues to grow, efficient NUMA mappings of processes to CPUs/memory will become increasingly important. This paper gives a brief overview of NUMA, and discusses the effects of manual tunings and numad on the performance of the HEPSPEC06 benchmark, and ATLAS software.
      

      
      Static and Dynamic Frequency Scaling on Multicore CPUs
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Bao, Wenlei; Hong, Changwan; Chunduri, Sudheer
         2016-12-28
         Dynamic voltage and frequency scaling (DVFS) adapts CPU power consumption by modifying a processor’s operating frequency (and the associated voltage). Typical approaches employing DVFS involve default strategies such as running at the lowest or the highest frequency, or observing the CPU’s runtime behavior and dynamically adapting the voltage/frequency configuration based on CPU usage. In this paper, we argue that many previous approaches suffer from inherent limitations, such as not account- ing for processor-specific impact of frequency changes on energy for different workload types. We first propose a lightweight runtime-based approach to automatically adapt the frequency based on the CPU workload,more » that is agnostic of the processor characteristics. We then show that further improvements can be achieved for affine kernels in the application, using a compile-time characterization instead of run-time monitoring to select the frequency and number of CPU cores to use. Our framework relies on a one-time energy characterization of CPU-specific DVFS profiles followed by a compile-time categorization of loop-based code segments in the application. These are combined to determine a priori of the frequency and the number of cores to use to execute the application so as to optimize energy or energy-delay product, outperforming runtime approach. Extensive evaluation on 60 benchmarks and five multi-core CPUs show that our approach systematically outperforms the powersave Linux governor, while improving overall performance.« less
      

      
      A GPU Parallelization of the Absolute Nodal Coordinate Formulation for Applications in Flexible Multibody Dynamics
      DTIC Science & Technology
      
         2012-02-17
         to be solved. Disclaimer: Reference herein to any specific commercial company , product, process, or service by trade name, trademark...data processing rather than data caching and control flow. To make use of this computational power, NVIDIA introduced a general purpose parallel...GPU implementations were run on an Intel Nehalem Xeon E5520 2.26GHz processor with an NVIDIA Tesla C2070 graphics card for varying numbers of
      

      
      A parallel algorithm for the two-dimensional time fractional diffusion equation with implicit difference method.
      PubMed
      Gong, Chunye; Bao, Weimin; Tang, Guojian; Jiang, Yuewen; Liu, Jie
         2014-01-01
         It is very time consuming to solve fractional differential equations. The computational complexity of two-dimensional fractional differential equation (2D-TFDE) with iterative implicit finite difference method is O(M(x)M(y)N(2)). In this paper, we present a parallel algorithm for 2D-TFDE and give an in-depth discussion about this algorithm. A task distribution model and data layout with virtual boundary are designed for this parallel algorithm. The experimental results show that the parallel algorithm compares well with the exact solution. The parallel algorithm on single Intel Xeon X5540 CPU runs 3.16-4.17 times faster than the serial algorithm on single CPU core. The parallel efficiency of 81 processes is up to 88.24% compared with 9 processes on a distributed memory cluster system. We do think that the parallel computing technology will become a very basic method for the computational intensive fractional applications in the near future.
      

      
      Visual Media Reasoning - Terrain-based Geolocation
      DTIC Science & Technology
      
         2015-06-01
         the drawings, specifications, or other data does not license the holder or any other person or corporation ; or convey any rights or permission to...3.4 Alternative Metric Investigation This section describes a graphics processor unit (GPU) based implementation in the NVIDIA CUDA programming...utilizing 2 concurrent CPU cores, each controlling a single Nvidia C2075 Tesla Fermi CUDA card. Figure 22 shows a comparison of the CPU and the GPU powered
      

      
      Intel Xeon Phi accelerated Weather Research and Forecasting (WRF) Goddard microphysics scheme
      NASA Astrophysics Data System (ADS)
      Mielikainen, J.; Huang, B.; Huang, A. H.-L.
         2014-12-01
         The Weather Research and Forecasting (WRF) model is a numerical weather prediction system designed to serve both atmospheric research and operational forecasting needs. The WRF development is a done in collaboration around the globe. Furthermore, the WRF is used by academic atmospheric scientists, weather forecasters at the operational centers and so on. The WRF contains several physics components. The most time consuming one is the microphysics. One microphysics scheme is the Goddard cloud microphysics scheme. It is a sophisticated cloud microphysics scheme in the Weather Research and Forecasting (WRF) model. The Goddard microphysics scheme is very suitable for massively parallel computation as there are no interactions among horizontal grid points. Compared to the earlier microphysics schemes, the Goddard scheme incorporates a large number of improvements. Thus, we have optimized the Goddard scheme code. In this paper, we present our results of optimizing the Goddard microphysics scheme on Intel Many Integrated Core Architecture (MIC) hardware. The Intel Xeon Phi coprocessor is the first product based on Intel MIC architecture, and it consists of up to 61 cores connected by a high performance on-die bidirectional interconnect. The Intel MIC is capable of executing a full operating system and entire programs rather than just kernels as the GPU does. The MIC coprocessor supports all important Intel development tools. Thus, the development environment is one familiar to a vast number of CPU developers. Although, getting a maximum performance out of MICs will require using some novel optimization techniques. Those optimization techniques are discussed in this paper. The results show that the optimizations improved performance of Goddard microphysics scheme on Xeon Phi 7120P by a factor of 4.7×. In addition, the optimizations reduced the Goddard microphysics scheme's share of the total WRF processing time from 20.0 to 7.5%. Furthermore, the same optimizations improved performance on Intel Xeon E5-2670 by a factor of 2.8× compared to the original code.


   
       
            
              
          

«

5
      6
      7
   8
      9
      »

          
        

           
           
             
               
      
      Innovative HPC architectures for the study of planetary plasma environments
      NASA Astrophysics Data System (ADS)
      Amaya, Jorge; Wolf, Anna; Lembège, Bertrand; Zitz, Anke; Alvarez, Damian; Lapenta, Giovanni
         2016-04-01
         DEEP-ER is an European Commission founded project that develops a new type of High Performance Computer architecture. The revolutionary system is currently used by KU Leuven to study the effects of the solar wind on the global environments of the Earth and Mercury. The new architecture combines the versatility of Intel Xeon computing nodes with the power of the upcoming Intel Xeon Phi accelerators. Contrary to classical heterogeneous HPC architectures, where it is customary to find CPU and accelerators in the same computing nodes, in the DEEP-ER system CPU nodes are grouped together (Cluster) and independently from the accelerator nodes (Booster). The system is equipped with a state of the art interconnection network, a highly scalable and fast I/O and a fail recovery resiliency system. The final objective of the project is to introduce a scalable system that can be used to create the next generation of exascale supercomputers. The code iPic3D from KU Leuven is being adapted to this new architecture. This particle-in-cell code can now perform the computation of the electromagnetic fields in the Cluster while the particles are moved in the Booster side. Using fast and scalable Xeon Phi accelerators in the Booster we can introduce many more particles per cell in the simulation than what is possible in the current generation of HPC systems, allowing to calculate fully kinetic plasmas with very low interpolation noise. The system will be used to perform fully kinetic, low noise, 3D simulations of the interaction of the solar wind with the magnetosphere of the Earth and Mercury. Preliminary simulations have been performed in other HPC centers in order to compare the results in different systems. In this presentation we show the complexity of the plasma flow around the planets, including the development of hydrodynamic instabilities at the flanks, the presence of the collision-less shock, the magnetosheath, the magnetopause, reconnection zones, the formation of the plasma sheet and the magnetotail, and the variation of ion/electron plasma flows when crossing these frontiers. The simulations also give access to detailed information about the particle dynamics and their velocity distribution at locations that can be used for comparison with satellite data.
      

      
      Kalman Filter Tracking on Parallel Architectures
      NASA Astrophysics Data System (ADS)
      Cerati, Giuseppe; Elmer, Peter; Krutelyov, Slava; Lantz, Steven; Lefebvre, Matthieu; McDermott, Kevin; Riley, Daniel; Tadel, Matevž; Wittich, Peter; Würthwein, Frank; Yagil, Avi
         2016-11-01
         Power density constraints are limiting the performance improvements of modern CPUs. To address this we have seen the introduction of lower-power, multi-core processors such as GPGPU, ARM and Intel MIC. In order to achieve the theoretical performance gains of these processors, it will be necessary to parallelize algorithms to exploit larger numbers of lightweight cores and specialized functions like large vector units. Track finding and fitting is one of the most computationally challenging problems for event reconstruction in particle physics. At the High-Luminosity Large Hadron Collider (HL-LHC), for example, this will be by far the dominant problem. The need for greater parallelism has driven investigations of very different track finding techniques such as Cellular Automata or Hough Transforms. The most common track finding techniques in use today, however, are those based on a Kalman filter approach. Significant experience has been accumulated with these techniques on real tracking detector systems, both in the trigger and offline. They are known to provide high physics performance, are robust, and are in use today at the LHC. Given the utility of the Kalman filter in track finding, we have begun to port these algorithms to parallel architectures, namely Intel Xeon and Xeon Phi. We report here on our progress towards an end-to-end track reconstruction algorithm fully exploiting vectorization and parallelization techniques in a simplified experimental environment.
      

      
      Conversion of Mass Storage Hierarchy in an IBM Computer Network
      DTIC Science & Technology
      
         1989-03-01
         storage devices GUIDE IBM users’ group for DOS operating systems IBM International Business Machines IBM 370/145 CPU introduced in 1970 IBM 370/168 CPU...February 12, 1985, Information Systems Group, International Business Machines Corporation. "IBM 3090 Processor Complex" and 񓼪 Mass Storage System...34 Mainframe Journal, pp. 15-26, 64-65, Dallas, Texas, September-October 1987. 3. International Business Machines Corporation, Introduction to IBM 3S80 Storage
      

      
      Gigaflop performance on a CRAY-2: Multitasking a computational fluid dynamics application
      NASA Technical Reports Server (NTRS)
      Tennille, Geoffrey M.; Overman, Andrea L.; Lambiotte, Jules J.; Streett, Craig L.
         1991-01-01
         The methodology is described for converting a large, long-running applications code that executed on a single processor of a CRAY-2 supercomputer to a version that executed efficiently on multiple processors. Although the conversion of every application is different, a discussion of the types of modification used to achieve gigaflop performance is included to assist others in the parallelization of applications for CRAY computers, especially those that were developed for other computers. An existing application, from the discipline of computational fluid dynamics, that had utilized over 2000 hrs of CPU time on CRAY-2 during the previous year was chosen as a test case to study the effectiveness of multitasking on a CRAY-2. The nature of dominant calculations within the application indicated that a sustained computational rate of 1 billion floating-point operations per second, or 1 gigaflop, might be achieved. The code was first analyzed and modified for optimal performance on a single processor in a batch environment. After optimal performance on a single CPU was achieved, the code was modified to use multiple processors in a dedicated environment. The results of these two efforts were merged into a single code that had a sustained computational rate of over 1 gigaflop on a CRAY-2. Timings and analysis of performance are given for both single- and multiple-processor runs.
      

      
      Case for a field-programmable gate array multicore hybrid machine for an image-processing application
      NASA Astrophysics Data System (ADS)
      Rakvic, Ryan N.; Ives, Robert W.; Lira, Javier; Molina, Carlos
         2011-01-01
         General purpose computer designers have recently begun adding cores to their processors in order to increase performance. For example, Intel has adopted a homogeneous quad-core processor as a base for general purpose computing. PlayStation3 (PS3) game consoles contain a multicore heterogeneous processor known as the Cell, which is designed to perform complex image processing algorithms at a high level. Can modern image-processing algorithms utilize these additional cores? On the other hand, modern advancements in configurable hardware, most notably field-programmable gate arrays (FPGAs) have created an interesting question for general purpose computer designers. Is there a reason to combine FPGAs with multicore processors to create an FPGA multicore hybrid general purpose computer? Iris matching, a repeatedly executed portion of a modern iris-recognition algorithm, is parallelized on an Intel-based homogeneous multicore Xeon system, a heterogeneous multicore Cell system, and an FPGA multicore hybrid system. Surprisingly, the cheaper PS3 slightly outperforms the Intel-based multicore on a core-for-core basis. However, both multicore systems are beaten by the FPGA multicore hybrid system by >50%.
      

      
      MSTor: A program for calculating partition functions, free energies, enthalpies, entropies, and heat capacities of complex molecules including torsional anharmonicity
      NASA Astrophysics Data System (ADS)
      Zheng, Jingjing; Mielke, Steven L.; Clarkson, Kenneth L.; Truhlar, Donald G.
         2012-08-01
         We present a Fortran program package, MSTor, which calculates partition functions and thermodynamic functions of complex molecules involving multiple torsional motions by the recently proposed MS-T method. This method interpolates between the local harmonic approximation in the low-temperature limit, and the limit of free internal rotation of all torsions at high temperature. The program can also carry out calculations in the multiple-structure local harmonic approximation. The program package also includes six utility codes that can be used as stand-alone programs to calculate reduced moment of inertia matrices by the method of Kilpatrick and Pitzer, to generate conformational structures, to calculate, either analytically or by Monte Carlo sampling, volumes for torsional subdomains defined by Voronoi tessellation of the conformational subspace, to generate template input files, and to calculate one-dimensional torsional partition functions using the torsional eigenvalue summation method. Catalogue identifier: AEMF_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEMF_v1_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: Standard CPC licence, http://cpc.cs.qub.ac.uk/licence/licence.html No. of lines in distributed program, including test data, etc.: 77 434 No. of bytes in distributed program, including test data, etc.: 3 264 737 Distribution format: tar.gz Programming language: Fortran 90, C, and Perl Computer: Itasca (HP Linux cluster, each node has two-socket, quad-core 2.8 GHz Intel Xeon X5560 “Nehalem EP” processors), Calhoun (SGI Altix XE 1300 cluster, each node containing two quad-core 2.66 GHz Intel Xeon “Clovertown”-class processors sharing 16 GB of main memory), Koronis (Altix UV 1000 server with 190 6-core Intel Xeon X7542 “Westmere” processors at 2.66 GHz), Elmo (Sun Fire X4600 Linux cluster with AMD Opteron cores), and Mac Pro (two 2.8 GHz Quad-core Intel Xeon processors) Operating system: Linux/Unix/Mac OS RAM: 2 Mbytes Classification: 16.3, 16.12, 23 Nature of problem: Calculation of the partition functions and thermodynamic functions (standard-state energy, enthalpy, entropy, and free energy as functions of temperatures) of complex molecules involving multiple torsional motions. Solution method: The multi-structural approximation with torsional anharmonicity (MS-T). The program also provides results for the multi-structural local harmonic approximation [1]. Restrictions: There is no limit on the number of torsions that can be included in either the Voronoi calculation or the full MS-T calculation. In practice, the range of problems that can be addressed with the present method consists of all multi-torsional problems for which one can afford to calculate all the conformations and their frequencies. Unusual features: The method can be applied to transition states as well as stable molecules. The program package also includes the hull program for the calculation of Voronoi volumes and six utility codes that can be used as stand-alone programs to calculate reduced moment-of-inertia matrices by the method of Kilpatrick and Pitzer, to generate conformational structures, to calculate, either analytically or by Monte Carlo sampling, volumes for torsional subdomain defined by Voronoi tessellation of the conformational subspace, to generate template input files, and to calculate one-dimensional torsional partition functions using the torsional eigenvalue summation method. Additional comments: The program package includes a manual, installation script, and input and output files for a test suite. Running time: There are 24 test runs. The running time of the test runs on a single processor of the Itasca computer is less than 2 seconds. J. Zheng, T. Yu, E. Papajak, I.M. Alecu, S.L. Mielke, D.G. Truhlar, Practical methods for including torsional anharmonicity in thermochemical calculations of complex molecules: The internal-coordinate multi-structural approximation, Phys. Chem. Chem. Phys. 13 (2011) 10885-10907.
      

      
      A Programming Framework for Scientific Applications on CPU-GPU Systems
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Owens, John
         2013-03-24
         At a high level, my research interests center around designing, programming, and evaluating computer systems that use new approaches to solve interesting problems. The rapid change of technology allows a variety of different architectural approaches to computationally difficult problems, and a constantly shifting set of constraints and trends makes the solutions to these problems both challenging and interesting. One of the most important recent trends in computing has been a move to commodity parallel architectures. This sea change is motivated by the industry’s inability to continue to profitably increase performance on a single processor and instead to move to multiplemore » parallel processors. In the period of review, my most significant work has been leading a research group looking at the use of the graphics processing unit (GPU) as a general-purpose processor. GPUs can potentially deliver superior performance on a broad range of problems than their CPU counterparts, but effectively mapping complex applications to a parallel programming model with an emerging programming environment is a significant and important research problem.« less
      

      
      Power Aware Distributed Systems
      DTIC Science & Technology
      
         2004-01-01
         detection or threshold functions to trigger the main CPU. The main processor can sleep and either wakeup on a schedule or by a positive threshold event...the RTOS must determine if wake-up latency can be tolerated (or, if it could be hidden by pre- wakeup ). The prediction accuracy for scheduling ...and processor shutdown/ wakeup . This analysis can be used to accurately analyze the schedulability of non-concrete periodic task sets, scheduled using
      

      
      The Advanced Communication Technology Satellite and ISDN
      NASA Technical Reports Server (NTRS)
      Lowry, Peter A.
         1996-01-01
         This paper depicts the Advanced Communication Technology Satellite (ACTS) system as a global central office switch. The ground portion of the system is the collection of earth stations or T1-VSAT's (T1 very small aperture terminals). The control software for the T1-VSAT's resides in a single CPU. The software consists of two modules, the modem manager and the call manager. The modem manager (MM) controls the RF modem portion of the T1-VSAT. It processes the orderwires from the satellite or from signaling generated by the call manager (CM). The CM controls the Recom Laboratories MSPs by receiving signaling messages from the stacked MSP shelves ro units and sending appropriate setup commands to them. There are two methods used to setup and process calls in the CM; first by dialing up a circuit using a standard telephone handset or, secondly by using an external processor connected to the CPU's second COM port, by sending and receiving signaling orderwires. It is the use of the external processor which permits the ISDN (Integrated Services Digital Network) Signaling Processor to implement ISDN calls. In August 1993, the initial testing of the ISDN Signaling Processor was carried out at ACTS System Test at Lockheed Marietta, Princeton, NJ using the spacecraft in its test configuration on the ground.
      

      
      Underwater Threat Source Localization: Processing Sensor Network TDOAs with a Terascale Optical Core Device
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Barhen, Jacob; Imam, Neena
         2007-01-01
         Revolutionary computing technologies are defined in terms of technological breakthroughs, which leapfrog over near-term projected advances in conventional hardware and software to produce paradigm shifts in computational science. For underwater threat source localization using information provided by a dynamical sensor network, one of the most promising computational advances builds upon the emergence of digital optical-core devices. In this article, we present initial results of sensor network calculations that focus on the concept of signal wavefront time-difference-of-arrival (TDOA). The corresponding algorithms are implemented on the EnLight processing platform recently introduced by Lenslet Laboratories. This tera-scale digital optical core processor is optimizedmore » for array operations, which it performs in a fixed-point-arithmetic architecture. Our results (i) illustrate the ability to reach the required accuracy in the TDOA computation, and (ii) demonstrate that a considerable speed-up can be achieved when using the EnLight 64a prototype processor as compared to a dual Intel XeonTM processor.« less
      

      
      
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Lyakh, Dmitry I.
         
         An efficient parallel tensor transpose algorithm is suggested for shared-memory computing units, namely, multicore CPU, Intel Xeon Phi, and NVidia GPU. The algorithm operates on dense tensors (multidimensional arrays) and is based on the optimization of cache utilization on x86 CPU and the use of shared memory on NVidia GPU. From the applied side, the ultimate goal is to minimize the overhead encountered in the transformation of tensor contractions into matrix multiplications in computer implementations of advanced methods of quantum many-body theory (e.g., in electronic structure theory and nuclear physics). A particular accent is made on higher-dimensional tensors that typicallymore » appear in the so-called multireference correlated methods of electronic structure theory. Depending on tensor dimensionality, the presented optimized algorithms can achieve an order of magnitude speedup on x86 CPUs and 2-3 times speedup on NVidia Tesla K20X GPU with respect to the na ve scattering algorithm (no memory access optimization). Furthermore, the tensor transpose routines developed in this work have been incorporated into a general-purpose tensor algebra library (TAL-SH).« less
      

      
      Speeding-up Bioinformatics Algorithms with Heterogeneous Architectures: Highly Heterogeneous Smith-Waterman (HHeterSW).
      PubMed
      Gálvez, Sergio; Ferusic, Adis; Esteban, Francisco J; Hernández, Pilar; Caballero, Juan A; Dorado, Gabriel
         2016-10-01
         The Smith-Waterman algorithm has a great sensitivity when used for biological sequence-database searches, but at the expense of high computing-power requirements. To overcome this problem, there are implementations in literature that exploit the different hardware-architectures available in a standard PC, such as GPU, CPU, and coprocessors. We introduce an application that splits the original database-search problem into smaller parts, resolves each of them by executing the most efficient implementations of the Smith-Waterman algorithms in different hardware architectures, and finally unifies the generated results. Using non-overlapping hardware allows simultaneous execution, and up to 2.58-fold performance gain, when compared with any other algorithm to search sequence databases. Even the performance of the popular BLAST heuristic is exceeded in 78% of the tests. The application has been tested with standard hardware: Intel i7-4820K CPU, Intel Xeon Phi 31S1P coprocessors, and nVidia GeForce GTX 960 graphics cards. An important increase in performance has been obtained in a wide range of situations, effectively exploiting the available hardware.
      

      
      GeauxDock: Accelerating Structure-Based Virtual Screening with Heterogeneous Computing
      PubMed Central
      Fang, Ye; Ding, Yun; Feinstein, Wei P.; Koppelman, David M.; Moreno, Juana; Jarrell, Mark; Ramanujam, J.; Brylinski, Michal
         2016-01-01
         Computational modeling of drug binding to proteins is an integral component of direct drug design. Particularly, structure-based virtual screening is often used to perform large-scale modeling of putative associations between small organic molecules and their pharmacologically relevant protein targets. Because of a large number of drug candidates to be evaluated, an accurate and fast docking engine is a critical element of virtual screening. Consequently, highly optimized docking codes are of paramount importance for the effectiveness of virtual screening methods. In this communication, we describe the implementation, tuning and performance characteristics of GeauxDock, a recently developed molecular docking program. GeauxDock is built upon the Monte Carlo algorithm and features a novel scoring function combining physics-based energy terms with statistical and knowledge-based potentials. Developed specifically for heterogeneous computing platforms, the current version of GeauxDock can be deployed on modern, multi-core Central Processing Units (CPUs) as well as massively parallel accelerators, Intel Xeon Phi and NVIDIA Graphics Processing Unit (GPU). First, we carried out a thorough performance tuning of the high-level framework and the docking kernel to produce a fast serial code, which was then ported to shared-memory multi-core CPUs yielding a near-ideal scaling. Further, using Xeon Phi gives 1.9× performance improvement over a dual 10-core Xeon CPU, whereas the best GPU accelerator, GeForce GTX 980, achieves a speedup as high as 3.5×. On that account, GeauxDock can take advantage of modern heterogeneous architectures to considerably accelerate structure-based virtual screening applications. GeauxDock is open-sourced and publicly available at www.brylinski.org/geauxdock and https://figshare.com/articles/geauxdock_tar_gz/3205249. PMID:27420300
      

      
      Intel Many Integrated Core (MIC) architecture optimization strategies for a memory-bound Weather Research and Forecasting (WRF) Goddard microphysics scheme
      NASA Astrophysics Data System (ADS)
      Mielikainen, Jarno; Huang, Bormin; Huang, Allen H.
         2014-10-01
         The Goddard cloud microphysics scheme is a sophisticated cloud microphysics scheme in the Weather Research and Forecasting (WRF) model. The WRF is a widely used weather prediction system in the world. It development is a done in collaborative around the globe. The Goddard microphysics scheme is very suitable for massively parallel computation as there are no interactions among horizontal grid points. Compared to the earlier microphysics schemes, the Goddard scheme incorporates a large number of improvements. Thus, we have optimized the code of this important part of WRF. In this paper, we present our results of optimizing the Goddard microphysics scheme on Intel Many Integrated Core Architecture (MIC) hardware. The Intel Xeon Phi coprocessor is the first product based on Intel MIC architecture, and it consists of up to 61 cores connected by a high performance on-die bidirectional interconnect. The Intel MIC is capable of executing a full operating system and entire programs rather than just kernels as the GPU do. The MIC coprocessor supports all important Intel development tools. Thus, the development environment is familiar one to a vast number of CPU developers. Although, getting a maximum performance out of MICs will require using some novel optimization techniques. Those optimization techniques are discusses in this paper. The results show that the optimizations improved performance of the original code on Xeon Phi 7120P by a factor of 4.7x. Furthermore, the same optimizations improved performance on a dual socket Intel Xeon E5-2670 system by a factor of 2.8x compared to the original code.
      

      
      GeauxDock: Accelerating Structure-Based Virtual Screening with Heterogeneous Computing.
      PubMed
      Fang, Ye; Ding, Yun; Feinstein, Wei P; Koppelman, David M; Moreno, Juana; Jarrell, Mark; Ramanujam, J; Brylinski, Michal
         2016-01-01
         Computational modeling of drug binding to proteins is an integral component of direct drug design. Particularly, structure-based virtual screening is often used to perform large-scale modeling of putative associations between small organic molecules and their pharmacologically relevant protein targets. Because of a large number of drug candidates to be evaluated, an accurate and fast docking engine is a critical element of virtual screening. Consequently, highly optimized docking codes are of paramount importance for the effectiveness of virtual screening methods. In this communication, we describe the implementation, tuning and performance characteristics of GeauxDock, a recently developed molecular docking program. GeauxDock is built upon the Monte Carlo algorithm and features a novel scoring function combining physics-based energy terms with statistical and knowledge-based potentials. Developed specifically for heterogeneous computing platforms, the current version of GeauxDock can be deployed on modern, multi-core Central Processing Units (CPUs) as well as massively parallel accelerators, Intel Xeon Phi and NVIDIA Graphics Processing Unit (GPU). First, we carried out a thorough performance tuning of the high-level framework and the docking kernel to produce a fast serial code, which was then ported to shared-memory multi-core CPUs yielding a near-ideal scaling. Further, using Xeon Phi gives 1.9× performance improvement over a dual 10-core Xeon CPU, whereas the best GPU accelerator, GeForce GTX 980, achieves a speedup as high as 3.5×. On that account, GeauxDock can take advantage of modern heterogeneous architectures to considerably accelerate structure-based virtual screening applications. GeauxDock is open-sourced and publicly available at www.brylinski.org/geauxdock and https://figshare.com/articles/geauxdock_tar_gz/3205249.
      

      
      Interactive high-resolution isosurface ray casting on multicore processors.
      PubMed
      Wang, Qin; JaJa, Joseph
         2008-01-01
         We present a new method for the interactive rendering of isosurfaces using ray casting on multi-core processors. This method consists of a combination of an object-order traversal that coarsely identifies possible candidate 3D data blocks for each small set of contiguous pixels, and an isosurface ray casting strategy tailored for the resulting limited-size lists of candidate 3D data blocks. While static screen partitioning is widely used in the literature, our scheme performs dynamic allocation of groups of ray casting tasks to ensure almost equal loads among the different threads running on multi-cores while maintaining spatial locality. We also make careful use of memory management environment commonly present in multi-core processors. We test our system on a two-processor Clovertown platform, each consisting of a Quad-Core 1.86-GHz Intel Xeon Processor, for a number of widely different benchmarks. The detailed experimental results show that our system is efficient and scalable, and achieves high cache performance and excellent load balancing, resulting in an overall performance that is superior to any of the previous algorithms. In fact, we achieve an interactive isosurface rendering on a 1024(2) screen for all the datasets tested up to the maximum size of the main memory of our platform.
      

      
      CPU-GPU hybrid accelerating the Zuker algorithm for RNA secondary structure prediction applications.
      PubMed
      Lei, Guoqing; Dou, Yong; Wan, Wen; Xia, Fei; Li, Rongchun; Ma, Meng; Zou, Dan
         2012-01-01
         Prediction of ribonucleic acid (RNA) secondary structure remains one of the most important research areas in bioinformatics. The Zuker algorithm is one of the most popular methods of free energy minimization for RNA secondary structure prediction. Thus far, few studies have been reported on the acceleration of the Zuker algorithm on general-purpose processors or on extra accelerators such as Field Programmable Gate-Array (FPGA) and Graphics Processing Units (GPU). To the best of our knowledge, no implementation combines both CPU and extra accelerators, such as GPUs, to accelerate the Zuker algorithm applications. In this paper, a CPU-GPU hybrid computing system that accelerates Zuker algorithm applications for RNA secondary structure prediction is proposed. The computing tasks are allocated between CPU and GPU for parallel cooperate execution. Performance differences between the CPU and the GPU in the task-allocation scheme are considered to obtain workload balance. To improve the hybrid system performance, the Zuker algorithm is optimally implemented with special methods for CPU and GPU architecture. Speedup of 15.93× over optimized multi-core SIMD CPU implementation and performance advantage of 16% over optimized GPU implementation are shown in the experimental results. More than 14% of the sequences are executed on CPU in the hybrid system. The system combining CPU and GPU to accelerate the Zuker algorithm is proven to be promising and can be applied to other bioinformatics applications.
      

      
      Accelerating Large Scale Image Analyses on Parallel, CPU-GPU Equipped Systems
      PubMed Central
      Teodoro, George; Kurc, Tahsin M.; Pan, Tony; Cooper, Lee A.D.; Kong, Jun; Widener, Patrick; Saltz, Joel H.
         2014-01-01
         The past decade has witnessed a major paradigm shift in high performance computing with the introduction of accelerators as general purpose processors. These computing devices make available very high parallel computing power at low cost and power consumption, transforming current high performance platforms into heterogeneous CPU-GPU equipped systems. Although the theoretical performance achieved by these hybrid systems is impressive, taking practical advantage of this computing power remains a very challenging problem. Most applications are still deployed to either GPU or CPU, leaving the other resource under- or un-utilized. In this paper, we propose, implement, and evaluate a performance aware scheduling technique along with optimizations to make efficient collaborative use of CPUs and GPUs on a parallel system. In the context of feature computations in large scale image analysis applications, our evaluations show that intelligently co-scheduling CPUs and GPUs can significantly improve performance over GPU-only or multi-core CPU-only approaches. PMID:25419545
      

      
      Application of Intel Many Integrated Core (MIC) architecture to the Yonsei University planetary boundary layer scheme in Weather Research and Forecasting model
      NASA Astrophysics Data System (ADS)
      Huang, Melin; Huang, Bormin; Huang, Allen H.
         2014-10-01
         The Weather Research and Forecasting (WRF) model provided operational services worldwide in many areas and has linked to our daily activity, in particular during severe weather events. The scheme of Yonsei University (YSU) is one of planetary boundary layer (PBL) models in WRF. The PBL is responsible for vertical sub-grid-scale fluxes due to eddy transports in the whole atmospheric column, determines the flux profiles within the well-mixed boundary layer and the stable layer, and thus provide atmospheric tendencies of temperature, moisture (including clouds), and horizontal momentum in the entire atmospheric column. The YSU scheme is very suitable for massively parallel computation as there are no interactions among horizontal grid points. To accelerate the computation process of the YSU scheme, we employ Intel Many Integrated Core (MIC) Architecture as it is a multiprocessor computer structure with merits of efficient parallelization and vectorization essentials. Our results show that the MIC-based optimization improved the performance of the first version of multi-threaded code on Xeon Phi 5110P by a factor of 2.4x. Furthermore, the same CPU-based optimizations improved the performance on Intel Xeon E5-2603 by a factor of 1.6x as compared to the first version of multi-threaded code.
      

      
      Heat dissipation for microprocessor using multiwalled carbon nanotubes based liquid.
      PubMed
      Hung Thang, Bui; Trinh, Pham Van; Chuc, Nguyen Van; Khoi, Phan Hong; Minh, Phan Ngoc
         2013-01-01
         Carbon nanotubes (CNTs) are one of the most valuable materials with high thermal conductivity (2000 W/m · K compared with thermal conductivity of Ag 419 W/m · K). This suggested an approach in applying the CNTs in thermal dissipation system for high power electronic devices, such as computer processor and high brightness light emitting diode (HB-LED). In this work, multiwalled carbon nanotubes (MWCNTs) based liquid was made by COOH functionalized MWCNTs dispersed in distilled water with concentration in the range between 0.2 and 1.2 gram/liter. MWCNT based liquid was used in liquid cooling system to enhance thermal dissipation for computer processor. By using distilled water in liquid cooling system, CPU's temperature decreases by about 10°C compared with using fan cooling system. By using MWCNT liquid with concentration of 1 gram/liter MWCNTs, the CPU's temperature decreases by 7°C compared with using distilled water in cooling system. Theoretically, we also showed that the presence of MWCNTs reduced thermal resistance and increased the thermal conductivity of liquid cooling system. The results have confirmed the advantages of the MWCNTs for thermal dissipation systems for the μ -processor and other high power electronic devices.
      

        
       
          

«

5
      6
      7
   8
      9
      »

          
        

     

   

   
       
            
              
          

«

6
      7
      8
   9
      10
      »

          
        

           
           
             
               
      
      3D Kirchhoff depth migration algorithm: A new scalable approach for parallelization on multicore CPU based cluster
      NASA Astrophysics Data System (ADS)
      Rastogi, Richa; Londhe, Ashutosh; Srivastava, Abhishek; Sirasala, Kirannmayi M.; Khonde, Kiran
         2017-03-01
         In this article, a new scalable 3D Kirchhoff depth migration algorithm is presented on state of the art multicore CPU based cluster. Parallelization of 3D Kirchhoff depth migration is challenging due to its high demand of compute time, memory, storage and I/O along with the need of their effective management. The most resource intensive modules of the algorithm are traveltime calculations and migration summation which exhibit an inherent trade off between compute time and other resources. The parallelization strategy of the algorithm largely depends on the storage of calculated traveltimes and its feeding mechanism to the migration process. The presented work is an extension of our previous work, wherein a 3D Kirchhoff depth migration application for multicore CPU based parallel system had been developed. Recently, we have worked on improving parallel performance of this application by re-designing the parallelization approach. The new algorithm is capable to efficiently migrate both prestack and poststack 3D data. It exhibits flexibility for migrating large number of traces within the available node memory and with minimal requirement of storage, I/O and inter-node communication. The resultant application is tested using 3D Overthrust data on PARAM Yuva II, which is a Xeon E5-2670 based multicore CPU cluster with 16 cores/node and 64 GB shared memory. Parallel performance of the algorithm is studied using different numerical experiments and the scalability results show striking improvement over its previous version. An impressive 49.05X speedup with 76.64% efficiency is achieved for 3D prestack data and 32.00X speedup with 50.00% efficiency for 3D poststack data, using 64 nodes. The results also demonstrate the effectiveness and robustness of the improved algorithm with high scalability and efficiency on a multicore CPU cluster.
      

      
      SU-E-T-423: Fast Photon Convolution Calculation with a 3D-Ideal Kernel On the GPU
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Moriya, S; Sato, M; Tachibana, H
         
         Purpose: The calculation time is a trade-off for improving the accuracy of convolution dose calculation with fine calculation spacing of the KERMA kernel. We investigated to accelerate the convolution calculation using an ideal kernel on the Graphic Processing Units (GPU). Methods: The calculation was performed on the AMD graphics hardware of Dual FirePro D700 and our algorithm was implemented using the Aparapi that convert Java bytecode to OpenCL. The process of dose calculation was separated with the TERMA and KERMA steps. The dose deposited at the coordinate (x, y, z) was determined in the process. In the dose calculation runningmore » on the central processing unit (CPU) of Intel Xeon E5, the calculation loops were performed for all calculation points. On the GPU computation, all of the calculation processes for the points were sent to the GPU and the multi-thread computation was done. In this study, the dose calculation was performed in a water equivalent homogeneous phantom with 150{sup 3} voxels (2 mm calculation grid) and the calculation speed on the GPU to that on the CPU and the accuracy of PDD were compared. Results: The calculation time for the GPU and the CPU were 3.3 sec and 4.4 hour, respectively. The calculation speed for the GPU was 4800 times faster than that for the CPU. The PDD curve for the GPU was perfectly matched to that for the CPU. Conclusion: The convolution calculation with the ideal kernel on the GPU was clinically acceptable for time and may be more accurate in an inhomogeneous region. Intensity modulated arc therapy needs dose calculations for different gantry angles at many control points. Thus, it would be more practical that the kernel uses a coarse spacing technique if the calculation is faster while keeping the similar accuracy to a current treatment planning system.« less
      

      
      Kalman Filter Tracking on Parallel Architectures
      NASA Astrophysics Data System (ADS)
      Cerati, Giuseppe; Elmer, Peter; Lantz, Steven; McDermott, Kevin; Riley, Dan; Tadel, Matevž; Wittich, Peter; Würthwein, Frank; Yagil, Avi
         2015-12-01
         Power density constraints are limiting the performance improvements of modern CPUs. To address this we have seen the introduction of lower-power, multi-core processors, but the future will be even more exciting. In order to stay within the power density limits but still obtain Moore's Law performance/price gains, it will be necessary to parallelize algorithms to exploit larger numbers of lightweight cores and specialized functions like large vector units. Example technologies today include Intel's Xeon Phi and GPGPUs. Track finding and fitting is one of the most computationally challenging problems for event reconstruction in particle physics. At the High Luminosity LHC, for example, this will be by far the dominant problem. The need for greater parallelism has driven investigations of very different track finding techniques including Cellular Automata or returning to Hough Transform. The most common track finding techniques in use today are however those based on the Kalman Filter [2]. Significant experience has been accumulated with these techniques on real tracking detector systems, both in the trigger and offline. They are known to provide high physics performance, are robust and are exactly those being used today for the design of the tracking system for HL-LHC. Our previous investigations showed that, using optimized data structures, track fitting with Kalman Filter can achieve large speedup both with Intel Xeon and Xeon Phi. We report here our further progress towards an end-to-end track reconstruction algorithm fully exploiting vectorization and parallelization techniques in a realistic simulation setup.
      

      
      Advanced Edit System.
      DTIC Science & Technology
      
         1983-01-01
         MFR Model Computer Subsystem 1. Cabinet 0, PDP-11/70 CPU with 11/70 CPU, and Floating point processor DEC 11/79-UK 2. Cabinet 1, with SDLC ... software T-square. o Unit lock causes a user-defined roundoff factor to be applied to all points selected with the cursor. V - 1 0 Grid lock...1 NL • • 1 I i * v • _ • _ . *. . - m m I 1 3 I = K» lää 12.2 1.1 2.0 1.8 1.25 11.4 Ho EJ V Ml ^"OPY RESOLUTION
      

      
      A High Performance Block Eigensolver for Nuclear Configuration Interaction Calculations
      DOE PAGES
      Aktulga, Hasan Metin; Afibuzzaman, Md.; Williams, Samuel; ...
         2017-06-01
         As on-node parallelism increases and the performance gap between the processor and the memory system widens, achieving high performance in large-scale scientific applications requires an architecture-aware design of algorithms and solvers. We focus on the eigenvalue problem arising in nuclear Configuration Interaction (CI) calculations, where a few extreme eigenpairs of a sparse symmetric matrix are needed. Here, we consider a block iterative eigensolver whose main computational kernels are the multiplication of a sparse matrix with multiple vectors (SpMM), and tall-skinny matrix operations. We then present techniques to significantly improve the SpMM and the transpose operation SpMM T by using themore » compressed sparse blocks (CSB) format. We achieve 3-4× speedup on the requisite operations over good implementations with the commonly used compressed sparse row (CSR) format. We develop a performance model that allows us to correctly estimate the performance of our SpMM kernel implementations, and we identify cache bandwidth as a potential performance bottleneck beyond DRAM. We also analyze and optimize the performance of LOBPCG kernels (inner product and linear combinations on multiple vectors) and show up to 15× speedup over using high performance BLAS libraries for these operations. The resulting high performance LOBPCG solver achieves 1.4× to 1.8× speedup over the existing Lanczos solver on a series of CI computations on high-end multicore architectures (Intel Xeons). We also analyze the performance of our techniques on an Intel Xeon Phi Knights Corner (KNC) processor.« less
      

      
      A High Performance Block Eigensolver for Nuclear Configuration Interaction Calculations
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Aktulga, Hasan Metin; Afibuzzaman, Md.; Williams, Samuel
         
         As on-node parallelism increases and the performance gap between the processor and the memory system widens, achieving high performance in large-scale scientific applications requires an architecture-aware design of algorithms and solvers. We focus on the eigenvalue problem arising in nuclear Configuration Interaction (CI) calculations, where a few extreme eigenpairs of a sparse symmetric matrix are needed. Here, we consider a block iterative eigensolver whose main computational kernels are the multiplication of a sparse matrix with multiple vectors (SpMM), and tall-skinny matrix operations. We then present techniques to significantly improve the SpMM and the transpose operation SpMM T by using themore » compressed sparse blocks (CSB) format. We achieve 3-4× speedup on the requisite operations over good implementations with the commonly used compressed sparse row (CSR) format. We develop a performance model that allows us to correctly estimate the performance of our SpMM kernel implementations, and we identify cache bandwidth as a potential performance bottleneck beyond DRAM. We also analyze and optimize the performance of LOBPCG kernels (inner product and linear combinations on multiple vectors) and show up to 15× speedup over using high performance BLAS libraries for these operations. The resulting high performance LOBPCG solver achieves 1.4× to 1.8× speedup over the existing Lanczos solver on a series of CI computations on high-end multicore architectures (Intel Xeons). We also analyze the performance of our techniques on an Intel Xeon Phi Knights Corner (KNC) processor.« less
      

      
      CPU-GPU hybrid accelerating the Zuker algorithm for RNA secondary structure prediction applications
      PubMed Central
      
         2012-01-01
         Background Prediction of ribonucleic acid (RNA) secondary structure remains one of the most important research areas in bioinformatics. The Zuker algorithm is one of the most popular methods of free energy minimization for RNA secondary structure prediction. Thus far, few studies have been reported on the acceleration of the Zuker algorithm on general-purpose processors or on extra accelerators such as Field Programmable Gate-Array (FPGA) and Graphics Processing Units (GPU). To the best of our knowledge, no implementation combines both CPU and extra accelerators, such as GPUs, to accelerate the Zuker algorithm applications. Results In this paper, a CPU-GPU hybrid computing system that accelerates Zuker algorithm applications for RNA secondary structure prediction is proposed. The computing tasks are allocated between CPU and GPU for parallel cooperate execution. Performance differences between the CPU and the GPU in the task-allocation scheme are considered to obtain workload balance. To improve the hybrid system performance, the Zuker algorithm is optimally implemented with special methods for CPU and GPU architecture. Conclusions Speedup of 15.93× over optimized multi-core SIMD CPU implementation and performance advantage of 16% over optimized GPU implementation are shown in the experimental results. More than 14% of the sequences are executed on CPU in the hybrid system. The system combining CPU and GPU to accelerate the Zuker algorithm is proven to be promising and can be applied to other bioinformatics applications. PMID:22369626
      

      
      Automated high-dose rate brachytherapy treatment planning for a single-channel vaginal cylinder applicator
      NASA Astrophysics Data System (ADS)
      Zhou, Yuhong; Klages, Peter; Tan, Jun; Chi, Yujie; Stojadinovic, Strahinja; Yang, Ming; Hrycushko, Brian; Medin, Paul; Pompos, Arnold; Jiang, Steve; Albuquerque, Kevin; Jia, Xun
         2017-06-01
         High dose rate (HDR) brachytherapy treatment planning is conventionally performed manually and/or with aids of preplanned templates. In general, the standard of care would be elevated by conducting an automated process to improve treatment planning efficiency, eliminate human error, and reduce plan quality variations. Thus, our group is developing AutoBrachy, an automated HDR brachytherapy planning suite of modules used to augment a clinical treatment planning system. This paper describes our proof-of-concept module for vaginal cylinder HDR planning that has been fully developed. After a patient CT scan is acquired, the cylinder applicator is automatically segmented using image-processing techniques. The target CTV is generated based on physician-specified treatment depth and length. Locations of the dose calculation point, apex point and vaginal surface point, as well as the central applicator channel coordinates, and the corresponding dwell positions are determined according to their geometric relationship with the applicator and written to a structure file. Dwell times are computed through iterative quadratic optimization techniques. The planning information is then transferred to the treatment planning system through a DICOM-RT interface. The entire process was tested for nine patients. The AutoBrachy cylindrical applicator module was able to generate treatment plans for these cases with clinical grade quality. Computation times varied between 1 and 3 min on an Intel Xeon CPU E3-1226 v3 processor. All geometric components in the automated treatment plans were generated accurately. The applicator channel tip positions agreed with the manually identified positions with submillimeter deviations and the channel orientations between the plans agreed within less than 1 degree. The automatically generated plans obtained clinically acceptable quality.
      

      
      Efficient molecular dynamics simulations with many-body potentials on graphics processing units
      NASA Astrophysics Data System (ADS)
      Fan, Zheyong; Chen, Wei; Vierimaa, Ville; Harju, Ari
         2017-09-01
         Graphics processing units have been extensively used to accelerate classical molecular dynamics simulations. However, there is much less progress on the acceleration of force evaluations for many-body potentials compared to pairwise ones. In the conventional force evaluation algorithm for many-body potentials, the force, virial stress, and heat current for a given atom are accumulated within different loops, which could result in write conflict between different threads in a CUDA kernel. In this work, we provide a new force evaluation algorithm, which is based on an explicit pairwise force expression for many-body potentials derived recently (Fan et al., 2015). In our algorithm, the force, virial stress, and heat current for a given atom can be accumulated within a single thread and is free of write conflicts. We discuss the formulations and algorithms and evaluate their performance. A new open-source code, GPUMD, is developed based on the proposed formulations. For the Tersoff many-body potential, the double precision performance of GPUMD using a Tesla K40 card is equivalent to that of the LAMMPS (Large-scale Atomic/Molecular Massively Parallel Simulator) molecular dynamics code running with about 100 CPU cores (Intel Xeon CPU X5670 @ 2.93 GHz).
      

      
      Face classification using electronic synapses
      NASA Astrophysics Data System (ADS)
      Yao, Peng; Wu, Huaqiang; Gao, Bin; Eryilmaz, Sukru Burc; Huang, Xueyao; Zhang, Wenqiang; Zhang, Qingtian; Deng, Ning; Shi, Luping; Wong, H.-S. Philip; Qian, He
         2017-05-01
         Conventional hardware platforms consume huge amount of energy for cognitive learning due to the data movement between the processor and the off-chip memory. Brain-inspired device technologies using analogue weight storage allow to complete cognitive tasks more efficiently. Here we present an analogue non-volatile resistive memory (an electronic synapse) with foundry friendly materials. The device shows bidirectional continuous weight modulation behaviour. Grey-scale face classification is experimentally demonstrated using an integrated 1024-cell array with parallel online training. The energy consumption within the analogue synapses for each iteration is 1,000 × (20 ×) lower compared to an implementation using Intel Xeon Phi processor with off-chip memory (with hypothetical on-chip digital resistive random access memory). The accuracy on test sets is close to the result using a central processing unit. These experimental results consolidate the feasibility of analogue synaptic array and pave the way toward building an energy efficient and large-scale neuromorphic system.
      

      
      A Locality-Based Threading Algorithm for the Configuration-Interaction Method
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Shan, Hongzhang; Williams, Samuel; Johnson, Calvin
         
         The Configuration Interaction (CI) method has been widely used to solve the non-relativistic many-body Schrodinger equation. One great challenge to implementing it efficiently on manycore architectures is its immense memory and data movement requirements. To address this issue, within each node, we exploit a hybrid MPI+OpenMP programming model in lieu of the traditional flat MPI programming model. Here in this paper, we develop optimizations that partition the workloads among OpenMP threads based on data locality,-which is essential in ensuring applications with complex data access patterns scale well on manycore architectures. The new algorithm scales to 256 threadson the 64-core Intelmore » Knights Landing (KNL) manycore processor and 24 threads on dual-socket Ivy Bridge (Xeon) nodes. Compared with the original implementation, the performance has been improved by up to 7× on theKnights Landing processor and 3× on the dual-socket Ivy Bridge node.« less
      

      
      A Locality-Based Threading Algorithm for the Configuration-Interaction Method
      DOE PAGES
      Shan, Hongzhang; Williams, Samuel; Johnson, Calvin; ...
         2017-07-03
         The Configuration Interaction (CI) method has been widely used to solve the non-relativistic many-body Schrodinger equation. One great challenge to implementing it efficiently on manycore architectures is its immense memory and data movement requirements. To address this issue, within each node, we exploit a hybrid MPI+OpenMP programming model in lieu of the traditional flat MPI programming model. Here in this paper, we develop optimizations that partition the workloads among OpenMP threads based on data locality,-which is essential in ensuring applications with complex data access patterns scale well on manycore architectures. The new algorithm scales to 256 threadson the 64-core Intelmore » Knights Landing (KNL) manycore processor and 24 threads on dual-socket Ivy Bridge (Xeon) nodes. Compared with the original implementation, the performance has been improved by up to 7× on theKnights Landing processor and 3× on the dual-socket Ivy Bridge node.« less
      

      
      Face classification using electronic synapses.
      PubMed
      Yao, Peng; Wu, Huaqiang; Gao, Bin; Eryilmaz, Sukru Burc; Huang, Xueyao; Zhang, Wenqiang; Zhang, Qingtian; Deng, Ning; Shi, Luping; Wong, H-S Philip; Qian, He
         2017-05-12
         Conventional hardware platforms consume huge amount of energy for cognitive learning due to the data movement between the processor and the off-chip memory. Brain-inspired device technologies using analogue weight storage allow to complete cognitive tasks more efficiently. Here we present an analogue non-volatile resistive memory (an electronic synapse) with foundry friendly materials. The device shows bidirectional continuous weight modulation behaviour. Grey-scale face classification is experimentally demonstrated using an integrated 1024-cell array with parallel online training. The energy consumption within the analogue synapses for each iteration is 1,000 × (20 ×) lower compared to an implementation using Intel Xeon Phi processor with off-chip memory (with hypothetical on-chip digital resistive random access memory). The accuracy on test sets is close to the result using a central processing unit. These experimental results consolidate the feasibility of analogue synaptic array and pave the way toward building an energy efficient and large-scale neuromorphic system.
      

      
      Parallelization of the preconditioned IDR solver for modern multicore computer systems
      NASA Astrophysics Data System (ADS)
      Bessonov, O. A.; Fedoseyev, A. I.
         2012-10-01
         This paper present the analysis, parallelization and optimization approach for the large sparse matrix solver CNSPACK for modern multicore microprocessors. CNSPACK is an advanced solver successfully used for coupled solution of stiff problems arising in multiphysics applications such as CFD, semiconductor transport, kinetic and quantum problems. It employs iterative IDR algorithm with ILU preconditioning (user chosen ILU preconditioning order). CNSPACK has been successfully used during last decade for solving problems in several application areas, including fluid dynamics and semiconductor device simulation. However, there was a dramatic change in processor architectures and computer system organization in recent years. Due to this, performance criteria and methods have been revisited, together with involving the parallelization of the solver and preconditioner using Open MP environment. Results of the successful implementation for efficient parallelization are presented for the most advances computer system (Intel Core i7-9xx or two-processor Xeon 55xx/56xx).
      

      
      Coding for parallel execution of hardware-in-the-loop millimeter-wave scene generation models on multicore SIMD processor architectures
      NASA Astrophysics Data System (ADS)
      Olson, Richard F.
         2013-05-01
         Rendering of point scatterer based radar scenes for millimeter wave (mmW) seeker tests in real-time hardware-in-the-loop (HWIL) scene generation requires efficient algorithms and vector-friendly computer architectures for complex signal synthesis. New processor technology from Intel implements an extended 256-bit vector SIMD instruction set (AVX, AVX2) in a multi-core CPU design providing peak execution rates of hundreds of GigaFLOPS (GFLOPS) on one chip. Real world mmW scene generation code can approach peak SIMD execution rates only after careful algorithm and source code design. An effective software design will maintain high computing intensity emphasizing register-to-register SIMD arithmetic operations over data movement between CPU caches or off-chip memories. Engineers at the U.S. Army Aviation and Missile Research, Development and Engineering Center (AMRDEC) applied two basic parallel coding methods to assess new 256-bit SIMD multi-core architectures for mmW scene generation in HWIL. These include use of POSIX threads built on vector library functions and more portable, highlevel parallel code based on compiler technology (e.g. OpenMP pragmas and SIMD autovectorization). Since CPU technology is rapidly advancing toward high processor core counts and TeraFLOPS peak SIMD execution rates, it is imperative that coding methods be identified which produce efficient and maintainable parallel code. This paper describes the algorithms used in point scatterer target model rendering, the parallelization of those algorithms, and the execution performance achieved on an AVX multi-core machine using the two basic parallel coding methods. The paper concludes with estimates for scale-up performance on upcoming multi-core technology.
      

      
      CoNNeCT Baseband Processor Module Boot Code SoftWare (BCSW)
      NASA Technical Reports Server (NTRS)
      Yamamoto, Clifford K.; Orozco, David S.; Byrne, D. J.; Allen, Steven J.; Sahasrabudhe, Adit; Lang, Minh
         2012-01-01
         This software provides essential startup and initialization routines for the CoNNeCT baseband processor module (BPM) hardware upon power-up. A command and data handling (C&DH) interface is provided via 1553 and diagnostic serial interfaces to invoke operational, reconfiguration, and test commands within the code. The BCSW has features unique to the hardware it is responsible for managing. In this case, the CoNNeCT BPM is configured with an updated CPU (Atmel AT697 SPARC processor) and a unique set of memory and I/O peripherals that require customized software to operate. These features include configuration of new AT697 registers, interfacing to a new HouseKeeper with a flash controller interface, a new dual Xilinx configuration/scrub interface, and an updated 1553 remote terminal (RT) core. The BCSW is intended to provide a "safe" mode for the BPM when initially powered on or when an unexpected trap occurs, causing the processor to reset. The BCSW allows the 1553 bus controller in the spacecraft or payload controller to operate the BPM over 1553 to upload code; upload Xilinx bit files; perform rudimentary tests; read, write, and copy the non-volatile flash memory; and configure the Xilinx interface. Commands also exist over 1553 to cause the CPU to jump or call a specified address to begin execution of user-supplied code. This may be in the form of a real-time operating system, test routine, or specific application code to run on the BPM.
      

      
      The density matrix renormalization group algorithm on kilo-processor architectures: Implementation and trade-offs
      NASA Astrophysics Data System (ADS)
      Nemes, Csaba; Barcza, Gergely; Nagy, Zoltán; Legeza, Örs; Szolgay, Péter
         2014-06-01
         In the numerical analysis of strongly correlated quantum lattice models one of the leading algorithms developed to balance the size of the effective Hilbert space and the accuracy of the simulation is the density matrix renormalization group (DMRG) algorithm, in which the run-time is dominated by the iterative diagonalization of the Hamilton operator. As the most time-dominant step of the diagonalization can be expressed as a list of dense matrix operations, the DMRG is an appealing candidate to fully utilize the computing power residing in novel kilo-processor architectures. In the paper a smart hybrid CPU-GPU implementation is presented, which exploits the power of both CPU and GPU and tolerates problems exceeding the GPU memory size. Furthermore, a new CUDA kernel has been designed for asymmetric matrix-vector multiplication to accelerate the rest of the diagonalization. Besides the evaluation of the GPU implementation, the practical limits of an FPGA implementation are also discussed.
      

      
      Real-time autocorrelator for fluorescence correlation spectroscopy based on graphical-processor-unit architecture: method, implementation, and comparative studies
      NASA Astrophysics Data System (ADS)
      Laracuente, Nicholas; Grossman, Carl
         2013-03-01
         We developed an algorithm and software to calculate autocorrelation functions from real-time photon-counting data using the fast, parallel capabilities of graphical processor units (GPUs). Recent developments in hardware and software have allowed for general purpose computing with inexpensive GPU hardware. These devices are more suited for emulating hardware autocorrelators than traditional CPU-based software applications by emphasizing parallel throughput over sequential speed. Incoming data are binned in a standard multi-tau scheme with configurable points-per-bin size and are mapped into a GPU memory pattern to reduce time-expensive memory access. Applications include dynamic light scattering (DLS) and fluorescence correlation spectroscopy (FCS) experiments. We ran the software on a 64-core graphics pci card in a 3.2 GHz Intel i5 CPU based computer running Linux. FCS measurements were made on Alexa-546 and Texas Red dyes in a standard buffer (PBS). Software correlations were compared to hardware correlator measurements on the same signals. Supported by HHMI and Swarthmore College
      

      
      Performance Evaluation of Supercomputers using HPCC and IMB Benchmarks
      NASA Technical Reports Server (NTRS)
      Saini, Subhash; Ciotti, Robert; Gunney, Brian T. N.; Spelce, Thomas E.; Koniges, Alice; Dossa, Don; Adamidis, Panagiotis; Rabenseifner, Rolf; Tiyyagura, Sunil R.; Mueller, Matthias;   

         2006-01-01
         The HPC Challenge (HPCC) benchmark suite and the Intel MPI Benchmark (IMB) are used to compare and evaluate the combined performance of processor, memory subsystem and interconnect fabric of five leading supercomputers - SGI Altix BX2, Cray XI, Cray Opteron Cluster, Dell Xeon cluster, and NEC SX-8. These five systems use five different networks (SGI NUMALINK4, Cray network, Myrinet, InfiniBand, and NEC IXS). The complete set of HPCC benchmarks are run on each of these systems. Additionally, we present Intel MPI Benchmarks (IMB) results to study the performance of 11 MPI communication functions on these systems.
      


      
      The Performance of the NAS HSPs in 1st Half of 1994
      NASA Technical Reports Server (NTRS)
      Bergeron, Robert J.; Walter, Howard (Technical Monitor)
         1995-01-01
         During the first six months of 1994, the NAS (National Airspace System) 16-CPU Y-MP C90 Von Neumann (VN) delivered an average throughput of 4.045 GFLOPS while the ACSF (Aeronautics Consolidated Supercomputer Facility) 8-CPU Y-MP C90 Eagle averaged 1.658 GFLOPS. The VN rate represents a machine efficiency of 26.3% whereas the Eagle rate corresponds to a machine efficiency of 21.6%. VN displayed a greater efficiency than Eagle primarily because the stronger workload demand for its CPU cycles allowed it to devote more time to user programs and less time to idle. An additional factor increasing VN efficiency was the ability of the UNICOS 8.0 Operating System to deliver a larger fraction of CPU time to user programs. Although measurements indicate increasing vector length for both workloads, insufficient vector lengths continue to hinder HSP (High Speed Processor) performance. To improve HSP performance, NAS should continue to encourage the HSP users to modify their codes to increase program vector length.
      

        
       
          

«

6
      7
      8
   9
      10
      »

          
        

     

   

   
       
            
              
          

«

7
      8
      9
   10
      11
      »

          
        

           
           
             
               
      
      Application of queueing models to multiprogrammed computer systems operating in a time-critical environment
      NASA Technical Reports Server (NTRS)
      Eckhardt, D. E., Jr.
         1979-01-01
         A model of a central processor (CPU) which services background applications in the presence of time critical activity is presented. The CPU is viewed as an M/M/1 queueing system subject to periodic interrupts by deterministic, time critical process. The Laplace transform of the distribution of service times for the background applications is developed. The use of state of the art queueing models for studying the background processing capability of time critical computer systems is discussed and the results of a model validation study which support this application of queueing models are presented.
      

      
      Toshiba TDF-500 High Resolution Viewing And Analysis System
      NASA Astrophysics Data System (ADS)
      Roberts, Barry; Kakegawa, M.; Nishikawa, M.; Oikawa, D.
         1988-06-01
         A high resolution, operator interactive, medical viewing and analysis system has been developed by Toshiba and Bio-Imaging Research. This system provides many advanced features including high resolution displays, a very large image memory and advanced image processing capability. In particular, the system provides CRT frame buffers capable of update in one frame period, an array processor capable of image processing at operator interactive speeds, and a memory system capable of updating multiple frame buffers at frame rates whilst supporting multiple array processors. The display system provides 1024 x 1536 display resolution at 40Hz frame and 80Hz field rates. In particular, the ability to provide whole or partial update of the screen at the scanning rate is a key feature. This allows multiple viewports or windows in the display buffer with both fixed and cine capability. To support image processing features such as windowing, pan, zoom, minification, filtering, ROI analysis, multiplanar and 3D reconstruction, a high performance CPU is integrated into the system. This CPU is an array processor capable of up to 400 million instructions per second. To support the multiple viewer and array processors' instantaneous high memory bandwidth requirement, an ultra fast memory system is used. This memory system has a bandwidth capability of 400MB/sec and a total capacity of 256MB. This bandwidth is more than adequate to support several high resolution CRT's and also the fast processing unit. This fully integrated approach allows effective real time image processing. The integrated design of viewing system, memory system and array processor are key to the imaging system. It is the intention to describe the architecture of the image system in this paper.
      

      
      Measurement of fault latency in a digital avionic mini processor, part 2
      NASA Technical Reports Server (NTRS)
      Mcgough, J.; Swern, F.
         1983-01-01
         The results of fault injection experiments utilizing a gate-level emulation of the central processor unit of the Bendix BDX-930 digital computer are described. Several earlier programs were reprogrammed, expanding the instruction set to capitalize on the full power of the BDX-930 computer. As a final demonstration of fault coverage an extensive, 3-axis, high performance flght control computation was added. The stages in the development of a CPU self-test program emphasizing the relationship between fault coverage, speed, and quantity of instructions were demonstrated.
      

      
      Multitasking OS manages a team of processors
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Ripps, D.L.
         1983-07-21
         MTOS-68k is a real-time multitasking operating system designed for the popular MC68000 microprocessors. It aproaches task coordination and synchronization in a fashion that matches uniquely the structural simplicity and regularity of the 68000 instruction set. Since in many 68000 applications the speed and power of one CPU are not enough, MTOS-68k has been designed to support multiple processors, as well as multiple tasks. Typically, the devices are tightly coupled single-board computers, that is they share a backplane and parts of global memory.
      

      
      Execution of a parallel edge-based Navier-Stokes solver on commodity graphics processor units
      NASA Astrophysics Data System (ADS)
      Corral, Roque; Gisbert, Fernando; Pueblas, Jesus
         2017-02-01
         The implementation of an edge-based three-dimensional Reynolds Average Navier-Stokes solver for unstructured grids able to run on multiple graphics processing units (GPUs) is presented. Loops over edges, which are the most time-consuming part of the solver, have been written to exploit the massively parallel capabilities of GPUs. Non-blocking communications between parallel processes and between the GPU and the central processor unit (CPU) have been used to enhance code scalability. The code is written using a mixture of C++ and OpenCL, to allow the execution of the source code on GPUs. The Message Passage Interface (MPI) library is used to allow the parallel execution of the solver on multiple GPUs. A comparative study of the solver parallel performance is carried out using a cluster of CPUs and another of GPUs. It is shown that a single GPU is up to 64 times faster than a single CPU core. The parallel scalability of the solver is mainly degraded due to the loss of computing efficiency of the GPU when the size of the case decreases. However, for large enough grid sizes, the scalability is strongly improved. A cluster featuring commodity GPUs and a high bandwidth network is ten times less costly and consumes 33% less energy than a CPU-based cluster with an equivalent computational power.
      

      
      NASA Center for Climate Simulation (NCCS) Presentation
      NASA Technical Reports Server (NTRS)
      Webster, William P.
         2012-01-01
         The NASA Center for Climate Simulation (NCCS) offers integrated supercomputing, visualization, and data interaction technologies to enhance NASA's weather and climate prediction capabilities. It serves hundreds of users at NASA Goddard Space Flight Center, as well as other NASA centers, laboratories, and universities across the US. Over the past year, NCCS has continued expanding its data-centric computing environment to meet the increasingly data-intensive challenges of climate science. We doubled our Discover supercomputer's peak performance to more than 800 teraflops by adding 7,680 Intel Xeon Sandy Bridge processor-cores and most recently 240 Intel Xeon Phi Many Integrated Core (MIG) co-processors. A supercomputing-class analysis system named Dali gives users rapid access to their data on Discover and high-performance software including the Ultra-scale Visualization Climate Data Analysis Tools (UV-CDAT), with interfaces from user desktops and a 17- by 6-foot visualization wall. NCCS also is exploring highly efficient climate data services and management with a new MapReduce/Hadoop cluster while augmenting its data distribution to the science community. Using NCCS resources, NASA completed its modeling contributions to the Intergovernmental Panel on Climate Change (IPCG) Fifth Assessment Report this summer as part of the ongoing Coupled Modellntercomparison Project Phase 5 (CMIP5). Ensembles of simulations run on Discover reached back to the year 1000 to test model accuracy and projected climate change through the year 2300 based on four different scenarios of greenhouse gases, aerosols, and land use. The data resulting from several thousand IPCC/CMIP5 simulations, as well as a variety of other simulation, reanalysis, and observationdatasets, are available to scientists and decision makers through an enhanced NCCS Earth System Grid Federation Gateway. Worldwide downloads have totaled over 110 terabytes of data.
      

      
      SU-E-T-493: Accelerated Monte Carlo Methods for Photon Dosimetry Using a Dual-GPU System and CUDA.
      PubMed
      Liu, T; Ding, A; Xu, X
         2012-06-01
         To develop a Graphics Processing Unit (GPU) based Monte Carlo (MC) code that accelerates dose calculations on a dual-GPU system. We simulated a clinical case of prostate cancer treatment. A voxelized abdomen phantom derived from 120 CT slices was used containing 218×126×60 voxels, and a GE LightSpeed 16-MDCT scanner was modeled. A CPU version of the MC code was first developed in C++ and tested on Intel Xeon X5660 2.8GHz CPU, then it was translated into GPU version using CUDA C 4.1 and run on a dual Tesla m 2 090 GPU system. The code was featured with automatic assignment of simulation task to multiple GPUs, as well as accurate calculation of energy- and material- dependent cross-sections. Double-precision floating point format was used for accuracy. Doses to the rectum, prostate, bladder and femoral heads were calculated. When running on a single GPU, the MC GPU code was found to be ×19 times faster than the CPU code and ×42 times faster than MCNPX. These speedup factors were doubled on the dual-GPU system. The dose Result was benchmarked against MCNPX and a maximum difference of 1% was observed when the relative error is kept below 0.1%. A GPU-based MC code was developed for dose calculations using detailed patient and CT scanner models. Efficiency and accuracy were both guaranteed in this code. Scalability of the code was confirmed on the dual-GPU system. © 2012 American Association of Physicists in Medicine.
      

      
      Comparative Performance Analysis of Intel Xeon Phi, GPU, and CPU: A Case Study from Microscopy Image Analysis
      PubMed Central
      Teodoro, George; Kurc, Tahsin; Kong, Jun; Cooper, Lee; Saltz, Joel
         2014-01-01
         We study and characterize the performance of operations in an important class of applications on GPUs and Many Integrated Core (MIC) architectures. Our work is motivated by applications that analyze low-dimensional spatial datasets captured by high resolution sensors, such as image datasets obtained from whole slide tissue specimens using microscopy scanners. Common operations in these applications involve the detection and extraction of objects (object segmentation), the computation of features of each extracted object (feature computation), and characterization of objects based on these features (object classification). In this work, we have identify the data access and computation patterns of operations in the object segmentation and feature computation categories. We systematically implement and evaluate the performance of these operations on modern CPUs, GPUs, and MIC systems for a microscopy image analysis application. Our results show that the performance on a MIC of operations that perform regular data access is comparable or sometimes better than that on a GPU. On the other hand, GPUs are significantly more efficient than MICs for operations that access data irregularly. This is a result of the low performance of MICs when it comes to random data access. We also have examined the coordinated use of MICs and CPUs. Our experiments show that using a performance aware task strategy for scheduling application operations improves performance about 1.29× over a first-come-first-served strategy. This allows applications to obtain high performance efficiency on CPU-MIC systems - the example application attained an efficiency of 84% on 192 nodes (3072 CPU cores and 192 MICs). PMID:25419088
      

      
      Task Parallel Incomplete Cholesky Factorization using 2D Partitioned-Block Layout
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Kim, Kyungjoo; Rajamanickam, Sivasankaran; Stelle, George Widgery
         
         We introduce a task-parallel algorithm for sparse incomplete Cholesky factorization that utilizes a 2D sparse partitioned-block layout of a matrix. Our factorization algorithm follows the idea of algorithms-by-blocks by using the block layout. The algorithm-byblocks approach induces a task graph for the factorization. These tasks are inter-related to each other through their data dependences in the factorization algorithm. To process the tasks on various manycore architectures in a portable manner, we also present a portable tasking API that incorporates different tasking backends and device-specific features using an open-source framework for manycore platforms i.e., Kokkos. A performance evaluation is presented onmore » both Intel Sandybridge and Xeon Phi platforms for matrices from the University of Florida sparse matrix collection to illustrate merits of the proposed task-based factorization. Experimental results demonstrate that our task-parallel implementation delivers about 26.6x speedup (geometric mean) over single-threaded incomplete Choleskyby- blocks and 19.2x speedup over serial Cholesky performance which does not carry tasking overhead using 56 threads on the Intel Xeon Phi processor for sparse matrices arising from various application problems.« less
      

      
      Collaborating CPU and GPU for large-scale high-order CFD simulations with complex grids on the TianHe-1A supercomputer
      NASA Astrophysics Data System (ADS)
      Xu, Chuanfu; Deng, Xiaogang; Zhang, Lilun; Fang, Jianbin; Wang, Guangxue; Jiang, Yi; Cao, Wei; Che, Yonggang; Wang, Yongxian; Wang, Zhenghua; Liu, Wei; Cheng, Xinghua
         2014-12-01
         Programming and optimizing complex, real-world CFD codes on current many-core accelerated HPC systems is very challenging, especially when collaborating CPUs and accelerators to fully tap the potential of heterogeneous systems. In this paper, with a tri-level hybrid and heterogeneous programming model using MPI + OpenMP + CUDA, we port and optimize our high-order multi-block structured CFD software HOSTA on the GPU-accelerated TianHe-1A supercomputer. HOSTA adopts two self-developed high-order compact definite difference schemes WCNS and HDCS that can simulate flows with complex geometries. We present a dual-level parallelization scheme for efficient multi-block computation on GPUs and perform particular kernel optimizations for high-order CFD schemes. The GPU-only approach achieves a speedup of about 1.3 when comparing one Tesla M2050 GPU with two Xeon X5670 CPUs. To achieve a greater speedup, we collaborate CPU and GPU for HOSTA instead of using a naive GPU-only approach. We present a novel scheme to balance the loads between the store-poor GPU and the store-rich CPU. Taking CPU and GPU load balance into account, we improve the maximum simulation problem size per TianHe-1A node for HOSTA by 2.3×, meanwhile the collaborative approach can improve the performance by around 45% compared to the GPU-only approach. Further, to scale HOSTA on TianHe-1A, we propose a gather/scatter optimization to minimize PCI-e data transfer times for ghost and singularity data of 3D grid blocks, and overlap the collaborative computation and communication as far as possible using some advanced CUDA and MPI features. Scalability tests show that HOSTA can achieve a parallel efficiency of above 60% on 1024 TianHe-1A nodes. With our method, we have successfully simulated an EET high-lift airfoil configuration containing 800M cells and China's large civil airplane configuration containing 150M cells. To our best knowledge, those are the largest-scale CPU-GPU collaborative simulations that solve realistic CFD problems with both complex configurations and high-order schemes.
      

      
      A CPU benchmark for protein crystallographic refinement.
      PubMed
      Bourne, P E; Hendrickson, W A
         1990-01-01
         The CPU time required to complete a cycle of restrained least-squares refinement of a protein structure from X-ray crystallographic data using the FORTRAN codes PROTIN and PROLSQ are reported for 48 different processors, ranging from single-user workstations to supercomputers. Sequential, vector, VLIW, multiprocessor, and RISC hardware architectures are compared using both a small and a large protein structure. Representative compile times for each hardware type are also given, and the improvement in run-time when coding for a specific hardware architecture considered. The benchmarks involve scalar integer and vector floating point arithmetic and are representative of the calculations performed in many scientific disciplines.
      

      
      Adaptive real-time methodology for optimizing energy-efficient computing
      DOEpatents
      Hsu, Chung-Hsing [Los Alamos, NM; Feng, Wu-Chun [Blacksburg, VA
         2011-06-28
         Dynamic voltage and frequency scaling (DVFS) is an effective way to reduce energy and power consumption in microprocessor units. Current implementations of DVFS suffer from inaccurate modeling of power requirements and usage, and from inaccurate characterization of the relationships between the applicable variables. A system and method is proposed that adjusts CPU frequency and voltage based on run-time calculations of the workload processing time, as well as a calculation of performance sensitivity with respect to CPU frequency. The system and method are processor independent, and can be applied to either an entire system as a unit, or individually to each process running on a system.
      

      
      CPU architecture for a fast and energy-saving calculation of convolution neural networks
      NASA Astrophysics Data System (ADS)
      Knoll, Florian J.; Grelcke, Michael; Czymmek, Vitali; Holtorf, Tim; Hussmann, Stephan
         2017-06-01
         One of the most difficult problem in the use of artificial neural networks is the computational capacity. Although large search engine companies own specially developed hardware to provide the necessary computing power, for the conventional user only remains the state of the art method, which is the use of a graphic processing unit (GPU) as a computational basis. Although these processors are well suited for large matrix computations, they need massive energy. Therefore a new processor on the basis of a field programmable gate array (FPGA) has been developed and is optimized for the application of deep learning. This processor is presented in this paper. The processor can be adapted for a particular application (in this paper to an organic farming application). The power consumption is only a fraction of a GPU application and should therefore be well suited for energy-saving applications.
      

      
      A GaAs vector processor based on parallel RISC microprocessors
      NASA Astrophysics Data System (ADS)
      Misko, Tim A.; Rasset, Terry L.
         
         A vector processor architecture based on the development of a 32-bit microprocessor using gallium arsenide (GaAs) technology has been developed. The McDonnell Douglas vector processor (MVP) will be fabricated completely from GaAs digital integrated circuits. The MVP architecture includes a vector memory of 1 megabyte, a parallel bus architecture with eight processing elements connected in parallel, and a control processor. The processing elements consist of a reduced instruction set CPU (RISC) with four floating-point coprocessor units and necessary memory interface functions. This architecture has been simulated for several benchmark programs including complex fast Fourier transform (FFT), complex inner product, trigonometric functions, and sort-merge routine. The results of this study indicate that the MVP can process a 1024-point complex FFT at a speed of 112 microsec (389 megaflops) while consuming approximately 618 W of power in a volume of approximately 0.1 ft-cubed.
      

      
      Cache write generate for parallel image processing on shared memory architectures.
      PubMed
      Wittenbrink, C M; Somani, A K; Chen, C H
         1996-01-01
         We investigate cache write generate, our cache mode invention. We demonstrate that for parallel image processing applications, the new mode improves main memory bandwidth, CPU efficiency, cache hits, and cache latency. We use register level simulations validated by the UW-Proteus system. Many memory, cache, and processor configurations are evaluated.
      

      
      Message Passing on GPUs
      NASA Astrophysics Data System (ADS)
      Stuart, J. A.
         2011-12-01
         This paper explores the challenges in implementing a message passing interface usable on systems with data-parallel processors, and more specifically GPUs. As a case study, we design and implement the ``DCGN'' API on NVIDIA GPUs that is similar to MPI and allows full access to the underlying architecture. We introduce the notion of data-parallel thread-groups as a way to map resources to MPI ranks. We use a method that also allows the data-parallel processors to run autonomously from user-written CPU code. In order to facilitate communication, we use a sleep-based polling system to store and retrieve messages. Unlike previous systems, our method provides both performance and flexibility. By running a test suite of applications with different communication requirements, we find that a tolerable amount of overhead is incurred, somewhere between one and five percent depending on the application, and indicate the locations where this overhead accumulates. We conclude that with innovations in chipsets and drivers, this overhead will be mitigated and provide similar performance to typical CPU-based MPI implementations while providing fully-dynamic communication.
      

      
      A task-based parallelism and vectorized approach to 3D Method of Characteristics (MOC) reactor simulation for high performance computing architectures
      NASA Astrophysics Data System (ADS)
      Tramm, John R.; Gunow, Geoffrey; He, Tim; Smith, Kord S.; Forget, Benoit; Siegel, Andrew R.
         2016-05-01
         In this study we present and analyze a formulation of the 3D Method of Characteristics (MOC) technique applied to the simulation of full core nuclear reactors. Key features of the algorithm include a task-based parallelism model that allows independent MOC tracks to be assigned to threads dynamically, ensuring load balancing, and a wide vectorizable inner loop that takes advantage of modern SIMD computer architectures. The algorithm is implemented in a set of highly optimized proxy applications in order to investigate its performance characteristics on CPU, GPU, and Intel Xeon Phi architectures. Speed, power, and hardware cost efficiencies are compared. Additionally, performance bottlenecks are identified for each architecture in order to determine the prospects for continued scalability of the algorithm on next generation HPC architectures.
      

      
      A parallelization scheme of the periodic signals tracking algorithm for isochronous mass spectrometry on GPUs
      NASA Astrophysics Data System (ADS)
      Chen, R. J.; Wang, M.; Yan, X. L.; Yang, Q.; Lam, Y. H.; Yang, L.; Zhang, Y. H.
         2017-12-01
         The periodic signals tracking algorithm has been used to determine the revolution times of ions stored in storage rings in isochronous mass spectrometry (IMS) experiments. It has been a challenge to perform real-time data analysis by using the periodic signals tracking algorithm in the IMS experiments. In this paper, a parallelization scheme of the periodic signals tracking algorithm is introduced and a new program is developed. The computing time of data analysis can be reduced by a factor of ∼71 and of ∼346 by using our new program on Tesla C1060 GPU and Tesla K20c GPU, compared to using old program on Xeon E5540 CPU. We succeed in performing real-time data analysis for the IMS experiments by using the new program on Tesla K20c GPU.
      

      
      High performance in silico virtual drug screening on many-core processors.
      PubMed
      McIntosh-Smith, Simon; Price, James; Sessions, Richard B; Ibarra, Amaurys A
         2015-05-01
         Drug screening is an important part of the drug development pipeline for the pharmaceutical industry. Traditional, lab-based methods are increasingly being augmented with computational methods, ranging from simple molecular similarity searches through more complex pharmacophore matching to more computationally intensive approaches, such as molecular docking. The latter simulates the binding of drug molecules to their targets, typically protein molecules. In this work, we describe BUDE, the Bristol University Docking Engine, which has been ported to the OpenCL industry standard parallel programming language in order to exploit the performance of modern many-core processors. Our highly optimized OpenCL implementation of BUDE sustains 1.43 TFLOP/s on a single Nvidia GTX 680 GPU, or 46% of peak performance. BUDE also exploits OpenCL to deliver effective performance portability across a broad spectrum of different computer architectures from different vendors, including GPUs from Nvidia and AMD, Intel's Xeon Phi and multi-core CPUs with SIMD instruction sets.
      

      
      High performance in silico virtual drug screening on many-core processors
      PubMed Central
      Price, James; Sessions, Richard B; Ibarra, Amaurys A
         2015-01-01
         Drug screening is an important part of the drug development pipeline for the pharmaceutical industry. Traditional, lab-based methods are increasingly being augmented with computational methods, ranging from simple molecular similarity searches through more complex pharmacophore matching to more computationally intensive approaches, such as molecular docking. The latter simulates the binding of drug molecules to their targets, typically protein molecules. In this work, we describe BUDE, the Bristol University Docking Engine, which has been ported to the OpenCL industry standard parallel programming language in order to exploit the performance of modern many-core processors. Our highly optimized OpenCL implementation of BUDE sustains 1.43 TFLOP/s on a single Nvidia GTX 680 GPU, or 46% of peak performance. BUDE also exploits OpenCL to deliver effective performance portability across a broad spectrum of different computer architectures from different vendors, including GPUs from Nvidia and AMD, Intel’s Xeon Phi and multi-core CPUs with SIMD instruction sets. PMID:25972727
      

        
       
          

«

7
      8
      9
   10
      11
      »

          
        

     

   

   
       
            
              
          

«

8
      9
      10
   11
      12
      »

          
        

           
           
             
               
      
      Implementation of 5-layer thermal diffusion scheme in weather research and forecasting model with Intel Many Integrated Cores
      NASA Astrophysics Data System (ADS)
      Huang, Melin; Huang, Bormin; Huang, Allen H.
         2014-10-01
         For weather forecasting and research, the Weather Research and Forecasting (WRF) model has been developed, consisting of several components such as dynamic solvers and physical simulation modules. WRF includes several Land- Surface Models (LSMs). The LSMs use atmospheric information, the radiative and precipitation forcing from the surface layer scheme, the radiation scheme, and the microphysics/convective scheme all together with the land's state variables and land-surface properties, to provide heat and moisture fluxes over land and sea-ice points. The WRF 5-layer thermal diffusion simulation is an LSM based on the MM5 5-layer soil temperature model with an energy budget that includes radiation, sensible, and latent heat flux. The WRF LSMs are very suitable for massively parallel computation as there are no interactions among horizontal grid points. The features, efficient parallelization and vectorization essentials, of Intel Many Integrated Core (MIC) architecture allow us to optimize this WRF 5-layer thermal diffusion scheme. In this work, we present the results of the computing performance on this scheme with Intel MIC architecture. Our results show that the MIC-based optimization improved the performance of the first version of multi-threaded code on Xeon Phi 5110P by a factor of 2.1x. Accordingly, the same CPU-based optimizations improved the performance on Intel Xeon E5- 2603 by a factor of 1.6x as compared to the first version of multi-threaded code.
      

      
      A new nonlinear conjugate gradient coefficient under strong Wolfe-Powell line search
      NASA Astrophysics Data System (ADS)
      Mohamed, Nur Syarafina; Mamat, Mustafa; Rivaie, Mohd
         2017-08-01
         A nonlinear conjugate gradient method (CG) plays an important role in solving a large-scale unconstrained optimization problem. This method is widely used due to its simplicity. The method is known to possess sufficient descend condition and global convergence properties. In this paper, a new nonlinear of CG coefficient βk is presented by employing the Strong Wolfe-Powell inexact line search. The new βk performance is tested based on number of iterations and central processing unit (CPU) time by using MATLAB software with Intel Core i7-3470 CPU processor. Numerical experimental results show that the new βk converge rapidly compared to other classical CG method.
      

      
      Adaptive real-time methodology for optimizing energy-efficient computing
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Hsu, Chung-Hsing; Feng, Wu-Chun
         
         Dynamic voltage and frequency scaling (DVFS) is an effective way to reduce energy and power consumption in microprocessor units. Current implementations of DVFS suffer from inaccurate modeling of power requirements and usage, and from inaccurate characterization of the relationships between the applicable variables. A system and method is proposed that adjusts CPU frequency and voltage based on run-time calculations of the workload processing time, as well as a calculation of performance sensitivity with respect to CPU frequency. The system and method are processor independent, and can be applied to either an entire system as a unit, or individually to eachmore » process running on a system.« less
      

      
      Efficient Machine Learning Approach for Optimizing Scientific Computing Applications on Emerging HPC Architectures
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Arumugam, Kamesh
         
         Efficient parallel implementations of scientific applications on multi-core CPUs with accelerators such as GPUs and Xeon Phis is challenging. This requires - exploiting the data parallel architecture of the accelerator along with the vector pipelines of modern x86 CPU architectures, load balancing, and efficient memory transfer between different devices. It is relatively easy to meet these requirements for highly structured scientific applications. In contrast, a number of scientific and engineering applications are unstructured. Getting performance on accelerators for these applications is extremely challenging because many of these applications employ irregular algorithms which exhibit data-dependent control-ow and irregular memory accesses. Furthermore,more » these applications are often iterative with dependency between steps, and thus making it hard to parallelize across steps. As a result, parallelism in these applications is often limited to a single step. Numerical simulation of charged particles beam dynamics is one such application where the distribution of work and memory access pattern at each time step is irregular. Applications with these properties tend to present significant branch and memory divergence, load imbalance between different processor cores, and poor compute and memory utilization. Prior research on parallelizing such irregular applications have been focused around optimizing the irregular, data-dependent memory accesses and control-ow during a single step of the application independent of the other steps, with the assumption that these patterns are completely unpredictable. We observed that the structure of computation leading to control-ow divergence and irregular memory accesses in one step is similar to that in the next step. It is possible to predict this structure in the current step by observing the computation structure of previous steps. In this dissertation, we present novel machine learning based optimization techniques to address the parallel implementation challenges of such irregular applications on different HPC architectures. In particular, we use supervised learning to predict the computation structure and use it to address the control-ow and memory access irregularities in the parallel implementation of such applications on GPUs, Xeon Phis, and heterogeneous architectures composed of multi-core CPUs with GPUs or Xeon Phis. We use numerical simulation of charged particles beam dynamics simulation as a motivating example throughout the dissertation to present our new approach, though they should be equally applicable to a wide range of irregular applications. The machine learning approach presented here use predictive analytics and forecasting techniques to adaptively model and track the irregular memory access pattern at each time step of the simulation to anticipate the future memory access pattern. Access pattern forecasts can then be used to formulate optimization decisions during application execution which improves the performance of the application at a future time step based on the observations from earlier time steps. In heterogeneous architectures, forecasts can also be used to improve the memory performance and resource utilization of all the processing units to deliver a good aggregate performance. We used these optimization techniques and anticipation strategy to design a cache-aware, memory efficient parallel algorithm to address the irregularities in the parallel implementation of charged particles beam dynamics simulation on different HPC architectures. Experimental result using a diverse mix of HPC architectures shows that our approach in using anticipation strategy is effective in maximizing data reuse, ensuring workload balance, minimizing branch and memory divergence, and in improving resource utilization.« less
      

      
      Computer hardware for radiologists: Part I
      PubMed Central
      Indrajit, IK; Alam, A
         2010-01-01
         Computers are an integral part of modern radiology practice. They are used in different radiology modalities to acquire, process, and postprocess imaging data. They have had a dramatic influence on contemporary radiology practice. Their impact has extended further with the emergence of Digital Imaging and Communications in Medicine (DICOM), Picture Archiving and Communication System (PACS), Radiology information system (RIS) technology, and Teleradiology. A basic overview of computer hardware relevant to radiology practice is presented here. The key hardware components in a computer are the motherboard, central processor unit (CPU), the chipset, the random access memory (RAM), the memory modules, bus, storage drives, and ports. The personnel computer (PC) has a rectangular case that contains important components called hardware, many of which are integrated circuits (ICs). The fiberglass motherboard is the main printed circuit board and has a variety of important hardware mounted on it, which are connected by electrical pathways called “buses”. The CPU is the largest IC on the motherboard and contains millions of transistors. Its principal function is to execute “programs”. A Pentium® 4 CPU has transistors that execute a billion instructions per second. The chipset is completely different from the CPU in design and function; it controls data and interaction of buses between the motherboard and the CPU. Memory (RAM) is fundamentally semiconductor chips storing data and instructions for access by a CPU. RAM is classified by storage capacity, access speed, data rate, and configuration. PMID:21042437
      

      
      Electromagnetic Physics Models for Parallel Computing Architectures
      NASA Astrophysics Data System (ADS)
      Amadio, G.; Ananya, A.; Apostolakis, J.; Aurora, A.; Bandieramonte, M.; Bhattacharyya, A.; Bianchini, C.; Brun, R.; Canal, P.; Carminati, F.; Duhem, L.; Elvira, D.; Gheata, A.; Gheata, M.; Goulas, I.; Iope, R.; Jun, S. Y.; Lima, G.; Mohanty, A.; Nikitina, T.; Novak, M.; Pokorski, W.; Ribon, A.; Seghal, R.; Shadura, O.; Vallecorsa, S.; Wenzel, S.; Zhang, Y.
         2016-10-01
         The recent emergence of hardware architectures characterized by many-core or accelerated processors has opened new opportunities for concurrent programming models taking advantage of both SIMD and SIMT architectures. GeantV, a next generation detector simulation, has been designed to exploit both the vector capability of mainstream CPUs and multi-threading capabilities of coprocessors including NVidia GPUs and Intel Xeon Phi. The characteristics of these architectures are very different in terms of the vectorization depth and type of parallelization needed to achieve optimal performance. In this paper we describe implementation of electromagnetic physics models developed for parallel computing architectures as a part of the GeantV project. Results of preliminary performance evaluation and physics validation are presented as well.
      

      
      A programming framework for data streaming on the Xeon Phi
      NASA Astrophysics Data System (ADS)
      Chapeland, S.;  ALICE Collaboration
         2017-10-01
         ALICE (A Large Ion Collider Experiment) is the dedicated heavy-ion detector studying the physics of strongly interacting matter and the quark-gluon plasma at the CERN LHC (Large Hadron Collider). After the second long shut-down of the LHC, the ALICE detector will be upgraded to cope with an interaction rate of 50 kHz in Pb-Pb collisions, producing in the online computing system (O2) a sustained throughput of 3.4 TB/s. This data will be processed on the fly so that the stream to permanent storage does not exceed 90 GB/s peak, the raw data being discarded. In the context of assessing different computing platforms for the O2 system, we have developed a framework for the Intel Xeon Phi processors (MIC). It provides the components to build a processing pipeline streaming the data from the PC memory to a pool of permanent threads running on the MIC, and back to the host after processing. It is based on explicit offloading mechanisms (data transfer, asynchronous tasks) and basic building blocks (FIFOs, memory pools, C++11 threads). The user only needs to implement the processing method to be run on the MIC. We present in this paper the architecture, implementation, and performance of this system.
      

      
      Time-efficient simulations of tight-binding electronic structures with Intel Xeon PhiTM many-core processors
      NASA Astrophysics Data System (ADS)
      Ryu, Hoon; Jeong, Yosang; Kang, Ji-Hoon; Cho, Kyu Nam
         2016-12-01
         Modelling of multi-million atomic semiconductor structures is important as it not only predicts properties of physically realizable novel materials, but can accelerate advanced device designs. This work elaborates a new Technology-Computer-Aided-Design (TCAD) tool for nanoelectronics modelling, which uses a sp3d5s∗ tight-binding approach to describe multi-million atomic structures, and simulate electronic structures with high performance computing (HPC), including atomic effects such as alloy and dopant disorders. Being named as Quantum simulation tool for Advanced Nanoscale Devices (Q-AND), the tool shows nice scalability on traditional multi-core HPC clusters implying the strong capability of large-scale electronic structure simulations, particularly with remarkable performance enhancement on latest clusters of Intel Xeon PhiTM coprocessors. A review of the recent modelling study conducted to understand an experimental work of highly phosphorus-doped silicon nanowires, is presented to demonstrate the utility of Q-AND. Having been developed via Intel Parallel Computing Center project, Q-AND will be open to public to establish a sound framework of nanoelectronics modelling with advanced HPC clusters of a many-core base. With details of the development methodology and exemplary study of dopant electronics, this work will present a practical guideline for TCAD development to researchers in the field of computational nanoelectronics.
      

      
      FPGA Online Tracking Algorithm for the PANDA Straw Tube Tracker
      NASA Astrophysics Data System (ADS)
      Liang, Yutie; Ye, Hua; Galuska, Martin J.; Gessler, Thomas; Kuhn, Wolfgang; Lange, Jens Soren; Wagner, Milan N.; Liu, Zhen'an; Zhao, Jingzhou
         2017-06-01
         A novel FPGA based online tracking algorithm for helix track reconstruction in a solenoidal field, developed for the PANDA spectrometer, is described. Employing the Straw Tube Tracker detector with 4636 straw tubes, the algorithm includes a complex track finder, and a track fitter. Implemented in VHDL, the algorithm is tested on a Xilinx Virtex-4 FX60 FPGA chip with different types of events, at different event rates. A processing time of 7 $\\mu$s per event for an average of 6 charged tracks is obtained. The momentum resolution is about 3\\% (4\\%) for $p_t$ ($p_z$) at 1 GeV/c. Comparing to the algorithm running on a CPU chip (single core Intel Xeon E5520 at 2.26 GHz), an improvement of 3 orders of magnitude in processing time is obtained. The algorithm can handle severe overlapping of events which are typical for interaction rates above 10 MHz.
      

      
      Advanced electronics for the CTF MEG system.
      PubMed
      McCubbin, J; Vrba, J; Spear, P; McKenzie, D; Willis, R; Loewen, R; Robinson, S E; Fife, A A
         2004-11-30
         Development of the CTF MEG system has been advanced with the introduction of a computer processing cluster between the data acquisition electronics and the host computer. The advent of fast processors, memory, and network interfaces has made this innovation feasible for large data streams at high sampling rates. We have implemented tasks including anti-alias filter, sample rate decimation, higher gradient balancing, crosstalk correction, and optional filters with a cluster consisting of 4 dual Intel Xeon processors operating on up to 275 channel MEG systems at 12 kHz sample rate. The architecture is expandable with additional processors to implement advanced processing tasks which may include e.g., continuous head localization/motion correction, optional display filters, coherence calculations, or real time synthetic channels (via beamformer). We also describe an electronics configuration upgrade to provide operator console access to the peripheral interface features such as analog signal and trigger I/O. This allows remote location of the acoustically noisy electronics cabinet and fitting of the cabinet with doors for improved EMI shielding. Finally, we present the latest performance results available for the CTF 275 channel MEG system including an unshielded SEF (median nerve electrical stimulation) measurement enhanced by application of an adaptive beamformer technique (SAM) which allows recognition of the nominal 20-ms response in the unaveraged signal.
      

      
      Analysis of impact of general-purpose graphics processor units in supersonic flow modeling
      NASA Astrophysics Data System (ADS)
      Emelyanov, V. N.; Karpenko, A. G.; Kozelkov, A. S.; Teterina, I. V.; Volkov, K. N.; Yalozo, A. V.
         2017-06-01
         Computational methods are widely used in prediction of complex flowfields associated with off-normal situations in aerospace engineering. Modern graphics processing units (GPU) provide architectures and new programming models that enable to harness their large processing power and to design computational fluid dynamics (CFD) simulations at both high performance and low cost. Possibilities of the use of GPUs for the simulation of external and internal flows on unstructured meshes are discussed. The finite volume method is applied to solve three-dimensional unsteady compressible Euler and Navier-Stokes equations on unstructured meshes with high resolution numerical schemes. CUDA technology is used for programming implementation of parallel computational algorithms. Solutions of some benchmark test cases on GPUs are reported, and the results computed are compared with experimental and computational data. Approaches to optimization of the CFD code related to the use of different types of memory are considered. Speedup of solution on GPUs with respect to the solution on central processor unit (CPU) is compared. Performance measurements show that numerical schemes developed achieve 20-50 speedup on GPU hardware compared to CPU reference implementation. The results obtained provide promising perspective for designing a GPU-based software framework for applications in CFD.
      

      
      A FAST ITERATIVE METHOD FOR SOLVING THE EIKONAL EQUATION ON TRIANGULATED SURFACES*
      PubMed Central
      Fu, Zhisong; Jeong, Won-Ki; Pan, Yongsheng; Kirby, Robert M.; Whitaker, Ross T.
         2012-01-01
         This paper presents an efficient, fine-grained parallel algorithm for solving the Eikonal equation on triangular meshes. The Eikonal equation, and the broader class of Hamilton–Jacobi equations to which it belongs, have a wide range of applications from geometric optics and seismology to biological modeling and analysis of geometry and images. The ability to solve such equations accurately and efficiently provides new capabilities for exploring and visualizing parameter spaces and for solving inverse problems that rely on such equations in the forward model. Efficient solvers on state-of-the-art, parallel architectures require new algorithms that are not, in many cases, optimal, but are better suited to synchronous updates of the solution. In previous work [W. K. Jeong and R. T. Whitaker, SIAM J. Sci. Comput., 30 (2008), pp. 2512–2534], the authors proposed the fast iterative method (FIM) to efficiently solve the Eikonal equation on regular grids. In this paper we extend the fast iterative method to solve Eikonal equations efficiently on triangulated domains on the CPU and on parallel architectures, including graphics processors. We propose a new local update scheme that provides solutions of first-order accuracy for both architectures. We also propose a novel triangle-based update scheme and its corresponding data structure for efficient irregular data mapping to parallel single-instruction multiple-data (SIMD) processors. We provide detailed descriptions of the implementations on a single CPU, a multicore CPU with shared memory, and SIMD architectures with comparative results against state-of-the-art Eikonal solvers. PMID:22641200
      

      
      A GPU-accelerated semi-implicit fractional step method for numerical solutions of incompressible Navier-Stokes equations
      NASA Astrophysics Data System (ADS)
      Ha, Sanghyun; Park, Junshin; You, Donghyun
         2017-11-01
         Utility of the computational power of modern Graphics Processing Units (GPUs) is elaborated for solutions of incompressible Navier-Stokes equations which are integrated using a semi-implicit fractional-step method. Due to its serial and bandwidth-bound nature, the present choice of numerical methods is considered to be a good candidate for evaluating the potential of GPUs for solving Navier-Stokes equations using non-explicit time integration. An efficient algorithm is presented for GPU acceleration of the Alternating Direction Implicit (ADI) and the Fourier-transform-based direct solution method used in the semi-implicit fractional-step method. OpenMP is employed for concurrent collection of turbulence statistics on a CPU while Navier-Stokes equations are computed on a GPU. Extension to multiple NVIDIA GPUs is implemented using NVLink supported by the Pascal architecture. Performance of the present method is experimented on multiple Tesla P100 GPUs compared with a single-core Xeon E5-2650 v4 CPU in simulations of boundary-layer flow over a flat plate. Supported by the National Research Foundation of Korea (NRF) Grant funded by the Korea government (Ministry of Science, ICT and Future Planning NRF-2016R1E1A2A01939553, NRF-2014R1A2A1A11049599, and Ministry of Trade, Industry and Energy 201611101000230).
      

      
      Time-domain seismic modeling in viscoelastic media for full waveform inversion on heterogeneous computing platforms with OpenCL
      NASA Astrophysics Data System (ADS)
      Fabien-Ouellet, Gabriel; Gloaguen, Erwan; Giroux, Bernard
         2017-03-01
         Full Waveform Inversion (FWI) aims at recovering the elastic parameters of the Earth by matching recordings of the ground motion with the direct solution of the wave equation. Modeling the wave propagation for realistic scenarios is computationally intensive, which limits the applicability of FWI. The current hardware evolution brings increasing parallel computing power that can speed up the computations in FWI. However, to take advantage of the diversity of parallel architectures presently available, new programming approaches are required. In this work, we explore the use of OpenCL to develop a portable code that can take advantage of the many parallel processor architectures now available. We present a program called SeisCL for 2D and 3D viscoelastic FWI in the time domain. The code computes the forward and adjoint wavefields using finite-difference and outputs the gradient of the misfit function given by the adjoint state method. To demonstrate the code portability on different architectures, the performance of SeisCL is tested on three different devices: Intel CPUs, NVidia GPUs and Intel Xeon PHI. Results show that the use of GPUs with OpenCL can speed up the computations by nearly two orders of magnitudes over a single threaded application on the CPU. Although OpenCL allows code portability, we show that some device-specific optimization is still required to get the best performance out of a specific architecture. Using OpenCL in conjunction with MPI allows the domain decomposition of large models on several devices located on different nodes of a cluster. For large enough models, the speedup of the domain decomposition varies quasi-linearly with the number of devices. Finally, we investigate two different approaches to compute the gradient by the adjoint state method and show the significant advantages of using OpenCL for FWI.
      

      
      Development of seismic tomography software for hybrid supercomputers
      NASA Astrophysics Data System (ADS)
      Nikitin, Alexandr; Serdyukov, Alexandr; Duchkov, Anton
         2015-04-01
         Seismic tomography is a technique used for computing velocity model of geologic structure from first arrival travel times of seismic waves. The technique is used in processing of regional and global seismic data, in seismic exploration for prospecting and exploration of mineral and hydrocarbon deposits, and in seismic engineering for monitoring the condition of engineering structures and the surrounding host medium. As a consequence of development of seismic monitoring systems and increasing volume of seismic data, there is a growing need for new, more effective computational algorithms for use in seismic tomography applications with improved performance, accuracy and resolution. To achieve this goal, it is necessary to use modern high performance computing systems, such as supercomputers with hybrid architecture that use not only CPUs, but also accelerators and co-processors for computation. The goal of this research is the development of parallel seismic tomography algorithms and software package for such systems, to be used in processing of large volumes of seismic data (hundreds of gigabytes and more). These algorithms and software package will be optimized for the most common computing devices used in modern hybrid supercomputers, such as Intel Xeon CPUs, NVIDIA Tesla accelerators and Intel Xeon Phi co-processors. In this work, the following general scheme of seismic tomography is utilized. Using the eikonal equation solver, arrival times of seismic waves are computed based on assumed velocity model of geologic structure being analyzed. In order to solve the linearized inverse problem, tomographic matrix is computed that connects model adjustments with travel time residuals, and the resulting system of linear equations is regularized and solved to adjust the model. The effectiveness of parallel implementations of existing algorithms on target architectures is considered. During the first stage of this work, algorithms were developed for execution on supercomputers using multicore CPUs only, with preliminary performance tests showing good parallel efficiency on large numerical grids. Porting of the algorithms to hybrid supercomputers is currently ongoing.
      

      
      Rapid insights from remote sensing in the geosciences
      NASA Astrophysics Data System (ADS)
      Plaza, Antonio
         2015-03-01
         The growing availability of capacity computing for atomistic materials modeling has encouraged the use of high-accuracy computationally intensive interatomic potentials, such as SNAP. These potentials also happen to scale well on petascale computing platforms. SNAP has a very general form and uses machine-learning techniques to reproduce the energies, forces, and stress tensors of a large set of small configurations of atoms, which are obtained using high-accuracy quantum electronic structure (QM) calculations. The local environment of each atom is characterized by a set of bispectrum components of the local neighbor density projected on to a basis of hyperspherical harmonics in four dimensions. The computational cost per atom is much greater than that of simpler potentials such as Lennard-Jones or EAM, while the communication cost remains modest. We discuss a variety of strategies for implementing SNAP in the LAMMPS molecular dynamics package. We present scaling results obtained running SNAP on three different classes of machine: a conventional Intel Xeon CPU cluster; the Titan GPU-based system; and the combined Sequoia and Vulcan BlueGene/Q. The growing availability of capacity computing for atomistic materials modeling has encouraged the use of high-accuracy computationally intensive interatomic potentials, such as SNAP. These potentials also happen to scale well on petascale computing platforms. SNAP has a very general form and uses machine-learning techniques to reproduce the energies, forces, and stress tensors of a large set of small configurations of atoms, which are obtained using high-accuracy quantum electronic structure (QM) calculations. The local environment of each atom is characterized by a set of bispectrum components of the local neighbor density projected on to a basis of hyperspherical harmonics in four dimensions. The computational cost per atom is much greater than that of simpler potentials such as Lennard-Jones or EAM, while the communication cost remains modest. We discuss a variety of strategies for implementing SNAP in the LAMMPS molecular dynamics package. We present scaling results obtained running SNAP on three different classes of machine: a conventional Intel Xeon CPU cluster; the Titan GPU-based system; and the combined Sequoia and Vulcan BlueGene/Q. Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corp., for the U.S. Dept. of Energy's National Nuclear Security Admin. under Contract DE-AC04-94AL85000.
      

      
      Performance Study of Monte Carlo Codes on Xeon Phi Coprocessors — Testing MCNP 6.1 and Profiling ARCHER Geometry Module on the FS7ONNi Problem
      NASA Astrophysics Data System (ADS)
      Liu, Tianyu; Wolfe, Noah; Lin, Hui; Zieb, Kris; Ji, Wei; Caracappa, Peter; Carothers, Christopher; Xu, X. George
         2017-09-01
         This paper contains two parts revolving around Monte Carlo transport simulation on Intel Many Integrated Core coprocessors (MIC, also known as Xeon Phi). (1) MCNP 6.1 was recompiled into multithreading (OpenMP) and multiprocessing (MPI) forms respectively without modification to the source code. The new codes were tested on a 60-core 5110P MIC. The test case was FS7ONNi, a radiation shielding problem used in MCNP's verification and validation suite. It was observed that both codes became slower on the MIC than on a 6-core X5650 CPU, by a factor of 4 for the MPI code and, abnormally, 20 for the OpenMP code, and both exhibited limited capability of strong scaling. (2) We have recently added a Constructive Solid Geometry (CSG) module to our ARCHER code to provide better support for geometry modelling in radiation shielding simulation. The functions of this module are frequently called in the particle random walk process. To identify the performance bottleneck we developed a CSG proxy application and profiled the code using the geometry data from FS7ONNi. The profiling data showed that the code was primarily memory latency bound on the MIC. This study suggests that despite low initial porting e_ort, Monte Carlo codes do not naturally lend themselves to the MIC platform — just like to the GPUs, and that the memory latency problem needs to be addressed in order to achieve decent performance gain.
      

      
      The Fluke Security Project
      DTIC Science & Technology
      
         2000-04-01
         be an extension of Utah’s nascent Quarks system, oriented to closely coupled cluster environments. However, the grant did not actually begin until... Intel x86, implemented ten virtual machine monitors and servers, including a virtual memory manager, a checkpointer, a process manager, a file server...Fluke, we developed a novel hierarchical processor scheduling frame- work called CPU inheritance scheduling [5]. This is a framework for scheduling
      

      
      Automatic Adaptation of Tunable Distributed Applications
      DTIC Science & Technology
      
         2001-01-01
         size, weight, and battery life, with a single CPU, less memory, smaller hard disk, and lower bandwidth network connectivity. The power of PDAs is...wireless, and bluetooth [32] facilities; thus achieving different rates of data transmission. 1 With the trend of “write once, run everywhere...applications, a single component can execute on multiple processors (or machines) in parallel. These parallel applications, written in a specialized language
      

      
      MATCHED FILTER COMPUTATION ON FPGA, CELL, AND GPU
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      BAKER, ZACHARY K.; GOKHALE, MAYA B.; TRIPP, JUSTIN L.
         2007-01-08
         The matched filter is an important kernel in the processing of hyperspectral data. The filter enables researchers to sift useful data from instruments that span large frequency bands. In this work, they evaluate the performance of a matched filter algorithm implementation on accelerated co-processor (XD1000), the IBM Cell microprocessor, and the NVIDIA GeForce 6900 GTX GPU graphics card. They provide extensive discussion of the challenges and opportunities afforded by each platform. In particular, they explore the problems of partitioning the filter most efficiently between the host CPU and the co-processor. Using their results, they derive several performance metrics that providemore » the optimal solution for a variety of application situations.« less
      

        
       
          

«

8
      9
      10
   11
      12
      »

          
        

     

   

   
       
            
              
          

«

9
      10
      11
   12
      13
      »

          
        

           
           
             
               
      
      Loran-C digital word generator for use with a KIM-1 microprocessor system
      NASA Technical Reports Server (NTRS)
      Nickum, J. D.
         1977-01-01
         The problem of translating the time of occurrence of received Loran-C pulses into a time, referenced to a particular period of occurrence is addressed and applied to the design of a digital word generator for a Loran-C sensor processor package. The digital information from this word generator is processed in a KIM-1 microprocessor system which is based on the MOS 6502 CPU. This final system will consist of a complete time difference sensor processor for determining position information using Loran-C charts. The system consists of the KIM-1 microprocessor module, a 4K RAM memory board, a user interface, and the Loran-C word generator.
      

      
      Theorem Proving in Intel Hardware Design
      NASA Technical Reports Server (NTRS)
      O'Leary, John
         2009-01-01
         For the past decade, a framework combining model checking (symbolic trajectory evaluation) and higher-order logic theorem proving has been in production use at Intel. Our tools and methodology have been used to formally verify execution cluster functionality (including floating-point operations) for a number of Intel products, including the Pentium(Registered TradeMark)4 and Core(TradeMark)i7 processors. Hardware verification in 2009 is much more challenging than it was in 1999 - today s CPU chip designs contain many processor cores and significant firmware content. This talk will attempt to distill the lessons learned over the past ten years, discuss how they apply to today s problems, outline some future directions.
      

      
      GPU accelerated Monte-Carlo simulation of SEM images for metrology
      NASA Astrophysics Data System (ADS)
      Verduin, T.; Lokhorst, S. R.; Hagen, C. W.
         2016-03-01
         In this work we address the computation times of numerical studies in dimensional metrology. In particular, full Monte-Carlo simulation programs for scanning electron microscopy (SEM) image acquisition are known to be notoriously slow. Our quest in reducing the computation time of SEM image simulation has led us to investigate the use of graphics processing units (GPUs) for metrology. We have succeeded in creating a full Monte-Carlo simulation program for SEM images, which runs entirely on a GPU. The physical scattering models of this GPU simulator are identical to a previous CPU-based simulator, which includes the dielectric function model for inelastic scattering and also refinements for low-voltage SEM applications. As a case study for the performance, we considered the simulated exposure of a complex feature: an isolated silicon line with rough sidewalls located on a at silicon substrate. The surface of the rough feature is decomposed into 408 012 triangles. We have used an exposure dose of 6 mC/cm2, which corresponds to 6 553 600 primary electrons on average (Poisson distributed). We repeat the simulation for various primary electron energies, 300 eV, 500 eV, 800 eV, 1 keV, 3 keV and 5 keV. At first we run the simulation on a GeForce GTX480 from NVIDIA. The very same simulation is duplicated on our CPU-based program, for which we have used an Intel Xeon X5650. Apart from statistics in the simulation, no difference is found between the CPU and GPU simulated results. The GTX480 generates the images (depending on the primary electron energy) 350 to 425 times faster than a single threaded Intel X5650 CPU. Although this is a tremendous speedup, we actually have not reached the maximum throughput because of the limited amount of available memory on the GTX480. Nevertheless, the speedup enables the fast acquisition of simulated SEM images for metrology. We now have the potential to investigate case studies in CD-SEM metrology, which otherwise would take unreasonable amounts of computation time.
      

      
      Collaborating CPU and GPU for large-scale high-order CFD simulations with complex grids on the TianHe-1A supercomputer
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Xu, Chuanfu, E-mail: xuchuanfu@nudt.edu.cn; Deng, Xiaogang; Zhang, Lilun
         
         Programming and optimizing complex, real-world CFD codes on current many-core accelerated HPC systems is very challenging, especially when collaborating CPUs and accelerators to fully tap the potential of heterogeneous systems. In this paper, with a tri-level hybrid and heterogeneous programming model using MPI + OpenMP + CUDA, we port and optimize our high-order multi-block structured CFD software HOSTA on the GPU-accelerated TianHe-1A supercomputer. HOSTA adopts two self-developed high-order compact definite difference schemes WCNS and HDCS that can simulate flows with complex geometries. We present a dual-level parallelization scheme for efficient multi-block computation on GPUs and perform particular kernel optimizations formore » high-order CFD schemes. The GPU-only approach achieves a speedup of about 1.3 when comparing one Tesla M2050 GPU with two Xeon X5670 CPUs. To achieve a greater speedup, we collaborate CPU and GPU for HOSTA instead of using a naive GPU-only approach. We present a novel scheme to balance the loads between the store-poor GPU and the store-rich CPU. Taking CPU and GPU load balance into account, we improve the maximum simulation problem size per TianHe-1A node for HOSTA by 2.3×, meanwhile the collaborative approach can improve the performance by around 45% compared to the GPU-only approach. Further, to scale HOSTA on TianHe-1A, we propose a gather/scatter optimization to minimize PCI-e data transfer times for ghost and singularity data of 3D grid blocks, and overlap the collaborative computation and communication as far as possible using some advanced CUDA and MPI features. Scalability tests show that HOSTA can achieve a parallel efficiency of above 60% on 1024 TianHe-1A nodes. With our method, we have successfully simulated an EET high-lift airfoil configuration containing 800M cells and China's large civil airplane configuration containing 150M cells. To our best knowledge, those are the largest-scale CPU–GPU collaborative simulations that solve realistic CFD problems with both complex configurations and high-order schemes.« less
      

      
      Accelerated Application Development: The ORNL Titan Experience
      DOE PAGES
      Joubert, Wayne; Archibald, Richard K.; Berrill, Mark A.; ...
         2015-05-09
         The use of computational accelerators such as NVIDIA GPUs and Intel Xeon Phi processors is now widespread in the high performance computing community, with many applications delivering impressive performance gains. However, programming these systems for high performance, performance portability and software maintainability has been a challenge. In this paper we discuss experiences porting applications to the Titan system. Titan, which began planning in 2009 and was deployed for general use in 2013, was the first multi-petaflop system based on accelerator hardware. To ready applications for accelerated computing, a preparedness effort was undertaken prior to delivery of Titan. In this papermore » we report experiences and lessons learned from this process and describe how users are currently making use of computational accelerators on Titan.« less
      

      
      Accelerated application development: The ORNL Titan experience
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Joubert, Wayne; Archibald, Rick; Berrill, Mark
         2015-08-01
         The use of computational accelerators such as NVIDIA GPUs and Intel Xeon Phi processors is now widespread in the high performance computing community, with many applications delivering impressive performance gains. However, programming these systems for high performance, performance portability and software maintainability has been a challenge. In this paper we discuss experiences porting applications to the Titan system. Titan, which began planning in 2009 and was deployed for general use in 2013, was the first multi-petaflop system based on accelerator hardware. To ready applications for accelerated computing, a preparedness effort was undertaken prior to delivery of Titan. In this papermore » we report experiences and lessons learned from this process and describe how users are currently making use of computational accelerators on Titan.« less
      

      
      Parallelization of MRCI based on hole-particle symmetry.
      PubMed
      Suo, Bing; Zhai, Gaohong; Wang, Yubin; Wen, Zhenyi; Hu, Xiangqian; Li, Lemin
         2005-01-15
         The parallel implementation of multireference configuration interaction program based on the hole-particle symmetry is described. The platform to implement the parallelization is an Intel-Architectural cluster consisting of 12 nodes, each of which is equipped with two 2.4-G XEON processors, 3-GB memory, and 36-GB disk, and are connected by a Gigabit Ethernet Switch. The dependence of speedup on molecular symmetries and task granularities is discussed. Test calculations show that the scaling with the number of nodes is about 1.9 (for C1 and Cs), 1.65 (for C2v), and 1.55 (for D2h) when the number of nodes is doubled. The largest calculation performed on this cluster involves 5.6 x 10(8) CSFs.
      

      
      Electromagnetic physics models for parallel computing architectures
      DOE PAGES
      Amadio, G.; Ananya, A.; Apostolakis, J.; ...
         2016-11-21
         The recent emergence of hardware architectures characterized by many-core or accelerated processors has opened new opportunities for concurrent programming models taking advantage of both SIMD and SIMT architectures. GeantV, a next generation detector simulation, has been designed to exploit both the vector capability of mainstream CPUs and multi-threading capabilities of coprocessors including NVidia GPUs and Intel Xeon Phi. The characteristics of these architectures are very different in terms of the vectorization depth and type of parallelization needed to achieve optimal performance. In this paper we describe implementation of electromagnetic physics models developed for parallel computing architectures as a part ofmore » the GeantV project. Finally, the results of preliminary performance evaluation and physics validation are presented as well.« less
      

      
      First experience of vectorizing electromagnetic physics models for detector simulation
      NASA Astrophysics Data System (ADS)
      Amadio, G.; Apostolakis, J.; Bandieramonte, M.; Bianchini, C.; Bitzes, G.; Brun, R.; Canal, P.; Carminati, F.; de Fine Licht, J.; Duhem, L.; Elvira, D.; Gheata, A.; Jun, S. Y.; Lima, G.; Novak, M.; Presbyterian, M.; Shadura, O.; Seghal, R.; Wenzel, S.
         2015-12-01
         The recent emergence of hardware architectures characterized by many-core or accelerated processors has opened new opportunities for concurrent programming models taking advantage of both SIMD and SIMT architectures. The GeantV vector prototype for detector simulations has been designed to exploit both the vector capability of mainstream CPUs and multi-threading capabilities of coprocessors including NVidia GPUs and Intel Xeon Phi. The characteristics of these architectures are very different in terms of the vectorization depth, parallelization needed to achieve optimal performance or memory access latency and speed. An additional challenge is to avoid the code duplication often inherent to supporting heterogeneous platforms. In this paper we present the first experience of vectorizing electromagnetic physics models developed for the GeantV project.
      

      
      Parallelizing ATLAS Reconstruction and Simulation: Issues and Optimization Solutions for Scaling on Multi- and Many-CPU Platforms
      NASA Astrophysics Data System (ADS)
      Leggett, C.; Binet, S.; Jackson, K.; Levinthal, D.; Tatarkhanov, M.; Yao, Y.
         2011-12-01
         Thermal limitations have forced CPU manufacturers to shift from simply increasing clock speeds to improve processor performance, to producing chip designs with multi- and many-core architectures. Further the cores themselves can run multiple threads as a zero overhead context switch allowing low level resource sharing (Intel Hyperthreading). To maximize bandwidth and minimize memory latency, memory access has become non uniform (NUMA). As manufacturers add more cores to each chip, a careful understanding of the underlying architecture is required in order to fully utilize the available resources. We present AthenaMP and the Atlas event loop manager, the driver of the simulation and reconstruction engines, which have been rewritten to make use of multiple cores, by means of event based parallelism, and final stage I/O synchronization. However, initial studies on 8 andl6 core Intel architectures have shown marked non-linearities as parallel process counts increase, with as much as 30% reductions in event throughput in some scenarios. Since the Intel Nehalem architecture (both Gainestown and Westmere) will be the most common choice for the next round of hardware procurements, an understanding of these scaling issues is essential. Using hardware based event counters and Intel's Performance Tuning Utility, we have studied the performance bottlenecks at the hardware level, and discovered optimization schemes to maximize processor throughput. We have also produced optimization mechanisms, common to all large experiments, that address the extreme nature of today's HEP code, which due to it's size, places huge burdens on the memory infrastructure of today's processors.
      

      
      Right-Brain/Left-Brain Integrated Associative Processor Employing Convertible Multiple-Instruction-Stream Multiple-Data-Stream Elements
      NASA Astrophysics Data System (ADS)
      Hayakawa, Hitoshi; Ogawa, Makoto; Shibata, Tadashi
         2005-04-01
         A very large scale integrated circuit (VLSI) architecture for a multiple-instruction-stream multiple-data-stream (MIMD) associative processor has been proposed. The processor employs an architecture that enables seamless switching from associative operations to arithmetic operations. The MIMD element is convertible to a regular central processing unit (CPU) while maintaining its high performance as an associative processor. Therefore, the MIMD associative processor can perform not only on-chip perception, i.e., searching for the vector most similar to an input vector throughout the on-chip cache memory, but also arithmetic and logic operations similar to those in ordinary CPUs, both simultaneously in parallel processing. Three key technologies have been developed to generate the MIMD element: associative-operation-and-arithmetic-operation switchable calculation units, a versatile register control scheme within the MIMD element for flexible operations, and a short instruction set for minimizing the memory size for program storage. Key circuit blocks were designed and fabricated using 0.18 μm complementary metal-oxide-semiconductor (CMOS) technology. As a result, the full-featured MIMD element is estimated to be 3 mm2, showing the feasibility of an 8-parallel-MIMD-element associative processor in a single chip of 5 mm× 5 mm.
      

      
      Accelerated Monte Carlo Simulation on the Chemical Stage in Water Radiolysis using GPU
      PubMed Central
      Tian, Zhen; Jiang, Steve B.; Jia, Xun
         2018-01-01
         The accurate simulation of water radiolysis is an important step to understand the mechanisms of radiobiology and quantitatively test some hypotheses regarding radiobiological effects. However, the simulation of water radiolysis is highly time consuming, taking hours or even days to be completed by a conventional CPU processor. This time limitation hinders cell-level simulations for a number of research studies. We recently initiated efforts to develop gMicroMC, a GPU-based fast microscopic MC simulation package for water radiolysis. The first step of this project focused on accelerating the simulation of the chemical stage, the most time consuming stage in the entire water radiolysis process. A GPU-friendly parallelization strategy was designed to address the highly correlated many-body simulation problem caused by the mutual competitive chemical reactions between the radiolytic molecules. Two cases were tested, using a 750 keV electron and a 5 MeV proton incident in pure water, respectively. The time-dependent yields of all the radiolytic species during the chemical stage were used to evaluate the accuracy of the simulation. The relative differences between our simulation and the Geant4-DNA simulation were on average 5.3% and 4.4% for the two cases. Our package, executed on an Nvidia Titan black GPU card, successfully completed the chemical stage simulation of the two cases within 599.2 s and 489.0 s. As compared with Geant4-DNA that was executed on an Intel i7-5500U CPU processor and needed 28.6 h and 26.8 h for the two cases using a single CPU core, our package achieved a speed-up factor of 171.1-197.2. PMID:28323637
      

      
      Accelerated Monte Carlo simulation on the chemical stage in water radiolysis using GPU
      NASA Astrophysics Data System (ADS)
      Tian, Zhen; Jiang, Steve B.; Jia, Xun
         2017-04-01
         The accurate simulation of water radiolysis is an important step to understand the mechanisms of radiobiology and quantitatively test some hypotheses regarding radiobiological effects. However, the simulation of water radiolysis is highly time consuming, taking hours or even days to be completed by a conventional CPU processor. This time limitation hinders cell-level simulations for a number of research studies. We recently initiated efforts to develop gMicroMC, a GPU-based fast microscopic MC simulation package for water radiolysis. The first step of this project focused on accelerating the simulation of the chemical stage, the most time consuming stage in the entire water radiolysis process. A GPU-friendly parallelization strategy was designed to address the highly correlated many-body simulation problem caused by the mutual competitive chemical reactions between the radiolytic molecules. Two cases were tested, using a 750 keV electron and a 5 MeV proton incident in pure water, respectively. The time-dependent yields of all the radiolytic species during the chemical stage were used to evaluate the accuracy of the simulation. The relative differences between our simulation and the Geant4-DNA simulation were on average 5.3% and 4.4% for the two cases. Our package, executed on an Nvidia Titan black GPU card, successfully completed the chemical stage simulation of the two cases within 599.2 s and 489.0 s. As compared with Geant4-DNA that was executed on an Intel i7-5500U CPU processor and needed 28.6 h and 26.8 h for the two cases using a single CPU core, our package achieved a speed-up factor of 171.1-197.2.
      

      
      Accelerated Monte Carlo simulation on the chemical stage in water radiolysis using GPU.
      PubMed
      Tian, Zhen; Jiang, Steve B; Jia, Xun
         2017-04-21
         The accurate simulation of water radiolysis is an important step to understand the mechanisms of radiobiology and quantitatively test some hypotheses regarding radiobiological effects. However, the simulation of water radiolysis is highly time consuming, taking hours or even days to be completed by a conventional CPU processor. This time limitation hinders cell-level simulations for a number of research studies. We recently initiated efforts to develop gMicroMC, a GPU-based fast microscopic MC simulation package for water radiolysis. The first step of this project focused on accelerating the simulation of the chemical stage, the most time consuming stage in the entire water radiolysis process. A GPU-friendly parallelization strategy was designed to address the highly correlated many-body simulation problem caused by the mutual competitive chemical reactions between the radiolytic molecules. Two cases were tested, using a 750 keV electron and a 5 MeV proton incident in pure water, respectively. The time-dependent yields of all the radiolytic species during the chemical stage were used to evaluate the accuracy of the simulation. The relative differences between our simulation and the Geant4-DNA simulation were on average 5.3% and 4.4% for the two cases. Our package, executed on an Nvidia Titan black GPU card, successfully completed the chemical stage simulation of the two cases within 599.2 s and 489.0 s. As compared with Geant4-DNA that was executed on an Intel i7-5500U CPU processor and needed 28.6 h and 26.8 h for the two cases using a single CPU core, our package achieved a speed-up factor of 171.1-197.2.
      

      
      The impact of Moore's Law and loss of Dennard scaling: Are DSP SoCs an energy efficient alternative to x86 SoCs?
      NASA Astrophysics Data System (ADS)
      Johnsson, L.; Netzer, G.
         2016-10-01
         Moore's law, the doubling of transistors per unit area for each CMOS technology generation, is expected to continue throughout the decade, while Dennard voltage scaling resulting in constant power per unit area stopped about a decade ago. The semiconductor industry's response to the loss of Dennard scaling and the consequent challenges in managing power distribution and dissipation has been leveled off clock rates, a die performance gain reduced from about a factor of 2.8 to 1.4 per technology generation, and multi-core processor dies with increased cache sizes. Increased caches sizes offers performance benefits for many applications as well as energy savings. Accessing data in cache is considerably more energy efficient than main memory accesses. Further, caches consume less power than a corresponding amount of functional logic. As feature sizes continue to be scaled down an increasing fraction of the die must be “underutilized” or “dark” due to power constraints. With power being a prime design constraint there is a concerted effort to find significantly more energy efficient chip architectures than dominant in servers today, with chips potentially incorporating several types of cores to cover a range of applications, or different functions in an application, as is already common for the mobile processor market. Digital Signal Processors (DSPs), largely targeting the embedded and mobile processor markets, typically have been designed for a power consumption of 10% or less of a typical x86 CPU, yet with much more than 10% of the floating-point capability of the same technology generation x86 CPUs. Thus, DSPs could potentially offer an energy efficient alternative to x86 CPUs. Here we report an assessment of the Texas Instruments TMS320C6678 DSP in regards to its energy efficiency for two common HPC benchmarks: STREAM (memory system benchmark) and HPL (CPU benchmark)
      

      
      Parallelization of a Monte Carlo particle transport simulation code
      NASA Astrophysics Data System (ADS)
      Hadjidoukas, P.; Bousis, C.; Emfietzoglou, D.
         2010-05-01
         We have developed a high performance version of the Monte Carlo particle transport simulation code MC4. The original application code, developed in Visual Basic for Applications (VBA) for Microsoft Excel, was first rewritten in the C programming language for improving code portability. Several pseudo-random number generators have been also integrated and studied. The new MC4 version was then parallelized for shared and distributed-memory multiprocessor systems using the Message Passing Interface. Two parallel pseudo-random number generator libraries (SPRNG and DCMT) have been seamlessly integrated. The performance speedup of parallel MC4 has been studied on a variety of parallel computing architectures including an Intel Xeon server with 4 dual-core processors, a Sun cluster consisting of 16 nodes of 2 dual-core AMD Opteron processors and a 200 dual-processor HP cluster. For large problem size, which is limited only by the physical memory of the multiprocessor server, the speedup results are almost linear on all systems. We have validated the parallel implementation against the serial VBA and C implementations using the same random number generator. Our experimental results on the transport and energy loss of electrons in a water medium show that the serial and parallel codes are equivalent in accuracy. The present improvements allow for studying of higher particle energies with the use of more accurate physical models, and improve statistics as more particles tracks can be simulated in low response time.
      

      
      DIALIGN P: fast pair-wise and multiple sequence alignment using parallel processors.
      PubMed
      Schmollinger, Martin; Nieselt, Kay; Kaufmann, Michael; Morgenstern, Burkhard
         2004-09-09
         Parallel computing is frequently used to speed up computationally expensive tasks in Bioinformatics. Herein, a parallel version of the multi-alignment program DIALIGN is introduced. We propose two ways of dividing the program into independent sub-routines that can be run on different processors: (a) pair-wise sequence alignments that are used as a first step to multiple alignment account for most of the CPU time in DIALIGN. Since alignments of different sequence pairs are completely independent of each other, they can be distributed to multiple processors without any effect on the resulting output alignments. (b) For alignments of large genomic sequences, we use a heuristics by splitting up sequences into sub-sequences based on a previously introduced anchored alignment procedure. For our test sequences, this combined approach reduces the program running time of DIALIGN by up to 97%. By distributing sub-routines to multiple processors, the running time of DIALIGN can be crucially improved. With these improvements, it is possible to apply the program in large-scale genomics and proteomics projects that were previously beyond its scope.
      

      
      A low power biomedical signal processor ASIC based on hardware software codesign.
      PubMed
      Nie, Z D; Wang, L; Chen, W G; Zhang, T; Zhang, Y T
         2009-01-01
         A low power biomedical digital signal processor ASIC based on hardware and software codesign methodology was presented in this paper. The codesign methodology was used to achieve higher system performance and design flexibility. The hardware implementation included a low power 32bit RISC CPU ARM7TDMI, a low power AHB-compatible bus, and a scalable digital co-processor that was optimized for low power Fast Fourier Transform (FFT) calculations. The co-processor could be scaled for 8-point, 16-point and 32-point FFTs, taking approximate 50, 100 and 150 clock circles, respectively. The complete design was intensively simulated using ARM DSM model and was emulated by ARM Versatile platform, before conducted to silicon. The multi-million-gate ASIC was fabricated using SMIC 0.18 microm mixed-signal CMOS 1P6M technology. The die area measures 5,000 microm x 2,350 microm. The power consumption was approximately 3.6 mW at 1.8 V power supply and 1 MHz clock rate. The power consumption for FFT calculations was less than 1.5 % comparing with the conventional embedded software-based solution.
      

      
      Orthorectification by Using Gpgpu Method
      NASA Astrophysics Data System (ADS)
      Sahin, H.; Kulur, S.
         2012-07-01
         Thanks to the nature of the graphics processing, the newly released products offer highly parallel processing units with high-memory bandwidth and computational power of more than teraflops per second. The modern GPUs are not only powerful graphic engines but also they are high level parallel programmable processors with very fast computing capabilities and high-memory bandwidth speed compared to central processing units (CPU). Data-parallel computations can be shortly described as mapping data elements to parallel processing threads. The rapid development of GPUs programmability and capabilities attracted the attentions of researchers dealing with complex problems which need high level calculations. This interest has revealed the concepts of "General Purpose Computation on Graphics Processing Units (GPGPU)" and "stream processing". The graphic processors are powerful hardware which is really cheap and affordable. So the graphic processors became an alternative to computer processors. The graphic chips which were standard application hardware have been transformed into modern, powerful and programmable processors to meet the overall needs. Especially in recent years, the phenomenon of the usage of graphics processing units in general purpose computation has led the researchers and developers to this point. The biggest problem is that the graphics processing units use different programming models unlike current programming methods. Therefore, an efficient GPU programming requires re-coding of the current program algorithm by considering the limitations and the structure of the graphics hardware. Currently, multi-core processors can not be programmed by using traditional programming methods. Event procedure programming method can not be used for programming the multi-core processors. GPUs are especially effective in finding solution for repetition of the computing steps for many data elements when high accuracy is needed. Thus, it provides the computing process more quickly and accurately. Compared to the GPUs, CPUs which perform just one computing in a time according to the flow control are slower in performance. This structure can be evaluated for various applications of computer technology. In this study covers how general purpose parallel programming and computational power of the GPUs can be used in photogrammetric applications especially direct georeferencing. The direct georeferencing algorithm is coded by using GPGPU method and CUDA (Compute Unified Device Architecture) programming language. Results provided by this method were compared with the traditional CPU programming. In the other application the projective rectification is coded by using GPGPU method and CUDA programming language. Sample images of various sizes, as compared to the results of the program were evaluated. GPGPU method can be used especially in repetition of same computations on highly dense data, thus finding the solution quickly.
      

      
      High Performance Computing Assets for Ocean Acoustics Research
      DTIC Science & Technology
      
         2016-11-18
         independently on processing units with access to a typically available amount of memory, say 16 or 32 gigabytes. Our models require each processor to...allow results to be obtained with limited amounts of memory available to individual processing units (with no time frame for successful completion...put into use. One file server computer to store simulation output has also been purchased. The first workstation has 28 CPU cores, dual- thread , (56
      

        
       
          

«

9
      10
      11
   12
      13
      »

          
        

     

   

   
       
            
              
          

«

10
      11
      12
   13
      14
      »

          
        

           
           
             
               
      
      Multiprocessing MCNP on an IBM RS/6000 cluster
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      McKinney, G.W.; West, J.T.
         1993-01-01
         The advent of high-performance computer systems has brought to maturity programming concepts like vectorization, multiprocessing, and multitasking. While there are many schools of thought as to the most significant factor in obtaining order-of-magnitude increases in performance, such speedup can only be achieved by integrating the computer system and application code. Vectorization leads to faster manipulation of arrays by overlapping instruction CPU cycles. Discrete ordinates codes, which require the solving of large matrices, have proved to be major benefactors of vectorization. Monte Carlo transport, on the other hand, typically contains numerous logic statements and requires extensive redevelopment to benefit from vectorization.more » Multiprocessing and multitasking provide additional CPU cycles via multiple processors. Such systems are generally designed with either common memory access (multitasking) or distributed memory access. In both cases, theoretical speedup, as a function of the number of processors (P) and the fraction of task time that multiprocesses (f), can be formulated using Amdahl's Law S ((f,P) = 1 f + f/P). However, for most applications this theoretical limit cannot be achieved, due to additional terms not included in Amdahl's Law. Monte Carlo transport is a natural candidate for multiprocessing, since the particle tracks are generally independent and the precision of the result increases as the square root of the number of particles tracked.« less
      

      
      Research on control law accelerator of digital signal process chip TMS320F28035 for real-time data acquisition and processing
      NASA Astrophysics Data System (ADS)
      Zhao, Shuangle; Zhang, Xueyi; Sun, Shengli; Wang, Xudong
         2017-08-01
         TI C2000 series digital signal process (DSP) chip has been widely used in electrical engineering, measurement and control, communications and other professional fields, DSP TMS320F28035 is one of the most representative of a kind. When using the DSP program, need data acquisition and data processing, and if the use of common mode C or assembly language programming, the program sequence, analogue-to-digital (AD) converter cannot be real-time acquisition, often missing a lot of data. The control low accelerator (CLA) processor can run in parallel with the main central processing unit (CPU), and the frequency is consistent with the main CPU, and has the function of floating point operations. Therefore, the CLA coprocessor is used in the program, and the CLA kernel is responsible for data processing. The main CPU is responsible for the AD conversion. The advantage of this method is to reduce the time of data processing and realize the real-time performance of data acquisition.
      

      
      A report documenting the completion of the Los Alamos National Laboratory portion of the ASC level II milestone ""Visualization on the supercomputing platform
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Ahrens, James P; Patchett, John M; Lo, Li - Ta
         2011-01-24
         This report provides documentation for the completion of the Los Alamos portion of the ASC Level II 'Visualization on the Supercomputing Platform' milestone. This ASC Level II milestone is a joint milestone between Sandia National Laboratory and Los Alamos National Laboratory. The milestone text is shown in Figure 1 with the Los Alamos portions highlighted in boldfaced text. Visualization and analysis of petascale data is limited by several factors which must be addressed as ACES delivers the Cielo platform. Two primary difficulties are: (1) Performance of interactive rendering, which is the most computationally intensive portion of the visualization process. Formore » terascale platforms, commodity clusters with graphics processors (GPUs) have been used for interactive rendering. For petascale platforms, visualization and rendering may be able to run efficiently on the supercomputer platform itself. (2) I/O bandwidth, which limits how much information can be written to disk. If we simply analyze the sparse information that is saved to disk we miss the opportunity to analyze the rich information produced every timestep by the simulation. For the first issue, we are pursuing in-situ analysis, in which simulations are coupled directly with analysis libraries at runtime. This milestone will evaluate the visualization and rendering performance of current and next generation supercomputers in contrast to GPU-based visualization clusters, and evaluate the perfromance of common analysis libraries coupled with the simulation that analyze and write data to disk during a running simulation. This milestone will explore, evaluate and advance the maturity level of these technologies and their applicability to problems of interest to the ASC program. In conclusion, we improved CPU-based rendering performance by a a factor of 2-10 times on our tests. In addition, we evaluated CPU and CPU-based rendering performance. We encourage production visualization experts to consider using CPU-based rendering solutions when it is appropriate. For example, on remote supercomputers CPU-based rendering can offer a means of viewing data without having to offload the data or geometry onto a CPU-based visualization system. In terms of comparative performance of the CPU and CPU we believe that further optimizations of the performance of both CPU or CPU-based rendering are possible. The simulation community is currently confronting this reality as they work to port their simulations to different hardware architectures. What is interesting about CPU rendering of massive datasets is that for part two decades CPU performance has significantly outperformed CPU-based systems. Based on our advancements, evaluations and explorations we believe that CPU-based rendering has returned as one viable option for the visualization of massive datasets.« less
      

      
      Graphics processing unit accelerated phase field dislocation dynamics: Application to bi-metallic interfaces
      DOE PAGES
      Eghtesad, Adnan; Germaschewski, Kai; Beyerlein, Irene J.; ...
         2017-10-14
         We present the first high-performance computing implementation of the meso-scale phase field dislocation dynamics (PFDD) model on a graphics processing unit (GPU)-based platform. The implementation takes advantage of the portable OpenACC standard directive pragmas along with Nvidia's compute unified device architecture (CUDA) fast Fourier transform (FFT) library called CUFFT to execute the FFT computations within the PFDD formulation on the same GPU platform. The overall implementation is termed ACCPFDD-CUFFT. The package is entirely performance portable due to the use of OPENACC-CUDA inter-operability, in which calls to CUDA functions are replaced with the OPENACC data regions for a host central processingmore » unit (CPU) and device (GPU). A comprehensive benchmark study has been conducted, which compares a number of FFT routines, the Numerical Recipes FFT (FOURN), Fastest Fourier Transform in the West (FFTW), and the CUFFT. The last one exploits the advantages of the GPU hardware for FFT calculations. The novel ACCPFDD-CUFFT implementation is verified using the analytical solutions for the stress field around an infinite edge dislocation and subsequently applied to simulate the interaction and motion of dislocations through a bi-phase copper-nickel (Cu–Ni) interface. It is demonstrated that the ACCPFDD-CUFFT implementation on a single TESLA K80 GPU offers a 27.6X speedup relative to the serial version and a 5X speedup relative to the 22-multicore Intel Xeon CPU E5-2699 v4 @ 2.20 GHz version of the code.« less
      

      
      Graphics processing unit accelerated phase field dislocation dynamics: Application to bi-metallic interfaces
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Eghtesad, Adnan; Germaschewski, Kai; Beyerlein, Irene J.
         
         We present the first high-performance computing implementation of the meso-scale phase field dislocation dynamics (PFDD) model on a graphics processing unit (GPU)-based platform. The implementation takes advantage of the portable OpenACC standard directive pragmas along with Nvidia's compute unified device architecture (CUDA) fast Fourier transform (FFT) library called CUFFT to execute the FFT computations within the PFDD formulation on the same GPU platform. The overall implementation is termed ACCPFDD-CUFFT. The package is entirely performance portable due to the use of OPENACC-CUDA inter-operability, in which calls to CUDA functions are replaced with the OPENACC data regions for a host central processingmore » unit (CPU) and device (GPU). A comprehensive benchmark study has been conducted, which compares a number of FFT routines, the Numerical Recipes FFT (FOURN), Fastest Fourier Transform in the West (FFTW), and the CUFFT. The last one exploits the advantages of the GPU hardware for FFT calculations. The novel ACCPFDD-CUFFT implementation is verified using the analytical solutions for the stress field around an infinite edge dislocation and subsequently applied to simulate the interaction and motion of dislocations through a bi-phase copper-nickel (Cu–Ni) interface. It is demonstrated that the ACCPFDD-CUFFT implementation on a single TESLA K80 GPU offers a 27.6X speedup relative to the serial version and a 5X speedup relative to the 22-multicore Intel Xeon CPU E5-2699 v4 @ 2.20 GHz version of the code.« less
      

      
      MC64-ClustalWP2: A Highly-Parallel Hybrid Strategy to Align Multiple Sequences in Many-Core Architectures
      PubMed Central
      Díaz, David; Esteban, Francisco J.; Hernández, Pilar; Caballero, Juan Antonio; Guevara, Antonio
         2014-01-01
         We have developed the MC64-ClustalWP2 as a new implementation of the Clustal W algorithm, integrating a novel parallelization strategy and significantly increasing the performance when aligning long sequences in architectures with many cores. It must be stressed that in such a process, the detailed analysis of both the software and hardware features and peculiarities is of paramount importance to reveal key points to exploit and optimize the full potential of parallelism in many-core CPU systems. The new parallelization approach has focused into the most time-consuming stages of this algorithm. In particular, the so-called progressive alignment has drastically improved the performance, due to a fine-grained approach where the forward and backward loops were unrolled and parallelized. Another key approach has been the implementation of the new algorithm in a hybrid-computing system, integrating both an Intel Xeon multi-core CPU and a Tilera Tile64 many-core card. A comparison with other Clustal W implementations reveals the high-performance of the new algorithm and strategy in many-core CPU architectures, in a scenario where the sequences to align are relatively long (more than 10 kb) and, hence, a many-core GPU hardware cannot be used. Thus, the MC64-ClustalWP2 runs multiple alignments more than 18x than the original Clustal W algorithm, and more than 7x than the best x86 parallel implementation to date, being publicly available through a web service. Besides, these developments have been deployed in cost-effective personal computers and should be useful for life-science researchers, including the identification of identities and differences for mutation/polymorphism analyses, biodiversity and evolutionary studies and for the development of molecular markers for paternity testing, germplasm management and protection, to assist breeding, illegal traffic control, fraud prevention and for the protection of the intellectual property (identification/traceability), including the protected designation of origin, among other applications. PMID:24710354
      

      
      Accelerating a three-dimensional eco-hydrological cellular automaton on GPGPU with OpenCL
      NASA Astrophysics Data System (ADS)
      Senatore, Alfonso; D'Ambrosio, Donato; De Rango, Alessio; Rongo, Rocco; Spataro, William; Straface, Salvatore; Mendicino, Giuseppe
         2016-10-01
         This work presents an effective implementation of a numerical model for complete eco-hydrological Cellular Automata modeling on Graphical Processing Units (GPU) with OpenCL (Open Computing Language) for heterogeneous computation (i.e., on CPUs and/or GPUs). Different types of parallel implementations were carried out (e.g., use of fast local memory, loop unrolling, etc), showing increasing performance improvements in terms of speedup, adopting also some original optimizations strategies. Moreover, numerical analysis of results (i.e., comparison of CPU and GPU outcomes in terms of rounding errors) have proven to be satisfactory. Experiments were carried out on a workstation with two CPUs (Intel Xeon E5440 at 2.83GHz), one GPU AMD R9 280X and one GPU nVIDIA Tesla K20c. Results have been extremely positive, but further testing should be performed to assess the functionality of the adopted strategies on other complete models and their ability to fruitfully exploit parallel systems resources.
      

      
      CUDA-based acceleration of collateral filtering in brain MR images
      NASA Astrophysics Data System (ADS)
      Li, Cheng-Yuan; Chang, Herng-Hua
         2017-02-01
         Image denoising is one of the fundamental and essential tasks within image processing. In medical imaging, finding an effective algorithm that can remove random noise in MR images is important. This paper proposes an effective noise reduction method for brain magnetic resonance (MR) images. Our approach is based on the collateral filter which is a more powerful method than the bilateral filter in many cases. However, the computation of the collateral filter algorithm is quite time-consuming. To solve this problem, we improved the collateral filter algorithm with parallel computing using GPU. We adopted CUDA, an application programming interface for GPU by NVIDIA, to accelerate the computation. Our experimental evaluation on an Intel Xeon CPU E5-2620 v3 2.40GHz with a NVIDIA Tesla K40c GPU indicated that the proposed implementation runs dramatically faster than the traditional collateral filter. We believe that the proposed framework has established a general blueprint for achieving fast and robust filtering in a wide variety of medical image denoising applications.
      

      
      Large Scale GW Calculations on the Cori System
      NASA Astrophysics Data System (ADS)
      Deslippe, Jack; Del Ben, Mauro; da Jornada, Felipe; Canning, Andrew; Louie, Steven
         
         The NERSC Cori system, powered by 9000+ Intel Xeon-Phi processors, represents one of the largest HPC systems for open-science in the United States and the world. We discuss the optimization of the GW methodology for this system, including both node level and system-scale optimizations. We highlight multiple large scale (thousands of atoms) case studies and discuss both absolute application performance and comparison to calculations on more traditional HPC architectures. We find that the GW method is particularly well suited for many-core architectures due to the ability to exploit a large amount of parallelism across many layers of the system. This work was supported by the U.S. Department of Energy, Office of Science, Basic Energy Sciences, Materials Sciences and Engineering Division, as part of the Computational Materials Sciences Program.
      

      
      Benchmarking and tuning the MILC code on clusters and supercomputers
      NASA Astrophysics Data System (ADS)
      Gottlieb, Steven
         2002-03-01
         Recently, we have benchmarked and tuned the MILC code on a number of architectures including Intel Itanium and Pentium IV (PIV), dual-CPU Athlon, and the latest Compaq Alpha nodes. Results will be presented for many of these, and we shall discuss some simple code changes that can result in a very dramatic speedup of the KS conjugate gradient on processors with more advanced memory systems such as PIV, IBM SP and Alpha.
      

      
      Benchmarking and tuning the MILC code on clusters and supercomputers
      NASA Astrophysics Data System (ADS)
      Gottlieb, Steven
         
         Recently, we have benchmarked and tuned the MILC code on a number of architectures including Intel Itanium and Pentium IV (PIV), dual-CPU Athlon, and the latest Compaq Alpha nodes. Results will be presented for many of these, and we shall discuss some simple code changes that can result in a very dramatic speedup of the KS conjugate gradient on processors with more advanced memory systems such as PIV, IBM SP and Alpha.
      

      
      Research on SEU hardening of heterogeneous Dual-Core SoC
      NASA Astrophysics Data System (ADS)
      Huang, Kun; Hu, Keliu; Deng, Jun; Zhang, Tao
         2017-08-01
         The implementation of Single-Event Upsets (SEU) hardening has various schemes. However, some of them require a lot of human, material and financial resources. This paper proposes an easy scheme on SEU hardening for Heterogeneous Dual-core SoC (HD SoC) which contains three techniques. First, the automatic Triple Modular Redundancy (TMR) technique is adopted to harden the register heaps of the processor and the instruction-fetching module. Second, Hamming codes are used to harden the random access memory (RAM). Last, a software signature technique is applied to check the programs which are running on CPU. The scheme need not to consume additional resources, and has little influence on the performance of CPU. These technologies are very mature, easy to implement and needs low cost. According to the simulation result, the scheme can satisfy the basic demand of SEU-hardening.
      

      
      Free-Space Optical Interconnect Employing VCSEL Diodes
      NASA Technical Reports Server (NTRS)
      Simons, Rainee N.; Savich, Gregory R.; Torres, Heidi
         2009-01-01
         Sensor signal processing is widely used on aircraft and spacecraft. The scheme employs multiple input/output nodes for data acquisition and CPU (central processing unit) nodes for data processing. To connect 110 nodes and CPU nodes, scalable interconnections such as backplanes are desired because the number of nodes depends on requirements of each mission. An optical backplane consisting of vertical-cavity surface-emitting lasers (VCSELs), VCSEL drivers, photodetectors, and transimpedance amplifiers is the preferred approach since it can handle several hundred megabits per second data throughput.The next generation of satellite-borne systems will require transceivers and processors that can handle several Gb/s of data. Optical interconnects have been praised for both their speed and functionality with hopes that light can relieve the electrical bottleneck predicted for the near future. Optoelectronic interconnects provide a factor of ten improvement over electrical interconnects.
      

      
      Irregular large-scale computed tomography on multiple graphics processors improves energy-efficiency metrics for industrial applications
      NASA Astrophysics Data System (ADS)
      Jimenez, Edward S.; Goodman, Eric L.; Park, Ryeojin; Orr, Laurel J.; Thompson, Kyle R.
         2014-09-01
         This paper will investigate energy-efficiency for various real-world industrial computed-tomography reconstruction algorithms, both CPU- and GPU-based implementations. This work shows that the energy required for a given reconstruction is based on performance and problem size. There are many ways to describe performance and energy efficiency, thus this work will investigate multiple metrics including performance-per-watt, energy-delay product, and energy consumption. This work found that irregular GPU-based approaches1 realized tremendous savings in energy consumption when compared to CPU implementations while also significantly improving the performance-per- watt and energy-delay product metrics. Additional energy savings and other metric improvement was realized on the GPU-based reconstructions by improving storage I/O by implementing a parallel MIMD-like modularization of the compute and I/O tasks.
      

      
      Bayer image parallel decoding based on GPU
      NASA Astrophysics Data System (ADS)
      Hu, Rihui; Xu, Zhiyong; Wei, Yuxing; Sun, Shaohua
         2012-11-01
         In the photoelectrical tracking system, Bayer image is decompressed in traditional method, which is CPU-based. However, it is too slow when the images become large, for example, 2K×2K×16bit. In order to accelerate the Bayer image decoding, this paper introduces a parallel speedup method for NVIDA's Graphics Processor Unit (GPU) which supports CUDA architecture. The decoding procedure can be divided into three parts: the first is serial part, the second is task-parallelism part, and the last is data-parallelism part including inverse quantization, inverse discrete wavelet transform (IDWT) as well as image post-processing part. For reducing the execution time, the task-parallelism part is optimized by OpenMP techniques. The data-parallelism part could advance its efficiency through executing on the GPU as CUDA parallel program. The optimization techniques include instruction optimization, shared memory access optimization, the access memory coalesced optimization and texture memory optimization. In particular, it can significantly speed up the IDWT by rewriting the 2D (Tow-dimensional) serial IDWT into 1D parallel IDWT. Through experimenting with 1K×1K×16bit Bayer image, data-parallelism part is 10 more times faster than CPU-based implementation. Finally, a CPU+GPU heterogeneous decompression system was designed. The experimental result shows that it could achieve 3 to 5 times speed increase compared to the CPU serial method.
      

      
      Scalable Algorithms for Clustering Large Geospatiotemporal Data Sets on Manycore Architectures
      NASA Astrophysics Data System (ADS)
      Mills, R. T.; Hoffman, F. M.; Kumar, J.; Sreepathi, S.; Sripathi, V.
         2016-12-01
         The increasing availability of high-resolution geospatiotemporal data sets from sources such as observatory networks, remote sensing platforms, and computational Earth system models has opened new possibilities for knowledge discovery using data sets fused from disparate sources. Traditional algorithms and computing platforms are impractical for the analysis and synthesis of data sets of this size; however, new algorithmic approaches that can effectively utilize the complex memory hierarchies and the extremely high levels of available parallelism in state-of-the-art high-performance computing platforms can enable such analysis. We describe a massively parallel implementation of accelerated k-means clustering and some optimizations to boost computational intensity and utilization of wide SIMD lanes on state-of-the art multi- and manycore processors, including the second-generation Intel Xeon Phi ("Knights Landing") processor based on the Intel Many Integrated Core (MIC) architecture, which includes several new features, including an on-package high-bandwidth memory. We also analyze the code in the context of a few practical applications to the analysis of climatic and remotely-sensed vegetation phenology data sets, and speculate on some of the new applications that such scalable analysis methods may enable.
      

      
      High-performance hardware implementation of a parallel database search engine for real-time peptide mass fingerprinting
      PubMed Central
      Bogdán, István A.; Rivers, Jenny; Beynon, Robert J.; Coca, Daniel
         2008-01-01
         Motivation: Peptide mass fingerprinting (PMF) is a method for protein identification in which a protein is fragmented by a defined cleavage protocol (usually proteolysis with trypsin), and the masses of these products constitute a ‘fingerprint’ that can be searched against theoretical fingerprints of all known proteins. In the first stage of PMF, the raw mass spectrometric data are processed to generate a peptide mass list. In the second stage this protein fingerprint is used to search a database of known proteins for the best protein match. Although current software solutions can typically deliver a match in a relatively short time, a system that can find a match in real time could change the way in which PMF is deployed and presented. In a paper published earlier we presented a hardware design of a raw mass spectra processor that, when implemented in Field Programmable Gate Array (FPGA) hardware, achieves almost 170-fold speed gain relative to a conventional software implementation running on a dual processor server. In this article we present a complementary hardware realization of a parallel database search engine that, when running on a Xilinx Virtex 2 FPGA at 100 MHz, delivers 1800-fold speed-up compared with an equivalent C software routine, running on a 3.06 GHz Xeon workstation. The inherent scalability of the design means that processing speed can be multiplied by deploying the design on multiple FPGAs. The database search processor and the mass spectra processor, running on a reconfigurable computing platform, provide a complete real-time PMF protein identification solution. Contact: d.coca@sheffield.ac.uk PMID:18453553
      

      
      Computational multicore on two-layer 1D shallow water equations for erodible dambreak
      NASA Astrophysics Data System (ADS)
      Simanjuntak, C. A.; Bagustara, B. A. R. H.; Gunawan, P. H.
         2018-03-01
         The simulation of erodible dambreak using two-layer shallow water equations and SCHR scheme are elaborated in this paper. The results show that the two-layer SWE model in a good agreement with the data experiment which is performed by Louvain-la-Neuve Université Catholique de Louvain. Moreover, the parallel algorithm with multicore architecture are given in the results. The results show that Computer I with processor Intel(R) Core(TM) i5-2500 CPU Quad-Core has the best performance to accelerate the computational time. Moreover, Computer III with processor AMD A6-5200 APU Quad-Core is observed has higher speedup and efficiency. The speedup and efficiency of Computer III with number of grids 3200 are 3.716050530 times and 92.9% respectively.
      

      
      The Impact of IBM Cell Technology on the Programming Paradigm in the Context of Computer Systems for Climate and Weather Models
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Zhou, Shujia; Duffy, Daniel; Clune, Thomas
         
         The call for ever-increasing model resolutions and physical processes in climate and weather models demands a continual increase in computing power. The IBM Cell processor's order-of-magnitude peak performance increase over conventional processors makes it very attractive to fulfill this requirement. However, the Cell's characteristics, 256KB local memory per SPE and the new low-level communication mechanism, make it very challenging to port an application. As a trial, we selected the solar radiation component of the NASA GEOS-5 climate model, which: (1) is representative of column physics components (half the total computational time), (2) has an extremely high computational intensity: the ratiomore » of computational load to main memory transfers, and (3) exhibits embarrassingly parallel column computations. In this paper, we converted the baseline code (single-precision Fortran) to C and ported it to an IBM BladeCenter QS20. For performance, we manually SIMDize four independent columns and include several unrolling optimizations. Our results show that when compared with the baseline implementation running on one core of Intel's Xeon Woodcrest, Dempsey, and Itanium2, the Cell is approximately 8.8x, 11.6x, and 12.8x faster, respectively. Our preliminary analysis shows that the Cell can also accelerate the dynamics component (~;;25percent total computational time). We believe these dramatic performance improvements make the Cell processor very competitive as an accelerator.« less
      

      
      Symposium on Turbulence (10th) Held in Rollo, Missouri on September 22-24, 1986
      DTIC Science & Technology
      
         1986-09-24
         that speckle velooimstry is rather excercised when attempting to obtain promising under 3000 oiraulstanoes, quantitative information from thisCould yOU...and free. small scale intermittency , it will have~Jack Herrin2. NCAR: Isn’t the reason to be based on some alternative measure olarge helicity...processor obtained under NASA’s Nu- merical Aerodynamic Simulation (NAS) project combines a relatively fast CPU with about 258 million words of memory. This
      

        
       
          

«

10
      11
      12
   13
      14
      »

          
        

     

   

   
       
            
              
          

«

11
      12
      13
   14
      15
      »

          
        

           
           
             
               
      
      DSS 13 Microprocessor Antenna Controller
      NASA Technical Reports Server (NTRS)
      Gosline, R. M.
         1984-01-01
         A microprocessor based antenna controller system developed as part of the unattended station project for DSS 13 is described. Both the hardware and software top level designs are presented and the major problems encounted are discussed. Developments useful to related projects include a JPL standard 15 line interface using a single board computer, a general purpose parser, a fast floating point to ASCII conversion technique, and experience gained in using off board floating point processors with the 8080 CPU.
      

      
      Data General Corporation Advanced Operating System/Virtual Storage (AOS/ VS). Revision 7.60
      DTIC Science & Technology
      
         1989-02-22
         control list for each directory and data file. An access control list includes the users who can and cannot access files as well as the access...and any required data, it can -5- February 22, 1989 Final Evaluation Report Data General AOS/VS SYSTEM OVERVIEW operate asynchronously and in parallel...memory. The IOC can perform the data transfer without further interventiin from the CPU. The I/O channels interface with the processor or system
      

      
      Multiprocessing MCNP on an IBN RS/6000 cluster
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      McKinney, G.W.; West, J.T.
         1993-01-01
         The advent of high-performance computer systems has brought to maturity programming concepts like vectorization, multiprocessing, and multitasking. While there are many schools of thought as to the most significant factor in obtaining order-of-magnitude increases in performance, such speedup can only be achieved by integrating the computer system and application code. Vectorization leads to faster manipulation of arrays by overlapping instruction CPU cycles. Discrete ordinates codes, which require the solving of large matrices, have proved to be major benefactors of vectorization. Monte Carlo transport, on the other hand, typically contains numerous logic statements and requires extensive redevelopment to benefit from vectorization.more » Multiprocessing and multitasking provide additional CPU cycles via multiple processors. Such systems are generally designed with either common memory access (multitasking) or distributed memory access. In both cases, theoretical speedup, as a function of the number of processors P and the fraction f of task time that multiprocesses, can be formulated using Amdahl's law: S(f, P) =1/(1-f+f/P). However, for most applications, this theoretical limit cannot be achieved because of additional terms (e.g., multitasking overhead, memory overlap, etc.) that are not included in Amdahl's law. Monte Carlo transport is a natural candidate for multiprocessing because the particle tracks are generally independent, and the precision of the result increases as the square Foot of the number of particles tracked.« less
      

      
      Multiprocessing MCNP on an IBM RS/6000 cluster
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      McKinney, G.W.; West, J.T.
         1993-03-01
         The advent of high-performance computer systems has brought to maturity programming concepts like vectorization, multiprocessing, and multitasking. While there are many schools of thought as to the most significant factor in obtaining order-of-magnitude increases in performance, such speedup can only be achieved by integrating the computer system and application code. Vectorization leads to faster manipulation of arrays by overlapping instruction CPU cycles. Discrete ordinates codes, which require the solving of large matrices, have proved to be major benefactors of vectorization. Monte Carlo transport, on the other hand, typically contains numerous logic statements and requires extensive redevelopment to benefit from vectorization.more » Multiprocessing and multitasking provide additional CPU cycles via multiple processors. Such systems are generally designed with either common memory access (multitasking) or distributed memory access. In both cases, theoretical speedup, as a function of the number of processors (P) and the fraction of task time that multiprocesses (f), can be formulated using Amdahl`s Law S ((f,P) = 1 f + f/P). However, for most applications this theoretical limit cannot be achieved, due to additional terms not included in Amdahl`s Law. Monte Carlo transport is a natural candidate for multiprocessing, since the particle tracks are generally independent and the precision of the result increases as the square root of the number of particles tracked.« less
      

      
      A communication-avoiding, hybrid-parallel, rank-revealing orthogonalization method.
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Hoemmen, Mark
         2010-11-01
         Orthogonalization consumes much of the run time of many iterative methods for solving sparse linear systems and eigenvalue problems. Commonly used algorithms, such as variants of Gram-Schmidt or Householder QR, have performance dominated by communication. Here, 'communication' includes both data movement between the CPU and memory, and messages between processors in parallel. Our Tall Skinny QR (TSQR) family of algorithms requires asymptotically fewer messages between processors and data movement between CPU and memory than typical orthogonalization methods, yet achieves the same accuracy as Householder QR factorization. Furthermore, in block orthogonalizations, TSQR is faster and more accurate than existing approaches formore » orthogonalizing the vectors within each block ('normalization'). TSQR's rank-revealing capability also makes it useful for detecting deflation in block iterative methods, for which existing approaches sacrifice performance, accuracy, or both. We have implemented a version of TSQR that exploits both distributed-memory and shared-memory parallelism, and supports real and complex arithmetic. Our implementation is optimized for the case of orthogonalizing a small number (5-20) of very long vectors. The shared-memory parallel component uses Intel's Threading Building Blocks, though its modular design supports other shared-memory programming models as well, including computation on the GPU. Our implementation achieves speedups of 2 times or more over competing orthogonalizations. It is available now in the development branch of the Trilinos software package, and will be included in the 10.8 release.« less
      

      
      Real-time image reconstruction and display system for MRI using a high-speed personal computer.
      PubMed
      Haishi, T; Kose, K
         1998-09-01
         A real-time NMR image reconstruction and display system was developed using a high-speed personal computer and optimized for the 32-bit multitasking Microsoft Windows 95 operating system. The system was operated at various CPU clock frequencies by changing the motherboard clock frequency and the processor/bus frequency ratio. When the Pentium CPU was used at the 200 MHz clock frequency, the reconstruction time for one 128 x 128 pixel image was 48 ms and that for the image display on the enlarged 256 x 256 pixel window was about 8 ms. NMR imaging experiments were performed with three fast imaging sequences (FLASH, multishot EPI, and one-shot EPI) to demonstrate the ability of the real-time system. It was concluded that in most cases, high-speed PC would be the best choice for the image reconstruction and display system for real-time MRI. Copyright 1998 Academic Press.
      

      
      Miniature Heat Pipes
      NASA Technical Reports Server (NTRS)
      
         1997-01-01
         Small Business Innovation Research contracts from Goddard Space Flight Center to Thermacore Inc. have fostered the company work on devices tagged "heat pipes" for space application. To control the extreme temperature ranges in space, heat pipes are important to spacecraft. The problem was to maintain an 8-watt central processing unit (CPU) at less than 90 C in a notebook computer using no power, with very little space available and without using forced convection. Thermacore's answer was in the design of a powder metal wick that transfers CPU heat from a tightly confined spot to an area near available air flow. The heat pipe technology permits a notebook computer to be operated in any position without loss of performance. Miniature heat pipe technology has successfully been applied, such as in Pentium Processor notebook computers. The company expects its heat pipes to accommodate desktop computers as well. Cellular phones, camcorders, and other hand-held electronics are forsible applications for heat pipes.
      

      
      Benchmarking worker nodes using LHCb productions and comparing with HEPSpec06
      NASA Astrophysics Data System (ADS)
      Charpentier, P.
         2017-10-01
         In order to estimate the capabilities of a computing slot with limited processing time, it is necessary to know with a rather good precision its “power”. This allows for example pilot jobs to match a task for which the required CPU-work is known, or to define the number of events to be processed knowing the CPU-work per event. Otherwise one always has the risk that the task is aborted because it exceeds the CPU capabilities of the resource. It also allows a better accounting of the consumed resources. The traditional way the CPU power is estimated in WLCG since 2007 is using the HEP-Spec06 benchmark (HS06) suite that was verified at the time to scale properly with a set of typical HEP applications. However, the hardware architecture of processors has evolved, all WLCG experiments moved to using 64-bit applications and use different compilation flags from those advertised for running HS06. It is therefore interesting to check the scaling of HS06 with the HEP applications. For this purpose, we have been using CPU intensive massive simulation productions from the LHCb experiment and compared their event throughput to the HS06 rating of the worker nodes. We also compared it with a much faster benchmark script that is used by the DIRAC framework used by LHCb for evaluating at run time the performance of the worker nodes. This contribution reports on the finding of these comparisons: the main observation is that the scaling with HS06 is no longer fulfilled, while the fast benchmarks have a better scaling but are less precise. One can also clearly see that some hardware or software features when enabled on the worker nodes may enhance their performance beyond expectation from either benchmark, depending on external factors.
      

      
      Method and apparatus for measuring spatial uniformity of radiation
      DOEpatents
      Field, Halden
         2002-01-01
         A method and apparatus for measuring the spatial uniformity of the intensity of a radiation beam from a radiation source based on a single sampling time and/or a single pulse of radiation. The measuring apparatus includes a plurality of radiation detectors positioned on planar mounting plate to form a radiation receiving area that has a shape and size approximating the size and shape of the cross section of the radiation beam. The detectors concurrently receive portions of the radiation beam and transmit electrical signals representative of the intensity of impinging radiation to a signal processor circuit connected to each of the detectors and adapted to concurrently receive the electrical signals from the detectors and process with a central processing unit (CPU) the signals to determine intensities of the radiation impinging at each detector location. The CPU displays the determined intensities and relative intensity values corresponding to each detector location to an operator of the measuring apparatus on an included data display device. Concurrent sampling of each detector is achieved by connecting to each detector a sample and hold circuit that is configured to track the signal and store it upon receipt of a "capture" signal. A switching device then selectively retrieves the signals and transmits the signals to the CPU through a single analog to digital (A/D) converter. The "capture" signal. is then removed from the sample-and-hold circuits. Alternatively, concurrent sampling is achieved by providing an A/D converter for each detector, each of which transmits a corresponding digital signal to the CPU. The sampling or reading of the detector signals can be controlled by the CPU or level-detection and timing circuit.
      

      
      SU-E-J-91: FFT Based Medical Image Registration Using a Graphics Processing Unit (GPU).
      PubMed
      Luce, J; Hoggarth, M; Lin, J; Block, A; Roeske, J
         2012-06-01
         To evaluate the efficiency gains obtained from using a Graphics Processing Unit (GPU) to perform a Fourier Transform (FT) based image registration. Fourier-based image registration involves obtaining the FT of the component images, and analyzing them in Fourier space to determine the translations and rotations of one image set relative to another. An important property of FT registration is that by enlarging the images (adding additional pixels), one can obtain translations and rotations with sub-pixel resolution. The expense, however, is an increased computational time. GPUs may decrease the computational time associated with FT image registration by taking advantage of their parallel architecture to perform matrix computations much more efficiently than a Central Processor Unit (CPU). In order to evaluate the computational gains produced by a GPU, images with known translational shifts were utilized. A program was written in the Interactive Data Language (IDL; Exelis, Boulder, CO) to performCPU-based calculations. Subsequently, the program was modified using GPU bindings (Tech-X, Boulder, CO) to perform GPU-based computation on the same system. Multiple image sizes were used, ranging from 256×256 to 2304×2304. The time required to complete the full algorithm by the CPU and GPU were benchmarked and the speed increase was defined as the ratio of the CPU-to-GPU computational time. The ratio of the CPU-to- GPU time was greater than 1.0 for all images, which indicates the GPU is performing the algorithm faster than the CPU. The smallest improvement, a 1.21 ratio, was found with the smallest image size of 256×256, and the largest speedup, a 4.25 ratio, was observed with the largest image size of 2304×2304. GPU programming resulted in a significant decrease in computational time associated with a FT image registration algorithm. The inclusion of the GPU may provide near real-time, sub-pixel registration capability. © 2012 American Association of Physicists in Medicine.
      

      
      Evaluating Multi-core Architectures through Accelerating the Three-Dimensional Lax–Wendroff Correction
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      You, Yang; Fu, Haohuan; Song, Shuaiwen
         2014-07-18
         Wave propagation forward modeling is a widely used computational method in oil and gas exploration. The iterative stencil loops in such problems have broad applications in scientific computing. However, executing such loops can be highly time time-consuming, which greatly limits application’s performance and power efficiency. In this paper, we accelerate the forward modeling technique on the latest multi-core and many-core architectures such as Intel Sandy Bridge CPUs, NVIDIA Fermi C2070 GPU, NVIDIA Kepler K20x GPU, and the Intel Xeon Phi Co-processor. For the GPU platforms, we propose two parallel strategies to explore the performance optimization opportunities for our stencil kernels.more » For Sandy Bridge CPUs and MIC, we also employ various optimization techniques in order to achieve the best.« less
      

      
      A GPU-accelerated semi-implicit fractional-step method for numerical solutions of incompressible Navier-Stokes equations
      NASA Astrophysics Data System (ADS)
      Ha, Sanghyun; Park, Junshin; You, Donghyun
         2018-01-01
         Utility of the computational power of Graphics Processing Units (GPUs) is elaborated for solutions of incompressible Navier-Stokes equations which are integrated using a semi-implicit fractional-step method. The Alternating Direction Implicit (ADI) and the Fourier-transform-based direct solution methods used in the semi-implicit fractional-step method take advantage of multiple tridiagonal matrices whose inversion is known as the major bottleneck for acceleration on a typical multi-core machine. A novel implementation of the semi-implicit fractional-step method designed for GPU acceleration of the incompressible Navier-Stokes equations is presented. Aspects of the programing model of Compute Unified Device Architecture (CUDA), which are critical to the bandwidth-bound nature of the present method are discussed in detail. A data layout for efficient use of CUDA libraries is proposed for acceleration of tridiagonal matrix inversion and fast Fourier transform. OpenMP is employed for concurrent collection of turbulence statistics on a CPU while the Navier-Stokes equations are computed on a GPU. Performance of the present method using CUDA is assessed by comparing the speed of solving three tridiagonal matrices using ADI with the speed of solving one heptadiagonal matrix using a conjugate gradient method. An overall speedup of 20 times is achieved using a Tesla K40 GPU in comparison with a single-core Xeon E5-2660 v3 CPU in simulations of turbulent boundary-layer flow over a flat plate conducted on over 134 million grids. Enhanced performance of 48 times speedup is reached for the same problem using a Tesla P100 GPU.
      

      
      ICON-MIC: Implementing a CPU/MIC Collaboration Parallel Framework for ICON on Tianhe-2 Supercomputer.
      PubMed
      Wang, Zihao; Chen, Yu; Zhang, Jingrong; Li, Lun; Wan, Xiaohua; Liu, Zhiyong; Sun, Fei; Zhang, Fa
         2018-03-01
         Electron tomography (ET) is an important technique for studying the three-dimensional structures of the biological ultrastructure. Recently, ET has reached sub-nanometer resolution for investigating the native and conformational dynamics of macromolecular complexes by combining with the sub-tomogram averaging approach. Due to the limited sampling angles, ET reconstruction typically suffers from the "missing wedge" problem. Using a validation procedure, iterative compressed-sensing optimized nonuniform fast Fourier transform (NUFFT) reconstruction (ICON) demonstrates its power in restoring validated missing information for a low-signal-to-noise ratio biological ET dataset. However, the huge computational demand has become a bottleneck for the application of ICON. In this work, we implemented a parallel acceleration technology ICON-many integrated core (MIC) on Xeon Phi cards to address the huge computational demand of ICON. During this step, we parallelize the element-wise matrix operations and use the efficient summation of a matrix to reduce the cost of matrix computation. We also developed parallel versions of NUFFT on MIC to achieve a high acceleration of ICON by using more efficient fast Fourier transform (FFT) calculation. We then proposed a hybrid task allocation strategy (two-level load balancing) to improve the overall performance of ICON-MIC by making full use of the idle resources on Tianhe-2 supercomputer. Experimental results using two different datasets show that ICON-MIC has high accuracy in biological specimens under different noise levels and a significant acceleration, up to 13.3 × , compared with the CPU version. Further, ICON-MIC has good scalability efficiency and overall performance on Tianhe-2 supercomputer.
      

      
      Multi-Threaded Algorithms for GPGPU in the ATLAS High Level Trigger
      NASA Astrophysics Data System (ADS)
      Conde Muíño, P.; ATLAS Collaboration
         2017-10-01
         General purpose Graphics Processor Units (GPGPU) are being evaluated for possible future inclusion in an upgraded ATLAS High Level Trigger farm. We have developed a demonstrator including GPGPU implementations of Inner Detector and Muon tracking and Calorimeter clustering within the ATLAS software framework. ATLAS is a general purpose particle physics experiment located on the LHC collider at CERN. The ATLAS Trigger system consists of two levels, with Level-1 implemented in hardware and the High Level Trigger implemented in software running on a farm of commodity CPU. The High Level Trigger reduces the trigger rate from the 100 kHz Level-1 acceptance rate to 1.5 kHz for recording, requiring an average per-event processing time of ∼ 250 ms for this task. The selection in the high level trigger is based on reconstructing tracks in the Inner Detector and Muon Spectrometer and clusters of energy deposited in the Calorimeter. Performing this reconstruction within the available farm resources presents a significant challenge that will increase significantly with future LHC upgrades. During the LHC data taking period starting in 2021, luminosity will reach up to three times the original design value. Luminosity will increase further to 7.5 times the design value in 2026 following LHC and ATLAS upgrades. Corresponding improvements in the speed of the reconstruction code will be needed to provide the required trigger selection power within affordable computing resources. Key factors determining the potential benefit of including GPGPU as part of the HLT processor farm are: the relative speed of the CPU and GPGPU algorithm implementations; the relative execution times of the GPGPU algorithms and serial code remaining on the CPU; the number of GPGPU required, and the relative financial cost of the selected GPGPU. We give a brief overview of the algorithms implemented and present new measurements that compare the performance of various configurations exploiting GPGPU cards.
      

      
      WARP3D-Release 10.8: Dynamic Nonlinear Analysis of Solids using a Preconditioned Conjugate Gradient Software Architecture
      NASA Technical Reports Server (NTRS)
      Koppenhoefer, Kyle C.; Gullerud, Arne S.; Ruggieri, Claudio; Dodds, Robert H., Jr.; Healy, Brian E.
         1998-01-01
         This report describes theoretical background material and commands necessary to use the WARP3D finite element code. WARP3D is under continuing development as a research code for the solution of very large-scale, 3-D solid models subjected to static and dynamic loads. Specific features in the code oriented toward the investigation of ductile fracture in metals include a robust finite strain formulation, a general J-integral computation facility (with inertia, face loading), an element extinction facility to model crack growth, nonlinear material models including viscoplastic effects, and the Gurson-Tver-gaard dilatant plasticity model for void growth. The nonlinear, dynamic equilibrium equations are solved using an incremental-iterative, implicit formulation with full Newton iterations to eliminate residual nodal forces. The history integration of the nonlinear equations of motion is accomplished with Newmarks Beta method. A central feature of WARP3D involves the use of a linear-preconditioned conjugate gradient (LPCG) solver implemented in an element-by-element format to replace a conventional direct linear equation solver. This software architecture dramatically reduces both the memory requirements and CPU time for very large, nonlinear solid models since formation of the assembled (dynamic) stiffness matrix is avoided. Analyses thus exhibit the numerical stability for large time (load) steps provided by the implicit formulation coupled with the low memory requirements characteristic of an explicit code. In addition to the much lower memory requirements of the LPCG solver, the CPU time required for solution of the linear equations during each Newton iteration is generally one-half or less of the CPU time required for a traditional direct solver. All other computational aspects of the code (element stiffnesses, element strains, stress updating, element internal forces) are implemented in the element-by- element, blocked architecture. This greatly improves vectorization of the code on uni-processor hardware and enables straightforward parallel-vector processing of element blocks on multi-processor hardware.
      

      
      Development of an optical parallel logic device and a half-adder circuit for digital optical processing
      NASA Technical Reports Server (NTRS)
      Athale, R. A.; Lee, S. H.
         1978-01-01
         The paper describes the fabrication and operation of an optical parallel logic (OPAL) device which performs Boolean algebraic operations on binary images. Several logic operations on two input binary images were demonstrated using an 8 x 8 device with a CdS photoconductor and a twisted nematic liquid crystal. Two such OPAL devices can be interconnected to form a half-adder circuit which is one of the essential components of a CPU in a digital signal processor.
      

      
      A Study on the Effectiveness of Lockup-Free Caches for a Reduced Instruction Set Computer (RISC) Processor
      DTIC Science & Technology
      
         1992-09-01
         to acquire or develop effective simulation tools to observe the behavior of a RISC implementation as it executes different types of programs . We choose...Performance Computer performance is measured by the amount of the time required to execute a program . Performance encompasses two types of time, elapsed time...and CPU time. Elapsed time is the time required to execute a program from start to finish. It includes latency of input/output activities such as
      

      
      Floating-point performance of ARM cores and their efficiency in classical molecular dynamics
      NASA Astrophysics Data System (ADS)
      Nikolskiy, V.; Stegailov, V.
         2016-02-01
         Supercomputing of the exascale era is going to be inevitably limited by power efficiency. Nowadays different possible variants of CPU architectures are considered. Recently the development of ARM processors has come to the point when their floating point performance can be seriously considered for a range of scientific applications. In this work we present the analysis of the floating point performance of the latest ARM cores and their efficiency for the algorithms of classical molecular dynamics.
      

      
      A 32-bit NMOS microprocessor with a large register file
      NASA Astrophysics Data System (ADS)
      Sherburne, R. W., Jr.; Katevenis, M. G. H.; Patterson, D. A.; Sequin, C. H.
         1984-10-01
         Two scaled versions of a 32-bit NMOS reduced instruction set computer CPU, called RISC II, have been implemented on two different processing lines using the simple Mead and Conway layout rules with lambda values of 2 and 1.5 microns (corresponding to drawn gate lengths of 4 and 3 microns), respectively. The design utilizes a small set of simple instructions in conjunction with a large register file in order to provide high performance. This approach has resulted in two surprisingly powerful single-chip processors.
      

      
      Survey and Analysis of Environmental Requirements for Shipboard Electronic Equipment Applications. Appendix A. Volume 2.
      DTIC Science & Technology
      
         1991-07-31
         INTELLIGENT SCSI DMV-719 MAS MIL CONTROLLER DY-4 SYSTEMS BYTE-WIDE MEMORY CARD DMV-536 MEM MIL DY-4 SYSTEMS POWER SUPPLY UNIT DMV-870 PWR MIL P age No. 5 06/10...FORCE COMPUTERS PROCESSOR CPU-386 SERIES SBC COM FORCE COMPUTERS ADVANCED SYSTEM CONTROL ASCU -1/2 SBC COM UNITI FORCE COMPUTERS GRAPHICS CONTROLLER AGC...RECORD VENDOR: JANZ COMPUTER AG DIVISION: VENDOR ADDRESS: Im Doerener Feld 3 D-4790 Paderborn Germany MARKETING: Johannes Kunz TECHNICAL: Arnulf
      

        
       
          

«

11
      12
      13
   14
      15
      »

          
        

     

   

   
       
            
              
          

«

12
      13
      14
   15
      16
      »

          
        

           
           
             
               
      
      Spectrophotometry (by Barbara Sawrey and Gabriele Wienhausen)
      NASA Astrophysics Data System (ADS)
      Pringle, David L.
         1998-08-01
         Science Media: San Diego, 1997. 1-10 copies, 99 each; 11-20 copies, 69 each; 21+ copies, $49 each. (Note: CD operates with both Mac and PC.) Spectrophotometry is an interactive CD-ROM which introduces the basics of UV-visible spectrophotometry with some mention of infrared and other forms of spectrophotometry. A Macintosh System 7.5 or higher, CPU 68040 or Power PC processor, 6 megabytes of free RAM, 2.6 megabytes of free disk space, and 4X CD-ROM or faster are required.
      

      
      Efficient Approximation Algorithms for Weighted $b$-Matching
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Khan, Arif; Pothen, Alex; Mostofa Ali Patwary, Md.
         2016-01-01
         We describe a half-approximation algorithm, b-Suitor, for computing a b-Matching of maximum weight in a graph with weights on the edges. b-Matching is a generalization of the well-known Matching problem in graphs, where the objective is to choose a subset of M edges in the graph such that at most a specified number b(v) of edges in M are incident on each vertex v. Subject to this restriction we maximize the sum of the weights of the edges in M. We prove that the b-Suitor algorithm computes the same b-Matching as the one obtained by the greedy algorithm for themore » problem. We implement the algorithm on serial and shared-memory parallel processors, and compare its performance against a collection of approximation algorithms that have been proposed for the Matching problem. Our results show that the b-Suitor algorithm outperforms the Greedy and Locally Dominant edge algorithms by one to two orders of magnitude on a serial processor. The b-Suitor algorithm has a high degree of concurrency, and it scales well up to 240 threads on a shared memory multiprocessor. The b-Suitor algorithm outperforms the Locally Dominant edge algorithm by a factor of fourteen on 16 cores of an Intel Xeon multiprocessor.« less
      

      
      Quantum Chemical Calculations Using Accelerators: Migrating Matrix Operations to the NVIDIA Kepler GPU and the Intel Xeon Phi.
      PubMed
      Leang, Sarom S; Rendell, Alistair P; Gordon, Mark S
         2014-03-11
         Increasingly, modern computer systems comprise a multicore general-purpose processor augmented with a number of special purpose devices or accelerators connected via an external interface such as a PCI bus. The NVIDIA Kepler Graphical Processing Unit (GPU) and the Intel Phi are two examples of such accelerators. Accelerators offer peak performances that can be well above those of the host processor. How to exploit this heterogeneous environment for legacy application codes is not, however, straightforward. This paper considers how matrix operations in typical quantum chemical calculations can be migrated to the GPU and Phi systems. Double precision general matrix multiply operations are endemic in electronic structure calculations, especially methods that include electron correlation, such as density functional theory, second order perturbation theory, and coupled cluster theory. The use of approaches that automatically determine whether to use the host or an accelerator, based on problem size, is explored, with computations that are occurring on the accelerator and/or the host. For data-transfers over PCI-e, the GPU provides the best overall performance for data sizes up to 4096 MB with consistent upload and download rates between 5-5.6 GB/s and 5.4-6.3 GB/s, respectively. The GPU outperforms the Phi for both square and nonsquare matrix multiplications.
      

      
      Open release of the DCA++ project
      NASA Astrophysics Data System (ADS)
      Haehner, Urs; Solca, Raffaele; Staar, Peter; Alvarez, Gonzalo; Maier, Thomas; Summers, Michael; Schulthess, Thomas
         
         We present the first open release of the DCA++ project, a highly scalable and efficient research code to solve quantum many-body problems with cutting edge quantum cluster algorithms. The implemented dynamical cluster approximation (DCA) and its DCA+ extension with a continuous self-energy capture nonlocal correlations in strongly correlated electron systems thereby allowing insight into high-Tc superconductivity. With the increasing heterogeneity of modern machines, DCA++ provides portable performance on conventional and emerging new architectures, such as hybrid CPU-GPU and Xeon Phi, sustaining multiple petaflops on ORNL's Titan and CSCS' Piz Daint. Moreover, we will describe how best practices in software engineering can be applied to make software development sustainable and scalable in a research group. Software testing and documentation not only prevent productivity collapse, but more importantly, they are necessary for correctness, credibility and reproducibility of scientific results. This research used resources of the Oak Ridge Leadership Computing Facility (OLCF) awarded by the INCITE program, and of the Swiss National Supercomputing Center. OLCF is a DOE Office of Science User Facility supported under Contract DE-AC05-00OR22725.
      

      
      Performance Analysis of GFDL's GCM Line-By-Line Radiative Transfer Model on GPU and MIC Architectures
      NASA Astrophysics Data System (ADS)
      Menzel, R.; Paynter, D.; Jones, A. L.
         2017-12-01
         Due to their relatively low computational cost, radiative transfer models in global climate models (GCMs) run on traditional CPU architectures generally consist of shortwave and longwave parameterizations over a small number of wavelength bands. With the rise of newer GPU and MIC architectures, however, the performance of high resolution line-by-line radiative transfer models may soon approach those of the physical parameterizations currently employed in GCMs. Here we present an analysis of the current performance of a new line-by-line radiative transfer model currently under development at GFDL. Although originally designed to specifically exploit GPU architectures through the use of CUDA, the radiative transfer model has recently been extended to include OpenMP in an effort to also effectively target MIC architectures such as Intel's Xeon Phi. Using input data provided by the upcoming Radiative Forcing Model Intercomparison Project (RFMIP, as part of CMIP 6), we compare model results and performance data for various model configurations and spectral resolutions run on both GPU and Intel Knights Landing architectures to analogous runs of the standard Oxford Reference Forward Model on traditional CPUs.
      

      
      A heterogeneous computing accelerated SCE-UA global optimization method using OpenMP, OpenCL, CUDA, and OpenACC.
      PubMed
      Kan, Guangyuan; He, Xiaoyan; Ding, Liuqian; Li, Jiren; Liang, Ke; Hong, Yang
         2017-10-01
         The shuffled complex evolution optimization developed at the University of Arizona (SCE-UA) has been successfully applied in various kinds of scientific and engineering optimization applications, such as hydrological model parameter calibration, for many years. The algorithm possesses good global optimality, convergence stability and robustness. However, benchmark and real-world applications reveal the poor computational efficiency of the SCE-UA. This research aims at the parallelization and acceleration of the SCE-UA method based on powerful heterogeneous computing technology. The parallel SCE-UA is implemented on Intel Xeon multi-core CPU (by using OpenMP and OpenCL) and NVIDIA Tesla many-core GPU (by using OpenCL, CUDA, and OpenACC). The serial and parallel SCE-UA were tested based on the Griewank benchmark function. Comparison results indicate the parallel SCE-UA significantly improves computational efficiency compared to the original serial version. The OpenCL implementation obtains the best overall acceleration results however, with the most complex source code. The parallel SCE-UA has bright prospects to be applied in real-world applications.
      

      
      The Transition to a Many-core World
      NASA Astrophysics Data System (ADS)
      Mattson, T. G.
         2012-12-01
         The need to increase performance within a fixed energy budget has pushed the computer industry to many core processors. This is grounded in the physics of computing and is not a trend that will just go away. It is hard to overestimate the profound impact of many-core processors on software developers. Virtually every facet of the software development process will need to change to adapt to these new processors. In this talk, we will look at many-core hardware and consider its evolution from a perspective grounded in the CPU. We will show that the number of cores will inevitably increase, but in addition, a quest to maximize performance per watt will push these cores to be heterogeneous. We will show that the inevitable result of these changes is a computing landscape where the distinction between the CPU and the GPU is blurred. We will then consider the much more pressing problem of software in a many core world. Writing software for heterogeneous many core processors is well beyond the ability of current programmers. One solution is to support a software development process where programmer teams are split into two distinct groups: a large group of domain-expert productivity programmers and much smaller team of computer-scientist efficiency programmers. The productivity programmers work in terms of high level frameworks to express the concurrency in their problems while avoiding any details for how that concurrency is exploited. The second group, the efficiency programmers, map applications expressed in terms of these frameworks onto the target many-core system. In other words, we can solve the many-core software problem by creating a software infrastructure that only requires a small subset of programmers to become master parallel programmers. This is different from the discredited dream of automatic parallelism. Note that productivity programmers still need to define the architecture of their software in a way that exposes the concurrency inherent in their problem. We submit that domain-expert programmers understand "what is concurrent". The parallel programming problem emerges from the complexity of "how that concurrency is utilized" on real hardware. The research described in this talk was carried out in collaboration with the ParLab at UC Berkeley. We use a design pattern language to define the high level frameworks exposed to domain-expert, productivity programmers. We then use tools from the SEJITS project (Selective embedded Just In time Specializers) to build the software transformation tool chains thst turn these framework-oriented designs into highly efficient code. The final ingredient is a software platform to serve as a target for these tools. One such platform is the OpenCL industry standard for programming heterogeneous systems. We will briefly describe OpenCL and show how it provides a vendor-neutral software target for current and future many core systems; both CPU-based, GPU-based, and heterogeneous combinations of the two.
      

      
      (U) Status of Trinity and Crossroads Systems
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Archer, Billy Joe; Lujan, James Westley; Hemmert, K. S.
         2017-01-10
         (U) This paper provides a general overview of current and future plans for the Advanced Simulation and Computing (ASC) Advanced Technology (AT) systems fielded by the New Mexico Alliance for Computing at Extreme Scale (ACES), a collaboration between Los Alamos Laboratory and Sandia National Laboratories. Additionally, this paper touches on research of technology beyond traditional CMOS. The status of Trinity, ASCs first AT system, and Crossroads, anticipated to succeed Trinity as the third AT system in 2020 will be presented, along with initial performance studies of the Intel Knights Landing Xeon Phi processors, introduced on Trinity. The challenges and opportunitiesmore » for our production simulation codes on AT systems will also be discussed. Trinity and Crossroads are a joint procurement by ACES and Lawrence Berkeley Laboratory as part of the Alliance for application Performance at EXtreme scale (APEX) http://apex.lanl.gov.« less
      

      
      A new parallel algorithm of MP2 energy calculations.
      PubMed
      Ishimura, Kazuya; Pulay, Peter; Nagase, Shigeru
         2006-03-01
         A new parallel algorithm has been developed for second-order Møller-Plesset perturbation theory (MP2) energy calculations. Its main projected applications are for large molecules, for instance, for the calculation of dispersion interaction. Tests on a moderate number of processors (2-16) show that the program has high CPU and parallel efficiency. Timings are presented for two relatively large molecules, taxol (C(47)H(51)NO(14)) and luciferin (C(11)H(8)N(2)O(3)S(2)), the former with the 6-31G* and 6-311G** basis sets (1,032 and 1,484 basis functions, 164 correlated orbitals), and the latter with the aug-cc-pVDZ and aug-cc-pVTZ basis sets (530 and 1,198 basis functions, 46 correlated orbitals). An MP2 energy calculation on C(130)H(10) (1,970 basis functions, 265 correlated orbitals) completed in less than 2 h on 128 processors.
      

      
      Real-time digital holographic microscopy using the graphic processing unit.
      PubMed
      Shimobaba, Tomoyoshi; Sato, Yoshikuni; Miura, Junya; Takenouchi, Mai; Ito, Tomoyoshi
         2008-08-04
         Digital holographic microscopy (DHM) is a well-known powerful method allowing both the amplitude and phase of a specimen to be simultaneously observed. In order to obtain a reconstructed image from a hologram, numerous calculations for the Fresnel diffraction are required. The Fresnel diffraction can be accelerated by the FFT (Fast Fourier Transform) algorithm. However, real-time reconstruction from a hologram is difficult even if we use a recent central processing unit (CPU) to calculate the Fresnel diffraction by the FFT algorithm. In this paper, we describe a real-time DHM system using a graphic processing unit (GPU) with many stream processors, which allows use as a highly parallel processor. The computational speed of the Fresnel diffraction using the GPU is faster than that of recent CPUs. The real-time DHM system can obtain reconstructed images from holograms whose size is 512 x 512 grids in 24 frames per second.
      

      
      Designing a Virtual-Memory Implementation Using the Motorola MC68010 16- Bit Microprocessor with Multi-Processor Capability Interfaced to the VMEbus
      DTIC Science & Technology
      
         1990-06-01
         RAM and ROM output enable signals. Figure C.7 shows the logic for the interrupt priority level (IPLO* through IPL2 *) and the interrupt acknowledge...IACK681* signal is sent to the DUART when a level one interrupt acknowledge is output by the CPU. The logic for the IACK681* and the IPLO* through IPL2 ...signals are actually implemented with an EPLD. Listing D.4 in Appendix D presents the Abel description of the IACK681* and IPLO* through IPL2
      

      
      Simultaneous Range-Velocity Processing and SNR Analysis of AFIT’s Random Noise Radar
      DTIC Science & Technology
      
         2012-03-22
         reducing the overall processing time. Two computers, equipped with NVIDIA ® GPUs, were used to process the col- 45 lected data. The specifications for each...gather the results back to the CPU. Another company , AccelerEyes®, has developed a product called Jacket® that claims to be better than the parallel...Number of Processing Cores 4 8 Processor Speed 3.33 GHz 3.07 GHz Installed Memory 48 GB 48 GB GPU Make NVIDIA NVIDIA GPU Model Tesla 1060 Tesla C2070 GPU
      

      
      Fault Tolerant Microcontroller for the Configurable Fault Tolerant Processor
      DTIC Science & Technology
      
         2008-09-01
         many others come to mind I also wish to thank Jan Grey for providing an excellent System-on-a-Chip that formed a core component of this thesis...developed by Jan Gray as documented in his "Building a RISC CPU and System-on-a-Chip in an FPGA" series of articles that was published in Circuit Cellar...those detailed by Jan Gray in his "Getting Started with the XSOC Project v0.93" [16]. The XSOC distribution is available at <http://www.fpgacpu.org
      

      
      Symplectic multi-particle tracking on GPUs
      NASA Astrophysics Data System (ADS)
      Liu, Zhicong; Qiang, Ji
         2018-05-01
         A symplectic multi-particle tracking model is implemented on the Graphic Processing Units (GPUs) using the Compute Unified Device Architecture (CUDA) language. The symplectic tracking model can preserve phase space structure and reduce non-physical effects in long term simulation, which is important for beam property evaluation in particle accelerators. Though this model is computationally expensive, it is very suitable for parallelization and can be accelerated significantly by using GPUs. In this paper, we optimized the implementation of the symplectic tracking model on both single GPU and multiple GPUs. Using a single GPU processor, the code achieves a factor of 2-10 speedup for a range of problem sizes compared with the time on a single state-of-the-art Central Processing Unit (CPU) node with similar power consumption and semiconductor technology. It also shows good scalability on a multi-GPU cluster at Oak Ridge Leadership Computing Facility. In an application to beam dynamics simulation, the GPU implementation helps save more than a factor of two total computing time in comparison to the CPU implementation.
      

      
      Transportable GPU (General Processor Units) chip set technology for standard computer architectures
      NASA Astrophysics Data System (ADS)
      Fosdick, R. E.; Denison, H. C.
         1982-11-01
         The USAFR-developed GPU Chip Set has been utilized by Tracor to implement both USAF and Navy Standard 16-Bit Airborne Computer Architectures. Both configurations are currently being delivered into DOD full-scale development programs. Leadless Hermetic Chip Carrier packaging has facilitated implementation of both architectures on single 41/2 x 5 substrates. The CMOS and CMOS/SOS implementations of the GPU Chip Set have allowed both CPU implementations to use less than 3 watts of power each. Recent efforts by Tracor for USAF have included the definition of a next-generation GPU Chip Set that will retain the application-proven architecture of the current chip set while offering the added cost advantages of transportability across ISO-CMOS and CMOS/SOS processes and across numerous semiconductor manufacturers using a newly-defined set of common design rules. The Enhanced GPU Chip Set will increase speed by an approximate factor of 3 while significantly reducing chip counts and costs of standard CPU implementations.
      

      
      WinHPC System Configuration | High-Performance Computing | NREL
      Science.gov Websites
      
         
         CPUs with 48GB of memory. Node 04 has dual Intel Xeon E5530 CPUs with 24GB of memory. Nodes 05-20 have dual AMD Opteron 2374 HE CPUs with 16GB of memory. Nodes 21-30 have been decommissioned. Nodes 31-35 have dual Intel Xeon X5675 CPUs with 48GB of memory. Nodes 36-37 have dual Intel Xeon E5-2680 CPUs with
      

      
      Ultrafast treatment plan optimization for volumetric modulated arc therapy (VMAT)
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Men Chunhua; Romeijn, H. Edwin; Jia Xun
         2010-11-15
         Purpose: To develop a novel aperture-based algorithm for volumetric modulated arc therapy (VMAT) treatment plan optimization with high quality and high efficiency. Methods: The VMAT optimization problem is formulated as a large-scale convex programming problem solved by a column generation approach. The authors consider a cost function consisting two terms, the first enforcing a desired dose distribution and the second guaranteeing a smooth dose rate variation between successive gantry angles. A gantry rotation is discretized into 180 beam angles and for each beam angle, only one MLC aperture is allowed. The apertures are generated one by one in a sequentialmore » way. At each iteration of the column generation method, a deliverable MLC aperture is generated for one of the unoccupied beam angles by solving a subproblem with the consideration of MLC mechanic constraints. A subsequent master problem is then solved to determine the dose rate at all currently generated apertures by minimizing the cost function. When all 180 beam angles are occupied, the optimization completes, yielding a set of deliverable apertures and associated dose rates that produce a high quality plan. Results: The algorithm was preliminarily tested on five prostate and five head-and-neck clinical cases, each with one full gantry rotation without any couch/collimator rotations. High quality VMAT plans have been generated for all ten cases with extremely high efficiency. It takes only 5-8 min on CPU (MATLAB code on an Intel Xeon 2.27 GHz CPU) and 18-31 s on GPU (CUDA code on an NVIDIA Tesla C1060 GPU card) to generate such plans. Conclusions: The authors have developed an aperture-based VMAT optimization algorithm which can generate clinically deliverable high quality treatment plans at very high efficiency.« less
      

      
      Ultrafast treatment plan optimization for volumetric modulated arc therapy (VMAT).
      PubMed
      Men, Chunhua; Romeijn, H Edwin; Jia, Xun; Jiang, Steve B
         2010-11-01
         To develop a novel aperture-based algorithm for volumetric modulated are therapy (VMAT) treatment plan optimization with high quality and high efficiency. The VMAT optimization problem is formulated as a large-scale convex programming problem solved by a column generation approach. The authors consider a cost function consisting two terms, the first enforcing a desired dose distribution and the second guaranteeing a smooth dose rate variation between successive gantry angles. A gantry rotation is discretized into 180 beam angles and for each beam angle, only one MLC aperture is allowed. The apertures are generated one by one in a sequential way. At each iteration of the column generation method, a deliverable MLC aperture is generated for one of the unoccupied beam angles by solving a subproblem with the consideration of MLC mechanic constraints. A subsequent master problem is then solved to determine the dose rate at all currently generated apertures by minimizing the cost function. When all 180 beam angles are occupied, the optimization completes, yielding a set of deliverable apertures and associated dose rates that produce a high quality plan. The algorithm was preliminarily tested on five prostate and five head-and-neck clinical cases, each with one full gantry rotation without any couch/collimator rotations. High quality VMAT plans have been generated for all ten cases with extremely high efficiency. It takes only 5-8 min on CPU (MATLAB code on an Intel Xeon 2.27 GHz CPU) and 18-31 s on GPU (CUDA code on an NVIDIA Tesla C1060 GPU card) to generate such plans. The authors have developed an aperture-based VMAT optimization algorithm which can generate clinically deliverable high quality treatment plans at very high efficiency.
      

      
      Comparing performance of many-core CPUs and GPUs for static and motion compensated reconstruction of C-arm CT data.
      PubMed
      Hofmann, Hannes G; Keck, Benjamin; Rohkohl, Christopher; Hornegger, Joachim
         2011-01-01
         Interventional reconstruction of 3-D volumetric data from C-arm CT projections is a computationally demanding task. Hardware optimization is not an option but mandatory for interventional image processing and, in particular, for image reconstruction due to the high demands on performance. Several groups have published fast analytical 3-D reconstruction on highly parallel hardware such as GPUs to mitigate this issue. The authors show that the performance of modern CPU-based systems is in the same order as current GPUs for static 3-D reconstruction and outperforms them for a recent motion compensated (3-D+time) image reconstruction algorithm. This work investigates two algorithms: Static 3-D reconstruction as well as a recent motion compensated algorithm. The evaluation was performed using a standardized reconstruction benchmark, RABBITCT, to get comparable results and two additional clinical data sets. The authors demonstrate for a parametric B-spline motion estimation scheme that the derivative computation, which requires many write operations to memory, performs poorly on the GPU and can highly benefit from modern CPU architectures with large caches. Moreover, on a 32-core Intel Xeon server system, the authors achieve linear scaling with the number of cores used and reconstruction times almost in the same range as current GPUs. Algorithmic innovations in the field of motion compensated image reconstruction may lead to a shift back to CPUs in the future. For analytical 3-D reconstruction, the authors show that the gap between GPUs and CPUs became smaller. It can be performed in less than 20 s (on-the-fly) using a 32-core server.
      

      
      Architecture of security management unit for safe hosting of multiple agents
      NASA Astrophysics Data System (ADS)
      Gilmont, Tanguy; Legat, Jean-Didier; Quisquater, Jean-Jacques
         1999-04-01
         In such growing areas as remote applications in large public networks, electronic commerce, digital signature, intellectual property and copyright protection, and even operating system extensibility, the hardware security level offered by existing processors is insufficient. They lack protection mechanisms that prevent the user from tampering critical data owned by those applications. Some devices make exception, but have not enough processing power nor enough memory to stand up to such applications (e.g. smart cards). This paper proposes an architecture of secure processor, in which the classical memory management unit is extended into a new security management unit. It allows ciphered code execution and ciphered data processing. An internal permanent memory can store cipher keys and critical data for several client agents simultaneously. The ordinary supervisor privilege scheme is replaced by a privilege inheritance mechanism that is more suited to operating system extensibility. The result is a secure processor that has hardware support for extensible multitask operating systems, and can be used for both general applications and critical applications needing strong protection. The security management unit and the internal permanent memory can be added to an existing CPU core without loss of performance, and do not require it to be modified.
      

        
       
          

«

12
      13
      14
   15
      16
      »

          
        

     

   

   
       
            
              
          

«

13
      14
      15
   16
      17
      »

          
        

           
           
             
               
      
      A Framework to Analyze the Performance of Load Balancing Schemes for Ensembles of Stochastic Simulations
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Ahn, Tae-Hyuk; Sandu, Adrian; Watson, Layne T.
         2015-08-01
         Ensembles of simulations are employed to estimate the statistics of possible future states of a system, and are widely used in important applications such as climate change and biological modeling. Ensembles of runs can naturally be executed in parallel. However, when the CPU times of individual simulations vary considerably, a simple strategy of assigning an equal number of tasks per processor can lead to serious work imbalances and low parallel efficiency. This paper presents a new probabilistic framework to analyze the performance of dynamic load balancing algorithms for ensembles of simulations where many tasks are mapped onto each processor, andmore » where the individual compute times vary considerably among tasks. Four load balancing strategies are discussed: most-dividing, all-redistribution, random-polling, and neighbor-redistribution. Simulation results with a stochastic budding yeast cell cycle model are consistent with the theoretical analysis. It is especially significant that there is a provable global decrease in load imbalance for the local rebalancing algorithms due to scalability concerns for the global rebalancing algorithms. The overall simulation time is reduced by up to 25 %, and the total processor idle time by 85 %.« less
      

      
      Equalizer: a scalable parallel rendering framework.
      PubMed
      Eilemann, Stefan; Makhinya, Maxim; Pajarola, Renato
         2009-01-01
         Continuing improvements in CPU and GPU performances as well as increasing multi-core processor and cluster-based parallelism demand for flexible and scalable parallel rendering solutions that can exploit multipipe hardware accelerated graphics. In fact, to achieve interactive visualization, scalable rendering systems are essential to cope with the rapid growth of data sets. However, parallel rendering systems are non-trivial to develop and often only application specific implementations have been proposed. The task of developing a scalable parallel rendering framework is even more difficult if it should be generic to support various types of data and visualization applications, and at the same time work efficiently on a cluster with distributed graphics cards. In this paper we introduce a novel system called Equalizer, a toolkit for scalable parallel rendering based on OpenGL which provides an application programming interface (API) to develop scalable graphics applications for a wide range of systems ranging from large distributed visualization clusters and multi-processor multipipe graphics systems to single-processor single-pipe desktop machines. We describe the system architecture, the basic API, discuss its advantages over previous approaches, present example configurations and usage scenarios as well as scalability results.
      

      
      SAFARI digital processing unit: performance analysis of the SpaceWire links in case of a LEON3-FT based CPU
      NASA Astrophysics Data System (ADS)
      Giusi, Giovanni; Liu, Scige J.; Di Giorgio, Anna M.; Galli, Emanuele; Pezzuto, Stefano; Farina, Maria; Spinoglio, Luigi
         2014-08-01
         SAFARI (SpicA FAR infrared Instrument) is a far-infrared imaging Fourier Transform Spectrometer for the SPICA mission. The Digital Processing Unit (DPU) of the instrument implements the functions of controlling the overall instrument and implementing the science data compression and packing. The DPU design is based on the use of a LEON family processor. In SAFARI, all instrument components are connected to the central DPU via SpaceWire links. On these links science data, housekeeping and commands flows are in some cases multiplexed, therefore the interface control shall be able to cope with variable throughput needs. The effective data transfer workload can be an issue for the overall system performances and becomes a critical parameter for the on-board software design, both at application layer level and at lower, and more HW related, levels. To analyze the system behavior in presence of the expected SAFARI demanding science data flow, we carried out a series of performance tests using the standard GR-CPCI-UT699 LEON3-FT Development Board, provided by Aeroflex/Gaisler, connected to the emulator of the SAFARI science data links, in a point-to-point topology. Two different communication protocols have been used in the tests, the ECSS-E-ST-50-52C RMAP protocol and an internally defined one, the SAFARI internal data handling protocol. An incremental approach has been adopted to measure the system performances at different levels of the communication protocol complexity. In all cases the performance has been evaluated by measuring the CPU workload and the bus latencies. The tests have been executed initially in a custom low level execution environment and finally using the Real- Time Executive for Multiprocessor Systems (RTEMS), which has been selected as the operating system to be used onboard SAFARI. The preliminary results of the carried out performance analysis confirmed the possibility of using a LEON3 CPU processor in the SAFARI DPU, but pointed out, in agreement with previous similar studies, the need of carefully designing the overall architecture to implement some of the DPU functionalities on additional processing devices.
      

      
      SpaceCubeX: A Framework for Evaluating Hybrid Multi-Core CPU FPGA DSP Architectures
      NASA Technical Reports Server (NTRS)
      Schmidt, Andrew G.; Weisz, Gabriel; French, Matthew; Flatley, Thomas; Villalpando, Carlos Y.
         2017-01-01
         The SpaceCubeX project is motivated by the need for high performance, modular, and scalable on-board processing to help scientists answer critical 21st century questions about global climate change, air quality, ocean health, and ecosystem dynamics, while adding new capabilities such as low-latency data products for extreme event warnings. These goals translate into on-board processing throughput requirements that are on the order of 100-1,000 more than those of previous Earth Science missions for standard processing, compression, storage, and downlink operations. To study possible future architectures to achieve these performance requirements, the SpaceCubeX project provides an evolvable testbed and framework that enables a focused design space exploration of candidate hybrid CPU/FPGA/DSP processing architectures. The framework includes ArchGen, an architecture generator tool populated with candidate architecture components, performance models, and IP cores, that allows an end user to specify the type, number, and connectivity of a hybrid architecture. The framework requires minimal extensions to integrate new processors, such as the anticipated High Performance Spaceflight Computer (HPSC), reducing time to initiate benchmarking by months. To evaluate the framework, we leverage a wide suite of high performance embedded computing benchmarks and Earth science scenarios to ensure robust architecture characterization. We report on our projects Year 1 efforts and demonstrate the capabilities across four simulation testbed models, a baseline SpaceCube 2.0 system, a dual ARM A9 processor system, a hybrid quad ARM A53 and FPGA system, and a hybrid quad ARM A53 and DSP system.
      

      
      Radiation hardened microprocessor for small payloads
      NASA Technical Reports Server (NTRS)
      Shah, Ravi
         1993-01-01
         The RH-3000 program is developing a rad-hard space qualified 32-bit MIPS R-3000 RISC processor under the Naval Research Lab sponsorship. In addition, under IR&D Harris is developing RHC-3000 for embedded control applications where low cost and radiation tolerance are primary concerns. The development program leverages heavily from commercial development of the MIPS R-3000. The commercial R-3000 has a large installed user base and several foundry partners are currently producing a wide variety of R-3000 derivative products. One of the MIPS derivative products, the LR33000 from LSI Logic, was used as the basis for the design of the RH-3000 chipset. The RH-3000 chipset consists of three core chips and two support chips. The core chips include the CPU, which is the R-3000 integer unit and the FPA/MD chip pair, which performs the R-3010 floating point functions. The two support whips contain all the support functions required for fault tolerance support, real-time support, memory management, timers, and other functions. The Harris development effort had first passed silicon success in June, 1992 with the first rad-hard 32-bit RH-3000 CPU chip. The CPU device is 30 kgates, has a 508 mil by 503 mil die size and is fabricated at Harris Semiconductor on the rad-hard CMOS Silicon on Sapphire (SOS) process. The CPU device successfully passed tesing against 600,000 test vectors derived directly on the LSI/MIPS test suite and has been operational as a single board computer running C code for the past year. In addition, the RH-3000 program has developed the methodology for converting commercially developed designs utilizing logic synthesis techniques based on a combination of VHDK and schematic data bases.
      

      
      Efficient Helicopter Aerodynamic and Aeroacoustic Predictions on Parallel Computers
      NASA Technical Reports Server (NTRS)
      Wissink, Andrew M.; Lyrintzis, Anastasios S.; Strawn, Roger C.; Oliker, Leonid; Biswas, Rupak
         1996-01-01
         This paper presents parallel implementations of two codes used in a combined CFD/Kirchhoff methodology to predict the aerodynamics and aeroacoustics properties of helicopters. The rotorcraft Navier-Stokes code, TURNS, computes the aerodynamic flowfield near the helicopter blades and the Kirchhoff acoustics code computes the noise in the far field, using the TURNS solution as input. The overall parallel strategy adds MPI message passing calls to the existing serial codes to allow for communication between processors. As a result, the total code modifications required for parallel execution are relatively small. The biggest bottleneck in running the TURNS code in parallel comes from the LU-SGS algorithm that solves the implicit system of equations. We use a new hybrid domain decomposition implementation of LU-SGS to obtain good parallel performance on the SP-2. TURNS demonstrates excellent parallel speedups for quasi-steady and unsteady three-dimensional calculations of a helicopter blade in forward flight. The execution rate attained by the code on 114 processors is six times faster than the same cases run on one processor of the Cray C-90. The parallel Kirchhoff code also shows excellent parallel speedups and fast execution rates. As a performance demonstration, unsteady acoustic pressures are computed at 1886 far-field observer locations for a sample acoustics problem. The calculation requires over two hundred hours of CPU time on one C-90 processor but takes only a few hours on 80 processors of the SP2. The resultant far-field acoustic field is analyzed with state of-the-art audio and video rendering of the propagating acoustic signals.
      

      
      Evaluating the transport layer of the ALFA framework for the Intel® Xeon Phi™ Coprocessor
      NASA Astrophysics Data System (ADS)
      Santogidis, Aram; Hirstius, Andreas; Lalis, Spyros
         2015-12-01
         The ALFA framework supports the software development of major High Energy Physics experiments. As part of our research effort to optimize the transport layer of ALFA, we focus on profiling its data transfer performance for inter-node communication on the Intel Xeon Phi Coprocessor. In this article we present the collected performance measurements with the related analysis of the results. The optimization opportunities that are discovered, help us to formulate the future plans of enabling high performance data transfer for ALFA on the Intel Xeon Phi architecture.
      

      
      Short Message Service (SMS) Command and Control (C2) Awareness in Android-based Smartphones Using Kernel-Level Auditing
      DTIC Science & Technology
      
         2012-06-14
         Display 480 x 800 pixels (3.7 inches) CPU Qualcomm QSD8250 1GHz Memory (internal) 512MB RAM / 512 MB ROM Kernel version 2.6.35.7-ge0fb012 Figure 3.5: HTC...development and writing). The 34 MSM kernel provided by the AOSP and compatible with the HTC Nexus One’s motherboard and Qualcomm chipset, is used for this...building the kernel is having the prebuilt toolchains and the right kernel for the hardware. Many HTC products use Qualcomm processors which uses the
      

      
      The Fermilab lattice supercomputer project
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Fischler, M.; Atac, R.; Cook, A.
         1989-02-01
         The ACPMAPS system is a highly cost effective, local memory MIMD computer targeted at algorithm development and production running for gauge theory on the lattice. The machine consists of a compound hypercube of crates, each of which is a full crossbar switch containing several processors. The processing nodes are single board array processors based on the Weitek XL chip set, each with a peak power of 20 MFLOPS and supported by 8 MBytes of data memory. The system currently being assembled has a peak power of 5 GFLOPS, delivering performance at approximately $250/MFLOP. The system is programmable in C andmore » Fortran. An underpinning of software routines (CANOPY) provides an easy and natural way of coding lattice problems, such that the details of parallelism, and communication and system architecture are transparent to the user. CANOPY can easily be ported to any single CPU or MIMD system which supports C, and allows the coding of typical applications with very little effort. 3 refs., 1 fig.« less
      

      
      Intricacies of modern supercomputing illustrated with recent advances in simulations of strongly correlated electron systems
      NASA Astrophysics Data System (ADS)
      Schulthess, Thomas C.
         2013-03-01
         The continued thousand-fold improvement in sustained application performance per decade on modern supercomputers keeps opening new opportunities for scientific simulations. But supercomputers have become very complex machines, built with thousands or tens of thousands of complex nodes consisting of multiple CPU cores or, most recently, a combination of CPU and GPU processors. Efficient simulations on such high-end computing systems require tailored algorithms that optimally map numerical methods to particular architectures. These intricacies will be illustrated with simulations of strongly correlated electron systems, where the development of quantum cluster methods, Monte Carlo techniques, as well as their optimal implementation by means of algorithms with improved data locality and high arithmetic density have gone hand in hand with evolving computer architectures. The present work would not have been possible without continued access to computing resources at the National Center for Computational Science of Oak Ridge National Laboratory, which is funded by the Facilities Division of the Office of Advanced Scientific Computing Research, and the Swiss National Supercomputing Center (CSCS) that is funded by ETH Zurich.
      

      
      Parallel Computer System for 3D Visualization Stereo on GPU
      NASA Astrophysics Data System (ADS)
      Al-Oraiqat, Anas M.; Zori, Sergii A.
         2018-03-01
         This paper proposes the organization of a parallel computer system based on Graphic Processors Unit (GPU) for 3D stereo image synthesis. The development is based on the modified ray tracing method developed by the authors for fast search of tracing rays intersections with scene objects. The system allows significant increase in the productivity for the 3D stereo synthesis of photorealistic quality. The generalized procedure of 3D stereo image synthesis on the Graphics Processing Unit/Graphics Processing Clusters (GPU/GPC) is proposed. The efficiency of the proposed solutions by GPU implementation is compared with single-threaded and multithreaded implementations on the CPU. The achieved average acceleration in multi-thread implementation on the test GPU and CPU is about 7.5 and 1.6 times, respectively. Studying the influence of choosing the size and configuration of the computational Compute Unified Device Archi-tecture (CUDA) network on the computational speed shows the importance of their correct selection. The obtained experimental estimations can be significantly improved by new GPUs with a large number of processing cores and multiprocessors, as well as optimized configuration of the computing CUDA network.
      

      
      Performance and scalability of Fourier domain optical coherence tomography acceleration using graphics processing units.
      PubMed
      Li, Jian; Bloch, Pavel; Xu, Jing; Sarunic, Marinko V; Shannon, Lesley
         2011-05-01
         Fourier domain optical coherence tomography (FD-OCT) provides faster line rates, better resolution, and higher sensitivity for noninvasive, in vivo biomedical imaging compared to traditional time domain OCT (TD-OCT). However, because the signal processing for FD-OCT is computationally intensive, real-time FD-OCT applications demand powerful computing platforms to deliver acceptable performance. Graphics processing units (GPUs) have been used as coprocessors to accelerate FD-OCT by leveraging their relatively simple programming model to exploit thread-level parallelism. Unfortunately, GPUs do not "share" memory with their host processors, requiring additional data transfers between the GPU and CPU. In this paper, we implement a complete FD-OCT accelerator on a consumer grade GPU/CPU platform. Our data acquisition system uses spectrometer-based detection and a dual-arm interferometer topology with numerical dispersion compensation for retinal imaging. We demonstrate that the maximum line rate is dictated by the memory transfer time and not the processing time due to the GPU platform's memory model. Finally, we discuss how the performance trends of GPU-based accelerators compare to the expected future requirements of FD-OCT data rates.
      

      
      Validation of columnar CsI x-ray detector responses obtained with hybridMANTIS, a CPU-GPU Monte Carlo code for coupled x-ray, electron, and optical transport.
      PubMed
      Sharma, Diksha; Badano, Aldo
         2013-03-01
         hybridMANTIS is a Monte Carlo package for modeling indirect x-ray imagers using columnar geometry based on a hybrid concept that maximizes the utilization of available CPU and graphics processing unit processors in a workstation. The authors compare hybridMANTIS x-ray response simulations to previously published MANTIS and experimental data for four cesium iodide scintillator screens. These screens have a variety of reflective and absorptive surfaces with different thicknesses. The authors analyze hybridMANTIS results in terms of modulation transfer function and calculate the root mean square difference and Swank factors from simulated and experimental results. The comparison suggests that hybridMANTIS better matches the experimental data as compared to MANTIS, especially at high spatial frequencies and for the thicker screens. hybridMANTIS simulations are much faster than MANTIS with speed-ups up to 5260. hybridMANTIS is a useful tool for improved description and optimization of image acquisition stages in medical imaging systems and for modeling the forward problem in iterative reconstruction algorithms.
      

      
      LHCb Kalman Filter cross architecture studies
      NASA Astrophysics Data System (ADS)
      Cámpora Pérez, Daniel Hugo
         2017-10-01
         The 2020 upgrade of the LHCb detector will vastly increase the rate of collisions the Online system needs to process in software, in order to filter events in real time. 30 million collisions per second will pass through a selection chain, where each step is executed conditional to its prior acceptance. The Kalman Filter is a fit applied to all reconstructed tracks which, due to its time characteristics and early execution in the selection chain, consumes 40% of the whole reconstruction time in the current trigger software. This makes the Kalman Filter a time-critical component as the LHCb trigger evolves into a full software trigger in the Upgrade. I present a new Kalman Filter algorithm for LHCb that can efficiently make use of any kind of SIMD processor, and its design is explained in depth. Performance benchmarks are compared between a variety of hardware architectures, including x86_64 and Power8, and the Intel Xeon Phi accelerator, and the suitability of said architectures to efficiently perform the LHCb Reconstruction process is determined.
      

      
      Traditional Tracking with Kalman Filter on Parallel Architectures
      NASA Astrophysics Data System (ADS)
      Cerati, Giuseppe; Elmer, Peter; Lantz, Steven; MacNeill, Ian; McDermott, Kevin; Riley, Dan; Tadel, Matevž; Wittich, Peter; Würthwein, Frank; Yagil, Avi
         2015-05-01
         Power density constraints are limiting the performance improvements of modern CPUs. To address this, we have seen the introduction of lower-power, multi-core processors, but the future will be even more exciting. In order to stay within the power density limits but still obtain Moore's Law performance/price gains, it will be necessary to parallelize algorithms to exploit larger numbers of lightweight cores and specialized functions like large vector units. Example technologies today include Intel's Xeon Phi and GPGPUs. Track finding and fitting is one of the most computationally challenging problems for event reconstruction in particle physics. At the High Luminosity LHC, for example, this will be by far the dominant problem. The most common track finding techniques in use today are however those based on the Kalman Filter. Significant experience has been accumulated with these techniques on real tracking detector systems, both in the trigger and offline. We report the results of our investigations into the potential and limitations of these algorithms on the new parallel hardware.
      

      
      Extreme Scale Plasma Turbulence Simulations on Top Supercomputers Worldwide
      DOE PAGES
      Tang, William; Wang, Bei; Ethier, Stephane; ...
         2016-11-01
         The goal of the extreme scale plasma turbulence studies described in this paper is to expedite the delivery of reliable predictions on confinement physics in large magnetic fusion systems by using world-class supercomputers to carry out simulations with unprecedented resolution and temporal duration. This has involved architecture-dependent optimizations of performance scaling and addressing code portability and energy issues, with the metrics for multi-platform comparisons being 'time-to-solution' and 'energy-to-solution'. Realistic results addressing how confinement losses caused by plasma turbulence scale from present-day devices to the much larger $25 billion international ITER fusion facility have been enabled by innovative advances in themore » GTC-P code including (i) implementation of one-sided communication from MPI 3.0 standard; (ii) creative optimization techniques on Xeon Phi processors; and (iii) development of a novel performance model for the key kernels of the PIC code. Our results show that modeling data movement is sufficient to predict performance on modern supercomputer platforms.« less
      

      
      Spectral Element Method for the Simulation of Unsteady Compressible Flows
      NASA Technical Reports Server (NTRS)
      Diosady, Laslo Tibor; Murman, Scott M.
         2013-01-01
         This work uses a discontinuous-Galerkin spectral-element method (DGSEM) to solve the compressible Navier-Stokes equations [1{3]. The inviscid ux is computed using the approximate Riemann solver of Roe [4]. The viscous fluxes are computed using the second form of Bassi and Rebay (BR2) [5] in a manner consistent with the spectral-element approximation. The method of lines with the classical 4th-order explicit Runge-Kutta scheme is used for time integration. Results for polynomial orders up to p = 15 (16th order) are presented. The code is parallelized using the Message Passing Interface (MPI). The computations presented in this work are performed using the Sandy Bridge nodes of the NASA Pleiades supercomputer at NASA Ames Research Center. Each Sandy Bridge node consists of 2 eight-core Intel Xeon E5-2670 processors with a clock speed of 2.6Ghz and 2GB per core memory. On a Sandy Bridge node the Tau Benchmark [6] runs in a time of 7.6s.
      

      
      Methods for compressible fluid simulation on GPUs using high-order finite differences
      NASA Astrophysics Data System (ADS)
      Pekkilä, Johannes; Väisälä, Miikka S.; Käpylä, Maarit J.; Käpylä, Petri J.; Anjum, Omer
         2017-08-01
         We focus on implementing and optimizing a sixth-order finite-difference solver for simulating compressible fluids on a GPU using third-order Runge-Kutta integration. Since graphics processing units perform well in data-parallel tasks, this makes them an attractive platform for fluid simulation. However, high-order stencil computation is memory-intensive with respect to both main memory and the caches of the GPU. We present two approaches for simulating compressible fluids using 55-point and 19-point stencils. We seek to reduce the requirements for memory bandwidth and cache size in our methods by using cache blocking and decomposing a latency-bound kernel into several bandwidth-bound kernels. Our fastest implementation is bandwidth-bound and integrates 343 million grid points per second on a Tesla K40t GPU, achieving a 3 . 6 × speedup over a comparable hydrodynamics solver benchmarked on two Intel Xeon E5-2690v3 processors. Our alternative GPU implementation is latency-bound and achieves the rate of 168 million updates per second.
      

      
      Performance verification and system integration tests of the pulse shape processor for the soft x-ray spectrometer onboard ASTRO-H
      NASA Astrophysics Data System (ADS)
      Takeda, Sawako; Tashiro, Makoto S.; Ishisaki, Yoshitaka; Tsujimoto, Masahiro; Seta, Hiromi; Shimoda, Yuya; Yamaguchi, Sunao; Uehara, Sho; Terada, Yukikatsu; Fujimoto, Ryuichi; Mitsuda, Kazuhisa
         2014-07-01
         The soft X-ray spectrometer (SXS) aboard ASTRO-H is equipped with dedicated digital signal processing units called pulse shape processors (PSPs). The X-ray microcalorimeter system SXS has 36 sensor pixels, which are operated at 50 mK to measure heat input of X-ray photons and realize an energy resolution of 7 eV FWHM in the range 0.3-12.0 keV. Front-end signal processing electronics are used to filter and amplify the electrical pulse output from the sensor and for analog-to-digital conversion. The digitized pulses from the 36 pixels are multiplexed and are sent to the PSP over low-voltage differential signaling lines. Each of two identical PSP units consists of an FPGA board, which assists the hardware logic, and two CPU boards, which assist the onboard software. The FPGA board triggers at every pixel event and stores the triggering information as a pulse waveform in the installed memory. The CPU boards read the event data to evaluate pulse heights by an optimal filtering algorithm. The evaluated X-ray photon data (including the pixel ID, energy, and arrival time information) are transferred to the satellite data recorder along with event quality information. The PSP units have been developed and tested with the engineering model (EM) and the flight model. Utilizing the EM PSP, we successfully verified the entire hardware system and the basic software design of the PSPs, including their communication capability and signal processing performance. In this paper, we show the key metrics of the EM test, such as accuracy and synchronicity of sampling clocks, event grading capability, and resultant energy resolution.
      

      
      Comparison of Acceleration Techniques for Selected Low-Level Bioinformatics Operations
      PubMed Central
      Langenkämper, Daniel; Jakobi, Tobias; Feld, Dustin; Jelonek, Lukas; Goesmann, Alexander; Nattkemper, Tim W.
         2016-01-01
         Within the recent years clock rates of modern processors stagnated while the demand for computing power continued to grow. This applied particularly for the fields of life sciences and bioinformatics, where new technologies keep on creating rapidly growing piles of raw data with increasing speed. The number of cores per processor increased in an attempt to compensate for slight increments of clock rates. This technological shift demands changes in software development, especially in the field of high performance computing where parallelization techniques are gaining in importance due to the pressing issue of large sized datasets generated by e.g., modern genomics. This paper presents an overview of state-of-the-art manual and automatic acceleration techniques and lists some applications employing these in different areas of sequence informatics. Furthermore, we provide examples for automatic acceleration of two use cases to show typical problems and gains of transforming a serial application to a parallel one. The paper should aid the reader in deciding for a certain techniques for the problem at hand. We compare four different state-of-the-art automatic acceleration approaches (OpenMP, PluTo-SICA, PPCG, and OpenACC). Their performance as well as their applicability for selected use cases is discussed. While optimizations targeting the CPU worked better in the complex k-mer use case, optimizers for Graphics Processing Units (GPUs) performed better in the matrix multiplication example. But performance is only superior at a certain problem size due to data migration overhead. We show that automatic code parallelization is feasible with current compiler software and yields significant increases in execution speed. Automatic optimizers for CPU are mature and usually no additional manual adjustment is required. In contrast, some automatic parallelizers targeting GPUs still lack maturity and are limited to simple statements and structures. PMID:26904094
      

        
       
          

«

13
      14
      15
   16
      17
      »

          
        

     

   

   
       
            
              
          

«

14
      15
      16
   17
      18
      »

          
        

           
           
             
               
      
      Accelerating Wright–Fisher Forward Simulations on the Graphics Processing Unit
      PubMed Central
      Lawrie, David S.
         2017-01-01
         Forward Wright–Fisher simulations are powerful in their ability to model complex demography and selection scenarios, but suffer from slow execution on the Central Processor Unit (CPU), thus limiting their usefulness. However, the single-locus Wright–Fisher forward algorithm is exceedingly parallelizable, with many steps that are so-called “embarrassingly parallel,” consisting of a vast number of individual computations that are all independent of each other and thus capable of being performed concurrently. The rise of modern Graphics Processing Units (GPUs) and programming languages designed to leverage the inherent parallel nature of these processors have allowed researchers to dramatically speed up many programs that have such high arithmetic intensity and intrinsic concurrency. The presented GPU Optimized Wright–Fisher simulation, or “GO Fish” for short, can be used to simulate arbitrary selection and demographic scenarios while running over 250-fold faster than its serial counterpart on the CPU. Even modest GPU hardware can achieve an impressive speedup of over two orders of magnitude. With simulations so accelerated, one can not only do quick parametric bootstrapping of previously estimated parameters, but also use simulated results to calculate the likelihoods and summary statistics of demographic and selection models against real polymorphism data, all without restricting the demographic and selection scenarios that can be modeled or requiring approximations to the single-locus forward algorithm for efficiency. Further, as many of the parallel programming techniques used in this simulation can be applied to other computationally intensive algorithms important in population genetics, GO Fish serves as an exciting template for future research into accelerating computation in evolution. GO Fish is part of the Parallel PopGen Package available at: http://dl42.github.io/ParallelPopGen/. PMID:28768689
      

      
      Comparison of Acceleration Techniques for Selected Low-Level Bioinformatics Operations.
      PubMed
      Langenkämper, Daniel; Jakobi, Tobias; Feld, Dustin; Jelonek, Lukas; Goesmann, Alexander; Nattkemper, Tim W
         2016-01-01
         Within the recent years clock rates of modern processors stagnated while the demand for computing power continued to grow. This applied particularly for the fields of life sciences and bioinformatics, where new technologies keep on creating rapidly growing piles of raw data with increasing speed. The number of cores per processor increased in an attempt to compensate for slight increments of clock rates. This technological shift demands changes in software development, especially in the field of high performance computing where parallelization techniques are gaining in importance due to the pressing issue of large sized datasets generated by e.g., modern genomics. This paper presents an overview of state-of-the-art manual and automatic acceleration techniques and lists some applications employing these in different areas of sequence informatics. Furthermore, we provide examples for automatic acceleration of two use cases to show typical problems and gains of transforming a serial application to a parallel one. The paper should aid the reader in deciding for a certain techniques for the problem at hand. We compare four different state-of-the-art automatic acceleration approaches (OpenMP, PluTo-SICA, PPCG, and OpenACC). Their performance as well as their applicability for selected use cases is discussed. While optimizations targeting the CPU worked better in the complex k-mer use case, optimizers for Graphics Processing Units (GPUs) performed better in the matrix multiplication example. But performance is only superior at a certain problem size due to data migration overhead. We show that automatic code parallelization is feasible with current compiler software and yields significant increases in execution speed. Automatic optimizers for CPU are mature and usually no additional manual adjustment is required. In contrast, some automatic parallelizers targeting GPUs still lack maturity and are limited to simple statements and structures.
      

      
      Parallel Agent-Based Simulations on Clusters of GPUs and Multi-Core Processors
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Aaby, Brandon G; Perumalla, Kalyan S; Seal, Sudip K
         2010-01-01
         An effective latency-hiding mechanism is presented in the parallelization of agent-based model simulations (ABMS) with millions of agents. The mechanism is designed to accommodate the hierarchical organization as well as heterogeneity of current state-of-the-art parallel computing platforms. We use it to explore the computation vs. communication trade-off continuum available with the deep computational and memory hierarchies of extant platforms and present a novel analytical model of the tradeoff. We describe our implementation and report preliminary performance results on two distinct parallel platforms suitable for ABMS: CUDA threads on multiple, networked graphical processing units (GPUs), and pthreads on multi-core processors. Messagemore » Passing Interface (MPI) is used for inter-GPU as well as inter-socket communication on a cluster of multiple GPUs and multi-core processors. Results indicate the benefits of our latency-hiding scheme, delivering as much as over 100-fold improvement in runtime for certain benchmark ABMS application scenarios with several million agents. This speed improvement is obtained on our system that is already two to three orders of magnitude faster on one GPU than an equivalent CPU-based execution in a popular simulator in Java. Thus, the overall execution of our current work is over four orders of magnitude faster when executed on multiple GPUs.« less
      

      
      Scalable NIC-based reduction on large-scale clusters
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Moody, A.; Fernández, J. C.; Petrini, F.
         2003-01-01
         Many parallel algorithms require effiaent support for reduction mllectives. Over the years, researchers have developed optimal reduction algonduns by taking inm account system size, dam size, and complexities of reduction operations. However, all of these algorithm have assumed the faa that the reduction precessing takes place on the host CPU. Modem Network Interface Cards (NICs) sport programmable processors with substantial memory and thus introduce a fresh variable into the equation This raises the following intersting challenge: Can we take advantage of modern NICs to implementJost redudion operations? In this paper, we take on this challenge in the context of large-scalemore » clusters. Through experiments on the 960-node, 1920-processor or ASCI Linux Cluster (ALC) located at the Lawrence Livermore National Laboratory, we show that NIC-based reductions indeed perform with reduced latency and immed consistency over host-based aleorithms for the wmmon case and that these benefits scale as the system grows. In the largest configuration tested--1812 processors-- our NIC-based algorithm can sum a single element vector in 73 ps with 32-bi integers and in 118 with Mbit floating-point numnbers. These results represent an improvement, respeaively, of 121% and 39% with resvect w the {approx}roductionle vel MPI library« less
      

      
      QuickProbs—A Fast Multiple Sequence Alignment Algorithm Designed for Graphics Processors
      PubMed Central
      Gudyś, Adam; Deorowicz, Sebastian
         2014-01-01
         Multiple sequence alignment is a crucial task in a number of biological analyses like secondary structure prediction, domain searching, phylogeny, etc. MSAProbs is currently the most accurate alignment algorithm, but its effectiveness is obtained at the expense of computational time. In the paper we present QuickProbs, the variant of MSAProbs customised for graphics processors. We selected the two most time consuming stages of MSAProbs to be redesigned for GPU execution: the posterior matrices calculation and the consistency transformation. Experiments on three popular benchmarks (BAliBASE, PREFAB, OXBench-X) on quad-core PC equipped with high-end graphics card show QuickProbs to be 5.7 to 9.7 times faster than original CPU-parallel MSAProbs. Additional tests performed on several protein families from Pfam database give overall speed-up of 6.7. Compared to other algorithms like MAFFT, MUSCLE, or ClustalW, QuickProbs proved to be much more accurate at similar speed. Additionally we introduce a tuned variant of QuickProbs which is significantly more accurate on sets of distantly related sequences than MSAProbs without exceeding its computation time. The GPU part of QuickProbs was implemented in OpenCL, thus the package is suitable for graphics processors produced by all major vendors. PMID:24586435
      

      
      Summary of Documentation for DYNA3D-ParaDyn's Software Quality Assurance Regression Test Problems
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Zywicz, Edward
         
         The Software Quality Assurance (SQA) regression test suite for DYNA3D (Zywicz and Lin, 2015) and ParaDyn (DeGroot, et al., 2015) currently contains approximately 600 problems divided into 21 suites, and is a required component of ParaDyn’s SQA plan (Ferencz and Oliver, 2013). The regression suite allows developers to ensure that software modifications do not unintentionally alter the code response. The entire regression suite is run prior to permanently incorporating any software modification or addition. When code modifications alter test problem results, the specific cause must be determined and fully understood before the software changes and revised test answers can bemore » incorporated. The regression suite is executed on LLNL platforms using a Python script and an associated data file. The user specifies the DYNA3D or ParaDyn executable, number of processors to use, test problems to run, and other options to the script. The data file details how each problem and its answer extraction scripts are executed. For each problem in the regression suite there exists an input deck, an eight-processor partition file, an answer file, and various extraction scripts. These scripts assemble a temporary answer file in a specific format from the simulation results. The temporary and stored answer files are compared to a specific level of numerical precision, and when differences are detected the test problem is flagged as failed. Presently, numerical results are stored and compared to 16 digits. At this accuracy level different processor types, compilers, number of partitions, etc. impact the results to various degrees. Thus, for consistency purposes the regression suite is run with ParaDyn using 8 processors on machines with a specific processor type (currently the Intel Xeon E5530 processor). For non-parallel regression problems, i.e., the two XFEM problems, DYNA3D is used instead. When environments or platforms change, executables using the current source code and the new resource are created and the regression suite is run. If differences in answers arise, the new answers are retained provided that the differences are inconsequential. This bootstrap approach allows the test suite answers to evolve in a controlled manner with a high level of confidence. Developers also run the entire regression suite with (serial) DYNA3D. While these results normally differ from the stored (parallel) answers, abnormal termination or wildly different values are strong indicators of potential issues.« less
      

      
      Parallel algorithms for large-scale biological sequence alignment on Xeon-Phi based clusters.
      PubMed
      Lan, Haidong; Chan, Yuandong; Xu, Kai; Schmidt, Bertil; Peng, Shaoliang; Liu, Weiguo
         2016-07-19
         Computing alignments between two or more sequences are common operations frequently performed in computational molecular biology. The continuing growth of biological sequence databases establishes the need for their efficient parallel implementation on modern accelerators. This paper presents new approaches to high performance biological sequence database scanning with the Smith-Waterman algorithm and the first stage of progressive multiple sequence alignment based on the ClustalW heuristic on a Xeon Phi-based compute cluster. Our approach uses a three-level parallelization scheme to take full advantage of the compute power available on this type of architecture; i.e. cluster-level data parallelism, thread-level coarse-grained parallelism, and vector-level fine-grained parallelism. Furthermore, we re-organize the sequence datasets and use Xeon Phi shuffle operations to improve I/O efficiency. Evaluations show that our method achieves a peak overall performance up to 220 GCUPS for scanning real protein sequence databanks on a single node consisting of two Intel E5-2620 CPUs and two Intel Xeon Phi 7110P cards. It also exhibits good scalability in terms of sequence length and size, and number of compute nodes for both database scanning and multiple sequence alignment. Furthermore, the achieved performance is highly competitive in comparison to optimized Xeon Phi and GPU implementations. Our implementation is available at https://github.com/turbo0628/LSDBS-mpi .
      

      
      The CMS High-Level Trigger
      NASA Astrophysics Data System (ADS)
      Covarelli, R.
         2009-12-01
         At the startup of the LHC, the CMS data acquisition is expected to be able to sustain an event readout rate of up to 100 kHz from the Level-1 trigger. These events will be read into a large processor farm which will run the "High-Level Trigger" (HLT) selection algorithms and will output a rate of about 150 Hz for permanent data storage. In this report HLT performances are shown for selections based on muons, electrons, photons, jets, missing transverse energy, τ leptons and b quarks: expected efficiencies, background rates and CPU time consumption are reported as well as relaxation criteria foreseen for a LHC startup instantaneous luminosity.
      

      
      Multitasking the code ARC3D. [for computational fluid dynamics
      NASA Technical Reports Server (NTRS)
      Barton, John T.; Hsiung, Christopher C.
         1986-01-01
         The CRAY multitasking system was developed in order to utilize all four processors and sharply reduce the wall clock run time. This paper describes the techniques used to modify the computational fluid dynamics code ARC3D for this run and analyzes the achieved speedup. The ARC3D code solves either the Euler or thin-layer N-S equations using an implicit approximate factorization scheme. Results indicate that multitask processing can be used to achieve wall clock speedup factors of over three times, depending on the nature of the program code being used. Multitasking appears to be particularly advantageous for large-memory problems running on multiple CPU computers.
      

      
      Interaction sorting method for molecular dynamics on multi-core SIMD CPU architecture.
      PubMed
      Matvienko, Sergey; Alemasov, Nikolay; Fomin, Eduard
         2015-02-01
         Molecular dynamics (MD) is widely used in computational biology for studying binding mechanisms of molecules, molecular transport, conformational transitions, protein folding, etc. The method is computationally expensive; thus, the demand for the development of novel, much more efficient algorithms is still high. Therefore, the new algorithm designed in 2007 and called interaction sorting (IS) clearly attracted interest, as it outperformed the most efficient MD algorithms. In this work, a new IS modification is proposed which allows the algorithm to utilize SIMD processor instructions. This paper shows that the improvement provides an additional gain in performance, 9% to 45% in comparison to the original IS method.
      

      
      Parasail: SIMD C library for global, semi-global, and local pairwise sequence alignments
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Daily, Jeffrey A.
         
         Sequence alignment algorithms are a key component of many bioinformatics applications. Though various fast Smith-Waterman local sequence alignment implementations have been developed for x86 CPUs, most are embedded into larger database search tools. In addition, fast implementations of Needleman-Wunsch global sequence alignment and its semi-global variants are not as widespread. This article presents the first software library for local, global, and semi-global pairwise intra-sequence alignments and improves the performance of previous intra-sequence implementations. As a result, a faster intra-sequence pairwise alignment implementation is described and benchmarked. Using a 375 residue query sequence a speed of 136 billion cell updates permore » second (GCUPS) was achieved on a dual Intel Xeon E5-2670 12-core processor system, the highest reported for an implementation based on Farrar’s ’striped’ approach. When using only a single thread, parasail was 1.7 times faster than Rognes’s SWIPE. For many score matrices, parasail is faster than BLAST. The software library is designed for 64 bit Linux, OS X, or Windows on processors with SSE2, SSE41, or AVX2. Source code is available from https://github.com/jeffdaily/parasail under the Battelle BSD-style license. In conclusion, applications that require optimal alignment scores could benefit from the improved performance. For the first time, SIMD global, semi-global, and local alignments are available in a stand-alone C library.« less
      

      
      High-performance sparse matrix-matrix products on Intel KNL and multicore architectures
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Nagasaka, Y; Matsuoka, S; Azad, A
         
         Sparse matrix-matrix multiplication (SpGEMM) is a computational primitive that is widely used in areas ranging from traditional numerical applications to recent big data analysis and machine learning. Although many SpGEMM algorithms have been proposed, hardware specific optimizations for multi- and many-core processors are lacking and a detailed analysis of their performance under various use cases and matrices is not available. We firstly identify and mitigate multiple bottlenecks with memory management and thread scheduling on Intel Xeon Phi (Knights Landing or KNL). Specifically targeting multi- and many-core processors, we develop a hash-table-based algorithm and optimize a heap-based shared-memory SpGEMM algorithm. Wemore » examine their performance together with other publicly available codes. Different from the literature, our evaluation also includes use cases that are representative of real graph algorithms, such as multi-source breadth-first search or triangle counting. Our hash-table and heap-based algorithms are showing significant speedups from libraries in the majority of the cases while different algorithms dominate the other scenarios with different matrix size, sparsity, compression factor and operation type. We wrap up in-depth evaluation results and make a recipe to give the best SpGEMM algorithm for target scenario. A critical finding is that hash-table-based SpGEMM gets a significant performance boost if the nonzeros are not required to be sorted within each row of the output matrix.« less
      

      
      Parasail: SIMD C library for global, semi-global, and local pairwise sequence alignments
      DOE PAGES
      Daily, Jeffrey A.
         2016-02-10
         Sequence alignment algorithms are a key component of many bioinformatics applications. Though various fast Smith-Waterman local sequence alignment implementations have been developed for x86 CPUs, most are embedded into larger database search tools. In addition, fast implementations of Needleman-Wunsch global sequence alignment and its semi-global variants are not as widespread. This article presents the first software library for local, global, and semi-global pairwise intra-sequence alignments and improves the performance of previous intra-sequence implementations. As a result, a faster intra-sequence pairwise alignment implementation is described and benchmarked. Using a 375 residue query sequence a speed of 136 billion cell updates permore » second (GCUPS) was achieved on a dual Intel Xeon E5-2670 12-core processor system, the highest reported for an implementation based on Farrar’s ’striped’ approach. When using only a single thread, parasail was 1.7 times faster than Rognes’s SWIPE. For many score matrices, parasail is faster than BLAST. The software library is designed for 64 bit Linux, OS X, or Windows on processors with SSE2, SSE41, or AVX2. Source code is available from https://github.com/jeffdaily/parasail under the Battelle BSD-style license. In conclusion, applications that require optimal alignment scores could benefit from the improved performance. For the first time, SIMD global, semi-global, and local alignments are available in a stand-alone C library.« less
      

      
      Particle Identification on an FPGA Accelerated Compute Platform for the LHCb Upgrade
      NASA Astrophysics Data System (ADS)
      Fäerber, Christian; Schwemmer, Rainer; Machen, Jonathan; Neufeld, Niko
         2017-07-01
         The current LHCb readout system will be upgraded in 2018 to a “triggerless” readout of the entire detector at the Large Hadron Collider collision rate of 40 MHz. The corresponding bandwidth from the detector down to the foreseen dedicated computing farm (event filter farm), which acts as the trigger, has to be increased by a factor of almost 100 from currently 500 Gb/s up to 40 Tb/s. The event filter farm will preanalyze the data and will select the events on an event by event basis. This will reduce the bandwidth down to a manageable size to write the interesting physics data to tape. The design of such a system is a challenging task, and the reason why different new technologies are considered and have to be investigated for the different parts of the system. For the usage in the event building farm or in the event filter farm (trigger), an experimental field programmable gate array (FPGA) accelerated computing platform is considered and, therefore, tested. FPGA compute accelerators are used more and more in standard servers such as for Microsoft Bing search or Baidu search. The platform we use hosts a general Intel CPU and a high-performance FPGA linked via the high-speed Intel QuickPath Interconnect. An accelerator is implemented on the FPGA. It is very likely that these platforms, which are built, in general, for high-performance computing, are also very interesting for the high-energy physics community. First, the performance results of smaller test cases performed at the beginning are presented. Afterward, a part of the existing LHCb RICH particle identification is tested and is ported to the experimental FPGA accelerated platform. We have compared the performance of the LHCb RICH particle identification running on a normal CPU with the performance of the same algorithm, which is running on the Xeon-FPGA compute accelerator platform.
      

      
      Prestack depth migration for complex 2D structure using phase-screen propagators
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Roberts, P.; Huang, Lian-Jie; Burch, C.
         1997-11-01
         We present results for the phase-screen propagator method applied to prestack depth migration of the Marmousi synthetic data set. The data were migrated as individual common-shot records and the resulting partial images were superposed to obtain the final complete Image. Tests were performed to determine the minimum number of frequency components required to achieve the best quality image and this in turn provided estimates of the minimum computing time. Running on a single processor SUN SPARC Ultra I, high quality images were obtained in as little as 8.7 CPU hours and adequate images were obtained in as little as 4.4more » CPU hours. Different methods were tested for choosing the reference velocity used for the background phase-shift operation and for defining the slowness perturbation screens. Although the depths of some of the steeply dipping, high-contrast features were shifted slightly the overall image quality was fairly insensitive to the choice of the reference velocity. Our jests show the phase-screen method to be a reliable and fast algorithm for imaging complex geologic structures, at least for complex 2D synthetic data where the velocity model is known.« less
      

      
      Algorithms and Application of Sparse Matrix Assembly and Equation Solvers for Aeroacoustics
      NASA Technical Reports Server (NTRS)
      Watson, W. R.; Nguyen, D. T.; Reddy, C. J.; Vatsa, V. N.; Tang, W. H.
         2001-01-01
         An algorithm for symmetric sparse equation solutions on an unstructured grid is described. Efficient, sequential sparse algorithms for degree-of-freedom reordering, supernodes, symbolic/numerical factorization, and forward backward solution phases are reviewed. Three sparse algorithms for the generation and assembly of symmetric systems of matrix equations are presented. The accuracy and numerical performance of the sequential version of the sparse algorithms are evaluated over the frequency range of interest in a three-dimensional aeroacoustics application. Results show that the solver solutions are accurate using a discretization of 12 points per wavelength. Results also show that the first assembly algorithm is impractical for high-frequency noise calculations. The second and third assembly algorithms have nearly equal performance at low values of source frequencies, but at higher values of source frequencies the third algorithm saves CPU time and RAM. The CPU time and the RAM required by the second and third assembly algorithms are two orders of magnitude smaller than that required by the sparse equation solver. A sequential version of these sparse algorithms can, therefore, be conveniently incorporated into a substructuring for domain decomposition formulation to achieve parallel computation, where different substructures are handles by different parallel processors.
      

      
      Validation of columnar CsI x-ray detector responses obtained with hybridMANTIS, a CPU-GPU Monte Carlo code for coupled x-ray, electron, and optical transport
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Sharma, Diksha; Badano, Aldo
         2013-03-15
         Purpose: hybridMANTIS is a Monte Carlo package for modeling indirect x-ray imagers using columnar geometry based on a hybrid concept that maximizes the utilization of available CPU and graphics processing unit processors in a workstation. Methods: The authors compare hybridMANTIS x-ray response simulations to previously published MANTIS and experimental data for four cesium iodide scintillator screens. These screens have a variety of reflective and absorptive surfaces with different thicknesses. The authors analyze hybridMANTIS results in terms of modulation transfer function and calculate the root mean square difference and Swank factors from simulated and experimental results. Results: The comparison suggests thatmore » hybridMANTIS better matches the experimental data as compared to MANTIS, especially at high spatial frequencies and for the thicker screens. hybridMANTIS simulations are much faster than MANTIS with speed-ups up to 5260. Conclusions: hybridMANTIS is a useful tool for improved description and optimization of image acquisition stages in medical imaging systems and for modeling the forward problem in iterative reconstruction algorithms.« less
      

      
      High-Speed Particle-in-Cell Simulation Parallelized with Graphic Processing Units for Low Temperature Plasmas for Material Processing
      NASA Astrophysics Data System (ADS)
      Hur, Min Young; Verboncoeur, John; Lee, Hae June
         2014-10-01
         Particle-in-cell (PIC) simulations have high fidelity in the plasma device requiring transient kinetic modeling compared with fluid simulations. It uses less approximation on the plasma kinetics but requires many particles and grids to observe the semantic results. It means that the simulation spends lots of simulation time in proportion to the number of particles. Therefore, PIC simulation needs high performance computing. In this research, a graphic processing unit (GPU) is adopted for high performance computing of PIC simulation for low temperature discharge plasmas. GPUs have many-core processors and high memory bandwidth compared with a central processing unit (CPU). NVIDIA GeForce GPUs were used for the test with hundreds of cores which show cost-effective performance. PIC code algorithm is divided into two modules which are a field solver and a particle mover. The particle mover module is divided into four routines which are named move, boundary, Monte Carlo collision (MCC), and deposit. Overall, the GPU code solves particle motions as well as electrostatic potential in two-dimensional geometry almost 30 times faster than a single CPU code. This work was supported by the Korea Institute of Science Technology Information.
      

      
      High performance computing for deformable image registration: towards a new paradigm in adaptive radiotherapy.
      PubMed
      Samant, Sanjiv S; Xia, Junyi; Muyan-Ozcelik, Pinar; Owens, John D
         2008-08-01
         The advent of readily available temporal imaging or time series volumetric (4D) imaging has become an indispensable component of treatment planning and adaptive radiotherapy (ART) at many radiotherapy centers. Deformable image registration (DIR) is also used in other areas of medical imaging, including motion corrected image reconstruction. Due to long computation time, clinical applications of DIR in radiation therapy and elsewhere have been limited and consequently relegated to offline analysis. With the recent advances in hardware and software, graphics processing unit (GPU) based computing is an emerging technology for general purpose computation, including DIR, and is suitable for highly parallelized computing. However, traditional general purpose computation on the GPU is limited because the constraints of the available programming platforms. As well, compared to CPU programming, the GPU currently has reduced dedicated processor memory, which can limit the useful working data set for parallelized processing. We present an implementation of the demons algorithm using the NVIDIA 8800 GTX GPU and the new CUDA programming language. The GPU performance will be compared with single threading and multithreading CPU implementations on an Intel dual core 2.4 GHz CPU using the C programming language. CUDA provides a C-like language programming interface, and allows for direct access to the highly parallel compute units in the GPU. Comparisons for volumetric clinical lung images acquired using 4DCT were carried out. Computation time for 100 iterations in the range of 1.8-13.5 s was observed for the GPU with image size ranging from 2.0 x 10(6) to 14.2 x 10(6) pixels. The GPU registration was 55-61 times faster than the CPU for the single threading implementation, and 34-39 times faster for the multithreading implementation. For CPU based computing, the computational time generally has a linear dependence on image size for medical imaging data. Computational efficiency is characterized in terms of time per megapixels per iteration (TPMI) with units of seconds per megapixels per iteration (or spmi). For the demons algorithm, our CPU implementation yielded largely invariant values of TPMI. The mean TPMIs were 0.527 spmi and 0.335 spmi for the single threading and multithreading cases, respectively, with <2% variation over the considered image data range. For GPU computing, we achieved TPMI =0.00916 spmi with 3.7% variation, indicating optimized memory handling under CUDA. The paradigm of GPU based real-time DIR opens up a host of clinical applications for medical imaging.
      

      
      GENIE: a software package for gene-gene interaction analysis in genetic association studies using multiple GPU or CPU cores.
      PubMed
      Chikkagoudar, Satish; Wang, Kai; Li, Mingyao
         2011-05-26
         Gene-gene interaction in genetic association studies is computationally intensive when a large number of SNPs are involved. Most of the latest Central Processing Units (CPUs) have multiple cores, whereas Graphics Processing Units (GPUs) also have hundreds of cores and have been recently used to implement faster scientific software. However, currently there are no genetic analysis software packages that allow users to fully utilize the computing power of these multi-core devices for genetic interaction analysis for binary traits. Here we present a novel software package GENIE, which utilizes the power of multiple GPU or CPU processor cores to parallelize the interaction analysis. GENIE reads an entire genetic association study dataset into memory and partitions the dataset into fragments with non-overlapping sets of SNPs. For each fragment, GENIE analyzes: 1) the interaction of SNPs within it in parallel, and 2) the interaction between the SNPs of the current fragment and other fragments in parallel. We tested GENIE on a large-scale candidate gene study on high-density lipoprotein cholesterol. Using an NVIDIA Tesla C1060 graphics card, the GPU mode of GENIE achieves a speedup of 27 times over its single-core CPU mode run. GENIE is open-source, economical, user-friendly, and scalable. Since the computing power and memory capacity of graphics cards are increasing rapidly while their cost is going down, we anticipate that GENIE will achieve greater speedups with faster GPU cards. Documentation, source code, and precompiled binaries can be downloaded from http://www.cceb.upenn.edu/~mli/software/GENIE/.
      

        
       
          

«

14
      15
      16
   17
      18
      »

          
        

     

   

   
       
            
              
          

«

15
      16
      17
   18
      19
      »

          
        

           
           
             
               
      
      GPU-Q-J, a fast method for calculating root mean square deviation (RMSD) after optimal superposition
      PubMed Central
      
         2011-01-01
         Background Calculation of the root mean square deviation (RMSD) between the atomic coordinates of two optimally superposed structures is a basic component of structural comparison techniques. We describe a quaternion based method, GPU-Q-J, that is stable with single precision calculations and suitable for graphics processor units (GPUs). The application was implemented on an ATI 4770 graphics card in C/C++ and Brook+ in Linux where it was 260 to 760 times faster than existing unoptimized CPU methods. Source code is available from the Compbio website http://software.compbio.washington.edu/misc/downloads/st_gpu_fit/ or from the author LHH. Findings The Nutritious Rice for the World Project (NRW) on World Community Grid predicted de novo, the structures of over 62,000 small proteins and protein domains returning a total of 10 billion candidate structures. Clustering ensembles of structures on this scale requires calculation of large similarity matrices consisting of RMSDs between each pair of structures in the set. As a real-world test, we calculated the matrices for 6 different ensembles from NRW. The GPU method was 260 times faster that the fastest existing CPU based method and over 500 times faster than the method that had been previously used. Conclusions GPU-Q-J is a significant advance over previous CPU methods. It relieves a major bottleneck in the clustering of large numbers of structures for NRW. It also has applications in structure comparison methods that involve multiple superposition and RMSD determination steps, particularly when such methods are applied on a proteome and genome wide scale. PMID:21453553
      

      
      GENIE: a software package for gene-gene interaction analysis in genetic association studies using multiple GPU or CPU cores
      PubMed Central
      
         2011-01-01
         Background Gene-gene interaction in genetic association studies is computationally intensive when a large number of SNPs are involved. Most of the latest Central Processing Units (CPUs) have multiple cores, whereas Graphics Processing Units (GPUs) also have hundreds of cores and have been recently used to implement faster scientific software. However, currently there are no genetic analysis software packages that allow users to fully utilize the computing power of these multi-core devices for genetic interaction analysis for binary traits. Findings Here we present a novel software package GENIE, which utilizes the power of multiple GPU or CPU processor cores to parallelize the interaction analysis. GENIE reads an entire genetic association study dataset into memory and partitions the dataset into fragments with non-overlapping sets of SNPs. For each fragment, GENIE analyzes: 1) the interaction of SNPs within it in parallel, and 2) the interaction between the SNPs of the current fragment and other fragments in parallel. We tested GENIE on a large-scale candidate gene study on high-density lipoprotein cholesterol. Using an NVIDIA Tesla C1060 graphics card, the GPU mode of GENIE achieves a speedup of 27 times over its single-core CPU mode run. Conclusions GENIE is open-source, economical, user-friendly, and scalable. Since the computing power and memory capacity of graphics cards are increasing rapidly while their cost is going down, we anticipate that GENIE will achieve greater speedups with faster GPU cards. Documentation, source code, and precompiled binaries can be downloaded from http://www.cceb.upenn.edu/~mli/software/GENIE/. PMID:21615923
      

      
      Solving Coupled Gross--Pitaevskii Equations on a Cluster of PlayStation 3 Computers
      NASA Astrophysics Data System (ADS)
      Edwards, Mark; Heward, Jeffrey; Clark, C. W.
         2009-05-01
         At Georgia Southern University we have constructed an 8+1--node cluster of Sony PlayStation 3 (PS3) computers with the intention of using this computing resource to solve problems related to the behavior of ultra--cold atoms in general with a particular emphasis on studying bose--bose and bose--fermi mixtures confined in optical lattices. As a first project that uses this computing resource, we have implemented a parallel solver of the coupled time--dependent, one--dimensional Gross--Pitaevskii (TDGP) equations. These equations govern the behavior of dual-- species bosonic mixtures. We chose the split--operator/FFT to solve the coupled 1D TDGP equations. The fast Fourier transform component of this solver can be readily parallelized on the PS3 cpu known as the Cell Broadband Engine (CellBE). Each CellBE chip contains a single 64--bit PowerPC Processor Element known as the PPE and eight ``Synergistic Processor Element'' identified as the SPE's. We report on this algorithm and compare its performance to a non--parallel solver as applied to modeling evaporative cooling in dual--species bosonic mixtures.
      

      
      FPGA-accelerated algorithm for the regular expression matching system
      NASA Astrophysics Data System (ADS)
      Russek, P.; Wiatr, K.
         2015-01-01
         This article describes an algorithm to support a regular expressions matching system. The goal was to achieve an attractive performance system with low energy consumption. The basic idea of the algorithm comes from a concept of the Bloom filter. It starts from the extraction of static sub-strings for strings of regular expressions. The algorithm is devised to gain from its decomposition into parts which are intended to be executed by custom hardware and the central processing unit (CPU). The pipelined custom processor architecture is proposed and a software algorithm explained accordingly. The software part of the algorithm was coded in C and runs on a processor from the ARM family. The hardware architecture was described in VHDL and implemented in field programmable gate array (FPGA). The performance results and required resources of the above experiments are given. An example of target application for the presented solution is computer and network security systems. The idea was tested on nearly 100,000 body-based viruses from the ClamAV virus database. The solution is intended for the emerging technology of clusters of low-energy computing nodes.
      

      
      New computing systems and their impact on structural analysis and design
      NASA Technical Reports Server (NTRS)
      Noor, Ahmed K.
         1989-01-01
         A review is given of the recent advances in computer technology that are likely to impact structural analysis and design. The computational needs for future structures technology are described. The characteristics of new and projected computing systems are summarized. Advances in programming environments, numerical algorithms, and computational strategies for new computing systems are reviewed, and a novel partitioning strategy is outlined for maximizing the degree of parallelism. The strategy is designed for computers with a shared memory and a small number of powerful processors (or a small number of clusters of medium-range processors). It is based on approximating the response of the structure by a combination of symmetric and antisymmetric response vectors, each obtained using a fraction of the degrees of freedom of the original finite element model. The strategy was implemented on the CRAY X-MP/4 and the Alliant FX/8 computers. For nonlinear dynamic problems on the CRAY X-MP with four CPUs, it resulted in an order of magnitude reduction in total analysis time, compared with the direct analysis on a single-CPU CRAY X-MP machine.
      

      
      A workload model and measures for computer performance evaluation
      NASA Technical Reports Server (NTRS)
      Kerner, H.; Kuemmerle, K.
         1972-01-01
         A generalized workload definition is presented which constructs measurable workloads of unit size from workload elements, called elementary processes. An elementary process makes almost exclusive use of one of the processors, CPU, I/O processor, etc., and is measured by the cost of its execution. Various kinds of user programs can be simulated by quantitative composition of elementary processes into a type. The character of the type is defined by the weights of its elementary processes and its structure by the amount and sequence of transitions between its elementary processes. A set of types is batched to a mix. Mixes of identical cost are considered as equivalent amounts of workload. These formalized descriptions of workloads allow investigators to compare the results of different studies quantitatively. Since workloads of different composition are assigned a unit of cost, these descriptions enable determination of cost effectiveness of different workloads on a machine. Subsequently performance parameters such as throughput rate, gain factor, internal and external delay factors are defined and used to demonstrate the effects of various workload attributes on the performance of a selected large scale computer system.
      

      
      Conference on Real-Time Computer Applications in Nuclear, Particle and Plasma Physics, 6th, Williamsburg, VA, May 15-19, 1989, Proceedings
      NASA Technical Reports Server (NTRS)
      Pordes, Ruth (Editor)
         1989-01-01
         Papers on real-time computer applications in nuclear, particle, and plasma physics are presented, covering topics such as expert systems tactics in testing FASTBUS segment interconnect modules, trigger control in a high energy physcis experiment, the FASTBUS read-out system for the Aleph time projection chamber, a multiprocessor data acquisition systems, DAQ software architecture for Aleph, a VME multiprocessor system for plasma control at the JT-60 upgrade, and a multiasking, multisinked, multiprocessor data acquisition front end. Other topics include real-time data reduction using a microVAX processor, a transputer based coprocessor for VEDAS, simulation of a macropipelined multi-CPU event processor for use in FASTBUS, a distributed VME control system for the LISA superconducting Linac, a distributed system for laboratory process automation, and a distributed system for laboratory process automation. Additional topics include a structure macro assembler for the event handler, a data acquisition and control system for Thomson scattering on ATF, remote procedure execution software for distributed systems, and a PC-based graphic display real-time particle beam uniformity.
      

      
      A System-on-Chip Solution for Point-of-Care Ultrasound Imaging Systems: Architecture and ASIC Implementation.
      PubMed
      Kang, Jeeun; Yoon, Changhan; Lee, Jaejin; Kye, Sang-Bum; Lee, Yongbae; Chang, Jin Ho; Kim, Gi-Duck; Yoo, Yangmo; Song, Tai-kyong
         2016-04-01
         In this paper, we present a novel system-on-chip (SOC) solution for a portable ultrasound imaging system (PUS) for point-of-care applications. The PUS-SOC includes all of the signal processing modules (i.e., the transmit and dynamic receive beamformer modules, mid- and back-end processors, and color Doppler processors) as well as an efficient architecture for hardware-based imaging methods (e.g., dynamic delay calculation, multi-beamforming, and coded excitation and compression). The PUS-SOC was fabricated using a UMC 130-nm NAND process and has 16.8 GFLOPS of computing power with a total equivalent gate count of 12.1 million, which is comparable to a Pentium-4 CPU. The size and power consumption of the PUS-SOC are 27×27 mm(2) and 1.2 W, respectively. Based on the PUS-SOC, a prototype hand-held US imaging system was implemented. Phantom experiments demonstrated that the PUS-SOC can provide appropriate image quality for point-of-care applications with a compact PDA size ( 200×120×45 mm(3)) and 3 hours of battery life.
      

      
      GPUs, a New Tool of Acceleration in CFD: Efficiency and Reliability on Smoothed Particle Hydrodynamics Methods
      PubMed Central
      Crespo, Alejandro C.; Dominguez, Jose M.; Barreiro, Anxo; Gómez-Gesteira, Moncho; Rogers, Benedict D.
         2011-01-01
         Smoothed Particle Hydrodynamics (SPH) is a numerical method commonly used in Computational Fluid Dynamics (CFD) to simulate complex free-surface flows. Simulations with this mesh-free particle method far exceed the capacity of a single processor. In this paper, as part of a dual-functioning code for either central processing units (CPUs) or Graphics Processor Units (GPUs), a parallelisation using GPUs is presented. The GPU parallelisation technique uses the Compute Unified Device Architecture (CUDA) of nVidia devices. Simulations with more than one million particles on a single GPU card exhibit speedups of up to two orders of magnitude over using a single-core CPU. It is demonstrated that the code achieves different speedups with different CUDA-enabled GPUs. The numerical behaviour of the SPH code is validated with a standard benchmark test case of dam break flow impacting on an obstacle where good agreement with the experimental results is observed. Both the achieved speed-ups and the quantitative agreement with experiments suggest that CUDA-based GPU programming can be used in SPH methods with efficiency and reliability. PMID:21695185
      

      
      Real-time compression of raw computed tomography data: technology, architecture, and benefits
      NASA Astrophysics Data System (ADS)
      Wegener, Albert; Chandra, Naveen; Ling, Yi; Senzig, Robert; Herfkens, Robert
         2009-02-01
         Compression of computed tomography (CT) projection samples reduces slip ring and disk drive costs. A lowcomplexity, CT-optimized compression algorithm called Prism CTTM achieves at least 1.59:1 and up to 2.75:1 lossless compression on twenty-six CT projection data sets. We compare the lossless compression performance of Prism CT to alternative lossless coders, including Lempel-Ziv, Golomb-Rice, and Huffman coders using representative CT data sets. Prism CT provides the best mean lossless compression ratio of 1.95:1 on the representative data set. Prism CT compression can be integrated into existing slip rings using a single FPGA. Prism CT decompression operates at 100 Msamp/sec using one core of a dual-core Xeon CPU. We describe a methodology to evaluate the effects of lossy compression on image quality to achieve even higher compression ratios. We conclude that lossless compression of raw CT signals provides significant cost savings and performance improvements for slip rings and disk drive subsystems in all CT machines. Lossy compression should be considered in future CT data acquisition subsystems because it provides even more system benefits above lossless compression while achieving transparent diagnostic image quality. This result is demonstrated on a limited dataset using appropriately selected compression ratios and an experienced radiologist.
      

      
      Early Experiences Writing Performance Portable OpenMP 4 Codes
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Joubert, Wayne; Hernandez, Oscar R
         
         In this paper, we evaluate the recently available directives in OpenMP 4 to parallelize a computational kernel using both the traditional shared memory approach and the newer accelerator targeting capabilities. In addition, we explore various transformations that attempt to increase application performance portability, and examine the expressiveness and performance implications of using these approaches. For example, we want to understand if the target map directives in OpenMP 4 improve data locality when mapped to a shared memory system, as opposed to the traditional first touch policy approach in traditional OpenMP. To that end, we use recent Cray and Intel compilersmore » to measure the performance variations of a simple application kernel when executed on the OLCF s Titan supercomputer with NVIDIA GPUs and the Beacon system with Intel Xeon Phi accelerators attached. To better understand these trade-offs, we compare our results from traditional OpenMP shared memory implementations to the newer accelerator programming model when it is used to target both the CPU and an attached heterogeneous device. We believe the results and lessons learned as presented in this paper will be useful to the larger user community by providing guidelines that can assist programmers in the development of performance portable code.« less
      

      
      GPU-based ultra-fast dose calculation using a finite size pencil beam model.
      PubMed
      Gu, Xuejun; Choi, Dongju; Men, Chunhua; Pan, Hubert; Majumdar, Amitava; Jiang, Steve B
         2009-10-21
         Online adaptive radiation therapy (ART) is an attractive concept that promises the ability to deliver an optimal treatment in response to the inter-fraction variability in patient anatomy. However, it has yet to be realized due to technical limitations. Fast dose deposit coefficient calculation is a critical component of the online planning process that is required for plan optimization of intensity-modulated radiation therapy (IMRT). Computer graphics processing units (GPUs) are well suited to provide the requisite fast performance for the data-parallel nature of dose calculation. In this work, we develop a dose calculation engine based on a finite-size pencil beam (FSPB) algorithm and a GPU parallel computing framework. The developed framework can accommodate any FSPB model. We test our implementation in the case of a water phantom and the case of a prostate cancer patient with varying beamlet and voxel sizes. All testing scenarios achieved speedup ranging from 200 to 400 times when using a NVIDIA Tesla C1060 card in comparison with a 2.27 GHz Intel Xeon CPU. The computational time for calculating dose deposition coefficients for a nine-field prostate IMRT plan with this new framework is less than 1 s. This indicates that the GPU-based FSPB algorithm is well suited for online re-planning for adaptive radiotherapy.
      

      
      A Real-Time Capable Software-Defined Receiver Using GPU for Adaptive Anti-Jam GPS Sensors
      PubMed Central
      Seo, Jiwon; Chen, Yu-Hsuan; De Lorenzo, David S.; Lo, Sherman; Enge, Per; Akos, Dennis; Lee, Jiyun
         2011-01-01
         Due to their weak received signal power, Global Positioning System (GPS) signals are vulnerable to radio frequency interference. Adaptive beam and null steering of the gain pattern of a GPS antenna array can significantly increase the resistance of GPS sensors to signal interference and jamming. Since adaptive array processing requires intensive computational power, beamsteering GPS receivers were usually implemented using hardware such as field-programmable gate arrays (FPGAs). However, a software implementation using general-purpose processors is much more desirable because of its flexibility and cost effectiveness. This paper presents a GPS software-defined radio (SDR) with adaptive beamsteering capability for anti-jam applications. The GPS SDR design is based on an optimized desktop parallel processing architecture using a quad-core Central Processing Unit (CPU) coupled with a new generation Graphics Processing Unit (GPU) having massively parallel processors. This GPS SDR demonstrates sufficient computational capability to support a four-element antenna array and future GPS L5 signal processing in real time. After providing the details of our design and optimization schemes for future GPU-based GPS SDR developments, the jamming resistance of our GPS SDR under synthetic wideband jamming is presented. Since the GPS SDR uses commercial-off-the-shelf hardware and processors, it can be easily adopted in civil GPS applications requiring anti-jam capabilities. PMID:22164116
      

      
      A real-time capable software-defined receiver using GPU for adaptive anti-jam GPS sensors.
      PubMed
      Seo, Jiwon; Chen, Yu-Hsuan; De Lorenzo, David S; Lo, Sherman; Enge, Per; Akos, Dennis; Lee, Jiyun
         2011-01-01
         Due to their weak received signal power, Global Positioning System (GPS) signals are vulnerable to radio frequency interference. Adaptive beam and null steering of the gain pattern of a GPS antenna array can significantly increase the resistance of GPS sensors to signal interference and jamming. Since adaptive array processing requires intensive computational power, beamsteering GPS receivers were usually implemented using hardware such as field-programmable gate arrays (FPGAs). However, a software implementation using general-purpose processors is much more desirable because of its flexibility and cost effectiveness. This paper presents a GPS software-defined radio (SDR) with adaptive beamsteering capability for anti-jam applications. The GPS SDR design is based on an optimized desktop parallel processing architecture using a quad-core Central Processing Unit (CPU) coupled with a new generation Graphics Processing Unit (GPU) having massively parallel processors. This GPS SDR demonstrates sufficient computational capability to support a four-element antenna array and future GPS L5 signal processing in real time. After providing the details of our design and optimization schemes for future GPU-based GPS SDR developments, the jamming resistance of our GPS SDR under synthetic wideband jamming is presented. Since the GPS SDR uses commercial-off-the-shelf hardware and processors, it can be easily adopted in civil GPS applications requiring anti-jam capabilities.
      

      
      Global fully kinetic models of planetary magnetospheres with iPic3D
      NASA Astrophysics Data System (ADS)
      Gonzalez, D.; Sanna, L.; Amaya, J.; Zitz, A.; Lembege, B.; Markidis, S.; Schriver, D.; Walker, R. J.; Berchem, J.; Peng, I. B.; Travnicek, P. M.; Lapenta, G.
         2016-12-01
         We report on the latest developments of our approach to model planetary magnetospheres, mini magnetospheres and the Earth's magnetosphere with the fully kinetic, electromagnetic particle in cell code iPic3D. The code treats electrons and multiple species of ions as full kinetic particles. We review: 1) Why a fully kinetic model and in particular why kinetic electrons are needed for capturing some of the most important aspects of the physics processes of planetary magnetospheres. 2) Why the energy conserving implicit method (ECIM) in its newest implementation [1] is the right approach to reach this goal. We consider the different electron scales and study how the new IECIM can be tuned to resolve only the electron scales of interest while averaging over the unresolved scales preserving their contribution to the evolution. 3) How with modern computing planetary magnetospheres, mini magnetosphere and eventually Earth's magnetosphere can be modeled with fully kinetic electrons. The path from petascale to exascale for iPiC3D is outlined based on the DEEP-ER project [2], using dynamic allocation of different processor architectures (Xeon and Xeon Phi) and innovative I/O technologies.Specifically results from models of Mercury are presented and compared with MESSENGER observations and with previous hybrid (fluid electrons and kinetic ions) simulations. The plasma convection around the planets includes the development of hydrodynamic instabilities at the flanks, the presence of the collisionless shocks, the magnetosheath, the magnetopause, reconnection zones, the formation of the plasma sheet and the magnetotail, and the variation of ion/electron plasma flows when crossing these frontiers. Given the full kinetic nature of our approach we focus on detailed particle dynamics and distribution at locations that can be used for comparison with satellite data. [1] Lapenta, G. (2016). Exactly Energy Conserving Implicit Moment Particle in Cell Formulation. arXiv preprint arXiv:1602.06326.[2] www.deep-er.eu
      

      
      A UNIX SVR4-OS 9 distributed data acquisition for high energy physics
      NASA Astrophysics Data System (ADS)
      Drouhin, F.; Schwaller, B.; Fontaine, J. C.; Charles, F.; Pallares, A.; Huss, D.
         1998-08-01
         The distributed data acquisition (DAQ) system developed by the GRPHE (Groupe de Recherche en Physique des Hautes Energies) group is a combination of hardware and software dedicated to high energy physics. The system described here is used in the beam tests of the CMS tracker. The central processor of the system is a RISC CPU hosted in a VME card, running a POSIX compliant UNIX system. Specialized real-time OS9 VME cards perform the instrumentation control. The main data flow goes over a deterministic high speed network. The UNIX system manages a list of OS9 front-end systems with a synchronisation protocol running over a TCP/IP layer.
      

      
      Parallel implementation of Hartree-Fock and density functional theory analytical second derivatives
      NASA Astrophysics Data System (ADS)
      Baker, Jon; Wolinski, Krzysztof; Malagoli, Massimo; Pulay, Peter
         2004-01-01
         We present an efficient, parallel implementation for the calculation of Hartree-Fock and density functional theory analytical Hessian (force constant, nuclear second derivative) matrices. These are important for the determination of harmonic vibrational frequencies, and to classify stationary points on potential energy surfaces. Our program is designed for modest parallelism (4-16 CPUs) as exemplified by our standard eight-processor QuantumCube™. We can routinely handle systems with up to 100+ atoms and 1000+ basis functions using under 0.5 GB of RAM memory per CPU. Timings are presented for several systems, ranging in size from aspirin (C9H8O4) to nickel octaethylporphyrin (C36H44N4Ni).
      

      
      SpaceCube Version 1.5
      NASA Technical Reports Server (NTRS)
      Geist, Alessandro; Lin, Michael; Flatley, Tom; Petrick, David
         2013-01-01
         SpaceCube 1.5 is a high-performance and low-power system in a compact form factor. It is a hybrid processing system consisting of CPU (central processing unit), FPGA (field-programmable gate array), and DSP (digital signal processor) processing elements. The primary processing engine is the Virtex- 5 FX100T FPGA, which has two embedded processors. The SpaceCube 1.5 System was a bridge to the SpaceCube 2.0 and SpaceCube 2.0 Mini processing systems. The SpaceCube 1.5 system was the primary avionics in the successful SMART (Small Rocket/Spacecraft Technology) Sounding Rocket mission that was launched in the summer of 2011. For SMART and similar missions, an avionics processor is required that is reconfigurable, has high processing capability, has multi-gigabit interfaces, is low power, and comes in a rugged/compact form factor. The original SpaceCube 1.0 met a number of the criteria, but did not possess the multi-gigabit interfaces that were required and is a higher-cost system. The SpaceCube 1.5 was designed with those mission requirements in mind. The SpaceCube 1.5 features one Xilinx Virtex-5 FX100T FPGA and has excellent size, weight, and power characteristics [4×4×3 in. (approx. = 10×10×8 cm), 3 lb (approx. = 1.4 kg), and 5 to 15 W depending on the application]. The estimated computing power of the two PowerPC 440s in the Virtex-5 FPGA is 1100 DMIPS each. The SpaceCube 1.5 includes two Gigabit Ethernet (1 Gbps) interfaces as well as two SATA-I/II interfaces (1.5 to 3.0 Gbps) for recording to data drives. The SpaceCube 1.5 also features DDR2 SDRAM (double data rate synchronous dynamic random access memory); 4- Gbit Flash for storing application code for the CPU, FPGA, and DSP processing elements; and a Xilinx Platform Flash XL to store FPGA configuration files or application code. The system also incorporates a 12 bit analog to digital converter with the ability to read 32 discrete analog sensor inputs. The SpaceCube 1.5 design also has a built-in accelerometer. In addition, the system has 12 receive and transmit RS- 422 interfaces for legacy support. The SpaceCube 1.5 processor card represents the first NASA Goddard design in a compact form factor featuring the Xilinx Virtex- 5. The SpaceCube 1.5 incorporates backward compatibility with the Space- Cube 1.0 form factor and stackable architecture. It also makes use of low-cost commercial parts, but is designed for operation in harsh environments.
      

      
      A derivation and scalable implementation of the synchronous parallel kinetic Monte Carlo method for simulating long-time dynamics
      NASA Astrophysics Data System (ADS)
      Byun, Hye Suk; El-Naggar, Mohamed Y.; Kalia, Rajiv K.; Nakano, Aiichiro; Vashishta, Priya
         2017-10-01
         Kinetic Monte Carlo (KMC) simulations are used to study long-time dynamics of a wide variety of systems. Unfortunately, the conventional KMC algorithm is not scalable to larger systems, since its time scale is inversely proportional to the simulated system size. A promising approach to resolving this issue is the synchronous parallel KMC (SPKMC) algorithm, which makes the time scale size-independent. This paper introduces a formal derivation of the SPKMC algorithm based on local transition-state and time-dependent Hartree approximations, as well as its scalable parallel implementation based on a dual linked-list cell method. The resulting algorithm has achieved a weak-scaling parallel efficiency of 0.935 on 1024 Intel Xeon processors for simulating biological electron transfer dynamics in a 4.2 billion-heme system, as well as decent strong-scaling parallel efficiency. The parallel code has been used to simulate a lattice of cytochrome complexes on a bacterial-membrane nanowire, and it is broadly applicable to other problems such as computational synthesis of new materials.
      

      
      BlochSolver: A GPU-optimized fast 3D MRI simulator for experimentally compatible pulse sequences
      NASA Astrophysics Data System (ADS)
      Kose, Ryoichi; Kose, Katsumi
         2017-08-01
         A magnetic resonance imaging (MRI) simulator, which reproduces MRI experiments using computers, has been developed using two graphic-processor-unit (GPU) boards (GTX 1080). The MRI simulator was developed to run according to pulse sequences used in experiments. Experiments and simulations were performed to demonstrate the usefulness of the MRI simulator for three types of pulse sequences, namely, three-dimensional (3D) gradient-echo, 3D radio-frequency spoiled gradient-echo, and gradient-echo multislice with practical matrix sizes. The results demonstrated that the calculation speed using two GPU boards was typically about 7 TFLOPS and about 14 times faster than the calculation speed using CPUs (two 18-core Xeons). We also found that MR images acquired by experiment could be reproduced using an appropriate number of subvoxels, and that 3D isotropic and two-dimensional multislice imaging experiments for practical matrix sizes could be simulated using the MRI simulator. Therefore, we concluded that such powerful MRI simulators are expected to become an indispensable tool for MRI research and development.
      

        
       
          

«

15
      16
      17
   18
      19
      »

          
        

     

   

   
       
            
              
          

«

16
      17
      18
   19
      20
      »

          
        

           
           
             
               
      
      Speeding up spin-component-scaled third-order pertubation theory with the chain of spheres approximation: the COSX-SCS-MP3 method
      NASA Astrophysics Data System (ADS)
      Izsák, Róbert; Neese, Frank
         2013-07-01
         The 'chain of spheres' approximation, developed earlier for the efficient evaluation of the self-consistent field exchange term, is introduced here into the evaluation of the external exchange term of higher order correlation methods. Its performance is studied in the specific case of the spin-component-scaled third-order Møller--Plesset perturbation (SCS-MP3) theory. The results indicate that the approximation performs excellently in terms of both computer time and achievable accuracy. Significant speedups over a conventional method are obtained for larger systems and basis sets. Owing to this development, SCS-MP3 calculations on molecules of the size of penicillin (42 atoms) with a polarised triple-zeta basis set can be performed in ∼3 hours using 16 cores of an Intel Xeon E7-8837 processor with a 2.67 GHz clock speed, which represents a speedup by a factor of 8-9 compared to the previously most efficient algorithm. Thus, the increased accuracy offered by SCS-MP3 can now be explored for at least medium-sized molecules.
      

      
      Thread-Level Parallelization and Optimization of NWChem for the Intel MIC Architecture
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Shan, Hongzhang; Williams, Samuel; Jong, Wibe de
         
         In the multicore era it was possible to exploit the increase in on-chip parallelism by simply running multiple MPI processes per chip. Unfortunately, manycore processors' greatly increased thread- and data-level parallelism coupled with a reduced memory capacity demand an altogether different approach. In this paper we explore augmenting two NWChem modules, triples correction of the CCSD(T) and Fock matrix construction, with OpenMP in order that they might run efficiently on future manycore architectures. As the next NERSC machine will be a self-hosted Intel MIC (Xeon Phi) based supercomputer, we leverage an existing MIC testbed at NERSC to evaluate our experiments.more » In order to proxy the fact that future MIC machines will not have a host processor, we run all of our experiments in tt native mode. We found that while straightforward application of OpenMP to the deep loop nests associated with the tensor contractions of CCSD(T) was sufficient in attaining high performance, significant effort was required to safely and efficiently thread the TEXAS integral package when constructing the Fock matrix. Ultimately, our new MPI OpenMP hybrid implementations attain up to 65x better performance for the triples part of the CCSD(T) due in large part to the fact that the limited on-card memory limits the existing MPI implementation to a single process per card. Additionally, we obtain up to 1.6x better performance on Fock matrix constructions when compared with the best MPI implementations running multiple processes per card.« less
      

      
      Thread-level parallelization and optimization of NWChem for the Intel MIC architecture
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Shan, Hongzhang; Williams, Samuel; de Jong, Wibe
         
         In the multicore era it was possible to exploit the increase in on-chip parallelism by simply running multiple MPI processes per chip. Unfortunately, manycore processors' greatly increased thread- and data-level parallelism coupled with a reduced memory capacity demand an altogether different approach. In this paper we explore augmenting two NWChem modules, triples correction of the CCSD(T) and Fock matrix construction, with OpenMP in order that they might run efficiently on future manycore architectures. As the next NERSC machine will be a self-hosted Intel MIC (Xeon Phi) based supercomputer, we leverage an existing MIC testbed at NERSC to evaluate our experiments.more » In order to proxy the fact that future MIC machines will not have a host processor, we run all of our experiments in native mode. We found that while straightforward application of OpenMP to the deep loop nests associated with the tensor contractions of CCSD(T) was sufficient in attaining high performance, significant e ort was required to safely and efeciently thread the TEXAS integral package when constructing the Fock matrix. Ultimately, our new MPI+OpenMP hybrid implementations attain up to 65× better performance for the triples part of the CCSD(T) due in large part to the fact that the limited on-card memory limits the existing MPI implementation to a single process per card. Additionally, we obtain up to 1.6× better performance on Fock matrix constructions when compared with the best MPI implementations running multiple processes per card.« less
      

      
      Faster Smith-Waterman database searches with inter-sequence SIMD parallelisation
      PubMed Central
      
         2011-01-01
         Background The Smith-Waterman algorithm for local sequence alignment is more sensitive than heuristic methods for database searching, but also more time-consuming. The fastest approach to parallelisation with SIMD technology has previously been described by Farrar in 2007. The aim of this study was to explore whether further speed could be gained by other approaches to parallelisation. Results A faster approach and implementation is described and benchmarked. In the new tool SWIPE, residues from sixteen different database sequences are compared in parallel to one query residue. Using a 375 residue query sequence a speed of 106 billion cell updates per second (GCUPS) was achieved on a dual Intel Xeon X5650 six-core processor system, which is over six times more rapid than software based on Farrar's 'striped' approach. SWIPE was about 2.5 times faster when the programs used only a single thread. For shorter queries, the increase in speed was larger. SWIPE was about twice as fast as BLAST when using the BLOSUM50 score matrix, while BLAST was about twice as fast as SWIPE for the BLOSUM62 matrix. The software is designed for 64 bit Linux on processors with SSSE3. Source code is available from http://dna.uio.no/swipe/ under the GNU Affero General Public License. Conclusions Efficient parallelisation using SIMD on standard hardware makes it possible to run Smith-Waterman database searches more than six times faster than before. The approach described here could significantly widen the potential application of Smith-Waterman searches. Other applications that require optimal local alignment scores could also benefit from improved performance. PMID:21631914
      

      
      Faster Smith-Waterman database searches with inter-sequence SIMD parallelisation.
      PubMed
      Rognes, Torbjørn
         2011-06-01
         The Smith-Waterman algorithm for local sequence alignment is more sensitive than heuristic methods for database searching, but also more time-consuming. The fastest approach to parallelisation with SIMD technology has previously been described by Farrar in 2007. The aim of this study was to explore whether further speed could be gained by other approaches to parallelisation. A faster approach and implementation is described and benchmarked. In the new tool SWIPE, residues from sixteen different database sequences are compared in parallel to one query residue. Using a 375 residue query sequence a speed of 106 billion cell updates per second (GCUPS) was achieved on a dual Intel Xeon X5650 six-core processor system, which is over six times more rapid than software based on Farrar's 'striped' approach. SWIPE was about 2.5 times faster when the programs used only a single thread. For shorter queries, the increase in speed was larger. SWIPE was about twice as fast as BLAST when using the BLOSUM50 score matrix, while BLAST was about twice as fast as SWIPE for the BLOSUM62 matrix. The software is designed for 64 bit Linux on processors with SSSE3. Source code is available from http://dna.uio.no/swipe/ under the GNU Affero General Public License. Efficient parallelisation using SIMD on standard hardware makes it possible to run Smith-Waterman database searches more than six times faster than before. The approach described here could significantly widen the potential application of Smith-Waterman searches. Other applications that require optimal local alignment scores could also benefit from improved performance.
      

      
      Software for embedded processors: Problems and solutions
      NASA Astrophysics Data System (ADS)
      Bogaerts, J. A. C.
         1990-08-01
         Data Acquistion systems in HEP experiments use a wide spectrum of computers to cope with two major problems: high event rates and a large data volume. They do this by using special fast trigger processors at the source to reduce the event rate by several orders of magnitude. The next stage of a data acquisition system consists of a network of fast but conventional microprocessors which are embedded in high speed bus systems where data is still further reduced, filtered and merged. In the final stage complete events are farmed out to a another collection of processors, which reconstruct the events and perhaps achieve a further event rejection by a small factor, prior to recording onto magnetic tape. Detectors are monitored by analyzing a fraction of the data. This may be done for individual detectors at an early state of the data acquisition or it may be delayed till the complete events are available. A network of workstations is used for monitoring, displays and run control. Software for trigger processors must have a simple structure. Rejection algorithms are carefully optimized, and overheads introduced by system software cannot be tolerated. The embedded microprocessors have to co-operate, and need to be synchronized with the preceding and following stages. Real time kernels are typically used to solve synchronization and communication problems. Applications are usually coded in C, which is reasonably efficient and allows direct control over low level hardware functions. Event reconstruction software is very similar or even identical to offline software, predominantly written in FORTRAN. With the advent of powerful RISC processors, and with manufacturers tending to adopt open bus architectures, there is a move towards commercial processors and hence the introduction of the UNIX operating system. Building and controlling such a heterogeneous data acquisition system puts a heavy strain on the software. Communications is now as important as CPU capacity and I/O bandwidth, the traditional key parameters of a HEP data acquisition system. Software engineering and real time system simulation tools are becoming indispensible for the design of future data acquisition systems.
      

      
      Use of a graphics processing unit (GPU) to facilitate real-time 3D graphic presentation of the patient skin-dose distribution during fluoroscopic interventional procedures
      PubMed Central
      Rana, Vijay; Rudin, Stephen; Bednarek, Daniel R.
         2012-01-01
         We have developed a dose-tracking system (DTS) that calculates the radiation dose to the patient’s skin in real-time by acquiring exposure parameters and imaging-system-geometry from the digital bus on a Toshiba Infinix C-arm unit. The cumulative dose values are then displayed as a color map on an OpenGL-based 3D graphic of the patient for immediate feedback to the interventionalist. Determination of those elements on the surface of the patient 3D-graphic that intersect the beam and calculation of the dose for these elements in real time demands fast computation. Reducing the size of the elements results in more computation load on the computer processor and therefore a tradeoff occurs between the resolution of the patient graphic and the real-time performance of the DTS. The speed of the DTS for calculating dose to the skin is limited by the central processing unit (CPU) and can be improved by using the parallel processing power of a graphics processing unit (GPU). Here, we compare the performance speed of GPU-based DTS software to that of the current CPU-based software as a function of the resolution of the patient graphics. Results show a tremendous improvement in speed using the GPU. While an increase in the spatial resolution of the patient graphics resulted in slowing down the computational speed of the DTS on the CPU, the speed of the GPU-based DTS was hardly affected. This GPU-based DTS can be a powerful tool for providing accurate, real-time feedback about patient skin-dose to physicians while performing interventional procedures. PMID:24027616
      

      
      Design and implementation of a UNIX based distributed computing system
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Love, J.S.; Michael, M.W.
         1994-12-31
         We have designed, implemented, and are running a corporate-wide distributed processing batch queue on a large number of networked workstations using the UNIX{reg_sign} operating system. Atlas Wireline researchers and scientists have used the system for over a year. The large increase in available computer power has greatly reduced the time required for nuclear and electromagnetic tool modeling. Use of remote distributed computing has simultaneously reduced computation costs and increased usable computer time. The system integrates equipment from different manufacturers, using various CPU architectures, distinct operating system revisions, and even multiple processors per machine. Various differences between the machines have tomore » be accounted for in the master scheduler. These differences include shells, command sets, swap spaces, memory sizes, CPU sizes, and OS revision levels. Remote processing across a network must be performed in a manner that is seamless from the users` perspective. The system currently uses IBM RISC System/6000{reg_sign}, SPARCstation{sup TM}, HP9000s700, HP9000s800, and DEC Alpha AXP{sup TM} machines. Each CPU in the network has its own speed rating, allowed working hours, and workload parameters. The system if designed so that all of the computers in the network can be optimally scheduled without adversely impacting the primary users of the machines. The increase in the total usable computational capacity by means of distributed batch computing can change corporate computing strategy. The integration of disparate computer platforms eliminates the need to buy one type of computer for computations, another for graphics, and yet another for day-to-day operations. It might be possible, for example, to meet all research and engineering computing needs with existing networked computers.« less
      

      
      Real-time unmanned aircraft systems surveillance video mosaicking using GPU
      NASA Astrophysics Data System (ADS)
      Camargo, Aldo; Anderson, Kyle; Wang, Yi; Schultz, Richard R.; Fevig, Ronald A.
         2010-04-01
         Digital video mosaicking from Unmanned Aircraft Systems (UAS) is being used for many military and civilian applications, including surveillance, target recognition, border protection, forest fire monitoring, traffic control on highways, monitoring of transmission lines, among others. Additionally, NASA is using digital video mosaicking to explore the moon and planets such as Mars. In order to compute a "good" mosaic from video captured by a UAS, the algorithm must deal with motion blur, frame-to-frame jitter associated with an imperfectly stabilized platform, perspective changes as the camera tilts in flight, as well as a number of other factors. The most suitable algorithms use SIFT (Scale-Invariant Feature Transform) to detect the features consistent between video frames. Utilizing these features, the next step is to estimate the homography between two consecutives video frames, perform warping to properly register the image data, and finally blend the video frames resulting in a seamless video mosaick. All this processing takes a great deal of resources of resources from the CPU, so it is almost impossible to compute a real time video mosaic on a single processor. Modern graphics processing units (GPUs) offer computational performance that far exceeds current CPU technology, allowing for real-time operation. This paper presents the development of a GPU-accelerated digital video mosaicking implementation and compares it with CPU performance. Our tests are based on two sets of real video captured by a small UAS aircraft; one video comes from Infrared (IR) and Electro-Optical (EO) cameras. Our results show that we can obtain a speed-up of more than 50 times using GPU technology, so real-time operation at a video capture of 30 frames per second is feasible.
      

      
      Use of a graphics processing unit (GPU) to facilitate real-time 3D graphic presentation of the patient skin-dose distribution during fluoroscopic interventional procedures.
      PubMed
      Rana, Vijay; Rudin, Stephen; Bednarek, Daniel R
         2012-02-23
         We have developed a dose-tracking system (DTS) that calculates the radiation dose to the patient's skin in real-time by acquiring exposure parameters and imaging-system-geometry from the digital bus on a Toshiba Infinix C-arm unit. The cumulative dose values are then displayed as a color map on an OpenGL-based 3D graphic of the patient for immediate feedback to the interventionalist. Determination of those elements on the surface of the patient 3D-graphic that intersect the beam and calculation of the dose for these elements in real time demands fast computation. Reducing the size of the elements results in more computation load on the computer processor and therefore a tradeoff occurs between the resolution of the patient graphic and the real-time performance of the DTS. The speed of the DTS for calculating dose to the skin is limited by the central processing unit (CPU) and can be improved by using the parallel processing power of a graphics processing unit (GPU). Here, we compare the performance speed of GPU-based DTS software to that of the current CPU-based software as a function of the resolution of the patient graphics. Results show a tremendous improvement in speed using the GPU. While an increase in the spatial resolution of the patient graphics resulted in slowing down the computational speed of the DTS on the CPU, the speed of the GPU-based DTS was hardly affected. This GPU-based DTS can be a powerful tool for providing accurate, real-time feedback about patient skin-dose to physicians while performing interventional procedures.
      

      
      Implementation of a cone-beam backprojection algorithm on the cell broadband engine processor
      NASA Astrophysics Data System (ADS)
      Bockenbach, Olivier; Knaup, Michael; Kachelrieß, Marc
         2007-03-01
         Tomographic image reconstruction is computationally very demanding. In all cases the backprojection represents the performance bottleneck due to the high operational count and due to the high demand put on the memory subsystem. In the past, solving this problem has lead to the implementation of specific architectures, connecting Application Specific Integrated Circuits (ASICs) or Field Programmable Gate Arrays (FPGAs) to memory through dedicated high speed busses. More recently, there have also been attempt to use Graphic Processing Units (GPUs) to perform the backprojection step. Originally aimed at the gaming market, IBM, Toshiba and Sony have introduced the Cell Broadband Engine (CBE) processor, often considered as a multicomputer on a chip. Clocked at 3 GHz, the Cell allows for a theoretical performance of 192 GFlops and a peak data transfer rate over the internal bus of 200 GB/s. This performance indeed makes the Cell a very attractive architecture for implementing tomographic image reconstruction algorithms. In this study, we investigate the relative performance of a perspective backprojection algorithm when implemented on a standard PC and on the Cell processor. We compare these results to the performance achievable with FPGAs based boards and high end GPUs. The cone-beam backprojection performance was assessed by backprojecting a full circle scan of 512 projections of 1024x1024 pixels into a volume of size 512x512x512 voxels. It took 3.2 minutes on the PC (single CPU) and is as fast as 13.6 seconds on the Cell.
      

      
      The growth of the UniTree mass storage system at the NASA Center for Computational Sciences
      NASA Technical Reports Server (NTRS)
      Tarshish, Adina; Salmon, Ellen
         1993-01-01
         In October 1992, the NASA Center for Computational Sciences made its Convex-based UniTree system generally available to users. The ensuing months saw the growth of near-online data from nil to nearly three terabytes, a doubling of the number of CPU's on the facility's Cray YMP (the primary data source for UniTree), and the necessity for an aggressive regimen for repacking sparse tapes and hierarchical 'vaulting' of old files to freestanding tape. Connectivity was enhanced as well with the addition of UltraNet HiPPI. This paper describes the increasing demands placed on the storage system's performance and throughput that resulted from the significant augmentation of compute-server processor power and network speed.
      

      
      Instrumentation & Data Acquisition System (D AS) Engineer
      NASA Technical Reports Server (NTRS)
      Jackson, Markus Deon
         2015-01-01
         The primary job of an Instrumentation and Data Acquisition System (DAS) Engineer is to properly measure physical phenomenon of hardware using appropriate instrumentation and DAS equipment designed to record data during a specified test of the hardware. A DAS system includes a CPU or processor, a data storage device such as a hard drive, a data communication bus such as Universal Serial Bus, software to control the DAS system processes like calibrations, recording of data and processing of data. It also includes signal conditioning amplifiers, and certain sensors for specified measurements. My internship responsibilities have included testing and adjusting Pacific Instruments Model 9355 signal conditioning amplifiers, writing and performing checkout procedures, writing and performing calibration procedures while learning the basics of instrumentation.
      

      
      A Unix SVR-4-OS9 distributed data acquisition for high energy physics
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Drouhin, F.; Schwaller, B.; Fontaine, J.C.
         1998-08-01
         The distributed data acquisition (DAQ) system developed by the GRPHE (Groupe de Recherche en Physique des Hautes Energies) group is a combination of hardware and software dedicated to high energy physics. The system described here is used in the beam tests of the CMs tracker. The central processor of the system is a RISC CPU hosted in a VME card, running a POSIX compliant UNIX system. Specialized real-time OS9 VME cards perform the instrumentation control. The main data flow goes over a deterministic high speed network. The Unix system manages a list of OS9 front-end systems with a synchronization protocolmore » running over a TCP/IP layer.« less
      

      
      Portable LQCD Monte Carlo code using OpenACC
      NASA Astrophysics Data System (ADS)
      Bonati, Claudio; Calore, Enrico; Coscetti, Simone; D'Elia, Massimo; Mesiti, Michele; Negro, Francesco; Fabio Schifano, Sebastiano; Silvi, Giorgio; Tripiccione, Raffaele
         2018-03-01
         Varying from multi-core CPU processors to many-core GPUs, the present scenario of HPC architectures is extremely heterogeneous. In this context, code portability is increasingly important for easy maintainability of applications; this is relevant in scientific computing where code changes are numerous and frequent. In this talk we present the design and optimization of a state-of-the-art production level LQCD Monte Carlo application, using the OpenACC directives model. OpenACC aims to abstract parallel programming to a descriptive level, where programmers do not need to specify the mapping of the code on the target machine. We describe the OpenACC implementation and show that the same code is able to target different architectures, including state-of-the-art CPUs and GPUs.
      

      
      Computational algorithms for simulations in atmospheric optics.
      PubMed
      Konyaev, P A; Lukin, V P
         2016-04-20
         A computer simulation technique for atmospheric and adaptive optics based on parallel programing is discussed. A parallel propagation algorithm is designed and a modified spectral-phase method for computer generation of 2D time-variant random fields is developed. Temporal power spectra of Laguerre-Gaussian beam fluctuations are considered as an example to illustrate the applications discussed. Implementation of the proposed algorithms using Intel MKL and IPP libraries and NVIDIA CUDA technology is shown to be very fast and accurate. The hardware system for the computer simulation is an off-the-shelf desktop with an Intel Core i7-4790K CPU operating at a turbo-speed frequency up to 5 GHz and an NVIDIA GeForce GTX-960 graphics accelerator with 1024 1.5 GHz processors.
      

      
      Accelerating Fibre Orientation Estimation from Diffusion Weighted Magnetic Resonance Imaging Using GPUs
      PubMed Central
      Hernández, Moisés; Guerrero, Ginés D.; Cecilia, José M.; García, José M.; Inuggi, Alberto; Jbabdi, Saad; Behrens, Timothy E. J.; Sotiropoulos, Stamatios N.
         2013-01-01
         With the performance of central processing units (CPUs) having effectively reached a limit, parallel processing offers an alternative for applications with high computational demands. Modern graphics processing units (GPUs) are massively parallel processors that can execute simultaneously thousands of light-weight processes. In this study, we propose and implement a parallel GPU-based design of a popular method that is used for the analysis of brain magnetic resonance imaging (MRI). More specifically, we are concerned with a model-based approach for extracting tissue structural information from diffusion-weighted (DW) MRI data. DW-MRI offers, through tractography approaches, the only way to study brain structural connectivity, non-invasively and in-vivo. We parallelise the Bayesian inference framework for the ball & stick model, as it is implemented in the tractography toolbox of the popular FSL software package (University of Oxford). For our implementation, we utilise the Compute Unified Device Architecture (CUDA) programming model. We show that the parameter estimation, performed through Markov Chain Monte Carlo (MCMC), is accelerated by at least two orders of magnitude, when comparing a single GPU with the respective sequential single-core CPU version. We also illustrate similar speed-up factors (up to 120x) when comparing a multi-GPU with a multi-CPU implementation. PMID:23658616
      

      
      Three-Dimensional Nacelle Aeroacoustics Code With Application to Impedance Education
      NASA Technical Reports Server (NTRS)
      Watson, Willie R.
         2000-01-01
         A three-dimensional nacelle acoustics code that accounts for uniform mean flow and variable surface impedance liners is developed. The code is linked to a commercial version of the NASA-developed General Purpose Solver (for solution of linear systems of equations) in order to obtain the capability to study high frequency waves that may require millions of grid points for resolution. Detailed, single-processor statistics for the performance of the solver in rigid and soft-wall ducts are presented. Over the range of frequencies of current interest in nacelle liner research, noise attenuation levels predicted from the code were in excellent agreement with those predicted from mode theory. The equation solver is memory efficient, requiring only a small fraction of the memory available on modern computers. As an application, the code is combined with an optimization algorithm and used to reduce the impedance spectrum of a ceramic liner. The primary problem with using the code to perform optimization studies at frequencies above I1kHz is the excessive CPU time (a major portion of which is matrix assembly). The research recommends that research be directed toward development of a rapid sparse assembler and exploitation of the multiprocessor capability of the solver to further reduce CPU time.
      

      
      Optimization of the coherence function estimation for multi-core central processing unit
      NASA Astrophysics Data System (ADS)
      Cheremnov, A. G.; Faerman, V. A.; Avramchuk, V. S.
         2017-02-01
         The paper considers use of parallel processing on multi-core central processing unit for optimization of the coherence function evaluation arising in digital signal processing. Coherence function along with other methods of spectral analysis is commonly used for vibration diagnosis of rotating machinery and its particular nodes. An algorithm is given for the function evaluation for signals represented with digital samples. The algorithm is analyzed for its software implementation and computational problems. Optimization measures are described, including algorithmic, architecture and compiler optimization, their results are assessed for multi-core processors from different manufacturers. Thus, speeding-up of the parallel execution with respect to sequential execution was studied and results are presented for Intel Core i7-4720HQ и AMD FX-9590 processors. The results show comparatively high efficiency of the optimization measures taken. In particular, acceleration indicators and average CPU utilization have been significantly improved, showing high degree of parallelism of the constructed calculating functions. The developed software underwent state registration and will be used as a part of a software and hardware solution for rotating machinery fault diagnosis and pipeline leak location with acoustic correlation method.
      

      
      High performance 3D adaptive filtering for DSP based portable medical imaging systems
      NASA Astrophysics Data System (ADS)
      Bockenbach, Olivier; Ali, Murtaza; Wainwright, Ian; Nadeski, Mark
         2015-03-01
         Portable medical imaging devices have proven valuable for emergency medical services both in the field and hospital environments and are becoming more prevalent in clinical settings where the use of larger imaging machines is impractical. Despite their constraints on power, size and cost, portable imaging devices must still deliver high quality images. 3D adaptive filtering is one of the most advanced techniques aimed at noise reduction and feature enhancement, but is computationally very demanding and hence often cannot be run with sufficient performance on a portable platform. In recent years, advanced multicore digital signal processors (DSP) have been developed that attain high processing performance while maintaining low levels of power dissipation. These processors enable the implementation of complex algorithms on a portable platform. In this study, the performance of a 3D adaptive filtering algorithm on a DSP is investigated. The performance is assessed by filtering a volume of size 512x256x128 voxels sampled at a pace of 10 MVoxels/sec with an Ultrasound 3D probe. Relative performance and power is addressed between a reference PC (Quad Core CPU) and a TMS320C6678 DSP from Texas Instruments.
      

        
       
          

«

16
      17
      18
   19
      20
      »

          
        

     

   

   
       
            
              
          

«

17
      18
      19
   20
      21
      »

          
        

           
           
             
               
      
      Dynamic Load-Balancing for Distributed Heterogeneous Computing of Parallel CFD Problems
      NASA Technical Reports Server (NTRS)
      Ecer, A.; Chien, Y. P.; Boenisch, T.; Akay, H. U.
         2000-01-01
         The developed methodology is aimed at improving the efficiency of executing block-structured algorithms on parallel, distributed, heterogeneous computers. The basic approach of these algorithms is to divide the flow domain into many sub- domains called blocks, and solve the governing equations over these blocks. Dynamic load balancing problem is defined as the efficient distribution of the blocks among the available processors over a period of several hours of computations. In environments with computers of different architecture, operating systems, CPU speed, memory size, load, and network speed, balancing the loads and managing the communication between processors becomes crucial. Load balancing software tools for mutually dependent parallel processes have been created to efficiently utilize an advanced computation environment and algorithms. These tools are dynamic in nature because of the chances in the computer environment during execution time. More recently, these tools were extended to a second operating system: NT. In this paper, the problems associated with this application will be discussed. Also, the developed algorithms were combined with the load sharing capability of LSF to efficiently utilize workstation clusters for parallel computing. Finally, results will be presented on running a NASA based code ADPAC to demonstrate the developed tools for dynamic load balancing.
      

      
      Kmerind: A Flexible Parallel Library for K-mer Indexing of Biological Sequences on Distributed Memory Systems.
      PubMed
      Pan, Tony; Flick, Patrick; Jain, Chirag; Liu, Yongchao; Aluru, Srinivas
         2017-10-09
         Counting and indexing fixed length substrings, or k-mers, in biological sequences is a key step in many bioinformatics tasks including genome alignment and mapping, genome assembly, and error correction. While advances in next generation sequencing technologies have dramatically reduced the cost and improved latency and throughput, few bioinformatics tools can efficiently process the datasets at the current generation rate of 1.8 terabases every 3 days. We present Kmerind, a high performance parallel k-mer indexing library for distributed memory environments. The Kmerind library provides a set of simple and consistent APIs with sequential semantics and parallel implementations that are designed to be flexible and extensible. Kmerind's k-mer counter performs similarly or better than the best existing k-mer counting tools even on shared memory systems. In a distributed memory environment, Kmerind counts k-mers in a 120 GB sequence read dataset in less than 13 seconds on 1024 Xeon CPU cores, and fully indexes their positions in approximately 17 seconds. Querying for 1% of the k-mers in these indices can be completed in 0.23 seconds and 28 seconds, respectively. Kmerind is the first k-mer indexing library for distributed memory environments, and the first extensible library for general k-mer indexing and counting. Kmerind is available at https://github.com/ParBLiSS/kmerind.
      

      
      Comparison of multihardware parallel implementations for a phase unwrapping algorithm
      NASA Astrophysics Data System (ADS)
      Hernandez-Lopez, Francisco Javier; Rivera, Mariano; Salazar-Garibay, Adan; Legarda-Sáenz, Ricardo
         2018-04-01
         Phase unwrapping is an important problem in the areas of optical metrology, synthetic aperture radar (SAR) image analysis, and magnetic resonance imaging (MRI) analysis. These images are becoming larger in size and, particularly, the availability and need for processing of SAR and MRI data have increased significantly with the acquisition of remote sensing data and the popularization of magnetic resonators in clinical diagnosis. Therefore, it is important to develop faster and accurate phase unwrapping algorithms. We propose a parallel multigrid algorithm of a phase unwrapping method named accumulation of residual maps, which builds on a serial algorithm that consists of the minimization of a cost function; minimization achieved by means of a serial Gauss-Seidel kind algorithm. Our algorithm also optimizes the original cost function, but unlike the original work, our algorithm is a parallel Jacobi class with alternated minimizations. This strategy is known as the chessboard type, where red pixels can be updated in parallel at same iteration since they are independent. Similarly, black pixels can be updated in parallel in an alternating iteration. We present parallel implementations of our algorithm for different parallel multicore architecture such as CPU-multicore, Xeon Phi coprocessor, and Nvidia graphics processing unit. In all the cases, we obtain a superior performance of our parallel algorithm when compared with the original serial version. In addition, we present a detailed comparative performance of the developed parallel versions.
      

      
      Using the GeoFEST Faulted Region Simulation System
      NASA Technical Reports Server (NTRS)
      Parker, Jay W.; Lyzenga, Gregory A.; Donnellan, Andrea; Judd, Michele A.; Norton, Charles D.; Baker, Teresa; Tisdale, Edwin R.; Li, Peggy
         2004-01-01
         GeoFEST (the Geophysical Finite Element Simulation Tool) simulates stress evolution, fault slip and plastic/elastic processes in realistic materials, and so is suitable for earthquake cycle studies in regions such as Southern California. Many new capabilities and means of access for GeoFEST are now supported. New abilities include MPI-based cluster parallel computing using automatic PYRAMID/Parmetis-based mesh partitioning, automatic mesh generation for layered media with rectangular faults, and results visualization that is integrated with remote sensing data. The parallel GeoFEST application has been successfully run on over a half-dozen computers, including Intel Xeon clusters, Itanium II and Altix machines, and the Apple G5 cluster. It is not separately optimized for different machines, but relies on good domain partitioning for load-balance and low communication, and careful writing of the parallel diagonally preconditioned conjugate gradient solver to keep communication overhead low. Demonstrated thousand-step solutions for over a million finite elements on 64 processors require under three hours, and scaling tests show high efficiency when using more than (order of) 4000 elements per processor. The source code and documentation for GeoFEST is available at no cost from Open Channel Foundation. In addition GeoFEST may be used through a browser-based portal environment available to approved users. That environment includes semi-automated geometry creation and mesh generation tools, GeoFEST, and RIVA-based visualization tools that include the ability to generate a flyover animation showing deformations and topography. Work is in progress to support simulation of a region with several faults using 16 million elements, using a strain energy metric to adapt the mesh to faithfully represent the solution in a region of widely varying strain.
      

      
      SoAx: A generic C++ Structure of Arrays for handling particles in HPC codes
      NASA Astrophysics Data System (ADS)
      Homann, Holger; Laenen, Francois
         2018-03-01
         The numerical study of physical problems often require integrating the dynamics of a large number of particles evolving according to a given set of equations. Particles are characterized by the information they are carrying such as an identity, a position other. There are generally speaking two different possibilities for handling particles in high performance computing (HPC) codes. The concept of an Array of Structures (AoS) is in the spirit of the object-oriented programming (OOP) paradigm in that the particle information is implemented as a structure. Here, an object (realization of the structure) represents one particle and a set of many particles is stored in an array. In contrast, using the concept of a Structure of Arrays (SoA), a single structure holds several arrays each representing one property (such as the identity) of the whole set of particles. The AoS approach is often implemented in HPC codes due to its handiness and flexibility. For a class of problems, however, it is known that the performance of SoA is much better than that of AoS. We confirm this observation for our particle problem. Using a benchmark we show that on modern Intel Xeon processors the SoA implementation is typically several times faster than the AoS one. On Intel's MIC co-processors the performance gap even attains a factor of ten. The same is true for GPU computing, using both computational and multi-purpose GPUs. Combining performance and handiness, we present the library SoAx that has optimal performance (on CPUs, MICs, and GPUs) while providing the same handiness as AoS. For this, SoAx uses modern C++ design techniques such template meta programming that allows to automatically generate code for user defined heterogeneous data structures.
      

      
      Parasail: SIMD C library for global, semi-global, and local pairwise sequence alignments.
      PubMed
      Daily, Jeff
         2016-02-10
         Sequence alignment algorithms are a key component of many bioinformatics applications. Though various fast Smith-Waterman local sequence alignment implementations have been developed for x86 CPUs, most are embedded into larger database search tools. In addition, fast implementations of Needleman-Wunsch global sequence alignment and its semi-global variants are not as widespread. This article presents the first software library for local, global, and semi-global pairwise intra-sequence alignments and improves the performance of previous intra-sequence implementations. A faster intra-sequence local pairwise alignment implementation is described and benchmarked, including new global and semi-global variants. Using a 375 residue query sequence a speed of 136 billion cell updates per second (GCUPS) was achieved on a dual Intel Xeon E5-2670 24-core processor system, the highest reported for an implementation based on Farrar's 'striped' approach. Rognes's SWIPE optimal database search application is still generally the fastest available at 1.2 to at best 2.4 times faster than Parasail for sequences shorter than 500 amino acids. However, Parasail was faster for longer sequences. For global alignments, Parasail's prefix scan implementation is generally the fastest, faster even than Farrar's 'striped' approach, however the opal library is faster for single-threaded applications. The software library is designed for 64 bit Linux, OS X, or Windows on processors with SSE2, SSE41, or AVX2. Source code is available from https://github.com/jeffdaily/parasail under the Battelle BSD-style license. Applications that require optimal alignment scores could benefit from the improved performance. For the first time, SIMD global, semi-global, and local alignments are available in a stand-alone C library.
      

      
      Real-time estimation of lesion depth and control of radiofrequency ablation within ex vivo animal tissues using a neural network.
      PubMed
      Wang, Yearnchee Curtis; Chan, Terence Chee-Hung; Sahakian, Alan Varteres
         2018-01-04
         Radiofrequency ablation (RFA), a method of inducing thermal ablation (cell death), is often used to destroy tumours or potentially cancerous tissue. Current techniques for RFA estimation (electrical impedance tomography, Nakagami ultrasound, etc.) require long compute times (≥ 2 s) and measurement devices other than the RFA device. This study aims to determine if a neural network (NN) can estimate ablation lesion depth for control of bipolar RFA using complex electrical impedance - since tissue electrical conductivity varies as a function of tissue temperature - in real time using only the RFA therapy device's electrodes. Three-dimensional, cubic models comprised of beef liver, pork loin or pork belly represented target tissue. Temperature and complex electrical impedance from 72 data generation ablations in pork loin and belly were used for training the NN (403 s on Xeon processor). NN inputs were inquiry depth, starting complex impedance and current complex impedance. Training-validation-test splits were 70%-0%-30% and 80%-10%-10% (overfit test). Once the NN-estimated lesion depth for a margin reached the target lesion depth, RFA was stopped for that margin of tissue. The NN trained to 93% accuracy and an NN-integrated control ablated tissue to within 1.0 mm of the target lesion depth on average. Full 15-mm depth maps were calculated in 0.2 s on a single-core ARMv7 processor. The results show that a NN could make lesion depth estimations in real-time using less in situ devices than current techniques. With the NN-based technique, physicians could deliver quicker and more precise ablation therapy.
      

      
      An efficient and portable SIMD algorithm for charge/current deposition in Particle-In-Cell codes
      NASA Astrophysics Data System (ADS)
      Vincenti, H.; Lobet, M.; Lehe, R.; Sasanka, R.; Vay, J.-L.
         2017-01-01
         In current computer architectures, data movement (from die to network) is by far the most energy consuming part of an algorithm (≈ 20 pJ/word on-die to ≈10,000 pJ/word on the network). To increase memory locality at the hardware level and reduce energy consumption related to data movement, future exascale computers tend to use many-core processors on each compute nodes that will have a reduced clock speed to allow for efficient cooling. To compensate for frequency decrease, machine vendors are making use of long SIMD instruction registers that are able to process multiple data with one arithmetic operator in one clock cycle. SIMD register length is expected to double every four years. As a consequence, Particle-In-Cell (PIC) codes will have to achieve good vectorization to fully take advantage of these upcoming architectures. In this paper, we present a new algorithm that allows for efficient and portable SIMD vectorization of current/charge deposition routines that are, along with the field gathering routines, among the most time consuming parts of the PIC algorithm. Our new algorithm uses a particular data structure that takes into account memory alignment constraints and avoids gather/scatter instructions that can significantly affect vectorization performances on current CPUs. The new algorithm was successfully implemented in the 3D skeleton PIC code PICSAR and tested on Haswell Xeon processors (AVX2-256 bits wide data registers). Results show a factor of × 2 to × 2.5 speed-up in double precision for particle shape factor of orders 1- 3. The new algorithm can be applied as is on future KNL (Knights Landing) architectures that will include AVX-512 instruction sets with 512 bits register lengths (8 doubles/16 singles).
      

      
      TU-AB-BRC-10: Modeling of Radiotherapy Linac Source Terms Using ARCHER Monte Carlo Code: Performance Comparison of GPU and MIC Computing Accelerators
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Liu, T; Lin, H; Xu, X
         
         Purpose: (1) To perform phase space (PS) based source modeling for Tomotherapy and Varian TrueBeam 6 MV Linacs, (2) to examine the accuracy and performance of the ARCHER Monte Carlo code on a heterogeneous computing platform with Many Integrated Core coprocessors (MIC, aka Xeon Phi) and GPUs, and (3) to explore the software micro-optimization methods. Methods: The patient-specific source of Tomotherapy and Varian TrueBeam Linacs was modeled using the PS approach. For the helical Tomotherapy case, the PS data were calculated in our previous study (Su et al. 2014 41(7) Medical Physics). For the single-view Varian TrueBeam case, we analyticallymore » derived them from the raw patient-independent PS data in IAEA’s database, partial geometry information of the jaw and MLC as well as the fluence map. The phantom was generated from DICOM images. The Monte Carlo simulation was performed by ARCHER-MIC and GPU codes, which were benchmarked against a modified parallel DPM code. Software micro-optimization was systematically conducted, and was focused on SIMD vectorization of tight for-loops and data prefetch, with the ultimate goal of increasing 512-bit register utilization and reducing memory access latency. Results: Dose calculation was performed for two clinical cases, a Tomotherapy-based prostate cancer treatment and a TrueBeam-based left breast treatment. ARCHER was verified against the DPM code. The statistical uncertainty of the dose to the PTV was less than 1%. Using double-precision, the total wall time of the multithreaded CPU code on a X5650 CPU was 339 seconds for the Tomotherapy case and 131 seconds for the TrueBeam, while on 3 5110P MICs it was reduced to 79 and 59 seconds, respectively. The single-precision GPU code on a K40 GPU took 45 seconds for the Tomotherapy dose calculation. Conclusion: We have extended ARCHER, the MIC and GPU-based Monte Carlo dose engine to Tomotherapy and Truebeam dose calculations.« less
      

      
      Measurements of the LHCb software stack on the ARM architecture
      NASA Astrophysics Data System (ADS)
      Vijay Kartik, S.; Couturier, Ben; Clemencic, Marco; Neufeld, Niko
         2014-06-01
         The ARM architecture is a power-efficient design that is used in most processors in mobile devices all around the world today since they provide reasonable compute performance per watt. The current LHCb software stack is designed (and thus expected) to build and run on machines with the x86/x86_64 architecture. This paper outlines the process of measuring the performance of the LHCb software stack on the ARM architecture - specifically, the ARMv7 architecture on Cortex-A9 processors from NVIDIA and on full-fledged ARM servers with chipsets from Calxeda - and makes comparisons with the performance on x86_64 architectures on the Intel Xeon L5520/X5650 and AMD Opteron 6272. The paper emphasises the aspects of performance per core with respect to the power drawn by the compute nodes for the given performance - this ensures a fair real-world comparison with much more 'powerful' Intel/AMD processors. The comparisons of these real workloads in the context of LHCb are also complemented with the standard synthetic benchmarks HEPSPEC and Coremark. The pitfalls and solutions for the non-trivial task of porting the source code to build for the ARMv7 instruction set are presented. The specific changes in the build process needed for ARM-specific portions of the software stack are described, to serve as pointers for further attempts taken up by other groups in this direction. Cases where architecture-specific tweaks at the assembler lever (both in ROOT and the LHCb software stack) were needed for a successful compile are detailed - these cases are good indicators of where/how the software stack as well as the build system can be made more portable and multi-arch friendly. The experience gained from the tasks described in this paper are intended to i) assist in making an informed choice about ARM-based server solutions as a feasible low-power alternative to the current compute nodes, and ii) revisit the software design and build system for portability and generic improvements.
      

      
      Spherical demons: fast diffeomorphic landmark-free surface registration.
      PubMed
      Yeo, B T Thomas; Sabuncu, Mert R; Vercauteren, Tom; Ayache, Nicholas; Fischl, Bruce; Golland, Polina
         2010-03-01
         We present the Spherical Demons algorithm for registering two spherical images. By exploiting spherical vector spline interpolation theory, we show that a large class of regularizors for the modified Demons objective function can be efficiently approximated on the sphere using iterative smoothing. Based on one parameter subgroups of diffeomorphisms, the resulting registration is diffeomorphic and fast. The Spherical Demons algorithm can also be modified to register a given spherical image to a probabilistic atlas. We demonstrate two variants of the algorithm corresponding to warping the atlas or warping the subject. Registration of a cortical surface mesh to an atlas mesh, both with more than 160 k nodes requires less than 5 min when warping the atlas and less than 3 min when warping the subject on a Xeon 3.2 GHz single processor machine. This is comparable to the fastest nondiffeomorphic landmark-free surface registration algorithms. Furthermore, the accuracy of our method compares favorably to the popular FreeSurfer registration algorithm. We validate the technique in two different applications that use registration to transfer segmentation labels onto a new image 1) parcellation of in vivo cortical surfaces and 2) Brodmann area localization in ex vivo cortical surfaces.
      

      
      Parallel Application Performance on Two Generations of Intel Xeon HPC Platforms
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Chang, Christopher H.; Long, Hai; Sides, Scott
         2015-10-15
         Two next-generation node configurations hosting the Haswell microarchitecture were tested with a suite of microbenchmarks and application examples, and compared with a current Ivy Bridge production node on NREL" tm s Peregrine high-performance computing cluster. A primary conclusion from this study is that the additional cores are of little value to individual task performance--limitations to application parallelism, or resource contention among concurrently running but independent tasks, limits effective utilization of these added cores. Hyperthreading generally impacts throughput negatively, but can improve performance in the absence of detailed attention to runtime workflow configuration. The observations offer some guidance to procurement ofmore » future HPC systems at NREL. First, raw core count must be balanced with available resources, particularly memory bandwidth. Balance-of-system will determine value more than processor capability alone. Second, hyperthreading continues to be largely irrelevant to the workloads that are commonly seen, and were tested here, at NREL. Finally, perhaps the most impactful enhancement to productivity might occur through enabling multiple concurrent jobs per node. Given the right type and size of workload, more may be achieved by doing many slow things at once, than fast things in order.« less
      

      
      Spherical Demons: Fast Diffeomorphic Landmark-Free Surface Registration
      PubMed Central
      Yeo, B.T. Thomas; Sabuncu, Mert R.; Vercauteren, Tom; Ayache, Nicholas; Fischl, Bruce; Golland, Polina
         2010-01-01
         We present the Spherical Demons algorithm for registering two spherical images. By exploiting spherical vector spline interpolation theory, we show that a large class of regularizors for the modified Demons objective function can be efficiently approximated on the sphere using iterative smoothing. Based on one parameter subgroups of diffeomorphisms, the resulting registration is diffeomorphic and fast. The Spherical Demons algorithm can also be modified to register a given spherical image to a probabilistic atlas. We demonstrate two variants of the algorithm corresponding to warping the atlas or warping the subject. Registration of a cortical surface mesh to an atlas mesh, both with more than 160k nodes requires less than 5 minutes when warping the atlas and less than 3 minutes when warping the subject on a Xeon 3.2GHz single processor machine. This is comparable to the fastest non-diffeomorphic landmark-free surface registration algorithms. Furthermore, the accuracy of our method compares favorably to the popular FreeSurfer registration algorithm. We validate the technique in two different applications that use registration to transfer segmentation labels onto a new image: (1) parcellation of in-vivo cortical surfaces and (2) Brodmann area localization in ex-vivo cortical surfaces. PMID:19709963
      

      
      Comparison of Monte Carlo simulated and measured performance parameters of miniPET scanner
      NASA Astrophysics Data System (ADS)
      Kis, S. A.; Emri, M.; Opposits, G.; Bükki, T.; Valastyán, I.; Hegyesi, Gy.; Imrek, J.; Kalinka, G.; Molnár, J.; Novák, D.; Végh, J.; Kerek, A.; Trón, L.; Balkay, L.
         2007-02-01
         In vivo imaging of small laboratory animals is a valuable tool in the development of new drugs. For this purpose, miniPET, an easy to scale modular small animal PET camera has been developed at our institutes. The system has four modules, which makes it possible to rotate the whole detector system around the axis of the field of view. Data collection and image reconstruction are performed using a data acquisition (DAQ) module with Ethernet communication facility and a computer cluster of commercial PCs. Performance tests were carried out to determine system parameters, such as energy resolution, sensitivity and noise equivalent count rate. A modified GEANT4-based GATE Monte Carlo software package was used to simulate PET data analogous to those of the performance measurements. GATE was run on a Linux cluster of 10 processors (64 bit, Xeon with 3.0 GHz) and controlled by a SUN grid engine. The application of this special computer cluster reduced the time necessary for the simulations by an order of magnitude. The simulated energy spectra, maximum rate of true coincidences and sensitivity of the camera were in good agreement with the measured parameters.
      

      
      A Survey of Techniques for Approximate Computing
      DOE PAGES
      Mittal, Sparsh
         2016-03-18
         Approximate computing trades off computation quality with the effort expended and as rising performance demands confront with plateauing resource budgets, approximate computing has become, not merely attractive, but even imperative. Here, we present a survey of techniques for approximate computing (AC). We discuss strategies for finding approximable program portions and monitoring output quality, techniques for using AC in different processing units (e.g., CPU, GPU and FPGA), processor components, memory technologies etc., and programming frameworks for AC. Moreover, we classify these techniques based on several key characteristics to emphasize their similarities and differences. Finally, the aim of this paper is tomore » provide insights to researchers into working of AC techniques and inspire more efforts in this area to make AC the mainstream computing approach in future systems.« less
      

      
      Stochastic first passage time accelerated with CUDA
      NASA Astrophysics Data System (ADS)
      Pierro, Vincenzo; Troiano, Luigi; Mejuto, Elena; Filatrella, Giovanni
         2018-05-01
         The numerical integration of stochastic trajectories to estimate the time to pass a threshold is an interesting physical quantity, for instance in Josephson junctions and atomic force microscopy, where the full trajectory is not accessible. We propose an algorithm suitable for efficient implementation on graphical processing unit in CUDA environment. The proposed approach for well balanced loads achieves almost perfect scaling with the number of available threads and processors, and allows an acceleration of about 400× with a GPU GTX980 respect to standard multicore CPU. This method allows with off the shell GPU to challenge problems that are otherwise prohibitive, as thermal activation in slowly tilted potentials. In particular, we demonstrate that it is possible to simulate the switching currents distributions of Josephson junctions in the timescale of actual experiments.
      

      
      [A novel biologic electricity signal measurement based on neuron chip].
      PubMed
      Lei, Yinsheng; Wang, Mingshi; Sun, Tongjing; Zhu, Qiang; Qin, Ran
         2006-06-01
         Neuron chip is a multiprocessor with three pipeline CPU; its communication protocol and control processor are integrated in effect to carry out the function of communication, control, attemper, I/O, etc. A novel biologic electronic signal measurement network system is composed of intelligent measurement nodes with neuron chip at the core. In this study, the electronic signals such as ECG, EEG, EMG and BOS can be synthetically measured by those intelligent nodes, and some valuable diagnostic messages are found. Wavelet transform is employed in this system to analyze various biologic electronic signals due to its strong time-frequency ability of decomposing signal local character. Better effect is gained. This paper introduces the hardware structure of network and intelligent measurement node, the measurement theory and the signal figure of data acquisition and processing.
      

      
      High-performance image reconstruction in fluorescence tomography on desktop computers and graphics hardware.
      PubMed
      Freiberger, Manuel; Egger, Herbert; Liebmann, Manfred; Scharfetter, Hermann
         2011-11-01
         Image reconstruction in fluorescence optical tomography is a three-dimensional nonlinear ill-posed problem governed by a system of partial differential equations. In this paper we demonstrate that a combination of state of the art numerical algorithms and a careful hardware optimized implementation allows to solve this large-scale inverse problem in a few seconds on standard desktop PCs with modern graphics hardware. In particular, we present methods to solve not only the forward but also the non-linear inverse problem by massively parallel programming on graphics processors. A comparison of optimized CPU and GPU implementations shows that the reconstruction can be accelerated by factors of about 15 through the use of the graphics hardware without compromising the accuracy in the reconstructed images.
      

      
      High-performance dynamic quantum clustering on graphics processors
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Wittek, Peter, E-mail: peterwittek@acm.org
         2013-01-15
         Clustering methods in machine learning may benefit from borrowing metaphors from physics. Dynamic quantum clustering associates a Gaussian wave packet with the multidimensional data points and regards them as eigenfunctions of the Schroedinger equation. The clustering structure emerges by letting the system evolve and the visual nature of the algorithm has been shown to be useful in a range of applications. Furthermore, the method only uses matrix operations, which readily lend themselves to parallelization. In this paper, we develop an implementation on graphics hardware and investigate how this approach can accelerate the computations. We achieve a speedup of up tomore » two magnitudes over a multicore CPU implementation, which proves that quantum-like methods and acceleration by graphics processing units have a great relevance to machine learning.« less
      

      
      Large-N in Volcano Settings: Volcanosri
      NASA Astrophysics Data System (ADS)
      Lees, J. M.; Song, W.; Xing, G.; Vick, S.; Phillips, D.
         2014-12-01
         We seek a paradigm shift in the approach we take on volcano monitoring where the compromise from high fidelity to large numbers of sensors is used to increase coverage and resolution. Accessibility, danger and the risk of equipment loss requires that we develop systems that are independent and inexpensive. Furthermore, rather than simply record data on hard disk for later analysis we desire a system that will work autonomously, capitalizing on wireless technology and in field network analysis. To this end we are currently producing a low cost seismic array which will incorporate, at the very basic level, seismological tools for first cut analysis of a volcano in crises mode. At the advanced end we expect to perform tomographic inversions in the network in near real time. Geophone (4 Hz) sensors connected to a low cost recording system will be installed on an active volcano where triggering earthquake location and velocity analysis will take place independent of human interaction. Stations are designed to be inexpensive and possibly disposable. In one of the first implementations the seismic nodes consist of an Arduino Due processor board with an attached Seismic Shield. The Arduino Due processor board contains an Atmel SAM3X8E ARM Cortex-M3 CPU. This 32 bit 84 MHz processor can filter and perform coarse seismic event detection on a 1600 sample signal in fewer than 200 milliseconds. The Seismic Shield contains a GPS module, 900 MHz high power mesh network radio, SD card, seismic amplifier, and 24 bit ADC. External sensors can be attached to either this 24-bit ADC or to the internal multichannel 12 bit ADC contained on the Arduino Due processor board. This allows the node to support attachment of multiple sensors. By utilizing a high-speed 32 bit processor complex signal processing tasks can be performed simultaneously on multiple sensors. Using a 10 W solar panel, second system being developed can run autonomously and collect data on 3 channels at 100Hz for 6 months with the installed 16Gb SD card. Initial designs and test results will be presented and discussed.
      

        
       
          

«

17
      18
      19
   20
      21
      »

          
        

     

   

   
       
            
              
          

«

18
      19
      20
   21
      22
      »

          
        

           
           
             
               
      
      Parallel design patterns for a low-power, software-defined compressed video encoder
      NASA Astrophysics Data System (ADS)
      Bruns, Michael W.; Hunt, Martin A.; Prasad, Durga; Gunupudi, Nageswara R.; Sonachalam, Sekar
         2011-06-01
         Video compression algorithms such as H.264 offer much potential for parallel processing that is not always exploited by the technology of a particular implementation. Consumer mobile encoding devices often achieve real-time performance and low power consumption through parallel processing in Application Specific Integrated Circuit (ASIC) technology, but many other applications require a software-defined encoder. High quality compression features needed for some applications such as 10-bit sample depth or 4:2:2 chroma format often go beyond the capability of a typical consumer electronics device. An application may also need to efficiently combine compression with other functions such as noise reduction, image stabilization, real time clocks, GPS data, mission/ESD/user data or software-defined radio in a low power, field upgradable implementation. Low power, software-defined encoders may be implemented using a massively parallel memory-network processor array with 100 or more cores and distributed memory. The large number of processor elements allow the silicon device to operate more efficiently than conventional DSP or CPU technology. A dataflow programming methodology may be used to express all of the encoding processes including motion compensation, transform and quantization, and entropy coding. This is a declarative programming model in which the parallelism of the compression algorithm is expressed as a hierarchical graph of tasks with message communication. Data parallel and task parallel design patterns are supported without the need for explicit global synchronization control. An example is described of an H.264 encoder developed for a commercially available, massively parallel memorynetwork processor device.
      

      
      Navier-Stokes Simulation of Airconditioning Facility of a Large Modem Computer Room
      NASA Technical Reports Server (NTRS)
      
         2005-01-01
         NASA recently assembled one of the world's fastest operational supercomputers to meet the agency's new high performance computing needs. This large-scale system, named Columbia, consists of 20 interconnected SGI Altix 512-processor systems, for a total of 10,240 Intel Itanium-2 processors. High-fidelity CFD simulations were performed for the NASA Advanced Supercomputing (NAS) computer room at Ames Research Center. The purpose of the simulations was to assess the adequacy of the existing air handling and conditioning system and make recommendations for changes in the design of the system if needed. The simulations were performed with NASA's OVERFLOW-2 CFD code which utilizes overset structured grids. A new set of boundary conditions were developed and added to the flow solver for modeling the roomls air-conditioning and proper cooling of the equipment. Boundary condition parameters for the flow solver are based on cooler CFM (flow rate) ratings and some reasonable assumptions of flow and heat transfer data for the floor and central processing units (CPU) . The geometry modeling from blue prints and grid generation were handled by the NASA Ames software package Chimera Grid Tools (CGT). This geometric model was developed as a CGT-scripted template, which can be easily modified to accommodate any changes in shape and size of the room, locations and dimensions of the CPU racks, disk racks, coolers, power distribution units, and mass-storage system. The compute nodes are grouped in pairs of racks with an aisle in the middle. High-speed connection cables connect the racks with overhead cable trays. The cool air from the cooling units is pumped into the computer room from a sub-floor through perforated floor tiles. The CPU cooling fans draw cool air from the floor tiles, which run along the outside length of each rack, and eject warm air into the center isle between the racks. This warm air is eventually drawn into the cooling units located near the walls of the room. One major concern is that the hot air ejected to the middle isle might recirculate back into the cool rack side and cause thermal short-cycling. The simulations analyzed and addressed the following important elements of the computer room: 1) High-temperature build-up in certain regions of the room; 2) Areas of low air circulation in the room; 3) Potential short-cycling of the computer rack cooling system; 4) Effectiveness of the perforated cooling floor tiles; 5) Effect of changes in various aspects of the cooling units. Detailed flow visualization is performed to show temperature distribution, air-flow streamlines and velocities in the computer room.
      

      
      Introducing concurrency in the Gaudi data processing framework
      NASA Astrophysics Data System (ADS)
      Clemencic, Marco; Hegner, Benedikt; Mato, Pere; Piparo, Danilo
         2014-06-01
         In the past, the increasing demands for HEP processing resources could be fulfilled by the ever increasing clock-frequencies and by distributing the work to more and more physical machines. Limitations in power consumption of both CPUs and entire data centres are bringing an end to this era of easy scalability. To get the most CPU performance per watt, future hardware will be characterised by less and less memory per processor, as well as thinner, more specialized and more numerous cores per die, and rather heterogeneous resources. To fully exploit the potential of the many cores, HEP data processing frameworks need to allow for parallel execution of reconstruction or simulation algorithms on several events simultaneously. We describe our experience in introducing concurrency related capabilities into Gaudi, a generic data processing software framework, which is currently being used by several HEP experiments, including the ATLAS and LHCb experiments at the LHC. After a description of the concurrent framework and the most relevant design choices driving its development, we describe the behaviour of the framework in a more realistic environment, using a subset of the real LHCb reconstruction workflow, and present our strategy and the used tools to validate the physics outcome of the parallel framework against the results of the present, purely sequential LHCb software. We then summarize the measurement of the code performance of the multithreaded application in terms of memory and CPU usage.
      

      
      Architecting the Finite Element Method Pipeline for the GPU.
      PubMed
      Fu, Zhisong; Lewis, T James; Kirby, Robert M; Whitaker, Ross T
         2014-02-01
         The finite element method (FEM) is a widely employed numerical technique for approximating the solution of partial differential equations (PDEs) in various science and engineering applications. Many of these applications benefit from fast execution of the FEM pipeline. One way to accelerate the FEM pipeline is by exploiting advances in modern computational hardware, such as the many-core streaming processors like the graphical processing unit (GPU). In this paper, we present the algorithms and data-structures necessary to move the entire FEM pipeline to the GPU. First we propose an efficient GPU-based algorithm to generate local element information and to assemble the global linear system associated with the FEM discretization of an elliptic PDE. To solve the corresponding linear system efficiently on the GPU, we implement a conjugate gradient method preconditioned with a geometry-informed algebraic multi-grid (AMG) method preconditioner. We propose a new fine-grained parallelism strategy, a corresponding multigrid cycling stage and efficient data mapping to the many-core architecture of GPU. Comparison of our on-GPU assembly versus a traditional serial implementation on the CPU achieves up to an 87 × speedup. Focusing on the linear system solver alone, we achieve a speedup of up to 51 × versus use of a comparable state-of-the-art serial CPU linear system solver. Furthermore, the method compares favorably with other GPU-based, sparse, linear solvers.
      

      
      Heterogeneous real-time computing in radio astronomy
      NASA Astrophysics Data System (ADS)
      Ford, John M.; Demorest, Paul; Ransom, Scott
         2010-07-01
         Modern computer architectures suited for general purpose computing are often not the best choice for either I/O-bound or compute-bound problems. Sometimes the best choice is not to choose a single architecture, but to take advantage of the best characteristics of different computer architectures to solve your problems. This paper examines the tradeoffs between using computer systems based on the ubiquitous X86 Central Processing Units (CPU's), Field Programmable Gate Array (FPGA) based signal processors, and Graphical Processing Units (GPU's). We will show how a heterogeneous system can be produced that blends the best of each of these technologies into a real-time signal processing system. FPGA's tightly coupled to analog-to-digital converters connect the instrument to the telescope and supply the first level of computing to the system. These FPGA's are coupled to other FPGA's to continue to provide highly efficient processing power. Data is then packaged up and shipped over fast networks to a cluster of general purpose computers equipped with GPU's, which are used for floating-point intensive computation. Finally, the data is handled by the CPU and written to disk, or further processed. Each of the elements in the system has been chosen for its specific characteristics and the role it can play in creating a system that does the most for the least, in terms of power, space, and money.
      

      
      Implementing Molecular Dynamics for Hybrid High Performance Computers - 1. Short Range Forces
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Brown, W Michael; Wang, Peng; Plimpton, Steven J
         
         The use of accelerators such as general-purpose graphics processing units (GPGPUs) have become popular in scientific computing applications due to their low cost, impressive floating-point capabilities, high memory bandwidth, and low electrical power requirements. Hybrid high performance computers, machines with more than one type of floating-point processor, are now becoming more prevalent due to these advantages. In this work, we discuss several important issues in porting a large molecular dynamics code for use on parallel hybrid machines - 1) choosing a hybrid parallel decomposition that works on central processing units (CPUs) with distributed memory and accelerator cores with shared memory,more » 2) minimizing the amount of code that must be ported for efficient acceleration, 3) utilizing the available processing power from both many-core CPUs and accelerators, and 4) choosing a programming model for acceleration. We present our solution to each of these issues for short-range force calculation in the molecular dynamics package LAMMPS. We describe algorithms for efficient short range force calculation on hybrid high performance machines. We describe a new approach for dynamic load balancing of work between CPU and accelerator cores. We describe the Geryon library that allows a single code to compile with both CUDA and OpenCL for use on a variety of accelerators. Finally, we present results on a parallel test cluster containing 32 Fermi GPGPUs and 180 CPU cores.« less
      

      
      Design of a system based on DSP and FPGA for video recording and replaying
      NASA Astrophysics Data System (ADS)
      Kang, Yan; Wang, Heng
         2013-08-01
         This paper brings forward a video recording and replaying system with the architecture of Digital Signal Processor (DSP) and Field Programmable Gate Array (FPGA). The system achieved encoding, recording, decoding and replaying of Video Graphics Array (VGA) signals which are displayed on a monitor during airplanes and ships' navigating. In the architecture, the DSP is a main processor which is used for a large amount of complicated calculation during digital signal processing. The FPGA is a coprocessor for preprocessing video signals and implementing logic control in the system. In the hardware design of the system, Peripheral Device Transfer (PDT) function of the External Memory Interface (EMIF) is utilized to implement seamless interface among the DSP, the synchronous dynamic RAM (SDRAM) and the First-In-First-Out (FIFO) in the system. This transfer mode can avoid the bottle-neck of the data transfer and simplify the circuit between the DSP and its peripheral chips. The DSP's EMIF and two level matching chips are used to implement Advanced Technology Attachment (ATA) protocol on physical layer of the interface of an Integrated Drive Electronics (IDE) Hard Disk (HD), which has a high speed in data access and does not rely on a computer. Main functions of the logic on the FPGA are described and the screenshots of the behavioral simulation are provided in this paper. In the design of program on the DSP, Enhanced Direct Memory Access (EDMA) channels are used to transfer data between the FIFO and the SDRAM to exert the CPU's high performance on computing without intervention by the CPU and save its time spending. JPEG2000 is implemented to obtain high fidelity in video recording and replaying. Ways and means of acquiring high performance for code are briefly present. The ability of data processing of the system is desirable. And smoothness of the replayed video is acceptable. By right of its design flexibility and reliable operation, the system based on DSP and FPGA for video recording and replaying has a considerable perspective in analysis after the event, simulated exercitation and so forth.
      

      
      Research on numerical control system based on S3C2410 and MCX314AL
      NASA Astrophysics Data System (ADS)
      Ren, Qiang; Jiang, Tingbiao
         2008-10-01
         With the rapid development of micro-computer technology, embedded system, CNC technology and integrated circuits, numerical control system with powerful functions can be realized by several high-speed CPU chips and RISC (Reduced Instruction Set Computing) chips which have small size and strong stability. In addition, the real-time operating system also makes the attainment of embedded system possible. Developing the NC system based on embedded technology can overcome some shortcomings of common PC-based CNC system, such as the waste of resources, low control precision, low frequency and low integration. This paper discusses a hardware platform of ENC (Embedded Numerical Control) system based on embedded processor chip ARM (Advanced RISC Machines)-S3C2410 and DSP (Digital Signal Processor)-MCX314AL and introduces the process of developing ENC system software. Finally write the MCX314AL's driver under the embedded Linux operating system. The embedded Linux operating system can deal with multitask well moreover satisfy the real-time and reliability of movement control. NC system has the advantages of best using resources and compact system with embedded technology. It provides a wealth of functions and superior performance with a lower cost. It can be sure that ENC is the direction of the future development.
      

      
      Acceleration of the Smith-Waterman algorithm using single and multiple graphics processors
      NASA Astrophysics Data System (ADS)
      Khajeh-Saeed, Ali; Poole, Stephen; Blair Perot, J.
         2010-06-01
         Finding regions of similarity between two very long data streams is a computationally intensive problem referred to as sequence alignment. Alignment algorithms must allow for imperfect sequence matching with different starting locations and some gaps and errors between the two data sequences. Perhaps the most well known application of sequence matching is the testing of DNA or protein sequences against genome databases. The Smith-Waterman algorithm is a method for precisely characterizing how well two sequences can be aligned and for determining the optimal alignment of those two sequences. Like many applications in computational science, the Smith-Waterman algorithm is constrained by the memory access speed and can be accelerated significantly by using graphics processors (GPUs) as the compute engine. In this work we show that effective use of the GPU requires a novel reformulation of the Smith-Waterman algorithm. The performance of this new version of the algorithm is demonstrated using the SSCA#1 (Bioinformatics) benchmark running on one GPU and on up to four GPUs executing in parallel. The results indicate that for large problems a single GPU is up to 45 times faster than a CPU for this application, and the parallel implementation shows linear speed up on up to 4 GPUs.
      

      
      MoNET: media over net gateway processor for next-generation network
      NASA Astrophysics Data System (ADS)
      Elabd, Hammam; Sundar, Rangarajan; Dedes, John
         2001-12-01
         MoNETTM (Media over Net) SX000 product family is designed using a scalable voice, video and packet-processing platform to address applications with channel densities from few voice channels to four OC3 per card. This platform is developed for bridging public circuit-switched network to the next generation packet telephony and data network. The platform consists of a DSP farm, RISC processors and interface modules. DSP farm is required to execute voice compression, image compression and line echo cancellation algorithms for large number of voice, video, fax, and modem or data channels. RISC CPUs are used for performing various packetizations based on RTP, UDP/IP and ATM encapsulations. In addition, RISC CPUs also participate in the DSP farm load management and communication with the host and other MoP devices. The MoNETTM S1000 communications device is designed for voice processing and for bridging TDM to ATM and IP packet networks. The S1000 consists of the DSP farm based on Carmel DSP core and 32-bit RISC CPU, along with Ethernet, Utopia, PCI, and TDM interfaces. In this paper, we will describe the VoIP infrastructure, building blocks of the S500, S1000 and S3000 devices, algorithms executed on these device and associated channel densities, detailed DSP architecture, memory architecture, data flow and scheduling.
      

      
      Design and optimization of a portable LQCD Monte Carlo code using OpenACC
      NASA Astrophysics Data System (ADS)
      Bonati, Claudio; Coscetti, Simone; D'Elia, Massimo; Mesiti, Michele; Negro, Francesco; Calore, Enrico; Schifano, Sebastiano Fabio; Silvi, Giorgio; Tripiccione, Raffaele
         
         The present panorama of HPC architectures is extremely heterogeneous, ranging from traditional multi-core CPU processors, supporting a wide class of applications but delivering moderate computing performance, to many-core Graphics Processor Units (GPUs), exploiting aggressive data-parallelism and delivering higher performances for streaming computing applications. In this scenario, code portability (and performance portability) become necessary for easy maintainability of applications; this is very relevant in scientific computing where code changes are very frequent, making it tedious and prone to error to keep different code versions aligned. In this work, we present the design and optimization of a state-of-the-art production-level LQCD Monte Carlo application, using the directive-based OpenACC programming model. OpenACC abstracts parallel programming to a descriptive level, relieving programmers from specifying how codes should be mapped onto the target architecture. We describe the implementation of a code fully written in OpenAcc, and show that we are able to target several different architectures, including state-of-the-art traditional CPUs and GPUs, with the same code. We also measure performance, evaluating the computing efficiency of our OpenACC code on several architectures, comparing with GPU-specific implementations and showing that a good level of performance-portability can be reached.
      

      
      Data Parallel Bin-Based Indexing for Answering Queries on Multi-Core Architectures
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Gosink, Luke; Wu, Kesheng; Bethel, E. Wes
         2009-06-02
         The multi-core trend in CPUs and general purpose graphics processing units (GPUs) offers new opportunities for the database community. The increase of cores at exponential rates is likely to affect virtually every server and client in the coming decade, and presents database management systems with a huge, compelling disruption that will radically change how processing is done. This paper presents a new parallel indexing data structure for answering queries that takes full advantage of the increasing thread-level parallelism emerging in multi-core architectures. In our approach, our Data Parallel Bin-based Index Strategy (DP-BIS) first bins the base data, and then partitionsmore » and stores the values in each bin as a separate, bin-based data cluster. In answering a query, the procedures for examining the bin numbers and the bin-based data clusters offer the maximum possible level of concurrency; each record is evaluated by a single thread and all threads are processed simultaneously in parallel. We implement and demonstrate the effectiveness of DP-BIS on two multi-core architectures: a multi-core CPU and a GPU. The concurrency afforded by DP-BIS allows us to fully utilize the thread-level parallelism provided by each architecture--for example, our GPU-based DP-BIS implementation simultaneously evaluates over 12,000 records with an equivalent number of concurrently executing threads. In comparing DP-BIS's performance across these architectures, we show that the GPU-based DP-BIS implementation requires significantly less computation time to answer a query than the CPU-based implementation. We also demonstrate in our analysis that DP-BIS provides better overall performance than the commonly utilized CPU and GPU-based projection index. Finally, due to data encoding, we show that DP-BIS accesses significantly smaller amounts of data than index strategies that operate solely on a column's base data; this smaller data footprint is critical for parallel processors that possess limited memory resources (e.g., GPUs).« less
      

      
      SpaceCube 2.0: An Advanced Hybrid Onboard Data Processor
      NASA Technical Reports Server (NTRS)
      Lin, Michael; Flatley, Thomas; Godfrey, John; Geist, Alessandro; Espinosa, Daniel; Petrick, David
         2011-01-01
         The SpaceCube 2.0 is a compact, high performance, low-power onboard processing system that takes advantage of cutting-edge hybrid (CPU/FPGA/DSP) processing elements. The SpaceCube 2.0 design concept includes two commercial Virtex-5 field-programmable gate array (FPGA) parts protected by gradiation hardened by software" technology, and possesses exceptional size, weight, and power characteristics [5x5x7 in., 3.5 lb (approximately equal to 12.7 x 12.7 x 17.8 cm, 1.6 kg) 5-25 W, depending on the application fs required clock rate]. The two Virtex-5 FPGA parts are implemented in a unique back-toback configuration to maximize data transfer and computing performance. Draft computing power specifications for the SpaceCube 2.0 unit include four PowerPC 440s (1100 DMIPS each), 500+ DSP48Es (2x580 GMACS), 100+ LVDS high-speed serial I/Os (1.25 Gbps each), and 2x190 GFLOPS single-precision (65 GFLOPS double-precision) floating point performance. The SpaceCube 2.0 includes PROM memory for CPU boot, health and safety, and basic command and telemetry functionality; RAM memory for program execution; and FLASH/EEPROM memory to store algorithms and application code for the CPU, FPGA, and DSP processing elements. Program execution can be reconfigured in real time and algorithms can be updated, modified, and/or replaced at any point during the mission. Gigabit Ethernet, Spacewire, SATA and highspeed LVDS serial/parallel I/O channels are available for instrument/sensor data ingest, and mission-unique instrument interfaces can be accommodated using a compact PCI (cPCI) expansion card interface. The SpaceCube 2.0 can be utilized in NASA Earth Science, Helio/Astrophysics and Exploration missions, and Department of Defense satellites for onboard data processing. It can also be used in commercial communication and mapping satellites.
      

      
      The Photon Shell Game and the Quantum von Neumann Architecture with Superconducting Circuits
      NASA Astrophysics Data System (ADS)
      Mariantoni, Matteo
         2012-02-01
         Superconducting quantum circuits have made significant advances over the past decade, allowing more complex and integrated circuits that perform with good fidelity. We have recently implemented a machine comprising seven quantum channels, with three superconducting resonators, two phase qubits, and two zeroing registers. I will explain the design and operation of this machine, first showing how a single microwave photon | 1 > can be prepared in one resonator and coherently transferred between the three resonators. I will also show how more exotic states such as double photon states | 2 > and superposition states | 0 >+ | 1 > can be shuffled among the resonators as well [1]. I will then demonstrate how this machine can be used as the quantum-mechanical analog of the von Neumann computer architecture, which for a classical computer comprises a central processing unit and a memory holding both instructions and data. The quantum version comprises a quantum central processing unit (quCPU) that exchanges data with a quantum random-access memory (quRAM) integrated on one chip, with instructions stored on a classical computer. I will also present a proof-of-concept demonstration of a code that involves all seven quantum elements: (1), Preparing an entangled state in the quCPU, (2), writing it to the quRAM, (3), preparing a second state in the quCPU, (4), zeroing it, and, (5), reading out the first state stored in the quRAM [2]. Finally, I will demonstrate that the quantum von Neumann machine provides one unit cell of a two-dimensional qubit-resonator array that can be used for surface code quantum computing. This will allow the realization of a scalable, fault-tolerant quantum processor with the most forgiving error rates to date. [4pt] [1] M. Mariantoni et al., Nature Physics 7, 287-293 (2011.)[0pt] [2] M. Mariantoni et al., Science 334, 61-65 (2011).
      

      
      libdrdc: software standards library
      NASA Astrophysics Data System (ADS)
      Erickson, David; Peng, Tie
         2008-04-01
         This paper presents the libdrdc software standards library including internal nomenclature, definitions, units of measure, coordinate reference frames, and representations for use in autonomous systems research. This library is a configurable, portable C-function wrapped C++ / Object Oriented C library developed to be independent of software middleware, system architecture, processor, or operating system. It is designed to use the automatically-tuned linear algebra suite (ATLAS) and Basic Linear Algebra Suite (BLAS) and port to firmware and software. The library goal is to unify data collection and representation for various microcontrollers and Central Processing Unit (CPU) cores and to provide a common Application Binary Interface (ABI) for research projects at all scales. The library supports multi-platform development and currently works on Windows, Unix, GNU/Linux, and Real-Time Executive for Multiprocessor Systems (RTEMS). This library is made available under LGPL version 2.1 license.
      

      
      Efficient multitasking of Choleski matrix factorization on CRAY supercomputers
      NASA Technical Reports Server (NTRS)
      Overman, Andrea L.; Poole, Eugene L.
         1991-01-01
         A Choleski method is described and used to solve linear systems of equations that arise in large scale structural analysis. The method uses a novel variable-band storage scheme and is structured to exploit fast local memory caches while minimizing data access delays between main memory and vector registers. Several parallel implementations of this method are described for the CRAY-2 and CRAY Y-MP computers demonstrating the use of microtasking and autotasking directives. A portable parallel language, FORCE, is used for comparison with the microtasked and autotasked implementations. Results are presented comparing the matrix factorization times for three representative structural analysis problems from runs made in both dedicated and multi-user modes on both computers. CPU and wall clock timings are given for the parallel implementations and are compared to single processor timings of the same algorithm.
      

      
      Gain in computational efficiency by vectorization in the dynamic simulation of multi-body systems
      NASA Technical Reports Server (NTRS)
      Amirouche, F. M. L.; Shareef, N. H.
         1991-01-01
         An improved technique for the identification and extraction of the exact quantities associated with the degrees of freedom at the element as well as the flexible body level is presented. It is implemented in the dynamic equations of motions based on the recursive formulation of Kane et al. (1987) and presented in a matrix form, integrating the concepts of strain energy, the finite-element approach, modal analysis, and reduction of equations. This technique eliminates the CPU intensive matrix multiplication operations in the code's hot spots for the dynamic simulation of the interconnected rigid and flexible bodies. A study of a simple robot with flexible links is presented by comparing the execution times on a scalar machine and a vector-processor with and without vector options. Performance figures demonstrating the substantial gains achieved by the technique are plotted.
      

      
      Implementation of the DAST ARW II control laws using an 8086 microprocessor and an 8087 floating-point coprocessor. [drones for aeroelasticity research
      NASA Technical Reports Server (NTRS)
      Kelly, G. L.; Berthold, G.; Abbott, L.
         1982-01-01
         A 5 MHZ single-board microprocessor system which incorporates an 8086 CPU and an 8087 Numeric Data Processor is used to implement the control laws for the NASA Drones for Aerodynamic and Structural Testing, Aeroelastic Research Wing II. The control laws program was executed in 7.02 msec, with initialization consuming 2.65 msec and the control law loop 4.38 msec. The software emulator execution times for these two tasks were 36.67 and 61.18, respectively, for a total of 97.68 msec. The space, weight and cost reductions achieved in the present, aircraft control application of this combination of a 16-bit microprocessor with an 80-bit floating point coprocessor may be obtainable in other real time control applications.
      

      
      Towards more stable operation of the Tokyo Tier2 center
      NASA Astrophysics Data System (ADS)
      Nakamura, T.; Mashimo, T.; Matsui, N.; Sakamoto, H.; Ueda, I.
         2014-06-01
         The Tokyo Tier2 center, which is located at the International Center for Elementary Particle Physics (ICEPP) in the University of Tokyo, was established as a regional analysis center in Japan for the ATLAS experiment. The official operation with WLCG was started in 2007 after the several years development since 2002. In December 2012, we have replaced almost all hardware as the third system upgrade to deal with analysis for further growing data of the ATLAS experiment. The number of CPU cores are increased by factor of two (9984 cores in total), and the performance of individual CPU core is improved by 20% according to the HEPSPEC06 benchmark test at 32bit compile mode. The score is estimated as 18.03 (SL6) per core by using Intel Xeon E5-2680 2.70 GHz. Since all worker nodes are made by 16 CPU cores configuration, we deployed 624 blade servers in total. They are connected to 6.7 PB of disk storage system with non-blocking 10 Gbps internal network backbone by using two center network switches (NetIron MLXe-32). The disk storage is made by 102 of RAID6 disk arrays (Infortrend DS S24F-G2840-4C16DO0) and served by equivalent number of 1U file servers with 8G-FC connection to maximize the file transfer throughput per storage capacity. As of February 2013, 2560 CPU cores and 2.00 PB of disk storage have already been deployed for WLCG. Currently, the remaining non-grid resources for both CPUs and disk storage are used as dedicated resources for the data analysis by the ATLAS Japan collaborators. Since all hardware in the non-grid resources are made by same architecture with Tier2 resource, they will be able to be migrated as the Tier2 extra resource on demand of the ATLAS experiment in the future. In addition to the upgrade of computing resources, we expect the improvement of connectivity on the wide area network. Thanks to the Japanese NREN (NII), another 10 Gbps trans-Pacific line from Japan to Washington will be available additionally with existing two 10 Gbps lines (Tokyo to New York and Tokyo to Los Angeles). The new line will be connected to LHCONE for the more improvement of the connectivity. In this circumstance, we are working for the further stable operation. For instance, we have newly introduced GPFS (IBM) for the non-grid disk storage, while Disk Pool Manager (DPM) are continued to be used as Tier2 disk storage from the previous system. Since the number of files stored in a DPM pool will be increased with increasing the total amount of data, the development of stable database configuration is one of the crucial issues as well as scalability. We have started some studies on the performance of asynchronous database replication so that we can take daily full backup. In this report, we would like to introduce several improvements in terms of the performances and stability of our new system and possibility of the further improvement of local I/O performance in the multi-core worker node. We also present the status of the wide area network connectivity from Japan to US and/or EU with LHCONE.
      

      
      Bin-Hash Indexing: A Parallel Method for Fast Query Processing
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Bethel, Edward W; Gosink, Luke J.; Wu, Kesheng
         2008-06-27
         This paper presents a new parallel indexing data structure for answering queries. The index, called Bin-Hash, offers extremely high levels of concurrency, and is therefore well-suited for the emerging commodity of parallel processors, such as multi-cores, cell processors, and general purpose graphics processing units (GPU). The Bin-Hash approach first bins the base data, and then partitions and separately stores the values in each bin as a perfect spatial hash table. To answer a query, we first determine whether or not a record satisfies the query conditions based on the bin boundaries. For the bins with records that can not bemore » resolved, we examine the spatial hash tables. The procedures for examining the bin numbers and the spatial hash tables offer the maximum possible level of concurrency; all records are able to be evaluated by our procedure independently in parallel. Additionally, our Bin-Hash procedures access much smaller amounts of data than similar parallel methods, such as the projection index. This smaller data footprint is critical for certain parallel processors, like GPUs, where memory resources are limited. To demonstrate the effectiveness of Bin-Hash, we implement it on a GPU using the data-parallel programming language CUDA. The concurrency offered by the Bin-Hash index allows us to fully utilize the GPU's massive parallelism in our work; over 12,000 records can be simultaneously evaluated at any one time. We show that our new query processing method is an order of magnitude faster than current state-of-the-art CPU-based indexing technologies. Additionally, we compare our performance to existing GPU-based projection index strategies.« less
      

        
       
          

«

18
      19
      20
   21
      22
      »

          
        

     

   

   
       
            
              
          

«

19
      20
      21
   22
      23
      »

          
        

           
           
             
               
      
      Thermo-mechanical properties of carbon nanotubes and applications in thermal management
      NASA Astrophysics Data System (ADS)
      Nguyen, Manh Hong; Thang Bui, Hung; Trinh Pham, Van; Phan, Ngoc Hong; Nguyen, Tuan Hong; Chuc Nguyen, Van; Quang Le, Dinh; Khoi Phan, Hong; Phan, Ngoc Minh
         2016-06-01
         Thanks to their very high thermal conductivity, high Young’s modulus and unique tensile strength, carbon nanotubes (CNTs) have become one of the most suitable nano additives for heat conductive materials. In this work, we present results obtained for the synthesis of heat conductive materials containing CNT based thermal greases, nanoliquids and lubricating oils. These synthesized heat conductive materials were applied to thermal management for high power electronic devices (CPUs, LEDs) and internal combustion engines. The simulation and experimental results on thermal greases for an Intel Pentium IV processor showed that the thermal conductivity of greases increases 1.4 times and the saturation temperature of the CPU decreased by 5 °C by using thermal grease containing 2 wt% CNTs. Nanoliquids containing CNT based distilled water/ethylene glycol were successfully applied in heat dissipation for an Intel Core i5 processor and a 450 W floodlight LED. The experimental results showed that the saturation temperature of the Intel Core i5 processor and the 450 W floodlight LED decreased by about 6 °C and 3.5 °C, respectively, when using nanoliquids containing 1 g l-1 of CNTs. The CNTs were also effectively utilized additive materials for the synthesis of lubricating oils to improve the thermal conductivity, heat dissipation efficiency and performance efficiency of engines. The experimental results show that the thermal conductivity of lubricating oils increased by 12.5%, the engine saved 15% fuel consumption, and the longevity of the lubricating oil increased up to 20 000 km by using 0.1% vol. CNTs in the lubricating oils. All above results have confirmed the tremendous application potential of heat conductive materials containing CNTs in thermal management for high power electronic devices, internal combustion engines and other high power apparatus.
      

      
      Enhanced tactical radar correlator (ETRAC): true interoperability of the 1990s
      NASA Astrophysics Data System (ADS)
      Guillen, Frank J.
         1994-10-01
         The enhanced tactical radar correlator (ETRAC) system is under development at Westinghouse Electric Corporation for the Army Space Program Office (ASPO). ETRAC is a real-time synthetic aperture radar (SAR) processing system that provides tactical IMINT to the corps commander. It features an open architecture comprised of ruggedized commercial-off-the-shelf (COTS), UNIX based workstations and processors. The architecture features the DoD common SAR processor (CSP), a multisensor computing platform to accommodate a variety of current and future imaging needs. ETRAC's principal functions include: (1) Mission planning and control -- ETRAC provides mission planning and control for the U-2R and ASARS-2 sensor, including capability for auto replanning, retasking, and immediate spot. (2) Image formation -- the image formation processor (IFP) provides the CPU intensive processing capability to produce real-time imagery for all ASARS imaging modes of operation. (3) Image exploitation -- two exploitation workstations are provided for first-phase image exploitation, manipulation, and annotation. Products include INTEL reports, annotated NITF SID imagery, high resolution hard copy prints and targeting data. ETRAC is transportable via two C-130 aircraft, with autonomous drive on/off capability for high mobility. Other autonomous capabilities include rapid setup/tear down, extended stand-alone support, internal environmental control units (ECUs) and power generation. ETRAC's mission is to provide the Army field commander with accurate, reliable, and timely imagery intelligence derived from collections made by the ASARS-2 sensor, located on-board the U-2R aircraft. To accomplish this mission, ETRAC receives video phase history (VPH) directly from the U-2R aircraft and converts it in real time into soft copy imagery for immediate exploitation and dissemination to the tactical users.
      

      
      High Performance Computing and Visualization Infrastructure for Simultaneous Parallel Computing and Parallel Visualization Research
      DTIC Science & Technology
      
         2016-11-09
         Total Number: Sub Contractors (DD882) Names of Personnel receiving masters degrees Names of personnel receiving PHDs Names of other research staff...Broadcom 5720 QP 1Gb Network Daughter Card (2) Intel Xeon E5-2680 v3 2.5GHz, 30M Cache, 9.60GT/s QPI, Turbo, HT , 12C/24T (120W...Broadcom 5720 QP 1Gb Network Daughter Card (2) Intel Xeon E5-2680 v3 2.5GHz, 30M Cache, 9.60GT/s QPI, Turbo, HT , 12C/24T (120W
      

      
      Symptoms of problematic cellular phone use, functional impairment and its association with depression among adolescents in Southern Taiwan.
      PubMed
      Yen, Cheng-Fang; Tang, Tze-Chun; Yen, Ju-Yu; Lin, Huang-Chi; Huang, Chi-Fen; Liu, Shu-Chun; Ko, Chih-Hung
         2009-08-01
         The aims of this study were: (1) to examine the prevalence of symptoms of problematic cellular phone use (CPU); (2) to examine the associations between the symptoms of problematic CPU, functional impairment caused by CPU and the characteristics of CPU; (3) to establish the optimal cut-off point of the number of symptoms for functional impairment caused by CPU; and (4) to examine the association between problematic CPU and depression in adolescents. A total of 10,191 adolescent students in Southern Taiwan were recruited into this study. Participants' self-reported symptoms of problematic CPU and functional impairments caused by CPU were collected. The associations of symptoms of problematic CPU with functional impairments and with the characteristics of CPU were examined. The cut-off point of the number of symptoms for functional impairment was also determined. The association between problematic CPU and depression was examined by logistic regression analysis. The results indicated that the symptoms of problematic CPU were prevalent in adolescents. The adolescents who had any one of the symptoms of problematic CPU were more likely to report at least one dimension of functional impairment caused by CPU, called more on cellular phones, sent more text messages, or spent more time and higher fees on CPU. Having four or more symptoms of problematic CPU had the highest potential to differentiate between the adolescents with and without functional impairment caused by CPU. Adolescents who had significant depression were more likely to have four or more symptoms of problematic CPU. The results of this study may provide a basis for detecting symptoms of problematic CPU in adolescents.
      

      
      Fast H-DROP: A thirty times accelerated version of H-DROP for interactive SVM-based prediction of helical domain linkers
      NASA Astrophysics Data System (ADS)
      Richa, Tambi; Ide, Soichiro; Suzuki, Ryosuke; Ebina, Teppei; Kuroda, Yutaka
         2017-02-01
         Efficient and rapid prediction of domain regions from amino acid sequence information alone is often required for swift structural and functional characterization of large multi-domain proteins. Here we introduce Fast H-DROP, a thirty times accelerated version of our previously reported H-DROP (Helical Domain linker pRediction using OPtimal features), which is unique in specifically predicting helical domain linkers (boundaries). Fast H-DROP, analogously to H-DROP, uses optimum features selected from a set of 3000 ones by combining a random forest and a stepwise feature selection protocol. We reduced the computational time from 8.5 min per sequence in H-DROP to 14 s per sequence in Fast H-DROP on an 8 Xeon processor Linux server by using SWISS-PROT instead of Genbank non-redundant (nr) database for generating the PSSMs. The sensitivity and precision of Fast H-DROP assessed by cross-validation were 33.7 and 36.2%, which were merely 2% lower than that of H-DROP. The reduced computational time of Fast H-DROP, without affecting prediction performances, makes it more interactive and user-friendly. Fast H-DROP and H-DROP are freely available from http://domserv.lab.tuat.ac.jp/.
      

      
      A novel image toggle tool for comparison of serial mammograms: automatic density normalization and alignment-development of the tool and initial experience.
      PubMed
      Honda, Satoshi; Tsunoda, Hiroko; Fukuda, Wataru; Saida, Yukihisa
         2014-12-01
         The purpose is to develop a new image toggle tool with automatic density normalization (ADN) and automatic alignment (AA) for comparing serial digital mammograms (DMGs). We developed an ADN and AA process to compare the images of serial DMGs. In image density normalization, a linear interpolation was applied by taking two points of high- and low-brightness areas. The alignment was calculated by determining the point of the greatest correlation while shifting the alignment between the current and prior images. These processes were performed on a PC with a 3.20-GHz Xeon processor and 8 GB of main memory. We selected 12 suspected breast cancer patients who had undergone screening DMGs in the past. Automatic processing was retrospectively performed on these images. Two radiologists subjectively evaluated them. The process of the developed algorithm took approximately 1 s per image. In our preliminary experience, two images could not be aligned approximately. When they were aligned, image toggling allowed detection of differences between examinations easily. We developed a new tool to facilitate comparative reading of DMGs on a mammography viewing system. Using this tool for toggling comparisons might improve the interpretation efficiency of serial DMGs.
      

      
      Porting plasma physics simulation codes to modern computing architectures using the libmrc framework
      NASA Astrophysics Data System (ADS)
      Germaschewski, Kai; Abbott, Stephen
         2015-11-01
         Available computing power has continued to grow exponentially even after single-core performance satured in the last decade. The increase has since been driven by more parallelism, both using more cores and having more parallelism in each core, e.g. in GPUs and Intel Xeon Phi. Adapting existing plasma physics codes is challenging, in particular as there is no single programming model that covers current and future architectures. We will introduce the open-source libmrc framework that has been used to modularize and port three plasma physics codes: The extended MHD code MRCv3 with implicit time integration and curvilinear grids; the OpenGGCM global magnetosphere model; and the particle-in-cell code PSC. libmrc consolidates basic functionality needed for simulations based on structured grids (I/O, load balancing, time integrators), and also introduces a parallel object model that makes it possible to maintain multiple implementations of computational kernels, on e.g. conventional processors and GPUs. It handles data layout conversions and enables us to port performance-critical parts of a code to a new architecture step-by-step, while the rest of the code can remain unchanged. We will show examples of the performance gains and some physics applications.
      

      
      Implementation of molecular dynamics and its extensions with the coarse-grained UNRES force field on massively parallel systems; towards millisecond-scale simulations of protein structure, dynamics, and thermodynamics
      PubMed Central
      Liwo, Adam; Ołdziej, Stanisław; Czaplewski, Cezary; Kleinerman, Dana S.; Blood, Philip; Scheraga, Harold A.
         2010-01-01
         We report the implementation of our united-residue UNRES force field for simulations of protein structure and dynamics with massively parallel architectures. In addition to coarse-grained parallelism already implemented in our previous work, in which each conformation was treated by a different task, we introduce a fine-grained level in which energy and gradient evaluation are split between several tasks. The Message Passing Interface (MPI) libraries have been utilized to construct the parallel code. The parallel performance of the code has been tested on a professional Beowulf cluster (Xeon Quad Core), a Cray XT3 supercomputer, and two IBM BlueGene/P supercomputers with canonical and replica-exchange molecular dynamics. With IBM BlueGene/P, about 50 % efficiency and 120-fold speed-up of the fine-grained part was achieved for a single trajectory of a 767-residue protein with use of 256 processors/trajectory. Because of averaging over the fast degrees of freedom, UNRES provides an effective 1000-fold speed-up compared to the experimental time scale and, therefore, enables us to effectively carry out millisecond-scale simulations of proteins with 500 and more amino-acid residues in days of wall-clock time. PMID:20305729
      

      
      GPU Lossless Hyperspectral Data Compression System for Space Applications
      NASA Technical Reports Server (NTRS)
      Keymeulen, Didier; Aranki, Nazeeh; Hopson, Ben; Kiely, Aaron; Klimesh, Matthew; Benkrid, Khaled
         2012-01-01
         On-board lossless hyperspectral data compression reduces data volume in order to meet NASA and DoD limited downlink capabilities. At JPL, a novel, adaptive and predictive technique for lossless compression of hyperspectral data, named the Fast Lossless (FL) algorithm, was recently developed. This technique uses an adaptive filtering method and achieves state-of-the-art performance in both compression effectiveness and low complexity. Because of its outstanding performance and suitability for real-time onboard hardware implementation, the FL compressor is being formalized as the emerging CCSDS Standard for Lossless Multispectral & Hyperspectral image compression. The FL compressor is well-suited for parallel hardware implementation. A GPU hardware implementation was developed for FL targeting the current state-of-the-art GPUs from NVIDIA(Trademark). The GPU implementation on a NVIDIA(Trademark) GeForce(Trademark) GTX 580 achieves a throughput performance of 583.08 Mbits/sec (44.85 MSamples/sec) and an acceleration of at least 6 times a software implementation running on a 3.47 GHz single core Intel(Trademark) Xeon(Trademark) processor. This paper describes the design and implementation of the FL algorithm on the GPU. The massively parallel implementation will provide in the future a fast and practical real-time solution for airborne and space applications.
      

      
      Leveraging FPGAs for Accelerating Short Read Alignment.
      PubMed
      Arram, James; Kaplan, Thomas; Luk, Wayne; Jiang, Peiyong
         2017-01-01
         One of the key challenges facing genomics today is how to efficiently analyze the massive amounts of data produced by next-generation sequencing platforms. With general-purpose computing systems struggling to address this challenge, specialized processors such as the Field-Programmable Gate Array (FPGA) are receiving growing interest. The means by which to leverage this technology for accelerating genomic data analysis is however largely unexplored. In this paper, we present a runtime reconfigurable architecture for accelerating short read alignment using FPGAs. This architecture exploits the reconfigurability of FPGAs to allow the development of fast yet flexible alignment designs. We apply this architecture to develop an alignment design which supports exact and approximate alignment with up to two mismatches. Our design is based on the FM-index, with optimizations to improve the alignment performance. In particular, the n-step FM-index, index oversampling, a seed-and-compare stage, and bi-directional backtracking are included. Our design is implemented and evaluated on a 1U Maxeler MPC-X2000 dataflow node with eight Altera Stratix-V FPGAs. Measurements show that our design is 28 times faster than Bowtie2 running with 16 threads on dual Intel Xeon E5-2640 CPUs, and nine times faster than Soap3-dp running on an NVIDIA Tesla C2070 GPU.
      

      
      Efficient parallel simulation of CO2 geologic sequestration insaline aquifers
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Zhang, Keni; Doughty, Christine; Wu, Yu-Shu
         2007-01-01
         An efficient parallel simulator for large-scale, long-termCO2 geologic sequestration in saline aquifers has been developed. Theparallel simulator is a three-dimensional, fully implicit model thatsolves large, sparse linear systems arising from discretization of thepartial differential equations for mass and energy balance in porous andfractured media. The simulator is based on the ECO2N module of the TOUGH2code and inherits all the process capabilities of the single-CPU TOUGH2code, including a comprehensive description of the thermodynamics andthermophysical properties of H2O-NaCl- CO2 mixtures, modeling singleand/or two-phase isothermal or non-isothermal flow processes, two-phasemixtures, fluid phases appearing or disappearing, as well as saltprecipitation or dissolution. The newmore » parallel simulator uses MPI forparallel implementation, the METIS software package for simulation domainpartitioning, and the iterative parallel linear solver package Aztec forsolving linear equations by multiple processors. In addition, theparallel simulator has been implemented with an efficient communicationscheme. Test examples show that a linear or super-linear speedup can beobtained on Linux clusters as well as on supercomputers. Because of thesignificant improvement in both simulation time and memory requirement,the new simulator provides a powerful tool for tackling larger scale andmore complex problems than can be solved by single-CPU codes. Ahigh-resolution simulation example is presented that models buoyantconvection, induced by a small increase in brine density caused bydissolution of CO2.« less
      

      
      Evaluation of the Monotonic Lagrangian Grid and Lat-Long Grid for Air Traffic Management
      NASA Technical Reports Server (NTRS)
      Kaplan, Carolyn; Dahm, Johann; Oran, Elaine; Alexandrov, Natalia; Boris, Jay
         2011-01-01
         The Air Traffic Monotonic Lagrangian Grid (ATMLG) is used to simulate a 24 hour period of air traffic flow in the National Airspace System (NAS). During this time period, there are 41,594 flights over the United States, and the flight plan information (departure and arrival airports and times, and waypoints along the way) are obtained from an Federal Aviation Administration (FAA) Enhanced Traffic Management System (ETMS) dataset. Two simulation procedures are tested and compared: one based on the Monotonic Lagrangian Grid (MLG), and the other based on the stationary Latitude-Longitude (Lat- Long) grid. Simulating one full day of air traffic over the United States required the following amounts of CPU time on a single processor of an SGI Altix: 88 s for the MLG method, and 163 s for the Lat-Long grid method. We present a discussion of the amount of CPU time required for each of the simulation processes (updating aircraft trajectories, sorting, conflict detection and resolution, etc.), and show that the main advantage of the MLG method is that it is a general sorting algorithm that can sort on multiple properties. We discuss how many MLG neighbors must be considered in the separation assurance procedure in order to ensure a five-mile separation buffer between aircraft, and we investigate the effect of removing waypoints from aircraft trajectories. When aircraft choose their own trajectory, there are more flights with shorter duration times and fewer CD&R maneuvers, resulting in significant fuel savings.
      

      
      Numericware i: Identical by State Matrix Calculator
      PubMed Central
      Kim, Bongsong; Beavis, William D
         2017-01-01
         We introduce software, Numericware i, to compute identical by state (IBS) matrix based on genotypic data. Calculating an IBS matrix with a large dataset requires large computer memory and takes lengthy processing time. Numericware i addresses these challenges with 2 algorithmic methods: multithreading and forward chopping. The multithreading allows computational routines to concurrently run on multiple central processing unit (CPU) processors. The forward chopping addresses memory limitation by dividing a dataset into appropriately sized subsets. Numericware i allows calculation of the IBS matrix for a large genotypic dataset using a laptop or a desktop computer. For comparison with different software, we calculated genetic relationship matrices using Numericware i, SPAGeDi, and TASSEL with the same genotypic dataset. Numericware i calculates IBS coefficients between 0 and 2, whereas SPAGeDi and TASSEL produce different ranges of values including negative values. The Pearson correlation coefficient between the matrices from Numericware i and TASSEL was high at .9972, whereas SPAGeDi showed low correlation with Numericware i (.0505) and TASSEL (.0587). With a high-dimensional dataset of 500 entities by 10 000 000 SNPs, Numericware i spent 382 minutes using 19 CPU threads and 64 GB memory by dividing the dataset into 3 pieces, whereas SPAGeDi and TASSEL failed with the same dataset. Numericware i is freely available for Windows and Linux under CC-BY 4.0 license at https://figshare.com/s/f100f33a8857131eb2db. PMID:28469375
      

      
      DOE SBIR Phase-1 Report on Hybrid CPU-GPU Parallel Development of the Eulerian-Lagrangian Barracuda Multiphase Program
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Dr. Dale M. Snider
         2011-02-28
         This report gives the result from the Phase-1 work on demonstrating greater than 10x speedup of the Barracuda computer program using parallel methods and GPU processors (General-Purpose Graphics Processing Unit or Graphics Processing Unit). Phase-1 demonstrated a 12x speedup on a typical Barracuda function using the GPU processor. The problem test case used about 5 million particles and 250,000 Eulerian grid cells. The relative speedup, compared to a single CPU, increases with increased number of particles giving greater than 12x speedup. Phase-1 work provided a path for reformatting data structure modifications to give good parallel performance while keeping a friendlymore » environment for new physics development and code maintenance. The implementation of data structure changes will be in Phase-2. Phase-1 laid the ground work for the complete parallelization of Barracuda in Phase-2, with the caveat that implemented computer practices for parallel programming done in Phase-1 gives immediate speedup in the current Barracuda serial running code. The Phase-1 tasks were completed successfully laying the frame work for Phase-2. The detailed results of Phase-1 are within this document. In general, the speedup of one function would be expected to be higher than the speedup of the entire code because of I/O functions and communication between the algorithms. However, because one of the most difficult Barracuda algorithms was parallelized in Phase-1 and because advanced parallelization methods and proposed parallelization optimization techniques identified in Phase-1 will be used in Phase-2, an overall Barracuda code speedup (relative to a single CPU) is expected to be greater than 10x. This means that a job which takes 30 days to complete will be done in 3 days. Tasks completed in Phase-1 are: Task 1: Profile the entire Barracuda code and select which subroutines are to be parallelized (See Section Choosing a Function to Accelerate) Task 2: Select a GPU consultant company and jointly parallelize subroutines (CPFD chose the small business EMPhotonics for the Phase-1 the technical partner. See Section Technical Objective and Approach) Task 3: Integrate parallel subroutines into Barracuda (See Section Results from Phase-1 and its subsections) Task 4: Testing, refinement, and optimization of parallel methodology (See Section Results from Phase-1 and Section Result Comparison Program) Task 5: Integrate Phase-1 parallel subroutines into Barracuda and release (See Section Results from Phase-1 and its subsections) Task 6: Roadmap of Phase-2 (See Section Plan for Phase-2) With the completion of Phase 1 we have the base understanding to completely parallelize Barracuda. An overview of the work to move Barracuda to a parallelized code is given in Plan for Phase-2.« less
      

      
      Initial development of goCMC: a GPU-oriented fast cross-platform Monte Carlo engine for carbon ion therapy
      PubMed Central
      Qin, Nan; Pinto, Marco; Tian, Zhen; Dedes, Georgios; Pompos, Arnold; Jiang, Steve B.; Parodi, Katia; Jia, Xun
         2017-01-01
         Monte Carlo (MC) simulation is considered as the most accurate method for calculation of absorbed dose and fundamental physics quantities related to biological effects in carbon ion therapy. To improve its computational efficiency, we have developed a GPU-oriented fast MC package named goCMC, for carbon therapy. goCMC simulates particle transport in voxelized geometry with kinetic energy up to 450 MeV/u. Class II condensed history simulation scheme with a continuous slowing down approximation was employed. Energy straggling and multiple scattering were modeled. δ-electrons were terminated with their energy locally deposited. Four types of nuclear interactions were implemented in goCMC, i.e., carbon-hydrogen, carbon-carbon, carbon-oxygen and carbon-calcium inelastic collisions. Total cross section data from Geant4 were used. Secondary particles produced in these interactions were sampled according to particle yield with energy and directional distribution data derived from Geant4 simulation results. Secondary charged particles were transported following the condensed history scheme, whereas secondary neutral particles were ignored. goCMC was developed under OpenCL framework and is executable on different platforms, e.g. GPU and multi-core CPU. We have validated goCMC with Geant4 in cases with different beam energy and phantoms including four homogeneous phantoms, one heterogeneous half-slab phantom, and one patient case. For each case 3 × 107 carbon ions were simulated, such that in the region with dose greater than 10% of maximum dose, the mean relative statistical uncertainty was less than 1%. Good agreements for dose distributions and range estimations between goCMC and Geant4 were observed. 3D gamma passing rates with 1%/1 mm criterion were over 90% within 10%) isodose line except in two extreme cases, and those with 2%/1 mm criterion were all over 96%. Efficiency and code portability were tested with different GPUs and CPUs. Depending on the beam energy and voxel size, the computation time to simulate 107 carbons was 9.9–125 sec, 2.5–50 sec and 60–612 sec on an AMD Radeon GPU card, an NVidia GeForce GTX 1080 GPU card and an Intel Xeon E5-2640 CPU, respectively. The combined accuracy, efficiency and portability make goCMC attractive for research and clinical applications in carbon ion therapy. PMID:28140352
      

      
      SU-E-T-500: Initial Implementation of GPU-Based Particle Swarm Optimization for 4D IMRT Planning in Lung SBRT
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Modiri, A; Hagan, A; Gu, X
         
         Purpose 4D-IMRT planning, combined with dynamic MLC tracking delivery, utilizes the temporal dimension as an additional degree of freedom to achieve improved OAR-sparing. The computational complexity for such optimization increases exponentially with increase in dimensionality. In order to accomplish this task in a clinically-feasible time frame, we present an initial implementation of GPU-based 4D-IMRT planning based on particle swarm optimization (PSO). Methods The target and normal structures were manually contoured on ten phases of a 4DCT scan of a NSCLC patient with a 54cm3 right-lower-lobe tumor (1.5cm motion). Corresponding ten 3D-IMRT plans were created in the Eclipse treatment planning systemmore » (Ver-13.6). A vendor-provided scripting interface was used to export 3D-dose matrices corresponding to each control point (10 phases × 9 beams × 166 control points = 14,940), which served as input to PSO. The optimization task was to iteratively adjust the weights of each control point and scale the corresponding dose matrices. In order to handle the large amount of data in GPU memory, dose matrices were sparsified and placed in contiguous memory blocks with the 14,940 weight-variables. PSO was implemented on CPU (dual-Xeon, 3.1GHz) and GPU (dual-K20 Tesla, 2496 cores, 3.52Tflops, each) platforms. NiftyReg, an open-source deformable image registration package, was used to calculate the summed dose. Results The 4D-PSO plan yielded PTV coverage comparable to the clinical ITV-based plan and significantly higher OAR-sparing, as follows: lung Dmean=33%; lung V20=27%; spinal cord Dmax=26%; esophagus Dmax=42%; heart Dmax=0%; heart Dmean=47%. The GPU-PSO processing time for 14940 variables and 7 PSO-particles was 41% that of CPU-PSO (199 vs. 488 minutes). Conclusion Truly 4D-IMRT planning can yield significant OAR dose-sparing while preserving PTV coverage. The corresponding optimization problem is large-scale, non-convex and computationally rigorous. Our initial results indicate that GPU-based PSO with further software optimization can make such planning clinically feasible. This work was supported through funding from the National Institutes of Health and Varian Medical Systems.« less
      

      
      Initial development of goCMC: a GPU-oriented fast cross-platform Monte Carlo engine for carbon ion therapy
      NASA Astrophysics Data System (ADS)
      Qin, Nan; Pinto, Marco; Tian, Zhen; Dedes, Georgios; Pompos, Arnold; Jiang, Steve B.; Parodi, Katia; Jia, Xun
         2017-05-01
         Monte Carlo (MC) simulation is considered as the most accurate method for calculation of absorbed dose and fundamental physics quantities related to biological effects in carbon ion therapy. To improve its computational efficiency, we have developed a GPU-oriented fast MC package named goCMC, for carbon therapy. goCMC simulates particle transport in voxelized geometry with kinetic energy up to 450 MeV u-1. Class II condensed history simulation scheme with a continuous slowing down approximation was employed. Energy straggling and multiple scattering were modeled. δ-electrons were terminated with their energy locally deposited. Four types of nuclear interactions were implemented in goCMC, i.e. carbon-hydrogen, carbon-carbon, carbon-oxygen and carbon-calcium inelastic collisions. Total cross section data from Geant4 were used. Secondary particles produced in these interactions were sampled according to particle yield with energy and directional distribution data derived from Geant4 simulation results. Secondary charged particles were transported following the condensed history scheme, whereas secondary neutral particles were ignored. goCMC was developed under OpenCL framework and is executable on different platforms, e.g. GPU and multi-core CPU. We have validated goCMC with Geant4 in cases with different beam energy and phantoms including four homogeneous phantoms, one heterogeneous half-slab phantom, and one patient case. For each case 3× {{10}7} carbon ions were simulated, such that in the region with dose greater than 10% of maximum dose, the mean relative statistical uncertainty was less than 1%. Good agreements for dose distributions and range estimations between goCMC and Geant4 were observed. 3D gamma passing rates with 1%/1 mm criterion were over 90% within 10% isodose line except in two extreme cases, and those with 2%/1 mm criterion were all over 96%. Efficiency and code portability were tested with different GPUs and CPUs. Depending on the beam energy and voxel size, the computation time to simulate {{10}7} carbons was 9.9-125 s, 2.5-50 s and 60-612 s on an AMD Radeon GPU card, an NVidia GeForce GTX 1080 GPU card and an Intel Xeon E5-2640 CPU, respectively. The combined accuracy, efficiency and portability make goCMC attractive for research and clinical applications in carbon ion therapy.
      

      
      Initial development of goCMC: a GPU-oriented fast cross-platform Monte Carlo engine for carbon ion therapy.
      PubMed
      Qin, Nan; Pinto, Marco; Tian, Zhen; Dedes, Georgios; Pompos, Arnold; Jiang, Steve B; Parodi, Katia; Jia, Xun
         2017-05-07
         Monte Carlo (MC) simulation is considered as the most accurate method for calculation of absorbed dose and fundamental physics quantities related to biological effects in carbon ion therapy. To improve its computational efficiency, we have developed a GPU-oriented fast MC package named goCMC, for carbon therapy. goCMC simulates particle transport in voxelized geometry with kinetic energy up to 450 MeV u -1 . Class II condensed history simulation scheme with a continuous slowing down approximation was employed. Energy straggling and multiple scattering were modeled. δ-electrons were terminated with their energy locally deposited. Four types of nuclear interactions were implemented in goCMC, i.e. carbon-hydrogen, carbon-carbon, carbon-oxygen and carbon-calcium inelastic collisions. Total cross section data from Geant4 were used. Secondary particles produced in these interactions were sampled according to particle yield with energy and directional distribution data derived from Geant4 simulation results. Secondary charged particles were transported following the condensed history scheme, whereas secondary neutral particles were ignored. goCMC was developed under OpenCL framework and is executable on different platforms, e.g. GPU and multi-core CPU. We have validated goCMC with Geant4 in cases with different beam energy and phantoms including four homogeneous phantoms, one heterogeneous half-slab phantom, and one patient case. For each case [Formula: see text] carbon ions were simulated, such that in the region with dose greater than 10% of maximum dose, the mean relative statistical uncertainty was less than 1%. Good agreements for dose distributions and range estimations between goCMC and Geant4 were observed. 3D gamma passing rates with 1%/1 mm criterion were over 90% within 10% isodose line except in two extreme cases, and those with 2%/1 mm criterion were all over 96%. Efficiency and code portability were tested with different GPUs and CPUs. Depending on the beam energy and voxel size, the computation time to simulate [Formula: see text] carbons was 9.9-125 s, 2.5-50 s and 60-612 s on an AMD Radeon GPU card, an NVidia GeForce GTX 1080 GPU card and an Intel Xeon E5-2640 CPU, respectively. The combined accuracy, efficiency and portability make goCMC attractive for research and clinical applications in carbon ion therapy.
      

      
      Optimizing meridional advection of the Advanced Research WRF (ARW) dynamics for Intel Xeon Phi coprocessor
      NASA Astrophysics Data System (ADS)
      Mielikainen, Jarno; Huang, Bormin; Huang, Allen H.-L.
         2015-05-01
         The most widely used community weather forecast and research model in the world is the Weather Research and Forecast (WRF) model. Two distinct varieties of WRF exist. The one we are interested is the Advanced Research WRF (ARW) is an experimental, advanced research version featuring very high resolution. The WRF Nonhydrostatic Mesoscale Model (WRF-NMM) has been designed for forecasting operations. WRF consists of dynamics code and several physics modules. The WRF-ARW core is based on an Eulerian solver for the fully compressible nonhydrostatic equations. In the paper, we optimize a meridional (north-south direction) advection subroutine for Intel Xeon Phi coprocessor. Advection is of the most time consuming routines in the ARW dynamics core. It advances the explicit perturbation horizontal momentum equations by adding in the large-timestep tendency along with the small timestep pressure gradient tendency. We will describe the challenges we met during the development of a high-speed dynamics code subroutine for MIC architecture. Furthermore, lessons learned from the code optimization process will be discussed. The results show that the optimizations improved performance of the original code on Xeon Phi 7120P by a factor of 1.2x.
      

      
      GPU-Accelerated Molecular Modeling Coming Of Age
      PubMed Central
      Stone, John E.; Hardy, David J.; Ufimtsev, Ivan S.
         2010-01-01
         Graphics processing units (GPUs) have traditionally been used in molecular modeling solely for visualization of molecular structures and animation of trajectories resulting from molecular dynamics simulations. Modern GPUs have evolved into fully programmable, massively parallel co-processors that can now be exploited to accelerate many scientific computations, typically providing about one order of magnitude speedup over CPU code and in special cases providing speedups of two orders of magnitude. This paper surveys the development of molecular modeling algorithms that leverage GPU computing, the advances already made and remaining issues to be resolved, and the continuing evolution of GPU technology that promises to become even more useful to molecular modeling. Hardware acceleration with commodity GPUs is expected to benefit the overall computational biology community by bringing teraflops performance to desktop workstations and in some cases potentially changing what were formerly batch-mode computational jobs into interactive tasks. PMID:20675161
      

        
       
          

«

19
      20
      21
   22
      23
      »

          
        

     

   

   
       
            
              
          

«

20
      21
      22
   23
      24
      »

          
        

           
           
             
               
      
      Novel Hybrid Scheduling Technique for Sensor Nodes with Mixed Criticality Tasks.
      PubMed
      Micea, Mihai-Victor; Stangaciu, Cristina-Sorina; Stangaciu, Valentin; Curiac, Daniel-Ioan
         2017-06-26
         Sensor networks become increasingly a key technology for complex control applications. Their potential use in safety- and time-critical domains has raised the need for task scheduling mechanisms specially adapted to sensor node specific requirements, often materialized in predictable jitter-less execution of tasks characterized by different criticality levels. This paper offers an efficient scheduling solution, named Hybrid Hard Real-Time Scheduling (H²RTS), which combines a static, clock driven method with a dynamic, event driven scheduling technique, in order to provide high execution predictability, while keeping a high node Central Processing Unit (CPU) utilization factor. From the detailed, integrated schedulability analysis of the H²RTS, a set of sufficiency tests are introduced and demonstrated based on the processor demand and linear upper bound metrics. The performance and correct behavior of the proposed hybrid scheduling technique have been extensively evaluated and validated both on a simulator and on a sensor mote equipped with ARM7 microcontroller.
      

      
      GPU Based Software Correlators - Perspectives for VLBI2010
      NASA Technical Reports Server (NTRS)
      Hobiger, Thomas; Kimura, Moritaka; Takefuji, Kazuhiro; Oyama, Tomoaki; Koyama, Yasuhiro; Kondo, Tetsuro; Gotoh, Tadahiro; Amagai, Jun
         2010-01-01
         Caused by historical separation and driven by the requirements of the PC gaming industry, Graphics Processing Units (GPUs) have evolved to massive parallel processing systems which entered the area of non-graphic related applications. Although a single processing core on the GPU is much slower and provides less functionality than its counterpart on the CPU, the huge number of these small processing entities outperforms the classical processors when the application can be parallelized. Thus, in recent years various radio astronomical projects have started to make use of this technology either to realize the correlator on this platform or to establish the post-processing pipeline with GPUs. Therefore, the feasibility of GPUs as a choice for a VLBI correlator is being investigated, including pros and cons of this technology. Additionally, a GPU based software correlator will be reviewed with respect to energy consumption/GFlop/sec and cost/GFlop/sec.
      

      
      Reaching multi-nanosecond timescales in combined QM/MM molecular dynamics simulations through parallel horsetail sampling.
      PubMed
      Martins-Costa, Marilia T C; Ruiz-López, Manuel F
         2017-04-15
         We report an enhanced sampling technique that allows to reach the multi-nanosecond timescale in quantum mechanics/molecular mechanics molecular dynamics simulations. The proposed technique, called horsetail sampling, is a specific type of multiple molecular dynamics approach exhibiting high parallel efficiency. It couples a main simulation with a large number of shorter trajectories launched on independent processors at periodic time intervals. The technique is applied to study hydrogen peroxide at the water liquid-vapor interface, a system of considerable atmospheric relevance. A total simulation time of a little more than 6 ns has been attained for a total CPU time of 5.1 years representing only about 20 days of wall-clock time. The discussion of the results highlights the strong influence of the solvation effects at the interface on the structure and the electronic properties of the solute. © 2017 Wiley Periodicals, Inc. © 2017 Wiley Periodicals, Inc.
      

      
      An efficient nonlinear relaxation technique for the three-dimensional, Reynolds-averaged Navier-Stokes equations
      NASA Technical Reports Server (NTRS)
      Edwards, Jack R.; Mcrae, D. S.
         1993-01-01
         An efficient implicit method for the computation of steady, three-dimensional, compressible Navier-Stokes flowfields is presented. A nonlinear iteration strategy based on planar Gauss-Seidel sweeps is used to drive the solution toward a steady state, with approximate factorization errors within a crossflow plane reduced by the application of a quasi-Newton technique. A hybrid discretization approach is employed, with flux-vector splitting utilized in the streamwise direction and central differences with artificial dissipation used for the transverse fluxes. Convergence histories and comparisons with experimental data are presented for several 3-D shock-boundary layer interactions. Both laminar and turbulent cases are considered, with turbulent closure provided by a modification of the Baldwin-Barth one-equation model. For the problems considered (175,000-325,000 mesh points), the algorithm provides steady-state convergence in 900-2000 CPU seconds on a single processor of a Cray Y-MP.
      

      
      GPU-accelerated molecular modeling coming of age.
      PubMed
      Stone, John E; Hardy, David J; Ufimtsev, Ivan S; Schulten, Klaus
         2010-09-01
         Graphics processing units (GPUs) have traditionally been used in molecular modeling solely for visualization of molecular structures and animation of trajectories resulting from molecular dynamics simulations. Modern GPUs have evolved into fully programmable, massively parallel co-processors that can now be exploited to accelerate many scientific computations, typically providing about one order of magnitude speedup over CPU code and in special cases providing speedups of two orders of magnitude. This paper surveys the development of molecular modeling algorithms that leverage GPU computing, the advances already made and remaining issues to be resolved, and the continuing evolution of GPU technology that promises to become even more useful to molecular modeling. Hardware acceleration with commodity GPUs is expected to benefit the overall computational biology community by bringing teraflops performance to desktop workstations and in some cases potentially changing what were formerly batch-mode computational jobs into interactive tasks. (c) 2010 Elsevier Inc. All rights reserved.
      

      
      Toward Millions of File System IOPS on Low-Cost, Commodity Hardware
      PubMed Central
      Zheng, Da; Burns, Randal; Szalay, Alexander S.
         2013-01-01
         We describe a storage system that removes I/O bottlenecks to achieve more than one million IOPS based on a user-space file abstraction for arrays of commodity SSDs. The file abstraction refactors I/O scheduling and placement for extreme parallelism and non-uniform memory and I/O. The system includes a set-associative, parallel page cache in the user space. We redesign page caching to eliminate CPU overhead and lock-contention in non-uniform memory architecture machines. We evaluate our design on a 32 core NUMA machine with four, eight-core processors. Experiments show that our design delivers 1.23 million 512-byte read IOPS. The page cache realizes the scalable IOPS of Linux asynchronous I/O (AIO) and increases user-perceived I/O performance linearly with cache hit rates. The parallel, set-associative cache matches the cache hit rates of the global Linux page cache under real workloads. PMID:24402052
      

      
      Toward Millions of File System IOPS on Low-Cost, Commodity Hardware.
      PubMed
      Zheng, Da; Burns, Randal; Szalay, Alexander S
         2013-01-01
         We describe a storage system that removes I/O bottlenecks to achieve more than one million IOPS based on a user-space file abstraction for arrays of commodity SSDs. The file abstraction refactors I/O scheduling and placement for extreme parallelism and non-uniform memory and I/O. The system includes a set-associative, parallel page cache in the user space. We redesign page caching to eliminate CPU overhead and lock-contention in non-uniform memory architecture machines. We evaluate our design on a 32 core NUMA machine with four, eight-core processors. Experiments show that our design delivers 1.23 million 512-byte read IOPS. The page cache realizes the scalable IOPS of Linux asynchronous I/O (AIO) and increases user-perceived I/O performance linearly with cache hit rates. The parallel, set-associative cache matches the cache hit rates of the global Linux page cache under real workloads.
      

      
      Programming for 1.6 Millon cores: Early experiences with IBM's BG/Q SMP architecture
      NASA Astrophysics Data System (ADS)
      Glosli, James
         2013-03-01
         With the stall in clock cycle improvements a decade ago, the drive for computational performance has continues along a path of increasing core counts on a processor. The multi-core evolution has been expressed in both a symmetric multi processor (SMP) architecture and cpu/GPU architecture. Debates rage in the high performance computing (HPC) community which architecture best serves HPC. In this talk I will not attempt to resolve that debate but perhaps fuel it. I will discuss the experience of exploiting Sequoia, a 98304 node IBM Blue Gene/Q SMP at Lawrence Livermore National Laboratory. The advantages and challenges of leveraging the computational power BG/Q will be detailed through the discussion of two applications. The first application is a Molecular Dynamics code called ddcMD. This is a code developed over the last decade at LLNL and ported to BG/Q. The second application is a cardiac modeling code called Cardioid. This is a code that was recently designed and developed at LLNL to exploit the fine scale parallelism of BG/Q's SMP architecture. Through the lenses of these efforts I'll illustrate the need to rethink how we express and implement our computational approaches. This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.
      

      
      Bringing MapReduce Closer To Data With Active Drives
      NASA Astrophysics Data System (ADS)
      Golpayegani, N.; Prathapan, S.; Warmka, R.; Wyatt, B.; Halem, M.; Trantham, J. D.; Markey, C. A.
         2017-12-01
         Moving computation closer to the data location has been a much theorized improvement to computation for decades. The increase in processor performance, the decrease in processor size and power requirement combined with the increase in data intensive computing has created a push to move computation as close to data as possible. We will show the next logical step in this evolution in computing: moving computation directly to storage. Hypothetical systems, known as Active Drives, have been proposed as early as 1998. These Active Drives would have a general-purpose CPU on each disk allowing for computations to be performed on them without the need to transfer the data to the computer over the system bus or via a network. We will utilize Seagate's Active Drives to perform general purpose parallel computing using the MapReduce programming model directly on each drive. We will detail how the MapReduce programming model can be adapted to the Active Drive compute model to perform general purpose computing with comparable results to traditional MapReduce computations performed via Hadoop. We will show how an Active Drive based approach significantly reduces the amount of data leaving the drive when performing several common algorithms: subsetting and gridding. We will show that an Active Drive based design significantly improves data transfer speeds into and out of drives compared to Hadoop's HDFS while at the same time keeping comparable compute speeds as Hadoop.
      

      
      Mongoose: Creation of a Rad-Hard MIPS R3000
      NASA Technical Reports Server (NTRS)
      Lincoln, Dan; Smith, Brian
         1993-01-01
         This paper describes the development of a 32 Bit, full MIPS R3000 code-compatible Rad-Hard CPU, code named Mongoose. Mongoose progressed from contract award, through the design cycle, to operational silicon in 12 months to meet a space mission for NASA. The goal was the creation of a fully static device capable of operation to the maximum Mil-883 derated speed, worst-case post-rad exposure with full operational integrity. This included consideration of features for functional enhancements relating to mission compatibility and removal of commercial practices not supported by Rad-Hard technology. 'Mongoose' developed from an evolution of LSI Logic's MIPS-I embedded processor, LR33000, code named Cobra, to its Rad-Hard 'equivalent', Mongoose. The term 'equivalent' is used to infer that the core of the processor is functionally identical, allowing the same use and optimizations of the MIPS-I Instruction Set software tool suite for compilation, software program trace, etc. This activity was started in September of 1991 under a contract from NASA-Goddard Space Flight Center (GSFC)-Flight Data Systems. The approach affected a teaming of NASA-GSFC for program development, LSI Logic for system and ASIC design coupled with the Rad-Hard process technology, and Harris (GASD) for Rad-Hard microprocessor design expertise. The program culminated with the generation of Rad-Hard Mongoose prototypes one year later.
      

      
      Mobile high-performance computing (HPC) for synthetic aperture radar signal processing
      NASA Astrophysics Data System (ADS)
      Misko, Joshua; Kim, Youngsoo; Qi, Chenchen; Sirkeci, Birsen
         2018-04-01
         The importance of mobile high-performance computing has emerged in numerous battlespace applications at the tactical edge in hostile environments. Energy efficient computing power is a key enabler for diverse areas ranging from real-time big data analytics and atmospheric science to network science. However, the design of tactical mobile data centers is dominated by power, thermal, and physical constraints. Presently, it is very unlikely to achieve required computing processing power by aggregating emerging heterogeneous many-core processing platforms consisting of CPU, Field Programmable Gate Arrays and Graphic Processor cores constrained by power and performance. To address these challenges, we performed a Synthetic Aperture Radar case study for Automatic Target Recognition (ATR) using Deep Neural Networks (DNNs). However, these DNN models are typically trained using GPUs with gigabytes of external memories and massively used 32-bit floating point operations. As a result, DNNs do not run efficiently on hardware appropriate for low power or mobile applications. To address this limitation, we proposed for compressing DNN models for ATR suited to deployment on resource constrained hardware. This proposed compression framework utilizes promising DNN compression techniques including pruning and weight quantization while also focusing on processor features common to modern low-power devices. Following this methodology as a guideline produced a DNN for ATR tuned to maximize classification throughput, minimize power consumption, and minimize memory footprint on a low-power device.
      

      
      Processor core for real time background identification of HD video based on OpenCV Gaussian mixture model algorithm
      NASA Astrophysics Data System (ADS)
      Genovese, Mariangela; Napoli, Ettore
         2013-05-01
         The identification of moving objects is a fundamental step in computer vision processing chains. The development of low cost and lightweight smart cameras steadily increases the request of efficient and high performance circuits able to process high definition video in real time. The paper proposes two processor cores aimed to perform the real time background identification on High Definition (HD, 1920 1080 pixel) video streams. The implemented algorithm is the OpenCV version of the Gaussian Mixture Model (GMM), an high performance probabilistic algorithm for the segmentation of the background that is however computationally intensive and impossible to implement on general purpose CPU with the constraint of real time processing. In the proposed paper, the equations of the OpenCV GMM algorithm are optimized in such a way that a lightweight and low power implementation of the algorithm is obtained. The reported performances are also the result of the use of state of the art truncated binary multipliers and ROM compression techniques for the implementation of the non-linear functions. The first circuit has commercial FPGA devices as a target and provides speed and logic resource occupation that overcome previously proposed implementations. The second circuit is oriented to an ASIC (UMC-90nm) standard cell implementation. Both implementations are able to process more than 60 frames per second in 1080p format, a frame rate compatible with HD television.
      

      
      Using Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data
      NASA Astrophysics Data System (ADS)
      O'Connor, A. S.; Justice, B.; Harris, A. T.
         2013-12-01
         Graphics Processing Units (GPUs) are high-performance multiple-core processors capable of very high computational speeds and large data throughput. Modern GPUs are inexpensive and widely available commercially. These are general-purpose parallel processors with support for a variety of programming interfaces, including industry standard languages such as C. GPU implementations of algorithms that are well suited for parallel processing can often achieve speedups of several orders of magnitude over optimized CPU codes. Significant improvements in speeds for imagery orthorectification, atmospheric correction, target detection and image transformations like Independent Components Analsyis (ICA) have been achieved using GPU-based implementations. Additional optimizations, when factored in with GPU processing capabilities, can provide 50x - 100x reduction in the time required to process large imagery. Exelis Visual Information Solutions (VIS) has implemented a CUDA based GPU processing frame work for accelerating ENVI and IDL processes that can best take advantage of parallelization. Testing Exelis VIS has performed shows that orthorectification can take as long as two hours with a WorldView1 35,0000 x 35,000 pixel image. With GPU orthorecification, the same orthorectification process takes three minutes. By speeding up image processing, imagery can successfully be used by first responders, scientists making rapid discoveries with near real time data, and provides an operational component to data centers needing to quickly process and disseminate data.
      

      
      Heterogeneous computing architecture for fast detection of SNP-SNP interactions.
      PubMed
      Sluga, Davor; Curk, Tomaz; Zupan, Blaz; Lotric, Uros
         2014-06-25
         The extent of data in a typical genome-wide association study (GWAS) poses considerable computational challenges to software tools for gene-gene interaction discovery. Exhaustive evaluation of all interactions among hundreds of thousands to millions of single nucleotide polymorphisms (SNPs) may require weeks or even months of computation. Massively parallel hardware within a modern Graphic Processing Unit (GPU) and Many Integrated Core (MIC) coprocessors can shorten the run time considerably. While the utility of GPU-based implementations in bioinformatics has been well studied, MIC architecture has been introduced only recently and may provide a number of comparative advantages that have yet to be explored and tested. We have developed a heterogeneous, GPU and Intel MIC-accelerated software module for SNP-SNP interaction discovery to replace the previously single-threaded computational core in the interactive web-based data exploration program SNPsyn. We report on differences between these two modern massively parallel architectures and their software environments. Their utility resulted in an order of magnitude shorter execution times when compared to the single-threaded CPU implementation. GPU implementation on a single Nvidia Tesla K20 runs twice as fast as that for the MIC architecture-based Xeon Phi P5110 coprocessor, but also requires considerably more programming effort. General purpose GPUs are a mature platform with large amounts of computing power capable of tackling inherently parallel problems, but can prove demanding for the programmer. On the other hand the new MIC architecture, albeit lacking in performance reduces the programming effort and makes it up with a more general architecture suitable for a wider range of problems.
      

      
      Heterogeneous computing architecture for fast detection of SNP-SNP interactions
      PubMed Central
      
         2014-01-01
         Background The extent of data in a typical genome-wide association study (GWAS) poses considerable computational challenges to software tools for gene-gene interaction discovery. Exhaustive evaluation of all interactions among hundreds of thousands to millions of single nucleotide polymorphisms (SNPs) may require weeks or even months of computation. Massively parallel hardware within a modern Graphic Processing Unit (GPU) and Many Integrated Core (MIC) coprocessors can shorten the run time considerably. While the utility of GPU-based implementations in bioinformatics has been well studied, MIC architecture has been introduced only recently and may provide a number of comparative advantages that have yet to be explored and tested. Results We have developed a heterogeneous, GPU and Intel MIC-accelerated software module for SNP-SNP interaction discovery to replace the previously single-threaded computational core in the interactive web-based data exploration program SNPsyn. We report on differences between these two modern massively parallel architectures and their software environments. Their utility resulted in an order of magnitude shorter execution times when compared to the single-threaded CPU implementation. GPU implementation on a single Nvidia Tesla K20 runs twice as fast as that for the MIC architecture-based Xeon Phi P5110 coprocessor, but also requires considerably more programming effort. Conclusions General purpose GPUs are a mature platform with large amounts of computing power capable of tackling inherently parallel problems, but can prove demanding for the programmer. On the other hand the new MIC architecture, albeit lacking in performance reduces the programming effort and makes it up with a more general architecture suitable for a wider range of problems. PMID:24964802
      

      
      A real-time coherent dedispersion pipeline for the giant metrewave radio telescope
      NASA Astrophysics Data System (ADS)
      De, Kishalay; Gupta, Yashwant
         2016-02-01
         A fully real-time coherent dedispersion system has been developed for the pulsar back-end at the Giant Metrewave Radio Telescope (GMRT). The dedispersion pipeline uses the single phased array voltage beam produced by the existing GMRT software back-end (GSB) to produce coherently dedispersed intensity output in real time, for the currently operational bandwidths of 16 MHz and 32 MHz. Provision has also been made to coherently dedisperse voltage beam data from observations recorded on disk. We discuss the design and implementation of the real-time coherent dedispersion system, describing the steps carried out to optimise the performance of the pipeline. Presently functioning on an Intel Xeon X5550 CPU equipped with a NVIDIA Tesla C2075 GPU, the pipeline allows dispersion free, high time resolution data to be obtained in real-time. We illustrate the significant improvements over the existing incoherent dedispersion system at the GMRT, and present some preliminary results obtained from studies of pulsars using this system, demonstrating its potential as a useful tool for low frequency pulsar observations. We describe the salient features of our implementation, comparing it with other recently developed real-time coherent dedispersion systems. This implementation of a real-time coherent dedispersion pipeline for a large, low frequency array instrument like the GMRT, will enable long-term observing programs using coherent dedispersion to be carried out routinely at the observatory. We also outline the possible improvements for such a pipeline, including prospects for the upgraded GMRT which will have bandwidths about ten times larger than at present.
      

      
      Fast analysis of molecular dynamics trajectories with graphics processing units-Radial distribution function histogramming
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Levine, Benjamin G., E-mail: ben.levine@temple.ed; Stone, John E., E-mail: johns@ks.uiuc.ed; Kohlmeyer, Axel, E-mail: akohlmey@temple.ed
         2011-05-01
         The calculation of radial distribution functions (RDFs) from molecular dynamics trajectory data is a common and computationally expensive analysis task. The rate limiting step in the calculation of the RDF is building a histogram of the distance between atom pairs in each trajectory frame. Here we present an implementation of this histogramming scheme for multiple graphics processing units (GPUs). The algorithm features a tiling scheme to maximize the reuse of data at the fastest levels of the GPU's memory hierarchy and dynamic load balancing to allow high performance on heterogeneous configurations of GPUs. Several versions of the RDF algorithm aremore » presented, utilizing the specific hardware features found on different generations of GPUs. We take advantage of larger shared memory and atomic memory operations available on state-of-the-art GPUs to accelerate the code significantly. The use of atomic memory operations allows the fast, limited-capacity on-chip memory to be used much more efficiently, resulting in a fivefold increase in performance compared to the version of the algorithm without atomic operations. The ultimate version of the algorithm running in parallel on four NVIDIA GeForce GTX 480 (Fermi) GPUs was found to be 92 times faster than a multithreaded implementation running on an Intel Xeon 5550 CPU. On this multi-GPU hardware, the RDF between two selections of 1,000,000 atoms each can be calculated in 26.9 s per frame. The multi-GPU RDF algorithms described here are implemented in VMD, a widely used and freely available software package for molecular dynamics visualization and analysis.« less
      

      
      Fast Analysis of Molecular Dynamics Trajectories with Graphics Processing Units—Radial Distribution Function Histogramming
      PubMed Central
      Stone, John E.; Kohlmeyer, Axel
         2011-01-01
         The calculation of radial distribution functions (RDFs) from molecular dynamics trajectory data is a common and computationally expensive analysis task. The rate limiting step in the calculation of the RDF is building a histogram of the distance between atom pairs in each trajectory frame. Here we present an implementation of this histogramming scheme for multiple graphics processing units (GPUs). The algorithm features a tiling scheme to maximize the reuse of data at the fastest levels of the GPU’s memory hierarchy and dynamic load balancing to allow high performance on heterogeneous configurations of GPUs. Several versions of the RDF algorithm are presented, utilizing the specific hardware features found on different generations of GPUs. We take advantage of larger shared memory and atomic memory operations available on state-of-the-art GPUs to accelerate the code significantly. The use of atomic memory operations allows the fast, limited-capacity on-chip memory to be used much more efficiently, resulting in a fivefold increase in performance compared to the version of the algorithm without atomic operations. The ultimate version of the algorithm running in parallel on four NVIDIA GeForce GTX 480 (Fermi) GPUs was found to be 92 times faster than a multithreaded implementation running on an Intel Xeon 5550 CPU. On this multi-GPU hardware, the RDF between two selections of 1,000,000 atoms each can be calculated in 26.9 seconds per frame. The multi-GPU RDF algorithms described here are implemented in VMD, a widely used and freely available software package for molecular dynamics visualization and analysis. PMID:21547007
      

      
      Wilson Dslash Kernel From Lattice QCD Optimization
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Joo, Balint; Smelyanskiy, Mikhail; Kalamkar, Dhiraj D.
         2015-07-01
         Lattice Quantum Chromodynamics (LQCD) is a numerical technique used for calculations in Theoretical Nuclear and High Energy Physics. LQCD is traditionally one of the first applications ported to many new high performance computing architectures and indeed LQCD practitioners have been known to design and build custom LQCD computers. Lattice QCD kernels are frequently used as benchmarks (e.g. 168.wupwise in the SPEC suite) and are generally well understood, and as such are ideal to illustrate several optimization techniques. In this chapter we will detail our work in optimizing the Wilson-Dslash kernels for Intel Xeon Phi, however, as we will show themore » technique gives excellent performance on regular Xeon Architecture as well.« less
      

      
      Implementation of High-Order Multireference Coupled-Cluster Methods on Intel Many Integrated Core Architecture.
      PubMed
      Aprà, E; Kowalski, K
         2016-03-08
         In this paper we discuss the implementation of multireference coupled-cluster formalism with singles, doubles, and noniterative triples (MRCCSD(T)), which is capable of taking advantage of the processing power of the Intel Xeon Phi coprocessor. We discuss the integration of two levels of parallelism underlying the MRCCSD(T) implementation with computational kernels designed to offload the computationally intensive parts of the MRCCSD(T) formalism to Intel Xeon Phi coprocessors. Special attention is given to the enhancement of the parallel performance by task reordering that has improved load balancing in the noniterative part of the MRCCSD(T) calculations. We also discuss aspects regarding efficient optimization and vectorization strategies.
      

        
       
          

«

20
      21
      22
   23
      24
      »

          
        

     

   

   
       
            
              
          

«

21
      22
      23
   24
      25
      »

          
        

           
           
             
               
      
      HPC Programming on Intel Many-Integrated-Core Hardware with MAGMA Port to Xeon Phi
      DOE PAGES
      Dongarra, Jack; Gates, Mark; Haidar, Azzam; ...
         2015-01-01
         This paper presents the design and implementation of several fundamental dense linear algebra (DLA) algorithms for multicore with Intel Xeon Phi coprocessors. In particular, we consider algorithms for solving linear systems. Further, we give an overview of the MAGMA MIC library, an open source, high performance library, that incorporates the developments presented here and, more broadly, provides the DLA functionality equivalent to that of the popular LAPACK library while targeting heterogeneous architectures that feature a mix of multicore CPUs and coprocessors. The LAPACK-compliance simplifies the use of the MAGMA MIC library in applications, while providing them with portably performant DLA.more » High performance is obtained through the use of the high-performance BLAS, hardware-specific tuning, and a hybridization methodology whereby we split the algorithm into computational tasks of various granularities. Execution of those tasks is properly scheduled over the heterogeneous hardware by minimizing data movements and mapping algorithmic requirements to the architectural strengths of the various heterogeneous hardware components. Our methodology and programming techniques are incorporated into the MAGMA MIC API, which abstracts the application developer from the specifics of the Xeon Phi architecture and is therefore applicable to algorithms beyond the scope of DLA.« less
      

      
      An integrated pipeline of open source software adapted for multi-CPU architectures: use in the large-scale identification of single nucleotide polymorphisms.
      PubMed
      Jayashree, B; Hanspal, Manindra S; Srinivasan, Rajgopal; Vigneshwaran, R; Varshney, Rajeev K; Spurthi, N; Eshwar, K; Ramesh, N; Chandra, S; Hoisington, David A
         2007-01-01
         The large amounts of EST sequence data available from a single species of an organism as well as for several species within a genus provide an easy source of identification of intra- and interspecies single nucleotide polymorphisms (SNPs). In the case of model organisms, the data available are numerous, given the degree of redundancy in the deposited EST data. There are several available bioinformatics tools that can be used to mine this data; however, using them requires a certain level of expertise: the tools have to be used sequentially with accompanying format conversion and steps like clustering and assembly of sequences become time-intensive jobs even for moderately sized datasets. We report here a pipeline of open source software extended to run on multiple CPU architectures that can be used to mine large EST datasets for SNPs and identify restriction sites for assaying the SNPs so that cost-effective CAPS assays can be developed for SNP genotyping in genetics and breeding applications. At the International Crops Research Institute for the Semi-Arid Tropics (ICRISAT), the pipeline has been implemented to run on a Paracel high-performance system consisting of four dual AMD Opteron processors running Linux with MPICH. The pipeline can be accessed through user-friendly web interfaces at http://hpc.icrisat.cgiar.org/PBSWeb and is available on request for academic use. We have validated the developed pipeline by mining chickpea ESTs for interspecies SNPs, development of CAPS assays for SNP genotyping, and confirmation of restriction digestion pattern at the sequence level.
      

      
      Monitoring of computing resource use of active software releases at ATLAS
      NASA Astrophysics Data System (ADS)
      Limosani, Antonio; ATLAS Collaboration
         2017-10-01
         The LHC is the world’s most powerful particle accelerator, colliding protons at centre of mass energy of 13 TeV. As the energy and frequency of collisions has grown in the search for new physics, so too has demand for computing resources needed for event reconstruction. We will report on the evolution of resource usage in terms of CPU and RAM in key ATLAS offline reconstruction workflows at the TierO at CERN and on the WLCG. Monitoring of workflows is achieved using the ATLAS PerfMon package, which is the standard ATLAS performance monitoring system running inside Athena jobs. Systematic daily monitoring has recently been expanded to include all workflows beginning at Monte Carlo generation through to end-user physics analysis, beyond that of event reconstruction. Moreover, the move to a multiprocessor mode in production jobs has facilitated the use of tools, such as “MemoryMonitor”, to measure the memory shared across processors in jobs. Resource consumption is broken down into software domains and displayed in plots generated using Python visualization libraries and collected into pre-formatted auto-generated Web pages, which allow the ATLAS developer community to track the performance of their algorithms. This information is however preferentially filtered to domain leaders and developers through the use of JIRA and via reports given at ATLAS software meetings. Finally, we take a glimpse of the future by reporting on the expected CPU and RAM usage in benchmark workflows associated with the High Luminosity LHC and anticipate the ways performance monitoring will evolve to understand and benchmark future workflows.
      

      
      Conceptual design of the X-IFU Instrument Control Unit on board the ESA Athena mission
      NASA Astrophysics Data System (ADS)
      Corcione, L.; Ligori, S.; Capobianco, V.; Bonino, D.; Valenziano, L.; Guizzo, G. P.
         2016-07-01
         Athena is one of L-class missions selected in the ESA Cosmic Vision 2015-2025 program for the science theme of the Hot and Energetic Universe. The Athena model payload includes the X-ray Integral Field Unit (X-IFU), an advanced actively shielded X-ray microcalorimeter spectrometer for high spectral resolution imaging, utilizing cooled Transition Edge Sensors. This paper describes the preliminary architecture of Instrument Control Unit (ICU), which is aimed at operating all XIFU's subsystems, as well as at implementing the main functional interfaces of the instrument with the S/C control unit. The ICU functions include the TC/TM management with S/C, science data formatting and transmission to S/C Mass Memory, housekeeping data handling, time distribution for synchronous operations and the management of the X-IFU components (i.e. CryoCoolers, Filter Wheel, Detector Readout Electronics Event Processor, Power Distribution Unit). ICU functions baseline implementation for the phase-A study foresees the usage of standard and Space-qualified components from the heritage of past and current space missions (e.g. Gaia, Euclid), which currently encompasses Leon2/Leon3 based CPU board and standard Space-qualified interfaces for the exchange commands and data between ICU and X-IFU subsystems. Alternative architecture, arranged around a powerful PowerPC-based CPU, is also briefly presented, with the aim of endowing the system with enhanced hardware resources and processing power capability, for the handling of control and science data processing tasks not defined yet at this stage of the mission study.
      

      
      Performance optimization of Qbox and WEST on Intel Knights Landing
      NASA Astrophysics Data System (ADS)
      Zheng, Huihuo; Knight, Christopher; Galli, Giulia; Govoni, Marco; Gygi, Francois
         
         We present the optimization of electronic structure codes Qbox and WEST targeting the Intel®Xeon Phi™processor, codenamed Knights Landing (KNL). Qbox is an ab-initio molecular dynamics code based on plane wave density functional theory (DFT) and WEST is a post-DFT code for excited state calculations within many-body perturbation theory. Both Qbox and WEST employ highly scalable algorithms which enable accurate large-scale electronic structure calculations on leadership class supercomputer platforms beyond 100,000 cores, such as Mira and Theta at the Argonne Leadership Computing Facility. In this work, features of the KNL architecture (e.g. hierarchical memory) are explored to achieve higher performance in key algorithms of the Qbox and WEST codes and to develop a road-map for further development targeting next-generation computing architectures. In particular, the optimizations of the Qbox and WEST codes on the KNL platform will target efficient large-scale electronic structure calculations of nanostructured materials exhibiting complex structures and prediction of their electronic and thermal properties for use in solar and thermal energy conversion device. This work was supported by MICCoM, as part of Comp. Mats. Sci. Program funded by the U.S. DOE, Office of Sci., BES, MSE Division. This research used resources of the ALCF, which is a DOE Office of Sci. User Facility under Contract DE-AC02-06CH11357.
      

      
      
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Yeung, Yu-Hong; Pothen, Alex; Halappanavar, Mahantesh
         
         We present an augmented matrix approach to update the solution to a linear system of equations when the coefficient matrix is modified by a few elements within a principal submatrix. This problem arises in the dynamic security analysis of a power grid, where operators need to performmore » $N-x$ contingency analysis, i.e., determine the state of the system when up to $x$ links from $N$ fail. Our algorithms augment the coefficient matrix to account for the changes in it, and then compute the solution to the augmented system without refactoring the modified matrix. We provide two algorithms, a direct method, and a hybrid direct-iterative method for solving the augmented system. We also exploit the sparsity of the matrices and vectors to accelerate the overall computation. Our algorithms are compared on three power grids with PARDISO, a parallel direct solver, and CHOLMOD, a direct solver with the ability to modify the Cholesky factors of the coefficient matrix. We show that our augmented algorithms outperform PARDISO (by two orders of magnitude), and CHOLMOD (by a factor of up to 5). Further, our algorithms scale better than CHOLMOD as the number of elements updated increases. The solutions are computed with high accuracy. Our algorithms are capable of computing $N-x$ contingency analysis on a $778K$ bus grid, updating a solution with $x=20$ elements in $$1.6 \\times 10^{-2}$$ seconds on an Intel Xeon processor.« less
      

      
      Multiparametric fat-water separation method for fast chemical-shift imaging guidance of thermal therapies.
      PubMed
      Lin, Jonathan S; Hwang, Ken-Pin; Jackson, Edward F; Hazle, John D; Stafford, R Jason; Taylor, Brian A
         2013-10-01
         A k-means-based classification algorithm is investigated to assess suitability for rapidly separating and classifying fat/water spectral peaks from a fast chemical shift imaging technique for magnetic resonance temperature imaging. Algorithm testing is performed in simulated mathematical phantoms and agar gel phantoms containing mixed fat/water regions. Proton resonance frequencies (PRFs), apparent spin-spin relaxation (T2*) times, and T1-weighted (T1-W) amplitude values were calculated for each voxel using a single-peak autoregressive moving average (ARMA) signal model. These parameters were then used as criteria for k-means sorting, with the results used to determine PRF ranges of each chemical species cluster for further classification. To detect the presence of secondary chemical species, spectral parameters were recalculated when needed using a two-peak ARMA signal model during the subsequent classification steps. Mathematical phantom simulations involved the modulation of signal-to-noise ratios (SNR), maximum PRF shift (MPS) values, analysis window sizes, and frequency expansion factor sizes in order to characterize the algorithm performance across a variety of conditions. In agar, images were collected on a 1.5T clinical MR scanner using acquisition parameters close to simulation, and algorithm performance was assessed by comparing classification results to manually segmented maps of the fat/water regions. Performance was characterized quantitatively using the Dice Similarity Coefficient (DSC), sensitivity, and specificity. The simulated mathematical phantom experiments demonstrated good fat/water separation depending on conditions, specifically high SNR, moderate MPS value, small analysis window size, and low but nonzero frequency expansion factor size. Physical phantom results demonstrated good identification for both water (0.997 ± 0.001, 0.999 ± 0.001, and 0.986 ± 0.001 for DSC, sensitivity, and specificity, respectively) and fat (0.763 ± 0.006, 0.980 ± 0.004, and 0.941 ± 0.002 for DSC, sensitivity, and specificity, respectively). Temperature uncertainties, based on PRF uncertainties from a 5 × 5-voxel ROI, were 0.342 and 0.351°C for pure and mixed fat/water regions, respectively. Algorithm speed was tested using 25 × 25-voxel and whole image ROIs containing both fat and water, resulting in average processing times per acquisition of 2.00 ± 0.07 s and 146 ± 1 s, respectively, using uncompiled MATLAB scripts running on a shared CPU server with eight Intel Xeon(TM) E5640 quad-core processors (2.66 GHz, 12 MB cache) and 12 GB RAM. Results from both the mathematical and physical phantom suggest the k-means-based classification algorithm could be useful for rapid, dynamic imaging in an ROI for thermal interventions. Successful separation of fat/water information would aid in reducing errors from the nontemperature sensitive fat PRF, as well as potentially facilitate using fat as an internal reference for PRF shift thermometry when appropriate. Additionally, the T1-W or R2* signals may be used for monitoring temperature in surrounding adipose tissue.
      

      
      Cost efficient CFD simulations: Proper selection of domain partitioning strategies
      NASA Astrophysics Data System (ADS)
      Haddadi, Bahram; Jordan, Christian; Harasek, Michael
         2017-10-01
         Computational Fluid Dynamics (CFD) is one of the most powerful simulation methods, which is used for temporally and spatially resolved solutions of fluid flow, heat transfer, mass transfer, etc. One of the challenges of Computational Fluid Dynamics is the extreme hardware demand. Nowadays super-computers (e.g. High Performance Computing, HPC) featuring multiple CPU cores are applied for solving-the simulation domain is split into partitions for each core. Some of the different methods for partitioning are investigated in this paper. As a practical example, a new open source based solver was utilized for simulating packed bed adsorption, a common separation method within the field of thermal process engineering. Adsorption can for example be applied for removal of trace gases from a gas stream or pure gases production like Hydrogen. For comparing the performance of the partitioning methods, a 60 million cell mesh for a packed bed of spherical adsorbents was created; one second of the adsorption process was simulated. Different partitioning methods available in OpenFOAM® (Scotch, Simple, and Hierarchical) have been used with different numbers of sub-domains. The effect of the different methods and number of processor cores on the simulation speedup and also energy consumption were investigated for two different hardware infrastructures (Vienna Scientific Clusters VSC 2 and VSC 3). As a general recommendation an optimum number of cells per processor core was calculated. Optimized simulation speed, lower energy consumption and consequently the cost effects are reported here.
      

      
      A FAST ITERATIVE METHOD FOR SOLVING THE EIKONAL EQUATION ON TETRAHEDRAL DOMAINS
      PubMed Central
      Fu, Zhisong; Kirby, Robert M.; Whitaker, Ross T.
         2014-01-01
         Generating numerical solutions to the eikonal equation and its many variations has a broad range of applications in both the natural and computational sciences. Efficient solvers on cutting-edge, parallel architectures require new algorithms that may not be theoretically optimal, but that are designed to allow asynchronous solution updates and have limited memory access patterns. This paper presents a parallel algorithm for solving the eikonal equation on fully unstructured tetrahedral meshes. The method is appropriate for the type of fine-grained parallelism found on modern massively-SIMD architectures such as graphics processors and takes into account the particular constraints and capabilities of these computing platforms. This work builds on previous work for solving these equations on triangle meshes; in this paper we adapt and extend previous two-dimensional strategies to accommodate three-dimensional, unstructured, tetrahedralized domains. These new developments include a local update strategy with data compaction for tetrahedral meshes that provides solutions on both serial and parallel architectures, with a generalization to inhomogeneous, anisotropic speed functions. We also propose two new update schemes, specialized to mitigate the natural data increase observed when moving to three dimensions, and the data structures necessary for efficiently mapping data to parallel SIMD processors in a way that maintains computational density. Finally, we present descriptions of the implementations for a single CPU, as well as multicore CPUs with shared memory and SIMD architectures, with comparative results against state-of-the-art eikonal solvers. PMID:25221418
      

      
      Case Study of Using High Performance Commercial Processors in Space
      NASA Technical Reports Server (NTRS)
      Ferguson, Roscoe C.; Olivas, Zulema
         2009-01-01
         The purpose of the Space Shuttle Cockpit Avionics Upgrade project (1999 2004) was to reduce crew workload and improve situational awareness. The upgrade was to augment the Shuttle avionics system with new hardware and software. A major success of this project was the validation of the hardware architecture and software design. This was significant because the project incorporated new technology and approaches for the development of human rated space software. An early version of this system was tested at the Johnson Space Center for one month by teams of astronauts. The results were positive, but NASA eventually cancelled the project towards the end of the development cycle. The goal to reduce crew workload and improve situational awareness resulted in the need for high performance Central Processing Units (CPUs). The choice of CPU selected was the PowerPC family, which is a reduced instruction set computer (RISC) known for its high performance. However, the requirement for radiation tolerance resulted in the re-evaluation of the selected family member of the PowerPC line. Radiation testing revealed that the original selected processor (PowerPC 7400) was too soft to meet mission objectives and an effort was established to perform trade studies and performance testing to determine a feasible candidate. At that time, the PowerPC RAD750s were radiation tolerant, but did not meet the required performance needs of the project. Thus, the final solution was to select the PowerPC 7455. This processor did not have a radiation tolerant version, but had some ability to detect failures. However, its cache tags did not provide parity and thus the project incorporated a software strategy to detect radiation failures. The strategy was to incorporate dual paths for software generating commands to the legacy Space Shuttle avionics to prevent failures due to the softness of the upgraded avionics.
      

      
      Case Study of Using High Performance Commercial Processors in a Space Environment
      NASA Technical Reports Server (NTRS)
      Ferguson, Roscoe C.; Olivas, Zulema
         2009-01-01
         The purpose of the Space Shuttle Cockpit Avionics Upgrade project was to reduce crew workload and improve situational awareness. The upgrade was to augment the Shuttle avionics system with new hardware and software. A major success of this project was the validation of the hardware architecture and software design. This was significant because the project incorporated new technology and approaches for the development of human rated space software. An early version of this system was tested at the Johnson Space Center for one month by teams of astronauts. The results were positive, but NASA eventually cancelled the project towards the end of the development cycle. The goal to reduce crew workload and improve situational awareness resulted in the need for high performance Central Processing Units (CPUs). The choice of CPU selected was the PowerPC family, which is a reduced instruction set computer (RISC) known for its high performance. However, the requirement for radiation tolerance resulted in the reevaluation of the selected family member of the PowerPC line. Radiation testing revealed that the original selected processor (PowerPC 7400) was too soft to meet mission objectives and an effort was established to perform trade studies and performance testing to determine a feasible candidate. At that time, the PowerPC RAD750s where radiation tolerant, but did not meet the required performance needs of the project. Thus, the final solution was to select the PowerPC 7455. This processor did not have a radiation tolerant version, but faired better than the 7400 in the ability to detect failures. However, its cache tags did not provide parity and thus the project incorporated a software strategy to detect radiation failures. The strategy was to incorporate dual paths for software generating commands to the legacy Space Shuttle avionics to prevent failures due to the softness of the upgraded avionics.
      

      
      Tough2{_}MP: A parallel version of TOUGH2
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Zhang, Keni; Wu, Yu-Shu; Ding, Chris
         2003-04-09
         TOUGH2{_}MP is a massively parallel version of TOUGH2. It was developed for running on distributed-memory parallel computers to simulate large simulation problems that may not be solved by the standard, single-CPU TOUGH2 code. The new code implements an efficient massively parallel scheme, while preserving the full capacity and flexibility of the original TOUGH2 code. The new software uses the METIS software package for grid partitioning and AZTEC software package for linear-equation solving. The standard message-passing interface is adopted for communication among processors. Numerical performance of the current version code has been tested on CRAY-T3E and IBM RS/6000 SP platforms. Inmore » addition, the parallel code has been successfully applied to real field problems of multi-million-cell simulations for three-dimensional multiphase and multicomponent fluid and heat flow, as well as solute transport. In this paper, we will review the development of the TOUGH2{_}MP, and discuss the basic features, modules, and their applications.« less
      

      
      GPU-based streaming architectures for fast cone-beam CT image reconstruction and demons deformable registration.
      PubMed
      Sharp, G C; Kandasamy, N; Singh, H; Folkert, M
         2007-10-07
         This paper shows how to significantly accelerate cone-beam CT reconstruction and 3D deformable image registration using the stream-processing model. We describe data-parallel designs for the Feldkamp, Davis and Kress (FDK) reconstruction algorithm, and the demons deformable registration algorithm, suitable for use on a commodity graphics processing unit. The streaming versions of these algorithms are implemented using the Brook programming environment and executed on an NVidia 8800 GPU. Performance results using CT data of a preserved swine lung indicate that the GPU-based implementations of the FDK and demons algorithms achieve a substantial speedup--up to 80 times for FDK and 70 times for demons when compared to an optimized reference implementation on a 2.8 GHz Intel processor. In addition, the accuracy of the GPU-based implementations was found to be excellent. Compared with CPU-based implementations, the RMS differences were less than 0.1 Hounsfield unit for reconstruction and less than 0.1 mm for deformable registration.
      

      
      A Study about Kalman Filters Applied to Embedded Sensors
      PubMed Central
      Valade, Aurélien; Acco, Pascal; Grabolosa, Pierre; Fourniols, Jean-Yves
         2017-01-01
         Over the last decade, smart sensors have grown in complexity and can now handle multiple measurement sources. This work establishes a methodology to achieve better estimates of physical values by processing raw measurements within a sensor using multi-physical models and Kalman filters for data fusion. A driving constraint being production cost and power consumption, this methodology focuses on algorithmic complexity while meeting real-time constraints and improving both precision and reliability despite low power processors limitations. Consequently, processing time available for other tasks is maximized. The known problem of estimating a 2D orientation using an inertial measurement unit with automatic gyroscope bias compensation will be used to illustrate the proposed methodology applied to a low power STM32L053 microcontroller. This application shows promising results with a processing time of 1.18 ms at 32 MHz with a 3.8% CPU usage due to the computation at a 26 Hz measurement and estimation rate. PMID:29206187
      

      
      Libpsht - algorithms for efficient spherical harmonic transforms
      NASA Astrophysics Data System (ADS)
      Reinecke, M.
         2011-02-01
         Libpsht (or "library for performant spherical harmonic transforms") is a collection of algorithms for efficient conversion between spatial-domain and spectral-domain representations of data defined on the sphere. The package supports both transforms of scalars and spin-1 and spin-2 quantities, and can be used for a wide range of pixelisations (including HEALPix, GLESP, and ECP). It will take advantage of hardware features such as multiple processor cores and floating-point vector operations, if available. Even without this additional acceleration, the employed algorithms are among the most efficient (in terms of CPU time, as well as memory consumption) currently being used in the astronomical community. The library is written in strictly standard-conforming C90, ensuring portability to many different hard- and software platforms, and allowing straightforward integration with codes written in various programming languages like C, C++, Fortran, Python etc. Libpsht is distributed under the terms of the GNU General Public License (GPL) version 2 and can be downloaded from .
      

      
      Libpsht: Algorithms for Efficient Spherical Harmonic Transforms
      NASA Astrophysics Data System (ADS)
      Reinecke, Martin
         2010-10-01
         Libpsht (or "library for Performing Spherical Harmonic Transforms") is a collection of algorithms for efficient conversion between spatial-domain and spectral-domain representations of data defined on the sphere. The package supports transforms of scalars as well as spin-1 and spin-2 quantities, and can be used for a wide range of pixelisations (including HEALPix, GLESP and ECP). It will take advantage of hardware features like multiple processor cores and floating-point vector operations, if available. Even without this additional acceleration, the employed algorithms are among the most efficient (in terms of CPU time as well as memory consumption) currently being used in the astronomical community. The library is written in strictly standard-conforming C90, ensuring portability to many different hard- and software platforms, and allowing straightforward integration with codes written in various programming languages like C, C++, Fortran, Python etc. Libpsht is distributed under the terms of the GNU General Public License (GPL) version 2. Development on this project has ended; its successor is libsharp (ascl:1402.033).
      

      
      A Multiprocessor SoC Architecture with Efficient Communication Infrastructure and Advanced Compiler Support for Easy Application Development
      NASA Astrophysics Data System (ADS)
      Urfianto, Mohammad Zalfany; Isshiki, Tsuyoshi; Khan, Arif Ullah; Li, Dongju; Kunieda, Hiroaki
         
         This paper presentss a Multiprocessor System-on-Chips (MPSoC) architecture used as an execution platform for the new C-language based MPSoC design framework we are currently developing. The MPSoC architecture is based on an existing SoC platform with a commercial RISC core acting as the host CPU. We extend the existing SoC with a multiprocessor-array block that is used as the main engine to run parallel applications modeled in our design framework. Utilizing several optimizations provided by our compiler, an efficient inter-communication between processing elements with minimum overhead is implemented. A host-interface is designed to integrate the existing RISC core to the multiprocessor-array. The experimental results show that an efficacious integration is achieved, proving that the designed communication module can be used to efficiently incorporate off-the-shelf processors as a processing element for MPSoC architectures designed using our framework.
      

      
      MPIGeneNet: Parallel Calculation of Gene Co-Expression Networks on Multicore Clusters.
      PubMed
      Gonzalez-Dominguez, Jorge; Martin, Maria J
         2017-10-10
         In this work we present MPIGeneNet, a parallel tool that applies Pearson's correlation and Random Matrix Theory to construct gene co-expression networks. It is based on the state-of-the-art sequential tool RMTGeneNet, which provides networks with high robustness and sensitivity at the expenses of relatively long runtimes for large scale input datasets. MPIGeneNet returns the same results as RMTGeneNet but improves the memory management, reduces the I/O cost, and accelerates the two most computationally demanding steps of co-expression network construction by exploiting the compute capabilities of common multicore CPU clusters. Our performance evaluation on two different systems using three typical input datasets shows that MPIGeneNet is significantly faster than RMTGeneNet. As an example, our tool is up to 175.41 times faster on a cluster with eight nodes, each one containing two 12-core Intel Haswell processors. Source code of MPIGeneNet, as well as a reference manual, are available at https://sourceforge.net/projects/mpigenenet/.
      

      
      Novel Hybrid Scheduling Technique for Sensor Nodes with Mixed Criticality Tasks
      PubMed Central
      Micea, Mihai-Victor; Stangaciu, Cristina-Sorina; Stangaciu, Valentin; Curiac, Daniel-Ioan
         2017-01-01
         Sensor networks become increasingly a key technology for complex control applications. Their potential use in safety- and time-critical domains has raised the need for task scheduling mechanisms specially adapted to sensor node specific requirements, often materialized in predictable jitter-less execution of tasks characterized by different criticality levels. This paper offers an efficient scheduling solution, named Hybrid Hard Real-Time Scheduling (H2RTS), which combines a static, clock driven method with a dynamic, event driven scheduling technique, in order to provide high execution predictability, while keeping a high node Central Processing Unit (CPU) utilization factor. From the detailed, integrated schedulability analysis of the H2RTS, a set of sufficiency tests are introduced and demonstrated based on the processor demand and linear upper bound metrics. The performance and correct behavior of the proposed hybrid scheduling technique have been extensively evaluated and validated both on a simulator and on a sensor mote equipped with ARM7 microcontroller. PMID:28672856
      

      
      Vectorization of a particle code used in the simulation of rarefied hypersonic flow
      NASA Technical Reports Server (NTRS)
      Baganoff, D.
         1990-01-01
         A limitation of the direct simulation Monte Carlo (DSMC) method is that it does not allow efficient use of vector architectures that predominate in current supercomputers. Consequently, the problems that can be handled are limited to those of one- and two-dimensional flows. This work focuses on a reformulation of the DSMC method with the objective of designing a procedure that is optimized to the vector architectures found on machines such as the Cray-2. In addition, it focuses on finding a better balance between algorithmic complexity and the total number of particles employed in a simulation so that the overall performance of a particle simulation scheme can be greatly improved. Simulations of the flow about a 3D blunt body are performed with 10 to the 7th particles and 4 x 10 to the 5th mesh cells. Good statistics are obtained with time averaging over 800 time steps using 4.5 h of Cray-2 single-processor CPU time.
      

        
       
          

«

21
      22
      23
   24
      25
      »

          
        

     

   

   
       
            
              
          

«

21
      22
      23
      24
   25
      »

          
        

           
           
             
               
      
      Measurement of fault latency in a digital avionic miniprocessor
      NASA Technical Reports Server (NTRS)
      Mcgough, J. G.; Swern, F. L.
         1981-01-01
         The results of fault injection experiments utilizing a gate-level emulation of the central processor unit of the Bendix BDX-930 digital computer are presented. The failure detection coverage of comparison-monitoring and a typical avionics CPU self-test program was determined. The specific tasks and experiments included: (1) inject randomly selected gate-level and pin-level faults and emulate six software programs using comparison-monitoring to detect the faults; (2) based upon the derived empirical data develop and validate a model of fault latency that will forecast a software program's detecting ability; (3) given a typical avionics self-test program, inject randomly selected faults at both the gate-level and pin-level and determine the proportion of faults detected; (4) determine why faults were undetected; (5) recommend how the emulation can be extended to multiprocessor systems such as SIFT; and (6) determine the proportion of faults detected by a uniprocessor BIT (built-in-test) irrespective of self-test.
      

      
      Full Parallel Implementation of an All-Electron Four-Component Dirac-Kohn-Sham Program.
      PubMed
      Rampino, Sergio; Belpassi, Leonardo; Tarantelli, Francesco; Storchi, Loriano
         2014-09-09
         A full distributed-memory implementation of the Dirac-Kohn-Sham (DKS) module of the program BERTHA (Belpassi et al., Phys. Chem. Chem. Phys. 2011, 13, 12368-12394) is presented, where the self-consistent field (SCF) procedure is replicated on all the parallel processes, each process working on subsets of the global matrices. The key feature of the implementation is an efficient procedure for switching between two matrix distribution schemes, one (integral-driven) optimal for the parallel computation of the matrix elements and another (block-cyclic) optimal for the parallel linear algebra operations. This approach, making both CPU-time and memory scalable with the number of processors used, virtually overcomes at once both time and memory barriers associated with DKS calculations. Performance, portability, and numerical stability of the code are illustrated on the basis of test calculations on three gold clusters of increasing size, an organometallic compound, and a perovskite model. The calculations are performed on a Beowulf and a BlueGene/Q system.
      

      
      Implementation of metal-friendly EAM/FS-type semi-empirical potentials in HOOMD-blue: A GPU-accelerated molecular dynamics software
      NASA Astrophysics Data System (ADS)
      Yang, Lin; Zhang, Feng; Wang, Cai-Zhuang; Ho, Kai-Ming; Travesset, Alex
         2018-04-01
         We present an implementation of EAM and FS interatomic potentials, which are widely used in simulating metallic systems, in HOOMD-blue, a software designed to perform classical molecular dynamics simulations using GPU accelerations. We first discuss the details of our implementation and then report extensive benchmark tests. We demonstrate that single-precision floating point operations efficiently implemented on GPUs can produce sufficient accuracy when compared against double-precision codes, as demonstrated in test simulations of calculations of the glass-transition temperature of Cu64.5Zr35.5, and pair correlation function g (r) of liquid Ni3Al. Our code scales well with the size of the simulating system on NVIDIA Tesla M40 and P100 GPUs. Compared with another popular software LAMMPS running on 32 cores of AMD Opteron 6220 processors, the GPU/CPU performance ratio can reach as high as 4.6. The source code can be accessed through the HOOMD-blue web page for free by any interested user.
      

      
      Association between problematic cellular phone use and suicide: the moderating effect of family function and depression.
      PubMed
      Wang, Peng-Wei; Liu, Tai-Ling; Ko, Chih-Hung; Lin, Huang-Chi; Huang, Mei-Feng; Yeh, Yi-Chun; Yen, Cheng-Fang
         2014-02-01
         Suicidal ideation and attempt among adolescents are risk factors for eventual completed suicide. Cellular phone use (CPU) has markedly changed the everyday lives of adolescents. Issues about how cellular phone use relates to adolescent mental health, such as suicidal ideation and attempts, are important because of the high rate of cellular phone usage among children in that age group. This study explored the association between problematic CPU and suicidal ideation and attempts among adolescents and investigated how family function and depression influence the association between problematic CPU and suicidal ideation and attempts. A total of 5051 (2872 girls and 2179 boys) adolescents who owned at least one cellular phone completed the research questionnaires. We collected data on participants' CPU and suicidal behavior (ideation and attempts) during the past month as well as information on family function and history of depression. Five hundred thirty-two adolescents (10.54%) had problematic CPU. The rates of suicidal ideation were 23.50% and 11.76% in adolescents with problematic CPU and without problematic CPU, respectively. The rates of suicidal attempts in both groups were 13.70% and 5.45%, respectively. Family function, but not depression, had a moderating effect on the association between problematic CPU and suicidal ideation and attempt. This study highlights the association between problematic CPU and suicidal ideation as well as attempts and indicates that good family function may have a more significant role on reducing the risks of suicidal ideation and attempts in adolescents with problematic CPU than in those without problematic CPU. © 2014.
      

      
      Exploiting MIC architectures for the simulation of channeling of charged particles in crystals
      NASA Astrophysics Data System (ADS)
      Bagli, Enrico; Karpusenko, Vadim
         2016-08-01
         Coherent effects of ultra-relativistic particles in crystals is an area of science under development. DYNECHARM + + is a toolkit for the simulation of coherent interactions between high-energy charged particles and complex crystal structures. The particle trajectory in a crystal is computed through numerical integration of the equation of motion. The code was revised and improved in order to exploit parallelization on multi-cores and vectorization of single instructions on multiple data. An Intel Xeon Phi card was adopted for the performance measurements. The computation time was proved to scale linearly as a function of the number of physical and virtual cores. By enabling the auto-vectorization flag of the compiler a three time speedup was obtained. The performances of the card were compared to the Dual Xeon ones.
      

      
      Analytical Performance Modeling and Validation of Intel’s Xeon Phi Architecture
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Chunduri, Sudheer; Balaprakash, Prasanna; Morozov, Vitali
         
         Modeling the performance of scientific applications on emerging hardware plays a central role in achieving extreme-scale computing goals. Analytical models that capture the interaction between applications and hardware characteristics are attractive because even a reasonably accurate model can be useful for performance tuning before the hardware is made available. In this paper, we develop a hardware model for Intel’s second-generation Xeon Phi architecture code-named Knights Landing (KNL) for the SKOPE framework. We validate the KNL hardware model by projecting the performance of mini-benchmarks and application kernels. The results show that our KNL model can project the performance with prediction errorsmore » of 10% to 20%. The hardware model also provides informative recommendations for code transformations and tuning.« less
      

      
      Efficient parallel implicit methods for rotary-wing aerodynamics calculations
      NASA Astrophysics Data System (ADS)
      Wissink, Andrew M.
         
         Euler/Navier-Stokes Computational Fluid Dynamics (CFD) methods are commonly used for prediction of the aerodynamics and aeroacoustics of modern rotary-wing aircraft. However, their widespread application to large complex problems is limited lack of adequate computing power. Parallel processing offers the potential for dramatic increases in computing power, but most conventional implicit solution methods are inefficient in parallel and new techniques must be adopted to realize its potential. This work proposes alternative implicit schemes for Euler/Navier-Stokes rotary-wing calculations which are robust and efficient in parallel. The first part of this work proposes an efficient parallelizable modification of the Lower Upper-Symmetric Gauss Seidel (LU-SGS) implicit operator used in the well-known Transonic Unsteady Rotor Navier Stokes (TURNS) code. The new hybrid LU-SGS scheme couples a point-relaxation approach of the Data Parallel-Lower Upper Relaxation (DP-LUR) algorithm for inter-processor communication with the Symmetric Gauss Seidel algorithm of LU-SGS for on-processor computations. With the modified operator, TURNS is implemented in parallel using Message Passing Interface (MPI) for communication. Numerical performance and parallel efficiency are evaluated on the IBM SP2 and Thinking Machines CM-5 multi-processors for a variety of steady-state and unsteady test cases. The hybrid LU-SGS scheme maintains the numerical performance of the original LU-SGS algorithm in all cases and shows a good degree of parallel efficiency. It experiences a higher degree of robustness than DP-LUR for third-order upwind solutions. The second part of this work examines use of Krylov subspace iterative solvers for the nonlinear CFD solutions. The hybrid LU-SGS scheme is used as a parallelizable preconditioner. Two iterative methods are tested, Generalized Minimum Residual (GMRES) and Orthogonal s-Step Generalized Conjugate Residual (OSGCR). The Newton method demonstrates good parallel performance on the IBM SP2, with OS-GCR giving slightly better performance than GMRES on large numbers of processors. For steady and quasi-steady calculations, the convergence rate is accelerated but the overall solution time remains about the same as the standard hybrid LU-SGS scheme. For unsteady calculations, however, the Newton method maintains a higher degree of time-accuracy which allows tbe use of larger timesteps and results in CPU savings of 20-35%.
      

      
      Shadow: Running Tor in a Box for Accurate and Efficient Experimentation
      DTIC Science & Technology
      
         2011-09-23
         Modeling the speed of a target CPU is done by running an OpenSSL [31] speed test on a real CPU of that type. This provides us with the raw CPU processing...rate, but we are also interested in the processing speed of an application. By running application 5 benchmarks on the same CPU as the OpenSSL speed test...simulation, saving CPU cy- cles on our simulation host machine. Shadow removes cryptographic processing by preloading the main OpenSSL [31] functions used
      

      
      A biomolecular electrostatics solver using Python, GPUs and boundary elements that can handle solvent-filled cavities and Stern layers.
      PubMed
      Cooper, Christopher D; Bardhan, Jaydeep P; Barba, L A
         2014-03-01
         The continuum theory applied to biomolecular electrostatics leads to an implicit-solvent model governed by the Poisson-Boltzmann equation. Solvers relying on a boundary integral representation typically do not consider features like solvent-filled cavities or ion-exclusion (Stern) layers, due to the added difficulty of treating multiple boundary surfaces. This has hindered meaningful comparisons with volume-based methods, and the effects on accuracy of including these features has remained unknown. This work presents a solver called PyGBe that uses a boundary-element formulation and can handle multiple interacting surfaces. It was used to study the effects of solvent-filled cavities and Stern layers on the accuracy of calculating solvation energy and binding energy of proteins, using the well-known apbs finite-difference code for comparison. The results suggest that if required accuracy for an application allows errors larger than about 2% in solvation energy, then the simpler, single-surface model can be used. When calculating binding energies, the need for a multi-surface model is problem-dependent, becoming more critical when ligand and receptor are of comparable size. Comparing with the apbs solver, the boundary-element solver is faster when the accuracy requirements are higher. The cross-over point for the PyGBe code is in the order of 1-2% error, when running on one gpu card (nvidia Tesla C2075), compared with apbs running on six Intel Xeon cpu cores. PyGBe achieves algorithmic acceleration of the boundary element method using a treecode, and hardware acceleration using gpus via PyCuda from a user-visible code that is all Python. The code is open-source under MIT license.
      

      
      An empirical comparison of several recent epistatic interaction detection methods.
      PubMed
      Wang, Yue; Liu, Guimei; Feng, Mengling; Wong, Limsoon
         2011-11-01
         Many new methods have recently been proposed for detecting epistatic interactions in GWAS data. There is, however, no in-depth independent comparison of these methods yet. Five recent methods-TEAM, BOOST, SNPHarvester, SNPRuler and Screen and Clean (SC)-are evaluated here in terms of power, type-1 error rate, scalability and completeness. In terms of power, TEAM performs best on data with main effect and BOOST performs best on data without main effect. In terms of type-1 error rate, TEAM and BOOST have higher type-1 error rates than SNPRuler and SNPHarvester. SC does not control type-1 error rate well. In terms of scalability, we tested the five methods using a dataset with 100 000 SNPs on a 64 bit Ubuntu system, with Intel (R) Xeon(R) CPU 2.66 GHz, 16 GB memory. TEAM takes ~36 days to finish and SNPRuler reports heap allocation problems. BOOST scales up to 100 000 SNPs and the cost is much lower than that of TEAM. SC and SNPHarvester are the most scalable. In terms of completeness, we study how frequently the pruning techniques employed by these methods incorrectly prune away the most significant epistatic interactions. We find that, on average, 20% of datasets without main effect and 60% of datasets with main effect are pruned incorrectly by BOOST, SNPRuler and SNPHarvester. The software for the five methods tested are available from the URLs below. TEAM: http://csbio.unc.edu/epistasis/download.php BOOST: http://ihome.ust.hk/~eeyang/papers.html. SNPHarvester: http://bioinformatics.ust.hk/SNPHarvester.html. SNPRuler: http://bioinformatics.ust.hk/SNPRuler.zip. Screen and Clean: http://wpicr.wpic.pitt.edu/WPICCompGen/. wangyue@nus.edu.sg.
      

      
      A biomolecular electrostatics solver using Python, GPUs and boundary elements that can handle solvent-filled cavities and Stern layers
      NASA Astrophysics Data System (ADS)
      Cooper, Christopher D.; Bardhan, Jaydeep P.; Barba, L. A.
         2014-03-01
         The continuum theory applied to biomolecular electrostatics leads to an implicit-solvent model governed by the Poisson-Boltzmann equation. Solvers relying on a boundary integral representation typically do not consider features like solvent-filled cavities or ion-exclusion (Stern) layers, due to the added difficulty of treating multiple boundary surfaces. This has hindered meaningful comparisons with volume-based methods, and the effects on accuracy of including these features has remained unknown. This work presents a solver called PyGBe that uses a boundary-element formulation and can handle multiple interacting surfaces. It was used to study the effects of solvent-filled cavities and Stern layers on the accuracy of calculating solvation energy and binding energy of proteins, using the well-known APBS finite-difference code for comparison. The results suggest that if required accuracy for an application allows errors larger than about 2% in solvation energy, then the simpler, single-surface model can be used. When calculating binding energies, the need for a multi-surface model is problem-dependent, becoming more critical when ligand and receptor are of comparable size. Comparing with the APBS solver, the boundary-element solver is faster when the accuracy requirements are higher. The cross-over point for the PyGBe code is on the order of 1-2% error, when running on one GPU card (NVIDIA Tesla C2075), compared with APBS running on six Intel Xeon CPU cores. PyGBe achieves algorithmic acceleration of the boundary element method using a treecode, and hardware acceleration using GPUs via PyCuda from a user-visible code that is all Python. The code is open-source under MIT license.
      

      
      ASC-ATDM Performance Portability Requirements for 2015-2019
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Edwards, Harold C.; Trott, Christian Robert
         
         This report outlines the research, development, and support requirements for the Advanced Simulation and Computing (ASC ) Advanced Technology, Development, and Mitigation (ATDM) Performance Portability (a.k.a., Kokkos) project for 2015 - 2019 . The research and development (R&D) goal for Kokkos (v2) has been to create and demonstrate a thread - parallel programming model a nd standard C++ library - based implementation that enables performance portability across diverse manycore architectures such as multicore CPU, Intel Xeon Phi, and NVIDIA Kepler GPU. This R&D goal has been achieved for algorithms that use data parallel pat terns including parallel - for, parallelmore » - reduce, and parallel - scan. Current R&D is focusing on hierarchical parallel patterns such as a directed acyclic graph (DAG) of asynchronous tasks where each task contain s nested data parallel algorithms. This five y ear plan includes R&D required to f ully and performance portably exploit thread parallelism across current and anticipated next generation platforms (NGP). The Kokkos library is being evaluated by many projects exploring algorithm s and code design for NGP. Some production libraries and applications such as Trilinos and LAMMPS have already committed to Kokkos as their foundation for manycore parallelism an d performance portability. These five year requirements includes support required for current and antic ipated ASC projects to be effective and productive in their use of Kokkos on NGP. The greatest risk to the success of Kokkos and ASC projects relying upon Kokkos is a lack of staffing resources to support Kokkos to the degree needed by these ASC projects. This support includes up - to - date tutorials, documentation, multi - platform (hardware and software stack) testing, minor feature enhancements, thread - scalable algorithm consulting, and managing collaborative R&D.« less
      

      
      Investigating the Use of the Intel Xeon Phi for Event Reconstruction
      NASA Astrophysics Data System (ADS)
      Sherman, Keegan; Gilfoyle, Gerard
         2014-09-01
         The physics goal of Jefferson Lab is to understand how quarks and gluons form nuclei and it is being upgraded to a higher, 12-GeV beam energy. The new CLAS12 detector in Hall B will collect 5-10 terabytes of data per day and will require considerable computing resources. We are investigating tools, such as the Intel Xeon Phi, to speed up the event reconstruction. The Kalman Filter is one of the methods being studied. It is a linear algebra algorithm that estimates the state of a system by combining existing data and predictions of those measurements. The tools required to apply this technique (i.e. matrix multiplication, matrix inversion) are being written using C++ intrinsics for Intel's Xeon Phi Coprocessor, which uses the Many Integrated Cores (MIC) architecture. The Intel MIC is a new high-performance chip that connects to a host machine through the PCIe bus and is built to run highly vectorized and parallelized code making it a well-suited device for applications such as the Kalman Filter. Our tests of the MIC optimized algorithms needed for the filter show significant increases in speed. For example, matrix multiplication of 5x5 matrices on the MIC was able to run up to 69 times faster than the host core. The physics goal of Jefferson Lab is to understand how quarks and gluons form nuclei and it is being upgraded to a higher, 12-GeV beam energy. The new CLAS12 detector in Hall B will collect 5-10 terabytes of data per day and will require considerable computing resources. We are investigating tools, such as the Intel Xeon Phi, to speed up the event reconstruction. The Kalman Filter is one of the methods being studied. It is a linear algebra algorithm that estimates the state of a system by combining existing data and predictions of those measurements. The tools required to apply this technique (i.e. matrix multiplication, matrix inversion) are being written using C++ intrinsics for Intel's Xeon Phi Coprocessor, which uses the Many Integrated Cores (MIC) architecture. The Intel MIC is a new high-performance chip that connects to a host machine through the PCIe bus and is built to run highly vectorized and parallelized code making it a well-suited device for applications such as the Kalman Filter. Our tests of the MIC optimized algorithms needed for the filter show significant increases in speed. For example, matrix multiplication of 5x5 matrices on the MIC was able to run up to 69 times faster than the host core. Work supported by the University of Richmond and the US Department of Energy.
      

      
      The association between problematic cellular phone use and risky behaviors and low self-esteem among Taiwanese adolescents.
      PubMed
      Yang, Yuan-Sheng; Yen, Ju-Yu; Ko, Chih-Hung; Cheng, Chung-Ping; Yen, Cheng-Fang
         2010-04-28
         Cellular phone use (CPU) is an important part of life for many adolescents. However, problematic CPU may complicate physiological and psychological problems. The aim of our study was to examine the associations between problematic CPU and a series of risky behaviors and low self-esteem in Taiwanese adolescents. A total of 11,111 adolescent students in Southern Taiwan were randomly selected into this study. We used the Problematic Cellular Phone Use Questionnaire to identify the adolescents with problematic CPU. Meanwhile, a series of risky behaviors and self-esteem were evaluated. Multilevel logistic regression analyses were employed to examine the associations between problematic CPU and risky behaviors and low self-esteem regarding gender and age. The results indicated that positive associations were found between problematic CPU and aggression, insomnia, smoking cigarettes, suicidal tendencies, and low self-esteem in all groups with different sexes and ages. However, gender and age differences existed in the associations between problematic CPU and suspension from school, criminal records, tattooing, short nocturnal sleep duration, unprotected sex, illicit drugs use, drinking alcohol and chewing betel nuts. There were positive associations between problematic CPU and a series of risky behaviors and low self-esteem in Taiwanese adolescents. It is worthy for parents and mental health professionals to pay attention to adolescents' problematic CPU.
      

      
      Parallel Task Management Library for MARTe
      NASA Astrophysics Data System (ADS)
      Valcarcel, Daniel F.; Alves, Diogo; Neto, Andre; Reux, Cedric; Carvalho, Bernardo B.; Felton, Robert; Lomas, Peter J.; Sousa, Jorge; Zabeo, Luca
         2014-06-01
         The Multithreaded Application Real-Time executor (MARTe) is a real-time framework with increasing popularity and support in the thermonuclear fusion community. It allows modular code to run in a multi-threaded environment leveraging on the current multi-core processor (CPU) technology. One application that relies on the MARTe framework is the Joint European Torus (JET) tokamak WAll Load Limiter System (WALLS). It calculates and monitors the temperature on metal tiles and plasma facing components (PFCs) that can melt or flake if their temperature gets too high when exposed to power loads. One of the main time consuming tasks in WALLS is the calculation of thermal diffusion models in real-time. These models tend to be described by very large state-space models thus making them perfect candidates for parallelisation. MARTe's traditional approach for task parallelisation is to split the problem into several Real-Time Threads, each responsible for a self-contained sequential execution of an input-to-output chain. This is usually possible, but it might not always be practical for algorithmic or technical reasons. Also, it might not be easily scalable with an increase in the number of available CPU cores. The WorkLibrary introduces a “GPU-like approach” of splitting work among the available cores of modern CPUs that is (i) straightforward to use in an application, (ii) scalable with the availability of cores and all of this (iii) without rewriting or recompiling the source code. The first part of this article explains the motivation behind the library, its architecture and implementation. The second part presents a real application for WALLS, a parallel version of a large state-space model describing the 2D thermal diffusion on a JET tile.
      

      
      Speeding up tsunami wave propagation modeling
      NASA Astrophysics Data System (ADS)
      Lavrentyev, Mikhail; Romanenko, Alexey
         2014-05-01
         Trans-oceanic wave propagation is one of the most time/CPU consuming parts of the tsunami modeling process. The so-called Method Of Splitting Tsunami (MOST) software package, developed at PMEL NOAA USA (Pacific Marine Environmental Laboratory of the National Oceanic and Atmospheric Administration, USA), is widely used to evaluate the tsunami parameters. However, it takes time to simulate trans-ocean wave propagation, that is up to 5 hours CPU time to "drive" the wave from Chili (epicenter) to the coast of Japan (even using a rather coarse computational mesh). Accurate wave height prediction requires fine meshes which leads to dramatic increase in time for simulation. Computation time is among the critical parameter as it takes only about 20 minutes for tsunami wave to approach the coast of Japan after earthquake at Japan trench or Sagami trench (as it was after the Great East Japan Earthquake on March 11, 2011). MOST solves numerically the hyperbolic system for three unknown functions, namely velocity vector and wave height (shallow water approximation). The system could be split into two independent systems by orthogonal directions (splitting method). Each system can be treated independently. This calculation scheme is well suited for SIMD architecture and GPUs as well. We performed adaptation of MOST package to GPU. Several numerical tests showed 40x performance gain for NVIDIA Tesla C2050 GPU vs. single core of Intel i7 processor. Results of numerical experiments were compared with other available simulation data. Calculation results, obtained at GPU, differ from the reference ones by 10^-3 cm of the wave height simulating 24 hours wave propagation. This allows us to speak about possibility to develop real-time system for evaluating tsunami danger.
      

      
      Discovering epistasis in large scale genetic association studies by exploiting graphics cards.
      PubMed
      Chen, Gary K; Guo, Yunfei
         2013-12-03
         Despite the enormous investments made in collecting DNA samples and generating germline variation data across thousands of individuals in modern genome-wide association studies (GWAS), progress has been frustratingly slow in explaining much of the heritability in common disease. Today's paradigm of testing independent hypotheses on each single nucleotide polymorphism (SNP) marker is unlikely to adequately reflect the complex biological processes in disease risk. Alternatively, modeling risk as an ensemble of SNPs that act in concert in a pathway, and/or interact non-additively on log risk for example, may be a more sensible way to approach gene mapping in modern studies. Implementing such analyzes genome-wide can quickly become intractable due to the fact that even modest size SNP panels on modern genotype arrays (500k markers) pose a combinatorial nightmare, require tens of billions of models to be tested for evidence of interaction. In this article, we provide an in-depth analysis of programs that have been developed to explicitly overcome these enormous computational barriers through the use of processors on graphics cards known as Graphics Processing Units (GPU). We include tutorials on GPU technology, which will convey why they are growing in appeal with today's numerical scientists. One obvious advantage is the impressive density of microprocessor cores that are available on only a single GPU. Whereas high end servers feature up to 24 Intel or AMD CPU cores, the latest GPU offerings from nVidia feature over 2600 cores. Each compute node may be outfitted with up to 4 GPU devices. Success on GPUs varies across problems. However, epistasis screens fare well due to the high degree of parallelism exposed in these problems. Papers that we review routinely report GPU speedups of over two orders of magnitude (>100x) over standard CPU implementations.
      

      
      Discovering epistasis in large scale genetic association studies by exploiting graphics cards
      PubMed Central
      Chen, Gary K.; Guo, Yunfei
         2013-01-01
         Despite the enormous investments made in collecting DNA samples and generating germline variation data across thousands of individuals in modern genome-wide association studies (GWAS), progress has been frustratingly slow in explaining much of the heritability in common disease. Today's paradigm of testing independent hypotheses on each single nucleotide polymorphism (SNP) marker is unlikely to adequately reflect the complex biological processes in disease risk. Alternatively, modeling risk as an ensemble of SNPs that act in concert in a pathway, and/or interact non-additively on log risk for example, may be a more sensible way to approach gene mapping in modern studies. Implementing such analyzes genome-wide can quickly become intractable due to the fact that even modest size SNP panels on modern genotype arrays (500k markers) pose a combinatorial nightmare, require tens of billions of models to be tested for evidence of interaction. In this article, we provide an in-depth analysis of programs that have been developed to explicitly overcome these enormous computational barriers through the use of processors on graphics cards known as Graphics Processing Units (GPU). We include tutorials on GPU technology, which will convey why they are growing in appeal with today's numerical scientists. One obvious advantage is the impressive density of microprocessor cores that are available on only a single GPU. Whereas high end servers feature up to 24 Intel or AMD CPU cores, the latest GPU offerings from nVidia feature over 2600 cores. Each compute node may be outfitted with up to 4 GPU devices. Success on GPUs varies across problems. However, epistasis screens fare well due to the high degree of parallelism exposed in these problems. Papers that we review routinely report GPU speedups of over two orders of magnitude (>100x) over standard CPU implementations. PMID:24348518
      

      
      GPU-accelerated atmospheric chemical kinetics in the ECHAM/MESSy (EMAC) Earth system model (version 2.52)
      NASA Astrophysics Data System (ADS)
      Alvanos, Michail; Christoudias, Theodoros
         2017-10-01
         This paper presents an application of GPU accelerators in Earth system modeling. We focus on atmospheric chemical kinetics, one of the most computationally intensive tasks in climate-chemistry model simulations. We developed a software package that automatically generates CUDA kernels to numerically integrate atmospheric chemical kinetics in the global climate model ECHAM/MESSy Atmospheric Chemistry (EMAC), used to study climate change and air quality scenarios. A source-to-source compiler outputs a CUDA-compatible kernel by parsing the FORTRAN code generated by the Kinetic PreProcessor (KPP) general analysis tool. All Rosenbrock methods that are available in the KPP numerical library are supported.Performance evaluation, using Fermi and Pascal CUDA-enabled GPU accelerators, shows achieved speed-ups of 4. 5 × and 20. 4 × , respectively, of the kernel execution time. A node-to-node real-world production performance comparison shows a 1. 75 × speed-up over the non-accelerated application using the KPP three-stage Rosenbrock solver. We provide a detailed description of the code optimizations used to improve the performance including memory optimizations, control code simplification, and reduction of idle time. The accuracy and correctness of the accelerated implementation are evaluated by comparing to the CPU-only code of the application. The median relative difference is found to be less than 0.000000001 % when comparing the output of the accelerated kernel the CPU-only code.The approach followed, including the computational workload division, and the developed GPU solver code can potentially be used as the basis for hardware acceleration of numerous geoscientific models that rely on KPP for atmospheric chemical kinetics applications.
      

      
      A Wideband Fast Multipole Method for the two-dimensional complex Helmholtz equation
      NASA Astrophysics Data System (ADS)
      Cho, Min Hyung; Cai, Wei
         2010-12-01
         A Wideband Fast Multipole Method (FMM) for the 2D Helmholtz equation is presented. It can evaluate the interactions between N particles governed by the fundamental solution of 2D complex Helmholtz equation in a fast manner for a wide range of complex wave number k, which was not easy with the original FMM due to the instability of the diagonalized conversion operator. This paper includes the description of theoretical backgrounds, the FMM algorithm, software structures, and some test runs. Program summaryProgram title: 2D-WFMM Catalogue identifier: AEHI_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEHI_v1_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: Standard CPC licence, http://cpc.cs.qub.ac.uk/licence/licence.html No. of lines in distributed program, including test data, etc.: 4636 No. of bytes in distributed program, including test data, etc.: 82 582 Distribution format: tar.gz Programming language: C Computer: Any Operating system: Any operating system with gcc version 4.2 or newer Has the code been vectorized or parallelized?: Multi-core processors with shared memory RAM: Depending on the number of particles N and the wave number k Classification: 4.8, 4.12 External routines: OpenMP ( http://openmp.org/wp/) Nature of problem: Evaluate interaction between N particles governed by the fundamental solution of 2D Helmholtz equation with complex k. Solution method: Multilevel Fast Multipole Algorithm in a hierarchical quad-tree structure with cutoff level which combines low frequency method and high frequency method. Running time: Depending on the number of particles N, wave number k, and number of cores in CPU. CPU time increases as N log N.
      

        
       
          

«

21
      22
      23
      24
   25
      »

          
        

     

   

   
       
            
              
          

«

21
      22
      23
      24
      25
   »

          
        

           
           
             
               
      
      A parallel algorithm for the initial screening of space debris collisions prediction using the SGP4/SDP4 models and GPU acceleration
      NASA Astrophysics Data System (ADS)
      Lin, Mingpei; Xu, Ming; Fu, Xiaoyu
         2017-05-01
         Currently, a tremendous amount of space debris in Earth's orbit imperils operational spacecraft. It is essential to undertake risk assessments of collisions and predict dangerous encounters in space. However, collision predictions for an enormous amount of space debris give rise to large-scale computations. In this paper, a parallel algorithm is established on the Compute Unified Device Architecture (CUDA) platform of NVIDIA Corporation for collision prediction. According to the parallel structure of NVIDIA graphics processors, a block decomposition strategy is adopted in the algorithm. Space debris is divided into batches, and the computation and data transfer operations of adjacent batches overlap. As a consequence, the latency to access shared memory during the entire computing process is significantly reduced, and a higher computing speed is reached. Theoretically, a simulation of collision prediction for space debris of any amount and for any time span can be executed. To verify this algorithm, a simulation example including 1382 pieces of debris, whose operational time scales vary from 1 min to 3 days, is conducted on Tesla C2075 of NVIDIA. The simulation results demonstrate that with the same computational accuracy as that of a CPU, the computing speed of the parallel algorithm on a GPU is 30 times that on a CPU. Based on this algorithm, collision prediction of over 150 Chinese spacecraft for a time span of 3 days can be completed in less than 3 h on a single computer, which meets the timeliness requirement of the initial screening task. Furthermore, the algorithm can be adapted for multiple tasks, including particle filtration, constellation design, and Monte-Carlo simulation of an orbital computation.
      

      
      PIMS: Memristor-Based Processing-in-Memory-and-Storage.
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Cook, Jeanine
         
         Continued progress in computing has augmented the quest for higher performance with a new quest for higher energy efficiency. This has led to the re-emergence of Processing-In-Memory (PIM) ar- chitectures that offer higher density and performance with some boost in energy efficiency. Past PIM work either integrated a standard CPU with a conventional DRAM to improve the CPU- memory link, or used a bit-level processor with Single Instruction Multiple Data (SIMD) control, but neither matched the energy consumption of the memory to the computation. We originally proposed to develop a new architecture derived from PIM that more effectively addressed energymore » efficiency for high performance scientific, data analytics, and neuromorphic applications. We also originally planned to implement a von Neumann architecture with arithmetic/logic units (ALUs) that matched the power consumption of an advanced storage array to maximize energy efficiency. Implementing this architecture in storage was our original idea, since by augmenting storage (in- stead of memory), the system could address both in-memory computation and applications that accessed larger data sets directly from storage, hence Processing-in-Memory-and-Storage (PIMS). However, as our research matured, we discovered several things that changed our original direc- tion, the most important being that a PIM that implements a standard von Neumann-type archi- tecture results in significant energy efficiency improvement, but only about a O(10) performance improvement. In addition to this, the emergence of new memory technologies moved us to propos- ing a non-von Neumann architecture, called Superstrider, implemented not in storage, but in a new DRAM technology called High Bandwidth Memory (HBM). HBM is a stacked DRAM tech- nology that includes a logic layer where an architecture such as Superstrider could potentially be implemented.« less
      

      
      From experiment to design -- Fault characterization and detection in parallel computer systems using computational accelerators
      NASA Astrophysics Data System (ADS)
      Yim, Keun Soo
         
         This dissertation summarizes experimental validation and co-design studies conducted to optimize the fault detection capabilities and overheads in hybrid computer systems (e.g., using CPUs and Graphics Processing Units, or GPUs), and consequently to improve the scalability of parallel computer systems using computational accelerators. The experimental validation studies were conducted to help us understand the failure characteristics of CPU-GPU hybrid computer systems under various types of hardware faults. The main characterization targets were faults that are difficult to detect and/or recover from, e.g., faults that cause long latency failures (Ch. 3), faults in dynamically allocated resources (Ch. 4), faults in GPUs (Ch. 5), faults in MPI programs (Ch. 6), and microarchitecture-level faults with specific timing features (Ch. 7). The co-design studies were based on the characterization results. One of the co-designed systems has a set of source-to-source translators that customize and strategically place error detectors in the source code of target GPU programs (Ch. 5). Another co-designed system uses an extension card to learn the normal behavioral and semantic execution patterns of message-passing processes executing on CPUs, and to detect abnormal behaviors of those parallel processes (Ch. 6). The third co-designed system is a co-processor that has a set of new instructions in order to support software-implemented fault detection techniques (Ch. 7). The work described in this dissertation gains more importance because heterogeneous processors have become an essential component of state-of-the-art supercomputers. GPUs were used in three of the five fastest supercomputers that were operating in 2011. Our work included comprehensive fault characterization studies in CPU-GPU hybrid computers. In CPUs, we monitored the target systems for a long period of time after injecting faults (a temporally comprehensive experiment), and injected faults into various types of program states that included dynamically allocated memory (to be spatially comprehensive). In GPUs, we used fault injection studies to demonstrate the importance of detecting silent data corruption (SDC) errors that are mainly due to the lack of fine-grained protections and the massive use of fault-insensitive data. This dissertation also presents transparent fault tolerance frameworks and techniques that are directly applicable to hybrid computers built using only commercial off-the-shelf hardware components. This dissertation shows that by developing understanding of the failure characteristics and error propagation paths of target programs, we were able to create fault tolerance frameworks and techniques that can quickly detect and recover from hardware faults with low performance and hardware overheads.
      

      
      The association between problematic cellular phone use and risky behaviors and low self-esteem among Taiwanese adolescents
      PubMed Central
      
         2010-01-01
         Background Cellular phone use (CPU) is an important part of life for many adolescents. However, problematic CPU may complicate physiological and psychological problems. The aim of our study was to examine the associations between problematic CPU and a series of risky behaviors and low self-esteem in Taiwanese adolescents. Methods A total of 11,111 adolescent students in Southern Taiwan were randomly selected into this study. We used the Problematic Cellular Phone Use Questionnaire to identify the adolescents with problematic CPU. Meanwhile, a series of risky behaviors and self-esteem were evaluated. Multilevel logistic regression analyses were employed to examine the associations between problematic CPU and risky behaviors and low self-esteem regarding gender and age. Results The results indicated that positive associations were found between problematic CPU and aggression, insomnia, smoking cigarettes, suicidal tendencies, and low self-esteem in all groups with different sexes and ages. However, gender and age differences existed in the associations between problematic CPU and suspension from school, criminal records, tattooing, short nocturnal sleep duration, unprotected sex, illicit drugs use, drinking alcohol and chewing betel nuts. Conclusions There were positive associations between problematic CPU and a series of risky behaviors and low self-esteem in Taiwanese adolescents. It is worthy for parents and mental health professionals to pay attention to adolescents' problematic CPU. PMID:20426807
      

      
      Accelerating Spaceborne SAR Imaging Using Multiple CPU/GPU Deep Collaborative Computing
      PubMed Central
      Zhang, Fan; Li, Guojun; Li, Wei; Hu, Wei; Hu, Yuxin
         2016-01-01
         With the development of synthetic aperture radar (SAR) technologies in recent years, the huge amount of remote sensing data brings challenges for real-time imaging processing. Therefore, high performance computing (HPC) methods have been presented to accelerate SAR imaging, especially the GPU based methods. In the classical GPU based imaging algorithm, GPU is employed to accelerate image processing by massive parallel computing, and CPU is only used to perform the auxiliary work such as data input/output (IO). However, the computing capability of CPU is ignored and underestimated. In this work, a new deep collaborative SAR imaging method based on multiple CPU/GPU is proposed to achieve real-time SAR imaging. Through the proposed tasks partitioning and scheduling strategy, the whole image can be generated with deep collaborative multiple CPU/GPU computing. In the part of CPU parallel imaging, the advanced vector extension (AVX) method is firstly introduced into the multi-core CPU parallel method for higher efficiency. As for the GPU parallel imaging, not only the bottlenecks of memory limitation and frequent data transferring are broken, but also kinds of optimized strategies are applied, such as streaming, parallel pipeline and so on. Experimental results demonstrate that the deep CPU/GPU collaborative imaging method enhances the efficiency of SAR imaging on single-core CPU by 270 times and realizes the real-time imaging in that the imaging rate outperforms the raw data generation rate. PMID:27070606
      

      
      Accelerating Spaceborne SAR Imaging Using Multiple CPU/GPU Deep Collaborative Computing.
      PubMed
      Zhang, Fan; Li, Guojun; Li, Wei; Hu, Wei; Hu, Yuxin
         2016-04-07
         With the development of synthetic aperture radar (SAR) technologies in recent years, the huge amount of remote sensing data brings challenges for real-time imaging processing. Therefore, high performance computing (HPC) methods have been presented to accelerate SAR imaging, especially the GPU based methods. In the classical GPU based imaging algorithm, GPU is employed to accelerate image processing by massive parallel computing, and CPU is only used to perform the auxiliary work such as data input/output (IO). However, the computing capability of CPU is ignored and underestimated. In this work, a new deep collaborative SAR imaging method based on multiple CPU/GPU is proposed to achieve real-time SAR imaging. Through the proposed tasks partitioning and scheduling strategy, the whole image can be generated with deep collaborative multiple CPU/GPU computing. In the part of CPU parallel imaging, the advanced vector extension (AVX) method is firstly introduced into the multi-core CPU parallel method for higher efficiency. As for the GPU parallel imaging, not only the bottlenecks of memory limitation and frequent data transferring are broken, but also kinds of optimized strategies are applied, such as streaming, parallel pipeline and so on. Experimental results demonstrate that the deep CPU/GPU collaborative imaging method enhances the efficiency of SAR imaging on single-core CPU by 270 times and realizes the real-time imaging in that the imaging rate outperforms the raw data generation rate.
      

      
      Adaptive MCMC in Bayesian phylogenetics: an application to analyzing partitioned data in BEAST.
      PubMed
      Baele, Guy; Lemey, Philippe; Rambaut, Andrew; Suchard, Marc A
         2017-06-15
         Advances in sequencing technology continue to deliver increasingly large molecular sequence datasets that are often heavily partitioned in order to accurately model the underlying evolutionary processes. In phylogenetic analyses, partitioning strategies involve estimating conditionally independent models of molecular evolution for different genes and different positions within those genes, requiring a large number of evolutionary parameters that have to be estimated, leading to an increased computational burden for such analyses. The past two decades have also seen the rise of multi-core processors, both in the central processing unit (CPU) and Graphics processing unit processor markets, enabling massively parallel computations that are not yet fully exploited by many software packages for multipartite analyses. We here propose a Markov chain Monte Carlo (MCMC) approach using an adaptive multivariate transition kernel to estimate in parallel a large number of parameters, split across partitioned data, by exploiting multi-core processing. Across several real-world examples, we demonstrate that our approach enables the estimation of these multipartite parameters more efficiently than standard approaches that typically use a mixture of univariate transition kernels. In one case, when estimating the relative rate parameter of the non-coding partition in a heterochronous dataset, MCMC integration efficiency improves by > 14-fold. Our implementation is part of the BEAST code base, a widely used open source software package to perform Bayesian phylogenetic inference. guy.baele@kuleuven.be. Supplementary data are available at Bioinformatics online. © The Author 2017. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com
      

      
      SU-G-TeP1-15: Toward a Novel GPU Accelerated Deterministic Solution to the Linear Boltzmann Transport Equation
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Yang, R; Fallone, B; Cross Cancer Institute, Edmonton, AB
         
         Purpose: To develop a Graphic Processor Unit (GPU) accelerated deterministic solution to the Linear Boltzmann Transport Equation (LBTE) for accurate dose calculations in radiotherapy (RT). A deterministic solution yields the potential for major speed improvements due to the sparse matrix-vector and vector-vector multiplications and would thus be of benefit to RT. Methods: In order to leverage the massively parallel architecture of GPUs, the first order LBTE was reformulated as a second order self-adjoint equation using the Least Squares Finite Element Method (LSFEM). This produces a symmetric positive-definite matrix which is efficiently solved using a parallelized conjugate gradient (CG) solver. Themore » LSFEM formalism is applied in space, discrete ordinates is applied in angle, and the Multigroup method is applied in energy. The final linear system of equations produced is tightly coupled in space and angle. Our code written in CUDA-C was benchmarked on an Nvidia GeForce TITAN-X GPU against an Intel i7-6700K CPU. A spatial mesh of 30,950 tetrahedral elements was used with an S4 angular approximation. Results: To avoid repeating a full computationally intensive finite element matrix assembly at each Multigroup energy, a novel mapping algorithm was developed which minimized the operations required at each energy. Additionally, a parallelized memory mapping for the kronecker product between the sparse spatial and angular matrices, including Dirichlet boundary conditions, was created. Atomicity is preserved by graph-coloring overlapping nodes into separate kernel launches. The one-time mapping calculations for matrix assembly, kronecker product, and boundary condition application took 452±1ms on GPU. Matrix assembly for 16 energy groups took 556±3s on CPU, and 358±2ms on GPU using the mappings developed. The CG solver took 93±1s on CPU, and 468±2ms on GPU. Conclusion: Three computationally intensive subroutines in deterministically solving the LBTE have been formulated on GPU, resulting in two orders of magnitude speedup. Funding support from Natural Sciences and Engineering Research Council and Alberta Innovates Health Solutions. Dr. Fallone is a co-founder and CEO of MagnetTx Oncology Solutions (under discussions to license Alberta bi-planar linac MR for commercialization).« less
      

      
      Symptoms of Problematic Cellular Phone Use, Functional Impairment and Its Association with Depression among Adolescents in Southern Taiwan
      ERIC Educational Resources Information Center
      Yen, Cheng-Fang; Tang, Tze-Chun; Yen, Ju-Yu; Lin, Huang-Chi; Huang, Chi-Fen; Liu, Shu-Chun; Ko, Chih-Hung
         2009-01-01
         The aims of this study were: (1) to examine the prevalence of symptoms of problematic cellular phone use (CPU); (2) to examine the associations between the symptoms of problematic CPU, functional impairment caused by CPU and the characteristics of CPU; (3) to establish the optimal cut-off point of the number of symptoms for functional impairment…
      

      
      Hypoxia/oxidative stress alters the pharmacokinetics of CPU86017-RS through mitochondrial dysfunction and NADPH oxidase activation.
      PubMed
      Gao, Jie; Ding, Xuan-sheng; Zhang, Yu-mao; Dai, De-zai; Liu, Mei; Zhang, Can; Dai, Yin
         2013-12-01
         Hypoxia/oxidative stress can alter the pharmacokinetics (PK) of CPU86017-RS, a novel antiarrhythmic agent. The aim of this study was to investigate the mechanisms underlying the alteration of PK of CPU86017-RS by hypoxia/oxidative stress. Male SD rats exposed to normal or intermittent hypoxia (10% O2) were administered CPU86017-RS (20, 40 or 80 mg/kg, ig) for 8 consecutive days. The PK parameters of CPU86017-RS were examined on d 8. In a separate set of experiments, female SD rats were injected with isoproterenol (ISO) for 5 consecutive days to induce a stress-related status, then CPU86017-RS (80 mg/kg, ig) was administered, and the tissue distributions were examined. The levels of Mn-SOD (manganese containing superoxide dismutase), endoplasmic reticulum (ER) stress sensor proteins (ATF-6, activating transcription factor 6 and PERK, PRK-like ER kinase) and activation of NADPH oxidase (NOX) were detected with Western blotting. Rat liver microsomes were incubated under N2 for in vitro study. The Cmax, t1/2, MRT (mean residence time) and AUC (area under the curve) of CPU86017-RS were significantly increased in the hypoxic rats receiving the 3 different doses of CPU86017-RS. The hypoxia-induced alteration of PK was associated with significantly reduced Mn-SOD level, and increased ATF-6, PERK and NOX levels. In ISO-treated rats, the distributions of CPU86017-RS in plasma, heart, kidney, and liver were markedly increased, and NOX levels in heart, kidney, and liver were significantly upregulated. Co-administration of the NOX blocker apocynin eliminated the abnormalities in the PK and tissue distributions of CPU86017-RS induced by hypoxia/oxidative stress. The metabolism of CPU86017-RS in the N2-treated liver microsomes was significantly reduced, addition of N-acetylcysteine (NAC), but not vitamin C, effectively reversed this change. The altered PK and metabolism of CPU86017-RS induced by hypoxia/oxidative stress are produced by mitochondrial abnormalities, NOX activation and ER stress; these abnormalities are significantly alleviated by apocynin or NAC.
      

      
      FLY MPI-2: a parallel tree code for LSS
      NASA Astrophysics Data System (ADS)
      Becciani, U.; Comparato, M.; Antonuccio-Delogu, V.
         2006-04-01
         New version program summaryProgram title: FLY 3.1 Catalogue identifier: ADSC_v2_0 Licensing provisions: yes Program summary URL:http://cpc.cs.qub.ac.uk/summaries/ADSC_v2_0 Program obtainable from: CPC Program Library, Queen's University of Belfast, N. Ireland No. of lines in distributed program, including test data, etc.: 158 172 No. of bytes in distributed program, including test data, etc.: 4 719 953 Distribution format: tar.gz Programming language: Fortran 90, C Computer: Beowulf cluster, PC, MPP systems Operating system: Linux, Aix RAM: 100M words Catalogue identifier of previous version: ADSC_v1_0 Journal reference of previous version: Comput. Phys. Comm. 155 (2003) 159 Does the new version supersede the previous version?: yes Nature of problem: FLY is a parallel collisionless N-body code for the calculation of the gravitational force Solution method: FLY is based on the hierarchical oct-tree domain decomposition introduced by Barnes and Hut (1986) Reasons for the new version: The new version of FLY is implemented by using the MPI-2 standard: the distributed version 3.1 was developed by using the MPICH2 library on a PC Linux cluster. Today the FLY performance allows us to consider the FLY code among the most powerful parallel codes for tree N-body simulations. Another important new feature regards the availability of an interface with hydrodynamical Paramesh based codes. Simulations must follow a box large enough to accurately represent the power spectrum of fluctuations on very large scales so that we may hope to compare them meaningfully with real data. The number of particles then sets the mass resolution of the simulation, which we would like to make as fine as possible. The idea to build an interface between two codes, that have different and complementary cosmological tasks, allows us to execute complex cosmological simulations with FLY, specialized for DM evolution, and a code specialized for hydrodynamical components that uses a Paramesh block structure. Summary of revisions: The parallel communication schema was totally changed. The new version adopts the MPICH2 library. Now FLY can be executed on all Unix systems having an MPI-2 standard library. The main data structure, is declared in a module procedure of FLY (fly_h.F90 routine). FLY creates the MPI Window object for one-sided communication for all the shared arrays, with a call like the following: CALL MPI_WIN_CREATE(POS, SIZE, REAL8, MPI_INFO_NULL, MPI_COMM_WORLD, WIN_POS, IERR) the following main window objects are created: win_pos, win_vel, win_acc: particles positions velocities and accelerations, win_pos_cell, win_mass_cell, win_quad, win_subp, win_grouping: cells positions, masses, quadrupole momenta, tree structure and grouping cells. Other windows are created for dynamic load balance and global counters. Restrictions: The program uses the leapfrog integrator schema, but could be changed by the user. Unusual features: FLY uses the MPI-2 standard: the MPICH2 library on Linux systems was adopted. To run this version of FLY the working directory must be shared among all the processors that execute FLY. Additional comments: Full documentation for the program is included in the distribution in the form of a README file, a User Guide and a Reference manuscript. Running time: IBM Linux Cluster 1350, 512 nodes with 2 processors for each node and 2 GB RAM for each processor, at Cineca, was adopted to make performance tests. Processor type: Intel Xeon Pentium IV 3.0 GHz and 512 KB cache (128 nodes have Nocona processors). Internal Network: Myricom LAN Card "C" Version and "D" Version. Operating System: Linux SuSE SLES 8. The code was compiled using the mpif90 compiler version 8.1 and with basic optimization options in order to have performances that could be useful compared with other generic clusters Processors
      

      
      Synthesis and characterization of conductive, biodegradable, elastomeric polyurethanes for biomedical applications.
      PubMed
      Xu, Cancan; Yepez, Gerardo; Wei, Zi; Liu, Fuqiang; Bugarin, Alejandro; Hong, Yi
         2016-09-01
         Biodegradable conductive polymers are currently of significant interest in tissue repair and regeneration, drug delivery, and bioelectronics. However, biodegradable materials exhibiting both conductive and elastic properties have rarely been reported to date. To that end, an electrically conductive polyurethane (CPU) was synthesized from polycaprolactone diol, hexadiisocyanate, and aniline trimer and subsequently doped with (1S)-(+)-10-camphorsulfonic acid (CSA). All CPU films showed good elasticity within a 30% strain range. The electrical conductivity of the CPU films, as enhanced with increasing amounts of CSA, ranged from 2.7 ± 0.9 × 10(-10) to 4.4 ± 0.6 × 10(-7) S/cm in a dry state and 4.2 ± 0.5 × 10(-8) to 7.3 ± 1.5 × 10(-5) S/cm in a wet state. The redox peaks of a CPU1.5 film (molar ratio CSA:aniline trimer = 1.5:1) in the cyclic voltammogram confirmed the desired good electroactivity. The doped CPU film exhibited good electrical stability (87% of initial conductivity after 150 hours charge) as measured in a cell culture medium. The degradation rates of CPU films increased with increasing CSA content in both phosphate-buffered solution (PBS) and lipase/PBS solutions. After 7 days of enzymatic degradation, the conductivity of all CSA-doped CPU films had decreased to that of the undoped CPU film. Mouse 3T3 fibroblasts proliferated and spread on all CPU films. This developed biodegradable CPU with good elasticity, electrical stability, and biocompatibility may find potential applications in tissue engineering, smart drug release, and electronics. © 2016 Wiley Periodicals, Inc. J Biomed Mater Res Part A: 104A: 2305-2314, 2016. © 2016 Wiley Periodicals, Inc.
      

      
      Semiconductor Ion Implanters
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      MacKinnon, Barry A.; Ruffell, John P.
         
         In 1953 the Raytheon CK722 transistor was priced at $7.60. Based upon this, an Intel Xeon Quad Core processor containing 820,000,000 transistors should list at $6.2 billion. Particle accelerator technology plays an important part in the remarkable story of why that Intel product can be purchased today for a few hundred dollars. Most people of the mid twentieth century would be astonished at the ubiquity of semiconductors in the products we now buy and use every day. Though relatively expensive in the nineteen fifties they now exist in a wide range of items from high-end multicore microprocessors like the Intelmore » product to disposable items containing 'only' hundreds or thousands like RFID chips and talking greeting cards. This historical development has been fueled by continuous advancement of the several individual technologies involved in the production of semiconductor devices including Ion Implantation and the charged particle beamlines at the heart of implant machines. In the course of its 40 year development, the worldwide implanter industry has reached annual sales levels around $2B, installed thousands of dedicated machines and directly employs thousands of workers. It represents in all these measures, as much and possibly more than any other industrial application of particle accelerator technology. This presentation discusses the history of implanter development. It touches on some of the people involved and on some of the developmental changes and challenges imposed as the requirements of the semiconductor industry evolved.« less
      

      
      Connectivity: Performance Portable Algorithms for graph connectivity v. 0.1
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Slota, George; Rajamanickam, Sivasankaran; Madduri, Kamesh
         
         Graphs occur in several places in real world from road networks, social networks and scientific simulations. Connectivity is a graph analysis software to graph connectivity in modern architectures like multicore CPUs, Xeon Phi and GPUs.
      

      
      Acceleration of Monte Carlo simulation of photon migration in complex heterogeneous media using Intel many-integrated core architecture.
      PubMed
      Gorshkov, Anton V; Kirillin, Mikhail Yu
         2015-08-01
         Over two decades, the Monte Carlo technique has become a gold standard in simulation of light propagation in turbid media, including biotissues. Technological solutions provide further advances of this technique. The Intel Xeon Phi coprocessor is a new type of accelerator for highly parallel general purpose computing, which allows execution of a wide range of applications without substantial code modification. We present a technical approach of porting our previously developed Monte Carlo (MC) code for simulation of light transport in tissues to the Intel Xeon Phi coprocessor. We show that employing the accelerator allows reducing computational time of MC simulation and obtaining simulation speed-up comparable to GPU. We demonstrate the performance of the developed code for simulation of light transport in the human head and determination of the measurement volume in near-infrared spectroscopy brain sensing.
      

      
      Parallel Semi-Implicit Spectral Element Atmospheric Model
      NASA Astrophysics Data System (ADS)
      Fournier, A.; Thomas, S.; Loft, R.
         2001-05-01
         The shallow-water equations (SWE) have long been used to test atmospheric-modeling numerical methods. The SWE contain essential wave-propagation and nonlinear effects of more complete models. We present a semi-implicit (SI) improvement of the Spectral Element Atmospheric Model to solve the SWE (SEAM, Taylor et al. 1997, Fournier et al. 2000, Thomas & Loft 2000). SE methods are h-p finite element methods combining the geometric flexibility of size-h finite elements with the accuracy of degree-p spectral methods. Our work suggests that exceptional parallel-computation performance is achievable by a General-Circulation-Model (GCM) dynamical core, even at modest climate-simulation resolutions (>1o). The code derivation involves weak variational formulation of the SWE, Gauss(-Lobatto) quadrature over the collocation points, and Legendre cardinal interpolators. Appropriate weak variation yields a symmetric positive-definite Helmholtz operator. To meet the Ladyzhenskaya-Babuska-Brezzi inf-sup condition and avoid spurious modes, we use a staggered grid. The SI scheme combines leapfrog and Crank-Nicholson schemes for the nonlinear and linear terms respectively. The localization of operations to elements ideally fits the method to cache-based microprocessor computer architectures --derivatives are computed as collections of small (8x8), naturally cache-blocked matrix-vector products. SEAM also has desirable boundary-exchange communication, like finite-difference models. Timings on on the IBM SP and Compaq ES40 supercomputers indicate that the SI code (20-min timestep) requires 1/3 the CPU time of the explicit code (2-min timestep) for T42 resolutions. Both codes scale nearly linearly out to 400 processors. We achieved single-processor performance up to 30% of peak for both codes on the 375-MHz IBM Power-3 processors. Fast computation and linear scaling lead to a useful climate-simulation dycore only if enough model time is computed per unit wall-clock time. An efficient SI solver is essential to substantially increase this rate. Parallel preconditioning for an iterative conjugate-gradient elliptic solver is described. We are building a GCM dycore capable of 200 GF% lOPS sustained performance on clustered RISC/cache architectures using hybrid MPI/OpenMP programming.
      

      
      System for processing an encrypted instruction stream in hardware
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Griswold, Richard L.; Nickless, William K.; Conrad, Ryan C.
         
         A system and method of processing an encrypted instruction stream in hardware is disclosed. Main memory stores the encrypted instruction stream and unencrypted data. A central processing unit (CPU) is operatively coupled to the main memory. A decryptor is operatively coupled to the main memory and located within the CPU. The decryptor decrypts the encrypted instruction stream upon receipt of an instruction fetch signal from a CPU core. Unencrypted data is passed through to the CPU core without decryption upon receipt of a data fetch signal.
      

      
      A fast CT reconstruction scheme for a general multi-core PC.
      PubMed
      Zeng, Kai; Bai, Erwei; Wang, Ge
         2007-01-01
         Expensive computational cost is a severe limitation in CT reconstruction for clinical applications that need real-time feedback. A primary example is bolus-chasing computed tomography (CT) angiography (BCA) that we have been developing for the past several years. To accelerate the reconstruction process using the filtered backprojection (FBP) method, specialized hardware or graphics cards can be used. However, specialized hardware is expensive and not flexible. The graphics processing unit (GPU) in a current graphic card can only reconstruct images in a reduced precision and is not easy to program. In this paper, an acceleration scheme is proposed based on a multi-core PC. In the proposed scheme, several techniques are integrated, including utilization of geometric symmetry, optimization of data structures, single-instruction multiple-data (SIMD) processing, multithreaded computation, and an Intel C++ compilier. Our scheme maintains the original precision and involves no data exchange between the GPU and CPU. The merits of our scheme are demonstrated in numerical experiments against the traditional implementation. Our scheme achieves a speedup of about 40, which can be further improved by several folds using the latest quad-core processors.
      

      
      Mass production of extensive air showers for the Pierre Auger Collaboration using Grid Technology
      NASA Astrophysics Data System (ADS)
      Lozano Bahilo, Julio; Pierre Auger Collaboration
         2012-06-01
         When ultra-high energy cosmic rays enter the atmosphere they interact producing extensive air showers (EAS) which are the objects studied by the Pierre Auger Observatory. The number of particles involved in an EAS at these energies is of the order of billions and the generation of a single simulated EAS requires many hours of computing time with current processors. In addition, the storage space consumed by the output of one simulated EAS is very high. Therefore we have to make use of Grid resources to be able to generate sufficient quantities of showers for our physics studies in reasonable time periods. We have developed a set of highly automated scripts written in common software scripting languages in order to deal with the high number of jobs which we have to submit regularly to the Grid. In spite of the low number of sites supporting our Virtual Organization (VO) we have reached the top spot on CPU consumption among non LHC (Large Hadron Collider) VOs within EGI (European Grid Infrastructure).
      

      
      Generic Software Architecture for Launchers
      NASA Astrophysics Data System (ADS)
      Carre, Emilien; Gast, Philippe; Hiron, Emmanuel; Leblanc, Alain; Lesens, David; Mescam, Emmanuelle; Moro, Pierre
         2015-09-01
         The definition and reuse of generic software architecture for launchers is not so usual for several reasons: the number of European launcher families is very small (Ariane 5 and Vega for these last decades); the real time constraints (reactivity and determinism needs) are very hard; low levels of versatility are required (implying often an ad hoc development of the launcher mission). In comparison, satellites are often built on a generic platform made up of reusable hardware building blocks (processors, star-trackers, gyroscopes, etc.) and reusable software building blocks (middleware, TM/TC, On Board Control Procedure, etc.). If some of these reasons are still valid (e.g. the limited number of development), the increase of the available CPU power makes today an approach based on a generic time triggered middleware (ensuring the full determinism of the system) and a centralised mission and vehicle management (offering more flexibility in the design and facilitating the long term maintenance) achievable. This paper presents an example of generic software architecture which could be envisaged for future launchers, based on the previously described principles and supported by model driven engineering and automatic code generation.
      

        
       
          

«

21
      22
      23
      24
      25
   »

          
        

     

   

   Some links on this page may take you to non-federal websites. Their policies may differ from this site.