cpu time required: Topics by Science.gov

Sample records for cpu time required

Techniques for increasing the efficiency of Earth gravity calculations for precision orbit determination

NASA Technical Reports Server (NTRS)

Smith, R. L.; Lyubomirsky, A. S.

1981-01-01

Two techniques were analyzed. The first is a representation using Chebyshev expansions in three-dimensional cells. The second technique employs a temporary file for storing the components of the nonspherical gravity force. Computer storage requirements and relative CPU time requirements are presented. The Chebyshev gravity representation can provide a significant reduction in CPU time in precision orbit calculations, but at the cost of a large amount of direct-access storage space, which is required for a global model.
Lossless data compression for improving the performance of a GPU-based beamformer.

PubMed

Lok, U-Wai; Fan, Gang-Wei; Li, Pai-Chi

2015-04-01

The powerful parallel computation ability of a graphics processing unit (GPU) makes it feasible to perform dynamic receive beamforming However, a real time GPU-based beamformer requires high data rate to transfer radio-frequency (RF) data from hardware to software memory, as well as from central processing unit (CPU) to GPU memory. There are data compression methods (e.g. Joint Photographic Experts Group (JPEG)) available for the hardware front end to reduce data size, alleviating the data transfer requirement of the hardware interface. Nevertheless, the required decoding time may even be larger than the transmission time of its original data, in turn degrading the overall performance of the GPU-based beamformer. This article proposes and implements a lossless compression-decompression algorithm, which enables in parallel compression and decompression of data. By this means, the data transfer requirement of hardware interface and the transmission time of CPU to GPU data transfers are reduced, without sacrificing image quality. In simulation results, the compression ratio reached around 1.7. The encoder design of our lossless compression approach requires low hardware resources and reasonable latency in a field programmable gate array. In addition, the transmission time of transferring data from CPU to GPU with the parallel decoding process improved by threefold, as compared with transferring original uncompressed data. These results show that our proposed lossless compression plus parallel decoder approach not only mitigate the transmission bandwidth requirement to transfer data from hardware front end to software system but also reduce the transmission time for CPU to GPU data transfer. © The Author(s) 2014.
GPU-Acceleration of Sequence Homology Searches with Database Subsequence Clustering.

PubMed

Suzuki, Shuji; Kakuta, Masanori; Ishida, Takashi; Akiyama, Yutaka

2016-01-01

Sequence homology searches are used in various fields and require large amounts of computation time, especially for metagenomic analysis, owing to the large number of queries and the database size. To accelerate computing analyses, graphics processing units (GPUs) are widely used as a low-cost, high-performance computing platform. Therefore, we mapped the time-consuming steps involved in GHOSTZ, which is a state-of-the-art homology search algorithm for protein sequences, onto a GPU and implemented it as GHOSTZ-GPU. In addition, we optimized memory access for GPU calculations and for communication between the CPU and GPU. As per results of the evaluation test involving metagenomic data, GHOSTZ-GPU with 12 CPU threads and 1 GPU was approximately 3.0- to 4.1-fold faster than GHOSTZ with 12 CPU threads. Moreover, GHOSTZ-GPU with 12 CPU threads and 3 GPUs was approximately 5.8- to 7.7-fold faster than GHOSTZ with 12 CPU threads.
File Usage Analysis and Resource Usage Prediction: a Measurement-Based Study. Ph.D. Thesis

NASA Technical Reports Server (NTRS)

Devarakonda, Murthy V.-S.

1987-01-01

A probabilistic scheme was developed to predict process resource usage in UNIX. Given the identity of the program being run, the scheme predicts CPU time, file I/O, and memory requirements of a process at the beginning of its life. The scheme uses a state-transition model of the program's resource usage in its past executions for prediction. The states of the model are the resource regions obtained from an off-line cluster analysis of processes run on the system. The proposed method is shown to work on data collected from a VAX 11/780 running 4.3 BSD UNIX. The results show that the predicted values correlate well with the actual. The coefficient of correlation between the predicted and actual values of CPU time is 0.84. Errors in prediction are mostly small. Some 82% of errors in CPU time prediction are less than 0.5 standard deviations of process CPU time.
Predictability of process resource usage - A measurement-based study on UNIX

NASA Technical Reports Server (NTRS)

Devarakonda, Murthy V.; Iyer, Ravishankar K.

1989-01-01

A probabilistic scheme is developed to predict process resource usage in UNIX. Given the identity of the program being run, the scheme predicts CPU time, file I/O, and memory requirements of a process at the beginning of its life. The scheme uses a state-transition model of the program's resource usage in its past executions for prediction. The states of the model are the resource regions obtained from an off-line cluster analysis of processes run on the system. The proposed method is shown to work on data collected from a VAX 11/780 running 4.3 BSD UNIX. The results show that the predicted values correlate well with the actual. The correlation coefficient betweeen the predicted and actual values of CPU time is 0.84. Errors in prediction are mostly small. Some 82 percent of errors in CPU time prediction are less than 0.5 standard deviations of process CPU time.
Predictability of process resource usage: A measurement-based study of UNIX

NASA Technical Reports Server (NTRS)

Devarakonda, Murthy V.; Iyer, Ravishankar K.

1987-01-01

A probabilistic scheme is developed to predict process resource usage in UNIX. Given the identity of the program being run, the scheme predicts CPU time, file I/O, and memory requirements of a process at the beginning of its life. The scheme uses a state-transition model of the program's resource usage in its past executions for prediction. The states of the model are the resource regions obtained from an off-line cluster analysis of processes run on the system. The proposed method is shown to work on data collected from a VAX 11/780 running 4.3 BSD UNIX. The results show that the predicted values correlate well with the actual. The correlation coefficient between the predicted and actual values of CPU time is 0.84. Errors in prediction are mostly small. Some 82% of errors in CPU time prediction are less than 0.5 standard deviations of process CPU time.
On the cost of approximating and recognizing a noise perturbed straight line or a quadratic curve segment in the plane. [central processing units

NASA Technical Reports Server (NTRS)

Cooper, D. B.; Yalabik, N.

1975-01-01

Approximation of noisy data in the plane by straight lines or elliptic or single-branch hyperbolic curve segments arises in pattern recognition, data compaction, and other problems. The efficient search for and approximation of data by such curves were examined. Recursive least-squares linear curve-fitting was used, and ellipses and hyperbolas are parameterized as quadratic functions in x and y. The error minimized by the algorithm is interpreted, and central processing unit (CPU) times for estimating parameters for fitting straight lines and quadratic curves were determined and compared. CPU time for data search was also determined for the case of straight line fitting. Quadratic curve fitting is shown to require about six times as much CPU time as does straight line fitting, and curves relating CPU time and fitting error were determined for straight line fitting. Results are derived on early sequential determination of whether or not the underlying curve is a straight line.
GPU Optimizations for a Production Molecular Docking Code*

PubMed Central

Landaverde, Raphael; Herbordt, Martin C.

2015-01-01

Modeling molecular docking is critical to both understanding life processes and designing new drugs. In previous work we created the first published GPU-accelerated docking code (PIPER) which achieved a roughly 5× speed-up over a contemporaneous 4 core CPU. Advances in GPU architecture and in the CPU code, however, have since reduced this relalative performance by a factor of 10. In this paper we describe the upgrade of GPU PIPER. This required an entire rewrite, including algorithm changes and moving most remaining non-accelerated CPU code onto the GPU. The result is a 7× improvement in GPU performance and a 3.3× speedup over the CPU-only code. We find that this difference in time is almost entirely due to the difference in run times of the 3D FFT library functions on CPU (MKL) and GPU (cuFFT), respectively. The GPU code has been integrated into the ClusPro docking server which has over 4000 active users. PMID:26594667
GPU-Acceleration of Sequence Homology Searches with Database Subsequence Clustering

PubMed Central

Suzuki, Shuji; Kakuta, Masanori; Ishida, Takashi; Akiyama, Yutaka

2016-01-01

Sequence homology searches are used in various fields and require large amounts of computation time, especially for metagenomic analysis, owing to the large number of queries and the database size. To accelerate computing analyses, graphics processing units (GPUs) are widely used as a low-cost, high-performance computing platform. Therefore, we mapped the time-consuming steps involved in GHOSTZ, which is a state-of-the-art homology search algorithm for protein sequences, onto a GPU and implemented it as GHOSTZ-GPU. In addition, we optimized memory access for GPU calculations and for communication between the CPU and GPU. As per results of the evaluation test involving metagenomic data, GHOSTZ-GPU with 12 CPU threads and 1 GPU was approximately 3.0- to 4.1-fold faster than GHOSTZ with 12 CPU threads. Moreover, GHOSTZ-GPU with 12 CPU threads and 3 GPUs was approximately 5.8- to 7.7-fold faster than GHOSTZ with 12 CPU threads. PMID:27482905
GPU Optimizations for a Production Molecular Docking Code.

PubMed

Landaverde, Raphael; Herbordt, Martin C

2014-09-01

Modeling molecular docking is critical to both understanding life processes and designing new drugs. In previous work we created the first published GPU-accelerated docking code (PIPER) which achieved a roughly 5× speed-up over a contemporaneous 4 core CPU. Advances in GPU architecture and in the CPU code, however, have since reduced this relalative performance by a factor of 10. In this paper we describe the upgrade of GPU PIPER. This required an entire rewrite, including algorithm changes and moving most remaining non-accelerated CPU code onto the GPU. The result is a 7× improvement in GPU performance and a 3.3× speedup over the CPU-only code. We find that this difference in time is almost entirely due to the difference in run times of the 3D FFT library functions on CPU (MKL) and GPU (cuFFT), respectively. The GPU code has been integrated into the ClusPro docking server which has over 4000 active users.
A fast sequence assembly method based on compressed data structures.

PubMed

Liang, Peifeng; Zhang, Yancong; Lin, Kui; Hu, Jinglu

2014-01-01

Assembling a large genome using next generation sequencing reads requires large computer memory and a long execution time. To reduce these requirements, a memory and time efficient assembler is presented from applying FM-index in JR-Assembler, called FMJ-Assembler, where FM stand for FMR-index derived from the FM-index and BWT and J for jumping extension. The FMJ-Assembler uses expanded FM-index and BWT to compress data of reads to save memory and jumping extension method make it faster in CPU time. An extensive comparison of the FMJ-Assembler with current assemblers shows that the FMJ-Assembler achieves a better or comparable overall assembly quality and requires lower memory use and less CPU time. All these advantages of the FMJ-Assembler indicate that the FMJ-Assembler will be an efficient assembly method in next generation sequencing technology.
On the Finite Element Implementation of the Generalized Method of Cells Micromechanics Constitutive Model

NASA Technical Reports Server (NTRS)

Wilt, T. E.

1995-01-01

The Generalized Method of Cells (GMC), a micromechanics based constitutive model, is implemented into the finite element code MARC using the user subroutine HYPELA. Comparisons in terms of transverse deformation response, micro stress and strain distributions, and required CPU time are presented for GMC and finite element models of fiber/matrix unit cell. GMC is shown to provide comparable predictions of the composite behavior and requires significantly less CPU time as compared to a finite element analysis of the unit cell. Details as to the organization of the HYPELA code are provided with the actual HYPELA code included in the appendix.
Benchmarking hardware architecture candidates for the NFIRAOS real-time controller

NASA Astrophysics Data System (ADS)

Smith, Malcolm; Kerley, Dan; Herriot, Glen; Véran, Jean-Pierre

2014-07-01

As a part of the trade study for the Narrow Field Infrared Adaptive Optics System, the adaptive optics system for the Thirty Meter Telescope, we investigated the feasibility of performing real-time control computation using a Linux operating system and Intel Xeon E5 CPUs. We also investigated a Xeon Phi based architecture which allows higher levels of parallelism. This paper summarizes both the CPU based real-time controller architecture and the Xeon Phi based RTC. The Intel Xeon E5 CPU solution meets the requirements and performs the computation for one AO cycle in an average of 767 microseconds. The Xeon Phi solution did not meet the 1200 microsecond time requirement and also suffered from unpredictable execution times. More detailed benchmark results are reported for both architectures.
Optimum element density studies for finite-element thermal analysis of hypersonic aircraft structures

NASA Technical Reports Server (NTRS)

Ko, William L.; Olona, Timothy; Muramoto, Kyle M.

1990-01-01

Different finite element models previously set up for thermal analysis of the space shuttle orbiter structure are discussed and their shortcomings identified. Element density criteria are established for the finite element thermal modelings of space shuttle orbiter-type large, hypersonic aircraft structures. These criteria are based on rigorous studies on solution accuracies using different finite element models having different element densities set up for one cell of the orbiter wing. Also, a method for optimization of the transient thermal analysis computer central processing unit (CPU) time is discussed. Based on the newly established element density criteria, the orbiter wing midspan segment was modeled for the examination of thermal analysis solution accuracies and the extent of computation CPU time requirements. The results showed that the distributions of the structural temperatures and the thermal stresses obtained from this wing segment model were satisfactory and the computation CPU time was at the acceptable level. The studies offered the hope that modeling the large, hypersonic aircraft structures using high-density elements for transient thermal analysis is possible if a CPU optimization technique was used.
Parallelized computation for computer simulation of electrocardiograms using personal computers with multi-core CPU and general-purpose GPU.

PubMed

Shen, Wenfeng; Wei, Daming; Xu, Weimin; Zhu, Xin; Yuan, Shizhong

2010-10-01

Biological computations like electrocardiological modelling and simulation usually require high-performance computing environments. This paper introduces an implementation of parallel computation for computer simulation of electrocardiograms (ECGs) in a personal computer environment with an Intel CPU of Core (TM) 2 Quad Q6600 and a GPU of Geforce 8800GT, with software support by OpenMP and CUDA. It was tested in three parallelization device setups: (a) a four-core CPU without a general-purpose GPU, (b) a general-purpose GPU plus 1 core of CPU, and (c) a four-core CPU plus a general-purpose GPU. To effectively take advantage of a multi-core CPU and a general-purpose GPU, an algorithm based on load-prediction dynamic scheduling was developed and applied to setting (c). In the simulation with 1600 time steps, the speedup of the parallel computation as compared to the serial computation was 3.9 in setting (a), 16.8 in setting (b), and 20.0 in setting (c). This study demonstrates that a current PC with a multi-core CPU and a general-purpose GPU provides a good environment for parallel computations in biological modelling and simulation studies. Copyright 2010 Elsevier Ireland Ltd. All rights reserved.
Airloads on Bluff Bodies, with Application to the Rotor-Induced Downloads on Tilt-Rotor Aircraft.

DTIC Science & Technology

1983-09-01

interference aerodynamics would be tion on hover performance (Ref. (11). to study the two-dimensional sec- tion characteristics of a wing in the wake of a...resources for large numbers of vortices; a typical case requires 10-15 min CPU time on the Ames Cray IS computer. Figure 6 shows a typical result. Here...CPU time per case on a Prime 550UPPER SURFACE (WINDWARD) computer to converge to a steady solution; this would be equivalent to one or two seconds on
SU-E-J-91: FFT Based Medical Image Registration Using a Graphics Processing Unit (GPU).

PubMed

Luce, J; Hoggarth, M; Lin, J; Block, A; Roeske, J

2012-06-01

To evaluate the efficiency gains obtained from using a Graphics Processing Unit (GPU) to perform a Fourier Transform (FT) based image registration. Fourier-based image registration involves obtaining the FT of the component images, and analyzing them in Fourier space to determine the translations and rotations of one image set relative to another. An important property of FT registration is that by enlarging the images (adding additional pixels), one can obtain translations and rotations with sub-pixel resolution. The expense, however, is an increased computational time. GPUs may decrease the computational time associated with FT image registration by taking advantage of their parallel architecture to perform matrix computations much more efficiently than a Central Processor Unit (CPU). In order to evaluate the computational gains produced by a GPU, images with known translational shifts were utilized. A program was written in the Interactive Data Language (IDL; Exelis, Boulder, CO) to performCPU-based calculations. Subsequently, the program was modified using GPU bindings (Tech-X, Boulder, CO) to perform GPU-based computation on the same system. Multiple image sizes were used, ranging from 256×256 to 2304×2304. The time required to complete the full algorithm by the CPU and GPU were benchmarked and the speed increase was defined as the ratio of the CPU-to-GPU computational time. The ratio of the CPU-to- GPU time was greater than 1.0 for all images, which indicates the GPU is performing the algorithm faster than the CPU. The smallest improvement, a 1.21 ratio, was found with the smallest image size of 256×256, and the largest speedup, a 4.25 ratio, was observed with the largest image size of 2304×2304. GPU programming resulted in a significant decrease in computational time associated with a FT image registration algorithm. The inclusion of the GPU may provide near real-time, sub-pixel registration capability. © 2012 American Association of Physicists in Medicine.
A CPU benchmark for protein crystallographic refinement.

PubMed

Bourne, P E; Hendrickson, W A

1990-01-01

The CPU time required to complete a cycle of restrained least-squares refinement of a protein structure from X-ray crystallographic data using the FORTRAN codes PROTIN and PROLSQ are reported for 48 different processors, ranging from single-user workstations to supercomputers. Sequential, vector, VLIW, multiprocessor, and RISC hardware architectures are compared using both a small and a large protein structure. Representative compile times for each hardware type are also given, and the improvement in run-time when coding for a specific hardware architecture considered. The benchmarks involve scalar integer and vector floating point arithmetic and are representative of the calculations performed in many scientific disciplines.
Adaptive real-time methodology for optimizing energy-efficient computing

DOEpatents

Hsu, Chung-Hsing [Los Alamos, NM; Feng, Wu-Chun [Blacksburg, VA

2011-06-28

Dynamic voltage and frequency scaling (DVFS) is an effective way to reduce energy and power consumption in microprocessor units. Current implementations of DVFS suffer from inaccurate modeling of power requirements and usage, and from inaccurate characterization of the relationships between the applicable variables. A system and method is proposed that adjusts CPU frequency and voltage based on run-time calculations of the workload processing time, as well as a calculation of performance sensitivity with respect to CPU frequency. The system and method are processor independent, and can be applied to either an entire system as a unit, or individually to each process running on a system.
32 CFR 701.53 - FOIA fee schedule.

Code of Federal Regulations, 2014 CFR

2014-07-01

... human time) and machine time. (1) Human time. Human time is all the time spent by humans performing the...) Machine time. Machine time involves only direct costs of the central processing unit (CPU), input/output... exist to calculate CPU time, no machine costs can be passed on to the requester. When CPU calculations...

32 CFR 701.53 - FOIA fee schedule.

Code of Federal Regulations, 2012 CFR

2012-07-01

... human time) and machine time. (1) Human time. Human time is all the time spent by humans performing the...) Machine time. Machine time involves only direct costs of the central processing unit (CPU), input/output... exist to calculate CPU time, no machine costs can be passed on to the requester. When CPU calculations...
32 CFR 701.53 - FOIA fee schedule.

Code of Federal Regulations, 2013 CFR

2013-07-01

... human time) and machine time. (1) Human time. Human time is all the time spent by humans performing the...) Machine time. Machine time involves only direct costs of the central processing unit (CPU), input/output... exist to calculate CPU time, no machine costs can be passed on to the requester. When CPU calculations...
Performance analysis of the FDTD method applied to holographic volume gratings: Multi-core CPU versus GPU computing

NASA Astrophysics Data System (ADS)

Francés, J.; Bleda, S.; Neipp, C.; Márquez, A.; Pascual, I.; Beléndez, A.

2013-03-01

The finite-difference time-domain method (FDTD) allows electromagnetic field distribution analysis as a function of time and space. The method is applied to analyze holographic volume gratings (HVGs) for the near-field distribution at optical wavelengths. Usually, this application requires the simulation of wide areas, which implies more memory and time processing. In this work, we propose a specific implementation of the FDTD method including several add-ons for a precise simulation of optical diffractive elements. Values in the near-field region are computed considering the illumination of the grating by means of a plane wave for different angles of incidence and including absorbing boundaries as well. We compare the results obtained by FDTD with those obtained using a matrix method (MM) applied to diffraction gratings. In addition, we have developed two optimized versions of the algorithm, for both CPU and GPU, in order to analyze the improvement of using the new NVIDIA Fermi GPU architecture versus highly tuned multi-core CPU as a function of the size simulation. In particular, the optimized CPU implementation takes advantage of the arithmetic and data transfer streaming SIMD (single instruction multiple data) extensions (SSE) included explicitly in the code and also of multi-threading by means of OpenMP directives. A good agreement between the results obtained using both FDTD and MM methods is obtained, thus validating our methodology. Moreover, the performance of the GPU is compared to the SSE+OpenMP CPU implementation, and it is quantitatively determined that a highly optimized CPU program can be competitive for a wider range of simulation sizes, whereas GPU computing becomes more powerful for large-scale simulations.
Upwind relaxation methods for the Navier-Stokes equations using inner iterations

NASA Technical Reports Server (NTRS)

Taylor, Arthur C., III; Ng, Wing-Fai; Walters, Robert W.

1992-01-01

A subsonic and a supersonic problem are respectively treated by an upwind line-relaxation algorithm for the Navier-Stokes equations using inner iterations to accelerate steady-state solution convergence and thereby minimize CPU time. While the ability of the inner iterative procedure to mimic the quadratic convergence of the direct solver method is attested to in both test problems, some of the nonquadratic inner iterative results are noted to have been more efficient than the quadratic. In the more successful, supersonic test case, inner iteration required only about 65 percent of the line-relaxation method-entailed CPU time.
Comparison of Conjugate Gradient Density Matrix Search and Chebyshev Expansion Methods for Avoiding Diagonalization in Large-Scale Electronic Structure Calculations

NASA Technical Reports Server (NTRS)

Bates, Kevin R.; Daniels, Andrew D.; Scuseria, Gustavo E.

1998-01-01

We report a comparison of two linear-scaling methods which avoid the diagonalization bottleneck of traditional electronic structure algorithms. The Chebyshev expansion method (CEM) is implemented for carbon tight-binding calculations of large systems and its memory and timing requirements compared to those of our previously implemented conjugate gradient density matrix search (CG-DMS). Benchmark calculations are carried out on icosahedral fullerenes from C60 to C8640 and the linear scaling memory and CPU requirements of the CEM demonstrated. We show that the CPU requisites of the CEM and CG-DMS are similar for calculations with comparable accuracy.
Agglomeration Multigrid for an Unstructured-Grid Flow Solver

NASA Technical Reports Server (NTRS)

Frink, Neal; Pandya, Mohagna J.

2004-01-01

An agglomeration multigrid scheme has been implemented into the sequential version of the NASA code USM3Dns, tetrahedral cell-centered finite volume Euler/Navier-Stokes flow solver. Efficiency and robustness of the multigrid-enhanced flow solver have been assessed for three configurations assuming an inviscid flow and one configuration assuming a viscous fully turbulent flow. The inviscid studies include a transonic flow over the ONERA M6 wing and a generic business jet with flow-through nacelles and a low subsonic flow over a high-lift trapezoidal wing. The viscous case includes a fully turbulent flow over the RAE 2822 rectangular wing. The multigrid solutions converged with 12%-33% of the Central Processing Unit (CPU) time required by the solutions obtained without multigrid. For all of the inviscid cases, multigrid in conjunction with an explicit time-stepping scheme performed the best with regard to the run time memory and CPU time requirements. However, for the viscous case multigrid had to be used with an implicit backward Euler time-stepping scheme that increased the run time memory requirement by 22% as compared to the run made without multigrid.
Efficient methods for implementation of multi-level nonrigid mass-preserving image registration on GPUs and multi-threaded CPUs.

PubMed

Ellingwood, Nathan D; Yin, Youbing; Smith, Matthew; Lin, Ching-Long

2016-04-01

Faster and more accurate methods for registration of images are important for research involved in conducting population-based studies that utilize medical imaging, as well as improvements for use in clinical applications. We present a novel computation- and memory-efficient multi-level method on graphics processing units (GPU) for performing registration of two computed tomography (CT) volumetric lung images. We developed a computation- and memory-efficient Diffeomorphic Multi-level B-Spline Transform Composite (DMTC) method to implement nonrigid mass-preserving registration of two CT lung images on GPU. The framework consists of a hierarchy of B-Spline control grids of increasing resolution. A similarity criterion known as the sum of squared tissue volume difference (SSTVD) was adopted to preserve lung tissue mass. The use of SSTVD consists of the calculation of the tissue volume, the Jacobian, and their derivatives, which makes its implementation on GPU challenging due to memory constraints. The use of the DMTC method enabled reduced computation and memory storage of variables with minimal communication between GPU and Central Processing Unit (CPU) due to ability to pre-compute values. The method was assessed on six healthy human subjects. Resultant GPU-generated displacement fields were compared against the previously validated CPU counterpart fields, showing good agreement with an average normalized root mean square error (nRMS) of 0.044±0.015. Runtime and performance speedup are compared between single-threaded CPU, multi-threaded CPU, and GPU algorithms. Best performance speedup occurs at the highest resolution in the GPU implementation for the SSTVD cost and cost gradient computations, with a speedup of 112 times that of the single-threaded CPU version and 11 times over the twelve-threaded version when considering average time per iteration using a Nvidia Tesla K20X GPU. The proposed GPU-based DMTC method outperforms its multi-threaded CPU version in terms of runtime. Total registration time reduced runtime to 2.9min on the GPU version, compared to 12.8min on twelve-threaded CPU version and 112.5min on a single-threaded CPU. Furthermore, the GPU implementation discussed in this work can be adapted for use of other cost functions that require calculation of the first derivatives. Copyright © 2015 Elsevier Ireland Ltd. All rights reserved.
Adaptive real-time methodology for optimizing energy-efficient computing

DOE Office of Scientific and Technical Information (OSTI.GOV)

Hsu, Chung-Hsing; Feng, Wu-Chun

Dynamic voltage and frequency scaling (DVFS) is an effective way to reduce energy and power consumption in microprocessor units. Current implementations of DVFS suffer from inaccurate modeling of power requirements and usage, and from inaccurate characterization of the relationships between the applicable variables. A system and method is proposed that adjusts CPU frequency and voltage based on run-time calculations of the workload processing time, as well as a calculation of performance sensitivity with respect to CPU frequency. The system and method are processor independent, and can be applied to either an entire system as a unit, or individually to eachmore » process running on a system.« less
Sequence search on a supercomputer.

PubMed

Gotoh, O; Tagashira, Y

1986-01-10

A set of programs was developed for searching nucleic acid and protein sequence data bases for sequences similar to a given sequence. The programs, written in FORTRAN 77, were optimized for vector processing on a Hitachi S810-20 supercomputer. A search of a 500-residue protein sequence against the entire PIR data base Ver. 1.0 (1) (0.5 M residues) is carried out in a CPU time of 45 sec. About 4 min is required for an exhaustive search of a 1500-base nucleotide sequence against all mammalian sequences (1.2M bases) in Genbank Ver. 29.0. The CPU time is reduced to about a quarter with a faster version.
Accelerating Spaceborne SAR Imaging Using Multiple CPU/GPU Deep Collaborative Computing

PubMed Central

Zhang, Fan; Li, Guojun; Li, Wei; Hu, Wei; Hu, Yuxin

2016-01-01

With the development of synthetic aperture radar (SAR) technologies in recent years, the huge amount of remote sensing data brings challenges for real-time imaging processing. Therefore, high performance computing (HPC) methods have been presented to accelerate SAR imaging, especially the GPU based methods. In the classical GPU based imaging algorithm, GPU is employed to accelerate image processing by massive parallel computing, and CPU is only used to perform the auxiliary work such as data input/output (IO). However, the computing capability of CPU is ignored and underestimated. In this work, a new deep collaborative SAR imaging method based on multiple CPU/GPU is proposed to achieve real-time SAR imaging. Through the proposed tasks partitioning and scheduling strategy, the whole image can be generated with deep collaborative multiple CPU/GPU computing. In the part of CPU parallel imaging, the advanced vector extension (AVX) method is firstly introduced into the multi-core CPU parallel method for higher efficiency. As for the GPU parallel imaging, not only the bottlenecks of memory limitation and frequent data transferring are broken, but also kinds of optimized strategies are applied, such as streaming, parallel pipeline and so on. Experimental results demonstrate that the deep CPU/GPU collaborative imaging method enhances the efficiency of SAR imaging on single-core CPU by 270 times and realizes the real-time imaging in that the imaging rate outperforms the raw data generation rate. PMID:27070606
Accelerating Spaceborne SAR Imaging Using Multiple CPU/GPU Deep Collaborative Computing.

PubMed

Zhang, Fan; Li, Guojun; Li, Wei; Hu, Wei; Hu, Yuxin

2016-04-07

With the development of synthetic aperture radar (SAR) technologies in recent years, the huge amount of remote sensing data brings challenges for real-time imaging processing. Therefore, high performance computing (HPC) methods have been presented to accelerate SAR imaging, especially the GPU based methods. In the classical GPU based imaging algorithm, GPU is employed to accelerate image processing by massive parallel computing, and CPU is only used to perform the auxiliary work such as data input/output (IO). However, the computing capability of CPU is ignored and underestimated. In this work, a new deep collaborative SAR imaging method based on multiple CPU/GPU is proposed to achieve real-time SAR imaging. Through the proposed tasks partitioning and scheduling strategy, the whole image can be generated with deep collaborative multiple CPU/GPU computing. In the part of CPU parallel imaging, the advanced vector extension (AVX) method is firstly introduced into the multi-core CPU parallel method for higher efficiency. As for the GPU parallel imaging, not only the bottlenecks of memory limitation and frequent data transferring are broken, but also kinds of optimized strategies are applied, such as streaming, parallel pipeline and so on. Experimental results demonstrate that the deep CPU/GPU collaborative imaging method enhances the efficiency of SAR imaging on single-core CPU by 270 times and realizes the real-time imaging in that the imaging rate outperforms the raw data generation rate.
Two-dimensional Euler and Navier-Stokes Time accurate simulations of fan rotor flows

NASA Technical Reports Server (NTRS)

Boretti, A. A.

1990-01-01

Two numerical methods are presented which describe the unsteady flow field in the blade-to-blade plane of an axial fan rotor. These methods solve the compressible, time-dependent, Euler and the compressible, turbulent, time-dependent, Navier-Stokes conservation equations for mass, momentum, and energy. The Navier-Stokes equations are written in Favre-averaged form and are closed with an approximate two-equation turbulence model with low Reynolds number and compressibility effects included. The unsteady aerodynamic component is obtained by superposing inflow or outflow unsteadiness to the steady conditions through time-dependent boundary conditions. The integration in space is performed by using a finite volume scheme, and the integration in time is performed by using k-stage Runge-Kutta schemes, k = 2,5. The numerical integration algorithm allows the reduction of the computational cost of an unsteady simulation involving high frequency disturbances in both CPU time and memory requirements. Less than 200 sec of CPU time are required to advance the Euler equations in a computational grid made up of about 2000 grid during 10,000 time steps on a CRAY Y-MP computer, with a required memory of less than 0.3 megawords.
Evaluation of the CPU time for solving the radiative transfer equation with high-order resolution schemes applying the normalized weighting-factor method

NASA Astrophysics Data System (ADS)

Xamán, J.; Zavala-Guillén, I.; Hernández-López, I.; Uriarte-Flores, J.; Hernández-Pérez, I.; Macías-Melo, E. V.; Aguilar-Castro, K. M.

2018-03-01

In this paper, we evaluated the convergence rate (CPU time) of a new mathematical formulation for the numerical solution of the radiative transfer equation (RTE) with several High-Order (HO) and High-Resolution (HR) schemes. In computational fluid dynamics, this procedure is known as the Normalized Weighting-Factor (NWF) method and it is adopted here. The NWF method is used to incorporate the high-order resolution schemes in the discretized RTE. The NWF method is compared, in terms of computer time needed to obtain a converged solution, with the widely used deferred-correction (DC) technique for the calculations of a two-dimensional cavity with emitting-absorbing-scattering gray media using the discrete ordinates method. Six parameters, viz. the grid size, the order of quadrature, the absorption coefficient, the emissivity of the boundary surface, the under-relaxation factor, and the scattering albedo are considered to evaluate ten schemes. The results showed that using the DC method, in general, the scheme that had the lowest CPU time is the SOU. In contrast, with the results of theDC procedure the CPU time for DIAMOND and QUICK schemes using the NWF method is shown to be, between the 3.8 and 23.1% faster and 12.6 and 56.1% faster, respectively. However, the other schemes are more time consuming when theNWFis used instead of the DC method. Additionally, a second test case was presented and the results showed that depending on the problem under consideration, the NWF procedure may be computationally faster or slower that the DC method. As an example, the CPU time for QUICK and SMART schemes are 61.8 and 203.7%, respectively, slower when the NWF formulation is used for the second test case. Finally, future researches to explore the computational cost of the NWF method in more complex problems are required.
Dynamic Quantum Allocation and Swap-Time Variability in Time-Sharing Operating Systems.

ERIC Educational Resources Information Center

Bhat, U. Narayan; Nance, Richard E.

The effects of dynamic quantum allocation and swap-time variability on central processing unit (CPU) behavior are investigated using a model that allows both quantum length and swap-time to be state-dependent random variables. Effective CPU utilization is defined to be the proportion of a CPU busy period that is devoted to program processing, i.e.…
Performance of the OVERFLOW-MLP and LAURA-MLP CFD Codes on the NASA Ames 512 CPU Origin System

NASA Technical Reports Server (NTRS)

Taft, James R.

2000-01-01

The shared memory Multi-Level Parallelism (MLP) technique, developed last year at NASA Ames has been very successful in dramatically improving the performance of important NASA CFD codes. This new and very simple parallel programming technique was first inserted into the OVERFLOW production CFD code in FY 1998. The OVERFLOW-MLP code's parallel performance scaled linearly to 256 CPUs on the NASA Ames 256 CPU Origin 2000 system (steger). Overall performance exceeded 20.1 GFLOP/s, or about 4.5x the performance of a dedicated 16 CPU C90 system. All of this was achieved without any major modification to the original vector based code. The OVERFLOW-MLP code is now in production on the inhouse Origin systems as well as being used offsite at commercial aerospace companies. Partially as a result of this work, NASA Ames has purchased a new 512 CPU Origin 2000 system to further test the limits of parallel performance for NASA codes of interest. This paper presents the performance obtained from the latest optimization efforts on this machine for the LAURA-MLP and OVERFLOW-MLP codes. The Langley Aerothermodynamics Upwind Relaxation Algorithm (LAURA) code is a key simulation tool in the development of the next generation shuttle, interplanetary reentry vehicles, and nearly all "X" plane development. This code sustains about 4-5 GFLOP/s on a dedicated 16 CPU C90. At this rate, expected workloads would require over 100 C90 CPU years of computing over the next few calendar years. It is not feasible to expect that this would be affordable or available to the user community. Dramatic performance gains on cheaper systems are needed. This code is expected to be perhaps the largest consumer of NASA Ames compute cycles per run in the coming year.The OVERFLOW CFD code is extensively used in the government and commercial aerospace communities to evaluate new aircraft designs. It is one of the largest consumers of NASA supercomputing cycles and large simulations of highly resolved full aircraft are routinely undertaken. Typical large problems might require 100s of Cray C90 CPU hours to complete. The dramatic performance gains with the 256 CPU steger system are exciting. Obtaining results in hours instead of months is revolutionizing the way in which aircraft manufacturers are looking at future aircraft simulation work. Figure 2 below is a current state of the art plot of OVERFLOW-MLP performance on the 512 CPU Lomax system. As can be seen, the chart indicates that OVERFLOW-MLP continues to scale linearly with CPU count up to 512 CPUs on a large 35 million point full aircraft RANS simulation. At this point performance is such that a fully converged simulation of 2500 time steps is completed in less than 2 hours of elapsed time. Further work over the next few weeks will improve the performance of this code even further.The LAURA code has been converted to the MLP format as well. This code is currently being optimized for the 512 CPU system. Performance statistics indicate that the goal of 100 GFLOP/s will be achieved by year's end. This amounts to 20x the 16 CPU C90 result and strongly demonstrates the viability of the new parallel systems rapidly solving very large simulations in a production environment.
Dense GPU-enhanced surface reconstruction from stereo endoscopic images for intraoperative registration.

PubMed

Rohl, Sebastian; Bodenstedt, Sebastian; Suwelack, Stefan; Dillmann, Rudiger; Speidel, Stefanie; Kenngott, Hannes; Muller-Stich, Beat P

2012-03-01

In laparoscopic surgery, soft tissue deformations substantially change the surgical site, thus impeding the use of preoperative planning during intraoperative navigation. Extracting depth information from endoscopic images and building a surface model of the surgical field-of-view is one way to represent this constantly deforming environment. The information can then be used for intraoperative registration. Stereo reconstruction is a typical problem within computer vision. However, most of the available methods do not fulfill the specific requirements in a minimally invasive setting such as the need of real-time performance, the problem of view-dependent specular reflections and large curved areas with partly homogeneous or periodic textures and occlusions. In this paper, the authors present an approach toward intraoperative surface reconstruction based on stereo endoscopic images. The authors describe our answer to this problem through correspondence analysis, disparity correction and refinement, 3D reconstruction, point cloud smoothing and meshing. Real-time performance is achieved by implementing the algorithms on the gpu. The authors also present a new hybrid cpu-gpu algorithm that unifies the advantages of the cpu and the gpu version. In a comprehensive evaluation using in vivo data, in silico data from the literature and virtual data from a newly developed simulation environment, the cpu, the gpu, and the hybrid cpu-gpu versions of the surface reconstruction are compared to a cpu and a gpu algorithm from the literature. The recommended approach toward intraoperative surface reconstruction can be conducted in real-time depending on the image resolution (20 fps for the gpu and 14fps for the hybrid cpu-gpu version on resolution of 640 × 480). It is robust to homogeneous regions without texture, large image changes, noise or errors from camera calibration, and it reconstructs the surface down to sub millimeter accuracy. In all the experiments within the simulation environment, the mean distance to ground truth data is between 0.05 and 0.6 mm for the hybrid cpu-gpu version. The hybrid cpu-gpu algorithm shows a much more superior performance than its cpu and gpu counterpart (mean distance reduction 26% and 45%, respectively, for the experiments in the simulation environment). The recommended approach for surface reconstruction is fast, robust, and accurate. It can represent changes in the intraoperative environment and can be used to adapt a preoperative model within the surgical site by registration of these two models.
Restricted Collision List method for faster Direct Simulation Monte-Carlo (DSMC) collisions

DOE Office of Scientific and Technical Information (OSTI.GOV)

Macrossan, Michael N., E-mail: m.macrossan@uq.edu.au

The ‘Restricted Collision List’ (RCL) method for speeding up the calculation of DSMC Variable Soft Sphere collisions, with Borgnakke–Larsen (BL) energy exchange, is presented. The method cuts down considerably on the number of random collision parameters which must be calculated (deflection and azimuthal angles, and the BL energy exchange factors). A relatively short list of these parameters is generated and the parameters required in any cell are selected from this list. The list is regenerated at intervals approximately equal to the smallest mean collision time in the flow, and the chance of any particle re-using the same collision parameters inmore » two successive collisions is negligible. The results using this method are indistinguishable from those obtained with standard DSMC. The CPU time saving depends on how much of a DSMC calculation is devoted to collisions and how much is devoted to other tasks, such as moving particles and calculating particle interactions with flow boundaries. For 1-dimensional calculations of flow in a tube, the new method saves 20% of the CPU time per collision for VSS scattering with no energy exchange. With RCL applied to rotational energy exchange, the CPU saving can be greater; for small values of the rotational collision number, for which most collisions involve some rotational energy exchange, the CPU may be reduced by 50% or more.« less
An efficient mixed-precision, hybrid CPU-GPU implementation of a nonlinearly implicit one-dimensional particle-in-cell algorithm

DOE Office of Scientific and Technical Information (OSTI.GOV)

Chen, Guangye; Chacon, Luis; Barnes, Daniel C

2012-01-01

Recently, a fully implicit, energy- and charge-conserving particle-in-cell method has been developed for multi-scale, full-f kinetic simulations [G. Chen, et al., J. Comput. Phys. 230, 18 (2011)]. The method employs a Jacobian-free Newton-Krylov (JFNK) solver and is capable of using very large timesteps without loss of numerical stability or accuracy. A fundamental feature of the method is the segregation of particle orbit integrations from the field solver, while remaining fully self-consistent. This provides great flexibility, and dramatically improves the solver efficiency by reducing the degrees of freedom of the associated nonlinear system. However, it requires a particle push per nonlinearmore » residual evaluation, which makes the particle push the most time-consuming operation in the algorithm. This paper describes a very efficient mixed-precision, hybrid CPU-GPU implementation of the implicit PIC algorithm. The JFNK solver is kept on the CPU (in double precision), while the inherent data parallelism of the particle mover is exploited by implementing it in single-precision on a graphics processing unit (GPU) using CUDA. Performance-oriented optimizations, with the aid of an analytical performance model, the roofline model, are employed. Despite being highly dynamic, the adaptive, charge-conserving particle mover algorithm achieves up to 300 400 GOp/s (including single-precision floating-point, integer, and logic operations) on a Nvidia GeForce GTX580, corresponding to 20 25% absolute GPU efficiency (against the peak theoretical performance) and 50-70% intrinsic efficiency (against the algorithm s maximum operational throughput, which neglects all latencies). This is about 200-300 times faster than an equivalent serial CPU implementation. When the single-precision GPU particle mover is combined with a double-precision CPU JFNK field solver, overall performance gains 100 vs. the double-precision CPU-only serial version are obtained, with no apparent loss of robustness or accuracy when applied to a challenging long-time scale ion acoustic wave simulation.« less
Evaluating Mobile Graphics Processing Units (GPUs) for Real-Time Resource Constrained Applications

DOE Office of Scientific and Technical Information (OSTI.GOV)

Meredith, J; Conger, J; Liu, Y

2005-11-11

Modern graphics processing units (GPUs) can provide tremendous performance boosts for some applications beyond what a single CPU can accomplish, and their performance is growing at a rate faster than CPUs as well. Mobile GPUs available for laptops have the small form factor and low power requirements suitable for use in embedded processing. We evaluated several desktop and mobile GPUs and CPUs on traditional and non-traditional graphics tasks, as well as on the most time consuming pieces of a full hyperspectral imaging application. Accuracy remained high despite small differences in arithmetic operations like rounding. Performance improvements are summarized here relativemore » to a desktop Pentium 4 CPU.« less
Benchmarking worker nodes using LHCb productions and comparing with HEPSpec06

NASA Astrophysics Data System (ADS)

Charpentier, P.

2017-10-01

In order to estimate the capabilities of a computing slot with limited processing time, it is necessary to know with a rather good precision its “power”. This allows for example pilot jobs to match a task for which the required CPU-work is known, or to define the number of events to be processed knowing the CPU-work per event. Otherwise one always has the risk that the task is aborted because it exceeds the CPU capabilities of the resource. It also allows a better accounting of the consumed resources. The traditional way the CPU power is estimated in WLCG since 2007 is using the HEP-Spec06 benchmark (HS06) suite that was verified at the time to scale properly with a set of typical HEP applications. However, the hardware architecture of processors has evolved, all WLCG experiments moved to using 64-bit applications and use different compilation flags from those advertised for running HS06. It is therefore interesting to check the scaling of HS06 with the HEP applications. For this purpose, we have been using CPU intensive massive simulation productions from the LHCb experiment and compared their event throughput to the HS06 rating of the worker nodes. We also compared it with a much faster benchmark script that is used by the DIRAC framework used by LHCb for evaluating at run time the performance of the worker nodes. This contribution reports on the finding of these comparisons: the main observation is that the scaling with HS06 is no longer fulfilled, while the fast benchmarks have a better scaling but are less precise. One can also clearly see that some hardware or software features when enabled on the worker nodes may enhance their performance beyond expectation from either benchmark, depending on external factors.

Efficient spares matrix multiplication scheme for the CYBER 203

NASA Technical Reports Server (NTRS)

Lambiotte, J. J., Jr.

1984-01-01

This work has been directed toward the development of an efficient algorithm for performing this computation on the CYBER-203. The desire to provide software which gives the user the choice between the often conflicting goals of minimizing central processing (CPU) time or storage requirements has led to a diagonal-based algorithm in which one of three types of storage is selected for each diagonal. For each storage type, an initialization sub-routine estimates the CPU and storage requirements based upon results from previously performed numerical experimentation. These requirements are adjusted by weights provided by the user which reflect the relative importance the user places on the resources. The three storage types employed were chosen to be efficient on the CYBER-203 for diagonals which are sparse, moderately sparse, or dense; however, for many densities, no diagonal type is most efficient with respect to both resource requirements. The user-supplied weights dictate the choice.
A Study on the Effectiveness of Lockup-Free Caches for a Reduced Instruction Set Computer (RISC) Processor

DTIC Science & Technology

1992-09-01

to acquire or develop effective simulation tools to observe the behavior of a RISC implementation as it executes different types of programs . We choose...Performance Computer performance is measured by the amount of the time required to execute a program . Performance encompasses two types of time, elapsed time...and CPU time. Elapsed time is the time required to execute a program from start to finish. It includes latency of input/output activities such as
Efficient Scalable Median Filtering Using Histogram-Based Operations.

PubMed

Green, Oded

2018-05-01

Median filtering is a smoothing technique for noise removal in images. While there are various implementations of median filtering for a single-core CPU, there are few implementations for accelerators and multi-core systems. Many parallel implementations of median filtering use a sorting algorithm for rearranging the values within a filtering window and taking the median of the sorted value. While using sorting algorithms allows for simple parallel implementations, the cost of the sorting becomes prohibitive as the filtering windows grow. This makes such algorithms, sequential and parallel alike, inefficient. In this work, we introduce the first software parallel median filtering that is non-sorting-based. The new algorithm uses efficient histogram-based operations. These reduce the computational requirements of the new algorithm while also accessing the image fewer times. We show an implementation of our algorithm for both the CPU and NVIDIA's CUDA supported graphics processing unit (GPU). The new algorithm is compared with several other leading CPU and GPU implementations. The CPU implementation has near perfect linear scaling with a speedup on a quad-core system. The GPU implementation is several orders of magnitude faster than the other GPU implementations for mid-size median filters. For small kernels, and , comparison-based approaches are preferable as fewer operations are required. Lastly, the new algorithm is open-source and can be found in the OpenCV library.
WARP3D-Release 10.8: Dynamic Nonlinear Analysis of Solids using a Preconditioned Conjugate Gradient Software Architecture

NASA Technical Reports Server (NTRS)

Koppenhoefer, Kyle C.; Gullerud, Arne S.; Ruggieri, Claudio; Dodds, Robert H., Jr.; Healy, Brian E.

1998-01-01

This report describes theoretical background material and commands necessary to use the WARP3D finite element code. WARP3D is under continuing development as a research code for the solution of very large-scale, 3-D solid models subjected to static and dynamic loads. Specific features in the code oriented toward the investigation of ductile fracture in metals include a robust finite strain formulation, a general J-integral computation facility (with inertia, face loading), an element extinction facility to model crack growth, nonlinear material models including viscoplastic effects, and the Gurson-Tver-gaard dilatant plasticity model for void growth. The nonlinear, dynamic equilibrium equations are solved using an incremental-iterative, implicit formulation with full Newton iterations to eliminate residual nodal forces. The history integration of the nonlinear equations of motion is accomplished with Newmarks Beta method. A central feature of WARP3D involves the use of a linear-preconditioned conjugate gradient (LPCG) solver implemented in an element-by-element format to replace a conventional direct linear equation solver. This software architecture dramatically reduces both the memory requirements and CPU time for very large, nonlinear solid models since formation of the assembled (dynamic) stiffness matrix is avoided. Analyses thus exhibit the numerical stability for large time (load) steps provided by the implicit formulation coupled with the low memory requirements characteristic of an explicit code. In addition to the much lower memory requirements of the LPCG solver, the CPU time required for solution of the linear equations during each Newton iteration is generally one-half or less of the CPU time required for a traditional direct solver. All other computational aspects of the code (element stiffnesses, element strains, stress updating, element internal forces) are implemented in the element-by- element, blocked architecture. This greatly improves vectorization of the code on uni-processor hardware and enables straightforward parallel-vector processing of element blocks on multi-processor hardware.
General approach to boat simulation in virtual reality systems

NASA Astrophysics Data System (ADS)

Aranov, Vladislav Y.; Belyaev, Sergey Y.

2002-02-01

The paper is dedicated to real time simulation of sport boats, particularly a kayak and high-speed skimming boat, for training goals. This training is issue of the day, since kayaking and riding a high-speed skimming boat are both extreme sports. Participating in such types of competitions puts sportsmen into danger, particularly due to rapids, waterfalls, different water streams, and other obstacles. In order to make the simulation realistic, it is necessary to calculate data for at least 30 frames per second. These calculations may take not more than 5% CPU time, because very time-consuming 3D rendering process takes the rest - 95% CPU time. This paper describes an approach for creating minimal boat simulator models that satisfy the mentioned requirements. Besides, this approach can be used for other watercraft models of this kind.
GPU: the biggest key processor for AI and parallel processing

NASA Astrophysics Data System (ADS)

Baji, Toru

2017-07-01

Two types of processors exist in the market. One is the conventional CPU and the other is Graphic Processor Unit (GPU). Typical CPU is composed of 1 to 8 cores while GPU has thousands of cores. CPU is good for sequential processing, while GPU is good to accelerate software with heavy parallel executions. GPU was initially dedicated for 3D graphics. However from 2006, when GPU started to apply general-purpose cores, it was noticed that this architecture can be used as a general purpose massive-parallel processor. NVIDIA developed a software framework Compute Unified Device Architecture (CUDA) that make it possible to easily program the GPU for these application. With CUDA, GPU started to be used in workstations and supercomputers widely. Recently two key technologies are highlighted in the industry. The Artificial Intelligence (AI) and Autonomous Driving Cars. AI requires a massive parallel operation to train many-layers of neural networks. With CPU alone, it was impossible to finish the training in a practical time. The latest multi-GPU system with P100 makes it possible to finish the training in a few hours. For the autonomous driving cars, TOPS class of performance is required to implement perception, localization, path planning processing and again SoC with integrated GPU will play a key role there. In this paper, the evolution of the GPU which is one of the biggest commercial devices requiring state-of-the-art fabrication technology will be introduced. Also overview of the GPU demanding key application like the ones described above will be introduced.
A comparison of native GPU computing versus OpenACC for implementing flow-routing algorithms in hydrological applications

NASA Astrophysics Data System (ADS)

Rueda, Antonio J.; Noguera, José M.; Luque, Adrián

2016-02-01

In recent years GPU computing has gained wide acceptance as a simple low-cost solution for speeding up computationally expensive processing in many scientific and engineering applications. However, in most cases accelerating a traditional CPU implementation for a GPU is a non-trivial task that requires a thorough refactorization of the code and specific optimizations that depend on the architecture of the device. OpenACC is a promising technology that aims at reducing the effort required to accelerate C/C++/Fortran code on an attached multicore device. Virtually with this technology the CPU code only has to be augmented with a few compiler directives to identify the areas to be accelerated and the way in which data has to be moved between the CPU and GPU. Its potential benefits are multiple: better code readability, less development time, lower risk of errors and less dependency on the underlying architecture and future evolution of the GPU technology. Our aim with this work is to evaluate the pros and cons of using OpenACC against native GPU implementations in computationally expensive hydrological applications, using the classic D8 algorithm of O'Callaghan and Mark for river network extraction as case-study. We implemented the flow accumulation step of this algorithm in CPU, using OpenACC and two different CUDA versions, comparing the length and complexity of the code and its performance with different datasets. We advance that although OpenACC can not match the performance of a CUDA optimized implementation (×3.5 slower in average), it provides a significant performance improvement against a CPU implementation (×2-6) with by far a simpler code and less implementation effort.
Jobs masonry in LHCb with elastic Grid Jobs

NASA Astrophysics Data System (ADS)

Stagni, F.; Charpentier, Ph

2015-12-01

In any distributed computing infrastructure, a job is normally forbidden to run for an indefinite amount of time. This limitation is implemented using different technologies, the most common one being the CPU time limit implemented by batch queues. It is therefore important to have a good estimate of how much CPU work a job will require: otherwise, it might be killed by the batch system, or by whatever system is controlling the jobs’ execution. In many modern interwares, the jobs are actually executed by pilot jobs, that can use the whole available time in running multiple consecutive jobs. If at some point the available time in a pilot is too short for the execution of any job, it should be released, while it could have been used efficiently by a shorter job. Within LHCbDIRAC, the LHCb extension of the DIRAC interware, we developed a simple way to fully exploit computing capabilities available to a pilot, even for resources with limited time capabilities, by adding elasticity to production MonteCarlo (MC) simulation jobs. With our approach, independently of the time available, LHCbDIRAC will always have the possibility to execute a MC job, whose length will be adapted to the available amount of time: therefore the same job, running on different computing resources with different time limits, will produce different amounts of events. The decision on the number of events to be produced is made just in time at the start of the job, when the capabilities of the resource are known. In order to know how many events a MC job will be instructed to produce, LHCbDIRAC simply requires three values: the CPU-work per event for that type of job, the power of the machine it is running on, and the time left for the job before being killed. Knowing these values, we can estimate the number of events the job will be able to simulate with the available CPU time. This paper will demonstrate that, using this simple but effective solution, LHCb manages to make a more efficient use of the available resources, and that it can easily use new types of resources. An example is represented by resources provided by batch queues, where low-priority MC jobs can be used as "masonry" jobs in multi-jobs pilots. A second example is represented by opportunistic resources with limited available time.
Symptoms of problematic cellular phone use, functional impairment and its association with depression among adolescents in Southern Taiwan.

PubMed

Yen, Cheng-Fang; Tang, Tze-Chun; Yen, Ju-Yu; Lin, Huang-Chi; Huang, Chi-Fen; Liu, Shu-Chun; Ko, Chih-Hung

2009-08-01

The aims of this study were: (1) to examine the prevalence of symptoms of problematic cellular phone use (CPU); (2) to examine the associations between the symptoms of problematic CPU, functional impairment caused by CPU and the characteristics of CPU; (3) to establish the optimal cut-off point of the number of symptoms for functional impairment caused by CPU; and (4) to examine the association between problematic CPU and depression in adolescents. A total of 10,191 adolescent students in Southern Taiwan were recruited into this study. Participants' self-reported symptoms of problematic CPU and functional impairments caused by CPU were collected. The associations of symptoms of problematic CPU with functional impairments and with the characteristics of CPU were examined. The cut-off point of the number of symptoms for functional impairment was also determined. The association between problematic CPU and depression was examined by logistic regression analysis. The results indicated that the symptoms of problematic CPU were prevalent in adolescents. The adolescents who had any one of the symptoms of problematic CPU were more likely to report at least one dimension of functional impairment caused by CPU, called more on cellular phones, sent more text messages, or spent more time and higher fees on CPU. Having four or more symptoms of problematic CPU had the highest potential to differentiate between the adolescents with and without functional impairment caused by CPU. Adolescents who had significant depression were more likely to have four or more symptoms of problematic CPU. The results of this study may provide a basis for detecting symptoms of problematic CPU in adolescents.
Performance and scalability of Fourier domain optical coherence tomography acceleration using graphics processing units.

PubMed

Li, Jian; Bloch, Pavel; Xu, Jing; Sarunic, Marinko V; Shannon, Lesley

2011-05-01

Fourier domain optical coherence tomography (FD-OCT) provides faster line rates, better resolution, and higher sensitivity for noninvasive, in vivo biomedical imaging compared to traditional time domain OCT (TD-OCT). However, because the signal processing for FD-OCT is computationally intensive, real-time FD-OCT applications demand powerful computing platforms to deliver acceptable performance. Graphics processing units (GPUs) have been used as coprocessors to accelerate FD-OCT by leveraging their relatively simple programming model to exploit thread-level parallelism. Unfortunately, GPUs do not "share" memory with their host processors, requiring additional data transfers between the GPU and CPU. In this paper, we implement a complete FD-OCT accelerator on a consumer grade GPU/CPU platform. Our data acquisition system uses spectrometer-based detection and a dual-arm interferometer topology with numerical dispersion compensation for retinal imaging. We demonstrate that the maximum line rate is dictated by the memory transfer time and not the processing time due to the GPU platform's memory model. Finally, we discuss how the performance trends of GPU-based accelerators compare to the expected future requirements of FD-OCT data rates.
An efficient sparse matrix multiplication scheme for the CYBER 205 computer

NASA Technical Reports Server (NTRS)

Lambiotte, Jules J., Jr.

1988-01-01

This paper describes the development of an efficient algorithm for computing the product of a matrix and vector on a CYBER 205 vector computer. The desire to provide software which allows the user to choose between the often conflicting goals of minimizing central processing unit (CPU) time or storage requirements has led to a diagonal-based algorithm in which one of four types of storage is selected for each diagonal. The candidate storage types employed were chosen to be efficient on the CYBER 205 for diagonals which have nonzero structure which is dense, moderately sparse, very sparse and short, or very sparse and long; however, for many densities, no diagonal type is most efficient with respect to both resource requirements, and a trade-off must be made. For each diagonal, an initialization subroutine estimates the CPU time and storage required for each storage type based on results from previously performed numerical experimentation. These requirements are adjusted by weights provided by the user which reflect the relative importance the user places on the two resources. The adjusted resource requirements are then compared to select the most efficient storage and computational scheme.
Accelerated event-by-event Monte Carlo microdosimetric calculations of electrons and protons tracks on a multi-core CPU and a CUDA-enabled GPU.

PubMed

Kalantzis, Georgios; Tachibana, Hidenobu

2014-01-01

For microdosimetric calculations event-by-event Monte Carlo (MC) methods are considered the most accurate. The main shortcoming of those methods is the extensive requirement for computational time. In this work we present an event-by-event MC code of low projectile energy electron and proton tracks for accelerated microdosimetric MC simulations on a graphic processing unit (GPU). Additionally, a hybrid implementation scheme was realized by employing OpenMP and CUDA in such a way that both GPU and multi-core CPU were utilized simultaneously. The two implementation schemes have been tested and compared with the sequential single threaded MC code on the CPU. Performance comparison was established on the speed-up for a set of benchmarking cases of electron and proton tracks. A maximum speedup of 67.2 was achieved for the GPU-based MC code, while a further improvement of the speedup up to 20% was achieved for the hybrid approach. The results indicate the capability of our CPU-GPU implementation for accelerated MC microdosimetric calculations of both electron and proton tracks without loss of accuracy. Copyright © 2013 Elsevier Ireland Ltd. All rights reserved.
An incomplete assembly with thresholding algorithm for systems of reaction-diffusion equations in three space dimensions IAT for reaction-diffusion systems

NASA Astrophysics Data System (ADS)

Moore, Peter K.

2003-07-01

Solving systems of reaction-diffusion equations in three space dimensions can be prohibitively expensive both in terms of storage and CPU time. Herein, I present a new incomplete assembly procedure that is designed to reduce storage requirements. Incomplete assembly is analogous to incomplete factorization in that only a fixed number of nonzero entries are stored per row and a drop tolerance is used to discard small values. The algorithm is incorporated in a finite element method-of-lines code and tested on a set of reaction-diffusion systems. The effect of incomplete assembly on CPU time and storage and on the performance of the temporal integrator DASPK, algebraic solver GMRES and preconditioner ILUT is studied.
Hybrid Computational Architecture for Multi-Scale Modeling of Materials and Devices

DTIC Science & Technology

2016-01-03

Equivalent: Total Number: Sub Contractors (DD882) Names of Faculty Supported Names of Under Graduate students supported Names of Personnel receiving masters...GHz, 20 cores (40 with hyper-threading ( HT )) Single node performance Node # of cores Total CPU time User CPU time System CPU time Elapsed time...INTEL20 40 (with HT ) 534.785 529.984 4.800 541.179 20 468.873 466.119 2.754 476.878 10 671.798 669.653 2.145 680.510 8 772.269 770.256 2.013
Fidelity Optimization of Microprocessor System Simulations.

DTIC Science & Technology

1981-03-01

effort feasible in terms of required CPU time would be to employ a separate clock with an artificially compressed time base in the serial...RETURN ILINCR -NU𔃾OPS D.% PROt.ESSING 900 IF IIERP2.NF.41 GO TO 1000 IFRCOD - L CALL VAIRCO 1A(61,NUMVALLEPCOOl IEPRZ -IEACCO IF hEARR .GT. 01 RETURN I
Newmark local time stepping on high-performance computing architectures

DOE Office of Scientific and Technical Information (OSTI.GOV)

Rietmann, Max, E-mail: max.rietmann@erdw.ethz.ch; Institute of Geophysics, ETH Zurich; Grote, Marcus, E-mail: marcus.grote@unibas.ch

In multi-scale complex media, finite element meshes often require areas of local refinement, creating small elements that can dramatically reduce the global time-step for wave-propagation problems due to the CFL condition. Local time stepping (LTS) algorithms allow an explicit time-stepping scheme to adapt the time-step to the element size, allowing near-optimal time-steps everywhere in the mesh. We develop an efficient multilevel LTS-Newmark scheme and implement it in a widely used continuous finite element seismic wave-propagation package. In particular, we extend the standard LTS formulation with adaptations to continuous finite element methods that can be implemented very efficiently with very strongmore » element-size contrasts (more than 100x). Capable of running on large CPU and GPU clusters, we present both synthetic validation examples and large scale, realistic application examples to demonstrate the performance and applicability of the method and implementation on thousands of CPU cores and hundreds of GPUs.« less
RTOS kernel in portable electrocardiograph

NASA Astrophysics Data System (ADS)

Centeno, C. A.; Voos, J. A.; Riva, G. G.; Zerbini, C.; Gonzalez, E. A.

2011-12-01

This paper presents the use of a Real Time Operating System (RTOS) on a portable electrocardiograph based on a microcontroller platform. All medical device digital functions are performed by the microcontroller. The electrocardiograph CPU is based on the 18F4550 microcontroller, in which an uCOS-II RTOS can be embedded. The decision associated with the kernel use is based on its benefits, the license for educational use and its intrinsic time control and peripherals management. The feasibility of its use on the electrocardiograph is evaluated based on the minimum memory requirements due to the kernel structure. The kernel's own tools were used for time estimation and evaluation of resources used by each process. After this feasibility analysis, the migration from cyclic code to a structure based on separate processes or tasks able to synchronize events is used; resulting in an electrocardiograph running on one Central Processing Unit (CPU) based on RTOS.
Study on efficiency of time computation in x-ray imaging simulation base on Monte Carlo algorithm using graphics processing unit

DOE Office of Scientific and Technical Information (OSTI.GOV)

Setiani, Tia Dwi, E-mail: tiadwisetiani@gmail.com; Suprijadi; Nuclear Physics and Biophysics Reaserch Division, Faculty of Mathematics and Natural Sciences, Institut Teknologi Bandung Jalan Ganesha 10 Bandung, 40132

Monte Carlo (MC) is one of the powerful techniques for simulation in x-ray imaging. MC method can simulate the radiation transport within matter with high accuracy and provides a natural way to simulate radiation transport in complex systems. One of the codes based on MC algorithm that are widely used for radiographic images simulation is MC-GPU, a codes developed by Andrea Basal. This study was aimed to investigate the time computation of x-ray imaging simulation in GPU (Graphics Processing Unit) compared to a standard CPU (Central Processing Unit). Furthermore, the effect of physical parameters to the quality of radiographic imagesmore » and the comparison of image quality resulted from simulation in the GPU and CPU are evaluated in this paper. The simulations were run in CPU which was simulated in serial condition, and in two GPU with 384 cores and 2304 cores. In simulation using GPU, each cores calculates one photon, so, a large number of photon were calculated simultaneously. Results show that the time simulations on GPU were significantly accelerated compared to CPU. The simulations on the 2304 core of GPU were performed about 64 -114 times faster than on CPU, while the simulation on the 384 core of GPU were performed about 20 – 31 times faster than in a single core of CPU. Another result shows that optimum quality of images from the simulation was gained at the history start from 10{sup 8} and the energy from 60 Kev to 90 Kev. Analyzed by statistical approach, the quality of GPU and CPU images are relatively the same.« less
Continuous piecewise-linear, reduced-order electrochemical model for lithium-ion batteries in real-time applications

NASA Astrophysics Data System (ADS)

Farag, Mohammed; Fleckenstein, Matthias; Habibi, Saeid

2017-02-01

Model-order reduction and minimization of the CPU run-time while maintaining the model accuracy are critical requirements for real-time implementation of lithium-ion electrochemical battery models. In this paper, an isothermal, continuous, piecewise-linear, electrode-average model is developed by using an optimal knot placement technique. The proposed model reduces the univariate nonlinear function of the electrode's open circuit potential dependence on the state of charge to continuous piecewise regions. The parameterization experiments were chosen to provide a trade-off between extensive experimental characterization techniques and purely identifying all parameters using optimization techniques. The model is then parameterized in each continuous, piecewise-linear, region. Applying the proposed technique cuts down the CPU run-time by around 20%, compared to the reduced-order, electrode-average model. Finally, the model validation against real-time driving profiles (FTP-72, WLTP) demonstrates the ability of the model to predict the cell voltage accurately with less than 2% error.
Multi-Threaded Algorithms for GPGPU in the ATLAS High Level Trigger

NASA Astrophysics Data System (ADS)

Conde Muíño, P.; ATLAS Collaboration

2017-10-01

General purpose Graphics Processor Units (GPGPU) are being evaluated for possible future inclusion in an upgraded ATLAS High Level Trigger farm. We have developed a demonstrator including GPGPU implementations of Inner Detector and Muon tracking and Calorimeter clustering within the ATLAS software framework. ATLAS is a general purpose particle physics experiment located on the LHC collider at CERN. The ATLAS Trigger system consists of two levels, with Level-1 implemented in hardware and the High Level Trigger implemented in software running on a farm of commodity CPU. The High Level Trigger reduces the trigger rate from the 100 kHz Level-1 acceptance rate to 1.5 kHz for recording, requiring an average per-event processing time of ∼ 250 ms for this task. The selection in the high level trigger is based on reconstructing tracks in the Inner Detector and Muon Spectrometer and clusters of energy deposited in the Calorimeter. Performing this reconstruction within the available farm resources presents a significant challenge that will increase significantly with future LHC upgrades. During the LHC data taking period starting in 2021, luminosity will reach up to three times the original design value. Luminosity will increase further to 7.5 times the design value in 2026 following LHC and ATLAS upgrades. Corresponding improvements in the speed of the reconstruction code will be needed to provide the required trigger selection power within affordable computing resources. Key factors determining the potential benefit of including GPGPU as part of the HLT processor farm are: the relative speed of the CPU and GPGPU algorithm implementations; the relative execution times of the GPGPU algorithms and serial code remaining on the CPU; the number of GPGPU required, and the relative financial cost of the selected GPGPU. We give a brief overview of the algorithms implemented and present new measurements that compare the performance of various configurations exploiting GPGPU cards.

The Creation of a CPU Timer for High Fidelity Programs

NASA Technical Reports Server (NTRS)

Dick, Aidan A.

2011-01-01

Using C and C++ programming languages, a tool was developed that measures the efficiency of a program by recording the amount of CPU time that various functions consume. By inserting the tool between lines of code in the program, one can receive a detailed report of the absolute and relative time consumption associated with each section. After adapting the generic tool for a high-fidelity launch vehicle simulation program called MAVERIC, the components of a frequently used function called "derivatives ( )" were measured. Out of the 34 sub-functions in "derivatives ( )", it was found that the top 8 sub-functions made up 83.1% of the total time spent. In order to decrease the overall run time of MAVERIC, a launch vehicle simulation program, a change was implemented in the sub-function "Event_Controller ( )". Reformatting "Event_Controller ( )" led to a 36.9% decrease in the total CPU time spent by that sub-function, and a 3.2% decrease in the total CPU time spent by the overarching function "derivatives ( )".
Dosimetric comparison of helical tomotherapy treatment plans for total marrow irradiation created using GPU and CPU dose calculation engines.

PubMed

Nalichowski, Adrian; Burmeister, Jay

2013-07-01

To compare optimization characteristics, plan quality, and treatment delivery efficiency between total marrow irradiation (TMI) plans using the new TomoTherapy graphic processing unit (GPU) based dose engine and CPU/cluster based dose engine. Five TMI plans created on an anthropomorphic phantom were optimized and calculated with both dose engines. The planning treatment volume (PTV) included all the bones from head to mid femur except for upper extremities. Evaluated organs at risk (OAR) consisted of lung, liver, heart, kidneys, and brain. The following treatment parameters were used to generate the TMI plans: field widths of 2.5 and 5 cm, modulation factors of 2 and 2.5, and pitch of either 0.287 or 0.43. The optimization parameters were chosen based on the PTV and OAR priorities and the plans were optimized with a fixed number of iterations. The PTV constraint was selected to ensure that at least 95% of the PTV received the prescription dose. The plans were evaluated based on D80 and D50 (dose to 80% and 50% of the OAR volume, respectively) and hotspot volumes within the PTVs. Gamma indices (Γ) were also used to compare planar dose distributions between the two modalities. The optimization and dose calculation times were compared between the two systems. The treatment delivery times were also evaluated. The results showed very good dosimetric agreement between the GPU and CPU calculated plans for any of the evaluated planning parameters indicating that both systems converge on nearly identical plans. All D80 and D50 parameters varied by less than 3% of the prescription dose with an average difference of 0.8%. A gamma analysis Γ(3%, 3 mm) < 1 of the GPU plan resulted in over 90% of calculated voxels satisfying Γ < 1 criterion as compared to baseline CPU plan. The average number of voxels meeting the Γ < 1 criterion for all the plans was 97%. In terms of dose optimization/calculation efficiency, there was a 20-fold reduction in planning time with the new GPU system. The average optimization/dose calculation time utilizing the traditional CPU/cluster based system was 579 vs 26.8 min for the GPU based system. There was no difference in the calculated treatment delivery time per fraction. Beam-on time varied based on field width and pitch and ranged between 15 and 28 min. The TomoTherapy GPU based dose engine is capable of calculating TMI treatment plans with plan quality nearly identical to plans calculated using the traditional CPU/cluster based system, while significantly reducing the time required for optimization and dose calculation.
A Spiking Neural Simulator Integrating Event-Driven and Time-Driven Computation Schemes Using Parallel CPU-GPU Co-Processing: A Case Study.

PubMed

Naveros, Francisco; Luque, Niceto R; Garrido, Jesús A; Carrillo, Richard R; Anguita, Mancia; Ros, Eduardo

2015-07-01

Time-driven simulation methods in traditional CPU architectures perform well and precisely when simulating small-scale spiking neural networks. Nevertheless, they still have drawbacks when simulating large-scale systems. Conversely, event-driven simulation methods in CPUs and time-driven simulation methods in graphic processing units (GPUs) can outperform CPU time-driven methods under certain conditions. With this performance improvement in mind, we have developed an event-and-time-driven spiking neural network simulator suitable for a hybrid CPU-GPU platform. Our neural simulator is able to efficiently simulate bio-inspired spiking neural networks consisting of different neural models, which can be distributed heterogeneously in both small layers and large layers or subsystems. For the sake of efficiency, the low-activity parts of the neural network can be simulated in CPU using event-driven methods while the high-activity subsystems can be simulated in either CPU (a few neurons) or GPU (thousands or millions of neurons) using time-driven methods. In this brief, we have undertaken a comparative study of these different simulation methods. For benchmarking the different simulation methods and platforms, we have used a cerebellar-inspired neural-network model consisting of a very dense granular layer and a Purkinje layer with a smaller number of cells (according to biological ratios). Thus, this cerebellar-like network includes a dense diverging neural layer (increasing the dimensionality of its internal representation and sparse coding) and a converging neural layer (integration) similar to many other biologically inspired and also artificial neural networks.
Software beamforming: comparison between a phased array and synthetic transmit aperture.

PubMed

Li, Yen-Feng; Li, Pai-Chi

2011-04-01

The data-transfer and computation requirements are compared between software-based beamforming using a phased array (PA) and a synthetic transmit aperture (STA). The advantages of a software-based architecture are reduced system complexity and lower hardware cost. Although this architecture can be implemented using commercial CPUs or GPUs, the high computation and data-transfer requirements limit its real-time beamforming performance. In particular, transferring the raw rf data from the front-end subsystem to the software back-end remains challenging with current state-of-the-art electronics technologies, which offset the cost advantage of the software back end. This study investigated the tradeoff between the data-transfer and computation requirements. Two beamforming methods based on a PA and STA, respectively, were used: the former requires a higher data transfer rate and the latter requires more memory operations. The beamformers were implemente;d in an NVIDIA GeForce GTX 260 GPU and an Intel core i7 920 CPU. The frame rate of PA beamforming was 42 fps with a 128-element array transducer, with 2048 samples per firing and 189 beams per image (with a 95 MB/frame data-transfer requirement). The frame rate of STA beamforming was 40 fps with 16 firings per image (with an 8 MB/frame data-transfer requirement). Both approaches achieved real-time beamforming performance but each had its own bottleneck. On the one hand, the required data-transfer speed was considerably reduced in STA beamforming, whereas this required more memory operations, which limited the overall computation time. The advantages of the GPU approach over the CPU approach were clearly demonstrated.
hybrid\\scriptsize{{MANTIS}}: a CPU-GPU Monte Carlo method for modeling indirect x-ray detectors with columnar scintillators

NASA Astrophysics Data System (ADS)

Sharma, Diksha; Badal, Andreu; Badano, Aldo

2012-04-01

The computational modeling of medical imaging systems often requires obtaining a large number of simulated images with low statistical uncertainty which translates into prohibitive computing times. We describe a novel hybrid approach for Monte Carlo simulations that maximizes utilization of CPUs and GPUs in modern workstations. We apply the method to the modeling of indirect x-ray detectors using a new and improved version of the code \\scriptsize{{MANTIS}}, an open source software tool used for the Monte Carlo simulations of indirect x-ray imagers. We first describe a GPU implementation of the physics and geometry models in fast\\scriptsize{{DETECT}}2 (the optical transport model) and a serial CPU version of the same code. We discuss its new features like on-the-fly column geometry and columnar crosstalk in relation to the \\scriptsize{{MANTIS}} code, and point out areas where our model provides more flexibility for the modeling of realistic columnar structures in large area detectors. Second, we modify \\scriptsize{{PENELOPE}} (the open source software package that handles the x-ray and electron transport in \\scriptsize{{MANTIS}}) to allow direct output of location and energy deposited during x-ray and electron interactions occurring within the scintillator. This information is then handled by optical transport routines in fast\\scriptsize{{DETECT}}2. A load balancer dynamically allocates optical transport showers to the GPU and CPU computing cores. Our hybrid\\scriptsize{{MANTIS}} approach achieves a significant speed-up factor of 627 when compared to \\scriptsize{{MANTIS}} and of 35 when compared to the same code running only in a CPU instead of a GPU. Using hybrid\\scriptsize{{MANTIS}}, we successfully hide hours of optical transport time by running it in parallel with the x-ray and electron transport, thus shifting the computational bottleneck from optical to x-ray transport. The new code requires much less memory than \\scriptsize{{MANTIS}} and, as a result, allows us to efficiently simulate large area detectors.
Accelerating moderately stiff chemical kinetics in reactive-flow simulations using GPUs

NASA Astrophysics Data System (ADS)

Niemeyer, Kyle E.; Sung, Chih-Jen

2014-01-01

The chemical kinetics ODEs arising from operator-split reactive-flow simulations were solved on GPUs using explicit integration algorithms. Nonstiff chemical kinetics of a hydrogen oxidation mechanism (9 species and 38 irreversible reactions) were computed using the explicit fifth-order Runge-Kutta-Cash-Karp method, and the GPU-accelerated version performed faster than single- and six-core CPU versions by factors of 126 and 25, respectively, for 524,288 ODEs. Moderately stiff kinetics, represented with mechanisms for hydrogen/carbon-monoxide (13 species and 54 irreversible reactions) and methane (53 species and 634 irreversible reactions) oxidation, were computed using the stabilized explicit second-order Runge-Kutta-Chebyshev (RKC) algorithm. The GPU-based RKC implementation demonstrated an increase in performance of nearly 59 and 10 times, for problem sizes consisting of 262,144 ODEs and larger, than the single- and six-core CPU-based RKC algorithms using the hydrogen/carbon-monoxide mechanism. With the methane mechanism, RKC-GPU performed more than 65 and 11 times faster, for problem sizes consisting of 131,072 ODEs and larger, than the single- and six-core RKC-CPU versions, and up to 57 times faster than the six-core CPU-based implicit VODE algorithm on 65,536 ODEs. In the presence of more severe stiffness, such as ethylene oxidation (111 species and 1566 irreversible reactions), RKC-GPU performed more than 17 times faster than RKC-CPU on six cores for 32,768 ODEs and larger, and at best 4.5 times faster than VODE on six CPU cores for 65,536 ODEs. With a larger time step size, RKC-GPU performed at best 2.5 times slower than six-core VODE for 8192 ODEs and larger. Therefore, the need for developing new strategies for integrating stiff chemistry on GPUs was discussed.
Novel hybrid GPU-CPU implementation of parallelized Monte Carlo parametric expectation maximization estimation method for population pharmacokinetic data analysis.

PubMed

Ng, C M

2013-10-01

The development of a population PK/PD model, an essential component for model-based drug development, is both time- and labor-intensive. A graphical-processing unit (GPU) computing technology has been proposed and used to accelerate many scientific computations. The objective of this study was to develop a hybrid GPU-CPU implementation of parallelized Monte Carlo parametric expectation maximization (MCPEM) estimation algorithm for population PK data analysis. A hybrid GPU-CPU implementation of the MCPEM algorithm (MCPEMGPU) and identical algorithm that is designed for the single CPU (MCPEMCPU) were developed using MATLAB in a single computer equipped with dual Xeon 6-Core E5690 CPU and a NVIDIA Tesla C2070 GPU parallel computing card that contained 448 stream processors. Two different PK models with rich/sparse sampling design schemes were used to simulate population data in assessing the performance of MCPEMCPU and MCPEMGPU. Results were analyzed by comparing the parameter estimation and model computation times. Speedup factor was used to assess the relative benefit of parallelized MCPEMGPU over MCPEMCPU in shortening model computation time. The MCPEMGPU consistently achieved shorter computation time than the MCPEMCPU and can offer more than 48-fold speedup using a single GPU card. The novel hybrid GPU-CPU implementation of parallelized MCPEM algorithm developed in this study holds a great promise in serving as the core for the next-generation of modeling software for population PK/PD analysis.
3D Kirchhoff depth migration algorithm: A new scalable approach for parallelization on multicore CPU based cluster

NASA Astrophysics Data System (ADS)

Rastogi, Richa; Londhe, Ashutosh; Srivastava, Abhishek; Sirasala, Kirannmayi M.; Khonde, Kiran

2017-03-01

In this article, a new scalable 3D Kirchhoff depth migration algorithm is presented on state of the art multicore CPU based cluster. Parallelization of 3D Kirchhoff depth migration is challenging due to its high demand of compute time, memory, storage and I/O along with the need of their effective management. The most resource intensive modules of the algorithm are traveltime calculations and migration summation which exhibit an inherent trade off between compute time and other resources. The parallelization strategy of the algorithm largely depends on the storage of calculated traveltimes and its feeding mechanism to the migration process. The presented work is an extension of our previous work, wherein a 3D Kirchhoff depth migration application for multicore CPU based parallel system had been developed. Recently, we have worked on improving parallel performance of this application by re-designing the parallelization approach. The new algorithm is capable to efficiently migrate both prestack and poststack 3D data. It exhibits flexibility for migrating large number of traces within the available node memory and with minimal requirement of storage, I/O and inter-node communication. The resultant application is tested using 3D Overthrust data on PARAM Yuva II, which is a Xeon E5-2670 based multicore CPU cluster with 16 cores/node and 64 GB shared memory. Parallel performance of the algorithm is studied using different numerical experiments and the scalability results show striking improvement over its previous version. An impressive 49.05X speedup with 76.64% efficiency is achieved for 3D prestack data and 32.00X speedup with 50.00% efficiency for 3D poststack data, using 64 nodes. The results also demonstrate the effectiveness and robustness of the improved algorithm with high scalability and efficiency on a multicore CPU cluster.
Exploring compression techniques for ROOT IO

NASA Astrophysics Data System (ADS)

Zhang, Z.; Bockelman, B.

2017-10-01

ROOT provides an flexible format used throughout the HEP community. The number of use cases - from an archival data format to end-stage analysis - has required a number of tradeoffs to be exposed to the user. For example, a high “compression level” in the traditional DEFLATE algorithm will result in a smaller file (saving disk space) at the cost of slower decompression (costing CPU time when read). At the scale of the LHC experiment, poor design choices can result in terabytes of wasted space or wasted CPU time. We explore and attempt to quantify some of these tradeoffs. Specifically, we explore: the use of alternate compressing algorithms to optimize for read performance; an alternate method of compressing individual events to allow efficient random access; and a new approach to whole-file compression. Quantitative results are given, as well as guidance on how to make compression decisions for different use cases.
CMSA: a heterogeneous CPU/GPU computing system for multiple similar RNA/DNA sequence alignment.

PubMed

Chen, Xi; Wang, Chen; Tang, Shanjiang; Yu, Ce; Zou, Quan

2017-06-24

The multiple sequence alignment (MSA) is a classic and powerful technique for sequence analysis in bioinformatics. With the rapid growth of biological datasets, MSA parallelization becomes necessary to keep its running time in an acceptable level. Although there are a lot of work on MSA problems, their approaches are either insufficient or contain some implicit assumptions that limit the generality of usage. First, the information of users' sequences, including the sizes of datasets and the lengths of sequences, can be of arbitrary values and are generally unknown before submitted, which are unfortunately ignored by previous work. Second, the center star strategy is suited for aligning similar sequences. But its first stage, center sequence selection, is highly time-consuming and requires further optimization. Moreover, given the heterogeneous CPU/GPU platform, prior studies consider the MSA parallelization on GPU devices only, making the CPUs idle during the computation. Co-run computation, however, can maximize the utilization of the computing resources by enabling the workload computation on both CPU and GPU simultaneously. This paper presents CMSA, a robust and efficient MSA system for large-scale datasets on the heterogeneous CPU/GPU platform. It performs and optimizes multiple sequence alignment automatically for users' submitted sequences without any assumptions. CMSA adopts the co-run computation model so that both CPU and GPU devices are fully utilized. Moreover, CMSA proposes an improved center star strategy that reduces the time complexity of its center sequence selection process from O(mn 2 ) to O(mn). The experimental results show that CMSA achieves an up to 11× speedup and outperforms the state-of-the-art software. CMSA focuses on the multiple similar RNA/DNA sequence alignment and proposes a novel bitmap based algorithm to improve the center star strategy. We can conclude that harvesting the high performance of modern GPU is a promising approach to accelerate multiple sequence alignment. Besides, adopting the co-run computation model can maximize the entire system utilization significantly. The source code is available at https://github.com/wangvsa/CMSA .
The Research and Test of Fast Radio Burst Real-time Search Algorithm Based on GPU Acceleration

NASA Astrophysics Data System (ADS)

Wang, J.; Chen, M. Z.; Pei, X.; Wang, Z. Q.

2017-03-01

In order to satisfy the research needs of Nanshan 25 m radio telescope of Xinjiang Astronomical Observatory (XAO) and study the key technology of the planned QiTai radio Telescope (QTT), the receiver group of XAO studied the GPU (Graphics Processing Unit) based real-time FRB searching algorithm which developed from the original FRB searching algorithm based on CPU (Central Processing Unit), and built the FRB real-time searching system. The comparison of the GPU system and the CPU system shows that: on the basis of ensuring the accuracy of the search, the speed of the GPU accelerated algorithm is improved by 35-45 times compared with the CPU algorithm.
Multigrid direct numerical simulation of the whole process of flow transition in 3-D boundary layers

NASA Technical Reports Server (NTRS)

Liu, Chaoqun; Liu, Zhining

1993-01-01

A new technology was developed in this study which provides a successful numerical simulation of the whole process of flow transition in 3-D boundary layers, including linear growth, secondary instability, breakdown, and transition at relatively low CPU cost. Most other spatial numerical simulations require high CPU cost and blow up at the stage of flow breakdown. A fourth-order finite difference scheme on stretched and staggered grids, a fully implicit time marching technique, a semi-coarsening multigrid based on the so-called approximate line-box relaxation, and a buffer domain for the outflow boundary conditions were all used for high-order accuracy, good stability, and fast convergence. A new fine-coarse-fine grid mapping technique was developed to keep the code running after the laminar flow breaks down. The computational results are in good agreement with linear stability theory, secondary instability theory, and some experiments. The cost for a typical case with 162 x 34 x 34 grid is around 2 CRAY-YMP CPU hours for 10 T-S periods.
Exploring the use of I/O nodes for computation in a MIMD multiprocessor

NASA Technical Reports Server (NTRS)

Kotz, David; Cai, Ting

1995-01-01

As parallel systems move into the production scientific-computing world, the emphasis will be on cost-effective solutions that provide high throughput for a mix of applications. Cost effective solutions demand that a system make effective use of all of its resources. Many MIMD multiprocessors today, however, distinguish between 'compute' and 'I/O' nodes, the latter having attached disks and being dedicated to running the file-system server. This static division of responsibilities simplifies system management but does not necessarily lead to the best performance in workloads that need a different balance of computation and I/O. Of course, computational processes sharing a node with a file-system service may receive less CPU time, network bandwidth, and memory bandwidth than they would on a computation-only node. In this paper we begin to examine this issue experimentally. We found that high performance I/O does not necessarily require substantial CPU time, leaving plenty of time for application computation. There were some complex file-system requests, however, which left little CPU time available to the application. (The impact on network and memory bandwidth still needs to be determined.) For applications (or users) that cannot tolerate an occasional interruption, we recommend that they continue to use only compute nodes. For tolerant applications needing more cycles than those provided by the compute nodes, we recommend that they take full advantage of both compute and I/O nodes for computation, and that operating systems should make this possible.
Neural Network and Regression Approximations in High Speed Civil Transport Aircraft Design Optimization

NASA Technical Reports Server (NTRS)

Patniak, Surya N.; Guptill, James D.; Hopkins, Dale A.; Lavelle, Thomas M.

1998-01-01

Nonlinear mathematical-programming-based design optimization can be an elegant method. However, the calculations required to generate the merit function, constraints, and their gradients, which are frequently required, can make the process computational intensive. The computational burden can be greatly reduced by using approximating analyzers derived from an original analyzer utilizing neural networks and linear regression methods. The experience gained from using both of these approximation methods in the design optimization of a high speed civil transport aircraft is the subject of this paper. The Langley Research Center's Flight Optimization System was selected for the aircraft analysis. This software was exercised to generate a set of training data with which a neural network and a regression method were trained, thereby producing the two approximating analyzers. The derived analyzers were coupled to the Lewis Research Center's CometBoards test bed to provide the optimization capability. With the combined software, both approximation methods were examined for use in aircraft design optimization, and both performed satisfactorily. The CPU time for solution of the problem, which had been measured in hours, was reduced to minutes with the neural network approximation and to seconds with the regression method. Instability encountered in the aircraft analysis software at certain design points was also eliminated. On the other hand, there were costs and difficulties associated with training the approximating analyzers. The CPU time required to generate the input-output pairs and to train the approximating analyzers was seven times that required for solution of the problem.
Developing infrared array controller with software real time operating system

NASA Astrophysics Data System (ADS)

Sako, Shigeyuki; Miyata, Takashi; Nakamura, Tomohiko; Motohara, Kentaro; Uchimoto, Yuka Katsuno; Onaka, Takashi; Kataza, Hirokazu

2008-07-01

Real-time capabilities are required for a controller of a large format array to reduce a dead-time attributed by readout and data transfer. The real-time processing has been achieved by dedicated processors including DSP, CPLD, and FPGA devices. However, the dedicated processors have problems with memory resources, inflexibility, and high cost. Meanwhile, a recent PC has sufficient resources of CPUs and memories to control the infrared array and to process a large amount of frame data in real-time. In this study, we have developed an infrared array controller with a software real-time operating system (RTOS) instead of the dedicated processors. A Linux PC equipped with a RTAI extension and a dual-core CPU is used as a main computer, and one of the CPU cores is allocated to the real-time processing. A digital I/O board with DMA functions is used for an I/O interface. The signal-processing cores are integrated in the OS kernel as a real-time driver module, which is composed of two virtual devices of the clock processor and the frame processor tasks. The array controller with the RTOS realizes complicated operations easily, flexibly, and at a low cost.
A Hybrid CPU/GPU Pattern-Matching Algorithm for Deep Packet Inspection

PubMed Central

Chen, Yaw-Chung

2015-01-01

The large quantities of data now being transferred via high-speed networks have made deep packet inspection indispensable for security purposes. Scalable and low-cost signature-based network intrusion detection systems have been developed for deep packet inspection for various software platforms. Traditional approaches that only involve central processing units (CPUs) are now considered inadequate in terms of inspection speed. Graphic processing units (GPUs) have superior parallel processing power, but transmission bottlenecks can reduce optimal GPU efficiency. In this paper we describe our proposal for a hybrid CPU/GPU pattern-matching algorithm (HPMA) that divides and distributes the packet-inspecting workload between a CPU and GPU. All packets are initially inspected by the CPU and filtered using a simple pre-filtering algorithm, and packets that might contain malicious content are sent to the GPU for further inspection. Test results indicate that in terms of random payload traffic, the matching speed of our proposed algorithm was 3.4 times and 2.7 times faster than those of the AC-CPU and AC-GPU algorithms, respectively. Further, HPMA achieved higher energy efficiency than the other tested algorithms. PMID:26437335
A Hybrid CPU/GPU Pattern-Matching Algorithm for Deep Packet Inspection.

PubMed

Lee, Chun-Liang; Lin, Yi-Shan; Chen, Yaw-Chung

2015-01-01

The large quantities of data now being transferred via high-speed networks have made deep packet inspection indispensable for security purposes. Scalable and low-cost signature-based network intrusion detection systems have been developed for deep packet inspection for various software platforms. Traditional approaches that only involve central processing units (CPUs) are now considered inadequate in terms of inspection speed. Graphic processing units (GPUs) have superior parallel processing power, but transmission bottlenecks can reduce optimal GPU efficiency. In this paper we describe our proposal for a hybrid CPU/GPU pattern-matching algorithm (HPMA) that divides and distributes the packet-inspecting workload between a CPU and GPU. All packets are initially inspected by the CPU and filtered using a simple pre-filtering algorithm, and packets that might contain malicious content are sent to the GPU for further inspection. Test results indicate that in terms of random payload traffic, the matching speed of our proposed algorithm was 3.4 times and 2.7 times faster than those of the AC-CPU and AC-GPU algorithms, respectively. Further, HPMA achieved higher energy efficiency than the other tested algorithms.
Clinical implementation of a GPU-based simplified Monte Carlo method for a treatment planning system of proton beam therapy.

PubMed

Kohno, R; Hotta, K; Nishioka, S; Matsubara, K; Tansho, R; Suzuki, T

2011-11-21

We implemented the simplified Monte Carlo (SMC) method on graphics processing unit (GPU) architecture under the computer-unified device architecture platform developed by NVIDIA. The GPU-based SMC was clinically applied for four patients with head and neck, lung, or prostate cancer. The results were compared to those obtained by a traditional CPU-based SMC with respect to the computation time and discrepancy. In the CPU- and GPU-based SMC calculations, the estimated mean statistical errors of the calculated doses in the planning target volume region were within 0.5% rms. The dose distributions calculated by the GPU- and CPU-based SMCs were similar, within statistical errors. The GPU-based SMC showed 12.30-16.00 times faster performance than the CPU-based SMC. The computation time per beam arrangement using the GPU-based SMC for the clinical cases ranged 9-67 s. The results demonstrate the successful application of the GPU-based SMC to a clinical proton treatment planning.
The Performance of the NAS HSPs in 1st Half of 1994

NASA Technical Reports Server (NTRS)

Bergeron, Robert J.; Walter, Howard (Technical Monitor)

1995-01-01

During the first six months of 1994, the NAS (National Airspace System) 16-CPU Y-MP C90 Von Neumann (VN) delivered an average throughput of 4.045 GFLOPS while the ACSF (Aeronautics Consolidated Supercomputer Facility) 8-CPU Y-MP C90 Eagle averaged 1.658 GFLOPS. The VN rate represents a machine efficiency of 26.3% whereas the Eagle rate corresponds to a machine efficiency of 21.6%. VN displayed a greater efficiency than Eagle primarily because the stronger workload demand for its CPU cycles allowed it to devote more time to user programs and less time to idle. An additional factor increasing VN efficiency was the ability of the UNICOS 8.0 Operating System to deliver a larger fraction of CPU time to user programs. Although measurements indicate increasing vector length for both workloads, insufficient vector lengths continue to hinder HSP (High Speed Processor) performance. To improve HSP performance, NAS should continue to encourage the HSP users to modify their codes to increase program vector length.
A Subsonic Aircraft Design Optimization With Neural Network and Regression Approximators

NASA Technical Reports Server (NTRS)

Patnaik, Surya N.; Coroneos, Rula M.; Guptill, James D.; Hopkins, Dale A.; Haller, William J.

2004-01-01

The Flight-Optimization-System (FLOPS) code encountered difficulty in analyzing a subsonic aircraft. The limitation made the design optimization problematic. The deficiencies have been alleviated through use of neural network and regression approximations. The insight gained from using the approximators is discussed in this paper. The FLOPS code is reviewed. Analysis models are developed and validated for each approximator. The regression method appears to hug the data points, while the neural network approximation follows a mean path. For an analysis cycle, the approximate model required milliseconds of central processing unit (CPU) time versus seconds by the FLOPS code. Performance of the approximators was satisfactory for aircraft analysis. A design optimization capability has been created by coupling the derived analyzers to the optimization test bed CometBoards. The approximators were efficient reanalysis tools in the aircraft design optimization. Instability encountered in the FLOPS analyzer was eliminated. The convergence characteristics were improved for the design optimization. The CPU time required to calculate the optimum solution, measured in hours with the FLOPS code was reduced to minutes with the neural network approximation and to seconds with the regression method. Generation of the approximators required the manipulation of a very large quantity of data. Design sensitivity with respect to the bounds of aircraft constraints is easily generated.

Static and Dynamic Frequency Scaling on Multicore CPUs

DOE Office of Scientific and Technical Information (OSTI.GOV)

Bao, Wenlei; Hong, Changwan; Chunduri, Sudheer

2016-12-28

Dynamic voltage and frequency scaling (DVFS) adapts CPU power consumption by modifying a processor’s operating frequency (and the associated voltage). Typical approaches employing DVFS involve default strategies such as running at the lowest or the highest frequency, or observing the CPU’s runtime behavior and dynamically adapting the voltage/frequency configuration based on CPU usage. In this paper, we argue that many previous approaches suffer from inherent limitations, such as not account- ing for processor-specific impact of frequency changes on energy for different workload types. We first propose a lightweight runtime-based approach to automatically adapt the frequency based on the CPU workload,more » that is agnostic of the processor characteristics. We then show that further improvements can be achieved for affine kernels in the application, using a compile-time characterization instead of run-time monitoring to select the frequency and number of CPU cores to use. Our framework relies on a one-time energy characterization of CPU-specific DVFS profiles followed by a compile-time categorization of loop-based code segments in the application. These are combined to determine a priori of the frequency and the number of cores to use to execute the application so as to optimize energy or energy-delay product, outperforming runtime approach. Extensive evaluation on 60 benchmarks and five multi-core CPUs show that our approach systematically outperforms the powersave Linux governor, while improving overall performance.« less
Requirements Analysis for Large Ada Programs: Lessons Learned on CCPDS- R

DTIC Science & Technology

1989-12-01

when the design had matured and This approach was not optimal from the formal the SRS role was to be the tester’s contract, implemen- testing and...on the software development CPU processing load. These constraints primar- process is the necessity to include sufficient testing ily affect algorithm...allocations and timing requirements are by-products of the software design process when multiple CSCls are a P R StrR eSOFTWARE ENGINEERING executed within
Development of small scale cluster computer for numerical analysis

NASA Astrophysics Data System (ADS)

Zulkifli, N. H. N.; Sapit, A.; Mohammed, A. N.

2017-09-01

In this study, two units of personal computer were successfully networked together to form a small scale cluster. Each of the processor involved are multicore processor which has four cores in it, thus made this cluster to have eight processors. Here, the cluster incorporate Ubuntu 14.04 LINUX environment with MPI implementation (MPICH2). Two main tests were conducted in order to test the cluster, which is communication test and performance test. The communication test was done to make sure that the computers are able to pass the required information without any problem and were done by using simple MPI Hello Program where the program written in C language. Additional, performance test was also done to prove that this cluster calculation performance is much better than single CPU computer. In this performance test, four tests were done by running the same code by using single node, 2 processors, 4 processors, and 8 processors. The result shows that with additional processors, the time required to solve the problem decrease. Time required for the calculation shorten to half when we double the processors. To conclude, we successfully develop a small scale cluster computer using common hardware which capable of higher computing power when compare to single CPU processor, and this can be beneficial for research that require high computing power especially numerical analysis such as finite element analysis, computational fluid dynamics, and computational physics analysis.
Deterministic Stress Modeling of Hot Gas Segregation in a Turbine

NASA Technical Reports Server (NTRS)

Busby, Judy; Sondak, Doug; Staubach, Brent; Davis, Roger

1998-01-01

Simulation of unsteady viscous turbomachinery flowfields is presently impractical as a design tool due to the long run times required. Designers rely predominantly on steady-state simulations, but these simulations do not account for some of the important unsteady flow physics. Unsteady flow effects can be modeled as source terms in the steady flow equations. These source terms, referred to as Lumped Deterministic Stresses (LDS), can be used to drive steady flow solution procedures to reproduce the time-average of an unsteady flow solution. The goal of this work is to investigate the feasibility of using inviscid lumped deterministic stresses to model unsteady combustion hot streak migration effects on the turbine blade tip and outer air seal heat loads using a steady computational approach. The LDS model is obtained from an unsteady inviscid calculation. The LDS model is then used with a steady viscous computation to simulate the time-averaged viscous solution. Both two-dimensional and three-dimensional applications are examined. The inviscid LDS model produces good results for the two-dimensional case and requires less than 10% of the CPU time of the unsteady viscous run. For the three-dimensional case, the LDS model does a good job of reproducing the time-averaged viscous temperature migration and separation as well as heat load on the outer air seal at a CPU cost that is 25% of that of an unsteady viscous computation.
Characterization and referral patterns of ST-elevation myocardial infarction patients admitted to chest pain units rather than directly to catherization laboratories. Data from the German Chest Pain Unit Registry.

PubMed

Schmidt, Frank P; Perne, Andrea; Hochadel, Matthias; Giannitsis, Evangelos; Darius, Harald; Maier, Lars S; Schmitt, Claus; Heusch, Gerd; Voigtländer, Thomas; Mudra, Harald; Gori, Tommaso; Senges, Jochen; Münzel, Thomas

2017-03-15

Direct transfer to the catheterization laboratory for primary percutaneous coronary intervention (PCI) is standard of care for patients with ST-segment elevation myocardial infarction (STEMI). Nevertheless, a significant number of STEMI-patients are initially treated in chest pain units (CPUs) of admitting hospitals. Thus, it is important to characterize these patients and to define why an important deviation from recommended clinical pathways occurs and in particular to quantify the impact of deviation on critical time intervals. 1679 STEMI patients admitted to a CPU in the period from 2010 to 2015 were enrolled in the German CPU registry (8.5% of 19,666). 55.9% of the patients were delivered by an emergency medical system (EMS), 16.1% transferred from other hospitals and 15.2% referred by a general practitioner (GP). 12.7% were self-referrals. 55% did not get a pre-hospital ECG. Compared to the EMS, referral by GPs markedly delayed critical time intervals while a pre-hospital ECG demonstrating ST-segment elevation reduced door-to-balloon time. When compared to STEMI patients (n=21,674) enrolled in the ALKK-registry, CPU-STEMI patients had a lower risk profile, their treatment in the CPU was guideline-conform and in-hospital mortality was low (1.5%). CPU-STEMI patients represent a numerically significant group because a pre-hospital ECG was not documented. Treatment in the CPU is guideline-conform and the intra-hospital mortality is low. The lack of a pre-hospital ECG and admission via the GP substantially delay critical time intervals suggesting that in patients with symptoms suggestive an ACS, the EMS should be contacted and not the GP. Copyright © 2016 Elsevier Ireland Ltd. All rights reserved.
Simulation Testing of Embedded Flight Software

NASA Technical Reports Server (NTRS)

Shahabuddin, Mohammad; Reinholtz, William

2004-01-01

Virtual Real Time (VRT) is a computer program for testing embedded flight software by computational simulation in a workstation, in contradistinction to testing it in its target central processing unit (CPU). The disadvantages of testing in the target CPU include the need for an expensive test bed, the necessity for testers and programmers to take turns using the test bed, and the lack of software tools for debugging in a real-time environment. By virtue of its architecture, most of the flight software of the type in question is amenable to development and testing on workstations, for which there is an abundance of commercially available debugging and analysis software tools. Unfortunately, the timing of a workstation differs from that of a target CPU in a test bed. VRT, in conjunction with closed-loop simulation software, provides a capability for executing embedded flight software on a workstation in a close-to-real-time environment. A scale factor is used to convert between execution time in VRT on a workstation and execution on a target CPU. VRT includes high-resolution operating- system timers that enable the synchronization of flight software with simulation software and ground software, all running on different workstations.
Performance Analysis of the NAS Y-MP Workload

NASA Technical Reports Server (NTRS)

Bergeron, Robert J.; Kutler, Paul (Technical Monitor)

1997-01-01

This paper describes the performance characteristics of the computational workloads on the NAS Cray Y-MP machines, a Y-MP 832 and later a Y-MP 8128. Hardware measurements indicated that the Y-MP workload performance matured over time, ultimately sustaining an average throughput of 0.8 GFLOPS and a vector operation fraction of 87%. The measurements also revealed an operation rate exceeding 1 per clock period, a well-balanced architecture featuring a strong utilization of vector functional units, and an efficient memory organization. Introduction of the larger memory 8128 increased throughput by allowing a more efficient utilization of CPUs. Throughput also depended on the metering of the batch queues; low-idle Saturday workloads required a buffer of small jobs to prevent memory starvation of the CPU. UNICOS required about 7% of total CPU time to service the 832 workloads; this overhead decreased to 5% for the 8128 workloads. While most of the system time went to service I/O requests, efficient scheduling prevented excessive idle due to I/O wait. System measurements disclosed no obvious bottlenecks in the response of the machine and UNICOS to the workloads. In most cases, Cray-provided software tools were- quite sufficient for measuring the performance of both the machine and operating, system.
Emissivity of Rocket Plume Particulates

DTIC Science & Technology

1992-09-01

V. EXPERIMENTAL RESULTS ........ ............... 29 VI. CONCLUSIONS AND RECOMMENDATIONS .... ........ 32 APPENDIX A. CATS -E SOFTWARE...interfaced through the CATS E Thermal Analysis software, which is MS-DOS based, and can be run on any 28b or higher CPU. This system allows real-time...body source to establish the parameters required by the CATS program for proper microscope/scanner interface. A complete description of microscope
Algorithms and Application of Sparse Matrix Assembly and Equation Solvers for Aeroacoustics

NASA Technical Reports Server (NTRS)

Watson, W. R.; Nguyen, D. T.; Reddy, C. J.; Vatsa, V. N.; Tang, W. H.

2001-01-01

An algorithm for symmetric sparse equation solutions on an unstructured grid is described. Efficient, sequential sparse algorithms for degree-of-freedom reordering, supernodes, symbolic/numerical factorization, and forward backward solution phases are reviewed. Three sparse algorithms for the generation and assembly of symmetric systems of matrix equations are presented. The accuracy and numerical performance of the sequential version of the sparse algorithms are evaluated over the frequency range of interest in a three-dimensional aeroacoustics application. Results show that the solver solutions are accurate using a discretization of 12 points per wavelength. Results also show that the first assembly algorithm is impractical for high-frequency noise calculations. The second and third assembly algorithms have nearly equal performance at low values of source frequencies, but at higher values of source frequencies the third algorithm saves CPU time and RAM. The CPU time and the RAM required by the second and third assembly algorithms are two orders of magnitude smaller than that required by the sparse equation solver. A sequential version of these sparse algorithms can, therefore, be conveniently incorporated into a substructuring for domain decomposition formulation to achieve parallel computation, where different substructures are handles by different parallel processors.
High-Speed Particle-in-Cell Simulation Parallelized with Graphic Processing Units for Low Temperature Plasmas for Material Processing

NASA Astrophysics Data System (ADS)

Hur, Min Young; Verboncoeur, John; Lee, Hae June

2014-10-01

Particle-in-cell (PIC) simulations have high fidelity in the plasma device requiring transient kinetic modeling compared with fluid simulations. It uses less approximation on the plasma kinetics but requires many particles and grids to observe the semantic results. It means that the simulation spends lots of simulation time in proportion to the number of particles. Therefore, PIC simulation needs high performance computing. In this research, a graphic processing unit (GPU) is adopted for high performance computing of PIC simulation for low temperature discharge plasmas. GPUs have many-core processors and high memory bandwidth compared with a central processing unit (CPU). NVIDIA GeForce GPUs were used for the test with hundreds of cores which show cost-effective performance. PIC code algorithm is divided into two modules which are a field solver and a particle mover. The particle mover module is divided into four routines which are named move, boundary, Monte Carlo collision (MCC), and deposit. Overall, the GPU code solves particle motions as well as electrostatic potential in two-dimensional geometry almost 30 times faster than a single CPU code. This work was supported by the Korea Institute of Science Technology Information.
Use of general purpose graphics processing units with MODFLOW

USGS Publications Warehouse

Hughes, Joseph D.; White, Jeremy T.

2013-01-01

To evaluate the use of general-purpose graphics processing units (GPGPUs) to improve the performance of MODFLOW, an unstructured preconditioned conjugate gradient (UPCG) solver has been developed. The UPCG solver uses a compressed sparse row storage scheme and includes Jacobi, zero fill-in incomplete, and modified-incomplete lower-upper (LU) factorization, and generalized least-squares polynomial preconditioners. The UPCG solver also includes options for sequential and parallel solution on the central processing unit (CPU) using OpenMP. For simulations utilizing the GPGPU, all basic linear algebra operations are performed on the GPGPU; memory copies between the central processing unit CPU and GPCPU occur prior to the first iteration of the UPCG solver and after satisfying head and flow criteria or exceeding a maximum number of iterations. The efficiency of the UPCG solver for GPGPU and CPU solutions is benchmarked using simulations of a synthetic, heterogeneous unconfined aquifer with tens of thousands to millions of active grid cells. Testing indicates GPGPU speedups on the order of 2 to 8, relative to the standard MODFLOW preconditioned conjugate gradient (PCG) solver, can be achieved when (1) memory copies between the CPU and GPGPU are optimized, (2) the percentage of time performing memory copies between the CPU and GPGPU is small relative to the calculation time, (3) high-performance GPGPU cards are utilized, and (4) CPU-GPGPU combinations are used to execute sequential operations that are difficult to parallelize. Furthermore, UPCG solver testing indicates GPGPU speedups exceed parallel CPU speedups achieved using OpenMP on multicore CPUs for preconditioners that can be easily parallelized.
Memory interface simulator: A computer design aid

NASA Technical Reports Server (NTRS)

Taylor, D. S.; Williams, T.; Weatherbee, J. E.

1972-01-01

Results are presented of a study conducted with a digital simulation model being used in the design of the Automatically Reconfigurable Modular Multiprocessor System (ARMMS), a candidate computer system for future manned and unmanned space missions. The model simulates the activity involved as instructions are fetched from random access memory for execution in one of the system central processing units. A series of model runs measured instruction execution time under various assumptions pertaining to the CPU's and the interface between the CPU's and RAM. Design tradeoffs are presented in the following areas: Bus widths, CPU microprogram read only memory cycle time, multiple instruction fetch, and instruction mix.
ELT-scale Adaptive Optics real-time control with thes Intel Xeon Phi Many Integrated Core Architecture

NASA Astrophysics Data System (ADS)

Jenkins, David R.; Basden, Alastair; Myers, Richard M.

2018-05-01

We propose a solution to the increased computational demands of Extremely Large Telescope (ELT) scale adaptive optics (AO) real-time control with the Intel Xeon Phi Knights Landing (KNL) Many Integrated Core (MIC) Architecture. The computational demands of an AO real-time controller (RTC) scale with the fourth power of telescope diameter and so the next generation ELTs require orders of magnitude more processing power for the RTC pipeline than existing systems. The Xeon Phi contains a large number (≥64) of low power x86 CPU cores and high bandwidth memory integrated into a single socketed server CPU package. The increased parallelism and memory bandwidth are crucial to providing the performance for reconstructing wavefronts with the required precision for ELT scale AO. Here, we demonstrate that the Xeon Phi KNL is capable of performing ELT scale single conjugate AO real-time control computation at over 1.0kHz with less than 20μs RMS jitter. We have also shown that with a wavefront sensor camera attached the KNL can process the real-time control loop at up to 966Hz, the maximum frame-rate of the camera, with jitter remaining below 20μs RMS. Future studies will involve exploring the use of a cluster of Xeon Phis for the real-time control of the MCAO and MOAO regimes of AO. We find that the Xeon Phi is highly suitable for ELT AO real time control.
Nuclear shell model code CRUNCHER

DOE Office of Scientific and Technical Information (OSTI.GOV)

Resler, D.A.; Grimes, S.M.

1988-05-01

A new nuclear shell model code CRUNCHER, patterned after the code VLADIMIR, has been developed. While CRUNCHER and VLADIMIR employ the techniques of an uncoupled basis and the Lanczos process, improvements in the new code allow it to handle much larger problems than the previous code and to perform them more efficiently. Tests involving a moderately sized calculation indicate that CRUNCHER running on a SUN 3/260 workstation requires approximately one-half the central processing unit (CPU) time required by VLADIMIR running on a CRAY-1 supercomputer.
Polydrug use among college students in Brazil: a nationwide survey.

PubMed

Oliveira, Lúcio Garcia de; Alberghini, Denis Guilherme; Santos, Bernardo dos; Andrade, Arthur Guerra de

2013-01-01

To estimate the frequency of polydrug use (alcohol and illicit drugs) among college students and its associations with gender and age group. A nationwide sample of 12,544 college students was asked to complete a questionnaire on their use of drugs according to three time parameters (lifetime, past 12 months, and last 30 days). The co-use of drugs was investigated as concurrent polydrug use (CPU) and simultaneous polydrug use (SPU), a subcategory of CPU that involves the use of drugs at the same time or in close temporal proximity. Almost 26% of college students reported having engaged in CPU in the past 12 months. Among these students, 37% had engaged in SPU. In the past 30 days, 17% college students had engaged in CPU. Among these, 35% had engaged in SPU. Marijuana was the illicit drug mostly frequently used with alcohol (either as CPU or SPU), especially among males. Among females, the most commonly reported combination was alcohol and prescribed medications. A high proportion of Brazilian college students may be engaging in polydrug use. College administrators should keep themselves informed to be able to identify such use and to develop educational interventions to prevent such behavior.
Spectrum Savings from High Performance Recording and Playback Onboard the Test Article

DTIC Science & Technology

2013-02-20

execute within a Windows 7 environment, and data is recorded on SSDs. The underlying database is implemented using MySQL . Figure 1 illustrates the... MySQL database. This is effectively the time at which the recorded data are available for retransmission. CPU and Memory utilization were collected...17.7% MySQL avg. 3.9% EQDR Total avg. 21.6% Table 1 CPU Utilization with260 Mbits/sec Load The difference between the total System CPU (27.8
Real-Time Ada Problem Solution Study

DTIC Science & Technology

1989-03-24

been performed, there is a larger base of information concerning standards and guidelines for Ada usage, as well "lessons learned ". A number of...the target machine and operate in conjunction with the application programs, they also require system resources (CPU,memory). The utilization of...Transporter-Consumer 1694 154 6. Producer-Transpt-Buffer- Transp -Consumer 2248 204 7. Relay 906 82 8. Conditional Entry - no rendezvous 170 15
Hypoxia/oxidative stress alters the pharmacokinetics of CPU86017-RS through mitochondrial dysfunction and NADPH oxidase activation.

PubMed

Gao, Jie; Ding, Xuan-sheng; Zhang, Yu-mao; Dai, De-zai; Liu, Mei; Zhang, Can; Dai, Yin

2013-12-01

Hypoxia/oxidative stress can alter the pharmacokinetics (PK) of CPU86017-RS, a novel antiarrhythmic agent. The aim of this study was to investigate the mechanisms underlying the alteration of PK of CPU86017-RS by hypoxia/oxidative stress. Male SD rats exposed to normal or intermittent hypoxia (10% O2) were administered CPU86017-RS (20, 40 or 80 mg/kg, ig) for 8 consecutive days. The PK parameters of CPU86017-RS were examined on d 8. In a separate set of experiments, female SD rats were injected with isoproterenol (ISO) for 5 consecutive days to induce a stress-related status, then CPU86017-RS (80 mg/kg, ig) was administered, and the tissue distributions were examined. The levels of Mn-SOD (manganese containing superoxide dismutase), endoplasmic reticulum (ER) stress sensor proteins (ATF-6, activating transcription factor 6 and PERK, PRK-like ER kinase) and activation of NADPH oxidase (NOX) were detected with Western blotting. Rat liver microsomes were incubated under N2 for in vitro study. The Cmax, t1/2, MRT (mean residence time) and AUC (area under the curve) of CPU86017-RS were significantly increased in the hypoxic rats receiving the 3 different doses of CPU86017-RS. The hypoxia-induced alteration of PK was associated with significantly reduced Mn-SOD level, and increased ATF-6, PERK and NOX levels. In ISO-treated rats, the distributions of CPU86017-RS in plasma, heart, kidney, and liver were markedly increased, and NOX levels in heart, kidney, and liver were significantly upregulated. Co-administration of the NOX blocker apocynin eliminated the abnormalities in the PK and tissue distributions of CPU86017-RS induced by hypoxia/oxidative stress. The metabolism of CPU86017-RS in the N2-treated liver microsomes was significantly reduced, addition of N-acetylcysteine (NAC), but not vitamin C, effectively reversed this change. The altered PK and metabolism of CPU86017-RS induced by hypoxia/oxidative stress are produced by mitochondrial abnormalities, NOX activation and ER stress; these abnormalities are significantly alleviated by apocynin or NAC.
Acceleration of discrete stochastic biochemical simulation using GPGPU.

PubMed

Sumiyoshi, Kei; Hirata, Kazuki; Hiroi, Noriko; Funahashi, Akira

2015-01-01

For systems made up of a small number of molecules, such as a biochemical network in a single cell, a simulation requires a stochastic approach, instead of a deterministic approach. The stochastic simulation algorithm (SSA) simulates the stochastic behavior of a spatially homogeneous system. Since stochastic approaches produce different results each time they are used, multiple runs are required in order to obtain statistical results; this results in a large computational cost. We have implemented a parallel method for using SSA to simulate a stochastic model; the method uses a graphics processing unit (GPU), which enables multiple realizations at the same time, and thus reduces the computational time and cost. During the simulation, for the purpose of analysis, each time course is recorded at each time step. A straightforward implementation of this method on a GPU is about 16 times faster than a sequential simulation on a CPU with hybrid parallelization; each of the multiple simulations is run simultaneously, and the computational tasks within each simulation are parallelized. We also implemented an improvement to the memory access and reduced the memory footprint, in order to optimize the computations on the GPU. We also implemented an asynchronous data transfer scheme to accelerate the time course recording function. To analyze the acceleration of our implementation on various sizes of model, we performed SSA simulations on different model sizes and compared these computation times to those for sequential simulations with a CPU. When used with the improved time course recording function, our method was shown to accelerate the SSA simulation by a factor of up to 130.
Acceleration of discrete stochastic biochemical simulation using GPGPU

PubMed Central

Sumiyoshi, Kei; Hirata, Kazuki; Hiroi, Noriko; Funahashi, Akira

2015-01-01

For systems made up of a small number of molecules, such as a biochemical network in a single cell, a simulation requires a stochastic approach, instead of a deterministic approach. The stochastic simulation algorithm (SSA) simulates the stochastic behavior of a spatially homogeneous system. Since stochastic approaches produce different results each time they are used, multiple runs are required in order to obtain statistical results; this results in a large computational cost. We have implemented a parallel method for using SSA to simulate a stochastic model; the method uses a graphics processing unit (GPU), which enables multiple realizations at the same time, and thus reduces the computational time and cost. During the simulation, for the purpose of analysis, each time course is recorded at each time step. A straightforward implementation of this method on a GPU is about 16 times faster than a sequential simulation on a CPU with hybrid parallelization; each of the multiple simulations is run simultaneously, and the computational tasks within each simulation are parallelized. We also implemented an improvement to the memory access and reduced the memory footprint, in order to optimize the computations on the GPU. We also implemented an asynchronous data transfer scheme to accelerate the time course recording function. To analyze the acceleration of our implementation on various sizes of model, we performed SSA simulations on different model sizes and compared these computation times to those for sequential simulations with a CPU. When used with the improved time course recording function, our method was shown to accelerate the SSA simulation by a factor of up to 130. PMID:25762936

Assessment of Linear Finite-Difference Poisson-Boltzmann Solvers

PubMed Central

Wang, Jun; Luo, Ray

2009-01-01

CPU time and memory usage are two vital issues that any numerical solvers for the Poisson-Boltzmann equation have to face in biomolecular applications. In this study we systematically analyzed the CPU time and memory usage of five commonly used finite-difference solvers with a large and diversified set of biomolecular structures. Our comparative analysis shows that modified incomplete Cholesky conjugate gradient and geometric multigrid are the most efficient in the diversified test set. For the two efficient solvers, our test shows that their CPU times increase approximately linearly with the numbers of grids. Their CPU times also increase almost linearly with the negative logarithm of the convergence criterion at very similar rate. Our comparison further shows that geometric multigrid performs better in the large set of tested biomolecules. However, modified incomplete Cholesky conjugate gradient is superior to geometric multigrid in molecular dynamics simulations of tested molecules. We also investigated other significant components in numerical solutions of the Poisson-Boltzmann equation. It turns out that the time-limiting step is the free boundary condition setup for the linear systems for the selected proteins if the electrostatic focusing is not used. Thus, development of future numerical solvers for the Poisson-Boltzmann equation should balance all aspects of the numerical procedures in realistic biomolecular applications. PMID:20063271
Exact diagonalization of quantum lattice models on coprocessors

NASA Astrophysics Data System (ADS)

Siro, T.; Harju, A.

2016-10-01

We implement the Lanczos algorithm on an Intel Xeon Phi coprocessor and compare its performance to a multi-core Intel Xeon CPU and an NVIDIA graphics processor. The Xeon and the Xeon Phi are parallelized with OpenMP and the graphics processor is programmed with CUDA. The performance is evaluated by measuring the execution time of a single step in the Lanczos algorithm. We study two quantum lattice models with different particle numbers, and conclude that for small systems, the multi-core CPU is the fastest platform, while for large systems, the graphics processor is the clear winner, reaching speedups of up to 7.6 compared to the CPU. The Xeon Phi outperforms the CPU with sufficiently large particle number, reaching a speedup of 2.5.
Instrumentation complex for Langley Research Center's National Transonic Facility

NASA Technical Reports Server (NTRS)

Russell, C. H.; Bryant, C. S.

1977-01-01

The instrumentation discussed in the present paper was developed to ensure reliable operation for a 2.5-meter cryogenic high-Reynolds-number fan-driven transonic wind tunnel. It will incorporate four CPU's and associated analog and digital input/output equipment, necessary for acquiring research data, controlling the tunnel parameters, and monitoring the process conditions. Connected in a multipoint distributed network, the CPU's will support data base management and processing; research measurement data acquisition and display; process monitoring; and communication control. The design will allow essential processes to continue, in the case of major hardware failures, by switching input/output equipment to alternate CPU's and by eliminating nonessential functions. It will also permit software modularization by CPU activity and thereby reduce complexity and development time.
Accelerating Smith-Waterman Alignment for Protein Database Search Using Frequency Distance Filtration Scheme Based on CPU-GPU Collaborative System.

PubMed

Liu, Yu; Hong, Yang; Lin, Chun-Yuan; Hung, Che-Lun

2015-01-01

The Smith-Waterman (SW) algorithm has been widely utilized for searching biological sequence databases in bioinformatics. Recently, several works have adopted the graphic card with Graphic Processing Units (GPUs) and their associated CUDA model to enhance the performance of SW computations. However, these works mainly focused on the protein database search by using the intertask parallelization technique, and only using the GPU capability to do the SW computations one by one. Hence, in this paper, we will propose an efficient SW alignment method, called CUDA-SWfr, for the protein database search by using the intratask parallelization technique based on a CPU-GPU collaborative system. Before doing the SW computations on GPU, a procedure is applied on CPU by using the frequency distance filtration scheme (FDFS) to eliminate the unnecessary alignments. The experimental results indicate that CUDA-SWfr runs 9.6 times and 96 times faster than the CPU-based SW method without and with FDFS, respectively.
Evaluation of the Monotonic Lagrangian Grid and Lat-Long Grid for Air Traffic Management

NASA Technical Reports Server (NTRS)

Kaplan, Carolyn; Dahm, Johann; Oran, Elaine; Alexandrov, Natalia; Boris, Jay

2011-01-01

The Air Traffic Monotonic Lagrangian Grid (ATMLG) is used to simulate a 24 hour period of air traffic flow in the National Airspace System (NAS). During this time period, there are 41,594 flights over the United States, and the flight plan information (departure and arrival airports and times, and waypoints along the way) are obtained from an Federal Aviation Administration (FAA) Enhanced Traffic Management System (ETMS) dataset. Two simulation procedures are tested and compared: one based on the Monotonic Lagrangian Grid (MLG), and the other based on the stationary Latitude-Longitude (Lat- Long) grid. Simulating one full day of air traffic over the United States required the following amounts of CPU time on a single processor of an SGI Altix: 88 s for the MLG method, and 163 s for the Lat-Long grid method. We present a discussion of the amount of CPU time required for each of the simulation processes (updating aircraft trajectories, sorting, conflict detection and resolution, etc.), and show that the main advantage of the MLG method is that it is a general sorting algorithm that can sort on multiple properties. We discuss how many MLG neighbors must be considered in the separation assurance procedure in order to ensure a five-mile separation buffer between aircraft, and we investigate the effect of removing waypoints from aircraft trajectories. When aircraft choose their own trajectory, there are more flights with shorter duration times and fewer CD&R maneuvers, resulting in significant fuel savings.
GPU-Q-J, a fast method for calculating root mean square deviation (RMSD) after optimal superposition

PubMed Central

2011-01-01

Background Calculation of the root mean square deviation (RMSD) between the atomic coordinates of two optimally superposed structures is a basic component of structural comparison techniques. We describe a quaternion based method, GPU-Q-J, that is stable with single precision calculations and suitable for graphics processor units (GPUs). The application was implemented on an ATI 4770 graphics card in C/C++ and Brook+ in Linux where it was 260 to 760 times faster than existing unoptimized CPU methods. Source code is available from the Compbio website http://software.compbio.washington.edu/misc/downloads/st_gpu_fit/ or from the author LHH. Findings The Nutritious Rice for the World Project (NRW) on World Community Grid predicted de novo, the structures of over 62,000 small proteins and protein domains returning a total of 10 billion candidate structures. Clustering ensembles of structures on this scale requires calculation of large similarity matrices consisting of RMSDs between each pair of structures in the set. As a real-world test, we calculated the matrices for 6 different ensembles from NRW. The GPU method was 260 times faster that the fastest existing CPU based method and over 500 times faster than the method that had been previously used. Conclusions GPU-Q-J is a significant advance over previous CPU methods. It relieves a major bottleneck in the clustering of large numbers of structures for NRW. It also has applications in structure comparison methods that involve multiple superposition and RMSD determination steps, particularly when such methods are applied on a proteome and genome wide scale. PMID:21453553
Fast and high-order numerical algorithms for the solution of multidimensional nonlinear fractional Ginzburg-Landau equation

NASA Astrophysics Data System (ADS)

Mohebbi, Akbar

2018-02-01

In this paper we propose two fast and accurate numerical methods for the solution of multidimensional space fractional Ginzburg-Landau equation (FGLE). In the presented methods, to avoid solving a nonlinear system of algebraic equations and to increase the accuracy and efficiency of method, we split the complex problem into simpler sub-problems using the split-step idea. For a homogeneous FGLE, we propose a method which has fourth-order of accuracy in time component and spectral accuracy in space variable and for nonhomogeneous one, we introduce another scheme based on the Crank-Nicolson approach which has second-order of accuracy in time variable. Due to using the Fourier spectral method for fractional Laplacian operator, the resulting schemes are fully diagonal and easy to code. Numerical results are reported in terms of accuracy, computational order and CPU time to demonstrate the accuracy and efficiency of the proposed methods and to compare the results with the analytical solutions. The results show that the present methods are accurate and require low CPU time. It is illustrated that the numerical results are in good agreement with the theoretical ones.
GPU accelerated generation of digitally reconstructed radiographs for 2-D/3-D image registration.

PubMed

Dorgham, Osama M; Laycock, Stephen D; Fisher, Mark H

2012-09-01

Recent advances in programming languages for graphics processing units (GPUs) provide developers with a convenient way of implementing applications which can be executed on the CPU and GPU interchangeably. GPUs are becoming relatively cheap, powerful, and widely available hardware components, which can be used to perform intensive calculations. The last decade of hardware performance developments shows that GPU-based computation is progressing significantly faster than CPU-based computation, particularly if one considers the execution of highly parallelisable algorithms. Future predictions illustrate that this trend is likely to continue. In this paper, we introduce a way of accelerating 2-D/3-D image registration by developing a hybrid system which executes on the CPU and utilizes the GPU for parallelizing the generation of digitally reconstructed radiographs (DRRs). Based on the advancements of the GPU over the CPU, it is timely to exploit the benefits of many-core GPU technology by developing algorithms for DRR generation. Although some previous work has investigated the rendering of DRRs using the GPU, this paper investigates approximations which reduce the computational overhead while still maintaining a quality consistent with that needed for 2-D/3-D registration with sufficient accuracy to be clinically acceptable in certain applications of radiation oncology. Furthermore, by comparing implementations of 2-D/3-D registration on the CPU and GPU, we investigate current performance and propose an optimal framework for PC implementations addressing the rigid registration problem. Using this framework, we are able to render DRR images from a 256×256×133 CT volume in ~24 ms using an NVidia GeForce 8800 GTX and in ~2 ms using NVidia GeForce GTX 580. In addition to applications requiring fast automatic patient setup, these levels of performance suggest image-guided radiation therapy at video frame rates is technically feasible using relatively low cost PC architecture.
High Performance Computing Assets for Ocean Acoustics Research

DTIC Science & Technology

2016-11-18

independently on processing units with access to a typically available amount of memory, say 16 or 32 gigabytes. Our models require each processor to...allow results to be obtained with limited amounts of memory available to individual processing units (with no time frame for successful completion...put into use. One file server computer to store simulation output has also been purchased. The first workstation has 28 CPU cores, dual- thread , (56
Integrals for IBS and beam cooling

DOE Office of Scientific and Technical Information (OSTI.GOV)

Burov, A.; /Fermilab

Simulation of beam cooling usually requires performing certain integral transformations every time step or so, which is a significant burden on the CPU. Examples are the dispersion integrals (Hilbert transforms) in the stochastic cooling, wake fields and IBS integrals. An original method is suggested for fast and sufficiently accurate computation of the integrals. This method is applied for the dispersion integral. Some methodical aspects of the IBS analysis are discussed.
Integrals for IBS and Beam Cooling

DOE Office of Scientific and Technical Information (OSTI.GOV)

Burov, A.

Simulation of beam cooling usually requires performing certain integral transformations every time step or so, which is a significant burden on the CPU. Examples are the dispersion integrals (Hilbert transforms) in the stochastic cooling, wake fields and IBS integrals. An original method is suggested for fast and sufficiently accurate computation of the integrals. This method is applied for the dispersion integral. Some methodical aspects of the IBS analysis are discussed.
Accelerating statistical image reconstruction algorithms for fan-beam x-ray CT using cloud computing

NASA Astrophysics Data System (ADS)

Srivastava, Somesh; Rao, A. Ravishankar; Sheinin, Vadim

2011-03-01

Statistical image reconstruction algorithms potentially offer many advantages to x-ray computed tomography (CT), e.g. lower radiation dose. But, their adoption in practical CT scanners requires extra computation power, which is traditionally provided by incorporating additional computing hardware (e.g. CPU-clusters, GPUs, FPGAs etc.) into a scanner. An alternative solution is to access the required computation power over the internet from a cloud computing service, which is orders-of-magnitude more cost-effective. This is because users only pay a small pay-as-you-go fee for the computation resources used (i.e. CPU time, storage etc.), and completely avoid purchase, maintenance and upgrade costs. In this paper, we investigate the benefits and shortcomings of using cloud computing for statistical image reconstruction. We parallelized the most time-consuming parts of our application, the forward and back projectors, using MapReduce, the standard parallelization library on clouds. From preliminary investigations, we found that a large speedup is possible at a very low cost. But, communication overheads inside MapReduce can limit the maximum speedup, and a better MapReduce implementation might become necessary in the future. All the experiments for this paper, including development and testing, were completed on the Amazon Elastic Compute Cloud (EC2) for less than $20.
Analysis of Multivariate Experimental Data Using A Simplified Regression Model Search Algorithm

NASA Technical Reports Server (NTRS)

Ulbrich, Norbert M.

2013-01-01

A new regression model search algorithm was developed that may be applied to both general multivariate experimental data sets and wind tunnel strain-gage balance calibration data. The algorithm is a simplified version of a more complex algorithm that was originally developed for the NASA Ames Balance Calibration Laboratory. The new algorithm performs regression model term reduction to prevent overfitting of data. It has the advantage that it needs only about one tenth of the original algorithm's CPU time for the completion of a regression model search. In addition, extensive testing showed that the prediction accuracy of math models obtained from the simplified algorithm is similar to the prediction accuracy of math models obtained from the original algorithm. The simplified algorithm, however, cannot guarantee that search constraints related to a set of statistical quality requirements are always satisfied in the optimized regression model. Therefore, the simplified algorithm is not intended to replace the original algorithm. Instead, it may be used to generate an alternate optimized regression model of experimental data whenever the application of the original search algorithm fails or requires too much CPU time. Data from a machine calibration of NASA's MK40 force balance is used to illustrate the application of the new search algorithm.
Analysis of Multivariate Experimental Data Using A Simplified Regression Model Search Algorithm

NASA Technical Reports Server (NTRS)

Ulbrich, Norbert Manfred

2013-01-01

A new regression model search algorithm was developed in 2011 that may be used to analyze both general multivariate experimental data sets and wind tunnel strain-gage balance calibration data. The new algorithm is a simplified version of a more complex search algorithm that was originally developed at the NASA Ames Balance Calibration Laboratory. The new algorithm has the advantage that it needs only about one tenth of the original algorithm's CPU time for the completion of a search. In addition, extensive testing showed that the prediction accuracy of math models obtained from the simplified algorithm is similar to the prediction accuracy of math models obtained from the original algorithm. The simplified algorithm, however, cannot guarantee that search constraints related to a set of statistical quality requirements are always satisfied in the optimized regression models. Therefore, the simplified search algorithm is not intended to replace the original search algorithm. Instead, it may be used to generate an alternate optimized regression model of experimental data whenever the application of the original search algorithm either fails or requires too much CPU time. Data from a machine calibration of NASA's MK40 force balance is used to illustrate the application of the new regression model search algorithm.
Three-Dimensional Nacelle Aeroacoustics Code With Application to Impedance Education

NASA Technical Reports Server (NTRS)

Watson, Willie R.

2000-01-01

A three-dimensional nacelle acoustics code that accounts for uniform mean flow and variable surface impedance liners is developed. The code is linked to a commercial version of the NASA-developed General Purpose Solver (for solution of linear systems of equations) in order to obtain the capability to study high frequency waves that may require millions of grid points for resolution. Detailed, single-processor statistics for the performance of the solver in rigid and soft-wall ducts are presented. Over the range of frequencies of current interest in nacelle liner research, noise attenuation levels predicted from the code were in excellent agreement with those predicted from mode theory. The equation solver is memory efficient, requiring only a small fraction of the memory available on modern computers. As an application, the code is combined with an optimization algorithm and used to reduce the impedance spectrum of a ceramic liner. The primary problem with using the code to perform optimization studies at frequencies above I1kHz is the excessive CPU time (a major portion of which is matrix assembly). The research recommends that research be directed toward development of a rapid sparse assembler and exploitation of the multiprocessor capability of the solver to further reduce CPU time.
Wang-Landau sampling: Saving CPU time

NASA Astrophysics Data System (ADS)

Ferreira, L. S.; Jorge, L. N.; Leão, S. A.; Caparica, A. A.

2018-04-01

In this work we propose an improvement to the Wang-Landau (WL) method that allows an economy in CPU time of about 60% leading to the same results with the same accuracy. We used the 2D Ising model to show that one can initiate all WL simulations using the outputs of an advanced WL level from a previous simulation. We showed that up to the seventh WL level (f6) the simulations are not biased yet and can proceed to any value that the simulation from the very beginning would reach. As a result the initial WL levels can be simulated just once. It was also observed that the saving in CPU time is larger for larger lattice sizes, exactly where the computational cost is considerable. We carried out high-resolution simulations beginning initially from the first WL level (f0) and another beginning from the eighth WL level (f7) using all the data at the end of the previous level and showed that the results for the critical temperature Tc and the critical static exponents β and γ coincide within the error bars. Finally we applied the same procedure to the 1/2-spin Baxter-Wu model and the economy in CPU time was of about 64%.
YAdumper: extracting and translating large information volumes from relational databases to structured flat files.

PubMed

Fernández, José M; Valencia, Alfonso

2004-10-12

Downloading the information stored in relational databases into XML and other flat formats is a common task in bioinformatics. This periodical dumping of information requires considerable CPU time, disk and memory resources. YAdumper has been developed as a purpose-specific tool to deal with the integral structured information download of relational databases. YAdumper is a Java application that organizes database extraction following an XML template based on an external Document Type Declaration. Compared with other non-native alternatives, YAdumper substantially reduces memory requirements and considerably improves writing performance.
Fast CPU-based Monte Carlo simulation for radiotherapy dose calculation.

PubMed

Ziegenhein, Peter; Pirner, Sven; Ph Kamerling, Cornelis; Oelfke, Uwe

2015-08-07

Monte-Carlo (MC) simulations are considered to be the most accurate method for calculating dose distributions in radiotherapy. Its clinical application, however, still is limited by the long runtimes conventional implementations of MC algorithms require to deliver sufficiently accurate results on high resolution imaging data. In order to overcome this obstacle we developed the software-package PhiMC, which is capable of computing precise dose distributions in a sub-minute time-frame by leveraging the potential of modern many- and multi-core CPU-based computers. PhiMC is based on the well verified dose planning method (DPM). We could demonstrate that PhiMC delivers dose distributions which are in excellent agreement to DPM. The multi-core implementation of PhiMC scales well between different computer architectures and achieves a speed-up of up to 37[Formula: see text] compared to the original DPM code executed on a modern system. Furthermore, we could show that our CPU-based implementation on a modern workstation is between 1.25[Formula: see text] and 1.95[Formula: see text] faster than a well-known GPU implementation of the same simulation method on a NVIDIA Tesla C2050. Since CPUs work on several hundreds of GB RAM the typical GPU memory limitation does not apply for our implementation and high resolution clinical plans can be calculated.
Benchmark measurements and calculations of a 3-dimensional neutron streaming experiment

NASA Astrophysics Data System (ADS)

Barnett, D. A., Jr.

1991-02-01

An experimental assembly known as the Dog-Legged Void assembly was constructed to measure the effect of neutron streaming in iron and void regions. The primary purpose of the measurements was to provide benchmark data against which various neutron transport calculation tools could be compared. The measurements included neutron flux spectra at four places and integral measurements at two places in the iron streaming path as well as integral measurements along several axial traverses. These data have been used in the verification of Oak Ridge National Laboratory's three-dimensional discrete ordinates code, TORT. For a base case calculation using one-half inch mesh spacing, finite difference spatial differencing, an S(sub 16) quadrature and P(sub 1) cross sections in the MUFT multigroup structure, the calculated solution agreed to within 18 percent with the spectral measurements and to within 24 percent of the integral measurements. Variations on the base case using a fewgroup energy structure and P(sub 1) and P(sub 3) cross sections showed similar agreement. Calculations using a linear nodal spatial differencing scheme and fewgroup cross sections also showed similar agreement. For the same mesh size, the nodal method was seen to require 2.2 times as much CPU time as the finite difference method. A nodal calculation using a typical mesh spacing of 2 inches, which had approximately 32 times fewer mesh cells than the base case, agreed with the measurements to within 34 percent and yet required on 8 percent of the CPU time.
Multi-GPU Accelerated Admittance Method for High-Resolution Human Exposure Evaluation.

PubMed

Xiong, Zubiao; Feng, Shi; Kautz, Richard; Chandra, Sandeep; Altunyurt, Nevin; Chen, Ji

2015-12-01

A multi-graphics processing unit (GPU) accelerated admittance method solver is presented for solving the induced electric field in high-resolution anatomical models of human body when exposed to external low-frequency magnetic fields. In the solver, the anatomical model is discretized as a three-dimensional network of admittances. The conjugate orthogonal conjugate gradient (COCG) iterative algorithm is employed to take advantage of the symmetric property of the complex-valued linear system of equations. Compared against the widely used biconjugate gradient stabilized method, the COCG algorithm can reduce the solving time by 3.5 times and reduce the storage requirement by about 40%. The iterative algorithm is then accelerated further by using multiple NVIDIA GPUs. The computations and data transfers between GPUs are overlapped in time by using asynchronous concurrent execution design. The communication overhead is well hidden so that the acceleration is nearly linear with the number of GPU cards. Numerical examples show that our GPU implementation running on four NVIDIA Tesla K20c cards can reach 90 times faster than the CPU implementation running on eight CPU cores (two Intel Xeon E5-2603 processors). The implemented solver is able to solve large dimensional problems efficiently. A whole adult body discretized in 1-mm resolution can be solved in just several minutes. The high efficiency achieved makes it practical to investigate human exposure involving a large number of cases with a high resolution that meets the requirements of international dosimetry guidelines.

Multi-GPU Jacobian accelerated computing for soft-field tomography.

PubMed

Borsic, A; Attardo, E A; Halter, R J

2012-10-01

Image reconstruction in soft-field tomography is based on an inverse problem formulation, where a forward model is fitted to the data. In medical applications, where the anatomy presents complex shapes, it is common to use finite element models (FEMs) to represent the volume of interest and solve a partial differential equation that models the physics of the system. Over the last decade, there has been a shifting interest from 2D modeling to 3D modeling, as the underlying physics of most problems are 3D. Although the increased computational power of modern computers allows working with much larger FEM models, the computational time required to reconstruct 3D images on a fine 3D FEM model can be significant, on the order of hours. For example, in electrical impedance tomography (EIT) applications using a dense 3D FEM mesh with half a million elements, a single reconstruction iteration takes approximately 15-20 min with optimized routines running on a modern multi-core PC. It is desirable to accelerate image reconstruction to enable researchers to more easily and rapidly explore data and reconstruction parameters. Furthermore, providing high-speed reconstructions is essential for some promising clinical application of EIT. For 3D problems, 70% of the computing time is spent building the Jacobian matrix, and 25% of the time in forward solving. In this work, we focus on accelerating the Jacobian computation by using single and multiple GPUs. First, we discuss an optimized implementation on a modern multi-core PC architecture and show how computing time is bounded by the CPU-to-memory bandwidth; this factor limits the rate at which data can be fetched by the CPU. Gains associated with the use of multiple CPU cores are minimal, since data operands cannot be fetched fast enough to saturate the processing power of even a single CPU core. GPUs have much faster memory bandwidths compared to CPUs and better parallelism. We are able to obtain acceleration factors of 20 times on a single NVIDIA S1070 GPU, and of 50 times on four GPUs, bringing the Jacobian computing time for a fine 3D mesh from 12 min to 14 s. We regard this as an important step toward gaining interactive reconstruction times in 3D imaging, particularly when coupled in the future with acceleration of the forward problem. While we demonstrate results for EIT, these results apply to any soft-field imaging modality where the Jacobian matrix is computed with the adjoint method.
Multi-GPU Jacobian Accelerated Computing for Soft Field Tomography

PubMed Central

Borsic, A.; Attardo, E. A.; Halter, R. J.

2012-01-01

Image reconstruction in soft-field tomography is based on an inverse problem formulation, where a forward model is fitted to the data. In medical applications, where the anatomy presents complex shapes, it is common to use Finite Element Models to represent the volume of interest and to solve a partial differential equation that models the physics of the system. Over the last decade, there has been a shifting interest from 2D modeling to 3D modeling, as the underlying physics of most problems are three-dimensional. Though the increased computational power of modern computers allows working with much larger FEM models, the computational time required to reconstruct 3D images on a fine 3D FEM model can be significant, on the order of hours. For example, in Electrical Impedance Tomography applications using a dense 3D FEM mesh with half a million elements, a single reconstruction iteration takes approximately 15 to 20 minutes with optimized routines running on a modern multi-core PC. It is desirable to accelerate image reconstruction to enable researchers to more easily and rapidly explore data and reconstruction parameters. Further, providing high-speed reconstructions are essential for some promising clinical application of EIT. For 3D problems 70% of the computing time is spent building the Jacobian matrix, and 25% of the time in forward solving. In the present work, we focus on accelerating the Jacobian computation by using single and multiple GPUs. First, we discuss an optimized implementation on a modern multi-core PC architecture and show how computing time is bounded by the CPU-to-memory bandwidth; this factor limits the rate at which data can be fetched by the CPU. Gains associated with use of multiple CPU cores are minimal, since data operands cannot be fetched fast enough to saturate the processing power of even a single CPU core. GPUs have a much faster memory bandwidths compared to CPUs and better parallelism. We are able to obtain acceleration factors of 20 times on a single NVIDIA S1070 GPU, and of 50 times on 4 GPUs, bringing the Jacobian computing time for a fine 3D mesh from 12 minutes to 14 seconds. We regard this as an important step towards gaining interactive reconstruction times in 3D imaging, particularly when coupled in the future with acceleration of the forward problem. While we demonstrate results for Electrical Impedance Tomography, these results apply to any soft-field imaging modality where the Jacobian matrix is computed with the Adjoint Method. PMID:23010857
On localization attacks against cloud infrastructure

NASA Astrophysics Data System (ADS)

Ge, Linqiang; Yu, Wei; Sistani, Mohammad Ali

2013-05-01

One of the key characteristics of cloud computing is the device and location independence that enables the user to access systems regardless of their location. Because cloud computing is heavily based on sharing resource, it is vulnerable to cyber attacks. In this paper, we investigate a localization attack that enables the adversary to leverage central processing unit (CPU) resources to localize the physical location of server used by victims. By increasing and reducing CPU usage through the malicious virtual machine (VM), the response time from the victim VM will increase and decrease correspondingly. In this way, by embedding the probing signal into the CPU usage and correlating the same pattern in the response time from the victim VM, the adversary can find the location of victim VM. To determine attack accuracy, we investigate features in both the time and frequency domains. We conduct both theoretical and experimental study to demonstrate the effectiveness of such an attack.
CUDA Fortran acceleration for the finite-difference time-domain method

NASA Astrophysics Data System (ADS)

Hadi, Mohammed F.; Esmaeili, Seyed A.

2013-05-01

A detailed description of programming the three-dimensional finite-difference time-domain (FDTD) method to run on graphical processing units (GPUs) using CUDA Fortran is presented. Two FDTD-to-CUDA thread-block mapping designs are investigated and their performances compared. Comparative assessment of trade-offs between GPU's shared memory and L1 cache is also discussed. This presentation is for the benefit of FDTD programmers who work exclusively with Fortran and are reluctant to port their codes to C in order to utilize GPU computing. The derived CUDA Fortran code is compared with an optimized CPU version that runs on a workstation-class CPU to present a realistic GPU to CPU run time comparison and thus help in making better informed investment decisions on FDTD code redesigns and equipment upgrades. All analyses are mirrored with CUDA C simulations to put in perspective the present state of CUDA Fortran development.
A report documenting the completion of the Los Alamos National Laboratory portion of the ASC level II milestone ""Visualization on the supercomputing platform

DOE Office of Scientific and Technical Information (OSTI.GOV)

Ahrens, James P; Patchett, John M; Lo, Li - Ta

2011-01-24

This report provides documentation for the completion of the Los Alamos portion of the ASC Level II 'Visualization on the Supercomputing Platform' milestone. This ASC Level II milestone is a joint milestone between Sandia National Laboratory and Los Alamos National Laboratory. The milestone text is shown in Figure 1 with the Los Alamos portions highlighted in boldfaced text. Visualization and analysis of petascale data is limited by several factors which must be addressed as ACES delivers the Cielo platform. Two primary difficulties are: (1) Performance of interactive rendering, which is the most computationally intensive portion of the visualization process. Formore » terascale platforms, commodity clusters with graphics processors (GPUs) have been used for interactive rendering. For petascale platforms, visualization and rendering may be able to run efficiently on the supercomputer platform itself. (2) I/O bandwidth, which limits how much information can be written to disk. If we simply analyze the sparse information that is saved to disk we miss the opportunity to analyze the rich information produced every timestep by the simulation. For the first issue, we are pursuing in-situ analysis, in which simulations are coupled directly with analysis libraries at runtime. This milestone will evaluate the visualization and rendering performance of current and next generation supercomputers in contrast to GPU-based visualization clusters, and evaluate the perfromance of common analysis libraries coupled with the simulation that analyze and write data to disk during a running simulation. This milestone will explore, evaluate and advance the maturity level of these technologies and their applicability to problems of interest to the ASC program. In conclusion, we improved CPU-based rendering performance by a a factor of 2-10 times on our tests. In addition, we evaluated CPU and CPU-based rendering performance. We encourage production visualization experts to consider using CPU-based rendering solutions when it is appropriate. For example, on remote supercomputers CPU-based rendering can offer a means of viewing data without having to offload the data or geometry onto a CPU-based visualization system. In terms of comparative performance of the CPU and CPU we believe that further optimizations of the performance of both CPU or CPU-based rendering are possible. The simulation community is currently confronting this reality as they work to port their simulations to different hardware architectures. What is interesting about CPU rendering of massive datasets is that for part two decades CPU performance has significantly outperformed CPU-based systems. Based on our advancements, evaluations and explorations we believe that CPU-based rendering has returned as one viable option for the visualization of massive datasets.« less
An Adaptive Priority Tuning System for Optimized Local CPU Scheduling using BOINC Clients

NASA Astrophysics Data System (ADS)

Mnaouer, Adel B.; Ragoonath, Colin

2010-11-01

Volunteer Computing (VC) is a Distributed Computing model which utilizes idle CPU cycles from computing resources donated by volunteers who are connected through the Internet to form a very large-scale, loosely coupled High Performance Computing environment. Distributed Volunteer Computing environments such as the BOINC framework is concerned mainly with the efficient scheduling of the available resources to the applications which require them. The BOINC framework thus contains a number of scheduling policies/algorithms both on the server-side and on the client which work together to maximize the available resources and to provide a degree of QoS in an environment which is highly volatile. This paper focuses on the BOINC client and introduces an adaptive priority tuning client side middleware application which improves the execution times of Work Units (WUs) while maintaining an acceptable Maximum Response Time (MRT) for the end user. We have conducted extensive experimentation of the proposed system and the results show clear speedup of BOINC applications using our optimized middleware as opposed to running using the original BOINC client.
Fog computing job scheduling optimization based on bees swarm

NASA Astrophysics Data System (ADS)

Bitam, Salim; Zeadally, Sherali; Mellouk, Abdelhamid

2018-04-01

Fog computing is a new computing architecture, composed of a set of near-user edge devices called fog nodes, which collaborate together in order to perform computational services such as running applications, storing an important amount of data, and transmitting messages. Fog computing extends cloud computing by deploying digital resources at the premise of mobile users. In this new paradigm, management and operating functions, such as job scheduling aim at providing high-performance, cost-effective services requested by mobile users and executed by fog nodes. We propose a new bio-inspired optimization approach called Bees Life Algorithm (BLA) aimed at addressing the job scheduling problem in the fog computing environment. Our proposed approach is based on the optimized distribution of a set of tasks among all the fog computing nodes. The objective is to find an optimal tradeoff between CPU execution time and allocated memory required by fog computing services established by mobile users. Our empirical performance evaluation results demonstrate that the proposal outperforms the traditional particle swarm optimization and genetic algorithm in terms of CPU execution time and allocated memory.
Application of queueing models to multiprogrammed computer systems operating in a time-critical environment

NASA Technical Reports Server (NTRS)

Eckhardt, D. E., Jr.

1979-01-01

A model of a central processor (CPU) which services background applications in the presence of time critical activity is presented. The CPU is viewed as an M/M/1 queueing system subject to periodic interrupts by deterministic, time critical process. The Laplace transform of the distribution of service times for the background applications is developed. The use of state of the art queueing models for studying the background processing capability of time critical computer systems is discussed and the results of a model validation study which support this application of queueing models are presented.
Aligner optimization increases accuracy and decreases compute times in multi-species sequence data.

PubMed

Robinson, Kelly M; Hawkins, Aziah S; Santana-Cruz, Ivette; Adkins, Ricky S; Shetty, Amol C; Nagaraj, Sushma; Sadzewicz, Lisa; Tallon, Luke J; Rasko, David A; Fraser, Claire M; Mahurkar, Anup; Silva, Joana C; Dunning Hotopp, Julie C

2017-09-01

As sequencing technologies have evolved, the tools to analyze these sequences have made similar advances. However, for multi-species samples, we observed important and adverse differences in alignment specificity and computation time for bwa- mem (Burrows-Wheeler aligner-maximum exact matches) relative to bwa-aln. Therefore, we sought to optimize bwa-mem for alignment of data from multi-species samples in order to reduce alignment time and increase the specificity of alignments. In the multi-species cases examined, there was one majority member (i.e. Plasmodium falciparum or Brugia malayi ) and one minority member (i.e. human or the Wolbachia endosymbiont w Bm) of the sequence data. Increasing bwa-mem seed length from the default value reduced the number of read pairs from the majority sequence member that incorrectly aligned to the reference genome of the minority sequence member. Combining both source genomes into a single reference genome increased the specificity of mapping, while also reducing the central processing unit (CPU) time. In Plasmodium , at a seed length of 18 nt, 24.1 % of reads mapped to the human genome using 1.7±0.1 CPU hours, while 83.6 % of reads mapped to the Plasmodium genome using 0.2±0.0 CPU hours (total: 107.7 % reads mapping; in 1.9±0.1 CPU hours). In contrast, 97.1 % of the reads mapped to a combined Plasmodium- human reference in only 0.7±0.0 CPU hours. Overall, the results suggest that combining all references into a single reference database and using a 23 nt seed length reduces the computational time, while maximizing specificity. Similar results were found for simulated sequence reads from a mock metagenomic data set. We found similar improvements to computation time in a publicly available human-only data set.
GPU-Accelerated Voxelwise Hepatic Perfusion Quantification

PubMed Central

Wang, H; Cao, Y

2012-01-01

Voxelwise quantification of hepatic perfusion parameters from dynamic contrast enhanced (DCE) imaging greatly contributes to assessment of liver function in response to radiation therapy. However, the efficiency of the estimation of hepatic perfusion parameters voxel-by-voxel in the whole liver using a dual-input single-compartment model requires substantial improvement for routine clinical applications. In this paper, we utilize the parallel computation power of a graphics processing unit (GPU) to accelerate the computation, while maintaining the same accuracy as the conventional method. Using CUDA-GPU, the hepatic perfusion computations over multiple voxels are run across the GPU blocks concurrently but independently. At each voxel, non-linear least squares fitting the time series of the liver DCE data to the compartmental model is distributed to multiple threads in a block, and the computations of different time points are performed simultaneously and synchronically. An efficient fast Fourier transform in a block is also developed for the convolution computation in the model. The GPU computations of the voxel-by-voxel hepatic perfusion images are compared with ones by the CPU using the simulated DCE data and the experimental DCE MR images from patients. The computation speed is improved by 30 times using a NVIDIA Tesla C2050 GPU compared to a 2.67 GHz Intel Xeon CPU processor. To obtain liver perfusion maps with 626400 voxels in a patient’s liver, it takes 0.9 min with the GPU-accelerated voxelwise computation, compared to 110 min with the CPU, while both methods result in perfusion parameters differences less than 10−6. The method will be useful for generating liver perfusion images in clinical settings. PMID:22892645
Integrated DoD Voice and Data Networks and Ground Packet Radio Technology

DTIC Science & Technology

1976-08-01

as the traffic requirement level increases. Moreover, the satellite switch selection problem is only meaningful over a limited traffic range. When...5: CPU TIMES VS. NUMBER OF SWITCHES SATELLITE SWITCH SELECTION ALGORITHM Computer Used: PDP-10 ♦O’S" means 0 minutes and 5 seconds. 5.30...Saturation Algorithm for Topo\\ogical Design of Parket-Switched Communications Networks," National Te3 ecommunications Conference Proceed- ings, San
GO, an exec for running the programs: CELL, COLLIDER, MAGIC, PATRICIA, PETROS, TRANSPORT, and TURTLE

DOE Office of Scientific and Technical Information (OSTI.GOV)

Shoaee, H.

1982-05-01

An exec has been written and placed on the PEP group's public disk to facilitate the use of several PEP related computer programs available on VM. The exec's program list currently includes: CELL, COLLIDER, MAGIC, PATRICIA, PETROS, TRANSPORT, and TURTLE. In addition, provisions have been made to allow addition of new programs to this list as they become available. The GO exec is directly callable from inside the Wylbur editor (in fact, currently this is the only way to use the GO exec.). It provides the option of running any of the above programs in either interactive or batch mode.more » In the batch mode, the GO exec sends the data in the Wylbur active file along with the information required to run the job to the batch monitor (BMON, a virtual machine that schedules and controls execution of batch jobs). This enables the user to proceed with other VM activities at his/her terminal while the job executes, thus making it of particular interest to the users with jobs requiring much CPU time to execute and/or those wishing to run multiple jobs independently. In the interactive mode, useful for small jobs requiring less CPU time, the job is executed by the user's own Virtual Machine using the data in the active file as input. At the termination of an interactive job, the GO exec facilitates examination of the output by placing it in the Wylbur active file.« less
GO, an exec for running the programs: CELL, COLLIDER, MAGIC, PATRICIA, PETROS, TRANSPORT and TURTLE

DOE Office of Scientific and Technical Information (OSTI.GOV)

Shoaee, H.

1982-05-01

An exec has been written and placed on the PEP group's public disk (PUBRL 192) to facilitate the use of several PEP related computer programs available on VM. The exec's program list currently includes: CELL, COLLIDER, MAGIC, PATRICIA, PETROS, TRANSPORT, and TURTLE. In addition, provisions have been made to allow addition of new programs to this list as they become available. The GO exec is directly callable from inside the Wylbur editor (in fact, currently this is the only way to use the GO exec.) It provides the option of running any of the above programs in either interactive ormore » batch mode. In the batch mode, the GO exec sends the data in the Wylbur active file along with the information required to run the job to the batch monitor (BMON, a virtual machine that schedules and controls execution of batch jobs). This enables the user to proceed with other VM activities at his/her terminal while the job executes, thus making it of particular interest to the users with jobs requiring much CPU time to execute and/or those wishing to run multiple jobs independently. In the interactive mode, useful for small jobs requiring less CPU time, the job is executed by the user's own Virtual Machine using the data in the active file as input. At the termination of an interactive job, the GO exec facilitates examination of the output by placing it in the Wylbur active file.« less
Revisiting Molecular Dynamics on a CPU/GPU system: Water Kernel and SHAKE Parallelization.

PubMed

Ruymgaart, A Peter; Elber, Ron

2012-11-13

We report Graphics Processing Unit (GPU) and Open-MP parallel implementations of water-specific force calculations and of bond constraints for use in Molecular Dynamics simulations. We focus on a typical laboratory computing-environment in which a CPU with a few cores is attached to a GPU. We discuss in detail the design of the code and we illustrate performance comparable to highly optimized codes such as GROMACS. Beside speed our code shows excellent energy conservation. Utilization of water-specific lists allows the efficient calculations of non-bonded interactions that include water molecules and results in a speed-up factor of more than 40 on the GPU compared to code optimized on a single CPU core for systems larger than 20,000 atoms. This is up four-fold from a factor of 10 reported in our initial GPU implementation that did not include a water-specific code. Another optimization is the implementation of constrained dynamics entirely on the GPU. The routine, which enforces constraints of all bonds, runs in parallel on multiple Open-MP cores or entirely on the GPU. It is based on Conjugate Gradient solution of the Lagrange multipliers (CG SHAKE). The GPU implementation is partially in double precision and requires no communication with the CPU during the execution of the SHAKE algorithm. The (parallel) implementation of SHAKE allows an increase of the time step to 2.0fs while maintaining excellent energy conservation. Interestingly, CG SHAKE is faster than the usual bond relaxation algorithm even on a single core if high accuracy is expected. The significant speedup of the optimized components transfers the computational bottleneck of the MD calculation to the reciprocal part of Particle Mesh Ewald (PME).
Radiation hardened microprocessor for small payloads

NASA Technical Reports Server (NTRS)

Shah, Ravi

1993-01-01

The RH-3000 program is developing a rad-hard space qualified 32-bit MIPS R-3000 RISC processor under the Naval Research Lab sponsorship. In addition, under IR&D Harris is developing RHC-3000 for embedded control applications where low cost and radiation tolerance are primary concerns. The development program leverages heavily from commercial development of the MIPS R-3000. The commercial R-3000 has a large installed user base and several foundry partners are currently producing a wide variety of R-3000 derivative products. One of the MIPS derivative products, the LR33000 from LSI Logic, was used as the basis for the design of the RH-3000 chipset. The RH-3000 chipset consists of three core chips and two support chips. The core chips include the CPU, which is the R-3000 integer unit and the FPA/MD chip pair, which performs the R-3010 floating point functions. The two support whips contain all the support functions required for fault tolerance support, real-time support, memory management, timers, and other functions. The Harris development effort had first passed silicon success in June, 1992 with the first rad-hard 32-bit RH-3000 CPU chip. The CPU device is 30 kgates, has a 508 mil by 503 mil die size and is fabricated at Harris Semiconductor on the rad-hard CMOS Silicon on Sapphire (SOS) process. The CPU device successfully passed tesing against 600,000 test vectors derived directly on the LSI/MIPS test suite and has been operational as a single board computer running C code for the past year. In addition, the RH-3000 program has developed the methodology for converting commercially developed designs utilizing logic synthesis techniques based on a combination of VHDK and schematic data bases.
Self-organized neural maps of human protein sequences.

PubMed Central

Ferrán, E. A.; Pflugfelder, B.; Ferrara, P.

1994-01-01

We have recently described a method based on artificial neural networks to cluster protein sequences into families. The network was trained with Kohonen's unsupervised learning algorithm using, as inputs, the matrix patterns derived from the dipeptide composition of the proteins. We present here a large-scale application of that method to classify the 1,758 human protein sequences stored in the SwissProt database (release 19.0), whose lengths are greater than 50 amino acids. In the final 2-dimensional topologically ordered map of 15 x 15 neurons, proteins belonging to known families were associated with the same neuron or with neighboring ones. Also, as an attempt to reduce the time-consuming learning procedure, we compared 2 learning protocols: one of 500 epochs (100 SUN CPU-hours [CPU-h]), and another one of 30 epochs (6.7 CPU-h). A further reduction of learning-computing time, by a factor of about 3.3, with similar protein clustering results, was achieved using a matrix of 11 x 11 components to represent the sequences. Although network training is time consuming, the classification of a new protein in the final ordered map is very fast (14.6 CPU-seconds). We also show a comparison between the artificial neural network approach and conventional methods of biosequence analysis. PMID:8019421
The METAL System. Volume I and Volume II. Appendices.

DTIC Science & Technology

1981-01-01

demands , and fair CPU time were measured. The fair measure reported here includes the pure CPU time plus a pro-rated portion of the time consumed by the...syntactic class or the form matched . NO = noun VB = verb OTR = other part of speech IT-12 Although the above feature is not used by the system at present...indicate the syntactic class of the form matched . NO = noun other than gerund ("content", "dark", "African") INF = infinitive ("direct", "equal", "content
High performance computing for deformable image registration: towards a new paradigm in adaptive radiotherapy.

PubMed

Samant, Sanjiv S; Xia, Junyi; Muyan-Ozcelik, Pinar; Owens, John D

2008-08-01

The advent of readily available temporal imaging or time series volumetric (4D) imaging has become an indispensable component of treatment planning and adaptive radiotherapy (ART) at many radiotherapy centers. Deformable image registration (DIR) is also used in other areas of medical imaging, including motion corrected image reconstruction. Due to long computation time, clinical applications of DIR in radiation therapy and elsewhere have been limited and consequently relegated to offline analysis. With the recent advances in hardware and software, graphics processing unit (GPU) based computing is an emerging technology for general purpose computation, including DIR, and is suitable for highly parallelized computing. However, traditional general purpose computation on the GPU is limited because the constraints of the available programming platforms. As well, compared to CPU programming, the GPU currently has reduced dedicated processor memory, which can limit the useful working data set for parallelized processing. We present an implementation of the demons algorithm using the NVIDIA 8800 GTX GPU and the new CUDA programming language. The GPU performance will be compared with single threading and multithreading CPU implementations on an Intel dual core 2.4 GHz CPU using the C programming language. CUDA provides a C-like language programming interface, and allows for direct access to the highly parallel compute units in the GPU. Comparisons for volumetric clinical lung images acquired using 4DCT were carried out. Computation time for 100 iterations in the range of 1.8-13.5 s was observed for the GPU with image size ranging from 2.0 x 10(6) to 14.2 x 10(6) pixels. The GPU registration was 55-61 times faster than the CPU for the single threading implementation, and 34-39 times faster for the multithreading implementation. For CPU based computing, the computational time generally has a linear dependence on image size for medical imaging data. Computational efficiency is characterized in terms of time per megapixels per iteration (TPMI) with units of seconds per megapixels per iteration (or spmi). For the demons algorithm, our CPU implementation yielded largely invariant values of TPMI. The mean TPMIs were 0.527 spmi and 0.335 spmi for the single threading and multithreading cases, respectively, with <2% variation over the considered image data range. For GPU computing, we achieved TPMI =0.00916 spmi with 3.7% variation, indicating optimized memory handling under CUDA. The paradigm of GPU based real-time DIR opens up a host of clinical applications for medical imaging.
Fast in-memory elastic full-waveform inversion using consumer-grade GPUs

NASA Astrophysics Data System (ADS)

Sivertsen Bergslid, Tore; Birger Raknes, Espen; Arntsen, Børge

2017-04-01

Full-waveform inversion (FWI) is a technique to estimate subsurface properties by using the recorded waveform produced by a seismic source and applying inverse theory. This is done through an iterative optimization procedure, where each iteration requires solving the wave equation many times, then trying to minimize the difference between the modeled and the measured seismic data. Having to model many of these seismic sources per iteration means that this is a highly computationally demanding procedure, which usually involves writing a lot of data to disk. We have written code that does forward modeling and inversion entirely in memory. A typical HPC cluster has many more CPUs than GPUs. Since FWI involves modeling many seismic sources per iteration, the obvious approach is to parallelize the code on a source-by-source basis, where each core of the CPU performs one modeling, and do all modelings simultaneously. With this approach, the GPU is already at a major disadvantage in pure numbers. Fortunately, GPUs can more than make up for this hardware disadvantage by performing each modeling much faster than a CPU. Another benefit of parallelizing each individual modeling is that it lets each modeling use a lot more RAM. If one node has 128 GB of RAM and 20 CPU cores, each modeling can use only 6.4 GB RAM if one is running the node at full capacity with source-by-source parallelization on the CPU. A parallelized per-source code using GPUs can use 64 GB RAM per modeling. Whenever a modeling uses more RAM than is available and has to start using regular disk space the runtime increases dramatically, due to slow file I/O. The extremely high computational speed of the GPUs combined with the large amount of RAM available for each modeling lets us do high frequency FWI for fairly large models very quickly. For a single modeling, our GPU code outperforms the single-threaded CPU-code by a factor of about 75. Successful inversions have been run on data with frequencies up to 40 Hz for a model of 2001 by 600 grid points with 5 m grid spacing and 5000 time steps, in less than 2.5 minutes per source. In practice, using 15 nodes (30 GPUs) to model 101 sources, each iteration took approximately 9 minutes. For reference, the same inversion run with our CPU code uses two hours per iteration. This was done using only a very simple wavefield interpolation technique, saving every second timestep. Using a more sophisticated checkpointing or wavefield reconstruction method would allow us to increase this model size significantly. Our results show that ordinary gaming GPUs are a viable alternative to the expensive professional GPUs often used today, when performing large scale modeling and inversion in geophysics.
Rapid and semi-analytical design and simulation of a toroidal magnet made with YBCO and MgB 2 superconductors

DOE PAGES

Dimitrov, I. K.; Zhang, X.; Solovyov, V. F.; ...

2015-07-07

Recent advances in second-generation (YBCO) high-temperature superconducting wire could potentially enable the design of super high performance energy storage devices that combine the high energy density of chemical storage with the high power of superconducting magnetic storage. However, the high aspect ratio and the considerable filament size of these wires require the concomitant development of dedicated optimization methods that account for the critical current density in type-II superconductors. In this study, we report on the novel application and results of a CPU-efficient semianalytical computer code based on the Radia 3-D magnetostatics software package. Our algorithm is used to simulate andmore » optimize the energy density of a superconducting magnetic energy storage device model, based on design constraints, such as overall size and number of coils. The rapid performance of the code is pivoted on analytical calculations of the magnetic field based on an efficient implementation of the Biot-Savart law for a large variety of 3-D “base” geometries in the Radia package. The significantly reduced CPU time and simple data input in conjunction with the consideration of realistic input variables, such as material-specific, temperature, and magnetic-field-dependent critical current densities, have enabled the Radia-based algorithm to outperform finite-element approaches in CPU time at the same accuracy levels. Comparative simulations of MgB 2 and YBCO-based devices are performed at 4.2 K, in order to ascertain the realistic efficiency of the design configurations.« less

Robotic goalie with 3 ms reaction time at 4% CPU load using event-based dynamic vision sensor

PubMed Central

Delbruck, Tobi; Lang, Manuel

2013-01-01

Conventional vision-based robotic systems that must operate quickly require high video frame rates and consequently high computational costs. Visual response latencies are lower-bound by the frame period, e.g., 20 ms for 50 Hz frame rate. This paper shows how an asynchronous neuromorphic dynamic vision sensor (DVS) silicon retina is used to build a fast self-calibrating robotic goalie, which offers high update rates and low latency at low CPU load. Independent and asynchronous per pixel illumination change events from the DVS signify moving objects and are used in software to track multiple balls. Motor actions to block the most “threatening” ball are based on measured ball positions and velocities. The goalie also sees its single-axis goalie arm and calibrates the motor output map during idle periods so that it can plan open-loop arm movements to desired visual locations. Blocking capability is about 80% for balls shot from 1 m from the goal even with the fastest-shots, and approaches 100% accuracy when the ball does not beat the limits of the servo motor to move the arm to the necessary position in time. Running with standard USB buses under a standard preemptive multitasking operating system (Windows), the goalie robot achieves median update rates of 550 Hz, with latencies of 2.2 ± 2 ms from ball movement to motor command at a peak CPU load of less than 4%. Practical observations and measurements of USB device latency are provided1. PMID:24311999
A CPU/MIC Collaborated Parallel Framework for GROMACS on Tianhe-2 Supercomputer.

PubMed

Peng, Shaoliang; Yang, Shunyun; Su, Wenhe; Zhang, Xiaoyu; Zhang, Tenglilang; Liu, Weiguo; Zhao, Xingming

2017-06-16

Molecular Dynamics (MD) is the simulation of the dynamic behavior of atoms and molecules. As the most popular software for molecular dynamics, GROMACS cannot work on large-scale data because of limit computing resources. In this paper, we propose a CPU and Intel® Xeon Phi Many Integrated Core (MIC) collaborated parallel framework to accelerate GROMACS using the offload mode on a MIC coprocessor, with which the performance of GROMACS is improved significantly, especially with the utility of Tianhe-2 supercomputer. Furthermore, we optimize GROMACS so that it can run on both the CPU and MIC at the same time. In addition, we accelerate multi-node GROMACS so that it can be used in practice. Benchmarking on real data, our accelerated GROMACS performs very well and reduces computation time significantly. Source code: https://github.com/tianhe2/gromacs-mic.
Particle-in-Cell laser-plasma simulation on Xeon Phi coprocessors

NASA Astrophysics Data System (ADS)

Surmin, I. A.; Bastrakov, S. I.; Efimenko, E. S.; Gonoskov, A. A.; Korzhimanov, A. V.; Meyerov, I. B.

2016-05-01

This paper concerns the development of a high-performance implementation of the Particle-in-Cell method for plasma simulation on Intel Xeon Phi coprocessors. We discuss the suitability of the method for Xeon Phi architecture and present our experience in the porting and optimization of the existing parallel Particle-in-Cell code PICADOR. Direct porting without code modification gives performance on Xeon Phi close to that of an 8-core CPU on a benchmark problem with 50 particles per cell. We demonstrate step-by-step optimization techniques, such as improving data locality, enhancing parallelization efficiency and vectorization leading to an overall 4.2 × speedup on CPU and 7.5 × on Xeon Phi compared to the baseline version. The optimized version achieves 16.9 ns per particle update on an Intel Xeon E5-2660 CPU and 9.3 ns per particle update on an Intel Xeon Phi 5110P. For a real problem of laser ion acceleration in targets with surface grating, where a large number of macroparticles per cell is required, the speedup of Xeon Phi compared to CPU is 1.6 ×.
Design Alternatives to Improve Access Time Performance of Disk Drives Under DOS and UNIX

NASA Astrophysics Data System (ADS)

Hospodor, Andy

For the past 25 years, improvements in CPU performance have overshadowed improvements in the access time performance of disk drives. CPU performance has been slanted towards greater instruction execution rates, measured in millions of instructions per second (MIPS). However, the slant for performance of disk storage has been towards capacity and corresponding increased storage densities. The IBM PC, introduced in 1982, processed only a fraction of a MIP. Follow-on CPUs, such as the 80486 and 80586, sported 5-10 MIPS by 1992. Single user PCs and workstations, with one CPU and one disk drive, became the dominant application, as implied by their production volumes. However, disk drives did not enjoy a corresponding improvement in access time performance, although the potential still exists. The time to access a disk drive improves (decreases) in two ways: by altering the mechanical properties of the drive or by adding cache to the drive. This paper explores the improvement to access time performance of disk drives using cache, prefetch, faster rotation rates, and faster seek acceleration.
Caffe con Troll: Shallow Ideas to Speed Up Deep Learning

PubMed Central

Hadjis, Stefan; Abuzaid, Firas; Zhang, Ce; Ré, Christopher

2016-01-01

We present Caffe con Troll (CcT), a fully compatible end-to-end version of the popular framework Caffe with rebuilt internals. We built CcT to examine the performance characteristics of training and deploying general-purpose convolutional neural networks across different hardware architectures. We find that, by employing standard batching optimizations for CPU training, we achieve a 4.5× throughput improvement over Caffe on popular networks like CaffeNet. Moreover, with these improvements, the end-to-end training time for CNNs is directly proportional to the FLOPS delivered by the CPU, which enables us to efficiently train hybrid CPU-GPU systems for CNNs. PMID:27314106
Caffe con Troll: Shallow Ideas to Speed Up Deep Learning.

PubMed

Hadjis, Stefan; Abuzaid, Firas; Zhang, Ce; Ré, Christopher

2015-01-01

We present Caffe con Troll (CcT), a fully compatible end-to-end version of the popular framework Caffe with rebuilt internals. We built CcT to examine the performance characteristics of training and deploying general-purpose convolutional neural networks across different hardware architectures. We find that, by employing standard batching optimizations for CPU training, we achieve a 4.5× throughput improvement over Caffe on popular networks like CaffeNet. Moreover, with these improvements, the end-to-end training time for CNNs is directly proportional to the FLOPS delivered by the CPU, which enables us to efficiently train hybrid CPU-GPU systems for CNNs.
OpenMP GNU and Intel Fortran programs for solving the time-dependent Gross-Pitaevskii equation

NASA Astrophysics Data System (ADS)

Young-S., Luis E.; Muruganandam, Paulsamy; Adhikari, Sadhan K.; Lončar, Vladimir; Vudragović, Dušan; Balaž, Antun

2017-11-01

We present Open Multi-Processing (OpenMP) version of Fortran 90 programs for solving the Gross-Pitaevskii (GP) equation for a Bose-Einstein condensate in one, two, and three spatial dimensions, optimized for use with GNU and Intel compilers. We use the split-step Crank-Nicolson algorithm for imaginary- and real-time propagation, which enables efficient calculation of stationary and non-stationary solutions, respectively. The present OpenMP programs are designed for computers with multi-core processors and optimized for compiling with both commercially-licensed Intel Fortran and popular free open-source GNU Fortran compiler. The programs are easy to use and are elaborated with helpful comments for the users. All input parameters are listed at the beginning of each program. Different output files provide physical quantities such as energy, chemical potential, root-mean-square sizes, densities, etc. We also present speedup test results for new versions of the programs. Program files doi:http://dx.doi.org/10.17632/y8zk3jgn84.2 Licensing provisions: Apache License 2.0 Programming language: OpenMP GNU and Intel Fortran 90. Computer: Any multi-core personal computer or workstation with the appropriate OpenMP-capable Fortran compiler installed. Number of processors used: All available CPU cores on the executing computer. Journal reference of previous version: Comput. Phys. Commun. 180 (2009) 1888; ibid.204 (2016) 209. Does the new version supersede the previous version?: Not completely. It does supersede previous Fortran programs from both references above, but not OpenMP C programs from Comput. Phys. Commun. 204 (2016) 209. Nature of problem: The present Open Multi-Processing (OpenMP) Fortran programs, optimized for use with commercially-licensed Intel Fortran and free open-source GNU Fortran compilers, solve the time-dependent nonlinear partial differential (GP) equation for a trapped Bose-Einstein condensate in one (1d), two (2d), and three (3d) spatial dimensions for six different trap symmetries: axially and radially symmetric traps in 3d, circularly symmetric traps in 2d, fully isotropic (spherically symmetric) and fully anisotropic traps in 2d and 3d, as well as 1d traps, where no spatial symmetry is considered. Solution method: We employ the split-step Crank-Nicolson algorithm to discretize the time-dependent GP equation in space and time. The discretized equation is then solved by imaginary- or real-time propagation, employing adequately small space and time steps, to yield the solution of stationary and non-stationary problems, respectively. Reasons for the new version: Previously published Fortran programs [1,2] have now become popular tools [3] for solving the GP equation. These programs have been translated to the C programming language [4] and later extended to the more complex scenario of dipolar atoms [5]. Now virtually all computers have multi-core processors and some have motherboards with more than one physical computer processing unit (CPU), which may increase the number of available CPU cores on a single computer to several tens. The C programs have been adopted to be very fast on such multi-core modern computers using general-purpose graphic processing units (GPGPU) with Nvidia CUDA and computer clusters using Message Passing Interface (MPI) [6]. Nevertheless, previously developed Fortran programs are also commonly used for scientific computation and most of them use a single CPU core at a time in modern multi-core laptops, desktops, and workstations. Unless the Fortran programs are made aware and capable of making efficient use of the available CPU cores, the solution of even a realistic dynamical 1d problem, not to mention the more complicated 2d and 3d problems, could be time consuming using the Fortran programs. Previously, we published auto-parallel Fortran programs [2] suitable for Intel (but not GNU) compiler for solving the GP equation. Hence, a need for the full OpenMP version of the Fortran programs to reduce the execution time cannot be overemphasized. To address this issue, we provide here such OpenMP Fortran programs, optimized for both Intel and GNU Fortran compilers and capable of using all available CPU cores, which can significantly reduce the execution time. Summary of revisions: Previous Fortran programs [1] for solving the time-dependent GP equation in 1d, 2d, and 3d with different trap symmetries have been parallelized using the OpenMP interface to reduce the execution time on multi-core processors. There are six different trap symmetries considered, resulting in six programs for imaginary-time propagation and six for real-time propagation, totaling to 12 programs included in BEC-GP-OMP-FOR software package. All input data (number of atoms, scattering length, harmonic oscillator trap length, trap anisotropy, etc.) are conveniently placed at the beginning of each program, as before [2]. Present programs introduce a new input parameter, which is designated by Number_of_Threads and defines the number of CPU cores of the processor to be used in the calculation. If one sets the value 0 for this parameter, all available CPU cores will be used. For the most efficient calculation it is advisable to leave one CPU core unused for the background system's jobs. For example, on a machine with 20 CPU cores such that we used for testing, it is advisable to use up to 19 CPU cores. However, the total number of used CPU cores can be divided into more than one job. For instance, one can run three simulations simultaneously using 10, 4, and 5 CPU cores, respectively, thus totaling to 19 used CPU cores on a 20-core computer. The Fortran source programs are located in the directory src, and can be compiled by the make command using the makefile in the root directory BEC-GP-OMP-FOR of the software package. The examples of produced output files can be found in the directory output, although some large density files are omitted, to save space. The programs calculate the values of actually used dimensionless nonlinearities from the physical input parameters, where the input parameters correspond to the identical nonlinearity values as in the previously published programs [1], so that the output files of the old and new programs can be directly compared. The output files are conveniently named such that their contents can be easily identified, following the naming convention introduced in Ref. [2]. For example, a file named -out.txt, where is a name of the individual program, represents the general output file containing input data, time and space steps, nonlinearity, energy and chemical potential, and was named fort.7 in the old Fortran version of programs [1]. A file named -den.txt is the output file with the condensate density, which had the names fort.3 and fort.4 in the old Fortran version [1] for imaginary- and real-time propagation programs, respectively. Other possible density outputs, such as the initial density, are commented out in the programs to have a simpler set of output files, but users can uncomment and re-enable them, if needed. In addition, there are output files for reduced (integrated) 1d and 2d densities for different programs. In the real-time programs there is also an output file reporting the dynamics of evolution of root-mean-square sizes after a perturbation is introduced. The supplied real-time programs solve the stationary GP equation, and then calculate the dynamics. As the imaginary-time programs are more accurate than the real-time programs for the solution of a stationary problem, one can first solve the stationary problem using the imaginary-time programs, adapt the real-time programs to read the pre-calculated wave function and then study the dynamics. In that case the parameter NSTP in the real-time programs should be set to zero and the space mesh and nonlinearity parameters should be identical in both programs. The reader is advised to consult our previous publication where a complete description of the output files is given [2]. A readme.txt file, included in the root directory, explains the procedure to compile and run the programs. We tested our programs on a workstation with two 10-core Intel Xeon E5-2650 v3 CPUs. The parameters used for testing are given in sample input files, provided in the corresponding directory together with the programs. In Table 1 we present wall-clock execution times for runs on 1, 6, and 19 CPU cores for programs compiled using Intel and GNU Fortran compilers. The corresponding columns "Intel speedup" and "GNU speedup" give the ratio of wall-clock execution times of runs on 1 and 19 CPU cores, and denote the actual measured speedup for 19 CPU cores. In all cases and for all numbers of CPU cores, although the GNU Fortran compiler gives excellent results, the Intel Fortran compiler turns out to be slightly faster. Note that during these tests we always ran only a single simulation on a workstation at a time, to avoid any possible interference issues. Therefore, the obtained wall-clock times are more reliable than the ones that could be measured with two or more jobs running simultaneously. We also studied the speedup of the programs as a function of the number of CPU cores used. The performance of the Intel and GNU Fortran compilers is illustrated in Fig. 1, where we plot the speedup and actual wall-clock times as functions of the number of CPU cores for 2d and 3d programs. We see that the speedup increases monotonically with the number of CPU cores in all cases and has large values (between 10 and 14 for 3d programs) for the maximal number of cores. This fully justifies the development of OpenMP programs, which enable much faster and more efficient solving of the GP equation. However, a slow saturation in the speedup with the further increase in the number of CPU cores is observed in all cases, as expected. The speedup tends to increase for programs in higher dimensions, as they become more complex and have to process more data. This is why the speedups of the supplied 2d and 3d programs are larger than those of 1d programs. Also, for a single program the speedup increases with the size of the spatial grid, i.e., with the number of spatial discretization points, since this increases the amount of calculations performed by the program. To demonstrate this, we tested the supplied real2d-th program and varied the number of spatial discretization points NX=NY from 20 to 1000. The measured speedup obtained when running this program on 19 CPU cores as a function of the number of discretization points is shown in Fig. 2. The speedup first increases rapidly with the number of discretization points and eventually saturates. Additional comments: Example inputs provided with the programs take less than 30 minutes to run on a workstation with two Intel Xeon E5-2650 v3 processors (2 QPI links, 10 CPU cores, 25 MB cache, 2.3 GHz).



      
      Storage element performance optimization for CMS analysis jobs
      NASA Astrophysics Data System (ADS)
      Behrmann, G.; Dahlblom, J.; Guldmyr, J.; Happonen, K.; Lindén, T.
         2012-12-01
         Tier-2 computing sites in the Worldwide Large Hadron Collider Computing Grid (WLCG) host CPU-resources (Compute Element, CE) and storage resources (Storage Element, SE). The vast amount of data that needs to processed from the Large Hadron Collider (LHC) experiments requires good and efficient use of the available resources. Having a good CPU efficiency for the end users analysis jobs requires that the performance of the storage system is able to scale with I/O requests from hundreds or even thousands of simultaneous jobs. In this presentation we report on the work on improving the SE performance at the Helsinki Institute of Physics (HIP) Tier-2 used for the Compact Muon Experiment (CMS) at the LHC. Statistics from CMS grid jobs are collected and stored in the CMS Dashboard for further analysis, which allows for easy performance monitoring by the sites and by the CMS collaboration. As part of the monitoring framework CMS uses the JobRobot which sends every four hours 100 analysis jobs to each site. CMS also uses the HammerCloud tool for site monitoring and stress testing and it has replaced the JobRobot. The performance of the analysis workflow submitted with JobRobot or HammerCloud can be used to track the performance due to site configuration changes, since the analysis workflow is kept the same for all sites and for months in time. The CPU efficiency of the JobRobot jobs at HIP was increased approximately by 50 % to more than 90 %, by tuning the SE and by improvements in the CMSSW and dCache software. The performance of the CMS analysis jobs improved significantly too. Similar work has been done on other CMS Tier-sites, since on average the CPU efficiency for CMSSW jobs has increased during 2011. Better monitoring of the SE allows faster detection of problems, so that the performance level can be kept high. The next storage upgrade at HIP consists of SAS disk enclosures which can be stress tested on demand with HammerCloud workflows, to make sure that the I/O-performance is good.
      

      
      Personal Computer and Workstation Operating Systems Tutorial
      DTIC Science & Technology
      
         1994-03-01
         to a RAM area where it is executed by the CPU. The program consists of instructions that perform operations on data. The CPU will perform two basic...memory to improve system performance. More often the user will buy a new fixed disk so the computer will hold more programs internally. The trend today...MHZ. Another way to view how fast the information is going into the register is in a time domain rather than a frequency domain knowing that time and
      

      
      Fast simulation of Proton Induced X-Ray Emission Tomography using CUDA
      NASA Astrophysics Data System (ADS)
      Beasley, D. G.; Marques, A. C.; Alves, L. C.; da Silva, R. C.
         2013-07-01
         A new 3D Proton Induced X-Ray Emission Tomography (PIXE-T) and Scanning Transmission Ion Microscopy Tomography (STIM-T) simulation software has been developed in Java and uses NVIDIA™ Common Unified Device Architecture (CUDA) to calculate the X-ray attenuation for large detector areas. A challenge with PIXE-T is to get sufficient counts while retaining a small beam spot size. Therefore a high geometric efficiency is required. However, as the detector solid angle increases the calculations required for accurate reconstruction of the data increase substantially. To overcome this limitation, the CUDA parallel computing platform was used which enables general purpose programming of NVIDIA graphics processing units (GPUs) to perform computations traditionally handled by the central processing unit (CPU). For simulation performance evaluation, the results of a CPU- and a CUDA-based simulation of a phantom are presented. Furthermore, a comparison with the simulation code in the PIXE-Tomography reconstruction software DISRA (A. Sakellariou, D.N. Jamieson, G.J.F. Legge, 2001) is also shown. Compared to a CPU implementation, the CUDA based simulation is approximately 30× faster.
      

      
      An efficient compression scheme for bitmap indices
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Wu, Kesheng; Otoo, Ekow J.; Shoshani, Arie
         2004-04-13
         When using an out-of-core indexing method to answer a query, it is generally assumed that the I/O cost dominates the overall query response time. Because of this, most research on indexing methods concentrate on reducing the sizes of indices. For bitmap indices, compression has been used for this purpose. However, in most cases, operations on these compressed bitmaps, mostly bitwise logical operations such as AND, OR, and NOT, spend more time in CPU than in I/O. To speedup these operations, a number of specialized bitmap compression schemes have been developed; the best known of which is the byte-aligned bitmap codemore » (BBC). They are usually faster in performing logical operations than the general purpose compression schemes, but, the time spent in CPU still dominates the total query response time. To reduce the query response time, we designed a CPU-friendly scheme named the word-aligned hybrid (WAH) code. In this paper, we prove that the sizes of WAH compressed bitmap indices are about two words per row for large range of attributes. This size is smaller than typical sizes of commonly used indices, such as a B-tree. Therefore, WAH compressed indices are not only appropriate for low cardinality attributes but also for high cardinality attributes.In the worst case, the time to operate on compressed bitmaps is proportional to the total size of the bitmaps involved. The total size of the bitmaps required to answer a query on one attribute is proportional to the number of hits. These indicate that WAH compressed bitmap indices are optimal. To verify their effectiveness, we generated bitmap indices for four different datasets and measured the response time of many range queries. Tests confirm that sizes of compressed bitmap indices are indeed smaller than B-tree indices, and query processing with WAH compressed indices is much faster than with BBC compressed indices, projection indices and B-tree indices. In addition, we also verified that the average query response time is proportional to the index size. This indicates that the compressed bitmap indices are efficient for very large datasets.« less
      

      
      Optimizing a mobile robot control system using GPU acceleration
      NASA Astrophysics Data System (ADS)
      Tuck, Nat; McGuinness, Michael; Martin, Fred
         2012-01-01
         This paper describes our attempt to optimize a robot control program for the Intelligent Ground Vehicle Competition (IGVC) by running computationally intensive portions of the system on a commodity graphics processing unit (GPU). The IGVC Autonomous Challenge requires a control program that performs a number of different computationally intensive tasks ranging from computer vision to path planning. For the 2011 competition our Robot Operating System (ROS) based control system would not run comfortably on the multicore CPU on our custom robot platform. The process of profiling the ROS control program and selecting appropriate modules for porting to run on a GPU is described. A GPU-targeting compiler, Bacon, is used to speed up development and help optimize the ported modules. The impact of the ported modules on overall performance is discussed. We conclude that GPU optimization can free a significant amount of CPU resources with minimal effort for expensive user-written code, but that replacing heavily-optimized library functions is more difficult, and a much less efficient use of time.
      

      
      Cross-Identification of Astronomical Catalogs on Multiple GPUs
      NASA Astrophysics Data System (ADS)
      Lee, M. A.; Budavári, T.
         2013-10-01
         One of the most fundamental problems in observational astronomy is the cross-identification of sources. Observations are made in different wavelengths, at different times, and from different locations and instruments, resulting in a large set of independent observations. The scientific outcome is often limited by our ability to quickly perform meaningful associations between detections. The matching, however, is difficult scientifically, statistically, as well as computationally. The former two require detailed physical modeling and advanced probabilistic concepts; the latter is due to the large volumes of data and the problem's combinatorial nature. In order to tackle the computational challenge and to prepare for future surveys, whose measurements will be exponentially increasing in size past the scale of feasible CPU-based solutions, we developed a new implementation which addresses the issue by performing the associations on multiple Graphics Processing Units (GPUs). Our implementation utilizes up to 6 GPUs in combination with the Thrust library to achieve an over 40x speed up verses the previous best implementation running on a multi-CPU SQL Server.
      

      
      Inexact adaptive Newton methods
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Bertiger, W.I.; Kelsey, F.J.
         1985-02-01
         The Inexact Adaptive Newton method (IAN) is a modification of the Adaptive Implicit Method/sup 1/ (AIM) with improved Newton convergence. Both methods simplify the Jacobian at each time step by zeroing coefficients in regions where saturations are changing slowly. The methods differ in how the diagonal block terms are treated. On test problems with up to 3,000 cells, IAN consistently saves approximately 30% of the CPU time when compared to the fully implicit method. AIM shows similar savings on some problems, but takes as much CPU time as fully implicit on other test problems due to poor Newton convergence.
      

      
      Semiempirical Quantum Chemical Calculations Accelerated on a Hybrid Multicore CPU-GPU Computing Platform.
      PubMed
      Wu, Xin; Koslowski, Axel; Thiel, Walter
         2012-07-10
         In this work, we demonstrate that semiempirical quantum chemical calculations can be accelerated significantly by leveraging the graphics processing unit (GPU) as a coprocessor on a hybrid multicore CPU-GPU computing platform. Semiempirical calculations using the MNDO, AM1, PM3, OM1, OM2, and OM3 model Hamiltonians were systematically profiled for three types of test systems (fullerenes, water clusters, and solvated crambin) to identify the most time-consuming sections of the code. The corresponding routines were ported to the GPU and optimized employing both existing library functions and a GPU kernel that carries out a sequence of noniterative Jacobi transformations during pseudodiagonalization. The overall computation times for single-point energy calculations and geometry optimizations of large molecules were reduced by one order of magnitude for all methods, as compared to runs on a single CPU core.
      

      
      A GPU-based calculation using the three-dimensional FDTD method for electromagnetic field analysis.
      PubMed
      Nagaoka, Tomoaki; Watanabe, Soichi
         2010-01-01
         Numerical simulations with the numerical human model using the finite-difference time domain (FDTD) method have recently been performed frequently in a number of fields in biomedical engineering. However, the FDTD calculation runs too slowly. We focus, therefore, on general purpose programming on the graphics processing unit (GPGPU). The three-dimensional FDTD method was implemented on the GPU using Compute Unified Device Architecture (CUDA). In this study, we used the NVIDIA Tesla C1060 as a GPGPU board. The performance of the GPU is evaluated in comparison with the performance of a conventional CPU and a vector supercomputer. The results indicate that three-dimensional FDTD calculations using a GPU can significantly reduce run time in comparison with that using a conventional CPU, even a native GPU implementation of the three-dimensional FDTD method, while the GPU/CPU speed ratio varies with the calculation domain and thread block size.
      

      
      Research on control law accelerator of digital signal process chip TMS320F28035 for real-time data acquisition and processing
      NASA Astrophysics Data System (ADS)
      Zhao, Shuangle; Zhang, Xueyi; Sun, Shengli; Wang, Xudong
         2017-08-01
         TI C2000 series digital signal process (DSP) chip has been widely used in electrical engineering, measurement and control, communications and other professional fields, DSP TMS320F28035 is one of the most representative of a kind. When using the DSP program, need data acquisition and data processing, and if the use of common mode C or assembly language programming, the program sequence, analogue-to-digital (AD) converter cannot be real-time acquisition, often missing a lot of data. The control low accelerator (CLA) processor can run in parallel with the main central processing unit (CPU), and the frequency is consistent with the main CPU, and has the function of floating point operations. Therefore, the CLA coprocessor is used in the program, and the CLA kernel is responsible for data processing. The main CPU is responsible for the AD conversion. The advantage of this method is to reduce the time of data processing and realize the real-time performance of data acquisition.
      

      
      Bayer image parallel decoding based on GPU
      NASA Astrophysics Data System (ADS)
      Hu, Rihui; Xu, Zhiyong; Wei, Yuxing; Sun, Shaohua
         2012-11-01
         In the photoelectrical tracking system, Bayer image is decompressed in traditional method, which is CPU-based. However, it is too slow when the images become large, for example, 2K×2K×16bit. In order to accelerate the Bayer image decoding, this paper introduces a parallel speedup method for NVIDA's Graphics Processor Unit (GPU) which supports CUDA architecture. The decoding procedure can be divided into three parts: the first is serial part, the second is task-parallelism part, and the last is data-parallelism part including inverse quantization, inverse discrete wavelet transform (IDWT) as well as image post-processing part. For reducing the execution time, the task-parallelism part is optimized by OpenMP techniques. The data-parallelism part could advance its efficiency through executing on the GPU as CUDA parallel program. The optimization techniques include instruction optimization, shared memory access optimization, the access memory coalesced optimization and texture memory optimization. In particular, it can significantly speed up the IDWT by rewriting the 2D (Tow-dimensional) serial IDWT into 1D parallel IDWT. Through experimenting with 1K×1K×16bit Bayer image, data-parallelism part is 10 more times faster than CPU-based implementation. Finally, a CPU+GPU heterogeneous decompression system was designed. The experimental result shows that it could achieve 3 to 5 times speed increase compared to the CPU serial method.
      

      
      SU-E-T-423: Fast Photon Convolution Calculation with a 3D-Ideal Kernel On the GPU
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Moriya, S; Sato, M; Tachibana, H
         
         Purpose: The calculation time is a trade-off for improving the accuracy of convolution dose calculation with fine calculation spacing of the KERMA kernel. We investigated to accelerate the convolution calculation using an ideal kernel on the Graphic Processing Units (GPU). Methods: The calculation was performed on the AMD graphics hardware of Dual FirePro D700 and our algorithm was implemented using the Aparapi that convert Java bytecode to OpenCL. The process of dose calculation was separated with the TERMA and KERMA steps. The dose deposited at the coordinate (x, y, z) was determined in the process. In the dose calculation runningmore » on the central processing unit (CPU) of Intel Xeon E5, the calculation loops were performed for all calculation points. On the GPU computation, all of the calculation processes for the points were sent to the GPU and the multi-thread computation was done. In this study, the dose calculation was performed in a water equivalent homogeneous phantom with 150{sup 3} voxels (2 mm calculation grid) and the calculation speed on the GPU to that on the CPU and the accuracy of PDD were compared. Results: The calculation time for the GPU and the CPU were 3.3 sec and 4.4 hour, respectively. The calculation speed for the GPU was 4800 times faster than that for the CPU. The PDD curve for the GPU was perfectly matched to that for the CPU. Conclusion: The convolution calculation with the ideal kernel on the GPU was clinically acceptable for time and may be more accurate in an inhomogeneous region. Intensity modulated arc therapy needs dose calculations for different gantry angles at many control points. Thus, it would be more practical that the kernel uses a coarse spacing technique if the calculation is faster while keeping the similar accuracy to a current treatment planning system.« less
      

      
      Embedded real-time operating system micro kernel design
      NASA Astrophysics Data System (ADS)
      Cheng, Xiao-hui; Li, Ming-qiang; Wang, Xin-zheng
         2005-12-01
         Embedded systems usually require a real-time character. Base on an 8051 microcontroller, an embedded real-time operating system micro kernel is proposed consisting of six parts, including a critical section process, task scheduling, interruption handle, semaphore and message mailbox communication, clock managent and memory managent. Distributed CPU and other resources are among tasks rationally according to the importance and urgency. The design proposed here provides the position, definition, function and principle of micro kernel. The kernel runs on the platform of an ATMEL AT89C51 microcontroller. Simulation results prove that the designed micro kernel is stable and reliable and has quick response while operating in an application system.


   
       
            
              
          

«

7
      8
      9
   10
      11
      »

          
        

           
           
             
               
      
      Multibody dynamic simulation of knee contact mechanics
      PubMed Central
      Bei, Yanhong; Fregly, Benjamin J.
         2006-01-01
         Multibody dynamic musculoskeletal models capable of predicting muscle forces and joint contact pressures simultaneously would be valuable for studying clinical issues related to knee joint degeneration and restoration. Current three-dimensional multi-body knee models are either quasi-static with deformable contact or dynamic with rigid contact. This study proposes a computationally efficient methodology for combining multibody dynamic simulation methods with a deformable contact knee model. The methodology requires preparation of the articular surface geometry, development of efficient methods to calculate distances between contact surfaces, implementation of an efficient contact solver that accounts for the unique characteristics of human joints, and specification of an application programming interface for integration with any multibody dynamic simulation environment. The current implementation accommodates natural or artificial tibiofemoral joint models, small or large strain contact models, and linear or nonlinear material models. Applications are presented for static analysis (via dynamic simulation) of a natural knee model created from MRI and CT data and dynamic simulation of an artificial knee model produced from manufacturer’s CAD data. Small and large strain natural knee static analyses required 1 min of CPU time and predicted similar contact conditions except for peak pressure, which was higher for the large strain model. Linear and nonlinear artificial knee dynamic simulations required 10 min of CPU time and predicted similar contact force and torque but different contact pressures, which were lower for the nonlinear model due to increased contact area. This methodology provides an important step toward the realization of dynamic musculoskeletal models that can predict in vivo knee joint motion and loading simultaneously. PMID:15564115
      

      
      An evaluation of superminicomputers for thermal analysis
      NASA Technical Reports Server (NTRS)
      Storaasli, O. O.; Vidal, J. B.; Jones, G. K.
         1962-01-01
         The feasibility and cost effectiveness of solving thermal analysis problems on superminicomputers is demonstrated. Conventional thermal analysis and the changing computer environment, computer hardware and software used, six thermal analysis test problems, performance of superminicomputers (CPU time, accuracy, turnaround, and cost) and comparison with large computers are considered. Although the CPU times for superminicomputers were 15 to 30 times greater than the fastest mainframe computer, the minimum cost to obtain the solutions on superminicomputers was from 11 percent to 59 percent of the cost of mainframe solutions. The turnaround (elapsed) time is highly dependent on the computer load, but for large problems, superminicomputers produced results in less elapsed time than a typically loaded mainframe computer.
      

      
      Numerical study of the effects of icing on viscous flow over wings
      NASA Technical Reports Server (NTRS)
      Sankar, L. N.
         1994-01-01
         An improved hybrid method for computing unsteady compressible viscous flows is presented. This method divides the computational domain into two zones. In the outer zone, the unsteady full-potential equation (FPE) is solved. In the inner zone, the Navier-Stokes equations are solved using a diagonal form of an alternating-direction implicit (ADI) approximate factorization procedure. The two zones are tightly coupled so that steady and unsteady flows may be efficiently solved. Characteristic-based viscous/inviscid interface boundary conditions are employed to avoid spurious reflections at that interface. The resulting CPU times are less than 60 percent of that required for a full-blown Navier-Stokes analysis for steady flow applications and about 60 percent of the Navier-Stokes CPU times for unsteady flows in non-vector processing machines. Applications of the method are presented for a rectangular NACA 0012 wing in low subsonic steady flow at moderate and high angles of attack, and for an F-5 wing in steady and unsteady subsonic and transonic flows. Steady surface pressures are in very good agreement with experimental data and are essentially identical to Navier-Stokes predictions. Density contours show that shocks cross the viscous/inviscid interface smoothly, so that the accuracy of full Navier-Stokes equations can be retained with a significant savings in computational time.
      

      
      Task Assignment Heuristics for Distributed CFD Applications
      NASA Technical Reports Server (NTRS)
      Lopez-Benitez, N.; Djomehri, M. J.; Biswas, R.; Biegel, Bryan (Technical Monitor)
         2001-01-01
         CFD applications require high-performance computational platforms: 1. Complex physics and domain configuration demand strongly coupled solutions; 2. Applications are CPU and memory intensive; and 3. Huge resource requirements can only be satisfied by teraflop-scale machines or distributed computing.
      

      
      A Reliability-Based Particle Filter for Humanoid Robot Self-Localization in RoboCup Standard Platform League
      PubMed Central
      Sánchez, Eduardo Munera; Alcobendas, Manuel Muñoz; Noguera, Juan Fco. Blanes; Gilabert, Ginés Benet; Simó Ten, José E.
         2013-01-01
         This paper deals with the problem of humanoid robot localization and proposes a new method for position estimation that has been developed for the RoboCup Standard Platform League environment. Firstly, a complete vision system has been implemented in the Nao robot platform that enables the detection of relevant field markers. The detection of field markers provides some estimation of distances for the current robot position. To reduce errors in these distance measurements, extrinsic and intrinsic camera calibration procedures have been developed and described. To validate the localization algorithm, experiments covering many of the typical situations that arise during RoboCup games have been developed: ranging from degradation in position estimation to total loss of position (due to falls, ‘kidnapped robot’, or penalization). The self-localization method developed is based on the classical particle filter algorithm. The main contribution of this work is a new particle selection strategy. Our approach reduces the CPU computing time required for each iteration and so eases the limited resource availability problem that is common in robot platforms such as Nao. The experimental results show the quality of the new algorithm in terms of localization and CPU time consumption. PMID:24193098
      

      
      Design of a memory-access controller with 3.71-times-enhanced energy efficiency for Internet-of-Things-oriented nonvolatile microcontroller unit
      NASA Astrophysics Data System (ADS)
      Natsui, Masanori; Hanyu, Takahiro
         2018-04-01
         In realizing a nonvolatile microcontroller unit (MCU) for sensor nodes in Internet-of-Things (IoT) applications, it is important to solve the data-transfer bottleneck between the central processing unit (CPU) and the nonvolatile memory constituting the MCU. As one circuit-oriented approach to solving this problem, we propose a memory access minimization technique for magnetoresistive-random-access-memory (MRAM)-embedded nonvolatile MCUs. In addition to multiplexing and prefetching of memory access, the proposed technique realizes efficient instruction fetch by eliminating redundant memory access while considering the code length of the instruction to be fetched and the transition of the memory address to be accessed. As a result, the performance of the MCU can be improved while relaxing the performance requirement for the embedded MRAM, and compact and low-power implementation can be performed as compared with the conventional cache-based one. Through the evaluation using a system consisting of a general purpose 32-bit CPU and embedded MRAM, it is demonstrated that the proposed technique increases the peak efficiency of the system up to 3.71 times, while a 2.29-fold area reduction is achieved compared with the cache-based one.
      

      
      Design and implementation of a UNIX based distributed computing system
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Love, J.S.; Michael, M.W.
         1994-12-31
         We have designed, implemented, and are running a corporate-wide distributed processing batch queue on a large number of networked workstations using the UNIX{reg_sign} operating system. Atlas Wireline researchers and scientists have used the system for over a year. The large increase in available computer power has greatly reduced the time required for nuclear and electromagnetic tool modeling. Use of remote distributed computing has simultaneously reduced computation costs and increased usable computer time. The system integrates equipment from different manufacturers, using various CPU architectures, distinct operating system revisions, and even multiple processors per machine. Various differences between the machines have tomore » be accounted for in the master scheduler. These differences include shells, command sets, swap spaces, memory sizes, CPU sizes, and OS revision levels. Remote processing across a network must be performed in a manner that is seamless from the users` perspective. The system currently uses IBM RISC System/6000{reg_sign}, SPARCstation{sup TM}, HP9000s700, HP9000s800, and DEC Alpha AXP{sup TM} machines. Each CPU in the network has its own speed rating, allowed working hours, and workload parameters. The system if designed so that all of the computers in the network can be optimally scheduled without adversely impacting the primary users of the machines. The increase in the total usable computational capacity by means of distributed batch computing can change corporate computing strategy. The integration of disparate computer platforms eliminates the need to buy one type of computer for computations, another for graphics, and yet another for day-to-day operations. It might be possible, for example, to meet all research and engineering computing needs with existing networked computers.« less
      

      
      Real time display Fourier-domain OCT using multi-thread parallel computing with data vectorization
      NASA Astrophysics Data System (ADS)
      Eom, Tae Joong; Kim, Hoon Seop; Kim, Chul Min; Lee, Yeung Lak; Choi, Eun-Seo
         2011-03-01
         We demonstrate a real-time display of processed OCT images using multi-thread parallel computing with a quad-core CPU of a personal computer. The data of each A-line are treated as one vector to maximize the data translation rate between the cores of the CPU and RAM stored image data. A display rate of 29.9 frames/sec for processed OCT data (4096 FFT-size x 500 A-scans) is achieved in our system using a wavelength swept source with 52-kHz swept frequency. The data processing times of the OCT image and a Doppler OCT image with a 4-time average are 23.8 msec and 91.4 msec.
      

      
      Full waveform time domain solutions for source and induced magnetotelluric and controlled-source electromagnetic fields using quasi-equivalent time domain decomposition and GPU parallelization
      NASA Astrophysics Data System (ADS)
      Imamura, N.; Schultz, A.
         2015-12-01
         Recently, a full waveform time domain solution has been developed for the magnetotelluric (MT) and controlled-source electromagnetic (CSEM) methods. The ultimate goal of this approach is to obtain a computationally tractable direct waveform joint inversion for source fields and earth conductivity structure in three and four dimensions. This is desirable on several grounds, including the improved spatial resolving power expected from use of a multitude of source illuminations of non-zero wavenumber, the ability to operate in areas of high levels of source signal spatial complexity and non-stationarity, etc. This goal would not be obtainable if one were to adopt the finite difference time-domain (FDTD) approach for the forward problem. This is particularly true for the case of MT surveys, since an enormous number of degrees of freedom are required to represent the observed MT waveforms across the large frequency bandwidth. It means that for FDTD simulation, the smallest time steps should be finer than that required to represent the highest frequency, while the number of time steps should also cover the lowest frequency. This leads to a linear system that is computationally burdensome to solve. We have implemented our code that addresses this situation through the use of a fictitious wave domain method and GPUs to speed up the computation time. We also substantially reduce the size of the linear systems by applying concepts from successive cascade decimation, through quasi-equivalent time domain decomposition. By combining these refinements, we have made good progress toward implementing the core of a full waveform joint source field/earth conductivity inverse modeling method. From results, we found the use of previous generation of CPU/GPU speeds computations by an order of magnitude over a parallel CPU only approach. In part, this arises from the use of the quasi-equivalent time domain decomposition, which shrinks the size of the linear system dramatically.
      

      
      A fast three-dimensional gamma evaluation using a GPU utilizing texture memory for on-the-fly interpolations.
      PubMed
      Persoon, Lucas C G G; Podesta, Mark; van Elmpt, Wouter J C; Nijsten, Sebastiaan M J J G; Verhaegen, Frank
         2011-07-01
         A widely accepted method to quantify differences in dose distributions is the gamma (gamma) evaluation. Currently, almost all gamma implementations utilize the central processing unit (CPU). Recently, the graphics processing unit (GPU) has become a powerful platform for specific computing tasks. In this study, we describe the implementation of a 3D gamma evaluation using a GPU to improve calculation time. The gamma evaluation algorithm was implemented on an NVIDIA Tesla C2050 GPU using the compute unified device architecture (CUDA). First, several cubic virtual phantoms were simulated. These phantoms were tested with varying dose cube sizes and set-ups, introducing artificial dose differences. Second, to show applicability in clinical practice, five patient cases have been evaluated using the 3D dose distribution from a treatment planning system as the reference and the delivered dose determined during treatment as the comparison. A calculation time comparison between the CPU and GPU was made with varying thread-block sizes including the option of using texture or global memory. A GPU over CPU speed-up of 66 +/- 12 was achieved for the virtual phantoms. For the patient cases, a speed-up of 57 +/- 15 using the GPU was obtained. A thread-block size of 16 x 16 performed best in all cases. The use of texture memory improved the total calculation time, especially when interpolation was applied. Differences between the CPU and GPU gammas were negligible. The GPU and its features, such as texture memory, decreased the calculation time for gamma evaluations considerably without loss of accuracy.
      

      
      GPU based contouring method on grid DEM data
      NASA Astrophysics Data System (ADS)
      Tan, Liheng; Wan, Gang; Li, Feng; Chen, Xiaohui; Du, Wenlong
         2017-08-01
         This paper presents a novel method to generate contour lines from grid DEM data based on the programmable GPU pipeline. The previous contouring approaches often use CPU to construct a finite element mesh from the raw DEM data, and then extract contour segments from the elements. They also need a tracing or sorting strategy to generate the final continuous contours. These approaches can be heavily CPU-costing and time-consuming. Meanwhile the generated contours would be unsmooth if the raw data is sparsely distributed. Unlike the CPU approaches, we employ the GPU's vertex shader to generate a triangular mesh with arbitrary user-defined density, in which the height of each vertex is calculated through a third-order Cardinal spline function. Then in the same frame, segments are extracted from the triangles by the geometry shader, and translated to the CPU-side with an internal order in the GPU's transform feedback stage. Finally we propose a "Grid Sorting" algorithm to achieve the continuous contour lines by travelling the segments only once. Our method makes use of multiple stages of GPU pipeline for computation, which can generate smooth contour lines, and is significantly faster than the previous CPU approaches. The algorithm can be easily implemented with OpenGL 3.3 API or higher on consumer-level PCs.
      

      
      Implementation of ADI: Schemes on MIMD parallel computers
      NASA Technical Reports Server (NTRS)
      Vanderwijngaart, Rob F.
         1993-01-01
         In order to simulate the effects of the impingement of hot exhaust jets of High Performance Aircraft on landing surfaces a multi-disciplinary computation coupling flow dynamics to heat conduction in the runway needs to be carried out. Such simulations, which are essentially unsteady, require very large computational power in order to be completed within a reasonable time frame of the order of an hour. Such power can be furnished by the latest generation of massively parallel computers. These remove the bottleneck of ever more congested data paths to one or a few highly specialized central processing units (CPU's) by having many off-the-shelf CPU's work independently on their own data, and exchange information only when needed. During the past year the first phase of this project was completed, in which the optimal strategy for mapping an ADI-algorithm for the three dimensional unsteady heat equation to a MIMD parallel computer was identified. This was done by implementing and comparing three different domain decomposition techniques that define the tasks for the CPU's in the parallel machine. These implementations were done for a Cartesian grid and Dirichlet boundary conditions. The most promising technique was then used to implement the heat equation solver on a general curvilinear grid with a suite of nontrivial boundary conditions. Finally, this technique was also used to implement the Scalar Penta-diagonal (SP) benchmark, which was taken from the NAS Parallel Benchmarks report. All implementations were done in the programming language C on the Intel iPSC/860 computer.
      

      
      Evaluation of nonlinear structural dynamic responses using a fast-running spring-mass formulation
      NASA Astrophysics Data System (ADS)
      Benjamin, A. S.; Altman, B. S.; Gruda, J. D.
         
         In today's world, accurate finite-element simulations of large nonlinear systems may require meshes composed of hundreds of thousands of degrees of freedom. Even with today's fast computers and the promise of ever-faster ones in the future, central processing unit (CPU) expenditures for such problems could be measured in days. Many contemporary engineering problems, such as those found in risk assessment, probabilistic structural analysis, and structural design optimization, cannot tolerate the cost or turnaround time for such CPU-intensive analyses, because these applications require a large number of cases to be run with different inputs. For many risk assessment applications, analysts would prefer running times to be measurable in minutes. There is therefore a need for approximation methods which can solve such problems far more efficiently than the very detailed methods and yet maintain an acceptable degree of accuracy. For this purpose, we have been working on two methods of approximation: neural networks and spring-mass models. This paper presents our work and results to date for spring-mass modeling and analysis, since we are further along in this area than in the neural network formulation. It describes the physical and numerical models contained in a code we developed called STRESS, which stands for 'Spring-mass Transient Response Evaluation for structural Systems'. The paper also presents results for a demonstration problem, and compares these with results obtained for the same problem using PRONTO3D, a state-of-the-art finite element code which was also developed at Sandia.
      

      
      Multibody Parachute Flight Simulations for Planetary Entry Trajectories Using "Equilibrium Points"
      NASA Technical Reports Server (NTRS)
      Raiszadeh, Ben
         2003-01-01
         A method has been developed to reduce numerical stiffness and computer CPU requirements of high fidelity multibody flight simulations involving parachutes for planetary entry trajectories. Typical parachute entry configurations consist of entry bodies suspended from a parachute, connected by flexible lines. To accurately calculate line forces and moments, the simulations need to keep track of the point where the flexible lines meet (confluence point). In previous multibody parachute flight simulations, the confluence point has been modeled as a point mass. Using a point mass for the confluence point tends to make the simulation numerically stiff, because its mass is typically much less that than the main rigid body masses. One solution for stiff differential equations is to use a very small integration time step. However, this results in large computer CPU requirements. In the method described in the paper, the need for using a mass as the confluence point has been eliminated. Instead, the confluence point is modeled using an "equilibrium point". This point is calculated at every integration step as the point at which sum of all line forces is zero (static equilibrium). The use of this "equilibrium point" has the advantage of both reducing the numerical stiffness of the simulations, and eliminating the dynamical equations associated with vibration of a lumped mass on a high-tension string.
      

      
      Multiprocessing MCNP on an IBM RS/6000 cluster
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      McKinney, G.W.; West, J.T.
         1993-01-01
         The advent of high-performance computer systems has brought to maturity programming concepts like vectorization, multiprocessing, and multitasking. While there are many schools of thought as to the most significant factor in obtaining order-of-magnitude increases in performance, such speedup can only be achieved by integrating the computer system and application code. Vectorization leads to faster manipulation of arrays by overlapping instruction CPU cycles. Discrete ordinates codes, which require the solving of large matrices, have proved to be major benefactors of vectorization. Monte Carlo transport, on the other hand, typically contains numerous logic statements and requires extensive redevelopment to benefit from vectorization.more » Multiprocessing and multitasking provide additional CPU cycles via multiple processors. Such systems are generally designed with either common memory access (multitasking) or distributed memory access. In both cases, theoretical speedup, as a function of the number of processors (P) and the fraction of task time that multiprocesses (f), can be formulated using Amdahl's Law S ((f,P) = 1 f + f/P). However, for most applications this theoretical limit cannot be achieved, due to additional terms not included in Amdahl's Law. Monte Carlo transport is a natural candidate for multiprocessing, since the particle tracks are generally independent and the precision of the result increases as the square root of the number of particles tracked.« less
      

      
      Rapid Monte Carlo simulation of detector DQE(f)
      PubMed Central
      Star-Lack, Josh; Sun, Mingshan; Meyer, Andre; Morf, Daniel; Constantin, Dragos; Fahrig, Rebecca; Abel, Eric
         2014-01-01
         Purpose: Performance optimization of indirect x-ray detectors requires proper characterization of both ionizing (gamma) and optical photon transport in a heterogeneous medium. As the tool of choice for modeling detector physics, Monte Carlo methods have failed to gain traction as a design utility, due mostly to excessive simulation times and a lack of convenient simulation packages. The most important figure-of-merit in assessing detector performance is the detective quantum efficiency (DQE), for which most of the computational burden has traditionally been associated with the determination of the noise power spectrum (NPS) from an ensemble of flood images, each conventionally having 107 − 109 detected gamma photons. In this work, the authors show that the idealized conditions inherent in a numerical simulation allow for a dramatic reduction in the number of gamma and optical photons required to accurately predict the NPS. Methods: The authors derived an expression for the mean squared error (MSE) of a simulated NPS when computed using the International Electrotechnical Commission-recommended technique based on taking the 2D Fourier transform of flood images. It is shown that the MSE is inversely proportional to the number of flood images, and is independent of the input fluence provided that the input fluence is above a minimal value that avoids biasing the estimate. The authors then propose to further lower the input fluence so that each event creates a point-spread function rather than a flood field. The authors use this finding as the foundation for a novel algorithm in which the characteristic MTF(f), NPS(f), and DQE(f) curves are simultaneously generated from the results of a single run. The authors also investigate lowering the number of optical photons used in a scintillator simulation to further increase efficiency. Simulation results are compared with measurements performed on a Varian AS1000 portal imager, and with a previously published simulation performed using clinical fluence levels. Results: On the order of only 10–100 gamma photons per flood image were required to be detected to avoid biasing the NPS estimate. This allowed for a factor of 107 reduction in fluence compared to clinical levels with no loss of accuracy. An optimal signal-to-noise ratio (SNR) was achieved by increasing the number of flood images from a typical value of 100 up to 500, thereby illustrating the importance of flood image quantity over the number of gammas per flood. For the point-spread ensemble technique, an additional 2× reduction in the number of incident gammas was realized. As a result, when modeling gamma transport in a thick pixelated array, the simulation time was reduced from 2.5 × 106 CPU min if using clinical fluence levels to 3.1 CPU min if using optimized fluence levels while also producing a higher SNR. The AS1000 DQE(f) simulation entailing both optical and radiative transport matched experimental results to within 11%, and required 14.5 min to complete on a single CPU. Conclusions: The authors demonstrate the feasibility of accurately modeling x-ray detector DQE(f) with completion times on the order of several minutes using a single CPU. Convenience of simulation can be achieved using GEANT4 which offers both gamma and optical photon transport capabilities. PMID:24593734
      

      
      Rapid Monte Carlo simulation of detector DQE(f)
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Star-Lack, Josh, E-mail: josh.starlack@varian.com; Sun, Mingshan; Abel, Eric
         2014-03-15
         Purpose: Performance optimization of indirect x-ray detectors requires proper characterization of both ionizing (gamma) and optical photon transport in a heterogeneous medium. As the tool of choice for modeling detector physics, Monte Carlo methods have failed to gain traction as a design utility, due mostly to excessive simulation times and a lack of convenient simulation packages. The most important figure-of-merit in assessing detector performance is the detective quantum efficiency (DQE), for which most of the computational burden has traditionally been associated with the determination of the noise power spectrum (NPS) from an ensemble of flood images, each conventionally having 10{supmore » 7} − 10{sup 9} detected gamma photons. In this work, the authors show that the idealized conditions inherent in a numerical simulation allow for a dramatic reduction in the number of gamma and optical photons required to accurately predict the NPS. Methods: The authors derived an expression for the mean squared error (MSE) of a simulated NPS when computed using the International Electrotechnical Commission-recommended technique based on taking the 2D Fourier transform of flood images. It is shown that the MSE is inversely proportional to the number of flood images, and is independent of the input fluence provided that the input fluence is above a minimal value that avoids biasing the estimate. The authors then propose to further lower the input fluence so that each event creates a point-spread function rather than a flood field. The authors use this finding as the foundation for a novel algorithm in which the characteristic MTF(f), NPS(f), and DQE(f) curves are simultaneously generated from the results of a single run. The authors also investigate lowering the number of optical photons used in a scintillator simulation to further increase efficiency. Simulation results are compared with measurements performed on a Varian AS1000 portal imager, and with a previously published simulation performed using clinical fluence levels. Results: On the order of only 10–100 gamma photons per flood image were required to be detected to avoid biasing the NPS estimate. This allowed for a factor of 10{sup 7} reduction in fluence compared to clinical levels with no loss of accuracy. An optimal signal-to-noise ratio (SNR) was achieved by increasing the number of flood images from a typical value of 100 up to 500, thereby illustrating the importance of flood image quantity over the number of gammas per flood. For the point-spread ensemble technique, an additional 2× reduction in the number of incident gammas was realized. As a result, when modeling gamma transport in a thick pixelated array, the simulation time was reduced from 2.5 × 10{sup 6} CPU min if using clinical fluence levels to 3.1 CPU min if using optimized fluence levels while also producing a higher SNR. The AS1000 DQE(f) simulation entailing both optical and radiative transport matched experimental results to within 11%, and required 14.5 min to complete on a single CPU. Conclusions: The authors demonstrate the feasibility of accurately modeling x-ray detector DQE(f) with completion times on the order of several minutes using a single CPU. Convenience of simulation can be achieved using GEANT4 which offers both gamma and optical photon transport capabilities.« less
      

      
      Subsonic Aircraft With Regression and Neural-Network Approximators Designed
      NASA Technical Reports Server (NTRS)
      Patnaik, Surya N.; Hopkins, Dale A.
         2004-01-01
         At the NASA Glenn Research Center, NASA Langley Research Center's Flight Optimization System (FLOPS) and the design optimization testbed COMETBOARDS with regression and neural-network-analysis approximators have been coupled to obtain a preliminary aircraft design methodology. For a subsonic aircraft, the optimal design, that is the airframe-engine combination, is obtained by the simulation. The aircraft is powered by two high-bypass-ratio engines with a nominal thrust of about 35,000 lbf. It is to carry 150 passengers at a cruise speed of Mach 0.8 over a range of 3000 n mi and to operate on a 6000-ft runway. The aircraft design utilized a neural network and a regression-approximations-based analysis tool, along with a multioptimizer cascade algorithm that uses sequential linear programming, sequential quadratic programming, the method of feasible directions, and then sequential quadratic programming again. Optimal aircraft weight versus the number of design iterations is shown. The central processing unit (CPU) time to solution is given. It is shown that the regression-method-based analyzer exhibited a smoother convergence pattern than the FLOPS code. The optimum weight obtained by the approximation technique and the FLOPS code differed by 1.3 percent. Prediction by the approximation technique exhibited no error for the aircraft wing area and turbine entry temperature, whereas it was within 2 percent for most other parameters. Cascade strategy was required by FLOPS as well as the approximators. The regression method had a tendency to hug the data points, whereas the neural network exhibited a propensity to follow a mean path. The performance of the neural network and regression methods was considered adequate. It was at about the same level for small, standard, and large models with redundancy ratios (defined as the number of input-output pairs to the number of unknown coefficients) of 14, 28, and 57, respectively. In an SGI octane workstation (Silicon Graphics, Inc., Mountainview, CA), the regression training required a fraction of a CPU second, whereas neural network training was between 1 and 9 min, as given. For a single analysis cycle, the 3-sec CPU time required by the FLOPS code was reduced to milliseconds by the approximators. For design calculations, the time with the FLOPS code was 34 min. It was reduced to 2 sec with the regression method and to 4 min by the neural network technique. The performance of the regression and neural network methods was found to be satisfactory for the analysis and design optimization of the subsonic aircraft.
      

      
      VAXCMS - VAX CONTINUOUS MONITORING SYSTEM, VERSION 2.2
      NASA Technical Reports Server (NTRS)
      Farkas, L.
         1994-01-01
         The VAX Continuous Monitoring System (VAXCMS) was developed at NASA Headquarters to aid system managers in monitoring the performance of VAX systems through the generation of graphic images which summarize trends in performance metrics over time. Since its initial development, VAXCMS has been extensively modified at the NASA Lewis Research Center. Data is produced by utilizing the VMS MONITOR utility to collect the performance data, and then feeding the data through custom-developed linkages to the Computer Associates' TELL-A-GRAF computer graphics software to generate the chart images for analysis by the system manager. The VMS ACCOUNTING utility is also utilized to gather interactive process information. The charts that are generated by VAXCMS are: 1) CPU modes for each node over the most recent four month period 2) CPU modes for the cluster as a whole using a weighted average of all the nodes in the cluster based on processing power 3) Percent of primary memory in use for each node over the most recent four month period 4) Interactive processes for all nodes over the most recent four month period 5) Daily, weekly, and monthly, performance summaries for CPU modes, percent of primary memory in use, and page fault rates for each node 6) Daily disk I/O performance data plotting Average Disk I/O Response Time based on I/O Operation Rate and Queue Length. VAXCMS is written in DCL and VAX FORTRAN for use with DEC VAX series computers running VMS 5.1 or later. This program requires the TELL-A-GRAF graphics package in order to generate plots of system data. A FORTRAN compiler is required. The standard distribution medium for VAXCMS is a 9-track 1600 BPI magnetic tape in DEC VAX BACKUP format. It is also available on a TK50 tape cartridge in DEC VAX BACKUP format. An electronic copy of the documentation in ASCII format is included on the distribution medium. Portions of this code are copyrighted by Mr. David Lavery and are distributed with his permission. These portions of the code may not be redistributed commercially.
      

      
      Method and apparatus for measuring spatial uniformity of radiation
      DOEpatents
      Field, Halden
         2002-01-01
         A method and apparatus for measuring the spatial uniformity of the intensity of a radiation beam from a radiation source based on a single sampling time and/or a single pulse of radiation. The measuring apparatus includes a plurality of radiation detectors positioned on planar mounting plate to form a radiation receiving area that has a shape and size approximating the size and shape of the cross section of the radiation beam. The detectors concurrently receive portions of the radiation beam and transmit electrical signals representative of the intensity of impinging radiation to a signal processor circuit connected to each of the detectors and adapted to concurrently receive the electrical signals from the detectors and process with a central processing unit (CPU) the signals to determine intensities of the radiation impinging at each detector location. The CPU displays the determined intensities and relative intensity values corresponding to each detector location to an operator of the measuring apparatus on an included data display device. Concurrent sampling of each detector is achieved by connecting to each detector a sample and hold circuit that is configured to track the signal and store it upon receipt of a "capture" signal. A switching device then selectively retrieves the signals and transmits the signals to the CPU through a single analog to digital (A/D) converter. The "capture" signal. is then removed from the sample-and-hold circuits. Alternatively, concurrent sampling is achieved by providing an A/D converter for each detector, each of which transmits a corresponding digital signal to the CPU. The sampling or reading of the detector signals can be controlled by the CPU or level-detection and timing circuit.
      

        
       
          

«

7
      8
      9
   10
      11
      »

          
        

     

   

   
       
            
              
          

«

8
      9
      10
   11
      12
      »

          
        

           
           
             
               
      
      CUDASW++ 3.0: accelerating Smith-Waterman protein database search by coupling CPU and GPU SIMD instructions.
      PubMed
      Liu, Yongchao; Wirawan, Adrianto; Schmidt, Bertil
         2013-04-04
         The maximal sensitivity for local alignments makes the Smith-Waterman algorithm a popular choice for protein sequence database search based on pairwise alignment. However, the algorithm is compute-intensive due to a quadratic time complexity. Corresponding runtimes are further compounded by the rapid growth of sequence databases. We present CUDASW++ 3.0, a fast Smith-Waterman protein database search algorithm, which couples CPU and GPU SIMD instructions and carries out concurrent CPU and GPU computations. For the CPU computation, this algorithm employs SSE-based vector execution units as accelerators. For the GPU computation, we have investigated for the first time a GPU SIMD parallelization, which employs CUDA PTX SIMD video instructions to gain more data parallelism beyond the SIMT execution model. Moreover, sequence alignment workloads are automatically distributed over CPUs and GPUs based on their respective compute capabilities. Evaluation on the Swiss-Prot database shows that CUDASW++ 3.0 gains a performance improvement over CUDASW++ 2.0 up to 2.9 and 3.2, with a maximum performance of 119.0 and 185.6 GCUPS, on a single-GPU GeForce GTX 680 and a dual-GPU GeForce GTX 690 graphics card, respectively. In addition, our algorithm has demonstrated significant speedups over other top-performing tools: SWIPE and BLAST+. CUDASW++ 3.0 is written in CUDA C++ and PTX assembly languages, targeting GPUs based on the Kepler architecture. This algorithm obtains significant speedups over its predecessor: CUDASW++ 2.0, by benefiting from the use of CPU and GPU SIMD instructions as well as the concurrent execution on CPUs and GPUs. The source code and the simulated data are available at http://cudasw.sourceforge.net.
      

      
      Requiring Writing, Ensuring Distance-Learning Outcomes.
      ERIC Educational Resources Information Center
      Mulligan, Roark; Geary, Susan
         1999-01-01
         Discussion of the quality of distance learning programs focuses on a study of Christopher Newport University (CPU ONLINE) that showed learning outcomes in online courses were as good as outcomes in traditional courses. Describes online writing requirements that helped students develop critical thinking skills. (Author/LRW)
      

      
      Application of graphics processing units to search pipelines for gravitational waves from coalescing binaries of compact objects
      NASA Astrophysics Data System (ADS)
      Chung, Shin Kee; Wen, Linqing; Blair, David; Cannon, Kipp; Datta, Amitava
         2010-07-01
         We report a novel application of a graphics processing unit (GPU) for the purpose of accelerating the search pipelines for gravitational waves from coalescing binaries of compact objects. A speed-up of 16-fold in total has been achieved with an NVIDIA GeForce 8800 Ultra GPU card compared with one core of a 2.5 GHz Intel Q9300 central processing unit (CPU). We show that substantial improvements are possible and discuss the reduction in CPU count required for the detection of inspiral sources afforded by the use of GPUs.
      

      
      Fast 3D shape screening of large chemical databases through alignment-recycling
      PubMed Central
      Fontaine, Fabien; Bolton, Evan; Borodina, Yulia; Bryant, Stephen H
         2007-01-01
         Background Large chemical databases require fast, efficient, and simple ways of looking for similar structures. Although such tasks are now fairly well resolved for graph-based similarity queries, they remain an issue for 3D approaches, particularly for those based on 3D shape overlays. Inspired by a recent technique developed to compare molecular shapes, we designed a hybrid methodology, alignment-recycling, that enables efficient retrieval and alignment of structures with similar 3D shapes. Results Using a dataset of more than one million PubChem compounds of limited size (< 28 heavy atoms) and flexibility (< 6 rotatable bonds), we obtained a set of a few thousand diverse structures covering entirely the 3D shape space of the conformers of the dataset. Transformation matrices gathered from the overlays between these diverse structures and the 3D conformer dataset allowed us to drastically (100-fold) reduce the CPU time required for shape overlay. The alignment-recycling heuristic produces results consistent with de novo alignment calculation, with better than 80% hit list overlap on average. Conclusion Overlay-based 3D methods are computationally demanding when searching large databases. Alignment-recycling reduces the CPU time to perform shape similarity searches by breaking the alignment problem into three steps: selection of diverse shapes to describe the database shape-space; overlay of the database conformers to the diverse shapes; and non-optimized overlay of query and database conformers using common reference shapes. The precomputation, required by the first two steps, is a significant cost of the method; however, once performed, querying is two orders of magnitude faster. Extensions and variations of this methodology, for example, to handle more flexible and larger small-molecules are discussed. PMID:17880744
      

      
      Performance and accuracy of criticality calculations performed using WARP – A framework for continuous energy Monte Carlo neutron transport in general 3D geometries on GPUs
      DOE PAGES
      Bergmann, Ryan M.; Rowland, Kelly L.; Radnović, Nikola; ...
         2017-05-01
         In this companion paper to "Algorithmic Choices in WARP - A Framework for Continuous Energy Monte Carlo Neutron Transport in General 3D Geometries on GPUs" (doi:10.1016/j.anucene.2014.10.039), the WARP Monte Carlo neutron transport framework for graphics processing units (GPUs) is benchmarked against production-level central processing unit (CPU) Monte Carlo neutron transport codes for both performance and accuracy. We compare neutron flux spectra, multiplication factors, runtimes, speedup factors, and costs of various GPU and CPU platforms running either WARP, Serpent 2.1.24, or MCNP 6.1. WARP compares well with the results of the production-level codes, and it is shown that on the newestmore » hardware considered, GPU platforms running WARP are between 0.8 to 7.6 times as fast as CPU platforms running production codes. Also, the GPU platforms running WARP were between 15% and 50% as expensive to purchase and between 80% to 90% as expensive to operate as equivalent CPU platforms performing at an equal simulation rate.« less
      

      
      Performance and accuracy of criticality calculations performed using WARP – A framework for continuous energy Monte Carlo neutron transport in general 3D geometries on GPUs
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Bergmann, Ryan M.; Rowland, Kelly L.; Radnović, Nikola
         
         In this companion paper to "Algorithmic Choices in WARP - A Framework for Continuous Energy Monte Carlo Neutron Transport in General 3D Geometries on GPUs" (doi:10.1016/j.anucene.2014.10.039), the WARP Monte Carlo neutron transport framework for graphics processing units (GPUs) is benchmarked against production-level central processing unit (CPU) Monte Carlo neutron transport codes for both performance and accuracy. We compare neutron flux spectra, multiplication factors, runtimes, speedup factors, and costs of various GPU and CPU platforms running either WARP, Serpent 2.1.24, or MCNP 6.1. WARP compares well with the results of the production-level codes, and it is shown that on the newestmore » hardware considered, GPU platforms running WARP are between 0.8 to 7.6 times as fast as CPU platforms running production codes. Also, the GPU platforms running WARP were between 15% and 50% as expensive to purchase and between 80% to 90% as expensive to operate as equivalent CPU platforms performing at an equal simulation rate.« less
      

      
      Accelerating Monte Carlo simulations of photon transport in a voxelized geometry using a massively parallel graphics processing unit.
      PubMed
      Badal, Andreu; Badano, Aldo
         2009-11-01
         It is a known fact that Monte Carlo simulations of radiation transport are computationally intensive and may require long computing times. The authors introduce a new paradigm for the acceleration of Monte Carlo simulations: The use of a graphics processing unit (GPU) as the main computing device instead of a central processing unit (CPU). A GPU-based Monte Carlo code that simulates photon transport in a voxelized geometry with the accurate physics models from PENELOPE has been developed using the CUDATM programming model (NVIDIA Corporation, Santa Clara, CA). An outline of the new code and a sample x-ray imaging simulation with an anthropomorphic phantom are presented. A remarkable 27-fold speed up factor was obtained using a GPU compared to a single core CPU. The reported results show that GPUs are currently a good alternative to CPUs for the simulation of radiation transport. Since the performance of GPUs is currently increasing at a faster pace than that of CPUs, the advantages of GPU-based software are likely to be more pronounced in the future.
      

      
      AMITIS: A 3D GPU-Based Hybrid-PIC Model for Space and Plasma Physics
      NASA Astrophysics Data System (ADS)
      Fatemi, Shahab; Poppe, Andrew R.; Delory, Gregory T.; Farrell, William M.
         2017-05-01
         We have developed, for the first time, an advanced modeling infrastructure in space simulations (AMITIS) with an embedded three-dimensional self-consistent grid-based hybrid model of plasma (kinetic ions and fluid electrons) that runs entirely on graphics processing units (GPUs). The model uses NVIDIA GPUs and their associated parallel computing platform, CUDA, developed for general purpose processing on GPUs. The model uses a single CPU-GPU pair, where the CPU transfers data between the system and GPU memory, executes CUDA kernels, and writes simulation outputs on the disk. All computations, including moving particles, calculating macroscopic properties of particles on a grid, and solving hybrid model equations are processed on a single GPU. We explain various computing kernels within AMITIS and compare their performance with an already existing well-tested hybrid model of plasma that runs in parallel using multi-CPU platforms. We show that AMITIS runs ∼10 times faster than the parallel CPU-based hybrid model. We also introduce an implicit solver for computation of Faraday’s Equation, resulting in an explicit-implicit scheme for the hybrid model equation. We show that the proposed scheme is stable and accurate. We examine the AMITIS energy conservation and show that the energy is conserved with an error < 0.2% after 500,000 timesteps, even when a very low number of particles per cell is used.
      

      
      Enhanced round robin CPU scheduling with burst time based time quantum
      NASA Astrophysics Data System (ADS)
      Indusree, J. R.; Prabadevi, B.
         2017-11-01
         Process scheduling is a very important functionality of Operating system. The main-known process-scheduling algorithms are First Come First Serve (FCFS) algorithm, Round Robin (RR) algorithm, Priority scheduling algorithm and Shortest Job First (SJF) algorithm. Compared to its peers, Round Robin (RR) algorithm has the advantage that it gives fair share of CPU to the processes which are already in the ready-queue. The effectiveness of the RR algorithm greatly depends on chosen time quantum value. Through this research paper, we are proposing an enhanced algorithm called Enhanced Round Robin with Burst-time based Time Quantum (ERRBTQ) process scheduling algorithm which calculates time quantum as per the burst-time of processes already in ready queue. The experimental results and analysis of ERRBTQ algorithm clearly indicates the improved performance when compared with conventional RR and its variants.
      

      
      10 Management Controller for Time and Space Partitioning Architectures
      NASA Astrophysics Data System (ADS)
      Lachaize, Jerome; Deredempt, Marie-Helene; Galizzi, Julien
         2015-09-01
         The Integrated Modular Avionics (IMA) has been industrialized in aeronautical domain to enable the independent qualification of different application softwares from different suppliers on the same generic computer, this latter computer being a single terminal in a deterministic network. This concept allowed to distribute efficiently and transparently the different applications across the network, sizing accurately the HW equipments to embed on the aircraft, through the configuration of the virtual computers and the virtual network. , This concept has been studied for space domain and requirements issued [D04],[D05]. Experiments in the space domain have been done, for the computer level, through ESA and CNES initiatives [D02] [D03]. One possible IMA implementation may use Time and Space Partitioning (TSP) technology. Studies on Time and Space Partitioning [D02] for controlling resources access such as CPU and memories and studies on hardware/software interface standardization [D01] showed that for space domain technologies where I/O components (or IP) do not cover advanced features such as buffering, descriptors or virtualization, CPU overhead in terms of performances is mainly due to shared interface management in the execution platform, and to the high frequency of I/O accesses, these latter leading to an important number of context switches. This paper will present a solution to reduce this execution overhead with an open, modular and configurable controller.
      

      
      Computer simulations and real-time control of ELT AO systems using graphical processing units
      NASA Astrophysics Data System (ADS)
      Wang, Lianqi; Ellerbroek, Brent
         2012-07-01
         The adaptive optics (AO) simulations at the Thirty Meter Telescope (TMT) have been carried out using the efficient, C based multi-threaded adaptive optics simulator (MAOS, http://github.com/lianqiw/maos). By porting time-critical parts of MAOS to graphical processing units (GPU) using NVIDIA CUDA technology, we achieved a 10 fold speed up for each GTX 580 GPU used compared to a modern quad core CPU. Each time step of full scale end to end simulation for the TMT narrow field infrared AO system (NFIRAOS) takes only 0.11 second in a desktop with two GTX 580s. We also demonstrate that the TMT minimum variance reconstructor can be assembled in matrix vector multiply (MVM) format in 8 seconds with 8 GTX 580 GPUs, meeting the TMT requirement for updating the reconstructor. Analysis show that it is also possible to apply the MVM using 8 GTX 580s within the required latency.
      

      
      Beyond where to how: a machine learning approach for sensing mobility contexts using smartphone sensors.
      PubMed
      Guinness, Robert E
         2015-04-28
         This paper presents the results of research on the use of smartphone sensors (namely, GPS and accelerometers), geospatial information (points of interest, such as bus stops and train stations) and machine learning (ML) to sense mobility contexts. Our goal is to develop techniques to continuously and automatically detect a smartphone user's mobility activities, including walking, running, driving and using a bus or train, in real-time or near-real-time (<5 s). We investigated a wide range of supervised learning techniques for classification, including decision trees (DT), support vector machines (SVM), naive Bayes classifiers (NB), Bayesian networks (BN), logistic regression (LR), artificial neural networks (ANN) and several instance-based classifiers (KStar, LWLand IBk). Applying ten-fold cross-validation, the best performers in terms of correct classification rate (i.e., recall) were DT (96.5%), BN (90.9%), LWL (95.5%) and KStar (95.6%). In particular, the DT-algorithm RandomForest exhibited the best overall performance. After a feature selection process for a subset of algorithms, the performance was improved slightly. Furthermore, after tuning the parameters of RandomForest, performance improved to above 97.5%. Lastly, we measured the computational complexity of the classifiers, in terms of central processing unit (CPU) time needed for classification, to provide a rough comparison between the algorithms in terms of battery usage requirements. As a result, the classifiers can be ranked from lowest to highest complexity (i.e., computational cost) as follows: SVM, ANN, LR, BN, DT, NB, IBk, LWL and KStar. The instance-based classifiers take considerably more computational time than the non-instance-based classifiers, whereas the slowest non-instance-based classifier (NB) required about five-times the amount of CPU time as the fastest classifier (SVM). The above results suggest that DT algorithms are excellent candidates for detecting mobility contexts in smartphones, both in terms of performance and computational complexity.
      

      
      Beyond Where to How: A Machine Learning Approach for Sensing Mobility Contexts Using Smartphone Sensors †
      PubMed Central
      Guinness, Robert E.
         2015-01-01
         This paper presents the results of research on the use of smartphone sensors (namely, GPS and accelerometers), geospatial information (points of interest, such as bus stops and train stations) and machine learning (ML) to sense mobility contexts. Our goal is to develop techniques to continuously and automatically detect a smartphone user's mobility activities, including walking, running, driving and using a bus or train, in real-time or near-real-time (<5 s). We investigated a wide range of supervised learning techniques for classification, including decision trees (DT), support vector machines (SVM), naive Bayes classifiers (NB), Bayesian networks (BN), logistic regression (LR), artificial neural networks (ANN) and several instance-based classifiers (KStar, LWLand IBk). Applying ten-fold cross-validation, the best performers in terms of correct classification rate (i.e., recall) were DT (96.5%), BN (90.9%), LWL (95.5%) and KStar (95.6%). In particular, the DT-algorithm RandomForest exhibited the best overall performance. After a feature selection process for a subset of algorithms, the performance was improved slightly. Furthermore, after tuning the parameters of RandomForest, performance improved to above 97.5%. Lastly, we measured the computational complexity of the classifiers, in terms of central processing unit (CPU) time needed for classification, to provide a rough comparison between the algorithms in terms of battery usage requirements. As a result, the classifiers can be ranked from lowest to highest complexity (i.e., computational cost) as follows: SVM, ANN, LR, BN, DT, NB, IBk, LWL and KStar. The instance-based classifiers take considerably more computational time than the non-instance-based classifiers, whereas the slowest non-instance-based classifier (NB) required about five-times the amount of CPU time as the fastest classifier (SVM). The above results suggest that DT algorithms are excellent candidates for detecting mobility contexts in smartphones, both in terms of performance and computational complexity. PMID:25928060
      

      
      Study of data I/O performance on distributed disk system in mask data preparation
      NASA Astrophysics Data System (ADS)
      Ohara, Shuichiro; Odaira, Hiroyuki; Chikanaga, Tomoyuki; Hamaji, Masakazu; Yoshioka, Yasuharu
         2010-09-01
         Data volume is getting larger every day in Mask Data Preparation (MDP). In the meantime, faster data handling is always required. MDP flow typically introduces Distributed Processing (DP) system to realize the demand because using hundreds of CPU is a reasonable solution. However, even if the number of CPU were increased, the throughput might be saturated because hard disk I/O and network speeds could be bottlenecks. So, MDP needs to invest a lot of money to not only hundreds of CPU but also storage and a network device which make the throughput faster. NCS would like to introduce new distributed processing system which is called "NDE". NDE could be a distributed disk system which makes the throughput faster without investing a lot of money because it is designed to use multiple conventional hard drives appropriately over network. NCS studies I/O performance with OASIS® data format on NDE which contributes to realize the high throughput in this paper.
      

      
      Double dissociation of the anterior and posterior dorsomedial caudate-putamen in the acquisition and expression of associative learning with the nicotine stimulus.
      PubMed
      Charntikov, Sergios; Pittenger, Steven T; Swalve, Natashia; Li, Ming; Bevins, Rick A
         2017-07-15
         Tobacco use is the leading cause of preventable deaths worldwide. This habit is not only debilitating to individual users but also to those around them (second-hand smoking). Nicotine is the main addictive component of tobacco products and is a moderate stimulant and a mild reinforcer. Importantly, besides its unconditional effects, nicotine also has conditioned stimulus effects that may contribute to the tenacity of the smoking habit. Because the neurobiological substrates underlying these processes are virtually unexplored, the present study investigated the functional involvement of the dorsomedial caudate putamen (dmCPu) in learning processes with nicotine as an interoceptive stimulus. Rats were trained using the discriminated goal-tracking task where nicotine injections (0.4 mg/kg; SC), on some days, were paired with intermittent (36 per session) sucrose deliveries; sucrose was not available on interspersed saline days. Pre-training excitotoxic or post-training transient lesions of anterior or posterior dmCPu were used to elucidate the role of these areas in acquisition or expression of associative learning with nicotine stimulus. Pre-training lesion of p-dmCPu inhibited acquisition while post-training lesions of p-dmCPu attenuated the expression of associative learning with the nicotine stimulus. On the other hand, post-training lesions of a-dmCPu evoked nicotine-like responding following saline treatment indicating the role of this area in disinhibition of learned motor behaviors. These results, for the first time, show functionally distinct involvement of a- and p-dmCPu in various stages of associative learning using nicotine stimulus and provide an initial account of neural plasticity underlying these learning processes. Copyright © 2017 Elsevier Ltd. All rights reserved.
      

      
      Free-Space Optical Interconnect Employing VCSEL Diodes
      NASA Technical Reports Server (NTRS)
      Simons, Rainee N.; Savich, Gregory R.; Torres, Heidi
         2009-01-01
         Sensor signal processing is widely used on aircraft and spacecraft. The scheme employs multiple input/output nodes for data acquisition and CPU (central processing unit) nodes for data processing. To connect 110 nodes and CPU nodes, scalable interconnections such as backplanes are desired because the number of nodes depends on requirements of each mission. An optical backplane consisting of vertical-cavity surface-emitting lasers (VCSELs), VCSEL drivers, photodetectors, and transimpedance amplifiers is the preferred approach since it can handle several hundred megabits per second data throughput.The next generation of satellite-borne systems will require transceivers and processors that can handle several Gb/s of data. Optical interconnects have been praised for both their speed and functionality with hopes that light can relieve the electrical bottleneck predicted for the near future. Optoelectronic interconnects provide a factor of ten improvement over electrical interconnects.
      

      
      Finite difference numerical method for the superlattice Boltzmann transport equation and case comparison of CPU(C) and GPU(CUDA) implementations
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Priimak, Dmitri
         2014-12-01
         We present a finite difference numerical algorithm for solving two dimensional spatially homogeneous Boltzmann transport equation which describes electron transport in a semiconductor superlattice subject to crossed time dependent electric and constant magnetic fields. The algorithm is implemented both in C language targeted to CPU and in CUDA C language targeted to commodity NVidia GPU. We compare performances and merits of one implementation versus another and discuss various software optimisation techniques.
      

      
      The development of an interim generalized gate logic software simulator
      NASA Technical Reports Server (NTRS)
      Mcgough, J. G.; Nemeroff, S.
         1985-01-01
         A proof-of-concept computer program called IGGLOSS (Interim Generalized Gate Logic Software Simulator) was developed and is discussed. The simulator engine was designed to perform stochastic estimation of self test coverage (fault-detection latency times) of digital computers or systems. A major attribute of the IGGLOSS is its high-speed simulation: 9.5 x 1,000,000 gates/cpu sec for nonfaulted circuits and 4.4 x 1,000,000 gates/cpu sec for faulted circuits on a VAX 11/780 host computer.
      

      
      Acoustic reverse-time migration using GPU card and POSIX thread based on the adaptive optimal finite-difference scheme and the hybrid absorbing boundary condition
      NASA Astrophysics Data System (ADS)
      Cai, Xiaohui; Liu, Yang; Ren, Zhiming
         2018-06-01
         Reverse-time migration (RTM) is a powerful tool for imaging geologically complex structures such as steep-dip and subsalt. However, its implementation is quite computationally expensive. Recently, as a low-cost solution, the graphic processing unit (GPU) was introduced to improve the efficiency of RTM. In the paper, we develop three ameliorative strategies to implement RTM on GPU card. First, given the high accuracy and efficiency of the adaptive optimal finite-difference (FD) method based on least squares (LS) on central processing unit (CPU), we study the optimal LS-based FD method on GPU. Second, we develop the CPU-based hybrid absorbing boundary condition (ABC) to the GPU-based one by addressing two issues of the former when introduced to GPU card: time-consuming and chaotic threads. Third, for large-scale data, the combinatorial strategy for optimal checkpointing and efficient boundary storage is introduced for the trade-off between memory and recomputation. To save the time of communication between host and disk, the portable operating system interface (POSIX) thread is utilized to create the other CPU core at the checkpoints. Applications of the three strategies on GPU with the compute unified device architecture (CUDA) programming language in RTM demonstrate their efficiency and validity.
      

      
      Association between problematic cellular phone use and suicide: the moderating effect of family function and depression.
      PubMed
      Wang, Peng-Wei; Liu, Tai-Ling; Ko, Chih-Hung; Lin, Huang-Chi; Huang, Mei-Feng; Yeh, Yi-Chun; Yen, Cheng-Fang
         2014-02-01
         Suicidal ideation and attempt among adolescents are risk factors for eventual completed suicide. Cellular phone use (CPU) has markedly changed the everyday lives of adolescents. Issues about how cellular phone use relates to adolescent mental health, such as suicidal ideation and attempts, are important because of the high rate of cellular phone usage among children in that age group. This study explored the association between problematic CPU and suicidal ideation and attempts among adolescents and investigated how family function and depression influence the association between problematic CPU and suicidal ideation and attempts. A total of 5051 (2872 girls and 2179 boys) adolescents who owned at least one cellular phone completed the research questionnaires. We collected data on participants' CPU and suicidal behavior (ideation and attempts) during the past month as well as information on family function and history of depression. Five hundred thirty-two adolescents (10.54%) had problematic CPU. The rates of suicidal ideation were 23.50% and 11.76% in adolescents with problematic CPU and without problematic CPU, respectively. The rates of suicidal attempts in both groups were 13.70% and 5.45%, respectively. Family function, but not depression, had a moderating effect on the association between problematic CPU and suicidal ideation and attempt. This study highlights the association between problematic CPU and suicidal ideation as well as attempts and indicates that good family function may have a more significant role on reducing the risks of suicidal ideation and attempts in adolescents with problematic CPU than in those without problematic CPU. © 2014.
      

        
       
          

«

8
      9
      10
   11
      12
      »

          
        

     

   

   
       
            
              
          

«

9
      10
      11
   12
      13
      »

          
        

           
           
             
               
      
      High-performance computing on GPUs for resistivity logging of oil and gas wells
      NASA Astrophysics Data System (ADS)
      Glinskikh, V.; Dudaev, A.; Nechaev, O.; Surodina, I.
         2017-10-01
         We developed and implemented into software an algorithm for high-performance simulation of electrical logs from oil and gas wells using high-performance heterogeneous computing. The numerical solution of the 2D forward problem is based on the finite-element method and the Cholesky decomposition for solving a system of linear algebraic equations (SLAE). Software implementations of the algorithm used the NVIDIA CUDA technology and computing libraries are made, allowing us to perform decomposition of SLAE and find its solution on central processor unit (CPU) and graphics processor unit (GPU). The calculation time is analyzed depending on the matrix size and number of its non-zero elements. We estimated the computing speed on CPU and GPU, including high-performance heterogeneous CPU-GPU computing. Using the developed algorithm, we simulated resistivity data in realistic models.
      

      
      Analysis OpenMP performance of AMD and Intel architecture for breaking waves simulation using MPS
      NASA Astrophysics Data System (ADS)
      Alamsyah, M. N. A.; Utomo, A.; Gunawan, P. H.
         2018-03-01
         Simulation of breaking waves by using Navier-Stokes equation via moving particle semi-implicit method (MPS) over close domain is given. The results show the parallel computing on multicore architecture using OpenMP platform can reduce the computational time almost half of the serial time. Here, the comparison using two computer architectures (AMD and Intel) are performed. The results using Intel architecture is shown better than AMD architecture in CPU time. However, in efficiency, the computer with AMD architecture gives slightly higher than the Intel. For the simulation by 1512 number of particles, the CPU time using Intel and AMD are 12662.47 and 28282.30 respectively. Moreover, the efficiency using similar number of particles, AMD obtains 50.09 % and Intel up to 49.42 %.
      

      
      GPU Computing in Bayesian Inference of Realized Stochastic Volatility Model
      NASA Astrophysics Data System (ADS)
      Takaishi, Tetsuya
         2015-01-01
         The realized stochastic volatility (RSV) model that utilizes the realized volatility as additional information has been proposed to infer volatility of financial time series. We consider the Bayesian inference of the RSV model by the Hybrid Monte Carlo (HMC) algorithm. The HMC algorithm can be parallelized and thus performed on the GPU for speedup. The GPU code is developed with CUDA Fortran. We compare the computational time in performing the HMC algorithm on GPU (GTX 760) and CPU (Intel i7-4770 3.4GHz) and find that the GPU can be up to 17 times faster than the CPU. We also code the program with OpenACC and find that appropriate coding can achieve the similar speedup with CUDA Fortran.
      

      
      Shadow: Running Tor in a Box for Accurate and Efficient Experimentation
      DTIC Science & Technology
      
         2011-09-23
         Modeling the speed of a target CPU is done by running an OpenSSL [31] speed test on a real CPU of that type. This provides us with the raw CPU processing...rate, but we are also interested in the processing speed of an application. By running application 5 benchmarks on the same CPU as the OpenSSL speed test...simulation, saving CPU cy- cles on our simulation host machine. Shadow removes cryptographic processing by preloading the main OpenSSL [31] functions used
      

      
      Multi-GPU and multi-CPU accelerated FDTD scheme for vibroacoustic applications
      NASA Astrophysics Data System (ADS)
      Francés, J.; Otero, B.; Bleda, S.; Gallego, S.; Neipp, C.; Márquez, A.; Beléndez, A.
         2015-06-01
         The Finite-Difference Time-Domain (FDTD) method is applied to the analysis of vibroacoustic problems and to study the propagation of longitudinal and transversal waves in a stratified media. The potential of the scheme and the relevance of each acceleration strategy for massively computations in FDTD are demonstrated in this work. In this paper, we propose two new specific implementations of the bi-dimensional scheme of the FDTD method using multi-CPU and multi-GPU, respectively. In the first implementation, an open source message passing interface (OMPI) has been included in order to massively exploit the resources of a biprocessor station with two Intel Xeon processors. Moreover, regarding CPU code version, the streaming SIMD extensions (SSE) and also the advanced vectorial extensions (AVX) have been included with shared memory approaches that take advantage of the multi-core platforms. On the other hand, the second implementation called the multi-GPU code version is based on Peer-to-Peer communications available in CUDA on two GPUs (NVIDIA GTX 670). Subsequently, this paper presents an accurate analysis of the influence of the different code versions including shared memory approaches, vector instructions and multi-processors (both CPU and GPU) and compares them in order to delimit the degree of improvement of using distributed solutions based on multi-CPU and multi-GPU. The performance of both approaches was analysed and it has been demonstrated that the addition of shared memory schemes to CPU computing improves substantially the performance of vector instructions enlarging the simulation sizes that use efficiently the cache memory of CPUs. In this case GPU computing is slightly twice times faster than the fine tuned CPU version in both cases one and two nodes. However, for massively computations explicit vector instructions do not worth it since the memory bandwidth is the limiting factor and the performance tends to be the same than the sequential version with auto-vectorisation and also shared memory approach. In this scenario GPU computing is the best option since it provides a homogeneous behaviour. More specifically, the speedup of GPU computing achieves an upper limit of 12 for both one and two GPUs, whereas the performance reaches peak values of 80 GFlops and 146 GFlops for the performance for one GPU and two GPUs respectively. Finally, the method is applied to an earth crust profile in order to demonstrate the potential of our approach and the necessity of applying acceleration strategies in these type of applications.
      

      
      Parallel hyperbolic PDE simulation on clusters: Cell versus GPU
      NASA Astrophysics Data System (ADS)
      Rostrup, Scott; De Sterck, Hans
         2010-12-01
         Increasingly, high-performance computing is looking towards data-parallel computational devices to enhance computational performance. Two technologies that have received significant attention are IBM's Cell Processor and NVIDIA's CUDA programming model for graphics processing unit (GPU) computing. In this paper we investigate the acceleration of parallel hyperbolic partial differential equation simulation on structured grids with explicit time integration on clusters with Cell and GPU backends. The message passing interface (MPI) is used for communication between nodes at the coarsest level of parallelism. Optimizations of the simulation code at the several finer levels of parallelism that the data-parallel devices provide are described in terms of data layout, data flow and data-parallel instructions. Optimized Cell and GPU performance are compared with reference code performance on a single x86 central processing unit (CPU) core in single and double precision. We further compare the CPU, Cell and GPU platforms on a chip-to-chip basis, and compare performance on single cluster nodes with two CPUs, two Cell processors or two GPUs in a shared memory configuration (without MPI). We finally compare performance on clusters with 32 CPUs, 32 Cell processors, and 32 GPUs using MPI. Our GPU cluster results use NVIDIA Tesla GPUs with GT200 architecture, but some preliminary results on recently introduced NVIDIA GPUs with the next-generation Fermi architecture are also included. This paper provides computational scientists and engineers who are considering porting their codes to accelerator environments with insight into how structured grid based explicit algorithms can be optimized for clusters with Cell and GPU accelerators. It also provides insight into the speed-up that may be gained on current and future accelerator architectures for this class of applications. Program summaryProgram title: SWsolver Catalogue identifier: AEGY_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEGY_v1_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: GPL v3 No. of lines in distributed program, including test data, etc.: 59 168 No. of bytes in distributed program, including test data, etc.: 453 409 Distribution format: tar.gz Programming language: C, CUDA Computer: Parallel Computing Clusters. Individual compute nodes may consist of x86 CPU, Cell processor, or x86 CPU with attached NVIDIA GPU accelerator. Operating system: Linux Has the code been vectorised or parallelized?: Yes. Tested on 1-128 x86 CPU cores, 1-32 Cell Processors, and 1-32 NVIDIA GPUs. RAM: Tested on Problems requiring up to 4 GB per compute node. Classification: 12 External routines: MPI, CUDA, IBM Cell SDK Nature of problem: MPI-parallel simulation of Shallow Water equations using high-resolution 2D hyperbolic equation solver on regular Cartesian grids for x86 CPU, Cell Processor, and NVIDIA GPU using CUDA. Solution method: SWsolver provides 3 implementations of a high-resolution 2D Shallow Water equation solver on regular Cartesian grids, for CPU, Cell Processor, and NVIDIA GPU. Each implementation uses MPI to divide work across a parallel computing cluster. Additional comments: Sub-program numdiff is used for the test run.
      

      
      GRAPEVINE: Grids about anything by Poisson's equation in a visually interactive networking environment
      NASA Technical Reports Server (NTRS)
      Sorenson, Reese L.; Mccann, Karen
         1992-01-01
         A proven 3-D multiple-block elliptic grid generator, designed to run in 'batch mode' on a supercomputer, is improved by the creation of a modern graphical user interface (GUI) running on a workstation. The two parts are connected in real time by a network. The resultant system offers a significant speedup in the process of preparing and formatting input data and the ability to watch the grid solution converge by replotting the grid at each iteration step. The result is a reduction in user time and CPU time required to generate the grid and an enhanced understanding of the elliptic solution process. This software system, called GRAPEVINE, is described, and certain observations are made concerning the creation of such software.
      

      
      Interactive brain shift compensation using GPU based programming
      NASA Astrophysics Data System (ADS)
      van der Steen, Sander; Noordmans, Herke Jan; Verdaasdonk, Rudolf
         2009-02-01
         Processing large images files or real-time video streams requires intense computational power. Driven by the gaming industry, the processing power of graphic process units (GPUs) has increased significantly. With the pixel shader model 4.0 the GPU can be used for image processing 10x faster than the CPU. Dedicated software was developed to deform 3D MR and CT image sets for real-time brain shift correction during navigated neurosurgery using landmarks or cortical surface traces defined by the navigation pointer. Feedback was given using orthogonal slices and an interactively raytraced 3D brain image. GPU based programming enables real-time processing of high definition image datasets and various applications can be developed in medicine, optics and image sciences.
      

      
      Heterogeneous CPU-GPU moving targets detection for UAV video
      NASA Astrophysics Data System (ADS)
      Li, Maowen; Tang, Linbo; Han, Yuqi; Yu, Chunlei; Zhang, Chao; Fu, Huiquan
         2017-07-01
         Moving targets detection is gaining popularity in civilian and military applications. On some monitoring platform of motion detection, some low-resolution stationary cameras are replaced by moving HD camera based on UAVs. The pixels of moving targets in the HD Video taken by UAV are always in a minority, and the background of the frame is usually moving because of the motion of UAVs. The high computational cost of the algorithm prevents running it at higher resolutions the pixels of frame. Hence, to solve the problem of moving targets detection based UAVs video, we propose a heterogeneous CPU-GPU moving target detection algorithm for UAV video. More specifically, we use background registration to eliminate the impact of the moving background and frame difference to detect small moving targets. In order to achieve the effect of real-time processing, we design the solution of heterogeneous CPU-GPU framework for our method. The experimental results show that our method can detect the main moving targets from the HD video taken by UAV, and the average process time is 52.16ms per frame which is fast enough to solve the problem.
      

      
      Multiprocessing MCNP on an IBN RS/6000 cluster
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      McKinney, G.W.; West, J.T.
         1993-01-01
         The advent of high-performance computer systems has brought to maturity programming concepts like vectorization, multiprocessing, and multitasking. While there are many schools of thought as to the most significant factor in obtaining order-of-magnitude increases in performance, such speedup can only be achieved by integrating the computer system and application code. Vectorization leads to faster manipulation of arrays by overlapping instruction CPU cycles. Discrete ordinates codes, which require the solving of large matrices, have proved to be major benefactors of vectorization. Monte Carlo transport, on the other hand, typically contains numerous logic statements and requires extensive redevelopment to benefit from vectorization.more » Multiprocessing and multitasking provide additional CPU cycles via multiple processors. Such systems are generally designed with either common memory access (multitasking) or distributed memory access. In both cases, theoretical speedup, as a function of the number of processors P and the fraction f of task time that multiprocesses, can be formulated using Amdahl's law: S(f, P) =1/(1-f+f/P). However, for most applications, this theoretical limit cannot be achieved because of additional terms (e.g., multitasking overhead, memory overlap, etc.) that are not included in Amdahl's law. Monte Carlo transport is a natural candidate for multiprocessing because the particle tracks are generally independent, and the precision of the result increases as the square Foot of the number of particles tracked.« less
      

      
      Multiprocessing MCNP on an IBM RS/6000 cluster
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      McKinney, G.W.; West, J.T.
         1993-03-01
         The advent of high-performance computer systems has brought to maturity programming concepts like vectorization, multiprocessing, and multitasking. While there are many schools of thought as to the most significant factor in obtaining order-of-magnitude increases in performance, such speedup can only be achieved by integrating the computer system and application code. Vectorization leads to faster manipulation of arrays by overlapping instruction CPU cycles. Discrete ordinates codes, which require the solving of large matrices, have proved to be major benefactors of vectorization. Monte Carlo transport, on the other hand, typically contains numerous logic statements and requires extensive redevelopment to benefit from vectorization.more » Multiprocessing and multitasking provide additional CPU cycles via multiple processors. Such systems are generally designed with either common memory access (multitasking) or distributed memory access. In both cases, theoretical speedup, as a function of the number of processors (P) and the fraction of task time that multiprocesses (f), can be formulated using Amdahl`s Law S ((f,P) = 1 f + f/P). However, for most applications this theoretical limit cannot be achieved, due to additional terms not included in Amdahl`s Law. Monte Carlo transport is a natural candidate for multiprocessing, since the particle tracks are generally independent and the precision of the result increases as the square root of the number of particles tracked.« less
      

      
      The association between problematic cellular phone use and risky behaviors and low self-esteem among Taiwanese adolescents.
      PubMed
      Yang, Yuan-Sheng; Yen, Ju-Yu; Ko, Chih-Hung; Cheng, Chung-Ping; Yen, Cheng-Fang
         2010-04-28
         Cellular phone use (CPU) is an important part of life for many adolescents. However, problematic CPU may complicate physiological and psychological problems. The aim of our study was to examine the associations between problematic CPU and a series of risky behaviors and low self-esteem in Taiwanese adolescents. A total of 11,111 adolescent students in Southern Taiwan were randomly selected into this study. We used the Problematic Cellular Phone Use Questionnaire to identify the adolescents with problematic CPU. Meanwhile, a series of risky behaviors and self-esteem were evaluated. Multilevel logistic regression analyses were employed to examine the associations between problematic CPU and risky behaviors and low self-esteem regarding gender and age. The results indicated that positive associations were found between problematic CPU and aggression, insomnia, smoking cigarettes, suicidal tendencies, and low self-esteem in all groups with different sexes and ages. However, gender and age differences existed in the associations between problematic CPU and suspension from school, criminal records, tattooing, short nocturnal sleep duration, unprotected sex, illicit drugs use, drinking alcohol and chewing betel nuts. There were positive associations between problematic CPU and a series of risky behaviors and low self-esteem in Taiwanese adolescents. It is worthy for parents and mental health professionals to pay attention to adolescents' problematic CPU.
      

      
      Fast multipurpose Monte Carlo simulation for proton therapy using multi- and many-core CPU architectures.
      PubMed
      Souris, Kevin; Lee, John Aldo; Sterpin, Edmond
         2016-04-01
         Accuracy in proton therapy treatment planning can be improved using Monte Carlo (MC) simulations. However the long computation time of such methods hinders their use in clinical routine. This work aims to develop a fast multipurpose Monte Carlo simulation tool for proton therapy using massively parallel central processing unit (CPU) architectures. A new Monte Carlo, called MCsquare (many-core Monte Carlo), has been designed and optimized for the last generation of Intel Xeon processors and Intel Xeon Phi coprocessors. These massively parallel architectures offer the flexibility and the computational power suitable to MC methods. The class-II condensed history algorithm of MCsquare provides a fast and yet accurate method of simulating heavy charged particles such as protons, deuterons, and alphas inside voxelized geometries. Hard ionizations, with energy losses above a user-specified threshold, are simulated individually while soft events are regrouped in a multiple scattering theory. Elastic and inelastic nuclear interactions are sampled from ICRU 63 differential cross sections, thereby allowing for the computation of prompt gamma emission profiles. MCsquare has been benchmarked with the gate/geant4 Monte Carlo application for homogeneous and heterogeneous geometries. Comparisons with gate/geant4 for various geometries show deviations within 2%-1 mm. In spite of the limited memory bandwidth of the coprocessor simulation time is below 25 s for 10(7) primary 200 MeV protons in average soft tissues using all Xeon Phi and CPU resources embedded in a single desktop unit. MCsquare exploits the flexibility of CPU architectures to provide a multipurpose MC simulation tool. Optimized code enables the use of accurate MC calculation within a reasonable computation time, adequate for clinical practice. MCsquare also simulates prompt gamma emission and can thus be used also for in vivo range verification.
      

      
      Fast Simulation of Dynamic Ultrasound Images Using the GPU.
      PubMed
      Storve, Sigurd; Torp, Hans
         2017-10-01
         Simulated ultrasound data is a valuable tool for development and validation of quantitative image analysis methods in echocardiography. Unfortunately, simulation time can become prohibitive for phantoms consisting of a large number of point scatterers. The COLE algorithm by Gao et al. is a fast convolution-based simulator that trades simulation accuracy for improved speed. We present highly efficient parallelized CPU and GPU implementations of the COLE algorithm with an emphasis on dynamic simulations involving moving point scatterers. We argue that it is crucial to minimize the amount of data transfers from the CPU to achieve good performance on the GPU. We achieve this by storing the complete trajectories of the dynamic point scatterers as spline curves in the GPU memory. This leads to good efficiency when simulating sequences consisting of a large number of frames, such as B-mode and tissue Doppler data for a full cardiac cycle. In addition, we propose a phase-based subsample delay technique that efficiently eliminates flickering artifacts seen in B-mode sequences when COLE is used without enough temporal oversampling. To assess the performance, we used a laptop computer and a desktop computer, each equipped with a multicore Intel CPU and an NVIDIA GPU. Running the simulator on a high-end TITAN X GPU, we observed two orders of magnitude speedup compared to the parallel CPU version, three orders of magnitude speedup compared to simulation times reported by Gao et al. in their paper on COLE, and a speedup of 27000 times compared to the multithreaded version of Field II, using numbers reported in a paper by Jensen. We hope that by releasing the simulator as an open-source project we will encourage its use and further development.
      

      
      A Framework for Batched and GPU-Resident Factorization Algorithms Applied to Block Householder Transformations
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Dong, Tingzing Tim; Tomov, Stanimire Z; Luszczek, Piotr R
         
         As modern hardware keeps evolving, an increasingly effective approach to developing energy efficient and high-performance solvers is to design them to work on many small size and independent problems. Many applications already need this functionality, especially for GPUs, which are currently known to be about four to five times more energy efficient than multicore CPUs. We describe the development of one-sided factorizations that work for a set of small dense matrices in parallel, and we illustrate our techniques on the QR factorization based on Householder transformations. We refer to this mode of operation as a batched factorization. Our approach ismore » based on representing the algorithms as a sequence of batched BLAS routines for GPU-only execution. This is in contrast to the hybrid CPU-GPU algorithms that rely heavily on using the multicore CPU for specific parts of the workload. But for a system to benefit fully from the GPU's significantly higher energy efficiency, avoiding the use of the multicore CPU must be a primary design goal, so the system can rely more heavily on the more efficient GPU. Additionally, this will result in the removal of the costly CPU-to-GPU communication. Furthermore, we do not use a single symmetric multiprocessor(on the GPU) to factorize a single problem at a time. We illustrate how our performance analysis, and the use of profiling and tracing tools, guided the development and optimization of our batched factorization to achieve up to a 2-fold speedup and a 3-fold energy efficiency improvement compared to our highly optimized batched CPU implementations based on the MKL library(when using two sockets of Intel Sandy Bridge CPUs). Compared to a batched QR factorization featured in the CUBLAS library for GPUs, we achieved up to 5x speedup on the K40 GPU.« less
      

      
      A parallel method of atmospheric correction for multispectral high spatial resolution remote sensing images
      NASA Astrophysics Data System (ADS)
      Zhao, Shaoshuai; Ni, Chen; Cao, Jing; Li, Zhengqiang; Chen, Xingfeng; Ma, Yan; Yang, Leiku; Hou, Weizhen; Qie, Lili; Ge, Bangyu; Liu, Li; Xing, Jin
         2018-03-01
         The remote sensing image is usually polluted by atmosphere components especially like aerosol particles. For the quantitative remote sensing applications, the radiative transfer model based atmospheric correction is used to get the reflectance with decoupling the atmosphere and surface by consuming a long computational time. The parallel computing is a solution method for the temporal acceleration. The parallel strategy which uses multi-CPU to work simultaneously is designed to do atmospheric correction for a multispectral remote sensing image. The parallel framework's flow and the main parallel body of atmospheric correction are described. Then, the multispectral remote sensing image of the Chinese Gaofen-2 satellite is used to test the acceleration efficiency. When the CPU number is increasing from 1 to 8, the computational speed is also increasing. The biggest acceleration rate is 6.5. Under the 8 CPU working mode, the whole image atmospheric correction costs 4 minutes.
      

      
      Convolution of large 3D images on GPU and its decomposition
      NASA Astrophysics Data System (ADS)
      Karas, Pavel; Svoboda, David
         2011-12-01
         In this article, we propose a method for computing convolution of large 3D images. The convolution is performed in a frequency domain using a convolution theorem. The algorithm is accelerated on a graphic card by means of the CUDA parallel computing model. Convolution is decomposed in a frequency domain using the decimation in frequency algorithm. We pay attention to keeping our approach efficient in terms of both time and memory consumption and also in terms of memory transfers between CPU and GPU which have a significant inuence on overall computational time. We also study the implementation on multiple GPUs and compare the results between the multi-GPU and multi-CPU implementations.
      

      
      CscoreTool: fast Hi-C compartment analysis at high resolution.
      PubMed
      Zheng, Xiaobin; Zheng, Yixian
         2018-05-01
         The genome-wide chromosome conformation capture (Hi-C) has revealed that the eukaryotic genome can be partitioned into A and B compartments that have distinctive chromatin and transcription features. Current Principle Component Analyses (PCA)-based method for the A/B compartment prediction based on Hi-C data requires substantial CPU time and memory. We report the development of a method, CscoreTool, which enables fast and memory-efficient determination of A/B compartments at high resolution even in datasets with low sequencing depth. https://github.com/scoutzxb/CscoreTool. xzheng@carnegiescience.edu. Supplementary data are available at Bioinformatics online.
      

      
      A Whale of a Tale: Creating Spacecraft Telemetry Data Analysis Products for the Deep Impact Mission
      NASA Technical Reports Server (NTRS)
      Sturdevant, Kathryn
         2006-01-01
         A description of the Whale product generation utility and its means of analyzing project data for Deep Impact Missions is presented. The topics include: 1) Whale Definition; 2) Whale Overview; 3) Whale Challenges; 4) Network Configuration; 5) Network Diagram; 6) Whale Data Flow: Design Decisions; 7) Whale Data Flow Diagram; 8) Whale Data Flow; 9) Whale Team and Users; 10) Creeping Requirements; 11) Whale Competition; 12) Statistics: Processing Time; 13) CPU and Disk Usage; 14) The Ripple Effect of More Data; and 15) Data Validation and the Automation Challenge.
      

      
      The association between problematic cellular phone use and risky behaviors and low self-esteem among Taiwanese adolescents
      PubMed Central
      
         2010-01-01
         Background Cellular phone use (CPU) is an important part of life for many adolescents. However, problematic CPU may complicate physiological and psychological problems. The aim of our study was to examine the associations between problematic CPU and a series of risky behaviors and low self-esteem in Taiwanese adolescents. Methods A total of 11,111 adolescent students in Southern Taiwan were randomly selected into this study. We used the Problematic Cellular Phone Use Questionnaire to identify the adolescents with problematic CPU. Meanwhile, a series of risky behaviors and self-esteem were evaluated. Multilevel logistic regression analyses were employed to examine the associations between problematic CPU and risky behaviors and low self-esteem regarding gender and age. Results The results indicated that positive associations were found between problematic CPU and aggression, insomnia, smoking cigarettes, suicidal tendencies, and low self-esteem in all groups with different sexes and ages. However, gender and age differences existed in the associations between problematic CPU and suspension from school, criminal records, tattooing, short nocturnal sleep duration, unprotected sex, illicit drugs use, drinking alcohol and chewing betel nuts. Conclusions There were positive associations between problematic CPU and a series of risky behaviors and low self-esteem in Taiwanese adolescents. It is worthy for parents and mental health professionals to pay attention to adolescents' problematic CPU. PMID:20426807
      

        
       
          

«

9
      10
      11
   12
      13
      »

          
        

     

   

   
       
            
              
          

«

10
      11
      12
   13
      14
      »

          
        

           
           
             
               
      
      CPU-GPU hybrid accelerating the Zuker algorithm for RNA secondary structure prediction applications.
      PubMed
      Lei, Guoqing; Dou, Yong; Wan, Wen; Xia, Fei; Li, Rongchun; Ma, Meng; Zou, Dan
         2012-01-01
         Prediction of ribonucleic acid (RNA) secondary structure remains one of the most important research areas in bioinformatics. The Zuker algorithm is one of the most popular methods of free energy minimization for RNA secondary structure prediction. Thus far, few studies have been reported on the acceleration of the Zuker algorithm on general-purpose processors or on extra accelerators such as Field Programmable Gate-Array (FPGA) and Graphics Processing Units (GPU). To the best of our knowledge, no implementation combines both CPU and extra accelerators, such as GPUs, to accelerate the Zuker algorithm applications. In this paper, a CPU-GPU hybrid computing system that accelerates Zuker algorithm applications for RNA secondary structure prediction is proposed. The computing tasks are allocated between CPU and GPU for parallel cooperate execution. Performance differences between the CPU and the GPU in the task-allocation scheme are considered to obtain workload balance. To improve the hybrid system performance, the Zuker algorithm is optimally implemented with special methods for CPU and GPU architecture. Speedup of 15.93× over optimized multi-core SIMD CPU implementation and performance advantage of 16% over optimized GPU implementation are shown in the experimental results. More than 14% of the sequences are executed on CPU in the hybrid system. The system combining CPU and GPU to accelerate the Zuker algorithm is proven to be promising and can be applied to other bioinformatics applications.
      

      
      Comparison of reversible methods for data compression
      NASA Astrophysics Data System (ADS)
      Heer, Volker K.; Reinfelder, Hans-Erich
         1990-07-01
         Widely differing methods for data compression described in the ACR-NEMA draft are used in medical imaging. In our contribution we will review various methods briefly and discuss the relevant advantages and disadvantages. In detail we evaluate 1st order DPCM pyramid transformation and S transformation. We compare as coding algorithms both fixed and adaptive Huffman coding and Lempel-Ziv coding. Our comparison is performed on typical medical images from CT MR DSA and DLR (Digital Luminescence Radiography). Apart from the achieved compression factors we take into account CPU time required and main memory requirement both for compression and for decompression. For a realistic comparison we have implemented the mentioned algorithms in the C program language on a MicroVAX II and a SPARC station 1. 2.
      

      
      Optimization of Selected Remote Sensing Algorithms for Embedded NVIDIA Kepler GPU Architecture
      NASA Technical Reports Server (NTRS)
      Riha, Lubomir; Le Moigne, Jacqueline; El-Ghazawi, Tarek
         2015-01-01
         This paper evaluates the potential of embedded Graphic Processing Units in the Nvidias Tegra K1 for onboard processing. The performance is compared to a general purpose multi-core CPU and full fledge GPU accelerator. This study uses two algorithms: Wavelet Spectral Dimension Reduction of Hyperspectral Imagery and Automated Cloud-Cover Assessment (ACCA) Algorithm. Tegra K1 achieved 51 for ACCA algorithm and 20 for the dimension reduction algorithm, as compared to the performance of the high-end 8-core server Intel Xeon CPU with 13.5 times higher power consumption.
      

      
      Establishment and progress of the chest pain unit certification process in Germany and the local experiences of Mainz.
      PubMed
      Post, Felix; Gori, Tommaso; Senges, Jochen; Giannitsis, Evangelos; Katus, Hugo; Münzel, Thomas
         2012-03-01
         The establishment of chest pain units (CPUs) in the USA and UK has led to improvements in the prognosis of patients with chest pain and myocardial infarction, optimizing access to specialized diagnostic and therapeutic facilities and reducing costs. To establish a uniform implementation of this type of service in Germany, the German Cardiac Society (DGK) founded a 'CPU task force' in 2007, which developed a set of standard requirements and a nationwide certification programme. The recommendations for minimum standard requirements were published in 2008. As of November 2011, 132 CPUs were certified and 36 units were in the certification process. The aim of the DGK is to certify as many as 250 centres (units) throughout Germany within the next 2 years, to provide nationwide coverage. Applications from Switzerland are also being filed. Public awareness campaigns in cooperation with national league soccer teams were organized to raise awareness of the importance for early diagnosis and treatment of cardiac diseases and to publicize the existence of these new facilities. The German model of CPU certification allows nationwide and prospectively European-wide standardization of patient care and to improve adherence to international guidelines. Coupled with awareness campaigns and with the launch of a German CPU Registry, this process is aimed at improving the education and treatment of patients with chest pain and to provide scientific information about the quality of patient care.
      

      
      Real-time image reconstruction and display system for MRI using a high-speed personal computer.
      PubMed
      Haishi, T; Kose, K
         1998-09-01
         A real-time NMR image reconstruction and display system was developed using a high-speed personal computer and optimized for the 32-bit multitasking Microsoft Windows 95 operating system. The system was operated at various CPU clock frequencies by changing the motherboard clock frequency and the processor/bus frequency ratio. When the Pentium CPU was used at the 200 MHz clock frequency, the reconstruction time for one 128 x 128 pixel image was 48 ms and that for the image display on the enlarged 256 x 256 pixel window was about 8 ms. NMR imaging experiments were performed with three fast imaging sequences (FLASH, multishot EPI, and one-shot EPI) to demonstrate the ability of the real-time system. It was concluded that in most cases, high-speed PC would be the best choice for the image reconstruction and display system for real-time MRI. Copyright 1998 Academic Press.
      

      
      Design of high-performance parallelized gene predictors in MATLAB.
      PubMed
      Rivard, Sylvain Robert; Mailloux, Jean-Gabriel; Beguenane, Rachid; Bui, Hung Tien
         2012-04-10
         This paper proposes a method of implementing parallel gene prediction algorithms in MATLAB. The proposed designs are based on either Goertzel's algorithm or on FFTs and have been implemented using varying amounts of parallelism on a central processing unit (CPU) and on a graphics processing unit (GPU). Results show that an implementation using a straightforward approach can require over 4.5 h to process 15 million base pairs (bps) whereas a properly designed one could perform the same task in less than five minutes. In the best case, a GPU implementation can yield these results in 57 s. The present work shows how parallelism can be used in MATLAB for gene prediction in very large DNA sequences to produce results that are over 270 times faster than a conventional approach. This is significant as MATLAB is typically overlooked due to its apparent slow processing time even though it offers a convenient environment for bioinformatics. From a practical standpoint, this work proposes two strategies for accelerating genome data processing which rely on different parallelization mechanisms. Using a CPU, the work shows that direct access to the MEX function increases execution speed and that the PARFOR construct should be used in order to take full advantage of the parallelizable Goertzel implementation. When the target is a GPU, the work shows that data needs to be segmented into manageable sizes within the GFOR construct before processing in order to minimize execution time.
      

      
      SpaceCubeX: A Framework for Evaluating Hybrid Multi-Core CPU FPGA DSP Architectures
      NASA Technical Reports Server (NTRS)
      Schmidt, Andrew G.; Weisz, Gabriel; French, Matthew; Flatley, Thomas; Villalpando, Carlos Y.
         2017-01-01
         The SpaceCubeX project is motivated by the need for high performance, modular, and scalable on-board processing to help scientists answer critical 21st century questions about global climate change, air quality, ocean health, and ecosystem dynamics, while adding new capabilities such as low-latency data products for extreme event warnings. These goals translate into on-board processing throughput requirements that are on the order of 100-1,000 more than those of previous Earth Science missions for standard processing, compression, storage, and downlink operations. To study possible future architectures to achieve these performance requirements, the SpaceCubeX project provides an evolvable testbed and framework that enables a focused design space exploration of candidate hybrid CPU/FPGA/DSP processing architectures. The framework includes ArchGen, an architecture generator tool populated with candidate architecture components, performance models, and IP cores, that allows an end user to specify the type, number, and connectivity of a hybrid architecture. The framework requires minimal extensions to integrate new processors, such as the anticipated High Performance Spaceflight Computer (HPSC), reducing time to initiate benchmarking by months. To evaluate the framework, we leverage a wide suite of high performance embedded computing benchmarks and Earth science scenarios to ensure robust architecture characterization. We report on our projects Year 1 efforts and demonstrate the capabilities across four simulation testbed models, a baseline SpaceCube 2.0 system, a dual ARM A9 processor system, a hybrid quad ARM A53 and FPGA system, and a hybrid quad ARM A53 and DSP system.
      

      
      QR-decomposition based SENSE reconstruction using parallel architecture.
      PubMed
      Ullah, Irfan; Nisar, Habab; Raza, Haseeb; Qasim, Malik; Inam, Omair; Omer, Hammad
         2018-04-01
         Magnetic Resonance Imaging (MRI) is a powerful medical imaging technique that provides essential clinical information about the human body. One major limitation of MRI is its long scan time. Implementation of advance MRI algorithms on a parallel architecture (to exploit inherent parallelism) has a great potential to reduce the scan time. Sensitivity Encoding (SENSE) is a Parallel Magnetic Resonance Imaging (pMRI) algorithm that utilizes receiver coil sensitivities to reconstruct MR images from the acquired under-sampled k-space data. At the heart of SENSE lies inversion of a rectangular encoding matrix. This work presents a novel implementation of GPU based SENSE algorithm, which employs QR decomposition for the inversion of the rectangular encoding matrix. For a fair comparison, the performance of the proposed GPU based SENSE reconstruction is evaluated against single and multicore CPU using openMP. Several experiments against various acceleration factors (AFs) are performed using multichannel (8, 12 and 30) phantom and in-vivo human head and cardiac datasets. Experimental results show that GPU significantly reduces the computation time of SENSE reconstruction as compared to multi-core CPU (approximately 12x speedup) and single-core CPU (approximately 53x speedup) without any degradation in the quality of the reconstructed images. Copyright © 2018 Elsevier Ltd. All rights reserved.
      

      
      GPU Linear Algebra Libraries and GPGPU Programming for Accelerating MOPAC Semiempirical Quantum Chemistry Calculations.
      PubMed
      Maia, Julio Daniel Carvalho; Urquiza Carvalho, Gabriel Aires; Mangueira, Carlos Peixoto; Santana, Sidney Ramos; Cabral, Lucidio Anjos Formiga; Rocha, Gerd B
         2012-09-11
         In this study, we present some modifications in the semiempirical quantum chemistry MOPAC2009 code that accelerate single-point energy calculations (1SCF) of medium-size (up to 2500 atoms) molecular systems using GPU coprocessors and multithreaded shared-memory CPUs. Our modifications consisted of using a combination of highly optimized linear algebra libraries for both CPU (LAPACK and BLAS from Intel MKL) and GPU (MAGMA and CUBLAS) to hasten time-consuming parts of MOPAC such as the pseudodiagonalization, full diagonalization, and density matrix assembling. We have shown that it is possible to obtain large speedups just by using CPU serial linear algebra libraries in the MOPAC code. As a special case, we show a speedup of up to 14 times for a methanol simulation box containing 2400 atoms and 4800 basis functions, with even greater gains in performance when using multithreaded CPUs (2.1 times in relation to the single-threaded CPU code using linear algebra libraries) and GPUs (3.8 times). This degree of acceleration opens new perspectives for modeling larger structures which appear in inorganic chemistry (such as zeolites and MOFs), biochemistry (such as polysaccharides, small proteins, and DNA fragments), and materials science (such as nanotubes and fullerenes). In addition, we believe that this parallel (GPU-GPU) MOPAC code will make it feasible to use semiempirical methods in lengthy molecular simulations using both hybrid QM/MM and QM/QM potentials.
      

      
      Prestack depth migration for complex 2D structure using phase-screen propagators
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Roberts, P.; Huang, Lian-Jie; Burch, C.
         1997-11-01
         We present results for the phase-screen propagator method applied to prestack depth migration of the Marmousi synthetic data set. The data were migrated as individual common-shot records and the resulting partial images were superposed to obtain the final complete Image. Tests were performed to determine the minimum number of frequency components required to achieve the best quality image and this in turn provided estimates of the minimum computing time. Running on a single processor SUN SPARC Ultra I, high quality images were obtained in as little as 8.7 CPU hours and adequate images were obtained in as little as 4.4more » CPU hours. Different methods were tested for choosing the reference velocity used for the background phase-shift operation and for defining the slowness perturbation screens. Although the depths of some of the steeply dipping, high-contrast features were shifted slightly the overall image quality was fairly insensitive to the choice of the reference velocity. Our jests show the phase-screen method to be a reliable and fast algorithm for imaging complex geologic structures, at least for complex 2D synthetic data where the velocity model is known.« less
      

      
      Accelerating Monte Carlo simulations of photon transport in a voxelized geometry using a massively parallel graphics processing unit
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Badal, Andreu; Badano, Aldo
         
         Purpose: It is a known fact that Monte Carlo simulations of radiation transport are computationally intensive and may require long computing times. The authors introduce a new paradigm for the acceleration of Monte Carlo simulations: The use of a graphics processing unit (GPU) as the main computing device instead of a central processing unit (CPU). Methods: A GPU-based Monte Carlo code that simulates photon transport in a voxelized geometry with the accurate physics models from PENELOPE has been developed using the CUDA programming model (NVIDIA Corporation, Santa Clara, CA). Results: An outline of the new code and a sample x-raymore » imaging simulation with an anthropomorphic phantom are presented. A remarkable 27-fold speed up factor was obtained using a GPU compared to a single core CPU. Conclusions: The reported results show that GPUs are currently a good alternative to CPUs for the simulation of radiation transport. Since the performance of GPUs is currently increasing at a faster pace than that of CPUs, the advantages of GPU-based software are likely to be more pronounced in the future.« less
      

      
      Optimizing Tensor Contraction Expressions for Hybrid CPU-GPU Execution
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Ma, Wenjing; Krishnamoorthy, Sriram; Villa, Oreste
         2013-03-01
         Tensor contractions are generalized multidimensional matrix multiplication operations that widely occur in quantum chemistry. Efficient execution of tensor contractions on Graphics Processing Units (GPUs) requires several challenges to be addressed, including index permutation and small dimension-sizes reducing thread block utilization. Moreover, to apply the same optimizations to various expressions, we need a code generation tool. In this paper, we present our approach to automatically generate CUDA code to execute tensor contractions on GPUs, including management of data movement between CPU and GPU. To evaluate our tool, GPU-enabled code is generated for the most expensive contractions in CCSD(T), a key coupledmore » cluster method, and incorporated into NWChem, a popular computational chemistry suite. For this method, we demonstrate speedup over a factor of 8.4 using one GPU (instead of one core per node) and over 2.6 when utilizing the entire system using hybrid CPU+GPU solution with 2 GPUs and 5 cores (instead of 7 cores per node). Finally, we analyze the implementation behavior on future GPU systems.« less
      

      
      Symptoms of Problematic Cellular Phone Use, Functional Impairment and Its Association with Depression among Adolescents in Southern Taiwan
      ERIC Educational Resources Information Center
      Yen, Cheng-Fang; Tang, Tze-Chun; Yen, Ju-Yu; Lin, Huang-Chi; Huang, Chi-Fen; Liu, Shu-Chun; Ko, Chih-Hung
         2009-01-01
         The aims of this study were: (1) to examine the prevalence of symptoms of problematic cellular phone use (CPU); (2) to examine the associations between the symptoms of problematic CPU, functional impairment caused by CPU and the characteristics of CPU; (3) to establish the optimal cut-off point of the number of symptoms for functional impairment…
      

      
      Derivative free Davidon-Fletcher-Powell (DFP) for solving symmetric systems of nonlinear equations
      NASA Astrophysics Data System (ADS)
      Mamat, M.; Dauda, M. K.; Mohamed, M. A. bin; Waziri, M. Y.; Mohamad, F. S.; Abdullah, H.
         2018-03-01
         Research from the work of engineers, economist, modelling, industry, computing, and scientist are mostly nonlinear equations in nature. Numerical solution to such systems is widely applied in those areas of mathematics. Over the years, there has been significant theoretical study to develop methods for solving such systems, despite these efforts, unfortunately the methods developed do have deficiency. In a contribution to solve systems of the form F(x) = 0, x ∈ Rn , a derivative free method via the classical Davidon-Fletcher-Powell (DFP) update is presented. This is achieved by simply approximating the inverse Hessian matrix with {Q}k+1-1 to θkI. The modified method satisfied the descent condition and possess local superlinear convergence properties. Interestingly, without computing any derivative, the proposed method never fail to converge throughout the numerical experiments. The output is based on number of iterations and CPU time, different initial starting points were used on a solve 40 benchmark test problems. With the aid of the squared norm merit function and derivative-free line search technique, the approach yield a method of solving symmetric systems of nonlinear equations that is capable of significantly reducing the CPU time and number of iteration, as compared to its counterparts. A comparison between the proposed method and classical DFP update were made and found that the proposed methodis the top performer and outperformed the existing method in almost all the cases. In terms of number of iterations, out of the 40 problems solved, the proposed method solved 38 successfully, (95%) while classical DFP solved 2 problems (i.e. 05%). In terms of CPU time, the proposed method solved 29 out of the 40 problems given, (i.e.72.5%) successfully whereas classical DFP solves 11 (27.5%). The method is valid in terms of derivation, reliable in terms of number of iterations and accurate in terms of CPU time. Thus, suitable and achived the objective.
      

      
      First Update of the Criteria for Certification of Chest Pain Units in Germany: Facelift or New Model?
      PubMed
      Breuckmann, Frank; Rassaf, Tienush
         2016-03-01
         In an effort to provide a systematic and specific standard-of-care for patients with acute chest pain, the German Cardiac Society introduced criteria for certification of specialized chest pain units (CPUs) in 2008, which have been replaced by a recent update published in 2015. We reviewed the development of CPU establishment in Germany during the past 7 years and compared and commented the current update of the certification criteria. As of October 2015, 228 CPUs in Germany have been successfully certified by the German Cardiac Society; 300 CPUs are needed for full coverage closing gaps in rural regions. Current changes of the criteria mainly affect guideline-adherent adaptions of diagnostic work-ups, therapeutic strategies, risk stratification, in-hospital timing and education, and quality measures, whereas the overall structure remained unchanged. Benchmarking by participation within the German CPU registry is encouraged. Even though the history is short, the concept of certified CPUs in Germany is accepted and successful underlined by its recent implementation in national and international guidelines. First registry data demonstrated a high standard of quality-of-care. The current update provides rational adaptions to new guidelines and developments without raising the level for successful certifications. A periodic release of fast-track updates with shorter time frames and an increase of minimum requirements should be considered.
      

      
      Efficient calculation of many-body induced electrostatics in molecular systems
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      McLaughlin, Keith, E-mail: kmclaugh@mail.usf.edu; Cioce, Christian R.; Pham, Tony
         
         Potential energy functions including many-body polarization are in widespread use in simulations of aqueous and biological systems, metal-organics, molecular clusters, and other systems where electronically induced redistribution of charge among local atomic sites is of importance. The polarization interactions, treated here via the methods of Thole and Applequist, while long-ranged, can be computed for moderate-sized periodic systems with extremely high accuracy by extending Ewald summation to the induced fields as demonstrated by Nymand, Sala, and others. These full Ewald polarization calculations, however, are expensive and often limited to very small systems, particularly in Monte Carlo simulations, which may require energymore » evaluation over several hundred-thousand configurations. For such situations, it shall be shown that sufficiently accurate computation of the polarization energy can be produced in a fraction of the central processing unit (CPU) time by neglecting the long-range extension to the induced fields while applying the long-range treatments of Ewald or Wolf to the static fields; these methods, denoted Ewald E-Static and Wolf E-Static (WES), respectively, provide an effective means to obtain polarization energies for intermediate and large systems including those with several thousand polarizable sites in a fraction of the CPU time. Furthermore, we shall demonstrate a means to optimize the damping for WES calculations via extrapolation from smaller trial systems.« less
      

      
      A communication-avoiding, hybrid-parallel, rank-revealing orthogonalization method.
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Hoemmen, Mark
         2010-11-01
         Orthogonalization consumes much of the run time of many iterative methods for solving sparse linear systems and eigenvalue problems. Commonly used algorithms, such as variants of Gram-Schmidt or Householder QR, have performance dominated by communication. Here, 'communication' includes both data movement between the CPU and memory, and messages between processors in parallel. Our Tall Skinny QR (TSQR) family of algorithms requires asymptotically fewer messages between processors and data movement between CPU and memory than typical orthogonalization methods, yet achieves the same accuracy as Householder QR factorization. Furthermore, in block orthogonalizations, TSQR is faster and more accurate than existing approaches formore » orthogonalizing the vectors within each block ('normalization'). TSQR's rank-revealing capability also makes it useful for detecting deflation in block iterative methods, for which existing approaches sacrifice performance, accuracy, or both. We have implemented a version of TSQR that exploits both distributed-memory and shared-memory parallelism, and supports real and complex arithmetic. Our implementation is optimized for the case of orthogonalizing a small number (5-20) of very long vectors. The shared-memory parallel component uses Intel's Threading Building Blocks, though its modular design supports other shared-memory programming models as well, including computation on the GPU. Our implementation achieves speedups of 2 times or more over competing orthogonalizations. It is available now in the development branch of the Trilinos software package, and will be included in the 10.8 release.« less
      

      
      Speeding up tsunami wave propagation modeling
      NASA Astrophysics Data System (ADS)
      Lavrentyev, Mikhail; Romanenko, Alexey
         2014-05-01
         Trans-oceanic wave propagation is one of the most time/CPU consuming parts of the tsunami modeling process. The so-called Method Of Splitting Tsunami (MOST) software package, developed at PMEL NOAA USA (Pacific Marine Environmental Laboratory of the National Oceanic and Atmospheric Administration, USA), is widely used to evaluate the tsunami parameters. However, it takes time to simulate trans-ocean wave propagation, that is up to 5 hours CPU time to "drive" the wave from Chili (epicenter) to the coast of Japan (even using a rather coarse computational mesh). Accurate wave height prediction requires fine meshes which leads to dramatic increase in time for simulation. Computation time is among the critical parameter as it takes only about 20 minutes for tsunami wave to approach the coast of Japan after earthquake at Japan trench or Sagami trench (as it was after the Great East Japan Earthquake on March 11, 2011). MOST solves numerically the hyperbolic system for three unknown functions, namely velocity vector and wave height (shallow water approximation). The system could be split into two independent systems by orthogonal directions (splitting method). Each system can be treated independently. This calculation scheme is well suited for SIMD architecture and GPUs as well. We performed adaptation of MOST package to GPU. Several numerical tests showed 40x performance gain for NVIDIA Tesla C2050 GPU vs. single core of Intel i7 processor. Results of numerical experiments were compared with other available simulation data. Calculation results, obtained at GPU, differ from the reference ones by 10^-3 cm of the wave height simulating 24 hours wave propagation. This allows us to speak about possibility to develop real-time system for evaluating tsunami danger.
      

      
      A time-efficient implementation of Extended Kalman Filter for sequential orbit determination and a case study for onboard application
      NASA Astrophysics Data System (ADS)
      Tang, Jingshi; Wang, Haihong; Chen, Qiuli; Chen, Zhonggui; Zheng, Jinjun; Cheng, Haowen; Liu, Lin
         2018-07-01
         Onboard orbit determination (OD) is often used in space missions, with which mission support can be partially accomplished autonomously, with less dependency on ground stations. In major Global Navigation Satellite Systems (GNSS), inter-satellite link is also an essential upgrade in the future generations. To serve for autonomous operation, sequential OD method is crucial to provide real-time or near real-time solutions. The Extended Kalman Filter (EKF) is an effective and convenient sequential estimator that is widely used in onboard application. The filter requires the solutions of state transition matrix (STM) and the process noise transition matrix, which are always obtained by numerical integration. However, numerically integrating the differential equations is a CPU intensive process and consumes a large portion of the time in EKF procedures. In this paper, we present an implementation that uses the analytical solutions of these transition matrices to replace the numerical calculations. This analytical implementation is demonstrated and verified using a fictitious constellation based on selected medium Earth orbit (MEO) and inclined Geosynchronous orbit (IGSO) satellites. We show that this implementation performs effectively and converges quickly, steadily and accurately in the presence of considerable errors in the initial values, measurements and force models. The filter is able to converge within 2-4 h of flight time in our simulation. The observation residual is consistent with simulated measurement error, which is about a few centimeters in our scenarios. Compared to results implemented with numerically integrated STM, the analytical implementation shows results with consistent accuracy, while it takes only about half the CPU time to filter a 10-day measurement series. The future possible extensions are also discussed to fit in various missions.
      

      
      Integration of TomoPy and the ASTRA toolbox for advanced processing and reconstruction of tomographic synchrotron data
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Pelt, Daniël M.; Gürsoy, Dogˇa; Palenstijn, Willem Jan
         2016-04-28
         The processing of tomographic synchrotron data requires advanced and efficient software to be able to produce accurate results in reasonable time. In this paper, the integration of two software toolboxes, TomoPy and the ASTRA toolbox, which, together, provide a powerful framework for processing tomographic data, is presented. The integration combines the advantages of both toolboxes, such as the user-friendliness and CPU-efficient methods of TomoPy and the flexibility and optimized GPU-based reconstruction methods of the ASTRA toolbox. It is shown that both toolboxes can be easily installed and used together, requiring only minor changes to existing TomoPy scripts. Furthermore, it ismore » shown that the efficient GPU-based reconstruction methods of the ASTRA toolbox can significantly decrease the time needed to reconstruct large datasets, and that advanced reconstruction methods can improve reconstruction quality compared with TomoPy's standard reconstruction method.« less
      

        
       
          

«

10
      11
      12
   13
      14
      »

          
        

     

   

   
       
            
              
          

«

11
      12
      13
   14
      15
      »

          
        

           
           
             
               
      
      Integration of TomoPy and the ASTRA toolbox for advanced processing and reconstruction of tomographic synchrotron data
      PubMed Central
      Pelt, Daniël M.; Gürsoy, Doǧa; Palenstijn, Willem Jan; Sijbers, Jan; De Carlo, Francesco; Batenburg, Kees Joost
         2016-01-01
         The processing of tomographic synchrotron data requires advanced and efficient software to be able to produce accurate results in reasonable time. In this paper, the integration of two software toolboxes, TomoPy and the ASTRA toolbox, which, together, provide a powerful framework for processing tomographic data, is presented. The integration combines the advantages of both toolboxes, such as the user-friendliness and CPU-efficient methods of TomoPy and the flexibility and optimized GPU-based reconstruction methods of the ASTRA toolbox. It is shown that both toolboxes can be easily installed and used together, requiring only minor changes to existing TomoPy scripts. Furthermore, it is shown that the efficient GPU-based reconstruction methods of the ASTRA toolbox can significantly decrease the time needed to reconstruct large datasets, and that advanced reconstruction methods can improve reconstruction quality compared with TomoPy’s standard reconstruction method. PMID:27140167
      

      
      Synthesis and characterization of conductive, biodegradable, elastomeric polyurethanes for biomedical applications.
      PubMed
      Xu, Cancan; Yepez, Gerardo; Wei, Zi; Liu, Fuqiang; Bugarin, Alejandro; Hong, Yi
         2016-09-01
         Biodegradable conductive polymers are currently of significant interest in tissue repair and regeneration, drug delivery, and bioelectronics. However, biodegradable materials exhibiting both conductive and elastic properties have rarely been reported to date. To that end, an electrically conductive polyurethane (CPU) was synthesized from polycaprolactone diol, hexadiisocyanate, and aniline trimer and subsequently doped with (1S)-(+)-10-camphorsulfonic acid (CSA). All CPU films showed good elasticity within a 30% strain range. The electrical conductivity of the CPU films, as enhanced with increasing amounts of CSA, ranged from 2.7 ± 0.9 × 10(-10) to 4.4 ± 0.6 × 10(-7) S/cm in a dry state and 4.2 ± 0.5 × 10(-8) to 7.3 ± 1.5 × 10(-5) S/cm in a wet state. The redox peaks of a CPU1.5 film (molar ratio CSA:aniline trimer = 1.5:1) in the cyclic voltammogram confirmed the desired good electroactivity. The doped CPU film exhibited good electrical stability (87% of initial conductivity after 150 hours charge) as measured in a cell culture medium. The degradation rates of CPU films increased with increasing CSA content in both phosphate-buffered solution (PBS) and lipase/PBS solutions. After 7 days of enzymatic degradation, the conductivity of all CSA-doped CPU films had decreased to that of the undoped CPU film. Mouse 3T3 fibroblasts proliferated and spread on all CPU films. This developed biodegradable CPU with good elasticity, electrical stability, and biocompatibility may find potential applications in tissue engineering, smart drug release, and electronics. © 2016 Wiley Periodicals, Inc. J Biomed Mater Res Part A: 104A: 2305-2314, 2016. © 2016 Wiley Periodicals, Inc.
      

      
      GPU based framework for geospatial analyses
      NASA Astrophysics Data System (ADS)
      Cosmin Sandric, Ionut; Ionita, Cristian; Dardala, Marian; Furtuna, Titus
         2017-04-01
         Parallel processing on multiple CPU cores is already used at large scale in geocomputing, but parallel processing on graphics cards is just at the beginning. Being able to use an simple laptop with a dedicated graphics card for advanced and very fast geocomputation is an advantage that each scientist wants to have. The necessity to have high speed computation in geosciences has increased in the last 10 years, mostly due to the increase in the available datasets. These datasets are becoming more and more detailed and hence they require more space to store and more time to process. Distributed computation on multicore CPU's and GPU's plays an important role by processing one by one small parts from these big datasets. These way of computations allows to speed up the process, because instead of using just one process for each dataset, the user can use all the cores from a CPU or up to hundreds of cores from GPU The framework provide to the end user a standalone tools for morphometry analyses at multiscale level. An important part of the framework is dedicated to uncertainty propagation in geospatial analyses. The uncertainty may come from the data collection or may be induced by the model or may have an infinite sources. These uncertainties plays important roles when a spatial delineation of the phenomena is modelled. Uncertainty propagation is implemented inside the GPU framework using Monte Carlo simulations. The GPU framework with the standalone tools proved to be a reliable tool for modelling complex natural phenomena The framework is based on NVidia Cuda technology and is written in C++ programming language. The code source will be available on github at https://github.com/sandricionut/GeoRsGPU Acknowledgement: GPU framework for geospatial analysis, Young Researchers Grant (ICUB-University of Bucharest) 2016, director Ionut Sandric
      

      
      Coding for parallel execution of hardware-in-the-loop millimeter-wave scene generation models on multicore SIMD processor architectures
      NASA Astrophysics Data System (ADS)
      Olson, Richard F.
         2013-05-01
         Rendering of point scatterer based radar scenes for millimeter wave (mmW) seeker tests in real-time hardware-in-the-loop (HWIL) scene generation requires efficient algorithms and vector-friendly computer architectures for complex signal synthesis. New processor technology from Intel implements an extended 256-bit vector SIMD instruction set (AVX, AVX2) in a multi-core CPU design providing peak execution rates of hundreds of GigaFLOPS (GFLOPS) on one chip. Real world mmW scene generation code can approach peak SIMD execution rates only after careful algorithm and source code design. An effective software design will maintain high computing intensity emphasizing register-to-register SIMD arithmetic operations over data movement between CPU caches or off-chip memories. Engineers at the U.S. Army Aviation and Missile Research, Development and Engineering Center (AMRDEC) applied two basic parallel coding methods to assess new 256-bit SIMD multi-core architectures for mmW scene generation in HWIL. These include use of POSIX threads built on vector library functions and more portable, highlevel parallel code based on compiler technology (e.g. OpenMP pragmas and SIMD autovectorization). Since CPU technology is rapidly advancing toward high processor core counts and TeraFLOPS peak SIMD execution rates, it is imperative that coding methods be identified which produce efficient and maintainable parallel code. This paper describes the algorithms used in point scatterer target model rendering, the parallelization of those algorithms, and the execution performance achieved on an AVX multi-core machine using the two basic parallel coding methods. The paper concludes with estimates for scale-up performance on upcoming multi-core technology.
      

      
      General-purpose interface bus for multiuser, multitasking computer system
      NASA Technical Reports Server (NTRS)
      Generazio, Edward R.; Roth, Don J.; Stang, David B.
         1990-01-01
         The architecture of a multiuser, multitasking, virtual-memory computer system intended for the use by a medium-size research group is described. There are three central processing units (CPU) in the configuration, each with 16 MB memory, and two 474 MB hard disks attached. CPU 1 is designed for data analysis and contains an array processor for fast-Fourier transformations. In addition, CPU 1 shares display images viewed with the image processor. CPU 2 is designed for image analysis and display. CPU 3 is designed for data acquisition and contains 8 GPIB channels and an analog-to-digital conversion input/output interface with 16 channels. Up to 9 users can access the third CPU simultaneously for data acquisition. Focus is placed on the optimization of hardware interfaces and software, facilitating instrument control, data acquisition, and processing.
      

      
      CPU-GPU hybrid accelerating the Zuker algorithm for RNA secondary structure prediction applications
      PubMed Central
      
         2012-01-01
         Background Prediction of ribonucleic acid (RNA) secondary structure remains one of the most important research areas in bioinformatics. The Zuker algorithm is one of the most popular methods of free energy minimization for RNA secondary structure prediction. Thus far, few studies have been reported on the acceleration of the Zuker algorithm on general-purpose processors or on extra accelerators such as Field Programmable Gate-Array (FPGA) and Graphics Processing Units (GPU). To the best of our knowledge, no implementation combines both CPU and extra accelerators, such as GPUs, to accelerate the Zuker algorithm applications. Results In this paper, a CPU-GPU hybrid computing system that accelerates Zuker algorithm applications for RNA secondary structure prediction is proposed. The computing tasks are allocated between CPU and GPU for parallel cooperate execution. Performance differences between the CPU and the GPU in the task-allocation scheme are considered to obtain workload balance. To improve the hybrid system performance, the Zuker algorithm is optimally implemented with special methods for CPU and GPU architecture. Conclusions Speedup of 15.93× over optimized multi-core SIMD CPU implementation and performance advantage of 16% over optimized GPU implementation are shown in the experimental results. More than 14% of the sequences are executed on CPU in the hybrid system. The system combining CPU and GPU to accelerate the Zuker algorithm is proven to be promising and can be applied to other bioinformatics applications. PMID:22369626
      

      
      Increases in cytoplasmic dopamine compromise the normal resistance of the nucleus accumbens to methamphetamine neurotoxicity
      PubMed Central
      Thomas, David M.; Francescutti-Verbeem, Dina M.; Kuhnt, Donald M.
         2016-01-01
         Methamphetamine (METH) is a neurotoxic drug of abuse that damages the dopamine (DA) neuronal system in a highly delimited manner. The brain structure most affected by METH is the caudate–putamen (CPu) where long-term DA depletion and microglial activation are most evident. Even damage within the CPu is remarkably heterogenous with lateral and ventral aspects showing the greatest deficits. The nucleus accumbens (NAc) is largely spared of the damage that accompanies binge METH intoxication. Increases in cytoplasmic DA produced by reserpine, L-DOPA or clorgyline prior to METH uncover damage in the NAc as evidenced by microglial activation and depletion of DA, tyrosine hydroxylase (TH), and the DA transporter. These effects do not occur in the NAc after treatment with METH alone. In contrast to the CPu where DA, TH, and DA transporter levels remain depleted chronically, DA nerve ending alterations in the NAc show a partial recovery over time. None of the treatments that enhance METH toxicity in the NAc and CPu lead to losses of TH protein or DA cell bodies in the substantia nigra or the ventral tegmentum. These data show that increases in cytoplasmic DA dramatically broaden the neurotoxic profile of METH to include brain structures not normally targeted for damage by METH alone. The resistance of the NAc to METH-induced neurotoxicity and its ability to recover reveal a fundamentally different neuroplasticity by comparison to the CPu. Recruitment of the NAc as a target of METH neurotoxicity by alterations in DA homeostasis is significant in light of the important roles played by this brain structure. PMID:19457119
      

      
      Increases in cytoplasmic dopamine compromise the normal resistance of the nucleus accumbens to methamphetamine neurotoxicity.
      PubMed
      Thomas, David M; Francescutti-Verbeem, Dina M; Kuhn, Donald M
         2009-06-01
         Methamphetamine (METH) is a neurotoxic drug of abuse that damages the dopamine (DA) neuronal system in a highly delimited manner. The brain structure most affected by METH is the caudate-putamen (CPu) where long-term DA depletion and microglial activation are most evident. Even damage within the CPu is remarkably heterogenous with lateral and ventral aspects showing the greatest deficits. The nucleus accumbens (NAc) is largely spared of the damage that accompanies binge METH intoxication. Increases in cytoplasmic DA produced by reserpine, L-DOPA or clorgyline prior to METH uncover damage in the NAc as evidenced by microglial activation and depletion of DA, tyrosine hydroxylase (TH), and the DA transporter. These effects do not occur in the NAc after treatment with METH alone. In contrast to the CPu where DA, TH, and DA transporter levels remain depleted chronically, DA nerve ending alterations in the NAc show a partial recovery over time. None of the treatments that enhance METH toxicity in the NAc and CPu lead to losses of TH protein or DA cell bodies in the substantia nigra or the ventral tegmentum. These data show that increases in cytoplasmic DA dramatically broaden the neurotoxic profile of METH to include brain structures not normally targeted for damage by METH alone. The resistance of the NAc to METH-induced neurotoxicity and its ability to recover reveal a fundamentally different neuroplasticity by comparison to the CPu. Recruitment of the NAc as a target of METH neurotoxicity by alterations in DA homeostasis is significant in light of the important roles played by this brain structure.
      

      
      Multiresource allocation and scheduling for periodic soft real-time applications
      NASA Astrophysics Data System (ADS)
      Gopalan, Kartik; Chiueh, Tzi-cker
         2001-12-01
         Real-time applications that utilize multiple system resources, such as CPU, disks, and network links, require coordinated scheduling of these resources in order to meet their end-to-end performance requirements. Most state-of-the-art operating systems support independent resource allocation and deadline-driven scheduling but lack coordination among multiple heterogeneous resources. This paper describes the design and implementation of an Integrated Real-time Resource Scheduler (IRS) that performs coordinated allocation and scheduling of multiple heterogeneous resources on the same machine for periodic soft real-time application. The principal feature of IRS is a heuristic multi-resource allocation algorithm that reserves multiple resources for real-time applications in a manner that can maximize the number of applications admitted into the system in the long run. At run-time, a global scheduler dispatches the tasks of the soft real-time application to individual resource schedulers according to the precedence constraints between tasks. The individual resource schedulers, which could be any deadline based schedulers, can make scheduling decisions locally and yet collectively satisfy a real-time application's performance requirements. The tightness of overall timing guarantees is ultimately determined by the properties of individual resource schedulers. However, IRS maximizes overall system resource utilization efficiency by coordinating deadline assignment across multiple tasks in a soft real-time application.
      

      
      Analysis of cache for streaming tape drive
      NASA Technical Reports Server (NTRS)
      Chinnaswamy, V.
         1993-01-01
         A tape subsystem consists of a controller and a tape drive. Tapes are used for backup, data interchange, and software distribution. The backup operation is addressed. During a backup operation, data is read from disk, processed in CPU, and then sent to tape. The processing speeds of a disk subsystem, CPU, and a tape subsystem are likely to be different. A powerful CPU can read data from a fast disk, process it, and supply the data to the tape subsystem at a faster rate than the tape subsystem can handle. On the other hand, a slow disk drive and a slow CPU may not be able to supply data fast enough to keep a tape drive busy all the time. The backup process may supply data to tape drive in bursts. Each burst may be followed by an idle period. Depending on the nature of the file distribution in the disk, the input stream to the tape subsystem may vary significantly during backup. To compensate for these differences and optimize the utilization of a tape subsystem, a cache or buffer is introduced in the tape controller. Most of the tape drives today are streaming tape drives. A streaming tape drive goes into reposition when there is no data from the controller. Once the drive goes into reposition, the controller can receive data, but it cannot supply data to the tape drive until the drive completes its reposition. A controller can also receive data from the host and send data to the tape drive at the same time. The relationship of cache size, host transfer rate, drive transfer rate, reposition, and ramp up times for optimal performance of the tape subsystem are investigated. Formulas developed will also show the advantages of cache watermarks to increase the streaming time of the tape drive, maximum loss due to insufficient cache, tradeoffs between cache and reposition times and the effectiveness of cache on a streaming tape drive due to idle times or interruptions due in host transfers. Several mathematical formulas are developed to predict the performance of the tape drive. Some examples are given illustrating the usefulness of these formulas. Finally, a summary and some conclusions are provided.
      

      
      An efficient implementation of 3D high-resolution imaging for large-scale seismic data with GPU/CPU heterogeneous parallel computing
      NASA Astrophysics Data System (ADS)
      Xu, Jincheng; Liu, Wei; Wang, Jin; Liu, Linong; Zhang, Jianfeng
         2018-02-01
         De-absorption pre-stack time migration (QPSTM) compensates for the absorption and dispersion of seismic waves by introducing an effective Q parameter, thereby making it an effective tool for 3D, high-resolution imaging of seismic data. Although the optimal aperture obtained via stationary-phase migration reduces the computational cost of 3D QPSTM and yields 3D stationary-phase QPSTM, the associated computational efficiency is still the main problem in the processing of 3D, high-resolution images for real large-scale seismic data. In the current paper, we proposed a division method for large-scale, 3D seismic data to optimize the performance of stationary-phase QPSTM on clusters of graphics processing units (GPU). Then, we designed an imaging point parallel strategy to achieve an optimal parallel computing performance. Afterward, we adopted an asynchronous double buffering scheme for multi-stream to perform the GPU/CPU parallel computing. Moreover, several key optimization strategies of computation and storage based on the compute unified device architecture (CUDA) were adopted to accelerate the 3D stationary-phase QPSTM algorithm. Compared with the initial GPU code, the implementation of the key optimization steps, including thread optimization, shared memory optimization, register optimization and special function units (SFU), greatly improved the efficiency. A numerical example employing real large-scale, 3D seismic data showed that our scheme is nearly 80 times faster than the CPU-QPSTM algorithm. Our GPU/CPU heterogeneous parallel computing framework significant reduces the computational cost and facilitates 3D high-resolution imaging for large-scale seismic data.
      

      
      GPU-accelerated automatic identification of robust beam setups for proton and carbon-ion radiotherapy
      NASA Astrophysics Data System (ADS)
      Ammazzalorso, F.; Bednarz, T.; Jelen, U.
         2014-03-01
         We demonstrate acceleration on graphic processing units (GPU) of automatic identification of robust particle therapy beam setups, minimizing negative dosimetric effects of Bragg peak displacement caused by treatment-time patient positioning errors. Our particle therapy research toolkit, RobuR, was extended with OpenCL support and used to implement calculation on GPU of the Port Homogeneity Index, a metric scoring irradiation port robustness through analysis of tissue density patterns prior to dose optimization and computation. Results were benchmarked against an independent native CPU implementation. Numerical results were in agreement between the GPU implementation and native CPU implementation. For 10 skull base cases, the GPU-accelerated implementation was employed to select beam setups for proton and carbon ion treatment plans, which proved to be dosimetrically robust, when recomputed in presence of various simulated positioning errors. From the point of view of performance, average running time on the GPU decreased by at least one order of magnitude compared to the CPU, rendering the GPU-accelerated analysis a feasible step in a clinical treatment planning interactive session. In conclusion, selection of robust particle therapy beam setups can be effectively accelerated on a GPU and become an unintrusive part of the particle therapy treatment planning workflow. Additionally, the speed gain opens new usage scenarios, like interactive analysis manipulation (e.g. constraining of some setup) and re-execution. Finally, through OpenCL portable parallelism, the new implementation is suitable also for CPU-only use, taking advantage of multiple cores, and can potentially exploit types of accelerators other than GPUs.
      

      
      Improving the performance of heterogeneous multi-core processors by modifying the cache coherence protocol
      NASA Astrophysics Data System (ADS)
      Fang, Juan; Hao, Xiaoting; Fan, Qingwen; Chang, Zeqing; Song, Shuying
         2017-05-01
         In the Heterogeneous multi-core architecture, CPU and GPU processor are integrated on the same chip, which poses a new challenge to the last-level cache management. In this architecture, the CPU application and the GPU application execute concurrently, accessing the last-level cache. CPU and GPU have different memory access characteristics, so that they have differences in the sensitivity of last-level cache (LLC) capacity. For many CPU applications, a reduced share of the LLC could lead to significant performance degradation. On the contrary, GPU applications can tolerate increase in memory access latency when there is sufficient thread-level parallelism. Taking into account the GPU program memory latency tolerance characteristics, this paper presents a method that let GPU applications can access to memory directly, leaving lots of LLC space for CPU applications, in improving the performance of CPU applications and does not affect the performance of GPU applications. When the CPU application is cache sensitive, and the GPU application is insensitive to the cache, the overall performance of the system is improved significantly.
      

      
      A novel potential/viscous flow coupling technique for computing helicopter flow fields
      NASA Technical Reports Server (NTRS)
      Summa, J. Michael; Strash, Daniel J.; Yoo, Sungyul
         1993-01-01
         The primary objective of this work was to demonstrate the feasibility of a new potential/viscous flow coupling procedure for reducing computational effort while maintaining solution accuracy. This closed-loop, overlapped velocity-coupling concept has been developed in a new two-dimensional code, ZAP2D (Zonal Aerodynamics Program - 2D), a three-dimensional code for wing analysis, ZAP3D (Zonal Aerodynamics Program - 3D), and a three-dimensional code for isolated helicopter rotors in hover, ZAPR3D (Zonal Aerodynamics Program for Rotors - 3D). Comparisons with large domain ARC3D solutions and with experimental data for a NACA 0012 airfoil have shown that the required domain size can be reduced to a few tenths of a percent chord for the low Mach and low angle of attack cases and to less than 2-5 chords for the high Mach and high angle of attack cases while maintaining solution accuracies to within a few percent. This represents CPU time reductions by a factor of 2-4 compared with ARC2D. The current ZAP3D calculation for a rectangular plan-form wing of aspect ratio 5 with an outer domain radius of about 1.2 chords represents a speed-up in CPU time over the ARC3D large domain calculation by about a factor of 2.5 while maintaining solution accuracies to within a few percent. A ZAPR3D simulation for a two-bladed rotor in hover with a reduced grid domain of about two chord lengths was able to capture the wake effects and compared accurately with the experimental pressure data. Further development is required in order to substantiate the promise of computational improvements due to the ZAPR3D coupling concept.
      

      
      GPU-based stochastic-gradient optimization for non-rigid medical image registration in time-critical applications
      NASA Astrophysics Data System (ADS)
      Bhosale, Parag; Staring, Marius; Al-Ars, Zaid; Berendsen, Floris F.
         2018-03-01
         Currently, non-rigid image registration algorithms are too computationally intensive to use in time-critical applications. Existing implementations that focus on speed typically address this by either parallelization on GPU-hardware, or by introducing methodically novel techniques into CPU-oriented algorithms. Stochastic gradient descent (SGD) optimization and variations thereof have proven to drastically reduce the computational burden for CPU-based image registration, but have not been successfully applied in GPU hardware due to its stochastic nature. This paper proposes 1) NiftyRegSGD, a SGD optimization for the GPU-based image registration tool NiftyReg, 2) random chunk sampler, a new random sampling strategy that better utilizes the memory bandwidth of GPU hardware. Experiments have been performed on 3D lung CT data of 19 patients, which compared NiftyRegSGD (with and without random chunk sampler) with CPU-based elastix Fast Adaptive SGD (FASGD) and NiftyReg. The registration runtime was 21.5s, 4.4s and 2.8s for elastix-FASGD, NiftyRegSGD without, and NiftyRegSGD with random chunk sampling, respectively, while similar accuracy was obtained. Our method is publicly available at https://github.com/SuperElastix/NiftyRegSGD.
      

      
      Implementation of GPU accelerated SPECT reconstruction with Monte Carlo-based scatter correction.
      PubMed
      Bexelius, Tobias; Sohlberg, Antti
         2018-06-01
         Statistical SPECT reconstruction can be very time-consuming especially when compensations for collimator and detector response, attenuation, and scatter are included in the reconstruction. This work proposes an accelerated SPECT reconstruction algorithm based on graphics processing unit (GPU) processing. Ordered subset expectation maximization (OSEM) algorithm with CT-based attenuation modelling, depth-dependent Gaussian convolution-based collimator-detector response modelling, and Monte Carlo-based scatter compensation was implemented using OpenCL. The OpenCL implementation was compared against the existing multi-threaded OSEM implementation running on a central processing unit (CPU) in terms of scatter-to-primary ratios, standardized uptake values (SUVs), and processing speed using mathematical phantoms and clinical multi-bed bone SPECT/CT studies. The difference in scatter-to-primary ratios, visual appearance, and SUVs between GPU and CPU implementations was minor. On the other hand, at its best, the GPU implementation was noticed to be 24 times faster than the multi-threaded CPU version on a normal 128 × 128 matrix size 3 bed bone SPECT/CT data set when compensations for collimator and detector response, attenuation, and scatter were included. GPU SPECT reconstructions show great promise as an every day clinical reconstruction tool.
      

      
      Large-scale neural circuit mapping data analysis accelerated with the graphical processing unit (GPU).
      PubMed
      Shi, Yulin; Veidenbaum, Alexander V; Nicolau, Alex; Xu, Xiangmin
         2015-01-15
         Modern neuroscience research demands computing power. Neural circuit mapping studies such as those using laser scanning photostimulation (LSPS) produce large amounts of data and require intensive computation for post hoc processing and analysis. Here we report on the design and implementation of a cost-effective desktop computer system for accelerated experimental data processing with recent GPU computing technology. A new version of Matlab software with GPU enabled functions is used to develop programs that run on Nvidia GPUs to harness their parallel computing power. We evaluated both the central processing unit (CPU) and GPU-enabled computational performance of our system in benchmark testing and practical applications. The experimental results show that the GPU-CPU co-processing of simulated data and actual LSPS experimental data clearly outperformed the multi-core CPU with up to a 22× speedup, depending on computational tasks. Further, we present a comparison of numerical accuracy between GPU and CPU computation to verify the precision of GPU computation. In addition, we show how GPUs can be effectively adapted to improve the performance of commercial image processing software such as Adobe Photoshop. To our best knowledge, this is the first demonstration of GPU application in neural circuit mapping and electrophysiology-based data processing. Together, GPU enabled computation enhances our ability to process large-scale data sets derived from neural circuit mapping studies, allowing for increased processing speeds while retaining data precision. Copyright © 2014 Elsevier B.V. All rights reserved.
      

      
      Large scale neural circuit mapping data analysis accelerated with the graphical processing unit (GPU)
      PubMed Central
      Shi, Yulin; Veidenbaum, Alexander V.; Nicolau, Alex; Xu, Xiangmin
         2014-01-01
         Background Modern neuroscience research demands computing power. Neural circuit mapping studies such as those using laser scanning photostimulation (LSPS) produce large amounts of data and require intensive computation for post-hoc processing and analysis. New Method Here we report on the design and implementation of a cost-effective desktop computer system for accelerated experimental data processing with recent GPU computing technology. A new version of Matlab software with GPU enabled functions is used to develop programs that run on Nvidia GPUs to harness their parallel computing power. Results We evaluated both the central processing unit (CPU) and GPU-enabled computational performance of our system in benchmark testing and practical applications. The experimental results show that the GPU-CPU co-processing of simulated data and actual LSPS experimental data clearly outperformed the multi-core CPU with up to a 22x speedup, depending on computational tasks. Further, we present a comparison of numerical accuracy between GPU and CPU computation to verify the precision of GPU computation. In addition, we show how GPUs can be effectively adapted to improve the performance of commercial image processing software such as Adobe Photoshop. Comparison with Existing Method(s) To our best knowledge, this is the first demonstration of GPU application in neural circuit mapping and electrophysiology-based data processing. Conclusions Together, GPU enabled computation enhances our ability to process large-scale data sets derived from neural circuit mapping studies, allowing for increased processing speeds while retaining data precision. PMID:25277633
      

      
      Near-realtime simulations of biolelectric activity in small mammalian hearts using graphical processing units
      PubMed Central
      Vigmond, Edward J.; Boyle, Patrick M.; Leon, L. Joshua; Plank, Gernot
         2014-01-01
         Simulations of cardiac bioelectric phenomena remain a significant challenge despite continual advancements in computational machinery. Spanning large temporal and spatial ranges demands millions of nodes to accurately depict geometry, and a comparable number of timesteps to capture dynamics. This study explores a new hardware computing paradigm, the graphics processing unit (GPU), to accelerate cardiac models, and analyzes results in the context of simulating a small mammalian heart in real time. The ODEs associated with membrane ionic flow were computed on traditional CPU and compared to GPU performance, for one to four parallel processing units. The scalability of solving the PDE responsible for tissue coupling was examined on a cluster using up to 128 cores. Results indicate that the GPU implementation was between 9 and 17 times faster than the CPU implementation and scaled similarly. Solving the PDE was still 160 times slower than real time. PMID:19964295
      

      
      Vector computer memory bank contention
      NASA Technical Reports Server (NTRS)
      Bailey, D. H.
         1985-01-01
         A number of vector supercomputers feature very large memories. Unfortunately the large capacity memory chips that are used in these computers are much slower than the fast central processing unit (CPU) circuitry. As a result, memory bank reservation times (in CPU ticks) are much longer than on previous generations of computers. A consequence of these long reservation times is that memory bank contention is sharply increased, resulting in significantly lowered performance rates. The phenomenon of memory bank contention in vector computers is analyzed using both a Markov chain model and a Monte Carlo simulation program. The results of this analysis indicate that future generations of supercomputers must either employ much faster memory chips or else feature very large numbers of independent memory banks.
      

        
       
          

«

11
      12
      13
   14
      15
      »

          
        

     

   

   
       
            
              
          

«

12
      13
      14
   15
      16
      »

          
        

           
           
             
               
      
      New Focal Plane Array Controller for the Instruments of the Subaru Telescope
      NASA Astrophysics Data System (ADS)
      Nakaya, Hidehiko; Komiyama, Yutaka; Miyazaki, Satoshi; Yamashita, Takuya; Yagi, Masafumi; Sekiguchi, Maki
         2006-03-01
         We have developed a next-generation data acquisition system, MESSIA5 (Modularized Extensible System for Image Acquisition), which comprises the digital part of a focal plane array controller. The new data acquisition system was constructed based on a 64 bit, 66 MHz PCI (peripheral component interconnect) bus architecture and runs on an x86 CPU computer with (non-real-time) Linux. The system, including the CPU board, is placed at the telescope focus, and standard gigabit Ethernet is adopted for the data transfer, as opposed to a dedicated fiber link. During the summer of 2002, we installed the new system for the first time on the Subaru prime-focus camera Suprime-Cam and successfully improved the observing performance.
      

      
      Vector computer memory bank contention
      NASA Technical Reports Server (NTRS)
      Bailey, David H.
         1987-01-01
         A number of vector supercomputers feature very large memories. Unfortunately the large capacity memory chips that are used in these computers are much slower than the fast central processing unit (CPU) circuitry. As a result, memory bank reservation times (in CPU ticks) are much longer than on previous generations of computers. A consequence of these long reservation times is that memory bank contention is sharply increased, resulting in significantly lowered performance rates. The phenomenon of memory bank contention in vector computers is analyzed using both a Markov chain model and a Monte Carlo simulation program. The results of this analysis indicate that future generations of supercomputers must either employ much faster memory chips or else feature very large numbers of independent memory banks.
      

      
      Bridging FPGA and GPU technologies for AO real-time control
      NASA Astrophysics Data System (ADS)
      Perret, Denis; Lainé, Maxime; Bernard, Julien; Gratadour, Damien; Sevin, Arnaud
         2016-07-01
         Our team has developed a common environment for high performance simulations and real-time control of AO systems based on the use of Graphics Processors Units in the context of the COMPASS project. Such a solution, based on the ability of the real time core in the simulation to provide adequate computing performance, limits the cost of developing AO RTC systems and makes them more scalable. A code developed and validated in the context of the simulation may be injected directly into the system and tested on sky. Furthermore, the use of relatively low cost components also offers significant advantages for the system hardware platform. However, the use of GPUs in an AO loop comes with drawbacks: the traditional way of offloading computation from CPU to GPUs - involving multiple copies and unacceptable overhead in kernel launching - is not well suited in a real time context. This last application requires the implementation of a solution enabling direct memory access (DMA) to the GPU memory from a third party device, bypassing the operating system. This allows this device to communicate directly with the real-time core of the simulation feeding it with the WFS camera pixel stream. We show that DMA between a custom FPGA-based frame-grabber and a computation unit (GPU, FPGA, or Coprocessor such as Xeon-phi) across PCIe allows us to get latencies compatible with what will be needed on ELTs. As a fine-grained synchronization mechanism is not yet made available by GPU vendors, we propose the use of memory polling to avoid interrupts handling and involvement of a CPU. Network and Vision protocols are handled by the FPGA-based Network Interface Card (NIC). We present the results we obtained on a complete AO loop using camera and deformable mirror simulators.
      

      
      A numerical code for the simulation of non-equilibrium chemically reacting flows on hybrid CPU-GPU clusters
      NASA Astrophysics Data System (ADS)
      Kudryavtsev, Alexey N.; Kashkovsky, Alexander V.; Borisov, Semyon P.; Shershnev, Anton A.
         2017-10-01
         In the present work a computer code RCFS for numerical simulation of chemically reacting compressible flows on hybrid CPU/GPU supercomputers is developed. It solves 3D unsteady Euler equations for multispecies chemically reacting flows in general curvilinear coordinates using shock-capturing TVD schemes. Time advancement is carried out using the explicit Runge-Kutta TVD schemes. Program implementation uses CUDA application programming interface to perform GPU computations. Data between GPUs is distributed via domain decomposition technique. The developed code is verified on the number of test cases including supersonic flow over a cylinder.
      

      
      Generalized conjugate-gradient methods for the Navier-Stokes equations
      NASA Technical Reports Server (NTRS)
      Ajmani, Kumud; Ng, Wing-Fai; Liou, Meng-Sing
         1991-01-01
         A generalized conjugate-gradient method is used to solve the two-dimensional, compressible Navier-Stokes equations of fluid flow. The equations are discretized with an implicit, upwind finite-volume formulation. Preconditioning techniques are incorporated into the new solver to accelerate convergence of the overall iterative method. The superiority of the new solver is demonstrated by comparisons with a conventional line Gauss-Siedel Relaxation solver. Computational test results for transonic flow (trailing edge flow in a transonic turbine cascade) and hypersonic flow (M = 6.0 shock-on-shock phenoena on a cylindrical leading edge) are presented. When applied to the transonic cascade case, the new solver is 4.4 times faster in terms of number of iterations and 3.1 times faster in terms of CPU time than the Relaxation solver. For the hypersonic shock case, the new solver is 3.0 times faster in terms of number of iterations and 2.2 times faster in terms of CPU time than the Relaxation solver.
      

      
      
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Li, C.; Yu, G.; Wang, K.
         
         The physical designs of the new concept reactors which have complex structure, various materials and neutronic energy spectrum, have greatly improved the requirements to the calculation methods and the corresponding computing hardware. Along with the widely used parallel algorithm, heterogeneous platforms architecture has been introduced into numerical computations in reactor physics. Because of the natural parallel characteristics, the CPU-FPGA architecture is often used to accelerate numerical computation. This paper studies the application and features of this kind of heterogeneous platforms used in numerical calculation of reactor physics through practical examples. After the designed neutron diffusion module based on CPU-FPGA architecturemore » achieves a 11.2 speed up factor, it is proved to be feasible to apply this kind of heterogeneous platform into reactor physics. (authors)« less
      

      
      An emulator for minimizing computer resources for finite element analysis
      NASA Technical Reports Server (NTRS)
      Melosh, R.; Utku, S.; Islam, M.; Salama, M.
         1984-01-01
         A computer code, SCOPE, has been developed for predicting the computer resources required for a given analysis code, computer hardware, and structural problem. The cost of running the code is a small fraction (about 3 percent) of the cost of performing the actual analysis. However, its accuracy in predicting the CPU and I/O resources depends intrinsically on the accuracy of calibration data that must be developed once for the computer hardware and the finite element analysis code of interest. Testing of the SCOPE code on the AMDAHL 470 V/8 computer and the ELAS finite element analysis program indicated small I/O errors (3.2 percent), larger CPU errors (17.8 percent), and negligible total errors (1.5 percent).
      

      
      WE-C-217BCD-08: Rapid Monte Carlo Simulations of DQE(f) of Scintillator-Based Detectors.
      PubMed
      Star-Lack, J; Abel, E; Constantin, D; Fahrig, R; Sun, M
         2012-06-01
         Monte Carlo simulations of DQE(f) can greatly aid in the design of scintillator-based detectors by helping optimize key parameters including scintillator material and thickness, pixel size, surface finish, and septa reflectivity. However, the additional optical transport significantly increases simulation times, necessitating a large number of parallel processors to adequately explore the parameter space. To address this limitation, we have optimized the DQE(f) algorithm, reducing simulation times per design iteration to 10 minutes on a single CPU. DQE(f) is proportional to the ratio, MTF(f)̂2 /NPS(f). The LSF-MTF simulation uses a slanted line source and is rapidly performed with relatively few gammas launched. However, the conventional NPS simulation for standard radiation exposure levels requires the acquisition of multiple flood fields (nRun), each requiring billions of input gamma photons (nGamma), many of which will scintillate, thereby producing thousands of optical photons (nOpt) per deposited MeV. The resulting execution time is proportional to the product nRun x nGamma x nOpt. In this investigation, we revisit the theoretical derivation of DQE(f), and reveal significant computation time savings through the optimization of nRun, nGamma, and nOpt. Using GEANT4, we determine optimal values for these three variables for a GOS scintillator-amorphous silicon portal imager. Both isotropic and Mie optical scattering processes were modeled. Simulation results were validated against the literature. We found that, depending on the radiative and optical attenuation properties of the scintillator, the NPS can be accurately computed using values for nGamma below 1000, and values for nOpt below 500/MeV. nRun should remain above 200. Using these parameters, typical computation times for a complete NPS ranged from 2-10 minutes on a single CPU. The number of launched particles and corresponding execution times for a DQE simulation can be dramatically reduced allowing for accurate computation with modest computer hardware. NIHRO1 CA138426. Several authors work for Varian Medical Systems. © 2012 American Association of Physicists in Medicine.
      

      
      CPU-GPU mixed implementation of virtual node method for real-time interactive cutting of deformable objects using OpenCL.
      PubMed
      Jia, Shiyu; Zhang, Weizhong; Yu, Xiaokang; Pan, Zhenkuan
         2015-09-01
         Surgical simulators need to simulate interactive cutting of deformable objects in real time. The goal of this work was to design an interactive cutting algorithm that eliminates traditional cutting state classification and can work simultaneously with real-time GPU-accelerated deformation without affecting its numerical stability. A modified virtual node method for cutting is proposed. Deformable object is modeled as a real tetrahedral mesh embedded in a virtual tetrahedral mesh, and the former is used for graphics rendering and collision, while the latter is used for deformation. Cutting algorithm first subdivides real tetrahedrons to eliminate all face and edge intersections, then splits faces, edges and vertices along cutting tool trajectory to form cut surfaces. Next virtual tetrahedrons containing more than one connected real tetrahedral fragments are duplicated, and connectivity between virtual tetrahedrons is updated. Finally, embedding relationship between real and virtual tetrahedral meshes is updated. Co-rotational linear finite element method is used for deformation. Cutting and collision are processed by CPU, while deformation is carried out by GPU using OpenCL. Efficiency of GPU-accelerated deformation algorithm was tested using block models with varying numbers of tetrahedrons. Effectiveness of our cutting algorithm under multiple cuts and self-intersecting cuts was tested using a block model and a cylinder model. Cutting of a more complex liver model was performed, and detailed performance characteristics of cutting, deformation and collision were measured and analyzed. Our cutting algorithm can produce continuous cut surfaces when traditional minimal element creation algorithm fails. Our GPU-accelerated deformation algorithm remains stable with constant time step under multiple arbitrary cuts and works on both NVIDIA and AMD GPUs. GPU-CPU speed ratio can be as high as 10 for models with 80,000 tetrahedrons. Forty to sixty percent real-time performance and 100-200 Hz simulation rate are achieved for the liver model with 3,101 tetrahedrons. Major bottlenecks for simulation efficiency are cutting, collision processing and CPU-GPU data transfer. Future work needs to improve on these areas.
      

      
      A Specialized Diacylglycerol Acyltransferase Contributes to the Extreme Medium-Chain Fatty Acid Content of Cuphea Seed Oil1[OPEN
      PubMed Central
      Iskandarov, Umidjon; Silva, Jillian E.; Andersson, Mariette
         2017-01-01
         Seed oils of many Cuphea sp. contain >90% of medium-chain fatty acids, such as decanoic acid (10:0). These seed oils, which are among the most compositionally variant in the plant kingdom, arise from specialized fatty acid biosynthetic enzymes and specialized acyltransferases. These include lysophosphatidic acid acyltransferases (LPAT) and diacylglycerol acyltransferases (DGAT) that are required for successive acylation of medium-chain fatty acids in the sn-2 and sn-3 positions of seed triacylglycerols (TAGs). Here we report the identification of a cDNA for a DGAT1-type enzyme, designated CpuDGAT1, from the transcriptome of C. avigera var pulcherrima developing seeds. Microsomes of camelina (Camelina sativa) seeds engineered for CpuDGAT1 expression displayed DGAT activity with 10:0-CoA and the diacylglycerol didecanoyl, that was approximately 4-fold higher than that in camelina seed microsomes lacking CpuDGAT1. In addition, coexpression in camelina seeds of CpuDGAT1 with a C. viscosissima FatB thioesterase (CvFatB1) that generates 10:0 resulted in TAGs with nearly 15 mol % of 10:0. More strikingly, expression of CpuDGAT1 and CvFatB1 with the previously described CvLPAT2, a 10:0-CoA-specific Cuphea LPAT, increased 10:0 amounts to 25 mol % in camelina seed TAG. These TAGs contained up to 40 mol % 10:0 in the sn-2 position, nearly double the amounts obtained from coexpression of CvFatB1 and CvLPAT2 alone. Although enriched in diacylglycerol, 10:0 was not detected in phosphatidylcholine in these seeds. These findings are consistent with channeling of 10:0 into TAG through the combined activities of specialized LPAT and DGAT activities and demonstrate the biotechnological use of these enzymes to generate 10:0-rich seed oils. PMID:28325847
      

      
      A Specialized Diacylglycerol Acyltransferase Contributes to the Extreme Medium-Chain Fatty Acid Content of Cuphea Seed Oil.
      PubMed
      Iskandarov, Umidjon; Silva, Jillian E; Kim, Hae Jin; Andersson, Mariette; Cahoon, Rebecca E; Mockaitis, Keithanne; Cahoon, Edgar B
         2017-05-01
         Seed oils of many Cuphea sp. contain >90% of medium-chain fatty acids, such as decanoic acid (10:0). These seed oils, which are among the most compositionally variant in the plant kingdom, arise from specialized fatty acid biosynthetic enzymes and specialized acyltransferases. These include lysophosphatidic acid acyltransferases (LPAT) and diacylglycerol acyltransferases (DGAT) that are required for successive acylation of medium-chain fatty acids in the sn -2 and sn -3 positions of seed triacylglycerols (TAGs). Here we report the identification of a cDNA for a DGAT1-type enzyme, designated CpuDGAT1, from the transcriptome of C. avigera var pulcherrima developing seeds. Microsomes of camelina ( Camelina sativa ) seeds engineered for CpuDGAT1 expression displayed DGAT activity with 10:0-CoA and the diacylglycerol didecanoyl, that was approximately 4-fold higher than that in camelina seed microsomes lacking CpuDGAT1. In addition, coexpression in camelina seeds of CpuDGAT1 with a C. viscosissima FatB thioesterase (CvFatB1) that generates 10:0 resulted in TAGs with nearly 15 mol % of 10:0. More strikingly, expression of CpuDGAT1 and CvFatB1 with the previously described CvLPAT2, a 10:0-CoA-specific Cuphea LPAT, increased 10:0 amounts to 25 mol % in camelina seed TAG. These TAGs contained up to 40 mol % 10:0 in the sn -2 position, nearly double the amounts obtained from coexpression of CvFatB1 and CvLPAT2 alone. Although enriched in diacylglycerol, 10:0 was not detected in phosphatidylcholine in these seeds. These findings are consistent with channeling of 10:0 into TAG through the combined activities of specialized LPAT and DGAT activities and demonstrate the biotechnological use of these enzymes to generate 10:0-rich seed oils. © 2017 American Society of Plant Biologists. All Rights Reserved.
      

      
      Automated Testing Experience of the Linear Aerospike SR-71 Experiment (LASRE) Controller
      NASA Technical Reports Server (NTRS)
      Larson, Richard R.
         1999-01-01
         System controllers must be fail-safe, low cost, flexible to software changes, able to output health and status words, and permit rapid retest qualification. The system controller designed and tested for the aerospike engine program was an attempt to meet these requirements. This paper describes (1) the aerospike controller design, (2) the automated simulation testing techniques, and (3) the real time monitoring data visualization structure. Controller cost was minimized by design of a single-string system that used an off-the-shelf 486 central processing unit (CPU). A linked-list architecture, with states (nodes) defined in a user-friendly state table, accomplished software changes to the controller. Proven to be fail-safe, this system reported the abort cause and automatically reverted to a safe condition for any first failure. A real time simulation and test system automated the software checkout and retest requirements. A program requirement to decode all abort causes in real time during all ground and flight tests assured the safety of flight decisions and the proper execution of mission rules. The design also included health and status words, and provided a real time analysis interpretation for all health and status data.
      

      
      System for processing an encrypted instruction stream in hardware
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Griswold, Richard L.; Nickless, William K.; Conrad, Ryan C.
         
         A system and method of processing an encrypted instruction stream in hardware is disclosed. Main memory stores the encrypted instruction stream and unencrypted data. A central processing unit (CPU) is operatively coupled to the main memory. A decryptor is operatively coupled to the main memory and located within the CPU. The decryptor decrypts the encrypted instruction stream upon receipt of an instruction fetch signal from a CPU core. Unencrypted data is passed through to the CPU core without decryption upon receipt of a data fetch signal.
      

      
      Execution of a parallel edge-based Navier-Stokes solver on commodity graphics processor units
      NASA Astrophysics Data System (ADS)
      Corral, Roque; Gisbert, Fernando; Pueblas, Jesus
         2017-02-01
         The implementation of an edge-based three-dimensional Reynolds Average Navier-Stokes solver for unstructured grids able to run on multiple graphics processing units (GPUs) is presented. Loops over edges, which are the most time-consuming part of the solver, have been written to exploit the massively parallel capabilities of GPUs. Non-blocking communications between parallel processes and between the GPU and the central processor unit (CPU) have been used to enhance code scalability. The code is written using a mixture of C++ and OpenCL, to allow the execution of the source code on GPUs. The Message Passage Interface (MPI) library is used to allow the parallel execution of the solver on multiple GPUs. A comparative study of the solver parallel performance is carried out using a cluster of CPUs and another of GPUs. It is shown that a single GPU is up to 64 times faster than a single CPU core. The parallel scalability of the solver is mainly degraded due to the loss of computing efficiency of the GPU when the size of the case decreases. However, for large enough grid sizes, the scalability is strongly improved. A cluster featuring commodity GPUs and a high bandwidth network is ten times less costly and consumes 33% less energy than a CPU-based cluster with an equivalent computational power.
      

      
      A practically unconditionally gradient stable scheme for the N-component Cahn-Hilliard system
      NASA Astrophysics Data System (ADS)
      Lee, Hyun Geun; Choi, Jeong-Whan; Kim, Junseok
         2012-02-01
         We present a practically unconditionally gradient stable conservative nonlinear numerical scheme for the N-component Cahn-Hilliard system modeling the phase separation of an N-component mixture. The scheme is based on a nonlinear splitting method and is solved by an efficient and accurate nonlinear multigrid method. The scheme allows us to convert the N-component Cahn-Hilliard system into a system of N-1 binary Cahn-Hilliard equations and significantly reduces the required computer memory and CPU time. We observe that our numerical solutions are consistent with the linear stability analysis results. We also demonstrate the efficiency of the proposed scheme with various numerical experiments.
      

      
      Computation of Turbulent Heat Transfer on the Walls of a 180 Degree Turn Channel With a Low Reynolds Number Reynolds Stress Model
      NASA Technical Reports Server (NTRS)
      Ameri, A. A.; Rigby, D. L.; Steinthorsson, E.; Gaugler, Raymond (Technical Monitor)
         2002-01-01
         The Low Reynolds number version of the Stress-omega model and the two equation k-omega model of Wilcox were used for the calculation of turbulent heat transfer in a 180 degree turn simulating an internal coolant passage. The Stress-omega model was chosen for its robustness. The turbulent thermal fluxes were calculated by modifying and using the Generalized Gradient Diffusion Hypothesis. The results showed that using this Reynolds Stress model allowed better prediction of heat transfer compared to the k-omega two equation model. This improvement however required a finer grid and commensurately more CPU time.
      

      
      Construction of the energy matrix for complex atoms. Part VIII: Hyperfine structure HPC calculations for terbium atom
      NASA Astrophysics Data System (ADS)
      Elantkowska, Magdalena; Ruczkowski, Jarosław; Sikorski, Andrzej; Dembczyński, Jerzy
         2017-11-01
         A parametric analysis of the hyperfine structure (hfs) for the even parity configurations of atomic terbium (Tb I) is presented in this work. We introduce the complete set of 4fN-core states in our high-performance computing (HPC) calculations. For calculations of the huge hyperfine structure matrix, requiring approximately 5000 hours when run on a single CPU, we propose the methods utilizing a personal computer cluster or, alternatively a cluster of Microsoft Azure virtual machines (VM). These methods give a factor 12 performance boost, enabling the calculations to complete in an acceptable time.
      

      
      Accelerating Advanced MRI Reconstructions on GPUs
      PubMed Central
      Stone, S.S.; Haldar, J.P.; Tsao, S.C.; Hwu, W.-m.W.; Sutton, B.P.; Liang, Z.-P.
         2008-01-01
         Computational acceleration on graphics processing units (GPUs) can make advanced magnetic resonance imaging (MRI) reconstruction algorithms attractive in clinical settings, thereby improving the quality of MR images across a broad spectrum of applications. This paper describes the acceleration of such an algorithm on NVIDIA’s Quadro FX 5600. The reconstruction of a 3D image with 1283 voxels achieves up to 180 GFLOPS and requires just over one minute on the Quadro, while reconstruction on a quad-core CPU is twenty-one times slower. Furthermore, relative to the true image, the error exhibited by the advanced reconstruction is only 12%, while conventional reconstruction techniques incur error of 42%. PMID:21796230
      

      
      Accelerating Advanced MRI Reconstructions on GPUs.
      PubMed
      Stone, S S; Haldar, J P; Tsao, S C; Hwu, W-M W; Sutton, B P; Liang, Z-P
         2008-10-01
         Computational acceleration on graphics processing units (GPUs) can make advanced magnetic resonance imaging (MRI) reconstruction algorithms attractive in clinical settings, thereby improving the quality of MR images across a broad spectrum of applications. This paper describes the acceleration of such an algorithm on NVIDIA's Quadro FX 5600. The reconstruction of a 3D image with 128(3) voxels achieves up to 180 GFLOPS and requires just over one minute on the Quadro, while reconstruction on a quad-core CPU is twenty-one times slower. Furthermore, relative to the true image, the error exhibited by the advanced reconstruction is only 12%, while conventional reconstruction techniques incur error of 42%.
      

      
      Sputnik: ad hoc distributed computation.
      PubMed
      Völkel, Gunnar; Lausser, Ludwig; Schmid, Florian; Kraus, Johann M; Kestler, Hans A
         2015-04-15
         In bioinformatic applications, computationally demanding algorithms are often parallelized to speed up computation. Nevertheless, setting up computational environments for distributed computation is often tedious. Aim of this project were the lightweight ad hoc set up and fault-tolerant computation requiring only a Java runtime, no administrator rights, while utilizing all CPU cores most effectively. The Sputnik framework provides ad hoc distributed computation on the Java Virtual Machine which uses all supplied CPU cores fully. It provides a graphical user interface for deployment setup and a web user interface displaying the current status of current computation jobs. Neither a permanent setup nor administrator privileges are required. We demonstrate the utility of our approach on feature selection of microarray data. The Sputnik framework is available on Github http://github.com/sysbio-bioinf/sputnik under the Eclipse Public License. hkestler@fli-leibniz.de or hans.kestler@uni-ulm.de Supplementary data are available at Bioinformatics online. © The Author 2014. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.
      

        
       
          

«

12
      13
      14
   15
      16
      »

          
        

     

   

   
       
            
              
          

«

13
      14
      15
   16
      17
      »

          
        

           
           
             
               
      
      Algorithms of GPU-enabled reactive force field (ReaxFF) molecular dynamics.
      PubMed
      Zheng, Mo; Li, Xiaoxia; Guo, Li
         2013-04-01
         Reactive force field (ReaxFF), a recent and novel bond order potential, allows for reactive molecular dynamics (ReaxFF MD) simulations for modeling larger and more complex molecular systems involving chemical reactions when compared with computation intensive quantum mechanical methods. However, ReaxFF MD can be approximately 10-50 times slower than classical MD due to its explicit modeling of bond forming and breaking, the dynamic charge equilibration at each time-step, and its one order smaller time-step than the classical MD, all of which pose significant computational challenges in simulation capability to reach spatio-temporal scales of nanometers and nanoseconds. The very recent advances of graphics processing unit (GPU) provide not only highly favorable performance for GPU enabled MD programs compared with CPU implementations but also an opportunity to manage with the computing power and memory demanding nature imposed on computer hardware by ReaxFF MD. In this paper, we present the algorithms of GMD-Reax, the first GPU enabled ReaxFF MD program with significantly improved performance surpassing CPU implementations on desktop workstations. The performance of GMD-Reax has been benchmarked on a PC equipped with a NVIDIA C2050 GPU for coal pyrolysis simulation systems with atoms ranging from 1378 to 27,283. GMD-Reax achieved speedups as high as 12 times faster than Duin et al.'s FORTRAN codes in Lammps on 8 CPU cores and 6 times faster than the Lammps' C codes based on PuReMD in terms of the simulation time per time-step averaged over 100 steps. GMD-Reax could be used as a new and efficient computational tool for exploiting very complex molecular reactions via ReaxFF MD simulation on desktop workstations. Copyright © 2013 Elsevier Inc. All rights reserved.
      

      
      A survey of CPU-GPU heterogeneous computing techniques
      DOE PAGES
      Mittal, Sparsh; Vetter, Jeffrey S.
         2015-07-04
         As both CPU and GPU become employed in a wide range of applications, it has been acknowledged that both of these processing units (PUs) have their unique features and strengths and hence, CPU-GPU collaboration is inevitable to achieve high-performance computing. This has motivated significant amount of research on heterogeneous computing techniques, along with the design of CPU-GPU fused chips and petascale heterogeneous supercomputers. In this paper, we survey heterogeneous computing techniques (HCTs) such as workload-partitioning which enable utilizing both CPU and GPU to improve performance and/or energy efficiency. We review heterogeneous computing approaches at runtime, algorithm, programming, compiler and applicationmore » level. Further, we review both discrete and fused CPU-GPU systems; and discuss benchmark suites designed for evaluating heterogeneous computing systems (HCSs). Furthermore, we believe that this paper will provide insights into working and scope of applications of HCTs to researchers and motivate them to further harness the computational powers of CPUs and GPUs to achieve the goal of exascale performance.« less
      

      
      A survey of CPU-GPU heterogeneous computing techniques
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Mittal, Sparsh; Vetter, Jeffrey S.
         
         As both CPU and GPU become employed in a wide range of applications, it has been acknowledged that both of these processing units (PUs) have their unique features and strengths and hence, CPU-GPU collaboration is inevitable to achieve high-performance computing. This has motivated significant amount of research on heterogeneous computing techniques, along with the design of CPU-GPU fused chips and petascale heterogeneous supercomputers. In this paper, we survey heterogeneous computing techniques (HCTs) such as workload-partitioning which enable utilizing both CPU and GPU to improve performance and/or energy efficiency. We review heterogeneous computing approaches at runtime, algorithm, programming, compiler and applicationmore » level. Further, we review both discrete and fused CPU-GPU systems; and discuss benchmark suites designed for evaluating heterogeneous computing systems (HCSs). Furthermore, we believe that this paper will provide insights into working and scope of applications of HCTs to researchers and motivate them to further harness the computational powers of CPUs and GPUs to achieve the goal of exascale performance.« less
      

      
      Evaluation of user input methods for manipulating a tablet personal computer in sterile techniques.
      PubMed
      Yamada, Akira; Komatsu, Daisuke; Suzuki, Takeshi; Kurozumi, Masahiro; Fujinaga, Yasunari; Ueda, Kazuhiko; Kadoya, Masumi
         2017-02-01
         To determine a quick and accurate user input method for manipulating tablet personal computers (PCs) in sterile techniques. We evaluated three different manipulation methods, (1) Computer mouse and sterile system drape, (2) Fingers and sterile system drape, and (3) Digitizer stylus and sterile ultrasound probe cover with a pinhole, in terms of the central processing unit (CPU) performance, manipulation performance, and contactlessness. A significant decrease in CPU score ([Formula: see text]) and an increase in CPU temperature ([Formula: see text]) were observed when a system drape was used. The respective mean times taken to select a target image from an image series (ST) and the mean times for measuring points on an image (MT) were [Formula: see text] and [Formula: see text] s for the computer mouse method, [Formula: see text] and [Formula: see text] s for the finger method, and [Formula: see text] and [Formula: see text] s for the digitizer stylus method, respectively. The ST for the finger method was significantly longer than for the digitizer stylus method ([Formula: see text]). The MT for the computer mouse method was significantly longer than for the digitizer stylus method ([Formula: see text]). The mean success rate for measuring points on an image was significantly lower for the finger method when the diameter of the target was equal to or smaller than 8 mm than for the other methods. No significant difference in the adenosine triphosphate amount at the surface of the tablet PC was observed before, during, or after manipulation via the digitizer stylus method while wearing starch-powdered sterile gloves ([Formula: see text]). Quick and accurate manipulation of tablet PCs in sterile techniques without CPU load is feasible using a digitizer stylus and sterile ultrasound probe cover with a pinhole.
      

      
      Massively parallel data processing for quantitative total flow imaging with optical coherence microscopy and tomography
      NASA Astrophysics Data System (ADS)
      Sylwestrzak, Marcin; Szlag, Daniel; Marchand, Paul J.; Kumar, Ashwin S.; Lasser, Theo
         2017-08-01
         We present an application of massively parallel processing of quantitative flow measurements data acquired using spectral optical coherence microscopy (SOCM). The need for massive signal processing of these particular datasets has been a major hurdle for many applications based on SOCM. In view of this difficulty, we implemented and adapted quantitative total flow estimation algorithms on graphics processing units (GPU) and achieved a 150 fold reduction in processing time when compared to a former CPU implementation. As SOCM constitutes the microscopy counterpart to spectral optical coherence tomography (SOCT), the developed processing procedure can be applied to both imaging modalities. We present the developed DLL library integrated in MATLAB (with an example) and have included the source code for adaptations and future improvements. Catalogue identifier: AFBT_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AFBT_v1_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: GNU GPLv3 No. of lines in distributed program, including test data, etc.: 913552 No. of bytes in distributed program, including test data, etc.: 270876249 Distribution format: tar.gz Programming language: CUDA/C, MATLAB. Computer: Intel x64 CPU, GPU supporting CUDA technology. Operating system: 64-bit Windows 7 Professional. Has the code been vectorized or parallelized?: Yes, CPU code has been vectorized in MATLAB, CUDA code has been parallelized. RAM: Dependent on users parameters, typically between several gigabytes and several tens of gigabytes Classification: 6.5, 18. Nature of problem: Speed up of data processing in optical coherence microscopy Solution method: Utilization of GPU for massively parallel data processing Additional comments: Compiled DLL library with source code and documentation, example of utilization (MATLAB script with raw data) Running time: 1,8 s for one B-scan (150 × faster in comparison to the CPU data processing time)
      

      
      Fast multipurpose Monte Carlo simulation for proton therapy using multi- and many-core CPU architectures
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Souris, Kevin, E-mail: kevin.souris@uclouvain.be; Lee, John Aldo; Sterpin, Edmond
         2016-04-15
         Purpose: Accuracy in proton therapy treatment planning can be improved using Monte Carlo (MC) simulations. However the long computation time of such methods hinders their use in clinical routine. This work aims to develop a fast multipurpose Monte Carlo simulation tool for proton therapy using massively parallel central processing unit (CPU) architectures. Methods: A new Monte Carlo, called MCsquare (many-core Monte Carlo), has been designed and optimized for the last generation of Intel Xeon processors and Intel Xeon Phi coprocessors. These massively parallel architectures offer the flexibility and the computational power suitable to MC methods. The class-II condensed history algorithmmore » of MCsquare provides a fast and yet accurate method of simulating heavy charged particles such as protons, deuterons, and alphas inside voxelized geometries. Hard ionizations, with energy losses above a user-specified threshold, are simulated individually while soft events are regrouped in a multiple scattering theory. Elastic and inelastic nuclear interactions are sampled from ICRU 63 differential cross sections, thereby allowing for the computation of prompt gamma emission profiles. MCsquare has been benchmarked with the GATE/GEANT4 Monte Carlo application for homogeneous and heterogeneous geometries. Results: Comparisons with GATE/GEANT4 for various geometries show deviations within 2%–1 mm. In spite of the limited memory bandwidth of the coprocessor simulation time is below 25 s for 10{sup 7} primary 200 MeV protons in average soft tissues using all Xeon Phi and CPU resources embedded in a single desktop unit. Conclusions: MCsquare exploits the flexibility of CPU architectures to provide a multipurpose MC simulation tool. Optimized code enables the use of accurate MC calculation within a reasonable computation time, adequate for clinical practice. MCsquare also simulates prompt gamma emission and can thus be used also for in vivo range verification.« less
      

      
      A configurable distributed high-performance computing framework for satellite's TDI-CCD imaging simulation
      NASA Astrophysics Data System (ADS)
      Xue, Bo; Mao, Bingjing; Chen, Xiaomei; Ni, Guoqiang
         2010-11-01
         This paper renders a configurable distributed high performance computing(HPC) framework for TDI-CCD imaging simulation. It uses strategy pattern to adapt multi-algorithms. Thus, this framework help to decrease the simulation time with low expense. Imaging simulation for TDI-CCD mounted on satellite contains four processes: 1) atmosphere leads degradation, 2) optical system leads degradation, 3) electronic system of TDI-CCD leads degradation and re-sampling process, 4) data integration. Process 1) to 3) utilize diversity data-intensity algorithms such as FFT, convolution and LaGrange Interpol etc., which requires powerful CPU. Even uses Intel Xeon X5550 processor, regular series process method takes more than 30 hours for a simulation whose result image size is 1500 * 1462. With literature study, there isn't any mature distributing HPC framework in this field. Here we developed a distribute computing framework for TDI-CCD imaging simulation, which is based on WCF[1], uses Client/Server (C/S) layer and invokes the free CPU resources in LAN. The server pushes the process 1) to 3) tasks to those free computing capacity. Ultimately we rendered the HPC in low cost. In the computing experiment with 4 symmetric nodes and 1 server , this framework reduced about 74% simulation time. Adding more asymmetric nodes to the computing network, the time decreased namely. In conclusion, this framework could provide unlimited computation capacity in condition that the network and task management server are affordable. And this is the brand new HPC solution for TDI-CCD imaging simulation and similar applications.
      

      
      Compiler-based code generation and autotuning for geometric multigrid on GPU-accelerated supercomputers
      DOE PAGES
      Basu, Protonu; Williams, Samuel; Van Straalen, Brian; ...
         2017-04-05
         GPUs, with their high bandwidths and computational capabilities are an increasingly popular target for scientific computing. Unfortunately, to date, harnessing the power of the GPU has required use of a GPU-specific programming model like CUDA, OpenCL, or OpenACC. Thus, in order to deliver portability across CPU-based and GPU-accelerated supercomputers, programmers are forced to write and maintain two versions of their applications or frameworks. In this paper, we explore the use of a compiler-based autotuning framework based on CUDA-CHiLL to deliver not only portability, but also performance portability across CPU- and GPU-accelerated platforms for the geometric multigrid linear solvers found inmore » many scientific applications. We also show that with autotuning we can attain near Roofline (a performance bound for a computation and target architecture) performance across the key operations in the miniGMG benchmark for both CPU- and GPU-based architectures as well as for a multiple stencil discretizations and smoothers. We show that our technology is readily interoperable with MPI resulting in performance at scale equal to that obtained via hand-optimized MPI+CUDA implementation.« less
      

      
      Energy consumption optimization of the total-FETI solver by changing the CPU frequency
      NASA Astrophysics Data System (ADS)
      Horak, David; Riha, Lubomir; Sojka, Radim; Kruzik, Jakub; Beseda, Martin; Cermak, Martin; Schuchart, Joseph
         2017-07-01
         The energy consumption of supercomputers is one of the critical problems for the upcoming Exascale supercomputing era. The awareness of power and energy consumption is required on both software and hardware side. This paper deals with the energy consumption evaluation of the Finite Element Tearing and Interconnect (FETI) based solvers of linear systems, which is an established method for solving real-world engineering problems. We have evaluated the effect of the CPU frequency on the energy consumption of the FETI solver using a linear elasticity 3D cube synthetic benchmark. In this problem, we have evaluated the effect of frequency tuning on the energy consumption of the essential processing kernels of the FETI method. The paper provides results for two types of frequency tuning: (1) static tuning and (2) dynamic tuning. For static tuning experiments, the frequency is set before execution and kept constant during the runtime. For dynamic tuning, the frequency is changed during the program execution to adapt the system to the actual needs of the application. The paper shows that static tuning brings up 12% energy savings when compared to default CPU settings (the highest clock rate). The dynamic tuning improves this further by up to 3%.
      

      
      Compiler-based code generation and autotuning for geometric multigrid on GPU-accelerated supercomputers
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Basu, Protonu; Williams, Samuel; Van Straalen, Brian
         
         GPUs, with their high bandwidths and computational capabilities are an increasingly popular target for scientific computing. Unfortunately, to date, harnessing the power of the GPU has required use of a GPU-specific programming model like CUDA, OpenCL, or OpenACC. Thus, in order to deliver portability across CPU-based and GPU-accelerated supercomputers, programmers are forced to write and maintain two versions of their applications or frameworks. In this paper, we explore the use of a compiler-based autotuning framework based on CUDA-CHiLL to deliver not only portability, but also performance portability across CPU- and GPU-accelerated platforms for the geometric multigrid linear solvers found inmore » many scientific applications. We also show that with autotuning we can attain near Roofline (a performance bound for a computation and target architecture) performance across the key operations in the miniGMG benchmark for both CPU- and GPU-based architectures as well as for a multiple stencil discretizations and smoothers. We show that our technology is readily interoperable with MPI resulting in performance at scale equal to that obtained via hand-optimized MPI+CUDA implementation.« less
      

      
      The Effect of Multigrid Parameters in a 3D Heat Diffusion Equation
      NASA Astrophysics Data System (ADS)
      Oliveira, F. De; Franco, S. R.; Pinto, M. A. Villela
         2018-02-01
         The aim of this paper is to reduce the necessary CPU time to solve the three-dimensional heat diffusion equation using Dirichlet boundary conditions. The finite difference method (FDM) is used to discretize the differential equations with a second-order accuracy central difference scheme (CDS). The algebraic equations systems are solved using the lexicographical and red-black Gauss-Seidel methods, associated with the geometric multigrid method with a correction scheme (CS) and V-cycle. Comparisons are made between two types of restriction: injection and full weighting. The used prolongation process is the trilinear interpolation. This work is concerned with the study of the influence of the smoothing value (v), number of mesh levels (L) and number of unknowns (N) on the CPU time, as well as the analysis of algorithm complexity.
      

      
      Analysis and improvements of Adaptive Particle Refinement (APR) through CPU time, accuracy and robustness considerations
      NASA Astrophysics Data System (ADS)
      Chiron, L.; Oger, G.; de Leffe, M.; Le Touzé, D.
         2018-02-01
         While smoothed-particle hydrodynamics (SPH) simulations are usually performed using uniform particle distributions, local particle refinement techniques have been developed to concentrate fine spatial resolutions in identified areas of interest. Although the formalism of this method is relatively easy to implement, its robustness at coarse/fine interfaces can be problematic. Analysis performed in [16] shows that the radius of refined particles should be greater than half the radius of unrefined particles to ensure robustness. In this article, the basics of an Adaptive Particle Refinement (APR) technique, inspired by AMR in mesh-based methods, are presented. This approach ensures robustness with alleviated constraints. Simulations applying the new formalism proposed achieve accuracy comparable to fully refined spatial resolutions, together with robustness, low CPU times and maintained parallel efficiency.
      

      
      Fast polyenergetic forward projection for image formation using OpenCL on a heterogeneous parallel computing platform.
      PubMed
      Zhou, Lili; Clifford Chao, K S; Chang, Jenghwa
         2012-11-01
         Simulated projection images of digital phantoms constructed from CT scans have been widely used for clinical and research applications but their quality and computation speed are not optimal for real-time comparison with the radiography acquired with an x-ray source of different energies. In this paper, the authors performed polyenergetic forward projections using open computing language (OpenCL) in a parallel computing ecosystem consisting of CPU and general purpose graphics processing unit (GPGPU) for fast and realistic image formation. The proposed polyenergetic forward projection uses a lookup table containing the NIST published mass attenuation coefficients (μ∕ρ) for different tissue types and photon energies ranging from 1 keV to 20 MeV. The CT images of interested sites are first segmented into different tissue types based on the CT numbers and converted to a three-dimensional attenuation phantom by linking each voxel to the corresponding tissue type in the lookup table. The x-ray source can be a radioisotope or an x-ray generator with a known spectrum described as weight w(n) for energy bin E(n). The Siddon method is used to compute the x-ray transmission line integral for E(n) and the x-ray fluence is the weighted sum of the exponential of line integral for all energy bins with added Poisson noise. To validate this method, a digital head and neck phantom constructed from the CT scan of a Rando head phantom was segmented into three (air, gray∕white matter, and bone) regions for calculating the polyenergetic projection images for the Mohan 4 MV energy spectrum. To accelerate the calculation, the authors partitioned the workloads using the task parallelism and data parallelism and scheduled them in a parallel computing ecosystem consisting of CPU and GPGPU (NVIDIA Tesla C2050) using OpenCL only. The authors explored the task overlapping strategy and the sequential method for generating the first and subsequent DRRs. A dispatcher was designed to drive the high-degree parallelism of the task overlapping strategy. Numerical experiments were conducted to compare the performance of the OpenCL∕GPGPU-based implementation with the CPU-based implementation. The projection images were similar to typical portal images obtained with a 4 or 6 MV x-ray source. For a phantom size of 512 × 512 × 223, the time for calculating the line integrals for a 512 × 512 image panel was 16.2 ms on GPGPU for one energy bin in comparison to 8.83 s on CPU. The total computation time for generating one polyenergetic projection image of 512 × 512 was 0.3 s (141 s for CPU). The relative difference between the projection images obtained with the CPU-based and OpenCL∕GPGPU-based implementations was on the order of 10(-6) and was virtually indistinguishable. The task overlapping strategy was 5.84 and 1.16 times faster than the sequential method for the first and the subsequent digitally reconstruction radiographies, respectively. The authors have successfully built digital phantoms using anatomic CT images and NIST μ∕ρ tables for simulating realistic polyenergetic projection images and optimized the processing speed with parallel computing using GPGPU∕OpenCL-based implementation. The computation time was fast (0.3 s per projection image) enough for real-time IGRT (image-guided radiotherapy) applications.
      

      
      The applicability of turbulence models to aerodynamic and propulsion flowfields at McDonnell-Douglas Aerospace
      NASA Technical Reports Server (NTRS)
      Kral, Linda D.; Ladd, John A.; Mani, Mori
         1995-01-01
         The objective of this viewgraph presentation is to evaluate turbulence models for integrated aircraft components such as the forebody, wing, inlet, diffuser, nozzle, and afterbody. The one-equation models have replaced the algebraic models as the baseline turbulence models. The Spalart-Allmaras one-equation model consistently performs better than the Baldwin-Barth model, particularly in the log-layer and free shear layers. Also, the Sparlart-Allmaras model is not grid dependent like the Baldwin-Barth model. No general turbulence model exists for all engineering applications. The Spalart-Allmaras one-equation model and the Chien k-epsilon models are the preferred turbulence models. Although the two-equation models often better predict the flow field, they may take from two to five times the CPU time. Future directions are in further benchmarking the Menter blended k-w/k-epsilon and algorithmic improvements to reduce CPU time of the two-equation model.
      

      
      GPU Particle Tracking and MHD Simulations with Greatly Enhanced Computational Speed
      NASA Astrophysics Data System (ADS)
      Ziemba, T.; O'Donnell, D.; Carscadden, J.; Cash, M.; Winglee, R.; Harnett, E.
         2008-12-01
         GPUs are intrinsically highly parallelized systems that provide more than an order of magnitude computing speed over a CPU based systems, for less cost than a high end-workstation. Recent advancements in GPU technologies allow for full IEEE float specifications with performance up to several hundred GFLOPs per GPU, and new software architectures have recently become available to ease the transition from graphics based to scientific applications. This allows for a cheap alternative to standard supercomputing methods and should increase the time to discovery. 3-D particle tracking and MHD codes have been developed using NVIDIA's CUDA and have demonstrated speed up of nearly a factor of 20 over equivalent CPU versions of the codes. Such a speed up enables new applications to develop, including real time running of radiation belt simulations and real time running of global magnetospheric simulations, both of which could provide important space weather prediction tools.
      

      
      PrismTech Data Distribution Service Java API Evaluation
      NASA Technical Reports Server (NTRS)
      Riggs, Cortney
         2008-01-01
         My internship duties with Launch Control Systems required me to start performance testing of an Object Management Group's (OMG) Data Distribution Service (DDS) specification implementation by PrismTech Limited through the Java programming language application programming interface (API). DDS is a networking middleware for Real-Time Data Distribution. The performance testing involves latency, redundant publishers, extended duration, redundant failover, and read performance. Time constraints allowed only for a data throughput test. I have designed the testing applications to perform all performance tests when time is allowed. Performance evaluation data such as megabits per second and central processing unit (CPU) time consumption were not easily attainable through the Java programming language; they required new methods and classes created in the test applications. Evaluation of this product showed the rate that data can be sent across the network. Performance rates are better on Linux platforms than AIX and Sun platforms. Compared to previous C++ programming language API, the performance evaluation also shows the language differences for the implementation. The Java API of the DDS has a lower throughput performance than the C++ API.
      

      
      Transition-Tempered Metadynamics Is a Promising Tool for Studying the Permeation of Drug-like Molecules through Membranes.
      PubMed
      Sun, Rui; Dama, James F; Tan, Jeffrey S; Rose, John P; Voth, Gregory A
         2016-10-11
         Metadynamics is an important enhanced sampling technique in molecular dynamics simulation to efficiently explore potential energy surfaces. The recently developed transition-tempered metadynamics (TTMetaD) has been proven to converge asymptotically without sacrificing exploration of the collective variable space in the early stages of simulations, unlike other convergent metadynamics (MetaD) methods. We have applied TTMetaD to study the permeation of drug-like molecules through a lipid bilayer to further investigate the usefulness of this method as applied to problems of relevance to medicinal chemistry. First, ethanol permeation through a lipid bilayer was studied to compare TTMetaD with nontempered metadynamics and well-tempered metadynamics. The bias energies computed from various metadynamics simulations were compared to the potential of mean force calculated from umbrella sampling. Though all of the MetaD simulations agree with one another asymptotically, TTMetaD is able to predict the most accurate and reliable estimate of the potential of mean force for permeation in the early stages of the simulations and is robust to the choice of required additional parameters. We also show that using multiple randomly initialized replicas allows convergence analysis and also provides an efficient means to converge the simulations in shorter wall times and, more unexpectedly, in shorter CPU times; splitting the CPU time between multiple replicas appears to lead to less overall error. After validating the method, we studied the permeation of a more complicated drug-like molecule, trimethoprim. Three sets of TTMetaD simulations with different choices of collective variables were carried out, and all converged within feasible simulation time. The minimum free energy paths showed that TTMetaD was able to predict almost identical permeation mechanisms in each case despite significantly different definitions of collective variables.
      

      
      A parallel algorithm for the initial screening of space debris collisions prediction using the SGP4/SDP4 models and GPU acceleration
      NASA Astrophysics Data System (ADS)
      Lin, Mingpei; Xu, Ming; Fu, Xiaoyu
         2017-05-01
         Currently, a tremendous amount of space debris in Earth's orbit imperils operational spacecraft. It is essential to undertake risk assessments of collisions and predict dangerous encounters in space. However, collision predictions for an enormous amount of space debris give rise to large-scale computations. In this paper, a parallel algorithm is established on the Compute Unified Device Architecture (CUDA) platform of NVIDIA Corporation for collision prediction. According to the parallel structure of NVIDIA graphics processors, a block decomposition strategy is adopted in the algorithm. Space debris is divided into batches, and the computation and data transfer operations of adjacent batches overlap. As a consequence, the latency to access shared memory during the entire computing process is significantly reduced, and a higher computing speed is reached. Theoretically, a simulation of collision prediction for space debris of any amount and for any time span can be executed. To verify this algorithm, a simulation example including 1382 pieces of debris, whose operational time scales vary from 1 min to 3 days, is conducted on Tesla C2075 of NVIDIA. The simulation results demonstrate that with the same computational accuracy as that of a CPU, the computing speed of the parallel algorithm on a GPU is 30 times that on a CPU. Based on this algorithm, collision prediction of over 150 Chinese spacecraft for a time span of 3 days can be completed in less than 3 h on a single computer, which meets the timeliness requirement of the initial screening task. Furthermore, the algorithm can be adapted for multiple tasks, including particle filtration, constellation design, and Monte-Carlo simulation of an orbital computation.
      

      
      Real-time digital holographic microscopy using the graphic processing unit.
      PubMed
      Shimobaba, Tomoyoshi; Sato, Yoshikuni; Miura, Junya; Takenouchi, Mai; Ito, Tomoyoshi
         2008-08-04
         Digital holographic microscopy (DHM) is a well-known powerful method allowing both the amplitude and phase of a specimen to be simultaneously observed. In order to obtain a reconstructed image from a hologram, numerous calculations for the Fresnel diffraction are required. The Fresnel diffraction can be accelerated by the FFT (Fast Fourier Transform) algorithm. However, real-time reconstruction from a hologram is difficult even if we use a recent central processing unit (CPU) to calculate the Fresnel diffraction by the FFT algorithm. In this paper, we describe a real-time DHM system using a graphic processing unit (GPU) with many stream processors, which allows use as a highly parallel processor. The computational speed of the Fresnel diffraction using the GPU is faster than that of recent CPUs. The real-time DHM system can obtain reconstructed images from holograms whose size is 512 x 512 grids in 24 frames per second.
      

      
      47 CFR 15.102 - CPU boards and power supplies used in personal computers.
      Code of Federal Regulations, 2013 CFR
      
         2013-10-01
         ... computers. 15.102 Section 15.102 Telecommunication FEDERAL COMMUNICATIONS COMMISSION GENERAL RADIO FREQUENCY DEVICES Unintentional Radiators § 15.102 CPU boards and power supplies used in personal computers. (a... modifications that must be made to a personal computer, peripheral device, CPU board or power supply during...
      

        
       
          

«

13
      14
      15
   16
      17
      »

          
        

     

   

   
       
            
              
          

«

14
      15
      16
   17
      18
      »

          
        

           
           
             
               
      
      47 CFR 15.102 - CPU boards and power supplies used in personal computers.
      Code of Federal Regulations, 2011 CFR
      
         2011-10-01
         ... computers. 15.102 Section 15.102 Telecommunication FEDERAL COMMUNICATIONS COMMISSION GENERAL RADIO FREQUENCY DEVICES Unintentional Radiators § 15.102 CPU boards and power supplies used in personal computers. (a... modifications that must be made to a personal computer, peripheral device, CPU board or power supply during...
      

      
      47 CFR 15.102 - CPU boards and power supplies used in personal computers.
      Code of Federal Regulations, 2010 CFR
      
         2010-10-01
         ... computers. 15.102 Section 15.102 Telecommunication FEDERAL COMMUNICATIONS COMMISSION GENERAL RADIO FREQUENCY DEVICES Unintentional Radiators § 15.102 CPU boards and power supplies used in personal computers. (a... modifications that must be made to a personal computer, peripheral device, CPU board or power supply during...
      

      
      47 CFR 15.102 - CPU boards and power supplies used in personal computers.
      Code of Federal Regulations, 2014 CFR
      
         2014-10-01
         ... computers. 15.102 Section 15.102 Telecommunication FEDERAL COMMUNICATIONS COMMISSION GENERAL RADIO FREQUENCY DEVICES Unintentional Radiators § 15.102 CPU boards and power supplies used in personal computers. (a... modifications that must be made to a personal computer, peripheral device, CPU board or power supply during...
      

      
      47 CFR 15.102 - CPU boards and power supplies used in personal computers.
      Code of Federal Regulations, 2012 CFR
      
         2012-10-01
         ... computers. 15.102 Section 15.102 Telecommunication FEDERAL COMMUNICATIONS COMMISSION GENERAL RADIO FREQUENCY DEVICES Unintentional Radiators § 15.102 CPU boards and power supplies used in personal computers. (a... modifications that must be made to a personal computer, peripheral device, CPU board or power supply during...
      

      
      Online performance evaluation of RAID 5 using CPU utilization
      NASA Astrophysics Data System (ADS)
      Jin, Hai; Yang, Hua; Zhang, Jiangling
         1998-09-01
         Redundant arrays of independent disks (RAID) technology is the efficient way to solve the bottleneck problem between CPU processing ability and I/O subsystem. For the system point of view, the most important metric of on line performance is the utilization of CPU. This paper first employs the way to calculate the CPU utilization of system connected with RAID level 5 using statistic average method. From the simulation results of CPU utilization of system connected with RAID level 5 subsystem can we see that using multiple disks as an array to access data in parallel is the efficient way to enhance the on-line performance of disk storage system. USing high-end disk drivers to compose the disk array is the key to enhance the on-line performance of system.
      

      
      Performance assessment of a pre-partitioned adaptive chemistry approach in large-eddy simulation of turbulent flames
      NASA Astrophysics Data System (ADS)
      Pepiot, Perrine; Liang, Youwen; Newale, Ashish; Pope, Stephen
         2016-11-01
         A pre-partitioned adaptive chemistry (PPAC) approach recently developed and validated in the simplified framework of a partially-stirred reactor is applied to the simulation of turbulent flames using a LES/particle PDF framework. The PPAC approach was shown to simultaneously provide significant savings in CPU and memory requirements, two major limiting factors in LES/particle PDF. The savings are achieved by providing each particle in the PDF method with a specialized reduced representation and kinetic model adjusted to its changing composition. Both representation and model are identified efficiently from a pre-determined list using a low-dimensional binary-tree search algorithm, thereby keeping the run-time overhead associated with the adaptive strategy to a minimum. The Sandia D flame is used as benchmark to quantify the performance of the PPAC algorithm in a turbulent combustion setting. In particular, the CPU and memory benefits, the distribution of the various representations throughout the computational domain, and the relationship between the user-defined error tolerances used to derive the reduced representations and models and the actual errors observed in LES/PDF are characterized. This material is based upon work supported by the U.S. Department of Energy Office of Science, Office of Basic Energy Sciences under Award Number DE-FG02-90ER14128.
      

      
      I/O-Efficient Scientific Computation Using TPIE
      NASA Technical Reports Server (NTRS)
      Vengroff, Darren Erik; Vitter, Jeffrey Scott
         1996-01-01
         In recent years, input/output (I/O)-efficient algorithms for a wide variety of problems have appeared in the literature. However, systems specifically designed to assist programmers in implementing such algorithms have remained scarce. TPIE is a system designed to support I/O-efficient paradigms for problems from a variety of domains, including computational geometry, graph algorithms, and scientific computation. The TPIE interface frees programmers from having to deal not only with explicit read and write calls, but also the complex memory management that must be performed for I/O-efficient computation. In this paper we discuss applications of TPIE to problems in scientific computation. We discuss algorithmic issues underlying the design and implementation of the relevant components of TPIE and present performance results of programs written to solve a series of benchmark problems using our current TPIE prototype. Some of the benchmarks we present are based on the NAS parallel benchmarks while others are of our own creation. We demonstrate that the central processing unit (CPU) overhead required to manage I/O is small and that even with just a single disk, the I/O overhead of I/O-efficient computation ranges from negligible to the same order of magnitude as CPU time. We conjecture that if we use a number of disks in parallel this overhead can be all but eliminated.
      

      
      Compute-unified device architecture implementation of a block-matching algorithm for multiple graphical processing unit cards
      PubMed Central
      Massanes, Francesc; Cadennes, Marie; Brankov, Jovan G.
         2012-01-01
         In this paper we describe and evaluate a fast implementation of a classical block matching motion estimation algorithm for multiple Graphical Processing Units (GPUs) using the Compute Unified Device Architecture (CUDA) computing engine. The implemented block matching algorithm (BMA) uses summed absolute difference (SAD) error criterion and full grid search (FS) for finding optimal block displacement. In this evaluation we compared the execution time of a GPU and CPU implementation for images of various sizes, using integer and non-integer search grids. The results show that use of a GPU card can shorten computation time by a factor of 200 times for integer and 1000 times for a non-integer search grid. The additional speedup for non-integer search grid comes from the fact that GPU has built-in hardware for image interpolation. Further, when using multiple GPU cards, the presented evaluation shows the importance of the data splitting method across multiple cards, but an almost linear speedup with a number of cards is achievable. In addition we compared execution time of the proposed FS GPU implementation with two existing, highly optimized non-full grid search CPU based motion estimations methods, namely implementation of the Pyramidal Lucas Kanade Optical flow algorithm in OpenCV and Simplified Unsymmetrical multi-Hexagon search in H.264/AVC standard. In these comparisons, FS GPU implementation still showed modest improvement even though the computational complexity of FS GPU implementation is substantially higher than non-FS CPU implementation. We also demonstrated that for an image sequence of 720×480 pixels in resolution, commonly used in video surveillance, the proposed GPU implementation is sufficiently fast for real-time motion estimation at 30 frames-per-second using two NVIDIA C1060 Tesla GPU cards. PMID:22347787
      

      
      Compute-unified device architecture implementation of a block-matching algorithm for multiple graphical processing unit cards.
      PubMed
      Massanes, Francesc; Cadennes, Marie; Brankov, Jovan G
         2011-07-01
         In this paper we describe and evaluate a fast implementation of a classical block matching motion estimation algorithm for multiple Graphical Processing Units (GPUs) using the Compute Unified Device Architecture (CUDA) computing engine. The implemented block matching algorithm (BMA) uses summed absolute difference (SAD) error criterion and full grid search (FS) for finding optimal block displacement. In this evaluation we compared the execution time of a GPU and CPU implementation for images of various sizes, using integer and non-integer search grids.The results show that use of a GPU card can shorten computation time by a factor of 200 times for integer and 1000 times for a non-integer search grid. The additional speedup for non-integer search grid comes from the fact that GPU has built-in hardware for image interpolation. Further, when using multiple GPU cards, the presented evaluation shows the importance of the data splitting method across multiple cards, but an almost linear speedup with a number of cards is achievable.In addition we compared execution time of the proposed FS GPU implementation with two existing, highly optimized non-full grid search CPU based motion estimations methods, namely implementation of the Pyramidal Lucas Kanade Optical flow algorithm in OpenCV and Simplified Unsymmetrical multi-Hexagon search in H.264/AVC standard. In these comparisons, FS GPU implementation still showed modest improvement even though the computational complexity of FS GPU implementation is substantially higher than non-FS CPU implementation.We also demonstrated that for an image sequence of 720×480 pixels in resolution, commonly used in video surveillance, the proposed GPU implementation is sufficiently fast for real-time motion estimation at 30 frames-per-second using two NVIDIA C1060 Tesla GPU cards.
      

      
      GPU accelerated Monte-Carlo simulation of SEM images for metrology
      NASA Astrophysics Data System (ADS)
      Verduin, T.; Lokhorst, S. R.; Hagen, C. W.
         2016-03-01
         In this work we address the computation times of numerical studies in dimensional metrology. In particular, full Monte-Carlo simulation programs for scanning electron microscopy (SEM) image acquisition are known to be notoriously slow. Our quest in reducing the computation time of SEM image simulation has led us to investigate the use of graphics processing units (GPUs) for metrology. We have succeeded in creating a full Monte-Carlo simulation program for SEM images, which runs entirely on a GPU. The physical scattering models of this GPU simulator are identical to a previous CPU-based simulator, which includes the dielectric function model for inelastic scattering and also refinements for low-voltage SEM applications. As a case study for the performance, we considered the simulated exposure of a complex feature: an isolated silicon line with rough sidewalls located on a at silicon substrate. The surface of the rough feature is decomposed into 408 012 triangles. We have used an exposure dose of 6 mC/cm2, which corresponds to 6 553 600 primary electrons on average (Poisson distributed). We repeat the simulation for various primary electron energies, 300 eV, 500 eV, 800 eV, 1 keV, 3 keV and 5 keV. At first we run the simulation on a GeForce GTX480 from NVIDIA. The very same simulation is duplicated on our CPU-based program, for which we have used an Intel Xeon X5650. Apart from statistics in the simulation, no difference is found between the CPU and GPU simulated results. The GTX480 generates the images (depending on the primary electron energy) 350 to 425 times faster than a single threaded Intel X5650 CPU. Although this is a tremendous speedup, we actually have not reached the maximum throughput because of the limited amount of available memory on the GTX480. Nevertheless, the speedup enables the fast acquisition of simulated SEM images for metrology. We now have the potential to investigate case studies in CD-SEM metrology, which otherwise would take unreasonable amounts of computation time.
      

      
      Deployment of 464XLAT (RFC6877) alongside IPv6-only CPU resources at WLCG sites
      NASA Astrophysics Data System (ADS)
      Froy, T. S.; Traynor, D. P.; Walker, C. J.
         2017-10-01
         IPv4 is now officially deprecated by the IETF. A significant amount of effort has already been expended by the HEPiX IPv6 Working Group on testing dual-stacked hosts and IPv6-only CPU resources. Dual-stack adds complexity and administrative overhead to sites that may already be starved of resource. This has resulted in a very slow uptake of IPv6 from WLCG sites. 464XLAT (RFC6877) is intended for IPv6 single-stack environments that require the ability to communicate with IPv4-only endpoints. This paper will present a deployment strategy for 464XLAT, operational experiences of using 464XLAT in production at a WLCG site and important information to consider prior to deploying 464XLAT.
      

      
      Signal-to-noise ratio of Singer product apertures
      NASA Astrophysics Data System (ADS)
      Shutler, Paul M. E.; Byard, Kevin
         2017-09-01
         Formulae for the signal-to-noise ratio (SNR) of Singer product apertures are derived, allowing optimal Singer product apertures to be identified, and the CPU time required to decode them is quantified. This allows a systematic comparison to be made of the performance of Singer product apertures against both conventionally wrapped Singer apertures, and also conventional product apertures such as square uniformly redundant arrays. For very large images, equivalently for images at very high resolution, the SNR of Singer product apertures is asymptotically as good as the best conventional apertures, but Singer product apertures decode faster than any conventional aperture by at least a factor of ten for image sizes up to several megapixels. These theoretical predictions are verified using numerical simulations, demonstrating that coded aperture video is for the first time a realistic possibility.
      

      
      The GPU implementation of micro - Doppler period estimation
      NASA Astrophysics Data System (ADS)
      Yang, Liyuan; Wang, Junling; Bi, Ran
         2018-03-01
         Aiming at the problem that the computational complexity and the deficiency of real-time of the wideband radar echo signal, a program is designed to improve the performance of real-time extraction of micro-motion feature in this paper based on the CPU-GPU heterogeneous parallel structure. Firstly, we discuss the principle of the micro-Doppler effect generated by the rolling of the scattering points on the orbiting satellite, analyses how to use Kalman filter to compensate the translational motion of tumbling satellite and how to use the joint time-frequency analysis and inverse Radon transform to extract the micro-motion features from the echo after compensation. Secondly, the advantages of GPU in terms of real-time processing and the working principle of CPU-GPU heterogeneous parallelism are analysed, and a program flow based on GPU to extract the micro-motion feature from the radar echo signal of rolling satellite is designed. At the end of the article the results of extraction are given to verify the correctness of the program and algorithm.
      

      
      A Fast Code for Jupiter Atmospheric Entry
      NASA Technical Reports Server (NTRS)
      Tauber, Michael E.; Wercinski, Paul; Yang, Lily; Chen, Yih-Kanq; Arnold, James (Technical Monitor)
         1998-01-01
         A fast code was developed to calculate the forebody heating environment and heat shielding that is required for Jupiter atmospheric entry probes. A carbon phenolic heat shield material was assumed and, since computational efficiency was a major goal, analytic expressions were used, primarily, to calculate the heating, ablation and the required insulation. The code was verified by comparison with flight measurements from the Galileo probe's entry; the calculation required 3.5 sec of CPU time on a work station. The computed surface recessions from ablation were compared with the flight values at six body stations. The average, absolute, predicted difference in the recession was 12.5% too high. The forebody's mass loss was overpredicted by 5.5% and the heat shield mass was calculated to be 15% less than the probe's actual heat shield. However, the calculated heat shield mass did not include contingencies for the various uncertainties that must be considered in the design of probes. Therefore, the agreement with the Galileo probe's values was considered satisfactory, especially in view of the code's fast running time and the methods' approximations.
      

      
      Thirty Meter Telescope narrow-field infrared adaptive optics system real-time controller prototyping results
      NASA Astrophysics Data System (ADS)
      Smith, Malcolm; Kerley, Dan; Chapin, Edward L.; Dunn, Jennifer; Herriot, Glen; Véran, Jean-Pierre; Boyer, Corinne; Ellerbroek, Brent; Gilles, Luc; Wang, Lianqi
         2016-07-01
         Prototyping and benchmarking was performed for the Real-Time Controller (RTC) of the Narrow Field InfraRed Adaptive Optics System (NFIRAOS). To perform wavefront correction, NFIRAOS utilizes two deformable mirrors (DM) and one tip/tilt stage (TTS). The RTC receives wavefront information from six Laser Guide Star (LGS) Shack- Hartmann WaveFront Sensors (WFS), one high-order Natural Guide Star Pyramid WaveFront Sensor (PWFS) and multiple low-order instrument detectors. The RTC uses this information to determine the commands to send to the wavefront correctors. NFIRAOS is the first light AO system for the Thirty Meter Telescope (TMT). The prototyping was performed using dual-socket high performance Linux servers with the real-time (PREEMPT_RT) patch and demonstrated the viability of a commercial off-the-shelf (COTS) hardware approach to large scale AO reconstruction. In particular, a large custom matrix vector multiplication (MVM) was benchmarked which met the required latency requirements. In addition all major inter-machine communication was verified to be adequate using 10Gb and 40Gb Ethernet. The results of this prototyping has enabled a CPU-based NFIRAOS RTC design to proceed with confidence and that COTS hardware can be used to meet the demanding performance requirements.
      

      
      Numerical solutions of the complete Navier-Stokes equations
      NASA Technical Reports Server (NTRS)
      Hassan, H. A.
         1993-01-01
         The objective of this study is to compare the use of assumed pdf (probability density function) approaches for modeling supersonic turbulent reacting flowfields with the more elaborate approach where the pdf evolution equation is solved. Assumed pdf approaches for averaging the chemical source terms require modest increases in CPU time typically of the order of 20 percent above treating the source terms as 'laminar.' However, it is difficult to assume a form for these pdf's a priori that correctly mimics the behavior of the actual pdf governing the flow. Solving the evolution equation for the pdf is a theoretically sound approach, but because of the large dimensionality of this function, its solution requires a Monte Carlo method which is computationally expensive and slow to coverage. Preliminary results show both pdf approaches to yield similar solutions for the mean flow variables.
      

      
      Power and Performance Trade-offs for Space Time Adaptive Processing
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Gawande, Nitin A.; Manzano Franco, Joseph B.; Tumeo, Antonino
         
         Computational efficiency – performance relative to power or energy – is one of the most important concerns when designing RADAR processing systems. This paper analyzes power and performance trade-offs for a typical Space Time Adaptive Processing (STAP) application. We study STAP implementations for CUDA and OpenMP on two computationally efficient architectures, Intel Haswell Core I7-4770TE and NVIDIA Kayla with a GK208 GPU. We analyze the power and performance of STAP’s computationally intensive kernels across the two hardware testbeds. We also show the impact and trade-offs of GPU optimization techniques. We show that data parallelism can be exploited for efficient implementationmore » on the Haswell CPU architecture. The GPU architecture is able to process large size data sets without increase in power requirement. The use of shared memory has a significant impact on the power requirement for the GPU. A balance between the use of shared memory and main memory access leads to an improved performance in a typical STAP application.« less
      

      
      influx_s: increasing numerical stability and precision for metabolic flux analysis in isotope labelling experiments.
      PubMed
      Sokol, Serguei; Millard, Pierre; Portais, Jean-Charles
         2012-03-01
         The problem of stationary metabolic flux analysis based on isotope labelling experiments first appeared in the early 1950s and was basically solved in early 2000s. Several algorithms and software packages are available for this problem. However, the generic stochastic algorithms (simulated annealing or evolution algorithms) currently used in these software require a lot of time to achieve acceptable precision. For deterministic algorithms, a common drawback is the lack of convergence stability for ill-conditioned systems or when started from a random point. In this article, we present a new deterministic algorithm with significantly increased numerical stability and accuracy of flux estimation compared with commonly used algorithms. It requires relatively short CPU time (from several seconds to several minutes with a standard PC architecture) to estimate fluxes in the central carbon metabolism network of Escherichia coli. The software package influx_s implementing this algorithm is distributed under an OpenSource licence at http://metasys.insa-toulouse.fr/software/influx/. Supplementary data are available at Bioinformatics online.
      

      
      Use of a graphics processing unit (GPU) to facilitate real-time 3D graphic presentation of the patient skin-dose distribution during fluoroscopic interventional procedures
      PubMed Central
      Rana, Vijay; Rudin, Stephen; Bednarek, Daniel R.
         2012-01-01
         We have developed a dose-tracking system (DTS) that calculates the radiation dose to the patient’s skin in real-time by acquiring exposure parameters and imaging-system-geometry from the digital bus on a Toshiba Infinix C-arm unit. The cumulative dose values are then displayed as a color map on an OpenGL-based 3D graphic of the patient for immediate feedback to the interventionalist. Determination of those elements on the surface of the patient 3D-graphic that intersect the beam and calculation of the dose for these elements in real time demands fast computation. Reducing the size of the elements results in more computation load on the computer processor and therefore a tradeoff occurs between the resolution of the patient graphic and the real-time performance of the DTS. The speed of the DTS for calculating dose to the skin is limited by the central processing unit (CPU) and can be improved by using the parallel processing power of a graphics processing unit (GPU). Here, we compare the performance speed of GPU-based DTS software to that of the current CPU-based software as a function of the resolution of the patient graphics. Results show a tremendous improvement in speed using the GPU. While an increase in the spatial resolution of the patient graphics resulted in slowing down the computational speed of the DTS on the CPU, the speed of the GPU-based DTS was hardly affected. This GPU-based DTS can be a powerful tool for providing accurate, real-time feedback about patient skin-dose to physicians while performing interventional procedures. PMID:24027616
      

      
      Real-time unmanned aircraft systems surveillance video mosaicking using GPU
      NASA Astrophysics Data System (ADS)
      Camargo, Aldo; Anderson, Kyle; Wang, Yi; Schultz, Richard R.; Fevig, Ronald A.
         2010-04-01
         Digital video mosaicking from Unmanned Aircraft Systems (UAS) is being used for many military and civilian applications, including surveillance, target recognition, border protection, forest fire monitoring, traffic control on highways, monitoring of transmission lines, among others. Additionally, NASA is using digital video mosaicking to explore the moon and planets such as Mars. In order to compute a "good" mosaic from video captured by a UAS, the algorithm must deal with motion blur, frame-to-frame jitter associated with an imperfectly stabilized platform, perspective changes as the camera tilts in flight, as well as a number of other factors. The most suitable algorithms use SIFT (Scale-Invariant Feature Transform) to detect the features consistent between video frames. Utilizing these features, the next step is to estimate the homography between two consecutives video frames, perform warping to properly register the image data, and finally blend the video frames resulting in a seamless video mosaick. All this processing takes a great deal of resources of resources from the CPU, so it is almost impossible to compute a real time video mosaic on a single processor. Modern graphics processing units (GPUs) offer computational performance that far exceeds current CPU technology, allowing for real-time operation. This paper presents the development of a GPU-accelerated digital video mosaicking implementation and compares it with CPU performance. Our tests are based on two sets of real video captured by a small UAS aircraft; one video comes from Infrared (IR) and Electro-Optical (EO) cameras. Our results show that we can obtain a speed-up of more than 50 times using GPU technology, so real-time operation at a video capture of 30 frames per second is feasible.
      

        
       
          

«

14
      15
      16
   17
      18
      »

          
        

     

   

   
       
            
              
          

«

15
      16
      17
   18
      19
      »

          
        

           
           
             
               
      
      Use of a graphics processing unit (GPU) to facilitate real-time 3D graphic presentation of the patient skin-dose distribution during fluoroscopic interventional procedures.
      PubMed
      Rana, Vijay; Rudin, Stephen; Bednarek, Daniel R
         2012-02-23
         We have developed a dose-tracking system (DTS) that calculates the radiation dose to the patient's skin in real-time by acquiring exposure parameters and imaging-system-geometry from the digital bus on a Toshiba Infinix C-arm unit. The cumulative dose values are then displayed as a color map on an OpenGL-based 3D graphic of the patient for immediate feedback to the interventionalist. Determination of those elements on the surface of the patient 3D-graphic that intersect the beam and calculation of the dose for these elements in real time demands fast computation. Reducing the size of the elements results in more computation load on the computer processor and therefore a tradeoff occurs between the resolution of the patient graphic and the real-time performance of the DTS. The speed of the DTS for calculating dose to the skin is limited by the central processing unit (CPU) and can be improved by using the parallel processing power of a graphics processing unit (GPU). Here, we compare the performance speed of GPU-based DTS software to that of the current CPU-based software as a function of the resolution of the patient graphics. Results show a tremendous improvement in speed using the GPU. While an increase in the spatial resolution of the patient graphics resulted in slowing down the computational speed of the DTS on the CPU, the speed of the GPU-based DTS was hardly affected. This GPU-based DTS can be a powerful tool for providing accurate, real-time feedback about patient skin-dose to physicians while performing interventional procedures.
      

      
      Fast-GPU-PCC: A GPU-Based Technique to Compute Pairwise Pearson's Correlation Coefficients for Time Series Data-fMRI Study.
      PubMed
      Eslami, Taban; Saeed, Fahad
         2018-04-20
         Functional magnetic resonance imaging (fMRI) is a non-invasive brain imaging technique, which has been regularly used for studying brain’s functional activities in the past few years. A very well-used measure for capturing functional associations in brain is Pearson’s correlation coefficient. Pearson’s correlation is widely used for constructing functional network and studying dynamic functional connectivity of the brain. These are useful measures for understanding the effects of brain disorders on connectivities among brain regions. The fMRI scanners produce huge number of voxels and using traditional central processing unit (CPU)-based techniques for computing pairwise correlations is very time consuming especially when large number of subjects are being studied. In this paper, we propose a graphics processing unit (GPU)-based algorithm called Fast-GPU-PCC for computing pairwise Pearson’s correlation coefficient. Based on the symmetric property of Pearson’s correlation, this approach returns N ( N − 1 ) / 2 correlation coefficients located at strictly upper triangle part of the correlation matrix. Storing correlations in a one-dimensional array with the order as proposed in this paper is useful for further usage. Our experiments on real and synthetic fMRI data for different number of voxels and varying length of time series show that the proposed approach outperformed state of the art GPU-based techniques as well as the sequential CPU-based versions. We show that Fast-GPU-PCC runs 62 times faster than CPU-based version and about 2 to 3 times faster than two other state of the art GPU-based methods.
      

      
      An efficient tensor transpose algorithm for multicore CPU, Intel Xeon Phi, and NVidia Tesla GPU
      NASA Astrophysics Data System (ADS)
      Lyakh, Dmitry I.
         2015-04-01
         An efficient parallel tensor transpose algorithm is suggested for shared-memory computing units, namely, multicore CPU, Intel Xeon Phi, and NVidia GPU. The algorithm operates on dense tensors (multidimensional arrays) and is based on the optimization of cache utilization on x86 CPU and the use of shared memory on NVidia GPU. From the applied side, the ultimate goal is to minimize the overhead encountered in the transformation of tensor contractions into matrix multiplications in computer implementations of advanced methods of quantum many-body theory (e.g., in electronic structure theory and nuclear physics). A particular accent is made on higher-dimensional tensors that typically appear in the so-called multireference correlated methods of electronic structure theory. Depending on tensor dimensionality, the presented optimized algorithms can achieve an order of magnitude speedup on x86 CPUs and 2-3 times speedup on NVidia Tesla K20X GPU with respect to the naïve scattering algorithm (no memory access optimization). The tensor transpose routines developed in this work have been incorporated into a general-purpose tensor algebra library (TAL-SH).
      

      
      Short-term dopaminergic regulation of GABA release in dopamine deafferented caudate-putamen is not directly associated with glutamic acid decarboxylase gene expression.
      PubMed
      O'Connor, W T; Lindefors, N; Brené, S; Herrera-Marschitz, M; Persson, H; Ungerstedt, U
         1991-07-08
         In vivo microdialysis and in situ hybridization were combined to study dopaminergic regulation of gamma-amino butyric acid (GABA) neurons in rat caudate-putamen (CPu). Potassium-stimulated GABA release in CPu was elevated following a dopamine deafferentation. Local perfusion with exogenous dopamine (50 microM) for 3 h via the microdialysis probe attenuated the potassium-stimulated increase in extracellular GABA in CPu. Expression of glutamic acid decarboxylase (GAD) mRNA was also increased in the dopamine deafferented CPu. However, local perfusion with dopamine had no significant attenuating effect on the increased GAD mRNA expression. These findings indicate that dopaminergic regulation of GABA neurons in the dopamine deafferented CPu includes both a short-term effect at the level of GABA release independent of changes in GAD mRNA expression and a long-term modulation at the level of GAD gene expression.
      

      
      Combined AIE/EBE/GMRES approach to incompressible flows. [Adaptive Implicit-Explicit/Grouped Element-by-Element/Generalized Minimum Residuals
      NASA Technical Reports Server (NTRS)
      Liou, J.; Tezduyar, T. E.
         1990-01-01
         Adaptive implicit-explicit (AIE), grouped element-by-element (GEBE), and generalized minimum residuals (GMRES) solution techniques for incompressible flows are combined. In this approach, the GEBE and GMRES iteration methods are employed to solve the equation systems resulting from the implicitly treated elements, and therefore no direct solution effort is involved. The benchmarking results demonstrate that this approach can substantially reduce the CPU time and memory requirements in large-scale flow problems. Although the description of the concepts and the numerical demonstration are based on the incompressible flows, the approach presented here is applicable to larger class of problems in computational mechanics.
      

      
      Moment method analysis of linearly tapered slot antennas: Low loss components for switched beam radiometers
      NASA Technical Reports Server (NTRS)
      Koeksal, Adnan; Trew, Robert J.; Kauffman, J. Frank
         1992-01-01
         A Moment Method Model for the radiation pattern characterization of single Linearly Tapered Slot Antennas (LTSA) in air or on a dielectric substrate is developed. This characterization consists of: (1) finding the radiated far-fields of the antenna; (2) determining the E-Plane and H-Plane beamwidths and sidelobe levels; and (3) determining the D-Plane beamwidth and cross polarization levels, as antenna parameters length, height, taper angle, substrate thickness, and the relative substrate permittivity vary. The LTSA geometry does not lend itself to analytical solution with the given parameter ranges. Therefore, a computer modeling scheme and a code are necessary to analyze the problem. This necessity imposes some further objectives or requirements on the solution method (modeling) and tool (computer code). These may be listed as follows: (1) a good approximation to the real antenna geometry; and (2) feasible computer storage and time requirements. According to these requirements, the work is concentrated on the development of efficient modeling schemes for these type of problems and on reducing the central processing unit (CPU) time required from the computer code. A Method of Moments (MoM) code is developed for the analysis of LTSA's within the parameter ranges given.
      

      
      High Performance Computing of Meshless Time Domain Method on Multi-GPU Cluster
      NASA Astrophysics Data System (ADS)
      Ikuno, Soichiro; Nakata, Susumu; Hirokawa, Yuta; Itoh, Taku
         2015-01-01
         High performance computing of Meshless Time Domain Method (MTDM) on multi-GPU using the supercomputer HA-PACS (Highly Accelerated Parallel Advanced system for Computational Sciences) at University of Tsukuba is investigated. Generally, the finite difference time domain (FDTD) method is adopted for the numerical simulation of the electromagnetic wave propagation phenomena. However, the numerical domain must be divided into rectangle meshes, and it is difficult to adopt the problem in a complexed domain to the method. On the other hand, MTDM can be easily adept to the problem because MTDM does not requires meshes. In the present study, we implement MTDM on multi-GPU cluster to speedup the method, and numerically investigate the performance of the method on multi-GPU cluster. To reduce the computation time, the communication time between the decomposed domain is hided below the perfect matched layer (PML) calculation procedure. The results of computation show that speedup of MTDM on 128 GPUs is 173 times faster than that of single CPU calculation.
      

      
      The DISTO data acquisition system at SATURNE
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Balestra, F.; Bedfer, Y.; Bertini, R.
         1998-06-01
         The DISTO collaboration has built a large-acceptance magnetic spectrometer designed to provide broad kinematic coverage of multiparticle final states produced in pp scattering. The spectrometer has been installed in the polarized proton beam of the Saturne accelerator in Saclay to study polarization observables in the {rvec p}p {yields} pK{sup +}{rvec Y} (Y = {Lambda}, {Sigma}{sup 0} or Y{sup *}) reaction and vector meson production ({psi}, {omega} and {rho}) in pp collisions. The data acquisition system is based on a VME 68030 CPU running the OS/9 operating system, housed in a single VME crate together with the CAMAC interface, the triplemore » port ECL memories, and four RISC R3000 CPU. The digitization of signals from the detectors is made by PCOS III and FERA front-end electronics. Data of several events belonging to a single Saturne extraction are stored in VME triple-port ECL memories using a hardwired fast sequencer. The buffer, optionally filtered by the RISC R3000 CPU, is recorded on a DLT cassette by DAQ CPU using the on-board SCSI interface during the acceleration cycle. Two UNIX workstations are connected to the VME CPUs through a fast parallel bus and the Local Area Network. They analyze a subset of events for on-line monitoring. The data acquisition system is able to read and record 3,500 ev/burst in the present configuration with a dead time of 15%.« less
      

      
      Efficient and portable acceleration of quantum chemical many-body methods in mixed floating point precision using OpenACC compiler directives
      NASA Astrophysics Data System (ADS)
      Eriksen, Janus J.
         2017-09-01
         It is demonstrated how the non-proprietary OpenACC standard of compiler directives may be used to compactly and efficiently accelerate the rate-determining steps of two of the most routinely applied many-body methods of electronic structure theory, namely the second-order Møller-Plesset (MP2) model in its resolution-of-the-identity approximated form and the (T) triples correction to the coupled cluster singles and doubles model (CCSD(T)). By means of compute directives as well as the use of optimised device math libraries, the operations involved in the energy kernels have been ported to graphics processing unit (GPU) accelerators, and the associated data transfers correspondingly optimised to such a degree that the final implementations (using either double and/or single precision arithmetics) are capable of scaling to as large systems as allowed for by the capacity of the host central processing unit (CPU) main memory. The performance of the hybrid CPU/GPU implementations is assessed through calculations on test systems of alanine amino acid chains using one-electron basis sets of increasing size (ranging from double- to pentuple-ζ quality). For all but the smallest problem sizes of the present study, the optimised accelerated codes (using a single multi-core CPU host node in conjunction with six GPUs) are found to be capable of reducing the total time-to-solution by at least an order of magnitude over optimised, OpenMP-threaded CPU-only reference implementations.
      

      
      Data Parallel Bin-Based Indexing for Answering Queries on Multi-Core Architectures
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Gosink, Luke; Wu, Kesheng; Bethel, E. Wes
         2009-06-02
         The multi-core trend in CPUs and general purpose graphics processing units (GPUs) offers new opportunities for the database community. The increase of cores at exponential rates is likely to affect virtually every server and client in the coming decade, and presents database management systems with a huge, compelling disruption that will radically change how processing is done. This paper presents a new parallel indexing data structure for answering queries that takes full advantage of the increasing thread-level parallelism emerging in multi-core architectures. In our approach, our Data Parallel Bin-based Index Strategy (DP-BIS) first bins the base data, and then partitionsmore » and stores the values in each bin as a separate, bin-based data cluster. In answering a query, the procedures for examining the bin numbers and the bin-based data clusters offer the maximum possible level of concurrency; each record is evaluated by a single thread and all threads are processed simultaneously in parallel. We implement and demonstrate the effectiveness of DP-BIS on two multi-core architectures: a multi-core CPU and a GPU. The concurrency afforded by DP-BIS allows us to fully utilize the thread-level parallelism provided by each architecture--for example, our GPU-based DP-BIS implementation simultaneously evaluates over 12,000 records with an equivalent number of concurrently executing threads. In comparing DP-BIS's performance across these architectures, we show that the GPU-based DP-BIS implementation requires significantly less computation time to answer a query than the CPU-based implementation. We also demonstrate in our analysis that DP-BIS provides better overall performance than the commonly utilized CPU and GPU-based projection index. Finally, due to data encoding, we show that DP-BIS accesses significantly smaller amounts of data than index strategies that operate solely on a column's base data; this smaller data footprint is critical for parallel processors that possess limited memory resources (e.g., GPUs).« less
      

      
      SpaceCube 2.0: An Advanced Hybrid Onboard Data Processor
      NASA Technical Reports Server (NTRS)
      Lin, Michael; Flatley, Thomas; Godfrey, John; Geist, Alessandro; Espinosa, Daniel; Petrick, David
         2011-01-01
         The SpaceCube 2.0 is a compact, high performance, low-power onboard processing system that takes advantage of cutting-edge hybrid (CPU/FPGA/DSP) processing elements. The SpaceCube 2.0 design concept includes two commercial Virtex-5 field-programmable gate array (FPGA) parts protected by gradiation hardened by software" technology, and possesses exceptional size, weight, and power characteristics [5x5x7 in., 3.5 lb (approximately equal to 12.7 x 12.7 x 17.8 cm, 1.6 kg) 5-25 W, depending on the application fs required clock rate]. The two Virtex-5 FPGA parts are implemented in a unique back-toback configuration to maximize data transfer and computing performance. Draft computing power specifications for the SpaceCube 2.0 unit include four PowerPC 440s (1100 DMIPS each), 500+ DSP48Es (2x580 GMACS), 100+ LVDS high-speed serial I/Os (1.25 Gbps each), and 2x190 GFLOPS single-precision (65 GFLOPS double-precision) floating point performance. The SpaceCube 2.0 includes PROM memory for CPU boot, health and safety, and basic command and telemetry functionality; RAM memory for program execution; and FLASH/EEPROM memory to store algorithms and application code for the CPU, FPGA, and DSP processing elements. Program execution can be reconfigured in real time and algorithms can be updated, modified, and/or replaced at any point during the mission. Gigabit Ethernet, Spacewire, SATA and highspeed LVDS serial/parallel I/O channels are available for instrument/sensor data ingest, and mission-unique instrument interfaces can be accommodated using a compact PCI (cPCI) expansion card interface. The SpaceCube 2.0 can be utilized in NASA Earth Science, Helio/Astrophysics and Exploration missions, and Department of Defense satellites for onboard data processing. It can also be used in commercial communication and mapping satellites.
      

      
      A new nonlinear conjugate gradient coefficient under strong Wolfe-Powell line search
      NASA Astrophysics Data System (ADS)
      Mohamed, Nur Syarafina; Mamat, Mustafa; Rivaie, Mohd
         2017-08-01
         A nonlinear conjugate gradient method (CG) plays an important role in solving a large-scale unconstrained optimization problem. This method is widely used due to its simplicity. The method is known to possess sufficient descend condition and global convergence properties. In this paper, a new nonlinear of CG coefficient βk is presented by employing the Strong Wolfe-Powell inexact line search. The new βk performance is tested based on number of iterations and central processing unit (CPU) time by using MATLAB software with Intel Core i7-3470 CPU processor. Numerical experimental results show that the new βk converge rapidly compared to other classical CG method.
      

      
      Hypermatrix scheme for finite element systems on CDC STAR-100 computer
      NASA Technical Reports Server (NTRS)
      Noor, A. K.; Voigt, S. J.
         1975-01-01
         A study is made of the adaptation of the hypermatrix (block matrix) scheme for solving large systems of finite element equations to the CDC STAR-100 computer. Discussion is focused on the organization of the hypermatrix computation using Cholesky decomposition and the mode of storage of the different submatrices to take advantage of the STAR pipeline (streaming) capability. Consideration is also given to the associated data handling problems and the means of balancing the I/Q and cpu times in the solution process. Numerical examples are presented showing anticipated gain in cpu speed over the CDC 6600 to be obtained by using the proposed algorithms on the STAR computer.
      

      
      An efficient implementation of semi-numerical computation of the Hartree-Fock exchange on the Intel Phi processor
      NASA Astrophysics Data System (ADS)
      Liu, Fenglai; Kong, Jing
         2018-07-01
         Unique technical challenges and their solutions for implementing semi-numerical Hartree-Fock exchange on the Phil Processor are discussed, especially concerning the single- instruction-multiple-data type of processing and small cache size. Benchmark calculations on a series of buckyball molecules with various Gaussian basis sets on a Phi processor and a six-core CPU show that the Phi processor provides as much as 12 times of speedup with large basis sets compared with the conventional four-center electron repulsion integration approach performed on the CPU. The accuracy of the semi-numerical scheme is also evaluated and found to be comparable to that of the resolution-of-identity approach.
      

      
      Real-Time Agent-Based Modeling Simulation with in-situ Visualization of Complex Biological Systems: A Case Study on Vocal Fold Inflammation and Healing.
      PubMed
      Seekhao, Nuttiiya; Shung, Caroline; JaJa, Joseph; Mongeau, Luc; Li-Jessen, Nicole Y K
         2016-05-01
         We present an efficient and scalable scheme for implementing agent-based modeling (ABM) simulation with In Situ visualization of large complex systems on heterogeneous computing platforms. The scheme is designed to make optimal use of the resources available on a heterogeneous platform consisting of a multicore CPU and a GPU, resulting in minimal to no resource idle time. Furthermore, the scheme was implemented under a client-server paradigm that enables remote users to visualize and analyze simulation data as it is being generated at each time step of the model. Performance of a simulation case study of vocal fold inflammation and wound healing with 3.8 million agents shows 35× and 7× speedup in execution time over single-core and multi-core CPU respectively. Each iteration of the model took less than 200 ms to simulate, visualize and send the results to the client. This enables users to monitor the simulation in real-time and modify its course as needed.
      

      
      Research on SEU hardening of heterogeneous Dual-Core SoC
      NASA Astrophysics Data System (ADS)
      Huang, Kun; Hu, Keliu; Deng, Jun; Zhang, Tao
         2017-08-01
         The implementation of Single-Event Upsets (SEU) hardening has various schemes. However, some of them require a lot of human, material and financial resources. This paper proposes an easy scheme on SEU hardening for Heterogeneous Dual-core SoC (HD SoC) which contains three techniques. First, the automatic Triple Modular Redundancy (TMR) technique is adopted to harden the register heaps of the processor and the instruction-fetching module. Second, Hamming codes are used to harden the random access memory (RAM). Last, a software signature technique is applied to check the programs which are running on CPU. The scheme need not to consume additional resources, and has little influence on the performance of CPU. These technologies are very mature, easy to implement and needs low cost. According to the simulation result, the scheme can satisfy the basic demand of SEU-hardening.
      

      
      Irregular large-scale computed tomography on multiple graphics processors improves energy-efficiency metrics for industrial applications
      NASA Astrophysics Data System (ADS)
      Jimenez, Edward S.; Goodman, Eric L.; Park, Ryeojin; Orr, Laurel J.; Thompson, Kyle R.
         2014-09-01
         This paper will investigate energy-efficiency for various real-world industrial computed-tomography reconstruction algorithms, both CPU- and GPU-based implementations. This work shows that the energy required for a given reconstruction is based on performance and problem size. There are many ways to describe performance and energy efficiency, thus this work will investigate multiple metrics including performance-per-watt, energy-delay product, and energy consumption. This work found that irregular GPU-based approaches1 realized tremendous savings in energy consumption when compared to CPU implementations while also significantly improving the performance-per- watt and energy-delay product metrics. Additional energy savings and other metric improvement was realized on the GPU-based reconstructions by improving storage I/O by implementing a parallel MIMD-like modularization of the compute and I/O tasks.
      

      
      An Investigation of Unified Memory Access Performance in CUDA
      PubMed Central
      Landaverde, Raphael; Zhang, Tiansheng; Coskun, Ayse K.; Herbordt, Martin
         2015-01-01
         Managing memory between the CPU and GPU is a major challenge in GPU computing. A programming model, Unified Memory Access (UMA), has been recently introduced by Nvidia to simplify the complexities of memory management while claiming good overall performance. In this paper, we investigate this programming model and evaluate its performance and programming model simplifications based on our experimental results. We find that beyond on-demand data transfers to the CPU, the GPU is also able to request subsets of data it requires on demand. This feature allows UMA to outperform full data transfer methods for certain parallel applications and small data sizes. We also find, however, that for the majority of applications and memory access patterns, the performance overheads associated with UMA are significant, while the simplifications to the programming model restrict flexibility for adding future optimizations. PMID:26594668
      

      
      Reduced order model based on principal component analysis for process simulation and optimization
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Lang, Y.; Malacina, A.; Biegler, L.
         2009-01-01
         It is well-known that distributed parameter computational fluid dynamics (CFD) models provide more accurate results than conventional, lumped-parameter unit operation models used in process simulation. Consequently, the use of CFD models in process/equipment co-simulation offers the potential to optimize overall plant performance with respect to complex thermal and fluid flow phenomena. Because solving CFD models is time-consuming compared to the overall process simulation, we consider the development of fast reduced order models (ROMs) based on CFD results to closely approximate the high-fidelity equipment models in the co-simulation. By considering process equipment items with complicated geometries and detailed thermodynamic property models,more » this study proposes a strategy to develop ROMs based on principal component analysis (PCA). Taking advantage of commercial process simulation and CFD software (for example, Aspen Plus and FLUENT), we are able to develop systematic CFD-based ROMs for equipment models in an efficient manner. In particular, we show that the validity of the ROM is more robust within well-sampled input domain and the CPU time is significantly reduced. Typically, it takes at most several CPU seconds to evaluate the ROM compared to several CPU hours or more to solve the CFD model. Two case studies, involving two power plant equipment examples, are described and demonstrate the benefits of using our proposed ROM methodology for process simulation and optimization.« less
      

      
      Generic algorithms for high performance scalable geocomputing
      NASA Astrophysics Data System (ADS)
      de Jong, Kor; Schmitz, Oliver; Karssenberg, Derek
         2016-04-01
         During the last decade, the characteristics of computing hardware have changed a lot. For example, instead of a single general purpose CPU core, personal computers nowadays contain multiple cores per CPU and often general purpose accelerators, like GPUs. Additionally, compute nodes are often grouped together to form clusters or a supercomputer, providing enormous amounts of compute power. For existing earth simulation models to be able to use modern hardware platforms, their compute intensive parts must be rewritten. This can be a major undertaking and may involve many technical challenges. Compute tasks must be distributed over CPU cores, offloaded to hardware accelerators, or distributed to different compute nodes. And ideally, all of this should be done in such a way that the compute task scales well with the hardware resources. This presents two challenges: 1) how to make good use of all the compute resources and 2) how to make these compute resources available for developers of simulation models, who may not (want to) have the required technical background for distributing compute tasks. The first challenge requires the use of specialized technology (e.g.: threads, OpenMP, MPI, OpenCL, CUDA). The second challenge requires the abstraction of the logic handling the distribution of compute tasks from the model-specific logic, hiding the technical details from the model developer. To assist the model developer, we are developing a C++ software library (called Fern) containing algorithms that can use all CPU cores available in a single compute node (distributing tasks over multiple compute nodes will be done at a later stage). The algorithms are grid-based (finite difference) and include local and spatial operations such as convolution filters. The algorithms handle distribution of the compute tasks to CPU cores internally. In the resulting model the low-level details of how this is done is separated from the model-specific logic representing the modeled system. This contrasts with practices in which code for distributing of compute tasks is mixed with model-specific code, and results in a better maintainable model. For flexibility and efficiency, the algorithms are configurable at compile-time with the respect to the following aspects: data type, value type, no-data handling, input value domain handling, and output value range handling. This makes the algorithms usable in very different contexts, without the need for making intrusive changes to existing models when using them. Applications that benefit from using the Fern library include the construction of forward simulation models in (global) hydrology (e.g. PCR-GLOBWB (Van Beek et al. 2011)), ecology, geomorphology, or land use change (e.g. PLUC (Verstegen et al. 2014)) and manipulation of hyper-resolution land surface data such as digital elevation models and remote sensing data. Using the Fern library, we have also created an add-on to the PCRaster Python Framework (Karssenberg et al. 2010) allowing its users to speed up their spatio-temporal models, sometimes by changing just a single line of Python code in their model. In our presentation we will give an overview of the design of the algorithms, providing examples of different contexts where they can be used to replace existing sequential algorithms, including the PCRaster environmental modeling software (www.pcraster.eu). We will show how the algorithms can be configured to behave differently when necessary. References Karssenberg, D., Schmitz, O., Salamon, P., De Jong, K. and Bierkens, M.F.P., 2010, A software framework for construction of process-based stochastic spatio-temporal models and data assimilation. Environmental Modelling & Software, 25, pp. 489-502, Link. Best Paper Award 2010: Software and Decision Support. Van Beek, L. P. H., Y. Wada, and M. F. P. Bierkens. 2011. Global monthly water stress: 1. Water balance and water availability. Water Resources Research. 47. Verstegen, J. A., D. Karssenberg, F. van der Hilst, and A. P. C. Faaij. 2014. Identifying a land use change cellular automaton by Bayesian data assimilation. Environmental Modelling & Software 53:121-136.
      

        
       
          

«

15
      16
      17
   18
      19
      »

          
        

     

   

   
       
            
              
          

«

16
      17
      18
   19
      20
      »

          
        

           
           
             
               
      
      A parallel algorithm for the two-dimensional time fractional diffusion equation with implicit difference method.
      PubMed
      Gong, Chunye; Bao, Weimin; Tang, Guojian; Jiang, Yuewen; Liu, Jie
         2014-01-01
         It is very time consuming to solve fractional differential equations. The computational complexity of two-dimensional fractional differential equation (2D-TFDE) with iterative implicit finite difference method is O(M(x)M(y)N(2)). In this paper, we present a parallel algorithm for 2D-TFDE and give an in-depth discussion about this algorithm. A task distribution model and data layout with virtual boundary are designed for this parallel algorithm. The experimental results show that the parallel algorithm compares well with the exact solution. The parallel algorithm on single Intel Xeon X5540 CPU runs 3.16-4.17 times faster than the serial algorithm on single CPU core. The parallel efficiency of 81 processes is up to 88.24% compared with 9 processes on a distributed memory cluster system. We do think that the parallel computing technology will become a very basic method for the computational intensive fractional applications in the near future.
      

      
      Real-time autocorrelator for fluorescence correlation spectroscopy based on graphical-processor-unit architecture: method, implementation, and comparative studies
      NASA Astrophysics Data System (ADS)
      Laracuente, Nicholas; Grossman, Carl
         2013-03-01
         We developed an algorithm and software to calculate autocorrelation functions from real-time photon-counting data using the fast, parallel capabilities of graphical processor units (GPUs). Recent developments in hardware and software have allowed for general purpose computing with inexpensive GPU hardware. These devices are more suited for emulating hardware autocorrelators than traditional CPU-based software applications by emphasizing parallel throughput over sequential speed. Incoming data are binned in a standard multi-tau scheme with configurable points-per-bin size and are mapped into a GPU memory pattern to reduce time-expensive memory access. Applications include dynamic light scattering (DLS) and fluorescence correlation spectroscopy (FCS) experiments. We ran the software on a 64-core graphics pci card in a 3.2 GHz Intel i5 CPU based computer running Linux. FCS measurements were made on Alexa-546 and Texas Red dyes in a standard buffer (PBS). Software correlations were compared to hardware correlator measurements on the same signals. Supported by HHMI and Swarthmore College
      

      
      SU-E-J-60: Efficient Monte Carlo Dose Calculation On CPU-GPU Heterogeneous Systems
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Xiao, K; Chen, D. Z; Hu, X. S
         
         Purpose: It is well-known that the performance of GPU-based Monte Carlo dose calculation implementations is bounded by memory bandwidth. One major cause of this bottleneck is the random memory writing patterns in dose deposition, which leads to several memory efficiency issues on GPU such as un-coalesced writing and atomic operations. We propose a new method to alleviate such issues on CPU-GPU heterogeneous systems, which achieves overall performance improvement for Monte Carlo dose calculation. Methods: Dose deposition is to accumulate dose into the voxels of a dose volume along the trajectories of radiation rays. Our idea is to partition this proceduremore » into the following three steps, which are fine-tuned for CPU or GPU: (1) each GPU thread writes dose results with location information to a buffer on GPU memory, which achieves fully-coalesced and atomic-free memory transactions; (2) the dose results in the buffer are transferred to CPU memory; (3) the dose volume is constructed from the dose buffer on CPU. We organize the processing of all radiation rays into streams. Since the steps within a stream use different hardware resources (i.e., GPU, DMA, CPU), we can overlap the execution of these steps for different streams by pipelining. Results: We evaluated our method using a Monte Carlo Convolution Superposition (MCCS) program and tested our implementation for various clinical cases on a heterogeneous system containing an Intel i7 quad-core CPU and an NVIDIA TITAN GPU. Comparing with a straightforward MCCS implementation on the same system (using both CPU and GPU for radiation ray tracing), our method gained 2-5X speedup without losing dose calculation accuracy. Conclusion: The results show that our new method improves the effective memory bandwidth and overall performance for MCCS on the CPU-GPU systems. Our proposed method can also be applied to accelerate other Monte Carlo dose calculation approaches. This research was supported in part by NSF under Grants CCF-1217906, and also in part by a research contract from the Sandia National Laboratories.« less
      

      
      CUDA-Accelerated Geodesic Ray-Tracing for Fiber Tracking
      PubMed Central
      van Aart, Evert; Sepasian, Neda; Jalba, Andrei; Vilanova, Anna
         2011-01-01
         Diffusion Tensor Imaging (DTI) allows to noninvasively measure the diffusion of water in fibrous tissue. By reconstructing the fibers from DTI data using a fiber-tracking algorithm, we can deduce the structure of the tissue. In this paper, we outline an approach to accelerating such a fiber-tracking algorithm using a Graphics Processing Unit (GPU). This algorithm, which is based on the calculation of geodesics, has shown promising results for both synthetic and real data, but is limited in its applicability by its high computational requirements. We present a solution which uses the parallelism offered by modern GPUs, in combination with the CUDA platform by NVIDIA, to significantly reduce the execution time of the fiber-tracking algorithm. Compared to a multithreaded CPU implementation of the same algorithm, our GPU mapping achieves a speedup factor of up to 40 times. PMID:21941525
      

      
      Simulating electron wave dynamics in graphene superlattices exploiting parallel processing advantages
      NASA Astrophysics Data System (ADS)
      Rodrigues, Manuel J.; Fernandes, David E.; Silveirinha, Mário G.; Falcão, Gabriel
         2018-01-01
         This work introduces a parallel computing framework to characterize the propagation of electron waves in graphene-based nanostructures. The electron wave dynamics is modeled using both "microscopic" and effective medium formalisms and the numerical solution of the two-dimensional massless Dirac equation is determined using a Finite-Difference Time-Domain scheme. The propagation of electron waves in graphene superlattices with localized scattering centers is studied, and the role of the symmetry of the microscopic potential in the electron velocity is discussed. The computational methodologies target the parallel capabilities of heterogeneous multi-core CPU and multi-GPU environments and are built with the OpenCL parallel programming framework which provides a portable, vendor agnostic and high throughput-performance solution. The proposed heterogeneous multi-GPU implementation achieves speedup ratios up to 75x when compared to multi-thread and multi-core CPU execution, reducing simulation times from several hours to a couple of minutes.
      

      
      Fast data reconstructed method of Fourier transform imaging spectrometer based on multi-core CPU
      NASA Astrophysics Data System (ADS)
      Yu, Chunchao; Du, Debiao; Xia, Zongze; Song, Li; Zheng, Weijian; Yan, Min; Lei, Zhenggang
         2017-10-01
         Imaging spectrometer can gain two-dimensional space image and one-dimensional spectrum at the same time, which shows high utility in color and spectral measurements, the true color image synthesis, military reconnaissance and so on. In order to realize the fast reconstructed processing of the Fourier transform imaging spectrometer data, the paper designed the optimization reconstructed algorithm with OpenMP parallel calculating technology, which was further used for the optimization process for the HyperSpectral Imager of `HJ-1' Chinese satellite. The results show that the method based on multi-core parallel computing technology can control the multi-core CPU hardware resources competently and significantly enhance the calculation of the spectrum reconstruction processing efficiency. If the technology is applied to more cores workstation in parallel computing, it will be possible to complete Fourier transform imaging spectrometer real-time data processing with a single computer.
      

      
      A proximity algorithm accelerated by Gauss-Seidel iterations for L1/TV denoising models
      NASA Astrophysics Data System (ADS)
      Li, Qia; Micchelli, Charles A.; Shen, Lixin; Xu, Yuesheng
         2012-09-01
         Our goal in this paper is to improve the computational performance of the proximity algorithms for the L1/TV denoising model. This leads us to a new characterization of all solutions to the L1/TV model via fixed-point equations expressed in terms of the proximity operators. Based upon this observation we develop an algorithm for solving the model and establish its convergence. Furthermore, we demonstrate that the proposed algorithm can be accelerated through the use of the componentwise Gauss-Seidel iteration so that the CPU time consumed is significantly reduced. Numerical experiments using the proposed algorithm for impulsive noise removal are included, with a comparison to three recently developed algorithms. The numerical results show that while the proposed algorithm enjoys a high quality of the restored images, as the other three known algorithms do, it performs significantly better in terms of computational efficiency measured in the CPU time consumed.
      

      
      Classification of hyperspectral imagery using MapReduce on a NVIDIA graphics processing unit (Conference Presentation)
      NASA Astrophysics Data System (ADS)
      Ramirez, Andres; Rahnemoonfar, Maryam
         2017-04-01
         A hyperspectral image provides multidimensional figure rich in data consisting of hundreds of spectral dimensions. Analyzing the spectral and spatial information of such image with linear and non-linear algorithms will result in high computational time. In order to overcome this problem, this research presents a system using a MapReduce-Graphics Processing Unit (GPU) model that can help analyzing a hyperspectral image through the usage of parallel hardware and a parallel programming model, which will be simpler to handle compared to other low-level parallel programming models. Additionally, Hadoop was used as an open-source version of the MapReduce parallel programming model. This research compared classification accuracy results and timing results between the Hadoop and GPU system and tested it against the following test cases: the CPU and GPU test case, a CPU test case and a test case where no dimensional reduction was applied.
      

      
      Accelerated Monte Carlo Simulation on the Chemical Stage in Water Radiolysis using GPU
      PubMed Central
      Tian, Zhen; Jiang, Steve B.; Jia, Xun
         2018-01-01
         The accurate simulation of water radiolysis is an important step to understand the mechanisms of radiobiology and quantitatively test some hypotheses regarding radiobiological effects. However, the simulation of water radiolysis is highly time consuming, taking hours or even days to be completed by a conventional CPU processor. This time limitation hinders cell-level simulations for a number of research studies. We recently initiated efforts to develop gMicroMC, a GPU-based fast microscopic MC simulation package for water radiolysis. The first step of this project focused on accelerating the simulation of the chemical stage, the most time consuming stage in the entire water radiolysis process. A GPU-friendly parallelization strategy was designed to address the highly correlated many-body simulation problem caused by the mutual competitive chemical reactions between the radiolytic molecules. Two cases were tested, using a 750 keV electron and a 5 MeV proton incident in pure water, respectively. The time-dependent yields of all the radiolytic species during the chemical stage were used to evaluate the accuracy of the simulation. The relative differences between our simulation and the Geant4-DNA simulation were on average 5.3% and 4.4% for the two cases. Our package, executed on an Nvidia Titan black GPU card, successfully completed the chemical stage simulation of the two cases within 599.2 s and 489.0 s. As compared with Geant4-DNA that was executed on an Intel i7-5500U CPU processor and needed 28.6 h and 26.8 h for the two cases using a single CPU core, our package achieved a speed-up factor of 171.1-197.2. PMID:28323637
      

      
      Quasi-elastic light scattering: Signal storage, correlation, and spectrum analysis under control of an 8-bit microprocessor
      NASA Astrophysics Data System (ADS)
      Glatter, Otto; Fuchs, Heribert; Jorde, Christian; Eigner, Wolf-Dieter
         1987-03-01
         The microprocessor of an 8-bit PC system is used as a central control unit for the acquisition and evaluation of data from quasi-elastic light scattering experiments. Data are sampled with a width of 8 bits under control of the CPU. This limits the minimum sample time to 20 μs. Shorter sample times would need a direct memory access channel. The 8-bit CPU can address a 64-kbyte RAM without additional paging. Up to 49 000 sample points can be measured without interruption. After storage, a correlation function or a power spectrum can be calculated from such a primary data set. Furthermore access is provided to the primary data for stability control, statistical tests, and for comparison of different evaluation methods for the same experiment. A detailed analysis of the signal (histogram) and of the effect of overflows is possible and shows that the number of pulses but not the number of overflows determines the error in the result. The correlation function can be computed with reasonable accuracy from data with a mean pulse rate greater than one, the power spectrum needs a three times higher pulse rate for convergence. The statistical accuracy of the results from 49 000 sample points is of the order of a few percent. Additional averages are necessary to improve their quality. The hardware extensions for the PC system are inexpensive. The main disadvantage of the present system is the high minimum sampling time of 20 μs and the fact that the correlogram or the power spectrum cannot be computed on-line as it can be done with hardware correlators or spectrum analyzers. These shortcomings and the storage size restrictions can be removed with a faster 16/32-bit CPU.
      

      
      Accelerated Monte Carlo simulation on the chemical stage in water radiolysis using GPU
      NASA Astrophysics Data System (ADS)
      Tian, Zhen; Jiang, Steve B.; Jia, Xun
         2017-04-01
         The accurate simulation of water radiolysis is an important step to understand the mechanisms of radiobiology and quantitatively test some hypotheses regarding radiobiological effects. However, the simulation of water radiolysis is highly time consuming, taking hours or even days to be completed by a conventional CPU processor. This time limitation hinders cell-level simulations for a number of research studies. We recently initiated efforts to develop gMicroMC, a GPU-based fast microscopic MC simulation package for water radiolysis. The first step of this project focused on accelerating the simulation of the chemical stage, the most time consuming stage in the entire water radiolysis process. A GPU-friendly parallelization strategy was designed to address the highly correlated many-body simulation problem caused by the mutual competitive chemical reactions between the radiolytic molecules. Two cases were tested, using a 750 keV electron and a 5 MeV proton incident in pure water, respectively. The time-dependent yields of all the radiolytic species during the chemical stage were used to evaluate the accuracy of the simulation. The relative differences between our simulation and the Geant4-DNA simulation were on average 5.3% and 4.4% for the two cases. Our package, executed on an Nvidia Titan black GPU card, successfully completed the chemical stage simulation of the two cases within 599.2 s and 489.0 s. As compared with Geant4-DNA that was executed on an Intel i7-5500U CPU processor and needed 28.6 h and 26.8 h for the two cases using a single CPU core, our package achieved a speed-up factor of 171.1-197.2.
      

      
      Accelerated Monte Carlo simulation on the chemical stage in water radiolysis using GPU.
      PubMed
      Tian, Zhen; Jiang, Steve B; Jia, Xun
         2017-04-21
         The accurate simulation of water radiolysis is an important step to understand the mechanisms of radiobiology and quantitatively test some hypotheses regarding radiobiological effects. However, the simulation of water radiolysis is highly time consuming, taking hours or even days to be completed by a conventional CPU processor. This time limitation hinders cell-level simulations for a number of research studies. We recently initiated efforts to develop gMicroMC, a GPU-based fast microscopic MC simulation package for water radiolysis. The first step of this project focused on accelerating the simulation of the chemical stage, the most time consuming stage in the entire water radiolysis process. A GPU-friendly parallelization strategy was designed to address the highly correlated many-body simulation problem caused by the mutual competitive chemical reactions between the radiolytic molecules. Two cases were tested, using a 750 keV electron and a 5 MeV proton incident in pure water, respectively. The time-dependent yields of all the radiolytic species during the chemical stage were used to evaluate the accuracy of the simulation. The relative differences between our simulation and the Geant4-DNA simulation were on average 5.3% and 4.4% for the two cases. Our package, executed on an Nvidia Titan black GPU card, successfully completed the chemical stage simulation of the two cases within 599.2 s and 489.0 s. As compared with Geant4-DNA that was executed on an Intel i7-5500U CPU processor and needed 28.6 h and 26.8 h for the two cases using a single CPU core, our package achieved a speed-up factor of 171.1-197.2.
      

      
      Synthesis and Characterization of Biodegradable Polyurethane for Hypopharyngeal Tissue Engineering
      PubMed Central
      Shen, Zhisen; Lu, Dakai; Li, Qun; Zhang, Zongyong
         2015-01-01
         Biodegradable crosslinked polyurethane (cPU) was synthesized using polyethylene glycol (PEG), L-lactide (L-LA), and hexamethylene diisocyanate (HDI), with iron acetylacetonate (Fe(acac)3) as the catalyst and PEG as the extender. Chemical components of the obtained polymers were characterized by FTIR spectroscopy, 1H NMR spectra, and Gel Permeation Chromatography (GPC). The thermodynamic properties, mechanical behaviors, surface hydrophilicity, degradability, and cytotoxicity were tested via differential scanning calorimetry (DSC), tensile tests, contact angle measurements, and cell culture. The results show that the synthesized cPU possessed good flexibility with quite low glass transition temperature (T g, −22°C) and good wettability. Water uptake measured as high as 229.7 ± 18.7%. These properties make cPU a good candidate material for engineering soft tissues such as the hypopharynx. In vitro and in vivo tests showed that cPU has the ability to support the growth of human hypopharyngeal fibroblasts and angiogenesis was observed around cPU after it was implanted subcutaneously in SD rats. PMID:25839041
      

      
      Is our medical school socially accountable? The case of Faculty of Medicine, Suez Canal University.
      PubMed
      Hosny, Somaya; Ghaly, Mona; Boelen, Charles
         2015-04-01
         Faculty of Medicine, Suez Canal University (FOM/SCU) was established as community oriented school with innovative educational strategies. Social accountability represents the commitment of the medical school towards the community it serves. To assess FOM/SCU compliance to social accountability using the "Conceptualization, Production, Usability" (CPU) model. FOM/SCU's practice was reviewed against CPU model parameters. CPU consists of three domains, 11 sections and 31 parameters. Data were collected through unstructured interviews with the main stakeholders and documents review since 2005 to 2013. FOM/SCU shows general compliance to the three domains of the CPU. Very good compliance was shown to the "P" domain of the model through FOM/SCU's innovative educational system, students and faculty members. More work is needed on the "C" and "U" domains. FOM/SCU complies with many parameters of the CPU model; however, more work should be accomplished to comply with some items in the C and U domains so that FOM/SCU can be recognized as a proactive socially accountable school.
      

      
      Synthesis and characterization of biodegradable polyurethane for hypopharyngeal tissue engineering.
      PubMed
      Shen, Zhisen; Lu, Dakai; Li, Qun; Zhang, Zongyong; Zhu, Yabin
         2015-01-01
         Biodegradable crosslinked polyurethane (cPU) was synthesized using polyethylene glycol (PEG), L-lactide (L-LA), and hexamethylene diisocyanate (HDI), with iron acetylacetonate (Fe(acac)3) as the catalyst and PEG as the extender. Chemical components of the obtained polymers were characterized by FTIR spectroscopy, (1)H NMR spectra, and Gel Permeation Chromatography (GPC). The thermodynamic properties, mechanical behaviors, surface hydrophilicity, degradability, and cytotoxicity were tested via differential scanning calorimetry (DSC), tensile tests, contact angle measurements, and cell culture. The results show that the synthesized cPU possessed good flexibility with quite low glass transition temperature (T g , -22°C) and good wettability. Water uptake measured as high as 229.7 ± 18.7%. These properties make cPU a good candidate material for engineering soft tissues such as the hypopharynx. In vitro and in vivo tests showed that cPU has the ability to support the growth of human hypopharyngeal fibroblasts and angiogenesis was observed around cPU after it was implanted subcutaneously in SD rats.
      

      
      Low-memory iterative density fitting.
      PubMed
      Grajciar, Lukáš
         2015-07-30
         A new low-memory modification of the density fitting approximation based on a combination of a continuous fast multipole method (CFMM) and a preconditioned conjugate gradient solver is presented. Iterative conjugate gradient solver uses preconditioners formed from blocks of the Coulomb metric matrix that decrease the number of iterations needed for convergence by up to one order of magnitude. The matrix-vector products needed within the iterative algorithm are calculated using CFMM, which evaluates them with the linear scaling memory requirements only. Compared with the standard density fitting implementation, up to 15-fold reduction of the memory requirements is achieved for the most efficient preconditioner at a cost of only 25% increase in computational time. The potential of the method is demonstrated by performing density functional theory calculations for zeolite fragment with 2592 atoms and 121,248 auxiliary basis functions on a single 12-core CPU workstation. © 2015 Wiley Periodicals, Inc.
      

      
      Multi-GPU implementation of a VMAT treatment plan optimization algorithm.
      PubMed
      Tian, Zhen; Peng, Fei; Folkerts, Michael; Tan, Jun; Jia, Xun; Jiang, Steve B
         2015-06-01
         Volumetric modulated arc therapy (VMAT) optimization is a computationally challenging problem due to its large data size, high degrees of freedom, and many hardware constraints. High-performance graphics processing units (GPUs) have been used to speed up the computations. However, GPU's relatively small memory size cannot handle cases with a large dose-deposition coefficient (DDC) matrix in cases of, e.g., those with a large target size, multiple targets, multiple arcs, and/or small beamlet size. The main purpose of this paper is to report an implementation of a column-generation-based VMAT algorithm, previously developed in the authors' group, on a multi-GPU platform to solve the memory limitation problem. While the column-generation-based VMAT algorithm has been previously developed, the GPU implementation details have not been reported. Hence, another purpose is to present detailed techniques employed for GPU implementation. The authors also would like to utilize this particular problem as an example problem to study the feasibility of using a multi-GPU platform to solve large-scale problems in medical physics. The column-generation approach generates VMAT apertures sequentially by solving a pricing problem (PP) and a master problem (MP) iteratively. In the authors' method, the sparse DDC matrix is first stored on a CPU in coordinate list format (COO). On the GPU side, this matrix is split into four submatrices according to beam angles, which are stored on four GPUs in compressed sparse row format. Computation of beamlet price, the first step in PP, is accomplished using multi-GPUs. A fast inter-GPU data transfer scheme is accomplished using peer-to-peer access. The remaining steps of PP and MP problems are implemented on CPU or a single GPU due to their modest problem scale and computational loads. Barzilai and Borwein algorithm with a subspace step scheme is adopted here to solve the MP problem. A head and neck (H&N) cancer case is then used to validate the authors' method. The authors also compare their multi-GPU implementation with three different single GPU implementation strategies, i.e., truncating DDC matrix (S1), repeatedly transferring DDC matrix between CPU and GPU (S2), and porting computations involving DDC matrix to CPU (S3), in terms of both plan quality and computational efficiency. Two more H&N patient cases and three prostate cases are used to demonstrate the advantages of the authors' method. The authors' multi-GPU implementation can finish the optimization process within ∼ 1 min for the H&N patient case. S1 leads to an inferior plan quality although its total time was 10 s shorter than the multi-GPU implementation due to the reduced matrix size. S2 and S3 yield the same plan quality as the multi-GPU implementation but take ∼4 and ∼6 min, respectively. High computational efficiency was consistently achieved for the other five patient cases tested, with VMAT plans of clinically acceptable quality obtained within 23-46 s. Conversely, to obtain clinically comparable or acceptable plans for all six of these VMAT cases that the authors have tested in this paper, the optimization time needed in a commercial TPS system on CPU was found to be in an order of several minutes. The results demonstrate that the multi-GPU implementation of the authors' column-generation-based VMAT optimization can handle the large-scale VMAT optimization problem efficiently without sacrificing plan quality. The authors' study may serve as an example to shed some light on other large-scale medical physics problems that require multi-GPU techniques.
      

      
      FastGCN: A GPU Accelerated Tool for Fast Gene Co-Expression Networks
      PubMed Central
      Liang, Meimei; Zhang, Futao; Jin, Gulei; Zhu, Jun
         2015-01-01
         Gene co-expression networks comprise one type of valuable biological networks. Many methods and tools have been published to construct gene co-expression networks; however, most of these tools and methods are inconvenient and time consuming for large datasets. We have developed a user-friendly, accelerated and optimized tool for constructing gene co-expression networks that can fully harness the parallel nature of GPU (Graphic Processing Unit) architectures. Genetic entropies were exploited to filter out genes with no or small expression changes in the raw data preprocessing step. Pearson correlation coefficients were then calculated. After that, we normalized these coefficients and employed the False Discovery Rate to control the multiple tests. At last, modules identification was conducted to construct the co-expression networks. All of these calculations were implemented on a GPU. We also compressed the coefficient matrix to save space. We compared the performance of the GPU implementation with those of multi-core CPU implementations with 16 CPU threads, single-thread C/C++ implementation and single-thread R implementation. Our results show that GPU implementation largely outperforms single-thread C/C++ implementation and single-thread R implementation, and GPU implementation outperforms multi-core CPU implementation when the number of genes increases. With the test dataset containing 16,000 genes and 590 individuals, we can achieve greater than 63 times the speed using a GPU implementation compared with a single-thread R implementation when 50 percent of genes were filtered out and about 80 times the speed when no genes were filtered out. PMID:25602758
      

      
      FastGCN: a GPU accelerated tool for fast gene co-expression networks.
      PubMed
      Liang, Meimei; Zhang, Futao; Jin, Gulei; Zhu, Jun
         2015-01-01
         Gene co-expression networks comprise one type of valuable biological networks. Many methods and tools have been published to construct gene co-expression networks; however, most of these tools and methods are inconvenient and time consuming for large datasets. We have developed a user-friendly, accelerated and optimized tool for constructing gene co-expression networks that can fully harness the parallel nature of GPU (Graphic Processing Unit) architectures. Genetic entropies were exploited to filter out genes with no or small expression changes in the raw data preprocessing step. Pearson correlation coefficients were then calculated. After that, we normalized these coefficients and employed the False Discovery Rate to control the multiple tests. At last, modules identification was conducted to construct the co-expression networks. All of these calculations were implemented on a GPU. We also compressed the coefficient matrix to save space. We compared the performance of the GPU implementation with those of multi-core CPU implementations with 16 CPU threads, single-thread C/C++ implementation and single-thread R implementation. Our results show that GPU implementation largely outperforms single-thread C/C++ implementation and single-thread R implementation, and GPU implementation outperforms multi-core CPU implementation when the number of genes increases. With the test dataset containing 16,000 genes and 590 individuals, we can achieve greater than 63 times the speed using a GPU implementation compared with a single-thread R implementation when 50 percent of genes were filtered out and about 80 times the speed when no genes were filtered out.
      

      
      Preliminary Study of Image Reconstruction Algorithm on a Digital Signal Processor
      DTIC Science & Technology
      
         2014-03-01
         5.2 Comparison of CPU-GPU, CPU-FPGA, and CPU-DSP Designs The work for implementing VHDL description of the back-projection algorithm on a physical...FPGA was not complete. Hence, the DSP implementation results are compared with the simulated results for the VHDL design. Simulating VHDL provides an...rather than at the software level. Depending on an application’s characteristics, FPGA implementations can provide a significant performance
      

        
       
          

«

16
      17
      18
   19
      20
      »

          
        

     

   

   
       
            
              
          

«

17
      18
      19
   20
      21
      »

          
        

           
           
             
               
      
      Molecular dynamics simulations through GPU video games technologies
      PubMed Central
      Loukatou, Styliani; Papageorgiou, Louis; Fakourelis, Paraskevas; Filntisi, Arianna; Polychronidou, Eleftheria; Bassis, Ioannis; Megalooikonomou, Vasileios; Makałowski, Wojciech; Vlachakis, Dimitrios; Kossida, Sophia
         2016-01-01
         Bioinformatics is the scientific field that focuses on the application of computer technology to the management of biological information. Over the years, bioinformatics applications have been used to store, process and integrate biological and genetic information, using a wide range of methodologies. One of the most de novo techniques used to understand the physical movements of atoms and molecules is molecular dynamics (MD). MD is an in silico method to simulate the physical motions of atoms and molecules under certain conditions. This has become a state strategic technique and now plays a key role in many areas of exact sciences, such as chemistry, biology, physics and medicine. Due to their complexity, MD calculations could require enormous amounts of computer memory and time and therefore their execution has been a big problem. Despite the huge computational cost, molecular dynamics have been implemented using traditional computers with a central memory unit (CPU). A graphics processing unit (GPU) computing technology was first designed with the goal to improve video games, by rapidly creating and displaying images in a frame buffer such as screens. The hybrid GPU-CPU implementation, combined with parallel computing is a novel technology to perform a wide range of calculations. GPUs have been proposed and used to accelerate many scientific computations including MD simulations. Herein, we describe the new methodologies developed initially as video games and how they are now applied in MD simulations. PMID:27525251
      

      
      OpenACC acceleration of an unstructured CFD solver based on a reconstructed discontinuous Galerkin method for compressible flows
      DOE PAGES
      Xia, Yidong; Lou, Jialin; Luo, Hong; ...
         2015-02-09
         Here, an OpenACC directive-based graphics processing unit (GPU) parallel scheme is presented for solving the compressible Navier–Stokes equations on 3D hybrid unstructured grids with a third-order reconstructed discontinuous Galerkin method. The developed scheme requires the minimum code intrusion and algorithm alteration for upgrading a legacy solver with the GPU computing capability at very little extra effort in programming, which leads to a unified and portable code development strategy. A face coloring algorithm is adopted to eliminate the memory contention because of the threading of internal and boundary face integrals. A number of flow problems are presented to verify the implementationmore » of the developed scheme. Timing measurements were obtained by running the resulting GPU code on one Nvidia Tesla K20c GPU card (Nvidia Corporation, Santa Clara, CA, USA) and compared with those obtained by running the equivalent Message Passing Interface (MPI) parallel CPU code on a compute node (consisting of two AMD Opteron 6128 eight-core CPUs (Advanced Micro Devices, Inc., Sunnyvale, CA, USA)). Speedup factors of up to 24× and 1.6× for the GPU code were achieved with respect to one and 16 CPU cores, respectively. The numerical results indicate that this OpenACC-based parallel scheme is an effective and extensible approach to port unstructured high-order CFD solvers to GPU computing.« less
      

      
      Source parameter inversion of compound earthquakes on GPU/CPU hybrid platform
      NASA Astrophysics Data System (ADS)
      Wang, Y.; Ni, S.; Chen, W.
         2012-12-01
         Source parameter of earthquakes is essential problem in seismology. Accurate and timely determination of the earthquake parameters (such as moment, depth, strike, dip and rake of fault planes) is significant for both the rupture dynamics and ground motion prediction or simulation. And the rupture process study, especially for the moderate and large earthquakes, is essential as the more detailed kinematic study has became the routine work of seismologists. However, among these events, some events behave very specially and intrigue seismologists. These earthquakes usually consist of two similar size sub-events which occurred with very little time interval, such as mb4.5 Dec.9, 2003 in Virginia. The studying of these special events including the source parameter determination of each sub-events will be helpful to the understanding of earthquake dynamics. However, seismic signals of two distinctive sources are mixed up bringing in the difficulty of inversion. As to common events, the method(Cut and Paste) has been proven effective for resolving source parameters, which jointly use body wave and surface wave with independent time shift and weights. CAP could resolve fault orientation and focal depth using a grid search algorithm. Based on this method, we developed an algorithm(MUL_CAP) to simultaneously acquire parameters of two distinctive events. However, the simultaneous inversion of both sub-events make the computation very time consuming, so we develop a hybrid GPU and CPU version of CAP(HYBRID_CAP) to improve the computation efficiency. Thanks to advantages on multiple dimension storage and processing in GPU, we obtain excellent performance of the revised code on GPU-CPU combined architecture and the speedup factors can be as high as 40x-90x compared to classical cap on traditional CPU architecture.As the benchmark, we take the synthetics as observation and inverse the source parameters of two given sub-events and the inversion results are very consistent with the true parameters. For the events in Virginia, USA on 9 Dec, 2003, we re-invert source parameters and detailed analysis of regional waveform indicates that Virginia earthquake included two sub-events which are Mw4.05 and Mw4.25 at the same depth of 10km with focal mechanism of strike65/dip32/rake135, which are consistent with previous study. Moreover, compared to traditional two-source model method, MUL_CAP is more automatic with no need for human intervention.
      

      
      Acceleration of fluoro-CT reconstruction for a mobile C-Arm on GPU and FPGA hardware: a simulation study
      NASA Astrophysics Data System (ADS)
      Xue, Xinwei; Cheryauka, Arvi; Tubbs, David
         2006-03-01
         CT imaging in interventional and minimally-invasive surgery requires high-performance computing solutions that meet operational room demands, healthcare business requirements, and the constraints of a mobile C-arm system. The computational requirements of clinical procedures using CT-like data are increasing rapidly, mainly due to the need for rapid access to medical imagery during critical surgical procedures. The highly parallel nature of Radon transform and CT algorithms enables embedded computing solutions utilizing a parallel processing architecture to realize a significant gain of computational intensity with comparable hardware and program coding/testing expenses. In this paper, using a sample 2D and 3D CT problem, we explore the programming challenges and the potential benefits of embedded computing using commodity hardware components. The accuracy and performance results obtained on three computational platforms: a single CPU, a single GPU, and a solution based on FPGA technology have been analyzed. We have shown that hardware-accelerated CT image reconstruction can be achieved with similar levels of noise and clarity of feature when compared to program execution on a CPU, but gaining a performance increase at one or more orders of magnitude faster. 3D cone-beam or helical CT reconstruction and a variety of volumetric image processing applications will benefit from similar accelerations.
      

      
      Many-integrated core (MIC) technology for accelerating Monte Carlo simulation of radiation transport: A study based on the code DPM
      NASA Astrophysics Data System (ADS)
      Rodriguez, M.; Brualla, L.
         2018-04-01
         Monte Carlo simulation of radiation transport is computationally demanding to obtain reasonably low statistical uncertainties of the estimated quantities. Therefore, it can benefit in a large extent from high-performance computing. This work is aimed at assessing the performance of the first generation of the many-integrated core architecture (MIC) Xeon Phi coprocessor with respect to that of a CPU consisting of a double 12-core Xeon processor in Monte Carlo simulation of coupled electron-photonshowers. The comparison was made twofold, first, through a suite of basic tests including parallel versions of the random number generators Mersenne Twister and a modified implementation of RANECU. These tests were addressed to establish a baseline comparison between both devices. Secondly, through the p DPM code developed in this work. p DPM is a parallel version of the Dose Planning Method (DPM) program for fast Monte Carlo simulation of radiation transport in voxelized geometries. A variety of techniques addressed to obtain a large scalability on the Xeon Phi were implemented in p DPM. Maximum scalabilities of 84 . 2 × and 107 . 5 × were obtained in the Xeon Phi for simulations of electron and photon beams, respectively. Nevertheless, in none of the tests involving radiation transport the Xeon Phi performed better than the CPU. The disadvantage of the Xeon Phi with respect to the CPU owes to the low performance of the single core of the former. A single core of the Xeon Phi was more than 10 times less efficient than a single core of the CPU for all radiation transport simulations.
      

      
      Research on fast Fourier transforms algorithm of huge remote sensing image technology with GPU and partitioning technology.
      PubMed
      Yang, Xue; Li, Xue-You; Li, Jia-Guo; Ma, Jun; Zhang, Li; Yang, Jan; Du, Quan-Ye
         2014-02-01
         Fast Fourier transforms (FFT) is a basic approach to remote sensing image processing. With the improvement of capacity of remote sensing image capture with the features of hyperspectrum, high spatial resolution and high temporal resolution, how to use FFT technology to efficiently process huge remote sensing image becomes the critical step and research hot spot of current image processing technology. FFT algorithm, one of the basic algorithms of image processing, can be used for stripe noise removal, image compression, image registration, etc. in processing remote sensing image. CUFFT function library is the FFT algorithm library based on CPU and FFTW. FFTW is a FFT algorithm developed based on CPU in PC platform, and is currently the fastest CPU based FFT algorithm function library. However there is a common problem that once the available memory or memory is less than the capacity of image, there will be out of memory or memory overflow when using the above two methods to realize image FFT arithmetic. To address this problem, a CPU and partitioning technology based Huge Remote Fast Fourier Transform (HRFFT) algorithm is proposed in this paper. By improving the FFT algorithm in CUFFT function library, the problem of out of memory and memory overflow is solved. Moreover, this method is proved rational by experiment combined with the CCD image of HJ-1A satellite. When applied to practical image processing, it improves effect of the image processing, speeds up the processing, which saves the time of computation and achieves sound result.
      

      
      Parallel Scaling Characteristics of Selected NERSC User ProjectCodes
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Skinner, David; Verdier, Francesca; Anand, Harsh
         
         This report documents parallel scaling characteristics of NERSC user project codes between Fiscal Year 2003 and the first half of Fiscal Year 2004 (Oct 2002-March 2004). The codes analyzed cover 60% of all the CPU hours delivered during that time frame on seaborg, a 6080 CPU IBM SP and the largest parallel computer at NERSC. The scale in terms of concurrency and problem size of the workload is analyzed. Drawing on batch queue logs, performance data and feedback from researchers we detail the motivations, benefits, and challenges of implementing highly parallel scientific codes on current NERSC High Performance Computing systems.more » An evaluation and outlook of the NERSC workload for Allocation Year 2005 is presented.« less
      

      
      Event- and Time-Driven Techniques Using Parallel CPU-GPU Co-processing for Spiking Neural Networks
      PubMed Central
      Naveros, Francisco; Garrido, Jesus A.; Carrillo, Richard R.; Ros, Eduardo; Luque, Niceto R.
         2017-01-01
         Modeling and simulating the neural structures which make up our central neural system is instrumental for deciphering the computational neural cues beneath. Higher levels of biological plausibility usually impose higher levels of complexity in mathematical modeling, from neural to behavioral levels. This paper focuses on overcoming the simulation problems (accuracy and performance) derived from using higher levels of mathematical complexity at a neural level. This study proposes different techniques for simulating neural models that hold incremental levels of mathematical complexity: leaky integrate-and-fire (LIF), adaptive exponential integrate-and-fire (AdEx), and Hodgkin-Huxley (HH) neural models (ranged from low to high neural complexity). The studied techniques are classified into two main families depending on how the neural-model dynamic evaluation is computed: the event-driven or the time-driven families. Whilst event-driven techniques pre-compile and store the neural dynamics within look-up tables, time-driven techniques compute the neural dynamics iteratively during the simulation time. We propose two modifications for the event-driven family: a look-up table recombination to better cope with the incremental neural complexity together with a better handling of the synchronous input activity. Regarding the time-driven family, we propose a modification in computing the neural dynamics: the bi-fixed-step integration method. This method automatically adjusts the simulation step size to better cope with the stiffness of the neural model dynamics running in CPU platforms. One version of this method is also implemented for hybrid CPU-GPU platforms. Finally, we analyze how the performance and accuracy of these modifications evolve with increasing levels of neural complexity. We also demonstrate how the proposed modifications which constitute the main contribution of this study systematically outperform the traditional event- and time-driven techniques under increasing levels of neural complexity. PMID:28223930
      

      
      GPU-Based Real-Time Volumetric Ultrasound Image Reconstruction for a Ring Array
      PubMed Central
      Choe, Jung Woo; Nikoozadeh, Amin; Oralkan, Ömer; Khuri-Yakub, Butrus T.
         2014-01-01
         Synthetic phased array (SPA) beamforming with Hadamard coding and aperture weighting is an optimal option for real-time volumetric imaging with a ring array, a particularly attractive geometry in intracardiac and intravascular applications. However, the imaging frame rate of this method is limited by the immense computational load required in synthetic beamforming. For fast imaging with a ring array, we developed graphics processing unit (GPU)-based, real-time image reconstruction software that exploits massive data-level parallelism in beamforming operations. The GPU-based software reconstructs and displays three cross-sectional images at 45 frames per second (fps). This frame rate is 4.5 times higher than that for our previously-developed multi-core CPU-based software. In an alternative imaging mode, it shows one B-mode image rotating about the axis and its maximum intensity projection (MIP), processed at a rate of 104 fps. This paper describes the image reconstruction procedure on the GPU platform and presents the experimental images obtained using this software. PMID:23529080
      

      
      An efficient tensor transpose algorithm for multicore CPU, Intel Xeon Phi, and NVidia Tesla GPU
      DOE PAGES
      Lyakh, Dmitry I.
         2015-01-05
         An efficient parallel tensor transpose algorithm is suggested for shared-memory computing units, namely, multicore CPU, Intel Xeon Phi, and NVidia GPU. The algorithm operates on dense tensors (multidimensional arrays) and is based on the optimization of cache utilization on x86 CPU and the use of shared memory on NVidia GPU. From the applied side, the ultimate goal is to minimize the overhead encountered in the transformation of tensor contractions into matrix multiplications in computer implementations of advanced methods of quantum many-body theory (e.g., in electronic structure theory and nuclear physics). A particular accent is made on higher-dimensional tensors that typicallymore » appear in the so-called multireference correlated methods of electronic structure theory. Depending on tensor dimensionality, the presented optimized algorithms can achieve an order of magnitude speedup on x86 CPUs and 2-3 times speedup on NVidia Tesla K20X GPU with respect to the na ve scattering algorithm (no memory access optimization). Furthermore, the tensor transpose routines developed in this work have been incorporated into a general-purpose tensor algebra library (TAL-SH).« less
      

      
      A fast summation method for oscillatory lattice sums
      NASA Astrophysics Data System (ADS)
      Denlinger, Ryan; Gimbutas, Zydrunas; Greengard, Leslie; Rokhlin, Vladimir
         2017-02-01
         We present a fast summation method for lattice sums of the type which arise when solving wave scattering problems with periodic boundary conditions. While there are a variety of effective algorithms in the literature for such calculations, the approach presented here is new and leads to a rigorous analysis of Wood's anomalies. These arise when illuminating a grating at specific combinations of the angle of incidence and the frequency of the wave, for which the lattice sums diverge. They were discovered by Wood in 1902 as singularities in the spectral response. The primary tools in our approach are the Euler-Maclaurin formula and a steepest descent argument. The resulting algorithm has super-algebraic convergence and requires only milliseconds of CPU time.
      

      
      Study and Optimization of Helicopter Subfloor Energy Absorption Structure with Foldcore Sandwich Structures
      NASA Astrophysics Data System (ADS)
      HuaZhi, Zhou; ZhiJin, Wang
         2017-11-01
         The intersection element is an important part of the helicopter subfloor structure. In order to improve the crashworthiness properties, the floor and the skin of the intersection element are replaced with foldcore sandwich structures. Foldcore is a kind of high-energy absorption structure. Compared with original structure, the new intersection element shows better buffering capacity and energy-absorption capacity. To reduce structure’s mass while maintaining the crashworthiness requirements satisfied, optimization of the intersection element geometric parameters is conducted. An optimization method using NSGA-II and Anisotropic Kriging is used. A significant CPU time saving can be obtained by replacing numerical model with Anisotropic Kriging surrogate model. The operation allows 17.15% reduce of the intersection element mass.
      

      
      Fast calculation of the line-spread-function by transversal directions decoupling
      NASA Astrophysics Data System (ADS)
      Parravicini, Jacopo; Tartara, Luca; Hasani, Elton; Tomaselli, Alessandra
         2016-07-01
         We propose a simplified method to calculate the optical spread function of a paradigmatic system constituted by a pupil-lens with a line-shaped illumination (‘line-spread-function’). Our approach is based on decoupling the two transversal directions of the beam and treating the propagation by means of the Fourier optics formalism. This requires simpler calculations with respect to the more usual Bessel-function-based method. The model is discussed and compared with standard calculation methods by carrying out computer simulations. The proposed approach is found to be much faster than the Bessel-function-based one (CPU time ≲ 5% of the standard method), while the results of the two methods present a very good mutual agreement.
      

      
      Complexity transitions in global algorithms for sparse linear systems over finite fields
      NASA Astrophysics Data System (ADS)
      Braunstein, A.; Leone, M.; Ricci-Tersenghi, F.; Zecchina, R.
         2002-09-01
         We study the computational complexity of a very basic problem, namely that of finding solutions to a very large set of random linear equations in a finite Galois field modulo q. Using tools from statistical mechanics we are able to identify phase transitions in the structure of the solution space and to connect them to the changes in the performance of a global algorithm, namely Gaussian elimination. Crossing phase boundaries produces a dramatic increase in memory and CPU requirements necessary for the algorithms. In turn, this causes the saturation of the upper bounds for the running time. We illustrate the results on the specific problem of integer factorization, which is of central interest for deciphering messages encrypted with the RSA cryptosystem.
      

      
      Autonomous Modal Identification of the Space Shuttle Tail Rudder
      NASA Technical Reports Server (NTRS)
      Pappa, Richard S.; James, George H., III; Zimmerman, David C.
         1997-01-01
         Autonomous modal identification automates the calculation of natural vibration frequencies, damping, and mode shapes of a structure from experimental data. This technology complements damage detection techniques that use continuous or periodic monitoring of vibration characteristics. The approach shown in the paper incorporates the Eigensystem Realization Algorithm (ERA) as a data analysis engine and an autonomous supervisor to condense multiple estimates of modal parameters using ERA's Consistent-Mode Indicator and correlation of mode shapes. The procedure was applied to free-decay responses of a Space Shuttle tail rudder and successfully identified the seven modes of the structure below 250 Hz. The final modal parameters are a condensed set of results for 87 individual ERA cases requiring approximately five minutes of CPU time on a DEC Alpha computer.
      

      
      GPU accelerated FDTD solver and its application in MRI.
      PubMed
      Chi, J; Liu, F; Jin, J; Mason, D G; Crozier, S
         2010-01-01
         The finite difference time domain (FDTD) method is a popular technique for computational electromagnetics (CEM). The large computational power often required, however, has been a limiting factor for its applications. In this paper, we will present a graphics processing unit (GPU)-based parallel FDTD solver and its successful application to the investigation of a novel B1 shimming scheme for high-field magnetic resonance imaging (MRI). The optimized shimming scheme exhibits considerably improved transmit B(1) profiles. The GPU implementation dramatically shortened the runtime of FDTD simulation of electromagnetic field compared with its CPU counterpart. The acceleration in runtime has made such investigation possible, and will pave the way for other studies of large-scale computational electromagnetic problems in modern MRI which were previously impractical.
      

      
      High performance GPU processing for inversion using uniform grid searches
      NASA Astrophysics Data System (ADS)
      Venetis, Ioannis E.; Saltogianni, Vasso; Stiros, Stathis; Gallopoulos, Efstratios
         2017-04-01
         Many geophysical problems are described by systems of redundant, highly non-linear systems of ordinary equations with constant terms deriving from measurements and hence representing stochastic variables. Solution (inversion) of such problems is based on numerical, optimization methods, based on Monte Carlo sampling or on exhaustive searches in cases of two or even three "free" unknown variables. Recently the TOPological INVersion (TOPINV) algorithm, a grid search-based technique in the Rn space, has been proposed. TOPINV is not based on the minimization of a certain cost function and involves only forward computations, hence avoiding computational errors. The basic concept is to transform observation equations into inequalities on the basis of an optimization parameter k and of their standard errors, and through repeated "scans" of n-dimensional search grids for decreasing values of k to identify the optimal clusters of gridpoints which satisfy observation inequalities and by definition contain the "true" solution. Stochastic optimal solutions and their variance-covariance matrices are then computed as first and second statistical moments. Such exhaustive uniform searches produce an excessive computational load and are extremely time consuming for common computers based on a CPU. An alternative is to use a computing platform based on a GPU, which nowadays is affordable to the research community, which provides a much higher computing performance. Using the CUDA programming language to implement TOPINV allows the investigation of the attained speedup in execution time on such a high performance platform. Based on synthetic data we compared the execution time required for two typical geophysical problems, modeling magma sources and seismic faults, described with up to 18 unknown variables, on both CPU/FORTRAN and GPU/CUDA platforms. The same problems for several different sizes of search grids (up to 1012 gridpoints) and numbers of unknown variables were solved on both platforms, and execution time as a function of the grid dimension for each problem was recorded. Results indicate an average speedup in calculations by a factor of 100 on the GPU platform; for example problems with 1012 grid-points require less than two hours instead of several days on conventional desktop computers. Such a speedup encourages the application of TOPINV on high performance platforms, as a GPU, in cases where nearly real time decisions are necessary, for example finite fault modeling to identify possible tsunami sources.
      

      
      A Fast Code for Jupiter Atmospheric Entry Analysis
      NASA Technical Reports Server (NTRS)
      Yauber, Michael E.; Wercinski, Paul; Yang, Lily; Chen, Yih-Kanq
         1999-01-01
         A fast code was developed to calculate the forebody heating environment and heat shielding that is required for Jupiter atmospheric entry probes. A carbon phenolic heat shield material was assumed and, since computational efficiency was a major goal, analytic expressions were used, primarily, to calculate the heating, ablation and the required insulation. The code was verified by comparison with flight measurements from the Galileo probe's entry. The calculation required 3.5 sec of CPU time on a work station, or three to four orders of magnitude less than for previous Jovian entry heat shields. The computed surface recessions from ablation were compared with the flight values at six body stations. The average, absolute, predicted difference in the recession was 13.7% too high. The forebody's mass loss was overpredicted by 5.3% and the heat shield mass was calculated to be 15% less than the probe's actual heat shield. However, the calculated heat shield mass did not include contingencies for the various uncertainties that must be considered in the design of probes. Therefore, the agreement with the Galileo probe's values was satisfactory in view of the code's fast running time and the methods' approximations.
      

      
      Application of high-performance computing to numerical simulation of human movement
      NASA Technical Reports Server (NTRS)
      Anderson, F. C.; Ziegler, J. M.; Pandy, M. G.; Whalen, R. T.
         1995-01-01
         We have examined the feasibility of using massively-parallel and vector-processing supercomputers to solve large-scale optimization problems for human movement. Specifically, we compared the computational expense of determining the optimal controls for the single support phase of gait using a conventional serial machine (SGI Iris 4D25), a MIMD parallel machine (Intel iPSC/860), and a parallel-vector-processing machine (Cray Y-MP 8/864). With the human body modeled as a 14 degree-of-freedom linkage actuated by 46 musculotendinous units, computation of the optimal controls for gait could take up to 3 months of CPU time on the Iris. Both the Cray and the Intel are able to reduce this time to practical levels. The optimal solution for gait can be found with about 77 hours of CPU on the Cray and with about 88 hours of CPU on the Intel. Although the overall speeds of the Cray and the Intel were found to be similar, the unique capabilities of each machine are better suited to different portions of the computational algorithm used. The Intel was best suited to computing the derivatives of the performance criterion and the constraints whereas the Cray was best suited to parameter optimization of the controls. These results suggest that the ideal computer architecture for solving very large-scale optimal control problems is a hybrid system in which a vector-processing machine is integrated into the communication network of a MIMD parallel machine.
      

      
      Disk-based k-mer counting on a PC
      PubMed Central
      
         2013-01-01
         Background The k-mer counting problem, which is to build the histogram of occurrences of every k-symbol long substring in a given text, is important for many bioinformatics applications. They include developing de Bruijn graph genome assemblers, fast multiple sequence alignment and repeat detection. Results We propose a simple, yet efficient, parallel disk-based algorithm for counting k-mers. Experiments show that it usually offers the fastest solution to the considered problem, while demanding a relatively small amount of memory. In particular, it is capable of counting the statistics for short-read human genome data, in input gzipped FASTQ file, in less than 40 minutes on a PC with 16 GB of RAM and 6 CPU cores, and for long-read human genome data in less than 70 minutes. On a more powerful machine, using 32 GB of RAM and 32 CPU cores, the tasks are accomplished in less than half the time. No other algorithm for most tested settings of this problem and mammalian-size data can accomplish this task in comparable time. Our solution also belongs to memory-frugal ones; most competitive algorithms cannot efficiently work on a PC with 16 GB of memory for such massive data. Conclusions By making use of cheap disk space and exploiting CPU and I/O parallelism we propose a very competitive k-mer counting procedure, called KMC. Our results suggest that judicious resource management may allow to solve at least some bioinformatics problems with massive data on a commodity personal computer. PMID:23679007
      

        
       
          

«

17
      18
      19
   20
      21
      »

          
        

     

   

   
       
            
              
          

«

18
      19
      20
   21
      22
      »

          
        

           
           
             
               
      
      Effect of Fiber Orientation on Dynamic Compressive Properties of an Ultra-High Performance Concrete
      DTIC Science & Technology
      
         2017-08-01
         measurements for LSFfiberOrient function for multiple cores. Elapsed time is the total time taken to run ; CPU time is the number of cores times the...Superscripts  Maximum value during a test  Measured value from a calibration run ...movement left or right. Before cutting, the Cor-Tuf Baseline beam was placed on the table and squared with the blade . The blade was then moved into
      

      
      Rt-Space: A Real-Time Stochastically-Provisioned Adaptive Container Environment
      DTIC Science & Technology
      
         2017-08-04
         SECURITY CLASSIFICATION OF: This project was directed at component-based soft real- time (SRT) systems implemented on multicore platforms. To facilitate...upon average-case or near- average-case task execution times . The main intellectual contribution of this project was the development of methods for...allocating CPU time to components and associated analysis for validating SRT correctness. 1. REPORT DATE (DD-MM-YYYY) 4. TITLE AND SUBTITLE 13
      

      
      Acceleration for 2D time-domain elastic full waveform inversion using a single GPU card
      NASA Astrophysics Data System (ADS)
      Jiang, Jinpeng; Zhu, Peimin
         2018-05-01
         Full waveform inversion (FWI) is a challenging procedure due to the high computational cost related to the modeling, especially for the elastic case. The graphics processing unit (GPU) has become a popular device for the high-performance computing (HPC). To reduce the long computation time, we design and implement the GPU-based 2D elastic FWI (EFWI) in time domain using a single GPU card. We parallelize the forward modeling and gradient calculations using the CUDA programming language. To overcome the limitation of relatively small global memory on GPU, the boundary saving strategy is exploited to reconstruct the forward wavefield. Moreover, the L-BFGS optimization method used in the inversion increases the convergence of the misfit function. A multiscale inversion strategy is performed in the workflow to obtain the accurate inversion results. In our tests, the GPU-based implementations using a single GPU device achieve >15 times speedup in forward modeling, and about 12 times speedup in gradient calculation, compared with the eight-core CPU implementations optimized by OpenMP. The test results from the GPU implementations are verified to have enough accuracy by comparing the results obtained from the CPU implementations.
      

      
      MOIL-opt: Energy-Conserving Molecular Dynamics on a GPU/CPU system
      PubMed Central
      Ruymgaart, A. Peter; Cardenas, Alfredo E.; Elber, Ron
         2011-01-01
         We report an optimized version of the molecular dynamics program MOIL that runs on a shared memory system with OpenMP and exploits the power of a Graphics Processing Unit (GPU). The model is of heterogeneous computing system on a single node with several cores sharing the same memory and a GPU. This is a typical laboratory tool, which provides excellent performance at minimal cost. Besides performance, emphasis is made on accuracy and stability of the algorithm probed by energy conservation for explicit-solvent atomically-detailed-models. Especially for long simulations energy conservation is critical due to the phenomenon known as “energy drift” in which energy errors accumulate linearly as a function of simulation time. To achieve long time dynamics with acceptable accuracy the drift must be particularly small. We identify several means of controlling long-time numerical accuracy while maintaining excellent speedup. To maintain a high level of energy conservation SHAKE and the Ewald reciprocal summation are run in double precision. Double precision summation of real-space non-bonded interactions improves energy conservation. In our best option, the energy drift using 1fs for a time step while constraining the distances of all bonds, is undetectable in 10ns simulation of solvated DHFR (Dihydrofolate reductase). Faster options, shaking only bonds with hydrogen atoms, are also very well behaved and have drifts of less than 1kcal/mol per nanosecond of the same system. CPU/GPU implementations require changes in programming models. We consider the use of a list of neighbors and quadratic versus linear interpolation in lookup tables of different sizes. Quadratic interpolation with a smaller number of grid points is faster than linear lookup tables (with finer representation) without loss of accuracy. Atomic neighbor lists were found most efficient. Typical speedups are about a factor of 10 compared to a single-core single-precision code. PMID:22328867
      

      
      Read margin analysis of crossbar arrays using the cell-variability-aware simulation method
      NASA Astrophysics Data System (ADS)
      Sun, Wookyung; Choi, Sujin; Shin, Hyungsoon
         2018-02-01
         This paper proposes a new concept of read margin analysis of crossbar arrays using cell-variability-aware simulation. The size of the crossbar array should be considered to predict the read margin characteristic of the crossbar array because the read margin depends on the number of word lines and bit lines. However, an excessively high-CPU time is required to simulate large arrays using a commercial circuit simulator. A variability-aware MATLAB simulator that considers independent variability sources is developed to analyze the characteristics of the read margin according to the array size. The developed MATLAB simulator provides an effective method for reducing the simulation time while maintaining the accuracy of the read margin estimation in the crossbar array. The simulation is also highly efficient in analyzing the characteristic of the crossbar memory array considering the statistical variations in the cell characteristics.
      

      
      Single-chip pulse programmer for magnetic resonance imaging using a 32-bit microcontroller.
      PubMed
      Handa, Shinya; Domalain, Thierry; Kose, Katsumi
         2007-08-01
         A magnetic resonance imaging (MRI) pulse programmer has been developed using a single-chip microcontroller (ADmicroC7026). The microcontroller includes all the components required for the MRI pulse programmer: a 32-bit RISC CPU core, 62 kbytes of flash memory, 8 kbytes of SRAM, two 32-bit timers, four 12-bit DA converters, and 40 bits of general purpose I/O. An evaluation board for the microcontroller was connected to a host personal computer (PC), an MRI transceiver, and a gradient driver using interface circuitry. Target (embedded) and host PC programs were developed to enable MRI pulse sequence generation by the microcontroller. The pulse programmer achieved a (nominal) time resolution of approximately 100 ns and a minimum time delay between successive events of approximately 9 micros. Imaging experiments using the pulse programmer demonstrated the effectiveness of our approach.
      

      
      Novel Hybrid Scheduling Technique for Sensor Nodes with Mixed Criticality Tasks.
      PubMed
      Micea, Mihai-Victor; Stangaciu, Cristina-Sorina; Stangaciu, Valentin; Curiac, Daniel-Ioan
         2017-06-26
         Sensor networks become increasingly a key technology for complex control applications. Their potential use in safety- and time-critical domains has raised the need for task scheduling mechanisms specially adapted to sensor node specific requirements, often materialized in predictable jitter-less execution of tasks characterized by different criticality levels. This paper offers an efficient scheduling solution, named Hybrid Hard Real-Time Scheduling (H²RTS), which combines a static, clock driven method with a dynamic, event driven scheduling technique, in order to provide high execution predictability, while keeping a high node Central Processing Unit (CPU) utilization factor. From the detailed, integrated schedulability analysis of the H²RTS, a set of sufficiency tests are introduced and demonstrated based on the processor demand and linear upper bound metrics. The performance and correct behavior of the proposed hybrid scheduling technique have been extensively evaluated and validated both on a simulator and on a sensor mote equipped with ARM7 microcontroller.
      

      
      OSCAR a Matlab based optical FFT code
      NASA Astrophysics Data System (ADS)
      Degallaix, Jérôme
         2010-05-01
         Optical simulation softwares are essential tools for designing and commissioning laser interferometers. This article aims to introduce OSCAR, a Matlab based FFT code, to the experimentalist community. OSCAR (Optical Simulation Containing Ansys Results) is used to simulate the steady state electric fields in optical cavities with realistic mirrors. The main advantage of OSCAR over other similar packages is the simplicity of its code requiring only a short time to master. As a result, even for a beginner, it is relatively easy to modify OSCAR to suit other specific purposes. OSCAR includes an extensive manual and numerous detailed examples such as simulating thermal aberration, calculating cavity eigen modes and diffraction loss, simulating flat beam cavities and three mirror ring cavities. An example is also provided about how to run OSCAR on the GPU of modern graphic cards instead of the CPU, making the simulation up to 20 times faster.
      

      
      Mercury BLASTP: Accelerating Protein Sequence Alignment
      PubMed Central
      Jacob, Arpith; Lancaster, Joseph; Buhler, Jeremy; Harris, Brandon; Chamberlain, Roger D.
         2008-01-01
         Large-scale protein sequence comparison is an important but compute-intensive task in molecular biology. BLASTP is the most popular tool for comparative analysis of protein sequences. In recent years, an exponential increase in the size of protein sequence databases has required either exponentially more running time or a cluster of machines to keep pace. To address this problem, we have designed and built a high-performance FPGA-accelerated version of BLASTP, Mercury BLASTP. In this paper, we describe the architecture of the portions of the application that are accelerated in the FPGA, and we also describe the integration of these FPGA-accelerated portions with the existing BLASTP software. We have implemented Mercury BLASTP on a commodity workstation with two Xilinx Virtex-II 6000 FPGAs. We show that the new design runs 11-15 times faster than software BLASTP on a modern CPU while delivering close to 99% identical results. PMID:19492068
      

      
      Single-chip pulse programmer for magnetic resonance imaging using a 32-bit microcontroller
      NASA Astrophysics Data System (ADS)
      Handa, Shinya; Domalain, Thierry; Kose, Katsumi
         2007-08-01
         A magnetic resonance imaging (MRI) pulse programmer has been developed using a single-chip microcontroller (ADμC7026). The microcontroller includes all the components required for the MRI pulse programmer: a 32-bit RISC CPU core, 62kbytes of flash memory, 8kbytes of SRAM, two 32-bit timers, four 12-bit DA converters, and 40bits of general purpose I/O. An evaluation board for the microcontroller was connected to a host personal computer (PC), an MRI transceiver, and a gradient driver using interface circuitry. Target (embedded) and host PC programs were developed to enable MRI pulse sequence generation by the microcontroller. The pulse programmer achieved a (nominal) time resolution of approximately 100ns and a minimum time delay between successive events of approximately 9μs. Imaging experiments using the pulse programmer demonstrated the effectiveness of our approach.
      

      
      A simplified method for elastic-plastic-creep structural analysis
      NASA Technical Reports Server (NTRS)
      Kaufman, A.
         1984-01-01
         A simplified inelastic analysis computer program (ANSYPM) was developed for predicting the stress-strain history at the critical location of a thermomechanically cycled structure from an elastic solution. The program uses an iterative and incremental procedure to estimate the plastic strains from the material stress-strain properties and a plasticity hardening model. Creep effects are calculated on the basis of stress relaxation at constant strain, creep at constant stress or a combination of stress relaxation and creep accumulation. The simplified method was exercised on a number of problems involving uniaxial and multiaxial loading, isothermal and nonisothermal conditions, dwell times at various points in the cycles, different materials and kinematic hardening. Good agreement was found between these analytical results and nonlinear finite element solutions for these problems. The simplified analysis program used less than 1 percent of the CPU time required for a nonlinear finite element analysis.
      

      
      A simplified method for elastic-plastic-creep structural analysis
      NASA Technical Reports Server (NTRS)
      Kaufman, A.
         1985-01-01
         A simplified inelastic analysis computer program (ANSYPM) was developed for predicting the stress-strain history at the critical location of a thermomechanically cycled structure from an elastic solution. The program uses an iterative and incremental procedure to estimate the plastic strains from the material stress-strain properties and a plasticity hardening model. Creep effects are calculated on the basis of stress relaxation at constant strain, creep at constant stress or a combination of stress relaxation and creep accumulation. The simplified method was exercised on a number of problems involving uniaxial and multiaxial loading, isothermal and nonisothermal conditions, dwell times at various points in the cycles, different materials and kinematic hardening. Good agreement was found between these analytical results and nonlinear finite element solutions for these problems. The simplified analysis program used less than 1 percent of the CPU time required for a nonlinear finite element analysis.
      

      
      GPU-Powered Coherent Beamforming
      NASA Astrophysics Data System (ADS)
      Magro, A.; Adami, K. Zarb; Hickish, J.
         2015-03-01
         Graphics processing units (GPU)-based beamforming is a relatively unexplored area in radio astronomy, possibly due to the assumption that any such system will be severely limited by the PCIe bandwidth required to transfer data to the GPU. We have developed a CUDA-based GPU implementation of a coherent beamformer, specifically designed and optimized for deployment at the BEST-2 array which can generate an arbitrary number of synthesized beams for a wide range of parameters. It achieves ˜1.3 TFLOPs on an NVIDIA Tesla K20, approximately 10x faster than an optimized, multithreaded CPU implementation. This kernel has been integrated into two real-time, GPU-based time-domain software pipelines deployed at the BEST-2 array in Medicina: a standalone beamforming pipeline and a transient detection pipeline. We present performance benchmarks for the beamforming kernel as well as the transient detection pipeline with beamforming capabilities as well as results of test observation.
      

      
      New core-reflector boundary conditions for transient nodal reactor calculations
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Lee, E.K.; Kim, C.H.; Joo, H.K.
         1995-09-01
         New core-reflector boundary conditions designed for the exclusion of the reflector region in transient nodal reactor calculations are formulated. Spatially flat frequency approximations for the temporal neutron behavior and two types of transverse leakage approximations in the reflector region are introduced to solve the transverse-integrated time-dependent one-dimensional diffusion equation and then to obtain relationships between net current and flux at the core-reflector interfaces. To examine the effectiveness of new core-reflector boundary conditions in transient nodal reactor computations, nodal expansion method (NEM) computations with and without explicit representation of the reflector are performed for Laboratorium fuer Reaktorregelung und Anlagen (LRA) boilingmore » water reactor (BWR) and Nuclear Energy Agency Committee on Reactor Physics (NEACRP) pressurized water reactor (PWR) rod ejection kinetics benchmark problems. Good agreement between two NEM computations is demonstrated in all the important transient parameters of two benchmark problems. A significant amount of CPU time saving is also demonstrated with the boundary condition model with transverse leakage (BCMTL) approximations in the reflector region. In the three-dimensional LRA BWR, the BCMTL and the explicit reflector model computations differ by {approximately}4% in transient peak power density while the BCMTL results in >40% of CPU time saving by excluding both the axial and the radial reflector regions from explicit computational nodes. In the NEACRP PWR problem, which includes six different transient cases, the largest difference is 24.4% in the transient maximum power in the one-node-per-assembly B1 transient results. This difference in the transient maximum power of the B1 case is shown to reduce to 11.7% in the four-node-per-assembly computations. As for the computing time, BCMTL is shown to reduce the CPU time >20% in all six transient cases of the NEACRP PWR.« less
      

      
      Ground Shock Effects from Accidental Explosions
      DTIC Science & Technology
      
         1976-11-01
         1,200 P0 A = V P cp 8 Horizontal Dh = Dv tannin " 1 (cp/U)] Vh = Vv tan [sin" 1 (cp/U)] \\ - \\ tanfainŕ (cp/U)] For tan sin (c /U...explosive are not included in the present analysis . This effect will limit the credibility of the direct- induced ground shock predictions, but if the... analysis . Dr. D. R. Richmond of Lovelace Foundation provided data on human shock tolerances. 26 REFERENCES 1. "Structures to Resist the Effects of
      

      
      WinHPC System Policies | High-Performance Computing | NREL
      Science.gov Websites
      
         
         requiring high CPU utilization or large amounts of memory should be run on the worker nodes. WinHPC02 is not associated data are removed when NREL worker status is discontinued. Users should make arrangements to save other users. Licenses are returned to the license pool when other users close the application or after
      

      
      Leap Frog and Time Step Sub-Cycle Scheme for Coupled Neutronics and Thermal-Hydraulic Codes
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Lu, S.
         2002-07-01
         As the result of the advancing TCP/IP based inter-process communication technology, more and more legacy thermal-hydraulic codes have been coupled with neutronics codes to provide best-estimate capabilities for reactivity related reactor transient analysis. Most of the coupling schemes are based on closely coupled serial or parallel approaches. Therefore, the execution of the coupled codes usually requires significant CPU time, when a complicated system is analyzed. Leap Frog scheme has been used to reduce the run time. The extent of the decoupling is usually determined based on a trial and error process for a specific analysis. It is the intent ofmore » this paper to develop a set of general criteria, which can be used to invoke the automatic Leap Frog algorithm. The algorithm will not only provide the run time reduction but also preserve the accuracy. The criteria will also serve as the base of an automatic time step sub-cycle scheme when a sudden reactivity change is introduced and the thermal-hydraulic code is marching with a relatively large time step. (authors)« less
      

      
      Dust Dynamics in Protoplanetary Disks: Parallel Computing with PVM
      NASA Astrophysics Data System (ADS)
      de La Fuente Marcos, Carlos; Barge, Pierre; de La Fuente Marcos, Raúl
         2002-03-01
         We describe a parallel version of our high-order-accuracy particle-mesh code for the simulation of collisionless protoplanetary disks. We use this code to carry out a massively parallel, two-dimensional, time-dependent, numerical simulation, which includes dust particles, to study the potential role of large-scale, gaseous vortices in protoplanetary disks. This noncollisional problem is easy to parallelize on message-passing multicomputer architectures. We performed the simulations on a cache-coherent nonuniform memory access Origin 2000 machine, using both the parallel virtual machine (PVM) and message-passing interface (MPI) message-passing libraries. Our performance analysis suggests that, for our problem, PVM is about 25% faster than MPI. Using PVM and MPI made it possible to reduce CPU time and increase code performance. This allows for simulations with a large number of particles (N ~ 105-106) in reasonable CPU times. The performances of our implementation of the pa! rallel code on an Origin 2000 supercomputer are presented and discussed. They exhibit very good speedup behavior and low load unbalancing. Our results confirm that giant gaseous vortices can play a dominant role in giant planet formation.
      

      
      SU-F-BRD-02: Application of ARCHERRT-- A GPU-Based Monte Carlo Dose Engine for Radiation Therapy -- to Tomotherapy and Patient-Independent IMRT
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Su, L; Du, X; Liu, T
         
         Purpose: As a module of ARCHER -- Accelerated Radiation-transport Computations in Heterogeneous EnviRonments, ARCHER{sub RT} is designed for RadioTherapy (RT) dose calculation. This paper describes the application of ARCHERRT on patient-dependent TomoTherapy and patient-independent IMRT. It also conducts a 'fair' comparison of different GPUs and multicore CPU. Methods: The source input used for patient-dependent TomoTherapy is phase space file (PSF) generated from optimized plan. For patient-independent IMRT, the open filed PSF is used for different cases. The intensity modulation is simulated by fluence map. The GEANT4 code is used as benchmark. DVH and gamma index test are employed to evaluatemore » the accuracy of ARCHER{sub RT} code. Some previous studies reported misleading speedups by comparing GPU code with serial CPU code. To perform a fairer comparison, we write multi-thread code with OpenMP to fully exploit computing potential of CPU. The hardware involved in this study are a 6-core Intel E5-2620 CPU and 6 NVIDIA M2090 GPUs, a K20 GPU and a K40 GPU. Results: Dosimetric results from ARCHER{sub RT} and GEANT4 show good agreement. The 2%/2mm gamma test pass rates for different clinical cases are 97.2% to 99.7%. A single M2090 GPU needs 50~79 seconds for the simulation to achieve a statistical error of 1% in the PTV. The K40 card is about 1.7∼1.8 times faster than M2090 card. Using 6 M2090 card, the simulation can be finished in about 10 seconds. For comparison, Intel E5-2620 needs 507∼879 seconds for the same simulation. Conclusion: We successfully applied ARCHER{sub RT} to Tomotherapy and patient-independent IMRT, and conducted a fair comparison between GPU and CPU performance. The ARCHER{sub RT} code is both accurate and efficient and may be used towards clinical applications.« less
      

      
      Collaborating CPU and GPU for large-scale high-order CFD simulations with complex grids on the TianHe-1A supercomputer
      NASA Astrophysics Data System (ADS)
      Xu, Chuanfu; Deng, Xiaogang; Zhang, Lilun; Fang, Jianbin; Wang, Guangxue; Jiang, Yi; Cao, Wei; Che, Yonggang; Wang, Yongxian; Wang, Zhenghua; Liu, Wei; Cheng, Xinghua
         2014-12-01
         Programming and optimizing complex, real-world CFD codes on current many-core accelerated HPC systems is very challenging, especially when collaborating CPUs and accelerators to fully tap the potential of heterogeneous systems. In this paper, with a tri-level hybrid and heterogeneous programming model using MPI + OpenMP + CUDA, we port and optimize our high-order multi-block structured CFD software HOSTA on the GPU-accelerated TianHe-1A supercomputer. HOSTA adopts two self-developed high-order compact definite difference schemes WCNS and HDCS that can simulate flows with complex geometries. We present a dual-level parallelization scheme for efficient multi-block computation on GPUs and perform particular kernel optimizations for high-order CFD schemes. The GPU-only approach achieves a speedup of about 1.3 when comparing one Tesla M2050 GPU with two Xeon X5670 CPUs. To achieve a greater speedup, we collaborate CPU and GPU for HOSTA instead of using a naive GPU-only approach. We present a novel scheme to balance the loads between the store-poor GPU and the store-rich CPU. Taking CPU and GPU load balance into account, we improve the maximum simulation problem size per TianHe-1A node for HOSTA by 2.3×, meanwhile the collaborative approach can improve the performance by around 45% compared to the GPU-only approach. Further, to scale HOSTA on TianHe-1A, we propose a gather/scatter optimization to minimize PCI-e data transfer times for ghost and singularity data of 3D grid blocks, and overlap the collaborative computation and communication as far as possible using some advanced CUDA and MPI features. Scalability tests show that HOSTA can achieve a parallel efficiency of above 60% on 1024 TianHe-1A nodes. With our method, we have successfully simulated an EET high-lift airfoil configuration containing 800M cells and China's large civil airplane configuration containing 150M cells. To our best knowledge, those are the largest-scale CPU-GPU collaborative simulations that solve realistic CFD problems with both complex configurations and high-order schemes.
      

        
       
          

«

18
      19
      20
   21
      22
      »

          
        

     

   

   
       
            
              
          

«

19
      20
      21
   22
      23
      »

          
        

           
           
             
               
      
      The VLBA correlator: Real-time in the distributed era
      NASA Technical Reports Server (NTRS)
      Wells, D. C.
         1992-01-01
         The correlator is the signal processing engine of the Very Long Baseline Array (VLBA). Radio signals are recorded on special wideband (128 Mb/s) digital recorders at the 10 telescopes, with sampling times controlled by hydrogen maser clocks. The magnetic tapes are shipped to the Array Operations Center in Socorro, New Mexico, where they are played back simultaneously into the correlator. Real-time software and firmware controls the playback drives to achieve synchronization, compute models of the wavefront delay, control the numerous modules of the correlator, and record FITS files of the fringe visibilities at the back-end of the correlator. In addition to the more than 3000 custom VLSI chips which handle the massive data flow of the signal processing, the correlator contains a total of more than 100 programmable computers, 8-, 16- and 32-bit CPUs. Code is downloaded into front-end CPU's dependent on operating mode. Low-level code is assembly language, high-level code is C running under a RT OS. We use VxWorks on Motorola MVME147 CPU's. Code development is on a complex of SPARC workstations connected to the RT CPU's by Ethernet. The overall management of the correlation process is dependent on a database management system. We use Ingres running on a Sparcstation-2. We transfer logging information from the database of the VLBA Monitor and Control System to our database using Ingres/NET. Job scripts are computed and are transferred to the real-time computers using NFS, and correlation job execution logs and status flow back by the route. Operator status and control displays use windows on workstations, interfaced to the real-time processes by network protocols. The extensive network protocol support provided by VxWorks is invaluable. The VLBA Correlator's dependence on network protocols is an example of the radical transformation of the real-time world over the past five years. Real-time is becoming more like conventional computing. Paradoxically, 'conventional' computing is also adopting practices from the real-time world: semaphores, shared memory, light-weight threads, and concurrency. This appears to be a convergence of thinking.
      

      
      Examining the architecture of cellular computing through a comparative study with a computer
      PubMed Central
      Wang, Degeng; Gribskov, Michael
         2005-01-01
         The computer and the cell both use information embedded in simple coding, the binary software code and the quadruple genomic code, respectively, to support system operations. A comparative examination of their system architecture as well as their information storage and utilization schemes is performed. On top of the code, both systems display a modular, multi-layered architecture, which, in the case of a computer, arises from human engineering efforts through a combination of hardware implementation and software abstraction. Using the computer as a reference system, a simplistic mapping of the architectural components between the two is easily detected. This comparison also reveals that a cell abolishes the software–hardware barrier through genomic encoding for the constituents of the biochemical network, a cell's ‘hardware’ equivalent to the computer central processing unit (CPU). The information loading (gene expression) process acts as a major determinant of the encoded constituent's abundance, which, in turn, often determines the ‘bandwidth’ of a biochemical pathway. Cellular processes are implemented in biochemical pathways in parallel manners. In a computer, on the other hand, the software provides only instructions and data for the CPU. A process represents just sequentially ordered actions by the CPU and only virtual parallelism can be implemented through CPU time-sharing. Whereas process management in a computer may simply mean job scheduling, coordinating pathway bandwidth through the gene expression machinery represents a major process management scheme in a cell. In summary, a cell can be viewed as a super-parallel computer, which computes through controlled hardware composition. While we have, at best, a very fragmented understanding of cellular operation, we have a thorough understanding of the computer throughout the engineering process. The potential utilization of this knowledge to the benefit of systems biology is discussed. PMID:16849179
      

      
      Examining the architecture of cellular computing through a comparative study with a computer.
      PubMed
      Wang, Degeng; Gribskov, Michael
         2005-06-22
         The computer and the cell both use information embedded in simple coding, the binary software code and the quadruple genomic code, respectively, to support system operations. A comparative examination of their system architecture as well as their information storage and utilization schemes is performed. On top of the code, both systems display a modular, multi-layered architecture, which, in the case of a computer, arises from human engineering efforts through a combination of hardware implementation and software abstraction. Using the computer as a reference system, a simplistic mapping of the architectural components between the two is easily detected. This comparison also reveals that a cell abolishes the software-hardware barrier through genomic encoding for the constituents of the biochemical network, a cell's "hardware" equivalent to the computer central processing unit (CPU). The information loading (gene expression) process acts as a major determinant of the encoded constituent's abundance, which, in turn, often determines the "bandwidth" of a biochemical pathway. Cellular processes are implemented in biochemical pathways in parallel manners. In a computer, on the other hand, the software provides only instructions and data for the CPU. A process represents just sequentially ordered actions by the CPU and only virtual parallelism can be implemented through CPU time-sharing. Whereas process management in a computer may simply mean job scheduling, coordinating pathway bandwidth through the gene expression machinery represents a major process management scheme in a cell. In summary, a cell can be viewed as a super-parallel computer, which computes through controlled hardware composition. While we have, at best, a very fragmented understanding of cellular operation, we have a thorough understanding of the computer throughout the engineering process. The potential utilization of this knowledge to the benefit of systems biology is discussed.
      

      
      Using all of your CPU's in HIPE
      NASA Astrophysics Data System (ADS)
      Jacobson, J. D.; Fadda, D.
         2012-09-01
         Modern computer architectures increasingly feature multi-core CPU's. For example, the MacbookPro features the Intel quad-core i7 processors. Through the use of hyper-threading, where each core can execute two threads simultaneously, the quad-core i7 can support eight simultaneous processing threads. All this on your laptop! This CPU power can now be put into service by scientists to perform data reduction tasks, but only if the software has been designed to take advantage of the multiple processor architectures. Up to now, software written for Herschel data reduction (HIPE), written in Jython and JAVA, is single-threaded and can only utilize a single processor. Users of HIPE do not get any advantage from the additional processors. Why not put all of the CPU resources to work reducing your data? We present a multi-threaded software application that corrects long-term transients in the signal from the PACS unchopped spectroscopy line scan mode. In this poster, we present a multi-threaded software framework to achieve performance improvements from parallel execution. We will show how a task to correct transients in the PACS Spectroscopy Pipeline for the un-chopped line scan mode, has been threaded. This computation-intensive task uses either a one-parameter or a three parameter exponential function, to characterize the transient. The task uses a JAVA implementation of Minpack, translated from the C (Moshier) and IDL (Markwardt) by the authors, to optimize the correction parameters. We also explain how to determine if a task can benefit from threading (Amdahl's Law), and if it is safe to thread. The design and implementation, using the JAVA concurrency package completions service is described. Pitfalls, timing bugs, thread safety, resource control, testing and performance improvements are described and plotted.
      

      
      Computing the Density Matrix in Electronic Structure Theory on Graphics Processing Units.
      PubMed
      Cawkwell, M J; Sanville, E J; Mniszewski, S M; Niklasson, Anders M N
         2012-11-13
         The self-consistent solution of a Schrödinger-like equation for the density matrix is a critical and computationally demanding step in quantum-based models of interatomic bonding. This step was tackled historically via the diagonalization of the Hamiltonian. We have investigated the performance and accuracy of the second-order spectral projection (SP2) algorithm for the computation of the density matrix via a recursive expansion of the Fermi operator in a series of generalized matrix-matrix multiplications. We demonstrate that owing to its simplicity, the SP2 algorithm [Niklasson, A. M. N. Phys. Rev. B2002, 66, 155115] is exceptionally well suited to implementation on graphics processing units (GPUs). The performance in double and single precision arithmetic of a hybrid GPU/central processing unit (CPU) and full GPU implementation of the SP2 algorithm exceed those of a CPU-only implementation of the SP2 algorithm and traditional matrix diagonalization when the dimensions of the matrices exceed about 2000 × 2000. Padding schemes for arrays allocated in the GPU memory that optimize the performance of the CUBLAS implementations of the level 3 BLAS DGEMM and SGEMM subroutines for generalized matrix-matrix multiplications are described in detail. The analysis of the relative performance of the hybrid CPU/GPU and full GPU implementations indicate that the transfer of arrays between the GPU and CPU constitutes only a small fraction of the total computation time. The errors measured in the self-consistent density matrices computed using the SP2 algorithm are generally smaller than those measured in matrices computed via diagonalization. Furthermore, the errors in the density matrices computed using the SP2 algorithm do not exhibit any dependence of system size, whereas the errors increase linearly with the number of orbitals when diagonalization is employed.
      

      
      Optimal endothelialisation of a new compliant poly(carbonate-urea)urethane vascular graft with effect of physiological shear stress.
      PubMed
      Salacinski, H J; Tai, N R; Punshon, G; Giudiceandrea, A; Hamilton, G; Seifalian, A M
         2000-10-01
         to define the optimal seeding conditions of a new stress free poly(carbonate-urea)urethane (CPU) graft with compliance similar to that of human artery with honeycomb structure engineered during the manufacturing process to enhance adhesion and growth of endothelial cells. (111)Indium-oxine radiolabeled human umbilical vein endothelial cells (HUVEC) were seeded onto CPU grafts at (a) concentrations from 2-24x10(5)cells/cm(2)and (b) incubated for 0.5, 1, 2, 4 and 6 h. Following incubation, graft segments were subjected to three washing/gamma counting procedures and scanning electron microscopy (SEM). Cell viability was measured using a modified Alamar blue(TM)assay. To test physiological retention a pulsatile flow phantom was used to subject optimally seeded (16x10(5), 4 h) CPU grafts to arterial shear stress for 6 h with real time acquisition of scintigraphic images of seeded grafts using a nuclear medicine gamma camera system. the seeding efficiency of 54+/-13% post three washes was achieved using 16x10(5)cells/cm(2). Similarly in SEM micrographs a seeding density of 16x10(5)cells/cm(2)resulted in a confluent monolayer. Seeded CPU segments incubated for 4 h exhibited significantly higher resistance to wash-off than segments incubated for 30 min (p <0.05). Exposure of seeded grafts to pulsatile shear stress resulted in some cell loss with 67+/-3% of cells adherent following 6 h of perfusion with ongoing metabolic activity. Thus, optimal conditions were 16x10(5)cells/cm(2)at 4 h. the optimal seeding conditions have been defined for "tissue-engineered" vascular graft which allow complete endothelialisation and high cell-to-substrate strength that resists hydrodynamic stress. Copyright 2000 Harcourt Publishers Ltd.
      

      
      Comparing performance of many-core CPUs and GPUs for static and motion compensated reconstruction of C-arm CT data.
      PubMed
      Hofmann, Hannes G; Keck, Benjamin; Rohkohl, Christopher; Hornegger, Joachim
         2011-01-01
         Interventional reconstruction of 3-D volumetric data from C-arm CT projections is a computationally demanding task. Hardware optimization is not an option but mandatory for interventional image processing and, in particular, for image reconstruction due to the high demands on performance. Several groups have published fast analytical 3-D reconstruction on highly parallel hardware such as GPUs to mitigate this issue. The authors show that the performance of modern CPU-based systems is in the same order as current GPUs for static 3-D reconstruction and outperforms them for a recent motion compensated (3-D+time) image reconstruction algorithm. This work investigates two algorithms: Static 3-D reconstruction as well as a recent motion compensated algorithm. The evaluation was performed using a standardized reconstruction benchmark, RABBITCT, to get comparable results and two additional clinical data sets. The authors demonstrate for a parametric B-spline motion estimation scheme that the derivative computation, which requires many write operations to memory, performs poorly on the GPU and can highly benefit from modern CPU architectures with large caches. Moreover, on a 32-core Intel Xeon server system, the authors achieve linear scaling with the number of cores used and reconstruction times almost in the same range as current GPUs. Algorithmic innovations in the field of motion compensated image reconstruction may lead to a shift back to CPUs in the future. For analytical 3-D reconstruction, the authors show that the gap between GPUs and CPUs became smaller. It can be performed in less than 20 s (on-the-fly) using a 32-core server.
      

      
      Scalable algorithms for 3D extended MHD.
      NASA Astrophysics Data System (ADS)
      Chacon, Luis
         2007-11-01
         In the modeling of plasmas with extended MHD (XMHD), the challenge is to resolve long time scales while rendering the whole simulation manageable. In XMHD, this is particularly difficult because fast (dispersive) waves are supported, resulting in a very stiff set of PDEs. In explicit schemes, such stiffness results in stringent numerical stability time-step constraints, rendering them inefficient and algorithmically unscalable. In implicit schemes, it yields very ill-conditioned algebraic systems, which are difficult to invert. In this talk, we present recent theoretical and computational progress that demonstrate a scalable 3D XMHD solver (i.e., CPU ˜N, with N the number of degrees of freedom). The approach is based on Newton-Krylov methods, which are preconditioned for efficiency. The preconditioning stage admits suitable approximations without compromising the quality of the overall solution. In this work, we employ optimal (CPU ˜N) multilevel methods on a parabolized XMHD formulation, which renders the whole algorithm scalable. The (crucial) parabolization step is required to render XMHD multilevel-friendly. Algebraically, the parabolization step can be interpreted as a Schur factorization of the Jacobian matrix, thereby providing a solid foundation for the current (and future extensions of the) approach. We will build towards 3D extended MHDootnotetextL. Chac'on, Comput. Phys. Comm., 163 (3), 143-171 (2004)^,ootnotetextL. Chac'on et al., 33rd EPS Conf. Plasma Physics, Rome, Italy, 2006 by discussing earlier algorithmic breakthroughs in 2D reduced MHDootnotetextL. Chac'on et al., J. Comput. Phys. 178 (1), 15- 36 (2002) and 2D Hall MHD.ootnotetextL. Chac'on et al., J. Comput. Phys., 188 (2), 573-592 (2003)
      

      
      Symplectic multi-particle tracking on GPUs
      NASA Astrophysics Data System (ADS)
      Liu, Zhicong; Qiang, Ji
         2018-05-01
         A symplectic multi-particle tracking model is implemented on the Graphic Processing Units (GPUs) using the Compute Unified Device Architecture (CUDA) language. The symplectic tracking model can preserve phase space structure and reduce non-physical effects in long term simulation, which is important for beam property evaluation in particle accelerators. Though this model is computationally expensive, it is very suitable for parallelization and can be accelerated significantly by using GPUs. In this paper, we optimized the implementation of the symplectic tracking model on both single GPU and multiple GPUs. Using a single GPU processor, the code achieves a factor of 2-10 speedup for a range of problem sizes compared with the time on a single state-of-the-art Central Processing Unit (CPU) node with similar power consumption and semiconductor technology. It also shows good scalability on a multi-GPU cluster at Oak Ridge Leadership Computing Facility. In an application to beam dynamics simulation, the GPU implementation helps save more than a factor of two total computing time in comparison to the CPU implementation.
      

      
      Machine learning based job status prediction in scientific clusters
      DOE PAGES
      Yoo, Wucherl; Sim, Alex; Wu, Kesheng
         2016-09-01
         Large high-performance computing systems are built with increasing number of components with more CPU cores, more memory, and more storage space. At the same time, scientific applications have been growing in complexity. Together, they are leading to more frequent unsuccessful job statuses on HPC systems. From measured job statuses, 23.4% of CPU time was spent to the unsuccessful jobs. Here, we set out to study whether these unsuccessful job statuses could be anticipated from known job characteristics. To explore this possibility, we have developed a job status prediction method for the execution of jobs on scientific clusters. The Random Forestsmore » algorithm was applied to extract and characterize the patterns of unsuccessful job statuses. Experimental results show that our method can predict the unsuccessful job statuses from the monitored ongoing job executions in 99.8% the cases with 83.6% recall and 94.8% precision. Lastly, this prediction accuracy can be sufficiently high that it can be used to mitigation procedures of predicted failures.« less
      

      
      RXIO: Design and implementation of high performance RDMA-capable GridFTP
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Tian, Yuan; Yu, Weikuan; Vetter, Jeffrey S.
         2011-12-21
         For its low-latency, high bandwidth, and low CPU utilization, Remote Direct Memory Access (RDMA) has established itself as an effective data movement technology in many networking environments. However, the transport protocols of grid run-time systems, such as GridFTP in Globus, are not yet capable of utilizing RDMA. In this study, we examine the architecture of GridFTP for the feasibility of enabling RDMA. An RDMA-capable XIO (RXIO) framework is designed and implemented to extend its XIO system and match the characteristics of RDMA. Our experimental results demonstrate that RDMA can significantly improve the performance of GridFTP, reducing the latency by 32%more » and increasing the bandwidth by more than three times. In achieving such performance improvements, RDMA dramatically cuts down CPU utilization of GridFTP clients and servers. In conclusion, these results demonstrate that RXIO can effectively exploit the benefits of RDMA for GridFTP. It offers a good prototype to further leverage GridFTP on wide-area RDMA networks.« less
      

      
      Machine learning based job status prediction in scientific clusters
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Yoo, Wucherl; Sim, Alex; Wu, Kesheng
         
         Large high-performance computing systems are built with increasing number of components with more CPU cores, more memory, and more storage space. At the same time, scientific applications have been growing in complexity. Together, they are leading to more frequent unsuccessful job statuses on HPC systems. From measured job statuses, 23.4% of CPU time was spent to the unsuccessful jobs. Here, we set out to study whether these unsuccessful job statuses could be anticipated from known job characteristics. To explore this possibility, we have developed a job status prediction method for the execution of jobs on scientific clusters. The Random Forestsmore » algorithm was applied to extract and characterize the patterns of unsuccessful job statuses. Experimental results show that our method can predict the unsuccessful job statuses from the monitored ongoing job executions in 99.8% the cases with 83.6% recall and 94.8% precision. Lastly, this prediction accuracy can be sufficiently high that it can be used to mitigation procedures of predicted failures.« less
      

      
      The density matrix renormalization group algorithm on kilo-processor architectures: Implementation and trade-offs
      NASA Astrophysics Data System (ADS)
      Nemes, Csaba; Barcza, Gergely; Nagy, Zoltán; Legeza, Örs; Szolgay, Péter
         2014-06-01
         In the numerical analysis of strongly correlated quantum lattice models one of the leading algorithms developed to balance the size of the effective Hilbert space and the accuracy of the simulation is the density matrix renormalization group (DMRG) algorithm, in which the run-time is dominated by the iterative diagonalization of the Hamilton operator. As the most time-dominant step of the diagonalization can be expressed as a list of dense matrix operations, the DMRG is an appealing candidate to fully utilize the computing power residing in novel kilo-processor architectures. In the paper a smart hybrid CPU-GPU implementation is presented, which exploits the power of both CPU and GPU and tolerates problems exceeding the GPU memory size. Furthermore, a new CUDA kernel has been designed for asymmetric matrix-vector multiplication to accelerate the rest of the diagonalization. Besides the evaluation of the GPU implementation, the practical limits of an FPGA implementation are also discussed.
      

      
      Development of the Large-Scale Statistical Analysis System of Satellites Observations Data with Grid Datafarm Architecture
      NASA Astrophysics Data System (ADS)
      Yamamoto, K.; Murata, K.; Kimura, E.; Honda, R.
         2006-12-01
         In the Solar-Terrestrial Physics (STP) field, the amount of satellite observation data has been increasing every year. It is necessary to solve the following three problems to achieve large-scale statistical analyses of plenty of data. (i) More CPU power and larger memory and disk size are required. However, total powers of personal computers are not enough to analyze such amount of data. Super-computers provide a high performance CPU and rich memory area, but they are usually separated from the Internet or connected only for the purpose of programming or data file transfer. (ii) Most of the observation data files are managed at distributed data sites over the Internet. Users have to know where the data files are located. (iii) Since no common data format in the STP field is available now, users have to prepare reading program for each data by themselves. To overcome the problems (i) and (ii), we constructed a parallel and distributed data analysis environment based on the Gfarm reference implementation of the Grid Datafarm architecture. The Gfarm shares both computational resources and perform parallel distributed processings. In addition, the Gfarm provides the Gfarm filesystem which can be as virtual directory tree among nodes. The Gfarm environment is composed of three parts; a metadata server to manage distributed files information, filesystem nodes to provide computational resources and a client to throw a job into metadata server and manages data processing schedulings. In the present study, both data files and data processes are parallelized on the Gfarm with 6 file system nodes: CPU clock frequency of each node is Pentium V 1GHz, 256MB memory and40GB disk. To evaluate performances of the present Gfarm system, we scanned plenty of data files, the size of which is about 300MB for each, in three processing methods: sequential processing in one node, sequential processing by each node and parallel processing by each node. As a result, in comparison between the number of files and the elapsed time, parallel and distributed processing shorten the elapsed time to 1/5 than sequential processing. On the other hand, sequential processing times were shortened in another experiment, whose file size is smaller than 100KB. In this case, the elapsed time to scan one file is within one second. It implies that disk swap took place in case of parallel processing by each node. We note that the operation became unstable when the number of the files exceeded 1000. To overcome the problem (iii), we developed an original data class. This class supports our reading of data files with various data formats since it converts them into an original data format since it defines schemata for every type of data and encapsulates the structure of data files. In addition, since this class provides a function of time re-sampling, users can easily convert multiple data (array) with different time resolution into the same time resolution array. Finally, using the Gfarm, we achieved a high performance environment for large-scale statistical data analyses. It should be noted that the present method is effective only when one data file size is large enough. At present, we are restructuring the new Gfarm environment with 8 nodes: CPU is Athlon 64 x2 Dual Core 2GHz, 2GB memory and 1.2TB disk (using RAID0) for each node. Our original class is to be implemented on the new Gfarm environment. In the present talk, we show the latest results with applying the present system for data analyses with huge number of satellite observation data files.
      

      
      Computer hardware for radiologists: Part I
      PubMed Central
      Indrajit, IK; Alam, A
         2010-01-01
         Computers are an integral part of modern radiology practice. They are used in different radiology modalities to acquire, process, and postprocess imaging data. They have had a dramatic influence on contemporary radiology practice. Their impact has extended further with the emergence of Digital Imaging and Communications in Medicine (DICOM), Picture Archiving and Communication System (PACS), Radiology information system (RIS) technology, and Teleradiology. A basic overview of computer hardware relevant to radiology practice is presented here. The key hardware components in a computer are the motherboard, central processor unit (CPU), the chipset, the random access memory (RAM), the memory modules, bus, storage drives, and ports. The personnel computer (PC) has a rectangular case that contains important components called hardware, many of which are integrated circuits (ICs). The fiberglass motherboard is the main printed circuit board and has a variety of important hardware mounted on it, which are connected by electrical pathways called “buses”. The CPU is the largest IC on the motherboard and contains millions of transistors. Its principal function is to execute “programs”. A Pentium® 4 CPU has transistors that execute a billion instructions per second. The chipset is completely different from the CPU in design and function; it controls data and interaction of buses between the motherboard and the CPU. Memory (RAM) is fundamentally semiconductor chips storing data and instructions for access by a CPU. RAM is classified by storage capacity, access speed, data rate, and configuration. PMID:21042437
      

      
      Extreme-Scale Stochastic Particle Tracing for Uncertain Unsteady Flow Analysis
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Guo, Hanqi; He, Wenbin; Seo, Sangmin
         2016-11-13
         We present an efficient and scalable solution to estimate uncertain transport behaviors using stochastic flow maps (SFM,) for visualizing and analyzing uncertain unsteady flows. SFM computation is extremely expensive because it requires many Monte Carlo runs to trace densely seeded particles in the flow. We alleviate the computational cost by decoupling the time dependencies in SFMs so that we can process adjacent time steps independently and then compose them together for longer time periods. Adaptive refinement is also used to reduce the number of runs for each location. We then parallelize over tasks—packets of particles in our design—to achieve highmore » efficiency in MPI/thread hybrid programming. Such a task model also enables CPU/GPU coprocessing. We show the scalability on two supercomputers, Mira (up to 1M Blue Gene/Q cores) and Titan (up to 128K Opteron cores and 8K GPUs), that can trace billions of particles in seconds.« less
      

      
      OpenCL-based vicinity computation for 3D multiresolution mesh compression
      NASA Astrophysics Data System (ADS)
      Hachicha, Soumaya; Elkefi, Akram; Ben Amar, Chokri
         2017-03-01
         3D multiresolution mesh compression systems are still widely addressed in many domains. These systems are more and more requiring volumetric data to be processed in real-time. Therefore, the performance is becoming constrained by material resources usage and an overall reduction in the computational time. In this paper, our contribution entirely lies on computing, in real-time, triangles neighborhood of 3D progressive meshes for a robust compression algorithm based on the scan-based wavelet transform(WT) technique. The originality of this latter algorithm is to compute the WT with minimum memory usage by processing data as they are acquired. However, with large data, this technique is considered poor in term of computational complexity. For that, this work exploits the GPU to accelerate the computation using OpenCL as a heterogeneous programming language. Experiments demonstrate that, aside from the portability across various platforms and the flexibility guaranteed by the OpenCL-based implementation, this method can improve performance gain in speedup factor of 5 compared to the sequential CPU implementation.
      

      
      A Fully Automated Approach to Spike Sorting.
      PubMed
      Chung, Jason E; Magland, Jeremy F; Barnett, Alex H; Tolosa, Vanessa M; Tooker, Angela C; Lee, Kye Y; Shah, Kedar G; Felix, Sarah H; Frank, Loren M; Greengard, Leslie F
         2017-09-13
         Understanding the detailed dynamics of neuronal networks will require the simultaneous measurement of spike trains from hundreds of neurons (or more). Currently, approaches to extracting spike times and labels from raw data are time consuming, lack standardization, and involve manual intervention, making it difficult to maintain data provenance and assess the quality of scientific results. Here, we describe an automated clustering approach and associated software package that addresses these problems and provides novel cluster quality metrics. We show that our approach has accuracy comparable to or exceeding that achieved using manual or semi-manual techniques with desktop central processing unit (CPU) runtimes faster than acquisition time for up to hundreds of electrodes. Moreover, a single choice of parameters in the algorithm is effective for a variety of electrode geometries and across multiple brain regions. This algorithm has the potential to enable reproducible and automated spike sorting of larger scale recordings than is currently possible. Copyright © 2017 Elsevier Inc. All rights reserved.
      

      
      Far-field radiation patterns of aperture antennas by the Winograd Fourier transform algorithm
      NASA Technical Reports Server (NTRS)
      Heisler, R.
         1978-01-01
         A more time-efficient algorithm for computing the discrete Fourier transform, the Winograd Fourier transform (WFT), is described. The WFT algorithm is compared with other transform algorithms. Results indicate that the WFT algorithm in antenna analysis appears to be a very successful application. Significant savings in cpu time will improve the computer turn around time and circumvent the need to resort to weekend runs.
      

      
      Measurements of neuron soma size and density in rat dorsal striatum, nucleus accumbens core and nucleus accumbens shell: differences between striatal region and brain hemisphere, but not sex.
      PubMed
      Meitzen, John; Pflepsen, Kelsey R; Stern, Christopher M; Meisel, Robert L; Mermelstein, Paul G
         2011-01-07
         Both hemispheric bias and sex differences exist in striatal-mediated behaviors and pathologies. The extent to which these dimorphisms can be attributed to an underlying neuroanatomical difference is unclear. We therefore quantified neuron soma size and density in the dorsal striatum (CPu) as well as the core (AcbC) and shell (AcbS) subregions of the nucleus accumbens to determine whether these anatomical measurements differ by region, hemisphere, or sex in adult Sprague-Dawley rats. Neuron soma size was larger in the CPu than the AcbC or AcbS. Neuron density was greatest in the AcbS, intermediate in the AcbC, and least dense in the CPu. CPu neuron density was greater in the left in comparison to the right hemisphere. No attribute was sexually dimorphic. These results provide the first evidence that hemispheric bias in the striatum and striatal-mediated behaviors can be attributed to a lateralization in neuronal density within the CPu. In contrast, sexual dimorphisms appear mediated by factors other than gross anatomical differences. Copyright © 2010 Elsevier Ireland Ltd. All rights reserved.
      

        
       
          

«

19
      20
      21
   22
      23
      »

          
        

     

   

   
       
            
              
          

«

20
      21
      22
   23
      24
      »

          
        

           
           
             
               
      
      48 CFR 252.204-7011 - Alternative Line Item Structure.
      Code of Federal Regulations, 2011 CFR
      
         2011-10-01
         ... Unit Unit price Amount 0001 Computer, Desktop with CPU, Monitor, Keyboard and Mouse 20 EA Alternative... Unit Unit Price Amount 0001 Computer, Desktop with CPU, Keyboard and Mouse 20 EA 0002 Monitor 20 EA...
      

      
      48 CFR 252.204-7011 - Alternative Line Item Structure.
      Code of Federal Regulations, 2014 CFR
      
         2014-10-01
         ... Unit Unit price Amount 0001 Computer, Desktop with CPU, Monitor, Keyboard and Mouse 20 EA Alternative... Unit Unit Price Amount 0001 Computer, Desktop with CPU, Keyboard and Mouse 20 EA 0002 Monitor 20 EA...
      

      
      48 CFR 252.204-7011 - Alternative Line Item Structure.
      Code of Federal Regulations, 2012 CFR
      
         2012-10-01
         ... Unit Unit price Amount 0001 Computer, Desktop with CPU, Monitor, Keyboard and Mouse 20 EA Alternative... Unit Unit Price Amount 0001 Computer, Desktop with CPU, Keyboard and Mouse 20 EA 0002 Monitor 20 EA...
      

      
      48 CFR 252.204-7011 - Alternative Line Item Structure.
      Code of Federal Regulations, 2013 CFR
      
         2013-10-01
         ... Unit Unit price Amount 0001 Computer, Desktop with CPU, Monitor, Keyboard and Mouse 20 EA Alternative... Unit Unit Price Amount 0001 Computer, Desktop with CPU, Keyboard and Mouse 20 EA 0002 Monitor 20 EA...
      

      
      Efficient parallel simulation of CO2 geologic sequestration insaline aquifers
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Zhang, Keni; Doughty, Christine; Wu, Yu-Shu
         2007-01-01
         An efficient parallel simulator for large-scale, long-termCO2 geologic sequestration in saline aquifers has been developed. Theparallel simulator is a three-dimensional, fully implicit model thatsolves large, sparse linear systems arising from discretization of thepartial differential equations for mass and energy balance in porous andfractured media. The simulator is based on the ECO2N module of the TOUGH2code and inherits all the process capabilities of the single-CPU TOUGH2code, including a comprehensive description of the thermodynamics andthermophysical properties of H2O-NaCl- CO2 mixtures, modeling singleand/or two-phase isothermal or non-isothermal flow processes, two-phasemixtures, fluid phases appearing or disappearing, as well as saltprecipitation or dissolution. The newmore » parallel simulator uses MPI forparallel implementation, the METIS software package for simulation domainpartitioning, and the iterative parallel linear solver package Aztec forsolving linear equations by multiple processors. In addition, theparallel simulator has been implemented with an efficient communicationscheme. Test examples show that a linear or super-linear speedup can beobtained on Linux clusters as well as on supercomputers. Because of thesignificant improvement in both simulation time and memory requirement,the new simulator provides a powerful tool for tackling larger scale andmore complex problems than can be solved by single-CPU codes. Ahigh-resolution simulation example is presented that models buoyantconvection, induced by a small increase in brine density caused bydissolution of CO2.« less
      

      
      Fast and flexible gpu accelerated binding free energy calculations within the amber molecular dynamics package.
      PubMed
      Mermelstein, Daniel J; Lin, Charles; Nelson, Gard; Kretsch, Rachael; McCammon, J Andrew; Walker, Ross C
         2018-07-15
         Alchemical free energy (AFE) calculations based on molecular dynamics (MD) simulations are key tools in both improving our understanding of a wide variety of biological processes and accelerating the design and optimization of therapeutics for numerous diseases. Computing power and theory have, however, long been insufficient to enable AFE calculations to be routinely applied in early stage drug discovery. One of the major difficulties in performing AFE calculations is the length of time required for calculations to converge to an ensemble average. CPU implementations of MD-based free energy algorithms can effectively only reach tens of nanoseconds per day for systems on the order of 50,000 atoms, even running on massively parallel supercomputers. Therefore, converged free energy calculations on large numbers of potential lead compounds are often untenable, preventing researchers from gaining crucial insight into molecular recognition, potential druggability and other crucial areas of interest. Graphics Processing Units (GPUs) can help address this. We present here a seamless GPU implementation, within the PMEMD module of the AMBER molecular dynamics package, of thermodynamic integration (TI) capable of reaching speeds of >140 ns/day for a 44,907-atom system, with accuracy equivalent to the existing CPU implementation in AMBER. The implementation described here is currently part of the AMBER 18 beta code and will be an integral part of the upcoming version 18 release of AMBER. © 2018 Wiley Periodicals, Inc. © 2018 Wiley Periodicals, Inc.
      

      
      CoNNeCT Baseband Processor Module Boot Code SoftWare (BCSW)
      NASA Technical Reports Server (NTRS)
      Yamamoto, Clifford K.; Orozco, David S.; Byrne, D. J.; Allen, Steven J.; Sahasrabudhe, Adit; Lang, Minh
         2012-01-01
         This software provides essential startup and initialization routines for the CoNNeCT baseband processor module (BPM) hardware upon power-up. A command and data handling (C&DH) interface is provided via 1553 and diagnostic serial interfaces to invoke operational, reconfiguration, and test commands within the code. The BCSW has features unique to the hardware it is responsible for managing. In this case, the CoNNeCT BPM is configured with an updated CPU (Atmel AT697 SPARC processor) and a unique set of memory and I/O peripherals that require customized software to operate. These features include configuration of new AT697 registers, interfacing to a new HouseKeeper with a flash controller interface, a new dual Xilinx configuration/scrub interface, and an updated 1553 remote terminal (RT) core. The BCSW is intended to provide a "safe" mode for the BPM when initially powered on or when an unexpected trap occurs, causing the processor to reset. The BCSW allows the 1553 bus controller in the spacecraft or payload controller to operate the BPM over 1553 to upload code; upload Xilinx bit files; perform rudimentary tests; read, write, and copy the non-volatile flash memory; and configure the Xilinx interface. Commands also exist over 1553 to cause the CPU to jump or call a specified address to begin execution of user-supplied code. This may be in the form of a real-time operating system, test routine, or specific application code to run on the BPM.
      

      
      Accelerating cardiac bidomain simulations using graphics processing units.
      PubMed
      Neic, A; Liebmann, M; Hoetzl, E; Mitchell, L; Vigmond, E J; Haase, G; Plank, G
         2012-08-01
         Anatomically realistic and biophysically detailed multiscale computer models of the heart are playing an increasingly important role in advancing our understanding of integrated cardiac function in health and disease. Such detailed simulations, however, are computationally vastly demanding, which is a limiting factor for a wider adoption of in-silico modeling. While current trends in high-performance computing (HPC) hardware promise to alleviate this problem, exploiting the potential of such architectures remains challenging since strongly scalable algorithms are necessitated to reduce execution times. Alternatively, acceleration technologies such as graphics processing units (GPUs) are being considered. While the potential of GPUs has been demonstrated in various applications, benefits in the context of bidomain simulations where large sparse linear systems have to be solved in parallel with advanced numerical techniques are less clear. In this study, the feasibility of multi-GPU bidomain simulations is demonstrated by running strong scalability benchmarks using a state-of-the-art model of rabbit ventricles. The model is spatially discretized using the finite element methods (FEM) on fully unstructured grids. The GPU code is directly derived from a large pre-existing code, the Cardiac Arrhythmia Research Package (CARP), with very minor perturbation of the code base. Overall, bidomain simulations were sped up by a factor of 11.8 to 16.3 in benchmarks running on 6-20 GPUs compared to the same number of CPU cores. To match the fastest GPU simulation which engaged 20 GPUs, 476 CPU cores were required on a national supercomputing facility.
      

      
      Accelerating Cardiac Bidomain Simulations Using Graphics Processing Units
      PubMed Central
      Neic, Aurel; Liebmann, Manfred; Hoetzl, Elena; Mitchell, Lawrence; Vigmond, Edward J.; Haase, Gundolf
         2013-01-01
         Anatomically realistic and biophysically detailed multiscale computer models of the heart are playing an increasingly important role in advancing our understanding of integrated cardiac function in health and disease. Such detailed simulations, however, are computationally vastly demanding, which is a limiting factor for a wider adoption of in-silico modeling. While current trends in high-performance computing (HPC) hardware promise to alleviate this problem, exploiting the potential of such architectures remains challenging since strongly scalable algorithms are necessitated to reduce execution times. Alternatively, acceleration technologies such as graphics processing units (GPUs) are being considered. While the potential of GPUs has been demonstrated in various applications, benefits in the context of bidomain simulations where large sparse linear systems have to be solved in parallel with advanced numerical techniques are less clear. In this study, the feasibility of multi-GPU bidomain simulations is demonstrated by running strong scalability benchmarks using a state-of-the-art model of rabbit ventricles. The model is spatially discretized using the finite element methods (FEM) on fully unstructured grids. The GPU code is directly derived from a large pre-existing code, the Cardiac Arrhythmia Research Package (CARP), with very minor perturbation of the code base. Overall, bidomain simulations were sped up by a factor of 11.8 to 16.3 in benchmarks running on 6–20 GPUs compared to the same number of CPU cores. To match the fastest GPU simulation which engaged 20GPUs, 476 CPU cores were required on a national supercomputing facility. PMID:22692867
      

      
      Numericware i: Identical by State Matrix Calculator
      PubMed Central
      Kim, Bongsong; Beavis, William D
         2017-01-01
         We introduce software, Numericware i, to compute identical by state (IBS) matrix based on genotypic data. Calculating an IBS matrix with a large dataset requires large computer memory and takes lengthy processing time. Numericware i addresses these challenges with 2 algorithmic methods: multithreading and forward chopping. The multithreading allows computational routines to concurrently run on multiple central processing unit (CPU) processors. The forward chopping addresses memory limitation by dividing a dataset into appropriately sized subsets. Numericware i allows calculation of the IBS matrix for a large genotypic dataset using a laptop or a desktop computer. For comparison with different software, we calculated genetic relationship matrices using Numericware i, SPAGeDi, and TASSEL with the same genotypic dataset. Numericware i calculates IBS coefficients between 0 and 2, whereas SPAGeDi and TASSEL produce different ranges of values including negative values. The Pearson correlation coefficient between the matrices from Numericware i and TASSEL was high at .9972, whereas SPAGeDi showed low correlation with Numericware i (.0505) and TASSEL (.0587). With a high-dimensional dataset of 500 entities by 10 000 000 SNPs, Numericware i spent 382 minutes using 19 CPU threads and 64 GB memory by dividing the dataset into 3 pieces, whereas SPAGeDi and TASSEL failed with the same dataset. Numericware i is freely available for Windows and Linux under CC-BY 4.0 license at https://figshare.com/s/f100f33a8857131eb2db. PMID:28469375
      

      
      Toward GPGPU accelerated human electromechanical cardiac simulations
      PubMed Central
      Vigueras, Guillermo; Roy, Ishani; Cookson, Andrew; Lee, Jack; Smith, Nicolas; Nordsletten, David
         2014-01-01
         In this paper, we look at the acceleration of weakly coupled electromechanics using the graphics processing unit (GPU). Specifically, we port to the GPU a number of components of Heart—a CPU-based finite element code developed for simulating multi-physics problems. On the basis of a criterion of computational cost, we implemented on the GPU the ODE and PDE solution steps for the electrophysiology problem and the Jacobian and residual evaluation for the mechanics problem. Performance of the GPU implementation is then compared with single core CPU (SC) execution as well as multi-core CPU (MC) computations with equivalent theoretical performance. Results show that for a human scale left ventricle mesh, GPU acceleration of the electrophysiology problem provided speedups of 164 × compared with SC and 5.5 times compared with MC for the solution of the ODE model. Speedup of up to 72 × compared with SC and 2.6 × compared with MC was also observed for the PDE solve. Using the same human geometry, the GPU implementation of mechanics residual/Jacobian computation provided speedups of up to 44 × compared with SC and 2.0 × compared with MC. © 2013 The Authors. International Journal for Numerical Methods in Biomedical Engineering published by John Wiley & Sons, Ltd. PMID:24115492
      

      
      Structural optimization with approximate sensitivities
      NASA Technical Reports Server (NTRS)
      Patnaik, S. N.; Hopkins, D. A.; Coroneos, R.
         1994-01-01
         Computational efficiency in structural optimization can be enhanced if the intensive computations associated with the calculation of the sensitivities, that is, gradients of the behavior constraints, are reduced. Approximation to gradients of the behavior constraints that can be generated with small amount of numerical calculations is proposed. Structural optimization with these approximate sensitivities produced correct optimum solution. Approximate gradients performed well for different nonlinear programming methods, such as the sequence of unconstrained minimization technique, method of feasible directions, sequence of quadratic programming, and sequence of linear programming. Structural optimization with approximate gradients can reduce by one third the CPU time that would otherwise be required to solve the problem with explicit closed-form gradients. The proposed gradient approximation shows potential to reduce intensive computation that has been associated with traditional structural optimization.
      

      
      Accelerating Large Scale Image Analyses on Parallel, CPU-GPU Equipped Systems
      PubMed Central
      Teodoro, George; Kurc, Tahsin M.; Pan, Tony; Cooper, Lee A.D.; Kong, Jun; Widener, Patrick; Saltz, Joel H.
         2014-01-01
         The past decade has witnessed a major paradigm shift in high performance computing with the introduction of accelerators as general purpose processors. These computing devices make available very high parallel computing power at low cost and power consumption, transforming current high performance platforms into heterogeneous CPU-GPU equipped systems. Although the theoretical performance achieved by these hybrid systems is impressive, taking practical advantage of this computing power remains a very challenging problem. Most applications are still deployed to either GPU or CPU, leaving the other resource under- or un-utilized. In this paper, we propose, implement, and evaluate a performance aware scheduling technique along with optimizations to make efficient collaborative use of CPUs and GPUs on a parallel system. In the context of feature computations in large scale image analysis applications, our evaluations show that intelligently co-scheduling CPUs and GPUs can significantly improve performance over GPU-only or multi-core CPU-only approaches. PMID:25419545
      

      
      Brain and behaviour phenotyping of a mouse model of neurofibromatosis type-1: an MRI/DTI study on social cognition.
      PubMed
      Petrella, L I; Cai, Y; Sereno, J V; Gonçalves, S I; Silva, A J; Castelo-Branco, M
         2016-09-01
         Neurofibromatosis type-1 (NF1) is a common neurogenetic disorder and an important cause of intellectual disability. Brain-behaviour associations can be examined in vivo using morphometric magnetic resonance imaging (MRI) and diffusion tensor imaging (DTI) to study brain structure. Here, we studied structural and behavioural phenotypes in heterozygous Nf1 mice (Nf1(+/-) ) using T2-weighted imaging MRI and DTI, with a focus on social recognition deficits. We found that Nf1(+/-) mice have larger volumes than wild-type (WT) mice in regions of interest involved in social cognition, the prefrontal cortex (PFC) and the caudate-putamen (CPu). Higher diffusivity was found across a distributed network of cortical and subcortical brain regions, within and beyond these regions. Significant differences were observed for the social recognition test. Most importantly, significant structure-function correlations were identified concerning social recognition performance and PFC volumes in Nf1(+/-) mice. Analyses of spatial learning corroborated the previously known deficits in the mutant mice, as corroborated by platform crossings, training quadrant time and average proximity measures. Moreover, linear discriminant analysis of spatial performance identified 2 separate sub-groups in Nf1(+/-) mice. A significant correlation between quadrant time and CPu volumes was found specifically for the sub-group of Nf1(+/-) mice with lower spatial learning performance, suggesting additional evidence for reorganization of this region. We found strong evidence that social and spatial cognition deficits can be associated with PFC/CPu structural changes and reorganization in NF1. © 2016 John Wiley & Sons Ltd and International Behavioural and Neural Genetics Society.
      

      
      SU-E-T-493: Accelerated Monte Carlo Methods for Photon Dosimetry Using a Dual-GPU System and CUDA.
      PubMed
      Liu, T; Ding, A; Xu, X
         2012-06-01
         To develop a Graphics Processing Unit (GPU) based Monte Carlo (MC) code that accelerates dose calculations on a dual-GPU system. We simulated a clinical case of prostate cancer treatment. A voxelized abdomen phantom derived from 120 CT slices was used containing 218×126×60 voxels, and a GE LightSpeed 16-MDCT scanner was modeled. A CPU version of the MC code was first developed in C++ and tested on Intel Xeon X5660 2.8GHz CPU, then it was translated into GPU version using CUDA C 4.1 and run on a dual Tesla m 2 090 GPU system. The code was featured with automatic assignment of simulation task to multiple GPUs, as well as accurate calculation of energy- and material- dependent cross-sections. Double-precision floating point format was used for accuracy. Doses to the rectum, prostate, bladder and femoral heads were calculated. When running on a single GPU, the MC GPU code was found to be ×19 times faster than the CPU code and ×42 times faster than MCNPX. These speedup factors were doubled on the dual-GPU system. The dose Result was benchmarked against MCNPX and a maximum difference of 1% was observed when the relative error is kept below 0.1%. A GPU-based MC code was developed for dose calculations using detailed patient and CT scanner models. Efficiency and accuracy were both guaranteed in this code. Scalability of the code was confirmed on the dual-GPU system. © 2012 American Association of Physicists in Medicine.
      

      
      GPU acceleration towards real-time image reconstruction in 3D tomographic diffractive microscopy
      NASA Astrophysics Data System (ADS)
      Bailleul, J.; Simon, B.; Debailleul, M.; Liu, H.; Haeberlé, O.
         2012-06-01
         Phase microscopy techniques regained interest in allowing for the observation of unprepared specimens with excellent temporal resolution. Tomographic diffractive microscopy is an extension of holographic microscopy which permits 3D observations with a finer resolution than incoherent light microscopes. Specimens are imaged by a series of 2D holograms: their accumulation progressively fills the range of frequencies of the specimen in Fourier space. A 3D inverse FFT eventually provides a spatial image of the specimen. Consequently, acquisition then reconstruction are mandatory to produce an image that could prelude real-time control of the observed specimen. The MIPS Laboratory has built a tomographic diffractive microscope with an unsurpassed 130nm resolution but a low imaging speed - no less than one minute. Afterwards, a high-end PC reconstructs the 3D image in 20 seconds. We now expect an interactive system providing preview images during the acquisition for monitoring purposes. We first present a prototype implementing this solution on CPU: acquisition and reconstruction are tied in a producer-consumer scheme, sharing common data into CPU memory. Then we present a prototype dispatching some reconstruction tasks to GPU in order to take advantage of SIMDparallelization for FFT and higher bandwidth for filtering operations. The CPU scheme takes 6 seconds for a 3D image update while the GPU scheme can go down to 2 or > 1 seconds depending on the GPU class. This opens opportunities for 4D imaging of living organisms or crystallization processes. We also consider the relevance of GPU for 3D image interaction in our specific conditions.
      

      
      Managing Contention and Timing Constraints in a Real-Time Database System
      DTIC Science & Technology
      
         1995-01-01
         In order to realize many of these goals, StarBase is constructed on top of RT-Mach, a real - time operating system developed at Carnegie Mellon...University [ll]. StarBase differs from previous RT-DBMS work [l, 2, 31 in that a) it relies on a real - time operating system which provides priority...CPU and resource scheduling pro- vided by tlhe underlying real - time operating system . Issues of data contention are dealt with by use of a priority
      

      
      Accelerating next generation sequencing data analysis with system level optimizations.
      PubMed
      Kathiresan, Nagarajan; Temanni, Ramzi; Almabrazi, Hakeem; Syed, Najeeb; Jithesh, Puthen V; Al-Ali, Rashid
         2017-08-22
         Next generation sequencing (NGS) data analysis is highly compute intensive. In-memory computing, vectorization, bulk data transfer, CPU frequency scaling are some of the hardware features in the modern computing architectures. To get the best execution time and utilize these hardware features, it is necessary to tune the system level parameters before running the application. We studied the GATK-HaplotypeCaller which is part of common NGS workflows, that consume more than 43% of the total execution time. Multiple GATK 3.x versions were benchmarked and the execution time of HaplotypeCaller was optimized by various system level parameters which included: (i) tuning the parallel garbage collection and kernel shared memory to simulate in-memory computing, (ii) architecture-specific tuning in the PairHMM library for vectorization, (iii) including Java 1.8 features through GATK source code compilation and building a runtime environment for parallel sorting and bulk data transfer (iv) the default 'on-demand' mode of CPU frequency is over-clocked by using 'performance-mode' to accelerate the Java multi-threads. As a result, the HaplotypeCaller execution time was reduced by 82.66% in GATK 3.3 and 42.61% in GATK 3.7. Overall, the execution time of NGS pipeline was reduced to 70.60% and 34.14% for GATK 3.3 and GATK 3.7 respectively.
      

      
      HACC: Extreme Scaling and Performance Across Diverse Architectures
      NASA Astrophysics Data System (ADS)
      Habib, Salman; Morozov, Vitali; Frontiere, Nicholas; Finkel, Hal; Pope, Adrian; Heitmann, Katrin
         2013-11-01
         Supercomputing is evolving towards hybrid and accelerator-based architectures with millions of cores. The HACC (Hardware/Hybrid Accelerated Cosmology Code) framework exploits this diverse landscape at the largest scales of problem size, obtaining high scalability and sustained performance. Developed to satisfy the science requirements of cosmological surveys, HACC melds particle and grid methods using a novel algorithmic structure that flexibly maps across architectures, including CPU/GPU, multi/many-core, and Blue Gene systems. We demonstrate the success of HACC on two very different machines, the CPU/GPU system Titan and the BG/Q systems Sequoia and Mira, attaining unprecedented levels of scalable performance. We demonstrate strong and weak scaling on Titan, obtaining up to 99.2% parallel efficiency, evolving 1.1 trillion particles. On Sequoia, we reach 13.94 PFlops (69.2% of peak) and 90% parallel efficiency on 1,572,864 cores, with 3.6 trillion particles, the largest cosmological benchmark yet performed. HACC design concepts are applicable to several other supercomputer applications.
      

      
      Modeling and docking antibody structures with Rosetta
      PubMed Central
      Weitzner, Brian D.; Jeliazkov, Jeliazko R.; Lyskov, Sergey; Marze, Nicholas; Kuroda, Daisuke; Frick, Rahel; Adolf-Bryfogle, Jared; Biswas, Naireeta; Dunbrack, Roland L.; Gray, Jeffrey J.
         2017-01-01
         We describe Rosetta-based computational protocols for predicting the three-dimensional structure of an antibody from sequence (RosettaAntibody) and then docking the antibody to protein antigens (SnugDock). Antibody modeling leverages canonical loop conformations to graft large segments from experimentally-determined structures as well as (1) energetic calculations to minimize loops, (2) docking methodology to refine the VL–VH relative orientation, and (3) de novo prediction of the elusive complementarity determining region (CDR) H3 loop. To alleviate model uncertainty, antibody–antigen docking resamples CDR loop conformations and can use multiple models to represent an ensemble of conformations for the antibody, the antigen or both. These protocols can be run fully-automated via the ROSIE web server (http://rosie.rosettacommons.org/) or manually on a computer with user control of individual steps. For best results, the protocol requires roughly 1,000 CPU-hours for antibody modeling and 250 CPU-hours for antibody–antigen docking. Tasks can be completed in under a day by using public supercomputers. PMID:28125104
      

        
       
          

«

20
      21
      22
   23
      24
      »

          
        

     

   

   
       
            
              
          

«

21
      22
      23
   24
      25
      »

          
        

           
           
             
               
      
      Tier-2 Optimisation for Computational Density/Diversity and Big Data
      NASA Astrophysics Data System (ADS)
      Fay, R. B.; Bland, J.
         2014-06-01
         As the number of cores on chip continues to trend upwards and new CPU architectures emerge, increasing CPU density and diversity presents multiple challenges to site administrators. These include scheduling for massively multi-core systems (potentially including Graphical Processing Units (GPU), integrated and dedicated) and Many Integrated Core (MIC)) to ensure a balanced throughput of jobs while preserving overall cluster throughput, as well as the increasing complexity of developing for these heterogeneous platforms, and the challenge in managing this more complex mix of resources. In addition, meeting data demands as both dataset sizes increase and as the rate of demand scales with increased computational power requires additional performance from the associated storage elements. In this report, we evaluate one emerging technology, Solid State Drive (SSD) caching for RAID controllers, with consideration to its potential to assist in meeting evolving demand. We also briefly consider the broader developing trends outlined above in order to identify issues that may develop and assess what actions should be taken in the immediate term to address those.
      

      
      GPU-based efficient realistic techniques for bleeding and smoke generation in surgical simulators.
      PubMed
      Halic, Tansel; Sankaranarayanan, Ganesh; De, Suvranu
         2010-12-01
         In actual surgery, smoke and bleeding due to cauterization processes provide important visual cues to the surgeon, which have been proposed as factors in surgical skill assessment. While several virtual reality (VR)-based surgical simulators have incorporated the effects of bleeding and smoke generation, they are not realistic due to the requirement of real-time performance. To be interactive, visual update must be performed at at least 30 Hz and haptic (touch) information must be refreshed at 1 kHz. Simulation of smoke and bleeding is, therefore, either ignored or simulated using highly simplified techniques, since other computationally intensive processes compete for the available Central Processing Unit (CPU) resources. In this study we developed a novel low-cost method to generate realistic bleeding and smoke in VR-based surgical simulators, which outsources the computations to the graphical processing unit (GPU), thus freeing up the CPU for other time-critical tasks. This method is independent of the complexity of the organ models in the virtual environment. User studies were performed using 20 subjects to determine the visual quality of the simulations compared to real surgical videos. The smoke and bleeding simulation were implemented as part of a laparoscopic adjustable gastric banding (LAGB) simulator. For the bleeding simulation, the original implementation using the shader did not incur noticeable overhead. However, for smoke generation, an input/output (I/O) bottleneck was observed and two different methods were developed to overcome this limitation. Based on our benchmark results, a buffered approach performed better than a pipelined approach and could support up to 15 video streams in real time. Human subject studies showed that the visual realism of the simulations were as good as in real surgery (median rating of 4 on a 5-point Likert scale). Based on the performance results and subject study, both bleeding and smoke simulations were concluded to be efficient, highly realistic and well suited to VR-based surgical simulators. Copyright © 2010 John Wiley & Sons, Ltd.
      

      
      GPU-based Efficient Realistic Techniques for Bleeding and Smoke Generation in Surgical Simulators
      PubMed Central
      Halic, Tansel; Sankaranarayanan, Ganesh; De, Suvranu
         2010-01-01
         Background In actual surgery, smoke and bleeding due to cautery processes, provide important visual cues to the surgeon which have been proposed as factors in surgical skill assessment. While several virtual reality (VR)-based surgical simulators have incorporated effects of bleeding and smoke generation, they are not realistic due to the requirement of real time performance. To be interactive, visual update must be performed at least 30 Hz and haptic (touch) information must be refreshed at 1 kHz. Simulation of smoke and bleeding is, therefore, either ignored or simulated using highly simplified techniques since other computationally intensive processes compete for the available CPU resources. Methods In this work, we develop a novel low-cost method to generate realistic bleeding and smoke in VR-based surgical simulators which outsources the computations to the graphical processing unit (GPU), thus freeing up the CPU for other time-critical tasks. This method is independent of the complexity of the organ models in the virtual environment. User studies were performed using 20 subjects to determine the visual quality of the simulations compared to real surgical videos. Results The smoke and bleeding simulation were implemented as part of a Laparoscopic Adjustable Gastric Banding (LAGB) simulator. For the bleeding simulation, the original implementation using the shader did not incur in noticeable overhead. However, for smoke generation, an I/O (Input/Output) bottleneck was observed and two different methods were developed to overcome this limitation. Based on our benchmark results, a buffered approach performed better than a pipelined approach and could support up to 15 video streams in real time. Human subject studies showed that the visual realism of the simulations were as good as in real surgery (median rating of 4 on a 5-point Likert scale). Conclusions Based on the performance results and subject study, both bleeding and smoke simulations were concluded to be efficient, highly realistic and well suited in VR-based surgical simulators. PMID:20878651
      

      
      A graphics-card implementation of Monte-Carlo simulations for cosmic-ray transport
      NASA Astrophysics Data System (ADS)
      Tautz, R. C.
         2016-05-01
         A graphics card implementation of a test-particle simulation code is presented that is based on the CUDA extension of the C/C++ programming language. The original CPU version has been developed for the calculation of cosmic-ray diffusion coefficients in artificial Kolmogorov-type turbulence. In the new implementation, the magnetic turbulence generation, which is the most time-consuming part, is separated from the particle transport and is performed on a graphics card. In this article, the modification of the basic approach of integrating test particle trajectories to employ the SIMD (single instruction, multiple data) model is presented and verified. The efficiency of the new code is tested and several language-specific accelerating factors are discussed. For the example of isotropic magnetostatic turbulence, sample results are shown and a comparison to the results of the CPU implementation is performed.
      

      
      Mitigating the Insider Threat with High-Dimensional Anomaly Detection
      DTIC Science & Technology
      
         2004-12-01
         a more serious attack. Various systems such as NSM [56], GrIDS [57], snort [58], Emerald [59], and Spice [60] generate alerts for portscan...reboot etc. The user measurements include the user profiles such as time of login , duration of user session, cumulative CPU time, names of files...already been implemented in a real-time system for information retrieval [3]. A technique developed at SRI in the Emerald system [22] uses historical
      

      
      Meeting the Challenge of Distributed Real-Time & Embedded (DRE) Systems
      DTIC Science & Technology
      
         2012-05-10
         IP RTOS Middleware Middleware Services DRE Applications Operating Sys & Protocols Hardware & Networks Middleware Middleware Services DRE...Services COTS & standards-based middleware, language, OS , network, & hardware platforms • Real-time CORBA (TAO) middleware • ADAPTIVE Communication...SPLs) F-15 product variant A/V 8-B product variant F/A 18 product variant UCAV product variant Software Produce-Line Hardware (CPU, Memory, I/O) OS
      

      
      The optical design of 3D ICs for smartphone and optro-electronics sensing module
      NASA Astrophysics Data System (ADS)
      Huang, Jiun-Woei
         2018-03-01
         Smartphone require limit space for image system, current lens, used in smartphones are refractive type, the effective focal length is limited the thickness of phone physical size. Other, such as optro-electronics sensing chips, proximity optical sensors, and UV indexer chips are integrated into smart phone with limit space. Due to the requirement of multiple lens in smartphone, proximity optical sensors, UV indexer and other optro-electronics sensing chips in a limited space of CPU board in future smart phone, optro-electronics 3D IC's integrated with optical lens or components may be a key technology for 3 C products. A design for reflective lens is fitted to CMOS, proximity optical sensors, UV indexer and other optro-electronics sensing chips based on 3-D IC. The reflective lens can be threes times of effective focal lens, and be able to resolve small object. The system will be assembled and integrated in one 3-D IC more easily.
      

      
      Three dimensional unstructured multigrid for the Euler equations
      NASA Technical Reports Server (NTRS)
      Mavriplis, D. J.
         1991-01-01
         The three dimensional Euler equations are solved on unstructured tetrahedral meshes using a multigrid strategy. The driving algorithm consists of an explicit vertex-based finite element scheme, which employs an edge-based data structure to assemble the residuals. The multigrid approach employs a sequence of independently generated coarse and fine meshes to accelerate the convergence to steady-state of the fine grid solution. Variables, residuals and corrections are passed back and forth between the various grids of the sequence using linear interpolation. The addresses and weights for interpolation are determined in a preprocessing stage using linear interpolation. The addresses and weights for interpolation are determined in a preprocessing stage using an efficient graph traversal algorithm. The preprocessing operation is shown to require a negligible fraction of the CPU time required by the overall solution procedure, while gains in overall solution efficiencies greater than an order of magnitude are demonstrated on meshes containing up to 350,000 vertices. Solutions using globally regenerated fine meshes as well as adaptively refined meshes are given.
      

      
      Accurate acceleration of kinetic Monte Carlo simulations through the modification of rate constants.
      PubMed
      Chatterjee, Abhijit; Voter, Arthur F
         2010-05-21
         We present a novel computational algorithm called the accelerated superbasin kinetic Monte Carlo (AS-KMC) method that enables a more efficient study of rare-event dynamics than the standard KMC method while maintaining control over the error. In AS-KMC, the rate constants for processes that are observed many times are lowered during the course of a simulation. As a result, rare processes are observed more frequently than in KMC and the time progresses faster. We first derive error estimates for AS-KMC when the rate constants are modified. These error estimates are next employed to develop a procedure for lowering process rates with control over the maximum error. Finally, numerical calculations are performed to demonstrate that the AS-KMC method captures the correct dynamics, while providing significant CPU savings over KMC in most cases. We show that the AS-KMC method can be employed with any KMC model, even when no time scale separation is present (although in such cases no computational speed-up is observed), without requiring the knowledge of various time scales present in the system.
      

      
      Numerical Procedures for Inlet/Diffuser/Nozzle Flows
      NASA Technical Reports Server (NTRS)
      Rubin, Stanley G.
         1998-01-01
         Two primitive variable, pressure based, flux-split, RNS/NS solution procedures for viscous flows are presented. Both methods are uniformly valid across the full Mach number range, Le., from the incompressible limit to high supersonic speeds. The first method is an 'optimized' version of a previously developed global pressure relaxation RNS procedure. Considerable reduction in the number of relatively expensive matrix inversion, and thereby in the computational time, has been achieved with this procedure. CPU times are reduced by a factor of 15 for predominantly elliptic flows (incompressible and low subsonic). The second method is a time-marching, 'linearized' convection RNS/NS procedure. The key to the efficiency of this procedure is the reduction to a single LU inversion at the inflow cross-plane. The remainder of the algorithm simply requires back-substitution with this LU and the corresponding residual vector at any cross-plane location. This method is not time-consistent, but has a convective-type CFL stability limitation. Both formulations are robust and provide accurate solutions for a variety of internal viscous flows to be provided herein.
      

      
      Accelerating epistasis analysis in human genetics with consumer graphics hardware.
      PubMed
      Sinnott-Armstrong, Nicholas A; Greene, Casey S; Cancare, Fabio; Moore, Jason H
         2009-07-24
         Human geneticists are now capable of measuring more than one million DNA sequence variations from across the human genome. The new challenge is to develop computationally feasible methods capable of analyzing these data for associations with common human disease, particularly in the context of epistasis. Epistasis describes the situation where multiple genes interact in a complex non-linear manner to determine an individual's disease risk and is thought to be ubiquitous for common diseases. Multifactor Dimensionality Reduction (MDR) is an algorithm capable of detecting epistasis. An exhaustive analysis with MDR is often computationally expensive, particularly for high order interactions. This challenge has previously been met with parallel computation and expensive hardware. The option we examine here exploits commodity hardware designed for computer graphics. In modern computers Graphics Processing Units (GPUs) have more memory bandwidth and computational capability than Central Processing Units (CPUs) and are well suited to this problem. Advances in the video game industry have led to an economy of scale creating a situation where these powerful components are readily available at very low cost. Here we implement and evaluate the performance of the MDR algorithm on GPUs. Of primary interest are the time required for an epistasis analysis and the price to performance ratio of available solutions. We found that using MDR on GPUs consistently increased performance per machine over both a feature rich Java software package and a C++ cluster implementation. The performance of a GPU workstation running a GPU implementation reduces computation time by a factor of 160 compared to an 8-core workstation running the Java implementation on CPUs. This GPU workstation performs similarly to 150 cores running an optimized C++ implementation on a Beowulf cluster. Furthermore this GPU system provides extremely cost effective performance while leaving the CPU available for other tasks. The GPU workstation containing three GPUs costs $2000 while obtaining similar performance on a Beowulf cluster requires 150 CPU cores which, including the added infrastructure and support cost of the cluster system, cost approximately $82,500. Graphics hardware based computing provides a cost effective means to perform genetic analysis of epistasis using MDR on large datasets without the infrastructure of a computing cluster.
      

      
      Purple L1 Milestone Review Panel TotalView Debugger Functionality and Performance for ASC Purple
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Wolfe, M
         2006-12-12
         ASC code teams require a robust software debugging tool to help developers quickly find bugs in their codes and get their codes running. Development debugging commonly runs up to 512 processes. Production jobs run up to full ASC Purple scale, and at times require introspection while running. Developers want a debugger that runs on all their development and production platforms and that works with all compilers and runtimes used with ASC codes. The TotalView Multiprocess Debugger made by Etnus was specified for ASC Purple to address this needed capability. The ASC Purple environment builds on the environment seen by TotalViewmore » on ASCI White. The debugger must now operate with the Power5 CPU, Federation switch, AIX 5.3 operating system including large pages, IBM compilers 7 and 9, POE 4.2 parallel environment, and rs6000 SLURM resource manager. Users require robust, basic debugger functionality with acceptable performance at development debugging scale. A TotalView installation must be provided at the beginning of the early user access period that meets these requirements. A functional enhancement, fast conditional data watchpoints, and a scalability enhancement, capability up to 8192 processes, are to be demonstrated.« less
      

      
      Interactivity vs. fairness in networked linux systems
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Wu, Wenji; Crawford, Matt; /Fermilab
         
         In general, the Linux 2.6 scheduler can ensure fairness and provide excellent interactive performance at the same time. However, our experiments and mathematical analysis have shown that the current Linux interactivity mechanism tends to incorrectly categorize non-interactive network applications as interactive, which can lead to serious fairness or starvation issues. In the extreme, a single process can unjustifiably obtain up to 95% of the CPU! The root cause is due to the facts that: (1) network packets arrive at the receiver independently and discretely, and the 'relatively fast' non-interactive network process might frequently sleep to wait for packet arrival. Thoughmore » each sleep lasts for a very short period of time, the wait-for-packet sleeps occur so frequently that they lead to interactive status for the process. (2) The current Linux interactivity mechanism provides the possibility that a non-interactive network process could receive a high CPU share, and at the same time be incorrectly categorized as 'interactive.' In this paper, we propose and test a possible solution to address the interactivity vs. fairness problems. Experiment results have proved the effectiveness of the proposed solution.« less
      

      
      GPU-Meta-Storms: computing the structure similarities among massive amount of microbial community samples using GPU.
      PubMed
      Su, Xiaoquan; Wang, Xuetao; Jing, Gongchao; Ning, Kang
         2014-04-01
         The number of microbial community samples is increasing with exponential speed. Data-mining among microbial community samples could facilitate the discovery of valuable biological information that is still hidden in the massive data. However, current methods for the comparison among microbial communities are limited by their ability to process large amount of samples each with complex community structure. We have developed an optimized GPU-based software, GPU-Meta-Storms, to efficiently measure the quantitative phylogenetic similarity among massive amount of microbial community samples. Our results have shown that GPU-Meta-Storms would be able to compute the pair-wise similarity scores for 10 240 samples within 20 min, which gained a speed-up of >17 000 times compared with single-core CPU, and >2600 times compared with 16-core CPU. Therefore, the high-performance of GPU-Meta-Storms could facilitate in-depth data mining among massive microbial community samples, and make the real-time analysis and monitoring of temporal or conditional changes for microbial communities possible. GPU-Meta-Storms is implemented by CUDA (Compute Unified Device Architecture) and C++. Source code is available at http://www.computationalbioenergy.org/meta-storms.html.
      

      
      Hotspot detection using image pattern recognition based on higher-order local auto-correlation
      NASA Astrophysics Data System (ADS)
      Maeda, Shimon; Matsunawa, Tetsuaki; Ogawa, Ryuji; Ichikawa, Hirotaka; Takahata, Kazuhiro; Miyairi, Masahiro; Kotani, Toshiya; Nojima, Shigeki; Tanaka, Satoshi; Nakagawa, Kei; Saito, Tamaki; Mimotogi, Shoji; Inoue, Soichi; Nosato, Hirokazu; Sakanashi, Hidenori; Kobayashi, Takumi; Murakawa, Masahiro; Higuchi, Tetsuya; Takahashi, Eiichi; Otsu, Nobuyuki
         2011-04-01
         Below 40nm design node, systematic variation due to lithography must be taken into consideration during the early stage of design. So far, litho-aware design using lithography simulation models has been widely applied to assure that designs are printed on silicon without any error. However, the lithography simulation approach is very time consuming, and under time-to-market pressure, repetitive redesign by this approach may result in the missing of the market window. This paper proposes a fast hotspot detection support method by flexible and intelligent vision system image pattern recognition based on Higher-Order Local Autocorrelation. Our method learns the geometrical properties of the given design data without any defects as normal patterns, and automatically detects the design patterns with hotspots from the test data as abnormal patterns. The Higher-Order Local Autocorrelation method can extract features from the graphic image of design pattern, and computational cost of the extraction is constant regardless of the number of design pattern polygons. This approach can reduce turnaround time (TAT) dramatically only on 1CPU, compared with the conventional simulation-based approach, and by distributed processing, this has proven to deliver linear scalability with each additional CPU.
      

      
      Intact intracortical microstimulation (ICMS) representations of rostral and caudal forelimb areas in rats with quinolinic acid lesions of the medial or lateral caudate-putamen in an animal model of Huntington's disease.
      PubMed
      Karl, Jenni M; Sacrey, Lori-Ann R; McDonald, Robert J; Whishaw, Ian Q
         2008-09-05
         Neurotoxic, cell-specific lesions of the rat caudate-putamen (CPu) have been proposed as a model of human Huntington's disease and as such impair performance on many motor tasks, including skilled forelimbs tasks such as reaching for food. Because the CPu and motor cortex share reciprocal connections, it has been proposed that the motor deficits are due in part to a secondary disruption of motor cortex. The purpose of the present study was to examine the functionality of the motor cortex using intracortical microstimulation (ICMS) following neurotoxic lesions of the CPu. ICMS maps have been shown to be sensitive indicators of motor skill, cortical injury, learning, and experience. Long-evans hooded rats received a sham, a medial, or a lateral CPu lesion using the neurotoxin, quinolinic acid (2,3-pyridinedicarboxylic acid). Two weeks later the motor cortex was stimulated under light ketamine anesthesia. Neither lateral nor medial lesions of the CPu altered the stimulation threshold for eliciting forelimb movements, the type of movements elicited, or the size of the rostral forelimb (RFA) and caudal forelimb areas (CFA) from which movements were elicited. The preservation of ICMS forelimb movement representations (the forelimb map) in rats with cell-specific CPu lesions suggests motor impairments following lesions of the lateral striatum are not due to the disruption of the motor map. Therefore, the impairments that follow striatal cell loss are due either to alterations in circuitry that is independent of motor cortex or to alterations in circuitry afferent to the motor cortex projections.
      

      
      Collateral projections of nucleus raphe dorsalis neurones to the caudate-putamen and region around the nucleus raphe magnus and nucleus reticularis gigantocellularis pars alpha in the rat.
      PubMed
      Li, Y Q; Kaneko, T; Mizuno, N
         2001-02-16
         It was examined whether or not the nucleus raphe dorsalis (RD) neurons projecting to the caudate-putamen (CPu) might also project to the motor-controlling region around the nucleus raphe magnus (NRM) and nucleus reticularis gigantocellularis pars alpha (Gia) in the rat. Single RD neurons projecting to the CPu and NRM/Gia by way of axon collaterals were identified by the retrograde double-labeling method with fluorescent dyes, Fast Blue and Diamidino Yellow, which were injected respectively into the CPu and NRM/Gia. Then, serotonin (5-HT)-like immunoreactivity of the double-labeled RD neurons was examined immunohistochemically; approximately 60% of the double-labeled RD neurons showed 5-HT-like immunoreactivity. The results indicated that some of serotonergic and non-serotonergic RD neurons might control motor functions simultaneously at the levels of the CPu and NRM/Gia by way of axon collaterals.
      

      
      First LHCb measurement with data from the LHC Run 2
      NASA Astrophysics Data System (ADS)
      Anderlini, L.; Amerio, S.
         2017-01-01
         LHCb has recently introduced a novel real-time detector alignment and calibration strategy for the Run 2. Data collected at the start of each LHC fill are processed in few minutes and used to update the alignment. On the other hand, the calibration constants will be evaluated for each run of data taking. An increase in the CPU and disk capacity of the event filter farm, combined with improvements to the reconstruction software, allow for efficient, exclusive selections already in the first stage of the High Level Trigger (HLT1), while the second stage, HLT2, performs complete, offline-quality, event reconstruction. In Run 2, LHCb will collect the largest data sample of charm mesons ever recorded. Novel data processing and analysis techniques are required to maximise the physics potential of this data sample with the available computing resources, taking into account data preservation constraints. In this write-up, we describe the full analysis chain used to obtain important results analysing the data collected in proton-proton collisions in 2015, such as the J/ψ and open charm production cross-sections, and consider the further steps required to obtain real-time results after the LHCb upgrade.
      

      
      Fast, Accurate and Shift-Varying Line Projections for Iterative Reconstruction Using the GPU
      PubMed Central
      Pratx, Guillem; Chinn, Garry; Olcott, Peter D.; Levin, Craig S.
         2013-01-01
         List-mode processing provides an efficient way to deal with sparse projections in iterative image reconstruction for emission tomography. An issue often reported is the tremendous amount of computation required by such algorithm. Each recorded event requires several back- and forward line projections. We investigated the use of the programmable graphics processing unit (GPU) to accelerate the line-projection operations and implement fully-3D list-mode ordered-subsets expectation-maximization for positron emission tomography (PET). We designed a reconstruction approach that incorporates resolution kernels, which model the spatially-varying physical processes associated with photon emission, transport and detection. Our development is particularly suitable for applications where the projection data is sparse, such as high-resolution, dynamic, and time-of-flight PET reconstruction. The GPU approach runs more than 50 times faster than an equivalent CPU implementation while image quality and accuracy are virtually identical. This paper describes in details how the GPU can be used to accelerate the line projection operations, even when the lines-of-response have arbitrary endpoint locations and shift-varying resolution kernels are used. A quantitative evaluation is included to validate the correctness of this new approach. PMID:19244015
      

      
      Large-scale ground motion simulation using GPGPU
      NASA Astrophysics Data System (ADS)
      Aoi, S.; Maeda, T.; Nishizawa, N.; Aoki, T.
         2012-12-01
         Huge computation resources are required to perform large-scale ground motion simulations using 3-D finite difference method (FDM) for realistic and complex models with high accuracy. Furthermore, thousands of various simulations are necessary to evaluate the variability of the assessment caused by uncertainty of the assumptions of the source models for future earthquakes. To conquer the problem of restricted computational resources, we introduced the use of GPGPU (General purpose computing on graphics processing units) which is the technique of using a GPU as an accelerator of the computation which has been traditionally conducted by the CPU. We employed the CPU version of GMS (Ground motion Simulator; Aoi et al., 2004) as the original code and implemented the function for GPU calculation using CUDA (Compute Unified Device Architecture). GMS is a total system for seismic wave propagation simulation based on 3-D FDM scheme using discontinuous grids (Aoi&Fujiwara, 1999), which includes the solver as well as the preprocessor tools (parameter generation tool) and postprocessor tools (filter tool, visualization tool, and so on). The computational model is decomposed in two horizontal directions and each decomposed model is allocated to a different GPU. We evaluated the performance of our newly developed GPU version of GMS on the TSUBAME2.0 which is one of the Japanese fastest supercomputer operated by the Tokyo Institute of Technology. First we have performed a strong scaling test using the model with about 22 million grids and achieved 3.2 and 7.3 times of the speed-up by using 4 and 16 GPUs. Next, we have examined a weak scaling test where the model sizes (number of grids) are increased in proportion to the degree of parallelism (number of GPUs). The result showed almost perfect linearity up to the simulation with 22 billion grids using 1024 GPUs where the calculation speed reached to 79.7 TFlops and about 34 times faster than the CPU calculation using the same number of cores. Finally, we applied GPU calculation to the simulation of the 2011 Tohoku-oki earthquake. The model was constructed using a slip model from inversion of strong motion data (Suzuki et al., 2012), and a geological- and geophysical-based velocity structure model comprising all the Tohoku and Kanto regions as well as the large source area, which consists of about 1.9 billion grids. The overall characteristics of observed velocity seismograms for a longer period than range of 8 s were successfully reproduced (Maeda et al., 2012 AGU meeting). The turn around time for 50 thousand-step calculation (which correspond to 416 s in seismograph) using 100 GPUs was 52 minutes which is fairly short, especially considering this is the performance for the realistic and complex model.
      

        
       
          

«

21
      22
      23
   24
      25
      »

          
        

     

   

   
       
            
              
          

«

21
      22
      23
      24
   25
      »

          
        

           
           
             
               
      
      An evaluation of superminicomputers for thermal analysis
      NASA Technical Reports Server (NTRS)
      Storaasli, O. O.; Vidal, J. B.; Jones, G. K.
         1982-01-01
         The use of superminicomputers for solving a series of increasingly complex thermal analysis problems is investigated. The approach involved (1) installation and verification of the SPAR thermal analyzer software on superminicomputers at Langley Research Center and Goddard Space Flight Center, (2) solution of six increasingly complex thermal problems on this equipment, and (3) comparison of solution (accuracy, CPU time, turnaround time, and cost) with solutions on large mainframe computers.
      

      
      Real Time Control of the SSC String Magnets
      NASA Astrophysics Data System (ADS)
      Calvo, O.; Flora, R.; MacPherson, M.
         1987-08-01
         The system described in this paper, called SECAR, was designed to control the excitation of a test string of magnets for the proposed Superconducting Super Collider (SSC) and will be used to upgrade the present Tevatron Excitation, Control and Regulation (TECAR) hardware and software . It resides in a VME crate and is controlled by a 68020/68881 based CPU running the application software under a real time operating system named VRTX.
      

      
      Real time control of the SSC string magnets
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Calvo, O.; Flora, R.; MacPherson, M.
         1987-08-01
         The system described in this paper, called SECAR, was designed to control the excitation of a test string of magnets for the proposed Superconducting Super Collider (SSC) and will be used to upgrade the present Tevatron Excitation, Control and Regulation (TECAR) hardware and software. It resides in a VME orate and is controlled by a 68020/68881 based CPU running the application software under a real time operating system named VRTX.
      

      
      Adaptive Multilevel Middleware for Object Systems
      DTIC Science & Technology
      
         2006-12-01
         the system at the system-call level or using the CORBA-standard Extensible Transport Framework ( ETF ). Transparent insertion is highly desirable from an...often as it needs to. This is remedied by using the real-time scheduling class in a stock Linux kernel. We used schedsetscheduler system call (with...real-time scheduling class (SCHEDFIFO) for all the ML-NFD programs, later experiments with CPU load indicate that a stock Linux kernel is not
      

      
      GPU-based prompt gamma ray imaging from boron neutron capture therapy.
      PubMed
      Yoon, Do-Kun; Jung, Joo-Young; Jo Hong, Key; Sil Lee, Keum; Suk Suh, Tae
         2015-01-01
         The purpose of this research is to perform the fast reconstruction of a prompt gamma ray image using a graphics processing unit (GPU) computation from boron neutron capture therapy (BNCT) simulations. To evaluate the accuracy of the reconstructed image, a phantom including four boron uptake regions (BURs) was used in the simulation. After the Monte Carlo simulation of the BNCT, the modified ordered subset expectation maximization reconstruction algorithm using the GPU computation was used to reconstruct the images with fewer projections. The computation times for image reconstruction were compared between the GPU and the central processing unit (CPU). Also, the accuracy of the reconstructed image was evaluated by a receiver operating characteristic (ROC) curve analysis. The image reconstruction time using the GPU was 196 times faster than the conventional reconstruction time using the CPU. For the four BURs, the area under curve values from the ROC curve were 0.6726 (A-region), 0.6890 (B-region), 0.7384 (C-region), and 0.8009 (D-region). The tomographic image using the prompt gamma ray event from the BNCT simulation was acquired using the GPU computation in order to perform a fast reconstruction during treatment. The authors verified the feasibility of the prompt gamma ray image reconstruction using the GPU computation for BNCT simulations.
      

      
      Multi-GPU implementation of a VMAT treatment plan optimization algorithm
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Tian, Zhen, E-mail: Zhen.Tian@UTSouthwestern.edu, E-mail: Xun.Jia@UTSouthwestern.edu, E-mail: Steve.Jiang@UTSouthwestern.edu; Folkerts, Michael; Tan, Jun
         
         Purpose: Volumetric modulated arc therapy (VMAT) optimization is a computationally challenging problem due to its large data size, high degrees of freedom, and many hardware constraints. High-performance graphics processing units (GPUs) have been used to speed up the computations. However, GPU’s relatively small memory size cannot handle cases with a large dose-deposition coefficient (DDC) matrix in cases of, e.g., those with a large target size, multiple targets, multiple arcs, and/or small beamlet size. The main purpose of this paper is to report an implementation of a column-generation-based VMAT algorithm, previously developed in the authors’ group, on a multi-GPU platform tomore » solve the memory limitation problem. While the column-generation-based VMAT algorithm has been previously developed, the GPU implementation details have not been reported. Hence, another purpose is to present detailed techniques employed for GPU implementation. The authors also would like to utilize this particular problem as an example problem to study the feasibility of using a multi-GPU platform to solve large-scale problems in medical physics. Methods: The column-generation approach generates VMAT apertures sequentially by solving a pricing problem (PP) and a master problem (MP) iteratively. In the authors’ method, the sparse DDC matrix is first stored on a CPU in coordinate list format (COO). On the GPU side, this matrix is split into four submatrices according to beam angles, which are stored on four GPUs in compressed sparse row format. Computation of beamlet price, the first step in PP, is accomplished using multi-GPUs. A fast inter-GPU data transfer scheme is accomplished using peer-to-peer access. The remaining steps of PP and MP problems are implemented on CPU or a single GPU due to their modest problem scale and computational loads. Barzilai and Borwein algorithm with a subspace step scheme is adopted here to solve the MP problem. A head and neck (H and N) cancer case is then used to validate the authors’ method. The authors also compare their multi-GPU implementation with three different single GPU implementation strategies, i.e., truncating DDC matrix (S1), repeatedly transferring DDC matrix between CPU and GPU (S2), and porting computations involving DDC matrix to CPU (S3), in terms of both plan quality and computational efficiency. Two more H and N patient cases and three prostate cases are used to demonstrate the advantages of the authors’ method. Results: The authors’ multi-GPU implementation can finish the optimization process within ∼1 min for the H and N patient case. S1 leads to an inferior plan quality although its total time was 10 s shorter than the multi-GPU implementation due to the reduced matrix size. S2 and S3 yield the same plan quality as the multi-GPU implementation but take ∼4 and ∼6 min, respectively. High computational efficiency was consistently achieved for the other five patient cases tested, with VMAT plans of clinically acceptable quality obtained within 23–46 s. Conversely, to obtain clinically comparable or acceptable plans for all six of these VMAT cases that the authors have tested in this paper, the optimization time needed in a commercial TPS system on CPU was found to be in an order of several minutes. Conclusions: The results demonstrate that the multi-GPU implementation of the authors’ column-generation-based VMAT optimization can handle the large-scale VMAT optimization problem efficiently without sacrificing plan quality. The authors’ study may serve as an example to shed some light on other large-scale medical physics problems that require multi-GPU techniques.« less
      

      
      Machine-Aided Indexing of Technical Literature
      ERIC Educational Resources Information Center
      Klingbiel, Paul H.
         1973-01-01
         To index at the Defense Documentation Center (DDC), an automated system must choose single words or phrases rapidly and economically. Automation of DDC's indexing has been machine-aided from its inception. A machine-aided indexing system is described that indexes one million words of text per hour of CPU time. (22 references) (Author/SJ)
      

      
      Transient dynamics capability at Sandia National Laboratories
      NASA Technical Reports Server (NTRS)
      Attaway, Steven W.; Biffle, Johnny H.; Sjaardema, G. D.; Heinstein, M. W.; Schoof, L. A.
         1993-01-01
         A brief overview of the transient dynamics capabilities at Sandia National Laboratories, with an emphasis on recent new developments and current research is presented. In addition, the Sandia National Laboratories (SNL) Engineering Analysis Code Access System (SEACAS), which is a collection of structural and thermal codes and utilities used by analysts at SNL, is described. The SEACAS system includes pre- and post-processing codes, analysis codes, database translation codes, support libraries, Unix shell scripts for execution, and an installation system. SEACAS is used at SNL on a daily basis as a production, research, and development system for the engineering analysts and code developers. Over the past year, approximately 190 days of CPU time were used by SEACAS codes on jobs running from a few seconds up to two and one-half days of CPU time. SEACAS is running on several different systems at SNL including Cray Unicos, Hewlett Packard PH-UX, Digital Equipment Ultrix, and Sun SunOS. An overview of SEACAS, including a short description of the codes in the system, are presented. Abstracts and references for the codes are listed at the end of the report.
      

      
      Ab initio quantum chemical calculation of electron transfer matrix elements for large molecules
      NASA Astrophysics Data System (ADS)
      Zhang, Linda Yu; Friesner, Richard A.; Murphy, Robert B.
         1997-07-01
         Using a diabatic state formalism and pseudospectral numerical methods, we have developed an efficient ab initio quantum chemical approach to the calculation of electron transfer matrix elements for large molecules. The theory is developed at the Hartree-Fock level and validated by comparison with results in the literature for small systems. As an example of the power of the method, we calculate the electronic coupling between two bacteriochlorophyll molecules in various intermolecular geometries. Only a single self-consistent field (SCF) calculation on each of the monomers is needed to generate coupling matrix elements for all of the molecular pairs. The largest calculations performed, utilizing 1778 basis functions, required ˜14 h on an IBM 390 workstation. This is considerably less cpu time than would be necessitated with a supermolecule adiabatic state calculation and a conventional electronic structure code.
      

      
      Efficient Resources Provisioning Based on Load Forecasting in Cloud
      PubMed Central
      Hu, Rongdong; Jiang, Jingfei; Liu, Guangming; Wang, Lixin
         2014-01-01
         Cloud providers should ensure QoS while maximizing resources utilization. One optimal strategy is to timely allocate resources in a fine-grained mode according to application's actual resources demand. The necessary precondition of this strategy is obtaining future load information in advance. We propose a multi-step-ahead load forecasting method, KSwSVR, based on statistical learning theory which is suitable for the complex and dynamic characteristics of the cloud computing environment. It integrates an improved support vector regression algorithm and Kalman smoother. Public trace data taken from multitypes of resources were used to verify its prediction accuracy, stability, and adaptability, comparing with AR, BPNN, and standard SVR. Subsequently, based on the predicted results, a simple and efficient strategy is proposed for resource provisioning. CPU allocation experiment indicated it can effectively reduce resources consumption while meeting service level agreements requirements. PMID:24701160
      

      
      Effect of acute and continuous morphine treatment on transcription factor expression in subregions of the rat caudate putamen. Marked modulation by D4 receptor activation.
      PubMed
      Gago, Belén; Suárez-Boomgaard, Diana; Fuxe, Kjell; Brené, Stefan; Reina-Sánchez, María Dolores; Rodríguez-Pérez, Luis M; Agnati, Luigi F; de la Calle, Adelaida; Rivera, Alicia
         2011-08-17
         Acute administration of the dopamine D(4) receptor (D(4)R) agonist PD168,077 induces a down-regulation of the μ opioid receptor (MOR) in the striosomal compartment of the rat caudate putamen (CPu), suggesting a striosomal D(4)R/MOR receptor interaction in line with their high co-distribution in this brain subregion. The present work was designed to explore if a D(4)R/MOR receptor interaction also occurs in the modulation of the expression pattern of several transcription factors in striatal subregions that play a central role in drug addiction. Thus, c-Fos, FosB/ΔFosB and P-CREB immunoreactive profiles were quantified in the rat CPu after either acute or continuous (6-day) administration of morphine and/or PD168,077. Acute and continuous administration of morphine induced different patterns of expression of these transcription factors, effects that were time-course and region dependent and fully blocked by PD168,077 co-administration. Moreover, this effect of the D(4)R agonist was counteracted by the D(4)R antagonist L745,870. Interestingly, at some time-points, combined treatment with morphine and PD168,077 substantially increased c-Fos, FosB/ΔFosB and P-CREB expression. The results of this study give indications for a general antagonistic D(4)R/MOR receptor interaction at the level of transcription factors. The change in the transcription factor expression by D(4)R/MOR interactions in turn suggests a modulation of neuronal activity in the CPu that could be of relevance for drug addiction. Copyright © 2011 Elsevier B.V. All rights reserved.
      

      
      P-Hint-Hunt: a deep parallelized whole genome DNA methylation detection tool.
      PubMed
      Peng, Shaoliang; Yang, Shunyun; Gao, Ming; Liao, Xiangke; Liu, Jie; Yang, Canqun; Wu, Chengkun; Yu, Wenqiang
         2017-03-14
         The increasing studies have been conducted using whole genome DNA methylation detection as one of the most important part of epigenetics research to find the significant relationships among DNA methylation and several typical diseases, such as cancers and diabetes. In many of those studies, mapping the bisulfite treated sequence to the whole genome has been the main method to study DNA cytosine methylation. However, today's relative tools almost suffer from inaccuracies and time-consuming problems. In our study, we designed a new DNA methylation prediction tool ("Hint-Hunt") to solve the problem. By having an optimal complex alignment computation and Smith-Waterman matrix dynamic programming, Hint-Hunt could analyze and predict the DNA methylation status. But when Hint-Hunt tried to predict DNA methylation status with large-scale dataset, there are still slow speed and low temporal-spatial efficiency problems. In order to solve the problems of Smith-Waterman dynamic programming and low temporal-spatial efficiency, we further design a deep parallelized whole genome DNA methylation detection tool ("P-Hint-Hunt") on Tianhe-2 (TH-2) supercomputer. To the best of our knowledge, P-Hint-Hunt is the first parallel DNA methylation detection tool with a high speed-up to process large-scale dataset, and could run both on CPU and Intel Xeon Phi coprocessors. Moreover, we deploy and evaluate Hint-Hunt and P-Hint-Hunt on TH-2 supercomputer in different scales. The experimental results illuminate our tools eliminate the deviation caused by bisulfite treatment in mapping procedure and the multi-level parallel program yields a 48 times speed-up with 64 threads. P-Hint-Hunt gain a deep acceleration on CPU and Intel Xeon Phi heterogeneous platform, which gives full play of the advantages of multi-cores (CPU) and many-cores (Phi).
      

      
      GPU acceleration of Runge Kutta-Fehlberg and its comparison with Dormand-Prince method
      NASA Astrophysics Data System (ADS)
      Seen, Wo Mei; Gobithaasan, R. U.; Miura, Kenjiro T.
         2014-07-01
         There is a significant reduction of processing time and speedup of performance in computer graphics with the emergence of Graphic Processing Units (GPUs). GPUs have been developed to surpass Central Processing Unit (CPU) in terms of performance and processing speed. This evolution has opened up a new area in computing and researches where highly parallel GPU has been used for non-graphical algorithms. Physical or phenomenal simulations and modelling can be accelerated through General Purpose Graphic Processing Units (GPGPU) and Compute Unified Device Architecture (CUDA) implementations. These phenomena can be represented with mathematical models in the form of Ordinary Differential Equations (ODEs) which encompasses the gist of change rate between independent and dependent variables. ODEs are numerically integrated over time in order to simulate these behaviours. The classical Runge-Kutta (RK) scheme is the common method used to numerically solve ODEs. The Runge Kutta Fehlberg (RKF) scheme has been specially developed to provide an estimate of the principal local truncation error at each step, known as embedding estimate technique. This paper delves into the implementation of RKF scheme for GPU devices and compares its result with Dorman Prince method. A pseudo code is developed to show the implementation in detail. Hence, practitioners will be able to understand the data allocation in GPU, formation of RKF kernels and the flow of data to/from GPU-CPU upon RKF kernel evaluation. The pseudo code is then written in C Language and two ODE models are executed to show the achievable speedup as compared to CPU implementation. The accuracy and efficiency of the proposed implementation method is discussed in the final section of this paper.
      

      
      GPU accelerated implementation of NCI calculations using promolecular density.
      PubMed
      Rubez, Gaëtan; Etancelin, Jean-Matthieu; Vigouroux, Xavier; Krajecki, Michael; Boisson, Jean-Charles; Hénon, Eric
         2017-05-30
         The NCI approach is a modern tool to reveal chemical noncovalent interactions. It is particularly attractive to describe ligand-protein binding. A custom implementation for NCI using promolecular density is presented. It is designed to leverage the computational power of NVIDIA graphics processing unit (GPU) accelerators through the CUDA programming model. The code performances of three versions are examined on a test set of 144 systems. NCI calculations are particularly well suited to the GPU architecture, which reduces drastically the computational time. On a single compute node, the dual-GPU version leads to a 39-fold improvement for the biggest instance compared to the optimal OpenMP parallel run (C code, icc compiler) with 16 CPU cores. Energy consumption measurements carried out on both CPU and GPU NCI tests show that the GPU approach provides substantial energy savings. © 2017 Wiley Periodicals, Inc. © 2017 Wiley Periodicals, Inc.
      

      
      The Effect of NUMA Tunings on CPU Performance
      NASA Astrophysics Data System (ADS)
      Hollowell, Christopher; Caramarcu, Costin; Strecker-Kellogg, William; Wong, Antonio; Zaytsev, Alexandr
         2015-12-01
         Non-Uniform Memory Access (NUMA) is a memory architecture for symmetric multiprocessing (SMP) systems where each processor is directly connected to separate memory. Indirect access to other CPU's (remote) RAM is still possible, but such requests are slower as they must also pass through that memory's controlling CPU. In concert with a NUMA-aware operating system, the NUMA hardware architecture can help eliminate the memory performance reductions generally seen in SMP systems when multiple processors simultaneously attempt to access memory. The x86 CPU architecture has supported NUMA for a number of years. Modern operating systems such as Linux support NUMA-aware scheduling, where the OS attempts to schedule a process to the CPU directly attached to the majority of its RAM. In Linux, it is possible to further manually tune the NUMA subsystem using the numactl utility. With the release of Red Hat Enterprise Linux (RHEL) 6.3, the numad daemon became available in this distribution. This daemon monitors a system's NUMA topology and utilization, and automatically makes adjustments to optimize locality. As the number of cores in x86 servers continues to grow, efficient NUMA mappings of processes to CPUs/memory will become increasingly important. This paper gives a brief overview of NUMA, and discusses the effects of manual tunings and numad on the performance of the HEPSPEC06 benchmark, and ATLAS software.
      

      
      Accelerating finite-rate chemical kinetics with coprocessors: Comparing vectorization methods on GPUs, MICs, and CPUs
      NASA Astrophysics Data System (ADS)
      Stone, Christopher P.; Alferman, Andrew T.; Niemeyer, Kyle E.
         2018-05-01
         Accurate and efficient methods for solving stiff ordinary differential equations (ODEs) are a critical component of turbulent combustion simulations with finite-rate chemistry. The ODEs governing the chemical kinetics at each mesh point are decoupled by operator-splitting allowing each to be solved concurrently. An efficient ODE solver must then take into account the available thread and instruction-level parallelism of the underlying hardware, especially on many-core coprocessors, as well as the numerical efficiency. A stiff Rosenbrock and a nonstiff Runge-Kutta ODE solver are both implemented using the single instruction, multiple thread (SIMT) and single instruction, multiple data (SIMD) paradigms within OpenCL. Both methods solve multiple ODEs concurrently within the same instruction stream. The performance of these parallel implementations was measured on three chemical kinetic models of increasing size across several multicore and many-core platforms. Two separate benchmarks were conducted to clearly determine any performance advantage offered by either method. The first benchmark measured the run-time of evaluating the right-hand-side source terms in parallel and the second benchmark integrated a series of constant-pressure, homogeneous reactors using the Rosenbrock and Runge-Kutta solvers. The right-hand-side evaluations with SIMD parallelism on the host multicore Xeon CPU and many-core Xeon Phi co-processor performed approximately three times faster than the baseline multithreaded C++ code. The SIMT parallel model on the host and Phi was 13%-35% slower than the baseline while the SIMT model on the NVIDIA Kepler GPU provided approximately the same performance as the SIMD model on the Phi. The runtimes for both ODE solvers decreased significantly with the SIMD implementations on the host CPU (2.5-2.7 ×) and Xeon Phi coprocessor (4.7-4.9 ×) compared to the baseline parallel code. The SIMT implementations on the GPU ran 1.5-1.6 times faster than the baseline multithreaded CPU code; however, this was significantly slower than the SIMD versions on the host CPU or the Xeon Phi. The performance difference between the three platforms was attributed to thread divergence caused by the adaptive step-sizes within the ODE integrators. Analysis showed that the wider vector width of the GPU incurs a higher level of divergence than the narrower Sandy Bridge or Xeon Phi. The significant performance improvement provided by the SIMD parallel strategy motivates further research into more ODE solver methods that are both SIMD-friendly and computationally efficient.
      

      
      A proposed through-flow inverse method for the design of mixed-flow pumps
      NASA Technical Reports Server (NTRS)
      Borges, Joao Eduardo
         1991-01-01
         A through-flow (hub-to-shroud) truly inverse method is proposed and described. It uses an imposition of mean swirl, i.e., radius times mean tangential velocity, given throughout the meridional section of the turbomachine as an initial design specification. In the present implementation, it is assumed that the fluid is inviscid, incompressible, and irrotational at inlet and that the blades are supposed to have zero thickness. Only blade rows that impart to the fluid a constant work along the space are considered. An application of this procedure to design the rotor of a mixed-flow pump is described in detail. The strategy used to find a suitable mean swirl distribution and the other design inputs is also described. The final blade shape and pressure distributions on the blade surface are presented, showing that it is possible to obtain feasible designs using this technique. Another advantage of this technique is the fact that it does not require large amounts of CPU time.
      

      
      Inversion of surface parameters using fast learning neural networks
      NASA Technical Reports Server (NTRS)
      Dawson, M. S.; Olvera, J.; Fung, A. K.; Manry, M. T.
         1992-01-01
         A neural network approach to the inversion of surface scattering parameters is presented. Simulated data sets based on a surface scattering model are used so that the data may be viewed as taken from a completely known randomly rough surface. The fast learning (FL) neural network and a multilayer perceptron (MLP) trained with backpropagation learning (BP network) are tested on the simulated backscattering data. The RMS error of training the FL network is found to be less than one half the error of the BP network while requiring one to two orders of magnitude less CPU time. When applied to inversion of parameters from a statistically rough surface, the FL method is successful at recovering the surface permittivity, the surface correlation length, and the RMS surface height in less time and with less error than the BP network. Further applications of the FL neural network to the inversion of parameters from backscatter measurements of an inhomogeneous layer above a half space are shown.
      

      
      FDDO and DSMC analyses of rarefied gas flow through 2D nozzles
      NASA Technical Reports Server (NTRS)
      Chung, Chan-Hong; De Witt, Kenneth J.; Jeng, Duen-Ren; Penko, Paul F.
         1992-01-01
         Two different approaches, the finite-difference method coupled with the discrete-ordinate method (FDDO), and the direct-simulation Monte Carlo (DSMC) method, are used in the analysis of the flow of a rarefied gas expanding through a two-dimensional nozzle and into a surrounding low-density environment. In the FDDO analysis, by employing the discrete-ordinate method, the Boltzmann equation simplified by a model collision integral is transformed to a set of partial differential equations which are continuous in physical space but are point functions in molecular velocity space. The set of partial differential equations are solved by means of a finite-difference approximation. In the DSMC analysis, the variable hard sphere model is used as a molecular model and the no time counter method is employed as a collision sampling technique. The results of both the FDDO and the DSMC methods show good agreement. The FDDO method requires less computational effort than the DSMC method by factors of 10 to 40 in CPU time, depending on the degree of rarefaction.
      

      
      Novel Hybrid Scheduling Technique for Sensor Nodes with Mixed Criticality Tasks
      PubMed Central
      Micea, Mihai-Victor; Stangaciu, Cristina-Sorina; Stangaciu, Valentin; Curiac, Daniel-Ioan
         2017-01-01
         Sensor networks become increasingly a key technology for complex control applications. Their potential use in safety- and time-critical domains has raised the need for task scheduling mechanisms specially adapted to sensor node specific requirements, often materialized in predictable jitter-less execution of tasks characterized by different criticality levels. This paper offers an efficient scheduling solution, named Hybrid Hard Real-Time Scheduling (H2RTS), which combines a static, clock driven method with a dynamic, event driven scheduling technique, in order to provide high execution predictability, while keeping a high node Central Processing Unit (CPU) utilization factor. From the detailed, integrated schedulability analysis of the H2RTS, a set of sufficiency tests are introduced and demonstrated based on the processor demand and linear upper bound metrics. The performance and correct behavior of the proposed hybrid scheduling technique have been extensively evaluated and validated both on a simulator and on a sensor mote equipped with ARM7 microcontroller. PMID:28672856
      

        
       
          

«

21
      22
      23
      24
   25
      »

          
        

     

   

   
       
            
              
          

«

21
      22
      23
      24
      25
   »

          
        

           
           
             
               
      
      The graphics and data acquisition software package
      NASA Technical Reports Server (NTRS)
      Crosier, W. G.
         1981-01-01
         A software package was developed for use with micro and minicomputers, particularly the LSI-11/DPD-11 series. The package has a number of Fortran-callable subroutines which perform a variety of frequently needed tasks for biomedical applications. All routines are well documented, flexible, easy to use and modify, and require minimal programmer knowledge of peripheral hardware. The package is also economical of memory and CPU time. A single subroutine call can perform any one of the following functions: (1) plot an array of integer values from sampled A/D data, (2) plot an array of Y values versus an array of X values; (3) draw horizontal and/or vertical grid lines of selectable type; (4) annotate grid lines with user units; (5) get coordinates of user controlled crosshairs from the terminal for interactive graphics; (6) sample any analog channel with program selectable gain; (7) wait a specified time interval, and (8) perform random access I/O of one or more blocks of a sequential disk file. Several miscellaneous functions are also provided.
      

      
      A hybrid short read mapping accelerator
      PubMed Central
      
         2013-01-01
         Background The rapid growth of short read datasets poses a new challenge to the short read mapping problem in terms of sensitivity and execution speed. Existing methods often use a restrictive error model for computing the alignments to improve speed, whereas more flexible error models are generally too slow for large-scale applications. A number of short read mapping software tools have been proposed. However, designs based on hardware are relatively rare. Field programmable gate arrays (FPGAs) have been successfully used in a number of specific application areas, such as the DSP and communications domains due to their outstanding parallel data processing capabilities, making them a competitive platform to solve problems that are “inherently parallel”. Results We present a hybrid system for short read mapping utilizing both FPGA-based hardware and CPU-based software. The computation intensive alignment and the seed generation operations are mapped onto an FPGA. We present a computationally efficient, parallel block-wise alignment structure (Align Core) to approximate the conventional dynamic programming algorithm. The performance is compared to the multi-threaded CPU-based GASSST and BWA software implementations. For single-end alignment, our hybrid system achieves faster processing speed than GASSST (with a similar sensitivity) and BWA (with a higher sensitivity); for pair-end alignment, our design achieves a slightly worse sensitivity than that of BWA but has a higher processing speed. Conclusions This paper shows that our hybrid system can effectively accelerate the mapping of short reads to a reference genome based on the seed-and-extend approach. The performance comparison to the GASSST and BWA software implementations under different conditions shows that our hybrid design achieves a high degree of sensitivity and requires less overall execution time with only modest FPGA resource utilization. Our hybrid system design also shows that the performance bottleneck for the short read mapping problem can be changed from the alignment stage to the seed generation stage, which provides an additional requirement for the future development of short read aligners. PMID:23441908
      

      
      A FAST ITERATIVE METHOD FOR SOLVING THE EIKONAL EQUATION ON TRIANGULATED SURFACES*
      PubMed Central
      Fu, Zhisong; Jeong, Won-Ki; Pan, Yongsheng; Kirby, Robert M.; Whitaker, Ross T.
         2012-01-01
         This paper presents an efficient, fine-grained parallel algorithm for solving the Eikonal equation on triangular meshes. The Eikonal equation, and the broader class of Hamilton–Jacobi equations to which it belongs, have a wide range of applications from geometric optics and seismology to biological modeling and analysis of geometry and images. The ability to solve such equations accurately and efficiently provides new capabilities for exploring and visualizing parameter spaces and for solving inverse problems that rely on such equations in the forward model. Efficient solvers on state-of-the-art, parallel architectures require new algorithms that are not, in many cases, optimal, but are better suited to synchronous updates of the solution. In previous work [W. K. Jeong and R. T. Whitaker, SIAM J. Sci. Comput., 30 (2008), pp. 2512–2534], the authors proposed the fast iterative method (FIM) to efficiently solve the Eikonal equation on regular grids. In this paper we extend the fast iterative method to solve Eikonal equations efficiently on triangulated domains on the CPU and on parallel architectures, including graphics processors. We propose a new local update scheme that provides solutions of first-order accuracy for both architectures. We also propose a novel triangle-based update scheme and its corresponding data structure for efficient irregular data mapping to parallel single-instruction multiple-data (SIMD) processors. We provide detailed descriptions of the implementations on a single CPU, a multicore CPU with shared memory, and SIMD architectures with comparative results against state-of-the-art Eikonal solvers. PMID:22641200
      

      
      Optimization and uncertainty assessment of strongly nonlinear groundwater models with high parameter dimensionality
      NASA Astrophysics Data System (ADS)
      Keating, Elizabeth H.; Doherty, John; Vrugt, Jasper A.; Kang, Qinjun
         2010-10-01
         Highly parameterized and CPU-intensive groundwater models are increasingly being used to understand and predict flow and transport through aquifers. Despite their frequent use, these models pose significant challenges for parameter estimation and predictive uncertainty analysis algorithms, particularly global methods which usually require very large numbers of forward runs. Here we present a general methodology for parameter estimation and uncertainty analysis that can be utilized in these situations. Our proposed method includes extraction of a surrogate model that mimics key characteristics of a full process model, followed by testing and implementation of a pragmatic uncertainty analysis technique, called null-space Monte Carlo (NSMC), that merges the strengths of gradient-based search and parameter dimensionality reduction. As part of the surrogate model analysis, the results of NSMC are compared with a formal Bayesian approach using the DiffeRential Evolution Adaptive Metropolis (DREAM) algorithm. Such a comparison has never been accomplished before, especially in the context of high parameter dimensionality. Despite the highly nonlinear nature of the inverse problem, the existence of multiple local minima, and the relatively large parameter dimensionality, both methods performed well and results compare favorably with each other. Experiences gained from the surrogate model analysis are then transferred to calibrate the full highly parameterized and CPU intensive groundwater model and to explore predictive uncertainty of predictions made by that model. The methodology presented here is generally applicable to any highly parameterized and CPU-intensive environmental model, where efficient methods such as NSMC provide the only practical means for conducting predictive uncertainty analysis.
      

      
      Distributed GPU Computing in GIScience
      NASA Astrophysics Data System (ADS)
      Jiang, Y.; Yang, C.; Huang, Q.; Li, J.; Sun, M.
         2013-12-01
         Geoscientists strived to discover potential principles and patterns hidden inside ever-growing Big Data for scientific discoveries. To better achieve this objective, more capable computing resources are required to process, analyze and visualize Big Data (Ferreira et al., 2003; Li et al., 2013). Current CPU-based computing techniques cannot promptly meet the computing challenges caused by increasing amount of datasets from different domains, such as social media, earth observation, environmental sensing (Li et al., 2013). Meanwhile CPU-based computing resources structured as cluster or supercomputer is costly. In the past several years with GPU-based technology matured in both the capability and performance, GPU-based computing has emerged as a new computing paradigm. Compare to traditional computing microprocessor, the modern GPU, as a compelling alternative microprocessor, has outstanding high parallel processing capability with cost-effectiveness and efficiency(Owens et al., 2008), although it is initially designed for graphical rendering in visualization pipe. This presentation reports a distributed GPU computing framework for integrating GPU-based computing within distributed environment. Within this framework, 1) for each single computer, computing resources of both GPU-based and CPU-based can be fully utilized to improve the performance of visualizing and processing Big Data; 2) within a network environment, a variety of computers can be used to build up a virtual super computer to support CPU-based and GPU-based computing in distributed computing environment; 3) GPUs, as a specific graphic targeted device, are used to greatly improve the rendering efficiency in distributed geo-visualization, especially for 3D/4D visualization. Key words: Geovisualization, GIScience, Spatiotemporal Studies Reference : 1. Ferreira de Oliveira, M. C., & Levkowitz, H. (2003). From visual data exploration to visual data mining: A survey. Visualization and Computer Graphics, IEEE Transactions on, 9(3), 378-394. 2. Li, J., Jiang, Y., Yang, C., Huang, Q., & Rice, M. (2013). Visualizing 3D/4D Environmental Data Using Many-core Graphics Processing Units (GPUs) and Multi-core Central Processing Units (CPUs). Computers & Geosciences, 59(9), 78-89. 3. Owens, J. D., Houston, M., Luebke, D., Green, S., Stone, J. E., & Phillips, J. C. (2008). GPU computing. Proceedings of the IEEE, 96(5), 879-899.
      

      
      An integrated pipeline of open source software adapted for multi-CPU architectures: use in the large-scale identification of single nucleotide polymorphisms.
      PubMed
      Jayashree, B; Hanspal, Manindra S; Srinivasan, Rajgopal; Vigneshwaran, R; Varshney, Rajeev K; Spurthi, N; Eshwar, K; Ramesh, N; Chandra, S; Hoisington, David A
         2007-01-01
         The large amounts of EST sequence data available from a single species of an organism as well as for several species within a genus provide an easy source of identification of intra- and interspecies single nucleotide polymorphisms (SNPs). In the case of model organisms, the data available are numerous, given the degree of redundancy in the deposited EST data. There are several available bioinformatics tools that can be used to mine this data; however, using them requires a certain level of expertise: the tools have to be used sequentially with accompanying format conversion and steps like clustering and assembly of sequences become time-intensive jobs even for moderately sized datasets. We report here a pipeline of open source software extended to run on multiple CPU architectures that can be used to mine large EST datasets for SNPs and identify restriction sites for assaying the SNPs so that cost-effective CAPS assays can be developed for SNP genotyping in genetics and breeding applications. At the International Crops Research Institute for the Semi-Arid Tropics (ICRISAT), the pipeline has been implemented to run on a Paracel high-performance system consisting of four dual AMD Opteron processors running Linux with MPICH. The pipeline can be accessed through user-friendly web interfaces at http://hpc.icrisat.cgiar.org/PBSWeb and is available on request for academic use. We have validated the developed pipeline by mining chickpea ESTs for interspecies SNPs, development of CAPS assays for SNP genotyping, and confirmation of restriction digestion pattern at the sequence level.
      

      
      Improving Block-level Efficiency with scsi-mq
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Caldwell, Blake A
         2015-01-01
         Current generation solid-state storage devices are exposing a new bottlenecks in the SCSI and block layers of the Linux kernel, where IO throughput is limited by lock contention, inefficient interrupt handling, and poor memory locality. To address these limitations, the Linux kernel block layer underwent a major rewrite with the blk-mq project to move from a single request queue to a multi-queue model. The Linux SCSI subsystem rework to make use of this new model, known as scsi-mq, has been merged into the Linux kernel and work is underway for dm-multipath support in the upcoming Linux 4.0 kernel. These piecesmore » were necessary to make use of the multi-queue block layer in a Lustre parallel filesystem with high availability requirements. We undertook adding support of the 3.18 kernel to Lustre with scsi-mq and dm-multipath patches to evaluate the potential of these efficiency improvements. In this paper we evaluate the block-level performance of scsi-mq with backing storage hardware representative of a HPC-targerted Lustre filesystem. Our findings show that SCSI write request latency is reduced by as much as 13.6%. Additionally, when profiling the CPU usage of our prototype Lustre filesystem, we found that CPU idle time increased by a factor of 7 with Linux 3.18 and blk-mq as compared to a standard 2.6.32 Linux kernel. Our findings demonstrate increased efficiency of the multi-queue block layer even with disk-based caching storage arrays used in existing parallel filesystems.« less
      

      
      Defining and Enforcing Hardware Security Requirements
      DTIC Science & Technology
      
         2011-12-01
         Computer-Aided Design CPU Central Processing Unit CTL Computation Tree Logic DARPA The Defense Advanced Projects Research Agency DFF D-type Flip-Flop DNF...They too have no global knowledge of what is going on, nor any meaning to attach to any bit, whether storage or gating . . . it is we who attach...This option is prohibitively ex- pensive with the current trends in the global distribution of the steps in IC design and fabrication. The second option
      

      
      Feasibility Study for an Air Force Environmental Model and Data Exchange. Volume 1. Model and Data Requirements with Recommendations.
      DTIC Science & Technology
      
         1983-07-01
         CC6600 FOR 64CPU 56K N s0 so AT123D 102 W L IBM4360 FOR 2 K CLEARY GROUN WATER FLOW 240 C FOR " CLEARY MASS TRANSPORT 222 U C FOR Y *DIURNAL 102 W H L...user support. $250,000 Computer Hardware Principally acquisition of terminals. printers, and modems for distribution to the user commnunity. $100,000
      

      
      Helicopter In-Flight Monitoring System Second Generation (HIMS II).
      DTIC Science & Technology
      
         1983-08-01
         acquisition cycle. B. Computer Chassis CPU (DEC LSI-II/2) -- Executes instructions contained in the memory. 32K memory (DEC MSVII-DD) --Contains program...when the operator executes command #2, 3, or 5 (display data). New cartridges can be inserted as required for truly unlimited, continuous data...is called bootstrapping. The software, which is stored on a tape cartridge, is loaded into memory by execution of a small program stored in read-only
      

      
      MEqTrees Telescope and Radio-sky Simulations and CPU Benchmarking
      NASA Astrophysics Data System (ADS)
      Shanmugha Sundaram, G. A.
         2009-09-01
         MEqTrees is a Python-based implementation of the classical Measurement Equation, wherein the various 2×2 Jones matrices are parametrized representations in the spatial and sky domains for any generic radio telescope. Customized simulations of radio-source sky models and corrupt Jones terms are demonstrated based on a policy framework, with performance estimates derived for array configurations, ``dirty''-map residuals and processing power requirements for such computations on conventional platforms.
      

      
      Using SimCPU in Cooperative Learning Laboratories.
      ERIC Educational Resources Information Center
      Lin, Janet Mei-Chuen; Wu, Cheng-Chih; Liu, Hsi-Jen
         1999-01-01
         Reports research findings of an experimental design in which cooperative-learning strategies were applied to closed-lab instruction of computing concepts. SimCPU, a software package specially designed for closed-lab usage was used by 171 high school students of four classes. Results showed that collaboration enhanced learning and that blending…
      

      
      Accelerating solutions of one-dimensional unsteady PDEs with GPU-based swept time-space decomposition
      NASA Astrophysics Data System (ADS)
      Magee, Daniel J.; Niemeyer, Kyle E.
         2018-03-01
         The expedient design of precision components in aerospace and other high-tech industries requires simulations of physical phenomena often described by partial differential equations (PDEs) without exact solutions. Modern design problems require simulations with a level of resolution difficult to achieve in reasonable amounts of time-even in effectively parallelized solvers. Though the scale of the problem relative to available computing power is the greatest impediment to accelerating these applications, significant performance gains can be achieved through careful attention to the details of memory communication and access. The swept time-space decomposition rule reduces communication between sub-domains by exhausting the domain of influence before communicating boundary values. Here we present a GPU implementation of the swept rule, which modifies the algorithm for improved performance on this processing architecture by prioritizing use of private (shared) memory, avoiding interblock communication, and overwriting unnecessary values. It shows significant improvement in the execution time of finite-difference solvers for one-dimensional unsteady PDEs, producing speedups of 2 - 9 × for a range of problem sizes, respectively, compared with simple GPU versions and 7 - 300 × compared with parallel CPU versions. However, for a more sophisticated one-dimensional system of equations discretized with a second-order finite-volume scheme, the swept rule performs 1.2 - 1.9 × worse than a standard implementation for all problem sizes.
      

      
      Advances in Mechanisms Supporting Data Collection on Future Force Networks: Product Manager C4ISR On-the-Move
      DTIC Science & Technology
      
         2008-12-01
         for Layer 3 data capture: NetPoll ncap tget Monitor session Radio System switch router User App interface box GPS This model applies to most fixed...developed a lightweight, custom implementation, termed ncap . As described in Section 3.1, the Ground Truth System provides a linkage between host...computer CPU time and GPS time, and ncap leverages this to perform highly precise (əmsec) time tagging of offered and received packets. Such
      

      
      GPU-accelerated low-latency real-time searches for gravitational waves from compact binary coalescence
      NASA Astrophysics Data System (ADS)
      Liu, Yuan; Du, Zhihui; Chung, Shin Kee; Hooper, Shaun; Blair, David; Wen, Linqing
         2012-12-01
         We present a graphics processing unit (GPU)-accelerated time-domain low-latency algorithm to search for gravitational waves (GWs) from coalescing binaries of compact objects based on the summed parallel infinite impulse response (SPIIR) filtering technique. The aim is to facilitate fast detection of GWs with a minimum delay to allow prompt electromagnetic follow-up observations. To maximize the GPU acceleration, we apply an efficient batched parallel computing model that significantly reduces the number of synchronizations in SPIIR and optimizes the usage of the memory and hardware resource. Our code is tested on the CUDA ‘Fermi’ architecture in a GTX 480 graphics card and its performance is compared with a single core of Intel Core i7 920 (2.67 GHz). A 58-fold speedup is achieved while giving results in close agreement with the CPU implementation. Our result indicates that it is possible to conduct a full search for GWs from compact binary coalescence in real time with only one desktop computer equipped with a Fermi GPU card for the initial LIGO detectors which in the past required more than 100 CPUs.
      

      
      Chimera grids in the simulation of three-dimensional flowfields in turbine-blade-coolant passages
      NASA Technical Reports Server (NTRS)
      Stephens, M. A.; Rimlinger, M. J.; Shih, T. I.-P.; Civinskas, K. C.
         1993-01-01
         When computing flows inside geometrically complex turbine-blade coolant passages, the structure of the grid system used can affect significantly the overall time and cost required to obtain solutions. This paper addresses this issue while evaluating and developing computational tools for the design and analysis of coolant-passages, and is divided into two parts. In the first part, the various types of structured and unstructured grids are compared in relation to their ability to provide solutions in a timely and cost-effective manner. This comparison shows that the overlapping structured grids, known as Chimera grids, can rival and in some instances exceed the cost-effectiveness of unstructured grids in terms of both the man hours needed to generate grids and the amount of computer memory and CPU time needed to obtain solutions. In the second part, a computational tool utilizing Chimera grids was used to compute the flow and heat transfer in two different turbine-blade coolant passages that contain baffles and numerous pin fins. These computations showed the versatility and flexibility offered by Chimera grids.
      

      
      Practical Implementation of Prestack Kirchhoff Time Migration on a General Purpose Graphics Processing Unit
      NASA Astrophysics Data System (ADS)
      Liu, Guofeng; Li, Chun
         2016-08-01
         In this study, we present a practical implementation of prestack Kirchhoff time migration (PSTM) on a general purpose graphic processing unit. First, we consider the three main optimizations of the PSTM GPU code, i.e., designing a configuration based on a reasonable execution, using the texture memory for velocity interpolation, and the application of an intrinsic function in device code. This approach can achieve a speedup of nearly 45 times on a NVIDIA GTX 680 GPU compared with CPU code when a larger imaging space is used, where the PSTM output is a common reflection point that is gathered as I[ nx][ ny][ nh][ nt] in matrix format. However, this method requires more memory space so the limited imaging space cannot fully exploit the GPU sources. To overcome this problem, we designed a PSTM scheme with multi-GPUs for imaging different seismic data on different GPUs using an offset value. This process can achieve the peak speedup of GPU PSTM code and it greatly increases the efficiency of the calculations, but without changing the imaging result.
      

      
      Real-time electroholography using a multiple-graphics processing unit cluster system with a single spatial light modulator and the InfiniBand network
      NASA Astrophysics Data System (ADS)
      Niwase, Hiroaki; Takada, Naoki; Araki, Hiromitsu; Maeda, Yuki; Fujiwara, Masato; Nakayama, Hirotaka; Kakue, Takashi; Shimobaba, Tomoyoshi; Ito, Tomoyoshi
         2016-09-01
         Parallel calculations of large-pixel-count computer-generated holograms (CGHs) are suitable for multiple-graphics processing unit (multi-GPU) cluster systems. However, it is not easy for a multi-GPU cluster system to accomplish fast CGH calculations when CGH transfers between PCs are required. In these cases, the CGH transfer between the PCs becomes a bottleneck. Usually, this problem occurs only in multi-GPU cluster systems with a single spatial light modulator. To overcome this problem, we propose a simple method using the InfiniBand network. The computational speed of the proposed method using 13 GPUs (NVIDIA GeForce GTX TITAN X) was more than 3000 times faster than that of a CPU (Intel Core i7 4770) when the number of three-dimensional (3-D) object points exceeded 20,480. In practice, we achieved ˜40 tera floating point operations per second (TFLOPS) when the number of 3-D object points exceeded 40,960. Our proposed method was able to reconstruct a real-time movie of a 3-D object comprising 95,949 points.
      

      
      GASPRNG: GPU accelerated scalable parallel random number generator library
      NASA Astrophysics Data System (ADS)
      Gao, Shuang; Peterson, Gregory D.
         2013-04-01
         Graphics processors represent a promising technology for accelerating computational science applications. Many computational science applications require fast and scalable random number generation with good statistical properties, so they use the Scalable Parallel Random Number Generators library (SPRNG). We present the GPU Accelerated SPRNG library (GASPRNG) to accelerate SPRNG in GPU-based high performance computing systems. GASPRNG includes code for a host CPU and CUDA code for execution on NVIDIA graphics processing units (GPUs) along with a programming interface to support various usage models for pseudorandom numbers and computational science applications executing on the CPU, GPU, or both. This paper describes the implementation approach used to produce high performance and also describes how to use the programming interface. The programming interface allows a user to be able to use GASPRNG the same way as SPRNG on traditional serial or parallel computers as well as to develop tightly coupled programs executing primarily on the GPU. We also describe how to install GASPRNG and use it. To help illustrate linking with GASPRNG, various demonstration codes are included for the different usage models. GASPRNG on a single GPU shows up to 280x speedup over SPRNG on a single CPU core and is able to scale for larger systems in the same manner as SPRNG. Because GASPRNG generates identical streams of pseudorandom numbers as SPRNG, users can be confident about the quality of GASPRNG for scalable computational science applications. Catalogue identifier: AEOI_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEOI_v1_0.html Program obtainable from: CPC Program Library, Queen’s University, Belfast, N. Ireland Licensing provisions: UTK license. No. of lines in distributed program, including test data, etc.: 167900 No. of bytes in distributed program, including test data, etc.: 1422058 Distribution format: tar.gz Programming language: C and CUDA. Computer: Any PC or workstation with NVIDIA GPU (Tested on Fermi GTX480, Tesla C1060, Tesla M2070). Operating system: Linux with CUDA version 4.0 or later. Should also run on MacOS, Windows, or UNIX. Has the code been vectorized or parallelized?: Yes. Parallelized using MPI directives. RAM: 512 MB˜ 732 MB (main memory on host CPU, depending on the data type of random numbers.) / 512 MB (GPU global memory) Classification: 4.13, 6.5. Nature of problem: Many computational science applications are able to consume large numbers of random numbers. For example, Monte Carlo simulations are able to consume limitless random numbers for the computation as long as resources for the computing are supported. Moreover, parallel computational science applications require independent streams of random numbers to attain statistically significant results. The SPRNG library provides this capability, but at a significant computational cost. The GASPRNG library presented here accelerates the generators of independent streams of random numbers using graphical processing units (GPUs). Solution method: Multiple copies of random number generators in GPUs allow a computational science application to consume large numbers of random numbers from independent, parallel streams. GASPRNG is a random number generators library to allow a computational science application to employ multiple copies of random number generators to boost performance. Users can interface GASPRNG with software code executing on microprocessors and/or GPUs. Running time: The tests provided take a few minutes to run.
      

      
      Computational Issues Associated with Temporally Deforming Geometries Such as Thrust Vectoring Nozzles
      NASA Technical Reports Server (NTRS)
      Boyalakuntla, Kishore; Soni, Bharat K.; Thornburg, Hugh J.; Yu, Robert
         1996-01-01
         During the past decade, computational simulation of fluid flow around complex configurations has progressed significantly and many notable successes have been reported, however, unsteady time-dependent solutions are not easily obtainable. The present effort involves unsteady time dependent simulation of temporally deforming geometries. Grid generation for a complex configuration can be a time consuming process and temporally varying geometries necessitate the regeneration of such grids for every time step. Traditional grid generation techniques have been tried and demonstrated to be inadequate to such simulations. Non-Uniform Rational B-splines (NURBS) based techniques provide a compact and accurate representation of the geometry. This definition can be coupled with a distribution mesh for a user defined spacing. The present method greatly reduces cpu requirements for time dependent remeshing, facilitating the simulation of more complex unsteady problems. A thrust vectoring nozzle has been chosen to demonstrate the capability as it is of current interest in the aerospace industry for better maneuverability of fighter aircraft in close combat and in post stall regimes. This current effort is the first step towards multidisciplinary design optimization which involves coupling the aerodynamic heat transfer and structural analysis techniques. Applications include simulation of temporally deforming bodies and aeroelastic problems.
      

        
       
          

«

21
      22
      23
      24
      25
   »

          
        

     

   

   Some links on this page may take you to non-federal websites. Their policies may differ from this site.