overlapping instruction cpu: Topics by Science.gov

Sample records for overlapping instruction cpu

System for processing an encrypted instruction stream in hardware

DOE Office of Scientific and Technical Information (OSTI.GOV)

Griswold, Richard L.; Nickless, William K.; Conrad, Ryan C.

A system and method of processing an encrypted instruction stream in hardware is disclosed. Main memory stores the encrypted instruction stream and unencrypted data. A central processing unit (CPU) is operatively coupled to the main memory. A decryptor is operatively coupled to the main memory and located within the CPU. The decryptor decrypts the encrypted instruction stream upon receipt of an instruction fetch signal from a CPU core. Unencrypted data is passed through to the CPU core without decryption upon receipt of a data fetch signal.
Memory interface simulator: A computer design aid

NASA Technical Reports Server (NTRS)

Taylor, D. S.; Williams, T.; Weatherbee, J. E.

1972-01-01

Results are presented of a study conducted with a digital simulation model being used in the design of the Automatically Reconfigurable Modular Multiprocessor System (ARMMS), a candidate computer system for future manned and unmanned space missions. The model simulates the activity involved as instructions are fetched from random access memory for execution in one of the system central processing units. A series of model runs measured instruction execution time under various assumptions pertaining to the CPU's and the interface between the CPU's and RAM. Design tradeoffs are presented in the following areas: Bus widths, CPU microprogram read only memory cycle time, multiple instruction fetch, and instruction mix.
Multiprocessing MCNP on an IBN RS/6000 cluster

DOE Office of Scientific and Technical Information (OSTI.GOV)

McKinney, G.W.; West, J.T.

1993-01-01

The advent of high-performance computer systems has brought to maturity programming concepts like vectorization, multiprocessing, and multitasking. While there are many schools of thought as to the most significant factor in obtaining order-of-magnitude increases in performance, such speedup can only be achieved by integrating the computer system and application code. Vectorization leads to faster manipulation of arrays by overlapping instruction CPU cycles. Discrete ordinates codes, which require the solving of large matrices, have proved to be major benefactors of vectorization. Monte Carlo transport, on the other hand, typically contains numerous logic statements and requires extensive redevelopment to benefit from vectorization.more » Multiprocessing and multitasking provide additional CPU cycles via multiple processors. Such systems are generally designed with either common memory access (multitasking) or distributed memory access. In both cases, theoretical speedup, as a function of the number of processors P and the fraction f of task time that multiprocesses, can be formulated using Amdahl's law: S(f, P) =1/(1-f+f/P). However, for most applications, this theoretical limit cannot be achieved because of additional terms (e.g., multitasking overhead, memory overlap, etc.) that are not included in Amdahl's law. Monte Carlo transport is a natural candidate for multiprocessing because the particle tracks are generally independent, and the precision of the result increases as the square Foot of the number of particles tracked.« less
CUDASW++ 3.0: accelerating Smith-Waterman protein database search by coupling CPU and GPU SIMD instructions.

PubMed

Liu, Yongchao; Wirawan, Adrianto; Schmidt, Bertil

2013-04-04

The maximal sensitivity for local alignments makes the Smith-Waterman algorithm a popular choice for protein sequence database search based on pairwise alignment. However, the algorithm is compute-intensive due to a quadratic time complexity. Corresponding runtimes are further compounded by the rapid growth of sequence databases. We present CUDASW++ 3.0, a fast Smith-Waterman protein database search algorithm, which couples CPU and GPU SIMD instructions and carries out concurrent CPU and GPU computations. For the CPU computation, this algorithm employs SSE-based vector execution units as accelerators. For the GPU computation, we have investigated for the first time a GPU SIMD parallelization, which employs CUDA PTX SIMD video instructions to gain more data parallelism beyond the SIMT execution model. Moreover, sequence alignment workloads are automatically distributed over CPUs and GPUs based on their respective compute capabilities. Evaluation on the Swiss-Prot database shows that CUDASW++ 3.0 gains a performance improvement over CUDASW++ 2.0 up to 2.9 and 3.2, with a maximum performance of 119.0 and 185.6 GCUPS, on a single-GPU GeForce GTX 680 and a dual-GPU GeForce GTX 690 graphics card, respectively. In addition, our algorithm has demonstrated significant speedups over other top-performing tools: SWIPE and BLAST+. CUDASW++ 3.0 is written in CUDA C++ and PTX assembly languages, targeting GPUs based on the Kepler architecture. This algorithm obtains significant speedups over its predecessor: CUDASW++ 2.0, by benefiting from the use of CPU and GPU SIMD instructions as well as the concurrent execution on CPUs and GPUs. The source code and the simulated data are available at http://cudasw.sourceforge.net.
Using SimCPU in Cooperative Learning Laboratories.

ERIC Educational Resources Information Center

Lin, Janet Mei-Chuen; Wu, Cheng-Chih; Liu, Hsi-Jen

1999-01-01

Reports research findings of an experimental design in which cooperative-learning strategies were applied to closed-lab instruction of computing concepts. SimCPU, a software package specially designed for closed-lab usage was used by 171 high school students of four classes. Results showed that collaboration enhanced learning and that blending…
Multiprocessing MCNP on an IBM RS/6000 cluster

DOE Office of Scientific and Technical Information (OSTI.GOV)

McKinney, G.W.; West, J.T.

1993-01-01

The advent of high-performance computer systems has brought to maturity programming concepts like vectorization, multiprocessing, and multitasking. While there are many schools of thought as to the most significant factor in obtaining order-of-magnitude increases in performance, such speedup can only be achieved by integrating the computer system and application code. Vectorization leads to faster manipulation of arrays by overlapping instruction CPU cycles. Discrete ordinates codes, which require the solving of large matrices, have proved to be major benefactors of vectorization. Monte Carlo transport, on the other hand, typically contains numerous logic statements and requires extensive redevelopment to benefit from vectorization.more » Multiprocessing and multitasking provide additional CPU cycles via multiple processors. Such systems are generally designed with either common memory access (multitasking) or distributed memory access. In both cases, theoretical speedup, as a function of the number of processors (P) and the fraction of task time that multiprocesses (f), can be formulated using Amdahl's Law S ((f,P) = 1 f + f/P). However, for most applications this theoretical limit cannot be achieved, due to additional terms not included in Amdahl's Law. Monte Carlo transport is a natural candidate for multiprocessing, since the particle tracks are generally independent and the precision of the result increases as the square root of the number of particles tracked.« less
CPU SIM: A Computer Simulator for Use in an Introductory Computer Organization-Architecture Class.

ERIC Educational Resources Information Center

Skrein, Dale

1994-01-01

CPU SIM, an interactive low-level computer simulation package that runs on the Macintosh computer, is described. The program is designed for instructional use in the first or second year of undergraduate computer science, to teach various features of typical computer organization through hands-on exercises. (MSE)
Computer hardware for radiologists: Part I

PubMed Central

Indrajit, IK; Alam, A

2010-01-01

Computers are an integral part of modern radiology practice. They are used in different radiology modalities to acquire, process, and postprocess imaging data. They have had a dramatic influence on contemporary radiology practice. Their impact has extended further with the emergence of Digital Imaging and Communications in Medicine (DICOM), Picture Archiving and Communication System (PACS), Radiology information system (RIS) technology, and Teleradiology. A basic overview of computer hardware relevant to radiology practice is presented here. The key hardware components in a computer are the motherboard, central processor unit (CPU), the chipset, the random access memory (RAM), the memory modules, bus, storage drives, and ports. The personnel computer (PC) has a rectangular case that contains important components called hardware, many of which are integrated circuits (ICs). The fiberglass motherboard is the main printed circuit board and has a variety of important hardware mounted on it, which are connected by electrical pathways called “buses”. The CPU is the largest IC on the motherboard and contains millions of transistors. Its principal function is to execute “programs”. A Pentium® 4 CPU has transistors that execute a billion instructions per second. The chipset is completely different from the CPU in design and function; it controls data and interaction of buses between the motherboard and the CPU. Memory (RAM) is fundamentally semiconductor chips storing data and instructions for access by a CPU. RAM is classified by storage capacity, access speed, data rate, and configuration. PMID:21042437
Multiprocessing MCNP on an IBM RS/6000 cluster

DOE Office of Scientific and Technical Information (OSTI.GOV)

McKinney, G.W.; West, J.T.

1993-03-01

The advent of high-performance computer systems has brought to maturity programming concepts like vectorization, multiprocessing, and multitasking. While there are many schools of thought as to the most significant factor in obtaining order-of-magnitude increases in performance, such speedup can only be achieved by integrating the computer system and application code. Vectorization leads to faster manipulation of arrays by overlapping instruction CPU cycles. Discrete ordinates codes, which require the solving of large matrices, have proved to be major benefactors of vectorization. Monte Carlo transport, on the other hand, typically contains numerous logic statements and requires extensive redevelopment to benefit from vectorization.more » Multiprocessing and multitasking provide additional CPU cycles via multiple processors. Such systems are generally designed with either common memory access (multitasking) or distributed memory access. In both cases, theoretical speedup, as a function of the number of processors (P) and the fraction of task time that multiprocesses (f), can be formulated using Amdahl`s Law S ((f,P) = 1 f + f/P). However, for most applications this theoretical limit cannot be achieved, due to additional terms not included in Amdahl`s Law. Monte Carlo transport is a natural candidate for multiprocessing, since the particle tracks are generally independent and the precision of the result increases as the square root of the number of particles tracked.« less
Multi-GPU and multi-CPU accelerated FDTD scheme for vibroacoustic applications

NASA Astrophysics Data System (ADS)

Francés, J.; Otero, B.; Bleda, S.; Gallego, S.; Neipp, C.; Márquez, A.; Beléndez, A.

2015-06-01

The Finite-Difference Time-Domain (FDTD) method is applied to the analysis of vibroacoustic problems and to study the propagation of longitudinal and transversal waves in a stratified media. The potential of the scheme and the relevance of each acceleration strategy for massively computations in FDTD are demonstrated in this work. In this paper, we propose two new specific implementations of the bi-dimensional scheme of the FDTD method using multi-CPU and multi-GPU, respectively. In the first implementation, an open source message passing interface (OMPI) has been included in order to massively exploit the resources of a biprocessor station with two Intel Xeon processors. Moreover, regarding CPU code version, the streaming SIMD extensions (SSE) and also the advanced vectorial extensions (AVX) have been included with shared memory approaches that take advantage of the multi-core platforms. On the other hand, the second implementation called the multi-GPU code version is based on Peer-to-Peer communications available in CUDA on two GPUs (NVIDIA GTX 670). Subsequently, this paper presents an accurate analysis of the influence of the different code versions including shared memory approaches, vector instructions and multi-processors (both CPU and GPU) and compares them in order to delimit the degree of improvement of using distributed solutions based on multi-CPU and multi-GPU. The performance of both approaches was analysed and it has been demonstrated that the addition of shared memory schemes to CPU computing improves substantially the performance of vector instructions enlarging the simulation sizes that use efficiently the cache memory of CPUs. In this case GPU computing is slightly twice times faster than the fine tuned CPU version in both cases one and two nodes. However, for massively computations explicit vector instructions do not worth it since the memory bandwidth is the limiting factor and the performance tends to be the same than the sequential version with auto-vectorisation and also shared memory approach. In this scenario GPU computing is the best option since it provides a homogeneous behaviour. More specifically, the speedup of GPU computing achieves an upper limit of 12 for both one and two GPUs, whereas the performance reaches peak values of 80 GFlops and 146 GFlops for the performance for one GPU and two GPUs respectively. Finally, the method is applied to an earth crust profile in order to demonstrate the potential of our approach and the necessity of applying acceleration strategies in these type of applications.
Design Alternatives to Improve Access Time Performance of Disk Drives Under DOS and UNIX

NASA Astrophysics Data System (ADS)

Hospodor, Andy

For the past 25 years, improvements in CPU performance have overshadowed improvements in the access time performance of disk drives. CPU performance has been slanted towards greater instruction execution rates, measured in millions of instructions per second (MIPS). However, the slant for performance of disk storage has been towards capacity and corresponding increased storage densities. The IBM PC, introduced in 1982, processed only a fraction of a MIP. Follow-on CPUs, such as the 80486 and 80586, sported 5-10 MIPS by 1992. Single user PCs and workstations, with one CPU and one disk drive, became the dominant application, as implied by their production volumes. However, disk drives did not enjoy a corresponding improvement in access time performance, although the potential still exists. The time to access a disk drive improves (decreases) in two ways: by altering the mechanical properties of the drive or by adding cache to the drive. This paper explores the improvement to access time performance of disk drives using cache, prefetch, faster rotation rates, and faster seek acceleration.
Personal Computer and Workstation Operating Systems Tutorial

DTIC Science & Technology

1994-03-01

to a RAM area where it is executed by the CPU. The program consists of instructions that perform operations on data. The CPU will perform two basic...memory to improve system performance. More often the user will buy a new fixed disk so the computer will hold more programs internally. The trend today...MHZ. Another way to view how fast the information is going into the register is in a time domain rather than a frequency domain knowing that time and
A Symposium: Relevant Cue Research, a Program of Systematic Evaluation: Considerations for Sustaining Instructional Design Research Using an Integrated Learning System.

ERIC Educational Resources Information Center

Grabowski, Barbara

An intelligent videodisc system on which comprehensive instructional development research can be conducted has been developed. This integrated learning system combines all other existing media, except objects, using a videodisc, microcomputer, printer, single monitor, hard disc storage with CPU for random access digitized audio, and headphones.…
Libsharp - spherical harmonic transforms revisited

NASA Astrophysics Data System (ADS)

Reinecke, M.; Seljebotn, D. S.

2013-06-01

We present libsharp, a code library for spherical harmonic transforms (SHTs), which evolved from the libpsht library and addresses several of its shortcomings, such as adding MPI support for distributed memory systems and SHTs of fields with arbitrary spin, but also supporting new developments in CPU instruction sets like the Advanced Vector Extensions (AVX) or fused multiply-accumulate (FMA) instructions. The library is implemented in portable C99 and provides an interface that can be easily accessed from other programming languages such as C++, Fortran, Python, etc. Generally, libsharp's performance is at least on par with that of its predecessor; however, significant improvements were made to the algorithms for scalar SHTs, which are roughly twice as fast when using the same CPU capabilities. The library is available at http://sourceforge.net/projects/libsharp/ under the terms of the GNU General Public License.
SU-E-J-60: Efficient Monte Carlo Dose Calculation On CPU-GPU Heterogeneous Systems

DOE Office of Scientific and Technical Information (OSTI.GOV)

Xiao, K; Chen, D. Z; Hu, X. S

Purpose: It is well-known that the performance of GPU-based Monte Carlo dose calculation implementations is bounded by memory bandwidth. One major cause of this bottleneck is the random memory writing patterns in dose deposition, which leads to several memory efficiency issues on GPU such as un-coalesced writing and atomic operations. We propose a new method to alleviate such issues on CPU-GPU heterogeneous systems, which achieves overall performance improvement for Monte Carlo dose calculation. Methods: Dose deposition is to accumulate dose into the voxels of a dose volume along the trajectories of radiation rays. Our idea is to partition this proceduremore » into the following three steps, which are fine-tuned for CPU or GPU: (1) each GPU thread writes dose results with location information to a buffer on GPU memory, which achieves fully-coalesced and atomic-free memory transactions; (2) the dose results in the buffer are transferred to CPU memory; (3) the dose volume is constructed from the dose buffer on CPU. We organize the processing of all radiation rays into streams. Since the steps within a stream use different hardware resources (i.e., GPU, DMA, CPU), we can overlap the execution of these steps for different streams by pipelining. Results: We evaluated our method using a Monte Carlo Convolution Superposition (MCCS) program and tested our implementation for various clinical cases on a heterogeneous system containing an Intel i7 quad-core CPU and an NVIDIA TITAN GPU. Comparing with a straightforward MCCS implementation on the same system (using both CPU and GPU for radiation ray tracing), our method gained 2-5X speedup without losing dose calculation accuracy. Conclusion: The results show that our new method improves the effective memory bandwidth and overall performance for MCCS on the CPU-GPU systems. Our proposed method can also be applied to accelerate other Monte Carlo dose calculation approaches. This research was supported in part by NSF under Grants CCF-1217906, and also in part by a research contract from the Sandia National Laboratories.« less
Chip architecture - A revolution brewing

NASA Astrophysics Data System (ADS)

Guterl, F.

1983-07-01

Techniques being explored by microchip designers and manufacturers to both speed up memory access and instruction execution while protecting memory are discussed. Attention is given to hardwiring control logic, pipelining for parallel processing, devising orthogonal instruction sets for interchangeable instruction fields, and the development of hardware for implementation of virtual memory and multiuser systems to provide memory management and protection. The inclusion of microcode in mainframes eliminated logic circuits that control timing and gating of the CPU. However, improvements in memory architecture have reduced access time to below that needed for instruction execution. Hardwiring the functions as a virtual memory enhances memory protection. Parallelism involves a redundant architecture, which allows identical operations to be performed simultaneously, and can be directed with microcode to avoid abortion of intermediate instructions once on set of instructions has been completed.
gpuPOM: a GPU-based Princeton Ocean Model

NASA Astrophysics Data System (ADS)

Xu, S.; Huang, X.; Zhang, Y.; Fu, H.; Oey, L.-Y.; Xu, F.; Yang, G.

2014-11-01

Rapid advances in the performance of the graphics processing unit (GPU) have made the GPU a compelling solution for a series of scientific applications. However, most existing GPU acceleration works for climate models are doing partial code porting for certain hot spots, and can only achieve limited speedup for the entire model. In this work, we take the mpiPOM (a parallel version of the Princeton Ocean Model) as our starting point, design and implement a GPU-based Princeton Ocean Model. By carefully considering the architectural features of the state-of-the-art GPU devices, we rewrite the full mpiPOM model from the original Fortran version into a new Compute Unified Device Architecture C (CUDA-C) version. We take several accelerating methods to further improve the performance of gpuPOM, including optimizing memory access in a single GPU, overlapping communication and boundary operations among multiple GPUs, and overlapping input/output (I/O) between the hybrid Central Processing Unit (CPU) and the GPU. Our experimental results indicate that the performance of the gpuPOM on a workstation containing 4 GPUs is comparable to a powerful cluster with 408 CPU cores and it reduces the energy consumption by 6.8 times.
Accelerating finite-rate chemical kinetics with coprocessors: Comparing vectorization methods on GPUs, MICs, and CPUs

NASA Astrophysics Data System (ADS)

Stone, Christopher P.; Alferman, Andrew T.; Niemeyer, Kyle E.

2018-05-01

Accurate and efficient methods for solving stiff ordinary differential equations (ODEs) are a critical component of turbulent combustion simulations with finite-rate chemistry. The ODEs governing the chemical kinetics at each mesh point are decoupled by operator-splitting allowing each to be solved concurrently. An efficient ODE solver must then take into account the available thread and instruction-level parallelism of the underlying hardware, especially on many-core coprocessors, as well as the numerical efficiency. A stiff Rosenbrock and a nonstiff Runge-Kutta ODE solver are both implemented using the single instruction, multiple thread (SIMT) and single instruction, multiple data (SIMD) paradigms within OpenCL. Both methods solve multiple ODEs concurrently within the same instruction stream. The performance of these parallel implementations was measured on three chemical kinetic models of increasing size across several multicore and many-core platforms. Two separate benchmarks were conducted to clearly determine any performance advantage offered by either method. The first benchmark measured the run-time of evaluating the right-hand-side source terms in parallel and the second benchmark integrated a series of constant-pressure, homogeneous reactors using the Rosenbrock and Runge-Kutta solvers. The right-hand-side evaluations with SIMD parallelism on the host multicore Xeon CPU and many-core Xeon Phi co-processor performed approximately three times faster than the baseline multithreaded C++ code. The SIMT parallel model on the host and Phi was 13%-35% slower than the baseline while the SIMT model on the NVIDIA Kepler GPU provided approximately the same performance as the SIMD model on the Phi. The runtimes for both ODE solvers decreased significantly with the SIMD implementations on the host CPU (2.5-2.7 ×) and Xeon Phi coprocessor (4.7-4.9 ×) compared to the baseline parallel code. The SIMT implementations on the GPU ran 1.5-1.6 times faster than the baseline multithreaded CPU code; however, this was significantly slower than the SIMD versions on the host CPU or the Xeon Phi. The performance difference between the three platforms was attributed to thread divergence caused by the adaptive step-sizes within the ODE integrators. Analysis showed that the wider vector width of the GPU incurs a higher level of divergence than the narrower Sandy Bridge or Xeon Phi. The significant performance improvement provided by the SIMD parallel strategy motivates further research into more ODE solver methods that are both SIMD-friendly and computationally efficient.
A 32-bit NMOS microprocessor with a large register file

NASA Astrophysics Data System (ADS)

Sherburne, R. W., Jr.; Katevenis, M. G. H.; Patterson, D. A.; Sequin, C. H.

1984-10-01

Two scaled versions of a 32-bit NMOS reduced instruction set computer CPU, called RISC II, have been implemented on two different processing lines using the simple Mead and Conway layout rules with lambda values of 2 and 1.5 microns (corresponding to drawn gate lengths of 4 and 3 microns), respectively. The design utilizes a small set of simple instructions in conjunction with a large register file in order to provide high performance. This approach has resulted in two surprisingly powerful single-chip processors.
Rapid Computation of Thermodynamic Properties over Multidimensional Nonbonded Parameter Spaces Using Adaptive Multistate Reweighting.

PubMed

Naden, Levi N; Shirts, Michael R

2016-04-12

We show how thermodynamic properties of molecular models can be computed over a large, multidimensional parameter space by combining multistate reweighting analysis with a linear basis function approach. This approach reduces the computational cost to estimate thermodynamic properties from molecular simulations for over 130,000 tested parameter combinations from over 1000 CPU years to tens of CPU days. This speed increase is achieved primarily by computing the potential energy as a linear combination of basis functions, computed from either modified simulation code or as the difference of energy between two reference states, which can be done without any simulation code modification. The thermodynamic properties are then estimated with the Multistate Bennett Acceptance Ratio (MBAR) as a function of multiple model parameters without the need to define a priori how the states are connected by a pathway. Instead, we adaptively sample a set of points in parameter space to create mutual configuration space overlap. The existence of regions of poor configuration space overlap are detected by analyzing the eigenvalues of the sampled states' overlap matrix. The configuration space overlap to sampled states is monitored alongside the mean and maximum uncertainty to determine convergence, as neither the uncertainty or the configuration space overlap alone is a sufficient metric of convergence. This adaptive sampling scheme is demonstrated by estimating with high precision the solvation free energies of charged particles of Lennard-Jones plus Coulomb functional form with charges between -2 and +2 and generally physical values of σij and ϵij in TIP3P water. We also compute entropy, enthalpy, and radial distribution functions of arbitrary unsampled parameter combinations using only the data from these sampled states and use the estimates of free energies over the entire space to examine the deviation of atomistic simulations from the Born approximation to the solvation free energy.

Fast polyenergetic forward projection for image formation using OpenCL on a heterogeneous parallel computing platform.

PubMed

Zhou, Lili; Clifford Chao, K S; Chang, Jenghwa

2012-11-01

Simulated projection images of digital phantoms constructed from CT scans have been widely used for clinical and research applications but their quality and computation speed are not optimal for real-time comparison with the radiography acquired with an x-ray source of different energies. In this paper, the authors performed polyenergetic forward projections using open computing language (OpenCL) in a parallel computing ecosystem consisting of CPU and general purpose graphics processing unit (GPGPU) for fast and realistic image formation. The proposed polyenergetic forward projection uses a lookup table containing the NIST published mass attenuation coefficients (μ∕ρ) for different tissue types and photon energies ranging from 1 keV to 20 MeV. The CT images of interested sites are first segmented into different tissue types based on the CT numbers and converted to a three-dimensional attenuation phantom by linking each voxel to the corresponding tissue type in the lookup table. The x-ray source can be a radioisotope or an x-ray generator with a known spectrum described as weight w(n) for energy bin E(n). The Siddon method is used to compute the x-ray transmission line integral for E(n) and the x-ray fluence is the weighted sum of the exponential of line integral for all energy bins with added Poisson noise. To validate this method, a digital head and neck phantom constructed from the CT scan of a Rando head phantom was segmented into three (air, gray∕white matter, and bone) regions for calculating the polyenergetic projection images for the Mohan 4 MV energy spectrum. To accelerate the calculation, the authors partitioned the workloads using the task parallelism and data parallelism and scheduled them in a parallel computing ecosystem consisting of CPU and GPGPU (NVIDIA Tesla C2050) using OpenCL only. The authors explored the task overlapping strategy and the sequential method for generating the first and subsequent DRRs. A dispatcher was designed to drive the high-degree parallelism of the task overlapping strategy. Numerical experiments were conducted to compare the performance of the OpenCL∕GPGPU-based implementation with the CPU-based implementation. The projection images were similar to typical portal images obtained with a 4 or 6 MV x-ray source. For a phantom size of 512 × 512 × 223, the time for calculating the line integrals for a 512 × 512 image panel was 16.2 ms on GPGPU for one energy bin in comparison to 8.83 s on CPU. The total computation time for generating one polyenergetic projection image of 512 × 512 was 0.3 s (141 s for CPU). The relative difference between the projection images obtained with the CPU-based and OpenCL∕GPGPU-based implementations was on the order of 10(-6) and was virtually indistinguishable. The task overlapping strategy was 5.84 and 1.16 times faster than the sequential method for the first and the subsequent digitally reconstruction radiographies, respectively. The authors have successfully built digital phantoms using anatomic CT images and NIST μ∕ρ tables for simulating realistic polyenergetic projection images and optimized the processing speed with parallel computing using GPGPU∕OpenCL-based implementation. The computation time was fast (0.3 s per projection image) enough for real-time IGRT (image-guided radiotherapy) applications.
The PLATO IV Architecture.

ERIC Educational Resources Information Center

Stifle, Jack

The PLATO IV computer-based instructional system consists of a large scale centrally located CDC 6400 computer and a large number of remote student terminals. This is a brief and general description of the proposed input/output hardware necessary to interface the student terminals with the computer's central processing unit (CPU) using available…
Cpu/gpu Computing for AN Implicit Multi-Block Compressible Navier-Stokes Solver on Heterogeneous Platform

NASA Astrophysics Data System (ADS)

Deng, Liang; Bai, Hanli; Wang, Fang; Xu, Qingxin

2016-06-01

CPU/GPU computing allows scientists to tremendously accelerate their numerical codes. In this paper, we port and optimize a double precision alternating direction implicit (ADI) solver for three-dimensional compressible Navier-Stokes equations from our in-house Computational Fluid Dynamics (CFD) software on heterogeneous platform. First, we implement a full GPU version of the ADI solver to remove a lot of redundant data transfers between CPU and GPU, and then design two fine-grain schemes, namely “one-thread-one-point” and “one-thread-one-line”, to maximize the performance. Second, we present a dual-level parallelization scheme using the CPU/GPU collaborative model to exploit the computational resources of both multi-core CPUs and many-core GPUs within the heterogeneous platform. Finally, considering the fact that memory on a single node becomes inadequate when the simulation size grows, we present a tri-level hybrid programming pattern MPI-OpenMP-CUDA that merges fine-grain parallelism using OpenMP and CUDA threads with coarse-grain parallelism using MPI for inter-node communication. We also propose a strategy to overlap the computation with communication using the advanced features of CUDA and MPI programming. We obtain speedups of 6.0 for the ADI solver on one Tesla M2050 GPU in contrast to two Xeon X5670 CPUs. Scalability tests show that our implementation can offer significant performance improvement on heterogeneous platform.
An efficient implementation of semi-numerical computation of the Hartree-Fock exchange on the Intel Phi processor

NASA Astrophysics Data System (ADS)

Liu, Fenglai; Kong, Jing

2018-07-01

Unique technical challenges and their solutions for implementing semi-numerical Hartree-Fock exchange on the Phil Processor are discussed, especially concerning the single- instruction-multiple-data type of processing and small cache size. Benchmark calculations on a series of buckyball molecules with various Gaussian basis sets on a Phi processor and a six-core CPU show that the Phi processor provides as much as 12 times of speedup with large basis sets compared with the conventional four-center electron repulsion integration approach performed on the CPU. The accuracy of the semi-numerical scheme is also evaluated and found to be comparable to that of the resolution-of-identity approach.
[Research on fast implementation method of image Gaussian RBF interpolation based on CUDA].

PubMed

Chen, Hao; Yu, Haizhong

2014-04-01

Image interpolation is often required during medical image processing and analysis. Although interpolation method based on Gaussian radial basis function (GRBF) has high precision, the long calculation time still limits its application in field of image interpolation. To overcome this problem, a method of two-dimensional and three-dimensional medical image GRBF interpolation based on computing unified device architecture (CUDA) is proposed in this paper. According to single instruction multiple threads (SIMT) executive model of CUDA, various optimizing measures such as coalesced access and shared memory are adopted in this study. To eliminate the edge distortion of image interpolation, natural suture algorithm is utilized in overlapping regions while adopting data space strategy of separating 2D images into blocks or dividing 3D images into sub-volumes. Keeping a high interpolation precision, the 2D and 3D medical image GRBF interpolation achieved great acceleration in each basic computing step. The experiments showed that the operative efficiency of image GRBF interpolation based on CUDA platform was obviously improved compared with CPU calculation. The present method is of a considerable reference value in the application field of image interpolation.
Performance analysis of the FDTD method applied to holographic volume gratings: Multi-core CPU versus GPU computing

NASA Astrophysics Data System (ADS)

Francés, J.; Bleda, S.; Neipp, C.; Márquez, A.; Pascual, I.; Beléndez, A.

2013-03-01

The finite-difference time-domain method (FDTD) allows electromagnetic field distribution analysis as a function of time and space. The method is applied to analyze holographic volume gratings (HVGs) for the near-field distribution at optical wavelengths. Usually, this application requires the simulation of wide areas, which implies more memory and time processing. In this work, we propose a specific implementation of the FDTD method including several add-ons for a precise simulation of optical diffractive elements. Values in the near-field region are computed considering the illumination of the grating by means of a plane wave for different angles of incidence and including absorbing boundaries as well. We compare the results obtained by FDTD with those obtained using a matrix method (MM) applied to diffraction gratings. In addition, we have developed two optimized versions of the algorithm, for both CPU and GPU, in order to analyze the improvement of using the new NVIDIA Fermi GPU architecture versus highly tuned multi-core CPU as a function of the size simulation. In particular, the optimized CPU implementation takes advantage of the arithmetic and data transfer streaming SIMD (single instruction multiple data) extensions (SSE) included explicitly in the code and also of multi-threading by means of OpenMP directives. A good agreement between the results obtained using both FDTD and MM methods is obtained, thus validating our methodology. Moreover, the performance of the GPU is compared to the SSE+OpenMP CPU implementation, and it is quantitatively determined that a highly optimized CPU program can be competitive for a wider range of simulation sizes, whereas GPU computing becomes more powerful for large-scale simulations.
Measurement of fault latency in a digital avionic mini processor, part 2

NASA Technical Reports Server (NTRS)

Mcgough, J.; Swern, F.

1983-01-01

The results of fault injection experiments utilizing a gate-level emulation of the central processor unit of the Bendix BDX-930 digital computer are described. Several earlier programs were reprogrammed, expanding the instruction set to capitalize on the full power of the BDX-930 computer. As a final demonstration of fault coverage an extensive, 3-axis, high performance flght control computation was added. The stages in the development of a CPU self-test program emphasizing the relationship between fault coverage, speed, and quantity of instructions were demonstrated.
Supporting Undergraduate Computer Architecture Students Using a Visual MIPS64 CPU Simulator

ERIC Educational Resources Information Center

Patti, D.; Spadaccini, A.; Palesi, M.; Fazzino, F.; Catania, V.

2012-01-01

The topics of computer architecture are always taught using an Assembly dialect as an example. The most commonly used textbooks in this field use the MIPS64 Instruction Set Architecture (ISA) to help students in learning the fundamentals of computer architecture because of its orthogonality and its suitability for real-world applications. This…
Bayer image parallel decoding based on GPU

NASA Astrophysics Data System (ADS)

Hu, Rihui; Xu, Zhiyong; Wei, Yuxing; Sun, Shaohua

2012-11-01

In the photoelectrical tracking system, Bayer image is decompressed in traditional method, which is CPU-based. However, it is too slow when the images become large, for example, 2K×2K×16bit. In order to accelerate the Bayer image decoding, this paper introduces a parallel speedup method for NVIDA's Graphics Processor Unit (GPU) which supports CUDA architecture. The decoding procedure can be divided into three parts: the first is serial part, the second is task-parallelism part, and the last is data-parallelism part including inverse quantization, inverse discrete wavelet transform (IDWT) as well as image post-processing part. For reducing the execution time, the task-parallelism part is optimized by OpenMP techniques. The data-parallelism part could advance its efficiency through executing on the GPU as CUDA parallel program. The optimization techniques include instruction optimization, shared memory access optimization, the access memory coalesced optimization and texture memory optimization. In particular, it can significantly speed up the IDWT by rewriting the 2D (Tow-dimensional) serial IDWT into 1D parallel IDWT. Through experimenting with 1K×1K×16bit Bayer image, data-parallelism part is 10 more times faster than CPU-based implementation. Finally, a CPU+GPU heterogeneous decompression system was designed. The experimental result shows that it could achieve 3 to 5 times speed increase compared to the CPU serial method.
Design of a memory-access controller with 3.71-times-enhanced energy efficiency for Internet-of-Things-oriented nonvolatile microcontroller unit

NASA Astrophysics Data System (ADS)

Natsui, Masanori; Hanyu, Takahiro

2018-04-01

In realizing a nonvolatile microcontroller unit (MCU) for sensor nodes in Internet-of-Things (IoT) applications, it is important to solve the data-transfer bottleneck between the central processing unit (CPU) and the nonvolatile memory constituting the MCU. As one circuit-oriented approach to solving this problem, we propose a memory access minimization technique for magnetoresistive-random-access-memory (MRAM)-embedded nonvolatile MCUs. In addition to multiplexing and prefetching of memory access, the proposed technique realizes efficient instruction fetch by eliminating redundant memory access while considering the code length of the instruction to be fetched and the transition of the memory address to be accessed. As a result, the performance of the MCU can be improved while relaxing the performance requirement for the embedded MRAM, and compact and low-power implementation can be performed as compared with the conventional cache-based one. Through the evaluation using a system consisting of a general purpose 32-bit CPU and embedded MRAM, it is demonstrated that the proposed technique increases the peak efficiency of the system up to 3.71 times, while a 2.29-fold area reduction is achieved compared with the cache-based one.
A graphics-card implementation of Monte-Carlo simulations for cosmic-ray transport

NASA Astrophysics Data System (ADS)

Tautz, R. C.

2016-05-01

A graphics card implementation of a test-particle simulation code is presented that is based on the CUDA extension of the C/C++ programming language. The original CPU version has been developed for the calculation of cosmic-ray diffusion coefficients in artificial Kolmogorov-type turbulence. In the new implementation, the magnetic turbulence generation, which is the most time-consuming part, is separated from the particle transport and is performed on a graphics card. In this article, the modification of the basic approach of integrating test particle trajectories to employ the SIMD (single instruction, multiple data) model is presented and verified. The efficiency of the new code is tested and several language-specific accelerating factors are discussed. For the example of isotropic magnetostatic turbulence, sample results are shown and a comparison to the results of the CPU implementation is performed.
Research on SEU hardening of heterogeneous Dual-Core SoC

NASA Astrophysics Data System (ADS)

Huang, Kun; Hu, Keliu; Deng, Jun; Zhang, Tao

2017-08-01

The implementation of Single-Event Upsets (SEU) hardening has various schemes. However, some of them require a lot of human, material and financial resources. This paper proposes an easy scheme on SEU hardening for Heterogeneous Dual-core SoC (HD SoC) which contains three techniques. First, the automatic Triple Modular Redundancy (TMR) technique is adopted to harden the register heaps of the processor and the instruction-fetching module. Second, Hamming codes are used to harden the random access memory (RAM). Last, a software signature technique is applied to check the programs which are running on CPU. The scheme need not to consume additional resources, and has little influence on the performance of CPU. These technologies are very mature, easy to implement and needs low cost. According to the simulation result, the scheme can satisfy the basic demand of SEU-hardening.
An Efficient Method to Detect Mutual Overlap of a Large Set of Unordered Images for Structure-From

NASA Astrophysics Data System (ADS)

Wang, X.; Zhan, Z. Q.; Heipke, C.

2017-05-01

Recently, low-cost 3D reconstruction based on images has become a popular focus of photogrammetry and computer vision research. Methods which can handle an arbitrary geometric setup of a large number of unordered and convergent images are of particular interest. However, determining the mutual overlap poses a considerable challenge. We propose a new method which was inspired by and improves upon methods employing random k-d forests for this task. Specifically, we first derive features from the images and then a random k-d forest is used to find the nearest neighbours in feature space. Subsequently, the degree of similarity between individual images, the image overlaps and thus images belonging to a common block are calculated as input to a structure-from-motion (sfm) pipeline. In our experiments we show the general applicability of the new method and compare it with other methods by analyzing the time efficiency. Orientations and 3D reconstructions were successfully conducted with our overlap graphs by sfm. The results show a speed-up of a factor of 80 compared to conventional pairwise matching, and of 8 and 2 compared to the VocMatch approach using 1 and 4 CPU, respectively.
Cosimulation of embedded system using RTOS software simulator

NASA Astrophysics Data System (ADS)

Wang, Shihao; Duan, Zhigang; Liu, Mingye

2003-09-01

Embedded system design often employs co-simulation to verify system's function; one efficient verification tool of software is Instruction Set Simulator (ISS). As a full functional model of target CPU, ISS interprets instruction of embedded software step by step, which usually is time-consuming since it simulates at low-level. Hence ISS often becomes the bottleneck of co-simulation in a complicated system. In this paper, a new software verification tools, the RTOS software simulator (RSS) was presented. The mechanism of its operation was described in a full details. In RSS method, RTOS API is extended and hardware simulator driver is adopted to deal with data-exchange and synchronism between the two simulators.
A Cache Design to Exploit Structural Locality

DTIC Science & Technology

1991-12-01

memory and secondary storage. Main memory was used to store the instructions and data of an executing pro- gram, while secondary storage held programs ...efficiency of the CPU and faster turnaround of executing programs . In addition to the well known spatial and temporal aspects of locality, Hobart has...identified a third aspect, which he has called structural locality (9). This type of locality is defined as the tendency of an executing program to
Helicopter In-Flight Monitoring System Second Generation (HIMS II).

DTIC Science & Technology

1983-08-01

acquisition cycle. B. Computer Chassis CPU (DEC LSI-II/2) -- Executes instructions contained in the memory. 32K memory (DEC MSVII-DD) --Contains program...when the operator executes command #2, 3, or 5 (display data). New cartridges can be inserted as required for truly unlimited, continuous data...is called bootstrapping. The software, which is stored on a tape cartridge, is loaded into memory by execution of a small program stored in read-only
Collaborating CPU and GPU for large-scale high-order CFD simulations with complex grids on the TianHe-1A supercomputer

NASA Astrophysics Data System (ADS)

Xu, Chuanfu; Deng, Xiaogang; Zhang, Lilun; Fang, Jianbin; Wang, Guangxue; Jiang, Yi; Cao, Wei; Che, Yonggang; Wang, Yongxian; Wang, Zhenghua; Liu, Wei; Cheng, Xinghua

2014-12-01

Programming and optimizing complex, real-world CFD codes on current many-core accelerated HPC systems is very challenging, especially when collaborating CPUs and accelerators to fully tap the potential of heterogeneous systems. In this paper, with a tri-level hybrid and heterogeneous programming model using MPI + OpenMP + CUDA, we port and optimize our high-order multi-block structured CFD software HOSTA on the GPU-accelerated TianHe-1A supercomputer. HOSTA adopts two self-developed high-order compact definite difference schemes WCNS and HDCS that can simulate flows with complex geometries. We present a dual-level parallelization scheme for efficient multi-block computation on GPUs and perform particular kernel optimizations for high-order CFD schemes. The GPU-only approach achieves a speedup of about 1.3 when comparing one Tesla M2050 GPU with two Xeon X5670 CPUs. To achieve a greater speedup, we collaborate CPU and GPU for HOSTA instead of using a naive GPU-only approach. We present a novel scheme to balance the loads between the store-poor GPU and the store-rich CPU. Taking CPU and GPU load balance into account, we improve the maximum simulation problem size per TianHe-1A node for HOSTA by 2.3×, meanwhile the collaborative approach can improve the performance by around 45% compared to the GPU-only approach. Further, to scale HOSTA on TianHe-1A, we propose a gather/scatter optimization to minimize PCI-e data transfer times for ghost and singularity data of 3D grid blocks, and overlap the collaborative computation and communication as far as possible using some advanced CUDA and MPI features. Scalability tests show that HOSTA can achieve a parallel efficiency of above 60% on 1024 TianHe-1A nodes. With our method, we have successfully simulated an EET high-lift airfoil configuration containing 800M cells and China's large civil airplane configuration containing 150M cells. To our best knowledge, those are the largest-scale CPU-GPU collaborative simulations that solve realistic CFD problems with both complex configurations and high-order schemes.
The Photon Shell Game and the Quantum von Neumann Architecture with Superconducting Circuits

NASA Astrophysics Data System (ADS)

Mariantoni, Matteo

2012-02-01

Superconducting quantum circuits have made significant advances over the past decade, allowing more complex and integrated circuits that perform with good fidelity. We have recently implemented a machine comprising seven quantum channels, with three superconducting resonators, two phase qubits, and two zeroing registers. I will explain the design and operation of this machine, first showing how a single microwave photon | 1 > can be prepared in one resonator and coherently transferred between the three resonators. I will also show how more exotic states such as double photon states | 2 > and superposition states | 0 >+ | 1 > can be shuffled among the resonators as well [1]. I will then demonstrate how this machine can be used as the quantum-mechanical analog of the von Neumann computer architecture, which for a classical computer comprises a central processing unit and a memory holding both instructions and data. The quantum version comprises a quantum central processing unit (quCPU) that exchanges data with a quantum random-access memory (quRAM) integrated on one chip, with instructions stored on a classical computer. I will also present a proof-of-concept demonstration of a code that involves all seven quantum elements: (1), Preparing an entangled state in the quCPU, (2), writing it to the quRAM, (3), preparing a second state in the quCPU, (4), zeroing it, and, (5), reading out the first state stored in the quRAM [2]. Finally, I will demonstrate that the quantum von Neumann machine provides one unit cell of a two-dimensional qubit-resonator array that can be used for surface code quantum computing. This will allow the realization of a scalable, fault-tolerant quantum processor with the most forgiving error rates to date. [4pt] [1] M. Mariantoni et al., Nature Physics 7, 287-293 (2011.)[0pt] [2] M. Mariantoni et al., Science 334, 61-65 (2011).
Speeding-up Bioinformatics Algorithms with Heterogeneous Architectures: Highly Heterogeneous Smith-Waterman (HHeterSW).

PubMed

Gálvez, Sergio; Ferusic, Adis; Esteban, Francisco J; Hernández, Pilar; Caballero, Juan A; Dorado, Gabriel

2016-10-01

The Smith-Waterman algorithm has a great sensitivity when used for biological sequence-database searches, but at the expense of high computing-power requirements. To overcome this problem, there are implementations in literature that exploit the different hardware-architectures available in a standard PC, such as GPU, CPU, and coprocessors. We introduce an application that splits the original database-search problem into smaller parts, resolves each of them by executing the most efficient implementations of the Smith-Waterman algorithms in different hardware architectures, and finally unifies the generated results. Using non-overlapping hardware allows simultaneous execution, and up to 2.58-fold performance gain, when compared with any other algorithm to search sequence databases. Even the performance of the popular BLAST heuristic is exceeded in 78% of the tests. The application has been tested with standard hardware: Intel i7-4820K CPU, Intel Xeon Phi 31S1P coprocessors, and nVidia GeForce GTX 960 graphics cards. An important increase in performance has been obtained in a wide range of situations, effectively exploiting the available hardware.
Examining the architecture of cellular computing through a comparative study with a computer

PubMed Central

Wang, Degeng; Gribskov, Michael

2005-01-01

The computer and the cell both use information embedded in simple coding, the binary software code and the quadruple genomic code, respectively, to support system operations. A comparative examination of their system architecture as well as their information storage and utilization schemes is performed. On top of the code, both systems display a modular, multi-layered architecture, which, in the case of a computer, arises from human engineering efforts through a combination of hardware implementation and software abstraction. Using the computer as a reference system, a simplistic mapping of the architectural components between the two is easily detected. This comparison also reveals that a cell abolishes the software–hardware barrier through genomic encoding for the constituents of the biochemical network, a cell's ‘hardware’ equivalent to the computer central processing unit (CPU). The information loading (gene expression) process acts as a major determinant of the encoded constituent's abundance, which, in turn, often determines the ‘bandwidth’ of a biochemical pathway. Cellular processes are implemented in biochemical pathways in parallel manners. In a computer, on the other hand, the software provides only instructions and data for the CPU. A process represents just sequentially ordered actions by the CPU and only virtual parallelism can be implemented through CPU time-sharing. Whereas process management in a computer may simply mean job scheduling, coordinating pathway bandwidth through the gene expression machinery represents a major process management scheme in a cell. In summary, a cell can be viewed as a super-parallel computer, which computes through controlled hardware composition. While we have, at best, a very fragmented understanding of cellular operation, we have a thorough understanding of the computer throughout the engineering process. The potential utilization of this knowledge to the benefit of systems biology is discussed. PMID:16849179

Examining the architecture of cellular computing through a comparative study with a computer.

PubMed

Wang, Degeng; Gribskov, Michael

2005-06-22

The computer and the cell both use information embedded in simple coding, the binary software code and the quadruple genomic code, respectively, to support system operations. A comparative examination of their system architecture as well as their information storage and utilization schemes is performed. On top of the code, both systems display a modular, multi-layered architecture, which, in the case of a computer, arises from human engineering efforts through a combination of hardware implementation and software abstraction. Using the computer as a reference system, a simplistic mapping of the architectural components between the two is easily detected. This comparison also reveals that a cell abolishes the software-hardware barrier through genomic encoding for the constituents of the biochemical network, a cell's "hardware" equivalent to the computer central processing unit (CPU). The information loading (gene expression) process acts as a major determinant of the encoded constituent's abundance, which, in turn, often determines the "bandwidth" of a biochemical pathway. Cellular processes are implemented in biochemical pathways in parallel manners. In a computer, on the other hand, the software provides only instructions and data for the CPU. A process represents just sequentially ordered actions by the CPU and only virtual parallelism can be implemented through CPU time-sharing. Whereas process management in a computer may simply mean job scheduling, coordinating pathway bandwidth through the gene expression machinery represents a major process management scheme in a cell. In summary, a cell can be viewed as a super-parallel computer, which computes through controlled hardware composition. While we have, at best, a very fragmented understanding of cellular operation, we have a thorough understanding of the computer throughout the engineering process. The potential utilization of this knowledge to the benefit of systems biology is discussed.
Parallel-vector out-of-core equation solver for computational mechanics

NASA Technical Reports Server (NTRS)

Qin, J.; Agarwal, T. K.; Storaasli, O. O.; Nguyen, D. T.; Baddourah, M. A.

1993-01-01

A parallel/vector out-of-core equation solver is developed for shared-memory computers, such as the Cray Y-MP machine. The input/ output (I/O) time is reduced by using the a synchronous BUFFER IN and BUFFER OUT, which can be executed simultaneously with the CPU instructions. The parallel and vector capability provided by the supercomputers is also exploited to enhance the performance. Numerical applications in large-scale structural analysis are given to demonstrate the efficiency of the present out-of-core solver.
A Study on the Effectiveness of Lockup-Free Caches for a Reduced Instruction Set Computer (RISC) Processor

DTIC Science & Technology

1992-09-01

to acquire or develop effective simulation tools to observe the behavior of a RISC implementation as it executes different types of programs . We choose...Performance Computer performance is measured by the amount of the time required to execute a program . Performance encompasses two types of time, elapsed time...and CPU time. Elapsed time is the time required to execute a program from start to finish. It includes latency of input/output activities such as
Efficient irregular wavefront propagation algorithms on Intel® Xeon Phi™

PubMed Central

Gomes, Jeremias M.; Teodoro, George; de Melo, Alba; Kong, Jun; Kurc, Tahsin; Saltz, Joel H.

2016-01-01

We investigate the execution of the Irregular Wavefront Propagation Pattern (IWPP), a fundamental computing structure used in several image analysis operations, on the Intel® Xeon Phi™ co-processor. An efficient implementation of IWPP on the Xeon Phi is a challenging problem because of IWPP’s irregularity and the use of atomic instructions in the original IWPP algorithm to resolve race conditions. On the Xeon Phi, the use of SIMD and vectorization instructions is critical to attain high performance. However, SIMD atomic instructions are not supported. Therefore, we propose a new IWPP algorithm that can take advantage of the supported SIMD instruction set. We also evaluate an alternate storage container (priority queue) to track active elements in the wavefront in an effort to improve the parallel algorithm efficiency. The new IWPP algorithm is evaluated with Morphological Reconstruction and Imfill operations as use cases. Our results show performance improvements of up to 5.63× on top of the original IWPP due to vectorization. Moreover, the new IWPP achieves speedups of 45.7× and 1.62×, respectively, as compared to efficient CPU and GPU implementations. PMID:27298591
Efficient irregular wavefront propagation algorithms on Intel® Xeon Phi™.

PubMed

Gomes, Jeremias M; Teodoro, George; de Melo, Alba; Kong, Jun; Kurc, Tahsin; Saltz, Joel H

2015-10-01

We investigate the execution of the Irregular Wavefront Propagation Pattern (IWPP), a fundamental computing structure used in several image analysis operations, on the Intel ® Xeon Phi ™ co-processor. An efficient implementation of IWPP on the Xeon Phi is a challenging problem because of IWPP's irregularity and the use of atomic instructions in the original IWPP algorithm to resolve race conditions. On the Xeon Phi, the use of SIMD and vectorization instructions is critical to attain high performance. However, SIMD atomic instructions are not supported. Therefore, we propose a new IWPP algorithm that can take advantage of the supported SIMD instruction set. We also evaluate an alternate storage container (priority queue) to track active elements in the wavefront in an effort to improve the parallel algorithm efficiency. The new IWPP algorithm is evaluated with Morphological Reconstruction and Imfill operations as use cases. Our results show performance improvements of up to 5.63 × on top of the original IWPP due to vectorization. Moreover, the new IWPP achieves speedups of 45.7 × and 1.62 × , respectively, as compared to efficient CPU and GPU implementations.
CUDA-based high-performance computing of the S-BPF algorithm with no-waiting pipelining

NASA Astrophysics Data System (ADS)

Deng, Lin; Yan, Bin; Chang, Qingmei; Han, Yu; Zhang, Xiang; Xi, Xiaoqi; Li, Lei

2015-10-01

The backprojection-filtration (BPF) algorithm has become a good solution for local reconstruction in cone-beam computed tomography (CBCT). However, the reconstruction speed of BPF is a severe limitation for clinical applications. The selective-backprojection filtration (S-BPF) algorithm is developed to improve the parallel performance of BPF by selective backprojection. Furthermore, the general-purpose graphics processing unit (GP-GPU) is a popular tool for accelerating the reconstruction. Much work has been performed aiming for the optimization of the cone-beam back-projection. As the cone-beam back-projection process becomes faster, the data transportation holds a much bigger time proportion in the reconstruction than before. This paper focuses on minimizing the total time in the reconstruction with the S-BPF algorithm by hiding the data transportation among hard disk, CPU and GPU. And based on the analysis of the S-BPF algorithm, some strategies are implemented: (1) the asynchronous calls are used to overlap the implemention of CPU and GPU, (2) an innovative strategy is applied to obtain the DBP image to hide the transport time effectively, (3) two streams for data transportation and calculation are synchronized by the cudaEvent in the inverse of finite Hilbert transform on GPU. Our main contribution is a smart reconstruction of the S-BPF algorithm with GPU's continuous calculation and no data transportation time cost. a 5123 volume is reconstructed in less than 0.7 second on a single Tesla-based K20 GPU from 182 views projection with 5122 pixel per projection. The time cost of our implementation is about a half of that without the overlap behavior.
GENIE: a software package for gene-gene interaction analysis in genetic association studies using multiple GPU or CPU cores.

PubMed

Chikkagoudar, Satish; Wang, Kai; Li, Mingyao

2011-05-26

Gene-gene interaction in genetic association studies is computationally intensive when a large number of SNPs are involved. Most of the latest Central Processing Units (CPUs) have multiple cores, whereas Graphics Processing Units (GPUs) also have hundreds of cores and have been recently used to implement faster scientific software. However, currently there are no genetic analysis software packages that allow users to fully utilize the computing power of these multi-core devices for genetic interaction analysis for binary traits. Here we present a novel software package GENIE, which utilizes the power of multiple GPU or CPU processor cores to parallelize the interaction analysis. GENIE reads an entire genetic association study dataset into memory and partitions the dataset into fragments with non-overlapping sets of SNPs. For each fragment, GENIE analyzes: 1) the interaction of SNPs within it in parallel, and 2) the interaction between the SNPs of the current fragment and other fragments in parallel. We tested GENIE on a large-scale candidate gene study on high-density lipoprotein cholesterol. Using an NVIDIA Tesla C1060 graphics card, the GPU mode of GENIE achieves a speedup of 27 times over its single-core CPU mode run. GENIE is open-source, economical, user-friendly, and scalable. Since the computing power and memory capacity of graphics cards are increasing rapidly while their cost is going down, we anticipate that GENIE will achieve greater speedups with faster GPU cards. Documentation, source code, and precompiled binaries can be downloaded from http://www.cceb.upenn.edu/~mli/software/GENIE/.
Testing and Validating Gadget2 for GPUs

NASA Astrophysics Data System (ADS)

Wibking, Benjamin; Holley-Bockelmann, K.; Berlind, A. A.

2013-01-01

We are currently upgrading a version of Gadget2 (Springel et al., 2005) that is optimized for NVIDIA's CUDA GPU architecture (Frigaard, unpublished) to work with the latest libraries and graphics cards. Preliminary tests of its performance indicate a ~40x speedup in the particle force tree approximation calculation, with overall speedup of 5-10x for cosmological simulations run with GPUs compared to running on the same CPU cores without GPU acceleration. We believe this speedup can be reasonably increased by an additional factor of two with futher optimization, including overlap of computation on CPU and GPU. Tests of single-precision GPU numerical fidelity currently indicate accuracy of the mass function and the spectral power density to within a few percent of extended-precision CPU results with the unmodified form of Gadget. Additionally, we plan to test and optimize the GPU code for Millenium-scale "grand challenge" simulations of >10^9 particles, a scale that has been previously untested with this code, with the aid of the NSF XSEDE flagship GPU-based supercomputing cluster codenamed "Keeneland." Current work involves additional validation of numerical results, extending the numerical precision of the GPU calculations to double precision, and evaluating performance/accuracy tradeoffs. We believe that this project, if successful, will yield substantial computational performance benefits to the N-body research community as the next generation of GPU supercomputing resources becomes available, both increasing the electrical power efficiency of ever-larger computations (making simulations possible a decade from now at scales and resolutions unavailable today) and accelerating the pace of research in the field.
GENIE: a software package for gene-gene interaction analysis in genetic association studies using multiple GPU or CPU cores

PubMed Central

2011-01-01

Background Gene-gene interaction in genetic association studies is computationally intensive when a large number of SNPs are involved. Most of the latest Central Processing Units (CPUs) have multiple cores, whereas Graphics Processing Units (GPUs) also have hundreds of cores and have been recently used to implement faster scientific software. However, currently there are no genetic analysis software packages that allow users to fully utilize the computing power of these multi-core devices for genetic interaction analysis for binary traits. Findings Here we present a novel software package GENIE, which utilizes the power of multiple GPU or CPU processor cores to parallelize the interaction analysis. GENIE reads an entire genetic association study dataset into memory and partitions the dataset into fragments with non-overlapping sets of SNPs. For each fragment, GENIE analyzes: 1) the interaction of SNPs within it in parallel, and 2) the interaction between the SNPs of the current fragment and other fragments in parallel. We tested GENIE on a large-scale candidate gene study on high-density lipoprotein cholesterol. Using an NVIDIA Tesla C1060 graphics card, the GPU mode of GENIE achieves a speedup of 27 times over its single-core CPU mode run. Conclusions GENIE is open-source, economical, user-friendly, and scalable. Since the computing power and memory capacity of graphics cards are increasing rapidly while their cost is going down, we anticipate that GENIE will achieve greater speedups with faster GPU cards. Documentation, source code, and precompiled binaries can be downloaded from http://www.cceb.upenn.edu/~mli/software/GENIE/. PMID:21615923
Symptoms of problematic cellular phone use, functional impairment and its association with depression among adolescents in Southern Taiwan.

PubMed

Yen, Cheng-Fang; Tang, Tze-Chun; Yen, Ju-Yu; Lin, Huang-Chi; Huang, Chi-Fen; Liu, Shu-Chun; Ko, Chih-Hung

2009-08-01

The aims of this study were: (1) to examine the prevalence of symptoms of problematic cellular phone use (CPU); (2) to examine the associations between the symptoms of problematic CPU, functional impairment caused by CPU and the characteristics of CPU; (3) to establish the optimal cut-off point of the number of symptoms for functional impairment caused by CPU; and (4) to examine the association between problematic CPU and depression in adolescents. A total of 10,191 adolescent students in Southern Taiwan were recruited into this study. Participants' self-reported symptoms of problematic CPU and functional impairments caused by CPU were collected. The associations of symptoms of problematic CPU with functional impairments and with the characteristics of CPU were examined. The cut-off point of the number of symptoms for functional impairment was also determined. The association between problematic CPU and depression was examined by logistic regression analysis. The results indicated that the symptoms of problematic CPU were prevalent in adolescents. The adolescents who had any one of the symptoms of problematic CPU were more likely to report at least one dimension of functional impairment caused by CPU, called more on cellular phones, sent more text messages, or spent more time and higher fees on CPU. Having four or more symptoms of problematic CPU had the highest potential to differentiate between the adolescents with and without functional impairment caused by CPU. Adolescents who had significant depression were more likely to have four or more symptoms of problematic CPU. The results of this study may provide a basis for detecting symptoms of problematic CPU in adolescents.
Predicting the integration of overlapping memories by decoding mnemonic processing states during learning

PubMed Central

Richter, Franziska R.; Chanales, Avi J. H.; Kuhl, Brice A.

2015-01-01

The hippocampal memory system is thought to alternate between two opposing processing states: encoding and retrieval. When present experience overlaps with past experience, this creates a potential tradeoff between encoding the present and retrieving the past. This tradeoff may be resolved by memory integration—that is, by forming a mnemonic representation that links present experience with overlapping past experience. Here, we used fMRI decoding analyses to predict when—and establish how—past and present experiences become integrated in memory. In an initial experiment, we alternately instructed subjects to adopt encoding, retrieval or integration states during overlapping learning. We then trained across-subject pattern classifiers to ‘read out’ the instructed processing states from fMRI activity patterns. We show that an integration state was clearly dissociable from encoding or retrieval states. Moreover, trial-by-trial fluctuations in decoded evidence for an integration state during learning reliably predicted behavioral expressions of successful memory integration. Strikingly, the decoding algorithm also successfully predicted specific instances of spontaneous memory integration in an entirely independent sample of subjects for whom processing state instructions were not administered. Finally, we show that medial prefrontal cortex and hippocampus differentially contribute to encoding, retrieval, and integration states: whereas hippocampus signals the tradeoff between encoding vs. retrieval states, medial prefrontal cortex actively represents past experience in relation to new learning. PMID:26327243
A scalable approach to solving dense linear algebra problems on hybrid CPU-GPU systems

DOE PAGES

Song, Fengguang; Dongarra, Jack

2014-10-01

Aiming to fully exploit the computing power of all CPUs and all graphics processing units (GPUs) on hybrid CPU-GPU systems to solve dense linear algebra problems, in this paper we design a class of heterogeneous tile algorithms to maximize the degree of parallelism, to minimize the communication volume, and to accommodate the heterogeneity between CPUs and GPUs. The new heterogeneous tile algorithms are executed upon our decentralized dynamic scheduling runtime system, which schedules a task graph dynamically and transfers data between compute nodes automatically. The runtime system uses a new distributed task assignment protocol to solve data dependencies between tasksmore » without any coordination between processing units. By overlapping computation and communication through dynamic scheduling, we are able to attain scalable performance for the double-precision Cholesky factorization and QR factorization. Finally, our approach demonstrates a performance comparable to Intel MKL on shared-memory multicore systems and better performance than both vendor (e.g., Intel MKL) and open source libraries (e.g., StarPU) in the following three environments: heterogeneous clusters with GPUs, conventional clusters without GPUs, and shared-memory systems with multiple GPUs.« less
A scalable approach to solving dense linear algebra problems on hybrid CPU-GPU systems

DOE Office of Scientific and Technical Information (OSTI.GOV)

Song, Fengguang; Dongarra, Jack

Aiming to fully exploit the computing power of all CPUs and all graphics processing units (GPUs) on hybrid CPU-GPU systems to solve dense linear algebra problems, in this paper we design a class of heterogeneous tile algorithms to maximize the degree of parallelism, to minimize the communication volume, and to accommodate the heterogeneity between CPUs and GPUs. The new heterogeneous tile algorithms are executed upon our decentralized dynamic scheduling runtime system, which schedules a task graph dynamically and transfers data between compute nodes automatically. The runtime system uses a new distributed task assignment protocol to solve data dependencies between tasksmore » without any coordination between processing units. By overlapping computation and communication through dynamic scheduling, we are able to attain scalable performance for the double-precision Cholesky factorization and QR factorization. Finally, our approach demonstrates a performance comparable to Intel MKL on shared-memory multicore systems and better performance than both vendor (e.g., Intel MKL) and open source libraries (e.g., StarPU) in the following three environments: heterogeneous clusters with GPUs, conventional clusters without GPUs, and shared-memory systems with multiple GPUs.« less
Multitasking OS manages a team of processors

DOE Office of Scientific and Technical Information (OSTI.GOV)

Ripps, D.L.

1983-07-21

MTOS-68k is a real-time multitasking operating system designed for the popular MC68000 microprocessors. It aproaches task coordination and synchronization in a fashion that matches uniquely the structural simplicity and regularity of the 68000 instruction set. Since in many 68000 applications the speed and power of one CPU are not enough, MTOS-68k has been designed to support multiple processors, as well as multiple tasks. Typically, the devices are tightly coupled single-board computers, that is they share a backplane and parts of global memory.
Pairwise Sequence Alignment Library

DOE Office of Scientific and Technical Information (OSTI.GOV)

Jeff Daily, PNNL

2015-05-20

Vector extensions, such as SSE, have been part of the x86 CPU since the 1990s, with applications in graphics, signal processing, and scientific applications. Although many algorithms and applications can naturally benefit from automatic vectorization techniques, there are still many that are difficult to vectorize due to their dependence on irregular data structures, dense branch operations, or data dependencies. Sequence alignment, one of the most widely used operations in bioinformatics workflows, has a computational footprint that features complex data dependencies. The trend of widening vector registers adversely affects the state-of-the-art sequence alignment algorithm based on striped data layouts. Therefore, amore » novel SIMD implementation of a parallel scan-based sequence alignment algorithm that can better exploit wider SIMD units was implemented as part of the Parallel Sequence Alignment Library (parasail). Parasail features: Reference implementations of all known vectorized sequence alignment approaches. Implementations of Smith Waterman (SW), semi-global (SG), and Needleman Wunsch (NW) sequence alignment algorithms. Implementations across all modern CPU instruction sets including AVX2 and KNC. Language interfaces for C/C++ and Python.« less
Design and implementation of a reconfigurable mixed-signal SoC based on field programmable analog arrays

NASA Astrophysics Data System (ADS)

Liu, Lintao; Gao, Yuhan; Deng, Jun

2017-11-01

This work presents a reconfigurable mixed-signal system-on-chip (SoC), which integrates switched-capacitor-based field programmable analog arrays (FPAA), analog-to-digital converter (ADC), digital-to-analog converter, digital down converter , digital up converter, 32-bit reduced instruction-set computer central processing unit (CPU) and other digital IPs on a single chip with 0.18 μm CMOS technology. The FPAA intellectual property could be reconfigured as different function circuits, such as gain amplifier, divider, sine generator, and so on. This single-chip integrated mixed-signal system is a complete modern signal processing system, occupying a die area of 7 × 8 mm 2 and consuming 719 mW with a clock frequency of 150 MHz for CPU and 200 MHz for ADC/DAC. This SoC chip can help customers to shorten design cycles, save board area, reduce the system power consumption and depress the system integration risk, which would afford a big prospect of application for wireless communication. Project supported by the National High Technology and Development Program of China (No. 2012AA012303).
Configuring compute nodes of a parallel computer in an operational group into a plurality of independent non-overlapping collective networks

DOEpatents

Archer, Charles J.; Inglett, Todd A.; Ratterman, Joseph D.; Smith, Brian E.

2010-03-02

Methods, apparatus, and products are disclosed for configuring compute nodes of a parallel computer in an operational group into a plurality of independent non-overlapping collective networks, the compute nodes in the operational group connected together for data communications through a global combining network, that include: partitioning the compute nodes in the operational group into a plurality of non-overlapping subgroups; designating one compute node from each of the non-overlapping subgroups as a master node; and assigning, to the compute nodes in each of the non-overlapping subgroups, class routing instructions that organize the compute nodes in that non-overlapping subgroup as a collective network such that the master node is a physical root.
Instructional Television: Do We Have the Courage to Succeed?

ERIC Educational Resources Information Center

Bosner, Paul

1976-01-01

A discussion of instructional television's implementation in terms of massiveness of the system; the decentralization of the system; and the fragmented, overlapping funding structure of the system, which results in obscured centers of power resting somewhere between state and local control, with subtle direction by the federal government.…
Collaborating CPU and GPU for large-scale high-order CFD simulations with complex grids on the TianHe-1A supercomputer

DOE Office of Scientific and Technical Information (OSTI.GOV)

Xu, Chuanfu, E-mail: xuchuanfu@nudt.edu.cn; Deng, Xiaogang; Zhang, Lilun

Programming and optimizing complex, real-world CFD codes on current many-core accelerated HPC systems is very challenging, especially when collaborating CPUs and accelerators to fully tap the potential of heterogeneous systems. In this paper, with a tri-level hybrid and heterogeneous programming model using MPI + OpenMP + CUDA, we port and optimize our high-order multi-block structured CFD software HOSTA on the GPU-accelerated TianHe-1A supercomputer. HOSTA adopts two self-developed high-order compact definite difference schemes WCNS and HDCS that can simulate flows with complex geometries. We present a dual-level parallelization scheme for efficient multi-block computation on GPUs and perform particular kernel optimizations formore » high-order CFD schemes. The GPU-only approach achieves a speedup of about 1.3 when comparing one Tesla M2050 GPU with two Xeon X5670 CPUs. To achieve a greater speedup, we collaborate CPU and GPU for HOSTA instead of using a naive GPU-only approach. We present a novel scheme to balance the loads between the store-poor GPU and the store-rich CPU. Taking CPU and GPU load balance into account, we improve the maximum simulation problem size per TianHe-1A node for HOSTA by 2.3×, meanwhile the collaborative approach can improve the performance by around 45% compared to the GPU-only approach. Further, to scale HOSTA on TianHe-1A, we propose a gather/scatter optimization to minimize PCI-e data transfer times for ghost and singularity data of 3D grid blocks, and overlap the collaborative computation and communication as far as possible using some advanced CUDA and MPI features. Scalability tests show that HOSTA can achieve a parallel efficiency of above 60% on 1024 TianHe-1A nodes. With our method, we have successfully simulated an EET high-lift airfoil configuration containing 800M cells and China's large civil airplane configuration containing 150M cells. To our best knowledge, those are the largest-scale CPU–GPU collaborative simulations that solve realistic CFD problems with both complex configurations and high-order schemes.« less
Cultivating Computational Thinking Practices and Mathematical Habits of Mind in Lattice Land

ERIC Educational Resources Information Center

Pei, Christina; Weintrop, David; Wilensky, Uri

2018-01-01

There is a great deal of overlap between the set of practices collected under the term "computational thinking" and the mathematical habits of mind that are the focus of much mathematics instruction. Despite this overlap, the links between these two desirable educational outcomes are rarely made explicit, either in classrooms or in the…

How Leadership Content Knowledge in Writing Influences Leadership Practice in Elementary Schools

ERIC Educational Resources Information Center

Olsen, Heather Stuart

2010-01-01

In an era of increased accountability mandates, school leaders face daunting challenges to improve instruction. Despite the vast research on instructional leadership, little is known about how principals improve teaching and learning in the subject of writing. Leadership content knowledge is the overlap of knowledge of subject matter and…
Prioritizing Elementary School Writing Instruction: Cultivating Middle School Readiness for Students with Learning Disabilities

ERIC Educational Resources Information Center

Ciullo, Stephen; Mason, Linda

2017-01-01

Helping elementary students with learning disabilities (LD) prepare for the rigor of middle school writing is an instructional priority. Fortunately, several standards-based skills in upper elementary school and middle school overlap. Teachers in upper elementary grades, specifically fourth and fifth grades, have the opportunity to provide…
Generating Performance Models for Irregular Applications

DOE Office of Scientific and Technical Information (OSTI.GOV)

Friese, Ryan D.; Tallent, Nathan R.; Vishnu, Abhinav

2017-05-30

Many applications have irregular behavior --- non-uniform input data, input-dependent solvers, irregular memory accesses, unbiased branches --- that cannot be captured using today's automated performance modeling techniques. We describe new hierarchical critical path analyses for the \\Palm model generation tool. To create a model's structure, we capture tasks along representative MPI critical paths. We create a histogram of critical tasks with parameterized task arguments and instance counts. To model each task, we identify hot instruction-level sub-paths and model each sub-path based on data flow, instruction scheduling, and data locality. We describe application models that generate accurate predictions for strong scalingmore » when varying CPU speed, cache speed, memory speed, and architecture. We present results for the Sweep3D neutron transport benchmark; Page Rank on multiple graphs; Support Vector Machine with pruning; and PFLOTRAN's reactive flow/transport solver with domain-induced load imbalance.« less
A FAST ITERATIVE METHOD FOR SOLVING THE EIKONAL EQUATION ON TRIANGULATED SURFACES*

PubMed Central

Fu, Zhisong; Jeong, Won-Ki; Pan, Yongsheng; Kirby, Robert M.; Whitaker, Ross T.

2012-01-01

This paper presents an efficient, fine-grained parallel algorithm for solving the Eikonal equation on triangular meshes. The Eikonal equation, and the broader class of Hamilton–Jacobi equations to which it belongs, have a wide range of applications from geometric optics and seismology to biological modeling and analysis of geometry and images. The ability to solve such equations accurately and efficiently provides new capabilities for exploring and visualizing parameter spaces and for solving inverse problems that rely on such equations in the forward model. Efficient solvers on state-of-the-art, parallel architectures require new algorithms that are not, in many cases, optimal, but are better suited to synchronous updates of the solution. In previous work [W. K. Jeong and R. T. Whitaker, SIAM J. Sci. Comput., 30 (2008), pp. 2512–2534], the authors proposed the fast iterative method (FIM) to efficiently solve the Eikonal equation on regular grids. In this paper we extend the fast iterative method to solve Eikonal equations efficiently on triangulated domains on the CPU and on parallel architectures, including graphics processors. We propose a new local update scheme that provides solutions of first-order accuracy for both architectures. We also propose a novel triangle-based update scheme and its corresponding data structure for efficient irregular data mapping to parallel single-instruction multiple-data (SIMD) processors. We provide detailed descriptions of the implementations on a single CPU, a multicore CPU with shared memory, and SIMD architectures with comparative results against state-of-the-art Eikonal solvers. PMID:22641200
A Big RISC

DTIC Science & Technology

1983-07-18

architecture . Design , performance, and cost of BRISC is presented. Performance is shown to be better than high end mainframes such as the IBM 3081 and Amdahl 470V/8 on integer benchmarks written in C, Pascal and LISP. The cost, conservatively estimated to be $132,400 is about the same as a high end minicomputer such as the VAX-11/780. BRISC has a CPU cycle time of 46 ns, providing a RISC I instruction execution rate of greater than 15 MIPs. BRISC is designed with a Structured Computer Aided Logic Design System (SCALD) by Valid Logic Systems. An evaluation of the utility of
cuBLASTP: Fine-Grained Parallelization of Protein Sequence Search on CPU+GPU.

PubMed

Zhang, Jing; Wang, Hao; Feng, Wu-Chun

2017-01-01

BLAST, short for Basic Local Alignment Search Tool, is a ubiquitous tool used in the life sciences for pairwise sequence search. However, with the advent of next-generation sequencing (NGS), whether at the outset or downstream from NGS, the exponential growth of sequence databases is outstripping our ability to analyze the data. While recent studies have utilized the graphics processing unit (GPU) to speedup the BLAST algorithm for searching protein sequences (i.e., BLASTP), these studies use coarse-grained parallelism, where one sequence alignment is mapped to only one thread. Such an approach does not efficiently utilize the capabilities of a GPU, particularly due to the irregularity of BLASTP in both execution paths and memory-access patterns. To address the above shortcomings, we present a fine-grained approach to parallelize BLASTP, where each individual phase of sequence search is mapped to many threads on a GPU. This approach, which we refer to as cuBLASTP, reorders data-access patterns and reduces divergent branches of the most time-consuming phases (i.e., hit detection and ungapped extension). In addition, cuBLASTP optimizes the remaining phases (i.e., gapped extension and alignment with trace back) on a multicore CPU and overlaps their execution with the phases running on the GPU.
PIC codes for plasma accelerators on emerging computer architectures (GPUS, Multicore/Manycore CPUS)

NASA Astrophysics Data System (ADS)

Vincenti, Henri

2016-03-01

The advent of exascale computers will enable 3D simulations of a new laser-plasma interaction regimes that were previously out of reach of current Petasale computers. However, the paradigm used to write current PIC codes will have to change in order to fully exploit the potentialities of these new computing architectures. Indeed, achieving Exascale computing facilities in the next decade will be a great challenge in terms of energy consumption and will imply hardware developments directly impacting our way of implementing PIC codes. As data movement (from die to network) is by far the most energy consuming part of an algorithm future computers will tend to increase memory locality at the hardware level and reduce energy consumption related to data movement by using more and more cores on each compute nodes (''fat nodes'') that will have a reduced clock speed to allow for efficient cooling. To compensate for frequency decrease, CPU machine vendors are making use of long SIMD instruction registers that are able to process multiple data with one arithmetic operator in one clock cycle. SIMD register length is expected to double every four years. GPU's also have a reduced clock speed per core and can process Multiple Instructions on Multiple Datas (MIMD). At the software level Particle-In-Cell (PIC) codes will thus have to achieve both good memory locality and vectorization (for Multicore/Manycore CPU) to fully take advantage of these upcoming architectures. In this talk, we present the portable solutions we implemented in our high performance skeleton PIC code PICSAR to both achieve good memory locality and cache reuse as well as good vectorization on SIMD architectures. We also present the portable solutions used to parallelize the Pseudo-sepctral quasi-cylindrical code FBPIC on GPUs using the Numba python compiler.
Parallel hyperbolic PDE simulation on clusters: Cell versus GPU

NASA Astrophysics Data System (ADS)

Rostrup, Scott; De Sterck, Hans

2010-12-01

Increasingly, high-performance computing is looking towards data-parallel computational devices to enhance computational performance. Two technologies that have received significant attention are IBM's Cell Processor and NVIDIA's CUDA programming model for graphics processing unit (GPU) computing. In this paper we investigate the acceleration of parallel hyperbolic partial differential equation simulation on structured grids with explicit time integration on clusters with Cell and GPU backends. The message passing interface (MPI) is used for communication between nodes at the coarsest level of parallelism. Optimizations of the simulation code at the several finer levels of parallelism that the data-parallel devices provide are described in terms of data layout, data flow and data-parallel instructions. Optimized Cell and GPU performance are compared with reference code performance on a single x86 central processing unit (CPU) core in single and double precision. We further compare the CPU, Cell and GPU platforms on a chip-to-chip basis, and compare performance on single cluster nodes with two CPUs, two Cell processors or two GPUs in a shared memory configuration (without MPI). We finally compare performance on clusters with 32 CPUs, 32 Cell processors, and 32 GPUs using MPI. Our GPU cluster results use NVIDIA Tesla GPUs with GT200 architecture, but some preliminary results on recently introduced NVIDIA GPUs with the next-generation Fermi architecture are also included. This paper provides computational scientists and engineers who are considering porting their codes to accelerator environments with insight into how structured grid based explicit algorithms can be optimized for clusters with Cell and GPU accelerators. It also provides insight into the speed-up that may be gained on current and future accelerator architectures for this class of applications. Program summaryProgram title: SWsolver Catalogue identifier: AEGY_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEGY_v1_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: GPL v3 No. of lines in distributed program, including test data, etc.: 59 168 No. of bytes in distributed program, including test data, etc.: 453 409 Distribution format: tar.gz Programming language: C, CUDA Computer: Parallel Computing Clusters. Individual compute nodes may consist of x86 CPU, Cell processor, or x86 CPU with attached NVIDIA GPU accelerator. Operating system: Linux Has the code been vectorised or parallelized?: Yes. Tested on 1-128 x86 CPU cores, 1-32 Cell Processors, and 1-32 NVIDIA GPUs. RAM: Tested on Problems requiring up to 4 GB per compute node. Classification: 12 External routines: MPI, CUDA, IBM Cell SDK Nature of problem: MPI-parallel simulation of Shallow Water equations using high-resolution 2D hyperbolic equation solver on regular Cartesian grids for x86 CPU, Cell Processor, and NVIDIA GPU using CUDA. Solution method: SWsolver provides 3 implementations of a high-resolution 2D Shallow Water equation solver on regular Cartesian grids, for CPU, Cell Processor, and NVIDIA GPU. Each implementation uses MPI to divide work across a parallel computing cluster. Additional comments: Sub-program numdiff is used for the test run.
fast_protein_cluster: parallel and optimized clustering of large-scale protein modeling data.

PubMed

Hung, Ling-Hong; Samudrala, Ram

2014-06-15

fast_protein_cluster is a fast, parallel and memory efficient package used to cluster 60 000 sets of protein models (with up to 550 000 models per set) generated by the Nutritious Rice for the World project. fast_protein_cluster is an optimized and extensible toolkit that supports Root Mean Square Deviation after optimal superposition (RMSD) and Template Modeling score (TM-score) as metrics. RMSD calculations using a laptop CPU are 60× faster than qcprot and 3× faster than current graphics processing unit (GPU) implementations. New GPU code further increases the speed of RMSD and TM-score calculations. fast_protein_cluster provides novel k-means and hierarchical clustering methods that are up to 250× and 2000× faster, respectively, than Clusco, and identify significantly more accurate models than Spicker and Clusco. fast_protein_cluster is written in C++ using OpenMP for multi-threading support. Custom streaming Single Instruction Multiple Data (SIMD) extensions and advanced vector extension intrinsics code accelerate CPU calculations, and OpenCL kernels support AMD and Nvidia GPUs. fast_protein_cluster is available under the M.I.T. license. (http://software.compbio.washington.edu/fast_protein_cluster) © The Author 2014. Published by Oxford University Press.
Best bang for your buck: GPU nodes for GROMACS biomolecular simulations

PubMed Central

Páll, Szilárd; Fechner, Martin; Esztermann, Ansgar; de Groot, Bert L.; Grubmüller, Helmut

2015-01-01

The molecular dynamics simulation package GROMACS runs efficiently on a wide variety of hardware from commodity workstations to high performance computing clusters. Hardware features are well‐exploited with a combination of single instruction multiple data, multithreading, and message passing interface (MPI)‐based single program multiple data/multiple program multiple data parallelism while graphics processing units (GPUs) can be used as accelerators to compute interactions off‐loaded from the CPU. Here, we evaluate which hardware produces trajectories with GROMACS 4.6 or 5.0 in the most economical way. We have assembled and benchmarked compute nodes with various CPU/GPU combinations to identify optimal compositions in terms of raw trajectory production rate, performance‐to‐price ratio, energy efficiency, and several other criteria. Although hardware prices are naturally subject to trends and fluctuations, general tendencies are clearly visible. Adding any type of GPU significantly boosts a node's simulation performance. For inexpensive consumer‐class GPUs this improvement equally reflects in the performance‐to‐price ratio. Although memory issues in consumer‐class GPUs could pass unnoticed as these cards do not support error checking and correction memory, unreliable GPUs can be sorted out with memory checking tools. Apart from the obvious determinants for cost‐efficiency like hardware expenses and raw performance, the energy consumption of a node is a major cost factor. Over the typical hardware lifetime until replacement of a few years, the costs for electrical power and cooling can become larger than the costs of the hardware itself. Taking that into account, nodes with a well‐balanced ratio of CPU and consumer‐class GPU resources produce the maximum amount of GROMACS trajectory over their lifetime. © 2015 The Authors. Journal of Computational Chemistry Published by Wiley Periodicals, Inc. PMID:26238484
Best bang for your buck: GPU nodes for GROMACS biomolecular simulations.

PubMed

Kutzner, Carsten; Páll, Szilárd; Fechner, Martin; Esztermann, Ansgar; de Groot, Bert L; Grubmüller, Helmut

2015-10-05

The molecular dynamics simulation package GROMACS runs efficiently on a wide variety of hardware from commodity workstations to high performance computing clusters. Hardware features are well-exploited with a combination of single instruction multiple data, multithreading, and message passing interface (MPI)-based single program multiple data/multiple program multiple data parallelism while graphics processing units (GPUs) can be used as accelerators to compute interactions off-loaded from the CPU. Here, we evaluate which hardware produces trajectories with GROMACS 4.6 or 5.0 in the most economical way. We have assembled and benchmarked compute nodes with various CPU/GPU combinations to identify optimal compositions in terms of raw trajectory production rate, performance-to-price ratio, energy efficiency, and several other criteria. Although hardware prices are naturally subject to trends and fluctuations, general tendencies are clearly visible. Adding any type of GPU significantly boosts a node's simulation performance. For inexpensive consumer-class GPUs this improvement equally reflects in the performance-to-price ratio. Although memory issues in consumer-class GPUs could pass unnoticed as these cards do not support error checking and correction memory, unreliable GPUs can be sorted out with memory checking tools. Apart from the obvious determinants for cost-efficiency like hardware expenses and raw performance, the energy consumption of a node is a major cost factor. Over the typical hardware lifetime until replacement of a few years, the costs for electrical power and cooling can become larger than the costs of the hardware itself. Taking that into account, nodes with a well-balanced ratio of CPU and consumer-class GPU resources produce the maximum amount of GROMACS trajectory over their lifetime. © 2015 The Authors. Journal of Computational Chemistry Published by Wiley Periodicals, Inc.
Coding for parallel execution of hardware-in-the-loop millimeter-wave scene generation models on multicore SIMD processor architectures

NASA Astrophysics Data System (ADS)

Olson, Richard F.

2013-05-01

Rendering of point scatterer based radar scenes for millimeter wave (mmW) seeker tests in real-time hardware-in-the-loop (HWIL) scene generation requires efficient algorithms and vector-friendly computer architectures for complex signal synthesis. New processor technology from Intel implements an extended 256-bit vector SIMD instruction set (AVX, AVX2) in a multi-core CPU design providing peak execution rates of hundreds of GigaFLOPS (GFLOPS) on one chip. Real world mmW scene generation code can approach peak SIMD execution rates only after careful algorithm and source code design. An effective software design will maintain high computing intensity emphasizing register-to-register SIMD arithmetic operations over data movement between CPU caches or off-chip memories. Engineers at the U.S. Army Aviation and Missile Research, Development and Engineering Center (AMRDEC) applied two basic parallel coding methods to assess new 256-bit SIMD multi-core architectures for mmW scene generation in HWIL. These include use of POSIX threads built on vector library functions and more portable, highlevel parallel code based on compiler technology (e.g. OpenMP pragmas and SIMD autovectorization). Since CPU technology is rapidly advancing toward high processor core counts and TeraFLOPS peak SIMD execution rates, it is imperative that coding methods be identified which produce efficient and maintainable parallel code. This paper describes the algorithms used in point scatterer target model rendering, the parallelization of those algorithms, and the execution performance achieved on an AVX multi-core machine using the two basic parallel coding methods. The paper concludes with estimates for scale-up performance on upcoming multi-core technology.
A new robust algorithm for computation of a triangle circumscribed sphere in E3 and a hypersphere simplex

NASA Astrophysics Data System (ADS)

Skala, Vaclav

2016-06-01

There are many applications in which a bounding sphere containing the given triangle E3 is needed, e.g. fast collision detection, ray-triangle intersecting in raytracing etc. This is a typical geometrical problem in E3 and it has also applications in computational problems in general. In this paper a new fast and robust algorithm of circumscribed sphere computation in the n-dimensional space is presented and specification for the E3 space is given, too. The presented method is convenient for use on GPU or with SSE or Intel's AVX instructions on a standard CPU.
Interaction sorting method for molecular dynamics on multi-core SIMD CPU architecture.

PubMed

Matvienko, Sergey; Alemasov, Nikolay; Fomin, Eduard

2015-02-01

Molecular dynamics (MD) is widely used in computational biology for studying binding mechanisms of molecules, molecular transport, conformational transitions, protein folding, etc. The method is computationally expensive; thus, the demand for the development of novel, much more efficient algorithms is still high. Therefore, the new algorithm designed in 2007 and called interaction sorting (IS) clearly attracted interest, as it outperformed the most efficient MD algorithms. In this work, a new IS modification is proposed which allows the algorithm to utilize SIMD processor instructions. This paper shows that the improvement provides an additional gain in performance, 9% to 45% in comparison to the original IS method.
Understanding Evolutionary Potential in Virtual CPU Instruction Set Architectures

PubMed Central

Bryson, David M.; Ofria, Charles

2013-01-01

We investigate fundamental decisions in the design of instruction set architectures for linear genetic programs that are used as both model systems in evolutionary biology and underlying solution representations in evolutionary computation. We subjected digital organisms with each tested architecture to seven different computational environments designed to present a range of evolutionary challenges. Our goal was to engineer a general purpose architecture that would be effective under a broad range of evolutionary conditions. We evaluated six different types of architectural features for the virtual CPUs: (1) genetic flexibility: we allowed digital organisms to more precisely modify the function of genetic instructions, (2) memory: we provided an increased number of registers in the virtual CPUs, (3) decoupled sensors and actuators: we separated input and output operations to enable greater control over data flow. We also tested a variety of methods to regulate expression: (4) explicit labels that allow programs to dynamically refer to specific genome positions, (5) position-relative search instructions, and (6) multiple new flow control instructions, including conditionals and jumps. Each of these features also adds complication to the instruction set and risks slowing evolution due to epistatic interactions. Two features (multiple argument specification and separated I/O) demonstrated substantial improvements in the majority of test environments, along with versions of each of the remaining architecture modifications that show significant improvements in multiple environments. However, some tested modifications were detrimental, though most exhibit no systematic effects on evolutionary potential, highlighting the robustness of digital evolution. Combined, these observations enhance our understanding of how instruction architecture impacts evolutionary potential, enabling the creation of architectures that support more rapid evolution of complex solutions to a broad range of challenges. PMID:24376669
Association between problematic cellular phone use and suicide: the moderating effect of family function and depression.

PubMed

Wang, Peng-Wei; Liu, Tai-Ling; Ko, Chih-Hung; Lin, Huang-Chi; Huang, Mei-Feng; Yeh, Yi-Chun; Yen, Cheng-Fang

2014-02-01

Suicidal ideation and attempt among adolescents are risk factors for eventual completed suicide. Cellular phone use (CPU) has markedly changed the everyday lives of adolescents. Issues about how cellular phone use relates to adolescent mental health, such as suicidal ideation and attempts, are important because of the high rate of cellular phone usage among children in that age group. This study explored the association between problematic CPU and suicidal ideation and attempts among adolescents and investigated how family function and depression influence the association between problematic CPU and suicidal ideation and attempts. A total of 5051 (2872 girls and 2179 boys) adolescents who owned at least one cellular phone completed the research questionnaires. We collected data on participants' CPU and suicidal behavior (ideation and attempts) during the past month as well as information on family function and history of depression. Five hundred thirty-two adolescents (10.54%) had problematic CPU. The rates of suicidal ideation were 23.50% and 11.76% in adolescents with problematic CPU and without problematic CPU, respectively. The rates of suicidal attempts in both groups were 13.70% and 5.45%, respectively. Family function, but not depression, had a moderating effect on the association between problematic CPU and suicidal ideation and attempt. This study highlights the association between problematic CPU and suicidal ideation as well as attempts and indicates that good family function may have a more significant role on reducing the risks of suicidal ideation and attempts in adolescents with problematic CPU than in those without problematic CPU. © 2014.
Uncertainty quantification of seabed parameters for large data volumes along survey tracks with a tempered particle filter

NASA Astrophysics Data System (ADS)

Dettmer, J.; Quijano, J. E.; Dosso, S. E.; Holland, C. W.; Mandolesi, E.

2016-12-01

Geophysical seabed properties are important for the detection and classification of unexploded ordnance. However, current surveying methods such as vertical seismic profiling, coring, or inversion are of limited use when surveying large areas with high spatial sampling density. We consider surveys based on a source and receiver array towed by an autonomous vehicle which produce large volumes of seabed reflectivity data that contain unprecedented and detailed seabed information. The data are analyzed with a particle filter, which requires efficient reflection-coefficient computation, efficient inversion algorithms and efficient use of computer resources. The filter quantifies information content of multiple sequential data sets by considering results from previous data along the survey track to inform the importance sampling at the current point. Challenges arise from environmental changes along the track where the number of sediment layers and their properties change. This is addressed by a trans-dimensional model in the filter which allows layering complexity to change along a track. Efficiency is improved by likelihood tempering of various particle subsets and including exchange moves (parallel tempering). The filter is implemented on a hybrid computer that combines central processing units (CPUs) and graphics processing units (GPUs) to exploit three levels of parallelism: (1) fine-grained parallel computation of spherical reflection coefficients with a GPU implementation of Levin integration; (2) updating particles by concurrent CPU processes which exchange information using automatic load balancing (coarse grained parallelism); (3) overlapping CPU-GPU communication (a major bottleneck) with GPU computation by staggering CPU access to the multiple GPUs. The algorithm is applied to spherical reflection coefficients for data sets along a 14-km track on the Malta Plateau, Mediterranean Sea. We demonstrate substantial efficiency gains over previous methods. [This research was supported in part by the U.S. Dept of Defense, thought the Strategic Environmental Research and Development Program (SERDP).
Shadow: Running Tor in a Box for Accurate and Efficient Experimentation

DTIC Science & Technology

2011-09-23

Modeling the speed of a target CPU is done by running an OpenSSL [31] speed test on a real CPU of that type. This provides us with the raw CPU processing...rate, but we are also interested in the processing speed of an application. By running application 5 benchmarks on the same CPU as the OpenSSL speed test...simulation, saving CPU cy- cles on our simulation host machine. Shadow removes cryptographic processing by preloading the main OpenSSL [31] functions used
Performance analysis of distributed symmetric sparse matrix vector multiplication algorithm for multi-core architectures

DOE PAGES

Oryspayev, Dossay; Aktulga, Hasan Metin; Sosonkina, Masha; ...

2015-07-14

In this article, sparse matrix vector multiply (SpMVM) is an important kernel that frequently arises in high performance computing applications. Due to its low arithmetic intensity, several approaches have been proposed in literature to improve its scalability and efficiency in large scale computations. In this paper, our target systems are high end multi-core architectures and we use messaging passing interface + open multiprocessing hybrid programming model for parallelism. We analyze the performance of recently proposed implementation of the distributed symmetric SpMVM, originally developed for large sparse symmetric matrices arising in ab initio nuclear structure calculations. We also study important featuresmore » of this implementation and compare with previously reported implementations that do not exploit underlying symmetry. Our SpMVM implementations leverage the hybrid paradigm to efficiently overlap expensive communications with computations. Our main comparison criterion is the "CPU core hours" metric, which is the main measure of resource usage on supercomputers. We analyze the effects of topology-aware mapping heuristic using simplified network load model. Furthermore, we have tested the different SpMVM implementations on two large clusters with 3D Torus and Dragonfly topology. Our results show that the distributed SpMVM implementation that exploits matrix symmetry and hides communication yields the best value for the "CPU core hours" metric and significantly reduces data movement overheads.« less
The association between problematic cellular phone use and risky behaviors and low self-esteem among Taiwanese adolescents.

PubMed

Yang, Yuan-Sheng; Yen, Ju-Yu; Ko, Chih-Hung; Cheng, Chung-Ping; Yen, Cheng-Fang

2010-04-28

Cellular phone use (CPU) is an important part of life for many adolescents. However, problematic CPU may complicate physiological and psychological problems. The aim of our study was to examine the associations between problematic CPU and a series of risky behaviors and low self-esteem in Taiwanese adolescents. A total of 11,111 adolescent students in Southern Taiwan were randomly selected into this study. We used the Problematic Cellular Phone Use Questionnaire to identify the adolescents with problematic CPU. Meanwhile, a series of risky behaviors and self-esteem were evaluated. Multilevel logistic regression analyses were employed to examine the associations between problematic CPU and risky behaviors and low self-esteem regarding gender and age. The results indicated that positive associations were found between problematic CPU and aggression, insomnia, smoking cigarettes, suicidal tendencies, and low self-esteem in all groups with different sexes and ages. However, gender and age differences existed in the associations between problematic CPU and suspension from school, criminal records, tattooing, short nocturnal sleep duration, unprotected sex, illicit drugs use, drinking alcohol and chewing betel nuts. There were positive associations between problematic CPU and a series of risky behaviors and low self-esteem in Taiwanese adolescents. It is worthy for parents and mental health professionals to pay attention to adolescents' problematic CPU.

Digital altruists: Resolving key questions about the empathy-altruism hypothesis in an Internet sample.

PubMed

McAuliffe, William H B; Forster, Daniel E; Philippe, Joachner; McCullough, Michael E

2018-06-01

Researchers have identified the capacity to take the perspective of others as a precursor to empathy-induced altruistic motivation. Consequently, investigators frequently use so-called perspective-taking instructions to manipulate empathic concern. However, most experiments using perspective-taking instructions have had modest sample sizes, undermining confidence in the replicability of results. In addition, it is unknown whether perspective-taking instructions work because they increase empathic concern or because comparison conditions reduce empathic concern (or both). Finally, some researchers have found that egoistic factors that do not involve empathic concern, including self-oriented emotions and self-other overlap, mediate the relationship between perspective-taking instructions and helping. The present investigation was a high-powered, preregistered effort that addressed methodological shortcomings of previous experiments to clarify how and when perspective-taking manipulations affect emotional arousal and prosocial motivation in a prototypical experimental paradigm administered over the Internet. Perspective-taking instructions did not clearly increase empathic concern; this null finding was not due to ceiling effects. Instructions to remain objective, on the other hand, unequivocally reduced empathic concern relative to a no-instructions control condition. Empathic concern was the most strongly felt emotion in all conditions, suggesting that distressed targets primarily elicit other-oriented concern. Empathic concern uniquely predicted the quality of social support provided to the target, which supports the empathy-altruism hypothesis and contradicts the role of self-oriented emotions and self-other overlap in explaining helping behavior. Empathy-induced altruism may be responsible for many prosocial acts that occur in everyday settings, including the increasing number of prosocial acts that occur online. (PsycINFO Database Record (c) 2018 APA, all rights reserved).
The association between problematic cellular phone use and risky behaviors and low self-esteem among Taiwanese adolescents

PubMed Central

2010-01-01

Background Cellular phone use (CPU) is an important part of life for many adolescents. However, problematic CPU may complicate physiological and psychological problems. The aim of our study was to examine the associations between problematic CPU and a series of risky behaviors and low self-esteem in Taiwanese adolescents. Methods A total of 11,111 adolescent students in Southern Taiwan were randomly selected into this study. We used the Problematic Cellular Phone Use Questionnaire to identify the adolescents with problematic CPU. Meanwhile, a series of risky behaviors and self-esteem were evaluated. Multilevel logistic regression analyses were employed to examine the associations between problematic CPU and risky behaviors and low self-esteem regarding gender and age. Results The results indicated that positive associations were found between problematic CPU and aggression, insomnia, smoking cigarettes, suicidal tendencies, and low self-esteem in all groups with different sexes and ages. However, gender and age differences existed in the associations between problematic CPU and suspension from school, criminal records, tattooing, short nocturnal sleep duration, unprotected sex, illicit drugs use, drinking alcohol and chewing betel nuts. Conclusions There were positive associations between problematic CPU and a series of risky behaviors and low self-esteem in Taiwanese adolescents. It is worthy for parents and mental health professionals to pay attention to adolescents' problematic CPU. PMID:20426807
CPU-GPU hybrid accelerating the Zuker algorithm for RNA secondary structure prediction applications.

PubMed

Lei, Guoqing; Dou, Yong; Wan, Wen; Xia, Fei; Li, Rongchun; Ma, Meng; Zou, Dan

2012-01-01

Prediction of ribonucleic acid (RNA) secondary structure remains one of the most important research areas in bioinformatics. The Zuker algorithm is one of the most popular methods of free energy minimization for RNA secondary structure prediction. Thus far, few studies have been reported on the acceleration of the Zuker algorithm on general-purpose processors or on extra accelerators such as Field Programmable Gate-Array (FPGA) and Graphics Processing Units (GPU). To the best of our knowledge, no implementation combines both CPU and extra accelerators, such as GPUs, to accelerate the Zuker algorithm applications. In this paper, a CPU-GPU hybrid computing system that accelerates Zuker algorithm applications for RNA secondary structure prediction is proposed. The computing tasks are allocated between CPU and GPU for parallel cooperate execution. Performance differences between the CPU and the GPU in the task-allocation scheme are considered to obtain workload balance. To improve the hybrid system performance, the Zuker algorithm is optimally implemented with special methods for CPU and GPU architecture. Speedup of 15.93× over optimized multi-core SIMD CPU implementation and performance advantage of 16% over optimized GPU implementation are shown in the experimental results. More than 14% of the sequences are executed on CPU in the hybrid system. The system combining CPU and GPU to accelerate the Zuker algorithm is proven to be promising and can be applied to other bioinformatics applications.
Accelerating Spaceborne SAR Imaging Using Multiple CPU/GPU Deep Collaborative Computing

PubMed Central

Zhang, Fan; Li, Guojun; Li, Wei; Hu, Wei; Hu, Yuxin

2016-01-01

With the development of synthetic aperture radar (SAR) technologies in recent years, the huge amount of remote sensing data brings challenges for real-time imaging processing. Therefore, high performance computing (HPC) methods have been presented to accelerate SAR imaging, especially the GPU based methods. In the classical GPU based imaging algorithm, GPU is employed to accelerate image processing by massive parallel computing, and CPU is only used to perform the auxiliary work such as data input/output (IO). However, the computing capability of CPU is ignored and underestimated. In this work, a new deep collaborative SAR imaging method based on multiple CPU/GPU is proposed to achieve real-time SAR imaging. Through the proposed tasks partitioning and scheduling strategy, the whole image can be generated with deep collaborative multiple CPU/GPU computing. In the part of CPU parallel imaging, the advanced vector extension (AVX) method is firstly introduced into the multi-core CPU parallel method for higher efficiency. As for the GPU parallel imaging, not only the bottlenecks of memory limitation and frequent data transferring are broken, but also kinds of optimized strategies are applied, such as streaming, parallel pipeline and so on. Experimental results demonstrate that the deep CPU/GPU collaborative imaging method enhances the efficiency of SAR imaging on single-core CPU by 270 times and realizes the real-time imaging in that the imaging rate outperforms the raw data generation rate. PMID:27070606
Accelerating Spaceborne SAR Imaging Using Multiple CPU/GPU Deep Collaborative Computing.

PubMed

Zhang, Fan; Li, Guojun; Li, Wei; Hu, Wei; Hu, Yuxin

2016-04-07

With the development of synthetic aperture radar (SAR) technologies in recent years, the huge amount of remote sensing data brings challenges for real-time imaging processing. Therefore, high performance computing (HPC) methods have been presented to accelerate SAR imaging, especially the GPU based methods. In the classical GPU based imaging algorithm, GPU is employed to accelerate image processing by massive parallel computing, and CPU is only used to perform the auxiliary work such as data input/output (IO). However, the computing capability of CPU is ignored and underestimated. In this work, a new deep collaborative SAR imaging method based on multiple CPU/GPU is proposed to achieve real-time SAR imaging. Through the proposed tasks partitioning and scheduling strategy, the whole image can be generated with deep collaborative multiple CPU/GPU computing. In the part of CPU parallel imaging, the advanced vector extension (AVX) method is firstly introduced into the multi-core CPU parallel method for higher efficiency. As for the GPU parallel imaging, not only the bottlenecks of memory limitation and frequent data transferring are broken, but also kinds of optimized strategies are applied, such as streaming, parallel pipeline and so on. Experimental results demonstrate that the deep CPU/GPU collaborative imaging method enhances the efficiency of SAR imaging on single-core CPU by 270 times and realizes the real-time imaging in that the imaging rate outperforms the raw data generation rate.
BBMerge – Accurate paired shotgun read merging via overlap

DOE PAGES

Bushnell, Brian; Rood, Jonathan; Singer, Esther

2017-10-26

Merging paired-end shotgun reads generated on high-throughput sequencing platforms can substantially improve various subsequent bioinformatics processes, including genome assembly, binning, mapping, annotation, and clustering for taxonomic analysis. With the inexorable growth of sequence data volume and CPU core counts, the speed and scalability of read-processing tools becomes ever-more important. The accuracy of shotgun read merging is crucial as well, as errors introduced by incorrect merging percolate through to reduce the quality of downstream analysis. Thus, we designed a new tool to maximize accuracy and minimize processing time, allowing the use of read merging on larger datasets, and in analyses highlymore » sensitive to errors. We present BBMerge, a new merging tool for paired-end shotgun sequence data. We benchmark BBMerge by comparison with eight other widely used merging tools, assessing speed, accuracy and scalability. Evaluations of both synthetic and real-world datasets demonstrate that BBMerge produces merged shotgun reads with greater accuracy and at higher speed than any existing merging tool examined. BBMerge also provides the ability to merge non-overlapping shotgun read pairs by using k-mer frequency information to assemble the unsequenced gap between reads, achieving a significantly higher merge rate while maintaining or increasing accuracy.« less
BBMerge – Accurate paired shotgun read merging via overlap

DOE Office of Scientific and Technical Information (OSTI.GOV)

Bushnell, Brian; Rood, Jonathan; Singer, Esther

Merging paired-end shotgun reads generated on high-throughput sequencing platforms can substantially improve various subsequent bioinformatics processes, including genome assembly, binning, mapping, annotation, and clustering for taxonomic analysis. With the inexorable growth of sequence data volume and CPU core counts, the speed and scalability of read-processing tools becomes ever-more important. The accuracy of shotgun read merging is crucial as well, as errors introduced by incorrect merging percolate through to reduce the quality of downstream analysis. Thus, we designed a new tool to maximize accuracy and minimize processing time, allowing the use of read merging on larger datasets, and in analyses highlymore » sensitive to errors. We present BBMerge, a new merging tool for paired-end shotgun sequence data. We benchmark BBMerge by comparison with eight other widely used merging tools, assessing speed, accuracy and scalability. Evaluations of both synthetic and real-world datasets demonstrate that BBMerge produces merged shotgun reads with greater accuracy and at higher speed than any existing merging tool examined. BBMerge also provides the ability to merge non-overlapping shotgun read pairs by using k-mer frequency information to assemble the unsequenced gap between reads, achieving a significantly higher merge rate while maintaining or increasing accuracy.« less
Vectorization for Molecular Dynamics on Intel Xeon Phi Corpocessors

NASA Astrophysics Data System (ADS)

Yi, Hongsuk

2014-03-01

Many modern processors are capable of exploiting data-level parallelism through the use of single instruction multiple data (SIMD) execution. The new Intel Xeon Phi coprocessor supports 512 bit vector registers for the high performance computing. In this paper, we have developed a hierarchical parallelization scheme for accelerated molecular dynamics simulations with the Terfoff potentials for covalent bond solid crystals on Intel Xeon Phi coprocessor systems. The scheme exploits multi-level parallelism computing. We combine thread-level parallelism using a tightly coupled thread-level and task-level parallelism with 512-bit vector register. The simulation results show that the parallel performance of SIMD implementations on Xeon Phi is apparently superior to their x86 CPU architecture.
Design on the x-ray oral digital image display card

NASA Astrophysics Data System (ADS)

Wang, Liping; Gu, Guohua; Chen, Qian

2009-10-01

According to the main characteristics of X-ray imaging, the X-ray display card is successfully designed and debugged using the basic principle of correlated double sampling (CDS) and combined with embedded computer technology. CCD sensor drive circuit and the corresponding procedures have been designed. Filtering and sampling hold circuit have been designed. The data exchange with PC104 bus has been implemented. Using complex programmable logic device as a device to provide gating and timing logic, the functions which counting, reading CPU control instructions, corresponding exposure and controlling sample-and-hold have been completed. According to the image effect and noise analysis, the circuit components have been adjusted. And high-quality images have been obtained.
Flow simulations about steady-complex and unsteady moving configurations using structured-overlapped and unstructured grids

NASA Technical Reports Server (NTRS)

Newman, James C., III

1995-01-01

The limiting factor in simulating flows past realistic configurations of interest has been the discretization of the physical domain on which the governing equations of fluid flow may be solved. In an attempt to circumvent this problem, many Computational Fluid Dynamic (CFD) methodologies that are based on different grid generation and domain decomposition techniques have been developed. However, due to the costs involved and expertise required, very few comparative studies between these methods have been performed. In the present work, the two CFD methodologies which show the most promise for treating complex three-dimensional configurations as well as unsteady moving boundary problems are evaluated. These are namely the structured-overlapped and the unstructured grid schemes. Both methods use a cell centered, finite volume, upwind approach. The structured-overlapped algorithm uses an approximately factored, alternating direction implicit scheme to perform the time integration, whereas, the unstructured algorithm uses an explicit Runge-Kutta method. To examine the accuracy, efficiency, and limitations of each scheme, they are applied to the same steady complex multicomponent configurations and unsteady moving boundary problems. The steady complex cases consist of computing the subsonic flow about a two-dimensional high-lift multielement airfoil and the transonic flow about a three-dimensional wing/pylon/finned store assembly. The unsteady moving boundary problems are a forced pitching oscillation of an airfoil in a transonic freestream and a two-dimensional, subsonic airfoil/store separation sequence. Accuracy was accessed through the comparison of computed and experimentally measured pressure coefficient data on several of the wing/pylon/finned store assembly's components and at numerous angles-of-attack for the pitching airfoil. From this study, it was found that both the structured-overlapped and the unstructured grid schemes yielded flow solutions of comparable accuracy for these simulations. This study also indicated that, overall, the structured-overlapped scheme was slightly more CPU efficient than the unstructured approach.
Symptoms of Problematic Cellular Phone Use, Functional Impairment and Its Association with Depression among Adolescents in Southern Taiwan

ERIC Educational Resources Information Center

Yen, Cheng-Fang; Tang, Tze-Chun; Yen, Ju-Yu; Lin, Huang-Chi; Huang, Chi-Fen; Liu, Shu-Chun; Ko, Chih-Hung

2009-01-01

The aims of this study were: (1) to examine the prevalence of symptoms of problematic cellular phone use (CPU); (2) to examine the associations between the symptoms of problematic CPU, functional impairment caused by CPU and the characteristics of CPU; (3) to establish the optimal cut-off point of the number of symptoms for functional impairment…
Hypoxia/oxidative stress alters the pharmacokinetics of CPU86017-RS through mitochondrial dysfunction and NADPH oxidase activation.

PubMed

Gao, Jie; Ding, Xuan-sheng; Zhang, Yu-mao; Dai, De-zai; Liu, Mei; Zhang, Can; Dai, Yin

2013-12-01

Hypoxia/oxidative stress can alter the pharmacokinetics (PK) of CPU86017-RS, a novel antiarrhythmic agent. The aim of this study was to investigate the mechanisms underlying the alteration of PK of CPU86017-RS by hypoxia/oxidative stress. Male SD rats exposed to normal or intermittent hypoxia (10% O2) were administered CPU86017-RS (20, 40 or 80 mg/kg, ig) for 8 consecutive days. The PK parameters of CPU86017-RS were examined on d 8. In a separate set of experiments, female SD rats were injected with isoproterenol (ISO) for 5 consecutive days to induce a stress-related status, then CPU86017-RS (80 mg/kg, ig) was administered, and the tissue distributions were examined. The levels of Mn-SOD (manganese containing superoxide dismutase), endoplasmic reticulum (ER) stress sensor proteins (ATF-6, activating transcription factor 6 and PERK, PRK-like ER kinase) and activation of NADPH oxidase (NOX) were detected with Western blotting. Rat liver microsomes were incubated under N2 for in vitro study. The Cmax, t1/2, MRT (mean residence time) and AUC (area under the curve) of CPU86017-RS were significantly increased in the hypoxic rats receiving the 3 different doses of CPU86017-RS. The hypoxia-induced alteration of PK was associated with significantly reduced Mn-SOD level, and increased ATF-6, PERK and NOX levels. In ISO-treated rats, the distributions of CPU86017-RS in plasma, heart, kidney, and liver were markedly increased, and NOX levels in heart, kidney, and liver were significantly upregulated. Co-administration of the NOX blocker apocynin eliminated the abnormalities in the PK and tissue distributions of CPU86017-RS induced by hypoxia/oxidative stress. The metabolism of CPU86017-RS in the N2-treated liver microsomes was significantly reduced, addition of N-acetylcysteine (NAC), but not vitamin C, effectively reversed this change. The altered PK and metabolism of CPU86017-RS induced by hypoxia/oxidative stress are produced by mitochondrial abnormalities, NOX activation and ER stress; these abnormalities are significantly alleviated by apocynin or NAC.
Synthesis and characterization of conductive, biodegradable, elastomeric polyurethanes for biomedical applications.

PubMed

Xu, Cancan; Yepez, Gerardo; Wei, Zi; Liu, Fuqiang; Bugarin, Alejandro; Hong, Yi

2016-09-01

Biodegradable conductive polymers are currently of significant interest in tissue repair and regeneration, drug delivery, and bioelectronics. However, biodegradable materials exhibiting both conductive and elastic properties have rarely been reported to date. To that end, an electrically conductive polyurethane (CPU) was synthesized from polycaprolactone diol, hexadiisocyanate, and aniline trimer and subsequently doped with (1S)-(+)-10-camphorsulfonic acid (CSA). All CPU films showed good elasticity within a 30% strain range. The electrical conductivity of the CPU films, as enhanced with increasing amounts of CSA, ranged from 2.7 ± 0.9 × 10(-10) to 4.4 ± 0.6 × 10(-7) S/cm in a dry state and 4.2 ± 0.5 × 10(-8) to 7.3 ± 1.5 × 10(-5) S/cm in a wet state. The redox peaks of a CPU1.5 film (molar ratio CSA:aniline trimer = 1.5:1) in the cyclic voltammogram confirmed the desired good electroactivity. The doped CPU film exhibited good electrical stability (87% of initial conductivity after 150 hours charge) as measured in a cell culture medium. The degradation rates of CPU films increased with increasing CSA content in both phosphate-buffered solution (PBS) and lipase/PBS solutions. After 7 days of enzymatic degradation, the conductivity of all CSA-doped CPU films had decreased to that of the undoped CPU film. Mouse 3T3 fibroblasts proliferated and spread on all CPU films. This developed biodegradable CPU with good elasticity, electrical stability, and biocompatibility may find potential applications in tissue engineering, smart drug release, and electronics. © 2016 Wiley Periodicals, Inc. J Biomed Mater Res Part A: 104A: 2305-2314, 2016. © 2016 Wiley Periodicals, Inc.
General-purpose interface bus for multiuser, multitasking computer system

NASA Technical Reports Server (NTRS)

Generazio, Edward R.; Roth, Don J.; Stang, David B.

1990-01-01

The architecture of a multiuser, multitasking, virtual-memory computer system intended for the use by a medium-size research group is described. There are three central processing units (CPU) in the configuration, each with 16 MB memory, and two 474 MB hard disks attached. CPU 1 is designed for data analysis and contains an array processor for fast-Fourier transformations. In addition, CPU 1 shares display images viewed with the image processor. CPU 2 is designed for image analysis and display. CPU 3 is designed for data acquisition and contains 8 GPIB channels and an analog-to-digital conversion input/output interface with 16 channels. Up to 9 users can access the third CPU simultaneously for data acquisition. Focus is placed on the optimization of hardware interfaces and software, facilitating instrument control, data acquisition, and processing.
CPU-GPU hybrid accelerating the Zuker algorithm for RNA secondary structure prediction applications

PubMed Central

2012-01-01

Background Prediction of ribonucleic acid (RNA) secondary structure remains one of the most important research areas in bioinformatics. The Zuker algorithm is one of the most popular methods of free energy minimization for RNA secondary structure prediction. Thus far, few studies have been reported on the acceleration of the Zuker algorithm on general-purpose processors or on extra accelerators such as Field Programmable Gate-Array (FPGA) and Graphics Processing Units (GPU). To the best of our knowledge, no implementation combines both CPU and extra accelerators, such as GPUs, to accelerate the Zuker algorithm applications. Results In this paper, a CPU-GPU hybrid computing system that accelerates Zuker algorithm applications for RNA secondary structure prediction is proposed. The computing tasks are allocated between CPU and GPU for parallel cooperate execution. Performance differences between the CPU and the GPU in the task-allocation scheme are considered to obtain workload balance. To improve the hybrid system performance, the Zuker algorithm is optimally implemented with special methods for CPU and GPU architecture. Conclusions Speedup of 15.93× over optimized multi-core SIMD CPU implementation and performance advantage of 16% over optimized GPU implementation are shown in the experimental results. More than 14% of the sequences are executed on CPU in the hybrid system. The system combining CPU and GPU to accelerate the Zuker algorithm is proven to be promising and can be applied to other bioinformatics applications. PMID:22369626
Jobs masonry in LHCb with elastic Grid Jobs

NASA Astrophysics Data System (ADS)

Stagni, F.; Charpentier, Ph

2015-12-01

In any distributed computing infrastructure, a job is normally forbidden to run for an indefinite amount of time. This limitation is implemented using different technologies, the most common one being the CPU time limit implemented by batch queues. It is therefore important to have a good estimate of how much CPU work a job will require: otherwise, it might be killed by the batch system, or by whatever system is controlling the jobs’ execution. In many modern interwares, the jobs are actually executed by pilot jobs, that can use the whole available time in running multiple consecutive jobs. If at some point the available time in a pilot is too short for the execution of any job, it should be released, while it could have been used efficiently by a shorter job. Within LHCbDIRAC, the LHCb extension of the DIRAC interware, we developed a simple way to fully exploit computing capabilities available to a pilot, even for resources with limited time capabilities, by adding elasticity to production MonteCarlo (MC) simulation jobs. With our approach, independently of the time available, LHCbDIRAC will always have the possibility to execute a MC job, whose length will be adapted to the available amount of time: therefore the same job, running on different computing resources with different time limits, will produce different amounts of events. The decision on the number of events to be produced is made just in time at the start of the job, when the capabilities of the resource are known. In order to know how many events a MC job will be instructed to produce, LHCbDIRAC simply requires three values: the CPU-work per event for that type of job, the power of the machine it is running on, and the time left for the job before being killed. Knowing these values, we can estimate the number of events the job will be able to simulate with the available CPU time. This paper will demonstrate that, using this simple but effective solution, LHCb manages to make a more efficient use of the available resources, and that it can easily use new types of resources. An example is represented by resources provided by batch queues, where low-priority MC jobs can be used as "masonry" jobs in multi-jobs pilots. A second example is represented by opportunistic resources with limited available time.
Improving the performance of heterogeneous multi-core processors by modifying the cache coherence protocol

NASA Astrophysics Data System (ADS)

Fang, Juan; Hao, Xiaoting; Fan, Qingwen; Chang, Zeqing; Song, Shuying

2017-05-01

In the Heterogeneous multi-core architecture, CPU and GPU processor are integrated on the same chip, which poses a new challenge to the last-level cache management. In this architecture, the CPU application and the GPU application execute concurrently, accessing the last-level cache. CPU and GPU have different memory access characteristics, so that they have differences in the sensitivity of last-level cache (LLC) capacity. For many CPU applications, a reduced share of the LLC could lead to significant performance degradation. On the contrary, GPU applications can tolerate increase in memory access latency when there is sufficient thread-level parallelism. Taking into account the GPU program memory latency tolerance characteristics, this paper presents a method that let GPU applications can access to memory directly, leaving lots of LLC space for CPU applications, in improving the performance of CPU applications and does not affect the performance of GPU applications. When the CPU application is cache sensitive, and the GPU application is insensitive to the cache, the overall performance of the system is improved significantly.
Right-Brain/Left-Brain Integrated Associative Processor Employing Convertible Multiple-Instruction-Stream Multiple-Data-Stream Elements

NASA Astrophysics Data System (ADS)

Hayakawa, Hitoshi; Ogawa, Makoto; Shibata, Tadashi

2005-04-01

A very large scale integrated circuit (VLSI) architecture for a multiple-instruction-stream multiple-data-stream (MIMD) associative processor has been proposed. The processor employs an architecture that enables seamless switching from associative operations to arithmetic operations. The MIMD element is convertible to a regular central processing unit (CPU) while maintaining its high performance as an associative processor. Therefore, the MIMD associative processor can perform not only on-chip perception, i.e., searching for the vector most similar to an input vector throughout the on-chip cache memory, but also arithmetic and logic operations similar to those in ordinary CPUs, both simultaneously in parallel processing. Three key technologies have been developed to generate the MIMD element: associative-operation-and-arithmetic-operation switchable calculation units, a versatile register control scheme within the MIMD element for flexible operations, and a short instruction set for minimizing the memory size for program storage. Key circuit blocks were designed and fabricated using 0.18 μm complementary metal-oxide-semiconductor (CMOS) technology. As a result, the full-featured MIMD element is estimated to be 3 mm2, showing the feasibility of an 8-parallel-MIMD-element associative processor in a single chip of 5 mm× 5 mm.
A report documenting the completion of the Los Alamos National Laboratory portion of the ASC level II milestone ""Visualization on the supercomputing platform

DOE Office of Scientific and Technical Information (OSTI.GOV)

Ahrens, James P; Patchett, John M; Lo, Li - Ta

2011-01-24

This report provides documentation for the completion of the Los Alamos portion of the ASC Level II 'Visualization on the Supercomputing Platform' milestone. This ASC Level II milestone is a joint milestone between Sandia National Laboratory and Los Alamos National Laboratory. The milestone text is shown in Figure 1 with the Los Alamos portions highlighted in boldfaced text. Visualization and analysis of petascale data is limited by several factors which must be addressed as ACES delivers the Cielo platform. Two primary difficulties are: (1) Performance of interactive rendering, which is the most computationally intensive portion of the visualization process. Formore » terascale platforms, commodity clusters with graphics processors (GPUs) have been used for interactive rendering. For petascale platforms, visualization and rendering may be able to run efficiently on the supercomputer platform itself. (2) I/O bandwidth, which limits how much information can be written to disk. If we simply analyze the sparse information that is saved to disk we miss the opportunity to analyze the rich information produced every timestep by the simulation. For the first issue, we are pursuing in-situ analysis, in which simulations are coupled directly with analysis libraries at runtime. This milestone will evaluate the visualization and rendering performance of current and next generation supercomputers in contrast to GPU-based visualization clusters, and evaluate the perfromance of common analysis libraries coupled with the simulation that analyze and write data to disk during a running simulation. This milestone will explore, evaluate and advance the maturity level of these technologies and their applicability to problems of interest to the ASC program. In conclusion, we improved CPU-based rendering performance by a a factor of 2-10 times on our tests. In addition, we evaluated CPU and CPU-based rendering performance. We encourage production visualization experts to consider using CPU-based rendering solutions when it is appropriate. For example, on remote supercomputers CPU-based rendering can offer a means of viewing data without having to offload the data or geometry onto a CPU-based visualization system. In terms of comparative performance of the CPU and CPU we believe that further optimizations of the performance of both CPU or CPU-based rendering are possible. The simulation community is currently confronting this reality as they work to port their simulations to different hardware architectures. What is interesting about CPU rendering of massive datasets is that for part two decades CPU performance has significantly outperformed CPU-based systems. Based on our advancements, evaluations and explorations we believe that CPU-based rendering has returned as one viable option for the visualization of massive datasets.« less
Multi-GPU Accelerated Admittance Method for High-Resolution Human Exposure Evaluation.

PubMed

Xiong, Zubiao; Feng, Shi; Kautz, Richard; Chandra, Sandeep; Altunyurt, Nevin; Chen, Ji

2015-12-01

A multi-graphics processing unit (GPU) accelerated admittance method solver is presented for solving the induced electric field in high-resolution anatomical models of human body when exposed to external low-frequency magnetic fields. In the solver, the anatomical model is discretized as a three-dimensional network of admittances. The conjugate orthogonal conjugate gradient (COCG) iterative algorithm is employed to take advantage of the symmetric property of the complex-valued linear system of equations. Compared against the widely used biconjugate gradient stabilized method, the COCG algorithm can reduce the solving time by 3.5 times and reduce the storage requirement by about 40%. The iterative algorithm is then accelerated further by using multiple NVIDIA GPUs. The computations and data transfers between GPUs are overlapped in time by using asynchronous concurrent execution design. The communication overhead is well hidden so that the acceleration is nearly linear with the number of GPU cards. Numerical examples show that our GPU implementation running on four NVIDIA Tesla K20c cards can reach 90 times faster than the CPU implementation running on eight CPU cores (two Intel Xeon E5-2603 processors). The implemented solver is able to solve large dimensional problems efficiently. A whole adult body discretized in 1-mm resolution can be solved in just several minutes. The high efficiency achieved makes it practical to investigate human exposure involving a large number of cases with a high resolution that meets the requirements of international dosimetry guidelines.

DOE Office of Scientific and Technical Information (OSTI.GOV)

Gorentla Venkata, Manjunath; Graham, Richard L; Ladd, Joshua S

This paper describes the design and implementation of InfiniBand (IB) CORE-Direct based blocking and nonblocking broadcast operations within the Cheetah collective operation framework. It describes a novel approach that fully ofFLoads collective operations and employs only user-supplied buffers. For a 64 rank communicator, the latency of CORE-Direct based hierarchical algorithm is better than production-grade Message Passing Interface (MPI) implementations, 150% better than the default Open MPI algorithm and 115% better than the shared memory optimized MVAPICH implementation for a one kilobyte (KB) message, and for eight mega-bytes (MB) it is 48% and 64% better, respectively. Flat-topology broadcast achieves 99.9% overlapmore » in a polling based communication-computation test, and 95.1% overlap for a wait based test, compared with 92.4% and 17.0%, respectively, for a similar Central Processing Unit (CPU) based implementation.« less
A radiation-hardened, computer for satellite applications

DOE Office of Scientific and Technical Information (OSTI.GOV)

Gaona, J.I. Jr.

1996-08-01

This paper describes high reliability radiation hardened computers built by Sandia for application aboard DOE satellite programs requiring 32 bit processing. The computers highlight a radiation hardened (10 kGy(Si)) R3000 executing up to 10 million reduced instruction set instructions (RISC) per second (MIPS), a dual purpose module control bus used for real-time default and power management which allows for extended mission operation on as little as 1.2 watts, and a local area network capable of 480 Mbits/s. The central processing unit (CPU) is the NASA Goddard R3000 nicknamed the ``Mongoose or Mongoose 1``. The Sandia Satellite Computer (SSC) uses Rational`smore » Ada compiler, debugger, operating system kernel, and enhanced floating point emulation library targeted at the Mongoose. The SSC gives Sandia the capability of processing complex types of spacecraft attitude determination and control algorithms and of modifying programmed control laws via ground command. And in general, SSC offers end users the ability to process data onboard the spacecraft that would normally have been sent to the ground which allows reconsideration of traditional space-grounded partitioning options.« less
A GPU-based large-scale Monte Carlo simulation method for systems with long-range interactions

NASA Astrophysics Data System (ADS)

Liang, Yihao; Xing, Xiangjun; Li, Yaohang

2017-06-01

In this work we present an efficient implementation of Canonical Monte Carlo simulation for Coulomb many body systems on graphics processing units (GPU). Our method takes advantage of the GPU Single Instruction, Multiple Data (SIMD) architectures, and adopts the sequential updating scheme of Metropolis algorithm. It makes no approximation in the computation of energy, and reaches a remarkable 440-fold speedup, compared with the serial implementation on CPU. We further use this method to simulate primitive model electrolytes, and measure very precisely all ion-ion pair correlation functions at high concentrations. From these data, we extract the renormalized Debye length, renormalized valences of constituent ions, and renormalized dielectric constants. These results demonstrate unequivocally physics beyond the classical Poisson-Boltzmann theory.
A survey of CPU-GPU heterogeneous computing techniques

DOE PAGES

Mittal, Sparsh; Vetter, Jeffrey S.

2015-07-04

As both CPU and GPU become employed in a wide range of applications, it has been acknowledged that both of these processing units (PUs) have their unique features and strengths and hence, CPU-GPU collaboration is inevitable to achieve high-performance computing. This has motivated significant amount of research on heterogeneous computing techniques, along with the design of CPU-GPU fused chips and petascale heterogeneous supercomputers. In this paper, we survey heterogeneous computing techniques (HCTs) such as workload-partitioning which enable utilizing both CPU and GPU to improve performance and/or energy efficiency. We review heterogeneous computing approaches at runtime, algorithm, programming, compiler and applicationmore » level. Further, we review both discrete and fused CPU-GPU systems; and discuss benchmark suites designed for evaluating heterogeneous computing systems (HCSs). Furthermore, we believe that this paper will provide insights into working and scope of applications of HCTs to researchers and motivate them to further harness the computational powers of CPUs and GPUs to achieve the goal of exascale performance.« less
A survey of CPU-GPU heterogeneous computing techniques

DOE Office of Scientific and Technical Information (OSTI.GOV)

Mittal, Sparsh; Vetter, Jeffrey S.

As both CPU and GPU become employed in a wide range of applications, it has been acknowledged that both of these processing units (PUs) have their unique features and strengths and hence, CPU-GPU collaboration is inevitable to achieve high-performance computing. This has motivated significant amount of research on heterogeneous computing techniques, along with the design of CPU-GPU fused chips and petascale heterogeneous supercomputers. In this paper, we survey heterogeneous computing techniques (HCTs) such as workload-partitioning which enable utilizing both CPU and GPU to improve performance and/or energy efficiency. We review heterogeneous computing approaches at runtime, algorithm, programming, compiler and applicationmore » level. Further, we review both discrete and fused CPU-GPU systems; and discuss benchmark suites designed for evaluating heterogeneous computing systems (HCSs). Furthermore, we believe that this paper will provide insights into working and scope of applications of HCTs to researchers and motivate them to further harness the computational powers of CPUs and GPUs to achieve the goal of exascale performance.« less
47 CFR 15.102 - CPU boards and power supplies used in personal computers.

Code of Federal Regulations, 2013 CFR

2013-10-01

... computers. 15.102 Section 15.102 Telecommunication FEDERAL COMMUNICATIONS COMMISSION GENERAL RADIO FREQUENCY DEVICES Unintentional Radiators § 15.102 CPU boards and power supplies used in personal computers. (a... modifications that must be made to a personal computer, peripheral device, CPU board or power supply during...
47 CFR 15.102 - CPU boards and power supplies used in personal computers.

Code of Federal Regulations, 2011 CFR

2011-10-01

... computers. 15.102 Section 15.102 Telecommunication FEDERAL COMMUNICATIONS COMMISSION GENERAL RADIO FREQUENCY DEVICES Unintentional Radiators § 15.102 CPU boards and power supplies used in personal computers. (a... modifications that must be made to a personal computer, peripheral device, CPU board or power supply during...
47 CFR 15.102 - CPU boards and power supplies used in personal computers.

Code of Federal Regulations, 2010 CFR

2010-10-01

... computers. 15.102 Section 15.102 Telecommunication FEDERAL COMMUNICATIONS COMMISSION GENERAL RADIO FREQUENCY DEVICES Unintentional Radiators § 15.102 CPU boards and power supplies used in personal computers. (a... modifications that must be made to a personal computer, peripheral device, CPU board or power supply during...
47 CFR 15.102 - CPU boards and power supplies used in personal computers.

Code of Federal Regulations, 2014 CFR

2014-10-01

... computers. 15.102 Section 15.102 Telecommunication FEDERAL COMMUNICATIONS COMMISSION GENERAL RADIO FREQUENCY DEVICES Unintentional Radiators § 15.102 CPU boards and power supplies used in personal computers. (a... modifications that must be made to a personal computer, peripheral device, CPU board or power supply during...
47 CFR 15.102 - CPU boards and power supplies used in personal computers.

Code of Federal Regulations, 2012 CFR

2012-10-01

... computers. 15.102 Section 15.102 Telecommunication FEDERAL COMMUNICATIONS COMMISSION GENERAL RADIO FREQUENCY DEVICES Unintentional Radiators § 15.102 CPU boards and power supplies used in personal computers. (a... modifications that must be made to a personal computer, peripheral device, CPU board or power supply during...
Online performance evaluation of RAID 5 using CPU utilization

NASA Astrophysics Data System (ADS)

Jin, Hai; Yang, Hua; Zhang, Jiangling

1998-09-01

Redundant arrays of independent disks (RAID) technology is the efficient way to solve the bottleneck problem between CPU processing ability and I/O subsystem. For the system point of view, the most important metric of on line performance is the utilization of CPU. This paper first employs the way to calculate the CPU utilization of system connected with RAID level 5 using statistic average method. From the simulation results of CPU utilization of system connected with RAID level 5 subsystem can we see that using multiple disks as an array to access data in parallel is the efficient way to enhance the on-line performance of disk storage system. USing high-end disk drivers to compose the disk array is the key to enhance the on-line performance of system.
Parallelized computation for computer simulation of electrocardiograms using personal computers with multi-core CPU and general-purpose GPU.

PubMed

Shen, Wenfeng; Wei, Daming; Xu, Weimin; Zhu, Xin; Yuan, Shizhong

2010-10-01

Biological computations like electrocardiological modelling and simulation usually require high-performance computing environments. This paper introduces an implementation of parallel computation for computer simulation of electrocardiograms (ECGs) in a personal computer environment with an Intel CPU of Core (TM) 2 Quad Q6600 and a GPU of Geforce 8800GT, with software support by OpenMP and CUDA. It was tested in three parallelization device setups: (a) a four-core CPU without a general-purpose GPU, (b) a general-purpose GPU plus 1 core of CPU, and (c) a four-core CPU plus a general-purpose GPU. To effectively take advantage of a multi-core CPU and a general-purpose GPU, an algorithm based on load-prediction dynamic scheduling was developed and applied to setting (c). In the simulation with 1600 time steps, the speedup of the parallel computation as compared to the serial computation was 3.9 in setting (a), 16.8 in setting (b), and 20.0 in setting (c). This study demonstrates that a current PC with a multi-core CPU and a general-purpose GPU provides a good environment for parallel computations in biological modelling and simulation studies. Copyright 2010 Elsevier Ireland Ltd. All rights reserved.
Is Outdoor Education Environmental Education?

ERIC Educational Resources Information Center

Parkin, Danny

1998-01-01

Explores the relationship between outdoor education and environmental education by examining the broad nature of outdoor education and discussing whether outdoor education and environmental education are overlapping philosophies or separate methods of instruction. Includes analysis of a survey of outdoor educators and details a process for…
32 CFR 701.53 - FOIA fee schedule.

Code of Federal Regulations, 2014 CFR

2014-07-01

... human time) and machine time. (1) Human time. Human time is all the time spent by humans performing the...) Machine time. Machine time involves only direct costs of the central processing unit (CPU), input/output... exist to calculate CPU time, no machine costs can be passed on to the requester. When CPU calculations...
32 CFR 701.53 - FOIA fee schedule.

Code of Federal Regulations, 2012 CFR

2012-07-01

... human time) and machine time. (1) Human time. Human time is all the time spent by humans performing the...) Machine time. Machine time involves only direct costs of the central processing unit (CPU), input/output... exist to calculate CPU time, no machine costs can be passed on to the requester. When CPU calculations...
32 CFR 701.53 - FOIA fee schedule.

Code of Federal Regulations, 2013 CFR

2013-07-01

... human time) and machine time. (1) Human time. Human time is all the time spent by humans performing the...) Machine time. Machine time involves only direct costs of the central processing unit (CPU), input/output... exist to calculate CPU time, no machine costs can be passed on to the requester. When CPU calculations...
The Multiple Component Alternative for Gifted Education.

ERIC Educational Resources Information Center

Swassing, Ray

1984-01-01

The Multiple Component Model (MCM) of gifted education includes instruction which may overlap in literature, history, art, enrichment, languages, science, physics, math, music, and dance. The model rests on multifactored identification and requires systematic development and selection of components with ongoing feedback and evaluation. (CL)
FPGA Online Tracking Algorithm for the PANDA Straw Tube Tracker

NASA Astrophysics Data System (ADS)

Liang, Yutie; Ye, Hua; Galuska, Martin J.; Gessler, Thomas; Kuhn, Wolfgang; Lange, Jens Soren; Wagner, Milan N.; Liu, Zhen'an; Zhao, Jingzhou

2017-06-01

A novel FPGA based online tracking algorithm for helix track reconstruction in a solenoidal field, developed for the PANDA spectrometer, is described. Employing the Straw Tube Tracker detector with 4636 straw tubes, the algorithm includes a complex track finder, and a track fitter. Implemented in VHDL, the algorithm is tested on a Xilinx Virtex-4 FX60 FPGA chip with different types of events, at different event rates. A processing time of 7 $\\mu$s per event for an average of 6 charged tracks is obtained. The momentum resolution is about 3\\% (4\\%) for $p_t$ ($p_z$) at 1 GeV/c. Comparing to the algorithm running on a CPU chip (single core Intel Xeon E5520 at 2.26 GHz), an improvement of 3 orders of magnitude in processing time is obtained. The algorithm can handle severe overlapping of events which are typical for interaction rates above 10 MHz.
ClusCo: clustering and comparison of protein models.

PubMed

Jamroz, Michal; Kolinski, Andrzej

2013-02-22

The development, optimization and validation of protein modeling methods require efficient tools for structural comparison. Frequently, a large number of models need to be compared with the target native structure. The main reason for the development of Clusco software was to create a high-throughput tool for all-versus-all comparison, because calculating similarity matrix is the one of the bottlenecks in the protein modeling pipeline. Clusco is fast and easy-to-use software for high-throughput comparison of protein models with different similarity measures (cRMSD, dRMSD, GDT_TS, TM-Score, MaxSub, Contact Map Overlap) and clustering of the comparison results with standard methods: K-means Clustering or Hierarchical Agglomerative Clustering. The application was highly optimized and written in C/C++, including the code for parallel execution on CPU and GPU, which resulted in a significant speedup over similar clustering and scoring computation programs.
Gender differences in learning physical science concepts: Does computer animation help equalize them?

NASA Astrophysics Data System (ADS)

Jacek, Laura Lee

This dissertation details an experiment designed to identify gender differences in learning using three experimental treatments: animation, static graphics, and verbal instruction alone. Three learning presentations were used in testing of 332 university students. Statistical analysis was performed using ANOVA, binomial tests for differences of proportion, and descriptive statistics. Results showed that animation significantly improved women's long-term learning over static graphics (p = 0.067), but didn't significantly improve men's long-term learning over static graphics. In all cases, women's scores improved with animation over both other forms of instruction for long-term testing, indicating that future research should not abandon the study of animation as a tool that may promote gender equity in science. Short-term test differences were smaller, and not statistically significant. Variation present in short-term scores was related more to presentation topic than treatment. This research also details characteristics of each of the three presentations, to identify variables (e.g. level of abstraction in presentation) affecting score differences within treatments. Differences between men's and women's scores were non-standard between presentations, but these differences were not statistically significant (long-term p = 0.2961, short-term p = 0.2893). In future research, experiments might be better designed to test these presentational variables in isolation, possibly yielding more distinctive differences between presentational scores. Differences in confidence interval overlaps between presentations suggested that treatment superiority may be somewhat dependent on the design or topic of the learning presentation. Confidence intervals greatly overlap in all situations. This undercut, to some degree, the surety of conclusions indicating superiority of one treatment type over the others. However, confidence intervals for animation were smaller, overlapped nearly completely for men and women (there was less overlap between the genders for the other two treatments), and centered around slightly higher means, lending further support to the conclusion that animation helped equalize men's and women's learning. The most important conclusion identified in this research is that gender is an important variable experimental populations testing animation as a learning device. Averages indicated that both men and women prefer to work with animation over either static graphics or verbal instruction alone.

Short-term dopaminergic regulation of GABA release in dopamine deafferented caudate-putamen is not directly associated with glutamic acid decarboxylase gene expression.

PubMed

O'Connor, W T; Lindefors, N; Brené, S; Herrera-Marschitz, M; Persson, H; Ungerstedt, U

1991-07-08

In vivo microdialysis and in situ hybridization were combined to study dopaminergic regulation of gamma-amino butyric acid (GABA) neurons in rat caudate-putamen (CPu). Potassium-stimulated GABA release in CPu was elevated following a dopamine deafferentation. Local perfusion with exogenous dopamine (50 microM) for 3 h via the microdialysis probe attenuated the potassium-stimulated increase in extracellular GABA in CPu. Expression of glutamic acid decarboxylase (GAD) mRNA was also increased in the dopamine deafferented CPu. However, local perfusion with dopamine had no significant attenuating effect on the increased GAD mRNA expression. These findings indicate that dopaminergic regulation of GABA neurons in the dopamine deafferented CPu includes both a short-term effect at the level of GABA release independent of changes in GAD mRNA expression and a long-term modulation at the level of GAD gene expression.
GPU-Acceleration of Sequence Homology Searches with Database Subsequence Clustering.

PubMed

Suzuki, Shuji; Kakuta, Masanori; Ishida, Takashi; Akiyama, Yutaka

2016-01-01

Sequence homology searches are used in various fields and require large amounts of computation time, especially for metagenomic analysis, owing to the large number of queries and the database size. To accelerate computing analyses, graphics processing units (GPUs) are widely used as a low-cost, high-performance computing platform. Therefore, we mapped the time-consuming steps involved in GHOSTZ, which is a state-of-the-art homology search algorithm for protein sequences, onto a GPU and implemented it as GHOSTZ-GPU. In addition, we optimized memory access for GPU calculations and for communication between the CPU and GPU. As per results of the evaluation test involving metagenomic data, GHOSTZ-GPU with 12 CPU threads and 1 GPU was approximately 3.0- to 4.1-fold faster than GHOSTZ with 12 CPU threads. Moreover, GHOSTZ-GPU with 12 CPU threads and 3 GPUs was approximately 5.8- to 7.7-fold faster than GHOSTZ with 12 CPU threads.
Portable nuclear material detector and process

DOEpatents

Hofstetter, Kenneth J [Aiken, SC; Fulghum, Charles K [Aiken, SC; Harpring, Lawrence J [North Augusta, SC; Huffman, Russell K [Augusta, GA; Varble, Donald L [Evans, GA

2008-04-01

A portable, hand held, multi-sensor radiation detector is disclosed. The detection apparatus has a plurality of spaced sensor locations which are contained within a flexible housing. The detection apparatus, when suspended from an elevation, will readily assume a substantially straight, vertical orientation and may be used to monitor radiation levels from shipping containers. The flexible detection array can also assume a variety of other orientations to facilitate any unique container shapes or to conform to various physical requirements with respect to deployment of the detection array. The output of each sensor within the array is processed by at least one CPU which provides information in a usable form to a user interface. The user interface is used to provide the power requirements and operating instructions to the operational components within the detection array.
Teaching Linear Measurement Concepts. . .K to 6

ERIC Educational Resources Information Center

Norrie, A. L.

1974-01-01

Three distinct, but overlapping, stages in measurement are identified as intuitive thinking, logical thinking, and formal operations. Three types of representation are body units, non-standard, and standard units. Instructional sequences and activities are suggested for grades 1-3 and grades 4-6 based on these considerations. (LS)
Classroom Communities' Adaptations of the Practice of Scientific Argumentation

ERIC Educational Resources Information Center

Berland, Leema K.; Reiser, Brian J.

2011-01-01

Scientific argumentation is increasingly seen as a key inquiry practice for students in science classrooms. This is a complex practice that entails three overlapping, instructional goals: Participants "articulate their understandings" and work to "persuade others of those understandings" in order to "make sense of the…
Dynamic Quantum Allocation and Swap-Time Variability in Time-Sharing Operating Systems.

ERIC Educational Resources Information Center

Bhat, U. Narayan; Nance, Richard E.

The effects of dynamic quantum allocation and swap-time variability on central processing unit (CPU) behavior are investigated using a model that allows both quantum length and swap-time to be state-dependent random variables. Effective CPU utilization is defined to be the proportion of a CPU busy period that is devoted to program processing, i.e.…
Towards a high performance geometry library for particle-detector simulations

DOE PAGES

Apostolakis, J.; Bandieramonte, M.; Bitzes, G.; ...

2015-05-22

Thread-parallelization and single-instruction multiple data (SIMD) ”vectorisation” of software components in HEP computing has become a necessity to fully benefit from current and future computing hardware. In this context, the Geant-Vector/GPU simulation project aims to re-engineer current software for the simulation of the passage of particles through detectors in order to increase the overall event throughput. As one of the core modules in this area, the geometry library plays a central role and vectorising its algorithms will be one of the cornerstones towards achieving good CPU performance. Here, we report on the progress made in vectorising the shape primitives, asmore » well as in applying new C++ template based optimizations of existing code available in the Geant4, ROOT or USolids geometry libraries. We will focus on a presentation of our software development approach that aims to provide optimized code for all use cases of the library (e.g., single particle and many-particle APIs) and to support different architectures (CPU and GPU) while keeping the code base small, manageable and maintainable. We report on a generic and templated C++ geometry library as a continuation of the AIDA USolids project. As a result, the experience gained with these developments will be beneficial to other parts of the simulation software, such as for the optimization of the physics library, and possibly to other parts of the experiment software stack, such as reconstruction and analysis.« less
Towards a high performance geometry library for particle-detector simulations

DOE Office of Scientific and Technical Information (OSTI.GOV)

Apostolakis, J.; Bandieramonte, M.; Bitzes, G.

Thread-parallelization and single-instruction multiple data (SIMD) ”vectorisation” of software components in HEP computing has become a necessity to fully benefit from current and future computing hardware. In this context, the Geant-Vector/GPU simulation project aims to re-engineer current software for the simulation of the passage of particles through detectors in order to increase the overall event throughput. As one of the core modules in this area, the geometry library plays a central role and vectorising its algorithms will be one of the cornerstones towards achieving good CPU performance. Here, we report on the progress made in vectorising the shape primitives, asmore » well as in applying new C++ template based optimizations of existing code available in the Geant4, ROOT or USolids geometry libraries. We will focus on a presentation of our software development approach that aims to provide optimized code for all use cases of the library (e.g., single particle and many-particle APIs) and to support different architectures (CPU and GPU) while keeping the code base small, manageable and maintainable. We report on a generic and templated C++ geometry library as a continuation of the AIDA USolids project. As a result, the experience gained with these developments will be beneficial to other parts of the simulation software, such as for the optimization of the physics library, and possibly to other parts of the experiment software stack, such as reconstruction and analysis.« less
Allowing for Horizontally Heterogeneous Clouds and Generalized Overlap in an Atmospheric GCM

NASA Technical Reports Server (NTRS)

Lee, D.; Oreopoulos, L.; Suarez, M.

2011-01-01

While fully accounting for 3D effects in Global Climate Models (GCMs) appears not realistic at the present time for a variety of reasons such as computational cost and unavailability of 3D cloud structure in the models, incorporation in radiation schemes of subgrid cloud variability described by one-point statistics is now considered feasible and is being actively pursued. This development has gained momentum once it was demonstrated that CPU-intensive spectrally explicit Independent Column Approximation (lCA) can be substituted by stochastic Monte Carlo ICA (McICA) calculations where spectral integration is accomplished in a manner that produces relatively benign random noise. The McICA approach has been implemented in Goddard's GEOS-5 atmospheric GCM as part of the implementation of the RRTMG radiation package. GEOS-5 with McICA and RRTMG can handle horizontally variable clouds which can be set via a cloud generator to arbitrarily overlap within the full spectrum of maximum and random both in terms of cloud fraction and layer condensate distributions. In our presentation we will show radiative and other impacts of the combined horizontal and vertical cloud variability on multi-year simulations of an otherwise untuned GEOS-5 with fixed SSTs. Introducing cloud horizontal heterogeneity without changing the mean amounts of condensate reduces reflected solar and increases thermal radiation to space, but disproportionate changes may increase the radiative imbalance at TOA. The net radiation at TOA can be modulated by allowing the parameters of the generalized overlap and heterogeneity scheme to vary, a dependence whose behavior we will discuss. The sensitivity of the cloud radiative forcing to the parameters of cloud horizontal heterogeneity and comparisons of CERES-derived forcing will be shown.
Hybrid multicore/vectorisation technique applied to the elastic wave equation on a staggered grid

NASA Astrophysics Data System (ADS)

Titarenko, Sofya; Hildyard, Mark

2017-07-01

In modern physics it has become common to find the solution of a problem by solving numerically a set of PDEs. Whether solving them on a finite difference grid or by a finite element approach, the main calculations are often applied to a stencil structure. In the last decade it has become usual to work with so called big data problems where calculations are very heavy and accelerators and modern architectures are widely used. Although CPU and GPU clusters are often used to solve such problems, parallelisation of any calculation ideally starts from a single processor optimisation. Unfortunately, it is impossible to vectorise a stencil structured loop with high level instructions. In this paper we suggest a new approach to rearranging the data structure which makes it possible to apply high level vectorisation instructions to a stencil loop and which results in significant acceleration. The suggested method allows further acceleration if shared memory APIs are used. We show the effectiveness of the method by applying it to an elastic wave propagation problem on a finite difference grid. We have chosen Intel architecture for the test problem and OpenMP (Open Multi-Processing) since they are extensively used in many applications.
Synthesis and Characterization of Biodegradable Polyurethane for Hypopharyngeal Tissue Engineering

PubMed Central

Shen, Zhisen; Lu, Dakai; Li, Qun; Zhang, Zongyong

2015-01-01

Biodegradable crosslinked polyurethane (cPU) was synthesized using polyethylene glycol (PEG), L-lactide (L-LA), and hexamethylene diisocyanate (HDI), with iron acetylacetonate (Fe(acac)3) as the catalyst and PEG as the extender. Chemical components of the obtained polymers were characterized by FTIR spectroscopy, 1H NMR spectra, and Gel Permeation Chromatography (GPC). The thermodynamic properties, mechanical behaviors, surface hydrophilicity, degradability, and cytotoxicity were tested via differential scanning calorimetry (DSC), tensile tests, contact angle measurements, and cell culture. The results show that the synthesized cPU possessed good flexibility with quite low glass transition temperature (T g, −22°C) and good wettability. Water uptake measured as high as 229.7 ± 18.7%. These properties make cPU a good candidate material for engineering soft tissues such as the hypopharynx. In vitro and in vivo tests showed that cPU has the ability to support the growth of human hypopharyngeal fibroblasts and angiogenesis was observed around cPU after it was implanted subcutaneously in SD rats. PMID:25839041
Is our medical school socially accountable? The case of Faculty of Medicine, Suez Canal University.

PubMed

Hosny, Somaya; Ghaly, Mona; Boelen, Charles

2015-04-01

Faculty of Medicine, Suez Canal University (FOM/SCU) was established as community oriented school with innovative educational strategies. Social accountability represents the commitment of the medical school towards the community it serves. To assess FOM/SCU compliance to social accountability using the "Conceptualization, Production, Usability" (CPU) model. FOM/SCU's practice was reviewed against CPU model parameters. CPU consists of three domains, 11 sections and 31 parameters. Data were collected through unstructured interviews with the main stakeholders and documents review since 2005 to 2013. FOM/SCU shows general compliance to the three domains of the CPU. Very good compliance was shown to the "P" domain of the model through FOM/SCU's innovative educational system, students and faculty members. More work is needed on the "C" and "U" domains. FOM/SCU complies with many parameters of the CPU model; however, more work should be accomplished to comply with some items in the C and U domains so that FOM/SCU can be recognized as a proactive socially accountable school.
GPU Optimizations for a Production Molecular Docking Code*

PubMed Central

Landaverde, Raphael; Herbordt, Martin C.

2015-01-01

Modeling molecular docking is critical to both understanding life processes and designing new drugs. In previous work we created the first published GPU-accelerated docking code (PIPER) which achieved a roughly 5× speed-up over a contemporaneous 4 core CPU. Advances in GPU architecture and in the CPU code, however, have since reduced this relalative performance by a factor of 10. In this paper we describe the upgrade of GPU PIPER. This required an entire rewrite, including algorithm changes and moving most remaining non-accelerated CPU code onto the GPU. The result is a 7× improvement in GPU performance and a 3.3× speedup over the CPU-only code. We find that this difference in time is almost entirely due to the difference in run times of the 3D FFT library functions on CPU (MKL) and GPU (cuFFT), respectively. The GPU code has been integrated into the ClusPro docking server which has over 4000 active users. PMID:26594667
GPU-Acceleration of Sequence Homology Searches with Database Subsequence Clustering

PubMed Central

Suzuki, Shuji; Kakuta, Masanori; Ishida, Takashi; Akiyama, Yutaka

2016-01-01

Sequence homology searches are used in various fields and require large amounts of computation time, especially for metagenomic analysis, owing to the large number of queries and the database size. To accelerate computing analyses, graphics processing units (GPUs) are widely used as a low-cost, high-performance computing platform. Therefore, we mapped the time-consuming steps involved in GHOSTZ, which is a state-of-the-art homology search algorithm for protein sequences, onto a GPU and implemented it as GHOSTZ-GPU. In addition, we optimized memory access for GPU calculations and for communication between the CPU and GPU. As per results of the evaluation test involving metagenomic data, GHOSTZ-GPU with 12 CPU threads and 1 GPU was approximately 3.0- to 4.1-fold faster than GHOSTZ with 12 CPU threads. Moreover, GHOSTZ-GPU with 12 CPU threads and 3 GPUs was approximately 5.8- to 7.7-fold faster than GHOSTZ with 12 CPU threads. PMID:27482905
GPU Optimizations for a Production Molecular Docking Code.

PubMed

Landaverde, Raphael; Herbordt, Martin C

2014-09-01

Modeling molecular docking is critical to both understanding life processes and designing new drugs. In previous work we created the first published GPU-accelerated docking code (PIPER) which achieved a roughly 5× speed-up over a contemporaneous 4 core CPU. Advances in GPU architecture and in the CPU code, however, have since reduced this relalative performance by a factor of 10. In this paper we describe the upgrade of GPU PIPER. This required an entire rewrite, including algorithm changes and moving most remaining non-accelerated CPU code onto the GPU. The result is a 7× improvement in GPU performance and a 3.3× speedup over the CPU-only code. We find that this difference in time is almost entirely due to the difference in run times of the 3D FFT library functions on CPU (MKL) and GPU (cuFFT), respectively. The GPU code has been integrated into the ClusPro docking server which has over 4000 active users.
Synthesis and characterization of biodegradable polyurethane for hypopharyngeal tissue engineering.

PubMed

Shen, Zhisen; Lu, Dakai; Li, Qun; Zhang, Zongyong; Zhu, Yabin

2015-01-01

Biodegradable crosslinked polyurethane (cPU) was synthesized using polyethylene glycol (PEG), L-lactide (L-LA), and hexamethylene diisocyanate (HDI), with iron acetylacetonate (Fe(acac)3) as the catalyst and PEG as the extender. Chemical components of the obtained polymers were characterized by FTIR spectroscopy, (1)H NMR spectra, and Gel Permeation Chromatography (GPC). The thermodynamic properties, mechanical behaviors, surface hydrophilicity, degradability, and cytotoxicity were tested via differential scanning calorimetry (DSC), tensile tests, contact angle measurements, and cell culture. The results show that the synthesized cPU possessed good flexibility with quite low glass transition temperature (T g , -22°C) and good wettability. Water uptake measured as high as 229.7 ± 18.7%. These properties make cPU a good candidate material for engineering soft tissues such as the hypopharynx. In vitro and in vivo tests showed that cPU has the ability to support the growth of human hypopharyngeal fibroblasts and angiogenesis was observed around cPU after it was implanted subcutaneously in SD rats.
Preliminary Study of Image Reconstruction Algorithm on a Digital Signal Processor

DTIC Science & Technology

2014-03-01

5.2 Comparison of CPU-GPU, CPU-FPGA, and CPU-DSP Designs The work for implementing VHDL description of the back-projection algorithm on a physical...FPGA was not complete. Hence, the DSP implementation results are compared with the simulated results for the VHDL design. Simulating VHDL provides an...rather than at the software level. Depending on an application’s characteristics, FPGA implementations can provide a significant performance
Use of general purpose graphics processing units with MODFLOW

USGS Publications Warehouse

Hughes, Joseph D.; White, Jeremy T.

2013-01-01

To evaluate the use of general-purpose graphics processing units (GPGPUs) to improve the performance of MODFLOW, an unstructured preconditioned conjugate gradient (UPCG) solver has been developed. The UPCG solver uses a compressed sparse row storage scheme and includes Jacobi, zero fill-in incomplete, and modified-incomplete lower-upper (LU) factorization, and generalized least-squares polynomial preconditioners. The UPCG solver also includes options for sequential and parallel solution on the central processing unit (CPU) using OpenMP. For simulations utilizing the GPGPU, all basic linear algebra operations are performed on the GPGPU; memory copies between the central processing unit CPU and GPCPU occur prior to the first iteration of the UPCG solver and after satisfying head and flow criteria or exceeding a maximum number of iterations. The efficiency of the UPCG solver for GPGPU and CPU solutions is benchmarked using simulations of a synthetic, heterogeneous unconfined aquifer with tens of thousands to millions of active grid cells. Testing indicates GPGPU speedups on the order of 2 to 8, relative to the standard MODFLOW preconditioned conjugate gradient (PCG) solver, can be achieved when (1) memory copies between the CPU and GPGPU are optimized, (2) the percentage of time performing memory copies between the CPU and GPGPU is small relative to the calculation time, (3) high-performance GPGPU cards are utilized, and (4) CPU-GPGPU combinations are used to execute sequential operations that are difficult to parallelize. Furthermore, UPCG solver testing indicates GPGPU speedups exceed parallel CPU speedups achieved using OpenMP on multicore CPUs for preconditioners that can be easily parallelized.
Exact diagonalization of quantum lattice models on coprocessors

NASA Astrophysics Data System (ADS)

Siro, T.; Harju, A.

2016-10-01

We implement the Lanczos algorithm on an Intel Xeon Phi coprocessor and compare its performance to a multi-core Intel Xeon CPU and an NVIDIA graphics processor. The Xeon and the Xeon Phi are parallelized with OpenMP and the graphics processor is programmed with CUDA. The performance is evaluated by measuring the execution time of a single step in the Lanczos algorithm. We study two quantum lattice models with different particle numbers, and conclude that for small systems, the multi-core CPU is the fastest platform, while for large systems, the graphics processor is the clear winner, reaching speedups of up to 7.6 compared to the CPU. The Xeon Phi outperforms the CPU with sufficiently large particle number, reaching a speedup of 2.5.
Instrumentation complex for Langley Research Center's National Transonic Facility

NASA Technical Reports Server (NTRS)

Russell, C. H.; Bryant, C. S.

1977-01-01

The instrumentation discussed in the present paper was developed to ensure reliable operation for a 2.5-meter cryogenic high-Reynolds-number fan-driven transonic wind tunnel. It will incorporate four CPU's and associated analog and digital input/output equipment, necessary for acquiring research data, controlling the tunnel parameters, and monitoring the process conditions. Connected in a multipoint distributed network, the CPU's will support data base management and processing; research measurement data acquisition and display; process monitoring; and communication control. The design will allow essential processes to continue, in the case of major hardware failures, by switching input/output equipment to alternate CPU's and by eliminating nonessential functions. It will also permit software modularization by CPU activity and thereby reduce complexity and development time.

North Dakota Standards and Benchmarks--Content Standards: Library/Technology Literacy

ERIC Educational Resources Information Center

North Dakota Department of Public Instruction, 2003

2003-01-01

The Library/Technology Literacy Standards for the State of North Dakota were developed during 2000-2002 by a team of library and technology specialists, assisted by representatives from the Department of Public Instruction. The initial task was to decide whether technology and library curricula overlapped enough to create a shared set of…
From Shop to Shakespeare: Interdisciplinary Instruction at Auburn High School, Riner, Virginia.

ERIC Educational Resources Information Center

Bull, Steve; Sauter, Jerry; Harris, Kevan; Sumner, Bonnie; Jervis, Charles; Miller, Bob; Turner, Pat

This paper provides an overview of the interdisciplinary program at Auburn High School, a small high school in Riner, Virginia, and describes a recent schoolwide project to construct an Elizabethan gazebo and Shakespeare garden. To develop interdisciplinary units, the teachers begin by brainstorming ideas, looking for overlapping content. The next…
Beyond Abstinence-Only: Relationships between Abstinence Education and Comprehensive Topic Instruction

ERIC Educational Resources Information Center

Jeffries, William L., IV; Dodge, Brian; Bandiera, Frank C.; Reece, Michael

2010-01-01

In the United States, a debate exists as to whether abstinence-only or comprehensive sexuality education strategies are most beneficial for school-age youth. Despite abstinence being a fundamental component of comprehensive education, the two are often characterized as polar opposites. Few studies have examined overlaps between the approaches. The…
Academic Discussions: An Analysis of Instructional Discourse and an Argument for an Integrative Assessment Framework

ERIC Educational Resources Information Center

Elizabeth, Tracy; Ross Anderson, Trisha L.; Snow, Elana H.; Selman, Robert L.

2012-01-01

This article describes the structure of academic discussions during the implementation of a literacy curriculum in the upper elementary grades. The authors examine the quality of academic discussion, using existing discourse analysis frameworks designed to evaluate varying attributes of classroom discourse. To integrate the overlapping qualities…
Connecting the Curriculum through National Science and Mathematics Standards: A Matrix Approach.

ERIC Educational Resources Information Center

Francis, Raymond

This paper provides instructions for linking conceptual understandings using the Connections Matrix. The Connections Matrix and the process of connecting the curriculum works equally well with state-level learning objectives or outcomes. The intent of this process is to help educators see the overlap and connections between what teachers say they…
PubChem3D: Shape compatibility filtering using molecular shape quadrupoles

PubMed Central

2011-01-01

Background PubChem provides a 3-D neighboring relationship, which involves finding the maximal shape overlap between two static compound 3-D conformations, a computationally intensive step. It is highly desirable to avoid this overlap computation, especially if it can be determined with certainty that a conformer pair cannot meet the criteria to be a 3-D neighbor. As such, PubChem employs a series of pre-filters, based on the concept of volume, to remove approximately 65% of all conformer neighbor pairs prior to shape overlap optimization. Given that molecular volume, a somewhat vague concept, is rather effective, it leads one to wonder: can the existing PubChem 3-D neighboring relationship, which consists of billions of shape similar conformer pairs from tens of millions of unique small molecules, be used to identify additional shape descriptor relationships? Or, put more specifically, can one place an upper bound on shape similarity using other "fuzzy" shape-like concepts like length, width, and height? Results Using a basis set of 4.18 billion 3-D neighbor pairs identified from single conformer per compound neighboring of 17.1 million molecules, shape descriptors were computed for all conformers. These steric shape descriptors included several forms of molecular volume and shape quadrupoles, which essentially embody the length, width, and height of a conformer. For a given 3-D neighbor conformer pair, the volume and each quadrupole component (Qx, Qy, and Qz) were binned and their frequency of occurrence was examined. Per molecular volume type, this effectively produced three different maps, one per quadrupole component (Qx, Qy, and Qz), of allowed values for the similarity metric, shape Tanimoto (ST) ≥ 0.8. The efficiency of these relationships (in terms of true positive, true negative, false positive and false negative) as a function of ST threshold was determined in a test run of 13.2 billion conformer pairs not previously considered by the 3-D neighbor set. At an ST ≥ 0.8, a filtering efficiency of 40.4% of true negatives was achieved with only 32 false negatives out of 24 million true positives, when applying the separate Qx, Qy, and Qz maps in a series (Qxyz). This efficiency increased linearly as a function of ST threshold in the range 0.8-0.99. The Qx filter was consistently the most efficient followed by Qy and then by Qz. Use of a monopole volume showed the best overall performance, followed by the self-overlap volume and then by the analytic volume. Application of the monopole-based Qxyz filter in a "real world" test of 3-D neighboring of 4,218 chemicals of biomedical interest against 26.1 million molecules in PubChem reduced the total CPU cost of neighboring by between 24-38% and, if used as the initial filter, removed from consideration 48.3% of all conformer pairs at almost negligible computational overhead. Conclusion Basic shape descriptors, such as those embodied by size, length, width, and height, can be highly effective in identifying shape incompatible compound conformer pairs. When performing a 3-D search using a shape similarity cut-off, computation can be avoided by identifying conformer pairs that cannot meet the result criteria. Applying this methodology as a filter for PubChem 3-D neighboring computation, an improvement of 31% was realized, increasing the average conformer pair throughput from 154,000 to 202,000 per second per CPU core. PMID:21774809
SU-G-TeP1-15: Toward a Novel GPU Accelerated Deterministic Solution to the Linear Boltzmann Transport Equation

DOE Office of Scientific and Technical Information (OSTI.GOV)

Yang, R; Fallone, B; Cross Cancer Institute, Edmonton, AB

Purpose: To develop a Graphic Processor Unit (GPU) accelerated deterministic solution to the Linear Boltzmann Transport Equation (LBTE) for accurate dose calculations in radiotherapy (RT). A deterministic solution yields the potential for major speed improvements due to the sparse matrix-vector and vector-vector multiplications and would thus be of benefit to RT. Methods: In order to leverage the massively parallel architecture of GPUs, the first order LBTE was reformulated as a second order self-adjoint equation using the Least Squares Finite Element Method (LSFEM). This produces a symmetric positive-definite matrix which is efficiently solved using a parallelized conjugate gradient (CG) solver. Themore » LSFEM formalism is applied in space, discrete ordinates is applied in angle, and the Multigroup method is applied in energy. The final linear system of equations produced is tightly coupled in space and angle. Our code written in CUDA-C was benchmarked on an Nvidia GeForce TITAN-X GPU against an Intel i7-6700K CPU. A spatial mesh of 30,950 tetrahedral elements was used with an S4 angular approximation. Results: To avoid repeating a full computationally intensive finite element matrix assembly at each Multigroup energy, a novel mapping algorithm was developed which minimized the operations required at each energy. Additionally, a parallelized memory mapping for the kronecker product between the sparse spatial and angular matrices, including Dirichlet boundary conditions, was created. Atomicity is preserved by graph-coloring overlapping nodes into separate kernel launches. The one-time mapping calculations for matrix assembly, kronecker product, and boundary condition application took 452±1ms on GPU. Matrix assembly for 16 energy groups took 556±3s on CPU, and 358±2ms on GPU using the mappings developed. The CG solver took 93±1s on CPU, and 468±2ms on GPU. Conclusion: Three computationally intensive subroutines in deterministically solving the LBTE have been formulated on GPU, resulting in two orders of magnitude speedup. Funding support from Natural Sciences and Engineering Research Council and Alberta Innovates Health Solutions. Dr. Fallone is a co-founder and CEO of MagnetTx Oncology Solutions (under discussions to license Alberta bi-planar linac MR for commercialization).« less
A novel potential/viscous flow coupling technique for computing helicopter flow fields

NASA Technical Reports Server (NTRS)

Summa, J. Michael; Strash, Daniel J.; Yoo, Sungyul

1993-01-01

The primary objective of this work was to demonstrate the feasibility of a new potential/viscous flow coupling procedure for reducing computational effort while maintaining solution accuracy. This closed-loop, overlapped velocity-coupling concept has been developed in a new two-dimensional code, ZAP2D (Zonal Aerodynamics Program - 2D), a three-dimensional code for wing analysis, ZAP3D (Zonal Aerodynamics Program - 3D), and a three-dimensional code for isolated helicopter rotors in hover, ZAPR3D (Zonal Aerodynamics Program for Rotors - 3D). Comparisons with large domain ARC3D solutions and with experimental data for a NACA 0012 airfoil have shown that the required domain size can be reduced to a few tenths of a percent chord for the low Mach and low angle of attack cases and to less than 2-5 chords for the high Mach and high angle of attack cases while maintaining solution accuracies to within a few percent. This represents CPU time reductions by a factor of 2-4 compared with ARC2D. The current ZAP3D calculation for a rectangular plan-form wing of aspect ratio 5 with an outer domain radius of about 1.2 chords represents a speed-up in CPU time over the ARC3D large domain calculation by about a factor of 2.5 while maintaining solution accuracies to within a few percent. A ZAPR3D simulation for a two-bladed rotor in hover with a reduced grid domain of about two chord lengths was able to capture the wake effects and compared accurately with the experimental pressure data. Further development is required in order to substantiate the promise of computational improvements due to the ZAPR3D coupling concept.
Fast 3D shape screening of large chemical databases through alignment-recycling

PubMed Central

Fontaine, Fabien; Bolton, Evan; Borodina, Yulia; Bryant, Stephen H

2007-01-01

Background Large chemical databases require fast, efficient, and simple ways of looking for similar structures. Although such tasks are now fairly well resolved for graph-based similarity queries, they remain an issue for 3D approaches, particularly for those based on 3D shape overlays. Inspired by a recent technique developed to compare molecular shapes, we designed a hybrid methodology, alignment-recycling, that enables efficient retrieval and alignment of structures with similar 3D shapes. Results Using a dataset of more than one million PubChem compounds of limited size (< 28 heavy atoms) and flexibility (< 6 rotatable bonds), we obtained a set of a few thousand diverse structures covering entirely the 3D shape space of the conformers of the dataset. Transformation matrices gathered from the overlays between these diverse structures and the 3D conformer dataset allowed us to drastically (100-fold) reduce the CPU time required for shape overlay. The alignment-recycling heuristic produces results consistent with de novo alignment calculation, with better than 80% hit list overlap on average. Conclusion Overlay-based 3D methods are computationally demanding when searching large databases. Alignment-recycling reduces the CPU time to perform shape similarity searches by breaking the alignment problem into three steps: selection of diverse shapes to describe the database shape-space; overlay of the database conformers to the diverse shapes; and non-optimized overlay of query and database conformers using common reference shapes. The precomputation, required by the first two steps, is a significant cost of the method; however, once performed, querying is two orders of magnitude faster. Extensions and variations of this methodology, for example, to handle more flexible and larger small-molecules are discussed. PMID:17880744
A parallel algorithm for the initial screening of space debris collisions prediction using the SGP4/SDP4 models and GPU acceleration

NASA Astrophysics Data System (ADS)

Lin, Mingpei; Xu, Ming; Fu, Xiaoyu

2017-05-01

Currently, a tremendous amount of space debris in Earth's orbit imperils operational spacecraft. It is essential to undertake risk assessments of collisions and predict dangerous encounters in space. However, collision predictions for an enormous amount of space debris give rise to large-scale computations. In this paper, a parallel algorithm is established on the Compute Unified Device Architecture (CUDA) platform of NVIDIA Corporation for collision prediction. According to the parallel structure of NVIDIA graphics processors, a block decomposition strategy is adopted in the algorithm. Space debris is divided into batches, and the computation and data transfer operations of adjacent batches overlap. As a consequence, the latency to access shared memory during the entire computing process is significantly reduced, and a higher computing speed is reached. Theoretically, a simulation of collision prediction for space debris of any amount and for any time span can be executed. To verify this algorithm, a simulation example including 1382 pieces of debris, whose operational time scales vary from 1 min to 3 days, is conducted on Tesla C2075 of NVIDIA. The simulation results demonstrate that with the same computational accuracy as that of a CPU, the computing speed of the parallel algorithm on a GPU is 30 times that on a CPU. Based on this algorithm, collision prediction of over 150 Chinese spacecraft for a time span of 3 days can be completed in less than 3 h on a single computer, which meets the timeliness requirement of the initial screening task. Furthermore, the algorithm can be adapted for multiple tasks, including particle filtration, constellation design, and Monte-Carlo simulation of an orbital computation.
Study on efficiency of time computation in x-ray imaging simulation base on Monte Carlo algorithm using graphics processing unit

DOE Office of Scientific and Technical Information (OSTI.GOV)

Setiani, Tia Dwi, E-mail: tiadwisetiani@gmail.com; Suprijadi; Nuclear Physics and Biophysics Reaserch Division, Faculty of Mathematics and Natural Sciences, Institut Teknologi Bandung Jalan Ganesha 10 Bandung, 40132

Monte Carlo (MC) is one of the powerful techniques for simulation in x-ray imaging. MC method can simulate the radiation transport within matter with high accuracy and provides a natural way to simulate radiation transport in complex systems. One of the codes based on MC algorithm that are widely used for radiographic images simulation is MC-GPU, a codes developed by Andrea Basal. This study was aimed to investigate the time computation of x-ray imaging simulation in GPU (Graphics Processing Unit) compared to a standard CPU (Central Processing Unit). Furthermore, the effect of physical parameters to the quality of radiographic imagesmore » and the comparison of image quality resulted from simulation in the GPU and CPU are evaluated in this paper. The simulations were run in CPU which was simulated in serial condition, and in two GPU with 384 cores and 2304 cores. In simulation using GPU, each cores calculates one photon, so, a large number of photon were calculated simultaneously. Results show that the time simulations on GPU were significantly accelerated compared to CPU. The simulations on the 2304 core of GPU were performed about 64 -114 times faster than on CPU, while the simulation on the 384 core of GPU were performed about 20 – 31 times faster than in a single core of CPU. Another result shows that optimum quality of images from the simulation was gained at the history start from 10{sup 8} and the energy from 60 Kev to 90 Kev. Analyzed by statistical approach, the quality of GPU and CPU images are relatively the same.« less
A Hybrid CPU/GPU Pattern-Matching Algorithm for Deep Packet Inspection

PubMed Central

Chen, Yaw-Chung

2015-01-01

The large quantities of data now being transferred via high-speed networks have made deep packet inspection indispensable for security purposes. Scalable and low-cost signature-based network intrusion detection systems have been developed for deep packet inspection for various software platforms. Traditional approaches that only involve central processing units (CPUs) are now considered inadequate in terms of inspection speed. Graphic processing units (GPUs) have superior parallel processing power, but transmission bottlenecks can reduce optimal GPU efficiency. In this paper we describe our proposal for a hybrid CPU/GPU pattern-matching algorithm (HPMA) that divides and distributes the packet-inspecting workload between a CPU and GPU. All packets are initially inspected by the CPU and filtered using a simple pre-filtering algorithm, and packets that might contain malicious content are sent to the GPU for further inspection. Test results indicate that in terms of random payload traffic, the matching speed of our proposed algorithm was 3.4 times and 2.7 times faster than those of the AC-CPU and AC-GPU algorithms, respectively. Further, HPMA achieved higher energy efficiency than the other tested algorithms. PMID:26437335
A Hybrid CPU/GPU Pattern-Matching Algorithm for Deep Packet Inspection.

PubMed

Lee, Chun-Liang; Lin, Yi-Shan; Chen, Yaw-Chung

2015-01-01

The large quantities of data now being transferred via high-speed networks have made deep packet inspection indispensable for security purposes. Scalable and low-cost signature-based network intrusion detection systems have been developed for deep packet inspection for various software platforms. Traditional approaches that only involve central processing units (CPUs) are now considered inadequate in terms of inspection speed. Graphic processing units (GPUs) have superior parallel processing power, but transmission bottlenecks can reduce optimal GPU efficiency. In this paper we describe our proposal for a hybrid CPU/GPU pattern-matching algorithm (HPMA) that divides and distributes the packet-inspecting workload between a CPU and GPU. All packets are initially inspected by the CPU and filtered using a simple pre-filtering algorithm, and packets that might contain malicious content are sent to the GPU for further inspection. Test results indicate that in terms of random payload traffic, the matching speed of our proposed algorithm was 3.4 times and 2.7 times faster than those of the AC-CPU and AC-GPU algorithms, respectively. Further, HPMA achieved higher energy efficiency than the other tested algorithms.
Performance of the OVERFLOW-MLP and LAURA-MLP CFD Codes on the NASA Ames 512 CPU Origin System

NASA Technical Reports Server (NTRS)

Taft, James R.

2000-01-01

The shared memory Multi-Level Parallelism (MLP) technique, developed last year at NASA Ames has been very successful in dramatically improving the performance of important NASA CFD codes. This new and very simple parallel programming technique was first inserted into the OVERFLOW production CFD code in FY 1998. The OVERFLOW-MLP code's parallel performance scaled linearly to 256 CPUs on the NASA Ames 256 CPU Origin 2000 system (steger). Overall performance exceeded 20.1 GFLOP/s, or about 4.5x the performance of a dedicated 16 CPU C90 system. All of this was achieved without any major modification to the original vector based code. The OVERFLOW-MLP code is now in production on the inhouse Origin systems as well as being used offsite at commercial aerospace companies. Partially as a result of this work, NASA Ames has purchased a new 512 CPU Origin 2000 system to further test the limits of parallel performance for NASA codes of interest. This paper presents the performance obtained from the latest optimization efforts on this machine for the LAURA-MLP and OVERFLOW-MLP codes. The Langley Aerothermodynamics Upwind Relaxation Algorithm (LAURA) code is a key simulation tool in the development of the next generation shuttle, interplanetary reentry vehicles, and nearly all "X" plane development. This code sustains about 4-5 GFLOP/s on a dedicated 16 CPU C90. At this rate, expected workloads would require over 100 C90 CPU years of computing over the next few calendar years. It is not feasible to expect that this would be affordable or available to the user community. Dramatic performance gains on cheaper systems are needed. This code is expected to be perhaps the largest consumer of NASA Ames compute cycles per run in the coming year.The OVERFLOW CFD code is extensively used in the government and commercial aerospace communities to evaluate new aircraft designs. It is one of the largest consumers of NASA supercomputing cycles and large simulations of highly resolved full aircraft are routinely undertaken. Typical large problems might require 100s of Cray C90 CPU hours to complete. The dramatic performance gains with the 256 CPU steger system are exciting. Obtaining results in hours instead of months is revolutionizing the way in which aircraft manufacturers are looking at future aircraft simulation work. Figure 2 below is a current state of the art plot of OVERFLOW-MLP performance on the 512 CPU Lomax system. As can be seen, the chart indicates that OVERFLOW-MLP continues to scale linearly with CPU count up to 512 CPUs on a large 35 million point full aircraft RANS simulation. At this point performance is such that a fully converged simulation of 2500 time steps is completed in less than 2 hours of elapsed time. Further work over the next few weeks will improve the performance of this code even further.The LAURA code has been converted to the MLP format as well. This code is currently being optimized for the 512 CPU system. Performance statistics indicate that the goal of 100 GFLOP/s will be achieved by year's end. This amounts to 20x the 16 CPU C90 result and strongly demonstrates the viability of the new parallel systems rapidly solving very large simulations in a production environment.
User's Manual for FOMOCO Utilities-Force and Moment Computation Tools for Overset Grids

NASA Technical Reports Server (NTRS)

Chan, William M.; Buning, Pieter G.

1996-01-01

In the numerical computations of flows around complex configurations, accurate calculations of force and moment coefficients for aerodynamic surfaces are required. When overset grid methods are used, the surfaces on which force and moment coefficients are sought typically consist of a collection of overlapping surface grids. Direct integration of flow quantities on the overlapping grids would result in the overlapped regions being counted more than once. The FOMOCO Utilities is a software package for computing flow coefficients (force, moment, and mass flow rate) on a collection of overset surfaces with accurate accounting of the overlapped zones. FOMOCO Utilities can be used in stand-alone mode or in conjunction with the Chimera overset grid compressible Navier-Stokes flow solver OVERFLOW. The software package consists of two modules corresponding to a two-step procedure: (1) hybrid surface grid generation (MIXSUR module), and (2) flow quantities integration (OVERINT module). Instructions on how to use this software package are described in this user's manual. Equations used in the flow coefficients calculation are given in Appendix A.
Ground Shock Effects from Accidental Explosions

DTIC Science & Technology

1976-11-01

1,200 P0 A = V P cp 8 Horizontal Dh = Dv tannin " 1 (cp/U)] Vh = Vv tan [sin" 1 (cp/U)] \\ - \\ tanfainŕ (cp/U)] For tan sin (c /U...explosive are not included in the present analysis . This effect will limit the credibility of the direct- induced ground shock predictions, but if the... analysis . Dr. D. R. Richmond of Lovelace Foundation provided data on human shock tolerances. 26 REFERENCES 1. "Structures to Resist the Effects of
Spectrum Savings from High Performance Recording and Playback Onboard the Test Article

DTIC Science & Technology

2013-02-20

execute within a Windows 7 environment, and data is recorded on SSDs. The underlying database is implemented using MySQL . Figure 1 illustrates the... MySQL database. This is effectively the time at which the recorded data are available for retransmission. CPU and Memory utilization were collected...17.7% MySQL avg. 3.9% EQDR Total avg. 21.6% Table 1 CPU Utilization with260 Mbits/sec Load The difference between the total System CPU (27.8
Corporate Sector Practice Informs Online Workforce Training for Australian Government Agencies: Towards Effective Educational-Learning Systems Design

ERIC Educational Resources Information Center

McKay, Elspeth; Vilela, Cenie

2011-01-01

The purpose of this paper is to outline government online training practice. We searched individual research domains of the human-dimensions of Human Computer Interaction (HCI), information and communications technologies (ICT) and instructional design for evidence of either corporate sector or government training practices. We overlapped these…
Effectiveness of Selected Supplemental Reading Comprehension Interventions: Impacts on a First Cohort of Fifth-Grade Students. NCEE 2009-4032

ERIC Educational Resources Information Center

James-Burdumy, Susanne; Mansfield, Wendy; Deke, John; Carey, Nancy; Lugo-Gil, Julieta; Hershey, Alan; Douglas, Aaron; Gersten, Russell; Newman-Gonchar, Rebecca; Dimino, Joseph; Faddis, Bonnie

2009-01-01

This document reports on the impacts on student achievement for four supplemental reading curricula that use similar overlapping instructional strategies designed to improve reading comprehension in social studies and science text. Fifth-grade reading comprehension for each of three commercially-available curricula (Project CRISS, ReadAbout, and…
Integrating Internet-Based Learning in an Educational System: A Systemic Approach.

ERIC Educational Resources Information Center

Jones, Marshall G.; Harmon, Stephen W.

The purpose of this paper is to identify the various components of the educational system in higher education, to illustrate how and where they interact, overlap, and come together so that it may be better understood how Web-based instruction (WBI) may impact higher education. A definition of an educational system is given, and three principles…

Curriculum-Based Measurement of Oral Reading: A Preliminary Investigation of Confidence Interval Overlap to Detect Reliable Growth

ERIC Educational Resources Information Center

Van Norman, Ethan R.

2016-01-01

Curriculum-based measurement of oral reading (CBM-R) progress monitoring data is used to measure student response to instruction. Federal legislation permits educators to use CBM-R progress monitoring data as a basis for determining the presence of specific learning disabilities. However, decision making frameworks originally developed for CBM-R…
Novel hybrid GPU-CPU implementation of parallelized Monte Carlo parametric expectation maximization estimation method for population pharmacokinetic data analysis.

PubMed

Ng, C M

2013-10-01

The development of a population PK/PD model, an essential component for model-based drug development, is both time- and labor-intensive. A graphical-processing unit (GPU) computing technology has been proposed and used to accelerate many scientific computations. The objective of this study was to develop a hybrid GPU-CPU implementation of parallelized Monte Carlo parametric expectation maximization (MCPEM) estimation algorithm for population PK data analysis. A hybrid GPU-CPU implementation of the MCPEM algorithm (MCPEMGPU) and identical algorithm that is designed for the single CPU (MCPEMCPU) were developed using MATLAB in a single computer equipped with dual Xeon 6-Core E5690 CPU and a NVIDIA Tesla C2070 GPU parallel computing card that contained 448 stream processors. Two different PK models with rich/sparse sampling design schemes were used to simulate population data in assessing the performance of MCPEMCPU and MCPEMGPU. Results were analyzed by comparing the parameter estimation and model computation times. Speedup factor was used to assess the relative benefit of parallelized MCPEMGPU over MCPEMCPU in shortening model computation time. The MCPEMGPU consistently achieved shorter computation time than the MCPEMCPU and can offer more than 48-fold speedup using a single GPU card. The novel hybrid GPU-CPU implementation of parallelized MCPEM algorithm developed in this study holds a great promise in serving as the core for the next-generation of modeling software for population PK/PD analysis.
Dense GPU-enhanced surface reconstruction from stereo endoscopic images for intraoperative registration.

PubMed

Rohl, Sebastian; Bodenstedt, Sebastian; Suwelack, Stefan; Dillmann, Rudiger; Speidel, Stefanie; Kenngott, Hannes; Muller-Stich, Beat P

2012-03-01

In laparoscopic surgery, soft tissue deformations substantially change the surgical site, thus impeding the use of preoperative planning during intraoperative navigation. Extracting depth information from endoscopic images and building a surface model of the surgical field-of-view is one way to represent this constantly deforming environment. The information can then be used for intraoperative registration. Stereo reconstruction is a typical problem within computer vision. However, most of the available methods do not fulfill the specific requirements in a minimally invasive setting such as the need of real-time performance, the problem of view-dependent specular reflections and large curved areas with partly homogeneous or periodic textures and occlusions. In this paper, the authors present an approach toward intraoperative surface reconstruction based on stereo endoscopic images. The authors describe our answer to this problem through correspondence analysis, disparity correction and refinement, 3D reconstruction, point cloud smoothing and meshing. Real-time performance is achieved by implementing the algorithms on the gpu. The authors also present a new hybrid cpu-gpu algorithm that unifies the advantages of the cpu and the gpu version. In a comprehensive evaluation using in vivo data, in silico data from the literature and virtual data from a newly developed simulation environment, the cpu, the gpu, and the hybrid cpu-gpu versions of the surface reconstruction are compared to a cpu and a gpu algorithm from the literature. The recommended approach toward intraoperative surface reconstruction can be conducted in real-time depending on the image resolution (20 fps for the gpu and 14fps for the hybrid cpu-gpu version on resolution of 640 × 480). It is robust to homogeneous regions without texture, large image changes, noise or errors from camera calibration, and it reconstructs the surface down to sub millimeter accuracy. In all the experiments within the simulation environment, the mean distance to ground truth data is between 0.05 and 0.6 mm for the hybrid cpu-gpu version. The hybrid cpu-gpu algorithm shows a much more superior performance than its cpu and gpu counterpart (mean distance reduction 26% and 45%, respectively, for the experiments in the simulation environment). The recommended approach for surface reconstruction is fast, robust, and accurate. It can represent changes in the intraoperative environment and can be used to adapt a preoperative model within the surgical site by registration of these two models.
Polydrug use among college students in Brazil: a nationwide survey.

PubMed

Oliveira, Lúcio Garcia de; Alberghini, Denis Guilherme; Santos, Bernardo dos; Andrade, Arthur Guerra de

2013-01-01

To estimate the frequency of polydrug use (alcohol and illicit drugs) among college students and its associations with gender and age group. A nationwide sample of 12,544 college students was asked to complete a questionnaire on their use of drugs according to three time parameters (lifetime, past 12 months, and last 30 days). The co-use of drugs was investigated as concurrent polydrug use (CPU) and simultaneous polydrug use (SPU), a subcategory of CPU that involves the use of drugs at the same time or in close temporal proximity. Almost 26% of college students reported having engaged in CPU in the past 12 months. Among these students, 37% had engaged in SPU. In the past 30 days, 17% college students had engaged in CPU. Among these, 35% had engaged in SPU. Marijuana was the illicit drug mostly frequently used with alcohol (either as CPU or SPU), especially among males. Among females, the most commonly reported combination was alcohol and prescribed medications. A high proportion of Brazilian college students may be engaging in polydrug use. College administrators should keep themselves informed to be able to identify such use and to develop educational interventions to prevent such behavior.
Measurements of neuron soma size and density in rat dorsal striatum, nucleus accumbens core and nucleus accumbens shell: differences between striatal region and brain hemisphere, but not sex.

PubMed

Meitzen, John; Pflepsen, Kelsey R; Stern, Christopher M; Meisel, Robert L; Mermelstein, Paul G

2011-01-07

Both hemispheric bias and sex differences exist in striatal-mediated behaviors and pathologies. The extent to which these dimorphisms can be attributed to an underlying neuroanatomical difference is unclear. We therefore quantified neuron soma size and density in the dorsal striatum (CPu) as well as the core (AcbC) and shell (AcbS) subregions of the nucleus accumbens to determine whether these anatomical measurements differ by region, hemisphere, or sex in adult Sprague-Dawley rats. Neuron soma size was larger in the CPu than the AcbC or AcbS. Neuron density was greatest in the AcbS, intermediate in the AcbC, and least dense in the CPu. CPu neuron density was greater in the left in comparison to the right hemisphere. No attribute was sexually dimorphic. These results provide the first evidence that hemispheric bias in the striatum and striatal-mediated behaviors can be attributed to a lateralization in neuronal density within the CPu. In contrast, sexual dimorphisms appear mediated by factors other than gross anatomical differences. Copyright © 2010 Elsevier Ireland Ltd. All rights reserved.
48 CFR 252.204-7011 - Alternative Line Item Structure.

Code of Federal Regulations, 2011 CFR

2011-10-01

... Unit Unit price Amount 0001 Computer, Desktop with CPU, Monitor, Keyboard and Mouse 20 EA Alternative... Unit Unit Price Amount 0001 Computer, Desktop with CPU, Keyboard and Mouse 20 EA 0002 Monitor 20 EA...
48 CFR 252.204-7011 - Alternative Line Item Structure.

Code of Federal Regulations, 2014 CFR

2014-10-01

... Unit Unit price Amount 0001 Computer, Desktop with CPU, Monitor, Keyboard and Mouse 20 EA Alternative... Unit Unit Price Amount 0001 Computer, Desktop with CPU, Keyboard and Mouse 20 EA 0002 Monitor 20 EA...
48 CFR 252.204-7011 - Alternative Line Item Structure.

Code of Federal Regulations, 2012 CFR

2012-10-01

... Unit Unit price Amount 0001 Computer, Desktop with CPU, Monitor, Keyboard and Mouse 20 EA Alternative... Unit Unit Price Amount 0001 Computer, Desktop with CPU, Keyboard and Mouse 20 EA 0002 Monitor 20 EA...
48 CFR 252.204-7011 - Alternative Line Item Structure.

Code of Federal Regulations, 2013 CFR

2013-10-01

... Unit Unit price Amount 0001 Computer, Desktop with CPU, Monitor, Keyboard and Mouse 20 EA Alternative... Unit Unit Price Amount 0001 Computer, Desktop with CPU, Keyboard and Mouse 20 EA 0002 Monitor 20 EA...
Hybrid Computational Architecture for Multi-Scale Modeling of Materials and Devices

DTIC Science & Technology

2016-01-03

Equivalent: Total Number: Sub Contractors (DD882) Names of Faculty Supported Names of Under Graduate students supported Names of Personnel receiving masters...GHz, 20 cores (40 with hyper-threading ( HT )) Single node performance Node # of cores Total CPU time User CPU time System CPU time Elapsed time...INTEL20 40 (with HT ) 534.785 529.984 4.800 541.179 20 468.873 466.119 2.754 476.878 10 671.798 669.653 2.145 680.510 8 772.269 770.256 2.013
From Physics Model to Results: An Optimizing Framework for Cross-Architecture Code Generation

DOE PAGES

Blazewicz, Marek; Hinder, Ian; Koppelman, David M.; ...

2013-01-01

Starting from a high-level problem description in terms of partial differential equations using abstract tensor notation, the Chemora framework discretizes, optimizes, and generates complete high performance codes for a wide range of compute architectures. Chemora extends the capabilities of Cactus, facilitating the usage of large-scale CPU/GPU systems in an efficient manner for complex applications, without low-level code tuning. Chemora achieves parallelism through MPI and multi-threading, combining OpenMP and CUDA. Optimizations include high-level code transformations, efficient loop traversal strategies, dynamically selected data and instruction cache usage strategies, and JIT compilation of GPU code tailored to the problem characteristics. The discretization ismore » based on higher-order finite differences on multi-block domains. Chemora's capabilities are demonstrated by simulations of black hole collisions. This problem provides an acid test of the framework, as the Einstein equations contain hundreds of variables and thousands of terms.« less
A GaAs vector processor based on parallel RISC microprocessors

NASA Astrophysics Data System (ADS)

Misko, Tim A.; Rasset, Terry L.

A vector processor architecture based on the development of a 32-bit microprocessor using gallium arsenide (GaAs) technology has been developed. The McDonnell Douglas vector processor (MVP) will be fabricated completely from GaAs digital integrated circuits. The MVP architecture includes a vector memory of 1 megabyte, a parallel bus architecture with eight processing elements connected in parallel, and a control processor. The processing elements consist of a reduced instruction set CPU (RISC) with four floating-point coprocessor units and necessary memory interface functions. This architecture has been simulated for several benchmark programs including complex fast Fourier transform (FFT), complex inner product, trigonometric functions, and sort-merge routine. The results of this study indicate that the MVP can process a 1024-point complex FFT at a speed of 112 microsec (389 megaflops) while consuming approximately 618 W of power in a volume of approximately 0.1 ft-cubed.
Fibered fluorescence microscopy (FFM) of intra epidermal nerve fibers--translational marker for peripheral neuropathies in preclinical research: processing and analysis of the data

NASA Astrophysics Data System (ADS)

Cornelissen, Frans; De Backer, Steve; Lemeire, Jan; Torfs, Berf; Nuydens, Rony; Meert, Theo; Schelkens, Peter; Scheunders, Paul

2008-08-01

Peripheral neuropathy can be caused by diabetes or AIDS or be a side-effect of chemotherapy. Fibered Fluorescence Microscopy (FFM) is a recently developed imaging modality using a fiber optic probe connected to a laser scanning unit. It allows for in-vivo scanning of small animal subjects by moving the probe along the tissue surface. In preclinical research, FFM enables non-invasive, longitudinal in vivo assessment of intra epidermal nerve fibre density in various models for peripheral neuropathies. By moving the probe, FFM allows visualization of larger surfaces, since, during the movement, images are continuously captured, allowing to acquire an area larger then the field of view of the probe. For analysis purposes, we need to obtain a single static image from the multiple overlapping frames. We introduce a mosaicing procedure for this kind of video sequence. Construction of mosaic images with sub-pixel alignment is indispensable and must be integrated into a global consistent image aligning. An additional motivation for the mosaicing is the use of overlapping redundant information to improve the signal to noise ratio of the acquisition, because the individual frames tend to have both high noise levels and intensity inhomogeneities. For longitudinal analysis, mosaics captured at different times must be aligned as well. For alignment, global correlation-based matching is compared with interest point matching. Use of algorithms working on multiple CPU's (parallel processor/cluster/grid) is imperative for use in a screening model.
GPU acceleration of a petascale application for turbulent mixing at high Schmidt number using OpenMP 4.5

NASA Astrophysics Data System (ADS)

Clay, M. P.; Buaria, D.; Yeung, P. K.; Gotoh, T.

2018-07-01

This paper reports on the successful implementation of a massively parallel GPU-accelerated algorithm for the direct numerical simulation of turbulent mixing at high Schmidt number. The work stems from a recent development (Comput. Phys. Commun., vol. 219, 2017, 313-328), in which a low-communication algorithm was shown to attain high degrees of scalability on the Cray XE6 architecture when overlapping communication and computation via dedicated communication threads. An even higher level of performance has now been achieved using OpenMP 4.5 on the Cray XK7 architecture, where on each node the 16 integer cores of an AMD Interlagos processor share a single Nvidia K20X GPU accelerator. In the new algorithm, data movements are minimized by performing virtually all of the intensive scalar field computations in the form of combined compact finite difference (CCD) operations on the GPUs. A memory layout in departure from usual practices is found to provide much better performance for a specific kernel required to apply the CCD scheme. Asynchronous execution enabled by adding the OpenMP 4.5 NOWAIT clause to TARGET constructs improves scalability when used to overlap computation on the GPUs with computation and communication on the CPUs. On the 27-petaflops supercomputer Titan at Oak Ridge National Laboratory, USA, a GPU-to-CPU speedup factor of approximately 5 is consistently observed at the largest problem size of 81923 grid points for the scalar field computed with 8192 XK7 nodes.
Accelerating Large Scale Image Analyses on Parallel, CPU-GPU Equipped Systems

PubMed Central

Teodoro, George; Kurc, Tahsin M.; Pan, Tony; Cooper, Lee A.D.; Kong, Jun; Widener, Patrick; Saltz, Joel H.

2014-01-01

The past decade has witnessed a major paradigm shift in high performance computing with the introduction of accelerators as general purpose processors. These computing devices make available very high parallel computing power at low cost and power consumption, transforming current high performance platforms into heterogeneous CPU-GPU equipped systems. Although the theoretical performance achieved by these hybrid systems is impressive, taking practical advantage of this computing power remains a very challenging problem. Most applications are still deployed to either GPU or CPU, leaving the other resource under- or un-utilized. In this paper, we propose, implement, and evaluate a performance aware scheduling technique along with optimizations to make efficient collaborative use of CPUs and GPUs on a parallel system. In the context of feature computations in large scale image analysis applications, our evaluations show that intelligently co-scheduling CPUs and GPUs can significantly improve performance over GPU-only or multi-core CPU-only approaches. PMID:25419545
File Usage Analysis and Resource Usage Prediction: a Measurement-Based Study. Ph.D. Thesis

NASA Technical Reports Server (NTRS)

Devarakonda, Murthy V.-S.

1987-01-01

A probabilistic scheme was developed to predict process resource usage in UNIX. Given the identity of the program being run, the scheme predicts CPU time, file I/O, and memory requirements of a process at the beginning of its life. The scheme uses a state-transition model of the program's resource usage in its past executions for prediction. The states of the model are the resource regions obtained from an off-line cluster analysis of processes run on the system. The proposed method is shown to work on data collected from a VAX 11/780 running 4.3 BSD UNIX. The results show that the predicted values correlate well with the actual. The coefficient of correlation between the predicted and actual values of CPU time is 0.84. Errors in prediction are mostly small. Some 82% of errors in CPU time prediction are less than 0.5 standard deviations of process CPU time.
Predictability of process resource usage - A measurement-based study on UNIX

NASA Technical Reports Server (NTRS)

Devarakonda, Murthy V.; Iyer, Ravishankar K.

1989-01-01

A probabilistic scheme is developed to predict process resource usage in UNIX. Given the identity of the program being run, the scheme predicts CPU time, file I/O, and memory requirements of a process at the beginning of its life. The scheme uses a state-transition model of the program's resource usage in its past executions for prediction. The states of the model are the resource regions obtained from an off-line cluster analysis of processes run on the system. The proposed method is shown to work on data collected from a VAX 11/780 running 4.3 BSD UNIX. The results show that the predicted values correlate well with the actual. The correlation coefficient betweeen the predicted and actual values of CPU time is 0.84. Errors in prediction are mostly small. Some 82 percent of errors in CPU time prediction are less than 0.5 standard deviations of process CPU time.
Predictability of process resource usage: A measurement-based study of UNIX

NASA Technical Reports Server (NTRS)

Devarakonda, Murthy V.; Iyer, Ravishankar K.

1987-01-01

A probabilistic scheme is developed to predict process resource usage in UNIX. Given the identity of the program being run, the scheme predicts CPU time, file I/O, and memory requirements of a process at the beginning of its life. The scheme uses a state-transition model of the program's resource usage in its past executions for prediction. The states of the model are the resource regions obtained from an off-line cluster analysis of processes run on the system. The proposed method is shown to work on data collected from a VAX 11/780 running 4.3 BSD UNIX. The results show that the predicted values correlate well with the actual. The correlation coefficient between the predicted and actual values of CPU time is 0.84. Errors in prediction are mostly small. Some 82% of errors in CPU time prediction are less than 0.5 standard deviations of process CPU time.
Career and Technical Education (CTE) Directors' Experiences with CTE's Contributions to Science, Technology, Engineering, and Math (STEM) Education Implementation

ERIC Educational Resources Information Center

Nkhata, Bentry

2013-01-01

In spite of the large overlap in the goals of CTE and STEM education, there is little evidence of the role(s) CTE delivery systems, programs, curricula, or pedagogical strategies can play in advancing STEM education. Because of their responsibilities, especially for organizational and instructional leadership, school district CTE directors could…
A Spiking Neural Simulator Integrating Event-Driven and Time-Driven Computation Schemes Using Parallel CPU-GPU Co-Processing: A Case Study.

PubMed

Naveros, Francisco; Luque, Niceto R; Garrido, Jesús A; Carrillo, Richard R; Anguita, Mancia; Ros, Eduardo

2015-07-01

Time-driven simulation methods in traditional CPU architectures perform well and precisely when simulating small-scale spiking neural networks. Nevertheless, they still have drawbacks when simulating large-scale systems. Conversely, event-driven simulation methods in CPUs and time-driven simulation methods in graphic processing units (GPUs) can outperform CPU time-driven methods under certain conditions. With this performance improvement in mind, we have developed an event-and-time-driven spiking neural network simulator suitable for a hybrid CPU-GPU platform. Our neural simulator is able to efficiently simulate bio-inspired spiking neural networks consisting of different neural models, which can be distributed heterogeneously in both small layers and large layers or subsystems. For the sake of efficiency, the low-activity parts of the neural network can be simulated in CPU using event-driven methods while the high-activity subsystems can be simulated in either CPU (a few neurons) or GPU (thousands or millions of neurons) using time-driven methods. In this brief, we have undertaken a comparative study of these different simulation methods. For benchmarking the different simulation methods and platforms, we have used a cerebellar-inspired neural-network model consisting of a very dense granular layer and a Purkinje layer with a smaller number of cells (according to biological ratios). Thus, this cerebellar-like network includes a dense diverging neural layer (increasing the dimensionality of its internal representation and sparse coding) and a converging neural layer (integration) similar to many other biologically inspired and also artificial neural networks.

Efficient methods for implementation of multi-level nonrigid mass-preserving image registration on GPUs and multi-threaded CPUs.

PubMed

Ellingwood, Nathan D; Yin, Youbing; Smith, Matthew; Lin, Ching-Long

2016-04-01

Faster and more accurate methods for registration of images are important for research involved in conducting population-based studies that utilize medical imaging, as well as improvements for use in clinical applications. We present a novel computation- and memory-efficient multi-level method on graphics processing units (GPU) for performing registration of two computed tomography (CT) volumetric lung images. We developed a computation- and memory-efficient Diffeomorphic Multi-level B-Spline Transform Composite (DMTC) method to implement nonrigid mass-preserving registration of two CT lung images on GPU. The framework consists of a hierarchy of B-Spline control grids of increasing resolution. A similarity criterion known as the sum of squared tissue volume difference (SSTVD) was adopted to preserve lung tissue mass. The use of SSTVD consists of the calculation of the tissue volume, the Jacobian, and their derivatives, which makes its implementation on GPU challenging due to memory constraints. The use of the DMTC method enabled reduced computation and memory storage of variables with minimal communication between GPU and Central Processing Unit (CPU) due to ability to pre-compute values. The method was assessed on six healthy human subjects. Resultant GPU-generated displacement fields were compared against the previously validated CPU counterpart fields, showing good agreement with an average normalized root mean square error (nRMS) of 0.044±0.015. Runtime and performance speedup are compared between single-threaded CPU, multi-threaded CPU, and GPU algorithms. Best performance speedup occurs at the highest resolution in the GPU implementation for the SSTVD cost and cost gradient computations, with a speedup of 112 times that of the single-threaded CPU version and 11 times over the twelve-threaded version when considering average time per iteration using a Nvidia Tesla K20X GPU. The proposed GPU-based DMTC method outperforms its multi-threaded CPU version in terms of runtime. Total registration time reduced runtime to 2.9min on the GPU version, compared to 12.8min on twelve-threaded CPU version and 112.5min on a single-threaded CPU. Furthermore, the GPU implementation discussed in this work can be adapted for use of other cost functions that require calculation of the first derivatives. Copyright © 2015 Elsevier Ireland Ltd. All rights reserved.
GPU based contouring method on grid DEM data

NASA Astrophysics Data System (ADS)

Tan, Liheng; Wan, Gang; Li, Feng; Chen, Xiaohui; Du, Wenlong

2017-08-01

This paper presents a novel method to generate contour lines from grid DEM data based on the programmable GPU pipeline. The previous contouring approaches often use CPU to construct a finite element mesh from the raw DEM data, and then extract contour segments from the elements. They also need a tracing or sorting strategy to generate the final continuous contours. These approaches can be heavily CPU-costing and time-consuming. Meanwhile the generated contours would be unsmooth if the raw data is sparsely distributed. Unlike the CPU approaches, we employ the GPU's vertex shader to generate a triangular mesh with arbitrary user-defined density, in which the height of each vertex is calculated through a third-order Cardinal spline function. Then in the same frame, segments are extracted from the triangles by the geometry shader, and translated to the CPU-side with an internal order in the GPU's transform feedback stage. Finally we propose a "Grid Sorting" algorithm to achieve the continuous contour lines by travelling the segments only once. Our method makes use of multiple stages of GPU pipeline for computation, which can generate smooth contour lines, and is significantly faster than the previous CPU approaches. The algorithm can be easily implemented with OpenGL 3.3 API or higher on consumer-level PCs.
Massive parallelization of a 3D finite difference electromagnetic forward solution using domain decomposition methods on multiple CUDA enabled GPUs

NASA Astrophysics Data System (ADS)

Schultz, A.

2010-12-01

3D forward solvers lie at the core of inverse formulations used to image the variation of electrical conductivity within the Earth's interior. This property is associated with variations in temperature, composition, phase, presence of volatiles, and in specific settings, the presence of groundwater, geothermal resources, oil/gas or minerals. The high cost of 3D solutions has been a stumbling block to wider adoption of 3D methods. Parallel algorithms for modeling frequency domain 3D EM problems have not achieved wide scale adoption, with emphasis on fairly coarse grained parallelism using MPI and similar approaches. The communications bandwidth as well as the latency required to send and receive network communication packets is a limiting factor in implementing fine grained parallel strategies, inhibiting wide adoption of these algorithms. Leading Graphics Processor Unit (GPU) companies now produce GPUs with hundreds of GPU processor cores per die. The footprint, in silicon, of the GPU's restricted instruction set is much smaller than the general purpose instruction set required of a CPU. Consequently, the density of processor cores on a GPU can be much greater than on a CPU. GPUs also have local memory, registers and high speed communication with host CPUs, usually through PCIe type interconnects. The extremely low cost and high computational power of GPUs provides the EM geophysics community with an opportunity to achieve fine grained (i.e. massive) parallelization of codes on low cost hardware. The current generation of GPUs (e.g. NVidia Fermi) provides 3 billion transistors per chip die, with nearly 500 processor cores and up to 6 GB of fast (DDR5) GPU memory. This latest generation of GPU supports fast hardware double precision (64 bit) floating point operations of the type required for frequency domain EM forward solutions. Each Fermi GPU board can sustain nearly 1 TFLOP in double precision, and multiple boards can be installed in the host computer system. We describe our ongoing efforts to achieve massive parallelization on a novel hybrid GPU testbed machine currently configured with 12 Intel Westmere Xeon CPU cores (or 24 parallel computational threads) with 96 GB DDR3 system memory, 4 GPU subsystems which in aggregate contain 960 NVidia Tesla GPU cores with 16 GB dedicated DDR3 GPU memory, and a second interleved bank of 4 GPU subsystems containing in aggregate 1792 NVidia Fermi GPU cores with 12 GB dedicated DDR5 GPU memory. We are applying domain decomposition methods to a modified version of Weiss' (2001) 3D frequency domain full physics EM finite difference code, an open source GPL licensed f90 code available for download from www.OpenEM.org. This will be the core of a new hybrid 3D inversion that parallelizes frequencies across CPUs and individual forward solutions across GPUs. We describe progress made in modifying the code to use direct solvers in GPU cores dedicated to each small subdomain, iteratively improving the solution by matching adjacent subdomain boundary solutions, rather than iterative Krylov space sparse solvers as currently applied to the whole domain.
Intact intracortical microstimulation (ICMS) representations of rostral and caudal forelimb areas in rats with quinolinic acid lesions of the medial or lateral caudate-putamen in an animal model of Huntington's disease.

PubMed

Karl, Jenni M; Sacrey, Lori-Ann R; McDonald, Robert J; Whishaw, Ian Q

2008-09-05

Neurotoxic, cell-specific lesions of the rat caudate-putamen (CPu) have been proposed as a model of human Huntington's disease and as such impair performance on many motor tasks, including skilled forelimbs tasks such as reaching for food. Because the CPu and motor cortex share reciprocal connections, it has been proposed that the motor deficits are due in part to a secondary disruption of motor cortex. The purpose of the present study was to examine the functionality of the motor cortex using intracortical microstimulation (ICMS) following neurotoxic lesions of the CPu. ICMS maps have been shown to be sensitive indicators of motor skill, cortical injury, learning, and experience. Long-evans hooded rats received a sham, a medial, or a lateral CPu lesion using the neurotoxin, quinolinic acid (2,3-pyridinedicarboxylic acid). Two weeks later the motor cortex was stimulated under light ketamine anesthesia. Neither lateral nor medial lesions of the CPu altered the stimulation threshold for eliciting forelimb movements, the type of movements elicited, or the size of the rostral forelimb (RFA) and caudal forelimb areas (CFA) from which movements were elicited. The preservation of ICMS forelimb movement representations (the forelimb map) in rats with cell-specific CPu lesions suggests motor impairments following lesions of the lateral striatum are not due to the disruption of the motor map. Therefore, the impairments that follow striatal cell loss are due either to alterations in circuitry that is independent of motor cortex or to alterations in circuitry afferent to the motor cortex projections.
Characterization and referral patterns of ST-elevation myocardial infarction patients admitted to chest pain units rather than directly to catherization laboratories. Data from the German Chest Pain Unit Registry.

PubMed

Schmidt, Frank P; Perne, Andrea; Hochadel, Matthias; Giannitsis, Evangelos; Darius, Harald; Maier, Lars S; Schmitt, Claus; Heusch, Gerd; Voigtländer, Thomas; Mudra, Harald; Gori, Tommaso; Senges, Jochen; Münzel, Thomas

2017-03-15

Direct transfer to the catheterization laboratory for primary percutaneous coronary intervention (PCI) is standard of care for patients with ST-segment elevation myocardial infarction (STEMI). Nevertheless, a significant number of STEMI-patients are initially treated in chest pain units (CPUs) of admitting hospitals. Thus, it is important to characterize these patients and to define why an important deviation from recommended clinical pathways occurs and in particular to quantify the impact of deviation on critical time intervals. 1679 STEMI patients admitted to a CPU in the period from 2010 to 2015 were enrolled in the German CPU registry (8.5% of 19,666). 55.9% of the patients were delivered by an emergency medical system (EMS), 16.1% transferred from other hospitals and 15.2% referred by a general practitioner (GP). 12.7% were self-referrals. 55% did not get a pre-hospital ECG. Compared to the EMS, referral by GPs markedly delayed critical time intervals while a pre-hospital ECG demonstrating ST-segment elevation reduced door-to-balloon time. When compared to STEMI patients (n=21,674) enrolled in the ALKK-registry, CPU-STEMI patients had a lower risk profile, their treatment in the CPU was guideline-conform and in-hospital mortality was low (1.5%). CPU-STEMI patients represent a numerically significant group because a pre-hospital ECG was not documented. Treatment in the CPU is guideline-conform and the intra-hospital mortality is low. The lack of a pre-hospital ECG and admission via the GP substantially delay critical time intervals suggesting that in patients with symptoms suggestive an ACS, the EMS should be contacted and not the GP. Copyright © 2016 Elsevier Ireland Ltd. All rights reserved.
Double dissociation of the anterior and posterior dorsomedial caudate-putamen in the acquisition and expression of associative learning with the nicotine stimulus.

PubMed

Charntikov, Sergios; Pittenger, Steven T; Swalve, Natashia; Li, Ming; Bevins, Rick A

2017-07-15

Tobacco use is the leading cause of preventable deaths worldwide. This habit is not only debilitating to individual users but also to those around them (second-hand smoking). Nicotine is the main addictive component of tobacco products and is a moderate stimulant and a mild reinforcer. Importantly, besides its unconditional effects, nicotine also has conditioned stimulus effects that may contribute to the tenacity of the smoking habit. Because the neurobiological substrates underlying these processes are virtually unexplored, the present study investigated the functional involvement of the dorsomedial caudate putamen (dmCPu) in learning processes with nicotine as an interoceptive stimulus. Rats were trained using the discriminated goal-tracking task where nicotine injections (0.4 mg/kg; SC), on some days, were paired with intermittent (36 per session) sucrose deliveries; sucrose was not available on interspersed saline days. Pre-training excitotoxic or post-training transient lesions of anterior or posterior dmCPu were used to elucidate the role of these areas in acquisition or expression of associative learning with nicotine stimulus. Pre-training lesion of p-dmCPu inhibited acquisition while post-training lesions of p-dmCPu attenuated the expression of associative learning with the nicotine stimulus. On the other hand, post-training lesions of a-dmCPu evoked nicotine-like responding following saline treatment indicating the role of this area in disinhibition of learned motor behaviors. These results, for the first time, show functionally distinct involvement of a- and p-dmCPu in various stages of associative learning using nicotine stimulus and provide an initial account of neural plasticity underlying these learning processes. Copyright © 2017 Elsevier Ltd. All rights reserved.
Accelerating moderately stiff chemical kinetics in reactive-flow simulations using GPUs

NASA Astrophysics Data System (ADS)

Niemeyer, Kyle E.; Sung, Chih-Jen

2014-01-01

The chemical kinetics ODEs arising from operator-split reactive-flow simulations were solved on GPUs using explicit integration algorithms. Nonstiff chemical kinetics of a hydrogen oxidation mechanism (9 species and 38 irreversible reactions) were computed using the explicit fifth-order Runge-Kutta-Cash-Karp method, and the GPU-accelerated version performed faster than single- and six-core CPU versions by factors of 126 and 25, respectively, for 524,288 ODEs. Moderately stiff kinetics, represented with mechanisms for hydrogen/carbon-monoxide (13 species and 54 irreversible reactions) and methane (53 species and 634 irreversible reactions) oxidation, were computed using the stabilized explicit second-order Runge-Kutta-Chebyshev (RKC) algorithm. The GPU-based RKC implementation demonstrated an increase in performance of nearly 59 and 10 times, for problem sizes consisting of 262,144 ODEs and larger, than the single- and six-core CPU-based RKC algorithms using the hydrogen/carbon-monoxide mechanism. With the methane mechanism, RKC-GPU performed more than 65 and 11 times faster, for problem sizes consisting of 131,072 ODEs and larger, than the single- and six-core RKC-CPU versions, and up to 57 times faster than the six-core CPU-based implicit VODE algorithm on 65,536 ODEs. In the presence of more severe stiffness, such as ethylene oxidation (111 species and 1566 irreversible reactions), RKC-GPU performed more than 17 times faster than RKC-CPU on six cores for 32,768 ODEs and larger, and at best 4.5 times faster than VODE on six CPU cores for 65,536 ODEs. With a larger time step size, RKC-GPU performed at best 2.5 times slower than six-core VODE for 8192 ODEs and larger. Therefore, the need for developing new strategies for integrating stiff chemistry on GPUs was discussed.
A Three-Attribute Transfer Skills Framework--Part I: Establishing the Model and Its Relation to Chemical Education

ERIC Educational Resources Information Center

Dori, Yehudit Judy; Sasson, Irit

2013-01-01

This paper presents Part I of a two-part study. This first part reviews the literature of transfer of learning as one of the major goals of instruction. Transfer refers to students' ability to apply knowledge and skills in new learning contexts. The literature suggests partially or non-overlapping definitions, and empirical studies on transfer…
A Grid and Group Explanation of Students' and Instructors' Preferences in Computer Assisted Instruction: A Case Study of University Classrooms in Thailand

ERIC Educational Resources Information Center

Limwudhikraijirath, Aree

2009-01-01

This study was a case study which had three overlapping purposes. The first purpose was to use Douglas's typology to explain the educational culture of the Faculty of Management Sciences (FMS) at Prince of Songkla University (PSU) Hatyai, Songkhla, Thailand. The second purpose was to describe the students' and instructor's preferences about…
Discrepancy between mRNA and protein abundance: Insight from information retrieval process in computers

PubMed Central

Wang, Degeng

2008-01-01

Discrepancy between the abundance of cognate protein and RNA molecules is frequently observed. A theoretical understanding of this discrepancy remains elusive, and it is frequently described as surprises and/or technical difficulties in the literature. Protein and RNA represent different steps of the multi-stepped cellular genetic information flow process, in which they are dynamically produced and degraded. This paper explores a comparison with a similar process in computers - multi-step information flow from storage level to the execution level. Functional similarities can be found in almost every facet of the retrieval process. Firstly, common architecture is shared, as the ribonome (RNA space) and the proteome (protein space) are functionally similar to the computer primary memory and the computer cache memory respectively. Secondly, the retrieval process functions, in both systems, to support the operation of dynamic networks – biochemical regulatory networks in cells and, in computers, the virtual networks (of CPU instructions) that the CPU travels through while executing computer programs. Moreover, many regulatory techniques are implemented in computers at each step of the information retrieval process, with a goal of optimizing system performance. Cellular counterparts can be easily identified for these regulatory techniques. In other words, this comparative study attempted to utilize theoretical insight from computer system design principles as catalysis to sketch an integrative view of the gene expression process, that is, how it functions to ensure efficient operation of the overall cellular regulatory network. In context of this bird’s-eye view, discrepancy between protein and RNA abundance became a logical observation one would expect. It was suggested that this discrepancy, when interpreted in the context of system operation, serves as a potential source of information to decipher regulatory logics underneath biochemical network operation. PMID:18757239
Toshiba TDF-500 High Resolution Viewing And Analysis System

NASA Astrophysics Data System (ADS)

Roberts, Barry; Kakegawa, M.; Nishikawa, M.; Oikawa, D.

1988-06-01

A high resolution, operator interactive, medical viewing and analysis system has been developed by Toshiba and Bio-Imaging Research. This system provides many advanced features including high resolution displays, a very large image memory and advanced image processing capability. In particular, the system provides CRT frame buffers capable of update in one frame period, an array processor capable of image processing at operator interactive speeds, and a memory system capable of updating multiple frame buffers at frame rates whilst supporting multiple array processors. The display system provides 1024 x 1536 display resolution at 40Hz frame and 80Hz field rates. In particular, the ability to provide whole or partial update of the screen at the scanning rate is a key feature. This allows multiple viewports or windows in the display buffer with both fixed and cine capability. To support image processing features such as windowing, pan, zoom, minification, filtering, ROI analysis, multiplanar and 3D reconstruction, a high performance CPU is integrated into the system. This CPU is an array processor capable of up to 400 million instructions per second. To support the multiple viewer and array processors' instantaneous high memory bandwidth requirement, an ultra fast memory system is used. This memory system has a bandwidth capability of 400MB/sec and a total capacity of 256MB. This bandwidth is more than adequate to support several high resolution CRT's and also the fast processing unit. This fully integrated approach allows effective real time image processing. The integrated design of viewing system, memory system and array processor are key to the imaging system. It is the intention to describe the architecture of the image system in this paper.
PIMS: Memristor-Based Processing-in-Memory-and-Storage.

DOE Office of Scientific and Technical Information (OSTI.GOV)

Cook, Jeanine

Continued progress in computing has augmented the quest for higher performance with a new quest for higher energy efficiency. This has led to the re-emergence of Processing-In-Memory (PIM) ar- chitectures that offer higher density and performance with some boost in energy efficiency. Past PIM work either integrated a standard CPU with a conventional DRAM to improve the CPU- memory link, or used a bit-level processor with Single Instruction Multiple Data (SIMD) control, but neither matched the energy consumption of the memory to the computation. We originally proposed to develop a new architecture derived from PIM that more effectively addressed energymore » efficiency for high performance scientific, data analytics, and neuromorphic applications. We also originally planned to implement a von Neumann architecture with arithmetic/logic units (ALUs) that matched the power consumption of an advanced storage array to maximize energy efficiency. Implementing this architecture in storage was our original idea, since by augmenting storage (in- stead of memory), the system could address both in-memory computation and applications that accessed larger data sets directly from storage, hence Processing-in-Memory-and-Storage (PIMS). However, as our research matured, we discovered several things that changed our original direc- tion, the most important being that a PIM that implements a standard von Neumann-type archi- tecture results in significant energy efficiency improvement, but only about a O(10) performance improvement. In addition to this, the emergence of new memory technologies moved us to propos- ing a non-von Neumann architecture, called Superstrider, implemented not in storage, but in a new DRAM technology called High Bandwidth Memory (HBM). HBM is a stacked DRAM tech- nology that includes a logic layer where an architecture such as Superstrider could potentially be implemented.« less
Discrepancy between mRNA and protein abundance: insight from information retrieval process in computers.

PubMed

Wang, Degeng

2008-12-01

Discrepancy between the abundance of cognate protein and RNA molecules is frequently observed. A theoretical understanding of this discrepancy remains elusive, and it is frequently described as surprises and/or technical difficulties in the literature. Protein and RNA represent different steps of the multi-stepped cellular genetic information flow process, in which they are dynamically produced and degraded. This paper explores a comparison with a similar process in computers-multi-step information flow from storage level to the execution level. Functional similarities can be found in almost every facet of the retrieval process. Firstly, common architecture is shared, as the ribonome (RNA space) and the proteome (protein space) are functionally similar to the computer primary memory and the computer cache memory, respectively. Secondly, the retrieval process functions, in both systems, to support the operation of dynamic networks-biochemical regulatory networks in cells and, in computers, the virtual networks (of CPU instructions) that the CPU travels through while executing computer programs. Moreover, many regulatory techniques are implemented in computers at each step of the information retrieval process, with a goal of optimizing system performance. Cellular counterparts can be easily identified for these regulatory techniques. In other words, this comparative study attempted to utilize theoretical insight from computer system design principles as catalysis to sketch an integrative view of the gene expression process, that is, how it functions to ensure efficient operation of the overall cellular regulatory network. In context of this bird's-eye view, discrepancy between protein and RNA abundance became a logical observation one would expect. It was suggested that this discrepancy, when interpreted in the context of system operation, serves as a potential source of information to decipher regulatory logics underneath biochemical network operation.
DOE Office of Scientific and Technical Information (OSTI.GOV)

Endele, Max; Etzrodt, Martin; Schroeder, Timm, E-mail: timm.schroeder@bsse.ethz.ch

Hematopoiesis is the cumulative consequence of finely tuned signaling pathways activated through extrinsic factors, such as local niche signals and systemic hematopoietic cytokines. Whether extrinsic factors actively instruct the lineage choice of hematopoietic stem and progenitor cells or are only selectively allowing survival and proliferation of already intrinsically lineage-committed cells has been debated over decades. Recent results demonstrated that cytokines can instruct lineage choice. However, the precise function of individual cytokine-triggered signaling molecules in inducing cellular events like proliferation, lineage choice, and differentiation remains largely elusive. Signal transduction pathways activated by different cytokine receptors are highly overlapping, but support themore » production of distinct hematopoietic lineages. Cellular context, signaling dynamics, and the crosstalk of different signaling pathways determine the cellular response of a given extrinsic signal. New tools to manipulate and continuously quantify signaling events at the single cell level are therefore required to thoroughly interrogate how dynamic signaling networks yield a specific cellular response. - Highlights: • Recent studies provided definite proof for lineage-instructive action of cytokines. • Signaling pathways involved in hematopoietic lineage instruction remain elusive. • New tools are emerging to quantitatively study dynamic signaling networks over time.« less
Accelerating Smith-Waterman Alignment for Protein Database Search Using Frequency Distance Filtration Scheme Based on CPU-GPU Collaborative System.

PubMed

Liu, Yu; Hong, Yang; Lin, Chun-Yuan; Hung, Che-Lun

2015-01-01

The Smith-Waterman (SW) algorithm has been widely utilized for searching biological sequence databases in bioinformatics. Recently, several works have adopted the graphic card with Graphic Processing Units (GPUs) and their associated CUDA model to enhance the performance of SW computations. However, these works mainly focused on the protein database search by using the intertask parallelization technique, and only using the GPU capability to do the SW computations one by one. Hence, in this paper, we will propose an efficient SW alignment method, called CUDA-SWfr, for the protein database search by using the intratask parallelization technique based on a CPU-GPU collaborative system. Before doing the SW computations on GPU, a procedure is applied on CPU by using the frequency distance filtration scheme (FDFS) to eliminate the unnecessary alignments. The experimental results indicate that CUDA-SWfr runs 9.6 times and 96 times faster than the CPU-based SW method without and with FDFS, respectively.
On the cost of approximating and recognizing a noise perturbed straight line or a quadratic curve segment in the plane. [central processing units

NASA Technical Reports Server (NTRS)

Cooper, D. B.; Yalabik, N.

1975-01-01

Approximation of noisy data in the plane by straight lines or elliptic or single-branch hyperbolic curve segments arises in pattern recognition, data compaction, and other problems. The efficient search for and approximation of data by such curves were examined. Recursive least-squares linear curve-fitting was used, and ellipses and hyperbolas are parameterized as quadratic functions in x and y. The error minimized by the algorithm is interpreted, and central processing unit (CPU) times for estimating parameters for fitting straight lines and quadratic curves were determined and compared. CPU time for data search was also determined for the case of straight line fitting. Quadratic curve fitting is shown to require about six times as much CPU time as does straight line fitting, and curves relating CPU time and fitting error were determined for straight line fitting. Results are derived on early sequential determination of whether or not the underlying curve is a straight line.
The Creation of a CPU Timer for High Fidelity Programs

NASA Technical Reports Server (NTRS)

Dick, Aidan A.

2011-01-01

Using C and C++ programming languages, a tool was developed that measures the efficiency of a program by recording the amount of CPU time that various functions consume. By inserting the tool between lines of code in the program, one can receive a detailed report of the absolute and relative time consumption associated with each section. After adapting the generic tool for a high-fidelity launch vehicle simulation program called MAVERIC, the components of a frequently used function called "derivatives ( )" were measured. Out of the 34 sub-functions in "derivatives ( )", it was found that the top 8 sub-functions made up 83.1% of the total time spent. In order to decrease the overall run time of MAVERIC, a launch vehicle simulation program, a change was implemented in the sub-function "Event_Controller ( )". Reformatting "Event_Controller ( )" led to a 36.9% decrease in the total CPU time spent by that sub-function, and a 3.2% decrease in the total CPU time spent by the overarching function "derivatives ( )".
Clinical implementation of a GPU-based simplified Monte Carlo method for a treatment planning system of proton beam therapy.

PubMed

Kohno, R; Hotta, K; Nishioka, S; Matsubara, K; Tansho, R; Suzuki, T

2011-11-21

We implemented the simplified Monte Carlo (SMC) method on graphics processing unit (GPU) architecture under the computer-unified device architecture platform developed by NVIDIA. The GPU-based SMC was clinically applied for four patients with head and neck, lung, or prostate cancer. The results were compared to those obtained by a traditional CPU-based SMC with respect to the computation time and discrepancy. In the CPU- and GPU-based SMC calculations, the estimated mean statistical errors of the calculated doses in the planning target volume region were within 0.5% rms. The dose distributions calculated by the GPU- and CPU-based SMCs were similar, within statistical errors. The GPU-based SMC showed 12.30-16.00 times faster performance than the CPU-based SMC. The computation time per beam arrangement using the GPU-based SMC for the clinical cases ranged 9-67 s. The results demonstrate the successful application of the GPU-based SMC to a clinical proton treatment planning.
Collateral projections of nucleus raphe dorsalis neurones to the caudate-putamen and region around the nucleus raphe magnus and nucleus reticularis gigantocellularis pars alpha in the rat.

PubMed

Li, Y Q; Kaneko, T; Mizuno, N

2001-02-16

It was examined whether or not the nucleus raphe dorsalis (RD) neurons projecting to the caudate-putamen (CPu) might also project to the motor-controlling region around the nucleus raphe magnus (NRM) and nucleus reticularis gigantocellularis pars alpha (Gia) in the rat. Single RD neurons projecting to the CPu and NRM/Gia by way of axon collaterals were identified by the retrograde double-labeling method with fluorescent dyes, Fast Blue and Diamidino Yellow, which were injected respectively into the CPu and NRM/Gia. Then, serotonin (5-HT)-like immunoreactivity of the double-labeled RD neurons was examined immunohistochemically; approximately 60% of the double-labeled RD neurons showed 5-HT-like immunoreactivity. The results indicated that some of serotonergic and non-serotonergic RD neurons might control motor functions simultaneously at the levels of the CPu and NRM/Gia by way of axon collaterals.
The Performance of the NAS HSPs in 1st Half of 1994

NASA Technical Reports Server (NTRS)

Bergeron, Robert J.; Walter, Howard (Technical Monitor)

1995-01-01

During the first six months of 1994, the NAS (National Airspace System) 16-CPU Y-MP C90 Von Neumann (VN) delivered an average throughput of 4.045 GFLOPS while the ACSF (Aeronautics Consolidated Supercomputer Facility) 8-CPU Y-MP C90 Eagle averaged 1.658 GFLOPS. The VN rate represents a machine efficiency of 26.3% whereas the Eagle rate corresponds to a machine efficiency of 21.6%. VN displayed a greater efficiency than Eagle primarily because the stronger workload demand for its CPU cycles allowed it to devote more time to user programs and less time to idle. An additional factor increasing VN efficiency was the ability of the UNICOS 8.0 Operating System to deliver a larger fraction of CPU time to user programs. Although measurements indicate increasing vector length for both workloads, insufficient vector lengths continue to hinder HSP (High Speed Processor) performance. To improve HSP performance, NAS should continue to encourage the HSP users to modify their codes to increase program vector length.

Caffe con Troll: Shallow Ideas to Speed Up Deep Learning

PubMed Central

Hadjis, Stefan; Abuzaid, Firas; Zhang, Ce; Ré, Christopher

2016-01-01

We present Caffe con Troll (CcT), a fully compatible end-to-end version of the popular framework Caffe with rebuilt internals. We built CcT to examine the performance characteristics of training and deploying general-purpose convolutional neural networks across different hardware architectures. We find that, by employing standard batching optimizations for CPU training, we achieve a 4.5× throughput improvement over Caffe on popular networks like CaffeNet. Moreover, with these improvements, the end-to-end training time for CNNs is directly proportional to the FLOPS delivered by the CPU, which enables us to efficiently train hybrid CPU-GPU systems for CNNs. PMID:27314106
Caffe con Troll: Shallow Ideas to Speed Up Deep Learning.

PubMed

Hadjis, Stefan; Abuzaid, Firas; Zhang, Ce; Ré, Christopher

2015-01-01

We present Caffe con Troll (CcT), a fully compatible end-to-end version of the popular framework Caffe with rebuilt internals. We built CcT to examine the performance characteristics of training and deploying general-purpose convolutional neural networks across different hardware architectures. We find that, by employing standard batching optimizations for CPU training, we achieve a 4.5× throughput improvement over Caffe on popular networks like CaffeNet. Moreover, with these improvements, the end-to-end training time for CNNs is directly proportional to the FLOPS delivered by the CPU, which enables us to efficiently train hybrid CPU-GPU systems for CNNs.
The Effect of NUMA Tunings on CPU Performance

NASA Astrophysics Data System (ADS)

Hollowell, Christopher; Caramarcu, Costin; Strecker-Kellogg, William; Wong, Antonio; Zaytsev, Alexandr

2015-12-01

Non-Uniform Memory Access (NUMA) is a memory architecture for symmetric multiprocessing (SMP) systems where each processor is directly connected to separate memory. Indirect access to other CPU's (remote) RAM is still possible, but such requests are slower as they must also pass through that memory's controlling CPU. In concert with a NUMA-aware operating system, the NUMA hardware architecture can help eliminate the memory performance reductions generally seen in SMP systems when multiple processors simultaneously attempt to access memory. The x86 CPU architecture has supported NUMA for a number of years. Modern operating systems such as Linux support NUMA-aware scheduling, where the OS attempts to schedule a process to the CPU directly attached to the majority of its RAM. In Linux, it is possible to further manually tune the NUMA subsystem using the numactl utility. With the release of Red Hat Enterprise Linux (RHEL) 6.3, the numad daemon became available in this distribution. This daemon monitors a system's NUMA topology and utilization, and automatically makes adjustments to optimize locality. As the number of cores in x86 servers continues to grow, efficient NUMA mappings of processes to CPUs/memory will become increasingly important. This paper gives a brief overview of NUMA, and discusses the effects of manual tunings and numad on the performance of the HEPSPEC06 benchmark, and ATLAS software.
Static and Dynamic Frequency Scaling on Multicore CPUs

DOE Office of Scientific and Technical Information (OSTI.GOV)

Bao, Wenlei; Hong, Changwan; Chunduri, Sudheer

2016-12-28

Dynamic voltage and frequency scaling (DVFS) adapts CPU power consumption by modifying a processor’s operating frequency (and the associated voltage). Typical approaches employing DVFS involve default strategies such as running at the lowest or the highest frequency, or observing the CPU’s runtime behavior and dynamically adapting the voltage/frequency configuration based on CPU usage. In this paper, we argue that many previous approaches suffer from inherent limitations, such as not account- ing for processor-specific impact of frequency changes on energy for different workload types. We first propose a lightweight runtime-based approach to automatically adapt the frequency based on the CPU workload,more » that is agnostic of the processor characteristics. We then show that further improvements can be achieved for affine kernels in the application, using a compile-time characterization instead of run-time monitoring to select the frequency and number of CPU cores to use. Our framework relies on a one-time energy characterization of CPU-specific DVFS profiles followed by a compile-time categorization of loop-based code segments in the application. These are combined to determine a priori of the frequency and the number of cores to use to execute the application so as to optimize energy or energy-delay product, outperforming runtime approach. Extensive evaluation on 60 benchmarks and five multi-core CPUs show that our approach systematically outperforms the powersave Linux governor, while improving overall performance.« less
“Superluminal” FITS File Processing on Multiprocessors: Zero Time Endian Conversion Technique

NASA Astrophysics Data System (ADS)

Eguchi, Satoshi

2013-05-01

The FITS is the standard file format in astronomy, and it has been extended to meet the astronomical needs of the day. However, astronomical datasets have been inflating year by year. In the case of the ALMA telescope, a ˜TB-scale four-dimensional data cube may be produced for one target. Considering that typical Internet bandwidth is tens of MB/s at most, the original data cubes in FITS format are hosted on a VO server, and the region which a user is interested in should be cut out and transferred to the user (Eguchi et al. 2012). The system will equip a very high-speed disk array to process a TB-scale data cube in 10 s, and disk I/O speed, endian conversion, and data processing speeds will be comparable. Hence, reducing the endian conversion time is one of issues to solve in our system. In this article, I introduce a technique named “just-in-time endian conversion”, which delays the endian conversion for each pixel just before it is really needed, to sweep out the endian conversion time; by applying this method, the FITS processing speed increases 20% for single threading and 40% for multi-threading compared to CFITSIO. The speedup tightly relates to modern CPU architecture to improve the efficiency of instruction pipelines due to break of “causality”, a programmed instruction code sequence.
A higher-speed compressive sensing camera through multi-diode design

NASA Astrophysics Data System (ADS)

Herman, Matthew A.; Tidman, James; Hewitt, Donna; Weston, Tyler; McMackin, Lenore

2013-05-01

Obtaining high frame rates is a challenge with compressive sensing (CS) systems that gather measurements in a sequential manner, such as the single-pixel CS camera. One strategy for increasing the frame rate is to divide the FOV into smaller areas that are sampled and reconstructed in parallel. Following this strategy, InView has developed a multi-aperture CS camera using an 8×4 array of photodiodes that essentially act as 32 individual simultaneously operating single-pixel cameras. Images reconstructed from each of the photodiode measurements are stitched together to form the full FOV. To account for crosstalk between the sub-apertures, novel modulation patterns have been developed to allow neighboring sub-apertures to share energy. Regions of overlap not only account for crosstalk energy that would otherwise be reconstructed as noise, but they also allow for tolerance in the alignment of the DMD to the lenslet array. Currently, the multi-aperture camera is built into a computational imaging workstation configuration useful for research and development purposes. In this configuration, modulation patterns are generated in a CPU and sent to the DMD via PCI express, which allows the operator to develop and change the patterns used in the data acquisition step. The sensor data is collected and then streamed to the workstation via an Ethernet or USB connection for the reconstruction step. Depending on the amount of data taken and the amount of overlap between sub-apertures, frame rates of 2-5 frames per second can be achieved. In a stand-alone camera platform, currently in development, pattern generation and reconstruction will be implemented on-board.
The Research and Test of Fast Radio Burst Real-time Search Algorithm Based on GPU Acceleration

NASA Astrophysics Data System (ADS)

Wang, J.; Chen, M. Z.; Pei, X.; Wang, Z. Q.

2017-03-01

In order to satisfy the research needs of Nanshan 25 m radio telescope of Xinjiang Astronomical Observatory (XAO) and study the key technology of the planned QiTai radio Telescope (QTT), the receiver group of XAO studied the GPU (Graphics Processing Unit) based real-time FRB searching algorithm which developed from the original FRB searching algorithm based on CPU (Central Processing Unit), and built the FRB real-time searching system. The comparison of the GPU system and the CPU system shows that: on the basis of ensuring the accuracy of the search, the speed of the GPU accelerated algorithm is improved by 35-45 times compared with the CPU algorithm.
Chimera grids in the simulation of three-dimensional flowfields in turbine-blade-coolant passages

NASA Technical Reports Server (NTRS)

Stephens, M. A.; Rimlinger, M. J.; Shih, T. I.-P.; Civinskas, K. C.

1993-01-01

When computing flows inside geometrically complex turbine-blade coolant passages, the structure of the grid system used can affect significantly the overall time and cost required to obtain solutions. This paper addresses this issue while evaluating and developing computational tools for the design and analysis of coolant-passages, and is divided into two parts. In the first part, the various types of structured and unstructured grids are compared in relation to their ability to provide solutions in a timely and cost-effective manner. This comparison shows that the overlapping structured grids, known as Chimera grids, can rival and in some instances exceed the cost-effectiveness of unstructured grids in terms of both the man hours needed to generate grids and the amount of computer memory and CPU time needed to obtain solutions. In the second part, a computational tool utilizing Chimera grids was used to compute the flow and heat transfer in two different turbine-blade coolant passages that contain baffles and numerous pin fins. These computations showed the versatility and flexibility offered by Chimera grids.
Combustion Power Unit--400: CPU-400.

ERIC Educational Resources Information Center

Combustion Power Co., Palo Alto, CA.

Aerospace technology may have led to a unique basic unit for processing solid wastes and controlling pollution. The Combustion Power Unit--400 (CPU-400) is designed as a turboelectric generator plant that will use municipal solid wastes as fuel. The baseline configuration is a modular unit that is designed to utilize 400 tons of refuse per day…
Particle-in-Cell laser-plasma simulation on Xeon Phi coprocessors

NASA Astrophysics Data System (ADS)

Surmin, I. A.; Bastrakov, S. I.; Efimenko, E. S.; Gonoskov, A. A.; Korzhimanov, A. V.; Meyerov, I. B.

2016-05-01

This paper concerns the development of a high-performance implementation of the Particle-in-Cell method for plasma simulation on Intel Xeon Phi coprocessors. We discuss the suitability of the method for Xeon Phi architecture and present our experience in the porting and optimization of the existing parallel Particle-in-Cell code PICADOR. Direct porting without code modification gives performance on Xeon Phi close to that of an 8-core CPU on a benchmark problem with 50 particles per cell. We demonstrate step-by-step optimization techniques, such as improving data locality, enhancing parallelization efficiency and vectorization leading to an overall 4.2 × speedup on CPU and 7.5 × on Xeon Phi compared to the baseline version. The optimized version achieves 16.9 ns per particle update on an Intel Xeon E5-2660 CPU and 9.3 ns per particle update on an Intel Xeon Phi 5110P. For a real problem of laser ion acceleration in targets with surface grating, where a large number of macroparticles per cell is required, the speedup of Xeon Phi compared to CPU is 1.6 ×.
Performance and accuracy of criticality calculations performed using WARP – A framework for continuous energy Monte Carlo neutron transport in general 3D geometries on GPUs

DOE PAGES

Bergmann, Ryan M.; Rowland, Kelly L.; Radnović, Nikola; ...

2017-05-01

In this companion paper to "Algorithmic Choices in WARP - A Framework for Continuous Energy Monte Carlo Neutron Transport in General 3D Geometries on GPUs" (doi:10.1016/j.anucene.2014.10.039), the WARP Monte Carlo neutron transport framework for graphics processing units (GPUs) is benchmarked against production-level central processing unit (CPU) Monte Carlo neutron transport codes for both performance and accuracy. We compare neutron flux spectra, multiplication factors, runtimes, speedup factors, and costs of various GPU and CPU platforms running either WARP, Serpent 2.1.24, or MCNP 6.1. WARP compares well with the results of the production-level codes, and it is shown that on the newestmore » hardware considered, GPU platforms running WARP are between 0.8 to 7.6 times as fast as CPU platforms running production codes. Also, the GPU platforms running WARP were between 15% and 50% as expensive to purchase and between 80% to 90% as expensive to operate as equivalent CPU platforms performing at an equal simulation rate.« less
Optimum element density studies for finite-element thermal analysis of hypersonic aircraft structures

NASA Technical Reports Server (NTRS)

Ko, William L.; Olona, Timothy; Muramoto, Kyle M.

1990-01-01

Different finite element models previously set up for thermal analysis of the space shuttle orbiter structure are discussed and their shortcomings identified. Element density criteria are established for the finite element thermal modelings of space shuttle orbiter-type large, hypersonic aircraft structures. These criteria are based on rigorous studies on solution accuracies using different finite element models having different element densities set up for one cell of the orbiter wing. Also, a method for optimization of the transient thermal analysis computer central processing unit (CPU) time is discussed. Based on the newly established element density criteria, the orbiter wing midspan segment was modeled for the examination of thermal analysis solution accuracies and the extent of computation CPU time requirements. The results showed that the distributions of the structural temperatures and the thermal stresses obtained from this wing segment model were satisfactory and the computation CPU time was at the acceptable level. The studies offered the hope that modeling the large, hypersonic aircraft structures using high-density elements for transient thermal analysis is possible if a CPU optimization technique was used.
Software Defined Radio with Parallelized Software Architecture

NASA Technical Reports Server (NTRS)

Heckler, Greg

2013-01-01

This software implements software-defined radio procession over multicore, multi-CPU systems in a way that maximizes the use of CPU resources in the system. The software treats each processing step in either a communications or navigation modulator or demodulator system as an independent, threaded block. Each threaded block is defined with a programmable number of input or output buffers; these buffers are implemented using POSIX pipes. In addition, each threaded block is assigned a unique thread upon block installation. A modulator or demodulator system is built by assembly of the threaded blocks into a flow graph, which assembles the processing blocks to accomplish the desired signal processing. This software architecture allows the software to scale effortlessly between single CPU/single-core computers or multi-CPU/multi-core computers without recompilation. NASA spaceflight and ground communications systems currently rely exclusively on ASICs or FPGAs. This software allows low- and medium-bandwidth (100 bps to approx.50 Mbps) software defined radios to be designed and implemented solely in C/C++ software, while lowering development costs and facilitating reuse and extensibility.
Software Defined Radio with Parallelized Software Architecture

NASA Technical Reports Server (NTRS)

Heckler, Greg

2013-01-01

This software implements software-defined radio procession over multi-core, multi-CPU systems in a way that maximizes the use of CPU resources in the system. The software treats each processing step in either a communications or navigation modulator or demodulator system as an independent, threaded block. Each threaded block is defined with a programmable number of input or output buffers; these buffers are implemented using POSIX pipes. In addition, each threaded block is assigned a unique thread upon block installation. A modulator or demodulator system is built by assembly of the threaded blocks into a flow graph, which assembles the processing blocks to accomplish the desired signal processing. This software architecture allows the software to scale effortlessly between single CPU/single-core computers or multi-CPU/multi-core computers without recompilation. NASA spaceflight and ground communications systems currently rely exclusively on ASICs or FPGAs. This software allows low- and medium-bandwidth (100 bps to .50 Mbps) software defined radios to be designed and implemented solely in C/C++ software, while lowering development costs and facilitating reuse and extensibility.
Work stealing for GPU-accelerated parallel programs in a global address space framework: WORK STEALING ON GPU-ACCELERATED SYSTEMS

DOE Office of Scientific and Technical Information (OSTI.GOV)

Arafat, Humayun; Dinan, James; Krishnamoorthy, Sriram

Task parallelism is an attractive approach to automatically load balance the computation in a parallel system and adapt to dynamism exhibited by parallel systems. Exploiting task parallelism through work stealing has been extensively studied in shared and distributed-memory contexts. In this paper, we study the design of a system that uses work stealing for dynamic load balancing of task-parallel programs executed on hybrid distributed-memory CPU-graphics processing unit (GPU) systems in a global-address space framework. We take into account the unique nature of the accelerator model employed by GPUs, the significant performance difference between GPU and CPU execution as a functionmore » of problem size, and the distinct CPU and GPU memory domains. We consider various alternatives in designing a distributed work stealing algorithm for CPU-GPU systems, while taking into account the impact of task distribution and data movement overheads. These strategies are evaluated using microbenchmarks that capture various execution configurations as well as the state-of-the-art CCSD(T) application module from the computational chemistry domain.« less
Work stealing for GPU-accelerated parallel programs in a global address space framework

DOE Office of Scientific and Technical Information (OSTI.GOV)

Arafat, Humayun; Dinan, James; Krishnamoorthy, Sriram

Task parallelism is an attractive approach to automatically load balance the computation in a parallel system and adapt to dynamism exhibited by parallel systems. Exploiting task parallelism through work stealing has been extensively studied in shared and distributed-memory contexts. In this paper, we study the design of a system that uses work stealing for dynamic load balancing of task-parallel programs executed on hybrid distributed-memory CPU-graphics processing unit (GPU) systems in a global-address space framework. We take into account the unique nature of the accelerator model employed by GPUs, the significant performance difference between GPU and CPU execution as a functionmore » of problem size, and the distinct CPU and GPU memory domains. We consider various alternatives in designing a distributed work stealing algorithm for CPU-GPU systems, while taking into account the impact of task distribution and data movement overheads. These strategies are evaluated using microbenchmarks that capture various execution configurations as well as the state-of-the-art CCSD(T) application module from the computational chemistry domain« less
Performance and accuracy of criticality calculations performed using WARP – A framework for continuous energy Monte Carlo neutron transport in general 3D geometries on GPUs

DOE Office of Scientific and Technical Information (OSTI.GOV)

Bergmann, Ryan M.; Rowland, Kelly L.; Radnović, Nikola

In this companion paper to "Algorithmic Choices in WARP - A Framework for Continuous Energy Monte Carlo Neutron Transport in General 3D Geometries on GPUs" (doi:10.1016/j.anucene.2014.10.039), the WARP Monte Carlo neutron transport framework for graphics processing units (GPUs) is benchmarked against production-level central processing unit (CPU) Monte Carlo neutron transport codes for both performance and accuracy. We compare neutron flux spectra, multiplication factors, runtimes, speedup factors, and costs of various GPU and CPU platforms running either WARP, Serpent 2.1.24, or MCNP 6.1. WARP compares well with the results of the production-level codes, and it is shown that on the newestmore » hardware considered, GPU platforms running WARP are between 0.8 to 7.6 times as fast as CPU platforms running production codes. Also, the GPU platforms running WARP were between 15% and 50% as expensive to purchase and between 80% to 90% as expensive to operate as equivalent CPU platforms performing at an equal simulation rate.« less
Optimizing legacy molecular dynamics software with directive-based offload

NASA Astrophysics Data System (ADS)

Michael Brown, W.; Carrillo, Jan-Michael Y.; Gavhane, Nitin; Thakkar, Foram M.; Plimpton, Steven J.

2015-10-01

Directive-based programming models are one solution for exploiting many-core coprocessors to increase simulation rates in molecular dynamics. They offer the potential to reduce code complexity with offload models that can selectively target computations to run on the CPU, the coprocessor, or both. In this paper, we describe modifications to the LAMMPS molecular dynamics code to enable concurrent calculations on a CPU and coprocessor. We demonstrate that standard molecular dynamics algorithms can run efficiently on both the CPU and an x86-based coprocessor using the same subroutines. As a consequence, we demonstrate that code optimizations for the coprocessor also result in speedups on the CPU; in extreme cases up to 4.7X. We provide results for LAMMPS benchmarks and for production molecular dynamics simulations using the Stampede hybrid supercomputer with both Intel® Xeon Phi™ coprocessors and NVIDIA GPUs. The optimizations presented have increased simulation rates by over 2X for organic molecules and over 7X for liquid crystals on Stampede. The optimizations are available as part of the "Intel package" supplied with LAMMPS.
An efficient mixed-precision, hybrid CPU-GPU implementation of a nonlinearly implicit one-dimensional particle-in-cell algorithm

DOE Office of Scientific and Technical Information (OSTI.GOV)

Chen, Guangye; Chacon, Luis; Barnes, Daniel C

2012-01-01

Recently, a fully implicit, energy- and charge-conserving particle-in-cell method has been developed for multi-scale, full-f kinetic simulations [G. Chen, et al., J. Comput. Phys. 230, 18 (2011)]. The method employs a Jacobian-free Newton-Krylov (JFNK) solver and is capable of using very large timesteps without loss of numerical stability or accuracy. A fundamental feature of the method is the segregation of particle orbit integrations from the field solver, while remaining fully self-consistent. This provides great flexibility, and dramatically improves the solver efficiency by reducing the degrees of freedom of the associated nonlinear system. However, it requires a particle push per nonlinearmore » residual evaluation, which makes the particle push the most time-consuming operation in the algorithm. This paper describes a very efficient mixed-precision, hybrid CPU-GPU implementation of the implicit PIC algorithm. The JFNK solver is kept on the CPU (in double precision), while the inherent data parallelism of the particle mover is exploited by implementing it in single-precision on a graphics processing unit (GPU) using CUDA. Performance-oriented optimizations, with the aid of an analytical performance model, the roofline model, are employed. Despite being highly dynamic, the adaptive, charge-conserving particle mover algorithm achieves up to 300 400 GOp/s (including single-precision floating-point, integer, and logic operations) on a Nvidia GeForce GTX580, corresponding to 20 25% absolute GPU efficiency (against the peak theoretical performance) and 50-70% intrinsic efficiency (against the algorithm s maximum operational throughput, which neglects all latencies). This is about 200-300 times faster than an equivalent serial CPU implementation. When the single-precision GPU particle mover is combined with a double-precision CPU JFNK field solver, overall performance gains 100 vs. the double-precision CPU-only serial version are obtained, with no apparent loss of robustness or accuracy when applied to a challenging long-time scale ion acoustic wave simulation.« less
Validation of GPU based TomoTherapy dose calculation engine.

PubMed

Chen, Quan; Lu, Weiguo; Chen, Yu; Chen, Mingli; Henderson, Douglas; Sterpin, Edmond

2012-04-01

The graphic processing unit (GPU) based TomoTherapy convolution/superposition(C/S) dose engine (GPU dose engine) achieves a dramatic performance improvement over the traditional CPU-cluster based TomoTherapy dose engine (CPU dose engine). Besides the architecture difference between the GPU and CPU, there are several algorithm changes from the CPU dose engine to the GPU dose engine. These changes made the GPU dose slightly different from the CPU-cluster dose. In order for the commercial release of the GPU dose engine, its accuracy has to be validated. Thirty eight TomoTherapy phantom plans and 19 patient plans were calculated with both dose engines to evaluate the equivalency between the two dose engines. Gamma indices (Γ) were used for the equivalency evaluation. The GPU dose was further verified with the absolute point dose measurement with ion chamber and film measurements for phantom plans. Monte Carlo calculation was used as a reference for both dose engines in the accuracy evaluation in heterogeneous phantom and actual patients. The GPU dose engine showed excellent agreement with the current CPU dose engine. The majority of cases had over 99.99% of voxels with Γ(1%, 1 mm) < 1. The worst case observed in the phantom had 0.22% voxels violating the criterion. In patient cases, the worst percentage of voxels violating the criterion was 0.57%. For absolute point dose verification, all cases agreed with measurement to within ±3% with average error magnitude within 1%. All cases passed the acceptance criterion that more than 95% of the pixels have Γ(3%, 3 mm) < 1 in film measurement, and the average passing pixel percentage is 98.5%-99%. The GPU dose engine also showed similar degree of accuracy in heterogeneous media as the current TomoTherapy dose engine. It is verified and validated that the ultrafast TomoTherapy GPU dose engine can safely replace the existing TomoTherapy cluster based dose engine without degradation in dose accuracy.

SU-E-J-91: FFT Based Medical Image Registration Using a Graphics Processing Unit (GPU).

PubMed

Luce, J; Hoggarth, M; Lin, J; Block, A; Roeske, J

2012-06-01

To evaluate the efficiency gains obtained from using a Graphics Processing Unit (GPU) to perform a Fourier Transform (FT) based image registration. Fourier-based image registration involves obtaining the FT of the component images, and analyzing them in Fourier space to determine the translations and rotations of one image set relative to another. An important property of FT registration is that by enlarging the images (adding additional pixels), one can obtain translations and rotations with sub-pixel resolution. The expense, however, is an increased computational time. GPUs may decrease the computational time associated with FT image registration by taking advantage of their parallel architecture to perform matrix computations much more efficiently than a Central Processor Unit (CPU). In order to evaluate the computational gains produced by a GPU, images with known translational shifts were utilized. A program was written in the Interactive Data Language (IDL; Exelis, Boulder, CO) to performCPU-based calculations. Subsequently, the program was modified using GPU bindings (Tech-X, Boulder, CO) to perform GPU-based computation on the same system. Multiple image sizes were used, ranging from 256×256 to 2304×2304. The time required to complete the full algorithm by the CPU and GPU were benchmarked and the speed increase was defined as the ratio of the CPU-to-GPU computational time. The ratio of the CPU-to- GPU time was greater than 1.0 for all images, which indicates the GPU is performing the algorithm faster than the CPU. The smallest improvement, a 1.21 ratio, was found with the smallest image size of 256×256, and the largest speedup, a 4.25 ratio, was observed with the largest image size of 2304×2304. GPU programming resulted in a significant decrease in computational time associated with a FT image registration algorithm. The inclusion of the GPU may provide near real-time, sub-pixel registration capability. © 2012 American Association of Physicists in Medicine.
Bimodal Bilinguals Co-activate Both Languages during Spoken Comprehension

PubMed Central

Shook, Anthony; Marian, Viorica

2012-01-01

Bilinguals have been shown to activate their two languages in parallel, and this process can often be attributed to overlap in input between the two languages. The present study examines whether two languages that do not overlap in input structure, and that have distinct phonological systems, such as American Sign Language (ASL) and English, are also activated in parallel. Hearing ASL-English bimodal bilinguals’ and English monolinguals’ eye-movements were recorded during a visual world paradigm, in which participants were instructed, in English, to select objects from a display. In critical trials, the target item appeared with a competing item that overlapped with the target in ASL phonology. Bimodal bilinguals looked more at competing items than at phonologically unrelated items, and looked more at competing items relative to monolinguals, indicating activation of the sign-language during spoken English comprehension. The findings suggest that language co-activation is not modality specific, and provide insight into the mechanisms that may underlie cross-modal language co-activation in bimodal bilinguals, including the role that top-down and lateral connections between levels of processing may play in language comprehension. PMID:22770677
Deployment of the OSIRIS EM-PIC code on the Intel Knights Landing architecture

NASA Astrophysics Data System (ADS)

Fonseca, Ricardo

2017-10-01

Electromagnetic particle-in-cell (EM-PIC) codes such as OSIRIS have found widespread use in modelling the highly nonlinear and kinetic processes that occur in several relevant plasma physics scenarios, ranging from astrophysical settings to high-intensity laser plasma interaction. Being computationally intensive, these codes require large scale HPC systems, and a continuous effort in adapting the algorithm to new hardware and computing paradigms. In this work, we report on our efforts on deploying the OSIRIS code on the new Intel Knights Landing (KNL) architecture. Unlike the previous generation (Knights Corner), these boards are standalone systems, and introduce several new features, include the new AVX-512 instructions and on-package MCDRAM. We will focus on the parallelization and vectorization strategies followed, as well as memory management, and present a detailed performance evaluation of code performance in comparison with the CPU code. This work was partially supported by Fundaçã para a Ciência e Tecnologia (FCT), Portugal, through Grant No. PTDC/FIS-PLA/2940/2014.
Matching Jobs, People, and Instructional Content: An Innovative Application of a Latent Semantic Analysis-Based Technology

DTIC Science & Technology

2003-03-01

information technologies that can: (a) represent knowledge and skills, (b) identify people with all or parts of the knowledge and task experience...needed but lacked, A might be at too advanced a level for the 8 individual to understand given his or her previous knowledge , B might overlap too...SEMANTIC ANALYSIS-BASED TECHNOLOGY Darrell Laham Knowledge Analysis Technologies 4940 Pearl East Circle #200 Boulder, CO 80301 Winston
Evaluating Academic Journals Using Impact Factor and Local Citation Score

ERIC Educational Resources Information Center

Chung, Hye-Kyung

2007-01-01

This study presents a method for journal collection evaluation using citation analysis. Cost-per-use (CPU) for each title is used to measure cost-effectiveness with higher CPU scores indicating cost-effective titles. Use data are based on the impact factor and locally collected citation score of each title and is compared to the cost of managing…
Increases in cytoplasmic dopamine compromise the normal resistance of the nucleus accumbens to methamphetamine neurotoxicity

PubMed Central

Thomas, David M.; Francescutti-Verbeem, Dina M.; Kuhnt, Donald M.

2016-01-01

Methamphetamine (METH) is a neurotoxic drug of abuse that damages the dopamine (DA) neuronal system in a highly delimited manner. The brain structure most affected by METH is the caudate–putamen (CPu) where long-term DA depletion and microglial activation are most evident. Even damage within the CPu is remarkably heterogenous with lateral and ventral aspects showing the greatest deficits. The nucleus accumbens (NAc) is largely spared of the damage that accompanies binge METH intoxication. Increases in cytoplasmic DA produced by reserpine, L-DOPA or clorgyline prior to METH uncover damage in the NAc as evidenced by microglial activation and depletion of DA, tyrosine hydroxylase (TH), and the DA transporter. These effects do not occur in the NAc after treatment with METH alone. In contrast to the CPu where DA, TH, and DA transporter levels remain depleted chronically, DA nerve ending alterations in the NAc show a partial recovery over time. None of the treatments that enhance METH toxicity in the NAc and CPu lead to losses of TH protein or DA cell bodies in the substantia nigra or the ventral tegmentum. These data show that increases in cytoplasmic DA dramatically broaden the neurotoxic profile of METH to include brain structures not normally targeted for damage by METH alone. The resistance of the NAc to METH-induced neurotoxicity and its ability to recover reveal a fundamentally different neuroplasticity by comparison to the CPu. Recruitment of the NAc as a target of METH neurotoxicity by alterations in DA homeostasis is significant in light of the important roles played by this brain structure. PMID:19457119
Increases in cytoplasmic dopamine compromise the normal resistance of the nucleus accumbens to methamphetamine neurotoxicity.

PubMed

Thomas, David M; Francescutti-Verbeem, Dina M; Kuhn, Donald M

2009-06-01

Methamphetamine (METH) is a neurotoxic drug of abuse that damages the dopamine (DA) neuronal system in a highly delimited manner. The brain structure most affected by METH is the caudate-putamen (CPu) where long-term DA depletion and microglial activation are most evident. Even damage within the CPu is remarkably heterogenous with lateral and ventral aspects showing the greatest deficits. The nucleus accumbens (NAc) is largely spared of the damage that accompanies binge METH intoxication. Increases in cytoplasmic DA produced by reserpine, L-DOPA or clorgyline prior to METH uncover damage in the NAc as evidenced by microglial activation and depletion of DA, tyrosine hydroxylase (TH), and the DA transporter. These effects do not occur in the NAc after treatment with METH alone. In contrast to the CPu where DA, TH, and DA transporter levels remain depleted chronically, DA nerve ending alterations in the NAc show a partial recovery over time. None of the treatments that enhance METH toxicity in the NAc and CPu lead to losses of TH protein or DA cell bodies in the substantia nigra or the ventral tegmentum. These data show that increases in cytoplasmic DA dramatically broaden the neurotoxic profile of METH to include brain structures not normally targeted for damage by METH alone. The resistance of the NAc to METH-induced neurotoxicity and its ability to recover reveal a fundamentally different neuroplasticity by comparison to the CPu. Recruitment of the NAc as a target of METH neurotoxicity by alterations in DA homeostasis is significant in light of the important roles played by this brain structure.
Pipelined CPU Design with FPGA in Teaching Computer Architecture

ERIC Educational Resources Information Center

Lee, Jong Hyuk; Lee, Seung Eun; Yu, Heon Chang; Suh, Taeweon

2012-01-01

This paper presents a pipelined CPU design project with a field programmable gate array (FPGA) system in a computer architecture course. The class project is a five-stage pipelined 32-bit MIPS design with experiments on the Altera DE2 board. For proper scheduling, milestones were set every one or two weeks to help students complete the project on…
Optimizing legacy molecular dynamics software with directive-based offload

DOE PAGES

Michael Brown, W.; Carrillo, Jan-Michael Y.; Gavhane, Nitin; ...

2015-05-14

The directive-based programming models are one solution for exploiting many-core coprocessors to increase simulation rates in molecular dynamics. They offer the potential to reduce code complexity with offload models that can selectively target computations to run on the CPU, the coprocessor, or both. In our paper, we describe modifications to the LAMMPS molecular dynamics code to enable concurrent calculations on a CPU and coprocessor. We also demonstrate that standard molecular dynamics algorithms can run efficiently on both the CPU and an x86-based coprocessor using the same subroutines. As a consequence, we demonstrate that code optimizations for the coprocessor also resultmore » in speedups on the CPU; in extreme cases up to 4.7X. We provide results for LAMMAS benchmarks and for production molecular dynamics simulations using the Stampede hybrid supercomputer with both Intel (R) Xeon Phi (TM) coprocessors and NVIDIA GPUs: The optimizations presented have increased simulation rates by over 2X for organic molecules and over 7X for liquid crystals on Stampede. The optimizations are available as part of the "Intel package" supplied with LAMMPS. (C) 2015 Elsevier B.V. All rights reserved.« less
Accelerated event-by-event Monte Carlo microdosimetric calculations of electrons and protons tracks on a multi-core CPU and a CUDA-enabled GPU.

PubMed

Kalantzis, Georgios; Tachibana, Hidenobu

2014-01-01

For microdosimetric calculations event-by-event Monte Carlo (MC) methods are considered the most accurate. The main shortcoming of those methods is the extensive requirement for computational time. In this work we present an event-by-event MC code of low projectile energy electron and proton tracks for accelerated microdosimetric MC simulations on a graphic processing unit (GPU). Additionally, a hybrid implementation scheme was realized by employing OpenMP and CUDA in such a way that both GPU and multi-core CPU were utilized simultaneously. The two implementation schemes have been tested and compared with the sequential single threaded MC code on the CPU. Performance comparison was established on the speed-up for a set of benchmarking cases of electron and proton tracks. A maximum speedup of 67.2 was achieved for the GPU-based MC code, while a further improvement of the speedup up to 20% was achieved for the hybrid approach. The results indicate the capability of our CPU-GPU implementation for accelerated MC microdosimetric calculations of both electron and proton tracks without loss of accuracy. Copyright © 2013 Elsevier Ireland Ltd. All rights reserved.
A novel heterogeneous algorithm to simulate multiphase flow in porous media on multicore CPU-GPU systems

NASA Astrophysics Data System (ADS)

McClure, J. E.; Prins, J. F.; Miller, C. T.

2014-07-01

Multiphase flow implementations of the lattice Boltzmann method (LBM) are widely applied to the study of porous medium systems. In this work, we construct a new variant of the popular "color" LBM for two-phase flow in which a three-dimensional, 19-velocity (D3Q19) lattice is used to compute the momentum transport solution while a three-dimensional, seven velocity (D3Q7) lattice is used to compute the mass transport solution. Based on this formulation, we implement a novel heterogeneous GPU-accelerated algorithm in which the mass transport solution is computed by multiple shared memory CPU cores programmed using OpenMP while a concurrent solution of the momentum transport is performed using a GPU. The heterogeneous solution is demonstrated to provide speedup of 2.6 × as compared to multi-core CPU solution and 1.8 × compared to GPU solution due to concurrent utilization of both CPU and GPU bandwidths. Furthermore, we verify that the proposed formulation provides an accurate physical representation of multiphase flow processes and demonstrate that the approach can be applied to perform heterogeneous simulations of two-phase flow in porous media using a typical GPU-accelerated workstation.
Simulation Testing of Embedded Flight Software

NASA Technical Reports Server (NTRS)

Shahabuddin, Mohammad; Reinholtz, William

2004-01-01

Virtual Real Time (VRT) is a computer program for testing embedded flight software by computational simulation in a workstation, in contradistinction to testing it in its target central processing unit (CPU). The disadvantages of testing in the target CPU include the need for an expensive test bed, the necessity for testers and programmers to take turns using the test bed, and the lack of software tools for debugging in a real-time environment. By virtue of its architecture, most of the flight software of the type in question is amenable to development and testing on workstations, for which there is an abundance of commercially available debugging and analysis software tools. Unfortunately, the timing of a workstation differs from that of a target CPU in a test bed. VRT, in conjunction with closed-loop simulation software, provides a capability for executing embedded flight software on a workstation in a close-to-real-time environment. A scale factor is used to convert between execution time in VRT on a workstation and execution on a target CPU. VRT includes high-resolution operating- system timers that enable the synchronization of flight software with simulation software and ground software, all running on different workstations.
High thermal conductivity liquid metal pad for heat dissipation in electronic devices

NASA Astrophysics Data System (ADS)

Lin, Zuoye; Liu, Huiqiang; Li, Qiuguo; Liu, Han; Chu, Sheng; Yang, Yuhua; Chu, Guang

2018-05-01

Novel thermal interface materials using Ag-doped Ga-based liquid metal were proposed for heat dissipation of electronic packaging and precision equipment. On one hand, the viscosity and fluidity of liquid metal was controlled to prevent leakage; on the other hand, the thermal conductivity of the Ga-based liquid metal was increased up to 46 W/mK by incorporating Ag nanoparticles. A series of experiments were performed to evaluate the heat dissipation performance on a CPU of smart-phone. The results demonstrated that the Ag-doped Ga-based liquid metal pad can effectively decrease the CPU temperature and change the heat flow path inside the smart-phone. To understand the heat flow path from CPU to screen through the interface material, heat dissipation mechanism was simulated and discussed.
Dynamic modelling and estimation of the error due to asynchronism in a redundant asynchronous multiprocessor system

NASA Technical Reports Server (NTRS)

Huynh, Loc C.; Duval, R. W.

1986-01-01

The use of Redundant Asynchronous Multiprocessor System to achieve ultrareliable Fault Tolerant Control Systems shows great promise. The development has been hampered by the inability to determine whether differences in the outputs of redundant CPU's are due to failures or to accrued error built up by slight differences in CPU clock intervals. This study derives an analytical dynamic model of the difference between redundant CPU's due to differences in their clock intervals and uses this model with on-line parameter identification to idenitify the differences in the clock intervals. The ability of this methodology to accurately track errors due to asynchronisity generate an error signal with the effect of asynchronisity removed and this signal may be used to detect and isolate actual system failures.
Interface Between Cosmetic and Migraine Surgery.

PubMed

Gfrerer, Lisa; Guyuron, Bahman

2017-10-01

This article describes connections between migraine surgery and cosmetic surgery including technical overlap, benefits for patients, and why every plastic surgeon may consider screening cosmetic surgery patients for migraine headache (MH). Contemporary migraine surgery began by an observation made following forehead rejuvenation, and the connection has continued. The prevalence of MH among females in the USA is 26%, and females account for 91% of cosmetic surgery procedures and 81-91% of migraine surgery procedures, which suggests substantial overlap between both patient populations. At the same time, recent reports show an overall increase in cosmetic facial procedures. Surgical techniques between some of the most commonly performed facial surgeries and migraine surgery overlap, creating opportunity for consolidation. In particular, forehead lift, blepharoplasty, septo-rhinoplasty, and rhytidectomy can easily be part of the migraine surgery, depending on the migraine trigger sites. Patients could benefit from simultaneous improvement in MH symptoms and rejuvenation of the face. Simple tools such as the Migraine Headache Index could be used to screen cosmetic surgery patients for MH. Similarity between patient populations, demand for both facial and MH procedures, and technical overlap suggest great incentive for plastic surgeons to combine both. Level of Evidence V This journal requires that authors assign a level of evidence to each article. For a full description of these Evidence-Based Medicine ratings, please refer to the Table of Contents or the online Instructions to Authors www.springer.com/00266 .
Multiband Radio Frequency Interconnect (MRFI) Technology For Next Generation Mobile/Airborne Computing Systems

DTIC Science & Technology

2017-02-01

enable high scalability and reconfigurability for inter-CPU/Memory communications with an increased number of communication channels in frequency ...interconnect technology (MRFI) to enable high scalability and re-configurability for inter-CPU/Memory communications with an increased number of communication ...testing in the University of California, Los Angeles (UCLA) Center for High Frequency Electronics, and Dr. Afshin Momtaz at Broadcom Corporation for
LU Factorization with Partial Pivoting for a Multi-CPU, Multi-GPU Shared Memory System

DOE Office of Scientific and Technical Information (OSTI.GOV)

Kurzak, Jakub; Luszczek, Pitior; Faverge, Mathieu

2012-03-01

LU factorization with partial pivoting is a canonical numerical procedure and the main component of the High Performance LINPACK benchmark. This article presents an implementation of the algorithm for a hybrid, shared memory, system with standard CPU cores and GPU accelerators. Performance in excess of one TeraFLOPS is achieved using four AMD Magny Cours CPUs and four NVIDIA Fermi GPUs.
Evaluation of patients with methamphetamine- and cocaine-related chest pain in a chest pain observation unit.

PubMed

Diercks, Deborah B; Kirk, J Douglas; Turnipseed, Samuel D; Amsterdam, Ezra A

2007-12-01

Risk of acute coronary events in patients with methamphetamine and cocaine intoxication has been described. Little is known about the need for additional evaluation in these patients who do not have evidence of myocardial infarction after the initial emergency department evaluation. We herein describe our experience with these patients in a chest pain unit (CPU) and the rate of cardiac-related chest pain in this group. Retrospective analysis of patients evaluated in our CPU from January 1, 2000 to December 16, 2004 with a history of chest pain. Patients who had a positive urine toxicologic screen for methamphetamine or cocaine were included. No patients had ECG or cardiac injury marker evidence of myocardial infarction or ischemia during the initial emergency department evaluation. A diagnosis of cardiac-related chest pain was based upon positive diagnostic testing (exercise stress testing, nuclear perfusion imaging, stress echocardiography, or coronary artery stenosis >70%). During the study period, 4568 patients were evaluated in the CPU. A total of 1690 (37%) of patients admitted to the CPU underwent urine toxicologic testing. The result of urine toxicologic test was positive for cocaine or methamphetamine in 224 (5%). In the 2871 patients who underwent diagnostic testing for coronary artery disease (CAD), 401 (14%) were found to have positive results. There was no difference in the prevalence of CAD between those with positive result for toxicology screens (26/156, 17%) and those without (375/2715, 13%, RR 1.2, 95% CI 0.8-1.7). These findings suggest a relatively high rate of CAD in patients with methamphetamine and cocaine use evaluated in a CPU.
Benchmarking worker nodes using LHCb productions and comparing with HEPSpec06

NASA Astrophysics Data System (ADS)

Charpentier, P.

2017-10-01

In order to estimate the capabilities of a computing slot with limited processing time, it is necessary to know with a rather good precision its “power”. This allows for example pilot jobs to match a task for which the required CPU-work is known, or to define the number of events to be processed knowing the CPU-work per event. Otherwise one always has the risk that the task is aborted because it exceeds the CPU capabilities of the resource. It also allows a better accounting of the consumed resources. The traditional way the CPU power is estimated in WLCG since 2007 is using the HEP-Spec06 benchmark (HS06) suite that was verified at the time to scale properly with a set of typical HEP applications. However, the hardware architecture of processors has evolved, all WLCG experiments moved to using 64-bit applications and use different compilation flags from those advertised for running HS06. It is therefore interesting to check the scaling of HS06 with the HEP applications. For this purpose, we have been using CPU intensive massive simulation productions from the LHCb experiment and compared their event throughput to the HS06 rating of the worker nodes. We also compared it with a much faster benchmark script that is used by the DIRAC framework used by LHCb for evaluating at run time the performance of the worker nodes. This contribution reports on the finding of these comparisons: the main observation is that the scaling with HS06 is no longer fulfilled, while the fast benchmarks have a better scaling but are less precise. One can also clearly see that some hardware or software features when enabled on the worker nodes may enhance their performance beyond expectation from either benchmark, depending on external factors.
Method and apparatus for measuring spatial uniformity of radiation

DOEpatents

Field, Halden

2002-01-01

A method and apparatus for measuring the spatial uniformity of the intensity of a radiation beam from a radiation source based on a single sampling time and/or a single pulse of radiation. The measuring apparatus includes a plurality of radiation detectors positioned on planar mounting plate to form a radiation receiving area that has a shape and size approximating the size and shape of the cross section of the radiation beam. The detectors concurrently receive portions of the radiation beam and transmit electrical signals representative of the intensity of impinging radiation to a signal processor circuit connected to each of the detectors and adapted to concurrently receive the electrical signals from the detectors and process with a central processing unit (CPU) the signals to determine intensities of the radiation impinging at each detector location. The CPU displays the determined intensities and relative intensity values corresponding to each detector location to an operator of the measuring apparatus on an included data display device. Concurrent sampling of each detector is achieved by connecting to each detector a sample and hold circuit that is configured to track the signal and store it upon receipt of a "capture" signal. A switching device then selectively retrieves the signals and transmits the signals to the CPU through a single analog to digital (A/D) converter. The "capture" signal. is then removed from the sample-and-hold circuits. Alternatively, concurrent sampling is achieved by providing an A/D converter for each detector, each of which transmits a corresponding digital signal to the CPU. The sampling or reading of the detector signals can be controlled by the CPU or level-detection and timing circuit.

A comparison of native GPU computing versus OpenACC for implementing flow-routing algorithms in hydrological applications

NASA Astrophysics Data System (ADS)

Rueda, Antonio J.; Noguera, José M.; Luque, Adrián

2016-02-01

In recent years GPU computing has gained wide acceptance as a simple low-cost solution for speeding up computationally expensive processing in many scientific and engineering applications. However, in most cases accelerating a traditional CPU implementation for a GPU is a non-trivial task that requires a thorough refactorization of the code and specific optimizations that depend on the architecture of the device. OpenACC is a promising technology that aims at reducing the effort required to accelerate C/C++/Fortran code on an attached multicore device. Virtually with this technology the CPU code only has to be augmented with a few compiler directives to identify the areas to be accelerated and the way in which data has to be moved between the CPU and GPU. Its potential benefits are multiple: better code readability, less development time, lower risk of errors and less dependency on the underlying architecture and future evolution of the GPU technology. Our aim with this work is to evaluate the pros and cons of using OpenACC against native GPU implementations in computationally expensive hydrological applications, using the classic D8 algorithm of O'Callaghan and Mark for river network extraction as case-study. We implemented the flow accumulation step of this algorithm in CPU, using OpenACC and two different CUDA versions, comparing the length and complexity of the code and its performance with different datasets. We advance that although OpenACC can not match the performance of a CUDA optimized implementation (×3.5 slower in average), it provides a significant performance improvement against a CPU implementation (×2-6) with by far a simpler code and less implementation effort.
Nucleus Accumbens Invulnerability to Methamphetamine Neurotoxicity

PubMed Central

Kuhn, Donald M.; Angoa-Pérez, Mariana; Thomas, David M.

2016-01-01

Methamphetamine (Meth) is a neurotoxic drug of abuse that damages neurons and nerve endings throughout the central nervous system. Emerging studies of human Meth addicts using both postmortem analyses of brain tissue and noninvasive imaging studies of intact brains have confirmed that Meth causes persistent structural abnormalities. Animal and human studies have also defined a number of significant functional problems and comorbid psychiatric disorders associated with long-term Meth abuse. This review summarizes the salient features of Meth-induced neurotoxicity with a focus on the dopamine (DA) neuronal system. DA nerve endings in the caudate-putamen (CPu) are damaged by Meth in a highly delimited manner. Even within the CPu, damage is remarkably heterogeneous, with ventral and lateral aspects showing the greatest deficits. The nucleus accumbens (NAc) is largely spared the damage that accompanies binge Meth intoxication, but relatively subtle changes in the disposition of DA in its nerve endings can lead to dramatic increases in Meth-induced toxicity in the CPu and overcome the normal resistance of the NAc to damage. In contrast to the CPu, where DA neuronal deficiencies are persistent, alterations in the NAc show a partial recovery. Animal models have been indispensable in studies of the causes and consequences of Meth neurotoxicity and in the development of new therapies. This research has shown that increases in cytoplasmic DA dramatically broaden the neurotoxic profile of Meth to include brain structures not normally targeted for damage. The resistance of the NAc to Meth-induced neurotoxicity and its ability to recover reveal a fundamentally different neuroplasticity by comparison to the CPu. Recruitment of the NAc as a target of Meth neurotoxicity by alterations in DA homeostasis is significant in light of the numerous important roles played by this brain structure. PMID:23382149
Nucleus accumbens invulnerability to methamphetamine neurotoxicity.

PubMed

Kuhn, Donald M; Angoa-Pérez, Mariana; Thomas, David M

2011-01-01

Methamphetamine (Meth) is a neurotoxic drug of abuse that damages neurons and nerve endings throughout the central nervous system. Emerging studies of human Meth addicts using both postmortem analyses of brain tissue and noninvasive imaging studies of intact brains have confirmed that Meth causes persistent structural abnormalities. Animal and human studies have also defined a number of significant functional problems and comorbid psychiatric disorders associated with long-term Meth abuse. This review summarizes the salient features of Meth-induced neurotoxicity with a focus on the dopamine (DA) neuronal system. DA nerve endings in the caudate-putamen (CPu) are damaged by Meth in a highly delimited manner. Even within the CPu, damage is remarkably heterogeneous, with ventral and lateral aspects showing the greatest deficits. The nucleus accumbens (NAc) is largely spared the damage that accompanies binge Meth intoxication, but relatively subtle changes in the disposition of DA in its nerve endings can lead to dramatic increases in Meth-induced toxicity in the CPu and overcome the normal resistance of the NAc to damage. In contrast to the CPu, where DA neuronal deficiencies are persistent, alterations in the NAc show a partial recovery. Animal models have been indispensable in studies of the causes and consequences of Meth neurotoxicity and in the development of new therapies. This research has shown that increases in cytoplasmic DA dramatically broaden the neurotoxic profile of Meth to include brain structures not normally targeted for damage. The resistance of the NAc to Meth-induced neurotoxicity and its ability to recover reveal a fundamentally different neuroplasticity by comparison to the CPu. Recruitment of the NAc as a target of Meth neurotoxicity by alterations in DA homeostasis is significant in light of the numerous important roles played by this brain structure.
A CPU/MIC Collaborated Parallel Framework for GROMACS on Tianhe-2 Supercomputer.

PubMed

Peng, Shaoliang; Yang, Shunyun; Su, Wenhe; Zhang, Xiaoyu; Zhang, Tenglilang; Liu, Weiguo; Zhao, Xingming

2017-06-16

Molecular Dynamics (MD) is the simulation of the dynamic behavior of atoms and molecules. As the most popular software for molecular dynamics, GROMACS cannot work on large-scale data because of limit computing resources. In this paper, we propose a CPU and Intel® Xeon Phi Many Integrated Core (MIC) collaborated parallel framework to accelerate GROMACS using the offload mode on a MIC coprocessor, with which the performance of GROMACS is improved significantly, especially with the utility of Tianhe-2 supercomputer. Furthermore, we optimize GROMACS so that it can run on both the CPU and MIC at the same time. In addition, we accelerate multi-node GROMACS so that it can be used in practice. Benchmarking on real data, our accelerated GROMACS performs very well and reduces computation time significantly. Source code: https://github.com/tianhe2/gromacs-mic.
High-performance computing on GPUs for resistivity logging of oil and gas wells

NASA Astrophysics Data System (ADS)

Glinskikh, V.; Dudaev, A.; Nechaev, O.; Surodina, I.

2017-10-01

We developed and implemented into software an algorithm for high-performance simulation of electrical logs from oil and gas wells using high-performance heterogeneous computing. The numerical solution of the 2D forward problem is based on the finite-element method and the Cholesky decomposition for solving a system of linear algebraic equations (SLAE). Software implementations of the algorithm used the NVIDIA CUDA technology and computing libraries are made, allowing us to perform decomposition of SLAE and find its solution on central processor unit (CPU) and graphics processor unit (GPU). The calculation time is analyzed depending on the matrix size and number of its non-zero elements. We estimated the computing speed on CPU and GPU, including high-performance heterogeneous CPU-GPU computing. Using the developed algorithm, we simulated resistivity data in realistic models.
Common elements of adolescent prevention programs: minimizing burden while maximizing reach.

PubMed

Boustani, Maya M; Frazier, Stacy L; Becker, Kimberly D; Bechor, Michele; Dinizulu, Sonya M; Hedemann, Erin R; Ogle, Robert R; Pasalich, Dave S

2015-03-01

A growing number of evidence-based youth prevention programs are available, but challenges related to dissemination and implementation limit their reach and impact. The current review identifies common elements across evidence-based prevention programs focused on the promotion of health-related outcomes in adolescents. We reviewed and coded descriptions of the programs for common practice and instructional elements. Problem-solving emerged as the most common practice element, followed by communication skills, and insight building. Psychoeducation, modeling, and role play emerged as the most common instructional elements. In light of significant comorbidity in poor outcomes for youth, and corresponding overlap in their underlying skills deficits, we propose that synthesizing the prevention literature using a common elements approach has the potential to yield novel information and inform prevention programming to minimize burden and maximize reach and impact for youth.
Software Techniques for Non-Von Neumann Architectures

DTIC Science & Technology

1990-01-01

Commtopo programmable Benes net.; hypercubic lattice for QCD Control CENTRALIZED Assign STATIC Memory :SHARED Synch UNIVERSAL Max-cpu 566 Proessor...boards (each = 4 floating point units, 2 multipliers) Cpu-size 32-bit floating point chips Perform 11.4 Gflops Market quantum chromodynamics ( QCD ...functions there should exist a capability to define hierarchies and lattices of complex objects. A complex object can be made up of a set of simple objects
Visual Media Reasoning - Terrain-based Geolocation

DTIC Science & Technology

2015-06-01

the drawings, specifications, or other data does not license the holder or any other person or corporation ; or convey any rights or permission to...3.4 Alternative Metric Investigation This section describes a graphics processor unit (GPU) based implementation in the NVIDIA CUDA programming...utilizing 2 concurrent CPU cores, each controlling a single Nvidia C2075 Tesla Fermi CUDA card. Figure 22 shows a comparison of the CPU and the GPU powered
Self-organized neural maps of human protein sequences.

PubMed Central

Ferrán, E. A.; Pflugfelder, B.; Ferrara, P.

1994-01-01

We have recently described a method based on artificial neural networks to cluster protein sequences into families. The network was trained with Kohonen's unsupervised learning algorithm using, as inputs, the matrix patterns derived from the dipeptide composition of the proteins. We present here a large-scale application of that method to classify the 1,758 human protein sequences stored in the SwissProt database (release 19.0), whose lengths are greater than 50 amino acids. In the final 2-dimensional topologically ordered map of 15 x 15 neurons, proteins belonging to known families were associated with the same neuron or with neighboring ones. Also, as an attempt to reduce the time-consuming learning procedure, we compared 2 learning protocols: one of 500 epochs (100 SUN CPU-hours [CPU-h]), and another one of 30 epochs (6.7 CPU-h). A further reduction of learning-computing time, by a factor of about 3.3, with similar protein clustering results, was achieved using a matrix of 11 x 11 components to represent the sequences. Although network training is time consuming, the classification of a new protein in the final ordered map is very fast (14.6 CPU-seconds). We also show a comparison between the artificial neural network approach and conventional methods of biosequence analysis. PMID:8019421
A CFD Heterogeneous Parallel Solver Based on Collaborating CPU and GPU

NASA Astrophysics Data System (ADS)

Lai, Jianqi; Tian, Zhengyu; Li, Hua; Pan, Sha

2018-03-01

Since Graphic Processing Unit (GPU) has a strong ability of floating-point computation and memory bandwidth for data parallelism, it has been widely used in the areas of common computing such as molecular dynamics (MD), computational fluid dynamics (CFD) and so on. The emergence of compute unified device architecture (CUDA), which reduces the complexity of compiling program, brings the great opportunities to CFD. There are three different modes for parallel solution of NS equations: parallel solver based on CPU, parallel solver based on GPU and heterogeneous parallel solver based on collaborating CPU and GPU. As we can see, GPUs are relatively rich in compute capacity but poor in memory capacity and the CPUs do the opposite. We need to make full use of the GPUs and CPUs, so a CFD heterogeneous parallel solver based on collaborating CPU and GPU has been established. Three cases are presented to analyse the solver’s computational accuracy and heterogeneous parallel efficiency. The numerical results agree well with experiment results, which demonstrate that the heterogeneous parallel solver has high computational precision. The speedup on a single GPU is more than 40 for laminar flow, it decreases for turbulent flow, but it still can reach more than 20. What’s more, the speedup increases as the grid size becomes larger.
Assessment of Linear Finite-Difference Poisson-Boltzmann Solvers

PubMed Central

Wang, Jun; Luo, Ray

2009-01-01

CPU time and memory usage are two vital issues that any numerical solvers for the Poisson-Boltzmann equation have to face in biomolecular applications. In this study we systematically analyzed the CPU time and memory usage of five commonly used finite-difference solvers with a large and diversified set of biomolecular structures. Our comparative analysis shows that modified incomplete Cholesky conjugate gradient and geometric multigrid are the most efficient in the diversified test set. For the two efficient solvers, our test shows that their CPU times increase approximately linearly with the numbers of grids. Their CPU times also increase almost linearly with the negative logarithm of the convergence criterion at very similar rate. Our comparison further shows that geometric multigrid performs better in the large set of tested biomolecules. However, modified incomplete Cholesky conjugate gradient is superior to geometric multigrid in molecular dynamics simulations of tested molecules. We also investigated other significant components in numerical solutions of the Poisson-Boltzmann equation. It turns out that the time-limiting step is the free boundary condition setup for the linear systems for the selected proteins if the electrostatic focusing is not used. Thus, development of future numerical solvers for the Poisson-Boltzmann equation should balance all aspects of the numerical procedures in realistic biomolecular applications. PMID:20063271
Wang-Landau sampling: Saving CPU time

NASA Astrophysics Data System (ADS)

Ferreira, L. S.; Jorge, L. N.; Leão, S. A.; Caparica, A. A.

2018-04-01

In this work we propose an improvement to the Wang-Landau (WL) method that allows an economy in CPU time of about 60% leading to the same results with the same accuracy. We used the 2D Ising model to show that one can initiate all WL simulations using the outputs of an advanced WL level from a previous simulation. We showed that up to the seventh WL level (f6) the simulations are not biased yet and can proceed to any value that the simulation from the very beginning would reach. As a result the initial WL levels can be simulated just once. It was also observed that the saving in CPU time is larger for larger lattice sizes, exactly where the computational cost is considerable. We carried out high-resolution simulations beginning initially from the first WL level (f0) and another beginning from the eighth WL level (f7) using all the data at the end of the previous level and showed that the results for the critical temperature Tc and the critical static exponents β and γ coincide within the error bars. Finally we applied the same procedure to the 1/2-spin Baxter-Wu model and the economy in CPU time was of about 64%.
A new finite element and finite difference hybrid method for computing electrostatics of ionic solvated biomolecule

NASA Astrophysics Data System (ADS)

Ying, Jinyong; Xie, Dexuan

2015-10-01

The Poisson-Boltzmann equation (PBE) is one widely-used implicit solvent continuum model for calculating electrostatics of ionic solvated biomolecule. In this paper, a new finite element and finite difference hybrid method is presented to solve PBE efficiently based on a special seven-overlapped box partition with one central box containing the solute region and surrounded by six neighboring boxes. In particular, an efficient finite element solver is applied to the central box while a fast preconditioned conjugate gradient method using a multigrid V-cycle preconditioning is constructed for solving a system of finite difference equations defined on a uniform mesh of each neighboring box. Moreover, the PBE domain, the box partition, and an interface fitted tetrahedral mesh of the central box can be generated adaptively for a given PQR file of a biomolecule. This new hybrid PBE solver is programmed in C, Fortran, and Python as a software tool for predicting electrostatics of a biomolecule in a symmetric 1:1 ionic solvent. Numerical results on two test models with analytical solutions and 12 proteins validate this new software tool, and demonstrate its high performance in terms of CPU time and memory usage.
Patterns and Practices for Future Architectures

DTIC Science & Technology

2014-08-01

14. SUBJECT TERMS computing architecture, graph algorithms, high-performance computing, big data , GPU 15. NUMBER OF PAGES 44 16. PRICE CODE 17...at Vertex 1 6 Figure 4: Data Structures Created by Kernel 1 of Single CPU, List Implementation Using the Graph in the Example from Section 1.2 9...Figure 5: Kernel 2 of Graph500 BFS Reference Implementation: Single CPU, List 10 Figure 6: Data Structures for Sequential CSR Algorithm 12 Figure 7
Conversion of Mass Storage Hierarchy in an IBM Computer Network

DTIC Science & Technology

1989-03-01

storage devices GUIDE IBM users’ group for DOS operating systems IBM International Business Machines IBM 370/145 CPU introduced in 1970 IBM 370/168 CPU...February 12, 1985, Information Systems Group, International Business Machines Corporation. "IBM 3090 Processor Complex" and 񓼪 Mass Storage System...34 Mainframe Journal, pp. 15-26, 64-65, Dallas, Texas, September-October 1987. 3. International Business Machines Corporation, Introduction to IBM 3S80 Storage
A parallel method of atmospheric correction for multispectral high spatial resolution remote sensing images

NASA Astrophysics Data System (ADS)

Zhao, Shaoshuai; Ni, Chen; Cao, Jing; Li, Zhengqiang; Chen, Xingfeng; Ma, Yan; Yang, Leiku; Hou, Weizhen; Qie, Lili; Ge, Bangyu; Liu, Li; Xing, Jin

2018-03-01

The remote sensing image is usually polluted by atmosphere components especially like aerosol particles. For the quantitative remote sensing applications, the radiative transfer model based atmospheric correction is used to get the reflectance with decoupling the atmosphere and surface by consuming a long computational time. The parallel computing is a solution method for the temporal acceleration. The parallel strategy which uses multi-CPU to work simultaneously is designed to do atmospheric correction for a multispectral remote sensing image. The parallel framework's flow and the main parallel body of atmospheric correction are described. Then, the multispectral remote sensing image of the Chinese Gaofen-2 satellite is used to test the acceleration efficiency. When the CPU number is increasing from 1 to 8, the computational speed is also increasing. The biggest acceleration rate is 6.5. Under the 8 CPU working mode, the whole image atmospheric correction costs 4 minutes.
On localization attacks against cloud infrastructure

NASA Astrophysics Data System (ADS)

Ge, Linqiang; Yu, Wei; Sistani, Mohammad Ali

2013-05-01

One of the key characteristics of cloud computing is the device and location independence that enables the user to access systems regardless of their location. Because cloud computing is heavily based on sharing resource, it is vulnerable to cyber attacks. In this paper, we investigate a localization attack that enables the adversary to leverage central processing unit (CPU) resources to localize the physical location of server used by victims. By increasing and reducing CPU usage through the malicious virtual machine (VM), the response time from the victim VM will increase and decrease correspondingly. In this way, by embedding the probing signal into the CPU usage and correlating the same pattern in the response time from the victim VM, the adversary can find the location of victim VM. To determine attack accuracy, we investigate features in both the time and frequency domains. We conduct both theoretical and experimental study to demonstrate the effectiveness of such an attack.
Storage strategies of eddy-current FE-BI model for GPU implementation

NASA Astrophysics Data System (ADS)

Bardel, Charles; Lei, Naiguang; Udpa, Lalita

2013-01-01

In the past few years graphical processing units (GPUs) have shown tremendous improvements in computational throughput over standard CPU architecture. However, this comes at the cost of restructuring the algorithms to meet the strengths and drawbacks of this GPU architecture. A major drawback is the state of limited memory, and hence storage of FE stiffness matrices on the GPU is important. In contrast to storage on CPU the GPU storage format has significant influence on the overall performance. This paper presents an investigation of a storage strategy in the implementation of a two-dimensional finite element-boundary integral (FE-BI) model for Eddy current NDE applications, on GPU architecture. Specifically, the high dimensional matrices are manipulated by examining the matrix structure and optimally splitting into structurally independent component matrices for efficient storage and retrieval of each component. Results obtained using the proposed approach are compared to those of conventional CPU implementation for validating the method.
Semiempirical Quantum Chemical Calculations Accelerated on a Hybrid Multicore CPU-GPU Computing Platform.

PubMed

Wu, Xin; Koslowski, Axel; Thiel, Walter

2012-07-10

In this work, we demonstrate that semiempirical quantum chemical calculations can be accelerated significantly by leveraging the graphics processing unit (GPU) as a coprocessor on a hybrid multicore CPU-GPU computing platform. Semiempirical calculations using the MNDO, AM1, PM3, OM1, OM2, and OM3 model Hamiltonians were systematically profiled for three types of test systems (fullerenes, water clusters, and solvated crambin) to identify the most time-consuming sections of the code. The corresponding routines were ported to the GPU and optimized employing both existing library functions and a GPU kernel that carries out a sequence of noniterative Jacobi transformations during pseudodiagonalization. The overall computation times for single-point energy calculations and geometry optimizations of large molecules were reduced by one order of magnitude for all methods, as compared to runs on a single CPU core.
Multigroup Monte Carlo on GPUs: Comparison of history- and event-based algorithms

DOE Office of Scientific and Technical Information (OSTI.GOV)

Hamilton, Steven P.; Slattery, Stuart R.; Evans, Thomas M.

This article presents an investigation of the performance of different multigroup Monte Carlo transport algorithms on GPUs with a discussion of both history-based and event-based approaches. Several algorithmic improvements are introduced for both approaches. By modifying the history-based algorithm that is traditionally favored in CPU-based MC codes to occasionally filter out dead particles to reduce thread divergence, performance exceeds that of either the pure history-based or event-based approaches. The impacts of several algorithmic choices are discussed, including performance studies on Kepler and Pascal generation NVIDIA GPUs for fixed source and eigenvalue calculations. Single-device performance equivalent to 20–40 CPU cores onmore » the K40 GPU and 60–80 CPU cores on the P100 GPU is achieved. Last, in addition, nearly perfect multi-device parallel weak scaling is demonstrated on more than 16,000 nodes of the Titan supercomputer.« less

Multigroup Monte Carlo on GPUs: Comparison of history- and event-based algorithms

DOE PAGES

Hamilton, Steven P.; Slattery, Stuart R.; Evans, Thomas M.

2017-12-22

This article presents an investigation of the performance of different multigroup Monte Carlo transport algorithms on GPUs with a discussion of both history-based and event-based approaches. Several algorithmic improvements are introduced for both approaches. By modifying the history-based algorithm that is traditionally favored in CPU-based MC codes to occasionally filter out dead particles to reduce thread divergence, performance exceeds that of either the pure history-based or event-based approaches. The impacts of several algorithmic choices are discussed, including performance studies on Kepler and Pascal generation NVIDIA GPUs for fixed source and eigenvalue calculations. Single-device performance equivalent to 20–40 CPU cores onmore » the K40 GPU and 60–80 CPU cores on the P100 GPU is achieved. Last, in addition, nearly perfect multi-device parallel weak scaling is demonstrated on more than 16,000 nodes of the Titan supercomputer.« less
A GPU-based calculation using the three-dimensional FDTD method for electromagnetic field analysis.

PubMed

Nagaoka, Tomoaki; Watanabe, Soichi

2010-01-01

Numerical simulations with the numerical human model using the finite-difference time domain (FDTD) method have recently been performed frequently in a number of fields in biomedical engineering. However, the FDTD calculation runs too slowly. We focus, therefore, on general purpose programming on the graphics processing unit (GPGPU). The three-dimensional FDTD method was implemented on the GPU using Compute Unified Device Architecture (CUDA). In this study, we used the NVIDIA Tesla C1060 as a GPGPU board. The performance of the GPU is evaluated in comparison with the performance of a conventional CPU and a vector supercomputer. The results indicate that three-dimensional FDTD calculations using a GPU can significantly reduce run time in comparison with that using a conventional CPU, even a native GPU implementation of the three-dimensional FDTD method, while the GPU/CPU speed ratio varies with the calculation domain and thread block size.
CUDA Fortran acceleration for the finite-difference time-domain method

NASA Astrophysics Data System (ADS)

Hadi, Mohammed F.; Esmaeili, Seyed A.

2013-05-01

A detailed description of programming the three-dimensional finite-difference time-domain (FDTD) method to run on graphical processing units (GPUs) using CUDA Fortran is presented. Two FDTD-to-CUDA thread-block mapping designs are investigated and their performances compared. Comparative assessment of trade-offs between GPU's shared memory and L1 cache is also discussed. This presentation is for the benefit of FDTD programmers who work exclusively with Fortran and are reluctant to port their codes to C in order to utilize GPU computing. The derived CUDA Fortran code is compared with an optimized CPU version that runs on a workstation-class CPU to present a realistic GPU to CPU run time comparison and thus help in making better informed investment decisions on FDTD code redesigns and equipment upgrades. All analyses are mirrored with CUDA C simulations to put in perspective the present state of CUDA Fortran development.
Near-Zero Emissions Oxy-Combustion Flue Gas Purification

DOE Office of Scientific and Technical Information (OSTI.GOV)

Minish Shah; Nich Degenstein; Monica Zanfir

2012-06-30

The objectives of this project were to carry out an experimental program to enable development and design of near zero emissions (NZE) CO{sub 2} processing unit (CPU) for oxy-combustion plants burning high and low sulfur coals and to perform commercial viability assessment. The NZE CPU was proposed to produce high purity CO{sub 2} from the oxycombustion flue gas, to achieve > 95% CO{sub 2} capture rate and to achieve near zero atmospheric emissions of criteria pollutants. Two SOx/NOx removal technologies were proposed depending on the SOx levels in the flue gas. The activated carbon process was proposed for power plantsmore » burning low sulfur coal and the sulfuric acid process was proposed for power plants burning high sulfur coal. For plants burning high sulfur coal, the sulfuric acid process would convert SOx and NOx in to commercial grade sulfuric and nitric acid by-products, thus reducing operating costs associated with SOx/NOx removal. For plants burning low sulfur coal, investment in separate FGD and SCR equipment for producing high purity CO{sub 2} would not be needed. To achieve high CO{sub 2} capture rates, a hybrid process that combines cold box and VPSA (vacuum pressure swing adsorption) was proposed. In the proposed hybrid process, up to 90% of CO{sub 2} in the cold box vent stream would be recovered by CO{sub 2} VPSA and then it would be recycled and mixed with the flue gas stream upstream of the compressor. The overall recovery from the process will be > 95%. The activated carbon process was able to achieve simultaneous SOx and NOx removal in a single step. The removal efficiencies were >99.9% for SOx and >98% for NOx, thus exceeding the performance targets of >99% and >95%, respectively. The process was also found to be suitable for power plants burning both low and high sulfur coals. Sulfuric acid process did not meet the performance expectations. Although it could achieve high SOx (>99%) and NOx (>90%) removal efficiencies, it could not produce by-product sulfuric and nitric acids that meet the commercial product specifications. The sulfuric acid will have to be disposed of by neutralization, thus lowering the value of the technology to same level as that of the activated carbon process. Therefore, it was decided to discontinue any further efforts on sulfuric acid process. Because of encouraging results on the activated carbon process, it was decided to add a new subtask on testing this process in a dual bed continuous unit. A 40 days long continuous operation test confirmed the excellent SOx/NOx removal efficiencies achieved in the batch operation. This test also indicated the need for further efforts on optimization of adsorption-regeneration cycle to maintain long term activity of activated carbon material at a higher level. The VPSA process was tested in a pilot unit. It achieved CO{sub 2} recovery of > 95% and CO{sub 2} purity of >80% (by vol.) from simulated cold box feed streams. The overall CO{sub 2} recovery from the cold box VPSA hybrid process was projected to be >99% for plants with low air ingress (2%) and >97% for plants with high air ingress (10%). Economic analysis was performed to assess value of the NZE CPU. The advantage of NZE CPU over conventional CPU is only apparent when CO{sub 2} capture and avoided costs are compared. For greenfield plants, cost of avoided CO{sub 2} and cost of captured CO{sub 2} are generally about 11-14% lower using the NZE CPU compared to using a conventional CPU. For older plants with high air intrusion, the cost of avoided CO{sub 2} and capture CO{sub 2} are about 18-24% lower using the NZE CPU. Lower capture costs for NZE CPU are due to lower capital investment in FGD/SCR and higher CO{sub 2} capture efficiency. In summary, as a result of this project, we now have developed one technology option for NZE CPU based on the activated carbon process and coldbox-VPSA hybrid process. This technology is projected to work for both low and high sulfur coal plants. The NZE CPU technology is projected to achieve near zero stack emissions, produce high purity CO{sub 2} relatively free of trace impurities and achieve ~99% CO{sub 2} capture rate while lowering the CO{sub 2} capture costs.« less
The STEM Lecture Hall: A Study of Effective Instructional Practices for Diverse Learners

NASA Astrophysics Data System (ADS)

Reimer, Lynn Christine

First-generation, low-income, underrepresented minority (URM) and female undergraduates are matriculating into science, technology, engineering, and math (STEM) majors at unprecedented levels. However, a disproportionate number of these students end up graduating in non-STEM disciplines. Attrition rates have been observed to spike in conjunction with introductory STEM courses in chemistry, biology, and physics. These "gateway" courses tend to be housed in large, impersonal lecture halls. First-generation and URM students struggle in this environment, possibly because of instructors' reliance on lecture-based content delivery and rote memorization. Recent social psychological studies suggest the problem may be related to cultural mismatch, or misalignment between independent learning norms typical of American universities and interdependent learning expectancies for first-generation and URM students. Value-affirming and utility-value interventions yield impressive academic achievement gains for these students. These findings overlap with a second body of literature on culturally responsive instruction. Active gateway learning practices that emphasize interactive instruction, frequent assessment, and epistemological instruction can be successful because of their propensity to incorporate values affirming and utility-value techniques. The present study observed instruction for gateway STEM courses over a three-year period at the University of California, Irvine (N = 13,856 undergraduates in 168 courses). Exploratory polychoric factor analysis was used to identify latent variables for observational data on gateway STEM instructional practices. Variables were regressed on institutional student data. Practices implemented in large lecture halls fall into three general categories: Faculty-Student Interaction, Epistemological Instruction, and Peer Interaction . The present study found that Faculty-Student Interaction was negatively associated with student outcomes for female and first-generation students; and Epistemological Instruction was negatively associated with student outcomes for Hispanic students. More importantly, Peer Interaction was positively associated with student outcomes for female, first-generation, and Hispanic students. Study implications and limitations are discussed with reference to the research literature.
A fast CT reconstruction scheme for a general multi-core PC.

PubMed

Zeng, Kai; Bai, Erwei; Wang, Ge

2007-01-01

Expensive computational cost is a severe limitation in CT reconstruction for clinical applications that need real-time feedback. A primary example is bolus-chasing computed tomography (CT) angiography (BCA) that we have been developing for the past several years. To accelerate the reconstruction process using the filtered backprojection (FBP) method, specialized hardware or graphics cards can be used. However, specialized hardware is expensive and not flexible. The graphics processing unit (GPU) in a current graphic card can only reconstruct images in a reduced precision and is not easy to program. In this paper, an acceleration scheme is proposed based on a multi-core PC. In the proposed scheme, several techniques are integrated, including utilization of geometric symmetry, optimization of data structures, single-instruction multiple-data (SIMD) processing, multithreaded computation, and an Intel C++ compilier. Our scheme maintains the original precision and involves no data exchange between the GPU and CPU. The merits of our scheme are demonstrated in numerical experiments against the traditional implementation. Our scheme achieves a speedup of about 40, which can be further improved by several folds using the latest quad-core processors.
SWPS3 - fast multi-threaded vectorized Smith-Waterman for IBM Cell/B.E. and x86/SSE2.

PubMed

Szalkowski, Adam; Ledergerber, Christian; Krähenbühl, Philipp; Dessimoz, Christophe

2008-10-29

We present swps3, a vectorized implementation of the Smith-Waterman local alignment algorithm optimized for both the Cell/BE and x86 architectures. The paper describes swps3 and compares its performances with several other implementations. Our benchmarking results show that swps3 is currently the fastest implementation of a vectorized Smith-Waterman on the Cell/BE, outperforming the only other known implementation by a factor of at least 4: on a Playstation 3, it achieves up to 8.0 billion cell-updates per second (GCUPS). Using the SSE2 instruction set, a quad-core Intel Pentium can reach 15.7 GCUPS. We also show that swps3 on this CPU is faster than a recent GPU implementation. Finally, we note that under some circumstances, alignments are computed at roughly the same speed as BLAST, a heuristic method. The Cell/BE can be a powerful platform to align biological sequences. Besides, the performance gap between exact and heuristic methods has almost disappeared, especially for long protein sequences.
SWPS3 – fast multi-threaded vectorized Smith-Waterman for IBM Cell/B.E. and ×86/SSE2

PubMed Central

Szalkowski, Adam; Ledergerber, Christian; Krähenbühl, Philipp; Dessimoz, Christophe

2008-01-01

Background We present swps3, a vectorized implementation of the Smith-Waterman local alignment algorithm optimized for both the Cell/BE and ×86 architectures. The paper describes swps3 and compares its performances with several other implementations. Findings Our benchmarking results show that swps3 is currently the fastest implementation of a vectorized Smith-Waterman on the Cell/BE, outperforming the only other known implementation by a factor of at least 4: on a Playstation 3, it achieves up to 8.0 billion cell-updates per second (GCUPS). Using the SSE2 instruction set, a quad-core Intel Pentium can reach 15.7 GCUPS. We also show that swps3 on this CPU is faster than a recent GPU implementation. Finally, we note that under some circumstances, alignments are computed at roughly the same speed as BLAST, a heuristic method. Conclusion The Cell/BE can be a powerful platform to align biological sequences. Besides, the performance gap between exact and heuristic methods has almost disappeared, especially for long protein sequences. PMID:18959793
Parallel seed-based approach to multiple protein structure similarities detection

DOE PAGES

Chapuis, Guillaume; Le Boudic-Jamin, Mathilde; Andonov, Rumen; ...

2015-01-01

Finding similarities between protein structures is a crucial task in molecular biology. Most of the existing tools require proteins to be aligned in order-preserving way and only find single alignments even when multiple similar regions exist. We propose a new seed-based approach that discovers multiple pairs of similar regions. Its computational complexity is polynomial and it comes with a quality guarantee—the returned alignments have both root mean squared deviations (coordinate-based as well as internal-distances based) lower than a given threshold, if such exist. We do not require the alignments to be order preserving (i.e., we consider nonsequential alignments), which makesmore » our algorithm suitable for detecting similar domains when comparing multidomain proteins as well as to detect structural repetitions within a single protein. Because the search space for nonsequential alignments is much larger than for sequential ones, the computational burden is addressed by extensive use of parallel computing techniques: a coarse-grain level parallelism making use of available CPU cores for computation and a fine-grain level parallelism exploiting bit-level concurrency as well as vector instructions.« less
Adaptive track scheduling to optimize concurrency and vectorization in GeantV

DOE PAGES

Apostolakis, J.; Bandieramonte, M.; Bitzes, G.; ...

2015-05-22

The GeantV project is focused on the R&D of new particle transport techniques to maximize parallelism on multiple levels, profiting from the use of both SIMD instructions and co-processors for the CPU-intensive calculations specific to this type of applications. In our approach, vectors of tracks belonging to multiple events and matching different locality criteria must be gathered and dispatched to algorithms having vector signatures. While the transport propagates tracks and changes their individual states, data locality becomes harder to maintain. The scheduling policy has to be changed to maintain efficient vectors while keeping an optimal level of concurrency. The modelmore » has complex dynamics requiring tuning the thresholds to switch between the normal regime and special modes, i.e. prioritizing events to allow flushing memory, adding new events in the transport pipeline to boost locality, dynamically adjusting the particle vector size or switching between vector to single track mode when vectorization causes only overhead. Lastly, this work requires a comprehensive study for optimizing these parameters to make the behaviour of the scheduler self-adapting, presenting here its initial results.« less
A Fast CT Reconstruction Scheme for a General Multi-Core PC

PubMed Central

Zeng, Kai; Bai, Erwei; Wang, Ge

2007-01-01

Expensive computational cost is a severe limitation in CT reconstruction for clinical applications that need real-time feedback. A primary example is bolus-chasing computed tomography (CT) angiography (BCA) that we have been developing for the past several years. To accelerate the reconstruction process using the filtered backprojection (FBP) method, specialized hardware or graphics cards can be used. However, specialized hardware is expensive and not flexible. The graphics processing unit (GPU) in a current graphic card can only reconstruct images in a reduced precision and is not easy to program. In this paper, an acceleration scheme is proposed based on a multi-core PC. In the proposed scheme, several techniques are integrated, including utilization of geometric symmetry, optimization of data structures, single-instruction multiple-data (SIMD) processing, multithreaded computation, and an Intel C++ compilier. Our scheme maintains the original precision and involves no data exchange between the GPU and CPU. The merits of our scheme are demonstrated in numerical experiments against the traditional implementation. Our scheme achieves a speedup of about 40, which can be further improved by several folds using the latest quad-core processors. PMID:18256731
SNAVA-A real-time multi-FPGA multi-model spiking neural network simulation architecture.

PubMed

Sripad, Athul; Sanchez, Giovanny; Zapata, Mireya; Pirrone, Vito; Dorta, Taho; Cambria, Salvatore; Marti, Albert; Krishnamourthy, Karthikeyan; Madrenas, Jordi

2018-01-01

Spiking Neural Networks (SNN) for Versatile Applications (SNAVA) simulation platform is a scalable and programmable parallel architecture that supports real-time, large-scale, multi-model SNN computation. This parallel architecture is implemented in modern Field-Programmable Gate Arrays (FPGAs) devices to provide high performance execution and flexibility to support large-scale SNN models. Flexibility is defined in terms of programmability, which allows easy synapse and neuron implementation. This has been achieved by using a special-purpose Processing Elements (PEs) for computing SNNs, and analyzing and customizing the instruction set according to the processing needs to achieve maximum performance with minimum resources. The parallel architecture is interfaced with customized Graphical User Interfaces (GUIs) to configure the SNN's connectivity, to compile the neuron-synapse model and to monitor SNN's activity. Our contribution intends to provide a tool that allows to prototype SNNs faster than on CPU/GPU architectures but significantly cheaper than fabricating a customized neuromorphic chip. This could be potentially valuable to the computational neuroscience and neuromorphic engineering communities. Copyright © 2017 Elsevier Ltd. All rights reserved.
Thermal Hotspots in CPU Die and It's Future Architecture

NASA Astrophysics Data System (ADS)

Wang, Jian; Hu, Fu-Yuan

Owing to the increasing core frequency and chip integration and the limited die dimension, the power densities in CPU chip have been increasing fastly. The high temperature on chip resulted by power densities threats the processor's performance and chip's reliability. This paper analyzed the thermal hotspots in die and their properties. A new architecture of function units in die - - hot units distributed architecture is suggested to cope with the problems of high power densities for future processor chip.
Restricted Collision List method for faster Direct Simulation Monte-Carlo (DSMC) collisions

DOE Office of Scientific and Technical Information (OSTI.GOV)

Macrossan, Michael N., E-mail: m.macrossan@uq.edu.au

The ‘Restricted Collision List’ (RCL) method for speeding up the calculation of DSMC Variable Soft Sphere collisions, with Borgnakke–Larsen (BL) energy exchange, is presented. The method cuts down considerably on the number of random collision parameters which must be calculated (deflection and azimuthal angles, and the BL energy exchange factors). A relatively short list of these parameters is generated and the parameters required in any cell are selected from this list. The list is regenerated at intervals approximately equal to the smallest mean collision time in the flow, and the chance of any particle re-using the same collision parameters inmore » two successive collisions is negligible. The results using this method are indistinguishable from those obtained with standard DSMC. The CPU time saving depends on how much of a DSMC calculation is devoted to collisions and how much is devoted to other tasks, such as moving particles and calculating particle interactions with flow boundaries. For 1-dimensional calculations of flow in a tube, the new method saves 20% of the CPU time per collision for VSS scattering with no energy exchange. With RCL applied to rotational energy exchange, the CPU saving can be greater; for small values of the rotational collision number, for which most collisions involve some rotational energy exchange, the CPU may be reduced by 50% or more.« less
GeantV: from CPU to accelerators

NASA Astrophysics Data System (ADS)

Amadio, G.; Ananya, A.; Apostolakis, J.; Arora, A.; Bandieramonte, M.; Bhattacharyya, A.; Bianchini, C.; Brun, R.; Canal, P.; Carminati, F.; Duhem, L.; Elvira, D.; Gheata, A.; Gheata, M.; Goulas, I.; Iope, R.; Jun, S.; Lima, G.; Mohanty, A.; Nikitina, T.; Novak, M.; Pokorski, W.; Ribon, A.; Sehgal, R.; Shadura, O.; Vallecorsa, S.; Wenzel, S.; Zhang, Y.

2016-10-01

The GeantV project aims to research and develop the next-generation simulation software describing the passage of particles through matter. While the modern CPU architectures are being targeted first, resources such as GPGPU, Intel© Xeon Phi, Atom or ARM cannot be ignored anymore by HEP CPU-bound applications. The proof of concept GeantV prototype has been mainly engineered for CPU's having vector units but we have foreseen from early stages a bridge to arbitrary accelerators. A software layer consisting of architecture/technology specific backends supports currently this concept. This approach allows to abstract out the basic types such as scalar/vector but also to formalize generic computation kernels using transparently library or device specific constructs based on Vc, CUDA, Cilk+ or Intel intrinsics. While the main goal of this approach is portable performance, as a bonus, it comes with the insulation of the core application and algorithms from the technology layer. This allows our application to be long term maintainable and versatile to changes at the backend side. The paper presents the first results of basket-based GeantV geometry navigation on the Intel© Xeon Phi KNC architecture. We present the scalability and vectorization study, conducted using Intel performance tools, as well as our preliminary conclusions on the use of accelerators for GeantV transport. We also describe the current work and preliminary results for using the GeantV transport kernel on GPUs.
Efficient Scalable Median Filtering Using Histogram-Based Operations.

PubMed

Green, Oded

2018-05-01

Median filtering is a smoothing technique for noise removal in images. While there are various implementations of median filtering for a single-core CPU, there are few implementations for accelerators and multi-core systems. Many parallel implementations of median filtering use a sorting algorithm for rearranging the values within a filtering window and taking the median of the sorted value. While using sorting algorithms allows for simple parallel implementations, the cost of the sorting becomes prohibitive as the filtering windows grow. This makes such algorithms, sequential and parallel alike, inefficient. In this work, we introduce the first software parallel median filtering that is non-sorting-based. The new algorithm uses efficient histogram-based operations. These reduce the computational requirements of the new algorithm while also accessing the image fewer times. We show an implementation of our algorithm for both the CPU and NVIDIA's CUDA supported graphics processing unit (GPU). The new algorithm is compared with several other leading CPU and GPU implementations. The CPU implementation has near perfect linear scaling with a speedup on a quad-core system. The GPU implementation is several orders of magnitude faster than the other GPU implementations for mid-size median filters. For small kernels, and , comparison-based approaches are preferable as fewer operations are required. Lastly, the new algorithm is open-source and can be found in the OpenCV library.
AMITIS: A 3D GPU-Based Hybrid-PIC Model for Space and Plasma Physics

NASA Astrophysics Data System (ADS)

Fatemi, Shahab; Poppe, Andrew R.; Delory, Gregory T.; Farrell, William M.

2017-05-01

We have developed, for the first time, an advanced modeling infrastructure in space simulations (AMITIS) with an embedded three-dimensional self-consistent grid-based hybrid model of plasma (kinetic ions and fluid electrons) that runs entirely on graphics processing units (GPUs). The model uses NVIDIA GPUs and their associated parallel computing platform, CUDA, developed for general purpose processing on GPUs. The model uses a single CPU-GPU pair, where the CPU transfers data between the system and GPU memory, executes CUDA kernels, and writes simulation outputs on the disk. All computations, including moving particles, calculating macroscopic properties of particles on a grid, and solving hybrid model equations are processed on a single GPU. We explain various computing kernels within AMITIS and compare their performance with an already existing well-tested hybrid model of plasma that runs in parallel using multi-CPU platforms. We show that AMITIS runs ∼10 times faster than the parallel CPU-based hybrid model. We also introduce an implicit solver for computation of Faraday’s Equation, resulting in an explicit-implicit scheme for the hybrid model equation. We show that the proposed scheme is stable and accurate. We examine the AMITIS energy conservation and show that the energy is conserved with an error < 0.2% after 500,000 timesteps, even when a very low number of particles per cell is used.
A Burst-Based “Hebbian” Learning Rule at Retinogeniculate Synapses Links Retinal Waves to Activity-Dependent Refinement

PubMed Central

Butts, Daniel A; Kanold, Patrick O; Shatz, Carla J

2007-01-01

Patterned spontaneous activity in the developing retina is necessary to drive synaptic refinement in the lateral geniculate nucleus (LGN). Using perforated patch recordings from neurons in LGN slices during the period of eye segregation, we examine how such burst-based activity can instruct this refinement. Retinogeniculate synapses have a novel learning rule that depends on the latencies between pre- and postsynaptic bursts on the order of one second: coincident bursts produce long-lasting synaptic enhancement, whereas non-overlapping bursts produce mild synaptic weakening. It is consistent with “Hebbian” development thought to exist at this synapse, and we demonstrate computationally that such a rule can robustly use retinal waves to drive eye segregation and retinotopic refinement. Thus, by measuring plasticity induced by natural activity patterns, synaptic learning rules can be linked directly to their larger role in instructing the patterning of neural connectivity. PMID:17341130
Physician discretion is safe and may lower stress test utilization in emergency department chest pain unit patients.

PubMed

Napoli, Anthony M; Arrighi, James A; Siket, Matthew S; Gibbs, Frantz J

2012-03-01

Chest pain unit (CPU) observation with defined stress utilization protocols is a common management option for low-risk emergency department patients. We sought to evaluate the safety of a joint emergency medicine and cardiology staffed CPU. Prospective observational trial of consecutive patients admitted to an emergency department CPU was conducted. A standard 6-hour observation protocol was followed by cardiology consultation and stress utilization largely at their discretion. Included patients were at low/intermediate risk by the American Heart Association, had nondiagnostic electrocardiograms, and a normal initial troponin. Excluded patients were those with an acute comorbidity, age >75, and a history of coronary artery disease, or had a coexistent problem restricting 24-hour observation. Primary outcome was 30-day major adverse cardiovascular events-defined as death, nonfatal acute myocardial infarction, revascularization, or out-of-hospital cardiac arrest. A total of 1063 patients were enrolled over 8 months. The mean age of the patients was 52.8 ± 11.8 years, and 51% (95% confidence interval [CI], 48-54) were female. The mean thrombolysis in myocardial infarction and Diamond & Forrester scores were 0.6% (95% CI, 0.51-0.62) and 33% (95% CI, 31-35), respectively. In all, 51% (95% CI, 48-54) received stress testing (52% nuclear stress, 39% stress echocardiogram, 5% exercise, 4% other). In all, 0.9% patients (n = 10, 95% CI, 0.4-1.5) were diagnosed with a non-ST elevation myocardial infarction and 2.2% (n = 23, 95% CI, 1.3-3) with acute coronary syndrome. There was 1 (95% CI, 0%-0.3%) case of a 30-day major adverse cardiovascular events. The 51% stress test utilization rate was less than the range reported in previous CPU studies (P < 0.05). Joint emergency medicine and cardiology management of patients within a CPU protocol is safe, efficacious, and may safely reduce stress testing rates.
3D Kirchhoff depth migration algorithm: A new scalable approach for parallelization on multicore CPU based cluster

NASA Astrophysics Data System (ADS)

Rastogi, Richa; Londhe, Ashutosh; Srivastava, Abhishek; Sirasala, Kirannmayi M.; Khonde, Kiran

2017-03-01

In this article, a new scalable 3D Kirchhoff depth migration algorithm is presented on state of the art multicore CPU based cluster. Parallelization of 3D Kirchhoff depth migration is challenging due to its high demand of compute time, memory, storage and I/O along with the need of their effective management. The most resource intensive modules of the algorithm are traveltime calculations and migration summation which exhibit an inherent trade off between compute time and other resources. The parallelization strategy of the algorithm largely depends on the storage of calculated traveltimes and its feeding mechanism to the migration process. The presented work is an extension of our previous work, wherein a 3D Kirchhoff depth migration application for multicore CPU based parallel system had been developed. Recently, we have worked on improving parallel performance of this application by re-designing the parallelization approach. The new algorithm is capable to efficiently migrate both prestack and poststack 3D data. It exhibits flexibility for migrating large number of traces within the available node memory and with minimal requirement of storage, I/O and inter-node communication. The resultant application is tested using 3D Overthrust data on PARAM Yuva II, which is a Xeon E5-2670 based multicore CPU cluster with 16 cores/node and 64 GB shared memory. Parallel performance of the algorithm is studied using different numerical experiments and the scalability results show striking improvement over its previous version. An impressive 49.05X speedup with 76.64% efficiency is achieved for 3D prestack data and 32.00X speedup with 50.00% efficiency for 3D poststack data, using 64 nodes. The results also demonstrate the effectiveness and robustness of the improved algorithm with high scalability and efficiency on a multicore CPU cluster.

Aligner optimization increases accuracy and decreases compute times in multi-species sequence data.

PubMed

Robinson, Kelly M; Hawkins, Aziah S; Santana-Cruz, Ivette; Adkins, Ricky S; Shetty, Amol C; Nagaraj, Sushma; Sadzewicz, Lisa; Tallon, Luke J; Rasko, David A; Fraser, Claire M; Mahurkar, Anup; Silva, Joana C; Dunning Hotopp, Julie C

2017-09-01

As sequencing technologies have evolved, the tools to analyze these sequences have made similar advances. However, for multi-species samples, we observed important and adverse differences in alignment specificity and computation time for bwa- mem (Burrows-Wheeler aligner-maximum exact matches) relative to bwa-aln. Therefore, we sought to optimize bwa-mem for alignment of data from multi-species samples in order to reduce alignment time and increase the specificity of alignments. In the multi-species cases examined, there was one majority member (i.e. Plasmodium falciparum or Brugia malayi ) and one minority member (i.e. human or the Wolbachia endosymbiont w Bm) of the sequence data. Increasing bwa-mem seed length from the default value reduced the number of read pairs from the majority sequence member that incorrectly aligned to the reference genome of the minority sequence member. Combining both source genomes into a single reference genome increased the specificity of mapping, while also reducing the central processing unit (CPU) time. In Plasmodium , at a seed length of 18 nt, 24.1 % of reads mapped to the human genome using 1.7±0.1 CPU hours, while 83.6 % of reads mapped to the Plasmodium genome using 0.2±0.0 CPU hours (total: 107.7 % reads mapping; in 1.9±0.1 CPU hours). In contrast, 97.1 % of the reads mapped to a combined Plasmodium- human reference in only 0.7±0.0 CPU hours. Overall, the results suggest that combining all references into a single reference database and using a 23 nt seed length reduces the computational time, while maximizing specificity. Similar results were found for simulated sequence reads from a mock metagenomic data set. We found similar improvements to computation time in a publicly available human-only data set.
A Specialized Diacylglycerol Acyltransferase Contributes to the Extreme Medium-Chain Fatty Acid Content of Cuphea Seed Oil1[OPEN

PubMed Central

Iskandarov, Umidjon; Silva, Jillian E.; Andersson, Mariette

2017-01-01

Seed oils of many Cuphea sp. contain >90% of medium-chain fatty acids, such as decanoic acid (10:0). These seed oils, which are among the most compositionally variant in the plant kingdom, arise from specialized fatty acid biosynthetic enzymes and specialized acyltransferases. These include lysophosphatidic acid acyltransferases (LPAT) and diacylglycerol acyltransferases (DGAT) that are required for successive acylation of medium-chain fatty acids in the sn-2 and sn-3 positions of seed triacylglycerols (TAGs). Here we report the identification of a cDNA for a DGAT1-type enzyme, designated CpuDGAT1, from the transcriptome of C. avigera var pulcherrima developing seeds. Microsomes of camelina (Camelina sativa) seeds engineered for CpuDGAT1 expression displayed DGAT activity with 10:0-CoA and the diacylglycerol didecanoyl, that was approximately 4-fold higher than that in camelina seed microsomes lacking CpuDGAT1. In addition, coexpression in camelina seeds of CpuDGAT1 with a C. viscosissima FatB thioesterase (CvFatB1) that generates 10:0 resulted in TAGs with nearly 15 mol % of 10:0. More strikingly, expression of CpuDGAT1 and CvFatB1 with the previously described CvLPAT2, a 10:0-CoA-specific Cuphea LPAT, increased 10:0 amounts to 25 mol % in camelina seed TAG. These TAGs contained up to 40 mol % 10:0 in the sn-2 position, nearly double the amounts obtained from coexpression of CvFatB1 and CvLPAT2 alone. Although enriched in diacylglycerol, 10:0 was not detected in phosphatidylcholine in these seeds. These findings are consistent with channeling of 10:0 into TAG through the combined activities of specialized LPAT and DGAT activities and demonstrate the biotechnological use of these enzymes to generate 10:0-rich seed oils. PMID:28325847
A Specialized Diacylglycerol Acyltransferase Contributes to the Extreme Medium-Chain Fatty Acid Content of Cuphea Seed Oil.

PubMed

Iskandarov, Umidjon; Silva, Jillian E; Kim, Hae Jin; Andersson, Mariette; Cahoon, Rebecca E; Mockaitis, Keithanne; Cahoon, Edgar B

2017-05-01

Seed oils of many Cuphea sp. contain >90% of medium-chain fatty acids, such as decanoic acid (10:0). These seed oils, which are among the most compositionally variant in the plant kingdom, arise from specialized fatty acid biosynthetic enzymes and specialized acyltransferases. These include lysophosphatidic acid acyltransferases (LPAT) and diacylglycerol acyltransferases (DGAT) that are required for successive acylation of medium-chain fatty acids in the sn -2 and sn -3 positions of seed triacylglycerols (TAGs). Here we report the identification of a cDNA for a DGAT1-type enzyme, designated CpuDGAT1, from the transcriptome of C. avigera var pulcherrima developing seeds. Microsomes of camelina ( Camelina sativa ) seeds engineered for CpuDGAT1 expression displayed DGAT activity with 10:0-CoA and the diacylglycerol didecanoyl, that was approximately 4-fold higher than that in camelina seed microsomes lacking CpuDGAT1. In addition, coexpression in camelina seeds of CpuDGAT1 with a C. viscosissima FatB thioesterase (CvFatB1) that generates 10:0 resulted in TAGs with nearly 15 mol % of 10:0. More strikingly, expression of CpuDGAT1 and CvFatB1 with the previously described CvLPAT2, a 10:0-CoA-specific Cuphea LPAT, increased 10:0 amounts to 25 mol % in camelina seed TAG. These TAGs contained up to 40 mol % 10:0 in the sn -2 position, nearly double the amounts obtained from coexpression of CvFatB1 and CvLPAT2 alone. Although enriched in diacylglycerol, 10:0 was not detected in phosphatidylcholine in these seeds. These findings are consistent with channeling of 10:0 into TAG through the combined activities of specialized LPAT and DGAT activities and demonstrate the biotechnological use of these enzymes to generate 10:0-rich seed oils. © 2017 American Society of Plant Biologists. All Rights Reserved.
SU-E-T-423: Fast Photon Convolution Calculation with a 3D-Ideal Kernel On the GPU

DOE Office of Scientific and Technical Information (OSTI.GOV)

Moriya, S; Sato, M; Tachibana, H

Purpose: The calculation time is a trade-off for improving the accuracy of convolution dose calculation with fine calculation spacing of the KERMA kernel. We investigated to accelerate the convolution calculation using an ideal kernel on the Graphic Processing Units (GPU). Methods: The calculation was performed on the AMD graphics hardware of Dual FirePro D700 and our algorithm was implemented using the Aparapi that convert Java bytecode to OpenCL. The process of dose calculation was separated with the TERMA and KERMA steps. The dose deposited at the coordinate (x, y, z) was determined in the process. In the dose calculation runningmore » on the central processing unit (CPU) of Intel Xeon E5, the calculation loops were performed for all calculation points. On the GPU computation, all of the calculation processes for the points were sent to the GPU and the multi-thread computation was done. In this study, the dose calculation was performed in a water equivalent homogeneous phantom with 150{sup 3} voxels (2 mm calculation grid) and the calculation speed on the GPU to that on the CPU and the accuracy of PDD were compared. Results: The calculation time for the GPU and the CPU were 3.3 sec and 4.4 hour, respectively. The calculation speed for the GPU was 4800 times faster than that for the CPU. The PDD curve for the GPU was perfectly matched to that for the CPU. Conclusion: The convolution calculation with the ideal kernel on the GPU was clinically acceptable for time and may be more accurate in an inhomogeneous region. Intensity modulated arc therapy needs dose calculations for different gantry angles at many control points. Thus, it would be more practical that the kernel uses a coarse spacing technique if the calculation is faster while keeping the similar accuracy to a current treatment planning system.« less
A Framework for Batched and GPU-Resident Factorization Algorithms Applied to Block Householder Transformations

DOE Office of Scientific and Technical Information (OSTI.GOV)

Dong, Tingzing Tim; Tomov, Stanimire Z; Luszczek, Piotr R

As modern hardware keeps evolving, an increasingly effective approach to developing energy efficient and high-performance solvers is to design them to work on many small size and independent problems. Many applications already need this functionality, especially for GPUs, which are currently known to be about four to five times more energy efficient than multicore CPUs. We describe the development of one-sided factorizations that work for a set of small dense matrices in parallel, and we illustrate our techniques on the QR factorization based on Householder transformations. We refer to this mode of operation as a batched factorization. Our approach ismore » based on representing the algorithms as a sequence of batched BLAS routines for GPU-only execution. This is in contrast to the hybrid CPU-GPU algorithms that rely heavily on using the multicore CPU for specific parts of the workload. But for a system to benefit fully from the GPU's significantly higher energy efficiency, avoiding the use of the multicore CPU must be a primary design goal, so the system can rely more heavily on the more efficient GPU. Additionally, this will result in the removal of the costly CPU-to-GPU communication. Furthermore, we do not use a single symmetric multiprocessor(on the GPU) to factorize a single problem at a time. We illustrate how our performance analysis, and the use of profiling and tracing tools, guided the development and optimization of our batched factorization to achieve up to a 2-fold speedup and a 3-fold energy efficiency improvement compared to our highly optimized batched CPU implementations based on the MKL library(when using two sockets of Intel Sandy Bridge CPUs). Compared to a batched QR factorization featured in the CUBLAS library for GPUs, we achieved up to 5x speedup on the K40 GPU.« less
Methamphetamine-induced neurotoxicity is attenuated in transgenic mice with a null mutation for interleukin-6.

PubMed

Ladenheim, B; Krasnova, I N; Deng, X; Oyler, J M; Polettini, A; Moran, T H; Huestis, M A; Cadet, J L

2000-12-01

Increasing evidence implicates apoptosis as a major mechanism of cell death in methamphetamine (METH) neurotoxicity. The involvement of a neuroimmune component in apoptotic cell death after injury or chemical damage suggests that cytokines may play a role in METH effects. In the present study, we examined if the absence of IL-6 in knockout (IL-6-/-) mice could provide protection against METH-induced neurotoxicity. Administration of METH resulted in a significant reduction of [(125)I]RTI-121-labeled dopamine transporters in the caudate-putamen (CPu) and cortex as well as depletion of dopamine in the CPu and frontal cortex of wild-type mice. However, these METH-induced effects were significantly attenuated in IL-6-/- animals. METH also caused a decrease in serotonin levels in the CPu and hippocampus of wild-type mice, but no reduction was observed in IL-6-/- animals. Moreover, METH induced decreases in [(125)I]RTI-55-labeled serotonin transporters in the hippocampal CA3 region and in the substantia nigra-reticulata but increases in serotonin transporters in the CPu and cingulate cortex in wild-type animals, all of which were attenuated in IL-6-/- mice. Additionally, METH caused increased gliosis in the CPu and cortices of wild-type mice as measured by [(3)H]PK-11195 binding; this gliotic response was almost completely inhibited in IL-6-/- animals. There was also significant protection against METH-induced DNA fragmentation, measured by the number of terminal deoxynucleotidyl transferase-mediated dUTP nick-end-labeled (TUNEL) cells in the cortices. The protective effects against METH toxicity observed in the IL-6-/- mice were not caused by differences in temperature elevation or in METH accumulation in wild-type and mutant animals. Therefore, these observations support the proposition that IL-6 may play an important role in the neurotoxicity of METH.
Structure of the airflow above surface waves

NASA Astrophysics Data System (ADS)

Buckley, Marc; Veron, Fabrice

2016-04-01

Weather, climate and upper ocean patterns are controlled by the exchanges of momentum, heat, mass, and energy across the ocean surface. These fluxes are, in turn, influenced by the small-scale physics at the wavy air-sea interface. We present laboratory measurements of the fine-scale airflow structure above waves, achieved in over 15 different wind-wave conditions, with wave ages Cp/u* ranging from 1.4 to 66.7 (where Cp is the peak phase speed of the waves, and u* the air friction velocity). The experiments were performed in the large (42-m long) wind-wave-current tank at University of Delaware's Air-Sea Interaction laboratory (USA). A combined Particle Image Velocimetry and Laser Induced Fluorescence system was specifically developed for this study, and provided two-dimensional airflow velocity measurement as low as 100 um above the air-water interface. Starting at very low wind speeds (U10~2m/s), we directly observe coherent turbulent structures within the buffer and logarithmic layers of the airflow above the air-water interface, whereby low horizontal velocity air is ejected away from the surface, and higher velocity fluid is swept downward. Wave phase coherent quadrant analysis shows that such turbulent momentum flux events are wave-phase dependent. Airflow separation events are directly observed over young wind waves (Cp/u*<3.7) and counted using measured vorticity and surface viscous stress criteria. Detached high spanwise vorticity layers cause intense wave-coherent turbulence downwind of wave crests, as shown by wave-phase averaging of turbulent momentum fluxes. Mean wave-coherent airflow motions and fluxes also show strong phase-locked patterns, including a sheltering effect, upwind of wave crests over old mechanically generated swells (Cp/u*=31.7), and downwind of crests over young wind waves (Cp/u*=3.7). Over slightly older wind waves (Cp/u* = 6.5), the measured wave-induced airflow perturbations are qualitatively consistent with linear critical layer theory.
Evaluation of the CPU time for solving the radiative transfer equation with high-order resolution schemes applying the normalized weighting-factor method

NASA Astrophysics Data System (ADS)

Xamán, J.; Zavala-Guillén, I.; Hernández-López, I.; Uriarte-Flores, J.; Hernández-Pérez, I.; Macías-Melo, E. V.; Aguilar-Castro, K. M.

2018-03-01

In this paper, we evaluated the convergence rate (CPU time) of a new mathematical formulation for the numerical solution of the radiative transfer equation (RTE) with several High-Order (HO) and High-Resolution (HR) schemes. In computational fluid dynamics, this procedure is known as the Normalized Weighting-Factor (NWF) method and it is adopted here. The NWF method is used to incorporate the high-order resolution schemes in the discretized RTE. The NWF method is compared, in terms of computer time needed to obtain a converged solution, with the widely used deferred-correction (DC) technique for the calculations of a two-dimensional cavity with emitting-absorbing-scattering gray media using the discrete ordinates method. Six parameters, viz. the grid size, the order of quadrature, the absorption coefficient, the emissivity of the boundary surface, the under-relaxation factor, and the scattering albedo are considered to evaluate ten schemes. The results showed that using the DC method, in general, the scheme that had the lowest CPU time is the SOU. In contrast, with the results of theDC procedure the CPU time for DIAMOND and QUICK schemes using the NWF method is shown to be, between the 3.8 and 23.1% faster and 12.6 and 56.1% faster, respectively. However, the other schemes are more time consuming when theNWFis used instead of the DC method. Additionally, a second test case was presented and the results showed that depending on the problem under consideration, the NWF procedure may be computationally faster or slower that the DC method. As an example, the CPU time for QUICK and SMART schemes are 61.8 and 203.7%, respectively, slower when the NWF formulation is used for the second test case. Finally, future researches to explore the computational cost of the NWF method in more complex problems are required.
GPU accelerated generation of digitally reconstructed radiographs for 2-D/3-D image registration.

PubMed

Dorgham, Osama M; Laycock, Stephen D; Fisher, Mark H

2012-09-01

Recent advances in programming languages for graphics processing units (GPUs) provide developers with a convenient way of implementing applications which can be executed on the CPU and GPU interchangeably. GPUs are becoming relatively cheap, powerful, and widely available hardware components, which can be used to perform intensive calculations. The last decade of hardware performance developments shows that GPU-based computation is progressing significantly faster than CPU-based computation, particularly if one considers the execution of highly parallelisable algorithms. Future predictions illustrate that this trend is likely to continue. In this paper, we introduce a way of accelerating 2-D/3-D image registration by developing a hybrid system which executes on the CPU and utilizes the GPU for parallelizing the generation of digitally reconstructed radiographs (DRRs). Based on the advancements of the GPU over the CPU, it is timely to exploit the benefits of many-core GPU technology by developing algorithms for DRR generation. Although some previous work has investigated the rendering of DRRs using the GPU, this paper investigates approximations which reduce the computational overhead while still maintaining a quality consistent with that needed for 2-D/3-D registration with sufficient accuracy to be clinically acceptable in certain applications of radiation oncology. Furthermore, by comparing implementations of 2-D/3-D registration on the CPU and GPU, we investigate current performance and propose an optimal framework for PC implementations addressing the rigid registration problem. Using this framework, we are able to render DRR images from a 256×256×133 CT volume in ~24 ms using an NVidia GeForce 8800 GTX and in ~2 ms using NVidia GeForce GTX 580. In addition to applications requiring fast automatic patient setup, these levels of performance suggest image-guided radiation therapy at video frame rates is technically feasible using relatively low cost PC architecture.
Techniques for increasing the efficiency of Earth gravity calculations for precision orbit determination

NASA Technical Reports Server (NTRS)

Smith, R. L.; Lyubomirsky, A. S.

1981-01-01

Two techniques were analyzed. The first is a representation using Chebyshev expansions in three-dimensional cells. The second technique employs a temporary file for storing the components of the nonspherical gravity force. Computer storage requirements and relative CPU time requirements are presented. The Chebyshev gravity representation can provide a significant reduction in CPU time in precision orbit calculations, but at the cost of a large amount of direct-access storage space, which is required for a global model.
Convolutional Neural Network on Embedded Linux(trademark) System-on-Chip: A Methodology and Performance Benchmark

DTIC Science & Technology

2016-05-01

A9 CPU and 15 W for the i7 CPU. A method of accelerating this computation is by using a customized hardware unit called a field- programmable gate...implementation of custom logic to accelerate com- putational workloads. This FPGA fabric, in addition to the standard programmable logic, contains 220...chip; field- programmable gate array Daniel Gebhardt U U U U 18 (619) 553-2786 INITIAL DISTRIBUTION 84300 Library (2) 85300 Archive/Stock (1
Convolutional Neural Network on Embedded Linux System-on-Chip: A Methodology and Performance Benchmark

DTIC Science & Technology

2016-05-01

A9 CPU and 15 W for the i7 CPU. A method of accelerating this computation is by using a customized hardware unit called a field- programmable gate...implementation of custom logic to accelerate com- putational workloads. This FPGA fabric, in addition to the standard programmable logic, contains 220...chip; field- programmable gate array Daniel Gebhardt U U U U 18 (619) 553-2786 INITIAL DISTRIBUTION 84300 Library (2) 85300 Archive/Stock (1
Finite difference numerical method for the superlattice Boltzmann transport equation and case comparison of CPU(C) and GPU(CUDA) implementations

DOE Office of Scientific and Technical Information (OSTI.GOV)

Priimak, Dmitri

2014-12-01

We present a finite difference numerical algorithm for solving two dimensional spatially homogeneous Boltzmann transport equation which describes electron transport in a semiconductor superlattice subject to crossed time dependent electric and constant magnetic fields. The algorithm is implemented both in C language targeted to CPU and in CUDA C language targeted to commodity NVidia GPU. We compare performances and merits of one implementation versus another and discuss various software optimisation techniques.
The development of an interim generalized gate logic software simulator

NASA Technical Reports Server (NTRS)

Mcgough, J. G.; Nemeroff, S.

1985-01-01

A proof-of-concept computer program called IGGLOSS (Interim Generalized Gate Logic Software Simulator) was developed and is discussed. The simulator engine was designed to perform stochastic estimation of self test coverage (fault-detection latency times) of digital computers or systems. A major attribute of the IGGLOSS is its high-speed simulation: 9.5 x 1,000,000 gates/cpu sec for nonfaulted circuits and 4.4 x 1,000,000 gates/cpu sec for faulted circuits on a VAX 11/780 host computer.
An Xdata Architecture for Federated Graph Models and Multi-tier Asymmetric Computing

DTIC Science & Technology

2014-01-01

Wikipedia, a scale-free random graph (kron), Akamai trace route data, Bitcoin transaction data, and a Twitter follower network. We present results for...3x (SSSP on a random graph) and nearly 300x (Akamai and Bitcoin ) over the CPU performance of a well-known and widely deployed CPU-based graph...provided better throughput for smaller frontiers such as roadmaps or the Bitcoin data set. In our work, we have focused on two-phase kernels, but it
Autonomic Recovery: HyperCheck: A Hardware-Assisted Integrity Monitor

DTIC Science & Technology

2013-08-01

system (OS). HyperCheck leverages the CPU System Management Mode ( SMM ), present in x86 systems, to securely generate and transmit the full state of the...HyperCheck harnesses the CPU System Management Mode ( SMM ) which is present in all x86 commodity systems to create a snapshot view of the current state of the...protect the software above it. Our assumptions are that the attacker does not have physical access to the machine and that the SMM BIOS is locked and
Effective electron-density map improvement and structure validation on a Linux multi-CPU web cluster: The TB Structural Genomics Consortium Bias Removal Web Service.

PubMed

Reddy, Vinod; Swanson, Stanley M; Segelke, Brent; Kantardjieff, Katherine A; Sacchettini, James C; Rupp, Bernhard

2003-12-01

Anticipating a continuing increase in the number of structures solved by molecular replacement in high-throughput crystallography and drug-discovery programs, a user-friendly web service for automated molecular replacement, map improvement, bias removal and real-space correlation structure validation has been implemented. The service is based on an efficient bias-removal protocol, Shake&wARP, and implemented using EPMR and the CCP4 suite of programs, combined with various shell scripts and Fortran90 routines. The service returns improved maps, converted data files and real-space correlation and B-factor plots. User data are uploaded through a web interface and the CPU-intensive iteration cycles are executed on a low-cost Linux multi-CPU cluster using the Condor job-queuing package. Examples of map improvement at various resolutions are provided and include model completion and reconstruction of absent parts, sequence correction, and ligand validation in drug-target structures.
Comparing the Consumption of CPU Hours with Scientific Output for the Extreme Science and Engineering Discovery Environment (XSEDE).

PubMed

Knepper, Richard; Börner, Katy

2016-01-01

This paper presents the results of a study that compares resource usage with publication output using data about the consumption of CPU cycles from the Extreme Science and Engineering Discovery Environment (XSEDE) and resulting scientific publications for 2,691 institutions/teams. Specifically, the datasets comprise a total of 5,374,032,696 central processing unit (CPU) hours run in XSEDE during July 1, 2011 to August 18, 2015 and 2,882 publications that cite the XSEDE resource. Three types of studies were conducted: a geospatial analysis of XSEDE providers and consumers, co-authorship network analysis of XSEDE publications, and bi-modal network analysis of how XSEDE resources are used by different research fields. Resulting visualizations show that a diverse set of consumers make use of XSEDE resources, that users of XSEDE publish together frequently, and that the users of XSEDE with the highest resource usage tend to be "traditional" high-performance computing (HPC) community members from astronomy, atmospheric science, physics, chemistry, and biology.
The Application of Virtex-II Pro FPGA in High-Speed Image Processing Technology of Robot Vision Sensor

NASA Astrophysics Data System (ADS)

Ren, Y. J.; Zhu, J. G.; Yang, X. Y.; Ye, S. H.

2006-10-01

The Virtex-II Pro FPGA is applied to the vision sensor tracking system of IRB2400 robot. The hardware platform, which undertakes the task of improving SNR and compressing data, is constructed by using the high-speed image processing of FPGA. The lower level image-processing algorithm is realized by combining the FPGA frame and the embedded CPU. The velocity of image processing is accelerated due to the introduction of FPGA and CPU. The usage of the embedded CPU makes it easily to realize the logic design of interface. Some key techniques are presented in the text, such as read-write process, template matching, convolution, and some modules are simulated too. In the end, the compare among the modules using this design, using the PC computer and using the DSP, is carried out. Because the high-speed image processing system core is a chip of FPGA, the function of which can renew conveniently, therefore, to a degree, the measure system is intelligent.
Study of data I/O performance on distributed disk system in mask data preparation

NASA Astrophysics Data System (ADS)

Ohara, Shuichiro; Odaira, Hiroyuki; Chikanaga, Tomoyuki; Hamaji, Masakazu; Yoshioka, Yasuharu

2010-09-01

Data volume is getting larger every day in Mask Data Preparation (MDP). In the meantime, faster data handling is always required. MDP flow typically introduces Distributed Processing (DP) system to realize the demand because using hundreds of CPU is a reasonable solution. However, even if the number of CPU were increased, the throughput might be saturated because hard disk I/O and network speeds could be bottlenecks. So, MDP needs to invest a lot of money to not only hundreds of CPU but also storage and a network device which make the throughput faster. NCS would like to introduce new distributed processing system which is called "NDE". NDE could be a distributed disk system which makes the throughput faster without investing a lot of money because it is designed to use multiple conventional hard drives appropriately over network. NCS studies I/O performance with OASIS® data format on NDE which contributes to realize the high throughput in this paper.

A study of the relationship between the performance and dependability of a fault-tolerant computer

NASA Technical Reports Server (NTRS)

Goswami, Kumar K.

1994-01-01

This thesis studies the relationship by creating a tool (FTAPE) that integrates a high stress workload generator with fault injection and by using the tool to evaluate system performance under error conditions. The workloads are comprised of processes which are formed from atomic components that represent CPU, memory, and I/O activity. The fault injector is software-implemented and is capable of injecting any memory addressable location, including special registers and caches. This tool has been used to study a Tandem Integrity S2 Computer. Workloads with varying numbers of processes and varying compositions of CPU, memory, and I/O activity are first characterized in terms of performance. Then faults are injected into these workloads. The results show that as the number of concurrent processes increases, the mean fault latency initially increases due to increased contention for the CPU. However, for even higher numbers of processes (less than 3 processes), the mean latency decreases because long latency faults are paged out before they can be activated.
The DISTO data acquisition system at SATURNE

DOE Office of Scientific and Technical Information (OSTI.GOV)

Balestra, F.; Bedfer, Y.; Bertini, R.

1998-06-01

The DISTO collaboration has built a large-acceptance magnetic spectrometer designed to provide broad kinematic coverage of multiparticle final states produced in pp scattering. The spectrometer has been installed in the polarized proton beam of the Saturne accelerator in Saclay to study polarization observables in the {rvec p}p {yields} pK{sup +}{rvec Y} (Y = {Lambda}, {Sigma}{sup 0} or Y{sup *}) reaction and vector meson production ({psi}, {omega} and {rho}) in pp collisions. The data acquisition system is based on a VME 68030 CPU running the OS/9 operating system, housed in a single VME crate together with the CAMAC interface, the triplemore » port ECL memories, and four RISC R3000 CPU. The digitization of signals from the detectors is made by PCOS III and FERA front-end electronics. Data of several events belonging to a single Saturne extraction are stored in VME triple-port ECL memories using a hardwired fast sequencer. The buffer, optionally filtered by the RISC R3000 CPU, is recorded on a DLT cassette by DAQ CPU using the on-board SCSI interface during the acceleration cycle. Two UNIX workstations are connected to the VME CPUs through a fast parallel bus and the Local Area Network. They analyze a subset of events for on-line monitoring. The data acquisition system is able to read and record 3,500 ev/burst in the present configuration with a dead time of 15%.« less
Early modulation by the dopamine D4 receptor of morphine-induced changes in the opioid peptide systems in the rat caudate putamen.

PubMed

Gago, Belén; Fuxe, Kjell; Brené, Stefan; Díaz-Cabiale, Zaida; Reina-Sánchez, María Dolores; Suárez-Boomgaard, Diana; Roales-Buján, Ruth; Valderrama-Carvajal, Alejandra; de la Calle, Adelaida; Rivera, Alicia

2013-12-01

The peptides dynorphin and enkephalin modulate many physiological processes, such as motor activity and the control of mood and motivation. Their expression in the caudate putamen (CPu) is regulated by dopamine and opioid receptors. The current work was designed to explore the early effects of the acute activation of D4 and/or μ opioid receptors by the agonists PD168,077 and morphine, respectively, on the regulation of the expression of these opioid peptides in the rat CPu, on transcription factors linked to them, and on the expression of μ opioid receptors. In situ hybridization experiments showed that acute treatment with morphine (10 mg/kg) decreased both enkephalin and dynorphin mRNA levels in the CPu after 30 min, but PD168,077 (1 mg/kg) did not modify their expression. Coadministration of the two agonists demonstrated that PD168,077 counteracted the morphine-induced changes and even increased enkephalin mRNA levels. The immunohistochemistry studies showed that morphine administration also increased striatal μ opioid receptor immunoreactivity but reduced P-CREB expression, effects that were blocked by the PD168,077-induced activation of D4 receptors. The current results present evidence of functional D4 -μ opioid receptor interactions, with consequences for the opioid peptide mRNA levels in the rat CPu, contributing to the integration of DA and opioid peptide signaling. Copyright © 2013 Wiley Periodicals, Inc.
Efficient and portable acceleration of quantum chemical many-body methods in mixed floating point precision using OpenACC compiler directives

NASA Astrophysics Data System (ADS)

Eriksen, Janus J.

2017-09-01

It is demonstrated how the non-proprietary OpenACC standard of compiler directives may be used to compactly and efficiently accelerate the rate-determining steps of two of the most routinely applied many-body methods of electronic structure theory, namely the second-order Møller-Plesset (MP2) model in its resolution-of-the-identity approximated form and the (T) triples correction to the coupled cluster singles and doubles model (CCSD(T)). By means of compute directives as well as the use of optimised device math libraries, the operations involved in the energy kernels have been ported to graphics processing unit (GPU) accelerators, and the associated data transfers correspondingly optimised to such a degree that the final implementations (using either double and/or single precision arithmetics) are capable of scaling to as large systems as allowed for by the capacity of the host central processing unit (CPU) main memory. The performance of the hybrid CPU/GPU implementations is assessed through calculations on test systems of alanine amino acid chains using one-electron basis sets of increasing size (ranging from double- to pentuple-ζ quality). For all but the smallest problem sizes of the present study, the optimised accelerated codes (using a single multi-core CPU host node in conjunction with six GPUs) are found to be capable of reducing the total time-to-solution by at least an order of magnitude over optimised, OpenMP-threaded CPU-only reference implementations.
GeantV: From CPU to accelerators

DOE PAGES

Amadio, G.; Ananya, A.; Apostolakis, J.; ...

2016-01-01

The GeantV project aims to research and develop the next-generation simulation software describing the passage of particles through matter. While the modern CPU architectures are being targeted first, resources such as GPGPU, Intel© Xeon Phi, Atom or ARM cannot be ignored anymore by HEP CPU-bound applications. The proof of concept GeantV prototype has been mainly engineered for CPU's having vector units but we have foreseen from early stages a bridge to arbitrary accelerators. A software layer consisting of architecture/technology specific backends supports currently this concept. This approach allows to abstract out the basic types such as scalar/vector but also tomore » formalize generic computation kernels using transparently library or device specific constructs based on Vc, CUDA, Cilk+ or Intel intrinsics. While the main goal of this approach is portable performance, as a bonus, it comes with the insulation of the core application and algorithms from the technology layer. This allows our application to be long term maintainable and versatile to changes at the backend side. The paper presents the first results of basket-based GeantV geometry navigation on the Intel© Xeon Phi KNC architecture. We present the scalability and vectorization study, conducted using Intel performance tools, as well as our preliminary conclusions on the use of accelerators for GeantV transport. Lastly, we also describe the current work and preliminary results for using the GeantV transport kernel on GPUs.« less
Large-scale neural circuit mapping data analysis accelerated with the graphical processing unit (GPU).

PubMed

Shi, Yulin; Veidenbaum, Alexander V; Nicolau, Alex; Xu, Xiangmin

2015-01-15

Modern neuroscience research demands computing power. Neural circuit mapping studies such as those using laser scanning photostimulation (LSPS) produce large amounts of data and require intensive computation for post hoc processing and analysis. Here we report on the design and implementation of a cost-effective desktop computer system for accelerated experimental data processing with recent GPU computing technology. A new version of Matlab software with GPU enabled functions is used to develop programs that run on Nvidia GPUs to harness their parallel computing power. We evaluated both the central processing unit (CPU) and GPU-enabled computational performance of our system in benchmark testing and practical applications. The experimental results show that the GPU-CPU co-processing of simulated data and actual LSPS experimental data clearly outperformed the multi-core CPU with up to a 22× speedup, depending on computational tasks. Further, we present a comparison of numerical accuracy between GPU and CPU computation to verify the precision of GPU computation. In addition, we show how GPUs can be effectively adapted to improve the performance of commercial image processing software such as Adobe Photoshop. To our best knowledge, this is the first demonstration of GPU application in neural circuit mapping and electrophysiology-based data processing. Together, GPU enabled computation enhances our ability to process large-scale data sets derived from neural circuit mapping studies, allowing for increased processing speeds while retaining data precision. Copyright © 2014 Elsevier B.V. All rights reserved.
Methamphetamine-induced stereotypy correlates negatively with patch-enhanced prodynorphin and arc mRNA expression in the rat caudate putamen: the role of mu opioid receptor activation.

PubMed

Horner, Kristen A; Noble, Erika S; Gilbert, Yamiece E

2010-06-01

Amphetamines induce stereotypy, which correlates with patch-enhanced c-Fos expression the patch compartment of caudate putamen (CPu). Methamphetamine (METH) treatment also induces patch-enhanced expression of prodynorphin (PD), arc and zif/268 in the CPu. Whether patch-enhanced activation of any of these genes correlates with METH-induced stereotypy is unknown, and the factors that contribute to this pattern of expression are poorly understood. Activation of mu opioid receptors, which are expressed by the neurons of the patch compartment, may underlie METH-induced patch-enhanced gene expression and stereotypy. The current study examined whether striatal mu opioid receptor blockade altered METH-induced stereotypy and patch-enhanced gene expression, and if there was a correlation between the two responses. Animals were intrastriatally infused with the mu antagonist CTAP (10 microg/microl), treated with METH (7.5 mg/kg, s.c.), placed in activity chambers for 3h, and then sacrificed. CTAP pretreatment attenuated METH-induced increases in PD, arc and zif/268 mRNA expression and significantly reduced METH-induced stereotypy. Patch-enhanced PD and arc mRNA expression in the dorsolateral CPu correlated negatively with METH-induced stereotypy. These data indicate that mu opioid receptor activation contributes to METH-induced gene expression in the CPu and stereotypy, and that patch-enhanced PD and arc expression may be a homeostatic response to METH treatment. Copyright 2010 Elsevier Inc. All rights reserved.
Large scale neural circuit mapping data analysis accelerated with the graphical processing unit (GPU)

PubMed Central

Shi, Yulin; Veidenbaum, Alexander V.; Nicolau, Alex; Xu, Xiangmin

2014-01-01

Background Modern neuroscience research demands computing power. Neural circuit mapping studies such as those using laser scanning photostimulation (LSPS) produce large amounts of data and require intensive computation for post-hoc processing and analysis. New Method Here we report on the design and implementation of a cost-effective desktop computer system for accelerated experimental data processing with recent GPU computing technology. A new version of Matlab software with GPU enabled functions is used to develop programs that run on Nvidia GPUs to harness their parallel computing power. Results We evaluated both the central processing unit (CPU) and GPU-enabled computational performance of our system in benchmark testing and practical applications. The experimental results show that the GPU-CPU co-processing of simulated data and actual LSPS experimental data clearly outperformed the multi-core CPU with up to a 22x speedup, depending on computational tasks. Further, we present a comparison of numerical accuracy between GPU and CPU computation to verify the precision of GPU computation. In addition, we show how GPUs can be effectively adapted to improve the performance of commercial image processing software such as Adobe Photoshop. Comparison with Existing Method(s) To our best knowledge, this is the first demonstration of GPU application in neural circuit mapping and electrophysiology-based data processing. Conclusions Together, GPU enabled computation enhances our ability to process large-scale data sets derived from neural circuit mapping studies, allowing for increased processing speeds while retaining data precision. PMID:25277633
Routine Microsecond Molecular Dynamics Simulations with AMBER on GPUs. 2. Explicit Solvent Particle Mesh Ewald.

PubMed

Salomon-Ferrer, Romelia; Götz, Andreas W; Poole, Duncan; Le Grand, Scott; Walker, Ross C

2013-09-10

We present an implementation of explicit solvent all atom classical molecular dynamics (MD) within the AMBER program package that runs entirely on CUDA-enabled GPUs. First released publicly in April 2010 as part of version 11 of the AMBER MD package and further improved and optimized over the last two years, this implementation supports the three most widely used statistical mechanical ensembles (NVE, NVT, and NPT), uses particle mesh Ewald (PME) for the long-range electrostatics, and runs entirely on CUDA-enabled NVIDIA graphics processing units (GPUs), providing results that are statistically indistinguishable from the traditional CPU version of the software and with performance that exceeds that achievable by the CPU version of AMBER software running on all conventional CPU-based clusters and supercomputers. We briefly discuss three different precision models developed specifically for this work (SPDP, SPFP, and DPDP) and highlight the technical details of the approach as it extends beyond previously reported work [Götz et al., J. Chem. Theory Comput. 2012, DOI: 10.1021/ct200909j; Le Grand et al., Comp. Phys. Comm. 2013, DOI: 10.1016/j.cpc.2012.09.022].We highlight the substantial improvements in performance that are seen over traditional CPU-only machines and provide validation of our implementation and precision models. We also provide evidence supporting our decision to deprecate the previously described fully single precision (SPSP) model from the latest release of the AMBER software package.
Methamphetamine-Induced Stereotypy Correlates Negatively with Patch-Enhanced Prodynorphin and ARC mRNA Expression in the Rat Caudate Putamen: The Role of Mu Opioid Receptor Activation

PubMed Central

Horner, Kristen A.; Noble, Erika S.; Gilbert, Yamiece E.

2010-01-01

Amphetamines induce stereotypy, which correlates with patch-enhanced c-Fos expression the patch compartment of caudate putamen (CPu). Methamphetamine (METH) treatment also induces patch-enhanced expression of prodynorphin (PD), arc and zif/268 in the CPu. Whether patch-enhanced activation of any of these genes correlates with METH-induced stereotypy is unknown, and the factors that contribute to this pattern of expression are poorly understood. Activation of mu opioid receptors, which are expressed by the neurons of the patch compartment, may underlie METH-induced patch-enhanced gene expression and stereotypy. The current study examined whether striatal mu opioid receptor blockade altered METH-induced stereotypy and patch-enhanced gene expression, and if there was a correlation between the two responses. Animals were intrastriatally infused with the mu antagonist CTAP (10 μg/μl), treated with METH (7.5 mg/kg, s.c.), placed in activity chambers for 3h, and then sacrificed. CTAP pretreatment attenuated METH-induced increases in PD, arc and zif/268 mRNA expression and significantly reduced METH-induced stereotypy. Patch-enhanced PD and arc mRNA expression in the dorsolateral CPu correlated negatively with METH-induced stereotypy. These data indicate that mu opioid receptor activation contributes to METH-induced gene expression in the CPu and stereotypy, and that patch-enhanced PD and arc expression may be a homeostatic response to METH treatment. PMID:20298714
Early top-down control of visual processing predicts working memory performance

PubMed Central

Rutman, Aaron M.; Clapp, Wesley C.; Chadick, James Z.; Gazzaley, Adam

2009-01-01

Selective attention confers a behavioral benefit for both perceptual and working memory (WM) performance, often attributed to top-down modulation of sensory neural processing. However, the direct relationship between early activity modulation in sensory cortices during selective encoding and subsequent WM performance has not been established. To explore the influence of selective attention on WM recognition, we used electroencephalography (EEG) to study the temporal dynamics of top-down modulation in a selective, delayed-recognition paradigm. Participants were presented with overlapped, “double-exposed” images of faces and natural scenes, and were instructed to either remember the face or the scene while simultaneously ignoring the other stimulus. Here, we present evidence that the degree to which participants modulate the early P100 (97–129 ms) event-related potential (ERP) during selective stimulus encoding significantly correlates with their subsequent WM recognition. These results contribute to our evolving understanding of the mechanistic overlap between attention and memory. PMID:19413473
OpenMP GNU and Intel Fortran programs for solving the time-dependent Gross-Pitaevskii equation

NASA Astrophysics Data System (ADS)

Young-S., Luis E.; Muruganandam, Paulsamy; Adhikari, Sadhan K.; Lončar, Vladimir; Vudragović, Dušan; Balaž, Antun

2017-11-01

We present Open Multi-Processing (OpenMP) version of Fortran 90 programs for solving the Gross-Pitaevskii (GP) equation for a Bose-Einstein condensate in one, two, and three spatial dimensions, optimized for use with GNU and Intel compilers. We use the split-step Crank-Nicolson algorithm for imaginary- and real-time propagation, which enables efficient calculation of stationary and non-stationary solutions, respectively. The present OpenMP programs are designed for computers with multi-core processors and optimized for compiling with both commercially-licensed Intel Fortran and popular free open-source GNU Fortran compiler. The programs are easy to use and are elaborated with helpful comments for the users. All input parameters are listed at the beginning of each program. Different output files provide physical quantities such as energy, chemical potential, root-mean-square sizes, densities, etc. We also present speedup test results for new versions of the programs. Program files doi:http://dx.doi.org/10.17632/y8zk3jgn84.2 Licensing provisions: Apache License 2.0 Programming language: OpenMP GNU and Intel Fortran 90. Computer: Any multi-core personal computer or workstation with the appropriate OpenMP-capable Fortran compiler installed. Number of processors used: All available CPU cores on the executing computer. Journal reference of previous version: Comput. Phys. Commun. 180 (2009) 1888; ibid.204 (2016) 209. Does the new version supersede the previous version?: Not completely. It does supersede previous Fortran programs from both references above, but not OpenMP C programs from Comput. Phys. Commun. 204 (2016) 209. Nature of problem: The present Open Multi-Processing (OpenMP) Fortran programs, optimized for use with commercially-licensed Intel Fortran and free open-source GNU Fortran compilers, solve the time-dependent nonlinear partial differential (GP) equation for a trapped Bose-Einstein condensate in one (1d), two (2d), and three (3d) spatial dimensions for six different trap symmetries: axially and radially symmetric traps in 3d, circularly symmetric traps in 2d, fully isotropic (spherically symmetric) and fully anisotropic traps in 2d and 3d, as well as 1d traps, where no spatial symmetry is considered. Solution method: We employ the split-step Crank-Nicolson algorithm to discretize the time-dependent GP equation in space and time. The discretized equation is then solved by imaginary- or real-time propagation, employing adequately small space and time steps, to yield the solution of stationary and non-stationary problems, respectively. Reasons for the new version: Previously published Fortran programs [1,2] have now become popular tools [3] for solving the GP equation. These programs have been translated to the C programming language [4] and later extended to the more complex scenario of dipolar atoms [5]. Now virtually all computers have multi-core processors and some have motherboards with more than one physical computer processing unit (CPU), which may increase the number of available CPU cores on a single computer to several tens. The C programs have been adopted to be very fast on such multi-core modern computers using general-purpose graphic processing units (GPGPU) with Nvidia CUDA and computer clusters using Message Passing Interface (MPI) [6]. Nevertheless, previously developed Fortran programs are also commonly used for scientific computation and most of them use a single CPU core at a time in modern multi-core laptops, desktops, and workstations. Unless the Fortran programs are made aware and capable of making efficient use of the available CPU cores, the solution of even a realistic dynamical 1d problem, not to mention the more complicated 2d and 3d problems, could be time consuming using the Fortran programs. Previously, we published auto-parallel Fortran programs [2] suitable for Intel (but not GNU) compiler for solving the GP equation. Hence, a need for the full OpenMP version of the Fortran programs to reduce the execution time cannot be overemphasized. To address this issue, we provide here such OpenMP Fortran programs, optimized for both Intel and GNU Fortran compilers and capable of using all available CPU cores, which can significantly reduce the execution time. Summary of revisions: Previous Fortran programs [1] for solving the time-dependent GP equation in 1d, 2d, and 3d with different trap symmetries have been parallelized using the OpenMP interface to reduce the execution time on multi-core processors. There are six different trap symmetries considered, resulting in six programs for imaginary-time propagation and six for real-time propagation, totaling to 12 programs included in BEC-GP-OMP-FOR software package. All input data (number of atoms, scattering length, harmonic oscillator trap length, trap anisotropy, etc.) are conveniently placed at the beginning of each program, as before [2]. Present programs introduce a new input parameter, which is designated by Number_of_Threads and defines the number of CPU cores of the processor to be used in the calculation. If one sets the value 0 for this parameter, all available CPU cores will be used. For the most efficient calculation it is advisable to leave one CPU core unused for the background system's jobs. For example, on a machine with 20 CPU cores such that we used for testing, it is advisable to use up to 19 CPU cores. However, the total number of used CPU cores can be divided into more than one job. For instance, one can run three simulations simultaneously using 10, 4, and 5 CPU cores, respectively, thus totaling to 19 used CPU cores on a 20-core computer. The Fortran source programs are located in the directory src, and can be compiled by the make command using the makefile in the root directory BEC-GP-OMP-FOR of the software package. The examples of produced output files can be found in the directory output, although some large density files are omitted, to save space. The programs calculate the values of actually used dimensionless nonlinearities from the physical input parameters, where the input parameters correspond to the identical nonlinearity values as in the previously published programs [1], so that the output files of the old and new programs can be directly compared. The output files are conveniently named such that their contents can be easily identified, following the naming convention introduced in Ref. [2]. For example, a file named -out.txt, where is a name of the individual program, represents the general output file containing input data, time and space steps, nonlinearity, energy and chemical potential, and was named fort.7 in the old Fortran version of programs [1]. A file named -den.txt is the output file with the condensate density, which had the names fort.3 and fort.4 in the old Fortran version [1] for imaginary- and real-time propagation programs, respectively. Other possible density outputs, such as the initial density, are commented out in the programs to have a simpler set of output files, but users can uncomment and re-enable them, if needed. In addition, there are output files for reduced (integrated) 1d and 2d densities for different programs. In the real-time programs there is also an output file reporting the dynamics of evolution of root-mean-square sizes after a perturbation is introduced. The supplied real-time programs solve the stationary GP equation, and then calculate the dynamics. As the imaginary-time programs are more accurate than the real-time programs for the solution of a stationary problem, one can first solve the stationary problem using the imaginary-time programs, adapt the real-time programs to read the pre-calculated wave function and then study the dynamics. In that case the parameter NSTP in the real-time programs should be set to zero and the space mesh and nonlinearity parameters should be identical in both programs. The reader is advised to consult our previous publication where a complete description of the output files is given [2]. A readme.txt file, included in the root directory, explains the procedure to compile and run the programs. We tested our programs on a workstation with two 10-core Intel Xeon E5-2650 v3 CPUs. The parameters used for testing are given in sample input files, provided in the corresponding directory together with the programs. In Table 1 we present wall-clock execution times for runs on 1, 6, and 19 CPU cores for programs compiled using Intel and GNU Fortran compilers. The corresponding columns "Intel speedup" and "GNU speedup" give the ratio of wall-clock execution times of runs on 1 and 19 CPU cores, and denote the actual measured speedup for 19 CPU cores. In all cases and for all numbers of CPU cores, although the GNU Fortran compiler gives excellent results, the Intel Fortran compiler turns out to be slightly faster. Note that during these tests we always ran only a single simulation on a workstation at a time, to avoid any possible interference issues. Therefore, the obtained wall-clock times are more reliable than the ones that could be measured with two or more jobs running simultaneously. We also studied the speedup of the programs as a function of the number of CPU cores used. The performance of the Intel and GNU Fortran compilers is illustrated in Fig. 1, where we plot the speedup and actual wall-clock times as functions of the number of CPU cores for 2d and 3d programs. We see that the speedup increases monotonically with the number of CPU cores in all cases and has large values (between 10 and 14 for 3d programs) for the maximal number of cores. This fully justifies the development of OpenMP programs, which enable much faster and more efficient solving of the GP equation. However, a slow saturation in the speedup with the further increase in the number of CPU cores is observed in all cases, as expected. The speedup tends to increase for programs in higher dimensions, as they become more complex and have to process more data. This is why the speedups of the supplied 2d and 3d programs are larger than those of 1d programs. Also, for a single program the speedup increases with the size of the spatial grid, i.e., with the number of spatial discretization points, since this increases the amount of calculations performed by the program. To demonstrate this, we tested the supplied real2d-th program and varied the number of spatial discretization points NX=NY from 20 to 1000. The measured speedup obtained when running this program on 19 CPU cores as a function of the number of discretization points is shown in Fig. 2. The speedup first increases rapidly with the number of discretization points and eventually saturates. Additional comments: Example inputs provided with the programs take less than 30 minutes to run on a workstation with two Intel Xeon E5-2650 v3 processors (2 QPI links, 10 CPU cores, 25 MB cache, 2.3 GHz).



      
      High performance computing for deformable image registration: towards a new paradigm in adaptive radiotherapy.
      PubMed
      Samant, Sanjiv S; Xia, Junyi; Muyan-Ozcelik, Pinar; Owens, John D
         2008-08-01
         The advent of readily available temporal imaging or time series volumetric (4D) imaging has become an indispensable component of treatment planning and adaptive radiotherapy (ART) at many radiotherapy centers. Deformable image registration (DIR) is also used in other areas of medical imaging, including motion corrected image reconstruction. Due to long computation time, clinical applications of DIR in radiation therapy and elsewhere have been limited and consequently relegated to offline analysis. With the recent advances in hardware and software, graphics processing unit (GPU) based computing is an emerging technology for general purpose computation, including DIR, and is suitable for highly parallelized computing. However, traditional general purpose computation on the GPU is limited because the constraints of the available programming platforms. As well, compared to CPU programming, the GPU currently has reduced dedicated processor memory, which can limit the useful working data set for parallelized processing. We present an implementation of the demons algorithm using the NVIDIA 8800 GTX GPU and the new CUDA programming language. The GPU performance will be compared with single threading and multithreading CPU implementations on an Intel dual core 2.4 GHz CPU using the C programming language. CUDA provides a C-like language programming interface, and allows for direct access to the highly parallel compute units in the GPU. Comparisons for volumetric clinical lung images acquired using 4DCT were carried out. Computation time for 100 iterations in the range of 1.8-13.5 s was observed for the GPU with image size ranging from 2.0 x 10(6) to 14.2 x 10(6) pixels. The GPU registration was 55-61 times faster than the CPU for the single threading implementation, and 34-39 times faster for the multithreading implementation. For CPU based computing, the computational time generally has a linear dependence on image size for medical imaging data. Computational efficiency is characterized in terms of time per megapixels per iteration (TPMI) with units of seconds per megapixels per iteration (or spmi). For the demons algorithm, our CPU implementation yielded largely invariant values of TPMI. The mean TPMIs were 0.527 spmi and 0.335 spmi for the single threading and multithreading cases, respectively, with <2% variation over the considered image data range. For GPU computing, we achieved TPMI =0.00916 spmi with 3.7% variation, indicating optimized memory handling under CUDA. The paradigm of GPU based real-time DIR opens up a host of clinical applications for medical imaging.
      

      
      Airloads on Bluff Bodies, with Application to the Rotor-Induced Downloads on Tilt-Rotor Aircraft.
      DTIC Science & Technology
      
         1983-09-01
         interference aerodynamics would be tion on hover performance (Ref. (11). to study the two-dimensional sec- tion characteristics of a wing in the wake of a...resources for large numbers of vortices; a typical case requires 10-15 min CPU time on the Ames Cray IS computer. Figure 6 shows a typical result. Here...CPU time per case on a Prime 550UPPER SURFACE (WINDWARD) computer to converge to a steady solution; this would be equivalent to one or two seconds on
      

      
      Dynamic mechanical analysis and organization/storage of data for polymetric materials
      NASA Technical Reports Server (NTRS)
      Rosenberg, M.; Buckley, W.
         1982-01-01
         Dynamic mechanical analysis was performed on a variety of temperature resistant polymers and composite resin matrices. Data on glass transition temperatures and degree of cure attained were derived. In addition a laboratory based computer system was installed and data base set up to allow entry of composite data. The laboratory CPU termed TYCHO is based on a DEC PDP 11/44 CPU with a Datatrieve relational data base. The function of TYCHO is integration of chemical laboratory analytical instrumentation and storage of chemical structures for modeling of new polymeric structures and compounds
      

      
      The METAL System. Volume I and Volume II. Appendices.
      DTIC Science & Technology
      
         1981-01-01
         demands , and fair CPU time were measured. The fair measure reported here includes the pure CPU time plus a pro-rated portion of the time consumed by the...syntactic class or the form matched . NO = noun VB = verb OTR = other part of speech IT-12 Although the above feature is not used by the system at present...indicate the syntactic class of the form matched . NO = noun other than gerund ("content", "dark", "African") INF = infinitive ("direct", "equal", "content
      

      
      Advanced Edit System.
      DTIC Science & Technology
      
         1983-01-01
         MFR Model Computer Subsystem 1. Cabinet 0, PDP-11/70 CPU with 11/70 CPU, and Floating point processor DEC 11/79-UK 2. Cabinet 1, with SDLC ... software T-square. o Unit lock causes a user-defined roundoff factor to be applied to all points selected with the cursor. V - 1 0 Grid lock...1 NL • • 1 I i * v • _ • _ . *. . - m m I 1 3 I = K» lää 12.2 1.1 2.0 1.8 1.25 11.4 Ho EJ V Ml ^"OPY RESOLUTION
      

      
      Optimization of Selected Remote Sensing Algorithms for Embedded NVIDIA Kepler GPU Architecture
      NASA Technical Reports Server (NTRS)
      Riha, Lubomir; Le Moigne, Jacqueline; El-Ghazawi, Tarek
         2015-01-01
         This paper evaluates the potential of embedded Graphic Processing Units in the Nvidias Tegra K1 for onboard processing. The performance is compared to a general purpose multi-core CPU and full fledge GPU accelerator. This study uses two algorithms: Wavelet Spectral Dimension Reduction of Hyperspectral Imagery and Automated Cloud-Cover Assessment (ACCA) Algorithm. Tegra K1 achieved 51 for ACCA algorithm and 20 for the dimension reduction algorithm, as compared to the performance of the high-end 8-core server Intel Xeon CPU with 13.5 times higher power consumption.
      

      
      A multithreaded and GPU-optimized compact finite difference algorithm for turbulent mixing at high Schmidt number using petascale computing
      NASA Astrophysics Data System (ADS)
      Clay, M. P.; Yeung, P. K.; Buaria, D.; Gotoh, T.
         2017-11-01
         Turbulent mixing at high Schmidt number is a multiscale problem which places demanding requirements on direct numerical simulations to resolve fluctuations down the to Batchelor scale. We use a dual-grid, dual-scheme and dual-communicator approach where velocity and scalar fields are computed by separate groups of parallel processes, the latter using a combined compact finite difference (CCD) scheme on finer grid with a static 3-D domain decomposition free of the communication overhead of memory transposes. A high degree of scalability is achieved for a 81923 scalar field at Schmidt number 512 in turbulence with a modest inertial range, by overlapping communication with computation whenever possible. On the Cray XE6 partition of Blue Waters, use of a dedicated thread for communication combined with OpenMP locks and nested parallelism reduces CCD timings by 34% compared to an MPI baseline. The code has been further optimized for the 27-petaflops Cray XK7 machine Titan using GPUs as accelerators with the latest OpenMP 4.5 directives, giving 2.7X speedup compared to CPU-only execution at the largest problem size. Supported by NSF Grant ACI-1036170, the NCSA Blue Waters Project with subaward via UIUC, and a DOE INCITE allocation at ORNL.
      

      
      SU-F-BRD-02: Application of ARCHERRT-- A GPU-Based Monte Carlo Dose Engine for Radiation Therapy -- to Tomotherapy and Patient-Independent IMRT
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Su, L; Du, X; Liu, T
         
         Purpose: As a module of ARCHER -- Accelerated Radiation-transport Computations in Heterogeneous EnviRonments, ARCHER{sub RT} is designed for RadioTherapy (RT) dose calculation. This paper describes the application of ARCHERRT on patient-dependent TomoTherapy and patient-independent IMRT. It also conducts a 'fair' comparison of different GPUs and multicore CPU. Methods: The source input used for patient-dependent TomoTherapy is phase space file (PSF) generated from optimized plan. For patient-independent IMRT, the open filed PSF is used for different cases. The intensity modulation is simulated by fluence map. The GEANT4 code is used as benchmark. DVH and gamma index test are employed to evaluatemore » the accuracy of ARCHER{sub RT} code. Some previous studies reported misleading speedups by comparing GPU code with serial CPU code. To perform a fairer comparison, we write multi-thread code with OpenMP to fully exploit computing potential of CPU. The hardware involved in this study are a 6-core Intel E5-2620 CPU and 6 NVIDIA M2090 GPUs, a K20 GPU and a K40 GPU. Results: Dosimetric results from ARCHER{sub RT} and GEANT4 show good agreement. The 2%/2mm gamma test pass rates for different clinical cases are 97.2% to 99.7%. A single M2090 GPU needs 50~79 seconds for the simulation to achieve a statistical error of 1% in the PTV. The K40 card is about 1.7∼1.8 times faster than M2090 card. Using 6 M2090 card, the simulation can be finished in about 10 seconds. For comparison, Intel E5-2620 needs 507∼879 seconds for the same simulation. Conclusion: We successfully applied ARCHER{sub RT} to Tomotherapy and patient-independent IMRT, and conducted a fair comparison between GPU and CPU performance. The ARCHER{sub RT} code is both accurate and efficient and may be used towards clinical applications.« less


   
       
            
              
          

«

15
      16
      17
   18
      19
      »

          
        

           
           
             
               
      
      CMSA: a heterogeneous CPU/GPU computing system for multiple similar RNA/DNA sequence alignment.
      PubMed
      Chen, Xi; Wang, Chen; Tang, Shanjiang; Yu, Ce; Zou, Quan
         2017-06-24
         The multiple sequence alignment (MSA) is a classic and powerful technique for sequence analysis in bioinformatics. With the rapid growth of biological datasets, MSA parallelization becomes necessary to keep its running time in an acceptable level. Although there are a lot of work on MSA problems, their approaches are either insufficient or contain some implicit assumptions that limit the generality of usage. First, the information of users' sequences, including the sizes of datasets and the lengths of sequences, can be of arbitrary values and are generally unknown before submitted, which are unfortunately ignored by previous work. Second, the center star strategy is suited for aligning similar sequences. But its first stage, center sequence selection, is highly time-consuming and requires further optimization. Moreover, given the heterogeneous CPU/GPU platform, prior studies consider the MSA parallelization on GPU devices only, making the CPUs idle during the computation. Co-run computation, however, can maximize the utilization of the computing resources by enabling the workload computation on both CPU and GPU simultaneously. This paper presents CMSA, a robust and efficient MSA system for large-scale datasets on the heterogeneous CPU/GPU platform. It performs and optimizes multiple sequence alignment automatically for users' submitted sequences without any assumptions. CMSA adopts the co-run computation model so that both CPU and GPU devices are fully utilized. Moreover, CMSA proposes an improved center star strategy that reduces the time complexity of its center sequence selection process from O(mn 2 ) to O(mn). The experimental results show that CMSA achieves an up to 11× speedup and outperforms the state-of-the-art software. CMSA focuses on the multiple similar RNA/DNA sequence alignment and proposes a novel bitmap based algorithm to improve the center star strategy. We can conclude that harvesting the high performance of modern GPU is a promising approach to accelerate multiple sequence alignment. Besides, adopting the co-run computation model can maximize the entire system utilization significantly. The source code is available at https://github.com/wangvsa/CMSA .
      

      
      Many-integrated core (MIC) technology for accelerating Monte Carlo simulation of radiation transport: A study based on the code DPM
      NASA Astrophysics Data System (ADS)
      Rodriguez, M.; Brualla, L.
         2018-04-01
         Monte Carlo simulation of radiation transport is computationally demanding to obtain reasonably low statistical uncertainties of the estimated quantities. Therefore, it can benefit in a large extent from high-performance computing. This work is aimed at assessing the performance of the first generation of the many-integrated core architecture (MIC) Xeon Phi coprocessor with respect to that of a CPU consisting of a double 12-core Xeon processor in Monte Carlo simulation of coupled electron-photonshowers. The comparison was made twofold, first, through a suite of basic tests including parallel versions of the random number generators Mersenne Twister and a modified implementation of RANECU. These tests were addressed to establish a baseline comparison between both devices. Secondly, through the p DPM code developed in this work. p DPM is a parallel version of the Dose Planning Method (DPM) program for fast Monte Carlo simulation of radiation transport in voxelized geometries. A variety of techniques addressed to obtain a large scalability on the Xeon Phi were implemented in p DPM. Maximum scalabilities of 84 . 2 × and 107 . 5 × were obtained in the Xeon Phi for simulations of electron and photon beams, respectively. Nevertheless, in none of the tests involving radiation transport the Xeon Phi performed better than the CPU. The disadvantage of the Xeon Phi with respect to the CPU owes to the low performance of the single core of the former. A single core of the Xeon Phi was more than 10 times less efficient than a single core of the CPU for all radiation transport simulations.
      

      
      An efficient implementation of 3D high-resolution imaging for large-scale seismic data with GPU/CPU heterogeneous parallel computing
      NASA Astrophysics Data System (ADS)
      Xu, Jincheng; Liu, Wei; Wang, Jin; Liu, Linong; Zhang, Jianfeng
         2018-02-01
         De-absorption pre-stack time migration (QPSTM) compensates for the absorption and dispersion of seismic waves by introducing an effective Q parameter, thereby making it an effective tool for 3D, high-resolution imaging of seismic data. Although the optimal aperture obtained via stationary-phase migration reduces the computational cost of 3D QPSTM and yields 3D stationary-phase QPSTM, the associated computational efficiency is still the main problem in the processing of 3D, high-resolution images for real large-scale seismic data. In the current paper, we proposed a division method for large-scale, 3D seismic data to optimize the performance of stationary-phase QPSTM on clusters of graphics processing units (GPU). Then, we designed an imaging point parallel strategy to achieve an optimal parallel computing performance. Afterward, we adopted an asynchronous double buffering scheme for multi-stream to perform the GPU/CPU parallel computing. Moreover, several key optimization strategies of computation and storage based on the compute unified device architecture (CUDA) were adopted to accelerate the 3D stationary-phase QPSTM algorithm. Compared with the initial GPU code, the implementation of the key optimization steps, including thread optimization, shared memory optimization, register optimization and special function units (SFU), greatly improved the efficiency. A numerical example employing real large-scale, 3D seismic data showed that our scheme is nearly 80 times faster than the CPU-QPSTM algorithm. Our GPU/CPU heterogeneous parallel computing framework significant reduces the computational cost and facilitates 3D high-resolution imaging for large-scale seismic data.
      

      
      A fast three-dimensional gamma evaluation using a GPU utilizing texture memory for on-the-fly interpolations.
      PubMed
      Persoon, Lucas C G G; Podesta, Mark; van Elmpt, Wouter J C; Nijsten, Sebastiaan M J J G; Verhaegen, Frank
         2011-07-01
         A widely accepted method to quantify differences in dose distributions is the gamma (gamma) evaluation. Currently, almost all gamma implementations utilize the central processing unit (CPU). Recently, the graphics processing unit (GPU) has become a powerful platform for specific computing tasks. In this study, we describe the implementation of a 3D gamma evaluation using a GPU to improve calculation time. The gamma evaluation algorithm was implemented on an NVIDIA Tesla C2050 GPU using the compute unified device architecture (CUDA). First, several cubic virtual phantoms were simulated. These phantoms were tested with varying dose cube sizes and set-ups, introducing artificial dose differences. Second, to show applicability in clinical practice, five patient cases have been evaluated using the 3D dose distribution from a treatment planning system as the reference and the delivered dose determined during treatment as the comparison. A calculation time comparison between the CPU and GPU was made with varying thread-block sizes including the option of using texture or global memory. A GPU over CPU speed-up of 66 +/- 12 was achieved for the virtual phantoms. For the patient cases, a speed-up of 57 +/- 15 using the GPU was obtained. A thread-block size of 16 x 16 performed best in all cases. The use of texture memory improved the total calculation time, especially when interpolation was applied. Differences between the CPU and GPU gammas were negligible. The GPU and its features, such as texture memory, decreased the calculation time for gamma evaluations considerably without loss of accuracy.
      

      
      Lossless data compression for improving the performance of a GPU-based beamformer.
      PubMed
      Lok, U-Wai; Fan, Gang-Wei; Li, Pai-Chi
         2015-04-01
         The powerful parallel computation ability of a graphics processing unit (GPU) makes it feasible to perform dynamic receive beamforming However, a real time GPU-based beamformer requires high data rate to transfer radio-frequency (RF) data from hardware to software memory, as well as from central processing unit (CPU) to GPU memory. There are data compression methods (e.g. Joint Photographic Experts Group (JPEG)) available for the hardware front end to reduce data size, alleviating the data transfer requirement of the hardware interface. Nevertheless, the required decoding time may even be larger than the transmission time of its original data, in turn degrading the overall performance of the GPU-based beamformer. This article proposes and implements a lossless compression-decompression algorithm, which enables in parallel compression and decompression of data. By this means, the data transfer requirement of hardware interface and the transmission time of CPU to GPU data transfers are reduced, without sacrificing image quality. In simulation results, the compression ratio reached around 1.7. The encoder design of our lossless compression approach requires low hardware resources and reasonable latency in a field programmable gate array. In addition, the transmission time of transferring data from CPU to GPU with the parallel decoding process improved by threefold, as compared with transferring original uncompressed data. These results show that our proposed lossless compression plus parallel decoder approach not only mitigate the transmission bandwidth requirement to transfer data from hardware front end to software system but also reduce the transmission time for CPU to GPU data transfer. © The Author(s) 2014.
      

      
      Research on fast Fourier transforms algorithm of huge remote sensing image technology with GPU and partitioning technology.
      PubMed
      Yang, Xue; Li, Xue-You; Li, Jia-Guo; Ma, Jun; Zhang, Li; Yang, Jan; Du, Quan-Ye
         2014-02-01
         Fast Fourier transforms (FFT) is a basic approach to remote sensing image processing. With the improvement of capacity of remote sensing image capture with the features of hyperspectrum, high spatial resolution and high temporal resolution, how to use FFT technology to efficiently process huge remote sensing image becomes the critical step and research hot spot of current image processing technology. FFT algorithm, one of the basic algorithms of image processing, can be used for stripe noise removal, image compression, image registration, etc. in processing remote sensing image. CUFFT function library is the FFT algorithm library based on CPU and FFTW. FFTW is a FFT algorithm developed based on CPU in PC platform, and is currently the fastest CPU based FFT algorithm function library. However there is a common problem that once the available memory or memory is less than the capacity of image, there will be out of memory or memory overflow when using the above two methods to realize image FFT arithmetic. To address this problem, a CPU and partitioning technology based Huge Remote Fast Fourier Transform (HRFFT) algorithm is proposed in this paper. By improving the FFT algorithm in CUFFT function library, the problem of out of memory and memory overflow is solved. Moreover, this method is proved rational by experiment combined with the CCD image of HJ-1A satellite. When applied to practical image processing, it improves effect of the image processing, speeds up the processing, which saves the time of computation and achieves sound result.
      

      
      GPU: the biggest key processor for AI and parallel processing
      NASA Astrophysics Data System (ADS)
      Baji, Toru
         2017-07-01
         Two types of processors exist in the market. One is the conventional CPU and the other is Graphic Processor Unit (GPU). Typical CPU is composed of 1 to 8 cores while GPU has thousands of cores. CPU is good for sequential processing, while GPU is good to accelerate software with heavy parallel executions. GPU was initially dedicated for 3D graphics. However from 2006, when GPU started to apply general-purpose cores, it was noticed that this architecture can be used as a general purpose massive-parallel processor. NVIDIA developed a software framework Compute Unified Device Architecture (CUDA) that make it possible to easily program the GPU for these application. With CUDA, GPU started to be used in workstations and supercomputers widely. Recently two key technologies are highlighted in the industry. The Artificial Intelligence (AI) and Autonomous Driving Cars. AI requires a massive parallel operation to train many-layers of neural networks. With CPU alone, it was impossible to finish the training in a practical time. The latest multi-GPU system with P100 makes it possible to finish the training in a few hours. For the autonomous driving cars, TOPS class of performance is required to implement perception, localization, path planning processing and again SoC with integrated GPU will play a key role there. In this paper, the evolution of the GPU which is one of the biggest commercial devices requiring state-of-the-art fabrication technology will be introduced. Also overview of the GPU demanding key application like the ones described above will be introduced.
      

      
      GPU-accelerated automatic identification of robust beam setups for proton and carbon-ion radiotherapy
      NASA Astrophysics Data System (ADS)
      Ammazzalorso, F.; Bednarz, T.; Jelen, U.
         2014-03-01
         We demonstrate acceleration on graphic processing units (GPU) of automatic identification of robust particle therapy beam setups, minimizing negative dosimetric effects of Bragg peak displacement caused by treatment-time patient positioning errors. Our particle therapy research toolkit, RobuR, was extended with OpenCL support and used to implement calculation on GPU of the Port Homogeneity Index, a metric scoring irradiation port robustness through analysis of tissue density patterns prior to dose optimization and computation. Results were benchmarked against an independent native CPU implementation. Numerical results were in agreement between the GPU implementation and native CPU implementation. For 10 skull base cases, the GPU-accelerated implementation was employed to select beam setups for proton and carbon ion treatment plans, which proved to be dosimetrically robust, when recomputed in presence of various simulated positioning errors. From the point of view of performance, average running time on the GPU decreased by at least one order of magnitude compared to the CPU, rendering the GPU-accelerated analysis a feasible step in a clinical treatment planning interactive session. In conclusion, selection of robust particle therapy beam setups can be effectively accelerated on a GPU and become an unintrusive part of the particle therapy treatment planning workflow. Additionally, the speed gain opens new usage scenarios, like interactive analysis manipulation (e.g. constraining of some setup) and re-execution. Finally, through OpenCL portable parallelism, the new implementation is suitable also for CPU-only use, taking advantage of multiple cores, and can potentially exploit types of accelerators other than GPUs.
      

      
      Learn Locally, Act Globally: Learning Language from Variation Set Cues
      PubMed Central
      Onnis, Luca; Waterfall, Heidi R.; Edelman, Shimon
         2011-01-01
         Variation set structure — partial overlap of successive utterances in child-directed speech — has been shown to correlate with progress in children’s acquisition of syntax. We demonstrate the benefits of variation set structure directly: in miniature artificial languages, arranging a certain proportion of utterances in a training corpus in variation sets facilitated word and phrase constituent learning in adults. Our findings have implications for understanding the mechanisms of L1 acquisition by children, and for the development of more efficient algorithms for automatic language acquisition, as well as better methods for L2 instruction. PMID:19019350
      

      
      Performance measurements of the first RAID prototype
      NASA Technical Reports Server (NTRS)
      Chervenak, Ann L.
         1990-01-01
         The performance is examined of Redundant Arrays of Inexpensive Disks (RAID) the First, a prototype disk array. A hierarchy of bottlenecks was discovered in the system that limit overall performance. The most serious is the memory system contention on the Sun 4/280 host CPU, which limits array bandwidth to 2.3 MBytes/sec. The array performs more successfully on small random operations, achieving nearly 300 I/Os per second before the Sun 4/280 becomes CPU limited. Other bottlenecks in the system are the VME backplane, bandwidth on the disk controller, and overheads associated with the SCSI protocol. All are examined in detail. The main conclusion is that to achieve the potential bandwidth of arrays, more powerful CPU's alone will not suffice. Just as important are adequate host memory bandwidth and support for high bandwidth on disk controllers. Current disk controllers are more often designed to achieve large numbers of small random operations, rather than high bandwidth. Operating systems also need to change to support high bandwidth from disk arrays. In particular, they should transfer data in larger blocks, and should support asynchronous I/O to improve sequential write performance.
      

      
      Fast simulation of Proton Induced X-Ray Emission Tomography using CUDA
      NASA Astrophysics Data System (ADS)
      Beasley, D. G.; Marques, A. C.; Alves, L. C.; da Silva, R. C.
         2013-07-01
         A new 3D Proton Induced X-Ray Emission Tomography (PIXE-T) and Scanning Transmission Ion Microscopy Tomography (STIM-T) simulation software has been developed in Java and uses NVIDIA™ Common Unified Device Architecture (CUDA) to calculate the X-ray attenuation for large detector areas. A challenge with PIXE-T is to get sufficient counts while retaining a small beam spot size. Therefore a high geometric efficiency is required. However, as the detector solid angle increases the calculations required for accurate reconstruction of the data increase substantially. To overcome this limitation, the CUDA parallel computing platform was used which enables general purpose programming of NVIDIA graphics processing units (GPUs) to perform computations traditionally handled by the central processing unit (CPU). For simulation performance evaluation, the results of a CPU- and a CUDA-based simulation of a phantom are presented. Furthermore, a comparison with the simulation code in the PIXE-Tomography reconstruction software DISRA (A. Sakellariou, D.N. Jamieson, G.J.F. Legge, 2001) is also shown. Compared to a CPU implementation, the CUDA based simulation is approximately 30× faster.
      

      
      Research on control law accelerator of digital signal process chip TMS320F28035 for real-time data acquisition and processing
      NASA Astrophysics Data System (ADS)
      Zhao, Shuangle; Zhang, Xueyi; Sun, Shengli; Wang, Xudong
         2017-08-01
         TI C2000 series digital signal process (DSP) chip has been widely used in electrical engineering, measurement and control, communications and other professional fields, DSP TMS320F28035 is one of the most representative of a kind. When using the DSP program, need data acquisition and data processing, and if the use of common mode C or assembly language programming, the program sequence, analogue-to-digital (AD) converter cannot be real-time acquisition, often missing a lot of data. The control low accelerator (CLA) processor can run in parallel with the main central processing unit (CPU), and the frequency is consistent with the main CPU, and has the function of floating point operations. Therefore, the CLA coprocessor is used in the program, and the CLA kernel is responsible for data processing. The main CPU is responsible for the AD conversion. The advantage of this method is to reduce the time of data processing and realize the real-time performance of data acquisition.
      

      
      Optimizing Tensor Contraction Expressions for Hybrid CPU-GPU Execution
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Ma, Wenjing; Krishnamoorthy, Sriram; Villa, Oreste
         2013-03-01
         Tensor contractions are generalized multidimensional matrix multiplication operations that widely occur in quantum chemistry. Efficient execution of tensor contractions on Graphics Processing Units (GPUs) requires several challenges to be addressed, including index permutation and small dimension-sizes reducing thread block utilization. Moreover, to apply the same optimizations to various expressions, we need a code generation tool. In this paper, we present our approach to automatically generate CUDA code to execute tensor contractions on GPUs, including management of data movement between CPU and GPU. To evaluate our tool, GPU-enabled code is generated for the most expensive contractions in CCSD(T), a key coupledmore » cluster method, and incorporated into NWChem, a popular computational chemistry suite. For this method, we demonstrate speedup over a factor of 8.4 using one GPU (instead of one core per node) and over 2.6 when utilizing the entire system using hybrid CPU+GPU solution with 2 GPUs and 5 cores (instead of 7 cores per node). Finally, we analyze the implementation behavior on future GPU systems.« less
      

      
      Massively parallel algorithm and implementation of RI-MP2 energy calculation for peta-scale many-core supercomputers.
      PubMed
      Katouda, Michio; Naruse, Akira; Hirano, Yukihiko; Nakajima, Takahito
         2016-11-15
         A new parallel algorithm and its implementation for the RI-MP2 energy calculation utilizing peta-flop-class many-core supercomputers are presented. Some improvements from the previous algorithm (J. Chem. Theory Comput. 2013, 9, 5373) have been performed: (1) a dual-level hierarchical parallelization scheme that enables the use of more than 10,000 Message Passing Interface (MPI) processes and (2) a new data communication scheme that reduces network communication overhead. A multi-node and multi-GPU implementation of the present algorithm is presented for calculations on a central processing unit (CPU)/graphics processing unit (GPU) hybrid supercomputer. Benchmark results of the new algorithm and its implementation using the K computer (CPU clustering system) and TSUBAME 2.5 (CPU/GPU hybrid system) demonstrate high efficiency. The peak performance of 3.1 PFLOPS is attained using 80,199 nodes of the K computer. The peak performance of the multi-node and multi-GPU implementation is 514 TFLOPS using 1349 nodes and 4047 GPUs of TSUBAME 2.5. © 2016 Wiley Periodicals, Inc. © 2016 Wiley Periodicals, Inc.
      

      
      Multigrid direct numerical simulation of the whole process of flow transition in 3-D boundary layers
      NASA Technical Reports Server (NTRS)
      Liu, Chaoqun; Liu, Zhining
         1993-01-01
         A new technology was developed in this study which provides a successful numerical simulation of the whole process of flow transition in 3-D boundary layers, including linear growth, secondary instability, breakdown, and transition at relatively low CPU cost. Most other spatial numerical simulations require high CPU cost and blow up at the stage of flow breakdown. A fourth-order finite difference scheme on stretched and staggered grids, a fully implicit time marching technique, a semi-coarsening multigrid based on the so-called approximate line-box relaxation, and a buffer domain for the outflow boundary conditions were all used for high-order accuracy, good stability, and fast convergence. A new fine-coarse-fine grid mapping technique was developed to keep the code running after the laminar flow breaks down. The computational results are in good agreement with linear stability theory, secondary instability theory, and some experiments. The cost for a typical case with 162 x 34 x 34 grid is around 2 CRAY-YMP CPU hours for 10 T-S periods.
      

      
      Heterogeneous compute in computer vision: OpenCL in OpenCV
      NASA Astrophysics Data System (ADS)
      Gasparakis, Harris
         2014-02-01
         We explore the relevance of Heterogeneous System Architecture (HSA) in Computer Vision, both as a long term vision, and as a near term emerging reality via the recently ratified OpenCL 2.0 Khronos standard. After a brief review of OpenCL 1.2 and 2.0, including HSA features such as Shared Virtual Memory (SVM) and platform atomics, we identify what genres of Computer Vision workloads stand to benefit by leveraging those features, and we suggest a new mental framework that replaces GPU compute with hybrid HSA APU compute. As a case in point, we discuss, in some detail, popular object recognition algorithms (part-based models), emphasizing the interplay and concurrent collaboration between the GPU and CPU. We conclude by describing how OpenCL has been incorporated in OpenCV, a popular open source computer vision library, emphasizing recent work on the Transparent API, to appear in OpenCV 3.0, which unifies the native CPU and OpenCL execution paths under a single API, allowing the same code to execute either on CPU or on a OpenCL enabled device, without even recompiling.
      

      
      Comparing the Consumption of CPU Hours with Scientific Output for the Extreme Science and Engineering Discovery Environment (XSEDE)
      PubMed Central
      Börner, Katy
         2016-01-01
         This paper presents the results of a study that compares resource usage with publication output using data about the consumption of CPU cycles from the Extreme Science and Engineering Discovery Environment (XSEDE) and resulting scientific publications for 2,691 institutions/teams. Specifically, the datasets comprise a total of 5,374,032,696 central processing unit (CPU) hours run in XSEDE during July 1, 2011 to August 18, 2015 and 2,882 publications that cite the XSEDE resource. Three types of studies were conducted: a geospatial analysis of XSEDE providers and consumers, co-authorship network analysis of XSEDE publications, and bi-modal network analysis of how XSEDE resources are used by different research fields. Resulting visualizations show that a diverse set of consumers make use of XSEDE resources, that users of XSEDE publish together frequently, and that the users of XSEDE with the highest resource usage tend to be “traditional” high-performance computing (HPC) community members from astronomy, atmospheric science, physics, chemistry, and biology. PMID:27310174
      

      
      Heterogeneous CPU-GPU moving targets detection for UAV video
      NASA Astrophysics Data System (ADS)
      Li, Maowen; Tang, Linbo; Han, Yuqi; Yu, Chunlei; Zhang, Chao; Fu, Huiquan
         2017-07-01
         Moving targets detection is gaining popularity in civilian and military applications. On some monitoring platform of motion detection, some low-resolution stationary cameras are replaced by moving HD camera based on UAVs. The pixels of moving targets in the HD Video taken by UAV are always in a minority, and the background of the frame is usually moving because of the motion of UAVs. The high computational cost of the algorithm prevents running it at higher resolutions the pixels of frame. Hence, to solve the problem of moving targets detection based UAVs video, we propose a heterogeneous CPU-GPU moving target detection algorithm for UAV video. More specifically, we use background registration to eliminate the impact of the moving background and frame difference to detect small moving targets. In order to achieve the effect of real-time processing, we design the solution of heterogeneous CPU-GPU framework for our method. The experimental results show that our method can detect the main moving targets from the HD video taken by UAV, and the average process time is 52.16ms per frame which is fast enough to solve the problem.
      

      
      An efficient tensor transpose algorithm for multicore CPU, Intel Xeon Phi, and NVidia Tesla GPU
      NASA Astrophysics Data System (ADS)
      Lyakh, Dmitry I.
         2015-04-01
         An efficient parallel tensor transpose algorithm is suggested for shared-memory computing units, namely, multicore CPU, Intel Xeon Phi, and NVidia GPU. The algorithm operates on dense tensors (multidimensional arrays) and is based on the optimization of cache utilization on x86 CPU and the use of shared memory on NVidia GPU. From the applied side, the ultimate goal is to minimize the overhead encountered in the transformation of tensor contractions into matrix multiplications in computer implementations of advanced methods of quantum many-body theory (e.g., in electronic structure theory and nuclear physics). A particular accent is made on higher-dimensional tensors that typically appear in the so-called multireference correlated methods of electronic structure theory. Depending on tensor dimensionality, the presented optimized algorithms can achieve an order of magnitude speedup on x86 CPUs and 2-3 times speedup on NVidia Tesla K20X GPU with respect to the naïve scattering algorithm (no memory access optimization). The tensor transpose routines developed in this work have been incorporated into a general-purpose tensor algebra library (TAL-SH).
      

      
      An optimization program based on the method of feasible directions: Theory and users guide
      NASA Technical Reports Server (NTRS)
      Belegundu, Ashok D.; Berke, Laszlo; Patnaik, Surya N.
         1994-01-01
         The theory and user instructions for an optimization code based on the method of feasible directions are presented. The code was written for wide distribution and ease of attachment to other simulation software. Although the theory of the method of feasible direction was developed in the 1960's, many considerations are involved in its actual implementation as a computer code. Included in the code are a number of features to improve robustness in optimization. The search direction is obtained by solving a quadratic program using an interior method based on Karmarkar's algorithm. The theory is discussed focusing on the important and often overlooked role played by the various parameters guiding the iterations within the program. Also discussed is a robust approach for handling infeasible starting points. The code was validated by solving a variety of structural optimization test problems that have known solutions obtained by other optimization codes. It has been observed that this code is robust: it has solved a variety of problems from different starting points. However, the code is inefficient in that it takes considerable CPU time as compared with certain other available codes. Further work is required to improve its efficiency while retaining its robustness.
      

        
       
          

«

15
      16
      17
   18
      19
      »

          
        

     

   

   
       
            
              
          

«

16
      17
      18
   19
      20
      »

          
        

           
           
             
               
      
      A Single Chip Automotive Control LSI Using SOI Bipolar Complimentary MOS Double-Diffused MOS
      NASA Astrophysics Data System (ADS)
      Kawamoto, Kazunori; Mizuno, Shoji; Abe, Hirofumi; Higuchi, Yasushi; Ishihara, Hideaki; Fukumoto, Harutsugu; Watanabe, Takamoto; Fujino, Seiji; Shirakawa, Isao
         2001-04-01
         Using the example of an air bag controller, a single chip solution for automotive sub-control systems is investigated, by using a technological combination of improved circuits, bipolar complimentary metal oxide silicon double-diffused metal oxide silicon (BiCDMOS) and thick silicon on insulator (SOI). For circuits, an automotive specific reduced instruction set computer (RISC) center processing unit (CPU), and a novel, all integrated system clock generator, dividing digital phase-locked loop (DDPLL) are proposed. For the device technologies, the authors use SOI-BiCDMOS with trench dielectric-isolation (TD) which enables integration of various devices in an integrated circuit (IC) while avoiding parasitic miss operations by ideal isolation. The structures of the SOI layer and TD, are optimized for obtaining desired device characteristics and high electromagnetic interference (EMI) immunity. While performing all the air bag system functions over a wide range of supply voltage, and ambient temperature, the resulting single chip reduces the electronic parts to about a half of those in the conventional air bags. The combination of single chip oriented circuits and thick SOI-BiCDMOS technologies offered in this work is valuable for size reduction and improved reliability of automotive electronic control units (ECUs).
      

      
      Accelerating the Original Profile Kernel.
      PubMed
      Hamp, Tobias; Goldberg, Tatyana; Rost, Burkhard
         2013-01-01
         One of the most accurate multi-class protein classification systems continues to be the profile-based SVM kernel introduced by the Leslie group. Unfortunately, its CPU requirements render it too slow for practical applications of large-scale classification tasks. Here, we introduce several software improvements that enable significant acceleration. Using various non-redundant data sets, we demonstrate that our new implementation reaches a maximal speed-up as high as 14-fold for calculating the same kernel matrix. Some predictions are over 200 times faster and render the kernel as possibly the top contender in a low ratio of speed/performance. Additionally, we explain how to parallelize various computations and provide an integrative program that reduces creating a production-quality classifier to a single program call. The new implementation is available as a Debian package under a free academic license and does not depend on commercial software. For non-Debian based distributions, the source package ships with a traditional Makefile-based installer. Download and installation instructions can be found at https://rostlab.org/owiki/index.php/Fast_Profile_Kernel. Bugs and other issues may be reported at https://rostlab.org/bugzilla3/enter_bug.cgi?product=fastprofkernel.
      

      
      Inexact adaptive Newton methods
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Bertiger, W.I.; Kelsey, F.J.
         1985-02-01
         The Inexact Adaptive Newton method (IAN) is a modification of the Adaptive Implicit Method/sup 1/ (AIM) with improved Newton convergence. Both methods simplify the Jacobian at each time step by zeroing coefficients in regions where saturations are changing slowly. The methods differ in how the diagonal block terms are treated. On test problems with up to 3,000 cells, IAN consistently saves approximately 30% of the CPU time when compared to the fully implicit method. AIM shows similar savings on some problems, but takes as much CPU time as fully implicit on other test problems due to poor Newton convergence.
      

      
      A numerical code for the simulation of non-equilibrium chemically reacting flows on hybrid CPU-GPU clusters
      NASA Astrophysics Data System (ADS)
      Kudryavtsev, Alexey N.; Kashkovsky, Alexander V.; Borisov, Semyon P.; Shershnev, Anton A.
         2017-10-01
         In the present work a computer code RCFS for numerical simulation of chemically reacting compressible flows on hybrid CPU/GPU supercomputers is developed. It solves 3D unsteady Euler equations for multispecies chemically reacting flows in general curvilinear coordinates using shock-capturing TVD schemes. Time advancement is carried out using the explicit Runge-Kutta TVD schemes. Program implementation uses CUDA application programming interface to perform GPU computations. Data between GPUs is distributed via domain decomposition technique. The developed code is verified on the number of test cases including supersonic flow over a cylinder.
      

      
      Application of queueing models to multiprogrammed computer systems operating in a time-critical environment
      NASA Technical Reports Server (NTRS)
      Eckhardt, D. E., Jr.
         1979-01-01
         A model of a central processor (CPU) which services background applications in the presence of time critical activity is presented. The CPU is viewed as an M/M/1 queueing system subject to periodic interrupts by deterministic, time critical process. The Laplace transform of the distribution of service times for the background applications is developed. The use of state of the art queueing models for studying the background processing capability of time critical computer systems is discussed and the results of a model validation study which support this application of queueing models are presented.
      

      
      Upwind relaxation methods for the Navier-Stokes equations using inner iterations
      NASA Technical Reports Server (NTRS)
      Taylor, Arthur C., III; Ng, Wing-Fai; Walters, Robert W.
         1992-01-01
         A subsonic and a supersonic problem are respectively treated by an upwind line-relaxation algorithm for the Navier-Stokes equations using inner iterations to accelerate steady-state solution convergence and thereby minimize CPU time. While the ability of the inner iterative procedure to mimic the quadratic convergence of the direct solver method is attested to in both test problems, some of the nonquadratic inner iterative results are noted to have been more efficient than the quadratic. In the more successful, supersonic test case, inner iteration required only about 65 percent of the line-relaxation method-entailed CPU time.
      

      
      Application of graphics processing units to search pipelines for gravitational waves from coalescing binaries of compact objects
      NASA Astrophysics Data System (ADS)
      Chung, Shin Kee; Wen, Linqing; Blair, David; Cannon, Kipp; Datta, Amitava
         2010-07-01
         We report a novel application of a graphics processing unit (GPU) for the purpose of accelerating the search pipelines for gravitational waves from coalescing binaries of compact objects. A speed-up of 16-fold in total has been achieved with an NVIDIA GeForce 8800 Ultra GPU card compared with one core of a 2.5 GHz Intel Q9300 central processing unit (CPU). We show that substantial improvements are possible and discuss the reduction in CPU count required for the detection of inspiral sources afforded by the use of GPUs.
      

      
      A 3D front tracking method on a CPU/GPU system
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Bo, Wurigen; Grove, John
         2011-01-21
         We describe the method to port a sequential 3D interface tracking code to a GPU with CUDA. The interface is represented as a triangular mesh. Interface geometry properties and point propagation are performed on a GPU. Interface mesh adaptation is performed on a CPU. The convergence of the method is assessed from the test problems with given velocity fields. Performance results show overall speedups from 11 to 14 for the test problems under mesh refinement. We also briefly describe our ongoing work to couple the interface tracking method with a hydro solver.
      

      
      Dosimetric comparison of helical tomotherapy treatment plans for total marrow irradiation created using GPU and CPU dose calculation engines.
      PubMed
      Nalichowski, Adrian; Burmeister, Jay
         2013-07-01
         To compare optimization characteristics, plan quality, and treatment delivery efficiency between total marrow irradiation (TMI) plans using the new TomoTherapy graphic processing unit (GPU) based dose engine and CPU/cluster based dose engine. Five TMI plans created on an anthropomorphic phantom were optimized and calculated with both dose engines. The planning treatment volume (PTV) included all the bones from head to mid femur except for upper extremities. Evaluated organs at risk (OAR) consisted of lung, liver, heart, kidneys, and brain. The following treatment parameters were used to generate the TMI plans: field widths of 2.5 and 5 cm, modulation factors of 2 and 2.5, and pitch of either 0.287 or 0.43. The optimization parameters were chosen based on the PTV and OAR priorities and the plans were optimized with a fixed number of iterations. The PTV constraint was selected to ensure that at least 95% of the PTV received the prescription dose. The plans were evaluated based on D80 and D50 (dose to 80% and 50% of the OAR volume, respectively) and hotspot volumes within the PTVs. Gamma indices (Γ) were also used to compare planar dose distributions between the two modalities. The optimization and dose calculation times were compared between the two systems. The treatment delivery times were also evaluated. The results showed very good dosimetric agreement between the GPU and CPU calculated plans for any of the evaluated planning parameters indicating that both systems converge on nearly identical plans. All D80 and D50 parameters varied by less than 3% of the prescription dose with an average difference of 0.8%. A gamma analysis Γ(3%, 3 mm) < 1 of the GPU plan resulted in over 90% of calculated voxels satisfying Γ < 1 criterion as compared to baseline CPU plan. The average number of voxels meeting the Γ < 1 criterion for all the plans was 97%. In terms of dose optimization/calculation efficiency, there was a 20-fold reduction in planning time with the new GPU system. The average optimization/dose calculation time utilizing the traditional CPU/cluster based system was 579 vs 26.8 min for the GPU based system. There was no difference in the calculated treatment delivery time per fraction. Beam-on time varied based on field width and pitch and ranged between 15 and 28 min. The TomoTherapy GPU based dose engine is capable of calculating TMI treatment plans with plan quality nearly identical to plans calculated using the traditional CPU/cluster based system, while significantly reducing the time required for optimization and dose calculation.
      

      
      Fast in-memory elastic full-waveform inversion using consumer-grade GPUs
      NASA Astrophysics Data System (ADS)
      Sivertsen Bergslid, Tore; Birger Raknes, Espen; Arntsen, Børge
         2017-04-01
         Full-waveform inversion (FWI) is a technique to estimate subsurface properties by using the recorded waveform produced by a seismic source and applying inverse theory. This is done through an iterative optimization procedure, where each iteration requires solving the wave equation many times, then trying to minimize the difference between the modeled and the measured seismic data. Having to model many of these seismic sources per iteration means that this is a highly computationally demanding procedure, which usually involves writing a lot of data to disk. We have written code that does forward modeling and inversion entirely in memory. A typical HPC cluster has many more CPUs than GPUs. Since FWI involves modeling many seismic sources per iteration, the obvious approach is to parallelize the code on a source-by-source basis, where each core of the CPU performs one modeling, and do all modelings simultaneously. With this approach, the GPU is already at a major disadvantage in pure numbers. Fortunately, GPUs can more than make up for this hardware disadvantage by performing each modeling much faster than a CPU. Another benefit of parallelizing each individual modeling is that it lets each modeling use a lot more RAM. If one node has 128 GB of RAM and 20 CPU cores, each modeling can use only 6.4 GB RAM if one is running the node at full capacity with source-by-source parallelization on the CPU. A parallelized per-source code using GPUs can use 64 GB RAM per modeling. Whenever a modeling uses more RAM than is available and has to start using regular disk space the runtime increases dramatically, due to slow file I/O. The extremely high computational speed of the GPUs combined with the large amount of RAM available for each modeling lets us do high frequency FWI for fairly large models very quickly. For a single modeling, our GPU code outperforms the single-threaded CPU-code by a factor of about 75. Successful inversions have been run on data with frequencies up to 40 Hz for a model of 2001 by 600 grid points with 5 m grid spacing and 5000 time steps, in less than 2.5 minutes per source. In practice, using 15 nodes (30 GPUs) to model 101 sources, each iteration took approximately 9 minutes. For reference, the same inversion run with our CPU code uses two hours per iteration. This was done using only a very simple wavefield interpolation technique, saving every second timestep. Using a more sophisticated checkpointing or wavefield reconstruction method would allow us to increase this model size significantly. Our results show that ordinary gaming GPUs are a viable alternative to the expensive professional GPUs often used today, when performing large scale modeling and inversion in geophysics.
      

      
      Producing an Infrared Multiwavelength Galactic Plane Atlas Using Montage, Pegasus, and Amazon Web Services
      NASA Astrophysics Data System (ADS)
      Rynge, M.; Juve, G.; Kinney, J.; Good, J.; Berriman, B.; Merrihew, A.; Deelman, E.
         2014-05-01
         In this paper, we describe how to leverage cloud resources to generate large-scale mosaics of the galactic plane in multiple wavelengths. Our goal is to generate a 16-wavelength infrared Atlas of the Galactic Plane at a common spatial sampling of 1 arcsec, processed so that they appear to have been measured with a single instrument. This will be achieved by using the Montage image mosaic engine process observations from the 2MASS, GLIMPSE, MIPSGAL, MSX and WISE datasets, over a wavelength range of 1 μm to 24 μm, and by using the Pegasus Workflow Management System for managing the workload. When complete, the Atlas will be made available to the community as a data product. We are generating images that cover ±180° in Galactic longitude and ±20° in Galactic latitude, to the extent permitted by the spatial coverage of each dataset. Each image will be 5°x5° in size (including an overlap of 1° with neighboring tiles), resulting in an atlas of 1,001 images. The final size will be about 50 TBs. This paper will focus on the computational challenges, solutions, and lessons learned in producing the Atlas. To manage the computation we are using the Pegasus Workflow Management System, a mature, highly fault-tolerant system now in release 4.2.2 that has found wide applicability across many science disciplines. A scientific workflow describes the dependencies between the tasks and in most cases the workflow is described as a directed acyclic graph, where the nodes are tasks and the edges denote the task dependencies. A defining property for a scientific workflow is that it manages data flow between tasks. Applied to the galactic plane project, each 5 by 5 mosaic is a Pegasus workflow. Pegasus is used to fetch the source images, execute the image mosaicking steps of Montage, and store the final outputs in a storage system. As these workflows are very I/O intensive, care has to be taken when choosing what infrastructure to execute the workflow on. In our setup, we choose to use dynamically provisioned compute clusters running on the Amazon Elastic Compute Cloud (EC2). All our instances are using the same base image, which is configured to come up as a master node by default. The master node is a central instance from where the workflow can be managed. Additional worker instances are provisioned and configured to accept work assignments from the master node. The system allows for adding/removing workers in an ad hoc fashion, and could be run in large configurations. To-date we have performed 245,000 CPU hours of computing and generated 7,029 images and totaling 30 TB. With the current set up our runtime would be 340,000 CPU hours for the whole project. Using spot m2.4xlarge instances, the cost would be approximately $5,950. Using faster AWS instances, such as cc2.8xlarge could potentially decrease the total CPU hours and further reduce the compute costs. The paper will explore these tradeoffs.
      

      
      Machine learning methods can replace 3D profile method in classification of amyloidogenic hexapeptides.
      PubMed
      Stanislawski, Jerzy; Kotulska, Malgorzata; Unold, Olgierd
         2013-01-17
         Amyloids are proteins capable of forming fibrils. Many of them underlie serious diseases, like Alzheimer disease. The number of amyloid-associated diseases is constantly increasing. Recent studies indicate that amyloidogenic properties can be associated with short segments of aminoacids, which transform the structure when exposed. A few hundreds of such peptides have been experimentally found. Experimental testing of all possible aminoacid combinations is currently not feasible. Instead, they can be predicted by computational methods. 3D profile is a physicochemical-based method that has generated the most numerous dataset - ZipperDB. However, it is computationally very demanding. Here, we show that dataset generation can be accelerated. Two methods to increase the classification efficiency of amyloidogenic candidates are presented and tested: simplified 3D profile generation and machine learning methods. We generated a new dataset of hexapeptides, using more economical 3D profile algorithm, which showed very good classification overlap with ZipperDB (93.5%). The new part of our dataset contains 1779 segments, with 204 classified as amyloidogenic. The dataset of 6-residue sequences with their binary classification, based on the energy of the segment, was applied for training machine learning methods. A separate set of sequences from ZipperDB was used as a test set. The most effective methods were Alternating Decision Tree and Multilayer Perceptron. Both methods obtained area under ROC curve of 0.96, accuracy 91%, true positive rate ca. 78%, and true negative rate 95%. A few other machine learning methods also achieved a good performance. The computational time was reduced from 18-20 CPU-hours (full 3D profile) to 0.5 CPU-hours (simplified 3D profile) to seconds (machine learning). We showed that the simplified profile generation method does not introduce an error with regard to the original method, while increasing the computational efficiency. Our new dataset proved representative enough to use simple statistical methods for testing the amylogenicity based only on six letter sequences. Statistical machine learning methods such as Alternating Decision Tree and Multilayer Perceptron can replace the energy based classifier, with advantage of very significantly reduced computational time and simplicity to perform the analysis. Additionally, a decision tree provides a set of very easily interpretable rules.
      

      
      Physical fitness modulates incidental but not intentional statistical learning of simultaneous auditory sequences during concurrent physical exercise.
      PubMed
      Daikoku, Tatsuya; Takahashi, Yuji; Futagami, Hiroko; Tarumoto, Nagayoshi; Yasuda, Hideki
         2017-02-01
         In real-world auditory environments, humans are exposed to overlapping auditory information such as those made by human voices and musical instruments even during routine physical activities such as walking and cycling. The present study investigated how concurrent physical exercise affects performance of incidental and intentional learning of overlapping auditory streams, and whether physical fitness modulates the performances of learning. Participants were grouped with 11 participants with lower and higher fitness each, based on their Vo 2 max value. They were presented simultaneous auditory sequences with a distinct statistical regularity each other (i.e. statistical learning), while they were pedaling on the bike and seating on a bike at rest. In experiment 1, they were instructed to attend to one of the two sequences and ignore to the other sequence. In experiment 2, they were instructed to attend to both of the two sequences. After exposure to the sequences, learning effects were evaluated by familiarity test. In the experiment 1, performance of statistical learning of ignored sequences during concurrent pedaling could be higher in the participants with high than low physical fitness, whereas in attended sequence, there was no significant difference in performance of statistical learning between high than low physical fitness. Furthermore, there was no significant effect of physical fitness on learning while resting. In the experiment 2, the both participants with high and low physical fitness could perform intentional statistical learning of two simultaneous sequences in the both exercise and rest sessions. The improvement in physical fitness might facilitate incidental but not intentional statistical learning of simultaneous auditory sequences during concurrent physical exercise.
      

      
      Using Intel's Knight Landing Processor to Accelerate Global Nested Air Quality Prediction Modeling System (GNAQPMS) Model
      NASA Astrophysics Data System (ADS)
      Wang, H.; Chen, H.; Chen, X.; Wu, Q.; Wang, Z.
         2016-12-01
         The Global Nested Air Quality Prediction Modeling System for Hg (GNAQPMS-Hg) is a global chemical transport model coupled Hg transport module to investigate the mercury pollution. In this study, we present our work of transplanting the GNAQPMS model on Intel Xeon Phi processor, Knights Landing (KNL) to accelerate the model. KNL is the second-generation product adopting Many Integrated Core Architecture (MIC) architecture. Compared with the first generation Knight Corner (KNC), KNL has more new hardware features, that it can be used as unique processor as well as coprocessor with other CPU. According to the Vtune tool, the high overhead modules in GNAQPMS model have been addressed, including CBMZ gas chemistry, advection and convection module, and wet deposition module. These high overhead modules were accelerated by optimizing code and using new techniques of KNL. The following optimized measures was done: 1) Changing the pure MPI parallel mode to hybrid parallel mode with MPI and OpenMP; 2.Vectorizing the code to using the 512-bit wide vector computation unit. 3. Reducing unnecessary memory access and calculation. 4. Reducing Thread Local Storage (TLS) for common variables with each OpenMP thread in CBMZ. 5. Changing the way of global communication from files writing and reading to MPI functions. After optimization, the performance of GNAQPMS is greatly increased both on CPU and KNL platform, the single-node test showed that optimized version has 2.6x speedup on two sockets CPU platform and 3.3x speedup on one socket KNL platform compared with the baseline version code, which means the KNL has 1.29x speedup when compared with 2 sockets CPU platform.
      

      
      Evaluation of user input methods for manipulating a tablet personal computer in sterile techniques.
      PubMed
      Yamada, Akira; Komatsu, Daisuke; Suzuki, Takeshi; Kurozumi, Masahiro; Fujinaga, Yasunari; Ueda, Kazuhiko; Kadoya, Masumi
         2017-02-01
         To determine a quick and accurate user input method for manipulating tablet personal computers (PCs) in sterile techniques. We evaluated three different manipulation methods, (1) Computer mouse and sterile system drape, (2) Fingers and sterile system drape, and (3) Digitizer stylus and sterile ultrasound probe cover with a pinhole, in terms of the central processing unit (CPU) performance, manipulation performance, and contactlessness. A significant decrease in CPU score ([Formula: see text]) and an increase in CPU temperature ([Formula: see text]) were observed when a system drape was used. The respective mean times taken to select a target image from an image series (ST) and the mean times for measuring points on an image (MT) were [Formula: see text] and [Formula: see text] s for the computer mouse method, [Formula: see text] and [Formula: see text] s for the finger method, and [Formula: see text] and [Formula: see text] s for the digitizer stylus method, respectively. The ST for the finger method was significantly longer than for the digitizer stylus method ([Formula: see text]). The MT for the computer mouse method was significantly longer than for the digitizer stylus method ([Formula: see text]). The mean success rate for measuring points on an image was significantly lower for the finger method when the diameter of the target was equal to or smaller than 8 mm than for the other methods. No significant difference in the adenosine triphosphate amount at the surface of the tablet PC was observed before, during, or after manipulation via the digitizer stylus method while wearing starch-powdered sterile gloves ([Formula: see text]). Quick and accurate manipulation of tablet PCs in sterile techniques without CPU load is feasible using a digitizer stylus and sterile ultrasound probe cover with a pinhole.
      

      
      Cardiac computed tomography in patients with symptomatic new-onset atrial fibrillation, rule-out acute coronary syndrome, but with intermediate pretest probability for coronary artery disease admitted to a chest pain unit.
      PubMed
      Koopmann, Matthias; Hinrichs, Liane; Olligs, Jan; Lichtenberg, Michael; Eckardt, Lars; Böse, Dirk; Möhlenkamp, Stefan; Waltenberger, Johannes; Breuckmann, Frank
         2018-01-24
         Atrial fibrillation (AF) and coronary artery disease (CAD) may be encountered coincidently in a large portion of patients. However, data on coronary artery calcium burden in such patients are lacking. Thus, we sought to determine the value of cardiac computed tomography (CCT) in patients presenting with new-onset AF associated with an intermediate pretest probability for CAD admitted to a chest pain unit (CPU). Calcium scores (CS) of 73 new-onset, symptomatic AF subjects without typical clinical, electrocardiographic, or laboratory signs of acute coronary syndrome (ACS) admitted to our CPU were analyzed. In addition, results from computed tomography angiography (CTA) were related to coronary angiography findings whenever available. Calcium scores of zero were found in 25%. Median Agatston score was 77 (interquartile range: 1-270) with gender- and territory-specific dispersal. CS scores above average were present in about 50%, high (> 400)-to-very high (> 1000) CS scores were found in 22%. Overall percentile ranking showed a relative accordance to the reference percentile distribution. Additional CTA was performed in 47%, revealing stenoses in 12%. Coronary angiography was performed in 22% and resulted in coronary intervention or surgical revascularization in 7%. On univariate analysis, CS > 50th percentile failed to serve as an independent determinant of significant stenosis during catheterization. Within a CPU setting, relevant CAD was excluded or confirmed in almost 50%, the latter with a high proportion of coronary angiographies and subsequent coronary interventions, underlining the diagnostic value of CCT in symptomatic, non-ACS, new-onset AF patients when admitted to a CPU.
      

      
      Fast multipurpose Monte Carlo simulation for proton therapy using multi- and many-core CPU architectures.
      PubMed
      Souris, Kevin; Lee, John Aldo; Sterpin, Edmond
         2016-04-01
         Accuracy in proton therapy treatment planning can be improved using Monte Carlo (MC) simulations. However the long computation time of such methods hinders their use in clinical routine. This work aims to develop a fast multipurpose Monte Carlo simulation tool for proton therapy using massively parallel central processing unit (CPU) architectures. A new Monte Carlo, called MCsquare (many-core Monte Carlo), has been designed and optimized for the last generation of Intel Xeon processors and Intel Xeon Phi coprocessors. These massively parallel architectures offer the flexibility and the computational power suitable to MC methods. The class-II condensed history algorithm of MCsquare provides a fast and yet accurate method of simulating heavy charged particles such as protons, deuterons, and alphas inside voxelized geometries. Hard ionizations, with energy losses above a user-specified threshold, are simulated individually while soft events are regrouped in a multiple scattering theory. Elastic and inelastic nuclear interactions are sampled from ICRU 63 differential cross sections, thereby allowing for the computation of prompt gamma emission profiles. MCsquare has been benchmarked with the gate/geant4 Monte Carlo application for homogeneous and heterogeneous geometries. Comparisons with gate/geant4 for various geometries show deviations within 2%-1 mm. In spite of the limited memory bandwidth of the coprocessor simulation time is below 25 s for 10(7) primary 200 MeV protons in average soft tissues using all Xeon Phi and CPU resources embedded in a single desktop unit. MCsquare exploits the flexibility of CPU architectures to provide a multipurpose MC simulation tool. Optimized code enables the use of accurate MC calculation within a reasonable computation time, adequate for clinical practice. MCsquare also simulates prompt gamma emission and can thus be used also for in vivo range verification.
      

      
      Revisiting Molecular Dynamics on a CPU/GPU system: Water Kernel and SHAKE Parallelization.
      PubMed
      Ruymgaart, A Peter; Elber, Ron
         2012-11-13
         We report Graphics Processing Unit (GPU) and Open-MP parallel implementations of water-specific force calculations and of bond constraints for use in Molecular Dynamics simulations. We focus on a typical laboratory computing-environment in which a CPU with a few cores is attached to a GPU. We discuss in detail the design of the code and we illustrate performance comparable to highly optimized codes such as GROMACS. Beside speed our code shows excellent energy conservation. Utilization of water-specific lists allows the efficient calculations of non-bonded interactions that include water molecules and results in a speed-up factor of more than 40 on the GPU compared to code optimized on a single CPU core for systems larger than 20,000 atoms. This is up four-fold from a factor of 10 reported in our initial GPU implementation that did not include a water-specific code. Another optimization is the implementation of constrained dynamics entirely on the GPU. The routine, which enforces constraints of all bonds, runs in parallel on multiple Open-MP cores or entirely on the GPU. It is based on Conjugate Gradient solution of the Lagrange multipliers (CG SHAKE). The GPU implementation is partially in double precision and requires no communication with the CPU during the execution of the SHAKE algorithm. The (parallel) implementation of SHAKE allows an increase of the time step to 2.0fs while maintaining excellent energy conservation. Interestingly, CG SHAKE is faster than the usual bond relaxation algorithm even on a single core if high accuracy is expected. The significant speedup of the optimized components transfers the computational bottleneck of the MD calculation to the reciprocal part of Particle Mesh Ewald (PME).
      

      
      Radiation hardened microprocessor for small payloads
      NASA Technical Reports Server (NTRS)
      Shah, Ravi
         1993-01-01
         The RH-3000 program is developing a rad-hard space qualified 32-bit MIPS R-3000 RISC processor under the Naval Research Lab sponsorship. In addition, under IR&D Harris is developing RHC-3000 for embedded control applications where low cost and radiation tolerance are primary concerns. The development program leverages heavily from commercial development of the MIPS R-3000. The commercial R-3000 has a large installed user base and several foundry partners are currently producing a wide variety of R-3000 derivative products. One of the MIPS derivative products, the LR33000 from LSI Logic, was used as the basis for the design of the RH-3000 chipset. The RH-3000 chipset consists of three core chips and two support chips. The core chips include the CPU, which is the R-3000 integer unit and the FPA/MD chip pair, which performs the R-3010 floating point functions. The two support whips contain all the support functions required for fault tolerance support, real-time support, memory management, timers, and other functions. The Harris development effort had first passed silicon success in June, 1992 with the first rad-hard 32-bit RH-3000 CPU chip. The CPU device is 30 kgates, has a 508 mil by 503 mil die size and is fabricated at Harris Semiconductor on the rad-hard CMOS Silicon on Sapphire (SOS) process. The CPU device successfully passed tesing against 600,000 test vectors derived directly on the LSI/MIPS test suite and has been operational as a single board computer running C code for the past year. In addition, the RH-3000 program has developed the methodology for converting commercially developed designs utilizing logic synthesis techniques based on a combination of VHDK and schematic data bases.
      

      
      Using all of your CPU's in HIPE
      NASA Astrophysics Data System (ADS)
      Jacobson, J. D.; Fadda, D.
         2012-09-01
         Modern computer architectures increasingly feature multi-core CPU's. For example, the MacbookPro features the Intel quad-core i7 processors. Through the use of hyper-threading, where each core can execute two threads simultaneously, the quad-core i7 can support eight simultaneous processing threads. All this on your laptop! This CPU power can now be put into service by scientists to perform data reduction tasks, but only if the software has been designed to take advantage of the multiple processor architectures. Up to now, software written for Herschel data reduction (HIPE), written in Jython and JAVA, is single-threaded and can only utilize a single processor. Users of HIPE do not get any advantage from the additional processors. Why not put all of the CPU resources to work reducing your data? We present a multi-threaded software application that corrects long-term transients in the signal from the PACS unchopped spectroscopy line scan mode. In this poster, we present a multi-threaded software framework to achieve performance improvements from parallel execution. We will show how a task to correct transients in the PACS Spectroscopy Pipeline for the un-chopped line scan mode, has been threaded. This computation-intensive task uses either a one-parameter or a three parameter exponential function, to characterize the transient. The task uses a JAVA implementation of Minpack, translated from the C (Moshier) and IDL (Markwardt) by the authors, to optimize the correction parameters. We also explain how to determine if a task can benefit from threading (Amdahl's Law), and if it is safe to thread. The design and implementation, using the JAVA concurrency package completions service is described. Pitfalls, timing bugs, thread safety, resource control, testing and performance improvements are described and plotted.
      

        
       
          

«

16
      17
      18
   19
      20
      »

          
        

     

   

   
       
            
              
          

«

17
      18
      19
   20
      21
      »

          
        

           
           
             
               
      
      Computing the Density Matrix in Electronic Structure Theory on Graphics Processing Units.
      PubMed
      Cawkwell, M J; Sanville, E J; Mniszewski, S M; Niklasson, Anders M N
         2012-11-13
         The self-consistent solution of a Schrödinger-like equation for the density matrix is a critical and computationally demanding step in quantum-based models of interatomic bonding. This step was tackled historically via the diagonalization of the Hamiltonian. We have investigated the performance and accuracy of the second-order spectral projection (SP2) algorithm for the computation of the density matrix via a recursive expansion of the Fermi operator in a series of generalized matrix-matrix multiplications. We demonstrate that owing to its simplicity, the SP2 algorithm [Niklasson, A. M. N. Phys. Rev. B2002, 66, 155115] is exceptionally well suited to implementation on graphics processing units (GPUs). The performance in double and single precision arithmetic of a hybrid GPU/central processing unit (CPU) and full GPU implementation of the SP2 algorithm exceed those of a CPU-only implementation of the SP2 algorithm and traditional matrix diagonalization when the dimensions of the matrices exceed about 2000 × 2000. Padding schemes for arrays allocated in the GPU memory that optimize the performance of the CUBLAS implementations of the level 3 BLAS DGEMM and SGEMM subroutines for generalized matrix-matrix multiplications are described in detail. The analysis of the relative performance of the hybrid CPU/GPU and full GPU implementations indicate that the transfer of arrays between the GPU and CPU constitutes only a small fraction of the total computation time. The errors measured in the self-consistent density matrices computed using the SP2 algorithm are generally smaller than those measured in matrices computed via diagonalization. Furthermore, the errors in the density matrices computed using the SP2 algorithm do not exhibit any dependence of system size, whereas the errors increase linearly with the number of orbitals when diagonalization is employed.
      

      
      A novel iterative mixed model to remap three complex orthopedic traits in dogs
      PubMed Central
      Huang, Meng; Hayward, Jessica J.; Corey, Elizabeth; Garrison, Susan J.; Wagner, Gabriela R.; Krotscheck, Ursula; Hayashi, Kei; Schweitzer, Peter A.; Lust, George; Boyko, Adam R.; Todhunter, Rory J.
         2017-01-01
         Hip dysplasia (HD), elbow dysplasia (ED), and rupture of the cranial (anterior) cruciate ligament (RCCL) are the most common complex orthopedic traits of dogs and all result in debilitating osteoarthritis. We reanalyzed previously reported data: the Norberg angle (a quantitative measure of HD) in 921 dogs, ED in 113 cases and 633 controls, and RCCL in 271 cases and 399 controls and their genotypes at ~185,000 single nucleotide polymorphisms. A novel fixed and random model with a circulating probability unification (FarmCPU) function, with marker-based principal components and a kinship matrix to correct for population stratification, was used. A Bonferroni correction at p<0.01 resulted in a P< 6.96 ×10−8. Six loci were identified; three for HD and three for RCCL. An associated locus at CFA28:34,369,342 for HD was described previously in the same dogs using a conventional mixed model. No loci were identified for RCCL in the previous report but the two loci for ED in the previous report did not reach genome-wide significance using the FarmCPU model. These results were supported by simulation which demonstrated that the FarmCPU held no power advantage over the linear mixed model for the ED sample but provided additional power for the HD and RCCL samples. Candidate genes for HD and RCCL are discussed. When using FarmCPU software, we recommend a resampling test, that a positive control be used to determine the optimum pseudo quantitative trait nucleotide-based covariate structure of the model, and a negative control be used consisting of permutation testing and the identical resampling test as for the non-permuted phenotypes. PMID:28614352
      

      
      Feasibility and Safety of Evaluating Patients with Prior Coronary Artery Disease Using an Accelerated Diagnostic Algorithm in a Chest Pain Unit
      PubMed Central
      Goldkorn, Ronen; Goitein, Orly; Ben-Zekery, Sagit; Shlomo, Nir; Narodetsky, Michael; Livne, Moran; Sabbag, Avi; Asher, Elad; Matetzky, Shlomi
         2016-01-01
         An accelerated diagnostic protocol for evaluating low-risk patients with acute chest pain in a cardiologist-based chest pain unit (CPU) is widely employed today. However, limited data exist regarding the feasibility of such an algorithm for patients with a history of prior coronary artery disease (CAD). The aim of the current study was to assess the feasibility and safety of evaluating patients with a history of prior CAD using an accelerated diagnostic protocol. We evaluated 1,220 consecutive patients presenting with acute chest pain and hospitalized in our CPU. Patients were stratified according to whether they had a history of prior CAD or not. The primary composite outcome was defined as a composite of readmission due to chest pain, acute coronary syndrome, coronary revascularization, or death during a 60-day follow-up period. Overall, 268 (22%) patients had a history of prior CAD. Non-invasive evaluation was performed in 1,112 (91%) patients. While patients with a history of prior CAD had more comorbidities, the two study groups were similar regarding hospitalization rates (9% vs. 13%, p = 0.08), coronary angiography (13% vs. 11%, p = 0.41), and revascularization (6.5% vs. 5.7%, p = 0.8) performed during CPU evaluation. At 60-days the primary endpoint was observed in 12 (1.6%) and 6 (3.2%) patients without and with a history of prior CAD, respectively (p = 0.836). No mortalities were recorded. To conclude, Patients with a history of prior CAD can be expeditiously and safely evaluated using an accelerated diagnostic protocol in a CPU with outcomes not differing from patients without such a history. PMID:27669521
      

      
      Auditory stimulation by exposure to melodic music increases dopamine and serotonin activities in rat forebrain areas linked to reward and motor control.
      PubMed
      Moraes, Michele M; Rabelo, Patrícia C R; Pinto, Valéria A; Pires, Washington; Wanner, Samuel P; Szawka, Raphael E; Soares, Danusa D
         2018-04-23
         Listening to melodic music is regarded as a non-pharmacological intervention that ameliorates various disease symptoms, likely by changing the activity of brain monoaminergic systems. Here, we investigated the effects of exposure to melodic music on the concentrations of dopamine (DA), serotonin (5-HT) and their respective metabolites in the caudate-putamen (CPu) and nucleus accumbens (NAcc), areas linked to reward and motor control. Male adult Wistar rats were randomly assigned to a control group or a group exposed to music. The music group was submitted to 8 music sessions [Mozart's sonata for two pianos (K. 488) at an average sound pressure of 65 dB]. The control rats were handled in the same way but were not exposed to music. Immediately after the last exposure or control session, the rats were euthanized, and their brains were quickly removed to analyze the concentrations of 5-HT, DA, 5-hydroxyindoleacetic acid (5-HIAA) and 3,4-dihydroxyphenylacetic acid (DOPAC) in the CPu and NAcc. Auditory stimuli affected the monoaminergic system in these two brain structures. In the CPu, auditory stimuli increased the concentrations of DA and 5-HIAA but did not change the DOPAC or 5-HT levels. In the NAcc, music markedly increased the DOPAC/DA ratio, suggesting an increase in DA turnover. Our data indicate that auditory stimuli, such as exposure to melodic music, increase DA levels and the release of 5-HT in the CPu as well as DA turnover in the NAcc, suggesting that the music had a direct impact on monoamine activity in these brain areas. Copyright © 2018 Elsevier B.V. All rights reserved.
      

      
      Optimal endothelialisation of a new compliant poly(carbonate-urea)urethane vascular graft with effect of physiological shear stress.
      PubMed
      Salacinski, H J; Tai, N R; Punshon, G; Giudiceandrea, A; Hamilton, G; Seifalian, A M
         2000-10-01
         to define the optimal seeding conditions of a new stress free poly(carbonate-urea)urethane (CPU) graft with compliance similar to that of human artery with honeycomb structure engineered during the manufacturing process to enhance adhesion and growth of endothelial cells. (111)Indium-oxine radiolabeled human umbilical vein endothelial cells (HUVEC) were seeded onto CPU grafts at (a) concentrations from 2-24x10(5)cells/cm(2)and (b) incubated for 0.5, 1, 2, 4 and 6 h. Following incubation, graft segments were subjected to three washing/gamma counting procedures and scanning electron microscopy (SEM). Cell viability was measured using a modified Alamar blue(TM)assay. To test physiological retention a pulsatile flow phantom was used to subject optimally seeded (16x10(5), 4 h) CPU grafts to arterial shear stress for 6 h with real time acquisition of scintigraphic images of seeded grafts using a nuclear medicine gamma camera system. the seeding efficiency of 54+/-13% post three washes was achieved using 16x10(5)cells/cm(2). Similarly in SEM micrographs a seeding density of 16x10(5)cells/cm(2)resulted in a confluent monolayer. Seeded CPU segments incubated for 4 h exhibited significantly higher resistance to wash-off than segments incubated for 30 min (p <0.05). Exposure of seeded grafts to pulsatile shear stress resulted in some cell loss with 67+/-3% of cells adherent following 6 h of perfusion with ongoing metabolic activity. Thus, optimal conditions were 16x10(5)cells/cm(2)at 4 h. the optimal seeding conditions have been defined for "tissue-engineered" vascular graft which allow complete endothelialisation and high cell-to-substrate strength that resists hydrodynamic stress. Copyright 2000 Harcourt Publishers Ltd.
      

      
      Fast Simulation of Dynamic Ultrasound Images Using the GPU.
      PubMed
      Storve, Sigurd; Torp, Hans
         2017-10-01
         Simulated ultrasound data is a valuable tool for development and validation of quantitative image analysis methods in echocardiography. Unfortunately, simulation time can become prohibitive for phantoms consisting of a large number of point scatterers. The COLE algorithm by Gao et al. is a fast convolution-based simulator that trades simulation accuracy for improved speed. We present highly efficient parallelized CPU and GPU implementations of the COLE algorithm with an emphasis on dynamic simulations involving moving point scatterers. We argue that it is crucial to minimize the amount of data transfers from the CPU to achieve good performance on the GPU. We achieve this by storing the complete trajectories of the dynamic point scatterers as spline curves in the GPU memory. This leads to good efficiency when simulating sequences consisting of a large number of frames, such as B-mode and tissue Doppler data for a full cardiac cycle. In addition, we propose a phase-based subsample delay technique that efficiently eliminates flickering artifacts seen in B-mode sequences when COLE is used without enough temporal oversampling. To assess the performance, we used a laptop computer and a desktop computer, each equipped with a multicore Intel CPU and an NVIDIA GPU. Running the simulator on a high-end TITAN X GPU, we observed two orders of magnitude speedup compared to the parallel CPU version, three orders of magnitude speedup compared to simulation times reported by Gao et al. in their paper on COLE, and a speedup of 27000 times compared to the multithreaded version of Field II, using numbers reported in a paper by Jensen. We hope that by releasing the simulator as an open-source project we will encourage its use and further development.
      

      
      Morphological-transformation-based technique of edge detection and skeletonization of an image using a single spatial light modulator
      NASA Astrophysics Data System (ADS)
      Munshi, Soumika; Datta, A. K.
         2003-03-01
         A technique of optically detecting the edge and skeleton of an image by defining shift operations for morphological transformation is described. A (2 × 2) source array, which acts as the structuring element of morphological operations, casts four angularly shifted optical projections of the input image. The resulting dilated image, when superimposed with the complementary input image, produces the edge image. For skeletonization, the source array casts four partially overlapped output images of the inverted input image, which is negated, and the resultant image is recorded in a CCD camera. This overlapped eroded image is again eroded and then dilated, producing an opened image. The difference between the eroded and opened image is then computed, resulting in a thinner image. This procedure of obtaining a thinned image is iterated until the difference image becomes zero, maintaining the connectivity conditions. The technique has been optically implemented using a single spatial modulator and has the advantage of single-instruction parallel processing of the image. The techniques have been tested both for binary and grey images.
      

      
      On the Finite Element Implementation of the Generalized Method of Cells Micromechanics Constitutive Model
      NASA Technical Reports Server (NTRS)
      Wilt, T. E.
         1995-01-01
         The Generalized Method of Cells (GMC), a micromechanics based constitutive model, is implemented into the finite element code MARC using the user subroutine HYPELA. Comparisons in terms of transverse deformation response, micro stress and strain distributions, and required CPU time are presented for GMC and finite element models of fiber/matrix unit cell. GMC is shown to provide comparable predictions of the composite behavior and requires significantly less CPU time as compared to a finite element analysis of the unit cell. Details as to the organization of the HYPELA code are provided with the actual HYPELA code included in the appendix.
      

      
      Sequence search on a supercomputer.
      PubMed
      Gotoh, O; Tagashira, Y
         1986-01-10
         A set of programs was developed for searching nucleic acid and protein sequence data bases for sequences similar to a given sequence. The programs, written in FORTRAN 77, were optimized for vector processing on a Hitachi S810-20 supercomputer. A search of a 500-residue protein sequence against the entire PIR data base Ver. 1.0 (1) (0.5 M residues) is carried out in a CPU time of 45 sec. About 4 min is required for an exhaustive search of a 1500-base nucleotide sequence against all mammalian sequences (1.2M bases) in Genbank Ver. 29.0. The CPU time is reduced to about a quarter with a faster version.
      

      
      A CPU benchmark for protein crystallographic refinement.
      PubMed
      Bourne, P E; Hendrickson, W A
         1990-01-01
         The CPU time required to complete a cycle of restrained least-squares refinement of a protein structure from X-ray crystallographic data using the FORTRAN codes PROTIN and PROLSQ are reported for 48 different processors, ranging from single-user workstations to supercomputers. Sequential, vector, VLIW, multiprocessor, and RISC hardware architectures are compared using both a small and a large protein structure. Representative compile times for each hardware type are also given, and the improvement in run-time when coding for a specific hardware architecture considered. The benchmarks involve scalar integer and vector floating point arithmetic and are representative of the calculations performed in many scientific disciplines.
      

      
      Adaptive real-time methodology for optimizing energy-efficient computing
      DOEpatents
      Hsu, Chung-Hsing [Los Alamos, NM; Feng, Wu-Chun [Blacksburg, VA
         2011-06-28
         Dynamic voltage and frequency scaling (DVFS) is an effective way to reduce energy and power consumption in microprocessor units. Current implementations of DVFS suffer from inaccurate modeling of power requirements and usage, and from inaccurate characterization of the relationships between the applicable variables. A system and method is proposed that adjusts CPU frequency and voltage based on run-time calculations of the workload processing time, as well as a calculation of performance sensitivity with respect to CPU frequency. The system and method are processor independent, and can be applied to either an entire system as a unit, or individually to each process running on a system.
      

      
      Comparison of Conjugate Gradient Density Matrix Search and Chebyshev Expansion Methods for Avoiding Diagonalization in Large-Scale Electronic Structure Calculations
      NASA Technical Reports Server (NTRS)
      Bates, Kevin R.; Daniels, Andrew D.; Scuseria, Gustavo E.
         1998-01-01
         We report a comparison of two linear-scaling methods which avoid the diagonalization bottleneck of traditional electronic structure algorithms. The Chebyshev expansion method (CEM) is implemented for carbon tight-binding calculations of large systems and its memory and timing requirements compared to those of our previously implemented conjugate gradient density matrix search (CG-DMS). Benchmark calculations are carried out on icosahedral fullerenes from C60 to C8640 and the linear scaling memory and CPU requirements of the CEM demonstrated. We show that the CPU requisites of the CEM and CG-DMS are similar for calculations with comparable accuracy.
      

      
      Input/output behavior of supercomputing applications
      NASA Technical Reports Server (NTRS)
      Miller, Ethan L.
         1991-01-01
         The collection and analysis of supercomputer I/O traces and their use in a collection of buffering and caching simulations are described. This serves two purposes. First, it gives a model of how individual applications running on supercomputers request file system I/O, allowing system designer to optimize I/O hardware and file system algorithms to that model. Second, the buffering simulations show what resources are needed to maximize the CPU utilization of a supercomputer given a very bursty I/O request rate. By using read-ahead and write-behind in a large solid stated disk, one or two applications were sufficient to fully utilize a Cray Y-MP CPU.
      

      
      Exploiting graphics processing units for computational biology and bioinformatics.
      PubMed
      Payne, Joshua L; Sinnott-Armstrong, Nicholas A; Moore, Jason H
         2010-09-01
         Advances in the video gaming industry have led to the production of low-cost, high-performance graphics processing units (GPUs) that possess more memory bandwidth and computational capability than central processing units (CPUs), the standard workhorses of scientific computing. With the recent release of generalpurpose GPUs and NVIDIA's GPU programming language, CUDA, graphics engines are being adopted widely in scientific computing applications, particularly in the fields of computational biology and bioinformatics. The goal of this article is to concisely present an introduction to GPU hardware and programming, aimed at the computational biologist or bioinformaticist. To this end, we discuss the primary differences between GPU and CPU architecture, introduce the basics of the CUDA programming language, and discuss important CUDA programming practices, such as the proper use of coalesced reads, data types, and memory hierarchies. We highlight each of these topics in the context of computing the all-pairs distance between instances in a dataset, a common procedure in numerous disciplines of scientific computing. We conclude with a runtime analysis of the GPU and CPU implementations of the all-pairs distance calculation. We show our final GPU implementation to outperform the CPU implementation by a factor of 1700.
      

      
      Resource Isolation Method for Program’S Performance on CMP
      NASA Astrophysics Data System (ADS)
      Guan, Ti; Liu, Chunxiu; Xu, Zheng; Li, Huicong; Ma, Qiang
         2017-10-01
         Data center and cloud computing are more popular, which make more benefits for customers and the providers. However, in data center or clusters, commonly there is more than one program running on one server, but programs may interference with each other. The interference may take a little effect, however, the interference may cause serious drop down of performance. In order to avoid the performance interference problem, the mechanism of isolate resource for different programs is a better choice. In this paper we propose a light cost resource isolation method to improve program’s performance. This method uses Cgroups to set the dedicated CPU and memory resource for a program, aiming at to guarantee the program’s performance. There are three engines to realize this method: Program Monitor Engine top program’s resource usage of CPU and memory and transfer the information to Resource Assignment Engine; Resource Assignment Engine calculates the size of CPU and memory resource should be applied for the program; Cgroups Control Engine divide resource by Linux tool Cgroups, and drag program in control group for execution. The experiment result show that making use of the resource isolation method proposed by our paper, program’s performance can be improved.
      

      
      Data Acquisition System for Multi-Frequency Radar Flight Operations Preparation
      NASA Technical Reports Server (NTRS)
      Leachman, Jonathan
         2010-01-01
         A three-channel data acquisition system was developed for the NASA Multi-Frequency Radar (MFR) system. The system is based on a commercial-off-the-shelf (COTS) industrial PC (personal computer) and two dual-channel 14-bit digital receiver cards. The decimated complex envelope representations of the three radar signals are passed to the host PC via the PCI bus, and then processed in parallel by multiple cores of the PC CPU (central processing unit). The innovation is this parallelization of the radar data processing using multiple cores of a standard COTS multi-core CPU. The data processing portion of the data acquisition software was built using autonomous program modules or threads, which can run simultaneously on different cores. A master program module calculates the optimal number of processing threads, launches them, and continually supplies each with data. The benefit of this new parallel software architecture is that COTS PCs can be used to implement increasingly complex processing algorithms on an increasing number of radar range gates and data rates. As new PCs become available with higher numbers of CPU cores, the software will automatically utilize the additional computational capacity.
      

      
      Compiler-based code generation and autotuning for geometric multigrid on GPU-accelerated supercomputers
      DOE PAGES
      Basu, Protonu; Williams, Samuel; Van Straalen, Brian; ...
         2017-04-05
         GPUs, with their high bandwidths and computational capabilities are an increasingly popular target for scientific computing. Unfortunately, to date, harnessing the power of the GPU has required use of a GPU-specific programming model like CUDA, OpenCL, or OpenACC. Thus, in order to deliver portability across CPU-based and GPU-accelerated supercomputers, programmers are forced to write and maintain two versions of their applications or frameworks. In this paper, we explore the use of a compiler-based autotuning framework based on CUDA-CHiLL to deliver not only portability, but also performance portability across CPU- and GPU-accelerated platforms for the geometric multigrid linear solvers found inmore » many scientific applications. We also show that with autotuning we can attain near Roofline (a performance bound for a computation and target architecture) performance across the key operations in the miniGMG benchmark for both CPU- and GPU-based architectures as well as for a multiple stencil discretizations and smoothers. We show that our technology is readily interoperable with MPI resulting in performance at scale equal to that obtained via hand-optimized MPI+CUDA implementation.« less
      

      
      An efficient tensor transpose algorithm for multicore CPU, Intel Xeon Phi, and NVidia Tesla GPU
      DOE PAGES
      Lyakh, Dmitry I.
         2015-01-05
         An efficient parallel tensor transpose algorithm is suggested for shared-memory computing units, namely, multicore CPU, Intel Xeon Phi, and NVidia GPU. The algorithm operates on dense tensors (multidimensional arrays) and is based on the optimization of cache utilization on x86 CPU and the use of shared memory on NVidia GPU. From the applied side, the ultimate goal is to minimize the overhead encountered in the transformation of tensor contractions into matrix multiplications in computer implementations of advanced methods of quantum many-body theory (e.g., in electronic structure theory and nuclear physics). A particular accent is made on higher-dimensional tensors that typicallymore » appear in the so-called multireference correlated methods of electronic structure theory. Depending on tensor dimensionality, the presented optimized algorithms can achieve an order of magnitude speedup on x86 CPUs and 2-3 times speedup on NVidia Tesla K20X GPU with respect to the na ve scattering algorithm (no memory access optimization). Furthermore, the tensor transpose routines developed in this work have been incorporated into a general-purpose tensor algebra library (TAL-SH).« less
      

      
      Purine 5‧,8-cyclo-2‧-deoxynucleoside lesions in irradiated DNA
      NASA Astrophysics Data System (ADS)
      Chatgilialoglu, Chryssostomos; Krokidis, Marios G.; Papadopoulos, Kyriakos; Terzidis, Michael A.
         2016-11-01
         Having their position gained among the smallest bulky DNA lesions recognized by the nucleotide excision repair (NER) enzyme, purine 5‧,8-cyclo-2‧-deoxynucleosides (5‧,8-cPu) are increasingly attracting the interest in the field of genome integrity in health and diseases. Exclusively generated by one of the most harmful of the reactive oxygen species, the hydroxyl radical, 5‧,8-cPu can be utilized also for highly valuable information regarding the oxidative status nearby the area where the genetic information is stored. Herein, we have collected the most recently reported biological studies, focusing on the repair mechanism of these lesions and their biological significance particularly in transcription. The LC-MS/MS quantification protocols that appeared in the literature are discussed in details, along with the reported values for the four 5‧,8-cPu produced by in vitro γ-radiolysis experiments with calf thymus DNA. Mechanistic insights in the formation of the purine 5‧,8-cyclo-2‧-deoxynucleosides and their chemical stability are also given in the light of their potential to be utilized as DNA biomarkers of oxidative stress.
      

      
      Energy consumption optimization of the total-FETI solver by changing the CPU frequency
      NASA Astrophysics Data System (ADS)
      Horak, David; Riha, Lubomir; Sojka, Radim; Kruzik, Jakub; Beseda, Martin; Cermak, Martin; Schuchart, Joseph
         2017-07-01
         The energy consumption of supercomputers is one of the critical problems for the upcoming Exascale supercomputing era. The awareness of power and energy consumption is required on both software and hardware side. This paper deals with the energy consumption evaluation of the Finite Element Tearing and Interconnect (FETI) based solvers of linear systems, which is an established method for solving real-world engineering problems. We have evaluated the effect of the CPU frequency on the energy consumption of the FETI solver using a linear elasticity 3D cube synthetic benchmark. In this problem, we have evaluated the effect of frequency tuning on the energy consumption of the essential processing kernels of the FETI method. The paper provides results for two types of frequency tuning: (1) static tuning and (2) dynamic tuning. For static tuning experiments, the frequency is set before execution and kept constant during the runtime. For dynamic tuning, the frequency is changed during the program execution to adapt the system to the actual needs of the application. The paper shows that static tuning brings up 12% energy savings when compared to default CPU settings (the highest clock rate). The dynamic tuning improves this further by up to 3%.
      

        
       
          

«

17
      18
      19
   20
      21
      »

          
        

     

   

   
       
            
              
          

«

18
      19
      20
   21
      22
      »

          
        

           
           
             
               
      
      GPU-based stochastic-gradient optimization for non-rigid medical image registration in time-critical applications
      NASA Astrophysics Data System (ADS)
      Bhosale, Parag; Staring, Marius; Al-Ars, Zaid; Berendsen, Floris F.
         2018-03-01
         Currently, non-rigid image registration algorithms are too computationally intensive to use in time-critical applications. Existing implementations that focus on speed typically address this by either parallelization on GPU-hardware, or by introducing methodically novel techniques into CPU-oriented algorithms. Stochastic gradient descent (SGD) optimization and variations thereof have proven to drastically reduce the computational burden for CPU-based image registration, but have not been successfully applied in GPU hardware due to its stochastic nature. This paper proposes 1) NiftyRegSGD, a SGD optimization for the GPU-based image registration tool NiftyReg, 2) random chunk sampler, a new random sampling strategy that better utilizes the memory bandwidth of GPU hardware. Experiments have been performed on 3D lung CT data of 19 patients, which compared NiftyRegSGD (with and without random chunk sampler) with CPU-based elastix Fast Adaptive SGD (FASGD) and NiftyReg. The registration runtime was 21.5s, 4.4s and 2.8s for elastix-FASGD, NiftyRegSGD without, and NiftyRegSGD with random chunk sampling, respectively, while similar accuracy was obtained. Our method is publicly available at https://github.com/SuperElastix/NiftyRegSGD.
      

      
      Fast MPEG-CDVS Encoder With GPU-CPU Hybrid Computing
      NASA Astrophysics Data System (ADS)
      Duan, Ling-Yu; Sun, Wei; Zhang, Xinfeng; Wang, Shiqi; Chen, Jie; Yin, Jianxiong; See, Simon; Huang, Tiejun; Kot, Alex C.; Gao, Wen
         2018-05-01
         The compact descriptors for visual search (CDVS) standard from ISO/IEC moving pictures experts group (MPEG) has succeeded in enabling the interoperability for efficient and effective image retrieval by standardizing the bitstream syntax of compact feature descriptors. However, the intensive computation of CDVS encoder unfortunately hinders its widely deployment in industry for large-scale visual search. In this paper, we revisit the merits of low complexity design of CDVS core techniques and present a very fast CDVS encoder by leveraging the massive parallel execution resources of GPU. We elegantly shift the computation-intensive and parallel-friendly modules to the state-of-the-arts GPU platforms, in which the thread block allocation and the memory access are jointly optimized to eliminate performance loss. In addition, those operations with heavy data dependence are allocated to CPU to resolve the extra but non-necessary computation burden for GPU. Furthermore, we have demonstrated the proposed fast CDVS encoder can work well with those convolution neural network approaches which has harmoniously leveraged the advantages of GPU platforms, and yielded significant performance improvements. Comprehensive experimental results over benchmarks are evaluated, which has shown that the fast CDVS encoder using GPU-CPU hybrid computing is promising for scalable visual search.
      

      
      Implementation of GPU accelerated SPECT reconstruction with Monte Carlo-based scatter correction.
      PubMed
      Bexelius, Tobias; Sohlberg, Antti
         2018-06-01
         Statistical SPECT reconstruction can be very time-consuming especially when compensations for collimator and detector response, attenuation, and scatter are included in the reconstruction. This work proposes an accelerated SPECT reconstruction algorithm based on graphics processing unit (GPU) processing. Ordered subset expectation maximization (OSEM) algorithm with CT-based attenuation modelling, depth-dependent Gaussian convolution-based collimator-detector response modelling, and Monte Carlo-based scatter compensation was implemented using OpenCL. The OpenCL implementation was compared against the existing multi-threaded OSEM implementation running on a central processing unit (CPU) in terms of scatter-to-primary ratios, standardized uptake values (SUVs), and processing speed using mathematical phantoms and clinical multi-bed bone SPECT/CT studies. The difference in scatter-to-primary ratios, visual appearance, and SUVs between GPU and CPU implementations was minor. On the other hand, at its best, the GPU implementation was noticed to be 24 times faster than the multi-threaded CPU version on a normal 128 × 128 matrix size 3 bed bone SPECT/CT data set when compensations for collimator and detector response, attenuation, and scatter were included. GPU SPECT reconstructions show great promise as an every day clinical reconstruction tool.
      

      
      Compiler-based code generation and autotuning for geometric multigrid on GPU-accelerated supercomputers
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Basu, Protonu; Williams, Samuel; Van Straalen, Brian
         
         GPUs, with their high bandwidths and computational capabilities are an increasingly popular target for scientific computing. Unfortunately, to date, harnessing the power of the GPU has required use of a GPU-specific programming model like CUDA, OpenCL, or OpenACC. Thus, in order to deliver portability across CPU-based and GPU-accelerated supercomputers, programmers are forced to write and maintain two versions of their applications or frameworks. In this paper, we explore the use of a compiler-based autotuning framework based on CUDA-CHiLL to deliver not only portability, but also performance portability across CPU- and GPU-accelerated platforms for the geometric multigrid linear solvers found inmore » many scientific applications. We also show that with autotuning we can attain near Roofline (a performance bound for a computation and target architecture) performance across the key operations in the miniGMG benchmark for both CPU- and GPU-based architectures as well as for a multiple stencil discretizations and smoothers. We show that our technology is readily interoperable with MPI resulting in performance at scale equal to that obtained via hand-optimized MPI+CUDA implementation.« less
      

      
      GPU color space conversion
      NASA Astrophysics Data System (ADS)
      Chase, Patrick; Vondran, Gary
         2011-01-01
         Tetrahedral interpolation is commonly used to implement continuous color space conversions from sparse 3D and 4D lookup tables. We investigate the implementation and optimization of tetrahedral interpolation algorithms for GPUs, and compare to the best known CPU implementations as well as to a well known GPU-based trilinear implementation. We show that a 500 NVIDIA GTX-580 GPU is 3x faster than a 1000 Intel Core i7 980X CPU for 3D interpolation, and 9x faster for 4D interpolation. Performance-relevant GPU attributes are explored including thread scheduling, local memory characteristics, global memory hierarchy, and cache behaviors. We consider existing tetrahedral interpolation algorithms and tune based on the structure and branching capabilities of current GPUs. Global memory performance is improved by reordering and expanding the lookup table to ensure optimal access behaviors. Per multiprocessor local memory is exploited to implement optimally coalesced global memory accesses, and local memory addressing is optimized to minimize bank conflicts. We explore the impacts of lookup table density upon computation and memory access costs. Also presented are CPU-based 3D and 4D interpolators, using SSE vector operations that are faster than any previously published solution.
      

      
      GNAQPMS v1.1: accelerating the Global Nested Air Quality Prediction Modeling System (GNAQPMS) on Intel Xeon Phi processors
      NASA Astrophysics Data System (ADS)
      Wang, Hui; Chen, Huansheng; Wu, Qizhong; Lin, Junmin; Chen, Xueshun; Xie, Xinwei; Wang, Rongrong; Tang, Xiao; Wang, Zifa
         2017-08-01
         The Global Nested Air Quality Prediction Modeling System (GNAQPMS) is the global version of the Nested Air Quality Prediction Modeling System (NAQPMS), which is a multi-scale chemical transport model used for air quality forecast and atmospheric environmental research. In this study, we present the porting and optimisation of GNAQPMS on a second-generation Intel Xeon Phi processor, codenamed Knights Landing (KNL). Compared with the first-generation Xeon Phi coprocessor (codenamed Knights Corner, KNC), KNL has many new hardware features such as a bootable processor, high-performance in-package memory and ISA compatibility with Intel Xeon processors. In particular, we describe the five optimisations we applied to the key modules of GNAQPMS, including the CBM-Z gas-phase chemistry, advection, convection and wet deposition modules. These optimisations work well on both the KNL 7250 processor and the Intel Xeon E5-2697 V4 processor. They include (1) updating the pure Message Passing Interface (MPI) parallel mode to the hybrid parallel mode with MPI and OpenMP in the emission, advection, convection and gas-phase chemistry modules; (2) fully employing the 512 bit wide vector processing units (VPUs) on the KNL platform; (3) reducing unnecessary memory access to improve cache efficiency; (4) reducing the thread local storage (TLS) in the CBM-Z gas-phase chemistry module to improve its OpenMP performance; and (5) changing the global communication from writing/reading interface files to MPI functions to improve the performance and the parallel scalability. These optimisations greatly improved the GNAQPMS performance. The same optimisations also work well for the Intel Xeon Broadwell processor, specifically E5-2697 v4. Compared with the baseline version of GNAQPMS, the optimised version was 3.51 × faster on KNL and 2.77 × faster on the CPU. Moreover, the optimised version ran at 26 % lower average power on KNL than on the CPU. With the combined performance and energy improvement, the KNL platform was 37.5 % more efficient on power consumption compared with the CPU platform. The optimisations also enabled much further parallel scalability on both the CPU cluster and the KNL cluster scaled to 40 CPU nodes and 30 KNL nodes, with a parallel efficiency of 70.4 and 42.2 %, respectively.
      

      
      Technical Competencies Applied in Experimental Fluid Dynamics
      NASA Astrophysics Data System (ADS)
      Tagg, Randall
         2017-11-01
         The practical design, construction, and operation of fluid dynamics experiments require a broad range of competencies. Three types are instrumental, procedural, and design. Respective examples would be operation of a spectrum analyzer, soft-soldering or brazing flow plumbing, and design of a small wind tunnel. Some competencies, such as the selection and installation of pumping systems, are unique to fluid dynamics and fluids engineering. Others, such as the design and construction of electronic amplifiers or optical imaging systems, overlap with other fields. Thus the identification and development of learning materials and methods for instruction are part of a larger effort to identify competencies needed in active research and technical innovation.
      

      
      Final Technical Report for Years 1-4 of the Early Career Research Project "Viscosity and equation of state of hot and dense QCD matter" - ARRA portion
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Molnar, Denes
         2014-04-14
         The Section below summarizes research activities and achievements during the first four years of the PI’s Early Career Research Project (ECRP). Two main areas have been advanced: i) radiative 3 ↔ 2 radiative transport, via development of a new computer code MPC/Grid that solves the Boltzmann transport equation in full 6+1D (3X+3V+time) on both single-CPU and parallel computers; ii) development of a self-consistent framework to convert viscous fluids to particles, and application of this framework to relativistic heavy-ion collisions, in particular, determination of the shear viscosity. Year 5 of the ECRP is under a separate award number, and therefore itmore » has its own report document ’Final Technical Report for Year 5 of the Early Career Research Project “Viscosity and equation of state of hot and dense QCDmatter”’ (award DE-SC0008028). The PI’s group was also part of the DOE JET Topical Collaboration, a multi-institution project that overlapped in time significantly with the ECRP. Purdue achievements as part of the JET Topical Collaboration are in a separate report “Final Technical Report summarizing Purdue research activities as part of the DOE JET Topical Collaboration” (award DE-SC0004077).« less
      

      
      An evaluation of superminicomputers for thermal analysis
      NASA Technical Reports Server (NTRS)
      Storaasli, O. O.; Vidal, J. B.; Jones, G. K.
         1962-01-01
         The feasibility and cost effectiveness of solving thermal analysis problems on superminicomputers is demonstrated. Conventional thermal analysis and the changing computer environment, computer hardware and software used, six thermal analysis test problems, performance of superminicomputers (CPU time, accuracy, turnaround, and cost) and comparison with large computers are considered. Although the CPU times for superminicomputers were 15 to 30 times greater than the fastest mainframe computer, the minimum cost to obtain the solutions on superminicomputers was from 11 percent to 59 percent of the cost of mainframe solutions. The turnaround (elapsed) time is highly dependent on the computer load, but for large problems, superminicomputers produced results in less elapsed time than a typically loaded mainframe computer.
      

      
      Evaluating Mobile Graphics Processing Units (GPUs) for Real-Time Resource Constrained Applications
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Meredith, J; Conger, J; Liu, Y
         2005-11-11
         Modern graphics processing units (GPUs) can provide tremendous performance boosts for some applications beyond what a single CPU can accomplish, and their performance is growing at a rate faster than CPUs as well. Mobile GPUs available for laptops have the small form factor and low power requirements suitable for use in embedded processing. We evaluated several desktop and mobile GPUs and CPUs on traditional and non-traditional graphics tasks, as well as on the most time consuming pieces of a full hyperspectral imaging application. Accuracy remained high despite small differences in arithmetic operations like rounding. Performance improvements are summarized here relativemore » to a desktop Pentium 4 CPU.« less
      

      
      A new nonlinear conjugate gradient coefficient under strong Wolfe-Powell line search
      NASA Astrophysics Data System (ADS)
      Mohamed, Nur Syarafina; Mamat, Mustafa; Rivaie, Mohd
         2017-08-01
         A nonlinear conjugate gradient method (CG) plays an important role in solving a large-scale unconstrained optimization problem. This method is widely used due to its simplicity. The method is known to possess sufficient descend condition and global convergence properties. In this paper, a new nonlinear of CG coefficient βk is presented by employing the Strong Wolfe-Powell inexact line search. The new βk performance is tested based on number of iterations and central processing unit (CPU) time by using MATLAB software with Intel Core i7-3470 CPU processor. Numerical experimental results show that the new βk converge rapidly compared to other classical CG method.
      

      
      Hypermatrix scheme for finite element systems on CDC STAR-100 computer
      NASA Technical Reports Server (NTRS)
      Noor, A. K.; Voigt, S. J.
         1975-01-01
         A study is made of the adaptation of the hypermatrix (block matrix) scheme for solving large systems of finite element equations to the CDC STAR-100 computer. Discussion is focused on the organization of the hypermatrix computation using Cholesky decomposition and the mode of storage of the different submatrices to take advantage of the STAR pipeline (streaming) capability. Consideration is also given to the associated data handling problems and the means of balancing the I/Q and cpu times in the solution process. Numerical examples are presented showing anticipated gain in cpu speed over the CDC 6600 to be obtained by using the proposed algorithms on the STAR computer.
      

      
      Adaptive real-time methodology for optimizing energy-efficient computing
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Hsu, Chung-Hsing; Feng, Wu-Chun
         
         Dynamic voltage and frequency scaling (DVFS) is an effective way to reduce energy and power consumption in microprocessor units. Current implementations of DVFS suffer from inaccurate modeling of power requirements and usage, and from inaccurate characterization of the relationships between the applicable variables. A system and method is proposed that adjusts CPU frequency and voltage based on run-time calculations of the workload processing time, as well as a calculation of performance sensitivity with respect to CPU frequency. The system and method are processor independent, and can be applied to either an entire system as a unit, or individually to eachmore » process running on a system.« less
      

      
      Convolution of large 3D images on GPU and its decomposition
      NASA Astrophysics Data System (ADS)
      Karas, Pavel; Svoboda, David
         2011-12-01
         In this article, we propose a method for computing convolution of large 3D images. The convolution is performed in a frequency domain using a convolution theorem. The algorithm is accelerated on a graphic card by means of the CUDA parallel computing model. Convolution is decomposed in a frequency domain using the decimation in frequency algorithm. We pay attention to keeping our approach efficient in terms of both time and memory consumption and also in terms of memory transfers between CPU and GPU which have a significant inuence on overall computational time. We also study the implementation on multiple GPUs and compare the results between the multi-GPU and multi-CPU implementations.
      

      
      Real time display Fourier-domain OCT using multi-thread parallel computing with data vectorization
      NASA Astrophysics Data System (ADS)
      Eom, Tae Joong; Kim, Hoon Seop; Kim, Chul Min; Lee, Yeung Lak; Choi, Eun-Seo
         2011-03-01
         We demonstrate a real-time display of processed OCT images using multi-thread parallel computing with a quad-core CPU of a personal computer. The data of each A-line are treated as one vector to maximize the data translation rate between the cores of the CPU and RAM stored image data. A display rate of 29.9 frames/sec for processed OCT data (4096 FFT-size x 500 A-scans) is achieved in our system using a wavelength swept source with 52-kHz swept frequency. The data processing times of the OCT image and a Doppler OCT image with a 4-time average are 23.8 msec and 91.4 msec.
      

      
      Research on numerical control system based on S3C2410 and MCX314AL
      NASA Astrophysics Data System (ADS)
      Ren, Qiang; Jiang, Tingbiao
         2008-10-01
         With the rapid development of micro-computer technology, embedded system, CNC technology and integrated circuits, numerical control system with powerful functions can be realized by several high-speed CPU chips and RISC (Reduced Instruction Set Computing) chips which have small size and strong stability. In addition, the real-time operating system also makes the attainment of embedded system possible. Developing the NC system based on embedded technology can overcome some shortcomings of common PC-based CNC system, such as the waste of resources, low control precision, low frequency and low integration. This paper discusses a hardware platform of ENC (Embedded Numerical Control) system based on embedded processor chip ARM (Advanced RISC Machines)-S3C2410 and DSP (Digital Signal Processor)-MCX314AL and introduces the process of developing ENC system software. Finally write the MCX314AL's driver under the embedded Linux operating system. The embedded Linux operating system can deal with multitask well moreover satisfy the real-time and reliability of movement control. NC system has the advantages of best using resources and compact system with embedded technology. It provides a wealth of functions and superior performance with a lower cost. It can be sure that ENC is the direction of the future development.
      

      
      An Energy-Aware Runtime Management of Multi-Core Sensory Swarms.
      PubMed
      Kim, Sungchan; Yang, Hoeseok
         2017-08-24
         In sensory swarms, minimizing energy consumption under performance constraint is one of the key objectives. One possible approach to this problem is to monitor application workload that is subject to change at runtime, and to adjust system configuration adaptively to satisfy the performance goal. As today's sensory swarms are usually implemented using multi-core processors with adjustable clock frequency, we propose to monitor the CPU workload periodically and adjust the task-to-core allocation or clock frequency in an energy-efficient way in response to the workload variations. In doing so, we present an online heuristic that determines the most energy-efficient adjustment that satisfies the performance requirement. The proposed method is based on a simple yet effective energy model that is built upon performance prediction using IPC (instructions per cycle) measured online and power equation derived empirically. The use of IPC accounts for memory intensities of a given workload, enabling the accurate prediction of execution time. Hence, the model allows us to rapidly and accurately estimate the effect of the two control knobs, clock frequency adjustment and core allocation. The experiments show that the proposed technique delivers considerable energy saving of up to 45%compared to the state-of-the-art multi-core energy management technique.
      

      
      An Energy-Aware Runtime Management of Multi-Core Sensory Swarms
      PubMed Central
      Kim, Sungchan
         2017-01-01
         In sensory swarms, minimizing energy consumption under performance constraint is one of the key objectives. One possible approach to this problem is to monitor application workload that is subject to change at runtime, and to adjust system configuration adaptively to satisfy the performance goal. As today’s sensory swarms are usually implemented using multi-core processors with adjustable clock frequency, we propose to monitor the CPU workload periodically and adjust the task-to-core allocation or clock frequency in an energy-efficient way in response to the workload variations. In doing so, we present an online heuristic that determines the most energy-efficient adjustment that satisfies the performance requirement. The proposed method is based on a simple yet effective energy model that is built upon performance prediction using IPC (instructions per cycle) measured online and power equation derived empirically. The use of IPC accounts for memory intensities of a given workload, enabling the accurate prediction of execution time. Hence, the model allows us to rapidly and accurately estimate the effect of the two control knobs, clock frequency adjustment and core allocation. The experiments show that the proposed technique delivers considerable energy saving of up to 45%compared to the state-of-the-art multi-core energy management technique. PMID:28837094
      

      
      Acoustic reverse-time migration using GPU card and POSIX thread based on the adaptive optimal finite-difference scheme and the hybrid absorbing boundary condition
      NASA Astrophysics Data System (ADS)
      Cai, Xiaohui; Liu, Yang; Ren, Zhiming
         2018-06-01
         Reverse-time migration (RTM) is a powerful tool for imaging geologically complex structures such as steep-dip and subsalt. However, its implementation is quite computationally expensive. Recently, as a low-cost solution, the graphic processing unit (GPU) was introduced to improve the efficiency of RTM. In the paper, we develop three ameliorative strategies to implement RTM on GPU card. First, given the high accuracy and efficiency of the adaptive optimal finite-difference (FD) method based on least squares (LS) on central processing unit (CPU), we study the optimal LS-based FD method on GPU. Second, we develop the CPU-based hybrid absorbing boundary condition (ABC) to the GPU-based one by addressing two issues of the former when introduced to GPU card: time-consuming and chaotic threads. Third, for large-scale data, the combinatorial strategy for optimal checkpointing and efficient boundary storage is introduced for the trade-off between memory and recomputation. To save the time of communication between host and disk, the portable operating system interface (POSIX) thread is utilized to create the other CPU core at the checkpoints. Applications of the three strategies on GPU with the compute unified device architecture (CUDA) programming language in RTM demonstrate their efficiency and validity.
      

      
      Fast MPEG-CDVS Encoder With GPU-CPU Hybrid Computing.
      PubMed
      Duan, Ling-Yu; Sun, Wei; Zhang, Xinfeng; Wang, Shiqi; Chen, Jie; Yin, Jianxiong; See, Simon; Huang, Tiejun; Kot, Alex C; Gao, Wen
         2018-05-01
         The compact descriptors for visual search (CDVS) standard from ISO/IEC moving pictures experts group has succeeded in enabling the interoperability for efficient and effective image retrieval by standardizing the bitstream syntax of compact feature descriptors. However, the intensive computation of a CDVS encoder unfortunately hinders its widely deployment in industry for large-scale visual search. In this paper, we revisit the merits of low complexity design of CDVS core techniques and present a very fast CDVS encoder by leveraging the massive parallel execution resources of graphics processing unit (GPU). We elegantly shift the computation-intensive and parallel-friendly modules to the state-of-the-arts GPU platforms, in which the thread block allocation as well as the memory access mechanism are jointly optimized to eliminate performance loss. In addition, those operations with heavy data dependence are allocated to CPU for resolving the extra but non-necessary computation burden for GPU. Furthermore, we have demonstrated the proposed fast CDVS encoder can work well with those convolution neural network approaches which enables to leverage the advantages of GPU platforms harmoniously, and yield significant performance improvements. Comprehensive experimental results over benchmarks are evaluated, which has shown that the fast CDVS encoder using GPU-CPU hybrid computing is promising for scalable visual search.
      

        
       
          

«

18
      19
      20
   21
      22
      »

          
        

     

   

   
       
            
              
          

«

19
      20
      21
   22
      23
      »

          
        

           
           
             
               
      
      Toward GPGPU accelerated human electromechanical cardiac simulations
      PubMed Central
      Vigueras, Guillermo; Roy, Ishani; Cookson, Andrew; Lee, Jack; Smith, Nicolas; Nordsletten, David
         2014-01-01
         In this paper, we look at the acceleration of weakly coupled electromechanics using the graphics processing unit (GPU). Specifically, we port to the GPU a number of components of Heart—a CPU-based finite element code developed for simulating multi-physics problems. On the basis of a criterion of computational cost, we implemented on the GPU the ODE and PDE solution steps for the electrophysiology problem and the Jacobian and residual evaluation for the mechanics problem. Performance of the GPU implementation is then compared with single core CPU (SC) execution as well as multi-core CPU (MC) computations with equivalent theoretical performance. Results show that for a human scale left ventricle mesh, GPU acceleration of the electrophysiology problem provided speedups of 164 × compared with SC and 5.5 times compared with MC for the solution of the ODE model. Speedup of up to 72 × compared with SC and 2.6 × compared with MC was also observed for the PDE solve. Using the same human geometry, the GPU implementation of mechanics residual/Jacobian computation provided speedups of up to 44 × compared with SC and 2.0 × compared with MC. © 2013 The Authors. International Journal for Numerical Methods in Biomedical Engineering published by John Wiley & Sons, Ltd. PMID:24115492
      

      
      Fast CPU-based Monte Carlo simulation for radiotherapy dose calculation.
      PubMed
      Ziegenhein, Peter; Pirner, Sven; Ph Kamerling, Cornelis; Oelfke, Uwe
         2015-08-07
         Monte-Carlo (MC) simulations are considered to be the most accurate method for calculating dose distributions in radiotherapy. Its clinical application, however, still is limited by the long runtimes conventional implementations of MC algorithms require to deliver sufficiently accurate results on high resolution imaging data. In order to overcome this obstacle we developed the software-package PhiMC, which is capable of computing precise dose distributions in a sub-minute time-frame by leveraging the potential of modern many- and multi-core CPU-based computers. PhiMC is based on the well verified dose planning method (DPM). We could demonstrate that PhiMC delivers dose distributions which are in excellent agreement to DPM. The multi-core implementation of PhiMC scales well between different computer architectures and achieves a speed-up of up to 37[Formula: see text] compared to the original DPM code executed on a modern system. Furthermore, we could show that our CPU-based implementation on a modern workstation is between 1.25[Formula: see text] and 1.95[Formula: see text] faster than a well-known GPU implementation of the same simulation method on a NVIDIA Tesla C2050. Since CPUs work on several hundreds of GB RAM the typical GPU memory limitation does not apply for our implementation and high resolution clinical plans can be calculated.
      

      
      QR-decomposition based SENSE reconstruction using parallel architecture.
      PubMed
      Ullah, Irfan; Nisar, Habab; Raza, Haseeb; Qasim, Malik; Inam, Omair; Omer, Hammad
         2018-04-01
         Magnetic Resonance Imaging (MRI) is a powerful medical imaging technique that provides essential clinical information about the human body. One major limitation of MRI is its long scan time. Implementation of advance MRI algorithms on a parallel architecture (to exploit inherent parallelism) has a great potential to reduce the scan time. Sensitivity Encoding (SENSE) is a Parallel Magnetic Resonance Imaging (pMRI) algorithm that utilizes receiver coil sensitivities to reconstruct MR images from the acquired under-sampled k-space data. At the heart of SENSE lies inversion of a rectangular encoding matrix. This work presents a novel implementation of GPU based SENSE algorithm, which employs QR decomposition for the inversion of the rectangular encoding matrix. For a fair comparison, the performance of the proposed GPU based SENSE reconstruction is evaluated against single and multicore CPU using openMP. Several experiments against various acceleration factors (AFs) are performed using multichannel (8, 12 and 30) phantom and in-vivo human head and cardiac datasets. Experimental results show that GPU significantly reduces the computation time of SENSE reconstruction as compared to multi-core CPU (approximately 12x speedup) and single-core CPU (approximately 53x speedup) without any degradation in the quality of the reconstructed images. Copyright © 2018 Elsevier Ltd. All rights reserved.
      

      
      Scalable Metropolis Monte Carlo for simulation of hard shapes
      NASA Astrophysics Data System (ADS)
      Anderson, Joshua A.; Eric Irrgang, M.; Glotzer, Sharon C.
         2016-07-01
         We design and implement a scalable hard particle Monte Carlo simulation toolkit (HPMC), and release it open source as part of HOOMD-blue. HPMC runs in parallel on many CPUs and many GPUs using domain decomposition. We employ BVH trees instead of cell lists on the CPU for fast performance, especially with large particle size disparity, and optimize inner loops with SIMD vector intrinsics on the CPU. Our GPU kernel proposes many trial moves in parallel on a checkerboard and uses a block-level queue to redistribute work among threads and avoid divergence. HPMC supports a wide variety of shape classes, including spheres/disks, unions of spheres, convex polygons, convex spheropolygons, concave polygons, ellipsoids/ellipses, convex polyhedra, convex spheropolyhedra, spheres cut by planes, and concave polyhedra. NVT and NPT ensembles can be run in 2D or 3D triclinic boxes. Additional integration schemes permit Frenkel-Ladd free energy computations and implicit depletant simulations. In a benchmark system of a fluid of 4096 pentagons, HPMC performs 10 million sweeps in 10 min on 96 CPU cores on XSEDE Comet. The same simulation would take 7.6 h in serial. HPMC also scales to large system sizes, and the same benchmark with 16.8 million particles runs in 1.4 h on 2048 GPUs on OLCF Titan.
      

      
      Massively parallel data processing for quantitative total flow imaging with optical coherence microscopy and tomography
      NASA Astrophysics Data System (ADS)
      Sylwestrzak, Marcin; Szlag, Daniel; Marchand, Paul J.; Kumar, Ashwin S.; Lasser, Theo
         2017-08-01
         We present an application of massively parallel processing of quantitative flow measurements data acquired using spectral optical coherence microscopy (SOCM). The need for massive signal processing of these particular datasets has been a major hurdle for many applications based on SOCM. In view of this difficulty, we implemented and adapted quantitative total flow estimation algorithms on graphics processing units (GPU) and achieved a 150 fold reduction in processing time when compared to a former CPU implementation. As SOCM constitutes the microscopy counterpart to spectral optical coherence tomography (SOCT), the developed processing procedure can be applied to both imaging modalities. We present the developed DLL library integrated in MATLAB (with an example) and have included the source code for adaptations and future improvements. Catalogue identifier: AFBT_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AFBT_v1_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: GNU GPLv3 No. of lines in distributed program, including test data, etc.: 913552 No. of bytes in distributed program, including test data, etc.: 270876249 Distribution format: tar.gz Programming language: CUDA/C, MATLAB. Computer: Intel x64 CPU, GPU supporting CUDA technology. Operating system: 64-bit Windows 7 Professional. Has the code been vectorized or parallelized?: Yes, CPU code has been vectorized in MATLAB, CUDA code has been parallelized. RAM: Dependent on users parameters, typically between several gigabytes and several tens of gigabytes Classification: 6.5, 18. Nature of problem: Speed up of data processing in optical coherence microscopy Solution method: Utilization of GPU for massively parallel data processing Additional comments: Compiled DLL library with source code and documentation, example of utilization (MATLAB script with raw data) Running time: 1,8 s for one B-scan (150 × faster in comparison to the CPU data processing time)
      

      
      Structurally divergent lysophosphatidic acid acyltransferases with high selectivity for saturated medium chain fatty acids from Cuphea seeds.
      PubMed
      Kim, Hae Jin; Silva, Jillian E; Iskandarov, Umidjon; Andersson, Mariette; Cahoon, Rebecca E; Mockaitis, Keithanne; Cahoon, Edgar B
         2015-12-01
         Lysophosphatidic acid acyltransferase (LPAT) catalyzes acylation of the sn-2 position on lysophosphatidic acid by an acyl CoA substrate to produce the phosphatidic acid precursor of polar glycerolipids and triacylglycerols (TAGs). In the case of TAGs, this reaction is typically catalyzed by an LPAT2 from microsomal LPAT class A that has high specificity for C18 fatty acids containing Δ9 unsaturation. Because of this specificity, the occurrence of saturated fatty acids in the TAG sn-2 position is infrequent in seed oils. To identify LPATs with variant substrate specificities, deep transcriptomic mining was performed on seeds of two Cuphea species producing TAGs that are highly enriched in saturated C8 and C10 fatty acids. From these analyses, cDNAs for seven previously unreported LPATs were identified, including cDNAs from Cuphea viscosissima (CvLPAT2) and Cuphea avigera var. pulcherrima (CpuLPAT2a) encoding microsomal, seed-specific class A LPAT2s and a cDNA from C. avigera var. pulcherrima (CpuLPATB) encoding a microsomal, seed-specific LPAT from the bacterial-type class B. The activities of these enzymes were characterized in Camelina sativa by seed-specific co-expression with cDNAs for various Cuphea FatB acyl-acyl carrier protein thioesterases (FatB) that produce a variety of saturated medium-chain fatty acids. CvLPAT2 and CpuLPAT2a expression resulted in accumulation of 10:0 fatty acids in the Camelina sativa TAG sn-2 position, indicating a 10:0 CoA specificity that has not been previously described for plant LPATs. CpuLPATB expression generated TAGs with 14:0 at the sn-2 position, but not 10:0. Identification of these LPATs provides tools for understanding the structural basis of LPAT substrate specificity and for generating altered oil functionalities. © 2015 The Authors The Plant Journal © 2015 John Wiley & Sons Ltd.
      

      
      Derivative free Davidon-Fletcher-Powell (DFP) for solving symmetric systems of nonlinear equations
      NASA Astrophysics Data System (ADS)
      Mamat, M.; Dauda, M. K.; Mohamed, M. A. bin; Waziri, M. Y.; Mohamad, F. S.; Abdullah, H.
         2018-03-01
         Research from the work of engineers, economist, modelling, industry, computing, and scientist are mostly nonlinear equations in nature. Numerical solution to such systems is widely applied in those areas of mathematics. Over the years, there has been significant theoretical study to develop methods for solving such systems, despite these efforts, unfortunately the methods developed do have deficiency. In a contribution to solve systems of the form F(x) = 0, x ∈ Rn , a derivative free method via the classical Davidon-Fletcher-Powell (DFP) update is presented. This is achieved by simply approximating the inverse Hessian matrix with {Q}k+1-1 to θkI. The modified method satisfied the descent condition and possess local superlinear convergence properties. Interestingly, without computing any derivative, the proposed method never fail to converge throughout the numerical experiments. The output is based on number of iterations and CPU time, different initial starting points were used on a solve 40 benchmark test problems. With the aid of the squared norm merit function and derivative-free line search technique, the approach yield a method of solving symmetric systems of nonlinear equations that is capable of significantly reducing the CPU time and number of iteration, as compared to its counterparts. A comparison between the proposed method and classical DFP update were made and found that the proposed methodis the top performer and outperformed the existing method in almost all the cases. In terms of number of iterations, out of the 40 problems solved, the proposed method solved 38 successfully, (95%) while classical DFP solved 2 problems (i.e. 05%). In terms of CPU time, the proposed method solved 29 out of the 40 problems given, (i.e.72.5%) successfully whereas classical DFP solves 11 (27.5%). The method is valid in terms of derivation, reliable in terms of number of iterations and accurate in terms of CPU time. Thus, suitable and achived the objective.
      

      
      Aripiprazole Increases the PKA Signalling and Expression of the GABAA Receptor and CREB1 in the Nucleus Accumbens of Rats.
      PubMed
      Pan, Bo; Lian, Jiamei; Huang, Xu-Feng; Deng, Chao
         2016-05-01
         The GABAA receptor is implicated in the pathophysiology of schizophrenia and regulated by PKA signalling. Current antipsychotics bind with D2-like receptors, but not the GABAA receptor. The cAMP-responsive element-binding protein 1 (CREB1) is also associated with PKA signalling and may be related to the positive symptoms of schizophrenia. This study investigated the effects of antipsychotics in modulating D2-mediated PKA signalling and its downstream GABAA receptors and CREB1. Rats were treated orally with aripiprazole (0.75 mg/kg, ter in die (t.i.d.)), bifeprunox (0.8 mg/kg, t.i.d.), haloperidol (0.1 mg/kg, t.i.d.) or vehicle for 1 week. The levels of PKA-Cα and p-PKA in the prefrontal cortex (PFC), nucleus accumbens (NAc) and caudate putamen (CPu) were detected by Western blots. The mRNA levels of Gabrb1, Gabrb2, Gabrb3 and Creb1, and their protein expression were measured by qRT-PCR and Western blots, respectively. Aripiprazole elevated the levels of p-PKA and the ratio of p-PKA/PKA in the NAc, but not the PFC and CPu. Correlated with this elevated PKA signalling, aripiprazole elevated the mRNA and protein expression of the GABAA (β-1) receptor and CREB1 in the NAc. While haloperidol elevated the levels of p-PKA and the ratio of p-PKA/PKA in both NAc and CPu, it only tended to increase the expression of the GABAA (β-1) receptor and CREB1 in the NAc, but not the CPu. Bifeprunox had no effects on PKA signalling in these brain regions. These results suggest that aripiprazole has selective effects on upregulating the GABAA (β-1) receptor and CREB1 in the NAc, probably via activating PKA signalling.
      

      
      Fast multipurpose Monte Carlo simulation for proton therapy using multi- and many-core CPU architectures
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Souris, Kevin, E-mail: kevin.souris@uclouvain.be; Lee, John Aldo; Sterpin, Edmond
         2016-04-15
         Purpose: Accuracy in proton therapy treatment planning can be improved using Monte Carlo (MC) simulations. However the long computation time of such methods hinders their use in clinical routine. This work aims to develop a fast multipurpose Monte Carlo simulation tool for proton therapy using massively parallel central processing unit (CPU) architectures. Methods: A new Monte Carlo, called MCsquare (many-core Monte Carlo), has been designed and optimized for the last generation of Intel Xeon processors and Intel Xeon Phi coprocessors. These massively parallel architectures offer the flexibility and the computational power suitable to MC methods. The class-II condensed history algorithmmore » of MCsquare provides a fast and yet accurate method of simulating heavy charged particles such as protons, deuterons, and alphas inside voxelized geometries. Hard ionizations, with energy losses above a user-specified threshold, are simulated individually while soft events are regrouped in a multiple scattering theory. Elastic and inelastic nuclear interactions are sampled from ICRU 63 differential cross sections, thereby allowing for the computation of prompt gamma emission profiles. MCsquare has been benchmarked with the GATE/GEANT4 Monte Carlo application for homogeneous and heterogeneous geometries. Results: Comparisons with GATE/GEANT4 for various geometries show deviations within 2%–1 mm. In spite of the limited memory bandwidth of the coprocessor simulation time is below 25 s for 10{sup 7} primary 200 MeV protons in average soft tissues using all Xeon Phi and CPU resources embedded in a single desktop unit. Conclusions: MCsquare exploits the flexibility of CPU architectures to provide a multipurpose MC simulation tool. Optimized code enables the use of accurate MC calculation within a reasonable computation time, adequate for clinical practice. MCsquare also simulates prompt gamma emission and can thus be used also for in vivo range verification.« less
      

      
      Widespread Increases in Malondialdehyde Immunoreactivity in Dopamine-Rich and Dopamine-Poor Regions of Rat Brain Following Multiple, High Doses of Methamphetamine
      PubMed Central
      Horner, Kristen A.; Gilbert, Yamiece E.; Cline, Susan D.
         2011-01-01
         Treatment with multiple high doses of methamphetamine (METH) can induce oxidative damage, including dopamine (DA)-mediated reactive oxygen species (ROS) formation, which may contribute to the neurotoxic damage of monoamine neurons and long-term depletion of DA in the caudate putamen (CPu) and substantia nigra pars compacta (SNpc). Malondialdehyde (MDA), a product of lipid peroxidation by ROS, is commonly used as a marker of oxidative damage and treatment with multiple high doses of METH increases MDA reactivity in the CPu of humans and experimental animals. Recent data indicate that MDA itself may contribute to the destruction of DA neurons, as MDA causes the accumulation of toxic intermediates of DA metabolism via its chemical modification of the enzymes necessary for the breakdown of DA. However, it has been shown that in human METH abusers there is also increased MDA reactivity in the frontal cortex, which receives relatively fewer DA afferents than the CPu. These data suggest that METH may induce neuronal damage regardless of the regional density of DA or origin of DA input. The goal of the current study was to examine the modification of proteins by MDA in the DA-rich nigrostriatal and mesoaccumbal systems, as well as the less DA-dense cortex and hippocampus following a neurotoxic regimen of METH treatment. Animals were treated with METH (10 mg/kg) every 2 h for 6 h, sacrificed 1 week later, and examined using immunocytochemistry for changes in MDA-adducted proteins. Multiple, high doses of METH significantly increased MDA immunoreactivity (MDA-ir) in the CPu, SNpc, cortex, and hippocampus. Multiple METH administration also increased MDA-ir in the ventral tegmental area and nucleus accumbens. Our data indicate that multiple METH treatment can induce persistent and widespread neuronal damage that may not necessarily be limited to the nigrostriatal DA system. PMID:21602916
      

      
      Toward production of jet fuel functionality in oilseeds: identification of FatB acyl-acyl carrier protein thioesterases and evaluation of combinatorial expression strategies in Camelina seeds
      PubMed Central
      Kim, Hae Jin; Silva, Jillian E.; Vu, Hieu Sy; Mockaitis, Keithanne; Nam, Jeong-Won; Cahoon, Edgar B.
         2015-01-01
         Seeds of members of the genus Cuphea accumulate medium-chain fatty acids (MCFAs; 8:0–14:0). MCFA- and palmitic acid- (16:0) rich vegetable oils have received attention for jet fuel production, given their similarity in chain length to Jet A fuel hydrocarbons. Studies were conducted to test genes, including those from Cuphea, for their ability to confer jet fuel-type fatty acid accumulation in seed oil of the emerging biofuel crop Camelina sativa. Transcriptomes from Cuphea viscosissima and Cuphea pulcherrima developing seeds that accumulate >90% of C8 and C10 fatty acids revealed three FatB cDNAs (CpuFatB3, CvFatB1, and CpuFatB4) expressed predominantly in seeds and structurally divergent from typical FatB thioesterases that release 16:0 from acyl carrier protein (ACP). Expression of CpuFatB3 and CvFatB1 resulted in Camelina oil with capric acid (10:0), and CpuFatB4 expression conferred myristic acid (14:0) production and increased 16:0. Co-expression of combinations of previously characterized Cuphea and California bay FatBs produced Camelina oils with mixtures of C8–C16 fatty acids, but amounts of each fatty acid were less than obtained by expression of individual FatB cDNAs. Increases in lauric acid (12:0) and 14:0, but not 10:0, in Camelina oil and at the sn-2 position of triacylglycerols resulted from inclusion of a coconut lysophosphatidic acid acyltransferase specialized for MCFAs. RNA interference (RNAi) suppression of Camelina β-ketoacyl-ACP synthase II, however, reduced 12:0 in seeds expressing a 12:0-ACP-specific FatB. Camelina lines presented here provide platforms for additional metabolic engineering targeting fatty acid synthase and specialized acyltransferases for achieving oils with high levels of jet fuel-type fatty acids. PMID:25969557
      

      
      hybrid\\scriptsize{{MANTIS}}: a CPU-GPU Monte Carlo method for modeling indirect x-ray detectors with columnar scintillators
      NASA Astrophysics Data System (ADS)
      Sharma, Diksha; Badal, Andreu; Badano, Aldo
         2012-04-01
         The computational modeling of medical imaging systems often requires obtaining a large number of simulated images with low statistical uncertainty which translates into prohibitive computing times. We describe a novel hybrid approach for Monte Carlo simulations that maximizes utilization of CPUs and GPUs in modern workstations. We apply the method to the modeling of indirect x-ray detectors using a new and improved version of the code \\scriptsize{{MANTIS}}, an open source software tool used for the Monte Carlo simulations of indirect x-ray imagers. We first describe a GPU implementation of the physics and geometry models in fast\\scriptsize{{DETECT}}2 (the optical transport model) and a serial CPU version of the same code. We discuss its new features like on-the-fly column geometry and columnar crosstalk in relation to the \\scriptsize{{MANTIS}} code, and point out areas where our model provides more flexibility for the modeling of realistic columnar structures in large area detectors. Second, we modify \\scriptsize{{PENELOPE}} (the open source software package that handles the x-ray and electron transport in \\scriptsize{{MANTIS}}) to allow direct output of location and energy deposited during x-ray and electron interactions occurring within the scintillator. This information is then handled by optical transport routines in fast\\scriptsize{{DETECT}}2. A load balancer dynamically allocates optical transport showers to the GPU and CPU computing cores. Our hybrid\\scriptsize{{MANTIS}} approach achieves a significant speed-up factor of 627 when compared to \\scriptsize{{MANTIS}} and of 35 when compared to the same code running only in a CPU instead of a GPU. Using hybrid\\scriptsize{{MANTIS}}, we successfully hide hours of optical transport time by running it in parallel with the x-ray and electron transport, thus shifting the computational bottleneck from optical to x-ray transport. The new code requires much less memory than \\scriptsize{{MANTIS}} and, as a result, allows us to efficiently simulate large area detectors.
      

      
      Toward production of jet fuel functionality in oilseeds: Identification of FatB acyl-acyl carrier protein thioesterases and evaluation of combinatorial expression strategies in Camelina seeds
      DOE PAGES
      Kim, Hae Jin; Silva, Jillian E.; Vu, Hieu Sy; ...
         2015-05-11
         Seeds of members of the genus Cuphea accumulate medium-chain fatty acids (MCFAs; 8:0–14:0). MCFA- and palmitic acid- (16:0) rich vegetable oils have received attention for jet fuel production, given their similarity in chain length to Jet A fuel hydrocarbons. Studies were conducted to test genes, including those from Cuphea, for their ability to confer jet fuel-type fatty acid accumulation in seed oil of the emerging biofuel crop Camelina sativa. Transcriptomes from Cuphea viscosissima and Cuphea pulcherrima developing seeds that accumulate >90% of C8 and C10 fatty acids revealed three FatB cDNAs ( CpuFatB3, CvFatB1, and CpuFatB4) expressed predominantly in seedsmore » and structurally divergent from typical FatB thioesterases that release 16:0 from acyl carrier protein (ACP). Expression of CpuFatB3 and CvFatB1 resulted in Camelina oil with capric acid (10:0), and CpuFatB4 expression conferred myristic acid (14:0) production and increased 16:0. Co-expression of combinations of previously characterized Cuphea and California bay FatBs produced Camelina oils with mixtures of C8–C16 fatty acids, but amounts of each fatty acid were less than obtained by expression of individual FatB cDNAs. Increases in lauric acid (12:0) and 14:0, but not 10:0, in Camelina oil and at the sn-2 position of triacylglycerols resulted from inclusion of a coconut lysophosphatidic acid acyltransferase specialized for MCFAs. RNA interference (RNAi) suppression of Camelina β-ketoacyl-ACP synthase II, however, reduced 12:0 in seeds expressing a 12:0-ACP-specific FatB. Here, Camelina lines presented here provide platforms for additional metabolic engineering targeting fatty acid synthase and specialized acyltransferases for achieving oils with high levels of jet fuel-type fatty acids.« less
      

      
      
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Kim, Hae Jin; Silva, Jillian E.; Vu, Hieu Sy
         
         Seeds of members of the genus Cuphea accumulate medium-chain fatty acids (MCFAs; 8:0–14:0). MCFA- and palmitic acid- (16:0) rich vegetable oils have received attention for jet fuel production, given their similarity in chain length to Jet A fuel hydrocarbons. Studies were conducted to test genes, including those from Cuphea, for their ability to confer jet fuel-type fatty acid accumulation in seed oil of the emerging biofuel crop Camelina sativa. Transcriptomes from Cuphea viscosissima and Cuphea pulcherrima developing seeds that accumulate >90% of C8 and C10 fatty acids revealed three FatB cDNAs ( CpuFatB3, CvFatB1, and CpuFatB4) expressed predominantly in seedsmore » and structurally divergent from typical FatB thioesterases that release 16:0 from acyl carrier protein (ACP). Expression of CpuFatB3 and CvFatB1 resulted in Camelina oil with capric acid (10:0), and CpuFatB4 expression conferred myristic acid (14:0) production and increased 16:0. Co-expression of combinations of previously characterized Cuphea and California bay FatBs produced Camelina oils with mixtures of C8–C16 fatty acids, but amounts of each fatty acid were less than obtained by expression of individual FatB cDNAs. Increases in lauric acid (12:0) and 14:0, but not 10:0, in Camelina oil and at the sn-2 position of triacylglycerols resulted from inclusion of a coconut lysophosphatidic acid acyltransferase specialized for MCFAs. RNA interference (RNAi) suppression of Camelina β-ketoacyl-ACP synthase II, however, reduced 12:0 in seeds expressing a 12:0-ACP-specific FatB. Here, Camelina lines presented here provide platforms for additional metabolic engineering targeting fatty acid synthase and specialized acyltransferases for achieving oils with high levels of jet fuel-type fatty acids.« less
      

      
      Implicit and explicit social mentalizing: dual processes driven by a shared neural network
      PubMed Central
      Van Overwalle, Frank; Vandekerckhove, Marie
         2013-01-01
         Recent social neuroscientific evidence indicates that implicit and explicit inferences on the mind of another person (i.e., intentions, attributions or traits), are subserved by a shared mentalizing network. Under both implicit and explicit instructions, ERP studies reveal that early inferences occur at about the same time, and fMRI studies demonstrate an overlap in core mentalizing areas, including the temporo-parietal junction (TPJ) and the medial prefrontal cortex (mPFC). These results suggest a rapid shared implicit intuition followed by a slower explicit verification processes (as revealed by additional brain activation during explicit vs. implicit inferences). These data provide support for a default-adjustment dual-process framework of social mentalizing. PMID:24062663
      

      
      Context Switching with Multiple Register Windows: A RISC Performance Study
      NASA Technical Reports Server (NTRS)
      Konsek, Marion B.; Reed, Daniel A.; Watcharawittayakul, Wittaya
         1987-01-01
         Although previous studies have shown that a large file of overlapping register windows can greatly reduce procedure call/return overhead, the effects of register windows in a multiprogramming environment are poorly understood. This paper investigates the performance of multiprogrammed, reduced instruction set computers (RISCs) as a function of window management strategy. Using an analytic model that reflects context switch and procedure call overheads, we analyze the performance of simple, linearly self-recursive programs. For more complex programs, we present the results of a simulation study. These studies show that a simple strategy that saves all windows prior to a context switch, but restores only a single window following a context switch, performs near optimally.
      

      
      Vector computer memory bank contention
      NASA Technical Reports Server (NTRS)
      Bailey, D. H.
         1985-01-01
         A number of vector supercomputers feature very large memories. Unfortunately the large capacity memory chips that are used in these computers are much slower than the fast central processing unit (CPU) circuitry. As a result, memory bank reservation times (in CPU ticks) are much longer than on previous generations of computers. A consequence of these long reservation times is that memory bank contention is sharply increased, resulting in significantly lowered performance rates. The phenomenon of memory bank contention in vector computers is analyzed using both a Markov chain model and a Monte Carlo simulation program. The results of this analysis indicate that future generations of supercomputers must either employ much faster memory chips or else feature very large numbers of independent memory banks.
      

      
      
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Li, C.; Yu, G.; Wang, K.
         
         The physical designs of the new concept reactors which have complex structure, various materials and neutronic energy spectrum, have greatly improved the requirements to the calculation methods and the corresponding computing hardware. Along with the widely used parallel algorithm, heterogeneous platforms architecture has been introduced into numerical computations in reactor physics. Because of the natural parallel characteristics, the CPU-FPGA architecture is often used to accelerate numerical computation. This paper studies the application and features of this kind of heterogeneous platforms used in numerical calculation of reactor physics through practical examples. After the designed neutron diffusion module based on CPU-FPGA architecturemore » achieves a 11.2 speed up factor, it is proved to be feasible to apply this kind of heterogeneous platform into reactor physics. (authors)« less
      

      
      Introduction of Parallel GPGPU Acceleration Algorithms for the Solution of Radiative Transfer
      NASA Technical Reports Server (NTRS)
      Godoy, William F.; Liu, Xu
         2011-01-01
         General-purpose computing on graphics processing units (GPGPU) is a recent technique that allows the parallel graphics processing unit (GPU) to accelerate calculations performed sequentially by the central processing unit (CPU). To introduce GPGPU to radiative transfer, the Gauss-Seidel solution of the well-known expressions for 1-D and 3-D homogeneous, isotropic media is selected as a test case. Different algorithms are introduced to balance memory and GPU-CPU communication, critical aspects of GPGPU. Results show that speed-ups of one to two orders of magnitude are obtained when compared to sequential solutions. The underlying value of GPGPU is its potential extension in radiative solvers (e.g., Monte Carlo, discrete ordinates) at a minimal learning curve.
      

      
      GPU Computing in Bayesian Inference of Realized Stochastic Volatility Model
      NASA Astrophysics Data System (ADS)
      Takaishi, Tetsuya
         2015-01-01
         The realized stochastic volatility (RSV) model that utilizes the realized volatility as additional information has been proposed to infer volatility of financial time series. We consider the Bayesian inference of the RSV model by the Hybrid Monte Carlo (HMC) algorithm. The HMC algorithm can be parallelized and thus performed on the GPU for speedup. The GPU code is developed with CUDA Fortran. We compare the computational time in performing the HMC algorithm on GPU (GTX 760) and CPU (Intel i7-4770 3.4GHz) and find that the GPU can be up to 17 times faster than the CPU. We also code the program with OpenACC and find that appropriate coding can achieve the similar speedup with CUDA Fortran.
      

        
       
          

«

19
      20
      21
   22
      23
      »

          
        

     

   

   
       
            
              
          

«

20
      21
      22
   23
      24
      »

          
        

           
           
             
               
      
      Parallel Scaling Characteristics of Selected NERSC User ProjectCodes
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Skinner, David; Verdier, Francesca; Anand, Harsh
         
         This report documents parallel scaling characteristics of NERSC user project codes between Fiscal Year 2003 and the first half of Fiscal Year 2004 (Oct 2002-March 2004). The codes analyzed cover 60% of all the CPU hours delivered during that time frame on seaborg, a 6080 CPU IBM SP and the largest parallel computer at NERSC. The scale in terms of concurrency and problem size of the workload is analyzed. Drawing on batch queue logs, performance data and feedback from researchers we detail the motivations, benefits, and challenges of implementing highly parallel scientific codes on current NERSC High Performance Computing systems.more » An evaluation and outlook of the NERSC workload for Allocation Year 2005 is presented.« less
      

      
      An emulator for minimizing computer resources for finite element analysis
      NASA Technical Reports Server (NTRS)
      Melosh, R.; Utku, S.; Islam, M.; Salama, M.
         1984-01-01
         A computer code, SCOPE, has been developed for predicting the computer resources required for a given analysis code, computer hardware, and structural problem. The cost of running the code is a small fraction (about 3 percent) of the cost of performing the actual analysis. However, its accuracy in predicting the CPU and I/O resources depends intrinsically on the accuracy of calibration data that must be developed once for the computer hardware and the finite element analysis code of interest. Testing of the SCOPE code on the AMDAHL 470 V/8 computer and the ELAS finite element analysis program indicated small I/O errors (3.2 percent), larger CPU errors (17.8 percent), and negligible total errors (1.5 percent).
      

      
      New Focal Plane Array Controller for the Instruments of the Subaru Telescope
      NASA Astrophysics Data System (ADS)
      Nakaya, Hidehiko; Komiyama, Yutaka; Miyazaki, Satoshi; Yamashita, Takuya; Yagi, Masafumi; Sekiguchi, Maki
         2006-03-01
         We have developed a next-generation data acquisition system, MESSIA5 (Modularized Extensible System for Image Acquisition), which comprises the digital part of a focal plane array controller. The new data acquisition system was constructed based on a 64 bit, 66 MHz PCI (peripheral component interconnect) bus architecture and runs on an x86 CPU computer with (non-real-time) Linux. The system, including the CPU board, is placed at the telescope focus, and standard gigabit Ethernet is adopted for the data transfer, as opposed to a dedicated fiber link. During the summer of 2002, we installed the new system for the first time on the Subaru prime-focus camera Suprime-Cam and successfully improved the observing performance.
      

      
      Comprehensive silicon solar cell computer modeling
      NASA Technical Reports Server (NTRS)
      Lamorte, M. F.
         1984-01-01
         The development of an efficient, comprehensive Si solar cell modeling program that has the capability of simulation accuracy of 5 percent or less is examined. A general investigation of computerized simulation is provided. Computer simulation programs are subdivided into a number of major tasks: (1) analytical method used to represent the physical system; (2) phenomena submodels that comprise the simulation of the system; (3) coding of the analysis and the phenomena submodels; (4) coding scheme that results in efficient use of the CPU so that CPU costs are low; and (5) modularized simulation program with respect to structures that may be analyzed, addition and/or modification of phenomena submodels as new experimental data become available, and the addition of other photovoltaic materials.
      

      
      Vector computer memory bank contention
      NASA Technical Reports Server (NTRS)
      Bailey, David H.
         1987-01-01
         A number of vector supercomputers feature very large memories. Unfortunately the large capacity memory chips that are used in these computers are much slower than the fast central processing unit (CPU) circuitry. As a result, memory bank reservation times (in CPU ticks) are much longer than on previous generations of computers. A consequence of these long reservation times is that memory bank contention is sharply increased, resulting in significantly lowered performance rates. The phenomenon of memory bank contention in vector computers is analyzed using both a Markov chain model and a Monte Carlo simulation program. The results of this analysis indicate that future generations of supercomputers must either employ much faster memory chips or else feature very large numbers of independent memory banks.
      

      
      ATLAS offline software performance monitoring and optimization
      NASA Astrophysics Data System (ADS)
      Chauhan, N.; Kabra, G.; Kittelmann, T.; Langenberg, R.; Mandrysch, R.; Salzburger, A.; Seuster, R.; Ritsch, E.; Stewart, G.; van Eldik, N.; Vitillo, R.; Atlas Collaboration
         2014-06-01
         In a complex multi-developer, multi-package software environment, such as the ATLAS offline framework Athena, tracking the performance of the code can be a non-trivial task in itself. In this paper we describe improvements in the instrumentation of ATLAS offline software that have given considerable insight into the performance of the code and helped to guide the optimization work. The first tool we used to instrument the code is PAPI, which is a programing interface for accessing hardware performance counters. PAPI events can count floating point operations, cycles, instructions and cache accesses. Triggering PAPI to start/stop counting for each algorithm and processed event results in a good understanding of the algorithm level performance of ATLAS code. Further data can be obtained using Pin, a dynamic binary instrumentation tool. Pin tools can be used to obtain similar statistics as PAPI, but advantageously without requiring recompilation of the code. Fine grained routine and instruction level instrumentation is also possible. Pin tools can additionally interrogate the arguments to functions, like those in linear algebra libraries, so that a detailed usage profile can be obtained. These tools have characterized the extensive use of vector and matrix operations in ATLAS tracking. Currently, CLHEP is used here, which is not an optimal choice. To help evaluate replacement libraries a testbed has been setup allowing comparison of the performance of different linear algebra libraries (including CLHEP, Eigen and SMatrix/SVector). Results are then presented via the ATLAS Performance Management Board framework, which runs daily with the current development branch of the code and monitors reconstruction and Monte-Carlo jobs. This framework analyses the CPU and memory performance of algorithms and an overview of results are presented on a web page. These tools have provided the insight necessary to plan and implement performance enhancements in ATLAS code by identifying the most common operations, with the call parameters well understood, and allowing improvements to be quantified in detail.
      

      
      I Plan Therefore I Choose: Free-Choice Bias Due to Prior Action-Probability but Not Action-Value
      PubMed Central
      Suriya-Arunroj, Lalitta; Gail, Alexander
         2015-01-01
         According to an emerging view, decision-making, and motor planning are tightly entangled at the level of neural processing. Choice is influenced not only by the values associated with different options, but also biased by other factors. Here we test the hypothesis that preliminary action planning can induce choice biases gradually and independently of objective value when planning overlaps with one of the potential action alternatives. Subjects performed center-out reaches obeying either a clockwise or counterclockwise cue-response rule in two tasks. In the probabilistic task, a pre-cue indicated the probability of each of the two potential rules to become valid. When the subsequent rule-cue unambiguously indicated which of the pre-cued rules was actually valid (instructed trials), subjects responded faster to rules pre-cued with higher probability. When subjects were allowed to choose freely between two equally rewarded rules (choice trials) they chose the originally more likely rule more often and faster, despite the lack of an objective advantage in selecting this target. In the amount task, the pre-cue indicated the amount of potential reward associated with each rule. Subjects responded faster to rules pre-cued with higher reward amount in instructed trials of the amount task, equivalent to the more likely rule in the probabilistic task. Yet, in contrast, subjects showed hardly any choice bias and no increase in response speed in favor of the original high-reward target in the choice trials of the amount task. We conclude that free-choice behavior is robustly biased when predictability encourages the planning of one of the potential responses, while prior reward expectations without action planning do not induce such strong bias. Our results provide behavioral evidence for distinct contributions of expected value and action planning in decision-making and a tight interdependence of motor planning and action selection, supporting the idea that the underlying neural mechanisms overlap. PMID:26635565
      

      
      Execution of a parallel edge-based Navier-Stokes solver on commodity graphics processor units
      NASA Astrophysics Data System (ADS)
      Corral, Roque; Gisbert, Fernando; Pueblas, Jesus
         2017-02-01
         The implementation of an edge-based three-dimensional Reynolds Average Navier-Stokes solver for unstructured grids able to run on multiple graphics processing units (GPUs) is presented. Loops over edges, which are the most time-consuming part of the solver, have been written to exploit the massively parallel capabilities of GPUs. Non-blocking communications between parallel processes and between the GPU and the central processor unit (CPU) have been used to enhance code scalability. The code is written using a mixture of C++ and OpenCL, to allow the execution of the source code on GPUs. The Message Passage Interface (MPI) library is used to allow the parallel execution of the solver on multiple GPUs. A comparative study of the solver parallel performance is carried out using a cluster of CPUs and another of GPUs. It is shown that a single GPU is up to 64 times faster than a single CPU core. The parallel scalability of the solver is mainly degraded due to the loss of computing efficiency of the GPU when the size of the case decreases. However, for large enough grid sizes, the scalability is strongly improved. A cluster featuring commodity GPUs and a high bandwidth network is ten times less costly and consumes 33% less energy than a CPU-based cluster with an equivalent computational power.
      

      
      Computers in anesthesia and intensive care: lack of evidence that the central unit serves as reservoir of pathogens.
      PubMed
      Quinzio, Lorenzo; Blazek, Michael; Hartmann, Bernd; Röhrig, Rainer; Wille, Burkhard; Junger, Axel; Hempelmann, Gunter
         2005-01-01
         Computers are becoming increasingly visible in operating rooms (OR) and intensive care units (ICU) for use in bedside documentation. Recently, they have been suspected as possibly acting as reservoirs for microorganisms and vehicles for the transfer of pathogens to patients, causing nosocomial infections. The purpose of this study was to examine the microbiological (bacteriological and mycological) contamination of the central unit of computers used in an OR, a surgical and a pediatric ICU of a tertiary teaching hospital. Sterile swab samples were taken from five sites in each of 13 computers stationed at the two ICUs and 12 computers at the OR. Sample sites within the chassis housing of the computer processing unit (CPU) included the CPU fan, ventilator, and metal casing. External sites were the ventilator and the bottom of the computer tower. Quantitative and qualitative microbiological analyses were performed according to commonly used methods. One hundred and ninety sites were cultured for bacteria and fungi. Analyses of swabs taken at five equivalent sites inside and outside the computer chassis did not find any significant-number of potentially pathogenic bacteria or fungi. This can probably be attributed to either the absence or the low number of pathogens detected on the surfaces. Microbial contamination in the CPU of OR and ICU computers is too low for designating them as a reservoir for microorganisms.
      

      
      Reduced order model based on principal component analysis for process simulation and optimization
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Lang, Y.; Malacina, A.; Biegler, L.
         2009-01-01
         It is well-known that distributed parameter computational fluid dynamics (CFD) models provide more accurate results than conventional, lumped-parameter unit operation models used in process simulation. Consequently, the use of CFD models in process/equipment co-simulation offers the potential to optimize overall plant performance with respect to complex thermal and fluid flow phenomena. Because solving CFD models is time-consuming compared to the overall process simulation, we consider the development of fast reduced order models (ROMs) based on CFD results to closely approximate the high-fidelity equipment models in the co-simulation. By considering process equipment items with complicated geometries and detailed thermodynamic property models,more » this study proposes a strategy to develop ROMs based on principal component analysis (PCA). Taking advantage of commercial process simulation and CFD software (for example, Aspen Plus and FLUENT), we are able to develop systematic CFD-based ROMs for equipment models in an efficient manner. In particular, we show that the validity of the ROM is more robust within well-sampled input domain and the CPU time is significantly reduced. Typically, it takes at most several CPU seconds to evaluate the ROM compared to several CPU hours or more to solve the CFD model. Two case studies, involving two power plant equipment examples, are described and demonstrate the benefits of using our proposed ROM methodology for process simulation and optimization.« less
      

      
      GPU-based Branchless Distance-Driven Projection and Backprojection
      PubMed Central
      Liu, Rui; Fu, Lin; De Man, Bruno; Yu, Hengyong
         2017-01-01
         Projection and backprojection operations are essential in a variety of image reconstruction and physical correction algorithms in CT. The distance-driven (DD) projection and backprojection are widely used for their highly sequential memory access pattern and low arithmetic cost. However, a typical DD implementation has an inner loop that adjusts the calculation depending on the relative position between voxel and detector cell boundaries. The irregularity of the branch behavior makes it inefficient to be implemented on massively parallel computing devices such as graphics processing units (GPUs). Such irregular branch behaviors can be eliminated by factorizing the DD operation as three branchless steps: integration, linear interpolation, and differentiation, all of which are highly amenable to massive vectorization. In this paper, we implement and evaluate a highly parallel branchless DD algorithm for 3D cone beam CT. The algorithm utilizes the texture memory and hardware interpolation on GPUs to achieve fast computational speed. The developed branchless DD algorithm achieved 137-fold speedup for forward projection and 188-fold speedup for backprojection relative to a single-thread CPU implementation. Compared with a state-of-the-art 32-thread CPU implementation, the proposed branchless DD achieved 8-fold acceleration for forward projection and 10-fold acceleration for backprojection. GPU based branchless DD method was evaluated by iterative reconstruction algorithms with both simulation and real datasets. It obtained visually identical images as the CPU reference algorithm. PMID:29333480
      

      
      GPU-based Branchless Distance-Driven Projection and Backprojection.
      PubMed
      Liu, Rui; Fu, Lin; De Man, Bruno; Yu, Hengyong
         2017-12-01
         Projection and backprojection operations are essential in a variety of image reconstruction and physical correction algorithms in CT. The distance-driven (DD) projection and backprojection are widely used for their highly sequential memory access pattern and low arithmetic cost. However, a typical DD implementation has an inner loop that adjusts the calculation depending on the relative position between voxel and detector cell boundaries. The irregularity of the branch behavior makes it inefficient to be implemented on massively parallel computing devices such as graphics processing units (GPUs). Such irregular branch behaviors can be eliminated by factorizing the DD operation as three branchless steps: integration, linear interpolation, and differentiation, all of which are highly amenable to massive vectorization. In this paper, we implement and evaluate a highly parallel branchless DD algorithm for 3D cone beam CT. The algorithm utilizes the texture memory and hardware interpolation on GPUs to achieve fast computational speed. The developed branchless DD algorithm achieved 137-fold speedup for forward projection and 188-fold speedup for backprojection relative to a single-thread CPU implementation. Compared with a state-of-the-art 32-thread CPU implementation, the proposed branchless DD achieved 8-fold acceleration for forward projection and 10-fold acceleration for backprojection. GPU based branchless DD method was evaluated by iterative reconstruction algorithms with both simulation and real datasets. It obtained visually identical images as the CPU reference algorithm.
      

      
      Establishment and progress of the chest pain unit certification process in Germany and the local experiences of Mainz.
      PubMed
      Post, Felix; Gori, Tommaso; Senges, Jochen; Giannitsis, Evangelos; Katus, Hugo; Münzel, Thomas
         2012-03-01
         The establishment of chest pain units (CPUs) in the USA and UK has led to improvements in the prognosis of patients with chest pain and myocardial infarction, optimizing access to specialized diagnostic and therapeutic facilities and reducing costs. To establish a uniform implementation of this type of service in Germany, the German Cardiac Society (DGK) founded a 'CPU task force' in 2007, which developed a set of standard requirements and a nationwide certification programme. The recommendations for minimum standard requirements were published in 2008. As of November 2011, 132 CPUs were certified and 36 units were in the certification process. The aim of the DGK is to certify as many as 250 centres (units) throughout Germany within the next 2 years, to provide nationwide coverage. Applications from Switzerland are also being filed. Public awareness campaigns in cooperation with national league soccer teams were organized to raise awareness of the importance for early diagnosis and treatment of cardiac diseases and to publicize the existence of these new facilities. The German model of CPU certification allows nationwide and prospectively European-wide standardization of patient care and to improve adherence to international guidelines. Coupled with awareness campaigns and with the launch of a German CPU Registry, this process is aimed at improving the education and treatment of patients with chest pain and to provide scientific information about the quality of patient care.
      

      
      GPU Linear Algebra Libraries and GPGPU Programming for Accelerating MOPAC Semiempirical Quantum Chemistry Calculations.
      PubMed
      Maia, Julio Daniel Carvalho; Urquiza Carvalho, Gabriel Aires; Mangueira, Carlos Peixoto; Santana, Sidney Ramos; Cabral, Lucidio Anjos Formiga; Rocha, Gerd B
         2012-09-11
         In this study, we present some modifications in the semiempirical quantum chemistry MOPAC2009 code that accelerate single-point energy calculations (1SCF) of medium-size (up to 2500 atoms) molecular systems using GPU coprocessors and multithreaded shared-memory CPUs. Our modifications consisted of using a combination of highly optimized linear algebra libraries for both CPU (LAPACK and BLAS from Intel MKL) and GPU (MAGMA and CUBLAS) to hasten time-consuming parts of MOPAC such as the pseudodiagonalization, full diagonalization, and density matrix assembling. We have shown that it is possible to obtain large speedups just by using CPU serial linear algebra libraries in the MOPAC code. As a special case, we show a speedup of up to 14 times for a methanol simulation box containing 2400 atoms and 4800 basis functions, with even greater gains in performance when using multithreaded CPUs (2.1 times in relation to the single-threaded CPU code using linear algebra libraries) and GPUs (3.8 times). This degree of acceleration opens new perspectives for modeling larger structures which appear in inorganic chemistry (such as zeolites and MOFs), biochemistry (such as polysaccharides, small proteins, and DNA fragments), and materials science (such as nanotubes and fullerenes). In addition, we believe that this parallel (GPU-GPU) MOPAC code will make it feasible to use semiempirical methods in lengthy molecular simulations using both hybrid QM/MM and QM/QM potentials.
      

      
      Simulation of Earth textures by conditional image quilting
      NASA Astrophysics Data System (ADS)
      Mahmud, K.; Mariethoz, G.; Caers, J.; Tahmasebi, P.; Baker, A.
         2014-04-01
         Training image-based approaches for stochastic simulations have recently gained attention in surface and subsurface hydrology. This family of methods allows the creation of multiple realizations of a study domain, with a spatial continuity based on a training image (TI) that contains the variability, connectivity, and structural properties deemed realistic. A major drawback of these methods is their computational and/or memory cost, making certain applications challenging. It was found that similar methods, also based on training images or exemplars, have been proposed in computer graphics. One such method, image quilting (IQ), is introduced in this paper and adapted for hydrogeological applications. The main difficulty is that Image Quilting was originally not designed to produce conditional simulations and was restricted to 2-D images. In this paper, the original method developed in computer graphics has been modified to accommodate conditioning data and 3-D problems. This new conditional image quilting method (CIQ) is patch based, does not require constructing a pattern databases, and can be used with both categorical and continuous training images. The main concept is to optimally cut the patches such that they overlap with minimum discontinuity. The optimal cut is determined using a dynamic programming algorithm. Conditioning is accomplished by prior selection of patches that are compatible with the conditioning data. The performance of CIQ is tested for a variety of hydrogeological test cases. The results, when compared with previous multiple-point statistics (MPS) methods, indicate an improvement in CPU time by a factor of at least 50.
      

      
      A Hybrid Task Graph Scheduler for High Performance Image Processing Workflows.
      PubMed
      Blattner, Timothy; Keyrouz, Walid; Bhattacharyya, Shuvra S; Halem, Milton; Brady, Mary
         2017-12-01
         Designing applications for scalability is key to improving their performance in hybrid and cluster computing. Scheduling code to utilize parallelism is difficult, particularly when dealing with data dependencies, memory management, data motion, and processor occupancy. The Hybrid Task Graph Scheduler (HTGS) improves programmer productivity when implementing hybrid workflows for multi-core and multi-GPU systems. The Hybrid Task Graph Scheduler (HTGS) is an abstract execution model, framework, and API that increases programmer productivity when implementing hybrid workflows for such systems. HTGS manages dependencies between tasks, represents CPU and GPU memories independently, overlaps computations with disk I/O and memory transfers, keeps multiple GPUs occupied, and uses all available compute resources. Through these abstractions, data motion and memory are explicit; this makes data locality decisions more accessible. To demonstrate the HTGS application program interface (API), we present implementations of two example algorithms: (1) a matrix multiplication that shows how easily task graphs can be used; and (2) a hybrid implementation of microscopy image stitching that reduces code size by ≈ 43% compared to a manually coded hybrid workflow implementation and showcases the minimal overhead of task graphs in HTGS. Both of the HTGS-based implementations show good performance. In image stitching the HTGS implementation achieves similar performance to the hybrid workflow implementation. Matrix multiplication with HTGS achieves 1.3× and 1.8× speedup over the multi-threaded OpenBLAS library for 16k × 16k and 32k × 32k size matrices, respectively.
      

      
      CPU Act
      THOMAS, 112th Congress
      Sen. Hagan, Kay R. [D-NC
         2011-10-20
         Senate - 10/20/2011 Read twice and referred to the Committee on Health, Education, Labor, and Pensions. (All Actions) Tracker: This bill has the status IntroducedHere are the steps for Status of Legislation:
      

      
      Benchmarking hardware architecture candidates for the NFIRAOS real-time controller
      NASA Astrophysics Data System (ADS)
      Smith, Malcolm; Kerley, Dan; Herriot, Glen; Véran, Jean-Pierre
         2014-07-01
         As a part of the trade study for the Narrow Field Infrared Adaptive Optics System, the adaptive optics system for the Thirty Meter Telescope, we investigated the feasibility of performing real-time control computation using a Linux operating system and Intel Xeon E5 CPUs. We also investigated a Xeon Phi based architecture which allows higher levels of parallelism. This paper summarizes both the CPU based real-time controller architecture and the Xeon Phi based RTC. The Intel Xeon E5 CPU solution meets the requirements and performs the computation for one AO cycle in an average of 767 microseconds. The Xeon Phi solution did not meet the 1200 microsecond time requirement and also suffered from unpredictable execution times. More detailed benchmark results are reported for both architectures.
      

      
      The Effect of Multigrid Parameters in a 3D Heat Diffusion Equation
      NASA Astrophysics Data System (ADS)
      Oliveira, F. De; Franco, S. R.; Pinto, M. A. Villela
         2018-02-01
         The aim of this paper is to reduce the necessary CPU time to solve the three-dimensional heat diffusion equation using Dirichlet boundary conditions. The finite difference method (FDM) is used to discretize the differential equations with a second-order accuracy central difference scheme (CDS). The algebraic equations systems are solved using the lexicographical and red-black Gauss-Seidel methods, associated with the geometric multigrid method with a correction scheme (CS) and V-cycle. Comparisons are made between two types of restriction: injection and full weighting. The used prolongation process is the trilinear interpolation. This work is concerned with the study of the influence of the smoothing value (v), number of mesh levels (L) and number of unknowns (N) on the CPU time, as well as the analysis of algorithm complexity.
      

      
      Deployment of 464XLAT (RFC6877) alongside IPv6-only CPU resources at WLCG sites
      NASA Astrophysics Data System (ADS)
      Froy, T. S.; Traynor, D. P.; Walker, C. J.
         2017-10-01
         IPv4 is now officially deprecated by the IETF. A significant amount of effort has already been expended by the HEPiX IPv6 Working Group on testing dual-stacked hosts and IPv6-only CPU resources. Dual-stack adds complexity and administrative overhead to sites that may already be starved of resource. This has resulted in a very slow uptake of IPv6 from WLCG sites. 464XLAT (RFC6877) is intended for IPv6 single-stack environments that require the ability to communicate with IPv4-only endpoints. This paper will present a deployment strategy for 464XLAT, operational experiences of using 464XLAT in production at a WLCG site and important information to consider prior to deploying 464XLAT.
      

        
       
          

«

20
      21
      22
   23
      24
      »

          
        

     

   

   
       
            
              
          

«

21
      22
      23
   24
      25
      »

          
        

           
           
             
               
      
      An incomplete assembly with thresholding algorithm for systems of reaction-diffusion equations in three space dimensions IAT for reaction-diffusion systems
      NASA Astrophysics Data System (ADS)
      Moore, Peter K.
         2003-07-01
         Solving systems of reaction-diffusion equations in three space dimensions can be prohibitively expensive both in terms of storage and CPU time. Herein, I present a new incomplete assembly procedure that is designed to reduce storage requirements. Incomplete assembly is analogous to incomplete factorization in that only a fixed number of nonzero entries are stored per row and a drop tolerance is used to discard small values. The algorithm is incorporated in a finite element method-of-lines code and tested on a set of reaction-diffusion systems. The effect of incomplete assembly on CPU time and storage and on the performance of the temporal integrator DASPK, algebraic solver GMRES and preconditioner ILUT is studied.
      

      
      Astrophysical data mining with GPU. A case study: Genetic classification of globular clusters
      NASA Astrophysics Data System (ADS)
      Cavuoti, S.; Garofalo, M.; Brescia, M.; Paolillo, M.; Pescape', A.; Longo, G.; Ventre, G.
         2014-01-01
         We present a multi-purpose genetic algorithm, designed and implemented with GPGPU/CUDA parallel computing technology. The model was derived from our CPU serial implementation, named GAME (Genetic Algorithm Model Experiment). It was successfully tested and validated on the detection of candidate Globular Clusters in deep, wide-field, single band HST images. The GPU version of GAME will be made available to the community by integrating it into the web application DAMEWARE (DAta Mining Web Application REsource, http://dame.dsf.unina.it/beta_info.html), a public data mining service specialized on massive astrophysical data. Since genetic algorithms are inherently parallel, the GPGPU computing paradigm leads to a speedup of a factor of 200× in the training phase with respect to the CPU based version.
      

      
      Development of a low-cost, unmanned surface vehicle for military applications
      NASA Astrophysics Data System (ADS)
      Cadena, A.
         2012-06-01
         This paper describes the development of an USV (Unmanned Surface Vehicle) prototype that serves as an educational platform and can be use for coastal patrol and operations in the jungle. The USV length is less than 2 m and range of 5000 m. It's composed by the following modules: propulsion, power, motor driver, CPU, sensor suite, camera system, communication and weapon system. The weapon system is formed by an experimental assault rifle and a rocket launcher with a fire control system. The assault rifle haven't got mechanical moving parts, the bullets (7.62x51mm round) are electronically ignited. The CPU is an FPGA development kit. The USV can be operate in remote mode or fully autonomous. Results of some systems from laboratory and sea trials are show.
      

      
      A Graphics Processing Unit Implementation of Coulomb Interaction in Molecular Dynamics.
      PubMed
      Jha, Prateek K; Sknepnek, Rastko; Guerrero-García, Guillermo Iván; Olvera de la Cruz, Monica
         2010-10-12
         We report a GPU implementation in HOOMD Blue of long-range electrostatic interactions based on the orientation-averaged Ewald sum scheme, introduced by Yakub and Ronchi (J. Chem. Phys. 2003, 119, 11556). The performance of the method is compared to an optimized CPU version of the traditional Ewald sum available in LAMMPS, in the molecular dynamics of electrolytes. Our GPU implementation is significantly faster than the CPU implementation of the Ewald method for small to a sizable number of particles (∼10(5)). Thermodynamic and structural properties of monovalent and divalent hydrated salts in the bulk are calculated for a wide range of ionic concentrations. An excellent agreement between the two methods was found at the level of electrostatic energy, heat capacity, radial distribution functions, and integrated charge of the electrolytes.
      

      
      Tempest: Accelerated MS/MS database search software for heterogeneous computing platforms
      PubMed Central
      Adamo, Mark E.; Gerber, Scott A.
         2017-01-01
         MS/MS database search algorithms derive a set of candidate peptide sequences from in-silico digest of a protein sequence database, and compute theoretical fragmentation patterns to match these candidates against observed MS/MS spectra. The original Tempest publication described these operations mapped to a CPU-GPU model, in which the CPU generates peptide candidates that are asynchronously sent to a discrete GPU to be scored against experimental spectra in parallel (Milloy et al., 2012). The current version of Tempest expands this model, incorporating OpenCL to offer seamless parallelization across multicore CPUs, GPUs, integrated graphics chips, and general-purpose coprocessors. Three protocols describe how to configure and run a Tempest search, including discussion of how to leverage Tempest's unique feature set to produce optimal results. PMID:27603022
      

      
      Analysis OpenMP performance of AMD and Intel architecture for breaking waves simulation using MPS
      NASA Astrophysics Data System (ADS)
      Alamsyah, M. N. A.; Utomo, A.; Gunawan, P. H.
         2018-03-01
         Simulation of breaking waves by using Navier-Stokes equation via moving particle semi-implicit method (MPS) over close domain is given. The results show the parallel computing on multicore architecture using OpenMP platform can reduce the computational time almost half of the serial time. Here, the comparison using two computer architectures (AMD and Intel) are performed. The results using Intel architecture is shown better than AMD architecture in CPU time. However, in efficiency, the computer with AMD architecture gives slightly higher than the Intel. For the simulation by 1512 number of particles, the CPU time using Intel and AMD are 12662.47 and 28282.30 respectively. Moreover, the efficiency using similar number of particles, AMD obtains 50.09 % and Intel up to 49.42 %.
      

      
      Analysis and improvements of Adaptive Particle Refinement (APR) through CPU time, accuracy and robustness considerations
      NASA Astrophysics Data System (ADS)
      Chiron, L.; Oger, G.; de Leffe, M.; Le Touzé, D.
         2018-02-01
         While smoothed-particle hydrodynamics (SPH) simulations are usually performed using uniform particle distributions, local particle refinement techniques have been developed to concentrate fine spatial resolutions in identified areas of interest. Although the formalism of this method is relatively easy to implement, its robustness at coarse/fine interfaces can be problematic. Analysis performed in [16] shows that the radius of refined particles should be greater than half the radius of unrefined particles to ensure robustness. In this article, the basics of an Adaptive Particle Refinement (APR) technique, inspired by AMR in mesh-based methods, are presented. This approach ensures robustness with alleviated constraints. Simulations applying the new formalism proposed achieve accuracy comparable to fully refined spatial resolutions, together with robustness, low CPU times and maintained parallel efficiency.
      

      
      Task-set inertia and memory-consolidation bottleneck in dual tasks.
      PubMed
      Koch, Iring; Rumiati, Raffaella I
         2006-11-01
         Three dual-task experiments examined the influence of processing a briefly presented visual object for deferred verbal report on performance in an unrelated auditory-manual reaction time (RT) task. RT was increased at short stimulus-onset asynchronies (SOAs) relative to long SOAs, showing that memory consolidation processes can produce a functional processing bottleneck in dual-task performance. In addition, the experiments manipulated the spatial compatibility of the orientation of the visual object and the side of the speeded manual response. This cross-task compatibility produced relative RT benefits only when the instruction for the visual task emphasized overlap at the level of response codes across the task sets (Experiment 1). However, once the effective task set was in place, it continued to produce cross-task compatibility effects even in single-task situations ("ignore" trials in Experiment 2) and when instructions for the visual task did not explicitly require spatial coding of object orientation (Experiment 3). Taken together, the data suggest a considerable degree of task-set inertia in dual-task performance, which is also reinforced by finding costs of switching task sequences (e.g., AC --> BC vs. BC --> BC) in Experiment 3.
      

      
      Comparing large lecture mechanics curricula using the Force Concept Inventory: A five thousand student study
      NASA Astrophysics Data System (ADS)
      Caballero, Marcos D.; Greco, Edwin F.; Murray, Eric R.; Bujak, Keith R.; Jackson Marr, M.; Catrambone, Richard; Kohlmyer, Matthew A.; Schatz, Michael F.
         2012-07-01
         The performance of over 5000 students in introductory calculus-based mechanics courses at the Georgia Institute of Technology was assessed using the Force Concept Inventory (FCI). Results from two different curricula were compared: a traditional mechanics curriculum and the Matter & Interactions (M&I) curriculum. Both were taught with similar interactive pedagogy. Post-instruction FCI averages were significantly higher for the traditional curriculum than for the M&I curriculum; the differences between curricula persist after accounting for factors such as pre-instruction FCI scores, grade point averages, and SAT scores. FCI performance on categories of items organized by concepts was also compared; traditional averages were significantly higher in each concept. We examined differences in student preparation between the curricula and found that the relative fraction of homework and lecture topics devoted to FCI force and motion concepts correlated with the observed performance differences. Concept inventories, as instruments for evaluating curricular reforms, are generally limited to the particular choice of content and goals of the instrument. Moreover, concept inventories fail to measure what are perhaps the most interesting aspects of reform: the non-overlapping content and goals that are not present in courses without reform.
      

      
      29 CFR 4901.32 - Fee schedule.
      Code of Federal Regulations, 2011 CFR
      
         2011-07-01
         ... an established agency-wide average rate for CPU operating costs and operator/programmer salaries... of duplication. (c) Other charges. The scheduled fees, set forth in paragraphs (a) and (b) of this...
      

      
      29 CFR 4901.32 - Fee schedule.
      Code of Federal Regulations, 2010 CFR
      
         2010-07-01
         ... an established agency-wide average rate for CPU operating costs and operator/programmer salaries... of duplication. (c) Other charges. The scheduled fees, set forth in paragraphs (a) and (b) of this...
      

      
      Δ9-tetrahydrocannabinol prevents methamphetamine-induced neurotoxicity.
      PubMed
      Castelli, M Paola; Madeddu, Camilla; Casti, Alberto; Casu, Angelo; Casti, Paola; Scherma, Maria; Fattore, Liana; Fadda, Paola; Ennas, M Grazia
         2014-01-01
         Methamphetamine (METH) is a potent psychostimulant with neurotoxic properties. Heavy use increases the activation of neuronal nitric oxide synthase (nNOS), production of peroxynitrites, microglia stimulation, and induces hyperthermia and anorectic effects. Most METH recreational users also consume cannabis. Preclinical studies have shown that natural (Δ9-tetrahydrocannabinol, Δ9-THC) and synthetic cannabinoid CB1 and CB2 receptor agonists exert neuroprotective effects on different models of cerebral damage. Here, we investigated the neuroprotective effect of Δ9-THC on METH-induced neurotoxicity by examining its ability to reduce astrocyte activation and nNOS overexpression in selected brain areas. Rats exposed to a METH neurotoxic regimen (4 × 10 mg/kg, 2 hours apart) were pre- or post-treated with Δ9-THC (1 or 3 mg/kg) and sacrificed 3 days after the last METH administration. Semi-quantitative immunohistochemistry was performed using antibodies against nNOS and Glial Fibrillary Acidic Protein (GFAP). Results showed that, as compared to corresponding controls (i) METH-induced nNOS overexpression in the caudate-putamen (CPu) was significantly attenuated by pre- and post-treatment with both doses of Δ9-THC (-19% and -28% for 1 mg/kg pre- and post-treated animals; -25% and -21% for 3 mg/kg pre- and post-treated animals); (ii) METH-induced GFAP-immunoreactivity (IR) was significantly reduced in the CPu by post-treatment with 1 mg/kg Δ9-THC1 (-50%) and by pre-treatment with 3 mg/kg Δ9-THC (-53%); (iii) METH-induced GFAP-IR was significantly decreased in the prefrontal cortex (PFC) by pre- and post-treatment with both doses of Δ9-THC (-34% and -47% for 1 mg/kg pre- and post-treated animals; -37% and -29% for 3 mg/kg pre- and post-treated animals). The cannabinoid CB1 receptor antagonist SR141716A attenuated METH-induced nNOS overexpression in the CPu, but failed to counteract the Δ9-THC-mediated reduction of METH-induced GFAP-IR both in the PFC and CPu. Our results indicate that Δ9-THC reduces METH-induced brain damage via inhibition of nNOS expression and astrocyte activation through CB1-dependent and independent mechanisms, respectively.
      

      
      GPU accelerated Monte-Carlo simulation of SEM images for metrology
      NASA Astrophysics Data System (ADS)
      Verduin, T.; Lokhorst, S. R.; Hagen, C. W.
         2016-03-01
         In this work we address the computation times of numerical studies in dimensional metrology. In particular, full Monte-Carlo simulation programs for scanning electron microscopy (SEM) image acquisition are known to be notoriously slow. Our quest in reducing the computation time of SEM image simulation has led us to investigate the use of graphics processing units (GPUs) for metrology. We have succeeded in creating a full Monte-Carlo simulation program for SEM images, which runs entirely on a GPU. The physical scattering models of this GPU simulator are identical to a previous CPU-based simulator, which includes the dielectric function model for inelastic scattering and also refinements for low-voltage SEM applications. As a case study for the performance, we considered the simulated exposure of a complex feature: an isolated silicon line with rough sidewalls located on a at silicon substrate. The surface of the rough feature is decomposed into 408 012 triangles. We have used an exposure dose of 6 mC/cm2, which corresponds to 6 553 600 primary electrons on average (Poisson distributed). We repeat the simulation for various primary electron energies, 300 eV, 500 eV, 800 eV, 1 keV, 3 keV and 5 keV. At first we run the simulation on a GeForce GTX480 from NVIDIA. The very same simulation is duplicated on our CPU-based program, for which we have used an Intel Xeon X5650. Apart from statistics in the simulation, no difference is found between the CPU and GPU simulated results. The GTX480 generates the images (depending on the primary electron energy) 350 to 425 times faster than a single threaded Intel X5650 CPU. Although this is a tremendous speedup, we actually have not reached the maximum throughput because of the limited amount of available memory on the GTX480. Nevertheless, the speedup enables the fast acquisition of simulated SEM images for metrology. We now have the potential to investigate case studies in CD-SEM metrology, which otherwise would take unreasonable amounts of computation time.
      

      
      Distributed GPU Computing in GIScience
      NASA Astrophysics Data System (ADS)
      Jiang, Y.; Yang, C.; Huang, Q.; Li, J.; Sun, M.
         2013-12-01
         Geoscientists strived to discover potential principles and patterns hidden inside ever-growing Big Data for scientific discoveries. To better achieve this objective, more capable computing resources are required to process, analyze and visualize Big Data (Ferreira et al., 2003; Li et al., 2013). Current CPU-based computing techniques cannot promptly meet the computing challenges caused by increasing amount of datasets from different domains, such as social media, earth observation, environmental sensing (Li et al., 2013). Meanwhile CPU-based computing resources structured as cluster or supercomputer is costly. In the past several years with GPU-based technology matured in both the capability and performance, GPU-based computing has emerged as a new computing paradigm. Compare to traditional computing microprocessor, the modern GPU, as a compelling alternative microprocessor, has outstanding high parallel processing capability with cost-effectiveness and efficiency(Owens et al., 2008), although it is initially designed for graphical rendering in visualization pipe. This presentation reports a distributed GPU computing framework for integrating GPU-based computing within distributed environment. Within this framework, 1) for each single computer, computing resources of both GPU-based and CPU-based can be fully utilized to improve the performance of visualizing and processing Big Data; 2) within a network environment, a variety of computers can be used to build up a virtual super computer to support CPU-based and GPU-based computing in distributed computing environment; 3) GPUs, as a specific graphic targeted device, are used to greatly improve the rendering efficiency in distributed geo-visualization, especially for 3D/4D visualization. Key words: Geovisualization, GIScience, Spatiotemporal Studies Reference : 1. Ferreira de Oliveira, M. C., & Levkowitz, H. (2003). From visual data exploration to visual data mining: A survey. Visualization and Computer Graphics, IEEE Transactions on, 9(3), 378-394. 2. Li, J., Jiang, Y., Yang, C., Huang, Q., & Rice, M. (2013). Visualizing 3D/4D Environmental Data Using Many-core Graphics Processing Units (GPUs) and Multi-core Central Processing Units (CPUs). Computers & Geosciences, 59(9), 78-89. 3. Owens, J. D., Houston, M., Luebke, D., Green, S., Stone, J. E., & Phillips, J. C. (2008). GPU computing. Proceedings of the IEEE, 96(5), 879-899.
      

      
      Analysis of cache for streaming tape drive
      NASA Technical Reports Server (NTRS)
      Chinnaswamy, V.
         1993-01-01
         A tape subsystem consists of a controller and a tape drive. Tapes are used for backup, data interchange, and software distribution. The backup operation is addressed. During a backup operation, data is read from disk, processed in CPU, and then sent to tape. The processing speeds of a disk subsystem, CPU, and a tape subsystem are likely to be different. A powerful CPU can read data from a fast disk, process it, and supply the data to the tape subsystem at a faster rate than the tape subsystem can handle. On the other hand, a slow disk drive and a slow CPU may not be able to supply data fast enough to keep a tape drive busy all the time. The backup process may supply data to tape drive in bursts. Each burst may be followed by an idle period. Depending on the nature of the file distribution in the disk, the input stream to the tape subsystem may vary significantly during backup. To compensate for these differences and optimize the utilization of a tape subsystem, a cache or buffer is introduced in the tape controller. Most of the tape drives today are streaming tape drives. A streaming tape drive goes into reposition when there is no data from the controller. Once the drive goes into reposition, the controller can receive data, but it cannot supply data to the tape drive until the drive completes its reposition. A controller can also receive data from the host and send data to the tape drive at the same time. The relationship of cache size, host transfer rate, drive transfer rate, reposition, and ramp up times for optimal performance of the tape subsystem are investigated. Formulas developed will also show the advantages of cache watermarks to increase the streaming time of the tape drive, maximum loss due to insufficient cache, tradeoffs between cache and reposition times and the effectiveness of cache on a streaming tape drive due to idle times or interruptions due in host transfers. Several mathematical formulas are developed to predict the performance of the tape drive. Some examples are given illustrating the usefulness of these formulas. Finally, a summary and some conclusions are provided.
      

      
      Storage element performance optimization for CMS analysis jobs
      NASA Astrophysics Data System (ADS)
      Behrmann, G.; Dahlblom, J.; Guldmyr, J.; Happonen, K.; Lindén, T.
         2012-12-01
         Tier-2 computing sites in the Worldwide Large Hadron Collider Computing Grid (WLCG) host CPU-resources (Compute Element, CE) and storage resources (Storage Element, SE). The vast amount of data that needs to processed from the Large Hadron Collider (LHC) experiments requires good and efficient use of the available resources. Having a good CPU efficiency for the end users analysis jobs requires that the performance of the storage system is able to scale with I/O requests from hundreds or even thousands of simultaneous jobs. In this presentation we report on the work on improving the SE performance at the Helsinki Institute of Physics (HIP) Tier-2 used for the Compact Muon Experiment (CMS) at the LHC. Statistics from CMS grid jobs are collected and stored in the CMS Dashboard for further analysis, which allows for easy performance monitoring by the sites and by the CMS collaboration. As part of the monitoring framework CMS uses the JobRobot which sends every four hours 100 analysis jobs to each site. CMS also uses the HammerCloud tool for site monitoring and stress testing and it has replaced the JobRobot. The performance of the analysis workflow submitted with JobRobot or HammerCloud can be used to track the performance due to site configuration changes, since the analysis workflow is kept the same for all sites and for months in time. The CPU efficiency of the JobRobot jobs at HIP was increased approximately by 50 % to more than 90 %, by tuning the SE and by improvements in the CMSSW and dCache software. The performance of the CMS analysis jobs improved significantly too. Similar work has been done on other CMS Tier-sites, since on average the CPU efficiency for CMSSW jobs has increased during 2011. Better monitoring of the SE allows faster detection of problems, so that the performance level can be kept high. The next storage upgrade at HIP consists of SAS disk enclosures which can be stress tested on demand with HammerCloud workflows, to make sure that the I/O-performance is good.
      

      
      Toward production of jet fuel functionality in oilseeds: identification of FatB acyl-acyl carrier protein thioesterases and evaluation of combinatorial expression strategies in Camelina seeds.
      PubMed
      Kim, Hae Jin; Silva, Jillian E; Vu, Hieu Sy; Mockaitis, Keithanne; Nam, Jeong-Won; Cahoon, Edgar B
         2015-07-01
         Seeds of members of the genus Cuphea accumulate medium-chain fatty acids (MCFAs; 8:0-14:0). MCFA- and palmitic acid- (16:0) rich vegetable oils have received attention for jet fuel production, given their similarity in chain length to Jet A fuel hydrocarbons. Studies were conducted to test genes, including those from Cuphea, for their ability to confer jet fuel-type fatty acid accumulation in seed oil of the emerging biofuel crop Camelina sativa. Transcriptomes from Cuphea viscosissima and Cuphea pulcherrima developing seeds that accumulate >90% of C8 and C10 fatty acids revealed three FatB cDNAs (CpuFatB3, CvFatB1, and CpuFatB4) expressed predominantly in seeds and structurally divergent from typical FatB thioesterases that release 16:0 from acyl carrier protein (ACP). Expression of CpuFatB3 and CvFatB1 resulted in Camelina oil with capric acid (10:0), and CpuFatB4 expression conferred myristic acid (14:0) production and increased 16:0. Co-expression of combinations of previously characterized Cuphea and California bay FatBs produced Camelina oils with mixtures of C8-C16 fatty acids, but amounts of each fatty acid were less than obtained by expression of individual FatB cDNAs. Increases in lauric acid (12:0) and 14:0, but not 10:0, in Camelina oil and at the sn-2 position of triacylglycerols resulted from inclusion of a coconut lysophosphatidic acid acyltransferase specialized for MCFAs. RNA interference (RNAi) suppression of Camelina β-ketoacyl-ACP synthase II, however, reduced 12:0 in seeds expressing a 12:0-ACP-specific FatB. Camelina lines presented here provide platforms for additional metabolic engineering targeting fatty acid synthase and specialized acyltransferases for achieving oils with high levels of jet fuel-type fatty acids. © The Author 2015. Published by Oxford University Press on behalf of the Society for Experimental Biology.
      

      
      CPU-GPU mixed implementation of virtual node method for real-time interactive cutting of deformable objects using OpenCL.
      PubMed
      Jia, Shiyu; Zhang, Weizhong; Yu, Xiaokang; Pan, Zhenkuan
         2015-09-01
         Surgical simulators need to simulate interactive cutting of deformable objects in real time. The goal of this work was to design an interactive cutting algorithm that eliminates traditional cutting state classification and can work simultaneously with real-time GPU-accelerated deformation without affecting its numerical stability. A modified virtual node method for cutting is proposed. Deformable object is modeled as a real tetrahedral mesh embedded in a virtual tetrahedral mesh, and the former is used for graphics rendering and collision, while the latter is used for deformation. Cutting algorithm first subdivides real tetrahedrons to eliminate all face and edge intersections, then splits faces, edges and vertices along cutting tool trajectory to form cut surfaces. Next virtual tetrahedrons containing more than one connected real tetrahedral fragments are duplicated, and connectivity between virtual tetrahedrons is updated. Finally, embedding relationship between real and virtual tetrahedral meshes is updated. Co-rotational linear finite element method is used for deformation. Cutting and collision are processed by CPU, while deformation is carried out by GPU using OpenCL. Efficiency of GPU-accelerated deformation algorithm was tested using block models with varying numbers of tetrahedrons. Effectiveness of our cutting algorithm under multiple cuts and self-intersecting cuts was tested using a block model and a cylinder model. Cutting of a more complex liver model was performed, and detailed performance characteristics of cutting, deformation and collision were measured and analyzed. Our cutting algorithm can produce continuous cut surfaces when traditional minimal element creation algorithm fails. Our GPU-accelerated deformation algorithm remains stable with constant time step under multiple arbitrary cuts and works on both NVIDIA and AMD GPUs. GPU-CPU speed ratio can be as high as 10 for models with 80,000 tetrahedrons. Forty to sixty percent real-time performance and 100-200 Hz simulation rate are achieved for the liver model with 3,101 tetrahedrons. Major bottlenecks for simulation efficiency are cutting, collision processing and CPU-GPU data transfer. Future work needs to improve on these areas.
      

      
      GPU-Accelerated Stony-Brook University 5-class Microphysics Scheme in WRF
      NASA Astrophysics Data System (ADS)
      Mielikainen, J.; Huang, B.; Huang, A.
         2011-12-01
         The Weather Research and Forecasting (WRF) model is a next-generation mesoscale numerical weather prediction system. Microphysics plays an important role in weather and climate prediction. Several bulk water microphysics schemes are available within the WRF, with different numbers of simulated hydrometeor classes and methods for estimating their size fall speeds, distributions and densities. Stony-Brook University scheme (SBU-YLIN) is a 5-class scheme with riming intensity predicted to account for mixed-phase processes. In the past few years, co-processing on Graphics Processing Units (GPUs) has been a disruptive technology in High Performance Computing (HPC). GPUs use the ever increasing transistor count for adding more processor cores. Therefore, GPUs are well suited for massively data parallel processing with high floating point arithmetic intensity. Thus, it is imperative to update legacy scientific applications to take advantage of this unprecedented increase in computing power. CUDA is an extension to the C programming language offering programming GPU's directly. It is designed so that its constructs allow for natural expression of data-level parallelism. A CUDA program is organized into two parts: a serial program running on the CPU and a CUDA kernel running on the GPU. The CUDA code consists of three computational phases: transmission of data into the global memory of the GPU, execution of the CUDA kernel, and transmission of results from the GPU into the memory of CPU. CUDA takes a bottom-up point of view of parallelism is which thread is an atomic unit of parallelism. Individual threads are part of groups called warps, within which every thread executes exactly the same sequence of instructions. To test SBU-YLIN, we used a CONtinental United States (CONUS) benchmark data set for 12 km resolution domain for October 24, 2001. A WRF domain is a geographic region of interest discretized into a 2-dimensional grid parallel to the ground. Each grid point has multiple levels, which correspond to various vertical heights in the atmosphere. The size of the CONUS 12 km domain is 433 x 308 horizontal grid points with 35 vertical levels. First, the entire SBU-YLIN Fortran code was rewritten in C in preparation of GPU accelerated version. After that, C code was verified against Fortran code for identical outputs. Default compiler options from WRF were used for gfortran and gcc compilers. The processing time for the original Fortran code is 12274 ms and 12893 ms for C version. The processing times for GPU implementation of SBU-YLIN microphysics scheme with I/O are 57.7 ms and 37.2 ms for 1 and 2 GPUs, respectively. The corresponding speedups are 213x and 330x compared to a Fortran implementation. Without I/O the speedup is 896x on 1 GPU. Obviously, ignoring I/O time speedup scales linearly with GPUs. Thus, 2 GPUs have a speedup of 1788x without I/O. Microphysics computation is just a small part of the whole WRF model. After having completely implemented WRF on GPU, the inputs for SBU-YLIN do not have to be transferred from CPU. Instead they are results of previous WRF modules. Therefore, the role of I/O is greatly diminished once all of WRF have been converted to run on GPUs. In the near future, we expect to have a WRF running completely on GPUs for a superior performance.
      

      
      Mongoose: Creation of a Rad-Hard MIPS R3000
      NASA Technical Reports Server (NTRS)
      Lincoln, Dan; Smith, Brian
         1993-01-01
         This paper describes the development of a 32 Bit, full MIPS R3000 code-compatible Rad-Hard CPU, code named Mongoose. Mongoose progressed from contract award, through the design cycle, to operational silicon in 12 months to meet a space mission for NASA. The goal was the creation of a fully static device capable of operation to the maximum Mil-883 derated speed, worst-case post-rad exposure with full operational integrity. This included consideration of features for functional enhancements relating to mission compatibility and removal of commercial practices not supported by Rad-Hard technology. 'Mongoose' developed from an evolution of LSI Logic's MIPS-I embedded processor, LR33000, code named Cobra, to its Rad-Hard 'equivalent', Mongoose. The term 'equivalent' is used to infer that the core of the processor is functionally identical, allowing the same use and optimizations of the MIPS-I Instruction Set software tool suite for compilation, software program trace, etc. This activity was started in September of 1991 under a contract from NASA-Goddard Space Flight Center (GSFC)-Flight Data Systems. The approach affected a teaming of NASA-GSFC for program development, LSI Logic for system and ASIC design coupled with the Rad-Hard process technology, and Harris (GASD) for Rad-Hard microprocessor design expertise. The program culminated with the generation of Rad-Hard Mongoose prototypes one year later.
      

        
       
          

«

21
      22
      23
   24
      25
      »

          
        

     

   

   
       
            
              
          

«

21
      22
      23
      24
   25
      »

          
        

           
           
             
               
      
      Optimizing ATLAS code with different profilers
      NASA Astrophysics Data System (ADS)
      Kama, S.; Seuster, R.; Stewart, G. A.; Vitillo, R. A.
         2014-06-01
         After the current maintenance period, the LHC will provide higher energy collisions with increased luminosity. In order to keep up with these higher rates, ATLAS software needs to speed up substantially. However, ATLAS code is composed of approximately 6M lines, written by many different programmers with different backgrounds, which makes code optimisation a challenge. To help with this effort different profiling tools and techniques are being used. These include well known tools, such as the Valgrind suite and Intel Amplifier; less common tools like Pin, PAPI, and GOoDA; as well as techniques such as library interposing. In this paper we will mainly focus on Pin tools and GOoDA. Pin is a dynamic binary instrumentation tool which can obtain statistics such as call counts, instruction counts and interrogate functions' arguments. It has been used to obtain CLHEP Matrix profiles, operations and vector sizes for linear algebra calculations which has provided the insight necessary to achieve significant performance improvements. Complimenting this, GOoDA, an in-house performance tool built in collaboration with Google, which is based on hardware performance monitoring unit events, is used to identify hot-spots in the code for different types of hardware limitations, such as CPU resources, caches, or memory bandwidth. GOoDA has been used in improvement of the performance of new magnetic field code and identification of potential vectorization targets in several places, such as Runge-Kutta propagation code.
      

      
      A flexible algorithm for calculating pair interactions on SIMD architectures
      NASA Astrophysics Data System (ADS)
      Páll, Szilárd; Hess, Berk
         2013-12-01
         Calculating interactions or correlations between pairs of particles is typically the most time-consuming task in particle simulation or correlation analysis. Straightforward implementations using a double loop over particle pairs have traditionally worked well, especially since compilers usually do a good job of unrolling the inner loop. In order to reach high performance on modern CPU and accelerator architectures, single-instruction multiple-data (SIMD) parallelization has become essential. Avoiding memory bottlenecks is also increasingly important and requires reducing the ratio of memory to arithmetic operations. Moreover, when pairs only interact within a certain cut-off distance, good SIMD utilization can only be achieved by reordering input and output data, which quickly becomes a limiting factor. Here we present an algorithm for SIMD parallelization based on grouping a fixed number of particles, e.g. 2, 4, or 8, into spatial clusters. Calculating all interactions between particles in a pair of such clusters improves data reuse compared to the traditional scheme and results in a more efficient SIMD parallelization. Adjusting the cluster size allows the algorithm to map to SIMD units of various widths. This flexibility not only enables fast and efficient implementation on current CPUs and accelerator architectures like GPUs or Intel MIC, but it also makes the algorithm future-proof. We present the algorithm with an application to molecular dynamics simulations, where we can also make use of the effective buffering the method introduces.
      

      
      SuBSENSE: a universal change detection method with local adaptive sensitivity.
      PubMed
      St-Charles, Pierre-Luc; Bilodeau, Guillaume-Alexandre; Bergevin, Robert
         2015-01-01
         Foreground/background segmentation via change detection in video sequences is often used as a stepping stone in high-level analytics and applications. Despite the wide variety of methods that have been proposed for this problem, none has been able to fully address the complex nature of dynamic scenes in real surveillance tasks. In this paper, we present a universal pixel-level segmentation method that relies on spatiotemporal binary features as well as color information to detect changes. This allows camouflaged foreground objects to be detected more easily while most illumination variations are ignored. Besides, instead of using manually set, frame-wide constants to dictate model sensitivity and adaptation speed, we use pixel-level feedback loops to dynamically adjust our method's internal parameters without user intervention. These adjustments are based on the continuous monitoring of model fidelity and local segmentation noise levels. This new approach enables us to outperform all 32 previously tested state-of-the-art methods on the 2012 and 2014 versions of the ChangeDetection.net dataset in terms of overall F-Measure. The use of local binary image descriptors for pixel-level modeling also facilitates high-speed parallel implementations: our own version, which used no low-level or architecture-specific instruction, reached real-time processing speed on a midlevel desktop CPU. A complete C++ implementation based on OpenCV is available online.
      

      
      Overlap between the neural correlates of cued recall and source memory: evidence for a generic recollection network?
      PubMed Central
      Hayama, Hiroki R.; Vilberg, Kaia L.
         2012-01-01
         Recall of a studied item and retrieval of its encoding context (source memory) both depend upon recollection of qualitative information about the study episode. The present study investigated whether recall and source memory engage overlapping neural regions. Subjects (N=18) studied a series of words which were presented either to the left or right of fixation. fMRI data were collected during a subsequent test phase in which three-letter word stems were presented, two-thirds of which could be completed by a study item. Instructions were to use each stem as a cue to recall a studied word and, when recall was successful, to indicate the word’s study location. When recall failed, the stem was to be completed with the first word to come to mind. Relative to stems for which recall failed, word stems eliciting successful recall were associated with enhanced activity in a variety of cortical regions, including bilateral parietal, posterior midline, and parahippocampal cortex. Activity in these regions was enhanced when recall was accompanied by successful rather than unsuccessful source retrieval. It is proposed that the regions form part of a ‘recollection network’ in which activity is graded according to the amount of information retrieved about a study episode. PMID:22288393
      

      
      Overlap between the neural correlates of cued recall and source memory: evidence for a generic recollection network?
      PubMed
      Hayama, Hiroki R; Vilberg, Kaia L; Rugg, Michael D
         2012-05-01
         Recall of a studied item and retrieval of its encoding context (source memory) both depend on recollection of qualitative information about the study episode. This study investigated whether recall and source memory engage overlapping neural regions. Participants (n = 18) studied a series of words, which were presented either to the left or right of fixation. fMRI data were collected during a subsequent test phase in which three-letter word-stems were presented, two thirds of which could be completed by a study item. Instructions were to use each stem as a cue to recall a studied word and, when recall was successful, to indicate the word's study location. When recall failed, the stem was to be completed with the first word to come to mind. Relative to stems for which recall failed, word-stems eliciting successful recall were associated with enhanced activity in a variety of cortical regions, including bilateral parietal, posterior midline, and parahippocampal cortex. Activity in these regions was enhanced when recall was accompanied by successful rather than unsuccessful source retrieval. It is proposed that the regions form part of a "recollection network" in which activity is graded according to the amount of information retrieved about a study episode.
      

      
      Electronic Communications: Education Via a Virtual Workshop.
      ERIC Educational Resources Information Center
      Leibensperger, Roslyn; Mehringer, Susan; Trefethen, Anne; Kalos, Malvin
         1997-01-01
         Describes a virtual workshop where participants across the United States learn by interacting with their own computers. Highlights the program's goals, audience activity, goals versus accomplishments, CPU usage, consulting, and effectiveness. (Author/VWL)
      

      
      General approach to boat simulation in virtual reality systems
      NASA Astrophysics Data System (ADS)
      Aranov, Vladislav Y.; Belyaev, Sergey Y.
         2002-02-01
         The paper is dedicated to real time simulation of sport boats, particularly a kayak and high-speed skimming boat, for training goals. This training is issue of the day, since kayaking and riding a high-speed skimming boat are both extreme sports. Participating in such types of competitions puts sportsmen into danger, particularly due to rapids, waterfalls, different water streams, and other obstacles. In order to make the simulation realistic, it is necessary to calculate data for at least 30 frames per second. These calculations may take not more than 5% CPU time, because very time-consuming 3D rendering process takes the rest - 95% CPU time. This paper describes an approach for creating minimal boat simulator models that satisfy the mentioned requirements. Besides, this approach can be used for other watercraft models of this kind.
      

      
      Enhanced round robin CPU scheduling with burst time based time quantum
      NASA Astrophysics Data System (ADS)
      Indusree, J. R.; Prabadevi, B.
         2017-11-01
         Process scheduling is a very important functionality of Operating system. The main-known process-scheduling algorithms are First Come First Serve (FCFS) algorithm, Round Robin (RR) algorithm, Priority scheduling algorithm and Shortest Job First (SJF) algorithm. Compared to its peers, Round Robin (RR) algorithm has the advantage that it gives fair share of CPU to the processes which are already in the ready-queue. The effectiveness of the RR algorithm greatly depends on chosen time quantum value. Through this research paper, we are proposing an enhanced algorithm called Enhanced Round Robin with Burst-time based Time Quantum (ERRBTQ) process scheduling algorithm which calculates time quantum as per the burst-time of processes already in ready queue. The experimental results and analysis of ERRBTQ algorithm clearly indicates the improved performance when compared with conventional RR and its variants.
      

      
      
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Edwards, Harold C.; Ibanez, Daniel Alejandro
         
         This report documents the ASC/ATDM Kokkos deliverable "Production Portable Dy- namic Task DAG Capability." This capability enables applications to create and execute a dynamic task DAG ; a collection of heterogeneous computational tasks with a directed acyclic graph (DAG) of "execute after" dependencies where tasks and their dependencies are dynamically created and destroyed as tasks execute. The Kokkos task scheduler executes the dynamic task DAG on the target execution resource; e.g. a multicore CPU, a manycore CPU such as Intel's Knights Landing (KNL), or an NVIDIA GPU. Several major technical challenges had to be addressed during development of Kokkos' Taskmore » DAG capability: (1) portability to a GPU with it's simplified hardware and micro- runtime, (2) thread-scalable memory allocation and deallocation from a bounded pool of memory, (3) thread-scalable scheduler for dynamic task DAG, (4) usability by applications.« less
      

      
      BESIII physical offline data analysis on virtualization platform
      NASA Astrophysics Data System (ADS)
      Huang, Q.; Li, H.; Kan, B.; Shi, J.; Lei, X.
         2015-12-01
         In this contribution, we present an ongoing work, which aims at benefiting BESIII computing system for higher resource utilization and more efficient job operations brought by cloud and virtualization technology with Openstack and KVM. We begin with the architecture of BESIII offline software to understand how it works. We mainly report the KVM performance evaluation and optimization from various factors in hardware and kernel. Experimental results show the CPU performance penalty of KVM can be approximately decreased to 3%. In addition, the performance comparison between KVM and physical machines in aspect of CPU, disk IO and network IO is also presented. Finally, we present our development work, an adaptive cloud scheduler, which allocates and reclaims VMs dynamically according to the status of TORQUE queue and the size of resource pool to improve resource utilization and job processing efficiency.
      

      
      A fast sequence assembly method based on compressed data structures.
      PubMed
      Liang, Peifeng; Zhang, Yancong; Lin, Kui; Hu, Jinglu
         2014-01-01
         Assembling a large genome using next generation sequencing reads requires large computer memory and a long execution time. To reduce these requirements, a memory and time efficient assembler is presented from applying FM-index in JR-Assembler, called FMJ-Assembler, where FM stand for FMR-index derived from the FM-index and BWT and J for jumping extension. The FMJ-Assembler uses expanded FM-index and BWT to compress data of reads to save memory and jumping extension method make it faster in CPU time. An extensive comparison of the FMJ-Assembler with current assemblers shows that the FMJ-Assembler achieves a better or comparable overall assembly quality and requires lower memory use and less CPU time. All these advantages of the FMJ-Assembler indicate that the FMJ-Assembler will be an efficient assembly method in next generation sequencing technology.
      

      
      Free-Space Optical Interconnect Employing VCSEL Diodes
      NASA Technical Reports Server (NTRS)
      Simons, Rainee N.; Savich, Gregory R.; Torres, Heidi
         2009-01-01
         Sensor signal processing is widely used on aircraft and spacecraft. The scheme employs multiple input/output nodes for data acquisition and CPU (central processing unit) nodes for data processing. To connect 110 nodes and CPU nodes, scalable interconnections such as backplanes are desired because the number of nodes depends on requirements of each mission. An optical backplane consisting of vertical-cavity surface-emitting lasers (VCSELs), VCSEL drivers, photodetectors, and transimpedance amplifiers is the preferred approach since it can handle several hundred megabits per second data throughput.The next generation of satellite-borne systems will require transceivers and processors that can handle several Gb/s of data. Optical interconnects have been praised for both their speed and functionality with hopes that light can relieve the electrical bottleneck predicted for the near future. Optoelectronic interconnects provide a factor of ten improvement over electrical interconnects.
      

      
      Irregular large-scale computed tomography on multiple graphics processors improves energy-efficiency metrics for industrial applications
      NASA Astrophysics Data System (ADS)
      Jimenez, Edward S.; Goodman, Eric L.; Park, Ryeojin; Orr, Laurel J.; Thompson, Kyle R.
         2014-09-01
         This paper will investigate energy-efficiency for various real-world industrial computed-tomography reconstruction algorithms, both CPU- and GPU-based implementations. This work shows that the energy required for a given reconstruction is based on performance and problem size. There are many ways to describe performance and energy efficiency, thus this work will investigate multiple metrics including performance-per-watt, energy-delay product, and energy consumption. This work found that irregular GPU-based approaches1 realized tremendous savings in energy consumption when compared to CPU implementations while also significantly improving the performance-per- watt and energy-delay product metrics. Additional energy savings and other metric improvement was realized on the GPU-based reconstructions by improving storage I/O by implementing a parallel MIMD-like modularization of the compute and I/O tasks.
      

      
      An Investigation of Unified Memory Access Performance in CUDA
      PubMed Central
      Landaverde, Raphael; Zhang, Tiansheng; Coskun, Ayse K.; Herbordt, Martin
         2015-01-01
         Managing memory between the CPU and GPU is a major challenge in GPU computing. A programming model, Unified Memory Access (UMA), has been recently introduced by Nvidia to simplify the complexities of memory management while claiming good overall performance. In this paper, we investigate this programming model and evaluate its performance and programming model simplifications based on our experimental results. We find that beyond on-demand data transfers to the CPU, the GPU is also able to request subsets of data it requires on demand. This feature allows UMA to outperform full data transfer methods for certain parallel applications and small data sizes. We also find, however, that for the majority of applications and memory access patterns, the performance overheads associated with UMA are significant, while the simplifications to the programming model restrict flexibility for adding future optimizations. PMID:26594668
      

      
      Development of an Implicit, Charge and Energy Conserving 2D Electromagnetic PIC Code on Advanced Architectures
      NASA Astrophysics Data System (ADS)
      Payne, Joshua; Taitano, William; Knoll, Dana; Liebs, Chris; Murthy, Karthik; Feltman, Nicolas; Wang, Yijie; McCarthy, Colleen; Cieren, Emanuel
         2012-10-01
         In order to solve problems such as the ion coalescence and slow MHD shocks fully kinetically we developed a fully implicit 2D energy and charge conserving electromagnetic PIC code, PlasmaApp2D. PlasmaApp2D differs from previous implicit PIC implementations in that it will utilize advanced architectures such as GPUs and shared memory CPU systems, with problems too large to fit into cache. PlasmaApp2D will be a hybrid CPU-GPU code developed primarily to run on the DARWIN cluster at LANL utilizing four 12-core AMD Opteron CPUs and two NVIDIA Tesla GPUs per node. MPI will be used for cross-node communication, OpenMP will be used for on-node parallelism, and CUDA will be used for the GPUs. Development progress and initial results will be presented.
      

      
      Implementation of ADI: Schemes on MIMD parallel computers
      NASA Technical Reports Server (NTRS)
      Vanderwijngaart, Rob F.
         1993-01-01
         In order to simulate the effects of the impingement of hot exhaust jets of High Performance Aircraft on landing surfaces a multi-disciplinary computation coupling flow dynamics to heat conduction in the runway needs to be carried out. Such simulations, which are essentially unsteady, require very large computational power in order to be completed within a reasonable time frame of the order of an hour. Such power can be furnished by the latest generation of massively parallel computers. These remove the bottleneck of ever more congested data paths to one or a few highly specialized central processing units (CPU's) by having many off-the-shelf CPU's work independently on their own data, and exchange information only when needed. During the past year the first phase of this project was completed, in which the optimal strategy for mapping an ADI-algorithm for the three dimensional unsteady heat equation to a MIMD parallel computer was identified. This was done by implementing and comparing three different domain decomposition techniques that define the tasks for the CPU's in the parallel machine. These implementations were done for a Cartesian grid and Dirichlet boundary conditions. The most promising technique was then used to implement the heat equation solver on a general curvilinear grid with a suite of nontrivial boundary conditions. Finally, this technique was also used to implement the Scalar Penta-diagonal (SP) benchmark, which was taken from the NAS Parallel Benchmarks report. All implementations were done in the programming language C on the Intel iPSC/860 computer.
      

      
      Levo-tetrahydropalmatine inhibits the acquisition of ketamine-induced conditioned place preference by regulating the expression of ERK and CREB phosphorylation in rats.
      PubMed
      Du, Yan; Du, Li; Cao, Jie; Hölscher, Christian; Feng, Yongming; Su, Hongliang; Wang, Yujin; Yun, Ke-Ming
         2017-01-15
         Levo-tetrahydropalmatine (l-THP) is an alkaloid purified from the Chinese herbs Corydalis and Stephania and has been used in many traditional Chinese herbal preparations for its sedative, analgesic and hypnotic properties. Previous studies demonstrated that l-THP has antagonistic activity on dopamine receptors; thus, it may have potential therapeutic effects on drug abuse. However, whether l-THP affects ketamine-induced conditioned place preference (CPP) remains unclear. Therefore, the present study was designed to evaluate the effects of l-THP on the rewarding behavior of ketamine through CPP. Results revealed that ketamine (5, 10 and 15mg/kg) induced CPP in rats. Furthermore, Ketamine (10mg/kg) promoted the phosphorylation of extracellular-regulated kinase (ERK) and cAMP responsive element binding protein (CREB) in the hippocampus (Hip) and caudate putamen (CPu), but not in the prefrontal cortex (PFc). l-THP (20mg/kg) co-administered with ketamine during conditioning inhibited the acquisition of ketamine-induced CPP in rats. Furthermore, l-THP (20mg/kg) prevented the enhanced phosphorylation of ERK and CREB in CPu and Hip. These results suggest that l-THP has potential therapeutic effects on ketamine-induced CPP. The underlying molecular mechanism may be related to its inhibitory effect on ERK and CREB phosphorylation in Hip and CPu. The present data supports the potential use of l-THP for the treatment of ketamine addiction. Copyright © 2016 Elsevier B.V. All rights reserved.
      

      
      Invasive treatment of NSTEMI patients in German Chest Pain Units - Evidence for a treatment paradox.
      PubMed
      Schmidt, Frank P; Schmitt, Claus; Hochadel, Matthias; Giannitsis, Evangelos; Darius, Harald; Maier, Lars S; Schmitt, Claus; Heusch, Gerd; Voigtländer, Thomas; Mudra, Harald; Gori, Tommaso; Senges, Jochen; Münzel, Thomas
         2018-03-15
         Patients with non ST-segment elevation myocardial infarction (NSTEMI) represent the largest fraction of patients with acute coronary syndrome in German Chest Pain units. Recent evidence on early vs. selective percutaneous coronary intervention (PCI) is ambiguous with respect to effects on mortality, myocardial infarction (MI) and recurrent angina. With the present study we sought to investigate the prognostic impact of PCI and its timing in German Chest Pain Unit (CPU) NSTEMI patients. Data from 1549 patients whose leading diagnosis was NSTEMI were retrieved from the German CPU registry for the interval between 3/2010 and 3/2014. Follow-up was available at median of 167days after discharge. The patients were grouped into a higher (Group A) and lower risk group (Group B) according to GRACE score and additional criteria on admission. Group A had higher Killip classes, higher BNP levels, reduced EF and significant more triple vessel disease (p<0.001). Surprisingly, patients in group A less frequently received early diagnostic catheterization and PCI. While conservative management did not affect prognosis in Group B, higher-risk CPU-NSTEMI patients without PCI had a significantly worse survival. The present results reveal a substantial treatment gap in higher-risk NSTEMI patients in German Chest Pain Units. This treatment paradox may worsen prognosis in patients who could derive the largest benefit from early revascularization. Copyright © 2017 The Authors. Published by Elsevier B.V. All rights reserved.
      

      
      Application of high-performance computing to numerical simulation of human movement
      NASA Technical Reports Server (NTRS)
      Anderson, F. C.; Ziegler, J. M.; Pandy, M. G.; Whalen, R. T.
         1995-01-01
         We have examined the feasibility of using massively-parallel and vector-processing supercomputers to solve large-scale optimization problems for human movement. Specifically, we compared the computational expense of determining the optimal controls for the single support phase of gait using a conventional serial machine (SGI Iris 4D25), a MIMD parallel machine (Intel iPSC/860), and a parallel-vector-processing machine (Cray Y-MP 8/864). With the human body modeled as a 14 degree-of-freedom linkage actuated by 46 musculotendinous units, computation of the optimal controls for gait could take up to 3 months of CPU time on the Iris. Both the Cray and the Intel are able to reduce this time to practical levels. The optimal solution for gait can be found with about 77 hours of CPU on the Cray and with about 88 hours of CPU on the Intel. Although the overall speeds of the Cray and the Intel were found to be similar, the unique capabilities of each machine are better suited to different portions of the computational algorithm used. The Intel was best suited to computing the derivatives of the performance criterion and the constraints whereas the Cray was best suited to parameter optimization of the controls. These results suggest that the ideal computer architecture for solving very large-scale optimal control problems is a hybrid system in which a vector-processing machine is integrated into the communication network of a MIMD parallel machine.
      

      
      Disk-based k-mer counting on a PC
      PubMed Central
      
         2013-01-01
         Background The k-mer counting problem, which is to build the histogram of occurrences of every k-symbol long substring in a given text, is important for many bioinformatics applications. They include developing de Bruijn graph genome assemblers, fast multiple sequence alignment and repeat detection. Results We propose a simple, yet efficient, parallel disk-based algorithm for counting k-mers. Experiments show that it usually offers the fastest solution to the considered problem, while demanding a relatively small amount of memory. In particular, it is capable of counting the statistics for short-read human genome data, in input gzipped FASTQ file, in less than 40 minutes on a PC with 16 GB of RAM and 6 CPU cores, and for long-read human genome data in less than 70 minutes. On a more powerful machine, using 32 GB of RAM and 32 CPU cores, the tasks are accomplished in less than half the time. No other algorithm for most tested settings of this problem and mammalian-size data can accomplish this task in comparable time. Our solution also belongs to memory-frugal ones; most competitive algorithms cannot efficiently work on a PC with 16 GB of memory for such massive data. Conclusions By making use of cheap disk space and exploiting CPU and I/O parallelism we propose a very competitive k-mer counting procedure, called KMC. Our results suggest that judicious resource management may allow to solve at least some bioinformatics problems with massive data on a commodity personal computer. PMID:23679007
      

        
       
          

«

21
      22
      23
      24
   25
      »

          
        

     

   

   
       
            
              
          

«

21
      22
      23
      24
      25
   »

          
        

           
           
             
               
      
      Exploring the use of I/O nodes for computation in a MIMD multiprocessor
      NASA Technical Reports Server (NTRS)
      Kotz, David; Cai, Ting
         1995-01-01
         As parallel systems move into the production scientific-computing world, the emphasis will be on cost-effective solutions that provide high throughput for a mix of applications. Cost effective solutions demand that a system make effective use of all of its resources. Many MIMD multiprocessors today, however, distinguish between 'compute' and 'I/O' nodes, the latter having attached disks and being dedicated to running the file-system server. This static division of responsibilities simplifies system management but does not necessarily lead to the best performance in workloads that need a different balance of computation and I/O. Of course, computational processes sharing a node with a file-system service may receive less CPU time, network bandwidth, and memory bandwidth than they would on a computation-only node. In this paper we begin to examine this issue experimentally. We found that high performance I/O does not necessarily require substantial CPU time, leaving plenty of time for application computation. There were some complex file-system requests, however, which left little CPU time available to the application. (The impact on network and memory bandwidth still needs to be determined.) For applications (or users) that cannot tolerate an occasional interruption, we recommend that they continue to use only compute nodes. For tolerant applications needing more cycles than those provided by the compute nodes, we recommend that they take full advantage of both compute and I/O nodes for computation, and that operating systems should make this possible.
      

      
      Evidence-Based Medicine Curriculum Improves Pediatric Emergency Fellows' Scores on In-Training Examinations.
      PubMed
      Tavarez, Melissa M; Kenkre, Tanya S; Zuckerbraun, Noel
         2017-05-30
         The aim of this study was to determine if implementation of our evidence-based medicine (EBM) curriculum had an effect on pediatric emergency medicine fellows' scores on the relevant section of the in-training examination (ITE). We obtained deidentified subscores for 22 fellows over 6 academic years for the Core Knowledge in Scholarly Activities (SA) and, as a balance measure, Emergencies Treated Medically sections. We divided the subscores into the following 3 instruction periods: "baseline" for academic years before our current EBM curriculum, "transition" for academic years with use of a research method curriculum with some overlapping EBM content, and "EBM" for academic years with our current EBM curriculum. We analyzed data using the Kruskal-Wallis test, the Mann-Whitney U test, and multivariate mixed-effects linear models. The SA subscore median was higher during the EBM period in comparison with the baseline and transition periods. In contrast, the Emergencies Treated Medically subscore median was similar across instruction periods. Multivariate modeling demonstrated that our EBM curriculum had the following independent effects on the fellows' SA subscore: (1) in comparison with the transition period, the fellows' SA subscore was 21 percentage points higher (P = 0.005); and (2) in comparison to the baseline period, the fellows' SA subscore was 28 percentage points higher during the EBM curriculum instruction period (P < 0.001). Our EBM curriculum was associated with significantly higher scores on the SA section of the ITE. Pediatric emergency medicine educators could consider using fellows' scores on this section of the ITE to assess the effect of their EBM curricula.
      

      
      When global rule reversal meets local task switching: The neural mechanisms of coordinated behavioral adaptation to instructed multi-level demand changes.
      PubMed
      Shi, Yiquan; Wolfensteller, Uta; Schubert, Torsten; Ruge, Hannes
         2018-02-01
         Cognitive flexibility is essential to cope with changing task demands and often it is necessary to adapt to combined changes in a coordinated manner. The present fMRI study examined how the brain implements such multi-level adaptation processes. Specifically, on a "local," hierarchically lower level, switching between two tasks was required across trials while the rules of each task remained unchanged for blocks of trials. On a "global" level regarding blocks of twelve trials, the task rules could reverse or remain the same. The current task was cued at the start of each trial while the current task rules were instructed before the start of a new block. We found that partly overlapping and partly segregated neural networks play different roles when coping with the combination of global rule reversal and local task switching. The fronto-parietal control network (FPN) supported the encoding of reversed rules at the time of explicit rule instruction. The same regions subsequently supported local task switching processes during actual implementation trials, irrespective of rule reversal condition. By contrast, a cortico-striatal network (CSN) including supplementary motor area and putamen was increasingly engaged across implementation trials and more so for rule reversal than for nonreversal blocks, irrespective of task switching condition. Together, these findings suggest that the brain accomplishes the coordinated adaptation to multi-level demand changes by distributing processing resources either across time (FPN for reversed rule encoding and later for task switching) or across regions (CSN for reversed rule implementation and FPN for concurrent task switching). © 2017 Wiley Periodicals, Inc.
      

      
      47 CFR 15.32 - Test procedures for CPU boards and computer power supplies.
      Code of Federal Regulations, 2011 CFR
      
         2011-10-01
         ... result in a complete personal computer system. If the oscillator and the microprocessor circuits are... microprocessor circuits are contained on separate circuit boards, both boards, typical of the combination that...
      

      
      47 CFR 15.32 - Test procedures for CPU boards and computer power supplies.
      Code of Federal Regulations, 2013 CFR
      
         2013-10-01
         ... result in a complete personal computer system. If the oscillator and the microprocessor circuits are... microprocessor circuits are contained on separate circuit boards, both boards, typical of the combination that...
      

      
      47 CFR 15.32 - Test procedures for CPU boards and computer power supplies.
      Code of Federal Regulations, 2014 CFR
      
         2014-10-01
         ... result in a complete personal computer system. If the oscillator and the microprocessor circuits are... microprocessor circuits are contained on separate circuit boards, both boards, typical of the combination that...
      

      
      47 CFR 15.32 - Test procedures for CPU boards and computer power supplies.
      Code of Federal Regulations, 2012 CFR
      
         2012-10-01
         ... result in a complete personal computer system. If the oscillator and the microprocessor circuits are... microprocessor circuits are contained on separate circuit boards, both boards, typical of the combination that...
      

      
      47 CFR 15.32 - Test procedures for CPU boards and computer power supplies.
      Code of Federal Regulations, 2010 CFR
      
         2010-10-01
         ... result in a complete personal computer system. If the oscillator and the microprocessor circuits are... microprocessor circuits are contained on separate circuit boards, both boards, typical of the combination that...
      

      
      A filtering method to generate high quality short reads using illumina paired-end technology.
      PubMed
      Eren, A Murat; Vineis, Joseph H; Morrison, Hilary G; Sogin, Mitchell L
         2013-01-01
         Consensus between independent reads improves the accuracy of genome and transcriptome analyses, however lack of consensus between very similar sequences in metagenomic studies can and often does represent natural variation of biological significance. The common use of machine-assigned quality scores on next generation platforms does not necessarily correlate with accuracy. Here, we describe using the overlap of paired-end, short sequence reads to identify error-prone reads in marker gene analyses and their contribution to spurious OTUs following clustering analysis using QIIME. Our approach can also reduce error in shotgun sequencing data generated from libraries with small, tightly constrained insert sizes. The open-source implementation of this algorithm in Python programming language with user instructions can be obtained from https://github.com/meren/illumina-utils.
      

      
      The BRAIN Initiative Provides a Unifying Context for Integrating Core STEM Competencies into a Neurobiology Course.
      PubMed
      Schaefer, Jennifer E
         2016-01-01
         The Brain Research through Advancing Innovative Neurotechnologies (BRAIN) Initiative introduced by the Obama Administration in 2013 presents a context for integrating many STEM competencies into undergraduate neuroscience coursework. The BRAIN Initiative core principles overlap with core STEM competencies identified by the AAAS Vision and Change report and other entities. This neurobiology course utilizes the BRAIN Initiative to serve as the unifying theme that facilitates a primary emphasis on student competencies such as scientific process, scientific communication, and societal relevance while teaching foundational neurobiological content such as brain anatomy, cellular neurophysiology, and activity modulation. Student feedback indicates that the BRAIN Initiative is an engaging and instructional context for this course. Course module organization, suitable BRAIN Initiative commentary literature, sample primary literature, and important assignments are presented.
      

      
      From experiment to design -- Fault characterization and detection in parallel computer systems using computational accelerators
      NASA Astrophysics Data System (ADS)
      Yim, Keun Soo
         
         This dissertation summarizes experimental validation and co-design studies conducted to optimize the fault detection capabilities and overheads in hybrid computer systems (e.g., using CPUs and Graphics Processing Units, or GPUs), and consequently to improve the scalability of parallel computer systems using computational accelerators. The experimental validation studies were conducted to help us understand the failure characteristics of CPU-GPU hybrid computer systems under various types of hardware faults. The main characterization targets were faults that are difficult to detect and/or recover from, e.g., faults that cause long latency failures (Ch. 3), faults in dynamically allocated resources (Ch. 4), faults in GPUs (Ch. 5), faults in MPI programs (Ch. 6), and microarchitecture-level faults with specific timing features (Ch. 7). The co-design studies were based on the characterization results. One of the co-designed systems has a set of source-to-source translators that customize and strategically place error detectors in the source code of target GPU programs (Ch. 5). Another co-designed system uses an extension card to learn the normal behavioral and semantic execution patterns of message-passing processes executing on CPUs, and to detect abnormal behaviors of those parallel processes (Ch. 6). The third co-designed system is a co-processor that has a set of new instructions in order to support software-implemented fault detection techniques (Ch. 7). The work described in this dissertation gains more importance because heterogeneous processors have become an essential component of state-of-the-art supercomputers. GPUs were used in three of the five fastest supercomputers that were operating in 2011. Our work included comprehensive fault characterization studies in CPU-GPU hybrid computers. In CPUs, we monitored the target systems for a long period of time after injecting faults (a temporally comprehensive experiment), and injected faults into various types of program states that included dynamically allocated memory (to be spatially comprehensive). In GPUs, we used fault injection studies to demonstrate the importance of detecting silent data corruption (SDC) errors that are mainly due to the lack of fine-grained protections and the massive use of fault-insensitive data. This dissertation also presents transparent fault tolerance frameworks and techniques that are directly applicable to hybrid computers built using only commercial off-the-shelf hardware components. This dissertation shows that by developing understanding of the failure characteristics and error propagation paths of target programs, we were able to create fault tolerance frameworks and techniques that can quickly detect and recover from hardware faults with low performance and hardware overheads.
      

      
      Simple Interval Timers for Microcomputers.
      ERIC Educational Resources Information Center
      McInerney, M.; Burgess, G.
         1985-01-01
         Discusses simple interval timers for microcomputers, including (1) the Jiffy clock; (2) CPU count timers; (3) screen count timers; (4) light pen timers; and (5) chip timers. Also examines some of the general characteristics of all types of timers. (JN)
      

      
      Ice-sheet modelling accelerated by graphics cards
      NASA Astrophysics Data System (ADS)
      Brædstrup, Christian Fredborg; Damsgaard, Anders; Egholm, David Lundbek
         2014-11-01
         Studies of glaciers and ice sheets have increased the demand for high performance numerical ice flow models over the past decades. When exploring the highly non-linear dynamics of fast flowing glaciers and ice streams, or when coupling multiple flow processes for ice, water, and sediment, researchers are often forced to use super-computing clusters. As an alternative to conventional high-performance computing hardware, the Graphical Processing Unit (GPU) is capable of massively parallel computing while retaining a compact design and low cost. In this study, we present a strategy for accelerating a higher-order ice flow model using a GPU. By applying the newest GPU hardware, we achieve up to 180× speedup compared to a similar but serial CPU implementation. Our results suggest that GPU acceleration is a competitive option for ice-flow modelling when compared to CPU-optimised algorithms parallelised by the OpenMP or Message Passing Interface (MPI) protocols.
      

      
      Gpu Implementation of a Viscous Flow Solver on Unstructured Grids
      NASA Astrophysics Data System (ADS)
      Xu, Tianhao; Chen, Long
         2016-06-01
         Graphics processing units have gained popularities in scientific computing over past several years due to their outstanding parallel computing capability. Computational fluid dynamics applications involve large amounts of calculations, therefore a latest GPU card is preferable of which the peak computing performance and memory bandwidth are much better than a contemporary high-end CPU. We herein focus on the detailed implementation of our GPU targeting Reynolds-averaged Navier-Stokes equations solver based on finite-volume method. The solver employs a vertex-centered scheme on unstructured grids for the sake of being capable of handling complex topologies. Multiple optimizations are carried out to improve the memory accessing performance and kernel utilization. Both steady and unsteady flow simulation cases are carried out using explicit Runge-Kutta scheme. The solver with GPU acceleration in this paper is demonstrated to have competitive advantages over the CPU targeting one.
      

      
      A Study of Quality of Service Communication for High-Speed Packet-Switching Computer Sub-Networks
      NASA Technical Reports Server (NTRS)
      Cui, Zhenqian
         1999-01-01
         In this thesis, we analyze various factors that affect quality of service (QoS) communication in high-speed, packet-switching sub-networks. We hypothesize that sub-network-wide bandwidth reservation and guaranteed CPU processing power at endpoint systems for handling data traffic are indispensable to achieving hard end-to-end quality of service. Different bandwidth reservation strategies, traffic characterization schemes, and scheduling algorithms affect the network resources and CPU usage as well as the extent that QoS can be achieved. In order to analyze those factors, we design and implement a communication layer. Our experimental analysis supports our research hypothesis. The Resource ReSerVation Protocol (RSVP) is designed to realize resource reservation. Our analysis of RSVP shows that using RSVP solely is insufficient to provide hard end-to-end quality of service in a high-speed sub-network. Analysis of the IEEE 802.lp protocol also supports the research hypothesis.
      

      
      Exploring compression techniques for ROOT IO
      NASA Astrophysics Data System (ADS)
      Zhang, Z.; Bockelman, B.
         2017-10-01
         ROOT provides an flexible format used throughout the HEP community. The number of use cases - from an archival data format to end-stage analysis - has required a number of tradeoffs to be exposed to the user. For example, a high “compression level” in the traditional DEFLATE algorithm will result in a smaller file (saving disk space) at the cost of slower decompression (costing CPU time when read). At the scale of the LHC experiment, poor design choices can result in terabytes of wasted space or wasted CPU time. We explore and attempt to quantify some of these tradeoffs. Specifically, we explore: the use of alternate compressing algorithms to optimize for read performance; an alternate method of compressing individual events to allow efficient random access; and a new approach to whole-file compression. Quantitative results are given, as well as guidance on how to make compression decisions for different use cases.
      

      
      Near-realtime simulations of biolelectric activity in small mammalian hearts using graphical processing units
      PubMed Central
      Vigmond, Edward J.; Boyle, Patrick M.; Leon, L. Joshua; Plank, Gernot
         2014-01-01
         Simulations of cardiac bioelectric phenomena remain a significant challenge despite continual advancements in computational machinery. Spanning large temporal and spatial ranges demands millions of nodes to accurately depict geometry, and a comparable number of timesteps to capture dynamics. This study explores a new hardware computing paradigm, the graphics processing unit (GPU), to accelerate cardiac models, and analyzes results in the context of simulating a small mammalian heart in real time. The ODEs associated with membrane ionic flow were computed on traditional CPU and compared to GPU performance, for one to four parallel processing units. The scalability of solving the PDE responsible for tissue coupling was examined on a cluster using up to 128 cores. Results indicate that the GPU implementation was between 9 and 17 times faster than the CPU implementation and scaled similarly. Solving the PDE was still 160 times slower than real time. PMID:19964295
      

      
      Simulating electron wave dynamics in graphene superlattices exploiting parallel processing advantages
      NASA Astrophysics Data System (ADS)
      Rodrigues, Manuel J.; Fernandes, David E.; Silveirinha, Mário G.; Falcão, Gabriel
         2018-01-01
         This work introduces a parallel computing framework to characterize the propagation of electron waves in graphene-based nanostructures. The electron wave dynamics is modeled using both "microscopic" and effective medium formalisms and the numerical solution of the two-dimensional massless Dirac equation is determined using a Finite-Difference Time-Domain scheme. The propagation of electron waves in graphene superlattices with localized scattering centers is studied, and the role of the symmetry of the microscopic potential in the electron velocity is discussed. The computational methodologies target the parallel capabilities of heterogeneous multi-core CPU and multi-GPU environments and are built with the OpenCL parallel programming framework which provides a portable, vendor agnostic and high throughput-performance solution. The proposed heterogeneous multi-GPU implementation achieves speedup ratios up to 75x when compared to multi-thread and multi-core CPU execution, reducing simulation times from several hours to a couple of minutes.
      

      
      High-speed zero-copy data transfer for DAQ applications
      NASA Astrophysics Data System (ADS)
      Pisani, Flavio; Cámpora Pérez, Daniel Hugo; Neufeld, Niko
         2015-05-01
         The LHCb Data Acquisition (DAQ) will be upgraded in 2020 to a trigger-free readout. In order to achieve this goal we will need to connect around 500 nodes with a total network capacity of 32 Tb/s. To get such an high network capacity we are testing zero-copy technology in order to maximize the theoretical link throughput without adding excessive CPU and memory bandwidth overhead, leaving free resources for data processing resulting in less power, space and money used for the same result. We develop a modular test application which can be used with different transport layers. For the zero-copy implementation we choose the OFED IBVerbs API because it can provide low level access and high throughput. We present throughput and CPU usage measurements of 40 GbE solutions using Remote Direct Memory Access (RDMA), for several network configurations to test the scalability of the system.
      

      
      Fast data reconstructed method of Fourier transform imaging spectrometer based on multi-core CPU
      NASA Astrophysics Data System (ADS)
      Yu, Chunchao; Du, Debiao; Xia, Zongze; Song, Li; Zheng, Weijian; Yan, Min; Lei, Zhenggang
         2017-10-01
         Imaging spectrometer can gain two-dimensional space image and one-dimensional spectrum at the same time, which shows high utility in color and spectral measurements, the true color image synthesis, military reconnaissance and so on. In order to realize the fast reconstructed processing of the Fourier transform imaging spectrometer data, the paper designed the optimization reconstructed algorithm with OpenMP parallel calculating technology, which was further used for the optimization process for the HyperSpectral Imager of `HJ-1' Chinese satellite. The results show that the method based on multi-core parallel computing technology can control the multi-core CPU hardware resources competently and significantly enhance the calculation of the spectrum reconstruction processing efficiency. If the technology is applied to more cores workstation in parallel computing, it will be possible to complete Fourier transform imaging spectrometer real-time data processing with a single computer.
      

        
       
          

«

21
      22
      23
      24
      25
   »

          
        

     

   

   Some links on this page may take you to non-federal websites. Their policies may differ from this site.