provide additional cpu: Topics by Science.gov

Sample records for provide additional cpu

Comprehensive silicon solar cell computer modeling

NASA Technical Reports Server (NTRS)

Lamorte, M. F.

1984-01-01

The development of an efficient, comprehensive Si solar cell modeling program that has the capability of simulation accuracy of 5 percent or less is examined. A general investigation of computerized simulation is provided. Computer simulation programs are subdivided into a number of major tasks: (1) analytical method used to represent the physical system; (2) phenomena submodels that comprise the simulation of the system; (3) coding of the analysis and the phenomena submodels; (4) coding scheme that results in efficient use of the CPU so that CPU costs are low; and (5) modularized simulation program with respect to structures that may be analyzed, addition and/or modification of phenomena submodels as new experimental data become available, and the addition of other photovoltaic materials.
A report documenting the completion of the Los Alamos National Laboratory portion of the ASC level II milestone ""Visualization on the supercomputing platform

DOE Office of Scientific and Technical Information (OSTI.GOV)

Ahrens, James P; Patchett, John M; Lo, Li - Ta

2011-01-24

This report provides documentation for the completion of the Los Alamos portion of the ASC Level II 'Visualization on the Supercomputing Platform' milestone. This ASC Level II milestone is a joint milestone between Sandia National Laboratory and Los Alamos National Laboratory. The milestone text is shown in Figure 1 with the Los Alamos portions highlighted in boldfaced text. Visualization and analysis of petascale data is limited by several factors which must be addressed as ACES delivers the Cielo platform. Two primary difficulties are: (1) Performance of interactive rendering, which is the most computationally intensive portion of the visualization process. Formore » terascale platforms, commodity clusters with graphics processors (GPUs) have been used for interactive rendering. For petascale platforms, visualization and rendering may be able to run efficiently on the supercomputer platform itself. (2) I/O bandwidth, which limits how much information can be written to disk. If we simply analyze the sparse information that is saved to disk we miss the opportunity to analyze the rich information produced every timestep by the simulation. For the first issue, we are pursuing in-situ analysis, in which simulations are coupled directly with analysis libraries at runtime. This milestone will evaluate the visualization and rendering performance of current and next generation supercomputers in contrast to GPU-based visualization clusters, and evaluate the perfromance of common analysis libraries coupled with the simulation that analyze and write data to disk during a running simulation. This milestone will explore, evaluate and advance the maturity level of these technologies and their applicability to problems of interest to the ASC program. In conclusion, we improved CPU-based rendering performance by a a factor of 2-10 times on our tests. In addition, we evaluated CPU and CPU-based rendering performance. We encourage production visualization experts to consider using CPU-based rendering solutions when it is appropriate. For example, on remote supercomputers CPU-based rendering can offer a means of viewing data without having to offload the data or geometry onto a CPU-based visualization system. In terms of comparative performance of the CPU and CPU we believe that further optimizations of the performance of both CPU or CPU-based rendering are possible. The simulation community is currently confronting this reality as they work to port their simulations to different hardware architectures. What is interesting about CPU rendering of massive datasets is that for part two decades CPU performance has significantly outperformed CPU-based systems. Based on our advancements, evaluations and explorations we believe that CPU-based rendering has returned as one viable option for the visualization of massive datasets.« less
Shadow: Running Tor in a Box for Accurate and Efficient Experimentation

DTIC Science & Technology

2011-09-23

Modeling the speed of a target CPU is done by running an OpenSSL [31] speed test on a real CPU of that type. This provides us with the raw CPU processing...rate, but we are also interested in the processing speed of an application. By running application 5 benchmarks on the same CPU as the OpenSSL speed test...simulation, saving CPU cy- cles on our simulation host machine. Shadow removes cryptographic processing by preloading the main OpenSSL [31] functions used
General-purpose interface bus for multiuser, multitasking computer system

NASA Technical Reports Server (NTRS)

Generazio, Edward R.; Roth, Don J.; Stang, David B.

1990-01-01

The architecture of a multiuser, multitasking, virtual-memory computer system intended for the use by a medium-size research group is described. There are three central processing units (CPU) in the configuration, each with 16 MB memory, and two 474 MB hard disks attached. CPU 1 is designed for data analysis and contains an array processor for fast-Fourier transformations. In addition, CPU 1 shares display images viewed with the image processor. CPU 2 is designed for image analysis and display. CPU 3 is designed for data acquisition and contains 8 GPIB channels and an analog-to-digital conversion input/output interface with 16 channels. Up to 9 users can access the third CPU simultaneously for data acquisition. Focus is placed on the optimization of hardware interfaces and software, facilitating instrument control, data acquisition, and processing.
Symptoms of problematic cellular phone use, functional impairment and its association with depression among adolescents in Southern Taiwan.

PubMed

Yen, Cheng-Fang; Tang, Tze-Chun; Yen, Ju-Yu; Lin, Huang-Chi; Huang, Chi-Fen; Liu, Shu-Chun; Ko, Chih-Hung

2009-08-01

The aims of this study were: (1) to examine the prevalence of symptoms of problematic cellular phone use (CPU); (2) to examine the associations between the symptoms of problematic CPU, functional impairment caused by CPU and the characteristics of CPU; (3) to establish the optimal cut-off point of the number of symptoms for functional impairment caused by CPU; and (4) to examine the association between problematic CPU and depression in adolescents. A total of 10,191 adolescent students in Southern Taiwan were recruited into this study. Participants' self-reported symptoms of problematic CPU and functional impairments caused by CPU were collected. The associations of symptoms of problematic CPU with functional impairments and with the characteristics of CPU were examined. The cut-off point of the number of symptoms for functional impairment was also determined. The association between problematic CPU and depression was examined by logistic regression analysis. The results indicated that the symptoms of problematic CPU were prevalent in adolescents. The adolescents who had any one of the symptoms of problematic CPU were more likely to report at least one dimension of functional impairment caused by CPU, called more on cellular phones, sent more text messages, or spent more time and higher fees on CPU. Having four or more symptoms of problematic CPU had the highest potential to differentiate between the adolescents with and without functional impairment caused by CPU. Adolescents who had significant depression were more likely to have four or more symptoms of problematic CPU. The results of this study may provide a basis for detecting symptoms of problematic CPU in adolescents.
Preliminary Study of Image Reconstruction Algorithm on a Digital Signal Processor

DTIC Science & Technology

2014-03-01

5.2 Comparison of CPU-GPU, CPU-FPGA, and CPU-DSP Designs The work for implementing VHDL description of the back-projection algorithm on a physical...FPGA was not complete. Hence, the DSP implementation results are compared with the simulated results for the VHDL design. Simulating VHDL provides an...rather than at the software level. Depending on an application’s characteristics, FPGA implementations can provide a significant performance
GPU-Acceleration of Sequence Homology Searches with Database Subsequence Clustering.

PubMed

Suzuki, Shuji; Kakuta, Masanori; Ishida, Takashi; Akiyama, Yutaka

2016-01-01

Sequence homology searches are used in various fields and require large amounts of computation time, especially for metagenomic analysis, owing to the large number of queries and the database size. To accelerate computing analyses, graphics processing units (GPUs) are widely used as a low-cost, high-performance computing platform. Therefore, we mapped the time-consuming steps involved in GHOSTZ, which is a state-of-the-art homology search algorithm for protein sequences, onto a GPU and implemented it as GHOSTZ-GPU. In addition, we optimized memory access for GPU calculations and for communication between the CPU and GPU. As per results of the evaluation test involving metagenomic data, GHOSTZ-GPU with 12 CPU threads and 1 GPU was approximately 3.0- to 4.1-fold faster than GHOSTZ with 12 CPU threads. Moreover, GHOSTZ-GPU with 12 CPU threads and 3 GPUs was approximately 5.8- to 7.7-fold faster than GHOSTZ with 12 CPU threads.
SU-E-J-91: FFT Based Medical Image Registration Using a Graphics Processing Unit (GPU).

PubMed

Luce, J; Hoggarth, M; Lin, J; Block, A; Roeske, J

2012-06-01

To evaluate the efficiency gains obtained from using a Graphics Processing Unit (GPU) to perform a Fourier Transform (FT) based image registration. Fourier-based image registration involves obtaining the FT of the component images, and analyzing them in Fourier space to determine the translations and rotations of one image set relative to another. An important property of FT registration is that by enlarging the images (adding additional pixels), one can obtain translations and rotations with sub-pixel resolution. The expense, however, is an increased computational time. GPUs may decrease the computational time associated with FT image registration by taking advantage of their parallel architecture to perform matrix computations much more efficiently than a Central Processor Unit (CPU). In order to evaluate the computational gains produced by a GPU, images with known translational shifts were utilized. A program was written in the Interactive Data Language (IDL; Exelis, Boulder, CO) to performCPU-based calculations. Subsequently, the program was modified using GPU bindings (Tech-X, Boulder, CO) to perform GPU-based computation on the same system. Multiple image sizes were used, ranging from 256×256 to 2304×2304. The time required to complete the full algorithm by the CPU and GPU were benchmarked and the speed increase was defined as the ratio of the CPU-to-GPU computational time. The ratio of the CPU-to- GPU time was greater than 1.0 for all images, which indicates the GPU is performing the algorithm faster than the CPU. The smallest improvement, a 1.21 ratio, was found with the smallest image size of 256×256, and the largest speedup, a 4.25 ratio, was observed with the largest image size of 2304×2304. GPU programming resulted in a significant decrease in computational time associated with a FT image registration algorithm. The inclusion of the GPU may provide near real-time, sub-pixel registration capability. © 2012 American Association of Physicists in Medicine.
GPU-Acceleration of Sequence Homology Searches with Database Subsequence Clustering

PubMed Central

Suzuki, Shuji; Kakuta, Masanori; Ishida, Takashi; Akiyama, Yutaka

2016-01-01

Sequence homology searches are used in various fields and require large amounts of computation time, especially for metagenomic analysis, owing to the large number of queries and the database size. To accelerate computing analyses, graphics processing units (GPUs) are widely used as a low-cost, high-performance computing platform. Therefore, we mapped the time-consuming steps involved in GHOSTZ, which is a state-of-the-art homology search algorithm for protein sequences, onto a GPU and implemented it as GHOSTZ-GPU. In addition, we optimized memory access for GPU calculations and for communication between the CPU and GPU. As per results of the evaluation test involving metagenomic data, GHOSTZ-GPU with 12 CPU threads and 1 GPU was approximately 3.0- to 4.1-fold faster than GHOSTZ with 12 CPU threads. Moreover, GHOSTZ-GPU with 12 CPU threads and 3 GPUs was approximately 5.8- to 7.7-fold faster than GHOSTZ with 12 CPU threads. PMID:27482905
Hypoxia/oxidative stress alters the pharmacokinetics of CPU86017-RS through mitochondrial dysfunction and NADPH oxidase activation.

PubMed

Gao, Jie; Ding, Xuan-sheng; Zhang, Yu-mao; Dai, De-zai; Liu, Mei; Zhang, Can; Dai, Yin

2013-12-01

Hypoxia/oxidative stress can alter the pharmacokinetics (PK) of CPU86017-RS, a novel antiarrhythmic agent. The aim of this study was to investigate the mechanisms underlying the alteration of PK of CPU86017-RS by hypoxia/oxidative stress. Male SD rats exposed to normal or intermittent hypoxia (10% O2) were administered CPU86017-RS (20, 40 or 80 mg/kg, ig) for 8 consecutive days. The PK parameters of CPU86017-RS were examined on d 8. In a separate set of experiments, female SD rats were injected with isoproterenol (ISO) for 5 consecutive days to induce a stress-related status, then CPU86017-RS (80 mg/kg, ig) was administered, and the tissue distributions were examined. The levels of Mn-SOD (manganese containing superoxide dismutase), endoplasmic reticulum (ER) stress sensor proteins (ATF-6, activating transcription factor 6 and PERK, PRK-like ER kinase) and activation of NADPH oxidase (NOX) were detected with Western blotting. Rat liver microsomes were incubated under N2 for in vitro study. The Cmax, t1/2, MRT (mean residence time) and AUC (area under the curve) of CPU86017-RS were significantly increased in the hypoxic rats receiving the 3 different doses of CPU86017-RS. The hypoxia-induced alteration of PK was associated with significantly reduced Mn-SOD level, and increased ATF-6, PERK and NOX levels. In ISO-treated rats, the distributions of CPU86017-RS in plasma, heart, kidney, and liver were markedly increased, and NOX levels in heart, kidney, and liver were significantly upregulated. Co-administration of the NOX blocker apocynin eliminated the abnormalities in the PK and tissue distributions of CPU86017-RS induced by hypoxia/oxidative stress. The metabolism of CPU86017-RS in the N2-treated liver microsomes was significantly reduced, addition of N-acetylcysteine (NAC), but not vitamin C, effectively reversed this change. The altered PK and metabolism of CPU86017-RS induced by hypoxia/oxidative stress are produced by mitochondrial abnormalities, NOX activation and ER stress; these abnormalities are significantly alleviated by apocynin or NAC.
A survey of CPU-GPU heterogeneous computing techniques

DOE PAGES

Mittal, Sparsh; Vetter, Jeffrey S.

2015-07-04

As both CPU and GPU become employed in a wide range of applications, it has been acknowledged that both of these processing units (PUs) have their unique features and strengths and hence, CPU-GPU collaboration is inevitable to achieve high-performance computing. This has motivated significant amount of research on heterogeneous computing techniques, along with the design of CPU-GPU fused chips and petascale heterogeneous supercomputers. In this paper, we survey heterogeneous computing techniques (HCTs) such as workload-partitioning which enable utilizing both CPU and GPU to improve performance and/or energy efficiency. We review heterogeneous computing approaches at runtime, algorithm, programming, compiler and applicationmore » level. Further, we review both discrete and fused CPU-GPU systems; and discuss benchmark suites designed for evaluating heterogeneous computing systems (HCSs). Furthermore, we believe that this paper will provide insights into working and scope of applications of HCTs to researchers and motivate them to further harness the computational powers of CPUs and GPUs to achieve the goal of exascale performance.« less
A survey of CPU-GPU heterogeneous computing techniques

DOE Office of Scientific and Technical Information (OSTI.GOV)

Mittal, Sparsh; Vetter, Jeffrey S.

As both CPU and GPU become employed in a wide range of applications, it has been acknowledged that both of these processing units (PUs) have their unique features and strengths and hence, CPU-GPU collaboration is inevitable to achieve high-performance computing. This has motivated significant amount of research on heterogeneous computing techniques, along with the design of CPU-GPU fused chips and petascale heterogeneous supercomputers. In this paper, we survey heterogeneous computing techniques (HCTs) such as workload-partitioning which enable utilizing both CPU and GPU to improve performance and/or energy efficiency. We review heterogeneous computing approaches at runtime, algorithm, programming, compiler and applicationmore » level. Further, we review both discrete and fused CPU-GPU systems; and discuss benchmark suites designed for evaluating heterogeneous computing systems (HCSs). Furthermore, we believe that this paper will provide insights into working and scope of applications of HCTs to researchers and motivate them to further harness the computational powers of CPUs and GPUs to achieve the goal of exascale performance.« less
Parallel hyperbolic PDE simulation on clusters: Cell versus GPU

NASA Astrophysics Data System (ADS)

Rostrup, Scott; De Sterck, Hans

2010-12-01

Increasingly, high-performance computing is looking towards data-parallel computational devices to enhance computational performance. Two technologies that have received significant attention are IBM's Cell Processor and NVIDIA's CUDA programming model for graphics processing unit (GPU) computing. In this paper we investigate the acceleration of parallel hyperbolic partial differential equation simulation on structured grids with explicit time integration on clusters with Cell and GPU backends. The message passing interface (MPI) is used for communication between nodes at the coarsest level of parallelism. Optimizations of the simulation code at the several finer levels of parallelism that the data-parallel devices provide are described in terms of data layout, data flow and data-parallel instructions. Optimized Cell and GPU performance are compared with reference code performance on a single x86 central processing unit (CPU) core in single and double precision. We further compare the CPU, Cell and GPU platforms on a chip-to-chip basis, and compare performance on single cluster nodes with two CPUs, two Cell processors or two GPUs in a shared memory configuration (without MPI). We finally compare performance on clusters with 32 CPUs, 32 Cell processors, and 32 GPUs using MPI. Our GPU cluster results use NVIDIA Tesla GPUs with GT200 architecture, but some preliminary results on recently introduced NVIDIA GPUs with the next-generation Fermi architecture are also included. This paper provides computational scientists and engineers who are considering porting their codes to accelerator environments with insight into how structured grid based explicit algorithms can be optimized for clusters with Cell and GPU accelerators. It also provides insight into the speed-up that may be gained on current and future accelerator architectures for this class of applications. Program summaryProgram title: SWsolver Catalogue identifier: AEGY_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEGY_v1_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: GPL v3 No. of lines in distributed program, including test data, etc.: 59 168 No. of bytes in distributed program, including test data, etc.: 453 409 Distribution format: tar.gz Programming language: C, CUDA Computer: Parallel Computing Clusters. Individual compute nodes may consist of x86 CPU, Cell processor, or x86 CPU with attached NVIDIA GPU accelerator. Operating system: Linux Has the code been vectorised or parallelized?: Yes. Tested on 1-128 x86 CPU cores, 1-32 Cell Processors, and 1-32 NVIDIA GPUs. RAM: Tested on Problems requiring up to 4 GB per compute node. Classification: 12 External routines: MPI, CUDA, IBM Cell SDK Nature of problem: MPI-parallel simulation of Shallow Water equations using high-resolution 2D hyperbolic equation solver on regular Cartesian grids for x86 CPU, Cell Processor, and NVIDIA GPU using CUDA. Solution method: SWsolver provides 3 implementations of a high-resolution 2D Shallow Water equation solver on regular Cartesian grids, for CPU, Cell Processor, and NVIDIA GPU. Each implementation uses MPI to divide work across a parallel computing cluster. Additional comments: Sub-program numdiff is used for the test run.
Ground Shock Effects from Accidental Explosions

DTIC Science & Technology

1976-11-01

1,200 P0 A = V P cp 8 Horizontal Dh = Dv tannin " 1 (cp/U)] Vh = Vv tan [sin" 1 (cp/U)] \\ - \\ tanfainŕ (cp/U)] For tan sin (c /U...explosive are not included in the present analysis . This effect will limit the credibility of the direct- induced ground shock predictions, but if the... analysis . Dr. D. R. Richmond of Lovelace Foundation provided data on human shock tolerances. 26 REFERENCES 1. "Structures to Resist the Effects of
Parallelized computation for computer simulation of electrocardiograms using personal computers with multi-core CPU and general-purpose GPU.

PubMed

Shen, Wenfeng; Wei, Daming; Xu, Weimin; Zhu, Xin; Yuan, Shizhong

2010-10-01

Biological computations like electrocardiological modelling and simulation usually require high-performance computing environments. This paper introduces an implementation of parallel computation for computer simulation of electrocardiograms (ECGs) in a personal computer environment with an Intel CPU of Core (TM) 2 Quad Q6600 and a GPU of Geforce 8800GT, with software support by OpenMP and CUDA. It was tested in three parallelization device setups: (a) a four-core CPU without a general-purpose GPU, (b) a general-purpose GPU plus 1 core of CPU, and (c) a four-core CPU plus a general-purpose GPU. To effectively take advantage of a multi-core CPU and a general-purpose GPU, an algorithm based on load-prediction dynamic scheduling was developed and applied to setting (c). In the simulation with 1600 time steps, the speedup of the parallel computation as compared to the serial computation was 3.9 in setting (a), 16.8 in setting (b), and 20.0 in setting (c). This study demonstrates that a current PC with a multi-core CPU and a general-purpose GPU provides a good environment for parallel computations in biological modelling and simulation studies. Copyright 2010 Elsevier Ireland Ltd. All rights reserved.
Methamphetamine-induced neurotoxicity is attenuated in transgenic mice with a null mutation for interleukin-6.

PubMed

Ladenheim, B; Krasnova, I N; Deng, X; Oyler, J M; Polettini, A; Moran, T H; Huestis, M A; Cadet, J L

2000-12-01

Increasing evidence implicates apoptosis as a major mechanism of cell death in methamphetamine (METH) neurotoxicity. The involvement of a neuroimmune component in apoptotic cell death after injury or chemical damage suggests that cytokines may play a role in METH effects. In the present study, we examined if the absence of IL-6 in knockout (IL-6-/-) mice could provide protection against METH-induced neurotoxicity. Administration of METH resulted in a significant reduction of [(125)I]RTI-121-labeled dopamine transporters in the caudate-putamen (CPu) and cortex as well as depletion of dopamine in the CPu and frontal cortex of wild-type mice. However, these METH-induced effects were significantly attenuated in IL-6-/- animals. METH also caused a decrease in serotonin levels in the CPu and hippocampus of wild-type mice, but no reduction was observed in IL-6-/- animals. Moreover, METH induced decreases in [(125)I]RTI-55-labeled serotonin transporters in the hippocampal CA3 region and in the substantia nigra-reticulata but increases in serotonin transporters in the CPu and cingulate cortex in wild-type animals, all of which were attenuated in IL-6-/- mice. Additionally, METH caused increased gliosis in the CPu and cortices of wild-type mice as measured by [(3)H]PK-11195 binding; this gliotic response was almost completely inhibited in IL-6-/- animals. There was also significant protection against METH-induced DNA fragmentation, measured by the number of terminal deoxynucleotidyl transferase-mediated dUTP nick-end-labeled (TUNEL) cells in the cortices. The protective effects against METH toxicity observed in the IL-6-/- mice were not caused by differences in temperature elevation or in METH accumulation in wild-type and mutant animals. Therefore, these observations support the proposition that IL-6 may play an important role in the neurotoxicity of METH.
GPU accelerated generation of digitally reconstructed radiographs for 2-D/3-D image registration.

PubMed

Dorgham, Osama M; Laycock, Stephen D; Fisher, Mark H

2012-09-01

Recent advances in programming languages for graphics processing units (GPUs) provide developers with a convenient way of implementing applications which can be executed on the CPU and GPU interchangeably. GPUs are becoming relatively cheap, powerful, and widely available hardware components, which can be used to perform intensive calculations. The last decade of hardware performance developments shows that GPU-based computation is progressing significantly faster than CPU-based computation, particularly if one considers the execution of highly parallelisable algorithms. Future predictions illustrate that this trend is likely to continue. In this paper, we introduce a way of accelerating 2-D/3-D image registration by developing a hybrid system which executes on the CPU and utilizes the GPU for parallelizing the generation of digitally reconstructed radiographs (DRRs). Based on the advancements of the GPU over the CPU, it is timely to exploit the benefits of many-core GPU technology by developing algorithms for DRR generation. Although some previous work has investigated the rendering of DRRs using the GPU, this paper investigates approximations which reduce the computational overhead while still maintaining a quality consistent with that needed for 2-D/3-D registration with sufficient accuracy to be clinically acceptable in certain applications of radiation oncology. Furthermore, by comparing implementations of 2-D/3-D registration on the CPU and GPU, we investigate current performance and propose an optimal framework for PC implementations addressing the rigid registration problem. Using this framework, we are able to render DRR images from a 256×256×133 CT volume in ~24 ms using an NVidia GeForce 8800 GTX and in ~2 ms using NVidia GeForce GTX 580. In addition to applications requiring fast automatic patient setup, these levels of performance suggest image-guided radiation therapy at video frame rates is technically feasible using relatively low cost PC architecture.
Classification of hyperspectral imagery using MapReduce on a NVIDIA graphics processing unit (Conference Presentation)

NASA Astrophysics Data System (ADS)

Ramirez, Andres; Rahnemoonfar, Maryam

2017-04-01

A hyperspectral image provides multidimensional figure rich in data consisting of hundreds of spectral dimensions. Analyzing the spectral and spatial information of such image with linear and non-linear algorithms will result in high computational time. In order to overcome this problem, this research presents a system using a MapReduce-Graphics Processing Unit (GPU) model that can help analyzing a hyperspectral image through the usage of parallel hardware and a parallel programming model, which will be simpler to handle compared to other low-level parallel programming models. Additionally, Hadoop was used as an open-source version of the MapReduce parallel programming model. This research compared classification accuracy results and timing results between the Hadoop and GPU system and tested it against the following test cases: the CPU and GPU test case, a CPU test case and a test case where no dimensional reduction was applied.
The Performance of the NAS HSPs in 1st Half of 1994

NASA Technical Reports Server (NTRS)

Bergeron, Robert J.; Walter, Howard (Technical Monitor)

1995-01-01

During the first six months of 1994, the NAS (National Airspace System) 16-CPU Y-MP C90 Von Neumann (VN) delivered an average throughput of 4.045 GFLOPS while the ACSF (Aeronautics Consolidated Supercomputer Facility) 8-CPU Y-MP C90 Eagle averaged 1.658 GFLOPS. The VN rate represents a machine efficiency of 26.3% whereas the Eagle rate corresponds to a machine efficiency of 21.6%. VN displayed a greater efficiency than Eagle primarily because the stronger workload demand for its CPU cycles allowed it to devote more time to user programs and less time to idle. An additional factor increasing VN efficiency was the ability of the UNICOS 8.0 Operating System to deliver a larger fraction of CPU time to user programs. Although measurements indicate increasing vector length for both workloads, insufficient vector lengths continue to hinder HSP (High Speed Processor) performance. To improve HSP performance, NAS should continue to encourage the HSP users to modify their codes to increase program vector length.
Multi-GPU and multi-CPU accelerated FDTD scheme for vibroacoustic applications

NASA Astrophysics Data System (ADS)

Francés, J.; Otero, B.; Bleda, S.; Gallego, S.; Neipp, C.; Márquez, A.; Beléndez, A.

2015-06-01

The Finite-Difference Time-Domain (FDTD) method is applied to the analysis of vibroacoustic problems and to study the propagation of longitudinal and transversal waves in a stratified media. The potential of the scheme and the relevance of each acceleration strategy for massively computations in FDTD are demonstrated in this work. In this paper, we propose two new specific implementations of the bi-dimensional scheme of the FDTD method using multi-CPU and multi-GPU, respectively. In the first implementation, an open source message passing interface (OMPI) has been included in order to massively exploit the resources of a biprocessor station with two Intel Xeon processors. Moreover, regarding CPU code version, the streaming SIMD extensions (SSE) and also the advanced vectorial extensions (AVX) have been included with shared memory approaches that take advantage of the multi-core platforms. On the other hand, the second implementation called the multi-GPU code version is based on Peer-to-Peer communications available in CUDA on two GPUs (NVIDIA GTX 670). Subsequently, this paper presents an accurate analysis of the influence of the different code versions including shared memory approaches, vector instructions and multi-processors (both CPU and GPU) and compares them in order to delimit the degree of improvement of using distributed solutions based on multi-CPU and multi-GPU. The performance of both approaches was analysed and it has been demonstrated that the addition of shared memory schemes to CPU computing improves substantially the performance of vector instructions enlarging the simulation sizes that use efficiently the cache memory of CPUs. In this case GPU computing is slightly twice times faster than the fine tuned CPU version in both cases one and two nodes. However, for massively computations explicit vector instructions do not worth it since the memory bandwidth is the limiting factor and the performance tends to be the same than the sequential version with auto-vectorisation and also shared memory approach. In this scenario GPU computing is the best option since it provides a homogeneous behaviour. More specifically, the speedup of GPU computing achieves an upper limit of 12 for both one and two GPUs, whereas the performance reaches peak values of 80 GFlops and 146 GFlops for the performance for one GPU and two GPUs respectively. Finally, the method is applied to an earth crust profile in order to demonstrate the potential of our approach and the necessity of applying acceleration strategies in these type of applications.

A novel iterative mixed model to remap three complex orthopedic traits in dogs

PubMed Central

Huang, Meng; Hayward, Jessica J.; Corey, Elizabeth; Garrison, Susan J.; Wagner, Gabriela R.; Krotscheck, Ursula; Hayashi, Kei; Schweitzer, Peter A.; Lust, George; Boyko, Adam R.; Todhunter, Rory J.

2017-01-01

Hip dysplasia (HD), elbow dysplasia (ED), and rupture of the cranial (anterior) cruciate ligament (RCCL) are the most common complex orthopedic traits of dogs and all result in debilitating osteoarthritis. We reanalyzed previously reported data: the Norberg angle (a quantitative measure of HD) in 921 dogs, ED in 113 cases and 633 controls, and RCCL in 271 cases and 399 controls and their genotypes at ~185,000 single nucleotide polymorphisms. A novel fixed and random model with a circulating probability unification (FarmCPU) function, with marker-based principal components and a kinship matrix to correct for population stratification, was used. A Bonferroni correction at p<0.01 resulted in a P< 6.96 ×10−8. Six loci were identified; three for HD and three for RCCL. An associated locus at CFA28:34,369,342 for HD was described previously in the same dogs using a conventional mixed model. No loci were identified for RCCL in the previous report but the two loci for ED in the previous report did not reach genome-wide significance using the FarmCPU model. These results were supported by simulation which demonstrated that the FarmCPU held no power advantage over the linear mixed model for the ED sample but provided additional power for the HD and RCCL samples. Candidate genes for HD and RCCL are discussed. When using FarmCPU software, we recommend a resampling test, that a positive control be used to determine the optimum pseudo quantitative trait nucleotide-based covariate structure of the model, and a negative control be used consisting of permutation testing and the identical resampling test as for the non-permuted phenotypes. PMID:28614352
Software Defined Radio with Parallelized Software Architecture

NASA Technical Reports Server (NTRS)

Heckler, Greg

2013-01-01

This software implements software-defined radio procession over multicore, multi-CPU systems in a way that maximizes the use of CPU resources in the system. The software treats each processing step in either a communications or navigation modulator or demodulator system as an independent, threaded block. Each threaded block is defined with a programmable number of input or output buffers; these buffers are implemented using POSIX pipes. In addition, each threaded block is assigned a unique thread upon block installation. A modulator or demodulator system is built by assembly of the threaded blocks into a flow graph, which assembles the processing blocks to accomplish the desired signal processing. This software architecture allows the software to scale effortlessly between single CPU/single-core computers or multi-CPU/multi-core computers without recompilation. NASA spaceflight and ground communications systems currently rely exclusively on ASICs or FPGAs. This software allows low- and medium-bandwidth (100 bps to approx.50 Mbps) software defined radios to be designed and implemented solely in C/C++ software, while lowering development costs and facilitating reuse and extensibility.
Software Defined Radio with Parallelized Software Architecture

NASA Technical Reports Server (NTRS)

Heckler, Greg

2013-01-01

This software implements software-defined radio procession over multi-core, multi-CPU systems in a way that maximizes the use of CPU resources in the system. The software treats each processing step in either a communications or navigation modulator or demodulator system as an independent, threaded block. Each threaded block is defined with a programmable number of input or output buffers; these buffers are implemented using POSIX pipes. In addition, each threaded block is assigned a unique thread upon block installation. A modulator or demodulator system is built by assembly of the threaded blocks into a flow graph, which assembles the processing blocks to accomplish the desired signal processing. This software architecture allows the software to scale effortlessly between single CPU/single-core computers or multi-CPU/multi-core computers without recompilation. NASA spaceflight and ground communications systems currently rely exclusively on ASICs or FPGAs. This software allows low- and medium-bandwidth (100 bps to .50 Mbps) software defined radios to be designed and implemented solely in C/C++ software, while lowering development costs and facilitating reuse and extensibility.
Convolutional Neural Network on Embedded Linux(trademark) System-on-Chip: A Methodology and Performance Benchmark

DTIC Science & Technology

2016-05-01

A9 CPU and 15 W for the i7 CPU. A method of accelerating this computation is by using a customized hardware unit called a field- programmable gate...implementation of custom logic to accelerate com- putational workloads. This FPGA fabric, in addition to the standard programmable logic, contains 220...chip; field- programmable gate array Daniel Gebhardt U U U U 18 (619) 553-2786 INITIAL DISTRIBUTION 84300 Library (2) 85300 Archive/Stock (1
Convolutional Neural Network on Embedded Linux System-on-Chip: A Methodology and Performance Benchmark

DTIC Science & Technology

2016-05-01

A9 CPU and 15 W for the i7 CPU. A method of accelerating this computation is by using a customized hardware unit called a field- programmable gate...implementation of custom logic to accelerate com- putational workloads. This FPGA fabric, in addition to the standard programmable logic, contains 220...chip; field- programmable gate array Daniel Gebhardt U U U U 18 (619) 553-2786 INITIAL DISTRIBUTION 84300 Library (2) 85300 Archive/Stock (1
A novel heterogeneous algorithm to simulate multiphase flow in porous media on multicore CPU-GPU systems

NASA Astrophysics Data System (ADS)

McClure, J. E.; Prins, J. F.; Miller, C. T.

2014-07-01

Multiphase flow implementations of the lattice Boltzmann method (LBM) are widely applied to the study of porous medium systems. In this work, we construct a new variant of the popular "color" LBM for two-phase flow in which a three-dimensional, 19-velocity (D3Q19) lattice is used to compute the momentum transport solution while a three-dimensional, seven velocity (D3Q7) lattice is used to compute the mass transport solution. Based on this formulation, we implement a novel heterogeneous GPU-accelerated algorithm in which the mass transport solution is computed by multiple shared memory CPU cores programmed using OpenMP while a concurrent solution of the momentum transport is performed using a GPU. The heterogeneous solution is demonstrated to provide speedup of 2.6 × as compared to multi-core CPU solution and 1.8 × compared to GPU solution due to concurrent utilization of both CPU and GPU bandwidths. Furthermore, we verify that the proposed formulation provides an accurate physical representation of multiphase flow processes and demonstrate that the approach can be applied to perform heterogeneous simulations of two-phase flow in porous media using a typical GPU-accelerated workstation.
Quasi-elastic light scattering: Signal storage, correlation, and spectrum analysis under control of an 8-bit microprocessor

NASA Astrophysics Data System (ADS)

Glatter, Otto; Fuchs, Heribert; Jorde, Christian; Eigner, Wolf-Dieter

1987-03-01

The microprocessor of an 8-bit PC system is used as a central control unit for the acquisition and evaluation of data from quasi-elastic light scattering experiments. Data are sampled with a width of 8 bits under control of the CPU. This limits the minimum sample time to 20 μs. Shorter sample times would need a direct memory access channel. The 8-bit CPU can address a 64-kbyte RAM without additional paging. Up to 49 000 sample points can be measured without interruption. After storage, a correlation function or a power spectrum can be calculated from such a primary data set. Furthermore access is provided to the primary data for stability control, statistical tests, and for comparison of different evaluation methods for the same experiment. A detailed analysis of the signal (histogram) and of the effect of overflows is possible and shows that the number of pulses but not the number of overflows determines the error in the result. The correlation function can be computed with reasonable accuracy from data with a mean pulse rate greater than one, the power spectrum needs a three times higher pulse rate for convergence. The statistical accuracy of the results from 49 000 sample points is of the order of a few percent. Additional averages are necessary to improve their quality. The hardware extensions for the PC system are inexpensive. The main disadvantage of the present system is the high minimum sampling time of 20 μs and the fact that the correlogram or the power spectrum cannot be computed on-line as it can be done with hardware correlators or spectrum analyzers. These shortcomings and the storage size restrictions can be removed with a faster 16/32-bit CPU.
A CPU/MIC Collaborated Parallel Framework for GROMACS on Tianhe-2 Supercomputer.

PubMed

Peng, Shaoliang; Yang, Shunyun; Su, Wenhe; Zhang, Xiaoyu; Zhang, Tenglilang; Liu, Weiguo; Zhao, Xingming

2017-06-16

Molecular Dynamics (MD) is the simulation of the dynamic behavior of atoms and molecules. As the most popular software for molecular dynamics, GROMACS cannot work on large-scale data because of limit computing resources. In this paper, we propose a CPU and Intel® Xeon Phi Many Integrated Core (MIC) collaborated parallel framework to accelerate GROMACS using the offload mode on a MIC coprocessor, with which the performance of GROMACS is improved significantly, especially with the utility of Tianhe-2 supercomputer. Furthermore, we optimize GROMACS so that it can run on both the CPU and MIC at the same time. In addition, we accelerate multi-node GROMACS so that it can be used in practice. Benchmarking on real data, our accelerated GROMACS performs very well and reduces computation time significantly. Source code: https://github.com/tianhe2/gromacs-mic.
Measurements of neuron soma size and density in rat dorsal striatum, nucleus accumbens core and nucleus accumbens shell: differences between striatal region and brain hemisphere, but not sex.

PubMed

Meitzen, John; Pflepsen, Kelsey R; Stern, Christopher M; Meisel, Robert L; Mermelstein, Paul G

2011-01-07

Both hemispheric bias and sex differences exist in striatal-mediated behaviors and pathologies. The extent to which these dimorphisms can be attributed to an underlying neuroanatomical difference is unclear. We therefore quantified neuron soma size and density in the dorsal striatum (CPu) as well as the core (AcbC) and shell (AcbS) subregions of the nucleus accumbens to determine whether these anatomical measurements differ by region, hemisphere, or sex in adult Sprague-Dawley rats. Neuron soma size was larger in the CPu than the AcbC or AcbS. Neuron density was greatest in the AcbS, intermediate in the AcbC, and least dense in the CPu. CPu neuron density was greater in the left in comparison to the right hemisphere. No attribute was sexually dimorphic. These results provide the first evidence that hemispheric bias in the striatum and striatal-mediated behaviors can be attributed to a lateralization in neuronal density within the CPu. In contrast, sexual dimorphisms appear mediated by factors other than gross anatomical differences. Copyright © 2010 Elsevier Ireland Ltd. All rights reserved.
Performance analysis of the FDTD method applied to holographic volume gratings: Multi-core CPU versus GPU computing

NASA Astrophysics Data System (ADS)

Francés, J.; Bleda, S.; Neipp, C.; Márquez, A.; Pascual, I.; Beléndez, A.

2013-03-01

The finite-difference time-domain method (FDTD) allows electromagnetic field distribution analysis as a function of time and space. The method is applied to analyze holographic volume gratings (HVGs) for the near-field distribution at optical wavelengths. Usually, this application requires the simulation of wide areas, which implies more memory and time processing. In this work, we propose a specific implementation of the FDTD method including several add-ons for a precise simulation of optical diffractive elements. Values in the near-field region are computed considering the illumination of the grating by means of a plane wave for different angles of incidence and including absorbing boundaries as well. We compare the results obtained by FDTD with those obtained using a matrix method (MM) applied to diffraction gratings. In addition, we have developed two optimized versions of the algorithm, for both CPU and GPU, in order to analyze the improvement of using the new NVIDIA Fermi GPU architecture versus highly tuned multi-core CPU as a function of the size simulation. In particular, the optimized CPU implementation takes advantage of the arithmetic and data transfer streaming SIMD (single instruction multiple data) extensions (SSE) included explicitly in the code and also of multi-threading by means of OpenMP directives. A good agreement between the results obtained using both FDTD and MM methods is obtained, thus validating our methodology. Moreover, the performance of the GPU is compared to the SSE+OpenMP CPU implementation, and it is quantitatively determined that a highly optimized CPU program can be competitive for a wider range of simulation sizes, whereas GPU computing becomes more powerful for large-scale simulations.
Study on efficiency of time computation in x-ray imaging simulation base on Monte Carlo algorithm using graphics processing unit

DOE Office of Scientific and Technical Information (OSTI.GOV)

Setiani, Tia Dwi, E-mail: tiadwisetiani@gmail.com; Suprijadi; Nuclear Physics and Biophysics Reaserch Division, Faculty of Mathematics and Natural Sciences, Institut Teknologi Bandung Jalan Ganesha 10 Bandung, 40132

Monte Carlo (MC) is one of the powerful techniques for simulation in x-ray imaging. MC method can simulate the radiation transport within matter with high accuracy and provides a natural way to simulate radiation transport in complex systems. One of the codes based on MC algorithm that are widely used for radiographic images simulation is MC-GPU, a codes developed by Andrea Basal. This study was aimed to investigate the time computation of x-ray imaging simulation in GPU (Graphics Processing Unit) compared to a standard CPU (Central Processing Unit). Furthermore, the effect of physical parameters to the quality of radiographic imagesmore » and the comparison of image quality resulted from simulation in the GPU and CPU are evaluated in this paper. The simulations were run in CPU which was simulated in serial condition, and in two GPU with 384 cores and 2304 cores. In simulation using GPU, each cores calculates one photon, so, a large number of photon were calculated simultaneously. Results show that the time simulations on GPU were significantly accelerated compared to CPU. The simulations on the 2304 core of GPU were performed about 64 -114 times faster than on CPU, while the simulation on the 384 core of GPU were performed about 20 – 31 times faster than in a single core of CPU. Another result shows that optimum quality of images from the simulation was gained at the history start from 10{sup 8} and the energy from 60 Kev to 90 Kev. Analyzed by statistical approach, the quality of GPU and CPU images are relatively the same.« less
Optimizing the Betts-Miller-Janjic cumulus parameterization with Intel Many Integrated Core (MIC) architecture

NASA Astrophysics Data System (ADS)

Huang, Melin; Huang, Bormin; Huang, Allen H.-L.

2015-10-01

The schemes of cumulus parameterization are responsible for the sub-grid-scale effects of convective and/or shallow clouds, and intended to represent vertical fluxes due to unresolved updrafts and downdrafts and compensating motion outside the clouds. Some schemes additionally provide cloud and precipitation field tendencies in the convective column, and momentum tendencies due to convective transport of momentum. The schemes all provide the convective component of surface rainfall. Betts-Miller-Janjic (BMJ) is one scheme to fulfill such purposes in the weather research and forecast (WRF) model. National Centers for Environmental Prediction (NCEP) has tried to optimize the BMJ scheme for operational application. As there are no interactions among horizontal grid points, this scheme is very suitable for parallel computation. With the advantage of Intel Xeon Phi Many Integrated Core (MIC) architecture, efficient parallelization and vectorization essentials, it allows us to optimize the BMJ scheme. If compared to the original code respectively running on one CPU socket (eight cores) and on one CPU core with Intel Xeon E5-2670, the MIC-based optimization of this scheme running on Xeon Phi coprocessor 7120P improves the performance by 2.4x and 17.0x, respectively.
Dynamic mechanical analysis and organization/storage of data for polymetric materials

NASA Technical Reports Server (NTRS)

Rosenberg, M.; Buckley, W.

1982-01-01

Dynamic mechanical analysis was performed on a variety of temperature resistant polymers and composite resin matrices. Data on glass transition temperatures and degree of cure attained were derived. In addition a laboratory based computer system was installed and data base set up to allow entry of composite data. The laboratory CPU termed TYCHO is based on a DEC PDP 11/44 CPU with a Datatrieve relational data base. The function of TYCHO is integration of chemical laboratory analytical instrumentation and storage of chemical structures for modeling of new polymeric structures and compounds
Accelerated event-by-event Monte Carlo microdosimetric calculations of electrons and protons tracks on a multi-core CPU and a CUDA-enabled GPU.

PubMed

Kalantzis, Georgios; Tachibana, Hidenobu

2014-01-01

For microdosimetric calculations event-by-event Monte Carlo (MC) methods are considered the most accurate. The main shortcoming of those methods is the extensive requirement for computational time. In this work we present an event-by-event MC code of low projectile energy electron and proton tracks for accelerated microdosimetric MC simulations on a graphic processing unit (GPU). Additionally, a hybrid implementation scheme was realized by employing OpenMP and CUDA in such a way that both GPU and multi-core CPU were utilized simultaneously. The two implementation schemes have been tested and compared with the sequential single threaded MC code on the CPU. Performance comparison was established on the speed-up for a set of benchmarking cases of electron and proton tracks. A maximum speedup of 67.2 was achieved for the GPU-based MC code, while a further improvement of the speedup up to 20% was achieved for the hybrid approach. The results indicate the capability of our CPU-GPU implementation for accelerated MC microdosimetric calculations of both electron and proton tracks without loss of accuracy. Copyright © 2013 Elsevier Ireland Ltd. All rights reserved.
Multigroup Monte Carlo on GPUs: Comparison of history- and event-based algorithms

DOE Office of Scientific and Technical Information (OSTI.GOV)

Hamilton, Steven P.; Slattery, Stuart R.; Evans, Thomas M.

This article presents an investigation of the performance of different multigroup Monte Carlo transport algorithms on GPUs with a discussion of both history-based and event-based approaches. Several algorithmic improvements are introduced for both approaches. By modifying the history-based algorithm that is traditionally favored in CPU-based MC codes to occasionally filter out dead particles to reduce thread divergence, performance exceeds that of either the pure history-based or event-based approaches. The impacts of several algorithmic choices are discussed, including performance studies on Kepler and Pascal generation NVIDIA GPUs for fixed source and eigenvalue calculations. Single-device performance equivalent to 20–40 CPU cores onmore » the K40 GPU and 60–80 CPU cores on the P100 GPU is achieved. Last, in addition, nearly perfect multi-device parallel weak scaling is demonstrated on more than 16,000 nodes of the Titan supercomputer.« less
Multigroup Monte Carlo on GPUs: Comparison of history- and event-based algorithms

DOE PAGES

Hamilton, Steven P.; Slattery, Stuart R.; Evans, Thomas M.

2017-12-22

This article presents an investigation of the performance of different multigroup Monte Carlo transport algorithms on GPUs with a discussion of both history-based and event-based approaches. Several algorithmic improvements are introduced for both approaches. By modifying the history-based algorithm that is traditionally favored in CPU-based MC codes to occasionally filter out dead particles to reduce thread divergence, performance exceeds that of either the pure history-based or event-based approaches. The impacts of several algorithmic choices are discussed, including performance studies on Kepler and Pascal generation NVIDIA GPUs for fixed source and eigenvalue calculations. Single-device performance equivalent to 20–40 CPU cores onmore » the K40 GPU and 60–80 CPU cores on the P100 GPU is achieved. Last, in addition, nearly perfect multi-device parallel weak scaling is demonstrated on more than 16,000 nodes of the Titan supercomputer.« less
Routine Microsecond Molecular Dynamics Simulations with AMBER on GPUs. 2. Explicit Solvent Particle Mesh Ewald.

PubMed

Salomon-Ferrer, Romelia; Götz, Andreas W; Poole, Duncan; Le Grand, Scott; Walker, Ross C

2013-09-10

We present an implementation of explicit solvent all atom classical molecular dynamics (MD) within the AMBER program package that runs entirely on CUDA-enabled GPUs. First released publicly in April 2010 as part of version 11 of the AMBER MD package and further improved and optimized over the last two years, this implementation supports the three most widely used statistical mechanical ensembles (NVE, NVT, and NPT), uses particle mesh Ewald (PME) for the long-range electrostatics, and runs entirely on CUDA-enabled NVIDIA graphics processing units (GPUs), providing results that are statistically indistinguishable from the traditional CPU version of the software and with performance that exceeds that achievable by the CPU version of AMBER software running on all conventional CPU-based clusters and supercomputers. We briefly discuss three different precision models developed specifically for this work (SPDP, SPFP, and DPDP) and highlight the technical details of the approach as it extends beyond previously reported work [Götz et al., J. Chem. Theory Comput. 2012, DOI: 10.1021/ct200909j; Le Grand et al., Comp. Phys. Comm. 2013, DOI: 10.1016/j.cpc.2012.09.022].We highlight the substantial improvements in performance that are seen over traditional CPU-only machines and provide validation of our implementation and precision models. We also provide evidence supporting our decision to deprecate the previously described fully single precision (SPSP) model from the latest release of the AMBER software package.
Multiprocessing MCNP on an IBM RS/6000 cluster

DOE Office of Scientific and Technical Information (OSTI.GOV)

McKinney, G.W.; West, J.T.

1993-01-01

The advent of high-performance computer systems has brought to maturity programming concepts like vectorization, multiprocessing, and multitasking. While there are many schools of thought as to the most significant factor in obtaining order-of-magnitude increases in performance, such speedup can only be achieved by integrating the computer system and application code. Vectorization leads to faster manipulation of arrays by overlapping instruction CPU cycles. Discrete ordinates codes, which require the solving of large matrices, have proved to be major benefactors of vectorization. Monte Carlo transport, on the other hand, typically contains numerous logic statements and requires extensive redevelopment to benefit from vectorization.more » Multiprocessing and multitasking provide additional CPU cycles via multiple processors. Such systems are generally designed with either common memory access (multitasking) or distributed memory access. In both cases, theoretical speedup, as a function of the number of processors (P) and the fraction of task time that multiprocesses (f), can be formulated using Amdahl's Law S ((f,P) = 1 f + f/P). However, for most applications this theoretical limit cannot be achieved, due to additional terms not included in Amdahl's Law. Monte Carlo transport is a natural candidate for multiprocessing, since the particle tracks are generally independent and the precision of the result increases as the square root of the number of particles tracked.« less
Toward production of jet fuel functionality in oilseeds: identification of FatB acyl-acyl carrier protein thioesterases and evaluation of combinatorial expression strategies in Camelina seeds

PubMed Central

Kim, Hae Jin; Silva, Jillian E.; Vu, Hieu Sy; Mockaitis, Keithanne; Nam, Jeong-Won; Cahoon, Edgar B.

2015-01-01

Seeds of members of the genus Cuphea accumulate medium-chain fatty acids (MCFAs; 8:0–14:0). MCFA- and palmitic acid- (16:0) rich vegetable oils have received attention for jet fuel production, given their similarity in chain length to Jet A fuel hydrocarbons. Studies were conducted to test genes, including those from Cuphea, for their ability to confer jet fuel-type fatty acid accumulation in seed oil of the emerging biofuel crop Camelina sativa. Transcriptomes from Cuphea viscosissima and Cuphea pulcherrima developing seeds that accumulate >90% of C8 and C10 fatty acids revealed three FatB cDNAs (CpuFatB3, CvFatB1, and CpuFatB4) expressed predominantly in seeds and structurally divergent from typical FatB thioesterases that release 16:0 from acyl carrier protein (ACP). Expression of CpuFatB3 and CvFatB1 resulted in Camelina oil with capric acid (10:0), and CpuFatB4 expression conferred myristic acid (14:0) production and increased 16:0. Co-expression of combinations of previously characterized Cuphea and California bay FatBs produced Camelina oils with mixtures of C8–C16 fatty acids, but amounts of each fatty acid were less than obtained by expression of individual FatB cDNAs. Increases in lauric acid (12:0) and 14:0, but not 10:0, in Camelina oil and at the sn-2 position of triacylglycerols resulted from inclusion of a coconut lysophosphatidic acid acyltransferase specialized for MCFAs. RNA interference (RNAi) suppression of Camelina β-ketoacyl-ACP synthase II, however, reduced 12:0 in seeds expressing a 12:0-ACP-specific FatB. Camelina lines presented here provide platforms for additional metabolic engineering targeting fatty acid synthase and specialized acyltransferases for achieving oils with high levels of jet fuel-type fatty acids. PMID:25969557
Toward production of jet fuel functionality in oilseeds: Identification of FatB acyl-acyl carrier protein thioesterases and evaluation of combinatorial expression strategies in Camelina seeds

DOE PAGES

Kim, Hae Jin; Silva, Jillian E.; Vu, Hieu Sy; ...

2015-05-11

Seeds of members of the genus Cuphea accumulate medium-chain fatty acids (MCFAs; 8:0–14:0). MCFA- and palmitic acid- (16:0) rich vegetable oils have received attention for jet fuel production, given their similarity in chain length to Jet A fuel hydrocarbons. Studies were conducted to test genes, including those from Cuphea, for their ability to confer jet fuel-type fatty acid accumulation in seed oil of the emerging biofuel crop Camelina sativa. Transcriptomes from Cuphea viscosissima and Cuphea pulcherrima developing seeds that accumulate >90% of C8 and C10 fatty acids revealed three FatB cDNAs ( CpuFatB3, CvFatB1, and CpuFatB4) expressed predominantly in seedsmore » and structurally divergent from typical FatB thioesterases that release 16:0 from acyl carrier protein (ACP). Expression of CpuFatB3 and CvFatB1 resulted in Camelina oil with capric acid (10:0), and CpuFatB4 expression conferred myristic acid (14:0) production and increased 16:0. Co-expression of combinations of previously characterized Cuphea and California bay FatBs produced Camelina oils with mixtures of C8–C16 fatty acids, but amounts of each fatty acid were less than obtained by expression of individual FatB cDNAs. Increases in lauric acid (12:0) and 14:0, but not 10:0, in Camelina oil and at the sn-2 position of triacylglycerols resulted from inclusion of a coconut lysophosphatidic acid acyltransferase specialized for MCFAs. RNA interference (RNAi) suppression of Camelina β-ketoacyl-ACP synthase II, however, reduced 12:0 in seeds expressing a 12:0-ACP-specific FatB. Here, Camelina lines presented here provide platforms for additional metabolic engineering targeting fatty acid synthase and specialized acyltransferases for achieving oils with high levels of jet fuel-type fatty acids.« less

DOE Office of Scientific and Technical Information (OSTI.GOV)

Kim, Hae Jin; Silva, Jillian E.; Vu, Hieu Sy

Seeds of members of the genus Cuphea accumulate medium-chain fatty acids (MCFAs; 8:0–14:0). MCFA- and palmitic acid- (16:0) rich vegetable oils have received attention for jet fuel production, given their similarity in chain length to Jet A fuel hydrocarbons. Studies were conducted to test genes, including those from Cuphea, for their ability to confer jet fuel-type fatty acid accumulation in seed oil of the emerging biofuel crop Camelina sativa. Transcriptomes from Cuphea viscosissima and Cuphea pulcherrima developing seeds that accumulate >90% of C8 and C10 fatty acids revealed three FatB cDNAs ( CpuFatB3, CvFatB1, and CpuFatB4) expressed predominantly in seedsmore » and structurally divergent from typical FatB thioesterases that release 16:0 from acyl carrier protein (ACP). Expression of CpuFatB3 and CvFatB1 resulted in Camelina oil with capric acid (10:0), and CpuFatB4 expression conferred myristic acid (14:0) production and increased 16:0. Co-expression of combinations of previously characterized Cuphea and California bay FatBs produced Camelina oils with mixtures of C8–C16 fatty acids, but amounts of each fatty acid were less than obtained by expression of individual FatB cDNAs. Increases in lauric acid (12:0) and 14:0, but not 10:0, in Camelina oil and at the sn-2 position of triacylglycerols resulted from inclusion of a coconut lysophosphatidic acid acyltransferase specialized for MCFAs. RNA interference (RNAi) suppression of Camelina β-ketoacyl-ACP synthase II, however, reduced 12:0 in seeds expressing a 12:0-ACP-specific FatB. Here, Camelina lines presented here provide platforms for additional metabolic engineering targeting fatty acid synthase and specialized acyltransferases for achieving oils with high levels of jet fuel-type fatty acids.« less
Optimizing legacy molecular dynamics software with directive-based offload

NASA Astrophysics Data System (ADS)

Michael Brown, W.; Carrillo, Jan-Michael Y.; Gavhane, Nitin; Thakkar, Foram M.; Plimpton, Steven J.

2015-10-01

Directive-based programming models are one solution for exploiting many-core coprocessors to increase simulation rates in molecular dynamics. They offer the potential to reduce code complexity with offload models that can selectively target computations to run on the CPU, the coprocessor, or both. In this paper, we describe modifications to the LAMMPS molecular dynamics code to enable concurrent calculations on a CPU and coprocessor. We demonstrate that standard molecular dynamics algorithms can run efficiently on both the CPU and an x86-based coprocessor using the same subroutines. As a consequence, we demonstrate that code optimizations for the coprocessor also result in speedups on the CPU; in extreme cases up to 4.7X. We provide results for LAMMPS benchmarks and for production molecular dynamics simulations using the Stampede hybrid supercomputer with both Intel® Xeon Phi™ coprocessors and NVIDIA GPUs. The optimizations presented have increased simulation rates by over 2X for organic molecules and over 7X for liquid crystals on Stampede. The optimizations are available as part of the "Intel package" supplied with LAMMPS.
Performance and scalability of Fourier domain optical coherence tomography acceleration using graphics processing units.

PubMed

Li, Jian; Bloch, Pavel; Xu, Jing; Sarunic, Marinko V; Shannon, Lesley

2011-05-01

Fourier domain optical coherence tomography (FD-OCT) provides faster line rates, better resolution, and higher sensitivity for noninvasive, in vivo biomedical imaging compared to traditional time domain OCT (TD-OCT). However, because the signal processing for FD-OCT is computationally intensive, real-time FD-OCT applications demand powerful computing platforms to deliver acceptable performance. Graphics processing units (GPUs) have been used as coprocessors to accelerate FD-OCT by leveraging their relatively simple programming model to exploit thread-level parallelism. Unfortunately, GPUs do not "share" memory with their host processors, requiring additional data transfers between the GPU and CPU. In this paper, we implement a complete FD-OCT accelerator on a consumer grade GPU/CPU platform. Our data acquisition system uses spectrometer-based detection and a dual-arm interferometer topology with numerical dispersion compensation for retinal imaging. We demonstrate that the maximum line rate is dictated by the memory transfer time and not the processing time due to the GPU platform's memory model. Finally, we discuss how the performance trends of GPU-based accelerators compare to the expected future requirements of FD-OCT data rates.
Mcqueuer

DOE Office of Scientific and Technical Information (OSTI.GOV)

2016-09-12

Mcqueuer is a simple tool that allows anyone from researchers to experienced developers to create multi-node/multi-core jobs by simply creating a file with a list of commands. Users simply combine tasks, which would otherwise each be their own job on the cluster, into a single file that is given to Mcqueuer. Mcqueuer then does the heavy lifting required to process the tasks in parallel in a single multi-node job. In addition, Mcqueuer provides load-balancing, which frees the user from having to worry about complex memory and CPU considerations, and instead focus on the processing itself.
Cardiac computed tomography in patients with symptomatic new-onset atrial fibrillation, rule-out acute coronary syndrome, but with intermediate pretest probability for coronary artery disease admitted to a chest pain unit.

PubMed

Koopmann, Matthias; Hinrichs, Liane; Olligs, Jan; Lichtenberg, Michael; Eckardt, Lars; Böse, Dirk; Möhlenkamp, Stefan; Waltenberger, Johannes; Breuckmann, Frank

2018-01-24

Atrial fibrillation (AF) and coronary artery disease (CAD) may be encountered coincidently in a large portion of patients. However, data on coronary artery calcium burden in such patients are lacking. Thus, we sought to determine the value of cardiac computed tomography (CCT) in patients presenting with new-onset AF associated with an intermediate pretest probability for CAD admitted to a chest pain unit (CPU). Calcium scores (CS) of 73 new-onset, symptomatic AF subjects without typical clinical, electrocardiographic, or laboratory signs of acute coronary syndrome (ACS) admitted to our CPU were analyzed. In addition, results from computed tomography angiography (CTA) were related to coronary angiography findings whenever available. Calcium scores of zero were found in 25%. Median Agatston score was 77 (interquartile range: 1-270) with gender- and territory-specific dispersal. CS scores above average were present in about 50%, high (> 400)-to-very high (> 1000) CS scores were found in 22%. Overall percentile ranking showed a relative accordance to the reference percentile distribution. Additional CTA was performed in 47%, revealing stenoses in 12%. Coronary angiography was performed in 22% and resulted in coronary intervention or surgical revascularization in 7%. On univariate analysis, CS > 50th percentile failed to serve as an independent determinant of significant stenosis during catheterization. Within a CPU setting, relevant CAD was excluded or confirmed in almost 50%, the latter with a high proportion of coronary angiographies and subsequent coronary interventions, underlining the diagnostic value of CCT in symptomatic, non-ACS, new-onset AF patients when admitted to a CPU.
Radiation hardened microprocessor for small payloads

NASA Technical Reports Server (NTRS)

Shah, Ravi

1993-01-01

The RH-3000 program is developing a rad-hard space qualified 32-bit MIPS R-3000 RISC processor under the Naval Research Lab sponsorship. In addition, under IR&D Harris is developing RHC-3000 for embedded control applications where low cost and radiation tolerance are primary concerns. The development program leverages heavily from commercial development of the MIPS R-3000. The commercial R-3000 has a large installed user base and several foundry partners are currently producing a wide variety of R-3000 derivative products. One of the MIPS derivative products, the LR33000 from LSI Logic, was used as the basis for the design of the RH-3000 chipset. The RH-3000 chipset consists of three core chips and two support chips. The core chips include the CPU, which is the R-3000 integer unit and the FPA/MD chip pair, which performs the R-3010 floating point functions. The two support whips contain all the support functions required for fault tolerance support, real-time support, memory management, timers, and other functions. The Harris development effort had first passed silicon success in June, 1992 with the first rad-hard 32-bit RH-3000 CPU chip. The CPU device is 30 kgates, has a 508 mil by 503 mil die size and is fabricated at Harris Semiconductor on the rad-hard CMOS Silicon on Sapphire (SOS) process. The CPU device successfully passed tesing against 600,000 test vectors derived directly on the LSI/MIPS test suite and has been operational as a single board computer running C code for the past year. In addition, the RH-3000 program has developed the methodology for converting commercially developed designs utilizing logic synthesis techniques based on a combination of VHDK and schematic data bases.
Double dissociation of the anterior and posterior dorsomedial caudate-putamen in the acquisition and expression of associative learning with the nicotine stimulus.

PubMed

Charntikov, Sergios; Pittenger, Steven T; Swalve, Natashia; Li, Ming; Bevins, Rick A

2017-07-15

Tobacco use is the leading cause of preventable deaths worldwide. This habit is not only debilitating to individual users but also to those around them (second-hand smoking). Nicotine is the main addictive component of tobacco products and is a moderate stimulant and a mild reinforcer. Importantly, besides its unconditional effects, nicotine also has conditioned stimulus effects that may contribute to the tenacity of the smoking habit. Because the neurobiological substrates underlying these processes are virtually unexplored, the present study investigated the functional involvement of the dorsomedial caudate putamen (dmCPu) in learning processes with nicotine as an interoceptive stimulus. Rats were trained using the discriminated goal-tracking task where nicotine injections (0.4 mg/kg; SC), on some days, were paired with intermittent (36 per session) sucrose deliveries; sucrose was not available on interspersed saline days. Pre-training excitotoxic or post-training transient lesions of anterior or posterior dmCPu were used to elucidate the role of these areas in acquisition or expression of associative learning with nicotine stimulus. Pre-training lesion of p-dmCPu inhibited acquisition while post-training lesions of p-dmCPu attenuated the expression of associative learning with the nicotine stimulus. On the other hand, post-training lesions of a-dmCPu evoked nicotine-like responding following saline treatment indicating the role of this area in disinhibition of learned motor behaviors. These results, for the first time, show functionally distinct involvement of a- and p-dmCPu in various stages of associative learning using nicotine stimulus and provide an initial account of neural plasticity underlying these learning processes. Copyright © 2017 Elsevier Ltd. All rights reserved.
On the Finite Element Implementation of the Generalized Method of Cells Micromechanics Constitutive Model

NASA Technical Reports Server (NTRS)

Wilt, T. E.

1995-01-01

The Generalized Method of Cells (GMC), a micromechanics based constitutive model, is implemented into the finite element code MARC using the user subroutine HYPELA. Comparisons in terms of transverse deformation response, micro stress and strain distributions, and required CPU time are presented for GMC and finite element models of fiber/matrix unit cell. GMC is shown to provide comparable predictions of the composite behavior and requires significantly less CPU time as compared to a finite element analysis of the unit cell. Details as to the organization of the HYPELA code are provided with the actual HYPELA code included in the appendix.
Techniques for increasing the efficiency of Earth gravity calculations for precision orbit determination

NASA Technical Reports Server (NTRS)

Smith, R. L.; Lyubomirsky, A. S.

1981-01-01

Two techniques were analyzed. The first is a representation using Chebyshev expansions in three-dimensional cells. The second technique employs a temporary file for storing the components of the nonspherical gravity force. Computer storage requirements and relative CPU time requirements are presented. The Chebyshev gravity representation can provide a significant reduction in CPU time in precision orbit calculations, but at the cost of a large amount of direct-access storage space, which is required for a global model.
An Xdata Architecture for Federated Graph Models and Multi-tier Asymmetric Computing

DTIC Science & Technology

2014-01-01

Wikipedia, a scale-free random graph (kron), Akamai trace route data, Bitcoin transaction data, and a Twitter follower network. We present results for...3x (SSSP on a random graph) and nearly 300x (Akamai and Bitcoin ) over the CPU performance of a well-known and widely deployed CPU-based graph...provided better throughput for smaller frontiers such as roadmaps or the Bitcoin data set. In our work, we have focused on two-phase kernels, but it
Association between problematic cellular phone use and suicide: the moderating effect of family function and depression.

PubMed

Wang, Peng-Wei; Liu, Tai-Ling; Ko, Chih-Hung; Lin, Huang-Chi; Huang, Mei-Feng; Yeh, Yi-Chun; Yen, Cheng-Fang

2014-02-01

Suicidal ideation and attempt among adolescents are risk factors for eventual completed suicide. Cellular phone use (CPU) has markedly changed the everyday lives of adolescents. Issues about how cellular phone use relates to adolescent mental health, such as suicidal ideation and attempts, are important because of the high rate of cellular phone usage among children in that age group. This study explored the association between problematic CPU and suicidal ideation and attempts among adolescents and investigated how family function and depression influence the association between problematic CPU and suicidal ideation and attempts. A total of 5051 (2872 girls and 2179 boys) adolescents who owned at least one cellular phone completed the research questionnaires. We collected data on participants' CPU and suicidal behavior (ideation and attempts) during the past month as well as information on family function and history of depression. Five hundred thirty-two adolescents (10.54%) had problematic CPU. The rates of suicidal ideation were 23.50% and 11.76% in adolescents with problematic CPU and without problematic CPU, respectively. The rates of suicidal attempts in both groups were 13.70% and 5.45%, respectively. Family function, but not depression, had a moderating effect on the association between problematic CPU and suicidal ideation and attempt. This study highlights the association between problematic CPU and suicidal ideation as well as attempts and indicates that good family function may have a more significant role on reducing the risks of suicidal ideation and attempts in adolescents with problematic CPU than in those without problematic CPU. © 2014.
Evaluation of patients with methamphetamine- and cocaine-related chest pain in a chest pain observation unit.

PubMed

Diercks, Deborah B; Kirk, J Douglas; Turnipseed, Samuel D; Amsterdam, Ezra A

2007-12-01

Risk of acute coronary events in patients with methamphetamine and cocaine intoxication has been described. Little is known about the need for additional evaluation in these patients who do not have evidence of myocardial infarction after the initial emergency department evaluation. We herein describe our experience with these patients in a chest pain unit (CPU) and the rate of cardiac-related chest pain in this group. Retrospective analysis of patients evaluated in our CPU from January 1, 2000 to December 16, 2004 with a history of chest pain. Patients who had a positive urine toxicologic screen for methamphetamine or cocaine were included. No patients had ECG or cardiac injury marker evidence of myocardial infarction or ischemia during the initial emergency department evaluation. A diagnosis of cardiac-related chest pain was based upon positive diagnostic testing (exercise stress testing, nuclear perfusion imaging, stress echocardiography, or coronary artery stenosis >70%). During the study period, 4568 patients were evaluated in the CPU. A total of 1690 (37%) of patients admitted to the CPU underwent urine toxicologic testing. The result of urine toxicologic test was positive for cocaine or methamphetamine in 224 (5%). In the 2871 patients who underwent diagnostic testing for coronary artery disease (CAD), 401 (14%) were found to have positive results. There was no difference in the prevalence of CAD between those with positive result for toxicology screens (26/156, 17%) and those without (375/2715, 13%, RR 1.2, 95% CI 0.8-1.7). These findings suggest a relatively high rate of CAD in patients with methamphetamine and cocaine use evaluated in a CPU.
CUDASW++ 3.0: accelerating Smith-Waterman protein database search by coupling CPU and GPU SIMD instructions.

PubMed

Liu, Yongchao; Wirawan, Adrianto; Schmidt, Bertil

2013-04-04

The maximal sensitivity for local alignments makes the Smith-Waterman algorithm a popular choice for protein sequence database search based on pairwise alignment. However, the algorithm is compute-intensive due to a quadratic time complexity. Corresponding runtimes are further compounded by the rapid growth of sequence databases. We present CUDASW++ 3.0, a fast Smith-Waterman protein database search algorithm, which couples CPU and GPU SIMD instructions and carries out concurrent CPU and GPU computations. For the CPU computation, this algorithm employs SSE-based vector execution units as accelerators. For the GPU computation, we have investigated for the first time a GPU SIMD parallelization, which employs CUDA PTX SIMD video instructions to gain more data parallelism beyond the SIMT execution model. Moreover, sequence alignment workloads are automatically distributed over CPUs and GPUs based on their respective compute capabilities. Evaluation on the Swiss-Prot database shows that CUDASW++ 3.0 gains a performance improvement over CUDASW++ 2.0 up to 2.9 and 3.2, with a maximum performance of 119.0 and 185.6 GCUPS, on a single-GPU GeForce GTX 680 and a dual-GPU GeForce GTX 690 graphics card, respectively. In addition, our algorithm has demonstrated significant speedups over other top-performing tools: SWIPE and BLAST+. CUDASW++ 3.0 is written in CUDA C++ and PTX assembly languages, targeting GPUs based on the Kepler architecture. This algorithm obtains significant speedups over its predecessor: CUDASW++ 2.0, by benefiting from the use of CPU and GPU SIMD instructions as well as the concurrent execution on CPUs and GPUs. The source code and the simulated data are available at http://cudasw.sourceforge.net.
The association between problematic cellular phone use and risky behaviors and low self-esteem among Taiwanese adolescents.

PubMed

Yang, Yuan-Sheng; Yen, Ju-Yu; Ko, Chih-Hung; Cheng, Chung-Ping; Yen, Cheng-Fang

2010-04-28

Cellular phone use (CPU) is an important part of life for many adolescents. However, problematic CPU may complicate physiological and psychological problems. The aim of our study was to examine the associations between problematic CPU and a series of risky behaviors and low self-esteem in Taiwanese adolescents. A total of 11,111 adolescent students in Southern Taiwan were randomly selected into this study. We used the Problematic Cellular Phone Use Questionnaire to identify the adolescents with problematic CPU. Meanwhile, a series of risky behaviors and self-esteem were evaluated. Multilevel logistic regression analyses were employed to examine the associations between problematic CPU and risky behaviors and low self-esteem regarding gender and age. The results indicated that positive associations were found between problematic CPU and aggression, insomnia, smoking cigarettes, suicidal tendencies, and low self-esteem in all groups with different sexes and ages. However, gender and age differences existed in the associations between problematic CPU and suspension from school, criminal records, tattooing, short nocturnal sleep duration, unprotected sex, illicit drugs use, drinking alcohol and chewing betel nuts. There were positive associations between problematic CPU and a series of risky behaviors and low self-esteem in Taiwanese adolescents. It is worthy for parents and mental health professionals to pay attention to adolescents' problematic CPU.
Multiprocessing MCNP on an IBN RS/6000 cluster

DOE Office of Scientific and Technical Information (OSTI.GOV)

McKinney, G.W.; West, J.T.

1993-01-01

The advent of high-performance computer systems has brought to maturity programming concepts like vectorization, multiprocessing, and multitasking. While there are many schools of thought as to the most significant factor in obtaining order-of-magnitude increases in performance, such speedup can only be achieved by integrating the computer system and application code. Vectorization leads to faster manipulation of arrays by overlapping instruction CPU cycles. Discrete ordinates codes, which require the solving of large matrices, have proved to be major benefactors of vectorization. Monte Carlo transport, on the other hand, typically contains numerous logic statements and requires extensive redevelopment to benefit from vectorization.more » Multiprocessing and multitasking provide additional CPU cycles via multiple processors. Such systems are generally designed with either common memory access (multitasking) or distributed memory access. In both cases, theoretical speedup, as a function of the number of processors P and the fraction f of task time that multiprocesses, can be formulated using Amdahl's law: S(f, P) =1/(1-f+f/P). However, for most applications, this theoretical limit cannot be achieved because of additional terms (e.g., multitasking overhead, memory overlap, etc.) that are not included in Amdahl's law. Monte Carlo transport is a natural candidate for multiprocessing because the particle tracks are generally independent, and the precision of the result increases as the square Foot of the number of particles tracked.« less
Multiprocessing MCNP on an IBM RS/6000 cluster

DOE Office of Scientific and Technical Information (OSTI.GOV)

McKinney, G.W.; West, J.T.

1993-03-01

The advent of high-performance computer systems has brought to maturity programming concepts like vectorization, multiprocessing, and multitasking. While there are many schools of thought as to the most significant factor in obtaining order-of-magnitude increases in performance, such speedup can only be achieved by integrating the computer system and application code. Vectorization leads to faster manipulation of arrays by overlapping instruction CPU cycles. Discrete ordinates codes, which require the solving of large matrices, have proved to be major benefactors of vectorization. Monte Carlo transport, on the other hand, typically contains numerous logic statements and requires extensive redevelopment to benefit from vectorization.more » Multiprocessing and multitasking provide additional CPU cycles via multiple processors. Such systems are generally designed with either common memory access (multitasking) or distributed memory access. In both cases, theoretical speedup, as a function of the number of processors (P) and the fraction of task time that multiprocesses (f), can be formulated using Amdahl`s Law S ((f,P) = 1 f + f/P). However, for most applications this theoretical limit cannot be achieved, due to additional terms not included in Amdahl`s Law. Monte Carlo transport is a natural candidate for multiprocessing, since the particle tracks are generally independent and the precision of the result increases as the square root of the number of particles tracked.« less
Optimizing legacy molecular dynamics software with directive-based offload

DOE PAGES

Michael Brown, W.; Carrillo, Jan-Michael Y.; Gavhane, Nitin; ...

2015-05-14

The directive-based programming models are one solution for exploiting many-core coprocessors to increase simulation rates in molecular dynamics. They offer the potential to reduce code complexity with offload models that can selectively target computations to run on the CPU, the coprocessor, or both. In our paper, we describe modifications to the LAMMPS molecular dynamics code to enable concurrent calculations on a CPU and coprocessor. We also demonstrate that standard molecular dynamics algorithms can run efficiently on both the CPU and an x86-based coprocessor using the same subroutines. As a consequence, we demonstrate that code optimizations for the coprocessor also resultmore » in speedups on the CPU; in extreme cases up to 4.7X. We provide results for LAMMAS benchmarks and for production molecular dynamics simulations using the Stampede hybrid supercomputer with both Intel (R) Xeon Phi (TM) coprocessors and NVIDIA GPUs: The optimizations presented have increased simulation rates by over 2X for organic molecules and over 7X for liquid crystals on Stampede. The optimizations are available as part of the "Intel package" supplied with LAMMPS. (C) 2015 Elsevier B.V. All rights reserved.« less
Simulation Testing of Embedded Flight Software

NASA Technical Reports Server (NTRS)

Shahabuddin, Mohammad; Reinholtz, William

2004-01-01

Virtual Real Time (VRT) is a computer program for testing embedded flight software by computational simulation in a workstation, in contradistinction to testing it in its target central processing unit (CPU). The disadvantages of testing in the target CPU include the need for an expensive test bed, the necessity for testers and programmers to take turns using the test bed, and the lack of software tools for debugging in a real-time environment. By virtue of its architecture, most of the flight software of the type in question is amenable to development and testing on workstations, for which there is an abundance of commercially available debugging and analysis software tools. Unfortunately, the timing of a workstation differs from that of a target CPU in a test bed. VRT, in conjunction with closed-loop simulation software, provides a capability for executing embedded flight software on a workstation in a close-to-real-time environment. A scale factor is used to convert between execution time in VRT on a workstation and execution on a target CPU. VRT includes high-resolution operating- system timers that enable the synchronization of flight software with simulation software and ground software, all running on different workstations.
The association between problematic cellular phone use and risky behaviors and low self-esteem among Taiwanese adolescents

PubMed Central

2010-01-01

Background Cellular phone use (CPU) is an important part of life for many adolescents. However, problematic CPU may complicate physiological and psychological problems. The aim of our study was to examine the associations between problematic CPU and a series of risky behaviors and low self-esteem in Taiwanese adolescents. Methods A total of 11,111 adolescent students in Southern Taiwan were randomly selected into this study. We used the Problematic Cellular Phone Use Questionnaire to identify the adolescents with problematic CPU. Meanwhile, a series of risky behaviors and self-esteem were evaluated. Multilevel logistic regression analyses were employed to examine the associations between problematic CPU and risky behaviors and low self-esteem regarding gender and age. Results The results indicated that positive associations were found between problematic CPU and aggression, insomnia, smoking cigarettes, suicidal tendencies, and low self-esteem in all groups with different sexes and ages. However, gender and age differences existed in the associations between problematic CPU and suspension from school, criminal records, tattooing, short nocturnal sleep duration, unprotected sex, illicit drugs use, drinking alcohol and chewing betel nuts. Conclusions There were positive associations between problematic CPU and a series of risky behaviors and low self-esteem in Taiwanese adolescents. It is worthy for parents and mental health professionals to pay attention to adolescents' problematic CPU. PMID:20426807
CPU-GPU hybrid accelerating the Zuker algorithm for RNA secondary structure prediction applications.

PubMed

Lei, Guoqing; Dou, Yong; Wan, Wen; Xia, Fei; Li, Rongchun; Ma, Meng; Zou, Dan

2012-01-01

Prediction of ribonucleic acid (RNA) secondary structure remains one of the most important research areas in bioinformatics. The Zuker algorithm is one of the most popular methods of free energy minimization for RNA secondary structure prediction. Thus far, few studies have been reported on the acceleration of the Zuker algorithm on general-purpose processors or on extra accelerators such as Field Programmable Gate-Array (FPGA) and Graphics Processing Units (GPU). To the best of our knowledge, no implementation combines both CPU and extra accelerators, such as GPUs, to accelerate the Zuker algorithm applications. In this paper, a CPU-GPU hybrid computing system that accelerates Zuker algorithm applications for RNA secondary structure prediction is proposed. The computing tasks are allocated between CPU and GPU for parallel cooperate execution. Performance differences between the CPU and the GPU in the task-allocation scheme are considered to obtain workload balance. To improve the hybrid system performance, the Zuker algorithm is optimally implemented with special methods for CPU and GPU architecture. Speedup of 15.93× over optimized multi-core SIMD CPU implementation and performance advantage of 16% over optimized GPU implementation are shown in the experimental results. More than 14% of the sequences are executed on CPU in the hybrid system. The system combining CPU and GPU to accelerate the Zuker algorithm is proven to be promising and can be applied to other bioinformatics applications.

Accelerating Spaceborne SAR Imaging Using Multiple CPU/GPU Deep Collaborative Computing

PubMed Central

Zhang, Fan; Li, Guojun; Li, Wei; Hu, Wei; Hu, Yuxin

2016-01-01

With the development of synthetic aperture radar (SAR) technologies in recent years, the huge amount of remote sensing data brings challenges for real-time imaging processing. Therefore, high performance computing (HPC) methods have been presented to accelerate SAR imaging, especially the GPU based methods. In the classical GPU based imaging algorithm, GPU is employed to accelerate image processing by massive parallel computing, and CPU is only used to perform the auxiliary work such as data input/output (IO). However, the computing capability of CPU is ignored and underestimated. In this work, a new deep collaborative SAR imaging method based on multiple CPU/GPU is proposed to achieve real-time SAR imaging. Through the proposed tasks partitioning and scheduling strategy, the whole image can be generated with deep collaborative multiple CPU/GPU computing. In the part of CPU parallel imaging, the advanced vector extension (AVX) method is firstly introduced into the multi-core CPU parallel method for higher efficiency. As for the GPU parallel imaging, not only the bottlenecks of memory limitation and frequent data transferring are broken, but also kinds of optimized strategies are applied, such as streaming, parallel pipeline and so on. Experimental results demonstrate that the deep CPU/GPU collaborative imaging method enhances the efficiency of SAR imaging on single-core CPU by 270 times and realizes the real-time imaging in that the imaging rate outperforms the raw data generation rate. PMID:27070606
Accelerating Spaceborne SAR Imaging Using Multiple CPU/GPU Deep Collaborative Computing.

PubMed

Zhang, Fan; Li, Guojun; Li, Wei; Hu, Wei; Hu, Yuxin

2016-04-07

With the development of synthetic aperture radar (SAR) technologies in recent years, the huge amount of remote sensing data brings challenges for real-time imaging processing. Therefore, high performance computing (HPC) methods have been presented to accelerate SAR imaging, especially the GPU based methods. In the classical GPU based imaging algorithm, GPU is employed to accelerate image processing by massive parallel computing, and CPU is only used to perform the auxiliary work such as data input/output (IO). However, the computing capability of CPU is ignored and underestimated. In this work, a new deep collaborative SAR imaging method based on multiple CPU/GPU is proposed to achieve real-time SAR imaging. Through the proposed tasks partitioning and scheduling strategy, the whole image can be generated with deep collaborative multiple CPU/GPU computing. In the part of CPU parallel imaging, the advanced vector extension (AVX) method is firstly introduced into the multi-core CPU parallel method for higher efficiency. As for the GPU parallel imaging, not only the bottlenecks of memory limitation and frequent data transferring are broken, but also kinds of optimized strategies are applied, such as streaming, parallel pipeline and so on. Experimental results demonstrate that the deep CPU/GPU collaborative imaging method enhances the efficiency of SAR imaging on single-core CPU by 270 times and realizes the real-time imaging in that the imaging rate outperforms the raw data generation rate.
On Convergence Acceleration Techniques for Unstructured Meshes

NASA Technical Reports Server (NTRS)

Mavriplis, Dimitri J.

1998-01-01

A discussion of convergence acceleration techniques as they relate to computational fluid dynamics problems on unstructured meshes is given. Rather than providing a detailed description of particular methods, the various different building blocks of current solution techniques are discussed and examples of solution strategies using one or several of these ideas are given. Issues relating to unstructured grid CFD problems are given additional consideration, including suitability of algorithms to current hardware trends, memory and cpu tradeoffs, treatment of non-linearities, and the development of efficient strategies for handling anisotropy-induced stiffness. The outlook for future potential improvements is also discussed.
Toward production of jet fuel functionality in oilseeds: identification of FatB acyl-acyl carrier protein thioesterases and evaluation of combinatorial expression strategies in Camelina seeds.

PubMed

Kim, Hae Jin; Silva, Jillian E; Vu, Hieu Sy; Mockaitis, Keithanne; Nam, Jeong-Won; Cahoon, Edgar B

2015-07-01

Seeds of members of the genus Cuphea accumulate medium-chain fatty acids (MCFAs; 8:0-14:0). MCFA- and palmitic acid- (16:0) rich vegetable oils have received attention for jet fuel production, given their similarity in chain length to Jet A fuel hydrocarbons. Studies were conducted to test genes, including those from Cuphea, for their ability to confer jet fuel-type fatty acid accumulation in seed oil of the emerging biofuel crop Camelina sativa. Transcriptomes from Cuphea viscosissima and Cuphea pulcherrima developing seeds that accumulate >90% of C8 and C10 fatty acids revealed three FatB cDNAs (CpuFatB3, CvFatB1, and CpuFatB4) expressed predominantly in seeds and structurally divergent from typical FatB thioesterases that release 16:0 from acyl carrier protein (ACP). Expression of CpuFatB3 and CvFatB1 resulted in Camelina oil with capric acid (10:0), and CpuFatB4 expression conferred myristic acid (14:0) production and increased 16:0. Co-expression of combinations of previously characterized Cuphea and California bay FatBs produced Camelina oils with mixtures of C8-C16 fatty acids, but amounts of each fatty acid were less than obtained by expression of individual FatB cDNAs. Increases in lauric acid (12:0) and 14:0, but not 10:0, in Camelina oil and at the sn-2 position of triacylglycerols resulted from inclusion of a coconut lysophosphatidic acid acyltransferase specialized for MCFAs. RNA interference (RNAi) suppression of Camelina β-ketoacyl-ACP synthase II, however, reduced 12:0 in seeds expressing a 12:0-ACP-specific FatB. Camelina lines presented here provide platforms for additional metabolic engineering targeting fatty acid synthase and specialized acyltransferases for achieving oils with high levels of jet fuel-type fatty acids. © The Author 2015. Published by Oxford University Press on behalf of the Society for Experimental Biology.
Testing and Validating Gadget2 for GPUs

NASA Astrophysics Data System (ADS)

Wibking, Benjamin; Holley-Bockelmann, K.; Berlind, A. A.

2013-01-01

We are currently upgrading a version of Gadget2 (Springel et al., 2005) that is optimized for NVIDIA's CUDA GPU architecture (Frigaard, unpublished) to work with the latest libraries and graphics cards. Preliminary tests of its performance indicate a ~40x speedup in the particle force tree approximation calculation, with overall speedup of 5-10x for cosmological simulations run with GPUs compared to running on the same CPU cores without GPU acceleration. We believe this speedup can be reasonably increased by an additional factor of two with futher optimization, including overlap of computation on CPU and GPU. Tests of single-precision GPU numerical fidelity currently indicate accuracy of the mass function and the spectral power density to within a few percent of extended-precision CPU results with the unmodified form of Gadget. Additionally, we plan to test and optimize the GPU code for Millenium-scale "grand challenge" simulations of >10^9 particles, a scale that has been previously untested with this code, with the aid of the NSF XSEDE flagship GPU-based supercomputing cluster codenamed "Keeneland." Current work involves additional validation of numerical results, extending the numerical precision of the GPU calculations to double precision, and evaluating performance/accuracy tradeoffs. We believe that this project, if successful, will yield substantial computational performance benefits to the N-body research community as the next generation of GPU supercomputing resources becomes available, both increasing the electrical power efficiency of ever-larger computations (making simulations possible a decade from now at scales and resolutions unavailable today) and accelerating the pace of research in the field.
OpenMP GNU and Intel Fortran programs for solving the time-dependent Gross-Pitaevskii equation

NASA Astrophysics Data System (ADS)

Young-S., Luis E.; Muruganandam, Paulsamy; Adhikari, Sadhan K.; Lončar, Vladimir; Vudragović, Dušan; Balaž, Antun

2017-11-01

We present Open Multi-Processing (OpenMP) version of Fortran 90 programs for solving the Gross-Pitaevskii (GP) equation for a Bose-Einstein condensate in one, two, and three spatial dimensions, optimized for use with GNU and Intel compilers. We use the split-step Crank-Nicolson algorithm for imaginary- and real-time propagation, which enables efficient calculation of stationary and non-stationary solutions, respectively. The present OpenMP programs are designed for computers with multi-core processors and optimized for compiling with both commercially-licensed Intel Fortran and popular free open-source GNU Fortran compiler. The programs are easy to use and are elaborated with helpful comments for the users. All input parameters are listed at the beginning of each program. Different output files provide physical quantities such as energy, chemical potential, root-mean-square sizes, densities, etc. We also present speedup test results for new versions of the programs. Program files doi:http://dx.doi.org/10.17632/y8zk3jgn84.2 Licensing provisions: Apache License 2.0 Programming language: OpenMP GNU and Intel Fortran 90. Computer: Any multi-core personal computer or workstation with the appropriate OpenMP-capable Fortran compiler installed. Number of processors used: All available CPU cores on the executing computer. Journal reference of previous version: Comput. Phys. Commun. 180 (2009) 1888; ibid.204 (2016) 209. Does the new version supersede the previous version?: Not completely. It does supersede previous Fortran programs from both references above, but not OpenMP C programs from Comput. Phys. Commun. 204 (2016) 209. Nature of problem: The present Open Multi-Processing (OpenMP) Fortran programs, optimized for use with commercially-licensed Intel Fortran and free open-source GNU Fortran compilers, solve the time-dependent nonlinear partial differential (GP) equation for a trapped Bose-Einstein condensate in one (1d), two (2d), and three (3d) spatial dimensions for six different trap symmetries: axially and radially symmetric traps in 3d, circularly symmetric traps in 2d, fully isotropic (spherically symmetric) and fully anisotropic traps in 2d and 3d, as well as 1d traps, where no spatial symmetry is considered. Solution method: We employ the split-step Crank-Nicolson algorithm to discretize the time-dependent GP equation in space and time. The discretized equation is then solved by imaginary- or real-time propagation, employing adequately small space and time steps, to yield the solution of stationary and non-stationary problems, respectively. Reasons for the new version: Previously published Fortran programs [1,2] have now become popular tools [3] for solving the GP equation. These programs have been translated to the C programming language [4] and later extended to the more complex scenario of dipolar atoms [5]. Now virtually all computers have multi-core processors and some have motherboards with more than one physical computer processing unit (CPU), which may increase the number of available CPU cores on a single computer to several tens. The C programs have been adopted to be very fast on such multi-core modern computers using general-purpose graphic processing units (GPGPU) with Nvidia CUDA and computer clusters using Message Passing Interface (MPI) [6]. Nevertheless, previously developed Fortran programs are also commonly used for scientific computation and most of them use a single CPU core at a time in modern multi-core laptops, desktops, and workstations. Unless the Fortran programs are made aware and capable of making efficient use of the available CPU cores, the solution of even a realistic dynamical 1d problem, not to mention the more complicated 2d and 3d problems, could be time consuming using the Fortran programs. Previously, we published auto-parallel Fortran programs [2] suitable for Intel (but not GNU) compiler for solving the GP equation. Hence, a need for the full OpenMP version of the Fortran programs to reduce the execution time cannot be overemphasized. To address this issue, we provide here such OpenMP Fortran programs, optimized for both Intel and GNU Fortran compilers and capable of using all available CPU cores, which can significantly reduce the execution time. Summary of revisions: Previous Fortran programs [1] for solving the time-dependent GP equation in 1d, 2d, and 3d with different trap symmetries have been parallelized using the OpenMP interface to reduce the execution time on multi-core processors. There are six different trap symmetries considered, resulting in six programs for imaginary-time propagation and six for real-time propagation, totaling to 12 programs included in BEC-GP-OMP-FOR software package. All input data (number of atoms, scattering length, harmonic oscillator trap length, trap anisotropy, etc.) are conveniently placed at the beginning of each program, as before [2]. Present programs introduce a new input parameter, which is designated by Number_of_Threads and defines the number of CPU cores of the processor to be used in the calculation. If one sets the value 0 for this parameter, all available CPU cores will be used. For the most efficient calculation it is advisable to leave one CPU core unused for the background system's jobs. For example, on a machine with 20 CPU cores such that we used for testing, it is advisable to use up to 19 CPU cores. However, the total number of used CPU cores can be divided into more than one job. For instance, one can run three simulations simultaneously using 10, 4, and 5 CPU cores, respectively, thus totaling to 19 used CPU cores on a 20-core computer. The Fortran source programs are located in the directory src, and can be compiled by the make command using the makefile in the root directory BEC-GP-OMP-FOR of the software package. The examples of produced output files can be found in the directory output, although some large density files are omitted, to save space. The programs calculate the values of actually used dimensionless nonlinearities from the physical input parameters, where the input parameters correspond to the identical nonlinearity values as in the previously published programs [1], so that the output files of the old and new programs can be directly compared. The output files are conveniently named such that their contents can be easily identified, following the naming convention introduced in Ref. [2]. For example, a file named -out.txt, where is a name of the individual program, represents the general output file containing input data, time and space steps, nonlinearity, energy and chemical potential, and was named fort.7 in the old Fortran version of programs [1]. A file named -den.txt is the output file with the condensate density, which had the names fort.3 and fort.4 in the old Fortran version [1] for imaginary- and real-time propagation programs, respectively. Other possible density outputs, such as the initial density, are commented out in the programs to have a simpler set of output files, but users can uncomment and re-enable them, if needed. In addition, there are output files for reduced (integrated) 1d and 2d densities for different programs. In the real-time programs there is also an output file reporting the dynamics of evolution of root-mean-square sizes after a perturbation is introduced. The supplied real-time programs solve the stationary GP equation, and then calculate the dynamics. As the imaginary-time programs are more accurate than the real-time programs for the solution of a stationary problem, one can first solve the stationary problem using the imaginary-time programs, adapt the real-time programs to read the pre-calculated wave function and then study the dynamics. In that case the parameter NSTP in the real-time programs should be set to zero and the space mesh and nonlinearity parameters should be identical in both programs. The reader is advised to consult our previous publication where a complete description of the output files is given [2]. A readme.txt file, included in the root directory, explains the procedure to compile and run the programs. We tested our programs on a workstation with two 10-core Intel Xeon E5-2650 v3 CPUs. The parameters used for testing are given in sample input files, provided in the corresponding directory together with the programs. In Table 1 we present wall-clock execution times for runs on 1, 6, and 19 CPU cores for programs compiled using Intel and GNU Fortran compilers. The corresponding columns "Intel speedup" and "GNU speedup" give the ratio of wall-clock execution times of runs on 1 and 19 CPU cores, and denote the actual measured speedup for 19 CPU cores. In all cases and for all numbers of CPU cores, although the GNU Fortran compiler gives excellent results, the Intel Fortran compiler turns out to be slightly faster. Note that during these tests we always ran only a single simulation on a workstation at a time, to avoid any possible interference issues. Therefore, the obtained wall-clock times are more reliable than the ones that could be measured with two or more jobs running simultaneously. We also studied the speedup of the programs as a function of the number of CPU cores used. The performance of the Intel and GNU Fortran compilers is illustrated in Fig. 1, where we plot the speedup and actual wall-clock times as functions of the number of CPU cores for 2d and 3d programs. We see that the speedup increases monotonically with the number of CPU cores in all cases and has large values (between 10 and 14 for 3d programs) for the maximal number of cores. This fully justifies the development of OpenMP programs, which enable much faster and more efficient solving of the GP equation. However, a slow saturation in the speedup with the further increase in the number of CPU cores is observed in all cases, as expected. The speedup tends to increase for programs in higher dimensions, as they become more complex and have to process more data. This is why the speedups of the supplied 2d and 3d programs are larger than those of 1d programs. Also, for a single program the speedup increases with the size of the spatial grid, i.e., with the number of spatial discretization points, since this increases the amount of calculations performed by the program. To demonstrate this, we tested the supplied real2d-th program and varied the number of spatial discretization points NX=NY from 20 to 1000. The measured speedup obtained when running this program on 19 CPU cores as a function of the number of discretization points is shown in Fig. 2. The speedup first increases rapidly with the number of discretization points and eventually saturates. Additional comments: Example inputs provided with the programs take less than 30 minutes to run on a workstation with two Intel Xeon E5-2650 v3 processors (2 QPI links, 10 CPU cores, 25 MB cache, 2.3 GHz).



      
      Symptoms of Problematic Cellular Phone Use, Functional Impairment and Its Association with Depression among Adolescents in Southern Taiwan
      ERIC Educational Resources Information Center
      Yen, Cheng-Fang; Tang, Tze-Chun; Yen, Ju-Yu; Lin, Huang-Chi; Huang, Chi-Fen; Liu, Shu-Chun; Ko, Chih-Hung
         2009-01-01
         The aims of this study were: (1) to examine the prevalence of symptoms of problematic cellular phone use (CPU); (2) to examine the associations between the symptoms of problematic CPU, functional impairment caused by CPU and the characteristics of CPU; (3) to establish the optimal cut-off point of the number of symptoms for functional impairment…
      

      
      Synthesis and characterization of conductive, biodegradable, elastomeric polyurethanes for biomedical applications.
      PubMed
      Xu, Cancan; Yepez, Gerardo; Wei, Zi; Liu, Fuqiang; Bugarin, Alejandro; Hong, Yi
         2016-09-01
         Biodegradable conductive polymers are currently of significant interest in tissue repair and regeneration, drug delivery, and bioelectronics. However, biodegradable materials exhibiting both conductive and elastic properties have rarely been reported to date. To that end, an electrically conductive polyurethane (CPU) was synthesized from polycaprolactone diol, hexadiisocyanate, and aniline trimer and subsequently doped with (1S)-(+)-10-camphorsulfonic acid (CSA). All CPU films showed good elasticity within a 30% strain range. The electrical conductivity of the CPU films, as enhanced with increasing amounts of CSA, ranged from 2.7 ± 0.9 × 10(-10) to 4.4 ± 0.6 × 10(-7) S/cm in a dry state and 4.2 ± 0.5 × 10(-8) to 7.3 ± 1.5 × 10(-5) S/cm in a wet state. The redox peaks of a CPU1.5 film (molar ratio CSA:aniline trimer = 1.5:1) in the cyclic voltammogram confirmed the desired good electroactivity. The doped CPU film exhibited good electrical stability (87% of initial conductivity after 150 hours charge) as measured in a cell culture medium. The degradation rates of CPU films increased with increasing CSA content in both phosphate-buffered solution (PBS) and lipase/PBS solutions. After 7 days of enzymatic degradation, the conductivity of all CSA-doped CPU films had decreased to that of the undoped CPU film. Mouse 3T3 fibroblasts proliferated and spread on all CPU films. This developed biodegradable CPU with good elasticity, electrical stability, and biocompatibility may find potential applications in tissue engineering, smart drug release, and electronics. © 2016 Wiley Periodicals, Inc. J Biomed Mater Res Part A: 104A: 2305-2314, 2016. © 2016 Wiley Periodicals, Inc.
      

      
      CPU-GPU hybrid accelerating the Zuker algorithm for RNA secondary structure prediction applications
      PubMed Central
      
         2012-01-01
         Background Prediction of ribonucleic acid (RNA) secondary structure remains one of the most important research areas in bioinformatics. The Zuker algorithm is one of the most popular methods of free energy minimization for RNA secondary structure prediction. Thus far, few studies have been reported on the acceleration of the Zuker algorithm on general-purpose processors or on extra accelerators such as Field Programmable Gate-Array (FPGA) and Graphics Processing Units (GPU). To the best of our knowledge, no implementation combines both CPU and extra accelerators, such as GPUs, to accelerate the Zuker algorithm applications. Results In this paper, a CPU-GPU hybrid computing system that accelerates Zuker algorithm applications for RNA secondary structure prediction is proposed. The computing tasks are allocated between CPU and GPU for parallel cooperate execution. Performance differences between the CPU and the GPU in the task-allocation scheme are considered to obtain workload balance. To improve the hybrid system performance, the Zuker algorithm is optimally implemented with special methods for CPU and GPU architecture. Conclusions Speedup of 15.93× over optimized multi-core SIMD CPU implementation and performance advantage of 16% over optimized GPU implementation are shown in the experimental results. More than 14% of the sequences are executed on CPU in the hybrid system. The system combining CPU and GPU to accelerate the Zuker algorithm is proven to be promising and can be applied to other bioinformatics applications. PMID:22369626
      

      
      An efficient mixed-precision, hybrid CPU-GPU implementation of a nonlinearly implicit one-dimensional particle-in-cell algorithm
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Chen, Guangye; Chacon, Luis; Barnes, Daniel C
         2012-01-01
         Recently, a fully implicit, energy- and charge-conserving particle-in-cell method has been developed for multi-scale, full-f kinetic simulations [G. Chen, et al., J. Comput. Phys. 230, 18 (2011)]. The method employs a Jacobian-free Newton-Krylov (JFNK) solver and is capable of using very large timesteps without loss of numerical stability or accuracy. A fundamental feature of the method is the segregation of particle orbit integrations from the field solver, while remaining fully self-consistent. This provides great flexibility, and dramatically improves the solver efficiency by reducing the degrees of freedom of the associated nonlinear system. However, it requires a particle push per nonlinearmore » residual evaluation, which makes the particle push the most time-consuming operation in the algorithm. This paper describes a very efficient mixed-precision, hybrid CPU-GPU implementation of the implicit PIC algorithm. The JFNK solver is kept on the CPU (in double precision), while the inherent data parallelism of the particle mover is exploited by implementing it in single-precision on a graphics processing unit (GPU) using CUDA. Performance-oriented optimizations, with the aid of an analytical performance model, the roofline model, are employed. Despite being highly dynamic, the adaptive, charge-conserving particle mover algorithm achieves up to 300 400 GOp/s (including single-precision floating-point, integer, and logic operations) on a Nvidia GeForce GTX580, corresponding to 20 25% absolute GPU efficiency (against the peak theoretical performance) and 50-70% intrinsic efficiency (against the algorithm s maximum operational throughput, which neglects all latencies). This is about 200-300 times faster than an equivalent serial CPU implementation. When the single-precision GPU particle mover is combined with a double-precision CPU JFNK field solver, overall performance gains 100 vs. the double-precision CPU-only serial version are obtained, with no apparent loss of robustness or accuracy when applied to a challenging long-time scale ion acoustic wave simulation.« less
      

      
      Improving the performance of heterogeneous multi-core processors by modifying the cache coherence protocol
      NASA Astrophysics Data System (ADS)
      Fang, Juan; Hao, Xiaoting; Fan, Qingwen; Chang, Zeqing; Song, Shuying
         2017-05-01
         In the Heterogeneous multi-core architecture, CPU and GPU processor are integrated on the same chip, which poses a new challenge to the last-level cache management. In this architecture, the CPU application and the GPU application execute concurrently, accessing the last-level cache. CPU and GPU have different memory access characteristics, so that they have differences in the sensitivity of last-level cache (LLC) capacity. For many CPU applications, a reduced share of the LLC could lead to significant performance degradation. On the contrary, GPU applications can tolerate increase in memory access latency when there is sufficient thread-level parallelism. Taking into account the GPU program memory latency tolerance characteristics, this paper presents a method that let GPU applications can access to memory directly, leaving lots of LLC space for CPU applications, in improving the performance of CPU applications and does not affect the performance of GPU applications. When the CPU application is cache sensitive, and the GPU application is insensitive to the cache, the overall performance of the system is improved significantly.
      

      
      Large-scale neural circuit mapping data analysis accelerated with the graphical processing unit (GPU).
      PubMed
      Shi, Yulin; Veidenbaum, Alexander V; Nicolau, Alex; Xu, Xiangmin
         2015-01-15
         Modern neuroscience research demands computing power. Neural circuit mapping studies such as those using laser scanning photostimulation (LSPS) produce large amounts of data and require intensive computation for post hoc processing and analysis. Here we report on the design and implementation of a cost-effective desktop computer system for accelerated experimental data processing with recent GPU computing technology. A new version of Matlab software with GPU enabled functions is used to develop programs that run on Nvidia GPUs to harness their parallel computing power. We evaluated both the central processing unit (CPU) and GPU-enabled computational performance of our system in benchmark testing and practical applications. The experimental results show that the GPU-CPU co-processing of simulated data and actual LSPS experimental data clearly outperformed the multi-core CPU with up to a 22× speedup, depending on computational tasks. Further, we present a comparison of numerical accuracy between GPU and CPU computation to verify the precision of GPU computation. In addition, we show how GPUs can be effectively adapted to improve the performance of commercial image processing software such as Adobe Photoshop. To our best knowledge, this is the first demonstration of GPU application in neural circuit mapping and electrophysiology-based data processing. Together, GPU enabled computation enhances our ability to process large-scale data sets derived from neural circuit mapping studies, allowing for increased processing speeds while retaining data precision. Copyright © 2014 Elsevier B.V. All rights reserved.
      

      
      Large scale neural circuit mapping data analysis accelerated with the graphical processing unit (GPU)
      PubMed Central
      Shi, Yulin; Veidenbaum, Alexander V.; Nicolau, Alex; Xu, Xiangmin
         2014-01-01
         Background Modern neuroscience research demands computing power. Neural circuit mapping studies such as those using laser scanning photostimulation (LSPS) produce large amounts of data and require intensive computation for post-hoc processing and analysis. New Method Here we report on the design and implementation of a cost-effective desktop computer system for accelerated experimental data processing with recent GPU computing technology. A new version of Matlab software with GPU enabled functions is used to develop programs that run on Nvidia GPUs to harness their parallel computing power. Results We evaluated both the central processing unit (CPU) and GPU-enabled computational performance of our system in benchmark testing and practical applications. The experimental results show that the GPU-CPU co-processing of simulated data and actual LSPS experimental data clearly outperformed the multi-core CPU with up to a 22x speedup, depending on computational tasks. Further, we present a comparison of numerical accuracy between GPU and CPU computation to verify the precision of GPU computation. In addition, we show how GPUs can be effectively adapted to improve the performance of commercial image processing software such as Adobe Photoshop. Comparison with Existing Method(s) To our best knowledge, this is the first demonstration of GPU application in neural circuit mapping and electrophysiology-based data processing. Conclusions Together, GPU enabled computation enhances our ability to process large-scale data sets derived from neural circuit mapping studies, allowing for increased processing speeds while retaining data precision. PMID:25277633
      

      
      GO, an exec for running the programs: CELL, COLLIDER, MAGIC, PATRICIA, PETROS, TRANSPORT, and TURTLE
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Shoaee, H.
         1982-05-01
         An exec has been written and placed on the PEP group's public disk to facilitate the use of several PEP related computer programs available on VM. The exec's program list currently includes: CELL, COLLIDER, MAGIC, PATRICIA, PETROS, TRANSPORT, and TURTLE. In addition, provisions have been made to allow addition of new programs to this list as they become available. The GO exec is directly callable from inside the Wylbur editor (in fact, currently this is the only way to use the GO exec.). It provides the option of running any of the above programs in either interactive or batch mode.more » In the batch mode, the GO exec sends the data in the Wylbur active file along with the information required to run the job to the batch monitor (BMON, a virtual machine that schedules and controls execution of batch jobs). This enables the user to proceed with other VM activities at his/her terminal while the job executes, thus making it of particular interest to the users with jobs requiring much CPU time to execute and/or those wishing to run multiple jobs independently. In the interactive mode, useful for small jobs requiring less CPU time, the job is executed by the user's own Virtual Machine using the data in the active file as input. At the termination of an interactive job, the GO exec facilitates examination of the output by placing it in the Wylbur active file.« less
      

      
      GO, an exec for running the programs: CELL, COLLIDER, MAGIC, PATRICIA, PETROS, TRANSPORT and TURTLE
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Shoaee, H.
         1982-05-01
         An exec has been written and placed on the PEP group's public disk (PUBRL 192) to facilitate the use of several PEP related computer programs available on VM. The exec's program list currently includes: CELL, COLLIDER, MAGIC, PATRICIA, PETROS, TRANSPORT, and TURTLE. In addition, provisions have been made to allow addition of new programs to this list as they become available. The GO exec is directly callable from inside the Wylbur editor (in fact, currently this is the only way to use the GO exec.) It provides the option of running any of the above programs in either interactive ormore » batch mode. In the batch mode, the GO exec sends the data in the Wylbur active file along with the information required to run the job to the batch monitor (BMON, a virtual machine that schedules and controls execution of batch jobs). This enables the user to proceed with other VM activities at his/her terminal while the job executes, thus making it of particular interest to the users with jobs requiring much CPU time to execute and/or those wishing to run multiple jobs independently. In the interactive mode, useful for small jobs requiring less CPU time, the job is executed by the user's own Virtual Machine using the data in the active file as input. At the termination of an interactive job, the GO exec facilitates examination of the output by placing it in the Wylbur active file.« less
      

      
      GPU Computing in Bayesian Inference of Realized Stochastic Volatility Model
      NASA Astrophysics Data System (ADS)
      Takaishi, Tetsuya
         2015-01-01
         The realized stochastic volatility (RSV) model that utilizes the realized volatility as additional information has been proposed to infer volatility of financial time series. We consider the Bayesian inference of the RSV model by the Hybrid Monte Carlo (HMC) algorithm. The HMC algorithm can be parallelized and thus performed on the GPU for speedup. The GPU code is developed with CUDA Fortran. We compare the computational time in performing the HMC algorithm on GPU (GTX 760) and CPU (Intel i7-4770 3.4GHz) and find that the GPU can be up to 17 times faster than the CPU. We also code the program with OpenACC and find that appropriate coding can achieve the similar speedup with CUDA Fortran.
      

      
      A Specialized Diacylglycerol Acyltransferase Contributes to the Extreme Medium-Chain Fatty Acid Content of Cuphea Seed Oil1[OPEN
      PubMed Central
      Iskandarov, Umidjon; Silva, Jillian E.; Andersson, Mariette
         2017-01-01
         Seed oils of many Cuphea sp. contain >90% of medium-chain fatty acids, such as decanoic acid (10:0). These seed oils, which are among the most compositionally variant in the plant kingdom, arise from specialized fatty acid biosynthetic enzymes and specialized acyltransferases. These include lysophosphatidic acid acyltransferases (LPAT) and diacylglycerol acyltransferases (DGAT) that are required for successive acylation of medium-chain fatty acids in the sn-2 and sn-3 positions of seed triacylglycerols (TAGs). Here we report the identification of a cDNA for a DGAT1-type enzyme, designated CpuDGAT1, from the transcriptome of C. avigera var pulcherrima developing seeds. Microsomes of camelina (Camelina sativa) seeds engineered for CpuDGAT1 expression displayed DGAT activity with 10:0-CoA and the diacylglycerol didecanoyl, that was approximately 4-fold higher than that in camelina seed microsomes lacking CpuDGAT1. In addition, coexpression in camelina seeds of CpuDGAT1 with a C. viscosissima FatB thioesterase (CvFatB1) that generates 10:0 resulted in TAGs with nearly 15 mol % of 10:0. More strikingly, expression of CpuDGAT1 and CvFatB1 with the previously described CvLPAT2, a 10:0-CoA-specific Cuphea LPAT, increased 10:0 amounts to 25 mol % in camelina seed TAG. These TAGs contained up to 40 mol % 10:0 in the sn-2 position, nearly double the amounts obtained from coexpression of CvFatB1 and CvLPAT2 alone. Although enriched in diacylglycerol, 10:0 was not detected in phosphatidylcholine in these seeds. These findings are consistent with channeling of 10:0 into TAG through the combined activities of specialized LPAT and DGAT activities and demonstrate the biotechnological use of these enzymes to generate 10:0-rich seed oils. PMID:28325847
      

      
      A Specialized Diacylglycerol Acyltransferase Contributes to the Extreme Medium-Chain Fatty Acid Content of Cuphea Seed Oil.
      PubMed
      Iskandarov, Umidjon; Silva, Jillian E; Kim, Hae Jin; Andersson, Mariette; Cahoon, Rebecca E; Mockaitis, Keithanne; Cahoon, Edgar B
         2017-05-01
         Seed oils of many Cuphea sp. contain >90% of medium-chain fatty acids, such as decanoic acid (10:0). These seed oils, which are among the most compositionally variant in the plant kingdom, arise from specialized fatty acid biosynthetic enzymes and specialized acyltransferases. These include lysophosphatidic acid acyltransferases (LPAT) and diacylglycerol acyltransferases (DGAT) that are required for successive acylation of medium-chain fatty acids in the sn -2 and sn -3 positions of seed triacylglycerols (TAGs). Here we report the identification of a cDNA for a DGAT1-type enzyme, designated CpuDGAT1, from the transcriptome of C. avigera var pulcherrima developing seeds. Microsomes of camelina ( Camelina sativa ) seeds engineered for CpuDGAT1 expression displayed DGAT activity with 10:0-CoA and the diacylglycerol didecanoyl, that was approximately 4-fold higher than that in camelina seed microsomes lacking CpuDGAT1. In addition, coexpression in camelina seeds of CpuDGAT1 with a C. viscosissima FatB thioesterase (CvFatB1) that generates 10:0 resulted in TAGs with nearly 15 mol % of 10:0. More strikingly, expression of CpuDGAT1 and CvFatB1 with the previously described CvLPAT2, a 10:0-CoA-specific Cuphea LPAT, increased 10:0 amounts to 25 mol % in camelina seed TAG. These TAGs contained up to 40 mol % 10:0 in the sn -2 position, nearly double the amounts obtained from coexpression of CvFatB1 and CvLPAT2 alone. Although enriched in diacylglycerol, 10:0 was not detected in phosphatidylcholine in these seeds. These findings are consistent with channeling of 10:0 into TAG through the combined activities of specialized LPAT and DGAT activities and demonstrate the biotechnological use of these enzymes to generate 10:0-rich seed oils. © 2017 American Society of Plant Biologists. All Rights Reserved.
      

      
      A Framework for Batched and GPU-Resident Factorization Algorithms Applied to Block Householder Transformations
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Dong, Tingzing Tim; Tomov, Stanimire Z; Luszczek, Piotr R
         
         As modern hardware keeps evolving, an increasingly effective approach to developing energy efficient and high-performance solvers is to design them to work on many small size and independent problems. Many applications already need this functionality, especially for GPUs, which are currently known to be about four to five times more energy efficient than multicore CPUs. We describe the development of one-sided factorizations that work for a set of small dense matrices in parallel, and we illustrate our techniques on the QR factorization based on Householder transformations. We refer to this mode of operation as a batched factorization. Our approach ismore » based on representing the algorithms as a sequence of batched BLAS routines for GPU-only execution. This is in contrast to the hybrid CPU-GPU algorithms that rely heavily on using the multicore CPU for specific parts of the workload. But for a system to benefit fully from the GPU's significantly higher energy efficiency, avoiding the use of the multicore CPU must be a primary design goal, so the system can rely more heavily on the more efficient GPU. Additionally, this will result in the removal of the costly CPU-to-GPU communication. Furthermore, we do not use a single symmetric multiprocessor(on the GPU) to factorize a single problem at a time. We illustrate how our performance analysis, and the use of profiling and tracing tools, guided the development and optimization of our batched factorization to achieve up to a 2-fold speedup and a 3-fold energy efficiency improvement compared to our highly optimized batched CPU implementations based on the MKL library(when using two sockets of Intel Sandy Bridge CPUs). Compared to a batched QR factorization featured in the CUBLAS library for GPUs, we achieved up to 5x speedup on the K40 GPU.« less
      

      
      Evaluation of the CPU time for solving the radiative transfer equation with high-order resolution schemes applying the normalized weighting-factor method
      NASA Astrophysics Data System (ADS)
      Xamán, J.; Zavala-Guillén, I.; Hernández-López, I.; Uriarte-Flores, J.; Hernández-Pérez, I.; Macías-Melo, E. V.; Aguilar-Castro, K. M.
         2018-03-01
         In this paper, we evaluated the convergence rate (CPU time) of a new mathematical formulation for the numerical solution of the radiative transfer equation (RTE) with several High-Order (HO) and High-Resolution (HR) schemes. In computational fluid dynamics, this procedure is known as the Normalized Weighting-Factor (NWF) method and it is adopted here. The NWF method is used to incorporate the high-order resolution schemes in the discretized RTE. The NWF method is compared, in terms of computer time needed to obtain a converged solution, with the widely used deferred-correction (DC) technique for the calculations of a two-dimensional cavity with emitting-absorbing-scattering gray media using the discrete ordinates method. Six parameters, viz. the grid size, the order of quadrature, the absorption coefficient, the emissivity of the boundary surface, the under-relaxation factor, and the scattering albedo are considered to evaluate ten schemes. The results showed that using the DC method, in general, the scheme that had the lowest CPU time is the SOU. In contrast, with the results of theDC procedure the CPU time for DIAMOND and QUICK schemes using the NWF method is shown to be, between the 3.8 and 23.1% faster and 12.6 and 56.1% faster, respectively. However, the other schemes are more time consuming when theNWFis used instead of the DC method. Additionally, a second test case was presented and the results showed that depending on the problem under consideration, the NWF procedure may be computationally faster or slower that the DC method. As an example, the CPU time for QUICK and SMART schemes are 61.8 and 203.7%, respectively, slower when the NWF formulation is used for the second test case. Finally, future researches to explore the computational cost of the NWF method in more complex problems are required.


   
       
            
              
          

«

3
      4
      5
   6
      7
      »

          
        

           
           
             
               
      
      Effective electron-density map improvement and structure validation on a Linux multi-CPU web cluster: The TB Structural Genomics Consortium Bias Removal Web Service.
      PubMed
      Reddy, Vinod; Swanson, Stanley M; Segelke, Brent; Kantardjieff, Katherine A; Sacchettini, James C; Rupp, Bernhard
         2003-12-01
         Anticipating a continuing increase in the number of structures solved by molecular replacement in high-throughput crystallography and drug-discovery programs, a user-friendly web service for automated molecular replacement, map improvement, bias removal and real-space correlation structure validation has been implemented. The service is based on an efficient bias-removal protocol, Shake&wARP, and implemented using EPMR and the CCP4 suite of programs, combined with various shell scripts and Fortran90 routines. The service returns improved maps, converted data files and real-space correlation and B-factor plots. User data are uploaded through a web interface and the CPU-intensive iteration cycles are executed on a low-cost Linux multi-CPU cluster using the Condor job-queuing package. Examples of map improvement at various resolutions are provided and include model completion and reconstruction of absent parts, sequence correction, and ligand validation in drug-target structures.
      

      
      Comparing the Consumption of CPU Hours with Scientific Output for the Extreme Science and Engineering Discovery Environment (XSEDE).
      PubMed
      Knepper, Richard; Börner, Katy
         2016-01-01
         This paper presents the results of a study that compares resource usage with publication output using data about the consumption of CPU cycles from the Extreme Science and Engineering Discovery Environment (XSEDE) and resulting scientific publications for 2,691 institutions/teams. Specifically, the datasets comprise a total of 5,374,032,696 central processing unit (CPU) hours run in XSEDE during July 1, 2011 to August 18, 2015 and 2,882 publications that cite the XSEDE resource. Three types of studies were conducted: a geospatial analysis of XSEDE providers and consumers, co-authorship network analysis of XSEDE publications, and bi-modal network analysis of how XSEDE resources are used by different research fields. Resulting visualizations show that a diverse set of consumers make use of XSEDE resources, that users of XSEDE publish together frequently, and that the users of XSEDE with the highest resource usage tend to be "traditional" high-performance computing (HPC) community members from astronomy, atmospheric science, physics, chemistry, and biology.
      

      
      System for processing an encrypted instruction stream in hardware
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Griswold, Richard L.; Nickless, William K.; Conrad, Ryan C.
         
         A system and method of processing an encrypted instruction stream in hardware is disclosed. Main memory stores the encrypted instruction stream and unencrypted data. A central processing unit (CPU) is operatively coupled to the main memory. A decryptor is operatively coupled to the main memory and located within the CPU. The decryptor decrypts the encrypted instruction stream upon receipt of an instruction fetch signal from a CPU core. Unencrypted data is passed through to the CPU core without decryption upon receipt of a data fetch signal.
      

      
      Method and apparatus for measuring spatial uniformity of radiation
      DOEpatents
      Field, Halden
         2002-01-01
         A method and apparatus for measuring the spatial uniformity of the intensity of a radiation beam from a radiation source based on a single sampling time and/or a single pulse of radiation. The measuring apparatus includes a plurality of radiation detectors positioned on planar mounting plate to form a radiation receiving area that has a shape and size approximating the size and shape of the cross section of the radiation beam. The detectors concurrently receive portions of the radiation beam and transmit electrical signals representative of the intensity of impinging radiation to a signal processor circuit connected to each of the detectors and adapted to concurrently receive the electrical signals from the detectors and process with a central processing unit (CPU) the signals to determine intensities of the radiation impinging at each detector location. The CPU displays the determined intensities and relative intensity values corresponding to each detector location to an operator of the measuring apparatus on an included data display device. Concurrent sampling of each detector is achieved by connecting to each detector a sample and hold circuit that is configured to track the signal and store it upon receipt of a "capture" signal. A switching device then selectively retrieves the signals and transmits the signals to the CPU through a single analog to digital (A/D) converter. The "capture" signal. is then removed from the sample-and-hold circuits. Alternatively, concurrent sampling is achieved by providing an A/D converter for each detector, each of which transmits a corresponding digital signal to the CPU. The sampling or reading of the detector signals can be controlled by the CPU or level-detection and timing circuit.
      

      
      A comparison of native GPU computing versus OpenACC for implementing flow-routing algorithms in hydrological applications
      NASA Astrophysics Data System (ADS)
      Rueda, Antonio J.; Noguera, José M.; Luque, Adrián
         2016-02-01
         In recent years GPU computing has gained wide acceptance as a simple low-cost solution for speeding up computationally expensive processing in many scientific and engineering applications. However, in most cases accelerating a traditional CPU implementation for a GPU is a non-trivial task that requires a thorough refactorization of the code and specific optimizations that depend on the architecture of the device. OpenACC is a promising technology that aims at reducing the effort required to accelerate C/C++/Fortran code on an attached multicore device. Virtually with this technology the CPU code only has to be augmented with a few compiler directives to identify the areas to be accelerated and the way in which data has to be moved between the CPU and GPU. Its potential benefits are multiple: better code readability, less development time, lower risk of errors and less dependency on the underlying architecture and future evolution of the GPU technology. Our aim with this work is to evaluate the pros and cons of using OpenACC against native GPU implementations in computationally expensive hydrological applications, using the classic D8 algorithm of O'Callaghan and Mark for river network extraction as case-study. We implemented the flow accumulation step of this algorithm in CPU, using OpenACC and two different CUDA versions, comparing the length and complexity of the code and its performance with different datasets. We advance that although OpenACC can not match the performance of a CUDA optimized implementation (×3.5 slower in average), it provides a significant performance improvement against a CPU implementation (×2-6) with by far a simpler code and less implementation effort.
      

      
      Interaction sorting method for molecular dynamics on multi-core SIMD CPU architecture.
      PubMed
      Matvienko, Sergey; Alemasov, Nikolay; Fomin, Eduard
         2015-02-01
         Molecular dynamics (MD) is widely used in computational biology for studying binding mechanisms of molecules, molecular transport, conformational transitions, protein folding, etc. The method is computationally expensive; thus, the demand for the development of novel, much more efficient algorithms is still high. Therefore, the new algorithm designed in 2007 and called interaction sorting (IS) clearly attracted interest, as it outperformed the most efficient MD algorithms. In this work, a new IS modification is proposed which allows the algorithm to utilize SIMD processor instructions. This paper shows that the improvement provides an additional gain in performance, 9% to 45% in comparison to the original IS method.
      

      
      Tier-2 Optimisation for Computational Density/Diversity and Big Data
      NASA Astrophysics Data System (ADS)
      Fay, R. B.; Bland, J.
         2014-06-01
         As the number of cores on chip continues to trend upwards and new CPU architectures emerge, increasing CPU density and diversity presents multiple challenges to site administrators. These include scheduling for massively multi-core systems (potentially including Graphical Processing Units (GPU), integrated and dedicated) and Many Integrated Core (MIC)) to ensure a balanced throughput of jobs while preserving overall cluster throughput, as well as the increasing complexity of developing for these heterogeneous platforms, and the challenge in managing this more complex mix of resources. In addition, meeting data demands as both dataset sizes increase and as the rate of demand scales with increased computational power requires additional performance from the associated storage elements. In this report, we evaluate one emerging technology, Solid State Drive (SSD) caching for RAID controllers, with consideration to its potential to assist in meeting evolving demand. We also briefly consider the broader developing trends outlined above in order to identify issues that may develop and assess what actions should be taken in the immediate term to address those.
      

      
      GPU accelerated manifold correction method for spinning compact binaries
      NASA Astrophysics Data System (ADS)
      Ran, Chong-xi; Liu, Song; Zhong, Shuang-ying
         2018-04-01
         The graphics processing unit (GPU) acceleration of the manifold correction algorithm based on the compute unified device architecture (CUDA) technology is designed to simulate the dynamic evolution of the Post-Newtonian (PN) Hamiltonian formulation of spinning compact binaries. The feasibility and the efficiency of parallel computation on GPU have been confirmed by various numerical experiments. The numerical comparisons show that the accuracy on GPU execution of manifold corrections method has a good agreement with the execution of codes on merely central processing unit (CPU-based) method. The acceleration ability when the codes are implemented on GPU can increase enormously through the use of shared memory and register optimization techniques without additional hardware costs, implying that the speedup is nearly 13 times as compared with the codes executed on CPU for phase space scan (including 314 × 314 orbits). In addition, GPU-accelerated manifold correction method is used to numerically study how dynamics are affected by the spin-induced quadrupole-monopole interaction for black hole binary system.
      

      
      Fast multipurpose Monte Carlo simulation for proton therapy using multi- and many-core CPU architectures.
      PubMed
      Souris, Kevin; Lee, John Aldo; Sterpin, Edmond
         2016-04-01
         Accuracy in proton therapy treatment planning can be improved using Monte Carlo (MC) simulations. However the long computation time of such methods hinders their use in clinical routine. This work aims to develop a fast multipurpose Monte Carlo simulation tool for proton therapy using massively parallel central processing unit (CPU) architectures. A new Monte Carlo, called MCsquare (many-core Monte Carlo), has been designed and optimized for the last generation of Intel Xeon processors and Intel Xeon Phi coprocessors. These massively parallel architectures offer the flexibility and the computational power suitable to MC methods. The class-II condensed history algorithm of MCsquare provides a fast and yet accurate method of simulating heavy charged particles such as protons, deuterons, and alphas inside voxelized geometries. Hard ionizations, with energy losses above a user-specified threshold, are simulated individually while soft events are regrouped in a multiple scattering theory. Elastic and inelastic nuclear interactions are sampled from ICRU 63 differential cross sections, thereby allowing for the computation of prompt gamma emission profiles. MCsquare has been benchmarked with the gate/geant4 Monte Carlo application for homogeneous and heterogeneous geometries. Comparisons with gate/geant4 for various geometries show deviations within 2%-1 mm. In spite of the limited memory bandwidth of the coprocessor simulation time is below 25 s for 10(7) primary 200 MeV protons in average soft tissues using all Xeon Phi and CPU resources embedded in a single desktop unit. MCsquare exploits the flexibility of CPU architectures to provide a multipurpose MC simulation tool. Optimized code enables the use of accurate MC calculation within a reasonable computation time, adequate for clinical practice. MCsquare also simulates prompt gamma emission and can thus be used also for in vivo range verification.
      

      
      Data Acquisition System for Multi-Frequency Radar Flight Operations Preparation
      NASA Technical Reports Server (NTRS)
      Leachman, Jonathan
         2010-01-01
         A three-channel data acquisition system was developed for the NASA Multi-Frequency Radar (MFR) system. The system is based on a commercial-off-the-shelf (COTS) industrial PC (personal computer) and two dual-channel 14-bit digital receiver cards. The decimated complex envelope representations of the three radar signals are passed to the host PC via the PCI bus, and then processed in parallel by multiple cores of the PC CPU (central processing unit). The innovation is this parallelization of the radar data processing using multiple cores of a standard COTS multi-core CPU. The data processing portion of the data acquisition software was built using autonomous program modules or threads, which can run simultaneously on different cores. A master program module calculates the optimal number of processing threads, launches them, and continually supplies each with data. The benefit of this new parallel software architecture is that COTS PCs can be used to implement increasingly complex processing algorithms on an increasing number of radar range gates and data rates. As new PCs become available with higher numbers of CPU cores, the software will automatically utilize the additional computational capacity.
      

      
      Fast MPEG-CDVS Encoder With GPU-CPU Hybrid Computing
      NASA Astrophysics Data System (ADS)
      Duan, Ling-Yu; Sun, Wei; Zhang, Xinfeng; Wang, Shiqi; Chen, Jie; Yin, Jianxiong; See, Simon; Huang, Tiejun; Kot, Alex C.; Gao, Wen
         2018-05-01
         The compact descriptors for visual search (CDVS) standard from ISO/IEC moving pictures experts group (MPEG) has succeeded in enabling the interoperability for efficient and effective image retrieval by standardizing the bitstream syntax of compact feature descriptors. However, the intensive computation of CDVS encoder unfortunately hinders its widely deployment in industry for large-scale visual search. In this paper, we revisit the merits of low complexity design of CDVS core techniques and present a very fast CDVS encoder by leveraging the massive parallel execution resources of GPU. We elegantly shift the computation-intensive and parallel-friendly modules to the state-of-the-arts GPU platforms, in which the thread block allocation and the memory access are jointly optimized to eliminate performance loss. In addition, those operations with heavy data dependence are allocated to CPU to resolve the extra but non-necessary computation burden for GPU. Furthermore, we have demonstrated the proposed fast CDVS encoder can work well with those convolution neural network approaches which has harmoniously leveraged the advantages of GPU platforms, and yielded significant performance improvements. Comprehensive experimental results over benchmarks are evaluated, which has shown that the fast CDVS encoder using GPU-CPU hybrid computing is promising for scalable visual search.
      

      
      Lossless data compression for improving the performance of a GPU-based beamformer.
      PubMed
      Lok, U-Wai; Fan, Gang-Wei; Li, Pai-Chi
         2015-04-01
         The powerful parallel computation ability of a graphics processing unit (GPU) makes it feasible to perform dynamic receive beamforming However, a real time GPU-based beamformer requires high data rate to transfer radio-frequency (RF) data from hardware to software memory, as well as from central processing unit (CPU) to GPU memory. There are data compression methods (e.g. Joint Photographic Experts Group (JPEG)) available for the hardware front end to reduce data size, alleviating the data transfer requirement of the hardware interface. Nevertheless, the required decoding time may even be larger than the transmission time of its original data, in turn degrading the overall performance of the GPU-based beamformer. This article proposes and implements a lossless compression-decompression algorithm, which enables in parallel compression and decompression of data. By this means, the data transfer requirement of hardware interface and the transmission time of CPU to GPU data transfers are reduced, without sacrificing image quality. In simulation results, the compression ratio reached around 1.7. The encoder design of our lossless compression approach requires low hardware resources and reasonable latency in a field programmable gate array. In addition, the transmission time of transferring data from CPU to GPU with the parallel decoding process improved by threefold, as compared with transferring original uncompressed data. These results show that our proposed lossless compression plus parallel decoder approach not only mitigate the transmission bandwidth requirement to transfer data from hardware front end to software system but also reduce the transmission time for CPU to GPU data transfer. © The Author(s) 2014.
      

      
      GPU-accelerated automatic identification of robust beam setups for proton and carbon-ion radiotherapy
      NASA Astrophysics Data System (ADS)
      Ammazzalorso, F.; Bednarz, T.; Jelen, U.
         2014-03-01
         We demonstrate acceleration on graphic processing units (GPU) of automatic identification of robust particle therapy beam setups, minimizing negative dosimetric effects of Bragg peak displacement caused by treatment-time patient positioning errors. Our particle therapy research toolkit, RobuR, was extended with OpenCL support and used to implement calculation on GPU of the Port Homogeneity Index, a metric scoring irradiation port robustness through analysis of tissue density patterns prior to dose optimization and computation. Results were benchmarked against an independent native CPU implementation. Numerical results were in agreement between the GPU implementation and native CPU implementation. For 10 skull base cases, the GPU-accelerated implementation was employed to select beam setups for proton and carbon ion treatment plans, which proved to be dosimetrically robust, when recomputed in presence of various simulated positioning errors. From the point of view of performance, average running time on the GPU decreased by at least one order of magnitude compared to the CPU, rendering the GPU-accelerated analysis a feasible step in a clinical treatment planning interactive session. In conclusion, selection of robust particle therapy beam setups can be effectively accelerated on a GPU and become an unintrusive part of the particle therapy treatment planning workflow. Additionally, the speed gain opens new usage scenarios, like interactive analysis manipulation (e.g. constraining of some setup) and re-execution. Finally, through OpenCL portable parallelism, the new implementation is suitable also for CPU-only use, taking advantage of multiple cores, and can potentially exploit types of accelerators other than GPUs.
      

      
      47 CFR 15.102 - CPU boards and power supplies used in personal computers.
      Code of Federal Regulations, 2013 CFR
      
         2013-10-01
         ... computers. 15.102 Section 15.102 Telecommunication FEDERAL COMMUNICATIONS COMMISSION GENERAL RADIO FREQUENCY DEVICES Unintentional Radiators § 15.102 CPU boards and power supplies used in personal computers. (a... modifications that must be made to a personal computer, peripheral device, CPU board or power supply during...
      

      
      47 CFR 15.102 - CPU boards and power supplies used in personal computers.
      Code of Federal Regulations, 2011 CFR
      
         2011-10-01
         ... computers. 15.102 Section 15.102 Telecommunication FEDERAL COMMUNICATIONS COMMISSION GENERAL RADIO FREQUENCY DEVICES Unintentional Radiators § 15.102 CPU boards and power supplies used in personal computers. (a... modifications that must be made to a personal computer, peripheral device, CPU board or power supply during...
      

      
      47 CFR 15.102 - CPU boards and power supplies used in personal computers.
      Code of Federal Regulations, 2010 CFR
      
         2010-10-01
         ... computers. 15.102 Section 15.102 Telecommunication FEDERAL COMMUNICATIONS COMMISSION GENERAL RADIO FREQUENCY DEVICES Unintentional Radiators § 15.102 CPU boards and power supplies used in personal computers. (a... modifications that must be made to a personal computer, peripheral device, CPU board or power supply during...
      

      
      47 CFR 15.102 - CPU boards and power supplies used in personal computers.
      Code of Federal Regulations, 2014 CFR
      
         2014-10-01
         ... computers. 15.102 Section 15.102 Telecommunication FEDERAL COMMUNICATIONS COMMISSION GENERAL RADIO FREQUENCY DEVICES Unintentional Radiators § 15.102 CPU boards and power supplies used in personal computers. (a... modifications that must be made to a personal computer, peripheral device, CPU board or power supply during...
      

      
      47 CFR 15.102 - CPU boards and power supplies used in personal computers.
      Code of Federal Regulations, 2012 CFR
      
         2012-10-01
         ... computers. 15.102 Section 15.102 Telecommunication FEDERAL COMMUNICATIONS COMMISSION GENERAL RADIO FREQUENCY DEVICES Unintentional Radiators § 15.102 CPU boards and power supplies used in personal computers. (a... modifications that must be made to a personal computer, peripheral device, CPU board or power supply during...
      

      
      Online performance evaluation of RAID 5 using CPU utilization
      NASA Astrophysics Data System (ADS)
      Jin, Hai; Yang, Hua; Zhang, Jiangling
         1998-09-01
         Redundant arrays of independent disks (RAID) technology is the efficient way to solve the bottleneck problem between CPU processing ability and I/O subsystem. For the system point of view, the most important metric of on line performance is the utilization of CPU. This paper first employs the way to calculate the CPU utilization of system connected with RAID level 5 using statistic average method. From the simulation results of CPU utilization of system connected with RAID level 5 subsystem can we see that using multiple disks as an array to access data in parallel is the efficient way to enhance the on-line performance of disk storage system. USing high-end disk drivers to compose the disk array is the key to enhance the on-line performance of system.
      

      
      A world-wide databridge supported by a commercial cloud provider
      NASA Astrophysics Data System (ADS)
      Tat Cheung, Kwong; Field, Laurence; Furano, Fabrizio
         2017-10-01
         Volunteer computing has the potential to provide significant additional computing capacity for the LHC experiments. One of the challenges with exploiting volunteer computing is to support a global community of volunteers that provides heterogeneous resources. However, high energy physics applications require more data input and output than the CPU intensive applications that are typically used by other volunteer computing projects. While the so-called databridge has already been successfully proposed as a method to span the untrusted and trusted domains of volunteer computing and Grid computing respective, globally transferring data between potentially poor-performing residential networks and CERN could be unreliable, leading to wasted resources usage. The expectation is that by placing a storage endpoint that is part of a wider, flexible geographical databridge deployment closer to the volunteers, the transfer success rate and the overall performance can be improved. This contribution investigates the provision of a globally distributed databridge implemented upon a commercial cloud provider.
      

        
       
          

«

3
      4
      5
   6
      7
      »

          
        

     

   

   
       
            
              
          

«

4
      5
      6
   7
      8
      »

          
        

           
           
             
               
      
      Toward GPGPU accelerated human electromechanical cardiac simulations
      PubMed Central
      Vigueras, Guillermo; Roy, Ishani; Cookson, Andrew; Lee, Jack; Smith, Nicolas; Nordsletten, David
         2014-01-01
         In this paper, we look at the acceleration of weakly coupled electromechanics using the graphics processing unit (GPU). Specifically, we port to the GPU a number of components of Heart—a CPU-based finite element code developed for simulating multi-physics problems. On the basis of a criterion of computational cost, we implemented on the GPU the ODE and PDE solution steps for the electrophysiology problem and the Jacobian and residual evaluation for the mechanics problem. Performance of the GPU implementation is then compared with single core CPU (SC) execution as well as multi-core CPU (MC) computations with equivalent theoretical performance. Results show that for a human scale left ventricle mesh, GPU acceleration of the electrophysiology problem provided speedups of 164 × compared with SC and 5.5 times compared with MC for the solution of the ODE model. Speedup of up to 72 × compared with SC and 2.6 × compared with MC was also observed for the PDE solve. Using the same human geometry, the GPU implementation of mechanics residual/Jacobian computation provided speedups of up to 44 × compared with SC and 2.0 × compared with MC. © 2013 The Authors. International Journal for Numerical Methods in Biomedical Engineering published by John Wiley & Sons, Ltd. PMID:24115492
      

      
      Accelerating statistical image reconstruction algorithms for fan-beam x-ray CT using cloud computing
      NASA Astrophysics Data System (ADS)
      Srivastava, Somesh; Rao, A. Ravishankar; Sheinin, Vadim
         2011-03-01
         Statistical image reconstruction algorithms potentially offer many advantages to x-ray computed tomography (CT), e.g. lower radiation dose. But, their adoption in practical CT scanners requires extra computation power, which is traditionally provided by incorporating additional computing hardware (e.g. CPU-clusters, GPUs, FPGAs etc.) into a scanner. An alternative solution is to access the required computation power over the internet from a cloud computing service, which is orders-of-magnitude more cost-effective. This is because users only pay a small pay-as-you-go fee for the computation resources used (i.e. CPU time, storage etc.), and completely avoid purchase, maintenance and upgrade costs. In this paper, we investigate the benefits and shortcomings of using cloud computing for statistical image reconstruction. We parallelized the most time-consuming parts of our application, the forward and back projectors, using MapReduce, the standard parallelization library on clouds. From preliminary investigations, we found that a large speedup is possible at a very low cost. But, communication overheads inside MapReduce can limit the maximum speedup, and a better MapReduce implementation might become necessary in the future. All the experiments for this paper, including development and testing, were completed on the Amazon Elastic Compute Cloud (EC2) for less than $20.
      

      
      Developments in the ATLAS Tracking Software ahead of LHC Run 2
      NASA Astrophysics Data System (ADS)
      Styles, Nicholas; Bellomo, Massimiliano; Salzburger, Andreas; ATLAS Collaboration
         2015-05-01
         After a hugely successful first run, the Large Hadron Collider (LHC) is currently in a shut-down period, during which essential maintenance and upgrades are being performed on the accelerator. The ATLAS experiment, one of the four large LHC experiments has also used this period for consolidation and further developments of the detector and of its software framework, ahead of the new challenges that will be brought by the increased centre-of-mass energy and instantaneous luminosity in the next run period. This is of particular relevance for the ATLAS Tracking software, responsible for reconstructing the trajectory of charged particles through the detector, which faces a steep increase in CPU consumption due to the additional combinatorics of the high-multiplicity environment. The steps taken to mitigate this increase and stay within the available computing resources while maintaining the excellent performance of the tracking software in terms of the information provided to the physics analyses will be presented. Particular focus will be given to changes to the Event Data Model, replacement of the maths library, and adoption of a new persistent output format. The resulting CPU profiling results will be discussed, as well as the performance of the algorithms for physics processes under the expected conditions for the next LHC run.
      

      
      32 CFR 701.53 - FOIA fee schedule.
      Code of Federal Regulations, 2014 CFR
      
         2014-07-01
         ... human time) and machine time. (1) Human time. Human time is all the time spent by humans performing the...) Machine time. Machine time involves only direct costs of the central processing unit (CPU), input/output... exist to calculate CPU time, no machine costs can be passed on to the requester. When CPU calculations...
      

      
      32 CFR 701.53 - FOIA fee schedule.
      Code of Federal Regulations, 2012 CFR
      
         2012-07-01
         ... human time) and machine time. (1) Human time. Human time is all the time spent by humans performing the...) Machine time. Machine time involves only direct costs of the central processing unit (CPU), input/output... exist to calculate CPU time, no machine costs can be passed on to the requester. When CPU calculations...
      

      
      32 CFR 701.53 - FOIA fee schedule.
      Code of Federal Regulations, 2013 CFR
      
         2013-07-01
         ... human time) and machine time. (1) Human time. Human time is all the time spent by humans performing the...) Machine time. Machine time involves only direct costs of the central processing unit (CPU), input/output... exist to calculate CPU time, no machine costs can be passed on to the requester. When CPU calculations...
      

      
      cellGPU: Massively parallel simulations of dynamic vertex models
      NASA Astrophysics Data System (ADS)
      Sussman, Daniel M.
         2017-10-01
         Vertex models represent confluent tissue by polygonal or polyhedral tilings of space, with the individual cells interacting via force laws that depend on both the geometry of the cells and the topology of the tessellation. This dependence on the connectivity of the cellular network introduces several complications to performing molecular-dynamics-like simulations of vertex models, and in particular makes parallelizing the simulations difficult. cellGPU addresses this difficulty and lays the foundation for massively parallelized, GPU-based simulations of these models. This article discusses its implementation for a pair of two-dimensional models, and compares the typical performance that can be expected between running cellGPU entirely on the CPU versus its performance when running on a range of commercial and server-grade graphics cards. By implementing the calculation of topological changes and forces on cells in a highly parallelizable fashion, cellGPU enables researchers to simulate time- and length-scales previously inaccessible via existing single-threaded CPU implementations. Program Files doi:http://dx.doi.org/10.17632/6j2cj29t3r.1 Licensing provisions: MIT Programming language: CUDA/C++ Nature of problem: Simulations of off-lattice "vertex models" of cells, in which the interaction forces depend on both the geometry and the topology of the cellular aggregate. Solution method: Highly parallelized GPU-accelerated dynamical simulations in which the force calculations and the topological features can be handled on either the CPU or GPU. Additional comments: The code is hosted at https://gitlab.com/dmsussman/cellGPU, with documentation additionally maintained at http://dmsussman.gitlab.io/cellGPUdocumentation
      

      
      Multigrid direct numerical simulation of the whole process of flow transition in 3-D boundary layers
      NASA Technical Reports Server (NTRS)
      Liu, Chaoqun; Liu, Zhining
         1993-01-01
         A new technology was developed in this study which provides a successful numerical simulation of the whole process of flow transition in 3-D boundary layers, including linear growth, secondary instability, breakdown, and transition at relatively low CPU cost. Most other spatial numerical simulations require high CPU cost and blow up at the stage of flow breakdown. A fourth-order finite difference scheme on stretched and staggered grids, a fully implicit time marching technique, a semi-coarsening multigrid based on the so-called approximate line-box relaxation, and a buffer domain for the outflow boundary conditions were all used for high-order accuracy, good stability, and fast convergence. A new fine-coarse-fine grid mapping technique was developed to keep the code running after the laminar flow breaks down. The computational results are in good agreement with linear stability theory, secondary instability theory, and some experiments. The cost for a typical case with 162 x 34 x 34 grid is around 2 CRAY-YMP CPU hours for 10 T-S periods.
      

      
      Comparing the Consumption of CPU Hours with Scientific Output for the Extreme Science and Engineering Discovery Environment (XSEDE)
      PubMed Central
      Börner, Katy
         2016-01-01
         This paper presents the results of a study that compares resource usage with publication output using data about the consumption of CPU cycles from the Extreme Science and Engineering Discovery Environment (XSEDE) and resulting scientific publications for 2,691 institutions/teams. Specifically, the datasets comprise a total of 5,374,032,696 central processing unit (CPU) hours run in XSEDE during July 1, 2011 to August 18, 2015 and 2,882 publications that cite the XSEDE resource. Three types of studies were conducted: a geospatial analysis of XSEDE providers and consumers, co-authorship network analysis of XSEDE publications, and bi-modal network analysis of how XSEDE resources are used by different research fields. Resulting visualizations show that a diverse set of consumers make use of XSEDE resources, that users of XSEDE publish together frequently, and that the users of XSEDE with the highest resource usage tend to be “traditional” high-performance computing (HPC) community members from astronomy, atmospheric science, physics, chemistry, and biology. PMID:27310174
      

      
      Short-term dopaminergic regulation of GABA release in dopamine deafferented caudate-putamen is not directly associated with glutamic acid decarboxylase gene expression.
      PubMed
      O'Connor, W T; Lindefors, N; Brené, S; Herrera-Marschitz, M; Persson, H; Ungerstedt, U
         1991-07-08
         In vivo microdialysis and in situ hybridization were combined to study dopaminergic regulation of gamma-amino butyric acid (GABA) neurons in rat caudate-putamen (CPu). Potassium-stimulated GABA release in CPu was elevated following a dopamine deafferentation. Local perfusion with exogenous dopamine (50 microM) for 3 h via the microdialysis probe attenuated the potassium-stimulated increase in extracellular GABA in CPu. Expression of glutamic acid decarboxylase (GAD) mRNA was also increased in the dopamine deafferented CPu. However, local perfusion with dopamine had no significant attenuating effect on the increased GAD mRNA expression. These findings indicate that dopaminergic regulation of GABA neurons in the dopamine deafferented CPu includes both a short-term effect at the level of GABA release independent of changes in GAD mRNA expression and a long-term modulation at the level of GAD gene expression.
      

      
      Using all of your CPU's in HIPE
      NASA Astrophysics Data System (ADS)
      Jacobson, J. D.; Fadda, D.
         2012-09-01
         Modern computer architectures increasingly feature multi-core CPU's. For example, the MacbookPro features the Intel quad-core i7 processors. Through the use of hyper-threading, where each core can execute two threads simultaneously, the quad-core i7 can support eight simultaneous processing threads. All this on your laptop! This CPU power can now be put into service by scientists to perform data reduction tasks, but only if the software has been designed to take advantage of the multiple processor architectures. Up to now, software written for Herschel data reduction (HIPE), written in Jython and JAVA, is single-threaded and can only utilize a single processor. Users of HIPE do not get any advantage from the additional processors. Why not put all of the CPU resources to work reducing your data? We present a multi-threaded software application that corrects long-term transients in the signal from the PACS unchopped spectroscopy line scan mode. In this poster, we present a multi-threaded software framework to achieve performance improvements from parallel execution. We will show how a task to correct transients in the PACS Spectroscopy Pipeline for the un-chopped line scan mode, has been threaded. This computation-intensive task uses either a one-parameter or a three parameter exponential function, to characterize the transient. The task uses a JAVA implementation of Minpack, translated from the C (Moshier) and IDL (Markwardt) by the authors, to optimize the correction parameters. We also explain how to determine if a task can benefit from threading (Amdahl's Law), and if it is safe to thread. The design and implementation, using the JAVA concurrency package completions service is described. Pitfalls, timing bugs, thread safety, resource control, testing and performance improvements are described and plotted.
      

      
      Fast Simulation of Dynamic Ultrasound Images Using the GPU.
      PubMed
      Storve, Sigurd; Torp, Hans
         2017-10-01
         Simulated ultrasound data is a valuable tool for development and validation of quantitative image analysis methods in echocardiography. Unfortunately, simulation time can become prohibitive for phantoms consisting of a large number of point scatterers. The COLE algorithm by Gao et al. is a fast convolution-based simulator that trades simulation accuracy for improved speed. We present highly efficient parallelized CPU and GPU implementations of the COLE algorithm with an emphasis on dynamic simulations involving moving point scatterers. We argue that it is crucial to minimize the amount of data transfers from the CPU to achieve good performance on the GPU. We achieve this by storing the complete trajectories of the dynamic point scatterers as spline curves in the GPU memory. This leads to good efficiency when simulating sequences consisting of a large number of frames, such as B-mode and tissue Doppler data for a full cardiac cycle. In addition, we propose a phase-based subsample delay technique that efficiently eliminates flickering artifacts seen in B-mode sequences when COLE is used without enough temporal oversampling. To assess the performance, we used a laptop computer and a desktop computer, each equipped with a multicore Intel CPU and an NVIDIA GPU. Running the simulator on a high-end TITAN X GPU, we observed two orders of magnitude speedup compared to the parallel CPU version, three orders of magnitude speedup compared to simulation times reported by Gao et al. in their paper on COLE, and a speedup of 27000 times compared to the multithreaded version of Field II, using numbers reported in a paper by Jensen. We hope that by releasing the simulator as an open-source project we will encourage its use and further development.
      

      
      The DISTO data acquisition system at SATURNE
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Balestra, F.; Bedfer, Y.; Bertini, R.
         1998-06-01
         The DISTO collaboration has built a large-acceptance magnetic spectrometer designed to provide broad kinematic coverage of multiparticle final states produced in pp scattering. The spectrometer has been installed in the polarized proton beam of the Saturne accelerator in Saclay to study polarization observables in the {rvec p}p {yields} pK{sup +}{rvec Y} (Y = {Lambda}, {Sigma}{sup 0} or Y{sup *}) reaction and vector meson production ({psi}, {omega} and {rho}) in pp collisions. The data acquisition system is based on a VME 68030 CPU running the OS/9 operating system, housed in a single VME crate together with the CAMAC interface, the triplemore » port ECL memories, and four RISC R3000 CPU. The digitization of signals from the detectors is made by PCOS III and FERA front-end electronics. Data of several events belonging to a single Saturne extraction are stored in VME triple-port ECL memories using a hardwired fast sequencer. The buffer, optionally filtered by the RISC R3000 CPU, is recorded on a DLT cassette by DAQ CPU using the on-board SCSI interface during the acceleration cycle. Two UNIX workstations are connected to the VME CPUs through a fast parallel bus and the Local Area Network. They analyze a subset of events for on-line monitoring. The data acquisition system is able to read and record 3,500 ev/burst in the present configuration with a dead time of 15%.« less
      

      
      Dynamic Quantum Allocation and Swap-Time Variability in Time-Sharing Operating Systems.
      ERIC Educational Resources Information Center
      Bhat, U. Narayan; Nance, Richard E.
         
         The effects of dynamic quantum allocation and swap-time variability on central processing unit (CPU) behavior are investigated using a model that allows both quantum length and swap-time to be state-dependent random variables. Effective CPU utilization is defined to be the proportion of a CPU busy period that is devoted to program processing, i.e.…
      

      
      Fast MPEG-CDVS Encoder With GPU-CPU Hybrid Computing.
      PubMed
      Duan, Ling-Yu; Sun, Wei; Zhang, Xinfeng; Wang, Shiqi; Chen, Jie; Yin, Jianxiong; See, Simon; Huang, Tiejun; Kot, Alex C; Gao, Wen
         2018-05-01
         The compact descriptors for visual search (CDVS) standard from ISO/IEC moving pictures experts group has succeeded in enabling the interoperability for efficient and effective image retrieval by standardizing the bitstream syntax of compact feature descriptors. However, the intensive computation of a CDVS encoder unfortunately hinders its widely deployment in industry for large-scale visual search. In this paper, we revisit the merits of low complexity design of CDVS core techniques and present a very fast CDVS encoder by leveraging the massive parallel execution resources of graphics processing unit (GPU). We elegantly shift the computation-intensive and parallel-friendly modules to the state-of-the-arts GPU platforms, in which the thread block allocation as well as the memory access mechanism are jointly optimized to eliminate performance loss. In addition, those operations with heavy data dependence are allocated to CPU for resolving the extra but non-necessary computation burden for GPU. Furthermore, we have demonstrated the proposed fast CDVS encoder can work well with those convolution neural network approaches which enables to leverage the advantages of GPU platforms harmoniously, and yield significant performance improvements. Comprehensive experimental results over benchmarks are evaluated, which has shown that the fast CDVS encoder using GPU-CPU hybrid computing is promising for scalable visual search.
      

      
      Scalable Metropolis Monte Carlo for simulation of hard shapes
      NASA Astrophysics Data System (ADS)
      Anderson, Joshua A.; Eric Irrgang, M.; Glotzer, Sharon C.
         2016-07-01
         We design and implement a scalable hard particle Monte Carlo simulation toolkit (HPMC), and release it open source as part of HOOMD-blue. HPMC runs in parallel on many CPUs and many GPUs using domain decomposition. We employ BVH trees instead of cell lists on the CPU for fast performance, especially with large particle size disparity, and optimize inner loops with SIMD vector intrinsics on the CPU. Our GPU kernel proposes many trial moves in parallel on a checkerboard and uses a block-level queue to redistribute work among threads and avoid divergence. HPMC supports a wide variety of shape classes, including spheres/disks, unions of spheres, convex polygons, convex spheropolygons, concave polygons, ellipsoids/ellipses, convex polyhedra, convex spheropolyhedra, spheres cut by planes, and concave polyhedra. NVT and NPT ensembles can be run in 2D or 3D triclinic boxes. Additional integration schemes permit Frenkel-Ladd free energy computations and implicit depletant simulations. In a benchmark system of a fluid of 4096 pentagons, HPMC performs 10 million sweeps in 10 min on 96 CPU cores on XSEDE Comet. The same simulation would take 7.6 h in serial. HPMC also scales to large system sizes, and the same benchmark with 16.8 million particles runs in 1.4 h on 2048 GPUs on OLCF Titan.
      

      
      Evaluating Mobile Graphics Processing Units (GPUs) for Real-Time Resource Constrained Applications
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Meredith, J; Conger, J; Liu, Y
         2005-11-11
         Modern graphics processing units (GPUs) can provide tremendous performance boosts for some applications beyond what a single CPU can accomplish, and their performance is growing at a rate faster than CPUs as well. Mobile GPUs available for laptops have the small form factor and low power requirements suitable for use in embedded processing. We evaluated several desktop and mobile GPUs and CPUs on traditional and non-traditional graphics tasks, as well as on the most time consuming pieces of a full hyperspectral imaging application. Accuracy remained high despite small differences in arithmetic operations like rounding. Performance improvements are summarized here relativemore » to a desktop Pentium 4 CPU.« less
      

      
      An efficient implementation of semi-numerical computation of the Hartree-Fock exchange on the Intel Phi processor
      NASA Astrophysics Data System (ADS)
      Liu, Fenglai; Kong, Jing
         2018-07-01
         Unique technical challenges and their solutions for implementing semi-numerical Hartree-Fock exchange on the Phil Processor are discussed, especially concerning the single- instruction-multiple-data type of processing and small cache size. Benchmark calculations on a series of buckyball molecules with various Gaussian basis sets on a Phi processor and a six-core CPU show that the Phi processor provides as much as 12 times of speedup with large basis sets compared with the conventional four-center electron repulsion integration approach performed on the CPU. The accuracy of the semi-numerical scheme is also evaluated and found to be comparable to that of the resolution-of-identity approach.
      

      
      WARP3D-Release 10.8: Dynamic Nonlinear Analysis of Solids using a Preconditioned Conjugate Gradient Software Architecture
      NASA Technical Reports Server (NTRS)
      Koppenhoefer, Kyle C.; Gullerud, Arne S.; Ruggieri, Claudio; Dodds, Robert H., Jr.; Healy, Brian E.
         1998-01-01
         This report describes theoretical background material and commands necessary to use the WARP3D finite element code. WARP3D is under continuing development as a research code for the solution of very large-scale, 3-D solid models subjected to static and dynamic loads. Specific features in the code oriented toward the investigation of ductile fracture in metals include a robust finite strain formulation, a general J-integral computation facility (with inertia, face loading), an element extinction facility to model crack growth, nonlinear material models including viscoplastic effects, and the Gurson-Tver-gaard dilatant plasticity model for void growth. The nonlinear, dynamic equilibrium equations are solved using an incremental-iterative, implicit formulation with full Newton iterations to eliminate residual nodal forces. The history integration of the nonlinear equations of motion is accomplished with Newmarks Beta method. A central feature of WARP3D involves the use of a linear-preconditioned conjugate gradient (LPCG) solver implemented in an element-by-element format to replace a conventional direct linear equation solver. This software architecture dramatically reduces both the memory requirements and CPU time for very large, nonlinear solid models since formation of the assembled (dynamic) stiffness matrix is avoided. Analyses thus exhibit the numerical stability for large time (load) steps provided by the implicit formulation coupled with the low memory requirements characteristic of an explicit code. In addition to the much lower memory requirements of the LPCG solver, the CPU time required for solution of the linear equations during each Newton iteration is generally one-half or less of the CPU time required for a traditional direct solver. All other computational aspects of the code (element stiffnesses, element strains, stress updating, element internal forces) are implemented in the element-by- element, blocked architecture. This greatly improves vectorization of the code on uni-processor hardware and enables straightforward parallel-vector processing of element blocks on multi-processor hardware.
      

      
      BESIII physical offline data analysis on virtualization platform
      NASA Astrophysics Data System (ADS)
      Huang, Q.; Li, H.; Kan, B.; Shi, J.; Lei, X.
         2015-12-01
         In this contribution, we present an ongoing work, which aims at benefiting BESIII computing system for higher resource utilization and more efficient job operations brought by cloud and virtualization technology with Openstack and KVM. We begin with the architecture of BESIII offline software to understand how it works. We mainly report the KVM performance evaluation and optimization from various factors in hardware and kernel. Experimental results show the CPU performance penalty of KVM can be approximately decreased to 3%. In addition, the performance comparison between KVM and physical machines in aspect of CPU, disk IO and network IO is also presented. Finally, we present our development work, an adaptive cloud scheduler, which allocates and reclaims VMs dynamically according to the status of TORQUE queue and the size of resource pool to improve resource utilization and job processing efficiency.
      

        
       
          

«

4
      5
      6
   7
      8
      »

          
        

     

   

   
       
            
              
          

«

5
      6
      7
   8
      9
      »

          
        

           
           
             
               
      
      Research on SEU hardening of heterogeneous Dual-Core SoC
      NASA Astrophysics Data System (ADS)
      Huang, Kun; Hu, Keliu; Deng, Jun; Zhang, Tao
         2017-08-01
         The implementation of Single-Event Upsets (SEU) hardening has various schemes. However, some of them require a lot of human, material and financial resources. This paper proposes an easy scheme on SEU hardening for Heterogeneous Dual-core SoC (HD SoC) which contains three techniques. First, the automatic Triple Modular Redundancy (TMR) technique is adopted to harden the register heaps of the processor and the instruction-fetching module. Second, Hamming codes are used to harden the random access memory (RAM). Last, a software signature technique is applied to check the programs which are running on CPU. The scheme need not to consume additional resources, and has little influence on the performance of CPU. These technologies are very mature, easy to implement and needs low cost. According to the simulation result, the scheme can satisfy the basic demand of SEU-hardening.
      

      
      Irregular large-scale computed tomography on multiple graphics processors improves energy-efficiency metrics for industrial applications
      NASA Astrophysics Data System (ADS)
      Jimenez, Edward S.; Goodman, Eric L.; Park, Ryeojin; Orr, Laurel J.; Thompson, Kyle R.
         2014-09-01
         This paper will investigate energy-efficiency for various real-world industrial computed-tomography reconstruction algorithms, both CPU- and GPU-based implementations. This work shows that the energy required for a given reconstruction is based on performance and problem size. There are many ways to describe performance and energy efficiency, thus this work will investigate multiple metrics including performance-per-watt, energy-delay product, and energy consumption. This work found that irregular GPU-based approaches1 realized tremendous savings in energy consumption when compared to CPU implementations while also significantly improving the performance-per- watt and energy-delay product metrics. Additional energy savings and other metric improvement was realized on the GPU-based reconstructions by improving storage I/O by implementing a parallel MIMD-like modularization of the compute and I/O tasks.
      

      
      Structure of the airflow above surface waves
      NASA Astrophysics Data System (ADS)
      Buckley, Marc; Veron, Fabrice
         2016-04-01
         Weather, climate and upper ocean patterns are controlled by the exchanges of momentum, heat, mass, and energy across the ocean surface. These fluxes are, in turn, influenced by the small-scale physics at the wavy air-sea interface. We present laboratory measurements of the fine-scale airflow structure above waves, achieved in over 15 different wind-wave conditions, with wave ages Cp/u* ranging from 1.4 to 66.7 (where Cp is the peak phase speed of the waves, and u* the air friction velocity). The experiments were performed in the large (42-m long) wind-wave-current tank at University of Delaware's Air-Sea Interaction laboratory (USA). A combined Particle Image Velocimetry and Laser Induced Fluorescence system was specifically developed for this study, and provided two-dimensional airflow velocity measurement as low as 100 um above the air-water interface. Starting at very low wind speeds (U10~2m/s), we directly observe coherent turbulent structures within the buffer and logarithmic layers of the airflow above the air-water interface, whereby low horizontal velocity air is ejected away from the surface, and higher velocity fluid is swept downward. Wave phase coherent quadrant analysis shows that such turbulent momentum flux events are wave-phase dependent. Airflow separation events are directly observed over young wind waves (Cp/u*<3.7) and counted using measured vorticity and surface viscous stress criteria. Detached high spanwise vorticity layers cause intense wave-coherent turbulence downwind of wave crests, as shown by wave-phase averaging of turbulent momentum fluxes. Mean wave-coherent airflow motions and fluxes also show strong phase-locked patterns, including a sheltering effect, upwind of wave crests over old mechanically generated swells (Cp/u*=31.7), and downwind of crests over young wind waves (Cp/u*=3.7). Over slightly older wind waves (Cp/u* = 6.5), the measured wave-induced airflow perturbations are qualitatively consistent with linear critical layer theory.
      

      
      Monitoring and tracing of critical software systems: State of the work and project definition
      DTIC Science & Technology
      
         2008-12-01
         analysis, troubleshooting and debugging. Some of these subsystems already come with ad hoc tracers for events like wireless connections or SCSI disk... SQLite ). Additional synthetic events (e.g. states) are added to the database. The database thus consists in contexts (process, CPU, state), event...capability on a [operating] system-by-system basis. Additionally, the mechanics of querying the data in an ad - hoc manner outside the boundaries of the
      

      
      Establishment and progress of the chest pain unit certification process in Germany and the local experiences of Mainz.
      PubMed
      Post, Felix; Gori, Tommaso; Senges, Jochen; Giannitsis, Evangelos; Katus, Hugo; Münzel, Thomas
         2012-03-01
         The establishment of chest pain units (CPUs) in the USA and UK has led to improvements in the prognosis of patients with chest pain and myocardial infarction, optimizing access to specialized diagnostic and therapeutic facilities and reducing costs. To establish a uniform implementation of this type of service in Germany, the German Cardiac Society (DGK) founded a 'CPU task force' in 2007, which developed a set of standard requirements and a nationwide certification programme. The recommendations for minimum standard requirements were published in 2008. As of November 2011, 132 CPUs were certified and 36 units were in the certification process. The aim of the DGK is to certify as many as 250 centres (units) throughout Germany within the next 2 years, to provide nationwide coverage. Applications from Switzerland are also being filed. Public awareness campaigns in cooperation with national league soccer teams were organized to raise awareness of the importance for early diagnosis and treatment of cardiac diseases and to publicize the existence of these new facilities. The German model of CPU certification allows nationwide and prospectively European-wide standardization of patient care and to improve adherence to international guidelines. Coupled with awareness campaigns and with the launch of a German CPU Registry, this process is aimed at improving the education and treatment of patients with chest pain and to provide scientific information about the quality of patient care.
      

      
      SU-E-J-60: Efficient Monte Carlo Dose Calculation On CPU-GPU Heterogeneous Systems
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Xiao, K; Chen, D. Z; Hu, X. S
         
         Purpose: It is well-known that the performance of GPU-based Monte Carlo dose calculation implementations is bounded by memory bandwidth. One major cause of this bottleneck is the random memory writing patterns in dose deposition, which leads to several memory efficiency issues on GPU such as un-coalesced writing and atomic operations. We propose a new method to alleviate such issues on CPU-GPU heterogeneous systems, which achieves overall performance improvement for Monte Carlo dose calculation. Methods: Dose deposition is to accumulate dose into the voxels of a dose volume along the trajectories of radiation rays. Our idea is to partition this proceduremore » into the following three steps, which are fine-tuned for CPU or GPU: (1) each GPU thread writes dose results with location information to a buffer on GPU memory, which achieves fully-coalesced and atomic-free memory transactions; (2) the dose results in the buffer are transferred to CPU memory; (3) the dose volume is constructed from the dose buffer on CPU. We organize the processing of all radiation rays into streams. Since the steps within a stream use different hardware resources (i.e., GPU, DMA, CPU), we can overlap the execution of these steps for different streams by pipelining. Results: We evaluated our method using a Monte Carlo Convolution Superposition (MCCS) program and tested our implementation for various clinical cases on a heterogeneous system containing an Intel i7 quad-core CPU and an NVIDIA TITAN GPU. Comparing with a straightforward MCCS implementation on the same system (using both CPU and GPU for radiation ray tracing), our method gained 2-5X speedup without losing dose calculation accuracy. Conclusion: The results show that our new method improves the effective memory bandwidth and overall performance for MCCS on the CPU-GPU systems. Our proposed method can also be applied to accelerate other Monte Carlo dose calculation approaches. This research was supported in part by NSF under Grants CCF-1217906, and also in part by a research contract from the Sandia National Laboratories.« less
      

      
      Interfacing a high performance disk array file server to a Gigabit LAN
      NASA Technical Reports Server (NTRS)
      Seshan, Srinivasan; Katz, Randy H.
         1993-01-01
         Our previous prototype, RAID-1, identified several bottlenecks in typical file server architectures. The most important bottleneck was the lack of a high-bandwidth path between disk, memory, and the network. Workstation servers, such as the Sun-4/280, have very slow access to peripherals on busses far from the CPU. For the RAID-2 system, we addressed this problem by designing a crossbar interconnect, Xbus board, that provides a 40MB/s path between disk, memory, and the network interfaces. However, this interconnect does not provide the system CPU with low latency access to control the various interfaces. To provide a high data rate to clients on the network, we were forced to carefully and efficiently design the network software. A block diagram of the system hardware architecture is given. In the following subsections, we describe pieces of the RAID-2 file server hardware that had a significant impact on the design of the network interface.
      

      
      GPU Particle Tracking and MHD Simulations with Greatly Enhanced Computational Speed
      NASA Astrophysics Data System (ADS)
      Ziemba, T.; O'Donnell, D.; Carscadden, J.; Cash, M.; Winglee, R.; Harnett, E.
         2008-12-01
         GPUs are intrinsically highly parallelized systems that provide more than an order of magnitude computing speed over a CPU based systems, for less cost than a high end-workstation. Recent advancements in GPU technologies allow for full IEEE float specifications with performance up to several hundred GFLOPs per GPU, and new software architectures have recently become available to ease the transition from graphics based to scientific applications. This allows for a cheap alternative to standard supercomputing methods and should increase the time to discovery. 3-D particle tracking and MHD codes have been developed using NVIDIA's CUDA and have demonstrated speed up of nearly a factor of 20 over equivalent CPU versions of the codes. Such a speed up enables new applications to develop, including real time running of radiation belt simulations and real time running of global magnetospheric simulations, both of which could provide important space weather prediction tools.
      

      
      Efficient spares matrix multiplication scheme for the CYBER 203
      NASA Technical Reports Server (NTRS)
      Lambiotte, J. J., Jr.
         1984-01-01
         This work has been directed toward the development of an efficient algorithm for performing this computation on the CYBER-203. The desire to provide software which gives the user the choice between the often conflicting goals of minimizing central processing (CPU) time or storage requirements has led to a diagonal-based algorithm in which one of three types of storage is selected for each diagonal. For each storage type, an initialization sub-routine estimates the CPU and storage requirements based upon results from previously performed numerical experimentation. These requirements are adjusted by weights provided by the user which reflect the relative importance the user places on the resources. The three storage types employed were chosen to be efficient on the CYBER-203 for diagonals which are sparse, moderately sparse, or dense; however, for many densities, no diagonal type is most efficient with respect to both resource requirements. The user-supplied weights dictate the choice.
      

      
      Strong scaling of general-purpose molecular dynamics simulations on GPUs
      NASA Astrophysics Data System (ADS)
      Glaser, Jens; Nguyen, Trung Dac; Anderson, Joshua A.; Lui, Pak; Spiga, Filippo; Millan, Jaime A.; Morse, David C.; Glotzer, Sharon C.
         2015-07-01
         We describe a highly optimized implementation of MPI domain decomposition in a GPU-enabled, general-purpose molecular dynamics code, HOOMD-blue (Anderson and Glotzer, 2013). Our approach is inspired by a traditional CPU-based code, LAMMPS (Plimpton, 1995), but is implemented within a code that was designed for execution on GPUs from the start (Anderson et al., 2008). The software supports short-ranged pair force and bond force fields and achieves optimal GPU performance using an autotuning algorithm. We are able to demonstrate equivalent or superior scaling on up to 3375 GPUs in Lennard-Jones and dissipative particle dynamics (DPD) simulations of up to 108 million particles. GPUDirect RDMA capabilities in recent GPU generations provide better performance in full double precision calculations. For a representative polymer physics application, HOOMD-blue 1.0 provides an effective GPU vs. CPU node speed-up of 12.5 ×.
      

      
      Accelerated Cartesian expansions for the rapid solution of periodic multiscale problems
      DOE PAGES
      Baczewski, Andrew David; Dault, Daniel L.; Shanker, Balasubramaniam
         2012-07-03
         We present an algorithm for the fast and efficient solution of integral equations that arise in the analysis of scattering from periodic arrays of PEC objects, such as multiband frequency selective surfaces (FSS) or metamaterial structures. Our approach relies upon the method of Accelerated Cartesian Expansions (ACE) to rapidly evaluate the requisite potential integrals. ACE is analogous to FMM in that it can be used to accelerate the matrix vector product used in the solution of systems discretized using MoM. Here, ACE provides linear scaling in both CPU time and memory. Details regarding the implementation of this method within themore » context of periodic systems are provided, as well as results that establish error convergence and scalability. In addition, we also demonstrate the applicability of this algorithm by studying several exemplary electrically dense systems.« less
      

      
      Fast multipurpose Monte Carlo simulation for proton therapy using multi- and many-core CPU architectures
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Souris, Kevin, E-mail: kevin.souris@uclouvain.be; Lee, John Aldo; Sterpin, Edmond
         2016-04-15
         Purpose: Accuracy in proton therapy treatment planning can be improved using Monte Carlo (MC) simulations. However the long computation time of such methods hinders their use in clinical routine. This work aims to develop a fast multipurpose Monte Carlo simulation tool for proton therapy using massively parallel central processing unit (CPU) architectures. Methods: A new Monte Carlo, called MCsquare (many-core Monte Carlo), has been designed and optimized for the last generation of Intel Xeon processors and Intel Xeon Phi coprocessors. These massively parallel architectures offer the flexibility and the computational power suitable to MC methods. The class-II condensed history algorithmmore » of MCsquare provides a fast and yet accurate method of simulating heavy charged particles such as protons, deuterons, and alphas inside voxelized geometries. Hard ionizations, with energy losses above a user-specified threshold, are simulated individually while soft events are regrouped in a multiple scattering theory. Elastic and inelastic nuclear interactions are sampled from ICRU 63 differential cross sections, thereby allowing for the computation of prompt gamma emission profiles. MCsquare has been benchmarked with the GATE/GEANT4 Monte Carlo application for homogeneous and heterogeneous geometries. Results: Comparisons with GATE/GEANT4 for various geometries show deviations within 2%–1 mm. In spite of the limited memory bandwidth of the coprocessor simulation time is below 25 s for 10{sup 7} primary 200 MeV protons in average soft tissues using all Xeon Phi and CPU resources embedded in a single desktop unit. Conclusions: MCsquare exploits the flexibility of CPU architectures to provide a multipurpose MC simulation tool. Optimized code enables the use of accurate MC calculation within a reasonable computation time, adequate for clinical practice. MCsquare also simulates prompt gamma emission and can thus be used also for in vivo range verification.« less
      

      
      Synthesis and Characterization of Biodegradable Polyurethane for Hypopharyngeal Tissue Engineering
      PubMed Central
      Shen, Zhisen; Lu, Dakai; Li, Qun; Zhang, Zongyong
         2015-01-01
         Biodegradable crosslinked polyurethane (cPU) was synthesized using polyethylene glycol (PEG), L-lactide (L-LA), and hexamethylene diisocyanate (HDI), with iron acetylacetonate (Fe(acac)3) as the catalyst and PEG as the extender. Chemical components of the obtained polymers were characterized by FTIR spectroscopy, 1H NMR spectra, and Gel Permeation Chromatography (GPC). The thermodynamic properties, mechanical behaviors, surface hydrophilicity, degradability, and cytotoxicity were tested via differential scanning calorimetry (DSC), tensile tests, contact angle measurements, and cell culture. The results show that the synthesized cPU possessed good flexibility with quite low glass transition temperature (T g, −22°C) and good wettability. Water uptake measured as high as 229.7 ± 18.7%. These properties make cPU a good candidate material for engineering soft tissues such as the hypopharynx. In vitro and in vivo tests showed that cPU has the ability to support the growth of human hypopharyngeal fibroblasts and angiogenesis was observed around cPU after it was implanted subcutaneously in SD rats. PMID:25839041
      

      
      Is our medical school socially accountable? The case of Faculty of Medicine, Suez Canal University.
      PubMed
      Hosny, Somaya; Ghaly, Mona; Boelen, Charles
         2015-04-01
         Faculty of Medicine, Suez Canal University (FOM/SCU) was established as community oriented school with innovative educational strategies. Social accountability represents the commitment of the medical school towards the community it serves. To assess FOM/SCU compliance to social accountability using the "Conceptualization, Production, Usability" (CPU) model. FOM/SCU's practice was reviewed against CPU model parameters. CPU consists of three domains, 11 sections and 31 parameters. Data were collected through unstructured interviews with the main stakeholders and documents review since 2005 to 2013. FOM/SCU shows general compliance to the three domains of the CPU. Very good compliance was shown to the "P" domain of the model through FOM/SCU's innovative educational system, students and faculty members. More work is needed on the "C" and "U" domains. FOM/SCU complies with many parameters of the CPU model; however, more work should be accomplished to comply with some items in the C and U domains so that FOM/SCU can be recognized as a proactive socially accountable school.
      

      
      GPU Optimizations for a Production Molecular Docking Code*
      PubMed Central
      Landaverde, Raphael; Herbordt, Martin C.
         2015-01-01
         Modeling molecular docking is critical to both understanding life processes and designing new drugs. In previous work we created the first published GPU-accelerated docking code (PIPER) which achieved a roughly 5× speed-up over a contemporaneous 4 core CPU. Advances in GPU architecture and in the CPU code, however, have since reduced this relalative performance by a factor of 10. In this paper we describe the upgrade of GPU PIPER. This required an entire rewrite, including algorithm changes and moving most remaining non-accelerated CPU code onto the GPU. The result is a 7× improvement in GPU performance and a 3.3× speedup over the CPU-only code. We find that this difference in time is almost entirely due to the difference in run times of the 3D FFT library functions on CPU (MKL) and GPU (cuFFT), respectively. The GPU code has been integrated into the ClusPro docking server which has over 4000 active users. PMID:26594667
      

      
      GPU Optimizations for a Production Molecular Docking Code.
      PubMed
      Landaverde, Raphael; Herbordt, Martin C
         2014-09-01
         Modeling molecular docking is critical to both understanding life processes and designing new drugs. In previous work we created the first published GPU-accelerated docking code (PIPER) which achieved a roughly 5× speed-up over a contemporaneous 4 core CPU. Advances in GPU architecture and in the CPU code, however, have since reduced this relalative performance by a factor of 10. In this paper we describe the upgrade of GPU PIPER. This required an entire rewrite, including algorithm changes and moving most remaining non-accelerated CPU code onto the GPU. The result is a 7× improvement in GPU performance and a 3.3× speedup over the CPU-only code. We find that this difference in time is almost entirely due to the difference in run times of the 3D FFT library functions on CPU (MKL) and GPU (cuFFT), respectively. The GPU code has been integrated into the ClusPro docking server which has over 4000 active users.
      

      
      Synthesis and characterization of biodegradable polyurethane for hypopharyngeal tissue engineering.
      PubMed
      Shen, Zhisen; Lu, Dakai; Li, Qun; Zhang, Zongyong; Zhu, Yabin
         2015-01-01
         Biodegradable crosslinked polyurethane (cPU) was synthesized using polyethylene glycol (PEG), L-lactide (L-LA), and hexamethylene diisocyanate (HDI), with iron acetylacetonate (Fe(acac)3) as the catalyst and PEG as the extender. Chemical components of the obtained polymers were characterized by FTIR spectroscopy, (1)H NMR spectra, and Gel Permeation Chromatography (GPC). The thermodynamic properties, mechanical behaviors, surface hydrophilicity, degradability, and cytotoxicity were tested via differential scanning calorimetry (DSC), tensile tests, contact angle measurements, and cell culture. The results show that the synthesized cPU possessed good flexibility with quite low glass transition temperature (T g , -22°C) and good wettability. Water uptake measured as high as 229.7 ± 18.7%. These properties make cPU a good candidate material for engineering soft tissues such as the hypopharynx. In vitro and in vivo tests showed that cPU has the ability to support the growth of human hypopharyngeal fibroblasts and angiogenesis was observed around cPU after it was implanted subcutaneously in SD rats.
      

      
      A FAST ITERATIVE METHOD FOR SOLVING THE EIKONAL EQUATION ON TRIANGULATED SURFACES*
      PubMed Central
      Fu, Zhisong; Jeong, Won-Ki; Pan, Yongsheng; Kirby, Robert M.; Whitaker, Ross T.
         2012-01-01
         This paper presents an efficient, fine-grained parallel algorithm for solving the Eikonal equation on triangular meshes. The Eikonal equation, and the broader class of Hamilton–Jacobi equations to which it belongs, have a wide range of applications from geometric optics and seismology to biological modeling and analysis of geometry and images. The ability to solve such equations accurately and efficiently provides new capabilities for exploring and visualizing parameter spaces and for solving inverse problems that rely on such equations in the forward model. Efficient solvers on state-of-the-art, parallel architectures require new algorithms that are not, in many cases, optimal, but are better suited to synchronous updates of the solution. In previous work [W. K. Jeong and R. T. Whitaker, SIAM J. Sci. Comput., 30 (2008), pp. 2512–2534], the authors proposed the fast iterative method (FIM) to efficiently solve the Eikonal equation on regular grids. In this paper we extend the fast iterative method to solve Eikonal equations efficiently on triangulated domains on the CPU and on parallel architectures, including graphics processors. We propose a new local update scheme that provides solutions of first-order accuracy for both architectures. We also propose a novel triangle-based update scheme and its corresponding data structure for efficient irregular data mapping to parallel single-instruction multiple-data (SIMD) processors. We provide detailed descriptions of the implementations on a single CPU, a multicore CPU with shared memory, and SIMD architectures with comparative results against state-of-the-art Eikonal solvers. PMID:22641200
      

      
      Use of general purpose graphics processing units with MODFLOW
      USGS Publications Warehouse
      Hughes, Joseph D.; White, Jeremy T.
         2013-01-01
         To evaluate the use of general-purpose graphics processing units (GPGPUs) to improve the performance of MODFLOW, an unstructured preconditioned conjugate gradient (UPCG) solver has been developed. The UPCG solver uses a compressed sparse row storage scheme and includes Jacobi, zero fill-in incomplete, and modified-incomplete lower-upper (LU) factorization, and generalized least-squares polynomial preconditioners. The UPCG solver also includes options for sequential and parallel solution on the central processing unit (CPU) using OpenMP. For simulations utilizing the GPGPU, all basic linear algebra operations are performed on the GPGPU; memory copies between the central processing unit CPU and GPCPU occur prior to the first iteration of the UPCG solver and after satisfying head and flow criteria or exceeding a maximum number of iterations. The efficiency of the UPCG solver for GPGPU and CPU solutions is benchmarked using simulations of a synthetic, heterogeneous unconfined aquifer with tens of thousands to millions of active grid cells. Testing indicates GPGPU speedups on the order of 2 to 8, relative to the standard MODFLOW preconditioned conjugate gradient (PCG) solver, can be achieved when (1) memory copies between the CPU and GPGPU are optimized, (2) the percentage of time performing memory copies between the CPU and GPGPU is small relative to the calculation time, (3) high-performance GPGPU cards are utilized, and (4) CPU-GPGPU combinations are used to execute sequential operations that are difficult to parallelize. Furthermore, UPCG solver testing indicates GPGPU speedups exceed parallel CPU speedups achieved using OpenMP on multicore CPUs for preconditioners that can be easily parallelized.
      

      
      Compute-unified device architecture implementation of a block-matching algorithm for multiple graphical processing unit cards
      PubMed Central
      Massanes, Francesc; Cadennes, Marie; Brankov, Jovan G.
         2012-01-01
         In this paper we describe and evaluate a fast implementation of a classical block matching motion estimation algorithm for multiple Graphical Processing Units (GPUs) using the Compute Unified Device Architecture (CUDA) computing engine. The implemented block matching algorithm (BMA) uses summed absolute difference (SAD) error criterion and full grid search (FS) for finding optimal block displacement. In this evaluation we compared the execution time of a GPU and CPU implementation for images of various sizes, using integer and non-integer search grids. The results show that use of a GPU card can shorten computation time by a factor of 200 times for integer and 1000 times for a non-integer search grid. The additional speedup for non-integer search grid comes from the fact that GPU has built-in hardware for image interpolation. Further, when using multiple GPU cards, the presented evaluation shows the importance of the data splitting method across multiple cards, but an almost linear speedup with a number of cards is achievable. In addition we compared execution time of the proposed FS GPU implementation with two existing, highly optimized non-full grid search CPU based motion estimations methods, namely implementation of the Pyramidal Lucas Kanade Optical flow algorithm in OpenCV and Simplified Unsymmetrical multi-Hexagon search in H.264/AVC standard. In these comparisons, FS GPU implementation still showed modest improvement even though the computational complexity of FS GPU implementation is substantially higher than non-FS CPU implementation. We also demonstrated that for an image sequence of 720×480 pixels in resolution, commonly used in video surveillance, the proposed GPU implementation is sufficiently fast for real-time motion estimation at 30 frames-per-second using two NVIDIA C1060 Tesla GPU cards. PMID:22347787
      

        
       
          

«

5
      6
      7
   8
      9
      »

          
        

     

   

   
       
            
              
          

«

6
      7
      8
   9
      10
      »

          
        

           
           
             
               
      
      Compute-unified device architecture implementation of a block-matching algorithm for multiple graphical processing unit cards.
      PubMed
      Massanes, Francesc; Cadennes, Marie; Brankov, Jovan G
         2011-07-01
         In this paper we describe and evaluate a fast implementation of a classical block matching motion estimation algorithm for multiple Graphical Processing Units (GPUs) using the Compute Unified Device Architecture (CUDA) computing engine. The implemented block matching algorithm (BMA) uses summed absolute difference (SAD) error criterion and full grid search (FS) for finding optimal block displacement. In this evaluation we compared the execution time of a GPU and CPU implementation for images of various sizes, using integer and non-integer search grids.The results show that use of a GPU card can shorten computation time by a factor of 200 times for integer and 1000 times for a non-integer search grid. The additional speedup for non-integer search grid comes from the fact that GPU has built-in hardware for image interpolation. Further, when using multiple GPU cards, the presented evaluation shows the importance of the data splitting method across multiple cards, but an almost linear speedup with a number of cards is achievable.In addition we compared execution time of the proposed FS GPU implementation with two existing, highly optimized non-full grid search CPU based motion estimations methods, namely implementation of the Pyramidal Lucas Kanade Optical flow algorithm in OpenCV and Simplified Unsymmetrical multi-Hexagon search in H.264/AVC standard. In these comparisons, FS GPU implementation still showed modest improvement even though the computational complexity of FS GPU implementation is substantially higher than non-FS CPU implementation.We also demonstrated that for an image sequence of 720×480 pixels in resolution, commonly used in video surveillance, the proposed GPU implementation is sufficiently fast for real-time motion estimation at 30 frames-per-second using two NVIDIA C1060 Tesla GPU cards.
      

      
      Tempest: GPU-CPU computing for high-throughput database spectral matching.
      PubMed
      Milloy, Jeffrey A; Faherty, Brendan K; Gerber, Scott A
         2012-07-06
         Modern mass spectrometers are now capable of producing hundreds of thousands of tandem (MS/MS) spectra per experiment, making the translation of these fragmentation spectra into peptide matches a common bottleneck in proteomics research. When coupled with experimental designs that enrich for post-translational modifications such as phosphorylation and/or include isotopically labeled amino acids for quantification, additional burdens are placed on this computational infrastructure by shotgun sequencing. To address this issue, we have developed a new database searching program that utilizes the massively parallel compute capabilities of a graphical processing unit (GPU) to produce peptide spectral matches in a very high throughput fashion. Our program, named Tempest, combines efficient database digestion and MS/MS spectral indexing on a CPU with fast similarity scoring on a GPU. In our implementation, the entire similarity score, including the generation of full theoretical peptide candidate fragmentation spectra and its comparison to experimental spectra, is conducted on the GPU. Although Tempest uses the classical SEQUEST XCorr score as a primary metric for evaluating similarity for spectra collected at unit resolution, we have developed a new "Accelerated Score" for MS/MS spectra collected at high resolution that is based on a computationally inexpensive dot product but exhibits scoring accuracy similar to that of the classical XCorr. In our experience, Tempest provides compute-cluster level performance in an affordable desktop computer.
      

      
      High Capacity Single Table Performance Design Using Partitioning in Oracle or PostgreSQL
      DTIC Science & Technology
      
         2012-03-01
         Indicators ( KPIs ) 13  5.  Conclusion 14  List of Symbols, Abbreviations, and Acronyms 15  Distribution List 16 iv List of Figures Figure 1. Oracle...Figure 7. Time to seek and return one record. 4. Additional Key Performance Indicators ( KPIs ) In addition to pure response time, there are other...Laboratory ASM Automatic Storage Management CPU central processing unit I/O input/output KPIs key performance indicators OS operating system
      

      
      Development of the Large-Scale Statistical Analysis System of Satellites Observations Data with Grid Datafarm Architecture
      NASA Astrophysics Data System (ADS)
      Yamamoto, K.; Murata, K.; Kimura, E.; Honda, R.
         2006-12-01
         In the Solar-Terrestrial Physics (STP) field, the amount of satellite observation data has been increasing every year. It is necessary to solve the following three problems to achieve large-scale statistical analyses of plenty of data. (i) More CPU power and larger memory and disk size are required. However, total powers of personal computers are not enough to analyze such amount of data. Super-computers provide a high performance CPU and rich memory area, but they are usually separated from the Internet or connected only for the purpose of programming or data file transfer. (ii) Most of the observation data files are managed at distributed data sites over the Internet. Users have to know where the data files are located. (iii) Since no common data format in the STP field is available now, users have to prepare reading program for each data by themselves. To overcome the problems (i) and (ii), we constructed a parallel and distributed data analysis environment based on the Gfarm reference implementation of the Grid Datafarm architecture. The Gfarm shares both computational resources and perform parallel distributed processings. In addition, the Gfarm provides the Gfarm filesystem which can be as virtual directory tree among nodes. The Gfarm environment is composed of three parts; a metadata server to manage distributed files information, filesystem nodes to provide computational resources and a client to throw a job into metadata server and manages data processing schedulings. In the present study, both data files and data processes are parallelized on the Gfarm with 6 file system nodes: CPU clock frequency of each node is Pentium V 1GHz, 256MB memory and40GB disk. To evaluate performances of the present Gfarm system, we scanned plenty of data files, the size of which is about 300MB for each, in three processing methods: sequential processing in one node, sequential processing by each node and parallel processing by each node. As a result, in comparison between the number of files and the elapsed time, parallel and distributed processing shorten the elapsed time to 1/5 than sequential processing. On the other hand, sequential processing times were shortened in another experiment, whose file size is smaller than 100KB. In this case, the elapsed time to scan one file is within one second. It implies that disk swap took place in case of parallel processing by each node. We note that the operation became unstable when the number of the files exceeded 1000. To overcome the problem (iii), we developed an original data class. This class supports our reading of data files with various data formats since it converts them into an original data format since it defines schemata for every type of data and encapsulates the structure of data files. In addition, since this class provides a function of time re-sampling, users can easily convert multiple data (array) with different time resolution into the same time resolution array. Finally, using the Gfarm, we achieved a high performance environment for large-scale statistical data analyses. It should be noted that the present method is effective only when one data file size is large enough. At present, we are restructuring the new Gfarm environment with 8 nodes: CPU is Athlon 64 x2 Dual Core 2GHz, 2GB memory and 1.2TB disk (using RAID0) for each node. Our original class is to be implemented on the new Gfarm environment. In the present talk, we show the latest results with applying the present system for data analyses with huge number of satellite observation data files.
      

      
      Japanese Ubiquotous Network Project: Ubila
      NASA Astrophysics Data System (ADS)
      Ohashi, Masayoshi
         
         Recently, the advent of sophisticated technologies has stimulated ambient paradigms that may include high-performance CPU, compact real-time operating systems, a variety of devices/sensors, low power and high-speed radio communications, and in particular, third generation mobile phones. In addition, due to the spread of broadband ccess networks, various ubiquitous terminals and sensors can be connected closely.
      

      
      Exact diagonalization of quantum lattice models on coprocessors
      NASA Astrophysics Data System (ADS)
      Siro, T.; Harju, A.
         2016-10-01
         We implement the Lanczos algorithm on an Intel Xeon Phi coprocessor and compare its performance to a multi-core Intel Xeon CPU and an NVIDIA graphics processor. The Xeon and the Xeon Phi are parallelized with OpenMP and the graphics processor is programmed with CUDA. The performance is evaluated by measuring the execution time of a single step in the Lanczos algorithm. We study two quantum lattice models with different particle numbers, and conclude that for small systems, the multi-core CPU is the fastest platform, while for large systems, the graphics processor is the clear winner, reaching speedups of up to 7.6 compared to the CPU. The Xeon Phi outperforms the CPU with sufficiently large particle number, reaching a speedup of 2.5.
      

      
      Instrumentation complex for Langley Research Center's National Transonic Facility
      NASA Technical Reports Server (NTRS)
      Russell, C. H.; Bryant, C. S.
         1977-01-01
         The instrumentation discussed in the present paper was developed to ensure reliable operation for a 2.5-meter cryogenic high-Reynolds-number fan-driven transonic wind tunnel. It will incorporate four CPU's and associated analog and digital input/output equipment, necessary for acquiring research data, controlling the tunnel parameters, and monitoring the process conditions. Connected in a multipoint distributed network, the CPU's will support data base management and processing; research measurement data acquisition and display; process monitoring; and communication control. The design will allow essential processes to continue, in the case of major hardware failures, by switching input/output equipment to alternate CPU's and by eliminating nonessential functions. It will also permit software modularization by CPU activity and thereby reduce complexity and development time.
      

      
      Resource Isolation Method for Program’S Performance on CMP
      NASA Astrophysics Data System (ADS)
      Guan, Ti; Liu, Chunxiu; Xu, Zheng; Li, Huicong; Ma, Qiang
         2017-10-01
         Data center and cloud computing are more popular, which make more benefits for customers and the providers. However, in data center or clusters, commonly there is more than one program running on one server, but programs may interference with each other. The interference may take a little effect, however, the interference may cause serious drop down of performance. In order to avoid the performance interference problem, the mechanism of isolate resource for different programs is a better choice. In this paper we propose a light cost resource isolation method to improve program’s performance. This method uses Cgroups to set the dedicated CPU and memory resource for a program, aiming at to guarantee the program’s performance. There are three engines to realize this method: Program Monitor Engine top program’s resource usage of CPU and memory and transfer the information to Resource Assignment Engine; Resource Assignment Engine calculates the size of CPU and memory resource should be applied for the program; Cgroups Control Engine divide resource by Linux tool Cgroups, and drag program in control group for execution. The experiment result show that making use of the resource isolation method proposed by our paper, program’s performance can be improved.
      

      
      Energy consumption optimization of the total-FETI solver by changing the CPU frequency
      NASA Astrophysics Data System (ADS)
      Horak, David; Riha, Lubomir; Sojka, Radim; Kruzik, Jakub; Beseda, Martin; Cermak, Martin; Schuchart, Joseph
         2017-07-01
         The energy consumption of supercomputers is one of the critical problems for the upcoming Exascale supercomputing era. The awareness of power and energy consumption is required on both software and hardware side. This paper deals with the energy consumption evaluation of the Finite Element Tearing and Interconnect (FETI) based solvers of linear systems, which is an established method for solving real-world engineering problems. We have evaluated the effect of the CPU frequency on the energy consumption of the FETI solver using a linear elasticity 3D cube synthetic benchmark. In this problem, we have evaluated the effect of frequency tuning on the energy consumption of the essential processing kernels of the FETI method. The paper provides results for two types of frequency tuning: (1) static tuning and (2) dynamic tuning. For static tuning experiments, the frequency is set before execution and kept constant during the runtime. For dynamic tuning, the frequency is changed during the program execution to adapt the system to the actual needs of the application. The paper shows that static tuning brings up 12% energy savings when compared to default CPU settings (the highest clock rate). The dynamic tuning improves this further by up to 3%.
      

      
      Open source acceleration of wave optics simulations on energy efficient high-performance computing platforms
      NASA Astrophysics Data System (ADS)
      Beck, Jeffrey; Bos, Jeremy P.
         2017-05-01
         We compare several modifications to the open-source wave optics package, WavePy, intended to improve execution time. Specifically, we compare the relative performance of the Intel MKL, a CPU based OpenCV distribution, and GPU-based version. Performance is compared between distributions both on the same compute platform and between a fully-featured computing workstation and the NVIDIA Jetson TX1 platform. Comparisons are drawn in terms of both execution time and power consumption. We have found that substituting the Fast Fourier Transform operation from OpenCV provides a marked improvement on all platforms. In addition, we show that embedded platforms offer some possibility for extensive improvement in terms of efficiency compared to a fully featured workstation.
      

      
      GPU Linear Algebra Libraries and GPGPU Programming for Accelerating MOPAC Semiempirical Quantum Chemistry Calculations.
      PubMed
      Maia, Julio Daniel Carvalho; Urquiza Carvalho, Gabriel Aires; Mangueira, Carlos Peixoto; Santana, Sidney Ramos; Cabral, Lucidio Anjos Formiga; Rocha, Gerd B
         2012-09-11
         In this study, we present some modifications in the semiempirical quantum chemistry MOPAC2009 code that accelerate single-point energy calculations (1SCF) of medium-size (up to 2500 atoms) molecular systems using GPU coprocessors and multithreaded shared-memory CPUs. Our modifications consisted of using a combination of highly optimized linear algebra libraries for both CPU (LAPACK and BLAS from Intel MKL) and GPU (MAGMA and CUBLAS) to hasten time-consuming parts of MOPAC such as the pseudodiagonalization, full diagonalization, and density matrix assembling. We have shown that it is possible to obtain large speedups just by using CPU serial linear algebra libraries in the MOPAC code. As a special case, we show a speedup of up to 14 times for a methanol simulation box containing 2400 atoms and 4800 basis functions, with even greater gains in performance when using multithreaded CPUs (2.1 times in relation to the single-threaded CPU code using linear algebra libraries) and GPUs (3.8 times). This degree of acceleration opens new perspectives for modeling larger structures which appear in inorganic chemistry (such as zeolites and MOFs), biochemistry (such as polysaccharides, small proteins, and DNA fragments), and materials science (such as nanotubes and fullerenes). In addition, we believe that this parallel (GPU-GPU) MOPAC code will make it feasible to use semiempirical methods in lengthy molecular simulations using both hybrid QM/MM and QM/QM potentials.
      

      
      A Hybrid CPU/GPU Pattern-Matching Algorithm for Deep Packet Inspection
      PubMed Central
      Chen, Yaw-Chung
         2015-01-01
         The large quantities of data now being transferred via high-speed networks have made deep packet inspection indispensable for security purposes. Scalable and low-cost signature-based network intrusion detection systems have been developed for deep packet inspection for various software platforms. Traditional approaches that only involve central processing units (CPUs) are now considered inadequate in terms of inspection speed. Graphic processing units (GPUs) have superior parallel processing power, but transmission bottlenecks can reduce optimal GPU efficiency. In this paper we describe our proposal for a hybrid CPU/GPU pattern-matching algorithm (HPMA) that divides and distributes the packet-inspecting workload between a CPU and GPU. All packets are initially inspected by the CPU and filtered using a simple pre-filtering algorithm, and packets that might contain malicious content are sent to the GPU for further inspection. Test results indicate that in terms of random payload traffic, the matching speed of our proposed algorithm was 3.4 times and 2.7 times faster than those of the AC-CPU and AC-GPU algorithms, respectively. Further, HPMA achieved higher energy efficiency than the other tested algorithms. PMID:26437335
      

      
      A Hybrid CPU/GPU Pattern-Matching Algorithm for Deep Packet Inspection.
      PubMed
      Lee, Chun-Liang; Lin, Yi-Shan; Chen, Yaw-Chung
         2015-01-01
         The large quantities of data now being transferred via high-speed networks have made deep packet inspection indispensable for security purposes. Scalable and low-cost signature-based network intrusion detection systems have been developed for deep packet inspection for various software platforms. Traditional approaches that only involve central processing units (CPUs) are now considered inadequate in terms of inspection speed. Graphic processing units (GPUs) have superior parallel processing power, but transmission bottlenecks can reduce optimal GPU efficiency. In this paper we describe our proposal for a hybrid CPU/GPU pattern-matching algorithm (HPMA) that divides and distributes the packet-inspecting workload between a CPU and GPU. All packets are initially inspected by the CPU and filtered using a simple pre-filtering algorithm, and packets that might contain malicious content are sent to the GPU for further inspection. Test results indicate that in terms of random payload traffic, the matching speed of our proposed algorithm was 3.4 times and 2.7 times faster than those of the AC-CPU and AC-GPU algorithms, respectively. Further, HPMA achieved higher energy efficiency than the other tested algorithms.
      

      
      Exploring the use of I/O nodes for computation in a MIMD multiprocessor
      NASA Technical Reports Server (NTRS)
      Kotz, David; Cai, Ting
         1995-01-01
         As parallel systems move into the production scientific-computing world, the emphasis will be on cost-effective solutions that provide high throughput for a mix of applications. Cost effective solutions demand that a system make effective use of all of its resources. Many MIMD multiprocessors today, however, distinguish between 'compute' and 'I/O' nodes, the latter having attached disks and being dedicated to running the file-system server. This static division of responsibilities simplifies system management but does not necessarily lead to the best performance in workloads that need a different balance of computation and I/O. Of course, computational processes sharing a node with a file-system service may receive less CPU time, network bandwidth, and memory bandwidth than they would on a computation-only node. In this paper we begin to examine this issue experimentally. We found that high performance I/O does not necessarily require substantial CPU time, leaving plenty of time for application computation. There were some complex file-system requests, however, which left little CPU time available to the application. (The impact on network and memory bandwidth still needs to be determined.) For applications (or users) that cannot tolerate an occasional interruption, we recommend that they continue to use only compute nodes. For tolerant applications needing more cycles than those provided by the compute nodes, we recommend that they take full advantage of both compute and I/O nodes for computation, and that operating systems should make this possible.
      

      
      Performance of the OVERFLOW-MLP and LAURA-MLP CFD Codes on the NASA Ames 512 CPU Origin System
      NASA Technical Reports Server (NTRS)
      Taft, James R.
         2000-01-01
         The shared memory Multi-Level Parallelism (MLP) technique, developed last year at NASA Ames has been very successful in dramatically improving the performance of important NASA CFD codes. This new and very simple parallel programming technique was first inserted into the OVERFLOW production CFD code in FY 1998. The OVERFLOW-MLP code's parallel performance scaled linearly to 256 CPUs on the NASA Ames 256 CPU Origin 2000 system (steger). Overall performance exceeded 20.1 GFLOP/s, or about 4.5x the performance of a dedicated 16 CPU C90 system. All of this was achieved without any major modification to the original vector based code. The OVERFLOW-MLP code is now in production on the inhouse Origin systems as well as being used offsite at commercial aerospace companies. Partially as a result of this work, NASA Ames has purchased a new 512 CPU Origin 2000 system to further test the limits of parallel performance for NASA codes of interest. This paper presents the performance obtained from the latest optimization efforts on this machine for the LAURA-MLP and OVERFLOW-MLP codes. The Langley Aerothermodynamics Upwind Relaxation Algorithm (LAURA) code is a key simulation tool in the development of the next generation shuttle, interplanetary reentry vehicles, and nearly all "X" plane development. This code sustains about 4-5 GFLOP/s on a dedicated 16 CPU C90. At this rate, expected workloads would require over 100 C90 CPU years of computing over the next few calendar years. It is not feasible to expect that this would be affordable or available to the user community. Dramatic performance gains on cheaper systems are needed. This code is expected to be perhaps the largest consumer of NASA Ames compute cycles per run in the coming year.The OVERFLOW CFD code is extensively used in the government and commercial aerospace communities to evaluate new aircraft designs. It is one of the largest consumers of NASA supercomputing cycles and large simulations of highly resolved full aircraft are routinely undertaken. Typical large problems might require 100s of Cray C90 CPU hours to complete. The dramatic performance gains with the 256 CPU steger system are exciting. Obtaining results in hours instead of months is revolutionizing the way in which aircraft manufacturers are looking at future aircraft simulation work. Figure 2 below is a current state of the art plot of OVERFLOW-MLP performance on the 512 CPU Lomax system. As can be seen, the chart indicates that OVERFLOW-MLP continues to scale linearly with CPU count up to 512 CPUs on a large 35 million point full aircraft RANS simulation. At this point performance is such that a fully converged simulation of 2500 time steps is completed in less than 2 hours of elapsed time. Further work over the next few weeks will improve the performance of this code even further.The LAURA code has been converted to the MLP format as well. This code is currently being optimized for the 512 CPU system. Performance statistics indicate that the goal of 100 GFLOP/s will be achieved by year's end. This amounts to 20x the 16 CPU C90 result and strongly demonstrates the viability of the new parallel systems rapidly solving very large simulations in a production environment.
      

      
      Spectrum Savings from High Performance Recording and Playback Onboard the Test Article
      DTIC Science & Technology
      
         2013-02-20
         execute within a Windows 7 environment, and data is recorded on SSDs. The underlying database is implemented using MySQL . Figure 1 illustrates the... MySQL database. This is effectively the time at which the recorded data are available for retransmission. CPU and Memory utilization were collected...17.7% MySQL avg. 3.9% EQDR Total avg. 21.6% Table 1 CPU Utilization with260 Mbits/sec Load The difference between the total System CPU (27.8
      

      
      The ATLAS Tier-3 in Geneva and the Trigger Development Facility
      NASA Astrophysics Data System (ADS)
      Gadomski, S.; Meunier, Y.; Pasche, P.; Baud, J.-P.; ATLAS Collaboration
         2011-12-01
         The ATLAS Tier-3 farm at the University of Geneva provides storage and processing power for analysis of ATLAS data. In addition the facility is used for development, validation and commissioning of the High Level Trigger of ATLAS [1]. The latter purpose leads to additional requirements on the availability of latest software and data, which will be presented. The farm is also a part of the WLCG [2], and is available to all members of the ATLAS Virtual Organization. The farm currently provides 268 CPU cores and 177 TB of storage space. A grid Storage Element, implemented with the Disk Pool Manager software [3], is available and integrated with the ATLAS Distributed Data Management system [4]. The batch system can be used directly by local users, or with a grid interface provided by NorduGrid ARC middleware [5]. In this article we will present the use cases that we support, as well as the experience with the software and the hardware we are using. Results of I/O benchmarking tests, which were done for our DPM Storage Element and for the NFS servers we are using, will also be presented.
      

      
      Message Passing on GPUs
      NASA Astrophysics Data System (ADS)
      Stuart, J. A.
         2011-12-01
         This paper explores the challenges in implementing a message passing interface usable on systems with data-parallel processors, and more specifically GPUs. As a case study, we design and implement the ``DCGN'' API on NVIDIA GPUs that is similar to MPI and allows full access to the underlying architecture. We introduce the notion of data-parallel thread-groups as a way to map resources to MPI ranks. We use a method that also allows the data-parallel processors to run autonomously from user-written CPU code. In order to facilitate communication, we use a sleep-based polling system to store and retrieve messages. Unlike previous systems, our method provides both performance and flexibility. By running a test suite of applications with different communication requirements, we find that a tolerable amount of overhead is incurred, somewhere between one and five percent depending on the application, and indicate the locations where this overhead accumulates. We conclude that with innovations in chipsets and drivers, this overhead will be mitigated and provide similar performance to typical CPU-based MPI implementations while providing fully-dynamic communication.
      

      
      High performance computing for deformable image registration: towards a new paradigm in adaptive radiotherapy.
      PubMed
      Samant, Sanjiv S; Xia, Junyi; Muyan-Ozcelik, Pinar; Owens, John D
         2008-08-01
         The advent of readily available temporal imaging or time series volumetric (4D) imaging has become an indispensable component of treatment planning and adaptive radiotherapy (ART) at many radiotherapy centers. Deformable image registration (DIR) is also used in other areas of medical imaging, including motion corrected image reconstruction. Due to long computation time, clinical applications of DIR in radiation therapy and elsewhere have been limited and consequently relegated to offline analysis. With the recent advances in hardware and software, graphics processing unit (GPU) based computing is an emerging technology for general purpose computation, including DIR, and is suitable for highly parallelized computing. However, traditional general purpose computation on the GPU is limited because the constraints of the available programming platforms. As well, compared to CPU programming, the GPU currently has reduced dedicated processor memory, which can limit the useful working data set for parallelized processing. We present an implementation of the demons algorithm using the NVIDIA 8800 GTX GPU and the new CUDA programming language. The GPU performance will be compared with single threading and multithreading CPU implementations on an Intel dual core 2.4 GHz CPU using the C programming language. CUDA provides a C-like language programming interface, and allows for direct access to the highly parallel compute units in the GPU. Comparisons for volumetric clinical lung images acquired using 4DCT were carried out. Computation time for 100 iterations in the range of 1.8-13.5 s was observed for the GPU with image size ranging from 2.0 x 10(6) to 14.2 x 10(6) pixels. The GPU registration was 55-61 times faster than the CPU for the single threading implementation, and 34-39 times faster for the multithreading implementation. For CPU based computing, the computational time generally has a linear dependence on image size for medical imaging data. Computational efficiency is characterized in terms of time per megapixels per iteration (TPMI) with units of seconds per megapixels per iteration (or spmi). For the demons algorithm, our CPU implementation yielded largely invariant values of TPMI. The mean TPMIs were 0.527 spmi and 0.335 spmi for the single threading and multithreading cases, respectively, with <2% variation over the considered image data range. For GPU computing, we achieved TPMI =0.00916 spmi with 3.7% variation, indicating optimized memory handling under CUDA. The paradigm of GPU based real-time DIR opens up a host of clinical applications for medical imaging.
      

      
      Massively parallel data processing for quantitative total flow imaging with optical coherence microscopy and tomography
      NASA Astrophysics Data System (ADS)
      Sylwestrzak, Marcin; Szlag, Daniel; Marchand, Paul J.; Kumar, Ashwin S.; Lasser, Theo
         2017-08-01
         We present an application of massively parallel processing of quantitative flow measurements data acquired using spectral optical coherence microscopy (SOCM). The need for massive signal processing of these particular datasets has been a major hurdle for many applications based on SOCM. In view of this difficulty, we implemented and adapted quantitative total flow estimation algorithms on graphics processing units (GPU) and achieved a 150 fold reduction in processing time when compared to a former CPU implementation. As SOCM constitutes the microscopy counterpart to spectral optical coherence tomography (SOCT), the developed processing procedure can be applied to both imaging modalities. We present the developed DLL library integrated in MATLAB (with an example) and have included the source code for adaptations and future improvements. Catalogue identifier: AFBT_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AFBT_v1_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: GNU GPLv3 No. of lines in distributed program, including test data, etc.: 913552 No. of bytes in distributed program, including test data, etc.: 270876249 Distribution format: tar.gz Programming language: CUDA/C, MATLAB. Computer: Intel x64 CPU, GPU supporting CUDA technology. Operating system: 64-bit Windows 7 Professional. Has the code been vectorized or parallelized?: Yes, CPU code has been vectorized in MATLAB, CUDA code has been parallelized. RAM: Dependent on users parameters, typically between several gigabytes and several tens of gigabytes Classification: 6.5, 18. Nature of problem: Speed up of data processing in optical coherence microscopy Solution method: Utilization of GPU for massively parallel data processing Additional comments: Compiled DLL library with source code and documentation, example of utilization (MATLAB script with raw data) Running time: 1,8 s for one B-scan (150 × faster in comparison to the CPU data processing time)
      

        
       
          

«

6
      7
      8
   9
      10
      »

          
        

     

   

   
       
            
              
          

«

7
      8
      9
   10
      11
      »

          
        

           
           
             
               
      
      Examining the architecture of cellular computing through a comparative study with a computer
      PubMed Central
      Wang, Degeng; Gribskov, Michael
         2005-01-01
         The computer and the cell both use information embedded in simple coding, the binary software code and the quadruple genomic code, respectively, to support system operations. A comparative examination of their system architecture as well as their information storage and utilization schemes is performed. On top of the code, both systems display a modular, multi-layered architecture, which, in the case of a computer, arises from human engineering efforts through a combination of hardware implementation and software abstraction. Using the computer as a reference system, a simplistic mapping of the architectural components between the two is easily detected. This comparison also reveals that a cell abolishes the software–hardware barrier through genomic encoding for the constituents of the biochemical network, a cell's ‘hardware’ equivalent to the computer central processing unit (CPU). The information loading (gene expression) process acts as a major determinant of the encoded constituent's abundance, which, in turn, often determines the ‘bandwidth’ of a biochemical pathway. Cellular processes are implemented in biochemical pathways in parallel manners. In a computer, on the other hand, the software provides only instructions and data for the CPU. A process represents just sequentially ordered actions by the CPU and only virtual parallelism can be implemented through CPU time-sharing. Whereas process management in a computer may simply mean job scheduling, coordinating pathway bandwidth through the gene expression machinery represents a major process management scheme in a cell. In summary, a cell can be viewed as a super-parallel computer, which computes through controlled hardware composition. While we have, at best, a very fragmented understanding of cellular operation, we have a thorough understanding of the computer throughout the engineering process. The potential utilization of this knowledge to the benefit of systems biology is discussed. PMID:16849179
      

      
      Examining the architecture of cellular computing through a comparative study with a computer.
      PubMed
      Wang, Degeng; Gribskov, Michael
         2005-06-22
         The computer and the cell both use information embedded in simple coding, the binary software code and the quadruple genomic code, respectively, to support system operations. A comparative examination of their system architecture as well as their information storage and utilization schemes is performed. On top of the code, both systems display a modular, multi-layered architecture, which, in the case of a computer, arises from human engineering efforts through a combination of hardware implementation and software abstraction. Using the computer as a reference system, a simplistic mapping of the architectural components between the two is easily detected. This comparison also reveals that a cell abolishes the software-hardware barrier through genomic encoding for the constituents of the biochemical network, a cell's "hardware" equivalent to the computer central processing unit (CPU). The information loading (gene expression) process acts as a major determinant of the encoded constituent's abundance, which, in turn, often determines the "bandwidth" of a biochemical pathway. Cellular processes are implemented in biochemical pathways in parallel manners. In a computer, on the other hand, the software provides only instructions and data for the CPU. A process represents just sequentially ordered actions by the CPU and only virtual parallelism can be implemented through CPU time-sharing. Whereas process management in a computer may simply mean job scheduling, coordinating pathway bandwidth through the gene expression machinery represents a major process management scheme in a cell. In summary, a cell can be viewed as a super-parallel computer, which computes through controlled hardware composition. While we have, at best, a very fragmented understanding of cellular operation, we have a thorough understanding of the computer throughout the engineering process. The potential utilization of this knowledge to the benefit of systems biology is discussed.
      

      
      Computer hardware for radiologists: Part I
      PubMed Central
      Indrajit, IK; Alam, A
         2010-01-01
         Computers are an integral part of modern radiology practice. They are used in different radiology modalities to acquire, process, and postprocess imaging data. They have had a dramatic influence on contemporary radiology practice. Their impact has extended further with the emergence of Digital Imaging and Communications in Medicine (DICOM), Picture Archiving and Communication System (PACS), Radiology information system (RIS) technology, and Teleradiology. A basic overview of computer hardware relevant to radiology practice is presented here. The key hardware components in a computer are the motherboard, central processor unit (CPU), the chipset, the random access memory (RAM), the memory modules, bus, storage drives, and ports. The personnel computer (PC) has a rectangular case that contains important components called hardware, many of which are integrated circuits (ICs). The fiberglass motherboard is the main printed circuit board and has a variety of important hardware mounted on it, which are connected by electrical pathways called “buses”. The CPU is the largest IC on the motherboard and contains millions of transistors. Its principal function is to execute “programs”. A Pentium® 4 CPU has transistors that execute a billion instructions per second. The chipset is completely different from the CPU in design and function; it controls data and interaction of buses between the motherboard and the CPU. Memory (RAM) is fundamentally semiconductor chips storing data and instructions for access by a CPU. RAM is classified by storage capacity, access speed, data rate, and configuration. PMID:21042437
      

      
      Novel hybrid GPU-CPU implementation of parallelized Monte Carlo parametric expectation maximization estimation method for population pharmacokinetic data analysis.
      PubMed
      Ng, C M
         2013-10-01
         The development of a population PK/PD model, an essential component for model-based drug development, is both time- and labor-intensive. A graphical-processing unit (GPU) computing technology has been proposed and used to accelerate many scientific computations. The objective of this study was to develop a hybrid GPU-CPU implementation of parallelized Monte Carlo parametric expectation maximization (MCPEM) estimation algorithm for population PK data analysis. A hybrid GPU-CPU implementation of the MCPEM algorithm (MCPEMGPU) and identical algorithm that is designed for the single CPU (MCPEMCPU) were developed using MATLAB in a single computer equipped with dual Xeon 6-Core E5690 CPU and a NVIDIA Tesla C2070 GPU parallel computing card that contained 448 stream processors. Two different PK models with rich/sparse sampling design schemes were used to simulate population data in assessing the performance of MCPEMCPU and MCPEMGPU. Results were analyzed by comparing the parameter estimation and model computation times. Speedup factor was used to assess the relative benefit of parallelized MCPEMGPU over MCPEMCPU in shortening model computation time. The MCPEMGPU consistently achieved shorter computation time than the MCPEMCPU and can offer more than 48-fold speedup using a single GPU card. The novel hybrid GPU-CPU implementation of parallelized MCPEM algorithm developed in this study holds a great promise in serving as the core for the next-generation of modeling software for population PK/PD analysis.
      

      
      Sputnik: ad hoc distributed computation.
      PubMed
      Völkel, Gunnar; Lausser, Ludwig; Schmid, Florian; Kraus, Johann M; Kestler, Hans A
         2015-04-15
         In bioinformatic applications, computationally demanding algorithms are often parallelized to speed up computation. Nevertheless, setting up computational environments for distributed computation is often tedious. Aim of this project were the lightweight ad hoc set up and fault-tolerant computation requiring only a Java runtime, no administrator rights, while utilizing all CPU cores most effectively. The Sputnik framework provides ad hoc distributed computation on the Java Virtual Machine which uses all supplied CPU cores fully. It provides a graphical user interface for deployment setup and a web user interface displaying the current status of current computation jobs. Neither a permanent setup nor administrator privileges are required. We demonstrate the utility of our approach on feature selection of microarray data. The Sputnik framework is available on Github http://github.com/sysbio-bioinf/sputnik under the Eclipse Public License. hkestler@fli-leibniz.de or hans.kestler@uni-ulm.de Supplementary data are available at Bioinformatics online. © The Author 2014. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.
      

      
      Dense GPU-enhanced surface reconstruction from stereo endoscopic images for intraoperative registration.
      PubMed
      Rohl, Sebastian; Bodenstedt, Sebastian; Suwelack, Stefan; Dillmann, Rudiger; Speidel, Stefanie; Kenngott, Hannes; Muller-Stich, Beat P
         2012-03-01
         In laparoscopic surgery, soft tissue deformations substantially change the surgical site, thus impeding the use of preoperative planning during intraoperative navigation. Extracting depth information from endoscopic images and building a surface model of the surgical field-of-view is one way to represent this constantly deforming environment. The information can then be used for intraoperative registration. Stereo reconstruction is a typical problem within computer vision. However, most of the available methods do not fulfill the specific requirements in a minimally invasive setting such as the need of real-time performance, the problem of view-dependent specular reflections and large curved areas with partly homogeneous or periodic textures and occlusions. In this paper, the authors present an approach toward intraoperative surface reconstruction based on stereo endoscopic images. The authors describe our answer to this problem through correspondence analysis, disparity correction and refinement, 3D reconstruction, point cloud smoothing and meshing. Real-time performance is achieved by implementing the algorithms on the gpu. The authors also present a new hybrid cpu-gpu algorithm that unifies the advantages of the cpu and the gpu version. In a comprehensive evaluation using in vivo data, in silico data from the literature and virtual data from a newly developed simulation environment, the cpu, the gpu, and the hybrid cpu-gpu versions of the surface reconstruction are compared to a cpu and a gpu algorithm from the literature. The recommended approach toward intraoperative surface reconstruction can be conducted in real-time depending on the image resolution (20 fps for the gpu and 14fps for the hybrid cpu-gpu version on resolution of 640 × 480). It is robust to homogeneous regions without texture, large image changes, noise or errors from camera calibration, and it reconstructs the surface down to sub millimeter accuracy. In all the experiments within the simulation environment, the mean distance to ground truth data is between 0.05 and 0.6 mm for the hybrid cpu-gpu version. The hybrid cpu-gpu algorithm shows a much more superior performance than its cpu and gpu counterpart (mean distance reduction 26% and 45%, respectively, for the experiments in the simulation environment). The recommended approach for surface reconstruction is fast, robust, and accurate. It can represent changes in the intraoperative environment and can be used to adapt a preoperative model within the surgical site by registration of these two models.
      

      
      Using virtual machine monitors to overcome the challenges of monitoring and managing virtualized cloud infrastructures
      NASA Astrophysics Data System (ADS)
      Bamiah, Mervat Adib; Brohi, Sarfraz Nawaz; Chuprat, Suriayati
         2012-01-01
         Virtualization is one of the hottest research topics nowadays. Several academic researchers and developers from IT industry are designing approaches for solving security and manageability issues of Virtual Machines (VMs) residing on virtualized cloud infrastructures. Moving the application from a physical to a virtual platform increases the efficiency, flexibility and reduces management cost as well as effort. Cloud computing is adopting the paradigm of virtualization, using this technique, memory, CPU and computational power is provided to clients' VMs by utilizing the underlying physical hardware. Beside these advantages there are few challenges faced by adopting virtualization such as management of VMs and network traffic, unexpected additional cost and resource allocation. Virtual Machine Monitor (VMM) or hypervisor is the tool used by cloud providers to manage the VMs on cloud. There are several heterogeneous hypervisors provided by various vendors that include VMware, Hyper-V, Xen and Kernel Virtual Machine (KVM). Considering the challenge of VM management, this paper describes several techniques to monitor and manage virtualized cloud infrastructures.
      

      
      Polydrug use among college students in Brazil: a nationwide survey.
      PubMed
      Oliveira, Lúcio Garcia de; Alberghini, Denis Guilherme; Santos, Bernardo dos; Andrade, Arthur Guerra de
         2013-01-01
         To estimate the frequency of polydrug use (alcohol and illicit drugs) among college students and its associations with gender and age group. A nationwide sample of 12,544 college students was asked to complete a questionnaire on their use of drugs according to three time parameters (lifetime, past 12 months, and last 30 days). The co-use of drugs was investigated as concurrent polydrug use (CPU) and simultaneous polydrug use (SPU), a subcategory of CPU that involves the use of drugs at the same time or in close temporal proximity. Almost 26% of college students reported having engaged in CPU in the past 12 months. Among these students, 37% had engaged in SPU. In the past 30 days, 17% college students had engaged in CPU. Among these, 35% had engaged in SPU. Marijuana was the illicit drug mostly frequently used with alcohol (either as CPU or SPU), especially among males. Among females, the most commonly reported combination was alcohol and prescribed medications. A high proportion of Brazilian college students may be engaging in polydrug use. College administrators should keep themselves informed to be able to identify such use and to develop educational interventions to prevent such behavior.
      

      
      48 CFR 252.204-7011 - Alternative Line Item Structure.
      Code of Federal Regulations, 2011 CFR
      
         2011-10-01
         ... Unit Unit price Amount 0001 Computer, Desktop with CPU, Monitor, Keyboard and Mouse 20 EA Alternative... Unit Unit Price Amount 0001 Computer, Desktop with CPU, Keyboard and Mouse 20 EA 0002 Monitor 20 EA...
      

      
      48 CFR 252.204-7011 - Alternative Line Item Structure.
      Code of Federal Regulations, 2014 CFR
      
         2014-10-01
         ... Unit Unit price Amount 0001 Computer, Desktop with CPU, Monitor, Keyboard and Mouse 20 EA Alternative... Unit Unit Price Amount 0001 Computer, Desktop with CPU, Keyboard and Mouse 20 EA 0002 Monitor 20 EA...
      

      
      48 CFR 252.204-7011 - Alternative Line Item Structure.
      Code of Federal Regulations, 2012 CFR
      
         2012-10-01
         ... Unit Unit price Amount 0001 Computer, Desktop with CPU, Monitor, Keyboard and Mouse 20 EA Alternative... Unit Unit Price Amount 0001 Computer, Desktop with CPU, Keyboard and Mouse 20 EA 0002 Monitor 20 EA...
      

      
      48 CFR 252.204-7011 - Alternative Line Item Structure.
      Code of Federal Regulations, 2013 CFR
      
         2013-10-01
         ... Unit Unit price Amount 0001 Computer, Desktop with CPU, Monitor, Keyboard and Mouse 20 EA Alternative... Unit Unit Price Amount 0001 Computer, Desktop with CPU, Keyboard and Mouse 20 EA 0002 Monitor 20 EA...
      

      
      Hybrid Computational Architecture for Multi-Scale Modeling of Materials and Devices
      DTIC Science & Technology
      
         2016-01-03
         Equivalent: Total Number: Sub Contractors (DD882) Names of Faculty Supported Names of Under Graduate students supported Names of Personnel receiving masters...GHz, 20 cores (40 with hyper-threading ( HT )) Single node performance Node # of cores Total CPU time User CPU time System CPU time Elapsed time...INTEL20 40 (with HT ) 534.785 529.984 4.800 541.179 20 468.873 466.119 2.754 476.878 10 671.798 669.653 2.145 680.510 8 772.269 770.256 2.013
      

      
      QR-decomposition based SENSE reconstruction using parallel architecture.
      PubMed
      Ullah, Irfan; Nisar, Habab; Raza, Haseeb; Qasim, Malik; Inam, Omair; Omer, Hammad
         2018-04-01
         Magnetic Resonance Imaging (MRI) is a powerful medical imaging technique that provides essential clinical information about the human body. One major limitation of MRI is its long scan time. Implementation of advance MRI algorithms on a parallel architecture (to exploit inherent parallelism) has a great potential to reduce the scan time. Sensitivity Encoding (SENSE) is a Parallel Magnetic Resonance Imaging (pMRI) algorithm that utilizes receiver coil sensitivities to reconstruct MR images from the acquired under-sampled k-space data. At the heart of SENSE lies inversion of a rectangular encoding matrix. This work presents a novel implementation of GPU based SENSE algorithm, which employs QR decomposition for the inversion of the rectangular encoding matrix. For a fair comparison, the performance of the proposed GPU based SENSE reconstruction is evaluated against single and multicore CPU using openMP. Several experiments against various acceleration factors (AFs) are performed using multichannel (8, 12 and 30) phantom and in-vivo human head and cardiac datasets. Experimental results show that GPU significantly reduces the computation time of SENSE reconstruction as compared to multi-core CPU (approximately 12x speedup) and single-core CPU (approximately 53x speedup) without any degradation in the quality of the reconstructed images. Copyright © 2018 Elsevier Ltd. All rights reserved.
      

      
      Efficient parallel simulation of CO2 geologic sequestration insaline aquifers
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Zhang, Keni; Doughty, Christine; Wu, Yu-Shu
         2007-01-01
         An efficient parallel simulator for large-scale, long-termCO2 geologic sequestration in saline aquifers has been developed. Theparallel simulator is a three-dimensional, fully implicit model thatsolves large, sparse linear systems arising from discretization of thepartial differential equations for mass and energy balance in porous andfractured media. The simulator is based on the ECO2N module of the TOUGH2code and inherits all the process capabilities of the single-CPU TOUGH2code, including a comprehensive description of the thermodynamics andthermophysical properties of H2O-NaCl- CO2 mixtures, modeling singleand/or two-phase isothermal or non-isothermal flow processes, two-phasemixtures, fluid phases appearing or disappearing, as well as saltprecipitation or dissolution. The newmore » parallel simulator uses MPI forparallel implementation, the METIS software package for simulation domainpartitioning, and the iterative parallel linear solver package Aztec forsolving linear equations by multiple processors. In addition, theparallel simulator has been implemented with an efficient communicationscheme. Test examples show that a linear or super-linear speedup can beobtained on Linux clusters as well as on supercomputers. Because of thesignificant improvement in both simulation time and memory requirement,the new simulator provides a powerful tool for tackling larger scale andmore complex problems than can be solved by single-CPU codes. Ahigh-resolution simulation example is presented that models buoyantconvection, induced by a small increase in brine density caused bydissolution of CO2.« less
      

      
      SAFARI digital processing unit: performance analysis of the SpaceWire links in case of a LEON3-FT based CPU
      NASA Astrophysics Data System (ADS)
      Giusi, Giovanni; Liu, Scige J.; Di Giorgio, Anna M.; Galli, Emanuele; Pezzuto, Stefano; Farina, Maria; Spinoglio, Luigi
         2014-08-01
         SAFARI (SpicA FAR infrared Instrument) is a far-infrared imaging Fourier Transform Spectrometer for the SPICA mission. The Digital Processing Unit (DPU) of the instrument implements the functions of controlling the overall instrument and implementing the science data compression and packing. The DPU design is based on the use of a LEON family processor. In SAFARI, all instrument components are connected to the central DPU via SpaceWire links. On these links science data, housekeeping and commands flows are in some cases multiplexed, therefore the interface control shall be able to cope with variable throughput needs. The effective data transfer workload can be an issue for the overall system performances and becomes a critical parameter for the on-board software design, both at application layer level and at lower, and more HW related, levels. To analyze the system behavior in presence of the expected SAFARI demanding science data flow, we carried out a series of performance tests using the standard GR-CPCI-UT699 LEON3-FT Development Board, provided by Aeroflex/Gaisler, connected to the emulator of the SAFARI science data links, in a point-to-point topology. Two different communication protocols have been used in the tests, the ECSS-E-ST-50-52C RMAP protocol and an internally defined one, the SAFARI internal data handling protocol. An incremental approach has been adopted to measure the system performances at different levels of the communication protocol complexity. In all cases the performance has been evaluated by measuring the CPU workload and the bus latencies. The tests have been executed initially in a custom low level execution environment and finally using the Real- Time Executive for Multiprocessor Systems (RTEMS), which has been selected as the operating system to be used onboard SAFARI. The preliminary results of the carried out performance analysis confirmed the possibility of using a LEON3 CPU processor in the SAFARI DPU, but pointed out, in agreement with previous similar studies, the need of carefully designing the overall architecture to implement some of the DPU functionalities on additional processing devices.
      

      
      The VLBA correlator: Real-time in the distributed era
      NASA Technical Reports Server (NTRS)
      Wells, D. C.
         1992-01-01
         The correlator is the signal processing engine of the Very Long Baseline Array (VLBA). Radio signals are recorded on special wideband (128 Mb/s) digital recorders at the 10 telescopes, with sampling times controlled by hydrogen maser clocks. The magnetic tapes are shipped to the Array Operations Center in Socorro, New Mexico, where they are played back simultaneously into the correlator. Real-time software and firmware controls the playback drives to achieve synchronization, compute models of the wavefront delay, control the numerous modules of the correlator, and record FITS files of the fringe visibilities at the back-end of the correlator. In addition to the more than 3000 custom VLSI chips which handle the massive data flow of the signal processing, the correlator contains a total of more than 100 programmable computers, 8-, 16- and 32-bit CPUs. Code is downloaded into front-end CPU's dependent on operating mode. Low-level code is assembly language, high-level code is C running under a RT OS. We use VxWorks on Motorola MVME147 CPU's. Code development is on a complex of SPARC workstations connected to the RT CPU's by Ethernet. The overall management of the correlation process is dependent on a database management system. We use Ingres running on a Sparcstation-2. We transfer logging information from the database of the VLBA Monitor and Control System to our database using Ingres/NET. Job scripts are computed and are transferred to the real-time computers using NFS, and correlation job execution logs and status flow back by the route. Operator status and control displays use windows on workstations, interfaced to the real-time processes by network protocols. The extensive network protocol support provided by VxWorks is invaluable. The VLBA Correlator's dependence on network protocols is an example of the radical transformation of the real-time world over the past five years. Real-time is becoming more like conventional computing. Paradoxically, 'conventional' computing is also adopting practices from the real-time world: semaphores, shared memory, light-weight threads, and concurrency. This appears to be a convergence of thinking.
      

      
      Accelerating Large Scale Image Analyses on Parallel, CPU-GPU Equipped Systems
      PubMed Central
      Teodoro, George; Kurc, Tahsin M.; Pan, Tony; Cooper, Lee A.D.; Kong, Jun; Widener, Patrick; Saltz, Joel H.
         2014-01-01
         The past decade has witnessed a major paradigm shift in high performance computing with the introduction of accelerators as general purpose processors. These computing devices make available very high parallel computing power at low cost and power consumption, transforming current high performance platforms into heterogeneous CPU-GPU equipped systems. Although the theoretical performance achieved by these hybrid systems is impressive, taking practical advantage of this computing power remains a very challenging problem. Most applications are still deployed to either GPU or CPU, leaving the other resource under- or un-utilized. In this paper, we propose, implement, and evaluate a performance aware scheduling technique along with optimizations to make efficient collaborative use of CPUs and GPUs on a parallel system. In the context of feature computations in large scale image analysis applications, our evaluations show that intelligently co-scheduling CPUs and GPUs can significantly improve performance over GPU-only or multi-core CPU-only approaches. PMID:25419545
      

      
      File Usage Analysis and Resource Usage Prediction: a Measurement-Based Study. Ph.D. Thesis
      NASA Technical Reports Server (NTRS)
      Devarakonda, Murthy V.-S.
         1987-01-01
         A probabilistic scheme was developed to predict process resource usage in UNIX. Given the identity of the program being run, the scheme predicts CPU time, file I/O, and memory requirements of a process at the beginning of its life. The scheme uses a state-transition model of the program's resource usage in its past executions for prediction. The states of the model are the resource regions obtained from an off-line cluster analysis of processes run on the system. The proposed method is shown to work on data collected from a VAX 11/780 running 4.3 BSD UNIX. The results show that the predicted values correlate well with the actual. The coefficient of correlation between the predicted and actual values of CPU time is 0.84. Errors in prediction are mostly small. Some 82% of errors in CPU time prediction are less than 0.5 standard deviations of process CPU time.
      

      
      Predictability of process resource usage - A measurement-based study on UNIX
      NASA Technical Reports Server (NTRS)
      Devarakonda, Murthy V.; Iyer, Ravishankar K.
         1989-01-01
         A probabilistic scheme is developed to predict process resource usage in UNIX. Given the identity of the program being run, the scheme predicts CPU time, file I/O, and memory requirements of a process at the beginning of its life. The scheme uses a state-transition model of the program's resource usage in its past executions for prediction. The states of the model are the resource regions obtained from an off-line cluster analysis of processes run on the system. The proposed method is shown to work on data collected from a VAX 11/780 running 4.3 BSD UNIX. The results show that the predicted values correlate well with the actual. The correlation coefficient betweeen the predicted and actual values of CPU time is 0.84. Errors in prediction are mostly small. Some 82 percent of errors in CPU time prediction are less than 0.5 standard deviations of process CPU time.
      

        
       
          

«

7
      8
      9
   10
      11
      »

          
        

     

   

   
       
            
              
          

«

8
      9
      10
   11
      12
      »

          
        

           
           
             
               
      
      Predictability of process resource usage: A measurement-based study of UNIX
      NASA Technical Reports Server (NTRS)
      Devarakonda, Murthy V.; Iyer, Ravishankar K.
         1987-01-01
         A probabilistic scheme is developed to predict process resource usage in UNIX. Given the identity of the program being run, the scheme predicts CPU time, file I/O, and memory requirements of a process at the beginning of its life. The scheme uses a state-transition model of the program's resource usage in its past executions for prediction. The states of the model are the resource regions obtained from an off-line cluster analysis of processes run on the system. The proposed method is shown to work on data collected from a VAX 11/780 running 4.3 BSD UNIX. The results show that the predicted values correlate well with the actual. The correlation coefficient between the predicted and actual values of CPU time is 0.84. Errors in prediction are mostly small. Some 82% of errors in CPU time prediction are less than 0.5 standard deviations of process CPU time.
      

      
      Host-Based Systemic Network Obfuscation System for Windows
      DTIC Science & Technology
      
         2011-06-01
         speed, CPU speed, and memory size. These additional parameters are control variables and do not change throughout the experiment. The applications...physical median that passes the network traffic, such as a wireless signal or Ethernet cable and does not need obfuscation. The colored layers in Figure...Gul09] Ron Gula, “ Enchanced Operating System Identification with Nessus.” [Online]. Available: http://blog.tenablesecurity.com/2009/02
      

      
      GPU acceleration towards real-time image reconstruction in 3D tomographic diffractive microscopy
      NASA Astrophysics Data System (ADS)
      Bailleul, J.; Simon, B.; Debailleul, M.; Liu, H.; Haeberlé, O.
         2012-06-01
         Phase microscopy techniques regained interest in allowing for the observation of unprepared specimens with excellent temporal resolution. Tomographic diffractive microscopy is an extension of holographic microscopy which permits 3D observations with a finer resolution than incoherent light microscopes. Specimens are imaged by a series of 2D holograms: their accumulation progressively fills the range of frequencies of the specimen in Fourier space. A 3D inverse FFT eventually provides a spatial image of the specimen. Consequently, acquisition then reconstruction are mandatory to produce an image that could prelude real-time control of the observed specimen. The MIPS Laboratory has built a tomographic diffractive microscope with an unsurpassed 130nm resolution but a low imaging speed - no less than one minute. Afterwards, a high-end PC reconstructs the 3D image in 20 seconds. We now expect an interactive system providing preview images during the acquisition for monitoring purposes. We first present a prototype implementing this solution on CPU: acquisition and reconstruction are tied in a producer-consumer scheme, sharing common data into CPU memory. Then we present a prototype dispatching some reconstruction tasks to GPU in order to take advantage of SIMDparallelization for FFT and higher bandwidth for filtering operations. The CPU scheme takes 6 seconds for a 3D image update while the GPU scheme can go down to 2 or > 1 seconds depending on the GPU class. This opens opportunities for 4D imaging of living organisms or crystallization processes. We also consider the relevance of GPU for 3D image interaction in our specific conditions.
      

      
      Invasive treatment of NSTEMI patients in German Chest Pain Units - Evidence for a treatment paradox.
      PubMed
      Schmidt, Frank P; Schmitt, Claus; Hochadel, Matthias; Giannitsis, Evangelos; Darius, Harald; Maier, Lars S; Schmitt, Claus; Heusch, Gerd; Voigtländer, Thomas; Mudra, Harald; Gori, Tommaso; Senges, Jochen; Münzel, Thomas
         2018-03-15
         Patients with non ST-segment elevation myocardial infarction (NSTEMI) represent the largest fraction of patients with acute coronary syndrome in German Chest Pain units. Recent evidence on early vs. selective percutaneous coronary intervention (PCI) is ambiguous with respect to effects on mortality, myocardial infarction (MI) and recurrent angina. With the present study we sought to investigate the prognostic impact of PCI and its timing in German Chest Pain Unit (CPU) NSTEMI patients. Data from 1549 patients whose leading diagnosis was NSTEMI were retrieved from the German CPU registry for the interval between 3/2010 and 3/2014. Follow-up was available at median of 167days after discharge. The patients were grouped into a higher (Group A) and lower risk group (Group B) according to GRACE score and additional criteria on admission. Group A had higher Killip classes, higher BNP levels, reduced EF and significant more triple vessel disease (p<0.001). Surprisingly, patients in group A less frequently received early diagnostic catheterization and PCI. While conservative management did not affect prognosis in Group B, higher-risk CPU-NSTEMI patients without PCI had a significantly worse survival. The present results reveal a substantial treatment gap in higher-risk NSTEMI patients in German Chest Pain Units. This treatment paradox may worsen prognosis in patients who could derive the largest benefit from early revascularization. Copyright © 2017 The Authors. Published by Elsevier B.V. All rights reserved.
      

      
      A Spiking Neural Simulator Integrating Event-Driven and Time-Driven Computation Schemes Using Parallel CPU-GPU Co-Processing: A Case Study.
      PubMed
      Naveros, Francisco; Luque, Niceto R; Garrido, Jesús A; Carrillo, Richard R; Anguita, Mancia; Ros, Eduardo
         2015-07-01
         Time-driven simulation methods in traditional CPU architectures perform well and precisely when simulating small-scale spiking neural networks. Nevertheless, they still have drawbacks when simulating large-scale systems. Conversely, event-driven simulation methods in CPUs and time-driven simulation methods in graphic processing units (GPUs) can outperform CPU time-driven methods under certain conditions. With this performance improvement in mind, we have developed an event-and-time-driven spiking neural network simulator suitable for a hybrid CPU-GPU platform. Our neural simulator is able to efficiently simulate bio-inspired spiking neural networks consisting of different neural models, which can be distributed heterogeneously in both small layers and large layers or subsystems. For the sake of efficiency, the low-activity parts of the neural network can be simulated in CPU using event-driven methods while the high-activity subsystems can be simulated in either CPU (a few neurons) or GPU (thousands or millions of neurons) using time-driven methods. In this brief, we have undertaken a comparative study of these different simulation methods. For benchmarking the different simulation methods and platforms, we have used a cerebellar-inspired neural-network model consisting of a very dense granular layer and a Purkinje layer with a smaller number of cells (according to biological ratios). Thus, this cerebellar-like network includes a dense diverging neural layer (increasing the dimensionality of its internal representation and sparse coding) and a converging neural layer (integration) similar to many other biologically inspired and also artificial neural networks.
      

      
      Efficient methods for implementation of multi-level nonrigid mass-preserving image registration on GPUs and multi-threaded CPUs.
      PubMed
      Ellingwood, Nathan D; Yin, Youbing; Smith, Matthew; Lin, Ching-Long
         2016-04-01
         Faster and more accurate methods for registration of images are important for research involved in conducting population-based studies that utilize medical imaging, as well as improvements for use in clinical applications. We present a novel computation- and memory-efficient multi-level method on graphics processing units (GPU) for performing registration of two computed tomography (CT) volumetric lung images. We developed a computation- and memory-efficient Diffeomorphic Multi-level B-Spline Transform Composite (DMTC) method to implement nonrigid mass-preserving registration of two CT lung images on GPU. The framework consists of a hierarchy of B-Spline control grids of increasing resolution. A similarity criterion known as the sum of squared tissue volume difference (SSTVD) was adopted to preserve lung tissue mass. The use of SSTVD consists of the calculation of the tissue volume, the Jacobian, and their derivatives, which makes its implementation on GPU challenging due to memory constraints. The use of the DMTC method enabled reduced computation and memory storage of variables with minimal communication between GPU and Central Processing Unit (CPU) due to ability to pre-compute values. The method was assessed on six healthy human subjects. Resultant GPU-generated displacement fields were compared against the previously validated CPU counterpart fields, showing good agreement with an average normalized root mean square error (nRMS) of 0.044±0.015. Runtime and performance speedup are compared between single-threaded CPU, multi-threaded CPU, and GPU algorithms. Best performance speedup occurs at the highest resolution in the GPU implementation for the SSTVD cost and cost gradient computations, with a speedup of 112 times that of the single-threaded CPU version and 11 times over the twelve-threaded version when considering average time per iteration using a Nvidia Tesla K20X GPU. The proposed GPU-based DMTC method outperforms its multi-threaded CPU version in terms of runtime. Total registration time reduced runtime to 2.9min on the GPU version, compared to 12.8min on twelve-threaded CPU version and 112.5min on a single-threaded CPU. Furthermore, the GPU implementation discussed in this work can be adapted for use of other cost functions that require calculation of the first derivatives. Copyright © 2015 Elsevier Ireland Ltd. All rights reserved.
      

      
      Free-Space Optical Interconnect Employing VCSEL Diodes
      NASA Technical Reports Server (NTRS)
      Simons, Rainee N.; Savich, Gregory R.; Torres, Heidi
         2009-01-01
         Sensor signal processing is widely used on aircraft and spacecraft. The scheme employs multiple input/output nodes for data acquisition and CPU (central processing unit) nodes for data processing. To connect 110 nodes and CPU nodes, scalable interconnections such as backplanes are desired because the number of nodes depends on requirements of each mission. An optical backplane consisting of vertical-cavity surface-emitting lasers (VCSELs), VCSEL drivers, photodetectors, and transimpedance amplifiers is the preferred approach since it can handle several hundred megabits per second data throughput.The next generation of satellite-borne systems will require transceivers and processors that can handle several Gb/s of data. Optical interconnects have been praised for both their speed and functionality with hopes that light can relieve the electrical bottleneck predicted for the near future. Optoelectronic interconnects provide a factor of ten improvement over electrical interconnects.
      

      
      GPU based contouring method on grid DEM data
      NASA Astrophysics Data System (ADS)
      Tan, Liheng; Wan, Gang; Li, Feng; Chen, Xiaohui; Du, Wenlong
         2017-08-01
         This paper presents a novel method to generate contour lines from grid DEM data based on the programmable GPU pipeline. The previous contouring approaches often use CPU to construct a finite element mesh from the raw DEM data, and then extract contour segments from the elements. They also need a tracing or sorting strategy to generate the final continuous contours. These approaches can be heavily CPU-costing and time-consuming. Meanwhile the generated contours would be unsmooth if the raw data is sparsely distributed. Unlike the CPU approaches, we employ the GPU's vertex shader to generate a triangular mesh with arbitrary user-defined density, in which the height of each vertex is calculated through a third-order Cardinal spline function. Then in the same frame, segments are extracted from the triangles by the geometry shader, and translated to the CPU-side with an internal order in the GPU's transform feedback stage. Finally we propose a "Grid Sorting" algorithm to achieve the continuous contour lines by travelling the segments only once. Our method makes use of multiple stages of GPU pipeline for computation, which can generate smooth contour lines, and is significantly faster than the previous CPU approaches. The algorithm can be easily implemented with OpenGL 3.3 API or higher on consumer-level PCs.
      

      
      SPIDR, a general-purpose readout system for pixel ASICs
      NASA Astrophysics Data System (ADS)
      van der Heijden, B.; Visser, J.; van Beuzekom, M.; Boterenbrood, H.; Kulis, S.; Munneke, B.; Schreuder, F.
         2017-02-01
         The SPIDR (Speedy PIxel Detector Readout) system is a flexible general-purpose readout platform that can be easily adapted to test and characterize new and existing detector readout ASICs. It is originally designed for the readout of pixel ASICs from the Medipix/Timepix family, but other types of ASICs or front-end circuits can be read out as well. The SPIDR system consists of an FPGA board with memory and various communication interfaces, FPGA firmware, CPU subsystem and an API library on the PC . The FPGA firmware can be adapted to read out other ASICs by re-using IP blocks. The available IP blocks include a UDP packet builder, 1 and 10 Gigabit Ethernet MAC's and a "soft core" CPU . Currently the firmware is targeted at the Xilinx VC707 development board and at a custom board called Compact-SPIDR . The firmware can easily be ported to other Xilinx 7 series and ultra scale FPGAs. The gap between an ASIC and the data acquisition back-end is bridged by the SPIDR system. Using the high pin count VITA 57 FPGA Mezzanine Card (FMC) connector only a simple chip carrier PCB is required. A 1 and a 10 Gigabit Ethernet interface handle the connection to the back-end. These can be used simultaneously for high-speed data and configuration over separate channels. In addition to the FMC connector, configurable inputs and outputs are available for synchronization with other detectors. A high resolution (≈ 27 ps bin size) Time to Digital converter is provided for time stamping events in the detector. The SPIDR system is frequently used as readout for the Medipix3 and Timepix3 ASICs. Using the 10 Gigabit Ethernet interface it is possible to read out a single chip at full bandwidth or up to 12 chips at a reduced rate. Another recent application is the test-bed for the VeloPix ASIC, which is developed for the Vertex Detector of the LHCb experiment. In this case the SPIDR system processes the 20 Gbps scrambled data stream from the VeloPix and distributes it over four 10 Gigabit Ethernet links, and in addition provides the slow and fast control for the chip.
      

      
      Towards High Resolution Numerical Algorithms for Wave Dominated Physical Phenomena
      DTIC Science & Technology
      
         2009-01-30
         results are scaled as floating point operations per second, obtained by counting the number of floating point additions and multiplications in the...black horizontal line. Perhaps the most striking feature at first is the fact that the memory bandwidth measured for flux lifting transcends this...theoretical peak performance values. For a suitable CPU-limited workload, this means that a single workstation equipped with multiple GPUs can do work that
      

      
      Intact intracortical microstimulation (ICMS) representations of rostral and caudal forelimb areas in rats with quinolinic acid lesions of the medial or lateral caudate-putamen in an animal model of Huntington's disease.
      PubMed
      Karl, Jenni M; Sacrey, Lori-Ann R; McDonald, Robert J; Whishaw, Ian Q
         2008-09-05
         Neurotoxic, cell-specific lesions of the rat caudate-putamen (CPu) have been proposed as a model of human Huntington's disease and as such impair performance on many motor tasks, including skilled forelimbs tasks such as reaching for food. Because the CPu and motor cortex share reciprocal connections, it has been proposed that the motor deficits are due in part to a secondary disruption of motor cortex. The purpose of the present study was to examine the functionality of the motor cortex using intracortical microstimulation (ICMS) following neurotoxic lesions of the CPu. ICMS maps have been shown to be sensitive indicators of motor skill, cortical injury, learning, and experience. Long-evans hooded rats received a sham, a medial, or a lateral CPu lesion using the neurotoxin, quinolinic acid (2,3-pyridinedicarboxylic acid). Two weeks later the motor cortex was stimulated under light ketamine anesthesia. Neither lateral nor medial lesions of the CPu altered the stimulation threshold for eliciting forelimb movements, the type of movements elicited, or the size of the rostral forelimb (RFA) and caudal forelimb areas (CFA) from which movements were elicited. The preservation of ICMS forelimb movement representations (the forelimb map) in rats with cell-specific CPu lesions suggests motor impairments following lesions of the lateral striatum are not due to the disruption of the motor map. Therefore, the impairments that follow striatal cell loss are due either to alterations in circuitry that is independent of motor cortex or to alterations in circuitry afferent to the motor cortex projections.
      

      
      Characterization and referral patterns of ST-elevation myocardial infarction patients admitted to chest pain units rather than directly to catherization laboratories. Data from the German Chest Pain Unit Registry.
      PubMed
      Schmidt, Frank P; Perne, Andrea; Hochadel, Matthias; Giannitsis, Evangelos; Darius, Harald; Maier, Lars S; Schmitt, Claus; Heusch, Gerd; Voigtländer, Thomas; Mudra, Harald; Gori, Tommaso; Senges, Jochen; Münzel, Thomas
         2017-03-15
         Direct transfer to the catheterization laboratory for primary percutaneous coronary intervention (PCI) is standard of care for patients with ST-segment elevation myocardial infarction (STEMI). Nevertheless, a significant number of STEMI-patients are initially treated in chest pain units (CPUs) of admitting hospitals. Thus, it is important to characterize these patients and to define why an important deviation from recommended clinical pathways occurs and in particular to quantify the impact of deviation on critical time intervals. 1679 STEMI patients admitted to a CPU in the period from 2010 to 2015 were enrolled in the German CPU registry (8.5% of 19,666). 55.9% of the patients were delivered by an emergency medical system (EMS), 16.1% transferred from other hospitals and 15.2% referred by a general practitioner (GP). 12.7% were self-referrals. 55% did not get a pre-hospital ECG. Compared to the EMS, referral by GPs markedly delayed critical time intervals while a pre-hospital ECG demonstrating ST-segment elevation reduced door-to-balloon time. When compared to STEMI patients (n=21,674) enrolled in the ALKK-registry, CPU-STEMI patients had a lower risk profile, their treatment in the CPU was guideline-conform and in-hospital mortality was low (1.5%). CPU-STEMI patients represent a numerically significant group because a pre-hospital ECG was not documented. Treatment in the CPU is guideline-conform and the intra-hospital mortality is low. The lack of a pre-hospital ECG and admission via the GP substantially delay critical time intervals suggesting that in patients with symptoms suggestive an ACS, the EMS should be contacted and not the GP. Copyright © 2016 Elsevier Ireland Ltd. All rights reserved.
      

      
      Accelerating moderately stiff chemical kinetics in reactive-flow simulations using GPUs
      NASA Astrophysics Data System (ADS)
      Niemeyer, Kyle E.; Sung, Chih-Jen
         2014-01-01
         The chemical kinetics ODEs arising from operator-split reactive-flow simulations were solved on GPUs using explicit integration algorithms. Nonstiff chemical kinetics of a hydrogen oxidation mechanism (9 species and 38 irreversible reactions) were computed using the explicit fifth-order Runge-Kutta-Cash-Karp method, and the GPU-accelerated version performed faster than single- and six-core CPU versions by factors of 126 and 25, respectively, for 524,288 ODEs. Moderately stiff kinetics, represented with mechanisms for hydrogen/carbon-monoxide (13 species and 54 irreversible reactions) and methane (53 species and 634 irreversible reactions) oxidation, were computed using the stabilized explicit second-order Runge-Kutta-Chebyshev (RKC) algorithm. The GPU-based RKC implementation demonstrated an increase in performance of nearly 59 and 10 times, for problem sizes consisting of 262,144 ODEs and larger, than the single- and six-core CPU-based RKC algorithms using the hydrogen/carbon-monoxide mechanism. With the methane mechanism, RKC-GPU performed more than 65 and 11 times faster, for problem sizes consisting of 131,072 ODEs and larger, than the single- and six-core RKC-CPU versions, and up to 57 times faster than the six-core CPU-based implicit VODE algorithm on 65,536 ODEs. In the presence of more severe stiffness, such as ethylene oxidation (111 species and 1566 irreversible reactions), RKC-GPU performed more than 17 times faster than RKC-CPU on six cores for 32,768 ODEs and larger, and at best 4.5 times faster than VODE on six CPU cores for 65,536 ODEs. With a larger time step size, RKC-GPU performed at best 2.5 times slower than six-core VODE for 8192 ODEs and larger. Therefore, the need for developing new strategies for integrating stiff chemistry on GPUs was discussed.
      

      
      Accelerating Smith-Waterman Alignment for Protein Database Search Using Frequency Distance Filtration Scheme Based on CPU-GPU Collaborative System.
      PubMed
      Liu, Yu; Hong, Yang; Lin, Chun-Yuan; Hung, Che-Lun
         2015-01-01
         The Smith-Waterman (SW) algorithm has been widely utilized for searching biological sequence databases in bioinformatics. Recently, several works have adopted the graphic card with Graphic Processing Units (GPUs) and their associated CUDA model to enhance the performance of SW computations. However, these works mainly focused on the protein database search by using the intertask parallelization technique, and only using the GPU capability to do the SW computations one by one. Hence, in this paper, we will propose an efficient SW alignment method, called CUDA-SWfr, for the protein database search by using the intratask parallelization technique based on a CPU-GPU collaborative system. Before doing the SW computations on GPU, a procedure is applied on CPU by using the frequency distance filtration scheme (FDFS) to eliminate the unnecessary alignments. The experimental results indicate that CUDA-SWfr runs 9.6 times and 96 times faster than the CPU-based SW method without and with FDFS, respectively.
      

      
      On the cost of approximating and recognizing a noise perturbed straight line or a quadratic curve segment in the plane. [central processing units
      NASA Technical Reports Server (NTRS)
      Cooper, D. B.; Yalabik, N.
         1975-01-01
         Approximation of noisy data in the plane by straight lines or elliptic or single-branch hyperbolic curve segments arises in pattern recognition, data compaction, and other problems. The efficient search for and approximation of data by such curves were examined. Recursive least-squares linear curve-fitting was used, and ellipses and hyperbolas are parameterized as quadratic functions in x and y. The error minimized by the algorithm is interpreted, and central processing unit (CPU) times for estimating parameters for fitting straight lines and quadratic curves were determined and compared. CPU time for data search was also determined for the case of straight line fitting. Quadratic curve fitting is shown to require about six times as much CPU time as does straight line fitting, and curves relating CPU time and fitting error were determined for straight line fitting. Results are derived on early sequential determination of whether or not the underlying curve is a straight line.
      

      
      The Creation of a CPU Timer for High Fidelity Programs
      NASA Technical Reports Server (NTRS)
      Dick, Aidan A.
         2011-01-01
         Using C and C++ programming languages, a tool was developed that measures the efficiency of a program by recording the amount of CPU time that various functions consume. By inserting the tool between lines of code in the program, one can receive a detailed report of the absolute and relative time consumption associated with each section. After adapting the generic tool for a high-fidelity launch vehicle simulation program called MAVERIC, the components of a frequently used function called "derivatives ( )" were measured. Out of the 34 sub-functions in "derivatives ( )", it was found that the top 8 sub-functions made up 83.1% of the total time spent. In order to decrease the overall run time of MAVERIC, a launch vehicle simulation program, a change was implemented in the sub-function "Event_Controller ( )". Reformatting "Event_Controller ( )" led to a 36.9% decrease in the total CPU time spent by that sub-function, and a 3.2% decrease in the total CPU time spent by the overarching function "derivatives ( )".
      

      
      Clinical implementation of a GPU-based simplified Monte Carlo method for a treatment planning system of proton beam therapy.
      PubMed
      Kohno, R; Hotta, K; Nishioka, S; Matsubara, K; Tansho, R; Suzuki, T
         2011-11-21
         We implemented the simplified Monte Carlo (SMC) method on graphics processing unit (GPU) architecture under the computer-unified device architecture platform developed by NVIDIA. The GPU-based SMC was clinically applied for four patients with head and neck, lung, or prostate cancer. The results were compared to those obtained by a traditional CPU-based SMC with respect to the computation time and discrepancy. In the CPU- and GPU-based SMC calculations, the estimated mean statistical errors of the calculated doses in the planning target volume region were within 0.5% rms. The dose distributions calculated by the GPU- and CPU-based SMCs were similar, within statistical errors. The GPU-based SMC showed 12.30-16.00 times faster performance than the CPU-based SMC. The computation time per beam arrangement using the GPU-based SMC for the clinical cases ranged 9-67 s. The results demonstrate the successful application of the GPU-based SMC to a clinical proton treatment planning.
      

      
      Collateral projections of nucleus raphe dorsalis neurones to the caudate-putamen and region around the nucleus raphe magnus and nucleus reticularis gigantocellularis pars alpha in the rat.
      PubMed
      Li, Y Q; Kaneko, T; Mizuno, N
         2001-02-16
         It was examined whether or not the nucleus raphe dorsalis (RD) neurons projecting to the caudate-putamen (CPu) might also project to the motor-controlling region around the nucleus raphe magnus (NRM) and nucleus reticularis gigantocellularis pars alpha (Gia) in the rat. Single RD neurons projecting to the CPu and NRM/Gia by way of axon collaterals were identified by the retrograde double-labeling method with fluorescent dyes, Fast Blue and Diamidino Yellow, which were injected respectively into the CPu and NRM/Gia. Then, serotonin (5-HT)-like immunoreactivity of the double-labeled RD neurons was examined immunohistochemically; approximately 60% of the double-labeled RD neurons showed 5-HT-like immunoreactivity. The results indicated that some of serotonergic and non-serotonergic RD neurons might control motor functions simultaneously at the levels of the CPu and NRM/Gia by way of axon collaterals.
      

      
      Nationwide but still inhomogeneous distribution of certified chest pain units across Germany : Need to strengthen rural regions.
      PubMed
      Varnavas, V; Rassaf, T; Breuckmann, F
         2018-02-01
         The purpose of this work was to analyze structure, distribution, and bed capacities of certified German chest pain units (CPUs) to unveil potential gaps despite nationwide certification of 230 units till the end of 2015. Analysis of number and structure of CPUs per state, resident count, and population density by standardized telephone interview, online research, and data collection from the registry of the Federal Statistical Office for all certified German CPUs. Nationwide, German health facilities provided a mean of 1 CPU bed within a certified unit per 65,000 inhabitants. Bremen, Hamburg, Hesse, and Rhineland-Palatinate provided more than 1 bed per 50,000 inhabitants. Most CPUs (49%) were located in the emergency room. All university hospitals in Germany provided a certified CPU. Most units were found in academic teaching hospitals (146 CPUs). Only 42 CPUs were found in nonacademic providers of primary health care. The absolute necessary number of CPUs to reach full nationwide coverage is still unknown. The current analysis shows a high number of CPUs and bed capacities within the cities and industrial areas without relevant gaps, but also demonstrates a certain undersupply in more rural areas as well as in some of the former eastern federal states of Germany.
      

      
      Generic algorithms for high performance scalable geocomputing
      NASA Astrophysics Data System (ADS)
      de Jong, Kor; Schmitz, Oliver; Karssenberg, Derek
         2016-04-01
         During the last decade, the characteristics of computing hardware have changed a lot. For example, instead of a single general purpose CPU core, personal computers nowadays contain multiple cores per CPU and often general purpose accelerators, like GPUs. Additionally, compute nodes are often grouped together to form clusters or a supercomputer, providing enormous amounts of compute power. For existing earth simulation models to be able to use modern hardware platforms, their compute intensive parts must be rewritten. This can be a major undertaking and may involve many technical challenges. Compute tasks must be distributed over CPU cores, offloaded to hardware accelerators, or distributed to different compute nodes. And ideally, all of this should be done in such a way that the compute task scales well with the hardware resources. This presents two challenges: 1) how to make good use of all the compute resources and 2) how to make these compute resources available for developers of simulation models, who may not (want to) have the required technical background for distributing compute tasks. The first challenge requires the use of specialized technology (e.g.: threads, OpenMP, MPI, OpenCL, CUDA). The second challenge requires the abstraction of the logic handling the distribution of compute tasks from the model-specific logic, hiding the technical details from the model developer. To assist the model developer, we are developing a C++ software library (called Fern) containing algorithms that can use all CPU cores available in a single compute node (distributing tasks over multiple compute nodes will be done at a later stage). The algorithms are grid-based (finite difference) and include local and spatial operations such as convolution filters. The algorithms handle distribution of the compute tasks to CPU cores internally. In the resulting model the low-level details of how this is done is separated from the model-specific logic representing the modeled system. This contrasts with practices in which code for distributing of compute tasks is mixed with model-specific code, and results in a better maintainable model. For flexibility and efficiency, the algorithms are configurable at compile-time with the respect to the following aspects: data type, value type, no-data handling, input value domain handling, and output value range handling. This makes the algorithms usable in very different contexts, without the need for making intrusive changes to existing models when using them. Applications that benefit from using the Fern library include the construction of forward simulation models in (global) hydrology (e.g. PCR-GLOBWB (Van Beek et al. 2011)), ecology, geomorphology, or land use change (e.g. PLUC (Verstegen et al. 2014)) and manipulation of hyper-resolution land surface data such as digital elevation models and remote sensing data. Using the Fern library, we have also created an add-on to the PCRaster Python Framework (Karssenberg et al. 2010) allowing its users to speed up their spatio-temporal models, sometimes by changing just a single line of Python code in their model. In our presentation we will give an overview of the design of the algorithms, providing examples of different contexts where they can be used to replace existing sequential algorithms, including the PCRaster environmental modeling software (www.pcraster.eu). We will show how the algorithms can be configured to behave differently when necessary. References Karssenberg, D., Schmitz, O., Salamon, P., De Jong, K. and Bierkens, M.F.P., 2010, A software framework for construction of process-based stochastic spatio-temporal models and data assimilation. Environmental Modelling & Software, 25, pp. 489-502, Link. Best Paper Award 2010: Software and Decision Support. Van Beek, L. P. H., Y. Wada, and M. F. P. Bierkens. 2011. Global monthly water stress: 1. Water balance and water availability. Water Resources Research. 47. Verstegen, J. A., D. Karssenberg, F. van der Hilst, and A. P. C. Faaij. 2014. Identifying a land use change cellular automaton by Bayesian data assimilation. Environmental Modelling & Software 53:121-136.
      

        
       
          

«

8
      9
      10
   11
      12
      »

          
        

     

   

   
       
            
              
          

«

9
      10
      11
   12
      13
      »

          
        

           
           
             
               
      
      SeqLib: a C ++ API for rapid BAM manipulation, sequence alignment and sequence assembly
      PubMed Central
      Wala, Jeremiah; Beroukhim, Rameen
         2017-01-01
         Abstract We present SeqLib, a C ++ API and command line tool that provides a rapid and user-friendly interface to BAM/SAM/CRAM files, global sequence alignment operations and sequence assembly. Four C libraries perform core operations in SeqLib: HTSlib for BAM access, BWA-MEM and BLAT for sequence alignment and Fermi for error correction and sequence assembly. Benchmarking indicates that SeqLib has lower CPU and memory requirements than leading C ++ sequence analysis APIs. We demonstrate an example of how minimal SeqLib code can extract, error-correct and assemble reads from a CRAM file and then align with BWA-MEM. SeqLib also provides additional capabilities, including chromosome-aware interval queries and read plotting. Command line tools are available for performing integrated error correction, micro-assemblies and alignment. Availability and Implementation: SeqLib is available on Linux and OSX for the C ++98 standard and later at github.com/walaj/SeqLib. SeqLib is released under the Apache2 license. Additional capabilities for BLAT alignment are available under the BLAT license. Contact: jwala@broadinstitue.org; rameen@broadinstitute.org PMID:28011768
      

      
      SeqLib: a C ++ API for rapid BAM manipulation, sequence alignment and sequence assembly.
      PubMed
      Wala, Jeremiah; Beroukhim, Rameen
         2017-03-01
         We present SeqLib, a C ++ API and command line tool that provides a rapid and user-friendly interface to BAM/SAM/CRAM files, global sequence alignment operations and sequence assembly. Four C libraries perform core operations in SeqLib: HTSlib for BAM access, BWA-MEM and BLAT for sequence alignment and Fermi for error correction and sequence assembly. Benchmarking indicates that SeqLib has lower CPU and memory requirements than leading C ++ sequence analysis APIs. We demonstrate an example of how minimal SeqLib code can extract, error-correct and assemble reads from a CRAM file and then align with BWA-MEM. SeqLib also provides additional capabilities, including chromosome-aware interval queries and read plotting. Command line tools are available for performing integrated error correction, micro-assemblies and alignment. SeqLib is available on Linux and OSX for the C ++98 standard and later at github.com/walaj/SeqLib. SeqLib is released under the Apache2 license. Additional capabilities for BLAT alignment are available under the BLAT license. jwala@broadinstitue.org ; rameen@broadinstitute.org. © The Author 2016. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com
      

      
      75 FR 48338 - Intel Corporation; Analysis of Proposed Consent Order to Aid Public Comment
      Federal Register 2010, 2011, 2012, 2013, 2014
      
         2010-08-10
         ... integrated into chipsets as well as discrete graphics cards. NVIDIA has been at the forefront of developing... to connect peripheral products such as discrete GPUs to the CPU. A bus is a connection point between... platform. Intel's commitment to maintain an open PCIe bus will provide discrete graphics manufacturers...
      

      
      Caffe con Troll: Shallow Ideas to Speed Up Deep Learning
      PubMed Central
      Hadjis, Stefan; Abuzaid, Firas; Zhang, Ce; Ré, Christopher
         2016-01-01
         We present Caffe con Troll (CcT), a fully compatible end-to-end version of the popular framework Caffe with rebuilt internals. We built CcT to examine the performance characteristics of training and deploying general-purpose convolutional neural networks across different hardware architectures. We find that, by employing standard batching optimizations for CPU training, we achieve a 4.5× throughput improvement over Caffe on popular networks like CaffeNet. Moreover, with these improvements, the end-to-end training time for CNNs is directly proportional to the FLOPS delivered by the CPU, which enables us to efficiently train hybrid CPU-GPU systems for CNNs. PMID:27314106
      

      
      Caffe con Troll: Shallow Ideas to Speed Up Deep Learning.
      PubMed
      Hadjis, Stefan; Abuzaid, Firas; Zhang, Ce; Ré, Christopher
         2015-01-01
         We present Caffe con Troll (CcT), a fully compatible end-to-end version of the popular framework Caffe with rebuilt internals. We built CcT to examine the performance characteristics of training and deploying general-purpose convolutional neural networks across different hardware architectures. We find that, by employing standard batching optimizations for CPU training, we achieve a 4.5× throughput improvement over Caffe on popular networks like CaffeNet. Moreover, with these improvements, the end-to-end training time for CNNs is directly proportional to the FLOPS delivered by the CPU, which enables us to efficiently train hybrid CPU-GPU systems for CNNs.
      

      
      Memory interface simulator: A computer design aid
      NASA Technical Reports Server (NTRS)
      Taylor, D. S.; Williams, T.; Weatherbee, J. E.
         1972-01-01
         Results are presented of a study conducted with a digital simulation model being used in the design of the Automatically Reconfigurable Modular Multiprocessor System (ARMMS), a candidate computer system for future manned and unmanned space missions. The model simulates the activity involved as instructions are fetched from random access memory for execution in one of the system central processing units. A series of model runs measured instruction execution time under various assumptions pertaining to the CPU's and the interface between the CPU's and RAM. Design tradeoffs are presented in the following areas: Bus widths, CPU microprogram read only memory cycle time, multiple instruction fetch, and instruction mix.
      

      
      Reduced order model based on principal component analysis for process simulation and optimization
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Lang, Y.; Malacina, A.; Biegler, L.
         2009-01-01
         It is well-known that distributed parameter computational fluid dynamics (CFD) models provide more accurate results than conventional, lumped-parameter unit operation models used in process simulation. Consequently, the use of CFD models in process/equipment co-simulation offers the potential to optimize overall plant performance with respect to complex thermal and fluid flow phenomena. Because solving CFD models is time-consuming compared to the overall process simulation, we consider the development of fast reduced order models (ROMs) based on CFD results to closely approximate the high-fidelity equipment models in the co-simulation. By considering process equipment items with complicated geometries and detailed thermodynamic property models,more » this study proposes a strategy to develop ROMs based on principal component analysis (PCA). Taking advantage of commercial process simulation and CFD software (for example, Aspen Plus and FLUENT), we are able to develop systematic CFD-based ROMs for equipment models in an efficient manner. In particular, we show that the validity of the ROM is more robust within well-sampled input domain and the CPU time is significantly reduced. Typically, it takes at most several CPU seconds to evaluate the ROM compared to several CPU hours or more to solve the CFD model. Two case studies, involving two power plant equipment examples, are described and demonstrate the benefits of using our proposed ROM methodology for process simulation and optimization.« less
      

      
      A Study of Quality of Service Communication for High-Speed Packet-Switching Computer Sub-Networks
      NASA Technical Reports Server (NTRS)
      Cui, Zhenqian
         1999-01-01
         In this thesis, we analyze various factors that affect quality of service (QoS) communication in high-speed, packet-switching sub-networks. We hypothesize that sub-network-wide bandwidth reservation and guaranteed CPU processing power at endpoint systems for handling data traffic are indispensable to achieving hard end-to-end quality of service. Different bandwidth reservation strategies, traffic characterization schemes, and scheduling algorithms affect the network resources and CPU usage as well as the extent that QoS can be achieved. In order to analyze those factors, we design and implement a communication layer. Our experimental analysis supports our research hypothesis. The Resource ReSerVation Protocol (RSVP) is designed to realize resource reservation. Our analysis of RSVP shows that using RSVP solely is insufficient to provide hard end-to-end quality of service in a high-speed sub-network. Analysis of the IEEE 802.lp protocol also supports the research hypothesis.
      

      
      Exploring compression techniques for ROOT IO
      NASA Astrophysics Data System (ADS)
      Zhang, Z.; Bockelman, B.
         2017-10-01
         ROOT provides an flexible format used throughout the HEP community. The number of use cases - from an archival data format to end-stage analysis - has required a number of tradeoffs to be exposed to the user. For example, a high “compression level” in the traditional DEFLATE algorithm will result in a smaller file (saving disk space) at the cost of slower decompression (costing CPU time when read). At the scale of the LHC experiment, poor design choices can result in terabytes of wasted space or wasted CPU time. We explore and attempt to quantify some of these tradeoffs. Specifically, we explore: the use of alternate compressing algorithms to optimize for read performance; an alternate method of compressing individual events to allow efficient random access; and a new approach to whole-file compression. Quantitative results are given, as well as guidance on how to make compression decisions for different use cases.
      

      
      Simulating electron wave dynamics in graphene superlattices exploiting parallel processing advantages
      NASA Astrophysics Data System (ADS)
      Rodrigues, Manuel J.; Fernandes, David E.; Silveirinha, Mário G.; Falcão, Gabriel
         2018-01-01
         This work introduces a parallel computing framework to characterize the propagation of electron waves in graphene-based nanostructures. The electron wave dynamics is modeled using both "microscopic" and effective medium formalisms and the numerical solution of the two-dimensional massless Dirac equation is determined using a Finite-Difference Time-Domain scheme. The propagation of electron waves in graphene superlattices with localized scattering centers is studied, and the role of the symmetry of the microscopic potential in the electron velocity is discussed. The computational methodologies target the parallel capabilities of heterogeneous multi-core CPU and multi-GPU environments and are built with the OpenCL parallel programming framework which provides a portable, vendor agnostic and high throughput-performance solution. The proposed heterogeneous multi-GPU implementation achieves speedup ratios up to 75x when compared to multi-thread and multi-core CPU execution, reducing simulation times from several hours to a couple of minutes.
      

      
      Libsharp - spherical harmonic transforms revisited
      NASA Astrophysics Data System (ADS)
      Reinecke, M.; Seljebotn, D. S.
         2013-06-01
         We present libsharp, a code library for spherical harmonic transforms (SHTs), which evolved from the libpsht library and addresses several of its shortcomings, such as adding MPI support for distributed memory systems and SHTs of fields with arbitrary spin, but also supporting new developments in CPU instruction sets like the Advanced Vector Extensions (AVX) or fused multiply-accumulate (FMA) instructions. The library is implemented in portable C99 and provides an interface that can be easily accessed from other programming languages such as C++, Fortran, Python, etc. Generally, libsharp's performance is at least on par with that of its predecessor; however, significant improvements were made to the algorithms for scalar SHTs, which are roughly twice as fast when using the same CPU capabilities. The library is available at http://sourceforge.net/projects/libsharp/ under the terms of the GNU General Public License.
      

      
      High-speed zero-copy data transfer for DAQ applications
      NASA Astrophysics Data System (ADS)
      Pisani, Flavio; Cámpora Pérez, Daniel Hugo; Neufeld, Niko
         2015-05-01
         The LHCb Data Acquisition (DAQ) will be upgraded in 2020 to a trigger-free readout. In order to achieve this goal we will need to connect around 500 nodes with a total network capacity of 32 Tb/s. To get such an high network capacity we are testing zero-copy technology in order to maximize the theoretical link throughput without adding excessive CPU and memory bandwidth overhead, leaving free resources for data processing resulting in less power, space and money used for the same result. We develop a modular test application which can be used with different transport layers. For the zero-copy implementation we choose the OFED IBVerbs API because it can provide low level access and high throughput. We present throughput and CPU usage measurements of 40 GbE solutions using Remote Direct Memory Access (RDMA), for several network configurations to test the scalability of the system.
      

      
      A GPU accelerated and error-controlled solver for the unbounded Poisson equation in three dimensions
      NASA Astrophysics Data System (ADS)
      Exl, Lukas
         2017-12-01
         An efficient solver for the three dimensional free-space Poisson equation is presented. The underlying numerical method is based on finite Fourier series approximation. While the error of all involved approximations can be fully controlled, the overall computation error is driven by the convergence of the finite Fourier series of the density. For smooth and fast-decaying densities the proposed method will be spectrally accurate. The method scales with O(N log N) operations, where N is the total number of discretization points in the Cartesian grid. The majority of the computational costs come from fast Fourier transforms (FFT), which makes it ideal for GPU computation. Several numerical computations on CPU and GPU validate the method and show efficiency and convergence behavior. Tests are performed using the Vienna Scientific Cluster 3 (VSC3). A free MATLAB implementation for CPU and GPU is provided to the interested community.
      

      
      GPU accelerated implementation of NCI calculations using promolecular density.
      PubMed
      Rubez, Gaëtan; Etancelin, Jean-Matthieu; Vigouroux, Xavier; Krajecki, Michael; Boisson, Jean-Charles; Hénon, Eric
         2017-05-30
         The NCI approach is a modern tool to reveal chemical noncovalent interactions. It is particularly attractive to describe ligand-protein binding. A custom implementation for NCI using promolecular density is presented. It is designed to leverage the computational power of NVIDIA graphics processing unit (GPU) accelerators through the CUDA programming model. The code performances of three versions are examined on a test set of 144 systems. NCI calculations are particularly well suited to the GPU architecture, which reduces drastically the computational time. On a single compute node, the dual-GPU version leads to a 39-fold improvement for the biggest instance compared to the optimal OpenMP parallel run (C code, icc compiler) with 16 CPU cores. Energy consumption measurements carried out on both CPU and GPU NCI tests show that the GPU approach provides substantial energy savings. © 2017 Wiley Periodicals, Inc. © 2017 Wiley Periodicals, Inc.
      

      
      The Effect of NUMA Tunings on CPU Performance
      NASA Astrophysics Data System (ADS)
      Hollowell, Christopher; Caramarcu, Costin; Strecker-Kellogg, William; Wong, Antonio; Zaytsev, Alexandr
         2015-12-01
         Non-Uniform Memory Access (NUMA) is a memory architecture for symmetric multiprocessing (SMP) systems where each processor is directly connected to separate memory. Indirect access to other CPU's (remote) RAM is still possible, but such requests are slower as they must also pass through that memory's controlling CPU. In concert with a NUMA-aware operating system, the NUMA hardware architecture can help eliminate the memory performance reductions generally seen in SMP systems when multiple processors simultaneously attempt to access memory. The x86 CPU architecture has supported NUMA for a number of years. Modern operating systems such as Linux support NUMA-aware scheduling, where the OS attempts to schedule a process to the CPU directly attached to the majority of its RAM. In Linux, it is possible to further manually tune the NUMA subsystem using the numactl utility. With the release of Red Hat Enterprise Linux (RHEL) 6.3, the numad daemon became available in this distribution. This daemon monitors a system's NUMA topology and utilization, and automatically makes adjustments to optimize locality. As the number of cores in x86 servers continues to grow, efficient NUMA mappings of processes to CPUs/memory will become increasingly important. This paper gives a brief overview of NUMA, and discusses the effects of manual tunings and numad on the performance of the HEPSPEC06 benchmark, and ATLAS software.
      

      
      Static and Dynamic Frequency Scaling on Multicore CPUs
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Bao, Wenlei; Hong, Changwan; Chunduri, Sudheer
         2016-12-28
         Dynamic voltage and frequency scaling (DVFS) adapts CPU power consumption by modifying a processor’s operating frequency (and the associated voltage). Typical approaches employing DVFS involve default strategies such as running at the lowest or the highest frequency, or observing the CPU’s runtime behavior and dynamically adapting the voltage/frequency configuration based on CPU usage. In this paper, we argue that many previous approaches suffer from inherent limitations, such as not account- ing for processor-specific impact of frequency changes on energy for different workload types. We first propose a lightweight runtime-based approach to automatically adapt the frequency based on the CPU workload,more » that is agnostic of the processor characteristics. We then show that further improvements can be achieved for affine kernels in the application, using a compile-time characterization instead of run-time monitoring to select the frequency and number of CPU cores to use. Our framework relies on a one-time energy characterization of CPU-specific DVFS profiles followed by a compile-time categorization of loop-based code segments in the application. These are combined to determine a priori of the frequency and the number of cores to use to execute the application so as to optimize energy or energy-delay product, outperforming runtime approach. Extensive evaluation on 60 benchmarks and five multi-core CPUs show that our approach systematically outperforms the powersave Linux governor, while improving overall performance.« less
      

      
      Parallel and serial computing tools for testing single-locus and epistatic SNP effects of quantitative traits in genome-wide association studies
      PubMed Central
      Ma, Li; Runesha, H Birali; Dvorkin, Daniel; Garbe, John R; Da, Yang
         2008-01-01
         Background Genome-wide association studies (GWAS) using single nucleotide polymorphism (SNP) markers provide opportunities to detect epistatic SNPs associated with quantitative traits and to detect the exact mode of an epistasis effect. Computational difficulty is the main bottleneck for epistasis testing in large scale GWAS. Results The EPISNPmpi and EPISNP computer programs were developed for testing single-locus and epistatic SNP effects on quantitative traits in GWAS, including tests of three single-locus effects for each SNP (SNP genotypic effect, additive and dominance effects) and five epistasis effects for each pair of SNPs (two-locus interaction, additive × additive, additive × dominance, dominance × additive, and dominance × dominance) based on the extended Kempthorne model. EPISNPmpi is the parallel computing program for epistasis testing in large scale GWAS and achieved excellent scalability for large scale analysis and portability for various parallel computing platforms. EPISNP is the serial computing program based on the EPISNPmpi code for epistasis testing in small scale GWAS using commonly available operating systems and computer hardware. Three serial computing utility programs were developed for graphical viewing of test results and epistasis networks, and for estimating CPU time and disk space requirements. Conclusion The EPISNPmpi parallel computing program provides an effective computing tool for epistasis testing in large scale GWAS, and the epiSNP serial computing programs are convenient tools for epistasis analysis in small scale GWAS using commonly available computer hardware. PMID:18644146
      

      
      Brain and behaviour phenotyping of a mouse model of neurofibromatosis type-1: an MRI/DTI study on social cognition.
      PubMed
      Petrella, L I; Cai, Y; Sereno, J V; Gonçalves, S I; Silva, A J; Castelo-Branco, M
         2016-09-01
         Neurofibromatosis type-1 (NF1) is a common neurogenetic disorder and an important cause of intellectual disability. Brain-behaviour associations can be examined in vivo using morphometric magnetic resonance imaging (MRI) and diffusion tensor imaging (DTI) to study brain structure. Here, we studied structural and behavioural phenotypes in heterozygous Nf1 mice (Nf1(+/-) ) using T2-weighted imaging MRI and DTI, with a focus on social recognition deficits. We found that Nf1(+/-) mice have larger volumes than wild-type (WT) mice in regions of interest involved in social cognition, the prefrontal cortex (PFC) and the caudate-putamen (CPu). Higher diffusivity was found across a distributed network of cortical and subcortical brain regions, within and beyond these regions. Significant differences were observed for the social recognition test. Most importantly, significant structure-function correlations were identified concerning social recognition performance and PFC volumes in Nf1(+/-) mice. Analyses of spatial learning corroborated the previously known deficits in the mutant mice, as corroborated by platform crossings, training quadrant time and average proximity measures. Moreover, linear discriminant analysis of spatial performance identified 2 separate sub-groups in Nf1(+/-) mice. A significant correlation between quadrant time and CPu volumes was found specifically for the sub-group of Nf1(+/-) mice with lower spatial learning performance, suggesting additional evidence for reorganization of this region. We found strong evidence that social and spatial cognition deficits can be associated with PFC/CPu structural changes and reorganization in NF1. © 2016 John Wiley & Sons Ltd and International Behavioural and Neural Genetics Society.
      

      
      Reducing power usage on demand
      NASA Astrophysics Data System (ADS)
      Corbett, G.; Dewhurst, A.
         2016-10-01
         The Science and Technology Facilities Council (STFC) datacentre provides large- scale High Performance Computing facilities for the scientific community. It currently consumes approximately 1.5MW and this has risen by 25% in the past two years. STFC has been investigating leveraging preemption in the Tier 1 batch farm to save power. HEP experiments are increasing using jobs that can be killed to take advantage of opportunistic CPU resources or novel cost models such as Amazon's spot pricing. Additionally, schemes from energy providers are available that offer financial incentives to reduce power consumption at peak times. Under normal operating conditions, 3% of the batch farm capacity is wasted due to draining machines. By using preempt-able jobs, nodes can be rapidly made available to run multicore jobs without this wasted resource. The use of preempt-able jobs has been extended so that at peak times machines can be hibernated quickly to save energy. This paper describes the implementation of the above and demonstrates that STFC could in future take advantage of such energy saving schemes.
      

      
      Structurally divergent lysophosphatidic acid acyltransferases with high selectivity for saturated medium chain fatty acids from Cuphea seeds.
      PubMed
      Kim, Hae Jin; Silva, Jillian E; Iskandarov, Umidjon; Andersson, Mariette; Cahoon, Rebecca E; Mockaitis, Keithanne; Cahoon, Edgar B
         2015-12-01
         Lysophosphatidic acid acyltransferase (LPAT) catalyzes acylation of the sn-2 position on lysophosphatidic acid by an acyl CoA substrate to produce the phosphatidic acid precursor of polar glycerolipids and triacylglycerols (TAGs). In the case of TAGs, this reaction is typically catalyzed by an LPAT2 from microsomal LPAT class A that has high specificity for C18 fatty acids containing Δ9 unsaturation. Because of this specificity, the occurrence of saturated fatty acids in the TAG sn-2 position is infrequent in seed oils. To identify LPATs with variant substrate specificities, deep transcriptomic mining was performed on seeds of two Cuphea species producing TAGs that are highly enriched in saturated C8 and C10 fatty acids. From these analyses, cDNAs for seven previously unreported LPATs were identified, including cDNAs from Cuphea viscosissima (CvLPAT2) and Cuphea avigera var. pulcherrima (CpuLPAT2a) encoding microsomal, seed-specific class A LPAT2s and a cDNA from C. avigera var. pulcherrima (CpuLPATB) encoding a microsomal, seed-specific LPAT from the bacterial-type class B. The activities of these enzymes were characterized in Camelina sativa by seed-specific co-expression with cDNAs for various Cuphea FatB acyl-acyl carrier protein thioesterases (FatB) that produce a variety of saturated medium-chain fatty acids. CvLPAT2 and CpuLPAT2a expression resulted in accumulation of 10:0 fatty acids in the Camelina sativa TAG sn-2 position, indicating a 10:0 CoA specificity that has not been previously described for plant LPATs. CpuLPATB expression generated TAGs with 14:0 at the sn-2 position, but not 10:0. Identification of these LPATs provides tools for understanding the structural basis of LPAT substrate specificity and for generating altered oil functionalities. © 2015 The Authors The Plant Journal © 2015 John Wiley & Sons Ltd.
      

        
       
          

«

9
      10
      11
   12
      13
      »

          
        

     

   

   
       
            
              
          

«

10
      11
      12
   13
      14
      »

          
        

           
           
             
               
      
      hybrid\\scriptsize{{MANTIS}}: a CPU-GPU Monte Carlo method for modeling indirect x-ray detectors with columnar scintillators
      NASA Astrophysics Data System (ADS)
      Sharma, Diksha; Badal, Andreu; Badano, Aldo
         2012-04-01
         The computational modeling of medical imaging systems often requires obtaining a large number of simulated images with low statistical uncertainty which translates into prohibitive computing times. We describe a novel hybrid approach for Monte Carlo simulations that maximizes utilization of CPUs and GPUs in modern workstations. We apply the method to the modeling of indirect x-ray detectors using a new and improved version of the code \\scriptsize{{MANTIS}}, an open source software tool used for the Monte Carlo simulations of indirect x-ray imagers. We first describe a GPU implementation of the physics and geometry models in fast\\scriptsize{{DETECT}}2 (the optical transport model) and a serial CPU version of the same code. We discuss its new features like on-the-fly column geometry and columnar crosstalk in relation to the \\scriptsize{{MANTIS}} code, and point out areas where our model provides more flexibility for the modeling of realistic columnar structures in large area detectors. Second, we modify \\scriptsize{{PENELOPE}} (the open source software package that handles the x-ray and electron transport in \\scriptsize{{MANTIS}}) to allow direct output of location and energy deposited during x-ray and electron interactions occurring within the scintillator. This information is then handled by optical transport routines in fast\\scriptsize{{DETECT}}2. A load balancer dynamically allocates optical transport showers to the GPU and CPU computing cores. Our hybrid\\scriptsize{{MANTIS}} approach achieves a significant speed-up factor of 627 when compared to \\scriptsize{{MANTIS}} and of 35 when compared to the same code running only in a CPU instead of a GPU. Using hybrid\\scriptsize{{MANTIS}}, we successfully hide hours of optical transport time by running it in parallel with the x-ray and electron transport, thus shifting the computational bottleneck from optical to x-ray transport. The new code requires much less memory than \\scriptsize{{MANTIS}} and, as a result, allows us to efficiently simulate large area detectors.
      

      
      Routine Microsecond Molecular Dynamics Simulations with AMBER on GPUs. 1. Generalized Born
      PubMed Central
      
         2012-01-01
         We present an implementation of generalized Born implicit solvent all-atom classical molecular dynamics (MD) within the AMBER program package that runs entirely on CUDA enabled NVIDIA graphics processing units (GPUs). We discuss the algorithms that are used to exploit the processing power of the GPUs and show the performance that can be achieved in comparison to simulations on conventional CPU clusters. The implementation supports three different precision models in which the contributions to the forces are calculated in single precision floating point arithmetic but accumulated in double precision (SPDP), or everything is computed in single precision (SPSP) or double precision (DPDP). In addition to performance, we have focused on understanding the implications of the different precision models on the outcome of implicit solvent MD simulations. We show results for a range of tests including the accuracy of single point force evaluations and energy conservation as well as structural properties pertainining to protein dynamics. The numerical noise due to rounding errors within the SPSP precision model is sufficiently large to lead to an accumulation of errors which can result in unphysical trajectories for long time scale simulations. We recommend the use of the mixed-precision SPDP model since the numerical results obtained are comparable with those of the full double precision DPDP model and the reference double precision CPU implementation but at significantly reduced computational cost. Our implementation provides performance for GB simulations on a single desktop that is on par with, and in some cases exceeds, that of traditional supercomputers. PMID:22582031
      

      
      Using SimCPU in Cooperative Learning Laboratories.
      ERIC Educational Resources Information Center
      Lin, Janet Mei-Chuen; Wu, Cheng-Chih; Liu, Hsi-Jen
         1999-01-01
         Reports research findings of an experimental design in which cooperative-learning strategies were applied to closed-lab instruction of computing concepts. SimCPU, a software package specially designed for closed-lab usage was used by 171 high school students of four classes. Results showed that collaboration enhanced learning and that blending…
      

      
      A Cache Design to Exploit Structural Locality
      DTIC Science & Technology
      
         1991-12-01
         memory and secondary storage. Main memory was used to store the instructions and data of an executing pro- gram, while secondary storage held programs ...efficiency of the CPU and faster turnaround of executing programs . In addition to the well known spatial and temporal aspects of locality, Hobart has...identified a third aspect, which he has called structural locality (9). This type of locality is defined as the tendency of an executing program to
      

      
      The Research and Test of Fast Radio Burst Real-time Search Algorithm Based on GPU Acceleration
      NASA Astrophysics Data System (ADS)
      Wang, J.; Chen, M. Z.; Pei, X.; Wang, Z. Q.
         2017-03-01
         In order to satisfy the research needs of Nanshan 25 m radio telescope of Xinjiang Astronomical Observatory (XAO) and study the key technology of the planned QiTai radio Telescope (QTT), the receiver group of XAO studied the GPU (Graphics Processing Unit) based real-time FRB searching algorithm which developed from the original FRB searching algorithm based on CPU (Central Processing Unit), and built the FRB real-time searching system. The comparison of the GPU system and the CPU system shows that: on the basis of ensuring the accuracy of the search, the speed of the GPU accelerated algorithm is improved by 35-45 times compared with the CPU algorithm.
      

      
      Supercomputer Provides Molecular Insight into Cellulose (Fact Sheet)
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Not Available
         2011-02-01
         Groundbreaking research at the National Renewable Energy Laboratory (NREL) has used supercomputing simulations to calculate the work that enzymes must do to deconstruct cellulose, which is a fundamental step in biomass conversion technologies for biofuels production. NREL used the new high-performance supercomputer Red Mesa to conduct several million central processing unit (CPU) hours of simulation.
      

      
      CPU SIM: A Computer Simulator for Use in an Introductory Computer Organization-Architecture Class.
      ERIC Educational Resources Information Center
      Skrein, Dale
         1994-01-01
         CPU SIM, an interactive low-level computer simulation package that runs on the Macintosh computer, is described. The program is designed for instructional use in the first or second year of undergraduate computer science, to teach various features of typical computer organization through hands-on exercises. (MSE)
      

      
      Combustion Power Unit--400: CPU-400.
      ERIC Educational Resources Information Center
      Combustion Power Co., Palo Alto, CA.
         
         Aerospace technology may have led to a unique basic unit for processing solid wastes and controlling pollution. The Combustion Power Unit--400 (CPU-400) is designed as a turboelectric generator plant that will use municipal solid wastes as fuel. The baseline configuration is a modular unit that is designed to utilize 400 tons of refuse per day…
      

      
      Particle-in-Cell laser-plasma simulation on Xeon Phi coprocessors
      NASA Astrophysics Data System (ADS)
      Surmin, I. A.; Bastrakov, S. I.; Efimenko, E. S.; Gonoskov, A. A.; Korzhimanov, A. V.; Meyerov, I. B.
         2016-05-01
         This paper concerns the development of a high-performance implementation of the Particle-in-Cell method for plasma simulation on Intel Xeon Phi coprocessors. We discuss the suitability of the method for Xeon Phi architecture and present our experience in the porting and optimization of the existing parallel Particle-in-Cell code PICADOR. Direct porting without code modification gives performance on Xeon Phi close to that of an 8-core CPU on a benchmark problem with 50 particles per cell. We demonstrate step-by-step optimization techniques, such as improving data locality, enhancing parallelization efficiency and vectorization leading to an overall 4.2 × speedup on CPU and 7.5 × on Xeon Phi compared to the baseline version. The optimized version achieves 16.9 ns per particle update on an Intel Xeon E5-2660 CPU and 9.3 ns per particle update on an Intel Xeon Phi 5110P. For a real problem of laser ion acceleration in targets with surface grating, where a large number of macroparticles per cell is required, the speedup of Xeon Phi compared to CPU is 1.6 ×.
      

      
      Performance and accuracy of criticality calculations performed using WARP – A framework for continuous energy Monte Carlo neutron transport in general 3D geometries on GPUs
      DOE PAGES
      Bergmann, Ryan M.; Rowland, Kelly L.; Radnović, Nikola; ...
         2017-05-01
         In this companion paper to "Algorithmic Choices in WARP - A Framework for Continuous Energy Monte Carlo Neutron Transport in General 3D Geometries on GPUs" (doi:10.1016/j.anucene.2014.10.039), the WARP Monte Carlo neutron transport framework for graphics processing units (GPUs) is benchmarked against production-level central processing unit (CPU) Monte Carlo neutron transport codes for both performance and accuracy. We compare neutron flux spectra, multiplication factors, runtimes, speedup factors, and costs of various GPU and CPU platforms running either WARP, Serpent 2.1.24, or MCNP 6.1. WARP compares well with the results of the production-level codes, and it is shown that on the newestmore » hardware considered, GPU platforms running WARP are between 0.8 to 7.6 times as fast as CPU platforms running production codes. Also, the GPU platforms running WARP were between 15% and 50% as expensive to purchase and between 80% to 90% as expensive to operate as equivalent CPU platforms performing at an equal simulation rate.« less
      

      
      Optimum element density studies for finite-element thermal analysis of hypersonic aircraft structures
      NASA Technical Reports Server (NTRS)
      Ko, William L.; Olona, Timothy; Muramoto, Kyle M.
         1990-01-01
         Different finite element models previously set up for thermal analysis of the space shuttle orbiter structure are discussed and their shortcomings identified. Element density criteria are established for the finite element thermal modelings of space shuttle orbiter-type large, hypersonic aircraft structures. These criteria are based on rigorous studies on solution accuracies using different finite element models having different element densities set up for one cell of the orbiter wing. Also, a method for optimization of the transient thermal analysis computer central processing unit (CPU) time is discussed. Based on the newly established element density criteria, the orbiter wing midspan segment was modeled for the examination of thermal analysis solution accuracies and the extent of computation CPU time requirements. The results showed that the distributions of the structural temperatures and the thermal stresses obtained from this wing segment model were satisfactory and the computation CPU time was at the acceptable level. The studies offered the hope that modeling the large, hypersonic aircraft structures using high-density elements for transient thermal analysis is possible if a CPU optimization technique was used.
      

      
      Work stealing for GPU-accelerated parallel programs in a global address space framework: WORK STEALING ON GPU-ACCELERATED SYSTEMS
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Arafat, Humayun; Dinan, James; Krishnamoorthy, Sriram
         
         Task parallelism is an attractive approach to automatically load balance the computation in a parallel system and adapt to dynamism exhibited by parallel systems. Exploiting task parallelism through work stealing has been extensively studied in shared and distributed-memory contexts. In this paper, we study the design of a system that uses work stealing for dynamic load balancing of task-parallel programs executed on hybrid distributed-memory CPU-graphics processing unit (GPU) systems in a global-address space framework. We take into account the unique nature of the accelerator model employed by GPUs, the significant performance difference between GPU and CPU execution as a functionmore » of problem size, and the distinct CPU and GPU memory domains. We consider various alternatives in designing a distributed work stealing algorithm for CPU-GPU systems, while taking into account the impact of task distribution and data movement overheads. These strategies are evaluated using microbenchmarks that capture various execution configurations as well as the state-of-the-art CCSD(T) application module from the computational chemistry domain.« less
      

      
      Work stealing for GPU-accelerated parallel programs in a global address space framework
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Arafat, Humayun; Dinan, James; Krishnamoorthy, Sriram
         
         Task parallelism is an attractive approach to automatically load balance the computation in a parallel system and adapt to dynamism exhibited by parallel systems. Exploiting task parallelism through work stealing has been extensively studied in shared and distributed-memory contexts. In this paper, we study the design of a system that uses work stealing for dynamic load balancing of task-parallel programs executed on hybrid distributed-memory CPU-graphics processing unit (GPU) systems in a global-address space framework. We take into account the unique nature of the accelerator model employed by GPUs, the significant performance difference between GPU and CPU execution as a functionmore » of problem size, and the distinct CPU and GPU memory domains. We consider various alternatives in designing a distributed work stealing algorithm for CPU-GPU systems, while taking into account the impact of task distribution and data movement overheads. These strategies are evaluated using microbenchmarks that capture various execution configurations as well as the state-of-the-art CCSD(T) application module from the computational chemistry domain« less
      

      
      Performance and accuracy of criticality calculations performed using WARP – A framework for continuous energy Monte Carlo neutron transport in general 3D geometries on GPUs
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Bergmann, Ryan M.; Rowland, Kelly L.; Radnović, Nikola
         
         In this companion paper to "Algorithmic Choices in WARP - A Framework for Continuous Energy Monte Carlo Neutron Transport in General 3D Geometries on GPUs" (doi:10.1016/j.anucene.2014.10.039), the WARP Monte Carlo neutron transport framework for graphics processing units (GPUs) is benchmarked against production-level central processing unit (CPU) Monte Carlo neutron transport codes for both performance and accuracy. We compare neutron flux spectra, multiplication factors, runtimes, speedup factors, and costs of various GPU and CPU platforms running either WARP, Serpent 2.1.24, or MCNP 6.1. WARP compares well with the results of the production-level codes, and it is shown that on the newestmore » hardware considered, GPU platforms running WARP are between 0.8 to 7.6 times as fast as CPU platforms running production codes. Also, the GPU platforms running WARP were between 15% and 50% as expensive to purchase and between 80% to 90% as expensive to operate as equivalent CPU platforms performing at an equal simulation rate.« less
      

      
      Highly Productive Application Development with ViennaCL for Accelerators
      NASA Astrophysics Data System (ADS)
      Rupp, K.; Weinbub, J.; Rudolf, F.
         2012-12-01
         The use of graphics processing units (GPUs) for the acceleration of general purpose computations has become very attractive over the last years, and accelerators based on many integrated CPU cores are about to hit the market. However, there are discussions about the benefit of GPU computing when comparing the reduction of execution times with the increased development effort [1]. To counter these concerns, our open-source linear algebra library ViennaCL [2,3] uses modern programming techniques such as generic programming in order to provide a convenient access layer for accelerator and GPU computing. Other GPU-accelerated libraries are primarily tuned for performance, but less tailored to productivity and portability: MAGMA [4] provides dense linear algebra operations via a LAPACK-comparable interface, but no dedicated matrix and vector types. Cusp [5] is closest in functionality to ViennaCL for sparse matrices, but is based on CUDA and thus restricted to devices from NVIDIA. However, no convenience layer for dense linear algebra is provided with Cusp. ViennaCL is written in C++ and uses OpenCL to access the resources of accelerators, GPUs and multi-core CPUs in a unified way. On the one hand, the library provides iterative solvers from the family of Krylov methods, including various preconditioners, for the solution of linear systems typically obtained from the discretization of partial differential equations. On the other hand, dense linear algebra operations are supported, including algorithms such as QR factorization and singular value decomposition. The user application interface of ViennaCL is compatible to uBLAS [6], which is part of the peer-reviewed Boost C++ libraries [7]. This allows to port existing applications based on uBLAS with a minimum of effort to ViennaCL. Conversely, the interface compatibility allows to use the iterative solvers from ViennaCL with uBLAS types directly, thus enabling code reuse beyond CPU-GPU boundaries. Out-of-the-box support for types from the Eigen library [8] and MTL 4 [9] are provided as well, enabling a seamless transition from single-core CPU to GPU and multi-core CPU computations. Case studies from the numerical solution of PDEs are given and isolated performance benchmarks are discussed. Also, pitfalls in scientific computing with GPUs and accelerators are addressed, allowing for a first evaluation of whether these novel devices can be mapped well to certain applications. References: [1] R. Bordawekar et al., Technical Report, IBM, 2010 [2] ViennaCL library. Online: http://viennacl.sourceforge.net/ [3] K. Rupp et al., GPUScA, 2010 [4] MAGMA library. Online: http://icl.cs.utk.edu/magma/ [5] Cusp library. Online: http://code.google.com/p/cusp-library/ [6] uBLAS library. Online: http://www.boost.org/libs/numeric/ublas/ [7] Boost C++ Libraries. Online: http://www.boost.org/ [8] Eigen library. Online: http://eigen.tuxfamily.org/ [9] MTL 4 Library. Online: http://www.mtl4.org/
      

      
      Integration of High-Performance Computing into Cloud Computing Services
      NASA Astrophysics Data System (ADS)
      Vouk, Mladen A.; Sills, Eric; Dreher, Patrick
         
         High-Performance Computing (HPC) projects span a spectrum of computer hardware implementations ranging from peta-flop supercomputers, high-end tera-flop facilities running a variety of operating systems and applications, to mid-range and smaller computational clusters used for HPC application development, pilot runs and prototype staging clusters. What they all have in common is that they operate as a stand-alone system rather than a scalable and shared user re-configurable resource. The advent of cloud computing has changed the traditional HPC implementation. In this article, we will discuss a very successful production-level architecture and policy framework for supporting HPC services within a more general cloud computing infrastructure. This integrated environment, called Virtual Computing Lab (VCL), has been operating at NC State since fall 2004. Nearly 8,500,000 HPC CPU-Hrs were delivered by this environment to NC State faculty and students during 2009. In addition, we present and discuss operational data that show that integration of HPC and non-HPC (or general VCL) services in a cloud can substantially reduce the cost of delivering cloud services (down to cents per CPU hour).
      

      
      Adaptation of a Multi-Block Structured Solver for Effective Use in a Hybrid CPU/GPU Massively Parallel Environment
      NASA Astrophysics Data System (ADS)
      Gutzwiller, David; Gontier, Mathieu; Demeulenaere, Alain
         2014-11-01
         Multi-Block structured solvers hold many advantages over their unstructured counterparts, such as a smaller memory footprint and efficient serial performance. Historically, multi-block structured solvers have not been easily adapted for use in a High Performance Computing (HPC) environment, and the recent trend towards hybrid GPU/CPU architectures has further complicated the situation. This paper will elaborate on developments and innovations applied to the NUMECA FINE/Turbo solver that have allowed near-linear scalability with real-world problems on over 250 hybrid GPU/GPU cluster nodes. Discussion will focus on the implementation of virtual partitioning and load balancing algorithms using a novel meta-block concept. This implementation is transparent to the user, allowing all pre- and post-processing steps to be performed using a simple, unpartitioned grid topology. Additional discussion will elaborate on developments that have improved parallel performance, including fully parallel I/O with the ADIOS API and the GPU porting of the computationally heavy CPUBooster convergence acceleration module. Head of HPC and Release Management, Numeca International.
      

      
      Modeling and Simulation of the Economics of Mining in the Bitcoin Market.
      PubMed
      Cocco, Luisanna; Marchesi, Michele
         2016-01-01
         In January 3, 2009, Satoshi Nakamoto gave rise to the "Bitcoin Blockchain", creating the first block of the chain hashing on his computer's central processing unit (CPU). Since then, the hash calculations to mine Bitcoin have been getting more and more complex, and consequently the mining hardware evolved to adapt to this increasing difficulty. Three generations of mining hardware have followed the CPU's generation. They are GPU's, FPGA's and ASIC's generations. This work presents an agent-based artificial market model of the Bitcoin mining process and of the Bitcoin transactions. The goal of this work is to model the economy of the mining process, starting from GPU's generation, the first with economic significance. The model reproduces some "stylized facts" found in real-time price series and some core aspects of the mining business. In particular, the computational experiments performed can reproduce the unit root property, the fat tail phenomenon and the volatility clustering of Bitcoin price series. In addition, under proper assumptions, they can reproduce the generation of Bitcoins, the hashing capability, the power consumption, and the mining hardware and electrical energy expenditures of the Bitcoin network.
      

      
      Transient dynamics capability at Sandia National Laboratories
      NASA Technical Reports Server (NTRS)
      Attaway, Steven W.; Biffle, Johnny H.; Sjaardema, G. D.; Heinstein, M. W.; Schoof, L. A.
         1993-01-01
         A brief overview of the transient dynamics capabilities at Sandia National Laboratories, with an emphasis on recent new developments and current research is presented. In addition, the Sandia National Laboratories (SNL) Engineering Analysis Code Access System (SEACAS), which is a collection of structural and thermal codes and utilities used by analysts at SNL, is described. The SEACAS system includes pre- and post-processing codes, analysis codes, database translation codes, support libraries, Unix shell scripts for execution, and an installation system. SEACAS is used at SNL on a daily basis as a production, research, and development system for the engineering analysts and code developers. Over the past year, approximately 190 days of CPU time were used by SEACAS codes on jobs running from a few seconds up to two and one-half days of CPU time. SEACAS is running on several different systems at SNL including Cray Unicos, Hewlett Packard PH-UX, Digital Equipment Ultrix, and Sun SunOS. An overview of SEACAS, including a short description of the codes in the system, are presented. Abstracts and references for the codes are listed at the end of the report.
      

      
      Validation of GPU based TomoTherapy dose calculation engine.
      PubMed
      Chen, Quan; Lu, Weiguo; Chen, Yu; Chen, Mingli; Henderson, Douglas; Sterpin, Edmond
         2012-04-01
         The graphic processing unit (GPU) based TomoTherapy convolution/superposition(C/S) dose engine (GPU dose engine) achieves a dramatic performance improvement over the traditional CPU-cluster based TomoTherapy dose engine (CPU dose engine). Besides the architecture difference between the GPU and CPU, there are several algorithm changes from the CPU dose engine to the GPU dose engine. These changes made the GPU dose slightly different from the CPU-cluster dose. In order for the commercial release of the GPU dose engine, its accuracy has to be validated. Thirty eight TomoTherapy phantom plans and 19 patient plans were calculated with both dose engines to evaluate the equivalency between the two dose engines. Gamma indices (Γ) were used for the equivalency evaluation. The GPU dose was further verified with the absolute point dose measurement with ion chamber and film measurements for phantom plans. Monte Carlo calculation was used as a reference for both dose engines in the accuracy evaluation in heterogeneous phantom and actual patients. The GPU dose engine showed excellent agreement with the current CPU dose engine. The majority of cases had over 99.99% of voxels with Γ(1%, 1 mm) < 1. The worst case observed in the phantom had 0.22% voxels violating the criterion. In patient cases, the worst percentage of voxels violating the criterion was 0.57%. For absolute point dose verification, all cases agreed with measurement to within ±3% with average error magnitude within 1%. All cases passed the acceptance criterion that more than 95% of the pixels have Γ(3%, 3 mm) < 1 in film measurement, and the average passing pixel percentage is 98.5%-99%. The GPU dose engine also showed similar degree of accuracy in heterogeneous media as the current TomoTherapy dose engine. It is verified and validated that the ultrafast TomoTherapy GPU dose engine can safely replace the existing TomoTherapy cluster based dose engine without degradation in dose accuracy.
      

        
       
          

«

10
      11
      12
   13
      14
      »

          
        

     

   

   
       
            
              
          

«

11
      12
      13
   14
      15
      »

          
        

           
           
             
               
      
      Round-off error in long-term orbital integrations using multistep methods
      NASA Technical Reports Server (NTRS)
      Quinlan, Gerald D.
         1994-01-01
         Techniques for reducing roundoff error are compared by testing them on high-order Stormer and summetric multistep methods. The best technique for most applications is to write the equation in summed, function-evaluation form and to store the coefficients as rational numbers. A larger error reduction can be achieved by writing the equation in backward-difference form and performing some of the additions in extended precision, but this entails a larger central processing unit (cpu) cost.
      

      
      Development of small scale cluster computer for numerical analysis
      NASA Astrophysics Data System (ADS)
      Zulkifli, N. H. N.; Sapit, A.; Mohammed, A. N.
         2017-09-01
         In this study, two units of personal computer were successfully networked together to form a small scale cluster. Each of the processor involved are multicore processor which has four cores in it, thus made this cluster to have eight processors. Here, the cluster incorporate Ubuntu 14.04 LINUX environment with MPI implementation (MPICH2). Two main tests were conducted in order to test the cluster, which is communication test and performance test. The communication test was done to make sure that the computers are able to pass the required information without any problem and were done by using simple MPI Hello Program where the program written in C language. Additional, performance test was also done to prove that this cluster calculation performance is much better than single CPU computer. In this performance test, four tests were done by running the same code by using single node, 2 processors, 4 processors, and 8 processors. The result shows that with additional processors, the time required to solve the problem decrease. Time required for the calculation shorten to half when we double the processors. To conclude, we successfully develop a small scale cluster computer using common hardware which capable of higher computing power when compare to single CPU processor, and this can be beneficial for research that require high computing power especially numerical analysis such as finite element analysis, computational fluid dynamics, and computational physics analysis.
      

      
      Low-power secure body area network for vital sensors toward IEEE802.15.6.
      PubMed
      Kuroda, Masahiro; Qiu, Shuye; Tochikubo, Osamu
         2009-01-01
         Many healthcare/medical services have started using personal area networks, such as Bluetooth and ZigBee; these networks consist of various types of vital sensors. These works focus on generalized functions for sensor networks that expect enough battery capacity and low-power CPU/RF (Radio Frequency) modules, but less attention to easy-to-use privacy protection. In this paper, we propose a commercially-deployable secure body area network (S-BAN) with reduced computational burden on a real sensor that has limited RAM/ROM sizes and CPU/RF power consumption under a light-weight battery. Our proposed S-BAN provides vital data ordering among sensors that are involved in an S-BAN and also provides low-power networking with zero-administration security by automatic private key generation. We design and implement the power-efficient media access control (MAC) with resource-constraint security in sensors. Then, we evaluate the power efficiency of the S-BAN consisting of small sensors, such as an accessory type ECG and ring-type SpO2. The evaluation of power efficiency of the S-BAN using real sensors convinces us in deploying S-BAN and will also help us in providing feedbacks to the IEEE802.15.6 MAC, which will be the standard for BANs.
      

      
      HALO: a reconfigurable image enhancement and multisensor fusion system
      NASA Astrophysics Data System (ADS)
      Wu, F.; Hickman, D. L.; Parker, Steve J.
         2014-06-01
         Contemporary high definition (HD) cameras and affordable infrared (IR) imagers are set to dramatically improve the effectiveness of security, surveillance and military vision systems. However, the quality of imagery is often compromised by camera shake, or poor scene visibility due to inadequate illumination or bad atmospheric conditions. A versatile vision processing system called HALO™ is presented that can address these issues, by providing flexible image processing functionality on a low size, weight and power (SWaP) platform. Example processing functions include video distortion correction, stabilisation, multi-sensor fusion and image contrast enhancement (ICE). The system is based around an all-programmable system-on-a-chip (SoC), which combines the computational power of a field-programmable gate array (FPGA) with the flexibility of a CPU. The FPGA accelerates computationally intensive real-time processes, whereas the CPU provides management and decision making functions that can automatically reconfigure the platform based on user input and scene content. These capabilities enable a HALO™ equipped reconnaissance or surveillance system to operate in poor visibility, providing potentially critical operational advantages in visually complex and challenging usage scenarios. The choice of an FPGA based SoC is discussed, and the HALO™ architecture and its implementation are described. The capabilities of image distortion correction, stabilisation, fusion and ICE are illustrated using laboratory and trials data.
      

      
      Evaluating Academic Journals Using Impact Factor and Local Citation Score
      ERIC Educational Resources Information Center
      Chung, Hye-Kyung
         2007-01-01
         This study presents a method for journal collection evaluation using citation analysis. Cost-per-use (CPU) for each title is used to measure cost-effectiveness with higher CPU scores indicating cost-effective titles. Use data are based on the impact factor and locally collected citation score of each title and is compared to the cost of managing…
      

      
      A novel heuristic algorithm for capacitated vehicle routing problem
      NASA Astrophysics Data System (ADS)
      Kır, Sena; Yazgan, Harun Reşit; Tüncel, Emre
         2017-09-01
         The vehicle routing problem with the capacity constraints was considered in this paper. It is quite difficult to achieve an optimal solution with traditional optimization methods by reason of the high computational complexity for large-scale problems. Consequently, new heuristic or metaheuristic approaches have been developed to solve this problem. In this paper, we constructed a new heuristic algorithm based on the tabu search and adaptive large neighborhood search (ALNS) with several specifically designed operators and features to solve the capacitated vehicle routing problem (CVRP). The effectiveness of the proposed algorithm was illustrated on the benchmark problems. The algorithm provides a better performance on large-scaled instances and gained advantage in terms of CPU time. In addition, we solved a real-life CVRP using the proposed algorithm and found the encouraging results by comparison with the current situation that the company is in.
      

      
      GPU Based Software Correlators - Perspectives for VLBI2010
      NASA Technical Reports Server (NTRS)
      Hobiger, Thomas; Kimura, Moritaka; Takefuji, Kazuhiro; Oyama, Tomoaki; Koyama, Yasuhiro; Kondo, Tetsuro; Gotoh, Tadahiro; Amagai, Jun
         2010-01-01
         Caused by historical separation and driven by the requirements of the PC gaming industry, Graphics Processing Units (GPUs) have evolved to massive parallel processing systems which entered the area of non-graphic related applications. Although a single processing core on the GPU is much slower and provides less functionality than its counterpart on the CPU, the huge number of these small processing entities outperforms the classical processors when the application can be parallelized. Thus, in recent years various radio astronomical projects have started to make use of this technology either to realize the correlator on this platform or to establish the post-processing pipeline with GPUs. Therefore, the feasibility of GPUs as a choice for a VLBI correlator is being investigated, including pros and cons of this technology. Additionally, a GPU based software correlator will be reviewed with respect to energy consumption/GFlop/sec and cost/GFlop/sec.
      

      
      System architecture of a gallium arsenide one-gigahertz digital IC tester
      NASA Technical Reports Server (NTRS)
      Fouts, Douglas J.; Johnson, John M.; Butner, Steven E.; Long, Stephen I.
         1987-01-01
         The design for a 1-GHz digital integrated circuit tester for the evaluation of custom GaAs chips and subsystems is discussed. Technology-related problems affecting the design of a GaAs computer are discussed, with emphasis on the problems introduced by long printed-circuit-board interconnect. High-speed interface modules provide a link between the low-speed microprocessor and the chip under test. Memory-multiplexer and memory-shift register architectures for the storage of test vectors are described in addition to an architecture for local data storage consisting of a long chain of GaAs shift registers. The tester is constructed around a VME system card cage and backplane, and very little high-speed interconnect exists between boards. The tester has a three part self-test consisting of a CPU board confidence test, a main memory confidence test, and a high-speed interface module functional test.
      

      
      VAX CLuster upgrade: Report of a CPC task force
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Hanson, J.; Berry, H.; Kessler, P.
         
         The CSCF VAX cluster provides interactive computing for 100 users during prime time, plus a considerable amount of daytime and overnight batch processing. While this cluster represents less than 10% of the VAX computing power at BNL (6 MIPS out of 70), it has served as an important center for this larger network, supporting special hardware and software too expensive to maintain on every machine. In addition, it is the only unrestricted facility available to VAX/VMS users (other machines are typically dedicated to special projects). This committee's analysis shows that the cpu's on the CSCF cluster are currently badly oversaturated,more » frequently giving extremely poor interactive response. Short batch jobs (a necessary part of interactive work) typically take 3 to 4 times as long to execute as they would on an idle machine. There is also an immediate need for more scratch disk space and user permanent file space.« less
      

      
      Coronary Artery Calcium as an Independent Surrogate Marker in the Risk Assessment of Patients With Atrial Fibrillation and an Intermediate Pretest Likelihood for Coronary Artery Disease Admitted to a German Chest Pain Unit.
      PubMed
      Breuckmann, Frank; Olligs, Jan; Hinrichs, Liane; Koopmann, Matthias; Lichtenberg, Michael; Böse, Dirk; Fischer, Dieter; Eckardt, Lars; Waltenberger, Johannes; Garvey, J Lee
         2016-03-01
         About 10% of patients admitted to a chest pain unit (CPU) exhibit atrial fibrillation (AF). To determine whether calcium scores (CS) are superior over common risk scores for coronary artery disease (CAD) in patients presenting with atypical chest pain, newly diagnosed AF, and intermediate pretest probability for CAD within the CPU. In 73 subjects, CS was related to the following risk scores: Global Registry of Acute Coronary Events (GRACE) score, including a new model of a frequency-normalized approach; Thrombolysis In Myocardial Infarction score; European Society of Cardiology Systematic Coronary Risk Evaluation (SCORE); Framingham risk score; and Prospective Cardiovascular Münster Study score. Revascularization rates during index stay were assessed. Median CS was 77 (interquartile range, 1-270), with higher values in men and the left anterior descending artery. Only the modified GRACE (ρ = 0.27; P = 0.02) and the SCORE (ρ = 0.39; P < 0.005) were significantly correlated with CS, whereas the GRACE (τ = 0.21; P = 0.04) and modified GRACE (τ = 0.23; P = 0.02) scores were significantly correlated with percentile groups. Only the CS significantly discriminated between those with and without stenosis (P < 0.01). Apart from modified GRACE score, overall correlations between risk scores and calcium burden, as well as revascularization rates during index stay, were low. By contrast, the determination of CS may be used as an additional surrogate marker in risk stratification in AF patients with intermediate pretest likelihood for CAD admitted to a CPU. © 2016 Wiley Periodicals, Inc.
      

      
      Real-time unmanned aircraft systems surveillance video mosaicking using GPU
      NASA Astrophysics Data System (ADS)
      Camargo, Aldo; Anderson, Kyle; Wang, Yi; Schultz, Richard R.; Fevig, Ronald A.
         2010-04-01
         Digital video mosaicking from Unmanned Aircraft Systems (UAS) is being used for many military and civilian applications, including surveillance, target recognition, border protection, forest fire monitoring, traffic control on highways, monitoring of transmission lines, among others. Additionally, NASA is using digital video mosaicking to explore the moon and planets such as Mars. In order to compute a "good" mosaic from video captured by a UAS, the algorithm must deal with motion blur, frame-to-frame jitter associated with an imperfectly stabilized platform, perspective changes as the camera tilts in flight, as well as a number of other factors. The most suitable algorithms use SIFT (Scale-Invariant Feature Transform) to detect the features consistent between video frames. Utilizing these features, the next step is to estimate the homography between two consecutives video frames, perform warping to properly register the image data, and finally blend the video frames resulting in a seamless video mosaick. All this processing takes a great deal of resources of resources from the CPU, so it is almost impossible to compute a real time video mosaic on a single processor. Modern graphics processing units (GPUs) offer computational performance that far exceeds current CPU technology, allowing for real-time operation. This paper presents the development of a GPU-accelerated digital video mosaicking implementation and compares it with CPU performance. Our tests are based on two sets of real video captured by a small UAS aircraft; one video comes from Infrared (IR) and Electro-Optical (EO) cameras. Our results show that we can obtain a speed-up of more than 50 times using GPU technology, so real-time operation at a video capture of 30 frames per second is feasible.
      

      
      Analysis of impact of general-purpose graphics processor units in supersonic flow modeling
      NASA Astrophysics Data System (ADS)
      Emelyanov, V. N.; Karpenko, A. G.; Kozelkov, A. S.; Teterina, I. V.; Volkov, K. N.; Yalozo, A. V.
         2017-06-01
         Computational methods are widely used in prediction of complex flowfields associated with off-normal situations in aerospace engineering. Modern graphics processing units (GPU) provide architectures and new programming models that enable to harness their large processing power and to design computational fluid dynamics (CFD) simulations at both high performance and low cost. Possibilities of the use of GPUs for the simulation of external and internal flows on unstructured meshes are discussed. The finite volume method is applied to solve three-dimensional unsteady compressible Euler and Navier-Stokes equations on unstructured meshes with high resolution numerical schemes. CUDA technology is used for programming implementation of parallel computational algorithms. Solutions of some benchmark test cases on GPUs are reported, and the results computed are compared with experimental and computational data. Approaches to optimization of the CFD code related to the use of different types of memory are considered. Speedup of solution on GPUs with respect to the solution on central processor unit (CPU) is compared. Performance measurements show that numerical schemes developed achieve 20-50 speedup on GPU hardware compared to CPU reference implementation. The results obtained provide promising perspective for designing a GPU-based software framework for applications in CFD.
      

      
      Interactivity vs. fairness in networked linux systems
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Wu, Wenji; Crawford, Matt; /Fermilab
         
         In general, the Linux 2.6 scheduler can ensure fairness and provide excellent interactive performance at the same time. However, our experiments and mathematical analysis have shown that the current Linux interactivity mechanism tends to incorrectly categorize non-interactive network applications as interactive, which can lead to serious fairness or starvation issues. In the extreme, a single process can unjustifiably obtain up to 95% of the CPU! The root cause is due to the facts that: (1) network packets arrive at the receiver independently and discretely, and the 'relatively fast' non-interactive network process might frequently sleep to wait for packet arrival. Thoughmore » each sleep lasts for a very short period of time, the wait-for-packet sleeps occur so frequently that they lead to interactive status for the process. (2) The current Linux interactivity mechanism provides the possibility that a non-interactive network process could receive a high CPU share, and at the same time be incorrectly categorized as 'interactive.' In this paper, we propose and test a possible solution to address the interactivity vs. fairness problems. Experiment results have proved the effectiveness of the proposed solution.« less
      

      
      Parallel Optimization of 3D Cardiac Electrophysiological Model Using GPU
      PubMed Central
      Xia, Yong; Zhang, Henggui
         2015-01-01
         Large-scale 3D virtual heart model simulations are highly demanding in computational resources. This imposes a big challenge to the traditional computation resources based on CPU environment, which already cannot meet the requirement of the whole computation demands or are not easily available due to expensive costs. GPU as a parallel computing environment therefore provides an alternative to solve the large-scale computational problems of whole heart modeling. In this study, using a 3D sheep atrial model as a test bed, we developed a GPU-based simulation algorithm to simulate the conduction of electrical excitation waves in the 3D atria. In the GPU algorithm, a multicellular tissue model was split into two components: one is the single cell model (ordinary differential equation) and the other is the diffusion term of the monodomain model (partial differential equation). Such a decoupling enabled realization of the GPU parallel algorithm. Furthermore, several optimization strategies were proposed based on the features of the virtual heart model, which enabled a 200-fold speedup as compared to a CPU implementation. In conclusion, an optimized GPU algorithm has been developed that provides an economic and powerful platform for 3D whole heart simulations. PMID:26581957
      

      
      Parallel Optimization of 3D Cardiac Electrophysiological Model Using GPU.
      PubMed
      Xia, Yong; Wang, Kuanquan; Zhang, Henggui
         2015-01-01
         Large-scale 3D virtual heart model simulations are highly demanding in computational resources. This imposes a big challenge to the traditional computation resources based on CPU environment, which already cannot meet the requirement of the whole computation demands or are not easily available due to expensive costs. GPU as a parallel computing environment therefore provides an alternative to solve the large-scale computational problems of whole heart modeling. In this study, using a 3D sheep atrial model as a test bed, we developed a GPU-based simulation algorithm to simulate the conduction of electrical excitation waves in the 3D atria. In the GPU algorithm, a multicellular tissue model was split into two components: one is the single cell model (ordinary differential equation) and the other is the diffusion term of the monodomain model (partial differential equation). Such a decoupling enabled realization of the GPU parallel algorithm. Furthermore, several optimization strategies were proposed based on the features of the virtual heart model, which enabled a 200-fold speedup as compared to a CPU implementation. In conclusion, an optimized GPU algorithm has been developed that provides an economic and powerful platform for 3D whole heart simulations.
      

      
      An efficient sparse matrix multiplication scheme for the CYBER 205 computer
      NASA Technical Reports Server (NTRS)
      Lambiotte, Jules J., Jr.
         1988-01-01
         This paper describes the development of an efficient algorithm for computing the product of a matrix and vector on a CYBER 205 vector computer. The desire to provide software which allows the user to choose between the often conflicting goals of minimizing central processing unit (CPU) time or storage requirements has led to a diagonal-based algorithm in which one of four types of storage is selected for each diagonal. The candidate storage types employed were chosen to be efficient on the CYBER 205 for diagonals which have nonzero structure which is dense, moderately sparse, very sparse and short, or very sparse and long; however, for many densities, no diagonal type is most efficient with respect to both resource requirements, and a trade-off must be made. For each diagonal, an initialization subroutine estimates the CPU time and storage required for each storage type based on results from previously performed numerical experimentation. These requirements are adjusted by weights provided by the user which reflect the relative importance the user places on the two resources. The adjusted resource requirements are then compared to select the most efficient storage and computational scheme.
      

      
      Increases in cytoplasmic dopamine compromise the normal resistance of the nucleus accumbens to methamphetamine neurotoxicity
      PubMed Central
      Thomas, David M.; Francescutti-Verbeem, Dina M.; Kuhnt, Donald M.
         2016-01-01
         Methamphetamine (METH) is a neurotoxic drug of abuse that damages the dopamine (DA) neuronal system in a highly delimited manner. The brain structure most affected by METH is the caudate–putamen (CPu) where long-term DA depletion and microglial activation are most evident. Even damage within the CPu is remarkably heterogenous with lateral and ventral aspects showing the greatest deficits. The nucleus accumbens (NAc) is largely spared of the damage that accompanies binge METH intoxication. Increases in cytoplasmic DA produced by reserpine, L-DOPA or clorgyline prior to METH uncover damage in the NAc as evidenced by microglial activation and depletion of DA, tyrosine hydroxylase (TH), and the DA transporter. These effects do not occur in the NAc after treatment with METH alone. In contrast to the CPu where DA, TH, and DA transporter levels remain depleted chronically, DA nerve ending alterations in the NAc show a partial recovery over time. None of the treatments that enhance METH toxicity in the NAc and CPu lead to losses of TH protein or DA cell bodies in the substantia nigra or the ventral tegmentum. These data show that increases in cytoplasmic DA dramatically broaden the neurotoxic profile of METH to include brain structures not normally targeted for damage by METH alone. The resistance of the NAc to METH-induced neurotoxicity and its ability to recover reveal a fundamentally different neuroplasticity by comparison to the CPu. Recruitment of the NAc as a target of METH neurotoxicity by alterations in DA homeostasis is significant in light of the important roles played by this brain structure. PMID:19457119
      

      
      Increases in cytoplasmic dopamine compromise the normal resistance of the nucleus accumbens to methamphetamine neurotoxicity.
      PubMed
      Thomas, David M; Francescutti-Verbeem, Dina M; Kuhn, Donald M
         2009-06-01
         Methamphetamine (METH) is a neurotoxic drug of abuse that damages the dopamine (DA) neuronal system in a highly delimited manner. The brain structure most affected by METH is the caudate-putamen (CPu) where long-term DA depletion and microglial activation are most evident. Even damage within the CPu is remarkably heterogenous with lateral and ventral aspects showing the greatest deficits. The nucleus accumbens (NAc) is largely spared of the damage that accompanies binge METH intoxication. Increases in cytoplasmic DA produced by reserpine, L-DOPA or clorgyline prior to METH uncover damage in the NAc as evidenced by microglial activation and depletion of DA, tyrosine hydroxylase (TH), and the DA transporter. These effects do not occur in the NAc after treatment with METH alone. In contrast to the CPu where DA, TH, and DA transporter levels remain depleted chronically, DA nerve ending alterations in the NAc show a partial recovery over time. None of the treatments that enhance METH toxicity in the NAc and CPu lead to losses of TH protein or DA cell bodies in the substantia nigra or the ventral tegmentum. These data show that increases in cytoplasmic DA dramatically broaden the neurotoxic profile of METH to include brain structures not normally targeted for damage by METH alone. The resistance of the NAc to METH-induced neurotoxicity and its ability to recover reveal a fundamentally different neuroplasticity by comparison to the CPu. Recruitment of the NAc as a target of METH neurotoxicity by alterations in DA homeostasis is significant in light of the important roles played by this brain structure.
      

      
      Pipelined CPU Design with FPGA in Teaching Computer Architecture
      ERIC Educational Resources Information Center
      Lee, Jong Hyuk; Lee, Seung Eun; Yu, Heon Chang; Suh, Taeweon
         2012-01-01
         This paper presents a pipelined CPU design project with a field programmable gate array (FPGA) system in a computer architecture course. The class project is a five-stage pipelined 32-bit MIPS design with experiments on the Altera DE2 board. For proper scheduling, milestones were set every one or two weeks to help students complete the project on…
      

      
      Where Are the Asteroids? The Design of ASTPT and ASTID.
      DTIC Science & Technology
      
         1980-04-15
         obliquity A = nutation in longitude = obliquity of ecliptic , of date e 0 obliquity of ecliptic , 1950.0 0O eutra rcsin uniy e q 1c 6 equatorial precession...need an additional rotation by the obliquity of the ecliptic , r- = R1(-Eo)o; Eo = 23*26󈧰蠔 (6) There is a very old trick in astronomy to simplify...execution speed. This is accomplished by using an approximate geocentric ecliptic position to eliminate, as quickly (in terms of CPU time) as possible
      

        
       
          

«

11
      12
      13
   14
      15
      »

          
        

     

   

   
       
            
              
          

«

12
      13
      14
   15
      16
      »

          
        

           
           
             
               
      
      Portable implementation model for CFD simulations. Application to hybrid CPU/GPU supercomputers
      NASA Astrophysics Data System (ADS)
      Oyarzun, Guillermo; Borrell, Ricard; Gorobets, Andrey; Oliva, Assensi
         2017-10-01
         Nowadays, high performance computing (HPC) systems experience a disruptive moment with a variety of novel architectures and frameworks, without any clarity of which one is going to prevail. In this context, the portability of codes across different architectures is of major importance. This paper presents a portable implementation model based on an algebraic operational approach for direct numerical simulation (DNS) and large eddy simulation (LES) of incompressible turbulent flows using unstructured hybrid meshes. The strategy proposed consists in representing the whole time-integration algorithm using only three basic algebraic operations: sparse matrix-vector product, a linear combination of vectors and dot product. The main idea is based on decomposing the nonlinear operators into a concatenation of two SpMV operations. This provides high modularity and portability. An exhaustive analysis of the proposed implementation for hybrid CPU/GPU supercomputers has been conducted with tests using up to 128 GPUs. The main objective consists in understanding the challenges of implementing CFD codes on new architectures.
      

      
      Continuous piecewise-linear, reduced-order electrochemical model for lithium-ion batteries in real-time applications
      NASA Astrophysics Data System (ADS)
      Farag, Mohammed; Fleckenstein, Matthias; Habibi, Saeid
         2017-02-01
         Model-order reduction and minimization of the CPU run-time while maintaining the model accuracy are critical requirements for real-time implementation of lithium-ion electrochemical battery models. In this paper, an isothermal, continuous, piecewise-linear, electrode-average model is developed by using an optimal knot placement technique. The proposed model reduces the univariate nonlinear function of the electrode's open circuit potential dependence on the state of charge to continuous piecewise regions. The parameterization experiments were chosen to provide a trade-off between extensive experimental characterization techniques and purely identifying all parameters using optimization techniques. The model is then parameterized in each continuous, piecewise-linear, region. Applying the proposed technique cuts down the CPU run-time by around 20%, compared to the reduced-order, electrode-average model. Finally, the model validation against real-time driving profiles (FTP-72, WLTP) demonstrates the ability of the model to predict the cell voltage accurately with less than 2% error.
      

      
      Adaptive multi-GPU Exchange Monte Carlo for the 3D Random Field Ising Model
      NASA Astrophysics Data System (ADS)
      Navarro, Cristóbal A.; Huang, Wei; Deng, Youjin
         2016-08-01
         This work presents an adaptive multi-GPU Exchange Monte Carlo approach for the simulation of the 3D Random Field Ising Model (RFIM). The design is based on a two-level parallelization. The first level, spin-level parallelism, maps the parallel computation as optimal 3D thread-blocks that simulate blocks of spins in shared memory with minimal halo surface, assuming a constant block volume. The second level, replica-level parallelism, uses multi-GPU computation to handle the simulation of an ensemble of replicas. CUDA's concurrent kernel execution feature is used in order to fill the occupancy of each GPU with many replicas, providing a performance boost that is more notorious at the smallest values of L. In addition to the two-level parallel design, the work proposes an adaptive multi-GPU approach that dynamically builds a proper temperature set free of exchange bottlenecks. The strategy is based on mid-point insertions at the temperature gaps where the exchange rate is most compromised. The extra work generated by the insertions is balanced across the GPUs independently of where the mid-point insertions were performed. Performance results show that spin-level performance is approximately two orders of magnitude faster than a single-core CPU version and one order of magnitude faster than a parallel multi-core CPU version running on 16-cores. Multi-GPU performance is highly convenient under a weak scaling setting, reaching up to 99 % efficiency as long as the number of GPUs and L increase together. The combination of the adaptive approach with the parallel multi-GPU design has extended our possibilities of simulation to sizes of L = 32 , 64 for a workstation with two GPUs. Sizes beyond L = 64 can eventually be studied using larger multi-GPU systems.
      

      
      Comparison of Acceleration Techniques for Selected Low-Level Bioinformatics Operations
      PubMed Central
      Langenkämper, Daniel; Jakobi, Tobias; Feld, Dustin; Jelonek, Lukas; Goesmann, Alexander; Nattkemper, Tim W.
         2016-01-01
         Within the recent years clock rates of modern processors stagnated while the demand for computing power continued to grow. This applied particularly for the fields of life sciences and bioinformatics, where new technologies keep on creating rapidly growing piles of raw data with increasing speed. The number of cores per processor increased in an attempt to compensate for slight increments of clock rates. This technological shift demands changes in software development, especially in the field of high performance computing where parallelization techniques are gaining in importance due to the pressing issue of large sized datasets generated by e.g., modern genomics. This paper presents an overview of state-of-the-art manual and automatic acceleration techniques and lists some applications employing these in different areas of sequence informatics. Furthermore, we provide examples for automatic acceleration of two use cases to show typical problems and gains of transforming a serial application to a parallel one. The paper should aid the reader in deciding for a certain techniques for the problem at hand. We compare four different state-of-the-art automatic acceleration approaches (OpenMP, PluTo-SICA, PPCG, and OpenACC). Their performance as well as their applicability for selected use cases is discussed. While optimizations targeting the CPU worked better in the complex k-mer use case, optimizers for Graphics Processing Units (GPUs) performed better in the matrix multiplication example. But performance is only superior at a certain problem size due to data migration overhead. We show that automatic code parallelization is feasible with current compiler software and yields significant increases in execution speed. Automatic optimizers for CPU are mature and usually no additional manual adjustment is required. In contrast, some automatic parallelizers targeting GPUs still lack maturity and are limited to simple statements and structures. PMID:26904094
      

      
      Transition-Tempered Metadynamics Is a Promising Tool for Studying the Permeation of Drug-like Molecules through Membranes.
      PubMed
      Sun, Rui; Dama, James F; Tan, Jeffrey S; Rose, John P; Voth, Gregory A
         2016-10-11
         Metadynamics is an important enhanced sampling technique in molecular dynamics simulation to efficiently explore potential energy surfaces. The recently developed transition-tempered metadynamics (TTMetaD) has been proven to converge asymptotically without sacrificing exploration of the collective variable space in the early stages of simulations, unlike other convergent metadynamics (MetaD) methods. We have applied TTMetaD to study the permeation of drug-like molecules through a lipid bilayer to further investigate the usefulness of this method as applied to problems of relevance to medicinal chemistry. First, ethanol permeation through a lipid bilayer was studied to compare TTMetaD with nontempered metadynamics and well-tempered metadynamics. The bias energies computed from various metadynamics simulations were compared to the potential of mean force calculated from umbrella sampling. Though all of the MetaD simulations agree with one another asymptotically, TTMetaD is able to predict the most accurate and reliable estimate of the potential of mean force for permeation in the early stages of the simulations and is robust to the choice of required additional parameters. We also show that using multiple randomly initialized replicas allows convergence analysis and also provides an efficient means to converge the simulations in shorter wall times and, more unexpectedly, in shorter CPU times; splitting the CPU time between multiple replicas appears to lead to less overall error. After validating the method, we studied the permeation of a more complicated drug-like molecule, trimethoprim. Three sets of TTMetaD simulations with different choices of collective variables were carried out, and all converged within feasible simulation time. The minimum free energy paths showed that TTMetaD was able to predict almost identical permeation mechanisms in each case despite significantly different definitions of collective variables.
      

      
      Discovering epistasis in large scale genetic association studies by exploiting graphics cards.
      PubMed
      Chen, Gary K; Guo, Yunfei
         2013-12-03
         Despite the enormous investments made in collecting DNA samples and generating germline variation data across thousands of individuals in modern genome-wide association studies (GWAS), progress has been frustratingly slow in explaining much of the heritability in common disease. Today's paradigm of testing independent hypotheses on each single nucleotide polymorphism (SNP) marker is unlikely to adequately reflect the complex biological processes in disease risk. Alternatively, modeling risk as an ensemble of SNPs that act in concert in a pathway, and/or interact non-additively on log risk for example, may be a more sensible way to approach gene mapping in modern studies. Implementing such analyzes genome-wide can quickly become intractable due to the fact that even modest size SNP panels on modern genotype arrays (500k markers) pose a combinatorial nightmare, require tens of billions of models to be tested for evidence of interaction. In this article, we provide an in-depth analysis of programs that have been developed to explicitly overcome these enormous computational barriers through the use of processors on graphics cards known as Graphics Processing Units (GPU). We include tutorials on GPU technology, which will convey why they are growing in appeal with today's numerical scientists. One obvious advantage is the impressive density of microprocessor cores that are available on only a single GPU. Whereas high end servers feature up to 24 Intel or AMD CPU cores, the latest GPU offerings from nVidia feature over 2600 cores. Each compute node may be outfitted with up to 4 GPU devices. Success on GPUs varies across problems. However, epistasis screens fare well due to the high degree of parallelism exposed in these problems. Papers that we review routinely report GPU speedups of over two orders of magnitude (>100x) over standard CPU implementations.
      

      
      Discovering epistasis in large scale genetic association studies by exploiting graphics cards
      PubMed Central
      Chen, Gary K.; Guo, Yunfei
         2013-01-01
         Despite the enormous investments made in collecting DNA samples and generating germline variation data across thousands of individuals in modern genome-wide association studies (GWAS), progress has been frustratingly slow in explaining much of the heritability in common disease. Today's paradigm of testing independent hypotheses on each single nucleotide polymorphism (SNP) marker is unlikely to adequately reflect the complex biological processes in disease risk. Alternatively, modeling risk as an ensemble of SNPs that act in concert in a pathway, and/or interact non-additively on log risk for example, may be a more sensible way to approach gene mapping in modern studies. Implementing such analyzes genome-wide can quickly become intractable due to the fact that even modest size SNP panels on modern genotype arrays (500k markers) pose a combinatorial nightmare, require tens of billions of models to be tested for evidence of interaction. In this article, we provide an in-depth analysis of programs that have been developed to explicitly overcome these enormous computational barriers through the use of processors on graphics cards known as Graphics Processing Units (GPU). We include tutorials on GPU technology, which will convey why they are growing in appeal with today's numerical scientists. One obvious advantage is the impressive density of microprocessor cores that are available on only a single GPU. Whereas high end servers feature up to 24 Intel or AMD CPU cores, the latest GPU offerings from nVidia feature over 2600 cores. Each compute node may be outfitted with up to 4 GPU devices. Success on GPUs varies across problems. However, epistasis screens fare well due to the high degree of parallelism exposed in these problems. Papers that we review routinely report GPU speedups of over two orders of magnitude (>100x) over standard CPU implementations. PMID:24348518
      

      
      Comparison of Acceleration Techniques for Selected Low-Level Bioinformatics Operations.
      PubMed
      Langenkämper, Daniel; Jakobi, Tobias; Feld, Dustin; Jelonek, Lukas; Goesmann, Alexander; Nattkemper, Tim W
         2016-01-01
         Within the recent years clock rates of modern processors stagnated while the demand for computing power continued to grow. This applied particularly for the fields of life sciences and bioinformatics, where new technologies keep on creating rapidly growing piles of raw data with increasing speed. The number of cores per processor increased in an attempt to compensate for slight increments of clock rates. This technological shift demands changes in software development, especially in the field of high performance computing where parallelization techniques are gaining in importance due to the pressing issue of large sized datasets generated by e.g., modern genomics. This paper presents an overview of state-of-the-art manual and automatic acceleration techniques and lists some applications employing these in different areas of sequence informatics. Furthermore, we provide examples for automatic acceleration of two use cases to show typical problems and gains of transforming a serial application to a parallel one. The paper should aid the reader in deciding for a certain techniques for the problem at hand. We compare four different state-of-the-art automatic acceleration approaches (OpenMP, PluTo-SICA, PPCG, and OpenACC). Their performance as well as their applicability for selected use cases is discussed. While optimizations targeting the CPU worked better in the complex k-mer use case, optimizers for Graphics Processing Units (GPUs) performed better in the matrix multiplication example. But performance is only superior at a certain problem size due to data migration overhead. We show that automatic code parallelization is feasible with current compiler software and yields significant increases in execution speed. Automatic optimizers for CPU are mature and usually no additional manual adjustment is required. In contrast, some automatic parallelizers targeting GPUs still lack maturity and are limited to simple statements and structures.
      

      
      High thermal conductivity liquid metal pad for heat dissipation in electronic devices
      NASA Astrophysics Data System (ADS)
      Lin, Zuoye; Liu, Huiqiang; Li, Qiuguo; Liu, Han; Chu, Sheng; Yang, Yuhua; Chu, Guang
         2018-05-01
         Novel thermal interface materials using Ag-doped Ga-based liquid metal were proposed for heat dissipation of electronic packaging and precision equipment. On one hand, the viscosity and fluidity of liquid metal was controlled to prevent leakage; on the other hand, the thermal conductivity of the Ga-based liquid metal was increased up to 46 W/mK by incorporating Ag nanoparticles. A series of experiments were performed to evaluate the heat dissipation performance on a CPU of smart-phone. The results demonstrated that the Ag-doped Ga-based liquid metal pad can effectively decrease the CPU temperature and change the heat flow path inside the smart-phone. To understand the heat flow path from CPU to screen through the interface material, heat dissipation mechanism was simulated and discussed.
      

      
      Dynamic modelling and estimation of the error due to asynchronism in a redundant asynchronous multiprocessor system
      NASA Technical Reports Server (NTRS)
      Huynh, Loc C.; Duval, R. W.
         1986-01-01
         The use of Redundant Asynchronous Multiprocessor System to achieve ultrareliable Fault Tolerant Control Systems shows great promise. The development has been hampered by the inability to determine whether differences in the outputs of redundant CPU's are due to failures or to accrued error built up by slight differences in CPU clock intervals. This study derives an analytical dynamic model of the difference between redundant CPU's due to differences in their clock intervals and uses this model with on-line parameter identification to idenitify the differences in the clock intervals. The ability of this methodology to accurately track errors due to asynchronisity generate an error signal with the effect of asynchronisity removed and this signal may be used to detect and isolate actual system failures.
      

      
      Analysis of cache for streaming tape drive
      NASA Technical Reports Server (NTRS)
      Chinnaswamy, V.
         1993-01-01
         A tape subsystem consists of a controller and a tape drive. Tapes are used for backup, data interchange, and software distribution. The backup operation is addressed. During a backup operation, data is read from disk, processed in CPU, and then sent to tape. The processing speeds of a disk subsystem, CPU, and a tape subsystem are likely to be different. A powerful CPU can read data from a fast disk, process it, and supply the data to the tape subsystem at a faster rate than the tape subsystem can handle. On the other hand, a slow disk drive and a slow CPU may not be able to supply data fast enough to keep a tape drive busy all the time. The backup process may supply data to tape drive in bursts. Each burst may be followed by an idle period. Depending on the nature of the file distribution in the disk, the input stream to the tape subsystem may vary significantly during backup. To compensate for these differences and optimize the utilization of a tape subsystem, a cache or buffer is introduced in the tape controller. Most of the tape drives today are streaming tape drives. A streaming tape drive goes into reposition when there is no data from the controller. Once the drive goes into reposition, the controller can receive data, but it cannot supply data to the tape drive until the drive completes its reposition. A controller can also receive data from the host and send data to the tape drive at the same time. The relationship of cache size, host transfer rate, drive transfer rate, reposition, and ramp up times for optimal performance of the tape subsystem are investigated. Formulas developed will also show the advantages of cache watermarks to increase the streaming time of the tape drive, maximum loss due to insufficient cache, tradeoffs between cache and reposition times and the effectiveness of cache on a streaming tape drive due to idle times or interruptions due in host transfers. Several mathematical formulas are developed to predict the performance of the tape drive. Some examples are given illustrating the usefulness of these formulas. Finally, a summary and some conclusions are provided.
      

      
      Multiband Radio Frequency Interconnect (MRFI) Technology For Next Generation Mobile/Airborne Computing Systems
      DTIC Science & Technology
      
         2017-02-01
         enable high scalability and reconfigurability for inter-CPU/Memory communications with an increased number of communication channels in frequency ...interconnect technology (MRFI) to enable high scalability and re-configurability for inter-CPU/Memory communications with an increased number of communication ...testing in the University of California, Los Angeles (UCLA) Center for High Frequency Electronics, and Dr. Afshin Momtaz at Broadcom Corporation for
      

      
      LU Factorization with Partial Pivoting for a Multi-CPU, Multi-GPU Shared Memory System
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Kurzak, Jakub; Luszczek, Pitior; Faverge, Mathieu
         2012-03-01
         LU factorization with partial pivoting is a canonical numerical procedure and the main component of the High Performance LINPACK benchmark. This article presents an implementation of the algorithm for a hybrid, shared memory, system with standard CPU cores and GPU accelerators. Performance in excess of one TeraFLOPS is achieved using four AMD Magny Cours CPUs and four NVIDIA Fermi GPUs.
      

      
      Accelerating finite-rate chemical kinetics with coprocessors: Comparing vectorization methods on GPUs, MICs, and CPUs
      NASA Astrophysics Data System (ADS)
      Stone, Christopher P.; Alferman, Andrew T.; Niemeyer, Kyle E.
         2018-05-01
         Accurate and efficient methods for solving stiff ordinary differential equations (ODEs) are a critical component of turbulent combustion simulations with finite-rate chemistry. The ODEs governing the chemical kinetics at each mesh point are decoupled by operator-splitting allowing each to be solved concurrently. An efficient ODE solver must then take into account the available thread and instruction-level parallelism of the underlying hardware, especially on many-core coprocessors, as well as the numerical efficiency. A stiff Rosenbrock and a nonstiff Runge-Kutta ODE solver are both implemented using the single instruction, multiple thread (SIMT) and single instruction, multiple data (SIMD) paradigms within OpenCL. Both methods solve multiple ODEs concurrently within the same instruction stream. The performance of these parallel implementations was measured on three chemical kinetic models of increasing size across several multicore and many-core platforms. Two separate benchmarks were conducted to clearly determine any performance advantage offered by either method. The first benchmark measured the run-time of evaluating the right-hand-side source terms in parallel and the second benchmark integrated a series of constant-pressure, homogeneous reactors using the Rosenbrock and Runge-Kutta solvers. The right-hand-side evaluations with SIMD parallelism on the host multicore Xeon CPU and many-core Xeon Phi co-processor performed approximately three times faster than the baseline multithreaded C++ code. The SIMT parallel model on the host and Phi was 13%-35% slower than the baseline while the SIMT model on the NVIDIA Kepler GPU provided approximately the same performance as the SIMD model on the Phi. The runtimes for both ODE solvers decreased significantly with the SIMD implementations on the host CPU (2.5-2.7 ×) and Xeon Phi coprocessor (4.7-4.9 ×) compared to the baseline parallel code. The SIMT implementations on the GPU ran 1.5-1.6 times faster than the baseline multithreaded CPU code; however, this was significantly slower than the SIMD versions on the host CPU or the Xeon Phi. The performance difference between the three platforms was attributed to thread divergence caused by the adaptive step-sizes within the ODE integrators. Analysis showed that the wider vector width of the GPU incurs a higher level of divergence than the narrower Sandy Bridge or Xeon Phi. The significant performance improvement provided by the SIMD parallel strategy motivates further research into more ODE solver methods that are both SIMD-friendly and computationally efficient.
      

      
      Data Parallel Bin-Based Indexing for Answering Queries on Multi-Core Architectures
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Gosink, Luke; Wu, Kesheng; Bethel, E. Wes
         2009-06-02
         The multi-core trend in CPUs and general purpose graphics processing units (GPUs) offers new opportunities for the database community. The increase of cores at exponential rates is likely to affect virtually every server and client in the coming decade, and presents database management systems with a huge, compelling disruption that will radically change how processing is done. This paper presents a new parallel indexing data structure for answering queries that takes full advantage of the increasing thread-level parallelism emerging in multi-core architectures. In our approach, our Data Parallel Bin-based Index Strategy (DP-BIS) first bins the base data, and then partitionsmore » and stores the values in each bin as a separate, bin-based data cluster. In answering a query, the procedures for examining the bin numbers and the bin-based data clusters offer the maximum possible level of concurrency; each record is evaluated by a single thread and all threads are processed simultaneously in parallel. We implement and demonstrate the effectiveness of DP-BIS on two multi-core architectures: a multi-core CPU and a GPU. The concurrency afforded by DP-BIS allows us to fully utilize the thread-level parallelism provided by each architecture--for example, our GPU-based DP-BIS implementation simultaneously evaluates over 12,000 records with an equivalent number of concurrently executing threads. In comparing DP-BIS's performance across these architectures, we show that the GPU-based DP-BIS implementation requires significantly less computation time to answer a query than the CPU-based implementation. We also demonstrate in our analysis that DP-BIS provides better overall performance than the commonly utilized CPU and GPU-based projection index. Finally, due to data encoding, we show that DP-BIS accesses significantly smaller amounts of data than index strategies that operate solely on a column's base data; this smaller data footprint is critical for parallel processors that possess limited memory resources (e.g., GPUs).« less
      

      
      Benchmarking worker nodes using LHCb productions and comparing with HEPSpec06
      NASA Astrophysics Data System (ADS)
      Charpentier, P.
         2017-10-01
         In order to estimate the capabilities of a computing slot with limited processing time, it is necessary to know with a rather good precision its “power”. This allows for example pilot jobs to match a task for which the required CPU-work is known, or to define the number of events to be processed knowing the CPU-work per event. Otherwise one always has the risk that the task is aborted because it exceeds the CPU capabilities of the resource. It also allows a better accounting of the consumed resources. The traditional way the CPU power is estimated in WLCG since 2007 is using the HEP-Spec06 benchmark (HS06) suite that was verified at the time to scale properly with a set of typical HEP applications. However, the hardware architecture of processors has evolved, all WLCG experiments moved to using 64-bit applications and use different compilation flags from those advertised for running HS06. It is therefore interesting to check the scaling of HS06 with the HEP applications. For this purpose, we have been using CPU intensive massive simulation productions from the LHCb experiment and compared their event throughput to the HS06 rating of the worker nodes. We also compared it with a much faster benchmark script that is used by the DIRAC framework used by LHCb for evaluating at run time the performance of the worker nodes. This contribution reports on the finding of these comparisons: the main observation is that the scaling with HS06 is no longer fulfilled, while the fast benchmarks have a better scaling but are less precise. One can also clearly see that some hardware or software features when enabled on the worker nodes may enhance their performance beyond expectation from either benchmark, depending on external factors.
      

      
      Nucleus Accumbens Invulnerability to Methamphetamine Neurotoxicity
      PubMed Central
      Kuhn, Donald M.; Angoa-Pérez, Mariana; Thomas, David M.
         2016-01-01
         Methamphetamine (Meth) is a neurotoxic drug of abuse that damages neurons and nerve endings throughout the central nervous system. Emerging studies of human Meth addicts using both postmortem analyses of brain tissue and noninvasive imaging studies of intact brains have confirmed that Meth causes persistent structural abnormalities. Animal and human studies have also defined a number of significant functional problems and comorbid psychiatric disorders associated with long-term Meth abuse. This review summarizes the salient features of Meth-induced neurotoxicity with a focus on the dopamine (DA) neuronal system. DA nerve endings in the caudate-putamen (CPu) are damaged by Meth in a highly delimited manner. Even within the CPu, damage is remarkably heterogeneous, with ventral and lateral aspects showing the greatest deficits. The nucleus accumbens (NAc) is largely spared the damage that accompanies binge Meth intoxication, but relatively subtle changes in the disposition of DA in its nerve endings can lead to dramatic increases in Meth-induced toxicity in the CPu and overcome the normal resistance of the NAc to damage. In contrast to the CPu, where DA neuronal deficiencies are persistent, alterations in the NAc show a partial recovery. Animal models have been indispensable in studies of the causes and consequences of Meth neurotoxicity and in the development of new therapies. This research has shown that increases in cytoplasmic DA dramatically broaden the neurotoxic profile of Meth to include brain structures not normally targeted for damage. The resistance of the NAc to Meth-induced neurotoxicity and its ability to recover reveal a fundamentally different neuroplasticity by comparison to the CPu. Recruitment of the NAc as a target of Meth neurotoxicity by alterations in DA homeostasis is significant in light of the numerous important roles played by this brain structure. PMID:23382149
      

      
      Nucleus accumbens invulnerability to methamphetamine neurotoxicity.
      PubMed
      Kuhn, Donald M; Angoa-Pérez, Mariana; Thomas, David M
         2011-01-01
         Methamphetamine (Meth) is a neurotoxic drug of abuse that damages neurons and nerve endings throughout the central nervous system. Emerging studies of human Meth addicts using both postmortem analyses of brain tissue and noninvasive imaging studies of intact brains have confirmed that Meth causes persistent structural abnormalities. Animal and human studies have also defined a number of significant functional problems and comorbid psychiatric disorders associated with long-term Meth abuse. This review summarizes the salient features of Meth-induced neurotoxicity with a focus on the dopamine (DA) neuronal system. DA nerve endings in the caudate-putamen (CPu) are damaged by Meth in a highly delimited manner. Even within the CPu, damage is remarkably heterogeneous, with ventral and lateral aspects showing the greatest deficits. The nucleus accumbens (NAc) is largely spared the damage that accompanies binge Meth intoxication, but relatively subtle changes in the disposition of DA in its nerve endings can lead to dramatic increases in Meth-induced toxicity in the CPu and overcome the normal resistance of the NAc to damage. In contrast to the CPu, where DA neuronal deficiencies are persistent, alterations in the NAc show a partial recovery. Animal models have been indispensable in studies of the causes and consequences of Meth neurotoxicity and in the development of new therapies. This research has shown that increases in cytoplasmic DA dramatically broaden the neurotoxic profile of Meth to include brain structures not normally targeted for damage. The resistance of the NAc to Meth-induced neurotoxicity and its ability to recover reveal a fundamentally different neuroplasticity by comparison to the CPu. Recruitment of the NAc as a target of Meth neurotoxicity by alterations in DA homeostasis is significant in light of the numerous important roles played by this brain structure.
      

      
      Highly efficient spatial data filtering in parallel using the opensource library CPPPO
      NASA Astrophysics Data System (ADS)
      Municchi, Federico; Goniva, Christoph; Radl, Stefan
         2016-10-01
         CPPPO is a compilation of parallel data processing routines developed with the aim to create a library for "scale bridging" (i.e. connecting different scales by mean of closure models) in a multi-scale approach. CPPPO features a number of parallel filtering algorithms designed for use with structured and unstructured Eulerian meshes, as well as Lagrangian data sets. In addition, data can be processed on the fly, allowing the collection of relevant statistics without saving individual snapshots of the simulation state. Our library is provided with an interface to the widely-used CFD solver OpenFOAM®, and can be easily connected to any other software package via interface modules. Also, we introduce a novel, extremely efficient approach to parallel data filtering, and show that our algorithms scale super-linearly on multi-core clusters. Furthermore, we provide a guideline for choosing the optimal Eulerian cell selection algorithm depending on the number of CPU cores used. Finally, we demonstrate the accuracy and the parallel scalability of CPPPO in a showcase focusing on heat and mass transfer from a dense bed of particles.
      

      
      Robotic goalie with 3 ms reaction time at 4% CPU load using event-based dynamic vision sensor
      PubMed Central
      Delbruck, Tobi; Lang, Manuel
         2013-01-01
         Conventional vision-based robotic systems that must operate quickly require high video frame rates and consequently high computational costs. Visual response latencies are lower-bound by the frame period, e.g., 20 ms for 50 Hz frame rate. This paper shows how an asynchronous neuromorphic dynamic vision sensor (DVS) silicon retina is used to build a fast self-calibrating robotic goalie, which offers high update rates and low latency at low CPU load. Independent and asynchronous per pixel illumination change events from the DVS signify moving objects and are used in software to track multiple balls. Motor actions to block the most “threatening” ball are based on measured ball positions and velocities. The goalie also sees its single-axis goalie arm and calibrates the motor output map during idle periods so that it can plan open-loop arm movements to desired visual locations. Blocking capability is about 80% for balls shot from 1 m from the goal even with the fastest-shots, and approaches 100% accuracy when the ball does not beat the limits of the servo motor to move the arm to the necessary position in time. Running with standard USB buses under a standard preemptive multitasking operating system (Windows), the goalie robot achieves median update rates of 550 Hz, with latencies of 2.2 ± 2 ms from ball movement to motor command at a peak CPU load of less than 4%. Practical observations and measurements of USB device latency are provided1. PMID:24311999
      

        
       
          

«

12
      13
      14
   15
      16
      »

          
        

     

   

   
       
            
              
          

«

13
      14
      15
   16
      17
      »

          
        

           
           
             
               
      
      Optimization and uncertainty assessment of strongly nonlinear groundwater models with high parameter dimensionality
      NASA Astrophysics Data System (ADS)
      Keating, Elizabeth H.; Doherty, John; Vrugt, Jasper A.; Kang, Qinjun
         2010-10-01
         Highly parameterized and CPU-intensive groundwater models are increasingly being used to understand and predict flow and transport through aquifers. Despite their frequent use, these models pose significant challenges for parameter estimation and predictive uncertainty analysis algorithms, particularly global methods which usually require very large numbers of forward runs. Here we present a general methodology for parameter estimation and uncertainty analysis that can be utilized in these situations. Our proposed method includes extraction of a surrogate model that mimics key characteristics of a full process model, followed by testing and implementation of a pragmatic uncertainty analysis technique, called null-space Monte Carlo (NSMC), that merges the strengths of gradient-based search and parameter dimensionality reduction. As part of the surrogate model analysis, the results of NSMC are compared with a formal Bayesian approach using the DiffeRential Evolution Adaptive Metropolis (DREAM) algorithm. Such a comparison has never been accomplished before, especially in the context of high parameter dimensionality. Despite the highly nonlinear nature of the inverse problem, the existence of multiple local minima, and the relatively large parameter dimensionality, both methods performed well and results compare favorably with each other. Experiences gained from the surrogate model analysis are then transferred to calibrate the full highly parameterized and CPU intensive groundwater model and to explore predictive uncertainty of predictions made by that model. The methodology presented here is generally applicable to any highly parameterized and CPU-intensive environmental model, where efficient methods such as NSMC provide the only practical means for conducting predictive uncertainty analysis.
      

      
      High-performance computing on GPUs for resistivity logging of oil and gas wells
      NASA Astrophysics Data System (ADS)
      Glinskikh, V.; Dudaev, A.; Nechaev, O.; Surodina, I.
         2017-10-01
         We developed and implemented into software an algorithm for high-performance simulation of electrical logs from oil and gas wells using high-performance heterogeneous computing. The numerical solution of the 2D forward problem is based on the finite-element method and the Cholesky decomposition for solving a system of linear algebraic equations (SLAE). Software implementations of the algorithm used the NVIDIA CUDA technology and computing libraries are made, allowing us to perform decomposition of SLAE and find its solution on central processor unit (CPU) and graphics processor unit (GPU). The calculation time is analyzed depending on the matrix size and number of its non-zero elements. We estimated the computing speed on CPU and GPU, including high-performance heterogeneous CPU-GPU computing. Using the developed algorithm, we simulated resistivity data in realistic models.
      

      
      Design of a memory-access controller with 3.71-times-enhanced energy efficiency for Internet-of-Things-oriented nonvolatile microcontroller unit
      NASA Astrophysics Data System (ADS)
      Natsui, Masanori; Hanyu, Takahiro
         2018-04-01
         In realizing a nonvolatile microcontroller unit (MCU) for sensor nodes in Internet-of-Things (IoT) applications, it is important to solve the data-transfer bottleneck between the central processing unit (CPU) and the nonvolatile memory constituting the MCU. As one circuit-oriented approach to solving this problem, we propose a memory access minimization technique for magnetoresistive-random-access-memory (MRAM)-embedded nonvolatile MCUs. In addition to multiplexing and prefetching of memory access, the proposed technique realizes efficient instruction fetch by eliminating redundant memory access while considering the code length of the instruction to be fetched and the transition of the memory address to be accessed. As a result, the performance of the MCU can be improved while relaxing the performance requirement for the embedded MRAM, and compact and low-power implementation can be performed as compared with the conventional cache-based one. Through the evaluation using a system consisting of a general purpose 32-bit CPU and embedded MRAM, it is demonstrated that the proposed technique increases the peak efficiency of the system up to 3.71 times, while a 2.29-fold area reduction is achieved compared with the cache-based one.
      

      
      Hotspot detection using image pattern recognition based on higher-order local auto-correlation
      NASA Astrophysics Data System (ADS)
      Maeda, Shimon; Matsunawa, Tetsuaki; Ogawa, Ryuji; Ichikawa, Hirotaka; Takahata, Kazuhiro; Miyairi, Masahiro; Kotani, Toshiya; Nojima, Shigeki; Tanaka, Satoshi; Nakagawa, Kei; Saito, Tamaki; Mimotogi, Shoji; Inoue, Soichi; Nosato, Hirokazu; Sakanashi, Hidenori; Kobayashi, Takumi; Murakawa, Masahiro; Higuchi, Tetsuya; Takahashi, Eiichi; Otsu, Nobuyuki
         2011-04-01
         Below 40nm design node, systematic variation due to lithography must be taken into consideration during the early stage of design. So far, litho-aware design using lithography simulation models has been widely applied to assure that designs are printed on silicon without any error. However, the lithography simulation approach is very time consuming, and under time-to-market pressure, repetitive redesign by this approach may result in the missing of the market window. This paper proposes a fast hotspot detection support method by flexible and intelligent vision system image pattern recognition based on Higher-Order Local Autocorrelation. Our method learns the geometrical properties of the given design data without any defects as normal patterns, and automatically detects the design patterns with hotspots from the test data as abnormal patterns. The Higher-Order Local Autocorrelation method can extract features from the graphic image of design pattern, and computational cost of the extraction is constant regardless of the number of design pattern polygons. This approach can reduce turnaround time (TAT) dramatically only on 1CPU, compared with the conventional simulation-based approach, and by distributed processing, this has proven to deliver linear scalability with each additional CPU.
      

      
      A study of transonic aerodynamic analysis methods for use with a hypersonic aircraft synthesis code
      NASA Technical Reports Server (NTRS)
      Sandlin, Doral R.; Davis, Paul Christopher
         1992-01-01
         A means of performing routine transonic lift, drag, and moment analyses on hypersonic all-body and wing-body configurations were studied. The analysis method is to be used in conjunction with the Hypersonic Vehicle Optimization Code (HAVOC). A review of existing techniques is presented, after which three methods, chosen to represent a spectrum of capabilities, are tested and the results are compared with experimental data. The three methods consist of a wave drag code, a full potential code, and a Navier-Stokes code. The wave drag code, representing the empirical approach, has very fast CPU times, but very limited and sporadic results. The full potential code provides results which compare favorably to the wind tunnel data, but with a dramatic increase in computational time. Even more extreme is the Navier-Stokes code, which provides the most favorable and complete results, but with a very large turnaround time. The full potential code, TRANAIR, is used for additional analyses, because of the superior results it can provide over empirical and semi-empirical methods, and because of its automated grid generation. TRANAIR analyses include an all body hypersonic cruise configuration and an oblique flying wing supersonic transport.
      

      
      Software Techniques for Non-Von Neumann Architectures
      DTIC Science & Technology
      
         1990-01-01
         Commtopo programmable Benes net.; hypercubic lattice for QCD Control CENTRALIZED Assign STATIC Memory :SHARED Synch UNIVERSAL Max-cpu 566 Proessor...boards (each = 4 floating point units, 2 multipliers) Cpu-size 32-bit floating point chips Perform 11.4 Gflops Market quantum chromodynamics ( QCD ...functions there should exist a capability to define hierarchies and lattices of complex objects. A complex object can be made up of a set of simple objects
      

      
      Visual Media Reasoning - Terrain-based Geolocation
      DTIC Science & Technology
      
         2015-06-01
         the drawings, specifications, or other data does not license the holder or any other person or corporation ; or convey any rights or permission to...3.4 Alternative Metric Investigation This section describes a graphics processor unit (GPU) based implementation in the NVIDIA CUDA programming...utilizing 2 concurrent CPU cores, each controlling a single Nvidia C2075 Tesla Fermi CUDA card. Figure 22 shows a comparison of the CPU and the GPU powered
      

      
      Self-organized neural maps of human protein sequences.
      PubMed Central
      Ferrán, E. A.; Pflugfelder, B.; Ferrara, P.
         1994-01-01
         We have recently described a method based on artificial neural networks to cluster protein sequences into families. The network was trained with Kohonen's unsupervised learning algorithm using, as inputs, the matrix patterns derived from the dipeptide composition of the proteins. We present here a large-scale application of that method to classify the 1,758 human protein sequences stored in the SwissProt database (release 19.0), whose lengths are greater than 50 amino acids. In the final 2-dimensional topologically ordered map of 15 x 15 neurons, proteins belonging to known families were associated with the same neuron or with neighboring ones. Also, as an attempt to reduce the time-consuming learning procedure, we compared 2 learning protocols: one of 500 epochs (100 SUN CPU-hours [CPU-h]), and another one of 30 epochs (6.7 CPU-h). A further reduction of learning-computing time, by a factor of about 3.3, with similar protein clustering results, was achieved using a matrix of 11 x 11 components to represent the sequences. Although network training is time consuming, the classification of a new protein in the final ordered map is very fast (14.6 CPU-seconds). We also show a comparison between the artificial neural network approach and conventional methods of biosequence analysis. PMID:8019421
      

      
      A CFD Heterogeneous Parallel Solver Based on Collaborating CPU and GPU
      NASA Astrophysics Data System (ADS)
      Lai, Jianqi; Tian, Zhengyu; Li, Hua; Pan, Sha
         2018-03-01
         Since Graphic Processing Unit (GPU) has a strong ability of floating-point computation and memory bandwidth for data parallelism, it has been widely used in the areas of common computing such as molecular dynamics (MD), computational fluid dynamics (CFD) and so on. The emergence of compute unified device architecture (CUDA), which reduces the complexity of compiling program, brings the great opportunities to CFD. There are three different modes for parallel solution of NS equations: parallel solver based on CPU, parallel solver based on GPU and heterogeneous parallel solver based on collaborating CPU and GPU. As we can see, GPUs are relatively rich in compute capacity but poor in memory capacity and the CPUs do the opposite. We need to make full use of the GPUs and CPUs, so a CFD heterogeneous parallel solver based on collaborating CPU and GPU has been established. Three cases are presented to analyse the solver’s computational accuracy and heterogeneous parallel efficiency. The numerical results agree well with experiment results, which demonstrate that the heterogeneous parallel solver has high computational precision. The speedup on a single GPU is more than 40 for laminar flow, it decreases for turbulent flow, but it still can reach more than 20. What’s more, the speedup increases as the grid size becomes larger.
      

      
      Assessment of Linear Finite-Difference Poisson-Boltzmann Solvers
      PubMed Central
      Wang, Jun; Luo, Ray
         2009-01-01
         CPU time and memory usage are two vital issues that any numerical solvers for the Poisson-Boltzmann equation have to face in biomolecular applications. In this study we systematically analyzed the CPU time and memory usage of five commonly used finite-difference solvers with a large and diversified set of biomolecular structures. Our comparative analysis shows that modified incomplete Cholesky conjugate gradient and geometric multigrid are the most efficient in the diversified test set. For the two efficient solvers, our test shows that their CPU times increase approximately linearly with the numbers of grids. Their CPU times also increase almost linearly with the negative logarithm of the convergence criterion at very similar rate. Our comparison further shows that geometric multigrid performs better in the large set of tested biomolecules. However, modified incomplete Cholesky conjugate gradient is superior to geometric multigrid in molecular dynamics simulations of tested molecules. We also investigated other significant components in numerical solutions of the Poisson-Boltzmann equation. It turns out that the time-limiting step is the free boundary condition setup for the linear systems for the selected proteins if the electrostatic focusing is not used. Thus, development of future numerical solvers for the Poisson-Boltzmann equation should balance all aspects of the numerical procedures in realistic biomolecular applications. PMID:20063271
      

      
      Wang-Landau sampling: Saving CPU time
      NASA Astrophysics Data System (ADS)
      Ferreira, L. S.; Jorge, L. N.; Leão, S. A.; Caparica, A. A.
         2018-04-01
         In this work we propose an improvement to the Wang-Landau (WL) method that allows an economy in CPU time of about 60% leading to the same results with the same accuracy. We used the 2D Ising model to show that one can initiate all WL simulations using the outputs of an advanced WL level from a previous simulation. We showed that up to the seventh WL level (f6) the simulations are not biased yet and can proceed to any value that the simulation from the very beginning would reach. As a result the initial WL levels can be simulated just once. It was also observed that the saving in CPU time is larger for larger lattice sizes, exactly where the computational cost is considerable. We carried out high-resolution simulations beginning initially from the first WL level (f0) and another beginning from the eighth WL level (f7) using all the data at the end of the previous level and showed that the results for the critical temperature Tc and the critical static exponents β and γ coincide within the error bars. Finally we applied the same procedure to the 1/2-spin Baxter-Wu model and the economy in CPU time was of about 64%.
      

      
      Provenance-aware optimization of workload for distributed data production
      NASA Astrophysics Data System (ADS)
      Makatun, Dzmitry; Lauret, Jérôme; Rudová, Hana; Šumbera, Michal
         2017-10-01
         Distributed data processing in High Energy and Nuclear Physics (HENP) is a prominent example of big data analysis. Having petabytes of data being processed at tens of computational sites with thousands of CPUs, standard job scheduling approaches either do not address well the problem complexity or are dedicated to one specific aspect of the problem only (CPU, network or storage). Previously we have developed a new job scheduling approach dedicated to distributed data production - an essential part of data processing in HENP (preprocessing in big data terminology). In this contribution, we discuss the load balancing with multiple data sources and data replication, present recent improvements made to our planner and provide results of simulations which demonstrate the advantage against standard scheduling policies for the new use case. Multi-source or provenance is common in computing models of many applications whereas the data may be copied to several destinations. The initial input data set would hence be already partially replicated to multiple locations and the task of the scheduler is to maximize overall computational throughput considering possible data movements and CPU allocation. The studies have shown that our approach can provide a significant gain in overall computational performance in a wide scope of simulations considering realistic size of computational Grid and various input data distribution.
      

      
      Performance assessment of a pre-partitioned adaptive chemistry approach in large-eddy simulation of turbulent flames
      NASA Astrophysics Data System (ADS)
      Pepiot, Perrine; Liang, Youwen; Newale, Ashish; Pope, Stephen
         2016-11-01
         A pre-partitioned adaptive chemistry (PPAC) approach recently developed and validated in the simplified framework of a partially-stirred reactor is applied to the simulation of turbulent flames using a LES/particle PDF framework. The PPAC approach was shown to simultaneously provide significant savings in CPU and memory requirements, two major limiting factors in LES/particle PDF. The savings are achieved by providing each particle in the PDF method with a specialized reduced representation and kinetic model adjusted to its changing composition. Both representation and model are identified efficiently from a pre-determined list using a low-dimensional binary-tree search algorithm, thereby keeping the run-time overhead associated with the adaptive strategy to a minimum. The Sandia D flame is used as benchmark to quantify the performance of the PPAC algorithm in a turbulent combustion setting. In particular, the CPU and memory benefits, the distribution of the various representations throughout the computational domain, and the relationship between the user-defined error tolerances used to derive the reduced representations and models and the actual errors observed in LES/PDF are characterized. This material is based upon work supported by the U.S. Department of Energy Office of Science, Office of Basic Energy Sciences under Award Number DE-FG02-90ER14128.
      

      
      Acceleration of spiking neural network based pattern recognition on NVIDIA graphics processors.
      PubMed
      Han, Bing; Taha, Tarek M
         2010-04-01
         There is currently a strong push in the research community to develop biological scale implementations of neuron based vision models. Systems at this scale are computationally demanding and generally utilize more accurate neuron models, such as the Izhikevich and the Hodgkin-Huxley models, in favor of the more popular integrate and fire model. We examine the feasibility of using graphics processing units (GPUs) to accelerate a spiking neural network based character recognition network to enable such large scale systems. Two versions of the network utilizing the Izhikevich and Hodgkin-Huxley models are implemented. Three NVIDIA general-purpose (GP) GPU platforms are examined, including the GeForce 9800 GX2, the Tesla C1060, and the Tesla S1070. Our results show that the GPGPUs can provide significant speedup over conventional processors. In particular, the fastest GPGPU utilized, the Tesla S1070, provided a speedup of 5.6 and 84.4 over highly optimized implementations on the fastest central processing unit (CPU) tested, a quadcore 2.67 GHz Xeon processor, for the Izhikevich and the Hodgkin-Huxley models, respectively. The CPU implementation utilized all four cores and the vector data parallelism offered by the processor. The results indicate that GPUs are well suited for this application domain.
      

      
      Patterns and Practices for Future Architectures
      DTIC Science & Technology
      
         2014-08-01
         14. SUBJECT TERMS computing architecture, graph algorithms, high-performance computing, big data , GPU 15. NUMBER OF PAGES 44 16. PRICE CODE 17...at Vertex 1 6 Figure 4: Data Structures Created by Kernel 1 of Single CPU, List Implementation Using the Graph in the Example from Section 1.2 9...Figure 5: Kernel 2 of Graph500 BFS Reference Implementation: Single CPU, List 10 Figure 6: Data Structures for Sequential CSR Algorithm 12 Figure 7
      

      
      Conversion of Mass Storage Hierarchy in an IBM Computer Network
      DTIC Science & Technology
      
         1989-03-01
         storage devices GUIDE IBM users’ group for DOS operating systems IBM International Business Machines IBM 370/145 CPU introduced in 1970 IBM 370/168 CPU...February 12, 1985, Information Systems Group, International Business Machines Corporation. "IBM 3090 Processor Complex" and 񓼪 Mass Storage System...34 Mainframe Journal, pp. 15-26, 64-65, Dallas, Texas, September-October 1987. 3. International Business Machines Corporation, Introduction to IBM 3S80 Storage
      

      
      A parallel method of atmospheric correction for multispectral high spatial resolution remote sensing images
      NASA Astrophysics Data System (ADS)
      Zhao, Shaoshuai; Ni, Chen; Cao, Jing; Li, Zhengqiang; Chen, Xingfeng; Ma, Yan; Yang, Leiku; Hou, Weizhen; Qie, Lili; Ge, Bangyu; Liu, Li; Xing, Jin
         2018-03-01
         The remote sensing image is usually polluted by atmosphere components especially like aerosol particles. For the quantitative remote sensing applications, the radiative transfer model based atmospheric correction is used to get the reflectance with decoupling the atmosphere and surface by consuming a long computational time. The parallel computing is a solution method for the temporal acceleration. The parallel strategy which uses multi-CPU to work simultaneously is designed to do atmospheric correction for a multispectral remote sensing image. The parallel framework's flow and the main parallel body of atmospheric correction are described. Then, the multispectral remote sensing image of the Chinese Gaofen-2 satellite is used to test the acceleration efficiency. When the CPU number is increasing from 1 to 8, the computational speed is also increasing. The biggest acceleration rate is 6.5. Under the 8 CPU working mode, the whole image atmospheric correction costs 4 minutes.
      

      
      On localization attacks against cloud infrastructure
      NASA Astrophysics Data System (ADS)
      Ge, Linqiang; Yu, Wei; Sistani, Mohammad Ali
         2013-05-01
         One of the key characteristics of cloud computing is the device and location independence that enables the user to access systems regardless of their location. Because cloud computing is heavily based on sharing resource, it is vulnerable to cyber attacks. In this paper, we investigate a localization attack that enables the adversary to leverage central processing unit (CPU) resources to localize the physical location of server used by victims. By increasing and reducing CPU usage through the malicious virtual machine (VM), the response time from the victim VM will increase and decrease correspondingly. In this way, by embedding the probing signal into the CPU usage and correlating the same pattern in the response time from the victim VM, the adversary can find the location of victim VM. To determine attack accuracy, we investigate features in both the time and frequency domains. We conduct both theoretical and experimental study to demonstrate the effectiveness of such an attack.
      

      
      Storage strategies of eddy-current FE-BI model for GPU implementation
      NASA Astrophysics Data System (ADS)
      Bardel, Charles; Lei, Naiguang; Udpa, Lalita
         2013-01-01
         In the past few years graphical processing units (GPUs) have shown tremendous improvements in computational throughput over standard CPU architecture. However, this comes at the cost of restructuring the algorithms to meet the strengths and drawbacks of this GPU architecture. A major drawback is the state of limited memory, and hence storage of FE stiffness matrices on the GPU is important. In contrast to storage on CPU the GPU storage format has significant influence on the overall performance. This paper presents an investigation of a storage strategy in the implementation of a two-dimensional finite element-boundary integral (FE-BI) model for Eddy current NDE applications, on GPU architecture. Specifically, the high dimensional matrices are manipulated by examining the matrix structure and optimally splitting into structurally independent component matrices for efficient storage and retrieval of each component. Results obtained using the proposed approach are compared to those of conventional CPU implementation for validating the method.
      

      
      Semiempirical Quantum Chemical Calculations Accelerated on a Hybrid Multicore CPU-GPU Computing Platform.
      PubMed
      Wu, Xin; Koslowski, Axel; Thiel, Walter
         2012-07-10
         In this work, we demonstrate that semiempirical quantum chemical calculations can be accelerated significantly by leveraging the graphics processing unit (GPU) as a coprocessor on a hybrid multicore CPU-GPU computing platform. Semiempirical calculations using the MNDO, AM1, PM3, OM1, OM2, and OM3 model Hamiltonians were systematically profiled for three types of test systems (fullerenes, water clusters, and solvated crambin) to identify the most time-consuming sections of the code. The corresponding routines were ported to the GPU and optimized employing both existing library functions and a GPU kernel that carries out a sequence of noniterative Jacobi transformations during pseudodiagonalization. The overall computation times for single-point energy calculations and geometry optimizations of large molecules were reduced by one order of magnitude for all methods, as compared to runs on a single CPU core.
      

        
       
          

«

13
      14
      15
   16
      17
      »

          
        

     

   

   
       
            
              
          

«

14
      15
      16
   17
      18
      »

          
        

           
           
             
               
      
      A GPU-based calculation using the three-dimensional FDTD method for electromagnetic field analysis.
      PubMed
      Nagaoka, Tomoaki; Watanabe, Soichi
         2010-01-01
         Numerical simulations with the numerical human model using the finite-difference time domain (FDTD) method have recently been performed frequently in a number of fields in biomedical engineering. However, the FDTD calculation runs too slowly. We focus, therefore, on general purpose programming on the graphics processing unit (GPGPU). The three-dimensional FDTD method was implemented on the GPU using Compute Unified Device Architecture (CUDA). In this study, we used the NVIDIA Tesla C1060 as a GPGPU board. The performance of the GPU is evaluated in comparison with the performance of a conventional CPU and a vector supercomputer. The results indicate that three-dimensional FDTD calculations using a GPU can significantly reduce run time in comparison with that using a conventional CPU, even a native GPU implementation of the three-dimensional FDTD method, while the GPU/CPU speed ratio varies with the calculation domain and thread block size.
      

      
      CUDA Fortran acceleration for the finite-difference time-domain method
      NASA Astrophysics Data System (ADS)
      Hadi, Mohammed F.; Esmaeili, Seyed A.
         2013-05-01
         A detailed description of programming the three-dimensional finite-difference time-domain (FDTD) method to run on graphical processing units (GPUs) using CUDA Fortran is presented. Two FDTD-to-CUDA thread-block mapping designs are investigated and their performances compared. Comparative assessment of trade-offs between GPU's shared memory and L1 cache is also discussed. This presentation is for the benefit of FDTD programmers who work exclusively with Fortran and are reluctant to port their codes to C in order to utilize GPU computing. The derived CUDA Fortran code is compared with an optimized CPU version that runs on a workstation-class CPU to present a realistic GPU to CPU run time comparison and thus help in making better informed investment decisions on FDTD code redesigns and equipment upgrades. All analyses are mirrored with CUDA C simulations to put in perspective the present state of CUDA Fortran development.
      

      
      Near-Zero Emissions Oxy-Combustion Flue Gas Purification
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Minish Shah; Nich Degenstein; Monica Zanfir
         2012-06-30
         The objectives of this project were to carry out an experimental program to enable development and design of near zero emissions (NZE) CO{sub 2} processing unit (CPU) for oxy-combustion plants burning high and low sulfur coals and to perform commercial viability assessment. The NZE CPU was proposed to produce high purity CO{sub 2} from the oxycombustion flue gas, to achieve > 95% CO{sub 2} capture rate and to achieve near zero atmospheric emissions of criteria pollutants. Two SOx/NOx removal technologies were proposed depending on the SOx levels in the flue gas. The activated carbon process was proposed for power plantsmore » burning low sulfur coal and the sulfuric acid process was proposed for power plants burning high sulfur coal. For plants burning high sulfur coal, the sulfuric acid process would convert SOx and NOx in to commercial grade sulfuric and nitric acid by-products, thus reducing operating costs associated with SOx/NOx removal. For plants burning low sulfur coal, investment in separate FGD and SCR equipment for producing high purity CO{sub 2} would not be needed. To achieve high CO{sub 2} capture rates, a hybrid process that combines cold box and VPSA (vacuum pressure swing adsorption) was proposed. In the proposed hybrid process, up to 90% of CO{sub 2} in the cold box vent stream would be recovered by CO{sub 2} VPSA and then it would be recycled and mixed with the flue gas stream upstream of the compressor. The overall recovery from the process will be > 95%. The activated carbon process was able to achieve simultaneous SOx and NOx removal in a single step. The removal efficiencies were >99.9% for SOx and >98% for NOx, thus exceeding the performance targets of >99% and >95%, respectively. The process was also found to be suitable for power plants burning both low and high sulfur coals. Sulfuric acid process did not meet the performance expectations. Although it could achieve high SOx (>99%) and NOx (>90%) removal efficiencies, it could not produce by-product sulfuric and nitric acids that meet the commercial product specifications. The sulfuric acid will have to be disposed of by neutralization, thus lowering the value of the technology to same level as that of the activated carbon process. Therefore, it was decided to discontinue any further efforts on sulfuric acid process. Because of encouraging results on the activated carbon process, it was decided to add a new subtask on testing this process in a dual bed continuous unit. A 40 days long continuous operation test confirmed the excellent SOx/NOx removal efficiencies achieved in the batch operation. This test also indicated the need for further efforts on optimization of adsorption-regeneration cycle to maintain long term activity of activated carbon material at a higher level. The VPSA process was tested in a pilot unit. It achieved CO{sub 2} recovery of > 95% and CO{sub 2} purity of >80% (by vol.) from simulated cold box feed streams. The overall CO{sub 2} recovery from the cold box VPSA hybrid process was projected to be >99% for plants with low air ingress (2%) and >97% for plants with high air ingress (10%). Economic analysis was performed to assess value of the NZE CPU. The advantage of NZE CPU over conventional CPU is only apparent when CO{sub 2} capture and avoided costs are compared. For greenfield plants, cost of avoided CO{sub 2} and cost of captured CO{sub 2} are generally about 11-14% lower using the NZE CPU compared to using a conventional CPU. For older plants with high air intrusion, the cost of avoided CO{sub 2} and capture CO{sub 2} are about 18-24% lower using the NZE CPU. Lower capture costs for NZE CPU are due to lower capital investment in FGD/SCR and higher CO{sub 2} capture efficiency. In summary, as a result of this project, we now have developed one technology option for NZE CPU based on the activated carbon process and coldbox-VPSA hybrid process. This technology is projected to work for both low and high sulfur coal plants. The NZE CPU technology is projected to achieve near zero stack emissions, produce high purity CO{sub 2} relatively free of trace impurities and achieve ~99% CO{sub 2} capture rate while lowering the CO{sub 2} capture costs.« less
      

      
      Radiation shielding evaluation of the BNCT treatment room at THOR: a TORT-coupled MCNP Monte Carlo simulation study.
      PubMed
      Chen, A Y; Liu, Y-W H; Sheu, R J
         2008-01-01
         This study investigates the radiation shielding design of the treatment room for boron neutron capture therapy at Tsing Hua Open-pool Reactor using "TORT-coupled MCNP" method. With this method, the computational efficiency is improved significantly by two to three orders of magnitude compared to the analog Monte Carlo MCNP calculation. This makes the calculation feasible using a single CPU in less than 1 day. Further optimization of the photon weight windows leads to additional 50-75% improvement in the overall computational efficiency.
      

      
      Unraveling Network-induced Memory Contention: Deeper Insights with Machine Learning
      DOE PAGES
      Groves, Taylor Liles; Grant, Ryan; Gonzales, Aaron; ...
         2017-11-21
         Remote Direct Memory Access (RDMA) is expected to be an integral communication mechanism for future exascale systems enabling asynchronous data transfers, so that applications may fully utilize CPU resources while simultaneously sharing data amongst remote nodes. We examine Network-induced Memory Contention (NiMC) on Infiniband networks. We expose the interactions between RDMA, main-memory and cache, when applications and out-of-band services compete for memory resources. We then explore NiMCs resulting impact on application-level performance. For a range of hardware technologies and HPC workloads, we quantify NiMC and show that NiMCs impact grows with scale resulting in up to 3X performance degradation atmore » scales as small as 8K processes even in applications that previously have been shown to be performance resilient in the presence of noise. In addition, this work examines the problem of predicting NiMC's impact on applications by leveraging machine learning and easily accessible performance counters. This approach provides additional insights about the root cause of NiMC and facilitates dynamic selection of potential solutions. Finally, we evaluated three potential techniques to reduce NiMCs impact, namely hardware offloading, core reservation and network throttling.« less
      

      
      Unraveling Network-induced Memory Contention: Deeper Insights with Machine Learning
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Groves, Taylor Liles; Grant, Ryan; Gonzales, Aaron
         
         Remote Direct Memory Access (RDMA) is expected to be an integral communication mechanism for future exascale systems enabling asynchronous data transfers, so that applications may fully utilize CPU resources while simultaneously sharing data amongst remote nodes. We examine Network-induced Memory Contention (NiMC) on Infiniband networks. We expose the interactions between RDMA, main-memory and cache, when applications and out-of-band services compete for memory resources. We then explore NiMCs resulting impact on application-level performance. For a range of hardware technologies and HPC workloads, we quantify NiMC and show that NiMCs impact grows with scale resulting in up to 3X performance degradation atmore » scales as small as 8K processes even in applications that previously have been shown to be performance resilient in the presence of noise. In addition, this work examines the problem of predicting NiMC's impact on applications by leveraging machine learning and easily accessible performance counters. This approach provides additional insights about the root cause of NiMC and facilitates dynamic selection of potential solutions. Finally, we evaluated three potential techniques to reduce NiMCs impact, namely hardware offloading, core reservation and network throttling.« less
      

      
      Application of Intel Many Integrated Core (MIC) accelerators to the Pleim-Xiu land surface scheme
      NASA Astrophysics Data System (ADS)
      Huang, Melin; Huang, Bormin; Huang, Allen H.
         2015-10-01
         The land-surface model (LSM) is one physics process in the weather research and forecast (WRF) model. The LSM includes atmospheric information from the surface layer scheme, radiative forcing from the radiation scheme, and precipitation forcing from the microphysics and convective schemes, together with internal information on the land's state variables and land-surface properties. The LSM is to provide heat and moisture fluxes over land points and sea-ice points. The Pleim-Xiu (PX) scheme is one LSM. The PX LSM features three pathways for moisture fluxes: evapotranspiration, soil evaporation, and evaporation from wet canopies. To accelerate the computation process of this scheme, we employ Intel Xeon Phi Many Integrated Core (MIC) Architecture as it is a multiprocessor computer structure with merits of efficient parallelization and vectorization essentials. Our results show that the MIC-based optimization of this scheme running on Xeon Phi coprocessor 7120P improves the performance by 2.3x and 11.7x as compared to the original code respectively running on one CPU socket (eight cores) and on one CPU core with Intel Xeon E5-2670.
      

      
      An MPI-CUDA approach for hypersonic flows with detailed state-to-state air kinetics using a GPU cluster
      NASA Astrophysics Data System (ADS)
      Bonelli, Francesco; Tuttafesta, Michele; Colonna, Gianpiero; Cutrone, Luigi; Pascazio, Giuseppe
         2017-10-01
         This paper describes the most advanced results obtained in the context of fluid dynamic simulations of high-enthalpy flows using detailed state-to-state air kinetics. Thermochemical non-equilibrium, typical of supersonic and hypersonic flows, was modeled by using both the accurate state-to-state approach and the multi-temperature model proposed by Park. The accuracy of the two thermochemical non-equilibrium models was assessed by comparing the results with experimental findings, showing better predictions provided by the state-to-state approach. To overcome the huge computational cost of the state-to-state model, a multiple-nodes GPU implementation, based on an MPI-CUDA approach, was employed and a comprehensive code performance analysis is presented. Both the pure MPI-CPU and the MPI-CUDA implementations exhibit excellent scalability performance. GPUs outperform CPUs computing especially when the state-to-state approach is employed, showing speed-ups, of the single GPU with respect to the single-core CPU, larger than 100 in both the case of one MPI process and multiple MPI process.
      

      
      An Adaptive Priority Tuning System for Optimized Local CPU Scheduling using BOINC Clients
      NASA Astrophysics Data System (ADS)
      Mnaouer, Adel B.; Ragoonath, Colin
         2010-11-01
         Volunteer Computing (VC) is a Distributed Computing model which utilizes idle CPU cycles from computing resources donated by volunteers who are connected through the Internet to form a very large-scale, loosely coupled High Performance Computing environment. Distributed Volunteer Computing environments such as the BOINC framework is concerned mainly with the efficient scheduling of the available resources to the applications which require them. The BOINC framework thus contains a number of scheduling policies/algorithms both on the server-side and on the client which work together to maximize the available resources and to provide a degree of QoS in an environment which is highly volatile. This paper focuses on the BOINC client and introduces an adaptive priority tuning client side middleware application which improves the execution times of Work Units (WUs) while maintaining an acceptable Maximum Response Time (MRT) for the end user. We have conducted extensive experimentation of the proposed system and the results show clear speedup of BOINC applications using our optimized middleware as opposed to running using the original BOINC client.
      

      
      Fog computing job scheduling optimization based on bees swarm
      NASA Astrophysics Data System (ADS)
      Bitam, Salim; Zeadally, Sherali; Mellouk, Abdelhamid
         2018-04-01
         Fog computing is a new computing architecture, composed of a set of near-user edge devices called fog nodes, which collaborate together in order to perform computational services such as running applications, storing an important amount of data, and transmitting messages. Fog computing extends cloud computing by deploying digital resources at the premise of mobile users. In this new paradigm, management and operating functions, such as job scheduling aim at providing high-performance, cost-effective services requested by mobile users and executed by fog nodes. We propose a new bio-inspired optimization approach called Bees Life Algorithm (BLA) aimed at addressing the job scheduling problem in the fog computing environment. Our proposed approach is based on the optimized distribution of a set of tasks among all the fog computing nodes. The objective is to find an optimal tradeoff between CPU execution time and allocated memory required by fog computing services established by mobile users. Our empirical performance evaluation results demonstrate that the proposal outperforms the traditional particle swarm optimization and genetic algorithm in terms of CPU execution time and allocated memory.
      

      
      Optimizing SIEM Throughput on the Cloud Using Parallelization.
      PubMed
      Alam, Masoom; Ihsan, Asif; Khan, Muazzam A; Javaid, Qaisar; Khan, Abid; Manzoor, Jawad; Akhundzada, Adnan; Khan, Muhammad Khurram; Farooq, Sajid
         2016-01-01
         Processing large amounts of data in real time for identifying security issues pose several performance challenges, especially when hardware infrastructure is limited. Managed Security Service Providers (MSSP), mostly hosting their applications on the Cloud, receive events at a very high rate that varies from a few hundred to a couple of thousand events per second (EPS). It is critical to process this data efficiently, so that attacks could be identified quickly and necessary response could be initiated. This paper evaluates the performance of a security framework OSTROM built on the Esper complex event processing (CEP) engine under a parallel and non-parallel computational framework. We explain three architectures under which Esper can be used to process events. We investigated the effect on throughput, memory and CPU usage in each configuration setting. The results indicate that the performance of the engine is limited by the number of events coming in rather than the queries being processed. The architecture where 1/4th of the total events are submitted to each instance and all the queries are processed by all the units shows best results in terms of throughput, memory and CPU usage.
      

      
      Thermal Hotspots in CPU Die and It's Future Architecture
      NASA Astrophysics Data System (ADS)
      Wang, Jian; Hu, Fu-Yuan
         
         Owing to the increasing core frequency and chip integration and the limited die dimension, the power densities in CPU chip have been increasing fastly. The high temperature on chip resulted by power densities threats the processor's performance and chip's reliability. This paper analyzed the thermal hotspots in die and their properties. A new architecture of function units in die - - hot units distributed architecture is suggested to cope with the problems of high power densities for future processor chip.
      

      
      Restricted Collision List method for faster Direct Simulation Monte-Carlo (DSMC) collisions
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Macrossan, Michael N., E-mail: m.macrossan@uq.edu.au
         
         The ‘Restricted Collision List’ (RCL) method for speeding up the calculation of DSMC Variable Soft Sphere collisions, with Borgnakke–Larsen (BL) energy exchange, is presented. The method cuts down considerably on the number of random collision parameters which must be calculated (deflection and azimuthal angles, and the BL energy exchange factors). A relatively short list of these parameters is generated and the parameters required in any cell are selected from this list. The list is regenerated at intervals approximately equal to the smallest mean collision time in the flow, and the chance of any particle re-using the same collision parameters inmore » two successive collisions is negligible. The results using this method are indistinguishable from those obtained with standard DSMC. The CPU time saving depends on how much of a DSMC calculation is devoted to collisions and how much is devoted to other tasks, such as moving particles and calculating particle interactions with flow boundaries. For 1-dimensional calculations of flow in a tube, the new method saves 20% of the CPU time per collision for VSS scattering with no energy exchange. With RCL applied to rotational energy exchange, the CPU saving can be greater; for small values of the rotational collision number, for which most collisions involve some rotational energy exchange, the CPU may be reduced by 50% or more.« less
      

      
      GeantV: from CPU to accelerators
      NASA Astrophysics Data System (ADS)
      Amadio, G.; Ananya, A.; Apostolakis, J.; Arora, A.; Bandieramonte, M.; Bhattacharyya, A.; Bianchini, C.; Brun, R.; Canal, P.; Carminati, F.; Duhem, L.; Elvira, D.; Gheata, A.; Gheata, M.; Goulas, I.; Iope, R.; Jun, S.; Lima, G.; Mohanty, A.; Nikitina, T.; Novak, M.; Pokorski, W.; Ribon, A.; Sehgal, R.; Shadura, O.; Vallecorsa, S.; Wenzel, S.; Zhang, Y.
         2016-10-01
         The GeantV project aims to research and develop the next-generation simulation software describing the passage of particles through matter. While the modern CPU architectures are being targeted first, resources such as GPGPU, Intel© Xeon Phi, Atom or ARM cannot be ignored anymore by HEP CPU-bound applications. The proof of concept GeantV prototype has been mainly engineered for CPU's having vector units but we have foreseen from early stages a bridge to arbitrary accelerators. A software layer consisting of architecture/technology specific backends supports currently this concept. This approach allows to abstract out the basic types such as scalar/vector but also to formalize generic computation kernels using transparently library or device specific constructs based on Vc, CUDA, Cilk+ or Intel intrinsics. While the main goal of this approach is portable performance, as a bonus, it comes with the insulation of the core application and algorithms from the technology layer. This allows our application to be long term maintainable and versatile to changes at the backend side. The paper presents the first results of basket-based GeantV geometry navigation on the Intel© Xeon Phi KNC architecture. We present the scalability and vectorization study, conducted using Intel performance tools, as well as our preliminary conclusions on the use of accelerators for GeantV transport. We also describe the current work and preliminary results for using the GeantV transport kernel on GPUs.
      

      
      Cpu/gpu Computing for AN Implicit Multi-Block Compressible Navier-Stokes Solver on Heterogeneous Platform
      NASA Astrophysics Data System (ADS)
      Deng, Liang; Bai, Hanli; Wang, Fang; Xu, Qingxin
         2016-06-01
         CPU/GPU computing allows scientists to tremendously accelerate their numerical codes. In this paper, we port and optimize a double precision alternating direction implicit (ADI) solver for three-dimensional compressible Navier-Stokes equations from our in-house Computational Fluid Dynamics (CFD) software on heterogeneous platform. First, we implement a full GPU version of the ADI solver to remove a lot of redundant data transfers between CPU and GPU, and then design two fine-grain schemes, namely “one-thread-one-point” and “one-thread-one-line”, to maximize the performance. Second, we present a dual-level parallelization scheme using the CPU/GPU collaborative model to exploit the computational resources of both multi-core CPUs and many-core GPUs within the heterogeneous platform. Finally, considering the fact that memory on a single node becomes inadequate when the simulation size grows, we present a tri-level hybrid programming pattern MPI-OpenMP-CUDA that merges fine-grain parallelism using OpenMP and CUDA threads with coarse-grain parallelism using MPI for inter-node communication. We also propose a strategy to overlap the computation with communication using the advanced features of CUDA and MPI programming. We obtain speedups of 6.0 for the ADI solver on one Tesla M2050 GPU in contrast to two Xeon X5670 CPUs. Scalability tests show that our implementation can offer significant performance improvement on heterogeneous platform.
      

      
      Bayer image parallel decoding based on GPU
      NASA Astrophysics Data System (ADS)
      Hu, Rihui; Xu, Zhiyong; Wei, Yuxing; Sun, Shaohua
         2012-11-01
         In the photoelectrical tracking system, Bayer image is decompressed in traditional method, which is CPU-based. However, it is too slow when the images become large, for example, 2K×2K×16bit. In order to accelerate the Bayer image decoding, this paper introduces a parallel speedup method for NVIDA's Graphics Processor Unit (GPU) which supports CUDA architecture. The decoding procedure can be divided into three parts: the first is serial part, the second is task-parallelism part, and the last is data-parallelism part including inverse quantization, inverse discrete wavelet transform (IDWT) as well as image post-processing part. For reducing the execution time, the task-parallelism part is optimized by OpenMP techniques. The data-parallelism part could advance its efficiency through executing on the GPU as CUDA parallel program. The optimization techniques include instruction optimization, shared memory access optimization, the access memory coalesced optimization and texture memory optimization. In particular, it can significantly speed up the IDWT by rewriting the 2D (Tow-dimensional) serial IDWT into 1D parallel IDWT. Through experimenting with 1K×1K×16bit Bayer image, data-parallelism part is 10 more times faster than CPU-based implementation. Finally, a CPU+GPU heterogeneous decompression system was designed. The experimental result shows that it could achieve 3 to 5 times speed increase compared to the CPU serial method.
      

      
      Efficient Scalable Median Filtering Using Histogram-Based Operations.
      PubMed
      Green, Oded
         2018-05-01
         Median filtering is a smoothing technique for noise removal in images. While there are various implementations of median filtering for a single-core CPU, there are few implementations for accelerators and multi-core systems. Many parallel implementations of median filtering use a sorting algorithm for rearranging the values within a filtering window and taking the median of the sorted value. While using sorting algorithms allows for simple parallel implementations, the cost of the sorting becomes prohibitive as the filtering windows grow. This makes such algorithms, sequential and parallel alike, inefficient. In this work, we introduce the first software parallel median filtering that is non-sorting-based. The new algorithm uses efficient histogram-based operations. These reduce the computational requirements of the new algorithm while also accessing the image fewer times. We show an implementation of our algorithm for both the CPU and NVIDIA's CUDA supported graphics processing unit (GPU). The new algorithm is compared with several other leading CPU and GPU implementations. The CPU implementation has near perfect linear scaling with a speedup on a quad-core system. The GPU implementation is several orders of magnitude faster than the other GPU implementations for mid-size median filters. For small kernels, and , comparison-based approaches are preferable as fewer operations are required. Lastly, the new algorithm is open-source and can be found in the OpenCV library.
      

      
      AMITIS: A 3D GPU-Based Hybrid-PIC Model for Space and Plasma Physics
      NASA Astrophysics Data System (ADS)
      Fatemi, Shahab; Poppe, Andrew R.; Delory, Gregory T.; Farrell, William M.
         2017-05-01
         We have developed, for the first time, an advanced modeling infrastructure in space simulations (AMITIS) with an embedded three-dimensional self-consistent grid-based hybrid model of plasma (kinetic ions and fluid electrons) that runs entirely on graphics processing units (GPUs). The model uses NVIDIA GPUs and their associated parallel computing platform, CUDA, developed for general purpose processing on GPUs. The model uses a single CPU-GPU pair, where the CPU transfers data between the system and GPU memory, executes CUDA kernels, and writes simulation outputs on the disk. All computations, including moving particles, calculating macroscopic properties of particles on a grid, and solving hybrid model equations are processed on a single GPU. We explain various computing kernels within AMITIS and compare their performance with an already existing well-tested hybrid model of plasma that runs in parallel using multi-CPU platforms. We show that AMITIS runs ∼10 times faster than the parallel CPU-based hybrid model. We also introduce an implicit solver for computation of Faraday’s Equation, resulting in an explicit-implicit scheme for the hybrid model equation. We show that the proposed scheme is stable and accurate. We examine the AMITIS energy conservation and show that the energy is conserved with an error < 0.2% after 500,000 timesteps, even when a very low number of particles per cell is used.
      

      
      Use of a graphics processing unit (GPU) to facilitate real-time 3D graphic presentation of the patient skin-dose distribution during fluoroscopic interventional procedures
      PubMed Central
      Rana, Vijay; Rudin, Stephen; Bednarek, Daniel R.
         2012-01-01
         We have developed a dose-tracking system (DTS) that calculates the radiation dose to the patient’s skin in real-time by acquiring exposure parameters and imaging-system-geometry from the digital bus on a Toshiba Infinix C-arm unit. The cumulative dose values are then displayed as a color map on an OpenGL-based 3D graphic of the patient for immediate feedback to the interventionalist. Determination of those elements on the surface of the patient 3D-graphic that intersect the beam and calculation of the dose for these elements in real time demands fast computation. Reducing the size of the elements results in more computation load on the computer processor and therefore a tradeoff occurs between the resolution of the patient graphic and the real-time performance of the DTS. The speed of the DTS for calculating dose to the skin is limited by the central processing unit (CPU) and can be improved by using the parallel processing power of a graphics processing unit (GPU). Here, we compare the performance speed of GPU-based DTS software to that of the current CPU-based software as a function of the resolution of the patient graphics. Results show a tremendous improvement in speed using the GPU. While an increase in the spatial resolution of the patient graphics resulted in slowing down the computational speed of the DTS on the CPU, the speed of the GPU-based DTS was hardly affected. This GPU-based DTS can be a powerful tool for providing accurate, real-time feedback about patient skin-dose to physicians while performing interventional procedures. PMID:24027616
      

      
      GPU based framework for geospatial analyses
      NASA Astrophysics Data System (ADS)
      Cosmin Sandric, Ionut; Ionita, Cristian; Dardala, Marian; Furtuna, Titus
         2017-04-01
         Parallel processing on multiple CPU cores is already used at large scale in geocomputing, but parallel processing on graphics cards is just at the beginning. Being able to use an simple laptop with a dedicated graphics card for advanced and very fast geocomputation is an advantage that each scientist wants to have. The necessity to have high speed computation in geosciences has increased in the last 10 years, mostly due to the increase in the available datasets. These datasets are becoming more and more detailed and hence they require more space to store and more time to process. Distributed computation on multicore CPU's and GPU's plays an important role by processing one by one small parts from these big datasets. These way of computations allows to speed up the process, because instead of using just one process for each dataset, the user can use all the cores from a CPU or up to hundreds of cores from GPU The framework provide to the end user a standalone tools for morphometry analyses at multiscale level. An important part of the framework is dedicated to uncertainty propagation in geospatial analyses. The uncertainty may come from the data collection or may be induced by the model or may have an infinite sources. These uncertainties plays important roles when a spatial delineation of the phenomena is modelled. Uncertainty propagation is implemented inside the GPU framework using Monte Carlo simulations. The GPU framework with the standalone tools proved to be a reliable tool for modelling complex natural phenomena The framework is based on NVidia Cuda technology and is written in C++ programming language. The code source will be available on github at https://github.com/sandricionut/GeoRsGPU Acknowledgement: GPU framework for geospatial analysis, Young Researchers Grant (ICUB-University of Bucharest) 2016, director Ionut Sandric
      

        
       
          

«

14
      15
      16
   17
      18
      »

          
        

     

   

   
       
            
              
          

«

15
      16
      17
   18
      19
      »

          
        

           
           
             
               
      
      Use of a graphics processing unit (GPU) to facilitate real-time 3D graphic presentation of the patient skin-dose distribution during fluoroscopic interventional procedures.
      PubMed
      Rana, Vijay; Rudin, Stephen; Bednarek, Daniel R
         2012-02-23
         We have developed a dose-tracking system (DTS) that calculates the radiation dose to the patient's skin in real-time by acquiring exposure parameters and imaging-system-geometry from the digital bus on a Toshiba Infinix C-arm unit. The cumulative dose values are then displayed as a color map on an OpenGL-based 3D graphic of the patient for immediate feedback to the interventionalist. Determination of those elements on the surface of the patient 3D-graphic that intersect the beam and calculation of the dose for these elements in real time demands fast computation. Reducing the size of the elements results in more computation load on the computer processor and therefore a tradeoff occurs between the resolution of the patient graphic and the real-time performance of the DTS. The speed of the DTS for calculating dose to the skin is limited by the central processing unit (CPU) and can be improved by using the parallel processing power of a graphics processing unit (GPU). Here, we compare the performance speed of GPU-based DTS software to that of the current CPU-based software as a function of the resolution of the patient graphics. Results show a tremendous improvement in speed using the GPU. While an increase in the spatial resolution of the patient graphics resulted in slowing down the computational speed of the DTS on the CPU, the speed of the GPU-based DTS was hardly affected. This GPU-based DTS can be a powerful tool for providing accurate, real-time feedback about patient skin-dose to physicians while performing interventional procedures.
      

      
      GPU acceleration of Runge Kutta-Fehlberg and its comparison with Dormand-Prince method
      NASA Astrophysics Data System (ADS)
      Seen, Wo Mei; Gobithaasan, R. U.; Miura, Kenjiro T.
         2014-07-01
         There is a significant reduction of processing time and speedup of performance in computer graphics with the emergence of Graphic Processing Units (GPUs). GPUs have been developed to surpass Central Processing Unit (CPU) in terms of performance and processing speed. This evolution has opened up a new area in computing and researches where highly parallel GPU has been used for non-graphical algorithms. Physical or phenomenal simulations and modelling can be accelerated through General Purpose Graphic Processing Units (GPGPU) and Compute Unified Device Architecture (CUDA) implementations. These phenomena can be represented with mathematical models in the form of Ordinary Differential Equations (ODEs) which encompasses the gist of change rate between independent and dependent variables. ODEs are numerically integrated over time in order to simulate these behaviours. The classical Runge-Kutta (RK) scheme is the common method used to numerically solve ODEs. The Runge Kutta Fehlberg (RKF) scheme has been specially developed to provide an estimate of the principal local truncation error at each step, known as embedding estimate technique. This paper delves into the implementation of RKF scheme for GPU devices and compares its result with Dorman Prince method. A pseudo code is developed to show the implementation in detail. Hence, practitioners will be able to understand the data allocation in GPU, formation of RKF kernels and the flow of data to/from GPU-CPU upon RKF kernel evaluation. The pseudo code is then written in C Language and two ODE models are executed to show the achievable speedup as compared to CPU implementation. The accuracy and efficiency of the proposed implementation method is discussed in the final section of this paper.
      

      
      Coding for parallel execution of hardware-in-the-loop millimeter-wave scene generation models on multicore SIMD processor architectures
      NASA Astrophysics Data System (ADS)
      Olson, Richard F.
         2013-05-01
         Rendering of point scatterer based radar scenes for millimeter wave (mmW) seeker tests in real-time hardware-in-the-loop (HWIL) scene generation requires efficient algorithms and vector-friendly computer architectures for complex signal synthesis. New processor technology from Intel implements an extended 256-bit vector SIMD instruction set (AVX, AVX2) in a multi-core CPU design providing peak execution rates of hundreds of GigaFLOPS (GFLOPS) on one chip. Real world mmW scene generation code can approach peak SIMD execution rates only after careful algorithm and source code design. An effective software design will maintain high computing intensity emphasizing register-to-register SIMD arithmetic operations over data movement between CPU caches or off-chip memories. Engineers at the U.S. Army Aviation and Missile Research, Development and Engineering Center (AMRDEC) applied two basic parallel coding methods to assess new 256-bit SIMD multi-core architectures for mmW scene generation in HWIL. These include use of POSIX threads built on vector library functions and more portable, highlevel parallel code based on compiler technology (e.g. OpenMP pragmas and SIMD autovectorization). Since CPU technology is rapidly advancing toward high processor core counts and TeraFLOPS peak SIMD execution rates, it is imperative that coding methods be identified which produce efficient and maintainable parallel code. This paper describes the algorithms used in point scatterer target model rendering, the parallelization of those algorithms, and the execution performance achieved on an AVX multi-core machine using the two basic parallel coding methods. The paper concludes with estimates for scale-up performance on upcoming multi-core technology.
      

      
      Algorithms of GPU-enabled reactive force field (ReaxFF) molecular dynamics.
      PubMed
      Zheng, Mo; Li, Xiaoxia; Guo, Li
         2013-04-01
         Reactive force field (ReaxFF), a recent and novel bond order potential, allows for reactive molecular dynamics (ReaxFF MD) simulations for modeling larger and more complex molecular systems involving chemical reactions when compared with computation intensive quantum mechanical methods. However, ReaxFF MD can be approximately 10-50 times slower than classical MD due to its explicit modeling of bond forming and breaking, the dynamic charge equilibration at each time-step, and its one order smaller time-step than the classical MD, all of which pose significant computational challenges in simulation capability to reach spatio-temporal scales of nanometers and nanoseconds. The very recent advances of graphics processing unit (GPU) provide not only highly favorable performance for GPU enabled MD programs compared with CPU implementations but also an opportunity to manage with the computing power and memory demanding nature imposed on computer hardware by ReaxFF MD. In this paper, we present the algorithms of GMD-Reax, the first GPU enabled ReaxFF MD program with significantly improved performance surpassing CPU implementations on desktop workstations. The performance of GMD-Reax has been benchmarked on a PC equipped with a NVIDIA C2050 GPU for coal pyrolysis simulation systems with atoms ranging from 1378 to 27,283. GMD-Reax achieved speedups as high as 12 times faster than Duin et al.'s FORTRAN codes in Lammps on 8 CPU cores and 6 times faster than the Lammps' C codes based on PuReMD in terms of the simulation time per time-step averaged over 100 steps. GMD-Reax could be used as a new and efficient computational tool for exploiting very complex molecular reactions via ReaxFF MD simulation on desktop workstations. Copyright © 2013 Elsevier Inc. All rights reserved.
      

      
      Cloud based intelligent system for delivering health care as a service.
      PubMed
      Kaur, Pankaj Deep; Chana, Inderveer
         2014-01-01
         The promising potential of cloud computing and its convergence with technologies such as mobile computing, wireless networks, sensor technologies allows for creation and delivery of newer type of cloud services. In this paper, we advocate the use of cloud computing for the creation and management of cloud based health care services. As a representative case study, we design a Cloud Based Intelligent Health Care Service (CBIHCS) that performs real time monitoring of user health data for diagnosis of chronic illness such as diabetes. Advance body sensor components are utilized to gather user specific health data and store in cloud based storage repositories for subsequent analysis and classification. In addition, infrastructure level mechanisms are proposed to provide dynamic resource elasticity for CBIHCS. Experimental results demonstrate that classification accuracy of 92.59% is achieved with our prototype system and the predicted patterns of CPU usage offer better opportunities for adaptive resource elasticity. Copyright © 2013 Elsevier Ireland Ltd. All rights reserved.
      

      
      JAliEn - A new interface between the AliEn jobs and the central services
      NASA Astrophysics Data System (ADS)
      Grigoras, A. G.; Grigoras, C.; Pedreira, M. M.; Saiz, P.; Schreiner, S.
         2014-06-01
         Since the ALICE experiment began data taking in early 2010, the amount of end user jobs on the AliEn Grid has increased significantly. Presently 1/3 of the 40K CPU cores available to ALICE are occupied by jobs submitted by about 400 distinct users, individually or in organized analysis trains. The overall stability of the AliEn middleware has been excellent throughout the 3 years of running, but the massive amount of end-user analysis and its specific requirements and load has revealed few components which can be improved. One of them is the interface between users and central AliEn services (catalogue, job submission system) which we are currently re-implementing in Java. The interface provides persistent connection with enhanced data and job submission authenticity. In this paper we will describe the architecture of the new interface, the ROOT binding which enables the use of a single interface in addition to the standard UNIX-like access shell and the new security-related features.
      

      
      A fast parallel clustering algorithm for molecular simulation trajectories.
      PubMed
      Zhao, Yutong; Sheong, Fu Kit; Sun, Jian; Sander, Pedro; Huang, Xuhui
         2013-01-15
         We implemented a GPU-powered parallel k-centers algorithm to perform clustering on the conformations of molecular dynamics (MD) simulations. The algorithm is up to two orders of magnitude faster than the CPU implementation. We tested our algorithm on four protein MD simulation datasets ranging from the small Alanine Dipeptide to a 370-residue Maltose Binding Protein (MBP). It is capable of grouping 250,000 conformations of the MBP into 4000 clusters within 40 seconds. To achieve this, we effectively parallelized the code on the GPU and utilize the triangle inequality of metric spaces. Furthermore, the algorithm's running time is linear with respect to the number of cluster centers. In addition, we found the triangle inequality to be less effective in higher dimensions and provide a mathematical rationale. Finally, using Alanine Dipeptide as an example, we show a strong correlation between cluster populations resulting from the k-centers algorithm and the underlying density. © 2012 Wiley Periodicals, Inc. Copyright © 2012 Wiley Periodicals, Inc.
      

      
      Unique Effects of Acute Aripiprazole Treatment on the Dopamine D2 Receptor Downstream cAMP-PKA and Akt-GSK3β Signalling Pathways in Rats
      PubMed Central
      Pan, Bo; Chen, Jiezhong; Lian, Jiamei; Huang, Xu-Feng; Deng, Chao
         2015-01-01
         Aripiprazole is a wide-used antipsychotic drug with therapeutic effects on both positive and negative symptoms of schizophrenia, and reduced side-effects. Although aripiprazole was developed as a dopamine D2 receptor (D2R) partial agonist, all other D2R partial agonists that aimed to mimic aripiprazole failed to exert therapeutic effects in clinic. The present in vivo study aimed to investigate the effects of aripiprazole on the D2R downstream cAMP-PKA and Akt-GSK3β signalling pathways in comparison with a D2R antagonist – haloperidol and a D2R partial agonist – bifeprunox. Rats were injected once with aripiprazole (0.75mg/kg, i.p.), bifeprunox (0.8mg/kg, i.p.), haloperidol (0.1mg/kg, i.p.) or vehicle. Five brain regions – the prefrontal cortex (PFC), nucleus accumbens (NAc), caudate putamen (CPu), ventral tegmental area (VTA) and substantia nigra (SN) were collected. The protein levels of PKA, Akt and GSK3β were measured by Western Blotting; the cAMP levels were examined by ELISA tests. The results showed that aripiprazole presented similar acute effects on PKA expression to haloperidol, but not bifeprunox, in the CPU and VTA. Additionally, aripiprazole was able to increase the phosphorylation of GSK3β in the PFC, NAc, CPu and SN, respectively, which cannot be achieved by bifeprunox and haloperidol. These results suggested that acute treatment of aripiprazole had differential effects on the cAMP-PKA and Akt-GSK3β signalling pathways from haloperidol and bifeprunox in these brain areas. This study further indicated that, by comparison with bifeprunox, the unique pharmacological profile of aripiprazole may be attributed to the relatively lower intrinsic activity at D2R. PMID:26162083
      

      
      RooStatsCms: A tool for analysis modelling, combination and statistical studies
      NASA Astrophysics Data System (ADS)
      Piparo, D.; Schott, G.; Quast, G.
         2010-04-01
         RooStatsCms is an object oriented statistical framework based on the RooFit technology. Its scope is to allow the modelling, statistical analysis and combination of multiple search channels for new phenomena in High Energy Physics. It provides a variety of methods described in literature implemented as classes, whose design is oriented to the execution of multiple CPU intensive jobs on batch systems or on the Grid.
      

      
      Physician discretion is safe and may lower stress test utilization in emergency department chest pain unit patients.
      PubMed
      Napoli, Anthony M; Arrighi, James A; Siket, Matthew S; Gibbs, Frantz J
         2012-03-01
         Chest pain unit (CPU) observation with defined stress utilization protocols is a common management option for low-risk emergency department patients. We sought to evaluate the safety of a joint emergency medicine and cardiology staffed CPU. Prospective observational trial of consecutive patients admitted to an emergency department CPU was conducted. A standard 6-hour observation protocol was followed by cardiology consultation and stress utilization largely at their discretion. Included patients were at low/intermediate risk by the American Heart Association, had nondiagnostic electrocardiograms, and a normal initial troponin. Excluded patients were those with an acute comorbidity, age >75, and a history of coronary artery disease, or had a coexistent problem restricting 24-hour observation. Primary outcome was 30-day major adverse cardiovascular events-defined as death, nonfatal acute myocardial infarction, revascularization, or out-of-hospital cardiac arrest. A total of 1063 patients were enrolled over 8 months. The mean age of the patients was 52.8 ± 11.8 years, and 51% (95% confidence interval [CI], 48-54) were female. The mean thrombolysis in myocardial infarction and Diamond & Forrester scores were 0.6% (95% CI, 0.51-0.62) and 33% (95% CI, 31-35), respectively. In all, 51% (95% CI, 48-54) received stress testing (52% nuclear stress, 39% stress echocardiogram, 5% exercise, 4% other). In all, 0.9% patients (n = 10, 95% CI, 0.4-1.5) were diagnosed with a non-ST elevation myocardial infarction and 2.2% (n = 23, 95% CI, 1.3-3) with acute coronary syndrome. There was 1 (95% CI, 0%-0.3%) case of a 30-day major adverse cardiovascular events. The 51% stress test utilization rate was less than the range reported in previous CPU studies (P < 0.05). Joint emergency medicine and cardiology management of patients within a CPU protocol is safe, efficacious, and may safely reduce stress testing rates.
      

      
      3D Kirchhoff depth migration algorithm: A new scalable approach for parallelization on multicore CPU based cluster
      NASA Astrophysics Data System (ADS)
      Rastogi, Richa; Londhe, Ashutosh; Srivastava, Abhishek; Sirasala, Kirannmayi M.; Khonde, Kiran
         2017-03-01
         In this article, a new scalable 3D Kirchhoff depth migration algorithm is presented on state of the art multicore CPU based cluster. Parallelization of 3D Kirchhoff depth migration is challenging due to its high demand of compute time, memory, storage and I/O along with the need of their effective management. The most resource intensive modules of the algorithm are traveltime calculations and migration summation which exhibit an inherent trade off between compute time and other resources. The parallelization strategy of the algorithm largely depends on the storage of calculated traveltimes and its feeding mechanism to the migration process. The presented work is an extension of our previous work, wherein a 3D Kirchhoff depth migration application for multicore CPU based parallel system had been developed. Recently, we have worked on improving parallel performance of this application by re-designing the parallelization approach. The new algorithm is capable to efficiently migrate both prestack and poststack 3D data. It exhibits flexibility for migrating large number of traces within the available node memory and with minimal requirement of storage, I/O and inter-node communication. The resultant application is tested using 3D Overthrust data on PARAM Yuva II, which is a Xeon E5-2670 based multicore CPU cluster with 16 cores/node and 64 GB shared memory. Parallel performance of the algorithm is studied using different numerical experiments and the scalability results show striking improvement over its previous version. An impressive 49.05X speedup with 76.64% efficiency is achieved for 3D prestack data and 32.00X speedup with 50.00% efficiency for 3D poststack data, using 64 nodes. The results also demonstrate the effectiveness and robustness of the improved algorithm with high scalability and efficiency on a multicore CPU cluster.
      

      
      Aligner optimization increases accuracy and decreases compute times in multi-species sequence data.
      PubMed
      Robinson, Kelly M; Hawkins, Aziah S; Santana-Cruz, Ivette; Adkins, Ricky S; Shetty, Amol C; Nagaraj, Sushma; Sadzewicz, Lisa; Tallon, Luke J; Rasko, David A; Fraser, Claire M; Mahurkar, Anup; Silva, Joana C; Dunning Hotopp, Julie C
         2017-09-01
         As sequencing technologies have evolved, the tools to analyze these sequences have made similar advances. However, for multi-species samples, we observed important and adverse differences in alignment specificity and computation time for bwa- mem (Burrows-Wheeler aligner-maximum exact matches) relative to bwa-aln. Therefore, we sought to optimize bwa-mem for alignment of data from multi-species samples in order to reduce alignment time and increase the specificity of alignments. In the multi-species cases examined, there was one majority member (i.e. Plasmodium falciparum or Brugia malayi ) and one minority member (i.e. human or the Wolbachia endosymbiont w Bm) of the sequence data. Increasing bwa-mem seed length from the default value reduced the number of read pairs from the majority sequence member that incorrectly aligned to the reference genome of the minority sequence member. Combining both source genomes into a single reference genome increased the specificity of mapping, while also reducing the central processing unit (CPU) time. In Plasmodium , at a seed length of 18 nt, 24.1 % of reads mapped to the human genome using 1.7±0.1 CPU hours, while 83.6 % of reads mapped to the Plasmodium genome using 0.2±0.0 CPU hours (total: 107.7 % reads mapping; in 1.9±0.1 CPU hours). In contrast, 97.1 % of the reads mapped to a combined Plasmodium- human reference in only 0.7±0.0 CPU hours. Overall, the results suggest that combining all references into a single reference database and using a 23 nt seed length reduces the computational time, while maximizing specificity. Similar results were found for simulated sequence reads from a mock metagenomic data set. We found similar improvements to computation time in a publicly available human-only data set.
      

      
      SU-E-T-423: Fast Photon Convolution Calculation with a 3D-Ideal Kernel On the GPU
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Moriya, S; Sato, M; Tachibana, H
         
         Purpose: The calculation time is a trade-off for improving the accuracy of convolution dose calculation with fine calculation spacing of the KERMA kernel. We investigated to accelerate the convolution calculation using an ideal kernel on the Graphic Processing Units (GPU). Methods: The calculation was performed on the AMD graphics hardware of Dual FirePro D700 and our algorithm was implemented using the Aparapi that convert Java bytecode to OpenCL. The process of dose calculation was separated with the TERMA and KERMA steps. The dose deposited at the coordinate (x, y, z) was determined in the process. In the dose calculation runningmore » on the central processing unit (CPU) of Intel Xeon E5, the calculation loops were performed for all calculation points. On the GPU computation, all of the calculation processes for the points were sent to the GPU and the multi-thread computation was done. In this study, the dose calculation was performed in a water equivalent homogeneous phantom with 150{sup 3} voxels (2 mm calculation grid) and the calculation speed on the GPU to that on the CPU and the accuracy of PDD were compared. Results: The calculation time for the GPU and the CPU were 3.3 sec and 4.4 hour, respectively. The calculation speed for the GPU was 4800 times faster than that for the CPU. The PDD curve for the GPU was perfectly matched to that for the CPU. Conclusion: The convolution calculation with the ideal kernel on the GPU was clinically acceptable for time and may be more accurate in an inhomogeneous region. Intensity modulated arc therapy needs dose calculations for different gantry angles at many control points. Thus, it would be more practical that the kernel uses a coarse spacing technique if the calculation is faster while keeping the similar accuracy to a current treatment planning system.« less
      

      
      CoNNeCT Baseband Processor Module Boot Code SoftWare (BCSW)
      NASA Technical Reports Server (NTRS)
      Yamamoto, Clifford K.; Orozco, David S.; Byrne, D. J.; Allen, Steven J.; Sahasrabudhe, Adit; Lang, Minh
         2012-01-01
         This software provides essential startup and initialization routines for the CoNNeCT baseband processor module (BPM) hardware upon power-up. A command and data handling (C&DH) interface is provided via 1553 and diagnostic serial interfaces to invoke operational, reconfiguration, and test commands within the code. The BCSW has features unique to the hardware it is responsible for managing. In this case, the CoNNeCT BPM is configured with an updated CPU (Atmel AT697 SPARC processor) and a unique set of memory and I/O peripherals that require customized software to operate. These features include configuration of new AT697 registers, interfacing to a new HouseKeeper with a flash controller interface, a new dual Xilinx configuration/scrub interface, and an updated 1553 remote terminal (RT) core. The BCSW is intended to provide a "safe" mode for the BPM when initially powered on or when an unexpected trap occurs, causing the processor to reset. The BCSW allows the 1553 bus controller in the spacecraft or payload controller to operate the BPM over 1553 to upload code; upload Xilinx bit files; perform rudimentary tests; read, write, and copy the non-volatile flash memory; and configure the Xilinx interface. Commands also exist over 1553 to cause the CPU to jump or call a specified address to begin execution of user-supplied code. This may be in the form of a real-time operating system, test routine, or specific application code to run on the BPM.
      

      
      Acceleration of stable TTI P-wave reverse-time migration with GPUs
      NASA Astrophysics Data System (ADS)
      Kim, Youngseo; Cho, Yongchae; Jang, Ugeun; Shin, Changsoo
         2013-03-01
         When a pseudo-acoustic TTI (tilted transversely isotropic) coupled wave equation is used to implement reverse-time migration (RTM), shear wave energy is significantly included in the migration image. Because anisotropy has intrinsic elastic characteristics, coupling P-wave and S-wave modes in the pseudo-acoustic wave equation is inevitable. In RTM with only primary energy or the P-wave mode in seismic data, the S-wave energy is regarded as noise for the migration image. To solve this problem, we derive a pure P-wave equation for TTI media that excludes the S-wave energy. Additionally, we apply the rapid expansion method (REM) based on a Chebyshev expansion and a pseudo-spectral method (PSM) to calculate spatial derivatives in the wave equation. When REM is incorporated with the PSM for the spatial derivatives, wavefields with high numerical accuracy can be obtained without grid dispersion when performing numerical wave modeling. Another problem in the implementation of TTI RTM is that wavefields in an area with high gradients of dip or azimuth angles can be blown up in the progression of the forward and backward algorithms of the RTM. We stabilize the wavefields by applying a spatial-frequency domain high-cut filter when calculating the spatial derivatives using the PSM. In addition, to increase performance speed, the graphic processing unit (GPU) architecture is used instead of traditional CPU architecture. To confirm the degree of acceleration compared to the CPU version on our RTM, we then analyze the performance measurements according to the number of GPUs employed.
      

      
      Jobs masonry in LHCb with elastic Grid Jobs
      NASA Astrophysics Data System (ADS)
      Stagni, F.; Charpentier, Ph
         2015-12-01
         In any distributed computing infrastructure, a job is normally forbidden to run for an indefinite amount of time. This limitation is implemented using different technologies, the most common one being the CPU time limit implemented by batch queues. It is therefore important to have a good estimate of how much CPU work a job will require: otherwise, it might be killed by the batch system, or by whatever system is controlling the jobs’ execution. In many modern interwares, the jobs are actually executed by pilot jobs, that can use the whole available time in running multiple consecutive jobs. If at some point the available time in a pilot is too short for the execution of any job, it should be released, while it could have been used efficiently by a shorter job. Within LHCbDIRAC, the LHCb extension of the DIRAC interware, we developed a simple way to fully exploit computing capabilities available to a pilot, even for resources with limited time capabilities, by adding elasticity to production MonteCarlo (MC) simulation jobs. With our approach, independently of the time available, LHCbDIRAC will always have the possibility to execute a MC job, whose length will be adapted to the available amount of time: therefore the same job, running on different computing resources with different time limits, will produce different amounts of events. The decision on the number of events to be produced is made just in time at the start of the job, when the capabilities of the resource are known. In order to know how many events a MC job will be instructed to produce, LHCbDIRAC simply requires three values: the CPU-work per event for that type of job, the power of the machine it is running on, and the time left for the job before being killed. Knowing these values, we can estimate the number of events the job will be able to simulate with the available CPU time. This paper will demonstrate that, using this simple but effective solution, LHCb manages to make a more efficient use of the available resources, and that it can easily use new types of resources. An example is represented by resources provided by batch queues, where low-priority MC jobs can be used as "masonry" jobs in multi-jobs pilots. A second example is represented by opportunistic resources with limited available time.
      

      
      Personal Computer and Workstation Operating Systems Tutorial
      DTIC Science & Technology
      
         1994-03-01
         to a RAM area where it is executed by the CPU. The program consists of instructions that perform operations on data. The CPU will perform two basic...memory to improve system performance. More often the user will buy a new fixed disk so the computer will hold more programs internally. The trend today...MHZ. Another way to view how fast the information is going into the register is in a time domain rather than a frequency domain knowing that time and
      

      
      Finite difference numerical method for the superlattice Boltzmann transport equation and case comparison of CPU(C) and GPU(CUDA) implementations
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Priimak, Dmitri
         2014-12-01
         We present a finite difference numerical algorithm for solving two dimensional spatially homogeneous Boltzmann transport equation which describes electron transport in a semiconductor superlattice subject to crossed time dependent electric and constant magnetic fields. The algorithm is implemented both in C language targeted to CPU and in CUDA C language targeted to commodity NVidia GPU. We compare performances and merits of one implementation versus another and discuss various software optimisation techniques.
      

      
      The development of an interim generalized gate logic software simulator
      NASA Technical Reports Server (NTRS)
      Mcgough, J. G.; Nemeroff, S.
         1985-01-01
         A proof-of-concept computer program called IGGLOSS (Interim Generalized Gate Logic Software Simulator) was developed and is discussed. The simulator engine was designed to perform stochastic estimation of self test coverage (fault-detection latency times) of digital computers or systems. A major attribute of the IGGLOSS is its high-speed simulation: 9.5 x 1,000,000 gates/cpu sec for nonfaulted circuits and 4.4 x 1,000,000 gates/cpu sec for faulted circuits on a VAX 11/780 host computer.
      

      
      Autonomic Recovery: HyperCheck: A Hardware-Assisted Integrity Monitor
      DTIC Science & Technology
      
         2013-08-01
         system (OS). HyperCheck leverages the CPU System Management Mode ( SMM ), present in x86 systems, to securely generate and transmit the full state of the...HyperCheck harnesses the CPU System Management Mode ( SMM ) which is present in all x86 commodity systems to create a snapshot view of the current state of the...protect the software above it. Our assumptions are that the attacker does not have physical access to the machine and that the SMM BIOS is locked and
      

        
       
          

«

15
      16
      17
   18
      19
      »

          
        

     

   

   
       
            
              
          

«

16
      17
      18
   19
      20
      »

          
        

           
           
             
               
      
      cuBLASTP: Fine-Grained Parallelization of Protein Sequence Search on CPU+GPU.
      PubMed
      Zhang, Jing; Wang, Hao; Feng, Wu-Chun
         2017-01-01
         BLAST, short for Basic Local Alignment Search Tool, is a ubiquitous tool used in the life sciences for pairwise sequence search. However, with the advent of next-generation sequencing (NGS), whether at the outset or downstream from NGS, the exponential growth of sequence databases is outstripping our ability to analyze the data. While recent studies have utilized the graphics processing unit (GPU) to speedup the BLAST algorithm for searching protein sequences (i.e., BLASTP), these studies use coarse-grained parallelism, where one sequence alignment is mapped to only one thread. Such an approach does not efficiently utilize the capabilities of a GPU, particularly due to the irregularity of BLASTP in both execution paths and memory-access patterns. To address the above shortcomings, we present a fine-grained approach to parallelize BLASTP, where each individual phase of sequence search is mapped to many threads on a GPU. This approach, which we refer to as cuBLASTP, reorders data-access patterns and reduces divergent branches of the most time-consuming phases (i.e., hit detection and ungapped extension). In addition, cuBLASTP optimizes the remaining phases (i.e., gapped extension and alignment with trace back) on a multicore CPU and overlaps their execution with the phases running on the GPU.
      

      
      Analysis of Multivariate Experimental Data Using A Simplified Regression Model Search Algorithm
      NASA Technical Reports Server (NTRS)
      Ulbrich, Norbert M.
         2013-01-01
         A new regression model search algorithm was developed that may be applied to both general multivariate experimental data sets and wind tunnel strain-gage balance calibration data. The algorithm is a simplified version of a more complex algorithm that was originally developed for the NASA Ames Balance Calibration Laboratory. The new algorithm performs regression model term reduction to prevent overfitting of data. It has the advantage that it needs only about one tenth of the original algorithm's CPU time for the completion of a regression model search. In addition, extensive testing showed that the prediction accuracy of math models obtained from the simplified algorithm is similar to the prediction accuracy of math models obtained from the original algorithm. The simplified algorithm, however, cannot guarantee that search constraints related to a set of statistical quality requirements are always satisfied in the optimized regression model. Therefore, the simplified algorithm is not intended to replace the original algorithm. Instead, it may be used to generate an alternate optimized regression model of experimental data whenever the application of the original search algorithm fails or requires too much CPU time. Data from a machine calibration of NASA's MK40 force balance is used to illustrate the application of the new search algorithm.
      

      
      Analysis of Multivariate Experimental Data Using A Simplified Regression Model Search Algorithm
      NASA Technical Reports Server (NTRS)
      Ulbrich, Norbert Manfred
         2013-01-01
         A new regression model search algorithm was developed in 2011 that may be used to analyze both general multivariate experimental data sets and wind tunnel strain-gage balance calibration data. The new algorithm is a simplified version of a more complex search algorithm that was originally developed at the NASA Ames Balance Calibration Laboratory. The new algorithm has the advantage that it needs only about one tenth of the original algorithm's CPU time for the completion of a search. In addition, extensive testing showed that the prediction accuracy of math models obtained from the simplified algorithm is similar to the prediction accuracy of math models obtained from the original algorithm. The simplified algorithm, however, cannot guarantee that search constraints related to a set of statistical quality requirements are always satisfied in the optimized regression models. Therefore, the simplified search algorithm is not intended to replace the original search algorithm. Instead, it may be used to generate an alternate optimized regression model of experimental data whenever the application of the original search algorithm either fails or requires too much CPU time. Data from a machine calibration of NASA's MK40 force balance is used to illustrate the application of the new regression model search algorithm.
      

      
      libgapmis: extending short-read alignments
      PubMed Central
      
         2013-01-01
         Background A wide variety of short-read alignment programmes have been published recently to tackle the problem of mapping millions of short reads to a reference genome, focusing on different aspects of the procedure such as time and memory efficiency, sensitivity, and accuracy. These tools allow for a small number of mismatches in the alignment; however, their ability to allow for gaps varies greatly, with many performing poorly or not allowing them at all. The seed-and-extend strategy is applied in most short-read alignment programmes. After aligning a substring of the reference sequence against the high-quality prefix of a short read--the seed--an important problem is to find the best possible alignment between a substring of the reference sequence succeeding and the remaining suffix of low quality of the read--extend. The fact that the reads are rather short and that the gap occurrence frequency observed in various studies is rather low suggest that aligning (parts of) those reads with a single gap is in fact desirable. Results In this article, we present libgapmis, a library for extending pairwise short-read alignments. Apart from the standard CPU version, it includes ultrafast SSE- and GPU-based implementations. libgapmis is based on an algorithm computing a modified version of the traditional dynamic-programming matrix for sequence alignment. Extensive experimental results demonstrate that the functions of the CPU version provided in this library accelerate the computations by a factor of 20 compared to other programmes. The analogous SSE- and GPU-based implementations accelerate the computations by a factor of 6 and 11, respectively, compared to the CPU version. The library also provides the user the flexibility to split the read into fragments, based on the observed gap occurrence frequency and the length of the read, thereby allowing for a variable, but bounded, number of gaps in the alignment. Conclusions We present libgapmis, a library for extending pairwise short-read alignments. We show that libgapmis is better-suited and more efficient than existing algorithms for this task. The importance of our contribution is underlined by the fact that the provided functions may be seamlessly integrated into any short-read alignment pipeline. The open-source code of libgapmis is available at http://www.exelixis-lab.org/gapmis. PMID:24564250
      

      
      The Application of Virtex-II Pro FPGA in High-Speed Image Processing Technology of Robot Vision Sensor
      NASA Astrophysics Data System (ADS)
      Ren, Y. J.; Zhu, J. G.; Yang, X. Y.; Ye, S. H.
         2006-10-01
         The Virtex-II Pro FPGA is applied to the vision sensor tracking system of IRB2400 robot. The hardware platform, which undertakes the task of improving SNR and compressing data, is constructed by using the high-speed image processing of FPGA. The lower level image-processing algorithm is realized by combining the FPGA frame and the embedded CPU. The velocity of image processing is accelerated due to the introduction of FPGA and CPU. The usage of the embedded CPU makes it easily to realize the logic design of interface. Some key techniques are presented in the text, such as read-write process, template matching, convolution, and some modules are simulated too. In the end, the compare among the modules using this design, using the PC computer and using the DSP, is carried out. Because the high-speed image processing system core is a chip of FPGA, the function of which can renew conveniently, therefore, to a degree, the measure system is intelligent.
      

      
      Study of data I/O performance on distributed disk system in mask data preparation
      NASA Astrophysics Data System (ADS)
      Ohara, Shuichiro; Odaira, Hiroyuki; Chikanaga, Tomoyuki; Hamaji, Masakazu; Yoshioka, Yasuharu
         2010-09-01
         Data volume is getting larger every day in Mask Data Preparation (MDP). In the meantime, faster data handling is always required. MDP flow typically introduces Distributed Processing (DP) system to realize the demand because using hundreds of CPU is a reasonable solution. However, even if the number of CPU were increased, the throughput might be saturated because hard disk I/O and network speeds could be bottlenecks. So, MDP needs to invest a lot of money to not only hundreds of CPU but also storage and a network device which make the throughput faster. NCS would like to introduce new distributed processing system which is called "NDE". NDE could be a distributed disk system which makes the throughput faster without investing a lot of money because it is designed to use multiple conventional hard drives appropriately over network. NCS studies I/O performance with OASIS® data format on NDE which contributes to realize the high throughput in this paper.
      

      
      A study of the relationship between the performance and dependability of a fault-tolerant computer
      NASA Technical Reports Server (NTRS)
      Goswami, Kumar K.
         1994-01-01
         This thesis studies the relationship by creating a tool (FTAPE) that integrates a high stress workload generator with fault injection and by using the tool to evaluate system performance under error conditions. The workloads are comprised of processes which are formed from atomic components that represent CPU, memory, and I/O activity. The fault injector is software-implemented and is capable of injecting any memory addressable location, including special registers and caches. This tool has been used to study a Tandem Integrity S2 Computer. Workloads with varying numbers of processes and varying compositions of CPU, memory, and I/O activity are first characterized in terms of performance. Then faults are injected into these workloads. The results show that as the number of concurrent processes increases, the mean fault latency initially increases due to increased contention for the CPU. However, for even higher numbers of processes (less than 3 processes), the mean latency decreases because long latency faults are paged out before they can be activated.
      

      
      Toward Fast and Accurate Binding Affinity Prediction with pmemdGTI: An Efficient Implementation of GPU-Accelerated Thermodynamic Integration.
      PubMed
      Lee, Tai-Sung; Hu, Yuan; Sherborne, Brad; Guo, Zhuyan; York, Darrin M
         2017-07-11
         We report the implementation of the thermodynamic integration method on the pmemd module of the AMBER 16 package on GPUs (pmemdGTI). The pmemdGTI code typically delivers over 2 orders of magnitude of speed-up relative to a single CPU core for the calculation of ligand-protein binding affinities with no statistically significant numerical differences and thus provides a powerful new tool for drug discovery applications.
      

      
      Early modulation by the dopamine D4 receptor of morphine-induced changes in the opioid peptide systems in the rat caudate putamen.
      PubMed
      Gago, Belén; Fuxe, Kjell; Brené, Stefan; Díaz-Cabiale, Zaida; Reina-Sánchez, María Dolores; Suárez-Boomgaard, Diana; Roales-Buján, Ruth; Valderrama-Carvajal, Alejandra; de la Calle, Adelaida; Rivera, Alicia
         2013-12-01
         The peptides dynorphin and enkephalin modulate many physiological processes, such as motor activity and the control of mood and motivation. Their expression in the caudate putamen (CPu) is regulated by dopamine and opioid receptors. The current work was designed to explore the early effects of the acute activation of D4 and/or μ opioid receptors by the agonists PD168,077 and morphine, respectively, on the regulation of the expression of these opioid peptides in the rat CPu, on transcription factors linked to them, and on the expression of μ opioid receptors. In situ hybridization experiments showed that acute treatment with morphine (10 mg/kg) decreased both enkephalin and dynorphin mRNA levels in the CPu after 30 min, but PD168,077 (1 mg/kg) did not modify their expression. Coadministration of the two agonists demonstrated that PD168,077 counteracted the morphine-induced changes and even increased enkephalin mRNA levels. The immunohistochemistry studies showed that morphine administration also increased striatal μ opioid receptor immunoreactivity but reduced P-CREB expression, effects that were blocked by the PD168,077-induced activation of D4 receptors. The current results present evidence of functional D4 -μ opioid receptor interactions, with consequences for the opioid peptide mRNA levels in the rat CPu, contributing to the integration of DA and opioid peptide signaling. Copyright © 2013 Wiley Periodicals, Inc.
      

      
      Efficient and portable acceleration of quantum chemical many-body methods in mixed floating point precision using OpenACC compiler directives
      NASA Astrophysics Data System (ADS)
      Eriksen, Janus J.
         2017-09-01
         It is demonstrated how the non-proprietary OpenACC standard of compiler directives may be used to compactly and efficiently accelerate the rate-determining steps of two of the most routinely applied many-body methods of electronic structure theory, namely the second-order Møller-Plesset (MP2) model in its resolution-of-the-identity approximated form and the (T) triples correction to the coupled cluster singles and doubles model (CCSD(T)). By means of compute directives as well as the use of optimised device math libraries, the operations involved in the energy kernels have been ported to graphics processing unit (GPU) accelerators, and the associated data transfers correspondingly optimised to such a degree that the final implementations (using either double and/or single precision arithmetics) are capable of scaling to as large systems as allowed for by the capacity of the host central processing unit (CPU) main memory. The performance of the hybrid CPU/GPU implementations is assessed through calculations on test systems of alanine amino acid chains using one-electron basis sets of increasing size (ranging from double- to pentuple-ζ quality). For all but the smallest problem sizes of the present study, the optimised accelerated codes (using a single multi-core CPU host node in conjunction with six GPUs) are found to be capable of reducing the total time-to-solution by at least an order of magnitude over optimised, OpenMP-threaded CPU-only reference implementations.
      

      
      GeantV: From CPU to accelerators
      DOE PAGES
      Amadio, G.; Ananya, A.; Apostolakis, J.; ...
         2016-01-01
         The GeantV project aims to research and develop the next-generation simulation software describing the passage of particles through matter. While the modern CPU architectures are being targeted first, resources such as GPGPU, Intel© Xeon Phi, Atom or ARM cannot be ignored anymore by HEP CPU-bound applications. The proof of concept GeantV prototype has been mainly engineered for CPU's having vector units but we have foreseen from early stages a bridge to arbitrary accelerators. A software layer consisting of architecture/technology specific backends supports currently this concept. This approach allows to abstract out the basic types such as scalar/vector but also tomore » formalize generic computation kernels using transparently library or device specific constructs based on Vc, CUDA, Cilk+ or Intel intrinsics. While the main goal of this approach is portable performance, as a bonus, it comes with the insulation of the core application and algorithms from the technology layer. This allows our application to be long term maintainable and versatile to changes at the backend side. The paper presents the first results of basket-based GeantV geometry navigation on the Intel© Xeon Phi KNC architecture. We present the scalability and vectorization study, conducted using Intel performance tools, as well as our preliminary conclusions on the use of accelerators for GeantV transport. Lastly, we also describe the current work and preliminary results for using the GeantV transport kernel on GPUs.« less
      

      
      Methamphetamine-induced stereotypy correlates negatively with patch-enhanced prodynorphin and arc mRNA expression in the rat caudate putamen: the role of mu opioid receptor activation.
      PubMed
      Horner, Kristen A; Noble, Erika S; Gilbert, Yamiece E
         2010-06-01
         Amphetamines induce stereotypy, which correlates with patch-enhanced c-Fos expression the patch compartment of caudate putamen (CPu). Methamphetamine (METH) treatment also induces patch-enhanced expression of prodynorphin (PD), arc and zif/268 in the CPu. Whether patch-enhanced activation of any of these genes correlates with METH-induced stereotypy is unknown, and the factors that contribute to this pattern of expression are poorly understood. Activation of mu opioid receptors, which are expressed by the neurons of the patch compartment, may underlie METH-induced patch-enhanced gene expression and stereotypy. The current study examined whether striatal mu opioid receptor blockade altered METH-induced stereotypy and patch-enhanced gene expression, and if there was a correlation between the two responses. Animals were intrastriatally infused with the mu antagonist CTAP (10 microg/microl), treated with METH (7.5 mg/kg, s.c.), placed in activity chambers for 3h, and then sacrificed. CTAP pretreatment attenuated METH-induced increases in PD, arc and zif/268 mRNA expression and significantly reduced METH-induced stereotypy. Patch-enhanced PD and arc mRNA expression in the dorsolateral CPu correlated negatively with METH-induced stereotypy. These data indicate that mu opioid receptor activation contributes to METH-induced gene expression in the CPu and stereotypy, and that patch-enhanced PD and arc expression may be a homeostatic response to METH treatment. Copyright 2010 Elsevier Inc. All rights reserved.
      

      
      Methamphetamine-Induced Stereotypy Correlates Negatively with Patch-Enhanced Prodynorphin and ARC mRNA Expression in the Rat Caudate Putamen: The Role of Mu Opioid Receptor Activation
      PubMed Central
      Horner, Kristen A.; Noble, Erika S.; Gilbert, Yamiece E.
         2010-01-01
         Amphetamines induce stereotypy, which correlates with patch-enhanced c-Fos expression the patch compartment of caudate putamen (CPu). Methamphetamine (METH) treatment also induces patch-enhanced expression of prodynorphin (PD), arc and zif/268 in the CPu. Whether patch-enhanced activation of any of these genes correlates with METH-induced stereotypy is unknown, and the factors that contribute to this pattern of expression are poorly understood. Activation of mu opioid receptors, which are expressed by the neurons of the patch compartment, may underlie METH-induced patch-enhanced gene expression and stereotypy. The current study examined whether striatal mu opioid receptor blockade altered METH-induced stereotypy and patch-enhanced gene expression, and if there was a correlation between the two responses. Animals were intrastriatally infused with the mu antagonist CTAP (10 μg/μl), treated with METH (7.5 mg/kg, s.c.), placed in activity chambers for 3h, and then sacrificed. CTAP pretreatment attenuated METH-induced increases in PD, arc and zif/268 mRNA expression and significantly reduced METH-induced stereotypy. Patch-enhanced PD and arc mRNA expression in the dorsolateral CPu correlated negatively with METH-induced stereotypy. These data indicate that mu opioid receptor activation contributes to METH-induced gene expression in the CPu and stereotypy, and that patch-enhanced PD and arc expression may be a homeostatic response to METH treatment. PMID:20298714
      

      
      Duct flow nonuniformities for Space Shuttle Main Engine (SSME)
      NASA Technical Reports Server (NTRS)
      
         1987-01-01
         A three-duct Space Shuttle Main Engine (SSME) Hot Gas Manifold geometry code was developed for use. The methodology of the program is described, recommendations on its implementation made, and an input guide, input deck listing, and a source code listing provided. The code listing is strewn with an abundance of comments to assist the user in following its development and logic. A working source deck will be provided. A thorough analysis was made of the proper boundary conditions and chemistry kinetics necessary for an accurate computational analysis of the flow environment in the SSME fuel side preburner chamber during the initial startup transient. Pertinent results were presented to facilitate incorporation of these findings into an appropriate CFD code. The computation must be a turbulent computation, since the flow field turbulent mixing will have a profound effect on the chemistry. Because of the additional equations demanded by the chemistry model it is recommended that for expediency a simple algebraic mixing length model be adopted. Performing this computation for all or selected time intervals of the startup time will require an abundance of computer CPU time regardless of the specific CFD code selected.
      

      
      Accelerating image recognition on mobile devices using GPGPU
      NASA Astrophysics Data System (ADS)
      Bordallo López, Miguel; Nykänen, Henri; Hannuksela, Jari; Silvén, Olli; Vehviläinen, Markku
         2011-01-01
         The future multi-modal user interfaces of battery-powered mobile devices are expected to require computationally costly image analysis techniques. The use of Graphic Processing Units for computing is very well suited for parallel processing and the addition of programmable stages and high precision arithmetic provide for opportunities to implement energy-efficient complete algorithms. At the moment the first mobile graphics accelerators with programmable pipelines are available, enabling the GPGPU implementation of several image processing algorithms. In this context, we consider a face tracking approach that uses efficient gray-scale invariant texture features and boosting. The solution is based on the Local Binary Pattern (LBP) features and makes use of the GPU on the pre-processing and feature extraction phase. We have implemented a series of image processing techniques in the shader language of OpenGL ES 2.0, compiled them for a mobile graphics processing unit and performed tests on a mobile application processor platform (OMAP3530). In our contribution, we describe the challenges of designing on a mobile platform, present the performance achieved and provide measurement results for the actual power consumption in comparison to using the CPU (ARM) on the same platform.
      

      
      First Update of the Criteria for Certification of Chest Pain Units in Germany: Facelift or New Model?
      PubMed
      Breuckmann, Frank; Rassaf, Tienush
         2016-03-01
         In an effort to provide a systematic and specific standard-of-care for patients with acute chest pain, the German Cardiac Society introduced criteria for certification of specialized chest pain units (CPUs) in 2008, which have been replaced by a recent update published in 2015. We reviewed the development of CPU establishment in Germany during the past 7 years and compared and commented the current update of the certification criteria. As of October 2015, 228 CPUs in Germany have been successfully certified by the German Cardiac Society; 300 CPUs are needed for full coverage closing gaps in rural regions. Current changes of the criteria mainly affect guideline-adherent adaptions of diagnostic work-ups, therapeutic strategies, risk stratification, in-hospital timing and education, and quality measures, whereas the overall structure remained unchanged. Benchmarking by participation within the German CPU registry is encouraged. Even though the history is short, the concept of certified CPUs in Germany is accepted and successful underlined by its recent implementation in national and international guidelines. First registry data demonstrated a high standard of quality-of-care. The current update provides rational adaptions to new guidelines and developments without raising the level for successful certifications. A periodic release of fast-track updates with shorter time frames and an increase of minimum requirements should be considered.
      

      
      Computer Health Score
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      
         
         The algorithm develops a single health score for office computers, today just Windows, but we plan to extend this to Apple computers. The score is derived from various parameters, including: CPU Utilization; Memory Utilization; Various Error logs; Disk Problems; and Disk write queue length. It then uses a weighting scheme to balance these parameters and provide an overall health score. By using these parameters, we are not just assessing the theoretical performance of the components of the computer, rather we are using actual performance metrics that are selected to be a more realistic representation of the experience of the personmore » using the computer. This includes compensating for the nature of their use. If there are two identical computers and the user of one places heavy demands on their computer compared with the user of the second computer, the former will have a lower health score. This allows us to provide a 'fit for purpose' score tailored to the assigned user. This is very helpful data to inform the mangers when individual computers need to be replaced. Additionally it provides specific information that can facilitate the fixing of the computer, to extend it's useful lifetime. This presents direct financial savings, time savings for users transferring from one computer to the next, and better environmental stewardship.« less
      

      
      Digital image processing using parallel computing based on CUDA technology
      NASA Astrophysics Data System (ADS)
      Skirnevskiy, I. P.; Pustovit, A. V.; Abdrashitova, M. O.
         2017-01-01
         This article describes expediency of using a graphics processing unit (GPU) in big data processing in the context of digital images processing. It provides a short description of a parallel computing technology and its usage in different areas, definition of the image noise and a brief overview of some noise removal algorithms. It also describes some basic requirements that should be met by certain noise removal algorithm in the projection to computer tomography. It provides comparison of the performance with and without using GPU as well as with different percentage of using CPU and GPU.
      

      
      Rapid Monte Carlo simulation of detector DQE(f)
      PubMed Central
      Star-Lack, Josh; Sun, Mingshan; Meyer, Andre; Morf, Daniel; Constantin, Dragos; Fahrig, Rebecca; Abel, Eric
         2014-01-01
         Purpose: Performance optimization of indirect x-ray detectors requires proper characterization of both ionizing (gamma) and optical photon transport in a heterogeneous medium. As the tool of choice for modeling detector physics, Monte Carlo methods have failed to gain traction as a design utility, due mostly to excessive simulation times and a lack of convenient simulation packages. The most important figure-of-merit in assessing detector performance is the detective quantum efficiency (DQE), for which most of the computational burden has traditionally been associated with the determination of the noise power spectrum (NPS) from an ensemble of flood images, each conventionally having 107 − 109 detected gamma photons. In this work, the authors show that the idealized conditions inherent in a numerical simulation allow for a dramatic reduction in the number of gamma and optical photons required to accurately predict the NPS. Methods: The authors derived an expression for the mean squared error (MSE) of a simulated NPS when computed using the International Electrotechnical Commission-recommended technique based on taking the 2D Fourier transform of flood images. It is shown that the MSE is inversely proportional to the number of flood images, and is independent of the input fluence provided that the input fluence is above a minimal value that avoids biasing the estimate. The authors then propose to further lower the input fluence so that each event creates a point-spread function rather than a flood field. The authors use this finding as the foundation for a novel algorithm in which the characteristic MTF(f), NPS(f), and DQE(f) curves are simultaneously generated from the results of a single run. The authors also investigate lowering the number of optical photons used in a scintillator simulation to further increase efficiency. Simulation results are compared with measurements performed on a Varian AS1000 portal imager, and with a previously published simulation performed using clinical fluence levels. Results: On the order of only 10–100 gamma photons per flood image were required to be detected to avoid biasing the NPS estimate. This allowed for a factor of 107 reduction in fluence compared to clinical levels with no loss of accuracy. An optimal signal-to-noise ratio (SNR) was achieved by increasing the number of flood images from a typical value of 100 up to 500, thereby illustrating the importance of flood image quantity over the number of gammas per flood. For the point-spread ensemble technique, an additional 2× reduction in the number of incident gammas was realized. As a result, when modeling gamma transport in a thick pixelated array, the simulation time was reduced from 2.5 × 106 CPU min if using clinical fluence levels to 3.1 CPU min if using optimized fluence levels while also producing a higher SNR. The AS1000 DQE(f) simulation entailing both optical and radiative transport matched experimental results to within 11%, and required 14.5 min to complete on a single CPU. Conclusions: The authors demonstrate the feasibility of accurately modeling x-ray detector DQE(f) with completion times on the order of several minutes using a single CPU. Convenience of simulation can be achieved using GEANT4 which offers both gamma and optical photon transport capabilities. PMID:24593734
      

      
      Rapid Monte Carlo simulation of detector DQE(f)
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Star-Lack, Josh, E-mail: josh.starlack@varian.com; Sun, Mingshan; Abel, Eric
         2014-03-15
         Purpose: Performance optimization of indirect x-ray detectors requires proper characterization of both ionizing (gamma) and optical photon transport in a heterogeneous medium. As the tool of choice for modeling detector physics, Monte Carlo methods have failed to gain traction as a design utility, due mostly to excessive simulation times and a lack of convenient simulation packages. The most important figure-of-merit in assessing detector performance is the detective quantum efficiency (DQE), for which most of the computational burden has traditionally been associated with the determination of the noise power spectrum (NPS) from an ensemble of flood images, each conventionally having 10{supmore » 7} − 10{sup 9} detected gamma photons. In this work, the authors show that the idealized conditions inherent in a numerical simulation allow for a dramatic reduction in the number of gamma and optical photons required to accurately predict the NPS. Methods: The authors derived an expression for the mean squared error (MSE) of a simulated NPS when computed using the International Electrotechnical Commission-recommended technique based on taking the 2D Fourier transform of flood images. It is shown that the MSE is inversely proportional to the number of flood images, and is independent of the input fluence provided that the input fluence is above a minimal value that avoids biasing the estimate. The authors then propose to further lower the input fluence so that each event creates a point-spread function rather than a flood field. The authors use this finding as the foundation for a novel algorithm in which the characteristic MTF(f), NPS(f), and DQE(f) curves are simultaneously generated from the results of a single run. The authors also investigate lowering the number of optical photons used in a scintillator simulation to further increase efficiency. Simulation results are compared with measurements performed on a Varian AS1000 portal imager, and with a previously published simulation performed using clinical fluence levels. Results: On the order of only 10–100 gamma photons per flood image were required to be detected to avoid biasing the NPS estimate. This allowed for a factor of 10{sup 7} reduction in fluence compared to clinical levels with no loss of accuracy. An optimal signal-to-noise ratio (SNR) was achieved by increasing the number of flood images from a typical value of 100 up to 500, thereby illustrating the importance of flood image quantity over the number of gammas per flood. For the point-spread ensemble technique, an additional 2× reduction in the number of incident gammas was realized. As a result, when modeling gamma transport in a thick pixelated array, the simulation time was reduced from 2.5 × 10{sup 6} CPU min if using clinical fluence levels to 3.1 CPU min if using optimized fluence levels while also producing a higher SNR. The AS1000 DQE(f) simulation entailing both optical and radiative transport matched experimental results to within 11%, and required 14.5 min to complete on a single CPU. Conclusions: The authors demonstrate the feasibility of accurately modeling x-ray detector DQE(f) with completion times on the order of several minutes using a single CPU. Convenience of simulation can be achieved using GEANT4 which offers both gamma and optical photon transport capabilities.« less
      

        
       
          

«

16
      17
      18
   19
      20
      »

          
        

     

   

   
       
            
              
          

«

17
      18
      19
   20
      21
      »

          
        

           
           
             
               
      
      A hybrid short read mapping accelerator
      PubMed Central
      
         2013-01-01
         Background The rapid growth of short read datasets poses a new challenge to the short read mapping problem in terms of sensitivity and execution speed. Existing methods often use a restrictive error model for computing the alignments to improve speed, whereas more flexible error models are generally too slow for large-scale applications. A number of short read mapping software tools have been proposed. However, designs based on hardware are relatively rare. Field programmable gate arrays (FPGAs) have been successfully used in a number of specific application areas, such as the DSP and communications domains due to their outstanding parallel data processing capabilities, making them a competitive platform to solve problems that are “inherently parallel”. Results We present a hybrid system for short read mapping utilizing both FPGA-based hardware and CPU-based software. The computation intensive alignment and the seed generation operations are mapped onto an FPGA. We present a computationally efficient, parallel block-wise alignment structure (Align Core) to approximate the conventional dynamic programming algorithm. The performance is compared to the multi-threaded CPU-based GASSST and BWA software implementations. For single-end alignment, our hybrid system achieves faster processing speed than GASSST (with a similar sensitivity) and BWA (with a higher sensitivity); for pair-end alignment, our design achieves a slightly worse sensitivity than that of BWA but has a higher processing speed. Conclusions This paper shows that our hybrid system can effectively accelerate the mapping of short reads to a reference genome based on the seed-and-extend approach. The performance comparison to the GASSST and BWA software implementations under different conditions shows that our hybrid design achieves a high degree of sensitivity and requires less overall execution time with only modest FPGA resource utilization. Our hybrid system design also shows that the performance bottleneck for the short read mapping problem can be changed from the alignment stage to the seed generation stage, which provides an additional requirement for the future development of short read aligners. PMID:23441908
      

      
      Quantitative mouse brain phenotyping based on single and multispectral MR protocols
      PubMed Central
      Badea, Alexandra; Gewalt, Sally; Avants, Brian B.; Cook, James J.; Johnson, G. Allan
         2013-01-01
         Sophisticated image analysis methods have been developed for the human brain, but such tools still need to be adapted and optimized for quantitative small animal imaging. We propose a framework for quantitative anatomical phenotyping in mouse models of neurological and psychiatric conditions. The framework encompasses an atlas space, image acquisition protocols, and software tools to register images into this space. We show that a suite of segmentation tools (Avants, Epstein et al., 2008) designed for human neuroimaging can be incorporated into a pipeline for segmenting mouse brain images acquired with multispectral magnetic resonance imaging (MR) protocols. We present a flexible approach for segmenting such hyperimages, optimizing registration, and identifying optimal combinations of image channels for particular structures. Brain imaging with T1, T2* and T2 contrasts yielded accuracy in the range of 83% for hippocampus and caudate putamen (Hc and CPu), but only 54% in white matter tracts, and 44% for the ventricles. The addition of diffusion tensor parameter images improved accuracy for large gray matter structures (by >5%), white matter (10%), and ventricles (15%). The use of Markov random field segmentation further improved overall accuracy in the C57BL/6 strain by 6%; so Dice coefficients for Hc and CPu reached 93%, for white matter 79%, for ventricles 68%, and for substantia nigra 80%. We demonstrate the segmentation pipeline for the widely used C57BL/6 strain, and two test strains (BXD29, APP/TTA). This approach appears promising for characterizing temporal changes in mouse models of human neurological and psychiatric conditions, and may provide anatomical constraints for other preclinical imaging, e.g. fMRI and molecular imaging. This is the first demonstration that multiple MR imaging modalities combined with multivariate segmentation methods lead to significant improvements in anatomical segmentation in the mouse brain. PMID:22836174
      

      
      Airloads on Bluff Bodies, with Application to the Rotor-Induced Downloads on Tilt-Rotor Aircraft.
      DTIC Science & Technology
      
         1983-09-01
         interference aerodynamics would be tion on hover performance (Ref. (11). to study the two-dimensional sec- tion characteristics of a wing in the wake of a...resources for large numbers of vortices; a typical case requires 10-15 min CPU time on the Ames Cray IS computer. Figure 6 shows a typical result. Here...CPU time per case on a Prime 550UPPER SURFACE (WINDWARD) computer to converge to a steady solution; this would be equivalent to one or two seconds on
      

      
      The METAL System. Volume I and Volume II. Appendices.
      DTIC Science & Technology
      
         1981-01-01
         demands , and fair CPU time were measured. The fair measure reported here includes the pure CPU time plus a pro-rated portion of the time consumed by the...syntactic class or the form matched . NO = noun VB = verb OTR = other part of speech IT-12 Although the above feature is not used by the system at present...indicate the syntactic class of the form matched . NO = noun other than gerund ("content", "dark", "African") INF = infinitive ("direct", "equal", "content
      

      
      Advanced Edit System.
      DTIC Science & Technology
      
         1983-01-01
         MFR Model Computer Subsystem 1. Cabinet 0, PDP-11/70 CPU with 11/70 CPU, and Floating point processor DEC 11/79-UK 2. Cabinet 1, with SDLC ... software T-square. o Unit lock causes a user-defined roundoff factor to be applied to all points selected with the cursor. V - 1 0 Grid lock...1 NL • • 1 I i * v • _ • _ . *. . - m m I 1 3 I = K» lää 12.2 1.1 2.0 1.8 1.25 11.4 Ho EJ V Ml ^"OPY RESOLUTION
      

      
      Optimization of Selected Remote Sensing Algorithms for Embedded NVIDIA Kepler GPU Architecture
      NASA Technical Reports Server (NTRS)
      Riha, Lubomir; Le Moigne, Jacqueline; El-Ghazawi, Tarek
         2015-01-01
         This paper evaluates the potential of embedded Graphic Processing Units in the Nvidias Tegra K1 for onboard processing. The performance is compared to a general purpose multi-core CPU and full fledge GPU accelerator. This study uses two algorithms: Wavelet Spectral Dimension Reduction of Hyperspectral Imagery and Automated Cloud-Cover Assessment (ACCA) Algorithm. Tegra K1 achieved 51 for ACCA algorithm and 20 for the dimension reduction algorithm, as compared to the performance of the high-end 8-core server Intel Xeon CPU with 13.5 times higher power consumption.
      

      
      Stress myocardial perfusion imaging in the emergency department--new techniques for speed and diagnostic accuracy.
      PubMed
      Harrison, Sheri D; Harrison, Mark A; Duvall, W Lane
         2012-05-01
         Emergency room evaluations of patients presenting with chest pain continue to rise, and these evaluations which often include cardiac imaging, are an increasing area of resource utilization in the current health system. Myocardial perfusion imaging from the emergency department remains a vital component of the diagnosis or exclusion of coronary artery disease as the etiology of chest pain. Recent advances in camera technology, and changes to the imaging protocols have allowed MPI to become a more efficient way of providing this diagnostic information. Compared with conventional SPECT, new high-efficiency CZT cameras provide a 3-5 fold increase in photon sensitivity, 1.65-fold improvement in energy resolution and a 1.7-2.5-fold increase in spatial resolution. With stress-only imaging, rest images are eliminated if stress images are normal, as they provide no additional prognostic or diagnostic value and cancelling the rest images would shorten the length of the test which is of particular importance to the ED population. The rapid but accurate triage of patients in an ED CPU is essential to their care, and stress-only imaging and new CZT cameras allow for shorter test time, lower radiation doses and lower costs while demonstrating good clinical outcomes. These changes to nuclear stress testing can allow for faster throughput of patients through the emergency department while providing a safe and efficient evaluation of chest pain.
      

      
      A Reliability-Based Particle Filter for Humanoid Robot Self-Localization in RoboCup Standard Platform League
      PubMed Central
      Sánchez, Eduardo Munera; Alcobendas, Manuel Muñoz; Noguera, Juan Fco. Blanes; Gilabert, Ginés Benet; Simó Ten, José E.
         2013-01-01
         This paper deals with the problem of humanoid robot localization and proposes a new method for position estimation that has been developed for the RoboCup Standard Platform League environment. Firstly, a complete vision system has been implemented in the Nao robot platform that enables the detection of relevant field markers. The detection of field markers provides some estimation of distances for the current robot position. To reduce errors in these distance measurements, extrinsic and intrinsic camera calibration procedures have been developed and described. To validate the localization algorithm, experiments covering many of the typical situations that arise during RoboCup games have been developed: ranging from degradation in position estimation to total loss of position (due to falls, ‘kidnapped robot’, or penalization). The self-localization method developed is based on the classical particle filter algorithm. The main contribution of this work is a new particle selection strategy. Our approach reduces the CPU computing time required for each iteration and so eases the limited resource availability problem that is common in robot platforms such as Nao. The experimental results show the quality of the new algorithm in terms of localization and CPU time consumption. PMID:24193098
      

      
      Optimizing SIEM Throughput on the Cloud Using Parallelization
      PubMed Central
      Alam, Masoom; Ihsan, Asif; Javaid, Qaisar; Khan, Abid; Manzoor, Jawad; Akhundzada, Adnan; Khan, M Khurram; Farooq, Sajid
         2016-01-01
         Processing large amounts of data in real time for identifying security issues pose several performance challenges, especially when hardware infrastructure is limited. Managed Security Service Providers (MSSP), mostly hosting their applications on the Cloud, receive events at a very high rate that varies from a few hundred to a couple of thousand events per second (EPS). It is critical to process this data efficiently, so that attacks could be identified quickly and necessary response could be initiated. This paper evaluates the performance of a security framework OSTROM built on the Esper complex event processing (CEP) engine under a parallel and non-parallel computational framework. We explain three architectures under which Esper can be used to process events. We investigated the effect on throughput, memory and CPU usage in each configuration setting. The results indicate that the performance of the engine is limited by the number of events coming in rather than the queries being processed. The architecture where 1/4th of the total events are submitted to each instance and all the queries are processed by all the units shows best results in terms of throughput, memory and CPU usage. PMID:27851762
      

      
      Efficient molecular dynamics simulations with many-body potentials on graphics processing units
      NASA Astrophysics Data System (ADS)
      Fan, Zheyong; Chen, Wei; Vierimaa, Ville; Harju, Ari
         2017-09-01
         Graphics processing units have been extensively used to accelerate classical molecular dynamics simulations. However, there is much less progress on the acceleration of force evaluations for many-body potentials compared to pairwise ones. In the conventional force evaluation algorithm for many-body potentials, the force, virial stress, and heat current for a given atom are accumulated within different loops, which could result in write conflict between different threads in a CUDA kernel. In this work, we provide a new force evaluation algorithm, which is based on an explicit pairwise force expression for many-body potentials derived recently (Fan et al., 2015). In our algorithm, the force, virial stress, and heat current for a given atom can be accumulated within a single thread and is free of write conflicts. We discuss the formulations and algorithms and evaluate their performance. A new open-source code, GPUMD, is developed based on the proposed formulations. For the Tersoff many-body potential, the double precision performance of GPUMD using a Tesla K40 card is equivalent to that of the LAMMPS (Large-scale Atomic/Molecular Massively Parallel Simulator) molecular dynamics code running with about 100 CPU cores (Intel Xeon CPU X5670 @ 2.93 GHz).
      

      
      Prestack depth migration for complex 2D structure using phase-screen propagators
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Roberts, P.; Huang, Lian-Jie; Burch, C.
         1997-11-01
         We present results for the phase-screen propagator method applied to prestack depth migration of the Marmousi synthetic data set. The data were migrated as individual common-shot records and the resulting partial images were superposed to obtain the final complete Image. Tests were performed to determine the minimum number of frequency components required to achieve the best quality image and this in turn provided estimates of the minimum computing time. Running on a single processor SUN SPARC Ultra I, high quality images were obtained in as little as 8.7 CPU hours and adequate images were obtained in as little as 4.4more » CPU hours. Different methods were tested for choosing the reference velocity used for the background phase-shift operation and for defining the slowness perturbation screens. Although the depths of some of the steeply dipping, high-contrast features were shifted slightly the overall image quality was fairly insensitive to the choice of the reference velocity. Our jests show the phase-screen method to be a reliable and fast algorithm for imaging complex geologic structures, at least for complex 2D synthetic data where the velocity model is known.« less
      

      
      High Performance GPU-Based Fourier Volume Rendering.
      PubMed
      Abdellah, Marwan; Eldeib, Ayman; Sharawi, Amr
         2015-01-01
         Fourier volume rendering (FVR) is a significant visualization technique that has been used widely in digital radiography. As a result of its (N (2)log⁡N) time complexity, it provides a faster alternative to spatial domain volume rendering algorithms that are (N (3)) computationally complex. Relying on the Fourier projection-slice theorem, this technique operates on the spectral representation of a 3D volume instead of processing its spatial representation to generate attenuation-only projections that look like X-ray radiographs. Due to the rapid evolution of its underlying architecture, the graphics processing unit (GPU) became an attractive competent platform that can deliver giant computational raw power compared to the central processing unit (CPU) on a per-dollar-basis. The introduction of the compute unified device architecture (CUDA) technology enables embarrassingly-parallel algorithms to run efficiently on CUDA-capable GPU architectures. In this work, a high performance GPU-accelerated implementation of the FVR pipeline on CUDA-enabled GPUs is presented. This proposed implementation can achieve a speed-up of 117x compared to a single-threaded hybrid implementation that uses the CPU and GPU together by taking advantage of executing the rendering pipeline entirely on recent GPU architectures.
      

      
      fast_protein_cluster: parallel and optimized clustering of large-scale protein modeling data.
      PubMed
      Hung, Ling-Hong; Samudrala, Ram
         2014-06-15
         fast_protein_cluster is a fast, parallel and memory efficient package used to cluster 60 000 sets of protein models (with up to 550 000 models per set) generated by the Nutritious Rice for the World project. fast_protein_cluster is an optimized and extensible toolkit that supports Root Mean Square Deviation after optimal superposition (RMSD) and Template Modeling score (TM-score) as metrics. RMSD calculations using a laptop CPU are 60× faster than qcprot and 3× faster than current graphics processing unit (GPU) implementations. New GPU code further increases the speed of RMSD and TM-score calculations. fast_protein_cluster provides novel k-means and hierarchical clustering methods that are up to 250× and 2000× faster, respectively, than Clusco, and identify significantly more accurate models than Spicker and Clusco. fast_protein_cluster is written in C++ using OpenMP for multi-threading support. Custom streaming Single Instruction Multiple Data (SIMD) extensions and advanced vector extension intrinsics code accelerate CPU calculations, and OpenCL kernels support AMD and Nvidia GPUs. fast_protein_cluster is available under the M.I.T. license. (http://software.compbio.washington.edu/fast_protein_cluster) © The Author 2014. Published by Oxford University Press.
      

      
      Toshiba TDF-500 High Resolution Viewing And Analysis System
      NASA Astrophysics Data System (ADS)
      Roberts, Barry; Kakegawa, M.; Nishikawa, M.; Oikawa, D.
         1988-06-01
         A high resolution, operator interactive, medical viewing and analysis system has been developed by Toshiba and Bio-Imaging Research. This system provides many advanced features including high resolution displays, a very large image memory and advanced image processing capability. In particular, the system provides CRT frame buffers capable of update in one frame period, an array processor capable of image processing at operator interactive speeds, and a memory system capable of updating multiple frame buffers at frame rates whilst supporting multiple array processors. The display system provides 1024 x 1536 display resolution at 40Hz frame and 80Hz field rates. In particular, the ability to provide whole or partial update of the screen at the scanning rate is a key feature. This allows multiple viewports or windows in the display buffer with both fixed and cine capability. To support image processing features such as windowing, pan, zoom, minification, filtering, ROI analysis, multiplanar and 3D reconstruction, a high performance CPU is integrated into the system. This CPU is an array processor capable of up to 400 million instructions per second. To support the multiple viewer and array processors' instantaneous high memory bandwidth requirement, an ultra fast memory system is used. This memory system has a bandwidth capability of 400MB/sec and a total capacity of 256MB. This bandwidth is more than adequate to support several high resolution CRT's and also the fast processing unit. This fully integrated approach allows effective real time image processing. The integrated design of viewing system, memory system and array processor are key to the imaging system. It is the intention to describe the architecture of the image system in this paper.
      

      
      Managing Contention and Timing Constraints in a Real-Time Database System
      DTIC Science & Technology
      
         1995-01-01
         In order to realize many of these goals, StarBase is constructed on top of RT-Mach, a real - time operating system developed at Carnegie Mellon...University [ll]. StarBase differs from previous RT-DBMS work [l, 2, 31 in that a) it relies on a real - time operating system which provides priority...CPU and resource scheduling pro- vided by tlhe underlying real - time operating system . Issues of data contention are dealt with by use of a priority
      

      
      Parallel-vector out-of-core equation solver for computational mechanics
      NASA Technical Reports Server (NTRS)
      Qin, J.; Agarwal, T. K.; Storaasli, O. O.; Nguyen, D. T.; Baddourah, M. A.
         1993-01-01
         A parallel/vector out-of-core equation solver is developed for shared-memory computers, such as the Cray Y-MP machine. The input/ output (I/O) time is reduced by using the a synchronous BUFFER IN and BUFFER OUT, which can be executed simultaneously with the CPU instructions. The parallel and vector capability provided by the supercomputers is also exploited to enhance the performance. Numerical applications in large-scale structural analysis are given to demonstrate the efficiency of the present out-of-core solver.
      

      
      Towards Availability and Maintainability Benchmarks: A Case Study of Software RAID Systems
      DTIC Science & Technology
      
         2001-01-01
         on recent outages of big e-commerce providers and the major business impact of those out- ages is staggering; furthermore, several of those outages...uses one or more clients to generate a realistic, statistically reproducible web workload; its workload models what might be seen on a busy major server...the amount of dynamic content from 30% to 1% to keep the disks busy and to avoid saturating the CPU. This restriction was necessary because we used
      

      
      Single-chip microcomputer application in high-altitude balloon orientation system
      NASA Technical Reports Server (NTRS)
      Lim, T. S.; Ehrmann, C. H.; Allison, S. R.
         1980-01-01
         This paper describes the application of a single-chip microcomputer in a high-altitude balloon instrumentation system. The system, consisting of a magnetometer, a stepping motor, a microcomputer and a gray code shaft encoder, is used to provide an orientation reference to point a scientific instrument at an object in space. The single-chip microcomputer, Intel's 8748, consisting of a CPU, program memory, data memory and I/O ports, is used to control the orientation of the system.
      

      
      A 32-bit NMOS microprocessor with a large register file
      NASA Astrophysics Data System (ADS)
      Sherburne, R. W., Jr.; Katevenis, M. G. H.; Patterson, D. A.; Sequin, C. H.
         1984-10-01
         Two scaled versions of a 32-bit NMOS reduced instruction set computer CPU, called RISC II, have been implemented on two different processing lines using the simple Mead and Conway layout rules with lambda values of 2 and 1.5 microns (corresponding to drawn gate lengths of 4 and 3 microns), respectively. The design utilizes a small set of simple instructions in conjunction with a large register file in order to provide high performance. This approach has resulted in two surprisingly powerful single-chip processors.
      

      
      Advances in Mechanisms Supporting Data Collection on Future Force Networks: Product Manager C4ISR On-the-Move
      DTIC Science & Technology
      
         2008-12-01
         for Layer 3 data capture: NetPoll ncap tget Monitor session Radio System switch router User App interface box GPS This model applies to most fixed...developed a lightweight, custom implementation, termed ncap . As described in Section 3.1, the Ground Truth System provides a linkage between host...computer CPU time and GPS time, and ncap leverages this to perform highly precise (əmsec) time tagging of offered and received packets. Such
      

        
       
          

«

17
      18
      19
   20
      21
      »

          
        

     

   

   
       
            
              
          

«

18
      19
      20
   21
      22
      »

          
        

           
           
             
               
      
      INFN-Pisa scientific computation environment (GRID, HPC and Interactive Analysis)
      NASA Astrophysics Data System (ADS)
      Arezzini, S.; Carboni, A.; Caruso, G.; Ciampa, A.; Coscetti, S.; Mazzoni, E.; Piras, S.
         2014-06-01
         The INFN-Pisa Tier2 infrastructure is described, optimized not only for GRID CPU and Storage access, but also for a more interactive use of the resources in order to provide good solutions for the final data analysis step. The Data Center, equipped with about 6700 production cores, permits the use of modern analysis techniques realized via advanced statistical tools (like RooFit and RooStat) implemented in multicore systems. In particular a POSIX file storage access integrated with standard SRM access is provided. Therefore the unified storage infrastructure is described, based on GPFS and Xrootd, used both for SRM data repository and interactive POSIX access. Such a common infrastructure allows a transparent access to the Tier2 data to the users for their interactive analysis. The organization of a specialized many cores CPU facility devoted to interactive analysis is also described along with the login mechanism integrated with the INFN-AAI (National INFN Infrastructure) to extend the site access and use to a geographical distributed community. Such infrastructure is used also for a national computing facility in use to the INFN theoretical community, it enables a synergic use of computing and storage resources. Our Center initially developed for the HEP community is now growing and includes also HPC resources fully integrated. In recent years has been installed and managed a cluster facility (1000 cores, parallel use via InfiniBand connection) and we are now updating this facility that will provide resources for all the intermediate level HPC computing needs of the INFN theoretical national community.
      

      
      SU-F-BRD-02: Application of ARCHERRT-- A GPU-Based Monte Carlo Dose Engine for Radiation Therapy -- to Tomotherapy and Patient-Independent IMRT
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Su, L; Du, X; Liu, T
         
         Purpose: As a module of ARCHER -- Accelerated Radiation-transport Computations in Heterogeneous EnviRonments, ARCHER{sub RT} is designed for RadioTherapy (RT) dose calculation. This paper describes the application of ARCHERRT on patient-dependent TomoTherapy and patient-independent IMRT. It also conducts a 'fair' comparison of different GPUs and multicore CPU. Methods: The source input used for patient-dependent TomoTherapy is phase space file (PSF) generated from optimized plan. For patient-independent IMRT, the open filed PSF is used for different cases. The intensity modulation is simulated by fluence map. The GEANT4 code is used as benchmark. DVH and gamma index test are employed to evaluatemore » the accuracy of ARCHER{sub RT} code. Some previous studies reported misleading speedups by comparing GPU code with serial CPU code. To perform a fairer comparison, we write multi-thread code with OpenMP to fully exploit computing potential of CPU. The hardware involved in this study are a 6-core Intel E5-2620 CPU and 6 NVIDIA M2090 GPUs, a K20 GPU and a K40 GPU. Results: Dosimetric results from ARCHER{sub RT} and GEANT4 show good agreement. The 2%/2mm gamma test pass rates for different clinical cases are 97.2% to 99.7%. A single M2090 GPU needs 50~79 seconds for the simulation to achieve a statistical error of 1% in the PTV. The K40 card is about 1.7∼1.8 times faster than M2090 card. Using 6 M2090 card, the simulation can be finished in about 10 seconds. For comparison, Intel E5-2620 needs 507∼879 seconds for the same simulation. Conclusion: We successfully applied ARCHER{sub RT} to Tomotherapy and patient-independent IMRT, and conducted a fair comparison between GPU and CPU performance. The ARCHER{sub RT} code is both accurate and efficient and may be used towards clinical applications.« less
      

      
      Collaborating CPU and GPU for large-scale high-order CFD simulations with complex grids on the TianHe-1A supercomputer
      NASA Astrophysics Data System (ADS)
      Xu, Chuanfu; Deng, Xiaogang; Zhang, Lilun; Fang, Jianbin; Wang, Guangxue; Jiang, Yi; Cao, Wei; Che, Yonggang; Wang, Yongxian; Wang, Zhenghua; Liu, Wei; Cheng, Xinghua
         2014-12-01
         Programming and optimizing complex, real-world CFD codes on current many-core accelerated HPC systems is very challenging, especially when collaborating CPUs and accelerators to fully tap the potential of heterogeneous systems. In this paper, with a tri-level hybrid and heterogeneous programming model using MPI + OpenMP + CUDA, we port and optimize our high-order multi-block structured CFD software HOSTA on the GPU-accelerated TianHe-1A supercomputer. HOSTA adopts two self-developed high-order compact definite difference schemes WCNS and HDCS that can simulate flows with complex geometries. We present a dual-level parallelization scheme for efficient multi-block computation on GPUs and perform particular kernel optimizations for high-order CFD schemes. The GPU-only approach achieves a speedup of about 1.3 when comparing one Tesla M2050 GPU with two Xeon X5670 CPUs. To achieve a greater speedup, we collaborate CPU and GPU for HOSTA instead of using a naive GPU-only approach. We present a novel scheme to balance the loads between the store-poor GPU and the store-rich CPU. Taking CPU and GPU load balance into account, we improve the maximum simulation problem size per TianHe-1A node for HOSTA by 2.3×, meanwhile the collaborative approach can improve the performance by around 45% compared to the GPU-only approach. Further, to scale HOSTA on TianHe-1A, we propose a gather/scatter optimization to minimize PCI-e data transfer times for ghost and singularity data of 3D grid blocks, and overlap the collaborative computation and communication as far as possible using some advanced CUDA and MPI features. Scalability tests show that HOSTA can achieve a parallel efficiency of above 60% on 1024 TianHe-1A nodes. With our method, we have successfully simulated an EET high-lift airfoil configuration containing 800M cells and China's large civil airplane configuration containing 150M cells. To our best knowledge, those are the largest-scale CPU-GPU collaborative simulations that solve realistic CFD problems with both complex configurations and high-order schemes.
      

      
      CMSA: a heterogeneous CPU/GPU computing system for multiple similar RNA/DNA sequence alignment.
      PubMed
      Chen, Xi; Wang, Chen; Tang, Shanjiang; Yu, Ce; Zou, Quan
         2017-06-24
         The multiple sequence alignment (MSA) is a classic and powerful technique for sequence analysis in bioinformatics. With the rapid growth of biological datasets, MSA parallelization becomes necessary to keep its running time in an acceptable level. Although there are a lot of work on MSA problems, their approaches are either insufficient or contain some implicit assumptions that limit the generality of usage. First, the information of users' sequences, including the sizes of datasets and the lengths of sequences, can be of arbitrary values and are generally unknown before submitted, which are unfortunately ignored by previous work. Second, the center star strategy is suited for aligning similar sequences. But its first stage, center sequence selection, is highly time-consuming and requires further optimization. Moreover, given the heterogeneous CPU/GPU platform, prior studies consider the MSA parallelization on GPU devices only, making the CPUs idle during the computation. Co-run computation, however, can maximize the utilization of the computing resources by enabling the workload computation on both CPU and GPU simultaneously. This paper presents CMSA, a robust and efficient MSA system for large-scale datasets on the heterogeneous CPU/GPU platform. It performs and optimizes multiple sequence alignment automatically for users' submitted sequences without any assumptions. CMSA adopts the co-run computation model so that both CPU and GPU devices are fully utilized. Moreover, CMSA proposes an improved center star strategy that reduces the time complexity of its center sequence selection process from O(mn 2 ) to O(mn). The experimental results show that CMSA achieves an up to 11× speedup and outperforms the state-of-the-art software. CMSA focuses on the multiple similar RNA/DNA sequence alignment and proposes a novel bitmap based algorithm to improve the center star strategy. We can conclude that harvesting the high performance of modern GPU is a promising approach to accelerate multiple sequence alignment. Besides, adopting the co-run computation model can maximize the entire system utilization significantly. The source code is available at https://github.com/wangvsa/CMSA .
      

      
      Many-integrated core (MIC) technology for accelerating Monte Carlo simulation of radiation transport: A study based on the code DPM
      NASA Astrophysics Data System (ADS)
      Rodriguez, M.; Brualla, L.
         2018-04-01
         Monte Carlo simulation of radiation transport is computationally demanding to obtain reasonably low statistical uncertainties of the estimated quantities. Therefore, it can benefit in a large extent from high-performance computing. This work is aimed at assessing the performance of the first generation of the many-integrated core architecture (MIC) Xeon Phi coprocessor with respect to that of a CPU consisting of a double 12-core Xeon processor in Monte Carlo simulation of coupled electron-photonshowers. The comparison was made twofold, first, through a suite of basic tests including parallel versions of the random number generators Mersenne Twister and a modified implementation of RANECU. These tests were addressed to establish a baseline comparison between both devices. Secondly, through the p DPM code developed in this work. p DPM is a parallel version of the Dose Planning Method (DPM) program for fast Monte Carlo simulation of radiation transport in voxelized geometries. A variety of techniques addressed to obtain a large scalability on the Xeon Phi were implemented in p DPM. Maximum scalabilities of 84 . 2 × and 107 . 5 × were obtained in the Xeon Phi for simulations of electron and photon beams, respectively. Nevertheless, in none of the tests involving radiation transport the Xeon Phi performed better than the CPU. The disadvantage of the Xeon Phi with respect to the CPU owes to the low performance of the single core of the former. A single core of the Xeon Phi was more than 10 times less efficient than a single core of the CPU for all radiation transport simulations.
      

      
      An efficient implementation of 3D high-resolution imaging for large-scale seismic data with GPU/CPU heterogeneous parallel computing
      NASA Astrophysics Data System (ADS)
      Xu, Jincheng; Liu, Wei; Wang, Jin; Liu, Linong; Zhang, Jianfeng
         2018-02-01
         De-absorption pre-stack time migration (QPSTM) compensates for the absorption and dispersion of seismic waves by introducing an effective Q parameter, thereby making it an effective tool for 3D, high-resolution imaging of seismic data. Although the optimal aperture obtained via stationary-phase migration reduces the computational cost of 3D QPSTM and yields 3D stationary-phase QPSTM, the associated computational efficiency is still the main problem in the processing of 3D, high-resolution images for real large-scale seismic data. In the current paper, we proposed a division method for large-scale, 3D seismic data to optimize the performance of stationary-phase QPSTM on clusters of graphics processing units (GPU). Then, we designed an imaging point parallel strategy to achieve an optimal parallel computing performance. Afterward, we adopted an asynchronous double buffering scheme for multi-stream to perform the GPU/CPU parallel computing. Moreover, several key optimization strategies of computation and storage based on the compute unified device architecture (CUDA) were adopted to accelerate the 3D stationary-phase QPSTM algorithm. Compared with the initial GPU code, the implementation of the key optimization steps, including thread optimization, shared memory optimization, register optimization and special function units (SFU), greatly improved the efficiency. A numerical example employing real large-scale, 3D seismic data showed that our scheme is nearly 80 times faster than the CPU-QPSTM algorithm. Our GPU/CPU heterogeneous parallel computing framework significant reduces the computational cost and facilitates 3D high-resolution imaging for large-scale seismic data.
      

      
      A fast three-dimensional gamma evaluation using a GPU utilizing texture memory for on-the-fly interpolations.
      PubMed
      Persoon, Lucas C G G; Podesta, Mark; van Elmpt, Wouter J C; Nijsten, Sebastiaan M J J G; Verhaegen, Frank
         2011-07-01
         A widely accepted method to quantify differences in dose distributions is the gamma (gamma) evaluation. Currently, almost all gamma implementations utilize the central processing unit (CPU). Recently, the graphics processing unit (GPU) has become a powerful platform for specific computing tasks. In this study, we describe the implementation of a 3D gamma evaluation using a GPU to improve calculation time. The gamma evaluation algorithm was implemented on an NVIDIA Tesla C2050 GPU using the compute unified device architecture (CUDA). First, several cubic virtual phantoms were simulated. These phantoms were tested with varying dose cube sizes and set-ups, introducing artificial dose differences. Second, to show applicability in clinical practice, five patient cases have been evaluated using the 3D dose distribution from a treatment planning system as the reference and the delivered dose determined during treatment as the comparison. A calculation time comparison between the CPU and GPU was made with varying thread-block sizes including the option of using texture or global memory. A GPU over CPU speed-up of 66 +/- 12 was achieved for the virtual phantoms. For the patient cases, a speed-up of 57 +/- 15 using the GPU was obtained. A thread-block size of 16 x 16 performed best in all cases. The use of texture memory improved the total calculation time, especially when interpolation was applied. Differences between the CPU and GPU gammas were negligible. The GPU and its features, such as texture memory, decreased the calculation time for gamma evaluations considerably without loss of accuracy.
      

      
      Research on fast Fourier transforms algorithm of huge remote sensing image technology with GPU and partitioning technology.
      PubMed
      Yang, Xue; Li, Xue-You; Li, Jia-Guo; Ma, Jun; Zhang, Li; Yang, Jan; Du, Quan-Ye
         2014-02-01
         Fast Fourier transforms (FFT) is a basic approach to remote sensing image processing. With the improvement of capacity of remote sensing image capture with the features of hyperspectrum, high spatial resolution and high temporal resolution, how to use FFT technology to efficiently process huge remote sensing image becomes the critical step and research hot spot of current image processing technology. FFT algorithm, one of the basic algorithms of image processing, can be used for stripe noise removal, image compression, image registration, etc. in processing remote sensing image. CUFFT function library is the FFT algorithm library based on CPU and FFTW. FFTW is a FFT algorithm developed based on CPU in PC platform, and is currently the fastest CPU based FFT algorithm function library. However there is a common problem that once the available memory or memory is less than the capacity of image, there will be out of memory or memory overflow when using the above two methods to realize image FFT arithmetic. To address this problem, a CPU and partitioning technology based Huge Remote Fast Fourier Transform (HRFFT) algorithm is proposed in this paper. By improving the FFT algorithm in CUFFT function library, the problem of out of memory and memory overflow is solved. Moreover, this method is proved rational by experiment combined with the CCD image of HJ-1A satellite. When applied to practical image processing, it improves effect of the image processing, speeds up the processing, which saves the time of computation and achieves sound result.
      

      
      GPU: the biggest key processor for AI and parallel processing
      NASA Astrophysics Data System (ADS)
      Baji, Toru
         2017-07-01
         Two types of processors exist in the market. One is the conventional CPU and the other is Graphic Processor Unit (GPU). Typical CPU is composed of 1 to 8 cores while GPU has thousands of cores. CPU is good for sequential processing, while GPU is good to accelerate software with heavy parallel executions. GPU was initially dedicated for 3D graphics. However from 2006, when GPU started to apply general-purpose cores, it was noticed that this architecture can be used as a general purpose massive-parallel processor. NVIDIA developed a software framework Compute Unified Device Architecture (CUDA) that make it possible to easily program the GPU for these application. With CUDA, GPU started to be used in workstations and supercomputers widely. Recently two key technologies are highlighted in the industry. The Artificial Intelligence (AI) and Autonomous Driving Cars. AI requires a massive parallel operation to train many-layers of neural networks. With CPU alone, it was impossible to finish the training in a practical time. The latest multi-GPU system with P100 makes it possible to finish the training in a few hours. For the autonomous driving cars, TOPS class of performance is required to implement perception, localization, path planning processing and again SoC with integrated GPU will play a key role there. In this paper, the evolution of the GPU which is one of the biggest commercial devices requiring state-of-the-art fabrication technology will be introduced. Also overview of the GPU demanding key application like the ones described above will be introduced.
      

      
      Performance measurements of the first RAID prototype
      NASA Technical Reports Server (NTRS)
      Chervenak, Ann L.
         1990-01-01
         The performance is examined of Redundant Arrays of Inexpensive Disks (RAID) the First, a prototype disk array. A hierarchy of bottlenecks was discovered in the system that limit overall performance. The most serious is the memory system contention on the Sun 4/280 host CPU, which limits array bandwidth to 2.3 MBytes/sec. The array performs more successfully on small random operations, achieving nearly 300 I/Os per second before the Sun 4/280 becomes CPU limited. Other bottlenecks in the system are the VME backplane, bandwidth on the disk controller, and overheads associated with the SCSI protocol. All are examined in detail. The main conclusion is that to achieve the potential bandwidth of arrays, more powerful CPU's alone will not suffice. Just as important are adequate host memory bandwidth and support for high bandwidth on disk controllers. Current disk controllers are more often designed to achieve large numbers of small random operations, rather than high bandwidth. Operating systems also need to change to support high bandwidth from disk arrays. In particular, they should transfer data in larger blocks, and should support asynchronous I/O to improve sequential write performance.
      

      
      Fast simulation of Proton Induced X-Ray Emission Tomography using CUDA
      NASA Astrophysics Data System (ADS)
      Beasley, D. G.; Marques, A. C.; Alves, L. C.; da Silva, R. C.
         2013-07-01
         A new 3D Proton Induced X-Ray Emission Tomography (PIXE-T) and Scanning Transmission Ion Microscopy Tomography (STIM-T) simulation software has been developed in Java and uses NVIDIA™ Common Unified Device Architecture (CUDA) to calculate the X-ray attenuation for large detector areas. A challenge with PIXE-T is to get sufficient counts while retaining a small beam spot size. Therefore a high geometric efficiency is required. However, as the detector solid angle increases the calculations required for accurate reconstruction of the data increase substantially. To overcome this limitation, the CUDA parallel computing platform was used which enables general purpose programming of NVIDIA graphics processing units (GPUs) to perform computations traditionally handled by the central processing unit (CPU). For simulation performance evaluation, the results of a CPU- and a CUDA-based simulation of a phantom are presented. Furthermore, a comparison with the simulation code in the PIXE-Tomography reconstruction software DISRA (A. Sakellariou, D.N. Jamieson, G.J.F. Legge, 2001) is also shown. Compared to a CPU implementation, the CUDA based simulation is approximately 30× faster.
      

      
      Research on control law accelerator of digital signal process chip TMS320F28035 for real-time data acquisition and processing
      NASA Astrophysics Data System (ADS)
      Zhao, Shuangle; Zhang, Xueyi; Sun, Shengli; Wang, Xudong
         2017-08-01
         TI C2000 series digital signal process (DSP) chip has been widely used in electrical engineering, measurement and control, communications and other professional fields, DSP TMS320F28035 is one of the most representative of a kind. When using the DSP program, need data acquisition and data processing, and if the use of common mode C or assembly language programming, the program sequence, analogue-to-digital (AD) converter cannot be real-time acquisition, often missing a lot of data. The control low accelerator (CLA) processor can run in parallel with the main central processing unit (CPU), and the frequency is consistent with the main CPU, and has the function of floating point operations. Therefore, the CLA coprocessor is used in the program, and the CLA kernel is responsible for data processing. The main CPU is responsible for the AD conversion. The advantage of this method is to reduce the time of data processing and realize the real-time performance of data acquisition.
      

      
      Design Alternatives to Improve Access Time Performance of Disk Drives Under DOS and UNIX
      NASA Astrophysics Data System (ADS)
      Hospodor, Andy
         
         For the past 25 years, improvements in CPU performance have overshadowed improvements in the access time performance of disk drives. CPU performance has been slanted towards greater instruction execution rates, measured in millions of instructions per second (MIPS). However, the slant for performance of disk storage has been towards capacity and corresponding increased storage densities. The IBM PC, introduced in 1982, processed only a fraction of a MIP. Follow-on CPUs, such as the 80486 and 80586, sported 5-10 MIPS by 1992. Single user PCs and workstations, with one CPU and one disk drive, became the dominant application, as implied by their production volumes. However, disk drives did not enjoy a corresponding improvement in access time performance, although the potential still exists. The time to access a disk drive improves (decreases) in two ways: by altering the mechanical properties of the drive or by adding cache to the drive. This paper explores the improvement to access time performance of disk drives using cache, prefetch, faster rotation rates, and faster seek acceleration.
      

      
      Optimizing Tensor Contraction Expressions for Hybrid CPU-GPU Execution
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Ma, Wenjing; Krishnamoorthy, Sriram; Villa, Oreste
         2013-03-01
         Tensor contractions are generalized multidimensional matrix multiplication operations that widely occur in quantum chemistry. Efficient execution of tensor contractions on Graphics Processing Units (GPUs) requires several challenges to be addressed, including index permutation and small dimension-sizes reducing thread block utilization. Moreover, to apply the same optimizations to various expressions, we need a code generation tool. In this paper, we present our approach to automatically generate CUDA code to execute tensor contractions on GPUs, including management of data movement between CPU and GPU. To evaluate our tool, GPU-enabled code is generated for the most expensive contractions in CCSD(T), a key coupledmore » cluster method, and incorporated into NWChem, a popular computational chemistry suite. For this method, we demonstrate speedup over a factor of 8.4 using one GPU (instead of one core per node) and over 2.6 when utilizing the entire system using hybrid CPU+GPU solution with 2 GPUs and 5 cores (instead of 7 cores per node). Finally, we analyze the implementation behavior on future GPU systems.« less
      

      
      Massively parallel algorithm and implementation of RI-MP2 energy calculation for peta-scale many-core supercomputers.
      PubMed
      Katouda, Michio; Naruse, Akira; Hirano, Yukihiko; Nakajima, Takahito
         2016-11-15
         A new parallel algorithm and its implementation for the RI-MP2 energy calculation utilizing peta-flop-class many-core supercomputers are presented. Some improvements from the previous algorithm (J. Chem. Theory Comput. 2013, 9, 5373) have been performed: (1) a dual-level hierarchical parallelization scheme that enables the use of more than 10,000 Message Passing Interface (MPI) processes and (2) a new data communication scheme that reduces network communication overhead. A multi-node and multi-GPU implementation of the present algorithm is presented for calculations on a central processing unit (CPU)/graphics processing unit (GPU) hybrid supercomputer. Benchmark results of the new algorithm and its implementation using the K computer (CPU clustering system) and TSUBAME 2.5 (CPU/GPU hybrid system) demonstrate high efficiency. The peak performance of 3.1 PFLOPS is attained using 80,199 nodes of the K computer. The peak performance of the multi-node and multi-GPU implementation is 514 TFLOPS using 1349 nodes and 4047 GPUs of TSUBAME 2.5. © 2016 Wiley Periodicals, Inc. © 2016 Wiley Periodicals, Inc.
      

      
      Heterogeneous compute in computer vision: OpenCL in OpenCV
      NASA Astrophysics Data System (ADS)
      Gasparakis, Harris
         2014-02-01
         We explore the relevance of Heterogeneous System Architecture (HSA) in Computer Vision, both as a long term vision, and as a near term emerging reality via the recently ratified OpenCL 2.0 Khronos standard. After a brief review of OpenCL 1.2 and 2.0, including HSA features such as Shared Virtual Memory (SVM) and platform atomics, we identify what genres of Computer Vision workloads stand to benefit by leveraging those features, and we suggest a new mental framework that replaces GPU compute with hybrid HSA APU compute. As a case in point, we discuss, in some detail, popular object recognition algorithms (part-based models), emphasizing the interplay and concurrent collaboration between the GPU and CPU. We conclude by describing how OpenCL has been incorporated in OpenCV, a popular open source computer vision library, emphasizing recent work on the Transparent API, to appear in OpenCV 3.0, which unifies the native CPU and OpenCL execution paths under a single API, allowing the same code to execute either on CPU or on a OpenCL enabled device, without even recompiling.
      

      
      Heterogeneous CPU-GPU moving targets detection for UAV video
      NASA Astrophysics Data System (ADS)
      Li, Maowen; Tang, Linbo; Han, Yuqi; Yu, Chunlei; Zhang, Chao; Fu, Huiquan
         2017-07-01
         Moving targets detection is gaining popularity in civilian and military applications. On some monitoring platform of motion detection, some low-resolution stationary cameras are replaced by moving HD camera based on UAVs. The pixels of moving targets in the HD Video taken by UAV are always in a minority, and the background of the frame is usually moving because of the motion of UAVs. The high computational cost of the algorithm prevents running it at higher resolutions the pixels of frame. Hence, to solve the problem of moving targets detection based UAVs video, we propose a heterogeneous CPU-GPU moving target detection algorithm for UAV video. More specifically, we use background registration to eliminate the impact of the moving background and frame difference to detect small moving targets. In order to achieve the effect of real-time processing, we design the solution of heterogeneous CPU-GPU framework for our method. The experimental results show that our method can detect the main moving targets from the HD video taken by UAV, and the average process time is 52.16ms per frame which is fast enough to solve the problem.
      

      
      An efficient tensor transpose algorithm for multicore CPU, Intel Xeon Phi, and NVidia Tesla GPU
      NASA Astrophysics Data System (ADS)
      Lyakh, Dmitry I.
         2015-04-01
         An efficient parallel tensor transpose algorithm is suggested for shared-memory computing units, namely, multicore CPU, Intel Xeon Phi, and NVidia GPU. The algorithm operates on dense tensors (multidimensional arrays) and is based on the optimization of cache utilization on x86 CPU and the use of shared memory on NVidia GPU. From the applied side, the ultimate goal is to minimize the overhead encountered in the transformation of tensor contractions into matrix multiplications in computer implementations of advanced methods of quantum many-body theory (e.g., in electronic structure theory and nuclear physics). A particular accent is made on higher-dimensional tensors that typically appear in the so-called multireference correlated methods of electronic structure theory. Depending on tensor dimensionality, the presented optimized algorithms can achieve an order of magnitude speedup on x86 CPUs and 2-3 times speedup on NVidia Tesla K20X GPU with respect to the naïve scattering algorithm (no memory access optimization). The tensor transpose routines developed in this work have been incorporated into a general-purpose tensor algebra library (TAL-SH).
      

      
      Iterative Usage of Fixed and Random Effect Models for Powerful and Efficient Genome-Wide Association Studies
      PubMed Central
      Liu, Xiaolei; Huang, Meng; Fan, Bin; Buckler, Edward S.; Zhang, Zhiwu
         2016-01-01
         False positives in a Genome-Wide Association Study (GWAS) can be effectively controlled by a fixed effect and random effect Mixed Linear Model (MLM) that incorporates population structure and kinship among individuals to adjust association tests on markers; however, the adjustment also compromises true positives. The modified MLM method, Multiple Loci Linear Mixed Model (MLMM), incorporates multiple markers simultaneously as covariates in a stepwise MLM to partially remove the confounding between testing markers and kinship. To completely eliminate the confounding, we divided MLMM into two parts: Fixed Effect Model (FEM) and a Random Effect Model (REM) and use them iteratively. FEM contains testing markers, one at a time, and multiple associated markers as covariates to control false positives. To avoid model over-fitting problem in FEM, the associated markers are estimated in REM by using them to define kinship. The P values of testing markers and the associated markers are unified at each iteration. We named the new method as Fixed and random model Circulating Probability Unification (FarmCPU). Both real and simulated data analyses demonstrated that FarmCPU improves statistical power compared to current methods. Additional benefits include an efficient computing time that is linear to both number of individuals and number of markers. Now, a dataset with half million individuals and half million markers can be analyzed within three days. PMID:26828793
      

      
      Low-power grating detection system chip for high-speed low-cost length and angle precision measurement
      NASA Astrophysics Data System (ADS)
      Hou, Ligang; Luo, Rengui; Wu, Wuchen
         2006-11-01
         This paper forwards a low power grating detection chip (EYAS) on length and angle precision measurement. Traditional grating detection method, such as resister chain divide or phase locked divide circuit are difficult to design and tune. The need of an additional CPU for control and display makes these methods' implementation more complex and costly. Traditional methods also suffer low sampling speed for the complex divide circuit scheme and CPU software compensation. EYAS is an application specific integrated circuit (ASIC). It integrates micro controller unit (MCU), power management unit (PMU), LCD controller, Keyboard interface, grating detection unit and other peripherals. Working at 10MHz, EYAS can afford 5MHz internal sampling rate and can handle 1.25MHz orthogonal signal from grating sensor. With a simple control interface by keyboard, sensor parameter, data processing and system working mode can be configured. Two LCD controllers can adapt to dot array LCD or segment bit LCD, which comprised output interface. PMU alters system between working and standby mode by clock gating technique to save power. EYAS in test mode (system action are more frequently than real world use) consumes 0.9mw, while 0.2mw in real world use. EYAS achieved the whole grating detection system function, high-speed orthogonal signal handling in a single chip with very low power consumption.
      

        
       
          

«

18
      19
      20
   21
      22
      »

          
        

     

   

   
       
            
              
          

«

19
      20
      21
   22
      23
      »

          
        

           
           
             
               
      
      Multi-GPU Jacobian accelerated computing for soft-field tomography.
      PubMed
      Borsic, A; Attardo, E A; Halter, R J
         2012-10-01
         Image reconstruction in soft-field tomography is based on an inverse problem formulation, where a forward model is fitted to the data. In medical applications, where the anatomy presents complex shapes, it is common to use finite element models (FEMs) to represent the volume of interest and solve a partial differential equation that models the physics of the system. Over the last decade, there has been a shifting interest from 2D modeling to 3D modeling, as the underlying physics of most problems are 3D. Although the increased computational power of modern computers allows working with much larger FEM models, the computational time required to reconstruct 3D images on a fine 3D FEM model can be significant, on the order of hours. For example, in electrical impedance tomography (EIT) applications using a dense 3D FEM mesh with half a million elements, a single reconstruction iteration takes approximately 15-20 min with optimized routines running on a modern multi-core PC. It is desirable to accelerate image reconstruction to enable researchers to more easily and rapidly explore data and reconstruction parameters. Furthermore, providing high-speed reconstructions is essential for some promising clinical application of EIT. For 3D problems, 70% of the computing time is spent building the Jacobian matrix, and 25% of the time in forward solving. In this work, we focus on accelerating the Jacobian computation by using single and multiple GPUs. First, we discuss an optimized implementation on a modern multi-core PC architecture and show how computing time is bounded by the CPU-to-memory bandwidth; this factor limits the rate at which data can be fetched by the CPU. Gains associated with the use of multiple CPU cores are minimal, since data operands cannot be fetched fast enough to saturate the processing power of even a single CPU core. GPUs have much faster memory bandwidths compared to CPUs and better parallelism. We are able to obtain acceleration factors of 20 times on a single NVIDIA S1070 GPU, and of 50 times on four GPUs, bringing the Jacobian computing time for a fine 3D mesh from 12 min to 14 s. We regard this as an important step toward gaining interactive reconstruction times in 3D imaging, particularly when coupled in the future with acceleration of the forward problem. While we demonstrate results for EIT, these results apply to any soft-field imaging modality where the Jacobian matrix is computed with the adjoint method.
      

      
      Multi-GPU Jacobian Accelerated Computing for Soft Field Tomography
      PubMed Central
      Borsic, A.; Attardo, E. A.; Halter, R. J.
         2012-01-01
         Image reconstruction in soft-field tomography is based on an inverse problem formulation, where a forward model is fitted to the data. In medical applications, where the anatomy presents complex shapes, it is common to use Finite Element Models to represent the volume of interest and to solve a partial differential equation that models the physics of the system. Over the last decade, there has been a shifting interest from 2D modeling to 3D modeling, as the underlying physics of most problems are three-dimensional. Though the increased computational power of modern computers allows working with much larger FEM models, the computational time required to reconstruct 3D images on a fine 3D FEM model can be significant, on the order of hours. For example, in Electrical Impedance Tomography applications using a dense 3D FEM mesh with half a million elements, a single reconstruction iteration takes approximately 15 to 20 minutes with optimized routines running on a modern multi-core PC. It is desirable to accelerate image reconstruction to enable researchers to more easily and rapidly explore data and reconstruction parameters. Further, providing high-speed reconstructions are essential for some promising clinical application of EIT. For 3D problems 70% of the computing time is spent building the Jacobian matrix, and 25% of the time in forward solving. In the present work, we focus on accelerating the Jacobian computation by using single and multiple GPUs. First, we discuss an optimized implementation on a modern multi-core PC architecture and show how computing time is bounded by the CPU-to-memory bandwidth; this factor limits the rate at which data can be fetched by the CPU. Gains associated with use of multiple CPU cores are minimal, since data operands cannot be fetched fast enough to saturate the processing power of even a single CPU core. GPUs have a much faster memory bandwidths compared to CPUs and better parallelism. We are able to obtain acceleration factors of 20 times on a single NVIDIA S1070 GPU, and of 50 times on 4 GPUs, bringing the Jacobian computing time for a fine 3D mesh from 12 minutes to 14 seconds. We regard this as an important step towards gaining interactive reconstruction times in 3D imaging, particularly when coupled in the future with acceleration of the forward problem. While we demonstrate results for Electrical Impedance Tomography, these results apply to any soft-field imaging modality where the Jacobian matrix is computed with the Adjoint Method. PMID:23010857
      

      
      Inexact adaptive Newton methods
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Bertiger, W.I.; Kelsey, F.J.
         1985-02-01
         The Inexact Adaptive Newton method (IAN) is a modification of the Adaptive Implicit Method/sup 1/ (AIM) with improved Newton convergence. Both methods simplify the Jacobian at each time step by zeroing coefficients in regions where saturations are changing slowly. The methods differ in how the diagonal block terms are treated. On test problems with up to 3,000 cells, IAN consistently saves approximately 30% of the CPU time when compared to the fully implicit method. AIM shows similar savings on some problems, but takes as much CPU time as fully implicit on other test problems due to poor Newton convergence.
      

      
      A numerical code for the simulation of non-equilibrium chemically reacting flows on hybrid CPU-GPU clusters
      NASA Astrophysics Data System (ADS)
      Kudryavtsev, Alexey N.; Kashkovsky, Alexander V.; Borisov, Semyon P.; Shershnev, Anton A.
         2017-10-01
         In the present work a computer code RCFS for numerical simulation of chemically reacting compressible flows on hybrid CPU/GPU supercomputers is developed. It solves 3D unsteady Euler equations for multispecies chemically reacting flows in general curvilinear coordinates using shock-capturing TVD schemes. Time advancement is carried out using the explicit Runge-Kutta TVD schemes. Program implementation uses CUDA application programming interface to perform GPU computations. Data between GPUs is distributed via domain decomposition technique. The developed code is verified on the number of test cases including supersonic flow over a cylinder.
      

      
      Application of queueing models to multiprogrammed computer systems operating in a time-critical environment
      NASA Technical Reports Server (NTRS)
      Eckhardt, D. E., Jr.
         1979-01-01
         A model of a central processor (CPU) which services background applications in the presence of time critical activity is presented. The CPU is viewed as an M/M/1 queueing system subject to periodic interrupts by deterministic, time critical process. The Laplace transform of the distribution of service times for the background applications is developed. The use of state of the art queueing models for studying the background processing capability of time critical computer systems is discussed and the results of a model validation study which support this application of queueing models are presented.
      

      
      Upwind relaxation methods for the Navier-Stokes equations using inner iterations
      NASA Technical Reports Server (NTRS)
      Taylor, Arthur C., III; Ng, Wing-Fai; Walters, Robert W.
         1992-01-01
         A subsonic and a supersonic problem are respectively treated by an upwind line-relaxation algorithm for the Navier-Stokes equations using inner iterations to accelerate steady-state solution convergence and thereby minimize CPU time. While the ability of the inner iterative procedure to mimic the quadratic convergence of the direct solver method is attested to in both test problems, some of the nonquadratic inner iterative results are noted to have been more efficient than the quadratic. In the more successful, supersonic test case, inner iteration required only about 65 percent of the line-relaxation method-entailed CPU time.
      

      
      Application of graphics processing units to search pipelines for gravitational waves from coalescing binaries of compact objects
      NASA Astrophysics Data System (ADS)
      Chung, Shin Kee; Wen, Linqing; Blair, David; Cannon, Kipp; Datta, Amitava
         2010-07-01
         We report a novel application of a graphics processing unit (GPU) for the purpose of accelerating the search pipelines for gravitational waves from coalescing binaries of compact objects. A speed-up of 16-fold in total has been achieved with an NVIDIA GeForce 8800 Ultra GPU card compared with one core of a 2.5 GHz Intel Q9300 central processing unit (CPU). We show that substantial improvements are possible and discuss the reduction in CPU count required for the detection of inspiral sources afforded by the use of GPUs.
      

      
      A 3D front tracking method on a CPU/GPU system
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Bo, Wurigen; Grove, John
         2011-01-21
         We describe the method to port a sequential 3D interface tracking code to a GPU with CUDA. The interface is represented as a triangular mesh. Interface geometry properties and point propagation are performed on a GPU. Interface mesh adaptation is performed on a CPU. The convergence of the method is assessed from the test problems with given velocity fields. Performance results show overall speedups from 11 to 14 for the test problems under mesh refinement. We also briefly describe our ongoing work to couple the interface tracking method with a hydro solver.
      

      
      Dosimetric comparison of helical tomotherapy treatment plans for total marrow irradiation created using GPU and CPU dose calculation engines.
      PubMed
      Nalichowski, Adrian; Burmeister, Jay
         2013-07-01
         To compare optimization characteristics, plan quality, and treatment delivery efficiency between total marrow irradiation (TMI) plans using the new TomoTherapy graphic processing unit (GPU) based dose engine and CPU/cluster based dose engine. Five TMI plans created on an anthropomorphic phantom were optimized and calculated with both dose engines. The planning treatment volume (PTV) included all the bones from head to mid femur except for upper extremities. Evaluated organs at risk (OAR) consisted of lung, liver, heart, kidneys, and brain. The following treatment parameters were used to generate the TMI plans: field widths of 2.5 and 5 cm, modulation factors of 2 and 2.5, and pitch of either 0.287 or 0.43. The optimization parameters were chosen based on the PTV and OAR priorities and the plans were optimized with a fixed number of iterations. The PTV constraint was selected to ensure that at least 95% of the PTV received the prescription dose. The plans were evaluated based on D80 and D50 (dose to 80% and 50% of the OAR volume, respectively) and hotspot volumes within the PTVs. Gamma indices (Γ) were also used to compare planar dose distributions between the two modalities. The optimization and dose calculation times were compared between the two systems. The treatment delivery times were also evaluated. The results showed very good dosimetric agreement between the GPU and CPU calculated plans for any of the evaluated planning parameters indicating that both systems converge on nearly identical plans. All D80 and D50 parameters varied by less than 3% of the prescription dose with an average difference of 0.8%. A gamma analysis Γ(3%, 3 mm) < 1 of the GPU plan resulted in over 90% of calculated voxels satisfying Γ < 1 criterion as compared to baseline CPU plan. The average number of voxels meeting the Γ < 1 criterion for all the plans was 97%. In terms of dose optimization/calculation efficiency, there was a 20-fold reduction in planning time with the new GPU system. The average optimization/dose calculation time utilizing the traditional CPU/cluster based system was 579 vs 26.8 min for the GPU based system. There was no difference in the calculated treatment delivery time per fraction. Beam-on time varied based on field width and pitch and ranged between 15 and 28 min. The TomoTherapy GPU based dose engine is capable of calculating TMI treatment plans with plan quality nearly identical to plans calculated using the traditional CPU/cluster based system, while significantly reducing the time required for optimization and dose calculation.
      

      
      Fast in-memory elastic full-waveform inversion using consumer-grade GPUs
      NASA Astrophysics Data System (ADS)
      Sivertsen Bergslid, Tore; Birger Raknes, Espen; Arntsen, Børge
         2017-04-01
         Full-waveform inversion (FWI) is a technique to estimate subsurface properties by using the recorded waveform produced by a seismic source and applying inverse theory. This is done through an iterative optimization procedure, where each iteration requires solving the wave equation many times, then trying to minimize the difference between the modeled and the measured seismic data. Having to model many of these seismic sources per iteration means that this is a highly computationally demanding procedure, which usually involves writing a lot of data to disk. We have written code that does forward modeling and inversion entirely in memory. A typical HPC cluster has many more CPUs than GPUs. Since FWI involves modeling many seismic sources per iteration, the obvious approach is to parallelize the code on a source-by-source basis, where each core of the CPU performs one modeling, and do all modelings simultaneously. With this approach, the GPU is already at a major disadvantage in pure numbers. Fortunately, GPUs can more than make up for this hardware disadvantage by performing each modeling much faster than a CPU. Another benefit of parallelizing each individual modeling is that it lets each modeling use a lot more RAM. If one node has 128 GB of RAM and 20 CPU cores, each modeling can use only 6.4 GB RAM if one is running the node at full capacity with source-by-source parallelization on the CPU. A parallelized per-source code using GPUs can use 64 GB RAM per modeling. Whenever a modeling uses more RAM than is available and has to start using regular disk space the runtime increases dramatically, due to slow file I/O. The extremely high computational speed of the GPUs combined with the large amount of RAM available for each modeling lets us do high frequency FWI for fairly large models very quickly. For a single modeling, our GPU code outperforms the single-threaded CPU-code by a factor of about 75. Successful inversions have been run on data with frequencies up to 40 Hz for a model of 2001 by 600 grid points with 5 m grid spacing and 5000 time steps, in less than 2.5 minutes per source. In practice, using 15 nodes (30 GPUs) to model 101 sources, each iteration took approximately 9 minutes. For reference, the same inversion run with our CPU code uses two hours per iteration. This was done using only a very simple wavefield interpolation technique, saving every second timestep. Using a more sophisticated checkpointing or wavefield reconstruction method would allow us to increase this model size significantly. Our results show that ordinary gaming GPUs are a viable alternative to the expensive professional GPUs often used today, when performing large scale modeling and inversion in geophysics.
      

      
      Changes of brain monoamine levels and physiological indexes during heat acclimation in rats.
      PubMed
      Nakagawa, Hikaru; Matsumura, Takeru; Suzuki, Kota; Ninomiya, Chisa; Ishiwata, Takayuki
         2016-05-01
         Brain monoamines, such as noradrenaline (NA), dopamine (DA), and serotonin (5-HT), regulate many important physiological functions including thermoregulation. The purpose of this study was to clarify changes in NA, DA, and 5-HT levels in several brain regions in response to heat acclimation while also recording body temperature (Tb), heart rate (HR), and locomotor activity (Act). Rats were exposed to a heated environment (32°C) for 3h (3H), 1 day (1D), 7 days, 14 days (14D), 21 days, or 28 days (28D). After heat exposure, each of the following brain regions were immediately extracted and homogenized: the caudate putamen (CPu), preoptic area (PO), dorsomedial hypothalamus (DMH), frontal cortex (FC), and hippocampus (Hip). NA, DA, and 5-HT levels in the extract were measured by high performance liquid chromatography. Although Tb increased immediately after heat exposure, it decreased about 14D later. HR was maintained at a low level throughout heat exposure, and Act tended to increase near the end of heat exposure. After 3H, we observed a marked increase in NA level in the CPu. Although this response vanished after 1D, the level increased again after 28D. DA level in the CPu decreased significantly from 1D to 28D. 5-HT level in the PO and DMH decreased from 1D to 14D. It returned to control levels after 28D with increment of DA level. 5-HT level in the FC decreased at the start of heat exposure, but recovered after 28D; a time point at which DA level also increased. Monoamine levels in the Hip were unchanged after early heat exposure, but both 5-HT and DA levels increased after 28D. These results provide definitive evidence of changes in monoamines in individual brain regions involved in thermoregulation and behavioral, cognitive, and memory function during both acute and chronic heat exposure. Copyright © 2016 Elsevier Ltd. All rights reserved.
      

      
      Multi-Threaded Algorithms for GPGPU in the ATLAS High Level Trigger
      NASA Astrophysics Data System (ADS)
      Conde Muíño, P.; ATLAS Collaboration
         2017-10-01
         General purpose Graphics Processor Units (GPGPU) are being evaluated for possible future inclusion in an upgraded ATLAS High Level Trigger farm. We have developed a demonstrator including GPGPU implementations of Inner Detector and Muon tracking and Calorimeter clustering within the ATLAS software framework. ATLAS is a general purpose particle physics experiment located on the LHC collider at CERN. The ATLAS Trigger system consists of two levels, with Level-1 implemented in hardware and the High Level Trigger implemented in software running on a farm of commodity CPU. The High Level Trigger reduces the trigger rate from the 100 kHz Level-1 acceptance rate to 1.5 kHz for recording, requiring an average per-event processing time of ∼ 250 ms for this task. The selection in the high level trigger is based on reconstructing tracks in the Inner Detector and Muon Spectrometer and clusters of energy deposited in the Calorimeter. Performing this reconstruction within the available farm resources presents a significant challenge that will increase significantly with future LHC upgrades. During the LHC data taking period starting in 2021, luminosity will reach up to three times the original design value. Luminosity will increase further to 7.5 times the design value in 2026 following LHC and ATLAS upgrades. Corresponding improvements in the speed of the reconstruction code will be needed to provide the required trigger selection power within affordable computing resources. Key factors determining the potential benefit of including GPGPU as part of the HLT processor farm are: the relative speed of the CPU and GPGPU algorithm implementations; the relative execution times of the GPGPU algorithms and serial code remaining on the CPU; the number of GPGPU required, and the relative financial cost of the selected GPGPU. We give a brief overview of the algorithms implemented and present new measurements that compare the performance of various configurations exploiting GPGPU cards.
      

      
      Design, Results, Evolution and Status of the ATLAS Simulation at Point1 Project
      NASA Astrophysics Data System (ADS)
      Ballestrero, S.; Batraneanu, S. M.; Brasolin, F.; Contescu, C.; Fazio, D.; Di Girolamo, A.; Lee, C. J.; Pozo Astigarraga, M. E.; Scannicchio, D. A.; Sedov, A.; Twomey, M. S.; Wang, F.; Zaytsev, A.
         2015-12-01
         During the LHC Long Shutdown 1 (LSI) period, that started in 2013, the Simulation at Point1 (Sim@P1) project takes advantage, in an opportunistic way, of the TDAQ (Trigger and Data Acquisition) HLT (High-Level Trigger) farm of the ATLAS experiment. This farm provides more than 1300 compute nodes, which are particularly suited for running event generation and Monte Carlo production jobs that are mostly CPU and not I/O bound. It is capable of running up to 2700 Virtual Machines (VMs) each with 8 CPU cores, for a total of up to 22000 parallel jobs. This contribution gives a review of the design, the results, and the evolution of the Sim@P1 project, operating a large scale OpenStack based virtualized platform deployed on top of the ATLAS TDAQ HLT farm computing resources. During LS1, Sim@P1 was one of the most productive ATLAS sites: it delivered more than 33 million CPU-hours and it generated more than 1.1 billion Monte Carlo events. The design aspects are presented: the virtualization platform exploited by Sim@P1 avoids interferences with TDAQ operations and it guarantees the security and the usability of the ATLAS private network. The cloud mechanism allows the separation of the needed support on both infrastructural (hardware, virtualization layer) and logical (Grid site support) levels. This paper focuses on the operational aspects of such a large system during the upcoming LHC Run 2 period: simple, reliable, and efficient tools are needed to quickly switch from Sim@P1 to TDAQ mode and back, to exploit the resources when they are not used for the data acquisition, even for short periods. The evolution of the central OpenStack infrastructure is described, as it was upgraded from Folsom to the Icehouse release, including the scalability issues addressed.
      

      
      The Photon Shell Game and the Quantum von Neumann Architecture with Superconducting Circuits
      NASA Astrophysics Data System (ADS)
      Mariantoni, Matteo
         2012-02-01
         Superconducting quantum circuits have made significant advances over the past decade, allowing more complex and integrated circuits that perform with good fidelity. We have recently implemented a machine comprising seven quantum channels, with three superconducting resonators, two phase qubits, and two zeroing registers. I will explain the design and operation of this machine, first showing how a single microwave photon | 1 > can be prepared in one resonator and coherently transferred between the three resonators. I will also show how more exotic states such as double photon states | 2 > and superposition states | 0 >+ | 1 > can be shuffled among the resonators as well [1]. I will then demonstrate how this machine can be used as the quantum-mechanical analog of the von Neumann computer architecture, which for a classical computer comprises a central processing unit and a memory holding both instructions and data. The quantum version comprises a quantum central processing unit (quCPU) that exchanges data with a quantum random-access memory (quRAM) integrated on one chip, with instructions stored on a classical computer. I will also present a proof-of-concept demonstration of a code that involves all seven quantum elements: (1), Preparing an entangled state in the quCPU, (2), writing it to the quRAM, (3), preparing a second state in the quCPU, (4), zeroing it, and, (5), reading out the first state stored in the quRAM [2]. Finally, I will demonstrate that the quantum von Neumann machine provides one unit cell of a two-dimensional qubit-resonator array that can be used for surface code quantum computing. This will allow the realization of a scalable, fault-tolerant quantum processor with the most forgiving error rates to date. [4pt] [1] M. Mariantoni et al., Nature Physics 7, 287-293 (2011.)[0pt] [2] M. Mariantoni et al., Science 334, 61-65 (2011).
      

      
      GPU-based cone beam computed tomography.
      PubMed
      Noël, Peter B; Walczak, Alan M; Xu, Jinhui; Corso, Jason J; Hoffmann, Kenneth R; Schafer, Sebastian
         2010-06-01
         The use of cone beam computed tomography (CBCT) is growing in the clinical arena due to its ability to provide 3D information during interventions, its high diagnostic quality (sub-millimeter resolution), and its short scanning times (60 s). In many situations, the short scanning time of CBCT is followed by a time-consuming 3D reconstruction. The standard reconstruction algorithm for CBCT data is the filtered backprojection, which for a volume of size 256(3) takes up to 25 min on a standard system. Recent developments in the area of Graphic Processing Units (GPUs) make it possible to have access to high-performance computing solutions at a low cost, allowing their use in many scientific problems. We have implemented an algorithm for 3D reconstruction of CBCT data using the Compute Unified Device Architecture (CUDA) provided by NVIDIA (NVIDIA Corporation, Santa Clara, California), which was executed on a NVIDIA GeForce GTX 280. Our implementation results in improved reconstruction times from minutes, and perhaps hours, to a matter of seconds, while also giving the clinician the ability to view 3D volumetric data at higher resolutions. We evaluated our implementation on ten clinical data sets and one phantom data set to observe if differences occur between CPU and GPU-based reconstructions. By using our approach, the computation time for 256(3) is reduced from 25 min on the CPU to 3.2 s on the GPU. The GPU reconstruction time for 512(3) volumes is 8.5 s. Copyright 2009 Elsevier Ireland Ltd. All rights reserved.
      

      
      Programming and Runtime Support to Blaze FPGA Accelerator Deployment at Datacenter Scale
      PubMed Central
      Huang, Muhuan; Wu, Di; Yu, Cody Hao; Fang, Zhenman; Interlandi, Matteo; Condie, Tyson; Cong, Jason
         2017-01-01
         With the end of CPU core scaling due to dark silicon limitations, customized accelerators on FPGAs have gained increased attention in modern datacenters due to their lower power, high performance and energy efficiency. Evidenced by Microsoft’s FPGA deployment in its Bing search engine and Intel’s 16.7 billion acquisition of Altera, integrating FPGAs into datacenters is considered one of the most promising approaches to sustain future datacenter growth. However, it is quite challenging for existing big data computing systems—like Apache Spark and Hadoop—to access the performance and energy benefits of FPGA accelerators. In this paper we design and implement Blaze to provide programming and runtime support for enabling easy and efficient deployments of FPGA accelerators in datacenters. In particular, Blaze abstracts FPGA accelerators as a service (FaaS) and provides a set of clean programming APIs for big data processing applications to easily utilize those accelerators. Our Blaze runtime implements an FaaS framework to efficiently share FPGA accelerators among multiple heterogeneous threads on a single node, and extends Hadoop YARN with accelerator-centric scheduling to efficiently share them among multiple computing tasks in the cluster. Experimental results using four representative big data applications demonstrate that Blaze greatly reduces the programming efforts to access FPGA accelerators in systems like Apache Spark and YARN, and improves the system throughput by 1.7 × to 3× (and energy efficiency by 1.5× to 2.7×) compared to a conventional CPU-only cluster. PMID:28317049
      

      
      Programming and Runtime Support to Blaze FPGA Accelerator Deployment at Datacenter Scale.
      PubMed
      Huang, Muhuan; Wu, Di; Yu, Cody Hao; Fang, Zhenman; Interlandi, Matteo; Condie, Tyson; Cong, Jason
         2016-10-01
         With the end of CPU core scaling due to dark silicon limitations, customized accelerators on FPGAs have gained increased attention in modern datacenters due to their lower power, high performance and energy efficiency. Evidenced by Microsoft's FPGA deployment in its Bing search engine and Intel's 16.7 billion acquisition of Altera, integrating FPGAs into datacenters is considered one of the most promising approaches to sustain future datacenter growth. However, it is quite challenging for existing big data computing systems-like Apache Spark and Hadoop-to access the performance and energy benefits of FPGA accelerators. In this paper we design and implement Blaze to provide programming and runtime support for enabling easy and efficient deployments of FPGA accelerators in datacenters. In particular, Blaze abstracts FPGA accelerators as a service (FaaS) and provides a set of clean programming APIs for big data processing applications to easily utilize those accelerators. Our Blaze runtime implements an FaaS framework to efficiently share FPGA accelerators among multiple heterogeneous threads on a single node, and extends Hadoop YARN with accelerator-centric scheduling to efficiently share them among multiple computing tasks in the cluster. Experimental results using four representative big data applications demonstrate that Blaze greatly reduces the programming efforts to access FPGA accelerators in systems like Apache Spark and YARN, and improves the system throughput by 1.7 × to 3× (and energy efficiency by 1.5× to 2.7×) compared to a conventional CPU-only cluster.
      

      
      Real-time image processing for non-contact monitoring of dynamic displacements using smartphone technologies
      NASA Astrophysics Data System (ADS)
      Min, Jae-Hong; Gelo, Nikolas J.; Jo, Hongki
         2016-04-01
         The newly developed smartphone application, named RINO, in this study allows measuring absolute dynamic displacements and processing them in real time using state-of-the-art smartphone technologies, such as high-performance graphics processing unit (GPU), in addition to already powerful CPU and memories, embedded high-speed/ resolution camera, and open-source computer vision libraries. A carefully designed color-patterned target and user-adjustable crop filter enable accurate and fast image processing, allowing up to 240fps for complete displacement calculation and real-time display. The performances of the developed smartphone application are experimentally validated, showing comparable accuracy with those of conventional laser displacement sensor.
      

      
      
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Messer, Bronson; Harris, James A; Parete-Koon, Suzanne T
         
         We describe recent development work on the core-collapse supernova code CHIMERA. CHIMERA has consumed more than 100 million cpu-hours on Oak Ridge Leadership Computing Facility (OLCF) platforms in the past 3 years, ranking it among the most important applications at the OLCF. Most of the work described has been focused on exploiting the multicore nature of the current platform (Jaguar) via, e.g., multithreading using OpenMP. In addition, we have begun a major effort to marshal the computational power of GPUs with CHIMERA. The impending upgrade of Jaguar to Titan a 20+ PF machine with an NVIDIA GPU on many nodesmore » makes this work essential.« less
      

      
      An efficient compression scheme for bitmap indices
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Wu, Kesheng; Otoo, Ekow J.; Shoshani, Arie
         2004-04-13
         When using an out-of-core indexing method to answer a query, it is generally assumed that the I/O cost dominates the overall query response time. Because of this, most research on indexing methods concentrate on reducing the sizes of indices. For bitmap indices, compression has been used for this purpose. However, in most cases, operations on these compressed bitmaps, mostly bitwise logical operations such as AND, OR, and NOT, spend more time in CPU than in I/O. To speedup these operations, a number of specialized bitmap compression schemes have been developed; the best known of which is the byte-aligned bitmap codemore » (BBC). They are usually faster in performing logical operations than the general purpose compression schemes, but, the time spent in CPU still dominates the total query response time. To reduce the query response time, we designed a CPU-friendly scheme named the word-aligned hybrid (WAH) code. In this paper, we prove that the sizes of WAH compressed bitmap indices are about two words per row for large range of attributes. This size is smaller than typical sizes of commonly used indices, such as a B-tree. Therefore, WAH compressed indices are not only appropriate for low cardinality attributes but also for high cardinality attributes.In the worst case, the time to operate on compressed bitmaps is proportional to the total size of the bitmaps involved. The total size of the bitmaps required to answer a query on one attribute is proportional to the number of hits. These indicate that WAH compressed bitmap indices are optimal. To verify their effectiveness, we generated bitmap indices for four different datasets and measured the response time of many range queries. Tests confirm that sizes of compressed bitmap indices are indeed smaller than B-tree indices, and query processing with WAH compressed indices is much faster than with BBC compressed indices, projection indices and B-tree indices. In addition, we also verified that the average query response time is proportional to the index size. This indicates that the compressed bitmap indices are efficient for very large datasets.« less
      

        
       
          

«

19
      20
      21
   22
      23
      »

          
        

     

   

   
       
            
              
          

«

20
      21
      22
   23
      24
      »

          
        

           
           
             
               
      
      PODIO: An Event-Data-Model Toolkit for High Energy Physics Experiments
      NASA Astrophysics Data System (ADS)
      Gaede, F.; Hegner, B.; Mato, P.
         2017-10-01
         PODIO is a C++ library that supports the automatic creation of event data models (EDMs) and efficient I/O code for HEP experiments. It is developed as a new EDM Toolkit for future particle physics experiments in the context of the AIDA2020 EU programme. Experience from LHC and the linear collider community shows that existing solutions partly suffer from overly complex data models with deep object-hierarchies or unfavorable I/O performance. The PODIO project was created in order to address these problems. PODIO is based on the idea of employing plain-old-data (POD) data structures wherever possible, while avoiding deep object-hierarchies and virtual inheritance. At the same time it provides the necessary high-level interface towards the developer physicist, such as the support for inter-object relations and automatic memory-management, as well as a Python interface. To simplify the creation of efficient data models PODIO employs code generation from a simple yaml-based markup language. In addition, it was developed with concurrency in mind in order to support the use of modern CPU features, for example giving basic support for vectorization techniques.
      

      
      JBioWH: an open-source Java framework for bioinformatics data integration
      PubMed Central
      Vera, Roberto; Perez-Riverol, Yasset; Perez, Sonia; Ligeti, Balázs; Kertész-Farkas, Attila; Pongor, Sándor
         2013-01-01
         The Java BioWareHouse (JBioWH) project is an open-source platform-independent programming framework that allows a user to build his/her own integrated database from the most popular data sources. JBioWH can be used for intensive querying of multiple data sources and the creation of streamlined task-specific data sets on local PCs. JBioWH is based on a MySQL relational database scheme and includes JAVA API parser functions for retrieving data from 20 public databases (e.g. NCBI, KEGG, etc.). It also includes a client desktop application for (non-programmer) users to query data. In addition, JBioWH can be tailored for use in specific circumstances, including the handling of massive queries for high-throughput analyses or CPU intensive calculations. The framework is provided with complete documentation and application examples and it can be downloaded from the Project Web site at http://code.google.com/p/jbiowh. A MySQL server is available for demonstration purposes at hydrax.icgeb.trieste.it:3307. Database URL: http://code.google.com/p/jbiowh PMID:23846595
      

      
      JBioWH: an open-source Java framework for bioinformatics data integration.
      PubMed
      Vera, Roberto; Perez-Riverol, Yasset; Perez, Sonia; Ligeti, Balázs; Kertész-Farkas, Attila; Pongor, Sándor
         2013-01-01
         The Java BioWareHouse (JBioWH) project is an open-source platform-independent programming framework that allows a user to build his/her own integrated database from the most popular data sources. JBioWH can be used for intensive querying of multiple data sources and the creation of streamlined task-specific data sets on local PCs. JBioWH is based on a MySQL relational database scheme and includes JAVA API parser functions for retrieving data from 20 public databases (e.g. NCBI, KEGG, etc.). It also includes a client desktop application for (non-programmer) users to query data. In addition, JBioWH can be tailored for use in specific circumstances, including the handling of massive queries for high-throughput analyses or CPU intensive calculations. The framework is provided with complete documentation and application examples and it can be downloaded from the Project Web site at http://code.google.com/p/jbiowh. A MySQL server is available for demonstration purposes at hydrax.icgeb.trieste.it:3307. Database URL: http://code.google.com/p/jbiowh.
      

      
      Lens-free imaging-based low-cost microsensor for in-line wear debris detection in lube oils
      NASA Astrophysics Data System (ADS)
      Mabe, Jon; Zubia, Joseba; Gorritxategi, Eneko
         2017-02-01
         The current paper describes the application of lens-free imaging principles for the detection and classification of wear debris in lubricant oils. The potential benefits brought by the lens-free microscopy techniques in terms of resolution, deep of field and active areas have been tailored to develop a micro sensor for the in-line monitoring of wear debris in oils used in lubricated or hydraulic machines as gearboxes, actuators, engines, etc. The current work presents a laboratory test-bench used for evaluating the optical performance of the lens-free approach applied to the wear particle detection in oil samples. Additionally, the current prototype sensor is presented, which integrates a LED light source, CMOS imager, embedded CPU, the measurement cell and the appropriate optical components for setting up the lens-free system. The imaging performance is quantified using micro structured samples, as well as by imaging real used lubricant oils. Probing a large volume with a decent 2D spatial resolution, this lens-free micro sensor can provide a powerful tool at very low cost for inline wear debris monitoring.
      

      
      Towards more stable operation of the Tokyo Tier2 center
      NASA Astrophysics Data System (ADS)
      Nakamura, T.; Mashimo, T.; Matsui, N.; Sakamoto, H.; Ueda, I.
         2014-06-01
         The Tokyo Tier2 center, which is located at the International Center for Elementary Particle Physics (ICEPP) in the University of Tokyo, was established as a regional analysis center in Japan for the ATLAS experiment. The official operation with WLCG was started in 2007 after the several years development since 2002. In December 2012, we have replaced almost all hardware as the third system upgrade to deal with analysis for further growing data of the ATLAS experiment. The number of CPU cores are increased by factor of two (9984 cores in total), and the performance of individual CPU core is improved by 20% according to the HEPSPEC06 benchmark test at 32bit compile mode. The score is estimated as 18.03 (SL6) per core by using Intel Xeon E5-2680 2.70 GHz. Since all worker nodes are made by 16 CPU cores configuration, we deployed 624 blade servers in total. They are connected to 6.7 PB of disk storage system with non-blocking 10 Gbps internal network backbone by using two center network switches (NetIron MLXe-32). The disk storage is made by 102 of RAID6 disk arrays (Infortrend DS S24F-G2840-4C16DO0) and served by equivalent number of 1U file servers with 8G-FC connection to maximize the file transfer throughput per storage capacity. As of February 2013, 2560 CPU cores and 2.00 PB of disk storage have already been deployed for WLCG. Currently, the remaining non-grid resources for both CPUs and disk storage are used as dedicated resources for the data analysis by the ATLAS Japan collaborators. Since all hardware in the non-grid resources are made by same architecture with Tier2 resource, they will be able to be migrated as the Tier2 extra resource on demand of the ATLAS experiment in the future. In addition to the upgrade of computing resources, we expect the improvement of connectivity on the wide area network. Thanks to the Japanese NREN (NII), another 10 Gbps trans-Pacific line from Japan to Washington will be available additionally with existing two 10 Gbps lines (Tokyo to New York and Tokyo to Los Angeles). The new line will be connected to LHCONE for the more improvement of the connectivity. In this circumstance, we are working for the further stable operation. For instance, we have newly introduced GPFS (IBM) for the non-grid disk storage, while Disk Pool Manager (DPM) are continued to be used as Tier2 disk storage from the previous system. Since the number of files stored in a DPM pool will be increased with increasing the total amount of data, the development of stable database configuration is one of the crucial issues as well as scalability. We have started some studies on the performance of asynchronous database replication so that we can take daily full backup. In this report, we would like to introduce several improvements in terms of the performances and stability of our new system and possibility of the further improvement of local I/O performance in the multi-core worker node. We also present the status of the wide area network connectivity from Japan to US and/or EU with LHCONE.
      

      
      Using Intel's Knight Landing Processor to Accelerate Global Nested Air Quality Prediction Modeling System (GNAQPMS) Model
      NASA Astrophysics Data System (ADS)
      Wang, H.; Chen, H.; Chen, X.; Wu, Q.; Wang, Z.
         2016-12-01
         The Global Nested Air Quality Prediction Modeling System for Hg (GNAQPMS-Hg) is a global chemical transport model coupled Hg transport module to investigate the mercury pollution. In this study, we present our work of transplanting the GNAQPMS model on Intel Xeon Phi processor, Knights Landing (KNL) to accelerate the model. KNL is the second-generation product adopting Many Integrated Core Architecture (MIC) architecture. Compared with the first generation Knight Corner (KNC), KNL has more new hardware features, that it can be used as unique processor as well as coprocessor with other CPU. According to the Vtune tool, the high overhead modules in GNAQPMS model have been addressed, including CBMZ gas chemistry, advection and convection module, and wet deposition module. These high overhead modules were accelerated by optimizing code and using new techniques of KNL. The following optimized measures was done: 1) Changing the pure MPI parallel mode to hybrid parallel mode with MPI and OpenMP; 2.Vectorizing the code to using the 512-bit wide vector computation unit. 3. Reducing unnecessary memory access and calculation. 4. Reducing Thread Local Storage (TLS) for common variables with each OpenMP thread in CBMZ. 5. Changing the way of global communication from files writing and reading to MPI functions. After optimization, the performance of GNAQPMS is greatly increased both on CPU and KNL platform, the single-node test showed that optimized version has 2.6x speedup on two sockets CPU platform and 3.3x speedup on one socket KNL platform compared with the baseline version code, which means the KNL has 1.29x speedup when compared with 2 sockets CPU platform.
      

      
      Evaluation of user input methods for manipulating a tablet personal computer in sterile techniques.
      PubMed
      Yamada, Akira; Komatsu, Daisuke; Suzuki, Takeshi; Kurozumi, Masahiro; Fujinaga, Yasunari; Ueda, Kazuhiko; Kadoya, Masumi
         2017-02-01
         To determine a quick and accurate user input method for manipulating tablet personal computers (PCs) in sterile techniques. We evaluated three different manipulation methods, (1) Computer mouse and sterile system drape, (2) Fingers and sterile system drape, and (3) Digitizer stylus and sterile ultrasound probe cover with a pinhole, in terms of the central processing unit (CPU) performance, manipulation performance, and contactlessness. A significant decrease in CPU score ([Formula: see text]) and an increase in CPU temperature ([Formula: see text]) were observed when a system drape was used. The respective mean times taken to select a target image from an image series (ST) and the mean times for measuring points on an image (MT) were [Formula: see text] and [Formula: see text] s for the computer mouse method, [Formula: see text] and [Formula: see text] s for the finger method, and [Formula: see text] and [Formula: see text] s for the digitizer stylus method, respectively. The ST for the finger method was significantly longer than for the digitizer stylus method ([Formula: see text]). The MT for the computer mouse method was significantly longer than for the digitizer stylus method ([Formula: see text]). The mean success rate for measuring points on an image was significantly lower for the finger method when the diameter of the target was equal to or smaller than 8 mm than for the other methods. No significant difference in the adenosine triphosphate amount at the surface of the tablet PC was observed before, during, or after manipulation via the digitizer stylus method while wearing starch-powdered sterile gloves ([Formula: see text]). Quick and accurate manipulation of tablet PCs in sterile techniques without CPU load is feasible using a digitizer stylus and sterile ultrasound probe cover with a pinhole.
      

      
      Revisiting Molecular Dynamics on a CPU/GPU system: Water Kernel and SHAKE Parallelization.
      PubMed
      Ruymgaart, A Peter; Elber, Ron
         2012-11-13
         We report Graphics Processing Unit (GPU) and Open-MP parallel implementations of water-specific force calculations and of bond constraints for use in Molecular Dynamics simulations. We focus on a typical laboratory computing-environment in which a CPU with a few cores is attached to a GPU. We discuss in detail the design of the code and we illustrate performance comparable to highly optimized codes such as GROMACS. Beside speed our code shows excellent energy conservation. Utilization of water-specific lists allows the efficient calculations of non-bonded interactions that include water molecules and results in a speed-up factor of more than 40 on the GPU compared to code optimized on a single CPU core for systems larger than 20,000 atoms. This is up four-fold from a factor of 10 reported in our initial GPU implementation that did not include a water-specific code. Another optimization is the implementation of constrained dynamics entirely on the GPU. The routine, which enforces constraints of all bonds, runs in parallel on multiple Open-MP cores or entirely on the GPU. It is based on Conjugate Gradient solution of the Lagrange multipliers (CG SHAKE). The GPU implementation is partially in double precision and requires no communication with the CPU during the execution of the SHAKE algorithm. The (parallel) implementation of SHAKE allows an increase of the time step to 2.0fs while maintaining excellent energy conservation. Interestingly, CG SHAKE is faster than the usual bond relaxation algorithm even on a single core if high accuracy is expected. The significant speedup of the optimized components transfers the computational bottleneck of the MD calculation to the reciprocal part of Particle Mesh Ewald (PME).
      

      
      Computing the Density Matrix in Electronic Structure Theory on Graphics Processing Units.
      PubMed
      Cawkwell, M J; Sanville, E J; Mniszewski, S M; Niklasson, Anders M N
         2012-11-13
         The self-consistent solution of a Schrödinger-like equation for the density matrix is a critical and computationally demanding step in quantum-based models of interatomic bonding. This step was tackled historically via the diagonalization of the Hamiltonian. We have investigated the performance and accuracy of the second-order spectral projection (SP2) algorithm for the computation of the density matrix via a recursive expansion of the Fermi operator in a series of generalized matrix-matrix multiplications. We demonstrate that owing to its simplicity, the SP2 algorithm [Niklasson, A. M. N. Phys. Rev. B2002, 66, 155115] is exceptionally well suited to implementation on graphics processing units (GPUs). The performance in double and single precision arithmetic of a hybrid GPU/central processing unit (CPU) and full GPU implementation of the SP2 algorithm exceed those of a CPU-only implementation of the SP2 algorithm and traditional matrix diagonalization when the dimensions of the matrices exceed about 2000 × 2000. Padding schemes for arrays allocated in the GPU memory that optimize the performance of the CUBLAS implementations of the level 3 BLAS DGEMM and SGEMM subroutines for generalized matrix-matrix multiplications are described in detail. The analysis of the relative performance of the hybrid CPU/GPU and full GPU implementations indicate that the transfer of arrays between the GPU and CPU constitutes only a small fraction of the total computation time. The errors measured in the self-consistent density matrices computed using the SP2 algorithm are generally smaller than those measured in matrices computed via diagonalization. Furthermore, the errors in the density matrices computed using the SP2 algorithm do not exhibit any dependence of system size, whereas the errors increase linearly with the number of orbitals when diagonalization is employed.
      

      
      Feasibility and Safety of Evaluating Patients with Prior Coronary Artery Disease Using an Accelerated Diagnostic Algorithm in a Chest Pain Unit
      PubMed Central
      Goldkorn, Ronen; Goitein, Orly; Ben-Zekery, Sagit; Shlomo, Nir; Narodetsky, Michael; Livne, Moran; Sabbag, Avi; Asher, Elad; Matetzky, Shlomi
         2016-01-01
         An accelerated diagnostic protocol for evaluating low-risk patients with acute chest pain in a cardiologist-based chest pain unit (CPU) is widely employed today. However, limited data exist regarding the feasibility of such an algorithm for patients with a history of prior coronary artery disease (CAD). The aim of the current study was to assess the feasibility and safety of evaluating patients with a history of prior CAD using an accelerated diagnostic protocol. We evaluated 1,220 consecutive patients presenting with acute chest pain and hospitalized in our CPU. Patients were stratified according to whether they had a history of prior CAD or not. The primary composite outcome was defined as a composite of readmission due to chest pain, acute coronary syndrome, coronary revascularization, or death during a 60-day follow-up period. Overall, 268 (22%) patients had a history of prior CAD. Non-invasive evaluation was performed in 1,112 (91%) patients. While patients with a history of prior CAD had more comorbidities, the two study groups were similar regarding hospitalization rates (9% vs. 13%, p = 0.08), coronary angiography (13% vs. 11%, p = 0.41), and revascularization (6.5% vs. 5.7%, p = 0.8) performed during CPU evaluation. At 60-days the primary endpoint was observed in 12 (1.6%) and 6 (3.2%) patients without and with a history of prior CAD, respectively (p = 0.836). No mortalities were recorded. To conclude, Patients with a history of prior CAD can be expeditiously and safely evaluated using an accelerated diagnostic protocol in a CPU with outcomes not differing from patients without such a history. PMID:27669521
      

      
      Auditory stimulation by exposure to melodic music increases dopamine and serotonin activities in rat forebrain areas linked to reward and motor control.
      PubMed
      Moraes, Michele M; Rabelo, Patrícia C R; Pinto, Valéria A; Pires, Washington; Wanner, Samuel P; Szawka, Raphael E; Soares, Danusa D
         2018-04-23
         Listening to melodic music is regarded as a non-pharmacological intervention that ameliorates various disease symptoms, likely by changing the activity of brain monoaminergic systems. Here, we investigated the effects of exposure to melodic music on the concentrations of dopamine (DA), serotonin (5-HT) and their respective metabolites in the caudate-putamen (CPu) and nucleus accumbens (NAcc), areas linked to reward and motor control. Male adult Wistar rats were randomly assigned to a control group or a group exposed to music. The music group was submitted to 8 music sessions [Mozart's sonata for two pianos (K. 488) at an average sound pressure of 65 dB]. The control rats were handled in the same way but were not exposed to music. Immediately after the last exposure or control session, the rats were euthanized, and their brains were quickly removed to analyze the concentrations of 5-HT, DA, 5-hydroxyindoleacetic acid (5-HIAA) and 3,4-dihydroxyphenylacetic acid (DOPAC) in the CPu and NAcc. Auditory stimuli affected the monoaminergic system in these two brain structures. In the CPu, auditory stimuli increased the concentrations of DA and 5-HIAA but did not change the DOPAC or 5-HT levels. In the NAcc, music markedly increased the DOPAC/DA ratio, suggesting an increase in DA turnover. Our data indicate that auditory stimuli, such as exposure to melodic music, increase DA levels and the release of 5-HT in the CPu as well as DA turnover in the NAcc, suggesting that the music had a direct impact on monoamine activity in these brain areas. Copyright © 2018 Elsevier B.V. All rights reserved.
      

      
      Optimal endothelialisation of a new compliant poly(carbonate-urea)urethane vascular graft with effect of physiological shear stress.
      PubMed
      Salacinski, H J; Tai, N R; Punshon, G; Giudiceandrea, A; Hamilton, G; Seifalian, A M
         2000-10-01
         to define the optimal seeding conditions of a new stress free poly(carbonate-urea)urethane (CPU) graft with compliance similar to that of human artery with honeycomb structure engineered during the manufacturing process to enhance adhesion and growth of endothelial cells. (111)Indium-oxine radiolabeled human umbilical vein endothelial cells (HUVEC) were seeded onto CPU grafts at (a) concentrations from 2-24x10(5)cells/cm(2)and (b) incubated for 0.5, 1, 2, 4 and 6 h. Following incubation, graft segments were subjected to three washing/gamma counting procedures and scanning electron microscopy (SEM). Cell viability was measured using a modified Alamar blue(TM)assay. To test physiological retention a pulsatile flow phantom was used to subject optimally seeded (16x10(5), 4 h) CPU grafts to arterial shear stress for 6 h with real time acquisition of scintigraphic images of seeded grafts using a nuclear medicine gamma camera system. the seeding efficiency of 54+/-13% post three washes was achieved using 16x10(5)cells/cm(2). Similarly in SEM micrographs a seeding density of 16x10(5)cells/cm(2)resulted in a confluent monolayer. Seeded CPU segments incubated for 4 h exhibited significantly higher resistance to wash-off than segments incubated for 30 min (p <0.05). Exposure of seeded grafts to pulsatile shear stress resulted in some cell loss with 67+/-3% of cells adherent following 6 h of perfusion with ongoing metabolic activity. Thus, optimal conditions were 16x10(5)cells/cm(2)at 4 h. the optimal seeding conditions have been defined for "tissue-engineered" vascular graft which allow complete endothelialisation and high cell-to-substrate strength that resists hydrodynamic stress. Copyright 2000 Harcourt Publishers Ltd.
      

      
      Parallel Algorithm for GPU Processing; for use in High Speed Machine Vision Sensing of Cotton Lint Trash.
      PubMed
      Pelletier, Mathew G
         2008-02-08
         One of the main hurdles standing in the way of optimal cleaning of cotton lint isthe lack of sensing systems that can react fast enough to provide the control system withreal-time information as to the level of trash contamination of the cotton lint. This researchexamines the use of programmable graphic processing units (GPU) as an alternative to thePC's traditional use of the central processing unit (CPU). The use of the GPU, as analternative computation platform, allowed for the machine vision system to gain asignificant improvement in processing time. By improving the processing time, thisresearch seeks to address the lack of availability of rapid trash sensing systems and thusalleviate a situation in which the current systems view the cotton lint either well before, orafter, the cotton is cleaned. This extended lag/lead time that is currently imposed on thecotton trash cleaning control systems, is what is responsible for system operators utilizing avery large dead-band safety buffer in order to ensure that the cotton lint is not undercleaned.Unfortunately, the utilization of a large dead-band buffer results in the majority ofthe cotton lint being over-cleaned which in turn causes lint fiber-damage as well assignificant losses of the valuable lint due to the excessive use of cleaning machinery. Thisresearch estimates that upwards of a 30% reduction in lint loss could be gained through theuse of a tightly coupled trash sensor to the cleaning machinery control systems. Thisresearch seeks to improve processing times through the development of a new algorithm forcotton trash sensing that allows for implementation on a highly parallel architecture.Additionally, by moving the new parallel algorithm onto an alternative computing platform,the graphic processing unit "GPU", for processing of the cotton trash images, a speed up ofover 6.5 times, over optimized code running on the PC's central processing unit "CPU", wasgained. The new parallel algorithm operating on the GPU was able to process a 1024x1024image in less than 17ms. At this improved speed, the image processing system's performance should now be sufficient to provide a system that would be capable of realtimefeed-back control that is in tight cooperation with the cleaning equipment.
      

      
      Research on phone contacts online status based on mobile cloud computing
      NASA Astrophysics Data System (ADS)
      Wang, Wen-jinga; Ge, Weib
         2013-03-01
         Because the limited ability of storage space, CPU processing on mobile phone, it is difficult to realize complex applications on mobile phones, but along with the development of cloud computing, we can place the computing and storage in the clouds, provide users with rich cloud services, helping users complete various function through the browser has become the trend for future mobile communication. This article is taking the mobile phone contacts online status as an example to analysis the development and application of mobile cloud computing.
      

      
      Fault Tolerant Microcontroller for the Configurable Fault Tolerant Processor
      DTIC Science & Technology
      
         2008-09-01
         many others come to mind I also wish to thank Jan Grey for providing an excellent System-on-a-Chip that formed a core component of this thesis...developed by Jan Gray as documented in his "Building a RISC CPU and System-on-a-Chip in an FPGA" series of articles that was published in Circuit Cellar...those detailed by Jan Gray in his "Getting Started with the XSOC Project v0.93" [16]. The XSOC distribution is available at <http://www.fpgacpu.org
      

      
      Closed Loop Real-Time Evaluation of Missile Guidance and Control Components: Simulator Hardware/Software Characteristics and Use
      DTIC Science & Technology
      
         1974-08-01
         Node Control Logic 2-27 2.16 Pitch Channel Frequence Response 2-36 2.17 Yaw Channel Frequency Response 2-37 K 4 2.18 Analog Computer Mechanlzation of...8217S 0 121 £l1:c IL-I. TABLE I Elements of the Slgma 5 Digital Computer System Xerox Model- Performance MIOP Channel Description Number Characteristics...transfer control signals to or from the CPU. The MIOP can handle up to 32 I/0 channels each operating simultaneously, provided the overall data
      

      
      Sequence search on a supercomputer.
      PubMed
      Gotoh, O; Tagashira, Y
         1986-01-10
         A set of programs was developed for searching nucleic acid and protein sequence data bases for sequences similar to a given sequence. The programs, written in FORTRAN 77, were optimized for vector processing on a Hitachi S810-20 supercomputer. A search of a 500-residue protein sequence against the entire PIR data base Ver. 1.0 (1) (0.5 M residues) is carried out in a CPU time of 45 sec. About 4 min is required for an exhaustive search of a 1500-base nucleotide sequence against all mammalian sequences (1.2M bases) in Genbank Ver. 29.0. The CPU time is reduced to about a quarter with a faster version.
      

      
      A CPU benchmark for protein crystallographic refinement.
      PubMed
      Bourne, P E; Hendrickson, W A
         1990-01-01
         The CPU time required to complete a cycle of restrained least-squares refinement of a protein structure from X-ray crystallographic data using the FORTRAN codes PROTIN and PROLSQ are reported for 48 different processors, ranging from single-user workstations to supercomputers. Sequential, vector, VLIW, multiprocessor, and RISC hardware architectures are compared using both a small and a large protein structure. Representative compile times for each hardware type are also given, and the improvement in run-time when coding for a specific hardware architecture considered. The benchmarks involve scalar integer and vector floating point arithmetic and are representative of the calculations performed in many scientific disciplines.
      

      
      Adaptive real-time methodology for optimizing energy-efficient computing
      DOEpatents
      Hsu, Chung-Hsing [Los Alamos, NM; Feng, Wu-Chun [Blacksburg, VA
         2011-06-28
         Dynamic voltage and frequency scaling (DVFS) is an effective way to reduce energy and power consumption in microprocessor units. Current implementations of DVFS suffer from inaccurate modeling of power requirements and usage, and from inaccurate characterization of the relationships between the applicable variables. A system and method is proposed that adjusts CPU frequency and voltage based on run-time calculations of the workload processing time, as well as a calculation of performance sensitivity with respect to CPU frequency. The system and method are processor independent, and can be applied to either an entire system as a unit, or individually to each process running on a system.
      

      
      Comparison of Conjugate Gradient Density Matrix Search and Chebyshev Expansion Methods for Avoiding Diagonalization in Large-Scale Electronic Structure Calculations
      NASA Technical Reports Server (NTRS)
      Bates, Kevin R.; Daniels, Andrew D.; Scuseria, Gustavo E.
         1998-01-01
         We report a comparison of two linear-scaling methods which avoid the diagonalization bottleneck of traditional electronic structure algorithms. The Chebyshev expansion method (CEM) is implemented for carbon tight-binding calculations of large systems and its memory and timing requirements compared to those of our previously implemented conjugate gradient density matrix search (CG-DMS). Benchmark calculations are carried out on icosahedral fullerenes from C60 to C8640 and the linear scaling memory and CPU requirements of the CEM demonstrated. We show that the CPU requisites of the CEM and CG-DMS are similar for calculations with comparable accuracy.
      

        
       
          

«

20
      21
      22
   23
      24
      »

          
        

     

   

   
       
            
              
          

«

21
      22
      23
   24
      25
      »

          
        

           
           
             
               
      
      Input/output behavior of supercomputing applications
      NASA Technical Reports Server (NTRS)
      Miller, Ethan L.
         1991-01-01
         The collection and analysis of supercomputer I/O traces and their use in a collection of buffering and caching simulations are described. This serves two purposes. First, it gives a model of how individual applications running on supercomputers request file system I/O, allowing system designer to optimize I/O hardware and file system algorithms to that model. Second, the buffering simulations show what resources are needed to maximize the CPU utilization of a supercomputer given a very bursty I/O request rate. By using read-ahead and write-behind in a large solid stated disk, one or two applications were sufficient to fully utilize a Cray Y-MP CPU.
      

      
      Exploiting graphics processing units for computational biology and bioinformatics.
      PubMed
      Payne, Joshua L; Sinnott-Armstrong, Nicholas A; Moore, Jason H
         2010-09-01
         Advances in the video gaming industry have led to the production of low-cost, high-performance graphics processing units (GPUs) that possess more memory bandwidth and computational capability than central processing units (CPUs), the standard workhorses of scientific computing. With the recent release of generalpurpose GPUs and NVIDIA's GPU programming language, CUDA, graphics engines are being adopted widely in scientific computing applications, particularly in the fields of computational biology and bioinformatics. The goal of this article is to concisely present an introduction to GPU hardware and programming, aimed at the computational biologist or bioinformaticist. To this end, we discuss the primary differences between GPU and CPU architecture, introduce the basics of the CUDA programming language, and discuss important CUDA programming practices, such as the proper use of coalesced reads, data types, and memory hierarchies. We highlight each of these topics in the context of computing the all-pairs distance between instances in a dataset, a common procedure in numerous disciplines of scientific computing. We conclude with a runtime analysis of the GPU and CPU implementations of the all-pairs distance calculation. We show our final GPU implementation to outperform the CPU implementation by a factor of 1700.
      

      
      Compiler-based code generation and autotuning for geometric multigrid on GPU-accelerated supercomputers
      DOE PAGES
      Basu, Protonu; Williams, Samuel; Van Straalen, Brian; ...
         2017-04-05
         GPUs, with their high bandwidths and computational capabilities are an increasingly popular target for scientific computing. Unfortunately, to date, harnessing the power of the GPU has required use of a GPU-specific programming model like CUDA, OpenCL, or OpenACC. Thus, in order to deliver portability across CPU-based and GPU-accelerated supercomputers, programmers are forced to write and maintain two versions of their applications or frameworks. In this paper, we explore the use of a compiler-based autotuning framework based on CUDA-CHiLL to deliver not only portability, but also performance portability across CPU- and GPU-accelerated platforms for the geometric multigrid linear solvers found inmore » many scientific applications. We also show that with autotuning we can attain near Roofline (a performance bound for a computation and target architecture) performance across the key operations in the miniGMG benchmark for both CPU- and GPU-based architectures as well as for a multiple stencil discretizations and smoothers. We show that our technology is readily interoperable with MPI resulting in performance at scale equal to that obtained via hand-optimized MPI+CUDA implementation.« less
      

      
      An efficient tensor transpose algorithm for multicore CPU, Intel Xeon Phi, and NVidia Tesla GPU
      DOE PAGES
      Lyakh, Dmitry I.
         2015-01-05
         An efficient parallel tensor transpose algorithm is suggested for shared-memory computing units, namely, multicore CPU, Intel Xeon Phi, and NVidia GPU. The algorithm operates on dense tensors (multidimensional arrays) and is based on the optimization of cache utilization on x86 CPU and the use of shared memory on NVidia GPU. From the applied side, the ultimate goal is to minimize the overhead encountered in the transformation of tensor contractions into matrix multiplications in computer implementations of advanced methods of quantum many-body theory (e.g., in electronic structure theory and nuclear physics). A particular accent is made on higher-dimensional tensors that typicallymore » appear in the so-called multireference correlated methods of electronic structure theory. Depending on tensor dimensionality, the presented optimized algorithms can achieve an order of magnitude speedup on x86 CPUs and 2-3 times speedup on NVidia Tesla K20X GPU with respect to the na ve scattering algorithm (no memory access optimization). Furthermore, the tensor transpose routines developed in this work have been incorporated into a general-purpose tensor algebra library (TAL-SH).« less
      

      
      Purine 5‧,8-cyclo-2‧-deoxynucleoside lesions in irradiated DNA
      NASA Astrophysics Data System (ADS)
      Chatgilialoglu, Chryssostomos; Krokidis, Marios G.; Papadopoulos, Kyriakos; Terzidis, Michael A.
         2016-11-01
         Having their position gained among the smallest bulky DNA lesions recognized by the nucleotide excision repair (NER) enzyme, purine 5‧,8-cyclo-2‧-deoxynucleosides (5‧,8-cPu) are increasingly attracting the interest in the field of genome integrity in health and diseases. Exclusively generated by one of the most harmful of the reactive oxygen species, the hydroxyl radical, 5‧,8-cPu can be utilized also for highly valuable information regarding the oxidative status nearby the area where the genetic information is stored. Herein, we have collected the most recently reported biological studies, focusing on the repair mechanism of these lesions and their biological significance particularly in transcription. The LC-MS/MS quantification protocols that appeared in the literature are discussed in details, along with the reported values for the four 5‧,8-cPu produced by in vitro γ-radiolysis experiments with calf thymus DNA. Mechanistic insights in the formation of the purine 5‧,8-cyclo-2‧-deoxynucleosides and their chemical stability are also given in the light of their potential to be utilized as DNA biomarkers of oxidative stress.
      

      
      GPU-based stochastic-gradient optimization for non-rigid medical image registration in time-critical applications
      NASA Astrophysics Data System (ADS)
      Bhosale, Parag; Staring, Marius; Al-Ars, Zaid; Berendsen, Floris F.
         2018-03-01
         Currently, non-rigid image registration algorithms are too computationally intensive to use in time-critical applications. Existing implementations that focus on speed typically address this by either parallelization on GPU-hardware, or by introducing methodically novel techniques into CPU-oriented algorithms. Stochastic gradient descent (SGD) optimization and variations thereof have proven to drastically reduce the computational burden for CPU-based image registration, but have not been successfully applied in GPU hardware due to its stochastic nature. This paper proposes 1) NiftyRegSGD, a SGD optimization for the GPU-based image registration tool NiftyReg, 2) random chunk sampler, a new random sampling strategy that better utilizes the memory bandwidth of GPU hardware. Experiments have been performed on 3D lung CT data of 19 patients, which compared NiftyRegSGD (with and without random chunk sampler) with CPU-based elastix Fast Adaptive SGD (FASGD) and NiftyReg. The registration runtime was 21.5s, 4.4s and 2.8s for elastix-FASGD, NiftyRegSGD without, and NiftyRegSGD with random chunk sampling, respectively, while similar accuracy was obtained. Our method is publicly available at https://github.com/SuperElastix/NiftyRegSGD.
      

      
      Implementation of GPU accelerated SPECT reconstruction with Monte Carlo-based scatter correction.
      PubMed
      Bexelius, Tobias; Sohlberg, Antti
         2018-06-01
         Statistical SPECT reconstruction can be very time-consuming especially when compensations for collimator and detector response, attenuation, and scatter are included in the reconstruction. This work proposes an accelerated SPECT reconstruction algorithm based on graphics processing unit (GPU) processing. Ordered subset expectation maximization (OSEM) algorithm with CT-based attenuation modelling, depth-dependent Gaussian convolution-based collimator-detector response modelling, and Monte Carlo-based scatter compensation was implemented using OpenCL. The OpenCL implementation was compared against the existing multi-threaded OSEM implementation running on a central processing unit (CPU) in terms of scatter-to-primary ratios, standardized uptake values (SUVs), and processing speed using mathematical phantoms and clinical multi-bed bone SPECT/CT studies. The difference in scatter-to-primary ratios, visual appearance, and SUVs between GPU and CPU implementations was minor. On the other hand, at its best, the GPU implementation was noticed to be 24 times faster than the multi-threaded CPU version on a normal 128 × 128 matrix size 3 bed bone SPECT/CT data set when compensations for collimator and detector response, attenuation, and scatter were included. GPU SPECT reconstructions show great promise as an every day clinical reconstruction tool.
      

      
      Compiler-based code generation and autotuning for geometric multigrid on GPU-accelerated supercomputers
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Basu, Protonu; Williams, Samuel; Van Straalen, Brian
         
         GPUs, with their high bandwidths and computational capabilities are an increasingly popular target for scientific computing. Unfortunately, to date, harnessing the power of the GPU has required use of a GPU-specific programming model like CUDA, OpenCL, or OpenACC. Thus, in order to deliver portability across CPU-based and GPU-accelerated supercomputers, programmers are forced to write and maintain two versions of their applications or frameworks. In this paper, we explore the use of a compiler-based autotuning framework based on CUDA-CHiLL to deliver not only portability, but also performance portability across CPU- and GPU-accelerated platforms for the geometric multigrid linear solvers found inmore » many scientific applications. We also show that with autotuning we can attain near Roofline (a performance bound for a computation and target architecture) performance across the key operations in the miniGMG benchmark for both CPU- and GPU-based architectures as well as for a multiple stencil discretizations and smoothers. We show that our technology is readily interoperable with MPI resulting in performance at scale equal to that obtained via hand-optimized MPI+CUDA implementation.« less
      

      
      GPU color space conversion
      NASA Astrophysics Data System (ADS)
      Chase, Patrick; Vondran, Gary
         2011-01-01
         Tetrahedral interpolation is commonly used to implement continuous color space conversions from sparse 3D and 4D lookup tables. We investigate the implementation and optimization of tetrahedral interpolation algorithms for GPUs, and compare to the best known CPU implementations as well as to a well known GPU-based trilinear implementation. We show that a 500 NVIDIA GTX-580 GPU is 3x faster than a 1000 Intel Core i7 980X CPU for 3D interpolation, and 9x faster for 4D interpolation. Performance-relevant GPU attributes are explored including thread scheduling, local memory characteristics, global memory hierarchy, and cache behaviors. We consider existing tetrahedral interpolation algorithms and tune based on the structure and branching capabilities of current GPUs. Global memory performance is improved by reordering and expanding the lookup table to ensure optimal access behaviors. Per multiprocessor local memory is exploited to implement optimally coalesced global memory accesses, and local memory addressing is optimized to minimize bank conflicts. We explore the impacts of lookup table density upon computation and memory access costs. Also presented are CPU-based 3D and 4D interpolators, using SSE vector operations that are faster than any previously published solution.
      

      
      GNAQPMS v1.1: accelerating the Global Nested Air Quality Prediction Modeling System (GNAQPMS) on Intel Xeon Phi processors
      NASA Astrophysics Data System (ADS)
      Wang, Hui; Chen, Huansheng; Wu, Qizhong; Lin, Junmin; Chen, Xueshun; Xie, Xinwei; Wang, Rongrong; Tang, Xiao; Wang, Zifa
         2017-08-01
         The Global Nested Air Quality Prediction Modeling System (GNAQPMS) is the global version of the Nested Air Quality Prediction Modeling System (NAQPMS), which is a multi-scale chemical transport model used for air quality forecast and atmospheric environmental research. In this study, we present the porting and optimisation of GNAQPMS on a second-generation Intel Xeon Phi processor, codenamed Knights Landing (KNL). Compared with the first-generation Xeon Phi coprocessor (codenamed Knights Corner, KNC), KNL has many new hardware features such as a bootable processor, high-performance in-package memory and ISA compatibility with Intel Xeon processors. In particular, we describe the five optimisations we applied to the key modules of GNAQPMS, including the CBM-Z gas-phase chemistry, advection, convection and wet deposition modules. These optimisations work well on both the KNL 7250 processor and the Intel Xeon E5-2697 V4 processor. They include (1) updating the pure Message Passing Interface (MPI) parallel mode to the hybrid parallel mode with MPI and OpenMP in the emission, advection, convection and gas-phase chemistry modules; (2) fully employing the 512 bit wide vector processing units (VPUs) on the KNL platform; (3) reducing unnecessary memory access to improve cache efficiency; (4) reducing the thread local storage (TLS) in the CBM-Z gas-phase chemistry module to improve its OpenMP performance; and (5) changing the global communication from writing/reading interface files to MPI functions to improve the performance and the parallel scalability. These optimisations greatly improved the GNAQPMS performance. The same optimisations also work well for the Intel Xeon Broadwell processor, specifically E5-2697 v4. Compared with the baseline version of GNAQPMS, the optimised version was 3.51 × faster on KNL and 2.77 × faster on the CPU. Moreover, the optimised version ran at 26 % lower average power on KNL than on the CPU. With the combined performance and energy improvement, the KNL platform was 37.5 % more efficient on power consumption compared with the CPU platform. The optimisations also enabled much further parallel scalability on both the CPU cluster and the KNL cluster scaled to 40 CPU nodes and 30 KNL nodes, with a parallel efficiency of 70.4 and 42.2 %, respectively.
      

      
      Performance Analysis of the NAS Y-MP Workload
      NASA Technical Reports Server (NTRS)
      Bergeron, Robert J.; Kutler, Paul (Technical Monitor)
         1997-01-01
         This paper describes the performance characteristics of the computational workloads on the NAS Cray Y-MP machines, a Y-MP 832 and later a Y-MP 8128. Hardware measurements indicated that the Y-MP workload performance matured over time, ultimately sustaining an average throughput of 0.8 GFLOPS and a vector operation fraction of 87%. The measurements also revealed an operation rate exceeding 1 per clock period, a well-balanced architecture featuring a strong utilization of vector functional units, and an efficient memory organization. Introduction of the larger memory 8128 increased throughput by allowing a more efficient utilization of CPUs. Throughput also depended on the metering of the batch queues; low-idle Saturday workloads required a buffer of small jobs to prevent memory starvation of the CPU. UNICOS required about 7% of total CPU time to service the 832 workloads; this overhead decreased to 5% for the 8128 workloads. While most of the system time went to service I/O requests, efficient scheduling prevented excessive idle due to I/O wait. System measurements disclosed no obvious bottlenecks in the response of the machine and UNICOS to the workloads. In most cases, Cray-provided software tools were- quite sufficient for measuring the performance of both the machine and operating, system.
      

      
      A fast - Monte Carlo toolkit on GPU for treatment plan dose recalculation in proton therapy
      NASA Astrophysics Data System (ADS)
      Senzacqua, M.; Schiavi, A.; Patera, V.; Pioli, S.; Battistoni, G.; Ciocca, M.; Mairani, A.; Magro, G.; Molinelli, S.
         2017-10-01
         In the context of the particle therapy a crucial role is played by Treatment Planning Systems (TPSs), tools aimed to compute and optimize the tratment plan. Nowadays one of the major issues related to the TPS in particle therapy is the large CPU time needed. We developed a software toolkit (FRED) for reducing dose recalculation time by exploiting Graphics Processing Units (GPU) hardware. Thanks to their high parallelization capability, GPUs significantly reduce the computation time, up to factor 100 respect to a standard CPU running software. The transport of proton beams in the patient is accurately described through Monte Carlo methods. Physical processes reproduced are: Multiple Coulomb Scattering, energy straggling and nuclear interactions of protons with the main nuclei composing the biological tissues. FRED toolkit does not rely on the water equivalent translation of tissues, but exploits the Computed Tomography anatomical information by reconstructing and simulating the atomic composition of each crossed tissue. FRED can be used as an efficient tool for dose recalculation, on the day of the treatment. In fact it can provide in about one minute on standard hardware the dose map obtained combining the treatment plan, earlier computed by the TPS, and the current patient anatomic arrangement.
      

      
      GPUs for statistical data analysis in HEP: a performance study of GooFit on GPUs vs. RooFit on CPUs
      NASA Astrophysics Data System (ADS)
      Pompili, Alexis; Di Florio, Adriano; CMS Collaboration
         2016-10-01
         In order to test the computing capabilities of GPUs with respect to traditional CPU cores a high-statistics toy Monte Carlo technique has been implemented both in ROOT/RooFit and GooFit frameworks with the purpose to estimate the statistical significance of the structure observed by CMS close to the kinematical boundary of the Jψϕ invariant mass in the three-body decay B +→JψϕK +. GooFit is a data analysis open tool under development that interfaces ROOT/RooFit to CUDA platform on nVidia GPU. The optimized GooFit application running on GPUs hosted by servers in the Bari Tier2 provides striking speed-up performances with respect to the RooFit application parallelised on multiple CPUs by means of PROOF-Lite tool. The considerably resulting speed-up, while comparing concurrent GooFit processes allowed by CUDA Multi Process Service and a RooFit/PROOF-Lite process with multiple CPU workers, is presented and discussed in detail. By means of GooFit it has also been possible to explore the behaviour of a likelihood ratio test statistic in different situations in which the Wilks Theorem may apply or does not apply because its regularity conditions are not satisfied.
      

      
      Statistical significance estimation of a signal within the GooFit framework on GPUs
      NASA Astrophysics Data System (ADS)
      Cristella, Leonardo; Di Florio, Adriano; Pompili, Alexis
         2017-03-01
         In order to test the computing capabilities of GPUs with respect to traditional CPU cores a high-statistics toy Monte Carlo technique has been implemented both in ROOT/RooFit and GooFit frameworks with the purpose to estimate the statistical significance of the structure observed by CMS close to the kinematical boundary of the J/ψϕ invariant mass in the three-body decay B+ → J/ψϕK+. GooFit is a data analysis open tool under development that interfaces ROOT/RooFit to CUDA platform on nVidia GPU. The optimized GooFit application running on GPUs hosted by servers in the Bari Tier2 provides striking speed-up performances with respect to the RooFit application parallelised on multiple CPUs by means of PROOF-Lite tool. The considerable resulting speed-up, evident when comparing concurrent GooFit processes allowed by CUDA Multi Process Service and a RooFit/PROOF-Lite process with multiple CPU workers, is presented and discussed in detail. By means of GooFit it has also been possible to explore the behaviour of a likelihood ratio test statistic in different situations in which the Wilks Theorem may or may not apply because its regularity conditions are not satisfied.
      

      
      Performance studies of GooFit on GPUs vs RooFit on CPUs while estimating the statistical significance of a new physical signal
      NASA Astrophysics Data System (ADS)
      Di Florio, Adriano
         2017-10-01
         In order to test the computing capabilities of GPUs with respect to traditional CPU cores a high-statistics toy Monte Carlo technique has been implemented both in ROOT/RooFit and GooFit frameworks with the purpose to estimate the statistical significance of the structure observed by CMS close to the kinematical boundary of the J/ψϕ invariant mass in the three-body decay B + → J/ψϕK +. GooFit is a data analysis open tool under development that interfaces ROOT/RooFit to CUDA platform on nVidia GPU. The optimized GooFit application running on GPUs hosted by servers in the Bari Tier2 provides striking speed-up performances with respect to the RooFit application parallelised on multiple CPUs by means of PROOF-Lite tool. The considerable resulting speed-up, evident when comparing concurrent GooFit processes allowed by CUDA Multi Process Service and a RooFit/PROOF-Lite process with multiple CPU workers, is presented and discussed in detail. By means of GooFit it has also been possible to explore the behaviour of a likelihood ratio test statistic in different situations in which the Wilks Theorem may or may not apply because its regularity conditions are not satisfied.
      

      
      A C++11 implementation of arbitrary-rank tensors for high-performance computing
      NASA Astrophysics Data System (ADS)
      Aragón, Alejandro M.
         2014-06-01
         This article discusses an efficient implementation of tensors of arbitrary rank by using some of the idioms introduced by the recently published C++ ISO Standard (C++11). With the aims at providing a basic building block for high-performance computing, a single Array class template is carefully crafted, from which vectors, matrices, and even higher-order tensors can be created. An expression template facility is also built around the array class template to provide convenient mathematical syntax. As a result, by using templates, an extra high-level layer is added to the C++ language when dealing with algebraic objects and their operations, without compromising performance. The implementation is tested running on both CPU and GPU.
      

      
      A C++11 implementation of arbitrary-rank tensors for high-performance computing
      NASA Astrophysics Data System (ADS)
      Aragón, Alejandro M.
         2014-11-01
         This article discusses an efficient implementation of tensors of arbitrary rank by using some of the idioms introduced by the recently published C++ ISO Standard (C++11). With the aims at providing a basic building block for high-performance computing, a single Array class template is carefully crafted, from which vectors, matrices, and even higher-order tensors can be created. An expression template facility is also built around the array class template to provide convenient mathematical syntax. As a result, by using templates, an extra high-level layer is added to the C++ language when dealing with algebraic objects and their operations, without compromising performance. The implementation is tested running on both CPU and GPU.
      

      
      Encryption and decryption algorithm using algebraic matrix approach
      NASA Astrophysics Data System (ADS)
      Thiagarajan, K.; Balasubramanian, P.; Nagaraj, J.; Padmashree, J.
         2018-04-01
         Cryptographic algorithms provide security of data against attacks during encryption and decryption. However, they are computationally intensive process which consume large amount of CPU time and space at time of encryption and decryption. The goal of this paper is to study the encryption and decryption algorithm and to find space complexity of the encrypted and decrypted data by using of algorithm. In this paper, we encrypt and decrypt the message using key with the help of cyclic square matrix provides the approach applicable for any number of words having more number of characters and longest word. Also we discussed about the time complexity of the algorithm. The proposed algorithm is simple but difficult to break the process.
      

      
      Parallel k-means++ for Multiple Shared-Memory Architectures
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Mackey, Patrick S.; Lewis, Robert R.
         2016-09-22
         In recent years k-means++ has become a popular initialization technique for improved k-means clustering. To date, most of the work done to improve its performance has involved parallelizing algorithms that are only approximations of k-means++. In this paper we present a parallelization of the exact k-means++ algorithm, with a proof of its correctness. We develop implementations for three distinct shared-memory architectures: multicore CPU, high performance GPU, and the massively multithreaded Cray XMT platform. We demonstrate the scalability of the algorithm on each platform. In addition we present a visual approach for showing which platform performed k-means++ the fastest for varyingmore » data sizes.« less
      

      
      Comparison of computation time and image quality between full-parallax 4G-pixels CGHs calculated by the point cloud and polygon-based method
      NASA Astrophysics Data System (ADS)
      Nakatsuji, Noriaki; Matsushima, Kyoji
         2017-03-01
         Full-parallax high-definition CGHs composed of more than billion pixels were so far created only by the polygon-based method because of its high performance. However, GPUs recently allow us to generate CGHs much faster by the point cloud. In this paper, we measure computation time of object fields for full-parallax high-definition CGHs, which are composed of 4 billion pixels and reconstruct the same scene, by using the point cloud with GPU and the polygon-based method with CPU. In addition, we compare the optical and simulated reconstructions between CGHs created by these techniques to verify the image quality.
      

        
       
          

«

21
      22
      23
   24
      25
      »

          
        

     

   

   
       
            
              
          

«

21
      22
      23
      24
   25
      »

          
        

           
           
             
               
      
      Intensity-based segmentation and visualization of cells in 3D microscopic images using the GPU
      NASA Astrophysics Data System (ADS)
      Kang, Mi-Sun; Lee, Jeong-Eom; Jeon, Woong-ki; Choi, Heung-Kook; Kim, Myoung-Hee
         2013-02-01
         3D microscopy images contain abundant astronomical data, rendering 3D microscopy image processing time-consuming and laborious on a central processing unit (CPU). To solve these problems, many people crop a region of interest (ROI) of the input image to a small size. Although this reduces cost and time, there are drawbacks at the image processing level, e.g., the selected ROI strongly depends on the user and there is a loss in original image information. To mitigate these problems, we developed a 3D microscopy image processing tool on a graphics processing unit (GPU). Our tool provides efficient and various automatic thresholding methods to achieve intensity-based segmentation of 3D microscopy images. Users can select the algorithm to be applied. Further, the image processing tool provides visualization of segmented volume data and can set the scale, transportation, etc. using a keyboard and mouse. However, the 3D objects visualized fast still need to be analyzed to obtain information for biologists. To analyze 3D microscopic images, we need quantitative data of the images. Therefore, we label the segmented 3D objects within all 3D microscopic images and obtain quantitative information on each labeled object. This information can use the classification feature. A user can select the object to be analyzed. Our tool allows the selected object to be displayed on a new window, and hence, more details of the object can be observed. Finally, we validate the effectiveness of our tool by comparing the CPU and GPU processing times by matching the specification and configuration.
      

      
      The Transition to a Many-core World
      NASA Astrophysics Data System (ADS)
      Mattson, T. G.
         2012-12-01
         The need to increase performance within a fixed energy budget has pushed the computer industry to many core processors. This is grounded in the physics of computing and is not a trend that will just go away. It is hard to overestimate the profound impact of many-core processors on software developers. Virtually every facet of the software development process will need to change to adapt to these new processors. In this talk, we will look at many-core hardware and consider its evolution from a perspective grounded in the CPU. We will show that the number of cores will inevitably increase, but in addition, a quest to maximize performance per watt will push these cores to be heterogeneous. We will show that the inevitable result of these changes is a computing landscape where the distinction between the CPU and the GPU is blurred. We will then consider the much more pressing problem of software in a many core world. Writing software for heterogeneous many core processors is well beyond the ability of current programmers. One solution is to support a software development process where programmer teams are split into two distinct groups: a large group of domain-expert productivity programmers and much smaller team of computer-scientist efficiency programmers. The productivity programmers work in terms of high level frameworks to express the concurrency in their problems while avoiding any details for how that concurrency is exploited. The second group, the efficiency programmers, map applications expressed in terms of these frameworks onto the target many-core system. In other words, we can solve the many-core software problem by creating a software infrastructure that only requires a small subset of programmers to become master parallel programmers. This is different from the discredited dream of automatic parallelism. Note that productivity programmers still need to define the architecture of their software in a way that exposes the concurrency inherent in their problem. We submit that domain-expert programmers understand "what is concurrent". The parallel programming problem emerges from the complexity of "how that concurrency is utilized" on real hardware. The research described in this talk was carried out in collaboration with the ParLab at UC Berkeley. We use a design pattern language to define the high level frameworks exposed to domain-expert, productivity programmers. We then use tools from the SEJITS project (Selective embedded Just In time Specializers) to build the software transformation tool chains thst turn these framework-oriented designs into highly efficient code. The final ingredient is a software platform to serve as a target for these tools. One such platform is the OpenCL industry standard for programming heterogeneous systems. We will briefly describe OpenCL and show how it provides a vendor-neutral software target for current and future many core systems; both CPU-based, GPU-based, and heterogeneous combinations of the two.
      

      
      An evaluation of superminicomputers for thermal analysis
      NASA Technical Reports Server (NTRS)
      Storaasli, O. O.; Vidal, J. B.; Jones, G. K.
         1962-01-01
         The feasibility and cost effectiveness of solving thermal analysis problems on superminicomputers is demonstrated. Conventional thermal analysis and the changing computer environment, computer hardware and software used, six thermal analysis test problems, performance of superminicomputers (CPU time, accuracy, turnaround, and cost) and comparison with large computers are considered. Although the CPU times for superminicomputers were 15 to 30 times greater than the fastest mainframe computer, the minimum cost to obtain the solutions on superminicomputers was from 11 percent to 59 percent of the cost of mainframe solutions. The turnaround (elapsed) time is highly dependent on the computer load, but for large problems, superminicomputers produced results in less elapsed time than a typically loaded mainframe computer.
      

      
      A new nonlinear conjugate gradient coefficient under strong Wolfe-Powell line search
      NASA Astrophysics Data System (ADS)
      Mohamed, Nur Syarafina; Mamat, Mustafa; Rivaie, Mohd
         2017-08-01
         A nonlinear conjugate gradient method (CG) plays an important role in solving a large-scale unconstrained optimization problem. This method is widely used due to its simplicity. The method is known to possess sufficient descend condition and global convergence properties. In this paper, a new nonlinear of CG coefficient βk is presented by employing the Strong Wolfe-Powell inexact line search. The new βk performance is tested based on number of iterations and central processing unit (CPU) time by using MATLAB software with Intel Core i7-3470 CPU processor. Numerical experimental results show that the new βk converge rapidly compared to other classical CG method.
      

      
      Hypermatrix scheme for finite element systems on CDC STAR-100 computer
      NASA Technical Reports Server (NTRS)
      Noor, A. K.; Voigt, S. J.
         1975-01-01
         A study is made of the adaptation of the hypermatrix (block matrix) scheme for solving large systems of finite element equations to the CDC STAR-100 computer. Discussion is focused on the organization of the hypermatrix computation using Cholesky decomposition and the mode of storage of the different submatrices to take advantage of the STAR pipeline (streaming) capability. Consideration is also given to the associated data handling problems and the means of balancing the I/Q and cpu times in the solution process. Numerical examples are presented showing anticipated gain in cpu speed over the CDC 6600 to be obtained by using the proposed algorithms on the STAR computer.
      

      
      Adaptive real-time methodology for optimizing energy-efficient computing
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Hsu, Chung-Hsing; Feng, Wu-Chun
         
         Dynamic voltage and frequency scaling (DVFS) is an effective way to reduce energy and power consumption in microprocessor units. Current implementations of DVFS suffer from inaccurate modeling of power requirements and usage, and from inaccurate characterization of the relationships between the applicable variables. A system and method is proposed that adjusts CPU frequency and voltage based on run-time calculations of the workload processing time, as well as a calculation of performance sensitivity with respect to CPU frequency. The system and method are processor independent, and can be applied to either an entire system as a unit, or individually to eachmore » process running on a system.« less
      

      
      Convolution of large 3D images on GPU and its decomposition
      NASA Astrophysics Data System (ADS)
      Karas, Pavel; Svoboda, David
         2011-12-01
         In this article, we propose a method for computing convolution of large 3D images. The convolution is performed in a frequency domain using a convolution theorem. The algorithm is accelerated on a graphic card by means of the CUDA parallel computing model. Convolution is decomposed in a frequency domain using the decimation in frequency algorithm. We pay attention to keeping our approach efficient in terms of both time and memory consumption and also in terms of memory transfers between CPU and GPU which have a significant inuence on overall computational time. We also study the implementation on multiple GPUs and compare the results between the multi-GPU and multi-CPU implementations.
      

      
      Real time display Fourier-domain OCT using multi-thread parallel computing with data vectorization
      NASA Astrophysics Data System (ADS)
      Eom, Tae Joong; Kim, Hoon Seop; Kim, Chul Min; Lee, Yeung Lak; Choi, Eun-Seo
         2011-03-01
         We demonstrate a real-time display of processed OCT images using multi-thread parallel computing with a quad-core CPU of a personal computer. The data of each A-line are treated as one vector to maximize the data translation rate between the cores of the CPU and RAM stored image data. A display rate of 29.9 frames/sec for processed OCT data (4096 FFT-size x 500 A-scans) is achieved in our system using a wavelength swept source with 52-kHz swept frequency. The data processing times of the OCT image and a Doppler OCT image with a 4-time average are 23.8 msec and 91.4 msec.
      

      
      Graphics Processing Unit–Enhanced Genetic Algorithms for Solving the Temporal Dynamics of Gene Regulatory Networks
      PubMed Central
      García-Calvo, Raúl; Guisado, JL; Diaz-del-Rio, Fernando; Córdoba, Antonio; Jiménez-Morales, Francisco
         2018-01-01
         Understanding the regulation of gene expression is one of the key problems in current biology. A promising method for that purpose is the determination of the temporal dynamics between known initial and ending network states, by using simple acting rules. The huge amount of rule combinations and the nonlinear inherent nature of the problem make genetic algorithms an excellent candidate for finding optimal solutions. As this is a computationally intensive problem that needs long runtimes in conventional architectures for realistic network sizes, it is fundamental to accelerate this task. In this article, we study how to develop efficient parallel implementations of this method for the fine-grained parallel architecture of graphics processing units (GPUs) using the compute unified device architecture (CUDA) platform. An exhaustive and methodical study of various parallel genetic algorithm schemes—master-slave, island, cellular, and hybrid models, and various individual selection methods (roulette, elitist)—is carried out for this problem. Several procedures that optimize the use of the GPU’s resources are presented. We conclude that the implementation that produces better results (both from the performance and the genetic algorithm fitness perspectives) is simulating a few thousands of individuals grouped in a few islands using elitist selection. This model comprises 2 mighty factors for discovering the best solutions: finding good individuals in a short number of generations, and introducing genetic diversity via a relatively frequent and numerous migration. As a result, we have even found the optimal solution for the analyzed gene regulatory network (GRN). In addition, a comparative study of the performance obtained by the different parallel implementations on GPU versus a sequential application on CPU is carried out. In our tests, a multifold speedup was obtained for our optimized parallel implementation of the method on medium class GPU over an equivalent sequential single-core implementation running on a recent Intel i7 CPU. This work can provide useful guidance to researchers in biology, medicine, or bioinformatics in how to take advantage of the parallelization on massively parallel devices and GPUs to apply novel metaheuristic algorithms powered by nature for real-world applications (like the method to solve the temporal dynamics of GRNs). PMID:29662297
      

      
      A Joint Method of Envelope Inversion Combined with Hybrid-domain Full Waveform Inversion
      NASA Astrophysics Data System (ADS)
      CUI, C.; Hou, W.
         2017-12-01
         Full waveform inversion (FWI) aims to construct high-precision subsurface models by fully using the information in seismic records, including amplitude, travel time, phase and so on. However, high non-linearity and the absence of low frequency information in seismic data lead to the well-known cycle skipping problem and make inversion easily fall into local minima. In addition, those 3D inversion methods that are based on acoustic approximation ignore the elastic effects in real seismic field, and make inversion harder. As a result, the accuracy of final inversion results highly relies on the quality of initial model. In order to improve stability and quality of inversion results, multi-scale inversion that reconstructs subsurface model from low to high frequency are applied. But, the absence of very low frequencies (< 3Hz) in field data is still bottleneck in the FWI. By extracting ultra low-frequency data from field data, envelope inversion is able to recover low wavenumber model with a demodulation operator (envelope operator), though the low frequency data does not really exist in field data. To improve the efficiency and viability of the inversion, in this study, we proposed a joint method of envelope inversion combined with hybrid-domain FWI. First, we developed 3D elastic envelope inversion, and the misfit function and the corresponding gradient operator were derived. Then we performed hybrid-domain FWI with envelope inversion result as initial model which provides low wavenumber component of model. Here, forward modeling is implemented in the time domain and inversion in the frequency domain. To accelerate the inversion, we adopt CPU/GPU heterogeneous computing techniques. There were two levels of parallelism. In the first level, the inversion tasks are decomposed and assigned to each computation node by shot number. In the second level, GPU multithreaded programming is used for the computation tasks in each node, including forward modeling, envelope extraction, DFT (discrete Fourier transform) calculation and gradients calculation. Numerical tests demonstrated that the combined envelope inversion + hybrid-domain FWI could obtain much faithful and accurate result than conventional hybrid-domain FWI. The CPU/GPU heterogeneous parallel computation could improve the performance speed.
      

      
      Graphics Processing Unit-Enhanced Genetic Algorithms for Solving the Temporal Dynamics of Gene Regulatory Networks.
      PubMed
      García-Calvo, Raúl; Guisado, J L; Diaz-Del-Rio, Fernando; Córdoba, Antonio; Jiménez-Morales, Francisco
         2018-01-01
         Understanding the regulation of gene expression is one of the key problems in current biology. A promising method for that purpose is the determination of the temporal dynamics between known initial and ending network states, by using simple acting rules. The huge amount of rule combinations and the nonlinear inherent nature of the problem make genetic algorithms an excellent candidate for finding optimal solutions. As this is a computationally intensive problem that needs long runtimes in conventional architectures for realistic network sizes, it is fundamental to accelerate this task. In this article, we study how to develop efficient parallel implementations of this method for the fine-grained parallel architecture of graphics processing units (GPUs) using the compute unified device architecture (CUDA) platform. An exhaustive and methodical study of various parallel genetic algorithm schemes-master-slave, island, cellular, and hybrid models, and various individual selection methods (roulette, elitist)-is carried out for this problem. Several procedures that optimize the use of the GPU's resources are presented. We conclude that the implementation that produces better results (both from the performance and the genetic algorithm fitness perspectives) is simulating a few thousands of individuals grouped in a few islands using elitist selection. This model comprises 2 mighty factors for discovering the best solutions: finding good individuals in a short number of generations, and introducing genetic diversity via a relatively frequent and numerous migration. As a result, we have even found the optimal solution for the analyzed gene regulatory network (GRN). In addition, a comparative study of the performance obtained by the different parallel implementations on GPU versus a sequential application on CPU is carried out. In our tests, a multifold speedup was obtained for our optimized parallel implementation of the method on medium class GPU over an equivalent sequential single-core implementation running on a recent Intel i7 CPU. This work can provide useful guidance to researchers in biology, medicine, or bioinformatics in how to take advantage of the parallelization on massively parallel devices and GPUs to apply novel metaheuristic algorithms powered by nature for real-world applications (like the method to solve the temporal dynamics of GRNs).
      

      
      Acoustic reverse-time migration using GPU card and POSIX thread based on the adaptive optimal finite-difference scheme and the hybrid absorbing boundary condition
      NASA Astrophysics Data System (ADS)
      Cai, Xiaohui; Liu, Yang; Ren, Zhiming
         2018-06-01
         Reverse-time migration (RTM) is a powerful tool for imaging geologically complex structures such as steep-dip and subsalt. However, its implementation is quite computationally expensive. Recently, as a low-cost solution, the graphic processing unit (GPU) was introduced to improve the efficiency of RTM. In the paper, we develop three ameliorative strategies to implement RTM on GPU card. First, given the high accuracy and efficiency of the adaptive optimal finite-difference (FD) method based on least squares (LS) on central processing unit (CPU), we study the optimal LS-based FD method on GPU. Second, we develop the CPU-based hybrid absorbing boundary condition (ABC) to the GPU-based one by addressing two issues of the former when introduced to GPU card: time-consuming and chaotic threads. Third, for large-scale data, the combinatorial strategy for optimal checkpointing and efficient boundary storage is introduced for the trade-off between memory and recomputation. To save the time of communication between host and disk, the portable operating system interface (POSIX) thread is utilized to create the other CPU core at the checkpoints. Applications of the three strategies on GPU with the compute unified device architecture (CUDA) programming language in RTM demonstrate their efficiency and validity.
      

      
      Fast CPU-based Monte Carlo simulation for radiotherapy dose calculation.
      PubMed
      Ziegenhein, Peter; Pirner, Sven; Ph Kamerling, Cornelis; Oelfke, Uwe
         2015-08-07
         Monte-Carlo (MC) simulations are considered to be the most accurate method for calculating dose distributions in radiotherapy. Its clinical application, however, still is limited by the long runtimes conventional implementations of MC algorithms require to deliver sufficiently accurate results on high resolution imaging data. In order to overcome this obstacle we developed the software-package PhiMC, which is capable of computing precise dose distributions in a sub-minute time-frame by leveraging the potential of modern many- and multi-core CPU-based computers. PhiMC is based on the well verified dose planning method (DPM). We could demonstrate that PhiMC delivers dose distributions which are in excellent agreement to DPM. The multi-core implementation of PhiMC scales well between different computer architectures and achieves a speed-up of up to 37[Formula: see text] compared to the original DPM code executed on a modern system. Furthermore, we could show that our CPU-based implementation on a modern workstation is between 1.25[Formula: see text] and 1.95[Formula: see text] faster than a well-known GPU implementation of the same simulation method on a NVIDIA Tesla C2050. Since CPUs work on several hundreds of GB RAM the typical GPU memory limitation does not apply for our implementation and high resolution clinical plans can be calculated.
      

      
      Derivative free Davidon-Fletcher-Powell (DFP) for solving symmetric systems of nonlinear equations
      NASA Astrophysics Data System (ADS)
      Mamat, M.; Dauda, M. K.; Mohamed, M. A. bin; Waziri, M. Y.; Mohamad, F. S.; Abdullah, H.
         2018-03-01
         Research from the work of engineers, economist, modelling, industry, computing, and scientist are mostly nonlinear equations in nature. Numerical solution to such systems is widely applied in those areas of mathematics. Over the years, there has been significant theoretical study to develop methods for solving such systems, despite these efforts, unfortunately the methods developed do have deficiency. In a contribution to solve systems of the form F(x) = 0, x ∈ Rn , a derivative free method via the classical Davidon-Fletcher-Powell (DFP) update is presented. This is achieved by simply approximating the inverse Hessian matrix with {Q}k+1-1 to θkI. The modified method satisfied the descent condition and possess local superlinear convergence properties. Interestingly, without computing any derivative, the proposed method never fail to converge throughout the numerical experiments. The output is based on number of iterations and CPU time, different initial starting points were used on a solve 40 benchmark test problems. With the aid of the squared norm merit function and derivative-free line search technique, the approach yield a method of solving symmetric systems of nonlinear equations that is capable of significantly reducing the CPU time and number of iteration, as compared to its counterparts. A comparison between the proposed method and classical DFP update were made and found that the proposed methodis the top performer and outperformed the existing method in almost all the cases. In terms of number of iterations, out of the 40 problems solved, the proposed method solved 38 successfully, (95%) while classical DFP solved 2 problems (i.e. 05%). In terms of CPU time, the proposed method solved 29 out of the 40 problems given, (i.e.72.5%) successfully whereas classical DFP solves 11 (27.5%). The method is valid in terms of derivation, reliable in terms of number of iterations and accurate in terms of CPU time. Thus, suitable and achived the objective.
      

      
      Aripiprazole Increases the PKA Signalling and Expression of the GABAA Receptor and CREB1 in the Nucleus Accumbens of Rats.
      PubMed
      Pan, Bo; Lian, Jiamei; Huang, Xu-Feng; Deng, Chao
         2016-05-01
         The GABAA receptor is implicated in the pathophysiology of schizophrenia and regulated by PKA signalling. Current antipsychotics bind with D2-like receptors, but not the GABAA receptor. The cAMP-responsive element-binding protein 1 (CREB1) is also associated with PKA signalling and may be related to the positive symptoms of schizophrenia. This study investigated the effects of antipsychotics in modulating D2-mediated PKA signalling and its downstream GABAA receptors and CREB1. Rats were treated orally with aripiprazole (0.75 mg/kg, ter in die (t.i.d.)), bifeprunox (0.8 mg/kg, t.i.d.), haloperidol (0.1 mg/kg, t.i.d.) or vehicle for 1 week. The levels of PKA-Cα and p-PKA in the prefrontal cortex (PFC), nucleus accumbens (NAc) and caudate putamen (CPu) were detected by Western blots. The mRNA levels of Gabrb1, Gabrb2, Gabrb3 and Creb1, and their protein expression were measured by qRT-PCR and Western blots, respectively. Aripiprazole elevated the levels of p-PKA and the ratio of p-PKA/PKA in the NAc, but not the PFC and CPu. Correlated with this elevated PKA signalling, aripiprazole elevated the mRNA and protein expression of the GABAA (β-1) receptor and CREB1 in the NAc. While haloperidol elevated the levels of p-PKA and the ratio of p-PKA/PKA in both NAc and CPu, it only tended to increase the expression of the GABAA (β-1) receptor and CREB1 in the NAc, but not the CPu. Bifeprunox had no effects on PKA signalling in these brain regions. These results suggest that aripiprazole has selective effects on upregulating the GABAA (β-1) receptor and CREB1 in the NAc, probably via activating PKA signalling.
      

      
      Widespread Increases in Malondialdehyde Immunoreactivity in Dopamine-Rich and Dopamine-Poor Regions of Rat Brain Following Multiple, High Doses of Methamphetamine
      PubMed Central
      Horner, Kristen A.; Gilbert, Yamiece E.; Cline, Susan D.
         2011-01-01
         Treatment with multiple high doses of methamphetamine (METH) can induce oxidative damage, including dopamine (DA)-mediated reactive oxygen species (ROS) formation, which may contribute to the neurotoxic damage of monoamine neurons and long-term depletion of DA in the caudate putamen (CPu) and substantia nigra pars compacta (SNpc). Malondialdehyde (MDA), a product of lipid peroxidation by ROS, is commonly used as a marker of oxidative damage and treatment with multiple high doses of METH increases MDA reactivity in the CPu of humans and experimental animals. Recent data indicate that MDA itself may contribute to the destruction of DA neurons, as MDA causes the accumulation of toxic intermediates of DA metabolism via its chemical modification of the enzymes necessary for the breakdown of DA. However, it has been shown that in human METH abusers there is also increased MDA reactivity in the frontal cortex, which receives relatively fewer DA afferents than the CPu. These data suggest that METH may induce neuronal damage regardless of the regional density of DA or origin of DA input. The goal of the current study was to examine the modification of proteins by MDA in the DA-rich nigrostriatal and mesoaccumbal systems, as well as the less DA-dense cortex and hippocampus following a neurotoxic regimen of METH treatment. Animals were treated with METH (10 mg/kg) every 2 h for 6 h, sacrificed 1 week later, and examined using immunocytochemistry for changes in MDA-adducted proteins. Multiple, high doses of METH significantly increased MDA immunoreactivity (MDA-ir) in the CPu, SNpc, cortex, and hippocampus. Multiple METH administration also increased MDA-ir in the ventral tegmental area and nucleus accumbens. Our data indicate that multiple METH treatment can induce persistent and widespread neuronal damage that may not necessarily be limited to the nigrostriatal DA system. PMID:21602916
      

      
      Short Message Service (SMS) Command and Control (C2) Awareness in Android-based Smartphones Using Kernel-Level Auditing
      DTIC Science & Technology
      
         2012-06-14
         Display 480 x 800 pixels (3.7 inches) CPU Qualcomm QSD8250 1GHz Memory (internal) 512MB RAM / 512 MB ROM Kernel version 2.6.35.7-ge0fb012 Figure 3.5: HTC...development and writing). The 34 MSM kernel provided by the AOSP and compatible with the HTC Nexus One’s motherboard and Qualcomm chipset, is used for this...building the kernel is having the prebuilt toolchains and the right kernel for the hardware. Many HTC products use Qualcomm processors which uses the
      

      
      The interactive astronomical data analysis facility - image enhancement techniques to Comet Halley
      NASA Astrophysics Data System (ADS)
      Klinglesmith, D. A.
         1981-10-01
         PDP 11/40 computer is at the heart of a general purpose interactive data analysis facility designed to permit easy access to data in both visual imagery and graphic representations. The major components consist of: the 11/40 CPU and 256 K bytes of 16-bit memory; two TU10 tape drives; 20 million bytes of disk storage; three user terminals; and the COMTAL image processing display system. The application of image enhancement techniques to two sequences of photographs of Comet Halley taken in Egypt in 1910 provides evidence for eruptions from the comet's nucleus.
      

      
      A Big RISC
      DTIC Science & Technology
      
         1983-07-18
         architecture . Design , performance, and cost of BRISC is presented. Performance is shown to be better than high end mainframes such as the IBM 3081 and Amdahl 470V/8 on integer benchmarks written in C, Pascal and LISP. The cost, conservatively estimated to be $132,400 is about the same as a high end minicomputer such as the VAX-11/780. BRISC has a CPU cycle time of 46 ns, providing a RISC I instruction execution rate of greater than 15 MIPs. BRISC is designed with a Structured Computer Aided Logic Design System (SCALD) by Valid Logic Systems. An evaluation of the utility of
      

      
      SU-G-TeP1-15: Toward a Novel GPU Accelerated Deterministic Solution to the Linear Boltzmann Transport Equation
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Yang, R; Fallone, B; Cross Cancer Institute, Edmonton, AB
         
         Purpose: To develop a Graphic Processor Unit (GPU) accelerated deterministic solution to the Linear Boltzmann Transport Equation (LBTE) for accurate dose calculations in radiotherapy (RT). A deterministic solution yields the potential for major speed improvements due to the sparse matrix-vector and vector-vector multiplications and would thus be of benefit to RT. Methods: In order to leverage the massively parallel architecture of GPUs, the first order LBTE was reformulated as a second order self-adjoint equation using the Least Squares Finite Element Method (LSFEM). This produces a symmetric positive-definite matrix which is efficiently solved using a parallelized conjugate gradient (CG) solver. Themore » LSFEM formalism is applied in space, discrete ordinates is applied in angle, and the Multigroup method is applied in energy. The final linear system of equations produced is tightly coupled in space and angle. Our code written in CUDA-C was benchmarked on an Nvidia GeForce TITAN-X GPU against an Intel i7-6700K CPU. A spatial mesh of 30,950 tetrahedral elements was used with an S4 angular approximation. Results: To avoid repeating a full computationally intensive finite element matrix assembly at each Multigroup energy, a novel mapping algorithm was developed which minimized the operations required at each energy. Additionally, a parallelized memory mapping for the kronecker product between the sparse spatial and angular matrices, including Dirichlet boundary conditions, was created. Atomicity is preserved by graph-coloring overlapping nodes into separate kernel launches. The one-time mapping calculations for matrix assembly, kronecker product, and boundary condition application took 452±1ms on GPU. Matrix assembly for 16 energy groups took 556±3s on CPU, and 358±2ms on GPU using the mappings developed. The CG solver took 93±1s on CPU, and 468±2ms on GPU. Conclusion: Three computationally intensive subroutines in deterministically solving the LBTE have been formulated on GPU, resulting in two orders of magnitude speedup. Funding support from Natural Sciences and Engineering Research Council and Alberta Innovates Health Solutions. Dr. Fallone is a co-founder and CEO of MagnetTx Oncology Solutions (under discussions to license Alberta bi-planar linac MR for commercialization).« less
      

        
       
          

«

21
      22
      23
      24
   25
      »

          
        

     

   

   
       
            
              
          

«

21
      22
      23
      24
      25
   »

          
        

           
           
             
               
      
      ATLAS@Home: Harnessing Volunteer Computing for HEP
      NASA Astrophysics Data System (ADS)
      Adam-Bourdarios, C.; Cameron, D.; Filipčič, A.; Lancon, E.; Wu, W.; ATLAS Collaboration
         2015-12-01
         A recent common theme among HEP computing is exploitation of opportunistic resources in order to provide the maximum statistics possible for Monte Carlo simulation. Volunteer computing has been used over the last few years in many other scientific fields and by CERN itself to run simulations of the LHC beams. The ATLAS@Home project was started to allow volunteers to run simulations of collisions in the ATLAS detector. So far many thousands of members of the public have signed up to contribute their spare CPU cycles for ATLAS, and there is potential for volunteer computing to provide a significant fraction of ATLAS computing resources. Here we describe the design of the project, the lessons learned so far and the future plans.
      

      
      jSPyDB, an open source database-independent tool for data management
      NASA Astrophysics Data System (ADS)
      Pierro, Giuseppe Antonio; Cavallari, Francesca; Di Guida, Salvatore; Innocente, Vincenzo
         2011-12-01
         Nowadays, the number of commercial tools available for accessing Databases, built on Java or .Net, is increasing. However, many of these applications have several drawbacks: usually they are not open-source, they provide interfaces only with a specific kind of database, they are platform-dependent and very CPU and memory consuming. jSPyDB is a free web-based tool written using Python and Javascript. It relies on jQuery and python libraries, and is intended to provide a simple handler to different database technologies inside a local web browser. Such a tool, exploiting fast access libraries such as SQLAlchemy, is easy to install, and to configure. The design of this tool envisages three layers. The front-end client side in the local web browser communicates with a backend server. Only the server is able to connect to the different databases for the purposes of performing data definition and manipulation. The server makes the data available to the client, so that the user can display and handle them safely. Moreover, thanks to jQuery libraries, this tool supports export of data in different formats, such as XML and JSON. Finally, by using a set of pre-defined functions, users are allowed to create their customized views for a better data visualization. In this way, we optimize the performance of database servers by avoiding short connections and concurrent sessions. In addition, security is enforced since we do not provide users the possibility to directly execute any SQL statement.
      

      
      Graphics processing unit (GPU)-based computation of heat conduction in thermally anisotropic solids
      NASA Astrophysics Data System (ADS)
      Nahas, C. A.; Balasubramaniam, Krishnan; Rajagopal, Prabhu
         2013-01-01
         Numerical modeling of anisotropic media is a computationally intensive task since it brings additional complexity to the field problem in such a way that the physical properties are different in different directions. Largely used in the aerospace industry because of their lightweight nature, composite materials are a very good example of thermally anisotropic media. With advancements in video gaming technology, parallel processors are much cheaper today and accessibility to higher-end graphical processing devices has increased dramatically over the past couple of years. Since these massively parallel GPUs are very good in handling floating point arithmetic, they provide a new platform for engineers and scientists to accelerate their numerical models using commodity hardware. In this paper we implement a parallel finite difference model of thermal diffusion through anisotropic media using the NVIDIA CUDA (Compute Unified device Architecture). We use the NVIDIA GeForce GTX 560 Ti as our primary computing device which consists of 384 CUDA cores clocked at 1645 MHz with a standard desktop pc as the host platform. We compare the results from standard CPU implementation for its accuracy and speed and draw implications for simulation using the GPU paradigm.
      

      
      Performance implications from sizing a VM on multi-core systems: A Data analytic application s view
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Lim, Seung-Hwan; Horey, James L; Begoli, Edmon
         
         In this paper, we present a quantitative performance analysis of data analytics applications running on multi-core virtual machines. Such environments form the core of cloud computing. In addition, data analytics applications, such as Cassandra and Hadoop, are becoming increasingly popular on cloud computing platforms. This convergence necessitates a better understanding of the performance and cost implications of such hybrid systems. For example, the very rst step in hosting applications in virtualized environments, requires the user to con gure the number of virtual processors and the size of memory. To understand performance implications of this step, we benchmarked three Yahoo Cloudmore » Serving Benchmark (YCSB) workloads in a virtualized multi-core environment. Our measurements indicate that the performance of Cassandra for YCSB workloads does not heavily depend on the processing capacity of a system, while the size of the data set is critical to performance relative to allocated memory. We also identi ed a strong relationship between the running time of workloads and various hardware events (last level cache loads, misses, and CPU migrations). From this analysis, we provide several suggestions to improve the performance of data analytics applications running on cloud computing environments.« less
      

      
      Accelerating the Original Profile Kernel.
      PubMed
      Hamp, Tobias; Goldberg, Tatyana; Rost, Burkhard
         2013-01-01
         One of the most accurate multi-class protein classification systems continues to be the profile-based SVM kernel introduced by the Leslie group. Unfortunately, its CPU requirements render it too slow for practical applications of large-scale classification tasks. Here, we introduce several software improvements that enable significant acceleration. Using various non-redundant data sets, we demonstrate that our new implementation reaches a maximal speed-up as high as 14-fold for calculating the same kernel matrix. Some predictions are over 200 times faster and render the kernel as possibly the top contender in a low ratio of speed/performance. Additionally, we explain how to parallelize various computations and provide an integrative program that reduces creating a production-quality classifier to a single program call. The new implementation is available as a Debian package under a free academic license and does not depend on commercial software. For non-Debian based distributions, the source package ships with a traditional Makefile-based installer. Download and installation instructions can be found at https://rostlab.org/owiki/index.php/Fast_Profile_Kernel. Bugs and other issues may be reported at https://rostlab.org/bugzilla3/enter_bug.cgi?product=fastprofkernel.
      

      
      omniClassifier: a Desktop Grid Computing System for Big Data Prediction Modeling
      PubMed Central
      Phan, John H.; Kothari, Sonal; Wang, May D.
         2016-01-01
         Robust prediction models are important for numerous science, engineering, and biomedical applications. However, best-practice procedures for optimizing prediction models can be computationally complex, especially when choosing models from among hundreds or thousands of parameter choices. Computational complexity has further increased with the growth of data in these fields, concurrent with the era of “Big Data”. Grid computing is a potential solution to the computational challenges of Big Data. Desktop grid computing, which uses idle CPU cycles of commodity desktop machines, coupled with commercial cloud computing resources can enable research labs to gain easier and more cost effective access to vast computing resources. We have developed omniClassifier, a multi-purpose prediction modeling application that provides researchers with a tool for conducting machine learning research within the guidelines of recommended best-practices. omniClassifier is implemented as a desktop grid computing system using the Berkeley Open Infrastructure for Network Computing (BOINC) middleware. In addition to describing implementation details, we use various gene expression datasets to demonstrate the potential scalability of omniClassifier for efficient and robust Big Data prediction modeling. A prototype of omniClassifier can be accessed at http://omniclassifier.bme.gatech.edu/. PMID:27532062
      

      
      CMS Connect
      NASA Astrophysics Data System (ADS)
      Balcas, J.; Bockelman, B.; Gardner, R., Jr.; Hurtado Anampa, K.; Jayatilaka, B.; Aftab Khan, F.; Lannon, K.; Larson, K.; Letts, J.; Marra Da Silva, J.; Mascheroni, M.; Mason, D.; Perez-Calero Yzquierdo, A.; Tiradani, A.
         2017-10-01
         The CMS experiment collects and analyzes large amounts of data coming from high energy particle collisions produced by the Large Hadron Collider (LHC) at CERN. This involves a huge amount of real and simulated data processing that needs to be handled in batch-oriented platforms. The CMS Global Pool of computing resources provide +100K dedicated CPU cores and another 50K to 100K CPU cores from opportunistic resources for these kind of tasks and even though production and event processing analysis workflows are already managed by existing tools, there is still a lack of support to submit final stage condor-like analysis jobs familiar to Tier-3 or local Computing Facilities users into these distributed resources in an integrated (with other CMS services) and friendly way. CMS Connect is a set of computing tools and services designed to augment existing services in the CMS Physics community focusing on these kind of condor analysis jobs. It is based on the CI-Connect platform developed by the Open Science Grid and uses the CMS GlideInWMS infrastructure to transparently plug CMS global grid resources into a virtual pool accessed via a single submission machine. This paper describes the specific developments and deployment of CMS Connect beyond the CI-Connect platform in order to integrate the service with CMS specific needs, including specific Site submission, accounting of jobs and automated reporting to standard CMS monitoring resources in an effortless way to their users.
      

      
      Porting AMG2013 to Heterogeneous CPU+GPU Nodes
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Samfass, Philipp
         
         LLNL's future advanced technology system SIERRA will feature heterogeneous compute nodes that consist of IBM PowerV9 CPUs and NVIDIA Volta GPUs. Conceptually, the motivation for such an architecture is quite straightforward: While GPUs are optimized for throughput on massively parallel workloads, CPUs strive to minimize latency for rather sequential operations. Yet, making optimal use of heterogeneous architectures raises new challenges for the development of scalable parallel software, e.g., with respect to work distribution. Porting LLNL's parallel numerical libraries to upcoming heterogeneous CPU+GPU architectures is therefore a critical factor for ensuring LLNL's future success in ful lling its national mission. Onemore » of these libraries, called HYPRE, provides parallel solvers and precondi- tioners for large, sparse linear systems of equations. In the context of this intern- ship project, I consider AMG2013 which is a proxy application for major parts of HYPRE that implements a benchmark for setting up and solving di erent systems of linear equations. In the following, I describe in detail how I ported multiple parts of AMG2013 to the GPU (Section 2) and present results for di erent experiments that demonstrate a successful parallel implementation on the heterogeneous ma- chines surface and ray (Section 3). In Section 4, I give guidelines on how my code should be used. Finally, I conclude and give an outlook for future work (Section 5).« less
      

      
      Efficient calculation of many-body induced electrostatics in molecular systems
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      McLaughlin, Keith, E-mail: kmclaugh@mail.usf.edu; Cioce, Christian R.; Pham, Tony
         
         Potential energy functions including many-body polarization are in widespread use in simulations of aqueous and biological systems, metal-organics, molecular clusters, and other systems where electronically induced redistribution of charge among local atomic sites is of importance. The polarization interactions, treated here via the methods of Thole and Applequist, while long-ranged, can be computed for moderate-sized periodic systems with extremely high accuracy by extending Ewald summation to the induced fields as demonstrated by Nymand, Sala, and others. These full Ewald polarization calculations, however, are expensive and often limited to very small systems, particularly in Monte Carlo simulations, which may require energymore » evaluation over several hundred-thousand configurations. For such situations, it shall be shown that sufficiently accurate computation of the polarization energy can be produced in a fraction of the central processing unit (CPU) time by neglecting the long-range extension to the induced fields while applying the long-range treatments of Ewald or Wolf to the static fields; these methods, denoted Ewald E-Static and Wolf E-Static (WES), respectively, provide an effective means to obtain polarization energies for intermediate and large systems including those with several thousand polarizable sites in a fraction of the CPU time. Furthermore, we shall demonstrate a means to optimize the damping for WES calculations via extrapolation from smaller trial systems.« less
      

      
      Heterogeneous real-time computing in radio astronomy
      NASA Astrophysics Data System (ADS)
      Ford, John M.; Demorest, Paul; Ransom, Scott
         2010-07-01
         Modern computer architectures suited for general purpose computing are often not the best choice for either I/O-bound or compute-bound problems. Sometimes the best choice is not to choose a single architecture, but to take advantage of the best characteristics of different computer architectures to solve your problems. This paper examines the tradeoffs between using computer systems based on the ubiquitous X86 Central Processing Units (CPU's), Field Programmable Gate Array (FPGA) based signal processors, and Graphical Processing Units (GPU's). We will show how a heterogeneous system can be produced that blends the best of each of these technologies into a real-time signal processing system. FPGA's tightly coupled to analog-to-digital converters connect the instrument to the telescope and supply the first level of computing to the system. These FPGA's are coupled to other FPGA's to continue to provide highly efficient processing power. Data is then packaged up and shipped over fast networks to a cluster of general purpose computers equipped with GPU's, which are used for floating-point intensive computation. Finally, the data is handled by the CPU and written to disk, or further processed. Each of the elements in the system has been chosen for its specific characteristics and the role it can play in creating a system that does the most for the least, in terms of power, space, and money.
      

      
      Vector computer memory bank contention
      NASA Technical Reports Server (NTRS)
      Bailey, D. H.
         1985-01-01
         A number of vector supercomputers feature very large memories. Unfortunately the large capacity memory chips that are used in these computers are much slower than the fast central processing unit (CPU) circuitry. As a result, memory bank reservation times (in CPU ticks) are much longer than on previous generations of computers. A consequence of these long reservation times is that memory bank contention is sharply increased, resulting in significantly lowered performance rates. The phenomenon of memory bank contention in vector computers is analyzed using both a Markov chain model and a Monte Carlo simulation program. The results of this analysis indicate that future generations of supercomputers must either employ much faster memory chips or else feature very large numbers of independent memory banks.
      

      
      
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Li, C.; Yu, G.; Wang, K.
         
         The physical designs of the new concept reactors which have complex structure, various materials and neutronic energy spectrum, have greatly improved the requirements to the calculation methods and the corresponding computing hardware. Along with the widely used parallel algorithm, heterogeneous platforms architecture has been introduced into numerical computations in reactor physics. Because of the natural parallel characteristics, the CPU-FPGA architecture is often used to accelerate numerical computation. This paper studies the application and features of this kind of heterogeneous platforms used in numerical calculation of reactor physics through practical examples. After the designed neutron diffusion module based on CPU-FPGA architecturemore » achieves a 11.2 speed up factor, it is proved to be feasible to apply this kind of heterogeneous platform into reactor physics. (authors)« less
      

      
      Introduction of Parallel GPGPU Acceleration Algorithms for the Solution of Radiative Transfer
      NASA Technical Reports Server (NTRS)
      Godoy, William F.; Liu, Xu
         2011-01-01
         General-purpose computing on graphics processing units (GPGPU) is a recent technique that allows the parallel graphics processing unit (GPU) to accelerate calculations performed sequentially by the central processing unit (CPU). To introduce GPGPU to radiative transfer, the Gauss-Seidel solution of the well-known expressions for 1-D and 3-D homogeneous, isotropic media is selected as a test case. Different algorithms are introduced to balance memory and GPU-CPU communication, critical aspects of GPGPU. Results show that speed-ups of one to two orders of magnitude are obtained when compared to sequential solutions. The underlying value of GPGPU is its potential extension in radiative solvers (e.g., Monte Carlo, discrete ordinates) at a minimal learning curve.
      

      
      Parallel Scaling Characteristics of Selected NERSC User ProjectCodes
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Skinner, David; Verdier, Francesca; Anand, Harsh
         
         This report documents parallel scaling characteristics of NERSC user project codes between Fiscal Year 2003 and the first half of Fiscal Year 2004 (Oct 2002-March 2004). The codes analyzed cover 60% of all the CPU hours delivered during that time frame on seaborg, a 6080 CPU IBM SP and the largest parallel computer at NERSC. The scale in terms of concurrency and problem size of the workload is analyzed. Drawing on batch queue logs, performance data and feedback from researchers we detail the motivations, benefits, and challenges of implementing highly parallel scientific codes on current NERSC High Performance Computing systems.more » An evaluation and outlook of the NERSC workload for Allocation Year 2005 is presented.« less
      

      
      An emulator for minimizing computer resources for finite element analysis
      NASA Technical Reports Server (NTRS)
      Melosh, R.; Utku, S.; Islam, M.; Salama, M.
         1984-01-01
         A computer code, SCOPE, has been developed for predicting the computer resources required for a given analysis code, computer hardware, and structural problem. The cost of running the code is a small fraction (about 3 percent) of the cost of performing the actual analysis. However, its accuracy in predicting the CPU and I/O resources depends intrinsically on the accuracy of calibration data that must be developed once for the computer hardware and the finite element analysis code of interest. Testing of the SCOPE code on the AMDAHL 470 V/8 computer and the ELAS finite element analysis program indicated small I/O errors (3.2 percent), larger CPU errors (17.8 percent), and negligible total errors (1.5 percent).
      

      
      New Focal Plane Array Controller for the Instruments of the Subaru Telescope
      NASA Astrophysics Data System (ADS)
      Nakaya, Hidehiko; Komiyama, Yutaka; Miyazaki, Satoshi; Yamashita, Takuya; Yagi, Masafumi; Sekiguchi, Maki
         2006-03-01
         We have developed a next-generation data acquisition system, MESSIA5 (Modularized Extensible System for Image Acquisition), which comprises the digital part of a focal plane array controller. The new data acquisition system was constructed based on a 64 bit, 66 MHz PCI (peripheral component interconnect) bus architecture and runs on an x86 CPU computer with (non-real-time) Linux. The system, including the CPU board, is placed at the telescope focus, and standard gigabit Ethernet is adopted for the data transfer, as opposed to a dedicated fiber link. During the summer of 2002, we installed the new system for the first time on the Subaru prime-focus camera Suprime-Cam and successfully improved the observing performance.
      

      
      Vector computer memory bank contention
      NASA Technical Reports Server (NTRS)
      Bailey, David H.
         1987-01-01
         A number of vector supercomputers feature very large memories. Unfortunately the large capacity memory chips that are used in these computers are much slower than the fast central processing unit (CPU) circuitry. As a result, memory bank reservation times (in CPU ticks) are much longer than on previous generations of computers. A consequence of these long reservation times is that memory bank contention is sharply increased, resulting in significantly lowered performance rates. The phenomenon of memory bank contention in vector computers is analyzed using both a Markov chain model and a Monte Carlo simulation program. The results of this analysis indicate that future generations of supercomputers must either employ much faster memory chips or else feature very large numbers of independent memory banks.
      

      
      Modeling CANDU-6 liquid zone controllers for effects of thorium-based fuels
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      St-Aubin, E.; Marleau, G.
         2012-07-01
         We use the DRAGON code to model the CANDU-6 liquid zone controllers and evaluate the effects of thorium-based fuels on their incremental cross sections and reactivity worth. We optimize both the numerical quadrature and spatial discretization for 2D cell models in order to provide accurate fuel properties for 3D liquid zone controller supercell models. We propose a low computer cost parameterized pseudo-exact 3D cluster geometries modeling approach that avoids tracking issues on small external surfaces. This methodology provides consistent incremental cross sections and reactivity worths when the thickness of the buffer region is reduced. When compared with an approximate annularmore » geometry representation of the fuel and coolant region, we observe that the cluster description of fuel bundles in the supercell models does not increase considerably the precision of the results while increasing substantially the CPU time. In addition, this comparison shows that it is imperative to finely describe the liquid zone controller geometry since it has a strong impact of the incremental cross sections. This paper also shows that liquid zone controller reactivity worth is greatly decreased in presence of thorium-based fuels compared to the reference natural uranium fuel, since the fission and the fast to thermal scattering incremental cross sections are higher for the new fuels. (authors)« less
      

      
      Using Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data
      NASA Astrophysics Data System (ADS)
      O'Connor, A. S.; Justice, B.; Harris, A. T.
         2013-12-01
         Graphics Processing Units (GPUs) are high-performance multiple-core processors capable of very high computational speeds and large data throughput. Modern GPUs are inexpensive and widely available commercially. These are general-purpose parallel processors with support for a variety of programming interfaces, including industry standard languages such as C. GPU implementations of algorithms that are well suited for parallel processing can often achieve speedups of several orders of magnitude over optimized CPU codes. Significant improvements in speeds for imagery orthorectification, atmospheric correction, target detection and image transformations like Independent Components Analsyis (ICA) have been achieved using GPU-based implementations. Additional optimizations, when factored in with GPU processing capabilities, can provide 50x - 100x reduction in the time required to process large imagery. Exelis Visual Information Solutions (VIS) has implemented a CUDA based GPU processing frame work for accelerating ENVI and IDL processes that can best take advantage of parallelization. Testing Exelis VIS has performed shows that orthorectification can take as long as two hours with a WorldView1 35,0000 x 35,000 pixel image. With GPU orthorecification, the same orthorectification process takes three minutes. By speeding up image processing, imagery can successfully be used by first responders, scientists making rapid discoveries with near real time data, and provides an operational component to data centers needing to quickly process and disseminate data.
      

      
      The growth of the UniTree mass storage system at the NASA Center for Computational Sciences
      NASA Technical Reports Server (NTRS)
      Tarshish, Adina; Salmon, Ellen
         1993-01-01
         In October 1992, the NASA Center for Computational Sciences made its Convex-based UniTree system generally available to users. The ensuing months saw the growth of near-online data from nil to nearly three terabytes, a doubling of the number of CPU's on the facility's Cray YMP (the primary data source for UniTree), and the necessity for an aggressive regimen for repacking sparse tapes and hierarchical 'vaulting' of old files to freestanding tape. Connectivity was enhanced as well with the addition of UltraNet HiPPI. This paper describes the increasing demands placed on the storage system's performance and throughput that resulted from the significant augmentation of compute-server processor power and network speed.
      

        
       
          

«

21
      22
      23
      24
      25
   »

          
        

     

   

   Some links on this page may take you to non-federal websites. Their policies may differ from this site.