total cpu time: Topics by Science.gov

Sample records for total cpu time

Hybrid Computational Architecture for Multi-Scale Modeling of Materials and Devices

DTIC Science & Technology

2016-01-03

Equivalent: Total Number: Sub Contractors (DD882) Names of Faculty Supported Names of Under Graduate students supported Names of Personnel receiving masters...GHz, 20 cores (40 with hyper-threading ( HT )) Single node performance Node # of cores Total CPU time User CPU time System CPU time Elapsed time...INTEL20 40 (with HT ) 534.785 529.984 4.800 541.179 20 468.873 466.119 2.754 476.878 10 671.798 669.653 2.145 680.510 8 772.269 770.256 2.013
The Creation of a CPU Timer for High Fidelity Programs

NASA Technical Reports Server (NTRS)

Dick, Aidan A.

2011-01-01

Using C and C++ programming languages, a tool was developed that measures the efficiency of a program by recording the amount of CPU time that various functions consume. By inserting the tool between lines of code in the program, one can receive a detailed report of the absolute and relative time consumption associated with each section. After adapting the generic tool for a high-fidelity launch vehicle simulation program called MAVERIC, the components of a frequently used function called "derivatives ( )" were measured. Out of the 34 sub-functions in "derivatives ( )", it was found that the top 8 sub-functions made up 83.1% of the total time spent. In order to decrease the overall run time of MAVERIC, a launch vehicle simulation program, a change was implemented in the sub-function "Event_Controller ( )". Reformatting "Event_Controller ( )" led to a 36.9% decrease in the total CPU time spent by that sub-function, and a 3.2% decrease in the total CPU time spent by the overarching function "derivatives ( )".
Spectrum Savings from High Performance Recording and Playback Onboard the Test Article

DTIC Science & Technology

2013-02-20

execute within a Windows 7 environment, and data is recorded on SSDs. The underlying database is implemented using MySQL . Figure 1 illustrates the... MySQL database. This is effectively the time at which the recorded data are available for retransmission. CPU and Memory utilization were collected...17.7% MySQL avg. 3.9% EQDR Total avg. 21.6% Table 1 CPU Utilization with260 Mbits/sec Load The difference between the total System CPU (27.8
Symptoms of problematic cellular phone use, functional impairment and its association with depression among adolescents in Southern Taiwan.

PubMed

Yen, Cheng-Fang; Tang, Tze-Chun; Yen, Ju-Yu; Lin, Huang-Chi; Huang, Chi-Fen; Liu, Shu-Chun; Ko, Chih-Hung

2009-08-01

The aims of this study were: (1) to examine the prevalence of symptoms of problematic cellular phone use (CPU); (2) to examine the associations between the symptoms of problematic CPU, functional impairment caused by CPU and the characteristics of CPU; (3) to establish the optimal cut-off point of the number of symptoms for functional impairment caused by CPU; and (4) to examine the association between problematic CPU and depression in adolescents. A total of 10,191 adolescent students in Southern Taiwan were recruited into this study. Participants' self-reported symptoms of problematic CPU and functional impairments caused by CPU were collected. The associations of symptoms of problematic CPU with functional impairments and with the characteristics of CPU were examined. The cut-off point of the number of symptoms for functional impairment was also determined. The association between problematic CPU and depression was examined by logistic regression analysis. The results indicated that the symptoms of problematic CPU were prevalent in adolescents. The adolescents who had any one of the symptoms of problematic CPU were more likely to report at least one dimension of functional impairment caused by CPU, called more on cellular phones, sent more text messages, or spent more time and higher fees on CPU. Having four or more symptoms of problematic CPU had the highest potential to differentiate between the adolescents with and without functional impairment caused by CPU. Adolescents who had significant depression were more likely to have four or more symptoms of problematic CPU. The results of this study may provide a basis for detecting symptoms of problematic CPU in adolescents.
Efficient methods for implementation of multi-level nonrigid mass-preserving image registration on GPUs and multi-threaded CPUs.

PubMed

Ellingwood, Nathan D; Yin, Youbing; Smith, Matthew; Lin, Ching-Long

2016-04-01

Faster and more accurate methods for registration of images are important for research involved in conducting population-based studies that utilize medical imaging, as well as improvements for use in clinical applications. We present a novel computation- and memory-efficient multi-level method on graphics processing units (GPU) for performing registration of two computed tomography (CT) volumetric lung images. We developed a computation- and memory-efficient Diffeomorphic Multi-level B-Spline Transform Composite (DMTC) method to implement nonrigid mass-preserving registration of two CT lung images on GPU. The framework consists of a hierarchy of B-Spline control grids of increasing resolution. A similarity criterion known as the sum of squared tissue volume difference (SSTVD) was adopted to preserve lung tissue mass. The use of SSTVD consists of the calculation of the tissue volume, the Jacobian, and their derivatives, which makes its implementation on GPU challenging due to memory constraints. The use of the DMTC method enabled reduced computation and memory storage of variables with minimal communication between GPU and Central Processing Unit (CPU) due to ability to pre-compute values. The method was assessed on six healthy human subjects. Resultant GPU-generated displacement fields were compared against the previously validated CPU counterpart fields, showing good agreement with an average normalized root mean square error (nRMS) of 0.044±0.015. Runtime and performance speedup are compared between single-threaded CPU, multi-threaded CPU, and GPU algorithms. Best performance speedup occurs at the highest resolution in the GPU implementation for the SSTVD cost and cost gradient computations, with a speedup of 112 times that of the single-threaded CPU version and 11 times over the twelve-threaded version when considering average time per iteration using a Nvidia Tesla K20X GPU. The proposed GPU-based DMTC method outperforms its multi-threaded CPU version in terms of runtime. Total registration time reduced runtime to 2.9min on the GPU version, compared to 12.8min on twelve-threaded CPU version and 112.5min on a single-threaded CPU. Furthermore, the GPU implementation discussed in this work can be adapted for use of other cost functions that require calculation of the first derivatives. Copyright © 2015 Elsevier Ireland Ltd. All rights reserved.
32 CFR 701.53 - FOIA fee schedule.

Code of Federal Regulations, 2014 CFR

2014-07-01

... human time) and machine time. (1) Human time. Human time is all the time spent by humans performing the...) Machine time. Machine time involves only direct costs of the central processing unit (CPU), input/output... exist to calculate CPU time, no machine costs can be passed on to the requester. When CPU calculations...
32 CFR 701.53 - FOIA fee schedule.

Code of Federal Regulations, 2012 CFR

2012-07-01

... human time) and machine time. (1) Human time. Human time is all the time spent by humans performing the...) Machine time. Machine time involves only direct costs of the central processing unit (CPU), input/output... exist to calculate CPU time, no machine costs can be passed on to the requester. When CPU calculations...
32 CFR 701.53 - FOIA fee schedule.

Code of Federal Regulations, 2013 CFR

2013-07-01

... human time) and machine time. (1) Human time. Human time is all the time spent by humans performing the...) Machine time. Machine time involves only direct costs of the central processing unit (CPU), input/output... exist to calculate CPU time, no machine costs can be passed on to the requester. When CPU calculations...
Dosimetric comparison of helical tomotherapy treatment plans for total marrow irradiation created using GPU and CPU dose calculation engines.

PubMed

Nalichowski, Adrian; Burmeister, Jay

2013-07-01

To compare optimization characteristics, plan quality, and treatment delivery efficiency between total marrow irradiation (TMI) plans using the new TomoTherapy graphic processing unit (GPU) based dose engine and CPU/cluster based dose engine. Five TMI plans created on an anthropomorphic phantom were optimized and calculated with both dose engines. The planning treatment volume (PTV) included all the bones from head to mid femur except for upper extremities. Evaluated organs at risk (OAR) consisted of lung, liver, heart, kidneys, and brain. The following treatment parameters were used to generate the TMI plans: field widths of 2.5 and 5 cm, modulation factors of 2 and 2.5, and pitch of either 0.287 or 0.43. The optimization parameters were chosen based on the PTV and OAR priorities and the plans were optimized with a fixed number of iterations. The PTV constraint was selected to ensure that at least 95% of the PTV received the prescription dose. The plans were evaluated based on D80 and D50 (dose to 80% and 50% of the OAR volume, respectively) and hotspot volumes within the PTVs. Gamma indices (Γ) were also used to compare planar dose distributions between the two modalities. The optimization and dose calculation times were compared between the two systems. The treatment delivery times were also evaluated. The results showed very good dosimetric agreement between the GPU and CPU calculated plans for any of the evaluated planning parameters indicating that both systems converge on nearly identical plans. All D80 and D50 parameters varied by less than 3% of the prescription dose with an average difference of 0.8%. A gamma analysis Γ(3%, 3 mm) < 1 of the GPU plan resulted in over 90% of calculated voxels satisfying Γ < 1 criterion as compared to baseline CPU plan. The average number of voxels meeting the Γ < 1 criterion for all the plans was 97%. In terms of dose optimization/calculation efficiency, there was a 20-fold reduction in planning time with the new GPU system. The average optimization/dose calculation time utilizing the traditional CPU/cluster based system was 579 vs 26.8 min for the GPU based system. There was no difference in the calculated treatment delivery time per fraction. Beam-on time varied based on field width and pitch and ranged between 15 and 28 min. The TomoTherapy GPU based dose engine is capable of calculating TMI treatment plans with plan quality nearly identical to plans calculated using the traditional CPU/cluster based system, while significantly reducing the time required for optimization and dose calculation.
Aligner optimization increases accuracy and decreases compute times in multi-species sequence data.

PubMed

Robinson, Kelly M; Hawkins, Aziah S; Santana-Cruz, Ivette; Adkins, Ricky S; Shetty, Amol C; Nagaraj, Sushma; Sadzewicz, Lisa; Tallon, Luke J; Rasko, David A; Fraser, Claire M; Mahurkar, Anup; Silva, Joana C; Dunning Hotopp, Julie C

2017-09-01

As sequencing technologies have evolved, the tools to analyze these sequences have made similar advances. However, for multi-species samples, we observed important and adverse differences in alignment specificity and computation time for bwa- mem (Burrows-Wheeler aligner-maximum exact matches) relative to bwa-aln. Therefore, we sought to optimize bwa-mem for alignment of data from multi-species samples in order to reduce alignment time and increase the specificity of alignments. In the multi-species cases examined, there was one majority member (i.e. Plasmodium falciparum or Brugia malayi ) and one minority member (i.e. human or the Wolbachia endosymbiont w Bm) of the sequence data. Increasing bwa-mem seed length from the default value reduced the number of read pairs from the majority sequence member that incorrectly aligned to the reference genome of the minority sequence member. Combining both source genomes into a single reference genome increased the specificity of mapping, while also reducing the central processing unit (CPU) time. In Plasmodium , at a seed length of 18 nt, 24.1 % of reads mapped to the human genome using 1.7±0.1 CPU hours, while 83.6 % of reads mapped to the Plasmodium genome using 0.2±0.0 CPU hours (total: 107.7 % reads mapping; in 1.9±0.1 CPU hours). In contrast, 97.1 % of the reads mapped to a combined Plasmodium- human reference in only 0.7±0.0 CPU hours. Overall, the results suggest that combining all references into a single reference database and using a 23 nt seed length reduces the computational time, while maximizing specificity. Similar results were found for simulated sequence reads from a mock metagenomic data set. We found similar improvements to computation time in a publicly available human-only data set.
Accelerating Spaceborne SAR Imaging Using Multiple CPU/GPU Deep Collaborative Computing

PubMed Central

Zhang, Fan; Li, Guojun; Li, Wei; Hu, Wei; Hu, Yuxin

2016-01-01

With the development of synthetic aperture radar (SAR) technologies in recent years, the huge amount of remote sensing data brings challenges for real-time imaging processing. Therefore, high performance computing (HPC) methods have been presented to accelerate SAR imaging, especially the GPU based methods. In the classical GPU based imaging algorithm, GPU is employed to accelerate image processing by massive parallel computing, and CPU is only used to perform the auxiliary work such as data input/output (IO). However, the computing capability of CPU is ignored and underestimated. In this work, a new deep collaborative SAR imaging method based on multiple CPU/GPU is proposed to achieve real-time SAR imaging. Through the proposed tasks partitioning and scheduling strategy, the whole image can be generated with deep collaborative multiple CPU/GPU computing. In the part of CPU parallel imaging, the advanced vector extension (AVX) method is firstly introduced into the multi-core CPU parallel method for higher efficiency. As for the GPU parallel imaging, not only the bottlenecks of memory limitation and frequent data transferring are broken, but also kinds of optimized strategies are applied, such as streaming, parallel pipeline and so on. Experimental results demonstrate that the deep CPU/GPU collaborative imaging method enhances the efficiency of SAR imaging on single-core CPU by 270 times and realizes the real-time imaging in that the imaging rate outperforms the raw data generation rate. PMID:27070606
Accelerating Spaceborne SAR Imaging Using Multiple CPU/GPU Deep Collaborative Computing.

PubMed

Zhang, Fan; Li, Guojun; Li, Wei; Hu, Wei; Hu, Yuxin

2016-04-07

With the development of synthetic aperture radar (SAR) technologies in recent years, the huge amount of remote sensing data brings challenges for real-time imaging processing. Therefore, high performance computing (HPC) methods have been presented to accelerate SAR imaging, especially the GPU based methods. In the classical GPU based imaging algorithm, GPU is employed to accelerate image processing by massive parallel computing, and CPU is only used to perform the auxiliary work such as data input/output (IO). However, the computing capability of CPU is ignored and underestimated. In this work, a new deep collaborative SAR imaging method based on multiple CPU/GPU is proposed to achieve real-time SAR imaging. Through the proposed tasks partitioning and scheduling strategy, the whole image can be generated with deep collaborative multiple CPU/GPU computing. In the part of CPU parallel imaging, the advanced vector extension (AVX) method is firstly introduced into the multi-core CPU parallel method for higher efficiency. As for the GPU parallel imaging, not only the bottlenecks of memory limitation and frequent data transferring are broken, but also kinds of optimized strategies are applied, such as streaming, parallel pipeline and so on. Experimental results demonstrate that the deep CPU/GPU collaborative imaging method enhances the efficiency of SAR imaging on single-core CPU by 270 times and realizes the real-time imaging in that the imaging rate outperforms the raw data generation rate.
Dynamic Quantum Allocation and Swap-Time Variability in Time-Sharing Operating Systems.

ERIC Educational Resources Information Center

Bhat, U. Narayan; Nance, Richard E.

The effects of dynamic quantum allocation and swap-time variability on central processing unit (CPU) behavior are investigated using a model that allows both quantum length and swap-time to be state-dependent random variables. Effective CPU utilization is defined to be the proportion of a CPU busy period that is devoted to program processing, i.e.…
A fast three-dimensional gamma evaluation using a GPU utilizing texture memory for on-the-fly interpolations.

PubMed

Persoon, Lucas C G G; Podesta, Mark; van Elmpt, Wouter J C; Nijsten, Sebastiaan M J J G; Verhaegen, Frank

2011-07-01

A widely accepted method to quantify differences in dose distributions is the gamma (gamma) evaluation. Currently, almost all gamma implementations utilize the central processing unit (CPU). Recently, the graphics processing unit (GPU) has become a powerful platform for specific computing tasks. In this study, we describe the implementation of a 3D gamma evaluation using a GPU to improve calculation time. The gamma evaluation algorithm was implemented on an NVIDIA Tesla C2050 GPU using the compute unified device architecture (CUDA). First, several cubic virtual phantoms were simulated. These phantoms were tested with varying dose cube sizes and set-ups, introducing artificial dose differences. Second, to show applicability in clinical practice, five patient cases have been evaluated using the 3D dose distribution from a treatment planning system as the reference and the delivered dose determined during treatment as the comparison. A calculation time comparison between the CPU and GPU was made with varying thread-block sizes including the option of using texture or global memory. A GPU over CPU speed-up of 66 +/- 12 was achieved for the virtual phantoms. For the patient cases, a speed-up of 57 +/- 15 using the GPU was obtained. A thread-block size of 16 x 16 performed best in all cases. The use of texture memory improved the total calculation time, especially when interpolation was applied. Differences between the CPU and GPU gammas were negligible. The GPU and its features, such as texture memory, decreased the calculation time for gamma evaluations considerably without loss of accuracy.
Massively parallel data processing for quantitative total flow imaging with optical coherence microscopy and tomography

NASA Astrophysics Data System (ADS)

Sylwestrzak, Marcin; Szlag, Daniel; Marchand, Paul J.; Kumar, Ashwin S.; Lasser, Theo

2017-08-01

We present an application of massively parallel processing of quantitative flow measurements data acquired using spectral optical coherence microscopy (SOCM). The need for massive signal processing of these particular datasets has been a major hurdle for many applications based on SOCM. In view of this difficulty, we implemented and adapted quantitative total flow estimation algorithms on graphics processing units (GPU) and achieved a 150 fold reduction in processing time when compared to a former CPU implementation. As SOCM constitutes the microscopy counterpart to spectral optical coherence tomography (SOCT), the developed processing procedure can be applied to both imaging modalities. We present the developed DLL library integrated in MATLAB (with an example) and have included the source code for adaptations and future improvements. Catalogue identifier: AFBT_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AFBT_v1_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: GNU GPLv3 No. of lines in distributed program, including test data, etc.: 913552 No. of bytes in distributed program, including test data, etc.: 270876249 Distribution format: tar.gz Programming language: CUDA/C, MATLAB. Computer: Intel x64 CPU, GPU supporting CUDA technology. Operating system: 64-bit Windows 7 Professional. Has the code been vectorized or parallelized?: Yes, CPU code has been vectorized in MATLAB, CUDA code has been parallelized. RAM: Dependent on users parameters, typically between several gigabytes and several tens of gigabytes Classification: 6.5, 18. Nature of problem: Speed up of data processing in optical coherence microscopy Solution method: Utilization of GPU for massively parallel data processing Additional comments: Compiled DLL library with source code and documentation, example of utilization (MATLAB script with raw data) Running time: 1,8 s for one B-scan (150 × faster in comparison to the CPU data processing time)
The association between problematic cellular phone use and risky behaviors and low self-esteem among Taiwanese adolescents.

PubMed

Yang, Yuan-Sheng; Yen, Ju-Yu; Ko, Chih-Hung; Cheng, Chung-Ping; Yen, Cheng-Fang

2010-04-28

Cellular phone use (CPU) is an important part of life for many adolescents. However, problematic CPU may complicate physiological and psychological problems. The aim of our study was to examine the associations between problematic CPU and a series of risky behaviors and low self-esteem in Taiwanese adolescents. A total of 11,111 adolescent students in Southern Taiwan were randomly selected into this study. We used the Problematic Cellular Phone Use Questionnaire to identify the adolescents with problematic CPU. Meanwhile, a series of risky behaviors and self-esteem were evaluated. Multilevel logistic regression analyses were employed to examine the associations between problematic CPU and risky behaviors and low self-esteem regarding gender and age. The results indicated that positive associations were found between problematic CPU and aggression, insomnia, smoking cigarettes, suicidal tendencies, and low self-esteem in all groups with different sexes and ages. However, gender and age differences existed in the associations between problematic CPU and suspension from school, criminal records, tattooing, short nocturnal sleep duration, unprotected sex, illicit drugs use, drinking alcohol and chewing betel nuts. There were positive associations between problematic CPU and a series of risky behaviors and low self-esteem in Taiwanese adolescents. It is worthy for parents and mental health professionals to pay attention to adolescents' problematic CPU.
Effect of Fiber Orientation on Dynamic Compressive Properties of an Ultra-High Performance Concrete

DTIC Science & Technology

2017-08-01

measurements for LSFfiberOrient function for multiple cores. Elapsed time is the total time taken to run ; CPU time is the number of cores times the...Superscripts Maximum value during a test Measured value from a calibration run ...movement left or right. Before cutting, the Cor-Tuf Baseline beam was placed on the table and squared with the blade . The blade was then moved into
An efficient compression scheme for bitmap indices

DOE Office of Scientific and Technical Information (OSTI.GOV)

Wu, Kesheng; Otoo, Ekow J.; Shoshani, Arie

2004-04-13

When using an out-of-core indexing method to answer a query, it is generally assumed that the I/O cost dominates the overall query response time. Because of this, most research on indexing methods concentrate on reducing the sizes of indices. For bitmap indices, compression has been used for this purpose. However, in most cases, operations on these compressed bitmaps, mostly bitwise logical operations such as AND, OR, and NOT, spend more time in CPU than in I/O. To speedup these operations, a number of specialized bitmap compression schemes have been developed; the best known of which is the byte-aligned bitmap codemore » (BBC). They are usually faster in performing logical operations than the general purpose compression schemes, but, the time spent in CPU still dominates the total query response time. To reduce the query response time, we designed a CPU-friendly scheme named the word-aligned hybrid (WAH) code. In this paper, we prove that the sizes of WAH compressed bitmap indices are about two words per row for large range of attributes. This size is smaller than typical sizes of commonly used indices, such as a B-tree. Therefore, WAH compressed indices are not only appropriate for low cardinality attributes but also for high cardinality attributes.In the worst case, the time to operate on compressed bitmaps is proportional to the total size of the bitmaps involved. The total size of the bitmaps required to answer a query on one attribute is proportional to the number of hits. These indicate that WAH compressed bitmap indices are optimal. To verify their effectiveness, we generated bitmap indices for four different datasets and measured the response time of many range queries. Tests confirm that sizes of compressed bitmap indices are indeed smaller than B-tree indices, and query processing with WAH compressed indices is much faster than with BBC compressed indices, projection indices and B-tree indices. In addition, we also verified that the average query response time is proportional to the index size. This indicates that the compressed bitmap indices are efficient for very large datasets.« less
Association between problematic cellular phone use and suicide: the moderating effect of family function and depression.

PubMed

Wang, Peng-Wei; Liu, Tai-Ling; Ko, Chih-Hung; Lin, Huang-Chi; Huang, Mei-Feng; Yeh, Yi-Chun; Yen, Cheng-Fang

2014-02-01

Suicidal ideation and attempt among adolescents are risk factors for eventual completed suicide. Cellular phone use (CPU) has markedly changed the everyday lives of adolescents. Issues about how cellular phone use relates to adolescent mental health, such as suicidal ideation and attempts, are important because of the high rate of cellular phone usage among children in that age group. This study explored the association between problematic CPU and suicidal ideation and attempts among adolescents and investigated how family function and depression influence the association between problematic CPU and suicidal ideation and attempts. A total of 5051 (2872 girls and 2179 boys) adolescents who owned at least one cellular phone completed the research questionnaires. We collected data on participants' CPU and suicidal behavior (ideation and attempts) during the past month as well as information on family function and history of depression. Five hundred thirty-two adolescents (10.54%) had problematic CPU. The rates of suicidal ideation were 23.50% and 11.76% in adolescents with problematic CPU and without problematic CPU, respectively. The rates of suicidal attempts in both groups were 13.70% and 5.45%, respectively. Family function, but not depression, had a moderating effect on the association between problematic CPU and suicidal ideation and attempt. This study highlights the association between problematic CPU and suicidal ideation as well as attempts and indicates that good family function may have a more significant role on reducing the risks of suicidal ideation and attempts in adolescents with problematic CPU than in those without problematic CPU. © 2014.
The association between problematic cellular phone use and risky behaviors and low self-esteem among Taiwanese adolescents

PubMed Central

2010-01-01

Background Cellular phone use (CPU) is an important part of life for many adolescents. However, problematic CPU may complicate physiological and psychological problems. The aim of our study was to examine the associations between problematic CPU and a series of risky behaviors and low self-esteem in Taiwanese adolescents. Methods A total of 11,111 adolescent students in Southern Taiwan were randomly selected into this study. We used the Problematic Cellular Phone Use Questionnaire to identify the adolescents with problematic CPU. Meanwhile, a series of risky behaviors and self-esteem were evaluated. Multilevel logistic regression analyses were employed to examine the associations between problematic CPU and risky behaviors and low self-esteem regarding gender and age. Results The results indicated that positive associations were found between problematic CPU and aggression, insomnia, smoking cigarettes, suicidal tendencies, and low self-esteem in all groups with different sexes and ages. However, gender and age differences existed in the associations between problematic CPU and suspension from school, criminal records, tattooing, short nocturnal sleep duration, unprotected sex, illicit drugs use, drinking alcohol and chewing betel nuts. Conclusions There were positive associations between problematic CPU and a series of risky behaviors and low self-esteem in Taiwanese adolescents. It is worthy for parents and mental health professionals to pay attention to adolescents' problematic CPU. PMID:20426807

OpenMP GNU and Intel Fortran programs for solving the time-dependent Gross-Pitaevskii equation

NASA Astrophysics Data System (ADS)

Young-S., Luis E.; Muruganandam, Paulsamy; Adhikari, Sadhan K.; Lončar, Vladimir; Vudragović, Dušan; Balaž, Antun

2017-11-01

We present Open Multi-Processing (OpenMP) version of Fortran 90 programs for solving the Gross-Pitaevskii (GP) equation for a Bose-Einstein condensate in one, two, and three spatial dimensions, optimized for use with GNU and Intel compilers. We use the split-step Crank-Nicolson algorithm for imaginary- and real-time propagation, which enables efficient calculation of stationary and non-stationary solutions, respectively. The present OpenMP programs are designed for computers with multi-core processors and optimized for compiling with both commercially-licensed Intel Fortran and popular free open-source GNU Fortran compiler. The programs are easy to use and are elaborated with helpful comments for the users. All input parameters are listed at the beginning of each program. Different output files provide physical quantities such as energy, chemical potential, root-mean-square sizes, densities, etc. We also present speedup test results for new versions of the programs. Program files doi:http://dx.doi.org/10.17632/y8zk3jgn84.2 Licensing provisions: Apache License 2.0 Programming language: OpenMP GNU and Intel Fortran 90. Computer: Any multi-core personal computer or workstation with the appropriate OpenMP-capable Fortran compiler installed. Number of processors used: All available CPU cores on the executing computer. Journal reference of previous version: Comput. Phys. Commun. 180 (2009) 1888; ibid.204 (2016) 209. Does the new version supersede the previous version?: Not completely. It does supersede previous Fortran programs from both references above, but not OpenMP C programs from Comput. Phys. Commun. 204 (2016) 209. Nature of problem: The present Open Multi-Processing (OpenMP) Fortran programs, optimized for use with commercially-licensed Intel Fortran and free open-source GNU Fortran compilers, solve the time-dependent nonlinear partial differential (GP) equation for a trapped Bose-Einstein condensate in one (1d), two (2d), and three (3d) spatial dimensions for six different trap symmetries: axially and radially symmetric traps in 3d, circularly symmetric traps in 2d, fully isotropic (spherically symmetric) and fully anisotropic traps in 2d and 3d, as well as 1d traps, where no spatial symmetry is considered. Solution method: We employ the split-step Crank-Nicolson algorithm to discretize the time-dependent GP equation in space and time. The discretized equation is then solved by imaginary- or real-time propagation, employing adequately small space and time steps, to yield the solution of stationary and non-stationary problems, respectively. Reasons for the new version: Previously published Fortran programs [1,2] have now become popular tools [3] for solving the GP equation. These programs have been translated to the C programming language [4] and later extended to the more complex scenario of dipolar atoms [5]. Now virtually all computers have multi-core processors and some have motherboards with more than one physical computer processing unit (CPU), which may increase the number of available CPU cores on a single computer to several tens. The C programs have been adopted to be very fast on such multi-core modern computers using general-purpose graphic processing units (GPGPU) with Nvidia CUDA and computer clusters using Message Passing Interface (MPI) [6]. Nevertheless, previously developed Fortran programs are also commonly used for scientific computation and most of them use a single CPU core at a time in modern multi-core laptops, desktops, and workstations. Unless the Fortran programs are made aware and capable of making efficient use of the available CPU cores, the solution of even a realistic dynamical 1d problem, not to mention the more complicated 2d and 3d problems, could be time consuming using the Fortran programs. Previously, we published auto-parallel Fortran programs [2] suitable for Intel (but not GNU) compiler for solving the GP equation. Hence, a need for the full OpenMP version of the Fortran programs to reduce the execution time cannot be overemphasized. To address this issue, we provide here such OpenMP Fortran programs, optimized for both Intel and GNU Fortran compilers and capable of using all available CPU cores, which can significantly reduce the execution time. Summary of revisions: Previous Fortran programs [1] for solving the time-dependent GP equation in 1d, 2d, and 3d with different trap symmetries have been parallelized using the OpenMP interface to reduce the execution time on multi-core processors. There are six different trap symmetries considered, resulting in six programs for imaginary-time propagation and six for real-time propagation, totaling to 12 programs included in BEC-GP-OMP-FOR software package. All input data (number of atoms, scattering length, harmonic oscillator trap length, trap anisotropy, etc.) are conveniently placed at the beginning of each program, as before [2]. Present programs introduce a new input parameter, which is designated by Number_of_Threads and defines the number of CPU cores of the processor to be used in the calculation. If one sets the value 0 for this parameter, all available CPU cores will be used. For the most efficient calculation it is advisable to leave one CPU core unused for the background system's jobs. For example, on a machine with 20 CPU cores such that we used for testing, it is advisable to use up to 19 CPU cores. However, the total number of used CPU cores can be divided into more than one job. For instance, one can run three simulations simultaneously using 10, 4, and 5 CPU cores, respectively, thus totaling to 19 used CPU cores on a 20-core computer. The Fortran source programs are located in the directory src, and can be compiled by the make command using the makefile in the root directory BEC-GP-OMP-FOR of the software package. The examples of produced output files can be found in the directory output, although some large density files are omitted, to save space. The programs calculate the values of actually used dimensionless nonlinearities from the physical input parameters, where the input parameters correspond to the identical nonlinearity values as in the previously published programs [1], so that the output files of the old and new programs can be directly compared. The output files are conveniently named such that their contents can be easily identified, following the naming convention introduced in Ref. [2]. For example, a file named -out.txt, where is a name of the individual program, represents the general output file containing input data, time and space steps, nonlinearity, energy and chemical potential, and was named fort.7 in the old Fortran version of programs [1]. A file named -den.txt is the output file with the condensate density, which had the names fort.3 and fort.4 in the old Fortran version [1] for imaginary- and real-time propagation programs, respectively. Other possible density outputs, such as the initial density, are commented out in the programs to have a simpler set of output files, but users can uncomment and re-enable them, if needed. In addition, there are output files for reduced (integrated) 1d and 2d densities for different programs. In the real-time programs there is also an output file reporting the dynamics of evolution of root-mean-square sizes after a perturbation is introduced. The supplied real-time programs solve the stationary GP equation, and then calculate the dynamics. As the imaginary-time programs are more accurate than the real-time programs for the solution of a stationary problem, one can first solve the stationary problem using the imaginary-time programs, adapt the real-time programs to read the pre-calculated wave function and then study the dynamics. In that case the parameter NSTP in the real-time programs should be set to zero and the space mesh and nonlinearity parameters should be identical in both programs. The reader is advised to consult our previous publication where a complete description of the output files is given [2]. A readme.txt file, included in the root directory, explains the procedure to compile and run the programs. We tested our programs on a workstation with two 10-core Intel Xeon E5-2650 v3 CPUs. The parameters used for testing are given in sample input files, provided in the corresponding directory together with the programs. In Table 1 we present wall-clock execution times for runs on 1, 6, and 19 CPU cores for programs compiled using Intel and GNU Fortran compilers. The corresponding columns "Intel speedup" and "GNU speedup" give the ratio of wall-clock execution times of runs on 1 and 19 CPU cores, and denote the actual measured speedup for 19 CPU cores. In all cases and for all numbers of CPU cores, although the GNU Fortran compiler gives excellent results, the Intel Fortran compiler turns out to be slightly faster. Note that during these tests we always ran only a single simulation on a workstation at a time, to avoid any possible interference issues. Therefore, the obtained wall-clock times are more reliable than the ones that could be measured with two or more jobs running simultaneously. We also studied the speedup of the programs as a function of the number of CPU cores used. The performance of the Intel and GNU Fortran compilers is illustrated in Fig. 1, where we plot the speedup and actual wall-clock times as functions of the number of CPU cores for 2d and 3d programs. We see that the speedup increases monotonically with the number of CPU cores in all cases and has large values (between 10 and 14 for 3d programs) for the maximal number of cores. This fully justifies the development of OpenMP programs, which enable much faster and more efficient solving of the GP equation. However, a slow saturation in the speedup with the further increase in the number of CPU cores is observed in all cases, as expected. The speedup tends to increase for programs in higher dimensions, as they become more complex and have to process more data. This is why the speedups of the supplied 2d and 3d programs are larger than those of 1d programs. Also, for a single program the speedup increases with the size of the spatial grid, i.e., with the number of spatial discretization points, since this increases the amount of calculations performed by the program. To demonstrate this, we tested the supplied real2d-th program and varied the number of spatial discretization points NX=NY from 20 to 1000. The measured speedup obtained when running this program on 19 CPU cores as a function of the number of discretization points is shown in Fig. 2. The speedup first increases rapidly with the number of discretization points and eventually saturates. Additional comments: Example inputs provided with the programs take less than 30 minutes to run on a workstation with two Intel Xeon E5-2650 v3 processors (2 QPI links, 10 CPU cores, 25 MB cache, 2.3 GHz).



      
      Study on efficiency of time computation in x-ray imaging simulation base on Monte Carlo algorithm using graphics processing unit
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Setiani, Tia Dwi, E-mail: tiadwisetiani@gmail.com; Suprijadi; Nuclear Physics and Biophysics Reaserch Division, Faculty of Mathematics and Natural Sciences, Institut Teknologi Bandung Jalan Ganesha 10 Bandung, 40132
         
         Monte Carlo (MC) is one of the powerful techniques for simulation in x-ray imaging. MC method can simulate the radiation transport within matter with high accuracy and provides a natural way to simulate radiation transport in complex systems. One of the codes based on MC algorithm that are widely used for radiographic images simulation is MC-GPU, a codes developed by Andrea Basal. This study was aimed to investigate the time computation of x-ray imaging simulation in GPU (Graphics Processing Unit) compared to a standard CPU (Central Processing Unit). Furthermore, the effect of physical parameters to the quality of radiographic imagesmore » and the comparison of image quality resulted from simulation in the GPU and CPU are evaluated in this paper. The simulations were run in CPU which was simulated in serial condition, and in two GPU with 384 cores and 2304 cores. In simulation using GPU, each cores calculates one photon, so, a large number of photon were calculated simultaneously. Results show that the time simulations on GPU were significantly accelerated compared to CPU. The simulations on the 2304 core of GPU were performed about 64 -114 times faster than on CPU, while the simulation on the 384 core of GPU were performed about 20 – 31 times faster than in a single core of CPU. Another result shows that optimum quality of images from the simulation was gained at the history start from 10{sup 8} and the energy from 60 Kev to 90 Kev. Analyzed by statistical approach, the quality of GPU and CPU images are relatively the same.« less
      

      
      GPU-Acceleration of Sequence Homology Searches with Database Subsequence Clustering.
      PubMed
      Suzuki, Shuji; Kakuta, Masanori; Ishida, Takashi; Akiyama, Yutaka
         2016-01-01
         Sequence homology searches are used in various fields and require large amounts of computation time, especially for metagenomic analysis, owing to the large number of queries and the database size. To accelerate computing analyses, graphics processing units (GPUs) are widely used as a low-cost, high-performance computing platform. Therefore, we mapped the time-consuming steps involved in GHOSTZ, which is a state-of-the-art homology search algorithm for protein sequences, onto a GPU and implemented it as GHOSTZ-GPU. In addition, we optimized memory access for GPU calculations and for communication between the CPU and GPU. As per results of the evaluation test involving metagenomic data, GHOSTZ-GPU with 12 CPU threads and 1 GPU was approximately 3.0- to 4.1-fold faster than GHOSTZ with 12 CPU threads. Moreover, GHOSTZ-GPU with 12 CPU threads and 3 GPUs was approximately 5.8- to 7.7-fold faster than GHOSTZ with 12 CPU threads.
      

      
      File Usage Analysis and Resource Usage Prediction: a Measurement-Based Study. Ph.D. Thesis
      NASA Technical Reports Server (NTRS)
      Devarakonda, Murthy V.-S.
         1987-01-01
         A probabilistic scheme was developed to predict process resource usage in UNIX. Given the identity of the program being run, the scheme predicts CPU time, file I/O, and memory requirements of a process at the beginning of its life. The scheme uses a state-transition model of the program's resource usage in its past executions for prediction. The states of the model are the resource regions obtained from an off-line cluster analysis of processes run on the system. The proposed method is shown to work on data collected from a VAX 11/780 running 4.3 BSD UNIX. The results show that the predicted values correlate well with the actual. The coefficient of correlation between the predicted and actual values of CPU time is 0.84. Errors in prediction are mostly small. Some 82% of errors in CPU time prediction are less than 0.5 standard deviations of process CPU time.
      

      
      Predictability of process resource usage - A measurement-based study on UNIX
      NASA Technical Reports Server (NTRS)
      Devarakonda, Murthy V.; Iyer, Ravishankar K.
         1989-01-01
         A probabilistic scheme is developed to predict process resource usage in UNIX. Given the identity of the program being run, the scheme predicts CPU time, file I/O, and memory requirements of a process at the beginning of its life. The scheme uses a state-transition model of the program's resource usage in its past executions for prediction. The states of the model are the resource regions obtained from an off-line cluster analysis of processes run on the system. The proposed method is shown to work on data collected from a VAX 11/780 running 4.3 BSD UNIX. The results show that the predicted values correlate well with the actual. The correlation coefficient betweeen the predicted and actual values of CPU time is 0.84. Errors in prediction are mostly small. Some 82 percent of errors in CPU time prediction are less than 0.5 standard deviations of process CPU time.
      

      
      Predictability of process resource usage: A measurement-based study of UNIX
      NASA Technical Reports Server (NTRS)
      Devarakonda, Murthy V.; Iyer, Ravishankar K.
         1987-01-01
         A probabilistic scheme is developed to predict process resource usage in UNIX. Given the identity of the program being run, the scheme predicts CPU time, file I/O, and memory requirements of a process at the beginning of its life. The scheme uses a state-transition model of the program's resource usage in its past executions for prediction. The states of the model are the resource regions obtained from an off-line cluster analysis of processes run on the system. The proposed method is shown to work on data collected from a VAX 11/780 running 4.3 BSD UNIX. The results show that the predicted values correlate well with the actual. The correlation coefficient between the predicted and actual values of CPU time is 0.84. Errors in prediction are mostly small. Some 82% of errors in CPU time prediction are less than 0.5 standard deviations of process CPU time.
      

      
      A Spiking Neural Simulator Integrating Event-Driven and Time-Driven Computation Schemes Using Parallel CPU-GPU Co-Processing: A Case Study.
      PubMed
      Naveros, Francisco; Luque, Niceto R; Garrido, Jesús A; Carrillo, Richard R; Anguita, Mancia; Ros, Eduardo
         2015-07-01
         Time-driven simulation methods in traditional CPU architectures perform well and precisely when simulating small-scale spiking neural networks. Nevertheless, they still have drawbacks when simulating large-scale systems. Conversely, event-driven simulation methods in CPUs and time-driven simulation methods in graphic processing units (GPUs) can outperform CPU time-driven methods under certain conditions. With this performance improvement in mind, we have developed an event-and-time-driven spiking neural network simulator suitable for a hybrid CPU-GPU platform. Our neural simulator is able to efficiently simulate bio-inspired spiking neural networks consisting of different neural models, which can be distributed heterogeneously in both small layers and large layers or subsystems. For the sake of efficiency, the low-activity parts of the neural network can be simulated in CPU using event-driven methods while the high-activity subsystems can be simulated in either CPU (a few neurons) or GPU (thousands or millions of neurons) using time-driven methods. In this brief, we have undertaken a comparative study of these different simulation methods. For benchmarking the different simulation methods and platforms, we have used a cerebellar-inspired neural-network model consisting of a very dense granular layer and a Purkinje layer with a smaller number of cells (according to biological ratios). Thus, this cerebellar-like network includes a dense diverging neural layer (increasing the dimensionality of its internal representation and sparse coding) and a converging neural layer (integration) similar to many other biologically inspired and also artificial neural networks.
      

      
      On the cost of approximating and recognizing a noise perturbed straight line or a quadratic curve segment in the plane. [central processing units
      NASA Technical Reports Server (NTRS)
      Cooper, D. B.; Yalabik, N.
         1975-01-01
         Approximation of noisy data in the plane by straight lines or elliptic or single-branch hyperbolic curve segments arises in pattern recognition, data compaction, and other problems. The efficient search for and approximation of data by such curves were examined. Recursive least-squares linear curve-fitting was used, and ellipses and hyperbolas are parameterized as quadratic functions in x and y. The error minimized by the algorithm is interpreted, and central processing unit (CPU) times for estimating parameters for fitting straight lines and quadratic curves were determined and compared. CPU time for data search was also determined for the case of straight line fitting. Quadratic curve fitting is shown to require about six times as much CPU time as does straight line fitting, and curves relating CPU time and fitting error were determined for straight line fitting. Results are derived on early sequential determination of whether or not the underlying curve is a straight line.
      

      
      Accelerating moderately stiff chemical kinetics in reactive-flow simulations using GPUs
      NASA Astrophysics Data System (ADS)
      Niemeyer, Kyle E.; Sung, Chih-Jen
         2014-01-01
         The chemical kinetics ODEs arising from operator-split reactive-flow simulations were solved on GPUs using explicit integration algorithms. Nonstiff chemical kinetics of a hydrogen oxidation mechanism (9 species and 38 irreversible reactions) were computed using the explicit fifth-order Runge-Kutta-Cash-Karp method, and the GPU-accelerated version performed faster than single- and six-core CPU versions by factors of 126 and 25, respectively, for 524,288 ODEs. Moderately stiff kinetics, represented with mechanisms for hydrogen/carbon-monoxide (13 species and 54 irreversible reactions) and methane (53 species and 634 irreversible reactions) oxidation, were computed using the stabilized explicit second-order Runge-Kutta-Chebyshev (RKC) algorithm. The GPU-based RKC implementation demonstrated an increase in performance of nearly 59 and 10 times, for problem sizes consisting of 262,144 ODEs and larger, than the single- and six-core CPU-based RKC algorithms using the hydrogen/carbon-monoxide mechanism. With the methane mechanism, RKC-GPU performed more than 65 and 11 times faster, for problem sizes consisting of 131,072 ODEs and larger, than the single- and six-core RKC-CPU versions, and up to 57 times faster than the six-core CPU-based implicit VODE algorithm on 65,536 ODEs. In the presence of more severe stiffness, such as ethylene oxidation (111 species and 1566 irreversible reactions), RKC-GPU performed more than 17 times faster than RKC-CPU on six cores for 32,768 ODEs and larger, and at best 4.5 times faster than VODE on six CPU cores for 65,536 ODEs. With a larger time step size, RKC-GPU performed at best 2.5 times slower than six-core VODE for 8192 ODEs and larger. Therefore, the need for developing new strategies for integrating stiff chemistry on GPUs was discussed.
      

      
      Novel hybrid GPU-CPU implementation of parallelized Monte Carlo parametric expectation maximization estimation method for population pharmacokinetic data analysis.
      PubMed
      Ng, C M
         2013-10-01
         The development of a population PK/PD model, an essential component for model-based drug development, is both time- and labor-intensive. A graphical-processing unit (GPU) computing technology has been proposed and used to accelerate many scientific computations. The objective of this study was to develop a hybrid GPU-CPU implementation of parallelized Monte Carlo parametric expectation maximization (MCPEM) estimation algorithm for population PK data analysis. A hybrid GPU-CPU implementation of the MCPEM algorithm (MCPEMGPU) and identical algorithm that is designed for the single CPU (MCPEMCPU) were developed using MATLAB in a single computer equipped with dual Xeon 6-Core E5690 CPU and a NVIDIA Tesla C2070 GPU parallel computing card that contained 448 stream processors. Two different PK models with rich/sparse sampling design schemes were used to simulate population data in assessing the performance of MCPEMCPU and MCPEMGPU. Results were analyzed by comparing the parameter estimation and model computation times. Speedup factor was used to assess the relative benefit of parallelized MCPEMGPU over MCPEMCPU in shortening model computation time. The MCPEMGPU consistently achieved shorter computation time than the MCPEMCPU and can offer more than 48-fold speedup using a single GPU card. The novel hybrid GPU-CPU implementation of parallelized MCPEM algorithm developed in this study holds a great promise in serving as the core for the next-generation of modeling software for population PK/PD analysis.
      

      
      GPU Optimizations for a Production Molecular Docking Code*
      PubMed Central
      Landaverde, Raphael; Herbordt, Martin C.
         2015-01-01
         Modeling molecular docking is critical to both understanding life processes and designing new drugs. In previous work we created the first published GPU-accelerated docking code (PIPER) which achieved a roughly 5× speed-up over a contemporaneous 4 core CPU. Advances in GPU architecture and in the CPU code, however, have since reduced this relalative performance by a factor of 10. In this paper we describe the upgrade of GPU PIPER. This required an entire rewrite, including algorithm changes and moving most remaining non-accelerated CPU code onto the GPU. The result is a 7× improvement in GPU performance and a 3.3× speedup over the CPU-only code. We find that this difference in time is almost entirely due to the difference in run times of the 3D FFT library functions on CPU (MKL) and GPU (cuFFT), respectively. The GPU code has been integrated into the ClusPro docking server which has over 4000 active users. PMID:26594667
      

      
      GPU-Acceleration of Sequence Homology Searches with Database Subsequence Clustering
      PubMed Central
      Suzuki, Shuji; Kakuta, Masanori; Ishida, Takashi; Akiyama, Yutaka
         2016-01-01
         Sequence homology searches are used in various fields and require large amounts of computation time, especially for metagenomic analysis, owing to the large number of queries and the database size. To accelerate computing analyses, graphics processing units (GPUs) are widely used as a low-cost, high-performance computing platform. Therefore, we mapped the time-consuming steps involved in GHOSTZ, which is a state-of-the-art homology search algorithm for protein sequences, onto a GPU and implemented it as GHOSTZ-GPU. In addition, we optimized memory access for GPU calculations and for communication between the CPU and GPU. As per results of the evaluation test involving metagenomic data, GHOSTZ-GPU with 12 CPU threads and 1 GPU was approximately 3.0- to 4.1-fold faster than GHOSTZ with 12 CPU threads. Moreover, GHOSTZ-GPU with 12 CPU threads and 3 GPUs was approximately 5.8- to 7.7-fold faster than GHOSTZ with 12 CPU threads. PMID:27482905
      

      
      GPU Optimizations for a Production Molecular Docking Code.
      PubMed
      Landaverde, Raphael; Herbordt, Martin C
         2014-09-01
         Modeling molecular docking is critical to both understanding life processes and designing new drugs. In previous work we created the first published GPU-accelerated docking code (PIPER) which achieved a roughly 5× speed-up over a contemporaneous 4 core CPU. Advances in GPU architecture and in the CPU code, however, have since reduced this relalative performance by a factor of 10. In this paper we describe the upgrade of GPU PIPER. This required an entire rewrite, including algorithm changes and moving most remaining non-accelerated CPU code onto the GPU. The result is a 7× improvement in GPU performance and a 3.3× speedup over the CPU-only code. We find that this difference in time is almost entirely due to the difference in run times of the 3D FFT library functions on CPU (MKL) and GPU (cuFFT), respectively. The GPU code has been integrated into the ClusPro docking server which has over 4000 active users.
      

      
      The Research and Test of Fast Radio Burst Real-time Search Algorithm Based on GPU Acceleration
      NASA Astrophysics Data System (ADS)
      Wang, J.; Chen, M. Z.; Pei, X.; Wang, Z. Q.
         2017-03-01
         In order to satisfy the research needs of Nanshan 25 m radio telescope of Xinjiang Astronomical Observatory (XAO) and study the key technology of the planned QiTai radio Telescope (QTT), the receiver group of XAO studied the GPU (Graphics Processing Unit) based real-time FRB searching algorithm which developed from the original FRB searching algorithm based on CPU (Central Processing Unit), and built the FRB real-time searching system. The comparison of the GPU system and the CPU system shows that: on the basis of ensuring the accuracy of the search, the speed of the GPU accelerated algorithm is improved by 35-45 times compared with the CPU algorithm.
      

      
      Efficient and portable acceleration of quantum chemical many-body methods in mixed floating point precision using OpenACC compiler directives
      NASA Astrophysics Data System (ADS)
      Eriksen, Janus J.
         2017-09-01
         It is demonstrated how the non-proprietary OpenACC standard of compiler directives may be used to compactly and efficiently accelerate the rate-determining steps of two of the most routinely applied many-body methods of electronic structure theory, namely the second-order Møller-Plesset (MP2) model in its resolution-of-the-identity approximated form and the (T) triples correction to the coupled cluster singles and doubles model (CCSD(T)). By means of compute directives as well as the use of optimised device math libraries, the operations involved in the energy kernels have been ported to graphics processing unit (GPU) accelerators, and the associated data transfers correspondingly optimised to such a degree that the final implementations (using either double and/or single precision arithmetics) are capable of scaling to as large systems as allowed for by the capacity of the host central processing unit (CPU) main memory. The performance of the hybrid CPU/GPU implementations is assessed through calculations on test systems of alanine amino acid chains using one-electron basis sets of increasing size (ranging from double- to pentuple-ζ quality). For all but the smallest problem sizes of the present study, the optimised accelerated codes (using a single multi-core CPU host node in conjunction with six GPUs) are found to be capable of reducing the total time-to-solution by at least an order of magnitude over optimised, OpenMP-threaded CPU-only reference implementations.
      

      
      Symplectic multi-particle tracking on GPUs
      NASA Astrophysics Data System (ADS)
      Liu, Zhicong; Qiang, Ji
         2018-05-01
         A symplectic multi-particle tracking model is implemented on the Graphic Processing Units (GPUs) using the Compute Unified Device Architecture (CUDA) language. The symplectic tracking model can preserve phase space structure and reduce non-physical effects in long term simulation, which is important for beam property evaluation in particle accelerators. Though this model is computationally expensive, it is very suitable for parallelization and can be accelerated significantly by using GPUs. In this paper, we optimized the implementation of the symplectic tracking model on both single GPU and multiple GPUs. Using a single GPU processor, the code achieves a factor of 2-10 speedup for a range of problem sizes compared with the time on a single state-of-the-art Central Processing Unit (CPU) node with similar power consumption and semiconductor technology. It also shows good scalability on a multi-GPU cluster at Oak Ridge Leadership Computing Facility. In an application to beam dynamics simulation, the GPU implementation helps save more than a factor of two total computing time in comparison to the CPU implementation.
      

      
      SU-E-J-91: FFT Based Medical Image Registration Using a Graphics Processing Unit (GPU).
      PubMed
      Luce, J; Hoggarth, M; Lin, J; Block, A; Roeske, J
         2012-06-01
         To evaluate the efficiency gains obtained from using a Graphics Processing Unit (GPU) to perform a Fourier Transform (FT) based image registration. Fourier-based image registration involves obtaining the FT of the component images, and analyzing them in Fourier space to determine the translations and rotations of one image set relative to another. An important property of FT registration is that by enlarging the images (adding additional pixels), one can obtain translations and rotations with sub-pixel resolution. The expense, however, is an increased computational time. GPUs may decrease the computational time associated with FT image registration by taking advantage of their parallel architecture to perform matrix computations much more efficiently than a Central Processor Unit (CPU). In order to evaluate the computational gains produced by a GPU, images with known translational shifts were utilized. A program was written in the Interactive Data Language (IDL; Exelis, Boulder, CO) to performCPU-based calculations. Subsequently, the program was modified using GPU bindings (Tech-X, Boulder, CO) to perform GPU-based computation on the same system. Multiple image sizes were used, ranging from 256×256 to 2304×2304. The time required to complete the full algorithm by the CPU and GPU were benchmarked and the speed increase was defined as the ratio of the CPU-to-GPU computational time. The ratio of the CPU-to- GPU time was greater than 1.0 for all images, which indicates the GPU is performing the algorithm faster than the CPU. The smallest improvement, a 1.21 ratio, was found with the smallest image size of 256×256, and the largest speedup, a 4.25 ratio, was observed with the largest image size of 2304×2304. GPU programming resulted in a significant decrease in computational time associated with a FT image registration algorithm. The inclusion of the GPU may provide near real-time, sub-pixel registration capability. © 2012 American Association of Physicists in Medicine.
      

      
      A Hybrid CPU/GPU Pattern-Matching Algorithm for Deep Packet Inspection
      PubMed Central
      Chen, Yaw-Chung
         2015-01-01
         The large quantities of data now being transferred via high-speed networks have made deep packet inspection indispensable for security purposes. Scalable and low-cost signature-based network intrusion detection systems have been developed for deep packet inspection for various software platforms. Traditional approaches that only involve central processing units (CPUs) are now considered inadequate in terms of inspection speed. Graphic processing units (GPUs) have superior parallel processing power, but transmission bottlenecks can reduce optimal GPU efficiency. In this paper we describe our proposal for a hybrid CPU/GPU pattern-matching algorithm (HPMA) that divides and distributes the packet-inspecting workload between a CPU and GPU. All packets are initially inspected by the CPU and filtered using a simple pre-filtering algorithm, and packets that might contain malicious content are sent to the GPU for further inspection. Test results indicate that in terms of random payload traffic, the matching speed of our proposed algorithm was 3.4 times and 2.7 times faster than those of the AC-CPU and AC-GPU algorithms, respectively. Further, HPMA achieved higher energy efficiency than the other tested algorithms. PMID:26437335
      

      
      A Hybrid CPU/GPU Pattern-Matching Algorithm for Deep Packet Inspection.
      PubMed
      Lee, Chun-Liang; Lin, Yi-Shan; Chen, Yaw-Chung
         2015-01-01
         The large quantities of data now being transferred via high-speed networks have made deep packet inspection indispensable for security purposes. Scalable and low-cost signature-based network intrusion detection systems have been developed for deep packet inspection for various software platforms. Traditional approaches that only involve central processing units (CPUs) are now considered inadequate in terms of inspection speed. Graphic processing units (GPUs) have superior parallel processing power, but transmission bottlenecks can reduce optimal GPU efficiency. In this paper we describe our proposal for a hybrid CPU/GPU pattern-matching algorithm (HPMA) that divides and distributes the packet-inspecting workload between a CPU and GPU. All packets are initially inspected by the CPU and filtered using a simple pre-filtering algorithm, and packets that might contain malicious content are sent to the GPU for further inspection. Test results indicate that in terms of random payload traffic, the matching speed of our proposed algorithm was 3.4 times and 2.7 times faster than those of the AC-CPU and AC-GPU algorithms, respectively. Further, HPMA achieved higher energy efficiency than the other tested algorithms.
      

      
      GPU Accelerated Clustering for Arbitrary Shapes in Geoscience Data
      NASA Astrophysics Data System (ADS)
      Pankratius, V.; Gowanlock, M.; Rude, C. M.; Li, J. D.
         2016-12-01
         Clustering algorithms have become a vital component in intelligent systems for geoscience that helps scientists discover and track phenomena of various kinds. Here, we outline advances in Density-Based Spatial Clustering of Applications with Noise (DBSCAN) which detects clusters of arbitrary shape that are common in geospatial data. In particular, we propose a hybrid CPU-GPU implementation of DBSCAN and highlight new optimization approaches on the GPU that allows clustering detection in parallel while optimizing data transport during CPU-GPU interactions. We employ an efficient batching scheme between the host and GPU such that limited GPU memory is not prohibitive when processing large and/or dense datasets. To minimize data transfer overhead, we estimate the total workload size and employ an execution that generates optimized batches that will not overflow the GPU buffer. This work is demonstrated on space weather Total Electron Content (TEC) datasets containing over 5 million measurements from instruments worldwide, and allows scientists to spot spatially coherent phenomena with ease. Our approach is up to 30 times faster than a sequential implementation and therefore accelerates discoveries in large datasets. We acknowledge support from NSF ACI-1442997.


   
       
            
              
          

«

1
      2
      3
   4
      5
      »

          
        

           
           
             
               
      
      Clinical implementation of a GPU-based simplified Monte Carlo method for a treatment planning system of proton beam therapy.
      PubMed
      Kohno, R; Hotta, K; Nishioka, S; Matsubara, K; Tansho, R; Suzuki, T
         2011-11-21
         We implemented the simplified Monte Carlo (SMC) method on graphics processing unit (GPU) architecture under the computer-unified device architecture platform developed by NVIDIA. The GPU-based SMC was clinically applied for four patients with head and neck, lung, or prostate cancer. The results were compared to those obtained by a traditional CPU-based SMC with respect to the computation time and discrepancy. In the CPU- and GPU-based SMC calculations, the estimated mean statistical errors of the calculated doses in the planning target volume region were within 0.5% rms. The dose distributions calculated by the GPU- and CPU-based SMCs were similar, within statistical errors. The GPU-based SMC showed 12.30-16.00 times faster performance than the CPU-based SMC. The computation time per beam arrangement using the GPU-based SMC for the clinical cases ranged 9-67 s. The results demonstrate the successful application of the GPU-based SMC to a clinical proton treatment planning.
      

      
      The Performance of the NAS HSPs in 1st Half of 1994
      NASA Technical Reports Server (NTRS)
      Bergeron, Robert J.; Walter, Howard (Technical Monitor)
         1995-01-01
         During the first six months of 1994, the NAS (National Airspace System) 16-CPU Y-MP C90 Von Neumann (VN) delivered an average throughput of 4.045 GFLOPS while the ACSF (Aeronautics Consolidated Supercomputer Facility) 8-CPU Y-MP C90 Eagle averaged 1.658 GFLOPS. The VN rate represents a machine efficiency of 26.3% whereas the Eagle rate corresponds to a machine efficiency of 21.6%. VN displayed a greater efficiency than Eagle primarily because the stronger workload demand for its CPU cycles allowed it to devote more time to user programs and less time to idle. An additional factor increasing VN efficiency was the ability of the UNICOS 8.0 Operating System to deliver a larger fraction of CPU time to user programs. Although measurements indicate increasing vector length for both workloads, insufficient vector lengths continue to hinder HSP (High Speed Processor) performance. To improve HSP performance, NAS should continue to encourage the HSP users to modify their codes to increase program vector length.
      

      
      Static and Dynamic Frequency Scaling on Multicore CPUs
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Bao, Wenlei; Hong, Changwan; Chunduri, Sudheer
         2016-12-28
         Dynamic voltage and frequency scaling (DVFS) adapts CPU power consumption by modifying a processor’s operating frequency (and the associated voltage). Typical approaches employing DVFS involve default strategies such as running at the lowest or the highest frequency, or observing the CPU’s runtime behavior and dynamically adapting the voltage/frequency configuration based on CPU usage. In this paper, we argue that many previous approaches suffer from inherent limitations, such as not account- ing for processor-specific impact of frequency changes on energy for different workload types. We first propose a lightweight runtime-based approach to automatically adapt the frequency based on the CPU workload,more » that is agnostic of the processor characteristics. We then show that further improvements can be achieved for affine kernels in the application, using a compile-time characterization instead of run-time monitoring to select the frequency and number of CPU cores to use. Our framework relies on a one-time energy characterization of CPU-specific DVFS profiles followed by a compile-time categorization of loop-based code segments in the application. These are combined to determine a priori of the frequency and the number of cores to use to execute the application so as to optimize energy or energy-delay product, outperforming runtime approach. Extensive evaluation on 60 benchmarks and five multi-core CPUs show that our approach systematically outperforms the powersave Linux governor, while improving overall performance.« less
      

      
      Characterization and referral patterns of ST-elevation myocardial infarction patients admitted to chest pain units rather than directly to catherization laboratories. Data from the German Chest Pain Unit Registry.
      PubMed
      Schmidt, Frank P; Perne, Andrea; Hochadel, Matthias; Giannitsis, Evangelos; Darius, Harald; Maier, Lars S; Schmitt, Claus; Heusch, Gerd; Voigtländer, Thomas; Mudra, Harald; Gori, Tommaso; Senges, Jochen; Münzel, Thomas
         2017-03-15
         Direct transfer to the catheterization laboratory for primary percutaneous coronary intervention (PCI) is standard of care for patients with ST-segment elevation myocardial infarction (STEMI). Nevertheless, a significant number of STEMI-patients are initially treated in chest pain units (CPUs) of admitting hospitals. Thus, it is important to characterize these patients and to define why an important deviation from recommended clinical pathways occurs and in particular to quantify the impact of deviation on critical time intervals. 1679 STEMI patients admitted to a CPU in the period from 2010 to 2015 were enrolled in the German CPU registry (8.5% of 19,666). 55.9% of the patients were delivered by an emergency medical system (EMS), 16.1% transferred from other hospitals and 15.2% referred by a general practitioner (GP). 12.7% were self-referrals. 55% did not get a pre-hospital ECG. Compared to the EMS, referral by GPs markedly delayed critical time intervals while a pre-hospital ECG demonstrating ST-segment elevation reduced door-to-balloon time. When compared to STEMI patients (n=21,674) enrolled in the ALKK-registry, CPU-STEMI patients had a lower risk profile, their treatment in the CPU was guideline-conform and in-hospital mortality was low (1.5%). CPU-STEMI patients represent a numerically significant group because a pre-hospital ECG was not documented. Treatment in the CPU is guideline-conform and the intra-hospital mortality is low. The lack of a pre-hospital ECG and admission via the GP substantially delay critical time intervals suggesting that in patients with symptoms suggestive an ACS, the EMS should be contacted and not the GP. Copyright © 2016 Elsevier Ireland Ltd. All rights reserved.
      

      
      GPU-Q-J, a fast method for calculating root mean square deviation (RMSD) after optimal superposition
      PubMed Central
      
         2011-01-01
         Background Calculation of the root mean square deviation (RMSD) between the atomic coordinates of two optimally superposed structures is a basic component of structural comparison techniques. We describe a quaternion based method, GPU-Q-J, that is stable with single precision calculations and suitable for graphics processor units (GPUs). The application was implemented on an ATI 4770 graphics card in C/C++ and Brook+ in Linux where it was 260 to 760 times faster than existing unoptimized CPU methods. Source code is available from the Compbio website http://software.compbio.washington.edu/misc/downloads/st_gpu_fit/ or from the author LHH. Findings The Nutritious Rice for the World Project (NRW) on World Community Grid predicted de novo, the structures of over 62,000 small proteins and protein domains returning a total of 10 billion candidate structures. Clustering ensembles of structures on this scale requires calculation of large similarity matrices consisting of RMSDs between each pair of structures in the set. As a real-world test, we calculated the matrices for 6 different ensembles from NRW. The GPU method was 260 times faster that the fastest existing CPU based method and over 500 times faster than the method that had been previously used. Conclusions GPU-Q-J is a significant advance over previous CPU methods. It relieves a major bottleneck in the clustering of large numbers of structures for NRW. It also has applications in structure comparison methods that involve multiple superposition and RMSD determination steps, particularly when such methods are applied on a proteome and genome wide scale. PMID:21453553
      

      
      Simulation Testing of Embedded Flight Software
      NASA Technical Reports Server (NTRS)
      Shahabuddin, Mohammad; Reinholtz, William
         2004-01-01
         Virtual Real Time (VRT) is a computer program for testing embedded flight software by computational simulation in a workstation, in contradistinction to testing it in its target central processing unit (CPU). The disadvantages of testing in the target CPU include the need for an expensive test bed, the necessity for testers and programmers to take turns using the test bed, and the lack of software tools for debugging in a real-time environment. By virtue of its architecture, most of the flight software of the type in question is amenable to development and testing on workstations, for which there is an abundance of commercially available debugging and analysis software tools. Unfortunately, the timing of a workstation differs from that of a target CPU in a test bed. VRT, in conjunction with closed-loop simulation software, provides a capability for executing embedded flight software on a workstation in a close-to-real-time environment. A scale factor is used to convert between execution time in VRT on a workstation and execution on a target CPU. VRT includes high-resolution operating- system timers that enable the synchronization of flight software with simulation software and ground software, all running on different workstations.
      

      
      Use of general purpose graphics processing units with MODFLOW
      USGS Publications Warehouse
      Hughes, Joseph D.; White, Jeremy T.
         2013-01-01
         To evaluate the use of general-purpose graphics processing units (GPGPUs) to improve the performance of MODFLOW, an unstructured preconditioned conjugate gradient (UPCG) solver has been developed. The UPCG solver uses a compressed sparse row storage scheme and includes Jacobi, zero fill-in incomplete, and modified-incomplete lower-upper (LU) factorization, and generalized least-squares polynomial preconditioners. The UPCG solver also includes options for sequential and parallel solution on the central processing unit (CPU) using OpenMP. For simulations utilizing the GPGPU, all basic linear algebra operations are performed on the GPGPU; memory copies between the central processing unit CPU and GPCPU occur prior to the first iteration of the UPCG solver and after satisfying head and flow criteria or exceeding a maximum number of iterations. The efficiency of the UPCG solver for GPGPU and CPU solutions is benchmarked using simulations of a synthetic, heterogeneous unconfined aquifer with tens of thousands to millions of active grid cells. Testing indicates GPGPU speedups on the order of 2 to 8, relative to the standard MODFLOW preconditioned conjugate gradient (PCG) solver, can be achieved when (1) memory copies between the CPU and GPGPU are optimized, (2) the percentage of time performing memory copies between the CPU and GPGPU is small relative to the calculation time, (3) high-performance GPGPU cards are utilized, and (4) CPU-GPGPU combinations are used to execute sequential operations that are difficult to parallelize. Furthermore, UPCG solver testing indicates GPGPU speedups exceed parallel CPU speedups achieved using OpenMP on multicore CPUs for preconditioners that can be easily parallelized.
      

      
      Memory interface simulator: A computer design aid
      NASA Technical Reports Server (NTRS)
      Taylor, D. S.; Williams, T.; Weatherbee, J. E.
         1972-01-01
         Results are presented of a study conducted with a digital simulation model being used in the design of the Automatically Reconfigurable Modular Multiprocessor System (ARMMS), a candidate computer system for future manned and unmanned space missions. The model simulates the activity involved as instructions are fetched from random access memory for execution in one of the system central processing units. A series of model runs measured instruction execution time under various assumptions pertaining to the CPU's and the interface between the CPU's and RAM. Design tradeoffs are presented in the following areas: Bus widths, CPU microprogram read only memory cycle time, multiple instruction fetch, and instruction mix.
      

      
      Polydrug use among college students in Brazil: a nationwide survey.
      PubMed
      Oliveira, Lúcio Garcia de; Alberghini, Denis Guilherme; Santos, Bernardo dos; Andrade, Arthur Guerra de
         2013-01-01
         To estimate the frequency of polydrug use (alcohol and illicit drugs) among college students and its associations with gender and age group. A nationwide sample of 12,544 college students was asked to complete a questionnaire on their use of drugs according to three time parameters (lifetime, past 12 months, and last 30 days). The co-use of drugs was investigated as concurrent polydrug use (CPU) and simultaneous polydrug use (SPU), a subcategory of CPU that involves the use of drugs at the same time or in close temporal proximity. Almost 26% of college students reported having engaged in CPU in the past 12 months. Among these students, 37% had engaged in SPU. In the past 30 days, 17% college students had engaged in CPU. Among these, 35% had engaged in SPU. Marijuana was the illicit drug mostly frequently used with alcohol (either as CPU or SPU), especially among males. Among females, the most commonly reported combination was alcohol and prescribed medications. A high proportion of Brazilian college students may be engaging in polydrug use. College administrators should keep themselves informed to be able to identify such use and to develop educational interventions to prevent such behavior.
      

      
      Optimum element density studies for finite-element thermal analysis of hypersonic aircraft structures
      NASA Technical Reports Server (NTRS)
      Ko, William L.; Olona, Timothy; Muramoto, Kyle M.
         1990-01-01
         Different finite element models previously set up for thermal analysis of the space shuttle orbiter structure are discussed and their shortcomings identified. Element density criteria are established for the finite element thermal modelings of space shuttle orbiter-type large, hypersonic aircraft structures. These criteria are based on rigorous studies on solution accuracies using different finite element models having different element densities set up for one cell of the orbiter wing. Also, a method for optimization of the transient thermal analysis computer central processing unit (CPU) time is discussed. Based on the newly established element density criteria, the orbiter wing midspan segment was modeled for the examination of thermal analysis solution accuracies and the extent of computation CPU time requirements. The results showed that the distributions of the structural temperatures and the thermal stresses obtained from this wing segment model were satisfactory and the computation CPU time was at the acceptable level. The studies offered the hope that modeling the large, hypersonic aircraft structures using high-density elements for transient thermal analysis is possible if a CPU optimization technique was used.
      

      
      Parallelized computation for computer simulation of electrocardiograms using personal computers with multi-core CPU and general-purpose GPU.
      PubMed
      Shen, Wenfeng; Wei, Daming; Xu, Weimin; Zhu, Xin; Yuan, Shizhong
         2010-10-01
         Biological computations like electrocardiological modelling and simulation usually require high-performance computing environments. This paper introduces an implementation of parallel computation for computer simulation of electrocardiograms (ECGs) in a personal computer environment with an Intel CPU of Core (TM) 2 Quad Q6600 and a GPU of Geforce 8800GT, with software support by OpenMP and CUDA. It was tested in three parallelization device setups: (a) a four-core CPU without a general-purpose GPU, (b) a general-purpose GPU plus 1 core of CPU, and (c) a four-core CPU plus a general-purpose GPU. To effectively take advantage of a multi-core CPU and a general-purpose GPU, an algorithm based on load-prediction dynamic scheduling was developed and applied to setting (c). In the simulation with 1600 time steps, the speedup of the parallel computation as compared to the serial computation was 3.9 in setting (a), 16.8 in setting (b), and 20.0 in setting (c). This study demonstrates that a current PC with a multi-core CPU and a general-purpose GPU provides a good environment for parallel computations in biological modelling and simulation studies. Copyright 2010 Elsevier Ireland Ltd. All rights reserved.
      

      
      Hypoxia/oxidative stress alters the pharmacokinetics of CPU86017-RS through mitochondrial dysfunction and NADPH oxidase activation.
      PubMed
      Gao, Jie; Ding, Xuan-sheng; Zhang, Yu-mao; Dai, De-zai; Liu, Mei; Zhang, Can; Dai, Yin
         2013-12-01
         Hypoxia/oxidative stress can alter the pharmacokinetics (PK) of CPU86017-RS, a novel antiarrhythmic agent. The aim of this study was to investigate the mechanisms underlying the alteration of PK of CPU86017-RS by hypoxia/oxidative stress. Male SD rats exposed to normal or intermittent hypoxia (10% O2) were administered CPU86017-RS (20, 40 or 80 mg/kg, ig) for 8 consecutive days. The PK parameters of CPU86017-RS were examined on d 8. In a separate set of experiments, female SD rats were injected with isoproterenol (ISO) for 5 consecutive days to induce a stress-related status, then CPU86017-RS (80 mg/kg, ig) was administered, and the tissue distributions were examined. The levels of Mn-SOD (manganese containing superoxide dismutase), endoplasmic reticulum (ER) stress sensor proteins (ATF-6, activating transcription factor 6 and PERK, PRK-like ER kinase) and activation of NADPH oxidase (NOX) were detected with Western blotting. Rat liver microsomes were incubated under N2 for in vitro study. The Cmax, t1/2, MRT (mean residence time) and AUC (area under the curve) of CPU86017-RS were significantly increased in the hypoxic rats receiving the 3 different doses of CPU86017-RS. The hypoxia-induced alteration of PK was associated with significantly reduced Mn-SOD level, and increased ATF-6, PERK and NOX levels. In ISO-treated rats, the distributions of CPU86017-RS in plasma, heart, kidney, and liver were markedly increased, and NOX levels in heart, kidney, and liver were significantly upregulated. Co-administration of the NOX blocker apocynin eliminated the abnormalities in the PK and tissue distributions of CPU86017-RS induced by hypoxia/oxidative stress. The metabolism of CPU86017-RS in the N2-treated liver microsomes was significantly reduced, addition of N-acetylcysteine (NAC), but not vitamin C, effectively reversed this change. The altered PK and metabolism of CPU86017-RS induced by hypoxia/oxidative stress are produced by mitochondrial abnormalities, NOX activation and ER stress; these abnormalities are significantly alleviated by apocynin or NAC.
      

      
      Performance analysis of the FDTD method applied to holographic volume gratings: Multi-core CPU versus GPU computing
      NASA Astrophysics Data System (ADS)
      Francés, J.; Bleda, S.; Neipp, C.; Márquez, A.; Pascual, I.; Beléndez, A.
         2013-03-01
         The finite-difference time-domain method (FDTD) allows electromagnetic field distribution analysis as a function of time and space. The method is applied to analyze holographic volume gratings (HVGs) for the near-field distribution at optical wavelengths. Usually, this application requires the simulation of wide areas, which implies more memory and time processing. In this work, we propose a specific implementation of the FDTD method including several add-ons for a precise simulation of optical diffractive elements. Values in the near-field region are computed considering the illumination of the grating by means of a plane wave for different angles of incidence and including absorbing boundaries as well. We compare the results obtained by FDTD with those obtained using a matrix method (MM) applied to diffraction gratings. In addition, we have developed two optimized versions of the algorithm, for both CPU and GPU, in order to analyze the improvement of using the new NVIDIA Fermi GPU architecture versus highly tuned multi-core CPU as a function of the size simulation. In particular, the optimized CPU implementation takes advantage of the arithmetic and data transfer streaming SIMD (single instruction multiple data) extensions (SSE) included explicitly in the code and also of multi-threading by means of OpenMP directives. A good agreement between the results obtained using both FDTD and MM methods is obtained, thus validating our methodology. Moreover, the performance of the GPU is compared to the SSE+OpenMP CPU implementation, and it is quantitatively determined that a highly optimized CPU program can be competitive for a wider range of simulation sizes, whereas GPU computing becomes more powerful for large-scale simulations.
      

      
      Accelerating next generation sequencing data analysis with system level optimizations.
      PubMed
      Kathiresan, Nagarajan; Temanni, Ramzi; Almabrazi, Hakeem; Syed, Najeeb; Jithesh, Puthen V; Al-Ali, Rashid
         2017-08-22
         Next generation sequencing (NGS) data analysis is highly compute intensive. In-memory computing, vectorization, bulk data transfer, CPU frequency scaling are some of the hardware features in the modern computing architectures. To get the best execution time and utilize these hardware features, it is necessary to tune the system level parameters before running the application. We studied the GATK-HaplotypeCaller which is part of common NGS workflows, that consume more than 43% of the total execution time. Multiple GATK 3.x versions were benchmarked and the execution time of HaplotypeCaller was optimized by various system level parameters which included: (i) tuning the parallel garbage collection and kernel shared memory to simulate in-memory computing, (ii) architecture-specific tuning in the PairHMM library for vectorization, (iii) including Java 1.8 features through GATK source code compilation and building a runtime environment for parallel sorting and bulk data transfer (iv) the default 'on-demand' mode of CPU frequency is over-clocked by using 'performance-mode' to accelerate the Java multi-threads. As a result, the HaplotypeCaller execution time was reduced by 82.66% in GATK 3.3 and 42.61% in GATK 3.7. Overall, the execution time of NGS pipeline was reduced to 70.60% and 34.14% for GATK 3.3 and GATK 3.7 respectively.
      

      
      Assessment of Linear Finite-Difference Poisson-Boltzmann Solvers
      PubMed Central
      Wang, Jun; Luo, Ray
         2009-01-01
         CPU time and memory usage are two vital issues that any numerical solvers for the Poisson-Boltzmann equation have to face in biomolecular applications. In this study we systematically analyzed the CPU time and memory usage of five commonly used finite-difference solvers with a large and diversified set of biomolecular structures. Our comparative analysis shows that modified incomplete Cholesky conjugate gradient and geometric multigrid are the most efficient in the diversified test set. For the two efficient solvers, our test shows that their CPU times increase approximately linearly with the numbers of grids. Their CPU times also increase almost linearly with the negative logarithm of the convergence criterion at very similar rate. Our comparison further shows that geometric multigrid performs better in the large set of tested biomolecules. However, modified incomplete Cholesky conjugate gradient is superior to geometric multigrid in molecular dynamics simulations of tested molecules. We also investigated other significant components in numerical solutions of the Poisson-Boltzmann equation. It turns out that the time-limiting step is the free boundary condition setup for the linear systems for the selected proteins if the electrostatic focusing is not used. Thus, development of future numerical solvers for the Poisson-Boltzmann equation should balance all aspects of the numerical procedures in realistic biomolecular applications. PMID:20063271
      

      
      Exact diagonalization of quantum lattice models on coprocessors
      NASA Astrophysics Data System (ADS)
      Siro, T.; Harju, A.
         2016-10-01
         We implement the Lanczos algorithm on an Intel Xeon Phi coprocessor and compare its performance to a multi-core Intel Xeon CPU and an NVIDIA graphics processor. The Xeon and the Xeon Phi are parallelized with OpenMP and the graphics processor is programmed with CUDA. The performance is evaluated by measuring the execution time of a single step in the Lanczos algorithm. We study two quantum lattice models with different particle numbers, and conclude that for small systems, the multi-core CPU is the fastest platform, while for large systems, the graphics processor is the clear winner, reaching speedups of up to 7.6 compared to the CPU. The Xeon Phi outperforms the CPU with sufficiently large particle number, reaching a speedup of 2.5.
      

      
      Instrumentation complex for Langley Research Center's National Transonic Facility
      NASA Technical Reports Server (NTRS)
      Russell, C. H.; Bryant, C. S.
         1977-01-01
         The instrumentation discussed in the present paper was developed to ensure reliable operation for a 2.5-meter cryogenic high-Reynolds-number fan-driven transonic wind tunnel. It will incorporate four CPU's and associated analog and digital input/output equipment, necessary for acquiring research data, controlling the tunnel parameters, and monitoring the process conditions. Connected in a multipoint distributed network, the CPU's will support data base management and processing; research measurement data acquisition and display; process monitoring; and communication control. The design will allow essential processes to continue, in the case of major hardware failures, by switching input/output equipment to alternate CPU's and by eliminating nonessential functions. It will also permit software modularization by CPU activity and thereby reduce complexity and development time.
      

      
      Accelerating Smith-Waterman Alignment for Protein Database Search Using Frequency Distance Filtration Scheme Based on CPU-GPU Collaborative System.
      PubMed
      Liu, Yu; Hong, Yang; Lin, Chun-Yuan; Hung, Che-Lun
         2015-01-01
         The Smith-Waterman (SW) algorithm has been widely utilized for searching biological sequence databases in bioinformatics. Recently, several works have adopted the graphic card with Graphic Processing Units (GPUs) and their associated CUDA model to enhance the performance of SW computations. However, these works mainly focused on the protein database search by using the intertask parallelization technique, and only using the GPU capability to do the SW computations one by one. Hence, in this paper, we will propose an efficient SW alignment method, called CUDA-SWfr, for the protein database search by using the intratask parallelization technique based on a CPU-GPU collaborative system. Before doing the SW computations on GPU, a procedure is applied on CPU by using the frequency distance filtration scheme (FDFS) to eliminate the unnecessary alignments. The experimental results indicate that CUDA-SWfr runs 9.6 times and 96 times faster than the CPU-based SW method without and with FDFS, respectively.
      

      
      Dense GPU-enhanced surface reconstruction from stereo endoscopic images for intraoperative registration.
      PubMed
      Rohl, Sebastian; Bodenstedt, Sebastian; Suwelack, Stefan; Dillmann, Rudiger; Speidel, Stefanie; Kenngott, Hannes; Muller-Stich, Beat P
         2012-03-01
         In laparoscopic surgery, soft tissue deformations substantially change the surgical site, thus impeding the use of preoperative planning during intraoperative navigation. Extracting depth information from endoscopic images and building a surface model of the surgical field-of-view is one way to represent this constantly deforming environment. The information can then be used for intraoperative registration. Stereo reconstruction is a typical problem within computer vision. However, most of the available methods do not fulfill the specific requirements in a minimally invasive setting such as the need of real-time performance, the problem of view-dependent specular reflections and large curved areas with partly homogeneous or periodic textures and occlusions. In this paper, the authors present an approach toward intraoperative surface reconstruction based on stereo endoscopic images. The authors describe our answer to this problem through correspondence analysis, disparity correction and refinement, 3D reconstruction, point cloud smoothing and meshing. Real-time performance is achieved by implementing the algorithms on the gpu. The authors also present a new hybrid cpu-gpu algorithm that unifies the advantages of the cpu and the gpu version. In a comprehensive evaluation using in vivo data, in silico data from the literature and virtual data from a newly developed simulation environment, the cpu, the gpu, and the hybrid cpu-gpu versions of the surface reconstruction are compared to a cpu and a gpu algorithm from the literature. The recommended approach toward intraoperative surface reconstruction can be conducted in real-time depending on the image resolution (20 fps for the gpu and 14fps for the hybrid cpu-gpu version on resolution of 640 × 480). It is robust to homogeneous regions without texture, large image changes, noise or errors from camera calibration, and it reconstructs the surface down to sub millimeter accuracy. In all the experiments within the simulation environment, the mean distance to ground truth data is between 0.05 and 0.6 mm for the hybrid cpu-gpu version. The hybrid cpu-gpu algorithm shows a much more superior performance than its cpu and gpu counterpart (mean distance reduction 26% and 45%, respectively, for the experiments in the simulation environment). The recommended approach for surface reconstruction is fast, robust, and accurate. It can represent changes in the intraoperative environment and can be used to adapt a preoperative model within the surgical site by registration of these two models.
      

      
      Computing the Density Matrix in Electronic Structure Theory on Graphics Processing Units.
      PubMed
      Cawkwell, M J; Sanville, E J; Mniszewski, S M; Niklasson, Anders M N
         2012-11-13
         The self-consistent solution of a Schrödinger-like equation for the density matrix is a critical and computationally demanding step in quantum-based models of interatomic bonding. This step was tackled historically via the diagonalization of the Hamiltonian. We have investigated the performance and accuracy of the second-order spectral projection (SP2) algorithm for the computation of the density matrix via a recursive expansion of the Fermi operator in a series of generalized matrix-matrix multiplications. We demonstrate that owing to its simplicity, the SP2 algorithm [Niklasson, A. M. N. Phys. Rev. B2002, 66, 155115] is exceptionally well suited to implementation on graphics processing units (GPUs). The performance in double and single precision arithmetic of a hybrid GPU/central processing unit (CPU) and full GPU implementation of the SP2 algorithm exceed those of a CPU-only implementation of the SP2 algorithm and traditional matrix diagonalization when the dimensions of the matrices exceed about 2000 × 2000. Padding schemes for arrays allocated in the GPU memory that optimize the performance of the CUBLAS implementations of the level 3 BLAS DGEMM and SGEMM subroutines for generalized matrix-matrix multiplications are described in detail. The analysis of the relative performance of the hybrid CPU/GPU and full GPU implementations indicate that the transfer of arrays between the GPU and CPU constitutes only a small fraction of the total computation time. The errors measured in the self-consistent density matrices computed using the SP2 algorithm are generally smaller than those measured in matrices computed via diagonalization. Furthermore, the errors in the density matrices computed using the SP2 algorithm do not exhibit any dependence of system size, whereas the errors increase linearly with the number of orbitals when diagonalization is employed.
      

        
       
          

«

1
      2
      3
   4
      5
      »

          
        

     

   

   
       
            
              
          

«

2
      3
      4
   5
      6
      »

          
        

           
           
             
               
      
      Fast polyenergetic forward projection for image formation using OpenCL on a heterogeneous parallel computing platform.
      PubMed
      Zhou, Lili; Clifford Chao, K S; Chang, Jenghwa
         2012-11-01
         Simulated projection images of digital phantoms constructed from CT scans have been widely used for clinical and research applications but their quality and computation speed are not optimal for real-time comparison with the radiography acquired with an x-ray source of different energies. In this paper, the authors performed polyenergetic forward projections using open computing language (OpenCL) in a parallel computing ecosystem consisting of CPU and general purpose graphics processing unit (GPGPU) for fast and realistic image formation. The proposed polyenergetic forward projection uses a lookup table containing the NIST published mass attenuation coefficients (μ∕ρ) for different tissue types and photon energies ranging from 1 keV to 20 MeV. The CT images of interested sites are first segmented into different tissue types based on the CT numbers and converted to a three-dimensional attenuation phantom by linking each voxel to the corresponding tissue type in the lookup table. The x-ray source can be a radioisotope or an x-ray generator with a known spectrum described as weight w(n) for energy bin E(n). The Siddon method is used to compute the x-ray transmission line integral for E(n) and the x-ray fluence is the weighted sum of the exponential of line integral for all energy bins with added Poisson noise. To validate this method, a digital head and neck phantom constructed from the CT scan of a Rando head phantom was segmented into three (air, gray∕white matter, and bone) regions for calculating the polyenergetic projection images for the Mohan 4 MV energy spectrum. To accelerate the calculation, the authors partitioned the workloads using the task parallelism and data parallelism and scheduled them in a parallel computing ecosystem consisting of CPU and GPGPU (NVIDIA Tesla C2050) using OpenCL only. The authors explored the task overlapping strategy and the sequential method for generating the first and subsequent DRRs. A dispatcher was designed to drive the high-degree parallelism of the task overlapping strategy. Numerical experiments were conducted to compare the performance of the OpenCL∕GPGPU-based implementation with the CPU-based implementation. The projection images were similar to typical portal images obtained with a 4 or 6 MV x-ray source. For a phantom size of 512 × 512 × 223, the time for calculating the line integrals for a 512 × 512 image panel was 16.2 ms on GPGPU for one energy bin in comparison to 8.83 s on CPU. The total computation time for generating one polyenergetic projection image of 512 × 512 was 0.3 s (141 s for CPU). The relative difference between the projection images obtained with the CPU-based and OpenCL∕GPGPU-based implementations was on the order of 10(-6) and was virtually indistinguishable. The task overlapping strategy was 5.84 and 1.16 times faster than the sequential method for the first and the subsequent digitally reconstruction radiographies, respectively. The authors have successfully built digital phantoms using anatomic CT images and NIST μ∕ρ tables for simulating realistic polyenergetic projection images and optimized the processing speed with parallel computing using GPGPU∕OpenCL-based implementation. The computation time was fast (0.3 s per projection image) enough for real-time IGRT (image-guided radiotherapy) applications.
      

      
      Wang-Landau sampling: Saving CPU time
      NASA Astrophysics Data System (ADS)
      Ferreira, L. S.; Jorge, L. N.; Leão, S. A.; Caparica, A. A.
         2018-04-01
         In this work we propose an improvement to the Wang-Landau (WL) method that allows an economy in CPU time of about 60% leading to the same results with the same accuracy. We used the 2D Ising model to show that one can initiate all WL simulations using the outputs of an advanced WL level from a previous simulation. We showed that up to the seventh WL level (f6) the simulations are not biased yet and can proceed to any value that the simulation from the very beginning would reach. As a result the initial WL levels can be simulated just once. It was also observed that the saving in CPU time is larger for larger lattice sizes, exactly where the computational cost is considerable. We carried out high-resolution simulations beginning initially from the first WL level (f0) and another beginning from the eighth WL level (f7) using all the data at the end of the previous level and showed that the results for the critical temperature Tc and the critical static exponents β and γ coincide within the error bars. Finally we applied the same procedure to the 1/2-spin Baxter-Wu model and the economy in CPU time was of about 64%.
      

      
      Evaluation of the CPU time for solving the radiative transfer equation with high-order resolution schemes applying the normalized weighting-factor method
      NASA Astrophysics Data System (ADS)
      Xamán, J.; Zavala-Guillén, I.; Hernández-López, I.; Uriarte-Flores, J.; Hernández-Pérez, I.; Macías-Melo, E. V.; Aguilar-Castro, K. M.
         2018-03-01
         In this paper, we evaluated the convergence rate (CPU time) of a new mathematical formulation for the numerical solution of the radiative transfer equation (RTE) with several High-Order (HO) and High-Resolution (HR) schemes. In computational fluid dynamics, this procedure is known as the Normalized Weighting-Factor (NWF) method and it is adopted here. The NWF method is used to incorporate the high-order resolution schemes in the discretized RTE. The NWF method is compared, in terms of computer time needed to obtain a converged solution, with the widely used deferred-correction (DC) technique for the calculations of a two-dimensional cavity with emitting-absorbing-scattering gray media using the discrete ordinates method. Six parameters, viz. the grid size, the order of quadrature, the absorption coefficient, the emissivity of the boundary surface, the under-relaxation factor, and the scattering albedo are considered to evaluate ten schemes. The results showed that using the DC method, in general, the scheme that had the lowest CPU time is the SOU. In contrast, with the results of theDC procedure the CPU time for DIAMOND and QUICK schemes using the NWF method is shown to be, between the 3.8 and 23.1% faster and 12.6 and 56.1% faster, respectively. However, the other schemes are more time consuming when theNWFis used instead of the DC method. Additionally, a second test case was presented and the results showed that depending on the problem under consideration, the NWF procedure may be computationally faster or slower that the DC method. As an example, the CPU time for QUICK and SMART schemes are 61.8 and 203.7%, respectively, slower when the NWF formulation is used for the second test case. Finally, future researches to explore the computational cost of the NWF method in more complex problems are required.
      

      
      Stability and Scalability of the CMS Global Pool: Pushing HTCondor and GlideinWMS to New Limits
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Balcas, J.; Bockelman, B.; Hufnagel, D.
         
         The CMS Global Pool, based on HTCondor and glideinWMS, is the main computing resource provisioning system for all CMS workflows, including analysis, Monte Carlo production, and detector data reprocessing activities. The total resources at Tier-1 and Tier-2 grid sites pledged to CMS exceed 100,000 CPU cores, while another 50,000 to 100,000 CPU cores are available opportunistically, pushing the needs of the Global Pool to higher scales each year. These resources are becoming more diverse in their accessibility and configuration over time. Furthermore, the challenge of stably running at higher and higher scales while introducing new modes of operation such asmore » multi-core pilots, as well as the chaotic nature of physics analysis workflows, places huge strains on the submission infrastructure. This paper details some of the most important challenges to scalability and stability that the CMS Global Pool has faced since the beginning of the LHC Run II and how they were overcome.« less
      

      
      Stability and scalability of the CMS Global Pool: Pushing HTCondor and glideinWMS to new limits
      NASA Astrophysics Data System (ADS)
      Balcas, J.; Bockelman, B.; Hufnagel, D.; Hurtado Anampa, K.; Aftab Khan, F.; Larson, K.; Letts, J.; Marra da Silva, J.; Mascheroni, M.; Mason, D.; Perez-Calero Yzquierdo, A.; Tiradani, A.
         2017-10-01
         The CMS Global Pool, based on HTCondor and glideinWMS, is the main computing resource provisioning system for all CMS workflows, including analysis, Monte Carlo production, and detector data reprocessing activities. The total resources at Tier-1 and Tier-2 grid sites pledged to CMS exceed 100,000 CPU cores, while another 50,000 to 100,000 CPU cores are available opportunistically, pushing the needs of the Global Pool to higher scales each year. These resources are becoming more diverse in their accessibility and configuration over time. Furthermore, the challenge of stably running at higher and higher scales while introducing new modes of operation such as multi-core pilots, as well as the chaotic nature of physics analysis workflows, places huge strains on the submission infrastructure. This paper details some of the most important challenges to scalability and stability that the CMS Global Pool has faced since the beginning of the LHC Run II and how they were overcome.
      

      
      Raspberry Pi Eclipse Experiments
      NASA Astrophysics Data System (ADS)
      Chizek Frouard, Malynda
         2018-01-01
         The 21 August 2017 solar eclipse was an excellent opportunity for electronics and science enthusiasts to collect data during a fascinating phenomenon. With my recent personal interest in Raspberry Pis, I thought measuring how much the temperature and illuminance changes during a total solar eclipse would be fun and informational.Previous observations of total solar eclipses have remarked on the temperature drop during totality. Illuminance (ambient light) varies over 7 orders of magnitude from day to night and is highly dependent on relative positions of Sun, Earth, and Moon. I wondered whether totality was really as dark as night.Using a Raspberry Pi Zero W, a Pimoroni Enviro pHAT, and a portable USB charger, I collected environmental temperature; CPU temperature (because the environmental temperature sensor sat very near the CPU on the Raspberry Pi); barometric pressure; ambient light; R, G, and B colors; and x, y, and z acceleration (for marking times when I moved the sensor) data at a ~15 second cadence starting at about 5 am until 1:30 pm from my eclipse observation site in Glendo, WY. Totality occurred from 11:45 to 11:47 am, lasting about 2 minutes and 30 seconds.The Raspberry Pi recorded a >20 degree F drop in temperature during the eclipse, and the illuminance during totality was equivalent to twilight measurements earlier in the day. A limitation in the ambient light sensor prevented accurate measurements of broad daylight and most of the partial phase of the eclipse, but an alternate ambient light sensor combined with the Raspberry Pi setup would make this a cost-efficient set-up for illuminance studies.I will present data from the ambient light sensor, temperature sensor, and color sensor, noting caveats from my experiments, lessons learned for next time, and suggestions for anyone who wants to perform similar experiments for themselves or with a classroom.
      

      
      On localization attacks against cloud infrastructure
      NASA Astrophysics Data System (ADS)
      Ge, Linqiang; Yu, Wei; Sistani, Mohammad Ali
         2013-05-01
         One of the key characteristics of cloud computing is the device and location independence that enables the user to access systems regardless of their location. Because cloud computing is heavily based on sharing resource, it is vulnerable to cyber attacks. In this paper, we investigate a localization attack that enables the adversary to leverage central processing unit (CPU) resources to localize the physical location of server used by victims. By increasing and reducing CPU usage through the malicious virtual machine (VM), the response time from the victim VM will increase and decrease correspondingly. In this way, by embedding the probing signal into the CPU usage and correlating the same pattern in the response time from the victim VM, the adversary can find the location of victim VM. To determine attack accuracy, we investigate features in both the time and frequency domains. We conduct both theoretical and experimental study to demonstrate the effectiveness of such an attack.
      

      
      CUDA Fortran acceleration for the finite-difference time-domain method
      NASA Astrophysics Data System (ADS)
      Hadi, Mohammed F.; Esmaeili, Seyed A.
         2013-05-01
         A detailed description of programming the three-dimensional finite-difference time-domain (FDTD) method to run on graphical processing units (GPUs) using CUDA Fortran is presented. Two FDTD-to-CUDA thread-block mapping designs are investigated and their performances compared. Comparative assessment of trade-offs between GPU's shared memory and L1 cache is also discussed. This presentation is for the benefit of FDTD programmers who work exclusively with Fortran and are reluctant to port their codes to C in order to utilize GPU computing. The derived CUDA Fortran code is compared with an optimized CPU version that runs on a workstation-class CPU to present a realistic GPU to CPU run time comparison and thus help in making better informed investment decisions on FDTD code redesigns and equipment upgrades. All analyses are mirrored with CUDA C simulations to put in perspective the present state of CUDA Fortran development.
      

      
      A Fast and Accurate Algorithm for l1 Minimization Problems in Compressive Sampling (Preprint)
      DTIC Science & Technology
      
         2013-01-22
         However, updating uk+1 via the formulation of Step 2 in Algorithm 1 can be implemented through the use of the component-wise Gauss - Seidel iteration which...may accelerate the rate of convergence of the algorithm and therefore reduce the total CPU-time consumed. The efficiency of component-wise Gauss - Seidel ...Micchelli, L. Shen, and Y. Xu, A proximity algorithm accelerated by Gauss - Seidel iterations for L1/TV denoising models, Inverse Problems, 28 (2012), p
      

      
      RRTMGP: A High-Performance Broadband Radiation Code for the Next Decade
      DTIC Science & Technology
      
         2014-09-30
         Hardware counters were used to measure several performance metrics, including the number of double-precision (DP) floating- point operations ( FLOPs ...0.2 DP FLOPs per CPU cycle. Experience with production science code is that it is possible to achieve execution rates in the range of 0.5 to 1.0...DP FLOPs per cycle. Looking at the ratio of vectorized DP FLOPs to total DP FLOPs we see (Figure PROF) that for most of the execution time the
      

      
      A report documenting the completion of the Los Alamos National Laboratory portion of the ASC level II milestone ""Visualization on the supercomputing platform
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Ahrens, James P; Patchett, John M; Lo, Li - Ta
         2011-01-24
         This report provides documentation for the completion of the Los Alamos portion of the ASC Level II 'Visualization on the Supercomputing Platform' milestone. This ASC Level II milestone is a joint milestone between Sandia National Laboratory and Los Alamos National Laboratory. The milestone text is shown in Figure 1 with the Los Alamos portions highlighted in boldfaced text. Visualization and analysis of petascale data is limited by several factors which must be addressed as ACES delivers the Cielo platform. Two primary difficulties are: (1) Performance of interactive rendering, which is the most computationally intensive portion of the visualization process. Formore » terascale platforms, commodity clusters with graphics processors (GPUs) have been used for interactive rendering. For petascale platforms, visualization and rendering may be able to run efficiently on the supercomputer platform itself. (2) I/O bandwidth, which limits how much information can be written to disk. If we simply analyze the sparse information that is saved to disk we miss the opportunity to analyze the rich information produced every timestep by the simulation. For the first issue, we are pursuing in-situ analysis, in which simulations are coupled directly with analysis libraries at runtime. This milestone will evaluate the visualization and rendering performance of current and next generation supercomputers in contrast to GPU-based visualization clusters, and evaluate the perfromance of common analysis libraries coupled with the simulation that analyze and write data to disk during a running simulation. This milestone will explore, evaluate and advance the maturity level of these technologies and their applicability to problems of interest to the ASC program. In conclusion, we improved CPU-based rendering performance by a a factor of 2-10 times on our tests. In addition, we evaluated CPU and CPU-based rendering performance. We encourage production visualization experts to consider using CPU-based rendering solutions when it is appropriate. For example, on remote supercomputers CPU-based rendering can offer a means of viewing data without having to offload the data or geometry onto a CPU-based visualization system. In terms of comparative performance of the CPU and CPU we believe that further optimizations of the performance of both CPU or CPU-based rendering are possible. The simulation community is currently confronting this reality as they work to port their simulations to different hardware architectures. What is interesting about CPU rendering of massive datasets is that for part two decades CPU performance has significantly outperformed CPU-based systems. Based on our advancements, evaluations and explorations we believe that CPU-based rendering has returned as one viable option for the visualization of massive datasets.« less
      

      
      Application of queueing models to multiprogrammed computer systems operating in a time-critical environment
      NASA Technical Reports Server (NTRS)
      Eckhardt, D. E., Jr.
         1979-01-01
         A model of a central processor (CPU) which services background applications in the presence of time critical activity is presented. The CPU is viewed as an M/M/1 queueing system subject to periodic interrupts by deterministic, time critical process. The Laplace transform of the distribution of service times for the background applications is developed. The use of state of the art queueing models for studying the background processing capability of time critical computer systems is discussed and the results of a model validation study which support this application of queueing models are presented.
      

      
      An efficient mixed-precision, hybrid CPU-GPU implementation of a nonlinearly implicit one-dimensional particle-in-cell algorithm
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Chen, Guangye; Chacon, Luis; Barnes, Daniel C
         2012-01-01
         Recently, a fully implicit, energy- and charge-conserving particle-in-cell method has been developed for multi-scale, full-f kinetic simulations [G. Chen, et al., J. Comput. Phys. 230, 18 (2011)]. The method employs a Jacobian-free Newton-Krylov (JFNK) solver and is capable of using very large timesteps without loss of numerical stability or accuracy. A fundamental feature of the method is the segregation of particle orbit integrations from the field solver, while remaining fully self-consistent. This provides great flexibility, and dramatically improves the solver efficiency by reducing the degrees of freedom of the associated nonlinear system. However, it requires a particle push per nonlinearmore » residual evaluation, which makes the particle push the most time-consuming operation in the algorithm. This paper describes a very efficient mixed-precision, hybrid CPU-GPU implementation of the implicit PIC algorithm. The JFNK solver is kept on the CPU (in double precision), while the inherent data parallelism of the particle mover is exploited by implementing it in single-precision on a graphics processing unit (GPU) using CUDA. Performance-oriented optimizations, with the aid of an analytical performance model, the roofline model, are employed. Despite being highly dynamic, the adaptive, charge-conserving particle mover algorithm achieves up to 300 400 GOp/s (including single-precision floating-point, integer, and logic operations) on a Nvidia GeForce GTX580, corresponding to 20 25% absolute GPU efficiency (against the peak theoretical performance) and 50-70% intrinsic efficiency (against the algorithm s maximum operational throughput, which neglects all latencies). This is about 200-300 times faster than an equivalent serial CPU implementation. When the single-precision GPU particle mover is combined with a double-precision CPU JFNK field solver, overall performance gains 100 vs. the double-precision CPU-only serial version are obtained, with no apparent loss of robustness or accuracy when applied to a challenging long-time scale ion acoustic wave simulation.« less
      

      
      The VLBA correlator: Real-time in the distributed era
      NASA Technical Reports Server (NTRS)
      Wells, D. C.
         1992-01-01
         The correlator is the signal processing engine of the Very Long Baseline Array (VLBA). Radio signals are recorded on special wideband (128 Mb/s) digital recorders at the 10 telescopes, with sampling times controlled by hydrogen maser clocks. The magnetic tapes are shipped to the Array Operations Center in Socorro, New Mexico, where they are played back simultaneously into the correlator. Real-time software and firmware controls the playback drives to achieve synchronization, compute models of the wavefront delay, control the numerous modules of the correlator, and record FITS files of the fringe visibilities at the back-end of the correlator. In addition to the more than 3000 custom VLSI chips which handle the massive data flow of the signal processing, the correlator contains a total of more than 100 programmable computers, 8-, 16- and 32-bit CPUs. Code is downloaded into front-end CPU's dependent on operating mode. Low-level code is assembly language, high-level code is C running under a RT OS. We use VxWorks on Motorola MVME147 CPU's. Code development is on a complex of SPARC workstations connected to the RT CPU's by Ethernet. The overall management of the correlation process is dependent on a database management system. We use Ingres running on a Sparcstation-2. We transfer logging information from the database of the VLBA Monitor and Control System to our database using Ingres/NET. Job scripts are computed and are transferred to the real-time computers using NFS, and correlation job execution logs and status flow back by the route. Operator status and control displays use windows on workstations, interfaced to the real-time processes by network protocols. The extensive network protocol support provided by VxWorks is invaluable. The VLBA Correlator's dependence on network protocols is an example of the radical transformation of the real-time world over the past five years. Real-time is becoming more like conventional computing. Paradoxically, 'conventional' computing is also adopting practices from the real-time world: semaphores, shared memory, light-weight threads, and concurrency. This appears to be a convergence of thinking.
      

      
      Saving time and energy with oversubscription and semi-direct Møller-Plesset second order perturbation methods.
      PubMed
      Fought, Ellie L; Sundriyal, Vaibhav; Sosonkina, Masha; Windus, Theresa L
         2017-04-30
         In this work, the effect of oversubscription is evaluated, via calling 2n, 3n, or 4n processes for n physical cores, on semi-direct MP2 energy and gradient calculations and RI-MP2 energy calculations with the cc-pVTZ basis using NWChem. Results indicate that on both Intel and AMD platforms, oversubscription reduces total time to solution on average for semi-direct MP2 energy calculations by 25-45% and reduces total energy consumed by the CPU and DRAM on average by 10-15% on the Intel platform. Semi-direct gradient time to solution is shortened on average by 8-15% and energy consumption is decreased by 5-10%. Linear regression analysis shows a strong correlation between time to solution and total energy consumed. Oversubscribing during RI-MP2 calculations results in performance degradations of 30-50% at the 4n level. © 2017 Wiley Periodicals, Inc. © 2017 Wiley Periodicals, Inc.
      

      
      A Reliability-Based Particle Filter for Humanoid Robot Self-Localization in RoboCup Standard Platform League
      PubMed Central
      Sánchez, Eduardo Munera; Alcobendas, Manuel Muñoz; Noguera, Juan Fco. Blanes; Gilabert, Ginés Benet; Simó Ten, José E.
         2013-01-01
         This paper deals with the problem of humanoid robot localization and proposes a new method for position estimation that has been developed for the RoboCup Standard Platform League environment. Firstly, a complete vision system has been implemented in the Nao robot platform that enables the detection of relevant field markers. The detection of field markers provides some estimation of distances for the current robot position. To reduce errors in these distance measurements, extrinsic and intrinsic camera calibration procedures have been developed and described. To validate the localization algorithm, experiments covering many of the typical situations that arise during RoboCup games have been developed: ranging from degradation in position estimation to total loss of position (due to falls, ‘kidnapped robot’, or penalization). The self-localization method developed is based on the classical particle filter algorithm. The main contribution of this work is a new particle selection strategy. Our approach reduces the CPU computing time required for each iteration and so eases the limited resource availability problem that is common in robot platforms such as Nao. The experimental results show the quality of the new algorithm in terms of localization and CPU time consumption. PMID:24193098
      

      
      Performance Analysis of the NAS Y-MP Workload
      NASA Technical Reports Server (NTRS)
      Bergeron, Robert J.; Kutler, Paul (Technical Monitor)
         1997-01-01
         This paper describes the performance characteristics of the computational workloads on the NAS Cray Y-MP machines, a Y-MP 832 and later a Y-MP 8128. Hardware measurements indicated that the Y-MP workload performance matured over time, ultimately sustaining an average throughput of 0.8 GFLOPS and a vector operation fraction of 87%. The measurements also revealed an operation rate exceeding 1 per clock period, a well-balanced architecture featuring a strong utilization of vector functional units, and an efficient memory organization. Introduction of the larger memory 8128 increased throughput by allowing a more efficient utilization of CPUs. Throughput also depended on the metering of the batch queues; low-idle Saturday workloads required a buffer of small jobs to prevent memory starvation of the CPU. UNICOS required about 7% of total CPU time to service the 832 workloads; this overhead decreased to 5% for the 8128 workloads. While most of the system time went to service I/O requests, efficient scheduling prevented excessive idle due to I/O wait. System measurements disclosed no obvious bottlenecks in the response of the machine and UNICOS to the workloads. In most cases, Cray-provided software tools were- quite sufficient for measuring the performance of both the machine and operating, system.
      

      
      Self-organized neural maps of human protein sequences.
      PubMed Central
      Ferrán, E. A.; Pflugfelder, B.; Ferrara, P.
         1994-01-01
         We have recently described a method based on artificial neural networks to cluster protein sequences into families. The network was trained with Kohonen's unsupervised learning algorithm using, as inputs, the matrix patterns derived from the dipeptide composition of the proteins. We present here a large-scale application of that method to classify the 1,758 human protein sequences stored in the SwissProt database (release 19.0), whose lengths are greater than 50 amino acids. In the final 2-dimensional topologically ordered map of 15 x 15 neurons, proteins belonging to known families were associated with the same neuron or with neighboring ones. Also, as an attempt to reduce the time-consuming learning procedure, we compared 2 learning protocols: one of 500 epochs (100 SUN CPU-hours [CPU-h]), and another one of 30 epochs (6.7 CPU-h). A further reduction of learning-computing time, by a factor of about 3.3, with similar protein clustering results, was achieved using a matrix of 11 x 11 components to represent the sequences. Although network training is time consuming, the classification of a new protein in the final ordered map is very fast (14.6 CPU-seconds). We also show a comparison between the artificial neural network approach and conventional methods of biosequence analysis. PMID:8019421
      

      
      The METAL System. Volume I and Volume II. Appendices.
      DTIC Science & Technology
      
         1981-01-01
         demands , and fair CPU time were measured. The fair measure reported here includes the pure CPU time plus a pro-rated portion of the time consumed by the...syntactic class or the form matched . NO = noun VB = verb OTR = other part of speech IT-12 Although the above feature is not used by the system at present...indicate the syntactic class of the form matched . NO = noun other than gerund ("content", "dark", "African") INF = infinitive ("direct", "equal", "content
      

      
      Reaching multi-nanosecond timescales in combined QM/MM molecular dynamics simulations through parallel horsetail sampling.
      PubMed
      Martins-Costa, Marilia T C; Ruiz-López, Manuel F
         2017-04-15
         We report an enhanced sampling technique that allows to reach the multi-nanosecond timescale in quantum mechanics/molecular mechanics molecular dynamics simulations. The proposed technique, called horsetail sampling, is a specific type of multiple molecular dynamics approach exhibiting high parallel efficiency. It couples a main simulation with a large number of shorter trajectories launched on independent processors at periodic time intervals. The technique is applied to study hydrogen peroxide at the water liquid-vapor interface, a system of considerable atmospheric relevance. A total simulation time of a little more than 6 ns has been attained for a total CPU time of 5.1 years representing only about 20 days of wall-clock time. The discussion of the results highlights the strong influence of the solvation effects at the interface on the structure and the electronic properties of the solute. © 2017 Wiley Periodicals, Inc. © 2017 Wiley Periodicals, Inc.
      

        
       
          

«

2
      3
      4
   5
      6
      »

          
        

     

   

   
       
            
              
          

«

3
      4
      5
   6
      7
      »

          
        

           
           
             
               
      
      High performance computing for deformable image registration: towards a new paradigm in adaptive radiotherapy.
      PubMed
      Samant, Sanjiv S; Xia, Junyi; Muyan-Ozcelik, Pinar; Owens, John D
         2008-08-01
         The advent of readily available temporal imaging or time series volumetric (4D) imaging has become an indispensable component of treatment planning and adaptive radiotherapy (ART) at many radiotherapy centers. Deformable image registration (DIR) is also used in other areas of medical imaging, including motion corrected image reconstruction. Due to long computation time, clinical applications of DIR in radiation therapy and elsewhere have been limited and consequently relegated to offline analysis. With the recent advances in hardware and software, graphics processing unit (GPU) based computing is an emerging technology for general purpose computation, including DIR, and is suitable for highly parallelized computing. However, traditional general purpose computation on the GPU is limited because the constraints of the available programming platforms. As well, compared to CPU programming, the GPU currently has reduced dedicated processor memory, which can limit the useful working data set for parallelized processing. We present an implementation of the demons algorithm using the NVIDIA 8800 GTX GPU and the new CUDA programming language. The GPU performance will be compared with single threading and multithreading CPU implementations on an Intel dual core 2.4 GHz CPU using the C programming language. CUDA provides a C-like language programming interface, and allows for direct access to the highly parallel compute units in the GPU. Comparisons for volumetric clinical lung images acquired using 4DCT were carried out. Computation time for 100 iterations in the range of 1.8-13.5 s was observed for the GPU with image size ranging from 2.0 x 10(6) to 14.2 x 10(6) pixels. The GPU registration was 55-61 times faster than the CPU for the single threading implementation, and 34-39 times faster for the multithreading implementation. For CPU based computing, the computational time generally has a linear dependence on image size for medical imaging data. Computational efficiency is characterized in terms of time per megapixels per iteration (TPMI) with units of seconds per megapixels per iteration (or spmi). For the demons algorithm, our CPU implementation yielded largely invariant values of TPMI. The mean TPMIs were 0.527 spmi and 0.335 spmi for the single threading and multithreading cases, respectively, with <2% variation over the considered image data range. For GPU computing, we achieved TPMI =0.00916 spmi with 3.7% variation, indicating optimized memory handling under CUDA. The paradigm of GPU based real-time DIR opens up a host of clinical applications for medical imaging.
      

      
      A CPU/MIC Collaborated Parallel Framework for GROMACS on Tianhe-2 Supercomputer.
      PubMed
      Peng, Shaoliang; Yang, Shunyun; Su, Wenhe; Zhang, Xiaoyu; Zhang, Tenglilang; Liu, Weiguo; Zhao, Xingming
         2017-06-16
         Molecular Dynamics (MD) is the simulation of the dynamic behavior of atoms and molecules. As the most popular software for molecular dynamics, GROMACS cannot work on large-scale data because of limit computing resources. In this paper, we propose a CPU and Intel® Xeon Phi Many Integrated Core (MIC) collaborated parallel framework to accelerate GROMACS using the offload mode on a MIC coprocessor, with which the performance of GROMACS is improved significantly, especially with the utility of Tianhe-2 supercomputer. Furthermore, we optimize GROMACS so that it can run on both the CPU and MIC at the same time. In addition, we accelerate multi-node GROMACS so that it can be used in practice. Benchmarking on real data, our accelerated GROMACS performs very well and reduces computation time significantly. Source code: https://github.com/tianhe2/gromacs-mic.
      

      
      Design and implementation of a UNIX based distributed computing system
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Love, J.S.; Michael, M.W.
         1994-12-31
         We have designed, implemented, and are running a corporate-wide distributed processing batch queue on a large number of networked workstations using the UNIX{reg_sign} operating system. Atlas Wireline researchers and scientists have used the system for over a year. The large increase in available computer power has greatly reduced the time required for nuclear and electromagnetic tool modeling. Use of remote distributed computing has simultaneously reduced computation costs and increased usable computer time. The system integrates equipment from different manufacturers, using various CPU architectures, distinct operating system revisions, and even multiple processors per machine. Various differences between the machines have tomore » be accounted for in the master scheduler. These differences include shells, command sets, swap spaces, memory sizes, CPU sizes, and OS revision levels. Remote processing across a network must be performed in a manner that is seamless from the users` perspective. The system currently uses IBM RISC System/6000{reg_sign}, SPARCstation{sup TM}, HP9000s700, HP9000s800, and DEC Alpha AXP{sup TM} machines. Each CPU in the network has its own speed rating, allowed working hours, and workload parameters. The system if designed so that all of the computers in the network can be optimally scheduled without adversely impacting the primary users of the machines. The increase in the total usable computational capacity by means of distributed batch computing can change corporate computing strategy. The integration of disparate computer platforms eliminates the need to buy one type of computer for computations, another for graphics, and yet another for day-to-day operations. It might be possible, for example, to meet all research and engineering computing needs with existing networked computers.« less
      

      
      Design Alternatives to Improve Access Time Performance of Disk Drives Under DOS and UNIX
      NASA Astrophysics Data System (ADS)
      Hospodor, Andy
         
         For the past 25 years, improvements in CPU performance have overshadowed improvements in the access time performance of disk drives. CPU performance has been slanted towards greater instruction execution rates, measured in millions of instructions per second (MIPS). However, the slant for performance of disk storage has been towards capacity and corresponding increased storage densities. The IBM PC, introduced in 1982, processed only a fraction of a MIP. Follow-on CPUs, such as the 80486 and 80586, sported 5-10 MIPS by 1992. Single user PCs and workstations, with one CPU and one disk drive, became the dominant application, as implied by their production volumes. However, disk drives did not enjoy a corresponding improvement in access time performance, although the potential still exists. The time to access a disk drive improves (decreases) in two ways: by altering the mechanical properties of the drive or by adding cache to the drive. This paper explores the improvement to access time performance of disk drives using cache, prefetch, faster rotation rates, and faster seek acceleration.
      

      
      Caffe con Troll: Shallow Ideas to Speed Up Deep Learning
      PubMed Central
      Hadjis, Stefan; Abuzaid, Firas; Zhang, Ce; Ré, Christopher
         2016-01-01
         We present Caffe con Troll (CcT), a fully compatible end-to-end version of the popular framework Caffe with rebuilt internals. We built CcT to examine the performance characteristics of training and deploying general-purpose convolutional neural networks across different hardware architectures. We find that, by employing standard batching optimizations for CPU training, we achieve a 4.5× throughput improvement over Caffe on popular networks like CaffeNet. Moreover, with these improvements, the end-to-end training time for CNNs is directly proportional to the FLOPS delivered by the CPU, which enables us to efficiently train hybrid CPU-GPU systems for CNNs. PMID:27314106
      

      
      Caffe con Troll: Shallow Ideas to Speed Up Deep Learning.
      PubMed
      Hadjis, Stefan; Abuzaid, Firas; Zhang, Ce; Ré, Christopher
         2015-01-01
         We present Caffe con Troll (CcT), a fully compatible end-to-end version of the popular framework Caffe with rebuilt internals. We built CcT to examine the performance characteristics of training and deploying general-purpose convolutional neural networks across different hardware architectures. We find that, by employing standard batching optimizations for CPU training, we achieve a 4.5× throughput improvement over Caffe on popular networks like CaffeNet. Moreover, with these improvements, the end-to-end training time for CNNs is directly proportional to the FLOPS delivered by the CPU, which enables us to efficiently train hybrid CPU-GPU systems for CNNs.
      

      
      Techniques for increasing the efficiency of Earth gravity calculations for precision orbit determination
      NASA Technical Reports Server (NTRS)
      Smith, R. L.; Lyubomirsky, A. S.
         1981-01-01
         Two techniques were analyzed. The first is a representation using Chebyshev expansions in three-dimensional cells. The second technique employs a temporary file for storing the components of the nonspherical gravity force. Computer storage requirements and relative CPU time requirements are presented. The Chebyshev gravity representation can provide a significant reduction in CPU time in precision orbit calculations, but at the cost of a large amount of direct-access storage space, which is required for a global model.
      

      
      Personal Computer and Workstation Operating Systems Tutorial
      DTIC Science & Technology
      
         1994-03-01
         to a RAM area where it is executed by the CPU. The program consists of instructions that perform operations on data. The CPU will perform two basic...memory to improve system performance. More often the user will buy a new fixed disk so the computer will hold more programs internally. The trend today...MHZ. Another way to view how fast the information is going into the register is in a time domain rather than a frequency domain knowing that time and
      

      
      Lossless data compression for improving the performance of a GPU-based beamformer.
      PubMed
      Lok, U-Wai; Fan, Gang-Wei; Li, Pai-Chi
         2015-04-01
         The powerful parallel computation ability of a graphics processing unit (GPU) makes it feasible to perform dynamic receive beamforming However, a real time GPU-based beamformer requires high data rate to transfer radio-frequency (RF) data from hardware to software memory, as well as from central processing unit (CPU) to GPU memory. There are data compression methods (e.g. Joint Photographic Experts Group (JPEG)) available for the hardware front end to reduce data size, alleviating the data transfer requirement of the hardware interface. Nevertheless, the required decoding time may even be larger than the transmission time of its original data, in turn degrading the overall performance of the GPU-based beamformer. This article proposes and implements a lossless compression-decompression algorithm, which enables in parallel compression and decompression of data. By this means, the data transfer requirement of hardware interface and the transmission time of CPU to GPU data transfers are reduced, without sacrificing image quality. In simulation results, the compression ratio reached around 1.7. The encoder design of our lossless compression approach requires low hardware resources and reasonable latency in a field programmable gate array. In addition, the transmission time of transferring data from CPU to GPU with the parallel decoding process improved by threefold, as compared with transferring original uncompressed data. These results show that our proposed lossless compression plus parallel decoder approach not only mitigate the transmission bandwidth requirement to transfer data from hardware front end to software system but also reduce the transmission time for CPU to GPU data transfer. © The Author(s) 2014.
      

      
      Benchmarking worker nodes using LHCb productions and comparing with HEPSpec06
      NASA Astrophysics Data System (ADS)
      Charpentier, P.
         2017-10-01
         In order to estimate the capabilities of a computing slot with limited processing time, it is necessary to know with a rather good precision its “power”. This allows for example pilot jobs to match a task for which the required CPU-work is known, or to define the number of events to be processed knowing the CPU-work per event. Otherwise one always has the risk that the task is aborted because it exceeds the CPU capabilities of the resource. It also allows a better accounting of the consumed resources. The traditional way the CPU power is estimated in WLCG since 2007 is using the HEP-Spec06 benchmark (HS06) suite that was verified at the time to scale properly with a set of typical HEP applications. However, the hardware architecture of processors has evolved, all WLCG experiments moved to using 64-bit applications and use different compilation flags from those advertised for running HS06. It is therefore interesting to check the scaling of HS06 with the HEP applications. For this purpose, we have been using CPU intensive massive simulation productions from the LHCb experiment and compared their event throughput to the HS06 rating of the worker nodes. We also compared it with a much faster benchmark script that is used by the DIRAC framework used by LHCb for evaluating at run time the performance of the worker nodes. This contribution reports on the finding of these comparisons: the main observation is that the scaling with HS06 is no longer fulfilled, while the fast benchmarks have a better scaling but are less precise. One can also clearly see that some hardware or software features when enabled on the worker nodes may enhance their performance beyond expectation from either benchmark, depending on external factors.
      

      
      Application of graphics processing units to search pipelines for gravitational waves from coalescing binaries of compact objects
      NASA Astrophysics Data System (ADS)
      Chung, Shin Kee; Wen, Linqing; Blair, David; Cannon, Kipp; Datta, Amitava
         2010-07-01
         We report a novel application of a graphics processing unit (GPU) for the purpose of accelerating the search pipelines for gravitational waves from coalescing binaries of compact objects. A speed-up of 16-fold in total has been achieved with an NVIDIA GeForce 8800 Ultra GPU card compared with one core of a 2.5 GHz Intel Q9300 central processing unit (CPU). We show that substantial improvements are possible and discuss the reduction in CPU count required for the detection of inspiral sources afforded by the use of GPUs.
      

      
      Performance of the OVERFLOW-MLP and LAURA-MLP CFD Codes on the NASA Ames 512 CPU Origin System
      NASA Technical Reports Server (NTRS)
      Taft, James R.
         2000-01-01
         The shared memory Multi-Level Parallelism (MLP) technique, developed last year at NASA Ames has been very successful in dramatically improving the performance of important NASA CFD codes. This new and very simple parallel programming technique was first inserted into the OVERFLOW production CFD code in FY 1998. The OVERFLOW-MLP code's parallel performance scaled linearly to 256 CPUs on the NASA Ames 256 CPU Origin 2000 system (steger). Overall performance exceeded 20.1 GFLOP/s, or about 4.5x the performance of a dedicated 16 CPU C90 system. All of this was achieved without any major modification to the original vector based code. The OVERFLOW-MLP code is now in production on the inhouse Origin systems as well as being used offsite at commercial aerospace companies. Partially as a result of this work, NASA Ames has purchased a new 512 CPU Origin 2000 system to further test the limits of parallel performance for NASA codes of interest. This paper presents the performance obtained from the latest optimization efforts on this machine for the LAURA-MLP and OVERFLOW-MLP codes. The Langley Aerothermodynamics Upwind Relaxation Algorithm (LAURA) code is a key simulation tool in the development of the next generation shuttle, interplanetary reentry vehicles, and nearly all "X" plane development. This code sustains about 4-5 GFLOP/s on a dedicated 16 CPU C90. At this rate, expected workloads would require over 100 C90 CPU years of computing over the next few calendar years. It is not feasible to expect that this would be affordable or available to the user community. Dramatic performance gains on cheaper systems are needed. This code is expected to be perhaps the largest consumer of NASA Ames compute cycles per run in the coming year.The OVERFLOW CFD code is extensively used in the government and commercial aerospace communities to evaluate new aircraft designs. It is one of the largest consumers of NASA supercomputing cycles and large simulations of highly resolved full aircraft are routinely undertaken. Typical large problems might require 100s of Cray C90 CPU hours to complete. The dramatic performance gains with the 256 CPU steger system are exciting. Obtaining results in hours instead of months is revolutionizing the way in which aircraft manufacturers are looking at future aircraft simulation work. Figure 2 below is a current state of the art plot of OVERFLOW-MLP performance on the 512 CPU Lomax system. As can be seen, the chart indicates that OVERFLOW-MLP continues to scale linearly with CPU count up to 512 CPUs on a large 35 million point full aircraft RANS simulation. At this point performance is such that a fully converged simulation of 2500 time steps is completed in less than 2 hours of elapsed time. Further work over the next few weeks will improve the performance of this code even further.The LAURA code has been converted to the MLP format as well. This code is currently being optimized for the 512 CPU system. Performance statistics indicate that the goal of 100 GFLOP/s will be achieved by year's end. This amounts to 20x the 16 CPU C90 result and strongly demonstrates the viability of the new parallel systems rapidly solving very large simulations in a production environment.
      

      
      Inexact adaptive Newton methods
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Bertiger, W.I.; Kelsey, F.J.
         1985-02-01
         The Inexact Adaptive Newton method (IAN) is a modification of the Adaptive Implicit Method/sup 1/ (AIM) with improved Newton convergence. Both methods simplify the Jacobian at each time step by zeroing coefficients in regions where saturations are changing slowly. The methods differ in how the diagonal block terms are treated. On test problems with up to 3,000 cells, IAN consistently saves approximately 30% of the CPU time when compared to the fully implicit method. AIM shows similar savings on some problems, but takes as much CPU time as fully implicit on other test problems due to poor Newton convergence.
      

      
      Semiempirical Quantum Chemical Calculations Accelerated on a Hybrid Multicore CPU-GPU Computing Platform.
      PubMed
      Wu, Xin; Koslowski, Axel; Thiel, Walter
         2012-07-10
         In this work, we demonstrate that semiempirical quantum chemical calculations can be accelerated significantly by leveraging the graphics processing unit (GPU) as a coprocessor on a hybrid multicore CPU-GPU computing platform. Semiempirical calculations using the MNDO, AM1, PM3, OM1, OM2, and OM3 model Hamiltonians were systematically profiled for three types of test systems (fullerenes, water clusters, and solvated crambin) to identify the most time-consuming sections of the code. The corresponding routines were ported to the GPU and optimized employing both existing library functions and a GPU kernel that carries out a sequence of noniterative Jacobi transformations during pseudodiagonalization. The overall computation times for single-point energy calculations and geometry optimizations of large molecules were reduced by one order of magnitude for all methods, as compared to runs on a single CPU core.
      

      
      A GPU-based calculation using the three-dimensional FDTD method for electromagnetic field analysis.
      PubMed
      Nagaoka, Tomoaki; Watanabe, Soichi
         2010-01-01
         Numerical simulations with the numerical human model using the finite-difference time domain (FDTD) method have recently been performed frequently in a number of fields in biomedical engineering. However, the FDTD calculation runs too slowly. We focus, therefore, on general purpose programming on the graphics processing unit (GPGPU). The three-dimensional FDTD method was implemented on the GPU using Compute Unified Device Architecture (CUDA). In this study, we used the NVIDIA Tesla C1060 as a GPGPU board. The performance of the GPU is evaluated in comparison with the performance of a conventional CPU and a vector supercomputer. The results indicate that three-dimensional FDTD calculations using a GPU can significantly reduce run time in comparison with that using a conventional CPU, even a native GPU implementation of the three-dimensional FDTD method, while the GPU/CPU speed ratio varies with the calculation domain and thread block size.
      

      
      Research on control law accelerator of digital signal process chip TMS320F28035 for real-time data acquisition and processing
      NASA Astrophysics Data System (ADS)
      Zhao, Shuangle; Zhang, Xueyi; Sun, Shengli; Wang, Xudong
         2017-08-01
         TI C2000 series digital signal process (DSP) chip has been widely used in electrical engineering, measurement and control, communications and other professional fields, DSP TMS320F28035 is one of the most representative of a kind. When using the DSP program, need data acquisition and data processing, and if the use of common mode C or assembly language programming, the program sequence, analogue-to-digital (AD) converter cannot be real-time acquisition, often missing a lot of data. The control low accelerator (CLA) processor can run in parallel with the main central processing unit (CPU), and the frequency is consistent with the main CPU, and has the function of floating point operations. Therefore, the CLA coprocessor is used in the program, and the CLA kernel is responsible for data processing. The main CPU is responsible for the AD conversion. The advantage of this method is to reduce the time of data processing and realize the real-time performance of data acquisition.
      

      
      Restricted Collision List method for faster Direct Simulation Monte-Carlo (DSMC) collisions
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Macrossan, Michael N., E-mail: m.macrossan@uq.edu.au
         
         The ‘Restricted Collision List’ (RCL) method for speeding up the calculation of DSMC Variable Soft Sphere collisions, with Borgnakke–Larsen (BL) energy exchange, is presented. The method cuts down considerably on the number of random collision parameters which must be calculated (deflection and azimuthal angles, and the BL energy exchange factors). A relatively short list of these parameters is generated and the parameters required in any cell are selected from this list. The list is regenerated at intervals approximately equal to the smallest mean collision time in the flow, and the chance of any particle re-using the same collision parameters inmore » two successive collisions is negligible. The results using this method are indistinguishable from those obtained with standard DSMC. The CPU time saving depends on how much of a DSMC calculation is devoted to collisions and how much is devoted to other tasks, such as moving particles and calculating particle interactions with flow boundaries. For 1-dimensional calculations of flow in a tube, the new method saves 20% of the CPU time per collision for VSS scattering with no energy exchange. With RCL applied to rotational energy exchange, the CPU saving can be greater; for small values of the rotational collision number, for which most collisions involve some rotational energy exchange, the CPU may be reduced by 50% or more.« less
      

      
      Bayer image parallel decoding based on GPU
      NASA Astrophysics Data System (ADS)
      Hu, Rihui; Xu, Zhiyong; Wei, Yuxing; Sun, Shaohua
         2012-11-01
         In the photoelectrical tracking system, Bayer image is decompressed in traditional method, which is CPU-based. However, it is too slow when the images become large, for example, 2K×2K×16bit. In order to accelerate the Bayer image decoding, this paper introduces a parallel speedup method for NVIDA's Graphics Processor Unit (GPU) which supports CUDA architecture. The decoding procedure can be divided into three parts: the first is serial part, the second is task-parallelism part, and the last is data-parallelism part including inverse quantization, inverse discrete wavelet transform (IDWT) as well as image post-processing part. For reducing the execution time, the task-parallelism part is optimized by OpenMP techniques. The data-parallelism part could advance its efficiency through executing on the GPU as CUDA parallel program. The optimization techniques include instruction optimization, shared memory access optimization, the access memory coalesced optimization and texture memory optimization. In particular, it can significantly speed up the IDWT by rewriting the 2D (Tow-dimensional) serial IDWT into 1D parallel IDWT. Through experimenting with 1K×1K×16bit Bayer image, data-parallelism part is 10 more times faster than CPU-based implementation. Finally, a CPU+GPU heterogeneous decompression system was designed. The experimental result shows that it could achieve 3 to 5 times speed increase compared to the CPU serial method.
      

      
      SU-E-T-423: Fast Photon Convolution Calculation with a 3D-Ideal Kernel On the GPU
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Moriya, S; Sato, M; Tachibana, H
         
         Purpose: The calculation time is a trade-off for improving the accuracy of convolution dose calculation with fine calculation spacing of the KERMA kernel. We investigated to accelerate the convolution calculation using an ideal kernel on the Graphic Processing Units (GPU). Methods: The calculation was performed on the AMD graphics hardware of Dual FirePro D700 and our algorithm was implemented using the Aparapi that convert Java bytecode to OpenCL. The process of dose calculation was separated with the TERMA and KERMA steps. The dose deposited at the coordinate (x, y, z) was determined in the process. In the dose calculation runningmore » on the central processing unit (CPU) of Intel Xeon E5, the calculation loops were performed for all calculation points. On the GPU computation, all of the calculation processes for the points were sent to the GPU and the multi-thread computation was done. In this study, the dose calculation was performed in a water equivalent homogeneous phantom with 150{sup 3} voxels (2 mm calculation grid) and the calculation speed on the GPU to that on the CPU and the accuracy of PDD were compared. Results: The calculation time for the GPU and the CPU were 3.3 sec and 4.4 hour, respectively. The calculation speed for the GPU was 4800 times faster than that for the CPU. The PDD curve for the GPU was perfectly matched to that for the CPU. Conclusion: The convolution calculation with the ideal kernel on the GPU was clinically acceptable for time and may be more accurate in an inhomogeneous region. Intensity modulated arc therapy needs dose calculations for different gantry angles at many control points. Thus, it would be more practical that the kernel uses a coarse spacing technique if the calculation is faster while keeping the similar accuracy to a current treatment planning system.« less
      

      
      An evaluation of superminicomputers for thermal analysis
      NASA Technical Reports Server (NTRS)
      Storaasli, O. O.; Vidal, J. B.; Jones, G. K.
         1962-01-01
         The feasibility and cost effectiveness of solving thermal analysis problems on superminicomputers is demonstrated. Conventional thermal analysis and the changing computer environment, computer hardware and software used, six thermal analysis test problems, performance of superminicomputers (CPU time, accuracy, turnaround, and cost) and comparison with large computers are considered. Although the CPU times for superminicomputers were 15 to 30 times greater than the fastest mainframe computer, the minimum cost to obtain the solutions on superminicomputers was from 11 percent to 59 percent of the cost of mainframe solutions. The turnaround (elapsed) time is highly dependent on the computer load, but for large problems, superminicomputers produced results in less elapsed time than a typically loaded mainframe computer.
      

        
       
          

«

3
      4
      5
   6
      7
      »

          
        

     

   

   
       
            
              
          

«

4
      5
      6
   7
      8
      »

          
        

           
           
             
               
      
      Optimizing SIEM Throughput on the Cloud Using Parallelization.
      PubMed
      Alam, Masoom; Ihsan, Asif; Khan, Muazzam A; Javaid, Qaisar; Khan, Abid; Manzoor, Jawad; Akhundzada, Adnan; Khan, Muhammad Khurram; Farooq, Sajid
         2016-01-01
         Processing large amounts of data in real time for identifying security issues pose several performance challenges, especially when hardware infrastructure is limited. Managed Security Service Providers (MSSP), mostly hosting their applications on the Cloud, receive events at a very high rate that varies from a few hundred to a couple of thousand events per second (EPS). It is critical to process this data efficiently, so that attacks could be identified quickly and necessary response could be initiated. This paper evaluates the performance of a security framework OSTROM built on the Esper complex event processing (CEP) engine under a parallel and non-parallel computational framework. We explain three architectures under which Esper can be used to process events. We investigated the effect on throughput, memory and CPU usage in each configuration setting. The results indicate that the performance of the engine is limited by the number of events coming in rather than the queries being processed. The architecture where 1/4th of the total events are submitted to each instance and all the queries are processed by all the units shows best results in terms of throughput, memory and CPU usage.
      

      
      Airloads on Bluff Bodies, with Application to the Rotor-Induced Downloads on Tilt-Rotor Aircraft.
      DTIC Science & Technology
      
         1983-09-01
         interference aerodynamics would be tion on hover performance (Ref. (11). to study the two-dimensional sec- tion characteristics of a wing in the wake of a...resources for large numbers of vortices; a typical case requires 10-15 min CPU time on the Ames Cray IS computer. Figure 6 shows a typical result. Here...CPU time per case on a Prime 550UPPER SURFACE (WINDWARD) computer to converge to a steady solution; this would be equivalent to one or two seconds on
      

      
      Comparing the Consumption of CPU Hours with Scientific Output for the Extreme Science and Engineering Discovery Environment (XSEDE).
      PubMed
      Knepper, Richard; Börner, Katy
         2016-01-01
         This paper presents the results of a study that compares resource usage with publication output using data about the consumption of CPU cycles from the Extreme Science and Engineering Discovery Environment (XSEDE) and resulting scientific publications for 2,691 institutions/teams. Specifically, the datasets comprise a total of 5,374,032,696 central processing unit (CPU) hours run in XSEDE during July 1, 2011 to August 18, 2015 and 2,882 publications that cite the XSEDE resource. Three types of studies were conducted: a geospatial analysis of XSEDE providers and consumers, co-authorship network analysis of XSEDE publications, and bi-modal network analysis of how XSEDE resources are used by different research fields. Resulting visualizations show that a diverse set of consumers make use of XSEDE resources, that users of XSEDE publish together frequently, and that the users of XSEDE with the highest resource usage tend to be "traditional" high-performance computing (HPC) community members from astronomy, atmospheric science, physics, chemistry, and biology.
      

      
      A CPU benchmark for protein crystallographic refinement.
      PubMed
      Bourne, P E; Hendrickson, W A
         1990-01-01
         The CPU time required to complete a cycle of restrained least-squares refinement of a protein structure from X-ray crystallographic data using the FORTRAN codes PROTIN and PROLSQ are reported for 48 different processors, ranging from single-user workstations to supercomputers. Sequential, vector, VLIW, multiprocessor, and RISC hardware architectures are compared using both a small and a large protein structure. Representative compile times for each hardware type are also given, and the improvement in run-time when coding for a specific hardware architecture considered. The benchmarks involve scalar integer and vector floating point arithmetic and are representative of the calculations performed in many scientific disciplines.
      

      
      Adaptive real-time methodology for optimizing energy-efficient computing
      DOEpatents
      Hsu, Chung-Hsing [Los Alamos, NM; Feng, Wu-Chun [Blacksburg, VA
         2011-06-28
         Dynamic voltage and frequency scaling (DVFS) is an effective way to reduce energy and power consumption in microprocessor units. Current implementations of DVFS suffer from inaccurate modeling of power requirements and usage, and from inaccurate characterization of the relationships between the applicable variables. A system and method is proposed that adjusts CPU frequency and voltage based on run-time calculations of the workload processing time, as well as a calculation of performance sensitivity with respect to CPU frequency. The system and method are processor independent, and can be applied to either an entire system as a unit, or individually to each process running on a system.
      

      
      Real time display Fourier-domain OCT using multi-thread parallel computing with data vectorization
      NASA Astrophysics Data System (ADS)
      Eom, Tae Joong; Kim, Hoon Seop; Kim, Chul Min; Lee, Yeung Lak; Choi, Eun-Seo
         2011-03-01
         We demonstrate a real-time display of processed OCT images using multi-thread parallel computing with a quad-core CPU of a personal computer. The data of each A-line are treated as one vector to maximize the data translation rate between the cores of the CPU and RAM stored image data. A display rate of 29.9 frames/sec for processed OCT data (4096 FFT-size x 500 A-scans) is achieved in our system using a wavelength swept source with 52-kHz swept frequency. The data processing times of the OCT image and a Doppler OCT image with a 4-time average are 23.8 msec and 91.4 msec.
      

      
      GPU based contouring method on grid DEM data
      NASA Astrophysics Data System (ADS)
      Tan, Liheng; Wan, Gang; Li, Feng; Chen, Xiaohui; Du, Wenlong
         2017-08-01
         This paper presents a novel method to generate contour lines from grid DEM data based on the programmable GPU pipeline. The previous contouring approaches often use CPU to construct a finite element mesh from the raw DEM data, and then extract contour segments from the elements. They also need a tracing or sorting strategy to generate the final continuous contours. These approaches can be heavily CPU-costing and time-consuming. Meanwhile the generated contours would be unsmooth if the raw data is sparsely distributed. Unlike the CPU approaches, we employ the GPU's vertex shader to generate a triangular mesh with arbitrary user-defined density, in which the height of each vertex is calculated through a third-order Cardinal spline function. Then in the same frame, segments are extracted from the triangles by the geometry shader, and translated to the CPU-side with an internal order in the GPU's transform feedback stage. Finally we propose a "Grid Sorting" algorithm to achieve the continuous contour lines by travelling the segments only once. Our method makes use of multiple stages of GPU pipeline for computation, which can generate smooth contour lines, and is significantly faster than the previous CPU approaches. The algorithm can be easily implemented with OpenGL 3.3 API or higher on consumer-level PCs.
      

      
      High-throughput sequence alignment using Graphics Processing Units
      PubMed Central
      Schatz, Michael C; Trapnell, Cole; Delcher, Arthur L; Varshney, Amitabh
         2007-01-01
         Background The recent availability of new, less expensive high-throughput DNA sequencing technologies has yielded a dramatic increase in the volume of sequence data that must be analyzed. These data are being generated for several purposes, including genotyping, genome resequencing, metagenomics, and de novo genome assembly projects. Sequence alignment programs such as MUMmer have proven essential for analysis of these data, but researchers will need ever faster, high-throughput alignment tools running on inexpensive hardware to keep up with new sequence technologies. Results This paper describes MUMmerGPU, an open-source high-throughput parallel pairwise local sequence alignment program that runs on commodity Graphics Processing Units (GPUs) in common workstations. MUMmerGPU uses the new Compute Unified Device Architecture (CUDA) from nVidia to align multiple query sequences against a single reference sequence stored as a suffix tree. By processing the queries in parallel on the highly parallel graphics card, MUMmerGPU achieves more than a 10-fold speedup over a serial CPU version of the sequence alignment kernel, and outperforms the exact alignment component of MUMmer on a high end CPU by 3.5-fold in total application time when aligning reads from recent sequencing projects using Solexa/Illumina, 454, and Sanger sequencing technologies. Conclusion MUMmerGPU is a low cost, ultra-fast sequence alignment program designed to handle the increasing volume of data produced by new, high-throughput sequencing technologies. MUMmerGPU demonstrates that even memory-intensive applications can run significantly faster on the relatively low-cost GPU than on the CPU. PMID:18070356
      

      
      Upwind relaxation methods for the Navier-Stokes equations using inner iterations
      NASA Technical Reports Server (NTRS)
      Taylor, Arthur C., III; Ng, Wing-Fai; Walters, Robert W.
         1992-01-01
         A subsonic and a supersonic problem are respectively treated by an upwind line-relaxation algorithm for the Navier-Stokes equations using inner iterations to accelerate steady-state solution convergence and thereby minimize CPU time. While the ability of the inner iterative procedure to mimic the quadratic convergence of the direct solver method is attested to in both test problems, some of the nonquadratic inner iterative results are noted to have been more efficient than the quadratic. In the more successful, supersonic test case, inner iteration required only about 65 percent of the line-relaxation method-entailed CPU time.
      

      
      Method and apparatus for measuring spatial uniformity of radiation
      DOEpatents
      Field, Halden
         2002-01-01
         A method and apparatus for measuring the spatial uniformity of the intensity of a radiation beam from a radiation source based on a single sampling time and/or a single pulse of radiation. The measuring apparatus includes a plurality of radiation detectors positioned on planar mounting plate to form a radiation receiving area that has a shape and size approximating the size and shape of the cross section of the radiation beam. The detectors concurrently receive portions of the radiation beam and transmit electrical signals representative of the intensity of impinging radiation to a signal processor circuit connected to each of the detectors and adapted to concurrently receive the electrical signals from the detectors and process with a central processing unit (CPU) the signals to determine intensities of the radiation impinging at each detector location. The CPU displays the determined intensities and relative intensity values corresponding to each detector location to an operator of the measuring apparatus on an included data display device. Concurrent sampling of each detector is achieved by connecting to each detector a sample and hold circuit that is configured to track the signal and store it upon receipt of a "capture" signal. A switching device then selectively retrieves the signals and transmits the signals to the CPU through a single analog to digital (A/D) converter. The "capture" signal. is then removed from the sample-and-hold circuits. Alternatively, concurrent sampling is achieved by providing an A/D converter for each detector, each of which transmits a corresponding digital signal to the CPU. The sampling or reading of the detector signals can be controlled by the CPU or level-detection and timing circuit.
      

      
      CUDASW++ 3.0: accelerating Smith-Waterman protein database search by coupling CPU and GPU SIMD instructions.
      PubMed
      Liu, Yongchao; Wirawan, Adrianto; Schmidt, Bertil
         2013-04-04
         The maximal sensitivity for local alignments makes the Smith-Waterman algorithm a popular choice for protein sequence database search based on pairwise alignment. However, the algorithm is compute-intensive due to a quadratic time complexity. Corresponding runtimes are further compounded by the rapid growth of sequence databases. We present CUDASW++ 3.0, a fast Smith-Waterman protein database search algorithm, which couples CPU and GPU SIMD instructions and carries out concurrent CPU and GPU computations. For the CPU computation, this algorithm employs SSE-based vector execution units as accelerators. For the GPU computation, we have investigated for the first time a GPU SIMD parallelization, which employs CUDA PTX SIMD video instructions to gain more data parallelism beyond the SIMT execution model. Moreover, sequence alignment workloads are automatically distributed over CPUs and GPUs based on their respective compute capabilities. Evaluation on the Swiss-Prot database shows that CUDASW++ 3.0 gains a performance improvement over CUDASW++ 2.0 up to 2.9 and 3.2, with a maximum performance of 119.0 and 185.6 GCUPS, on a single-GPU GeForce GTX 680 and a dual-GPU GeForce GTX 690 graphics card, respectively. In addition, our algorithm has demonstrated significant speedups over other top-performing tools: SWIPE and BLAST+. CUDASW++ 3.0 is written in CUDA C++ and PTX assembly languages, targeting GPUs based on the Kepler architecture. This algorithm obtains significant speedups over its predecessor: CUDASW++ 2.0, by benefiting from the use of CPU and GPU SIMD instructions as well as the concurrent execution on CPUs and GPUs. The source code and the simulated data are available at http://cudasw.sourceforge.net.
      

      
      Evaluation of patients with methamphetamine- and cocaine-related chest pain in a chest pain observation unit.
      PubMed
      Diercks, Deborah B; Kirk, J Douglas; Turnipseed, Samuel D; Amsterdam, Ezra A
         2007-12-01
         Risk of acute coronary events in patients with methamphetamine and cocaine intoxication has been described. Little is known about the need for additional evaluation in these patients who do not have evidence of myocardial infarction after the initial emergency department evaluation. We herein describe our experience with these patients in a chest pain unit (CPU) and the rate of cardiac-related chest pain in this group. Retrospective analysis of patients evaluated in our CPU from January 1, 2000 to December 16, 2004 with a history of chest pain. Patients who had a positive urine toxicologic screen for methamphetamine or cocaine were included. No patients had ECG or cardiac injury marker evidence of myocardial infarction or ischemia during the initial emergency department evaluation. A diagnosis of cardiac-related chest pain was based upon positive diagnostic testing (exercise stress testing, nuclear perfusion imaging, stress echocardiography, or coronary artery stenosis >70%). During the study period, 4568 patients were evaluated in the CPU. A total of 1690 (37%) of patients admitted to the CPU underwent urine toxicologic testing. The result of urine toxicologic test was positive for cocaine or methamphetamine in 224 (5%). In the 2871 patients who underwent diagnostic testing for coronary artery disease (CAD), 401 (14%) were found to have positive results. There was no difference in the prevalence of CAD between those with positive result for toxicology screens (26/156, 17%) and those without (375/2715, 13%, RR 1.2, 95% CI 0.8-1.7). These findings suggest a relatively high rate of CAD in patients with methamphetamine and cocaine use evaluated in a CPU.
      

      
      Performance and accuracy of criticality calculations performed using WARP – A framework for continuous energy Monte Carlo neutron transport in general 3D geometries on GPUs
      DOE PAGES
      Bergmann, Ryan M.; Rowland, Kelly L.; Radnović, Nikola; ...
         2017-05-01
         In this companion paper to "Algorithmic Choices in WARP - A Framework for Continuous Energy Monte Carlo Neutron Transport in General 3D Geometries on GPUs" (doi:10.1016/j.anucene.2014.10.039), the WARP Monte Carlo neutron transport framework for graphics processing units (GPUs) is benchmarked against production-level central processing unit (CPU) Monte Carlo neutron transport codes for both performance and accuracy. We compare neutron flux spectra, multiplication factors, runtimes, speedup factors, and costs of various GPU and CPU platforms running either WARP, Serpent 2.1.24, or MCNP 6.1. WARP compares well with the results of the production-level codes, and it is shown that on the newestmore » hardware considered, GPU platforms running WARP are between 0.8 to 7.6 times as fast as CPU platforms running production codes. Also, the GPU platforms running WARP were between 15% and 50% as expensive to purchase and between 80% to 90% as expensive to operate as equivalent CPU platforms performing at an equal simulation rate.« less
      

      
      Performance and accuracy of criticality calculations performed using WARP – A framework for continuous energy Monte Carlo neutron transport in general 3D geometries on GPUs
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Bergmann, Ryan M.; Rowland, Kelly L.; Radnović, Nikola
         
         In this companion paper to "Algorithmic Choices in WARP - A Framework for Continuous Energy Monte Carlo Neutron Transport in General 3D Geometries on GPUs" (doi:10.1016/j.anucene.2014.10.039), the WARP Monte Carlo neutron transport framework for graphics processing units (GPUs) is benchmarked against production-level central processing unit (CPU) Monte Carlo neutron transport codes for both performance and accuracy. We compare neutron flux spectra, multiplication factors, runtimes, speedup factors, and costs of various GPU and CPU platforms running either WARP, Serpent 2.1.24, or MCNP 6.1. WARP compares well with the results of the production-level codes, and it is shown that on the newestmore » hardware considered, GPU platforms running WARP are between 0.8 to 7.6 times as fast as CPU platforms running production codes. Also, the GPU platforms running WARP were between 15% and 50% as expensive to purchase and between 80% to 90% as expensive to operate as equivalent CPU platforms performing at an equal simulation rate.« less
      

      
      Adaptive real-time methodology for optimizing energy-efficient computing
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Hsu, Chung-Hsing; Feng, Wu-Chun
         
         Dynamic voltage and frequency scaling (DVFS) is an effective way to reduce energy and power consumption in microprocessor units. Current implementations of DVFS suffer from inaccurate modeling of power requirements and usage, and from inaccurate characterization of the relationships between the applicable variables. A system and method is proposed that adjusts CPU frequency and voltage based on run-time calculations of the workload processing time, as well as a calculation of performance sensitivity with respect to CPU frequency. The system and method are processor independent, and can be applied to either an entire system as a unit, or individually to eachmore » process running on a system.« less
      

      
      AMITIS: A 3D GPU-Based Hybrid-PIC Model for Space and Plasma Physics
      NASA Astrophysics Data System (ADS)
      Fatemi, Shahab; Poppe, Andrew R.; Delory, Gregory T.; Farrell, William M.
         2017-05-01
         We have developed, for the first time, an advanced modeling infrastructure in space simulations (AMITIS) with an embedded three-dimensional self-consistent grid-based hybrid model of plasma (kinetic ions and fluid electrons) that runs entirely on graphics processing units (GPUs). The model uses NVIDIA GPUs and their associated parallel computing platform, CUDA, developed for general purpose processing on GPUs. The model uses a single CPU-GPU pair, where the CPU transfers data between the system and GPU memory, executes CUDA kernels, and writes simulation outputs on the disk. All computations, including moving particles, calculating macroscopic properties of particles on a grid, and solving hybrid model equations are processed on a single GPU. We explain various computing kernels within AMITIS and compare their performance with an already existing well-tested hybrid model of plasma that runs in parallel using multi-CPU platforms. We show that AMITIS runs ∼10 times faster than the parallel CPU-based hybrid model. We also introduce an implicit solver for computation of Faraday’s Equation, resulting in an explicit-implicit scheme for the hybrid model equation. We show that the proposed scheme is stable and accurate. We examine the AMITIS energy conservation and show that the energy is conserved with an error < 0.2% after 500,000 timesteps, even when a very low number of particles per cell is used.
      

      
      Enhanced round robin CPU scheduling with burst time based time quantum
      NASA Astrophysics Data System (ADS)
      Indusree, J. R.; Prabadevi, B.
         2017-11-01
         Process scheduling is a very important functionality of Operating system. The main-known process-scheduling algorithms are First Come First Serve (FCFS) algorithm, Round Robin (RR) algorithm, Priority scheduling algorithm and Shortest Job First (SJF) algorithm. Compared to its peers, Round Robin (RR) algorithm has the advantage that it gives fair share of CPU to the processes which are already in the ready-queue. The effectiveness of the RR algorithm greatly depends on chosen time quantum value. Through this research paper, we are proposing an enhanced algorithm called Enhanced Round Robin with Burst-time based Time Quantum (ERRBTQ) process scheduling algorithm which calculates time quantum as per the burst-time of processes already in ready queue. The experimental results and analysis of ERRBTQ algorithm clearly indicates the improved performance when compared with conventional RR and its variants.
      

      
      Benchmarking hardware architecture candidates for the NFIRAOS real-time controller
      NASA Astrophysics Data System (ADS)
      Smith, Malcolm; Kerley, Dan; Herriot, Glen; Véran, Jean-Pierre
         2014-07-01
         As a part of the trade study for the Narrow Field Infrared Adaptive Optics System, the adaptive optics system for the Thirty Meter Telescope, we investigated the feasibility of performing real-time control computation using a Linux operating system and Intel Xeon E5 CPUs. We also investigated a Xeon Phi based architecture which allows higher levels of parallelism. This paper summarizes both the CPU based real-time controller architecture and the Xeon Phi based RTC. The Intel Xeon E5 CPU solution meets the requirements and performs the computation for one AO cycle in an average of 767 microseconds. The Xeon Phi solution did not meet the 1200 microsecond time requirement and also suffered from unpredictable execution times. More detailed benchmark results are reported for both architectures.
      

      
      Comparing the Consumption of CPU Hours with Scientific Output for the Extreme Science and Engineering Discovery Environment (XSEDE)
      PubMed Central
      Börner, Katy
         2016-01-01
         This paper presents the results of a study that compares resource usage with publication output using data about the consumption of CPU cycles from the Extreme Science and Engineering Discovery Environment (XSEDE) and resulting scientific publications for 2,691 institutions/teams. Specifically, the datasets comprise a total of 5,374,032,696 central processing unit (CPU) hours run in XSEDE during July 1, 2011 to August 18, 2015 and 2,882 publications that cite the XSEDE resource. Three types of studies were conducted: a geospatial analysis of XSEDE providers and consumers, co-authorship network analysis of XSEDE publications, and bi-modal network analysis of how XSEDE resources are used by different research fields. Resulting visualizations show that a diverse set of consumers make use of XSEDE resources, that users of XSEDE publish together frequently, and that the users of XSEDE with the highest resource usage tend to be “traditional” high-performance computing (HPC) community members from astronomy, atmospheric science, physics, chemistry, and biology. PMID:27310174
      

      
      Optimizing SIEM Throughput on the Cloud Using Parallelization
      PubMed Central
      Alam, Masoom; Ihsan, Asif; Javaid, Qaisar; Khan, Abid; Manzoor, Jawad; Akhundzada, Adnan; Khan, M Khurram; Farooq, Sajid
         2016-01-01
         Processing large amounts of data in real time for identifying security issues pose several performance challenges, especially when hardware infrastructure is limited. Managed Security Service Providers (MSSP), mostly hosting their applications on the Cloud, receive events at a very high rate that varies from a few hundred to a couple of thousand events per second (EPS). It is critical to process this data efficiently, so that attacks could be identified quickly and necessary response could be initiated. This paper evaluates the performance of a security framework OSTROM built on the Esper complex event processing (CEP) engine under a parallel and non-parallel computational framework. We explain three architectures under which Esper can be used to process events. We investigated the effect on throughput, memory and CPU usage in each configuration setting. The results indicate that the performance of the engine is limited by the number of events coming in rather than the queries being processed. The architecture where 1/4th of the total events are submitted to each instance and all the queries are processed by all the units shows best results in terms of throughput, memory and CPU usage. PMID:27851762
      

        
       
          

«

4
      5
      6
   7
      8
      »

          
        

     

   

   
       
            
              
          

«

5
      6
      7
   8
      9
      »

          
        

           
           
             
               
      
      Double dissociation of the anterior and posterior dorsomedial caudate-putamen in the acquisition and expression of associative learning with the nicotine stimulus.
      PubMed
      Charntikov, Sergios; Pittenger, Steven T; Swalve, Natashia; Li, Ming; Bevins, Rick A
         2017-07-15
         Tobacco use is the leading cause of preventable deaths worldwide. This habit is not only debilitating to individual users but also to those around them (second-hand smoking). Nicotine is the main addictive component of tobacco products and is a moderate stimulant and a mild reinforcer. Importantly, besides its unconditional effects, nicotine also has conditioned stimulus effects that may contribute to the tenacity of the smoking habit. Because the neurobiological substrates underlying these processes are virtually unexplored, the present study investigated the functional involvement of the dorsomedial caudate putamen (dmCPu) in learning processes with nicotine as an interoceptive stimulus. Rats were trained using the discriminated goal-tracking task where nicotine injections (0.4 mg/kg; SC), on some days, were paired with intermittent (36 per session) sucrose deliveries; sucrose was not available on interspersed saline days. Pre-training excitotoxic or post-training transient lesions of anterior or posterior dmCPu were used to elucidate the role of these areas in acquisition or expression of associative learning with nicotine stimulus. Pre-training lesion of p-dmCPu inhibited acquisition while post-training lesions of p-dmCPu attenuated the expression of associative learning with the nicotine stimulus. On the other hand, post-training lesions of a-dmCPu evoked nicotine-like responding following saline treatment indicating the role of this area in disinhibition of learned motor behaviors. These results, for the first time, show functionally distinct involvement of a- and p-dmCPu in various stages of associative learning using nicotine stimulus and provide an initial account of neural plasticity underlying these learning processes. Copyright © 2017 Elsevier Ltd. All rights reserved.
      

      
      On the Finite Element Implementation of the Generalized Method of Cells Micromechanics Constitutive Model
      NASA Technical Reports Server (NTRS)
      Wilt, T. E.
         1995-01-01
         The Generalized Method of Cells (GMC), a micromechanics based constitutive model, is implemented into the finite element code MARC using the user subroutine HYPELA. Comparisons in terms of transverse deformation response, micro stress and strain distributions, and required CPU time are presented for GMC and finite element models of fiber/matrix unit cell. GMC is shown to provide comparable predictions of the composite behavior and requires significantly less CPU time as compared to a finite element analysis of the unit cell. Details as to the organization of the HYPELA code are provided with the actual HYPELA code included in the appendix.
      

      
      Sequence search on a supercomputer.
      PubMed
      Gotoh, O; Tagashira, Y
         1986-01-10
         A set of programs was developed for searching nucleic acid and protein sequence data bases for sequences similar to a given sequence. The programs, written in FORTRAN 77, were optimized for vector processing on a Hitachi S810-20 supercomputer. A search of a 500-residue protein sequence against the entire PIR data base Ver. 1.0 (1) (0.5 M residues) is carried out in a CPU time of 45 sec. About 4 min is required for an exhaustive search of a 1500-base nucleotide sequence against all mammalian sequences (1.2M bases) in Genbank Ver. 29.0. The CPU time is reduced to about a quarter with a faster version.
      

      
      Finite difference numerical method for the superlattice Boltzmann transport equation and case comparison of CPU(C) and GPU(CUDA) implementations
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Priimak, Dmitri
         2014-12-01
         We present a finite difference numerical algorithm for solving two dimensional spatially homogeneous Boltzmann transport equation which describes electron transport in a semiconductor superlattice subject to crossed time dependent electric and constant magnetic fields. The algorithm is implemented both in C language targeted to CPU and in CUDA C language targeted to commodity NVidia GPU. We compare performances and merits of one implementation versus another and discuss various software optimisation techniques.
      

      
      The development of an interim generalized gate logic software simulator
      NASA Technical Reports Server (NTRS)
      Mcgough, J. G.; Nemeroff, S.
         1985-01-01
         A proof-of-concept computer program called IGGLOSS (Interim Generalized Gate Logic Software Simulator) was developed and is discussed. The simulator engine was designed to perform stochastic estimation of self test coverage (fault-detection latency times) of digital computers or systems. A major attribute of the IGGLOSS is its high-speed simulation: 9.5 x 1,000,000 gates/cpu sec for nonfaulted circuits and 4.4 x 1,000,000 gates/cpu sec for faulted circuits on a VAX 11/780 host computer.
      

      
      Acoustic reverse-time migration using GPU card and POSIX thread based on the adaptive optimal finite-difference scheme and the hybrid absorbing boundary condition
      NASA Astrophysics Data System (ADS)
      Cai, Xiaohui; Liu, Yang; Ren, Zhiming
         2018-06-01
         Reverse-time migration (RTM) is a powerful tool for imaging geologically complex structures such as steep-dip and subsalt. However, its implementation is quite computationally expensive. Recently, as a low-cost solution, the graphic processing unit (GPU) was introduced to improve the efficiency of RTM. In the paper, we develop three ameliorative strategies to implement RTM on GPU card. First, given the high accuracy and efficiency of the adaptive optimal finite-difference (FD) method based on least squares (LS) on central processing unit (CPU), we study the optimal LS-based FD method on GPU. Second, we develop the CPU-based hybrid absorbing boundary condition (ABC) to the GPU-based one by addressing two issues of the former when introduced to GPU card: time-consuming and chaotic threads. Third, for large-scale data, the combinatorial strategy for optimal checkpointing and efficient boundary storage is introduced for the trade-off between memory and recomputation. To save the time of communication between host and disk, the portable operating system interface (POSIX) thread is utilized to create the other CPU core at the checkpoints. Applications of the three strategies on GPU with the compute unified device architecture (CUDA) programming language in RTM demonstrate their efficiency and validity.
      

      
      High-performance computing on GPUs for resistivity logging of oil and gas wells
      NASA Astrophysics Data System (ADS)
      Glinskikh, V.; Dudaev, A.; Nechaev, O.; Surodina, I.
         2017-10-01
         We developed and implemented into software an algorithm for high-performance simulation of electrical logs from oil and gas wells using high-performance heterogeneous computing. The numerical solution of the 2D forward problem is based on the finite-element method and the Cholesky decomposition for solving a system of linear algebraic equations (SLAE). Software implementations of the algorithm used the NVIDIA CUDA technology and computing libraries are made, allowing us to perform decomposition of SLAE and find its solution on central processor unit (CPU) and graphics processor unit (GPU). The calculation time is analyzed depending on the matrix size and number of its non-zero elements. We estimated the computing speed on CPU and GPU, including high-performance heterogeneous CPU-GPU computing. Using the developed algorithm, we simulated resistivity data in realistic models.
      

      
      Multidisciplinary Simulation Acceleration using Multiple Shared-Memory Graphical Processing Units
      NASA Astrophysics Data System (ADS)
      Kemal, Jonathan Yashar
         
         For purposes of optimizing and analyzing turbomachinery and other designs, the unsteady Favre-averaged flow-field differential equations for an ideal compressible gas can be solved in conjunction with the heat conduction equation. We solve all equations using the finite-volume multiple-grid numerical technique, with the dual time-step scheme used for unsteady simulations. Our numerical solver code targets CUDA-capable Graphical Processing Units (GPUs) produced by NVIDIA. Making use of MPI, our solver can run across networked compute notes, where each MPI process can use either a GPU or a Central Processing Unit (CPU) core for primary solver calculations. We use NVIDIA Tesla C2050/C2070 GPUs based on the Fermi architecture, and compare our resulting performance against Intel Zeon X5690 CPUs. Solver routines converted to CUDA typically run about 10 times faster on a GPU for sufficiently dense computational grids. We used a conjugate cylinder computational grid and ran a turbulent steady flow simulation using 4 increasingly dense computational grids. Our densest computational grid is divided into 13 blocks each containing 1033x1033 grid points, for a total of 13.87 million grid points or 1.07 million grid points per domain block. To obtain overall speedups, we compare the execution time of the solver's iteration loop, including all resource intensive GPU-related memory copies. Comparing the performance of 8 GPUs to that of 8 CPUs, we obtain an overall speedup of about 6.0 when using our densest computational grid. This amounts to an 8-GPU simulation running about 39.5 times faster than running than a single-CPU simulation.
      

      
      Analysis OpenMP performance of AMD and Intel architecture for breaking waves simulation using MPS
      NASA Astrophysics Data System (ADS)
      Alamsyah, M. N. A.; Utomo, A.; Gunawan, P. H.
         2018-03-01
         Simulation of breaking waves by using Navier-Stokes equation via moving particle semi-implicit method (MPS) over close domain is given. The results show the parallel computing on multicore architecture using OpenMP platform can reduce the computational time almost half of the serial time. Here, the comparison using two computer architectures (AMD and Intel) are performed. The results using Intel architecture is shown better than AMD architecture in CPU time. However, in efficiency, the computer with AMD architecture gives slightly higher than the Intel. For the simulation by 1512 number of particles, the CPU time using Intel and AMD are 12662.47 and 28282.30 respectively. Moreover, the efficiency using similar number of particles, AMD obtains 50.09 % and Intel up to 49.42 %.
      

      
      GPU Computing in Bayesian Inference of Realized Stochastic Volatility Model
      NASA Astrophysics Data System (ADS)
      Takaishi, Tetsuya
         2015-01-01
         The realized stochastic volatility (RSV) model that utilizes the realized volatility as additional information has been proposed to infer volatility of financial time series. We consider the Bayesian inference of the RSV model by the Hybrid Monte Carlo (HMC) algorithm. The HMC algorithm can be parallelized and thus performed on the GPU for speedup. The GPU code is developed with CUDA Fortran. We compare the computational time in performing the HMC algorithm on GPU (GTX 760) and CPU (Intel i7-4770 3.4GHz) and find that the GPU can be up to 17 times faster than the CPU. We also code the program with OpenACC and find that appropriate coding can achieve the similar speedup with CUDA Fortran.
      

      
      Shadow: Running Tor in a Box for Accurate and Efficient Experimentation
      DTIC Science & Technology
      
         2011-09-23
         Modeling the speed of a target CPU is done by running an OpenSSL [31] speed test on a real CPU of that type. This provides us with the raw CPU processing...rate, but we are also interested in the processing speed of an application. By running application 5 benchmarks on the same CPU as the OpenSSL speed test...simulation, saving CPU cy- cles on our simulation host machine. Shadow removes cryptographic processing by preloading the main OpenSSL [31] functions used
      

      
      Multi-GPU and multi-CPU accelerated FDTD scheme for vibroacoustic applications
      NASA Astrophysics Data System (ADS)
      Francés, J.; Otero, B.; Bleda, S.; Gallego, S.; Neipp, C.; Márquez, A.; Beléndez, A.
         2015-06-01
         The Finite-Difference Time-Domain (FDTD) method is applied to the analysis of vibroacoustic problems and to study the propagation of longitudinal and transversal waves in a stratified media. The potential of the scheme and the relevance of each acceleration strategy for massively computations in FDTD are demonstrated in this work. In this paper, we propose two new specific implementations of the bi-dimensional scheme of the FDTD method using multi-CPU and multi-GPU, respectively. In the first implementation, an open source message passing interface (OMPI) has been included in order to massively exploit the resources of a biprocessor station with two Intel Xeon processors. Moreover, regarding CPU code version, the streaming SIMD extensions (SSE) and also the advanced vectorial extensions (AVX) have been included with shared memory approaches that take advantage of the multi-core platforms. On the other hand, the second implementation called the multi-GPU code version is based on Peer-to-Peer communications available in CUDA on two GPUs (NVIDIA GTX 670). Subsequently, this paper presents an accurate analysis of the influence of the different code versions including shared memory approaches, vector instructions and multi-processors (both CPU and GPU) and compares them in order to delimit the degree of improvement of using distributed solutions based on multi-CPU and multi-GPU. The performance of both approaches was analysed and it has been demonstrated that the addition of shared memory schemes to CPU computing improves substantially the performance of vector instructions enlarging the simulation sizes that use efficiently the cache memory of CPUs. In this case GPU computing is slightly twice times faster than the fine tuned CPU version in both cases one and two nodes. However, for massively computations explicit vector instructions do not worth it since the memory bandwidth is the limiting factor and the performance tends to be the same than the sequential version with auto-vectorisation and also shared memory approach. In this scenario GPU computing is the best option since it provides a homogeneous behaviour. More specifically, the speedup of GPU computing achieves an upper limit of 12 for both one and two GPUs, whereas the performance reaches peak values of 80 GFlops and 146 GFlops for the performance for one GPU and two GPUs respectively. Finally, the method is applied to an earth crust profile in order to demonstrate the potential of our approach and the necessity of applying acceleration strategies in these type of applications.
      

      
      Heterogeneous CPU-GPU moving targets detection for UAV video
      NASA Astrophysics Data System (ADS)
      Li, Maowen; Tang, Linbo; Han, Yuqi; Yu, Chunlei; Zhang, Chao; Fu, Huiquan
         2017-07-01
         Moving targets detection is gaining popularity in civilian and military applications. On some monitoring platform of motion detection, some low-resolution stationary cameras are replaced by moving HD camera based on UAVs. The pixels of moving targets in the HD Video taken by UAV are always in a minority, and the background of the frame is usually moving because of the motion of UAVs. The high computational cost of the algorithm prevents running it at higher resolutions the pixels of frame. Hence, to solve the problem of moving targets detection based UAVs video, we propose a heterogeneous CPU-GPU moving target detection algorithm for UAV video. More specifically, we use background registration to eliminate the impact of the moving background and frame difference to detect small moving targets. In order to achieve the effect of real-time processing, we design the solution of heterogeneous CPU-GPU framework for our method. The experimental results show that our method can detect the main moving targets from the HD video taken by UAV, and the average process time is 52.16ms per frame which is fast enough to solve the problem.
      

      
      An emulator for minimizing computer resources for finite element analysis
      NASA Technical Reports Server (NTRS)
      Melosh, R.; Utku, S.; Islam, M.; Salama, M.
         1984-01-01
         A computer code, SCOPE, has been developed for predicting the computer resources required for a given analysis code, computer hardware, and structural problem. The cost of running the code is a small fraction (about 3 percent) of the cost of performing the actual analysis. However, its accuracy in predicting the CPU and I/O resources depends intrinsically on the accuracy of calibration data that must be developed once for the computer hardware and the finite element analysis code of interest. Testing of the SCOPE code on the AMDAHL 470 V/8 computer and the ELAS finite element analysis program indicated small I/O errors (3.2 percent), larger CPU errors (17.8 percent), and negligible total errors (1.5 percent).
      

      
      Accelerated event-by-event Monte Carlo microdosimetric calculations of electrons and protons tracks on a multi-core CPU and a CUDA-enabled GPU.
      PubMed
      Kalantzis, Georgios; Tachibana, Hidenobu
         2014-01-01
         For microdosimetric calculations event-by-event Monte Carlo (MC) methods are considered the most accurate. The main shortcoming of those methods is the extensive requirement for computational time. In this work we present an event-by-event MC code of low projectile energy electron and proton tracks for accelerated microdosimetric MC simulations on a graphic processing unit (GPU). Additionally, a hybrid implementation scheme was realized by employing OpenMP and CUDA in such a way that both GPU and multi-core CPU were utilized simultaneously. The two implementation schemes have been tested and compared with the sequential single threaded MC code on the CPU. Performance comparison was established on the speed-up for a set of benchmarking cases of electron and proton tracks. A maximum speedup of 67.2 was achieved for the GPU-based MC code, while a further improvement of the speedup up to 20% was achieved for the hybrid approach. The results indicate the capability of our CPU-GPU implementation for accelerated MC microdosimetric calculations of both electron and proton tracks without loss of accuracy. Copyright © 2013 Elsevier Ireland Ltd. All rights reserved.
      

      
      CMSA: a heterogeneous CPU/GPU computing system for multiple similar RNA/DNA sequence alignment.
      PubMed
      Chen, Xi; Wang, Chen; Tang, Shanjiang; Yu, Ce; Zou, Quan
         2017-06-24
         The multiple sequence alignment (MSA) is a classic and powerful technique for sequence analysis in bioinformatics. With the rapid growth of biological datasets, MSA parallelization becomes necessary to keep its running time in an acceptable level. Although there are a lot of work on MSA problems, their approaches are either insufficient or contain some implicit assumptions that limit the generality of usage. First, the information of users' sequences, including the sizes of datasets and the lengths of sequences, can be of arbitrary values and are generally unknown before submitted, which are unfortunately ignored by previous work. Second, the center star strategy is suited for aligning similar sequences. But its first stage, center sequence selection, is highly time-consuming and requires further optimization. Moreover, given the heterogeneous CPU/GPU platform, prior studies consider the MSA parallelization on GPU devices only, making the CPUs idle during the computation. Co-run computation, however, can maximize the utilization of the computing resources by enabling the workload computation on both CPU and GPU simultaneously. This paper presents CMSA, a robust and efficient MSA system for large-scale datasets on the heterogeneous CPU/GPU platform. It performs and optimizes multiple sequence alignment automatically for users' submitted sequences without any assumptions. CMSA adopts the co-run computation model so that both CPU and GPU devices are fully utilized. Moreover, CMSA proposes an improved center star strategy that reduces the time complexity of its center sequence selection process from O(mn 2 ) to O(mn). The experimental results show that CMSA achieves an up to 11× speedup and outperforms the state-of-the-art software. CMSA focuses on the multiple similar RNA/DNA sequence alignment and proposes a novel bitmap based algorithm to improve the center star strategy. We can conclude that harvesting the high performance of modern GPU is a promising approach to accelerate multiple sequence alignment. Besides, adopting the co-run computation model can maximize the entire system utilization significantly. The source code is available at https://github.com/wangvsa/CMSA .
      

      
      Fast multipurpose Monte Carlo simulation for proton therapy using multi- and many-core CPU architectures.
      PubMed
      Souris, Kevin; Lee, John Aldo; Sterpin, Edmond
         2016-04-01
         Accuracy in proton therapy treatment planning can be improved using Monte Carlo (MC) simulations. However the long computation time of such methods hinders their use in clinical routine. This work aims to develop a fast multipurpose Monte Carlo simulation tool for proton therapy using massively parallel central processing unit (CPU) architectures. A new Monte Carlo, called MCsquare (many-core Monte Carlo), has been designed and optimized for the last generation of Intel Xeon processors and Intel Xeon Phi coprocessors. These massively parallel architectures offer the flexibility and the computational power suitable to MC methods. The class-II condensed history algorithm of MCsquare provides a fast and yet accurate method of simulating heavy charged particles such as protons, deuterons, and alphas inside voxelized geometries. Hard ionizations, with energy losses above a user-specified threshold, are simulated individually while soft events are regrouped in a multiple scattering theory. Elastic and inelastic nuclear interactions are sampled from ICRU 63 differential cross sections, thereby allowing for the computation of prompt gamma emission profiles. MCsquare has been benchmarked with the gate/geant4 Monte Carlo application for homogeneous and heterogeneous geometries. Comparisons with gate/geant4 for various geometries show deviations within 2%-1 mm. In spite of the limited memory bandwidth of the coprocessor simulation time is below 25 s for 10(7) primary 200 MeV protons in average soft tissues using all Xeon Phi and CPU resources embedded in a single desktop unit. MCsquare exploits the flexibility of CPU architectures to provide a multipurpose MC simulation tool. Optimized code enables the use of accurate MC calculation within a reasonable computation time, adequate for clinical practice. MCsquare also simulates prompt gamma emission and can thus be used also for in vivo range verification.
      

      
      Fast Simulation of Dynamic Ultrasound Images Using the GPU.
      PubMed
      Storve, Sigurd; Torp, Hans
         2017-10-01
         Simulated ultrasound data is a valuable tool for development and validation of quantitative image analysis methods in echocardiography. Unfortunately, simulation time can become prohibitive for phantoms consisting of a large number of point scatterers. The COLE algorithm by Gao et al. is a fast convolution-based simulator that trades simulation accuracy for improved speed. We present highly efficient parallelized CPU and GPU implementations of the COLE algorithm with an emphasis on dynamic simulations involving moving point scatterers. We argue that it is crucial to minimize the amount of data transfers from the CPU to achieve good performance on the GPU. We achieve this by storing the complete trajectories of the dynamic point scatterers as spline curves in the GPU memory. This leads to good efficiency when simulating sequences consisting of a large number of frames, such as B-mode and tissue Doppler data for a full cardiac cycle. In addition, we propose a phase-based subsample delay technique that efficiently eliminates flickering artifacts seen in B-mode sequences when COLE is used without enough temporal oversampling. To assess the performance, we used a laptop computer and a desktop computer, each equipped with a multicore Intel CPU and an NVIDIA GPU. Running the simulator on a high-end TITAN X GPU, we observed two orders of magnitude speedup compared to the parallel CPU version, three orders of magnitude speedup compared to simulation times reported by Gao et al. in their paper on COLE, and a speedup of 27000 times compared to the multithreaded version of Field II, using numbers reported in a paper by Jensen. We hope that by releasing the simulator as an open-source project we will encourage its use and further development.
      

      
      CUDA-based high-performance computing of the S-BPF algorithm with no-waiting pipelining
      NASA Astrophysics Data System (ADS)
      Deng, Lin; Yan, Bin; Chang, Qingmei; Han, Yu; Zhang, Xiang; Xi, Xiaoqi; Li, Lei
         2015-10-01
         The backprojection-filtration (BPF) algorithm has become a good solution for local reconstruction in cone-beam computed tomography (CBCT). However, the reconstruction speed of BPF is a severe limitation for clinical applications. The selective-backprojection filtration (S-BPF) algorithm is developed to improve the parallel performance of BPF by selective backprojection. Furthermore, the general-purpose graphics processing unit (GP-GPU) is a popular tool for accelerating the reconstruction. Much work has been performed aiming for the optimization of the cone-beam back-projection. As the cone-beam back-projection process becomes faster, the data transportation holds a much bigger time proportion in the reconstruction than before. This paper focuses on minimizing the total time in the reconstruction with the S-BPF algorithm by hiding the data transportation among hard disk, CPU and GPU. And based on the analysis of the S-BPF algorithm, some strategies are implemented: (1) the asynchronous calls are used to overlap the implemention of CPU and GPU, (2) an innovative strategy is applied to obtain the DBP image to hide the transport time effectively, (3) two streams for data transportation and calculation are synchronized by the cudaEvent in the inverse of finite Hilbert transform on GPU. Our main contribution is a smart reconstruction of the S-BPF algorithm with GPU's continuous calculation and no data transportation time cost. a 5123 volume is reconstructed in less than 0.7 second on a single Tesla-based K20 GPU from 182 views projection with 5122 pixel per projection. The time cost of our implementation is about a half of that without the overlap behavior.
      

      
      3D Kirchhoff depth migration algorithm: A new scalable approach for parallelization on multicore CPU based cluster
      NASA Astrophysics Data System (ADS)
      Rastogi, Richa; Londhe, Ashutosh; Srivastava, Abhishek; Sirasala, Kirannmayi M.; Khonde, Kiran
         2017-03-01
         In this article, a new scalable 3D Kirchhoff depth migration algorithm is presented on state of the art multicore CPU based cluster. Parallelization of 3D Kirchhoff depth migration is challenging due to its high demand of compute time, memory, storage and I/O along with the need of their effective management. The most resource intensive modules of the algorithm are traveltime calculations and migration summation which exhibit an inherent trade off between compute time and other resources. The parallelization strategy of the algorithm largely depends on the storage of calculated traveltimes and its feeding mechanism to the migration process. The presented work is an extension of our previous work, wherein a 3D Kirchhoff depth migration application for multicore CPU based parallel system had been developed. Recently, we have worked on improving parallel performance of this application by re-designing the parallelization approach. The new algorithm is capable to efficiently migrate both prestack and poststack 3D data. It exhibits flexibility for migrating large number of traces within the available node memory and with minimal requirement of storage, I/O and inter-node communication. The resultant application is tested using 3D Overthrust data on PARAM Yuva II, which is a Xeon E5-2670 based multicore CPU cluster with 16 cores/node and 64 GB shared memory. Parallel performance of the algorithm is studied using different numerical experiments and the scalability results show striking improvement over its previous version. An impressive 49.05X speedup with 76.64% efficiency is achieved for 3D prestack data and 32.00X speedup with 50.00% efficiency for 3D poststack data, using 64 nodes. The results also demonstrate the effectiveness and robustness of the improved algorithm with high scalability and efficiency on a multicore CPU cluster.
      

        
       
          

«

5
      6
      7
   8
      9
      »

          
        

     

   

   
       
            
              
          

«

6
      7
      8
   9
      10
      »

          
        

           
           
             
               
      
      A Framework for Batched and GPU-Resident Factorization Algorithms Applied to Block Householder Transformations
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Dong, Tingzing Tim; Tomov, Stanimire Z; Luszczek, Piotr R
         
         As modern hardware keeps evolving, an increasingly effective approach to developing energy efficient and high-performance solvers is to design them to work on many small size and independent problems. Many applications already need this functionality, especially for GPUs, which are currently known to be about four to five times more energy efficient than multicore CPUs. We describe the development of one-sided factorizations that work for a set of small dense matrices in parallel, and we illustrate our techniques on the QR factorization based on Householder transformations. We refer to this mode of operation as a batched factorization. Our approach ismore » based on representing the algorithms as a sequence of batched BLAS routines for GPU-only execution. This is in contrast to the hybrid CPU-GPU algorithms that rely heavily on using the multicore CPU for specific parts of the workload. But for a system to benefit fully from the GPU's significantly higher energy efficiency, avoiding the use of the multicore CPU must be a primary design goal, so the system can rely more heavily on the more efficient GPU. Additionally, this will result in the removal of the costly CPU-to-GPU communication. Furthermore, we do not use a single symmetric multiprocessor(on the GPU) to factorize a single problem at a time. We illustrate how our performance analysis, and the use of profiling and tracing tools, guided the development and optimization of our batched factorization to achieve up to a 2-fold speedup and a 3-fold energy efficiency improvement compared to our highly optimized batched CPU implementations based on the MKL library(when using two sockets of Intel Sandy Bridge CPUs). Compared to a batched QR factorization featured in the CUBLAS library for GPUs, we achieved up to 5x speedup on the K40 GPU.« less
      

      
      A parallel method of atmospheric correction for multispectral high spatial resolution remote sensing images
      NASA Astrophysics Data System (ADS)
      Zhao, Shaoshuai; Ni, Chen; Cao, Jing; Li, Zhengqiang; Chen, Xingfeng; Ma, Yan; Yang, Leiku; Hou, Weizhen; Qie, Lili; Ge, Bangyu; Liu, Li; Xing, Jin
         2018-03-01
         The remote sensing image is usually polluted by atmosphere components especially like aerosol particles. For the quantitative remote sensing applications, the radiative transfer model based atmospheric correction is used to get the reflectance with decoupling the atmosphere and surface by consuming a long computational time. The parallel computing is a solution method for the temporal acceleration. The parallel strategy which uses multi-CPU to work simultaneously is designed to do atmospheric correction for a multispectral remote sensing image. The parallel framework's flow and the main parallel body of atmospheric correction are described. Then, the multispectral remote sensing image of the Chinese Gaofen-2 satellite is used to test the acceleration efficiency. When the CPU number is increasing from 1 to 8, the computational speed is also increasing. The biggest acceleration rate is 6.5. Under the 8 CPU working mode, the whole image atmospheric correction costs 4 minutes.
      

      
      Evaluating Mobile Graphics Processing Units (GPUs) for Real-Time Resource Constrained Applications
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Meredith, J; Conger, J; Liu, Y
         2005-11-11
         Modern graphics processing units (GPUs) can provide tremendous performance boosts for some applications beyond what a single CPU can accomplish, and their performance is growing at a rate faster than CPUs as well. Mobile GPUs available for laptops have the small form factor and low power requirements suitable for use in embedded processing. We evaluated several desktop and mobile GPUs and CPUs on traditional and non-traditional graphics tasks, as well as on the most time consuming pieces of a full hyperspectral imaging application. Accuracy remained high despite small differences in arithmetic operations like rounding. Performance improvements are summarized here relativemore » to a desktop Pentium 4 CPU.« less
      

      
      Convolution of large 3D images on GPU and its decomposition
      NASA Astrophysics Data System (ADS)
      Karas, Pavel; Svoboda, David
         2011-12-01
         In this article, we propose a method for computing convolution of large 3D images. The convolution is performed in a frequency domain using a convolution theorem. The algorithm is accelerated on a graphic card by means of the CUDA parallel computing model. Convolution is decomposed in a frequency domain using the decimation in frequency algorithm. We pay attention to keeping our approach efficient in terms of both time and memory consumption and also in terms of memory transfers between CPU and GPU which have a significant inuence on overall computational time. We also study the implementation on multiple GPUs and compare the results between the multi-GPU and multi-CPU implementations.
      

      
      CPU-GPU hybrid accelerating the Zuker algorithm for RNA secondary structure prediction applications.
      PubMed
      Lei, Guoqing; Dou, Yong; Wan, Wen; Xia, Fei; Li, Rongchun; Ma, Meng; Zou, Dan
         2012-01-01
         Prediction of ribonucleic acid (RNA) secondary structure remains one of the most important research areas in bioinformatics. The Zuker algorithm is one of the most popular methods of free energy minimization for RNA secondary structure prediction. Thus far, few studies have been reported on the acceleration of the Zuker algorithm on general-purpose processors or on extra accelerators such as Field Programmable Gate-Array (FPGA) and Graphics Processing Units (GPU). To the best of our knowledge, no implementation combines both CPU and extra accelerators, such as GPUs, to accelerate the Zuker algorithm applications. In this paper, a CPU-GPU hybrid computing system that accelerates Zuker algorithm applications for RNA secondary structure prediction is proposed. The computing tasks are allocated between CPU and GPU for parallel cooperate execution. Performance differences between the CPU and the GPU in the task-allocation scheme are considered to obtain workload balance. To improve the hybrid system performance, the Zuker algorithm is optimally implemented with special methods for CPU and GPU architecture. Speedup of 15.93× over optimized multi-core SIMD CPU implementation and performance advantage of 16% over optimized GPU implementation are shown in the experimental results. More than 14% of the sequences are executed on CPU in the hybrid system. The system combining CPU and GPU to accelerate the Zuker algorithm is proven to be promising and can be applied to other bioinformatics applications.
      

      
      General approach to boat simulation in virtual reality systems
      NASA Astrophysics Data System (ADS)
      Aranov, Vladislav Y.; Belyaev, Sergey Y.
         2002-02-01
         The paper is dedicated to real time simulation of sport boats, particularly a kayak and high-speed skimming boat, for training goals. This training is issue of the day, since kayaking and riding a high-speed skimming boat are both extreme sports. Participating in such types of competitions puts sportsmen into danger, particularly due to rapids, waterfalls, different water streams, and other obstacles. In order to make the simulation realistic, it is necessary to calculate data for at least 30 frames per second. These calculations may take not more than 5% CPU time, because very time-consuming 3D rendering process takes the rest - 95% CPU time. This paper describes an approach for creating minimal boat simulator models that satisfy the mentioned requirements. Besides, this approach can be used for other watercraft models of this kind.
      

      
      A fast sequence assembly method based on compressed data structures.
      PubMed
      Liang, Peifeng; Zhang, Yancong; Lin, Kui; Hu, Jinglu
         2014-01-01
         Assembling a large genome using next generation sequencing reads requires large computer memory and a long execution time. To reduce these requirements, a memory and time efficient assembler is presented from applying FM-index in JR-Assembler, called FMJ-Assembler, where FM stand for FMR-index derived from the FM-index and BWT and J for jumping extension. The FMJ-Assembler uses expanded FM-index and BWT to compress data of reads to save memory and jumping extension method make it faster in CPU time. An extensive comparison of the FMJ-Assembler with current assemblers shows that the FMJ-Assembler achieves a better or comparable overall assembly quality and requires lower memory use and less CPU time. All these advantages of the FMJ-Assembler indicate that the FMJ-Assembler will be an efficient assembly method in next generation sequencing technology.
      

      
      Jobs masonry in LHCb with elastic Grid Jobs
      NASA Astrophysics Data System (ADS)
      Stagni, F.; Charpentier, Ph
         2015-12-01
         In any distributed computing infrastructure, a job is normally forbidden to run for an indefinite amount of time. This limitation is implemented using different technologies, the most common one being the CPU time limit implemented by batch queues. It is therefore important to have a good estimate of how much CPU work a job will require: otherwise, it might be killed by the batch system, or by whatever system is controlling the jobs’ execution. In many modern interwares, the jobs are actually executed by pilot jobs, that can use the whole available time in running multiple consecutive jobs. If at some point the available time in a pilot is too short for the execution of any job, it should be released, while it could have been used efficiently by a shorter job. Within LHCbDIRAC, the LHCb extension of the DIRAC interware, we developed a simple way to fully exploit computing capabilities available to a pilot, even for resources with limited time capabilities, by adding elasticity to production MonteCarlo (MC) simulation jobs. With our approach, independently of the time available, LHCbDIRAC will always have the possibility to execute a MC job, whose length will be adapted to the available amount of time: therefore the same job, running on different computing resources with different time limits, will produce different amounts of events. The decision on the number of events to be produced is made just in time at the start of the job, when the capabilities of the resource are known. In order to know how many events a MC job will be instructed to produce, LHCbDIRAC simply requires three values: the CPU-work per event for that type of job, the power of the machine it is running on, and the time left for the job before being killed. Knowing these values, we can estimate the number of events the job will be able to simulate with the available CPU time. This paper will demonstrate that, using this simple but effective solution, LHCb manages to make a more efficient use of the available resources, and that it can easily use new types of resources. An example is represented by resources provided by batch queues, where low-priority MC jobs can be used as "masonry" jobs in multi-jobs pilots. A second example is represented by opportunistic resources with limited available time.
      

      
      Optimization of Selected Remote Sensing Algorithms for Embedded NVIDIA Kepler GPU Architecture
      NASA Technical Reports Server (NTRS)
      Riha, Lubomir; Le Moigne, Jacqueline; El-Ghazawi, Tarek
         2015-01-01
         This paper evaluates the potential of embedded Graphic Processing Units in the Nvidias Tegra K1 for onboard processing. The performance is compared to a general purpose multi-core CPU and full fledge GPU accelerator. This study uses two algorithms: Wavelet Spectral Dimension Reduction of Hyperspectral Imagery and Automated Cloud-Cover Assessment (ACCA) Algorithm. Tegra K1 achieved 51 for ACCA algorithm and 20 for the dimension reduction algorithm, as compared to the performance of the high-end 8-core server Intel Xeon CPU with 13.5 times higher power consumption.
      

      
      Physician discretion is safe and may lower stress test utilization in emergency department chest pain unit patients.
      PubMed
      Napoli, Anthony M; Arrighi, James A; Siket, Matthew S; Gibbs, Frantz J
         2012-03-01
         Chest pain unit (CPU) observation with defined stress utilization protocols is a common management option for low-risk emergency department patients. We sought to evaluate the safety of a joint emergency medicine and cardiology staffed CPU. Prospective observational trial of consecutive patients admitted to an emergency department CPU was conducted. A standard 6-hour observation protocol was followed by cardiology consultation and stress utilization largely at their discretion. Included patients were at low/intermediate risk by the American Heart Association, had nondiagnostic electrocardiograms, and a normal initial troponin. Excluded patients were those with an acute comorbidity, age >75, and a history of coronary artery disease, or had a coexistent problem restricting 24-hour observation. Primary outcome was 30-day major adverse cardiovascular events-defined as death, nonfatal acute myocardial infarction, revascularization, or out-of-hospital cardiac arrest. A total of 1063 patients were enrolled over 8 months. The mean age of the patients was 52.8 ± 11.8 years, and 51% (95% confidence interval [CI], 48-54) were female. The mean thrombolysis in myocardial infarction and Diamond & Forrester scores were 0.6% (95% CI, 0.51-0.62) and 33% (95% CI, 31-35), respectively. In all, 51% (95% CI, 48-54) received stress testing (52% nuclear stress, 39% stress echocardiogram, 5% exercise, 4% other). In all, 0.9% patients (n = 10, 95% CI, 0.4-1.5) were diagnosed with a non-ST elevation myocardial infarction and 2.2% (n = 23, 95% CI, 1.3-3) with acute coronary syndrome. There was 1 (95% CI, 0%-0.3%) case of a 30-day major adverse cardiovascular events. The 51% stress test utilization rate was less than the range reported in previous CPU studies (P < 0.05). Joint emergency medicine and cardiology management of patients within a CPU protocol is safe, efficacious, and may safely reduce stress testing rates.
      

      
      Real-time image reconstruction and display system for MRI using a high-speed personal computer.
      PubMed
      Haishi, T; Kose, K
         1998-09-01
         A real-time NMR image reconstruction and display system was developed using a high-speed personal computer and optimized for the 32-bit multitasking Microsoft Windows 95 operating system. The system was operated at various CPU clock frequencies by changing the motherboard clock frequency and the processor/bus frequency ratio. When the Pentium CPU was used at the 200 MHz clock frequency, the reconstruction time for one 128 x 128 pixel image was 48 ms and that for the image display on the enlarged 256 x 256 pixel window was about 8 ms. NMR imaging experiments were performed with three fast imaging sequences (FLASH, multishot EPI, and one-shot EPI) to demonstrate the ability of the real-time system. It was concluded that in most cases, high-speed PC would be the best choice for the image reconstruction and display system for real-time MRI. Copyright 1998 Academic Press.
      

      
      QR-decomposition based SENSE reconstruction using parallel architecture.
      PubMed
      Ullah, Irfan; Nisar, Habab; Raza, Haseeb; Qasim, Malik; Inam, Omair; Omer, Hammad
         2018-04-01
         Magnetic Resonance Imaging (MRI) is a powerful medical imaging technique that provides essential clinical information about the human body. One major limitation of MRI is its long scan time. Implementation of advance MRI algorithms on a parallel architecture (to exploit inherent parallelism) has a great potential to reduce the scan time. Sensitivity Encoding (SENSE) is a Parallel Magnetic Resonance Imaging (pMRI) algorithm that utilizes receiver coil sensitivities to reconstruct MR images from the acquired under-sampled k-space data. At the heart of SENSE lies inversion of a rectangular encoding matrix. This work presents a novel implementation of GPU based SENSE algorithm, which employs QR decomposition for the inversion of the rectangular encoding matrix. For a fair comparison, the performance of the proposed GPU based SENSE reconstruction is evaluated against single and multicore CPU using openMP. Several experiments against various acceleration factors (AFs) are performed using multichannel (8, 12 and 30) phantom and in-vivo human head and cardiac datasets. Experimental results show that GPU significantly reduces the computation time of SENSE reconstruction as compared to multi-core CPU (approximately 12x speedup) and single-core CPU (approximately 53x speedup) without any degradation in the quality of the reconstructed images. Copyright © 2018 Elsevier Ltd. All rights reserved.
      

      
      Energy consumption optimization of the total-FETI solver by changing the CPU frequency
      NASA Astrophysics Data System (ADS)
      Horak, David; Riha, Lubomir; Sojka, Radim; Kruzik, Jakub; Beseda, Martin; Cermak, Martin; Schuchart, Joseph
         2017-07-01
         The energy consumption of supercomputers is one of the critical problems for the upcoming Exascale supercomputing era. The awareness of power and energy consumption is required on both software and hardware side. This paper deals with the energy consumption evaluation of the Finite Element Tearing and Interconnect (FETI) based solvers of linear systems, which is an established method for solving real-world engineering problems. We have evaluated the effect of the CPU frequency on the energy consumption of the FETI solver using a linear elasticity 3D cube synthetic benchmark. In this problem, we have evaluated the effect of frequency tuning on the energy consumption of the essential processing kernels of the FETI method. The paper provides results for two types of frequency tuning: (1) static tuning and (2) dynamic tuning. For static tuning experiments, the frequency is set before execution and kept constant during the runtime. For dynamic tuning, the frequency is changed during the program execution to adapt the system to the actual needs of the application. The paper shows that static tuning brings up 12% energy savings when compared to default CPU settings (the highest clock rate). The dynamic tuning improves this further by up to 3%.
      

      
      GPU Linear Algebra Libraries and GPGPU Programming for Accelerating MOPAC Semiempirical Quantum Chemistry Calculations.
      PubMed
      Maia, Julio Daniel Carvalho; Urquiza Carvalho, Gabriel Aires; Mangueira, Carlos Peixoto; Santana, Sidney Ramos; Cabral, Lucidio Anjos Formiga; Rocha, Gerd B
         2012-09-11
         In this study, we present some modifications in the semiempirical quantum chemistry MOPAC2009 code that accelerate single-point energy calculations (1SCF) of medium-size (up to 2500 atoms) molecular systems using GPU coprocessors and multithreaded shared-memory CPUs. Our modifications consisted of using a combination of highly optimized linear algebra libraries for both CPU (LAPACK and BLAS from Intel MKL) and GPU (MAGMA and CUBLAS) to hasten time-consuming parts of MOPAC such as the pseudodiagonalization, full diagonalization, and density matrix assembling. We have shown that it is possible to obtain large speedups just by using CPU serial linear algebra libraries in the MOPAC code. As a special case, we show a speedup of up to 14 times for a methanol simulation box containing 2400 atoms and 4800 basis functions, with even greater gains in performance when using multithreaded CPUs (2.1 times in relation to the single-threaded CPU code using linear algebra libraries) and GPUs (3.8 times). This degree of acceleration opens new perspectives for modeling larger structures which appear in inorganic chemistry (such as zeolites and MOFs), biochemistry (such as polysaccharides, small proteins, and DNA fragments), and materials science (such as nanotubes and fullerenes). In addition, we believe that this parallel (GPU-GPU) MOPAC code will make it feasible to use semiempirical methods in lengthy molecular simulations using both hybrid QM/MM and QM/QM potentials.
      

      
      Symptoms of Problematic Cellular Phone Use, Functional Impairment and Its Association with Depression among Adolescents in Southern Taiwan
      ERIC Educational Resources Information Center
      Yen, Cheng-Fang; Tang, Tze-Chun; Yen, Ju-Yu; Lin, Huang-Chi; Huang, Chi-Fen; Liu, Shu-Chun; Ko, Chih-Hung
         2009-01-01
         The aims of this study were: (1) to examine the prevalence of symptoms of problematic cellular phone use (CPU); (2) to examine the associations between the symptoms of problematic CPU, functional impairment caused by CPU and the characteristics of CPU; (3) to establish the optimal cut-off point of the number of symptoms for functional impairment…
      

      
      Derivative free Davidon-Fletcher-Powell (DFP) for solving symmetric systems of nonlinear equations
      NASA Astrophysics Data System (ADS)
      Mamat, M.; Dauda, M. K.; Mohamed, M. A. bin; Waziri, M. Y.; Mohamad, F. S.; Abdullah, H.
         2018-03-01
         Research from the work of engineers, economist, modelling, industry, computing, and scientist are mostly nonlinear equations in nature. Numerical solution to such systems is widely applied in those areas of mathematics. Over the years, there has been significant theoretical study to develop methods for solving such systems, despite these efforts, unfortunately the methods developed do have deficiency. In a contribution to solve systems of the form F(x) = 0, x ∈ Rn , a derivative free method via the classical Davidon-Fletcher-Powell (DFP) update is presented. This is achieved by simply approximating the inverse Hessian matrix with {Q}k+1-1 to θkI. The modified method satisfied the descent condition and possess local superlinear convergence properties. Interestingly, without computing any derivative, the proposed method never fail to converge throughout the numerical experiments. The output is based on number of iterations and CPU time, different initial starting points were used on a solve 40 benchmark test problems. With the aid of the squared norm merit function and derivative-free line search technique, the approach yield a method of solving symmetric systems of nonlinear equations that is capable of significantly reducing the CPU time and number of iteration, as compared to its counterparts. A comparison between the proposed method and classical DFP update were made and found that the proposed methodis the top performer and outperformed the existing method in almost all the cases. In terms of number of iterations, out of the 40 problems solved, the proposed method solved 38 successfully, (95%) while classical DFP solved 2 problems (i.e. 05%). In terms of CPU time, the proposed method solved 29 out of the 40 problems given, (i.e.72.5%) successfully whereas classical DFP solves 11 (27.5%). The method is valid in terms of derivation, reliable in terms of number of iterations and accurate in terms of CPU time. Thus, suitable and achived the objective.
      

      
      Synthesis and characterization of conductive, biodegradable, elastomeric polyurethanes for biomedical applications.
      PubMed
      Xu, Cancan; Yepez, Gerardo; Wei, Zi; Liu, Fuqiang; Bugarin, Alejandro; Hong, Yi
         2016-09-01
         Biodegradable conductive polymers are currently of significant interest in tissue repair and regeneration, drug delivery, and bioelectronics. However, biodegradable materials exhibiting both conductive and elastic properties have rarely been reported to date. To that end, an electrically conductive polyurethane (CPU) was synthesized from polycaprolactone diol, hexadiisocyanate, and aniline trimer and subsequently doped with (1S)-(+)-10-camphorsulfonic acid (CSA). All CPU films showed good elasticity within a 30% strain range. The electrical conductivity of the CPU films, as enhanced with increasing amounts of CSA, ranged from 2.7 ± 0.9 × 10(-10) to 4.4 ± 0.6 × 10(-7) S/cm in a dry state and 4.2 ± 0.5 × 10(-8) to 7.3 ± 1.5 × 10(-5) S/cm in a wet state. The redox peaks of a CPU1.5 film (molar ratio CSA:aniline trimer = 1.5:1) in the cyclic voltammogram confirmed the desired good electroactivity. The doped CPU film exhibited good electrical stability (87% of initial conductivity after 150 hours charge) as measured in a cell culture medium. The degradation rates of CPU films increased with increasing CSA content in both phosphate-buffered solution (PBS) and lipase/PBS solutions. After 7 days of enzymatic degradation, the conductivity of all CSA-doped CPU films had decreased to that of the undoped CPU film. Mouse 3T3 fibroblasts proliferated and spread on all CPU films. This developed biodegradable CPU with good elasticity, electrical stability, and biocompatibility may find potential applications in tissue engineering, smart drug release, and electronics. © 2016 Wiley Periodicals, Inc. J Biomed Mater Res Part A: 104A: 2305-2314, 2016. © 2016 Wiley Periodicals, Inc.
      

      
      General-purpose interface bus for multiuser, multitasking computer system
      NASA Technical Reports Server (NTRS)
      Generazio, Edward R.; Roth, Don J.; Stang, David B.
         1990-01-01
         The architecture of a multiuser, multitasking, virtual-memory computer system intended for the use by a medium-size research group is described. There are three central processing units (CPU) in the configuration, each with 16 MB memory, and two 474 MB hard disks attached. CPU 1 is designed for data analysis and contains an array processor for fast-Fourier transformations. In addition, CPU 1 shares display images viewed with the image processor. CPU 2 is designed for image analysis and display. CPU 3 is designed for data acquisition and contains 8 GPIB channels and an analog-to-digital conversion input/output interface with 16 channels. Up to 9 users can access the third CPU simultaneously for data acquisition. Focus is placed on the optimization of hardware interfaces and software, facilitating instrument control, data acquisition, and processing.
      

      
      CPU-GPU hybrid accelerating the Zuker algorithm for RNA secondary structure prediction applications
      PubMed Central
      
         2012-01-01
         Background Prediction of ribonucleic acid (RNA) secondary structure remains one of the most important research areas in bioinformatics. The Zuker algorithm is one of the most popular methods of free energy minimization for RNA secondary structure prediction. Thus far, few studies have been reported on the acceleration of the Zuker algorithm on general-purpose processors or on extra accelerators such as Field Programmable Gate-Array (FPGA) and Graphics Processing Units (GPU). To the best of our knowledge, no implementation combines both CPU and extra accelerators, such as GPUs, to accelerate the Zuker algorithm applications. Results In this paper, a CPU-GPU hybrid computing system that accelerates Zuker algorithm applications for RNA secondary structure prediction is proposed. The computing tasks are allocated between CPU and GPU for parallel cooperate execution. Performance differences between the CPU and the GPU in the task-allocation scheme are considered to obtain workload balance. To improve the hybrid system performance, the Zuker algorithm is optimally implemented with special methods for CPU and GPU architecture. Conclusions Speedup of 15.93× over optimized multi-core SIMD CPU implementation and performance advantage of 16% over optimized GPU implementation are shown in the experimental results. More than 14% of the sequences are executed on CPU in the hybrid system. The system combining CPU and GPU to accelerate the Zuker algorithm is proven to be promising and can be applied to other bioinformatics applications. PMID:22369626
      

      
      Increases in cytoplasmic dopamine compromise the normal resistance of the nucleus accumbens to methamphetamine neurotoxicity
      PubMed Central
      Thomas, David M.; Francescutti-Verbeem, Dina M.; Kuhnt, Donald M.
         2016-01-01
         Methamphetamine (METH) is a neurotoxic drug of abuse that damages the dopamine (DA) neuronal system in a highly delimited manner. The brain structure most affected by METH is the caudate–putamen (CPu) where long-term DA depletion and microglial activation are most evident. Even damage within the CPu is remarkably heterogenous with lateral and ventral aspects showing the greatest deficits. The nucleus accumbens (NAc) is largely spared of the damage that accompanies binge METH intoxication. Increases in cytoplasmic DA produced by reserpine, L-DOPA or clorgyline prior to METH uncover damage in the NAc as evidenced by microglial activation and depletion of DA, tyrosine hydroxylase (TH), and the DA transporter. These effects do not occur in the NAc after treatment with METH alone. In contrast to the CPu where DA, TH, and DA transporter levels remain depleted chronically, DA nerve ending alterations in the NAc show a partial recovery over time. None of the treatments that enhance METH toxicity in the NAc and CPu lead to losses of TH protein or DA cell bodies in the substantia nigra or the ventral tegmentum. These data show that increases in cytoplasmic DA dramatically broaden the neurotoxic profile of METH to include brain structures not normally targeted for damage by METH alone. The resistance of the NAc to METH-induced neurotoxicity and its ability to recover reveal a fundamentally different neuroplasticity by comparison to the CPu. Recruitment of the NAc as a target of METH neurotoxicity by alterations in DA homeostasis is significant in light of the important roles played by this brain structure. PMID:19457119
      

        
       
          

«

6
      7
      8
   9
      10
      »

          
        

     

   

   
       
            
              
          

«

7
      8
      9
   10
      11
      »

          
        

           
           
             
               
      
      Increases in cytoplasmic dopamine compromise the normal resistance of the nucleus accumbens to methamphetamine neurotoxicity.
      PubMed
      Thomas, David M; Francescutti-Verbeem, Dina M; Kuhn, Donald M
         2009-06-01
         Methamphetamine (METH) is a neurotoxic drug of abuse that damages the dopamine (DA) neuronal system in a highly delimited manner. The brain structure most affected by METH is the caudate-putamen (CPu) where long-term DA depletion and microglial activation are most evident. Even damage within the CPu is remarkably heterogenous with lateral and ventral aspects showing the greatest deficits. The nucleus accumbens (NAc) is largely spared of the damage that accompanies binge METH intoxication. Increases in cytoplasmic DA produced by reserpine, L-DOPA or clorgyline prior to METH uncover damage in the NAc as evidenced by microglial activation and depletion of DA, tyrosine hydroxylase (TH), and the DA transporter. These effects do not occur in the NAc after treatment with METH alone. In contrast to the CPu where DA, TH, and DA transporter levels remain depleted chronically, DA nerve ending alterations in the NAc show a partial recovery over time. None of the treatments that enhance METH toxicity in the NAc and CPu lead to losses of TH protein or DA cell bodies in the substantia nigra or the ventral tegmentum. These data show that increases in cytoplasmic DA dramatically broaden the neurotoxic profile of METH to include brain structures not normally targeted for damage by METH alone. The resistance of the NAc to METH-induced neurotoxicity and its ability to recover reveal a fundamentally different neuroplasticity by comparison to the CPu. Recruitment of the NAc as a target of METH neurotoxicity by alterations in DA homeostasis is significant in light of the important roles played by this brain structure.
      

      
      Analysis of cache for streaming tape drive
      NASA Technical Reports Server (NTRS)
      Chinnaswamy, V.
         1993-01-01
         A tape subsystem consists of a controller and a tape drive. Tapes are used for backup, data interchange, and software distribution. The backup operation is addressed. During a backup operation, data is read from disk, processed in CPU, and then sent to tape. The processing speeds of a disk subsystem, CPU, and a tape subsystem are likely to be different. A powerful CPU can read data from a fast disk, process it, and supply the data to the tape subsystem at a faster rate than the tape subsystem can handle. On the other hand, a slow disk drive and a slow CPU may not be able to supply data fast enough to keep a tape drive busy all the time. The backup process may supply data to tape drive in bursts. Each burst may be followed by an idle period. Depending on the nature of the file distribution in the disk, the input stream to the tape subsystem may vary significantly during backup. To compensate for these differences and optimize the utilization of a tape subsystem, a cache or buffer is introduced in the tape controller. Most of the tape drives today are streaming tape drives. A streaming tape drive goes into reposition when there is no data from the controller. Once the drive goes into reposition, the controller can receive data, but it cannot supply data to the tape drive until the drive completes its reposition. A controller can also receive data from the host and send data to the tape drive at the same time. The relationship of cache size, host transfer rate, drive transfer rate, reposition, and ramp up times for optimal performance of the tape subsystem are investigated. Formulas developed will also show the advantages of cache watermarks to increase the streaming time of the tape drive, maximum loss due to insufficient cache, tradeoffs between cache and reposition times and the effectiveness of cache on a streaming tape drive due to idle times or interruptions due in host transfers. Several mathematical formulas are developed to predict the performance of the tape drive. Some examples are given illustrating the usefulness of these formulas. Finally, a summary and some conclusions are provided.
      

      
      An efficient implementation of 3D high-resolution imaging for large-scale seismic data with GPU/CPU heterogeneous parallel computing
      NASA Astrophysics Data System (ADS)
      Xu, Jincheng; Liu, Wei; Wang, Jin; Liu, Linong; Zhang, Jianfeng
         2018-02-01
         De-absorption pre-stack time migration (QPSTM) compensates for the absorption and dispersion of seismic waves by introducing an effective Q parameter, thereby making it an effective tool for 3D, high-resolution imaging of seismic data. Although the optimal aperture obtained via stationary-phase migration reduces the computational cost of 3D QPSTM and yields 3D stationary-phase QPSTM, the associated computational efficiency is still the main problem in the processing of 3D, high-resolution images for real large-scale seismic data. In the current paper, we proposed a division method for large-scale, 3D seismic data to optimize the performance of stationary-phase QPSTM on clusters of graphics processing units (GPU). Then, we designed an imaging point parallel strategy to achieve an optimal parallel computing performance. Afterward, we adopted an asynchronous double buffering scheme for multi-stream to perform the GPU/CPU parallel computing. Moreover, several key optimization strategies of computation and storage based on the compute unified device architecture (CUDA) were adopted to accelerate the 3D stationary-phase QPSTM algorithm. Compared with the initial GPU code, the implementation of the key optimization steps, including thread optimization, shared memory optimization, register optimization and special function units (SFU), greatly improved the efficiency. A numerical example employing real large-scale, 3D seismic data showed that our scheme is nearly 80 times faster than the CPU-QPSTM algorithm. Our GPU/CPU heterogeneous parallel computing framework significant reduces the computational cost and facilitates 3D high-resolution imaging for large-scale seismic data.
      

      
      GPU-accelerated automatic identification of robust beam setups for proton and carbon-ion radiotherapy
      NASA Astrophysics Data System (ADS)
      Ammazzalorso, F.; Bednarz, T.; Jelen, U.
         2014-03-01
         We demonstrate acceleration on graphic processing units (GPU) of automatic identification of robust particle therapy beam setups, minimizing negative dosimetric effects of Bragg peak displacement caused by treatment-time patient positioning errors. Our particle therapy research toolkit, RobuR, was extended with OpenCL support and used to implement calculation on GPU of the Port Homogeneity Index, a metric scoring irradiation port robustness through analysis of tissue density patterns prior to dose optimization and computation. Results were benchmarked against an independent native CPU implementation. Numerical results were in agreement between the GPU implementation and native CPU implementation. For 10 skull base cases, the GPU-accelerated implementation was employed to select beam setups for proton and carbon ion treatment plans, which proved to be dosimetrically robust, when recomputed in presence of various simulated positioning errors. From the point of view of performance, average running time on the GPU decreased by at least one order of magnitude compared to the CPU, rendering the GPU-accelerated analysis a feasible step in a clinical treatment planning interactive session. In conclusion, selection of robust particle therapy beam setups can be effectively accelerated on a GPU and become an unintrusive part of the particle therapy treatment planning workflow. Additionally, the speed gain opens new usage scenarios, like interactive analysis manipulation (e.g. constraining of some setup) and re-execution. Finally, through OpenCL portable parallelism, the new implementation is suitable also for CPU-only use, taking advantage of multiple cores, and can potentially exploit types of accelerators other than GPUs.
      

      
      Improving the performance of heterogeneous multi-core processors by modifying the cache coherence protocol
      NASA Astrophysics Data System (ADS)
      Fang, Juan; Hao, Xiaoting; Fan, Qingwen; Chang, Zeqing; Song, Shuying
         2017-05-01
         In the Heterogeneous multi-core architecture, CPU and GPU processor are integrated on the same chip, which poses a new challenge to the last-level cache management. In this architecture, the CPU application and the GPU application execute concurrently, accessing the last-level cache. CPU and GPU have different memory access characteristics, so that they have differences in the sensitivity of last-level cache (LLC) capacity. For many CPU applications, a reduced share of the LLC could lead to significant performance degradation. On the contrary, GPU applications can tolerate increase in memory access latency when there is sufficient thread-level parallelism. Taking into account the GPU program memory latency tolerance characteristics, this paper presents a method that let GPU applications can access to memory directly, leaving lots of LLC space for CPU applications, in improving the performance of CPU applications and does not affect the performance of GPU applications. When the CPU application is cache sensitive, and the GPU application is insensitive to the cache, the overall performance of the system is improved significantly.
      

      
      GPU-based stochastic-gradient optimization for non-rigid medical image registration in time-critical applications
      NASA Astrophysics Data System (ADS)
      Bhosale, Parag; Staring, Marius; Al-Ars, Zaid; Berendsen, Floris F.
         2018-03-01
         Currently, non-rigid image registration algorithms are too computationally intensive to use in time-critical applications. Existing implementations that focus on speed typically address this by either parallelization on GPU-hardware, or by introducing methodically novel techniques into CPU-oriented algorithms. Stochastic gradient descent (SGD) optimization and variations thereof have proven to drastically reduce the computational burden for CPU-based image registration, but have not been successfully applied in GPU hardware due to its stochastic nature. This paper proposes 1) NiftyRegSGD, a SGD optimization for the GPU-based image registration tool NiftyReg, 2) random chunk sampler, a new random sampling strategy that better utilizes the memory bandwidth of GPU hardware. Experiments have been performed on 3D lung CT data of 19 patients, which compared NiftyRegSGD (with and without random chunk sampler) with CPU-based elastix Fast Adaptive SGD (FASGD) and NiftyReg. The registration runtime was 21.5s, 4.4s and 2.8s for elastix-FASGD, NiftyRegSGD without, and NiftyRegSGD with random chunk sampling, respectively, while similar accuracy was obtained. Our method is publicly available at https://github.com/SuperElastix/NiftyRegSGD.
      

      
      Implementation of GPU accelerated SPECT reconstruction with Monte Carlo-based scatter correction.
      PubMed
      Bexelius, Tobias; Sohlberg, Antti
         2018-06-01
         Statistical SPECT reconstruction can be very time-consuming especially when compensations for collimator and detector response, attenuation, and scatter are included in the reconstruction. This work proposes an accelerated SPECT reconstruction algorithm based on graphics processing unit (GPU) processing. Ordered subset expectation maximization (OSEM) algorithm with CT-based attenuation modelling, depth-dependent Gaussian convolution-based collimator-detector response modelling, and Monte Carlo-based scatter compensation was implemented using OpenCL. The OpenCL implementation was compared against the existing multi-threaded OSEM implementation running on a central processing unit (CPU) in terms of scatter-to-primary ratios, standardized uptake values (SUVs), and processing speed using mathematical phantoms and clinical multi-bed bone SPECT/CT studies. The difference in scatter-to-primary ratios, visual appearance, and SUVs between GPU and CPU implementations was minor. On the other hand, at its best, the GPU implementation was noticed to be 24 times faster than the multi-threaded CPU version on a normal 128 × 128 matrix size 3 bed bone SPECT/CT data set when compensations for collimator and detector response, attenuation, and scatter were included. GPU SPECT reconstructions show great promise as an every day clinical reconstruction tool.
      

      
      Near-realtime simulations of biolelectric activity in small mammalian hearts using graphical processing units
      PubMed Central
      Vigmond, Edward J.; Boyle, Patrick M.; Leon, L. Joshua; Plank, Gernot
         2014-01-01
         Simulations of cardiac bioelectric phenomena remain a significant challenge despite continual advancements in computational machinery. Spanning large temporal and spatial ranges demands millions of nodes to accurately depict geometry, and a comparable number of timesteps to capture dynamics. This study explores a new hardware computing paradigm, the graphics processing unit (GPU), to accelerate cardiac models, and analyzes results in the context of simulating a small mammalian heart in real time. The ODEs associated with membrane ionic flow were computed on traditional CPU and compared to GPU performance, for one to four parallel processing units. The scalability of solving the PDE responsible for tissue coupling was examined on a cluster using up to 128 cores. Results indicate that the GPU implementation was between 9 and 17 times faster than the CPU implementation and scaled similarly. Solving the PDE was still 160 times slower than real time. PMID:19964295
      

      
      Vector computer memory bank contention
      NASA Technical Reports Server (NTRS)
      Bailey, D. H.
         1985-01-01
         A number of vector supercomputers feature very large memories. Unfortunately the large capacity memory chips that are used in these computers are much slower than the fast central processing unit (CPU) circuitry. As a result, memory bank reservation times (in CPU ticks) are much longer than on previous generations of computers. A consequence of these long reservation times is that memory bank contention is sharply increased, resulting in significantly lowered performance rates. The phenomenon of memory bank contention in vector computers is analyzed using both a Markov chain model and a Monte Carlo simulation program. The results of this analysis indicate that future generations of supercomputers must either employ much faster memory chips or else feature very large numbers of independent memory banks.
      

      
      New Focal Plane Array Controller for the Instruments of the Subaru Telescope
      NASA Astrophysics Data System (ADS)
      Nakaya, Hidehiko; Komiyama, Yutaka; Miyazaki, Satoshi; Yamashita, Takuya; Yagi, Masafumi; Sekiguchi, Maki
         2006-03-01
         We have developed a next-generation data acquisition system, MESSIA5 (Modularized Extensible System for Image Acquisition), which comprises the digital part of a focal plane array controller. The new data acquisition system was constructed based on a 64 bit, 66 MHz PCI (peripheral component interconnect) bus architecture and runs on an x86 CPU computer with (non-real-time) Linux. The system, including the CPU board, is placed at the telescope focus, and standard gigabit Ethernet is adopted for the data transfer, as opposed to a dedicated fiber link. During the summer of 2002, we installed the new system for the first time on the Subaru prime-focus camera Suprime-Cam and successfully improved the observing performance.
      

      
      Vector computer memory bank contention
      NASA Technical Reports Server (NTRS)
      Bailey, David H.
         1987-01-01
         A number of vector supercomputers feature very large memories. Unfortunately the large capacity memory chips that are used in these computers are much slower than the fast central processing unit (CPU) circuitry. As a result, memory bank reservation times (in CPU ticks) are much longer than on previous generations of computers. A consequence of these long reservation times is that memory bank contention is sharply increased, resulting in significantly lowered performance rates. The phenomenon of memory bank contention in vector computers is analyzed using both a Markov chain model and a Monte Carlo simulation program. The results of this analysis indicate that future generations of supercomputers must either employ much faster memory chips or else feature very large numbers of independent memory banks.
      

      
      A numerical code for the simulation of non-equilibrium chemically reacting flows on hybrid CPU-GPU clusters
      NASA Astrophysics Data System (ADS)
      Kudryavtsev, Alexey N.; Kashkovsky, Alexander V.; Borisov, Semyon P.; Shershnev, Anton A.
         2017-10-01
         In the present work a computer code RCFS for numerical simulation of chemically reacting compressible flows on hybrid CPU/GPU supercomputers is developed. It solves 3D unsteady Euler equations for multispecies chemically reacting flows in general curvilinear coordinates using shock-capturing TVD schemes. Time advancement is carried out using the explicit Runge-Kutta TVD schemes. Program implementation uses CUDA application programming interface to perform GPU computations. Data between GPUs is distributed via domain decomposition technique. The developed code is verified on the number of test cases including supersonic flow over a cylinder.
      

      
      Newmark local time stepping on high-performance computing architectures
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Rietmann, Max, E-mail: max.rietmann@erdw.ethz.ch; Institute of Geophysics, ETH Zurich; Grote, Marcus, E-mail: marcus.grote@unibas.ch
         
         In multi-scale complex media, finite element meshes often require areas of local refinement, creating small elements that can dramatically reduce the global time-step for wave-propagation problems due to the CFL condition. Local time stepping (LTS) algorithms allow an explicit time-stepping scheme to adapt the time-step to the element size, allowing near-optimal time-steps everywhere in the mesh. We develop an efficient multilevel LTS-Newmark scheme and implement it in a widely used continuous finite element seismic wave-propagation package. In particular, we extend the standard LTS formulation with adaptations to continuous finite element methods that can be implemented very efficiently with very strongmore » element-size contrasts (more than 100x). Capable of running on large CPU and GPU clusters, we present both synthetic validation examples and large scale, realistic application examples to demonstrate the performance and applicability of the method and implementation on thousands of CPU cores and hundreds of GPUs.« less
      

      
      Generalized conjugate-gradient methods for the Navier-Stokes equations
      NASA Technical Reports Server (NTRS)
      Ajmani, Kumud; Ng, Wing-Fai; Liou, Meng-Sing
         1991-01-01
         A generalized conjugate-gradient method is used to solve the two-dimensional, compressible Navier-Stokes equations of fluid flow. The equations are discretized with an implicit, upwind finite-volume formulation. Preconditioning techniques are incorporated into the new solver to accelerate convergence of the overall iterative method. The superiority of the new solver is demonstrated by comparisons with a conventional line Gauss-Siedel Relaxation solver. Computational test results for transonic flow (trailing edge flow in a transonic turbine cascade) and hypersonic flow (M = 6.0 shock-on-shock phenoena on a cylindrical leading edge) are presented. When applied to the transonic cascade case, the new solver is 4.4 times faster in terms of number of iterations and 3.1 times faster in terms of CPU time than the Relaxation solver. For the hypersonic shock case, the new solver is 3.0 times faster in terms of number of iterations and 2.2 times faster in terms of CPU time than the Relaxation solver.
      

      
      CPU-GPU mixed implementation of virtual node method for real-time interactive cutting of deformable objects using OpenCL.
      PubMed
      Jia, Shiyu; Zhang, Weizhong; Yu, Xiaokang; Pan, Zhenkuan
         2015-09-01
         Surgical simulators need to simulate interactive cutting of deformable objects in real time. The goal of this work was to design an interactive cutting algorithm that eliminates traditional cutting state classification and can work simultaneously with real-time GPU-accelerated deformation without affecting its numerical stability. A modified virtual node method for cutting is proposed. Deformable object is modeled as a real tetrahedral mesh embedded in a virtual tetrahedral mesh, and the former is used for graphics rendering and collision, while the latter is used for deformation. Cutting algorithm first subdivides real tetrahedrons to eliminate all face and edge intersections, then splits faces, edges and vertices along cutting tool trajectory to form cut surfaces. Next virtual tetrahedrons containing more than one connected real tetrahedral fragments are duplicated, and connectivity between virtual tetrahedrons is updated. Finally, embedding relationship between real and virtual tetrahedral meshes is updated. Co-rotational linear finite element method is used for deformation. Cutting and collision are processed by CPU, while deformation is carried out by GPU using OpenCL. Efficiency of GPU-accelerated deformation algorithm was tested using block models with varying numbers of tetrahedrons. Effectiveness of our cutting algorithm under multiple cuts and self-intersecting cuts was tested using a block model and a cylinder model. Cutting of a more complex liver model was performed, and detailed performance characteristics of cutting, deformation and collision were measured and analyzed. Our cutting algorithm can produce continuous cut surfaces when traditional minimal element creation algorithm fails. Our GPU-accelerated deformation algorithm remains stable with constant time step under multiple arbitrary cuts and works on both NVIDIA and AMD GPUs. GPU-CPU speed ratio can be as high as 10 for models with 80,000 tetrahedrons. Forty to sixty percent real-time performance and 100-200 Hz simulation rate are achieved for the liver model with 3,101 tetrahedrons. Major bottlenecks for simulation efficiency are cutting, collision processing and CPU-GPU data transfer. Future work needs to improve on these areas.
      

      
      Exploring the use of I/O nodes for computation in a MIMD multiprocessor
      NASA Technical Reports Server (NTRS)
      Kotz, David; Cai, Ting
         1995-01-01
         As parallel systems move into the production scientific-computing world, the emphasis will be on cost-effective solutions that provide high throughput for a mix of applications. Cost effective solutions demand that a system make effective use of all of its resources. Many MIMD multiprocessors today, however, distinguish between 'compute' and 'I/O' nodes, the latter having attached disks and being dedicated to running the file-system server. This static division of responsibilities simplifies system management but does not necessarily lead to the best performance in workloads that need a different balance of computation and I/O. Of course, computational processes sharing a node with a file-system service may receive less CPU time, network bandwidth, and memory bandwidth than they would on a computation-only node. In this paper we begin to examine this issue experimentally. We found that high performance I/O does not necessarily require substantial CPU time, leaving plenty of time for application computation. There were some complex file-system requests, however, which left little CPU time available to the application. (The impact on network and memory bandwidth still needs to be determined.) For applications (or users) that cannot tolerate an occasional interruption, we recommend that they continue to use only compute nodes. For tolerant applications needing more cycles than those provided by the compute nodes, we recommend that they take full advantage of both compute and I/O nodes for computation, and that operating systems should make this possible.
      

      
      System for processing an encrypted instruction stream in hardware
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Griswold, Richard L.; Nickless, William K.; Conrad, Ryan C.
         
         A system and method of processing an encrypted instruction stream in hardware is disclosed. Main memory stores the encrypted instruction stream and unencrypted data. A central processing unit (CPU) is operatively coupled to the main memory. A decryptor is operatively coupled to the main memory and located within the CPU. The decryptor decrypts the encrypted instruction stream upon receipt of an instruction fetch signal from a CPU core. Unencrypted data is passed through to the CPU core without decryption upon receipt of a data fetch signal.
      

      
      Execution of a parallel edge-based Navier-Stokes solver on commodity graphics processor units
      NASA Astrophysics Data System (ADS)
      Corral, Roque; Gisbert, Fernando; Pueblas, Jesus
         2017-02-01
         The implementation of an edge-based three-dimensional Reynolds Average Navier-Stokes solver for unstructured grids able to run on multiple graphics processing units (GPUs) is presented. Loops over edges, which are the most time-consuming part of the solver, have been written to exploit the massively parallel capabilities of GPUs. Non-blocking communications between parallel processes and between the GPU and the central processor unit (CPU) have been used to enhance code scalability. The code is written using a mixture of C++ and OpenCL, to allow the execution of the source code on GPUs. The Message Passage Interface (MPI) library is used to allow the parallel execution of the solver on multiple GPUs. A comparative study of the solver parallel performance is carried out using a cluster of CPUs and another of GPUs. It is shown that a single GPU is up to 64 times faster than a single CPU core. The parallel scalability of the solver is mainly degraded due to the loss of computing efficiency of the GPU when the size of the case decreases. However, for large enough grid sizes, the scalability is strongly improved. A cluster featuring commodity GPUs and a high bandwidth network is ten times less costly and consumes 33% less energy than a CPU-based cluster with an equivalent computational power.
      

      
      A comparison of native GPU computing versus OpenACC for implementing flow-routing algorithms in hydrological applications
      NASA Astrophysics Data System (ADS)
      Rueda, Antonio J.; Noguera, José M.; Luque, Adrián
         2016-02-01
         In recent years GPU computing has gained wide acceptance as a simple low-cost solution for speeding up computationally expensive processing in many scientific and engineering applications. However, in most cases accelerating a traditional CPU implementation for a GPU is a non-trivial task that requires a thorough refactorization of the code and specific optimizations that depend on the architecture of the device. OpenACC is a promising technology that aims at reducing the effort required to accelerate C/C++/Fortran code on an attached multicore device. Virtually with this technology the CPU code only has to be augmented with a few compiler directives to identify the areas to be accelerated and the way in which data has to be moved between the CPU and GPU. Its potential benefits are multiple: better code readability, less development time, lower risk of errors and less dependency on the underlying architecture and future evolution of the GPU technology. Our aim with this work is to evaluate the pros and cons of using OpenACC against native GPU implementations in computationally expensive hydrological applications, using the classic D8 algorithm of O'Callaghan and Mark for river network extraction as case-study. We implemented the flow accumulation step of this algorithm in CPU, using OpenACC and two different CUDA versions, comparing the length and complexity of the code and its performance with different datasets. We advance that although OpenACC can not match the performance of a CUDA optimized implementation (×3.5 slower in average), it provides a significant performance improvement against a CPU implementation (×2-6) with by far a simpler code and less implementation effort.
      

      
      Algorithms of GPU-enabled reactive force field (ReaxFF) molecular dynamics.
      PubMed
      Zheng, Mo; Li, Xiaoxia; Guo, Li
         2013-04-01
         Reactive force field (ReaxFF), a recent and novel bond order potential, allows for reactive molecular dynamics (ReaxFF MD) simulations for modeling larger and more complex molecular systems involving chemical reactions when compared with computation intensive quantum mechanical methods. However, ReaxFF MD can be approximately 10-50 times slower than classical MD due to its explicit modeling of bond forming and breaking, the dynamic charge equilibration at each time-step, and its one order smaller time-step than the classical MD, all of which pose significant computational challenges in simulation capability to reach spatio-temporal scales of nanometers and nanoseconds. The very recent advances of graphics processing unit (GPU) provide not only highly favorable performance for GPU enabled MD programs compared with CPU implementations but also an opportunity to manage with the computing power and memory demanding nature imposed on computer hardware by ReaxFF MD. In this paper, we present the algorithms of GMD-Reax, the first GPU enabled ReaxFF MD program with significantly improved performance surpassing CPU implementations on desktop workstations. The performance of GMD-Reax has been benchmarked on a PC equipped with a NVIDIA C2050 GPU for coal pyrolysis simulation systems with atoms ranging from 1378 to 27,283. GMD-Reax achieved speedups as high as 12 times faster than Duin et al.'s FORTRAN codes in Lammps on 8 CPU cores and 6 times faster than the Lammps' C codes based on PuReMD in terms of the simulation time per time-step averaged over 100 steps. GMD-Reax could be used as a new and efficient computational tool for exploiting very complex molecular reactions via ReaxFF MD simulation on desktop workstations. Copyright © 2013 Elsevier Inc. All rights reserved.
      

        
       
          

«

7
      8
      9
   10
      11
      »

          
        

     

   

   
       
            
              
          

«

8
      9
      10
   11
      12
      »

          
        

           
           
             
               
      
      A survey of CPU-GPU heterogeneous computing techniques
      DOE PAGES
      Mittal, Sparsh; Vetter, Jeffrey S.
         2015-07-04
         As both CPU and GPU become employed in a wide range of applications, it has been acknowledged that both of these processing units (PUs) have their unique features and strengths and hence, CPU-GPU collaboration is inevitable to achieve high-performance computing. This has motivated significant amount of research on heterogeneous computing techniques, along with the design of CPU-GPU fused chips and petascale heterogeneous supercomputers. In this paper, we survey heterogeneous computing techniques (HCTs) such as workload-partitioning which enable utilizing both CPU and GPU to improve performance and/or energy efficiency. We review heterogeneous computing approaches at runtime, algorithm, programming, compiler and applicationmore » level. Further, we review both discrete and fused CPU-GPU systems; and discuss benchmark suites designed for evaluating heterogeneous computing systems (HCSs). Furthermore, we believe that this paper will provide insights into working and scope of applications of HCTs to researchers and motivate them to further harness the computational powers of CPUs and GPUs to achieve the goal of exascale performance.« less
      

      
      A survey of CPU-GPU heterogeneous computing techniques
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Mittal, Sparsh; Vetter, Jeffrey S.
         
         As both CPU and GPU become employed in a wide range of applications, it has been acknowledged that both of these processing units (PUs) have their unique features and strengths and hence, CPU-GPU collaboration is inevitable to achieve high-performance computing. This has motivated significant amount of research on heterogeneous computing techniques, along with the design of CPU-GPU fused chips and petascale heterogeneous supercomputers. In this paper, we survey heterogeneous computing techniques (HCTs) such as workload-partitioning which enable utilizing both CPU and GPU to improve performance and/or energy efficiency. We review heterogeneous computing approaches at runtime, algorithm, programming, compiler and applicationmore » level. Further, we review both discrete and fused CPU-GPU systems; and discuss benchmark suites designed for evaluating heterogeneous computing systems (HCSs). Furthermore, we believe that this paper will provide insights into working and scope of applications of HCTs to researchers and motivate them to further harness the computational powers of CPUs and GPUs to achieve the goal of exascale performance.« less
      

      
      Evaluation of user input methods for manipulating a tablet personal computer in sterile techniques.
      PubMed
      Yamada, Akira; Komatsu, Daisuke; Suzuki, Takeshi; Kurozumi, Masahiro; Fujinaga, Yasunari; Ueda, Kazuhiko; Kadoya, Masumi
         2017-02-01
         To determine a quick and accurate user input method for manipulating tablet personal computers (PCs) in sterile techniques. We evaluated three different manipulation methods, (1) Computer mouse and sterile system drape, (2) Fingers and sterile system drape, and (3) Digitizer stylus and sterile ultrasound probe cover with a pinhole, in terms of the central processing unit (CPU) performance, manipulation performance, and contactlessness. A significant decrease in CPU score ([Formula: see text]) and an increase in CPU temperature ([Formula: see text]) were observed when a system drape was used. The respective mean times taken to select a target image from an image series (ST) and the mean times for measuring points on an image (MT) were [Formula: see text] and [Formula: see text] s for the computer mouse method, [Formula: see text] and [Formula: see text] s for the finger method, and [Formula: see text] and [Formula: see text] s for the digitizer stylus method, respectively. The ST for the finger method was significantly longer than for the digitizer stylus method ([Formula: see text]). The MT for the computer mouse method was significantly longer than for the digitizer stylus method ([Formula: see text]). The mean success rate for measuring points on an image was significantly lower for the finger method when the diameter of the target was equal to or smaller than 8 mm than for the other methods. No significant difference in the adenosine triphosphate amount at the surface of the tablet PC was observed before, during, or after manipulation via the digitizer stylus method while wearing starch-powdered sterile gloves ([Formula: see text]). Quick and accurate manipulation of tablet PCs in sterile techniques without CPU load is feasible using a digitizer stylus and sterile ultrasound probe cover with a pinhole.
      

      
      Fast multipurpose Monte Carlo simulation for proton therapy using multi- and many-core CPU architectures
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Souris, Kevin, E-mail: kevin.souris@uclouvain.be; Lee, John Aldo; Sterpin, Edmond
         2016-04-15
         Purpose: Accuracy in proton therapy treatment planning can be improved using Monte Carlo (MC) simulations. However the long computation time of such methods hinders their use in clinical routine. This work aims to develop a fast multipurpose Monte Carlo simulation tool for proton therapy using massively parallel central processing unit (CPU) architectures. Methods: A new Monte Carlo, called MCsquare (many-core Monte Carlo), has been designed and optimized for the last generation of Intel Xeon processors and Intel Xeon Phi coprocessors. These massively parallel architectures offer the flexibility and the computational power suitable to MC methods. The class-II condensed history algorithmmore » of MCsquare provides a fast and yet accurate method of simulating heavy charged particles such as protons, deuterons, and alphas inside voxelized geometries. Hard ionizations, with energy losses above a user-specified threshold, are simulated individually while soft events are regrouped in a multiple scattering theory. Elastic and inelastic nuclear interactions are sampled from ICRU 63 differential cross sections, thereby allowing for the computation of prompt gamma emission profiles. MCsquare has been benchmarked with the GATE/GEANT4 Monte Carlo application for homogeneous and heterogeneous geometries. Results: Comparisons with GATE/GEANT4 for various geometries show deviations within 2%–1 mm. In spite of the limited memory bandwidth of the coprocessor simulation time is below 25 s for 10{sup 7} primary 200 MeV protons in average soft tissues using all Xeon Phi and CPU resources embedded in a single desktop unit. Conclusions: MCsquare exploits the flexibility of CPU architectures to provide a multipurpose MC simulation tool. Optimized code enables the use of accurate MC calculation within a reasonable computation time, adequate for clinical practice. MCsquare also simulates prompt gamma emission and can thus be used also for in vivo range verification.« less
      

      
      Continuous piecewise-linear, reduced-order electrochemical model for lithium-ion batteries in real-time applications
      NASA Astrophysics Data System (ADS)
      Farag, Mohammed; Fleckenstein, Matthias; Habibi, Saeid
         2017-02-01
         Model-order reduction and minimization of the CPU run-time while maintaining the model accuracy are critical requirements for real-time implementation of lithium-ion electrochemical battery models. In this paper, an isothermal, continuous, piecewise-linear, electrode-average model is developed by using an optimal knot placement technique. The proposed model reduces the univariate nonlinear function of the electrode's open circuit potential dependence on the state of charge to continuous piecewise regions. The parameterization experiments were chosen to provide a trade-off between extensive experimental characterization techniques and purely identifying all parameters using optimization techniques. The model is then parameterized in each continuous, piecewise-linear, region. Applying the proposed technique cuts down the CPU run-time by around 20%, compared to the reduced-order, electrode-average model. Finally, the model validation against real-time driving profiles (FTP-72, WLTP) demonstrates the ability of the model to predict the cell voltage accurately with less than 2% error.
      

      
      The Effect of Multigrid Parameters in a 3D Heat Diffusion Equation
      NASA Astrophysics Data System (ADS)
      Oliveira, F. De; Franco, S. R.; Pinto, M. A. Villela
         2018-02-01
         The aim of this paper is to reduce the necessary CPU time to solve the three-dimensional heat diffusion equation using Dirichlet boundary conditions. The finite difference method (FDM) is used to discretize the differential equations with a second-order accuracy central difference scheme (CDS). The algebraic equations systems are solved using the lexicographical and red-black Gauss-Seidel methods, associated with the geometric multigrid method with a correction scheme (CS) and V-cycle. Comparisons are made between two types of restriction: injection and full weighting. The used prolongation process is the trilinear interpolation. This work is concerned with the study of the influence of the smoothing value (v), number of mesh levels (L) and number of unknowns (N) on the CPU time, as well as the analysis of algorithm complexity.
      

      
      An incomplete assembly with thresholding algorithm for systems of reaction-diffusion equations in three space dimensions IAT for reaction-diffusion systems
      NASA Astrophysics Data System (ADS)
      Moore, Peter K.
         2003-07-01
         Solving systems of reaction-diffusion equations in three space dimensions can be prohibitively expensive both in terms of storage and CPU time. Herein, I present a new incomplete assembly procedure that is designed to reduce storage requirements. Incomplete assembly is analogous to incomplete factorization in that only a fixed number of nonzero entries are stored per row and a drop tolerance is used to discard small values. The algorithm is incorporated in a finite element method-of-lines code and tested on a set of reaction-diffusion systems. The effect of incomplete assembly on CPU time and storage and on the performance of the temporal integrator DASPK, algebraic solver GMRES and preconditioner ILUT is studied.
      

      
      Analysis and improvements of Adaptive Particle Refinement (APR) through CPU time, accuracy and robustness considerations
      NASA Astrophysics Data System (ADS)
      Chiron, L.; Oger, G.; de Leffe, M.; Le Touzé, D.
         2018-02-01
         While smoothed-particle hydrodynamics (SPH) simulations are usually performed using uniform particle distributions, local particle refinement techniques have been developed to concentrate fine spatial resolutions in identified areas of interest. Although the formalism of this method is relatively easy to implement, its robustness at coarse/fine interfaces can be problematic. Analysis performed in [16] shows that the radius of refined particles should be greater than half the radius of unrefined particles to ensure robustness. In this article, the basics of an Adaptive Particle Refinement (APR) technique, inspired by AMR in mesh-based methods, are presented. This approach ensures robustness with alleviated constraints. Simulations applying the new formalism proposed achieve accuracy comparable to fully refined spatial resolutions, together with robustness, low CPU times and maintained parallel efficiency.
      

      
      Implementation of the DAST ARW II control laws using an 8086 microprocessor and an 8087 floating-point coprocessor. [drones for aeroelasticity research
      NASA Technical Reports Server (NTRS)
      Kelly, G. L.; Berthold, G.; Abbott, L.
         1982-01-01
         A 5 MHZ single-board microprocessor system which incorporates an 8086 CPU and an 8087 Numeric Data Processor is used to implement the control laws for the NASA Drones for Aerodynamic and Structural Testing, Aeroelastic Research Wing II. The control laws program was executed in 7.02 msec, with initialization consuming 2.65 msec and the control law loop 4.38 msec. The software emulator execution times for these two tasks were 36.67 and 61.18, respectively, for a total of 97.68 msec. The space, weight and cost reductions achieved in the present, aircraft control application of this combination of a 16-bit microprocessor with an 80-bit floating point coprocessor may be obtainable in other real time control applications.
      

      
      The applicability of turbulence models to aerodynamic and propulsion flowfields at McDonnell-Douglas Aerospace
      NASA Technical Reports Server (NTRS)
      Kral, Linda D.; Ladd, John A.; Mani, Mori
         1995-01-01
         The objective of this viewgraph presentation is to evaluate turbulence models for integrated aircraft components such as the forebody, wing, inlet, diffuser, nozzle, and afterbody. The one-equation models have replaced the algebraic models as the baseline turbulence models. The Spalart-Allmaras one-equation model consistently performs better than the Baldwin-Barth model, particularly in the log-layer and free shear layers. Also, the Sparlart-Allmaras model is not grid dependent like the Baldwin-Barth model. No general turbulence model exists for all engineering applications. The Spalart-Allmaras one-equation model and the Chien k-epsilon models are the preferred turbulence models. Although the two-equation models often better predict the flow field, they may take from two to five times the CPU time. Future directions are in further benchmarking the Menter blended k-w/k-epsilon and algorithmic improvements to reduce CPU time of the two-equation model.
      

      
      GPU Particle Tracking and MHD Simulations with Greatly Enhanced Computational Speed
      NASA Astrophysics Data System (ADS)
      Ziemba, T.; O'Donnell, D.; Carscadden, J.; Cash, M.; Winglee, R.; Harnett, E.
         2008-12-01
         GPUs are intrinsically highly parallelized systems that provide more than an order of magnitude computing speed over a CPU based systems, for less cost than a high end-workstation. Recent advancements in GPU technologies allow for full IEEE float specifications with performance up to several hundred GFLOPs per GPU, and new software architectures have recently become available to ease the transition from graphics based to scientific applications. This allows for a cheap alternative to standard supercomputing methods and should increase the time to discovery. 3-D particle tracking and MHD codes have been developed using NVIDIA's CUDA and have demonstrated speed up of nearly a factor of 20 over equivalent CPU versions of the codes. Such a speed up enables new applications to develop, including real time running of radiation belt simulations and real time running of global magnetospheric simulations, both of which could provide important space weather prediction tools.
      

      
      RTOS kernel in portable electrocardiograph
      NASA Astrophysics Data System (ADS)
      Centeno, C. A.; Voos, J. A.; Riva, G. G.; Zerbini, C.; Gonzalez, E. A.
         2011-12-01
         This paper presents the use of a Real Time Operating System (RTOS) on a portable electrocardiograph based on a microcontroller platform. All medical device digital functions are performed by the microcontroller. The electrocardiograph CPU is based on the 18F4550 microcontroller, in which an uCOS-II RTOS can be embedded. The decision associated with the kernel use is based on its benefits, the license for educational use and its intrinsic time control and peripherals management. The feasibility of its use on the electrocardiograph is evaluated based on the minimum memory requirements due to the kernel structure. The kernel's own tools were used for time estimation and evaluation of resources used by each process. After this feasibility analysis, the migration from cyclic code to a structure based on separate processes or tasks able to synchronize events is used; resulting in an electrocardiograph running on one Central Processing Unit (CPU) based on RTOS.
      

      
      Efficient Scalable Median Filtering Using Histogram-Based Operations.
      PubMed
      Green, Oded
         2018-05-01
         Median filtering is a smoothing technique for noise removal in images. While there are various implementations of median filtering for a single-core CPU, there are few implementations for accelerators and multi-core systems. Many parallel implementations of median filtering use a sorting algorithm for rearranging the values within a filtering window and taking the median of the sorted value. While using sorting algorithms allows for simple parallel implementations, the cost of the sorting becomes prohibitive as the filtering windows grow. This makes such algorithms, sequential and parallel alike, inefficient. In this work, we introduce the first software parallel median filtering that is non-sorting-based. The new algorithm uses efficient histogram-based operations. These reduce the computational requirements of the new algorithm while also accessing the image fewer times. We show an implementation of our algorithm for both the CPU and NVIDIA's CUDA supported graphics processing unit (GPU). The new algorithm is compared with several other leading CPU and GPU implementations. The CPU implementation has near perfect linear scaling with a speedup on a quad-core system. The GPU implementation is several orders of magnitude faster than the other GPU implementations for mid-size median filters. For small kernels, and , comparison-based approaches are preferable as fewer operations are required. Lastly, the new algorithm is open-source and can be found in the OpenCV library.
      

      
      SU-E-T-422: Fast Analytical Beamlet Optimization for Volumetric Intensity-Modulated Arc Therapy
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Chan, Kenny S K; Lee, Louis K Y; Xing, L
         2015-06-15
         Purpose: To implement a fast optimization algorithm on CPU/GPU heterogeneous computing platform and to obtain an optimal fluence for a given target dose distribution from the pre-calculated beamlets in an analytical approach. Methods: The 2D target dose distribution was modeled as an n-dimensional vector and estimated by a linear combination of independent basis vectors. The basis set was composed of the pre-calculated beamlet dose distributions at every 6 degrees of gantry angle and the cost function was set as the magnitude square of the vector difference between the target and the estimated dose distribution. The optimal weighting of the basis,more » which corresponds to the optimal fluence, was obtained analytically by the least square method. Those basis vectors with a positive weighting were selected for entering into the next level of optimization. Totally, 7 levels of optimization were implemented in the study.Ten head-and-neck and ten prostate carcinoma cases were selected for the study and mapped to a round water phantom with a diameter of 20cm. The Matlab computation was performed in a heterogeneous programming environment with Intel i7 CPU and NVIDIA Geforce 840M GPU. Results: In all selected cases, the estimated dose distribution was in a good agreement with the given target dose distribution and their correlation coefficients were found to be in the range of 0.9992 to 0.9997. Their root-mean-square error was monotonically decreasing and converging after 7 cycles of optimization. The computation took only about 10 seconds and the optimal fluence maps at each gantry angle throughout an arc were quickly obtained. Conclusion: An analytical approach is derived for finding the optimal fluence for a given target dose distribution and a fast optimization algorithm implemented on the CPU/GPU heterogeneous computing environment greatly reduces the optimization time.« less
      

      
      47 CFR 15.102 - CPU boards and power supplies used in personal computers.
      Code of Federal Regulations, 2013 CFR
      
         2013-10-01
         ... computers. 15.102 Section 15.102 Telecommunication FEDERAL COMMUNICATIONS COMMISSION GENERAL RADIO FREQUENCY DEVICES Unintentional Radiators § 15.102 CPU boards and power supplies used in personal computers. (a... modifications that must be made to a personal computer, peripheral device, CPU board or power supply during...
      

      
      47 CFR 15.102 - CPU boards and power supplies used in personal computers.
      Code of Federal Regulations, 2011 CFR
      
         2011-10-01
         ... computers. 15.102 Section 15.102 Telecommunication FEDERAL COMMUNICATIONS COMMISSION GENERAL RADIO FREQUENCY DEVICES Unintentional Radiators § 15.102 CPU boards and power supplies used in personal computers. (a... modifications that must be made to a personal computer, peripheral device, CPU board or power supply during...
      

      
      47 CFR 15.102 - CPU boards and power supplies used in personal computers.
      Code of Federal Regulations, 2010 CFR
      
         2010-10-01
         ... computers. 15.102 Section 15.102 Telecommunication FEDERAL COMMUNICATIONS COMMISSION GENERAL RADIO FREQUENCY DEVICES Unintentional Radiators § 15.102 CPU boards and power supplies used in personal computers. (a... modifications that must be made to a personal computer, peripheral device, CPU board or power supply during...
      

      
      47 CFR 15.102 - CPU boards and power supplies used in personal computers.
      Code of Federal Regulations, 2014 CFR
      
         2014-10-01
         ... computers. 15.102 Section 15.102 Telecommunication FEDERAL COMMUNICATIONS COMMISSION GENERAL RADIO FREQUENCY DEVICES Unintentional Radiators § 15.102 CPU boards and power supplies used in personal computers. (a... modifications that must be made to a personal computer, peripheral device, CPU board or power supply during...
      

      
      47 CFR 15.102 - CPU boards and power supplies used in personal computers.
      Code of Federal Regulations, 2012 CFR
      
         2012-10-01
         ... computers. 15.102 Section 15.102 Telecommunication FEDERAL COMMUNICATIONS COMMISSION GENERAL RADIO FREQUENCY DEVICES Unintentional Radiators § 15.102 CPU boards and power supplies used in personal computers. (a... modifications that must be made to a personal computer, peripheral device, CPU board or power supply during...
      

      
      Online performance evaluation of RAID 5 using CPU utilization
      NASA Astrophysics Data System (ADS)
      Jin, Hai; Yang, Hua; Zhang, Jiangling
         1998-09-01
         Redundant arrays of independent disks (RAID) technology is the efficient way to solve the bottleneck problem between CPU processing ability and I/O subsystem. For the system point of view, the most important metric of on line performance is the utilization of CPU. This paper first employs the way to calculate the CPU utilization of system connected with RAID level 5 using statistic average method. From the simulation results of CPU utilization of system connected with RAID level 5 subsystem can we see that using multiple disks as an array to access data in parallel is the efficient way to enhance the on-line performance of disk storage system. USing high-end disk drivers to compose the disk array is the key to enhance the on-line performance of system.
      

        
       
          

«

8
      9
      10
   11
      12
      »

          
        

     

   

   
       
            
              
          

«

9
      10
      11
   12
      13
      »

          
        

           
           
             
               
      
      Compute-unified device architecture implementation of a block-matching algorithm for multiple graphical processing unit cards
      PubMed Central
      Massanes, Francesc; Cadennes, Marie; Brankov, Jovan G.
         2012-01-01
         In this paper we describe and evaluate a fast implementation of a classical block matching motion estimation algorithm for multiple Graphical Processing Units (GPUs) using the Compute Unified Device Architecture (CUDA) computing engine. The implemented block matching algorithm (BMA) uses summed absolute difference (SAD) error criterion and full grid search (FS) for finding optimal block displacement. In this evaluation we compared the execution time of a GPU and CPU implementation for images of various sizes, using integer and non-integer search grids. The results show that use of a GPU card can shorten computation time by a factor of 200 times for integer and 1000 times for a non-integer search grid. The additional speedup for non-integer search grid comes from the fact that GPU has built-in hardware for image interpolation. Further, when using multiple GPU cards, the presented evaluation shows the importance of the data splitting method across multiple cards, but an almost linear speedup with a number of cards is achievable. In addition we compared execution time of the proposed FS GPU implementation with two existing, highly optimized non-full grid search CPU based motion estimations methods, namely implementation of the Pyramidal Lucas Kanade Optical flow algorithm in OpenCV and Simplified Unsymmetrical multi-Hexagon search in H.264/AVC standard. In these comparisons, FS GPU implementation still showed modest improvement even though the computational complexity of FS GPU implementation is substantially higher than non-FS CPU implementation. We also demonstrated that for an image sequence of 720×480 pixels in resolution, commonly used in video surveillance, the proposed GPU implementation is sufficiently fast for real-time motion estimation at 30 frames-per-second using two NVIDIA C1060 Tesla GPU cards. PMID:22347787
      

      
      Compute-unified device architecture implementation of a block-matching algorithm for multiple graphical processing unit cards.
      PubMed
      Massanes, Francesc; Cadennes, Marie; Brankov, Jovan G
         2011-07-01
         In this paper we describe and evaluate a fast implementation of a classical block matching motion estimation algorithm for multiple Graphical Processing Units (GPUs) using the Compute Unified Device Architecture (CUDA) computing engine. The implemented block matching algorithm (BMA) uses summed absolute difference (SAD) error criterion and full grid search (FS) for finding optimal block displacement. In this evaluation we compared the execution time of a GPU and CPU implementation for images of various sizes, using integer and non-integer search grids.The results show that use of a GPU card can shorten computation time by a factor of 200 times for integer and 1000 times for a non-integer search grid. The additional speedup for non-integer search grid comes from the fact that GPU has built-in hardware for image interpolation. Further, when using multiple GPU cards, the presented evaluation shows the importance of the data splitting method across multiple cards, but an almost linear speedup with a number of cards is achievable.In addition we compared execution time of the proposed FS GPU implementation with two existing, highly optimized non-full grid search CPU based motion estimations methods, namely implementation of the Pyramidal Lucas Kanade Optical flow algorithm in OpenCV and Simplified Unsymmetrical multi-Hexagon search in H.264/AVC standard. In these comparisons, FS GPU implementation still showed modest improvement even though the computational complexity of FS GPU implementation is substantially higher than non-FS CPU implementation.We also demonstrated that for an image sequence of 720×480 pixels in resolution, commonly used in video surveillance, the proposed GPU implementation is sufficiently fast for real-time motion estimation at 30 frames-per-second using two NVIDIA C1060 Tesla GPU cards.
      

      
      Comparison of Conjugate Gradient Density Matrix Search and Chebyshev Expansion Methods for Avoiding Diagonalization in Large-Scale Electronic Structure Calculations
      NASA Technical Reports Server (NTRS)
      Bates, Kevin R.; Daniels, Andrew D.; Scuseria, Gustavo E.
         1998-01-01
         We report a comparison of two linear-scaling methods which avoid the diagonalization bottleneck of traditional electronic structure algorithms. The Chebyshev expansion method (CEM) is implemented for carbon tight-binding calculations of large systems and its memory and timing requirements compared to those of our previously implemented conjugate gradient density matrix search (CG-DMS). Benchmark calculations are carried out on icosahedral fullerenes from C60 to C8640 and the linear scaling memory and CPU requirements of the CEM demonstrated. We show that the CPU requisites of the CEM and CG-DMS are similar for calculations with comparable accuracy.
      

      
      GPU accelerated Monte-Carlo simulation of SEM images for metrology
      NASA Astrophysics Data System (ADS)
      Verduin, T.; Lokhorst, S. R.; Hagen, C. W.
         2016-03-01
         In this work we address the computation times of numerical studies in dimensional metrology. In particular, full Monte-Carlo simulation programs for scanning electron microscopy (SEM) image acquisition are known to be notoriously slow. Our quest in reducing the computation time of SEM image simulation has led us to investigate the use of graphics processing units (GPUs) for metrology. We have succeeded in creating a full Monte-Carlo simulation program for SEM images, which runs entirely on a GPU. The physical scattering models of this GPU simulator are identical to a previous CPU-based simulator, which includes the dielectric function model for inelastic scattering and also refinements for low-voltage SEM applications. As a case study for the performance, we considered the simulated exposure of a complex feature: an isolated silicon line with rough sidewalls located on a at silicon substrate. The surface of the rough feature is decomposed into 408 012 triangles. We have used an exposure dose of 6 mC/cm2, which corresponds to 6 553 600 primary electrons on average (Poisson distributed). We repeat the simulation for various primary electron energies, 300 eV, 500 eV, 800 eV, 1 keV, 3 keV and 5 keV. At first we run the simulation on a GeForce GTX480 from NVIDIA. The very same simulation is duplicated on our CPU-based program, for which we have used an Intel Xeon X5650. Apart from statistics in the simulation, no difference is found between the CPU and GPU simulated results. The GTX480 generates the images (depending on the primary electron energy) 350 to 425 times faster than a single threaded Intel X5650 CPU. Although this is a tremendous speedup, we actually have not reached the maximum throughput because of the limited amount of available memory on the GTX480. Nevertheless, the speedup enables the fast acquisition of simulated SEM images for metrology. We now have the potential to investigate case studies in CD-SEM metrology, which otherwise would take unreasonable amounts of computation time.
      

      
      The GPU implementation of micro - Doppler period estimation
      NASA Astrophysics Data System (ADS)
      Yang, Liyuan; Wang, Junling; Bi, Ran
         2018-03-01
         Aiming at the problem that the computational complexity and the deficiency of real-time of the wideband radar echo signal, a program is designed to improve the performance of real-time extraction of micro-motion feature in this paper based on the CPU-GPU heterogeneous parallel structure. Firstly, we discuss the principle of the micro-Doppler effect generated by the rolling of the scattering points on the orbiting satellite, analyses how to use Kalman filter to compensate the translational motion of tumbling satellite and how to use the joint time-frequency analysis and inverse Radon transform to extract the micro-motion features from the echo after compensation. Secondly, the advantages of GPU in terms of real-time processing and the working principle of CPU-GPU heterogeneous parallelism are analysed, and a program flow based on GPU to extract the micro-motion feature from the radar echo signal of rolling satellite is designed. At the end of the article the results of extraction are given to verify the correctness of the program and algorithm.
      

      
      High-speed zero-copy data transfer for DAQ applications
      NASA Astrophysics Data System (ADS)
      Pisani, Flavio; Cámpora Pérez, Daniel Hugo; Neufeld, Niko
         2015-05-01
         The LHCb Data Acquisition (DAQ) will be upgraded in 2020 to a trigger-free readout. In order to achieve this goal we will need to connect around 500 nodes with a total network capacity of 32 Tb/s. To get such an high network capacity we are testing zero-copy technology in order to maximize the theoretical link throughput without adding excessive CPU and memory bandwidth overhead, leaving free resources for data processing resulting in less power, space and money used for the same result. We develop a modular test application which can be used with different transport layers. For the zero-copy implementation we choose the OFED IBVerbs API because it can provide low level access and high throughput. We present throughput and CPU usage measurements of 40 GbE solutions using Remote Direct Memory Access (RDMA), for several network configurations to test the scalability of the system.
      

      
      A GPU accelerated and error-controlled solver for the unbounded Poisson equation in three dimensions
      NASA Astrophysics Data System (ADS)
      Exl, Lukas
         2017-12-01
         An efficient solver for the three dimensional free-space Poisson equation is presented. The underlying numerical method is based on finite Fourier series approximation. While the error of all involved approximations can be fully controlled, the overall computation error is driven by the convergence of the finite Fourier series of the density. For smooth and fast-decaying densities the proposed method will be spectrally accurate. The method scales with O(N log N) operations, where N is the total number of discretization points in the Cartesian grid. The majority of the computational costs come from fast Fourier transforms (FFT), which makes it ideal for GPU computation. Several numerical computations on CPU and GPU validate the method and show efficiency and convergence behavior. Tests are performed using the Vienna Scientific Cluster 3 (VSC3). A free MATLAB implementation for CPU and GPU is provided to the interested community.
      

      
      Calculations of 3D compressible flows using an efficient low diffusion upwind scheme
      NASA Astrophysics Data System (ADS)
      Hu, Zongjun; Zha, Gecheng
         2005-01-01
         A newly suggested E-CUSP upwind scheme is employed for the first time to calculate 3D flows of propulsion systems. The E-CUSP scheme contains the total energy in the convective vector and is fully consistent with the characteristic directions. The scheme is proved to have low diffusion and high CPU efficiency. The computed cases in this paper include a transonic nozzle with circular-to-rectangular cross-section, a transonic duct with shock wave/turbulent boundary layer interaction, and a subsonic 3D compressor cascade. The computed results agree well with the experiments. The new scheme is proved to be accurate, efficient and robust for the 3D calculations of the flows in this paper.
      

      
      Development and evaluation of a Fault-Tolerant Multiprocessor (FTMP) computer. Volume 3: FTMP test and evaluation
      NASA Technical Reports Server (NTRS)
      Lala, J. H.; Smith, T. B., III
         1983-01-01
         The experimental test and evaluation of the Fault-Tolerant Multiprocessor (FTMP) is described. Major objectives of this exercise include expanding validation envelope, building confidence in the system, revealing any weaknesses in the architectural concepts and in their execution in hardware and software, and in general, stressing the hardware and software. To this end, pin-level faults were injected into one LRU of the FTMP and the FTMP response was measured in terms of fault detection, isolation, and recovery times. A total of 21,055 stuck-at-0, stuck-at-1 and invert-signal faults were injected in the CPU, memory, bus interface circuits, Bus Guardian Units, and voters and error latches. Of these, 17,418 were detected. At least 80 percent of undetected faults are estimated to be on unused pins. The multiprocessor identified all detected faults correctly and recovered successfully in each case. Total recovery time for all faults averaged a little over one second. This can be reduced to half a second by including appropriate self-tests.
      

      
      Agglomeration Multigrid for an Unstructured-Grid Flow Solver
      NASA Technical Reports Server (NTRS)
      Frink, Neal; Pandya, Mohagna J.
         2004-01-01
         An agglomeration multigrid scheme has been implemented into the sequential version of the NASA code USM3Dns, tetrahedral cell-centered finite volume Euler/Navier-Stokes flow solver. Efficiency and robustness of the multigrid-enhanced flow solver have been assessed for three configurations assuming an inviscid flow and one configuration assuming a viscous fully turbulent flow. The inviscid studies include a transonic flow over the ONERA M6 wing and a generic business jet with flow-through nacelles and a low subsonic flow over a high-lift trapezoidal wing. The viscous case includes a fully turbulent flow over the RAE 2822 rectangular wing. The multigrid solutions converged with 12%-33% of the Central Processing Unit (CPU) time required by the solutions obtained without multigrid. For all of the inviscid cases, multigrid in conjunction with an explicit time-stepping scheme performed the best with regard to the run time memory and CPU time requirements. However, for the viscous case multigrid had to be used with an implicit backward Euler time-stepping scheme that increased the run time memory requirement by 22% as compared to the run made without multigrid.
      

      
      Multi-GPU Jacobian accelerated computing for soft-field tomography.
      PubMed
      Borsic, A; Attardo, E A; Halter, R J
         2012-10-01
         Image reconstruction in soft-field tomography is based on an inverse problem formulation, where a forward model is fitted to the data. In medical applications, where the anatomy presents complex shapes, it is common to use finite element models (FEMs) to represent the volume of interest and solve a partial differential equation that models the physics of the system. Over the last decade, there has been a shifting interest from 2D modeling to 3D modeling, as the underlying physics of most problems are 3D. Although the increased computational power of modern computers allows working with much larger FEM models, the computational time required to reconstruct 3D images on a fine 3D FEM model can be significant, on the order of hours. For example, in electrical impedance tomography (EIT) applications using a dense 3D FEM mesh with half a million elements, a single reconstruction iteration takes approximately 15-20 min with optimized routines running on a modern multi-core PC. It is desirable to accelerate image reconstruction to enable researchers to more easily and rapidly explore data and reconstruction parameters. Furthermore, providing high-speed reconstructions is essential for some promising clinical application of EIT. For 3D problems, 70% of the computing time is spent building the Jacobian matrix, and 25% of the time in forward solving. In this work, we focus on accelerating the Jacobian computation by using single and multiple GPUs. First, we discuss an optimized implementation on a modern multi-core PC architecture and show how computing time is bounded by the CPU-to-memory bandwidth; this factor limits the rate at which data can be fetched by the CPU. Gains associated with the use of multiple CPU cores are minimal, since data operands cannot be fetched fast enough to saturate the processing power of even a single CPU core. GPUs have much faster memory bandwidths compared to CPUs and better parallelism. We are able to obtain acceleration factors of 20 times on a single NVIDIA S1070 GPU, and of 50 times on four GPUs, bringing the Jacobian computing time for a fine 3D mesh from 12 min to 14 s. We regard this as an important step toward gaining interactive reconstruction times in 3D imaging, particularly when coupled in the future with acceleration of the forward problem. While we demonstrate results for EIT, these results apply to any soft-field imaging modality where the Jacobian matrix is computed with the adjoint method.
      

      
      Multi-GPU Jacobian Accelerated Computing for Soft Field Tomography
      PubMed Central
      Borsic, A.; Attardo, E. A.; Halter, R. J.
         2012-01-01
         Image reconstruction in soft-field tomography is based on an inverse problem formulation, where a forward model is fitted to the data. In medical applications, where the anatomy presents complex shapes, it is common to use Finite Element Models to represent the volume of interest and to solve a partial differential equation that models the physics of the system. Over the last decade, there has been a shifting interest from 2D modeling to 3D modeling, as the underlying physics of most problems are three-dimensional. Though the increased computational power of modern computers allows working with much larger FEM models, the computational time required to reconstruct 3D images on a fine 3D FEM model can be significant, on the order of hours. For example, in Electrical Impedance Tomography applications using a dense 3D FEM mesh with half a million elements, a single reconstruction iteration takes approximately 15 to 20 minutes with optimized routines running on a modern multi-core PC. It is desirable to accelerate image reconstruction to enable researchers to more easily and rapidly explore data and reconstruction parameters. Further, providing high-speed reconstructions are essential for some promising clinical application of EIT. For 3D problems 70% of the computing time is spent building the Jacobian matrix, and 25% of the time in forward solving. In the present work, we focus on accelerating the Jacobian computation by using single and multiple GPUs. First, we discuss an optimized implementation on a modern multi-core PC architecture and show how computing time is bounded by the CPU-to-memory bandwidth; this factor limits the rate at which data can be fetched by the CPU. Gains associated with use of multiple CPU cores are minimal, since data operands cannot be fetched fast enough to saturate the processing power of even a single CPU core. GPUs have a much faster memory bandwidths compared to CPUs and better parallelism. We are able to obtain acceleration factors of 20 times on a single NVIDIA S1070 GPU, and of 50 times on 4 GPUs, bringing the Jacobian computing time for a fine 3D mesh from 12 minutes to 14 seconds. We regard this as an important step towards gaining interactive reconstruction times in 3D imaging, particularly when coupled in the future with acceleration of the forward problem. While we demonstrate results for Electrical Impedance Tomography, these results apply to any soft-field imaging modality where the Jacobian matrix is computed with the Adjoint Method. PMID:23010857
      

      
      Use of a graphics processing unit (GPU) to facilitate real-time 3D graphic presentation of the patient skin-dose distribution during fluoroscopic interventional procedures
      PubMed Central
      Rana, Vijay; Rudin, Stephen; Bednarek, Daniel R.
         2012-01-01
         We have developed a dose-tracking system (DTS) that calculates the radiation dose to the patient’s skin in real-time by acquiring exposure parameters and imaging-system-geometry from the digital bus on a Toshiba Infinix C-arm unit. The cumulative dose values are then displayed as a color map on an OpenGL-based 3D graphic of the patient for immediate feedback to the interventionalist. Determination of those elements on the surface of the patient 3D-graphic that intersect the beam and calculation of the dose for these elements in real time demands fast computation. Reducing the size of the elements results in more computation load on the computer processor and therefore a tradeoff occurs between the resolution of the patient graphic and the real-time performance of the DTS. The speed of the DTS for calculating dose to the skin is limited by the central processing unit (CPU) and can be improved by using the parallel processing power of a graphics processing unit (GPU). Here, we compare the performance speed of GPU-based DTS software to that of the current CPU-based software as a function of the resolution of the patient graphics. Results show a tremendous improvement in speed using the GPU. While an increase in the spatial resolution of the patient graphics resulted in slowing down the computational speed of the DTS on the CPU, the speed of the GPU-based DTS was hardly affected. This GPU-based DTS can be a powerful tool for providing accurate, real-time feedback about patient skin-dose to physicians while performing interventional procedures. PMID:24027616
      

      
      Real-time unmanned aircraft systems surveillance video mosaicking using GPU
      NASA Astrophysics Data System (ADS)
      Camargo, Aldo; Anderson, Kyle; Wang, Yi; Schultz, Richard R.; Fevig, Ronald A.
         2010-04-01
         Digital video mosaicking from Unmanned Aircraft Systems (UAS) is being used for many military and civilian applications, including surveillance, target recognition, border protection, forest fire monitoring, traffic control on highways, monitoring of transmission lines, among others. Additionally, NASA is using digital video mosaicking to explore the moon and planets such as Mars. In order to compute a "good" mosaic from video captured by a UAS, the algorithm must deal with motion blur, frame-to-frame jitter associated with an imperfectly stabilized platform, perspective changes as the camera tilts in flight, as well as a number of other factors. The most suitable algorithms use SIFT (Scale-Invariant Feature Transform) to detect the features consistent between video frames. Utilizing these features, the next step is to estimate the homography between two consecutives video frames, perform warping to properly register the image data, and finally blend the video frames resulting in a seamless video mosaick. All this processing takes a great deal of resources of resources from the CPU, so it is almost impossible to compute a real time video mosaic on a single processor. Modern graphics processing units (GPUs) offer computational performance that far exceeds current CPU technology, allowing for real-time operation. This paper presents the development of a GPU-accelerated digital video mosaicking implementation and compares it with CPU performance. Our tests are based on two sets of real video captured by a small UAS aircraft; one video comes from Infrared (IR) and Electro-Optical (EO) cameras. Our results show that we can obtain a speed-up of more than 50 times using GPU technology, so real-time operation at a video capture of 30 frames per second is feasible.
      

      
      Use of a graphics processing unit (GPU) to facilitate real-time 3D graphic presentation of the patient skin-dose distribution during fluoroscopic interventional procedures.
      PubMed
      Rana, Vijay; Rudin, Stephen; Bednarek, Daniel R
         2012-02-23
         We have developed a dose-tracking system (DTS) that calculates the radiation dose to the patient's skin in real-time by acquiring exposure parameters and imaging-system-geometry from the digital bus on a Toshiba Infinix C-arm unit. The cumulative dose values are then displayed as a color map on an OpenGL-based 3D graphic of the patient for immediate feedback to the interventionalist. Determination of those elements on the surface of the patient 3D-graphic that intersect the beam and calculation of the dose for these elements in real time demands fast computation. Reducing the size of the elements results in more computation load on the computer processor and therefore a tradeoff occurs between the resolution of the patient graphic and the real-time performance of the DTS. The speed of the DTS for calculating dose to the skin is limited by the central processing unit (CPU) and can be improved by using the parallel processing power of a graphics processing unit (GPU). Here, we compare the performance speed of GPU-based DTS software to that of the current CPU-based software as a function of the resolution of the patient graphics. Results show a tremendous improvement in speed using the GPU. While an increase in the spatial resolution of the patient graphics resulted in slowing down the computational speed of the DTS on the CPU, the speed of the GPU-based DTS was hardly affected. This GPU-based DTS can be a powerful tool for providing accurate, real-time feedback about patient skin-dose to physicians while performing interventional procedures.
      

      
      Fast-GPU-PCC: A GPU-Based Technique to Compute Pairwise Pearson's Correlation Coefficients for Time Series Data-fMRI Study.
      PubMed
      Eslami, Taban; Saeed, Fahad
         2018-04-20
         Functional magnetic resonance imaging (fMRI) is a non-invasive brain imaging technique, which has been regularly used for studying brain’s functional activities in the past few years. A very well-used measure for capturing functional associations in brain is Pearson’s correlation coefficient. Pearson’s correlation is widely used for constructing functional network and studying dynamic functional connectivity of the brain. These are useful measures for understanding the effects of brain disorders on connectivities among brain regions. The fMRI scanners produce huge number of voxels and using traditional central processing unit (CPU)-based techniques for computing pairwise correlations is very time consuming especially when large number of subjects are being studied. In this paper, we propose a graphics processing unit (GPU)-based algorithm called Fast-GPU-PCC for computing pairwise Pearson’s correlation coefficient. Based on the symmetric property of Pearson’s correlation, this approach returns N ( N − 1 ) / 2 correlation coefficients located at strictly upper triangle part of the correlation matrix. Storing correlations in a one-dimensional array with the order as proposed in this paper is useful for further usage. Our experiments on real and synthetic fMRI data for different number of voxels and varying length of time series show that the proposed approach outperformed state of the art GPU-based techniques as well as the sequential CPU-based versions. We show that Fast-GPU-PCC runs 62 times faster than CPU-based version and about 2 to 3 times faster than two other state of the art GPU-based methods.
      

      
      Multigrid direct numerical simulation of the whole process of flow transition in 3-D boundary layers
      NASA Technical Reports Server (NTRS)
      Liu, Chaoqun; Liu, Zhining
         1993-01-01
         A new technology was developed in this study which provides a successful numerical simulation of the whole process of flow transition in 3-D boundary layers, including linear growth, secondary instability, breakdown, and transition at relatively low CPU cost. Most other spatial numerical simulations require high CPU cost and blow up at the stage of flow breakdown. A fourth-order finite difference scheme on stretched and staggered grids, a fully implicit time marching technique, a semi-coarsening multigrid based on the so-called approximate line-box relaxation, and a buffer domain for the outflow boundary conditions were all used for high-order accuracy, good stability, and fast convergence. A new fine-coarse-fine grid mapping technique was developed to keep the code running after the laminar flow breaks down. The computational results are in good agreement with linear stability theory, secondary instability theory, and some experiments. The cost for a typical case with 162 x 34 x 34 grid is around 2 CRAY-YMP CPU hours for 10 T-S periods.
      

      
      An efficient tensor transpose algorithm for multicore CPU, Intel Xeon Phi, and NVidia Tesla GPU
      NASA Astrophysics Data System (ADS)
      Lyakh, Dmitry I.
         2015-04-01
         An efficient parallel tensor transpose algorithm is suggested for shared-memory computing units, namely, multicore CPU, Intel Xeon Phi, and NVidia GPU. The algorithm operates on dense tensors (multidimensional arrays) and is based on the optimization of cache utilization on x86 CPU and the use of shared memory on NVidia GPU. From the applied side, the ultimate goal is to minimize the overhead encountered in the transformation of tensor contractions into matrix multiplications in computer implementations of advanced methods of quantum many-body theory (e.g., in electronic structure theory and nuclear physics). A particular accent is made on higher-dimensional tensors that typically appear in the so-called multireference correlated methods of electronic structure theory. Depending on tensor dimensionality, the presented optimized algorithms can achieve an order of magnitude speedup on x86 CPUs and 2-3 times speedup on NVidia Tesla K20X GPU with respect to the naïve scattering algorithm (no memory access optimization). The tensor transpose routines developed in this work have been incorporated into a general-purpose tensor algebra library (TAL-SH).
      

      
      Short-term dopaminergic regulation of GABA release in dopamine deafferented caudate-putamen is not directly associated with glutamic acid decarboxylase gene expression.
      PubMed
      O'Connor, W T; Lindefors, N; Brené, S; Herrera-Marschitz, M; Persson, H; Ungerstedt, U
         1991-07-08
         In vivo microdialysis and in situ hybridization were combined to study dopaminergic regulation of gamma-amino butyric acid (GABA) neurons in rat caudate-putamen (CPu). Potassium-stimulated GABA release in CPu was elevated following a dopamine deafferentation. Local perfusion with exogenous dopamine (50 microM) for 3 h via the microdialysis probe attenuated the potassium-stimulated increase in extracellular GABA in CPu. Expression of glutamic acid decarboxylase (GAD) mRNA was also increased in the dopamine deafferented CPu. However, local perfusion with dopamine had no significant attenuating effect on the increased GAD mRNA expression. These findings indicate that dopaminergic regulation of GABA neurons in the dopamine deafferented CPu includes both a short-term effect at the level of GABA release independent of changes in GAD mRNA expression and a long-term modulation at the level of GAD gene expression.
      

      
      Multi-GPU implementation of a VMAT treatment plan optimization algorithm.
      PubMed
      Tian, Zhen; Peng, Fei; Folkerts, Michael; Tan, Jun; Jia, Xun; Jiang, Steve B
         2015-06-01
         Volumetric modulated arc therapy (VMAT) optimization is a computationally challenging problem due to its large data size, high degrees of freedom, and many hardware constraints. High-performance graphics processing units (GPUs) have been used to speed up the computations. However, GPU's relatively small memory size cannot handle cases with a large dose-deposition coefficient (DDC) matrix in cases of, e.g., those with a large target size, multiple targets, multiple arcs, and/or small beamlet size. The main purpose of this paper is to report an implementation of a column-generation-based VMAT algorithm, previously developed in the authors' group, on a multi-GPU platform to solve the memory limitation problem. While the column-generation-based VMAT algorithm has been previously developed, the GPU implementation details have not been reported. Hence, another purpose is to present detailed techniques employed for GPU implementation. The authors also would like to utilize this particular problem as an example problem to study the feasibility of using a multi-GPU platform to solve large-scale problems in medical physics. The column-generation approach generates VMAT apertures sequentially by solving a pricing problem (PP) and a master problem (MP) iteratively. In the authors' method, the sparse DDC matrix is first stored on a CPU in coordinate list format (COO). On the GPU side, this matrix is split into four submatrices according to beam angles, which are stored on four GPUs in compressed sparse row format. Computation of beamlet price, the first step in PP, is accomplished using multi-GPUs. A fast inter-GPU data transfer scheme is accomplished using peer-to-peer access. The remaining steps of PP and MP problems are implemented on CPU or a single GPU due to their modest problem scale and computational loads. Barzilai and Borwein algorithm with a subspace step scheme is adopted here to solve the MP problem. A head and neck (H&N) cancer case is then used to validate the authors' method. The authors also compare their multi-GPU implementation with three different single GPU implementation strategies, i.e., truncating DDC matrix (S1), repeatedly transferring DDC matrix between CPU and GPU (S2), and porting computations involving DDC matrix to CPU (S3), in terms of both plan quality and computational efficiency. Two more H&N patient cases and three prostate cases are used to demonstrate the advantages of the authors' method. The authors' multi-GPU implementation can finish the optimization process within ∼ 1 min for the H&N patient case. S1 leads to an inferior plan quality although its total time was 10 s shorter than the multi-GPU implementation due to the reduced matrix size. S2 and S3 yield the same plan quality as the multi-GPU implementation but take ∼4 and ∼6 min, respectively. High computational efficiency was consistently achieved for the other five patient cases tested, with VMAT plans of clinically acceptable quality obtained within 23-46 s. Conversely, to obtain clinically comparable or acceptable plans for all six of these VMAT cases that the authors have tested in this paper, the optimization time needed in a commercial TPS system on CPU was found to be in an order of several minutes. The results demonstrate that the multi-GPU implementation of the authors' column-generation-based VMAT optimization can handle the large-scale VMAT optimization problem efficiently without sacrificing plan quality. The authors' study may serve as an example to shed some light on other large-scale medical physics problems that require multi-GPU techniques.
      

        
       
          

«

9
      10
      11
   12
      13
      »

          
        

     

   

   
       
            
              
          

«

10
      11
      12
   13
      14
      »

          
        

           
           
             
               
      
      The DISTO data acquisition system at SATURNE
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Balestra, F.; Bedfer, Y.; Bertini, R.
         1998-06-01
         The DISTO collaboration has built a large-acceptance magnetic spectrometer designed to provide broad kinematic coverage of multiparticle final states produced in pp scattering. The spectrometer has been installed in the polarized proton beam of the Saturne accelerator in Saclay to study polarization observables in the {rvec p}p {yields} pK{sup +}{rvec Y} (Y = {Lambda}, {Sigma}{sup 0} or Y{sup *}) reaction and vector meson production ({psi}, {omega} and {rho}) in pp collisions. The data acquisition system is based on a VME 68030 CPU running the OS/9 operating system, housed in a single VME crate together with the CAMAC interface, the triplemore » port ECL memories, and four RISC R3000 CPU. The digitization of signals from the detectors is made by PCOS III and FERA front-end electronics. Data of several events belonging to a single Saturne extraction are stored in VME triple-port ECL memories using a hardwired fast sequencer. The buffer, optionally filtered by the RISC R3000 CPU, is recorded on a DLT cassette by DAQ CPU using the on-board SCSI interface during the acceleration cycle. Two UNIX workstations are connected to the VME CPUs through a fast parallel bus and the Local Area Network. They analyze a subset of events for on-line monitoring. The data acquisition system is able to read and record 3,500 ev/burst in the present configuration with a dead time of 15%.« less
      

      
      SEMICONDUCTOR INTEGRATED CIRCUITS: A quasi-3-dimensional simulation method for a high-voltage level-shifting circuit structure
      NASA Astrophysics Data System (ADS)
      Jizhi, Liu; Xingbi, Chen
         2009-12-01
         A new quasi-three-dimensional (quasi-3D) numeric simulation method for a high-voltage level-shifting circuit structure is proposed. The performances of the 3D structure are analyzed by combining some 2D device structures; the 2D devices are in two planes perpendicular to each other and to the surface of the semiconductor. In comparison with Davinci, the full 3D device simulation tool, the quasi-3D simulation method can give results for the potential and current distribution of the 3D high-voltage level-shifting circuit structure with appropriate accuracy and the total CPU time for simulation is significantly reduced. The quasi-3D simulation technique can be used in many cases with advantages such as saving computing time, making no demands on the high-end computer terminals, and being easy to operate.
      

      
      Accelerating Biomedical Signal Processing Using GPU: A Case Study of Snore Sound Feature Extraction.
      PubMed
      Guo, Jian; Qian, Kun; Zhang, Gongxuan; Xu, Huijie; Schuller, Björn
         2017-12-01
         The advent of 'Big Data' and 'Deep Learning' offers both, a great challenge and a huge opportunity for personalised health-care. In machine learning-based biomedical data analysis, feature extraction is a key step for 'feeding' the subsequent classifiers. With increasing numbers of biomedical data, extracting features from these 'big' data is an intensive and time-consuming task. In this case study, we employ a Graphics Processing Unit (GPU) via Python to extract features from a large corpus of snore sound data. Those features can subsequently be imported into many well-known deep learning training frameworks without any format processing. The snore sound data were collected from several hospitals (20 subjects, with 770-990 MB per subject - in total 17.20 GB). Experimental results show that our GPU-based processing significantly speeds up the feature extraction phase, by up to seven times, as compared to the previous CPU system.
      

      
      A new nonlinear conjugate gradient coefficient under strong Wolfe-Powell line search
      NASA Astrophysics Data System (ADS)
      Mohamed, Nur Syarafina; Mamat, Mustafa; Rivaie, Mohd
         2017-08-01
         A nonlinear conjugate gradient method (CG) plays an important role in solving a large-scale unconstrained optimization problem. This method is widely used due to its simplicity. The method is known to possess sufficient descend condition and global convergence properties. In this paper, a new nonlinear of CG coefficient βk is presented by employing the Strong Wolfe-Powell inexact line search. The new βk performance is tested based on number of iterations and central processing unit (CPU) time by using MATLAB software with Intel Core i7-3470 CPU processor. Numerical experimental results show that the new βk converge rapidly compared to other classical CG method.
      

      
      Hypermatrix scheme for finite element systems on CDC STAR-100 computer
      NASA Technical Reports Server (NTRS)
      Noor, A. K.; Voigt, S. J.
         1975-01-01
         A study is made of the adaptation of the hypermatrix (block matrix) scheme for solving large systems of finite element equations to the CDC STAR-100 computer. Discussion is focused on the organization of the hypermatrix computation using Cholesky decomposition and the mode of storage of the different submatrices to take advantage of the STAR pipeline (streaming) capability. Consideration is also given to the associated data handling problems and the means of balancing the I/Q and cpu times in the solution process. Numerical examples are presented showing anticipated gain in cpu speed over the CDC 6600 to be obtained by using the proposed algorithms on the STAR computer.
      

      
      An efficient implementation of semi-numerical computation of the Hartree-Fock exchange on the Intel Phi processor
      NASA Astrophysics Data System (ADS)
      Liu, Fenglai; Kong, Jing
         2018-07-01
         Unique technical challenges and their solutions for implementing semi-numerical Hartree-Fock exchange on the Phil Processor are discussed, especially concerning the single- instruction-multiple-data type of processing and small cache size. Benchmark calculations on a series of buckyball molecules with various Gaussian basis sets on a Phi processor and a six-core CPU show that the Phi processor provides as much as 12 times of speedup with large basis sets compared with the conventional four-center electron repulsion integration approach performed on the CPU. The accuracy of the semi-numerical scheme is also evaluated and found to be comparable to that of the resolution-of-identity approach.
      

      
      Real-Time Agent-Based Modeling Simulation with in-situ Visualization of Complex Biological Systems: A Case Study on Vocal Fold Inflammation and Healing.
      PubMed
      Seekhao, Nuttiiya; Shung, Caroline; JaJa, Joseph; Mongeau, Luc; Li-Jessen, Nicole Y K
         2016-05-01
         We present an efficient and scalable scheme for implementing agent-based modeling (ABM) simulation with In Situ visualization of large complex systems on heterogeneous computing platforms. The scheme is designed to make optimal use of the resources available on a heterogeneous platform consisting of a multicore CPU and a GPU, resulting in minimal to no resource idle time. Furthermore, the scheme was implemented under a client-server paradigm that enables remote users to visualize and analyze simulation data as it is being generated at each time step of the model. Performance of a simulation case study of vocal fold inflammation and wound healing with 3.8 million agents shows 35× and 7× speedup in execution time over single-core and multi-core CPU respectively. Each iteration of the model took less than 200 ms to simulate, visualize and send the results to the client. This enables users to monitor the simulation in real-time and modify its course as needed.
      

      
      Fast in-memory elastic full-waveform inversion using consumer-grade GPUs
      NASA Astrophysics Data System (ADS)
      Sivertsen Bergslid, Tore; Birger Raknes, Espen; Arntsen, Børge
         2017-04-01
         Full-waveform inversion (FWI) is a technique to estimate subsurface properties by using the recorded waveform produced by a seismic source and applying inverse theory. This is done through an iterative optimization procedure, where each iteration requires solving the wave equation many times, then trying to minimize the difference between the modeled and the measured seismic data. Having to model many of these seismic sources per iteration means that this is a highly computationally demanding procedure, which usually involves writing a lot of data to disk. We have written code that does forward modeling and inversion entirely in memory. A typical HPC cluster has many more CPUs than GPUs. Since FWI involves modeling many seismic sources per iteration, the obvious approach is to parallelize the code on a source-by-source basis, where each core of the CPU performs one modeling, and do all modelings simultaneously. With this approach, the GPU is already at a major disadvantage in pure numbers. Fortunately, GPUs can more than make up for this hardware disadvantage by performing each modeling much faster than a CPU. Another benefit of parallelizing each individual modeling is that it lets each modeling use a lot more RAM. If one node has 128 GB of RAM and 20 CPU cores, each modeling can use only 6.4 GB RAM if one is running the node at full capacity with source-by-source parallelization on the CPU. A parallelized per-source code using GPUs can use 64 GB RAM per modeling. Whenever a modeling uses more RAM than is available and has to start using regular disk space the runtime increases dramatically, due to slow file I/O. The extremely high computational speed of the GPUs combined with the large amount of RAM available for each modeling lets us do high frequency FWI for fairly large models very quickly. For a single modeling, our GPU code outperforms the single-threaded CPU-code by a factor of about 75. Successful inversions have been run on data with frequencies up to 40 Hz for a model of 2001 by 600 grid points with 5 m grid spacing and 5000 time steps, in less than 2.5 minutes per source. In practice, using 15 nodes (30 GPUs) to model 101 sources, each iteration took approximately 9 minutes. For reference, the same inversion run with our CPU code uses two hours per iteration. This was done using only a very simple wavefield interpolation technique, saving every second timestep. Using a more sophisticated checkpointing or wavefield reconstruction method would allow us to increase this model size significantly. Our results show that ordinary gaming GPUs are a viable alternative to the expensive professional GPUs often used today, when performing large scale modeling and inversion in geophysics.
      

      
      Two-dimensional Euler and Navier-Stokes Time accurate simulations of fan rotor flows
      NASA Technical Reports Server (NTRS)
      Boretti, A. A.
         1990-01-01
         Two numerical methods are presented which describe the unsteady flow field in the blade-to-blade plane of an axial fan rotor. These methods solve the compressible, time-dependent, Euler and the compressible, turbulent, time-dependent, Navier-Stokes conservation equations for mass, momentum, and energy. The Navier-Stokes equations are written in Favre-averaged form and are closed with an approximate two-equation turbulence model with low Reynolds number and compressibility effects included. The unsteady aerodynamic component is obtained by superposing inflow or outflow unsteadiness to the steady conditions through time-dependent boundary conditions. The integration in space is performed by using a finite volume scheme, and the integration in time is performed by using k-stage Runge-Kutta schemes, k = 2,5. The numerical integration algorithm allows the reduction of the computational cost of an unsteady simulation involving high frequency disturbances in both CPU time and memory requirements. Less than 200 sec of CPU time are required to advance the Euler equations in a computational grid made up of about 2000 grid during 10,000 time steps on a CRAY Y-MP computer, with a required memory of less than 0.3 megawords.
      

      
      Developing infrared array controller with software real time operating system
      NASA Astrophysics Data System (ADS)
      Sako, Shigeyuki; Miyata, Takashi; Nakamura, Tomohiko; Motohara, Kentaro; Uchimoto, Yuka Katsuno; Onaka, Takashi; Kataza, Hirokazu
         2008-07-01
         Real-time capabilities are required for a controller of a large format array to reduce a dead-time attributed by readout and data transfer. The real-time processing has been achieved by dedicated processors including DSP, CPLD, and FPGA devices. However, the dedicated processors have problems with memory resources, inflexibility, and high cost. Meanwhile, a recent PC has sufficient resources of CPUs and memories to control the infrared array and to process a large amount of frame data in real-time. In this study, we have developed an infrared array controller with a software real-time operating system (RTOS) instead of the dedicated processors. A Linux PC equipped with a RTAI extension and a dual-core CPU is used as a main computer, and one of the CPU cores is allocated to the real-time processing. A digital I/O board with DMA functions is used for an I/O interface. The signal-processing cores are integrated in the OS kernel as a real-time driver module, which is composed of two virtual devices of the clock processor and the frame processor tasks. The array controller with the RTOS realizes complicated operations easily, flexibly, and at a low cost.
      

      
      GPU accelerated generation of digitally reconstructed radiographs for 2-D/3-D image registration.
      PubMed
      Dorgham, Osama M; Laycock, Stephen D; Fisher, Mark H
         2012-09-01
         Recent advances in programming languages for graphics processing units (GPUs) provide developers with a convenient way of implementing applications which can be executed on the CPU and GPU interchangeably. GPUs are becoming relatively cheap, powerful, and widely available hardware components, which can be used to perform intensive calculations. The last decade of hardware performance developments shows that GPU-based computation is progressing significantly faster than CPU-based computation, particularly if one considers the execution of highly parallelisable algorithms. Future predictions illustrate that this trend is likely to continue. In this paper, we introduce a way of accelerating 2-D/3-D image registration by developing a hybrid system which executes on the CPU and utilizes the GPU for parallelizing the generation of digitally reconstructed radiographs (DRRs). Based on the advancements of the GPU over the CPU, it is timely to exploit the benefits of many-core GPU technology by developing algorithms for DRR generation. Although some previous work has investigated the rendering of DRRs using the GPU, this paper investigates approximations which reduce the computational overhead while still maintaining a quality consistent with that needed for 2-D/3-D registration with sufficient accuracy to be clinically acceptable in certain applications of radiation oncology. Furthermore, by comparing implementations of 2-D/3-D registration on the CPU and GPU, we investigate current performance and propose an optimal framework for PC implementations addressing the rigid registration problem. Using this framework, we are able to render DRR images from a 256×256×133 CT volume in ~24 ms using an NVidia GeForce 8800 GTX and in ~2 ms using NVidia GeForce GTX 580. In addition to applications requiring fast automatic patient setup, these levels of performance suggest image-guided radiation therapy at video frame rates is technically feasible using relatively low cost PC architecture.
      

      
      Reduced order model based on principal component analysis for process simulation and optimization
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Lang, Y.; Malacina, A.; Biegler, L.
         2009-01-01
         It is well-known that distributed parameter computational fluid dynamics (CFD) models provide more accurate results than conventional, lumped-parameter unit operation models used in process simulation. Consequently, the use of CFD models in process/equipment co-simulation offers the potential to optimize overall plant performance with respect to complex thermal and fluid flow phenomena. Because solving CFD models is time-consuming compared to the overall process simulation, we consider the development of fast reduced order models (ROMs) based on CFD results to closely approximate the high-fidelity equipment models in the co-simulation. By considering process equipment items with complicated geometries and detailed thermodynamic property models,more » this study proposes a strategy to develop ROMs based on principal component analysis (PCA). Taking advantage of commercial process simulation and CFD software (for example, Aspen Plus and FLUENT), we are able to develop systematic CFD-based ROMs for equipment models in an efficient manner. In particular, we show that the validity of the ROM is more robust within well-sampled input domain and the CPU time is significantly reduced. Typically, it takes at most several CPU seconds to evaluate the ROM compared to several CPU hours or more to solve the CFD model. Two case studies, involving two power plant equipment examples, are described and demonstrate the benefits of using our proposed ROM methodology for process simulation and optimization.« less
      

      
      A parallel algorithm for the two-dimensional time fractional diffusion equation with implicit difference method.
      PubMed
      Gong, Chunye; Bao, Weimin; Tang, Guojian; Jiang, Yuewen; Liu, Jie
         2014-01-01
         It is very time consuming to solve fractional differential equations. The computational complexity of two-dimensional fractional differential equation (2D-TFDE) with iterative implicit finite difference method is O(M(x)M(y)N(2)). In this paper, we present a parallel algorithm for 2D-TFDE and give an in-depth discussion about this algorithm. A task distribution model and data layout with virtual boundary are designed for this parallel algorithm. The experimental results show that the parallel algorithm compares well with the exact solution. The parallel algorithm on single Intel Xeon X5540 CPU runs 3.16-4.17 times faster than the serial algorithm on single CPU core. The parallel efficiency of 81 processes is up to 88.24% compared with 9 processes on a distributed memory cluster system. We do think that the parallel computing technology will become a very basic method for the computational intensive fractional applications in the near future.
      

      
      Real-time autocorrelator for fluorescence correlation spectroscopy based on graphical-processor-unit architecture: method, implementation, and comparative studies
      NASA Astrophysics Data System (ADS)
      Laracuente, Nicholas; Grossman, Carl
         2013-03-01
         We developed an algorithm and software to calculate autocorrelation functions from real-time photon-counting data using the fast, parallel capabilities of graphical processor units (GPUs). Recent developments in hardware and software have allowed for general purpose computing with inexpensive GPU hardware. These devices are more suited for emulating hardware autocorrelators than traditional CPU-based software applications by emphasizing parallel throughput over sequential speed. Incoming data are binned in a standard multi-tau scheme with configurable points-per-bin size and are mapped into a GPU memory pattern to reduce time-expensive memory access. Applications include dynamic light scattering (DLS) and fluorescence correlation spectroscopy (FCS) experiments. We ran the software on a 64-core graphics pci card in a 3.2 GHz Intel i5 CPU based computer running Linux. FCS measurements were made on Alexa-546 and Texas Red dyes in a standard buffer (PBS). Software correlations were compared to hardware correlator measurements on the same signals. Supported by HHMI and Swarthmore College
      

      
      SU-E-J-60: Efficient Monte Carlo Dose Calculation On CPU-GPU Heterogeneous Systems
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Xiao, K; Chen, D. Z; Hu, X. S
         
         Purpose: It is well-known that the performance of GPU-based Monte Carlo dose calculation implementations is bounded by memory bandwidth. One major cause of this bottleneck is the random memory writing patterns in dose deposition, which leads to several memory efficiency issues on GPU such as un-coalesced writing and atomic operations. We propose a new method to alleviate such issues on CPU-GPU heterogeneous systems, which achieves overall performance improvement for Monte Carlo dose calculation. Methods: Dose deposition is to accumulate dose into the voxels of a dose volume along the trajectories of radiation rays. Our idea is to partition this proceduremore » into the following three steps, which are fine-tuned for CPU or GPU: (1) each GPU thread writes dose results with location information to a buffer on GPU memory, which achieves fully-coalesced and atomic-free memory transactions; (2) the dose results in the buffer are transferred to CPU memory; (3) the dose volume is constructed from the dose buffer on CPU. We organize the processing of all radiation rays into streams. Since the steps within a stream use different hardware resources (i.e., GPU, DMA, CPU), we can overlap the execution of these steps for different streams by pipelining. Results: We evaluated our method using a Monte Carlo Convolution Superposition (MCCS) program and tested our implementation for various clinical cases on a heterogeneous system containing an Intel i7 quad-core CPU and an NVIDIA TITAN GPU. Comparing with a straightforward MCCS implementation on the same system (using both CPU and GPU for radiation ray tracing), our method gained 2-5X speedup without losing dose calculation accuracy. Conclusion: The results show that our new method improves the effective memory bandwidth and overall performance for MCCS on the CPU-GPU systems. Our proposed method can also be applied to accelerate other Monte Carlo dose calculation approaches. This research was supported in part by NSF under Grants CCF-1217906, and also in part by a research contract from the Sandia National Laboratories.« less
      

      
      Exploring compression techniques for ROOT IO
      NASA Astrophysics Data System (ADS)
      Zhang, Z.; Bockelman, B.
         2017-10-01
         ROOT provides an flexible format used throughout the HEP community. The number of use cases - from an archival data format to end-stage analysis - has required a number of tradeoffs to be exposed to the user. For example, a high “compression level” in the traditional DEFLATE algorithm will result in a smaller file (saving disk space) at the cost of slower decompression (costing CPU time when read). At the scale of the LHC experiment, poor design choices can result in terabytes of wasted space or wasted CPU time. We explore and attempt to quantify some of these tradeoffs. Specifically, we explore: the use of alternate compressing algorithms to optimize for read performance; an alternate method of compressing individual events to allow efficient random access; and a new approach to whole-file compression. Quantitative results are given, as well as guidance on how to make compression decisions for different use cases.
      

      
      Simulating electron wave dynamics in graphene superlattices exploiting parallel processing advantages
      NASA Astrophysics Data System (ADS)
      Rodrigues, Manuel J.; Fernandes, David E.; Silveirinha, Mário G.; Falcão, Gabriel
         2018-01-01
         This work introduces a parallel computing framework to characterize the propagation of electron waves in graphene-based nanostructures. The electron wave dynamics is modeled using both "microscopic" and effective medium formalisms and the numerical solution of the two-dimensional massless Dirac equation is determined using a Finite-Difference Time-Domain scheme. The propagation of electron waves in graphene superlattices with localized scattering centers is studied, and the role of the symmetry of the microscopic potential in the electron velocity is discussed. The computational methodologies target the parallel capabilities of heterogeneous multi-core CPU and multi-GPU environments and are built with the OpenCL parallel programming framework which provides a portable, vendor agnostic and high throughput-performance solution. The proposed heterogeneous multi-GPU implementation achieves speedup ratios up to 75x when compared to multi-thread and multi-core CPU execution, reducing simulation times from several hours to a couple of minutes.
      

      
      Fast data reconstructed method of Fourier transform imaging spectrometer based on multi-core CPU
      NASA Astrophysics Data System (ADS)
      Yu, Chunchao; Du, Debiao; Xia, Zongze; Song, Li; Zheng, Weijian; Yan, Min; Lei, Zhenggang
         2017-10-01
         Imaging spectrometer can gain two-dimensional space image and one-dimensional spectrum at the same time, which shows high utility in color and spectral measurements, the true color image synthesis, military reconnaissance and so on. In order to realize the fast reconstructed processing of the Fourier transform imaging spectrometer data, the paper designed the optimization reconstructed algorithm with OpenMP parallel calculating technology, which was further used for the optimization process for the HyperSpectral Imager of `HJ-1' Chinese satellite. The results show that the method based on multi-core parallel computing technology can control the multi-core CPU hardware resources competently and significantly enhance the calculation of the spectrum reconstruction processing efficiency. If the technology is applied to more cores workstation in parallel computing, it will be possible to complete Fourier transform imaging spectrometer real-time data processing with a single computer.
      

      
      A proximity algorithm accelerated by Gauss-Seidel iterations for L1/TV denoising models
      NASA Astrophysics Data System (ADS)
      Li, Qia; Micchelli, Charles A.; Shen, Lixin; Xu, Yuesheng
         2012-09-01
         Our goal in this paper is to improve the computational performance of the proximity algorithms for the L1/TV denoising model. This leads us to a new characterization of all solutions to the L1/TV model via fixed-point equations expressed in terms of the proximity operators. Based upon this observation we develop an algorithm for solving the model and establish its convergence. Furthermore, we demonstrate that the proposed algorithm can be accelerated through the use of the componentwise Gauss-Seidel iteration so that the CPU time consumed is significantly reduced. Numerical experiments using the proposed algorithm for impulsive noise removal are included, with a comparison to three recently developed algorithms. The numerical results show that while the proposed algorithm enjoys a high quality of the restored images, as the other three known algorithms do, it performs significantly better in terms of computational efficiency measured in the CPU time consumed.
      

      
      Classification of hyperspectral imagery using MapReduce on a NVIDIA graphics processing unit (Conference Presentation)
      NASA Astrophysics Data System (ADS)
      Ramirez, Andres; Rahnemoonfar, Maryam
         2017-04-01
         A hyperspectral image provides multidimensional figure rich in data consisting of hundreds of spectral dimensions. Analyzing the spectral and spatial information of such image with linear and non-linear algorithms will result in high computational time. In order to overcome this problem, this research presents a system using a MapReduce-Graphics Processing Unit (GPU) model that can help analyzing a hyperspectral image through the usage of parallel hardware and a parallel programming model, which will be simpler to handle compared to other low-level parallel programming models. Additionally, Hadoop was used as an open-source version of the MapReduce parallel programming model. This research compared classification accuracy results and timing results between the Hadoop and GPU system and tested it against the following test cases: the CPU and GPU test case, a CPU test case and a test case where no dimensional reduction was applied.
      

        
       
          

«

10
      11
      12
   13
      14
      »

          
        

     

   

   
       
            
              
          

«

11
      12
      13
   14
      15
      »

          
        

           
           
             
               
      
      Vectorization of a particle code used in the simulation of rarefied hypersonic flow
      NASA Technical Reports Server (NTRS)
      Baganoff, D.
         1990-01-01
         A limitation of the direct simulation Monte Carlo (DSMC) method is that it does not allow efficient use of vector architectures that predominate in current supercomputers. Consequently, the problems that can be handled are limited to those of one- and two-dimensional flows. This work focuses on a reformulation of the DSMC method with the objective of designing a procedure that is optimized to the vector architectures found on machines such as the Cray-2. In addition, it focuses on finding a better balance between algorithmic complexity and the total number of particles employed in a simulation so that the overall performance of a particle simulation scheme can be greatly improved. Simulations of the flow about a 3D blunt body are performed with 10 to the 7th particles and 4 x 10 to the 5th mesh cells. Good statistics are obtained with time averaging over 800 time steps using 4.5 h of Cray-2 single-processor CPU time.
      

      
      Accelerated Monte Carlo Simulation on the Chemical Stage in Water Radiolysis using GPU
      PubMed Central
      Tian, Zhen; Jiang, Steve B.; Jia, Xun
         2018-01-01
         The accurate simulation of water radiolysis is an important step to understand the mechanisms of radiobiology and quantitatively test some hypotheses regarding radiobiological effects. However, the simulation of water radiolysis is highly time consuming, taking hours or even days to be completed by a conventional CPU processor. This time limitation hinders cell-level simulations for a number of research studies. We recently initiated efforts to develop gMicroMC, a GPU-based fast microscopic MC simulation package for water radiolysis. The first step of this project focused on accelerating the simulation of the chemical stage, the most time consuming stage in the entire water radiolysis process. A GPU-friendly parallelization strategy was designed to address the highly correlated many-body simulation problem caused by the mutual competitive chemical reactions between the radiolytic molecules. Two cases were tested, using a 750 keV electron and a 5 MeV proton incident in pure water, respectively. The time-dependent yields of all the radiolytic species during the chemical stage were used to evaluate the accuracy of the simulation. The relative differences between our simulation and the Geant4-DNA simulation were on average 5.3% and 4.4% for the two cases. Our package, executed on an Nvidia Titan black GPU card, successfully completed the chemical stage simulation of the two cases within 599.2 s and 489.0 s. As compared with Geant4-DNA that was executed on an Intel i7-5500U CPU processor and needed 28.6 h and 26.8 h for the two cases using a single CPU core, our package achieved a speed-up factor of 171.1-197.2. PMID:28323637
      

      
      Quasi-elastic light scattering: Signal storage, correlation, and spectrum analysis under control of an 8-bit microprocessor
      NASA Astrophysics Data System (ADS)
      Glatter, Otto; Fuchs, Heribert; Jorde, Christian; Eigner, Wolf-Dieter
         1987-03-01
         The microprocessor of an 8-bit PC system is used as a central control unit for the acquisition and evaluation of data from quasi-elastic light scattering experiments. Data are sampled with a width of 8 bits under control of the CPU. This limits the minimum sample time to 20 μs. Shorter sample times would need a direct memory access channel. The 8-bit CPU can address a 64-kbyte RAM without additional paging. Up to 49 000 sample points can be measured without interruption. After storage, a correlation function or a power spectrum can be calculated from such a primary data set. Furthermore access is provided to the primary data for stability control, statistical tests, and for comparison of different evaluation methods for the same experiment. A detailed analysis of the signal (histogram) and of the effect of overflows is possible and shows that the number of pulses but not the number of overflows determines the error in the result. The correlation function can be computed with reasonable accuracy from data with a mean pulse rate greater than one, the power spectrum needs a three times higher pulse rate for convergence. The statistical accuracy of the results from 49 000 sample points is of the order of a few percent. Additional averages are necessary to improve their quality. The hardware extensions for the PC system are inexpensive. The main disadvantage of the present system is the high minimum sampling time of 20 μs and the fact that the correlogram or the power spectrum cannot be computed on-line as it can be done with hardware correlators or spectrum analyzers. These shortcomings and the storage size restrictions can be removed with a faster 16/32-bit CPU.
      

      
      Accelerated Monte Carlo simulation on the chemical stage in water radiolysis using GPU
      NASA Astrophysics Data System (ADS)
      Tian, Zhen; Jiang, Steve B.; Jia, Xun
         2017-04-01
         The accurate simulation of water radiolysis is an important step to understand the mechanisms of radiobiology and quantitatively test some hypotheses regarding radiobiological effects. However, the simulation of water radiolysis is highly time consuming, taking hours or even days to be completed by a conventional CPU processor. This time limitation hinders cell-level simulations for a number of research studies. We recently initiated efforts to develop gMicroMC, a GPU-based fast microscopic MC simulation package for water radiolysis. The first step of this project focused on accelerating the simulation of the chemical stage, the most time consuming stage in the entire water radiolysis process. A GPU-friendly parallelization strategy was designed to address the highly correlated many-body simulation problem caused by the mutual competitive chemical reactions between the radiolytic molecules. Two cases were tested, using a 750 keV electron and a 5 MeV proton incident in pure water, respectively. The time-dependent yields of all the radiolytic species during the chemical stage were used to evaluate the accuracy of the simulation. The relative differences between our simulation and the Geant4-DNA simulation were on average 5.3% and 4.4% for the two cases. Our package, executed on an Nvidia Titan black GPU card, successfully completed the chemical stage simulation of the two cases within 599.2 s and 489.0 s. As compared with Geant4-DNA that was executed on an Intel i7-5500U CPU processor and needed 28.6 h and 26.8 h for the two cases using a single CPU core, our package achieved a speed-up factor of 171.1-197.2.
      

      
      Accelerated Monte Carlo simulation on the chemical stage in water radiolysis using GPU.
      PubMed
      Tian, Zhen; Jiang, Steve B; Jia, Xun
         2017-04-21
         The accurate simulation of water radiolysis is an important step to understand the mechanisms of radiobiology and quantitatively test some hypotheses regarding radiobiological effects. However, the simulation of water radiolysis is highly time consuming, taking hours or even days to be completed by a conventional CPU processor. This time limitation hinders cell-level simulations for a number of research studies. We recently initiated efforts to develop gMicroMC, a GPU-based fast microscopic MC simulation package for water radiolysis. The first step of this project focused on accelerating the simulation of the chemical stage, the most time consuming stage in the entire water radiolysis process. A GPU-friendly parallelization strategy was designed to address the highly correlated many-body simulation problem caused by the mutual competitive chemical reactions between the radiolytic molecules. Two cases were tested, using a 750 keV electron and a 5 MeV proton incident in pure water, respectively. The time-dependent yields of all the radiolytic species during the chemical stage were used to evaluate the accuracy of the simulation. The relative differences between our simulation and the Geant4-DNA simulation were on average 5.3% and 4.4% for the two cases. Our package, executed on an Nvidia Titan black GPU card, successfully completed the chemical stage simulation of the two cases within 599.2 s and 489.0 s. As compared with Geant4-DNA that was executed on an Intel i7-5500U CPU processor and needed 28.6 h and 26.8 h for the two cases using a single CPU core, our package achieved a speed-up factor of 171.1-197.2.
      

      
      hybrid\\scriptsize{{MANTIS}}: a CPU-GPU Monte Carlo method for modeling indirect x-ray detectors with columnar scintillators
      NASA Astrophysics Data System (ADS)
      Sharma, Diksha; Badal, Andreu; Badano, Aldo
         2012-04-01
         The computational modeling of medical imaging systems often requires obtaining a large number of simulated images with low statistical uncertainty which translates into prohibitive computing times. We describe a novel hybrid approach for Monte Carlo simulations that maximizes utilization of CPUs and GPUs in modern workstations. We apply the method to the modeling of indirect x-ray detectors using a new and improved version of the code \\scriptsize{{MANTIS}}, an open source software tool used for the Monte Carlo simulations of indirect x-ray imagers. We first describe a GPU implementation of the physics and geometry models in fast\\scriptsize{{DETECT}}2 (the optical transport model) and a serial CPU version of the same code. We discuss its new features like on-the-fly column geometry and columnar crosstalk in relation to the \\scriptsize{{MANTIS}} code, and point out areas where our model provides more flexibility for the modeling of realistic columnar structures in large area detectors. Second, we modify \\scriptsize{{PENELOPE}} (the open source software package that handles the x-ray and electron transport in \\scriptsize{{MANTIS}}) to allow direct output of location and energy deposited during x-ray and electron interactions occurring within the scintillator. This information is then handled by optical transport routines in fast\\scriptsize{{DETECT}}2. A load balancer dynamically allocates optical transport showers to the GPU and CPU computing cores. Our hybrid\\scriptsize{{MANTIS}} approach achieves a significant speed-up factor of 627 when compared to \\scriptsize{{MANTIS}} and of 35 when compared to the same code running only in a CPU instead of a GPU. Using hybrid\\scriptsize{{MANTIS}}, we successfully hide hours of optical transport time by running it in parallel with the x-ray and electron transport, thus shifting the computational bottleneck from optical to x-ray transport. The new code requires much less memory than \\scriptsize{{MANTIS}} and, as a result, allows us to efficiently simulate large area detectors.
      

      
      Synthesis and Characterization of Biodegradable Polyurethane for Hypopharyngeal Tissue Engineering
      PubMed Central
      Shen, Zhisen; Lu, Dakai; Li, Qun; Zhang, Zongyong
         2015-01-01
         Biodegradable crosslinked polyurethane (cPU) was synthesized using polyethylene glycol (PEG), L-lactide (L-LA), and hexamethylene diisocyanate (HDI), with iron acetylacetonate (Fe(acac)3) as the catalyst and PEG as the extender. Chemical components of the obtained polymers were characterized by FTIR spectroscopy, 1H NMR spectra, and Gel Permeation Chromatography (GPC). The thermodynamic properties, mechanical behaviors, surface hydrophilicity, degradability, and cytotoxicity were tested via differential scanning calorimetry (DSC), tensile tests, contact angle measurements, and cell culture. The results show that the synthesized cPU possessed good flexibility with quite low glass transition temperature (T g, −22°C) and good wettability. Water uptake measured as high as 229.7 ± 18.7%. These properties make cPU a good candidate material for engineering soft tissues such as the hypopharynx. In vitro and in vivo tests showed that cPU has the ability to support the growth of human hypopharyngeal fibroblasts and angiogenesis was observed around cPU after it was implanted subcutaneously in SD rats. PMID:25839041
      

      
      Is our medical school socially accountable? The case of Faculty of Medicine, Suez Canal University.
      PubMed
      Hosny, Somaya; Ghaly, Mona; Boelen, Charles
         2015-04-01
         Faculty of Medicine, Suez Canal University (FOM/SCU) was established as community oriented school with innovative educational strategies. Social accountability represents the commitment of the medical school towards the community it serves. To assess FOM/SCU compliance to social accountability using the "Conceptualization, Production, Usability" (CPU) model. FOM/SCU's practice was reviewed against CPU model parameters. CPU consists of three domains, 11 sections and 31 parameters. Data were collected through unstructured interviews with the main stakeholders and documents review since 2005 to 2013. FOM/SCU shows general compliance to the three domains of the CPU. Very good compliance was shown to the "P" domain of the model through FOM/SCU's innovative educational system, students and faculty members. More work is needed on the "C" and "U" domains. FOM/SCU complies with many parameters of the CPU model; however, more work should be accomplished to comply with some items in the C and U domains so that FOM/SCU can be recognized as a proactive socially accountable school.
      

      
      Synthesis and characterization of biodegradable polyurethane for hypopharyngeal tissue engineering.
      PubMed
      Shen, Zhisen; Lu, Dakai; Li, Qun; Zhang, Zongyong; Zhu, Yabin
         2015-01-01
         Biodegradable crosslinked polyurethane (cPU) was synthesized using polyethylene glycol (PEG), L-lactide (L-LA), and hexamethylene diisocyanate (HDI), with iron acetylacetonate (Fe(acac)3) as the catalyst and PEG as the extender. Chemical components of the obtained polymers were characterized by FTIR spectroscopy, (1)H NMR spectra, and Gel Permeation Chromatography (GPC). The thermodynamic properties, mechanical behaviors, surface hydrophilicity, degradability, and cytotoxicity were tested via differential scanning calorimetry (DSC), tensile tests, contact angle measurements, and cell culture. The results show that the synthesized cPU possessed good flexibility with quite low glass transition temperature (T g , -22°C) and good wettability. Water uptake measured as high as 229.7 ± 18.7%. These properties make cPU a good candidate material for engineering soft tissues such as the hypopharynx. In vitro and in vivo tests showed that cPU has the ability to support the growth of human hypopharyngeal fibroblasts and angiogenesis was observed around cPU after it was implanted subcutaneously in SD rats.
      

      
      FastGCN: A GPU Accelerated Tool for Fast Gene Co-Expression Networks
      PubMed Central
      Liang, Meimei; Zhang, Futao; Jin, Gulei; Zhu, Jun
         2015-01-01
         Gene co-expression networks comprise one type of valuable biological networks. Many methods and tools have been published to construct gene co-expression networks; however, most of these tools and methods are inconvenient and time consuming for large datasets. We have developed a user-friendly, accelerated and optimized tool for constructing gene co-expression networks that can fully harness the parallel nature of GPU (Graphic Processing Unit) architectures. Genetic entropies were exploited to filter out genes with no or small expression changes in the raw data preprocessing step. Pearson correlation coefficients were then calculated. After that, we normalized these coefficients and employed the False Discovery Rate to control the multiple tests. At last, modules identification was conducted to construct the co-expression networks. All of these calculations were implemented on a GPU. We also compressed the coefficient matrix to save space. We compared the performance of the GPU implementation with those of multi-core CPU implementations with 16 CPU threads, single-thread C/C++ implementation and single-thread R implementation. Our results show that GPU implementation largely outperforms single-thread C/C++ implementation and single-thread R implementation, and GPU implementation outperforms multi-core CPU implementation when the number of genes increases. With the test dataset containing 16,000 genes and 590 individuals, we can achieve greater than 63 times the speed using a GPU implementation compared with a single-thread R implementation when 50 percent of genes were filtered out and about 80 times the speed when no genes were filtered out. PMID:25602758
      

      
      FastGCN: a GPU accelerated tool for fast gene co-expression networks.
      PubMed
      Liang, Meimei; Zhang, Futao; Jin, Gulei; Zhu, Jun
         2015-01-01
         Gene co-expression networks comprise one type of valuable biological networks. Many methods and tools have been published to construct gene co-expression networks; however, most of these tools and methods are inconvenient and time consuming for large datasets. We have developed a user-friendly, accelerated and optimized tool for constructing gene co-expression networks that can fully harness the parallel nature of GPU (Graphic Processing Unit) architectures. Genetic entropies were exploited to filter out genes with no or small expression changes in the raw data preprocessing step. Pearson correlation coefficients were then calculated. After that, we normalized these coefficients and employed the False Discovery Rate to control the multiple tests. At last, modules identification was conducted to construct the co-expression networks. All of these calculations were implemented on a GPU. We also compressed the coefficient matrix to save space. We compared the performance of the GPU implementation with those of multi-core CPU implementations with 16 CPU threads, single-thread C/C++ implementation and single-thread R implementation. Our results show that GPU implementation largely outperforms single-thread C/C++ implementation and single-thread R implementation, and GPU implementation outperforms multi-core CPU implementation when the number of genes increases. With the test dataset containing 16,000 genes and 590 individuals, we can achieve greater than 63 times the speed using a GPU implementation compared with a single-thread R implementation when 50 percent of genes were filtered out and about 80 times the speed when no genes were filtered out.
      

      
      Preliminary Study of Image Reconstruction Algorithm on a Digital Signal Processor
      DTIC Science & Technology
      
         2014-03-01
         5.2 Comparison of CPU-GPU, CPU-FPGA, and CPU-DSP Designs The work for implementing VHDL description of the back-projection algorithm on a physical...FPGA was not complete. Hence, the DSP implementation results are compared with the simulated results for the VHDL design. Simulating VHDL provides an...rather than at the software level. Depending on an application’s characteristics, FPGA implementations can provide a significant performance
      

      
      Purple L1 Milestone Review Panel TotalView Debugger Functionality and Performance for ASC Purple
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Wolfe, M
         2006-12-12
         ASC code teams require a robust software debugging tool to help developers quickly find bugs in their codes and get their codes running. Development debugging commonly runs up to 512 processes. Production jobs run up to full ASC Purple scale, and at times require introspection while running. Developers want a debugger that runs on all their development and production platforms and that works with all compilers and runtimes used with ASC codes. The TotalView Multiprocess Debugger made by Etnus was specified for ASC Purple to address this needed capability. The ASC Purple environment builds on the environment seen by TotalViewmore » on ASCI White. The debugger must now operate with the Power5 CPU, Federation switch, AIX 5.3 operating system including large pages, IBM compilers 7 and 9, POE 4.2 parallel environment, and rs6000 SLURM resource manager. Users require robust, basic debugger functionality with acceptable performance at development debugging scale. A TotalView installation must be provided at the beginning of the early user access period that meets these requirements. A functional enhancement, fast conditional data watchpoints, and a scalability enhancement, capability up to 8192 processes, are to be demonstrated.« less
      

      
      Performance and scalability of Fourier domain optical coherence tomography acceleration using graphics processing units.
      PubMed
      Li, Jian; Bloch, Pavel; Xu, Jing; Sarunic, Marinko V; Shannon, Lesley
         2011-05-01
         Fourier domain optical coherence tomography (FD-OCT) provides faster line rates, better resolution, and higher sensitivity for noninvasive, in vivo biomedical imaging compared to traditional time domain OCT (TD-OCT). However, because the signal processing for FD-OCT is computationally intensive, real-time FD-OCT applications demand powerful computing platforms to deliver acceptable performance. Graphics processing units (GPUs) have been used as coprocessors to accelerate FD-OCT by leveraging their relatively simple programming model to exploit thread-level parallelism. Unfortunately, GPUs do not "share" memory with their host processors, requiring additional data transfers between the GPU and CPU. In this paper, we implement a complete FD-OCT accelerator on a consumer grade GPU/CPU platform. Our data acquisition system uses spectrometer-based detection and a dual-arm interferometer topology with numerical dispersion compensation for retinal imaging. We demonstrate that the maximum line rate is dictated by the memory transfer time and not the processing time due to the GPU platform's memory model. Finally, we discuss how the performance trends of GPU-based accelerators compare to the expected future requirements of FD-OCT data rates.
      

      
      Source parameter inversion of compound earthquakes on GPU/CPU hybrid platform
      NASA Astrophysics Data System (ADS)
      Wang, Y.; Ni, S.; Chen, W.
         2012-12-01
         Source parameter of earthquakes is essential problem in seismology. Accurate and timely determination of the earthquake parameters (such as moment, depth, strike, dip and rake of fault planes) is significant for both the rupture dynamics and ground motion prediction or simulation. And the rupture process study, especially for the moderate and large earthquakes, is essential as the more detailed kinematic study has became the routine work of seismologists. However, among these events, some events behave very specially and intrigue seismologists. These earthquakes usually consist of two similar size sub-events which occurred with very little time interval, such as mb4.5 Dec.9, 2003 in Virginia. The studying of these special events including the source parameter determination of each sub-events will be helpful to the understanding of earthquake dynamics. However, seismic signals of two distinctive sources are mixed up bringing in the difficulty of inversion. As to common events, the method(Cut and Paste) has been proven effective for resolving source parameters, which jointly use body wave and surface wave with independent time shift and weights. CAP could resolve fault orientation and focal depth using a grid search algorithm. Based on this method, we developed an algorithm(MUL_CAP) to simultaneously acquire parameters of two distinctive events. However, the simultaneous inversion of both sub-events make the computation very time consuming, so we develop a hybrid GPU and CPU version of CAP(HYBRID_CAP) to improve the computation efficiency. Thanks to advantages on multiple dimension storage and processing in GPU, we obtain excellent performance of the revised code on GPU-CPU combined architecture and the speedup factors can be as high as 40x-90x compared to classical cap on traditional CPU architecture.As the benchmark, we take the synthetics as observation and inverse the source parameters of two given sub-events and the inversion results are very consistent with the true parameters. For the events in Virginia, USA on 9 Dec, 2003, we re-invert source parameters and detailed analysis of regional waveform indicates that Virginia earthquake included two sub-events which are Mw4.05 and Mw4.25 at the same depth of 10km with focal mechanism of strike65/dip32/rake135, which are consistent with previous study. Moreover, compared to traditional two-source model method, MUL_CAP is more automatic with no need for human intervention.
      

      
      Many-integrated core (MIC) technology for accelerating Monte Carlo simulation of radiation transport: A study based on the code DPM
      NASA Astrophysics Data System (ADS)
      Rodriguez, M.; Brualla, L.
         2018-04-01
         Monte Carlo simulation of radiation transport is computationally demanding to obtain reasonably low statistical uncertainties of the estimated quantities. Therefore, it can benefit in a large extent from high-performance computing. This work is aimed at assessing the performance of the first generation of the many-integrated core architecture (MIC) Xeon Phi coprocessor with respect to that of a CPU consisting of a double 12-core Xeon processor in Monte Carlo simulation of coupled electron-photonshowers. The comparison was made twofold, first, through a suite of basic tests including parallel versions of the random number generators Mersenne Twister and a modified implementation of RANECU. These tests were addressed to establish a baseline comparison between both devices. Secondly, through the p DPM code developed in this work. p DPM is a parallel version of the Dose Planning Method (DPM) program for fast Monte Carlo simulation of radiation transport in voxelized geometries. A variety of techniques addressed to obtain a large scalability on the Xeon Phi were implemented in p DPM. Maximum scalabilities of 84 . 2 × and 107 . 5 × were obtained in the Xeon Phi for simulations of electron and photon beams, respectively. Nevertheless, in none of the tests involving radiation transport the Xeon Phi performed better than the CPU. The disadvantage of the Xeon Phi with respect to the CPU owes to the low performance of the single core of the former. A single core of the Xeon Phi was more than 10 times less efficient than a single core of the CPU for all radiation transport simulations.
      

      
      Research on fast Fourier transforms algorithm of huge remote sensing image technology with GPU and partitioning technology.
      PubMed
      Yang, Xue; Li, Xue-You; Li, Jia-Guo; Ma, Jun; Zhang, Li; Yang, Jan; Du, Quan-Ye
         2014-02-01
         Fast Fourier transforms (FFT) is a basic approach to remote sensing image processing. With the improvement of capacity of remote sensing image capture with the features of hyperspectrum, high spatial resolution and high temporal resolution, how to use FFT technology to efficiently process huge remote sensing image becomes the critical step and research hot spot of current image processing technology. FFT algorithm, one of the basic algorithms of image processing, can be used for stripe noise removal, image compression, image registration, etc. in processing remote sensing image. CUFFT function library is the FFT algorithm library based on CPU and FFTW. FFTW is a FFT algorithm developed based on CPU in PC platform, and is currently the fastest CPU based FFT algorithm function library. However there is a common problem that once the available memory or memory is less than the capacity of image, there will be out of memory or memory overflow when using the above two methods to realize image FFT arithmetic. To address this problem, a CPU and partitioning technology based Huge Remote Fast Fourier Transform (HRFFT) algorithm is proposed in this paper. By improving the FFT algorithm in CUFFT function library, the problem of out of memory and memory overflow is solved. Moreover, this method is proved rational by experiment combined with the CCD image of HJ-1A satellite. When applied to practical image processing, it improves effect of the image processing, speeds up the processing, which saves the time of computation and achieves sound result.
      

      
      GPU: the biggest key processor for AI and parallel processing
      NASA Astrophysics Data System (ADS)
      Baji, Toru
         2017-07-01
         Two types of processors exist in the market. One is the conventional CPU and the other is Graphic Processor Unit (GPU). Typical CPU is composed of 1 to 8 cores while GPU has thousands of cores. CPU is good for sequential processing, while GPU is good to accelerate software with heavy parallel executions. GPU was initially dedicated for 3D graphics. However from 2006, when GPU started to apply general-purpose cores, it was noticed that this architecture can be used as a general purpose massive-parallel processor. NVIDIA developed a software framework Compute Unified Device Architecture (CUDA) that make it possible to easily program the GPU for these application. With CUDA, GPU started to be used in workstations and supercomputers widely. Recently two key technologies are highlighted in the industry. The Artificial Intelligence (AI) and Autonomous Driving Cars. AI requires a massive parallel operation to train many-layers of neural networks. With CPU alone, it was impossible to finish the training in a practical time. The latest multi-GPU system with P100 makes it possible to finish the training in a few hours. For the autonomous driving cars, TOPS class of performance is required to implement perception, localization, path planning processing and again SoC with integrated GPU will play a key role there. In this paper, the evolution of the GPU which is one of the biggest commercial devices requiring state-of-the-art fabrication technology will be introduced. Also overview of the GPU demanding key application like the ones described above will be introduced.
      

      
      Parallel Scaling Characteristics of Selected NERSC User ProjectCodes
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Skinner, David; Verdier, Francesca; Anand, Harsh
         
         This report documents parallel scaling characteristics of NERSC user project codes between Fiscal Year 2003 and the first half of Fiscal Year 2004 (Oct 2002-March 2004). The codes analyzed cover 60% of all the CPU hours delivered during that time frame on seaborg, a 6080 CPU IBM SP and the largest parallel computer at NERSC. The scale in terms of concurrency and problem size of the workload is analyzed. Drawing on batch queue logs, performance data and feedback from researchers we detail the motivations, benefits, and challenges of implementing highly parallel scientific codes on current NERSC High Performance Computing systems.more » An evaluation and outlook of the NERSC workload for Allocation Year 2005 is presented.« less
      

      
      Event- and Time-Driven Techniques Using Parallel CPU-GPU Co-processing for Spiking Neural Networks
      PubMed Central
      Naveros, Francisco; Garrido, Jesus A.; Carrillo, Richard R.; Ros, Eduardo; Luque, Niceto R.
         2017-01-01
         Modeling and simulating the neural structures which make up our central neural system is instrumental for deciphering the computational neural cues beneath. Higher levels of biological plausibility usually impose higher levels of complexity in mathematical modeling, from neural to behavioral levels. This paper focuses on overcoming the simulation problems (accuracy and performance) derived from using higher levels of mathematical complexity at a neural level. This study proposes different techniques for simulating neural models that hold incremental levels of mathematical complexity: leaky integrate-and-fire (LIF), adaptive exponential integrate-and-fire (AdEx), and Hodgkin-Huxley (HH) neural models (ranged from low to high neural complexity). The studied techniques are classified into two main families depending on how the neural-model dynamic evaluation is computed: the event-driven or the time-driven families. Whilst event-driven techniques pre-compile and store the neural dynamics within look-up tables, time-driven techniques compute the neural dynamics iteratively during the simulation time. We propose two modifications for the event-driven family: a look-up table recombination to better cope with the incremental neural complexity together with a better handling of the synchronous input activity. Regarding the time-driven family, we propose a modification in computing the neural dynamics: the bi-fixed-step integration method. This method automatically adjusts the simulation step size to better cope with the stiffness of the neural model dynamics running in CPU platforms. One version of this method is also implemented for hybrid CPU-GPU platforms. Finally, we analyze how the performance and accuracy of these modifications evolve with increasing levels of neural complexity. We also demonstrate how the proposed modifications which constitute the main contribution of this study systematically outperform the traditional event- and time-driven techniques under increasing levels of neural complexity. PMID:28223930
      

        
       
          

«

11
      12
      13
   14
      15
      »

          
        

     

   

   
       
            
              
          

«

12
      13
      14
   15
      16
      »

          
        

           
           
             
               
      
      An efficient tensor transpose algorithm for multicore CPU, Intel Xeon Phi, and NVidia Tesla GPU
      DOE PAGES
      Lyakh, Dmitry I.
         2015-01-05
         An efficient parallel tensor transpose algorithm is suggested for shared-memory computing units, namely, multicore CPU, Intel Xeon Phi, and NVidia GPU. The algorithm operates on dense tensors (multidimensional arrays) and is based on the optimization of cache utilization on x86 CPU and the use of shared memory on NVidia GPU. From the applied side, the ultimate goal is to minimize the overhead encountered in the transformation of tensor contractions into matrix multiplications in computer implementations of advanced methods of quantum many-body theory (e.g., in electronic structure theory and nuclear physics). A particular accent is made on higher-dimensional tensors that typicallymore » appear in the so-called multireference correlated methods of electronic structure theory. Depending on tensor dimensionality, the presented optimized algorithms can achieve an order of magnitude speedup on x86 CPUs and 2-3 times speedup on NVidia Tesla K20X GPU with respect to the na ve scattering algorithm (no memory access optimization). Furthermore, the tensor transpose routines developed in this work have been incorporated into a general-purpose tensor algebra library (TAL-SH).« less
      

      
      Development of the Large-Scale Statistical Analysis System of Satellites Observations Data with Grid Datafarm Architecture
      NASA Astrophysics Data System (ADS)
      Yamamoto, K.; Murata, K.; Kimura, E.; Honda, R.
         2006-12-01
         In the Solar-Terrestrial Physics (STP) field, the amount of satellite observation data has been increasing every year. It is necessary to solve the following three problems to achieve large-scale statistical analyses of plenty of data. (i) More CPU power and larger memory and disk size are required. However, total powers of personal computers are not enough to analyze such amount of data. Super-computers provide a high performance CPU and rich memory area, but they are usually separated from the Internet or connected only for the purpose of programming or data file transfer. (ii) Most of the observation data files are managed at distributed data sites over the Internet. Users have to know where the data files are located. (iii) Since no common data format in the STP field is available now, users have to prepare reading program for each data by themselves. To overcome the problems (i) and (ii), we constructed a parallel and distributed data analysis environment based on the Gfarm reference implementation of the Grid Datafarm architecture. The Gfarm shares both computational resources and perform parallel distributed processings. In addition, the Gfarm provides the Gfarm filesystem which can be as virtual directory tree among nodes. The Gfarm environment is composed of three parts; a metadata server to manage distributed files information, filesystem nodes to provide computational resources and a client to throw a job into metadata server and manages data processing schedulings. In the present study, both data files and data processes are parallelized on the Gfarm with 6 file system nodes: CPU clock frequency of each node is Pentium V 1GHz, 256MB memory and40GB disk. To evaluate performances of the present Gfarm system, we scanned plenty of data files, the size of which is about 300MB for each, in three processing methods: sequential processing in one node, sequential processing by each node and parallel processing by each node. As a result, in comparison between the number of files and the elapsed time, parallel and distributed processing shorten the elapsed time to 1/5 than sequential processing. On the other hand, sequential processing times were shortened in another experiment, whose file size is smaller than 100KB. In this case, the elapsed time to scan one file is within one second. It implies that disk swap took place in case of parallel processing by each node. We note that the operation became unstable when the number of the files exceeded 1000. To overcome the problem (iii), we developed an original data class. This class supports our reading of data files with various data formats since it converts them into an original data format since it defines schemata for every type of data and encapsulates the structure of data files. In addition, since this class provides a function of time re-sampling, users can easily convert multiple data (array) with different time resolution into the same time resolution array. Finally, using the Gfarm, we achieved a high performance environment for large-scale statistical data analyses. It should be noted that the present method is effective only when one data file size is large enough. At present, we are restructuring the new Gfarm environment with 8 nodes: CPU is Athlon 64 x2 Dual Core 2GHz, 2GB memory and 1.2TB disk (using RAID0) for each node. Our original class is to be implemented on the new Gfarm environment. In the present talk, we show the latest results with applying the present system for data analyses with huge number of satellite observation data files.
      

      
      Application of high-performance computing to numerical simulation of human movement
      NASA Technical Reports Server (NTRS)
      Anderson, F. C.; Ziegler, J. M.; Pandy, M. G.; Whalen, R. T.
         1995-01-01
         We have examined the feasibility of using massively-parallel and vector-processing supercomputers to solve large-scale optimization problems for human movement. Specifically, we compared the computational expense of determining the optimal controls for the single support phase of gait using a conventional serial machine (SGI Iris 4D25), a MIMD parallel machine (Intel iPSC/860), and a parallel-vector-processing machine (Cray Y-MP 8/864). With the human body modeled as a 14 degree-of-freedom linkage actuated by 46 musculotendinous units, computation of the optimal controls for gait could take up to 3 months of CPU time on the Iris. Both the Cray and the Intel are able to reduce this time to practical levels. The optimal solution for gait can be found with about 77 hours of CPU on the Cray and with about 88 hours of CPU on the Intel. Although the overall speeds of the Cray and the Intel were found to be similar, the unique capabilities of each machine are better suited to different portions of the computational algorithm used. The Intel was best suited to computing the derivatives of the performance criterion and the constraints whereas the Cray was best suited to parameter optimization of the controls. These results suggest that the ideal computer architecture for solving very large-scale optimal control problems is a hybrid system in which a vector-processing machine is integrated into the communication network of a MIMD parallel machine.
      

      
      Disk-based k-mer counting on a PC
      PubMed Central
      
         2013-01-01
         Background The k-mer counting problem, which is to build the histogram of occurrences of every k-symbol long substring in a given text, is important for many bioinformatics applications. They include developing de Bruijn graph genome assemblers, fast multiple sequence alignment and repeat detection. Results We propose a simple, yet efficient, parallel disk-based algorithm for counting k-mers. Experiments show that it usually offers the fastest solution to the considered problem, while demanding a relatively small amount of memory. In particular, it is capable of counting the statistics for short-read human genome data, in input gzipped FASTQ file, in less than 40 minutes on a PC with 16 GB of RAM and 6 CPU cores, and for long-read human genome data in less than 70 minutes. On a more powerful machine, using 32 GB of RAM and 32 CPU cores, the tasks are accomplished in less than half the time. No other algorithm for most tested settings of this problem and mammalian-size data can accomplish this task in comparable time. Our solution also belongs to memory-frugal ones; most competitive algorithms cannot efficiently work on a PC with 16 GB of memory for such massive data. Conclusions By making use of cheap disk space and exploiting CPU and I/O parallelism we propose a very competitive k-mer counting procedure, called KMC. Our results suggest that judicious resource management may allow to solve at least some bioinformatics problems with massive data on a commodity personal computer. PMID:23679007
      

      
      Rt-Space: A Real-Time Stochastically-Provisioned Adaptive Container Environment
      DTIC Science & Technology
      
         2017-08-04
         SECURITY CLASSIFICATION OF: This project was directed at component-based soft real- time (SRT) systems implemented on multicore platforms. To facilitate...upon average-case or near- average-case task execution times . The main intellectual contribution of this project was the development of methods for...allocating CPU time to components and associated analysis for validating SRT correctness. 1. REPORT DATE (DD-MM-YYYY) 4. TITLE AND SUBTITLE 13
      

      
      Acceleration for 2D time-domain elastic full waveform inversion using a single GPU card
      NASA Astrophysics Data System (ADS)
      Jiang, Jinpeng; Zhu, Peimin
         2018-05-01
         Full waveform inversion (FWI) is a challenging procedure due to the high computational cost related to the modeling, especially for the elastic case. The graphics processing unit (GPU) has become a popular device for the high-performance computing (HPC). To reduce the long computation time, we design and implement the GPU-based 2D elastic FWI (EFWI) in time domain using a single GPU card. We parallelize the forward modeling and gradient calculations using the CUDA programming language. To overcome the limitation of relatively small global memory on GPU, the boundary saving strategy is exploited to reconstruct the forward wavefield. Moreover, the L-BFGS optimization method used in the inversion increases the convergence of the misfit function. A multiscale inversion strategy is performed in the workflow to obtain the accurate inversion results. In our tests, the GPU-based implementations using a single GPU device achieve >15 times speedup in forward modeling, and about 12 times speedup in gradient calculation, compared with the eight-core CPU implementations optimized by OpenMP. The test results from the GPU implementations are verified to have enough accuracy by comparing the results obtained from the CPU implementations.
      

      
      Fidelity Optimization of Microprocessor System Simulations.
      DTIC Science & Technology
      
         1981-03-01
         effort feasible in terms of required CPU time would be to employ a separate clock with an artificially compressed time base in the serial...RETURN ILINCR -NU𔃾OPS D.% PROt.ESSING 900 IF IIERP2.NF.41 GO TO 1000 IFRCOD - L CALL VAIRCO 1A(61,NUMVALLEPCOOl IEPRZ -IEACCO IF hEARR .GT. 01 RETURN I
      

      
      New core-reflector boundary conditions for transient nodal reactor calculations
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Lee, E.K.; Kim, C.H.; Joo, H.K.
         1995-09-01
         New core-reflector boundary conditions designed for the exclusion of the reflector region in transient nodal reactor calculations are formulated. Spatially flat frequency approximations for the temporal neutron behavior and two types of transverse leakage approximations in the reflector region are introduced to solve the transverse-integrated time-dependent one-dimensional diffusion equation and then to obtain relationships between net current and flux at the core-reflector interfaces. To examine the effectiveness of new core-reflector boundary conditions in transient nodal reactor computations, nodal expansion method (NEM) computations with and without explicit representation of the reflector are performed for Laboratorium fuer Reaktorregelung und Anlagen (LRA) boilingmore » water reactor (BWR) and Nuclear Energy Agency Committee on Reactor Physics (NEACRP) pressurized water reactor (PWR) rod ejection kinetics benchmark problems. Good agreement between two NEM computations is demonstrated in all the important transient parameters of two benchmark problems. A significant amount of CPU time saving is also demonstrated with the boundary condition model with transverse leakage (BCMTL) approximations in the reflector region. In the three-dimensional LRA BWR, the BCMTL and the explicit reflector model computations differ by {approximately}4% in transient peak power density while the BCMTL results in >40% of CPU time saving by excluding both the axial and the radial reflector regions from explicit computational nodes. In the NEACRP PWR problem, which includes six different transient cases, the largest difference is 24.4% in the transient maximum power in the one-node-per-assembly B1 transient results. This difference in the transient maximum power of the B1 case is shown to reduce to 11.7% in the four-node-per-assembly computations. As for the computing time, BCMTL is shown to reduce the CPU time >20% in all six transient cases of the NEACRP PWR.« less
      

      
      Ground Shock Effects from Accidental Explosions
      DTIC Science & Technology
      
         1976-11-01
         1,200 P0 A = V P cp 8 Horizontal Dh = Dv tannin " 1 (cp/U)] Vh = Vv tan [sin" 1 (cp/U)] \\ - \\ tanfainŕ (cp/U)] For tan sin (c /U...explosive are not included in the present analysis . This effect will limit the credibility of the direct- induced ground shock predictions, but if the... analysis . Dr. D. R. Richmond of Lovelace Foundation provided data on human shock tolerances. 26 REFERENCES 1. "Structures to Resist the Effects of
      

      
      GPU-Accelerated Voxelwise Hepatic Perfusion Quantification
      PubMed Central
      Wang, H; Cao, Y
         2012-01-01
         Voxelwise quantification of hepatic perfusion parameters from dynamic contrast enhanced (DCE) imaging greatly contributes to assessment of liver function in response to radiation therapy. However, the efficiency of the estimation of hepatic perfusion parameters voxel-by-voxel in the whole liver using a dual-input single-compartment model requires substantial improvement for routine clinical applications. In this paper, we utilize the parallel computation power of a graphics processing unit (GPU) to accelerate the computation, while maintaining the same accuracy as the conventional method. Using CUDA-GPU, the hepatic perfusion computations over multiple voxels are run across the GPU blocks concurrently but independently. At each voxel, non-linear least squares fitting the time series of the liver DCE data to the compartmental model is distributed to multiple threads in a block, and the computations of different time points are performed simultaneously and synchronically. An efficient fast Fourier transform in a block is also developed for the convolution computation in the model. The GPU computations of the voxel-by-voxel hepatic perfusion images are compared with ones by the CPU using the simulated DCE data and the experimental DCE MR images from patients. The computation speed is improved by 30 times using a NVIDIA Tesla C2050 GPU compared to a 2.67 GHz Intel Xeon CPU processor. To obtain liver perfusion maps with 626400 voxels in a patient’s liver, it takes 0.9 min with the GPU-accelerated voxelwise computation, compared to 110 min with the CPU, while both methods result in perfusion parameters differences less than 10−6. The method will be useful for generating liver perfusion images in clinical settings. PMID:22892645
      

      
      Dust Dynamics in Protoplanetary Disks: Parallel Computing with PVM
      NASA Astrophysics Data System (ADS)
      de La Fuente Marcos, Carlos; Barge, Pierre; de La Fuente Marcos, Raúl
         2002-03-01
         We describe a parallel version of our high-order-accuracy particle-mesh code for the simulation of collisionless protoplanetary disks. We use this code to carry out a massively parallel, two-dimensional, time-dependent, numerical simulation, which includes dust particles, to study the potential role of large-scale, gaseous vortices in protoplanetary disks. This noncollisional problem is easy to parallelize on message-passing multicomputer architectures. We performed the simulations on a cache-coherent nonuniform memory access Origin 2000 machine, using both the parallel virtual machine (PVM) and message-passing interface (MPI) message-passing libraries. Our performance analysis suggests that, for our problem, PVM is about 25% faster than MPI. Using PVM and MPI made it possible to reduce CPU time and increase code performance. This allows for simulations with a large number of particles (N ~ 105-106) in reasonable CPU times. The performances of our implementation of the pa! rallel code on an Origin 2000 supercomputer are presented and discussed. They exhibit very good speedup behavior and low load unbalancing. Our results confirm that giant gaseous vortices can play a dominant role in giant planet formation.
      

      
      SU-F-BRD-02: Application of ARCHERRT-- A GPU-Based Monte Carlo Dose Engine for Radiation Therapy -- to Tomotherapy and Patient-Independent IMRT
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Su, L; Du, X; Liu, T
         
         Purpose: As a module of ARCHER -- Accelerated Radiation-transport Computations in Heterogeneous EnviRonments, ARCHER{sub RT} is designed for RadioTherapy (RT) dose calculation. This paper describes the application of ARCHERRT on patient-dependent TomoTherapy and patient-independent IMRT. It also conducts a 'fair' comparison of different GPUs and multicore CPU. Methods: The source input used for patient-dependent TomoTherapy is phase space file (PSF) generated from optimized plan. For patient-independent IMRT, the open filed PSF is used for different cases. The intensity modulation is simulated by fluence map. The GEANT4 code is used as benchmark. DVH and gamma index test are employed to evaluatemore » the accuracy of ARCHER{sub RT} code. Some previous studies reported misleading speedups by comparing GPU code with serial CPU code. To perform a fairer comparison, we write multi-thread code with OpenMP to fully exploit computing potential of CPU. The hardware involved in this study are a 6-core Intel E5-2620 CPU and 6 NVIDIA M2090 GPUs, a K20 GPU and a K40 GPU. Results: Dosimetric results from ARCHER{sub RT} and GEANT4 show good agreement. The 2%/2mm gamma test pass rates for different clinical cases are 97.2% to 99.7%. A single M2090 GPU needs 50~79 seconds for the simulation to achieve a statistical error of 1% in the PTV. The K40 card is about 1.7∼1.8 times faster than M2090 card. Using 6 M2090 card, the simulation can be finished in about 10 seconds. For comparison, Intel E5-2620 needs 507∼879 seconds for the same simulation. Conclusion: We successfully applied ARCHER{sub RT} to Tomotherapy and patient-independent IMRT, and conducted a fair comparison between GPU and CPU performance. The ARCHER{sub RT} code is both accurate and efficient and may be used towards clinical applications.« less
      

      
      Collaborating CPU and GPU for large-scale high-order CFD simulations with complex grids on the TianHe-1A supercomputer
      NASA Astrophysics Data System (ADS)
      Xu, Chuanfu; Deng, Xiaogang; Zhang, Lilun; Fang, Jianbin; Wang, Guangxue; Jiang, Yi; Cao, Wei; Che, Yonggang; Wang, Yongxian; Wang, Zhenghua; Liu, Wei; Cheng, Xinghua
         2014-12-01
         Programming and optimizing complex, real-world CFD codes on current many-core accelerated HPC systems is very challenging, especially when collaborating CPUs and accelerators to fully tap the potential of heterogeneous systems. In this paper, with a tri-level hybrid and heterogeneous programming model using MPI + OpenMP + CUDA, we port and optimize our high-order multi-block structured CFD software HOSTA on the GPU-accelerated TianHe-1A supercomputer. HOSTA adopts two self-developed high-order compact definite difference schemes WCNS and HDCS that can simulate flows with complex geometries. We present a dual-level parallelization scheme for efficient multi-block computation on GPUs and perform particular kernel optimizations for high-order CFD schemes. The GPU-only approach achieves a speedup of about 1.3 when comparing one Tesla M2050 GPU with two Xeon X5670 CPUs. To achieve a greater speedup, we collaborate CPU and GPU for HOSTA instead of using a naive GPU-only approach. We present a novel scheme to balance the loads between the store-poor GPU and the store-rich CPU. Taking CPU and GPU load balance into account, we improve the maximum simulation problem size per TianHe-1A node for HOSTA by 2.3×, meanwhile the collaborative approach can improve the performance by around 45% compared to the GPU-only approach. Further, to scale HOSTA on TianHe-1A, we propose a gather/scatter optimization to minimize PCI-e data transfer times for ghost and singularity data of 3D grid blocks, and overlap the collaborative computation and communication as far as possible using some advanced CUDA and MPI features. Scalability tests show that HOSTA can achieve a parallel efficiency of above 60% on 1024 TianHe-1A nodes. With our method, we have successfully simulated an EET high-lift airfoil configuration containing 800M cells and China's large civil airplane configuration containing 150M cells. To our best knowledge, those are the largest-scale CPU-GPU collaborative simulations that solve realistic CFD problems with both complex configurations and high-order schemes.
      

      
      Examining the architecture of cellular computing through a comparative study with a computer
      PubMed Central
      Wang, Degeng; Gribskov, Michael
         2005-01-01
         The computer and the cell both use information embedded in simple coding, the binary software code and the quadruple genomic code, respectively, to support system operations. A comparative examination of their system architecture as well as their information storage and utilization schemes is performed. On top of the code, both systems display a modular, multi-layered architecture, which, in the case of a computer, arises from human engineering efforts through a combination of hardware implementation and software abstraction. Using the computer as a reference system, a simplistic mapping of the architectural components between the two is easily detected. This comparison also reveals that a cell abolishes the software–hardware barrier through genomic encoding for the constituents of the biochemical network, a cell's ‘hardware’ equivalent to the computer central processing unit (CPU). The information loading (gene expression) process acts as a major determinant of the encoded constituent's abundance, which, in turn, often determines the ‘bandwidth’ of a biochemical pathway. Cellular processes are implemented in biochemical pathways in parallel manners. In a computer, on the other hand, the software provides only instructions and data for the CPU. A process represents just sequentially ordered actions by the CPU and only virtual parallelism can be implemented through CPU time-sharing. Whereas process management in a computer may simply mean job scheduling, coordinating pathway bandwidth through the gene expression machinery represents a major process management scheme in a cell. In summary, a cell can be viewed as a super-parallel computer, which computes through controlled hardware composition. While we have, at best, a very fragmented understanding of cellular operation, we have a thorough understanding of the computer throughout the engineering process. The potential utilization of this knowledge to the benefit of systems biology is discussed. PMID:16849179
      

      
      Examining the architecture of cellular computing through a comparative study with a computer.
      PubMed
      Wang, Degeng; Gribskov, Michael
         2005-06-22
         The computer and the cell both use information embedded in simple coding, the binary software code and the quadruple genomic code, respectively, to support system operations. A comparative examination of their system architecture as well as their information storage and utilization schemes is performed. On top of the code, both systems display a modular, multi-layered architecture, which, in the case of a computer, arises from human engineering efforts through a combination of hardware implementation and software abstraction. Using the computer as a reference system, a simplistic mapping of the architectural components between the two is easily detected. This comparison also reveals that a cell abolishes the software-hardware barrier through genomic encoding for the constituents of the biochemical network, a cell's "hardware" equivalent to the computer central processing unit (CPU). The information loading (gene expression) process acts as a major determinant of the encoded constituent's abundance, which, in turn, often determines the "bandwidth" of a biochemical pathway. Cellular processes are implemented in biochemical pathways in parallel manners. In a computer, on the other hand, the software provides only instructions and data for the CPU. A process represents just sequentially ordered actions by the CPU and only virtual parallelism can be implemented through CPU time-sharing. Whereas process management in a computer may simply mean job scheduling, coordinating pathway bandwidth through the gene expression machinery represents a major process management scheme in a cell. In summary, a cell can be viewed as a super-parallel computer, which computes through controlled hardware composition. While we have, at best, a very fragmented understanding of cellular operation, we have a thorough understanding of the computer throughout the engineering process. The potential utilization of this knowledge to the benefit of systems biology is discussed.
      

      
      Revisiting Molecular Dynamics on a CPU/GPU system: Water Kernel and SHAKE Parallelization.
      PubMed
      Ruymgaart, A Peter; Elber, Ron
         2012-11-13
         We report Graphics Processing Unit (GPU) and Open-MP parallel implementations of water-specific force calculations and of bond constraints for use in Molecular Dynamics simulations. We focus on a typical laboratory computing-environment in which a CPU with a few cores is attached to a GPU. We discuss in detail the design of the code and we illustrate performance comparable to highly optimized codes such as GROMACS. Beside speed our code shows excellent energy conservation. Utilization of water-specific lists allows the efficient calculations of non-bonded interactions that include water molecules and results in a speed-up factor of more than 40 on the GPU compared to code optimized on a single CPU core for systems larger than 20,000 atoms. This is up four-fold from a factor of 10 reported in our initial GPU implementation that did not include a water-specific code. Another optimization is the implementation of constrained dynamics entirely on the GPU. The routine, which enforces constraints of all bonds, runs in parallel on multiple Open-MP cores or entirely on the GPU. It is based on Conjugate Gradient solution of the Lagrange multipliers (CG SHAKE). The GPU implementation is partially in double precision and requires no communication with the CPU during the execution of the SHAKE algorithm. The (parallel) implementation of SHAKE allows an increase of the time step to 2.0fs while maintaining excellent energy conservation. Interestingly, CG SHAKE is faster than the usual bond relaxation algorithm even on a single core if high accuracy is expected. The significant speedup of the optimized components transfers the computational bottleneck of the MD calculation to the reciprocal part of Particle Mesh Ewald (PME).
      

      
      Radiation hardened microprocessor for small payloads
      NASA Technical Reports Server (NTRS)
      Shah, Ravi
         1993-01-01
         The RH-3000 program is developing a rad-hard space qualified 32-bit MIPS R-3000 RISC processor under the Naval Research Lab sponsorship. In addition, under IR&D Harris is developing RHC-3000 for embedded control applications where low cost and radiation tolerance are primary concerns. The development program leverages heavily from commercial development of the MIPS R-3000. The commercial R-3000 has a large installed user base and several foundry partners are currently producing a wide variety of R-3000 derivative products. One of the MIPS derivative products, the LR33000 from LSI Logic, was used as the basis for the design of the RH-3000 chipset. The RH-3000 chipset consists of three core chips and two support chips. The core chips include the CPU, which is the R-3000 integer unit and the FPA/MD chip pair, which performs the R-3010 floating point functions. The two support whips contain all the support functions required for fault tolerance support, real-time support, memory management, timers, and other functions. The Harris development effort had first passed silicon success in June, 1992 with the first rad-hard 32-bit RH-3000 CPU chip. The CPU device is 30 kgates, has a 508 mil by 503 mil die size and is fabricated at Harris Semiconductor on the rad-hard CMOS Silicon on Sapphire (SOS) process. The CPU device successfully passed tesing against 600,000 test vectors derived directly on the LSI/MIPS test suite and has been operational as a single board computer running C code for the past year. In addition, the RH-3000 program has developed the methodology for converting commercially developed designs utilizing logic synthesis techniques based on a combination of VHDK and schematic data bases.
      

      
      Using all of your CPU's in HIPE
      NASA Astrophysics Data System (ADS)
      Jacobson, J. D.; Fadda, D.
         2012-09-01
         Modern computer architectures increasingly feature multi-core CPU's. For example, the MacbookPro features the Intel quad-core i7 processors. Through the use of hyper-threading, where each core can execute two threads simultaneously, the quad-core i7 can support eight simultaneous processing threads. All this on your laptop! This CPU power can now be put into service by scientists to perform data reduction tasks, but only if the software has been designed to take advantage of the multiple processor architectures. Up to now, software written for Herschel data reduction (HIPE), written in Jython and JAVA, is single-threaded and can only utilize a single processor. Users of HIPE do not get any advantage from the additional processors. Why not put all of the CPU resources to work reducing your data? We present a multi-threaded software application that corrects long-term transients in the signal from the PACS unchopped spectroscopy line scan mode. In this poster, we present a multi-threaded software framework to achieve performance improvements from parallel execution. We will show how a task to correct transients in the PACS Spectroscopy Pipeline for the un-chopped line scan mode, has been threaded. This computation-intensive task uses either a one-parameter or a three parameter exponential function, to characterize the transient. The task uses a JAVA implementation of Minpack, translated from the C (Moshier) and IDL (Markwardt) by the authors, to optimize the correction parameters. We also explain how to determine if a task can benefit from threading (Amdahl's Law), and if it is safe to thread. The design and implementation, using the JAVA concurrency package completions service is described. Pitfalls, timing bugs, thread safety, resource control, testing and performance improvements are described and plotted.
      

      
      Optimal endothelialisation of a new compliant poly(carbonate-urea)urethane vascular graft with effect of physiological shear stress.
      PubMed
      Salacinski, H J; Tai, N R; Punshon, G; Giudiceandrea, A; Hamilton, G; Seifalian, A M
         2000-10-01
         to define the optimal seeding conditions of a new stress free poly(carbonate-urea)urethane (CPU) graft with compliance similar to that of human artery with honeycomb structure engineered during the manufacturing process to enhance adhesion and growth of endothelial cells. (111)Indium-oxine radiolabeled human umbilical vein endothelial cells (HUVEC) were seeded onto CPU grafts at (a) concentrations from 2-24x10(5)cells/cm(2)and (b) incubated for 0.5, 1, 2, 4 and 6 h. Following incubation, graft segments were subjected to three washing/gamma counting procedures and scanning electron microscopy (SEM). Cell viability was measured using a modified Alamar blue(TM)assay. To test physiological retention a pulsatile flow phantom was used to subject optimally seeded (16x10(5), 4 h) CPU grafts to arterial shear stress for 6 h with real time acquisition of scintigraphic images of seeded grafts using a nuclear medicine gamma camera system. the seeding efficiency of 54+/-13% post three washes was achieved using 16x10(5)cells/cm(2). Similarly in SEM micrographs a seeding density of 16x10(5)cells/cm(2)resulted in a confluent monolayer. Seeded CPU segments incubated for 4 h exhibited significantly higher resistance to wash-off than segments incubated for 30 min (p <0.05). Exposure of seeded grafts to pulsatile shear stress resulted in some cell loss with 67+/-3% of cells adherent following 6 h of perfusion with ongoing metabolic activity. Thus, optimal conditions were 16x10(5)cells/cm(2)at 4 h. the optimal seeding conditions have been defined for "tissue-engineered" vascular graft which allow complete endothelialisation and high cell-to-substrate strength that resists hydrodynamic stress. Copyright 2000 Harcourt Publishers Ltd.
      

      
      Machine learning based job status prediction in scientific clusters
      DOE PAGES
      Yoo, Wucherl; Sim, Alex; Wu, Kesheng
         2016-09-01
         Large high-performance computing systems are built with increasing number of components with more CPU cores, more memory, and more storage space. At the same time, scientific applications have been growing in complexity. Together, they are leading to more frequent unsuccessful job statuses on HPC systems. From measured job statuses, 23.4% of CPU time was spent to the unsuccessful jobs. Here, we set out to study whether these unsuccessful job statuses could be anticipated from known job characteristics. To explore this possibility, we have developed a job status prediction method for the execution of jobs on scientific clusters. The Random Forestsmore » algorithm was applied to extract and characterize the patterns of unsuccessful job statuses. Experimental results show that our method can predict the unsuccessful job statuses from the monitored ongoing job executions in 99.8% the cases with 83.6% recall and 94.8% precision. Lastly, this prediction accuracy can be sufficiently high that it can be used to mitigation procedures of predicted failures.« less
      

        
       
          

«

12
      13
      14
   15
      16
      »

          
        

     

   

   
       
            
              
          

«

13
      14
      15
   16
      17
      »

          
        

           
           
             
               
      
      RXIO: Design and implementation of high performance RDMA-capable GridFTP
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Tian, Yuan; Yu, Weikuan; Vetter, Jeffrey S.
         2011-12-21
         For its low-latency, high bandwidth, and low CPU utilization, Remote Direct Memory Access (RDMA) has established itself as an effective data movement technology in many networking environments. However, the transport protocols of grid run-time systems, such as GridFTP in Globus, are not yet capable of utilizing RDMA. In this study, we examine the architecture of GridFTP for the feasibility of enabling RDMA. An RDMA-capable XIO (RXIO) framework is designed and implemented to extend its XIO system and match the characteristics of RDMA. Our experimental results demonstrate that RDMA can significantly improve the performance of GridFTP, reducing the latency by 32%more » and increasing the bandwidth by more than three times. In achieving such performance improvements, RDMA dramatically cuts down CPU utilization of GridFTP clients and servers. In conclusion, these results demonstrate that RXIO can effectively exploit the benefits of RDMA for GridFTP. It offers a good prototype to further leverage GridFTP on wide-area RDMA networks.« less
      

      
      Machine learning based job status prediction in scientific clusters
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Yoo, Wucherl; Sim, Alex; Wu, Kesheng
         
         Large high-performance computing systems are built with increasing number of components with more CPU cores, more memory, and more storage space. At the same time, scientific applications have been growing in complexity. Together, they are leading to more frequent unsuccessful job statuses on HPC systems. From measured job statuses, 23.4% of CPU time was spent to the unsuccessful jobs. Here, we set out to study whether these unsuccessful job statuses could be anticipated from known job characteristics. To explore this possibility, we have developed a job status prediction method for the execution of jobs on scientific clusters. The Random Forestsmore » algorithm was applied to extract and characterize the patterns of unsuccessful job statuses. Experimental results show that our method can predict the unsuccessful job statuses from the monitored ongoing job executions in 99.8% the cases with 83.6% recall and 94.8% precision. Lastly, this prediction accuracy can be sufficiently high that it can be used to mitigation procedures of predicted failures.« less
      

      
      The density matrix renormalization group algorithm on kilo-processor architectures: Implementation and trade-offs
      NASA Astrophysics Data System (ADS)
      Nemes, Csaba; Barcza, Gergely; Nagy, Zoltán; Legeza, Örs; Szolgay, Péter
         2014-06-01
         In the numerical analysis of strongly correlated quantum lattice models one of the leading algorithms developed to balance the size of the effective Hilbert space and the accuracy of the simulation is the density matrix renormalization group (DMRG) algorithm, in which the run-time is dominated by the iterative diagonalization of the Hamilton operator. As the most time-dominant step of the diagonalization can be expressed as a list of dense matrix operations, the DMRG is an appealing candidate to fully utilize the computing power residing in novel kilo-processor architectures. In the paper a smart hybrid CPU-GPU implementation is presented, which exploits the power of both CPU and GPU and tolerates problems exceeding the GPU memory size. Furthermore, a new CUDA kernel has been designed for asymmetric matrix-vector multiplication to accelerate the rest of the diagonalization. Besides the evaluation of the GPU implementation, the practical limits of an FPGA implementation are also discussed.
      

      
      Computer hardware for radiologists: Part I
      PubMed Central
      Indrajit, IK; Alam, A
         2010-01-01
         Computers are an integral part of modern radiology practice. They are used in different radiology modalities to acquire, process, and postprocess imaging data. They have had a dramatic influence on contemporary radiology practice. Their impact has extended further with the emergence of Digital Imaging and Communications in Medicine (DICOM), Picture Archiving and Communication System (PACS), Radiology information system (RIS) technology, and Teleradiology. A basic overview of computer hardware relevant to radiology practice is presented here. The key hardware components in a computer are the motherboard, central processor unit (CPU), the chipset, the random access memory (RAM), the memory modules, bus, storage drives, and ports. The personnel computer (PC) has a rectangular case that contains important components called hardware, many of which are integrated circuits (ICs). The fiberglass motherboard is the main printed circuit board and has a variety of important hardware mounted on it, which are connected by electrical pathways called “buses”. The CPU is the largest IC on the motherboard and contains millions of transistors. Its principal function is to execute “programs”. A Pentium® 4 CPU has transistors that execute a billion instructions per second. The chipset is completely different from the CPU in design and function; it controls data and interaction of buses between the motherboard and the CPU. Memory (RAM) is fundamentally semiconductor chips storing data and instructions for access by a CPU. RAM is classified by storage capacity, access speed, data rate, and configuration. PMID:21042437
      

      
      Fast and high-order numerical algorithms for the solution of multidimensional nonlinear fractional Ginzburg-Landau equation
      NASA Astrophysics Data System (ADS)
      Mohebbi, Akbar
         2018-02-01
         In this paper we propose two fast and accurate numerical methods for the solution of multidimensional space fractional Ginzburg-Landau equation (FGLE). In the presented methods, to avoid solving a nonlinear system of algebraic equations and to increase the accuracy and efficiency of method, we split the complex problem into simpler sub-problems using the split-step idea. For a homogeneous FGLE, we propose a method which has fourth-order of accuracy in time component and spectral accuracy in space variable and for nonhomogeneous one, we introduce another scheme based on the Crank-Nicolson approach which has second-order of accuracy in time variable. Due to using the Fourier spectral method for fractional Laplacian operator, the resulting schemes are fully diagonal and easy to code. Numerical results are reported in terms of accuracy, computational order and CPU time to demonstrate the accuracy and efficiency of the proposed methods and to compare the results with the analytical solutions. The results show that the present methods are accurate and require low CPU time. It is illustrated that the numerical results are in good agreement with the theoretical ones.
      

      
      Far-field radiation patterns of aperture antennas by the Winograd Fourier transform algorithm
      NASA Technical Reports Server (NTRS)
      Heisler, R.
         1978-01-01
         A more time-efficient algorithm for computing the discrete Fourier transform, the Winograd Fourier transform (WFT), is described. The WFT algorithm is compared with other transform algorithms. Results indicate that the WFT algorithm in antenna analysis appears to be a very successful application. Significant savings in cpu time will improve the computer turn around time and circumvent the need to resort to weekend runs.
      

      
      Remote Control and Data Acquisition: A Case Study
      NASA Technical Reports Server (NTRS)
      DeGennaro, Alfred J.; Wilkinson, R. Allen
         2000-01-01
         This paper details software tools developed to remotely command experimental apparatus, and to acquire and visualize the associated data in soft real time. The work was undertaken because commercial products failed to meet the needs. This work has identified six key factors intrinsic to development of quality research laboratory software. Capabilities include access to all new instrument functions without any programming or dependence on others to write drivers or virtual instruments, simple full screen text-based experiment configuration and control user interface, months of continuous experiment run-times, order of 1% CPU load for condensed matter physics experiment described here, very little imposition of software tool choices on remote users, and total remote control from anywhere in the world over the Internet or from home on a 56 Kb modem as if the user is sitting in the laboratory. This work yielded a set of simple robust tools that are highly reliable, resource conserving, extensible, and versatile, with a uniform simple interface.
      

      
      Measurements of neuron soma size and density in rat dorsal striatum, nucleus accumbens core and nucleus accumbens shell: differences between striatal region and brain hemisphere, but not sex.
      PubMed
      Meitzen, John; Pflepsen, Kelsey R; Stern, Christopher M; Meisel, Robert L; Mermelstein, Paul G
         2011-01-07
         Both hemispheric bias and sex differences exist in striatal-mediated behaviors and pathologies. The extent to which these dimorphisms can be attributed to an underlying neuroanatomical difference is unclear. We therefore quantified neuron soma size and density in the dorsal striatum (CPu) as well as the core (AcbC) and shell (AcbS) subregions of the nucleus accumbens to determine whether these anatomical measurements differ by region, hemisphere, or sex in adult Sprague-Dawley rats. Neuron soma size was larger in the CPu than the AcbC or AcbS. Neuron density was greatest in the AcbS, intermediate in the AcbC, and least dense in the CPu. CPu neuron density was greater in the left in comparison to the right hemisphere. No attribute was sexually dimorphic. These results provide the first evidence that hemispheric bias in the striatum and striatal-mediated behaviors can be attributed to a lateralization in neuronal density within the CPu. In contrast, sexual dimorphisms appear mediated by factors other than gross anatomical differences. Copyright © 2010 Elsevier Ireland Ltd. All rights reserved.
      

      
      48 CFR 252.204-7011 - Alternative Line Item Structure.
      Code of Federal Regulations, 2011 CFR
      
         2011-10-01
         ... Unit Unit price Amount 0001 Computer, Desktop with CPU, Monitor, Keyboard and Mouse 20 EA Alternative... Unit Unit Price Amount 0001 Computer, Desktop with CPU, Keyboard and Mouse 20 EA 0002 Monitor 20 EA...
      

      
      48 CFR 252.204-7011 - Alternative Line Item Structure.
      Code of Federal Regulations, 2014 CFR
      
         2014-10-01
         ... Unit Unit price Amount 0001 Computer, Desktop with CPU, Monitor, Keyboard and Mouse 20 EA Alternative... Unit Unit Price Amount 0001 Computer, Desktop with CPU, Keyboard and Mouse 20 EA 0002 Monitor 20 EA...
      

      
      48 CFR 252.204-7011 - Alternative Line Item Structure.
      Code of Federal Regulations, 2012 CFR
      
         2012-10-01
         ... Unit Unit price Amount 0001 Computer, Desktop with CPU, Monitor, Keyboard and Mouse 20 EA Alternative... Unit Unit Price Amount 0001 Computer, Desktop with CPU, Keyboard and Mouse 20 EA 0002 Monitor 20 EA...
      

      
      48 CFR 252.204-7011 - Alternative Line Item Structure.
      Code of Federal Regulations, 2013 CFR
      
         2013-10-01
         ... Unit Unit price Amount 0001 Computer, Desktop with CPU, Monitor, Keyboard and Mouse 20 EA Alternative... Unit Unit Price Amount 0001 Computer, Desktop with CPU, Keyboard and Mouse 20 EA 0002 Monitor 20 EA...
      

      
      Toward GPGPU accelerated human electromechanical cardiac simulations
      PubMed Central
      Vigueras, Guillermo; Roy, Ishani; Cookson, Andrew; Lee, Jack; Smith, Nicolas; Nordsletten, David
         2014-01-01
         In this paper, we look at the acceleration of weakly coupled electromechanics using the graphics processing unit (GPU). Specifically, we port to the GPU a number of components of Heart—a CPU-based finite element code developed for simulating multi-physics problems. On the basis of a criterion of computational cost, we implemented on the GPU the ODE and PDE solution steps for the electrophysiology problem and the Jacobian and residual evaluation for the mechanics problem. Performance of the GPU implementation is then compared with single core CPU (SC) execution as well as multi-core CPU (MC) computations with equivalent theoretical performance. Results show that for a human scale left ventricle mesh, GPU acceleration of the electrophysiology problem provided speedups of 164 × compared with SC and 5.5 times compared with MC for the solution of the ODE model. Speedup of up to 72 × compared with SC and 2.6 × compared with MC was also observed for the PDE solve. Using the same human geometry, the GPU implementation of mechanics residual/Jacobian computation provided speedups of up to 44 × compared with SC and 2.0 × compared with MC. © 2013 The Authors. International Journal for Numerical Methods in Biomedical Engineering published by John Wiley & Sons, Ltd. PMID:24115492
      

      
      Fast CPU-based Monte Carlo simulation for radiotherapy dose calculation.
      PubMed
      Ziegenhein, Peter; Pirner, Sven; Ph Kamerling, Cornelis; Oelfke, Uwe
         2015-08-07
         Monte-Carlo (MC) simulations are considered to be the most accurate method for calculating dose distributions in radiotherapy. Its clinical application, however, still is limited by the long runtimes conventional implementations of MC algorithms require to deliver sufficiently accurate results on high resolution imaging data. In order to overcome this obstacle we developed the software-package PhiMC, which is capable of computing precise dose distributions in a sub-minute time-frame by leveraging the potential of modern many- and multi-core CPU-based computers. PhiMC is based on the well verified dose planning method (DPM). We could demonstrate that PhiMC delivers dose distributions which are in excellent agreement to DPM. The multi-core implementation of PhiMC scales well between different computer architectures and achieves a speed-up of up to 37[Formula: see text] compared to the original DPM code executed on a modern system. Furthermore, we could show that our CPU-based implementation on a modern workstation is between 1.25[Formula: see text] and 1.95[Formula: see text] faster than a well-known GPU implementation of the same simulation method on a NVIDIA Tesla C2050. Since CPUs work on several hundreds of GB RAM the typical GPU memory limitation does not apply for our implementation and high resolution clinical plans can be calculated.
      

      
      A Study on the Effectiveness of Lockup-Free Caches for a Reduced Instruction Set Computer (RISC) Processor
      DTIC Science & Technology
      
         1992-09-01
         to acquire or develop effective simulation tools to observe the behavior of a RISC implementation as it executes different types of programs . We choose...Performance Computer performance is measured by the amount of the time required to execute a program . Performance encompasses two types of time, elapsed time...and CPU time. Elapsed time is the time required to execute a program from start to finish. It includes latency of input/output activities such as
      

      
      Accelerating Large Scale Image Analyses on Parallel, CPU-GPU Equipped Systems
      PubMed Central
      Teodoro, George; Kurc, Tahsin M.; Pan, Tony; Cooper, Lee A.D.; Kong, Jun; Widener, Patrick; Saltz, Joel H.
         2014-01-01
         The past decade has witnessed a major paradigm shift in high performance computing with the introduction of accelerators as general purpose processors. These computing devices make available very high parallel computing power at low cost and power consumption, transforming current high performance platforms into heterogeneous CPU-GPU equipped systems. Although the theoretical performance achieved by these hybrid systems is impressive, taking practical advantage of this computing power remains a very challenging problem. Most applications are still deployed to either GPU or CPU, leaving the other resource under- or un-utilized. In this paper, we propose, implement, and evaluate a performance aware scheduling technique along with optimizations to make efficient collaborative use of CPUs and GPUs on a parallel system. In the context of feature computations in large scale image analysis applications, our evaluations show that intelligently co-scheduling CPUs and GPUs can significantly improve performance over GPU-only or multi-core CPU-only approaches. PMID:25419545
      

      
      Large-scale ground motion simulation using GPGPU
      NASA Astrophysics Data System (ADS)
      Aoi, S.; Maeda, T.; Nishizawa, N.; Aoki, T.
         2012-12-01
         Huge computation resources are required to perform large-scale ground motion simulations using 3-D finite difference method (FDM) for realistic and complex models with high accuracy. Furthermore, thousands of various simulations are necessary to evaluate the variability of the assessment caused by uncertainty of the assumptions of the source models for future earthquakes. To conquer the problem of restricted computational resources, we introduced the use of GPGPU (General purpose computing on graphics processing units) which is the technique of using a GPU as an accelerator of the computation which has been traditionally conducted by the CPU. We employed the CPU version of GMS (Ground motion Simulator; Aoi et al., 2004) as the original code and implemented the function for GPU calculation using CUDA (Compute Unified Device Architecture). GMS is a total system for seismic wave propagation simulation based on 3-D FDM scheme using discontinuous grids (Aoi&Fujiwara, 1999), which includes the solver as well as the preprocessor tools (parameter generation tool) and postprocessor tools (filter tool, visualization tool, and so on). The computational model is decomposed in two horizontal directions and each decomposed model is allocated to a different GPU. We evaluated the performance of our newly developed GPU version of GMS on the TSUBAME2.0 which is one of the Japanese fastest supercomputer operated by the Tokyo Institute of Technology. First we have performed a strong scaling test using the model with about 22 million grids and achieved 3.2 and 7.3 times of the speed-up by using 4 and 16 GPUs. Next, we have examined a weak scaling test where the model sizes (number of grids) are increased in proportion to the degree of parallelism (number of GPUs). The result showed almost perfect linearity up to the simulation with 22 billion grids using 1024 GPUs where the calculation speed reached to 79.7 TFlops and about 34 times faster than the CPU calculation using the same number of cores. Finally, we applied GPU calculation to the simulation of the 2011 Tohoku-oki earthquake. The model was constructed using a slip model from inversion of strong motion data (Suzuki et al., 2012), and a geological- and geophysical-based velocity structure model comprising all the Tohoku and Kanto regions as well as the large source area, which consists of about 1.9 billion grids. The overall characteristics of observed velocity seismograms for a longer period than range of 8 s were successfully reproduced (Maeda et al., 2012 AGU meeting). The turn around time for 50 thousand-step calculation (which correspond to 416 s in seismograph) using 100 GPUs was 52 minutes which is fairly short, especially considering this is the performance for the realistic and complex model.
      

      
      Rapid and semi-analytical design and simulation of a toroidal magnet made with YBCO and MgB 2 superconductors
      DOE PAGES
      Dimitrov, I. K.; Zhang, X.; Solovyov, V. F.; ...
         2015-07-07
         Recent advances in second-generation (YBCO) high-temperature superconducting wire could potentially enable the design of super high performance energy storage devices that combine the high energy density of chemical storage with the high power of superconducting magnetic storage. However, the high aspect ratio and the considerable filament size of these wires require the concomitant development of dedicated optimization methods that account for the critical current density in type-II superconductors. In this study, we report on the novel application and results of a CPU-efficient semianalytical computer code based on the Radia 3-D magnetostatics software package. Our algorithm is used to simulate andmore » optimize the energy density of a superconducting magnetic energy storage device model, based on design constraints, such as overall size and number of coils. The rapid performance of the code is pivoted on analytical calculations of the magnetic field based on an efficient implementation of the Biot-Savart law for a large variety of 3-D “base” geometries in the Radia package. The significantly reduced CPU time and simple data input in conjunction with the consideration of realistic input variables, such as material-specific, temperature, and magnetic-field-dependent critical current densities, have enabled the Radia-based algorithm to outperform finite-element approaches in CPU time at the same accuracy levels. Comparative simulations of MgB 2 and YBCO-based devices are performed at 4.2 K, in order to ascertain the realistic efficiency of the design configurations.« less
      

      
      Robotic goalie with 3 ms reaction time at 4% CPU load using event-based dynamic vision sensor
      PubMed Central
      Delbruck, Tobi; Lang, Manuel
         2013-01-01
         Conventional vision-based robotic systems that must operate quickly require high video frame rates and consequently high computational costs. Visual response latencies are lower-bound by the frame period, e.g., 20 ms for 50 Hz frame rate. This paper shows how an asynchronous neuromorphic dynamic vision sensor (DVS) silicon retina is used to build a fast self-calibrating robotic goalie, which offers high update rates and low latency at low CPU load. Independent and asynchronous per pixel illumination change events from the DVS signify moving objects and are used in software to track multiple balls. Motor actions to block the most “threatening” ball are based on measured ball positions and velocities. The goalie also sees its single-axis goalie arm and calibrates the motor output map during idle periods so that it can plan open-loop arm movements to desired visual locations. Blocking capability is about 80% for balls shot from 1 m from the goal even with the fastest-shots, and approaches 100% accuracy when the ball does not beat the limits of the servo motor to move the arm to the necessary position in time. Running with standard USB buses under a standard preemptive multitasking operating system (Windows), the goalie robot achieves median update rates of 550 Hz, with latencies of 2.2 ± 2 ms from ball movement to motor command at a peak CPU load of less than 4%. Practical observations and measurements of USB device latency are provided1. PMID:24311999
      

      
      Brain and behaviour phenotyping of a mouse model of neurofibromatosis type-1: an MRI/DTI study on social cognition.
      PubMed
      Petrella, L I; Cai, Y; Sereno, J V; Gonçalves, S I; Silva, A J; Castelo-Branco, M
         2016-09-01
         Neurofibromatosis type-1 (NF1) is a common neurogenetic disorder and an important cause of intellectual disability. Brain-behaviour associations can be examined in vivo using morphometric magnetic resonance imaging (MRI) and diffusion tensor imaging (DTI) to study brain structure. Here, we studied structural and behavioural phenotypes in heterozygous Nf1 mice (Nf1(+/-) ) using T2-weighted imaging MRI and DTI, with a focus on social recognition deficits. We found that Nf1(+/-) mice have larger volumes than wild-type (WT) mice in regions of interest involved in social cognition, the prefrontal cortex (PFC) and the caudate-putamen (CPu). Higher diffusivity was found across a distributed network of cortical and subcortical brain regions, within and beyond these regions. Significant differences were observed for the social recognition test. Most importantly, significant structure-function correlations were identified concerning social recognition performance and PFC volumes in Nf1(+/-) mice. Analyses of spatial learning corroborated the previously known deficits in the mutant mice, as corroborated by platform crossings, training quadrant time and average proximity measures. Moreover, linear discriminant analysis of spatial performance identified 2 separate sub-groups in Nf1(+/-) mice. A significant correlation between quadrant time and CPu volumes was found specifically for the sub-group of Nf1(+/-) mice with lower spatial learning performance, suggesting additional evidence for reorganization of this region. We found strong evidence that social and spatial cognition deficits can be associated with PFC/CPu structural changes and reorganization in NF1. © 2016 John Wiley & Sons Ltd and International Behavioural and Neural Genetics Society.
      

        
       
          

«

13
      14
      15
   16
      17
      »

          
        

     

   

   
       
            
              
          

«

14
      15
      16
   17
      18
      »

          
        

           
           
             
               
      
      SU-E-T-493: Accelerated Monte Carlo Methods for Photon Dosimetry Using a Dual-GPU System and CUDA.
      PubMed
      Liu, T; Ding, A; Xu, X
         2012-06-01
         To develop a Graphics Processing Unit (GPU) based Monte Carlo (MC) code that accelerates dose calculations on a dual-GPU system. We simulated a clinical case of prostate cancer treatment. A voxelized abdomen phantom derived from 120 CT slices was used containing 218×126×60 voxels, and a GE LightSpeed 16-MDCT scanner was modeled. A CPU version of the MC code was first developed in C++ and tested on Intel Xeon X5660 2.8GHz CPU, then it was translated into GPU version using CUDA C 4.1 and run on a dual Tesla m 2 090 GPU system. The code was featured with automatic assignment of simulation task to multiple GPUs, as well as accurate calculation of energy- and material- dependent cross-sections. Double-precision floating point format was used for accuracy. Doses to the rectum, prostate, bladder and femoral heads were calculated. When running on a single GPU, the MC GPU code was found to be ×19 times faster than the CPU code and ×42 times faster than MCNPX. These speedup factors were doubled on the dual-GPU system. The dose Result was benchmarked against MCNPX and a maximum difference of 1% was observed when the relative error is kept below 0.1%. A GPU-based MC code was developed for dose calculations using detailed patient and CT scanner models. Efficiency and accuracy were both guaranteed in this code. Scalability of the code was confirmed on the dual-GPU system. © 2012 American Association of Physicists in Medicine.
      

      
      GPU acceleration towards real-time image reconstruction in 3D tomographic diffractive microscopy
      NASA Astrophysics Data System (ADS)
      Bailleul, J.; Simon, B.; Debailleul, M.; Liu, H.; Haeberlé, O.
         2012-06-01
         Phase microscopy techniques regained interest in allowing for the observation of unprepared specimens with excellent temporal resolution. Tomographic diffractive microscopy is an extension of holographic microscopy which permits 3D observations with a finer resolution than incoherent light microscopes. Specimens are imaged by a series of 2D holograms: their accumulation progressively fills the range of frequencies of the specimen in Fourier space. A 3D inverse FFT eventually provides a spatial image of the specimen. Consequently, acquisition then reconstruction are mandatory to produce an image that could prelude real-time control of the observed specimen. The MIPS Laboratory has built a tomographic diffractive microscope with an unsurpassed 130nm resolution but a low imaging speed - no less than one minute. Afterwards, a high-end PC reconstructs the 3D image in 20 seconds. We now expect an interactive system providing preview images during the acquisition for monitoring purposes. We first present a prototype implementing this solution on CPU: acquisition and reconstruction are tied in a producer-consumer scheme, sharing common data into CPU memory. Then we present a prototype dispatching some reconstruction tasks to GPU in order to take advantage of SIMDparallelization for FFT and higher bandwidth for filtering operations. The CPU scheme takes 6 seconds for a 3D image update while the GPU scheme can go down to 2 or > 1 seconds depending on the GPU class. This opens opportunities for 4D imaging of living organisms or crystallization processes. We also consider the relevance of GPU for 3D image interaction in our specific conditions.
      

      
      Multi-Threaded Algorithms for GPGPU in the ATLAS High Level Trigger
      NASA Astrophysics Data System (ADS)
      Conde Muíño, P.; ATLAS Collaboration
         2017-10-01
         General purpose Graphics Processor Units (GPGPU) are being evaluated for possible future inclusion in an upgraded ATLAS High Level Trigger farm. We have developed a demonstrator including GPGPU implementations of Inner Detector and Muon tracking and Calorimeter clustering within the ATLAS software framework. ATLAS is a general purpose particle physics experiment located on the LHC collider at CERN. The ATLAS Trigger system consists of two levels, with Level-1 implemented in hardware and the High Level Trigger implemented in software running on a farm of commodity CPU. The High Level Trigger reduces the trigger rate from the 100 kHz Level-1 acceptance rate to 1.5 kHz for recording, requiring an average per-event processing time of ∼ 250 ms for this task. The selection in the high level trigger is based on reconstructing tracks in the Inner Detector and Muon Spectrometer and clusters of energy deposited in the Calorimeter. Performing this reconstruction within the available farm resources presents a significant challenge that will increase significantly with future LHC upgrades. During the LHC data taking period starting in 2021, luminosity will reach up to three times the original design value. Luminosity will increase further to 7.5 times the design value in 2026 following LHC and ATLAS upgrades. Corresponding improvements in the speed of the reconstruction code will be needed to provide the required trigger selection power within affordable computing resources. Key factors determining the potential benefit of including GPGPU as part of the HLT processor farm are: the relative speed of the CPU and GPGPU algorithm implementations; the relative execution times of the GPGPU algorithms and serial code remaining on the CPU; the number of GPGPU required, and the relative financial cost of the selected GPGPU. We give a brief overview of the algorithms implemented and present new measurements that compare the performance of various configurations exploiting GPGPU cards.
      

      
      WARP3D-Release 10.8: Dynamic Nonlinear Analysis of Solids using a Preconditioned Conjugate Gradient Software Architecture
      NASA Technical Reports Server (NTRS)
      Koppenhoefer, Kyle C.; Gullerud, Arne S.; Ruggieri, Claudio; Dodds, Robert H., Jr.; Healy, Brian E.
         1998-01-01
         This report describes theoretical background material and commands necessary to use the WARP3D finite element code. WARP3D is under continuing development as a research code for the solution of very large-scale, 3-D solid models subjected to static and dynamic loads. Specific features in the code oriented toward the investigation of ductile fracture in metals include a robust finite strain formulation, a general J-integral computation facility (with inertia, face loading), an element extinction facility to model crack growth, nonlinear material models including viscoplastic effects, and the Gurson-Tver-gaard dilatant plasticity model for void growth. The nonlinear, dynamic equilibrium equations are solved using an incremental-iterative, implicit formulation with full Newton iterations to eliminate residual nodal forces. The history integration of the nonlinear equations of motion is accomplished with Newmarks Beta method. A central feature of WARP3D involves the use of a linear-preconditioned conjugate gradient (LPCG) solver implemented in an element-by-element format to replace a conventional direct linear equation solver. This software architecture dramatically reduces both the memory requirements and CPU time for very large, nonlinear solid models since formation of the assembled (dynamic) stiffness matrix is avoided. Analyses thus exhibit the numerical stability for large time (load) steps provided by the implicit formulation coupled with the low memory requirements characteristic of an explicit code. In addition to the much lower memory requirements of the LPCG solver, the CPU time required for solution of the linear equations during each Newton iteration is generally one-half or less of the CPU time required for a traditional direct solver. All other computational aspects of the code (element stiffnesses, element strains, stress updating, element internal forces) are implemented in the element-by- element, blocked architecture. This greatly improves vectorization of the code on uni-processor hardware and enables straightforward parallel-vector processing of element blocks on multi-processor hardware.
      

      
      Managing Contention and Timing Constraints in a Real-Time Database System
      DTIC Science & Technology
      
         1995-01-01
         In order to realize many of these goals, StarBase is constructed on top of RT-Mach, a real - time operating system developed at Carnegie Mellon...University [ll]. StarBase differs from previous RT-DBMS work [l, 2, 31 in that a) it relies on a real - time operating system which provides priority...CPU and resource scheduling pro- vided by tlhe underlying real - time operating system . Issues of data contention are dealt with by use of a priority
      

      
      Multi-GPU implementation of a VMAT treatment plan optimization algorithm
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Tian, Zhen, E-mail: Zhen.Tian@UTSouthwestern.edu, E-mail: Xun.Jia@UTSouthwestern.edu, E-mail: Steve.Jiang@UTSouthwestern.edu; Folkerts, Michael; Tan, Jun
         
         Purpose: Volumetric modulated arc therapy (VMAT) optimization is a computationally challenging problem due to its large data size, high degrees of freedom, and many hardware constraints. High-performance graphics processing units (GPUs) have been used to speed up the computations. However, GPU’s relatively small memory size cannot handle cases with a large dose-deposition coefficient (DDC) matrix in cases of, e.g., those with a large target size, multiple targets, multiple arcs, and/or small beamlet size. The main purpose of this paper is to report an implementation of a column-generation-based VMAT algorithm, previously developed in the authors’ group, on a multi-GPU platform tomore » solve the memory limitation problem. While the column-generation-based VMAT algorithm has been previously developed, the GPU implementation details have not been reported. Hence, another purpose is to present detailed techniques employed for GPU implementation. The authors also would like to utilize this particular problem as an example problem to study the feasibility of using a multi-GPU platform to solve large-scale problems in medical physics. Methods: The column-generation approach generates VMAT apertures sequentially by solving a pricing problem (PP) and a master problem (MP) iteratively. In the authors’ method, the sparse DDC matrix is first stored on a CPU in coordinate list format (COO). On the GPU side, this matrix is split into four submatrices according to beam angles, which are stored on four GPUs in compressed sparse row format. Computation of beamlet price, the first step in PP, is accomplished using multi-GPUs. A fast inter-GPU data transfer scheme is accomplished using peer-to-peer access. The remaining steps of PP and MP problems are implemented on CPU or a single GPU due to their modest problem scale and computational loads. Barzilai and Borwein algorithm with a subspace step scheme is adopted here to solve the MP problem. A head and neck (H and N) cancer case is then used to validate the authors’ method. The authors also compare their multi-GPU implementation with three different single GPU implementation strategies, i.e., truncating DDC matrix (S1), repeatedly transferring DDC matrix between CPU and GPU (S2), and porting computations involving DDC matrix to CPU (S3), in terms of both plan quality and computational efficiency. Two more H and N patient cases and three prostate cases are used to demonstrate the advantages of the authors’ method. Results: The authors’ multi-GPU implementation can finish the optimization process within ∼1 min for the H and N patient case. S1 leads to an inferior plan quality although its total time was 10 s shorter than the multi-GPU implementation due to the reduced matrix size. S2 and S3 yield the same plan quality as the multi-GPU implementation but take ∼4 and ∼6 min, respectively. High computational efficiency was consistently achieved for the other five patient cases tested, with VMAT plans of clinically acceptable quality obtained within 23–46 s. Conversely, to obtain clinically comparable or acceptable plans for all six of these VMAT cases that the authors have tested in this paper, the optimization time needed in a commercial TPS system on CPU was found to be in an order of several minutes. Conclusions: The results demonstrate that the multi-GPU implementation of the authors’ column-generation-based VMAT optimization can handle the large-scale VMAT optimization problem efficiently without sacrificing plan quality. The authors’ study may serve as an example to shed some light on other large-scale medical physics problems that require multi-GPU techniques.« less
      

      
      A graphics-card implementation of Monte-Carlo simulations for cosmic-ray transport
      NASA Astrophysics Data System (ADS)
      Tautz, R. C.
         2016-05-01
         A graphics card implementation of a test-particle simulation code is presented that is based on the CUDA extension of the C/C++ programming language. The original CPU version has been developed for the calculation of cosmic-ray diffusion coefficients in artificial Kolmogorov-type turbulence. In the new implementation, the magnetic turbulence generation, which is the most time-consuming part, is separated from the particle transport and is performed on a graphics card. In this article, the modification of the basic approach of integrating test particle trajectories to employ the SIMD (single instruction, multiple data) model is presented and verified. The efficiency of the new code is tested and several language-specific accelerating factors are discussed. For the example of isotropic magnetostatic turbulence, sample results are shown and a comparison to the results of the CPU implementation is performed.
      

      
      Mitigating the Insider Threat with High-Dimensional Anomaly Detection
      DTIC Science & Technology
      
         2004-12-01
         a more serious attack. Various systems such as NSM [56], GrIDS [57], snort [58], Emerald [59], and Spice [60] generate alerts for portscan...reboot etc. The user measurements include the user profiles such as time of login , duration of user session, cumulative CPU time, names of files...already been implemented in a real-time system for information retrieval [3]. A technique developed at SRI in the Emerald system [22] uses historical
      

      
      Meeting the Challenge of Distributed Real-Time & Embedded (DRE) Systems
      DTIC Science & Technology
      
         2012-05-10
         IP RTOS Middleware Middleware Services DRE Applications Operating Sys & Protocols Hardware & Networks Middleware Middleware Services DRE...Services COTS & standards-based middleware, language, OS , network, & hardware platforms • Real-time CORBA (TAO) middleware • ADAPTIVE Communication...SPLs) F-15 product variant A/V 8-B product variant F/A 18 product variant UCAV product variant Software Produce-Line Hardware (CPU, Memory, I/O) OS
      

      
      Parallel hyperbolic PDE simulation on clusters: Cell versus GPU
      NASA Astrophysics Data System (ADS)
      Rostrup, Scott; De Sterck, Hans
         2010-12-01
         Increasingly, high-performance computing is looking towards data-parallel computational devices to enhance computational performance. Two technologies that have received significant attention are IBM's Cell Processor and NVIDIA's CUDA programming model for graphics processing unit (GPU) computing. In this paper we investigate the acceleration of parallel hyperbolic partial differential equation simulation on structured grids with explicit time integration on clusters with Cell and GPU backends. The message passing interface (MPI) is used for communication between nodes at the coarsest level of parallelism. Optimizations of the simulation code at the several finer levels of parallelism that the data-parallel devices provide are described in terms of data layout, data flow and data-parallel instructions. Optimized Cell and GPU performance are compared with reference code performance on a single x86 central processing unit (CPU) core in single and double precision. We further compare the CPU, Cell and GPU platforms on a chip-to-chip basis, and compare performance on single cluster nodes with two CPUs, two Cell processors or two GPUs in a shared memory configuration (without MPI). We finally compare performance on clusters with 32 CPUs, 32 Cell processors, and 32 GPUs using MPI. Our GPU cluster results use NVIDIA Tesla GPUs with GT200 architecture, but some preliminary results on recently introduced NVIDIA GPUs with the next-generation Fermi architecture are also included. This paper provides computational scientists and engineers who are considering porting their codes to accelerator environments with insight into how structured grid based explicit algorithms can be optimized for clusters with Cell and GPU accelerators. It also provides insight into the speed-up that may be gained on current and future accelerator architectures for this class of applications. Program summaryProgram title: SWsolver Catalogue identifier: AEGY_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEGY_v1_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: GPL v3 No. of lines in distributed program, including test data, etc.: 59 168 No. of bytes in distributed program, including test data, etc.: 453 409 Distribution format: tar.gz Programming language: C, CUDA Computer: Parallel Computing Clusters. Individual compute nodes may consist of x86 CPU, Cell processor, or x86 CPU with attached NVIDIA GPU accelerator. Operating system: Linux Has the code been vectorised or parallelized?: Yes. Tested on 1-128 x86 CPU cores, 1-32 Cell Processors, and 1-32 NVIDIA GPUs. RAM: Tested on Problems requiring up to 4 GB per compute node. Classification: 12 External routines: MPI, CUDA, IBM Cell SDK Nature of problem: MPI-parallel simulation of Shallow Water equations using high-resolution 2D hyperbolic equation solver on regular Cartesian grids for x86 CPU, Cell Processor, and NVIDIA GPU using CUDA. Solution method: SWsolver provides 3 implementations of a high-resolution 2D Shallow Water equation solver on regular Cartesian grids, for CPU, Cell Processor, and NVIDIA GPU. Each implementation uses MPI to divide work across a parallel computing cluster. Additional comments: Sub-program numdiff is used for the test run.
      

      
      GPU implementation of the linear scaling three dimensional fragment method for large scale electronic structure calculations
      NASA Astrophysics Data System (ADS)
      Jia, Weile; Wang, Jue; Chi, Xuebin; Wang, Lin-Wang
         2017-02-01
         LS3DF, namely linear scaling three-dimensional fragment method, is an efficient linear scaling ab initio total energy electronic structure calculation code based on a divide-and-conquer strategy. In this paper, we present our GPU implementation of the LS3DF code. Our test results show that the GPU code can calculate systems with about ten thousand atoms fully self-consistently in the order of 10 min using thousands of computing nodes. This makes the electronic structure calculations of 10,000-atom nanosystems routine work. This speed is 4.5-6 times faster than the CPU calculations using the same number of nodes on the Titan machine in the Oak Ridge leadership computing facility (OLCF). Such speedup is achieved by (a) carefully re-designing of the computationally heavy kernels; (b) redesign of the communication pattern for heterogeneous supercomputers.
      

      
      Interactivity vs. fairness in networked linux systems
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Wu, Wenji; Crawford, Matt; /Fermilab
         
         In general, the Linux 2.6 scheduler can ensure fairness and provide excellent interactive performance at the same time. However, our experiments and mathematical analysis have shown that the current Linux interactivity mechanism tends to incorrectly categorize non-interactive network applications as interactive, which can lead to serious fairness or starvation issues. In the extreme, a single process can unjustifiably obtain up to 95% of the CPU! The root cause is due to the facts that: (1) network packets arrive at the receiver independently and discretely, and the 'relatively fast' non-interactive network process might frequently sleep to wait for packet arrival. Thoughmore » each sleep lasts for a very short period of time, the wait-for-packet sleeps occur so frequently that they lead to interactive status for the process. (2) The current Linux interactivity mechanism provides the possibility that a non-interactive network process could receive a high CPU share, and at the same time be incorrectly categorized as 'interactive.' In this paper, we propose and test a possible solution to address the interactivity vs. fairness problems. Experiment results have proved the effectiveness of the proposed solution.« less
      

      
      GPU-Meta-Storms: computing the structure similarities among massive amount of microbial community samples using GPU.
      PubMed
      Su, Xiaoquan; Wang, Xuetao; Jing, Gongchao; Ning, Kang
         2014-04-01
         The number of microbial community samples is increasing with exponential speed. Data-mining among microbial community samples could facilitate the discovery of valuable biological information that is still hidden in the massive data. However, current methods for the comparison among microbial communities are limited by their ability to process large amount of samples each with complex community structure. We have developed an optimized GPU-based software, GPU-Meta-Storms, to efficiently measure the quantitative phylogenetic similarity among massive amount of microbial community samples. Our results have shown that GPU-Meta-Storms would be able to compute the pair-wise similarity scores for 10 240 samples within 20 min, which gained a speed-up of >17 000 times compared with single-core CPU, and >2600 times compared with 16-core CPU. Therefore, the high-performance of GPU-Meta-Storms could facilitate in-depth data mining among massive microbial community samples, and make the real-time analysis and monitoring of temporal or conditional changes for microbial communities possible. GPU-Meta-Storms is implemented by CUDA (Compute Unified Device Architecture) and C++. Source code is available at http://www.computationalbioenergy.org/meta-storms.html.
      

      
      Hotspot detection using image pattern recognition based on higher-order local auto-correlation
      NASA Astrophysics Data System (ADS)
      Maeda, Shimon; Matsunawa, Tetsuaki; Ogawa, Ryuji; Ichikawa, Hirotaka; Takahata, Kazuhiro; Miyairi, Masahiro; Kotani, Toshiya; Nojima, Shigeki; Tanaka, Satoshi; Nakagawa, Kei; Saito, Tamaki; Mimotogi, Shoji; Inoue, Soichi; Nosato, Hirokazu; Sakanashi, Hidenori; Kobayashi, Takumi; Murakawa, Masahiro; Higuchi, Tetsuya; Takahashi, Eiichi; Otsu, Nobuyuki
         2011-04-01
         Below 40nm design node, systematic variation due to lithography must be taken into consideration during the early stage of design. So far, litho-aware design using lithography simulation models has been widely applied to assure that designs are printed on silicon without any error. However, the lithography simulation approach is very time consuming, and under time-to-market pressure, repetitive redesign by this approach may result in the missing of the market window. This paper proposes a fast hotspot detection support method by flexible and intelligent vision system image pattern recognition based on Higher-Order Local Autocorrelation. Our method learns the geometrical properties of the given design data without any defects as normal patterns, and automatically detects the design patterns with hotspots from the test data as abnormal patterns. The Higher-Order Local Autocorrelation method can extract features from the graphic image of design pattern, and computational cost of the extraction is constant regardless of the number of design pattern polygons. This approach can reduce turnaround time (TAT) dramatically only on 1CPU, compared with the conventional simulation-based approach, and by distributed processing, this has proven to deliver linear scalability with each additional CPU.
      

      
      Intact intracortical microstimulation (ICMS) representations of rostral and caudal forelimb areas in rats with quinolinic acid lesions of the medial or lateral caudate-putamen in an animal model of Huntington's disease.
      PubMed
      Karl, Jenni M; Sacrey, Lori-Ann R; McDonald, Robert J; Whishaw, Ian Q
         2008-09-05
         Neurotoxic, cell-specific lesions of the rat caudate-putamen (CPu) have been proposed as a model of human Huntington's disease and as such impair performance on many motor tasks, including skilled forelimbs tasks such as reaching for food. Because the CPu and motor cortex share reciprocal connections, it has been proposed that the motor deficits are due in part to a secondary disruption of motor cortex. The purpose of the present study was to examine the functionality of the motor cortex using intracortical microstimulation (ICMS) following neurotoxic lesions of the CPu. ICMS maps have been shown to be sensitive indicators of motor skill, cortical injury, learning, and experience. Long-evans hooded rats received a sham, a medial, or a lateral CPu lesion using the neurotoxin, quinolinic acid (2,3-pyridinedicarboxylic acid). Two weeks later the motor cortex was stimulated under light ketamine anesthesia. Neither lateral nor medial lesions of the CPu altered the stimulation threshold for eliciting forelimb movements, the type of movements elicited, or the size of the rostral forelimb (RFA) and caudal forelimb areas (CFA) from which movements were elicited. The preservation of ICMS forelimb movement representations (the forelimb map) in rats with cell-specific CPu lesions suggests motor impairments following lesions of the lateral striatum are not due to the disruption of the motor map. Therefore, the impairments that follow striatal cell loss are due either to alterations in circuitry that is independent of motor cortex or to alterations in circuitry afferent to the motor cortex projections.
      

      
      Energy Efficiency Evaluation and Benchmarking of AFRL’s Condor High Performance Computer
      DTIC Science & Technology
      
         2011-08-01
         AUG 2011 2. REPORT TYPE CONFERENCE PAPER (Post Print) 3. DATES COVERED (From - To) JAN 2011 – JUN 2011 4 . TITLE AND SUBTITLE ENERGY EFFICIENCY...1716 Sony PlayStation 3s (PS3s), adding up to a total of 69,940 cores and a theoretical peak performance of 500 TFLOPS. There are 84 subcluster head...Thus, a critical component to achieving maximum performance is to find the optimum division of processing load between the CPU and GPU. 4 The
      

      
      A Framework to Analyze the Performance of Load Balancing Schemes for Ensembles of Stochastic Simulations
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Ahn, Tae-Hyuk; Sandu, Adrian; Watson, Layne T.
         2015-08-01
         Ensembles of simulations are employed to estimate the statistics of possible future states of a system, and are widely used in important applications such as climate change and biological modeling. Ensembles of runs can naturally be executed in parallel. However, when the CPU times of individual simulations vary considerably, a simple strategy of assigning an equal number of tasks per processor can lead to serious work imbalances and low parallel efficiency. This paper presents a new probabilistic framework to analyze the performance of dynamic load balancing algorithms for ensembles of simulations where many tasks are mapped onto each processor, andmore » where the individual compute times vary considerably among tasks. Four load balancing strategies are discussed: most-dividing, all-redistribution, random-polling, and neighbor-redistribution. Simulation results with a stochastic budding yeast cell cycle model are consistent with the theoretical analysis. It is especially significant that there is a provable global decrease in load imbalance for the local rebalancing algorithms due to scalability concerns for the global rebalancing algorithms. The overall simulation time is reduced by up to 25 %, and the total processor idle time by 85 %.« less
      

      
      Collateral projections of nucleus raphe dorsalis neurones to the caudate-putamen and region around the nucleus raphe magnus and nucleus reticularis gigantocellularis pars alpha in the rat.
      PubMed
      Li, Y Q; Kaneko, T; Mizuno, N
         2001-02-16
         It was examined whether or not the nucleus raphe dorsalis (RD) neurons projecting to the caudate-putamen (CPu) might also project to the motor-controlling region around the nucleus raphe magnus (NRM) and nucleus reticularis gigantocellularis pars alpha (Gia) in the rat. Single RD neurons projecting to the CPu and NRM/Gia by way of axon collaterals were identified by the retrograde double-labeling method with fluorescent dyes, Fast Blue and Diamidino Yellow, which were injected respectively into the CPu and NRM/Gia. Then, serotonin (5-HT)-like immunoreactivity of the double-labeled RD neurons was examined immunohistochemically; approximately 60% of the double-labeled RD neurons showed 5-HT-like immunoreactivity. The results indicated that some of serotonergic and non-serotonergic RD neurons might control motor functions simultaneously at the levels of the CPu and NRM/Gia by way of axon collaterals.
      

      
      An evaluation of superminicomputers for thermal analysis
      NASA Technical Reports Server (NTRS)
      Storaasli, O. O.; Vidal, J. B.; Jones, G. K.
         1982-01-01
         The use of superminicomputers for solving a series of increasingly complex thermal analysis problems is investigated. The approach involved (1) installation and verification of the SPAR thermal analyzer software on superminicomputers at Langley Research Center and Goddard Space Flight Center, (2) solution of six increasingly complex thermal problems on this equipment, and (3) comparison of solution (accuracy, CPU time, turnaround time, and cost) with solutions on large mainframe computers.
      

      
      Real Time Control of the SSC String Magnets
      NASA Astrophysics Data System (ADS)
      Calvo, O.; Flora, R.; MacPherson, M.
         1987-08-01
         The system described in this paper, called SECAR, was designed to control the excitation of a test string of magnets for the proposed Superconducting Super Collider (SSC) and will be used to upgrade the present Tevatron Excitation, Control and Regulation (TECAR) hardware and software . It resides in a VME crate and is controlled by a 68020/68881 based CPU running the application software under a real time operating system named VRTX.
      

        
       
          

«

14
      15
      16
   17
      18
      »

          
        

     

   

   
       
            
              
          

«

15
      16
      17
   18
      19
      »

          
        

           
           
             
               
      
      Real time control of the SSC string magnets
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Calvo, O.; Flora, R.; MacPherson, M.
         1987-08-01
         The system described in this paper, called SECAR, was designed to control the excitation of a test string of magnets for the proposed Superconducting Super Collider (SSC) and will be used to upgrade the present Tevatron Excitation, Control and Regulation (TECAR) hardware and software. It resides in a VME orate and is controlled by a 68020/68881 based CPU running the application software under a real time operating system named VRTX.
      

      
      Adaptive Multilevel Middleware for Object Systems
      DTIC Science & Technology
      
         2006-12-01
         the system at the system-call level or using the CORBA-standard Extensible Transport Framework ( ETF ). Transparent insertion is highly desirable from an...often as it needs to. This is remedied by using the real-time scheduling class in a stock Linux kernel. We used schedsetscheduler system call (with...real-time scheduling class (SCHEDFIFO) for all the ML-NFD programs, later experiments with CPU load indicate that a stock Linux kernel is not
      

      
      GPU-based prompt gamma ray imaging from boron neutron capture therapy.
      PubMed
      Yoon, Do-Kun; Jung, Joo-Young; Jo Hong, Key; Sil Lee, Keum; Suk Suh, Tae
         2015-01-01
         The purpose of this research is to perform the fast reconstruction of a prompt gamma ray image using a graphics processing unit (GPU) computation from boron neutron capture therapy (BNCT) simulations. To evaluate the accuracy of the reconstructed image, a phantom including four boron uptake regions (BURs) was used in the simulation. After the Monte Carlo simulation of the BNCT, the modified ordered subset expectation maximization reconstruction algorithm using the GPU computation was used to reconstruct the images with fewer projections. The computation times for image reconstruction were compared between the GPU and the central processing unit (CPU). Also, the accuracy of the reconstructed image was evaluated by a receiver operating characteristic (ROC) curve analysis. The image reconstruction time using the GPU was 196 times faster than the conventional reconstruction time using the CPU. For the four BURs, the area under curve values from the ROC curve were 0.6726 (A-region), 0.6890 (B-region), 0.7384 (C-region), and 0.8009 (D-region). The tomographic image using the prompt gamma ray event from the BNCT simulation was acquired using the GPU computation in order to perform a fast reconstruction during treatment. The authors verified the feasibility of the prompt gamma ray image reconstruction using the GPU computation for BNCT simulations.
      

      
      The cell adhesion molecule L1 regulates the expression of choline acetyltransferase and the development of septal cholinergic neurons
      PubMed Central
      Cui, Xuezhi; Weng, Ying-Qi; Frappé, Isabelle; Burgess, Alison; Girão da Cruz, M Teresa; Schachner, Melitta; Aubert, Isabelle
         2011-01-01
         Mutations in the L1 gene cause severe brain malformations and mental retardation. We investigated the potential roles of L1 in the regulation of choline acetyltransferase (ChAT) and in the development of septal cholinergic neurons, which are known to project to the hippocampus and play key roles in cognitive functions. Using stereological approaches, we detected significantly fewer ChAT-positive cholinergic neurons in the medial septum and vertical limb of the diagonal band of Broca (MS/VDB) of 2-week-old L1-deficient mice compared to wild-type littermates (1644 ± 137 vs. 2051 ± 165, P = 0.038). ChAT protein levels in the septum were 53% lower in 2-week-old L1-deficient mice compared to wild-type littermates. ChAT activity in the septum was significantly reduced in L1-deficient mice compared to wild-type littermates at 1 (34%) and 2 (40%) weeks of age. In vitro, increasing doses of L1-Fc induced ChAT activity in septal neurons with a significant linear trend (*P = 0.0065). At 4 weeks of age in the septum and at all time points investigated in the caudate-putamen (CPu), the number of ChAT-positive neurons and the levels of ChAT activity were not statistically different between L1-deficient mice and wild-type littermates. The total number of cells positive for the neuronal nuclear antigen (NeuN) in the MS/VDB and CPu was not statistically different in L1-deficient mice compared to wild-type littermates, and comparable expression of the cell cycle marker Ki67 was observed. Our results indicate that L1 is required for the timely maturation of septal cholinergic neurons and that L1 promotes the expression and activity of ChAT in septal neurons. PMID:22399087
      

      
      Machine-Aided Indexing of Technical Literature
      ERIC Educational Resources Information Center
      Klingbiel, Paul H.
         1973-01-01
         To index at the Defense Documentation Center (DDC), an automated system must choose single words or phrases rapidly and economically. Automation of DDC's indexing has been machine-aided from its inception. A machine-aided indexing system is described that indexes one million words of text per hour of CPU time. (22 references) (Author/SJ)
      

      
      An Adaptive Priority Tuning System for Optimized Local CPU Scheduling using BOINC Clients
      NASA Astrophysics Data System (ADS)
      Mnaouer, Adel B.; Ragoonath, Colin
         2010-11-01
         Volunteer Computing (VC) is a Distributed Computing model which utilizes idle CPU cycles from computing resources donated by volunteers who are connected through the Internet to form a very large-scale, loosely coupled High Performance Computing environment. Distributed Volunteer Computing environments such as the BOINC framework is concerned mainly with the efficient scheduling of the available resources to the applications which require them. The BOINC framework thus contains a number of scheduling policies/algorithms both on the server-side and on the client which work together to maximize the available resources and to provide a degree of QoS in an environment which is highly volatile. This paper focuses on the BOINC client and introduces an adaptive priority tuning client side middleware application which improves the execution times of Work Units (WUs) while maintaining an acceptable Maximum Response Time (MRT) for the end user. We have conducted extensive experimentation of the proposed system and the results show clear speedup of BOINC applications using our optimized middleware as opposed to running using the original BOINC client.
      

      
      Fog computing job scheduling optimization based on bees swarm
      NASA Astrophysics Data System (ADS)
      Bitam, Salim; Zeadally, Sherali; Mellouk, Abdelhamid
         2018-04-01
         Fog computing is a new computing architecture, composed of a set of near-user edge devices called fog nodes, which collaborate together in order to perform computational services such as running applications, storing an important amount of data, and transmitting messages. Fog computing extends cloud computing by deploying digital resources at the premise of mobile users. In this new paradigm, management and operating functions, such as job scheduling aim at providing high-performance, cost-effective services requested by mobile users and executed by fog nodes. We propose a new bio-inspired optimization approach called Bees Life Algorithm (BLA) aimed at addressing the job scheduling problem in the fog computing environment. Our proposed approach is based on the optimized distribution of a set of tasks among all the fog computing nodes. The objective is to find an optimal tradeoff between CPU execution time and allocated memory required by fog computing services established by mobile users. Our empirical performance evaluation results demonstrate that the proposal outperforms the traditional particle swarm optimization and genetic algorithm in terms of CPU execution time and allocated memory.
      

      
      Transient dynamics capability at Sandia National Laboratories
      NASA Technical Reports Server (NTRS)
      Attaway, Steven W.; Biffle, Johnny H.; Sjaardema, G. D.; Heinstein, M. W.; Schoof, L. A.
         1993-01-01
         A brief overview of the transient dynamics capabilities at Sandia National Laboratories, with an emphasis on recent new developments and current research is presented. In addition, the Sandia National Laboratories (SNL) Engineering Analysis Code Access System (SEACAS), which is a collection of structural and thermal codes and utilities used by analysts at SNL, is described. The SEACAS system includes pre- and post-processing codes, analysis codes, database translation codes, support libraries, Unix shell scripts for execution, and an installation system. SEACAS is used at SNL on a daily basis as a production, research, and development system for the engineering analysts and code developers. Over the past year, approximately 190 days of CPU time were used by SEACAS codes on jobs running from a few seconds up to two and one-half days of CPU time. SEACAS is running on several different systems at SNL including Cray Unicos, Hewlett Packard PH-UX, Digital Equipment Ultrix, and Sun SunOS. An overview of SEACAS, including a short description of the codes in the system, are presented. Abstracts and references for the codes are listed at the end of the report.
      

      
      Clinical applications of advanced rotational radiation therapy
      NASA Astrophysics Data System (ADS)
      Nalichowski, Adrian
         
         Purpose: With a fast adoption of emerging technologies, it is critical to fully test and understand its limits and capabilities. In this work we investigate new graphic processing unit (GPU) based treatment planning algorithm and its applications in helical tomotherapy dose delivery. We explore the limits of the system by applying it to challenging clinical cases of total marrow irradiation (TMI) and stereotactic radiosurgery (SRS). We also analyze the feasibility of alternative fractionation schemes for total body irradiation (TBI) and TMI based on reported historical data on lung dose and interstitial pneumonitis (IP) incidence rates. Methods and Materials: An anthropomorphic phantom was used to create TMI plans using the new GPU based treatment planning system and the existing CPU cluster based system. Optimization parameters were selected based on clinically used values for field width, modulation factor and pitch. Treatment plans were also created on Eclipse treatment planning system (Varian Medical Systems Inc, Palo Alto, CA) using volumetric modulated arc therapy (VMAT) for dose delivery on IX treatment unit. A retrospective review was performed of 42 publications that reported IP rates along with lung dose, fractionation regimen, dose rate and chemotherapy. The analysis consisted of nearly thirty two hundred patients and 34 unique radiation regimens. Multivariate logistic regression was performed to determine parameters associated with IP and establish does response function. Results: The results showed very good dosimetric agreement between the GPU and CPU calculated plans. The results from SBRT study show that GPU planning system can maintain 90% target coverage while meeting all the constraints of RTOG 0631 protocol. Beam on time for Tomotherapy and flattening filter free RapidArc was much faster than for Vero or Cyberknife. Retrospective data analysis showed that lung dose and Cyclophosphomide (Cy) are both predictors of IP in TBI/TMI treatments. The dose rate was not found to be an independent risk factor for IP. The model failed to establish accurate dose response function, but the discrete data indicated a radiation dose threshold of 7.6Gy (EQD2_repair) and 120 mg/kg of Cy below which no IP cases were reported. Conclusion: The TomoTherapy GPU based dose engine is capable of calculating TMI treatment plans with plan quality nearly identical to plans calculated using the traditional CPU/cluster based system, while significantly reducing the time required for optimization and dose calculation. The new system was able to achieve more uniform dose distribution throughout the target volume and steeper dose fall off, resulting in superior OAR sparing when compared to Eclipse treatment planning system for VMAT delivery. The machine optimization parameters tested for TMI cases provide a comprehensive overview of the capabilities of the treatment planning station and associated helical delivery system. The new system also proved to be dosimetrically compatible with other leading modalities for treatments of small and complicated target volumes and was even superior when treatment delivery times were compared. These finding demonstrate that the advanced treatment planning and delivery system from TomoTherapy is well suitable for treatments of complicated cases such as TMI and SRS and it's often dosimetrically and/or logistically superior to other modalities. The new planning system can easily meet the constraint of threshold lung dose established in this study. The results presented here on the capabilities of Tomotherapy and on the identified lung dose threshold provide an opportunity to explore alternative fractionation schemes without sacrificing target coverage or lung toxicity. (Abstract shortened by ProQuest.).
      

      
      Effect of acute and continuous morphine treatment on transcription factor expression in subregions of the rat caudate putamen. Marked modulation by D4 receptor activation.
      PubMed
      Gago, Belén; Suárez-Boomgaard, Diana; Fuxe, Kjell; Brené, Stefan; Reina-Sánchez, María Dolores; Rodríguez-Pérez, Luis M; Agnati, Luigi F; de la Calle, Adelaida; Rivera, Alicia
         2011-08-17
         Acute administration of the dopamine D(4) receptor (D(4)R) agonist PD168,077 induces a down-regulation of the μ opioid receptor (MOR) in the striosomal compartment of the rat caudate putamen (CPu), suggesting a striosomal D(4)R/MOR receptor interaction in line with their high co-distribution in this brain subregion. The present work was designed to explore if a D(4)R/MOR receptor interaction also occurs in the modulation of the expression pattern of several transcription factors in striatal subregions that play a central role in drug addiction. Thus, c-Fos, FosB/ΔFosB and P-CREB immunoreactive profiles were quantified in the rat CPu after either acute or continuous (6-day) administration of morphine and/or PD168,077. Acute and continuous administration of morphine induced different patterns of expression of these transcription factors, effects that were time-course and region dependent and fully blocked by PD168,077 co-administration. Moreover, this effect of the D(4)R agonist was counteracted by the D(4)R antagonist L745,870. Interestingly, at some time-points, combined treatment with morphine and PD168,077 substantially increased c-Fos, FosB/ΔFosB and P-CREB expression. The results of this study give indications for a general antagonistic D(4)R/MOR receptor interaction at the level of transcription factors. The change in the transcription factor expression by D(4)R/MOR interactions in turn suggests a modulation of neuronal activity in the CPu that could be of relevance for drug addiction. Copyright © 2011 Elsevier B.V. All rights reserved.
      

      
      P-Hint-Hunt: a deep parallelized whole genome DNA methylation detection tool.
      PubMed
      Peng, Shaoliang; Yang, Shunyun; Gao, Ming; Liao, Xiangke; Liu, Jie; Yang, Canqun; Wu, Chengkun; Yu, Wenqiang
         2017-03-14
         The increasing studies have been conducted using whole genome DNA methylation detection as one of the most important part of epigenetics research to find the significant relationships among DNA methylation and several typical diseases, such as cancers and diabetes. In many of those studies, mapping the bisulfite treated sequence to the whole genome has been the main method to study DNA cytosine methylation. However, today's relative tools almost suffer from inaccuracies and time-consuming problems. In our study, we designed a new DNA methylation prediction tool ("Hint-Hunt") to solve the problem. By having an optimal complex alignment computation and Smith-Waterman matrix dynamic programming, Hint-Hunt could analyze and predict the DNA methylation status. But when Hint-Hunt tried to predict DNA methylation status with large-scale dataset, there are still slow speed and low temporal-spatial efficiency problems. In order to solve the problems of Smith-Waterman dynamic programming and low temporal-spatial efficiency, we further design a deep parallelized whole genome DNA methylation detection tool ("P-Hint-Hunt") on Tianhe-2 (TH-2) supercomputer. To the best of our knowledge, P-Hint-Hunt is the first parallel DNA methylation detection tool with a high speed-up to process large-scale dataset, and could run both on CPU and Intel Xeon Phi coprocessors. Moreover, we deploy and evaluate Hint-Hunt and P-Hint-Hunt on TH-2 supercomputer in different scales. The experimental results illuminate our tools eliminate the deviation caused by bisulfite treatment in mapping procedure and the multi-level parallel program yields a 48 times speed-up with 64 threads. P-Hint-Hunt gain a deep acceleration on CPU and Intel Xeon Phi heterogeneous platform, which gives full play of the advantages of multi-cores (CPU) and many-cores (Phi).
      

      
      GPU acceleration of Runge Kutta-Fehlberg and its comparison with Dormand-Prince method
      NASA Astrophysics Data System (ADS)
      Seen, Wo Mei; Gobithaasan, R. U.; Miura, Kenjiro T.
         2014-07-01
         There is a significant reduction of processing time and speedup of performance in computer graphics with the emergence of Graphic Processing Units (GPUs). GPUs have been developed to surpass Central Processing Unit (CPU) in terms of performance and processing speed. This evolution has opened up a new area in computing and researches where highly parallel GPU has been used for non-graphical algorithms. Physical or phenomenal simulations and modelling can be accelerated through General Purpose Graphic Processing Units (GPGPU) and Compute Unified Device Architecture (CUDA) implementations. These phenomena can be represented with mathematical models in the form of Ordinary Differential Equations (ODEs) which encompasses the gist of change rate between independent and dependent variables. ODEs are numerically integrated over time in order to simulate these behaviours. The classical Runge-Kutta (RK) scheme is the common method used to numerically solve ODEs. The Runge Kutta Fehlberg (RKF) scheme has been specially developed to provide an estimate of the principal local truncation error at each step, known as embedding estimate technique. This paper delves into the implementation of RKF scheme for GPU devices and compares its result with Dorman Prince method. A pseudo code is developed to show the implementation in detail. Hence, practitioners will be able to understand the data allocation in GPU, formation of RKF kernels and the flow of data to/from GPU-CPU upon RKF kernel evaluation. The pseudo code is then written in C Language and two ODE models are executed to show the achievable speedup as compared to CPU implementation. The accuracy and efficiency of the proposed implementation method is discussed in the final section of this paper.
      

      
      Efficient spares matrix multiplication scheme for the CYBER 203
      NASA Technical Reports Server (NTRS)
      Lambiotte, J. J., Jr.
         1984-01-01
         This work has been directed toward the development of an efficient algorithm for performing this computation on the CYBER-203. The desire to provide software which gives the user the choice between the often conflicting goals of minimizing central processing (CPU) time or storage requirements has led to a diagonal-based algorithm in which one of three types of storage is selected for each diagonal. For each storage type, an initialization sub-routine estimates the CPU and storage requirements based upon results from previously performed numerical experimentation. These requirements are adjusted by weights provided by the user which reflect the relative importance the user places on the resources. The three storage types employed were chosen to be efficient on the CYBER-203 for diagonals which are sparse, moderately sparse, or dense; however, for many densities, no diagonal type is most efficient with respect to both resource requirements. The user-supplied weights dictate the choice.
      

      
      GPU accelerated implementation of NCI calculations using promolecular density.
      PubMed
      Rubez, Gaëtan; Etancelin, Jean-Matthieu; Vigouroux, Xavier; Krajecki, Michael; Boisson, Jean-Charles; Hénon, Eric
         2017-05-30
         The NCI approach is a modern tool to reveal chemical noncovalent interactions. It is particularly attractive to describe ligand-protein binding. A custom implementation for NCI using promolecular density is presented. It is designed to leverage the computational power of NVIDIA graphics processing unit (GPU) accelerators through the CUDA programming model. The code performances of three versions are examined on a test set of 144 systems. NCI calculations are particularly well suited to the GPU architecture, which reduces drastically the computational time. On a single compute node, the dual-GPU version leads to a 39-fold improvement for the biggest instance compared to the optimal OpenMP parallel run (C code, icc compiler) with 16 CPU cores. Energy consumption measurements carried out on both CPU and GPU NCI tests show that the GPU approach provides substantial energy savings. © 2017 Wiley Periodicals, Inc. © 2017 Wiley Periodicals, Inc.
      

      
      Using Intel Xeon Phi to accelerate the WRF TEMF planetary boundary layer scheme
      NASA Astrophysics Data System (ADS)
      Mielikainen, Jarno; Huang, Bormin; Huang, Allen
         2014-05-01
         The Weather Research and Forecasting (WRF) model is designed for numerical weather prediction and atmospheric research. The WRF software infrastructure consists of several components such as dynamic solvers and physics schemes. Numerical models are used to resolve the large-scale flow. However, subgrid-scale parameterizations are for an estimation of small-scale properties (e.g., boundary layer turbulence and convection, clouds, radiation). Those have a significant influence on the resolved scale due to the complex nonlinear nature of the atmosphere. For the cloudy planetary boundary layer (PBL), it is fundamental to parameterize vertical turbulent fluxes and subgrid-scale condensation in a realistic manner. A parameterization based on the Total Energy - Mass Flux (TEMF) that unifies turbulence and moist convection components produces a better result that the other PBL schemes. For that reason, the TEMF scheme is chosen as the PBL scheme we optimized for Intel Many Integrated Core (MIC), which ushers in a new era of supercomputing speed, performance, and compatibility. It allows the developers to run code at trillions of calculations per second using the familiar programming model. In this paper, we present our optimization results for TEMF planetary boundary layer scheme. The optimizations that were performed were quite generic in nature. Those optimizations included vectorization of the code to utilize vector units inside each CPU. Furthermore, memory access was improved by scalarizing some of the intermediate arrays. The results show that the optimization improved MIC performance by 14.8x. Furthermore, the optimizations increased CPU performance by 2.6x compared to the original multi-threaded code on quad core Intel Xeon E5-2603 running at 1.8 GHz. Compared to the optimized code running on a single CPU socket the optimized MIC code is 6.2x faster.
      

      
      The Effect of NUMA Tunings on CPU Performance
      NASA Astrophysics Data System (ADS)
      Hollowell, Christopher; Caramarcu, Costin; Strecker-Kellogg, William; Wong, Antonio; Zaytsev, Alexandr
         2015-12-01
         Non-Uniform Memory Access (NUMA) is a memory architecture for symmetric multiprocessing (SMP) systems where each processor is directly connected to separate memory. Indirect access to other CPU's (remote) RAM is still possible, but such requests are slower as they must also pass through that memory's controlling CPU. In concert with a NUMA-aware operating system, the NUMA hardware architecture can help eliminate the memory performance reductions generally seen in SMP systems when multiple processors simultaneously attempt to access memory. The x86 CPU architecture has supported NUMA for a number of years. Modern operating systems such as Linux support NUMA-aware scheduling, where the OS attempts to schedule a process to the CPU directly attached to the majority of its RAM. In Linux, it is possible to further manually tune the NUMA subsystem using the numactl utility. With the release of Red Hat Enterprise Linux (RHEL) 6.3, the numad daemon became available in this distribution. This daemon monitors a system's NUMA topology and utilization, and automatically makes adjustments to optimize locality. As the number of cores in x86 servers continues to grow, efficient NUMA mappings of processes to CPUs/memory will become increasingly important. This paper gives a brief overview of NUMA, and discusses the effects of manual tunings and numad on the performance of the HEPSPEC06 benchmark, and ATLAS software.
      

      
      Accelerating finite-rate chemical kinetics with coprocessors: Comparing vectorization methods on GPUs, MICs, and CPUs
      NASA Astrophysics Data System (ADS)
      Stone, Christopher P.; Alferman, Andrew T.; Niemeyer, Kyle E.
         2018-05-01
         Accurate and efficient methods for solving stiff ordinary differential equations (ODEs) are a critical component of turbulent combustion simulations with finite-rate chemistry. The ODEs governing the chemical kinetics at each mesh point are decoupled by operator-splitting allowing each to be solved concurrently. An efficient ODE solver must then take into account the available thread and instruction-level parallelism of the underlying hardware, especially on many-core coprocessors, as well as the numerical efficiency. A stiff Rosenbrock and a nonstiff Runge-Kutta ODE solver are both implemented using the single instruction, multiple thread (SIMT) and single instruction, multiple data (SIMD) paradigms within OpenCL. Both methods solve multiple ODEs concurrently within the same instruction stream. The performance of these parallel implementations was measured on three chemical kinetic models of increasing size across several multicore and many-core platforms. Two separate benchmarks were conducted to clearly determine any performance advantage offered by either method. The first benchmark measured the run-time of evaluating the right-hand-side source terms in parallel and the second benchmark integrated a series of constant-pressure, homogeneous reactors using the Rosenbrock and Runge-Kutta solvers. The right-hand-side evaluations with SIMD parallelism on the host multicore Xeon CPU and many-core Xeon Phi co-processor performed approximately three times faster than the baseline multithreaded C++ code. The SIMT parallel model on the host and Phi was 13%-35% slower than the baseline while the SIMT model on the NVIDIA Kepler GPU provided approximately the same performance as the SIMD model on the Phi. The runtimes for both ODE solvers decreased significantly with the SIMD implementations on the host CPU (2.5-2.7 ×) and Xeon Phi coprocessor (4.7-4.9 ×) compared to the baseline parallel code. The SIMT implementations on the GPU ran 1.5-1.6 times faster than the baseline multithreaded CPU code; however, this was significantly slower than the SIMD versions on the host CPU or the Xeon Phi. The performance difference between the three platforms was attributed to thread divergence caused by the adaptive step-sizes within the ODE integrators. Analysis showed that the wider vector width of the GPU incurs a higher level of divergence than the narrower Sandy Bridge or Xeon Phi. The significant performance improvement provided by the SIMD parallel strategy motivates further research into more ODE solver methods that are both SIMD-friendly and computationally efficient.
      

      
      Using SimCPU in Cooperative Learning Laboratories.
      ERIC Educational Resources Information Center
      Lin, Janet Mei-Chuen; Wu, Cheng-Chih; Liu, Hsi-Jen
         1999-01-01
         Reports research findings of an experimental design in which cooperative-learning strategies were applied to closed-lab instruction of computing concepts. SimCPU, a software package specially designed for closed-lab usage was used by 171 high school students of four classes. Results showed that collaboration enhanced learning and that blending…
      

      
      Deterministic Stress Modeling of Hot Gas Segregation in a Turbine
      NASA Technical Reports Server (NTRS)
      Busby, Judy; Sondak, Doug; Staubach, Brent; Davis, Roger
         1998-01-01
         Simulation of unsteady viscous turbomachinery flowfields is presently impractical as a design tool due to the long run times required. Designers rely predominantly on steady-state simulations, but these simulations do not account for some of the important unsteady flow physics. Unsteady flow effects can be modeled as source terms in the steady flow equations. These source terms, referred to as Lumped Deterministic Stresses (LDS), can be used to drive steady flow solution procedures to reproduce the time-average of an unsteady flow solution. The goal of this work is to investigate the feasibility of using inviscid lumped deterministic stresses to model unsteady combustion hot streak migration effects on the turbine blade tip and outer air seal heat loads using a steady computational approach. The LDS model is obtained from an unsteady inviscid calculation. The LDS model is then used with a steady viscous computation to simulate the time-averaged viscous solution. Both two-dimensional and three-dimensional applications are examined. The inviscid LDS model produces good results for the two-dimensional case and requires less than 10% of the CPU time of the unsteady viscous run. For the three-dimensional case, the LDS model does a good job of reproducing the time-averaged viscous temperature migration and separation as well as heat load on the outer air seal at a CPU cost that is 25% of that of an unsteady viscous computation.
      

      
      Advances in Mechanisms Supporting Data Collection on Future Force Networks: Product Manager C4ISR On-the-Move
      DTIC Science & Technology
      
         2008-12-01
         for Layer 3 data capture: NetPoll ncap tget Monitor session Radio System switch router User App interface box GPS This model applies to most fixed...developed a lightweight, custom implementation, termed ncap . As described in Section 3.1, the Ground Truth System provides a linkage between host...computer CPU time and GPS time, and ncap leverages this to perform highly precise (əmsec) time tagging of offered and received packets. Such
      

        
       
          

«

15
      16
      17
   18
      19
      »

          
        

     

   

   
       
            
              
          

«

16
      17
      18
   19
      20
      »

          
        

           
           
             
               
      
      Petabyte Class Storage at Jefferson Lab (CEBAF)
      NASA Technical Reports Server (NTRS)
      Chambers, Rita; Davis, Mark
         1996-01-01
         By 1997, the Thomas Jefferson National Accelerator Facility will collect over one Terabyte of raw information per day of Accelerator operation from three concurrently operating Experimental Halls. When post-processing is included, roughly 250 TB of raw and formatted experimental data will be generated each year. By the year 2000, a total of one Petabyte will be stored on-line. Critical to the experimental program at Jefferson Lab (JLab) is the networking and computational capability to collect, store, retrieve, and reconstruct data on this scale. The design criteria include support of a raw data stream of 10-12 MB/second from Experimental Hall B, which will operate the CEBAF (Continuous Electron Beam Accelerator Facility) Large Acceptance Spectrometer (CLAS). Keeping up with this data stream implies design strategies that provide storage guarantees during accelerator operation, minimize the number of times data is buffered allow seamless access to specific data sets for the researcher, synchronize data retrievals with the scheduling of postprocessing calculations on the data reconstruction CPU farms, as well as support the site capability to perform data reconstruction and reduction at the same overall rate at which new data is being collected. The current implementation employs state-of-the-art StorageTek Redwood tape drives and robotics library integrated with the Open Storage Manager (OSM) Hierarchical Storage Management software (Computer Associates, International), the use of Fibre Channel RAID disks dual-ported between Sun Microsystems SMP servers, and a network-based interface to a 10,000 SPECint92 data processing CPU farm. Issues of efficiency, scalability, and manageability will become critical to meet the year 2000 requirements for a Petabyte of near-line storage interfaced to over 30,000 SPECint92 of data processing power.
      

      
      Toshiba TDF-500 High Resolution Viewing And Analysis System
      NASA Astrophysics Data System (ADS)
      Roberts, Barry; Kakegawa, M.; Nishikawa, M.; Oikawa, D.
         1988-06-01
         A high resolution, operator interactive, medical viewing and analysis system has been developed by Toshiba and Bio-Imaging Research. This system provides many advanced features including high resolution displays, a very large image memory and advanced image processing capability. In particular, the system provides CRT frame buffers capable of update in one frame period, an array processor capable of image processing at operator interactive speeds, and a memory system capable of updating multiple frame buffers at frame rates whilst supporting multiple array processors. The display system provides 1024 x 1536 display resolution at 40Hz frame and 80Hz field rates. In particular, the ability to provide whole or partial update of the screen at the scanning rate is a key feature. This allows multiple viewports or windows in the display buffer with both fixed and cine capability. To support image processing features such as windowing, pan, zoom, minification, filtering, ROI analysis, multiplanar and 3D reconstruction, a high performance CPU is integrated into the system. This CPU is an array processor capable of up to 400 million instructions per second. To support the multiple viewer and array processors' instantaneous high memory bandwidth requirement, an ultra fast memory system is used. This memory system has a bandwidth capability of 400MB/sec and a total capacity of 256MB. This bandwidth is more than adequate to support several high resolution CRT's and also the fast processing unit. This fully integrated approach allows effective real time image processing. The integrated design of viewing system, memory system and array processor are key to the imaging system. It is the intention to describe the architecture of the image system in this paper.
      

      
      Genetic Dissection of End-Use Quality Traits in Adapted Soft White Winter Wheat
      PubMed Central
      Jernigan, Kendra L.; Godoy, Jayfred V.; Huang, Meng; Zhou, Yao; Morris, Craig F.; Garland-Campbell, Kimberly A.; Zhang, Zhiwu; Carter, Arron H.
         2018-01-01
         Soft white wheat is used in domestic and foreign markets for various end products requiring specific quality profiles. Phenotyping for end-use quality traits can be costly, time-consuming and destructive in nature, so it is advantageous to use molecular markers to select experimental lines with superior traits. An association mapping panel of 469 soft white winter wheat cultivars and advanced generation breeding lines was developed from regional breeding programs in the U.S. Pacific Northwest. This panel was genotyped on a wheat-specific 90 K iSelect single nucleotide polymorphism (SNP) chip. A total of 15,229 high quality SNPs were selected and combined with best linear unbiased predictions (BLUPs) from historical phenotypic data of the genotypes in the panel. Genome-wide association mapping was conducted using the Fixed and random model Circulating Probability Unification (FarmCPU). A total of 105 significant marker-trait associations were detected across 19 chromosomes. Potentially new loci for total flour yield, lactic acid solvent retention capacity, flour sodium dodecyl sulfate sedimentation and flour swelling volume were also detected. Better understanding of the genetic factors impacting end-use quality enable breeders to more effectively discard poor quality germplasm and increase frequencies of favorable end-use quality alleles in their breeding populations. PMID:29593752
      

      
      A GPU-based symmetric non-rigid image registration method in human lung.
      PubMed
      Haghighi, Babak; D Ellingwood, Nathan; Yin, Youbing; Hoffman, Eric A; Lin, Ching-Long
         2018-03-01
         Quantitative computed tomography (QCT) of the lungs plays an increasing role in identifying sub-phenotypes of pathologies previously lumped into broad categories such as chronic obstructive pulmonary disease and asthma. Methods for image matching and linking multiple lung volumes have proven useful in linking structure to function and in the identification of regional longitudinal changes. Here, we seek to improve the accuracy of image matching via the use of a symmetric multi-level non-rigid registration employing an inverse consistent (IC) transformation whereby images are registered both in the forward and reverse directions. To develop the symmetric method, two similarity measures, the sum of squared intensity difference (SSD) and the sum of squared tissue volume difference (SSTVD), were used. The method is based on a novel generic mathematical framework to include forward and backward transformations, simultaneously, eliminating the need to compute the inverse transformation. Two implementations were used to assess the proposed method: a two-dimensional (2-D) implementation using synthetic examples with SSD, and a multi-core CPU and graphics processing unit (GPU) implementation with SSTVD for three-dimensional (3-D) human lung datasets (six normal adults studied at total lung capacity (TLC) and functional residual capacity (FRC)). Success was evaluated in terms of the IC transformation consistency serving to link TLC to FRC. 2-D registration on synthetic images, using both symmetric and non-symmetric SSD methods, and comparison of displacement fields showed that the symmetric method gave a symmetrical grid shape and reduced IC errors, with the mean values of IC errors decreased by 37%. Results for both symmetric and non-symmetric transformations of human datasets showed that the symmetric method gave better results for IC errors in all cases, with mean values of IC errors for the symmetric method lower than the non-symmetric methods using both SSD and SSTVD. The GPU version demonstrated an average of 43 times speedup and ~5.2 times speedup over the single-threaded and 12-threaded CPU versions, respectively. Run times with the GPU were as fast as 2 min. The symmetric method improved the inverse consistency, aiding the use of image registration in the QCT-based evaluation of the lung.
      

      
      Tactical Operations Analysis Support Facility.
      DTIC Science & Technology
      
         1981-05-01
         Punch/Reader 2 DMC-11AR DDCMP Micro Processor 2 DMC-11DA Network Link Line Unit 2 DL-11E Async Serial Line Interface 4 Intel IN-1670 448K Words MOS Memory...86 5.3 VIRTUAL PROCESSORS - VAX-11/750 ........................... 89 5.4 A RELATIONAL DATA MANAGEMENT SYSTEM - ORACLE...Central Processing Unit (CPU) is a 16 bit processor for high-speed, real time applications, and for large multi-user, multi- task, time shared
      

      
      Fast, large-scale hologram calculation in wavelet domain
      NASA Astrophysics Data System (ADS)
      Shimobaba, Tomoyoshi; Matsushima, Kyoji; Takahashi, Takayuki; Nagahama, Yuki; Hasegawa, Satoki; Sano, Marie; Hirayama, Ryuji; Kakue, Takashi; Ito, Tomoyoshi
         2018-04-01
         We propose a large-scale hologram calculation using WAvelet ShrinkAge-Based superpositIon (WASABI), a wavelet transform-based algorithm. An image-type hologram calculated using the WASABI method is printed on a glass substrate with the resolution of 65 , 536 × 65 , 536 pixels and a pixel pitch of 1 μm. The hologram calculation time amounts to approximately 354 s on a commercial CPU, which is approximately 30 times faster than conventional methods.
      

      
      ELT-scale Adaptive Optics real-time control with thes Intel Xeon Phi Many Integrated Core Architecture
      NASA Astrophysics Data System (ADS)
      Jenkins, David R.; Basden, Alastair; Myers, Richard M.
         2018-05-01
         We propose a solution to the increased computational demands of Extremely Large Telescope (ELT) scale adaptive optics (AO) real-time control with the Intel Xeon Phi Knights Landing (KNL) Many Integrated Core (MIC) Architecture. The computational demands of an AO real-time controller (RTC) scale with the fourth power of telescope diameter and so the next generation ELTs require orders of magnitude more processing power for the RTC pipeline than existing systems. The Xeon Phi contains a large number (≥64) of low power x86 CPU cores and high bandwidth memory integrated into a single socketed server CPU package. The increased parallelism and memory bandwidth are crucial to providing the performance for reconstructing wavefronts with the required precision for ELT scale AO. Here, we demonstrate that the Xeon Phi KNL is capable of performing ELT scale single conjugate AO real-time control computation at over 1.0kHz with less than 20μs RMS jitter. We have also shown that with a wavefront sensor camera attached the KNL can process the real-time control loop at up to 966Hz, the maximum frame-rate of the camera, with jitter remaining below 20μs RMS. Future studies will involve exploring the use of a cluster of Xeon Phis for the real-time control of the MCAO and MOAO regimes of AO. We find that the Xeon Phi is highly suitable for ELT AO real time control.
      

      
      Japanese Ubiquotous Network Project: Ubila
      NASA Astrophysics Data System (ADS)
      Ohashi, Masayoshi
         
         Recently, the advent of sophisticated technologies has stimulated ambient paradigms that may include high-performance CPU, compact real-time operating systems, a variety of devices/sensors, low power and high-speed radio communications, and in particular, third generation mobile phones. In addition, due to the spread of broadband ccess networks, various ubiquitous terminals and sensors can be connected closely.
      

      
      A new approach to flow simulation in highly heterogeneous porous media
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Rame, M.; Killough, J.E.
         
         In this paper, applications are presented for a new numerical method - operator splittings on multiple grids (OSMG) - devised for simulations in heterogeneous porous media. A coarse-grid, finite-element pressure solver is interfaced with a fine-grid timestepping scheme. The CPU time for the pressure solver is greatly reduced and concentration fronts have minimal numerical dispersion.
      

      
      An Analysis of CONUS Based Deployment of Pseudolites for Positioning, Navigation and Timing (PNT) Systems
      DTIC Science & Technology
      
         2015-09-17
         Geostationary Satellite cMateriel» Geostationary Satellite::Re-ceive Antennas cMBI!t’iel» Geostationary •Fiba OptioC.bl.h Satellite::CPU...8217 cRsdio Ftequency Signal» .Ra v dio Fre-queno; OBI» •Fiber OplicCsbl•• cMstMiel» Geostationary Satell ite:: Transmitters cMste
      

      
      Energy Conservation Using Dynamic Voltage Frequency Scaling for Computational Cloud
      PubMed Central
      Florence, A. Paulin; Shanthi, V.; Simon, C. B. Sunil
         2016-01-01
         Cloud computing is a new technology which supports resource sharing on a “Pay as you go” basis around the world. It provides various services such as SaaS, IaaS, and PaaS. Computation is a part of IaaS and the entire computational requests are to be served efficiently with optimal power utilization in the cloud. Recently, various algorithms are developed to reduce power consumption and even Dynamic Voltage and Frequency Scaling (DVFS) scheme is also used in this perspective. In this paper we have devised methodology which analyzes the behavior of the given cloud request and identifies the associated type of algorithm. Once the type of algorithm is identified, using their asymptotic notations, its time complexity is calculated. Using best fit strategy the appropriate host is identified and the incoming job is allocated to the victimized host. Using the measured time complexity the required clock frequency of the host is measured. According to that CPU frequency is scaled up or down using DVFS scheme, enabling energy to be saved up to 55% of total Watts consumption. PMID:27239551
      

      
      Energy Conservation Using Dynamic Voltage Frequency Scaling for Computational Cloud.
      PubMed
      Florence, A Paulin; Shanthi, V; Simon, C B Sunil
         2016-01-01
         Cloud computing is a new technology which supports resource sharing on a "Pay as you go" basis around the world. It provides various services such as SaaS, IaaS, and PaaS. Computation is a part of IaaS and the entire computational requests are to be served efficiently with optimal power utilization in the cloud. Recently, various algorithms are developed to reduce power consumption and even Dynamic Voltage and Frequency Scaling (DVFS) scheme is also used in this perspective. In this paper we have devised methodology which analyzes the behavior of the given cloud request and identifies the associated type of algorithm. Once the type of algorithm is identified, using their asymptotic notations, its time complexity is calculated. Using best fit strategy the appropriate host is identified and the incoming job is allocated to the victimized host. Using the measured time complexity the required clock frequency of the host is measured. According to that CPU frequency is scaled up or down using DVFS scheme, enabling energy to be saved up to 55% of total Watts consumption.
      

      
      CPU SIM: A Computer Simulator for Use in an Introductory Computer Organization-Architecture Class.
      ERIC Educational Resources Information Center
      Skrein, Dale
         1994-01-01
         CPU SIM, an interactive low-level computer simulation package that runs on the Macintosh computer, is described. The program is designed for instructional use in the first or second year of undergraduate computer science, to teach various features of typical computer organization through hands-on exercises. (MSE)
      

      
      Combustion Power Unit--400: CPU-400.
      ERIC Educational Resources Information Center
      Combustion Power Co., Palo Alto, CA.
         
         Aerospace technology may have led to a unique basic unit for processing solid wastes and controlling pollution. The Combustion Power Unit--400 (CPU-400) is designed as a turboelectric generator plant that will use municipal solid wastes as fuel. The baseline configuration is a modular unit that is designed to utilize 400 tons of refuse per day…
      

      
      An efficient and robust algorithm for two dimensional time dependent incompressible Navier-Stokes equations: High Reynolds number flows
      NASA Technical Reports Server (NTRS)
      Goodrich, John W.
         1991-01-01
         An algorithm is presented for unsteady two-dimensional incompressible Navier-Stokes calculations. This algorithm is based on the fourth order partial differential equation for incompressible fluid flow which uses the streamfunction as the only dependent variable. The algorithm is second order accurate in both time and space. It uses a multigrid solver at each time step. It is extremely efficient with respect to the use of both CPU time and physical memory. It is extremely robust with respect to Reynolds number.
      

      
      Particle-in-Cell laser-plasma simulation on Xeon Phi coprocessors
      NASA Astrophysics Data System (ADS)
      Surmin, I. A.; Bastrakov, S. I.; Efimenko, E. S.; Gonoskov, A. A.; Korzhimanov, A. V.; Meyerov, I. B.
         2016-05-01
         This paper concerns the development of a high-performance implementation of the Particle-in-Cell method for plasma simulation on Intel Xeon Phi coprocessors. We discuss the suitability of the method for Xeon Phi architecture and present our experience in the porting and optimization of the existing parallel Particle-in-Cell code PICADOR. Direct porting without code modification gives performance on Xeon Phi close to that of an 8-core CPU on a benchmark problem with 50 particles per cell. We demonstrate step-by-step optimization techniques, such as improving data locality, enhancing parallelization efficiency and vectorization leading to an overall 4.2 × speedup on CPU and 7.5 × on Xeon Phi compared to the baseline version. The optimized version achieves 16.9 ns per particle update on an Intel Xeon E5-2660 CPU and 9.3 ns per particle update on an Intel Xeon Phi 5110P. For a real problem of laser ion acceleration in targets with surface grating, where a large number of macroparticles per cell is required, the speedup of Xeon Phi compared to CPU is 1.6 ×.
      

      
      Software Defined Radio with Parallelized Software Architecture
      NASA Technical Reports Server (NTRS)
      Heckler, Greg
         2013-01-01
         This software implements software-defined radio procession over multicore, multi-CPU systems in a way that maximizes the use of CPU resources in the system. The software treats each processing step in either a communications or navigation modulator or demodulator system as an independent, threaded block. Each threaded block is defined with a programmable number of input or output buffers; these buffers are implemented using POSIX pipes. In addition, each threaded block is assigned a unique thread upon block installation. A modulator or demodulator system is built by assembly of the threaded blocks into a flow graph, which assembles the processing blocks to accomplish the desired signal processing. This software architecture allows the software to scale effortlessly between single CPU/single-core computers or multi-CPU/multi-core computers without recompilation. NASA spaceflight and ground communications systems currently rely exclusively on ASICs or FPGAs. This software allows low- and medium-bandwidth (100 bps to approx.50 Mbps) software defined radios to be designed and implemented solely in C/C++ software, while lowering development costs and facilitating reuse and extensibility.
      

      
      Software Defined Radio with Parallelized Software Architecture
      NASA Technical Reports Server (NTRS)
      Heckler, Greg
         2013-01-01
         This software implements software-defined radio procession over multi-core, multi-CPU systems in a way that maximizes the use of CPU resources in the system. The software treats each processing step in either a communications or navigation modulator or demodulator system as an independent, threaded block. Each threaded block is defined with a programmable number of input or output buffers; these buffers are implemented using POSIX pipes. In addition, each threaded block is assigned a unique thread upon block installation. A modulator or demodulator system is built by assembly of the threaded blocks into a flow graph, which assembles the processing blocks to accomplish the desired signal processing. This software architecture allows the software to scale effortlessly between single CPU/single-core computers or multi-CPU/multi-core computers without recompilation. NASA spaceflight and ground communications systems currently rely exclusively on ASICs or FPGAs. This software allows low- and medium-bandwidth (100 bps to .50 Mbps) software defined radios to be designed and implemented solely in C/C++ software, while lowering development costs and facilitating reuse and extensibility.
      

      
      Work stealing for GPU-accelerated parallel programs in a global address space framework: WORK STEALING ON GPU-ACCELERATED SYSTEMS
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Arafat, Humayun; Dinan, James; Krishnamoorthy, Sriram
         
         Task parallelism is an attractive approach to automatically load balance the computation in a parallel system and adapt to dynamism exhibited by parallel systems. Exploiting task parallelism through work stealing has been extensively studied in shared and distributed-memory contexts. In this paper, we study the design of a system that uses work stealing for dynamic load balancing of task-parallel programs executed on hybrid distributed-memory CPU-graphics processing unit (GPU) systems in a global-address space framework. We take into account the unique nature of the accelerator model employed by GPUs, the significant performance difference between GPU and CPU execution as a functionmore » of problem size, and the distinct CPU and GPU memory domains. We consider various alternatives in designing a distributed work stealing algorithm for CPU-GPU systems, while taking into account the impact of task distribution and data movement overheads. These strategies are evaluated using microbenchmarks that capture various execution configurations as well as the state-of-the-art CCSD(T) application module from the computational chemistry domain.« less
      

      
      Work stealing for GPU-accelerated parallel programs in a global address space framework
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Arafat, Humayun; Dinan, James; Krishnamoorthy, Sriram
         
         Task parallelism is an attractive approach to automatically load balance the computation in a parallel system and adapt to dynamism exhibited by parallel systems. Exploiting task parallelism through work stealing has been extensively studied in shared and distributed-memory contexts. In this paper, we study the design of a system that uses work stealing for dynamic load balancing of task-parallel programs executed on hybrid distributed-memory CPU-graphics processing unit (GPU) systems in a global-address space framework. We take into account the unique nature of the accelerator model employed by GPUs, the significant performance difference between GPU and CPU execution as a functionmore » of problem size, and the distinct CPU and GPU memory domains. We consider various alternatives in designing a distributed work stealing algorithm for CPU-GPU systems, while taking into account the impact of task distribution and data movement overheads. These strategies are evaluated using microbenchmarks that capture various execution configurations as well as the state-of-the-art CCSD(T) application module from the computational chemistry domain« less
      

        
       
          

«

16
      17
      18
   19
      20
      »

          
        

     

   

   
       
            
              
          

«

17
      18
      19
   20
      21
      »

          
        

           
           
             
               
      
      Optimizing legacy molecular dynamics software with directive-based offload
      NASA Astrophysics Data System (ADS)
      Michael Brown, W.; Carrillo, Jan-Michael Y.; Gavhane, Nitin; Thakkar, Foram M.; Plimpton, Steven J.
         2015-10-01
         Directive-based programming models are one solution for exploiting many-core coprocessors to increase simulation rates in molecular dynamics. They offer the potential to reduce code complexity with offload models that can selectively target computations to run on the CPU, the coprocessor, or both. In this paper, we describe modifications to the LAMMPS molecular dynamics code to enable concurrent calculations on a CPU and coprocessor. We demonstrate that standard molecular dynamics algorithms can run efficiently on both the CPU and an x86-based coprocessor using the same subroutines. As a consequence, we demonstrate that code optimizations for the coprocessor also result in speedups on the CPU; in extreme cases up to 4.7X. We provide results for LAMMPS benchmarks and for production molecular dynamics simulations using the Stampede hybrid supercomputer with both Intel® Xeon Phi™ coprocessors and NVIDIA GPUs. The optimizations presented have increased simulation rates by over 2X for organic molecules and over 7X for liquid crystals on Stampede. The optimizations are available as part of the "Intel package" supplied with LAMMPS.
      

      
      Real-time global illumination on mobile device
      NASA Astrophysics Data System (ADS)
      Ahn, Minsu; Ha, Inwoo; Lee, Hyong-Euk; Kim, James D. K.
         2014-02-01
         We propose a novel method for real-time global illumination on mobile devices. Our approach is based on instant radiosity, which uses a sequence of virtual point lights in order to represent the e ect of indirect illumination. Our rendering process consists of three stages. With the primary light, the rst stage generates a local illumination with the shadow map on GPU The second stage of the global illumination uses the re ective shadow map on GPU and generates the sequence of virtual point lights on CPU. Finally, we use the splatting method of Dachsbacher et al 1 and add the indirect illumination to the local illumination on GPU. With the limited computing resources in mobile devices, a small number of virtual point lights are allowed for real-time rendering. Our approach uses the multi-resolution sampling method with 3D geometry and attributes simultaneously and reduce the total number of virtual point lights. We also use the hybrid strategy, which collaboratively combines the CPUs and GPUs available in a mobile SoC due to the limited computing resources in mobile devices. Experimental results demonstrate the global illumination performance of the proposed method.
      

      
      Validation of GPU based TomoTherapy dose calculation engine.
      PubMed
      Chen, Quan; Lu, Weiguo; Chen, Yu; Chen, Mingli; Henderson, Douglas; Sterpin, Edmond
         2012-04-01
         The graphic processing unit (GPU) based TomoTherapy convolution/superposition(C/S) dose engine (GPU dose engine) achieves a dramatic performance improvement over the traditional CPU-cluster based TomoTherapy dose engine (CPU dose engine). Besides the architecture difference between the GPU and CPU, there are several algorithm changes from the CPU dose engine to the GPU dose engine. These changes made the GPU dose slightly different from the CPU-cluster dose. In order for the commercial release of the GPU dose engine, its accuracy has to be validated. Thirty eight TomoTherapy phantom plans and 19 patient plans were calculated with both dose engines to evaluate the equivalency between the two dose engines. Gamma indices (Γ) were used for the equivalency evaluation. The GPU dose was further verified with the absolute point dose measurement with ion chamber and film measurements for phantom plans. Monte Carlo calculation was used as a reference for both dose engines in the accuracy evaluation in heterogeneous phantom and actual patients. The GPU dose engine showed excellent agreement with the current CPU dose engine. The majority of cases had over 99.99% of voxels with Γ(1%, 1 mm) < 1. The worst case observed in the phantom had 0.22% voxels violating the criterion. In patient cases, the worst percentage of voxels violating the criterion was 0.57%. For absolute point dose verification, all cases agreed with measurement to within ±3% with average error magnitude within 1%. All cases passed the acceptance criterion that more than 95% of the pixels have Γ(3%, 3 mm) < 1 in film measurement, and the average passing pixel percentage is 98.5%-99%. The GPU dose engine also showed similar degree of accuracy in heterogeneous media as the current TomoTherapy dose engine. It is verified and validated that the ultrafast TomoTherapy GPU dose engine can safely replace the existing TomoTherapy cluster based dose engine without degradation in dose accuracy.
      

      
      FPT- FORTRAN PROGRAMMING TOOLS FOR THE DEC VAX
      NASA Technical Reports Server (NTRS)
      Ragosta, A. E.
         1994-01-01
         The FORTRAN Programming Tools (FPT) are a series of tools used to support the development and maintenance of FORTRAN 77 source codes. Included are a debugging aid, a CPU time monitoring program, source code maintenance aids, print utilities, and a library of useful, well-documented programs. These tools assist in reducing development time and encouraging high quality programming. Although intended primarily for FORTRAN programmers, some of the tools can be used on data files and other programming languages. BUGOUT is a series of FPT programs that have proven very useful in debugging a particular kind of error and in optimizing CPU-intensive codes. The particular type of error is the illegal addressing of data or code as a result of subtle FORTRAN errors that are not caught by the compiler or at run time. A TRACE option also allows the programmer to verify the execution path of a program. The TIME option assists the programmer in identifying the CPU-intensive routines in a program to aid in optimization studies. Program coding, maintenance, and print aids available in FPT include: routines for building standard format subprogram stubs; cleaning up common blocks and NAMELISTs; removing all characters after column 72; displaying two files side by side on a VT-100 terminal; creating a neat listing of a FORTRAN source code including a Table of Contents, an Index, and Page Headings; converting files between VMS internal format and standard carriage control format; changing text strings in a file without using EDT; and replacing tab characters with spaces. The library of useful, documented programs includes the following: time and date routines; a string categorization routine; routines for converting between decimal, hex, and octal; routines to delay process execution for a specified time; a Gaussian elimination routine for solving a set of simultaneous linear equations; a curve fitting routine for least squares fit to polynomial, exponential, and sinusoidal forms (with a screen-oriented editor); a cubic spline fit routine; a screen-oriented array editor; routines to support parsing; and various terminal support routines. These FORTRAN programming tools are written in FORTRAN 77 and ASSEMBLER for interactive and batch execution. FPT is intended for implementation on DEC VAX series computers operating under VMS. This collection of tools was developed in 1985.
      

      
      GPU-based cone beam computed tomography.
      PubMed
      Noël, Peter B; Walczak, Alan M; Xu, Jinhui; Corso, Jason J; Hoffmann, Kenneth R; Schafer, Sebastian
         2010-06-01
         The use of cone beam computed tomography (CBCT) is growing in the clinical arena due to its ability to provide 3D information during interventions, its high diagnostic quality (sub-millimeter resolution), and its short scanning times (60 s). In many situations, the short scanning time of CBCT is followed by a time-consuming 3D reconstruction. The standard reconstruction algorithm for CBCT data is the filtered backprojection, which for a volume of size 256(3) takes up to 25 min on a standard system. Recent developments in the area of Graphic Processing Units (GPUs) make it possible to have access to high-performance computing solutions at a low cost, allowing their use in many scientific problems. We have implemented an algorithm for 3D reconstruction of CBCT data using the Compute Unified Device Architecture (CUDA) provided by NVIDIA (NVIDIA Corporation, Santa Clara, California), which was executed on a NVIDIA GeForce GTX 280. Our implementation results in improved reconstruction times from minutes, and perhaps hours, to a matter of seconds, while also giving the clinician the ability to view 3D volumetric data at higher resolutions. We evaluated our implementation on ten clinical data sets and one phantom data set to observe if differences occur between CPU and GPU-based reconstructions. By using our approach, the computation time for 256(3) is reduced from 25 min on the CPU to 3.2 s on the GPU. The GPU reconstruction time for 512(3) volumes is 8.5 s. Copyright 2009 Elsevier Ireland Ltd. All rights reserved.
      

      
      Using a pruned, nondirect product basis in conjunction with the multi-configuration time-dependent Hartree (MCTDH) method
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Wodraszka, Robert, E-mail: Robert.Wodraszka@chem.queensu.ca; Carrington, Tucker, E-mail: Tucker.Carrington@queensu.ca
         
         In this paper, we propose a pruned, nondirect product multi-configuration time dependent Hartree (MCTDH) method for solving the Schrödinger equation. MCTDH uses optimized 1D basis functions, called single particle functions, but the size of the standard direct product MCTDH basis scales exponentially with D, the number of coordinates. We compare the pruned approach to standard MCTDH calculations for basis sizes small enough that the latter are possible and demonstrate that pruning the basis reduces the CPU cost of computing vibrational energy levels of acetonitrile (D = 12) by more than two orders of magnitude. Using the pruned method, it ismore » possible to do calculations with larger bases, for which the cost of standard MCTDH calculations is prohibitive. Pruning the basis complicates the evaluation of matrix-vector products. In this paper, they are done term by term for a sum-of-products Hamiltonian. When no attempt is made to exploit the fact that matrices representing some of the factors of a term are identity matrices, one needs only to carefully constrain indices. In this paper, we develop new ideas that make it possible to further reduce the CPU time by exploiting identity matrices.« less
      

      
      A fast - Monte Carlo toolkit on GPU for treatment plan dose recalculation in proton therapy
      NASA Astrophysics Data System (ADS)
      Senzacqua, M.; Schiavi, A.; Patera, V.; Pioli, S.; Battistoni, G.; Ciocca, M.; Mairani, A.; Magro, G.; Molinelli, S.
         2017-10-01
         In the context of the particle therapy a crucial role is played by Treatment Planning Systems (TPSs), tools aimed to compute and optimize the tratment plan. Nowadays one of the major issues related to the TPS in particle therapy is the large CPU time needed. We developed a software toolkit (FRED) for reducing dose recalculation time by exploiting Graphics Processing Units (GPU) hardware. Thanks to their high parallelization capability, GPUs significantly reduce the computation time, up to factor 100 respect to a standard CPU running software. The transport of proton beams in the patient is accurately described through Monte Carlo methods. Physical processes reproduced are: Multiple Coulomb Scattering, energy straggling and nuclear interactions of protons with the main nuclei composing the biological tissues. FRED toolkit does not rely on the water equivalent translation of tissues, but exploits the Computed Tomography anatomical information by reconstructing and simulating the atomic composition of each crossed tissue. FRED can be used as an efficient tool for dose recalculation, on the day of the treatment. In fact it can provide in about one minute on standard hardware the dose map obtained combining the treatment plan, earlier computed by the TPS, and the current patient anatomic arrangement.
      

      
      Numerical study of the effects of icing on viscous flow over wings
      NASA Technical Reports Server (NTRS)
      Sankar, L. N.
         1994-01-01
         An improved hybrid method for computing unsteady compressible viscous flows is presented. This method divides the computational domain into two zones. In the outer zone, the unsteady full-potential equation (FPE) is solved. In the inner zone, the Navier-Stokes equations are solved using a diagonal form of an alternating-direction implicit (ADI) approximate factorization procedure. The two zones are tightly coupled so that steady and unsteady flows may be efficiently solved. Characteristic-based viscous/inviscid interface boundary conditions are employed to avoid spurious reflections at that interface. The resulting CPU times are less than 60 percent of that required for a full-blown Navier-Stokes analysis for steady flow applications and about 60 percent of the Navier-Stokes CPU times for unsteady flows in non-vector processing machines. Applications of the method are presented for a rectangular NACA 0012 wing in low subsonic steady flow at moderate and high angles of attack, and for an F-5 wing in steady and unsteady subsonic and transonic flows. Steady surface pressures are in very good agreement with experimental data and are essentially identical to Navier-Stokes predictions. Density contours show that shocks cross the viscous/inviscid interface smoothly, so that the accuracy of full Navier-Stokes equations can be retained with a significant savings in computational time.
      

      
      Planning for distributed workflows: constraint-based coscheduling of computational jobs and data placement in distributed environments
      NASA Astrophysics Data System (ADS)
      Makatun, Dzmitry; Lauret, Jérôme; Rudová, Hana; Šumbera, Michal
         2015-05-01
         When running data intensive applications on distributed computational resources long I/O overheads may be observed as access to remotely stored data is performed. Latencies and bandwidth can become the major limiting factor for the overall computation performance and can reduce the CPU/WallTime ratio to excessive IO wait. Reusing the knowledge of our previous research, we propose a constraint programming based planner that schedules computational jobs and data placements (transfers) in a distributed environment in order to optimize resource utilization and reduce the overall processing completion time. The optimization is achieved by ensuring that none of the resources (network links, data storages and CPUs) are oversaturated at any moment of time and either (a) that the data is pre-placed at the site where the job runs or (b) that the jobs are scheduled where the data is already present. Such an approach eliminates the idle CPU cycles occurring when the job is waiting for the I/O from a remote site and would have wide application in the community. Our planner was evaluated and simulated based on data extracted from log files of batch and data management systems of the STAR experiment. The results of evaluation and estimation of performance improvements are discussed in this paper.
      

      
      A GPU-Accelerated Approach for Feature Tracking in Time-Varying Imagery Datasets.
      PubMed
      Peng, Chao; Sahani, Sandip; Rushing, John
         2017-10-01
         We propose a novel parallel connected component labeling (CCL) algorithm along with efficient out-of-core data management to detect and track feature regions of large time-varying imagery datasets. Our approach contributes to the big data field with parallel algorithms tailored for GPU architectures. We remove the data dependency between frames and achieve pixel-level parallelism. Due to the large size, the entire dataset cannot fit into cached memory. Frames have to be streamed through the memory hierarchy (disk to CPU main memory and then to GPU memory), partitioned, and processed as batches, where each batch is small enough to fit into the GPU. To reconnect the feature regions that are separated due to data partitioning, we present a novel batch merging algorithm to extract the region connection information across multiple batches in a parallel fashion. The information is organized in a memory-efficient structure and supports fast indexing on the GPU. Our experiment uses a commodity workstation equipped with a single GPU. The results show that our approach can efficiently process a weather dataset composed of terabytes of time-varying radar images. The advantages of our approach are demonstrated by comparing to the performance of an efficient CPU cluster implementation which is being used by the weather scientists.
      

      
      Automatic optimization of well locations in a North Sea fractured chalk reservoir using a front tracking reservoir simulator
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Rian, D.T.; Hage, A.
         1994-12-31
         A numerical simulator is often used as a reservoir management tool. One of its main purposes is to aid in the evaluation of number of wells, well locations and start time for wells. Traditionally, the optimization of a field development is done by a manual trial and error process. In this paper, an example of an automated technique is given. The core in the automization process is the reservoir simulator Frontline. Frontline is based on front tracking techniques, which makes it fast and accurate compared to traditional finite difference simulators. Due to its CPU-efficiency the simulator has been coupled withmore » an optimization module, which enables automatic optimization of location of wells, number of wells and start-up times. The simulator was used as an alternative method in the evaluation of waterflooding in a North Sea fractured chalk reservoir. Since Frontline, in principle, is 2D, Buckley-Leverett pseudo functions were used to represent the 3rd dimension. The area full field simulation model was run with up to 25 wells for 20 years in less than one minute of Vax 9000 CPU-time. The automatic Frontline evaluation indicated that a peripheral waterflood could double incremental recovery compared to a central pattern drive.« less
      

      
      Acceleration of discrete stochastic biochemical simulation using GPGPU.
      PubMed
      Sumiyoshi, Kei; Hirata, Kazuki; Hiroi, Noriko; Funahashi, Akira
         2015-01-01
         For systems made up of a small number of molecules, such as a biochemical network in a single cell, a simulation requires a stochastic approach, instead of a deterministic approach. The stochastic simulation algorithm (SSA) simulates the stochastic behavior of a spatially homogeneous system. Since stochastic approaches produce different results each time they are used, multiple runs are required in order to obtain statistical results; this results in a large computational cost. We have implemented a parallel method for using SSA to simulate a stochastic model; the method uses a graphics processing unit (GPU), which enables multiple realizations at the same time, and thus reduces the computational time and cost. During the simulation, for the purpose of analysis, each time course is recorded at each time step. A straightforward implementation of this method on a GPU is about 16 times faster than a sequential simulation on a CPU with hybrid parallelization; each of the multiple simulations is run simultaneously, and the computational tasks within each simulation are parallelized. We also implemented an improvement to the memory access and reduced the memory footprint, in order to optimize the computations on the GPU. We also implemented an asynchronous data transfer scheme to accelerate the time course recording function. To analyze the acceleration of our implementation on various sizes of model, we performed SSA simulations on different model sizes and compared these computation times to those for sequential simulations with a CPU. When used with the improved time course recording function, our method was shown to accelerate the SSA simulation by a factor of up to 130.
      

      
      Acceleration of discrete stochastic biochemical simulation using GPGPU
      PubMed Central
      Sumiyoshi, Kei; Hirata, Kazuki; Hiroi, Noriko; Funahashi, Akira
         2015-01-01
         For systems made up of a small number of molecules, such as a biochemical network in a single cell, a simulation requires a stochastic approach, instead of a deterministic approach. The stochastic simulation algorithm (SSA) simulates the stochastic behavior of a spatially homogeneous system. Since stochastic approaches produce different results each time they are used, multiple runs are required in order to obtain statistical results; this results in a large computational cost. We have implemented a parallel method for using SSA to simulate a stochastic model; the method uses a graphics processing unit (GPU), which enables multiple realizations at the same time, and thus reduces the computational time and cost. During the simulation, for the purpose of analysis, each time course is recorded at each time step. A straightforward implementation of this method on a GPU is about 16 times faster than a sequential simulation on a CPU with hybrid parallelization; each of the multiple simulations is run simultaneously, and the computational tasks within each simulation are parallelized. We also implemented an improvement to the memory access and reduced the memory footprint, in order to optimize the computations on the GPU. We also implemented an asynchronous data transfer scheme to accelerate the time course recording function. To analyze the acceleration of our implementation on various sizes of model, we performed SSA simulations on different model sizes and compared these computation times to those for sequential simulations with a CPU. When used with the improved time course recording function, our method was shown to accelerate the SSA simulation by a factor of up to 130. PMID:25762936
      

      
      Implementation of ADI: Schemes on MIMD parallel computers
      NASA Technical Reports Server (NTRS)
      Vanderwijngaart, Rob F.
         1993-01-01
         In order to simulate the effects of the impingement of hot exhaust jets of High Performance Aircraft on landing surfaces a multi-disciplinary computation coupling flow dynamics to heat conduction in the runway needs to be carried out. Such simulations, which are essentially unsteady, require very large computational power in order to be completed within a reasonable time frame of the order of an hour. Such power can be furnished by the latest generation of massively parallel computers. These remove the bottleneck of ever more congested data paths to one or a few highly specialized central processing units (CPU's) by having many off-the-shelf CPU's work independently on their own data, and exchange information only when needed. During the past year the first phase of this project was completed, in which the optimal strategy for mapping an ADI-algorithm for the three dimensional unsteady heat equation to a MIMD parallel computer was identified. This was done by implementing and comparing three different domain decomposition techniques that define the tasks for the CPU's in the parallel machine. These implementations were done for a Cartesian grid and Dirichlet boundary conditions. The most promising technique was then used to implement the heat equation solver on a general curvilinear grid with a suite of nontrivial boundary conditions. Finally, this technique was also used to implement the Scalar Penta-diagonal (SP) benchmark, which was taken from the NAS Parallel Benchmarks report. All implementations were done in the programming language C on the Intel iPSC/860 computer.
      

      
      Invasive treatment of NSTEMI patients in German Chest Pain Units - Evidence for a treatment paradox.
      PubMed
      Schmidt, Frank P; Schmitt, Claus; Hochadel, Matthias; Giannitsis, Evangelos; Darius, Harald; Maier, Lars S; Schmitt, Claus; Heusch, Gerd; Voigtländer, Thomas; Mudra, Harald; Gori, Tommaso; Senges, Jochen; Münzel, Thomas
         2018-03-15
         Patients with non ST-segment elevation myocardial infarction (NSTEMI) represent the largest fraction of patients with acute coronary syndrome in German Chest Pain units. Recent evidence on early vs. selective percutaneous coronary intervention (PCI) is ambiguous with respect to effects on mortality, myocardial infarction (MI) and recurrent angina. With the present study we sought to investigate the prognostic impact of PCI and its timing in German Chest Pain Unit (CPU) NSTEMI patients. Data from 1549 patients whose leading diagnosis was NSTEMI were retrieved from the German CPU registry for the interval between 3/2010 and 3/2014. Follow-up was available at median of 167days after discharge. The patients were grouped into a higher (Group A) and lower risk group (Group B) according to GRACE score and additional criteria on admission. Group A had higher Killip classes, higher BNP levels, reduced EF and significant more triple vessel disease (p<0.001). Surprisingly, patients in group A less frequently received early diagnostic catheterization and PCI. While conservative management did not affect prognosis in Group B, higher-risk CPU-NSTEMI patients without PCI had a significantly worse survival. The present results reveal a substantial treatment gap in higher-risk NSTEMI patients in German Chest Pain Units. This treatment paradox may worsen prognosis in patients who could derive the largest benefit from early revascularization. Copyright © 2017 The Authors. Published by Elsevier B.V. All rights reserved.
      

      
      High Capacity Single Table Performance Design Using Partitioning in Oracle or PostgreSQL
      DTIC Science & Technology
      
         2012-03-01
         Indicators ( KPIs ) 13  5.  Conclusion 14  List of Symbols, Abbreviations, and Acronyms 15  Distribution List 16 iv List of Figures Figure 1. Oracle...Figure 7. Time to seek and return one record. 4. Additional Key Performance Indicators ( KPIs ) In addition to pure response time, there are other...Laboratory ASM Automatic Storage Management CPU central processing unit I/O input/output KPIs key performance indicators OS operating system
      

      
      A mass, momentum, and energy conserving, fully implicit, scalable algorithm for the multi-dimensional, multi-species Rosenbluth-Fokker-Planck equation
      NASA Astrophysics Data System (ADS)
      Taitano, W. T.; Chacón, L.; Simakov, A. N.; Molvig, K.
         2015-09-01
         In this study, we demonstrate a fully implicit algorithm for the multi-species, multidimensional Rosenbluth-Fokker-Planck equation which is exactly mass-, momentum-, and energy-conserving, and which preserves positivity. Unlike most earlier studies, we base our development on the Rosenbluth (rather than Landau) form of the Fokker-Planck collision operator, which reduces complexity while allowing for an optimal fully implicit treatment. Our discrete conservation strategy employs nonlinear constraints that force the continuum symmetries of the collision operator to be satisfied upon discretization. We converge the resulting nonlinear system iteratively using Jacobian-free Newton-Krylov methods, effectively preconditioned with multigrid methods for efficiency. Single- and multi-species numerical examples demonstrate the advertised accuracy properties of the scheme, and the superior algorithmic performance of our approach. In particular, the discretization approach is numerically shown to be second-order accurate in time and velocity space and to exhibit manifestly positive entropy production. That is, H-theorem behavior is indicated for all the examples we have tested. The solution approach is demonstrated to scale optimally with respect to grid refinement (with CPU time growing linearly with the number of mesh points), and timestep (showing very weak dependence of CPU time with time-step size). As a result, the proposed algorithm delivers several orders-of-magnitude speedup vs. explicit algorithms.
      

      
      GPU-based prompt gamma ray imaging from boron neutron capture therapy
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Yoon, Do-Kun; Jung, Joo-Young; Suk Suh, Tae, E-mail: suhsanta@catholic.ac.kr
         
         Purpose: The purpose of this research is to perform the fast reconstruction of a prompt gamma ray image using a graphics processing unit (GPU) computation from boron neutron capture therapy (BNCT) simulations. Methods: To evaluate the accuracy of the reconstructed image, a phantom including four boron uptake regions (BURs) was used in the simulation. After the Monte Carlo simulation of the BNCT, the modified ordered subset expectation maximization reconstruction algorithm using the GPU computation was used to reconstruct the images with fewer projections. The computation times for image reconstruction were compared between the GPU and the central processing unit (CPU).more » Also, the accuracy of the reconstructed image was evaluated by a receiver operating characteristic (ROC) curve analysis. Results: The image reconstruction time using the GPU was 196 times faster than the conventional reconstruction time using the CPU. For the four BURs, the area under curve values from the ROC curve were 0.6726 (A-region), 0.6890 (B-region), 0.7384 (C-region), and 0.8009 (D-region). Conclusions: The tomographic image using the prompt gamma ray event from the BNCT simulation was acquired using the GPU computation in order to perform a fast reconstruction during treatment. The authors verified the feasibility of the prompt gamma ray image reconstruction using the GPU computation for BNCT simulations.« less
      

      
      TU-FG-BRB-07: GPU-Based Prompt Gamma Ray Imaging From Boron Neutron Capture Therapy
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Kim, S; Suh, T; Yoon, D
         
         Purpose: The purpose of this research is to perform the fast reconstruction of a prompt gamma ray image using a graphics processing unit (GPU) computation from boron neutron capture therapy (BNCT) simulations. Methods: To evaluate the accuracy of the reconstructed image, a phantom including four boron uptake regions (BURs) was used in the simulation. After the Monte Carlo simulation of the BNCT, the modified ordered subset expectation maximization reconstruction algorithm using the GPU computation was used to reconstruct the images with fewer projections. The computation times for image reconstruction were compared between the GPU and the central processing unit (CPU).more » Also, the accuracy of the reconstructed image was evaluated by a receiver operating characteristic (ROC) curve analysis. Results: The image reconstruction time using the GPU was 196 times faster than the conventional reconstruction time using the CPU. For the four BURs, the area under curve values from the ROC curve were 0.6726 (A-region), 0.6890 (B-region), 0.7384 (C-region), and 0.8009 (D-region). Conclusion: The tomographic image using the prompt gamma ray event from the BNCT simulation was acquired using the GPU computation in order to perform a fast reconstruction during treatment. The authors verified the feasibility of the prompt gamma ray reconstruction using the GPU computation for BNCT simulations.« less
      

      
      On the performance of large Gaussian basis sets for the computation of total atomization energies
      NASA Technical Reports Server (NTRS)
      Martin, J. M. L.
         1992-01-01
         The total atomization energies of a number of molecules have been computed using an augmented coupled-cluster method and (5s4p3d2f1g) and 4s3p2d1f) atomic natural orbital (ANO) basis sets, as well as the correlation consistent valence triple zeta plus polarization (cc-pVTZ) correlation consistent valence quadrupole zeta plus polarization (cc-pVQZ) basis sets. The performance of ANO and correlation consistent basis sets is comparable throughout, although the latter can result in significant CPU time savings. Whereas the inclusion of g functions has significant effects on the computed Sigma D(e) values, chemical accuracy is still not reached for molecules involving multiple bonds. A Gaussian-1 (G) type correction lowers the error, but not much beyond the accuracy of the G1 model itself. Using separate corrections for sigma bonds, pi bonds, and valence pairs brings down the mean absolute error to less than 1 kcal/mol for the spdf basis sets, and about 0.5 kcal/mol for the spdfg basis sets. Some conclusions on the success of the Gaussian-1 and Gaussian-2 models are drawn.
      

        
       
          

«

17
      18
      19
   20
      21
      »

          
        

     

   

   
       
            
              
          

«

18
      19
      20
   21
      22
      »

          
        

           
           
             
               
      
      Evaluating Academic Journals Using Impact Factor and Local Citation Score
      ERIC Educational Resources Information Center
      Chung, Hye-Kyung
         2007-01-01
         This study presents a method for journal collection evaluation using citation analysis. Cost-per-use (CPU) for each title is used to measure cost-effectiveness with higher CPU scores indicating cost-effective titles. Use data are based on the impact factor and locally collected citation score of each title and is compared to the cost of managing…
      

      
      Emissivity of Rocket Plume Particulates
      DTIC Science & Technology
      
         1992-09-01
         V. EXPERIMENTAL RESULTS ........ ............... 29 VI. CONCLUSIONS AND RECOMMENDATIONS .... ........ 32 APPENDIX A. CATS -E SOFTWARE...interfaced through the CATS E Thermal Analysis software, which is MS-DOS based, and can be run on any 28b or higher CPU. This system allows real-time...body source to establish the parameters required by the CATS program for proper microscope/scanner interface. A complete description of microscope
      

      
      Vulnerability Model. A Simulation System for Assessing Damage Resulting from Marine Spills
      DTIC Science & Technology
      
         1975-06-01
         used and the scenario simulated. The test runs were made on an IBM 360/65 computer. Running times were generally between 15 and 35 CPU seconds...fect filrthcr north. A petroleum tank-truck operation was located within 600 feet Of L𔃻:- stock pond on which the crude oil had dammred itp . At 5 A-M
      

      
      Algorithms and Application of Sparse Matrix Assembly and Equation Solvers for Aeroacoustics
      NASA Technical Reports Server (NTRS)
      Watson, W. R.; Nguyen, D. T.; Reddy, C. J.; Vatsa, V. N.; Tang, W. H.
         2001-01-01
         An algorithm for symmetric sparse equation solutions on an unstructured grid is described. Efficient, sequential sparse algorithms for degree-of-freedom reordering, supernodes, symbolic/numerical factorization, and forward backward solution phases are reviewed. Three sparse algorithms for the generation and assembly of symmetric systems of matrix equations are presented. The accuracy and numerical performance of the sequential version of the sparse algorithms are evaluated over the frequency range of interest in a three-dimensional aeroacoustics application. Results show that the solver solutions are accurate using a discretization of 12 points per wavelength. Results also show that the first assembly algorithm is impractical for high-frequency noise calculations. The second and third assembly algorithms have nearly equal performance at low values of source frequencies, but at higher values of source frequencies the third algorithm saves CPU time and RAM. The CPU time and the RAM required by the second and third assembly algorithms are two orders of magnitude smaller than that required by the sparse equation solver. A sequential version of these sparse algorithms can, therefore, be conveniently incorporated into a substructuring for domain decomposition formulation to achieve parallel computation, where different substructures are handles by different parallel processors.
      

      
      High-Speed Particle-in-Cell Simulation Parallelized with Graphic Processing Units for Low Temperature Plasmas for Material Processing
      NASA Astrophysics Data System (ADS)
      Hur, Min Young; Verboncoeur, John; Lee, Hae June
         2014-10-01
         Particle-in-cell (PIC) simulations have high fidelity in the plasma device requiring transient kinetic modeling compared with fluid simulations. It uses less approximation on the plasma kinetics but requires many particles and grids to observe the semantic results. It means that the simulation spends lots of simulation time in proportion to the number of particles. Therefore, PIC simulation needs high performance computing. In this research, a graphic processing unit (GPU) is adopted for high performance computing of PIC simulation for low temperature discharge plasmas. GPUs have many-core processors and high memory bandwidth compared with a central processing unit (CPU). NVIDIA GeForce GPUs were used for the test with hundreds of cores which show cost-effective performance. PIC code algorithm is divided into two modules which are a field solver and a particle mover. The particle mover module is divided into four routines which are named move, boundary, Monte Carlo collision (MCC), and deposit. Overall, the GPU code solves particle motions as well as electrostatic potential in two-dimensional geometry almost 30 times faster than a single CPU code. This work was supported by the Korea Institute of Science Technology Information.
      

      
      An efficient sparse matrix multiplication scheme for the CYBER 205 computer
      NASA Technical Reports Server (NTRS)
      Lambiotte, Jules J., Jr.
         1988-01-01
         This paper describes the development of an efficient algorithm for computing the product of a matrix and vector on a CYBER 205 vector computer. The desire to provide software which allows the user to choose between the often conflicting goals of minimizing central processing unit (CPU) time or storage requirements has led to a diagonal-based algorithm in which one of four types of storage is selected for each diagonal. The candidate storage types employed were chosen to be efficient on the CYBER 205 for diagonals which have nonzero structure which is dense, moderately sparse, very sparse and short, or very sparse and long; however, for many densities, no diagonal type is most efficient with respect to both resource requirements, and a trade-off must be made. For each diagonal, an initialization subroutine estimates the CPU time and storage required for each storage type based on results from previously performed numerical experimentation. These requirements are adjusted by weights provided by the user which reflect the relative importance the user places on the two resources. The adjusted resource requirements are then compared to select the most efficient storage and computational scheme.
      

      
      Design of a memory-access controller with 3.71-times-enhanced energy efficiency for Internet-of-Things-oriented nonvolatile microcontroller unit
      NASA Astrophysics Data System (ADS)
      Natsui, Masanori; Hanyu, Takahiro
         2018-04-01
         In realizing a nonvolatile microcontroller unit (MCU) for sensor nodes in Internet-of-Things (IoT) applications, it is important to solve the data-transfer bottleneck between the central processing unit (CPU) and the nonvolatile memory constituting the MCU. As one circuit-oriented approach to solving this problem, we propose a memory access minimization technique for magnetoresistive-random-access-memory (MRAM)-embedded nonvolatile MCUs. In addition to multiplexing and prefetching of memory access, the proposed technique realizes efficient instruction fetch by eliminating redundant memory access while considering the code length of the instruction to be fetched and the transition of the memory address to be accessed. As a result, the performance of the MCU can be improved while relaxing the performance requirement for the embedded MRAM, and compact and low-power implementation can be performed as compared with the conventional cache-based one. Through the evaluation using a system consisting of a general purpose 32-bit CPU and embedded MRAM, it is demonstrated that the proposed technique increases the peak efficiency of the system up to 3.71 times, while a 2.29-fold area reduction is achieved compared with the cache-based one.
      

      
      Towards more stable operation of the Tokyo Tier2 center
      NASA Astrophysics Data System (ADS)
      Nakamura, T.; Mashimo, T.; Matsui, N.; Sakamoto, H.; Ueda, I.
         2014-06-01
         The Tokyo Tier2 center, which is located at the International Center for Elementary Particle Physics (ICEPP) in the University of Tokyo, was established as a regional analysis center in Japan for the ATLAS experiment. The official operation with WLCG was started in 2007 after the several years development since 2002. In December 2012, we have replaced almost all hardware as the third system upgrade to deal with analysis for further growing data of the ATLAS experiment. The number of CPU cores are increased by factor of two (9984 cores in total), and the performance of individual CPU core is improved by 20% according to the HEPSPEC06 benchmark test at 32bit compile mode. The score is estimated as 18.03 (SL6) per core by using Intel Xeon E5-2680 2.70 GHz. Since all worker nodes are made by 16 CPU cores configuration, we deployed 624 blade servers in total. They are connected to 6.7 PB of disk storage system with non-blocking 10 Gbps internal network backbone by using two center network switches (NetIron MLXe-32). The disk storage is made by 102 of RAID6 disk arrays (Infortrend DS S24F-G2840-4C16DO0) and served by equivalent number of 1U file servers with 8G-FC connection to maximize the file transfer throughput per storage capacity. As of February 2013, 2560 CPU cores and 2.00 PB of disk storage have already been deployed for WLCG. Currently, the remaining non-grid resources for both CPUs and disk storage are used as dedicated resources for the data analysis by the ATLAS Japan collaborators. Since all hardware in the non-grid resources are made by same architecture with Tier2 resource, they will be able to be migrated as the Tier2 extra resource on demand of the ATLAS experiment in the future. In addition to the upgrade of computing resources, we expect the improvement of connectivity on the wide area network. Thanks to the Japanese NREN (NII), another 10 Gbps trans-Pacific line from Japan to Washington will be available additionally with existing two 10 Gbps lines (Tokyo to New York and Tokyo to Los Angeles). The new line will be connected to LHCONE for the more improvement of the connectivity. In this circumstance, we are working for the further stable operation. For instance, we have newly introduced GPFS (IBM) for the non-grid disk storage, while Disk Pool Manager (DPM) are continued to be used as Tier2 disk storage from the previous system. Since the number of files stored in a DPM pool will be increased with increasing the total amount of data, the development of stable database configuration is one of the crucial issues as well as scalability. We have started some studies on the performance of asynchronous database replication so that we can take daily full backup. In this report, we would like to introduce several improvements in terms of the performances and stability of our new system and possibility of the further improvement of local I/O performance in the multi-core worker node. We also present the status of the wide area network connectivity from Japan to US and/or EU with LHCONE.
      

      
      GPU-accelerated iterative reconstruction for limited-data tomography in CBCT systems.
      PubMed
      de Molina, Claudia; Serrano, Estefania; Garcia-Blas, Javier; Carretero, Jesus; Desco, Manuel; Abella, Monica
         2018-05-15
         Standard cone-beam computed tomography (CBCT) involves the acquisition of at least 360 projections rotating through 360 degrees. Nevertheless, there are cases in which only a few projections can be taken in a limited angular span, such as during surgery, where rotation of the source-detector pair is limited to less than 180 degrees. Reconstruction of limited data with the conventional method proposed by Feldkamp, Davis and Kress (FDK) results in severe artifacts. Iterative methods may compensate for the lack of data by including additional prior information, although they imply a high computational burden and memory consumption. We present an accelerated implementation of an iterative method for CBCT following the Split Bregman formulation, which reduces computational time through GPU-accelerated kernels. The implementation enables the reconstruction of large volumes (>1024 3 pixels) using partitioning strategies in forward- and back-projection operations. We evaluated the algorithm on small-animal data for different scenarios with different numbers of projections, angular span, and projection size. Reconstruction time varied linearly with the number of projections and quadratically with projection size but remained almost unchanged with angular span. Forward- and back-projection operations represent 60% of the total computational burden. Efficient implementation using parallel processing and large-memory management strategies together with GPU kernels enables the use of advanced reconstruction approaches which are needed in limited-data scenarios. Our GPU implementation showed a significant time reduction (up to 48 ×) compared to a CPU-only implementation, resulting in a total reconstruction time from several hours to few minutes.
      

      
      Evaluation of the Monotonic Lagrangian Grid and Lat-Long Grid for Air Traffic Management
      NASA Technical Reports Server (NTRS)
      Kaplan, Carolyn; Dahm, Johann; Oran, Elaine; Alexandrov, Natalia; Boris, Jay
         2011-01-01
         The Air Traffic Monotonic Lagrangian Grid (ATMLG) is used to simulate a 24 hour period of air traffic flow in the National Airspace System (NAS). During this time period, there are 41,594 flights over the United States, and the flight plan information (departure and arrival airports and times, and waypoints along the way) are obtained from an Federal Aviation Administration (FAA) Enhanced Traffic Management System (ETMS) dataset. Two simulation procedures are tested and compared: one based on the Monotonic Lagrangian Grid (MLG), and the other based on the stationary Latitude-Longitude (Lat- Long) grid. Simulating one full day of air traffic over the United States required the following amounts of CPU time on a single processor of an SGI Altix: 88 s for the MLG method, and 163 s for the Lat-Long grid method. We present a discussion of the amount of CPU time required for each of the simulation processes (updating aircraft trajectories, sorting, conflict detection and resolution, etc.), and show that the main advantage of the MLG method is that it is a general sorting algorithm that can sort on multiple properties. We discuss how many MLG neighbors must be considered in the separation assurance procedure in order to ensure a five-mile separation buffer between aircraft, and we investigate the effect of removing waypoints from aircraft trajectories. When aircraft choose their own trajectory, there are more flights with shorter duration times and fewer CD&R maneuvers, resulting in significant fuel savings.
      

      
      Pipelined CPU Design with FPGA in Teaching Computer Architecture
      ERIC Educational Resources Information Center
      Lee, Jong Hyuk; Lee, Seung Eun; Yu, Heon Chang; Suh, Taeweon
         2012-01-01
         This paper presents a pipelined CPU design project with a field programmable gate array (FPGA) system in a computer architecture course. The class project is a five-stage pipelined 32-bit MIPS design with experiments on the Altera DE2 board. For proper scheduling, milestones were set every one or two weeks to help students complete the project on…
      

      
      Optimizing legacy molecular dynamics software with directive-based offload
      DOE PAGES
      Michael Brown, W.; Carrillo, Jan-Michael Y.; Gavhane, Nitin; ...
         2015-05-14
         The directive-based programming models are one solution for exploiting many-core coprocessors to increase simulation rates in molecular dynamics. They offer the potential to reduce code complexity with offload models that can selectively target computations to run on the CPU, the coprocessor, or both. In our paper, we describe modifications to the LAMMPS molecular dynamics code to enable concurrent calculations on a CPU and coprocessor. We also demonstrate that standard molecular dynamics algorithms can run efficiently on both the CPU and an x86-based coprocessor using the same subroutines. As a consequence, we demonstrate that code optimizations for the coprocessor also resultmore » in speedups on the CPU; in extreme cases up to 4.7X. We provide results for LAMMAS benchmarks and for production molecular dynamics simulations using the Stampede hybrid supercomputer with both Intel (R) Xeon Phi (TM) coprocessors and NVIDIA GPUs: The optimizations presented have increased simulation rates by over 2X for organic molecules and over 7X for liquid crystals on Stampede. The optimizations are available as part of the "Intel package" supplied with LAMMPS. (C) 2015 Elsevier B.V. All rights reserved.« less
      

      
      A novel heterogeneous algorithm to simulate multiphase flow in porous media on multicore CPU-GPU systems
      NASA Astrophysics Data System (ADS)
      McClure, J. E.; Prins, J. F.; Miller, C. T.
         2014-07-01
         Multiphase flow implementations of the lattice Boltzmann method (LBM) are widely applied to the study of porous medium systems. In this work, we construct a new variant of the popular "color" LBM for two-phase flow in which a three-dimensional, 19-velocity (D3Q19) lattice is used to compute the momentum transport solution while a three-dimensional, seven velocity (D3Q7) lattice is used to compute the mass transport solution. Based on this formulation, we implement a novel heterogeneous GPU-accelerated algorithm in which the mass transport solution is computed by multiple shared memory CPU cores programmed using OpenMP while a concurrent solution of the momentum transport is performed using a GPU. The heterogeneous solution is demonstrated to provide speedup of 2.6 × as compared to multi-core CPU solution and 1.8 × compared to GPU solution due to concurrent utilization of both CPU and GPU bandwidths. Furthermore, we verify that the proposed formulation provides an accurate physical representation of multiphase flow processes and demonstrate that the approach can be applied to perform heterogeneous simulations of two-phase flow in porous media using a typical GPU-accelerated workstation.
      

      
      Musrfit-Real Time Parameter Fitting Using GPUs
      NASA Astrophysics Data System (ADS)
      Locans, Uldis; Suter, Andreas
         
         High transverse field μSR (HTF-μSR) experiments typically lead to a rather large data sets, since it is necessary to follow the high frequencies present in the positron decay histograms. The analysis of these data sets can be very time consuming, usually due to the limited computational power of the hardware. To overcome the limited computing resources rotating reference frame transformation (RRF) is often used to reduce the data sets that need to be handled. This comes at a price typically the μSR community is not aware of: (i) due to the RRF transformation the fitting parameter estimate is of poorer precision, i.e., more extended expensive beamtime is needed. (ii) RRF introduces systematic errors which hampers the statistical interpretation of χ2 or the maximum log-likelihood. We will briefly discuss these issues in a non-exhaustive practical way. The only and single purpose of the RRF transformation is the sluggish computer power. Therefore during this work GPU (Graphical Processing Units) based fitting was developed which allows to perform real-time full data analysis without RRF. GPUs have become increasingly popular in scientific computing in recent years. Due to their highly parallel architecture they provide the opportunity to accelerate many applications with considerably less costs than upgrading the CPU computational power. With the emergence of frameworks such as CUDA and OpenCL these devices have become more easily programmable. During this work GPU support was added to Musrfit- a data analysis framework for μSR experiments. The new fitting algorithm uses CUDA or OpenCL to offload the most time consuming parts of the calculations to Nvidia or AMD GPUs. Using the current CPU implementation in Musrfit parameter fitting can take hours for certain data sets while the GPU version can allow to perform real-time data analysis on the same data sets. This work describes the challenges that arise in adding the GPU support to t as well as results obtained using the GPU version. The speedups using the GPU were measured comparing to the CPU implementation. Two different GPUs were used for the comparison — high end Nvidia Tesla K40c GPU designed for HPC applications and AMD Radeon R9 390× GPU designed for gaming industry.
      

      
      Bifrost: a Modular Python/C++ Framework for Development of High-Throughput Data Analysis Pipelines
      NASA Astrophysics Data System (ADS)
      Cranmer, Miles; Barsdell, Benjamin R.; Price, Danny C.; Garsden, Hugh; Taylor, Gregory B.; Dowell, Jayce; Schinzel, Frank; Costa, Timothy; Greenhill, Lincoln J.
         2017-01-01
         Large radio interferometers have data rates that render long-term storage of raw correlator data infeasible, thus motivating development of real-time processing software. For high-throughput applications, processing pipelines are challenging to design and implement. Motivated by science efforts with the Long Wavelength Array, we have developed Bifrost, a novel Python/C++ framework that eases the development of high-throughput data analysis software by packaging algorithms as black box processes in a directed graph. This strategy to modularize code allows astronomers to create parallelism without code adjustment. Bifrost uses CPU/GPU ’circular memory’ data buffers that enable ready introduction of arbitrary functions into the processing path for ’streams’ of data, and allow pipelines to automatically reconfigure in response to astrophysical transient detection or input of new observing settings. We have deployed and tested Bifrost at the latest Long Wavelength Array station, in Sevilleta National Wildlife Refuge, NM, where it handles throughput exceeding 10 Gbps per CPU core.
      

      
      FPGA-based prototype storage system with phase change memory
      NASA Astrophysics Data System (ADS)
      Li, Gezi; Chen, Xiaogang; Chen, Bomy; Li, Shunfen; Zhou, Mi; Han, Wenbing; Song, Zhitang
         2016-10-01
         With the ever-increasing amount of data being stored via social media, mobile telephony base stations, and network devices etc. the database systems face severe bandwidth bottlenecks when moving vast amounts of data from storage to the processing nodes. At the same time, Storage Class Memory (SCM) technologies such as Phase Change Memory (PCM) with unique features like fast read access, high density, non-volatility, byte-addressability, positive response to increasing temperature, superior scalability, and zero standby leakage have changed the landscape of modern computing and storage systems. In such a scenario, we present a storage system called FLEET which can off-load partial or whole SQL queries to the storage engine from CPU. FLEET uses an FPGA rather than conventional CPUs to implement the off-load engine due to its highly parallel nature. We have implemented an initial prototype of FLEET with PCM-based storage. The results demonstrate that significant performance and CPU utilization gains can be achieved by pushing selected query processing components inside in PCM-based storage.
      

      
      Optimizing a mobile robot control system using GPU acceleration
      NASA Astrophysics Data System (ADS)
      Tuck, Nat; McGuinness, Michael; Martin, Fred
         2012-01-01
         This paper describes our attempt to optimize a robot control program for the Intelligent Ground Vehicle Competition (IGVC) by running computationally intensive portions of the system on a commodity graphics processing unit (GPU). The IGVC Autonomous Challenge requires a control program that performs a number of different computationally intensive tasks ranging from computer vision to path planning. For the 2011 competition our Robot Operating System (ROS) based control system would not run comfortably on the multicore CPU on our custom robot platform. The process of profiling the ROS control program and selecting appropriate modules for porting to run on a GPU is described. A GPU-targeting compiler, Bacon, is used to speed up development and help optimize the ported modules. The impact of the ported modules on overall performance is discussed. We conclude that GPU optimization can free a significant amount of CPU resources with minimal effort for expensive user-written code, but that replacing heavily-optimized library functions is more difficult, and a much less efficient use of time.
      

      
      Cross-Identification of Astronomical Catalogs on Multiple GPUs
      NASA Astrophysics Data System (ADS)
      Lee, M. A.; Budavári, T.
         2013-10-01
         One of the most fundamental problems in observational astronomy is the cross-identification of sources. Observations are made in different wavelengths, at different times, and from different locations and instruments, resulting in a large set of independent observations. The scientific outcome is often limited by our ability to quickly perform meaningful associations between detections. The matching, however, is difficult scientifically, statistically, as well as computationally. The former two require detailed physical modeling and advanced probabilistic concepts; the latter is due to the large volumes of data and the problem's combinatorial nature. In order to tackle the computational challenge and to prepare for future surveys, whose measurements will be exponentially increasing in size past the scale of feasible CPU-based solutions, we developed a new implementation which addresses the issue by performing the associations on multiple Graphics Processing Units (GPUs). Our implementation utilizes up to 6 GPUs in combination with the Thrust library to achieve an over 40x speed up verses the previous best implementation running on a multi-CPU SQL Server.
      

      
      GPU accelerated manifold correction method for spinning compact binaries
      NASA Astrophysics Data System (ADS)
      Ran, Chong-xi; Liu, Song; Zhong, Shuang-ying
         2018-04-01
         The graphics processing unit (GPU) acceleration of the manifold correction algorithm based on the compute unified device architecture (CUDA) technology is designed to simulate the dynamic evolution of the Post-Newtonian (PN) Hamiltonian formulation of spinning compact binaries. The feasibility and the efficiency of parallel computation on GPU have been confirmed by various numerical experiments. The numerical comparisons show that the accuracy on GPU execution of manifold corrections method has a good agreement with the execution of codes on merely central processing unit (CPU-based) method. The acceleration ability when the codes are implemented on GPU can increase enormously through the use of shared memory and register optimization techniques without additional hardware costs, implying that the speedup is nearly 13 times as compared with the codes executed on CPU for phase space scan (including 314 × 314 orbits). In addition, GPU-accelerated manifold correction method is used to numerically study how dynamics are affected by the spin-induced quadrupole-monopole interaction for black hole binary system.
      

      
      Portable implementation model for CFD simulations. Application to hybrid CPU/GPU supercomputers
      NASA Astrophysics Data System (ADS)
      Oyarzun, Guillermo; Borrell, Ricard; Gorobets, Andrey; Oliva, Assensi
         2017-10-01
         Nowadays, high performance computing (HPC) systems experience a disruptive moment with a variety of novel architectures and frameworks, without any clarity of which one is going to prevail. In this context, the portability of codes across different architectures is of major importance. This paper presents a portable implementation model based on an algebraic operational approach for direct numerical simulation (DNS) and large eddy simulation (LES) of incompressible turbulent flows using unstructured hybrid meshes. The strategy proposed consists in representing the whole time-integration algorithm using only three basic algebraic operations: sparse matrix-vector product, a linear combination of vectors and dot product. The main idea is based on decomposing the nonlinear operators into a concatenation of two SpMV operations. This provides high modularity and portability. An exhaustive analysis of the proposed implementation for hybrid CPU/GPU supercomputers has been conducted with tests using up to 128 GPUs. The main objective consists in understanding the challenges of implementing CFD codes on new architectures.
      

        
       
          

«

18
      19
      20
   21
      22
      »

          
        

     

   

   
       
            
              
          

«

19
      20
      21
   22
      23
      »

          
        

           
           
             
               
      
      An efficient solver for large structured eigenvalue problems in relativistic quantum chemistry
      NASA Astrophysics Data System (ADS)
      Shiozaki, Toru
         2017-01-01
         We report an efficient program for computing the eigenvalues and symmetry-adapted eigenvectors of very large quaternionic (or Hermitian skew-Hamiltonian) matrices, using which structure-preserving diagonalisation of matrices of dimension N > 10, 000 is now routine on a single computer node. Such matrices appear frequently in relativistic quantum chemistry owing to the time-reversal symmetry. The implementation is based on a blocked version of the Paige-Van Loan algorithm, which allows us to use the Level 3 BLAS subroutines for most of the computations. Taking advantage of the symmetry, the program is faster by up to a factor of 2 than state-of-the-art implementations of complex Hermitian diagonalisation; diagonalising a 12, 800 × 12, 800 matrix took 42.8 (9.5) and 85.6 (12.6) minutes with 1 CPU core (16 CPU cores) using our symmetry-adapted solver and Intel Math Kernel Library's ZHEEV that is not structure-preserving, respectively. The source code is publicly available under the FreeBSD licence.
      

      
      Characteristic-based algorithms for flows in thermo-chemical nonequilibrium
      NASA Technical Reports Server (NTRS)
      Walters, Robert W.; Cinnella, Pasquale; Slack, David C.; Halt, David
         1990-01-01
         A generalized finite-rate chemistry algorithm with Steger-Warming, Van Leer, and Roe characteristic-based flux splittings is presented in three-dimensional generalized coordinates for the Navier-Stokes equations. Attention is placed on convergence to steady-state solutions with fully coupled chemistry. Time integration schemes including explicit m-stage Runge-Kutta, implicit approximate-factorization, relaxation and LU decomposition are investigated and compared in terms of residual reduction per unit of CPU time. Practical issues such as code vectorization and memory usage on modern supercomputers are discussed.
      

      
      High thermal conductivity liquid metal pad for heat dissipation in electronic devices
      NASA Astrophysics Data System (ADS)
      Lin, Zuoye; Liu, Huiqiang; Li, Qiuguo; Liu, Han; Chu, Sheng; Yang, Yuhua; Chu, Guang
         2018-05-01
         Novel thermal interface materials using Ag-doped Ga-based liquid metal were proposed for heat dissipation of electronic packaging and precision equipment. On one hand, the viscosity and fluidity of liquid metal was controlled to prevent leakage; on the other hand, the thermal conductivity of the Ga-based liquid metal was increased up to 46 W/mK by incorporating Ag nanoparticles. A series of experiments were performed to evaluate the heat dissipation performance on a CPU of smart-phone. The results demonstrated that the Ag-doped Ga-based liquid metal pad can effectively decrease the CPU temperature and change the heat flow path inside the smart-phone. To understand the heat flow path from CPU to screen through the interface material, heat dissipation mechanism was simulated and discussed.
      

      
      Dynamic modelling and estimation of the error due to asynchronism in a redundant asynchronous multiprocessor system
      NASA Technical Reports Server (NTRS)
      Huynh, Loc C.; Duval, R. W.
         1986-01-01
         The use of Redundant Asynchronous Multiprocessor System to achieve ultrareliable Fault Tolerant Control Systems shows great promise. The development has been hampered by the inability to determine whether differences in the outputs of redundant CPU's are due to failures or to accrued error built up by slight differences in CPU clock intervals. This study derives an analytical dynamic model of the difference between redundant CPU's due to differences in their clock intervals and uses this model with on-line parameter identification to idenitify the differences in the clock intervals. The ability of this methodology to accurately track errors due to asynchronisity generate an error signal with the effect of asynchronisity removed and this signal may be used to detect and isolate actual system failures.
      

      
      IGA-ADS: Isogeometric analysis FEM using ADS solver
      NASA Astrophysics Data System (ADS)
      Łoś, Marcin M.; Woźniak, Maciej; Paszyński, Maciej; Lenharth, Andrew; Hassaan, Muhamm Amber; Pingali, Keshav
         2017-08-01
         In this paper we present a fast explicit solver for solution of non-stationary problems using L2 projections with isogeometric finite element method. The solver has been implemented within GALOIS framework. It enables parallel multi-core simulations of different time-dependent problems, in 1D, 2D, or 3D. We have prepared the solver framework in a way that enables direct implementation of the selected PDE and corresponding boundary conditions. In this paper we describe the installation, implementation of exemplary three PDEs, and execution of the simulations on multi-core Linux cluster nodes. We consider three case studies, including heat transfer, linear elasticity, as well as non-linear flow in heterogeneous media. The presented package generates output suitable for interfacing with Gnuplot and ParaView visualization software. The exemplary simulations show near perfect scalability on Gilbert shared-memory node with four Intel® Xeon® CPU E7-4860 processors, each possessing 10 physical cores (for a total of 40 cores).
      

      
      Efficient Implementations of the Quadrature-Free Discontinuous Galerkin Method
      NASA Technical Reports Server (NTRS)
      Lockard, David P.; Atkins, Harold L.
         1999-01-01
         The efficiency of the quadrature-free form of the dis- continuous Galerkin method in two dimensions, and briefly in three dimensions, is examined. Most of the work for constant-coefficient, linear problems involves the volume and edge integrations, and the transformation of information from the volume to the edges. These operations can be viewed as matrix-vector multiplications. Many of the matrices are sparse as a result of symmetry, and blocking and specialized multiplication routines are used to account for the sparsity. By optimizing these operations, a 35% reduction in total CPU time is achieved. For nonlinear problems, the calculation of the flux becomes dominant because of the cost associated with polynomial products and inversion. This component of the work can be reduced by up to 75% when the products are approximated by truncating terms. Because the cost is high for nonlinear problems on general elements, it is suggested that simplified physics and the most efficient element types be used over most of the domain.
      

      
      Study of Wearable Knee Assistive Instruments for Walk Rehabilitation
      NASA Astrophysics Data System (ADS)
      Zhu, Yong; Nakamura, Masahiro; Ito, Noritaka; Fujimoto, Hiroshi; Horikuchi, Kenichi; Wakabayashi, Shojiro; Takahashi, Rei; Terada, Hidetsugu; Haro, Hirotaka
         
         A wearable Knee Assistive Instrument for the walk rehabilitation was newly developed. Especially, this system aimed at supporting the rehabilitation for the post-TKA (Total Knee Arthroplasty) which is a popular surgery for aging people. This system consisted of an assisting mechanism for the knee joint, a hip joint support system and a foot pressure sensor system. The driving system of this robot consisted of a CPU board which generated the walking pattern, a Li-ion battery, DC motors with motor drivers, contact sensors to detect the state of foot and potentiometers to detect the hip joint angle. The control method was proposed to reproduce complex motion of knee joint as much as possible, and to increase hip or knee flexion angle. Especially, this method used the timing that heel left from the floor. This method included that the lower limb was raised to prevent a subject's fall. Also, the prototype of knee assisting system was tested. It was confirmed that the assisting system is useful.
      

      
      Storage element performance optimization for CMS analysis jobs
      NASA Astrophysics Data System (ADS)
      Behrmann, G.; Dahlblom, J.; Guldmyr, J.; Happonen, K.; Lindén, T.
         2012-12-01
         Tier-2 computing sites in the Worldwide Large Hadron Collider Computing Grid (WLCG) host CPU-resources (Compute Element, CE) and storage resources (Storage Element, SE). The vast amount of data that needs to processed from the Large Hadron Collider (LHC) experiments requires good and efficient use of the available resources. Having a good CPU efficiency for the end users analysis jobs requires that the performance of the storage system is able to scale with I/O requests from hundreds or even thousands of simultaneous jobs. In this presentation we report on the work on improving the SE performance at the Helsinki Institute of Physics (HIP) Tier-2 used for the Compact Muon Experiment (CMS) at the LHC. Statistics from CMS grid jobs are collected and stored in the CMS Dashboard for further analysis, which allows for easy performance monitoring by the sites and by the CMS collaboration. As part of the monitoring framework CMS uses the JobRobot which sends every four hours 100 analysis jobs to each site. CMS also uses the HammerCloud tool for site monitoring and stress testing and it has replaced the JobRobot. The performance of the analysis workflow submitted with JobRobot or HammerCloud can be used to track the performance due to site configuration changes, since the analysis workflow is kept the same for all sites and for months in time. The CPU efficiency of the JobRobot jobs at HIP was increased approximately by 50 % to more than 90 %, by tuning the SE and by improvements in the CMSSW and dCache software. The performance of the CMS analysis jobs improved significantly too. Similar work has been done on other CMS Tier-sites, since on average the CPU efficiency for CMSSW jobs has increased during 2011. Better monitoring of the SE allows faster detection of problems, so that the performance level can be kept high. The next storage upgrade at HIP consists of SAS disk enclosures which can be stress tested on demand with HammerCloud workflows, to make sure that the I/O-performance is good.
      

      
      Collaborating CPU and GPU for large-scale high-order CFD simulations with complex grids on the TianHe-1A supercomputer
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Xu, Chuanfu, E-mail: xuchuanfu@nudt.edu.cn; Deng, Xiaogang; Zhang, Lilun
         
         Programming and optimizing complex, real-world CFD codes on current many-core accelerated HPC systems is very challenging, especially when collaborating CPUs and accelerators to fully tap the potential of heterogeneous systems. In this paper, with a tri-level hybrid and heterogeneous programming model using MPI + OpenMP + CUDA, we port and optimize our high-order multi-block structured CFD software HOSTA on the GPU-accelerated TianHe-1A supercomputer. HOSTA adopts two self-developed high-order compact definite difference schemes WCNS and HDCS that can simulate flows with complex geometries. We present a dual-level parallelization scheme for efficient multi-block computation on GPUs and perform particular kernel optimizations formore » high-order CFD schemes. The GPU-only approach achieves a speedup of about 1.3 when comparing one Tesla M2050 GPU with two Xeon X5670 CPUs. To achieve a greater speedup, we collaborate CPU and GPU for HOSTA instead of using a naive GPU-only approach. We present a novel scheme to balance the loads between the store-poor GPU and the store-rich CPU. Taking CPU and GPU load balance into account, we improve the maximum simulation problem size per TianHe-1A node for HOSTA by 2.3×, meanwhile the collaborative approach can improve the performance by around 45% compared to the GPU-only approach. Further, to scale HOSTA on TianHe-1A, we propose a gather/scatter optimization to minimize PCI-e data transfer times for ghost and singularity data of 3D grid blocks, and overlap the collaborative computation and communication as far as possible using some advanced CUDA and MPI features. Scalability tests show that HOSTA can achieve a parallel efficiency of above 60% on 1024 TianHe-1A nodes. With our method, we have successfully simulated an EET high-lift airfoil configuration containing 800M cells and China's large civil airplane configuration containing 150M cells. To our best knowledge, those are the largest-scale CPU–GPU collaborative simulations that solve realistic CFD problems with both complex configurations and high-order schemes.« less
      

      
      Multiband Radio Frequency Interconnect (MRFI) Technology For Next Generation Mobile/Airborne Computing Systems
      DTIC Science & Technology
      
         2017-02-01
         enable high scalability and reconfigurability for inter-CPU/Memory communications with an increased number of communication channels in frequency ...interconnect technology (MRFI) to enable high scalability and re-configurability for inter-CPU/Memory communications with an increased number of communication ...testing in the University of California, Los Angeles (UCLA) Center for High Frequency Electronics, and Dr. Afshin Momtaz at Broadcom Corporation for
      

      
      LU Factorization with Partial Pivoting for a Multi-CPU, Multi-GPU Shared Memory System
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Kurzak, Jakub; Luszczek, Pitior; Faverge, Mathieu
         2012-03-01
         LU factorization with partial pivoting is a canonical numerical procedure and the main component of the High Performance LINPACK benchmark. This article presents an implementation of the algorithm for a hybrid, shared memory, system with standard CPU cores and GPU accelerators. Performance in excess of one TeraFLOPS is achieved using four AMD Magny Cours CPUs and four NVIDIA Fermi GPUs.
      

      
      Real-Time Ada Problem Solution Study
      DTIC Science & Technology
      
         1989-03-24
         been performed, there is a larger base of information concerning standards and guidelines for Ada usage, as well "lessons learned ". A number of...the target machine and operate in conjunction with the application programs, they also require system resources (CPU,memory). The utilization of...Transporter-Consumer 1694 154 6. Producer-Transpt-Buffer- Transp -Consumer 2248 204 7. Relay 906 82 8. Conditional Entry - no rendezvous 170 15
      

      
      Surrogate Structures for Computationally Expensive Optimization Problems With CPU-Time Correlated Functions
      DTIC Science & Technology
      
         2007-06-01
         xc)−∇2g(x̃c)](x− xc). The second transformation is a space mapping function P that handles the change in variable dimensions (see Bandler et al. [11...17(2):188–217, 2004. 11. Bandler, J. W., Q. Cheng, S. Dakroury, A. S. Mohamed, M.H. Bakr, K. Madsen, J. Søndergaard. “ Space Mapping : The State of
      

      
      GPU-accelerated compressed-sensing (CS) image reconstruction in chest digital tomosynthesis (CDT) using CUDA programming
      NASA Astrophysics Data System (ADS)
      Choi, Sunghoon; Lee, Haenghwa; Lee, Donghoon; Choi, Seungyeon; Shin, Jungwook; Jang, Woojin; Seo, Chang-Woo; Kim, Hee-Joung
         2017-03-01
         A compressed-sensing (CS) technique has been rapidly applied in medical imaging field for retrieving volumetric data from highly under-sampled projections. Among many variant forms, CS technique based on a total-variation (TV) regularization strategy shows fairly reasonable results in cone-beam geometry. In this study, we implemented the TV-based CS image reconstruction strategy in our prototype chest digital tomosynthesis (CDT) R/F system. Due to the iterative nature of time consuming processes in solving a cost function, we took advantage of parallel computing using graphics processing units (GPU) by the compute unified device architecture (CUDA) programming to accelerate our algorithm. In order to compare the algorithmic performance of our proposed CS algorithm, conventional filtered back-projection (FBP) and simultaneous algebraic reconstruction technique (SART) reconstruction schemes were also studied. The results indicated that the CS produced better contrast-to-noise ratios (CNRs) in the physical phantom images (Teflon region-of-interest) by factors of 3.91 and 1.93 than FBP and SART images, respectively. The resulted human chest phantom images including lung nodules with different diameters also showed better visual appearance in the CS images. Our proposed GPU-accelerated CS reconstruction scheme could produce volumetric data up to 80 times than CPU programming. Total elapsed time for producing 50 coronal planes with 1024×1024 image matrix using 41 projection views were 216.74 seconds for proposed CS algorithms on our GPU programming, which could match the clinically feasible time ( 3 min). Consequently, our results demonstrated that the proposed CS method showed a potential of additional dose reduction in digital tomosynthesis with reasonable image quality in a fast time.
      

      
      cellGPU: Massively parallel simulations of dynamic vertex models
      NASA Astrophysics Data System (ADS)
      Sussman, Daniel M.
         2017-10-01
         Vertex models represent confluent tissue by polygonal or polyhedral tilings of space, with the individual cells interacting via force laws that depend on both the geometry of the cells and the topology of the tessellation. This dependence on the connectivity of the cellular network introduces several complications to performing molecular-dynamics-like simulations of vertex models, and in particular makes parallelizing the simulations difficult. cellGPU addresses this difficulty and lays the foundation for massively parallelized, GPU-based simulations of these models. This article discusses its implementation for a pair of two-dimensional models, and compares the typical performance that can be expected between running cellGPU entirely on the CPU versus its performance when running on a range of commercial and server-grade graphics cards. By implementing the calculation of topological changes and forces on cells in a highly parallelizable fashion, cellGPU enables researchers to simulate time- and length-scales previously inaccessible via existing single-threaded CPU implementations. Program Files doi:http://dx.doi.org/10.17632/6j2cj29t3r.1 Licensing provisions: MIT Programming language: CUDA/C++ Nature of problem: Simulations of off-lattice "vertex models" of cells, in which the interaction forces depend on both the geometry and the topology of the cellular aggregate. Solution method: Highly parallelized GPU-accelerated dynamical simulations in which the force calculations and the topological features can be handled on either the CPU or GPU. Additional comments: The code is hosted at https://gitlab.com/dmsussman/cellGPU, with documentation additionally maintained at http://dmsussman.gitlab.io/cellGPUdocumentation
      

      
      GENIE: a software package for gene-gene interaction analysis in genetic association studies using multiple GPU or CPU cores.
      PubMed
      Chikkagoudar, Satish; Wang, Kai; Li, Mingyao
         2011-05-26
         Gene-gene interaction in genetic association studies is computationally intensive when a large number of SNPs are involved. Most of the latest Central Processing Units (CPUs) have multiple cores, whereas Graphics Processing Units (GPUs) also have hundreds of cores and have been recently used to implement faster scientific software. However, currently there are no genetic analysis software packages that allow users to fully utilize the computing power of these multi-core devices for genetic interaction analysis for binary traits. Here we present a novel software package GENIE, which utilizes the power of multiple GPU or CPU processor cores to parallelize the interaction analysis. GENIE reads an entire genetic association study dataset into memory and partitions the dataset into fragments with non-overlapping sets of SNPs. For each fragment, GENIE analyzes: 1) the interaction of SNPs within it in parallel, and 2) the interaction between the SNPs of the current fragment and other fragments in parallel. We tested GENIE on a large-scale candidate gene study on high-density lipoprotein cholesterol. Using an NVIDIA Tesla C1060 graphics card, the GPU mode of GENIE achieves a speedup of 27 times over its single-core CPU mode run. GENIE is open-source, economical, user-friendly, and scalable. Since the computing power and memory capacity of graphics cards are increasing rapidly while their cost is going down, we anticipate that GENIE will achieve greater speedups with faster GPU cards. Documentation, source code, and precompiled binaries can be downloaded from http://www.cceb.upenn.edu/~mli/software/GENIE/.
      

      
      Region-specific expression of brain-derived neurotrophic factor splice variants in morphine conditioned place preference in mice.
      PubMed
      Meng, Min; Zhao, Xinhan; Dang, Yonghui; Ma, Jingyuan; Li, Lixu; Gu, Shanzhi
         2013-06-26
         It is well established that brain-derived neurotrophic factor (BDNF) plays a pivotal role in brain plasticity-related processes, such as learning, memory and drug addiction. However, changes in expression of BDNF splice variants after acquisition, extinction and reinstatement of cue-elicited morphine seeking behavior have not yet been investigated. Real-time PCR was used to assess BDNF splice variants (I, II, IV and VI) in various brain regions during acquisition, extinction and reinstatement of morphine-conditioned place preference (CPP) in mice. Repeated morphine injections (10mg/kg, i.p.) increased expression of BDNF splice variants II, IV and VI in the hippocampus, caudate putamen (CPu) and nucleus accumbens (NAcc). Levels of BDNF splice variants decreased after extinction training and continued to decrease during reinstatement induced by a morphine priming injection (10mg/kg, i.p.). However, after reinstatement induced by exposure to 6 min of forced swimming (FS), expression of BDNF splice variants II, IV and VI was increased in the hippocampus, CPu, NAcc and prefrontal cortex (PFC). After reinstatement induced by 40 min of restraint, expression of BDNF splice variants was increased in PFC. These results show that exposure to either morphine or acute stress can induce reinstatement of drug-seeking, but expression of BDNF splice variants is differentially affected by chronic morphine and acute stress. Furthermore, BDNF splice variants II, IV and VI may play a role in learning and memory for morphine addiction in the hippocampus, CPu and NAcc. Crown Copyright © 2013. Published by Elsevier B.V. All rights reserved.
      

      
      GENIE: a software package for gene-gene interaction analysis in genetic association studies using multiple GPU or CPU cores
      PubMed Central
      
         2011-01-01
         Background Gene-gene interaction in genetic association studies is computationally intensive when a large number of SNPs are involved. Most of the latest Central Processing Units (CPUs) have multiple cores, whereas Graphics Processing Units (GPUs) also have hundreds of cores and have been recently used to implement faster scientific software. However, currently there are no genetic analysis software packages that allow users to fully utilize the computing power of these multi-core devices for genetic interaction analysis for binary traits. Findings Here we present a novel software package GENIE, which utilizes the power of multiple GPU or CPU processor cores to parallelize the interaction analysis. GENIE reads an entire genetic association study dataset into memory and partitions the dataset into fragments with non-overlapping sets of SNPs. For each fragment, GENIE analyzes: 1) the interaction of SNPs within it in parallel, and 2) the interaction between the SNPs of the current fragment and other fragments in parallel. We tested GENIE on a large-scale candidate gene study on high-density lipoprotein cholesterol. Using an NVIDIA Tesla C1060 graphics card, the GPU mode of GENIE achieves a speedup of 27 times over its single-core CPU mode run. Conclusions GENIE is open-source, economical, user-friendly, and scalable. Since the computing power and memory capacity of graphics cards are increasing rapidly while their cost is going down, we anticipate that GENIE will achieve greater speedups with faster GPU cards. Documentation, source code, and precompiled binaries can be downloaded from http://www.cceb.upenn.edu/~mli/software/GENIE/. PMID:21615923
      

      
      Nucleus Accumbens Invulnerability to Methamphetamine Neurotoxicity
      PubMed Central
      Kuhn, Donald M.; Angoa-Pérez, Mariana; Thomas, David M.
         2016-01-01
         Methamphetamine (Meth) is a neurotoxic drug of abuse that damages neurons and nerve endings throughout the central nervous system. Emerging studies of human Meth addicts using both postmortem analyses of brain tissue and noninvasive imaging studies of intact brains have confirmed that Meth causes persistent structural abnormalities. Animal and human studies have also defined a number of significant functional problems and comorbid psychiatric disorders associated with long-term Meth abuse. This review summarizes the salient features of Meth-induced neurotoxicity with a focus on the dopamine (DA) neuronal system. DA nerve endings in the caudate-putamen (CPu) are damaged by Meth in a highly delimited manner. Even within the CPu, damage is remarkably heterogeneous, with ventral and lateral aspects showing the greatest deficits. The nucleus accumbens (NAc) is largely spared the damage that accompanies binge Meth intoxication, but relatively subtle changes in the disposition of DA in its nerve endings can lead to dramatic increases in Meth-induced toxicity in the CPu and overcome the normal resistance of the NAc to damage. In contrast to the CPu, where DA neuronal deficiencies are persistent, alterations in the NAc show a partial recovery. Animal models have been indispensable in studies of the causes and consequences of Meth neurotoxicity and in the development of new therapies. This research has shown that increases in cytoplasmic DA dramatically broaden the neurotoxic profile of Meth to include brain structures not normally targeted for damage. The resistance of the NAc to Meth-induced neurotoxicity and its ability to recover reveal a fundamentally different neuroplasticity by comparison to the CPu. Recruitment of the NAc as a target of Meth neurotoxicity by alterations in DA homeostasis is significant in light of the numerous important roles played by this brain structure. PMID:23382149
      

      
      Nucleus accumbens invulnerability to methamphetamine neurotoxicity.
      PubMed
      Kuhn, Donald M; Angoa-Pérez, Mariana; Thomas, David M
         2011-01-01
         Methamphetamine (Meth) is a neurotoxic drug of abuse that damages neurons and nerve endings throughout the central nervous system. Emerging studies of human Meth addicts using both postmortem analyses of brain tissue and noninvasive imaging studies of intact brains have confirmed that Meth causes persistent structural abnormalities. Animal and human studies have also defined a number of significant functional problems and comorbid psychiatric disorders associated with long-term Meth abuse. This review summarizes the salient features of Meth-induced neurotoxicity with a focus on the dopamine (DA) neuronal system. DA nerve endings in the caudate-putamen (CPu) are damaged by Meth in a highly delimited manner. Even within the CPu, damage is remarkably heterogeneous, with ventral and lateral aspects showing the greatest deficits. The nucleus accumbens (NAc) is largely spared the damage that accompanies binge Meth intoxication, but relatively subtle changes in the disposition of DA in its nerve endings can lead to dramatic increases in Meth-induced toxicity in the CPu and overcome the normal resistance of the NAc to damage. In contrast to the CPu, where DA neuronal deficiencies are persistent, alterations in the NAc show a partial recovery. Animal models have been indispensable in studies of the causes and consequences of Meth neurotoxicity and in the development of new therapies. This research has shown that increases in cytoplasmic DA dramatically broaden the neurotoxic profile of Meth to include brain structures not normally targeted for damage. The resistance of the NAc to Meth-induced neurotoxicity and its ability to recover reveal a fundamentally different neuroplasticity by comparison to the CPu. Recruitment of the NAc as a target of Meth neurotoxicity by alterations in DA homeostasis is significant in light of the numerous important roles played by this brain structure.
      

        
       
          

«

19
      20
      21
   22
      23
      »

          
        

     

   

   
       
            
              
          

«

20
      21
      22
   23
      24
      »

          
        

           
           
             
               
      
      Design, Results, Evolution and Status of the ATLAS Simulation at Point1 Project
      NASA Astrophysics Data System (ADS)
      Ballestrero, S.; Batraneanu, S. M.; Brasolin, F.; Contescu, C.; Fazio, D.; Di Girolamo, A.; Lee, C. J.; Pozo Astigarraga, M. E.; Scannicchio, D. A.; Sedov, A.; Twomey, M. S.; Wang, F.; Zaytsev, A.
         2015-12-01
         During the LHC Long Shutdown 1 (LSI) period, that started in 2013, the Simulation at Point1 (Sim@P1) project takes advantage, in an opportunistic way, of the TDAQ (Trigger and Data Acquisition) HLT (High-Level Trigger) farm of the ATLAS experiment. This farm provides more than 1300 compute nodes, which are particularly suited for running event generation and Monte Carlo production jobs that are mostly CPU and not I/O bound. It is capable of running up to 2700 Virtual Machines (VMs) each with 8 CPU cores, for a total of up to 22000 parallel jobs. This contribution gives a review of the design, the results, and the evolution of the Sim@P1 project, operating a large scale OpenStack based virtualized platform deployed on top of the ATLAS TDAQ HLT farm computing resources. During LS1, Sim@P1 was one of the most productive ATLAS sites: it delivered more than 33 million CPU-hours and it generated more than 1.1 billion Monte Carlo events. The design aspects are presented: the virtualization platform exploited by Sim@P1 avoids interferences with TDAQ operations and it guarantees the security and the usability of the ATLAS private network. The cloud mechanism allows the separation of the needed support on both infrastructural (hardware, virtualization layer) and logical (Grid site support) levels. This paper focuses on the operational aspects of such a large system during the upcoming LHC Run 2 period: simple, reliable, and efficient tools are needed to quickly switch from Sim@P1 to TDAQ mode and back, to exploit the resources when they are not used for the data acquisition, even for short periods. The evolution of the central OpenStack infrastructure is described, as it was upgraded from Folsom to the Icehouse release, including the scalability issues addressed.
      

      
      10 Management Controller for Time and Space Partitioning Architectures
      NASA Astrophysics Data System (ADS)
      Lachaize, Jerome; Deredempt, Marie-Helene; Galizzi, Julien
         2015-09-01
         The Integrated Modular Avionics (IMA) has been industrialized in aeronautical domain to enable the independent qualification of different application softwares from different suppliers on the same generic computer, this latter computer being a single terminal in a deterministic network. This concept allowed to distribute efficiently and transparently the different applications across the network, sizing accurately the HW equipments to embed on the aircraft, through the configuration of the virtual computers and the virtual network. , This concept has been studied for space domain and requirements issued [D04],[D05]. Experiments in the space domain have been done, for the computer level, through ESA and CNES initiatives [D02] [D03]. One possible IMA implementation may use Time and Space Partitioning (TSP) technology. Studies on Time and Space Partitioning [D02] for controlling resources access such as CPU and memories and studies on hardware/software interface standardization [D01] showed that for space domain technologies where I/O components (or IP) do not cover advanced features such as buffering, descriptors or virtualization, CPU overhead in terms of performances is mainly due to shared interface management in the execution platform, and to the high frequency of I/O accesses, these latter leading to an important number of context switches. This paper will present a solution to reduce this execution overhead with an open, modular and configurable controller.
      

      
      Benchmark measurements and calculations of a 3-dimensional neutron streaming experiment
      NASA Astrophysics Data System (ADS)
      Barnett, D. A., Jr.
         1991-02-01
         An experimental assembly known as the Dog-Legged Void assembly was constructed to measure the effect of neutron streaming in iron and void regions. The primary purpose of the measurements was to provide benchmark data against which various neutron transport calculation tools could be compared. The measurements included neutron flux spectra at four places and integral measurements at two places in the iron streaming path as well as integral measurements along several axial traverses. These data have been used in the verification of Oak Ridge National Laboratory's three-dimensional discrete ordinates code, TORT. For a base case calculation using one-half inch mesh spacing, finite difference spatial differencing, an S(sub 16) quadrature and P(sub 1) cross sections in the MUFT multigroup structure, the calculated solution agreed to within 18 percent with the spectral measurements and to within 24 percent of the integral measurements. Variations on the base case using a fewgroup energy structure and P(sub 1) and P(sub 3) cross sections showed similar agreement. Calculations using a linear nodal spatial differencing scheme and fewgroup cross sections also showed similar agreement. For the same mesh size, the nodal method was seen to require 2.2 times as much CPU time as the finite difference method. A nodal calculation using a typical mesh spacing of 2 inches, which had approximately 32 times fewer mesh cells than the base case, agreed with the measurements to within 34 percent and yet required on 8 percent of the CPU time.
      

      
      Multi-GPU Accelerated Admittance Method for High-Resolution Human Exposure Evaluation.
      PubMed
      Xiong, Zubiao; Feng, Shi; Kautz, Richard; Chandra, Sandeep; Altunyurt, Nevin; Chen, Ji
         2015-12-01
         A multi-graphics processing unit (GPU) accelerated admittance method solver is presented for solving the induced electric field in high-resolution anatomical models of human body when exposed to external low-frequency magnetic fields. In the solver, the anatomical model is discretized as a three-dimensional network of admittances. The conjugate orthogonal conjugate gradient (COCG) iterative algorithm is employed to take advantage of the symmetric property of the complex-valued linear system of equations. Compared against the widely used biconjugate gradient stabilized method, the COCG algorithm can reduce the solving time by 3.5 times and reduce the storage requirement by about 40%. The iterative algorithm is then accelerated further by using multiple NVIDIA GPUs. The computations and data transfers between GPUs are overlapped in time by using asynchronous concurrent execution design. The communication overhead is well hidden so that the acceleration is nearly linear with the number of GPU cards. Numerical examples show that our GPU implementation running on four NVIDIA Tesla K20c cards can reach 90 times faster than the CPU implementation running on eight CPU cores (two Intel Xeon E5-2603 processors). The implemented solver is able to solve large dimensional problems efficiently. A whole adult body discretized in 1-mm resolution can be solved in just several minutes. The high efficiency achieved makes it practical to investigate human exposure involving a large number of cases with a high resolution that meets the requirements of international dosimetry guidelines.
      

      
      Software Techniques for Non-Von Neumann Architectures
      DTIC Science & Technology
      
         1990-01-01
         Commtopo programmable Benes net.; hypercubic lattice for QCD Control CENTRALIZED Assign STATIC Memory :SHARED Synch UNIVERSAL Max-cpu 566 Proessor...boards (each = 4 floating point units, 2 multipliers) Cpu-size 32-bit floating point chips Perform 11.4 Gflops Market quantum chromodynamics ( QCD ...functions there should exist a capability to define hierarchies and lattices of complex objects. A complex object can be made up of a set of simple objects
      

      
      Visual Media Reasoning - Terrain-based Geolocation
      DTIC Science & Technology
      
         2015-06-01
         the drawings, specifications, or other data does not license the holder or any other person or corporation ; or convey any rights or permission to...3.4 Alternative Metric Investigation This section describes a graphics processor unit (GPU) based implementation in the NVIDIA CUDA programming...utilizing 2 concurrent CPU cores, each controlling a single Nvidia C2075 Tesla Fermi CUDA card. Figure 22 shows a comparison of the CPU and the GPU powered
      

      
      A CFD Heterogeneous Parallel Solver Based on Collaborating CPU and GPU
      NASA Astrophysics Data System (ADS)
      Lai, Jianqi; Tian, Zhengyu; Li, Hua; Pan, Sha
         2018-03-01
         Since Graphic Processing Unit (GPU) has a strong ability of floating-point computation and memory bandwidth for data parallelism, it has been widely used in the areas of common computing such as molecular dynamics (MD), computational fluid dynamics (CFD) and so on. The emergence of compute unified device architecture (CUDA), which reduces the complexity of compiling program, brings the great opportunities to CFD. There are three different modes for parallel solution of NS equations: parallel solver based on CPU, parallel solver based on GPU and heterogeneous parallel solver based on collaborating CPU and GPU. As we can see, GPUs are relatively rich in compute capacity but poor in memory capacity and the CPUs do the opposite. We need to make full use of the GPUs and CPUs, so a CFD heterogeneous parallel solver based on collaborating CPU and GPU has been established. Three cases are presented to analyse the solver’s computational accuracy and heterogeneous parallel efficiency. The numerical results agree well with experiment results, which demonstrate that the heterogeneous parallel solver has high computational precision. The speedup on a single GPU is more than 40 for laminar flow, it decreases for turbulent flow, but it still can reach more than 20. What’s more, the speedup increases as the grid size becomes larger.
      

      
      The BlueGene/L supercomputer
      NASA Astrophysics Data System (ADS)
      Bhanota, Gyan; Chen, Dong; Gara, Alan; Vranas, Pavlos
         2003-05-01
         The architecture of the BlueGene/L massively parallel supercomputer is described. Each computing node consists of a single compute ASIC plus 256 MB of external memory. The compute ASIC integrates two 700 MHz PowerPC 440 integer CPU cores, two 2.8 Gflops floating point units, 4 MB of embedded DRAM as cache, a memory controller for external memory, six 1.4 Gbit/s bi-directional ports for a 3-dimensional torus network connection, three 2.8 Gbit/s bi-directional ports for connecting to a global tree network and a Gigabit Ethernet for I/O. 65,536 of such nodes are connected into a 3-d torus with a geometry of 32×32×64. The total peak performance of the system is 360 Teraflops and the total amount of memory is 16 TeraBytes.
      

      
      A quantitative comparison of numerical methods for the compressible Euler equations: fifth-order WENO and piecewise-linear Godunov
      NASA Astrophysics Data System (ADS)
      Greenough, J. A.; Rider, W. J.
         2004-05-01
         A numerical study is undertaken comparing a fifth-order version of the weighted essentially non-oscillatory numerical (WENO5) method to a modern piecewise-linear, second-order, version of Godunov's (PLMDE) method for the compressible Euler equations. A series of one-dimensional test problems are examined beginning with classical linear problems and ending with complex shock interactions. The problems considered are: (1) linear advection of a Gaussian pulse in density, (2) Sod's shock tube problem, (3) the "peak" shock tube problem, (4) a version of the Shu and Osher shock entropy wave interaction and (5) the Woodward and Colella interacting shock wave problem. For each problem and method, run times, density error norms and convergence rates are reported for each method as produced from a common code test-bed. The linear problem exhibits the advertised convergence rate for both methods as well as the expected large disparity in overall error levels; WENO5 has the smaller errors and an enormous advantage in overall efficiency (in accuracy per unit CPU time). For the nonlinear problems with discontinuities, however, we generally see both first-order self-convergence of error as compared to an exact solution, or when an analytic solution is not available, a converged solution generated on an extremely fine grid. The overall comparison of error levels shows some variation from problem to problem. For Sod's shock tube, PLMDE has nearly half the error, while on the peak problem the errors are nearly the same. For the interacting blast wave problem the two methods again produce a similar level of error with a slight edge for the PLMDE. On the other hand, for the Shu-Osher problem, the errors are similar on the coarser grids, but favors WENO by a factor of nearly 1.5 on the finer grids used. In all cases holding mesh resolution constant though, PLMDE is less costly in terms of CPU time by approximately a factor of 6. If the CPU cost is taken as fixed, that is run times are equal for both numerical methods, then PLMDE uniformly produces lower errors than WENO for the fixed computation cost on the test problems considered here.
      

      
      A Distributed GPU-Based Framework for Real-Time 3D Volume Rendering of Large Astronomical Data Cubes
      NASA Astrophysics Data System (ADS)
      Hassan, A. H.; Fluke, C. J.; Barnes, D. G.
         2012-05-01
         We present a framework to volume-render three-dimensional data cubes interactively using distributed ray-casting and volume-bricking over a cluster of workstations powered by one or more graphics processing units (GPUs) and a multi-core central processing unit (CPU). The main design target for this framework is to provide an in-core visualization solution able to provide three-dimensional interactive views of terabyte-sized data cubes. We tested the presented framework using a computing cluster comprising 64 nodes with a total of 128GPUs. The framework proved to be scalable to render a 204GB data cube with an average of 30 frames per second. Our performance analyses also compare the use of NVIDIA Tesla 1060 and 2050GPU architectures and the effect of increasing the visualization output resolution on the rendering performance. Although our initial focus, as shown in the examples presented in this work, is volume rendering of spectral data cubes from radio astronomy, we contend that our approach has applicability to other disciplines where close to real-time volume rendering of terabyte-order three-dimensional data sets is a requirement.
      

      
      Meshless method for solving fixed boundary problem of plasma equilibrium
      NASA Astrophysics Data System (ADS)
      Imazawa, Ryota; Kawano, Yasunori; Itami, Kiyoshi
         2015-07-01
         This study solves the Grad-Shafranov equation with a fixed plasma boundary by utilizing a meshless method for the first time. Previous studies have utilized a finite element method (FEM) to solve an equilibrium inside the fixed separatrix. In order to avoid difficulties of FEM (such as mesh problem, difficulty of coding, expensive calculation cost), this study focuses on the meshless methods, especially RBF-MFS and KANSA's method to solve the fixed boundary problem. The results showed that CPU time of the meshless methods was ten to one hundred times shorter than that of FEM to obtain the same accuracy.
      

      
      High Performance Computing Assets for Ocean Acoustics Research
      DTIC Science & Technology
      
         2016-11-18
         independently on processing units with access to a typically available amount of memory, say 16 or 32 gigabytes. Our models require each processor to...allow results to be obtained with limited amounts of memory available to individual processing units (with no time frame for successful completion...put into use. One file server computer to store simulation output has also been purchased. The first workstation has 28 CPU cores, dual- thread , (56
      

      
      Mitigating Distributed Denial of Service Attacks with Multi-Protocol Label Switching-Traffic Engineering (MPLS-TE)
      DTIC Science & Technology
      
         2009-03-01
         time and the router CPU loads are comparable to those reported by two former NPS theses that examined alternative solutions based on BGP blackhole ...routing. 15. NUMBER OF PAGES 135 14. SUBJECT TERMS Traffic Engineering, Distributed Denial of Service Attacks, Sinkhole Routing, Blackhole Routing...alternative solutions based on BGP blackhole routing. vi THIS PAGE INTENTIONALLY LEFT BLANK vii TABLE OF CONTENTS I. INTRODUCTION
      

      
      Digital image processing for the earth resources technology satellite data.
      NASA Technical Reports Server (NTRS)
      Will, P. M.; Bakis, R.; Wesley, M. A.
         1972-01-01
         This paper discusses the problems of digital processing of the large volumes of multispectral image data that are expected to be received from the ERTS program. Correction of geometric and radiometric distortions are discussed and a byte oriented implementation is proposed. CPU timing estimates are given for a System/360 Model 67, and show that a processing throughput of 1000 image sets per week is feasible.
      

      
      Integrals for IBS and beam cooling
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Burov, A.; /Fermilab
         
         Simulation of beam cooling usually requires performing certain integral transformations every time step or so, which is a significant burden on the CPU. Examples are the dispersion integrals (Hilbert transforms) in the stochastic cooling, wake fields and IBS integrals. An original method is suggested for fast and sufficiently accurate computation of the integrals. This method is applied for the dispersion integral. Some methodical aspects of the IBS analysis are discussed.
      

      
      Integrals for IBS and Beam Cooling
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Burov, A.
         
         Simulation of beam cooling usually requires performing certain integral transformations every time step or so, which is a significant burden on the CPU. Examples are the dispersion integrals (Hilbert transforms) in the stochastic cooling, wake fields and IBS integrals. An original method is suggested for fast and sufficiently accurate computation of the integrals. This method is applied for the dispersion integral. Some methodical aspects of the IBS analysis are discussed.
      

      
      Accelerating statistical image reconstruction algorithms for fan-beam x-ray CT using cloud computing
      NASA Astrophysics Data System (ADS)
      Srivastava, Somesh; Rao, A. Ravishankar; Sheinin, Vadim
         2011-03-01
         Statistical image reconstruction algorithms potentially offer many advantages to x-ray computed tomography (CT), e.g. lower radiation dose. But, their adoption in practical CT scanners requires extra computation power, which is traditionally provided by incorporating additional computing hardware (e.g. CPU-clusters, GPUs, FPGAs etc.) into a scanner. An alternative solution is to access the required computation power over the internet from a cloud computing service, which is orders-of-magnitude more cost-effective. This is because users only pay a small pay-as-you-go fee for the computation resources used (i.e. CPU time, storage etc.), and completely avoid purchase, maintenance and upgrade costs. In this paper, we investigate the benefits and shortcomings of using cloud computing for statistical image reconstruction. We parallelized the most time-consuming parts of our application, the forward and back projectors, using MapReduce, the standard parallelization library on clouds. From preliminary investigations, we found that a large speedup is possible at a very low cost. But, communication overheads inside MapReduce can limit the maximum speedup, and a better MapReduce implementation might become necessary in the future. All the experiments for this paper, including development and testing, were completed on the Amazon Elastic Compute Cloud (EC2) for less than $20.
      

      
      CPU time optimization and precise adjustment of the Geant4 physics parameters for a VARIAN 2100 C/D gamma radiotherapy linear accelerator simulation using GAMOS.
      PubMed
      Arce, Pedro; Lagares, Juan Ignacio
         2018-01-25
         We have verified the GAMOS/Geant4 simulation model of a 6 MV VARIAN Clinac 2100 C/D linear accelerator by the procedure of adjusting the initial beam parameters to fit the percentage depth dose and cross-profile dose experimental data at different depths in a water phantom. Thanks to the use of a wide range of field sizes, from 2  ×  2 cm 2 to 40  ×  40 cm 2 , a small phantom voxel size and high statistics, fine precision in the determination of the beam parameters has been achieved. This precision has allowed us to make a thorough study of the different physics models and parameters that Geant4 offers. The three Geant4 electromagnetic physics sets of models, i.e. Standard, Livermore and Penelope, have been compared to the experiment, testing the four different models of angular bremsstrahlung distributions as well as the three available multiple-scattering models, and optimizing the most relevant Geant4 electromagnetic physics parameters. Before the fitting, a comprehensive CPU time optimization has been done, using several of the Geant4 efficiency improvement techniques plus a few more developed in GAMOS.
      

      
      Analysis of Multivariate Experimental Data Using A Simplified Regression Model Search Algorithm
      NASA Technical Reports Server (NTRS)
      Ulbrich, Norbert M.
         2013-01-01
         A new regression model search algorithm was developed that may be applied to both general multivariate experimental data sets and wind tunnel strain-gage balance calibration data. The algorithm is a simplified version of a more complex algorithm that was originally developed for the NASA Ames Balance Calibration Laboratory. The new algorithm performs regression model term reduction to prevent overfitting of data. It has the advantage that it needs only about one tenth of the original algorithm's CPU time for the completion of a regression model search. In addition, extensive testing showed that the prediction accuracy of math models obtained from the simplified algorithm is similar to the prediction accuracy of math models obtained from the original algorithm. The simplified algorithm, however, cannot guarantee that search constraints related to a set of statistical quality requirements are always satisfied in the optimized regression model. Therefore, the simplified algorithm is not intended to replace the original algorithm. Instead, it may be used to generate an alternate optimized regression model of experimental data whenever the application of the original search algorithm fails or requires too much CPU time. Data from a machine calibration of NASA's MK40 force balance is used to illustrate the application of the new search algorithm.
      

      
      A Subsonic Aircraft Design Optimization With Neural Network and Regression Approximators
      NASA Technical Reports Server (NTRS)
      Patnaik, Surya N.; Coroneos, Rula M.; Guptill, James D.; Hopkins, Dale A.; Haller, William J.
         2004-01-01
         The Flight-Optimization-System (FLOPS) code encountered difficulty in analyzing a subsonic aircraft. The limitation made the design optimization problematic. The deficiencies have been alleviated through use of neural network and regression approximations. The insight gained from using the approximators is discussed in this paper. The FLOPS code is reviewed. Analysis models are developed and validated for each approximator. The regression method appears to hug the data points, while the neural network approximation follows a mean path. For an analysis cycle, the approximate model required milliseconds of central processing unit (CPU) time versus seconds by the FLOPS code. Performance of the approximators was satisfactory for aircraft analysis. A design optimization capability has been created by coupling the derived analyzers to the optimization test bed CometBoards. The approximators were efficient reanalysis tools in the aircraft design optimization. Instability encountered in the FLOPS analyzer was eliminated. The convergence characteristics were improved for the design optimization. The CPU time required to calculate the optimum solution, measured in hours with the FLOPS code was reduced to minutes with the neural network approximation and to seconds with the regression method. Generation of the approximators required the manipulation of a very large quantity of data. Design sensitivity with respect to the bounds of aircraft constraints is easily generated.
      

        
       
          

«

20
      21
      22
   23
      24
      »

          
        

     

   

   
       
            
              
          

«

21
      22
      23
   24
      25
      »

          
        

           
           
             
               
      
      Analysis of Multivariate Experimental Data Using A Simplified Regression Model Search Algorithm
      NASA Technical Reports Server (NTRS)
      Ulbrich, Norbert Manfred
         2013-01-01
         A new regression model search algorithm was developed in 2011 that may be used to analyze both general multivariate experimental data sets and wind tunnel strain-gage balance calibration data. The new algorithm is a simplified version of a more complex search algorithm that was originally developed at the NASA Ames Balance Calibration Laboratory. The new algorithm has the advantage that it needs only about one tenth of the original algorithm's CPU time for the completion of a search. In addition, extensive testing showed that the prediction accuracy of math models obtained from the simplified algorithm is similar to the prediction accuracy of math models obtained from the original algorithm. The simplified algorithm, however, cannot guarantee that search constraints related to a set of statistical quality requirements are always satisfied in the optimized regression models. Therefore, the simplified search algorithm is not intended to replace the original search algorithm. Instead, it may be used to generate an alternate optimized regression model of experimental data whenever the application of the original search algorithm either fails or requires too much CPU time. Data from a machine calibration of NASA's MK40 force balance is used to illustrate the application of the new regression model search algorithm.
      

      
      Three-Dimensional Nacelle Aeroacoustics Code With Application to Impedance Education
      NASA Technical Reports Server (NTRS)
      Watson, Willie R.
         2000-01-01
         A three-dimensional nacelle acoustics code that accounts for uniform mean flow and variable surface impedance liners is developed. The code is linked to a commercial version of the NASA-developed General Purpose Solver (for solution of linear systems of equations) in order to obtain the capability to study high frequency waves that may require millions of grid points for resolution. Detailed, single-processor statistics for the performance of the solver in rigid and soft-wall ducts are presented. Over the range of frequencies of current interest in nacelle liner research, noise attenuation levels predicted from the code were in excellent agreement with those predicted from mode theory. The equation solver is memory efficient, requiring only a small fraction of the memory available on modern computers. As an application, the code is combined with an optimization algorithm and used to reduce the impedance spectrum of a ceramic liner. The primary problem with using the code to perform optimization studies at frequencies above I1kHz is the excessive CPU time (a major portion of which is matrix assembly). The research recommends that research be directed toward development of a rapid sparse assembler and exploitation of the multiprocessor capability of the solver to further reduce CPU time.
      

      
      GPU accelerated real-time confocal fluorescence lifetime imaging microscopy (FLIM) based on the analog mean-delay (AMD) method
      PubMed Central
      Kim, Byungyeon; Park, Byungjun; Lee, Seungrag; Won, Youngjae
         2016-01-01
         We demonstrated GPU accelerated real-time confocal fluorescence lifetime imaging microscopy (FLIM) based on the analog mean-delay (AMD) method. Our algorithm was verified for various fluorescence lifetimes and photon numbers. The GPU processing time was faster than the physical scanning time for images up to 800 × 800, and more than 149 times faster than a single core CPU. The frame rate of our system was demonstrated to be 13 fps for a 200 × 200 pixel image when observing maize vascular tissue. This system can be utilized for observing dynamic biological reactions, medical diagnosis, and real-time industrial inspection. PMID:28018724
      

      
      Patterns and Practices for Future Architectures
      DTIC Science & Technology
      
         2014-08-01
         14. SUBJECT TERMS computing architecture, graph algorithms, high-performance computing, big data , GPU 15. NUMBER OF PAGES 44 16. PRICE CODE 17...at Vertex 1 6 Figure 4: Data Structures Created by Kernel 1 of Single CPU, List Implementation Using the Graph in the Example from Section 1.2 9...Figure 5: Kernel 2 of Graph500 BFS Reference Implementation: Single CPU, List 10 Figure 6: Data Structures for Sequential CSR Algorithm 12 Figure 7
      

      
      Conversion of Mass Storage Hierarchy in an IBM Computer Network
      DTIC Science & Technology
      
         1989-03-01
         storage devices GUIDE IBM users’ group for DOS operating systems IBM International Business Machines IBM 370/145 CPU introduced in 1970 IBM 370/168 CPU...February 12, 1985, Information Systems Group, International Business Machines Corporation. "IBM 3090 Processor Complex" and 񓼪 Mass Storage System...34 Mainframe Journal, pp. 15-26, 64-65, Dallas, Texas, September-October 1987. 3. International Business Machines Corporation, Introduction to IBM 3S80 Storage
      

      
      Storage strategies of eddy-current FE-BI model for GPU implementation
      NASA Astrophysics Data System (ADS)
      Bardel, Charles; Lei, Naiguang; Udpa, Lalita
         2013-01-01
         In the past few years graphical processing units (GPUs) have shown tremendous improvements in computational throughput over standard CPU architecture. However, this comes at the cost of restructuring the algorithms to meet the strengths and drawbacks of this GPU architecture. A major drawback is the state of limited memory, and hence storage of FE stiffness matrices on the GPU is important. In contrast to storage on CPU the GPU storage format has significant influence on the overall performance. This paper presents an investigation of a storage strategy in the implementation of a two-dimensional finite element-boundary integral (FE-BI) model for Eddy current NDE applications, on GPU architecture. Specifically, the high dimensional matrices are manipulated by examining the matrix structure and optimally splitting into structurally independent component matrices for efficient storage and retrieval of each component. Results obtained using the proposed approach are compared to those of conventional CPU implementation for validating the method.
      

      
      Multigroup Monte Carlo on GPUs: Comparison of history- and event-based algorithms
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Hamilton, Steven P.; Slattery, Stuart R.; Evans, Thomas M.
         
         This article presents an investigation of the performance of different multigroup Monte Carlo transport algorithms on GPUs with a discussion of both history-based and event-based approaches. Several algorithmic improvements are introduced for both approaches. By modifying the history-based algorithm that is traditionally favored in CPU-based MC codes to occasionally filter out dead particles to reduce thread divergence, performance exceeds that of either the pure history-based or event-based approaches. The impacts of several algorithmic choices are discussed, including performance studies on Kepler and Pascal generation NVIDIA GPUs for fixed source and eigenvalue calculations. Single-device performance equivalent to 20–40 CPU cores onmore » the K40 GPU and 60–80 CPU cores on the P100 GPU is achieved. Last, in addition, nearly perfect multi-device parallel weak scaling is demonstrated on more than 16,000 nodes of the Titan supercomputer.« less
      

      
      Multigroup Monte Carlo on GPUs: Comparison of history- and event-based algorithms
      DOE PAGES
      Hamilton, Steven P.; Slattery, Stuart R.; Evans, Thomas M.
         2017-12-22
         This article presents an investigation of the performance of different multigroup Monte Carlo transport algorithms on GPUs with a discussion of both history-based and event-based approaches. Several algorithmic improvements are introduced for both approaches. By modifying the history-based algorithm that is traditionally favored in CPU-based MC codes to occasionally filter out dead particles to reduce thread divergence, performance exceeds that of either the pure history-based or event-based approaches. The impacts of several algorithmic choices are discussed, including performance studies on Kepler and Pascal generation NVIDIA GPUs for fixed source and eigenvalue calculations. Single-device performance equivalent to 20–40 CPU cores onmore » the K40 GPU and 60–80 CPU cores on the P100 GPU is achieved. Last, in addition, nearly perfect multi-device parallel weak scaling is demonstrated on more than 16,000 nodes of the Titan supercomputer.« less
      

      
      Near-Zero Emissions Oxy-Combustion Flue Gas Purification
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Minish Shah; Nich Degenstein; Monica Zanfir
         2012-06-30
         The objectives of this project were to carry out an experimental program to enable development and design of near zero emissions (NZE) CO{sub 2} processing unit (CPU) for oxy-combustion plants burning high and low sulfur coals and to perform commercial viability assessment. The NZE CPU was proposed to produce high purity CO{sub 2} from the oxycombustion flue gas, to achieve > 95% CO{sub 2} capture rate and to achieve near zero atmospheric emissions of criteria pollutants. Two SOx/NOx removal technologies were proposed depending on the SOx levels in the flue gas. The activated carbon process was proposed for power plantsmore » burning low sulfur coal and the sulfuric acid process was proposed for power plants burning high sulfur coal. For plants burning high sulfur coal, the sulfuric acid process would convert SOx and NOx in to commercial grade sulfuric and nitric acid by-products, thus reducing operating costs associated with SOx/NOx removal. For plants burning low sulfur coal, investment in separate FGD and SCR equipment for producing high purity CO{sub 2} would not be needed. To achieve high CO{sub 2} capture rates, a hybrid process that combines cold box and VPSA (vacuum pressure swing adsorption) was proposed. In the proposed hybrid process, up to 90% of CO{sub 2} in the cold box vent stream would be recovered by CO{sub 2} VPSA and then it would be recycled and mixed with the flue gas stream upstream of the compressor. The overall recovery from the process will be > 95%. The activated carbon process was able to achieve simultaneous SOx and NOx removal in a single step. The removal efficiencies were >99.9% for SOx and >98% for NOx, thus exceeding the performance targets of >99% and >95%, respectively. The process was also found to be suitable for power plants burning both low and high sulfur coals. Sulfuric acid process did not meet the performance expectations. Although it could achieve high SOx (>99%) and NOx (>90%) removal efficiencies, it could not produce by-product sulfuric and nitric acids that meet the commercial product specifications. The sulfuric acid will have to be disposed of by neutralization, thus lowering the value of the technology to same level as that of the activated carbon process. Therefore, it was decided to discontinue any further efforts on sulfuric acid process. Because of encouraging results on the activated carbon process, it was decided to add a new subtask on testing this process in a dual bed continuous unit. A 40 days long continuous operation test confirmed the excellent SOx/NOx removal efficiencies achieved in the batch operation. This test also indicated the need for further efforts on optimization of adsorption-regeneration cycle to maintain long term activity of activated carbon material at a higher level. The VPSA process was tested in a pilot unit. It achieved CO{sub 2} recovery of > 95% and CO{sub 2} purity of >80% (by vol.) from simulated cold box feed streams. The overall CO{sub 2} recovery from the cold box VPSA hybrid process was projected to be >99% for plants with low air ingress (2%) and >97% for plants with high air ingress (10%). Economic analysis was performed to assess value of the NZE CPU. The advantage of NZE CPU over conventional CPU is only apparent when CO{sub 2} capture and avoided costs are compared. For greenfield plants, cost of avoided CO{sub 2} and cost of captured CO{sub 2} are generally about 11-14% lower using the NZE CPU compared to using a conventional CPU. For older plants with high air intrusion, the cost of avoided CO{sub 2} and capture CO{sub 2} are about 18-24% lower using the NZE CPU. Lower capture costs for NZE CPU are due to lower capital investment in FGD/SCR and higher CO{sub 2} capture efficiency. In summary, as a result of this project, we now have developed one technology option for NZE CPU based on the activated carbon process and coldbox-VPSA hybrid process. This technology is projected to work for both low and high sulfur coal plants. The NZE CPU technology is projected to achieve near zero stack emissions, produce high purity CO{sub 2} relatively free of trace impurities and achieve ~99% CO{sub 2} capture rate while lowering the CO{sub 2} capture costs.« less
      

      
      Dynamic Allocation of SPM Based on Time-Slotted Cache Conflict Graph for System Optimization
      NASA Astrophysics Data System (ADS)
      Wu, Jianping; Ling, Ming; Zhang, Yang; Mei, Chen; Wang, Huan
         
         This paper proposes a novel dynamic Scratch-pad Memory allocation strategy to optimize the energy consumption of the memory sub-system. Firstly, the whole program execution process is sliced into several time slots according to the temporal dimension; thereafter, a Time-Slotted Cache Conflict Graph (TSCCG) is introduced to model the behavior of Data Cache (D-Cache) conflicts within each time slot. Then, Integer Nonlinear Programming (INP) is implemented, which can avoid time-consuming linearization process, to select the most profitable data pages. Virtual Memory System (VMS) is adopted to remap those data pages, which will cause severe Cache conflicts within a time slot, to SPM. In order to minimize the swapping overhead of dynamic SPM allocation, a novel SPM controller with a tightly coupled DMA is introduced to issue the swapping operations without CPU's intervention. Last but not the least, this paper discusses the fluctuation of system energy profit based on different MMU page size as well as the Time Slot duration quantitatively. According to our design space exploration, the proposed method can optimize all of the data segments, including global data, heap and stack data in general, and reduce the total energy consumption by 27.28% on average, up to 55.22% with a marginal performance promotion. And comparing to the conventional static CCG (Cache Conflicts Graph), our approach can obtain 24.7% energy profit on average, up to 30.5% with a sight boost in performance.
      

      
      An adaptive time-stepping strategy for solving the phase field crystal model
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Zhang, Zhengru, E-mail: zrzhang@bnu.edu.cn; Ma, Yuan, E-mail: yuner1022@gmail.com; Qiao, Zhonghua, E-mail: zqiao@polyu.edu.hk
         2013-09-15
         In this work, we will propose an adaptive time step method for simulating the dynamics of the phase field crystal (PFC) model. The numerical simulation of the PFC model needs long time to reach steady state, and then large time-stepping method is necessary. Unconditionally energy stable schemes are used to solve the PFC model. The time steps are adaptively determined based on the time derivative of the corresponding energy. It is found that the use of the proposed time step adaptivity cannot only resolve the steady state solution, but also the dynamical development of the solution efficiently and accurately. Themore » numerical experiments demonstrate that the CPU time is significantly saved for long time simulations.« less
      

      
      Parallel Computer System for 3D Visualization Stereo on GPU
      NASA Astrophysics Data System (ADS)
      Al-Oraiqat, Anas M.; Zori, Sergii A.
         2018-03-01
         This paper proposes the organization of a parallel computer system based on Graphic Processors Unit (GPU) for 3D stereo image synthesis. The development is based on the modified ray tracing method developed by the authors for fast search of tracing rays intersections with scene objects. The system allows significant increase in the productivity for the 3D stereo synthesis of photorealistic quality. The generalized procedure of 3D stereo image synthesis on the Graphics Processing Unit/Graphics Processing Clusters (GPU/GPC) is proposed. The efficiency of the proposed solutions by GPU implementation is compared with single-threaded and multithreaded implementations on the CPU. The achieved average acceleration in multi-thread implementation on the test GPU and CPU is about 7.5 and 1.6 times, respectively. Studying the influence of choosing the size and configuration of the computational Compute Unified Device Archi-tecture (CUDA) network on the computational speed shows the importance of their correct selection. The obtained experimental estimations can be significantly improved by new GPUs with a large number of processing cores and multiprocessors, as well as optimized configuration of the computing CUDA network.
      

      
      Soft-error tolerance and energy consumption evaluation of embedded computer with magnetic random access memory in practical systems using computer simulations
      NASA Astrophysics Data System (ADS)
      Nebashi, Ryusuke; Sakimura, Noboru; Sugibayashi, Tadahiko
         2017-08-01
         We evaluated the soft-error tolerance and energy consumption of an embedded computer with magnetic random access memory (MRAM) using two computer simulators. One is a central processing unit (CPU) simulator of a typical embedded computer system. We simulated the radiation-induced single-event-upset (SEU) probability in a spin-transfer-torque MRAM cell and also the failure rate of a typical embedded computer due to its main memory SEU error. The other is a delay tolerant network (DTN) system simulator. It simulates the power dissipation of wireless sensor network nodes of the system using a revised CPU simulator and a network simulator. We demonstrated that the SEU effect on the embedded computer with 1 Gbit MRAM-based working memory is less than 1 failure in time (FIT). We also demonstrated that the energy consumption of the DTN sensor node with MRAM-based working memory can be reduced to 1/11. These results indicate that MRAM-based working memory enhances the disaster tolerance of embedded computers.
      

      
      
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Lyakh, Dmitry I.
         
         An efficient parallel tensor transpose algorithm is suggested for shared-memory computing units, namely, multicore CPU, Intel Xeon Phi, and NVidia GPU. The algorithm operates on dense tensors (multidimensional arrays) and is based on the optimization of cache utilization on x86 CPU and the use of shared memory on NVidia GPU. From the applied side, the ultimate goal is to minimize the overhead encountered in the transformation of tensor contractions into matrix multiplications in computer implementations of advanced methods of quantum many-body theory (e.g., in electronic structure theory and nuclear physics). A particular accent is made on higher-dimensional tensors that typicallymore » appear in the so-called multireference correlated methods of electronic structure theory. Depending on tensor dimensionality, the presented optimized algorithms can achieve an order of magnitude speedup on x86 CPUs and 2-3 times speedup on NVidia Tesla K20X GPU with respect to the na ve scattering algorithm (no memory access optimization). Furthermore, the tensor transpose routines developed in this work have been incorporated into a general-purpose tensor algebra library (TAL-SH).« less
      

      
      MLP: A Parallel Programming Alternative to MPI for New Shared Memory Parallel Systems
      NASA Technical Reports Server (NTRS)
      Taft, James R.
         1999-01-01
         Recent developments at the NASA AMES Research Center's NAS Division have demonstrated that the new generation of NUMA based Symmetric Multi-Processing systems (SMPs), such as the Silicon Graphics Origin 2000, can successfully execute legacy vector oriented CFD production codes at sustained rates far exceeding processing rates possible on dedicated 16 CPU Cray C90 systems. This high level of performance is achieved via shared memory based Multi-Level Parallelism (MLP). This programming approach, developed at NAS and outlined below, is distinct from the message passing paradigm of MPI. It offers parallelism at both the fine and coarse grained level, with communication latencies that are approximately 50-100 times lower than typical MPI implementations on the same platform. Such latency reductions offer the promise of performance scaling to very large CPU counts. The method draws on, but is also distinct from, the newly defined OpenMP specification, which uses compiler directives to support a limited subset of multi-level parallel operations. The NAS MLP method is general, and applicable to a large class of NASA CFD codes.
      

      
      Accelerating Monte Carlo simulations of photon transport in a voxelized geometry using a massively parallel graphics processing unit.
      PubMed
      Badal, Andreu; Badano, Aldo
         2009-11-01
         It is a known fact that Monte Carlo simulations of radiation transport are computationally intensive and may require long computing times. The authors introduce a new paradigm for the acceleration of Monte Carlo simulations: The use of a graphics processing unit (GPU) as the main computing device instead of a central processing unit (CPU). A GPU-based Monte Carlo code that simulates photon transport in a voxelized geometry with the accurate physics models from PENELOPE has been developed using the CUDATM programming model (NVIDIA Corporation, Santa Clara, CA). An outline of the new code and a sample x-ray imaging simulation with an anthropomorphic phantom are presented. A remarkable 27-fold speed up factor was obtained using a GPU compared to a single core CPU. The reported results show that GPUs are currently a good alternative to CPUs for the simulation of radiation transport. Since the performance of GPUs is currently increasing at a faster pace than that of CPUs, the advantages of GPU-based software are likely to be more pronounced in the future.
      

      
      Modeling and Simulation of the Economics of Mining in the Bitcoin Market.
      PubMed
      Cocco, Luisanna; Marchesi, Michele
         2016-01-01
         In January 3, 2009, Satoshi Nakamoto gave rise to the "Bitcoin Blockchain", creating the first block of the chain hashing on his computer's central processing unit (CPU). Since then, the hash calculations to mine Bitcoin have been getting more and more complex, and consequently the mining hardware evolved to adapt to this increasing difficulty. Three generations of mining hardware have followed the CPU's generation. They are GPU's, FPGA's and ASIC's generations. This work presents an agent-based artificial market model of the Bitcoin mining process and of the Bitcoin transactions. The goal of this work is to model the economy of the mining process, starting from GPU's generation, the first with economic significance. The model reproduces some "stylized facts" found in real-time price series and some core aspects of the mining business. In particular, the computational experiments performed can reproduce the unit root property, the fat tail phenomenon and the volatility clustering of Bitcoin price series. In addition, under proper assumptions, they can reproduce the generation of Bitcoins, the hashing capability, the power consumption, and the mining hardware and electrical energy expenditures of the Bitcoin network.
      

      
      Speeding up tsunami wave propagation modeling
      NASA Astrophysics Data System (ADS)
      Lavrentyev, Mikhail; Romanenko, Alexey
         2014-05-01
         Trans-oceanic wave propagation is one of the most time/CPU consuming parts of the tsunami modeling process. The so-called Method Of Splitting Tsunami (MOST) software package, developed at PMEL NOAA USA (Pacific Marine Environmental Laboratory of the National Oceanic and Atmospheric Administration, USA), is widely used to evaluate the tsunami parameters. However, it takes time to simulate trans-ocean wave propagation, that is up to 5 hours CPU time to "drive" the wave from Chili (epicenter) to the coast of Japan (even using a rather coarse computational mesh). Accurate wave height prediction requires fine meshes which leads to dramatic increase in time for simulation. Computation time is among the critical parameter as it takes only about 20 minutes for tsunami wave to approach the coast of Japan after earthquake at Japan trench or Sagami trench (as it was after the Great East Japan Earthquake on March 11, 2011). MOST solves numerically the hyperbolic system for three unknown functions, namely velocity vector and wave height (shallow water approximation). The system could be split into two independent systems by orthogonal directions (splitting method). Each system can be treated independently. This calculation scheme is well suited for SIMD architecture and GPUs as well. We performed adaptation of MOST package to GPU. Several numerical tests showed 40x performance gain for NVIDIA Tesla C2050 GPU vs. single core of Intel i7 processor. Results of numerical experiments were compared with other available simulation data. Calculation results, obtained at GPU, differ from the reference ones by 10^-3 cm of the wave height simulating 24 hours wave propagation. This allows us to speak about possibility to develop real-time system for evaluating tsunami danger.
      

      
      A time-efficient implementation of Extended Kalman Filter for sequential orbit determination and a case study for onboard application
      NASA Astrophysics Data System (ADS)
      Tang, Jingshi; Wang, Haihong; Chen, Qiuli; Chen, Zhonggui; Zheng, Jinjun; Cheng, Haowen; Liu, Lin
         2018-07-01
         Onboard orbit determination (OD) is often used in space missions, with which mission support can be partially accomplished autonomously, with less dependency on ground stations. In major Global Navigation Satellite Systems (GNSS), inter-satellite link is also an essential upgrade in the future generations. To serve for autonomous operation, sequential OD method is crucial to provide real-time or near real-time solutions. The Extended Kalman Filter (EKF) is an effective and convenient sequential estimator that is widely used in onboard application. The filter requires the solutions of state transition matrix (STM) and the process noise transition matrix, which are always obtained by numerical integration. However, numerically integrating the differential equations is a CPU intensive process and consumes a large portion of the time in EKF procedures. In this paper, we present an implementation that uses the analytical solutions of these transition matrices to replace the numerical calculations. This analytical implementation is demonstrated and verified using a fictitious constellation based on selected medium Earth orbit (MEO) and inclined Geosynchronous orbit (IGSO) satellites. We show that this implementation performs effectively and converges quickly, steadily and accurately in the presence of considerable errors in the initial values, measurements and force models. The filter is able to converge within 2-4 h of flight time in our simulation. The observation residual is consistent with simulated measurement error, which is about a few centimeters in our scenarios. Compared to results implemented with numerically integrated STM, the analytical implementation shows results with consistent accuracy, while it takes only about half the CPU time to filter a 10-day measurement series. The future possible extensions are also discussed to fit in various missions.
      

      
      Impact of Machine Virtualization on Timing Precision for Performance-critical Tasks
      NASA Astrophysics Data System (ADS)
      Karpov, Kirill; Fedotova, Irina; Siemens, Eduard
         2017-07-01
         In this paper we present a measurement study to characterize the impact of hardware virtualization on basic software timing, as well as on precise sleep operations of an operating system. We investigated how timer hardware is shared among heavily CPU-, I/O- and Network-bound tasks on a virtual machine as well as on the host machine. VMware ESXi and QEMU/KVM have been chosen as commonly used examples of hypervisor- and host-based models. Based on statistical parameters of retrieved distributions, our results provide a very good estimation of timing behavior. It is essential for real-time and performance-critical applications such as image processing or real-time control.
      

        
       
          

«

21
      22
      23
   24
      25
      »

          
        

     

   

   
       
            
              
          

«

21
      22
      23
      24
   25
      »

          
        

           
           
             
               
      
      GPU based framework for geospatial analyses
      NASA Astrophysics Data System (ADS)
      Cosmin Sandric, Ionut; Ionita, Cristian; Dardala, Marian; Furtuna, Titus
         2017-04-01
         Parallel processing on multiple CPU cores is already used at large scale in geocomputing, but parallel processing on graphics cards is just at the beginning. Being able to use an simple laptop with a dedicated graphics card for advanced and very fast geocomputation is an advantage that each scientist wants to have. The necessity to have high speed computation in geosciences has increased in the last 10 years, mostly due to the increase in the available datasets. These datasets are becoming more and more detailed and hence they require more space to store and more time to process. Distributed computation on multicore CPU's and GPU's plays an important role by processing one by one small parts from these big datasets. These way of computations allows to speed up the process, because instead of using just one process for each dataset, the user can use all the cores from a CPU or up to hundreds of cores from GPU The framework provide to the end user a standalone tools for morphometry analyses at multiscale level. An important part of the framework is dedicated to uncertainty propagation in geospatial analyses. The uncertainty may come from the data collection or may be induced by the model or may have an infinite sources. These uncertainties plays important roles when a spatial delineation of the phenomena is modelled. Uncertainty propagation is implemented inside the GPU framework using Monte Carlo simulations. The GPU framework with the standalone tools proved to be a reliable tool for modelling complex natural phenomena The framework is based on NVidia Cuda technology and is written in C++ programming language. The code source will be available on github at https://github.com/sandricionut/GeoRsGPU Acknowledgement: GPU framework for geospatial analysis, Young Researchers Grant (ICUB-University of Bucharest) 2016, director Ionut Sandric
      

      
      Coding for parallel execution of hardware-in-the-loop millimeter-wave scene generation models on multicore SIMD processor architectures
      NASA Astrophysics Data System (ADS)
      Olson, Richard F.
         2013-05-01
         Rendering of point scatterer based radar scenes for millimeter wave (mmW) seeker tests in real-time hardware-in-the-loop (HWIL) scene generation requires efficient algorithms and vector-friendly computer architectures for complex signal synthesis. New processor technology from Intel implements an extended 256-bit vector SIMD instruction set (AVX, AVX2) in a multi-core CPU design providing peak execution rates of hundreds of GigaFLOPS (GFLOPS) on one chip. Real world mmW scene generation code can approach peak SIMD execution rates only after careful algorithm and source code design. An effective software design will maintain high computing intensity emphasizing register-to-register SIMD arithmetic operations over data movement between CPU caches or off-chip memories. Engineers at the U.S. Army Aviation and Missile Research, Development and Engineering Center (AMRDEC) applied two basic parallel coding methods to assess new 256-bit SIMD multi-core architectures for mmW scene generation in HWIL. These include use of POSIX threads built on vector library functions and more portable, highlevel parallel code based on compiler technology (e.g. OpenMP pragmas and SIMD autovectorization). Since CPU technology is rapidly advancing toward high processor core counts and TeraFLOPS peak SIMD execution rates, it is imperative that coding methods be identified which produce efficient and maintainable parallel code. This paper describes the algorithms used in point scatterer target model rendering, the parallelization of those algorithms, and the execution performance achieved on an AVX multi-core machine using the two basic parallel coding methods. The paper concludes with estimates for scale-up performance on upcoming multi-core technology.
      

      
      Thermal Hotspots in CPU Die and It's Future Architecture
      NASA Astrophysics Data System (ADS)
      Wang, Jian; Hu, Fu-Yuan
         
         Owing to the increasing core frequency and chip integration and the limited die dimension, the power densities in CPU chip have been increasing fastly. The high temperature on chip resulted by power densities threats the processor's performance and chip's reliability. This paper analyzed the thermal hotspots in die and their properties. A new architecture of function units in die - - hot units distributed architecture is suggested to cope with the problems of high power densities for future processor chip.
      

      
      GeantV: from CPU to accelerators
      NASA Astrophysics Data System (ADS)
      Amadio, G.; Ananya, A.; Apostolakis, J.; Arora, A.; Bandieramonte, M.; Bhattacharyya, A.; Bianchini, C.; Brun, R.; Canal, P.; Carminati, F.; Duhem, L.; Elvira, D.; Gheata, A.; Gheata, M.; Goulas, I.; Iope, R.; Jun, S.; Lima, G.; Mohanty, A.; Nikitina, T.; Novak, M.; Pokorski, W.; Ribon, A.; Sehgal, R.; Shadura, O.; Vallecorsa, S.; Wenzel, S.; Zhang, Y.
         2016-10-01
         The GeantV project aims to research and develop the next-generation simulation software describing the passage of particles through matter. While the modern CPU architectures are being targeted first, resources such as GPGPU, Intel© Xeon Phi, Atom or ARM cannot be ignored anymore by HEP CPU-bound applications. The proof of concept GeantV prototype has been mainly engineered for CPU's having vector units but we have foreseen from early stages a bridge to arbitrary accelerators. A software layer consisting of architecture/technology specific backends supports currently this concept. This approach allows to abstract out the basic types such as scalar/vector but also to formalize generic computation kernels using transparently library or device specific constructs based on Vc, CUDA, Cilk+ or Intel intrinsics. While the main goal of this approach is portable performance, as a bonus, it comes with the insulation of the core application and algorithms from the technology layer. This allows our application to be long term maintainable and versatile to changes at the backend side. The paper presents the first results of basket-based GeantV geometry navigation on the Intel© Xeon Phi KNC architecture. We present the scalability and vectorization study, conducted using Intel performance tools, as well as our preliminary conclusions on the use of accelerators for GeantV transport. We also describe the current work and preliminary results for using the GeantV transport kernel on GPUs.
      

      
      Cpu/gpu Computing for AN Implicit Multi-Block Compressible Navier-Stokes Solver on Heterogeneous Platform
      NASA Astrophysics Data System (ADS)
      Deng, Liang; Bai, Hanli; Wang, Fang; Xu, Qingxin
         2016-06-01
         CPU/GPU computing allows scientists to tremendously accelerate their numerical codes. In this paper, we port and optimize a double precision alternating direction implicit (ADI) solver for three-dimensional compressible Navier-Stokes equations from our in-house Computational Fluid Dynamics (CFD) software on heterogeneous platform. First, we implement a full GPU version of the ADI solver to remove a lot of redundant data transfers between CPU and GPU, and then design two fine-grain schemes, namely “one-thread-one-point” and “one-thread-one-line”, to maximize the performance. Second, we present a dual-level parallelization scheme using the CPU/GPU collaborative model to exploit the computational resources of both multi-core CPUs and many-core GPUs within the heterogeneous platform. Finally, considering the fact that memory on a single node becomes inadequate when the simulation size grows, we present a tri-level hybrid programming pattern MPI-OpenMP-CUDA that merges fine-grain parallelism using OpenMP and CUDA threads with coarse-grain parallelism using MPI for inter-node communication. We also propose a strategy to overlap the computation with communication using the advanced features of CUDA and MPI programming. We obtain speedups of 6.0 for the ADI solver on one Tesla M2050 GPU in contrast to two Xeon X5670 CPUs. Scalability tests show that our implementation can offer significant performance improvement on heterogeneous platform.
      

      
      Finding Dantzig Selectors with a Proximity Operator based Fixed-point Algorithm
      DTIC Science & Technology
      
         2014-11-01
         experiments showed that this method usually outperforms the method in [2] in terms of CPU time while producing solutions of comparable quality. The... method proposed in [19]. To alleviate the difficulty caused by the subprob- lem without a closed form solution , a linearized ADM was proposed for the...a closed form solution , but the β-related subproblem does not and is solved approximately by using the nonmonotone gradient method in [18]. The
      

      
      Development of Real-Time Image and In Situ Data Analysis at Sea
      DTIC Science & Technology
      
         1991-10-16
         for continuous capture from multiple satellites. The Blackhole System is the analysis machine used either by researchers to process/analyze their...Orbital Tracker and the antenna subsystem was overhauled. THE BLACKHOLE ANALYSIS SYSTEM A new HP9000/350 workstation was installed at SSOC to perform...L 4)L Scripps Satellite Oceanography Center Blackhole System Diagram (Analysis Machine) HP 350 Workstation Motorola 68020 CPU 2 - 512 MB hard disks
      

      
      Joint Services Electronics Program. Appendix
      DTIC Science & Technology
      
         1992-11-01
         the accu- clude surface waves, creeping waves, multiple racy, convergence, and CPU times for the MM diffractions, shadowing effects , etc. A second ad...Method which is an approximation to the true current J Jn= A /m on the strip. The next section will discuss the - computation of the far zone...to the cavity (0 part of the incident plane wave captured by interior E•,. After a background discussion of the aperture at the open end is divided
      

      
      Integrated DoD Voice and Data Networks and Ground Packet Radio Technology
      DTIC Science & Technology
      
         1976-08-01
         as the traffic requirement level increases. Moreover, the satellite switch selection problem is only meaningful over a limited traffic range. When...5: CPU TIMES VS. NUMBER OF SWITCHES SATELLITE SWITCH SELECTION ALGORITHM Computer Used: PDP-10 ♦O’S" means 0 minutes and 5 seconds. 5.30...Saturation Algorithm for Topo\\ogical Design of Parket-Switched Communications Networks," National Te3 ecommunications Conference Proceed- ings, San
      

      
      DDN (Defense Data Network) Protocol Implementations and Vendors Guide
      DTIC Science & Technology
      
         1989-02-01
         Announcement 286-259 6/16/86 MACHINE-TYPE/CPU: IBM RT/PC O/S: AIX DISTRIBUTOR: 1. IBM Marketing 2. IBM Authorized VAR’s 3. Authorized Personal Computer...Vendors Guide 12. PERSONAL AUTHOR(S) Dorio, Nan; Johnson, Marlyn; Lederman. Sol; Redfield, Elizabeth; Ward, Carol 13a. TYPE OF REPORT 13b. TIME COVERED 114...documentation, contact person , and distributor. The fourth section describes analysis tools. It includes information about network analysis products
      

      
      Accelerating Molecular Dynamic Simulation on Graphics Processing Units
      PubMed Central
      Friedrichs, Mark S.; Eastman, Peter; Vaidyanathan, Vishal; Houston, Mike; Legrand, Scott; Beberg, Adam L.; Ensign, Daniel L.; Bruns, Christopher M.; Pande, Vijay S.
         2009-01-01
         We describe a complete implementation of all-atom protein molecular dynamics running entirely on a graphics processing unit (GPU), including all standard force field terms, integration, constraints, and implicit solvent. We discuss the design of our algorithms and important optimizations needed to fully take advantage of a GPU. We evaluate its performance, and show that it can be more than 700 times faster than a conventional implementation running on a single CPU core. PMID:19191337
      

      
      Neural Network and Regression Approximations in High Speed Civil Transport Aircraft Design Optimization
      NASA Technical Reports Server (NTRS)
      Patniak, Surya N.; Guptill, James D.; Hopkins, Dale A.; Lavelle, Thomas M.
         1998-01-01
         Nonlinear mathematical-programming-based design optimization can be an elegant method. However, the calculations required to generate the merit function, constraints, and their gradients, which are frequently required, can make the process computational intensive. The computational burden can be greatly reduced by using approximating analyzers derived from an original analyzer utilizing neural networks and linear regression methods. The experience gained from using both of these approximation methods in the design optimization of a high speed civil transport aircraft is the subject of this paper. The Langley Research Center's Flight Optimization System was selected for the aircraft analysis. This software was exercised to generate a set of training data with which a neural network and a regression method were trained, thereby producing the two approximating analyzers. The derived analyzers were coupled to the Lewis Research Center's CometBoards test bed to provide the optimization capability. With the combined software, both approximation methods were examined for use in aircraft design optimization, and both performed satisfactorily. The CPU time for solution of the problem, which had been measured in hours, was reduced to minutes with the neural network approximation and to seconds with the regression method. Instability encountered in the aircraft analysis software at certain design points was also eliminated. On the other hand, there were costs and difficulties associated with training the approximating analyzers. The CPU time required to generate the input-output pairs and to train the approximating analyzers was seven times that required for solution of the problem.
      

      
      Design of high-performance parallelized gene predictors in MATLAB.
      PubMed
      Rivard, Sylvain Robert; Mailloux, Jean-Gabriel; Beguenane, Rachid; Bui, Hung Tien
         2012-04-10
         This paper proposes a method of implementing parallel gene prediction algorithms in MATLAB. The proposed designs are based on either Goertzel's algorithm or on FFTs and have been implemented using varying amounts of parallelism on a central processing unit (CPU) and on a graphics processing unit (GPU). Results show that an implementation using a straightforward approach can require over 4.5 h to process 15 million base pairs (bps) whereas a properly designed one could perform the same task in less than five minutes. In the best case, a GPU implementation can yield these results in 57 s. The present work shows how parallelism can be used in MATLAB for gene prediction in very large DNA sequences to produce results that are over 270 times faster than a conventional approach. This is significant as MATLAB is typically overlooked due to its apparent slow processing time even though it offers a convenient environment for bioinformatics. From a practical standpoint, this work proposes two strategies for accelerating genome data processing which rely on different parallelization mechanisms. Using a CPU, the work shows that direct access to the MEX function increases execution speed and that the PARFOR construct should be used in order to take full advantage of the parallelizable Goertzel implementation. When the target is a GPU, the work shows that data needs to be segmented into manageable sizes within the GFOR construct before processing in order to minimize execution time.
      

      
      On- versus off-hour care for patients with non-ST-segment elevation myocardial infarction in Germany : Exemplary results within the chest pain unit concept.
      PubMed
      Breuckmann, F; Remberg, F; Böse, D; Waltenberger, J; Fischer, D; Rassaf, T
         2016-12-01
         The aim of this study was to analyze differences in the timing of invasive management of patients with high-risk acute coronary syndrome without persistent ST-segment elevation (hr-NSTE-ACS) or myocardial infarction without persistent ST-segment elevation (NSTEMI) between on- and off-hours in a German chest pain unit (CPU). We retrospectively enrolled 160 NSTEMI patients in the study, who were admitted to two German CPUs in 2013. Patients presenting on weekdays between 8 a.m. and 6 p.m. were compared with patients presenting during off-hours. Data analysis included time intervals from admission to invasive management (goals: for hr-NSTE-ACS, <2 h; for NSTEMI, <24 h) and the resulting guideline adherence. Guideline-adherent timing of an invasive strategy did not differ significantly between the on-hour (6.5 h [3.0-22.0 h], 79.9 %) and off-hour groups (10.5 h [2.0-20.0 h], 75.3 %; p = 0.94), without additional significant differences between admissions during off-hours Monday to Thursday and weekends (10.0 h [2.0-19.0 h], 75.6 % vs. 7.5 h [2.0-20.0 h], 76.2 %; p = 0.96). Our exemplary experience in two different German CPUs demonstrates adequate timing of coronary catheterization in over 75 % of cases, irrespective of admission during on- or off-hours. Nationwide validation of our findings by the German CPU registry is mandatory.
      

      
      A Specialized Diacylglycerol Acyltransferase Contributes to the Extreme Medium-Chain Fatty Acid Content of Cuphea Seed Oil1[OPEN
      PubMed Central
      Iskandarov, Umidjon; Silva, Jillian E.; Andersson, Mariette
         2017-01-01
         Seed oils of many Cuphea sp. contain >90% of medium-chain fatty acids, such as decanoic acid (10:0). These seed oils, which are among the most compositionally variant in the plant kingdom, arise from specialized fatty acid biosynthetic enzymes and specialized acyltransferases. These include lysophosphatidic acid acyltransferases (LPAT) and diacylglycerol acyltransferases (DGAT) that are required for successive acylation of medium-chain fatty acids in the sn-2 and sn-3 positions of seed triacylglycerols (TAGs). Here we report the identification of a cDNA for a DGAT1-type enzyme, designated CpuDGAT1, from the transcriptome of C. avigera var pulcherrima developing seeds. Microsomes of camelina (Camelina sativa) seeds engineered for CpuDGAT1 expression displayed DGAT activity with 10:0-CoA and the diacylglycerol didecanoyl, that was approximately 4-fold higher than that in camelina seed microsomes lacking CpuDGAT1. In addition, coexpression in camelina seeds of CpuDGAT1 with a C. viscosissima FatB thioesterase (CvFatB1) that generates 10:0 resulted in TAGs with nearly 15 mol % of 10:0. More strikingly, expression of CpuDGAT1 and CvFatB1 with the previously described CvLPAT2, a 10:0-CoA-specific Cuphea LPAT, increased 10:0 amounts to 25 mol % in camelina seed TAG. These TAGs contained up to 40 mol % 10:0 in the sn-2 position, nearly double the amounts obtained from coexpression of CvFatB1 and CvLPAT2 alone. Although enriched in diacylglycerol, 10:0 was not detected in phosphatidylcholine in these seeds. These findings are consistent with channeling of 10:0 into TAG through the combined activities of specialized LPAT and DGAT activities and demonstrate the biotechnological use of these enzymes to generate 10:0-rich seed oils. PMID:28325847
      

      
      A Specialized Diacylglycerol Acyltransferase Contributes to the Extreme Medium-Chain Fatty Acid Content of Cuphea Seed Oil.
      PubMed
      Iskandarov, Umidjon; Silva, Jillian E; Kim, Hae Jin; Andersson, Mariette; Cahoon, Rebecca E; Mockaitis, Keithanne; Cahoon, Edgar B
         2017-05-01
         Seed oils of many Cuphea sp. contain >90% of medium-chain fatty acids, such as decanoic acid (10:0). These seed oils, which are among the most compositionally variant in the plant kingdom, arise from specialized fatty acid biosynthetic enzymes and specialized acyltransferases. These include lysophosphatidic acid acyltransferases (LPAT) and diacylglycerol acyltransferases (DGAT) that are required for successive acylation of medium-chain fatty acids in the sn -2 and sn -3 positions of seed triacylglycerols (TAGs). Here we report the identification of a cDNA for a DGAT1-type enzyme, designated CpuDGAT1, from the transcriptome of C. avigera var pulcherrima developing seeds. Microsomes of camelina ( Camelina sativa ) seeds engineered for CpuDGAT1 expression displayed DGAT activity with 10:0-CoA and the diacylglycerol didecanoyl, that was approximately 4-fold higher than that in camelina seed microsomes lacking CpuDGAT1. In addition, coexpression in camelina seeds of CpuDGAT1 with a C. viscosissima FatB thioesterase (CvFatB1) that generates 10:0 resulted in TAGs with nearly 15 mol % of 10:0. More strikingly, expression of CpuDGAT1 and CvFatB1 with the previously described CvLPAT2, a 10:0-CoA-specific Cuphea LPAT, increased 10:0 amounts to 25 mol % in camelina seed TAG. These TAGs contained up to 40 mol % 10:0 in the sn -2 position, nearly double the amounts obtained from coexpression of CvFatB1 and CvLPAT2 alone. Although enriched in diacylglycerol, 10:0 was not detected in phosphatidylcholine in these seeds. These findings are consistent with channeling of 10:0 into TAG through the combined activities of specialized LPAT and DGAT activities and demonstrate the biotechnological use of these enzymes to generate 10:0-rich seed oils. © 2017 American Society of Plant Biologists. All Rights Reserved.
      

      
      Methamphetamine-induced neurotoxicity is attenuated in transgenic mice with a null mutation for interleukin-6.
      PubMed
      Ladenheim, B; Krasnova, I N; Deng, X; Oyler, J M; Polettini, A; Moran, T H; Huestis, M A; Cadet, J L
         2000-12-01
         Increasing evidence implicates apoptosis as a major mechanism of cell death in methamphetamine (METH) neurotoxicity. The involvement of a neuroimmune component in apoptotic cell death after injury or chemical damage suggests that cytokines may play a role in METH effects. In the present study, we examined if the absence of IL-6 in knockout (IL-6-/-) mice could provide protection against METH-induced neurotoxicity. Administration of METH resulted in a significant reduction of [(125)I]RTI-121-labeled dopamine transporters in the caudate-putamen (CPu) and cortex as well as depletion of dopamine in the CPu and frontal cortex of wild-type mice. However, these METH-induced effects were significantly attenuated in IL-6-/- animals. METH also caused a decrease in serotonin levels in the CPu and hippocampus of wild-type mice, but no reduction was observed in IL-6-/- animals. Moreover, METH induced decreases in [(125)I]RTI-55-labeled serotonin transporters in the hippocampal CA3 region and in the substantia nigra-reticulata but increases in serotonin transporters in the CPu and cingulate cortex in wild-type animals, all of which were attenuated in IL-6-/- mice. Additionally, METH caused increased gliosis in the CPu and cortices of wild-type mice as measured by [(3)H]PK-11195 binding; this gliotic response was almost completely inhibited in IL-6-/- animals. There was also significant protection against METH-induced DNA fragmentation, measured by the number of terminal deoxynucleotidyl transferase-mediated dUTP nick-end-labeled (TUNEL) cells in the cortices. The protective effects against METH toxicity observed in the IL-6-/- mice were not caused by differences in temperature elevation or in METH accumulation in wild-type and mutant animals. Therefore, these observations support the proposition that IL-6 may play an important role in the neurotoxicity of METH.
      

      
      Structure of the airflow above surface waves
      NASA Astrophysics Data System (ADS)
      Buckley, Marc; Veron, Fabrice
         2016-04-01
         Weather, climate and upper ocean patterns are controlled by the exchanges of momentum, heat, mass, and energy across the ocean surface. These fluxes are, in turn, influenced by the small-scale physics at the wavy air-sea interface. We present laboratory measurements of the fine-scale airflow structure above waves, achieved in over 15 different wind-wave conditions, with wave ages Cp/u* ranging from 1.4 to 66.7 (where Cp is the peak phase speed of the waves, and u* the air friction velocity). The experiments were performed in the large (42-m long) wind-wave-current tank at University of Delaware's Air-Sea Interaction laboratory (USA). A combined Particle Image Velocimetry and Laser Induced Fluorescence system was specifically developed for this study, and provided two-dimensional airflow velocity measurement as low as 100 um above the air-water interface. Starting at very low wind speeds (U10~2m/s), we directly observe coherent turbulent structures within the buffer and logarithmic layers of the airflow above the air-water interface, whereby low horizontal velocity air is ejected away from the surface, and higher velocity fluid is swept downward. Wave phase coherent quadrant analysis shows that such turbulent momentum flux events are wave-phase dependent. Airflow separation events are directly observed over young wind waves (Cp/u*<3.7) and counted using measured vorticity and surface viscous stress criteria. Detached high spanwise vorticity layers cause intense wave-coherent turbulence downwind of wave crests, as shown by wave-phase averaging of turbulent momentum fluxes. Mean wave-coherent airflow motions and fluxes also show strong phase-locked patterns, including a sheltering effect, upwind of wave crests over old mechanically generated swells (Cp/u*=31.7), and downwind of crests over young wind waves (Cp/u*=3.7). Over slightly older wind waves (Cp/u* = 6.5), the measured wave-induced airflow perturbations are qualitatively consistent with linear critical layer theory.
      

      
      Embedded real-time operating system micro kernel design
      NASA Astrophysics Data System (ADS)
      Cheng, Xiao-hui; Li, Ming-qiang; Wang, Xin-zheng
         2005-12-01
         Embedded systems usually require a real-time character. Base on an 8051 microcontroller, an embedded real-time operating system micro kernel is proposed consisting of six parts, including a critical section process, task scheduling, interruption handle, semaphore and message mailbox communication, clock managent and memory managent. Distributed CPU and other resources are among tasks rationally according to the importance and urgency. The design proposed here provides the position, definition, function and principle of micro kernel. The kernel runs on the platform of an ATMEL AT89C51 microcontroller. Simulation results prove that the designed micro kernel is stable and reliable and has quick response while operating in an application system.
      

      
      Real-time image processing for non-contact monitoring of dynamic displacements using smartphone technologies
      NASA Astrophysics Data System (ADS)
      Min, Jae-Hong; Gelo, Nikolas J.; Jo, Hongki
         2016-04-01
         The newly developed smartphone application, named RINO, in this study allows measuring absolute dynamic displacements and processing them in real time using state-of-the-art smartphone technologies, such as high-performance graphics processing unit (GPU), in addition to already powerful CPU and memories, embedded high-speed/ resolution camera, and open-source computer vision libraries. A carefully designed color-patterned target and user-adjustable crop filter enable accurate and fast image processing, allowing up to 240fps for complete displacement calculation and real-time display. The performances of the developed smartphone application are experimentally validated, showing comparable accuracy with those of conventional laser displacement sensor.
      

        
       
          

«

21
      22
      23
      24
   25
      »

          
        

     

   

   
       
            
              
          

«

21
      22
      23
      24
      25
   »

          
        

           
           
             
               
      
      Convolutional Neural Network on Embedded Linux(trademark) System-on-Chip: A Methodology and Performance Benchmark
      DTIC Science & Technology
      
         2016-05-01
         A9 CPU and 15 W for the i7 CPU. A method of accelerating this computation is by using a customized hardware unit called a field- programmable gate...implementation of custom logic to accelerate com- putational workloads. This FPGA fabric, in addition to the standard programmable logic, contains 220...chip; field- programmable gate array Daniel Gebhardt U U U U 18 (619) 553-2786 INITIAL DISTRIBUTION 84300 Library (2) 85300 Archive/Stock (1
      

      
      Convolutional Neural Network on Embedded Linux System-on-Chip: A Methodology and Performance Benchmark
      DTIC Science & Technology
      
         2016-05-01
         A9 CPU and 15 W for the i7 CPU. A method of accelerating this computation is by using a customized hardware unit called a field- programmable gate...implementation of custom logic to accelerate com- putational workloads. This FPGA fabric, in addition to the standard programmable logic, contains 220...chip; field- programmable gate array Daniel Gebhardt U U U U 18 (619) 553-2786 INITIAL DISTRIBUTION 84300 Library (2) 85300 Archive/Stock (1
      

      
      An Xdata Architecture for Federated Graph Models and Multi-tier Asymmetric Computing
      DTIC Science & Technology
      
         2014-01-01
         Wikipedia, a scale-free random graph (kron), Akamai trace route data, Bitcoin transaction data, and a Twitter follower network. We present results for...3x (SSSP on a random graph) and nearly 300x (Akamai and Bitcoin ) over the CPU performance of a well-known and widely deployed CPU-based graph...provided better throughput for smaller frontiers such as roadmaps or the Bitcoin data set. In our work, we have focused on two-phase kernels, but it
      

      
      Autonomic Recovery: HyperCheck: A Hardware-Assisted Integrity Monitor
      DTIC Science & Technology
      
         2013-08-01
         system (OS). HyperCheck leverages the CPU System Management Mode ( SMM ), present in x86 systems, to securely generate and transmit the full state of the...HyperCheck harnesses the CPU System Management Mode ( SMM ) which is present in all x86 commodity systems to create a snapshot view of the current state of the...protect the software above it. Our assumptions are that the attacker does not have physical access to the machine and that the SMM BIOS is locked and
      

      
      Bridging FPGA and GPU technologies for AO real-time control
      NASA Astrophysics Data System (ADS)
      Perret, Denis; Lainé, Maxime; Bernard, Julien; Gratadour, Damien; Sevin, Arnaud
         2016-07-01
         Our team has developed a common environment for high performance simulations and real-time control of AO systems based on the use of Graphics Processors Units in the context of the COMPASS project. Such a solution, based on the ability of the real time core in the simulation to provide adequate computing performance, limits the cost of developing AO RTC systems and makes them more scalable. A code developed and validated in the context of the simulation may be injected directly into the system and tested on sky. Furthermore, the use of relatively low cost components also offers significant advantages for the system hardware platform. However, the use of GPUs in an AO loop comes with drawbacks: the traditional way of offloading computation from CPU to GPUs - involving multiple copies and unacceptable overhead in kernel launching - is not well suited in a real time context. This last application requires the implementation of a solution enabling direct memory access (DMA) to the GPU memory from a third party device, bypassing the operating system. This allows this device to communicate directly with the real-time core of the simulation feeding it with the WFS camera pixel stream. We show that DMA between a custom FPGA-based frame-grabber and a computation unit (GPU, FPGA, or Coprocessor such as Xeon-phi) across PCIe allows us to get latencies compatible with what will be needed on ELTs. As a fine-grained synchronization mechanism is not yet made available by GPU vendors, we propose the use of memory polling to avoid interrupts handling and involvement of a CPU. Network and Vision protocols are handled by the FPGA-based Network Interface Card (NIC). We present the results we obtained on a complete AO loop using camera and deformable mirror simulators.
      

      
      Parallel Task Management Library for MARTe
      NASA Astrophysics Data System (ADS)
      Valcarcel, Daniel F.; Alves, Diogo; Neto, Andre; Reux, Cedric; Carvalho, Bernardo B.; Felton, Robert; Lomas, Peter J.; Sousa, Jorge; Zabeo, Luca
         2014-06-01
         The Multithreaded Application Real-Time executor (MARTe) is a real-time framework with increasing popularity and support in the thermonuclear fusion community. It allows modular code to run in a multi-threaded environment leveraging on the current multi-core processor (CPU) technology. One application that relies on the MARTe framework is the Joint European Torus (JET) tokamak WAll Load Limiter System (WALLS). It calculates and monitors the temperature on metal tiles and plasma facing components (PFCs) that can melt or flake if their temperature gets too high when exposed to power loads. One of the main time consuming tasks in WALLS is the calculation of thermal diffusion models in real-time. These models tend to be described by very large state-space models thus making them perfect candidates for parallelisation. MARTe's traditional approach for task parallelisation is to split the problem into several Real-Time Threads, each responsible for a self-contained sequential execution of an input-to-output chain. This is usually possible, but it might not always be practical for algorithmic or technical reasons. Also, it might not be easily scalable with an increase in the number of available CPU cores. The WorkLibrary introduces a “GPU-like approach” of splitting work among the available cores of modern CPUs that is (i) straightforward to use in an application, (ii) scalable with the availability of cores and all of this (iii) without rewriting or recompiling the source code. The first part of this article explains the motivation behind the library, its architecture and implementation. The second part presents a real application for WALLS, a parallel version of a large state-space model describing the 2D thermal diffusion on a JET tile.
      

      
      Towards real-time medical diagnostics using hyperspectral imaging technology
      NASA Astrophysics Data System (ADS)
      Bjorgan, Asgeir; Randeberg, Lise L.
         2015-07-01
         Hyperspectral imaging provides non-contact, high resolution spectral images which has a substantial diagnostic potential. This can be used for e.g. diagnosis and early detection of arthritis in finger joints. Processing speed is currently a limitation for clinical use of the technique. A real-time system for analysis and visualization using GPU processing and threaded CPU processing is presented. Images showing blood oxygenation, blood volume fraction and vessel enhanced images are among the data calculated in real-time. This study shows the potential of real-time processing in this context. A combination of the processing modules will be used in detection of arthritic finger joints from hyperspectral reflectance and transmittance data.
      

      
      Development of a chest digital tomosynthesis R/F system and implementation of low-dose GPU-accelerated compressed sensing (CS) image reconstruction.
      PubMed
      Choi, Sunghoon; Lee, Haenghwa; Lee, Donghoon; Choi, Seungyeon; Lee, Chang-Lae; Kwon, Woocheol; Shin, Jungwook; Seo, Chang-Woo; Kim, Hee-Joung
         2018-05-01
         This work describes the hardware and software developments of a prototype chest digital tomosynthesis (CDT) R/F system. The purpose of this study was to validate the developed system for its possible clinical application on low-dose chest tomosynthesis imaging. The prototype CDT R/F system was operated by carefully controlling the electromechanical subsystems through a synchronized interface. Once a command signal was delivered by the user, a tomosynthesis sweep started to acquire 81 projection views (PVs) in a limited angular range of ±20°. Among the full projection dataset of 81 images, several sets of 21 (quarter view) and 41 (half view) images with equally spaced angle steps were selected to represent a sparse view condition. GPU-accelerated and total-variation (TV) regularization strategy-based compressed sensing (CS) image reconstruction was implemented. The imaged objects were a flat-field using a copper filter to measure the noise power spectrum (NPS), a Catphan ® CTP682 quality assurance (QA) phantom to measure a task-based modulation transfer function (MTF T ask ) of three different cylinders' edge, and an anthropomorphic chest phantom with inserted lung nodules. The authors also verified the accelerated computing power over CPU programming by checking the elapsed time required for the CS method. The resultant absorbed and effective doses that were delivered to the chest phantom from two-view digital radiographic projections, helical computed tomography (CT), and the prototype CDT system were compared. The prototype CDT system was successfully operated, showing little geometric error with fast rise and fall times of R/F x-ray pulse less than 2 and 10 ms, respectively. The in-plane NPS presented essential symmetric patterns as predicted by the central slice theorem. The NPS images from 21 PVs were provided quite different pattern against 41 and 81 PVs due to aliased noise. The voxel variance values which summed all NPS intensities were inversely proportional to the number of PVs, and the CS method gave much lower voxel variance by the factors of 3.97-6.43 and 2.28-3.36 compared to filtered backprojection (FBP) and 20 iterations of simultaneous algebraic reconstruction technique (SART). The spatial frequencies of the f 50 at which the MTF T ask reduced to 50% were 1.50, 1.55, and 1.67 cycles/mm for FBP, SART, and CS methods, respectively, in the case of Bone 20% cylinder using 41 views. A variety of ranges of TV reconstruction parameters were implemented during the CS method and we could observe that the NPS and MTF T ask preserved best when the regularization and TV smoothing parameters α and τ were in a range of 0.001-0.1. For the chest phantom data, the signal difference to noise ratios (SDNRs) were higher in the proposed CS scheme images than in the FBP and SART, showing the enhanced rate of 1.05-1.43 for half view imaging. The total averaged reconstruction time during 20 iterations of the CS scheme was 124.68 s, which could match-up a clinically feasible time (<3 min). This computing time represented an enhanced speed 386 times greater than CPU programming. The total amounts of estimated effective doses were 0.12, 0.53 (half view), and 2.56 mSv for two-view radiographs, the prototype CDT system, and helical CT, respectively, showing 4.49 times higher than conventional radiography and 4.83 times lower than a CT exam, respectively. The current work describes the development and performance assessment of both hardware and software for tomosynthesis applications. The authors observed reasonable outcomes by showing a potential for low-dose application in CDT imaging using GPU acceleration. © 2018 American Association of Physicists in Medicine.
      

      
      New computing systems and their impact on structural analysis and design
      NASA Technical Reports Server (NTRS)
      Noor, Ahmed K.
         1989-01-01
         A review is given of the recent advances in computer technology that are likely to impact structural analysis and design. The computational needs for future structures technology are described. The characteristics of new and projected computing systems are summarized. Advances in programming environments, numerical algorithms, and computational strategies for new computing systems are reviewed, and a novel partitioning strategy is outlined for maximizing the degree of parallelism. The strategy is designed for computers with a shared memory and a small number of powerful processors (or a small number of clusters of medium-range processors). It is based on approximating the response of the structure by a combination of symmetric and antisymmetric response vectors, each obtained using a fraction of the degrees of freedom of the original finite element model. The strategy was implemented on the CRAY X-MP/4 and the Alliant FX/8 computers. For nonlinear dynamic problems on the CRAY X-MP with four CPUs, it resulted in an order of magnitude reduction in total analysis time, compared with the direct analysis on a single-CPU CRAY X-MP machine.
      

      
      Transonic Navier-Stokes wing solution using a zonal approach. Part 1: Solution methodology and code validation
      NASA Technical Reports Server (NTRS)
      Flores, J.; Gundy, K.; Gundy, K.; Gundy, K.; Gundy, K.; Gundy, K.
         1986-01-01
         A fast diagonalized Beam-Warming algorithm is coupled with a zonal approach to solve the three-dimensional Euler/Navier-Stokes equations. The computer code, called Transonic Navier-Stokes (TNS), uses a total of four zones for wing configurations (or can be extended to complete aircraft configurations by adding zones). In the inner blocks near the wing surface, the thin-layer Navier-Stokes equations are solved, while in the outer two blocks the Euler equations are solved. The diagonal algorithm yields a speedup of as much as a factor of 40 over the original algorithm/zonal method code. The TNS code, in addition, has the capability to model wind tunnel walls. Transonic viscous solutions are obtained on a 150,000-point mesh for a NACA 0012 wing. A three-order-of-magnitude drop in the L2-norm of the residual requires approximately 500 iterations, which takes about 45 min of CPU time on a Cray-XMP processor. Simulations are also conducted for a different geometrical wing called WING C. All cases show good agreement with experimental data.
      

      
      VAXCMS - VAX CONTINUOUS MONITORING SYSTEM, VERSION 2.2
      NASA Technical Reports Server (NTRS)
      Farkas, L.
         1994-01-01
         The VAX Continuous Monitoring System (VAXCMS) was developed at NASA Headquarters to aid system managers in monitoring the performance of VAX systems through the generation of graphic images which summarize trends in performance metrics over time. Since its initial development, VAXCMS has been extensively modified at the NASA Lewis Research Center. Data is produced by utilizing the VMS MONITOR utility to collect the performance data, and then feeding the data through custom-developed linkages to the Computer Associates' TELL-A-GRAF computer graphics software to generate the chart images for analysis by the system manager. The VMS ACCOUNTING utility is also utilized to gather interactive process information. The charts that are generated by VAXCMS are: 1) CPU modes for each node over the most recent four month period 2) CPU modes for the cluster as a whole using a weighted average of all the nodes in the cluster based on processing power 3) Percent of primary memory in use for each node over the most recent four month period 4) Interactive processes for all nodes over the most recent four month period 5) Daily, weekly, and monthly, performance summaries for CPU modes, percent of primary memory in use, and page fault rates for each node 6) Daily disk I/O performance data plotting Average Disk I/O Response Time based on I/O Operation Rate and Queue Length. VAXCMS is written in DCL and VAX FORTRAN for use with DEC VAX series computers running VMS 5.1 or later. This program requires the TELL-A-GRAF graphics package in order to generate plots of system data. A FORTRAN compiler is required. The standard distribution medium for VAXCMS is a 9-track 1600 BPI magnetic tape in DEC VAX BACKUP format. It is also available on a TK50 tape cartridge in DEC VAX BACKUP format. An electronic copy of the documentation in ASCII format is included on the distribution medium. Portions of this code are copyrighted by Mr. David Lavery and are distributed with his permission. These portions of the code may not be redistributed commercially.
      

      
      Effective electron-density map improvement and structure validation on a Linux multi-CPU web cluster: The TB Structural Genomics Consortium Bias Removal Web Service.
      PubMed
      Reddy, Vinod; Swanson, Stanley M; Segelke, Brent; Kantardjieff, Katherine A; Sacchettini, James C; Rupp, Bernhard
         2003-12-01
         Anticipating a continuing increase in the number of structures solved by molecular replacement in high-throughput crystallography and drug-discovery programs, a user-friendly web service for automated molecular replacement, map improvement, bias removal and real-space correlation structure validation has been implemented. The service is based on an efficient bias-removal protocol, Shake&wARP, and implemented using EPMR and the CCP4 suite of programs, combined with various shell scripts and Fortran90 routines. The service returns improved maps, converted data files and real-space correlation and B-factor plots. User data are uploaded through a web interface and the CPU-intensive iteration cycles are executed on a low-cost Linux multi-CPU cluster using the Condor job-queuing package. Examples of map improvement at various resolutions are provided and include model completion and reconstruction of absent parts, sequence correction, and ligand validation in drug-target structures.
      

      
      The Application of Virtex-II Pro FPGA in High-Speed Image Processing Technology of Robot Vision Sensor
      NASA Astrophysics Data System (ADS)
      Ren, Y. J.; Zhu, J. G.; Yang, X. Y.; Ye, S. H.
         2006-10-01
         The Virtex-II Pro FPGA is applied to the vision sensor tracking system of IRB2400 robot. The hardware platform, which undertakes the task of improving SNR and compressing data, is constructed by using the high-speed image processing of FPGA. The lower level image-processing algorithm is realized by combining the FPGA frame and the embedded CPU. The velocity of image processing is accelerated due to the introduction of FPGA and CPU. The usage of the embedded CPU makes it easily to realize the logic design of interface. Some key techniques are presented in the text, such as read-write process, template matching, convolution, and some modules are simulated too. In the end, the compare among the modules using this design, using the PC computer and using the DSP, is carried out. Because the high-speed image processing system core is a chip of FPGA, the function of which can renew conveniently, therefore, to a degree, the measure system is intelligent.
      

      
      Study of data I/O performance on distributed disk system in mask data preparation
      NASA Astrophysics Data System (ADS)
      Ohara, Shuichiro; Odaira, Hiroyuki; Chikanaga, Tomoyuki; Hamaji, Masakazu; Yoshioka, Yasuharu
         2010-09-01
         Data volume is getting larger every day in Mask Data Preparation (MDP). In the meantime, faster data handling is always required. MDP flow typically introduces Distributed Processing (DP) system to realize the demand because using hundreds of CPU is a reasonable solution. However, even if the number of CPU were increased, the throughput might be saturated because hard disk I/O and network speeds could be bottlenecks. So, MDP needs to invest a lot of money to not only hundreds of CPU but also storage and a network device which make the throughput faster. NCS would like to introduce new distributed processing system which is called "NDE". NDE could be a distributed disk system which makes the throughput faster without investing a lot of money because it is designed to use multiple conventional hard drives appropriately over network. NCS studies I/O performance with OASIS® data format on NDE which contributes to realize the high throughput in this paper.
      

      
      A study of the relationship between the performance and dependability of a fault-tolerant computer
      NASA Technical Reports Server (NTRS)
      Goswami, Kumar K.
         1994-01-01
         This thesis studies the relationship by creating a tool (FTAPE) that integrates a high stress workload generator with fault injection and by using the tool to evaluate system performance under error conditions. The workloads are comprised of processes which are formed from atomic components that represent CPU, memory, and I/O activity. The fault injector is software-implemented and is capable of injecting any memory addressable location, including special registers and caches. This tool has been used to study a Tandem Integrity S2 Computer. Workloads with varying numbers of processes and varying compositions of CPU, memory, and I/O activity are first characterized in terms of performance. Then faults are injected into these workloads. The results show that as the number of concurrent processes increases, the mean fault latency initially increases due to increased contention for the CPU. However, for even higher numbers of processes (less than 3 processes), the mean latency decreases because long latency faults are paged out before they can be activated.
      

      
      Computational Approaches to Simulation and Optimization of Global Aircraft Trajectories
      NASA Technical Reports Server (NTRS)
      Ng, Hok Kwan; Sridhar, Banavar
         2016-01-01
         This study examines three possible approaches to improving the speed in generating wind-optimal routes for air traffic at the national or global level. They are: (a) using the resources of a supercomputer, (b) running the computations on multiple commercially available computers and (c) implementing those same algorithms into NASAs Future ATM Concepts Evaluation Tool (FACET) and compares those to a standard implementation run on a single CPU. Wind-optimal aircraft trajectories are computed using global air traffic schedules. The run time and wait time on the supercomputer for trajectory optimization using various numbers of CPUs ranging from 80 to 10,240 units are compared with the total computational time for running the same computation on a single desktop computer and on multiple commercially available computers for potential computational enhancement through parallel processing on the computer clusters. This study also re-implements the trajectory optimization algorithm for further reduction of computational time through algorithm modifications and integrates that with FACET to facilitate the use of the new features which calculate time-optimal routes between worldwide airport pairs in a wind field for use with existing FACET applications. The implementations of trajectory optimization algorithms use MATLAB, Python, and Java programming languages. The performance evaluations are done by comparing their computational efficiencies and based on the potential application of optimized trajectories. The paper shows that in the absence of special privileges on a supercomputer, a cluster of commercially available computers provides a feasible approach for national and global air traffic system studies.
      

      
      Preconditioned conjugate-gradient methods for low-speed flow calculations
      NASA Technical Reports Server (NTRS)
      Ajmani, Kumud; Ng, Wing-Fai; Liou, Meng-Sing
         1993-01-01
         An investigation is conducted into the viability of using a generalized Conjugate Gradient-like method as an iterative solver to obtain steady-state solutions of very low-speed fluid flow problems. Low-speed flow at Mach 0.1 over a backward-facing step is chosen as a representative test problem. The unsteady form of the two dimensional, compressible Navier-Stokes equations is integrated in time using discrete time-steps. The Navier-Stokes equations are cast in an implicit, upwind finite-volume, flux split formulation. The new iterative solver is used to solve a linear system of equations at each step of the time-integration. Preconditioning techniques are used with the new solver to enhance the stability and convergence rate of the solver and are found to be critical to the overall success of the solver. A study of various preconditioners reveals that a preconditioner based on the Lower-Upper Successive Symmetric Over-Relaxation iterative scheme is more efficient than a preconditioner based on Incomplete L-U factorizations of the iteration matrix. The performance of the new preconditioned solver is compared with a conventional Line Gauss-Seidel Relaxation (LGSR) solver. Overall speed-up factors of 28 (in terms of global time-steps required to converge to a steady-state solution) and 20 (in terms of total CPU time on one processor of a CRAY-YMP) are found in favor of the new preconditioned solver, when compared with the LGSR solver.
      

      
      Preconditioned Conjugate Gradient methods for low speed flow calculations
      NASA Technical Reports Server (NTRS)
      Ajmani, Kumud; Ng, Wing-Fai; Liou, Meng-Sing
         1993-01-01
         An investigation is conducted into the viability of using a generalized Conjugate Gradient-like method as an iterative solver to obtain steady-state solutions of very low-speed fluid flow problems. Low-speed flow at Mach 0.1 over a backward-facing step is chosen as a representative test problem. The unsteady form of the two dimensional, compressible Navier-Stokes equations are integrated in time using discrete time-steps. The Navier-Stokes equations are cast in an implicit, upwind finite-volume, flux split formulation. The new iterative solver is used to solve a linear system of equations at each step of the time-integration. Preconditioning techniques are used with the new solver to enhance the stability and the convergence rate of the solver and are found to be critical to the overall success of the solver. A study of various preconditioners reveals that a preconditioner based on the lower-upper (L-U)-successive symmetric over-relaxation iterative scheme is more efficient than a preconditioner based on incomplete L-U factorizations of the iteration matrix. The performance of the new preconditioned solver is compared with a conventional line Gauss-Seidel relaxation (LGSR) solver. Overall speed-up factors of 28 (in terms of global time-steps required to converge to a steady-state solution) and 20 (in terms of total CPU time on one processor of a CRAY-YMP) are found in favor of the new preconditioned solver, when compared with the LGSR solver.
      

      
      Full waveform time domain solutions for source and induced magnetotelluric and controlled-source electromagnetic fields using quasi-equivalent time domain decomposition and GPU parallelization
      NASA Astrophysics Data System (ADS)
      Imamura, N.; Schultz, A.
         2015-12-01
         Recently, a full waveform time domain solution has been developed for the magnetotelluric (MT) and controlled-source electromagnetic (CSEM) methods. The ultimate goal of this approach is to obtain a computationally tractable direct waveform joint inversion for source fields and earth conductivity structure in three and four dimensions. This is desirable on several grounds, including the improved spatial resolving power expected from use of a multitude of source illuminations of non-zero wavenumber, the ability to operate in areas of high levels of source signal spatial complexity and non-stationarity, etc. This goal would not be obtainable if one were to adopt the finite difference time-domain (FDTD) approach for the forward problem. This is particularly true for the case of MT surveys, since an enormous number of degrees of freedom are required to represent the observed MT waveforms across the large frequency bandwidth. It means that for FDTD simulation, the smallest time steps should be finer than that required to represent the highest frequency, while the number of time steps should also cover the lowest frequency. This leads to a linear system that is computationally burdensome to solve. We have implemented our code that addresses this situation through the use of a fictitious wave domain method and GPUs to speed up the computation time. We also substantially reduce the size of the linear systems by applying concepts from successive cascade decimation, through quasi-equivalent time domain decomposition. By combining these refinements, we have made good progress toward implementing the core of a full waveform joint source field/earth conductivity inverse modeling method. From results, we found the use of previous generation of CPU/GPU speeds computations by an order of magnitude over a parallel CPU only approach. In part, this arises from the use of the quasi-equivalent time domain decomposition, which shrinks the size of the linear system dramatically.
      

      
      Requirements Analysis for Large Ada Programs: Lessons Learned on CCPDS- R
      DTIC Science & Technology
      
         1989-12-01
         when the design had matured and This approach was not optimal from the formal the SRS role was to be the tester’s contract, implemen- testing and...on the software development CPU processing load. These constraints primar- process is the necessity to include sufficient testing ily affect algorithm...allocations and timing requirements are by-products of the software design process when multiple CSCls are a P R StrR eSOFTWARE ENGINEERING executed within
      

        
       
          

«

21
      22
      23
      24
      25
   »

          
        

     

   

   Some links on this page may take you to non-federal websites. Their policies may differ from this site.