quad-core intel xeon: Topics by Science.gov

Sample records for quad-core intel xeon

Accelerating the Pace of Protein Functional Annotation With Intel Xeon Phi Coprocessors.

PubMed

Feinstein, Wei P; Moreno, Juana; Jarrell, Mark; Brylinski, Michal

2015-06-01

Intel Xeon Phi is a new addition to the family of powerful parallel accelerators. The range of its potential applications in computationally driven research is broad; however, at present, the repository of scientific codes is still relatively limited. In this study, we describe the development and benchmarking of a parallel version of eFindSite, a structural bioinformatics algorithm for the prediction of ligand-binding sites in proteins. Implemented for the Intel Xeon Phi platform, the parallelization of the structure alignment portion of eFindSite using pragma-based OpenMP brings about the desired performance improvements, which scale well with the number of computing cores. Compared to a serial version, the parallel code runs 11.8 and 10.1 times faster on the CPU and the coprocessor, respectively; when both resources are utilized simultaneously, the speedup is 17.6. For example, ligand-binding predictions for 501 benchmarking proteins are completed in 2.1 hours on a single Stampede node equipped with the Intel Xeon Phi card compared to 3.1 hours without the accelerator and 36.8 hours required by a serial version. In addition to the satisfactory parallel performance, porting existing scientific codes to the Intel Xeon Phi architecture is relatively straightforward with a short development time due to the support of common parallel programming models by the coprocessor. The parallel version of eFindSite is freely available to the academic community at www.brylinski.org/efindsite.
Investigating the Use of the Intel Xeon Phi for Event Reconstruction

NASA Astrophysics Data System (ADS)

Sherman, Keegan; Gilfoyle, Gerard

2014-09-01

The physics goal of Jefferson Lab is to understand how quarks and gluons form nuclei and it is being upgraded to a higher, 12-GeV beam energy. The new CLAS12 detector in Hall B will collect 5-10 terabytes of data per day and will require considerable computing resources. We are investigating tools, such as the Intel Xeon Phi, to speed up the event reconstruction. The Kalman Filter is one of the methods being studied. It is a linear algebra algorithm that estimates the state of a system by combining existing data and predictions of those measurements. The tools required to apply this technique (i.e. matrix multiplication, matrix inversion) are being written using C++ intrinsics for Intel's Xeon Phi Coprocessor, which uses the Many Integrated Cores (MIC) architecture. The Intel MIC is a new high-performance chip that connects to a host machine through the PCIe bus and is built to run highly vectorized and parallelized code making it a well-suited device for applications such as the Kalman Filter. Our tests of the MIC optimized algorithms needed for the filter show significant increases in speed. For example, matrix multiplication of 5x5 matrices on the MIC was able to run up to 69 times faster than the host core. The physics goal of Jefferson Lab is to understand how quarks and gluons form nuclei and it is being upgraded to a higher, 12-GeV beam energy. The new CLAS12 detector in Hall B will collect 5-10 terabytes of data per day and will require considerable computing resources. We are investigating tools, such as the Intel Xeon Phi, to speed up the event reconstruction. The Kalman Filter is one of the methods being studied. It is a linear algebra algorithm that estimates the state of a system by combining existing data and predictions of those measurements. The tools required to apply this technique (i.e. matrix multiplication, matrix inversion) are being written using C++ intrinsics for Intel's Xeon Phi Coprocessor, which uses the Many Integrated Cores (MIC
ELT-scale Adaptive Optics real-time control with thes Intel Xeon Phi Many Integrated Core Architecture

NASA Astrophysics Data System (ADS)

Jenkins, David R.; Basden, Alastair; Myers, Richard M.

2018-05-01

We propose a solution to the increased computational demands of Extremely Large Telescope (ELT) scale adaptive optics (AO) real-time control with the Intel Xeon Phi Knights Landing (KNL) Many Integrated Core (MIC) Architecture. The computational demands of an AO real-time controller (RTC) scale with the fourth power of telescope diameter and so the next generation ELTs require orders of magnitude more processing power for the RTC pipeline than existing systems. The Xeon Phi contains a large number (≥64) of low power x86 CPU cores and high bandwidth memory integrated into a single socketed server CPU package. The increased parallelism and memory bandwidth are crucial to providing the performance for reconstructing wavefronts with the required precision for ELT scale AO. Here, we demonstrate that the Xeon Phi KNL is capable of performing ELT scale single conjugate AO real-time control computation at over 1.0kHz with less than 20μs RMS jitter. We have also shown that with a wavefront sensor camera attached the KNL can process the real-time control loop at up to 966Hz, the maximum frame-rate of the camera, with jitter remaining below 20μs RMS. Future studies will involve exploring the use of a cluster of Xeon Phis for the real-time control of the MCAO and MOAO regimes of AO. We find that the Xeon Phi is highly suitable for ELT AO real time control.
Revisiting Intel Xeon Phi optimization of Thompson cloud microphysics scheme in Weather Research and Forecasting (WRF) model

NASA Astrophysics Data System (ADS)

Mielikainen, Jarno; Huang, Bormin; Huang, Allen

2015-10-01

The Thompson cloud microphysics scheme is a sophisticated cloud microphysics scheme in the Weather Research and Forecasting (WRF) model. The scheme is very suitable for massively parallel computation as there are no interactions among horizontal grid points. Compared to the earlier microphysics schemes, the Thompson scheme incorporates a large number of improvements. Thus, we have optimized the speed of this important part of WRF. Intel Many Integrated Core (MIC) ushers in a new era of supercomputing speed, performance, and compatibility. It allows the developers to run code at trillions of calculations per second using the familiar programming model. In this paper, we present our results of optimizing the Thompson microphysics scheme on Intel Many Integrated Core Architecture (MIC) hardware. The Intel Xeon Phi coprocessor is the first product based on Intel MIC architecture, and it consists of up to 61 cores connected by a high performance on-die bidirectional interconnect. The coprocessor supports all important Intel development tools. Thus, the development environment is familiar one to a vast number of CPU developers. Although, getting a maximum performance out of MICs will require using some novel optimization techniques. New optimizations for an updated Thompson scheme are discusses in this paper. The optimizations improved the performance of the original Thompson code on Xeon Phi 7120P by a factor of 1.8x. Furthermore, the same optimizations improved the performance of the Thompson on a dual socket configuration of eight core Intel Xeon E5-2670 CPUs by a factor of 1.8x compared to the original Thompson code.
Parallel Mutual Information Based Construction of Genome-Scale Networks on the Intel® Xeon Phi™ Coprocessor.

PubMed

Misra, Sanchit; Pamnany, Kiran; Aluru, Srinivas

2015-01-01

Construction of whole-genome networks from large-scale gene expression data is an important problem in systems biology. While several techniques have been developed, most cannot handle network reconstruction at the whole-genome scale, and the few that can, require large clusters. In this paper, we present a solution on the Intel Xeon Phi coprocessor, taking advantage of its multi-level parallelism including many x86-based cores, multiple threads per core, and vector processing units. We also present a solution on the Intel® Xeon® processor. Our solution is based on TINGe, a fast parallel network reconstruction technique that uses mutual information and permutation testing for assessing statistical significance. We demonstrate the first ever inference of a plant whole genome regulatory network on a single chip by constructing a 15,575 gene network of the plant Arabidopsis thaliana from 3,137 microarray experiments in only 22 minutes. In addition, our optimization for parallelizing mutual information computation on the Intel Xeon Phi coprocessor holds out lessons that are applicable to other domains.
Vectorization for Molecular Dynamics on Intel Xeon Phi Corpocessors

NASA Astrophysics Data System (ADS)

Yi, Hongsuk

2014-03-01

Many modern processors are capable of exploiting data-level parallelism through the use of single instruction multiple data (SIMD) execution. The new Intel Xeon Phi coprocessor supports 512 bit vector registers for the high performance computing. In this paper, we have developed a hierarchical parallelization scheme for accelerated molecular dynamics simulations with the Terfoff potentials for covalent bond solid crystals on Intel Xeon Phi coprocessor systems. The scheme exploits multi-level parallelism computing. We combine thread-level parallelism using a tightly coupled thread-level and task-level parallelism with 512-bit vector register. The simulation results show that the parallel performance of SIMD implementations on Xeon Phi is apparently superior to their x86 CPU architecture.
Intel Xeon Phi accelerated Weather Research and Forecasting (WRF) Goddard microphysics scheme

NASA Astrophysics Data System (ADS)

Mielikainen, J.; Huang, B.; Huang, A. H.-L.

2014-12-01

The Weather Research and Forecasting (WRF) model is a numerical weather prediction system designed to serve both atmospheric research and operational forecasting needs. The WRF development is a done in collaboration around the globe. Furthermore, the WRF is used by academic atmospheric scientists, weather forecasters at the operational centers and so on. The WRF contains several physics components. The most time consuming one is the microphysics. One microphysics scheme is the Goddard cloud microphysics scheme. It is a sophisticated cloud microphysics scheme in the Weather Research and Forecasting (WRF) model. The Goddard microphysics scheme is very suitable for massively parallel computation as there are no interactions among horizontal grid points. Compared to the earlier microphysics schemes, the Goddard scheme incorporates a large number of improvements. Thus, we have optimized the Goddard scheme code. In this paper, we present our results of optimizing the Goddard microphysics scheme on Intel Many Integrated Core Architecture (MIC) hardware. The Intel Xeon Phi coprocessor is the first product based on Intel MIC architecture, and it consists of up to 61 cores connected by a high performance on-die bidirectional interconnect. The Intel MIC is capable of executing a full operating system and entire programs rather than just kernels as the GPU does. The MIC coprocessor supports all important Intel development tools. Thus, the development environment is one familiar to a vast number of CPU developers. Although, getting a maximum performance out of MICs will require using some novel optimization techniques. Those optimization techniques are discussed in this paper. The results show that the optimizations improved performance of Goddard microphysics scheme on Xeon Phi 7120P by a factor of 4.7×. In addition, the optimizations reduced the Goddard microphysics scheme's share of the total WRF processing time from 20.0 to 7.5%. Furthermore, the same optimizations
Performance Evaluation of NWChem Ab-Initio Molecular Dynamics (AIMD) Simulations on the Intel® Xeon Phi™ Processor

DOE Office of Scientific and Technical Information (OSTI.GOV)

Bylaska, Eric J.; Jacquelin, Mathias; De Jong, Wibe A.

2017-10-20

Ab-initio Molecular Dynamics (AIMD) methods are an important class of algorithms, as they enable scientists to understand the chemistry and dynamics of molecular and condensed phase systems while retaining a first-principles-based description of their interactions. Many-core architectures such as the Intel® Xeon Phi™ processor are an interesting and promising target for these algorithms, as they can provide the computational power that is needed to solve interesting problems in chemistry. In this paper, we describe the efforts of refactoring the existing AIMD plane-wave method of NWChem from an MPI-only implementation to a scalable, hybrid code that employs MPI and OpenMP tomore » exploit the capabilities of current and future many-core architectures. We describe the optimizations required to get close to optimal performance for the multiplication of the tall-and-skinny matrices that form the core of the computational algorithm. We present strong scaling results on the complete AIMD simulation for a test case that simulates 256 water molecules and that strong-scales well on a cluster of 1024 nodes of Intel Xeon Phi processors. We compare the performance obtained with a cluster of dual-socket Intel® Xeon® E5–2698v3 processors.« less
Performance tuning Weather Research and Forecasting (WRF) Goddard longwave radiative transfer scheme on Intel Xeon Phi

NASA Astrophysics Data System (ADS)

Mielikainen, Jarno; Huang, Bormin; Huang, Allen H.

2015-10-01

Next-generation mesoscale numerical weather prediction system, the Weather Research and Forecasting (WRF) model, is a designed for dual use for forecasting and research. WRF offers multiple physics options that can be combined in any way. One of the physics options is radiance computation. The major source for energy for the earth's climate is solar radiation. Thus, it is imperative to accurately model horizontal and vertical distribution of the heating. Goddard solar radiative transfer model includes the absorption duo to water vapor,ozone, ozygen, carbon dioxide, clouds and aerosols. The model computes the interactions among the absorption and scattering by clouds, aerosols, molecules and surface. Finally, fluxes are integrated over the entire longwave spectrum.In this paper, we present our results of optimizing the Goddard longwave radiative transfer scheme on Intel Many Integrated Core Architecture (MIC) hardware. The Intel Xeon Phi coprocessor is the first product based on Intel MIC architecture, and it consists of up to 61 cores connected by a high performance on-die bidirectional interconnect. The coprocessor supports all important Intel development tools. Thus, the development environment is familiar one to a vast number of CPU developers. Although, getting a maximum performance out of MICs will require using some novel optimization techniques. Those optimization techniques are discusses in this paper. The optimizations improved the performance of the original Goddard longwave radiative transfer scheme on Xeon Phi 7120P by a factor of 2.2x. Furthermore, the same optimizations improved the performance of the Goddard longwave radiative transfer scheme on a dual socket configuration of eight core Intel Xeon E5-2670 CPUs by a factor of 2.1x compared to the original Goddard longwave radiative transfer scheme code.
Evaluation of the Intel Xeon Phi Co-processor to accelerate the sensitivity map calculation for PET imaging

NASA Astrophysics Data System (ADS)

Dey, T.; Rodrigue, P.

2015-07-01

We aim to evaluate the Intel Xeon Phi coprocessor for acceleration of 3D Positron Emission Tomography (PET) image reconstruction. We focus on the sensitivity map calculation as one computational intensive part of PET image reconstruction, since it is a promising candidate for acceleration with the Many Integrated Core (MIC) architecture of the Xeon Phi. The computation of the voxels in the field of view (FoV) can be done in parallel and the 103 to 104 samples needed to calculate the detection probability of each voxel can take advantage of vectorization. We use the ray tracing kernels of the Embree project to calculate the hit points of the sample rays with the detector and in a second step the sum of the radiological path taking into account attenuation is determined. The core components are implemented using the Intel single instruction multiple data compiler (ISPC) to enable a portable implementation showing efficient vectorization either on the Xeon Phi and the Host platform. On the Xeon Phi, the calculation of the radiological path is also implemented in hardware specific intrinsic instructions (so-called `intrinsics') to allow manually-optimized vectorization. For parallelization either OpenMP and ISPC tasking (based on pthreads) are evaluated.Our implementation achieved a scalability factor of 0.90 on the Xeon Phi coprocessor (model 5110P) with 60 cores at 1 GHz. Only minor differences were found between parallelization with OpenMP and the ISPC tasking feature. The implementation using intrinsics was found to be about 12% faster than the portable ISPC version. With this version, a speedup of 1.43 was achieved on the Xeon Phi coprocessor compared to the host system (HP SL250s Gen8) equipped with two Xeon (E5-2670) CPUs, with 8 cores at 2.6 to 3.3 GHz each. Using a second Xeon Phi card the speedup could be further increased to 2.77. No significant differences were found between the results of the different Xeon Phi and the Host implementations. The examination
GW Calculations of Materials on the Intel Xeon-Phi Architecture

NASA Astrophysics Data System (ADS)

Deslippe, Jack; da Jornada, Felipe H.; Vigil-Fowler, Derek; Biller, Ariel; Chelikowsky, James R.; Louie, Steven G.

Intel Xeon-Phi processors are expected to power a large number of High-Performance Computing (HPC) systems around the United States and the world in the near future. We evaluate the ability of GW and pre-requisite Density Functional Theory (DFT) calculations for materials on utilizing the Xeon-Phi architecture. We describe the optimization process and performance improvements achieved. We find that the GW method, like other higher level Many-Body methods beyond standard local/semilocal approximations to Kohn-Sham DFT, is particularly well suited for many-core architectures due to the ability to exploit a large amount of parallelism over plane-waves, band-pairs and frequencies. Support provided by the SCIDAC program, Department of Energy, Office of Science, Advanced Scientic Computing Research and Basic Energy Sciences. Grant Numbers DE-SC0008877 (Austin) and DE-AC02-05CH11231 (LBNL).
Time-efficient simulations of tight-binding electronic structures with Intel Xeon PhiTM many-core processors

NASA Astrophysics Data System (ADS)

Ryu, Hoon; Jeong, Yosang; Kang, Ji-Hoon; Cho, Kyu Nam

2016-12-01

Modelling of multi-million atomic semiconductor structures is important as it not only predicts properties of physically realizable novel materials, but can accelerate advanced device designs. This work elaborates a new Technology-Computer-Aided-Design (TCAD) tool for nanoelectronics modelling, which uses a sp3d5s∗ tight-binding approach to describe multi-million atomic structures, and simulate electronic structures with high performance computing (HPC), including atomic effects such as alloy and dopant disorders. Being named as Quantum simulation tool for Advanced Nanoscale Devices (Q-AND), the tool shows nice scalability on traditional multi-core HPC clusters implying the strong capability of large-scale electronic structure simulations, particularly with remarkable performance enhancement on latest clusters of Intel Xeon PhiTM coprocessors. A review of the recent modelling study conducted to understand an experimental work of highly phosphorus-doped silicon nanowires, is presented to demonstrate the utility of Q-AND. Having been developed via Intel Parallel Computing Center project, Q-AND will be open to public to establish a sound framework of nanoelectronics modelling with advanced HPC clusters of a many-core base. With details of the development methodology and exemplary study of dopant electronics, this work will present a practical guideline for TCAD development to researchers in the field of computational nanoelectronics.
Using Intel Xeon Phi to accelerate the WRF TEMF planetary boundary layer scheme

NASA Astrophysics Data System (ADS)

Mielikainen, Jarno; Huang, Bormin; Huang, Allen

2014-05-01

The Weather Research and Forecasting (WRF) model is designed for numerical weather prediction and atmospheric research. The WRF software infrastructure consists of several components such as dynamic solvers and physics schemes. Numerical models are used to resolve the large-scale flow. However, subgrid-scale parameterizations are for an estimation of small-scale properties (e.g., boundary layer turbulence and convection, clouds, radiation). Those have a significant influence on the resolved scale due to the complex nonlinear nature of the atmosphere. For the cloudy planetary boundary layer (PBL), it is fundamental to parameterize vertical turbulent fluxes and subgrid-scale condensation in a realistic manner. A parameterization based on the Total Energy - Mass Flux (TEMF) that unifies turbulence and moist convection components produces a better result that the other PBL schemes. For that reason, the TEMF scheme is chosen as the PBL scheme we optimized for Intel Many Integrated Core (MIC), which ushers in a new era of supercomputing speed, performance, and compatibility. It allows the developers to run code at trillions of calculations per second using the familiar programming model. In this paper, we present our optimization results for TEMF planetary boundary layer scheme. The optimizations that were performed were quite generic in nature. Those optimizations included vectorization of the code to utilize vector units inside each CPU. Furthermore, memory access was improved by scalarizing some of the intermediate arrays. The results show that the optimization improved MIC performance by 14.8x. Furthermore, the optimizations increased CPU performance by 2.6x compared to the original multi-threaded code on quad core Intel Xeon E5-2603 running at 1.8 GHz. Compared to the optimized code running on a single CPU socket the optimized MIC code is 6.2x faster.
Optimizing the updated Goddard shortwave radiation Weather Research and Forecasting (WRF) scheme for Intel Many Integrated Core (MIC) architecture

NASA Astrophysics Data System (ADS)

Mielikainen, Jarno; Huang, Bormin; Huang, Allen H.-L.

2015-05-01

Intel Many Integrated Core (MIC) ushers in a new era of supercomputing speed, performance, and compatibility. It allows the developers to run code at trillions of calculations per second using the familiar programming model. In this paper, we present our results of optimizing the updated Goddard shortwave radiation Weather Research and Forecasting (WRF) scheme on Intel Many Integrated Core Architecture (MIC) hardware. The Intel Xeon Phi coprocessor is the first product based on Intel MIC architecture, and it consists of up to 61 cores connected by a high performance on-die bidirectional interconnect. The co-processor supports all important Intel development tools. Thus, the development environment is familiar one to a vast number of CPU developers. Although, getting a maximum performance out of Xeon Phi will require using some novel optimization techniques. Those optimization techniques are discusses in this paper. The results show that the optimizations improved performance of the original code on Xeon Phi 7120P by a factor of 1.3x.
Extension of the AMBER molecular dynamics software to Intel's Many Integrated Core (MIC) architecture

NASA Astrophysics Data System (ADS)

Needham, Perri J.; Bhuiyan, Ashraf; Walker, Ross C.

2016-04-01

We present an implementation of explicit solvent particle mesh Ewald (PME) classical molecular dynamics (MD) within the PMEMD molecular dynamics engine, that forms part of the AMBER v14 MD software package, that makes use of Intel Xeon Phi coprocessors by offloading portions of the PME direct summation and neighbor list build to the coprocessor. We refer to this implementation as pmemd MIC offload and in this paper present the technical details of the algorithm, including basic models for MPI and OpenMP configuration, and analyze the resultant performance. The algorithm provides the best performance improvement for large systems (>400,000 atoms), achieving a ∼35% performance improvement for satellite tobacco mosaic virus (1,067,095 atoms) when 2 Intel E5-2697 v2 processors (2 ×12 cores, 30M cache, 2.7 GHz) are coupled to an Intel Xeon Phi coprocessor (Model 7120P-1.238/1.333 GHz, 61 cores). The implementation utilizes a two-fold decomposition strategy: spatial decomposition using an MPI library and thread-based decomposition using OpenMP. We also present compiler optimization settings that improve the performance on Intel Xeon processors, while retaining simulation accuracy.
Cognitive Medical Wireless Testbed System (COMWITS)

DTIC Science & Technology

2016-11-01

Number: ...... ...... Sub Contractors (DD882) Names of other research staff Inventions (DD882) Scientific Progress This testbed merges two ARO grants...bit 64 bit CPU Intel Xeon Processor E5-1650v3 (6C, 3.5 GHz, Turbo, HT , 15M, 140W) Intel Core i7-3770 (3.4 GHz Quad Core, 77W) Dual Intel Xeon
Evaluating the transport layer of the ALFA framework for the Intel® Xeon Phi™ Coprocessor

NASA Astrophysics Data System (ADS)

Santogidis, Aram; Hirstius, Andreas; Lalis, Spyros

2015-12-01

The ALFA framework supports the software development of major High Energy Physics experiments. As part of our research effort to optimize the transport layer of ALFA, we focus on profiling its data transfer performance for inter-node communication on the Intel Xeon Phi Coprocessor. In this article we present the collected performance measurements with the related analysis of the results. The optimization opportunities that are discovered, help us to formulate the future plans of enabling high performance data transfer for ALFA on the Intel Xeon Phi architecture.
Optimizing Performance of Combustion Chemistry Solvers on Intel's Many Integrated Core (MIC) Architectures

DOE Office of Scientific and Technical Information (OSTI.GOV)

Sitaraman, Hariswaran; Grout, Ray W

This work investigates novel algorithm designs and optimization techniques for restructuring chemistry integrators in zero and multidimensional combustion solvers, which can then be effectively used on the emerging generation of Intel's Many Integrated Core/Xeon Phi processors. These processors offer increased computing performance via large number of lightweight cores at relatively lower clock speeds compared to traditional processors (e.g. Intel Sandybridge/Ivybridge) used in current supercomputers. This style of processor can be productively used for chemistry integrators that form a costly part of computational combustion codes, in spite of their relatively lower clock speeds. Performance commensurate with traditional processors is achieved heremore » through the combination of careful memory layout, exposing multiple levels of fine grain parallelism and through extensive use of vendor supported libraries (Cilk Plus and Math Kernel Libraries). Important optimization techniques for efficient memory usage and vectorization have been identified and quantified. These optimizations resulted in a factor of ~ 3 speed-up using Intel 2013 compiler and ~ 1.5 using Intel 2017 compiler for large chemical mechanisms compared to the unoptimized version on the Intel Xeon Phi. The strategies, especially with respect to memory usage and vectorization, should also be beneficial for general purpose computational fluid dynamics codes.« less
Application of Intel Many Integrated Core (MIC) accelerators to the Pleim-Xiu land surface scheme

NASA Astrophysics Data System (ADS)

Huang, Melin; Huang, Bormin; Huang, Allen H.

2015-10-01

The land-surface model (LSM) is one physics process in the weather research and forecast (WRF) model. The LSM includes atmospheric information from the surface layer scheme, radiative forcing from the radiation scheme, and precipitation forcing from the microphysics and convective schemes, together with internal information on the land's state variables and land-surface properties. The LSM is to provide heat and moisture fluxes over land points and sea-ice points. The Pleim-Xiu (PX) scheme is one LSM. The PX LSM features three pathways for moisture fluxes: evapotranspiration, soil evaporation, and evaporation from wet canopies. To accelerate the computation process of this scheme, we employ Intel Xeon Phi Many Integrated Core (MIC) Architecture as it is a multiprocessor computer structure with merits of efficient parallelization and vectorization essentials. Our results show that the MIC-based optimization of this scheme running on Xeon Phi coprocessor 7120P improves the performance by 2.3x and 11.7x as compared to the original code respectively running on one CPU socket (eight cores) and on one CPU core with Intel Xeon E5-2670.
Optimizing the Betts-Miller-Janjic cumulus parameterization with Intel Many Integrated Core (MIC) architecture

NASA Astrophysics Data System (ADS)

Huang, Melin; Huang, Bormin; Huang, Allen H.-L.

2015-10-01

The schemes of cumulus parameterization are responsible for the sub-grid-scale effects of convective and/or shallow clouds, and intended to represent vertical fluxes due to unresolved updrafts and downdrafts and compensating motion outside the clouds. Some schemes additionally provide cloud and precipitation field tendencies in the convective column, and momentum tendencies due to convective transport of momentum. The schemes all provide the convective component of surface rainfall. Betts-Miller-Janjic (BMJ) is one scheme to fulfill such purposes in the weather research and forecast (WRF) model. National Centers for Environmental Prediction (NCEP) has tried to optimize the BMJ scheme for operational application. As there are no interactions among horizontal grid points, this scheme is very suitable for parallel computation. With the advantage of Intel Xeon Phi Many Integrated Core (MIC) architecture, efficient parallelization and vectorization essentials, it allows us to optimize the BMJ scheme. If compared to the original code respectively running on one CPU socket (eight cores) and on one CPU core with Intel Xeon E5-2670, the MIC-based optimization of this scheme running on Xeon Phi coprocessor 7120P improves the performance by 2.4x and 17.0x, respectively.

Initial results on computational performance of Intel Many Integrated Core (MIC) architecture: implementation of the Weather and Research Forecasting (WRF) Purdue-Lin microphysics scheme

NASA Astrophysics Data System (ADS)

Mielikainen, Jarno; Huang, Bormin; Huang, Allen H.

2014-10-01

Purdue-Lin scheme is a relatively sophisticated microphysics scheme in the Weather Research and Forecasting (WRF) model. The scheme includes six classes of hydro meteors: water vapor, cloud water, raid, cloud ice, snow and graupel. The scheme is very suitable for massively parallel computation as there are no interactions among horizontal grid points. In this paper, we accelerate the Purdue Lin scheme using Intel Many Integrated Core Architecture (MIC) hardware. The Intel Xeon Phi is a high performance coprocessor consists of up to 61 cores. The Xeon Phi is connected to a CPU via the PCI Express (PICe) bus. In this paper, we will discuss in detail the code optimization issues encountered while tuning the Purdue-Lin microphysics Fortran code for Xeon Phi. In particularly, getting a good performance required utilizing multiple cores, the wide vector operations and make efficient use of memory. The results show that the optimizations improved performance of the original code on Xeon Phi 5110P by a factor of 4.2x. Furthermore, the same optimizations improved performance on Intel Xeon E5-2603 CPU by a factor of 1.2x compared to the original code.
Heterogeneous High Throughput Scientific Computing with APM X-Gene and Intel Xeon Phi

NASA Astrophysics Data System (ADS)

Abdurachmanov, David; Bockelman, Brian; Elmer, Peter; Eulisse, Giulio; Knight, Robert; Muzaffar, Shahzad

2015-05-01

Electrical power requirements will be a constraint on the future growth of Distributed High Throughput Computing (DHTC) as used by High Energy Physics. Performance-per-watt is a critical metric for the evaluation of computer architectures for cost- efficient computing. Additionally, future performance growth will come from heterogeneous, many-core, and high computing density platforms with specialized processors. In this paper, we examine the Intel Xeon Phi Many Integrated Cores (MIC) co-processor and Applied Micro X-Gene ARMv8 64-bit low-power server system-on-a-chip (SoC) solutions for scientific computing applications. We report our experience on software porting, performance and energy efficiency and evaluate the potential for use of such technologies in the context of distributed computing systems such as the Worldwide LHC Computing Grid (WLCG).
HPC Programming on Intel Many-Integrated-Core Hardware with MAGMA Port to Xeon Phi

DOE PAGES

Dongarra, Jack; Gates, Mark; Haidar, Azzam; ...

2015-01-01

This paper presents the design and implementation of several fundamental dense linear algebra (DLA) algorithms for multicore with Intel Xeon Phi coprocessors. In particular, we consider algorithms for solving linear systems. Further, we give an overview of the MAGMA MIC library, an open source, high performance library, that incorporates the developments presented here and, more broadly, provides the DLA functionality equivalent to that of the popular LAPACK library while targeting heterogeneous architectures that feature a mix of multicore CPUs and coprocessors. The LAPACK-compliance simplifies the use of the MAGMA MIC library in applications, while providing them with portably performant DLA.more » High performance is obtained through the use of the high-performance BLAS, hardware-specific tuning, and a hybridization methodology whereby we split the algorithm into computational tasks of various granularities. Execution of those tasks is properly scheduled over the heterogeneous hardware by minimizing data movements and mapping algorithmic requirements to the architectural strengths of the various heterogeneous hardware components. Our methodology and programming techniques are incorporated into the MAGMA MIC API, which abstracts the application developer from the specifics of the Xeon Phi architecture and is therefore applicable to algorithms beyond the scope of DLA.« less
Heterogeneous high throughput scientific computing with APM X-Gene and Intel Xeon Phi

DOE PAGES

Abdurachmanov, David; Bockelman, Brian; Elmer, Peter; ...

2015-05-22

Electrical power requirements will be a constraint on the future growth of Distributed High Throughput Computing (DHTC) as used by High Energy Physics. Performance-per-watt is a critical metric for the evaluation of computer architectures for cost- efficient computing. Additionally, future performance growth will come from heterogeneous, many-core, and high computing density platforms with specialized processors. In this paper, we examine the Intel Xeon Phi Many Integrated Cores (MIC) co-processor and Applied Micro X-Gene ARMv8 64-bit low-power server system-on-a-chip (SoC) solutions for scientific computing applications. As a result, we report our experience on software porting, performance and energy efficiency and evaluatemore » the potential for use of such technologies in the context of distributed computing systems such as the Worldwide LHC Computing Grid (WLCG).« less
Optimizing meridional advection of the Advanced Research WRF (ARW) dynamics for Intel Xeon Phi coprocessor

NASA Astrophysics Data System (ADS)

Mielikainen, Jarno; Huang, Bormin; Huang, Allen H.-L.

2015-05-01

The most widely used community weather forecast and research model in the world is the Weather Research and Forecast (WRF) model. Two distinct varieties of WRF exist. The one we are interested is the Advanced Research WRF (ARW) is an experimental, advanced research version featuring very high resolution. The WRF Nonhydrostatic Mesoscale Model (WRF-NMM) has been designed for forecasting operations. WRF consists of dynamics code and several physics modules. The WRF-ARW core is based on an Eulerian solver for the fully compressible nonhydrostatic equations. In the paper, we optimize a meridional (north-south direction) advection subroutine for Intel Xeon Phi coprocessor. Advection is of the most time consuming routines in the ARW dynamics core. It advances the explicit perturbation horizontal momentum equations by adding in the large-timestep tendency along with the small timestep pressure gradient tendency. We will describe the challenges we met during the development of a high-speed dynamics code subroutine for MIC architecture. Furthermore, lessons learned from the code optimization process will be discussed. The results show that the optimizations improved performance of the original code on Xeon Phi 7120P by a factor of 1.2x.
Efficient irregular wavefront propagation algorithms on Intel® Xeon Phi™.

PubMed

Gomes, Jeremias M; Teodoro, George; de Melo, Alba; Kong, Jun; Kurc, Tahsin; Saltz, Joel H

2015-10-01

We investigate the execution of the Irregular Wavefront Propagation Pattern (IWPP), a fundamental computing structure used in several image analysis operations, on the Intel ® Xeon Phi ™ co-processor. An efficient implementation of IWPP on the Xeon Phi is a challenging problem because of IWPP's irregularity and the use of atomic instructions in the original IWPP algorithm to resolve race conditions. On the Xeon Phi, the use of SIMD and vectorization instructions is critical to attain high performance. However, SIMD atomic instructions are not supported. Therefore, we propose a new IWPP algorithm that can take advantage of the supported SIMD instruction set. We also evaluate an alternate storage container (priority queue) to track active elements in the wavefront in an effort to improve the parallel algorithm efficiency. The new IWPP algorithm is evaluated with Morphological Reconstruction and Imfill operations as use cases. Our results show performance improvements of up to 5.63 × on top of the original IWPP due to vectorization. Moreover, the new IWPP achieves speedups of 45.7 × and 1.62 × , respectively, as compared to efficient CPU and GPU implementations.
Efficient irregular wavefront propagation algorithms on Intel® Xeon Phi™

PubMed Central

Gomes, Jeremias M.; Teodoro, George; de Melo, Alba; Kong, Jun; Kurc, Tahsin; Saltz, Joel H.

2016-01-01

We investigate the execution of the Irregular Wavefront Propagation Pattern (IWPP), a fundamental computing structure used in several image analysis operations, on the Intel® Xeon Phi™ co-processor. An efficient implementation of IWPP on the Xeon Phi is a challenging problem because of IWPP’s irregularity and the use of atomic instructions in the original IWPP algorithm to resolve race conditions. On the Xeon Phi, the use of SIMD and vectorization instructions is critical to attain high performance. However, SIMD atomic instructions are not supported. Therefore, we propose a new IWPP algorithm that can take advantage of the supported SIMD instruction set. We also evaluate an alternate storage container (priority queue) to track active elements in the wavefront in an effort to improve the parallel algorithm efficiency. The new IWPP algorithm is evaluated with Morphological Reconstruction and Imfill operations as use cases. Our results show performance improvements of up to 5.63× on top of the original IWPP due to vectorization. Moreover, the new IWPP achieves speedups of 45.7× and 1.62×, respectively, as compared to efficient CPU and GPU implementations. PMID:27298591
Particle-in-Cell laser-plasma simulation on Xeon Phi coprocessors

NASA Astrophysics Data System (ADS)

Surmin, I. A.; Bastrakov, S. I.; Efimenko, E. S.; Gonoskov, A. A.; Korzhimanov, A. V.; Meyerov, I. B.

2016-05-01

This paper concerns the development of a high-performance implementation of the Particle-in-Cell method for plasma simulation on Intel Xeon Phi coprocessors. We discuss the suitability of the method for Xeon Phi architecture and present our experience in the porting and optimization of the existing parallel Particle-in-Cell code PICADOR. Direct porting without code modification gives performance on Xeon Phi close to that of an 8-core CPU on a benchmark problem with 50 particles per cell. We demonstrate step-by-step optimization techniques, such as improving data locality, enhancing parallelization efficiency and vectorization leading to an overall 4.2 × speedup on CPU and 7.5 × on Xeon Phi compared to the baseline version. The optimized version achieves 16.9 ns per particle update on an Intel Xeon E5-2660 CPU and 9.3 ns per particle update on an Intel Xeon Phi 5110P. For a real problem of laser ion acceleration in targets with surface grating, where a large number of macroparticles per cell is required, the speedup of Xeon Phi compared to CPU is 1.6 ×.
Does the Intel Xeon Phi processor fit HEP workloads?

NASA Astrophysics Data System (ADS)

Nowak, A.; Bitzes, G.; Dotti, A.; Lazzaro, A.; Jarp, S.; Szostek, P.; Valsan, L.; Botezatu, M.; Leduc, J.

2014-06-01

This paper summarizes the five years of CERN openlab's efforts focused on the Intel Xeon Phi co-processor, from the time of its inception to public release. We consider the architecture of the device vis a vis the characteristics of HEP software and identify key opportunities for HEP processing, as well as scaling limitations. We report on improvements and speedups linked to parallelization and vectorization on benchmarks involving software frameworks such as Geant4 and ROOT. Finally, we extrapolate current software and hardware trends and project them onto accelerators of the future, with the specifics of offline and online HEP processing in mind.
Evaluation of the Intel Xeon Phi 7120 and NVIDIA K80 as accelerators for two-dimensional panel codes

PubMed Central

2017-01-01

To optimize the geometry of airfoils for a specific application is an important engineering problem. In this context genetic algorithms have enjoyed some success as they are able to explore the search space without getting stuck in local optima. However, these algorithms require the computation of aerodynamic properties for a significant number of airfoil geometries. Consequently, for low-speed aerodynamics, panel methods are most often used as the inner solver. In this paper we evaluate the performance of such an optimization algorithm on modern accelerators (more specifically, the Intel Xeon Phi 7120 and the NVIDIA K80). For that purpose, we have implemented an optimized version of the algorithm on the CPU and Xeon Phi (based on OpenMP, vectorization, and the Intel MKL library) and on the GPU (based on CUDA and the MAGMA library). We present timing results for all codes and discuss the similarities and differences between the three implementations. Overall, we observe a speedup of approximately 2.5 for adding an Intel Xeon Phi 7120 to a dual socket workstation and a speedup between 3.4 and 3.8 for adding a NVIDIA K80 to a dual socket workstation. PMID:28582389
Evaluation of the Intel Xeon Phi 7120 and NVIDIA K80 as accelerators for two-dimensional panel codes.

PubMed

Einkemmer, Lukas

2017-01-01

To optimize the geometry of airfoils for a specific application is an important engineering problem. In this context genetic algorithms have enjoyed some success as they are able to explore the search space without getting stuck in local optima. However, these algorithms require the computation of aerodynamic properties for a significant number of airfoil geometries. Consequently, for low-speed aerodynamics, panel methods are most often used as the inner solver. In this paper we evaluate the performance of such an optimization algorithm on modern accelerators (more specifically, the Intel Xeon Phi 7120 and the NVIDIA K80). For that purpose, we have implemented an optimized version of the algorithm on the CPU and Xeon Phi (based on OpenMP, vectorization, and the Intel MKL library) and on the GPU (based on CUDA and the MAGMA library). We present timing results for all codes and discuss the similarities and differences between the three implementations. Overall, we observe a speedup of approximately 2.5 for adding an Intel Xeon Phi 7120 to a dual socket workstation and a speedup between 3.4 and 3.8 for adding a NVIDIA K80 to a dual socket workstation.
Intel Many Integrated Core (MIC) architecture optimization strategies for a memory-bound Weather Research and Forecasting (WRF) Goddard microphysics scheme

NASA Astrophysics Data System (ADS)

Mielikainen, Jarno; Huang, Bormin; Huang, Allen H.

2014-10-01

The Goddard cloud microphysics scheme is a sophisticated cloud microphysics scheme in the Weather Research and Forecasting (WRF) model. The WRF is a widely used weather prediction system in the world. It development is a done in collaborative around the globe. The Goddard microphysics scheme is very suitable for massively parallel computation as there are no interactions among horizontal grid points. Compared to the earlier microphysics schemes, the Goddard scheme incorporates a large number of improvements. Thus, we have optimized the code of this important part of WRF. In this paper, we present our results of optimizing the Goddard microphysics scheme on Intel Many Integrated Core Architecture (MIC) hardware. The Intel Xeon Phi coprocessor is the first product based on Intel MIC architecture, and it consists of up to 61 cores connected by a high performance on-die bidirectional interconnect. The Intel MIC is capable of executing a full operating system and entire programs rather than just kernels as the GPU do. The MIC coprocessor supports all important Intel development tools. Thus, the development environment is familiar one to a vast number of CPU developers. Although, getting a maximum performance out of MICs will require using some novel optimization techniques. Those optimization techniques are discusses in this paper. The results show that the optimizations improved performance of the original code on Xeon Phi 7120P by a factor of 4.7x. Furthermore, the same optimizations improved performance on a dual socket Intel Xeon E5-2670 system by a factor of 2.8x compared to the original code.
An efficient MPI/OpenMP parallelization of the Hartree–Fock–Roothaan method for the first generation of Intel® Xeon Phi™ processor architecture

DOE PAGES

Mironov, Vladimir; Moskovsky, Alexander; D’Mello, Michael; ...

2017-10-04

The Hartree-Fock (HF) method in the quantum chemistry package GAMESS represents one of the most irregular algorithms in computation today. Major steps in the calculation are the irregular computation of electron repulsion integrals (ERIs) and the building of the Fock matrix. These are the central components of the main Self Consistent Field (SCF) loop, the key hotspot in Electronic Structure (ES) codes. By threading the MPI ranks in the official release of the GAMESS code, we not only speed up the main SCF loop (4x to 6x for large systems), but also achieve a significant (>2x) reduction in the overallmore » memory footprint. These improvements are a direct consequence of memory access optimizations within the MPI ranks. We benchmark our implementation against the official release of the GAMESS code on the Intel R Xeon PhiTM supercomputer. Here, scaling numbers are reported on up to 7,680 cores on Intel Xeon Phi coprocessors.« less
GNAQPMS v1.1: accelerating the Global Nested Air Quality Prediction Modeling System (GNAQPMS) on Intel Xeon Phi processors

NASA Astrophysics Data System (ADS)

Wang, Hui; Chen, Huansheng; Wu, Qizhong; Lin, Junmin; Chen, Xueshun; Xie, Xinwei; Wang, Rongrong; Tang, Xiao; Wang, Zifa

2017-08-01

The Global Nested Air Quality Prediction Modeling System (GNAQPMS) is the global version of the Nested Air Quality Prediction Modeling System (NAQPMS), which is a multi-scale chemical transport model used for air quality forecast and atmospheric environmental research. In this study, we present the porting and optimisation of GNAQPMS on a second-generation Intel Xeon Phi processor, codenamed Knights Landing (KNL). Compared with the first-generation Xeon Phi coprocessor (codenamed Knights Corner, KNC), KNL has many new hardware features such as a bootable processor, high-performance in-package memory and ISA compatibility with Intel Xeon processors. In particular, we describe the five optimisations we applied to the key modules of GNAQPMS, including the CBM-Z gas-phase chemistry, advection, convection and wet deposition modules. These optimisations work well on both the KNL 7250 processor and the Intel Xeon E5-2697 V4 processor. They include (1) updating the pure Message Passing Interface (MPI) parallel mode to the hybrid parallel mode with MPI and OpenMP in the emission, advection, convection and gas-phase chemistry modules; (2) fully employing the 512 bit wide vector processing units (VPUs) on the KNL platform; (3) reducing unnecessary memory access to improve cache efficiency; (4) reducing the thread local storage (TLS) in the CBM-Z gas-phase chemistry module to improve its OpenMP performance; and (5) changing the global communication from writing/reading interface files to MPI functions to improve the performance and the parallel scalability. These optimisations greatly improved the GNAQPMS performance. The same optimisations also work well for the Intel Xeon Broadwell processor, specifically E5-2697 v4. Compared with the baseline version of GNAQPMS, the optimised version was 3.51 × faster on KNL and 2.77 × faster on the CPU. Moreover, the optimised version ran at 26 % lower average power on KNL than on the CPU. With the combined performance and energy
LTE-Enhanced Cognitive Radio Network Testbed (LTE-CORNET)

DTIC Science & Technology

2016-11-01

4 PERCENT_SUPPORTEDNAME FTE Equivalent: Total Number: Sub Contractors (DD882) Names of Personnel receiving masters degrees Names of personnel...Turbo, HT , 15M, 140W) Intel Core i7-3770 (3.4 GHz Quad Core, 77W) Dual Intel Xeon E5-2695 v4 (18C, 2.1GHz, 3.3GHz Turbo, 2400MHz, 45MB, 120W
Acceleration of Cherenkov angle reconstruction with the new Intel Xeon/FPGA compute platform for the particle identification in the LHCb Upgrade

NASA Astrophysics Data System (ADS)

Faerber, Christian

2017-10-01

The LHCb experiment at the LHC will upgrade its detector by 2018/2019 to a ‘triggerless’ readout scheme, where all the readout electronics and several sub-detector parts will be replaced. The new readout electronics will be able to readout the detector at 40 MHz. This increases the data bandwidth from the detector down to the Event Filter farm to 40 TBit/s, which also has to be processed to select the interesting proton-proton collision for later storage. The architecture of such a computing farm, which can process this amount of data as efficiently as possible, is a challenging task and several compute accelerator technologies are being considered for use inside the new Event Filter farm. In the high performance computing sector more and more FPGA compute accelerators are used to improve the compute performance and reduce the power consumption (e.g. in the Microsoft Catapult project and Bing search engine). Also for the LHCb upgrade the usage of an experimental FPGA accelerated computing platform in the Event Building or in the Event Filter farm is being considered and therefore tested. This platform from Intel hosts a general CPU and a high performance FPGA linked via a high speed link which is for this platform a QPI link. On the FPGA an accelerator is implemented. The used system is a two socket platform from Intel with a Xeon CPU and an FPGA. The FPGA has cache-coherent memory access to the main memory of the server and can collaborate with the CPU. As a first step, a computing intensive algorithm to reconstruct Cherenkov angles for the LHCb RICH particle identification was successfully ported in Verilog to the Intel Xeon/FPGA platform and accelerated by a factor of 35. The same algorithm was ported to the Intel Xeon/FPGA platform with OpenCL. The implementation work and the performance will be compared. Also another FPGA accelerator the Nallatech 385 PCIe accelerator with the same Stratix V FPGA were tested for performance. The results show that the Intel
Implementation of High-Order Multireference Coupled-Cluster Methods on Intel Many Integrated Core Architecture.

PubMed

Aprà, E; Kowalski, K

2016-03-08

In this paper we discuss the implementation of multireference coupled-cluster formalism with singles, doubles, and noniterative triples (MRCCSD(T)), which is capable of taking advantage of the processing power of the Intel Xeon Phi coprocessor. We discuss the integration of two levels of parallelism underlying the MRCCSD(T) implementation with computational kernels designed to offload the computationally intensive parts of the MRCCSD(T) formalism to Intel Xeon Phi coprocessors. Special attention is given to the enhancement of the parallel performance by task reordering that has improved load balancing in the noniterative part of the MRCCSD(T) calculations. We also discuss aspects regarding efficient optimization and vectorization strategies.
Implementation of 5-layer thermal diffusion scheme in weather research and forecasting model with Intel Many Integrated Cores

NASA Astrophysics Data System (ADS)

Huang, Melin; Huang, Bormin; Huang, Allen H.

2014-10-01

For weather forecasting and research, the Weather Research and Forecasting (WRF) model has been developed, consisting of several components such as dynamic solvers and physical simulation modules. WRF includes several Land- Surface Models (LSMs). The LSMs use atmospheric information, the radiative and precipitation forcing from the surface layer scheme, the radiation scheme, and the microphysics/convective scheme all together with the land's state variables and land-surface properties, to provide heat and moisture fluxes over land and sea-ice points. The WRF 5-layer thermal diffusion simulation is an LSM based on the MM5 5-layer soil temperature model with an energy budget that includes radiation, sensible, and latent heat flux. The WRF LSMs are very suitable for massively parallel computation as there are no interactions among horizontal grid points. The features, efficient parallelization and vectorization essentials, of Intel Many Integrated Core (MIC) architecture allow us to optimize this WRF 5-layer thermal diffusion scheme. In this work, we present the results of the computing performance on this scheme with Intel MIC architecture. Our results show that the MIC-based optimization improved the performance of the first version of multi-threaded code on Xeon Phi 5110P by a factor of 2.1x. Accordingly, the same CPU-based optimizations improved the performance on Intel Xeon E5- 2603 by a factor of 1.6x as compared to the first version of multi-threaded code.
Reducing adaptive optics latency using Xeon Phi many-core processors

NASA Astrophysics Data System (ADS)

Barr, David; Basden, Alastair; Dipper, Nigel; Schwartz, Noah

2015-11-01

The next generation of Extremely Large Telescopes (ELTs) for astronomy will rely heavily on the performance of their adaptive optics (AO) systems. Real-time control is at the heart of the critical technologies that will enable telescopes to deliver the best possible science and will require a very significant extrapolation from current AO hardware existing for 4-10 m telescopes. Investigating novel real-time computing architectures and testing their eligibility against anticipated challenges is one of the main priorities of technology development for the ELTs. This paper investigates the suitability of the Intel Xeon Phi, which is a commercial off-the-shelf hardware accelerator. We focus on wavefront reconstruction performance, implementing a straightforward matrix-vector multiplication (MVM) algorithm. We present benchmarking results of the Xeon Phi on a real-time Linux platform, both as a standalone processor and integrated into an existing real-time controller (RTC). Performance of single and multiple Xeon Phis are investigated. We show that this technology has the potential of greatly reducing the mean latency and variations in execution time (jitter) of large AO systems. We present both a detailed performance analysis of the Xeon Phi for a typical E-ELT first-light instrument along with a more general approach that enables us to extend to any AO system size. We show that systematic and detailed performance analysis is an essential part of testing novel real-time control hardware to guarantee optimal science results.
Efficient sparse matrix-matrix multiplication for computing periodic responses by shooting method on Intel Xeon Phi

NASA Astrophysics Data System (ADS)

Stoykov, S.; Atanassov, E.; Margenov, S.

2016-10-01

Many of the scientific applications involve sparse or dense matrix operations, such as solving linear systems, matrix-matrix products, eigensolvers, etc. In what concerns structural nonlinear dynamics, the computations of periodic responses and the determination of stability of the solution are of primary interest. Shooting method iswidely used for obtaining periodic responses of nonlinear systems. The method involves simultaneously operations with sparse and dense matrices. One of the computationally expensive operations in the method is multiplication of sparse by dense matrices. In the current work, a new algorithm for sparse matrix by dense matrix products is presented. The algorithm takes into account the structure of the sparse matrix, which is obtained by space discretization of the nonlinear Mindlin's plate equation of motion by the finite element method. The algorithm is developed to use the vector engine of Intel Xeon Phi coprocessors. It is compared with the standard sparse matrix by dense matrix algorithm and the one developed by Intel MKL and it is shown that by considering the properties of the sparse matrix better algorithms can be developed.

Discrete Particle Model for Porous Media Flow using OpenFOAM at Intel Xeon Phi Coprocessors

NASA Astrophysics Data System (ADS)

Shang, Zhi; Nandakumar, Krishnaswamy; Liu, Honggao; Tyagi, Mayank; Lupo, James A.; Thompson, Karten

2015-11-01

The discrete particle model (DPM) in OpenFOAM was used to study the turbulent solid particle suspension flows through the porous media of a natural dual-permeability rock. The 2D and 3D pore geometries of the porous media were generated by sphere packing with the radius ratio of 3. The porosity is about 38% same as the natural dual-permeability rock. In the 2D case, the mesh cells reach 5 million with 1 million solid particles and in the 3D case, the mesh cells are above 10 million with 5 million solid particles. The solid particles are distributed by Gaussian distribution from 20 μm to 180 μm with expectation as 100 μm. Through the numerical simulations, not only was the HPC studied using Intel Xeon Phi Coprocessors but also the flow behaviors of large scale solid suspension flows in porous media were studied. The authors would like to thank the support by IPCC@LSU-Intel Parallel Computing Center (LSU # Y1SY1-1) and the HPC resources at Louisiana State University (http://www.hpc.lsu.edu).
Evaluation of the Xeon phi processor as a technology for the acceleration of real-time control in high-order adaptive optics systems

NASA Astrophysics Data System (ADS)

Barr, David; Basden, Alastair; Dipper, Nigel; Schwartz, Noah; Vick, Andy; Schnetler, Hermine

2014-08-01

We present wavefront reconstruction acceleration of high-order AO systems using an Intel Xeon Phi processor. The Xeon Phi is a coprocessor providing many integrated cores and designed for accelerating compute intensive, numerical codes. Unlike other accelerator technologies, it allows virtually unchanged C/C++ to be recompiled to run on the Xeon Phi, giving the potential of making development, upgrade and maintenance faster and less complex. We benchmark the Xeon Phi in the context of AO real-time control by running a matrix vector multiply (MVM) algorithm. We investigate variability in execution time and demonstrate a substantial speed-up in loop frequency. We examine the integration of a Xeon Phi into an existing RTC system and show that performance improvements can be achieved with limited development effort.
Performance Evaluation of an Intel Haswell- and Ivy Bridge-Based Supercomputer Using Scientific and Engineering Applications

NASA Technical Reports Server (NTRS)

Saini, Subhash; Hood, Robert T.; Chang, Johnny; Baron, John

2016-01-01

We present a performance evaluation conducted on a production supercomputer of the Intel Xeon Processor E5- 2680v3, a twelve-core implementation of the fourth-generation Haswell architecture, and compare it with Intel Xeon Processor E5-2680v2, an Ivy Bridge implementation of the third-generation Sandy Bridge architecture. Several new architectural features have been incorporated in Haswell including improvements in all levels of the memory hierarchy as well as improvements to vector instructions and power management. We critically evaluate these new features of Haswell and compare with Ivy Bridge using several low-level benchmarks including subset of HPCC, HPCG and four full-scale scientific and engineering applications. We also present a model to predict the performance of HPCG and Cart3D within 5%, and Overflow within 10% accuracy.
Acceleration of Monte Carlo simulation of photon migration in complex heterogeneous media using Intel many-integrated core architecture.

PubMed

Gorshkov, Anton V; Kirillin, Mikhail Yu

2015-08-01

Over two decades, the Monte Carlo technique has become a gold standard in simulation of light propagation in turbid media, including biotissues. Technological solutions provide further advances of this technique. The Intel Xeon Phi coprocessor is a new type of accelerator for highly parallel general purpose computing, which allows execution of a wide range of applications without substantial code modification. We present a technical approach of porting our previously developed Monte Carlo (MC) code for simulation of light transport in tissues to the Intel Xeon Phi coprocessor. We show that employing the accelerator allows reducing computational time of MC simulation and obtaining simulation speed-up comparable to GPU. We demonstrate the performance of the developed code for simulation of light transport in the human head and determination of the measurement volume in near-infrared spectroscopy brain sensing.
Evaluating the networking characteristics of the Cray XC-40 Intel Knights Landing-based Cori supercomputer at NERSC

DOE Office of Scientific and Technical Information (OSTI.GOV)

Doerfler, Douglas; Austin, Brian; Cook, Brandon

There are many potential issues associated with deploying the Intel Xeon Phi™ (code named Knights Landing [KNL]) manycore processor in a large-scale supercomputer. One in particular is the ability to fully utilize the high-speed communications network, given that the serial performance of a Xeon Phi TM core is a fraction of a Xeon®core. In this paper, we take a look at the trade-offs associated with allocating enough cores to fully utilize the Aries high-speed network versus cores dedicated to computation, e.g., the trade-off between MPI and OpenMP. In addition, we evaluate new features of Cray MPI in support of KNL,more » such as internode optimizations. We also evaluate one-sided programming models such as Unified Parallel C. We quantify the impact of the above trade-offs and features using a suite of National Energy Research Scientific Computing Center applications.« less
Application of Intel Many Integrated Core (MIC) architecture to the Yonsei University planetary boundary layer scheme in Weather Research and Forecasting model

NASA Astrophysics Data System (ADS)

Huang, Melin; Huang, Bormin; Huang, Allen H.

2014-10-01

The Weather Research and Forecasting (WRF) model provided operational services worldwide in many areas and has linked to our daily activity, in particular during severe weather events. The scheme of Yonsei University (YSU) is one of planetary boundary layer (PBL) models in WRF. The PBL is responsible for vertical sub-grid-scale fluxes due to eddy transports in the whole atmospheric column, determines the flux profiles within the well-mixed boundary layer and the stable layer, and thus provide atmospheric tendencies of temperature, moisture (including clouds), and horizontal momentum in the entire atmospheric column. The YSU scheme is very suitable for massively parallel computation as there are no interactions among horizontal grid points. To accelerate the computation process of the YSU scheme, we employ Intel Many Integrated Core (MIC) Architecture as it is a multiprocessor computer structure with merits of efficient parallelization and vectorization essentials. Our results show that the MIC-based optimization improved the performance of the first version of multi-threaded code on Xeon Phi 5110P by a factor of 2.4x. Furthermore, the same CPU-based optimizations improved the performance on Intel Xeon E5-2603 by a factor of 1.6x as compared to the first version of multi-threaded code.
List-mode PET image reconstruction for motion correction using the Intel XEON PHI co-processor

NASA Astrophysics Data System (ADS)

Ryder, W. J.; Angelis, G. I.; Bashar, R.; Gillam, J. E.; Fulton, R.; Meikle, S.

2014-03-01

List-mode image reconstruction with motion correction is computationally expensive, as it requires projection of hundreds of millions of rays through a 3D array. To decrease reconstruction time it is possible to use symmetric multiprocessing computers or graphics processing units. The former can have high financial costs, while the latter can require refactoring of algorithms. The Xeon Phi is a new co-processor card with a Many Integrated Core architecture that can run 4 multiple-instruction, multiple data threads per core with each thread having a 512-bit single instruction, multiple data vector register. Thus, it is possible to run in the region of 220 threads simultaneously. The aim of this study was to investigate whether the Xeon Phi co-processor card is a viable alternative to an x86 Linux server for accelerating List-mode PET image reconstruction for motion correction. An existing list-mode image reconstruction algorithm with motion correction was ported to run on the Xeon Phi coprocessor with the multi-threading implemented using pthreads. There were no differences between images reconstructed using the Phi co-processor card and images reconstructed using the same algorithm run on a Linux server. However, it was found that the reconstruction runtimes were 3 times greater for the Phi than the server. A new version of the image reconstruction algorithm was developed in C++ using OpenMP for mutli-threading and the Phi runtimes decreased to 1.67 times that of the host Linux server. Data transfer from the host to co-processor card was found to be a rate-limiting step; this needs to be carefully considered in order to maximize runtime speeds. When considering the purchase price of a Linux workstation with Xeon Phi co-processor card and top of the range Linux server, the former is a cost-effective computation resource for list-mode image reconstruction. A multi-Phi workstation could be a viable alternative to cluster computers at a lower cost for medical imaging
An efficient tensor transpose algorithm for multicore CPU, Intel Xeon Phi, and NVidia Tesla GPU

NASA Astrophysics Data System (ADS)

Lyakh, Dmitry I.

2015-04-01

An efficient parallel tensor transpose algorithm is suggested for shared-memory computing units, namely, multicore CPU, Intel Xeon Phi, and NVidia GPU. The algorithm operates on dense tensors (multidimensional arrays) and is based on the optimization of cache utilization on x86 CPU and the use of shared memory on NVidia GPU. From the applied side, the ultimate goal is to minimize the overhead encountered in the transformation of tensor contractions into matrix multiplications in computer implementations of advanced methods of quantum many-body theory (e.g., in electronic structure theory and nuclear physics). A particular accent is made on higher-dimensional tensors that typically appear in the so-called multireference correlated methods of electronic structure theory. Depending on tensor dimensionality, the presented optimized algorithms can achieve an order of magnitude speedup on x86 CPUs and 2-3 times speedup on NVidia Tesla K20X GPU with respect to the naïve scattering algorithm (no memory access optimization). The tensor transpose routines developed in this work have been incorporated into a general-purpose tensor algebra library (TAL-SH).
Accelerating 3D Elastic Wave Equations on Knights Landing based Intel Xeon Phi processors

NASA Astrophysics Data System (ADS)

Sourouri, Mohammed; Birger Raknes, Espen

2017-04-01

In advanced imaging methods like reverse-time migration (RTM) and full waveform inversion (FWI) the elastic wave equation (EWE) is numerically solved many times to create the seismic image or the elastic parameter model update. Thus, it is essential to optimize the solution time for solving the EWE as this will have a major impact on the total computational cost in running RTM or FWI. From a computational point of view applications implementing EWEs are associated with two major challenges. The first challenge is the amount of memory-bound computations involved, while the second challenge is the execution of such computations over very large datasets. So far, multi-core processors have not been able to tackle these two challenges, which eventually led to the adoption of accelerators such as Graphics Processing Units (GPUs). Compared to conventional CPUs, GPUs are densely populated with many floating-point units and fast memory, a type of architecture that has proven to map well to many scientific computations. Despite its architectural advantages, full-scale adoption of accelerators has yet to materialize. First, accelerators require a significant programming effort imposed by programming models such as CUDA or OpenCL. Second, accelerators come with a limited amount of memory, which also require explicit data transfers between the CPU and the accelerator over the slow PCI bus. The second generation of the Xeon Phi processor based on the Knights Landing (KNL) architecture, promises the computational capabilities of an accelerator but require the same programming effort as traditional multi-core processors. The high computational performance is realized through many integrated cores (number of cores and tiles and memory varies with the model) organized in tiles that are connected via a 2D mesh based interconnect. In contrary to accelerators, KNL is a self-hosted system, meaning explicit data transfers over the PCI bus are no longer required. However, like most
Nonlinear Wave Simulation on the Xeon Phi Knights Landing Processor

NASA Astrophysics Data System (ADS)

Hristov, Ivan; Goranov, Goran; Hristova, Radoslava

2018-02-01

We consider an interesting from computational point of view standing wave simulation by solving coupled 2D perturbed Sine-Gordon equations. We make an OpenMP realization which explores both thread and SIMD levels of parallelism. We test the OpenMP program on two different energy equivalent Intel architectures: 2× Xeon E5-2695 v2 processors, (code-named "Ivy Bridge-EP") in the Hybrilit cluster, and Xeon Phi 7250 processor (code-named "Knights Landing" (KNL). The results show 2 times better performance on KNL processor.
An efficient tensor transpose algorithm for multicore CPU, Intel Xeon Phi, and NVidia Tesla GPU

DOE PAGES

Lyakh, Dmitry I.

2015-01-05

An efficient parallel tensor transpose algorithm is suggested for shared-memory computing units, namely, multicore CPU, Intel Xeon Phi, and NVidia GPU. The algorithm operates on dense tensors (multidimensional arrays) and is based on the optimization of cache utilization on x86 CPU and the use of shared memory on NVidia GPU. From the applied side, the ultimate goal is to minimize the overhead encountered in the transformation of tensor contractions into matrix multiplications in computer implementations of advanced methods of quantum many-body theory (e.g., in electronic structure theory and nuclear physics). A particular accent is made on higher-dimensional tensors that typicallymore » appear in the so-called multireference correlated methods of electronic structure theory. Depending on tensor dimensionality, the presented optimized algorithms can achieve an order of magnitude speedup on x86 CPUs and 2-3 times speedup on NVidia Tesla K20X GPU with respect to the na ve scattering algorithm (no memory access optimization). Furthermore, the tensor transpose routines developed in this work have been incorporated into a general-purpose tensor algebra library (TAL-SH).« less
Case for a field-programmable gate array multicore hybrid machine for an image-processing application

NASA Astrophysics Data System (ADS)

Rakvic, Ryan N.; Ives, Robert W.; Lira, Javier; Molina, Carlos

2011-01-01

General purpose computer designers have recently begun adding cores to their processors in order to increase performance. For example, Intel has adopted a homogeneous quad-core processor as a base for general purpose computing. PlayStation3 (PS3) game consoles contain a multicore heterogeneous processor known as the Cell, which is designed to perform complex image processing algorithms at a high level. Can modern image-processing algorithms utilize these additional cores? On the other hand, modern advancements in configurable hardware, most notably field-programmable gate arrays (FPGAs) have created an interesting question for general purpose computer designers. Is there a reason to combine FPGAs with multicore processors to create an FPGA multicore hybrid general purpose computer? Iris matching, a repeatedly executed portion of a modern iris-recognition algorithm, is parallelized on an Intel-based homogeneous multicore Xeon system, a heterogeneous multicore Cell system, and an FPGA multicore hybrid system. Surprisingly, the cheaper PS3 slightly outperforms the Intel-based multicore on a core-for-core basis. However, both multicore systems are beaten by the FPGA multicore hybrid system by >50%.
Enhancing Image Processing Performance for PCID in a Heterogeneous Network of Multi-core Processors

DTIC Science & Technology

2009-09-01

TFLOPS of Playstation 3 (PS3) nodes with IBM Cell Broadband Engine multi-cores and 15 dual-quad Xeon head nodes. The interconnect fabric includes... 4 3. INFORMATION MANAGEMENT FOR PARALLELIZATION AND...STREAMING............................................................. 7 4 . RESULTS
Parallel algorithms for large-scale biological sequence alignment on Xeon-Phi based clusters.

PubMed

Lan, Haidong; Chan, Yuandong; Xu, Kai; Schmidt, Bertil; Peng, Shaoliang; Liu, Weiguo

2016-07-19

Computing alignments between two or more sequences are common operations frequently performed in computational molecular biology. The continuing growth of biological sequence databases establishes the need for their efficient parallel implementation on modern accelerators. This paper presents new approaches to high performance biological sequence database scanning with the Smith-Waterman algorithm and the first stage of progressive multiple sequence alignment based on the ClustalW heuristic on a Xeon Phi-based compute cluster. Our approach uses a three-level parallelization scheme to take full advantage of the compute power available on this type of architecture; i.e. cluster-level data parallelism, thread-level coarse-grained parallelism, and vector-level fine-grained parallelism. Furthermore, we re-organize the sequence datasets and use Xeon Phi shuffle operations to improve I/O efficiency. Evaluations show that our method achieves a peak overall performance up to 220 GCUPS for scanning real protein sequence databanks on a single node consisting of two Intel E5-2620 CPUs and two Intel Xeon Phi 7110P cards. It also exhibits good scalability in terms of sequence length and size, and number of compute nodes for both database scanning and multiple sequence alignment. Furthermore, the achieved performance is highly competitive in comparison to optimized Xeon Phi and GPU implementations. Our implementation is available at https://github.com/turbo0628/LSDBS-mpi .
Accelerating gravitational microlensing simulations using the Xeon Phi coprocessor

NASA Astrophysics Data System (ADS)

Chen, B.; Kantowski, R.; Dai, X.; Baron, E.; Van der Mark, P.

2017-04-01

Recently Graphics Processing Units (GPUs) have been used to speed up very CPU-intensive gravitational microlensing simulations. In this work, we use the Xeon Phi coprocessor to accelerate such simulations and compare its performance on a microlensing code with that of NVIDIA's GPUs. For the selected set of parameters evaluated in our experiment, we find that the speedup by Intel's Knights Corner coprocessor is comparable to that by NVIDIA's Fermi family of GPUs with compute capability 2.0, but less significant than GPUs with higher compute capabilities such as the Kepler. However, the very recently released second generation Xeon Phi, Knights Landing, is about 5.8 times faster than the Knights Corner, and about 2.9 times faster than the Kepler GPU used in our simulations. We conclude that the Xeon Phi is a very promising alternative to GPUs for modern high performance microlensing simulations.
Deploying electromagnetic particle-in-cell (EM-PIC) codes on Xeon Phi accelerators boards

NASA Astrophysics Data System (ADS)

Fonseca, Ricardo

2014-10-01

The complexity of the phenomena involved in several relevant plasma physics scenarios, where highly nonlinear and kinetic processes dominate, makes purely theoretical descriptions impossible. Further understanding of these scenarios requires detailed numerical modeling, but fully relativistic particle-in-cell codes such as OSIRIS are computationally intensive. The quest towards Exaflop computer systems has lead to the development of HPC systems based on add-on accelerator cards, such as GPGPUs and more recently the Xeon Phi accelerators that power the current number 1 system in the world. These cards, also referred to as Intel Many Integrated Core Architecture (MIC) offer peak theoretical performances of >1 TFlop/s for general purpose calculations in a single board, and are receiving significant attention as an attractive alternative to CPUs for plasma modeling. In this work we report on our efforts towards the deployment of an EM-PIC code on a Xeon Phi architecture system. We will focus on the parallelization and vectorization strategies followed, and present a detailed performance evaluation of code performance in comparison with the CPU code.
Optimizing zonal advection of the Advanced Research WRF (ARW) dynamics for Intel MIC

NASA Astrophysics Data System (ADS)

Mielikainen, Jarno; Huang, Bormin; Huang, Allen H.

2014-10-01

The Weather Research and Forecast (WRF) model is the most widely used community weather forecast and research model in the world. There are two distinct varieties of WRF. The Advanced Research WRF (ARW) is an experimental, advanced research version featuring very high resolution. The WRF Nonhydrostatic Mesoscale Model (WRF-NMM) has been designed for forecasting operations. WRF consists of dynamics code and several physics modules. The WRF-ARW core is based on an Eulerian solver for the fully compressible nonhydrostatic equations. In the paper, we will use Intel Intel Many Integrated Core (MIC) architecture to substantially increase the performance of a zonal advection subroutine for optimization. It is of the most time consuming routines in the ARW dynamics core. Advection advances the explicit perturbation horizontal momentum equations by adding in the large-timestep tendency along with the small timestep pressure gradient tendency. We will describe the challenges we met during the development of a high-speed dynamics code subroutine for MIC architecture. Furthermore, lessons learned from the code optimization process will be discussed. The results show that the optimizations improved performance of the original code on Xeon Phi 5110P by a factor of 2.4x.
MIC-SVM: Designing A Highly Efficient Support Vector Machine For Advanced Modern Multi-Core and Many-Core Architectures

DOE Office of Scientific and Technical Information (OSTI.GOV)

You, Yang; Song, Shuaiwen; Fu, Haohuan

2014-08-16

Support Vector Machine (SVM) has been widely used in data-mining and Big Data applications as modern commercial databases start to attach an increasing importance to the analytic capabilities. In recent years, SVM was adapted to the field of High Performance Computing for power/performance prediction, auto-tuning, and runtime scheduling. However, even at the risk of losing prediction accuracy due to insufficient runtime information, researchers can only afford to apply offline model training to avoid significant runtime training overhead. To address the challenges above, we designed and implemented MICSVM, a highly efficient parallel SVM for x86 based multi-core and many core architectures,more » such as the Intel Ivy Bridge CPUs and Intel Xeon Phi coprocessor (MIC).« less
Closeout Report ARRA supplement to DE-FG02-08ER41546, 03/15/2010 to 03/14/2011 - Advanced Transfer Map Methods for the Description of Particle Beam Dynamics

DOE Office of Scientific and Technical Information (OSTI.GOV)

Berz, Martin; Makino, Kyoko

The ARRA funds were utilized to acquire a cluster of high performance computers, consisting of one Altus 2804 Server based on a Quad AMD Opteron 6174 12C with 4 2.2 GHz nodes of 12 cores each, resulting in 48 directly usable cores; as well as a Relion 1751 Server using an Intel Xeon X5677 consisting of 4 3.46 GHz cores supporting 8 threads. Both systems run the Unix flavor CentOS, which is designed for use without need of updates, which greatly enhances their reliability. The systems are used to operate our COSY INFINITY environment which supports MPI parallelization. The unitsmore » arrived at MSU in September 2010, and were taken into operation shortly thereafter.« less
Performance of an MPI-only semiconductor device simulator on a quad socket/quad core InfiniBand platform.

DOE Office of Scientific and Technical Information (OSTI.GOV)

Shadid, John Nicolas; Lin, Paul Tinphone

2009-01-01

This preliminary study considers the scaling and performance of a finite element (FE) semiconductor device simulator on a capacity cluster with 272 compute nodes based on a homogeneous multicore node architecture utilizing 16 cores. The inter-node communication backbone for this Tri-Lab Linux Capacity Cluster (TLCC) machine is comprised of an InfiniBand interconnect. The nonuniform memory access (NUMA) nodes consist of 2.2 GHz quad socket/quad core AMD Opteron processors. The performance results for this study are obtained with a FE semiconductor device simulation code (Charon) that is based on a fully-coupled Newton-Krylov solver with domain decomposition and multilevel preconditioners. Scaling andmore » multicore performance results are presented for large-scale problems of 100+ million unknowns on up to 4096 cores. A parallel scaling comparison is also presented with the Cray XT3/4 Red Storm capability platform. The results indicate that an MPI-only programming model for utilizing the multicore nodes is reasonably efficient on all 16 cores per compute node. However, the results also indicated that the multilevel preconditioner, which is critical for large-scale capability type simulations, scales better on the Red Storm machine than the TLCC machine.« less

Exact diagonalization of quantum lattice models on coprocessors

NASA Astrophysics Data System (ADS)

Siro, T.; Harju, A.

2016-10-01

We implement the Lanczos algorithm on an Intel Xeon Phi coprocessor and compare its performance to a multi-core Intel Xeon CPU and an NVIDIA graphics processor. The Xeon and the Xeon Phi are parallelized with OpenMP and the graphics processor is programmed with CUDA. The performance is evaluated by measuring the execution time of a single step in the Lanczos algorithm. We study two quantum lattice models with different particle numbers, and conclude that for small systems, the multi-core CPU is the fastest platform, while for large systems, the graphics processor is the clear winner, reaching speedups of up to 7.6 compared to the CPU. The Xeon Phi outperforms the CPU with sufficiently large particle number, reaching a speedup of 2.5.
MSTor: A program for calculating partition functions, free energies, enthalpies, entropies, and heat capacities of complex molecules including torsional anharmonicity

NASA Astrophysics Data System (ADS)

Zheng, Jingjing; Mielke, Steven L.; Clarkson, Kenneth L.; Truhlar, Donald G.

2012-08-01

We present a Fortran program package, MSTor, which calculates partition functions and thermodynamic functions of complex molecules involving multiple torsional motions by the recently proposed MS-T method. This method interpolates between the local harmonic approximation in the low-temperature limit, and the limit of free internal rotation of all torsions at high temperature. The program can also carry out calculations in the multiple-structure local harmonic approximation. The program package also includes six utility codes that can be used as stand-alone programs to calculate reduced moment of inertia matrices by the method of Kilpatrick and Pitzer, to generate conformational structures, to calculate, either analytically or by Monte Carlo sampling, volumes for torsional subdomains defined by Voronoi tessellation of the conformational subspace, to generate template input files, and to calculate one-dimensional torsional partition functions using the torsional eigenvalue summation method. Catalogue identifier: AEMF_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEMF_v1_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: Standard CPC licence, http://cpc.cs.qub.ac.uk/licence/licence.html No. of lines in distributed program, including test data, etc.: 77 434 No. of bytes in distributed program, including test data, etc.: 3 264 737 Distribution format: tar.gz Programming language: Fortran 90, C, and Perl Computer: Itasca (HP Linux cluster, each node has two-socket, quad-core 2.8 GHz Intel Xeon X5560 “Nehalem EP” processors), Calhoun (SGI Altix XE 1300 cluster, each node containing two quad-core 2.66 GHz Intel Xeon “Clovertown”-class processors sharing 16 GB of main memory), Koronis (Altix UV 1000 server with 190 6-core Intel Xeon X7542 “Westmere” processors at 2.66 GHz), Elmo (Sun Fire X4600 Linux cluster with AMD Opteron cores), and Mac Pro (two 2.8 GHz Quad-core Intel Xeon
Evaluating Multi-core Architectures through Accelerating the Three-Dimensional Lax–Wendroff Correction

DOE Office of Scientific and Technical Information (OSTI.GOV)

You, Yang; Fu, Haohuan; Song, Shuaiwen

2014-07-18

Wave propagation forward modeling is a widely used computational method in oil and gas exploration. The iterative stencil loops in such problems have broad applications in scientific computing. However, executing such loops can be highly time time-consuming, which greatly limits application’s performance and power efficiency. In this paper, we accelerate the forward modeling technique on the latest multi-core and many-core architectures such as Intel Sandy Bridge CPUs, NVIDIA Fermi C2070 GPU, NVIDIA Kepler K20x GPU, and the Intel Xeon Phi Co-processor. For the GPU platforms, we propose two parallel strategies to explore the performance optimization opportunities for our stencil kernels.more » For Sandy Bridge CPUs and MIC, we also employ various optimization techniques in order to achieve the best.« less
TH-A-19A-08: Intel Xeon Phi Implementation of a Fast Multi-Purpose Monte Carlo Simulation for Proton Therapy

DOE Office of Scientific and Technical Information (OSTI.GOV)

Souris, K; Lee, J; Sterpin, E

2014-06-15

Purpose: Recent studies have demonstrated the capability of graphics processing units (GPUs) to compute dose distributions using Monte Carlo (MC) methods within clinical time constraints. However, GPUs have a rigid vectorial architecture that favors the implementation of simplified particle transport algorithms, adapted to specific tasks. Our new, fast, and multipurpose MC code, named MCsquare, runs on Intel Xeon Phi coprocessors. This technology offers 60 independent cores, and therefore more flexibility to implement fast and yet generic MC functionalities, such as prompt gamma simulations. Methods: MCsquare implements several models and hence allows users to make their own tradeoff between speed andmore » accuracy. A 200 MeV proton beam is simulated in a heterogeneous phantom using Geant4 and two configurations of MCsquare. The first one is the most conservative and accurate. The method of fictitious interactions handles the interfaces and secondary charged particles emitted in nuclear interactions are fully simulated. The second, faster configuration simplifies interface crossings and simulates only secondary protons after nuclear interaction events. Integral depth-dose and transversal profiles are compared to those of Geant4. Moreover, the production profile of prompt gammas is compared to PENH results. Results: Integral depth dose and transversal profiles computed by MCsquare and Geant4 are within 3%. The production of secondaries from nuclear interactions is slightly inaccurate at interfaces for the fastest configuration of MCsquare but this is unlikely to have any clinical impact. The computation time varies between 90 seconds for the most conservative settings to merely 59 seconds in the fastest configuration. Finally prompt gamma profiles are also in very good agreement with PENH results. Conclusion: Our new, fast, and multi-purpose Monte Carlo code simulates prompt gammas and calculates dose distributions in less than a minute, which complies with
Enhancing Image Processing Performance for PCID in a Heterogeneous Network of Multi-code Processors

NASA Astrophysics Data System (ADS)

Linderman, R.; Spetka, S.; Fitzgerald, D.; Emeny, S.

The Physically-Constrained Iterative Deconvolution (PCID) image deblurring code is being ported to heterogeneous networks of multi-core systems, including Intel Xeons and IBM Cell Broadband Engines. This paper reports results from experiments using the JAWS supercomputer at MHPCC (60 TFLOPS of dual-dual Xeon nodes linked with Infiniband) and the Cell Cluster at AFRL in Rome, NY. The Cell Cluster has 52 TFLOPS of Playstation 3 (PS3) nodes with IBM Cell Broadband Engine multi-cores and 15 dual-quad Xeon head nodes. The interconnect fabric includes Infiniband, 10 Gigabit Ethernet and 1 Gigabit Ethernet to each of the 336 PS3s. The results compare approaches to parallelizing FFT executions across the Xeons and the Cell's Synergistic Processing Elements (SPEs) for frame-level image processing. The experiments included Intel's Performance Primitives and Math Kernel Library, FFTW3.2, and Carnegie Mellon's SPIRAL. Optimization of FFTs in the PCID code led to a decrease in relative processing time for FFTs. Profiling PCID version 6.2, about one year ago, showed the 13 functions that accounted for the highest percentage of processing were all FFT processing functions. They accounted for over 88% of processing time in one run on Xeons. FFT optimizations led to improvement in the current PCID version 8.0. A recent profile showed that only two of the 19 functions with the highest processing time were FFT processing functions. Timing measurements showed that FFT processing for PCID version 8.0 has been reduced to less than 19% of overall processing time. We are working toward a goal of scaling to 200-400 cores per job (1-2 imagery frames/core). Running a pair of cores on each set of frames reduces latency by implementing parallel FFT processing. Our current results show scaling well out to 100 pairs of cores. These results support the next higher level of parallelism in PCID, where groups of several hundred frames each producing one resolved image are sent to cliques of several
Parallelized Kalman-Filter-Based Reconstruction of Particle Tracks on Many-Core Architectures

DOE Office of Scientific and Technical Information (OSTI.GOV)

Cerati, Giuseppe; Elmer, Peter; Krutelyov, Slava

Faced with physical and energy density limitations on clock speed, contemporary microprocessor designers have increasingly turned to on-chip parallelism for performance gains. Examples include the Intel Xeon Phi, GPGPUs, and similar technologies. Algorithms should accordingly be designed with ample amounts of fine-grained parallelism if they are to realize the full performance of the hardware. This requirement can be challenging for algorithms that are naturally expressed as a sequence of small-matrix operations, such as the Kalman filter methods widely in use in high-energy physics experiments. In the High-Luminosity Large Hadron Collider (HL-LHC), for example, one of the dominant computational problems ismore » expected to be finding and fitting charged-particle tracks during event reconstruction; today, the most common track-finding methods are those based on the Kalman filter. Experience at the LHC, both in the trigger and offline, has shown that these methods are robust and provide high physics performance. Previously we reported the significant parallel speedups that resulted from our efforts to adapt Kalman-filter-based tracking to many-core architectures such as Intel Xeon Phi. Here we report on how effectively those techniques can be applied to more realistic detector configurations and event complexity.« less
Multi-threaded ATLAS simulation on Intel Knights Landing processors

NASA Astrophysics Data System (ADS)

Farrell, Steven; Calafiura, Paolo; Leggett, Charles; Tsulaia, Vakhtang; Dotti, Andrea; ATLAS Collaboration

2017-10-01

The Knights Landing (KNL) release of the Intel Many Integrated Core (MIC) Xeon Phi line of processors is a potential game changer for HEP computing. With 72 cores and deep vector registers, the KNL cards promise significant performance benefits for highly-parallel, compute-heavy applications. Cori, the newest supercomputer at the National Energy Research Scientific Computing Center (NERSC), was delivered to its users in two phases with the first phase online at the end of 2015 and the second phase now online at the end of 2016. Cori Phase 2 is based on the KNL architecture and contains over 9000 compute nodes with 96GB DDR4 memory. ATLAS simulation with the multithreaded Athena Framework (AthenaMT) is a good potential use-case for the KNL architecture and supercomputers like Cori. ATLAS simulation jobs have a high ratio of CPU computation to disk I/O and have been shown to scale well in multi-threading and across many nodes. In this paper we will give an overview of the ATLAS simulation application with details on its multi-threaded design. Then, we will present a performance analysis of the application on KNL devices and compare it to a traditional x86 platform to demonstrate the capabilities of the architecture and evaluate the benefits of utilizing KNL platforms like Cori for ATLAS production.
Parallelized Kalman-Filter-Based Reconstruction of Particle Tracks on Many-Core Processors and GPUs

DOE Office of Scientific and Technical Information (OSTI.GOV)

Cerati, Giuseppe; Elmer, Peter; Krutelyov, Slava

2017-01-01

For over a decade now, physical and energy constraints have limited clock speed improvements in commodity microprocessors. Instead, chipmakers have been pushed into producing lower-power, multi-core processors such as Graphical Processing Units (GPU), ARM CPUs, and Intel MICs. Broad-based efforts from manufacturers and developers have been devoted to making these processors user-friendly enough to perform general computations. However, extracting performance from a larger number of cores, as well as specialized vector or SIMD units, requires special care in algorithm design and code optimization. One of the most computationally challenging problems in high-energy particle experiments is finding and fitting the charged-particlemore » tracks during event reconstruction. This is expected to become by far the dominant problem at the High-Luminosity Large Hadron Collider (HL-LHC), for example. Today the most common track finding methods are those based on the Kalman filter. Experience with Kalman techniques on real tracking detector systems has shown that they are robust and provide high physics performance. This is why they are currently in use at the LHC, both in the trigger and offine. Previously we reported on the significant parallel speedups that resulted from our investigations to adapt Kalman filters to track fitting and track building on Intel Xeon and Xeon Phi. Here, we discuss our progresses toward the understanding of these processors and the new developments to port the Kalman filter to NVIDIA GPUs.« less
Parallelized Kalman-Filter-Based Reconstruction of Particle Tracks on Many-Core Processors and GPUs

NASA Astrophysics Data System (ADS)

Cerati, Giuseppe; Elmer, Peter; Krutelyov, Slava; Lantz, Steven; Lefebvre, Matthieu; Masciovecchio, Mario; McDermott, Kevin; Riley, Daniel; Tadel, Matevž; Wittich, Peter; Würthwein, Frank; Yagil, Avi

2017-08-01

For over a decade now, physical and energy constraints have limited clock speed improvements in commodity microprocessors. Instead, chipmakers have been pushed into producing lower-power, multi-core processors such as Graphical Processing Units (GPU), ARM CPUs, and Intel MICs. Broad-based efforts from manufacturers and developers have been devoted to making these processors user-friendly enough to perform general computations. However, extracting performance from a larger number of cores, as well as specialized vector or SIMD units, requires special care in algorithm design and code optimization. One of the most computationally challenging problems in high-energy particle experiments is finding and fitting the charged-particle tracks during event reconstruction. This is expected to become by far the dominant problem at the High-Luminosity Large Hadron Collider (HL-LHC), for example. Today the most common track finding methods are those based on the Kalman filter. Experience with Kalman techniques on real tracking detector systems has shown that they are robust and provide high physics performance. This is why they are currently in use at the LHC, both in the trigger and offine. Previously we reported on the significant parallel speedups that resulted from our investigations to adapt Kalman filters to track fitting and track building on Intel Xeon and Xeon Phi. Here, we discuss our progresses toward the understanding of these processors and the new developments to port the Kalman filter to NVIDIA GPUs.
Roofline Analysis in the Intel® Advisor to Deliver Optimized Performance for applications on Intel® Xeon Phi™ Processor

DOE Office of Scientific and Technical Information (OSTI.GOV)

Koskela, Tuomas S.; Lobet, Mathieu; Deslippe, Jack

In this session we show, in two case studies, how the roofline feature of Intel Advisor has been utilized to optimize the performance of kernels of the XGC1 and PICSAR codes in preparation for Intel Knights Landing architecture. The impact of the implemented optimizations and the benefits of using the automatic roofline feature of Intel Advisor to study performance of large applications will be presented. This demonstrates an effective optimization strategy that has enabled these science applications to achieve up to 4.6 times speed-up and prepare for future exascale architectures. # Goal/Relevance of Session The roofline model [1,2] is amore » powerful tool for analyzing the performance of applications with respect to the theoretical peak achievable on a given computer architecture. It allows one to graphically represent the performance of an application in terms of operational intensity, i.e. the ratio of flops performed and bytes moved from memory in order to guide optimization efforts. Given the scale and complexity of modern science applications, it can often be a tedious task for the user to perform the analysis on the level of functions or loops to identify where performance gains can be made. With new Intel tools, it is now possible to automate this task, as well as base the estimates of peak performance on measurements rather than vendor specifications. The goal of this session is to demonstrate how the roofline feature of Intel Advisor can be used to balance memory vs. computation related optimization efforts and effectively identify performance bottlenecks. A series of typical optimization techniques: cache blocking, structure refactoring, data alignment, and vectorization illustrated by the kernel cases will be addressed. # Description of the codes ## XGC1 The XGC1 code [3] is a magnetic fusion Particle-In-Cell code that uses an unstructured mesh for its Poisson solver that allows it to accurately resolve the edge plasma of a magnetic fusion device
Peregrine System | High-Performance Computing | NREL

Science.gov Websites

) and longer-term (/projects) storage. These file systems are mounted on all nodes. Peregrine has three -2670 Xeon processors and 64 GB of memory. In addition to mounting the /home, /nopt, /projects and # cores/node Memory/node Peak (DP) performance per node 88 Intel Xeon E5-2670 "Sandy Bridge" 8
The parallel algorithm for the 2D discrete wavelet transform

NASA Astrophysics Data System (ADS)

Barina, David; Najman, Pavel; Kleparnik, Petr; Kula, Michal; Zemcik, Pavel

2018-04-01

The discrete wavelet transform can be found at the heart of many image-processing algorithms. Until now, the transform on general-purpose processors (CPUs) was mostly computed using a separable lifting scheme. As the lifting scheme consists of a small number of operations, it is preferred for processing using single-core CPUs. However, considering a parallel processing using multi-core processors, this scheme is inappropriate due to a large number of steps. On such architectures, the number of steps corresponds to the number of points that represent the exchange of data. Consequently, these points often form a performance bottleneck. Our approach appropriately rearranges calculations inside the transform, and thereby reduces the number of steps. In other words, we propose a new scheme that is friendly to parallel environments. When evaluating on multi-core CPUs, we consistently overcome the original lifting scheme. The evaluation was performed on 61-core Intel Xeon Phi and 8-core Intel Xeon processors.
Using the Intel Math Kernel Library on Peregrine | High-Performance

Science.gov Websites

Computing | NREL the Intel Math Kernel Library on Peregrine Using the Intel Math Kernel Library on Peregrine Learn how to use the Intel Math Kernel Library (MKL) with Peregrine system software. MKL architectures. Core math functions in MKL include BLAS, LAPACK, ScaLAPACK, sparse solvers, fast Fourier
Theorem Proving in Intel Hardware Design

NASA Technical Reports Server (NTRS)

O'Leary, John

2009-01-01

For the past decade, a framework combining model checking (symbolic trajectory evaluation) and higher-order logic theorem proving has been in production use at Intel. Our tools and methodology have been used to formally verify execution cluster functionality (including floating-point operations) for a number of Intel products, including the Pentium(Registered TradeMark)4 and Core(TradeMark)i7 processors. Hardware verification in 2009 is much more challenging than it was in 1999 - today s CPU chip designs contain many processor cores and significant firmware content. This talk will attempt to distill the lessons learned over the past ten years, discuss how they apply to today s problems, outline some future directions.
Scaling Support Vector Machines On Modern HPC Platforms

DOE Office of Scientific and Technical Information (OSTI.GOV)

You, Yang; Fu, Haohuan; Song, Shuaiwen

2015-02-01

We designed and implemented MIC-SVM, a highly efficient parallel SVM for x86 based multicore and many-core architectures, such as the Intel Ivy Bridge CPUs and Intel Xeon Phi co-processor (MIC). We propose various novel analysis methods and optimization techniques to fully utilize the multilevel parallelism provided by these architectures and serve as general optimization methods for other machine learning tools.
Fast multipurpose Monte Carlo simulation for proton therapy using multi- and many-core CPU architectures.

PubMed

Souris, Kevin; Lee, John Aldo; Sterpin, Edmond

2016-04-01

Accuracy in proton therapy treatment planning can be improved using Monte Carlo (MC) simulations. However the long computation time of such methods hinders their use in clinical routine. This work aims to develop a fast multipurpose Monte Carlo simulation tool for proton therapy using massively parallel central processing unit (CPU) architectures. A new Monte Carlo, called MCsquare (many-core Monte Carlo), has been designed and optimized for the last generation of Intel Xeon processors and Intel Xeon Phi coprocessors. These massively parallel architectures offer the flexibility and the computational power suitable to MC methods. The class-II condensed history algorithm of MCsquare provides a fast and yet accurate method of simulating heavy charged particles such as protons, deuterons, and alphas inside voxelized geometries. Hard ionizations, with energy losses above a user-specified threshold, are simulated individually while soft events are regrouped in a multiple scattering theory. Elastic and inelastic nuclear interactions are sampled from ICRU 63 differential cross sections, thereby allowing for the computation of prompt gamma emission profiles. MCsquare has been benchmarked with the gate/geant4 Monte Carlo application for homogeneous and heterogeneous geometries. Comparisons with gate/geant4 for various geometries show deviations within 2%-1 mm. In spite of the limited memory bandwidth of the coprocessor simulation time is below 25 s for 10(7) primary 200 MeV protons in average soft tissues using all Xeon Phi and CPU resources embedded in a single desktop unit. MCsquare exploits the flexibility of CPU architectures to provide a multipurpose MC simulation tool. Optimized code enables the use of accurate MC calculation within a reasonable computation time, adequate for clinical practice. MCsquare also simulates prompt gamma emission and can thus be used also for in vivo range verification.
Fast multipurpose Monte Carlo simulation for proton therapy using multi- and many-core CPU architectures

DOE Office of Scientific and Technical Information (OSTI.GOV)

Souris, Kevin, E-mail: kevin.souris@uclouvain.be; Lee, John Aldo; Sterpin, Edmond

2016-04-15

Purpose: Accuracy in proton therapy treatment planning can be improved using Monte Carlo (MC) simulations. However the long computation time of such methods hinders their use in clinical routine. This work aims to develop a fast multipurpose Monte Carlo simulation tool for proton therapy using massively parallel central processing unit (CPU) architectures. Methods: A new Monte Carlo, called MCsquare (many-core Monte Carlo), has been designed and optimized for the last generation of Intel Xeon processors and Intel Xeon Phi coprocessors. These massively parallel architectures offer the flexibility and the computational power suitable to MC methods. The class-II condensed history algorithmmore » of MCsquare provides a fast and yet accurate method of simulating heavy charged particles such as protons, deuterons, and alphas inside voxelized geometries. Hard ionizations, with energy losses above a user-specified threshold, are simulated individually while soft events are regrouped in a multiple scattering theory. Elastic and inelastic nuclear interactions are sampled from ICRU 63 differential cross sections, thereby allowing for the computation of prompt gamma emission profiles. MCsquare has been benchmarked with the GATE/GEANT4 Monte Carlo application for homogeneous and heterogeneous geometries. Results: Comparisons with GATE/GEANT4 for various geometries show deviations within 2%–1 mm. In spite of the limited memory bandwidth of the coprocessor simulation time is below 25 s for 10{sup 7} primary 200 MeV protons in average soft tissues using all Xeon Phi and CPU resources embedded in a single desktop unit. Conclusions: MCsquare exploits the flexibility of CPU architectures to provide a multipurpose MC simulation tool. Optimized code enables the use of accurate MC calculation within a reasonable computation time, adequate for clinical practice. MCsquare also simulates prompt gamma emission and can thus be used also for in vivo range verification.« less
High-performance sparse matrix-matrix products on Intel KNL and multicore architectures

DOE Office of Scientific and Technical Information (OSTI.GOV)

Nagasaka, Y; Matsuoka, S; Azad, A

Sparse matrix-matrix multiplication (SpGEMM) is a computational primitive that is widely used in areas ranging from traditional numerical applications to recent big data analysis and machine learning. Although many SpGEMM algorithms have been proposed, hardware specific optimizations for multi- and many-core processors are lacking and a detailed analysis of their performance under various use cases and matrices is not available. We firstly identify and mitigate multiple bottlenecks with memory management and thread scheduling on Intel Xeon Phi (Knights Landing or KNL). Specifically targeting multi- and many-core processors, we develop a hash-table-based algorithm and optimize a heap-based shared-memory SpGEMM algorithm. Wemore » examine their performance together with other publicly available codes. Different from the literature, our evaluation also includes use cases that are representative of real graph algorithms, such as multi-source breadth-first search or triangle counting. Our hash-table and heap-based algorithms are showing significant speedups from libraries in the majority of the cases while different algorithms dominate the other scenarios with different matrix size, sparsity, compression factor and operation type. We wrap up in-depth evaluation results and make a recipe to give the best SpGEMM algorithm for target scenario. A critical finding is that hash-table-based SpGEMM gets a significant performance boost if the nonzeros are not required to be sorted within each row of the output matrix.« less
WinHPC System Configuration | High-Performance Computing | NREL

Science.gov Websites

CPUs with 48GB of memory. Node 04 has dual Intel Xeon E5530 CPUs with 24GB of memory. Nodes 05-20 have dual AMD Opteron 2374 HE CPUs with 16GB of memory. Nodes 21-30 have been decommissioned. Nodes 31-35 have dual Intel Xeon X5675 CPUs with 48GB of memory. Nodes 36-37 have dual Intel Xeon E5-2680 CPUs with
Performance optimization of Qbox and WEST on Intel Knights Landing

NASA Astrophysics Data System (ADS)

Zheng, Huihuo; Knight, Christopher; Galli, Giulia; Govoni, Marco; Gygi, Francois

We present the optimization of electronic structure codes Qbox and WEST targeting the Intel®Xeon Phi™processor, codenamed Knights Landing (KNL). Qbox is an ab-initio molecular dynamics code based on plane wave density functional theory (DFT) and WEST is a post-DFT code for excited state calculations within many-body perturbation theory. Both Qbox and WEST employ highly scalable algorithms which enable accurate large-scale electronic structure calculations on leadership class supercomputer platforms beyond 100,000 cores, such as Mira and Theta at the Argonne Leadership Computing Facility. In this work, features of the KNL architecture (e.g. hierarchical memory) are explored to achieve higher performance in key algorithms of the Qbox and WEST codes and to develop a road-map for further development targeting next-generation computing architectures. In particular, the optimizations of the Qbox and WEST codes on the KNL platform will target efficient large-scale electronic structure calculations of nanostructured materials exhibiting complex structures and prediction of their electronic and thermal properties for use in solar and thermal energy conversion device. This work was supported by MICCoM, as part of Comp. Mats. Sci. Program funded by the U.S. DOE, Office of Sci., BES, MSE Division. This research used resources of the ALCF, which is a DOE Office of Sci. User Facility under Contract DE-AC02-06CH11357.

Interactive high-resolution isosurface ray casting on multicore processors.

PubMed

Wang, Qin; JaJa, Joseph

2008-01-01

We present a new method for the interactive rendering of isosurfaces using ray casting on multi-core processors. This method consists of a combination of an object-order traversal that coarsely identifies possible candidate 3D data blocks for each small set of contiguous pixels, and an isosurface ray casting strategy tailored for the resulting limited-size lists of candidate 3D data blocks. While static screen partitioning is widely used in the literature, our scheme performs dynamic allocation of groups of ray casting tasks to ensure almost equal loads among the different threads running on multi-cores while maintaining spatial locality. We also make careful use of memory management environment commonly present in multi-core processors. We test our system on a two-processor Clovertown platform, each consisting of a Quad-Core 1.86-GHz Intel Xeon Processor, for a number of widely different benchmarks. The detailed experimental results show that our system is efficient and scalable, and achieves high cache performance and excellent load balancing, resulting in an overall performance that is superior to any of the previous algorithms. In fact, we achieve an interactive isosurface rendering on a 1024(2) screen for all the datasets tested up to the maximum size of the main memory of our platform.
Using OpenMP vs. Threading Building Blocks for Medical Imaging on Multi-cores

NASA Astrophysics Data System (ADS)

Kegel, Philipp; Schellmann, Maraike; Gorlatch, Sergei

We compare two parallel programming approaches for multi-core systems: the well-known OpenMP and the recently introduced Threading Building Blocks (TBB) library by Intel®. The comparison is made using the parallelization of a real-world numerical algorithm for medical imaging. We develop several parallel implementations, and compare them w.r.t. programming effort, programming style and abstraction, and runtime performance. We show that TBB requires a considerable program re-design, whereas with OpenMP simple compiler directives are sufficient. While TBB appears to be less appropriate for parallelizing existing implementations, it fosters a good programming style and higher abstraction level for newly developed parallel programs. Our experimental measurements on a dual quad-core system demonstrate that OpenMP slightly outperforms TBB in our implementation.
High performance in silico virtual drug screening on many-core processors.

PubMed

McIntosh-Smith, Simon; Price, James; Sessions, Richard B; Ibarra, Amaurys A

2015-05-01

Drug screening is an important part of the drug development pipeline for the pharmaceutical industry. Traditional, lab-based methods are increasingly being augmented with computational methods, ranging from simple molecular similarity searches through more complex pharmacophore matching to more computationally intensive approaches, such as molecular docking. The latter simulates the binding of drug molecules to their targets, typically protein molecules. In this work, we describe BUDE, the Bristol University Docking Engine, which has been ported to the OpenCL industry standard parallel programming language in order to exploit the performance of modern many-core processors. Our highly optimized OpenCL implementation of BUDE sustains 1.43 TFLOP/s on a single Nvidia GTX 680 GPU, or 46% of peak performance. BUDE also exploits OpenCL to deliver effective performance portability across a broad spectrum of different computer architectures from different vendors, including GPUs from Nvidia and AMD, Intel's Xeon Phi and multi-core CPUs with SIMD instruction sets.
A polyphase filter for many-core architectures

NASA Astrophysics Data System (ADS)

Adámek, K.; Novotný, J.; Armour, W.

2016-07-01

In this article we discuss our implementation of a polyphase filter for real-time data processing in radio astronomy. The polyphase filter is a standard tool in digital signal processing and as such a well established algorithm. We describe in detail our implementation of the polyphase filter algorithm and its behaviour on three generations of NVIDIA GPU cards (Fermi, Kepler, Maxwell), on the Intel Xeon CPU and Xeon Phi (Knights Corner) platforms. All of our implementations aim to exploit the potential for data reuse that the algorithm offers. Our GPU implementations explore two different methods for achieving this, the first makes use of L1/Texture cache, the second uses shared memory. We discuss the usability of each of our implementations along with their behaviours. We measure performance in execution time, which is a critical factor for real-time systems, we also present results in terms of bandwidth (GB/s), compute (GFLOP/s/s) and type conversions (GTc/s). We include a presentation of our results in terms of the sample rate which can be processed in real-time by a chosen platform, which more intuitively describes the expected performance in a signal processing setting. Our findings show that, for the GPUs considered, the performance of our polyphase filter when using lower precision input data is limited by type conversions rather than device bandwidth. We compare these results to an implementation on the Xeon Phi. We show that our Xeon Phi implementation has a performance that is 1.5 × to 1.92 × greater than our CPU implementation, however is not insufficient to compete with the performance of GPUs. We conclude with a comparison of our best performing code to two other implementations of the polyphase filter, showing that our implementation is faster in nearly all cases. This work forms part of the Astro-Accelerate project, a many-core accelerated real-time data processing library for digital signal processing of time-domain radio astronomy data.
Multi-Kepler GPU vs. multi-Intel MIC for spin systems simulations

NASA Astrophysics Data System (ADS)

Bernaschi, M.; Bisson, M.; Salvadore, F.

2014-10-01

We present and compare the performances of two many-core architectures: the Nvidia Kepler and the Intel MIC both in a single system and in cluster configuration for the simulation of spin systems. As a benchmark we consider the time required to update a single spin of the 3D Heisenberg spin glass model by using the Over-relaxation algorithm. We present data also for a traditional high-end multi-core architecture: the Intel Sandy Bridge. The results show that although on the two Intel architectures it is possible to use basically the same code, the performances of a Intel MIC change dramatically depending on (apparently) minor details. Another issue is that to obtain a reasonable scalability with the Intel Phi coprocessor (Phi is the coprocessor that implements the MIC architecture) in a cluster configuration it is necessary to use the so-called offload mode which reduces the performances of the single system. As to the GPU, the Kepler architecture offers a clear advantage with respect to the previous Fermi architecture maintaining exactly the same source code. Scalability of the multi-GPU implementation remains very good by using the CPU as a communication co-processor of the GPU. All source codes are provided for inspection and for double-checking the results.
Real-time dedispersion for fast radio transient surveys, using auto tuning on many-core accelerators

NASA Astrophysics Data System (ADS)

Sclocco, A.; van Leeuwen, J.; Bal, H. E.; van Nieuwpoort, R. V.

2016-01-01

Dedispersion, the removal of deleterious smearing of impulsive signals by the interstellar matter, is one of the most intensive processing steps in any radio survey for pulsars and fast transients. We here present a study of the parallelization of this algorithm on many-core accelerators, including GPUs from AMD and NVIDIA, and the Intel Xeon Phi. We find that dedispersion is inherently memory-bound. Even in a perfect scenario, hardware limitations keep the arithmetic intensity low, thus limiting performance. We next exploit auto-tuning to adapt dedispersion to different accelerators, observations, and even telescopes. We demonstrate that the optimal settings differ between observational setups, and that auto-tuning significantly improves performance. This impacts time-domain surveys from Apertif to SKA.
OpenMP GNU and Intel Fortran programs for solving the time-dependent Gross-Pitaevskii equation

NASA Astrophysics Data System (ADS)

Young-S., Luis E.; Muruganandam, Paulsamy; Adhikari, Sadhan K.; Lončar, Vladimir; Vudragović, Dušan; Balaž, Antun

2017-11-01

description of the output files is given [2]. A readme.txt file, included in the root directory, explains the procedure to compile and run the programs. We tested our programs on a workstation with two 10-core Intel Xeon E5-2650 v3 CPUs. The parameters used for testing are given in sample input files, provided in the corresponding directory together with the programs. In Table 1 we present wall-clock execution times for runs on 1, 6, and 19 CPU cores for programs compiled using Intel and GNU Fortran compilers. The corresponding columns "Intel speedup" and "GNU speedup" give the ratio of wall-clock execution times of runs on 1 and 19 CPU cores, and denote the actual measured speedup for 19 CPU cores. In all cases and for all numbers of CPU cores, although the GNU Fortran compiler gives excellent results, the Intel Fortran compiler turns out to be slightly faster. Note that during these tests we always ran only a single simulation on a workstation at a time, to avoid any possible interference issues. Therefore, the obtained wall-clock times are more reliable than the ones that could be measured with two or more jobs running simultaneously. We also studied the speedup of the programs as a function of the number of CPU cores used. The performance of the Intel and GNU Fortran compilers is illustrated in Fig. 1, where we plot the speedup and actual wall-clock times as functions of the number of CPU cores for 2d and 3d programs. We see that the speedup increases monotonically with the number of CPU cores in all cases and has large values (between 10 and 14 for 3d programs) for the maximal number of cores. This fully justifies the development of OpenMP programs, which enable much faster and more efficient solving of the GP equation. However, a slow saturation in the speedup with the further increase in the number of CPU cores is observed in all cases, as expected. The speedup tends to increase for programs in higher dimensions, as they become more complex and have to process more
A comparison of SuperLU solvers on the intel MIC architecture

NASA Astrophysics Data System (ADS)

Tuncel, Mehmet; Duran, Ahmet; Celebi, M. Serdar; Akaydin, Bora; Topkaya, Figen O.

2016-10-01

In many science and engineering applications, problems may result in solving a sparse linear system AX=B. For example, SuperLU_MCDT, a linear solver, was used for the large penta-diagonal matrices for 2D problems and hepta-diagonal matrices for 3D problems, coming from the incompressible blood flow simulation (see [1]). It is important to test the status and potential improvements of state-of-the-art solvers on new technologies. In this work, sequential, multithreaded and distributed versions of SuperLU solvers (see [2]) are examined on the Intel Xeon Phi coprocessors using offload programming model at the EURORA cluster of CINECA in Italy. We consider a portfolio of test matrices containing patterned matrices from UFMM ([3]) and randomly located matrices. This architecture can benefit from high parallelism and large vectors. We find that the sequential SuperLU benefited up to 45 % performance improvement from the offload programming depending on the sparse matrix type and the size of transferred and processed data.
Application Performance Analysis and Efficient Execution on Systems with multi-core CPUs, GPUs and MICs: A Case Study with Microscopy Image Analysis

PubMed Central

Teodoro, George; Kurc, Tahsin; Andrade, Guilherme; Kong, Jun; Ferreira, Renato; Saltz, Joel

2015-01-01

We carry out a comparative performance study of multi-core CPUs, GPUs and Intel Xeon Phi (Many Integrated Core-MIC) with a microscopy image analysis application. We experimentally evaluate the performance of computing devices on core operations of the application. We correlate the observed performance with the characteristics of computing devices and data access patterns, computation complexities, and parallelization forms of the operations. The results show a significant variability in the performance of operations with respect to the device used. The performances of operations with regular data access are comparable or sometimes better on a MIC than that on a GPU. GPUs are more efficient than MICs for operations that access data irregularly, because of the lower bandwidth of the MIC for random data accesses. We propose new performance-aware scheduling strategies that consider variabilities in operation speedups. Our scheduling strategies significantly improve application performance compared to classic strategies in hybrid configurations. PMID:28239253
Performance Study of Monte Carlo Codes on Xeon Phi Coprocessors — Testing MCNP 6.1 and Profiling ARCHER Geometry Module on the FS7ONNi Problem

NASA Astrophysics Data System (ADS)

Liu, Tianyu; Wolfe, Noah; Lin, Hui; Zieb, Kris; Ji, Wei; Caracappa, Peter; Carothers, Christopher; Xu, X. George

2017-09-01

This paper contains two parts revolving around Monte Carlo transport simulation on Intel Many Integrated Core coprocessors (MIC, also known as Xeon Phi). (1) MCNP 6.1 was recompiled into multithreading (OpenMP) and multiprocessing (MPI) forms respectively without modification to the source code. The new codes were tested on a 60-core 5110P MIC. The test case was FS7ONNi, a radiation shielding problem used in MCNP's verification and validation suite. It was observed that both codes became slower on the MIC than on a 6-core X5650 CPU, by a factor of 4 for the MPI code and, abnormally, 20 for the OpenMP code, and both exhibited limited capability of strong scaling. (2) We have recently added a Constructive Solid Geometry (CSG) module to our ARCHER code to provide better support for geometry modelling in radiation shielding simulation. The functions of this module are frequently called in the particle random walk process. To identify the performance bottleneck we developed a CSG proxy application and profiled the code using the geometry data from FS7ONNi. The profiling data showed that the code was primarily memory latency bound on the MIC. This study suggests that despite low initial porting e_ort, Monte Carlo codes do not naturally lend themselves to the MIC platform — just like to the GPUs, and that the memory latency problem needs to be addressed in order to achieve decent performance gain.
Parallel Application Performance on Two Generations of Intel Xeon HPC Platforms

DOE Office of Scientific and Technical Information (OSTI.GOV)

Chang, Christopher H.; Long, Hai; Sides, Scott

2015-10-15

Two next-generation node configurations hosting the Haswell microarchitecture were tested with a suite of microbenchmarks and application examples, and compared with a current Ivy Bridge production node on NREL" tm s Peregrine high-performance computing cluster. A primary conclusion from this study is that the additional cores are of little value to individual task performance--limitations to application parallelism, or resource contention among concurrently running but independent tasks, limits effective utilization of these added cores. Hyperthreading generally impacts throughput negatively, but can improve performance in the absence of detailed attention to runtime workflow configuration. The observations offer some guidance to procurement ofmore » future HPC systems at NREL. First, raw core count must be balanced with available resources, particularly memory bandwidth. Balance-of-system will determine value more than processor capability alone. Second, hyperthreading continues to be largely irrelevant to the workloads that are commonly seen, and were tested here, at NREL. Finally, perhaps the most impactful enhancement to productivity might occur through enabling multiple concurrent jobs per node. Given the right type and size of workload, more may be achieved by doing many slow things at once, than fast things in order.« less
Semiconductor Ion Implanters

DOE Office of Scientific and Technical Information (OSTI.GOV)

MacKinnon, Barry A.; Ruffell, John P.

In 1953 the Raytheon CK722 transistor was priced at $7.60. Based upon this, an Intel Xeon Quad Core processor containing 820,000,000 transistors should list at $6.2 billion. Particle accelerator technology plays an important part in the remarkable story of why that Intel product can be purchased today for a few hundred dollars. Most people of the mid twentieth century would be astonished at the ubiquity of semiconductors in the products we now buy and use every day. Though relatively expensive in the nineteen fifties they now exist in a wide range of items from high-end multicore microprocessors like the Intelmore » product to disposable items containing 'only' hundreds or thousands like RFID chips and talking greeting cards. This historical development has been fueled by continuous advancement of the several individual technologies involved in the production of semiconductor devices including Ion Implantation and the charged particle beamlines at the heart of implant machines. In the course of its 40 year development, the worldwide implanter industry has reached annual sales levels around $2B, installed thousands of dedicated machines and directly employs thousands of workers. It represents in all these measures, as much and possibly more than any other industrial application of particle accelerator technology. This presentation discusses the history of implanter development. It touches on some of the people involved and on some of the developmental changes and challenges imposed as the requirements of the semiconductor industry evolved.« less
MILC Code Performance on High End CPU and GPU Supercomputer Clusters

NASA Astrophysics Data System (ADS)

DeTar, Carleton; Gottlieb, Steven; Li, Ruizi; Toussaint, Doug

2018-03-01

With recent developments in parallel supercomputing architecture, many core, multi-core, and GPU processors are now commonplace, resulting in more levels of parallelism, memory hierarchy, and programming complexity. It has been necessary to adapt the MILC code to these new processors starting with NVIDIA GPUs, and more recently, the Intel Xeon Phi processors. We report on our efforts to port and optimize our code for the Intel Knights Landing architecture. We consider performance of the MILC code with MPI and OpenMP, and optimizations with QOPQDP and QPhiX. For the latter approach, we concentrate on the staggered conjugate gradient and gauge force. We also consider performance on recent NVIDIA GPUs using the QUDA library.
Benchmarking hardware architecture candidates for the NFIRAOS real-time controller

NASA Astrophysics Data System (ADS)

Smith, Malcolm; Kerley, Dan; Herriot, Glen; Véran, Jean-Pierre

2014-07-01

As a part of the trade study for the Narrow Field Infrared Adaptive Optics System, the adaptive optics system for the Thirty Meter Telescope, we investigated the feasibility of performing real-time control computation using a Linux operating system and Intel Xeon E5 CPUs. We also investigated a Xeon Phi based architecture which allows higher levels of parallelism. This paper summarizes both the CPU based real-time controller architecture and the Xeon Phi based RTC. The Intel Xeon E5 CPU solution meets the requirements and performs the computation for one AO cycle in an average of 767 microseconds. The Xeon Phi solution did not meet the 1200 microsecond time requirement and also suffered from unpredictable execution times. More detailed benchmark results are reported for both architectures.
77 FR 65582 - Quad Graphics, Inc., Including Workers Whose Wages Were Reported Under Quad Graphics Printing...

Federal Register 2010, 2011, 2012, 2013, 2014

2012-10-29

...., Including Workers Whose Wages Were Reported Under Quad Graphics Printing Corp. and Quad Logistics Services... Logistics Services. The intent of the Department's certification is to include all workers of the subject... were reported under Quad Graphics Printing Corp. and Quad Logistics Services (TA-W-73,441H), who became...
A programming framework for data streaming on the Xeon Phi

NASA Astrophysics Data System (ADS)

Chapeland, S.; ALICE Collaboration

2017-10-01

ALICE (A Large Ion Collider Experiment) is the dedicated heavy-ion detector studying the physics of strongly interacting matter and the quark-gluon plasma at the CERN LHC (Large Hadron Collider). After the second long shut-down of the LHC, the ALICE detector will be upgraded to cope with an interaction rate of 50 kHz in Pb-Pb collisions, producing in the online computing system (O2) a sustained throughput of 3.4 TB/s. This data will be processed on the fly so that the stream to permanent storage does not exceed 90 GB/s peak, the raw data being discarded. In the context of assessing different computing platforms for the O2 system, we have developed a framework for the Intel Xeon Phi processors (MIC). It provides the components to build a processing pipeline streaming the data from the PC memory to a pool of permanent threads running on the MIC, and back to the host after processing. It is based on explicit offloading mechanisms (data transfer, asynchronous tasks) and basic building blocks (FIFOs, memory pools, C++11 threads). The user only needs to implement the processing method to be run on the MIC. We present in this paper the architecture, implementation, and performance of this system.
Underwater Threat Source Localization: Processing Sensor Network TDOAs with a Terascale Optical Core Device

DOE Office of Scientific and Technical Information (OSTI.GOV)

Barhen, Jacob; Imam, Neena

2007-01-01

Revolutionary computing technologies are defined in terms of technological breakthroughs, which leapfrog over near-term projected advances in conventional hardware and software to produce paradigm shifts in computational science. For underwater threat source localization using information provided by a dynamical sensor network, one of the most promising computational advances builds upon the emergence of digital optical-core devices. In this article, we present initial results of sensor network calculations that focus on the concept of signal wavefront time-difference-of-arrival (TDOA). The corresponding algorithms are implemented on the EnLight processing platform recently introduced by Lenslet Laboratories. This tera-scale digital optical core processor is optimizedmore » for array operations, which it performs in a fixed-point-arithmetic architecture. Our results (i) illustrate the ability to reach the required accuracy in the TDOA computation, and (ii) demonstrate that a considerable speed-up can be achieved when using the EnLight 64a prototype processor as compared to a dual Intel XeonTM processor.« less
Radiation Failures in Intel 14nm Microprocessors

NASA Technical Reports Server (NTRS)

Bossev, Dobrin P.; Duncan, Adam R.; Gadlage, Matthew J.; Roach, Austin H.; Kay, Matthew J.; Szabo, Carl; Berger, Tammy J.; York, Darin A.; Williams, Aaron; LaBel, K.;

2016-01-01

In this study the 14 nm Intel Broadwell 5th generation core series 5005U-i3 and 5200U-i5 was mounted on Dell Inspiron laptops, MSI Cubi and Gigabyte Brix barebones and tested with Windows 8 and CentOS7 at idle. Heavy-ion-induced hard- and catastrophic failures do not appear to be related to the Intel 14nm Tri-Gate FinFET process. They originate from a small (9 m 140 m) area on the 32nm planar PCH die (not the CPU) as initially speculated. The hard failures seem to be due to a SEE but the exact physical mechanism has yet to be identified. Some possibilities include latch-ups, charge ion trapping or implantation, ion channels, or a combination of those (in biased conditions). The mechanism of the catastrophic failures seems related to the presence of electric power (1.05V core voltage). The 1064 nm laser mimics ionization radiation and induces soft- and hard failures as a direct result of electron-hole pair production, not heat. The 14nm FinFET processes continue to look promising for space radiation environments.

Modern multicore and manycore architectures: Modelling, optimisation and benchmarking a multiblock CFD code

NASA Astrophysics Data System (ADS)

Hadade, Ioan; di Mare, Luca

2016-08-01

Modern multicore and manycore processors exhibit multiple levels of parallelism through a wide range of architectural features such as SIMD for data parallel execution or threads for core parallelism. The exploitation of multi-level parallelism is therefore crucial for achieving superior performance on current and future processors. This paper presents the performance tuning of a multiblock CFD solver on Intel SandyBridge and Haswell multicore CPUs and the Intel Xeon Phi Knights Corner coprocessor. Code optimisations have been applied on two computational kernels exhibiting different computational patterns: the update of flow variables and the evaluation of the Roe numerical fluxes. We discuss at great length the code transformations required for achieving efficient SIMD computations for both kernels across the selected devices including SIMD shuffles and transpositions for flux stencil computations and global memory transformations. Core parallelism is expressed through threading based on a number of domain decomposition techniques together with optimisations pertaining to alleviating NUMA effects found in multi-socket compute nodes. Results are correlated with the Roofline performance model in order to assert their efficiency for each distinct architecture. We report significant speedups for single thread execution across both kernels: 2-5X on the multicore CPUs and 14-23X on the Xeon Phi coprocessor. Computations at full node and chip concurrency deliver a factor of three speedup on the multicore processors and up to 24X on the Xeon Phi manycore coprocessor.
Exploiting MIC architectures for the simulation of channeling of charged particles in crystals

NASA Astrophysics Data System (ADS)

Bagli, Enrico; Karpusenko, Vadim

2016-08-01

Coherent effects of ultra-relativistic particles in crystals is an area of science under development. DYNECHARM + + is a toolkit for the simulation of coherent interactions between high-energy charged particles and complex crystal structures. The particle trajectory in a crystal is computed through numerical integration of the equation of motion. The code was revised and improved in order to exploit parallelization on multi-cores and vectorization of single instructions on multiple data. An Intel Xeon Phi card was adopted for the performance measurements. The computation time was proved to scale linearly as a function of the number of physical and virtual cores. By enabling the auto-vectorization flag of the compiler a three time speedup was obtained. The performances of the card were compared to the Dual Xeon ones.

Computational multicore on two-layer 1D shallow water equations for erodible dambreak

NASA Astrophysics Data System (ADS)

Simanjuntak, C. A.; Bagustara, B. A. R. H.; Gunawan, P. H.

2018-03-01

The simulation of erodible dambreak using two-layer shallow water equations and SCHR scheme are elaborated in this paper. The results show that the two-layer SWE model in a good agreement with the data experiment which is performed by Louvain-la-Neuve Université Catholique de Louvain. Moreover, the parallel algorithm with multicore architecture are given in the results. The results show that Computer I with processor Intel(R) Core(TM) i5-2500 CPU Quad-Core has the best performance to accelerate the computational time. Moreover, Computer III with processor AMD A6-5200 APU Quad-Core is observed has higher speedup and efficiency. The speedup and efficiency of Computer III with number of grids 3200 are 3.716050530 times and 92.9% respectively.
Modeling & Analysis of Multicore Architectures for Embedded SIGINT Applications

DTIC Science & Technology

2015-03-01

NVIDIA Kepler K20 [7][8] 2496e 706 225 3520 15.6 Intel Xeon Phi 5110P [9] 60 1050 225 1010 4.5 Adapteva Epiphany [10] 16 – 4K 800 0.270 19 70.4...Cortex A15 and a Kepler GPU with 192 “CUDA” cores, and is more comparable as an HPEEC platform than Tesla series GPUs, such as the NVIDIA C2075 and K20
Kalman Filter Tracking on Parallel Architectures

NASA Astrophysics Data System (ADS)

Cerati, Giuseppe; Elmer, Peter; Lantz, Steven; McDermott, Kevin; Riley, Dan; Tadel, Matevž; Wittich, Peter; Würthwein, Frank; Yagil, Avi

2015-12-01

Power density constraints are limiting the performance improvements of modern CPUs. To address this we have seen the introduction of lower-power, multi-core processors, but the future will be even more exciting. In order to stay within the power density limits but still obtain Moore's Law performance/price gains, it will be necessary to parallelize algorithms to exploit larger numbers of lightweight cores and specialized functions like large vector units. Example technologies today include Intel's Xeon Phi and GPGPUs. Track finding and fitting is one of the most computationally challenging problems for event reconstruction in particle physics. At the High Luminosity LHC, for example, this will be by far the dominant problem. The need for greater parallelism has driven investigations of very different track finding techniques including Cellular Automata or returning to Hough Transform. The most common track finding techniques in use today are however those based on the Kalman Filter [2]. Significant experience has been accumulated with these techniques on real tracking detector systems, both in the trigger and offline. They are known to provide high physics performance, are robust and are exactly those being used today for the design of the tracking system for HL-LHC. Our previous investigations showed that, using optimized data structures, track fitting with Kalman Filter can achieve large speedup both with Intel Xeon and Xeon Phi. We report here our further progress towards an end-to-end track reconstruction algorithm fully exploiting vectorization and parallelization techniques in a realistic simulation setup.
Many-integrated core (MIC) technology for accelerating Monte Carlo simulation of radiation transport: A study based on the code DPM

NASA Astrophysics Data System (ADS)

Rodriguez, M.; Brualla, L.

2018-04-01

Monte Carlo simulation of radiation transport is computationally demanding to obtain reasonably low statistical uncertainties of the estimated quantities. Therefore, it can benefit in a large extent from high-performance computing. This work is aimed at assessing the performance of the first generation of the many-integrated core architecture (MIC) Xeon Phi coprocessor with respect to that of a CPU consisting of a double 12-core Xeon processor in Monte Carlo simulation of coupled electron-photonshowers. The comparison was made twofold, first, through a suite of basic tests including parallel versions of the random number generators Mersenne Twister and a modified implementation of RANECU. These tests were addressed to establish a baseline comparison between both devices. Secondly, through the p DPM code developed in this work. p DPM is a parallel version of the Dose Planning Method (DPM) program for fast Monte Carlo simulation of radiation transport in voxelized geometries. A variety of techniques addressed to obtain a large scalability on the Xeon Phi were implemented in p DPM. Maximum scalabilities of 84 . 2 × and 107 . 5 × were obtained in the Xeon Phi for simulations of electron and photon beams, respectively. Nevertheless, in none of the tests involving radiation transport the Xeon Phi performed better than the CPU. The disadvantage of the Xeon Phi with respect to the CPU owes to the low performance of the single core of the former. A single core of the Xeon Phi was more than 10 times less efficient than a single core of the CPU for all radiation transport simulations.
GeantV: from CPU to accelerators

NASA Astrophysics Data System (ADS)

Amadio, G.; Ananya, A.; Apostolakis, J.; Arora, A.; Bandieramonte, M.; Bhattacharyya, A.; Bianchini, C.; Brun, R.; Canal, P.; Carminati, F.; Duhem, L.; Elvira, D.; Gheata, A.; Gheata, M.; Goulas, I.; Iope, R.; Jun, S.; Lima, G.; Mohanty, A.; Nikitina, T.; Novak, M.; Pokorski, W.; Ribon, A.; Sehgal, R.; Shadura, O.; Vallecorsa, S.; Wenzel, S.; Zhang, Y.

2016-10-01

The GeantV project aims to research and develop the next-generation simulation software describing the passage of particles through matter. While the modern CPU architectures are being targeted first, resources such as GPGPU, Intel© Xeon Phi, Atom or ARM cannot be ignored anymore by HEP CPU-bound applications. The proof of concept GeantV prototype has been mainly engineered for CPU's having vector units but we have foreseen from early stages a bridge to arbitrary accelerators. A software layer consisting of architecture/technology specific backends supports currently this concept. This approach allows to abstract out the basic types such as scalar/vector but also to formalize generic computation kernels using transparently library or device specific constructs based on Vc, CUDA, Cilk+ or Intel intrinsics. While the main goal of this approach is portable performance, as a bonus, it comes with the insulation of the core application and algorithms from the technology layer. This allows our application to be long term maintainable and versatile to changes at the backend side. The paper presents the first results of basket-based GeantV geometry navigation on the Intel© Xeon Phi KNC architecture. We present the scalability and vectorization study, conducted using Intel performance tools, as well as our preliminary conclusions on the use of accelerators for GeantV transport. We also describe the current work and preliminary results for using the GeantV transport kernel on GPUs.
GeantV: From CPU to accelerators

DOE PAGES

Amadio, G.; Ananya, A.; Apostolakis, J.; ...

2016-01-01

The GeantV project aims to research and develop the next-generation simulation software describing the passage of particles through matter. While the modern CPU architectures are being targeted first, resources such as GPGPU, Intel© Xeon Phi, Atom or ARM cannot be ignored anymore by HEP CPU-bound applications. The proof of concept GeantV prototype has been mainly engineered for CPU's having vector units but we have foreseen from early stages a bridge to arbitrary accelerators. A software layer consisting of architecture/technology specific backends supports currently this concept. This approach allows to abstract out the basic types such as scalar/vector but also tomore » formalize generic computation kernels using transparently library or device specific constructs based on Vc, CUDA, Cilk+ or Intel intrinsics. While the main goal of this approach is portable performance, as a bonus, it comes with the insulation of the core application and algorithms from the technology layer. This allows our application to be long term maintainable and versatile to changes at the backend side. The paper presents the first results of basket-based GeantV geometry navigation on the Intel© Xeon Phi KNC architecture. We present the scalability and vectorization study, conducted using Intel performance tools, as well as our preliminary conclusions on the use of accelerators for GeantV transport. Lastly, we also describe the current work and preliminary results for using the GeantV transport kernel on GPUs.« less
Optimization of Selected Remote Sensing Algorithms for Embedded NVIDIA Kepler GPU Architecture

NASA Technical Reports Server (NTRS)

Riha, Lubomir; Le Moigne, Jacqueline; El-Ghazawi, Tarek

2015-01-01

This paper evaluates the potential of embedded Graphic Processing Units in the Nvidias Tegra K1 for onboard processing. The performance is compared to a general purpose multi-core CPU and full fledge GPU accelerator. This study uses two algorithms: Wavelet Spectral Dimension Reduction of Hyperspectral Imagery and Automated Cloud-Cover Assessment (ACCA) Algorithm. Tegra K1 achieved 51 for ACCA algorithm and 20 for the dimension reduction algorithm, as compared to the performance of the high-end 8-core server Intel Xeon CPU with 13.5 times higher power consumption.
Chapter 13. Exploring Use of the Reserved Core

DOE Office of Scientific and Technical Information (OSTI.GOV)

Holmen, John; Humphrey, Alan; Berzins, Martin

2015-07-29

In this chapter, we illustrate benefits of thinking in terms of thread management techniques when using a centralized scheduler model along with interoperability of MPI and PThread. This is facilitated through an exploration of thread placement strategies for an algorithm modeling radiative heat transfer with special attention to the 61st core. This algorithm plays a key role within the Uintah Computational Framework (UCF) and current efforts taking place at the University of Utah to model next-generation, large-scale clean coal boilers. In such simulations, this algorithm models the dominant form of heat transfer and consumes a large portion of compute time.more » Exemplified by a real-world example, this chapter presents our early efforts in porting a key portion of a scalability-centric codebase to the Intel Xeon Phi coprocessor. Specifically, this chapter presents results from our experiments profiling the native execution of a reverse Monte-Carlo ray tracing-based radiation model on a single coprocessor. These results demonstrate that our fastest run configurations utilized the 61st core and that performance was not profoundly impacted when explicitly oversubscribing the coprocessor operating system thread. Additionally, this chapter presents a portion of radiation model source code, a MIC-centric UCF cross-compilation example, and less conventional thread management technique for developers utilizing the PThreads threading model.« less
A fast CT reconstruction scheme for a general multi-core PC.

PubMed

Zeng, Kai; Bai, Erwei; Wang, Ge

2007-01-01

Expensive computational cost is a severe limitation in CT reconstruction for clinical applications that need real-time feedback. A primary example is bolus-chasing computed tomography (CT) angiography (BCA) that we have been developing for the past several years. To accelerate the reconstruction process using the filtered backprojection (FBP) method, specialized hardware or graphics cards can be used. However, specialized hardware is expensive and not flexible. The graphics processing unit (GPU) in a current graphic card can only reconstruct images in a reduced precision and is not easy to program. In this paper, an acceleration scheme is proposed based on a multi-core PC. In the proposed scheme, several techniques are integrated, including utilization of geometric symmetry, optimization of data structures, single-instruction multiple-data (SIMD) processing, multithreaded computation, and an Intel C++ compilier. Our scheme maintains the original precision and involves no data exchange between the GPU and CPU. The merits of our scheme are demonstrated in numerical experiments against the traditional implementation. Our scheme achieves a speedup of about 40, which can be further improved by several folds using the latest quad-core processors.
A Fast CT Reconstruction Scheme for a General Multi-Core PC

PubMed Central

Zeng, Kai; Bai, Erwei; Wang, Ge

2007-01-01

Expensive computational cost is a severe limitation in CT reconstruction for clinical applications that need real-time feedback. A primary example is bolus-chasing computed tomography (CT) angiography (BCA) that we have been developing for the past several years. To accelerate the reconstruction process using the filtered backprojection (FBP) method, specialized hardware or graphics cards can be used. However, specialized hardware is expensive and not flexible. The graphics processing unit (GPU) in a current graphic card can only reconstruct images in a reduced precision and is not easy to program. In this paper, an acceleration scheme is proposed based on a multi-core PC. In the proposed scheme, several techniques are integrated, including utilization of geometric symmetry, optimization of data structures, single-instruction multiple-data (SIMD) processing, multithreaded computation, and an Intel C++ compilier. Our scheme maintains the original precision and involves no data exchange between the GPU and CPU. The merits of our scheme are demonstrated in numerical experiments against the traditional implementation. Our scheme achieves a speedup of about 40, which can be further improved by several folds using the latest quad-core processors. PMID:18256731
Parallelized computation for computer simulation of electrocardiograms using personal computers with multi-core CPU and general-purpose GPU.

PubMed

Shen, Wenfeng; Wei, Daming; Xu, Weimin; Zhu, Xin; Yuan, Shizhong

2010-10-01

Biological computations like electrocardiological modelling and simulation usually require high-performance computing environments. This paper introduces an implementation of parallel computation for computer simulation of electrocardiograms (ECGs) in a personal computer environment with an Intel CPU of Core (TM) 2 Quad Q6600 and a GPU of Geforce 8800GT, with software support by OpenMP and CUDA. It was tested in three parallelization device setups: (a) a four-core CPU without a general-purpose GPU, (b) a general-purpose GPU plus 1 core of CPU, and (c) a four-core CPU plus a general-purpose GPU. To effectively take advantage of a multi-core CPU and a general-purpose GPU, an algorithm based on load-prediction dynamic scheduling was developed and applied to setting (c). In the simulation with 1600 time steps, the speedup of the parallel computation as compared to the serial computation was 3.9 in setting (a), 16.8 in setting (b), and 20.0 in setting (c). This study demonstrates that a current PC with a multi-core CPU and a general-purpose GPU provides a good environment for parallel computations in biological modelling and simulation studies. Copyright 2010 Elsevier Ireland Ltd. All rights reserved.
Capture and surveillance of quad-bike (ATV)-related injuries in administrative data collections.

PubMed

Mitchell, Rebecca J; Grzebieta, Raphael; Rechnitzer, George

2016-09-01

Identifying quad-bike-related injuries in administrative data collections can be problematic. This study sought to determine whether quad-bike-related injuries could be identified in routinely collected administrative data collections in New South Wales (NSW), Australia, and to determine the information recorded according to World Health Organization (WHO) injury surveillance guidelines that could assist injury prevention efforts. Five routinely collected administrative data collections in NSW in the period 2000-2012 were reviewed. The WHO core minimum data items recorded in each of the five data collections ranged from 37.5% to 75.0%. Age and sex of the injured individual were the only data items that were recorded in all data collections. The data collections did not contain detailed information on the circumstances of quad bike incidents. Major improvements are needed in the information collected in these data-sets, if their value is to be increased and used for injury prevention purposes.
High Performance Computing and Visualization Infrastructure for Simultaneous Parallel Computing and Parallel Visualization Research

DTIC Science & Technology

2016-11-09

Total Number: Sub Contractors (DD882) Names of Personnel receiving masters degrees Names of personnel receiving PHDs Names of other research staff...Broadcom 5720 QP 1Gb Network Daughter Card (2) Intel Xeon E5-2680 v3 2.5GHz, 30M Cache, 9.60GT/s QPI, Turbo, HT , 12C/24T (120W...Broadcom 5720 QP 1Gb Network Daughter Card (2) Intel Xeon E5-2680 v3 2.5GHz, 30M Cache, 9.60GT/s QPI, Turbo, HT , 12C/24T (120W
Kalman Filter Tracking on Parallel Architectures

NASA Astrophysics Data System (ADS)

Cerati, Giuseppe; Elmer, Peter; Krutelyov, Slava; Lantz, Steven; Lefebvre, Matthieu; McDermott, Kevin; Riley, Daniel; Tadel, Matevž; Wittich, Peter; Würthwein, Frank; Yagil, Avi

2016-11-01

Power density constraints are limiting the performance improvements of modern CPUs. To address this we have seen the introduction of lower-power, multi-core processors such as GPGPU, ARM and Intel MIC. In order to achieve the theoretical performance gains of these processors, it will be necessary to parallelize algorithms to exploit larger numbers of lightweight cores and specialized functions like large vector units. Track finding and fitting is one of the most computationally challenging problems for event reconstruction in particle physics. At the High-Luminosity Large Hadron Collider (HL-LHC), for example, this will be by far the dominant problem. The need for greater parallelism has driven investigations of very different track finding techniques such as Cellular Automata or Hough Transforms. The most common track finding techniques in use today, however, are those based on a Kalman filter approach. Significant experience has been accumulated with these techniques on real tracking detector systems, both in the trigger and offline. They are known to provide high physics performance, are robust, and are in use today at the LHC. Given the utility of the Kalman filter in track finding, we have begun to port these algorithms to parallel architectures, namely Intel Xeon and Xeon Phi. We report here on our progress towards an end-to-end track reconstruction algorithm fully exploiting vectorization and parallelization techniques in a simplified experimental environment.
Genten: Software for Generalized Tensor Decompositions v. 1.0.0

DOE Office of Scientific and Technical Information (OSTI.GOV)

Phipps, Eric T.; Kolda, Tamara G.; Dunlavy, Daniel

Tensors, or multidimensional arrays, are a powerful mathematical means of describing multiway data. This software provides computational means for decomposing or approximating a given tensor in terms of smaller tensors of lower dimension, focusing on decomposition of large, sparse tensors. These techniques have applications in many scientific areas, including signal processing, linear algebra, computer vision, numerical analysis, data mining, graph analysis, neuroscience and more. The software is designed to take advantage of parallelism present emerging computer architectures such has multi-core CPUs, many-core accelerators such as the Intel Xeon Phi, and computation-oriented GPUs to enable efficient processing of large tensors.
Curiosity Quad

NASA Image and Video Library

2012-08-09

This image shows the quadrangle where NASA Curiosity rover landed, within the expansive Gale Crater. The mission science team has divided the landing region into several square quadrangles, or quads, of interest about 1-mile 1.3-kilometers wide.
Using Intel's Knight Landing Processor to Accelerate Global Nested Air Quality Prediction Modeling System (GNAQPMS) Model

NASA Astrophysics Data System (ADS)

Wang, H.; Chen, H.; Chen, X.; Wu, Q.; Wang, Z.

2016-12-01

The Global Nested Air Quality Prediction Modeling System for Hg (GNAQPMS-Hg) is a global chemical transport model coupled Hg transport module to investigate the mercury pollution. In this study, we present our work of transplanting the GNAQPMS model on Intel Xeon Phi processor, Knights Landing (KNL) to accelerate the model. KNL is the second-generation product adopting Many Integrated Core Architecture (MIC) architecture. Compared with the first generation Knight Corner (KNC), KNL has more new hardware features, that it can be used as unique processor as well as coprocessor with other CPU. According to the Vtune tool, the high overhead modules in GNAQPMS model have been addressed, including CBMZ gas chemistry, advection and convection module, and wet deposition module. These high overhead modules were accelerated by optimizing code and using new techniques of KNL. The following optimized measures was done: 1) Changing the pure MPI parallel mode to hybrid parallel mode with MPI and OpenMP; 2.Vectorizing the code to using the 512-bit wide vector computation unit. 3. Reducing unnecessary memory access and calculation. 4. Reducing Thread Local Storage (TLS) for common variables with each OpenMP thread in CBMZ. 5. Changing the way of global communication from files writing and reading to MPI functions. After optimization, the performance of GNAQPMS is greatly increased both on CPU and KNL platform, the single-node test showed that optimized version has 2.6x speedup on two sockets CPU platform and 3.3x speedup on one socket KNL platform compared with the baseline version code, which means the KNL has 1.29x speedup when compared with 2 sockets CPU platform.
Analysis and Implementation of Particle-to-Particle (P2P) Graphics Processor Unit (GPU) Kernel for Black-Box Adaptive Fast Multipole Method

DTIC Science & Technology

2015-06-01

5110P and 16 dx360M4 nodes each with one NVIDIA Kepler K20M/K40M GPU. Each node contained dual Intel Xeon E5-2670 (Sandy Bridge) central processing...kernel and as such does not employ multiple processors. This work makes use of a single processing core and a single NVIDIA Kepler K40 GK110...bandwidth (2 × 16 slot), 7.877 GFloat/s; Kepler K40 peak, 4,290 × 1 billion floating-point operations (GFLOPs), and 288 GB/s Kepler K40 memory
Quantum Chemical Calculations Using Accelerators: Migrating Matrix Operations to the NVIDIA Kepler GPU and the Intel Xeon Phi.

PubMed

Leang, Sarom S; Rendell, Alistair P; Gordon, Mark S

2014-03-11

Increasingly, modern computer systems comprise a multicore general-purpose processor augmented with a number of special purpose devices or accelerators connected via an external interface such as a PCI bus. The NVIDIA Kepler Graphical Processing Unit (GPU) and the Intel Phi are two examples of such accelerators. Accelerators offer peak performances that can be well above those of the host processor. How to exploit this heterogeneous environment for legacy application codes is not, however, straightforward. This paper considers how matrix operations in typical quantum chemical calculations can be migrated to the GPU and Phi systems. Double precision general matrix multiply operations are endemic in electronic structure calculations, especially methods that include electron correlation, such as density functional theory, second order perturbation theory, and coupled cluster theory. The use of approaches that automatically determine whether to use the host or an accelerator, based on problem size, is explored, with computations that are occurring on the accelerator and/or the host. For data-transfers over PCI-e, the GPU provides the best overall performance for data sizes up to 4096 MB with consistent upload and download rates between 5-5.6 GB/s and 5.4-6.3 GB/s, respectively. The GPU outperforms the Phi for both square and nonsquare matrix multiplications.
Quad Charts in the Classroom to Reinforce Technical Communication Fundamentals

ERIC Educational Resources Information Center

Ford, Julie Dyke; Wei, Tie

2015-01-01

Quad charts are a genre frequently used in scientific and technical environments, yet little prior work has evaluated their potential for reinforcing technical communication fundamentals. This article provides background information about quad charts and notes the benefits of implementing quad charts in the classroom. In particular, introducing…

Australian quad bike fatalities: what is the economic cost?

PubMed

Lower, Tony; Pollock, Kirrily; Herde, Emily

2013-04-01

To determine the economic costs associated with all quad bike-related fatalities in Australia, 2001 to 2010. A human capital approach to establish the economic costs of quad bike related fatalities to the Australian economy. The model included estimates on loss of earnings due to premature death and direct costs based on coronial records for ambulance, police, hospital, premature funeral, coronial and work safety authority investigation, and death compensation costs. All costs were calculated to 2010 dollars. The estimated total economic cost associated with quad bike fatalities over this period was $288.1 million, with an average cost for each fatality of $2.3 million. When assessing the average cost of incidents between age cohorts, those aged 25-34 years had the lowest number of fatalities but had the highest average cost ($4.2 million). Quad bike fatalities have a significant economic impact on Australian society that is increasing. Implications : Given the high cost to society, interventions to address quad bike fatalities have the potential to be highly cost-effective. Such interventions should focus on design approaches to improve the safety of quad bikes in terms of stability and protection in the event of a rollover. Additionally, relevant policy (e.g. no children under 16 years riding quads, no passengers) and intervention approaches (e.g. training and use of helmets) must also support the design modifications. © 2013 The Authors. ANZJPH © 2013 Public Health Association of Australia.
Accuracy of the QUAD4 thick shell element

NASA Technical Reports Server (NTRS)

Case, William R.; Bowles, Tiffany D.; Croft, Alicia K.; Mcginnis, Mark A.

1990-01-01

The accuracy of the relatively new QUAD4 thick shell element is assessed via comparison with a theoretical solution for thick homogeneous and honeycomb flat simply supported plates under the action of a uniform pressure load. The theoretical thick plate solution is based on the theory developed by Reissner and includes the effects of transverse shear flexibility which are not included in the thin plate solutions based on Kirchoff plate theory. In addition, the QUAD4 is assessed using a set of finite element test problems developed by the MacNeal-Schwendler Corp. (MSC). Comparison of the COSMIC QUAD4 element as well as those from MSC and Universal Analytics, Inc. (UAI) for these test problems is presented. The current COSMIC QUAD4 element is shown to have excellent comparison with both the theoretical solutions and also those from the two commercial versions of NASTRAN that it was compared to.
Impact of a vaccination programme in children vaccinated with ProQuad, and ProQuad-specific effectiveness against varicella in the Veneto region of Italy.

PubMed

Giaquinto, Carlo; Gabutti, Giovanni; Baldo, Vincenzo; Villa, Marco; Tramontan, Lara; Raccanello, Nadia; Russo, Francesca; Poma, Chiara; Scamarcia, Antonio; Cantarutti, Luigi; Lundin, Rebecca; Perinetti, Emilia; Cornen, Xavier; Thomas, Stéphane; Ballandras, Céline; Souverain, Audrey; Hartwig, Susanne

2018-03-05

Monovalent varicella vaccines have been available in the Veneto Region of Italy since 2004. In 2006, a single vaccine dose was added to the immunisation calendar for children aged 14 months. ProQuad®, a quadrivalent measles-mumps-rubella-varicella vaccine, was introduced in May 2007 and used, among other varicella vaccines, until October 2008. This study aimed to evaluate the effectiveness of a single dose of ProQuad, and the population impact of a vaccination program (VP) against varicella of any severity in children who received a first dose of ProQuad at 14 months of age in the Veneto Region, METHODS: All children born in 2006/2007, i.e., eligible for varicella vaccination after ProQuad was introduced, were retrospectively followed through individual-level data linkage between the Pedianet database (varicella cases) and the Regional Immunization Database (vaccination status). The direct effectiveness of ProQuad was estimated as the incidence rate of varicella in ProQuad-vaccinated children aged < 6 years compared to children with no varicella vaccination from the same birth cohort. The impact of the VP on varicella was measured by comparing children eligible for the VP to an unvaccinated historical cohort from 1997/1998. The vaccine impact measures were: total effect (the combined effect of ProQuad vaccination and being covered by the Veneto VP); indirect effect (the effect of the VP on unvaccinated individuals); and overall effect (the effect of the VP on varicella in the entire population of the Veneto Region, regardless of their vaccination status). The adjusted direct effectiveness of ProQuad was 94%. The vaccine impact measures total, indirect, and overall effect were 97%, 43%, and 90%, respectively. These are the first results on the effectiveness and impact of ProQuad against varicella; data confirmed its high effectiveness, based on immunological correlates for protection. Direct effectiveness is our only ProQuad-specific measure; all impact
Cosmological Parameters from the QUAD CMB Polarization Experiment

NASA Astrophysics Data System (ADS)

Castro, P. G.; Ade, P.; Bock, J.; Bowden, M.; Brown, M. L.; Cahill, G.; Church, S.; Culverhouse, T.; Friedman, R. B.; Ganga, K.; Gear, W. K.; Gupta, S.; Hinderks, J.; Kovac, J.; Lange, A. E.; Leitch, E.; Melhuish, S. J.; Memari, Y.; Murphy, J. A.; Orlando, A.; Pryke, C.; Schwarz, R.; O'Sullivan, C.; Piccirillo, L.; Rajguru, N.; Rusholme, B.; Taylor, A. N.; Thompson, K. L.; Turner, A. H.; Wu, E. Y. S.; Zemcov, M.; QUa D Collaboration

2009-08-01

In this paper, we present a parameter estimation analysis of the polarization and temperature power spectra from the second and third season of observations with the QUaD experiment. QUaD has for the first time detected multiple acoustic peaks in the E-mode polarization spectrum with high significance. Although QUaD-only parameter constraints are not competitive with previous results for the standard six-parameter ΛCDM cosmology, they do allow meaningful polarization-only parameter analyses for the first time. In a standard six-parameter ΛCDM analysis, we find the QUaD TT power spectrum to be in good agreement with previous results. However, the QUaD polarization data show some tension with ΛCDM. The origin of this 1σ-2σ tension remains unclear, and may point to new physics, residual systematics, or simple random chance. We also combine QUaD with the five-year WMAP data set and the SDSS luminous red galaxies 4th data release power spectrum, and extend our analysis to constrain individual isocurvature mode fractions, constraining cold dark matter density, αcdmi < 0.11 (95% confidence limit (CL)), neutrino density, αndi < 0.26 (95% CL), and neutrino velocity, αnvi < 0.23 (95% CL), modes. Our analysis sets a benchmark for future polarization experiments.
Electromagnetic scattering calculations on the Intel Touchstone Delta

NASA Technical Reports Server (NTRS)

Cwik, Tom; Patterson, Jean; Scott, David

1992-01-01

During the first year's operation of the Intel Touchstone Delta system, software which solves the electric field integral equations for fields scattered from arbitrarily shaped objects has been transferred to the Delta. To fully realize the Delta's resources, an out-of-core dense matrix solution algorithm that utilizes some or all of the 90 Gbyte of concurrent file system (CFS) has been used. The largest calculation completed to date computes the fields scattered from a perfectly conducting sphere modeled by 48,672 unknown functions, resulting in a complex valued dense matrix needing 37.9 Gbyte of storage. The out-of-core LU matrix factorization algorithm was executed in 8.25 h at a rate of 10.35 Gflops. Total time to complete the calculation was 19.7 h-the additional time was used to compute the 48,672 x 48,672 matrix entries, solve the system for a given excitation, and compute observable quantities. The calculation was performed in 64-b precision.
Compositional Verification with Abstraction, Learning, and SAT Solving

DTIC Science & Technology

2015-05-01

arithmetic, and bit-vectors (currently, via bit-blasting). The front-end is based on an existing tool called UFO [8] which converts C programs to the Horn...supports propositional logic, linear arithmetic, and bit-vectors (via bit-blasting). The front-end is based on the tool UFO [8]. It encodes safety of...tool UFO [8]. The encoding in Horn-SMT only uses the theory of Linear Rational Arithmetic. All experiments were carried out on an Intel R© CoreTM2 Quad
Intel NX to PVM 3.2 message passing conversion library

NASA Technical Reports Server (NTRS)

Arthur, Trey; Nelson, Michael L.

1993-01-01

NASA Langley Research Center has developed a library that allows Intel NX message passing codes to be executed under the more popular and widely supported Parallel Virtual Machine (PVM) message passing library. PVM was developed at Oak Ridge National Labs and has become the defacto standard for message passing. This library will allow the many programs that were developed on the Intel iPSC/860 or Intel Paragon in a Single Program Multiple Data (SPMD) design to be ported to the numerous architectures that PVM (version 3.2) supports. Also, the library adds global operations capability to PVM. A familiarity with Intel NX and PVM message passing is assumed.
Optimization of Sparse Matrix-Vector Multiplication on Emerging Multicore Platforms

DOE Office of Scientific and Technical Information (OSTI.GOV)

Williams, Samuel; Oliker, Leonid; Vuduc, Richard

2008-10-16

We are witnessing a dramatic change in computer architecture due to the multicore paradigm shift, as every electronic device from cell phones to supercomputers confronts parallelism of unprecedented scale. To fully unleash the potential of these systems, the HPC community must develop multicore specific-optimization methodologies for important scientific computations. In this work, we examine sparse matrix-vector multiply (SpMV) - one of the most heavily used kernels in scientific computing - across a broad spectrum of multicore designs. Our experimental platform includes the homogeneous AMD quad-core, AMD dual-core, and Intel quad-core designs, the heterogeneous STI Cell, as well as one ofmore » the first scientific studies of the highly multithreaded Sun Victoria Falls (a Niagara2 SMP). We present several optimization strategies especially effective for the multicore environment, and demonstrate significant performance improvements compared to existing state-of-the-art serial and parallel SpMV implementations. Additionally, we present key insights into the architectural trade-offs of leading multicore design strategies, in the context of demanding memory-bound numerical algorithms.« less
40 CFR 81.102 - Metropolitan Quad Cities Interstate Air Quality Control Region.

Code of Federal Regulations, 2011 CFR

2011-07-01

... 40 Protection of Environment 17 2011-07-01 2011-07-01 false Metropolitan Quad Cities Interstate... Designation of Air Quality Control Regions § 81.102 Metropolitan Quad Cities Interstate Air Quality Control Region. The Metropolitan Quad Cities Interstate Air Quality Control Region (Illinois-Iowa) consists of...
40 CFR 81.102 - Metropolitan Quad Cities Interstate Air Quality Control Region.

Code of Federal Regulations, 2010 CFR

2010-07-01

... 40 Protection of Environment 17 2010-07-01 2010-07-01 false Metropolitan Quad Cities Interstate... Designation of Air Quality Control Regions § 81.102 Metropolitan Quad Cities Interstate Air Quality Control Region. The Metropolitan Quad Cities Interstate Air Quality Control Region (Illinois-Iowa) consists of...
Accelerating Climate Simulations Through Hybrid Computing

NASA Technical Reports Server (NTRS)

Zhou, Shujia; Sinno, Scott; Cruz, Carlos; Purcell, Mark

2009-01-01

Unconventional multi-core processors (e.g., IBM Cell B/E and NYIDIDA GPU) have emerged as accelerators in climate simulation. However, climate models typically run on parallel computers with conventional processors (e.g., Intel and AMD) using MPI. Connecting accelerators to this architecture efficiently and easily becomes a critical issue. When using MPI for connection, we identified two challenges: (1) identical MPI implementation is required in both systems, and; (2) existing MPI code must be modified to accommodate the accelerators. In response, we have extended and deployed IBM Dynamic Application Virtualization (DAV) in a hybrid computing prototype system (one blade with two Intel quad-core processors, two IBM QS22 Cell blades, connected with Infiniband), allowing for seamlessly offloading compute-intensive functions to remote, heterogeneous accelerators in a scalable, load-balanced manner. Currently, a climate solar radiation model running with multiple MPI processes has been offloaded to multiple Cell blades with approx.10% network overhead.
QUAD+ BWR Fuel Assembly demonstration program at Browns Ferry plant

DOE Office of Scientific and Technical Information (OSTI.GOV)

Doshi, P.K.; Mayhue, L.T.; Robert, J.T.

1984-04-01

The QUAD+ fuel assembly is an improved BWR fuel assembly designed and manufactured by Westinghouse Electric Corporation. The design features a water cross separating four fuel minibundles in an integral channel. A demonstration program for this fuel design is planned for late 1984 in cycle 6 of Browns Ferry 2, a TVA plant. Objectives for the design of the QUAD+ demonstration assemblies are compatibility in performance and transparency in safety analysis with the feed fuel. These objectives are met. Inspections of the QUAD+ demonstration assemblies are planned at each refueling outage.
Thread-Level Parallelization and Optimization of NWChem for the Intel MIC Architecture

DOE Office of Scientific and Technical Information (OSTI.GOV)

Shan, Hongzhang; Williams, Samuel; Jong, Wibe de

In the multicore era it was possible to exploit the increase in on-chip parallelism by simply running multiple MPI processes per chip. Unfortunately, manycore processors' greatly increased thread- and data-level parallelism coupled with a reduced memory capacity demand an altogether different approach. In this paper we explore augmenting two NWChem modules, triples correction of the CCSD(T) and Fock matrix construction, with OpenMP in order that they might run efficiently on future manycore architectures. As the next NERSC machine will be a self-hosted Intel MIC (Xeon Phi) based supercomputer, we leverage an existing MIC testbed at NERSC to evaluate our experiments.more » In order to proxy the fact that future MIC machines will not have a host processor, we run all of our experiments in tt native mode. We found that while straightforward application of OpenMP to the deep loop nests associated with the tensor contractions of CCSD(T) was sufficient in attaining high performance, significant effort was required to safely and efficiently thread the TEXAS integral package when constructing the Fock matrix. Ultimately, our new MPI OpenMP hybrid implementations attain up to 65x better performance for the triples part of the CCSD(T) due in large part to the fact that the limited on-card memory limits the existing MPI implementation to a single process per card. Additionally, we obtain up to 1.6x better performance on Fock matrix constructions when compared with the best MPI implementations running multiple processes per card.« less
Thread-level parallelization and optimization of NWChem for the Intel MIC architecture

DOE Office of Scientific and Technical Information (OSTI.GOV)

Shan, Hongzhang; Williams, Samuel; de Jong, Wibe

In the multicore era it was possible to exploit the increase in on-chip parallelism by simply running multiple MPI processes per chip. Unfortunately, manycore processors' greatly increased thread- and data-level parallelism coupled with a reduced memory capacity demand an altogether different approach. In this paper we explore augmenting two NWChem modules, triples correction of the CCSD(T) and Fock matrix construction, with OpenMP in order that they might run efficiently on future manycore architectures. As the next NERSC machine will be a self-hosted Intel MIC (Xeon Phi) based supercomputer, we leverage an existing MIC testbed at NERSC to evaluate our experiments.more » In order to proxy the fact that future MIC machines will not have a host processor, we run all of our experiments in native mode. We found that while straightforward application of OpenMP to the deep loop nests associated with the tensor contractions of CCSD(T) was sufficient in attaining high performance, significant e ort was required to safely and efeciently thread the TEXAS integral package when constructing the Fock matrix. Ultimately, our new MPI+OpenMP hybrid implementations attain up to 65× better performance for the triples part of the CCSD(T) due in large part to the fact that the limited on-card memory limits the existing MPI implementation to a single process per card. Additionally, we obtain up to 1.6× better performance on Fock matrix constructions when compared with the best MPI implementations running multiple processes per card.« less
Comparing performance of many-core CPUs and GPUs for static and motion compensated reconstruction of C-arm CT data.

PubMed

Hofmann, Hannes G; Keck, Benjamin; Rohkohl, Christopher; Hornegger, Joachim

2011-01-01

Interventional reconstruction of 3-D volumetric data from C-arm CT projections is a computationally demanding task. Hardware optimization is not an option but mandatory for interventional image processing and, in particular, for image reconstruction due to the high demands on performance. Several groups have published fast analytical 3-D reconstruction on highly parallel hardware such as GPUs to mitigate this issue. The authors show that the performance of modern CPU-based systems is in the same order as current GPUs for static 3-D reconstruction and outperforms them for a recent motion compensated (3-D+time) image reconstruction algorithm. This work investigates two algorithms: Static 3-D reconstruction as well as a recent motion compensated algorithm. The evaluation was performed using a standardized reconstruction benchmark, RABBITCT, to get comparable results and two additional clinical data sets. The authors demonstrate for a parametric B-spline motion estimation scheme that the derivative computation, which requires many write operations to memory, performs poorly on the GPU and can highly benefit from modern CPU architectures with large caches. Moreover, on a 32-core Intel Xeon server system, the authors achieve linear scaling with the number of cores used and reconstruction times almost in the same range as current GPUs. Algorithmic innovations in the field of motion compensated image reconstruction may lead to a shift back to CPUs in the future. For analytical 3-D reconstruction, the authors show that the gap between GPUs and CPUs became smaller. It can be performed in less than 20 s (on-the-fly) using a 32-core server.
Image compression using quad-tree coding with morphological dilation

NASA Astrophysics Data System (ADS)

Wu, Jiaji; Jiang, Weiwei; Jiao, Licheng; Wang, Lei

2007-11-01

In this paper, we propose a new algorithm which integrates morphological dilation operation to quad-tree coding, the purpose of doing this is to compensate each other's drawback by using quad-tree coding and morphological dilation operation respectively. New algorithm can not only quickly find the seed significant coefficient of dilation but also break the limit of block boundary of quad-tree coding. We also make a full use of both within-subband and cross-subband correlation to avoid the expensive cost of representing insignificant coefficients. Experimental results show that our algorithm outperforms SPECK and SPIHT. Without using any arithmetic coding, our algorithm can achieve good performance with low computational cost and it's more suitable to mobile devices or scenarios with a strict real-time requirement.
IntellEditS: intelligent learning-based editor of segmentations.

PubMed

Harrison, Adam P; Birkbeck, Neil; Sofka, Michal

2013-01-01

Automatic segmentation techniques, despite demonstrating excellent overall accuracy, can often produce inaccuracies in local regions. As a result, correcting segmentations remains an important task that is often laborious, especially when done manually for 3D datasets. This work presents a powerful tool called Intelligent Learning-Based Editor of Segmentations (IntellEditS) that minimizes user effort and further improves segmentation accuracy. The tool partners interactive learning with an energy-minimization approach to editing. Based on interactive user input, a discriminative classifier is trained and applied to the edited 3D region to produce soft voxel labeling. The labels are integrated into a novel energy functional along with the existing segmentation and image data. Unlike the state of the art, IntellEditS is designed to correct segmentation results represented not only as masks but also as meshes. In addition, IntellEditS accepts intuitive boundary-based user interactions. The versatility and performance of IntellEditS are demonstrated on both MRI and CT datasets consisting of varied anatomical structures and resolutions.
Wilson and Domainwall Kernels on Oakforest-PACS

NASA Astrophysics Data System (ADS)

Kanamori, Issaku; Matsufuru, Hideo

2018-03-01

We report the performance of Wilson and Domainwall Kernels on a new Intel Xeon Phi Knights Landing based machine named Oakforest-PACS, which is co-hosted by University of Tokyo and Tsukuba University and is currently fastest in Japan. This machine uses Intel Omni-Path for the internode network. We compare performance with several types of implementation including that makes use of the Grid library. The code is incorporated with the code set Bridge++.
Spectral-element simulation of two-dimensional elastic wave propagation in fully heterogeneous media on a GPU cluster

NASA Astrophysics Data System (ADS)

Rudianto, Indra; Sudarmaji

2018-04-01

We present an implementation of the spectral-element method for simulation of two-dimensional elastic wave propagation in fully heterogeneous media. We have incorporated most of realistic geological features in the model, including surface topography, curved layer interfaces, and 2-D wave-speed heterogeneity. To accommodate such complexity, we use an unstructured quadrilateral meshing technique. Simulation was performed on a GPU cluster, which consists of 24 core processors Intel Xeon CPU and 4 NVIDIA Quadro graphics cards using CUDA and MPI implementation. We speed up the computation by a factor of about 5 compared to MPI only, and by a factor of about 40 compared to Serial implementation.
All-quad meshing without cleanup

DOE PAGES

Rushdi, Ahmad A.; Mitchell, Scott A.; Mahmoud, Ahmed H.; ...

2016-08-22

Here, we present an all-quad meshing algorithm for general domains. We start with a strongly balanced quadtree. In contrast to snapping the quadtree corners onto the geometric domain boundaries, we move them away from the geometry. Then we intersect the moved grid with the geometry. The resulting polygons are converted into quads with midpoint subdivision. Moving away avoids creating any flat angles, either at a quadtree corner or at a geometry–quadtree intersection. We are able to handle two-sided domains, and more complex topologies than prior methods. The algorithm is provably correct and robust in practice. It is cleanup-free, meaning wemore » have angle and edge length bounds without the use of any pillowing, swapping, or smoothing. Thus, our simple algorithm is fast and predictable. This paper has better quality bounds, and the algorithm is demonstrated over more complex domains, than our prior version.« less

All-quad meshing without cleanup

DOE Office of Scientific and Technical Information (OSTI.GOV)

Rushdi, Ahmad A.; Mitchell, Scott A.; Mahmoud, Ahmed H.

Here, we present an all-quad meshing algorithm for general domains. We start with a strongly balanced quadtree. In contrast to snapping the quadtree corners onto the geometric domain boundaries, we move them away from the geometry. Then we intersect the moved grid with the geometry. The resulting polygons are converted into quads with midpoint subdivision. Moving away avoids creating any flat angles, either at a quadtree corner or at a geometry–quadtree intersection. We are able to handle two-sided domains, and more complex topologies than prior methods. The algorithm is provably correct and robust in practice. It is cleanup-free, meaning wemore » have angle and edge length bounds without the use of any pillowing, swapping, or smoothing. Thus, our simple algorithm is fast and predictable. This paper has better quality bounds, and the algorithm is demonstrated over more complex domains, than our prior version.« less
FY17 NIF Performance Quad Campaign: laser performance results and conclusions

DOE Office of Scientific and Technical Information (OSTI.GOV)

Di Nicola, J. M.; Mennerat, G.; Widmayer, G.

The FY17 NIF Performance Quad Campaign exercised a single quad of NIF (Q45T) at elevated energy to assess the impact of recent improvements to the infrared (1ω) and ultraviolet (3ω) section of the laser on integrated performance.
QuadBase2: web server for multiplexed guanine quadruplex mining and visualization

PubMed Central

Dhapola, Parashar; Chowdhury, Shantanu

2016-01-01

DNA guanine quadruplexes or G4s are non-canonical DNA secondary structures which affect genomic processes like replication, transcription and recombination. G4s are computationally identified by specific nucleotide motifs which are also called putative G4 (PG4) motifs. Despite the general relevance of these structures, there is currently no tool available that can allow batch queries and genome-wide analysis of these motifs in a user-friendly interface. QuadBase2 (quadbase.igib.res.in) presents a completely reinvented web server version of previously published QuadBase database. QuadBase2 enables users to mine PG4 motifs in up to 178 eukaryotes through the EuQuad module. This module interfaces with Ensembl Compara database, to allow users mine PG4 motifs in the orthologues of genes of interest across eukaryotes. PG4 motifs can be mined across genes and their promoter sequences in 1719 prokaryotes through ProQuad module. This module includes a feature that allows genome-wide mining of PG4 motifs and their visualization as circular histograms. TetraplexFinder, the module for mining PG4 motifs in user-provided sequences is now capable of handling up to 20 MB of data. QuadBase2 is a comprehensive PG4 motif mining tool that further expands the configurations and algorithms for mining PG4 motifs in a user-friendly way. PMID:27185890
Design of a Compact Quad-Channel Diplexer

NASA Astrophysics Data System (ADS)

Xu, Jin

2016-01-01

This paper presents a compact quad-channel diplexer by using two asymmetrical coupling shorted stub loaded stepped-impedance (SSLSIR) dual-band bandpass filters (DB-BPFs) to replace two single-band BPFs in a traditional BPF-based diplexer. Part of its impedance matching circuit is implemented by using a three-element lowpass T-network to acquire the desired phase shift. Detailed design procedures are given to guide the diplexer design. The fabricated quad-channel diplexer occupies a compact circuit area of 0.168λg×0.136λg. High band-to-band isolation and wide stopband performance are achieved. Good agreement is shown between the simulated and measured results.
Intel Teach to the Future: A Partnership for Professional Development.

ERIC Educational Resources Information Center

Metcalf, Teri; Jolly, Deborah

This paper describes a public/private partnership program designed to provide staff development to help classroom teachers integrate technology in the curriculum by using the train-the-trainer model. The Intel[R] Teach to the Future Project was developed by Intel[R] in collaboration with other public and private sector partners, and has been…
Multi-Core Processor Memory Contention Benchmark Analysis Case Study

NASA Technical Reports Server (NTRS)

Simon, Tyler; McGalliard, James

2009-01-01

Multi-core processors dominate current mainframe, server, and high performance computing (HPC) systems. This paper provides synthetic kernel and natural benchmark results from an HPC system at the NASA Goddard Space Flight Center that illustrate the performance impacts of multi-core (dual- and quad-core) vs. single core processor systems. Analysis of processor design, application source code, and synthetic and natural test results all indicate that multi-core processors can suffer from significant memory subsystem contention compared to similar single-core processors.
Benchmarking the QUAD4/TRIA3 element

NASA Technical Reports Server (NTRS)

Pitrof, Stephen M.; Venkayya, Vipperla B.

1993-01-01

The QUAD4 and TRIA3 elements are the primary plate/shell elements in NASTRAN. These elements enable the user to analyze thin plate/shell structures for membrane, bending and shear phenomena. They are also very new elements in the NASTRAN library. These elements are extremely versatile and constitute a substantially enhanced analysis capability in NASTRAN. However, with the versatility comes the burden of understanding a myriad of modeling implications and their effect on accuracy and analysis quality. The validity of many aspects of these elements were established through a series of benchmark problem results and comparison with those available in the literature and obtained from other programs like MSC/NASTRAN and CSAR/NASTRAN. Never-the-less such a comparison is never complete because of the new and creative use of these elements in complex modeling situations. One of the important features of QUAD4 and TRIA3 elements is the offset capability which allows the midsurface of the plate to be noncoincident with the surface of the grid points. None of the previous elements, with the exception of bar (beam), has this capability. The offset capability played a crucial role in the design of QUAD4 and TRIA3 elements. It allowed modeling layered composites, laminated plates and sandwich plates with the metal and composite face sheets. Even though the basic implementation of the offset capability is found to be sound in the previous applications, there is some uncertainty in relatively simple applications. The main purpose of this paper is to test the integrity of the offset capability and provide guidelines for its effective use. For the purpose of simplicity, references in this paper to the QUAD4 element will also include the TRIA3 element.
IGA-ADS: Isogeometric analysis FEM using ADS solver

NASA Astrophysics Data System (ADS)

Łoś, Marcin M.; Woźniak, Maciej; Paszyński, Maciej; Lenharth, Andrew; Hassaan, Muhamm Amber; Pingali, Keshav

2017-08-01

In this paper we present a fast explicit solver for solution of non-stationary problems using L2 projections with isogeometric finite element method. The solver has been implemented within GALOIS framework. It enables parallel multi-core simulations of different time-dependent problems, in 1D, 2D, or 3D. We have prepared the solver framework in a way that enables direct implementation of the selected PDE and corresponding boundary conditions. In this paper we describe the installation, implementation of exemplary three PDEs, and execution of the simulations on multi-core Linux cluster nodes. We consider three case studies, including heat transfer, linear elasticity, as well as non-linear flow in heterogeneous media. The presented package generates output suitable for interfacing with Gnuplot and ParaView visualization software. The exemplary simulations show near perfect scalability on Gilbert shared-memory node with four Intel® Xeon® CPU E7-4860 processors, each possessing 10 physical cores (for a total of 40 cores).
Performance of a plasma fluid code on the Intel parallel computers

NASA Technical Reports Server (NTRS)

Lynch, V. E.; Carreras, B. A.; Drake, J. B.; Leboeuf, J. N.; Liewer, P.

1992-01-01

One approach to improving the real-time efficiency of plasma turbulence calculations is to use a parallel algorithm. A parallel algorithm for plasma turbulence calculations was tested on the Intel iPSC/860 hypercube and the Touchtone Delta machine. Using the 128 processors of the Intel iPSC/860 hypercube, a factor of 5 improvement over a single-processor CRAY-2 is obtained. For the Touchtone Delta machine, the corresponding improvement factor is 16. For plasma edge turbulence calculations, an extrapolation of the present results to the Intel (sigma) machine gives an improvement factor close to 64 over the single-processor CRAY-2.
NASA Center for Climate Simulation (NCCS) Presentation

NASA Technical Reports Server (NTRS)

Webster, William P.

2012-01-01

The NASA Center for Climate Simulation (NCCS) offers integrated supercomputing, visualization, and data interaction technologies to enhance NASA's weather and climate prediction capabilities. It serves hundreds of users at NASA Goddard Space Flight Center, as well as other NASA centers, laboratories, and universities across the US. Over the past year, NCCS has continued expanding its data-centric computing environment to meet the increasingly data-intensive challenges of climate science. We doubled our Discover supercomputer's peak performance to more than 800 teraflops by adding 7,680 Intel Xeon Sandy Bridge processor-cores and most recently 240 Intel Xeon Phi Many Integrated Core (MIG) co-processors. A supercomputing-class analysis system named Dali gives users rapid access to their data on Discover and high-performance software including the Ultra-scale Visualization Climate Data Analysis Tools (UV-CDAT), with interfaces from user desktops and a 17- by 6-foot visualization wall. NCCS also is exploring highly efficient climate data services and management with a new MapReduce/Hadoop cluster while augmenting its data distribution to the science community. Using NCCS resources, NASA completed its modeling contributions to the Intergovernmental Panel on Climate Change (IPCG) Fifth Assessment Report this summer as part of the ongoing Coupled Modellntercomparison Project Phase 5 (CMIP5). Ensembles of simulations run on Discover reached back to the year 1000 to test model accuracy and projected climate change through the year 2300 based on four different scenarios of greenhouse gases, aerosols, and land use. The data resulting from several thousand IPCC/CMIP5 simulations, as well as a variety of other simulation, reanalysis, and observationdatasets, are available to scientists and decision makers through an enhanced NCCS Earth System Grid Federation Gateway. Worldwide downloads have totaled over 110 terabytes of data.
P-Hint-Hunt: a deep parallelized whole genome DNA methylation detection tool.

PubMed

Peng, Shaoliang; Yang, Shunyun; Gao, Ming; Liao, Xiangke; Liu, Jie; Yang, Canqun; Wu, Chengkun; Yu, Wenqiang

2017-03-14

The increasing studies have been conducted using whole genome DNA methylation detection as one of the most important part of epigenetics research to find the significant relationships among DNA methylation and several typical diseases, such as cancers and diabetes. In many of those studies, mapping the bisulfite treated sequence to the whole genome has been the main method to study DNA cytosine methylation. However, today's relative tools almost suffer from inaccuracies and time-consuming problems. In our study, we designed a new DNA methylation prediction tool ("Hint-Hunt") to solve the problem. By having an optimal complex alignment computation and Smith-Waterman matrix dynamic programming, Hint-Hunt could analyze and predict the DNA methylation status. But when Hint-Hunt tried to predict DNA methylation status with large-scale dataset, there are still slow speed and low temporal-spatial efficiency problems. In order to solve the problems of Smith-Waterman dynamic programming and low temporal-spatial efficiency, we further design a deep parallelized whole genome DNA methylation detection tool ("P-Hint-Hunt") on Tianhe-2 (TH-2) supercomputer. To the best of our knowledge, P-Hint-Hunt is the first parallel DNA methylation detection tool with a high speed-up to process large-scale dataset, and could run both on CPU and Intel Xeon Phi coprocessors. Moreover, we deploy and evaluate Hint-Hunt and P-Hint-Hunt on TH-2 supercomputer in different scales. The experimental results illuminate our tools eliminate the deviation caused by bisulfite treatment in mapping procedure and the multi-level parallel program yields a 48 times speed-up with 64 threads. P-Hint-Hunt gain a deep acceleration on CPU and Intel Xeon Phi heterogeneous platform, which gives full play of the advantages of multi-cores (CPU) and many-cores (Phi).
Preparing for Exascale: Towards convection-permitting, global atmospheric simulations with the Model for Prediction Across Scales (MPAS)

NASA Astrophysics Data System (ADS)

Heinzeller, Dominikus; Duda, Michael G.; Kunstmann, Harald

2017-04-01

With strong financial and political support from national and international initiatives, exascale computing is projected for the end of this decade. Energy requirements and physical limitations imply the use of accelerators and the scaling out to orders of magnitudes larger numbers of cores then today to achieve this milestone. In order to fully exploit the capabilities of these Exascale computing systems, existing applications need to undergo significant development. The Model for Prediction Across Scales (MPAS) is a novel set of Earth system simulation components and consists of an atmospheric core, an ocean core, a land-ice core and a sea-ice core. Its distinct features are the use of unstructured Voronoi meshes and C-grid discretisation to address shortcomings of global models on regular grids and the use of limited area models nested in a forcing data set, with respect to parallel scalability, numerical accuracy and physical consistency. Here, we present work towards the application of the atmospheric core (MPAS-A) on current and future high performance computing systems for problems at extreme scale. In particular, we address the issue of massively parallel I/O by extending the model to support the highly scalable SIONlib library. Using global uniform meshes with a convection-permitting resolution of 2-3km, we demonstrate the ability of MPAS-A to scale out to half a million cores while maintaining a high parallel efficiency. We also demonstrate the potential benefit of a hybrid parallelisation of the code (MPI/OpenMP) on the latest generation of Intel's Many Integrated Core Architecture, the Intel Xeon Phi Knights Landing.
EUV mask pilot line at Intel Corporation

NASA Astrophysics Data System (ADS)

Stivers, Alan R.; Yan, Pei-Yang; Zhang, Guojing; Liang, Ted; Shu, Emily Y.; Tejnil, Edita; Lieberman, Barry; Nagpal, Rajesh; Hsia, Kangmin; Penn, Michael; Lo, Fu-Chang

2004-12-01

The introduction of extreme ultraviolet (EUV) lithography into high volume manufacturing requires the development of a new mask technology. In support of this, Intel Corporation has established a pilot line devoted to encountering and eliminating barriers to manufacturability of EUV masks. It concentrates on EUV-specific process modules and makes use of the captive standard photomask fabrication capability of Intel Corporation. The goal of the pilot line is to accelerate EUV mask development to intersect the 32nm technology node. This requires EUV mask technology to be comparable to standard photomask technology by the beginning of the silicon wafer process development phase for that technology node. The pilot line embodies Intel's strategy to lead EUV mask development in the areas of the mask patterning process, mask fabrication tools, the starting material (blanks) and the understanding of process interdependencies. The patterning process includes all steps from blank defect inspection through final pattern inspection and repair. We have specified and ordered the EUV-specific tools and most will be installed in 2004. We have worked with International Sematech and others to provide for the next generation of EUV-specific mask tools. Our process of record is run repeatedly to ensure its robustness. This primes the supply chain and collects information needed for blank improvement.
Performance of GeantV EM Physics Models

NASA Astrophysics Data System (ADS)

Amadio, G.; Ananya, A.; Apostolakis, J.; Aurora, A.; Bandieramonte, M.; Bhattacharyya, A.; Bianchini, C.; Brun, R.; Canal, P.; Carminati, F.; Cosmo, G.; Duhem, L.; Elvira, D.; Folger, G.; Gheata, A.; Gheata, M.; Goulas, I.; Iope, R.; Jun, S. Y.; Lima, G.; Mohanty, A.; Nikitina, T.; Novak, M.; Pokorski, W.; Ribon, A.; Seghal, R.; Shadura, O.; Vallecorsa, S.; Wenzel, S.; Zhang, Y.

2017-10-01

The recent progress in parallel hardware architectures with deeper vector pipelines or many-cores technologies brings opportunities for HEP experiments to take advantage of SIMD and SIMT computing models. Launched in 2013, the GeantV project studies performance gains in propagating multiple particles in parallel, improving instruction throughput and data locality in HEP event simulation on modern parallel hardware architecture. Due to the complexity of geometry description and physics algorithms of a typical HEP application, performance analysis is indispensable in identifying factors limiting parallel execution. In this report, we will present design considerations and preliminary computing performance of GeantV physics models on coprocessors (Intel Xeon Phi and NVidia GPUs) as well as on mainstream CPUs.
Electromagnetic Physics Models for Parallel Computing Architectures

NASA Astrophysics Data System (ADS)

Amadio, G.; Ananya, A.; Apostolakis, J.; Aurora, A.; Bandieramonte, M.; Bhattacharyya, A.; Bianchini, C.; Brun, R.; Canal, P.; Carminati, F.; Duhem, L.; Elvira, D.; Gheata, A.; Gheata, M.; Goulas, I.; Iope, R.; Jun, S. Y.; Lima, G.; Mohanty, A.; Nikitina, T.; Novak, M.; Pokorski, W.; Ribon, A.; Seghal, R.; Shadura, O.; Vallecorsa, S.; Wenzel, S.; Zhang, Y.

2016-10-01

The recent emergence of hardware architectures characterized by many-core or accelerated processors has opened new opportunities for concurrent programming models taking advantage of both SIMD and SIMT architectures. GeantV, a next generation detector simulation, has been designed to exploit both the vector capability of mainstream CPUs and multi-threading capabilities of coprocessors including NVidia GPUs and Intel Xeon Phi. The characteristics of these architectures are very different in terms of the vectorization depth and type of parallelization needed to achieve optimal performance. In this paper we describe implementation of electromagnetic physics models developed for parallel computing architectures as a part of the GeantV project. Results of preliminary performance evaluation and physics validation are presented as well.
MC64-ClustalWP2: A Highly-Parallel Hybrid Strategy to Align Multiple Sequences in Many-Core Architectures

PubMed Central

Díaz, David; Esteban, Francisco J.; Hernández, Pilar; Caballero, Juan Antonio; Guevara, Antonio

2014-01-01

We have developed the MC64-ClustalWP2 as a new implementation of the Clustal W algorithm, integrating a novel parallelization strategy and significantly increasing the performance when aligning long sequences in architectures with many cores. It must be stressed that in such a process, the detailed analysis of both the software and hardware features and peculiarities is of paramount importance to reveal key points to exploit and optimize the full potential of parallelism in many-core CPU systems. The new parallelization approach has focused into the most time-consuming stages of this algorithm. In particular, the so-called progressive alignment has drastically improved the performance, due to a fine-grained approach where the forward and backward loops were unrolled and parallelized. Another key approach has been the implementation of the new algorithm in a hybrid-computing system, integrating both an Intel Xeon multi-core CPU and a Tilera Tile64 many-core card. A comparison with other Clustal W implementations reveals the high-performance of the new algorithm and strategy in many-core CPU architectures, in a scenario where the sequences to align are relatively long (more than 10 kb) and, hence, a many-core GPU hardware cannot be used. Thus, the MC64-ClustalWP2 runs multiple alignments more than 18x than the original Clustal W algorithm, and more than 7x than the best x86 parallel implementation to date, being publicly available through a web service. Besides, these developments have been deployed in cost-effective personal computers and should be useful for life-science researchers, including the identification of identities and differences for mutation/polymorphism analyses, biodiversity and evolutionary studies and for the development of molecular markers for paternity testing, germplasm management and protection, to assist breeding, illegal traffic control, fraud prevention and for the protection of the intellectual property (identification
GeauxDock: Accelerating Structure-Based Virtual Screening with Heterogeneous Computing

PubMed Central

Fang, Ye; Ding, Yun; Feinstein, Wei P.; Koppelman, David M.; Moreno, Juana; Jarrell, Mark; Ramanujam, J.; Brylinski, Michal

2016-01-01

Computational modeling of drug binding to proteins is an integral component of direct drug design. Particularly, structure-based virtual screening is often used to perform large-scale modeling of putative associations between small organic molecules and their pharmacologically relevant protein targets. Because of a large number of drug candidates to be evaluated, an accurate and fast docking engine is a critical element of virtual screening. Consequently, highly optimized docking codes are of paramount importance for the effectiveness of virtual screening methods. In this communication, we describe the implementation, tuning and performance characteristics of GeauxDock, a recently developed molecular docking program. GeauxDock is built upon the Monte Carlo algorithm and features a novel scoring function combining physics-based energy terms with statistical and knowledge-based potentials. Developed specifically for heterogeneous computing platforms, the current version of GeauxDock can be deployed on modern, multi-core Central Processing Units (CPUs) as well as massively parallel accelerators, Intel Xeon Phi and NVIDIA Graphics Processing Unit (GPU). First, we carried out a thorough performance tuning of the high-level framework and the docking kernel to produce a fast serial code, which was then ported to shared-memory multi-core CPUs yielding a near-ideal scaling. Further, using Xeon Phi gives 1.9× performance improvement over a dual 10-core Xeon CPU, whereas the best GPU accelerator, GeForce GTX 980, achieves a speedup as high as 3.5×. On that account, GeauxDock can take advantage of modern heterogeneous architectures to considerably accelerate structure-based virtual screening applications. GeauxDock is open-sourced and publicly available at www.brylinski.org/geauxdock and https://figshare.com/articles/geauxdock_tar_gz/3205249. PMID:27420300
GeauxDock: Accelerating Structure-Based Virtual Screening with Heterogeneous Computing.

PubMed

Fang, Ye; Ding, Yun; Feinstein, Wei P; Koppelman, David M; Moreno, Juana; Jarrell, Mark; Ramanujam, J; Brylinski, Michal

2016-01-01

Computational modeling of drug binding to proteins is an integral component of direct drug design. Particularly, structure-based virtual screening is often used to perform large-scale modeling of putative associations between small organic molecules and their pharmacologically relevant protein targets. Because of a large number of drug candidates to be evaluated, an accurate and fast docking engine is a critical element of virtual screening. Consequently, highly optimized docking codes are of paramount importance for the effectiveness of virtual screening methods. In this communication, we describe the implementation, tuning and performance characteristics of GeauxDock, a recently developed molecular docking program. GeauxDock is built upon the Monte Carlo algorithm and features a novel scoring function combining physics-based energy terms with statistical and knowledge-based potentials. Developed specifically for heterogeneous computing platforms, the current version of GeauxDock can be deployed on modern, multi-core Central Processing Units (CPUs) as well as massively parallel accelerators, Intel Xeon Phi and NVIDIA Graphics Processing Unit (GPU). First, we carried out a thorough performance tuning of the high-level framework and the docking kernel to produce a fast serial code, which was then ported to shared-memory multi-core CPUs yielding a near-ideal scaling. Further, using Xeon Phi gives 1.9× performance improvement over a dual 10-core Xeon CPU, whereas the best GPU accelerator, GeForce GTX 980, achieves a speedup as high as 3.5×. On that account, GeauxDock can take advantage of modern heterogeneous architectures to considerably accelerate structure-based virtual screening applications. GeauxDock is open-sourced and publicly available at www.brylinski.org/geauxdock and https://figshare.com/articles/geauxdock_tar_gz/3205249.
Novel quad-band terahertz metamaterial absorber based on single pattern U-shaped resonator

NASA Astrophysics Data System (ADS)

Wang, Ben-Xin; Wang, Gui-Zhen

2017-03-01

A novel quad-band terahertz metamaterial absorber using four different modes of single pattern resonator is demonstrated. Four obvious frequencies with near-perfect absorption are realized. Near-field distributions of the four modes are provided to reveal the physical picture of the multiple-band absorption. Unlike most previous quad-band absorbers that typically require four or more patterns, the designed absorber has only one resonant structure, which is simpler than previous works. The presented quad-band absorber has potential applications in biological sensing, medical imaging, and material detection.
Analysis OpenMP performance of AMD and Intel architecture for breaking waves simulation using MPS

NASA Astrophysics Data System (ADS)

Alamsyah, M. N. A.; Utomo, A.; Gunawan, P. H.

2018-03-01

Simulation of breaking waves by using Navier-Stokes equation via moving particle semi-implicit method (MPS) over close domain is given. The results show the parallel computing on multicore architecture using OpenMP platform can reduce the computational time almost half of the serial time. Here, the comparison using two computer architectures (AMD and Intel) are performed. The results using Intel architecture is shown better than AMD architecture in CPU time. However, in efficiency, the computer with AMD architecture gives slightly higher than the Intel. For the simulation by 1512 number of particles, the CPU time using Intel and AMD are 12662.47 and 28282.30 respectively. Moreover, the efficiency using similar number of particles, AMD obtains 50.09 % and Intel up to 49.42 %.

GPU/MIC Acceleration of the LHC High Level Trigger to Extend the Physics Reach at the LHC

DOE Office of Scientific and Technical Information (OSTI.GOV)

Halyo, Valerie; Tully, Christopher

The quest for rare new physics phenomena leads the PI [3] to propose evaluation of coprocessors based on Graphics Processing Units (GPUs) and the Intel Many Integrated Core (MIC) architecture for integration into the trigger system at LHC. This will require development of a new massively parallel implementation of the well known Combinatorial Track Finder which uses the Kalman Filter to accelerate processing of data from the silicon pixel and microstrip detectors and reconstruct the trajectory of all charged particles down to momentums of 100 MeV. It is expected to run at least one order of magnitude faster than anmore » equivalent algorithm on a quad core CPU for extreme pileup scenarios of 100 interactions per bunch crossing. The new tracking algorithms will be developed and optimized separately on the GPU and Intel MIC and then evaluated against each other for performance and power efficiency. The results will be used to project the cost of the proposed hardware architectures for the HLT server farm, taking into account the long term projections of the main vendors in the market (AMD, Intel, and NVIDIA) over the next 10 years. Extensive experience and familiarity of the PI with the LHC tracker and trigger requirements led to the development of a complementary tracking algorithm that is described in [arxiv: 1305.4855], [arxiv: 1309.6275] and preliminary results accepted to JINST.« less
Reliability testing of ultra-low noise InGaAs quad photoreceivers

NASA Astrophysics Data System (ADS)

Joshi, Abhay M.; Datta, Shubhashish; Prasad, Narasimha; Sivertz, Michael

2018-02-01

We have developed ultra-low noise quadrant InGaAs photoreceivers for multiple applications ranging from Laser Interferometric Gravitional Wave Detection, to 3D Wind Profiling. Devices with diameters of 0.5 mm, 1mm, and 2 mm were processed, with the nominal capacitance of a single quadrant of a 1 mm quad photodiode being 2.5 pF. The 1 mm diameter InGaAs quad photoreceivers, using a low-noise, bipolar-input OpAmp circuitry exhibit an equivalent input noise per quadrant of <1.7 pA/√Hz in 2 to 20 MHz frequency range. The InGaAs Quad Photoreceivers have undergone the following reliability tests: 30 MeV Proton Radiation up to a Total Ionizing Dose (TID) of 50 krad, Mechanical Shock, and Sinusoidal Vibration.
Optimizing legacy molecular dynamics software with directive-based offload

NASA Astrophysics Data System (ADS)

Michael Brown, W.; Carrillo, Jan-Michael Y.; Gavhane, Nitin; Thakkar, Foram M.; Plimpton, Steven J.

2015-10-01

Directive-based programming models are one solution for exploiting many-core coprocessors to increase simulation rates in molecular dynamics. They offer the potential to reduce code complexity with offload models that can selectively target computations to run on the CPU, the coprocessor, or both. In this paper, we describe modifications to the LAMMPS molecular dynamics code to enable concurrent calculations on a CPU and coprocessor. We demonstrate that standard molecular dynamics algorithms can run efficiently on both the CPU and an x86-based coprocessor using the same subroutines. As a consequence, we demonstrate that code optimizations for the coprocessor also result in speedups on the CPU; in extreme cases up to 4.7X. We provide results for LAMMPS benchmarks and for production molecular dynamics simulations using the Stampede hybrid supercomputer with both Intel® Xeon Phi™ coprocessors and NVIDIA GPUs. The optimizations presented have increased simulation rates by over 2X for organic molecules and over 7X for liquid crystals on Stampede. The optimizations are available as part of the "Intel package" supplied with LAMMPS.
Charge line quad pulser

DOEpatents

Booth, R.

1996-10-08

A quartet of parallel coupled planar triodes is removably mounted in a quadrahedron shaped PCB structure. Releasable brackets and flexible means attached to each triode socket make triode cathode and grid contact with respective conductive coatings on the PCB and a detachable cylindrical conductive element enclosing and contacting the triode anodes jointly permit quick and easy replacement of faulty triodes. By such orientation, the quad pulser can convert a relatively low and broad pulse into a very high and narrow pulse. 16 figs.
Charge line quad pulser

DOEpatents

Booth, Rex

1996-01-01

A quartet of parallel coupled planar triodes is removably mounted in a quadrahedron shaped PCB structure. Releasable brackets and flexible means attached to each triode socket make triode cathode and grid contact with respective conductive coatings on the PCB and a detachable cylindrical conductive element enclosing and contacting the triode anodes jointly permit quick and easy replacement of faulty triodes. By such orientation, the quad pulser can convert a relatively low and broad pulse into a very high and narrow pulse.
75 FR 21353 - Intel Corporation, Fab 20 Division, Including On-Site Leased Workers From Volt Technical...

Federal Register 2010, 2011, 2012, 2013, 2014

2010-04-23

... DEPARTMENT OF LABOR Employment and Training Administration [TA-W-73,642] Intel Corporation, Fab 20... of Intel Corporation, Fab 20 Division, including on-site leased workers of Volt Technical Resources... Precision, Inc. were employed on-site at the Hillsboro, Oregon location of Intel Corporation, Fab 20...
Evaluation of an Adaptive Automation Trigger Based on Task Performance, Priority, and Frequency

DTIC Science & Technology

2013-06-01

with dual Intel ® Xeon ® CPU x5550 processors @ 2.67 GHz each, 12.0 GB RAM, and a 1.5 GB PCIe nVidia Quadro FX 4800 graphics card (Microsoft...Cole Publishing Company . Miller, C. A., & Parasuraman, R. (2007). Designing for flexible interaction between humans and automation: Delegation
A task-based parallelism and vectorized approach to 3D Method of Characteristics (MOC) reactor simulation for high performance computing architectures

NASA Astrophysics Data System (ADS)

Tramm, John R.; Gunow, Geoffrey; He, Tim; Smith, Kord S.; Forget, Benoit; Siegel, Andrew R.

2016-05-01

In this study we present and analyze a formulation of the 3D Method of Characteristics (MOC) technique applied to the simulation of full core nuclear reactors. Key features of the algorithm include a task-based parallelism model that allows independent MOC tracks to be assigned to threads dynamically, ensuring load balancing, and a wide vectorizable inner loop that takes advantage of modern SIMD computer architectures. The algorithm is implemented in a set of highly optimized proxy applications in order to investigate its performance characteristics on CPU, GPU, and Intel Xeon Phi architectures. Speed, power, and hardware cost efficiencies are compared. Additionally, performance bottlenecks are identified for each architecture in order to determine the prospects for continued scalability of the algorithm on next generation HPC architectures.
First experience of vectorizing electromagnetic physics models for detector simulation

NASA Astrophysics Data System (ADS)

Amadio, G.; Apostolakis, J.; Bandieramonte, M.; Bianchini, C.; Bitzes, G.; Brun, R.; Canal, P.; Carminati, F.; de Fine Licht, J.; Duhem, L.; Elvira, D.; Gheata, A.; Jun, S. Y.; Lima, G.; Novak, M.; Presbyterian, M.; Shadura, O.; Seghal, R.; Wenzel, S.

2015-12-01

The recent emergence of hardware architectures characterized by many-core or accelerated processors has opened new opportunities for concurrent programming models taking advantage of both SIMD and SIMT architectures. The GeantV vector prototype for detector simulations has been designed to exploit both the vector capability of mainstream CPUs and multi-threading capabilities of coprocessors including NVidia GPUs and Intel Xeon Phi. The characteristics of these architectures are very different in terms of the vectorization depth, parallelization needed to achieve optimal performance or memory access latency and speed. An additional challenge is to avoid the code duplication often inherent to supporting heterogeneous platforms. In this paper we present the first experience of vectorizing electromagnetic physics models developed for the GeantV project.
Modeling of Radiotherapy Linac Source Terms Using ARCHER Monte Carlo Code: Performance Comparison for GPU and MIC Parallel Computing Devices

NASA Astrophysics Data System (ADS)

Lin, Hui; Liu, Tianyu; Su, Lin; Bednarz, Bryan; Caracappa, Peter; Xu, X. George

2017-09-01

Monte Carlo (MC) simulation is well recognized as the most accurate method for radiation dose calculations. For radiotherapy applications, accurate modelling of the source term, i.e. the clinical linear accelerator is critical to the simulation. The purpose of this paper is to perform source modelling and examine the accuracy and performance of the models on Intel Many Integrated Core coprocessors (aka Xeon Phi) and Nvidia GPU using ARCHER and explore the potential optimization methods. Phase Space-based source modelling for has been implemented. Good agreements were found in a tomotherapy prostate patient case and a TrueBeam breast case. From the aspect of performance, the whole simulation for prostate plan and breast plan cost about 173s and 73s with 1% statistical error.
Electromagnetic physics models for parallel computing architectures

DOE PAGES

Amadio, G.; Ananya, A.; Apostolakis, J.; ...

2016-11-21

The recent emergence of hardware architectures characterized by many-core or accelerated processors has opened new opportunities for concurrent programming models taking advantage of both SIMD and SIMT architectures. GeantV, a next generation detector simulation, has been designed to exploit both the vector capability of mainstream CPUs and multi-threading capabilities of coprocessors including NVidia GPUs and Intel Xeon Phi. The characteristics of these architectures are very different in terms of the vectorization depth and type of parallelization needed to achieve optimal performance. In this paper we describe implementation of electromagnetic physics models developed for parallel computing architectures as a part ofmore » the GeantV project. Finally, the results of preliminary performance evaluation and physics validation are presented as well.« less
Toward performance portability of the Albany finite element analysis code using the Kokkos library

DOE Office of Scientific and Technical Information (OSTI.GOV)

Demeshko, Irina; Watkins, Jerry; Tezaur, Irina K.

Performance portability on heterogeneous high-performance computing (HPC) systems is a major challenge faced today by code developers: parallel code needs to be executed correctly as well as with high performance on machines with different architectures, operating systems, and software libraries. The finite element method (FEM) is a popular and flexible method for discretizing partial differential equations arising in a wide variety of scientific, engineering, and industrial applications that require HPC. This paper presents some preliminary results pertaining to our development of a performance portable implementation of the FEM-based Albany code. Performance portability is achieved using the Kokkos library. We presentmore » performance results for the Aeras global atmosphere dynamical core module in Albany. Finally, numerical experiments show that our single code implementation gives reasonable performance across three multicore/many-core architectures: NVIDIA General Processing Units (GPU’s), Intel Xeon Phis, and multicore CPUs.« less
Toward performance portability of the Albany finite element analysis code using the Kokkos library

DOE PAGES

Demeshko, Irina; Watkins, Jerry; Tezaur, Irina K.; ...

2018-02-05

Performance portability on heterogeneous high-performance computing (HPC) systems is a major challenge faced today by code developers: parallel code needs to be executed correctly as well as with high performance on machines with different architectures, operating systems, and software libraries. The finite element method (FEM) is a popular and flexible method for discretizing partial differential equations arising in a wide variety of scientific, engineering, and industrial applications that require HPC. This paper presents some preliminary results pertaining to our development of a performance portable implementation of the FEM-based Albany code. Performance portability is achieved using the Kokkos library. We presentmore » performance results for the Aeras global atmosphere dynamical core module in Albany. Finally, numerical experiments show that our single code implementation gives reasonable performance across three multicore/many-core architectures: NVIDIA General Processing Units (GPU’s), Intel Xeon Phis, and multicore CPUs.« less
CUDA-based real time surgery simulation.

PubMed

Liu, Youquan; De, Suvranu

2008-01-01

In this paper we present a general software platform that enables real time surgery simulation on the newly available compute unified device architecture (CUDA)from NVIDIA. CUDA-enabled GPUs harness the power of 128 processors which allow data parallel computations. Compared to the previous GPGPU, it is significantly more flexible with a C language interface. We report implementation of both collision detection and consequent deformation computation algorithms. Our test results indicate that the CUDA enables a twenty times speedup for collision detection and about fifteen times speedup for deformation computation on an Intel Core 2 Quad 2.66 GHz machine with GeForce 8800 GTX.
A High Performance Computing Framework for Physics-based Modeling and Simulation of Military Ground Vehicles

DTIC Science & Technology

2011-03-25

number one and Nebulae at number three. Both systems rely on GPU co-processing and use Intel Xeon processors cards and NVIDIA Tesla C2050 GPUs. In...spite of a theoretical peak capability of almost 3 Petaflop/s, Nebulae clocked at 1.271 PFlop/s when running the Linpack benchmark, which puts it
Acceleration of boundary element method for linear elasticity

NASA Astrophysics Data System (ADS)

Zapletal, Jan; Merta, Michal; Čermák, Martin

2017-07-01

In this work we describe the accelerated assembly of system matrices for the boundary element method using the Intel Xeon Phi coprocessors. We present a model problem, provide a brief overview of its discretization and acceleration of the system matrices assembly using the coprocessors, and test the accelerated version using a numerical benchmark.
Comparative Performance Analysis of Intel Xeon Phi, GPU, and CPU: A Case Study from Microscopy Image Analysis

PubMed Central

Teodoro, George; Kurc, Tahsin; Kong, Jun; Cooper, Lee; Saltz, Joel

2014-01-01

We study and characterize the performance of operations in an important class of applications on GPUs and Many Integrated Core (MIC) architectures. Our work is motivated by applications that analyze low-dimensional spatial datasets captured by high resolution sensors, such as image datasets obtained from whole slide tissue specimens using microscopy scanners. Common operations in these applications involve the detection and extraction of objects (object segmentation), the computation of features of each extracted object (feature computation), and characterization of objects based on these features (object classification). In this work, we have identify the data access and computation patterns of operations in the object segmentation and feature computation categories. We systematically implement and evaluate the performance of these operations on modern CPUs, GPUs, and MIC systems for a microscopy image analysis application. Our results show that the performance on a MIC of operations that perform regular data access is comparable or sometimes better than that on a GPU. On the other hand, GPUs are significantly more efficient than MICs for operations that access data irregularly. This is a result of the low performance of MICs when it comes to random data access. We also have examined the coordinated use of MICs and CPUs. Our experiments show that using a performance aware task strategy for scheduling application operations improves performance about 1.29× over a first-come-first-served strategy. This allows applications to obtain high performance efficiency on CPU-MIC systems - the example application attained an efficiency of 84% on 192 nodes (3072 CPU cores and 192 MICs). PMID:25419088
Lower-limb reconstruction with chimeric flaps: The quad flap.

PubMed

Azouz, Solomon M; Castel, Nikki A; Vijayasekaran, Aparna; Rebecca, Alanna M; Lettieri, Salvatore C

2018-05-07

Early soft-tissue coverage is critical for treating traumatic open lower-extremity wounds. As free-flap reconstruction evolves, injuries once thought to be nonreconstructable are being salvaged. Free-tissue transfer is imperative when there is extensive dead space or exposure of vital structures such as bone, tendon, nerves, or blood vessels. We describe 2 cases of lower-extremity crush injuries salvaged with the quad flap. This novel flap consists of parascapular, scapular, serratus, and latissimus dorsi free flaps in combination on one pedicle. This flap provides the large amount of soft-tissue coverage necessary to cover substantial defects from skin degloving, tibia and fibula fractures, and soft-tissue loss. In case 1, a 51-year-old woman was struck by an automobile and sustained bilateral tibia and fibula fractures, a crush degloving injury of the left leg, and a right forefoot traumatic amputation. She underwent reconstruction with a contralateral quad free flap. In case 2, a 53-year-old man sustained a right tibia plateau fracture with large soft-tissue defects from a motorcycle accident. He had a crush degloving injury of the entire anterolateral compartment over the distal and lower third of the right leg. The large soft-tissue defect was reconstructed with a contralateral quad flap. In both cases, the donor site was closed primarily and without early flap failures. There was one surgical complication, an abscess in case 2; the patient was taken back to the operating room for débridement of necrotic tissue. There have been no long-term complications in either case. Both patients achieved adequate soft-tissue coverage, avoided amputation, and had satisfactory aesthetic and functional outcomes. With appropriate surgical technique and patient selection, the quad-flap technique is promising for reconstructing the lower extremity. © 2018 Wiley Periodicals, Inc.
Quad City Intersection Traffic Accident Study: 1993 Data

DOT National Transportation Integrated Search

1996-03-01

Accident information is an important factor from which to work towards the : regional Transportation System Management (TSM) objective of improving the : safety of the local transportation system. The 1993 Quad City Intersection : Traffic Accident Re...
75 FR 12729 - Foreign-Trade Zone 133-Quad-Cities, Iowa/Illinois; Application for Expansion

Federal Register 2010, 2011, 2012, 2013, 2014

2010-03-17

... DEPARTMENT OF COMMERCE Foreign-Trade Zones Board [Docket 15-2010] Foreign-Trade Zone 133--Quad... Zones Board (the Board) by the Quad-City Foreign Trade Zone, Inc., grantee of FTZ 133, requesting authority to expand the zone within the Davenport-Rock Island-Moline Customs and Border Protection port of...

Generic accelerated sequence alignment in SeqAn using vectorization and multi-threading.

PubMed

Rahn, René; Budach, Stefan; Costanza, Pascal; Ehrhardt, Marcel; Hancox, Jonny; Reinert, Knut

2018-05-03

Pairwise sequence alignment is undoubtedly a central tool in many bioinformatics analyses. In this paper, we present a generically accelerated module for pairwise sequence alignments applicable for a broad range of applications. In our module, we unified the standard dynamic programming kernel used for pairwise sequence alignments and extended it with a generalized inter-sequence vectorization layout, such that many alignments can be computed simultaneously by exploiting SIMD (Single Instruction Multiple Data) instructions of modern processors. We then extended the module by adding two layers of thread-level parallelization, where we a) distribute many independent alignments on multiple threads and b) inherently parallelize a single alignment computation using a work stealing approach producing a dynamic wavefront progressing along the minor diagonal. We evaluated our alignment vectorization and parallelization on different processors, including the newest Intel® Xeon® (Skylake) and Intel® Xeon Phi™ (KNL) processors, and use cases. The instruction set AVX512-BW (Byte and Word), available on Skylake processors, can genuinely improve the performance of vectorized alignments. We could run single alignments 1600 times faster on the Xeon Phi™ and 1400 times faster on the Xeon® than executing them with our previous sequential alignment module. The module is programmed in C++ using the SeqAn (Reinert et al., 2017) library and distributed with version 2.4. under the BSD license. We support SSE4, AVX2, AVX512 instructions and included UME::SIMD, a SIMD-instruction wrapper library, to extend our module for further instruction sets. We thoroughly test all alignment components with all major C++ compilers on various platforms. rene.rahn@fu-berlin.de.
Simulating Hydrologic Flow and Reactive Transport with PFLOTRAN and PETSc on Emerging Fine-Grained Parallel Computer Architectures

NASA Astrophysics Data System (ADS)

Mills, R. T.; Rupp, K.; Smith, B. F.; Brown, J.; Knepley, M.; Zhang, H.; Adams, M.; Hammond, G. E.

2017-12-01

As the high-performance computing community pushes towards the exascale horizon, power and heat considerations have driven the increasing importance and prevalence of fine-grained parallelism in new computer architectures. High-performance computing centers have become increasingly reliant on GPGPU accelerators and "manycore" processors such as the Intel Xeon Phi line, and 512-bit SIMD registers have even been introduced in the latest generation of Intel's mainstream Xeon server processors. The high degree of fine-grained parallelism and more complicated memory hierarchy considerations of such "manycore" processors present several challenges to existing scientific software. Here, we consider how the massively parallel, open-source hydrologic flow and reactive transport code PFLOTRAN - and the underlying Portable, Extensible Toolkit for Scientific Computation (PETSc) library on which it is built - can best take advantage of such architectures. We will discuss some key features of these novel architectures and our code optimizations and algorithmic developments targeted at them, and present experiences drawn from working with a wide range of PFLOTRAN benchmark problems on these architectures.
High-performance 3D compressive sensing MRI reconstruction.

PubMed

Kim, Daehyun; Trzasko, Joshua D; Smelyanskiy, Mikhail; Haider, Clifton R; Manduca, Armando; Dubey, Pradeep

2010-01-01

Compressive Sensing (CS) is a nascent sampling and reconstruction paradigm that describes how sparse or compressible signals can be accurately approximated using many fewer samples than traditionally believed. In magnetic resonance imaging (MRI), where scan duration is directly proportional to the number of acquired samples, CS has the potential to dramatically decrease scan time. However, the computationally expensive nature of CS reconstructions has so far precluded their use in routine clinical practice - instead, more-easily generated but lower-quality images continue to be used. We investigate the development and optimization of a proven inexact quasi-Newton CS reconstruction algorithm on several modern parallel architectures, including CPUs, GPUs, and Intel's Many Integrated Core (MIC) architecture. Our (optimized) baseline implementation on a quad-core Core i7 is able to reconstruct a 256 × 160×80 volume of the neurovasculature from an 8-channel, 10 × undersampled data set within 56 seconds, which is already a significant improvement over existing implementations. The latest six-core Core i7 reduces the reconstruction time further to 32 seconds. Moreover, we show that the CS algorithm benefits from modern throughput-oriented architectures. Specifically, our CUDA-base implementation on NVIDIA GTX480 reconstructs the same dataset in 16 seconds, while Intel's Knights Ferry (KNF) of the MIC architecture even reduces the time to 12 seconds. Such level of performance allows the neurovascular dataset to be reconstructed within a clinically viable time.
Performance Evaluation of Supercomputers using HPCC and IMB Benchmarks

NASA Technical Reports Server (NTRS)

Saini, Subhash; Ciotti, Robert; Gunney, Brian T. N.; Spelce, Thomas E.; Koniges, Alice; Dossa, Don; Adamidis, Panagiotis; Rabenseifner, Rolf; Tiyyagura, Sunil R.; Mueller, Matthias;

2006-01-01

The HPC Challenge (HPCC) benchmark suite and the Intel MPI Benchmark (IMB) are used to compare and evaluate the combined performance of processor, memory subsystem and interconnect fabric of five leading supercomputers - SGI Altix BX2, Cray XI, Cray Opteron Cluster, Dell Xeon cluster, and NEC SX-8. These five systems use five different networks (SGI NUMALINK4, Cray network, Myrinet, InfiniBand, and NEC IXS). The complete set of HPCC benchmarks are run on each of these systems. Additionally, we present Intel MPI Benchmarks (IMB) results to study the performance of 11 MPI communication functions on these systems.

Kalman filter tracking on parallel architectures

NASA Astrophysics Data System (ADS)

Cerati, G.; Elmer, P.; Krutelyov, S.; Lantz, S.; Lefebvre, M.; McDermott, K.; Riley, D.; Tadel, M.; Wittich, P.; Wurthwein, F.; Yagil, A.

2017-10-01

We report on the progress of our studies towards a Kalman filter track reconstruction algorithm with optimal performance on manycore architectures. The combinatorial structure of these algorithms is not immediately compatible with an efficient SIMD (or SIMT) implementation; the challenge for us is to recast the existing software so it can readily generate hundreds of shared-memory threads that exploit the underlying instruction set of modern processors. We show how the data and associated tasks can be organized in a way that is conducive to both multithreading and vectorization. We demonstrate very good performance on Intel Xeon and Xeon Phi architectures, as well as promising first results on Nvidia GPUs.
Assessment of vertical changes during maxillary expansion using quad helix or bonded rapid maxillary expander.

PubMed

Conroy-Piskai, Cara; Galang-Boquiren, Maria Therese S; Obrez, Ales; Viana, Maria Grace Costa; Oppermann, Nelson; Sanchez, Flavio; Edgren, Bradford; Kusnoto, Budi

2016-11-01

To determine if there is a significantly different effect on vertical changes during phase I palatal expansion treatment using a quad helix and a bonded rapid maxillary expander in growing skeletal Class I and Class II patients. This retrospective study looked at 2 treatment groups, a quad helix group and a bonded rapid maxillary expander group, before treatment (T1) and at the completion of phase I treatment (T2). Each treatment group was compared to an untreated predicted growth model. Lateral cephalograms at T1 and T2 were traced and analyzed for changes in vertical dimension. No differences were found between the treatment groups at T1, but significant differences at T2 were found for convexity, lower facial height, total facial height, facial axis, and Frankfort Mandibular Plane Angle (FMA) variables. A comparison of treatment groups at T2 to their respective untreated predicted growth models found a significant difference for the lower facial height variable in the quad helix group and for the upper first molar to palatal plane (U6-PP) variable in the bonded expander group. Overall, both the quad helix expander and the bonded rapid maxillary expander showed minimal vertical changes during palatal expansion treatment. The differences at T2 suggested that the quad helix expander had more control over skeletal vertical measurements. When comparing treatment results to untreated predicted growth values, the quad helix expander appeared to better maintain lower facial height and the bonded rapid maxillary expander appeared to better maintain the maxillary first molar vertical height.
The impact of ΛCDM substructure and baryon-dark matter transition on the image positions of quad galaxy lenses

NASA Astrophysics Data System (ADS)

Gomer, Matthew R.; Williams, Liliya L. R.

2018-04-01

The positions of multiple images in galaxy lenses are related to the galaxy mass distribution. Smooth elliptical mass profiles were previously shown to be inadequate in reproducing the quad population. In this paper, we explore the deviations from such smooth elliptical mass distributions. Unlike most other work, we use a model-free approach based on the relative polar image angles of quads, and their position in 3D space with respect to the fundamental surface of quads (FSQ). The FSQ is defined by quads produced by elliptical lenses. We have generated thousands of quads from synthetic populations of lenses with substructure consistent with Lambda cold dark matter (ΛCDM) simulations, and found that such perturbations are not sufficient to match the observed distribution of quads relative to the FSQ. The result is unchanged even when subhalo masses are increased by a factor of 10, and the most optimistic lensing selection bias is applied. We then produce quads from galaxies created using two components, representing baryons and dark matter. The transition from the mass being dominated by baryons in inner radii to being dominated by dark matter in outer radii can carry with it asymmetries, which would affect relative image angles. We run preliminary experiments using lenses with two elliptical mass components with non-identical axial ratios and position angles, perturbations from ellipticity in the form of non-zero Fourier coefficients a4 and a6, and artificially offset ellipse centres as a proxy for asymmetry at image radii. We show that combination of these effects is a promising way of accounting for quad population properties. We conclude that the quad population provides a unique and sensitive tool for constraining detailed mass distribution in the centres of galaxies.
Large Scale GW Calculations on the Cori System

NASA Astrophysics Data System (ADS)

Deslippe, Jack; Del Ben, Mauro; da Jornada, Felipe; Canning, Andrew; Louie, Steven

The NERSC Cori system, powered by 9000+ Intel Xeon-Phi processors, represents one of the largest HPC systems for open-science in the United States and the world. We discuss the optimization of the GW methodology for this system, including both node level and system-scale optimizations. We highlight multiple large scale (thousands of atoms) case studies and discuss both absolute application performance and comparison to calculations on more traditional HPC architectures. We find that the GW method is particularly well suited for many-core architectures due to the ability to exploit a large amount of parallelism across many layers of the system. This work was supported by the U.S. Department of Energy, Office of Science, Basic Energy Sciences, Materials Sciences and Engineering Division, as part of the Computational Materials Sciences Program.
Static analysis of the hull plate using the finite element method

NASA Astrophysics Data System (ADS)

Ion, A.

2015-11-01

This paper aims at presenting the static analysis for two levels of a container ship's construction as follows: the first level is at the girder / hull plate and the second level is conducted at the entire strength hull of the vessel. This article will describe the work for the static analysis of a hull plate. We shall use the software package ANSYS Mechanical 14.5. The program is run on a computer with four Intel Xeon X5260 CPU processors at 3.33 GHz, 32 GB memory installed. In terms of software, the shared memory parallel version of ANSYS refers to running ANSYS across multiple cores on a SMP system. The distributed memory parallel version of ANSYS (Distributed ANSYS) refers to running ANSYS across multiple processors on SMP systems or DMP systems.
Scalable Algorithms for Clustering Large Geospatiotemporal Data Sets on Manycore Architectures

NASA Astrophysics Data System (ADS)

Mills, R. T.; Hoffman, F. M.; Kumar, J.; Sreepathi, S.; Sripathi, V.

2016-12-01

The increasing availability of high-resolution geospatiotemporal data sets from sources such as observatory networks, remote sensing platforms, and computational Earth system models has opened new possibilities for knowledge discovery using data sets fused from disparate sources. Traditional algorithms and computing platforms are impractical for the analysis and synthesis of data sets of this size; however, new algorithmic approaches that can effectively utilize the complex memory hierarchies and the extremely high levels of available parallelism in state-of-the-art high-performance computing platforms can enable such analysis. We describe a massively parallel implementation of accelerated k-means clustering and some optimizations to boost computational intensity and utilization of wide SIMD lanes on state-of-the art multi- and manycore processors, including the second-generation Intel Xeon Phi ("Knights Landing") processor based on the Intel Many Integrated Core (MIC) architecture, which includes several new features, including an on-package high-bandwidth memory. We also analyze the code in the context of a few practical applications to the analysis of climatic and remotely-sensed vegetation phenology data sets, and speculate on some of the new applications that such scalable analysis methods may enable.
GPU accelerated simulations of 3D deterministic particle transport using discrete ordinates method

NASA Astrophysics Data System (ADS)

Gong, Chunye; Liu, Jie; Chi, Lihua; Huang, Haowei; Fang, Jingyue; Gong, Zhenghu

2011-07-01

Graphics Processing Unit (GPU), originally developed for real-time, high-definition 3D graphics in computer games, now provides great faculty in solving scientific applications. The basis of particle transport simulation is the time-dependent, multi-group, inhomogeneous Boltzmann transport equation. The numerical solution to the Boltzmann equation involves the discrete ordinates ( Sn) method and the procedure of source iteration. In this paper, we present a GPU accelerated simulation of one energy group time-independent deterministic discrete ordinates particle transport in 3D Cartesian geometry (Sweep3D). The performance of the GPU simulations are reported with the simulations of vacuum boundary condition. The discussion of the relative advantages and disadvantages of the GPU implementation, the simulation on multi GPUs, the programming effort and code portability are also reported. The results show that the overall performance speedup of one NVIDIA Tesla M2050 GPU ranges from 2.56 compared with one Intel Xeon X5670 chip to 8.14 compared with one Intel Core Q6600 chip for no flux fixup. The simulation with flux fixup on one M2050 is 1.23 times faster than on one X5670.
Optimizing legacy molecular dynamics software with directive-based offload

DOE PAGES

Michael Brown, W.; Carrillo, Jan-Michael Y.; Gavhane, Nitin; ...

2015-05-14

The directive-based programming models are one solution for exploiting many-core coprocessors to increase simulation rates in molecular dynamics. They offer the potential to reduce code complexity with offload models that can selectively target computations to run on the CPU, the coprocessor, or both. In our paper, we describe modifications to the LAMMPS molecular dynamics code to enable concurrent calculations on a CPU and coprocessor. We also demonstrate that standard molecular dynamics algorithms can run efficiently on both the CPU and an x86-based coprocessor using the same subroutines. As a consequence, we demonstrate that code optimizations for the coprocessor also resultmore » in speedups on the CPU; in extreme cases up to 4.7X. We provide results for LAMMAS benchmarks and for production molecular dynamics simulations using the Stampede hybrid supercomputer with both Intel (R) Xeon Phi (TM) coprocessors and NVIDIA GPUs: The optimizations presented have increased simulation rates by over 2X for organic molecules and over 7X for liquid crystals on Stampede. The optimizations are available as part of the "Intel package" supplied with LAMMPS. (C) 2015 Elsevier B.V. All rights reserved.« less
SU-E-T-37: A GPU-Based Pencil Beam Algorithm for Dose Calculations in Proton Radiation Therapy

DOE Office of Scientific and Technical Information (OSTI.GOV)

Kalantzis, G; Leventouri, T; Tachibana, H

Purpose: Recent developments in radiation therapy have been focused on applications of charged particles, especially protons. Over the years several dose calculation methods have been proposed in proton therapy. A common characteristic of all these methods is their extensive computational burden. In the current study we present for the first time, to our best knowledge, a GPU-based PBA for proton dose calculations in Matlab. Methods: In the current study we employed an analytical expression for the protons depth dose distribution. The central-axis term is taken from the broad-beam central-axis depth dose in water modified by an inverse square correction whilemore » the distribution of the off-axis term was considered Gaussian. The serial code was implemented in MATLAB and was launched on a desktop with a quad core Intel Xeon X5550 at 2.67GHz with 8 GB of RAM. For the parallelization on the GPU, the parallel computing toolbox was employed and the code was launched on a GTX 770 with Kepler architecture. The performance comparison was established on the speedup factors. Results: The performance of the GPU code was evaluated for three different energies: low (50 MeV), medium (100 MeV) and high (150 MeV). Four square fields were selected for each energy, and the dose calculations were performed with both the serial and parallel codes for a homogeneous water phantom with size 300×300×300 mm3. The resolution of the PBs was set to 1.0 mm. The maximum speedup of ∼127 was achieved for the highest energy and the largest field size. Conclusion: A GPU-based PB algorithm for proton dose calculations in Matlab was presented. A maximum speedup of ∼127 was achieved. Future directions of the current work include extension of our method for dose calculation in heterogeneous phantoms.« less
Quad-rotor flight path energy optimization

NASA Astrophysics Data System (ADS)

Kemper, Edward

Quad-Rotor unmanned areal vehicles (UAVs) have been a popular area of research and development in the last decade, especially with the advent of affordable microcontrollers like the MSP 430 and the Raspberry Pi. Path-Energy Optimization is an area that is well developed for linear systems. In this thesis, this idea of path-energy optimization is extended to the nonlinear model of the Quad-rotor UAV. The classical optimization technique is adapted to the nonlinear model that is derived for the problem at hand, coming up with a set of partial differential equations and boundary value conditions to solve these equations. Then, different techniques to implement energy optimization algorithms are tested using simulations in Python. First, a purely nonlinear approach is used. This method is shown to be computationally intensive, with no practical solution available in a reasonable amount of time. Second, heuristic techniques to minimize the energy of the flight path are tested, using Ziegler-Nichols' proportional integral derivative (PID) controller tuning technique. Finally, a brute force look-up table based PID controller is used. Simulation results of the heuristic method show that both reliable control of the system and path-energy optimization are achieved in a reasonable amount of time.
Parallel multireference configuration interaction calculations on mini-β-carotenes and β-carotene

NASA Astrophysics Data System (ADS)

Kleinschmidt, Martin; Marian, Christel M.; Waletzke, Mirko; Grimme, Stefan

2009-01-01

We present a parallelized version of a direct selecting multireference configuration interaction (MRCI) code [S. Grimme and M. Waletzke, J. Chem. Phys. 111, 5645 (1999)]. The program can be run either in ab initio mode or as semiempirical procedure combined with density functional theory (DFT/MRCI). We have investigated the efficiency of the parallelization in case studies on carotenoids and porphyrins. The performance is found to depend heavily on the cluster architecture. While the speed-up on the older Intel Netburst technology is close to linear for up to 12-16 processes, our results indicate that it is not favorable to use all cores of modern Intel Dual Core or Quad Core processors simultaneously for memory intensive tasks. Due to saturation of the memory bandwidth, we recommend to run less demanding tasks on the latter architectures in parallel to two (Dual Core) or four (Quad Core) MRCI processes per node. The DFT/MRCI branch has been employed to study the low-lying singlet and triplet states of mini-n-β-carotenes (n =3, 5, 7, 9) and β-carotene (n =11) at the geometries of the ground state, the first excited triplet state, and the optically bright singlet state. The order of states depends heavily on the conjugation length and the nuclear geometry. The B1u+ state constitutes the S1 state in the vertical absorption spectrum of mini-3-β-carotene but switches order with the 2 A1g- state upon excited state relaxation. In the longer carotenes, near degeneracy or even root flipping between the B1u+ and B1u- states is observed whereas the 3 A1g- state is found to remain energetically above the optically bright B1u+ state at all nuclear geometries investigated here. The DFT/MRCI method is seen to underestimate the absolute excitation energies of the longer mini-β-carotenes but the energy gaps between the excited states are reproduced well. In addition to singlet data, triplet-triplet absorption energies are presented. For β-carotene, where these transition
Parallel multireference configuration interaction calculations on mini-beta-carotenes and beta-carotene.

PubMed

Kleinschmidt, Martin; Marian, Christel M; Waletzke, Mirko; Grimme, Stefan

2009-01-28

We present a parallelized version of a direct selecting multireference configuration interaction (MRCI) code [S. Grimme and M. Waletzke, J. Chem. Phys. 111, 5645 (1999)]. The program can be run either in ab initio mode or as semiempirical procedure combined with density functional theory (DFT/MRCI). We have investigated the efficiency of the parallelization in case studies on carotenoids and porphyrins. The performance is found to depend heavily on the cluster architecture. While the speed-up on the older Intel Netburst technology is close to linear for up to 12-16 processes, our results indicate that it is not favorable to use all cores of modern Intel Dual Core or Quad Core processors simultaneously for memory intensive tasks. Due to saturation of the memory bandwidth, we recommend to run less demanding tasks on the latter architectures in parallel to two (Dual Core) or four (Quad Core) MRCI processes per node. The DFT/MRCI branch has been employed to study the low-lying singlet and triplet states of mini-n-beta-carotenes (n=3, 5, 7, 9) and beta-carotene (n=11) at the geometries of the ground state, the first excited triplet state, and the optically bright singlet state. The order of states depends heavily on the conjugation length and the nuclear geometry. The (1)B(u) (+) state constitutes the S(1) state in the vertical absorption spectrum of mini-3-beta-carotene but switches order with the 2 (1)A(g) (-) state upon excited state relaxation. In the longer carotenes, near degeneracy or even root flipping between the (1)B(u) (+) and (1)B(u) (-) states is observed whereas the 3 (1)A(g) (-) state is found to remain energetically above the optically bright (1)B(u) (+) state at all nuclear geometries investigated here. The DFT/MRCI method is seen to underestimate the absolute excitation energies of the longer mini-beta-carotenes but the energy gaps between the excited states are reproduced well. In addition to singlet data, triplet-triplet absorption energies are
Physical investigation of a quad confinement plasma source

NASA Astrophysics Data System (ADS)

Knoll, Aaron; Lucca Fabris, Andrea; Young, Christopher; Cappelli, Mark

2016-10-01

Quad magnetic confinement plasma sources are novel magnetized DC discharges suitable for applications in a broad range of fields, particularly space propulsion, plasma etching and deposition. These sources contain a square discharge channel with magnetic cusps at the four lateral walls, enhancing plasma confinement and electron residence time inside the device. The magnetic field topology is manipulated using four independent electromagnets on each edge of the channel, tuning the properties of the generated plasma. We characterize the plasma ejected from the quad confinement sources using a combination of traditional electrostatic probes and non-intrusive laser-based diagnostics. Measurements show a strong ion acceleration layer located 8 cm downstream of the exit plane, beyond the extent of the magnetic field. The ion velocity field is investigated with different magnetic configurations, demonstrating how ion trajectories may be manipulated. C.Y. acknowledges support from the DOE NSSA Stewardship Science Graduate Fellowship under contract DE-FC52-08NA28752.
The development and validation of the Questionnaire on Anticipated Discrimination (QUAD).

PubMed

Gabbidon, Jheanell; Brohan, Elaine; Clement, Sarah; Henderson, R Claire; Thornicroft, Graham

2013-11-07

The anticipation of mental health-related discrimination is common amongst people with mental health problems and can have serious adverse effects. This study aimed to develop and validate a measure assessing the extent to which people with mental health problems anticipate that they will personally experience discrimination across a range of contexts. The items and format for the Questionnaire on Anticipated Discrimination (QUAD) were developed from previous versions of the Discrimination and Stigma Scale (DISC), focus groups and cognitive debriefing interviews which were used to further refine the content and format. The resulting provisional version of the QUAD was completed by 117 service users in an online survey and reliability, validity, precision and acceptability were assessed. A final version of the scale was agreed and analyses re-run using the online survey data and data from an independent sample to report the psychometric properties of the finalised scale. The provisional version of the QUAD had 17 items, good internal consistency (alpha = 0.86) and adequate convergent validity as supported by the significant positive correlations with the Stigma Scale (SS) (r = 0.40, p < 0.001) and the Internalised Stigma of Mental Illness Scale (ISMI) (r = 0.40, p < 0.001). Three items were removed due to low endorsements, high inter-correlation or conceptual concerns. The finalised 14 item QUAD had good internal consistency (alpha = 0.86), good test re-test reliability (ρ(c) = 0.81) and adequate convergent validity: correlations with the ISMI (r = 0.45, p < 0.001) and with the SS (r = 0.39, p < 0.001). Reading ease scores indicated good acceptability for general adult populations. Cross-replication in an independent sample further indicated good internal consistency (alpha = 0.88), adequate convergent validity and revealed two factors summarised by institutions/services and interpersonal/professional relationships. The QUAD expanded upon previous versions of the
Connecting Effective Instruction and Technology. Intel-elebration: Safari.

ERIC Educational Resources Information Center

Burton, Larry D.; Prest, Sharon

Intel-ebration is an attempt to integrate the following research-based instructional frameworks and strategies: (1) dimensions of learning; (2) multiple intelligences; (3) thematic instruction; (4) cooperative learning; (5) project-based learning; and (6) instructional technology. This paper presents a thematic unit on safari, using the…
IntellWheels: modular development platform for intelligent wheelchairs.

PubMed

Braga, Rodrigo Antonio Marques; Petry, Marcelo; Reis, Luis Paulo; Moreira, António Paulo

2011-01-01

Intelligent wheelchairs (IWs) can become an important solution to the challenge of assisting individuals who have disabilities and are thus unable to perform their daily activities using classic powered wheelchairs. This article describes the concept and design of IntellWheels, a modular platform to facilitate the development of IWs through a multiagent system paradigm. In fact, modularity is achieved not only in the software perspective, but also through a generic hardware framework that was designed to fit, in a straightforward manner, almost any commercial powered wheelchair. Experimental results demonstrate the successful integration of all modules in the platform, providing safe motion to the IW. Furthermore, the results achieved with a prototype running in autonomous mode in simulated and mixed-reality environments also demonstrate the potential of our approach. Although some future research is still necessary to fully accomplish our objectives, preliminary tests have shown that IntellWheels will effectively reduce users' limitations, offering them a much more independent life.

Wilson Dslash Kernel From Lattice QCD Optimization

DOE Office of Scientific and Technical Information (OSTI.GOV)

Joo, Balint; Smelyanskiy, Mikhail; Kalamkar, Dhiraj D.

2015-07-01

Lattice Quantum Chromodynamics (LQCD) is a numerical technique used for calculations in Theoretical Nuclear and High Energy Physics. LQCD is traditionally one of the first applications ported to many new high performance computing architectures and indeed LQCD practitioners have been known to design and build custom LQCD computers. Lattice QCD kernels are frequently used as benchmarks (e.g. 168.wupwise in the SPEC suite) and are generally well understood, and as such are ideal to illustrate several optimization techniques. In this chapter we will detail our work in optimizing the Wilson-Dslash kernels for Intel Xeon Phi, however, as we will show themore » technique gives excellent performance on regular Xeon Architecture as well.« less
Full cycle trigonometric function on Intel Quartus II Verilog

NASA Astrophysics Data System (ADS)

Mustapha, Muhazam; Zulkarnain, Nur Antasha

2018-02-01

This paper discusses about an improvement of a previous research on hardware based trigonometric calculations. Tangent function will also be implemented to get a complete set. The functions have been simulated using Quartus II where the result will be compared to the previous work. The number of bits has also been extended for each trigonometric function. The design is based on RTL due to its resource efficient nature. At earlier stage, a technology independent test bench simulation was conducted on ModelSim due to its convenience in capturing simulation data so that accuracy information can be obtained. On second stage, Intel/Altera Quartus II will be used to simulate on technology dependent platform, particularly on the one belonging to Intel/Altera itself. Real data on no. logic elements used and propagation delay have also been obtained.
Trusted Computing Technologies, Intel Trusted Execution Technology.

DOE Office of Scientific and Technical Information (OSTI.GOV)

Guise, Max Joseph; Wendt, Jeremy Daniel

2011-01-01

We describe the current state-of-the-art in Trusted Computing Technologies - focusing mainly on Intel's Trusted Execution Technology (TXT). This document is based on existing documentation and tests of two existing TXT-based systems: Intel's Trusted Boot and Invisible Things Lab's Qubes OS. We describe what features are lacking in current implementations, describe what a mature system could provide, and present a list of developments to watch. Critical systems perform operation-critical computations on high importance data. In such systems, the inputs, computation steps, and outputs may be highly sensitive. Sensitive components must be protected from both unauthorized release, and unauthorized alteration: Unauthorizedmore » users should not access the sensitive input and sensitive output data, nor be able to alter them; the computation contains intermediate data with the same requirements, and executes algorithms that the unauthorized should not be able to know or alter. Due to various system requirements, such critical systems are frequently built from commercial hardware, employ commercial software, and require network access. These hardware, software, and network system components increase the risk that sensitive input data, computation, and output data may be compromised.« less
Optimizing Excited-State Electronic-Structure Codes for Intel Knights Landing: A Case Study on the BerkeleyGW Software

DOE Office of Scientific and Technical Information (OSTI.GOV)

Deslippe, Jack; da Jornada, Felipe H.; Vigil-Fowler, Derek

2016-10-06

We profile and optimize calculations performed with the BerkeleyGW code on the Xeon-Phi architecture. BerkeleyGW depends both on hand-tuned critical kernels as well as on BLAS and FFT libraries. We describe the optimization process and performance improvements achieved. We discuss a layered parallelization strategy to take advantage of vector, thread and node-level parallelism. We discuss locality changes (including the consequence of the lack of L3 cache) and effective use of the on-package high-bandwidth memory. We show preliminary results on Knights-Landing including a roofline study of code performance before and after a number of optimizations. We find that the GW methodmore » is particularly well-suited for many-core architectures due to the ability to exploit a large amount of parallelism over plane-wave components, band-pairs, and frequencies.« less
Performance of VPIC on Trinity

NASA Astrophysics Data System (ADS)

Nystrom, W. D.; Bergen, B.; Bird, R. F.; Bowers, K. J.; Daughton, W. S.; Guo, F.; Li, H.; Nam, H. A.; Pang, X.; Rust, W. N., III; Wohlbier, J.; Yin, L.; Albright, B. J.

2016-10-01

Trinity is a new major DOE computing resource which is going through final acceptance testing at Los Alamos National Laboratory. Trinity has several new and unique architectural features including two compute partitions, one with dual socket Intel Haswell Xeon compute nodes and one with Intel Knights Landing (KNL) Xeon Phi compute nodes. Additional unique features include use of on package high bandwidth memory (HBM) for the KNL nodes, the ability to configure the KNL nodes with respect to HBM model and on die network topology in a variety of operational modes at run time, and use of solid state storage via burst buffer technology to reduce time required to perform I/O. An effort is in progress to port and optimize VPIC to Trinity and evaluate its performance. Because VPIC was recently released as Open Source, it is being used as part of acceptance testing for Trinity and is participating in the Trinity Open Science Program which has resulted in excellent collaboration activities with both Cray and Intel. Results of this work will be presented on performance of VPIC on both Haswell and KNL partitions for both single node runs and runs at scale. Work performed under the auspices of the U.S. Dept. of Energy by the Los Alamos National Security, LLC Los Alamos National Laboratory under contract DE-AC52-06NA25396 and supported by the LANL LDRD program.
Preliminary study of ground handling characteristics of Buoyant Quad Rotor (BQR) vehicles

NASA Technical Reports Server (NTRS)

Browning, R. G. E.

1980-01-01

A preliminary investigation of mooring concepts appropriate for heavy lift buoyant quad rotor (BQR) vehicles was performed. A review of the evolution of ground handling systems and procedures for all airship types is presented to ensure that appropriate consideration is given to past experiences. Two buoyant quad rotor designs are identified and described. An analysis of wind loads on a moored airship and the effects of these loads on vehicle design is provided. Four mooring concepts are assessed with respect to the airship design, wind loads and mooring site considerations. Basing requirements and applicability of expeditionary mooring at various operational scenarios are addressed.
Stabilization and control of quad-rotor helicopter using a smartphone device

NASA Astrophysics Data System (ADS)

Desai, Alok; Lee, Dah-Jye; Moore, Jason; Chang, Yung-Ping

2013-01-01

In recent years, autonomous, micro-unmanned aerial vehicles (micro-UAVs), or more specifically hovering micro- UAVs, have proven suitable for many promising applications such as unknown environment exploration and search and rescue operations. The early versions of UAVs had no on-board control capabilities, and were difficult for manual control from a ground station. Many UAVs now are equipped with on-board control systems that reduce the amount of control required from the ground-station operator. However, the limitations on payload, power consumption and control without human interference remain the biggest challenges. This paper proposes to use a smartphone as the sole computational device to stabilize and control a quad-rotor. The goal is to use the readily available sensors in a smartphone such as the GPS, the accelerometer, the rate-gyros, and the camera to support vision-related tasks such as flight stabilization, estimation of the height above ground, target tracking, obstacle detection, and surveillance. We use a quad-rotor platform that has been built in the Robotic Vision Lab at Brigham Young University for our development and experiments. An Android smartphone is connected through the USB port to an external hardware that has a microprocessor and circuitries to generate pulse-width modulation signals to control the brushless servomotors on the quad-rotor. The high-resolution camera on the smartphone is used to detect and track features to maintain a desired altitude level. The vision algorithms implemented include template matching, Harris feature detector, RANSAC similarity-constrained homography, and color segmentation. Other sensors are used to control yaw, pitch, and roll of the quad-rotor. This smartphone-based system is able to stabilize and control micro-UAVs and is ideal for micro-UAVs that have size, weight, and power limitations.
Development of an Effective System Identification and Control Capability for Quad-copter UAVs

NASA Astrophysics Data System (ADS)

Wei, Wei

In recent years, with the promise of extensive commercial applications, the popularity of Unmanned Aerial Vehicles (UAVs) has dramatically increased as witnessed by publications and mushrooming research and educational programs. Over the years, multi-copter aircraft have been chosen as a viable configuration for small-scale VTOL UAVs in the form of quad-copters, hexa-copters and octo-copters. Compared to the single main rotor configuration such as the conventional helicopter, multi-copter airframes require a simpler feedback control system and fewer mechanical parts. These characteristics make these UAV platforms, such as quad-copter which is the main emphasis in this dissertation, a rugged and competitive candidate for many applications in both military and civil areas. Because of its configuration and relative size, the small-scale quad-copter UAV system is inherently very unstable. In order to develop an effective control system through simulation techniques, obtaining an accurate dynamic model of a given quad-copter is imperative. Moreover, given the anticipated stringent safety requirements, fault tolerance will be a crucial component of UAV certification. Accurate dynamic modeling and control of this class of UAV is an enabling technology and is imperative for future commercial applications. In this work, the dynamic model of a quad-copter system in hover flight was identified using frequency-domain system identification techniques. A new and unique experimental system, data acquisition and processing procedure was developed catering specifically to the class of electric powered multi-copter UAV systems. The Comprehensive Identification from FrEquency Responses (CIFER RTM) software package, developed by US Army Aviation Development Directorate -- AFDD, was utilized along with flight tests to develop dynamic models of the quad-copter system. A new set of flight tests were conducted and the predictive capability of the dynamic models were successfully validated
Task Parallel Incomplete Cholesky Factorization using 2D Partitioned-Block Layout

DOE Office of Scientific and Technical Information (OSTI.GOV)

Kim, Kyungjoo; Rajamanickam, Sivasankaran; Stelle, George Widgery

We introduce a task-parallel algorithm for sparse incomplete Cholesky factorization that utilizes a 2D sparse partitioned-block layout of a matrix. Our factorization algorithm follows the idea of algorithms-by-blocks by using the block layout. The algorithm-byblocks approach induces a task graph for the factorization. These tasks are inter-related to each other through their data dependences in the factorization algorithm. To process the tasks on various manycore architectures in a portable manner, we also present a portable tasking API that incorporates different tasking backends and device-specific features using an open-source framework for manycore platforms i.e., Kokkos. A performance evaluation is presented onmore » both Intel Sandybridge and Xeon Phi platforms for matrices from the University of Florida sparse matrix collection to illustrate merits of the proposed task-based factorization. Experimental results demonstrate that our task-parallel implementation delivers about 26.6x speedup (geometric mean) over single-threaded incomplete Choleskyby- blocks and 19.2x speedup over serial Cholesky performance which does not carry tasking overhead using 56 threads on the Intel Xeon Phi processor for sparse matrices arising from various application problems.« less
(Re)engineering Earth System Models to Expose Greater Concurrency for Ultrascale Computing: Practice, Experience, and Musings

NASA Astrophysics Data System (ADS)

Mills, R. T.

2014-12-01

As the high performance computing (HPC) community pushes towards the exascale horizon, the importance and prevalence of fine-grained parallelism in new computer architectures is increasing. This is perhaps most apparent in the proliferation of so-called "accelerators" such as the Intel Xeon Phi or NVIDIA GPGPUs, but the trend also holds for CPUs, where serial performance has grown slowly and effective use of hardware threads and vector units are becoming increasingly important to realizing high performance. This has significant implications for weather, climate, and Earth system modeling codes, many of which display impressive scalability across MPI ranks but take relatively little advantage of threading and vector processing. In addition to increasing parallelism, next generation codes will also need to address increasingly deep hierarchies for data movement: NUMA/cache levels, on node vs. off node, local vs. wide neighborhoods on the interconnect, and even in the I/O system. We will discuss some approaches (grounded in experiences with the Intel Xeon Phi architecture) for restructuring Earth science codes to maximize concurrency across multiple levels (vectors, threads, MPI ranks), and also discuss some novel approaches for minimizing expensive data movement/communication.
Scaling Deep Learning Workloads: NVIDIA DGX-1/Pascal and Intel Knights Landing

DOE Office of Scientific and Technical Information (OSTI.GOV)

Gawande, Nitin A.; Landwehr, Joshua B.; Daily, Jeffrey A.

Deep Learning (DL) algorithms have become ubiquitous in data analytics. As a result, major computing vendors --- including NVIDIA, Intel, AMD and IBM --- have architectural road-maps influenced by DL workloads. Furthermore, several vendors have recently advertised new computing products as accelerating DL workloads. Unfortunately, it is difficult for data scientists to quantify the potential of these different products. This paper provides a performance and power analysis of important DL workloads on two major parallel architectures: NVIDIA DGX-1 (eight Pascal P100 GPUs interconnected with NVLink) and Intel Knights Landing (KNL) CPUs interconnected with Intel Omni-Path. Our evaluation consists of amore » cross section of convolutional neural net workloads: CifarNet, CaffeNet, AlexNet and GoogleNet topologies using the Cifar10 and ImageNet datasets. The workloads are vendor optimized for each architecture. GPUs provide the highest overall raw performance. Our analysis indicates that although GPUs provide the highest overall performance, the gap can close for some convolutional networks; and KNL can be competitive when considering performance/watt. Furthermore, NVLink is critical to GPU scaling.« less
Ultra-low noise large-area InGaAs quad photoreceiver with low crosstalk for laser interferometry space antenna

NASA Astrophysics Data System (ADS)

Joshi, Abhay; Datta, Shubhashish; Rue, Jim; Livas, Jeffrey; Silverberg, Robert; Guzman Cervantes, Felipe

2012-07-01

Quad photoreceivers, namely a 2 x 2 array of p-i-n photodiodes followed by a transimpedance amplifier (TIA) per diode, are required as the front-end photonic sensors in several applications relying on free-space propagation with position and direction sensing capability, such as long baseline interferometry, free-space optical communication, and biomedical imaging. It is desirable to increase the active area of quad photoreceivers (and photodiodes) to enhance the link gain, and therefore sensitivity, of the system. However, the resulting increase in the photodiode capacitance reduces the photoreceiver's bandwidth and adds to the excess system noise. As a result, the noise performance of the front-end quad photoreceiver has a direct impact on the sensitivity of the overall system. One such particularly challenging application is the space-based detection of gravitational waves by measuring distance at 1064 nm wavelength with ~ 10 pm/√Hz accuracy over a baseline of millions of kilometers. We present a 1 mm diameter quad photoreceiver having an equivalent input current noise density of < 1.7 pA/√Hz per quadrant in 2 MHz to 20 MHz frequency range. This performance is primarily enabled by a rad-hard-by-design dualdepletion region InGaAs quad photodiode having 2.5 pF capacitance per quadrant. Moreover, the quad photoreceiver demonstrates a crosstalk of < -45 dB between the neighboring quadrants, which ensures an uncorrected direction sensing resolution of < 50 nrad. The sources of this primarily capacitive crosstalk are presented.
A CPU/MIC Collaborated Parallel Framework for GROMACS on Tianhe-2 Supercomputer.

PubMed

Peng, Shaoliang; Yang, Shunyun; Su, Wenhe; Zhang, Xiaoyu; Zhang, Tenglilang; Liu, Weiguo; Zhao, Xingming

2017-06-16

Molecular Dynamics (MD) is the simulation of the dynamic behavior of atoms and molecules. As the most popular software for molecular dynamics, GROMACS cannot work on large-scale data because of limit computing resources. In this paper, we propose a CPU and Intel® Xeon Phi Many Integrated Core (MIC) collaborated parallel framework to accelerate GROMACS using the offload mode on a MIC coprocessor, with which the performance of GROMACS is improved significantly, especially with the utility of Tianhe-2 supercomputer. Furthermore, we optimize GROMACS so that it can run on both the CPU and MIC at the same time. In addition, we accelerate multi-node GROMACS so that it can be used in practice. Benchmarking on real data, our accelerated GROMACS performs very well and reduces computation time significantly. Source code: https://github.com/tianhe2/gromacs-mic.
Wetland Mapping with Quad-Pol Data Acquired during Tandem-X Science Phase

NASA Astrophysics Data System (ADS)

Mleczko, M.; Mroz, M.; Fitrzyk, M.

2016-06-01

The aim of this study was to exploit fully polarimetric SAR data acquired during TanDEM-X - Science Phase (2014/2015) over herbaceous wetlands of the Biebrza National Park (BbNP) in North-Eastern Poland for mapping seasonally flooded grasslands and permanent natural vegetation associations. The main goal of this work was to estimate the advantage of fully polarimetric radar images (QuadPol) versus alternative polarization (AltPol) modes. The methodology consisted in processing of several data subsets through polarimetric decompositions of complex quad-pol datasets, classification of multitemporal backscattering images, complementing backscattering images with Shannon Entropy, exploitation of interferometric coherence from tandem operations. In each case the multidimensional stack of images has been classified using ISODATA unsupervised clustering algorithm. With 6 QUAD-POL TSX/TDX acquisitions it was possible to distinguish correctly 5 thematic classes related to their water regime: permanent water bodies, temporarily flooded areas, wet grasslands, dry grasslands and common reed. This last category was possible to distinguish from deciduous forest only with Yamaguchi 4 component decomposition. The interferometric coherence calculated for tandem pairs turned out not so efficient as expected for this wetland mapping.
Singularities of the quad curl problem

NASA Astrophysics Data System (ADS)

Nicaise, Serge

2018-04-01

We consider the quad curl problem in smooth and non smooth domains of the space. We first give an augmented variational formulation equivalent to the one from [25] if the datum is divergence free. We describe the singularities of the variational space which correspond to the ones of the Maxwell system with perfectly conducting boundary conditions. The edge and corner singularities of the solution of the corresponding boundary value problem with smooth data are also characterized. We finally obtain some regularity results of the variational solution.
TRAC-BF1 thermal-hydraulic, ANSYS stress analysis for core shroud cracking phenomena

DOE Office of Scientific and Technical Information (OSTI.GOV)

Shoop, U.; Feltus, M.A.; Baratta, A.J.

1996-12-31

The U.S. Nuclear Regulatory Commission sent Generic Letter 94-03 informing all licensees about the intergranular stress corrosion cracking (IGSCC) of core shrouds found in both Dresden unit I and Quad Cities unit 1. The letter directed all licensees to perform safety analysis of their boiling water reactor (BWR) units. Two transients of special concern for the core shroud safety analysis include the main steam line break (MSLB) and recirculation line break transient.
Proton Irradiation of the 16GB Intel Optane SSD

NASA Technical Reports Server (NTRS)

Wyrwas, E. J.

2017-01-01

The purpose of this test is to assess the single event effects (SEE) and radiation susceptibility of the Intel Optane Memory device (SSD) containing the 3D Xpoint phase change memory (PCM) technology. This test is supported by the NASA Electronics Parts and Packaging Program (NEPP).
Estimation of winter wheat canopy nitrogen density at different growth stages based on Multi-LUT approach

NASA Astrophysics Data System (ADS)

Li, Zhenhai; Li, Na; Li, Zhenhong; Wang, Jianwen; Liu, Chang

2017-10-01

Rapid real-time monitoring of wheat nitrogen (N) status is crucial for precision N management during wheat growth. In this study, Multi Lookup Table (Multi-LUT) approach based on the N-PROSAIL model parameters setting at different growth stages was constructed to estimating canopy N density (CND) in winter wheat. The results showed that the estimated CND was in line with with measured CND, with the determination coefficient (R2) and the corresponding root mean square error (RMSE) values of 0.80 and 1.16 g m-2, respectively. Time-consuming of one sample estimation was only 6 ms under the test machine with CPU configuration of Intel(R) Core(TM) i5-2430 @2.40GHz quad-core. These results confirmed the potential of using Multi-LUT approach for CND retrieval in winter wheat at different growth stages and under variables climatic conditions.
Usefulness of a Rugby-shaped hohlraum in a Laser M'egaJoule (LMJ) 40-quad configuration

NASA Astrophysics Data System (ADS)

Malinie, G.; Vandenboomgaerde, M.; Bastian, J.; Galmiche, D.; Laffite, S.; Liberatore, S.

2007-11-01

The LMJ setup will consist of 60 quads in a 3-cone configuration, at angles 33.2^o, 49^o and 59.5^o. First ignition attempts in indirect drive are planned to be made on the way to the completion of the full facility, with only 40 quads in a 2-cone configuration, at angles 33.2^o and 49^o. By analytic considerations, we show that in a 40-quad configuration, the angular location of the hohlraum outer irradiating ring, as seen from the capsule, must be closer to the laser entrance hole than with the full LMJ. The use of a Rugby-shaped hohlraum instead of a cylinder therefore allows to keep a correct symmetry while reducing the wall surface, which improves the global energetic efficiency of the target. Simplified 2D numerical simulations of Rugby hohlraums are presented, achieving a yield of about 30 MJ with our 1.215 mm-radius, CH-uniform-ablator capsule. These results suggests this kind of hohlraum might be an interesting candidate for 40-quad ignition experiments. Work on optimizing the present design and refining the numerical simulations is currently pursued.
The Vacuum-Compacted Regolith Gripping Mechanism and Unmanned Flights via Quad-Rotors

NASA Technical Reports Server (NTRS)

Scott, Rollin L.

2014-01-01

During the course of the Kennedy Space Center Summer Internship, two main experiments were performed: The Vacuum-Compacted Regolith Gripping Mechanism and Unmanned Flights via Quad-copters. The objectives of the Vacuum-Compacted Regolith Gripping Mechanism, often abbreviated as the Granular Gripper, are to exhibit Space Technology, such as a soft robotic hand, lift different apparatuses used to excavate regolith, and conserve energy while executing its intended task. The project is being conducted to test how much weight the Granular Gripper can hold. With the use of an Animatronic Robotic Hand, Arduino Uno, and other components, the system was calibrated before actually conducting the intended weight test. The maximum weight each finger could hold with the servos running, in the order of pinky, ring, middle, and index fingers, are as follows: 1.340N, 1.456 N, 0.9579 N, and 1.358 N. Using the small vacuum pump system, the maximum weight each finger could hold, in the same order, was: 4.076 N, 6.159 N, 5.454 N, and 4.052 N. The maximum torques on each of the fingers when the servos were running, in the same respective order, was: 0.0777 Nm, 0.0533 Nm, 0.0648 Nm, and 0.0532 Nm. The maximum torques on the individual fingers, when the small vacuum pump was in effect, in the same order as above, was: 0.2318 Nm, 0.3032 Nm, 0.2741 Nm, and 0.1618 Nm. In testing all the fingers with the servos running, the total weight was 5.112 N and the maximum torque on the all the fingers was 0.2515 Nm. However, when the small vacuum pump system was used, the total weight was 19.741 N and the maximum torque on the all the fingers was 0.9713 Nm. The conclusion that was drawn stated that using the small vacuum pump system proved nearly 4 times more effective when testing how much weigh the hand could hold. The resistance provided by the compacted sand in the glove allowed more weight to be held by the hand and glove. Also, when the servos turned off and the hand still retaining its

Use of computer modeling to investigate a dynamic interaction problem in the Skylab TACS quad-valve package

NASA Technical Reports Server (NTRS)

Hesser, R. J.; Gershman, R.

1975-01-01

A valve opening-response problem encountered during development of a control valve for the Skylab thruster attitude control system (TACS) is described. The problem involved effects of dynamic interaction among valves in the quad-redundant valve package. Also described is a detailed computer simulation of the quad-valve package which was helpful in resolving the problem.
Utilising the Intel RealSense Camera for Measuring Health Outcomes in Clinical Research.

PubMed

Siena, Francesco Luke; Byrom, Bill; Watts, Paul; Breedon, Philip

2018-02-05

Applications utilising 3D Camera technologies for the measurement of health outcomes in the health and wellness sector continues to expand. The Intel® RealSense™ is one of the leading 3D depth sensing cameras currently available on the market and aligns itself for use in many applications, including robotics, automation, and medical systems. One of the most prominent areas is the production of interactive solutions for rehabilitation which includes gait analysis and facial tracking. Advancements in depth camera technology has resulted in a noticeable increase in the integration of these technologies into portable platforms, suggesting significant future potential for pervasive in-clinic and field based health assessment solutions. This paper reviews the Intel RealSense technology's technical capabilities and discusses its application to clinical research and includes examples where the Intel RealSense camera range has been used for the measurement of health outcomes. This review supports the use of the technology to develop robust, objective movement and mobility-based endpoints to enable accurate tracking of the effects of treatment interventions in clinical trials.
Successful outcome of modified quad surgical procedure in preteen and teen patients with brachial plexus birth palsy.

PubMed

Nath, Rahul K; Somasundaram, Chandra

2012-01-01

To evaluate the outcome of modified Quad procedure in preteen and teen patients with brachial plexus birth palsy. We have previously demonstrated a significant improvement in shoulder abduction, resulting from the modified Quad procedure in children (mean age 2.5 years; range, 0.5-9 years) with obstetric brachial plexus injury. We describe in this report the outcome of 16 patients (6 girls and 10 boys; 7 preteen and 9 teen) who have undergone the modified Quad procedure for the correction of the shoulder function, specifically abduction. The patients underwent transfer of the latissimus dorsi and teres major muscles, release of contractures of subscapularis pectoralis major and minor, and axillary nerve decompression and neurolysis (the modified Quad procedure). Mean age of these patients at surgery was 13.5 years (range, 10.1-17.9 years). The mean preoperative total Mallet score was 14.8 (range, 10-20), and active abduction was 84° (range, 20°-140°). At a mean follow-up of 1.5 years, the mean postoperative total Mallet score increased to 19.7 (range, 13-25, P < .0001), and the mean active abduction improved to 132° (range, 40°-180°, P < .0003). The modified Quad procedure greatly improves not only the active abduction but also other shoulder functions in preteen and teen patients, as this outcome is the combined result of decompression and neurolysis of the axillary nerve and the release of the contracted internal rotators of the shoulder.
Shape Memory Alloy Isolation Valves: Public Quad Chart

DTIC Science & Technology

2017-05-12

NUMBER (Include area code) 12 May 2017 Briefing Charts 12 April 2017 - 12 May 2017 Shape Memory Alloy Isolation Valves: Public Quad Chart William...Unclassified Unclassified Unclassified SAR 2 William Hargus N/A PAYOFF/TRANSITIONTECHNICAL APPROACH MOTIVATION APPLYING AFRL TO SUSTAINMENT • Evaluate...spacecraft (15+ yrs) • Shaped memory alloy isolation valves provide an intrinsically safe isolation system that increases lifetime >5x over SOTA and
A GPU Parallelization of the Absolute Nodal Coordinate Formulation for Applications in Flexible Multibody Dynamics

DTIC Science & Technology

2012-02-17

to be solved. Disclaimer: Reference herein to any specific commercial company , product, process, or service by trade name, trademark...data processing rather than data caching and control flow. To make use of this computational power, NVIDIA introduced a general purpose parallel...GPU implementations were run on an Intel Nehalem Xeon E5520 2.26GHz processor with an NVIDIA Tesla C2070 graphics card for varying numbers of
Minimization of color shift generated in RGBW quad structure.

NASA Astrophysics Data System (ADS)

Kim, Hong Chul; Yun, Jae Kyeong; Baek, Heume-Il; Kim, Ki Duk; Oh, Eui Yeol; Chung, In Jae

2005-03-01

The purpose of RGBW Quad Structure Technology is to realize higher brightness than that of normal panel (RGB stripe structure) by adding white sub-pixel to existing RGB stripe structure. However, there is side effect called 'color shift' resulted from increasing brightness. This side effect degrades general color characteristics due to change of 'Hue', 'Brightness' and 'Saturation' as compared with existing RGB stripe structure. Especially, skin-tone colors show a tendency to get darker in contrast to normal panel. We"ve tried to minimize 'color shift' through use of LUT (Look Up Table) for linear arithmetic processing of input data, data bit expansion to 12-bit for minimizing arithmetic tolerance and brightness weight of white sub-pixel on each R, G, B pixel. The objective of this study is to minimize and keep Δu'v' value (we commonly use to represent a color difference), quantitative basis of color difference between RGB stripe structure and RGBW quad structure, below 0.01 level (existing 0.02 or higher) using Macbeth colorchecker that is general reference of color characteristics.
Comparisons between Intel 386 and i486 microprecessors

NASA Technical Reports Server (NTRS)

Liu, Yuan-Kwei

1989-01-01

A quick and preliminary comparison is made between the Intel 386 and i486 microprocessors. The following topics are discussed: the i486 key elements, comparison of instruction set architecture, the i486 on-chip cache characteristics, the i486 multiprocessor support, comparison of performance, comparison of power consumption, comparison of radiation hardening potential, and recommendations for the Space Station Freedom (SSF) Data Management System (DMS).
Diode-quad bridge circuit means

NASA Technical Reports Server (NTRS)

Harrison, D. R.; Dimeff, J. (Inventor)

1975-01-01

Diode-quad bridge circuit means is described for use as a transducer circuit or as a discriminator circuit. It includes: (1) a diode bridge having first, second, third, and fourth bridge terminals consecutively coupled together by four diodes polarized in circulating relationship; (2) a first impedance connected between the second bridge terminal and a circuit ground; (3) a second impedance connected between the fourth bridge terminal and the circuit ground; (4) a signal source having a first source terminal capacitively coupled to the first and third bridge terminals, and a second source terminal connected to the circuit ground; and (5) an output terminal coupled to the first bridge terminal and at which an output signal may be taken.
Comparative Performance Analysis of Coarse Solvers for Algebraic Multigrid on Multicore and Manycore Architectures

DOE Office of Scientific and Technical Information (OSTI.GOV)

Druinsky, Alex; Ghysels, Pieter; Li, Xiaoye S.

In this paper, we study the performance of a two-level algebraic-multigrid algorithm, with a focus on the impact of the coarse-grid solver on performance. We consider two algorithms for solving the coarse-space systems: the preconditioned conjugate gradient method and a new robust HSS-embedded low-rank sparse-factorization algorithm. Our test data comes from the SPE Comparative Solution Project for oil-reservoir simulations. We contrast the performance of our code on one 12-core socket of a Cray XC30 machine with performance on a 60-core Intel Xeon Phi coprocessor. To obtain top performance, we optimized the code to take full advantage of fine-grained parallelism andmore » made it thread-friendly for high thread count. We also developed a bounds-and-bottlenecks performance model of the solver which we used to guide us through the optimization effort, and also carried out performance tuning in the solver’s large parameter space. Finally, as a result, significant speedups were obtained on both machines.« less
Earth system modelling on system-level heterogeneous architectures: EMAC (version 2.42) on the Dynamical Exascale Entry Platform (DEEP)

NASA Astrophysics Data System (ADS)

Christou, Michalis; Christoudias, Theodoros; Morillo, Julián; Alvarez, Damian; Merx, Hendrik

2016-09-01

We examine an alternative approach to heterogeneous cluster-computing in the many-core era for Earth system models, using the European Centre for Medium-Range Weather Forecasts Hamburg (ECHAM)/Modular Earth Submodel System (MESSy) Atmospheric Chemistry (EMAC) model as a pilot application on the Dynamical Exascale Entry Platform (DEEP). A set of autonomous coprocessors interconnected together, called Booster, complements a conventional HPC Cluster and increases its computing performance, offering extra flexibility to expose multiple levels of parallelism and achieve better scalability. The EMAC model atmospheric chemistry code (Module Efficiently Calculating the Chemistry of the Atmosphere (MECCA)) was taskified with an offload mechanism implemented using OmpSs directives. The model was ported to the MareNostrum 3 supercomputer to allow testing with Intel Xeon Phi accelerators on a production-size machine. The changes proposed in this paper are expected to contribute to the eventual adoption of Cluster-Booster division and Many Integrated Core (MIC) accelerated architectures in presently available implementations of Earth system models, towards exploiting the potential of a fully Exascale-capable platform.
Traditional Tracking with Kalman Filter on Parallel Architectures

NASA Astrophysics Data System (ADS)

Cerati, Giuseppe; Elmer, Peter; Lantz, Steven; MacNeill, Ian; McDermott, Kevin; Riley, Dan; Tadel, Matevž; Wittich, Peter; Würthwein, Frank; Yagil, Avi

2015-05-01

Power density constraints are limiting the performance improvements of modern CPUs. To address this, we have seen the introduction of lower-power, multi-core processors, but the future will be even more exciting. In order to stay within the power density limits but still obtain Moore's Law performance/price gains, it will be necessary to parallelize algorithms to exploit larger numbers of lightweight cores and specialized functions like large vector units. Example technologies today include Intel's Xeon Phi and GPGPUs. Track finding and fitting is one of the most computationally challenging problems for event reconstruction in particle physics. At the High Luminosity LHC, for example, this will be by far the dominant problem. The most common track finding techniques in use today are however those based on the Kalman Filter. Significant experience has been accumulated with these techniques on real tracking detector systems, both in the trigger and offline. We report the results of our investigations into the potential and limitations of these algorithms on the new parallel hardware.
Spectral Element Method for the Simulation of Unsteady Compressible Flows

NASA Technical Reports Server (NTRS)

Diosady, Laslo Tibor; Murman, Scott M.

2013-01-01

This work uses a discontinuous-Galerkin spectral-element method (DGSEM) to solve the compressible Navier-Stokes equations [1{3]. The inviscid ux is computed using the approximate Riemann solver of Roe [4]. The viscous fluxes are computed using the second form of Bassi and Rebay (BR2) [5] in a manner consistent with the spectral-element approximation. The method of lines with the classical 4th-order explicit Runge-Kutta scheme is used for time integration. Results for polynomial orders up to p = 15 (16th order) are presented. The code is parallelized using the Message Passing Interface (MPI). The computations presented in this work are performed using the Sandy Bridge nodes of the NASA Pleiades supercomputer at NASA Ames Research Center. Each Sandy Bridge node consists of 2 eight-core Intel Xeon E5-2670 processors with a clock speed of 2.6Ghz and 2GB per core memory. On a Sandy Bridge node the Tau Benchmark [6] runs in a time of 7.6s.
A heterogeneous computing accelerated SCE-UA global optimization method using OpenMP, OpenCL, CUDA, and OpenACC.

PubMed

Kan, Guangyuan; He, Xiaoyan; Ding, Liuqian; Li, Jiren; Liang, Ke; Hong, Yang

2017-10-01

The shuffled complex evolution optimization developed at the University of Arizona (SCE-UA) has been successfully applied in various kinds of scientific and engineering optimization applications, such as hydrological model parameter calibration, for many years. The algorithm possesses good global optimality, convergence stability and robustness. However, benchmark and real-world applications reveal the poor computational efficiency of the SCE-UA. This research aims at the parallelization and acceleration of the SCE-UA method based on powerful heterogeneous computing technology. The parallel SCE-UA is implemented on Intel Xeon multi-core CPU (by using OpenMP and OpenCL) and NVIDIA Tesla many-core GPU (by using OpenCL, CUDA, and OpenACC). The serial and parallel SCE-UA were tested based on the Griewank benchmark function. Comparison results indicate the parallel SCE-UA significantly improves computational efficiency compared to the original serial version. The OpenCL implementation obtains the best overall acceleration results however, with the most complex source code. The parallel SCE-UA has bright prospects to be applied in real-world applications.
A Micro-Force Sensor with Slotted-Quad-Beam Structure for Measuring the Friction in MEMS Bearings

PubMed Central

Liu, Huan; Yang, Shuming; Zhao, Yulong; Jiang, Zhuangde; Liu, Yan; Tian, Bian

2013-01-01

Presented here is a slotted-quad-beam structure sensor for the measurement of friction in micro bearings. Stress concentration slots are incorporated into a conventional quad-beam structure to improve the sensitivity of force measurements. The performance comparison between the quad-beam structure sensor and the slotted-quad-beam structure sensor are performed by theoretical modeling and finite element (FE) analysis. A hollow stainless steel probe is attached to the mesa of the sensor chip by a tailor-made organic glass fixture. Concerning the overload protection of the fragile beams, a glass wafer is bonded onto the bottom of sensor chip to limit the displacement of the mesa. The calibration of the packaged device is experimentally performed by a tri-dimensional positioning stage, a precision piezoelectric ceramic and an electronic analytical balance, which indicates its favorable sensitivity and overload protection. To verify the potential of the proposed sensor being applied in micro friction measurement, a measurement platform is established. The output of the sensor reflects the friction of bearing resulting from dry friction and solid lubrication. The results accord with the theoretical modeling and demonstrate that the sensor has the potential application in measuring the micro friction force under stable stage in MEMS machines. PMID:24084112
Global synchronization algorithms for the Intel iPSC/860

NASA Technical Reports Server (NTRS)

Seidel, Steven R.; Davis, Mark A.

1992-01-01

In a distributed memory multicomputer that has no global clock, global processor synchronization can only be achieved through software. Global synchronization algorithms are used in tridiagonal systems solvers, CFD codes, sequence comparison algorithms, and sorting algorithms. They are also useful for event simulation, debugging, and for solving mutual exclusion problems. For the Intel iPSC/860 in particular, global synchronization can be used to ensure the most effective use of the communication network for operations such as the shift, where each processor in a one-dimensional array or ring concurrently sends a message to its right (or left) neighbor. Three global synchronization algorithms are considered for the iPSC/860: the gysnc() primitive provided by Intel, the PICL primitive sync0(), and a new recursive doubling synchronization (RDS) algorithm. The performance of these algorithms is compared to the performance predicted by communication models of both the long and forced message protocols. Measurements of the cost of shift operations preceded by global synchronization show that the RDS algorithm always synchronizes the nodes more precisely and costs only slightly more than the other two algorithms.
Scaling deep learning workloads: NVIDIA DGX-1/Pascal and Intel Knights Landing

DOE Office of Scientific and Technical Information (OSTI.GOV)

Gawande, Nitin A.; Landwehr, Joshua B.; Daily, Jeffrey A.

Deep Learning (DL) algorithms have become ubiquitous in data analytics. As a result, major computing vendors --- including NVIDIA, Intel, AMD, and IBM --- have architectural road-maps influenced by DL workloads. Furthermore, several vendors have recently advertised new computing products as accelerating large DL workloads. Unfortunately, it is difficult for data scientists to quantify the potential of these different products. This paper provides a performance and power analysis of important DL workloads on two major parallel architectures: NVIDIA DGX-1 (eight Pascal P100 GPUs interconnected with NVLink) and Intel Knights Landing (KNL) CPUs interconnected with Intel Omni-Path or Cray Aries. Ourmore » evaluation consists of a cross section of convolutional neural net workloads: CifarNet, AlexNet, GoogLeNet, and ResNet50 topologies using the Cifar10 and ImageNet datasets. The workloads are vendor-optimized for each architecture. Our analysis indicates that although GPUs provide the highest overall performance, the gap can close for some convolutional networks; and the KNL can be competitive in performance/watt. We find that NVLink facilitates scaling efficiency on GPUs. However, its importance is heavily dependent on neural network architecture. Furthermore, for weak-scaling --- sometimes encouraged by restricted GPU memory --- NVLink is less important.« less
Successful Outcome of Modified Quad Surgical Procedure in Preteen and Teen Patients with Brachial Plexus Birth Palsy

PubMed Central

Nath, Rahul K.; Somasundaram, Chandra

2012-01-01

Objective: To evaluate the outcome of modified Quad procedure in preteen and teen patients with brachial plexus birth palsy. Background: We have previously demonstrated a significant improvement in shoulder abduction, resulting from the modified Quad procedure in children (mean age 2.5 years; range, 0.5–9 years) with obstetric brachial plexus injury. Methods: We describe in this report the outcome of 16 patients (6 girls and 10 boys; 7 preteen and 9 teen) who have undergone the modified Quad procedure for the correction of the shoulder function, specifically abduction. The patients underwent transfer of the latissimus dorsi and teres major muscles, release of contractures of subscapularis pectoralis major and minor, and axillary nerve decompression and neurolysis (the modified Quad procedure). Mean age of these patients at surgery was 13.5 years (range, 10.1–17.9 years). Results: The mean preoperative total Mallet score was 14.8 (range, 10–20), and active abduction was 84° (range, 20°–140°). At a mean follow-up of 1.5 years, the mean postoperative total Mallet score increased to 19.7 (range, 13–25, P < .0001), and the mean active abduction improved to 132° (range, 40°–180°, P < .0003). Conclusion: The modified Quad procedure greatly improves not only the active abduction but also other shoulder functions in preteen and teen patients, as this outcome is the combined result of decompression and neurolysis of the axillary nerve and the release of the contracted internal rotators of the shoulder. PMID:23308301
Using implicit association tests in age-heterogeneous samples: The importance of cognitive abilities and quad model processes.

PubMed

Wrzus, Cornelia; Egloff, Boris; Riediger, Michaela

2017-08-01

Implicit association tests (IATs) are increasingly used to indirectly assess people's traits, attitudes, or other characteristics. In addition to measuring traits or attitudes, IAT scores also reflect differences in cognitive abilities because scores are based on reaction times (RTs) and errors. As cognitive abilities change with age, questions arise concerning the usage and interpretation of IATs for people of different age. To address these questions, the current study examined how cognitive abilities and cognitive processes (i.e., quad model parameters) contribute to IAT results in a large age-heterogeneous sample. Participants (N = 549; 51% female) in an age-stratified sample (range = 12-88 years) completed different IATs and 2 tasks to assess cognitive processing speed and verbal ability. From the IAT data, D2-scores were computed based on RTs, and quad process parameters (activation of associations, overcoming bias, detection, guessing) were estimated from individual error rates. Substantial IAT scores and quad processes except guessing varied with age. Quad processes AC and D predicted D2-scores of the content-specific IAT. Importantly, the effects of cognitive abilities and quad processes on IAT scores were not significantly moderated by participants' age. These findings suggest that IATs seem suitable for age-heterogeneous studies from adolescence to old age when IATs are constructed and analyzed appropriately, for example with D-scores and process parameters. We offer further insight into how D-scoring controls for method effects in IATs and what IAT scores capture in addition to implicit representations of characteristics. (PsycINFO Database Record (c) 2017 APA, all rights reserved).
Recent Performance Results of VPIC on Trinity

NASA Astrophysics Data System (ADS)

Nystrom, W. D.; Bergen, B.; Bird, R. F.; Bowers, K. J.; Daughton, W. S.; Guo, F.; Le, A.; Li, H.; Nam, H.; Pang, X.; Stark, D. J.; Rust, W. N., III; Yin, L.; Albright, B. J.

2017-10-01

Trinity is a new DOE compute resource now in production at Los Alamos National Laboratory. Trinity has several new and unique features including two compute partitions, one with dual socket Intel Haswell Xeon compute nodes and one with Intel Knights Landing (KNL) Xeon Phi compute nodes, use of on package high bandwidth memory (HBM) for KNL nodes, ability to configure KNL nodes with respect to HBM model and on die network topology in a variety of operational modes at run time, and use of solid state storage via burst buffer technology to reduce time required to perform I/O. An effort is in progress to optimize VPIC on Trinity by taking advantage of these new architectural features. Results of work will be presented on performance of VPIC on Haswell and KNL partitions for single node runs and runs at scale. Results include use of burst buffers at scale to optimize I/O, comparison of strategies for using MPI and threads, performance benefits using HBM and effectiveness of using intrinsics for vectorization. Work performed under auspices of U.S. Dept. of Energy by Los Alamos National Security, LLC Los Alamos National Laboratory under contract DE-AC52-06NA25396 and supported by LANL LDRD program.
FPGA Online Tracking Algorithm for the PANDA Straw Tube Tracker

NASA Astrophysics Data System (ADS)

Liang, Yutie; Ye, Hua; Galuska, Martin J.; Gessler, Thomas; Kuhn, Wolfgang; Lange, Jens Soren; Wagner, Milan N.; Liu, Zhen'an; Zhao, Jingzhou

2017-06-01

A novel FPGA based online tracking algorithm for helix track reconstruction in a solenoidal field, developed for the PANDA spectrometer, is described. Employing the Straw Tube Tracker detector with 4636 straw tubes, the algorithm includes a complex track finder, and a track fitter. Implemented in VHDL, the algorithm is tested on a Xilinx Virtex-4 FX60 FPGA chip with different types of events, at different event rates. A processing time of 7 $\\mu$s per event for an average of 6 charged tracks is obtained. The momentum resolution is about 3\\% (4\\%) for $p_t$ ($p_z$) at 1 GeV/c. Comparing to the algorithm running on a CPU chip (single core Intel Xeon E5520 at 2.26 GHz), an improvement of 3 orders of magnitude in processing time is obtained. The algorithm can handle severe overlapping of events which are typical for interaction rates above 10 MHz.

Parallelization of the preconditioned IDR solver for modern multicore computer systems

NASA Astrophysics Data System (ADS)

Bessonov, O. A.; Fedoseyev, A. I.

2012-10-01

This paper present the analysis, parallelization and optimization approach for the large sparse matrix solver CNSPACK for modern multicore microprocessors. CNSPACK is an advanced solver successfully used for coupled solution of stiff problems arising in multiphysics applications such as CFD, semiconductor transport, kinetic and quantum problems. It employs iterative IDR algorithm with ILU preconditioning (user chosen ILU preconditioning order). CNSPACK has been successfully used during last decade for solving problems in several application areas, including fluid dynamics and semiconductor device simulation. However, there was a dramatic change in processor architectures and computer system organization in recent years. Due to this, performance criteria and methods have been revisited, together with involving the parallelization of the solver and preconditioner using Open MP environment. Results of the successful implementation for efficient parallelization are presented for the most advances computer system (Intel Core i7-9xx or two-processor Xeon 55xx/56xx).
Optimization of sparse matrix-vector multiplication on emerging multicore platforms

DOE Office of Scientific and Technical Information (OSTI.GOV)

Williams, Samuel; Oliker, Leonid; Vuduc, Richard

2007-01-01

We are witnessing a dramatic change in computer architecture due to the multicore paradigm shift, as every electronic device from cell phones to supercomputers confronts parallelism of unprecedented scale. To fully unleash the potential of these systems, the HPC community must develop multicore specific optimization methodologies for important scientific computations. In this work, we examine sparse matrix-vector multiply (SpMV) - one of the most heavily used kernels in scientific computing - across a broad spectrum of multicore designs. Our experimental platform includes the homogeneous AMD dual-core and Intel quad-core designs, the heterogeneous STI Cell, as well as the first scientificmore » study of the highly multithreaded Sun Niagara2. We present several optimization strategies especially effective for the multicore environment, and demonstrate significant performance improvements compared to existing state-of-the-art serial and parallel SpMV implementations. Additionally, we present key insights into the architectural tradeoffs of leading multicore design strategies, in the context of demanding memory-bound numerical algorithms.« less
Analytical Performance Modeling and Validation of Intel’s Xeon Phi Architecture

DOE Office of Scientific and Technical Information (OSTI.GOV)

Chunduri, Sudheer; Balaprakash, Prasanna; Morozov, Vitali

Modeling the performance of scientific applications on emerging hardware plays a central role in achieving extreme-scale computing goals. Analytical models that capture the interaction between applications and hardware characteristics are attractive because even a reasonably accurate model can be useful for performance tuning before the hardware is made available. In this paper, we develop a hardware model for Intel’s second-generation Xeon Phi architecture code-named Knights Landing (KNL) for the SKOPE framework. We validate the KNL hardware model by projecting the performance of mini-benchmarks and application kernels. The results show that our KNL model can project the performance with prediction errorsmore » of 10% to 20%. The hardware model also provides informative recommendations for code transformations and tuning.« less
Treatment effects of quad-helix on the eruption pattern of maxillary second molars.

PubMed

Kobayashi, Yoshiki; Shundo, Isao; Endo, Toshiya

2012-07-01

To evaluate the effects of quad-helix treatment on the eruption pattern of maxillary second molars in patients with maxillary incisor crowding. The lateral cephalograms of 40 consecutively treated patients in the early mixed-dentition group (treatment group) were examined in comparison with those of the same number of untreated patients with a similar form of malocclusion (control group). The cephalograms of the treated patients were taken at the start (T0) and at the end (T1) of treatment, and those of the untreated patients were also taken at about the same time as T0 and T1. The mean ages at T0 and T1 in the two groups were about the same. Distal tipping and movement and impeded extrusion of the maxillary first molars were notable in the treatment group compared with the control group. The actual treatment changes with the use of the quad-helix found expression in distal tipping and impeded vertical eruption of maxillary second molars. The more the maxillary first molars were tipped distally and the less the maxillary first molars extruded, the more the vertical eruption of the maxillary second molars was impeded. Quad-helix treatment gives rise to spontaneous distal tipping and impeded vertical eruption of the maxillary second molars.
Communication overhead on the Intel iPSC-860 hypercube

NASA Technical Reports Server (NTRS)

Bokhari, Shahid H.

1990-01-01

Experiments were conducted on the Intel iPSC-860 hypercube in order to evaluate the overhead of interprocessor communication. It is demonstrated that: (1) contrary to popular belief, the distance between two communicating processors has a significant impact on communication time, (2) edge contention can increase communication time by a factor of more than 7, and (3) node contention has no measurable impact.
Eigenvalue computations with the QUAD4 consistent-mass matrix

NASA Technical Reports Server (NTRS)

Butler, Thomas A.

1990-01-01

The NASTRAN user has the option of using either a lumped-mass matrix or a consistent- (coupled-) mass matrix with the QUAD4 shell finite element. At the Sixteenth NASTRAN Users' Colloquium (1988), Melvyn Marcus and associates of the David Taylor Research Center summarized a study comparing the results of the QUAD4 element with results of other NASTRAN shell elements for a cylindrical-shell modal analysis. Results of this study, in which both the lumped-and consistent-mass matrix formulations were used, implied that the consistent-mass matrix yielded poor results. In an effort to further evaluate the consistent-mass matrix, a study was performed using both a cylindrical-shell geometry and a flat-plate geometry. Modal parameters were extracted for several modes for both geometries leading to some significant conclusions. First, there do not appear to be any fundamental errors associated with the consistent-mass matrix. However, its accuracy is quite different for the two different geometries studied. The consistent-mass matrix yields better results for the flat-plate geometry and the lumped-mass matrix seems to be the better choice for cylindrical-shell geometries.
Scaling Deep Learning workloads: NVIDIA DGX-1/Pascal and Intel Knights Landing

DOE PAGES

Gawande, Nitin A.; Daily, Jeff A.; Siegel, Charles; ...

2018-05-05

Deep Learning (DL) algorithms have become ubiquitous in data analytics. As a result, major computing vendors—including NVIDIA, Intel, AMD, and IBM—have architectural road maps influenced by DL workloads. Furthermore, several vendors have recently advertised new computing products as accelerating large DL workloads. Unfortunately, it is difficult for data scientists to quantify the potential of these different products. Here, this article provides a performance and power analysis of important DL workloads on two major parallel architectures: NVIDIA DGX-1 (eight Pascal P100 GPUs interconnected with NVLink) and Intel Knights Landing (KNL) CPUs interconnected with Intel Omni-Path or Cray Aries. Our evaluation consistsmore » of a cross section of convolutional neural net workloads: CifarNet, AlexNet, GoogLeNet, and ResNet50 topologies using the Cifar10 and ImageNet datasets. The workloads are vendor-optimized for each architecture. We use sequentially equivalent implementations to maintain iso-accuracy between parallel and sequential DL models. Our analysis indicates that although GPUs provide the highest overall performance, the gap can close for some convolutional networks; and the KNL can be competitive in performance/watt. We find that NVLink facilitates scaling efficiency on GPUs. However, its importance is heavily dependent on neural network architecture. Furthermore, for weak-scaling—sometimes encouraged by restricted GPU memory—NVLink is less important.« less
Scaling Deep Learning workloads: NVIDIA DGX-1/Pascal and Intel Knights Landing

DOE Office of Scientific and Technical Information (OSTI.GOV)

Gawande, Nitin A.; Daily, Jeff A.; Siegel, Charles

Deep Learning (DL) algorithms have become ubiquitous in data analytics. As a result, major computing vendors—including NVIDIA, Intel, AMD, and IBM—have architectural road maps influenced by DL workloads. Furthermore, several vendors have recently advertised new computing products as accelerating large DL workloads. Unfortunately, it is difficult for data scientists to quantify the potential of these different products. Here, this article provides a performance and power analysis of important DL workloads on two major parallel architectures: NVIDIA DGX-1 (eight Pascal P100 GPUs interconnected with NVLink) and Intel Knights Landing (KNL) CPUs interconnected with Intel Omni-Path or Cray Aries. Our evaluation consistsmore » of a cross section of convolutional neural net workloads: CifarNet, AlexNet, GoogLeNet, and ResNet50 topologies using the Cifar10 and ImageNet datasets. The workloads are vendor-optimized for each architecture. We use sequentially equivalent implementations to maintain iso-accuracy between parallel and sequential DL models. Our analysis indicates that although GPUs provide the highest overall performance, the gap can close for some convolutional networks; and the KNL can be competitive in performance/watt. We find that NVLink facilitates scaling efficiency on GPUs. However, its importance is heavily dependent on neural network architecture. Furthermore, for weak-scaling—sometimes encouraged by restricted GPU memory—NVLink is less important.« less
An Approach to Quad Meshing Based On Cross Valued Maps and the Ginzburg-Landau Theory

DOE Office of Scientific and Technical Information (OSTI.GOV)

Viertel, Ryan; Osting, Braxton

2017-08-01

A generalization of vector fields, referred to as N-direction fields or cross fields when N=4, has been recently introduced and studied for geometry processing, with applications in quadrilateral (quad) meshing, texture mapping, and parameterization. We make the observation that cross field design for two-dimensional quad meshing is related to the well-known Ginzburg-Landau problem from mathematical physics. This identification yields a variety of theoretical tools for efficiently computing boundary-aligned quad meshes, with provable guarantees on the resulting mesh, for example, the number of mesh defects and bounds on the defect locations. The procedure for generating the quad mesh is to (i)more » find a complex-valued "representation" field that minimizes the Dirichlet energy subject to a boundary constraint, (ii) convert the representation field into a boundary-aligned, smooth cross field, (iii) use separatrices of the cross field to partition the domain into four sided regions, and (iv) mesh each of these four-sided regions using standard techniques. Under certain assumptions on the geometry of the domain, we prove that this procedure can be used to produce a cross field whose separatrices partition the domain into four sided regions. To solve the energy minimization problem for the representation field, we use an extension of the Merriman-Bence-Osher (MBO) threshold dynamics method, originally conceived as an algorithm to simulate motion by mean curvature, to minimize the Ginzburg-Landau energy for the optimal representation field. Lastly, we demonstrate the method on a variety of test domains.« less
75 FR 75706 - Dresden Nuclear Power Station, Units 2 and 3 and Quad Cities Nuclear Power Station, Unit Nos. 1...

Federal Register 2010, 2011, 2012, 2013, 2014

2010-12-06

...- 2010-0373] Dresden Nuclear Power Station, Units 2 and 3 and Quad Cities Nuclear Power Station, Unit Nos... and DPR-25 for Dresden Nuclear Power Station, Units 2 and 3, respectively, located in Grundy County, Illinois, and to Renewed Facility Operating License Nos. DPR-29 and DPR-30 for Quad Cities Nuclear Power...
Exploring synchrotron radiation capabilities: The ALS-Intel CRADA

DOE Office of Scientific and Technical Information (OSTI.GOV)

Gozzo, F.; Cossy-Favre, A; Trippleet, B.

1997-04-01

Synchrotron radiation spectroscopy and spectromicroscopy were applied, at the Advanced Light Source, to the analysis of materials and problems of interest to the commercial semiconductor industry. The authors discuss some of the results obtained at the ALS using existing capabilities, in particular the small spot ultra-ESCA instrument on beamline 7.0 and the AMS (Applied Material Science) endstation on beamline 9.3.2. The continuing trend towards smaller feature size and increased performance for semiconductor components has driven the semiconductor industry to invest in the development of sophisticated and complex instrumentation for the characterization of microstructures. Among the crucial milestones established by themore » Semiconductor Industry Association are the needs for high quality, defect free and extremely clean silicon wafers, very thin gate oxides, lithographies near 0.1 micron and advanced material interconnect structures. The requirements of future generations cannot be met with current industrial technologies. The purpose of the ALS-Intel CRADA (Cooperative Research And Development Agreement) is to explore, compare and improve the utility of synchrotron-based techniques for practical analysis of substrates of interest to semiconductor chip manufacturing. The first phase of the CRADA project consisted in exploring existing ALS capabilities and techniques on some problems of interest. Some of the preliminary results obtained on Intel samples are discussed here.« less
a Novel Two-Component Decomposition for Co-Polar Channels of GF-3 Quad-Pol Data

NASA Astrophysics Data System (ADS)

Kwok, E.; Li, C. H.; Zhao, Q. H.; Li, Y.

2018-04-01

Polarimetric target decomposition theory is the most dynamic and exploratory research area in the field of PolSAR. But most methods of target decomposition are based on fully polarized data (quad pol) and seldom utilize dual-polar data for target decomposition. Given this, we proposed a novel two-component decomposition method for co-polar channels of GF-3 quad-pol data. This method decomposes the data into two scattering contributions: surface, double bounce in dual co-polar channels. To save this underdetermined problem, a criterion for determining the model is proposed. The criterion can be named as second-order averaged scattering angle, which originates from the H/α decomposition. and we also put forward an alternative parameter of it. To validate the effectiveness of proposed decomposition, Liaodong Bay is selected as research area. The area is located in northeastern China, where it grows various wetland resources and appears sea ice phenomenon in winter. and we use the GF-3 quad-pol data as study data, which which is China's first C-band polarimetric synthetic aperture radar (PolSAR) satellite. The dependencies between the features of proposed algorithm and comparison decompositions (Pauli decomposition, An&Yang decomposition, Yamaguchi S4R decomposition) were investigated in the study. Though several aspects of the experimental discussion, we can draw the conclusion: the proposed algorithm may be suitable for special scenes with low vegetation coverage or low vegetation in the non-growing season; proposed decomposition features only using co-polar data are highly correlated with the corresponding comparison decomposition features under quad-polarization data. Moreover, it would be become input of the subsequent classification or parameter inversion.
QUAD fever: beware of non-infectious fever in high spinal cord injuries.

PubMed

Goyal, Jyoti; Jha, Rakesh; Bhatia, Paramjeet; Mani, Raj Kumar

2017-06-18

A case of cervical spinal cord injury and quadriparesis with prolonged fever is being described. Initially, the patient received treatment for well-documented catheter-related bloodstream infection. High spiking fever returned and persisted with no obvious evidence of infection. The usual non-infectious causes too were carefully excluded. QUAD fever or fever due to spinal cord injury itself was considered. The pathogenetic basis of QUAD fever is unclear but could be attributed to autonomic dysfunction and temperature dysregulation. Awareness of this little known condition could help in avoiding unnecessary antimicrobial therapy and in more accurate prognostication. Unlike several previous reported cases that ended fatally, the present case ran a relatively benign course. The spectrum of presentations may therefore be broader than hitherto appreciated. © BMJ Publishing Group Ltd (unless otherwise stated in the text of the article) 2017. All rights reserved. No commercial use is permitted unless otherwise expressly granted.
A parallel algorithm for the two-dimensional time fractional diffusion equation with implicit difference method.

PubMed

Gong, Chunye; Bao, Weimin; Tang, Guojian; Jiang, Yuewen; Liu, Jie

2014-01-01

It is very time consuming to solve fractional differential equations. The computational complexity of two-dimensional fractional differential equation (2D-TFDE) with iterative implicit finite difference method is O(M(x)M(y)N(2)). In this paper, we present a parallel algorithm for 2D-TFDE and give an in-depth discussion about this algorithm. A task distribution model and data layout with virtual boundary are designed for this parallel algorithm. The experimental results show that the parallel algorithm compares well with the exact solution. The parallel algorithm on single Intel Xeon X5540 CPU runs 3.16-4.17 times faster than the serial algorithm on single CPU core. The parallel efficiency of 81 processes is up to 88.24% compared with 9 processes on a distributed memory cluster system. We do think that the parallel computing technology will become a very basic method for the computational intensive fractional applications in the near future.
Application of the multireference equation of motion coupled cluster method, including spin-orbit coupling, to the atomic spectra of Cr, Mn, Fe and Co

NASA Astrophysics Data System (ADS)

Liu, Zhebing; Huntington, Lee M. J.; Nooijen, Marcel

2015-10-01

The recently introduced multireference equation of motion (MR-EOM) approach is combined with a simple treatment of spin-orbit coupling, as implemented in the ORCA program. The resulting multireference equation of motion spin-orbit coupling (MR-EOM-SOC) approach is applied to the first-row transition metal atoms Cr, Mn, Fe and Co, for which experimental data are readily available. Using the MR-EOM-SOC approach, the splittings in each L-S multiplet can be accurately assessed (root mean square (RMS) errors of about 70 cm-1). The RMS errors for J-specific excitation energies range from 414 to 783 cm-1 and are comparable to previously reported J-averaged MR-EOM results using the ACESII program. The MR-EOM approach is highly efficient. A typical MR-EOM calculation of a full spin-orbit spectrum takes about 2 CPU hours on a single processor of a 12-core node, consisting of Intel XEON 2.93 GHz CPUs with 12.3 MB of shared cache memory.
Single event effect testing of the Intel 80386 family and the 80486 microprocessor

DOE Office of Scientific and Technical Information (OSTI.GOV)

Moran, A.; LaBel, K.; Gates, M.

The authors present single event effect test results for the Intel 80386 microprocessor, the 80387 coprocessor, the 82380 peripheral device, and on the 80486 microprocessor. Both single event upset and latchup conditions were monitored.
Porting plasma physics simulation codes to modern computing architectures using the libmrc framework

NASA Astrophysics Data System (ADS)

Germaschewski, Kai; Abbott, Stephen

2015-11-01

Available computing power has continued to grow exponentially even after single-core performance satured in the last decade. The increase has since been driven by more parallelism, both using more cores and having more parallelism in each core, e.g. in GPUs and Intel Xeon Phi. Adapting existing plasma physics codes is challenging, in particular as there is no single programming model that covers current and future architectures. We will introduce the open-source libmrc framework that has been used to modularize and port three plasma physics codes: The extended MHD code MRCv3 with implicit time integration and curvilinear grids; the OpenGGCM global magnetosphere model; and the particle-in-cell code PSC. libmrc consolidates basic functionality needed for simulations based on structured grids (I/O, load balancing, time integrators), and also introduces a parallel object model that makes it possible to maintain multiple implementations of computational kernels, on e.g. conventional processors and GPUs. It handles data layout conversions and enables us to port performance-critical parts of a code to a new architecture step-by-step, while the rest of the code can remain unchanged. We will show examples of the performance gains and some physics applications.
QuadCam - A Quadruple Polarimetric Camera for Space Situational Awareness

NASA Astrophysics Data System (ADS)

Skuljan, J.

A specialised quadruple polarimetric camera for space situational awareness, QuadCam, has been built at the Defence Technology Agency (DTA), New Zealand, as part of collaboration with the Defence Science and Technology Laboratory (Dstl), United Kingdom. The design was based on a similar system originally developed at Dstl, with some significant modifications for improved performance. The system is made up of four identical CCD cameras looking in the same direction, but in a different plane of polarisation at 0, 45, 90 and 135 degrees with respect to the reference plane. A standard set of Stokes parameters can be derived from the four images in order to describe the state of polarisation of an object captured in the field of view. The modified design of the DTA QuadCam makes use of four small Raspberry Pi computers, so that each camera is controlled by its own computer in order to speed up the readout process and ensure that the four individual frames are taken simultaneously (to within 100-200 microseconds). In addition, a new firmware was requested from the camera manufacturer so that an output signal is generated to indicate the state of the camera shutter. A specialised GPS unit (also developed at DTA) is then used to monitor the shutter signals from the four cameras and record the actual time of exposure to an accuracy of about 100 microseconds. This makes the system well suited for the observation of fast-moving objects in the low Earth orbit (LEO). The QuadCam is currently mounted on a Paramount MEII robotic telescope mount at the newly built DTA space situational awareness observatory located on Whangaparaoa Peninsula near Auckland, New Zealand. The system will be used for tracking satellites in low Earth orbit and geostationary belt as well. The performance of the camera has been evaluated and a series of test images have been collected in order to derive the polarimetric signatures for selected satellites.
Towards a harmonised approach to reducing quad-related fatal injuries in Australia and New Zealand: a cross-sectional comparative analysis.

PubMed

Lilley, Rebbecca; Lower, Tony; Davie, Gabrielle

2017-10-01

This study compares the patterns of quad-related fatal injuries between Australia and New Zealand (NZ). Fatal injuries from July 2007 to June 2012 involving a quad (quad bike or all-terrain vehicle) were identified from coronial files. Data described the socio-demographic, injury, vehicle and environment factors associated with incidents. Injury patterns were compared between countries. A total of 101 quad-related fatalities were identified: 69 in Australia and 32 in NZ (7.3 and 8.0 annual fatalities per 100,000 vehicles). Of these, 95 closed cases were examined in detail and factors in common included fatalities occurring mainly in males, on farms, involving a rollover and resulting in crush injuries to the head and thorax. Helmet use and alcohol/drug involvement were infrequent. Differences were observed with regard to age, season of fatal incident and the presence of a slope. Fatality patterns are broadly similar. The few differences could be attributed to differing agricultural commodity mix, demographics and topography. This study's findings support harmonised cross-country injury prevention efforts primarily focused on safe design and engineering principles to reduce this injury burden. © 2017 The Authors.
DOE Office of Scientific and Technical Information (OSTI.GOV)

Trędak, Przemysław, E-mail: przemyslaw.tredak@fuw.edu.pl; Rudnicki, Witold R.; Interdisciplinary Centre for Mathematical and Computational Modelling, University of Warsaw, ul. Pawińskiego 5a, 02-106 Warsaw

The second generation Reactive Bond Order (REBO) empirical potential is commonly used to accurately model a wide range hydrocarbon materials. It is also extensible to other atom types and interactions. REBO potential assumes complex multi-body interaction model, that is difficult to represent efficiently in the SIMD or SIMT programming model. Hence, despite its importance, no efficient GPGPU implementation has been developed for this potential. Here we present a detailed description of a highly efficient GPGPU implementation of molecular dynamics algorithm using REBO potential. The presented algorithm takes advantage of rarely used properties of the SIMT architecture of a modern GPUmore » to solve difficult synchronizations issues that arise in computations of multi-body potential. Techniques developed for this problem may be also used to achieve efficient solutions of different problems. The performance of proposed algorithm is assessed using a range of model systems. It is compared to highly optimized CPU implementation (both single core and OpenMP) available in LAMMPS package. These experiments show up to 6x improvement in forces computation time using single processor of the NVIDIA Tesla K80 compared to high end 16-core Intel Xeon processor.« less

Efficient implementation of the many-body Reactive Bond Order (REBO) potential on GPU

NASA Astrophysics Data System (ADS)

Trędak, Przemysław; Rudnicki, Witold R.; Majewski, Jacek A.

2016-09-01

The second generation Reactive Bond Order (REBO) empirical potential is commonly used to accurately model a wide range hydrocarbon materials. It is also extensible to other atom types and interactions. REBO potential assumes complex multi-body interaction model, that is difficult to represent efficiently in the SIMD or SIMT programming model. Hence, despite its importance, no efficient GPGPU implementation has been developed for this potential. Here we present a detailed description of a highly efficient GPGPU implementation of molecular dynamics algorithm using REBO potential. The presented algorithm takes advantage of rarely used properties of the SIMT architecture of a modern GPU to solve difficult synchronizations issues that arise in computations of multi-body potential. Techniques developed for this problem may be also used to achieve efficient solutions of different problems. The performance of proposed algorithm is assessed using a range of model systems. It is compared to highly optimized CPU implementation (both single core and OpenMP) available in LAMMPS package. These experiments show up to 6x improvement in forces computation time using single processor of the NVIDIA Tesla K80 compared to high end 16-core Intel Xeon processor.
Hypofractionated Palliative Radiotherapy with Concurrent Radiosensitizing Chemotherapy for Advanced Head and Neck Cancer Using the "QUAD-SHOT Regimen".

PubMed

Gamez, Mauricio E; Agarwal, Manuj; Hu, Kenneth S; Lukens, John N; Harrison, Louis B

2017-02-01

To analyze the outcomes using the hypofractionated palliative radiotherapy regimen "QUAD-Shot" with concurrent radiosensitizing chemotherapy for advanced head and neck cancer. We analyzed twenty-one patients with newly-diagnosed or recurrent head and neck cancer treated with palliative hypofractionated concurrent chemoradiation using the QUAD-Shot regimen. All patients received at least one cycle of RT, with sixteen patients (76%) completing all three cycles. 85.7 % of patients had objective response to therapy with five patients (23.8%) demonstrating complete response (CR) and thirteen patients (61.9%) demonstrating partial response (PR). Palliation of symptoms was achieved in all (100%) of the sixteen patients that completed the three cycles. Median overall survival and median progression-free survival were 7 and 4 months, respectively. QUAD-Shot palliative radiation therapy coupled with radiosensitizing chemotherapy is efficacious and well-tolerated in patients with newly-diagnosed or recurrent head and neck cancer not amenable to curative therapy. Copyright© 2017, International Institute of Anticancer Research (Dr. George J. Delinasios), All rights reserved.
Using all of your CPU's in HIPE

NASA Astrophysics Data System (ADS)

Jacobson, J. D.; Fadda, D.

2012-09-01

Modern computer architectures increasingly feature multi-core CPU's. For example, the MacbookPro features the Intel quad-core i7 processors. Through the use of hyper-threading, where each core can execute two threads simultaneously, the quad-core i7 can support eight simultaneous processing threads. All this on your laptop! This CPU power can now be put into service by scientists to perform data reduction tasks, but only if the software has been designed to take advantage of the multiple processor architectures. Up to now, software written for Herschel data reduction (HIPE), written in Jython and JAVA, is single-threaded and can only utilize a single processor. Users of HIPE do not get any advantage from the additional processors. Why not put all of the CPU resources to work reducing your data? We present a multi-threaded software application that corrects long-term transients in the signal from the PACS unchopped spectroscopy line scan mode. In this poster, we present a multi-threaded software framework to achieve performance improvements from parallel execution. We will show how a task to correct transients in the PACS Spectroscopy Pipeline for the un-chopped line scan mode, has been threaded. This computation-intensive task uses either a one-parameter or a three parameter exponential function, to characterize the transient. The task uses a JAVA implementation of Minpack, translated from the C (Moshier) and IDL (Markwardt) by the authors, to optimize the correction parameters. We also explain how to determine if a task can benefit from threading (Amdahl's Law), and if it is safe to thread. The design and implementation, using the JAVA concurrency package completions service is described. Pitfalls, timing bugs, thread safety, resource control, testing and performance improvements are described and plotted.
OpenMP-accelerated SWAT simulation using Intel C and FORTRAN compilers: Development and benchmark

NASA Astrophysics Data System (ADS)

Ki, Seo Jin; Sugimura, Tak; Kim, Albert S.

2015-02-01

We developed a practical method to accelerate execution of Soil and Water Assessment Tool (SWAT) using open (free) computational resources. The SWAT source code (rev 622) was recompiled using a non-commercial Intel FORTRAN compiler in Ubuntu 12.04 LTS Linux platform, and newly named iOMP-SWAT in this study. GNU utilities of make, gprof, and diff were used to develop the iOMP-SWAT package, profile memory usage, and check identicalness of parallel and serial simulations. Among 302 SWAT subroutines, the slowest routines were identified using GNU gprof, and later modified using Open Multiple Processing (OpenMP) library in an 8-core shared memory system. In addition, a C wrapping function was used to rapidly set large arrays to zero by cross compiling with the original SWAT FORTRAN package. A universal speedup ratio of 2.3 was achieved using input data sets of a large number of hydrological response units. As we specifically focus on acceleration of a single SWAT run, the use of iOMP-SWAT for parameter calibrations will significantly improve the performance of SWAT optimization.
Comparison of experimental results of a Quad-CZT array detector, a NaI(Tl), a LaBr3(Ce), and a HPGe for safeguards applications

NASA Astrophysics Data System (ADS)

Kwak, S.-W.; Choi, J.; Park, S. S.; Ahn, S. H.; Park, J. S.; Chung, H.

2017-11-01

A compound semiconductor detector, CdTe (or CdZnTe), has been used in various areas including nuclear safeguards applications. To address its critical drawback, low detection efficiency, which leads to a long measurement time, a Quad-CZT array-based gamma-ray spectrometer in our previous study has been developed by combining four individual CZT detectors. We have re-designed the developed Quad-CZT array system to make it more simple and compact for a hand-held gamma-ray detector. The objective of this paper aims to compare the improved Quad-CZT array system with the traditional gamma-ray spectrometers (NaI(Tl), LaBr3(Ce), HPGe); these detectors currently have been the most commonly used for verification of nuclear materials. Nuclear materials in different physical forms in a nuclear facility of Korea were measured by the Quad-CZT array system and the existing gamma-ray detectors. For measurements of UO2 pellets and powders, and fresh fuel rods, the Quad-CZT array system turned out to be superior to the NaI(Tl) and LaBr3(Ce). For measurements of UF6 cylinders with a thick wall, the Quad-CZT array system and HPGe gave similar accuracy under the same measurement time. From the results of the field tests conducted, we can conclude that the improved Quad-CZT array system would be used as an alternative to HPGes and scintillation detectors for the purpose of increasing effectivenss and efficiency of safeguards applications. This is the first paper employing a multi-element CZT array detector for measurement of nuclear materials—particularly uranium in a UF6 cylinder—in a real nuclear facility. The present work also suggests that the multi-CZT array system described in this study would be one promising method to address a serious weakness of CZT-based radiation detection.
Adaptive sliding mode control for finite-time stability of quad-rotor UAVs with parametric uncertainties.

PubMed

Mofid, Omid; Mobayen, Saleh

2018-01-01

Adaptive control methods are developed for stability and tracking control of flight systems in the presence of parametric uncertainties. This paper offers a design technique of adaptive sliding mode control (ASMC) for finite-time stabilization of unmanned aerial vehicle (UAV) systems with parametric uncertainties. Applying the Lyapunov stability concept and finite-time convergence idea, the recommended control method guarantees that the states of the quad-rotor UAV are converged to the origin with a finite-time convergence rate. Furthermore, an adaptive-tuning scheme is advised to guesstimate the unknown parameters of the quad-rotor UAV at any moment. Finally, simulation results are presented to exhibit the helpfulness of the offered technique compared to the previous methods. Copyright © 2017 ISA. Published by Elsevier Ltd. All rights reserved.
Limited Human Factors Assessment of the QuadGard Limb Protection System: U.S. Marine Corps Systems Command Limb Protection Program Overview (QuadGard Phases 4 and 5 Production Designs)

DTIC Science & Technology

2011-09-01

in calculating the ergonomics associated with ballistic protection. MARCORSYSCOM established three design requirements: (1) system compatibility...knob. The Velcro disengaged, as designed , to allow the wearer unimpeded leg movement. The control knob is used to adjust the driver’s seat height...QuadGard Phases IV and V Production Designs ) by Richard S. Bruno ARL-TR-5656 September 2011
High-precision real-time 3D shape measurement based on a quad-camera system

NASA Astrophysics Data System (ADS)

Tao, Tianyang; Chen, Qian; Feng, Shijie; Hu, Yan; Zhang, Minliang; Zuo, Chao

2018-01-01

Phase-shifting profilometry (PSP) based 3D shape measurement is well established in various applications due to its high accuracy, simple implementation, and robustness to environmental illumination and surface texture. In PSP, higher depth resolution generally requires higher fringe density of projected patterns which, in turn, lead to severe phase ambiguities that must be solved with additional information from phase coding and/or geometric constraints. However, in order to guarantee the reliability of phase unwrapping, available techniques are usually accompanied by increased number of patterns, reduced amplitude of fringe, and complicated post-processing algorithms. In this work, we demonstrate that by using a quad-camera multi-view fringe projection system and carefully arranging the relative spatial positions between the cameras and the projector, it becomes possible to completely eliminate the phase ambiguities in conventional three-step PSP patterns with high-fringe-density without projecting any additional patterns or embedding any auxiliary signals. Benefiting from the position-optimized quad-camera system, stereo phase unwrapping can be efficiently and reliably performed by flexible phase consistency checks. Besides, redundant information of multiple phase consistency checks is fully used through a weighted phase difference scheme to further enhance the reliability of phase unwrapping. This paper explains the 3D measurement principle and the basic design of quad-camera system, and finally demonstrates that in a large measurement volume of 200 mm × 200 mm × 400 mm, the resultant dynamic 3D sensing system can realize real-time 3D reconstruction at 60 frames per second with a depth precision of 50 μm.
DD-αAMG on QPACE 3

NASA Astrophysics Data System (ADS)

Georg, Peter; Richtmann, Daniel; Wettig, Tilo

2018-03-01

We describe our experience porting the Regensburg implementation of the DD-αAMG solver from QPACE 2 to QPACE 3. We first review how the code was ported from the first generation Intel Xeon Phi processor (Knights Corner) to its successor (Knights Landing). We then describe the modifications in the communication library necessitated by the switch from InfiniBand to Omni-Path. Finally, we present the performance of the code on a single processor as well as the scaling on many nodes, where in both cases the speedup factor is close to the theoretical expectations.
Baryonic and mesonic 3-point functions with open spin indices

NASA Astrophysics Data System (ADS)

Bali, Gunnar S.; Collins, Sara; Gläßle, Benjamin; Heybrock, Simon; Korcyl, Piotr; Löffler, Marius; Rödl, Rudolf; Schäfer, Andreas

2018-03-01

We have implemented a new way of computing three-point correlation functions. It is based on a factorization of the entire correlation function into two parts which are evaluated with open spin-(and to some extent flavor-) indices. This allows us to estimate the two contributions simultaneously for many different initial and final states and momenta, with little computational overhead. We explain this factorization as well as its efficient implementation in a new library which has been written to provide the necessary functionality on modern parallel architectures and on CPUs, including Intel's Xeon Phi series.
Least square based sliding mode control for a quad-rotor helicopter and energy saving by chattering reduction

NASA Astrophysics Data System (ADS)

Sumantri, Bambang; Uchiyama, Naoki; Sano, Shigenori

2016-01-01

In this paper, a new control structure for a quad-rotor helicopter that employs the least squares method is introduced. This proposed algorithm solves the overdetermined problem of the control input for the translational motion of a quad-rotor helicopter. The algorithm allows all six degrees of freedom to be considered to calculate the control input. The sliding mode controller is applied to achieve robust tracking and stabilization. A saturation function is designed around a boundary layer to reduce the chattering phenomenon that is a common problem in sliding mode control. In order to improve the tracking performance, an integral sliding surface is designed. An energy saving effect because of chattering reduction is also evaluated. First, the dynamics of the quad-rotor helicopter is derived by the Newton-Euler formulation for a rigid body. Second, a constant plus proportional reaching law is introduced to increase the reaching rate of the sliding mode controller. Global stability of the proposed control strategy is guaranteed based on the Lyapunov's stability theory. Finally, the robustness and effectiveness of the proposed control system are demonstrated experimentally under wind gusts, and are compared with a regular sliding mode controller, a proportional-differential controller, and a proportional-integral-differential controller.
High performance in silico virtual drug screening on many-core processors

PubMed Central

Price, James; Sessions, Richard B; Ibarra, Amaurys A

2015-01-01

Drug screening is an important part of the drug development pipeline for the pharmaceutical industry. Traditional, lab-based methods are increasingly being augmented with computational methods, ranging from simple molecular similarity searches through more complex pharmacophore matching to more computationally intensive approaches, such as molecular docking. The latter simulates the binding of drug molecules to their targets, typically protein molecules. In this work, we describe BUDE, the Bristol University Docking Engine, which has been ported to the OpenCL industry standard parallel programming language in order to exploit the performance of modern many-core processors. Our highly optimized OpenCL implementation of BUDE sustains 1.43 TFLOP/s on a single Nvidia GTX 680 GPU, or 46% of peak performance. BUDE also exploits OpenCL to deliver effective performance portability across a broad spectrum of different computer architectures from different vendors, including GPUs from Nvidia and AMD, Intel’s Xeon Phi and multi-core CPUs with SIMD instruction sets. PMID:25972727
76 FR 19746 - Expansion of Foreign-Trade Zone 133; Quad-Cities, IL/IA

Federal Register 2010, 2011, 2012, 2013, 2014

2011-04-08

... DEPARTMENT OF COMMERCE Foreign-Trade Zones Board [Order No. 1749] Expansion of Foreign-Trade Zone 133; Quad-Cities, IL/IA Pursuant to its authority under the Foreign-Trade Zones Act of June 18, 1934, as amended (19 U.S.C. 81a-81u), the Foreign-Trade Zones Board (the Board) adopts the following Order...
Recent UAS Developments: VTOL HQ-series Shipboard Recovery and Autonomous Monitoring with MicroQuads

NASA Astrophysics Data System (ADS)

Wardell, L. J.; Farber, A. M.; Douglas, J.

2017-12-01

Ocean research would benefit from reliable shipboard launch and recovery of small class UAS. The vertical take-off and landing (VTOL) system reduces equipment footprint without the need for launchers or recovery systems. The HQ-60 (Latitude Engineering) has demonstrated reliable ship take-off and recovery on a 10x10' area on the R/V Falkor (Schmidt Ocean Institute) and other research vessels. The HQ-60 recently set a record for longest time aloft for a VTOL aircraft, flying nearly 22.5 hours non-stop. To support close-range research, autonomous MicroQuads that "perch" in a protective box that also recharges the aircraft and transmits the data is in development. Recent MicroQuad work with developing high-resolution (<1cm) DEMs using on-board cameras has yielded promising results for the use of surface change detection. Recent USDA development targeted erosion monitoring with this system. The latest updates and testing results for both systems will be presented.
Matrix multiplication on the Intel Touchstone Delta

DOE Office of Scientific and Technical Information (OSTI.GOV)

Huss-Lederman, S.; Jacobson, E.M.; Tsao, A.

1993-12-31

Matrix multiplication is a key primitive in block matrix algorithms such as those found in LAPACK. We present results from our study of matrix multiplication algorithms on the Intel Touchstone Delta, a distributed memory message-passing architecture with a two-dimensional mesh topology. We obtain an implementation that uses communication primitives highly suited to the Delta and exploits the single node assembly-coded matrix multiplication. Our algorithm is completely general, able to deal with arbitrary mesh aspect ratios and matrix dimensions, and has achieved parallel efficiency of 86% with overall peak performance in excess of 8 Gflops on 256 nodes for an 8800more » {times} 8800 matrix. We describe our algorithm design and implementation, and present performance results that demonstrate scalability and robust behavior over varying mesh topologies.« less
Mechanical evaluation of quad-helix appliance made of low-nickel stainless steel wire.

PubMed

dos Santos, Rogério Lacerda; Pithon, Matheus Melo

2013-01-01

The objective of this study was to test the hypothesis that there is no difference between stainless steel and low-nickel stainless steel wires as regards mechanical behavior. Force, resilience, and elastic modulus produced by Quad-helix appliances made of 0.032-inch and 0.036-inch wires were evaluated. Sixty Quad-helix appliances were made, thirty for each type of alloy, being fifteen for each wire thickness, 0.032-in and 0.036-in. All the archwires were submitted to mechanical compression test using an EMIC DL-10000 machine simulating activations of 4, 6, 9, and 12 mm. Analysis of variance (ANOVA) with multiple comparisons and Tukey's test were used (p < 0.05) to assess force, resilience, and elastic modulus. Statistically significant difference in the forces generated, resilience and elastic modulus were found between the 0.032-in and 0.036-in thicknesses (p < 0.05). Appliances made of low-nickel stainless steel alloy had force, resilience, and elastic modulus similar to those made of stainless steel alloy.
Multi-Core Programming Design Patterns: Stream Processing Algorithms for Dynamic Scene Perceptions

DTIC Science & Technology

2014-05-01

processor developed by IBM and other companies , incorpo- rates the verb—POWER5— processor as the Power Processor Element (PPE), one of the early general...deliver an power efficient single-precision peak performance of more than 256 GFlops. Substantially more raw power became available later, when nVIDIA ...algorithms, including IBM’s Cell/B.E., GPUs from NVidia and AMD and many-core CPUs from Intel.27 The vast growth of digital video content has been a
Experimental investigation of a quad-rotor biplane micro air vehicle

NASA Astrophysics Data System (ADS)

Bogdanowicz, Christopher Michael

Micro air vehicles are expected to perform demanding missions requiring efficient operation in both hover and forward flight. This thesis discusses the development of a hybrid air vehicle which seamlessly combines both flight capabilities: hover and high-speed forward flight. It is the quad-rotor biplane, which weighs 240 grams and consists of four propellers with wings arranged in a biplane configuration. The performance of the vehicle system was investigated in conditions representative of flight through a series of wind tunnel experiments. These studies provided an understanding of propeller-wing interaction effects and system trim analysis. This showed that the maximum speed of 11 m/s and a cruise speed of 4 m/s were achievable and that the cruise power is approximately one-third of the hover power. Free flight testing of the vehicle successfully highlighted its ability to achieve equilibrium transition flight. Key design parameters were experimentally investigated to understand their effect on overall performance. It was found that a trade-off between efficiency and compactness affects the final choice of the design. Design improvements have allowed for decreases in vehicle weight and ground footprint, while increasing structural soundness. Numerous vehicle designs, models, and flight tests have proven system scalability as well as versatility, including an upscaled model to be utilized in an extensive commercial package delivery system. Overall, the quad-rotor biplane is proven to be an efficient and effective multi-role vehicle.
78 FR 57921 - Patch International, Inc., QuadTech International, Inc., Strategic Resources, Ltd., and Virtual...

Federal Register 2010, 2011, 2012, 2013, 2014

2013-09-20

... SECURITIES AND EXCHANGE COMMISSION [File No. 500-1] Patch International, Inc., QuadTech International, Inc., Strategic Resources, Ltd., and Virtual Medical Centre, Inc.; Order of Suspension of Trading... lack of current and accurate information concerning the securities of Virtual Medical Centre, Inc...
Efficient Calculation of Exact Exchange Within the Quantum Espresso Software Package

NASA Astrophysics Data System (ADS)

Barnes, Taylor; Kurth, Thorsten; Carrier, Pierre; Wichmann, Nathan; Prendergast, David; Kent, Paul; Deslippe, Jack

Accurate simulation of condensed matter at the nanoscale requires careful treatment of the exchange interaction between electrons. In the context of plane-wave DFT, these interactions are typically represented through the use of approximate functionals. Greater accuracy can often be obtained through the use of functionals that incorporate some fraction of exact exchange; however, evaluation of the exact exchange potential is often prohibitively expensive. We present an improved algorithm for the parallel computation of exact exchange in Quantum Espresso, an open-source software package for plane-wave DFT simulation. Through the use of aggressive load balancing and on-the-fly transformation of internal data structures, our code exhibits speedups of approximately an order of magnitude for practical calculations. Additional optimizations are presented targeting the many-core Intel Xeon-Phi ``Knights Landing'' architecture, which largely powers NERSC's new Cori system. We demonstrate the successful application of the code to difficult problems, including simulation of water at a platinum interface and computation of the X-ray absorption spectra of transition metal oxides.

Examining Troughs in the Mass Distribution of All Theoretically Possible Tryptic Peptides

PubMed Central

Nefedov, Alexey V.; Mitra, Indranil; Brasier, Allan R.; Sadygov, Rovshan G.

2011-01-01

This work describes the mass distribution of all theoretically possibly tryptic peptides made of 20 amino acids, up to the mass of 3 kDa, with resolution of 0.001 Da. We characterize regions between the peaks of the distribution, including gaps (forbidden zones) and low-populated areas (quiet zones). We show how the gaps shrink over the mass range, and when they completely disappear. We demonstrate that peptide compositions in quiet zones are less diverse than those in the peaks of the distribution, and that by eliminating certain types of unrealistic compositions the gaps in the distribution may be increased. The mass distribution is generated using a parallel implementation of a recursive procedure that enumerates all amino acid compositions. It allows us to enumerate all compositions of tryptic peptides below 3 kDa in 48 minutes using a computer cluster with 12 Intel Xeon X5650 CPUs (72 cores). The results of this work can be used to facilitate protein identification and mass defect labeling in mass spectrometry-based proteomics experiments. PMID:21780838
Speeding up spin-component-scaled third-order pertubation theory with the chain of spheres approximation: the COSX-SCS-MP3 method

NASA Astrophysics Data System (ADS)

Izsák, Róbert; Neese, Frank

2013-07-01

The 'chain of spheres' approximation, developed earlier for the efficient evaluation of the self-consistent field exchange term, is introduced here into the evaluation of the external exchange term of higher order correlation methods. Its performance is studied in the specific case of the spin-component-scaled third-order Møller--Plesset perturbation (SCS-MP3) theory. The results indicate that the approximation performs excellently in terms of both computer time and achievable accuracy. Significant speedups over a conventional method are obtained for larger systems and basis sets. Owing to this development, SCS-MP3 calculations on molecules of the size of penicillin (42 atoms) with a polarised triple-zeta basis set can be performed in ∼3 hours using 16 cores of an Intel Xeon E7-8837 processor with a 2.67 GHz clock speed, which represents a speedup by a factor of 8-9 compared to the previously most efficient algorithm. Thus, the increased accuracy offered by SCS-MP3 can now be explored for at least medium-sized molecules.
76 FR 14099 - Quad Tech, Inc., Sussex, WI; Notice of Affirmative Determination Regarding Application for...

Federal Register 2010, 2011, 2012, 2013, 2014

2011-03-15

... dated February 7, 2011, a worker requested administrative reconsideration of the negative determination regarding workers' eligibility to apply for Trade Adjustment Assistance (TAA) applicable to workers and former workers of Quad Tech, Inc., Sussex, Wisconsin (TA-W-73,441A) (subject firm). The determination was...
Why K-12 IT Managers and Administrators Are Embracing the Intel-Based Mac

ERIC Educational Resources Information Center

Technology & Learning, 2007

2007-01-01

Over the past year, Apple has dramatically increased its share of the school computer marketplace--especially in the category of notebook computers. A recent study conducted by Grunwald Associates and Rockman et al. reports that one of the major reasons for this growth is Apple's introduction of the Intel processor to the entire line of Mac…
75 FR 48338 - Intel Corporation; Analysis of Proposed Consent Order to Aid Public Comment

Federal Register 2010, 2011, 2012, 2013, 2014

2010-08-10

... integrated into chipsets as well as discrete graphics cards. NVIDIA has been at the forefront of developing... to connect peripheral products such as discrete GPUs to the CPU. A bus is a connection point between... platform. Intel's commitment to maintain an open PCIe bus will provide discrete graphics manufacturers...
Decryption-decompression of AES protected ZIP files on GPUs

NASA Astrophysics Data System (ADS)

Duong, Tan Nhat; Pham, Phong Hong; Nguyen, Duc Huu; Nguyen, Thuy Thanh; Le, Hung Duc

2011-10-01

AES is a strong encryption system, so decryption-decompression of AES encrypted ZIP files requires very large computing power and techniques of reducing the password space. This makes implementations of techniques on common computing system not practical. In [1], we reduced the original very large password search space to a much smaller one which surely containing the correct password. Based on reduced set of passwords, in this paper, we parallel decryption, decompression and plain text recognition for encrypted ZIP files by using CUDA computing technology on graphics cards GeForce GTX295 of NVIDIA, to find out the correct password. The experimental results have shown that the speed of decrypting, decompressing, recognizing plain text and finding out the original password increases about from 45 to 180 times (depends on the number of GPUs) compared to sequential execution on the Intel Core 2 Quad Q8400 2.66 GHz. These results have demonstrated the potential applicability of GPUs in this cryptanalysis field.
Feasibility study on proposed Amtrak service from Chicago, to Iowa City, Iowa via Quad Cities : executive summary.

DOT National Transportation Integrated Search

2008-04-18

Soon after the Illinois Department of Transportation (Ill. DOT) requested Amtrak to : conduct a feasibility study on proposed Amtrak service between Chicago and the : Illinois Quad Cities, the Iowa Department of Transportation (Iowa DOT) ...
Personal supercomputing by using transputer and Intel 80860 in plasma engineering

NASA Astrophysics Data System (ADS)

Ido, S.; Aoki, K.; Ishine, M.; Kubota, M.

1992-09-01

Transputer (T800) and 64-bit RISC Intel 80860 (i860) added on a personal computer can be used as an accelerator. When 32-bit T800s in a parallel system or 64-bit i860s are used, scientific calculations are carried out several ten times as fast as in the case of commonly used 32-bit personal computers or UNIX workstations. Benchmark tests and examples of physical simulations using T800s and i860 are reported.
Newsgroups, Activist Publics, and Corporate Apologia: The Case of Intel and Its Pentium Chip.

ERIC Educational Resources Information Center

Hearit, Keith Michael

1999-01-01

Applies J. Grunig's theory of publics to the phenomenon of Internet newsgroups using the case of the flawed Intel Pentium chip. Argues that technology facilitates the rapid movement of publics from the theoretical construct stage to the active stage. Illustrates some of the difficulties companies face in establishing their identity in cyberspace.…
A High Performance Block Eigensolver for Nuclear Configuration Interaction Calculations

DOE PAGES

Aktulga, Hasan Metin; Afibuzzaman, Md.; Williams, Samuel; ...

2017-06-01

As on-node parallelism increases and the performance gap between the processor and the memory system widens, achieving high performance in large-scale scientific applications requires an architecture-aware design of algorithms and solvers. We focus on the eigenvalue problem arising in nuclear Configuration Interaction (CI) calculations, where a few extreme eigenpairs of a sparse symmetric matrix are needed. Here, we consider a block iterative eigensolver whose main computational kernels are the multiplication of a sparse matrix with multiple vectors (SpMM), and tall-skinny matrix operations. We then present techniques to significantly improve the SpMM and the transpose operation SpMM T by using themore » compressed sparse blocks (CSB) format. We achieve 3-4× speedup on the requisite operations over good implementations with the commonly used compressed sparse row (CSR) format. We develop a performance model that allows us to correctly estimate the performance of our SpMM kernel implementations, and we identify cache bandwidth as a potential performance bottleneck beyond DRAM. We also analyze and optimize the performance of LOBPCG kernels (inner product and linear combinations on multiple vectors) and show up to 15× speedup over using high performance BLAS libraries for these operations. The resulting high performance LOBPCG solver achieves 1.4× to 1.8× speedup over the existing Lanczos solver on a series of CI computations on high-end multicore architectures (Intel Xeons). We also analyze the performance of our techniques on an Intel Xeon Phi Knights Corner (KNC) processor.« less
A High Performance Block Eigensolver for Nuclear Configuration Interaction Calculations

DOE Office of Scientific and Technical Information (OSTI.GOV)

Aktulga, Hasan Metin; Afibuzzaman, Md.; Williams, Samuel

As on-node parallelism increases and the performance gap between the processor and the memory system widens, achieving high performance in large-scale scientific applications requires an architecture-aware design of algorithms and solvers. We focus on the eigenvalue problem arising in nuclear Configuration Interaction (CI) calculations, where a few extreme eigenpairs of a sparse symmetric matrix are needed. Here, we consider a block iterative eigensolver whose main computational kernels are the multiplication of a sparse matrix with multiple vectors (SpMM), and tall-skinny matrix operations. We then present techniques to significantly improve the SpMM and the transpose operation SpMM T by using themore » compressed sparse blocks (CSB) format. We achieve 3-4× speedup on the requisite operations over good implementations with the commonly used compressed sparse row (CSR) format. We develop a performance model that allows us to correctly estimate the performance of our SpMM kernel implementations, and we identify cache bandwidth as a potential performance bottleneck beyond DRAM. We also analyze and optimize the performance of LOBPCG kernels (inner product and linear combinations on multiple vectors) and show up to 15× speedup over using high performance BLAS libraries for these operations. The resulting high performance LOBPCG solver achieves 1.4× to 1.8× speedup over the existing Lanczos solver on a series of CI computations on high-end multicore architectures (Intel Xeons). We also analyze the performance of our techniques on an Intel Xeon Phi Knights Corner (KNC) processor.« less
"Sleep out on the Quad": An Opportunity for Experiential Education and Servant Based Leadership

ERIC Educational Resources Information Center

Johnson, Kristen A.; Grazulis, Jessica; White, Joshua K.

2014-01-01

One of the purposes of higher education is to promote citizenship and social justice. This article explores the way Aurora University is helping students to discover what matters through the impact of "Sleep Out on the Quad," which is an event that uses multiple pedagogies to impact student learning of homelessness. It serves as a call…
Parallel Climate Data Assimilation PSAS Package Achieves 18 GFLOPs on 512-Node Intel Paragon

NASA Technical Reports Server (NTRS)

Ding, H. Q.; Chan, C.; Gennery, D. B.; Ferraro, R. D.

1995-01-01

Several algorithms were added to the Physical-space Statistical Analysis System (PSAS) from Goddard, which assimilates observational weather data by correcting for different levels of uncertainty about the data and different locations for mobile observation platforms. The new algorithms and use of the 512-node Intel Paragon allowed a hundred-fold decrease in processing time.
Aging in the three-dimensional random-field Ising model

NASA Astrophysics Data System (ADS)

von Ohr, Sebastian; Manssen, Markus; Hartmann, Alexander K.

2017-07-01

We studied the nonequilibrium aging behavior of the random-field Ising model in three dimensions for various values of the disorder strength. This allowed us to investigate how the aging behavior changes across the ferromagnetic-paramagnetic phase transition. We investigated a large system size of N =2563 spins and up to 108 Monte Carlo sweeps. To reach these necessary long simulation times, we employed an implementation running on Intel Xeon Phi coprocessors, reaching single-spin-flip times as short as 6 ps. We measured typical correlation functions in space and time to extract a growing length scale and corresponding exponents.
Employing OpenCL to Accelerate Ab Initio Calculations on Graphics Processing Units.

PubMed

Kussmann, Jörg; Ochsenfeld, Christian

2017-06-13

We present an extension of our graphics processing units (GPU)-accelerated quantum chemistry package to employ OpenCL compute kernels, which can be executed on a wide range of computing devices like CPUs, Intel Xeon Phi, and AMD GPUs. Here, we focus on the use of AMD GPUs and discuss differences as compared to CUDA-based calculations on NVIDIA GPUs. First illustrative timings are presented for hybrid density functional theory calculations using serial as well as parallel compute environments. The results show that AMD GPUs are as fast or faster than comparable NVIDIA GPUs and provide a viable alternative for quantum chemical applications.
DBPQL: A view-oriented query language for the Intel Data Base Processor

NASA Technical Reports Server (NTRS)

Fishwick, P. A.

1983-01-01

An interactive query language (BDPQL) for the Intel Data Base Processor (DBP) is defined. DBPQL includes a parser generator package which permits the analyst to easily create and manipulate the query statement syntax and semantics. The prototype language, DBPQL, includes trace and performance commands to aid the analyst when implementing new commands and analyzing the execution characteristics of the DBP. The DBPQL grammar file and associated key procedures are included as an appendix to this report.
Game-Based Experiential Learning in Online Management Information Systems Classes Using Intel's IT Manager 3

ERIC Educational Resources Information Center

Bliemel, Michael; Ali-Hassan, Hossam

2014-01-01

For several years, we used Intel's flash-based game "IT Manager 3: Unseen Forces" as an experiential learning tool, where students had to act as a manager making real-time prioritization decisions about repairing computer problems, training and upgrading systems with better technologies as well as managing increasing numbers of technical…
Parity violation constraints using cosmic microwave background polarization spectra from 2006 and 2007 observations by the QUaD polarimeter.

PubMed

Wu, E Y S; Ade, P; Bock, J; Bowden, M; Brown, M L; Cahill, G; Castro, P G; Church, S; Culverhouse, T; Friedman, R B; Ganga, K; Gear, W K; Gupta, S; Hinderks, J; Kovac, J; Lange, A E; Leitch, E; Melhuish, S J; Memari, Y; Murphy, J A; Orlando, A; Piccirillo, L; Pryke, C; Rajguru, N; Rusholme, B; Schwarz, R; O'Sullivan, C; Taylor, A N; Thompson, K L; Turner, A H; Zemcov, M

2009-04-24

We constrain parity-violating interactions to the surface of last scattering using spectra from the QUaD experiment's second and third seasons of observations by searching for a possible systematic rotation of the polarization directions of cosmic microwave background photons. We measure the rotation angle due to such a possible "cosmological birefringence" to be 0.55 degrees +/-0.82 degrees (random) +/-0.5 degrees (systematic) using QUaD's 100 and 150 GHz temperature-curl and gradient-curl spectra over the spectra over the multipole range 200
Positive Peer-Pressured Productivity (P-QUAD): Novel Use of Increased Transparency and a Weighted Lottery to Increase a Division's Academic Output.

PubMed

Pitt, Michael B; Furnival, Ronald A; Zhang, Lei; Weber-Main, Anne M; Raymond, Nancy C; Jacob, Abraham K

2017-03-01

Evaluate a dual incentive model combining positive peer pressure through increased transparency of peers' academic work with a weighted lottery where entries are earned based on degree of productivity. We developed a dual-incentive peer mentoring model, Positive Peer-Pressured Productivity (P-QUAD), for faculty in the Pediatric Hospital Medicine Division at the University of Minnesota Masonic Children's Hospital. This model provided relative value-based incentives, with points assigned to different scholarly activities (eg. 1 point for abstract submission, 2 points for poster presentation, 3 points for oral presentation, etc.). These points translated into to lottery tickets for a semi-annual drawing for monetary prizes. Productivity was compared among faculty for P-QUAD year to the preintervention year. Fifteen (83%) of 18 eligible faculty members participated. Overall annual productivity per faculty member as measured by total P-QUAD score increased from a median of 3 (interquartile range [IQR] 0-14) in the preintervention year to 4 (IQR 0-27) in the P-QUAD year (P = .051). Submissions and acceptances increased in all categories except posters which were unchanged. Annual abstract submissions per faculty member significantly increased from a median of 1 (IQR 0-2) to 2 (IQR 0-2; P = .047). Seventy-three percent (8 of 11) of post-survey respondents indicated that the financial incentive motivated them to submit academic work; 100% indicated that increased awareness of their peers' work was a motivator. The combination of increased awareness of peers' academic productivity and a weighted lottery financial incentive appears to be a useful model for stimulating academic productivity in early-career faculty. Copyright © 2016 Academic Pediatric Association. Published by Elsevier Inc. All rights reserved.
Immunogenicity and safety of a combined measles, mumps, rubella and varicella live vaccine (ProQuad ®) administered concomitantly with a booster dose of a hexavalent vaccine in 12-23-month-old infants.

PubMed

Deichmann, Klaus A; Ferrera, Giuseppe; Tran, Clément; Thomas, Stéphane; Eymin, Cécile; Baudin, Martine

2015-05-11

Concomitant administration of vaccines can facilitate vaccination uptake, provided that no clinically significant effect on either vaccine is identified. We investigated the concomitant administration, during the second year of life, of one dose of the combined measles, mumps, rubella and varicella vaccine (ProQuad(®)) with a booster dose of a hexavalent vaccine. In this multicentre, open-label study, participants were randomized to 3 groups: Group 1, concomitant administration of one dose of ProQuad(®) and a booster of hexavalent vaccine; Group 2, one dose of ProQuad(®) alone; Group 3, a booster dose of hexavalent vaccine alone. Two serum samples were collected, within 7 days prior to vaccination and Days 42-56 post-vaccination for antibody testing. Antibody response rates to measles, mumps, rubella, varicella, hepatitis B and Haemophilus influenzae type b following concomitant administration of ProQuad(®) and hexavalent vaccine were non-inferior compared with those following the individual vaccines. Antibody response rates to these antigens were all >95% in all groups. Antibody titres for the pertussis antigens following concomitant administration were also non-inferior to those following the individual vaccines. Antibody titres for the other valences were numerically comparable between groups with the exception of hepatitis B, Haemophilus influenzae type b, tetanus and poliomyelitis, which were higher in the concomitant than in the non-concomitant groups. The safety profiles of each vaccination regimen were comparable, with the exception of solicited ProQuad(®)-related injection-site reactions (Days 0-4), which occurred more frequently in the concomitant than in the non-concomitant groups. These immunogenicity data support the concomitant administration of ProQuad(®) with a hexavalent vaccine. The safety profile of concomitant ProQuad(®) and hexavalent vaccination was also in line with that of the individual Summaries of Product Characteristics. Copyright

Implementation of molecular dynamics and its extensions with the coarse-grained UNRES force field on massively parallel systems; towards millisecond-scale simulations of protein structure, dynamics, and thermodynamics

PubMed Central

Liwo, Adam; Ołdziej, Stanisław; Czaplewski, Cezary; Kleinerman, Dana S.; Blood, Philip; Scheraga, Harold A.

2010-01-01

We report the implementation of our united-residue UNRES force field for simulations of protein structure and dynamics with massively parallel architectures. In addition to coarse-grained parallelism already implemented in our previous work, in which each conformation was treated by a different task, we introduce a fine-grained level in which energy and gradient evaluation are split between several tasks. The Message Passing Interface (MPI) libraries have been utilized to construct the parallel code. The parallel performance of the code has been tested on a professional Beowulf cluster (Xeon Quad Core), a Cray XT3 supercomputer, and two IBM BlueGene/P supercomputers with canonical and replica-exchange molecular dynamics. With IBM BlueGene/P, about 50 % efficiency and 120-fold speed-up of the fine-grained part was achieved for a single trajectory of a 767-residue protein with use of 256 processors/trajectory. Because of averaging over the fast degrees of freedom, UNRES provides an effective 1000-fold speed-up compared to the experimental time scale and, therefore, enables us to effectively carry out millisecond-scale simulations of proteins with 500 and more amino-acid residues in days of wall-clock time. PMID:20305729
Development of High Performance Composite Foam Insulation with Vacuum Insulation Cores

DOE Office of Scientific and Technical Information (OSTI.GOV)

Biswas, Kaushik; Desjarlais, Andre Omer; SmithPhD, Douglas

Development of a high performance thermal insulation (thermal resistance or R-value per inch of R-12 hr-ft2- F/Btu-in or greater), with twice the thermal resistance of state-of-the-art commercial insulation materials ( R6/inch for foam insulation), promises a transformational impact in the area of building insulation. In 2010, in the US, the building envelope-related primary energy consumption was 15.6 quads, of which 5.75 quads were due to opaque wall and roof sections; the total US consumption (building, industrial and transportation) was 98 quads. In other words, the wall and roof contribution was almost 6% of the entire US primary energy consumption. Buildingmore » energy modeling analyses have shown that adding insulation to increase the R-value of the external walls of residential buildings by R10-20 (hr-ft2- F/Btu) can yield savings of 38-50% in wall-generated heating and cooling loads. Adding R20 will require substantial thicknesses of current commercial insulation materials, often requiring significant (and sometimes cost-prohibitive) alterations to existing buildings. This article describes the development of a next-generation composite insulation with a target thermal resistance of R25 for a 2 inch thick board (R12/inch or higher). The composite insulation will contain vacuum insulation cores, which are nominally R35-40/inch, encapsulated in polyisocyanurate foam. A recently-developed variant of vacuum insulation, called modified atmosphere insulation (MAI), was used in this research. Some background information on the thermal performance and distinguishing features of MAI has been provided. Technical details of the composite insulation development and manufacturing as well as laboratory evaluation of prototype insulation boards are presented.« less
SMIF capability at Intel Mask Operation improves yield

NASA Astrophysics Data System (ADS)

Dam, Thuc H.; Pekny, Matt; Millino, Jim; Luu, Gibson; Melwani, Nitesh; Venkatramani, Aparna; Tavassoli, Malahat

2003-08-01

At Intel Mask Operations (IMO), Standard Mechanical Interface (SMIF) processing has been employed to reduce environmental particle contamination from manual handling-related activities. SMIF handling entailed the utilization of automated robotic transfers of photoblanks/reticles between SMIF pods, whereas conventional handling utilized manual pick transfers of masks between SMIF pods with intermediate storage in Toppan compacts. The SMIF-enabling units in IMO's process line included: (1) coater, (2) exposure, (3) developer, (4) dry etcher, and (5) inspection. Each unit is equipped with automated I/O port, environmentally enclosed processing chamber, and SMIF pods. Yield metrics were utilized to demonstrate the effectiveness and advantages of SMIF processing compared to manual processing. The areas focused in this paper were blank resist coating, binary front-end reticle processing and 2nd level PSM reticle processing. Results obtained from the investigation showed yield improvements in these areas.
Locality Aware Concurrent Start for Stencil Applications

DOE Office of Scientific and Technical Information (OSTI.GOV)

Shrestha, Sunil; Gao, Guang R.; Manzano Franco, Joseph B.

Stencil computations are at the heart of many physical simulations used in scientific codes. Thus, there exists a plethora of optimization efforts for this family of computations. Among these techniques, tiling techniques that allow concurrent start have proven to be very efficient in providing better performance for these critical kernels. Nevertheless, with many core designs being the norm, these optimization techniques might not be able to fully exploit locality (both spatial and temporal) on multiple levels of the memory hierarchy without compromising parallelism. It is no longer true that the machine can be seen as a homogeneous collection of nodesmore » with caches, main memory and an interconnect network. New architectural designs exhibit complex grouping of nodes, cores, threads, caches and memory connected by an ever evolving network-on-chip design. These new designs may benefit greatly from carefully crafted schedules and groupings that encourage parallel actors (i.e. threads, cores or nodes) to be aware of the computational history of other actors in close proximity. In this paper, we provide an efficient tiling technique that allows hierarchical concurrent start for memory hierarchy aware tile groups. Each execution schedule and tile shape exploit the available parallelism, load balance and locality present in the given applications. We demonstrate our technique on the Intel Xeon Phi architecture with selected and representative stencil kernels. We show improvement ranging from 5.58% to 31.17% over existing state-of-the-art techniques.« less
A radarsat-2 quad-polarized time series for monitoring crop and soil conditions in Barrax, Spain

USDA-ARS?s Scientific Manuscript database

The European Space Agency (ESA) along with multiple university and agency investigators joined to conduct the AgriSAR Campaign in 2009. The main objective was to analyze a dense time series of RADARSAT-2 quad-pol data to define and quantify the performance of Sentinel-1 and other future ESA C-Band ...
Student Intern Freed Competes at Intel ISEF, Two Others Awarded at Local Science Fair | Poster

Cancer.gov

Class of 2014–2015 Werner H. Kirsten (WHK) student intern Rebecca “Natasha” Freed earned a fourth-place award in biochemistry at the 2015 Intel International Science and Engineering Fair (ISEF), the largest high school science research competition in the world, according to the Society for Science & the Public’s website. Freed described the event as “transformative
Toward a High-Efficient Utilization of Solar Radiation by Quad-Band Solar Spectral Splitting.

PubMed

Cao, Feng; Huang, Yi; Tang, Lu; Sun, Tianyi; Boriskina, Svetlana V; Chen, Gang; Ren, Zhifeng

2016-12-01

The promising quad-band solar spectral splitter incorporates the properties of the optical filter and the spectrally selective solar thermal absorber can direct PV band to PV modules and absorb thermal band energy for thermal process with low thermal losses. It provides a new strategy for spectral splitting and offers potential ways for hybrid PVT system design. © 2016 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.
Process characteristics and design methods for a 300 deg quad OP amp

NASA Technical Reports Server (NTRS)

Beasom, J. D.; Patterson, R. B., III

1981-01-01

The results of process characterization, circuit design, and reliability studies for the development of a quad OP amplifier intended for use up to 300 C are presented. A dielectrically isolated complementary vertical bipolar process was chosen to fabricate the amplifier in order to eliminate isolation leakage and the possibility of latch up. Characterization of NPN and PNP junctions showed them to be suitable for use up to 300 C. Interconnect reliability was predicted to be greater than four years mean time between failure. Parasitic MOS formation was eliminated by isolation of each device.
Oil detection in RADARSAT-2 quad-polarization imagery: implications for ScanSAR performance

NASA Astrophysics Data System (ADS)

Cheng, Angela; Arkett, Matt; Zagon, Tom; De Abreu, Roger; Mueller, Derek; Vachon, Paris; Wolfe, John

2011-11-01

Environment Canada's Integrated Satellite Tracking of Pollution (ISTOP) program uses RADARSAT-2 data to vector pollution surveillance assets to areas where oil discharges/spills are suspected in support of enforcement and/or cleanup efforts. RADARSAT-2's new imaging capabilities and ground system promises significant improvement's in ISTOP's ability to detect and report on oil pollution. Of specific interest is the potential of dual polarization ScanSAR data acquired with VV polarization to improve the detection of oil pollution compared to data acquired with HH polarization, and with VH polarization to concurrently detect ship targets. A series of 101 RADARSAT-2 fine quad images were acquired over Coal Oil Point, near Santa Barbara, California where a seep field naturally releases hydrocarbons. The oil and gas releases in this region are visible on the sea surface and have been well documented allowing for the remote sensing of a constant source of oil at a fixed location. Although the make-up of the oil seep field could be different from that of oil spills, it provides a representative target that can be routinely imaged under a variety of wind conditions. Results derived from the fine quad imagery with a lower noise floor were adjusted to mimic the noise floor limitations of ScanSAR. In this study it was found that VV performed better than HH for oil detection, especially at higher incidence angles.
Innovative HPC architectures for the study of planetary plasma environments

NASA Astrophysics Data System (ADS)

Amaya, Jorge; Wolf, Anna; Lembège, Bertrand; Zitz, Anke; Alvarez, Damian; Lapenta, Giovanni

2016-04-01

DEEP-ER is an European Commission founded project that develops a new type of High Performance Computer architecture. The revolutionary system is currently used by KU Leuven to study the effects of the solar wind on the global environments of the Earth and Mercury. The new architecture combines the versatility of Intel Xeon computing nodes with the power of the upcoming Intel Xeon Phi accelerators. Contrary to classical heterogeneous HPC architectures, where it is customary to find CPU and accelerators in the same computing nodes, in the DEEP-ER system CPU nodes are grouped together (Cluster) and independently from the accelerator nodes (Booster). The system is equipped with a state of the art interconnection network, a highly scalable and fast I/O and a fail recovery resiliency system. The final objective of the project is to introduce a scalable system that can be used to create the next generation of exascale supercomputers. The code iPic3D from KU Leuven is being adapted to this new architecture. This particle-in-cell code can now perform the computation of the electromagnetic fields in the Cluster while the particles are moved in the Booster side. Using fast and scalable Xeon Phi accelerators in the Booster we can introduce many more particles per cell in the simulation than what is possible in the current generation of HPC systems, allowing to calculate fully kinetic plasmas with very low interpolation noise. The system will be used to perform fully kinetic, low noise, 3D simulations of the interaction of the solar wind with the magnetosphere of the Earth and Mercury. Preliminary simulations have been performed in other HPC centers in order to compare the results in different systems. In this presentation we show the complexity of the plasma flow around the planets, including the development of hydrodynamic instabilities at the flanks, the presence of the collision-less shock, the magnetosheath, the magnetopause, reconnection zones, the formation of the
DOE Office of Scientific and Technical Information (OSTI.GOV)

Ghysels, Pieter; Li, Xiaoye S.; Rouet, Francois -Henry

Here, we present a sparse linear system solver that is based on a multifrontal variant of Gaussian elimination and exploits low-rank approximation of the resulting dense frontal matrices. We use hierarchically semiseparable (HSS) matrices, which have low-rank off-diagonal blocks, to approximate the frontal matrices. For HSS matrix construction, a randomized sampling algorithm is used together with interpolative decompositions. The combination of the randomized compression with a fast ULV HSS factoriz ation leads to a solver with lower computational complexity than the standard multifrontal method for many applications, resulting in speedups up to 7 fold for problems in our test suite.more » The implementation targets many-core systems by using task parallelism with dynamic runtime scheduling. Numerical experiments show performance improvements over state-of-the-art sparse direct solvers. The implementation achieves high performance and good scalability on a range of modern shared memory parallel systems, including the Intel Xeon Phi (MIC). The code is part of a software package called STRUMPACK - STRUctured Matrices PACKage, which also has a distributed memory component for dense rank-structured matrices.« less
High-performance modeling of plasma-based acceleration and laser-plasma interactions

NASA Astrophysics Data System (ADS)

Vay, Jean-Luc; Blaclard, Guillaume; Godfrey, Brendan; Kirchen, Manuel; Lee, Patrick; Lehe, Remi; Lobet, Mathieu; Vincenti, Henri

2016-10-01

Large-scale numerical simulations are essential to the design of plasma-based accelerators and laser-plasma interations for ultra-high intensity (UHI) physics. The electromagnetic Particle-In-Cell (PIC) approach is the method of choice for self-consistent simulations, as it is based on first principles, and captures all kinetic effects, and also scale favorably to many cores on supercomputers. The standard PIC algorithm relies on second-order finite-difference discretization of the Maxwell and Newton-Lorentz equations. We present here novel formulations, based on very high-order pseudo-spectral Maxwell solvers, which enable near-total elimination of the numerical Cherenkov instability and increased accuracy over the standard PIC method for standard laboratory frame and Lorentz boosted frame simulations. We also present the latest implementations in the PIC modules Warp-PICSAR and FBPIC on the Intel Xeon Phi and GPU architectures. Examples of applications will be given on the simulation of laser-plasma accelerators and high-harmonic generation with plasma mirrors. Work supported by US-DOE Contracts DE-AC02-05CH11231 and by the European Commission through the Marie Slowdoska-Curie fellowship PICSSAR Grant Number 624543. Used resources of NERSC.
Elastic Cloud Computing Architecture and System for Heterogeneous Spatiotemporal Computing

NASA Astrophysics Data System (ADS)

Shi, X.

2017-10-01

Spatiotemporal computation implements a variety of different algorithms. When big data are involved, desktop computer or standalone application may not be able to complete the computation task due to limited memory and computing power. Now that a variety of hardware accelerators and computing platforms are available to improve the performance of geocomputation, different algorithms may have different behavior on different computing infrastructure and platforms. Some are perfect for implementation on a cluster of graphics processing units (GPUs), while GPUs may not be useful on certain kind of spatiotemporal computation. This is the same situation in utilizing a cluster of Intel's many-integrated-core (MIC) or Xeon Phi, as well as Hadoop or Spark platforms, to handle big spatiotemporal data. Furthermore, considering the energy efficiency requirement in general computation, Field Programmable Gate Array (FPGA) may be a better solution for better energy efficiency when the performance of computation could be similar or better than GPUs and MICs. It is expected that an elastic cloud computing architecture and system that integrates all of GPUs, MICs, and FPGAs could be developed and deployed to support spatiotemporal computing over heterogeneous data types and computational problems.
An Efficient Multicore Implementation of a Novel HSS-Structured Multifrontal Solver Using Randomized Sampling

DOE PAGES

Ghysels, Pieter; Li, Xiaoye S.; Rouet, Francois -Henry; ...

2016-10-27

Here, we present a sparse linear system solver that is based on a multifrontal variant of Gaussian elimination and exploits low-rank approximation of the resulting dense frontal matrices. We use hierarchically semiseparable (HSS) matrices, which have low-rank off-diagonal blocks, to approximate the frontal matrices. For HSS matrix construction, a randomized sampling algorithm is used together with interpolative decompositions. The combination of the randomized compression with a fast ULV HSS factoriz ation leads to a solver with lower computational complexity than the standard multifrontal method for many applications, resulting in speedups up to 7 fold for problems in our test suite.more » The implementation targets many-core systems by using task parallelism with dynamic runtime scheduling. Numerical experiments show performance improvements over state-of-the-art sparse direct solvers. The implementation achieves high performance and good scalability on a range of modern shared memory parallel systems, including the Intel Xeon Phi (MIC). The code is part of a software package called STRUMPACK - STRUctured Matrices PACKage, which also has a distributed memory component for dense rank-structured matrices.« less
Python in the NERSC Exascale Science Applications Program for Data

DOE Office of Scientific and Technical Information (OSTI.GOV)

Ronaghi, Zahra; Thomas, Rollin; Deslippe, Jack

We describe a new effort at the National Energy Re- search Scientific Computing Center (NERSC) in performance analysis and optimization of scientific Python applications targeting the Intel Xeon Phi (Knights Landing, KNL) many- core architecture. The Python-centered work outlined here is part of a larger effort called the NERSC Exascale Science Applications Program (NESAP) for Data. NESAP for Data focuses on applications that process and analyze high-volume, high-velocity data sets from experimental/observational science (EOS) facilities supported by the US Department of Energy Office of Science. We present three case study applications from NESAP for Data that use Python. These codesmore » vary in terms of “Python purity” from applications developed in pure Python to ones that use Python mainly as a convenience layer for scientists without expertise in lower level programming lan- guages like C, C++ or Fortran. The science case, requirements, constraints, algorithms, and initial performance optimizations for each code are discussed. Our goal with this paper is to contribute to the larger conversation around the role of Python in high-performance computing today and tomorrow, highlighting areas for future work and emerging best practices« less
Neutron Particle Effects on a Quad-Redundant Flight Control Computer

NASA Technical Reports Server (NTRS)

Eure, Kenneth; Belcastro, Celeste M.; Gray, W Steven; Gonzalex, Oscar

2003-01-01

This paper describes a single-event upset experiment performed at the Los Alamos National Laboratory. A closed-loop control system consisting of a Quad-Redundant Flight Control Computer (FCC) and a B737 simulator was operated while the FCC was exposed to a neutron beam. The purpose of this test was to analyze the effects of neutron bombardment on avionics control systems operating at altitudes where neutron strikes are probable. The neutron energy spectrum produced at the Los Alamos National Laboratory is similar in shape to the spectrum of atmospheric neutrons but much more intense. The higher intensity results in accelerated life tests that are representative of the actual neutron radiation that a FCC may receive over a period of years.
Performance Analysis of GFDL's GCM Line-By-Line Radiative Transfer Model on GPU and MIC Architectures

NASA Astrophysics Data System (ADS)

Menzel, R.; Paynter, D.; Jones, A. L.

2017-12-01

Due to their relatively low computational cost, radiative transfer models in global climate models (GCMs) run on traditional CPU architectures generally consist of shortwave and longwave parameterizations over a small number of wavelength bands. With the rise of newer GPU and MIC architectures, however, the performance of high resolution line-by-line radiative transfer models may soon approach those of the physical parameterizations currently employed in GCMs. Here we present an analysis of the current performance of a new line-by-line radiative transfer model currently under development at GFDL. Although originally designed to specifically exploit GPU architectures through the use of CUDA, the radiative transfer model has recently been extended to include OpenMP in an effort to also effectively target MIC architectures such as Intel's Xeon Phi. Using input data provided by the upcoming Radiative Forcing Model Intercomparison Project (RFMIP, as part of CMIP 6), we compare model results and performance data for various model configurations and spectral resolutions run on both GPU and Intel Knights Landing architectures to analogous runs of the standard Oxford Reference Forward Model on traditional CPUs.
Speeding-up Bioinformatics Algorithms with Heterogeneous Architectures: Highly Heterogeneous Smith-Waterman (HHeterSW).

PubMed

Gálvez, Sergio; Ferusic, Adis; Esteban, Francisco J; Hernández, Pilar; Caballero, Juan A; Dorado, Gabriel

2016-10-01

The Smith-Waterman algorithm has a great sensitivity when used for biological sequence-database searches, but at the expense of high computing-power requirements. To overcome this problem, there are implementations in literature that exploit the different hardware-architectures available in a standard PC, such as GPU, CPU, and coprocessors. We introduce an application that splits the original database-search problem into smaller parts, resolves each of them by executing the most efficient implementations of the Smith-Waterman algorithms in different hardware architectures, and finally unifies the generated results. Using non-overlapping hardware allows simultaneous execution, and up to 2.58-fold performance gain, when compared with any other algorithm to search sequence databases. Even the performance of the popular BLAST heuristic is exceeded in 78% of the tests. The application has been tested with standard hardware: Intel i7-4820K CPU, Intel Xeon Phi 31S1P coprocessors, and nVidia GeForce GTX 960 graphics cards. An important increase in performance has been obtained in a wide range of situations, effectively exploiting the available hardware.
SWPS3 - fast multi-threaded vectorized Smith-Waterman for IBM Cell/B.E. and x86/SSE2.

PubMed

Szalkowski, Adam; Ledergerber, Christian; Krähenbühl, Philipp; Dessimoz, Christophe

2008-10-29

We present swps3, a vectorized implementation of the Smith-Waterman local alignment algorithm optimized for both the Cell/BE and x86 architectures. The paper describes swps3 and compares its performances with several other implementations. Our benchmarking results show that swps3 is currently the fastest implementation of a vectorized Smith-Waterman on the Cell/BE, outperforming the only other known implementation by a factor of at least 4: on a Playstation 3, it achieves up to 8.0 billion cell-updates per second (GCUPS). Using the SSE2 instruction set, a quad-core Intel Pentium can reach 15.7 GCUPS. We also show that swps3 on this CPU is faster than a recent GPU implementation. Finally, we note that under some circumstances, alignments are computed at roughly the same speed as BLAST, a heuristic method. The Cell/BE can be a powerful platform to align biological sequences. Besides, the performance gap between exact and heuristic methods has almost disappeared, especially for long protein sequences.
SWPS3 – fast multi-threaded vectorized Smith-Waterman for IBM Cell/B.E. and ×86/SSE2

PubMed Central

Szalkowski, Adam; Ledergerber, Christian; Krähenbühl, Philipp; Dessimoz, Christophe

2008-01-01

Background We present swps3, a vectorized implementation of the Smith-Waterman local alignment algorithm optimized for both the Cell/BE and ×86 architectures. The paper describes swps3 and compares its performances with several other implementations. Findings Our benchmarking results show that swps3 is currently the fastest implementation of a vectorized Smith-Waterman on the Cell/BE, outperforming the only other known implementation by a factor of at least 4: on a Playstation 3, it achieves up to 8.0 billion cell-updates per second (GCUPS). Using the SSE2 instruction set, a quad-core Intel Pentium can reach 15.7 GCUPS. We also show that swps3 on this CPU is faster than a recent GPU implementation. Finally, we note that under some circumstances, alignments are computed at roughly the same speed as BLAST, a heuristic method. Conclusion The Cell/BE can be a powerful platform to align biological sequences. Besides, the performance gap between exact and heuristic methods has almost disappeared, especially for long protein sequences. PMID:18959793

75 FR 70691 - World Color Mt. Morris, IL LLC, Premedia Chicago Division, Currently Known as Quad/Graphics, Inc...

Federal Register 2010, 2011, 2012, 2013, 2014

2010-11-18

... DEPARTMENT OF LABOR Employment and Training Administration [TA-W-74,142] World Color Mt. Morris, IL LLC, Premedia Chicago Division, Currently Known as Quad/Graphics, Inc., Including On-Site Leased Workers From Creative Group and Creative Circle, Schaumburg, IL; Amended Certification Regarding Eligibility To Apply for Worker Adjustment Assistance...
A 'Quad-Disc' static pressure probe for measurement in adverse atmospheres - With a comparative review of static pressure probe designs

NASA Astrophysics Data System (ADS)

Nishiyama, Randall T.; Bedard, Alfred J., Jr.

1991-09-01

There are many areas of need for accurate measurements of atmospheric static pressure. These include observations of surface meteorology, airport altimeter settings, pressure distributions around buildings, moving measurement platforms, as well as basic measurements of fluctuating pressures in turbulence. Most of these observations require long-term observations in adverse environments (e.g., rain, dust, or snow). Currently, many pressure measurements are made, of necessity, within buildings, thus involving potential errors of several millibars in mean pressure during moderate winds, accompanied by large fluctuating pressures induced by the structure. In response to these needs, a 'Quad-Disk' pressure probe for continuous, outdoor monitoring purposes was designed which is inherently weather-protected. This Quad-Disk probe has the desirable features of omnidirectional response and small error in pitch. A review of past static pressure probes contrasts design approaches and capabilities.
Transitioning to Intel-based Linux Servers in the Payload Operations Integration Center

NASA Technical Reports Server (NTRS)

Guillebeau, P. L.

2004-01-01

The MSFC Payload Operations Integration Center (POIC) is the focal point for International Space Station (ISS) payload operations. The POIC contains the facilities, hardware, software and communication interface necessary to support payload operations. ISS ground system support for processing and display of real-time spacecraft and telemetry and command data has been operational for several years. The hardware components were reaching end of life and vendor costs were increasing while ISS budgets were becoming severely constrained. Therefore it has been necessary to migrate the Unix portions of our ground systems to commodity priced Intel-based Linux servers. hardware architecture including networks, data storage, and highly available resources. This paper will concentrate on the Linux migration implementation for the software portion of our ground system. The migration began with 3.5 million lines of code running on Unix platforms with separate servers for telemetry, command, Payload information management systems, web, system control, remote server interface and databases. The Intel-based system is scheduled to be available for initial operational use by August 2004 The overall migration to Intel-based Linux servers in the control center involves changes to the This paper will address the Linux migration study approach including the proof of concept, criticality of customer buy-in and importance of beginning with POSlX compliant code. It will focus on the development approach explaining the software lifecycle. Other aspects of development will be covered including phased implementation, interim milestones and metrics measurements and reporting mechanisms. This paper will also address the testing approach covering all levels of testing including development, development integration, IV&V, user beta testing and acceptance testing. Test results including performance numbers compared with Unix servers will be included. need for a smooth transition while maintaining
Achieving high performance on the Intel Paragon

DOE Office of Scientific and Technical Information (OSTI.GOV)

Greenberg, D.S.; Maccabe, B.; Riesen, R.

1993-11-01

When presented with a new supercomputer most users will first ask {open_quotes}How much faster will my applications run?{close_quotes} and then add a fearful {open_quotes}How much effort will it take me to convert to the new machine?{close_quotes} This paper describes some lessons learned at Sandia while asking these questions about the new 1800+ node Intel Paragon. The authors conclude that the operating system is crucial to both achieving high performance and allowing easy conversion from previous parallel implementations to a new machine. Using the Sandia/UNM Operating System (SUNMOS) they were able to port a LU factorization of dense matrices from themore » nCUBE2 to the Paragon and achieve 92% scaled speed-up on 1024 nodes. Thus on a 44,000 by 44,000 matrix which had required over 10 hours on the previous machine, they completed in less than 1/2 hour at a rate of over 40 GFLOPS. Two keys to achieving such high performance were the small size of SUNMOS (less than 256 kbytes) and the ability to send large messages with very low overhead.« less
Fast 2D FWI on a multi and many-cores workstation.

NASA Astrophysics Data System (ADS)

Thierry, Philippe; Donno, Daniela; Noble, Mark

2014-05-01

Following the introduction of x86 co-processors (Xeon Phi) and the performance increase of standard 2-socket workstations using the latest 12 cores E5-v2 x86-64 CPU, we present here a MPI + OpenMP implementation of an acoustic 2D FWI (full waveform inversion) code which simultaneously runs on the CPUs and on the co-processors installed in a workstation. The main advantage of running a 2D FWI on a workstation is to be able to quickly evaluate new features such as more complicated wave equations, new cost functions, finite-difference stencils or boundary conditions. Since the co-processor is made of 61 in-order x86 cores, each of them having up to 4 threads, this many-core can be seen as a shared memory SMP (symmetric multiprocessing) machine with its own IP address. Depending on the vendor, a single workstation can handle several co-processors making the workstation as a personal cluster under the desk. The original Fortran 90 CPU version of the 2D FWI code is just recompiled to get a Xeon Phi x86 binary. This multi and many-core configuration uses standard compilers and associated MPI as well as math libraries under Linux; therefore, the cost of code development remains constant, while improving computation time. We choose to implement the code with the so-called symmetric mode to fully use the capacity of the workstation, but we also evaluate the scalability of the code in native mode (i.e running only on the co-processor) thanks to the Linux ssh and NFS capabilities. Usual care of optimization and SIMD vectorization is used to ensure optimal performances, and to analyze the application performances and bottlenecks on both platforms. The 2D FWI implementation uses finite-difference time-domain forward modeling and a quasi-Newton (with L-BFGS algorithm) optimization scheme for the model parameters update. Parallelization is achieved through standard MPI shot gathers distribution and OpenMP for domain decomposition within the co-processor. Taking advantage of the 16
Spaceborne Hybrid Quad-Pol SAR Range Ambiguity Analysis and Simulations

NASA Astrophysics Data System (ADS)

Yang, Shilin; Li, Yang; Zhang, Jingjing; Hong, Wen

2014-11-01

The higher levels of range ambiguities in the cross-polarized measurement channels are the primary limitations for the matched quad-pol (e.g., HH, VV, VH, and HV) spaceborne synthetic aperture radar (SAR) systems. These ambiguities severely constrain the useful range of incident angles and the swath widths particularly at larger incidence. Adopting hybridpolarimetric architecture can remarkably reduce these ambiguities. In this paper, we analyse and develop the expression of range ambiguity to signal ratio (RASR) in the hybrid-polarimetric architecture. Simulations are made to testify this novel architecture’s advantage in the improvement of range ambiguities. The system operating parameters are derived from NASA’s DESDynl mission. In addition, we used the second order moments of polarimetric covariance matrices to depict target or the environment which are more precisely.
Spaceborne Hybrid Quad-Pol SAR Range Ambiguity Analysis and Simulations

NASA Astrophysics Data System (ADS)

Yang, Shilin; Li, Yang; Zhang, Jingjing; Hong, Wen

2014-11-01

The higher levels of range ambiguities in the cross- polarized measurement channels are the primary limitations for the matched quad-pol (e.g., HH, VV, VH, and HV) spaceborne synthetic aperture radar (SAR) systems. These ambiguities severely constrain the useful range of incident angles and the swath widths particularly at larger incidence. Adopting hybrid- polarimetric architecture can remarkably reduce these ambiguities. In this paper, we analyse and develop the expression of range ambiguity to signal ratio (RASR) in the hybrid-polarimetric architecture. Simulations are made to testify this novel architecture's advantage in the improvement of range ambiguities. The system operating parameters are derived from NASA's DESDynl mission. In addition, we used the second order moments of polarimetric covariance matrices to depict target or the environment which are more precisely.
Towards Highly Scalable Ab Initio Molecular Dynamics (AIMD) Simulations on the Intel Knights Landing Manycore Processor

DOE Office of Scientific and Technical Information (OSTI.GOV)

Jacquelin, Mathias; De Jong, Wibe A.; Bylaska, Eric J.

2017-07-03

The Ab Initio Molecular Dynamics (AIMD) method allows scientists to treat the dynamics of molecular and condensed phase systems while retaining a first-principles-based description of their interactions. This extremely important method has tremendous computational requirements, because the electronic Schr¨odinger equation, approximated using Kohn-Sham Density Functional Theory (DFT), is solved at every time step. With the advent of manycore architectures, application developers have a significant amount of processing power within each compute node that can only be exploited through massive parallelism. A compute intensive application such as AIMD forms a good candidate to leverage this processing power. In this paper, wemore » focus on adding thread level parallelism to the plane wave DFT methodology implemented in NWChem. Through a careful optimization of tall-skinny matrix products, which are at the heart of the Lagrange multiplier and nonlocal pseudopotential kernels, as well as 3D FFTs, our OpenMP implementation delivers excellent strong scaling on the latest Intel Knights Landing (KNL) processor. We assess the efficiency of our Lagrange multiplier kernels by building a Roofline model of the platform, and verify that our implementation is close to the roofline for various problem sizes. Finally, we present strong scaling results on the complete AIMD simulation for a 64 water molecules test case, that scales up to all 68 cores of the Knights Landing processor.« less
Exploring Machine Learning Techniques For Dynamic Modeling on Future Exascale Systems

DOE Office of Scientific and Technical Information (OSTI.GOV)

Song, Shuaiwen; Tallent, Nathan R.; Vishnu, Abhinav

2013-09-23

Future exascale systems must be optimized for both power and performance at scale in order to achieve DOE’s goal of a sustained petaflop within 20 Megawatts by 2022 [1]. Massive parallelism of the future systems combined with complex memory hierarchies will form a barrier to efficient application and architecture design. These challenges are exacerbated with emerging complex architectures such as GPGPUs and Intel Xeon Phi as parallelism increases orders of magnitude and system power consumption can easily triple or quadruple. Therefore, we need techniques that can reduce the search space for optimization, isolate power-performance bottlenecks, identify root causes for software/hardwaremore » inefficiency, and effectively direct runtime scheduling.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)

Zhang, Yao; Balaprakash, Prasanna; Meng, Jiayuan

We present Raexplore, a performance modeling framework for architecture exploration. Raexplore enables rapid, automated, and systematic search of architecture design space by combining hardware counter-based performance characterization and analytical performance modeling. We demonstrate Raexplore for two recent manycore processors IBM Blue- Gene/Q compute chip and Intel Xeon Phi, targeting a set of scientific applications. Our framework is able to capture complex interactions between architectural components including instruction pipeline, cache, and memory, and to achieve a 3–22% error for same-architecture and cross-architecture performance predictions. Furthermore, we apply our framework to assess the two processors, and discover and evaluate a list ofmore » architectural scaling options for future processor designs.« less
A Quad-Cantilevered Plate micro-sensor for intracranial pressure measurement.

PubMed

Lalkov, Vasko; Qasaimeh, Mohammad A

2017-07-01

This paper proposes a new design for pressure-sensing micro-plate platform to bring higher sensitivity to a pressure sensor based on piezoresistive MEMS sensing mechanism. The proposed design is composed of a suspended plate having four stepped cantilever beams connected to its corners, and thus defined as Quad-Cantilevered Plate (QCP). Finite element analysis was performed to determine the optimal design for sensitivity and structural stability under a range of applied forces. Furthermore, a piezoresistive analysis was performed to calculate sensor sensitivity. Both the maximum stress and the change in resistance of the piezoresistor associated with the QCP were found to be higher compared to previously published designs, and linearly related to the applied pressure as desired. Therefore, the QCP demonstrates greater sensitivity, and could be potentially used as an efficient pressure sensor for intracranial pressure measurement.
Classification Comparisons Between Compact Polarimetric and Quad-Pol SAR Imagery

NASA Astrophysics Data System (ADS)

Souissi, Boularbah; Doulgeris, Anthony P.; Eltoft, Torbjørn

2015-04-01

Recent interest in dual-pol SAR systems has lead to a novel approach, the so-called compact polarimetric imaging mode (CP) which attempts to reconstruct fully polarimetric information based on a few simple assumptions. In this work, the CP image is simulated from the full quad-pol (QP) image. We present here the initial comparison of polarimetric information content between QP and CP imaging modes. The analysis of multi-look polarimetric covariance matrix data uses an automated statistical clustering method based upon the expectation maximization (EM) algorithm for finite mixture modeling, using the complex Wishart probability density function. Our results showed that there are some different characteristics between the QP and CP modes. The classification is demonstrated using a E-SAR and Radarsat2 polarimetric SAR images acquired over DLR Oberpfaffenhofen in Germany and Algiers in Algeria respectively.
Parallelization of a Monte Carlo particle transport simulation code

NASA Astrophysics Data System (ADS)

Hadjidoukas, P.; Bousis, C.; Emfietzoglou, D.

2010-05-01

We have developed a high performance version of the Monte Carlo particle transport simulation code MC4. The original application code, developed in Visual Basic for Applications (VBA) for Microsoft Excel, was first rewritten in the C programming language for improving code portability. Several pseudo-random number generators have been also integrated and studied. The new MC4 version was then parallelized for shared and distributed-memory multiprocessor systems using the Message Passing Interface. Two parallel pseudo-random number generator libraries (SPRNG and DCMT) have been seamlessly integrated. The performance speedup of parallel MC4 has been studied on a variety of parallel computing architectures including an Intel Xeon server with 4 dual-core processors, a Sun cluster consisting of 16 nodes of 2 dual-core AMD Opteron processors and a 200 dual-processor HP cluster. For large problem size, which is limited only by the physical memory of the multiprocessor server, the speedup results are almost linear on all systems. We have validated the parallel implementation against the serial VBA and C implementations using the same random number generator. Our experimental results on the transport and energy loss of electrons in a water medium show that the serial and parallel codes are equivalent in accuracy. The present improvements allow for studying of higher particle energies with the use of more accurate physical models, and improve statistics as more particles tracks can be simulated in low response time.
Viscoelastic Finite Difference Modeling Using Graphics Processing Units

NASA Astrophysics Data System (ADS)

Fabien-Ouellet, G.; Gloaguen, E.; Giroux, B.

2014-12-01

Full waveform seismic modeling requires a huge amount of computing power that still challenges today's technology. This limits the applicability of powerful processing approaches in seismic exploration like full-waveform inversion. This paper explores the use of Graphics Processing Units (GPU) to compute a time based finite-difference solution to the viscoelastic wave equation. The aim is to investigate whether the adoption of the GPU technology is susceptible to reduce significantly the computing time of simulations. The code presented herein is based on the freely accessible software of Bohlen (2002) in 2D provided under a General Public License (GNU) licence. This implementation is based on a second order centred differences scheme to approximate time differences and staggered grid schemes with centred difference of order 2, 4, 6, 8, and 12 for spatial derivatives. The code is fully parallel and is written using the Message Passing Interface (MPI), and it thus supports simulations of vast seismic models on a cluster of CPUs. To port the code from Bohlen (2002) on GPUs, the OpenCl framework was chosen for its ability to work on both CPUs and GPUs and its adoption by most of GPU manufacturers. In our implementation, OpenCL works in conjunction with MPI, which allows computations on a cluster of GPU for large-scale model simulations. We tested our code for model sizes between 1002 and 60002 elements. Comparison shows a decrease in computation time of more than two orders of magnitude between the GPU implementation run on a AMD Radeon HD 7950 and the CPU implementation run on a 2.26 GHz Intel Xeon Quad-Core. The speed-up varies depending on the order of the finite difference approximation and generally increases for higher orders. Increasing speed-ups are also obtained for increasing model size, which can be explained by kernel overheads and delays introduced by memory transfers to and from the GPU through the PCI-E bus. Those tests indicate that the GPU memory size
Student Intern Ben Freed Competes as Finalist in Intel STS Competition, Three Other Interns Named Semifinalists | Poster

Cancer.gov

By Ashley DeVine, Staff Writer Werner H. Kirstin (WHK) student intern Ben Freed was one of 40 finalists to compete in the Intel Science Talent Search (STS) in Washington, DC, in March. “It was seven intense days of interacting with amazing judges and incredibly smart and interesting students. We met President Obama, and then the MIT astronomy lab named minor planets after each
Efficient molecular dynamics simulations with many-body potentials on graphics processing units

NASA Astrophysics Data System (ADS)

Fan, Zheyong; Chen, Wei; Vierimaa, Ville; Harju, Ari

2017-09-01

Graphics processing units have been extensively used to accelerate classical molecular dynamics simulations. However, there is much less progress on the acceleration of force evaluations for many-body potentials compared to pairwise ones. In the conventional force evaluation algorithm for many-body potentials, the force, virial stress, and heat current for a given atom are accumulated within different loops, which could result in write conflict between different threads in a CUDA kernel. In this work, we provide a new force evaluation algorithm, which is based on an explicit pairwise force expression for many-body potentials derived recently (Fan et al., 2015). In our algorithm, the force, virial stress, and heat current for a given atom can be accumulated within a single thread and is free of write conflicts. We discuss the formulations and algorithms and evaluate their performance. A new open-source code, GPUMD, is developed based on the proposed formulations. For the Tersoff many-body potential, the double precision performance of GPUMD using a Tesla K40 card is equivalent to that of the LAMMPS (Large-scale Atomic/Molecular Massively Parallel Simulator) molecular dynamics code running with about 100 CPU cores (Intel Xeon CPU X5670 @ 2.93 GHz).
GPU Lossless Hyperspectral Data Compression System for Space Applications

NASA Technical Reports Server (NTRS)

Keymeulen, Didier; Aranki, Nazeeh; Hopson, Ben; Kiely, Aaron; Klimesh, Matthew; Benkrid, Khaled

2012-01-01

On-board lossless hyperspectral data compression reduces data volume in order to meet NASA and DoD limited downlink capabilities. At JPL, a novel, adaptive and predictive technique for lossless compression of hyperspectral data, named the Fast Lossless (FL) algorithm, was recently developed. This technique uses an adaptive filtering method and achieves state-of-the-art performance in both compression effectiveness and low complexity. Because of its outstanding performance and suitability for real-time onboard hardware implementation, the FL compressor is being formalized as the emerging CCSDS Standard for Lossless Multispectral & Hyperspectral image compression. The FL compressor is well-suited for parallel hardware implementation. A GPU hardware implementation was developed for FL targeting the current state-of-the-art GPUs from NVIDIA(Trademark). The GPU implementation on a NVIDIA(Trademark) GeForce(Trademark) GTX 580 achieves a throughput performance of 583.08 Mbits/sec (44.85 MSamples/sec) and an acceleration of at least 6 times a software implementation running on a 3.47 GHz single core Intel(Trademark) Xeon(Trademark) processor. This paper describes the design and implementation of the FL algorithm on the GPU. The massively parallel implementation will provide in the future a fast and practical real-time solution for airborne and space applications.
Optimization of the coherence function estimation for multi-core central processing unit

NASA Astrophysics Data System (ADS)

Cheremnov, A. G.; Faerman, V. A.; Avramchuk, V. S.

2017-02-01

The paper considers use of parallel processing on multi-core central processing unit for optimization of the coherence function evaluation arising in digital signal processing. Coherence function along with other methods of spectral analysis is commonly used for vibration diagnosis of rotating machinery and its particular nodes. An algorithm is given for the function evaluation for signals represented with digital samples. The algorithm is analyzed for its software implementation and computational problems. Optimization measures are described, including algorithmic, architecture and compiler optimization, their results are assessed for multi-core processors from different manufacturers. Thus, speeding-up of the parallel execution with respect to sequential execution was studied and results are presented for Intel Core i7-4720HQ и AMD FX-9590 processors. The results show comparatively high efficiency of the optimization measures taken. In particular, acceleration indicators and average CPU utilization have been significantly improved, showing high degree of parallelism of the constructed calculating functions. The developed software underwent state registration and will be used as a part of a software and hardware solution for rotating machinery fault diagnosis and pipeline leak location with acoustic correlation method.
An Approach to Speed up Single-Frequency PPP Convergence with Quad-Constellation GNSS and GIM.

PubMed

Cai, Changsheng; Gong, Yangzhao; Gao, Yang; Kuang, Cuilin

2017-06-06

The single-frequency precise point positioning (PPP) technique has attracted increasing attention due to its high accuracy and low cost. However, a very long convergence time, normally a few hours, is required in order to achieve a positioning accuracy level of a few centimeters. In this study, an approach is proposed to accelerate the single-frequency PPP convergence by combining quad-constellation global navigation satellite system (GNSS) and global ionospheric map (GIM) data. In this proposed approach, the GPS, GLONASS, BeiDou, and Galileo observations are directly used in an uncombined observation model and as a result the ionospheric and hardware delay (IHD) can be estimated together as a single unknown parameter. The IHD values acquired from the GIM product and the multi-GNSS differential code bias (DCB) product are then utilized as pseudo-observables of the IHD parameter in the observation model. A time varying weight scheme has also been proposed for the pseudo-observables to gradually decrease its contribution to the position solutions during the convergence period. To evaluate the proposed approach, datasets from twelve Multi-GNSS Experiment (MGEX) stations on seven consecutive days are processed and analyzed. The numerical results indicate that the single-frequency PPP with quad-constellation GNSS and GIM data are able to reduce the convergence time by 56%, 47%, 41% in the east, north, and up directions compared to the GPS-only single-frequency PPP.
An Approach to Speed up Single-Frequency PPP Convergence with Quad-Constellation GNSS and GIM

PubMed Central

Cai, Changsheng; Gong, Yangzhao; Gao, Yang; Kuang, Cuilin

2017-01-01

The single-frequency precise point positioning (PPP) technique has attracted increasing attention due to its high accuracy and low cost. However, a very long convergence time, normally a few hours, is required in order to achieve a positioning accuracy level of a few centimeters. In this study, an approach is proposed to accelerate the single-frequency PPP convergence by combining quad-constellation global navigation satellite system (GNSS) and global ionospheric map (GIM) data. In this proposed approach, the GPS, GLONASS, BeiDou, and Galileo observations are directly used in an uncombined observation model and as a result the ionospheric and hardware delay (IHD) can be estimated together as a single unknown parameter. The IHD values acquired from the GIM product and the multi-GNSS differential code bias (DCB) product are then utilized as pseudo-observables of the IHD parameter in the observation model. A time varying weight scheme has also been proposed for the pseudo-observables to gradually decrease its contribution to the position solutions during the convergence period. To evaluate the proposed approach, datasets from twelve Multi-GNSS Experiment (MGEX) stations on seven consecutive days are processed and analyzed. The numerical results indicate that the single-frequency PPP with quad-constellation GNSS and GIM data are able to reduce the convergence time by 56%, 47%, 41% in the east, north, and up directions compared to the GPS-only single-frequency PPP. PMID:28587305

Wheelchair users' experience of non-adapted and adapted clothes during sailing, quad rugby or wheel-walking.

PubMed

Kratz, G; Söderback, I; Guidetti, S; Hultling, C; Rykatkin, T; Söderström, M

1997-01-01

The purpose of the present quasi-experimental post-test-design study was to compare 32 wheelchair users' (mostly para/tetraplegics) experience of wearing specially adapted clothes and non-adapted clothes for sailing, quad rugby or wheel-walking. Four existing assessment instruments were used: the Klein-Bell Activities of Daily Living Scale; a two-part Basic Information Questionnaire eliciting experience of effort, comfort and feeling of physical condition; the Experience Sampling Form for investigating the individuals' attitudes in terms of involvement and affective and activity mood states, and the Occupational Therapy Assessment of Leisure Time interview framework for collecting data about experience of leisure time. The wheelchair users all associated significantly greater comfort with use of the adapted clothes and, particularly the 'sailors', better physical condition. Overall, significantly greater involvement and more positive affect states were associated with the adapted clothes than with conventional garments, and mood state changed for the better. The wheelchair users set a higher priority upon work or leisure activities than upon independence in activities of daily living, and for this reason the Klein-Bell ratings showed great variation between the 'sailors' and the 'quad rugby players' (range 57%-93%), though these groups demonstrated more independence than the 'wheel-walkers'. The results of the study confirm the value of adapting sportswear for handicapped people. Such adaptations should also be of benefit for other activities than those studied.
Many-core computing for space-based stereoscopic imaging

NASA Astrophysics Data System (ADS)

McCall, Paul; Torres, Gildo; LeGrand, Keith; Adjouadi, Malek; Liu, Chen; Darling, Jacob; Pernicka, Henry

The potential benefits of using parallel computing in real-time visual-based satellite proximity operations missions are investigated. Improvements in performance and relative navigation solutions over single thread systems can be achieved through multi- and many-core computing. Stochastic relative orbit determination methods benefit from the higher measurement frequencies, allowing them to more accurately determine the associated statistical properties of the relative orbital elements. More accurate orbit determination can lead to reduced fuel consumption and extended mission capabilities and duration. Inherent to the process of stereoscopic image processing is the difficulty of loading, managing, parsing, and evaluating large amounts of data efficiently, which may result in delays or highly time consuming processes for single (or few) processor systems or platforms. In this research we utilize the Single-Chip Cloud Computer (SCC), a fully programmable 48-core experimental processor, created by Intel Labs as a platform for many-core software research, provided with a high-speed on-chip network for sharing information along with advanced power management technologies and support for message-passing. The results from utilizing the SCC platform for the stereoscopic image processing application are presented in the form of Performance, Power, Energy, and Energy-Delay-Product (EDP) metrics. Also, a comparison between the SCC results and those obtained from executing the same application on a commercial PC are presented, showing the potential benefits of utilizing the SCC in particular, and any many-core platforms in general for real-time processing of visual-based satellite proximity operations missions.
Evaluation of the Intel RealSense SR300 camera for image-guided interventions and application in vertebral level localization

NASA Astrophysics Data System (ADS)

House, Rachael; Lasso, Andras; Harish, Vinyas; Baum, Zachary; Fichtinger, Gabor

2017-03-01

PURPOSE: Optical pose tracking of medical instruments is often used in image-guided interventions. Unfortunately, compared to commonly used computing devices, optical trackers tend to be large, heavy, and expensive devices. Compact 3D vision systems, such as Intel RealSense cameras can capture 3D pose information at several magnitudes lower cost, size, and weight. We propose to use Intel SR300 device for applications where it is not practical or feasible to use conventional trackers and limited range and tracking accuracy is acceptable. We also put forward a vertebral level localization application utilizing the SR300 to reduce risk of wrong-level surgery. METHODS: The SR300 was utilized as an object tracker by extending the PLUS toolkit to support data collection from RealSense cameras. Accuracy of the camera was tested by comparing to a high-accuracy optical tracker. CT images of a lumbar spine phantom were obtained and used to create a 3D model in 3D Slicer. The SR300 was used to obtain a surface model of the phantom. Markers were attached to the phantom and a pointer and tracked using Intel RealSense SDK's built-in object tracking feature. 3D Slicer was used to align CT image with phantom using landmark registration and display the CT image overlaid on the optical image. RESULTS: Accuracy of the camera yielded a median position error of 3.3mm (95th percentile 6.7mm) and orientation error of 1.6° (95th percentile 4.3°) in a 20x16x10cm workspace, constantly maintaining proper marker orientation. The model and surface correctly aligned demonstrating the vertebral level localization application. CONCLUSION: The SR300 may be usable for pose tracking in medical procedures where limited accuracy is acceptable. Initial results suggest the SR300 is suitable for vertebral level localization.
Beyond core count: a look at new mainstream computing platforms for HEP workloads

NASA Astrophysics Data System (ADS)

Szostek, P.; Nowak, A.; Bitzes, G.; Valsan, L.; Jarp, S.; Dotti, A.

2014-06-01

As Moore's Law continues to deliver more and more transistors, the mainstream processor industry is preparing to expand its investments in areas other than simple core count. These new interests include deep integration of on-chip components, advanced vector units, memory, cache and interconnect technologies. We examine these moving trends with parallelized and vectorized High Energy Physics workloads in mind. In particular, we report on practical experience resulting from experiments with scalable HEP benchmarks on the Intel "Ivy Bridge-EP" and "Haswell" processor families. In addition, we examine the benefits of the new "Haswell" microarchitecture and its impact on multiple facets of HEP software. Finally, we report on the power efficiency of new systems.
High spectral resolution lidar based on quad mach zehnder interferometer for aerosols and wind measurements on board space missions

NASA Astrophysics Data System (ADS)

Mariscal, Jean-François; Bruneau, Didier; Pelon, Jacques; Van Haecke, Mathilde; Blouzon, Frédéric; Montmessin, Franck; Chepfer, Hélène

2018-04-01

We present the measurement principle and the optical design of a Quad Mach Zehnder (QMZ) interferometer as HSRL technique, allowing simultaneous measurements of particle backscattering and wind velocity. Key features of this concept is to operate with a multimodal laser and do not require any frequency stabilization. These features are relevant especially for space applications for which high technical readiness level is required.
Evaluation of the OpenCL AES Kernel using the Intel FPGA SDK for OpenCL

DOE Office of Scientific and Technical Information (OSTI.GOV)

Jin, Zheming; Yoshii, Kazutomo; Finkel, Hal

The OpenCL standard is an open programming model for accelerating algorithms on heterogeneous computing system. OpenCL extends the C-based programming language for developing portable codes on different platforms such as CPU, Graphics processing units (GPUs), Digital Signal Processors (DSPs) and Field Programmable Gate Arrays (FPGAs). The Intel FPGA SDK for OpenCL is a suite of tools that allows developers to abstract away the complex FPGA-based development flow for a high-level software development flow. Users can focus on the design of hardware-accelerated kernel functions in OpenCL and then direct the tools to generate the low-level FPGA implementations. The approach makes themore » FPGA-based development more accessible to software users as the needs for hybrid computing using CPUs and FPGAs are increasing. It can also significantly reduce the hardware development time as users can evaluate different ideas with high-level language without deep FPGA domain knowledge. In this report, we evaluate the performance of the kernel using the Intel FPGA SDK for OpenCL and Nallatech 385A FPGA board. Compared to the M506 module, the board provides more hardware resources for a larger design exploration space. The kernel performance is measured with the compute kernel throughput, an upper bound to the FPGA throughput. The report presents the experimental results in details. The Appendix lists the kernel source code.« less
On the classification of mixed floating pollutants on the Yellow Sea of China by using a quad-polarized SAR image

NASA Astrophysics Data System (ADS)

Wang, Xiaochen; Shao, Yun; Tian, Wei; Li, Kun

2018-06-01

This study explored different methodologies using a C-band RADARSAT-2 quad-polarized Synthetic Aperture Radar (SAR) image located over China's Yellow Sea to investigate polarization decomposition parameters for identifying mixed floating pollutants from a complex ocean background. It was found that solitary polarization decomposition did not meet the demand for detecting and classifying multiple floating pollutants, even after applying a polarized SAR image. Furthermore, considering that Yamaguchi decomposition is sensitive to vegetation and the algal variety Enteromorpha prolifera, while H/A/alpha decomposition is sensitive to oil spills, a combination of parameters which was deduced from these two decompositions was proposed for marine environmental monitoring of mixed floating sea surface pollutants. A combination of volume scattering, surface scattering, and scattering entropy was the best indicator for classifying mixed floating pollutants from a complex ocean background. The Kappa coefficients for Enteromorpha prolifera and oil spills were 0.7514 and 0.8470, respectively, evidence that the composite polarized parameters based on quad-polarized SAR imagery proposed in this research is an effective monitoring method for complex marine pollution.
Design of a Miniaturized Langmuir Plasma Probe for the QuadSat/PnP

NASA Astrophysics Data System (ADS)

Landavazo, M.; Jorgensen, A. M.; Del Barga, C.; Ferguson, D.; Guillette, D.; Huynh, A.; Klepper, J.; Kuker, J.; Lyke, J. C.; Marohn, B.; Mason, J.; Quiroga, J.; Ravindran, V.; Yelton, C.; Zagrai, A. N.; Zufelt, B.

2011-12-01

We have developed a miniaturized Langmuir plasma probe for measuring plasma density in low-earth orbit. Measuring plasma density in the upper ionosphere is important as a diagnostic for the rest of the ionosphere and as an input to space weather forecasting models. Developing miniaturized instrumentation allows easier deployment of a large number of small satellites for monitoring space weather. Our instrument was designed for the Swedish QuadSat/PnP, with the following constraints: A volume constraint of 5x5x1.25cm for the electronics enclosure, a mass budget 100 g, and a power budget of 0.5 W. We met the volume and mass constraints and where able to use less power than budgeted, only 0.25 W. We designed the probe for a bias range of +/-15V and current measurements in the 1 nA to 1 mA range (6 orders of magnitude). Necessary voltage of +/- 15 V and 3.3 V were generated on-board from a single 5 V supply. The electronics suite is based off carefully selected yet affordable commercial components that exhibit low noise, low leakage currents and low power consumption. Size constraints, low noise and low leakage requirements called for a carefully designed four layer PCB with a properly guarded current path using surface mount components on both sides. An ultra-low power microcontroller handles instrument functionality and is fully controllable over i2c using SPA-1 space plug and play. We elected for a probe launched deployed, which required careful design to survive launch vibrations while staying within the mass budget. The QuadSat/PnP has not been launched at the time of writing. We will present details of the instrument design and initial calibration data.
Cache Sharing and Isolation Tradeoffs in Multicore Mixed-Criticality Systems

DTIC Science & Technology

2015-05-01

of lockdown registers, to provide way-based partitioning. These alternatives are illustrated in Fig. 1 with respect to a quad-core ARM Cortex A9...presented a cache-partitioning scheme that allows multiple tasks to share the same cache partition on a single processor (as we do for Level-A and...sets and determined the fraction that were schedulable on our target hardware platform, the quad-core ARM Cortex A9 machine mentioned earlier, the LLC
Parallelization of MRCI based on hole-particle symmetry.

PubMed

Suo, Bing; Zhai, Gaohong; Wang, Yubin; Wen, Zhenyi; Hu, Xiangqian; Li, Lemin

2005-01-15

The parallel implementation of multireference configuration interaction program based on the hole-particle symmetry is described. The platform to implement the parallelization is an Intel-Architectural cluster consisting of 12 nodes, each of which is equipped with two 2.4-G XEON processors, 3-GB memory, and 36-GB disk, and are connected by a Gigabit Ethernet Switch. The dependence of speedup on molecular symmetries and task granularities is discussed. Test calculations show that the scaling with the number of nodes is about 1.9 (for C1 and Cs), 1.65 (for C2v), and 1.55 (for D2h) when the number of nodes is doubled. The largest calculation performed on this cluster involves 5.6 x 10(8) CSFs.
Student Intern Freed Competes at Intel ISEF, Two Others Awarded at Local Science Fair | Poster

Cancer.gov

Class of 2014–2015 Werner H. Kirsten (WHK) student intern Rebecca “Natasha” Freed earned a fourth-place award in biochemistry at the 2015 Intel International Science and Engineering Fair (ISEF), the largest high school science research competition in the world, according to the Society for Science & the Public’s website. Freed described the event as “transformative experience,” where she was able to present her research to “experts, including Nobel laureates, as well as members of the general community and, of course, to [other students].”
Student Intern Ben Freed Competes as Finalist in Intel STS Competition, Three Other Interns Named Semifinalists | Poster

Cancer.gov

By Ashley DeVine, Staff Writer Werner H. Kirstin (WHK) student intern Ben Freed was one of 40 finalists to compete in the Intel Science Talent Search (STS) in Washington, DC, in March. “It was seven intense days of interacting with amazing judges and incredibly smart and interesting students. We met President Obama, and then the MIT astronomy lab named minor planets after each of us,” Freed said of the competition.
Applications Performance on NAS Intel Paragon XP/S - 15#

NASA Technical Reports Server (NTRS)

Saini, Subhash; Simon, Horst D.; Copper, D. M. (Technical Monitor)

1994-01-01

The Numerical Aerodynamic Simulation (NAS) Systems Division received an Intel Touchstone Sigma prototype model Paragon XP/S- 15 in February, 1993. The i860 XP microprocessor with an integrated floating point unit and operating in dual -instruction mode gives peak performance of 75 million floating point operations (NIFLOPS) per second for 64 bit floating point arithmetic. It is used in the Paragon XP/S-15 which has been installed at NAS, NASA Ames Research Center. The NAS Paragon has 208 nodes and its peak performance is 15.6 GFLOPS. Here, we will report on early experience using the Paragon XP/S- 15. We have tested its performance using both kernels and applications of interest to NAS. We have measured the performance of BLAS 1, 2 and 3 both assembly-coded and Fortran coded on NAS Paragon XP/S- 15. Furthermore, we have investigated the performance of a single node one-dimensional FFT, a distributed two-dimensional FFT and a distributed three-dimensional FFT Finally, we measured the performance of NAS Parallel Benchmarks (NPB) on the Paragon and compare it with the performance obtained on other highly parallel machines, such as CM-5, CRAY T3D, IBM SP I, etc. In particular, we investigated the following issues, which can strongly affect the performance of the Paragon: a. Impact of the operating system: Intel currently uses as a default an operating system OSF/1 AD from the Open Software Foundation. The paging of Open Software Foundation (OSF) server at 22 MB to make more memory available for the application degrades the performance. We found that when the limit of 26 NIB per node out of 32 MB available is reached, the application is paged out of main memory using virtual memory. When the application starts paging, the performance is considerably reduced. We found that dynamic memory allocation can help applications performance under certain circumstances. b. Impact of data cache on the i860/XP: We measured the performance of the BLAS both assembly coded and Fortran
Accuracy of a real-time surgical navigation system for the placement of quad zygomatic implants in the severe atrophic maxilla: A pilot clinical study.

PubMed

Hung, Kuo-Feng; Wang, Feng; Wang, Hao-Wei; Zhou, Wen-Jie; Huang, Wei; Wu, Yi-Qun

2017-06-01

A real-time surgical navigation system potentially increases the accuracy when used for quad-zygomatic implant placement. To evaluate the accuracy of a real-time surgical navigation system when used for quad zygomatic implant placement. Patients with severely atrophic maxillae were prospectively recruited. Four trajectories for implants were planned, and zygomatic implants were placed using a real-time surgical navigation system. The planned-placed distance deviations at entry (entry deviation）points, exit (exit deviation) points, and angle deviation of axes (angle deviation) were measured on fused operation images. The differences of all the deviations between different groups, classified based on the lengths and locations of implants, were analysed. A P value of < 0.05 indicated statistical significance. Forty zygomatic implants were placed as planned in 10 patients. The entry deviation, exit deviation and angle deviation were 1.35 ± 0.75 mm, 2.15 mm ± 0.95 mm, and 2.05 ± 1.02 degrees, respectively. The differences of all deviations were not significant, irrespective of the lengths (P = .259, .158, and .914, respectively) or locations of the placed implants (P = .698, .072, and .602, respectively). A real-time surgical navigation system used for the placement of quad zygomatic implants demonstrated a high level of accuracy with only minimal planned-placed deviations, irrespective of the lengths or locations of the implants. © 2017 Wiley Periodicals, Inc.
Influence of Alveolar Bone Defects on the Stress Distribution in Quad Zygomatic Implant-Supported Maxillary Prosthesis.

PubMed

Duan, Yuanyuan; Chandran, Ravi; Cherry, Denise

The purpose of this study was to create three-dimensional composite models of quad zygomatic implant-supported maxillary prostheses with a variety of alveolar bone defects around implant sites, and to investigate the stress distribution in the surrounding bone using the finite element analysis (FEA) method. Three-dimensional models of titanium zygomatic implants, maxillary prostheses, and human skulls were created and assembled using Mimics based on microcomputed tomography and cone beam computed tomography images. A variety of additional bone defects were created at the locations of four zygomatic implants to simulate multiple clinical scenarios. The volume meshes were created and exported into FEA software. Material properties were assigned respectively for all the structures, and von Mises stress data were collected and plotted in the postprocessing module. The maximum stress in the surrounding bone was located in the crestal bone around zygomatic implants. The maximum stress in the prostheses was located at the angled area of the implant-abutment connection. The model with anterior defects had a higher peak stress value than the model with posterior defects. All the models with additional bone defects had higher maximum stress values than the control model without additional bone loss. Additional alveolar bone loss has a negative influence on the stress concentration in the surrounding bone of quad zygomatic implant-supported prostheses. More care should be taken if these additional bone defects are at the sites of anterior zygomatic implants.
Cache Sharing and Isolation Tradeoffs in Multicore Mixed-Criticality Systems

DTIC Science & Technology

2015-05-01

form of lockdown registers, to provide way-based partitioning. These alternatives are illustrated in Fig. 1 with respect to a quad-core ARM Cortex A9... processor (as we do for Level-A and -B tasks), but they did not consider MC systems. Altmeyer et al. [1] considered uniprocessor scheduling on a system with a...framework. We randomly generated task sets and determined the fraction that were schedulable on our target hardware platform, the quad-core ARM Cortex A9
A performance study of sparse Cholesky factorization on INTEL iPSC/860

NASA Technical Reports Server (NTRS)

Zubair, M.; Ghose, M.

1992-01-01

The problem of Cholesky factorization of a sparse matrix has been very well investigated on sequential machines. A number of efficient codes exist for factorizing large unstructured sparse matrices. However, there is a lack of such efficient codes on parallel machines in general, and distributed machines in particular. Some of the issues that are critical to the implementation of sparse Cholesky factorization on a distributed memory parallel machine are ordering, partitioning and mapping, load balancing, and ordering of various tasks within a processor. Here, we focus on the effect of various partitioning schemes on the performance of sparse Cholesky factorization on the Intel iPSC/860. Also, a new partitioning heuristic for structured as well as unstructured sparse matrices is proposed, and its performance is compared with other schemes.
Modeling Surface Roughness to Estimate Surface Moisture Using Radarsat-2 Quad Polarimetric SAR Data

NASA Astrophysics Data System (ADS)

Nurtyawan, R.; Saepuloh, A.; Budiharto, A.; Wikantika, K.

2016-08-01

Microwave backscattering from the earth's surface depends on several parameters such as surface roughness and dielectric constant of surface materials. The two parameters related to water content and porosity are crucial for estimating soil moisture. The soil moisture is an important parameter for ecological study and also a factor to maintain energy balance of land surface and atmosphere. Direct roughness measurements to a large area require extra time and cost. Heterogeneity roughness scale for some applications such as hydrology, climate, and ecology is a problem which could lead to inaccuracies of modeling. In this study, we modeled surface roughness using Radasat-2 quad Polarimetric Synthetic Aperture Radar (PolSAR) data. The statistical approaches to field roughness measurements were used to generate an appropriate roughness model. This modeling uses a physical SAR approach to predicts radar backscattering coefficient in the parameter of radar configuration (wavelength, polarization, and incidence angle) and soil parameters (surface roughness and dielectric constant). Surface roughness value is calculated using a modified Campbell and Shepard model in 1996. The modification was applied by incorporating the backscattering coefficient (σ°) of quad polarization HH, HV and VV. To obtain empirical surface roughness model from SAR backscattering intensity, we used forty-five sample points from field roughness measurements. We selected paddy field in Indramayu district, West Java, Indonesia as the study area. This area was selected due to intensive decreasing of rice productivity in the Northern Coast region of West Java. Third degree polynomial is the most suitable data fitting with coefficient of determination R2 and RMSE are about 0.82 and 1.18 cm, respectively. Therefore, this model is used as basis to generate the map of surface roughness.
45-110 GHz Quad-Ridge Horn With Stable Gain and Symmetric Beam

NASA Astrophysics Data System (ADS)

Manafi, Sara; Al-Tarifi, Muhannad; Filipovic, Dejan S.

2017-09-01

A quad-ridge horn antenna with stabilized gain and minimum difference between Eand H-plane half-power beamwidths (HPBWs) is demonstrated for operation over 45-110 GHz bandwidth. Multistep flaring and corrugations on a finite ground plane are applied to obtain stable radiation patterns with 16-dBi minimum gain over the entire range. The computational studies are validated through measurements of a 3-D printed prototype using the direct metal laser sintering (DMLS) process. Accurate fabrication with achieved surface roughness of < 1.7 μm of the fabricated antenna is verified with digital microscope. The obtained gain variation, VSWR, and HPBW variation with rotation and over 45-110 GHz bandwidth are below 1.7 dB, 1.7:1, and 9°, respectively. This work demonstrates that the DMLS is a viable fabrication process for wideband horn antennas at millimeter-wave frequencies.
Development of seismic tomography software for hybrid supercomputers

NASA Astrophysics Data System (ADS)

Nikitin, Alexandr; Serdyukov, Alexandr; Duchkov, Anton

2015-04-01

Seismic tomography is a technique used for computing velocity model of geologic structure from first arrival travel times of seismic waves. The technique is used in processing of regional and global seismic data, in seismic exploration for prospecting and exploration of mineral and hydrocarbon deposits, and in seismic engineering for monitoring the condition of engineering structures and the surrounding host medium. As a consequence of development of seismic monitoring systems and increasing volume of seismic data, there is a growing need for new, more effective computational algorithms for use in seismic tomography applications with improved performance, accuracy and resolution. To achieve this goal, it is necessary to use modern high performance computing systems, such as supercomputers with hybrid architecture that use not only CPUs, but also accelerators and co-processors for computation. The goal of this research is the development of parallel seismic tomography algorithms and software package for such systems, to be used in processing of large volumes of seismic data (hundreds of gigabytes and more). These algorithms and software package will be optimized for the most common computing devices used in modern hybrid supercomputers, such as Intel Xeon CPUs, NVIDIA Tesla accelerators and Intel Xeon Phi co-processors. In this work, the following general scheme of seismic tomography is utilized. Using the eikonal equation solver, arrival times of seismic waves are computed based on assumed velocity model of geologic structure being analyzed. In order to solve the linearized inverse problem, tomographic matrix is computed that connects model adjustments with travel time residuals, and the resulting system of linear equations is regularized and solved to adjust the model. The effectiveness of parallel implementations of existing algorithms on target architectures is considered. During the first stage of this work, algorithms were developed for execution on

Quad-phased data mining modeling for dementia diagnosis.

PubMed

Bang, Sunjoo; Son, Sangjoon; Roh, Hyunwoong; Lee, Jihye; Bae, Sungyun; Lee, Kyungwon; Hong, Changhyung; Shin, Hyunjung

2017-05-18

The number of people with dementia is increasing along with people's ageing trend worldwide. Therefore, there are various researches to improve a dementia diagnosis process in the field of computer-aided diagnosis (CAD) technology. The most significant issue is that the evaluation processes by physician which is based on medical information for patients and questionnaire from their guardians are time consuming, subjective and prone to error. This problem can be solved by an overall data mining modeling, which subsidizes an intuitive decision of clinicians. Therefore, in this paper we propose a quad-phased data mining modeling consisting of 4 modules. In Proposer Module, significant diagnostic criteria are selected that are effective for diagnostics. Then in Predictor Module, a model is constructed to predict and diagnose dementia based on a machine learning algorism. To help clinical physicians understand results of the predictive model better, in Descriptor Module, we interpret causes of diagnostics by profiling patient groups. Lastly, in Visualization Module, we provide visualization to effectively explore characteristics of patient groups. The proposed model is applied for CREDOS study which contains clinical data collected from 37 university-affiliated hospitals in republic of Korea from year 2005 to 2013. This research is an intelligent system enabling intuitive collaboration between CAD system and physicians. And also, improved evaluation process is able to effectively reduce time and cost consuming for clinicians and patients.
Accelerated Application Development: The ORNL Titan Experience

DOE PAGES

Joubert, Wayne; Archibald, Richard K.; Berrill, Mark A.; ...

2015-05-09

The use of computational accelerators such as NVIDIA GPUs and Intel Xeon Phi processors is now widespread in the high performance computing community, with many applications delivering impressive performance gains. However, programming these systems for high performance, performance portability and software maintainability has been a challenge. In this paper we discuss experiences porting applications to the Titan system. Titan, which began planning in 2009 and was deployed for general use in 2013, was the first multi-petaflop system based on accelerator hardware. To ready applications for accelerated computing, a preparedness effort was undertaken prior to delivery of Titan. In this papermore » we report experiences and lessons learned from this process and describe how users are currently making use of computational accelerators on Titan.« less
Accelerated application development: The ORNL Titan experience

DOE Office of Scientific and Technical Information (OSTI.GOV)

Joubert, Wayne; Archibald, Rick; Berrill, Mark

2015-08-01

The use of computational accelerators such as NVIDIA GPUs and Intel Xeon Phi processors is now widespread in the high performance computing community, with many applications delivering impressive performance gains. However, programming these systems for high performance, performance portability and software maintainability has been a challenge. In this paper we discuss experiences porting applications to the Titan system. Titan, which began planning in 2009 and was deployed for general use in 2013, was the first multi-petaflop system based on accelerator hardware. To ready applications for accelerated computing, a preparedness effort was undertaken prior to delivery of Titan. In this papermore » we report experiences and lessons learned from this process and describe how users are currently making use of computational accelerators on Titan.« less
PyPWA: A partial-wave/amplitude analysis software framework

NASA Astrophysics Data System (ADS)

Salgado, Carlos

2016-05-01

The PyPWA project aims to develop a software framework for Partial Wave and Amplitude Analysis of data; providing the user with software tools to identify resonances from multi-particle final states in photoproduction. Most of the code is written in Python. The software is divided into two main branches: one general-shell where amplitude's parameters (or any parametric model) are to be estimated from the data. This branch also includes software to produce simulated data-sets using the fitted amplitudes. A second branch contains a specific realization of the isobar model (with room to include Deck-type and other isobar model extensions) to perform PWA with an interface into the computer resources at Jefferson Lab. We are currently implementing parallelism and vectorization using the Intel's Xeon Phi family of coprocessors.
Scalability of a Low-Cost Multi-Teraflop Linux Cluster for High-End Classical Atomistic and Quantum Mechanical Simulations

NASA Technical Reports Server (NTRS)

Kikuchi, Hideaki; Kalia, Rajiv K.; Nakano, Aiichiro; Vashishta, Priya; Shimojo, Fuyuki; Saini, Subhash

2003-01-01

Scalability of a low-cost, Intel Xeon-based, multi-Teraflop Linux cluster is tested for two high-end scientific applications: Classical atomistic simulation based on the molecular dynamics method and quantum mechanical calculation based on the density functional theory. These scalable parallel applications use space-time multiresolution algorithms and feature computational-space decomposition, wavelet-based adaptive load balancing, and spacefilling-curve-based data compression for scalable I/O. Comparative performance tests are performed on a 1,024-processor Linux cluster and a conventional higher-end parallel supercomputer, 1,184-processor IBM SP4. The results show that the performance of the Linux cluster is comparable to that of the SP4. We also study various effects, such as the sharing of memory and L2 cache among processors, on the performance.
Early Experiences Writing Performance Portable OpenMP 4 Codes

DOE Office of Scientific and Technical Information (OSTI.GOV)

Joubert, Wayne; Hernandez, Oscar R

In this paper, we evaluate the recently available directives in OpenMP 4 to parallelize a computational kernel using both the traditional shared memory approach and the newer accelerator targeting capabilities. In addition, we explore various transformations that attempt to increase application performance portability, and examine the expressiveness and performance implications of using these approaches. For example, we want to understand if the target map directives in OpenMP 4 improve data locality when mapped to a shared memory system, as opposed to the traditional first touch policy approach in traditional OpenMP. To that end, we use recent Cray and Intel compilersmore » to measure the performance variations of a simple application kernel when executed on the OLCF s Titan supercomputer with NVIDIA GPUs and the Beacon system with Intel Xeon Phi accelerators attached. To better understand these trade-offs, we compare our results from traditional OpenMP shared memory implementations to the newer accelerator programming model when it is used to target both the CPU and an attached heterogeneous device. We believe the results and lessons learned as presented in this paper will be useful to the larger user community by providing guidelines that can assist programmers in the development of performance portable code.« less
Optimization of atmospheric transport models on HPC platforms

NASA Astrophysics Data System (ADS)

de la Cruz, Raúl; Folch, Arnau; Farré, Pau; Cabezas, Javier; Navarro, Nacho; Cela, José María

2016-12-01

The performance and scalability of atmospheric transport models on high performance computing environments is often far from optimal for multiple reasons including, for example, sequential input and output, synchronous communications, work unbalance, memory access latency or lack of task overlapping. We investigate how different software optimizations and porting to non general-purpose hardware architectures improve code scalability and execution times considering, as an example, the FALL3D volcanic ash transport model. To this purpose, we implement the FALL3D model equations in the WARIS framework, a software designed from scratch to solve in a parallel and efficient way different geoscience problems on a wide variety of architectures. In addition, we consider further improvements in WARIS such as hybrid MPI-OMP parallelization, spatial blocking, auto-tuning and thread affinity. Considering all these aspects together, the FALL3D execution times for a realistic test case running on general-purpose cluster architectures (Intel Sandy Bridge) decrease by a factor between 7 and 40 depending on the grid resolution. Finally, we port the application to Intel Xeon Phi (MIC) and NVIDIA GPUs (CUDA) accelerator-based architectures and compare performance, cost and power consumption on all the architectures. Implications on time-constrained operational model configurations are discussed.
Optimizing Approximate Weighted Matching on Nvidia Kepler K40

DOE Office of Scientific and Technical Information (OSTI.GOV)

Naim, Md; Manne, Fredrik; Halappanavar, Mahantesh

Matching is a fundamental graph problem with numerous applications in science and engineering. While algorithms for computing optimal matchings are difficult to parallelize, approximation algorithms on the other hand generally compute high quality solutions and are amenable to parallelization. In this paper, we present efficient implementations of the current best algorithm for half-approximate weighted matching, the Suitor algorithm, on Nvidia Kepler K-40 platform. We develop four variants of the algorithm that exploit hardware features to address key challenges for a GPU implementation. We also experiment with different combinations of work assigned to a warp. Using an exhaustive set ofmore » $269$ inputs, we demonstrate that the new implementation outperforms the previous best GPU algorithm by $10$ to $$100\\times$$ for over $100$ instances, and from $100$ to $$1000\\times$$ for $15$ instances. We also demonstrate up to $$20\\times$$ speedup relative to $2$ threads, and up to $$5\\times$$ relative to $16$ threads on Intel Xeon platform with $16$ cores for the same algorithm. The new algorithms and implementations provided in this paper will have a direct impact on several applications that repeatedly use matching as a key compute kernel. Further, algorithm designs and insights provided in this paper will benefit other researchers implementing graph algorithms on modern GPU architectures.« less
Efficient Approximation Algorithms for Weighted $b$-Matching

DOE Office of Scientific and Technical Information (OSTI.GOV)

Khan, Arif; Pothen, Alex; Mostofa Ali Patwary, Md.

2016-01-01

We describe a half-approximation algorithm, b-Suitor, for computing a b-Matching of maximum weight in a graph with weights on the edges. b-Matching is a generalization of the well-known Matching problem in graphs, where the objective is to choose a subset of M edges in the graph such that at most a specified number b(v) of edges in M are incident on each vertex v. Subject to this restriction we maximize the sum of the weights of the edges in M. We prove that the b-Suitor algorithm computes the same b-Matching as the one obtained by the greedy algorithm for themore » problem. We implement the algorithm on serial and shared-memory parallel processors, and compare its performance against a collection of approximation algorithms that have been proposed for the Matching problem. Our results show that the b-Suitor algorithm outperforms the Greedy and Locally Dominant edge algorithms by one to two orders of magnitude on a serial processor. The b-Suitor algorithm has a high degree of concurrency, and it scales well up to 240 threads on a shared memory multiprocessor. The b-Suitor algorithm outperforms the Locally Dominant edge algorithm by a factor of fourteen on 16 cores of an Intel Xeon multiprocessor.« less
Evaluation of the Intel iWarp parallel processor for space flight applications

NASA Technical Reports Server (NTRS)

Hine, Butler P., III; Fong, Terrence W.

1993-01-01

The potential of a DARPA-sponsored advanced processor, the Intel iWarp, for use in future SSF Data Management Systems (DMS) upgrades is evaluated through integration into the Ames DMS testbed and applications testing. The iWarp is a distributed, parallel computing system well suited for high performance computing applications such as matrix operations and image processing. The system architecture is modular, supports systolic and message-based computation, and is capable of providing massive computational power in a low-cost, low-power package. As a consequence, the iWarp offers significant potential for advanced space-based computing. This research seeks to determine the iWarp's suitability as a processing device for space missions. In particular, the project focuses on evaluating the ease of integrating the iWarp into the SSF DMS baseline architecture and the iWarp's ability to support computationally stressing applications representative of SSF tasks.
SpaceCubeX: A Framework for Evaluating Hybrid Multi-Core CPU FPGA DSP Architectures

NASA Technical Reports Server (NTRS)

Schmidt, Andrew G.; Weisz, Gabriel; French, Matthew; Flatley, Thomas; Villalpando, Carlos Y.

2017-01-01

The SpaceCubeX project is motivated by the need for high performance, modular, and scalable on-board processing to help scientists answer critical 21st century questions about global climate change, air quality, ocean health, and ecosystem dynamics, while adding new capabilities such as low-latency data products for extreme event warnings. These goals translate into on-board processing throughput requirements that are on the order of 100-1,000 more than those of previous Earth Science missions for standard processing, compression, storage, and downlink operations. To study possible future architectures to achieve these performance requirements, the SpaceCubeX project provides an evolvable testbed and framework that enables a focused design space exploration of candidate hybrid CPU/FPGA/DSP processing architectures. The framework includes ArchGen, an architecture generator tool populated with candidate architecture components, performance models, and IP cores, that allows an end user to specify the type, number, and connectivity of a hybrid architecture. The framework requires minimal extensions to integrate new processors, such as the anticipated High Performance Spaceflight Computer (HPSC), reducing time to initiate benchmarking by months. To evaluate the framework, we leverage a wide suite of high performance embedded computing benchmarks and Earth science scenarios to ensure robust architecture characterization. We report on our projects Year 1 efforts and demonstrate the capabilities across four simulation testbed models, a baseline SpaceCube 2.0 system, a dual ARM A9 processor system, a hybrid quad ARM A53 and FPGA system, and a hybrid quad ARM A53 and DSP system.
Computing effective properties of random heterogeneous materials on heterogeneous parallel processors

NASA Astrophysics Data System (ADS)

Leidi, Tiziano; Scocchi, Giulio; Grossi, Loris; Pusterla, Simone; D'Angelo, Claudio; Thiran, Jean-Philippe; Ortona, Alberto

2012-11-01

In recent decades, finite element (FE) techniques have been extensively used for predicting effective properties of random heterogeneous materials. In the case of very complex microstructures, the choice of numerical methods for the solution of this problem can offer some advantages over classical analytical approaches, and it allows the use of digital images obtained from real material samples (e.g., using computed tomography). On the other hand, having a large number of elements is often necessary for properly describing complex microstructures, ultimately leading to extremely time-consuming computations and high memory requirements. With the final objective of reducing these limitations, we improved an existing freely available FE code for the computation of effective conductivity (electrical and thermal) of microstructure digital models. To allow execution on hardware combining multi-core CPUs and a GPU, we first translated the original algorithm from Fortran to C, and we subdivided it into software components. Then, we enhanced the C version of the algorithm for parallel processing with heterogeneous processors. With the goal of maximizing the obtained performances and limiting resource consumption, we utilized a software architecture based on stream processing, event-driven scheduling, and dynamic load balancing. The parallel processing version of the algorithm has been validated using a simple microstructure consisting of a single sphere located at the centre of a cubic box, yielding consistent results. Finally, the code was used for the calculation of the effective thermal conductivity of a digital model of a real sample (a ceramic foam obtained using X-ray computed tomography). On a computer equipped with dual hexa-core Intel Xeon X5670 processors and an NVIDIA Tesla C2050, the parallel application version features near to linear speed-up progression when using only the CPU cores. It executes more than 20 times faster when additionally using the GPU.
Plasma Science and Applications at the Intel Science Fair: A Retrospective

NASA Astrophysics Data System (ADS)

Berry, Lee

2009-11-01

For the past five years, the Coalition for Plasma Science (CPS) has presented an award for a plasma project at the Intel International Science and Engineering Fair (ISEF). Eligible projects have ranged from grape-based plasma production in a microwave oven to observation of the effects of viscosity in a fluid model of quark-gluon plasma. Most projects have been aimed at applications, including fusion, thrusters, lighting, materials processing, and GPS improvements. However diagnostics (spectroscopy), technology (magnets), and theory (quark-gluon plasmas) have also been represented. All of the CPS award-winning projects so far have been based on experiments, with two awards going to women students and three to men. Since the award was initiated, both the number and quality of plasma projects has increased. The CPS expects this trend to continue, and looks forward to continuing its work with students who are excited about the possibilities of plasma. You too can share this excitement by judging at the 2010 fair in San Jose on May 11-12.
Fine-grained parallelism accelerating for RNA secondary structure prediction with pseudoknots based on FPGA.

PubMed

Xia, Fei; Jin, Guoqing

2014-06-01

PKNOTS is a most famous benchmark program and has been widely used to predict RNA secondary structure including pseudoknots. It adopts the standard four-dimensional (4D) dynamic programming (DP) method and is the basis of many variants and improved algorithms. Unfortunately, the O(N(6)) computing requirements and complicated data dependency greatly limits the usefulness of PKNOTS package with the explosion in gene database size. In this paper, we present a fine-grained parallel PKNOTS package and prototype system for accelerating RNA folding application based on FPGA chip. We adopted a series of storage optimization strategies to resolve the "Memory Wall" problem. We aggressively exploit parallel computing strategies to improve computational efficiency. We also propose several methods that collectively reduce the storage requirements for FPGA on-chip memory. To the best of our knowledge, our design is the first FPGA implementation for accelerating 4D DP problem for RNA folding application including pseudoknots. The experimental results show a factor of more than 50x average speedup over the PKNOTS-1.08 software running on a PC platform with Intel Core2 Q9400 Quad CPU for input RNA sequences. However, the power consumption of our FPGA accelerator is only about 50% of the general-purpose micro-processors.
Probabilistic Assessment of Soil Moisture using C-band Quad-polarized Remote Sensing Data from RISAT1

NASA Astrophysics Data System (ADS)

Pal, Manali; Suman, Mayank; Das, Sarit Kumar; Maity, Rajib

2017-04-01

Information on spatio-temporal distribution of surface Soil Moisture Content (SMC) is essential in several hydrological, meteorological and agricultural applications. There has been increasing importance of microwave active remote sensing data for large-scale estimation of surface SMC because of its ability to monitor spatial and temporal variation of surface SMC at regional, continental and global scale at a reasonably fine spatial and temporal resolution. The use of Synthetic Aperture Radar (SAR) is highly potential for catchment-scale applications due to high spatial resolution (˜10-20 m) both for vegetated and bare soil surface as well as because of its all-weather and day and night characteristics. However, one prime disadvantage of SAR is that their signal is subjective to SMC along with Land Use Land Cover (LULC) and surface roughness conditions, making the retrieval of SMC from SAR data an "ill-posed" problem. Moreover, the quantification of uncertainty due to inappropriate surface roughness characterization, soil texture, inversion techniques etc. even in the latest established retrieval methods, is little explored. This paper reports a recently developed method to estimate the surface SMC with probabilistic assessment of uncertainty associated with the estimation (Pal et al., 2016). Quad-polarized SAR data from Radar Imaging Satellite1 (RISAT1), launched in 2012 by Indian Space Research Organization (ISRO) and information on LULC regarding bareland and vegetated land (<30 cm height) are used in estimation using the potential of multivariate probabilistic assessment through copulas. The salient features of the study are: 1) development of a combined index to understand the role of all the quad-polarized backscattering coefficients and soil texture information in SMC estimation; 2) applicability of the model for different incidence angles using normalized incidence angle theory proposed by Zibri et al. (2005); and 3) assessment of uncertainty range of the
Parallelizing ATLAS Reconstruction and Simulation: Issues and Optimization Solutions for Scaling on Multi- and Many-CPU Platforms

NASA Astrophysics Data System (ADS)

Leggett, C.; Binet, S.; Jackson, K.; Levinthal, D.; Tatarkhanov, M.; Yao, Y.

2011-12-01

Thermal limitations have forced CPU manufacturers to shift from simply increasing clock speeds to improve processor performance, to producing chip designs with multi- and many-core architectures. Further the cores themselves can run multiple threads as a zero overhead context switch allowing low level resource sharing (Intel Hyperthreading). To maximize bandwidth and minimize memory latency, memory access has become non uniform (NUMA). As manufacturers add more cores to each chip, a careful understanding of the underlying architecture is required in order to fully utilize the available resources. We present AthenaMP and the Atlas event loop manager, the driver of the simulation and reconstruction engines, which have been rewritten to make use of multiple cores, by means of event based parallelism, and final stage I/O synchronization. However, initial studies on 8 andl6 core Intel architectures have shown marked non-linearities as parallel process counts increase, with as much as 30% reductions in event throughput in some scenarios. Since the Intel Nehalem architecture (both Gainestown and Westmere) will be the most common choice for the next round of hardware procurements, an understanding of these scaling issues is essential. Using hardware based event counters and Intel's Performance Tuning Utility, we have studied the performance bottlenecks at the hardware level, and discovered optimization schemes to maximize processor throughput. We have also produced optimization mechanisms, common to all large experiments, that address the extreme nature of today's HEP code, which due to it's size, places huge burdens on the memory infrastructure of today's processors.
Revised Masses and Densities of the Planets around Kepler-10

NASA Astrophysics Data System (ADS)

Weiss, Lauren M.; Rogers, Leslie A.; Isaacson, Howard T.; Agol, Eric; Marcy, Geoffrey W.; Rowe, Jason F.; Kipping, David; Fulton, Benjamin J.; Lissauer, Jack J.; Howard, Andrew W.; Fabrycky, Daniel

2016-03-01

Determining which small exoplanets have stony-iron compositions is necessary for quantifying the occurrence of such planets and for understanding the physics of planet formation. Kepler-10 hosts the stony-iron world Kepler-10b, and also contains what has been reported to be the largest solid silicate-ice planet, Kepler-10c. Using 220 radial velocities (RVs), including 72 precise RVs from Keck-HIRES of which 20 are new from 2014 to 2015, and 17 quarters of Kepler photometry, we obtain the most complete picture of the Kepler-10 system to date. We find that Kepler-10b ({R}{{p}}=1.47 {R}\\oplus ) has mass 3.72\\quad +/- \\quad 0.42\\quad {M}\\oplus and density 6.46\\quad +/- \\quad 0.73\\quad {{g}} {{cm}}-3. Modeling the interior of Kepler-10b as an iron core overlaid with a silicate mantle, we find that the iron core constitutes 0.17 ± 0.11 of the planet mass. For Kepler-10c ({R}{{p}}=2.35 {R}\\oplus ) we measure mass 13.98\\quad +/- \\quad 1.79\\quad {M}\\oplus and density 5.94\\quad +/- \\quad 0.76\\quad {{g}} {{cm}}-3, significantly lower than the mass computed in Dumusque et al. (17.2+/- 1.9 {M}\\oplus ). Our mass measurement of Kepler-10c rules out a pure stony-iron composition. Internal compositional modeling reveals that at least 10% of the radius of Kepler-10c is a volatile envelope composed of hydrogen-helium (0.2% of the mass, 16% of the radius) or super-ionic water (28% of the mass, 29% of the radius). However, we note that analysis of only HIRES data yields a higher mass for planet b and a lower mass for planet c than does analysis of the HARPS-N data alone, with the mass estimates for Kepler-10 c being formally inconsistent at the 3σ level. Moreover, dividing the data for each instrument into two parts also leads to somewhat inconsistent measurements for the mass of planet c derived from each observatory. Together, this suggests that time-correlated noise is present and that the uncertainties in the masses of the planets (especially planet c) likely
Human Factors Assessment of the U.S. Naval Research Laboratory Limb Protection Program (QuadGard Phase 3 Pre-Pilot Production Design)

DTIC Science & Technology

2011-09-01

NUMBER 5f. WORK UNIT NUMBER 7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES) U.S. Army Research Laboratory ATTN: RDRL- HRM Aberdeen Proving...user’s hand must be inserted toward the elbow area of the QuadGard III sleeve. This issue was overcome by equipment familiarization and practice ...ADELPHI MD 20783-1197 NO. OF NO. OF COPIES ORGANIZATION COPIES ORGANIZATION 17 1 ARMY RSCH LABORATORY – HRED RDRL HRM DW E REDDEN
Hexa Helix: Modified Quad Helix Appliance to Correct Anterior and Posterior Crossbites in Mixed Dentition

PubMed Central

Yaseen, Syed Mohammed; Acharya, Ravindranath

2012-01-01

Among the commonly encountered dental irregularities which constitute developing malocclusion is the crossbite. During primary and mixed dentition phase, the crossbite is seen very often and if left untreated during these phases then a simple problem may be transformed into a more complex problem. Different techniques have been used to correct anterior and posterior crossbites in mixed dentition. This case report describes the use of hexa helix, a modified version of quad helix for the management of anterior crossbite and bilateral posterior crossbite in early mixed dentition. Correction was achieved within 15 weeks with no damage to the tooth or the marginal periodontal tissue. The procedure is a simple and effective method for treating anterior and bilateral posterior crossbites simultaneously. PMID:23119188
Measurements of the LHCb software stack on the ARM architecture

NASA Astrophysics Data System (ADS)

Vijay Kartik, S.; Couturier, Ben; Clemencic, Marco; Neufeld, Niko

2014-06-01

The ARM architecture is a power-efficient design that is used in most processors in mobile devices all around the world today since they provide reasonable compute performance per watt. The current LHCb software stack is designed (and thus expected) to build and run on machines with the x86/x86_64 architecture. This paper outlines the process of measuring the performance of the LHCb software stack on the ARM architecture - specifically, the ARMv7 architecture on Cortex-A9 processors from NVIDIA and on full-fledged ARM servers with chipsets from Calxeda - and makes comparisons with the performance on x86_64 architectures on the Intel Xeon L5520/X5650 and AMD Opteron 6272. The paper emphasises the aspects of performance per core with respect to the power drawn by the compute nodes for the given performance - this ensures a fair real-world comparison with much more 'powerful' Intel/AMD processors. The comparisons of these real workloads in the context of LHCb are also complemented with the standard synthetic benchmarks HEPSPEC and Coremark. The pitfalls and solutions for the non-trivial task of porting the source code to build for the ARMv7 instruction set are presented. The specific changes in the build process needed for ARM-specific portions of the software stack are described, to serve as pointers for further attempts taken up by other groups in this direction. Cases where architecture-specific tweaks at the assembler lever (both in ROOT and the LHCb software stack) were needed for a successful compile are detailed - these cases are good indicators of where/how the software stack as well as the build system can be made more portable and multi-arch friendly. The experience gained from the tasks described in this paper are intended to i) assist in making an informed choice about ARM-based server solutions as a feasible low-power alternative to the current compute nodes, and ii) revisit the software design and build system for portability and generic improvements.

Lattice QCD Calculations in Nuclear Physics towards the Exascale

NASA Astrophysics Data System (ADS)

Joo, Balint

2017-01-01

The combination of algorithmic advances and new highly parallel computing architectures are enabling lattice QCD calculations to tackle ever more complex problems in nuclear physics. In this talk I will review some computational challenges that are encountered in large scale cold nuclear physics campaigns such as those in hadron spectroscopy calculations. I will discuss progress in addressing these with algorithmic improvements such as multi-grid solvers and software for recent hardware architectures such as GPUs and Intel Xeon Phi, Knights Landing. Finally, I will highlight some current topics for research and development as we head towards the Exascale era This material is funded by the U.S. Department of Energy, Office Of Science, Offices of Nuclear Physics, High Energy Physics and Advanced Scientific Computing Research, as well as the Office of Nuclear Physics under contract DE-AC05-06OR23177.
Communication overhead on the Intel Paragon, IBM SP2 and Meiko CS-2

NASA Technical Reports Server (NTRS)

Bokhari, Shahid H.

1995-01-01

Interprocessor communication overhead is a crucial measure of the power of parallel computing systems-its impact can severely limit the performance of parallel programs. This report presents measurements of communication overhead on three contemporary commercial multicomputer systems: the Intel Paragon, the IBM SP2 and the Meiko CS-2. In each case the time to communicate between processors is presented as a function of message length. The time for global synchronization and memory access is discussed. The performance of these machines in emulating hypercubes and executing random pairwise exchanges is also investigated. It is shown that the interprocessor communication time depends heavily on the specific communication pattern required. These observations contradict the commonly held belief that communication overhead on contemporary machines is independent of the placement of tasks on processors. The information presented in this report permits the evaluation of the efficiency of parallel algorithm implementations against standard baselines.
Heterogeneous computing architecture for fast detection of SNP-SNP interactions.

PubMed

Sluga, Davor; Curk, Tomaz; Zupan, Blaz; Lotric, Uros

2014-06-25

The extent of data in a typical genome-wide association study (GWAS) poses considerable computational challenges to software tools for gene-gene interaction discovery. Exhaustive evaluation of all interactions among hundreds of thousands to millions of single nucleotide polymorphisms (SNPs) may require weeks or even months of computation. Massively parallel hardware within a modern Graphic Processing Unit (GPU) and Many Integrated Core (MIC) coprocessors can shorten the run time considerably. While the utility of GPU-based implementations in bioinformatics has been well studied, MIC architecture has been introduced only recently and may provide a number of comparative advantages that have yet to be explored and tested. We have developed a heterogeneous, GPU and Intel MIC-accelerated software module for SNP-SNP interaction discovery to replace the previously single-threaded computational core in the interactive web-based data exploration program SNPsyn. We report on differences between these two modern massively parallel architectures and their software environments. Their utility resulted in an order of magnitude shorter execution times when compared to the single-threaded CPU implementation. GPU implementation on a single Nvidia Tesla K20 runs twice as fast as that for the MIC architecture-based Xeon Phi P5110 coprocessor, but also requires considerably more programming effort. General purpose GPUs are a mature platform with large amounts of computing power capable of tackling inherently parallel problems, but can prove demanding for the programmer. On the other hand the new MIC architecture, albeit lacking in performance reduces the programming effort and makes it up with a more general architecture suitable for a wider range of problems.
Heterogeneous computing architecture for fast detection of SNP-SNP interactions

PubMed Central

2014-01-01

Background The extent of data in a typical genome-wide association study (GWAS) poses considerable computational challenges to software tools for gene-gene interaction discovery. Exhaustive evaluation of all interactions among hundreds of thousands to millions of single nucleotide polymorphisms (SNPs) may require weeks or even months of computation. Massively parallel hardware within a modern Graphic Processing Unit (GPU) and Many Integrated Core (MIC) coprocessors can shorten the run time considerably. While the utility of GPU-based implementations in bioinformatics has been well studied, MIC architecture has been introduced only recently and may provide a number of comparative advantages that have yet to be explored and tested. Results We have developed a heterogeneous, GPU and Intel MIC-accelerated software module for SNP-SNP interaction discovery to replace the previously single-threaded computational core in the interactive web-based data exploration program SNPsyn. We report on differences between these two modern massively parallel architectures and their software environments. Their utility resulted in an order of magnitude shorter execution times when compared to the single-threaded CPU implementation. GPU implementation on a single Nvidia Tesla K20 runs twice as fast as that for the MIC architecture-based Xeon Phi P5110 coprocessor, but also requires considerably more programming effort. Conclusions General purpose GPUs are a mature platform with large amounts of computing power capable of tackling inherently parallel problems, but can prove demanding for the programmer. On the other hand the new MIC architecture, albeit lacking in performance reduces the programming effort and makes it up with a more general architecture suitable for a wider range of problems. PMID:24964802
Accelerating finite-rate chemical kinetics with coprocessors: Comparing vectorization methods on GPUs, MICs, and CPUs

NASA Astrophysics Data System (ADS)

Stone, Christopher P.; Alferman, Andrew T.; Niemeyer, Kyle E.

2018-05-01

Accurate and efficient methods for solving stiff ordinary differential equations (ODEs) are a critical component of turbulent combustion simulations with finite-rate chemistry. The ODEs governing the chemical kinetics at each mesh point are decoupled by operator-splitting allowing each to be solved concurrently. An efficient ODE solver must then take into account the available thread and instruction-level parallelism of the underlying hardware, especially on many-core coprocessors, as well as the numerical efficiency. A stiff Rosenbrock and a nonstiff Runge-Kutta ODE solver are both implemented using the single instruction, multiple thread (SIMT) and single instruction, multiple data (SIMD) paradigms within OpenCL. Both methods solve multiple ODEs concurrently within the same instruction stream. The performance of these parallel implementations was measured on three chemical kinetic models of increasing size across several multicore and many-core platforms. Two separate benchmarks were conducted to clearly determine any performance advantage offered by either method. The first benchmark measured the run-time of evaluating the right-hand-side source terms in parallel and the second benchmark integrated a series of constant-pressure, homogeneous reactors using the Rosenbrock and Runge-Kutta solvers. The right-hand-side evaluations with SIMD parallelism on the host multicore Xeon CPU and many-core Xeon Phi co-processor performed approximately three times faster than the baseline multithreaded C++ code. The SIMT parallel model on the host and Phi was 13%-35% slower than the baseline while the SIMT model on the NVIDIA Kepler GPU provided approximately the same performance as the SIMD model on the Phi. The runtimes for both ODE solvers decreased significantly with the SIMD implementations on the host CPU (2.5-2.7 ×) and Xeon Phi coprocessor (4.7-4.9 ×) compared to the baseline parallel code. The SIMT implementations on the GPU ran 1.5-1.6 times faster than the baseline
Comparative evaluation of molar distalization therapy with erupted second molar: Segmented versus Quad Pendulum appliance.

PubMed

Caprioglio, Alberto; Cozzani, Mauro; Fontana, Mattia

2014-01-01

There are controversial opinions about the effect of erupted second molars on distalization of the first molars. Most of the distalizing devices are anchored on the first molars, without including second molars; so, differences between sequentially distalize maxillary molars (second molar followed by the first molar) or distalize second and first molars together are not clear. The aim of the study was to compare sequential versus simultaneous molar distalization therapy with erupted second molar using two different modified Pendulum appliances followed by fixed appliances. The treatment sample consisted of 35 class II malocclusion subjects, divided in two groups: group 1 consisted of 24 patients (13 males and 11 females) with a mean pre-treatment age of 12.9 years, treated with the Segmented Pendulum (SP) and fixed appliances; group 2 consisted of 11 patients (6 males and 5 females) with a mean pre-treatment age of 13.2 years, treated with the Quad Pendulum (QP) and fixed appliances. Lateral cephalograms were obtained before treatment (T1), at the end of distalization (T2), and at the end of orthodontic fixed appliance therapy (T3). A Student t test was used to identify significant between-group differences between T1 to T2, T2 to T3, and T1 to T3. QP and SP were equally effective in distalizing maxillary molars (3.5 and 4 mm, respectively) between T1 and T2; however, the maxillary first molar showed less distal tipping (4.6° vs. 9.6°) and more extrusion (1.1 vs. 0.2 mm) in the QP group than in the SP group, as well as the vertical facial dimension, which increased more in the QP group (1.2°) than in the SP group (0.7°). At T3, the QP group maintained greater increase in lower anterior facial height and molar extrusion and decrease in overbite than the SP group. Quad Pendulum seems to have greater increase in vertical dimension and molar extrusion than the Segmented Pendulum.
The CPS Plasma Award at the Intel Science and Engineering Fair

NASA Astrophysics Data System (ADS)

Berry, Lee

2012-10-01

For the past eight years, the Coalition for Plasma Science (CPS) has presented an award for a plasma project at the Intel International Science and Engineering Fair (ISEF). We reported on the first five years of this award at the 2009 DPP Symposium. Pulsed neutron-producing experiments are a recurring topic, with the efforts now turning to applications. The most recent award at the Pittsburgh ISEF this past May was given for analysis of data from Brookhaven's Relativistic Heavy Ion Collider. The effort had the goal of understanding the fluid properties of the quark-gluon plasma. All of the CPS award-winning projects so far have been based on experiments, with four awards going to women students and four to men. In 2009 we noted that the number and quality of projects was improving. Since then, as we we predicted (hoped for), that trend has continued. The CPS looks forward to continuing its work with students who are excited about the possibilities of plasma. You too can share this excitement by judging at the 2013 fair in Phoenix on May 12-17. Information may be obtained by emailing cps@plasmacoalition.org.
Insar Unwrapping Error Correction Based on Quasi-Accurate Detection of Gross Errors (quad)

NASA Astrophysics Data System (ADS)

Kang, Y.; Zhao, C. Y.; Zhang, Q.; Yang, C. S.

2018-04-01

Unwrapping error is a common error in the InSAR processing, which will seriously degrade the accuracy of the monitoring results. Based on a gross error correction method, Quasi-accurate detection (QUAD), the method for unwrapping errors automatic correction is established in this paper. This method identifies and corrects the unwrapping errors by establishing a functional model between the true errors and interferograms. The basic principle and processing steps are presented. Then this method is compared with the L1-norm method with simulated data. Results show that both methods can effectively suppress the unwrapping error when the ratio of the unwrapping errors is low, and the two methods can complement each other when the ratio of the unwrapping errors is relatively high. At last the real SAR data is tested for the phase unwrapping error correction. Results show that this new method can correct the phase unwrapping errors successfully in the practical application.
Deployment of the OSIRIS EM-PIC code on the Intel Knights Landing architecture

NASA Astrophysics Data System (ADS)

Fonseca, Ricardo

2017-10-01

Electromagnetic particle-in-cell (EM-PIC) codes such as OSIRIS have found widespread use in modelling the highly nonlinear and kinetic processes that occur in several relevant plasma physics scenarios, ranging from astrophysical settings to high-intensity laser plasma interaction. Being computationally intensive, these codes require large scale HPC systems, and a continuous effort in adapting the algorithm to new hardware and computing paradigms. In this work, we report on our efforts on deploying the OSIRIS code on the new Intel Knights Landing (KNL) architecture. Unlike the previous generation (Knights Corner), these boards are standalone systems, and introduce several new features, include the new AVX-512 instructions and on-package MCDRAM. We will focus on the parallelization and vectorization strategies followed, as well as memory management, and present a detailed performance evaluation of code performance in comparison with the CPU code. This work was partially supported by Fundaçã para a Ciência e Tecnologia (FCT), Portugal, through Grant No. PTDC/FIS-PLA/2940/2014.
Modeling high-temperature superconductors and metallic alloys on the Intel IPSC/860

NASA Astrophysics Data System (ADS)

Geist, G. A.; Peyton, B. W.; Shelton, W. A.; Stocks, G. M.

Oak Ridge National Laboratory has embarked on several computational Grand Challenges, which require the close cooperation of physicists, mathematicians, and computer scientists. One of these projects is the determination of the material properties of alloys from first principles and, in particular, the electronic structure of high-temperature superconductors. While the present focus of the project is on superconductivity, the approach is general enough to permit study of other properties of metallic alloys such as strength and magnetic properties. This paper describes the progress to date on this project. We include a description of a self-consistent KKR-CPA method, parallelization of the model, and the incorporation of a dynamic load balancing scheme into the algorithm. We also describe the development and performance of a consolidated KKR-CPA code capable of running on CRAYs, workstations, and several parallel computers without source code modification. Performance of this code on the Intel iPSC/860 is also compared to a CRAY 2, CRAY YMP, and several workstations. Finally, some density of state calculations of two perovskite superconductors are given.
Balancing Contention and Synchronization on the Intel Paragon

NASA Technical Reports Server (NTRS)

Bokhari, Shahid H.; Nicol, David M.

1996-01-01

The Intel Paragon is a mesh-connected distributed memory parallel computer. It uses an oblivious and deterministic message routing algorithm: this permits us to develop highly optimized schedules for frequently needed communication patterns. The complete exchange is one such pattern. Several approaches are available for carrying it out on the mesh. We study an algorithm developed by Scott. This algorithm assumes that a communication link can carry one message at a time and that a node can only transmit one message at a time. It requires global synchronization to enforce a schedule of transmissions. Unfortunately global synchronization has substantial overhead on the Paragon. At the same time the powerful interconnection mechanism of this machine permits 2 or 3 messages to share a communication link with minor overhead. It can also overlap multiple message transmission from the same node to some extent. We develop a generalization of Scott's algorithm that executes complete exchange with a prescribed contention. Schedules that incur greater contention require fewer synchronization steps. This permits us to tradeoff contention against synchronization overhead. We describe the performance of this algorithm and compare it with Scott's original algorithm as well as with a naive algorithm that does not take interconnection structure into account. The Bounded contention algorithm is always better than Scott's algorithm and outperforms the naive algorithm for all but the smallest message sizes. The naive algorithm fails to work on meshes larger than 12 x 12. These results show that due consideration of processor interconnect and machine performance parameters is necessary to obtain peak performance from the Paragon and its successor mesh machines.
DOE Office of Scientific and Technical Information (OSTI.GOV)

Allada, Veerendra, Benjegerdes, Troy; Bode, Brett

Commodity clusters augmented with application accelerators are evolving as competitive high performance computing systems. The Graphical Processing Unit (GPU) with a very high arithmetic density and performance per price ratio is a good platform for the scientific application acceleration. In addition to the interconnect bottlenecks among the cluster compute nodes, the cost of memory copies between the host and the GPU device have to be carefully amortized to improve the overall efficiency of the application. Scientific applications also rely on efficient implementation of the BAsic Linear Algebra Subroutines (BLAS), among which the General Matrix Multiply (GEMM) is considered as themore » workhorse subroutine. In this paper, they study the performance of the memory copies and GEMM subroutines that are critical to port the computational chemistry algorithms to the GPU clusters. To that end, a benchmark based on the NetPIPE framework is developed to evaluate the latency and bandwidth of the memory copies between the host and the GPU device. The performance of the single and double precision GEMM subroutines from the NVIDIA CUBLAS 2.0 library are studied. The results have been compared with that of the BLAS routines from the Intel Math Kernel Library (MKL) to understand the computational trade-offs. The test bed is a Intel Xeon cluster equipped with NVIDIA Tesla GPUs.« less
Recent advances in PC-Linux systems for electronic structure computations by optimized compilers and numerical libraries.

PubMed

Yu, Jen-Shiang K; Yu, Chin-Hui

2002-01-01

One of the most frequently used packages for electronic structure research, GAUSSIAN 98, is compiled on Linux systems with various hardware configurations, including AMD Athlon (with the "Thunderbird" core), AthlonMP, and AthlonXP (with the "Palomino" core) systems as well as the Intel Pentium 4 (with the "Willamette" core) machines. The default PGI FORTRAN compiler (pgf77) and the Intel FORTRAN compiler (ifc) are respectively employed with different architectural optimization options to compile GAUSSIAN 98 and test the performance improvement. In addition to the BLAS library included in revision A.11 of this package, the Automatically Tuned Linear Algebra Software (ATLAS) library is linked against the binary executables to improve the performance. Various Hartree-Fock, density-functional theories, and the MP2 calculations are done for benchmarking purposes. It is found that the combination of ifc with ATLAS library gives the best performance for GAUSSIAN 98 on all of these PC-Linux computers, including AMD and Intel CPUs. Even on AMD systems, the Intel FORTRAN compiler invariably produces binaries with better performance than pgf77. The enhancement provided by the ATLAS library is more significant for post-Hartree-Fock calculations. The performance on one single CPU is potentially as good as that on an Alpha 21264A workstation or an SGI supercomputer. The floating-point marks by SpecFP2000 have similar trends to the results of GAUSSIAN 98 package.
Big Data, Deep Learning and Tianhe-2 at Sun Yat-Sen University, Guangzhou

NASA Astrophysics Data System (ADS)

Yuen, D. A.; Dzwinel, W.; Liu, J.; Zhang, K.

2014-12-01

In this decade the big data revolution has permeated in many fields, ranging from financial transactions, medical surveys and scientific endeavors, because of the big opportunities people see ahead. What to do with all this data remains an intriguing question. This is where computer scientists together with applied mathematicians have made some significant inroads in developing deep learning techniques for unraveling new relationships among the different variables by means of correlation analysis and data-assimilation methods. Deep-learning and big data taken together is a grand challenge task in High-performance computing which demand both ultrafast speed and large memory. The Tianhe-2 recently installed at Sun Yat-Sen University in Guangzhou is well positioned to take up this challenge because it is currently the world's fastest computer at 34 Petaflops. Each compute node of Tianhe-2 has two CPUs of Intel Xeon E5-2600 and three Xeon Phi accelerators. The Tianhe-2 has a very large fast memory RAM of 88 Gigabytes on each node. The system has a total memory of 1,375 Terabytes. All of these technical features will allow very high dimensional (more than 10) problem in deep learning to be explored carefully on the Tianhe-2. Problems in seismology which can be solved include three-dimensional seismic wave simulations of the whole Earth with a few km resolution and the recognition of new phases in seismic wave form from assemblage of large data sets.
IR radiation characteristics and operating range research for a quad-rotor unmanned aircraft vehicle.

PubMed

Gong, Mali; Guo, Rui; He, Sifeng; Wang, Wei

2016-11-01

The security threats caused by multi-rotor unmanned aircraft vehicles (UAVs) are serious, especially in public places. To detect and control multi-rotor UAVs, knowledge of IR characteristics is necessary. The IR characteristics of a typical commercial quad-rotor UAV are investigated in this paper through thermal imaging with an IR camera. Combining the 3D geometry and IR images of the UAV, a 3D IR characteristics model is established so that the radiant power from different views can be obtained. An estimation of operating range to detect the UAV is calculated theoretically using signal-to-noise ratio as the criterion. Field experiments are implemented with an uncooled IR camera in an environment temperature of 12°C and a uniform background. For the front view, the operating range is about 150 m, which is close to the simulation result of 170 m.
Parasail: SIMD C library for global, semi-global, and local pairwise sequence alignments

DOE PAGES

Daily, Jeffrey A.

2016-02-10

Sequence alignment algorithms are a key component of many bioinformatics applications. Though various fast Smith-Waterman local sequence alignment implementations have been developed for x86 CPUs, most are embedded into larger database search tools. In addition, fast implementations of Needleman-Wunsch global sequence alignment and its semi-global variants are not as widespread. This article presents the first software library for local, global, and semi-global pairwise intra-sequence alignments and improves the performance of previous intra-sequence implementations. As a result, a faster intra-sequence pairwise alignment implementation is described and benchmarked. Using a 375 residue query sequence a speed of 136 billion cell updates permore » second (GCUPS) was achieved on a dual Intel Xeon E5-2670 12-core processor system, the highest reported for an implementation based on Farrar’s ’striped’ approach. When using only a single thread, parasail was 1.7 times faster than Rognes’s SWIPE. For many score matrices, parasail is faster than BLAST. The software library is designed for 64 bit Linux, OS X, or Windows on processors with SSE2, SSE41, or AVX2. Source code is available from https://github.com/jeffdaily/parasail under the Battelle BSD-style license. In conclusion, applications that require optimal alignment scores could benefit from the improved performance. For the first time, SIMD global, semi-global, and local alignments are available in a stand-alone C library.« less
Parasail: SIMD C library for global, semi-global, and local pairwise sequence alignments

DOE Office of Scientific and Technical Information (OSTI.GOV)

Daily, Jeffrey A.

Sequence alignment algorithms are a key component of many bioinformatics applications. Though various fast Smith-Waterman local sequence alignment implementations have been developed for x86 CPUs, most are embedded into larger database search tools. In addition, fast implementations of Needleman-Wunsch global sequence alignment and its semi-global variants are not as widespread. This article presents the first software library for local, global, and semi-global pairwise intra-sequence alignments and improves the performance of previous intra-sequence implementations. As a result, a faster intra-sequence pairwise alignment implementation is described and benchmarked. Using a 375 residue query sequence a speed of 136 billion cell updates permore » second (GCUPS) was achieved on a dual Intel Xeon E5-2670 12-core processor system, the highest reported for an implementation based on Farrar’s ’striped’ approach. When using only a single thread, parasail was 1.7 times faster than Rognes’s SWIPE. For many score matrices, parasail is faster than BLAST. The software library is designed for 64 bit Linux, OS X, or Windows on processors with SSE2, SSE41, or AVX2. Source code is available from https://github.com/jeffdaily/parasail under the Battelle BSD-style license. In conclusion, applications that require optimal alignment scores could benefit from the improved performance. For the first time, SIMD global, semi-global, and local alignments are available in a stand-alone C library.« less
The Impact of IBM Cell Technology on the Programming Paradigm in the Context of Computer Systems for Climate and Weather Models

DOE Office of Scientific and Technical Information (OSTI.GOV)

Zhou, Shujia; Duffy, Daniel; Clune, Thomas

The call for ever-increasing model resolutions and physical processes in climate and weather models demands a continual increase in computing power. The IBM Cell processor's order-of-magnitude peak performance increase over conventional processors makes it very attractive to fulfill this requirement. However, the Cell's characteristics, 256KB local memory per SPE and the new low-level communication mechanism, make it very challenging to port an application. As a trial, we selected the solar radiation component of the NASA GEOS-5 climate model, which: (1) is representative of column physics components (half the total computational time), (2) has an extremely high computational intensity: the ratiomore » of computational load to main memory transfers, and (3) exhibits embarrassingly parallel column computations. In this paper, we converted the baseline code (single-precision Fortran) to C and ported it to an IBM BladeCenter QS20. For performance, we manually SIMDize four independent columns and include several unrolling optimizations. Our results show that when compared with the baseline implementation running on one core of Intel's Xeon Woodcrest, Dempsey, and Itanium2, the Cell is approximately 8.8x, 11.6x, and 12.8x faster, respectively. Our preliminary analysis shows that the Cell can also accelerate the dynamics component (~;;25percent total computational time). We believe these dramatic performance improvements make the Cell processor very competitive as an accelerator.« less
Multi-GPU Accelerated Admittance Method for High-Resolution Human Exposure Evaluation.

PubMed

Xiong, Zubiao; Feng, Shi; Kautz, Richard; Chandra, Sandeep; Altunyurt, Nevin; Chen, Ji

2015-12-01

A multi-graphics processing unit (GPU) accelerated admittance method solver is presented for solving the induced electric field in high-resolution anatomical models of human body when exposed to external low-frequency magnetic fields. In the solver, the anatomical model is discretized as a three-dimensional network of admittances. The conjugate orthogonal conjugate gradient (COCG) iterative algorithm is employed to take advantage of the symmetric property of the complex-valued linear system of equations. Compared against the widely used biconjugate gradient stabilized method, the COCG algorithm can reduce the solving time by 3.5 times and reduce the storage requirement by about 40%. The iterative algorithm is then accelerated further by using multiple NVIDIA GPUs. The computations and data transfers between GPUs are overlapped in time by using asynchronous concurrent execution design. The communication overhead is well hidden so that the acceleration is nearly linear with the number of GPU cards. Numerical examples show that our GPU implementation running on four NVIDIA Tesla K20c cards can reach 90 times faster than the CPU implementation running on eight CPU cores (two Intel Xeon E5-2603 processors). The implemented solver is able to solve large dimensional problems efficiently. A whole adult body discretized in 1-mm resolution can be solved in just several minutes. The high efficiency achieved makes it practical to investigate human exposure involving a large number of cases with a high resolution that meets the requirements of international dosimetry guidelines.
Transient Solid Dynamics Simulations on the Sandia/Intel Teraflop Computer

DOE Office of Scientific and Technical Information (OSTI.GOV)

Attaway, S.; Brown, K.; Gardner, D.

1997-12-31

Transient solid dynamics simulations are among the most widely used engineering calculations. Industrial applications include vehicle crashworthiness studies, metal forging, and powder compaction prior to sintering. These calculations are also critical to defense applications including safety studies and weapons simulations. The practical importance of these calculations and their computational intensiveness make them natural candidates for parallelization. This has proved to be difficult, and existing implementations fail to scale to more than a few dozen processors. In this paper we describe our parallelization of PRONTO, Sandia`s transient solid dynamics code, via a novel algorithmic approach that utilizes multiple decompositions for differentmore » key segments of the computations, including the material contact calculation. This latter calculation is notoriously difficult to perform well in parallel, because it involves dynamically changing geometry, global searches for elements in contact, and unstructured communications among the compute nodes. Our approach scales to at least 3600 compute nodes of the Sandia/Intel Teraflop computer (the largest set of nodes to which we have had access to date) on problems involving millions of finite elements. On this machine we can simulate models using more than ten- million elements in a few tenths of a second per timestep, and solve problems more than 3000 times faster than a single processor Cray Jedi.« less

a High-Density Electron Beam and Quad-Scan Measurements at Pleiades Thomson X-Ray Source

NASA Astrophysics Data System (ADS)

Lim, J. K.; Rosenzweig, J. B.; Anderson, S. G.; Tremaine, A. M.

2007-09-01

A recent development of the photo-cathode injector technology has greatly enhanced the beam quality necessary for the creation of high density/high brightness electron beam sources. In the Thomson backscattering x-ray experiment, there is an immense need for under 20 micron electron beam spot at the interaction point with a high-intensity laser in order to produce a large x-ray flux. This has been demonstrated successfully at PLEIADES in Lawrence Livermore National Laboratory. For this Thomson backscattering experiment, we employed an asymmetric triplet, high remanence permanent-magnet quads to produce smaller electron beams. Utilizing highly efficient optical transition radiation (OTR) beam spot imaging technique and varying electron focal spot sizes enabled a quadrupole scan at the interaction zone. Comparisons between Twiss parameters obtained upstream to those parameter values deduced from PMQ scan will be presented in this report.
a High-Density Electron Beam and Quad-Scan Measurements at Pleiades Thomson X-Ray Source

NASA Astrophysics Data System (ADS)

Lim, J. K.; Rosenzweig, J. B.; Anderson, S. G.; Tremaine, A. M.

A recent development of the photo-cathode injector technology has greatly enhanced the beam quality necessary for the creation of high density/high brightness electron beam sources. In the Thomson backscattering x-ray experiment, there is an immense need for under 20 micron electron beam spot at the interaction point with a high-intensity laser in order to produce a large x-ray flux. This has been demonstrated successfully at PLEIADES in Lawrence Livermore National Laboratory. For this Thomson backscattering experiment, we employed an asymmetric triplet, high remanence permanent-magnet quads to produce smaller electron beams. Utilizing highly efficient optical transition radiation (OTR) beam spot imaging technique and varying electron focal spot sizes enabled a quadrupole scan at the interaction zone. Comparisons between Twiss parameters obtained upstream to those parameter values deduced from PMQ scan will be presented in this report.
Design of Quad-Band Terahertz Metamaterial Absorber Using a Perforated Rectangular Resonator for Sensing Applications.

PubMed

Xie, Qin; Dong, Guangxi; Wang, Ben-Xin; Huang, Wei-Qing

2018-05-08

Quad-band terahertz absorber with single-sized metamaterial design formed by a perforated rectangular resonator on a gold substrate with a dielectric gap in between is investigated. The designed metamaterial structure enables four absorption peaks, of which the first three peaks have large absorption coefficient while the last peak possesses a high Q (quality factor) value of 98.33. The underlying physical mechanisms of these peaks are explored; it is found that their near-field distributions are different. Moreover, the figure of merit (FOM) of the last absorption peak can reach 101.67, which is much higher than that of the first three absorption modes and even absorption bands of other works operated in the terahertz frequency. The designed device with multiple-band absorption and high FOM could provide numerous potential applications in terahertz technology-related fields.
Design of Quad-Band Terahertz Metamaterial Absorber Using a Perforated Rectangular Resonator for Sensing Applications

NASA Astrophysics Data System (ADS)

Xie, Qin; Dong, Guangxi; Wang, Ben-Xin; Huang, Wei-Qing

2018-05-01

Quad-band terahertz absorber with single-sized metamaterial design formed by a perforated rectangular resonator on a gold substrate with a dielectric gap in between is investigated. The designed metamaterial structure enables four absorption peaks, of which the first three peaks have large absorption coefficient while the last peak possesses a high Q (quality factor) value of 98.33. The underlying physical mechanisms of these peaks are explored; it is found that their near-field distributions are different. Moreover, the figure of merit (FOM) of the last absorption peak can reach 101.67, which is much higher than that of the first three absorption modes and even absorption bands of other works operated in the terahertz frequency. The designed device with multiple-band absorption and high FOM could provide numerous potential applications in terahertz technology-related fields.
(U) Status of Trinity and Crossroads Systems

DOE Office of Scientific and Technical Information (OSTI.GOV)

Archer, Billy Joe; Lujan, James Westley; Hemmert, K. S.

2017-01-10

(U) This paper provides a general overview of current and future plans for the Advanced Simulation and Computing (ASC) Advanced Technology (AT) systems fielded by the New Mexico Alliance for Computing at Extreme Scale (ACES), a collaboration between Los Alamos Laboratory and Sandia National Laboratories. Additionally, this paper touches on research of technology beyond traditional CMOS. The status of Trinity, ASCs first AT system, and Crossroads, anticipated to succeed Trinity as the third AT system in 2020 will be presented, along with initial performance studies of the Intel Knights Landing Xeon Phi processors, introduced on Trinity. The challenges and opportunitiesmore » for our production simulation codes on AT systems will also be discussed. Trinity and Crossroads are a joint procurement by ACES and Lawrence Berkeley Laboratory as part of the Alliance for application Performance at EXtreme scale (APEX) http://apex.lanl.gov.« less
Computational performance of a smoothed particle hydrodynamics simulation for shared-memory parallel computing

NASA Astrophysics Data System (ADS)

Nishiura, Daisuke; Furuichi, Mikito; Sakaguchi, Hide

2015-09-01

The computational performance of a smoothed particle hydrodynamics (SPH) simulation is investigated for three types of current shared-memory parallel computer devices: many integrated core (MIC) processors, graphics processing units (GPUs), and multi-core CPUs. We are especially interested in efficient shared-memory allocation methods for each chipset, because the efficient data access patterns differ between compute unified device architecture (CUDA) programming for GPUs and OpenMP programming for MIC processors and multi-core CPUs. We first introduce several parallel implementation techniques for the SPH code, and then examine these on our target computer architectures to determine the most effective algorithms for each processor unit. In addition, we evaluate the effective computing performance and power efficiency of the SPH simulation on each architecture, as these are critical metrics for overall performance in a multi-device environment. In our benchmark test, the GPU is found to produce the best arithmetic performance as a standalone device unit, and gives the most efficient power consumption. The multi-core CPU obtains the most effective computing performance. The computational speed of the MIC processor on Xeon Phi approached that of two Xeon CPUs. This indicates that using MICs is an attractive choice for existing SPH codes on multi-core CPUs parallelized by OpenMP, as it gains computational acceleration without the need for significant changes to the source code.
Evaluation of CHO Benchmarks on the Arria 10 FPGA using Intel FPGA SDK for OpenCL

DOE Office of Scientific and Technical Information (OSTI.GOV)

Jin, Zheming; Yoshii, Kazutomo; Finkel, Hal

The OpenCL standard is an open programming model for accelerating algorithms on heterogeneous computing system. OpenCL extends the C-based programming language for developing portable codes on different platforms such as CPU, Graphics processing units (GPUs), Digital Signal Processors (DSPs) and Field Programmable Gate Arrays (FPGAs). The Intel FPGA SDK for OpenCL is a suite of tools that allows developers to abstract away the complex FPGA-based development flow for a high-level software development flow. Users can focus on the design of hardware-accelerated kernel functions in OpenCL and then direct the tools to generate the low-level FPGA implementations. The approach makes themore » FPGA-based development more accessible to software users as the needs for hybrid computing using CPUs and FPGAs are increasing. It can also significantly reduce the hardware development time as users can evaluate different ideas with high-level language without deep FPGA domain knowledge. Benchmarking of OpenCL-based framework is an effective way for analyzing the performance of system by studying the execution of the benchmark applications. CHO is a suite of benchmark applications that provides support for OpenCL [1]. The authors presented CHO as an OpenCL port of the CHStone benchmark. Using Altera OpenCL (AOCL) compiler to synthesize the benchmark applications, they listed the resource usage and performance of each kernel that can be successfully synthesized by the compiler. In this report, we evaluate the resource usage and performance of the CHO benchmark applications using the Intel FPGA SDK for OpenCL and Nallatech 385A FPGA board that features an Arria 10 FPGA device. The focus of the report is to have a better understanding of the resource usage and performance of the kernel implementations using Arria-10 FPGA devices compared to Stratix-5 FPGA devices. In addition, we also gain knowledge about the limitations of the current compiler when it fails to synthesize a benchmark
CHARACTERIZATION OF THE MILLIMETER-WAVE POLARIZATION OF CENTAURUS A WITH QUaD

DOE Office of Scientific and Technical Information (OSTI.GOV)

Zemcov, M.; Bock, J.; Leitch, E.

2010-02-20

Centaurus (Cen) A represents one of the best candidates for an isolated, compact, highly polarized source that is bright at typical cosmic microwave background (CMB) experiment frequencies. We present measurements of the 4{sup 0} x 2{sup 0} region centered on Cen A with QUaD, a CMB polarimeter whose absolute polarization angle is known to an accuracy of 0.{sup 0}5. Simulations are performed to assess the effect of misestimation of the instrumental parameters on the final measurement and systematic errors due to the field's background structure and temporal variability from Cen A's nuclear region are determined. The total (Q, U) ofmore » the inner lobe region is (1.00 +- 0.07(stat.) +- 0.04(sys.), - 1.72 +- 0.06 +- 0.05) Jy at 100 GHz and (0.80 +- 0.06 +- 0.06, - 1.40 +- 0.07 +- 0.08) Jy at 150 GHz, leading to polarization angles and total errors of -30.{sup 0}0 +- 1.{sup 0}1 and -29.{sup 0}1 +- 1.{sup 0}7. These measurements will allow the use of Cen A as a polarized calibration source for future millimeter experiments.« less
View Factor and Radiation-Hydrodynamic Simulations of Gas-Filled Outer-Quad-Only Hohlraums at the National Ignition Facility

NASA Astrophysics Data System (ADS)

Young, Christopher; Meezan, Nathan; Landen, Otto

2017-10-01

A cylindrical National Ignition Facility hohlraum irradiated exclusively by NOVA-like outer quads (44 .5° and 50° beams) is proposed to minimize laser plasma interaction (LPI) losses and avoid problems with propagating the inner (23 .5° and 30°) beams. Symmetry and drive are controlled by shortening the hohlraum, using a smaller laser entrance hole (LEH), beam phasing the 44 .5° and 50° beams, and correcting the remaining P4 asymmetry with a capsule shim. Ensembles of time-resolved view factor simulations help narrow the design space of the new configuration, with fine tuning provided by the radiation-hydrodynamic code HYDRA. Prepared by LLNL under Contract DE-AC52-07NA27344.
Accelerating Astronomy & Astrophysics in the New Era of Parallel Computing: GPUs, Phi and Cloud Computing

NASA Astrophysics Data System (ADS)

Ford, Eric B.; Dindar, Saleh; Peters, Jorg

2015-08-01

The realism of astrophysical simulations and statistical analyses of astronomical data are set by the available computational resources. Thus, astronomers and astrophysicists are constantly pushing the limits of computational capabilities. For decades, astronomers benefited from massive improvements in computational power that were driven primarily by increasing clock speeds and required relatively little attention to details of the computational hardware. For nearly a decade, increases in computational capabilities have come primarily from increasing the degree of parallelism, rather than increasing clock speeds. Further increases in computational capabilities will likely be led by many-core architectures such as Graphical Processing Units (GPUs) and Intel Xeon Phi. Successfully harnessing these new architectures, requires significantly more understanding of the hardware architecture, cache hierarchy, compiler capabilities and network network characteristics.I will provide an astronomer's overview of the opportunities and challenges provided by modern many-core architectures and elastic cloud computing. The primary goal is to help an astronomical audience understand what types of problems are likely to yield more than order of magnitude speed-ups and which problems are unlikely to parallelize sufficiently efficiently to be worth the development time and/or costs.I will draw on my experience leading a team in developing the Swarm-NG library for parallel integration of large ensembles of small n-body systems on GPUs, as well as several smaller software projects. I will share lessons learned from collaborating with computer scientists, including both technical and soft skills. Finally, I will discuss the challenges of training the next generation of astronomers to be proficient in this new era of high-performance computing, drawing on experience teaching a graduate class on High-Performance Scientific Computing for Astrophysics and organizing a 2014 advanced summer
RTEMS SMP and MTAPI for Efficient Multi-Core Space Applications on LEON3/LEON4 Processors

NASA Astrophysics Data System (ADS)

Cederman, Daniel; Hellstrom, Daniel; Sherrill, Joel; Bloom, Gedare; Patte, Mathieu; Zulianello, Marco

2015-09-01

This paper presents the final result of an European Space Agency (ESA) activity aimed at improving the software support for LEON processors used in SMP configurations. One of the benefits of using a multicore system in a SMP configuration is that in many instances it is possible to better utilize the available processing resources by load balancing between cores. This however comes with the cost of having to synchronize operations between cores, leading to increased complexity. While in an AMP system one can use multiple instances of operating systems that are only uni-processor capable, a SMP system requires the operating system to be written to support multicore systems. In this activity we have improved and extended the SMP support of the RTEMS real-time operating system and ensured that it fully supports the multicore capable LEON processors. The targeted hardware in the activity has been the GR712RC, a dual-core core LEON3FT processor, and the functional prototype of ESA's Next Generation Multiprocessor (NGMP), a quad core LEON4 processor. The final version of the NGMP is now available as a product under the name GR740. An implementation of the Multicore Task Management API (MTAPI) has been developed as part of this activity to aid in the parallelization of applications for RTEMS SMP. It allows for simplified development of parallel applications using the task-based programming model. An existing space application, the Gaia Video Processing Unit, has been ported to RTEMS SMP using the MTAPI implementation to demonstrate the feasibility and usefulness of multicore processors for space payload software. The activity is funded by ESA under contract 4000108560/13/NL/JK. Gedare Bloom is supported in part by NSF CNS-0934725.
Parallel computation for biological sequence comparison: comparing a portable model to the native model for the Intel Hypercube.

PubMed Central

Nadkarni, P. M.; Miller, P. L.

1991-01-01

A parallel program for inter-database sequence comparison was developed on the Intel Hypercube using two models of parallel programming. One version was built using machine-specific Hypercube parallel programming commands. The other version was built using Linda, a machine-independent parallel programming language. The two versions of the program provide a case study comparing these two approaches to parallelization in an important biological application area. Benchmark tests with both programs gave comparable results with a small number of processors. As the number of processors was increased, the Linda version was somewhat less efficient. The Linda version was also run without change on Network Linda, a virtual parallel machine running on a network of desktop workstations. PMID:1807632
Parallel computation for biological sequence comparison: comparing a portable model to the native model for the Intel Hypercube.

PubMed

Nadkarni, P M; Miller, P L

1991-01-01

A parallel program for inter-database sequence comparison was developed on the Intel Hypercube using two models of parallel programming. One version was built using machine-specific Hypercube parallel programming commands. The other version was built using Linda, a machine-independent parallel programming language. The two versions of the program provide a case study comparing these two approaches to parallelization in an important biological application area. Benchmark tests with both programs gave comparable results with a small number of processors. As the number of processors was increased, the Linda version was somewhat less efficient. The Linda version was also run without change on Network Linda, a virtual parallel machine running on a network of desktop workstations.
Multi-Pivot Quicksort: an Experiment with Single, Dual, Triple, Quad, and Penta-Pivot Quicksort Algorithms in Python

NASA Astrophysics Data System (ADS)

Budiman, M. A.; Zamzami, E. M.; Rachmawati, D.

2017-03-01

Dual-pivot quicksort, which was proposed by Yaroslavsky, has been experimentally proven to be more efficient than the classical single-pivot quicksort under the Java Virtual Machine [6]. Moreover, Kushagara, López-Ortiz, and Munro [4] has shown that triple-pivot quicksort runs 7-8% faster than dual-pivot quicksort in C, mutatis mutandis. In this research, we implement and experiment with single, dual, triple, quad, and penta-pivot quicksort algorithms in Python. Our experimental results are as follows. Firstly, the quicksort with single pivot is the slowest among the five variants. Secondly, at least until five (penta) pivots are being used, it is proven that the more pivots are used in a quicksort algorithm, the faster its performance becomes. Thirdly, the increase of speed resulted by adding more pivots tends to decrease gradually.
Real-time polarization-sensitive optical coherence tomography data processing with parallel computing

PubMed Central

Liu, Gangjun; Zhang, Jun; Yu, Lingfeng; Xie, Tuqiang; Chen, Zhongping

2010-01-01

With the increase of the A-line speed of optical coherence tomography (OCT) systems, real-time processing of acquired data has become a bottleneck. The shared-memory parallel computing technique is used to process OCT data in real time. The real-time processing power of a quad-core personal computer (PC) is analyzed. It is shown that the quad-core PC could provide real-time OCT data processing ability of more than 80K A-lines per second. A real-time, fiber-based, swept source polarization-sensitive OCT system with 20K A-line speed is demonstrated with this technique. The real-time 2D and 3D polarization-sensitive imaging of chicken muscle and pig tendon is also demonstrated. PMID:19904337
Flow Characteristics and Robustness of an Inclined Quad-vortex Range Hood

PubMed Central

CHEN, Jia-Kun; HUANG, Rong Fung

2014-01-01

A novel design of range hood, which was termed the inclined quad-vortex (IQV) range hood, was examined for its flow and containment leakage characteristics under the influence of a plate sweeping across the hood face. A flow visualization technique was used to unveil the flow behavior. Three characteristic flow modes were observed: convex, straight, and concave modes. A tracer gas detection method using sulfur hexafluoride (SF6) was employed to measure the containment leakage levels. The results were compared with the test data reported previously in the literature for a conventional range hood and an inclined air curtain (IAC) range hood. The leakage SF6 concentration of the IQV range hood under the influence of the plate sweeping was 0.039 ppm at a suction flow rate of 9.4 m3/min. The leakage concentration of the conventional range hood was 0.768 ppm at a suction flow rate of 15.0 m3/min. For the IAC range hood, the leakage concentration was 0.326 ppm at a suction flow rate of 10.9 m3/min. The IQV range hood presented a significantly lower leakage level at a smaller suction flow rate than the conventional and IAC range hoods due to its aerodynamic design for flow behavior. PMID:24583513
Face classification using electronic synapses

NASA Astrophysics Data System (ADS)

Yao, Peng; Wu, Huaqiang; Gao, Bin; Eryilmaz, Sukru Burc; Huang, Xueyao; Zhang, Wenqiang; Zhang, Qingtian; Deng, Ning; Shi, Luping; Wong, H.-S. Philip; Qian, He

2017-05-01

Conventional hardware platforms consume huge amount of energy for cognitive learning due to the data movement between the processor and the off-chip memory. Brain-inspired device technologies using analogue weight storage allow to complete cognitive tasks more efficiently. Here we present an analogue non-volatile resistive memory (an electronic synapse) with foundry friendly materials. The device shows bidirectional continuous weight modulation behaviour. Grey-scale face classification is experimentally demonstrated using an integrated 1024-cell array with parallel online training. The energy consumption within the analogue synapses for each iteration is 1,000 × (20 ×) lower compared to an implementation using Intel Xeon Phi processor with off-chip memory (with hypothetical on-chip digital resistive random access memory). The accuracy on test sets is close to the result using a central processing unit. These experimental results consolidate the feasibility of analogue synaptic array and pave the way toward building an energy efficient and large-scale neuromorphic system.
DOE Office of Scientific and Technical Information (OSTI.GOV)

Shipman, Galen M.

These are the slides for a presentation on programming models in HPC, at the Los Alamos National Laboratory's Parallel Computing Summer School. The following topics are covered: Flynn's Taxonomy of computer architectures; single instruction single data; single instruction multiple data; multiple instruction multiple data; address space organization; definition of Trinity (Intel Xeon-Phi is a MIMD architecture); single program multiple data; multiple program multiple data; ExMatEx workflow overview; definition of a programming model, programming languages, runtime systems; programming model and environments; MPI (Message Passing Interface); OpenMP; Kokkos (Performance Portable Thread-Parallel Programming Model); Kokkos abstractions, patterns, policies, and spaces; RAJA, a systematicmore » approach to node-level portability and tuning; overview of the Legion Programming Model; mapping tasks and data to hardware resources; interoperability: supporting task-level models; Legion S3D execution and performance details; workflow, integration of external resources into the programming model.« less
Evaluation of stochastic algorithms for financial mathematics problems from point of view of energy-efficiency

DOE Office of Scientific and Technical Information (OSTI.GOV)

Atanassov, E.; Dimitrov, D., E-mail: d.slavov@bas.bg, E-mail: emanouil@parallel.bas.bg, E-mail: gurov@bas.bg; Gurov, T.

2015-10-28

The recent developments in the area of high-performance computing are driven not only by the desire for ever higher performance but also by the rising costs of electricity. The use of various types of accelerators like GPUs, Intel Xeon Phi has become mainstream and many algorithms and applications have been ported to make use of them where available. In Financial Mathematics the question of optimal use of computational resources should also take into account the limitations on space, because in many use cases the servers are deployed close to the exchanges. In this work we evaluate various algorithms for optionmore » pricing that we have implemented for different target architectures in terms of their energy and space efficiency. Since it has been established that low-discrepancy sequences may be better than pseudorandom numbers for these types of algorithms, we also test the Sobol and Halton sequences. We present the raw results, the computed metrics and conclusions from our tests.« less
Kernel optimization for short-range molecular dynamics

NASA Astrophysics Data System (ADS)

Hu, Changjun; Wang, Xianmeng; Li, Jianjiang; He, Xinfu; Li, Shigang; Feng, Yangde; Yang, Shaofeng; Bai, He

2017-02-01

To optimize short-range force computations in Molecular Dynamics (MD) simulations, multi-threading and SIMD optimizations are presented in this paper. With respect to multi-threading optimization, a Partition-and-Separate-Calculation (PSC) method is designed to avoid write conflicts caused by using Newton's third law. Serial bottlenecks are eliminated with no additional memory usage. The method is implemented by using the OpenMP model. Furthermore, the PSC method is employed on Intel Xeon Phi coprocessors in both native and offload models. We also evaluate the performance of the PSC method under different thread affinities on the MIC architecture. In the SIMD execution, we explain the performance influence in the PSC method, considering the "if-clause" of the cutoff radius check. The experiment results show that our PSC method is relatively more efficient compared to some traditional methods. In double precision, our 256-bit SIMD implementation is about 3 times faster than the scalar version.

Face classification using electronic synapses.

PubMed

Yao, Peng; Wu, Huaqiang; Gao, Bin; Eryilmaz, Sukru Burc; Huang, Xueyao; Zhang, Wenqiang; Zhang, Qingtian; Deng, Ning; Shi, Luping; Wong, H-S Philip; Qian, He

2017-05-12

Conventional hardware platforms consume huge amount of energy for cognitive learning due to the data movement between the processor and the off-chip memory. Brain-inspired device technologies using analogue weight storage allow to complete cognitive tasks more efficiently. Here we present an analogue non-volatile resistive memory (an electronic synapse) with foundry friendly materials. The device shows bidirectional continuous weight modulation behaviour. Grey-scale face classification is experimentally demonstrated using an integrated 1024-cell array with parallel online training. The energy consumption within the analogue synapses for each iteration is 1,000 × (20 ×) lower compared to an implementation using Intel Xeon Phi processor with off-chip memory (with hypothetical on-chip digital resistive random access memory). The accuracy on test sets is close to the result using a central processing unit. These experimental results consolidate the feasibility of analogue synaptic array and pave the way toward building an energy efficient and large-scale neuromorphic system.
Understanding Portability of a High-Level Programming Model on Contemporary Heterogeneous Architectures

DOE PAGES

Sabne, Amit J.; Sakdhnagool, Putt; Lee, Seyong; ...

2015-07-13

Accelerator-based heterogeneous computing is gaining momentum in the high-performance computing arena. However, the increased complexity of heterogeneous architectures demands more generic, high-level programming models. OpenACC is one such attempt to tackle this problem. Although the abstraction provided by OpenACC offers productivity, it raises questions concerning both functional and performance portability. In this article, the authors propose HeteroIR, a high-level, architecture-independent intermediate representation, to map high-level programming models, such as OpenACC, to heterogeneous architectures. They present a compiler approach that translates OpenACC programs into HeteroIR and accelerator kernels to obtain OpenACC functional portability. They then evaluate the performance portability obtained bymore » OpenACC with their approach on 12 OpenACC programs on Nvidia CUDA, AMD GCN, and Intel Xeon Phi architectures. They study the effects of various compiler optimizations and OpenACC program settings on these architectures to provide insights into the achieved performance portability.« less
Accelerating a three-dimensional eco-hydrological cellular automaton on GPGPU with OpenCL

NASA Astrophysics Data System (ADS)

Senatore, Alfonso; D'Ambrosio, Donato; De Rango, Alessio; Rongo, Rocco; Spataro, William; Straface, Salvatore; Mendicino, Giuseppe

2016-10-01

This work presents an effective implementation of a numerical model for complete eco-hydrological Cellular Automata modeling on Graphical Processing Units (GPU) with OpenCL (Open Computing Language) for heterogeneous computation (i.e., on CPUs and/or GPUs). Different types of parallel implementations were carried out (e.g., use of fast local memory, loop unrolling, etc), showing increasing performance improvements in terms of speedup, adopting also some original optimizations strategies. Moreover, numerical analysis of results (i.e., comparison of CPU and GPU outcomes in terms of rounding errors) have proven to be satisfactory. Experiments were carried out on a workstation with two CPUs (Intel Xeon E5440 at 2.83GHz), one GPU AMD R9 280X and one GPU nVIDIA Tesla K20c. Results have been extremely positive, but further testing should be performed to assess the functionality of the adopted strategies on other complete models and their ability to fruitfully exploit parallel systems resources.
Evaluation of stochastic algorithms for financial mathematics problems from point of view of energy-efficiency

NASA Astrophysics Data System (ADS)

Atanassov, E.; Dimitrov, D.; Gurov, T.

2015-10-01

The recent developments in the area of high-performance computing are driven not only by the desire for ever higher performance but also by the rising costs of electricity. The use of various types of accelerators like GPUs, Intel Xeon Phi has become mainstream and many algorithms and applications have been ported to make use of them where available. In Financial Mathematics the question of optimal use of computational resources should also take into account the limitations on space, because in many use cases the servers are deployed close to the exchanges. In this work we evaluate various algorithms for option pricing that we have implemented for different target architectures in terms of their energy and space efficiency. Since it has been established that low-discrepancy sequences may be better than pseudorandom numbers for these types of algorithms, we also test the Sobol and Halton sequences. We present the raw results, the computed metrics and conclusions from our tests.
Error analysis for creating 3D face templates based on cylindrical quad-tree structure

NASA Astrophysics Data System (ADS)

Gutfeter, Weronika

2015-09-01

Development of new biometric algorithms is parallel to advances in technology of sensing devices. Some of the limitations of the current face recognition systems may be eliminated by integrating 3D sensors into these systems. Depth sensing devices can capture a spatial structure of the face in addition to the texture and color. This kind of data is yet usually very voluminous and requires large amount of computer resources for being processed (face scans obtained with typical depth cameras contain more than 150 000 points per face). That is why defining efficient data structures for processing spatial images is crucial for further development of 3D face recognition methods. The concept described in this work fulfills the aforementioned demands. Modification of the quad-tree structure was chosen because it can be easily transformed into less dimensional data structures and maintains spatial relations between data points. We are able to interpret data stored in the tree as a pyramid of features which allow us to analyze face images using coarse-to-fine strategy, often exploited in biometric recognition systems.
Introducing Argonne’s Theta Supercomputer

DOE Office of Scientific and Technical Information (OSTI.GOV)

None

Theta, the Argonne Leadership Computing Facility’s (ALCF) new Intel-Cray supercomputer, is officially open to the research community. Theta’s massively parallel, many-core architecture puts the ALCF on the path to Aurora, the facility’s future Intel-Cray system. Capable of nearly 10 quadrillion calculations per second, Theta enables researchers to break new ground in scientific investigations that range from modeling the inner workings of the brain to developing new materials for renewable energy applications.
Faster Smith-Waterman database searches with inter-sequence SIMD parallelisation.

PubMed

Rognes, Torbjørn

2011-06-01

The Smith-Waterman algorithm for local sequence alignment is more sensitive than heuristic methods for database searching, but also more time-consuming. The fastest approach to parallelisation with SIMD technology has previously been described by Farrar in 2007. The aim of this study was to explore whether further speed could be gained by other approaches to parallelisation. A faster approach and implementation is described and benchmarked. In the new tool SWIPE, residues from sixteen different database sequences are compared in parallel to one query residue. Using a 375 residue query sequence a speed of 106 billion cell updates per second (GCUPS) was achieved on a dual Intel Xeon X5650 six-core processor system, which is over six times more rapid than software based on Farrar's 'striped' approach. SWIPE was about 2.5 times faster when the programs used only a single thread. For shorter queries, the increase in speed was larger. SWIPE was about twice as fast as BLAST when using the BLOSUM50 score matrix, while BLAST was about twice as fast as SWIPE for the BLOSUM62 matrix. The software is designed for 64 bit Linux on processors with SSSE3. Source code is available from http://dna.uio.no/swipe/ under the GNU Affero General Public License. Efficient parallelisation using SIMD on standard hardware makes it possible to run Smith-Waterman database searches more than six times faster than before. The approach described here could significantly widen the potential application of Smith-Waterman searches. Other applications that require optimal local alignment scores could also benefit from improved performance.
Rapid insights from remote sensing in the geosciences

NASA Astrophysics Data System (ADS)

Plaza, Antonio

2015-03-01

The growing availability of capacity computing for atomistic materials modeling has encouraged the use of high-accuracy computationally intensive interatomic potentials, such as SNAP. These potentials also happen to scale well on petascale computing platforms. SNAP has a very general form and uses machine-learning techniques to reproduce the energies, forces, and stress tensors of a large set of small configurations of atoms, which are obtained using high-accuracy quantum electronic structure (QM) calculations. The local environment of each atom is characterized by a set of bispectrum components of the local neighbor density projected on to a basis of hyperspherical harmonics in four dimensions. The computational cost per atom is much greater than that of simpler potentials such as Lennard-Jones or EAM, while the communication cost remains modest. We discuss a variety of strategies for implementing SNAP in the LAMMPS molecular dynamics package. We present scaling results obtained running SNAP on three different classes of machine: a conventional Intel Xeon CPU cluster; the Titan GPU-based system; and the combined Sequoia and Vulcan BlueGene/Q. The growing availability of capacity computing for atomistic materials modeling has encouraged the use of high-accuracy computationally intensive interatomic potentials, such as SNAP. These potentials also happen to scale well on petascale computing platforms. SNAP has a very general form and uses machine-learning techniques to reproduce the energies, forces, and stress tensors of a large set of small configurations of atoms, which are obtained using high-accuracy quantum electronic structure (QM) calculations. The local environment of each atom is characterized by a set of bispectrum components of the local neighbor density projected on to a basis of hyperspherical harmonics in four dimensions. The computational cost per atom is much greater than that of simpler potentials such as Lennard-Jones or EAM, while the
A Survey of Recent MARTe Based Systems

NASA Astrophysics Data System (ADS)

Neto, André C.; Alves, Diogo; Boncagni, Luca; Carvalho, Pedro J.; Valcarcel, Daniel F.; Barbalace, Antonio; De Tommasi, Gianmaria; Fernandes, Horácio; Sartori, Filippo; Vitale, Enzo; Vitelli, Riccardo; Zabeo, Luca

2011-08-01

The Multithreaded Application Real-Time executor (MARTe) is a data driven framework environment for the development and deployment of real-time control algorithms. The main ideas which led to the present version of the framework were to standardize the development of real-time control systems, while providing a set of strictly bounded standard interfaces to the outside world and also accommodating a collection of facilities which promote the speed and ease of development, commissioning and deployment of such systems. At the core of every MARTe based application, is a set of independent inter-communicating software blocks, named Generic Application Modules (GAM), orchestrated by a real-time scheduler. The platform independence of its core library provides MARTe the necessary robustness and flexibility for conveniently testing applications in different environments including non-real-time operating systems. MARTe is already being used in several machines, each with its own peculiarities regarding hardware interfacing, supervisory control configuration, operating system and target control application. This paper presents and compares the most recent results of systems using MARTe: the JET Vertical Stabilization system, which uses the Real Time Application Interface (RTAI) operating system on Intel multi-core processors; the COMPASS plasma control system, driven by Linux RT also on Intel multi-core processors; ISTTOK real-time tomography equilibrium reconstruction which shares the same support configuration of COMPASS; JET error field correction coils based on VME, PowerPC and VxWorks; FTU LH reflected power system running on VME, Intel with RTAI.
Analysis of the Intel 386 and i486 microprocessors for the Space Station Freedom Data Management System

NASA Technical Reports Server (NTRS)

Liu, Yuan-Kwei

1991-01-01

The feasibility is analyzed of upgrading the Intel 386 microprocessor, which has been proposed as the baseline processor for the Space Station Freedom (SSF) Data Management System (DMS), to the more advanced i486 microprocessors. The items compared between the two processors include the instruction set architecture, power consumption, the MIL-STD-883C Class S (Space) qualification schedule, and performance. The advantages of the i486 over the 386 are (1) lower power consumption; and (2) higher floating point performance. The i486 on-chip cache does not have parity check or error detection and correction circuitry. The i486 with on-chip cache disabled, however, has lower integer performance than the 386 without cache, which is the current DMS design choice. Adding cache to the 386/386 DX memory hierachy appears to be the most beneficial change to the current DMS design at this time.
Analysis of the Intel 386 and i486 microprocessors for the Space Station Freedom Data Management System

NASA Technical Reports Server (NTRS)

Liu, Yuan-Kwei

1991-01-01

The feasibility is analyzed of upgrading the Intel 386 microprocessor, which has been proposed as the baseline processor for the Space Station Freedom (SSF) Data Management System (DMS), to the more advanced i486 microprocessors. The items compared between the two processors include the instruction set architecture, power consumption, the MIL-STD-883C Class S (Space) qualification schedule, and performance. The advantages of the i486 over the 386 are (1) lower power consumption; and (2) higher floating point performance. The i486 on-chip cache does not have parity check or error detection and correction circuitry. The i486 with on-chip cache disabled, however, has lower integer performance than the 386 without cache, which is the current DMS design choice. Adding cache to the 386/387 DX memory hierarchy appears to be the most beneficial change to the current DMS design at this time.
DOE Office of Scientific and Technical Information (OSTI.GOV)

Cohen, J; Dossa, D; Gokhale, M

Micro X7DBE Xeon Dual Socket Blackford Server Motherboard; 2 Intel Xeon Dual-Core 2.66 GHz processors; 1 GB DDR2 PC2-5300 RAM (2 x 512); 80GB Hard Drive (Seagate SATA II Barracuda). The Fusion board is presently capable of 4X in a PCIe slot. The image resampling benchmark was run on a dual Xeon workstation with NVIDIA graphics card (see Chapter 5 for full specification). An XtremeData Opteron+FPGA was used for the language classification application. We observed that these benchmarks are not uniformly I/O intensive. The only benchmark that showed greater that 50% of the time in I/O was the graph algorithm when it accessed data files over NFS. When local disk was used, the graph benchmark spent at most 40% of its time in I/O. The other benchmarks were CPU dominated. The image resampling benchmark and language classification showed order of magnitude speedup over software by using co-processor technology to offload the CPU-intensive kernels. Our experiments to date suggest that emerging hardware technologies offer significant benefit to boosting the performance of data-intensive algorithms. Using GPU and FPGA co-processors, we were able to improve performance by more than an order of magnitude on the benchmark algorithms, eliminating the processor bottleneck of CPU-bound tasks. Experiments with a prototype solid state nonvolative memory available today show 10X better throughput on random reads than disk, with a 2X speedup on a graph processing benchmark when compared to the use of local SATA disk.« less
A hybrid algorithm for parallel molecular dynamics simulations

NASA Astrophysics Data System (ADS)

Mangiardi, Chris M.; Meyer, R.

2017-10-01

This article describes algorithms for the hybrid parallelization and SIMD vectorization of molecular dynamics simulations with short-range forces. The parallelization method combines domain decomposition with a thread-based parallelization approach. The goal of the work is to enable efficient simulations of very large (tens of millions of atoms) and inhomogeneous systems on many-core processors with hundreds or thousands of cores and SIMD units with large vector sizes. In order to test the efficiency of the method, simulations of a variety of configurations with up to 74 million atoms have been performed. Results are shown that were obtained on multi-core systems with Sandy Bridge and Haswell processors as well as systems with Xeon Phi many-core processors.
Core-core and core-valence correlation

NASA Technical Reports Server (NTRS)

Bauschlicher, Charles W., Jr.; Langhoff, Stephen R.; Taylor, Peter R.

1988-01-01

The effect of (1s) core correlation on properties and energy separations was analyzed using full configuration-interaction (FCI) calculations. The Be 1 S - 1 P, the C 3 P - 5 S and CH+ 1 Sigma + or - 1 Pi separations, and CH+ spectroscopic constants, dipole moment and 1 Sigma + - 1 Pi transition dipole moment were studied. The results of the FCI calculations are compared to those obtained using approximate methods. In addition, the generation of atomic natural orbital (ANO) basis sets, as a method for contracting a primitive basis set for both valence and core correlation, is discussed. When both core-core and core-valence correlation are included in the calculation, no suitable truncated CI approach consistently reproduces the FCI, and contraction of the basis set is very difficult. If the (nearly constant) core-core correlation is eliminated, and only the core-valence correlation is included, CASSCF/MRCI approached reproduce the FCI results and basis set contraction is significantly easier.
Parallel spatial direct numerical simulations on the Intel iPSC/860 hypercube

NASA Technical Reports Server (NTRS)

Joslin, Ronald D.; Zubair, Mohammad

1993-01-01

The implementation and performance of a parallel spatial direct numerical simulation (PSDNS) approach on the Intel iPSC/860 hypercube is documented. The direct numerical simulation approach is used to compute spatially evolving disturbances associated with the laminar-to-turbulent transition in boundary-layer flows. The feasibility of using the PSDNS on the hypercube to perform transition studies is examined. The results indicate that the direct numerical simulation approach can effectively be parallelized on a distributed-memory parallel machine. By increasing the number of processors nearly ideal linear speedups are achieved with nonoptimized routines; slower than linear speedups are achieved with optimized (machine dependent library) routines. This slower than linear speedup results because the Fast Fourier Transform (FFT) routine dominates the computational cost and because the routine indicates less than ideal speedups. However with the machine-dependent routines the total computational cost decreases by a factor of 4 to 5 compared with standard FORTRAN routines. The computational cost increases linearly with spanwise wall-normal and streamwise grid refinements. The hypercube with 32 processors was estimated to require approximately twice the amount of Cray supercomputer single processor time to complete a comparable simulation; however it is estimated that a subgrid-scale model which reduces the required number of grid points and becomes a large-eddy simulation (PSLES) would reduce the computational cost and memory requirements by a factor of 10 over the PSDNS. This PSLES implementation would enable transition simulations on the hypercube at a reasonable computational cost.
Space shuttle engineering and operations support. Isolation between the S-band quad antenna and the S-band payload antenna. Engineering systems analysis

NASA Technical Reports Server (NTRS)

Lindsey, J. F.

1976-01-01

The isolation between the upper S-band quad antenna and the S-band payload antenna on the shuttle orbiter is calculated using a combination of plane surface and curved surface theories along with worst case values. A minimum value of 60 db isolation is predicted based on recent antenna pattern data, antenna locations on the orbiter, curvature effects, dielectric covering effects and edge effects of the payload bay. The calculated value of 60 db is significantly greater than the baseline value of 40 db. Use of the new value will result in the design of smaller, lighter weight and less expensive filters for S-band transponder and the S-band payload interrogator.
Optimization of high count rate event counting detector with Microchannel Plates and quad Timepix readout

NASA Astrophysics Data System (ADS)

Tremsin, A. S.; Vallerga, J. V.; McPhate, J. B.; Siegmund, O. H. W.

2015-07-01

Many high resolution event counting devices process one event at a time and cannot register simultaneous events. In this article a frame-based readout event counting detector consisting of a pair of Microchannel Plates and a quad Timepix readout is described. More than 104 simultaneous events can be detected with a spatial resolution of 55 μm, while >103 simultaneous events can be detected with <10 μm spatial resolution when event centroiding is implemented. The fast readout electronics is capable of processing >1200 frames/sec, while the global count rate of the detector can exceed 5×108 particles/s when no timing information on every particle is required. For the first generation Timepix readout, the timing resolution is limited by the Timepix clock to 10-20 ns. Optimization of the MCP gain, rear field voltage and Timepix threshold levels are crucial for the device performance and that is the main subject of this article. These devices can be very attractive for applications where the photon/electron/ion/neutron counting with high spatial and temporal resolution is required, such as energy resolved neutron imaging, Time of Flight experiments in lidar applications, experiments on photoelectron spectroscopy and many others.
LHCb Kalman Filter cross architecture studies

NASA Astrophysics Data System (ADS)

Cámpora Pérez, Daniel Hugo

2017-10-01

The 2020 upgrade of the LHCb detector will vastly increase the rate of collisions the Online system needs to process in software, in order to filter events in real time. 30 million collisions per second will pass through a selection chain, where each step is executed conditional to its prior acceptance. The Kalman Filter is a fit applied to all reconstructed tracks which, due to its time characteristics and early execution in the selection chain, consumes 40% of the whole reconstruction time in the current trigger software. This makes the Kalman Filter a time-critical component as the LHCb trigger evolves into a full software trigger in the Upgrade. I present a new Kalman Filter algorithm for LHCb that can efficiently make use of any kind of SIMD processor, and its design is explained in depth. Performance benchmarks are compared between a variety of hardware architectures, including x86_64 and Power8, and the Intel Xeon Phi accelerator, and the suitability of said architectures to efficiently perform the LHCb Reconstruction process is determined.
PERI - Auto-tuning Memory Intensive Kernels for Multicore

DOE Office of Scientific and Technical Information (OSTI.GOV)

Bailey, David H; Williams, Samuel; Datta, Kaushik

2008-06-24

We present an auto-tuning approach to optimize application performance on emerging multicore architectures. The methodology extends the idea of search-based performance optimizations, popular in linear algebra and FFT libraries, to application-specific computational kernels. Our work applies this strategy to Sparse Matrix Vector Multiplication (SpMV), the explicit heat equation PDE on a regular grid (Stencil), and a lattice Boltzmann application (LBMHD). We explore one of the broadest sets of multicore architectures in the HPC literature, including the Intel Xeon Clovertown, AMD Opteron Barcelona, Sun Victoria Falls, and the Sony-Toshiba-IBM (STI) Cell. Rather than hand-tuning each kernel for each system, we developmore » a code generator for each kernel that allows us to identify a highly optimized version for each platform, while amortizing the human programming effort. Results show that our auto-tuned kernel applications often achieve a better than 4X improvement compared with the original code. Additionally, we analyze a Roofline performance model for each platform to reveal hardware bottlenecks and software challenges for future multicore systems and applications.« less
A derivation and scalable implementation of the synchronous parallel kinetic Monte Carlo method for simulating long-time dynamics

NASA Astrophysics Data System (ADS)

Byun, Hye Suk; El-Naggar, Mohamed Y.; Kalia, Rajiv K.; Nakano, Aiichiro; Vashishta, Priya

2017-10-01

Kinetic Monte Carlo (KMC) simulations are used to study long-time dynamics of a wide variety of systems. Unfortunately, the conventional KMC algorithm is not scalable to larger systems, since its time scale is inversely proportional to the simulated system size. A promising approach to resolving this issue is the synchronous parallel KMC (SPKMC) algorithm, which makes the time scale size-independent. This paper introduces a formal derivation of the SPKMC algorithm based on local transition-state and time-dependent Hartree approximations, as well as its scalable parallel implementation based on a dual linked-list cell method. The resulting algorithm has achieved a weak-scaling parallel efficiency of 0.935 on 1024 Intel Xeon processors for simulating biological electron transfer dynamics in a 4.2 billion-heme system, as well as decent strong-scaling parallel efficiency. The parallel code has been used to simulate a lattice of cytochrome complexes on a bacterial-membrane nanowire, and it is broadly applicable to other problems such as computational synthesis of new materials.

Numerical solution of the Navier-Stokes equations by discontinuous Galerkin method

NASA Astrophysics Data System (ADS)

Krasnov, M. M.; Kuchugov, P. A.; E Ladonkina, M.; E Lutsky, A.; Tishkin, V. F.

2017-02-01

Detailed unstructured grids and numerical methods of high accuracy are frequently used in the numerical simulation of gasdynamic flows in areas with complex geometry. Galerkin method with discontinuous basis functions or Discontinuous Galerkin Method (DGM) works well in dealing with such problems. This approach offers a number of advantages inherent to both finite-element and finite-difference approximations. Moreover, the present paper shows that DGM schemes can be viewed as Godunov method extension to piecewise-polynomial functions. As is known, DGM involves significant computational complexity, and this brings up the question of ensuring the most effective use of all the computational capacity available. In order to speed up the calculations, operator programming method has been applied while creating the computational module. This approach makes possible compact encoding of mathematical formulas and facilitates the porting of programs to parallel architectures, such as NVidia CUDA and Intel Xeon Phi. With the software package, based on DGM, numerical simulations of supersonic flow past solid bodies has been carried out. The numerical results are in good agreement with the experimental ones.
CUDA-based acceleration of collateral filtering in brain MR images

NASA Astrophysics Data System (ADS)

Li, Cheng-Yuan; Chang, Herng-Hua

2017-02-01

Image denoising is one of the fundamental and essential tasks within image processing. In medical imaging, finding an effective algorithm that can remove random noise in MR images is important. This paper proposes an effective noise reduction method for brain magnetic resonance (MR) images. Our approach is based on the collateral filter which is a more powerful method than the bilateral filter in many cases. However, the computation of the collateral filter algorithm is quite time-consuming. To solve this problem, we improved the collateral filter algorithm with parallel computing using GPU. We adopted CUDA, an application programming interface for GPU by NVIDIA, to accelerate the computation. Our experimental evaluation on an Intel Xeon CPU E5-2620 v3 2.40GHz with a NVIDIA Tesla K40c GPU indicated that the proposed implementation runs dramatically faster than the traditional collateral filter. We believe that the proposed framework has established a general blueprint for achieving fast and robust filtering in a wide variety of medical image denoising applications.
Methods for compressible fluid simulation on GPUs using high-order finite differences

NASA Astrophysics Data System (ADS)

Pekkilä, Johannes; Väisälä, Miikka S.; Käpylä, Maarit J.; Käpylä, Petri J.; Anjum, Omer

2017-08-01

We focus on implementing and optimizing a sixth-order finite-difference solver for simulating compressible fluids on a GPU using third-order Runge-Kutta integration. Since graphics processing units perform well in data-parallel tasks, this makes them an attractive platform for fluid simulation. However, high-order stencil computation is memory-intensive with respect to both main memory and the caches of the GPU. We present two approaches for simulating compressible fluids using 55-point and 19-point stencils. We seek to reduce the requirements for memory bandwidth and cache size in our methods by using cache blocking and decomposing a latency-bound kernel into several bandwidth-bound kernels. Our fastest implementation is bandwidth-bound and integrates 343 million grid points per second on a Tesla K40t GPU, achieving a 3 . 6 × speedup over a comparable hydrodynamics solver benchmarked on two Intel Xeon E5-2690v3 processors. Our alternative GPU implementation is latency-bound and achieves the rate of 168 million updates per second.
A biomolecular electrostatics solver using Python, GPUs and boundary elements that can handle solvent-filled cavities and Stern layers.

PubMed

Cooper, Christopher D; Bardhan, Jaydeep P; Barba, L A

2014-03-01

The continuum theory applied to biomolecular electrostatics leads to an implicit-solvent model governed by the Poisson-Boltzmann equation. Solvers relying on a boundary integral representation typically do not consider features like solvent-filled cavities or ion-exclusion (Stern) layers, due to the added difficulty of treating multiple boundary surfaces. This has hindered meaningful comparisons with volume-based methods, and the effects on accuracy of including these features has remained unknown. This work presents a solver called PyGBe that uses a boundary-element formulation and can handle multiple interacting surfaces. It was used to study the effects of solvent-filled cavities and Stern layers on the accuracy of calculating solvation energy and binding energy of proteins, using the well-known apbs finite-difference code for comparison. The results suggest that if required accuracy for an application allows errors larger than about 2% in solvation energy, then the simpler, single-surface model can be used. When calculating binding energies, the need for a multi-surface model is problem-dependent, becoming more critical when ligand and receptor are of comparable size. Comparing with the apbs solver, the boundary-element solver is faster when the accuracy requirements are higher. The cross-over point for the PyGBe code is in the order of 1-2% error, when running on one gpu card (nvidia Tesla C2075), compared with apbs running on six Intel Xeon cpu cores. PyGBe achieves algorithmic acceleration of the boundary element method using a treecode, and hardware acceleration using gpus via PyCuda from a user-visible code that is all Python. The code is open-source under MIT license.
Spectral-element Seismic Wave Propagation on CUDA/OpenCL Hardware Accelerators

NASA Astrophysics Data System (ADS)

Peter, D. B.; Videau, B.; Pouget, K.; Komatitsch, D.

2015-12-01

Seismic wave propagation codes are essential tools to investigate a variety of wave phenomena in the Earth. Furthermore, they can now be used for seismic full-waveform inversions in regional- and global-scale adjoint tomography. Although these seismic wave propagation solvers are crucial ingredients to improve the resolution of tomographic images to answer important questions about the nature of Earth's internal processes and subsurface structure, their practical application is often limited due to high computational costs. They thus need high-performance computing (HPC) facilities to improving the current state of knowledge. At present, numerous large HPC systems embed many-core architectures such as graphics processing units (GPUs) to enhance numerical performance. Such hardware accelerators can be programmed using either the CUDA programming environment or the OpenCL language standard. CUDA software development targets NVIDIA graphic cards while OpenCL was adopted by additional hardware accelerators, like e.g. AMD graphic cards, ARM-based processors as well as Intel Xeon Phi coprocessors. For seismic wave propagation simulations using the open-source spectral-element code package SPECFEM3D_GLOBE, we incorporated an automatic source-to-source code generation tool (BOAST) which allows us to use meta-programming of all computational kernels for forward and adjoint runs. Using our BOAST kernels, we generate optimized source code for both CUDA and OpenCL languages within the source code package. Thus, seismic wave simulations are able now to fully utilize CUDA and OpenCL hardware accelerators. We show benchmarks of forward seismic wave propagation simulations using SPECFEM3D_GLOBE on CUDA/OpenCL GPUs, validating results and comparing performances for different simulations and hardware usages.
Parasail: SIMD C library for global, semi-global, and local pairwise sequence alignments.

PubMed

Daily, Jeff

2016-02-10

Sequence alignment algorithms are a key component of many bioinformatics applications. Though various fast Smith-Waterman local sequence alignment implementations have been developed for x86 CPUs, most are embedded into larger database search tools. In addition, fast implementations of Needleman-Wunsch global sequence alignment and its semi-global variants are not as widespread. This article presents the first software library for local, global, and semi-global pairwise intra-sequence alignments and improves the performance of previous intra-sequence implementations. A faster intra-sequence local pairwise alignment implementation is described and benchmarked, including new global and semi-global variants. Using a 375 residue query sequence a speed of 136 billion cell updates per second (GCUPS) was achieved on a dual Intel Xeon E5-2670 24-core processor system, the highest reported for an implementation based on Farrar's 'striped' approach. Rognes's SWIPE optimal database search application is still generally the fastest available at 1.2 to at best 2.4 times faster than Parasail for sequences shorter than 500 amino acids. However, Parasail was faster for longer sequences. For global alignments, Parasail's prefix scan implementation is generally the fastest, faster even than Farrar's 'striped' approach, however the opal library is faster for single-threaded applications. The software library is designed for 64 bit Linux, OS X, or Windows on processors with SSE2, SSE41, or AVX2. Source code is available from https://github.com/jeffdaily/parasail under the Battelle BSD-style license. Applications that require optimal alignment scores could benefit from the improved performance. For the first time, SIMD global, semi-global, and local alignments are available in a stand-alone C library.
Faster Smith-Waterman database searches with inter-sequence SIMD parallelisation

PubMed Central

2011-01-01

Background The Smith-Waterman algorithm for local sequence alignment is more sensitive than heuristic methods for database searching, but also more time-consuming. The fastest approach to parallelisation with SIMD technology has previously been described by Farrar in 2007. The aim of this study was to explore whether further speed could be gained by other approaches to parallelisation. Results A faster approach and implementation is described and benchmarked. In the new tool SWIPE, residues from sixteen different database sequences are compared in parallel to one query residue. Using a 375 residue query sequence a speed of 106 billion cell updates per second (GCUPS) was achieved on a dual Intel Xeon X5650 six-core processor system, which is over six times more rapid than software based on Farrar's 'striped' approach. SWIPE was about 2.5 times faster when the programs used only a single thread. For shorter queries, the increase in speed was larger. SWIPE was about twice as fast as BLAST when using the BLOSUM50 score matrix, while BLAST was about twice as fast as SWIPE for the BLOSUM62 matrix. The software is designed for 64 bit Linux on processors with SSSE3. Source code is available from http://dna.uio.no/swipe/ under the GNU Affero General Public License. Conclusions Efficient parallelisation using SIMD on standard hardware makes it possible to run Smith-Waterman database searches more than six times faster than before. The approach described here could significantly widen the potential application of Smith-Waterman searches. Other applications that require optimal local alignment scores could also benefit from improved performance. PMID:21631914
Operational Level 2 Data Processing System for the JEM/SMILES

NASA Astrophysics Data System (ADS)

Takahashi, C.; Mitsuda, C.; Suzuki, M.; Iwata, Y.; Horikawa, M.; Matsumoto, T.; Hayashi, H.; Imai, K.; Sano, T.; Takayanagi, M.

2009-12-01

the DPS-L2 along with the details on its algorithm and performance. The retrieval process consists of two parts: the forward model, which computes radiative transfer, and the inverse model, which deduces atmospheric states. Since the forward model must provide the most accurate basis for results and be implemented under limited computing resources, the forward model algorithm for an operational system has to be accurate and fast. Hence, the algorithm is improved (1) by designing accurate instrument functions such as the instrumental field of view (FOV), sideband rejection ratio of sideband separator, and spectral responses of acousto-optic spectrometer (AOS) and (2) by optimizing radiative transfer calculation. We have achieved that the accuracy of this algorithm is better than 1%, and the processing time for single-scan spectra is less than 1 min with 8 parallel processing using a 3.16-GHz Quad-Core Intel Xeon processor.
Multiparametric fat-water separation method for fast chemical-shift imaging guidance of thermal therapies.

PubMed

Lin, Jonathan S; Hwang, Ken-Pin; Jackson, Edward F; Hazle, John D; Stafford, R Jason; Taylor, Brian A

2013-10-01

.980 ± 0.004, and 0.941 ± 0.002 for DSC, sensitivity, and specificity, respectively). Temperature uncertainties, based on PRF uncertainties from a 5 × 5-voxel ROI, were 0.342 and 0.351°C for pure and mixed fat/water regions, respectively. Algorithm speed was tested using 25 × 25-voxel and whole image ROIs containing both fat and water, resulting in average processing times per acquisition of 2.00 ± 0.07 s and 146 ± 1 s, respectively, using uncompiled MATLAB scripts running on a shared CPU server with eight Intel Xeon(TM) E5640 quad-core processors (2.66 GHz, 12 MB cache) and 12 GB RAM. Results from both the mathematical and physical phantom suggest the k-means-based classification algorithm could be useful for rapid, dynamic imaging in an ROI for thermal interventions. Successful separation of fat/water information would aid in reducing errors from the nontemperature sensitive fat PRF, as well as potentially facilitate using fat as an internal reference for PRF shift thermometry when appropriate. Additionally, the T1-W or R2* signals may be used for monitoring temperature in surrounding adipose tissue.
DESDynI Quad First Stage Processor - A Four Channel Digitizer and Digital Beam Forming Processor

NASA Technical Reports Server (NTRS)

Chuang, Chung-Lun; Shaffer, Scott; Smythe, Robert; Niamsuwan, Noppasin; Li, Samuel; Liao, Eric; Lim, Chester; Morfopolous, Arin; Veilleux, Louise

2013-01-01

The proposed Deformation, Eco-Systems, and Dynamics of Ice Radar (DESDynI-R) L-band SAR instrument employs multiple digital channels to optimize resolution while keeping a large swath on a single pass. High-speed digitization with very fine synchronization and digital beam forming are necessary in order to facilitate this new technique. The Quad First Stage Processor (qFSP) was developed to achieve both the processing performance as well as the digitizing fidelity in order to accomplish this sweeping SAR technique. The qFSP utilizes high precision and high-speed analog to digital converters (ADCs), each with a finely adjustable clock distribution network to digitize the channels at the fidelity necessary to allow for digital beam forming. The Xilinx produced FX130T Virtex 5 part handles the processing to digitally calibrate each channel as well as filter and beam form the receive signals. Demonstrating the digital processing required for digital beam forming and digital calibration is instrumental to the viability of the proposed DESDynI instrument. The qFSP development brings this implementation to Technology Readiness Level (TRL) 6. This paper will detail the design and development of the prototype qFSP as well as the preliminary results from hardware tests.
Optimization of the Brillouin operator on the KNL architecture

NASA Astrophysics Data System (ADS)

Dürr, Stephan

2018-03-01

Experiences with optimizing the matrix-times-vector application of the Brillouin operator on the Intel KNL processor are reported. Without adjustments to the memory layout, performance figures of 360 Gflop/s in single and 270 Gflop/s in double precision are observed. This is with Nc = 3 colors, Nv = 12 right-hand-sides, Nthr = 256 threads, on lattices of size 323 × 64, using exclusively OMP pragmas. Interestingly, the same routine performs quite well on Intel Core i7 architectures, too. Some observations on the much harderWilson fermion matrix-times-vector optimization problem are added.
GPU-accelerated algorithms for many-particle continuous-time quantum walks

NASA Astrophysics Data System (ADS)

Piccinini, Enrico; Benedetti, Claudia; Siloi, Ilaria; Paris, Matteo G. A.; Bordone, Paolo

2017-06-01

Many-particle continuous-time quantum walks (CTQWs) represent a resource for several tasks in quantum technology, including quantum search algorithms and universal quantum computation. In order to design and implement CTQWs in a realistic scenario, one needs effective simulation tools for Hamiltonians that take into account static noise and fluctuations in the lattice, i.e. Hamiltonians containing stochastic terms. To this aim, we suggest a parallel algorithm based on the Taylor series expansion of the evolution operator, and compare its performances with those of algorithms based on the exact diagonalization of the Hamiltonian or a 4th order Runge-Kutta integration. We prove that both Taylor-series expansion and Runge-Kutta algorithms are reliable and have a low computational cost, the Taylor-series expansion showing the additional advantage of a memory allocation not depending on the precision of calculation. Both algorithms are also highly parallelizable within the SIMT paradigm, and are thus suitable for GPGPU computing. In turn, we have benchmarked 4 NVIDIA GPUs and 3 quad-core Intel CPUs for a 2-particle system over lattices of increasing dimension, showing that the speedup provided by GPU computing, with respect to the OPENMP parallelization, lies in the range between 8x and (more than) 20x, depending on the frequency of post-processing. GPU-accelerated codes thus allow one to overcome concerns about the execution time, and make it possible simulations with many interacting particles on large lattices, with the only limit of the memory available on the device.
Of Ivory and Smurfs: Loxodontan MapReduce Experiments for Web Search

DTIC Science & Technology

2009-11-01

i.e., index construction may involve multiple flushes to local disk and on-disk merge sorts outside of MapReduce). Once the local indexes have been...contained 198 cores, which, with current dual -processor quad-core con- figurations, could fit into 25 machines—a far more modest cluster with today’s...signifi- cant impact on effectiveness. Our simple pruning technique was performed at query time and hence could be adapted to query-dependent
Efficient Multicriteria Protein Structure Comparison on Modern Processor Architectures

PubMed Central

Manolakos, Elias S.

2015-01-01

Fast increasing computational demand for all-to-all protein structures comparison (PSC) is a result of three confounding factors: rapidly expanding structural proteomics databases, high computational complexity of pairwise protein comparison algorithms, and the trend in the domain towards using multiple criteria for protein structures comparison (MCPSC) and combining results. We have developed a software framework that exploits many-core and multicore CPUs to implement efficient parallel MCPSC in modern processors based on three popular PSC methods, namely, TMalign, CE, and USM. We evaluate and compare the performance and efficiency of the two parallel MCPSC implementations using Intel's experimental many-core Single-Chip Cloud Computer (SCC) as well as Intel's Core i7 multicore processor. We show that the 48-core SCC is more efficient than the latest generation Core i7, achieving a speedup factor of 42 (efficiency of 0.9), making many-core processors an exciting emerging technology for large-scale structural proteomics. We compare and contrast the performance of the two processors on several datasets and also show that MCPSC outperforms its component methods in grouping related domains, achieving a high F-measure of 0.91 on the benchmark CK34 dataset. The software implementation for protein structure comparison using the three methods and combined MCPSC, along with the developed underlying rckskel algorithmic skeletons library, is available via GitHub. PMID:26605332
Efficient Multicriteria Protein Structure Comparison on Modern Processor Architectures.

PubMed

Sharma, Anuj; Manolakos, Elias S

2015-01-01

Fast increasing computational demand for all-to-all protein structures comparison (PSC) is a result of three confounding factors: rapidly expanding structural proteomics databases, high computational complexity of pairwise protein comparison algorithms, and the trend in the domain towards using multiple criteria for protein structures comparison (MCPSC) and combining results. We have developed a software framework that exploits many-core and multicore CPUs to implement efficient parallel MCPSC in modern processors based on three popular PSC methods, namely, TMalign, CE, and USM. We evaluate and compare the performance and efficiency of the two parallel MCPSC implementations using Intel's experimental many-core Single-Chip Cloud Computer (SCC) as well as Intel's Core i7 multicore processor. We show that the 48-core SCC is more efficient than the latest generation Core i7, achieving a speedup factor of 42 (efficiency of 0.9), making many-core processors an exciting emerging technology for large-scale structural proteomics. We compare and contrast the performance of the two processors on several datasets and also show that MCPSC outperforms its component methods in grouping related domains, achieving a high F-measure of 0.91 on the benchmark CK34 dataset. The software implementation for protein structure comparison using the three methods and combined MCPSC, along with the developed underlying rckskel algorithmic skeletons library, is available via GitHub.
Polarimetric scattering model for estimation of above ground biomass of multilayer vegetation using ALOS-PALSAR quad-pol data

NASA Astrophysics Data System (ADS)

Sai Bharadwaj, P.; Kumar, Shashi; Kushwaha, S. P. S.; Bijker, Wietske

Forests are important biomes covering a major part of the vegetation on the Earth, and as such account for seventy percent of the carbon present in living beings. The value of a forest's above ground biomass (AGB) is considered as an important parameter for the estimation of global carbon content. In the present study, the quad-pol ALOS-PALSAR data was used for the estimation of AGB for the Dudhwa National Park, India. For this purpose, polarimetric decomposition components and an Extended Water Cloud Model (EWCM) were used. The PolSAR data orientation angle shifts were compensated for before the polarimetric decomposition. The scattering components obtained from the polarimetric decomposition were used in the Water Cloud Model (WCM). The WCM was extended for higher order interactions like double bounce scattering. The parameters of the EWCM were retrieved using the field measurements and the decomposition components. Finally, the relationship between the estimated AGB and measured AGB was assessed. The coefficient of determination (R2) and root mean square error (RMSE) were 0.4341 and 119 t/ha respectively.
NPS-NRL-Rice-UIUC Collaboration on Navy Atmosphere-Ocean Coupled Models on Many-Core Computer Architectures Annual Report

DTIC Science & Technology

2014-09-30

portability is difficult to achieve on future supercomputers that use various type of accelerators (GPUs, Xeon - Phi , and SIMD etc). All of these...bottlenecks of NUMA. For example, in the CG code the state vector was originally stored as q(1 : Nvar ,1 : Npoin) where Nvar are the number of...a Global Grid Point (GGP) storage. On the other hand, in the DG code the state vector is typically stored as q(1 : Nvar ,1 : Npts,1 : Nelem) where
Development of a highly maneuverable unmanned underwater vehicle on the basis of quad-copter dynamics

NASA Astrophysics Data System (ADS)

Amin, Osman Md; Karim, Md. Arshadul; Saad, Abdullah His

2017-12-01

At present, research on unmanned underwater vehicle (UUV) has become a significant & familiar topic for researchers from various engineering fields. UUV is of mainly two types - AUV (Autonomous Underwater vehicle) & ROV (Remotely Operated Vehicle). There exist a significant number of published research papers on UUV, where very few researchers emphasize on the ease of maneuvering and control of UUV. Maneuvering is important for underwater vehicle in avoiding obstacles, installing underwater piping system, searching undersea resources, underwater mine disposal operations, oceanographic surveys etc. A team from Dept. of Naval Architecture & Marine Engineering of MIST has taken a project to design a highly maneuverable unmanned underwater vehicle on the basis of quad-copter dynamics. The main objective of the research is to develop a control system for UUV which would be able to maneuver the vehicle in six DOF (Degrees of Freedom) with great ease. For this purpose we are not only focusing on controllability but also designing an efficient hull with minimal drag force & optimized propeller using CFD technique. Motors were selected on the basis of the simulated thrust generated by propellers in ANSYS Fluent software module. Settings for control parameters to carry out different types of maneuvering such as hovering, spiral, one point rotation about its centroid, gliding, rolling, drifting and zigzag motions were explained in short at the end.
Parallel Density-Based Clustering for Discovery of Ionospheric Phenomena

NASA Astrophysics Data System (ADS)

Pankratius, V.; Gowanlock, M.; Blair, D. M.

2015-12-01

Ionospheric total electron content maps derived from global networks of dual-frequency GPS receivers can reveal a plethora of ionospheric features in real-time and are key to space weather studies and natural hazard monitoring. However, growing data volumes from expanding sensor networks are making manual exploratory studies challenging. As the community is heading towards Big Data ionospheric science, automation and Computer-Aided Discovery become indispensable tools for scientists. One problem of machine learning methods is that they require domain-specific adaptations in order to be effective and useful for scientists. Addressing this problem, our Computer-Aided Discovery approach allows scientists to express various physical models as well as perturbation ranges for parameters. The search space is explored through an automated system and parallel processing of batched workloads, which finds corresponding matches and similarities in empirical data. We discuss density-based clustering as a particular method we employ in this process. Specifically, we adapt Density-Based Spatial Clustering of Applications with Noise (DBSCAN). This algorithm groups geospatial data points based on density. Clusters of points can be of arbitrary shape, and the number of clusters is not predetermined by the algorithm; only two input parameters need to be specified: (1) a distance threshold, (2) a minimum number of points within that threshold. We discuss an implementation of DBSCAN for batched workloads that is amenable to parallelization on manycore architectures such as Intel's Xeon Phi accelerator with 60+ general-purpose cores. This manycore parallelization can cluster large volumes of ionospheric total electronic content data quickly. Potential applications for cluster detection include the visualization, tracing, and examination of traveling ionospheric disturbances or other propagating phenomena. Acknowledgments. We acknowledge support from NSF ACI-1442997 (PI V. Pankratius).
egs_brachy: a versatile and fast Monte Carlo code for brachytherapy

NASA Astrophysics Data System (ADS)

Chamberland, Marc J. P.; Taylor, Randle E. P.; Rogers, D. W. O.; Thomson, Rowan M.

2016-12-01

egs_brachy is a versatile and fast Monte Carlo (MC) code for brachytherapy applications. It is based on the EGSnrc code system, enabling simulation of photons and electrons. Complex geometries are modelled using the EGSnrc C++ class library and egs_brachy includes a library of geometry models for many brachytherapy sources, in addition to eye plaques and applicators. Several simulation efficiency enhancing features are implemented in the code. egs_brachy is benchmarked by comparing TG-43 source parameters of three source models to previously published values. 3D dose distributions calculated with egs_brachy are also compared to ones obtained with the BrachyDose code. Well-defined simulations are used to characterize the effectiveness of many efficiency improving techniques, both as an indication of the usefulness of each technique and to find optimal strategies. Efficiencies and calculation times are characterized through single source simulations and simulations of idealized and typical treatments using various efficiency improving techniques. In general, egs_brachy shows agreement within uncertainties with previously published TG-43 source parameter values. 3D dose distributions from egs_brachy and BrachyDose agree at the sub-percent level. Efficiencies vary with radionuclide and source type, number of sources, phantom media, and voxel size. The combined effects of efficiency-improving techniques in egs_brachy lead to short calculation times: simulations approximating prostate and breast permanent implant (both with (2 mm)3 voxels) and eye plaque (with (1 mm)3 voxels) treatments take between 13 and 39 s, on a single 2.5 GHz Intel Xeon E5-2680 v3 processor core, to achieve 2% average statistical uncertainty on doses within the PTV. egs_brachy will be released as free and open source software to the research community.

A biomolecular electrostatics solver using Python, GPUs and boundary elements that can handle solvent-filled cavities and Stern layers

NASA Astrophysics Data System (ADS)

Cooper, Christopher D.; Bardhan, Jaydeep P.; Barba, L. A.

2014-03-01

The continuum theory applied to biomolecular electrostatics leads to an implicit-solvent model governed by the Poisson-Boltzmann equation. Solvers relying on a boundary integral representation typically do not consider features like solvent-filled cavities or ion-exclusion (Stern) layers, due to the added difficulty of treating multiple boundary surfaces. This has hindered meaningful comparisons with volume-based methods, and the effects on accuracy of including these features has remained unknown. This work presents a solver called PyGBe that uses a boundary-element formulation and can handle multiple interacting surfaces. It was used to study the effects of solvent-filled cavities and Stern layers on the accuracy of calculating solvation energy and binding energy of proteins, using the well-known APBS finite-difference code for comparison. The results suggest that if required accuracy for an application allows errors larger than about 2% in solvation energy, then the simpler, single-surface model can be used. When calculating binding energies, the need for a multi-surface model is problem-dependent, becoming more critical when ligand and receptor are of comparable size. Comparing with the APBS solver, the boundary-element solver is faster when the accuracy requirements are higher. The cross-over point for the PyGBe code is on the order of 1-2% error, when running on one GPU card (NVIDIA Tesla C2075), compared with APBS running on six Intel Xeon CPU cores. PyGBe achieves algorithmic acceleration of the boundary element method using a treecode, and hardware acceleration using GPUs via PyCuda from a user-visible code that is all Python. The code is open-source under MIT license.
egs_brachy: a versatile and fast Monte Carlo code for brachytherapy.

PubMed

Chamberland, Marc J P; Taylor, Randle E P; Rogers, D W O; Thomson, Rowan M

2016-12-07

egs_brachy is a versatile and fast Monte Carlo (MC) code for brachytherapy applications. It is based on the EGSnrc code system, enabling simulation of photons and electrons. Complex geometries are modelled using the EGSnrc C++ class library and egs_brachy includes a library of geometry models for many brachytherapy sources, in addition to eye plaques and applicators. Several simulation efficiency enhancing features are implemented in the code. egs_brachy is benchmarked by comparing TG-43 source parameters of three source models to previously published values. 3D dose distributions calculated with egs_brachy are also compared to ones obtained with the BrachyDose code. Well-defined simulations are used to characterize the effectiveness of many efficiency improving techniques, both as an indication of the usefulness of each technique and to find optimal strategies. Efficiencies and calculation times are characterized through single source simulations and simulations of idealized and typical treatments using various efficiency improving techniques. In general, egs_brachy shows agreement within uncertainties with previously published TG-43 source parameter values. 3D dose distributions from egs_brachy and BrachyDose agree at the sub-percent level. Efficiencies vary with radionuclide and source type, number of sources, phantom media, and voxel size. The combined effects of efficiency-improving techniques in egs_brachy lead to short calculation times: simulations approximating prostate and breast permanent implant (both with (2 mm) 3 voxels) and eye plaque (with (1 mm) 3 voxels) treatments take between 13 and 39 s, on a single 2.5 GHz Intel Xeon E5-2680 v3 processor core, to achieve 2% average statistical uncertainty on doses within the PTV. egs_brachy will be released as free and open source software to the research community.
Optimization of a Lattice Boltzmann Computation on State-of-the-Art Multicore Platforms

DOE Office of Scientific and Technical Information (OSTI.GOV)

Williams, Samuel; Carter, Jonathan; Oliker, Leonid

2009-04-10

We present an auto-tuning approach to optimize application performance on emerging multicore architectures. The methodology extends the idea of search-based performance optimizations, popular in linear algebra and FFT libraries, to application-specific computational kernels. Our work applies this strategy to a lattice Boltzmann application (LBMHD) that historically has made poor use of scalar microprocessors due to its complex data structures and memory access patterns. We explore one of the broadest sets of multicore architectures in the HPC literature, including the Intel Xeon E5345 (Clovertown), AMD Opteron 2214 (Santa Rosa), AMD Opteron 2356 (Barcelona), Sun T5140 T2+ (Victoria Falls), as well asmore » a QS20 IBM Cell Blade. Rather than hand-tuning LBMHD for each system, we develop a code generator that allows us to identify a highly optimized version for each platform, while amortizing the human programming effort. Results show that our auto-tuned LBMHD application achieves up to a 15x improvement compared with the original code at a given concurrency. Additionally, we present detailed analysis of each optimization, which reveal surprising hardware bottlenecks and software challenges for future multicore systems and applications.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)

Lyakh, Dmitry I.

An efficient parallel tensor transpose algorithm is suggested for shared-memory computing units, namely, multicore CPU, Intel Xeon Phi, and NVidia GPU. The algorithm operates on dense tensors (multidimensional arrays) and is based on the optimization of cache utilization on x86 CPU and the use of shared memory on NVidia GPU. From the applied side, the ultimate goal is to minimize the overhead encountered in the transformation of tensor contractions into matrix multiplications in computer implementations of advanced methods of quantum many-body theory (e.g., in electronic structure theory and nuclear physics). A particular accent is made on higher-dimensional tensors that typicallymore » appear in the so-called multireference correlated methods of electronic structure theory. Depending on tensor dimensionality, the presented optimized algorithms can achieve an order of magnitude speedup on x86 CPUs and 2-3 times speedup on NVidia Tesla K20X GPU with respect to the na ve scattering algorithm (no memory access optimization). Furthermore, the tensor transpose routines developed in this work have been incorporated into a general-purpose tensor algebra library (TAL-SH).« less
Toward Exascale Earthquake Ground Motion Simulations for Near-Fault Engineering Analysis

DOE PAGES

Johansen, Hans; Rodgers, Arthur; Petersson, N. Anders; ...

2017-09-01

Modernizing SW4 for massively parallel time-domain simulations of earthquake ground motions in 3D earth models increases resolution and provides ground motion estimates for critical infrastructure risk evaluations. Simulations of ground motions from large (M ≥ 7.0) earthquakes require domains on the order of 100 to500 km and spatial granularity on the order of 1 to5 m resulting in hundreds of billions of grid points. Surface-focused structured mesh refinement (SMR) allows for more constant grid point per wavelength scaling in typical Earth models, where wavespeeds increase with depth. In fact, MR allows for simulations to double the frequency content relative tomore » a fixed grid calculation on a given resource. The authors report improvements to the SW4 algorithm developed while porting the code to the Cori Phase 2 (Intel Xeon Phi) systems at the National Energy Research Scientific Computing Center (NERSC) at Lawrence Berkeley National Laboratory. As a result, investigations of the performance of the innermost loop of the calculations found that reorganizing the order of operations can improve performance for massive problems.« less
Toward Exascale Earthquake Ground Motion Simulations for Near-Fault Engineering Analysis

DOE Office of Scientific and Technical Information (OSTI.GOV)

Johansen, Hans; Rodgers, Arthur; Petersson, N. Anders

Modernizing SW4 for massively parallel time-domain simulations of earthquake ground motions in 3D earth models increases resolution and provides ground motion estimates for critical infrastructure risk evaluations. Simulations of ground motions from large (M ≥ 7.0) earthquakes require domains on the order of 100 to500 km and spatial granularity on the order of 1 to5 m resulting in hundreds of billions of grid points. Surface-focused structured mesh refinement (SMR) allows for more constant grid point per wavelength scaling in typical Earth models, where wavespeeds increase with depth. In fact, MR allows for simulations to double the frequency content relative tomore » a fixed grid calculation on a given resource. The authors report improvements to the SW4 algorithm developed while porting the code to the Cori Phase 2 (Intel Xeon Phi) systems at the National Energy Research Scientific Computing Center (NERSC) at Lawrence Berkeley National Laboratory. As a result, investigations of the performance of the innermost loop of the calculations found that reorganizing the order of operations can improve performance for massive problems.« less
SoAx: A generic C++ Structure of Arrays for handling particles in HPC codes

NASA Astrophysics Data System (ADS)

Homann, Holger; Laenen, Francois

2018-03-01

The numerical study of physical problems often require integrating the dynamics of a large number of particles evolving according to a given set of equations. Particles are characterized by the information they are carrying such as an identity, a position other. There are generally speaking two different possibilities for handling particles in high performance computing (HPC) codes. The concept of an Array of Structures (AoS) is in the spirit of the object-oriented programming (OOP) paradigm in that the particle information is implemented as a structure. Here, an object (realization of the structure) represents one particle and a set of many particles is stored in an array. In contrast, using the concept of a Structure of Arrays (SoA), a single structure holds several arrays each representing one property (such as the identity) of the whole set of particles. The AoS approach is often implemented in HPC codes due to its handiness and flexibility. For a class of problems, however, it is known that the performance of SoA is much better than that of AoS. We confirm this observation for our particle problem. Using a benchmark we show that on modern Intel Xeon processors the SoA implementation is typically several times faster than the AoS one. On Intel's MIC co-processors the performance gap even attains a factor of ten. The same is true for GPU computing, using both computational and multi-purpose GPUs. Combining performance and handiness, we present the library SoAx that has optimal performance (on CPUs, MICs, and GPUs) while providing the same handiness as AoS. For this, SoAx uses modern C++ design techniques such template meta programming that allows to automatically generate code for user defined heterogeneous data structures.
Time-domain seismic modeling in viscoelastic media for full waveform inversion on heterogeneous computing platforms with OpenCL

NASA Astrophysics Data System (ADS)

Fabien-Ouellet, Gabriel; Gloaguen, Erwan; Giroux, Bernard

2017-03-01

Full Waveform Inversion (FWI) aims at recovering the elastic parameters of the Earth by matching recordings of the ground motion with the direct solution of the wave equation. Modeling the wave propagation for realistic scenarios is computationally intensive, which limits the applicability of FWI. The current hardware evolution brings increasing parallel computing power that can speed up the computations in FWI. However, to take advantage of the diversity of parallel architectures presently available, new programming approaches are required. In this work, we explore the use of OpenCL to develop a portable code that can take advantage of the many parallel processor architectures now available. We present a program called SeisCL for 2D and 3D viscoelastic FWI in the time domain. The code computes the forward and adjoint wavefields using finite-difference and outputs the gradient of the misfit function given by the adjoint state method. To demonstrate the code portability on different architectures, the performance of SeisCL is tested on three different devices: Intel CPUs, NVidia GPUs and Intel Xeon PHI. Results show that the use of GPUs with OpenCL can speed up the computations by nearly two orders of magnitudes over a single threaded application on the CPU. Although OpenCL allows code portability, we show that some device-specific optimization is still required to get the best performance out of a specific architecture. Using OpenCL in conjunction with MPI allows the domain decomposition of large models on several devices located on different nodes of a cluster. For large enough models, the speedup of the domain decomposition varies quasi-linearly with the number of devices. Finally, we investigate two different approaches to compute the gradient by the adjoint state method and show the significant advantages of using OpenCL for FWI.
Focusing on Mathematical Knowledge: The Impact of Content-Intensive Teacher Professional Development. NCEE 2016-4010

ERIC Educational Resources Information Center

Garet, Michael S.; Heppen, Jessica B.; Walters, Kirk; Parkinson, Julia; Smith, Toni M.; Song, Mengli; Garrett, Rachel; Yang, Rui; Borman, Geoffrey D.

2016-01-01

This report examines the impact of content-intensive Professional Development (PD) on teachers' math content knowledge, their instructional practice, and their students' achievement. The study's PD had three components, totaling 93 hours. The core of the PD was "Intel Math," an intensive 80-hour workshop delivered in summer 2013 that…
Body of Knowledge (BOK) for Leadless Quad Flat No-Lead/bottom Termination Components (QFN/BTC) Package Trends and Reliability

NASA Technical Reports Server (NTRS)

Ghaffarian, Reza

2014-01-01

Bottom terminated components and quad flat no-lead (BTC/QFN) packages have been extensively used by commercial industry for more than a decade. Cost and performance advantages and the closeness of the packages to the boards make them especially unique for radio frequency (RF) applications. A number of high-reliability parts are now available in this style of package configuration. This report presents a summary of literature surveyed and provides a body of knowledge (BOK) gathered on the status of BTC/QFN and their advanced versions of multi-row QFN (MRQFN) packaging technologies. The report provides a comprehensive review of packaging trends and specifications on design, assembly, and reliability. Emphasis is placed on assembly reliability and associated key design and process parameters because they show lower life than standard leaded package assembly under thermal cycling exposures. Inspection of hidden solder joints for assuring quality is challenging and is similar to ball grid arrays (BGAs). Understanding the key BTC/QFN technology trends, applications, processing parameters, workmanship defects, and reliability behavior is important when judicially selecting and narrowing the follow-on packages for evaluation and testing, as well as for the low risk insertion in high-reliability applications.
Body of Knowledge (BOK) for Leadless Quad Flat No-Lead/Bottom Termination Components (QFN/BTC) Package Trends and Reliability

NASA Technical Reports Server (NTRS)

Ghaffarian, Reza

2014-01-01

Bottom terminated components and quad flat no-lead (BTC/QFN) packages have been extensively used by commercial industry for more than a decade. Cost and performance advantages and the closeness of the packages to the boards make them especially unique for radio frequency (RF) applications. A number of high-reliability parts are now available in this style of package configuration. This report presents a summary of literature surveyed and provides a body of knowledge (BOK) gathered on the status of BTC/QFN and their advanced versions of multi-row QFN (MRQFN) packaging technologies. The report provides a comprehensive review of packaging trends and specifications on design, assembly, and reliability. Emphasis is placed on assembly reliability and associated key design and process parameters because they show lower life than standard leaded package assembly under thermal cycling exposures. Inspection of hidden solder joints for assuring quality is challenging and is similar to ball grid arrays (BGAs). Understanding the key BTC/QFN technology trends, applications, processing parameters, workmanship defects, and reliability behavior is important when judicially selecting and narrowing the follow-on packages for evaluation and testing, as well as for the low risk insertion in high-reliability applications.
Core Hunter 3: flexible core subset selection.

PubMed

De Beukelaer, Herman; Davenport, Guy F; Fack, Veerle

2018-05-31

Core collections provide genebank curators and plant breeders a way to reduce size of their collections and populations, while minimizing impact on genetic diversity and allele frequency. Many methods have been proposed to generate core collections, often using distance metrics to quantify the similarity of two accessions, based on genetic marker data or phenotypic traits. Core Hunter is a multi-purpose core subset selection tool that uses local search algorithms to generate subsets relying on one or more metrics, including several distance metrics and allelic richness. In version 3 of Core Hunter (CH3) we have incorporated two new, improved methods for summarizing distances to quantify diversity or representativeness of the core collection. A comparison of CH3 and Core Hunter 2 (CH2) showed that these new metrics can be effectively optimized with less complex algorithms, as compared to those used in CH2. CH3 is more effective at maximizing the improved diversity metric than CH2, still ensures a high average and minimum distance, and is faster for large datasets. Using CH3, a simple stochastic hill-climber is able to find highly diverse core collections, and the more advanced parallel tempering algorithm further increases the quality of the core and further reduces variability across independent samples. We also evaluate the ability of CH3 to simultaneously maximize diversity, and either representativeness or allelic richness, and compare the results with those of the GDOpt and SimEli methods. CH3 can sample equally representative cores as GDOpt, which was specifically designed for this purpose, and is able to construct cores that are simultaneously more diverse, and either are more representative or have higher allelic richness, than those obtained by SimEli. In version 3, Core Hunter has been updated to include two new core subset selection metrics that construct cores for representativeness or diversity, with improved performance. It combines and outperforms the
Experimental demonstration of iterative post-equalization algorithm for 37.5-Gbaud PM-16QAM quad-carrier Terabit superchannel.

PubMed

Jia, Zhensheng; Chien, Hung-Chang; Cai, Yi; Yu, Jianjun; Zhang, Chengliang; Li, Junjie; Ma, Yiran; Shang, Dongdong; Zhang, Qi; Shi, Sheping; Wang, Huitao

2015-02-09

We experimentally demonstrate a quad-carrier 1-Tb/s solution with 37.5-GBaud PM-16QAM signal over 37.5-GHz optical grid at 6.7 b/s/Hz net spectral efficiency. Digital Nyquist pulse shaping at the transmitter and post-equalization at the receiver are employed to mitigate the impairments of joint inter-symbol-interference (ISI) and inter-channel-interference (ICI) symbol degradation. The post-equalization algorithms consist of one sample/symbol based decision-directed least mean square (DD-LMS) adaptive filter, digital post filter and maximum likelihood sequence estimation (MLSE), and a positive iterative process among them. By combining these algorithms, the improvement as much as 4-dB OSNR (0.1nm) at SD-FEC limit (Q(2) = 6.25 corresponding to BER = 2.0e-2) is obtained when compared to no such post-equalization process, and transmission over 820-km EDFA-only standard single-mode fiber (SSMF) link is achieved for two 1.2-Tb/s signals with the averaged Q(2) factor larger than 6.5 dB for all sub-channels. Additionally, 50-GBaud 16QAM operating at 1.28 samples/symbol in a DAC is also investigated and successful transmission over 410-km SSMF link is achieved at 62.5-GHz optical grid.
New Dimensions in Microarchitecture Harnessing 3D Integration Technologies (BRIEFING CHARTS)

DTIC Science & Technology

2007-03-06

Quad Core Bandwidth and Latency Boundaries General Purpose Processor Loads Latency limited Ba nd w id th li m ite dProcessor load trade -off between I...delay No= number of ckts at 1V do= ckt delay at 1V From “3D Intergration ” Special Topic Sessionl W. Haensch, ISSCC ‘07, 2/07 11 DARPA MTS March 6, 2007
Stereoscopic-3D display design: a new paradigm with Intel Adaptive Stable Image Technology [IA-SIT

NASA Astrophysics Data System (ADS)

Jain, Sunil

2012-03-01

Stereoscopic-3D (S3D) proliferation on personal computers (PC) is mired by several technical and business challenges: a) viewing discomfort due to cross-talk amongst stereo images; b) high system cost; and c) restricted content availability. Users expect S3D visual quality to be better than, or at least equal to, what they are used to enjoying on 2D in terms of resolution, pixel density, color, and interactivity. Intel Adaptive Stable Image Technology (IA-SIT) is a foundational technology, successfully developed to resolve S3D system design challenges and deliver high quality 3D visualization at PC price points. Optimizations in display driver, panel timing firmware, backlight hardware, eyewear optical stack, and synch mechanism combined can help accomplish this goal. Agnostic to refresh rate, IA-SIT will scale with shrinking of display transistors and improvements in liquid crystal and LED materials. Industry could profusely benefit from the following calls to action:- 1) Adopt 'IA-SIT S3D Mode' in panel specs (via VESA) to help panel makers monetize S3D; 2) Adopt 'IA-SIT Eyewear Universal Optical Stack' and algorithm (via CEA) to help PC peripheral makers develop stylish glasses; 3) Adopt 'IA-SIT Real Time Profile' for sub-100uS latency control (via BT Sig) to extend BT into S3D; and 4) Adopt 'IA-SIT Architecture' for Monitors and TVs to monetize via PC attach.
Core-to-core uniformity improvement in multi-core fiber Bragg gratings

NASA Astrophysics Data System (ADS)

Lindley, Emma; Min, Seong-Sik; Leon-Saval, Sergio; Cvetojevic, Nick; Jovanovic, Nemanja; Bland-Hawthorn, Joss; Lawrence, Jon; Gris-Sanchez, Itandehui; Birks, Tim; Haynes, Roger; Haynes, Dionne

2014-07-01

Multi-core fiber Bragg gratings (MCFBGs) will be a valuable tool not only in communications but also various astronomical, sensing and industry applications. In this paper we address some of the technical challenges of fabricating effective multi-core gratings by simulating improvements to the writing method. These methods allow a system designed for inscribing single-core fibers to cope with MCFBG fabrication with only minor, passive changes to the writing process. Using a capillary tube that was polished on one side, the field entering the fiber was flattened which improved the coverage and uniformity of all cores.
Automated Creation of Labeled Pointcloud Datasets in Support of Machine-Learning Based Perception

DTIC Science & Technology

2017-12-01

computationally intensive 3D vector math and took more than ten seconds to segment a single LIDAR frame from the HDL-32e with the Dell XPS15 9650’s Intel...Core i7 CPU. Depth Clustering avoids the computationally intensive 3D vector math of Euclidean Clustering-based DON segmentation and, instead
Computational algorithms for simulations in atmospheric optics.

PubMed

Konyaev, P A; Lukin, V P

2016-04-20

A computer simulation technique for atmospheric and adaptive optics based on parallel programing is discussed. A parallel propagation algorithm is designed and a modified spectral-phase method for computer generation of 2D time-variant random fields is developed. Temporal power spectra of Laguerre-Gaussian beam fluctuations are considered as an example to illustrate the applications discussed. Implementation of the proposed algorithms using Intel MKL and IPP libraries and NVIDIA CUDA technology is shown to be very fast and accurate. The hardware system for the computer simulation is an off-the-shelf desktop with an Intel Core i7-4790K CPU operating at a turbo-speed frequency up to 5 GHz and an NVIDIA GeForce GTX-960 graphics accelerator with 1024 1.5 GHz processors.
Navier-Stokes Aerodynamic Simulation of the V-22 Osprey on the Intel Paragon MPP

NASA Technical Reports Server (NTRS)

Vadyak, Joseph; Shrewsbury, George E.; Narramore, Jim C.; Montry, Gary; Holst, Terry; Kwak, Dochan (Technical Monitor)

1995-01-01

The paper will describe the Development of a general three-dimensional multiple grid zone Navier-Stokes flowfield simulation program (ENS3D-MPP) designed for efficient execution on the Intel Paragon Massively Parallel Processor (MPP) supercomputer, and the subsequent application of this method to the prediction of the viscous flowfield about the V-22 Osprey tiltrotor vehicle. The flowfield simulation code solves the thin Layer or full Navier-Stoke's equation - for viscous flow modeling, or the Euler equations for inviscid flow modeling on a structured multi-zone mesh. In the present paper only viscous simulations will be shown. The governing difference equations are solved using a time marching implicit approximate factorization method with either TVD upwind or central differencing used for the convective terms and central differencing used for the viscous diffusion terms. Steady state or Lime accurate solutions can be calculated. The present paper will focus on steady state applications, although time accurate solution analysis is the ultimate goal of this effort. Laminar viscosity is calculated using Sutherland's law and the Baldwin-Lomax two layer algebraic turbulence model is used to compute the eddy viscosity. The Simulation method uses an arbitrary block, curvilinear grid topology. An automatic grid adaption scheme is incorporated which concentrates grid points in high density gradient regions. A variety of user-specified boundary conditions are available. This paper will present the application of the scalable and superscalable versions to the steady state viscous flow analysis of the V-22 Osprey using a multiple zone global mesh. The mesh consists of a series of sheared cartesian grid blocks with polar grids embedded within to better simulate the wing tip mounted nacelle. MPP solutions will be shown in comparison to equivalent Cray C-90 results and also in comparison to experimental data. Discussions on meshing considerations, wall clock execution time
Core-core and core-valence correlation energy atomic and molecular benchmarks for Li through Ar

DOE Office of Scientific and Technical Information (OSTI.GOV)

Ranasinghe, Duminda S.; Frisch, Michael J.; Petersson, George A., E-mail: gpetersson@wesleyan.edu

2015-12-07

We have established benchmark core-core, core-valence, and valence-valence absolute coupled-cluster single double (triple) correlation energies (±0.1%) for 210 species covering the first- and second-rows of the periodic table. These species provide 194 energy differences (±0.03 mE{sub h}) including ionization potentials, electron affinities, and total atomization energies. These results can be used for calibration of less expensive methodologies for practical routine determination of core-core and core-valence correlation energies.

Composite Cores

NASA Technical Reports Server (NTRS)

1990-01-01

Spang & Company's new configuration of converter transformer cores is a composite of gapped and ungapped cores assembled together in concentric relationship. The net effect of the composite design is to combine the protection from saturation offered by the gapped core with the lower magnetizing requirement of the ungapped core. The uncut core functions under normal operating conditions and the cut core takes over during abnormal operation to prevent power surges and their potentially destructive effect on transistors. Principal customers are aerospace and defense manufacturers. Cores also have applicability in commercial products where precise power regulation is required, as in the power supplies for large mainframe computers.
Techniques and Tools for Estimating Ionospheric Effects in Interferometric and Polarimetric SAR Data

NASA Technical Reports Server (NTRS)

Rosen, P.; Lavalle, M.; Pi, X.; Buckley, S.; Szeliga, W.; Zebker, H.; Gurrola, E.

2011-01-01

The InSAR Scientific Computing Environment (ISCE) is a flexible, extensible software tool designed for the end-to-end processing and analysis of synthetic aperture radar data. ISCE inherits the core of the ROI_PAC interferometric tool, but contains improvements at all levels of the radar processing chain, including a modular and extensible architecture, new focusing approach, better geocoding of the data, handling of multi-polarization data, radiometric calibration, and estimation and correction of ionospheric effects. In this paper we describe the characteristics of ISCE with emphasis on the ionospheric modules. To detect ionospheric anomalies, ISCE implements the Faraday rotation method using quadpolarimetric images, and the split-spectrum technique using interferometric single-, dual- and quad-polarimetric images. The ability to generate co-registered time series of quad-polarimetric images makes ISCE also an ideal tool to be used for polarimetric-interferometric radar applications.
Above-ground biomass and carbon estimates of Shorea robusta and Tectona grandis forests using QuadPOL ALOS PALSAR data

NASA Astrophysics Data System (ADS)

Behera, M. D.; Tripathi, P.; Mishra, B.; Kumar, Shashi; Chitale, V. S.; Behera, Soumit K.

2016-01-01

Mechanisms to mitigate climate change in tropical countries such as India require information on forest structural components i.e., biomass and carbon for conservation steps to be implemented successfully. The present study focuses on investigating the potential use of a one time, QuadPOL ALOS PALSAR L-band 25 m data to estimate above-ground biomass (AGB) using a water cloud model (WCM) in a wildlife sanctuary in India. A significant correlation was obtained between the SAR-derived backscatter coefficient (σ°) and the field measured AGB, with the maximum coefficient of determination for cross-polarized (HV) σ° for Shorea robusta, and the weakest correlation was observed with co-polarized (HH) σ° for Tectona grandis forests. The biomass of S. robusta and that of T. grandis were estimated on the basis of field-measured data at 444.7 ± 170.4 Mg/ha and 451 ± 179.4 Mg/ha respectively. The mean biomass values estimated using the WCM varied between 562 and 660 Mg/ha for S. robusta; between 590 and 710 Mg/ha for T. grandis using various polarized data. Our results highlighted the efficacy of one time, fully polarized PALSAR data for biomass and carbon estimate in a dense forest.
Optimizing performance by improving core stability and core strength.

PubMed

Hibbs, Angela E; Thompson, Kevin G; French, Duncan; Wrigley, Allan; Spears, Iain

2008-01-01

Core stability and core strength have been subject to research since the early 1980s. Research has highlighted benefits of training these processes for people with back pain and for carrying out everyday activities. However, less research has been performed on the benefits of core training for elite athletes and how this training should be carried out to optimize sporting performance. Many elite athletes undertake core stability and core strength training as part of their training programme, despite contradictory findings and conclusions as to their efficacy. This is mainly due to the lack of a gold standard method for measuring core stability and strength when performing everyday tasks and sporting movements. A further confounding factor is that because of the differing demands on the core musculature during everyday activities (low load, slow movements) and sporting activities (high load, resisted, dynamic movements), research performed in the rehabilitation sector cannot be applied to the sporting environment and, subsequently, data regarding core training programmes and their effectiveness on sporting performance are lacking. There are many articles in the literature that promote core training programmes and exercises for performance enhancement without providing a strong scientific rationale of their effectiveness, especially in the sporting sector. In the rehabilitation sector, improvements in lower back injuries have been reported by improving core stability. Few studies have observed any performance enhancement in sporting activities despite observing improvements in core stability and core strength following a core training programme. A clearer understanding of the roles that specific muscles have during core stability and core strength exercises would enable more functional training programmes to be implemented, which may result in a more effective transfer of these skills to actual sporting activities.
Performance of VPIC on Sequoia

NASA Astrophysics Data System (ADS)

Nystrom, William

2014-10-01

Sequoia is a major DOE computing resource which is characteristic of future resources in that it has many threads per compute node, 64, and the individual processor cores are simpler and less powerful than cores on previous processors like Intel's Sandy Bridge or AMD's Opteron. An effort is in progress to port VPIC to the Blue Gene Q architecture of Sequoia and evaluate its performance. Results of this work will be presented on single node performance of VPIC as well as multi-node scaling.
Hybrid Computational Architecture for Multi-Scale Modeling of Materials and Devices

DTIC Science & Technology

2016-01-03

Equivalent: Total Number: Sub Contractors (DD882) Names of Faculty Supported Names of Under Graduate students supported Names of Personnel receiving masters...GHz, 20 cores (40 with hyper-threading ( HT )) Single node performance Node # of cores Total CPU time User CPU time System CPU time Elapsed time...INTEL20 40 (with HT ) 534.785 529.984 4.800 541.179 20 468.873 466.119 2.754 476.878 10 671.798 669.653 2.145 680.510 8 772.269 770.256 2.013
A MAGNETOHYDRODYNAMIC MODEL OF THE M87 JET. I. SUPERLUMINAL KNOT EJECTIONS FROM HST-1 AS TRAILS OF QUAD RELATIVISTIC MHD SHOCKS

DOE Office of Scientific and Technical Information (OSTI.GOV)

Nakamura, Masanori; Garofalo, David; Meier, David L., E-mail: nakamura@stsci.ed, E-mail: david.a.garofalo@jpl.nasa.go, E-mail: david.l.meier@jpl.nasa.go

2010-10-01

This is the first in a series of papers that introduces a new paradigm for understanding the jet in M87: a collimated relativistic flow in which strong magnetic fields play a dominant dynamical role. Here, we focus on the flow downstream of HST-1-an essentially stationary flaring feature that ejects trails of superluminal components. We propose that these components are quad relativistic magnetohydrodynamic shock fronts (forward/reverse fast and slow modes) in a narrow jet with a helically twisted magnetic structure. And we demonstrate the properties of such shocks with simple one-dimensional numerical simulations. Quasi-periodic ejections of similar component trails may bemore » responsible for the M87 jet substructures observed further downstream on 10{sup 2}-10{sup 3} pc scales. This new paradigm requires the assimilation of some new concepts into the astrophysical jet community, particularly the behavior of slow/fast-mode waves/shocks and of current-driven helical kink instabilities. However, the prospects of these ideas applying to a large number of other jet systems may make this worth the effort.« less
GPU-based ultra-fast dose calculation using a finite size pencil beam model.

PubMed

Gu, Xuejun; Choi, Dongju; Men, Chunhua; Pan, Hubert; Majumdar, Amitava; Jiang, Steve B

2009-10-21

Online adaptive radiation therapy (ART) is an attractive concept that promises the ability to deliver an optimal treatment in response to the inter-fraction variability in patient anatomy. However, it has yet to be realized due to technical limitations. Fast dose deposit coefficient calculation is a critical component of the online planning process that is required for plan optimization of intensity-modulated radiation therapy (IMRT). Computer graphics processing units (GPUs) are well suited to provide the requisite fast performance for the data-parallel nature of dose calculation. In this work, we develop a dose calculation engine based on a finite-size pencil beam (FSPB) algorithm and a GPU parallel computing framework. The developed framework can accommodate any FSPB model. We test our implementation in the case of a water phantom and the case of a prostate cancer patient with varying beamlet and voxel sizes. All testing scenarios achieved speedup ranging from 200 to 400 times when using a NVIDIA Tesla C1060 card in comparison with a 2.27 GHz Intel Xeon CPU. The computational time for calculating dose deposition coefficients for a nine-field prostate IMRT plan with this new framework is less than 1 s. This indicates that the GPU-based FSPB algorithm is well suited for online re-planning for adaptive radiotherapy.
Leveraging FPGAs for Accelerating Short Read Alignment.

PubMed

Arram, James; Kaplan, Thomas; Luk, Wayne; Jiang, Peiyong

2017-01-01

One of the key challenges facing genomics today is how to efficiently analyze the massive amounts of data produced by next-generation sequencing platforms. With general-purpose computing systems struggling to address this challenge, specialized processors such as the Field-Programmable Gate Array (FPGA) are receiving growing interest. The means by which to leverage this technology for accelerating genomic data analysis is however largely unexplored. In this paper, we present a runtime reconfigurable architecture for accelerating short read alignment using FPGAs. This architecture exploits the reconfigurability of FPGAs to allow the development of fast yet flexible alignment designs. We apply this architecture to develop an alignment design which supports exact and approximate alignment with up to two mismatches. Our design is based on the FM-index, with optimizations to improve the alignment performance. In particular, the n-step FM-index, index oversampling, a seed-and-compare stage, and bi-directional backtracking are included. Our design is implemented and evaluated on a 1U Maxeler MPC-X2000 dataflow node with eight Altera Stratix-V FPGAs. Measurements show that our design is 28 times faster than Bowtie2 running with 16 threads on dual Intel Xeon E5-2640 CPUs, and nine times faster than Soap3-dp running on an NVIDIA Tesla C2070 GPU.
Flow characteristics and spillage mechanisms of an inclined quad-vortex range hood subject to influence from draft.

PubMed

Huang, Rong Fung; Chen, Jia-Kun; Lin, Jyun-Hua

2015-01-01

The flow and spillage characteristics of an inclined quad-vortex (IQV) range hood subject to the influence of drafts from various directions were studied. The laser-assisted smoke flow visualization technique was used to reveal the flow characteristics, and the tracer-gas (sulfur hexafluoride) concentration detection method was used to indicate the quantitative values of the capture efficiency of the hood. It was found that the leakage mechanisms of the IQV range hood are closely related to the flow characteristics. A critical draft velocity of about 0.5 m/s and a critical face velocity of about 0.25 m/s for the IQV range hood were found. When the IQV range hood was influenced by a draft with a velocity larger than the critical draft velocity, the spillage of pollutants became significant and the pollutant spillage rate increased with increasing draft velocity. At draft velocities less than or equal to the critical value, no containment leakages induced by the turbulence diffusion, reverse flow, or boundary-layer separation were observed, and the capture efficiency was about 100%. The IQV range hood exhibited a high ability to resist the influences of lateral and frontal drafts. The capture efficiency of the IQV range hood operated at the suction flow rate 5 to 9 m(3)/min is higher than that of the conventional range hood operated at 11 to 15 m(3)/min.
Evaluating and optimizing the NERSC workload on Knights Landing

DOE Office of Scientific and Technical Information (OSTI.GOV)

Barnes, T; Cook, B; Deslippe, J

2017-01-30

NERSC has partnered with 20 representative application teams to evaluate performance on the Xeon-Phi Knights Landing architecture and develop an application-optimization strategy for the greater NERSC workload on the recently installed Cori system. In this article, we present early case studies and summarized results from a subset of the 20 applications highlighting the impact of important architecture differences between the Xeon-Phi and traditional Xeon processors. We summarize the status of the applications and describe the greater optimization strategy that has formed.
Evaluating and Optimizing the NERSC Workload on Knights Landing

DOE Office of Scientific and Technical Information (OSTI.GOV)

Barnes, Taylor; Cook, Brandon; Doerfler, Douglas

2016-01-01

NERSC has partnered with 20 representative application teams to evaluate performance on the Xeon-Phi Knights Landing architecture and develop an application-optimization strategy for the greater NERSC workload on the recently installed Cori system. In this article, we present early case studies and summarized results from a subset of the 20 applications highlighting the impact of important architecture differences between the Xeon-Phi and traditional Xeon processors. We summarize the status of the applications and describe the greater optimization strategy that has formed.
Accelerating Climate and Weather Simulations through Hybrid Computing

NASA Technical Reports Server (NTRS)

Zhou, Shujia; Cruz, Carlos; Duffy, Daniel; Tucker, Robert; Purcell, Mark

2011-01-01

Unconventional multi- and many-core processors (e.g. IBM (R) Cell B.E.(TM) and NVIDIA (R) GPU) have emerged as effective accelerators in trial climate and weather simulations. Yet these climate and weather models typically run on parallel computers with conventional processors (e.g. Intel, AMD, and IBM) using Message Passing Interface. To address challenges involved in efficiently and easily connecting accelerators to parallel computers, we investigated using IBM's Dynamic Application Virtualization (TM) (IBM DAV) software in a prototype hybrid computing system with representative climate and weather model components. The hybrid system comprises two Intel blades and two IBM QS22 Cell B.E. blades, connected with both InfiniBand(R) (IB) and 1-Gigabit Ethernet. The system significantly accelerates a solar radiation model component by offloading compute-intensive calculations to the Cell blades. Systematic tests show that IBM DAV can seamlessly offload compute-intensive calculations from Intel blades to Cell B.E. blades in a scalable, load-balanced manner. However, noticeable communication overhead was observed, mainly due to IP over the IB protocol. Full utilization of IB Sockets Direct Protocol and the lower latency production version of IBM DAV will reduce this overhead.
Massive parallelization of a 3D finite difference electromagnetic forward solution using domain decomposition methods on multiple CUDA enabled GPUs

NASA Astrophysics Data System (ADS)

Schultz, A.

2010-12-01

describe our ongoing efforts to achieve massive parallelization on a novel hybrid GPU testbed machine currently configured with 12 Intel Westmere Xeon CPU cores (or 24 parallel computational threads) with 96 GB DDR3 system memory, 4 GPU subsystems which in aggregate contain 960 NVidia Tesla GPU cores with 16 GB dedicated DDR3 GPU memory, and a second interleved bank of 4 GPU subsystems containing in aggregate 1792 NVidia Fermi GPU cores with 12 GB dedicated DDR5 GPU memory. We are applying domain decomposition methods to a modified version of Weiss' (2001) 3D frequency domain full physics EM finite difference code, an open source GPL licensed f90 code available for download from www.OpenEM.org. This will be the core of a new hybrid 3D inversion that parallelizes frequencies across CPUs and individual forward solutions across GPUs. We describe progress made in modifying the code to use direct solvers in GPU cores dedicated to each small subdomain, iteratively improving the solution by matching adjacent subdomain boundary solutions, rather than iterative Krylov space sparse solvers as currently applied to the whole domain.
Parallelization of combinatorial search when solving knapsack optimization problem on computing systems based on multicore processors

NASA Astrophysics Data System (ADS)

Rahman, P. A.

2018-05-01

This scientific paper deals with the model of the knapsack optimization problem and method of its solving based on directed combinatorial search in the boolean space. The offered by the author specialized mathematical model of decomposition of the search-zone to the separate search-spheres and the algorithm of distribution of the search-spheres to the different cores of the multi-core processor are also discussed. The paper also provides an example of decomposition of the search-zone to the several search-spheres and distribution of the search-spheres to the different cores of the quad-core processor. Finally, an offered by the author formula for estimation of the theoretical maximum of the computational acceleration, which can be achieved due to the parallelization of the search-zone to the search-spheres on the unlimited number of the processor cores, is also given.
User and Performance Impacts from Franklin Upgrades

DOE Office of Scientific and Technical Information (OSTI.GOV)

He, Yun

2009-05-10

The NERSC flagship computer Cray XT4 system"Franklin" has gone through three major upgrades: quad core upgrade, CLE 2.1 upgrade, and IO upgrade, during the past year. In this paper, we will discuss the various aspects of the user impacts such as user access, user environment, and user issues etc from these upgrades. The performance impacts on the kernel benchmarks and selected application benchmarks will also be presented.
34. DESPATCH CORE OVENS, GREY IRON FOUNDRY CORE ROOM, BAKES ...

Library of Congress Historic Buildings Survey, Historic Engineering Record, Historic Landscapes Survey

34. DESPATCH CORE OVENS, GREY IRON FOUNDRY CORE ROOM, BAKES CORES THAT ARE NOT MADE ON HEATED OR COLD BOX CORE MACHINES, TO SET BINDING AGENTS MIXED WITH THE SAND CREATING CORES HARD ENOUGH TO WITHSTAND THE FLOW OF MOLTEN IRON INSIDE A MOLD. - Stockham Pipe & Fittings Company, Grey Iron Foundry, 4000 Tenth Avenue North, Birmingham, Jefferson County, AL
Core Formation Process and Light Elements in the Planetary Core

NASA Astrophysics Data System (ADS)

Ohtani, E.; Sakairi, T.; Watanabe, K.; Kamada, S.; Sakamaki, T.; Hirao, N.

2015-12-01

Si, O, and S are major candidates for light elements in the planetary core. In the early stage of the planetary formation, the core formation started by percolation of the metallic liquid though silicate matrix because Fe-S-O and Fe-S-Si eutectic temperatures are significantly lower than the solidus of the silicates. Therefore, in the early stage of accretion of the planets, the eutectic liquid with S enrichment was formed and separated into the core by percolation. The major light element in the core at this stage will be sulfur. The internal pressure and temperature increased with the growth of the planets, and the metal component depleted in S was molten. The metallic melt contained both Si and O at high pressure in the deep magma ocean in the later stage. Thus, the core contains S, Si, and O in this stage of core formation. Partitioning experiments between solid and liquid metals indicate that S is partitioned into the liquid metal, whereas O is weakly into the liquid. Partitioning of Si changes with the metallic iron phases, i.e., fcc iron-alloy coexisting with the metallic liquid below 30 GPa is depleted in Si. Whereas hcp-Fe alloy above 30 GPa coexisting with the liquid favors Si. This contrast of Si partitioning provides remarkable difference in compositions of the solid inner core and liquid outer core among different terrestrial planets. Our melting experiments of the Fe-S-Si and Fe-O-S systems at high pressure indicate the core-adiabats in small planets, Mercury and Mars, are greater than the slope of the solidus and liquidus curves of these systems. Thus, in these planets, the core crystallized at the top of the liquid core and 'snowing core' formation occurred during crystallization. The solid inner core is depleted in both Si and S whereas the liquid outer core is relatively enriched in Si and S in these planets. On the other hand, the core adiabats in large planets, Earth and Venus, are smaller than the solidus and liquidus curves of the systems. The
How cores grow by pebble accretion. I. Direct core growth

NASA Astrophysics Data System (ADS)

Brouwers, M. G.; Vazan, A.; Ormel, C. W.

2018-03-01

Context. Planet formation by pebble accretion is an alternative to planetesimal-driven core accretion. In this scenario, planets grow by the accretion of cm- to m-sized pebbles instead of km-sized planetesimals. One of the main differences with planetesimal-driven core accretion is the increased thermal ablation experienced by pebbles. This can provide early enrichment to the planet's envelope, which influences its subsequent evolution and changes the process of core growth. Aims: We aim to predict core masses and envelope compositions of planets that form by pebble accretion and compare mass deposition of pebbles to planetesimals. Specifically, we calculate the core mass where pebbles completely evaporate and are absorbed before reaching the core, which signifies the end of direct core growth. Methods: We model the early growth of a protoplanet by calculating the structure of its envelope, taking into account the fate of impacting pebbles or planetesimals. The region where high-Z material can exist in vapor form is determined by the temperature-dependent vapor pressure. We include enrichment effects by locally modifying the mean molecular weight of the envelope. Results: In the pebble case, three phases of core growth can be identified. In the first phase (Mcore < 0.23-0.39 M⊕), pebbles impact the core without significant ablation. During the second phase (Mcore < 0.5M⊕), ablation becomes increasingly severe. A layer of high-Z vapor starts to form around the core that absorbs a small fraction of the ablated mass. The rest of the material either rains out to the core or instead mixes outwards, slowing core growth. In the third phase (Mcore > 0.5M⊕), the high-Z inner region expands outwards, absorbing an increasing fraction of the ablated material as vapor. Rainout ends before the core mass reaches 0.6 M⊕, terminating direct core growth. In the case of icy H2O pebbles, this happens before 0.1 M⊕. Conclusions: Our results indicate that pebble accretion can
A simple integrative method for presenting head-contingent motion parallax and disparity cues on intel x86 processor-based machines.

PubMed

Szatmary, J; Hadani, I; Julesz, B

1997-01-01

Rogers and Graham (1979) developed a system to show that head-movement-contingent motion parallax produces monocular depth perception in random dot patterns. Their display system comprised an oscilloscope driven by function generators or a special graphics board that triggered the X and Y deflection of the raster scan signal. Replication of this system required costly hardware that is no longer on the market. In this paper the Rogers-Graham method is reproduced with an Intel processor based IBM PC compatible machine with no additional hardware cost. An adapted joystick sampled through the standard game-port can serve as a provisional head-movement sensor. Monitor resolution for displaying motion is effectively enhanced 16 times by the use of anti-aliasing, enabling the display of thousands of random dots in real-time with a refresh rate of 60 Hz or above. A color monitor enables the use of the anaglyph method, thus combining stereoscopic and monocular parallax on a single display without the loss of speed. The power of this system is demonstrated by a psychophysical measurement in which subjects nulled head-movement-contingent illusory parallax, evoked by a static stereogram, with real parallax. The amount of real parallax required to null the illusory stereoscopic parallax monotonically increased with disparity.

Some links on this page may take you to non-federal websites. Their policies may differ from this site.