Sample records for ghz intel p4-processor

  1. Cognitive Medical Wireless Testbed System (COMWITS)

    DTIC Science & Technology

    2016-11-01

    Number: ...... ...... Sub Contractors (DD882) Names of other research staff Inventions (DD882) Scientific Progress This testbed merges two ARO grants...bit 64 bit CPU Intel Xeon Processor E5-1650v3 (6C, 3.5 GHz, Turbo, HT , 15M, 140W) Intel Core i7-3770 (3.4 GHz Quad Core, 77W) Dual Intel Xeon

  2. Extension of the AMBER molecular dynamics software to Intel's Many Integrated Core (MIC) architecture

    NASA Astrophysics Data System (ADS)

    Needham, Perri J.; Bhuiyan, Ashraf; Walker, Ross C.

    2016-04-01

    We present an implementation of explicit solvent particle mesh Ewald (PME) classical molecular dynamics (MD) within the PMEMD molecular dynamics engine, that forms part of the AMBER v14 MD software package, that makes use of Intel Xeon Phi coprocessors by offloading portions of the PME direct summation and neighbor list build to the coprocessor. We refer to this implementation as pmemd MIC offload and in this paper present the technical details of the algorithm, including basic models for MPI and OpenMP configuration, and analyze the resultant performance. The algorithm provides the best performance improvement for large systems (>400,000 atoms), achieving a ∼35% performance improvement for satellite tobacco mosaic virus (1,067,095 atoms) when 2 Intel E5-2697 v2 processors (2 ×12 cores, 30M cache, 2.7 GHz) are coupled to an Intel Xeon Phi coprocessor (Model 7120P-1.238/1.333 GHz, 61 cores). The implementation utilizes a two-fold decomposition strategy: spatial decomposition using an MPI library and thread-based decomposition using OpenMP. We also present compiler optimization settings that improve the performance on Intel Xeon processors, while retaining simulation accuracy.

  3. Computational algorithms for simulations in atmospheric optics.

    PubMed

    Konyaev, P A; Lukin, V P

    2016-04-20

    A computer simulation technique for atmospheric and adaptive optics based on parallel programing is discussed. A parallel propagation algorithm is designed and a modified spectral-phase method for computer generation of 2D time-variant random fields is developed. Temporal power spectra of Laguerre-Gaussian beam fluctuations are considered as an example to illustrate the applications discussed. Implementation of the proposed algorithms using Intel MKL and IPP libraries and NVIDIA CUDA technology is shown to be very fast and accurate. The hardware system for the computer simulation is an off-the-shelf desktop with an Intel Core i7-4790K CPU operating at a turbo-speed frequency up to 5 GHz and an NVIDIA GeForce GTX-960 graphics accelerator with 1024 1.5 GHz processors.

  4. Evaluation of an Adaptive Automation Trigger Based on Task Performance, Priority, and Frequency

    DTIC Science & Technology

    2013-06-01

    with dual Intel ® Xeon ® CPU x5550 processors @ 2.67 GHz each, 12.0 GB RAM, and a 1.5 GB PCIe nVidia Quadro FX 4800 graphics card (Microsoft...Cole Publishing Company . Miller, C. A., & Parasuraman, R. (2007). Designing for flexible interaction between humans and automation: Delegation

  5. LTE-Enhanced Cognitive Radio Network Testbed (LTE-CORNET)

    DTIC Science & Technology

    2016-11-01

    4 PERCENT_SUPPORTEDNAME FTE Equivalent: Total Number: Sub Contractors (DD882) Names of Personnel receiving masters degrees Names of personnel...Turbo, HT , 15M, 140W) Intel Core i7-3770 (3.4 GHz Quad Core, 77W) Dual Intel Xeon E5-2695 v4 (18C, 2.1GHz, 3.3GHz Turbo, 2400MHz, 45MB, 120W

  6. A GPU Parallelization of the Absolute Nodal Coordinate Formulation for Applications in Flexible Multibody Dynamics

    DTIC Science & Technology

    2012-02-17

    to be solved. Disclaimer: Reference herein to any specific commercial company , product, process, or service by trade name, trademark...data processing rather than data caching and control flow. To make use of this computational power, NVIDIA introduced a general purpose parallel...GPU implementations were run on an Intel Nehalem Xeon E5520 2.26GHz processor with an NVIDIA Tesla C2070 graphics card for varying numbers of

  7. Spectral Element Method for the Simulation of Unsteady Compressible Flows

    NASA Technical Reports Server (NTRS)

    Diosady, Laslo Tibor; Murman, Scott M.

    2013-01-01

    This work uses a discontinuous-Galerkin spectral-element method (DGSEM) to solve the compressible Navier-Stokes equations [1{3]. The inviscid ux is computed using the approximate Riemann solver of Roe [4]. The viscous fluxes are computed using the second form of Bassi and Rebay (BR2) [5] in a manner consistent with the spectral-element approximation. The method of lines with the classical 4th-order explicit Runge-Kutta scheme is used for time integration. Results for polynomial orders up to p = 15 (16th order) are presented. The code is parallelized using the Message Passing Interface (MPI). The computations presented in this work are performed using the Sandy Bridge nodes of the NASA Pleiades supercomputer at NASA Ames Research Center. Each Sandy Bridge node consists of 2 eight-core Intel Xeon E5-2670 processors with a clock speed of 2.6Ghz and 2GB per core memory. On a Sandy Bridge node the Tau Benchmark [6] runs in a time of 7.6s.

  8. Investigating the Naval Logistics Role in Humanitarian Assistance Activities

    DTIC Science & Technology

    2015-03-01

    transportation means. E. BASE CASE RESULTS The computations were executed on a MacBook Pro , 3 GHz Intel Core i7-4578U processor with 8 GB. The...MacBook Pro was partitioned to also contain a Windows 7, 64-bit operating system. The computations were run in the Windows 7 operating system using the...it impacts the types of metamodels that can be developed as a result of data farming (Lucas et al., 2015). Using a metamodel, one can closely

  9. Analysis and Implementation of Particle-to-Particle (P2P) Graphics Processor Unit (GPU) Kernel for Black-Box Adaptive Fast Multipole Method

    DTIC Science & Technology

    2015-06-01

    5110P and 16 dx360M4 nodes each with one NVIDIA Kepler K20M/K40M GPU. Each node contained dual Intel Xeon E5-2670 (Sandy Bridge) central processing...kernel and as such does not employ multiple processors. This work makes use of a single processing core and a single NVIDIA Kepler K40 GK110...bandwidth (2 × 16 slot), 7.877 GFloat/s; Kepler K40 peak, 4,290 × 1 billion floating-point operations (GFLOPs), and 288 GB/s Kepler K40 memory

  10. Fast generation of computer-generated hologram by graphics processing unit

    NASA Astrophysics Data System (ADS)

    Matsuda, Sho; Fujii, Tomohiko; Yamaguchi, Takeshi; Yoshikawa, Hiroshi

    2009-02-01

    A cylindrical hologram is well known to be viewable in 360 deg. This hologram depends high pixel resolution.Therefore, Computer-Generated Cylindrical Hologram (CGCH) requires huge calculation amount.In our previous research, we used look-up table method for fast calculation with Intel Pentium4 2.8 GHz.It took 480 hours to calculate high resolution CGCH (504,000 x 63,000 pixels and the average number of object points are 27,000).To improve quality of CGCH reconstructed image, fringe pattern requires higher spatial frequency and resolution.Therefore, to increase the calculation speed, we have to change the calculation method. In this paper, to reduce the calculation time of CGCH (912,000 x 108,000 pixels), we employ Graphics Processing Unit (GPU).It took 4,406 hours to calculate high resolution CGCH on Xeon 3.4 GHz.Since GPU has many streaming processors and a parallel processing structure, GPU works as the high performance parallel processor.In addition, GPU gives max performance to 2 dimensional data and streaming data.Recently, GPU can be utilized for the general purpose (GPGPU).For example, NVIDIA's GeForce7 series became a programmable processor with Cg programming language.Next GeForce8 series have CUDA as software development kit made by NVIDIA.Theoretically, calculation ability of GPU is announced as 500 GFLOPS. From the experimental result, we have achieved that 47 times faster calculation compared with our previous work which used CPU.Therefore, CGCH can be generated in 95 hours.So, total time is 110 hours to calculate and print the CGCH.

  11. MSTor: A program for calculating partition functions, free energies, enthalpies, entropies, and heat capacities of complex molecules including torsional anharmonicity

    NASA Astrophysics Data System (ADS)

    Zheng, Jingjing; Mielke, Steven L.; Clarkson, Kenneth L.; Truhlar, Donald G.

    2012-08-01

    We present a Fortran program package, MSTor, which calculates partition functions and thermodynamic functions of complex molecules involving multiple torsional motions by the recently proposed MS-T method. This method interpolates between the local harmonic approximation in the low-temperature limit, and the limit of free internal rotation of all torsions at high temperature. The program can also carry out calculations in the multiple-structure local harmonic approximation. The program package also includes six utility codes that can be used as stand-alone programs to calculate reduced moment of inertia matrices by the method of Kilpatrick and Pitzer, to generate conformational structures, to calculate, either analytically or by Monte Carlo sampling, volumes for torsional subdomains defined by Voronoi tessellation of the conformational subspace, to generate template input files, and to calculate one-dimensional torsional partition functions using the torsional eigenvalue summation method. Catalogue identifier: AEMF_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEMF_v1_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: Standard CPC licence, http://cpc.cs.qub.ac.uk/licence/licence.html No. of lines in distributed program, including test data, etc.: 77 434 No. of bytes in distributed program, including test data, etc.: 3 264 737 Distribution format: tar.gz Programming language: Fortran 90, C, and Perl Computer: Itasca (HP Linux cluster, each node has two-socket, quad-core 2.8 GHz Intel Xeon X5560 “Nehalem EP” processors), Calhoun (SGI Altix XE 1300 cluster, each node containing two quad-core 2.66 GHz Intel Xeon “Clovertown”-class processors sharing 16 GB of main memory), Koronis (Altix UV 1000 server with 190 6-core Intel Xeon X7542 “Westmere” processors at 2.66 GHz), Elmo (Sun Fire X4600 Linux cluster with AMD Opteron cores), and Mac Pro (two 2.8 GHz Quad-core Intel Xeon processors) Operating system: Linux/Unix/Mac OS RAM: 2 Mbytes Classification: 16.3, 16.12, 23 Nature of problem: Calculation of the partition functions and thermodynamic functions (standard-state energy, enthalpy, entropy, and free energy as functions of temperatures) of complex molecules involving multiple torsional motions. Solution method: The multi-structural approximation with torsional anharmonicity (MS-T). The program also provides results for the multi-structural local harmonic approximation [1]. Restrictions: There is no limit on the number of torsions that can be included in either the Voronoi calculation or the full MS-T calculation. In practice, the range of problems that can be addressed with the present method consists of all multi-torsional problems for which one can afford to calculate all the conformations and their frequencies. Unusual features: The method can be applied to transition states as well as stable molecules. The program package also includes the hull program for the calculation of Voronoi volumes and six utility codes that can be used as stand-alone programs to calculate reduced moment-of-inertia matrices by the method of Kilpatrick and Pitzer, to generate conformational structures, to calculate, either analytically or by Monte Carlo sampling, volumes for torsional subdomain defined by Voronoi tessellation of the conformational subspace, to generate template input files, and to calculate one-dimensional torsional partition functions using the torsional eigenvalue summation method. Additional comments: The program package includes a manual, installation script, and input and output files for a test suite. Running time: There are 24 test runs. The running time of the test runs on a single processor of the Itasca computer is less than 2 seconds. J. Zheng, T. Yu, E. Papajak, I.M. Alecu, S.L. Mielke, D.G. Truhlar, Practical methods for including torsional anharmonicity in thermochemical calculations of complex molecules: The internal-coordinate multi-structural approximation, Phys. Chem. Chem. Phys. 13 (2011) 10885-10907.

  12. GNAQPMS v1.1: accelerating the Global Nested Air Quality Prediction Modeling System (GNAQPMS) on Intel Xeon Phi processors

    NASA Astrophysics Data System (ADS)

    Wang, Hui; Chen, Huansheng; Wu, Qizhong; Lin, Junmin; Chen, Xueshun; Xie, Xinwei; Wang, Rongrong; Tang, Xiao; Wang, Zifa

    2017-08-01

    The Global Nested Air Quality Prediction Modeling System (GNAQPMS) is the global version of the Nested Air Quality Prediction Modeling System (NAQPMS), which is a multi-scale chemical transport model used for air quality forecast and atmospheric environmental research. In this study, we present the porting and optimisation of GNAQPMS on a second-generation Intel Xeon Phi processor, codenamed Knights Landing (KNL). Compared with the first-generation Xeon Phi coprocessor (codenamed Knights Corner, KNC), KNL has many new hardware features such as a bootable processor, high-performance in-package memory and ISA compatibility with Intel Xeon processors. In particular, we describe the five optimisations we applied to the key modules of GNAQPMS, including the CBM-Z gas-phase chemistry, advection, convection and wet deposition modules. These optimisations work well on both the KNL 7250 processor and the Intel Xeon E5-2697 V4 processor. They include (1) updating the pure Message Passing Interface (MPI) parallel mode to the hybrid parallel mode with MPI and OpenMP in the emission, advection, convection and gas-phase chemistry modules; (2) fully employing the 512 bit wide vector processing units (VPUs) on the KNL platform; (3) reducing unnecessary memory access to improve cache efficiency; (4) reducing the thread local storage (TLS) in the CBM-Z gas-phase chemistry module to improve its OpenMP performance; and (5) changing the global communication from writing/reading interface files to MPI functions to improve the performance and the parallel scalability. These optimisations greatly improved the GNAQPMS performance. The same optimisations also work well for the Intel Xeon Broadwell processor, specifically E5-2697 v4. Compared with the baseline version of GNAQPMS, the optimised version was 3.51 × faster on KNL and 2.77 × faster on the CPU. Moreover, the optimised version ran at 26 % lower average power on KNL than on the CPU. With the combined performance and energy improvement, the KNL platform was 37.5 % more efficient on power consumption compared with the CPU platform. The optimisations also enabled much further parallel scalability on both the CPU cluster and the KNL cluster scaled to 40 CPU nodes and 30 KNL nodes, with a parallel efficiency of 70.4 and 42.2 %, respectively.

  13. Implementation of the DPM Monte Carlo code on a parallel architecture for treatment planning applications.

    PubMed

    Tyagi, Neelam; Bose, Abhijit; Chetty, Indrin J

    2004-09-01

    We have parallelized the Dose Planning Method (DPM), a Monte Carlo code optimized for radiotherapy class problems, on distributed-memory processor architectures using the Message Passing Interface (MPI). Parallelization has been investigated on a variety of parallel computing architectures at the University of Michigan-Center for Advanced Computing, with respect to efficiency and speedup as a function of the number of processors. We have integrated the parallel pseudo random number generator from the Scalable Parallel Pseudo-Random Number Generator (SPRNG) library to run with the parallel DPM. The Intel cluster consisting of 800 MHz Intel Pentium III processor shows an almost linear speedup up to 32 processors for simulating 1 x 10(8) or more particles. The speedup results are nearly linear on an Athlon cluster (up to 24 processors based on availability) which consists of 1.8 GHz+ Advanced Micro Devices (AMD) Athlon processors on increasing the problem size up to 8 x 10(8) histories. For a smaller number of histories (1 x 10(8)) the reduction of efficiency with the Athlon cluster (down to 83.9% with 24 processors) occurs because the processing time required to simulate 1 x 10(8) histories is less than the time associated with interprocessor communication. A similar trend was seen with the Opteron Cluster (consisting of 1400 MHz, 64-bit AMD Opteron processors) on increasing the problem size. Because of the 64-bit architecture Opteron processors are capable of storing and processing instructions at a faster rate and hence are faster as compared to the 32-bit Athlon processors. We have validated our implementation with an in-phantom dose calculation study using a parallel pencil monoenergetic electron beam of 20 MeV energy. The phantom consists of layers of water, lung, bone, aluminum, and titanium. The agreement in the central axis depth dose curves and profiles at different depths shows that the serial and parallel codes are equivalent in accuracy.

  14. Architectural Specialization for Inter-Iteration Loop Dependence Patterns

    DTIC Science & Technology

    2015-10-01

    Architectural Specialization for Inter-Iteration Loop Dependence Patterns Christopher Batten Computer Systems Laboratory School of Electrical and...Trends in Computer Architecture Transistors (Thousands) Frequency (MHz) Typical Power (W) MIPS R2K Intel P4 DEC Alpha 21264 Data collected by M...T as ks p er Jo ule ) Simple Processor Design Power Constraint High-Performance Architectures Embedded Architectures Design Performance

  15. Theorem Proving in Intel Hardware Design

    NASA Technical Reports Server (NTRS)

    O'Leary, John

    2009-01-01

    For the past decade, a framework combining model checking (symbolic trajectory evaluation) and higher-order logic theorem proving has been in production use at Intel. Our tools and methodology have been used to formally verify execution cluster functionality (including floating-point operations) for a number of Intel products, including the Pentium(Registered TradeMark)4 and Core(TradeMark)i7 processors. Hardware verification in 2009 is much more challenging than it was in 1999 - today s CPU chip designs contain many processor cores and significant firmware content. This talk will attempt to distill the lessons learned over the past ten years, discuss how they apply to today s problems, outline some future directions.

  16. Performance of a plasma fluid code on the Intel parallel computers

    NASA Technical Reports Server (NTRS)

    Lynch, V. E.; Carreras, B. A.; Drake, J. B.; Leboeuf, J. N.; Liewer, P.

    1992-01-01

    One approach to improving the real-time efficiency of plasma turbulence calculations is to use a parallel algorithm. A parallel algorithm for plasma turbulence calculations was tested on the Intel iPSC/860 hypercube and the Touchtone Delta machine. Using the 128 processors of the Intel iPSC/860 hypercube, a factor of 5 improvement over a single-processor CRAY-2 is obtained. For the Touchtone Delta machine, the corresponding improvement factor is 16. For plasma edge turbulence calculations, an extrapolation of the present results to the Intel (sigma) machine gives an improvement factor close to 64 over the single-processor CRAY-2.

  17. Optimizing Performance of Combustion Chemistry Solvers on Intel's Many Integrated Core (MIC) Architectures

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Sitaraman, Hariswaran; Grout, Ray W

    This work investigates novel algorithm designs and optimization techniques for restructuring chemistry integrators in zero and multidimensional combustion solvers, which can then be effectively used on the emerging generation of Intel's Many Integrated Core/Xeon Phi processors. These processors offer increased computing performance via large number of lightweight cores at relatively lower clock speeds compared to traditional processors (e.g. Intel Sandybridge/Ivybridge) used in current supercomputers. This style of processor can be productively used for chemistry integrators that form a costly part of computational combustion codes, in spite of their relatively lower clock speeds. Performance commensurate with traditional processors is achieved heremore » through the combination of careful memory layout, exposing multiple levels of fine grain parallelism and through extensive use of vendor supported libraries (Cilk Plus and Math Kernel Libraries). Important optimization techniques for efficient memory usage and vectorization have been identified and quantified. These optimizations resulted in a factor of ~ 3 speed-up using Intel 2013 compiler and ~ 1.5 using Intel 2017 compiler for large chemical mechanisms compared to the unoptimized version on the Intel Xeon Phi. The strategies, especially with respect to memory usage and vectorization, should also be beneficial for general purpose computational fluid dynamics codes.« less

  18. Real-time SHVC software decoding with multi-threaded parallel processing

    NASA Astrophysics Data System (ADS)

    Gudumasu, Srinivas; He, Yuwen; Ye, Yan; He, Yong; Ryu, Eun-Seok; Dong, Jie; Xiu, Xiaoyu

    2014-09-01

    This paper proposes a parallel decoding framework for scalable HEVC (SHVC). Various optimization technologies are implemented on the basis of SHVC reference software SHM-2.0 to achieve real-time decoding speed for the two layer spatial scalability configuration. SHVC decoder complexity is analyzed with profiling information. The decoding process at each layer and the up-sampling process are designed in parallel and scheduled by a high level application task manager. Within each layer, multi-threaded decoding is applied to accelerate the layer decoding speed. Entropy decoding, reconstruction, and in-loop processing are pipeline designed with multiple threads based on groups of coding tree units (CTU). A group of CTUs is treated as a processing unit in each pipeline stage to achieve a better trade-off between parallelism and synchronization. Motion compensation, inverse quantization, and inverse transform modules are further optimized with SSE4 SIMD instructions. Simulations on a desktop with an Intel i7 processor 2600 running at 3.4 GHz show that the parallel SHVC software decoder is able to decode 1080p spatial 2x at up to 60 fps (frames per second) and 1080p spatial 1.5x at up to 50 fps for those bitstreams generated with SHVC common test conditions in the JCT-VC standardization group. The decoding performance at various bitrates with different optimization technologies and different numbers of threads are compared in terms of decoding speed and resource usage, including processor and memory.

  19. Static analysis of the hull plate using the finite element method

    NASA Astrophysics Data System (ADS)

    Ion, A.

    2015-11-01

    This paper aims at presenting the static analysis for two levels of a container ship's construction as follows: the first level is at the girder / hull plate and the second level is conducted at the entire strength hull of the vessel. This article will describe the work for the static analysis of a hull plate. We shall use the software package ANSYS Mechanical 14.5. The program is run on a computer with four Intel Xeon X5260 CPU processors at 3.33 GHz, 32 GB memory installed. In terms of software, the shared memory parallel version of ANSYS refers to running ANSYS across multiple cores on a SMP system. The distributed memory parallel version of ANSYS (Distributed ANSYS) refers to running ANSYS across multiple processors on SMP systems or DMP systems.

  20. CUDA-based real time surgery simulation.

    PubMed

    Liu, Youquan; De, Suvranu

    2008-01-01

    In this paper we present a general software platform that enables real time surgery simulation on the newly available compute unified device architecture (CUDA)from NVIDIA. CUDA-enabled GPUs harness the power of 128 processors which allow data parallel computations. Compared to the previous GPGPU, it is significantly more flexible with a C language interface. We report implementation of both collision detection and consequent deformation computation algorithms. Our test results indicate that the CUDA enables a twenty times speedup for collision detection and about fifteen times speedup for deformation computation on an Intel Core 2 Quad 2.66 GHz machine with GeForce 8800 GTX.

  1. Tactical Operations Analysis Support Facility.

    DTIC Science & Technology

    1981-05-01

    Punch/Reader 2 DMC-11AR DDCMP Micro Processor 2 DMC-11DA Network Link Line Unit 2 DL-11E Async Serial Line Interface 4 Intel IN-1670 448K Words MOS Memory...86 5.3 VIRTUAL PROCESSORS - VAX-11/750 ........................... 89 5.4 A RELATIONAL DATA MANAGEMENT SYSTEM - ORACLE...Central Processing Unit (CPU) is a 16 bit processor for high-speed, real time applications, and for large multi-user, multi- task, time shared

  2. Cost/Performance Ratio Achieved by Using a Commodity-Based Cluster

    NASA Technical Reports Server (NTRS)

    Lopez, Isaac

    2001-01-01

    Researchers at the NASA Glenn Research Center acquired a commodity cluster based on Intel Corporation processors to compare its performance with a traditional UNIX cluster in the execution of aeropropulsion applications. Since the cost differential of the clusters was significant, a cost/performance ratio was calculated. After executing a propulsion application on both clusters, the researchers demonstrated a 9.4 cost/performance ratio in favor of the Intel-based cluster. These researchers utilize the Aeroshark cluster as one of the primary testbeds for developing NPSS parallel application codes and system software. The Aero-shark cluster provides 64 Intel Pentium II 400-MHz processors, housed in 32 nodes. Recently, APNASA - a code developed by a Government/industry team for the design and analysis of turbomachinery systems was used for a simulation on Glenn's Aeroshark cluster.

  3. Does the Intel Xeon Phi processor fit HEP workloads?

    NASA Astrophysics Data System (ADS)

    Nowak, A.; Bitzes, G.; Dotti, A.; Lazzaro, A.; Jarp, S.; Szostek, P.; Valsan, L.; Botezatu, M.; Leduc, J.

    2014-06-01

    This paper summarizes the five years of CERN openlab's efforts focused on the Intel Xeon Phi co-processor, from the time of its inception to public release. We consider the architecture of the device vis a vis the characteristics of HEP software and identify key opportunities for HEP processing, as well as scaling limitations. We report on improvements and speedups linked to parallelization and vectorization on benchmarks involving software frameworks such as Geant4 and ROOT. Finally, we extrapolate current software and hardware trends and project them onto accelerators of the future, with the specifics of offline and online HEP processing in mind.

  4. Interactive high-resolution isosurface ray casting on multicore processors.

    PubMed

    Wang, Qin; JaJa, Joseph

    2008-01-01

    We present a new method for the interactive rendering of isosurfaces using ray casting on multi-core processors. This method consists of a combination of an object-order traversal that coarsely identifies possible candidate 3D data blocks for each small set of contiguous pixels, and an isosurface ray casting strategy tailored for the resulting limited-size lists of candidate 3D data blocks. While static screen partitioning is widely used in the literature, our scheme performs dynamic allocation of groups of ray casting tasks to ensure almost equal loads among the different threads running on multi-cores while maintaining spatial locality. We also make careful use of memory management environment commonly present in multi-core processors. We test our system on a two-processor Clovertown platform, each consisting of a Quad-Core 1.86-GHz Intel Xeon Processor, for a number of widely different benchmarks. The detailed experimental results show that our system is efficient and scalable, and achieves high cache performance and excellent load balancing, resulting in an overall performance that is superior to any of the previous algorithms. In fact, we achieve an interactive isosurface rendering on a 1024(2) screen for all the datasets tested up to the maximum size of the main memory of our platform.

  5. Benchmarking GNU Radio Kernels and Multi-Processor Scheduling

    DTIC Science & Technology

    2013-01-14

    AMD E350 APU , comparable to Atom • ARM Cortex A8 running on a Gumstix Overo on an Ettus USRP E110 The general testing procedure consists of • Build...Intel Atom, and the AMD E350 APU . 3.2 Multi-Processor Scheduling Figure 1: GFLOPs per second through an FFT array on an Intel i7. Example output from

  6. Optimizing the updated Goddard shortwave radiation Weather Research and Forecasting (WRF) scheme for Intel Many Integrated Core (MIC) architecture

    NASA Astrophysics Data System (ADS)

    Mielikainen, Jarno; Huang, Bormin; Huang, Allen H.-L.

    2015-05-01

    Intel Many Integrated Core (MIC) ushers in a new era of supercomputing speed, performance, and compatibility. It allows the developers to run code at trillions of calculations per second using the familiar programming model. In this paper, we present our results of optimizing the updated Goddard shortwave radiation Weather Research and Forecasting (WRF) scheme on Intel Many Integrated Core Architecture (MIC) hardware. The Intel Xeon Phi coprocessor is the first product based on Intel MIC architecture, and it consists of up to 61 cores connected by a high performance on-die bidirectional interconnect. The co-processor supports all important Intel development tools. Thus, the development environment is familiar one to a vast number of CPU developers. Although, getting a maximum performance out of Xeon Phi will require using some novel optimization techniques. Those optimization techniques are discusses in this paper. The results show that the optimizations improved performance of the original code on Xeon Phi 7120P by a factor of 1.3x.

  7. The Acceleration of Structural Microarchitectural Simulation via Scheduling

    DTIC Science & Technology

    2006-11-01

    193 viii List of Tables 1.1 Size of Intel R ©Processors...Table 1.1 shows the total and estimated non-cache transistor counts in succeeding generations of Intel R ©microprocessors. (Cache array transistors are...Intel486TM 1989 1,200,000 800,000 Intel R ©Pentium R © 1993 3,100,000 2,300,000 Intel R ©Pentium R ©II 1997 7,500,000 5,500,000 Intel R ©Pentium R ©III 1999

  8. A scalable PC-based parallel computer for lattice QCD

    NASA Astrophysics Data System (ADS)

    Fodor, Z.; Katz, S. D.; Pappa, G.

    2003-05-01

    A PC-based parallel computer for medium/large scale lattice QCD simulations is suggested. The Eo¨tvo¨s Univ., Inst. Theor. Phys. cluster consists of 137 Intel P4-1.7GHz nodes. Gigabit Ethernet cards are used for nearest neighbor communication in a two-dimensional mesh. The sustained performance for dynamical staggered (wilson) quarks on large lattices is around 70(110) GFlops. The exceptional price/performance ratio is below $1/Mflop.

  9. Performance Evaluation of NWChem Ab-Initio Molecular Dynamics (AIMD) Simulations on the Intel® Xeon Phi™ Processor

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Bylaska, Eric J.; Jacquelin, Mathias; De Jong, Wibe A.

    2017-10-20

    Ab-initio Molecular Dynamics (AIMD) methods are an important class of algorithms, as they enable scientists to understand the chemistry and dynamics of molecular and condensed phase systems while retaining a first-principles-based description of their interactions. Many-core architectures such as the Intel® Xeon Phi™ processor are an interesting and promising target for these algorithms, as they can provide the computational power that is needed to solve interesting problems in chemistry. In this paper, we describe the efforts of refactoring the existing AIMD plane-wave method of NWChem from an MPI-only implementation to a scalable, hybrid code that employs MPI and OpenMP tomore » exploit the capabilities of current and future many-core architectures. We describe the optimizations required to get close to optimal performance for the multiplication of the tall-and-skinny matrices that form the core of the computational algorithm. We present strong scaling results on the complete AIMD simulation for a test case that simulates 256 water molecules and that strong-scales well on a cluster of 1024 nodes of Intel Xeon Phi processors. We compare the performance obtained with a cluster of dual-socket Intel® Xeon® E5–2698v3 processors.« less

  10. Real-time autocorrelator for fluorescence correlation spectroscopy based on graphical-processor-unit architecture: method, implementation, and comparative studies

    NASA Astrophysics Data System (ADS)

    Laracuente, Nicholas; Grossman, Carl

    2013-03-01

    We developed an algorithm and software to calculate autocorrelation functions from real-time photon-counting data using the fast, parallel capabilities of graphical processor units (GPUs). Recent developments in hardware and software have allowed for general purpose computing with inexpensive GPU hardware. These devices are more suited for emulating hardware autocorrelators than traditional CPU-based software applications by emphasizing parallel throughput over sequential speed. Incoming data are binned in a standard multi-tau scheme with configurable points-per-bin size and are mapped into a GPU memory pattern to reduce time-expensive memory access. Applications include dynamic light scattering (DLS) and fluorescence correlation spectroscopy (FCS) experiments. We ran the software on a 64-core graphics pci card in a 3.2 GHz Intel i5 CPU based computer running Linux. FCS measurements were made on Alexa-546 and Texas Red dyes in a standard buffer (PBS). Software correlations were compared to hardware correlator measurements on the same signals. Supported by HHMI and Swarthmore College

  11. A new parallel-vector finite element analysis software on distributed-memory computers

    NASA Technical Reports Server (NTRS)

    Qin, Jiangning; Nguyen, Duc T.

    1993-01-01

    A new parallel-vector finite element analysis software package MPFEA (Massively Parallel-vector Finite Element Analysis) is developed for large-scale structural analysis on massively parallel computers with distributed-memory. MPFEA is designed for parallel generation and assembly of the global finite element stiffness matrices as well as parallel solution of the simultaneous linear equations, since these are often the major time-consuming parts of a finite element analysis. Block-skyline storage scheme along with vector-unrolling techniques are used to enhance the vector performance. Communications among processors are carried out concurrently with arithmetic operations to reduce the total execution time. Numerical results on the Intel iPSC/860 computers (such as the Intel Gamma with 128 processors and the Intel Touchstone Delta with 512 processors) are presented, including an aircraft structure and some very large truss structures, to demonstrate the efficiency and accuracy of MPFEA.

  12. Case for a field-programmable gate array multicore hybrid machine for an image-processing application

    NASA Astrophysics Data System (ADS)

    Rakvic, Ryan N.; Ives, Robert W.; Lira, Javier; Molina, Carlos

    2011-01-01

    General purpose computer designers have recently begun adding cores to their processors in order to increase performance. For example, Intel has adopted a homogeneous quad-core processor as a base for general purpose computing. PlayStation3 (PS3) game consoles contain a multicore heterogeneous processor known as the Cell, which is designed to perform complex image processing algorithms at a high level. Can modern image-processing algorithms utilize these additional cores? On the other hand, modern advancements in configurable hardware, most notably field-programmable gate arrays (FPGAs) have created an interesting question for general purpose computer designers. Is there a reason to combine FPGAs with multicore processors to create an FPGA multicore hybrid general purpose computer? Iris matching, a repeatedly executed portion of a modern iris-recognition algorithm, is parallelized on an Intel-based homogeneous multicore Xeon system, a heterogeneous multicore Cell system, and an FPGA multicore hybrid system. Surprisingly, the cheaper PS3 slightly outperforms the Intel-based multicore on a core-for-core basis. However, both multicore systems are beaten by the FPGA multicore hybrid system by >50%.

  13. DBPQL: A view-oriented query language for the Intel Data Base Processor

    NASA Technical Reports Server (NTRS)

    Fishwick, P. A.

    1983-01-01

    An interactive query language (BDPQL) for the Intel Data Base Processor (DBP) is defined. DBPQL includes a parser generator package which permits the analyst to easily create and manipulate the query statement syntax and semantics. The prototype language, DBPQL, includes trace and performance commands to aid the analyst when implementing new commands and analyzing the execution characteristics of the DBP. The DBPQL grammar file and associated key procedures are included as an appendix to this report.

  14. Evaluation of the Intel Xeon Phi Co-processor to accelerate the sensitivity map calculation for PET imaging

    NASA Astrophysics Data System (ADS)

    Dey, T.; Rodrigue, P.

    2015-07-01

    We aim to evaluate the Intel Xeon Phi coprocessor for acceleration of 3D Positron Emission Tomography (PET) image reconstruction. We focus on the sensitivity map calculation as one computational intensive part of PET image reconstruction, since it is a promising candidate for acceleration with the Many Integrated Core (MIC) architecture of the Xeon Phi. The computation of the voxels in the field of view (FoV) can be done in parallel and the 103 to 104 samples needed to calculate the detection probability of each voxel can take advantage of vectorization. We use the ray tracing kernels of the Embree project to calculate the hit points of the sample rays with the detector and in a second step the sum of the radiological path taking into account attenuation is determined. The core components are implemented using the Intel single instruction multiple data compiler (ISPC) to enable a portable implementation showing efficient vectorization either on the Xeon Phi and the Host platform. On the Xeon Phi, the calculation of the radiological path is also implemented in hardware specific intrinsic instructions (so-called `intrinsics') to allow manually-optimized vectorization. For parallelization either OpenMP and ISPC tasking (based on pthreads) are evaluated.Our implementation achieved a scalability factor of 0.90 on the Xeon Phi coprocessor (model 5110P) with 60 cores at 1 GHz. Only minor differences were found between parallelization with OpenMP and the ISPC tasking feature. The implementation using intrinsics was found to be about 12% faster than the portable ISPC version. With this version, a speedup of 1.43 was achieved on the Xeon Phi coprocessor compared to the host system (HP SL250s Gen8) equipped with two Xeon (E5-2670) CPUs, with 8 cores at 2.6 to 3.3 GHz each. Using a second Xeon Phi card the speedup could be further increased to 2.77. No significant differences were found between the results of the different Xeon Phi and the Host implementations. The examination showed that a reasonable speedup of sensitivity map calculation could be achieved on the Xeon Phi either by a portable or a hardware specific implementation.

  15. SPP: A data base processor data communications protocol

    NASA Technical Reports Server (NTRS)

    Fishwick, P. A.

    1983-01-01

    The design and implementation of a data communications protocol for the Intel Data Base Processor (DBP) is defined. The protocol is termed SPP (Service Port Protocol) since it enables data transfer between the host computer and the DBP service port. The protocol implementation is extensible in that it is explicitly layered and the protocol functionality is hierarchically organized. Extensive trace and performance capabilities have been supplied with the protocol software to permit optional efficient monitoring of the data transfer between the host and the Intel data base processor. Machine independence was considered to be an important attribute during the design and implementation of SPP. The protocol source is fully commented and is included in Appendix A of this report.

  16. MILC Code Performance on High End CPU and GPU Supercomputer Clusters

    NASA Astrophysics Data System (ADS)

    DeTar, Carleton; Gottlieb, Steven; Li, Ruizi; Toussaint, Doug

    2018-03-01

    With recent developments in parallel supercomputing architecture, many core, multi-core, and GPU processors are now commonplace, resulting in more levels of parallelism, memory hierarchy, and programming complexity. It has been necessary to adapt the MILC code to these new processors starting with NVIDIA GPUs, and more recently, the Intel Xeon Phi processors. We report on our efforts to port and optimize our code for the Intel Knights Landing architecture. We consider performance of the MILC code with MPI and OpenMP, and optimizations with QOPQDP and QPhiX. For the latter approach, we concentrate on the staggered conjugate gradient and gauge force. We also consider performance on recent NVIDIA GPUs using the QUDA library.

  17. JPRS Report, Science & Technology, Europe.

    DTIC Science & Technology

    1991-04-30

    processor in collaboration with Intel . The processor , christened Touchstone, will be used as the core of a parallel computer with 2,000 processors . One of...ELECTRONIQUE HEBDO in French 24 Jan 91 pp 14-15 [Article by Claire Remy: "Everything Set for Neural Signal Processors " first paragraph is ELECTRONIQUE...paving the way for neural signal processors in so doing. The principal advantage of this specific circuit over a neuromimetic software program is

  18. Exact diagonalization of quantum lattice models on coprocessors

    NASA Astrophysics Data System (ADS)

    Siro, T.; Harju, A.

    2016-10-01

    We implement the Lanczos algorithm on an Intel Xeon Phi coprocessor and compare its performance to a multi-core Intel Xeon CPU and an NVIDIA graphics processor. The Xeon and the Xeon Phi are parallelized with OpenMP and the graphics processor is programmed with CUDA. The performance is evaluated by measuring the execution time of a single step in the Lanczos algorithm. We study two quantum lattice models with different particle numbers, and conclude that for small systems, the multi-core CPU is the fastest platform, while for large systems, the graphics processor is the clear winner, reaching speedups of up to 7.6 compared to the CPU. The Xeon Phi outperforms the CPU with sufficiently large particle number, reaching a speedup of 2.5.

  19. Astrophysical N-body Simulations Using Hierarchical Tree Data Structures

    NASA Astrophysics Data System (ADS)

    Warren, M. S.; Salmon, J. K.

    The authors report on recent large astrophysical N-body simulations executed on the Intel Touchstone Delta system. They review the astrophysical motivation and the numerical techniques and discuss steps taken to parallelize these simulations. The methods scale as O(N log N), for large values of N, and also scale linearly with the number of processors. The performance sustained for a duration of 67 h, was between 5.1 and 5.4 Gflop/s on a 512-processor system.

  20. Discrete sensitivity derivatives of the Navier-Stokes equations with a parallel Krylov solver

    NASA Technical Reports Server (NTRS)

    Ajmani, Kumud; Taylor, Arthur C., III

    1994-01-01

    This paper solves an 'incremental' form of the sensitivity equations derived by differentiating the discretized thin-layer Navier Stokes equations with respect to certain design variables of interest. The equations are solved with a parallel, preconditioned Generalized Minimal RESidual (GMRES) solver on a distributed-memory architecture. The 'serial' sensitivity analysis code is parallelized by using the Single Program Multiple Data (SPMD) programming model, domain decomposition techniques, and message-passing tools. Sensitivity derivatives are computed for low and high Reynolds number flows over a NACA 1406 airfoil on a 32-processor Intel Hypercube, and found to be identical to those computed on a single-processor Cray Y-MP. It is estimated that the parallel sensitivity analysis code has to be run on 40-50 processors of the Intel Hypercube in order to match the single-processor processing time of a Cray Y-MP.

  1. Digital Hardware Architecture Implementation

    DTIC Science & Technology

    1993-02-15

    of micro - MOTOROLA 63.7 50MHZ 64 BIT 2092 N/A processors during quarterly re- INTEL 42 50MHz 64 BIT 1092 N/A views and monthly reports. The 186o XP...27 3.2.1 Signal Processor (SP) Analysis...31 3.2.1.11 MasPar Software Statements ........................................................ 32 3.2.2 Data Processor

  2. Performance Evaluation of Supercomputers using HPCC and IMB Benchmarks

    NASA Technical Reports Server (NTRS)

    Saini, Subhash; Ciotti, Robert; Gunney, Brian T. N.; Spelce, Thomas E.; Koniges, Alice; Dossa, Don; Adamidis, Panagiotis; Rabenseifner, Rolf; Tiyyagura, Sunil R.; Mueller, Matthias; hide

    2006-01-01

    The HPC Challenge (HPCC) benchmark suite and the Intel MPI Benchmark (IMB) are used to compare and evaluate the combined performance of processor, memory subsystem and interconnect fabric of five leading supercomputers - SGI Altix BX2, Cray XI, Cray Opteron Cluster, Dell Xeon cluster, and NEC SX-8. These five systems use five different networks (SGI NUMALINK4, Cray network, Myrinet, InfiniBand, and NEC IXS). The complete set of HPCC benchmarks are run on each of these systems. Additionally, we present Intel MPI Benchmarks (IMB) results to study the performance of 11 MPI communication functions on these systems.

  3. Performance Evaluation of an Intel Haswell- and Ivy Bridge-Based Supercomputer Using Scientific and Engineering Applications

    NASA Technical Reports Server (NTRS)

    Saini, Subhash; Hood, Robert T.; Chang, Johnny; Baron, John

    2016-01-01

    We present a performance evaluation conducted on a production supercomputer of the Intel Xeon Processor E5- 2680v3, a twelve-core implementation of the fourth-generation Haswell architecture, and compare it with Intel Xeon Processor E5-2680v2, an Ivy Bridge implementation of the third-generation Sandy Bridge architecture. Several new architectural features have been incorporated in Haswell including improvements in all levels of the memory hierarchy as well as improvements to vector instructions and power management. We critically evaluate these new features of Haswell and compare with Ivy Bridge using several low-level benchmarks including subset of HPCC, HPCG and four full-scale scientific and engineering applications. We also present a model to predict the performance of HPCG and Cart3D within 5%, and Overflow within 10% accuracy.

  4. Hot Chips and Hot Interconnects for High End Computing Systems

    NASA Technical Reports Server (NTRS)

    Saini, Subhash

    2005-01-01

    I will discuss several processors: 1. The Cray proprietary processor used in the Cray X1; 2. The IBM Power 3 and Power 4 used in an IBM SP 3 and IBM SP 4 systems; 3. The Intel Itanium and Xeon, used in the SGI Altix systems and clusters respectively; 4. IBM System-on-a-Chip used in IBM BlueGene/L; 5. HP Alpha EV68 processor used in DOE ASCI Q cluster; 6. SPARC64 V processor, which is used in the Fujitsu PRIMEPOWER HPC2500; 7. An NEC proprietary processor, which is used in NEC SX-6/7; 8. Power 4+ processor, which is used in Hitachi SR11000; 9. NEC proprietary processor, which is used in Earth Simulator. The IBM POWER5 and Red Storm Computing Systems will also be discussed. The architectures of these processors will first be presented, followed by interconnection networks and a description of high-end computer systems based on these processors and networks. The performance of various hardware/programming model combinations will then be compared, based on latest NAS Parallel Benchmark results (MPI, OpenMP/HPF and hybrid (MPI + OpenMP). The tutorial will conclude with a discussion of general trends in the field of high performance computing, (quantum computing, DNA computing, cellular engineering, and neural networks).

  5. Parallel community climate model: Description and user`s guide

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Drake, J.B.; Flanery, R.E.; Semeraro, B.D.

    This report gives an overview of a parallel version of the NCAR Community Climate Model, CCM2, implemented for MIMD massively parallel computers using a message-passing programming paradigm. The parallel implementation was developed on an Intel iPSC/860 with 128 processors and on the Intel Delta with 512 processors, and the initial target platform for the production version of the code is the Intel Paragon with 2048 processors. Because the implementation uses a standard, portable message-passing libraries, the code has been easily ported to other multiprocessors supporting a message-passing programming paradigm. The parallelization strategy used is to decompose the problem domain intomore » geographical patches and assign each processor the computation associated with a distinct subset of the patches. With this decomposition, the physics calculations involve only grid points and data local to a processor and are performed in parallel. Using parallel algorithms developed for the semi-Lagrangian transport, the fast Fourier transform and the Legendre transform, both physics and dynamics are computed in parallel with minimal data movement and modest change to the original CCM2 source code. Sequential or parallel history tapes are written and input files (in history tape format) are read sequentially by the parallel code to promote compatibility with production use of the model on other computer systems. A validation exercise has been performed with the parallel code and is detailed along with some performance numbers on the Intel Paragon and the IBM SP2. A discussion of reproducibility of results is included. A user`s guide for the PCCM2 version 2.1 on the various parallel machines completes the report. Procedures for compilation, setup and execution are given. A discussion of code internals is included for those who may wish to modify and use the program in their own research.« less

  6. Initial Performance Results on IBM POWER6

    NASA Technical Reports Server (NTRS)

    Saini, Subbash; Talcott, Dale; Jespersen, Dennis; Djomehri, Jahed; Jin, Haoqiang; Mehrotra, Piysuh

    2008-01-01

    The POWER5+ processor has a faster memory bus than that of the previous generation POWER5 processor (533 MHz vs. 400 MHz), but the measured per-core memory bandwidth of the latter is better than that of the former (5.7 GB/s vs. 4.3 GB/s). The reason for this is that in the POWER5+, the two cores on the chip share the L2 cache, L3 cache and memory bus. The memory controller is also on the chip and is shared by the two cores. This serializes the path to memory. For consistently good performance on a wide range of applications, the performance of the processor, the memory subsystem, and the interconnects (both latency and bandwidth) should be balanced. Recognizing this, IBM has designed the Power6 processor so as to avoid the bottlenecks due to the L2 cache, memory controller and buffer chips of the POWER5+. Unlike the POWER5+, each core in the POWER6 has its own L2 cache (4 MB - double that of the Power5+), memory controller and buffer chips. Each core in the POWER6 runs at 4.7 GHz instead of 1.9 GHz in POWER5+. In this paper, we evaluate the performance of a dual-core Power6 based IBM p6-570 system, and we compare its performance with that of a dual-core Power5+ based IBM p575+ system. In this evaluation, we have used the High- Performance Computing Challenge (HPCC) benchmarks, NAS Parallel Benchmarks (NPB), and four real-world applications--three from computational fluid dynamics and one from climate modeling.

  7. High Performance Computing and Visualization Infrastructure for Simultaneous Parallel Computing and Parallel Visualization Research

    DTIC Science & Technology

    2016-11-09

    Total Number: Sub Contractors (DD882) Names of Personnel receiving masters degrees Names of personnel receiving PHDs Names of other research staff...Broadcom 5720 QP 1Gb Network Daughter Card (2) Intel Xeon E5-2680 v3 2.5GHz, 30M Cache, 9.60GT/s QPI, Turbo, HT , 12C/24T (120W...Broadcom 5720 QP 1Gb Network Daughter Card (2) Intel Xeon E5-2680 v3 2.5GHz, 30M Cache, 9.60GT/s QPI, Turbo, HT , 12C/24T (120W

  8. Testing the Tester: Lessons Learned During the Testing of a State-of-the-Art Commercial 14nm Processor Under Proton Irradiation

    NASA Technical Reports Server (NTRS)

    Szabo, Carl M., Jr.; Duncan, Adam R.; Label, Kenneth A.

    2017-01-01

    Testing of an Intel 14nm desktop processor was conducted under proton irradiation. We share lessons learned, demonstrating that complex devices beget further complex challenges requiring practical and theoretical investigative expertise to solve.

  9. Why K-12 IT Managers and Administrators Are Embracing the Intel-Based Mac

    ERIC Educational Resources Information Center

    Technology & Learning, 2007

    2007-01-01

    Over the past year, Apple has dramatically increased its share of the school computer marketplace--especially in the category of notebook computers. A recent study conducted by Grunwald Associates and Rockman et al. reports that one of the major reasons for this growth is Apple's introduction of the Intel processor to the entire line of Mac…

  10. The parallel algorithm for the 2D discrete wavelet transform

    NASA Astrophysics Data System (ADS)

    Barina, David; Najman, Pavel; Kleparnik, Petr; Kula, Michal; Zemcik, Pavel

    2018-04-01

    The discrete wavelet transform can be found at the heart of many image-processing algorithms. Until now, the transform on general-purpose processors (CPUs) was mostly computed using a separable lifting scheme. As the lifting scheme consists of a small number of operations, it is preferred for processing using single-core CPUs. However, considering a parallel processing using multi-core processors, this scheme is inappropriate due to a large number of steps. On such architectures, the number of steps corresponds to the number of points that represent the exchange of data. Consequently, these points often form a performance bottleneck. Our approach appropriately rearranges calculations inside the transform, and thereby reduces the number of steps. In other words, we propose a new scheme that is friendly to parallel environments. When evaluating on multi-core CPUs, we consistently overcome the original lifting scheme. The evaluation was performed on 61-core Intel Xeon Phi and 8-core Intel Xeon processors.

  11. Accelerating Climate Simulations Through Hybrid Computing

    NASA Technical Reports Server (NTRS)

    Zhou, Shujia; Sinno, Scott; Cruz, Carlos; Purcell, Mark

    2009-01-01

    Unconventional multi-core processors (e.g., IBM Cell B/E and NYIDIDA GPU) have emerged as accelerators in climate simulation. However, climate models typically run on parallel computers with conventional processors (e.g., Intel and AMD) using MPI. Connecting accelerators to this architecture efficiently and easily becomes a critical issue. When using MPI for connection, we identified two challenges: (1) identical MPI implementation is required in both systems, and; (2) existing MPI code must be modified to accommodate the accelerators. In response, we have extended and deployed IBM Dynamic Application Virtualization (DAV) in a hybrid computing prototype system (one blade with two Intel quad-core processors, two IBM QS22 Cell blades, connected with Infiniband), allowing for seamlessly offloading compute-intensive functions to remote, heterogeneous accelerators in a scalable, load-balanced manner. Currently, a climate solar radiation model running with multiple MPI processes has been offloaded to multiple Cell blades with approx.10% network overhead.

  12. Thermo-mechanical properties of carbon nanotubes and applications in thermal management

    NASA Astrophysics Data System (ADS)

    Nguyen, Manh Hong; Thang Bui, Hung; Trinh Pham, Van; Phan, Ngoc Hong; Nguyen, Tuan Hong; Chuc Nguyen, Van; Quang Le, Dinh; Khoi Phan, Hong; Phan, Ngoc Minh

    2016-06-01

    Thanks to their very high thermal conductivity, high Young’s modulus and unique tensile strength, carbon nanotubes (CNTs) have become one of the most suitable nano additives for heat conductive materials. In this work, we present results obtained for the synthesis of heat conductive materials containing CNT based thermal greases, nanoliquids and lubricating oils. These synthesized heat conductive materials were applied to thermal management for high power electronic devices (CPUs, LEDs) and internal combustion engines. The simulation and experimental results on thermal greases for an Intel Pentium IV processor showed that the thermal conductivity of greases increases 1.4 times and the saturation temperature of the CPU decreased by 5 °C by using thermal grease containing 2 wt% CNTs. Nanoliquids containing CNT based distilled water/ethylene glycol were successfully applied in heat dissipation for an Intel Core i5 processor and a 450 W floodlight LED. The experimental results showed that the saturation temperature of the Intel Core i5 processor and the 450 W floodlight LED decreased by about 6 °C and 3.5 °C, respectively, when using nanoliquids containing 1 g l-1 of CNTs. The CNTs were also effectively utilized additive materials for the synthesis of lubricating oils to improve the thermal conductivity, heat dissipation efficiency and performance efficiency of engines. The experimental results show that the thermal conductivity of lubricating oils increased by 12.5%, the engine saved 15% fuel consumption, and the longevity of the lubricating oil increased up to 20 000 km by using 0.1% vol. CNTs in the lubricating oils. All above results have confirmed the tremendous application potential of heat conductive materials containing CNTs in thermal management for high power electronic devices, internal combustion engines and other high power apparatus.

  13. A data base processor semantics specification package

    NASA Technical Reports Server (NTRS)

    Fishwick, P. A.

    1983-01-01

    A Semantics Specification Package (DBPSSP) for the Intel Data Base Processor (DBP) is defined. DBPSSP serves as a collection of cross assembly tools that allow the analyst to assemble request blocks on the host computer for passage to the DBP. The assembly tools discussed in this report may be effectively used in conjunction with a DBP compatible data communications protocol to form a query processor, precompiler, or file management system for the database processor. The source modules representing the components of DBPSSP are fully commented and included.

  14. Scalability of a Low-Cost Multi-Teraflop Linux Cluster for High-End Classical Atomistic and Quantum Mechanical Simulations

    NASA Technical Reports Server (NTRS)

    Kikuchi, Hideaki; Kalia, Rajiv K.; Nakano, Aiichiro; Vashishta, Priya; Shimojo, Fuyuki; Saini, Subhash

    2003-01-01

    Scalability of a low-cost, Intel Xeon-based, multi-Teraflop Linux cluster is tested for two high-end scientific applications: Classical atomistic simulation based on the molecular dynamics method and quantum mechanical calculation based on the density functional theory. These scalable parallel applications use space-time multiresolution algorithms and feature computational-space decomposition, wavelet-based adaptive load balancing, and spacefilling-curve-based data compression for scalable I/O. Comparative performance tests are performed on a 1,024-processor Linux cluster and a conventional higher-end parallel supercomputer, 1,184-processor IBM SP4. The results show that the performance of the Linux cluster is comparable to that of the SP4. We also study various effects, such as the sharing of memory and L2 cache among processors, on the performance.

  15. Parallelizing ATLAS Reconstruction and Simulation: Issues and Optimization Solutions for Scaling on Multi- and Many-CPU Platforms

    NASA Astrophysics Data System (ADS)

    Leggett, C.; Binet, S.; Jackson, K.; Levinthal, D.; Tatarkhanov, M.; Yao, Y.

    2011-12-01

    Thermal limitations have forced CPU manufacturers to shift from simply increasing clock speeds to improve processor performance, to producing chip designs with multi- and many-core architectures. Further the cores themselves can run multiple threads as a zero overhead context switch allowing low level resource sharing (Intel Hyperthreading). To maximize bandwidth and minimize memory latency, memory access has become non uniform (NUMA). As manufacturers add more cores to each chip, a careful understanding of the underlying architecture is required in order to fully utilize the available resources. We present AthenaMP and the Atlas event loop manager, the driver of the simulation and reconstruction engines, which have been rewritten to make use of multiple cores, by means of event based parallelism, and final stage I/O synchronization. However, initial studies on 8 andl6 core Intel architectures have shown marked non-linearities as parallel process counts increase, with as much as 30% reductions in event throughput in some scenarios. Since the Intel Nehalem architecture (both Gainestown and Westmere) will be the most common choice for the next round of hardware procurements, an understanding of these scaling issues is essential. Using hardware based event counters and Intel's Performance Tuning Utility, we have studied the performance bottlenecks at the hardware level, and discovered optimization schemes to maximize processor throughput. We have also produced optimization mechanisms, common to all large experiments, that address the extreme nature of today's HEP code, which due to it's size, places huge burdens on the memory infrastructure of today's processors.

  16. Scaling Support Vector Machines On Modern HPC Platforms

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    You, Yang; Fu, Haohuan; Song, Shuaiwen

    2015-02-01

    We designed and implemented MIC-SVM, a highly efficient parallel SVM for x86 based multicore and many-core architectures, such as the Intel Ivy Bridge CPUs and Intel Xeon Phi co-processor (MIC). We propose various novel analysis methods and optimization techniques to fully utilize the multilevel parallelism provided by these architectures and serve as general optimization methods for other machine learning tools.

  17. Communication overhead on the Intel iPSC-860 hypercube

    NASA Technical Reports Server (NTRS)

    Bokhari, Shahid H.

    1990-01-01

    Experiments were conducted on the Intel iPSC-860 hypercube in order to evaluate the overhead of interprocessor communication. It is demonstrated that: (1) contrary to popular belief, the distance between two communicating processors has a significant impact on communication time, (2) edge contention can increase communication time by a factor of more than 7, and (3) node contention has no measurable impact.

  18. Multibus-based parallel processor for simulation

    NASA Technical Reports Server (NTRS)

    Ogrady, E. P.; Wang, C.-H.

    1983-01-01

    A Multibus-based parallel processor simulation system is described. The system is intended to serve as a vehicle for gaining hands-on experience, testing system and application software, and evaluating parallel processor performance during development of a larger system based on the horizontal/vertical-bus interprocessor communication mechanism. The prototype system consists of up to seven Intel iSBC 86/12A single-board computers which serve as processing elements, a multiple transmission controller (MTC) designed to support system operation, and an Intel Model 225 Microcomputer Development System which serves as the user interface and input/output processor. All components are interconnected by a Multibus/IEEE 796 bus. An important characteristic of the system is that it provides a mechanism for a processing element to broadcast data to other selected processing elements. This parallel transfer capability is provided through the design of the MTC and a minor modification to the iSBC 86/12A board. The operation of the MTC, the basic hardware-level operation of the system, and pertinent details about the iSBC 86/12A and the Multibus are described.

  19. Design and Demonstration of a 30 GHz 16-bit Superconductor RSFQ Microprocessor

    DTIC Science & Technology

    2015-03-10

    for Public Release; Distribution Unlimited Final Report: Design and Demonstration of a 30 GHz 16-bit Superconductor RSFQ Microprocessor The views...P.O. Box 12211 Research Triangle Park, NC 27709-2211 Superconductor technology, RSFQ, RQL, processor design, arithmetic units, high-performance...Demonstration of a 30 GHz 16-bit Superconductor RSFQ Microprocessor Report Title The major objective of the project was to design and demonstrate operation

  20. Generic accelerated sequence alignment in SeqAn using vectorization and multi-threading.

    PubMed

    Rahn, René; Budach, Stefan; Costanza, Pascal; Ehrhardt, Marcel; Hancox, Jonny; Reinert, Knut

    2018-05-03

    Pairwise sequence alignment is undoubtedly a central tool in many bioinformatics analyses. In this paper, we present a generically accelerated module for pairwise sequence alignments applicable for a broad range of applications. In our module, we unified the standard dynamic programming kernel used for pairwise sequence alignments and extended it with a generalized inter-sequence vectorization layout, such that many alignments can be computed simultaneously by exploiting SIMD (Single Instruction Multiple Data) instructions of modern processors. We then extended the module by adding two layers of thread-level parallelization, where we a) distribute many independent alignments on multiple threads and b) inherently parallelize a single alignment computation using a work stealing approach producing a dynamic wavefront progressing along the minor diagonal. We evaluated our alignment vectorization and parallelization on different processors, including the newest Intel® Xeon® (Skylake) and Intel® Xeon Phi™ (KNL) processors, and use cases. The instruction set AVX512-BW (Byte and Word), available on Skylake processors, can genuinely improve the performance of vectorized alignments. We could run single alignments 1600 times faster on the Xeon Phi™ and 1400 times faster on the Xeon® than executing them with our previous sequential alignment module. The module is programmed in C++ using the SeqAn (Reinert et al., 2017) library and distributed with version 2.4. under the BSD license. We support SSE4, AVX2, AVX512 instructions and included UME::SIMD, a SIMD-instruction wrapper library, to extend our module for further instruction sets. We thoroughly test all alignment components with all major C++ compilers on various platforms. rene.rahn@fu-berlin.de.

  1. Parallel computation for biological sequence comparison: comparing a portable model to the native model for the Intel Hypercube.

    PubMed

    Nadkarni, P M; Miller, P L

    1991-01-01

    A parallel program for inter-database sequence comparison was developed on the Intel Hypercube using two models of parallel programming. One version was built using machine-specific Hypercube parallel programming commands. The other version was built using Linda, a machine-independent parallel programming language. The two versions of the program provide a case study comparing these two approaches to parallelization in an important biological application area. Benchmark tests with both programs gave comparable results with a small number of processors. As the number of processors was increased, the Linda version was somewhat less efficient. The Linda version was also run without change on Network Linda, a virtual parallel machine running on a network of desktop workstations.

  2. European Scientific Notes. Volume 35, Number 12,

    DTIC Science & Technology

    1981-12-31

    been redesigned to work A. Osorio, which was organized some 3 with the Intel 8085 microprocessor, it years ago and contains about half of the has the...operational set. attempt to derive a set of invariants MOISE is based on the Intel 8085A upon which virtually speaker-invariant microprocessor, and...FACILITY software interface; a Research Signal Processor (RSP) using reduced computational It has been IBM International’s complexity algorithms for

  3. NearFar: A computer program for nearside farside decomposition of heavy-ion elastic scattering amplitude

    NASA Astrophysics Data System (ADS)

    Cha, Moon Hoe

    2007-02-01

    The NearFar program is a package for carrying out an interactive nearside-farside decomposition of heavy-ion elastic scattering amplitude. The program is implemented in Java to perform numerical operations on the nearside and farside angular distributions. It contains a graphical display interface for the numerical results. A test run has been applied to the elastic O16+Si28 scattering at E=1503 MeV. Program summaryTitle of program: NearFar Catalogue identifier: ADYP_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/ADYP_v1_0 Program obtainable from: CPC Program Library, Queen's University of Belfast, N. Ireland Licensing provisions: none Computers: designed for any machine capable of running Java, developed on PC-Pentium-4 Operating systems under which the program has been tested: Microsoft Windows XP (Home Edition) Program language used: Java Number of bits in a word: 64 Memory required to execute with typical data: case dependent No. of lines in distributed program, including test data, etc.: 3484 Number of bytes distributed program, including test data, etc.: 142 051 Distribution format: tar.gz Other software required: A Java runtime interpreter, or the Java Development Kit, version 5.0 Nature of physical problem: Interactive nearside-farside decomposition of heavy-ion elastic scattering amplitude. Method of solution: The user must supply a external data file or PPSM parameters which calculates theoretical values of the quantities to be decomposed. Typical running time: Problem dependent. In a test run, it is about 35 s on a 2.40 GHz Intel P4-processor machine.

  4. Protein structure database search and evolutionary classification.

    PubMed

    Yang, Jinn-Moon; Tung, Chi-Hua

    2006-01-01

    As more protein structures become available and structural genomics efforts provide structural models in a genome-wide strategy, there is a growing need for fast and accurate methods for discovering homologous proteins and evolutionary classifications of newly determined structures. We have developed 3D-BLAST, in part, to address these issues. 3D-BLAST is as fast as BLAST and calculates the statistical significance (E-value) of an alignment to indicate the reliability of the prediction. Using this method, we first identified 23 states of the structural alphabet that represent pattern profiles of the backbone fragments and then used them to represent protein structure databases as structural alphabet sequence databases (SADB). Our method enhanced BLAST as a search method, using a new structural alphabet substitution matrix (SASM) to find the longest common substructures with high-scoring structured segment pairs from an SADB database. Using personal computers with Intel Pentium4 (2.8 GHz) processors, our method searched more than 10 000 protein structures in 1.3 s and achieved a good agreement with search results from detailed structure alignment methods. [3D-BLAST is available at http://3d-blast.life.nctu.edu.tw].

  5. Nonlinear Wave Simulation on the Xeon Phi Knights Landing Processor

    NASA Astrophysics Data System (ADS)

    Hristov, Ivan; Goranov, Goran; Hristova, Radoslava

    2018-02-01

    We consider an interesting from computational point of view standing wave simulation by solving coupled 2D perturbed Sine-Gordon equations. We make an OpenMP realization which explores both thread and SIMD levels of parallelism. We test the OpenMP program on two different energy equivalent Intel architectures: 2× Xeon E5-2695 v2 processors, (code-named "Ivy Bridge-EP") in the Hybrilit cluster, and Xeon Phi 7250 processor (code-named "Knights Landing" (KNL). The results show 2 times better performance on KNL processor.

  6. Dynamic overset grid communication on distributed memory parallel processors

    NASA Technical Reports Server (NTRS)

    Barszcz, Eric; Weeratunga, Sisira K.; Meakin, Robert L.

    1993-01-01

    A parallel distributed memory implementation of intergrid communication for dynamic overset grids is presented. Included are discussions of various options considered during development. Results are presented comparing an Intel iPSC/860 to a single processor Cray Y-MP. Results for grids in relative motion show the iPSC/860 implementation to be faster than the Cray implementation.

  7. Web interfaces to relational databases

    NASA Technical Reports Server (NTRS)

    Carlisle, W. H.

    1996-01-01

    This reports on a project to extend the capabilities of a Virtual Research Center (VRC) for NASA's Advanced Concepts Office. The work was performed as part of NASA's 1995 Summer Faculty Fellowship program and involved the development of a prototype component of the VRC - a database system that provides data creation and access services within a room of the VRC. In support of VRC development, NASA has assembled a laboratory containing the variety of equipment expected to be used by scientists within the VRC. This laboratory consists of the major hardware platforms, SUN, Intel, and Motorola processors and their most common operating systems UNIX, Windows NT, Windows for Workgroups, and Macintosh. The SPARC 20 runs SUN Solaris 2.4, an Intel Pentium runs Windows NT and is installed on a different network from the other machines in the laboratory, a Pentium PC runs Windows for Workgroups, two Intel 386 machines run Windows 3.1, and finally, a PowerMacintosh and a Macintosh IIsi run MacOS.

  8. High-performance reconfigurable hardware architecture for restricted Boltzmann machines.

    PubMed

    Ly, Daniel Le; Chow, Paul

    2010-11-01

    Despite the popularity and success of neural networks in research, the number of resulting commercial or industrial applications has been limited. A primary cause for this lack of adoption is that neural networks are usually implemented as software running on general-purpose processors. Hence, a hardware implementation that can exploit the inherent parallelism in neural networks is desired. This paper investigates how the restricted Boltzmann machine (RBM), which is a popular type of neural network, can be mapped to a high-performance hardware architecture on field-programmable gate array (FPGA) platforms. The proposed modular framework is designed to reduce the time complexity of the computations through heavily customized hardware engines. A method to partition large RBMs into smaller congruent components is also presented, allowing the distribution of one RBM across multiple FPGA resources. The framework is tested on a platform of four Xilinx Virtex II-Pro XC2VP70 FPGAs running at 100 MHz through a variety of different configurations. The maximum performance was obtained by instantiating an RBM of 256 × 256 nodes distributed across four FPGAs, which resulted in a computational speed of 3.13 billion connection-updates-per-second and a speedup of 145-fold over an optimized C program running on a 2.8-GHz Intel processor.

  9. Perfmon2: a leap forward in performance monitoring

    NASA Astrophysics Data System (ADS)

    Jarp, S.; Jurga, R.; Nowak, A.

    2008-07-01

    This paper describes the software component, perfmon2, that is about to be added to the Linux kernel as the standard interface to the Performance Monitoring Unit (PMU) on common processors, including x86 (AMD and Intel), Sun SPARC, MIPS, IBM Power and Intel Itanium. It also describes a set of tools for doing performance monitoring in practice and details how the CERN openlab team has participated in the testing and development of these tools.

  10. Global synchronization algorithms for the Intel iPSC/860

    NASA Technical Reports Server (NTRS)

    Seidel, Steven R.; Davis, Mark A.

    1992-01-01

    In a distributed memory multicomputer that has no global clock, global processor synchronization can only be achieved through software. Global synchronization algorithms are used in tridiagonal systems solvers, CFD codes, sequence comparison algorithms, and sorting algorithms. They are also useful for event simulation, debugging, and for solving mutual exclusion problems. For the Intel iPSC/860 in particular, global synchronization can be used to ensure the most effective use of the communication network for operations such as the shift, where each processor in a one-dimensional array or ring concurrently sends a message to its right (or left) neighbor. Three global synchronization algorithms are considered for the iPSC/860: the gysnc() primitive provided by Intel, the PICL primitive sync0(), and a new recursive doubling synchronization (RDS) algorithm. The performance of these algorithms is compared to the performance predicted by communication models of both the long and forced message protocols. Measurements of the cost of shift operations preceded by global synchronization show that the RDS algorithm always synchronizes the nodes more precisely and costs only slightly more than the other two algorithms.

  11. Users manual for the Chameleon parallel programming tools

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Gropp, W.; Smith, B.

    1993-06-01

    Message passing is a common method for writing programs for distributed-memory parallel computers. Unfortunately, the lack of a standard for message passing has hampered the construction of portable and efficient parallel programs. In an attempt to remedy this problem, a number of groups have developed their own message-passing systems, each with its own strengths and weaknesses. Chameleon is a second-generation system of this type. Rather than replacing these existing systems, Chameleon is meant to supplement them by providing a uniform way to access many of these systems. Chameleon`s goals are to (a) be very lightweight (low over-head), (b) be highlymore » portable, and (c) help standardize program startup and the use of emerging message-passing operations such as collective operations on subsets of processors. Chameleon also provides a way to port programs written using PICL or Intel NX message passing to other systems, including collections of workstations. Chameleon is tracking the Message-Passing Interface (MPI) draft standard and will provide both an MPI implementation and an MPI transport layer. Chameleon provides support for heterogeneous computing by using p4 and PVM. Chameleon`s support for homogeneous computing includes the portable libraries p4, PICL, and PVM and vendor-specific implementation for Intel NX, IBM EUI (SP-1), and Thinking Machines CMMD (CM-5). Support for Ncube and PVM 3.x is also under development.« less

  12. Parallel computation for biological sequence comparison: comparing a portable model to the native model for the Intel Hypercube.

    PubMed Central

    Nadkarni, P. M.; Miller, P. L.

    1991-01-01

    A parallel program for inter-database sequence comparison was developed on the Intel Hypercube using two models of parallel programming. One version was built using machine-specific Hypercube parallel programming commands. The other version was built using Linda, a machine-independent parallel programming language. The two versions of the program provide a case study comparing these two approaches to parallelization in an important biological application area. Benchmark tests with both programs gave comparable results with a small number of processors. As the number of processors was increased, the Linda version was somewhat less efficient. The Linda version was also run without change on Network Linda, a virtual parallel machine running on a network of desktop workstations. PMID:1807632

  13. Multi-Core Programming Design Patterns: Stream Processing Algorithms for Dynamic Scene Perceptions

    DTIC Science & Technology

    2014-05-01

    processor developed by IBM and other companies , incorpo- rates the verb—POWER5— processor as the Power Processor Element (PPE), one of the early general...deliver an power efficient single-precision peak performance of more than 256 GFlops. Substantially more raw power became available later, when nVIDIA ...algorithms, including IBM’s Cell/B.E., GPUs from NVidia and AMD and many-core CPUs from Intel.27 The vast growth of digital video content has been a

  14. batman: BAsic Transit Model cAlculatioN in Python

    NASA Astrophysics Data System (ADS)

    Kreidberg, Laura

    2015-11-01

    I introduce batman, a Python package for modeling exoplanet transit light curves. The batman package supports calculation of light curves for any radially symmetric stellar limb darkening law, using a new integration algorithm for models that cannot be quickly calculated analytically. The code uses C extension modules to speed up model calculation and is parallelized with OpenMP. For a typical light curve with 100 data points in transit, batman can calculate one million quadratic limb-darkened models in 30 seconds with a single 1.7 GHz Intel Core i5 processor. The same calculation takes seven minutes using the four-parameter nonlinear limb darkening model (computed to 1 ppm accuracy). Maximum truncation error for integrated models is an input parameter that can be set as low as 0.001 ppm, ensuring that the community is prepared for the precise transit light curves we anticipate measuring with upcoming facilities. The batman package is open source and publicly available at https://github.com/lkreidberg/batman .

  15. GPU-based streaming architectures for fast cone-beam CT image reconstruction and demons deformable registration.

    PubMed

    Sharp, G C; Kandasamy, N; Singh, H; Folkert, M

    2007-10-07

    This paper shows how to significantly accelerate cone-beam CT reconstruction and 3D deformable image registration using the stream-processing model. We describe data-parallel designs for the Feldkamp, Davis and Kress (FDK) reconstruction algorithm, and the demons deformable registration algorithm, suitable for use on a commodity graphics processing unit. The streaming versions of these algorithms are implemented using the Brook programming environment and executed on an NVidia 8800 GPU. Performance results using CT data of a preserved swine lung indicate that the GPU-based implementations of the FDK and demons algorithms achieve a substantial speedup--up to 80 times for FDK and 70 times for demons when compared to an optimized reference implementation on a 2.8 GHz Intel processor. In addition, the accuracy of the GPU-based implementations was found to be excellent. Compared with CPU-based implementations, the RMS differences were less than 0.1 Hounsfield unit for reconstruction and less than 0.1 mm for deformable registration.

  16. Efficacy of Code Optimization on Cache-Based Processors

    NASA Technical Reports Server (NTRS)

    VanderWijngaart, Rob F.; Saphir, William C.; Chancellor, Marisa K. (Technical Monitor)

    1997-01-01

    In this paper a number of techniques for improving the cache performance of a representative piece of numerical software is presented. Target machines are popular processors from several vendors: MIPS R5000 (SGI Indy), MIPS R8000 (SGI PowerChallenge), MIPS R10000 (SGI Origin), DEC Alpha EV4 + EV5 (Cray T3D & T3E), IBM RS6000 (SP Wide-node), Intel PentiumPro (Ames' Whitney), Sun UltraSparc (NERSC's NOW). The optimizations all attempt to increase the locality of memory accesses. But they meet with rather varied and often counterintuitive success on the different computing platforms. We conclude that it may be genuinely impossible to obtain portable performance on the current generation of cache-based machines. At the least, it appears that the performance of modern commodity processors cannot be described with parameters defining the cache alone.

  17. FPGA Online Tracking Algorithm for the PANDA Straw Tube Tracker

    NASA Astrophysics Data System (ADS)

    Liang, Yutie; Ye, Hua; Galuska, Martin J.; Gessler, Thomas; Kuhn, Wolfgang; Lange, Jens Soren; Wagner, Milan N.; Liu, Zhen'an; Zhao, Jingzhou

    2017-06-01

    A novel FPGA based online tracking algorithm for helix track reconstruction in a solenoidal field, developed for the PANDA spectrometer, is described. Employing the Straw Tube Tracker detector with 4636 straw tubes, the algorithm includes a complex track finder, and a track fitter. Implemented in VHDL, the algorithm is tested on a Xilinx Virtex-4 FX60 FPGA chip with different types of events, at different event rates. A processing time of 7 $\\mu$s per event for an average of 6 charged tracks is obtained. The momentum resolution is about 3\\% (4\\%) for $p_t$ ($p_z$) at 1 GeV/c. Comparing to the algorithm running on a CPU chip (single core Intel Xeon E5520 at 2.26 GHz), an improvement of 3 orders of magnitude in processing time is obtained. The algorithm can handle severe overlapping of events which are typical for interaction rates above 10 MHz.

  18. Feasibility study of microprocessor systems suitable for use in developing a real-time for the 4.75 GHz scatterometer

    NASA Technical Reports Server (NTRS)

    1977-01-01

    A class of signal processors suitable for the reduction of radar scatterometer data in real time was developed. The systems were applied to the reduction of single polarized 13.3 GHz scatterometer data and provided a real time output of radar scattering coefficient as a function of incident angle. It was proposed that a system for processing of C band radar data be constructed to support scatterometer system currently under development. The establishment of a feasible design approach to the development of this processor system utilizing microprocessor technology was emphasized.

  19. Vectorization for Molecular Dynamics on Intel Xeon Phi Corpocessors

    NASA Astrophysics Data System (ADS)

    Yi, Hongsuk

    2014-03-01

    Many modern processors are capable of exploiting data-level parallelism through the use of single instruction multiple data (SIMD) execution. The new Intel Xeon Phi coprocessor supports 512 bit vector registers for the high performance computing. In this paper, we have developed a hierarchical parallelization scheme for accelerated molecular dynamics simulations with the Terfoff potentials for covalent bond solid crystals on Intel Xeon Phi coprocessor systems. The scheme exploits multi-level parallelism computing. We combine thread-level parallelism using a tightly coupled thread-level and task-level parallelism with 512-bit vector register. The simulation results show that the parallel performance of SIMD implementations on Xeon Phi is apparently superior to their x86 CPU architecture.

  20. Efficient Multicriteria Protein Structure Comparison on Modern Processor Architectures

    PubMed Central

    Manolakos, Elias S.

    2015-01-01

    Fast increasing computational demand for all-to-all protein structures comparison (PSC) is a result of three confounding factors: rapidly expanding structural proteomics databases, high computational complexity of pairwise protein comparison algorithms, and the trend in the domain towards using multiple criteria for protein structures comparison (MCPSC) and combining results. We have developed a software framework that exploits many-core and multicore CPUs to implement efficient parallel MCPSC in modern processors based on three popular PSC methods, namely, TMalign, CE, and USM. We evaluate and compare the performance and efficiency of the two parallel MCPSC implementations using Intel's experimental many-core Single-Chip Cloud Computer (SCC) as well as Intel's Core i7 multicore processor. We show that the 48-core SCC is more efficient than the latest generation Core i7, achieving a speedup factor of 42 (efficiency of 0.9), making many-core processors an exciting emerging technology for large-scale structural proteomics. We compare and contrast the performance of the two processors on several datasets and also show that MCPSC outperforms its component methods in grouping related domains, achieving a high F-measure of 0.91 on the benchmark CK34 dataset. The software implementation for protein structure comparison using the three methods and combined MCPSC, along with the developed underlying rckskel algorithmic skeletons library, is available via GitHub. PMID:26605332

  1. Efficient Multicriteria Protein Structure Comparison on Modern Processor Architectures.

    PubMed

    Sharma, Anuj; Manolakos, Elias S

    2015-01-01

    Fast increasing computational demand for all-to-all protein structures comparison (PSC) is a result of three confounding factors: rapidly expanding structural proteomics databases, high computational complexity of pairwise protein comparison algorithms, and the trend in the domain towards using multiple criteria for protein structures comparison (MCPSC) and combining results. We have developed a software framework that exploits many-core and multicore CPUs to implement efficient parallel MCPSC in modern processors based on three popular PSC methods, namely, TMalign, CE, and USM. We evaluate and compare the performance and efficiency of the two parallel MCPSC implementations using Intel's experimental many-core Single-Chip Cloud Computer (SCC) as well as Intel's Core i7 multicore processor. We show that the 48-core SCC is more efficient than the latest generation Core i7, achieving a speedup factor of 42 (efficiency of 0.9), making many-core processors an exciting emerging technology for large-scale structural proteomics. We compare and contrast the performance of the two processors on several datasets and also show that MCPSC outperforms its component methods in grouping related domains, achieving a high F-measure of 0.91 on the benchmark CK34 dataset. The software implementation for protein structure comparison using the three methods and combined MCPSC, along with the developed underlying rckskel algorithmic skeletons library, is available via GitHub.

  2. Heterogeneous High Throughput Scientific Computing with APM X-Gene and Intel Xeon Phi

    NASA Astrophysics Data System (ADS)

    Abdurachmanov, David; Bockelman, Brian; Elmer, Peter; Eulisse, Giulio; Knight, Robert; Muzaffar, Shahzad

    2015-05-01

    Electrical power requirements will be a constraint on the future growth of Distributed High Throughput Computing (DHTC) as used by High Energy Physics. Performance-per-watt is a critical metric for the evaluation of computer architectures for cost- efficient computing. Additionally, future performance growth will come from heterogeneous, many-core, and high computing density platforms with specialized processors. In this paper, we examine the Intel Xeon Phi Many Integrated Cores (MIC) co-processor and Applied Micro X-Gene ARMv8 64-bit low-power server system-on-a-chip (SoC) solutions for scientific computing applications. We report our experience on software porting, performance and energy efficiency and evaluate the potential for use of such technologies in the context of distributed computing systems such as the Worldwide LHC Computing Grid (WLCG).

  3. Simulating Hydrologic Flow and Reactive Transport with PFLOTRAN and PETSc on Emerging Fine-Grained Parallel Computer Architectures

    NASA Astrophysics Data System (ADS)

    Mills, R. T.; Rupp, K.; Smith, B. F.; Brown, J.; Knepley, M.; Zhang, H.; Adams, M.; Hammond, G. E.

    2017-12-01

    As the high-performance computing community pushes towards the exascale horizon, power and heat considerations have driven the increasing importance and prevalence of fine-grained parallelism in new computer architectures. High-performance computing centers have become increasingly reliant on GPGPU accelerators and "manycore" processors such as the Intel Xeon Phi line, and 512-bit SIMD registers have even been introduced in the latest generation of Intel's mainstream Xeon server processors. The high degree of fine-grained parallelism and more complicated memory hierarchy considerations of such "manycore" processors present several challenges to existing scientific software. Here, we consider how the massively parallel, open-source hydrologic flow and reactive transport code PFLOTRAN - and the underlying Portable, Extensible Toolkit for Scientific Computation (PETSc) library on which it is built - can best take advantage of such architectures. We will discuss some key features of these novel architectures and our code optimizations and algorithmic developments targeted at them, and present experiences drawn from working with a wide range of PFLOTRAN benchmark problems on these architectures.

  4. Accelerating Climate and Weather Simulations through Hybrid Computing

    NASA Technical Reports Server (NTRS)

    Zhou, Shujia; Cruz, Carlos; Duffy, Daniel; Tucker, Robert; Purcell, Mark

    2011-01-01

    Unconventional multi- and many-core processors (e.g. IBM (R) Cell B.E.(TM) and NVIDIA (R) GPU) have emerged as effective accelerators in trial climate and weather simulations. Yet these climate and weather models typically run on parallel computers with conventional processors (e.g. Intel, AMD, and IBM) using Message Passing Interface. To address challenges involved in efficiently and easily connecting accelerators to parallel computers, we investigated using IBM's Dynamic Application Virtualization (TM) (IBM DAV) software in a prototype hybrid computing system with representative climate and weather model components. The hybrid system comprises two Intel blades and two IBM QS22 Cell B.E. blades, connected with both InfiniBand(R) (IB) and 1-Gigabit Ethernet. The system significantly accelerates a solar radiation model component by offloading compute-intensive calculations to the Cell blades. Systematic tests show that IBM DAV can seamlessly offload compute-intensive calculations from Intel blades to Cell B.E. blades in a scalable, load-balanced manner. However, noticeable communication overhead was observed, mainly due to IP over the IB protocol. Full utilization of IB Sockets Direct Protocol and the lower latency production version of IBM DAV will reduce this overhead.

  5. Parallel spatial direct numerical simulations on the Intel iPSC/860 hypercube

    NASA Technical Reports Server (NTRS)

    Joslin, Ronald D.; Zubair, Mohammad

    1993-01-01

    The implementation and performance of a parallel spatial direct numerical simulation (PSDNS) approach on the Intel iPSC/860 hypercube is documented. The direct numerical simulation approach is used to compute spatially evolving disturbances associated with the laminar-to-turbulent transition in boundary-layer flows. The feasibility of using the PSDNS on the hypercube to perform transition studies is examined. The results indicate that the direct numerical simulation approach can effectively be parallelized on a distributed-memory parallel machine. By increasing the number of processors nearly ideal linear speedups are achieved with nonoptimized routines; slower than linear speedups are achieved with optimized (machine dependent library) routines. This slower than linear speedup results because the Fast Fourier Transform (FFT) routine dominates the computational cost and because the routine indicates less than ideal speedups. However with the machine-dependent routines the total computational cost decreases by a factor of 4 to 5 compared with standard FORTRAN routines. The computational cost increases linearly with spanwise wall-normal and streamwise grid refinements. The hypercube with 32 processors was estimated to require approximately twice the amount of Cray supercomputer single processor time to complete a comparable simulation; however it is estimated that a subgrid-scale model which reduces the required number of grid points and becomes a large-eddy simulation (PSLES) would reduce the computational cost and memory requirements by a factor of 10 over the PSDNS. This PSLES implementation would enable transition simulations on the hypercube at a reasonable computational cost.

  6. An efficient MPI/OpenMP parallelization of the Hartree–Fock–Roothaan method for the first generation of Intel® Xeon Phi™ processor architecture

    DOE PAGES

    Mironov, Vladimir; Moskovsky, Alexander; D’Mello, Michael; ...

    2017-10-04

    The Hartree-Fock (HF) method in the quantum chemistry package GAMESS represents one of the most irregular algorithms in computation today. Major steps in the calculation are the irregular computation of electron repulsion integrals (ERIs) and the building of the Fock matrix. These are the central components of the main Self Consistent Field (SCF) loop, the key hotspot in Electronic Structure (ES) codes. By threading the MPI ranks in the official release of the GAMESS code, we not only speed up the main SCF loop (4x to 6x for large systems), but also achieve a significant (>2x) reduction in the overallmore » memory footprint. These improvements are a direct consequence of memory access optimizations within the MPI ranks. We benchmark our implementation against the official release of the GAMESS code on the Intel R Xeon PhiTM supercomputer. Here, scaling numbers are reported on up to 7,680 cores on Intel Xeon Phi coprocessors.« less

  7. Federal Plan for High-End Computing. Report of the High-End Computing Revitalization Task Force (HECRTF)

    DTIC Science & Technology

    2004-07-01

    steadily for the past fifteen years, while memory latency and bandwidth have improved much more slowly. For example, Intel processor clock rates38 have... processor and memory performance) all greatly restrict the ability to achieve high levels of performance for science, engineering, and national...sub-nuclear distances. Guide experiments to identify transition from quantum chromodynamics to quark -gluon plasma. Accelerator Physics Accurate

  8. Communication overhead on the Intel Paragon, IBM SP2 and Meiko CS-2

    NASA Technical Reports Server (NTRS)

    Bokhari, Shahid H.

    1995-01-01

    Interprocessor communication overhead is a crucial measure of the power of parallel computing systems-its impact can severely limit the performance of parallel programs. This report presents measurements of communication overhead on three contemporary commercial multicomputer systems: the Intel Paragon, the IBM SP2 and the Meiko CS-2. In each case the time to communicate between processors is presented as a function of message length. The time for global synchronization and memory access is discussed. The performance of these machines in emulating hypercubes and executing random pairwise exchanges is also investigated. It is shown that the interprocessor communication time depends heavily on the specific communication pattern required. These observations contradict the commonly held belief that communication overhead on contemporary machines is independent of the placement of tasks on processors. The information presented in this report permits the evaluation of the efficiency of parallel algorithm implementations against standard baselines.

  9. Heterogeneous high throughput scientific computing with APM X-Gene and Intel Xeon Phi

    DOE PAGES

    Abdurachmanov, David; Bockelman, Brian; Elmer, Peter; ...

    2015-05-22

    Electrical power requirements will be a constraint on the future growth of Distributed High Throughput Computing (DHTC) as used by High Energy Physics. Performance-per-watt is a critical metric for the evaluation of computer architectures for cost- efficient computing. Additionally, future performance growth will come from heterogeneous, many-core, and high computing density platforms with specialized processors. In this paper, we examine the Intel Xeon Phi Many Integrated Cores (MIC) co-processor and Applied Micro X-Gene ARMv8 64-bit low-power server system-on-a-chip (SoC) solutions for scientific computing applications. As a result, we report our experience on software porting, performance and energy efficiency and evaluatemore » the potential for use of such technologies in the context of distributed computing systems such as the Worldwide LHC Computing Grid (WLCG).« less

  10. Evaluation of the Intel iWarp parallel processor for space flight applications

    NASA Technical Reports Server (NTRS)

    Hine, Butler P., III; Fong, Terrence W.

    1993-01-01

    The potential of a DARPA-sponsored advanced processor, the Intel iWarp, for use in future SSF Data Management Systems (DMS) upgrades is evaluated through integration into the Ames DMS testbed and applications testing. The iWarp is a distributed, parallel computing system well suited for high performance computing applications such as matrix operations and image processing. The system architecture is modular, supports systolic and message-based computation, and is capable of providing massive computational power in a low-cost, low-power package. As a consequence, the iWarp offers significant potential for advanced space-based computing. This research seeks to determine the iWarp's suitability as a processing device for space missions. In particular, the project focuses on evaluating the ease of integrating the iWarp into the SSF DMS baseline architecture and the iWarp's ability to support computationally stressing applications representative of SSF tasks.

  11. Analysis of the Intel 386 and i486 microprocessors for the Space Station Freedom Data Management System

    NASA Technical Reports Server (NTRS)

    Liu, Yuan-Kwei

    1991-01-01

    The feasibility is analyzed of upgrading the Intel 386 microprocessor, which has been proposed as the baseline processor for the Space Station Freedom (SSF) Data Management System (DMS), to the more advanced i486 microprocessors. The items compared between the two processors include the instruction set architecture, power consumption, the MIL-STD-883C Class S (Space) qualification schedule, and performance. The advantages of the i486 over the 386 are (1) lower power consumption; and (2) higher floating point performance. The i486 on-chip cache does not have parity check or error detection and correction circuitry. The i486 with on-chip cache disabled, however, has lower integer performance than the 386 without cache, which is the current DMS design choice. Adding cache to the 386/386 DX memory hierachy appears to be the most beneficial change to the current DMS design at this time.

  12. Analysis of the Intel 386 and i486 microprocessors for the Space Station Freedom Data Management System

    NASA Technical Reports Server (NTRS)

    Liu, Yuan-Kwei

    1991-01-01

    The feasibility is analyzed of upgrading the Intel 386 microprocessor, which has been proposed as the baseline processor for the Space Station Freedom (SSF) Data Management System (DMS), to the more advanced i486 microprocessors. The items compared between the two processors include the instruction set architecture, power consumption, the MIL-STD-883C Class S (Space) qualification schedule, and performance. The advantages of the i486 over the 386 are (1) lower power consumption; and (2) higher floating point performance. The i486 on-chip cache does not have parity check or error detection and correction circuitry. The i486 with on-chip cache disabled, however, has lower integer performance than the 386 without cache, which is the current DMS design choice. Adding cache to the 386/387 DX memory hierarchy appears to be the most beneficial change to the current DMS design at this time.

  13. Preliminary Radiation Testing of a State-of-the-Art Commercial 14nm CMOS Processor - System-on-a-Chip

    NASA Technical Reports Server (NTRS)

    Szabo, Carl M., Jr.; Duncan, Adam; LaBel, Kenneth A.; Kay, Matt; Bruner, Pat; Krzesniak, Mike; Dong, Lei

    2015-01-01

    Hardness assurance test results of Intel state-of-the-art 14nm Broadwell U-series processor System-on-a-Chip (SoC) for total dose are presented, along with first-look exploratory results from trials at a medical proton facility. Test method builds upon previous efforts by utilizing commercial laptop motherboards and software stress applications as opposed to more traditional automated test equipment (ATE).

  14. A machine-learning approach for computation of fractional flow reserve from coronary computed tomography.

    PubMed

    Itu, Lucian; Rapaka, Saikiran; Passerini, Tiziano; Georgescu, Bogdan; Schwemmer, Chris; Schoebinger, Max; Flohr, Thomas; Sharma, Puneet; Comaniciu, Dorin

    2016-07-01

    Fractional flow reserve (FFR) is a functional index quantifying the severity of coronary artery lesions and is clinically obtained using an invasive, catheter-based measurement. Recently, physics-based models have shown great promise in being able to noninvasively estimate FFR from patient-specific anatomical information, e.g., obtained from computed tomography scans of the heart and the coronary arteries. However, these models have high computational demand, limiting their clinical adoption. In this paper, we present a machine-learning-based model for predicting FFR as an alternative to physics-based approaches. The model is trained on a large database of synthetically generated coronary anatomies, where the target values are computed using the physics-based model. The trained model predicts FFR at each point along the centerline of the coronary tree, and its performance was assessed by comparing the predictions against physics-based computations and against invasively measured FFR for 87 patients and 125 lesions in total. Correlation between machine-learning and physics-based predictions was excellent (0.9994, P < 0.001), and no systematic bias was found in Bland-Altman analysis: mean difference was -0.00081 ± 0.0039. Invasive FFR ≤ 0.80 was found in 38 lesions out of 125 and was predicted by the machine-learning algorithm with a sensitivity of 81.6%, a specificity of 83.9%, and an accuracy of 83.2%. The correlation was 0.729 (P < 0.001). Compared with the physics-based computation, average execution time was reduced by more than 80 times, leading to near real-time assessment of FFR. Average execution time went down from 196.3 ± 78.5 s for the CFD model to ∼2.4 ± 0.44 s for the machine-learning model on a workstation with 3.4-GHz Intel i7 8-core processor. Copyright © 2016 the American Physiological Society.

  15. An efficient implementation of semi-numerical computation of the Hartree-Fock exchange on the Intel Phi processor

    NASA Astrophysics Data System (ADS)

    Liu, Fenglai; Kong, Jing

    2018-07-01

    Unique technical challenges and their solutions for implementing semi-numerical Hartree-Fock exchange on the Phil Processor are discussed, especially concerning the single- instruction-multiple-data type of processing and small cache size. Benchmark calculations on a series of buckyball molecules with various Gaussian basis sets on a Phi processor and a six-core CPU show that the Phi processor provides as much as 12 times of speedup with large basis sets compared with the conventional four-center electron repulsion integration approach performed on the CPU. The accuracy of the semi-numerical scheme is also evaluated and found to be comparable to that of the resolution-of-identity approach.

  16. Multi-threaded ATLAS simulation on Intel Knights Landing processors

    NASA Astrophysics Data System (ADS)

    Farrell, Steven; Calafiura, Paolo; Leggett, Charles; Tsulaia, Vakhtang; Dotti, Andrea; ATLAS Collaboration

    2017-10-01

    The Knights Landing (KNL) release of the Intel Many Integrated Core (MIC) Xeon Phi line of processors is a potential game changer for HEP computing. With 72 cores and deep vector registers, the KNL cards promise significant performance benefits for highly-parallel, compute-heavy applications. Cori, the newest supercomputer at the National Energy Research Scientific Computing Center (NERSC), was delivered to its users in two phases with the first phase online at the end of 2015 and the second phase now online at the end of 2016. Cori Phase 2 is based on the KNL architecture and contains over 9000 compute nodes with 96GB DDR4 memory. ATLAS simulation with the multithreaded Athena Framework (AthenaMT) is a good potential use-case for the KNL architecture and supercomputers like Cori. ATLAS simulation jobs have a high ratio of CPU computation to disk I/O and have been shown to scale well in multi-threading and across many nodes. In this paper we will give an overview of the ATLAS simulation application with details on its multi-threaded design. Then, we will present a performance analysis of the application on KNL devices and compare it to a traditional x86 platform to demonstrate the capabilities of the architecture and evaluate the benefits of utilizing KNL platforms like Cori for ATLAS production.

  17. Quantum Chemical Calculations Using Accelerators: Migrating Matrix Operations to the NVIDIA Kepler GPU and the Intel Xeon Phi.

    PubMed

    Leang, Sarom S; Rendell, Alistair P; Gordon, Mark S

    2014-03-11

    Increasingly, modern computer systems comprise a multicore general-purpose processor augmented with a number of special purpose devices or accelerators connected via an external interface such as a PCI bus. The NVIDIA Kepler Graphical Processing Unit (GPU) and the Intel Phi are two examples of such accelerators. Accelerators offer peak performances that can be well above those of the host processor. How to exploit this heterogeneous environment for legacy application codes is not, however, straightforward. This paper considers how matrix operations in typical quantum chemical calculations can be migrated to the GPU and Phi systems. Double precision general matrix multiply operations are endemic in electronic structure calculations, especially methods that include electron correlation, such as density functional theory, second order perturbation theory, and coupled cluster theory. The use of approaches that automatically determine whether to use the host or an accelerator, based on problem size, is explored, with computations that are occurring on the accelerator and/or the host. For data-transfers over PCI-e, the GPU provides the best overall performance for data sizes up to 4096 MB with consistent upload and download rates between 5-5.6 GB/s and 5.4-6.3 GB/s, respectively. The GPU outperforms the Phi for both square and nonsquare matrix multiplications.

  18. N-body simulation for self-gravitating collisional systems with a new SIMD instruction set extension to the x86 architecture, Advanced Vector eXtensions

    NASA Astrophysics Data System (ADS)

    Tanikawa, Ataru; Yoshikawa, Kohji; Okamoto, Takashi; Nitadori, Keigo

    2012-02-01

    We present a high-performance N-body code for self-gravitating collisional systems accelerated with the aid of a new SIMD instruction set extension of the x86 architecture: Advanced Vector eXtensions (AVX), an enhanced version of the Streaming SIMD Extensions (SSE). With one processor core of Intel Core i7-2600 processor (8 MB cache and 3.40 GHz) based on Sandy Bridge micro-architecture, we implemented a fourth-order Hermite scheme with individual timestep scheme ( Makino and Aarseth, 1992), and achieved the performance of ˜20 giga floating point number operations per second (GFLOPS) for double-precision accuracy, which is two times and five times higher than that of the previously developed code implemented with the SSE instructions ( Nitadori et al., 2006b), and that of a code implemented without any explicit use of SIMD instructions with the same processor core, respectively. We have parallelized the code by using so-called NINJA scheme ( Nitadori et al., 2006a), and achieved ˜90 GFLOPS for a system containing more than N = 8192 particles with 8 MPI processes on four cores. We expect to achieve about 10 tera FLOPS (TFLOPS) for a self-gravitating collisional system with N ˜ 10 5 on massively parallel systems with at most 800 cores with Sandy Bridge micro-architecture. This performance will be comparable to that of Graphic Processing Unit (GPU) cluster systems, such as the one with about 200 Tesla C1070 GPUs ( Spurzem et al., 2010). This paper offers an alternative to collisional N-body simulations with GRAPEs and GPUs.

  19. Preliminary Radiation Testing of a State-of-the-Art Commercial 14nm CMOS Processor/System-on-a-Chip

    NASA Technical Reports Server (NTRS)

    Szabo, Carl M., Jr.; Duncan, Adam; LaBel, Kenneth A.; Kay, Matt; Bruner, Pat; Krzesniak, Mike; Dong, Lei

    2015-01-01

    Hardness assurance test results of Intel state-of-the-art 14nm “Broadwell” U-series processor / System-on-a-Chip (SoC) for total ionizing dose (TID) are presented, along with exploratory results from trials at a medical proton facility. Test method builds upon previous efforts [1] by utilizing commercial laptop motherboards and software stress applications as opposed to more traditional automated test equipment (ATE).

  20. A trick to improve the efficiency of generating unweighted B events from BCVEGPY

    NASA Astrophysics Data System (ADS)

    Wang, Xian-You; Wu, Xing-Gang

    2012-02-01

    In the present paper, we provide an addendum to improve the efficiency of generating unweighted events within PYTHIA environment for the generator BCVEGPY2.1 [C.H. Chang, J.X. Wang, X.G. Wu, Comput. Phys. Commun. 174 (2006) 241]. This trick is helpful for experimental simulation. Moreover, the BCVEGPY output has also been improved, i.e. one Les Houches Event common block has been added so as to generate a standard Les Houches Event file that contains the information of the generated B meson and the accompanying partons, which can be more conveniently used for further simulation. New version program summaryTitle of program: BCVEGPY2.1a Catalogue identifier: ADTJ_v2_2 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/ADTJ_v2_2.html Program obtained from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: Standard CPC licence, http://cpc.cs.qub.ac.uk/licence/licence.html No. of lines in distributed program, including test data, etc.: 166 133 No. of bytes in distributed program, including test data, etc.: 1 655 390 Distribution format: tar.gz Programming language used: FORTRAN 77/90 Computer: Any LINUX based on PC with FORTRAN 77 or FORTRAN 90 and GNU C compiler as well Operating systems: LINUX RAM: About 2.0 MB Classification: 11.2, 11.5 Catalogue identifier of previous version: ADTJ_v2_1 Reference in CPC: Comput. Phys. Commun. 175 (2006) 624 Does the new version supersede the old program: No Nature of physical problem: Hadronic Production of B meson and its excited states Method of solution: To generate weighted and unweighted B events within PYTHIA environment effectively. Restrictions on the complexity of the problem: Hadronic production of ( cb¯)-quarkonium via the gluon-gluon fusion mechanism are given by the 'complete calculation approach'. The simulation of B events is done within PYTHIA environment. Reasons for new version: More and more data are accumulated at the large hadronic collider, it would be possible to make precise studies on B meson properties, such as its lifetime, mass spectrum and etc. The BCVEGPY has been adopted by several experimental groups due to its high efficiency in comparison to that of PYTHIA. However, to generate unweighted events with PYTHIA inner mechanism as programmed by the previous version is still time-consuming. So it would be helpful to improve the efficiency for generating unweighted events within PYTHIA. Moreover, it would be better to use an uniform and standard output format for further detector simulation. Typical running time: Typical running time is machine and user-parameters dependent. I) To generate 10 6 weighted S-wave ( cb¯)-quarkonium events (IDWTUP = 3), it will take about 40 minutes on a 1.8 GHz Intel P4-processor machine. II) To generate unweighted S-wave ( cb¯)-quarkonium events with PYTHIA inner structure (IDWTUP = 1), it will take about 20 hour on a 1.8 GHz Intel P4-processor machine to generate 1000 events. III) To generate 10 6 unweighted S-wave ( cb¯)-quarkonium events with the present trick (IDWTUP = 1), it will take 17 hour on a 3.16 Hz Intel E8500 processor machine. Moreover, it can be found that the running time for the P-wave ( cb¯)-quarkonium production is about two times longer than the case of S-wave production under the same conditions. Keywords: Event generator; Hadronic production; B meson; Unweighted events Summary of revisions: (1) The generator BCVEGPY [1-3] has been programmed to generate B events under PYTHIA environment [4], which has been frequently adopted for theoretical and experimental studies, e.g. Refs. [5-18]. It is found that each experimental group shall have its own simulation software architecture, and the users will spend a lot of time to write an interface so as to implement BCVEGPY into their own software. So it would be better to supply a standard output. The LHE format becomes a standard format [19], which is proposed to store process and event information from the matrix-element-based generators. The users can pass these parton-level information to the general event generators like PYTHIA and HERWIG [20] for further simulation. For such purpose, we add two common blocks in genevent.F. One common block is called as bcvegpy_pyupin and the other one is write_lhe. The bcvegpy_pyupin, which is similar to PYUPIN subroutine in PYTHIA, stores the initialization information in the HEPRUP common block. INTEGER MAXPUP PARAMETER (MAXPUP = 100) INTEGER IDBMUP,PDFGUP,PDFSUP,IDWTUP,NPRUP,LPRUP DOUBLE PRECISION EBMUP,XSECUP,XERRUP,XMAXUP COMMON/HEPRUP/IDBMUP(2),EBMUP(2),PDFGUP(2),PDFSUP(2), &IDWTUP,NPRUP,XSECUP(MAXPUP),XERRUP(MAXPUP), &XMAXUP(MAXPUP),LPRUP(MAXPUP) The write_lhe, which is similar to PYUPEV subroutine in pythia, stores the information of each separate event in the HEPEUP common block. INTEGER MAXNUP PARAMETER (MAXNUP = 500) INTEGER NUP,IDPRUP,IDUP,ISTUP,MOTHUP,ICOLUP DOUBLE PRECISION XWGTUP,SCALUP,AQEDUP,AQCDUP,PUP,VTIMUP, &SPINUP COMMON/HEPEUP/NUP,IDPRUP,XWGTUP,SCALUP,AQEDUP,AQCDUP, &IDUP(MAXNUP),ISTUP(MAXNUP),MOTHUP(2,MAXNUP), &ICOLUP(2,MAXNUP),PUP(5,MAXNUP),VTIMUP(MAXNUP), &SPINUP(MAXNUP)

  1. Multi-Kepler GPU vs. multi-Intel MIC for spin systems simulations

    NASA Astrophysics Data System (ADS)

    Bernaschi, M.; Bisson, M.; Salvadore, F.

    2014-10-01

    We present and compare the performances of two many-core architectures: the Nvidia Kepler and the Intel MIC both in a single system and in cluster configuration for the simulation of spin systems. As a benchmark we consider the time required to update a single spin of the 3D Heisenberg spin glass model by using the Over-relaxation algorithm. We present data also for a traditional high-end multi-core architecture: the Intel Sandy Bridge. The results show that although on the two Intel architectures it is possible to use basically the same code, the performances of a Intel MIC change dramatically depending on (apparently) minor details. Another issue is that to obtain a reasonable scalability with the Intel Phi coprocessor (Phi is the coprocessor that implements the MIC architecture) in a cluster configuration it is necessary to use the so-called offload mode which reduces the performances of the single system. As to the GPU, the Kepler architecture offers a clear advantage with respect to the previous Fermi architecture maintaining exactly the same source code. Scalability of the multi-GPU implementation remains very good by using the CPU as a communication co-processor of the GPU. All source codes are provided for inspection and for double-checking the results.

  2. High performance computing environment for multidimensional image analysis

    PubMed Central

    Rao, A Ravishankar; Cecchi, Guillermo A; Magnasco, Marcelo

    2007-01-01

    Background The processing of images acquired through microscopy is a challenging task due to the large size of datasets (several gigabytes) and the fast turnaround time required. If the throughput of the image processing stage is significantly increased, it can have a major impact in microscopy applications. Results We present a high performance computing (HPC) solution to this problem. This involves decomposing the spatial 3D image into segments that are assigned to unique processors, and matched to the 3D torus architecture of the IBM Blue Gene/L machine. Communication between segments is restricted to the nearest neighbors. When running on a 2 Ghz Intel CPU, the task of 3D median filtering on a typical 256 megabyte dataset takes two and a half hours, whereas by using 1024 nodes of Blue Gene, this task can be performed in 18.8 seconds, a 478× speedup. Conclusion Our parallel solution dramatically improves the performance of image processing, feature extraction and 3D reconstruction tasks. This increased throughput permits biologists to conduct unprecedented large scale experiments with massive datasets. PMID:17634099

  3. High performance computing environment for multidimensional image analysis.

    PubMed

    Rao, A Ravishankar; Cecchi, Guillermo A; Magnasco, Marcelo

    2007-07-10

    The processing of images acquired through microscopy is a challenging task due to the large size of datasets (several gigabytes) and the fast turnaround time required. If the throughput of the image processing stage is significantly increased, it can have a major impact in microscopy applications. We present a high performance computing (HPC) solution to this problem. This involves decomposing the spatial 3D image into segments that are assigned to unique processors, and matched to the 3D torus architecture of the IBM Blue Gene/L machine. Communication between segments is restricted to the nearest neighbors. When running on a 2 Ghz Intel CPU, the task of 3D median filtering on a typical 256 megabyte dataset takes two and a half hours, whereas by using 1024 nodes of Blue Gene, this task can be performed in 18.8 seconds, a 478x speedup. Our parallel solution dramatically improves the performance of image processing, feature extraction and 3D reconstruction tasks. This increased throughput permits biologists to conduct unprecedented large scale experiments with massive datasets.

  4. Speeding up spin-component-scaled third-order pertubation theory with the chain of spheres approximation: the COSX-SCS-MP3 method

    NASA Astrophysics Data System (ADS)

    Izsák, Róbert; Neese, Frank

    2013-07-01

    The 'chain of spheres' approximation, developed earlier for the efficient evaluation of the self-consistent field exchange term, is introduced here into the evaluation of the external exchange term of higher order correlation methods. Its performance is studied in the specific case of the spin-component-scaled third-order Møller--Plesset perturbation (SCS-MP3) theory. The results indicate that the approximation performs excellently in terms of both computer time and achievable accuracy. Significant speedups over a conventional method are obtained for larger systems and basis sets. Owing to this development, SCS-MP3 calculations on molecules of the size of penicillin (42 atoms) with a polarised triple-zeta basis set can be performed in ∼3 hours using 16 cores of an Intel Xeon E7-8837 processor with a 2.67 GHz clock speed, which represents a speedup by a factor of 8-9 compared to the previously most efficient algorithm. Thus, the increased accuracy offered by SCS-MP3 can now be explored for at least medium-sized molecules.

  5. Application of the multireference equation of motion coupled cluster method, including spin-orbit coupling, to the atomic spectra of Cr, Mn, Fe and Co

    NASA Astrophysics Data System (ADS)

    Liu, Zhebing; Huntington, Lee M. J.; Nooijen, Marcel

    2015-10-01

    The recently introduced multireference equation of motion (MR-EOM) approach is combined with a simple treatment of spin-orbit coupling, as implemented in the ORCA program. The resulting multireference equation of motion spin-orbit coupling (MR-EOM-SOC) approach is applied to the first-row transition metal atoms Cr, Mn, Fe and Co, for which experimental data are readily available. Using the MR-EOM-SOC approach, the splittings in each L-S multiplet can be accurately assessed (root mean square (RMS) errors of about 70 cm-1). The RMS errors for J-specific excitation energies range from 414 to 783 cm-1 and are comparable to previously reported J-averaged MR-EOM results using the ACESII program. The MR-EOM approach is highly efficient. A typical MR-EOM calculation of a full spin-orbit spectrum takes about 2 CPU hours on a single processor of a 12-core node, consisting of Intel XEON 2.93 GHz CPUs with 12.3 MB of shared cache memory.

  6. Old PCs: Upgrade or Abandon?

    ERIC Educational Resources Information Center

    Perez, Ernest

    1997-01-01

    Examines the practical realities of upgrading Intel personal computers in libraries, considering budgets and technical personnel availability. Highlights include adding RAM; putting in faster processor chips, including clock multipliers; new hard disks; CD-ROM speed; motherboards and interface cards; cost limits and economic factors; and…

  7. Optimization of the Brillouin operator on the KNL architecture

    NASA Astrophysics Data System (ADS)

    Dürr, Stephan

    2018-03-01

    Experiences with optimizing the matrix-times-vector application of the Brillouin operator on the Intel KNL processor are reported. Without adjustments to the memory layout, performance figures of 360 Gflop/s in single and 270 Gflop/s in double precision are observed. This is with Nc = 3 colors, Nv = 12 right-hand-sides, Nthr = 256 threads, on lattices of size 323 × 64, using exclusively OMP pragmas. Interestingly, the same routine performs quite well on Intel Core i7 architectures, too. Some observations on the much harderWilson fermion matrix-times-vector optimization problem are added.

  8. Performance of VPIC on Sequoia

    NASA Astrophysics Data System (ADS)

    Nystrom, William

    2014-10-01

    Sequoia is a major DOE computing resource which is characteristic of future resources in that it has many threads per compute node, 64, and the individual processor cores are simpler and less powerful than cores on previous processors like Intel's Sandy Bridge or AMD's Opteron. An effort is in progress to port VPIC to the Blue Gene Q architecture of Sequoia and evaluate its performance. Results of this work will be presented on single node performance of VPIC as well as multi-node scaling.

  9. OpenMP Performance on the Columbia Supercomputer

    NASA Technical Reports Server (NTRS)

    Haoqiang, Jin; Hood, Robert

    2005-01-01

    This presentation discusses Columbia World Class Supercomputer which is one of the world's fastest supercomputers providing 61 TFLOPs (10/20/04). Conceived, designed, built, and deployed in just 120 days. A 20-node supercomputer built on proven 512-processor nodes. The largest SGI system in the world with over 10,000 Intel Itanium 2 processors and provides the largest node size incorporating commodity parts (512) and the largest shared-memory environment (2048) with 88% efficiency tops the scalar systems on the Top500 list.

  10. High-Level Data-Abstraction System

    NASA Technical Reports Server (NTRS)

    Fishwick, P. A.

    1986-01-01

    Communication with data-base processor flexible and efficient. High Level Data Abstraction (HILDA) system is three-layer system supporting data-abstraction features of Intel data-base processor (DBP). Purpose of HILDA establishment of flexible method of efficiently communicating with DBP. Power of HILDA lies in its extensibility with regard to syntax and semantic changes. HILDA's high-level query language readily modified. Offers powerful potential to computer sites where DBP attached to DEC VAX-series computer. HILDA system written in Pascal and FORTRAN 77 for interactive execution.

  11. DOE Office of Scientific and Technical Information (OSTI.GOV)

    Zhang, Yao; Balaprakash, Prasanna; Meng, Jiayuan

    We present Raexplore, a performance modeling framework for architecture exploration. Raexplore enables rapid, automated, and systematic search of architecture design space by combining hardware counter-based performance characterization and analytical performance modeling. We demonstrate Raexplore for two recent manycore processors IBM Blue- Gene/Q compute chip and Intel Xeon Phi, targeting a set of scientific applications. Our framework is able to capture complex interactions between architectural components including instruction pipeline, cache, and memory, and to achieve a 3–22% error for same-architecture and cross-architecture performance predictions. Furthermore, we apply our framework to assess the two processors, and discover and evaluate a list ofmore » architectural scaling options for future processor designs.« less

  12. Riding the Technology Wave.

    ERIC Educational Resources Information Center

    Malan, Pierre

    This paper presents an overview of information technology development. The first section sets the scene, comparing the first WAN (Wide Area Network) and Intel processor to current technology. The birth of the microcomputer is described in the second section, including historical background on semiconductors, microprocessors, and the microcomputer.…

  13. Digital Circuit Analysis Using an 8080 Processor.

    ERIC Educational Resources Information Center

    Greco, John; Stern, Kenneth

    1983-01-01

    Presents the essentials of a program written in Intel 8080 assembly language for the steady state analysis of a combinatorial logic gate circuit. Program features and potential modifications are considered. For example, the program could also be extended to include clocked/unclocked sequential circuits. (JN)

  14. SAD-Based Stereo Matching Using FPGAs

    NASA Astrophysics Data System (ADS)

    Ambrosch, Kristian; Humenberger, Martin; Kubinger, Wilfried; Steininger, Andreas

    In this chapter we present a field-programmable gate array (FPGA) based stereo matching architecture. This architecture uses the sum of absolute differences (SAD) algorithm and is targeted at automotive and robotics applications. The disparity maps are calculated using 450×375 input images and a disparity range of up to 150 pixels. We discuss two different implementation approaches for the SAD and analyze their resource usage. Furthermore, block sizes ranging from 3×3 up to 11×11 and their impact on the consumed logic elements as well as on the disparity map quality are discussed. The stereo matching architecture enables a frame rate of up to 600 fps by calculating the data in a highly parallel and pipelined fashion. This way, a software solution optimized by using Intel's Open Source Computer Vision Library running on an Intel Pentium 4 with 3 GHz clock frequency is outperformed by a factor of 400.

  15. On-chip programmable ultra-wideband microwave photonic phase shifter and true time delay unit.

    PubMed

    Burla, Maurizio; Cortés, Luis Romero; Li, Ming; Wang, Xu; Chrostowski, Lukas; Azaña, José

    2014-11-01

    We proposed and experimentally demonstrated an ultra-broadband on-chip microwave photonic processor that can operate both as RF phase shifter (PS) and true-time-delay (TTD) line, with continuous tuning. The processor is based on a silicon dual-phase-shifted waveguide Bragg grating (DPS-WBG) realized with a CMOS compatible process. We experimentally demonstrated the generation of delay up to 19.4 ps over 10 GHz instantaneous bandwidth and a phase shift of approximately 160° over the bandwidth 22-29 GHz. The available RF measurement setup ultimately limits the phase shifting demonstration as the device is capable of providing up to 300° phase shift for RF frequencies over a record bandwidth approaching 1 THz.

  16. Parallelized Kalman-Filter-Based Reconstruction of Particle Tracks on Many-Core Processors and GPUs

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Cerati, Giuseppe; Elmer, Peter; Krutelyov, Slava

    2017-01-01

    For over a decade now, physical and energy constraints have limited clock speed improvements in commodity microprocessors. Instead, chipmakers have been pushed into producing lower-power, multi-core processors such as Graphical Processing Units (GPU), ARM CPUs, and Intel MICs. Broad-based efforts from manufacturers and developers have been devoted to making these processors user-friendly enough to perform general computations. However, extracting performance from a larger number of cores, as well as specialized vector or SIMD units, requires special care in algorithm design and code optimization. One of the most computationally challenging problems in high-energy particle experiments is finding and fitting the charged-particlemore » tracks during event reconstruction. This is expected to become by far the dominant problem at the High-Luminosity Large Hadron Collider (HL-LHC), for example. Today the most common track finding methods are those based on the Kalman filter. Experience with Kalman techniques on real tracking detector systems has shown that they are robust and provide high physics performance. This is why they are currently in use at the LHC, both in the trigger and offine. Previously we reported on the significant parallel speedups that resulted from our investigations to adapt Kalman filters to track fitting and track building on Intel Xeon and Xeon Phi. Here, we discuss our progresses toward the understanding of these processors and the new developments to port the Kalman filter to NVIDIA GPUs.« less

  17. Parallelized Kalman-Filter-Based Reconstruction of Particle Tracks on Many-Core Processors and GPUs

    NASA Astrophysics Data System (ADS)

    Cerati, Giuseppe; Elmer, Peter; Krutelyov, Slava; Lantz, Steven; Lefebvre, Matthieu; Masciovecchio, Mario; McDermott, Kevin; Riley, Daniel; Tadel, Matevž; Wittich, Peter; Würthwein, Frank; Yagil, Avi

    2017-08-01

    For over a decade now, physical and energy constraints have limited clock speed improvements in commodity microprocessors. Instead, chipmakers have been pushed into producing lower-power, multi-core processors such as Graphical Processing Units (GPU), ARM CPUs, and Intel MICs. Broad-based efforts from manufacturers and developers have been devoted to making these processors user-friendly enough to perform general computations. However, extracting performance from a larger number of cores, as well as specialized vector or SIMD units, requires special care in algorithm design and code optimization. One of the most computationally challenging problems in high-energy particle experiments is finding and fitting the charged-particle tracks during event reconstruction. This is expected to become by far the dominant problem at the High-Luminosity Large Hadron Collider (HL-LHC), for example. Today the most common track finding methods are those based on the Kalman filter. Experience with Kalman techniques on real tracking detector systems has shown that they are robust and provide high physics performance. This is why they are currently in use at the LHC, both in the trigger and offine. Previously we reported on the significant parallel speedups that resulted from our investigations to adapt Kalman filters to track fitting and track building on Intel Xeon and Xeon Phi. Here, we discuss our progresses toward the understanding of these processors and the new developments to port the Kalman filter to NVIDIA GPUs.

  18. Implementation and Optimization of miniGMG - a Compact Geometric Multigrid Benchmark

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Williams, Samuel; Kalamkar, Dhiraj; Singh, Amik

    2012-12-01

    Multigrid methods are widely used to accelerate the convergence of iterative solvers for linear systems used in a number of different application areas. In this report, we describe miniGMG, our compact geometric multigrid benchmark designed to proxy the multigrid solves found in AMR applications. We explore optimization techniques for geometric multigrid on existing and emerging multicore systems including the Opteron-based Cray XE6, Intel Sandy Bridge and Nehalem-based Infiniband clusters, as well as manycore-based architectures including NVIDIA's Fermi and Kepler GPUs and Intel's Knights Corner (KNC) co-processor. This report examines a variety of novel techniques including communication-aggregation, threaded wavefront-based DRAM communication-avoiding,more » dynamic threading decisions, SIMDization, and fusion of operators. We quantify performance through each phase of the V-cycle for both single-node and distributed-memory experiments and provide detailed analysis for each class of optimization. Results show our optimizations yield significant speedups across a variety of subdomain sizes while simultaneously demonstrating the potential of multi- and manycore processors to dramatically accelerate single-node performance. However, our analysis also indicates that improvements in networks and communication will be essential to reap the potential of manycore processors in large-scale multigrid calculations.« less

  19. Using Intel's Knight Landing Processor to Accelerate Global Nested Air Quality Prediction Modeling System (GNAQPMS) Model

    NASA Astrophysics Data System (ADS)

    Wang, H.; Chen, H.; Chen, X.; Wu, Q.; Wang, Z.

    2016-12-01

    The Global Nested Air Quality Prediction Modeling System for Hg (GNAQPMS-Hg) is a global chemical transport model coupled Hg transport module to investigate the mercury pollution. In this study, we present our work of transplanting the GNAQPMS model on Intel Xeon Phi processor, Knights Landing (KNL) to accelerate the model. KNL is the second-generation product adopting Many Integrated Core Architecture (MIC) architecture. Compared with the first generation Knight Corner (KNC), KNL has more new hardware features, that it can be used as unique processor as well as coprocessor with other CPU. According to the Vtune tool, the high overhead modules in GNAQPMS model have been addressed, including CBMZ gas chemistry, advection and convection module, and wet deposition module. These high overhead modules were accelerated by optimizing code and using new techniques of KNL. The following optimized measures was done: 1) Changing the pure MPI parallel mode to hybrid parallel mode with MPI and OpenMP; 2.Vectorizing the code to using the 512-bit wide vector computation unit. 3. Reducing unnecessary memory access and calculation. 4. Reducing Thread Local Storage (TLS) for common variables with each OpenMP thread in CBMZ. 5. Changing the way of global communication from files writing and reading to MPI functions. After optimization, the performance of GNAQPMS is greatly increased both on CPU and KNL platform, the single-node test showed that optimized version has 2.6x speedup on two sockets CPU platform and 3.3x speedup on one socket KNL platform compared with the baseline version code, which means the KNL has 1.29x speedup when compared with 2 sockets CPU platform.

  20. Closeout Report ARRA supplement to DE-FG02-08ER41546, 03/15/2010 to 03/14/2011 - Advanced Transfer Map Methods for the Description of Particle Beam Dynamics

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Berz, Martin; Makino, Kyoko

    The ARRA funds were utilized to acquire a cluster of high performance computers, consisting of one Altus 2804 Server based on a Quad AMD Opteron 6174 12C with 4 2.2 GHz nodes of 12 cores each, resulting in 48 directly usable cores; as well as a Relion 1751 Server using an Intel Xeon X5677 consisting of 4 3.46 GHz cores supporting 8 threads. Both systems run the Unix flavor CentOS, which is designed for use without need of updates, which greatly enhances their reliability. The systems are used to operate our COSY INFINITY environment which supports MPI parallelization. The unitsmore » arrived at MSU in September 2010, and were taken into operation shortly thereafter.« less

  1. Parallel Mutual Information Based Construction of Genome-Scale Networks on the Intel® Xeon Phi™ Coprocessor.

    PubMed

    Misra, Sanchit; Pamnany, Kiran; Aluru, Srinivas

    2015-01-01

    Construction of whole-genome networks from large-scale gene expression data is an important problem in systems biology. While several techniques have been developed, most cannot handle network reconstruction at the whole-genome scale, and the few that can, require large clusters. In this paper, we present a solution on the Intel Xeon Phi coprocessor, taking advantage of its multi-level parallelism including many x86-based cores, multiple threads per core, and vector processing units. We also present a solution on the Intel® Xeon® processor. Our solution is based on TINGe, a fast parallel network reconstruction technique that uses mutual information and permutation testing for assessing statistical significance. We demonstrate the first ever inference of a plant whole genome regulatory network on a single chip by constructing a 15,575 gene network of the plant Arabidopsis thaliana from 3,137 microarray experiments in only 22 minutes. In addition, our optimization for parallelizing mutual information computation on the Intel Xeon Phi coprocessor holds out lessons that are applicable to other domains.

  2. Parallel volume ray-casting for unstructured-grid data on distributed-memory architectures

    NASA Technical Reports Server (NTRS)

    Ma, Kwan-Liu

    1995-01-01

    As computing technology continues to advance, computational modeling of scientific and engineering problems produces data of increasing complexity: large in size and unstructured in shape. Volume visualization of such data is a challenging problem. This paper proposes a distributed parallel solution that makes ray-casting volume rendering of unstructured-grid data practical. Both the data and the rendering process are distributed among processors. At each processor, ray-casting of local data is performed independent of the other processors. The global image composing processes, which require inter-processor communication, are overlapped with the local ray-casting processes to achieve maximum parallel efficiency. This algorithm differs from previous ones in four ways: it is completely distributed, less view-dependent, reasonably scalable, and flexible. Without using dynamic load balancing, test results on the Intel Paragon using from two to 128 processors show, on average, about 60% parallel efficiency.

  3. Quark structure of static correlators in high temperature QCD

    NASA Astrophysics Data System (ADS)

    Bernard, Claude; DeGrand, Thomas A.; DeTar, Carleton; Gottlieb, Steven; Krasnitz, A.; Ogilvie, Michael C.; Sugar, R. L.; Toussaint, D.

    1992-07-01

    We present results of numerical simulations of quantum chromodynamics at finite temperature with two flavors of Kogut-Susskind quarks on the Intel iPSC/860 parallel processor. We investigate the properties of the objects whose exchange gives static screening lengths by reconstructing their correlated quark-antiquark structure.

  4. Ghost writer | ASCR Discovery

    Science.gov Websites

    the one illustrated here, the outer membrane protein OprF of Pseudomonas aeruginosa in its -1990s, NWChem was designed to run on networked processors, as in an HPC system, using one-sided communication, says Jeff Hammond of Intel Corp.'s Parallel Computing Laboratory. In one-sided communication, a

  5. DD-αAMG on QPACE 3

    NASA Astrophysics Data System (ADS)

    Georg, Peter; Richtmann, Daniel; Wettig, Tilo

    2018-03-01

    We describe our experience porting the Regensburg implementation of the DD-αAMG solver from QPACE 2 to QPACE 3. We first review how the code was ported from the first generation Intel Xeon Phi processor (Knights Corner) to its successor (Knights Landing). We then describe the modifications in the communication library necessitated by the switch from InfiniBand to Omni-Path. Finally, we present the performance of the code on a single processor as well as the scaling on many nodes, where in both cases the speedup factor is close to the theoretical expectations.

  6. The Department of Defense Superconductivity Research and Development (DSRD) Options. A Study of Possible Directions for Exploitation of Superconductivity in Military Applications.

    DTIC Science & Technology

    1987-07-01

    transmission lines Low - noise mm wave detectors, mixers and amplifiers Multi-GHz chirp transform processors High performance small antenna arrays Multi-GHz A/D...attractive alternative. The overall advantages for HTS mm wave receivers are very- low quantum-limited noise , wide bandwidth, low electrical power...0 0 3 2 1 6 6.3A 0 0 0 2 -3 S Total 2 2 4 S 4 17 116 10, ELF Communication (far term). Extremely low frequency communication via magnetic wave has

  7. Emerging Radio and Manet Technology Study: Research Support for a Survey of State-of-the-art Commercial and Military Hardware/Software for Mobile Ad Hoc Networks

    DTIC Science & Technology

    2014-10-01

    44 Table 19: Raspberry Pi Information...boards – These are single board devices targeted to education and embedding, the best known being the Raspberry Pi ; and 3. Development boards – These...popular, as it has high performance processor (perhaps 4 times the power of a Raspberry Pi ) with dual core processors running at 1.6 GHz and the cost is

  8. A High Performance Block Eigensolver for Nuclear Configuration Interaction Calculations

    DOE PAGES

    Aktulga, Hasan Metin; Afibuzzaman, Md.; Williams, Samuel; ...

    2017-06-01

    As on-node parallelism increases and the performance gap between the processor and the memory system widens, achieving high performance in large-scale scientific applications requires an architecture-aware design of algorithms and solvers. We focus on the eigenvalue problem arising in nuclear Configuration Interaction (CI) calculations, where a few extreme eigenpairs of a sparse symmetric matrix are needed. Here, we consider a block iterative eigensolver whose main computational kernels are the multiplication of a sparse matrix with multiple vectors (SpMM), and tall-skinny matrix operations. We then present techniques to significantly improve the SpMM and the transpose operation SpMM T by using themore » compressed sparse blocks (CSB) format. We achieve 3-4× speedup on the requisite operations over good implementations with the commonly used compressed sparse row (CSR) format. We develop a performance model that allows us to correctly estimate the performance of our SpMM kernel implementations, and we identify cache bandwidth as a potential performance bottleneck beyond DRAM. We also analyze and optimize the performance of LOBPCG kernels (inner product and linear combinations on multiple vectors) and show up to 15× speedup over using high performance BLAS libraries for these operations. The resulting high performance LOBPCG solver achieves 1.4× to 1.8× speedup over the existing Lanczos solver on a series of CI computations on high-end multicore architectures (Intel Xeons). We also analyze the performance of our techniques on an Intel Xeon Phi Knights Corner (KNC) processor.« less

  9. A High Performance Block Eigensolver for Nuclear Configuration Interaction Calculations

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Aktulga, Hasan Metin; Afibuzzaman, Md.; Williams, Samuel

    As on-node parallelism increases and the performance gap between the processor and the memory system widens, achieving high performance in large-scale scientific applications requires an architecture-aware design of algorithms and solvers. We focus on the eigenvalue problem arising in nuclear Configuration Interaction (CI) calculations, where a few extreme eigenpairs of a sparse symmetric matrix are needed. Here, we consider a block iterative eigensolver whose main computational kernels are the multiplication of a sparse matrix with multiple vectors (SpMM), and tall-skinny matrix operations. We then present techniques to significantly improve the SpMM and the transpose operation SpMM T by using themore » compressed sparse blocks (CSB) format. We achieve 3-4× speedup on the requisite operations over good implementations with the commonly used compressed sparse row (CSR) format. We develop a performance model that allows us to correctly estimate the performance of our SpMM kernel implementations, and we identify cache bandwidth as a potential performance bottleneck beyond DRAM. We also analyze and optimize the performance of LOBPCG kernels (inner product and linear combinations on multiple vectors) and show up to 15× speedup over using high performance BLAS libraries for these operations. The resulting high performance LOBPCG solver achieves 1.4× to 1.8× speedup over the existing Lanczos solver on a series of CI computations on high-end multicore architectures (Intel Xeons). We also analyze the performance of our techniques on an Intel Xeon Phi Knights Corner (KNC) processor.« less

  10. Hybrid Computational Architecture for Multi-Scale Modeling of Materials and Devices

    DTIC Science & Technology

    2016-01-03

    Equivalent: Total Number: Sub Contractors (DD882) Names of Faculty Supported Names of Under Graduate students supported Names of Personnel receiving masters...GHz, 20 cores (40 with hyper-threading ( HT )) Single node performance Node # of cores Total CPU time User CPU time System CPU time Elapsed time...INTEL20 40 (with HT ) 534.785 529.984 4.800 541.179 20 468.873 466.119 2.754 476.878 10 671.798 669.653 2.145 680.510 8 772.269 770.256 2.013

  11. Simulation of Fault Tolerance in a Hypercube Arrangement of Discrete Processors.

    DTIC Science & Technology

    1987-12-01

    Geometric Properties .................... 22 Binary Properties ....................... 26 Intel Hypercube Hardware Arrangement ... 28 IV. Cube-Connected... Properties of the CCC..............35 CCC Redundancy............................... 38 iii 6L V. Re-Configurable Cube-Connected Cycles ....... 40 Global...o........ 74 iv List of Figures Page Figure 1: Hypercubes of Different Dimensions ......... 21 Figure 2: Hypercube Properties

  12. Peregrine System | High-Performance Computing | NREL

    Science.gov Websites

    ) and longer-term (/projects) storage. These file systems are mounted on all nodes. Peregrine has three -2670 Xeon processors and 64 GB of memory. In addition to mounting the /home, /nopt, /projects and # cores/node Memory/node Peak (DP) performance per node 88 Intel Xeon E5-2670 "Sandy Bridge" 8

  13. A Survey of Recent MARTe Based Systems

    NASA Astrophysics Data System (ADS)

    Neto, André C.; Alves, Diogo; Boncagni, Luca; Carvalho, Pedro J.; Valcarcel, Daniel F.; Barbalace, Antonio; De Tommasi, Gianmaria; Fernandes, Horácio; Sartori, Filippo; Vitale, Enzo; Vitelli, Riccardo; Zabeo, Luca

    2011-08-01

    The Multithreaded Application Real-Time executor (MARTe) is a data driven framework environment for the development and deployment of real-time control algorithms. The main ideas which led to the present version of the framework were to standardize the development of real-time control systems, while providing a set of strictly bounded standard interfaces to the outside world and also accommodating a collection of facilities which promote the speed and ease of development, commissioning and deployment of such systems. At the core of every MARTe based application, is a set of independent inter-communicating software blocks, named Generic Application Modules (GAM), orchestrated by a real-time scheduler. The platform independence of its core library provides MARTe the necessary robustness and flexibility for conveniently testing applications in different environments including non-real-time operating systems. MARTe is already being used in several machines, each with its own peculiarities regarding hardware interfacing, supervisory control configuration, operating system and target control application. This paper presents and compares the most recent results of systems using MARTe: the JET Vertical Stabilization system, which uses the Real Time Application Interface (RTAI) operating system on Intel multi-core processors; the COMPASS plasma control system, driven by Linux RT also on Intel multi-core processors; ISTTOK real-time tomography equilibrium reconstruction which shares the same support configuration of COMPASS; JET error field correction coils based on VME, PowerPC and VxWorks; FTU LH reflected power system running on VME, Intel with RTAI.

  14. Analysis of EDP performance

    NASA Technical Reports Server (NTRS)

    1994-01-01

    The objective of this contract was the investigation of the potential performance gains that would result from an upgrade of the Space Station Freedom (SSF) Data Management System (DMS) Embedded Data Processor (EDP) '386' design with the Intel Pentium (registered trade-mark of Intel Corp.) '586' microprocessor. The Pentium ('586') is the latest member of the industry standard Intel X86 family of CISC (Complex Instruction Set Computer) microprocessors. This contract was scheduled to run in parallel with an internal IBM Federal Systems Company (FSC) Internal Research and Development (IR&D) task that had the goal to generate a baseline flight design for an upgraded EDP using the Pentium. This final report summarizes the activities performed in support of Contract NAS2-13758. Our plan was to baseline performance analyses and measurements on the latest state-of-the-art commercially available Pentium processor, representative of the proposed space station design, and then phase to an IBM capital funded breadboard version of the flight design (if available from IR&D and Space Station work) for additional evaluation of results. Unfortunately, the phase-over to the flight design breadboard did not take place, since the IBM Data Management System (DMS) for the Space Station Freedom was terminated by NASA before the referenced capital funded EDP breadboard could be completed. The baseline performance analyses and measurements, however, were successfully completed, as planned, on the commercial Pentium hardware. The results of those analyses, evaluations, and measurements are presented in this final report.

  15. DOE Office of Scientific and Technical Information (OSTI.GOV)

    Sancho Pitarch, Jose Carlos; Kerbyson, Darren; Lang, Mike

    Increasing the core-count on current and future processors is posing critical challenges to the memory subsystem to efficiently handle concurrent memory requests. The current trend to cope with this challenge is to increase the number of memory channels available to the processor's memory controller. In this paper we investigate the effectiveness of this approach on the performance of parallel scientific applications. Specifically, we explore the trade-off between employing multiple memory channels per memory controller and the use of multiple memory controllers. Experiments conducted on two current state-of-the-art multicore processors, a 6-core AMD Istanbul and a 4-core Intel Nehalem-EP, for amore » wide range of production applications shows that there is a diminishing return when increasing the number of memory channels per memory controller. In addition, we show that this performance degradation can be efficiently addressed by increasing the ratio of memory controllers to channels while keeping the number of memory channels constant. Significant performance improvements can be achieved in this scheme, up to 28%, in the case of using two memory controllers with each with one channel compared with one controller with two memory channels.« less

  16. Study of Thread Level Parallelism in a Video Encoding Application for Chip Multiprocessor Design

    NASA Astrophysics Data System (ADS)

    Debes, Eric; Kaine, Greg

    2002-11-01

    In media applications there is a high level of available thread level parallelism (TLP). In this paper we study the intra TLP in a video encoder. We show that a well-distributed highly optimized encoder running on a symmetric multiprocessor (SMP) system can run 3.2 faster on a 4-way SMP machine than on a single processor. The multithreaded encoder running on an SMP system is then used to understand the requirements of a chip multiprocessor (CMP) architecture, which is one possible architectural direction to better exploit TLP. In the framework of this study, we use a software approach to evaluate the dataflow between processors for the video encoder running on an SMP system. An estimation of the dataflow is done with L2 cache miss event counters using Intel® VTuneTM performance analyzer. The experimental measurements are compared to theoretical results.

  17. Towards Highly Scalable Ab Initio Molecular Dynamics (AIMD) Simulations on the Intel Knights Landing Manycore Processor

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Jacquelin, Mathias; De Jong, Wibe A.; Bylaska, Eric J.

    2017-07-03

    The Ab Initio Molecular Dynamics (AIMD) method allows scientists to treat the dynamics of molecular and condensed phase systems while retaining a first-principles-based description of their interactions. This extremely important method has tremendous computational requirements, because the electronic Schr¨odinger equation, approximated using Kohn-Sham Density Functional Theory (DFT), is solved at every time step. With the advent of manycore architectures, application developers have a significant amount of processing power within each compute node that can only be exploited through massive parallelism. A compute intensive application such as AIMD forms a good candidate to leverage this processing power. In this paper, wemore » focus on adding thread level parallelism to the plane wave DFT methodology implemented in NWChem. Through a careful optimization of tall-skinny matrix products, which are at the heart of the Lagrange multiplier and nonlocal pseudopotential kernels, as well as 3D FFTs, our OpenMP implementation delivers excellent strong scaling on the latest Intel Knights Landing (KNL) processor. We assess the efficiency of our Lagrange multiplier kernels by building a Roofline model of the platform, and verify that our implementation is close to the roofline for various problem sizes. Finally, we present strong scaling results on the complete AIMD simulation for a 64 water molecules test case, that scales up to all 68 cores of the Knights Landing processor.« less

  18. PSsolver: A Maple implementation to solve first order ordinary differential equations with Liouvillian solutions

    NASA Astrophysics Data System (ADS)

    Avellar, J.; Duarte, L. G. S.; da Mota, L. A. C. P.

    2012-10-01

    We present a set of software routines in Maple 14 for solving first order ordinary differential equations (FOODEs). The package implements the Prelle-Singer method in its original form together with its extension to include integrating factors in terms of elementary functions. The package also presents a theoretical extension to deal with all FOODEs presenting Liouvillian solutions. Applications to ODEs taken from standard references show that it solves ODEs which remain unsolved using Maple's standard ODE solution routines. New version program summary Program title: PSsolver Catalogue identifier: ADPR_v2_0 Program summary URL: http://cpc.cs.qub.ac.uk/summaries/ADPR_v2_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: Standard CPC licence, http://cpc.cs.qub.ac.uk/licence/licence.html No. of lines in distributed program, including test data, etc.: 2302 No. of bytes in distributed program, including test data, etc.: 31962 Distribution format: tar.gz Programming language: Maple 14 (also tested using Maple 15 and 16). Computer: Intel Pentium Processor P6000, 1.86 GHz. Operating system: Windows 7. RAM: 4 GB DDR3 Memory Classification: 4.3. Catalogue identifier of previous version: ADPR_v1_0 Journal reference of previous version: Comput. Phys. Comm. 144 (2002) 46 Does the new version supersede the previous version?: Yes Nature of problem: Symbolic solution of first order differential equations via the Prelle-Singer method. Solution method: The method of solution is based on the standard Prelle-Singer method, with extensions for the cases when the FOODE contains elementary functions. Additionally, an extension of our own which solves FOODEs with Liouvillian solutions is included. Reasons for new version: The program was not running anymore due to changes in the latest versions of Maple. Additionally, we corrected/changed some bugs/details that were hampering the smoother functioning of the routines. Summary of revisions: • As time went by, many commands in Maple were deprecated. So, in order to make the program able to run with the newer versions, we have checked and changed some of those. For instance, the command sum had changed, and some program lines were substituted so that the package works properly. • In the old version we must supply the degree of the Darboux polynomials we want to determine. In the present version the user can set the degree by typing Deg = number in the command call (e.g., PSsolve(ode, Deg =3); telling the command PSsolve that it must use Darboux polynomials of degree up to three). If the user does not specify the degree, the routines use, as default, the degree 1. Restrictions: If the integrating factor for the FOODE under consideration has factors of high degree in the dependent and independent variables and in the elementary functions appearing in the FOODE, the package may spend a long time finding the solution. Also, when dealing with FOODEs containing elementary functions, it is essential that the algebraic dependency between them is recognized. If that does not happen, our program can miss some solutions. Unusual features: Our implementation of the Prelle-Singer approach not only solves FOODEs, but can also be used as a research tool that allows the user to follow all the steps of the procedure. For example, the Darboux polynomials (eigenpolynomials) of the D-operator associated with a FOODE (see Section 4) can be calculated. In addition, our package is successful in solving FOODEs that were not solved by some of the most commonly available solvers. Finally, our package implements a theoretical extension (for details, see [1,2]) to the original Prelle-Singer approach that enhances its scope, allowing it to tackle some FOODEs whose solutions involve non-elementary Liouvillian functions. Running time: This depends strongly on the FOODE, but usually under 2 seconds when running our 'arena' test file: The non linear FOODEs presented in the book by Kamke [3]. These times were obtained using an Intel Pentium Processor P6000, 1.86 GHz, with 4 GB RAM. References: [1] M. Singer, Liouvillian first integrals of differential equations, Trans. Amer. Math. Soc. 333 (1992) 673-688. [2] L.G.S. Duarte, S.E.S. Duarte, L.A.C.P. da Mota, J.E.F. Skea, A method to tackle first order ordinary differential equations with Liouvilian functions in the solution, J. Phys. A: Math. Gen. Inglaterra 35 (17) (2002) 3899-3910. [3] E. Kamke, Differentialgleichungen: Lösungsmethoden und Lösungen, Chelsea Publishing Co., New York, 1959.

  19. A sweep algorithm for massively parallel simulation of circuit-switched networks

    NASA Technical Reports Server (NTRS)

    Gaujal, Bruno; Greenberg, Albert G.; Nicol, David M.

    1992-01-01

    A new massively parallel algorithm is presented for simulating large asymmetric circuit-switched networks, controlled by a randomized-routing policy that includes trunk-reservation. A single instruction multiple data (SIMD) implementation is described, and corresponding experiments on a 16384 processor MasPar parallel computer are reported. A multiple instruction multiple data (MIMD) implementation is also described, and corresponding experiments on an Intel IPSC/860 parallel computer, using 16 processors, are reported. By exploiting parallelism, our algorithm increases the possible execution rate of such complex simulations by as much as an order of magnitude.

  20. A high-speed digital signal processor for atmospheric radar, part 7.3A

    NASA Technical Reports Server (NTRS)

    Brosnahan, J. W.; Woodard, D. M.

    1984-01-01

    The Model SP-320 device is a monolithic realization of a complex general purpose signal processor, incorporating such features as a 32-bit ALU, a 16-bit x 16-bit combinatorial multiplier, and a 16-bit barrel shifter. The SP-320 is designed to operate as a slave processor to a host general purpose computer in applications such as coherent integration of a radar return signal in multiple ranges, or dedicated FFT processing. Presently available is an I/O module conforming to the Intel Multichannel interface standard; other I/O modules will be designed to meet specific user requirements. The main processor board includes input and output FIFO (First In First Out) memories, both with depths of 4096 W, to permit asynchronous operation between the source of data and the host computer. This design permits burst data rates in excess of 5 MW/s.

  1. Parallelization of a Monte Carlo particle transport simulation code

    NASA Astrophysics Data System (ADS)

    Hadjidoukas, P.; Bousis, C.; Emfietzoglou, D.

    2010-05-01

    We have developed a high performance version of the Monte Carlo particle transport simulation code MC4. The original application code, developed in Visual Basic for Applications (VBA) for Microsoft Excel, was first rewritten in the C programming language for improving code portability. Several pseudo-random number generators have been also integrated and studied. The new MC4 version was then parallelized for shared and distributed-memory multiprocessor systems using the Message Passing Interface. Two parallel pseudo-random number generator libraries (SPRNG and DCMT) have been seamlessly integrated. The performance speedup of parallel MC4 has been studied on a variety of parallel computing architectures including an Intel Xeon server with 4 dual-core processors, a Sun cluster consisting of 16 nodes of 2 dual-core AMD Opteron processors and a 200 dual-processor HP cluster. For large problem size, which is limited only by the physical memory of the multiprocessor server, the speedup results are almost linear on all systems. We have validated the parallel implementation against the serial VBA and C implementations using the same random number generator. Our experimental results on the transport and energy loss of electrons in a water medium show that the serial and parallel codes are equivalent in accuracy. The present improvements allow for studying of higher particle energies with the use of more accurate physical models, and improve statistics as more particles tracks can be simulated in low response time.

  2. Baseband processor development/test performance for 30/20 GHz SS-TDMA communication system

    NASA Technical Reports Server (NTRS)

    Brown, L.; Sabourin, D.; Attwood, S.

    1984-01-01

    The baseband processor (BBP) development for the 30/20 GHz Satellite Communication System is described. The SS-TDMA concept for future satellite communications is reviewed, describing the overall system, the satellite payload, and the frequency plan. A brief general description of the BBP is given, and the proof-of-concept model of the BBP is summarized. Key technologies and custom LSI developed for the BBP are listed. Finally, key technology developments and test data are reported for the BBP.

  3. Applications and development of communication models for the touchstone GAMMA and DELTA prototypes

    NASA Technical Reports Server (NTRS)

    Seidel, Steven R.

    1993-01-01

    The goal of this project was to develop models of the interconnection networks of the Intel iPSC/860 and DELTA multicomputers to guide the design of efficient algorithms for interprocessor communication in problems that commonly occur in CFD codes and other applications. Interprocessor communication costs of codes for message-passing architectures such as the iPSC/860 and DELTA significantly affect the level of performance that can be obtained from those machines. This project addressed several specific problems in the achievement of efficient communication on the Intel iPSC/860 hypercube and DELTA mesh. In particular, an efficient global processor synchronization algorithm was developed for the iPSC/860 and numerous broadcast algorithms were designed for the DELTA.

  4. Modern multicore and manycore architectures: Modelling, optimisation and benchmarking a multiblock CFD code

    NASA Astrophysics Data System (ADS)

    Hadade, Ioan; di Mare, Luca

    2016-08-01

    Modern multicore and manycore processors exhibit multiple levels of parallelism through a wide range of architectural features such as SIMD for data parallel execution or threads for core parallelism. The exploitation of multi-level parallelism is therefore crucial for achieving superior performance on current and future processors. This paper presents the performance tuning of a multiblock CFD solver on Intel SandyBridge and Haswell multicore CPUs and the Intel Xeon Phi Knights Corner coprocessor. Code optimisations have been applied on two computational kernels exhibiting different computational patterns: the update of flow variables and the evaluation of the Roe numerical fluxes. We discuss at great length the code transformations required for achieving efficient SIMD computations for both kernels across the selected devices including SIMD shuffles and transpositions for flux stencil computations and global memory transformations. Core parallelism is expressed through threading based on a number of domain decomposition techniques together with optimisations pertaining to alleviating NUMA effects found in multi-socket compute nodes. Results are correlated with the Roofline performance model in order to assert their efficiency for each distinct architecture. We report significant speedups for single thread execution across both kernels: 2-5X on the multicore CPUs and 14-23X on the Xeon Phi coprocessor. Computations at full node and chip concurrency deliver a factor of three speedup on the multicore processors and up to 24X on the Xeon Phi manycore coprocessor.

  5. Parallel computing of physical maps--a comparative study in SIMD and MIMD parallelism.

    PubMed

    Bhandarkar, S M; Chirravuri, S; Arnold, J

    1996-01-01

    Ordering clones from a genomic library into physical maps of whole chromosomes presents a central computational problem in genetics. Chromosome reconstruction via clone ordering is usually isomorphic to the NP-complete Optimal Linear Arrangement problem. Parallel SIMD and MIMD algorithms for simulated annealing based on Markov chain distribution are proposed and applied to the problem of chromosome reconstruction via clone ordering. Perturbation methods and problem-specific annealing heuristics are proposed and described. The SIMD algorithms are implemented on a 2048 processor MasPar MP-2 system which is an SIMD 2-D toroidal mesh architecture whereas the MIMD algorithms are implemented on an 8 processor Intel iPSC/860 which is an MIMD hypercube architecture. A comparative analysis of the various SIMD and MIMD algorithms is presented in which the convergence, speedup, and scalability characteristics of the various algorithms are analyzed and discussed. On a fine-grained, massively parallel SIMD architecture with a low synchronization overhead such as the MasPar MP-2, a parallel simulated annealing algorithm based on multiple periodically interacting searches performs the best. For a coarse-grained MIMD architecture with high synchronization overhead such as the Intel iPSC/860, a parallel simulated annealing algorithm based on multiple independent searches yields the best results. In either case, distribution of clonal data across multiple processors is shown to exacerbate the tendency of the parallel simulated annealing algorithm to get trapped in a local optimum.

  6. Equation solvers for distributed-memory computers

    NASA Technical Reports Server (NTRS)

    Storaasli, Olaf O.

    1994-01-01

    A large number of scientific and engineering problems require the rapid solution of large systems of simultaneous equations. The performance of parallel computers in this area now dwarfs traditional vector computers by nearly an order of magnitude. This talk describes the major issues involved in parallel equation solvers with particular emphasis on the Intel Paragon, IBM SP-1 and SP-2 processors.

  7. A High Performance Computing Framework for Physics-based Modeling and Simulation of Military Ground Vehicles

    DTIC Science & Technology

    2011-03-25

    number one and Nebulae at number three. Both systems rely on GPU co-processing and use Intel Xeon processors cards and NVIDIA Tesla C2050 GPUs. In...spite of a theoretical peak capability of almost 3 Petaflop/s, Nebulae clocked at 1.271 PFlop/s when running the Linpack benchmark, which puts it

  8. JPRS Report Science & Technology Europe.

    DTIC Science & Technology

    1992-09-17

    9 Jul 92] 48 HERA Project Gets Green Light for Quark Structure Analysis [DuesseldorfVDI NACHRICHTEN, 12 Jul 92] .... 48 TELECOMMUNICATIONS...communicating with the control station. The demonstrator is the product of research performed at the Robot and Artificial Intel - ligence Unit of...from the microphones, speedometers, or tachometers. Each board is linked to a Motorola DSP [digital signal processor ]. Although the system has been

  9. Scalable Algorithms for Clustering Large Geospatiotemporal Data Sets on Manycore Architectures

    NASA Astrophysics Data System (ADS)

    Mills, R. T.; Hoffman, F. M.; Kumar, J.; Sreepathi, S.; Sripathi, V.

    2016-12-01

    The increasing availability of high-resolution geospatiotemporal data sets from sources such as observatory networks, remote sensing platforms, and computational Earth system models has opened new possibilities for knowledge discovery using data sets fused from disparate sources. Traditional algorithms and computing platforms are impractical for the analysis and synthesis of data sets of this size; however, new algorithmic approaches that can effectively utilize the complex memory hierarchies and the extremely high levels of available parallelism in state-of-the-art high-performance computing platforms can enable such analysis. We describe a massively parallel implementation of accelerated k-means clustering and some optimizations to boost computational intensity and utilization of wide SIMD lanes on state-of-the art multi- and manycore processors, including the second-generation Intel Xeon Phi ("Knights Landing") processor based on the Intel Many Integrated Core (MIC) architecture, which includes several new features, including an on-package high-bandwidth memory. We also analyze the code in the context of a few practical applications to the analysis of climatic and remotely-sensed vegetation phenology data sets, and speculate on some of the new applications that such scalable analysis methods may enable.

  10. Parallel protein secondary structure prediction based on neural networks.

    PubMed

    Zhong, Wei; Altun, Gulsah; Tian, Xinmin; Harrison, Robert; Tai, Phang C; Pan, Yi

    2004-01-01

    Protein secondary structure prediction has a fundamental influence on today's bioinformatics research. In this work, binary and tertiary classifiers of protein secondary structure prediction are implemented on Denoeux belief neural network (DBNN) architecture. Hydrophobicity matrix, orthogonal matrix, BLOSUM62 and PSSM (position specific scoring matrix) are experimented separately as the encoding schemes for DBNN. The experimental results contribute to the design of new encoding schemes. New binary classifier for Helix versus not Helix ( approximately H) for DBNN produces prediction accuracy of 87% when PSSM is used for the input profile. The performance of DBNN binary classifier is comparable to other best prediction methods. The good test results for binary classifiers open a new approach for protein structure prediction with neural networks. Due to the time consuming task of training the neural networks, Pthread and OpenMP are employed to parallelize DBNN in the hyperthreading enabled Intel architecture. Speedup for 16 Pthreads is 4.9 and speedup for 16 OpenMP threads is 4 in the 4 processors shared memory architecture. Both speedup performance of OpenMP and Pthread is superior to that of other research. With the new parallel training algorithm, thousands of amino acids can be processed in reasonable amount of time. Our research also shows that hyperthreading technology for Intel architecture is efficient for parallel biological algorithms.

  11. Simultaneous Range-Velocity Processing and SNR Analysis of AFIT’s Random Noise Radar

    DTIC Science & Technology

    2012-03-22

    reducing the overall processing time. Two computers, equipped with NVIDIA ® GPUs, were used to process the col- 45 lected data. The specifications for each...gather the results back to the CPU. Another company , AccelerEyes®, has developed a product called Jacket® that claims to be better than the parallel...Number of Processing Cores 4 8 Processor Speed 3.33 GHz 3.07 GHz Installed Memory 48 GB 48 GB GPU Make NVIDIA NVIDIA GPU Model Tesla 1060 Tesla C2070 GPU

  12. Radar systems for a polar mission, volume 3, appendices A-D, S, T

    NASA Technical Reports Server (NTRS)

    Moore, R. K.; Claassen, J. P.; Erickson, R. L.; Fong, R. K. T.; Hanson, B. C.; Komen, M. J.; Mcmillan, S. B.; Parashar, S. K.

    1976-01-01

    Success is reported in the radar monitoring of such features of sea ice as concentration, floe size, leads and other water openings, drift, topographic features such as pressure ridges and hummocks, fractures, and a qualitative indication of age and thickness. Scatterometer measurements made north of Alaska show a good correlation with a scattering coefficient with apparent thickness as deduced from ice type analysis of stereo aerial photography. Indications are that frequencies from 9 GHz upward seem to be better for sea ice radar purposes than the information gathered at 0.4 GHz by a scatterometer. Some information indicates that 1 GHz is useful, but not as useful as higher frequencies. Either form of like-polarization can be used and it appears that cross-polarization may be more useful for thickness measurement. Resolution requirements have not been fully established, but most of the systems in use have had poorer resolution than 20 meters. The radar return from sea ice is found to be much different than that from lake ice. Methods to decrease side lobe levels of the Fresnel zone-plate processor and to decrease the memory requirements of a synthetic radar processor are discussed.

  13. GPU Lossless Hyperspectral Data Compression System for Space Applications

    NASA Technical Reports Server (NTRS)

    Keymeulen, Didier; Aranki, Nazeeh; Hopson, Ben; Kiely, Aaron; Klimesh, Matthew; Benkrid, Khaled

    2012-01-01

    On-board lossless hyperspectral data compression reduces data volume in order to meet NASA and DoD limited downlink capabilities. At JPL, a novel, adaptive and predictive technique for lossless compression of hyperspectral data, named the Fast Lossless (FL) algorithm, was recently developed. This technique uses an adaptive filtering method and achieves state-of-the-art performance in both compression effectiveness and low complexity. Because of its outstanding performance and suitability for real-time onboard hardware implementation, the FL compressor is being formalized as the emerging CCSDS Standard for Lossless Multispectral & Hyperspectral image compression. The FL compressor is well-suited for parallel hardware implementation. A GPU hardware implementation was developed for FL targeting the current state-of-the-art GPUs from NVIDIA(Trademark). The GPU implementation on a NVIDIA(Trademark) GeForce(Trademark) GTX 580 achieves a throughput performance of 583.08 Mbits/sec (44.85 MSamples/sec) and an acceleration of at least 6 times a software implementation running on a 3.47 GHz single core Intel(Trademark) Xeon(Trademark) processor. This paper describes the design and implementation of the FL algorithm on the GPU. The massively parallel implementation will provide in the future a fast and practical real-time solution for airborne and space applications.

  14. Impacts of the IBM Cell Processor to Support Climate Models

    NASA Technical Reports Server (NTRS)

    Zhou, Shujia; Duffy, Daniel; Clune, Tom; Suarez, Max; Williams, Samuel; Halem, Milt

    2008-01-01

    NASA is interested in the performance and cost benefits for adapting its applications to the IBM Cell processor. However, its 256KB local memory per SPE and the new communication mechanism, make it very challenging to port an application. We selected the solar radiation component of the NASA GEOS-5 climate model, which: (1) is representative of column physics (approximately 50% computational time), (2) has a high computational load relative to transferring data from and to main memory, (3) performs independent calculations across multiple columns. We converted the baseline code (single-precision, Fortran) to C and ported it with manually SIMDizing 4 independent columns and found that a Cell with 8 SPEs can process 2274 columns per second. Compared with the baseline results, the Cell is approximately 5.2X, approximately 8.2X, approximately 15.1X faster than a core on Intel Woodcrest, Dempsey, and Itanium2, respectively. We believe this dramatic performance improvement makes a hybrid cluster with Cell and traditional nodes competitive.

  15. The Fluke Security Project

    DTIC Science & Technology

    2000-04-01

    be an extension of Utah’s nascent Quarks system, oriented to closely coupled cluster environments. However, the grant did not actually begin until... Intel x86, implemented ten virtual machine monitors and servers, including a virtual memory manager, a checkpointer, a process manager, a file server...Fluke, we developed a novel hierarchical processor scheduling frame- work called CPU inheritance scheduling [5]. This is a framework for scheduling

  16. Adaptive Command and Control of Theater Air Power

    DTIC Science & Technology

    1997-06-01

    Industries, Citicorp, Coca-Cola, Honda, and Intel corporations practice similar techniques 19 Notes as cited in Thomas Petzinger, Jr., “The Front Lines...before the leap to personal computers and word processors occurred. Finally, anticipation takes place as the stock market adjusts current prices...Leading Marines. January 1995. Fleet Marine Force Manual 1-1. Campaigning. January 1990. Gell-Mann, Murray, The Quark and the Jaguar: Adventures

  17. Hydraulic Universal Display Processor System (HUDPS).

    DTIC Science & Technology

    1981-11-21

    emphasis on smart alphanumeric devices in Task II. Volatile and non-volatile memory components were utilized along with the Intel 8748 microprocessor...system. 1.2 TASK 11 Fault display methods for ground support personnel were investigated during Phase II with emphasis on smart alphanumeric devices...CONSIDERATIONS Methods of display fault indication for ground support personnel have been investigated with emphasis on " smart " alphanumeric devices

  18. A performance study of sparse Cholesky factorization on INTEL iPSC/860

    NASA Technical Reports Server (NTRS)

    Zubair, M.; Ghose, M.

    1992-01-01

    The problem of Cholesky factorization of a sparse matrix has been very well investigated on sequential machines. A number of efficient codes exist for factorizing large unstructured sparse matrices. However, there is a lack of such efficient codes on parallel machines in general, and distributed machines in particular. Some of the issues that are critical to the implementation of sparse Cholesky factorization on a distributed memory parallel machine are ordering, partitioning and mapping, load balancing, and ordering of various tasks within a processor. Here, we focus on the effect of various partitioning schemes on the performance of sparse Cholesky factorization on the Intel iPSC/860. Also, a new partitioning heuristic for structured as well as unstructured sparse matrices is proposed, and its performance is compared with other schemes.

  19. Roofline Analysis in the Intel® Advisor to Deliver Optimized Performance for applications on Intel® Xeon Phi™ Processor

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Koskela, Tuomas S.; Lobet, Mathieu; Deslippe, Jack

    In this session we show, in two case studies, how the roofline feature of Intel Advisor has been utilized to optimize the performance of kernels of the XGC1 and PICSAR codes in preparation for Intel Knights Landing architecture. The impact of the implemented optimizations and the benefits of using the automatic roofline feature of Intel Advisor to study performance of large applications will be presented. This demonstrates an effective optimization strategy that has enabled these science applications to achieve up to 4.6 times speed-up and prepare for future exascale architectures. # Goal/Relevance of Session The roofline model [1,2] is amore » powerful tool for analyzing the performance of applications with respect to the theoretical peak achievable on a given computer architecture. It allows one to graphically represent the performance of an application in terms of operational intensity, i.e. the ratio of flops performed and bytes moved from memory in order to guide optimization efforts. Given the scale and complexity of modern science applications, it can often be a tedious task for the user to perform the analysis on the level of functions or loops to identify where performance gains can be made. With new Intel tools, it is now possible to automate this task, as well as base the estimates of peak performance on measurements rather than vendor specifications. The goal of this session is to demonstrate how the roofline feature of Intel Advisor can be used to balance memory vs. computation related optimization efforts and effectively identify performance bottlenecks. A series of typical optimization techniques: cache blocking, structure refactoring, data alignment, and vectorization illustrated by the kernel cases will be addressed. # Description of the codes ## XGC1 The XGC1 code [3] is a magnetic fusion Particle-In-Cell code that uses an unstructured mesh for its Poisson solver that allows it to accurately resolve the edge plasma of a magnetic fusion device. After recent optimizations to its collision kernel [4], most of the computing time is spent in the electron push (pushe) kernel, where these optimization efforts have been focused. The kernel code scaled well with MPI+OpenMP but had almost no automatic compiler vectorization, in part due to indirect memory addresses and in part due to low trip counts of low-level loops that would be candidates for vectorization. Particle blocking and sorting have been implemented to increase trip counts of low-level loops and improve memory locality, and OpenMP directives have been added to vectorize compute-intensive loops that were identified by Advisor. The optimizations have improved the performance of the pushe kernel 2x on Haswell processors and 1.7x on KNL. The KNL node-for-node performance has been brought to within 30% of a NERSC Cori phase I Haswell node and we expect to bridge this gap by reducing the memory footprint of compute intensive routines to improve cache reuse. ## PICSAR is a Fortran/Python high-performance Particle-In-Cell library targeting at MIC architectures first designed to be coupled with the PIC code WARP for the simulation of laser-matter interaction and particle accelerators. PICSAR also contains a FORTRAN stand-alone kernel for performance studies and benchmarks. A MPI domain decomposition is used between NUMA domains and a tile decomposition (cache-blocking) handled by OpenMP has been added for shared-memory parallelism and better cache management. The so-called current deposition and field gathering steps that compose the PIC time loop constitute major hotspots that have been rewritten to enable more efficient vectorization. Particle communications between tiles and MPI domain has been merged and parallelized. All considered, these improvements provide speedups of 3.1 for order 1 and 4.6 for order 3 interpolation shape factors on KNL configured in SNC4 quadrant flat mode. Performance is similar between a node of cori phase 1 and KNL at order 1 and better on KNL by a factor 1.6 at order 3 with the considered test case (homogeneous thermal plasma).« less

  20. Numerical performance and throughput benchmark for electronic structure calculations in PC-Linux systems with new architectures, updated compilers, and libraries.

    PubMed

    Yu, Jen-Shiang K; Hwang, Jenn-Kang; Tang, Chuan Yi; Yu, Chin-Hui

    2004-01-01

    A number of recently released numerical libraries including Automatically Tuned Linear Algebra Subroutines (ATLAS) library, Intel Math Kernel Library (MKL), GOTO numerical library, and AMD Core Math Library (ACML) for AMD Opteron processors, are linked against the executables of the Gaussian 98 electronic structure calculation package, which is compiled by updated versions of Fortran compilers such as Intel Fortran compiler (ifc/efc) 7.1 and PGI Fortran compiler (pgf77/pgf90) 5.0. The ifc 7.1 delivers about 3% of improvement on 32-bit machines compared to the former version 6.0. Performance improved from pgf77 3.3 to 5.0 is also around 3% when utilizing the original unmodified optimization options of the compiler enclosed in the software. Nevertheless, if extensive compiler tuning options are used, the speed can be further accelerated to about 25%. The performances of these fully optimized numerical libraries are similar. The double-precision floating-point (FP) instruction sets (SSE2) are also functional on AMD Opteron processors operated in 32-bit compilation, and Intel Fortran compiler has performed better optimization. Hardware-level tuning is able to improve memory bandwidth by adjusting the DRAM timing, and the efficiency in the CL2 mode is further accelerated by 2.6% compared to that of the CL2.5 mode. The FP throughput is measured by simultaneous execution of two identical copies of each of the test jobs. Resultant performance impact suggests that IA64 and AMD64 architectures are able to fulfill significantly higher throughput than the IA32, which is consistent with the SpecFPrate2000 benchmarks.

  1. On extending parallelism to serial simulators

    NASA Technical Reports Server (NTRS)

    Nicol, David; Heidelberger, Philip

    1994-01-01

    This paper describes an approach to discrete event simulation modeling that appears to be effective for developing portable and efficient parallel execution of models of large distributed systems and communication networks. In this approach, the modeler develops submodels using an existing sequential simulation modeling tool, using the full expressive power of the tool. A set of modeling language extensions permit automatically synchronized communication between submodels; however, the automation requires that any such communication must take a nonzero amount off simulation time. Within this modeling paradigm, a variety of conservative synchronization protocols can transparently support conservative execution of submodels on potentially different processors. A specific implementation of this approach, U.P.S. (Utilitarian Parallel Simulator), is described, along with performance results on the Intel Paragon.

  2. Beyond core count: a look at new mainstream computing platforms for HEP workloads

    NASA Astrophysics Data System (ADS)

    Szostek, P.; Nowak, A.; Bitzes, G.; Valsan, L.; Jarp, S.; Dotti, A.

    2014-06-01

    As Moore's Law continues to deliver more and more transistors, the mainstream processor industry is preparing to expand its investments in areas other than simple core count. These new interests include deep integration of on-chip components, advanced vector units, memory, cache and interconnect technologies. We examine these moving trends with parallelized and vectorized High Energy Physics workloads in mind. In particular, we report on practical experience resulting from experiments with scalable HEP benchmarks on the Intel "Ivy Bridge-EP" and "Haswell" processor families. In addition, we examine the benefits of the new "Haswell" microarchitecture and its impact on multiple facets of HEP software. Finally, we report on the power efficiency of new systems.

  3. Nyquist-WDM filter shaping with a high-resolution colorless photonic spectral processor.

    PubMed

    Sinefeld, David; Ben-Ezra, Shalva; Marom, Dan M

    2013-09-01

    We employ a spatial-light-modulator-based colorless photonic spectral processor with a spectral addressability of 100 MHz along 100 GHz bandwidth, for multichannel, high-resolution reshaping of Gaussian channel response to square-like shape, compatible with Nyquist WDM requirements.

  4. Applications Performance on NAS Intel Paragon XP/S - 15#

    NASA Technical Reports Server (NTRS)

    Saini, Subhash; Simon, Horst D.; Copper, D. M. (Technical Monitor)

    1994-01-01

    The Numerical Aerodynamic Simulation (NAS) Systems Division received an Intel Touchstone Sigma prototype model Paragon XP/S- 15 in February, 1993. The i860 XP microprocessor with an integrated floating point unit and operating in dual -instruction mode gives peak performance of 75 million floating point operations (NIFLOPS) per second for 64 bit floating point arithmetic. It is used in the Paragon XP/S-15 which has been installed at NAS, NASA Ames Research Center. The NAS Paragon has 208 nodes and its peak performance is 15.6 GFLOPS. Here, we will report on early experience using the Paragon XP/S- 15. We have tested its performance using both kernels and applications of interest to NAS. We have measured the performance of BLAS 1, 2 and 3 both assembly-coded and Fortran coded on NAS Paragon XP/S- 15. Furthermore, we have investigated the performance of a single node one-dimensional FFT, a distributed two-dimensional FFT and a distributed three-dimensional FFT Finally, we measured the performance of NAS Parallel Benchmarks (NPB) on the Paragon and compare it with the performance obtained on other highly parallel machines, such as CM-5, CRAY T3D, IBM SP I, etc. In particular, we investigated the following issues, which can strongly affect the performance of the Paragon: a. Impact of the operating system: Intel currently uses as a default an operating system OSF/1 AD from the Open Software Foundation. The paging of Open Software Foundation (OSF) server at 22 MB to make more memory available for the application degrades the performance. We found that when the limit of 26 NIB per node out of 32 MB available is reached, the application is paged out of main memory using virtual memory. When the application starts paging, the performance is considerably reduced. We found that dynamic memory allocation can help applications performance under certain circumstances. b. Impact of data cache on the i860/XP: We measured the performance of the BLAS both assembly coded and Fortran coded. We found that the measured performance of assembly-coded BLAS is much less than what memory bandwidth limitation would predict. The influence of data cache on different sizes of vectors is also investigated using one-dimensional FFTs. c. Impact of processor layout: There are several different ways processors can be laid out within the two-dimensional grid of processors on the Paragon. We have used the FFT example to investigate performance differences based on processors layout.

  5. Dense and Sparse Matrix Operations on the Cell Processor

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Williams, Samuel W.; Shalf, John; Oliker, Leonid

    2005-05-01

    The slowing pace of commodity microprocessor performance improvements combined with ever-increasing chip power demands has become of utmost concern to computational scientists. Therefore, the high performance computing community is examining alternative architectures that address the limitations of modern superscalar designs. In this work, we examine STI's forthcoming Cell processor: a novel, low-power architecture that combines a PowerPC core with eight independent SIMD processing units coupled with a software-controlled memory to offer high FLOP/s/Watt. Since neither Cell hardware nor cycle-accurate simulators are currently publicly available, we develop an analytic framework to predict Cell performance on dense and sparse matrix operations, usingmore » a variety of algorithmic approaches. Results demonstrate Cell's potential to deliver more than an order of magnitude better GFLOP/s per watt performance, when compared with the Intel Itanium2 and Cray X1 processors.« less

  6. Spectral-element simulation of two-dimensional elastic wave propagation in fully heterogeneous media on a GPU cluster

    NASA Astrophysics Data System (ADS)

    Rudianto, Indra; Sudarmaji

    2018-04-01

    We present an implementation of the spectral-element method for simulation of two-dimensional elastic wave propagation in fully heterogeneous media. We have incorporated most of realistic geological features in the model, including surface topography, curved layer interfaces, and 2-D wave-speed heterogeneity. To accommodate such complexity, we use an unstructured quadrilateral meshing technique. Simulation was performed on a GPU cluster, which consists of 24 core processors Intel Xeon CPU and 4 NVIDIA Quadro graphics cards using CUDA and MPI implementation. We speed up the computation by a factor of about 5 compared to MPI only, and by a factor of about 40 compared to Serial implementation.

  7. Transient Solid Dynamics Simulations on the Sandia/Intel Teraflop Computer

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Attaway, S.; Brown, K.; Gardner, D.

    1997-12-31

    Transient solid dynamics simulations are among the most widely used engineering calculations. Industrial applications include vehicle crashworthiness studies, metal forging, and powder compaction prior to sintering. These calculations are also critical to defense applications including safety studies and weapons simulations. The practical importance of these calculations and their computational intensiveness make them natural candidates for parallelization. This has proved to be difficult, and existing implementations fail to scale to more than a few dozen processors. In this paper we describe our parallelization of PRONTO, Sandia`s transient solid dynamics code, via a novel algorithmic approach that utilizes multiple decompositions for differentmore » key segments of the computations, including the material contact calculation. This latter calculation is notoriously difficult to perform well in parallel, because it involves dynamically changing geometry, global searches for elements in contact, and unstructured communications among the compute nodes. Our approach scales to at least 3600 compute nodes of the Sandia/Intel Teraflop computer (the largest set of nodes to which we have had access to date) on problems involving millions of finite elements. On this machine we can simulate models using more than ten- million elements in a few tenths of a second per timestep, and solve problems more than 3000 times faster than a single processor Cray Jedi.« less

  8. A simple integrative method for presenting head-contingent motion parallax and disparity cues on intel x86 processor-based machines.

    PubMed

    Szatmary, J; Hadani, I; Julesz, B

    1997-01-01

    Rogers and Graham (1979) developed a system to show that head-movement-contingent motion parallax produces monocular depth perception in random dot patterns. Their display system comprised an oscilloscope driven by function generators or a special graphics board that triggered the X and Y deflection of the raster scan signal. Replication of this system required costly hardware that is no longer on the market. In this paper the Rogers-Graham method is reproduced with an Intel processor based IBM PC compatible machine with no additional hardware cost. An adapted joystick sampled through the standard game-port can serve as a provisional head-movement sensor. Monitor resolution for displaying motion is effectively enhanced 16 times by the use of anti-aliasing, enabling the display of thousands of random dots in real-time with a refresh rate of 60 Hz or above. A color monitor enables the use of the anaglyph method, thus combining stereoscopic and monocular parallax on a single display without the loss of speed. The power of this system is demonstrated by a psychophysical measurement in which subjects nulled head-movement-contingent illusory parallax, evoked by a static stereogram, with real parallax. The amount of real parallax required to null the illusory stereoscopic parallax monotonically increased with disparity.

  9. Portable parallel stochastic optimization for the design of aeropropulsion components

    NASA Technical Reports Server (NTRS)

    Sues, Robert H.; Rhodes, G. S.

    1994-01-01

    This report presents the results of Phase 1 research to develop a methodology for performing large-scale Multi-disciplinary Stochastic Optimization (MSO) for the design of aerospace systems ranging from aeropropulsion components to complete aircraft configurations. The current research recognizes that such design optimization problems are computationally expensive, and require the use of either massively parallel or multiple-processor computers. The methodology also recognizes that many operational and performance parameters are uncertain, and that uncertainty must be considered explicitly to achieve optimum performance and cost. The objective of this Phase 1 research was to initialize the development of an MSO methodology that is portable to a wide variety of hardware platforms, while achieving efficient, large-scale parallelism when multiple processors are available. The first effort in the project was a literature review of available computer hardware, as well as review of portable, parallel programming environments. The first effort was to implement the MSO methodology for a problem using the portable parallel programming language, Parallel Virtual Machine (PVM). The third and final effort was to demonstrate the example on a variety of computers, including a distributed-memory multiprocessor, a distributed-memory network of workstations, and a single-processor workstation. Results indicate the MSO methodology can be well-applied towards large-scale aerospace design problems. Nearly perfect linear speedup was demonstrated for computation of optimization sensitivity coefficients on both a 128-node distributed-memory multiprocessor (the Intel iPSC/860) and a network of workstations (speedups of almost 19 times achieved for 20 workstations). Very high parallel efficiencies (75 percent for 31 processors and 60 percent for 50 processors) were also achieved for computation of aerodynamic influence coefficients on the Intel. Finally, the multi-level parallelization strategy that will be needed for large-scale MSO problems was demonstrated to be highly efficient. The same parallel code instructions were used on both platforms, demonstrating portability. There are many applications for which MSO can be applied, including NASA's High-Speed-Civil Transport, and advanced propulsion systems. The use of MSO will reduce design and development time and testing costs dramatically.

  10. Evaluating Multi-core Architectures through Accelerating the Three-Dimensional Lax–Wendroff Correction

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    You, Yang; Fu, Haohuan; Song, Shuaiwen

    2014-07-18

    Wave propagation forward modeling is a widely used computational method in oil and gas exploration. The iterative stencil loops in such problems have broad applications in scientific computing. However, executing such loops can be highly time time-consuming, which greatly limits application’s performance and power efficiency. In this paper, we accelerate the forward modeling technique on the latest multi-core and many-core architectures such as Intel Sandy Bridge CPUs, NVIDIA Fermi C2070 GPU, NVIDIA Kepler K20x GPU, and the Intel Xeon Phi Co-processor. For the GPU platforms, we propose two parallel strategies to explore the performance optimization opportunities for our stencil kernels.more » For Sandy Bridge CPUs and MIC, we also employ various optimization techniques in order to achieve the best.« less

  11. GW Calculations of Materials on the Intel Xeon-Phi Architecture

    NASA Astrophysics Data System (ADS)

    Deslippe, Jack; da Jornada, Felipe H.; Vigil-Fowler, Derek; Biller, Ariel; Chelikowsky, James R.; Louie, Steven G.

    Intel Xeon-Phi processors are expected to power a large number of High-Performance Computing (HPC) systems around the United States and the world in the near future. We evaluate the ability of GW and pre-requisite Density Functional Theory (DFT) calculations for materials on utilizing the Xeon-Phi architecture. We describe the optimization process and performance improvements achieved. We find that the GW method, like other higher level Many-Body methods beyond standard local/semilocal approximations to Kohn-Sham DFT, is particularly well suited for many-core architectures due to the ability to exploit a large amount of parallelism over plane-waves, band-pairs and frequencies. Support provided by the SCIDAC program, Department of Energy, Office of Science, Advanced Scientic Computing Research and Basic Energy Sciences. Grant Numbers DE-SC0008877 (Austin) and DE-AC02-05CH11231 (LBNL).

  12. Advanced electronics for the CTF MEG system.

    PubMed

    McCubbin, J; Vrba, J; Spear, P; McKenzie, D; Willis, R; Loewen, R; Robinson, S E; Fife, A A

    2004-11-30

    Development of the CTF MEG system has been advanced with the introduction of a computer processing cluster between the data acquisition electronics and the host computer. The advent of fast processors, memory, and network interfaces has made this innovation feasible for large data streams at high sampling rates. We have implemented tasks including anti-alias filter, sample rate decimation, higher gradient balancing, crosstalk correction, and optional filters with a cluster consisting of 4 dual Intel Xeon processors operating on up to 275 channel MEG systems at 12 kHz sample rate. The architecture is expandable with additional processors to implement advanced processing tasks which may include e.g., continuous head localization/motion correction, optional display filters, coherence calculations, or real time synthetic channels (via beamformer). We also describe an electronics configuration upgrade to provide operator console access to the peripheral interface features such as analog signal and trigger I/O. This allows remote location of the acoustically noisy electronics cabinet and fitting of the cabinet with doors for improved EMI shielding. Finally, we present the latest performance results available for the CTF 275 channel MEG system including an unshielded SEF (median nerve electrical stimulation) measurement enhanced by application of an adaptive beamformer technique (SAM) which allows recognition of the nominal 20-ms response in the unaveraged signal.

  13. Benchmarking and tuning the MILC code on clusters and supercomputers

    NASA Astrophysics Data System (ADS)

    Gottlieb, Steven

    2002-03-01

    Recently, we have benchmarked and tuned the MILC code on a number of architectures including Intel Itanium and Pentium IV (PIV), dual-CPU Athlon, and the latest Compaq Alpha nodes. Results will be presented for many of these, and we shall discuss some simple code changes that can result in a very dramatic speedup of the KS conjugate gradient on processors with more advanced memory systems such as PIV, IBM SP and Alpha.

  14. Benchmarking and tuning the MILC code on clusters and supercomputers

    NASA Astrophysics Data System (ADS)

    Gottlieb, Steven

    Recently, we have benchmarked and tuned the MILC code on a number of architectures including Intel Itanium and Pentium IV (PIV), dual-CPU Athlon, and the latest Compaq Alpha nodes. Results will be presented for many of these, and we shall discuss some simple code changes that can result in a very dramatic speedup of the KS conjugate gradient on processors with more advanced memory systems such as PIV, IBM SP and Alpha.

  15. Turbo Pascal Implementation of a Distributed Processing Network of MS-DOS Microcomputers Connected in a Master-Slave Configuration

    DTIC Science & Technology

    1989-12-01

    Interrupt Procedures ....... 29 13. Support for a Larger Memory Model ................ 29 C. IMPLEMENTATION ........................................ 29...describe the programmer’s model of the hardware utilized in the microcomputers and interrupt driven serial communication considerations. Chapter III...Central Processor Unit The programming model of Table 2.1 is common to the Intel 8088, 8086 and 80x86 series of microprocessors used in the IBM PC/AT

  16. Cell-NPE (Numerical Performance Evaluation): Programming the IBM Cell Broadband Engine -- A General Parallelization Strategy

    DTIC Science & Technology

    2008-04-01

    Space GmbH as follows: B. TECHNICAL PRPOPOSA/DESCRIPTION OF WORK Cell: A Revolutionary High Performance Computing Platform On 29 June 2005 [1...IBM has announced that is has partnered with Mercury Computer Systems, a maker of specialized computers . The Cell chip provides massive floating-point...the computing industry away from the traditional processor technology dominated by Intel. While in the past, the development of computing power has

  17. Integrated 3-D vision system for autonomous vehicles

    NASA Astrophysics Data System (ADS)

    Hou, Kun M.; Shawky, Mohamed; Tu, Xiaowei

    1992-03-01

    Nowadays, autonomous vehicles have become a multidiscipline field. Its evolution is taking advantage of the recent technological progress in computer architectures. As the development tools became more sophisticated, the trend is being more specialized, or even dedicated architectures. In this paper, we will focus our interest on a parallel vision subsystem integrated in the overall system architecture. The system modules work in parallel, communicating through a hierarchical blackboard, an extension of the 'tuple space' from LINDA concepts, where they may exchange data or synchronization messages. The general purpose processing elements are of different skills, built around 40 MHz i860 Intel RISC processors for high level processing and pipelined systolic array processors based on PLAs or FPGAs for low-level processing.

  18. A 2.4-GHz Energy-Efficient Transmitter for Wireless Medical Applications.

    PubMed

    Qi Zhang; Peng Feng; Zhiqing Geng; Xiaozhou Yan; Nanjian Wu

    2011-02-01

    A 2.4-GHz energy-efficient transmitter (TX) for wireless medical applications is presented in this paper. It consists of four blocks: a phase-locked loop (PLL) synthesizer with a direct frequency presetting technique, a class-B power amplifier, a digital processor, and nonvolatile memory (NVM). The frequency presetting technique can accurately preset the carrier frequency of the voltage-controlled oscillator and reduce the lock-in time of the PLL synthesizer, further increasing the data rate of communication with low power consumption. The digital processor automatically compensates preset frequency variation with process, voltage, and temperature. The NVM stores the presetting signals and calibration data so that the TX can avoid the repetitive calibration process and save the energy in practical applications. The design is implemented in 0.18- μm radio-frequency complementary metal-oxide semiconductor process and the active area is 1.3 mm (2). The TX achieves 0-dBm output power with a maximum data rate of 4 Mb/s/2 Mb/s and dissipates 2.7-mA/5.4-mA current from a 1.8-V power supply for on-off keying/frequency-shift keying modulation, respectively. The corresponding energy efficiency is 1.2 nJ/b·mW and 4.8 nJ/b· mW when normalized to the transmitting power.

  19. HORN-6 special-purpose clustered computing system for electroholography.

    PubMed

    Ichihashi, Yasuyuki; Nakayama, Hirotaka; Ito, Tomoyoshi; Masuda, Nobuyuki; Shimobaba, Tomoyoshi; Shiraki, Atsushi; Sugie, Takashige

    2009-08-03

    We developed the HORN-6 special-purpose computer for holography. We designed and constructed the HORN-6 board to handle an object image composed of one million points and constructed a cluster system composed of 16 HORN-6 boards. Using this HORN-6 cluster system, we succeeded in creating a computer-generated hologram of a three-dimensional image composed of 1,000,000 points at a rate of 1 frame per second, and a computer-generated hologram of an image composed of 100,000 points at a rate of 10 frames per second, which is near video rate, when the size of a computer-generated hologram is 1,920 x 1,080. The calculation speed is approximately 4,600 times faster than that of a personal computer with an Intel 3.4-GHz Pentium 4 CPU.

  20. mm_par2.0: An object-oriented molecular dynamics simulation program parallelized using a hierarchical scheme with MPI and OPENMP

    NASA Astrophysics Data System (ADS)

    Oh, Kwang Jin; Kang, Ji Hoon; Myung, Hun Joo

    2012-02-01

    We have revised a general purpose parallel molecular dynamics simulation program mm_par using the object-oriented programming. We parallelized the revised version using a hierarchical scheme in order to utilize more processors for a given system size. The benchmark result will be presented here. New version program summaryProgram title: mm_par2.0 Catalogue identifier: ADXP_v2_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/ADXP_v2_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: Standard CPC license, http://cpc.cs.qub.ac.uk/licence/licence.html No. of lines in distributed program, including test data, etc.: 2 390 858 No. of bytes in distributed program, including test data, etc.: 25 068 310 Distribution format: tar.gz Programming language: C++ Computer: Any system operated by Linux or Unix Operating system: Linux Classification: 7.7 External routines: We provide wrappers for FFTW [1], Intel MKL library [2] FFT routine, and Numerical recipes [3] FFT, random number generator, and eigenvalue solver routines, SPRNG [4] random number generator, Mersenne Twister [5] random number generator, space filling curve routine. Catalogue identifier of previous version: ADXP_v1_0 Journal reference of previous version: Comput. Phys. Comm. 174 (2006) 560 Does the new version supersede the previous version?: Yes Nature of problem: Structural, thermodynamic, and dynamical properties of fluids and solids from microscopic scales to mesoscopic scales. Solution method: Molecular dynamics simulation in NVE, NVT, and NPT ensemble, Langevin dynamics simulation, dissipative particle dynamics simulation. Reasons for new version: First, object-oriented programming has been used, which is known to be open for extension and closed for modification. It is also known to be better for maintenance. Second, version 1.0 was based on atom decomposition and domain decomposition scheme [6] for parallelization. However, atom decomposition is not popular due to its poor scalability. On the other hand, domain decomposition scheme is better for scalability. It still has a limitation in utilizing a large number of cores on recent petascale computers due to the requirement that the domain size is larger than the potential cutoff distance. To go beyond such a limitation, a hierarchical parallelization scheme has been adopted in this new version and implemented using MPI [7] and OPENMP [8]. Summary of revisions: (1) Object-oriented programming has been used. (2) A hierarchical parallelization scheme has been adopted. (3) SPME routine has been fully parallelized with parallel 3D FFT using volumetric decomposition scheme [9]. K.J.O. thanks Mr. Seung Min Lee for useful discussion on programming and debugging. Running time: Running time depends on system size and methods used. For test system containing a protein (PDB id: 5DHFR) with CHARMM22 force field [10] and 7023 TIP3P [11] waters in simulation box having dimension 62.23 Å×62.23 Å×62.23 Å, the benchmark results are given in Fig. 1. Here the potential cutoff distance was set to 12 Å and the switching function was applied from 10 Å for the force calculation in real space. For the SPME [12] calculation, K, K, and K were set to 64 and the interpolation order was set to 4. To do the fast Fourier transform, we used Intel MKL library. All bonds including hydrogen atoms were constrained using SHAKE/RATTLE algorithms [13,14]. The code was compiled using Intel compiler version 11.1 and mvapich2 version 1.5. Fig. 2 shows performance gains from using CUDA-enabled version [15] of mm_par for 5DHFR simulation in water on Intel Core2Quad 2.83 GHz and GeForce GTX 580. Even though mm_par2.0 is not ported yet for GPU, its performance data would be useful to expect mm_par2.0 performance on GPU. Timing results for 1000 MD steps. 1, 2, 4, and 8 in the figure mean the number of OPENMP threads. Timing results for 1000 MD steps from double precision simulation on CPU, single precision simulation on GPU, and double precision simulation on GPU.

  1. Hardware description ADSP-21020 40-bit floating point DSP as designed in a remotely controlled digital CW Doppler radar

    NASA Astrophysics Data System (ADS)

    Morrison, R. E.; Robinson, S. H.

    A continuous wave Doppler radar system has been designed which is portable, easily deployed, and remotely controlled. The heart of this system is a DSP/control board using Analog Devices ADSP-21020 40-bit floating point digital signal processor (DSP) microprocessor. Two 18-bit audio A/D converters provide digital input to the DSP/controller board for near real time target detection. Program memory for the DSP is dual ported with an Intel 87C51 microcontroller allowing DSP code to be up-loaded or down-loaded from a central controlling computer. The 87C51 provides overall system control for the remote radar and includes a time-of-day/day-of-year real time clock, system identification (ID) switches, and input/output (I/O) expansion by an Intel 82C55 I/O expander.

  2. Parallelization of MRCI based on hole-particle symmetry.

    PubMed

    Suo, Bing; Zhai, Gaohong; Wang, Yubin; Wen, Zhenyi; Hu, Xiangqian; Li, Lemin

    2005-01-15

    The parallel implementation of multireference configuration interaction program based on the hole-particle symmetry is described. The platform to implement the parallelization is an Intel-Architectural cluster consisting of 12 nodes, each of which is equipped with two 2.4-G XEON processors, 3-GB memory, and 36-GB disk, and are connected by a Gigabit Ethernet Switch. The dependence of speedup on molecular symmetries and task granularities is discussed. Test calculations show that the scaling with the number of nodes is about 1.9 (for C1 and Cs), 1.65 (for C2v), and 1.55 (for D2h) when the number of nodes is doubled. The largest calculation performed on this cluster involves 5.6 x 10(8) CSFs.

  3. Domain Wall Fermion Inverter on Pentium 4

    NASA Astrophysics Data System (ADS)

    Pochinsky, Andrew

    2005-03-01

    A highly optimized domain wall fermion inverter has been developed as part of the SciDAC lattice initiative. By designing the code to minimize memory bus traffic, it achieves high cache reuse and performance in excess of 2 GFlops for out of L2 cache problem sizes on a GigE cluster with 2.66 GHz Xeon processors. The code uses the SciDAC QMP communication library.

  4. QCD thermodynamics with two flavors of quarks[1

    NASA Astrophysics Data System (ADS)

    MIMD lattice Computations (MILC) Collaboration

    We present results of numerical simulations of quantum chromodynamics at finite temperature on the Intel iPSC/860 parallel processor. We performed calculations with two flavors of Kogut-Susskind quarks and of Wilson quarks on 6 × 12 3 lattices in order to study the crossover from the low temperature hadronic regime to the high temperature regime. We investigate the properties of the objects whose exchange gives static screening lengths be reconstructing their correlated quark-antiquark structure.

  5. Kalman Filter Tracking on Parallel Architectures

    NASA Astrophysics Data System (ADS)

    Cerati, Giuseppe; Elmer, Peter; Krutelyov, Slava; Lantz, Steven; Lefebvre, Matthieu; McDermott, Kevin; Riley, Daniel; Tadel, Matevž; Wittich, Peter; Würthwein, Frank; Yagil, Avi

    2016-11-01

    Power density constraints are limiting the performance improvements of modern CPUs. To address this we have seen the introduction of lower-power, multi-core processors such as GPGPU, ARM and Intel MIC. In order to achieve the theoretical performance gains of these processors, it will be necessary to parallelize algorithms to exploit larger numbers of lightweight cores and specialized functions like large vector units. Track finding and fitting is one of the most computationally challenging problems for event reconstruction in particle physics. At the High-Luminosity Large Hadron Collider (HL-LHC), for example, this will be by far the dominant problem. The need for greater parallelism has driven investigations of very different track finding techniques such as Cellular Automata or Hough Transforms. The most common track finding techniques in use today, however, are those based on a Kalman filter approach. Significant experience has been accumulated with these techniques on real tracking detector systems, both in the trigger and offline. They are known to provide high physics performance, are robust, and are in use today at the LHC. Given the utility of the Kalman filter in track finding, we have begun to port these algorithms to parallel architectures, namely Intel Xeon and Xeon Phi. We report here on our progress towards an end-to-end track reconstruction algorithm fully exploiting vectorization and parallelization techniques in a simplified experimental environment.

  6. Sci-Thur PM – Brachytherapy 01: Fast brachytherapy dose calculations: Characterization of egs-brachy features to enhance simulation efficiency

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Chamberland, Marc; Taylor, Randle E.P.; Rogers, Da

    2016-08-15

    Purpose: egs-brachy is a fast, new EGSnrc user-code for brachytherapy applications. This study characterizes egs-brachy features that enhance simulation efficiency. Methods: Calculations are performed to characterize efficiency gains from various features. Simulations include radionuclide and miniature x-ray tube sources in water phantoms and idealized prostate, breast, and eye plaque treatments. Features characterized include voxel indexing of sources to reduce boundary checks during radiation transport, scoring collision kerma via tracklength estimator, recycling photons emitted from sources, and using phase space data to initiate simulations. Bremsstrahlung cross section enhancement (BCSE), uniform bremsstrahlung splitting (UBS), and Russian Roulette (RR) are considered for electronicmore » brachytherapy. Results: Efficiency is enhanced by a factor of up to 300 using tracklength versus interaction scoring of collision kerma and by up to 2.7 and 2.6 using phase space sources and particle recycling respectively compared to simulations in which particles are initiated within sources. On a single 2.5 GHz Intel Xeon E5-2680 processor cor, simulations approximating prostate and breast permanent implant ((2 mm){sup 3} voxels) and eye plaque ((1 mm){sup 3}) treatments take as little as 9 s (prostate, eye) and up to 31 s (breast) to achieve 2% statistical uncertainty on doses within the PTV. For electronic brachytherapy, BCSE, UBS, and RR enhance efficiency by a factor >2000 compared to a factor of >10{sup 4} using a phase space source. Conclusion: egs-brachy features provide substantial efficiency gains, resulting in calculation times sufficiently fast for full Monte Carlo simulations for routine brachytherapy treatment planning.« less

  7. Underwater Threat Source Localization: Processing Sensor Network TDOAs with a Terascale Optical Core Device

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Barhen, Jacob; Imam, Neena

    2007-01-01

    Revolutionary computing technologies are defined in terms of technological breakthroughs, which leapfrog over near-term projected advances in conventional hardware and software to produce paradigm shifts in computational science. For underwater threat source localization using information provided by a dynamical sensor network, one of the most promising computational advances builds upon the emergence of digital optical-core devices. In this article, we present initial results of sensor network calculations that focus on the concept of signal wavefront time-difference-of-arrival (TDOA). The corresponding algorithms are implemented on the EnLight processing platform recently introduced by Lenslet Laboratories. This tera-scale digital optical core processor is optimizedmore » for array operations, which it performs in a fixed-point-arithmetic architecture. Our results (i) illustrate the ability to reach the required accuracy in the TDOA computation, and (ii) demonstrate that a considerable speed-up can be achieved when using the EnLight 64a prototype processor as compared to a dual Intel XeonTM processor.« less

  8. DOE Office of Scientific and Technical Information (OSTI.GOV)

    Newman, G.A.; Commer, M.

    Three-dimensional (3D) geophysical imaging is now receiving considerable attention for electrical conductivity mapping of potential offshore oil and gas reservoirs. The imaging technology employs controlled source electromagnetic (CSEM) and magnetotelluric (MT) fields and treats geological media exhibiting transverse anisotropy. Moreover when combined with established seismic methods, direct imaging of reservoir fluids is possible. Because of the size of the 3D conductivity imaging problem, strategies are required exploiting computational parallelism and optimal meshing. The algorithm thus developed has been shown to scale to tens of thousands of processors. In one imaging experiment, 32,768 tasks/processors on the IBM Watson Research Blue Gene/Lmore » supercomputer were successfully utilized. Over a 24 hour period we were able to image a large scale field data set that previously required over four months of processing time on distributed clusters based on Intel or AMD processors utilizing 1024 tasks on an InfiniBand fabric. Electrical conductivity imaging using massively parallel computational resources produces results that cannot be obtained otherwise and are consistent with timeframes required for practical exploration problems.« less

  9. OpenMP GNU and Intel Fortran programs for solving the time-dependent Gross-Pitaevskii equation

    NASA Astrophysics Data System (ADS)

    Young-S., Luis E.; Muruganandam, Paulsamy; Adhikari, Sadhan K.; Lončar, Vladimir; Vudragović, Dušan; Balaž, Antun

    2017-11-01

    We present Open Multi-Processing (OpenMP) version of Fortran 90 programs for solving the Gross-Pitaevskii (GP) equation for a Bose-Einstein condensate in one, two, and three spatial dimensions, optimized for use with GNU and Intel compilers. We use the split-step Crank-Nicolson algorithm for imaginary- and real-time propagation, which enables efficient calculation of stationary and non-stationary solutions, respectively. The present OpenMP programs are designed for computers with multi-core processors and optimized for compiling with both commercially-licensed Intel Fortran and popular free open-source GNU Fortran compiler. The programs are easy to use and are elaborated with helpful comments for the users. All input parameters are listed at the beginning of each program. Different output files provide physical quantities such as energy, chemical potential, root-mean-square sizes, densities, etc. We also present speedup test results for new versions of the programs. Program files doi:http://dx.doi.org/10.17632/y8zk3jgn84.2 Licensing provisions: Apache License 2.0 Programming language: OpenMP GNU and Intel Fortran 90. Computer: Any multi-core personal computer or workstation with the appropriate OpenMP-capable Fortran compiler installed. Number of processors used: All available CPU cores on the executing computer. Journal reference of previous version: Comput. Phys. Commun. 180 (2009) 1888; ibid.204 (2016) 209. Does the new version supersede the previous version?: Not completely. It does supersede previous Fortran programs from both references above, but not OpenMP C programs from Comput. Phys. Commun. 204 (2016) 209. Nature of problem: The present Open Multi-Processing (OpenMP) Fortran programs, optimized for use with commercially-licensed Intel Fortran and free open-source GNU Fortran compilers, solve the time-dependent nonlinear partial differential (GP) equation for a trapped Bose-Einstein condensate in one (1d), two (2d), and three (3d) spatial dimensions for six different trap symmetries: axially and radially symmetric traps in 3d, circularly symmetric traps in 2d, fully isotropic (spherically symmetric) and fully anisotropic traps in 2d and 3d, as well as 1d traps, where no spatial symmetry is considered. Solution method: We employ the split-step Crank-Nicolson algorithm to discretize the time-dependent GP equation in space and time. The discretized equation is then solved by imaginary- or real-time propagation, employing adequately small space and time steps, to yield the solution of stationary and non-stationary problems, respectively. Reasons for the new version: Previously published Fortran programs [1,2] have now become popular tools [3] for solving the GP equation. These programs have been translated to the C programming language [4] and later extended to the more complex scenario of dipolar atoms [5]. Now virtually all computers have multi-core processors and some have motherboards with more than one physical computer processing unit (CPU), which may increase the number of available CPU cores on a single computer to several tens. The C programs have been adopted to be very fast on such multi-core modern computers using general-purpose graphic processing units (GPGPU) with Nvidia CUDA and computer clusters using Message Passing Interface (MPI) [6]. Nevertheless, previously developed Fortran programs are also commonly used for scientific computation and most of them use a single CPU core at a time in modern multi-core laptops, desktops, and workstations. Unless the Fortran programs are made aware and capable of making efficient use of the available CPU cores, the solution of even a realistic dynamical 1d problem, not to mention the more complicated 2d and 3d problems, could be time consuming using the Fortran programs. Previously, we published auto-parallel Fortran programs [2] suitable for Intel (but not GNU) compiler for solving the GP equation. Hence, a need for the full OpenMP version of the Fortran programs to reduce the execution time cannot be overemphasized. To address this issue, we provide here such OpenMP Fortran programs, optimized for both Intel and GNU Fortran compilers and capable of using all available CPU cores, which can significantly reduce the execution time. Summary of revisions: Previous Fortran programs [1] for solving the time-dependent GP equation in 1d, 2d, and 3d with different trap symmetries have been parallelized using the OpenMP interface to reduce the execution time on multi-core processors. There are six different trap symmetries considered, resulting in six programs for imaginary-time propagation and six for real-time propagation, totaling to 12 programs included in BEC-GP-OMP-FOR software package. All input data (number of atoms, scattering length, harmonic oscillator trap length, trap anisotropy, etc.) are conveniently placed at the beginning of each program, as before [2]. Present programs introduce a new input parameter, which is designated by Number_of_Threads and defines the number of CPU cores of the processor to be used in the calculation. If one sets the value 0 for this parameter, all available CPU cores will be used. For the most efficient calculation it is advisable to leave one CPU core unused for the background system's jobs. For example, on a machine with 20 CPU cores such that we used for testing, it is advisable to use up to 19 CPU cores. However, the total number of used CPU cores can be divided into more than one job. For instance, one can run three simulations simultaneously using 10, 4, and 5 CPU cores, respectively, thus totaling to 19 used CPU cores on a 20-core computer. The Fortran source programs are located in the directory src, and can be compiled by the make command using the makefile in the root directory BEC-GP-OMP-FOR of the software package. The examples of produced output files can be found in the directory output, although some large density files are omitted, to save space. The programs calculate the values of actually used dimensionless nonlinearities from the physical input parameters, where the input parameters correspond to the identical nonlinearity values as in the previously published programs [1], so that the output files of the old and new programs can be directly compared. The output files are conveniently named such that their contents can be easily identified, following the naming convention introduced in Ref. [2]. For example, a file named -out.txt, where is a name of the individual program, represents the general output file containing input data, time and space steps, nonlinearity, energy and chemical potential, and was named fort.7 in the old Fortran version of programs [1]. A file named -den.txt is the output file with the condensate density, which had the names fort.3 and fort.4 in the old Fortran version [1] for imaginary- and real-time propagation programs, respectively. Other possible density outputs, such as the initial density, are commented out in the programs to have a simpler set of output files, but users can uncomment and re-enable them, if needed. In addition, there are output files for reduced (integrated) 1d and 2d densities for different programs. In the real-time programs there is also an output file reporting the dynamics of evolution of root-mean-square sizes after a perturbation is introduced. The supplied real-time programs solve the stationary GP equation, and then calculate the dynamics. As the imaginary-time programs are more accurate than the real-time programs for the solution of a stationary problem, one can first solve the stationary problem using the imaginary-time programs, adapt the real-time programs to read the pre-calculated wave function and then study the dynamics. In that case the parameter NSTP in the real-time programs should be set to zero and the space mesh and nonlinearity parameters should be identical in both programs. The reader is advised to consult our previous publication where a complete description of the output files is given [2]. A readme.txt file, included in the root directory, explains the procedure to compile and run the programs. We tested our programs on a workstation with two 10-core Intel Xeon E5-2650 v3 CPUs. The parameters used for testing are given in sample input files, provided in the corresponding directory together with the programs. In Table 1 we present wall-clock execution times for runs on 1, 6, and 19 CPU cores for programs compiled using Intel and GNU Fortran compilers. The corresponding columns "Intel speedup" and "GNU speedup" give the ratio of wall-clock execution times of runs on 1 and 19 CPU cores, and denote the actual measured speedup for 19 CPU cores. In all cases and for all numbers of CPU cores, although the GNU Fortran compiler gives excellent results, the Intel Fortran compiler turns out to be slightly faster. Note that during these tests we always ran only a single simulation on a workstation at a time, to avoid any possible interference issues. Therefore, the obtained wall-clock times are more reliable than the ones that could be measured with two or more jobs running simultaneously. We also studied the speedup of the programs as a function of the number of CPU cores used. The performance of the Intel and GNU Fortran compilers is illustrated in Fig. 1, where we plot the speedup and actual wall-clock times as functions of the number of CPU cores for 2d and 3d programs. We see that the speedup increases monotonically with the number of CPU cores in all cases and has large values (between 10 and 14 for 3d programs) for the maximal number of cores. This fully justifies the development of OpenMP programs, which enable much faster and more efficient solving of the GP equation. However, a slow saturation in the speedup with the further increase in the number of CPU cores is observed in all cases, as expected. The speedup tends to increase for programs in higher dimensions, as they become more complex and have to process more data. This is why the speedups of the supplied 2d and 3d programs are larger than those of 1d programs. Also, for a single program the speedup increases with the size of the spatial grid, i.e., with the number of spatial discretization points, since this increases the amount of calculations performed by the program. To demonstrate this, we tested the supplied real2d-th program and varied the number of spatial discretization points NX=NY from 20 to 1000. The measured speedup obtained when running this program on 19 CPU cores as a function of the number of discretization points is shown in Fig. 2. The speedup first increases rapidly with the number of discretization points and eventually saturates. Additional comments: Example inputs provided with the programs take less than 30 minutes to run on a workstation with two Intel Xeon E5-2650 v3 processors (2 QPI links, 10 CPU cores, 25 MB cache, 2.3 GHz).

  10. Computational multicore on two-layer 1D shallow water equations for erodible dambreak

    NASA Astrophysics Data System (ADS)

    Simanjuntak, C. A.; Bagustara, B. A. R. H.; Gunawan, P. H.

    2018-03-01

    The simulation of erodible dambreak using two-layer shallow water equations and SCHR scheme are elaborated in this paper. The results show that the two-layer SWE model in a good agreement with the data experiment which is performed by Louvain-la-Neuve Université Catholique de Louvain. Moreover, the parallel algorithm with multicore architecture are given in the results. The results show that Computer I with processor Intel(R) Core(TM) i5-2500 CPU Quad-Core has the best performance to accelerate the computational time. Moreover, Computer III with processor AMD A6-5200 APU Quad-Core is observed has higher speedup and efficiency. The speedup and efficiency of Computer III with number of grids 3200 are 3.716050530 times and 92.9% respectively.

  11. A Parallel Pipelined Renderer for the Time-Varying Volume Data

    NASA Technical Reports Server (NTRS)

    Chiueh, Tzi-Cker; Ma, Kwan-Liu

    1997-01-01

    This paper presents a strategy for efficiently rendering time-varying volume data sets on a distributed-memory parallel computer. Time-varying volume data take large storage space and visualizing them requires reading large files continuously or periodically throughout the course of the visualization process. Instead of using all the processors to collectively render one volume at a time, a pipelined rendering process is formed by partitioning processors into groups to render multiple volumes concurrently. In this way, the overall rendering time may be greatly reduced because the pipelined rendering tasks are overlapped with the I/O required to load each volume into a group of processors; moreover, parallelization overhead may be reduced as a result of partitioning the processors. We modify an existing parallel volume renderer to exploit various levels of rendering parallelism and to study how the partitioning of processors may lead to optimal rendering performance. Two factors which are important to the overall execution time are re-source utilization efficiency and pipeline startup latency. The optimal partitioning configuration is the one that balances these two factors. Tests on Intel Paragon computers show that in general optimal partitionings do exist for a given rendering task and result in 40-50% saving in overall rendering time.

  12. Real-time trajectory optimization on parallel processors

    NASA Technical Reports Server (NTRS)

    Psiaki, Mark L.

    1993-01-01

    A parallel algorithm has been developed for rapidly solving trajectory optimization problems. The goal of the work has been to develop an algorithm that is suitable to do real-time, on-line optimal guidance through repeated solution of a trajectory optimization problem. The algorithm has been developed on an INTEL iPSC/860 message passing parallel processor. It uses a zero-order-hold discretization of a continuous-time problem and solves the resulting nonlinear programming problem using a custom-designed augmented Lagrangian nonlinear programming algorithm. The algorithm achieves parallelism of function, derivative, and search direction calculations through the principle of domain decomposition applied along the time axis. It has been encoded and tested on 3 example problems, the Goddard problem, the acceleration-limited, planar minimum-time to the origin problem, and a National Aerospace Plane minimum-fuel ascent guidance problem. Execution times as fast as 118 sec of wall clock time have been achieved for a 128-stage Goddard problem solved on 32 processors. A 32-stage minimum-time problem has been solved in 151 sec on 32 processors. A 32-stage National Aerospace Plane problem required 2 hours when solved on 32 processors. A speed-up factor of 7.2 has been achieved by using 32-nodes instead of 1-node to solve a 64-stage Goddard problem.

  13. GPU Computing in Bayesian Inference of Realized Stochastic Volatility Model

    NASA Astrophysics Data System (ADS)

    Takaishi, Tetsuya

    2015-01-01

    The realized stochastic volatility (RSV) model that utilizes the realized volatility as additional information has been proposed to infer volatility of financial time series. We consider the Bayesian inference of the RSV model by the Hybrid Monte Carlo (HMC) algorithm. The HMC algorithm can be parallelized and thus performed on the GPU for speedup. The GPU code is developed with CUDA Fortran. We compare the computational time in performing the HMC algorithm on GPU (GTX 760) and CPU (Intel i7-4770 3.4GHz) and find that the GPU can be up to 17 times faster than the CPU. We also code the program with OpenACC and find that appropriate coding can achieve the similar speedup with CUDA Fortran.

  14. Planning assistance for the 30/20 GHz program, volume 2

    NASA Technical Reports Server (NTRS)

    Al-Kinani, G.; Frankfort, M.; Kaushal, D.; Markham, R.; Siperko, C.; Wall, M.

    1981-01-01

    In the baseline concept development the communications payload on Flight 1 was specified to consist of on-board trunking and emergency communications systems (ECS). On Flight 2 the communications payloads consisted of trunking and CPS on-board systems, the CPS capability replacing the Flight 1 ECS. No restriction was placed on the launch vehicle size. Constraints placed on multiple concept development effort were that launch vehicle size for Concept 1 was restricted to SUSS-D and for Concept 2 a SUSS-A. The design concept development was based on satisfying the baseline requirements set forth in the SOW for a single demonstration flight system. Key constraints on contractors were cost and launch vehicle size. Five major areas of new technology development were reviewed: (1) 30 GHz low noise receivers; (2) 20 GHz Power Amplifiers; (3) SS-TDMA switch; (4) Baseband Processor; (5) Multibeam Antennas.

  15. Performance Evaluation of Synthetic Benchmarks and Image Processing (IP) Kernels on Intel and PowerPC Processors

    DTIC Science & Technology

    2013-08-01

    2006 Linux Q1 2005 Pentium D (830) 3 2/2 2511 1148 3617 Windows Vista Q2 2005 Pentium D (830) 3 2/2 2938 1155 3556 Windows XP Q2 2005 PowerPC 970MP 2...1 3734 3439 1304 Cell Broadband Engine 3.2 1/1 0.207 2006 239 441 Pentium D (830) 3 2/2 2 3617 2511 1148 Pentium D (830) 3 2/2 2 3556 2938 1155

  16. Multivariate statistical analysis of low-voltage EDS spectrum images

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Anderson, I.M.

    1998-03-01

    Whereas energy-dispersive X-ray spectrometry (EDS) has been used for compositional analysis in the scanning electron microscope for 30 years, the benefits of using low operating voltages for such analyses have been explored only during the last few years. This paper couples low-voltage EDS with two other emerging areas of characterization: spectrum imaging and multivariate statistical analysis. The specimen analyzed for this study was a finished Intel Pentium processor, with the polyimide protective coating stripped off to expose the final active layers.

  17. Building Columbia from the SysAdmin View

    NASA Technical Reports Server (NTRS)

    Chan, David

    2005-01-01

    Project Columbia was built at NASA Ames Research Center in partnership with SGI and Intel. Columbia consists of 20 512 processor Altix machines with 440TB of storage and achieved 51.87 TeraPlops to be ranked the second fastest on the top 500 at SuperComputing 2004. Columbia was delivered, installed and put into production in 3 months. On average, a new Columbia node was brought into production in less than a week. Columbia's configuration, installation, and future plans will be discussed.

  18. Command/response protocols and concurrent software

    NASA Technical Reports Server (NTRS)

    Bynum, W. L.

    1987-01-01

    A version of the program to control the parallel jaw gripper is documented. The parallel jaw end-effector hardware and the Intel 8031 processor that is used to control the end-effector are briefly described. A general overview of the controller program is given and a complete description of the program's structure and design are contained. There are three appendices: a memory map of the on-chip RAM, a cross-reference listing of the self-scheduling routines, and a summary of the top-level and monitor commands.

  19. SC'11 Poster: A Highly Efficient MGPT Implementation for LAMMPS; with Strong Scaling

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Oppelstrup, T; Stukowski, A; Marian, J

    2011-12-07

    The MGPT potential has been implemented as a drop in package to the general molecular dynamics code LAMMPS. We implement an improved communication scheme that shrinks the communication layer thickness, and increases the load balancing. This results in unprecedented strong scaling, and speedup continuing beyond 1/8 atom/core. In addition, we have optimized the small matrix linear algebra with generic blocking (for all processors) and specific SIMD intrinsics for vectorization on Intel, AMD, and BlueGene CPUs.

  20. Investigating the Importance of Stereo Displays for Helicopter Landing Simulation

    DTIC Science & Technology

    2016-08-11

    visualization. The two instances of X Plane® were implemented using two separate PCs, each incorporating Intel i7 processors and Nvidia Quadro K4200... Nvidia GeForce GTX 680 graphics card was used to administer the stereo acuity and fusion range tests. The tests were displayed on an Asus VG278HE 3D...monitor with 1920x1080 pixels that was compatible with Nvidia 3D Vision2 and that used active shutter glasses. At a 1-m viewing distance, the

  1. Parallel algorithms for simulating continuous time Markov chains

    NASA Technical Reports Server (NTRS)

    Nicol, David M.; Heidelberger, Philip

    1992-01-01

    We have previously shown that the mathematical technique of uniformization can serve as the basis of synchronization for the parallel simulation of continuous-time Markov chains. This paper reviews the basic method and compares five different methods based on uniformization, evaluating their strengths and weaknesses as a function of problem characteristics. The methods vary in their use of optimism, logical aggregation, communication management, and adaptivity. Performance evaluation is conducted on the Intel Touchstone Delta multiprocessor, using up to 256 processors.

  2. Pentium Pro inside. 1; A treecode at 430 Gigaflops on ASCI Red

    NASA Technical Reports Server (NTRS)

    Warren, M. S.; Becker, D. J.; Sterling, T.; Salmon, J. K.; Goda, M. P.

    1997-01-01

    As an entry for the 1997 Gordon Bell performance prize, we present results from two methods of solving the gravitational N-body problem on the Intel Teraflops system at Sandia National Laboratory (ASCI Red). The first method, an O(N2) algorithm, obtained 635 Gigaflops for a 1 million particle problem on 6800 Pentium Pro processors. The second solution method, a tree-code which scales as O(N log N), sustained 170 Gigaflops over a continuous 9.4 hour period on 4096 processors, integrating the motion of 322 million mutually interacting particles in a cosmology simulation, while saving over 100 Gigabytes of raw data. Additionally, the tree-code sustained 430 Gigaflops on 6800 processors for the first 5 time-steps of that simulation. This tree-code solution is approximately 105 times more efficient than the O(N2) algorithm for this problem. As an entry for the 1997 Gordon Bell price/performance prize, we present two calculations from the disciplines of astrophysics and fluid dynamics. The simulations were performed on two 16 Pentium Pro processor Beowulf-class computers (Loki and Hyglac) constructed entirely from commodity personal computer technology, at a cost of roughly $50k each in September, 1996. The price of an equivalent system in August 1997 is less than $30. At Los Alamos, Loki performed a gravitational tree-code N-body simulation of galaxy formation using 9.75 million particles, which sustained an average of 879 Mflops over a ten day period, and produced roughly 10 Gbytes of raw data.

  3. Time-efficient simulations of tight-binding electronic structures with Intel Xeon PhiTM many-core processors

    NASA Astrophysics Data System (ADS)

    Ryu, Hoon; Jeong, Yosang; Kang, Ji-Hoon; Cho, Kyu Nam

    2016-12-01

    Modelling of multi-million atomic semiconductor structures is important as it not only predicts properties of physically realizable novel materials, but can accelerate advanced device designs. This work elaborates a new Technology-Computer-Aided-Design (TCAD) tool for nanoelectronics modelling, which uses a sp3d5s∗ tight-binding approach to describe multi-million atomic structures, and simulate electronic structures with high performance computing (HPC), including atomic effects such as alloy and dopant disorders. Being named as Quantum simulation tool for Advanced Nanoscale Devices (Q-AND), the tool shows nice scalability on traditional multi-core HPC clusters implying the strong capability of large-scale electronic structure simulations, particularly with remarkable performance enhancement on latest clusters of Intel Xeon PhiTM coprocessors. A review of the recent modelling study conducted to understand an experimental work of highly phosphorus-doped silicon nanowires, is presented to demonstrate the utility of Q-AND. Having been developed via Intel Parallel Computing Center project, Q-AND will be open to public to establish a sound framework of nanoelectronics modelling with advanced HPC clusters of a many-core base. With details of the development methodology and exemplary study of dopant electronics, this work will present a practical guideline for TCAD development to researchers in the field of computational nanoelectronics.

  4. Parallel algorithms for quantum chemistry. I. Integral transformations on a hypercube multiprocessor

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Whiteside, R.A.; Binkley, J.S.; Colvin, M.E.

    1987-02-15

    For many years it has been recognized that fundamental physical constraints such as the speed of light will limit the ultimate speed of single processor computers to less than about three billion floating point operations per second (3 GFLOPS). This limitation is becoming increasingly restrictive as commercially available machines are now within an order of magnitude of this asymptotic limit. A natural way to avoid this limit is to harness together many processors to work on a single computational problem. In principle, these parallel processing computers have speeds limited only by the number of processors one chooses to acquire. Themore » usefulness of potentially unlimited processing speed to a computationally intensive field such as quantum chemistry is obvious. If these methods are to be applied to significantly larger chemical systems, parallel schemes will have to be employed. For this reason we have developed distributed-memory algorithms for a number of standard quantum chemical methods. We are currently implementing these on a 32 processor Intel hypercube. In this paper we present our algorithm and benchmark results for one of the bottleneck steps in quantum chemical calculations: the four index integral transformation.« less

  5. Novel Optical Processor for Phased Array Antenna.

    DTIC Science & Technology

    1992-10-20

    parallel glass slide into the signal beam optical loop. The parallel glass acts like a variable phase shifter to the signal beam simulating phase drift...A list of possible designs are given as follows , _ _ Velocity fa (100dB/cm) Lumit Wavelength I M2I1 TeO2 Longi 4.2 /m/ns about 3 GHz 1.4 4m 34 Fast...subject to achievable acoustic frequency, the preferred materials are the slow shear wave in TeO2 , the fast shear wave in TeO2 or the shear waves in

  6. DOE Office of Scientific and Technical Information (OSTI.GOV)

    Carver, R; Popple, R; Benhabib, S

    Purpose: To evaluate the accuracy of electron dose distribution calculated by the Varian Eclipse electron Monte Carlo (eMC) algorithm for use with recent commercially available bolus electron conformal therapy (ECT). Methods: eMC-calculated electron dose distributions for bolus ECT have been compared to those previously measured for cylindrical phantoms (retromolar trigone and nose), whose axial cross sections were based on the mid-PTV CT anatomy for each site. The phantoms consisted of SR4 muscle substitute, SR4 bone substitute, and air. The bolus ECT treatment plans were imported into the Eclipse treatment planning system and calculated using the maximum allowable histories (2×10{sup 9}),more » resulting in a statistical error of <0.2%. Smoothing was not used for these calculations. Differences between eMC-calculated and measured dose distributions were evaluated in terms of absolute dose difference as well as distance to agreement (DTA). Results: Results from the eMC for the retromolar trigone phantom showed 89% (41/46) of dose points within 3% dose difference or 3 mm DTA. There was an average dose difference of −0.12% with a standard deviation of 2.56%. Results for the nose phantom showed 95% (54/57) of dose points within 3% dose difference or 3 mm DTA. There was an average dose difference of 1.12% with a standard deviation of 3.03%. Dose calculation times for the retromolar trigone and nose treatment plans were 15 min and 22 min, respectively, using 16 processors (Intel Xeon E5-2690, 2.9 GHz) on a Varian Eclipse framework agent server (FAS). Results of this study were consistent with those previously reported for accuracy of the eMC electron dose algorithm and for the .decimal, Inc. pencil beam redefinition algorithm used to plan the bolus. Conclusion: These results show that the accuracy of the Eclipse eMC algorithm is suitable for clinical implementation of bolus ECT.« less

  7. High-performance sparse matrix-matrix products on Intel KNL and multicore architectures

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Nagasaka, Y; Matsuoka, S; Azad, A

    Sparse matrix-matrix multiplication (SpGEMM) is a computational primitive that is widely used in areas ranging from traditional numerical applications to recent big data analysis and machine learning. Although many SpGEMM algorithms have been proposed, hardware specific optimizations for multi- and many-core processors are lacking and a detailed analysis of their performance under various use cases and matrices is not available. We firstly identify and mitigate multiple bottlenecks with memory management and thread scheduling on Intel Xeon Phi (Knights Landing or KNL). Specifically targeting multi- and many-core processors, we develop a hash-table-based algorithm and optimize a heap-based shared-memory SpGEMM algorithm. Wemore » examine their performance together with other publicly available codes. Different from the literature, our evaluation also includes use cases that are representative of real graph algorithms, such as multi-source breadth-first search or triangle counting. Our hash-table and heap-based algorithms are showing significant speedups from libraries in the majority of the cases while different algorithms dominate the other scenarios with different matrix size, sparsity, compression factor and operation type. We wrap up in-depth evaluation results and make a recipe to give the best SpGEMM algorithm for target scenario. A critical finding is that hash-table-based SpGEMM gets a significant performance boost if the nonzeros are not required to be sorted within each row of the output matrix.« less

  8. A Parallel Rendering Algorithm for MIMD Architectures

    NASA Technical Reports Server (NTRS)

    Crockett, Thomas W.; Orloff, Tobias

    1991-01-01

    Applications such as animation and scientific visualization demand high performance rendering of complex three dimensional scenes. To deliver the necessary rendering rates, highly parallel hardware architectures are required. The challenge is then to design algorithms and software which effectively use the hardware parallelism. A rendering algorithm targeted to distributed memory MIMD architectures is described. For maximum performance, the algorithm exploits both object-level and pixel-level parallelism. The behavior of the algorithm is examined both analytically and experimentally. Its performance for large numbers of processors is found to be limited primarily by communication overheads. An experimental implementation for the Intel iPSC/860 shows increasing performance from 1 to 128 processors across a wide range of scene complexities. It is shown that minimal modifications to the algorithm will adapt it for use on shared memory architectures as well.

  9. Evaluation of the Xeon phi processor as a technology for the acceleration of real-time control in high-order adaptive optics systems

    NASA Astrophysics Data System (ADS)

    Barr, David; Basden, Alastair; Dipper, Nigel; Schwartz, Noah; Vick, Andy; Schnetler, Hermine

    2014-08-01

    We present wavefront reconstruction acceleration of high-order AO systems using an Intel Xeon Phi processor. The Xeon Phi is a coprocessor providing many integrated cores and designed for accelerating compute intensive, numerical codes. Unlike other accelerator technologies, it allows virtually unchanged C/C++ to be recompiled to run on the Xeon Phi, giving the potential of making development, upgrade and maintenance faster and less complex. We benchmark the Xeon Phi in the context of AO real-time control by running a matrix vector multiply (MVM) algorithm. We investigate variability in execution time and demonstrate a substantial speed-up in loop frequency. We examine the integration of a Xeon Phi into an existing RTC system and show that performance improvements can be achieved with limited development effort.

  10. Online Mapping and Perception Algorithms for Multi-robot Teams Operating in Urban Environments

    DTIC Science & Technology

    2015-01-01

    each method on a 2.53 GHz Intel i5 laptop. All our algorithms are hand-optimized, implemented in Java and single threaded. To determine which algorithm...approach would be to label all the pixels in the image with an x, y, z point. However, the angular resolution of the camera is finer than that of the...edge criterion. That is, each edge is either present or absent. In [42], edge existence is further screened by a fixed threshold for angular

  11. Evaluating the networking characteristics of the Cray XC-40 Intel Knights Landing-based Cori supercomputer at NERSC

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Doerfler, Douglas; Austin, Brian; Cook, Brandon

    There are many potential issues associated with deploying the Intel Xeon Phi™ (code named Knights Landing [KNL]) manycore processor in a large-scale supercomputer. One in particular is the ability to fully utilize the high-speed communications network, given that the serial performance of a Xeon Phi TM core is a fraction of a Xeon®core. In this paper, we take a look at the trade-offs associated with allocating enough cores to fully utilize the Aries high-speed network versus cores dedicated to computation, e.g., the trade-off between MPI and OpenMP. In addition, we evaluate new features of Cray MPI in support of KNL,more » such as internode optimizations. We also evaluate one-sided programming models such as Unified Parallel C. We quantify the impact of the above trade-offs and features using a suite of National Energy Research Scientific Computing Center applications.« less

  12. Parallel computing on Unix workstation arrays

    NASA Astrophysics Data System (ADS)

    Reale, F.; Bocchino, F.; Sciortino, S.

    1994-12-01

    We have tested arrays of general-purpose Unix workstations used as MIMD systems for massive parallel computations. In particular we have solved numerically a demanding test problem with a 2D hydrodynamic code, generally developed to study astrophysical flows, by exucuting it on arrays either of DECstations 5000/200 on Ethernet LAN, or of DECstations 3000/400, equipped with powerful Alpha processors, on FDDI LAN. The code is appropriate for data-domain decomposition, and we have used a library for parallelization previously developed in our Institute, and easily extended to work on Unix workstation arrays by using the PVM software toolset. We have compared the parallel efficiencies obtained on arrays of several processors to those obtained on a dedicated MIMD parallel system, namely a Meiko Computing Surface (CS-1), equipped with Intel i860 processors. We discuss the feasibility of using non-dedicated parallel systems and conclude that the convenience depends essentially on the size of the computational domain as compared to the relative processor power and network bandwidth. We point out that for future perspectives a parallel development of processor and network technology is important, and that the software still offers great opportunities of improvement, especially in terms of latency times in the message-passing protocols. In conditions of significant gain in terms of speedup, such workstation arrays represent a cost-effective approach to massive parallel computations.

  13. A multi-satellite orbit determination problem in a parallel processing environment

    NASA Technical Reports Server (NTRS)

    Deakyne, M. S.; Anderle, R. J.

    1988-01-01

    The Engineering Orbit Analysis Unit at GE Valley Forge used an Intel Hypercube Parallel Processor to investigate the performance and gain experience of parallel processors with a multi-satellite orbit determination problem. A general study was selected in which major blocks of computation for the multi-satellite orbit computations were used as units to be assigned to the various processors on the Hypercube. Problems encountered or successes achieved in addressing the orbit determination problem would be more likely to be transferable to other parallel processors. The prime objective was to study the algorithm to allow processing of observations later in time than those employed in the state update. Expertise in ephemeris determination was exploited in addressing these problems and the facility used to bring a realism to the study which would highlight the problems which may not otherwise be anticipated. Secondary objectives were to gain experience of a non-trivial problem in a parallel processor environment, to explore the necessary interplay of serial and parallel sections of the algorithm in terms of timing studies, to explore the granularity (coarse vs. fine grain) to discover the granularity limit above which there would be a risk of starvation where the majority of nodes would be idle or under the limit where the overhead associated with splitting the problem may require more work and communication time than is useful.

  14. Input-independent, Scalable and Fast String Matching on the Cray XMT

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Villa, Oreste; Chavarría-Miranda, Daniel; Maschhoff, Kristyn J

    2009-05-25

    String searching is at the core of many security and network applications like search engines, intrusion detection systems, virus scanners and spam filters. The growing size of on-line content and the increasing wire speeds push the need for fast, and often real- time, string searching solutions. For these conditions, many software implementations (if not all) targeting conventional cache-based microprocessors do not perform well. They either exhibit overall low performance or exhibit highly variable performance depending on the types of inputs. For this reason, real-time state of the art solutions rely on the use of either custom hardware or Field-Programmable Gatemore » Arrays (FPGAs) at the expense of overall system flexibility and programmability. This paper presents a software based implementation of the Aho-Corasick string searching algorithm on the Cray XMT multithreaded shared memory machine. Our so- lution relies on the particular features of the XMT architecture and on several algorith- mic strategies: it is fast, scalable and its performance is virtually content-independent. On a 128-processor Cray XMT, it reaches a scanning speed of ≈ 28 Gbps with a performance variability below 10 %. In the 10 Gbps performance range, variability is below 2.5%. By comparison, an Intel dual-socket, 8-core system running at 2.66 GHz achieves a peak performance which varies from 500 Mbps to 10 Gbps depending on the type of input and dictionary size.« less

  15. Design of a modified adaptive neuro fuzzy inference system classifier for medical diagnosis of Pima Indians Diabetes

    NASA Astrophysics Data System (ADS)

    Sagir, Abdu Masanawa; Sathasivam, Saratha

    2017-08-01

    Medical diagnosis is the process of determining which disease or medical condition explains a person's determinable signs and symptoms. Diagnosis of most of the diseases is very expensive as many tests are required for predictions. This paper aims to introduce an improved hybrid approach for training the adaptive network based fuzzy inference system with Modified Levenberg-Marquardt algorithm using analytical derivation scheme for computation of Jacobian matrix. The goal is to investigate how certain diseases are affected by patient's characteristics and measurement such as abnormalities or a decision about presence or absence of a disease. To achieve an accurate diagnosis at this complex stage of symptom analysis, the physician may need efficient diagnosis system to classify and predict patient condition by using an adaptive neuro fuzzy inference system (ANFIS) pre-processed by grid partitioning. The proposed hybridised intelligent system was tested with Pima Indian Diabetes dataset obtained from the University of California at Irvine's (UCI) machine learning repository. The proposed method's performance was evaluated based on training and test datasets. In addition, an attempt was done to specify the effectiveness of the performance measuring total accuracy, sensitivity and specificity. In comparison, the proposed method achieves superior performance when compared to conventional ANFIS based gradient descent algorithm and some related existing methods. The software used for the implementation is MATLAB R2014a (version 8.3) and executed in PC Intel Pentium IV E7400 processor with 2.80 GHz speed and 2.0 GB of RAM.

  16. Massively parallel electrical conductivity imaging of the subsurface: Applications to hydrocarbon exploration

    NASA Astrophysics Data System (ADS)

    Newman, Gregory A.; Commer, Michael

    2009-07-01

    Three-dimensional (3D) geophysical imaging is now receiving considerable attention for electrical conductivity mapping of potential offshore oil and gas reservoirs. The imaging technology employs controlled source electromagnetic (CSEM) and magnetotelluric (MT) fields and treats geological media exhibiting transverse anisotropy. Moreover when combined with established seismic methods, direct imaging of reservoir fluids is possible. Because of the size of the 3D conductivity imaging problem, strategies are required exploiting computational parallelism and optimal meshing. The algorithm thus developed has been shown to scale to tens of thousands of processors. In one imaging experiment, 32,768 tasks/processors on the IBM Watson Research Blue Gene/L supercomputer were successfully utilized. Over a 24 hour period we were able to image a large scale field data set that previously required over four months of processing time on distributed clusters based on Intel or AMD processors utilizing 1024 tasks on an InfiniBand fabric. Electrical conductivity imaging using massively parallel computational resources produces results that cannot be obtained otherwise and are consistent with timeframes required for practical exploration problems.

  17. Multiphase complete exchange on Paragon, SP2 and CS-2

    NASA Technical Reports Server (NTRS)

    Bokhari, Shahid H.

    1995-01-01

    The overhead of interprocessor communication is a major factor in limiting the performance of parallel computer systems. The complete exchange is the severest communication pattern in that it requires each processor to send a distinct message to every other processor. This pattern is at the heart of many important parallel applications. On hypercubes, multiphase complete exchange has been developed and shown to provide optimal performance over varying message sizes. Most commercial multicomputer systems do not have a hypercube interconnect. However, they use special purpose hardware and dedicated communication processors to achieve very high performance communication and can be made to emulate the hypercube quite well. Multiphase complete exchange has been implemented on three contemporary parallel architectures: the Intel Paragon, IBM SP2 and Meiko CS-2. The essential features of these machines are described and their basic interprocessor communication overheads are discussed. The performance of multiphase complete exchange is evaluated on each machine. It is shown that the theoretical ideas developed for hypercubes are also applicable in practice to these machines and that multiphase complete exchange can lead to major savings in execution time over traditional solutions.

  18. Intel Many Integrated Core (MIC) architecture optimization strategies for a memory-bound Weather Research and Forecasting (WRF) Goddard microphysics scheme

    NASA Astrophysics Data System (ADS)

    Mielikainen, Jarno; Huang, Bormin; Huang, Allen H.

    2014-10-01

    The Goddard cloud microphysics scheme is a sophisticated cloud microphysics scheme in the Weather Research and Forecasting (WRF) model. The WRF is a widely used weather prediction system in the world. It development is a done in collaborative around the globe. The Goddard microphysics scheme is very suitable for massively parallel computation as there are no interactions among horizontal grid points. Compared to the earlier microphysics schemes, the Goddard scheme incorporates a large number of improvements. Thus, we have optimized the code of this important part of WRF. In this paper, we present our results of optimizing the Goddard microphysics scheme on Intel Many Integrated Core Architecture (MIC) hardware. The Intel Xeon Phi coprocessor is the first product based on Intel MIC architecture, and it consists of up to 61 cores connected by a high performance on-die bidirectional interconnect. The Intel MIC is capable of executing a full operating system and entire programs rather than just kernels as the GPU do. The MIC coprocessor supports all important Intel development tools. Thus, the development environment is familiar one to a vast number of CPU developers. Although, getting a maximum performance out of MICs will require using some novel optimization techniques. Those optimization techniques are discusses in this paper. The results show that the optimizations improved performance of the original code on Xeon Phi 7120P by a factor of 4.7x. Furthermore, the same optimizations improved performance on a dual socket Intel Xeon E5-2670 system by a factor of 2.8x compared to the original code.

  19. Noiseless coding for the Gamma Ray spectrometer

    NASA Technical Reports Server (NTRS)

    Rice, R.; Lee, J. J.

    1985-01-01

    The payload of several future unmanned space missions will include a sophisticated gamma ray spectrometer. Severely constrained data rates during certain portions of these missions could limit the possible science return from this instrument. This report investigates the application of universal noiseless coding techniques to represent gamma ray spectrometer data more efficiently without any loss in data integrity. Performance results demonstrate compression factors from 2.5:1 to 20:1 in comparison to a standard representation. Feasibility was also demonstrated by implementing a microprocessor breadboard coder/decoder using an Intel 8086 processor.

  20. 30/20 GHz communications systems baseband processor development

    NASA Astrophysics Data System (ADS)

    Brown, L.; Sabourin, D.; Stilwell, J.; McCallister, R.; Borota, M.

    The architecture and system design concepts for a commercial satellite communications system planned for the 1990's has been developed. The system provides data communications between the individual users via trunking and customer premise service terminals utilizing a central switching satellite operating in a time-division multiple-access mode. Baseband processing is employed to route and control traffic on an individual message basis while providing significant advantages in improved link margins and system flexibility. Key technology developments required to prove the flight readiness of the baseband processor design are being verified in the baseband processor proof-of-concept model described herein.

  1. 30/20 GHz communications systems baseband processor development

    NASA Technical Reports Server (NTRS)

    Brown, L.; Sabourin, D.; Stilwell, J.; Mccallister, R.; Borota, M.

    1982-01-01

    The architecture and system design concepts for a commercial satellite communications system planned for the 1990's has been developed. The system provides data communications between the individual users via trunking and customer premise service terminals utilizing a central switching satellite operating in a time-division multiple-access mode. Baseband processing is employed to route and control traffic on an individual message basis while providing significant advantages in improved link margins and system flexibility. Key technology developments required to prove the flight readiness of the baseband processor design are being verified in the baseband processor proof-of-concept model described herein.

  2. Evaluation of the OpenCL AES Kernel using the Intel FPGA SDK for OpenCL

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Jin, Zheming; Yoshii, Kazutomo; Finkel, Hal

    The OpenCL standard is an open programming model for accelerating algorithms on heterogeneous computing system. OpenCL extends the C-based programming language for developing portable codes on different platforms such as CPU, Graphics processing units (GPUs), Digital Signal Processors (DSPs) and Field Programmable Gate Arrays (FPGAs). The Intel FPGA SDK for OpenCL is a suite of tools that allows developers to abstract away the complex FPGA-based development flow for a high-level software development flow. Users can focus on the design of hardware-accelerated kernel functions in OpenCL and then direct the tools to generate the low-level FPGA implementations. The approach makes themore » FPGA-based development more accessible to software users as the needs for hybrid computing using CPUs and FPGAs are increasing. It can also significantly reduce the hardware development time as users can evaluate different ideas with high-level language without deep FPGA domain knowledge. In this report, we evaluate the performance of the kernel using the Intel FPGA SDK for OpenCL and Nallatech 385A FPGA board. Compared to the M506 module, the board provides more hardware resources for a larger design exploration space. The kernel performance is measured with the compute kernel throughput, an upper bound to the FPGA throughput. The report presents the experimental results in details. The Appendix lists the kernel source code.« less

  3. SETI prototype system for NASA's Sky Survey microwave observing project - A progress report

    NASA Technical Reports Server (NTRS)

    Klein, M. J.; Gulkis, S.; Wilck, H. C.

    1990-01-01

    Two complementary search strategies, a Targeted Search and a Sky Survey, are part of NASA's SETI microwave observing project scheduled to begin in October of 1992. The current progress in the development of hardware and software elements of the JPL Sky Survey data processing system are presented. While the Targeted Search stresses sensitivity allowing the detection of either continuous or pulsed signals over the 1-3 GHz frequency range, the Sky Survey gives up sensitivity to survey the 99 percent of the sky that is not covered by the Targeted Search. The Sky Survey spans a larger frequency range from 1-10 GHz. The two searches will deploy special-purpose digital signal processing equipment designed and built to automate the observing and data processing activities. A two-million channel digital wideband spectrum analyzer and a signal processor system will serve as a prototype for the SETI Sky Survey processor. The design will permit future expansion to meet the SETI requirement that the processor concurrently search for left and right circularly polarized signals.

  4. Face classification using electronic synapses

    NASA Astrophysics Data System (ADS)

    Yao, Peng; Wu, Huaqiang; Gao, Bin; Eryilmaz, Sukru Burc; Huang, Xueyao; Zhang, Wenqiang; Zhang, Qingtian; Deng, Ning; Shi, Luping; Wong, H.-S. Philip; Qian, He

    2017-05-01

    Conventional hardware platforms consume huge amount of energy for cognitive learning due to the data movement between the processor and the off-chip memory. Brain-inspired device technologies using analogue weight storage allow to complete cognitive tasks more efficiently. Here we present an analogue non-volatile resistive memory (an electronic synapse) with foundry friendly materials. The device shows bidirectional continuous weight modulation behaviour. Grey-scale face classification is experimentally demonstrated using an integrated 1024-cell array with parallel online training. The energy consumption within the analogue synapses for each iteration is 1,000 × (20 ×) lower compared to an implementation using Intel Xeon Phi processor with off-chip memory (with hypothetical on-chip digital resistive random access memory). The accuracy on test sets is close to the result using a central processing unit. These experimental results consolidate the feasibility of analogue synaptic array and pave the way toward building an energy efficient and large-scale neuromorphic system.

  5. Face classification using electronic synapses.

    PubMed

    Yao, Peng; Wu, Huaqiang; Gao, Bin; Eryilmaz, Sukru Burc; Huang, Xueyao; Zhang, Wenqiang; Zhang, Qingtian; Deng, Ning; Shi, Luping; Wong, H-S Philip; Qian, He

    2017-05-12

    Conventional hardware platforms consume huge amount of energy for cognitive learning due to the data movement between the processor and the off-chip memory. Brain-inspired device technologies using analogue weight storage allow to complete cognitive tasks more efficiently. Here we present an analogue non-volatile resistive memory (an electronic synapse) with foundry friendly materials. The device shows bidirectional continuous weight modulation behaviour. Grey-scale face classification is experimentally demonstrated using an integrated 1024-cell array with parallel online training. The energy consumption within the analogue synapses for each iteration is 1,000 × (20 ×) lower compared to an implementation using Intel Xeon Phi processor with off-chip memory (with hypothetical on-chip digital resistive random access memory). The accuracy on test sets is close to the result using a central processing unit. These experimental results consolidate the feasibility of analogue synaptic array and pave the way toward building an energy efficient and large-scale neuromorphic system.

  6. Parallelization of the preconditioned IDR solver for modern multicore computer systems

    NASA Astrophysics Data System (ADS)

    Bessonov, O. A.; Fedoseyev, A. I.

    2012-10-01

    This paper present the analysis, parallelization and optimization approach for the large sparse matrix solver CNSPACK for modern multicore microprocessors. CNSPACK is an advanced solver successfully used for coupled solution of stiff problems arising in multiphysics applications such as CFD, semiconductor transport, kinetic and quantum problems. It employs iterative IDR algorithm with ILU preconditioning (user chosen ILU preconditioning order). CNSPACK has been successfully used during last decade for solving problems in several application areas, including fluid dynamics and semiconductor device simulation. However, there was a dramatic change in processor architectures and computer system organization in recent years. Due to this, performance criteria and methods have been revisited, together with involving the parallelization of the solver and preconditioner using Open MP environment. Results of the successful implementation for efficient parallelization are presented for the most advances computer system (Intel Core i7-9xx or two-processor Xeon 55xx/56xx).

  7. Thread-Level Parallelization and Optimization of NWChem for the Intel MIC Architecture

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Shan, Hongzhang; Williams, Samuel; Jong, Wibe de

    In the multicore era it was possible to exploit the increase in on-chip parallelism by simply running multiple MPI processes per chip. Unfortunately, manycore processors' greatly increased thread- and data-level parallelism coupled with a reduced memory capacity demand an altogether different approach. In this paper we explore augmenting two NWChem modules, triples correction of the CCSD(T) and Fock matrix construction, with OpenMP in order that they might run efficiently on future manycore architectures. As the next NERSC machine will be a self-hosted Intel MIC (Xeon Phi) based supercomputer, we leverage an existing MIC testbed at NERSC to evaluate our experiments.more » In order to proxy the fact that future MIC machines will not have a host processor, we run all of our experiments in tt native mode. We found that while straightforward application of OpenMP to the deep loop nests associated with the tensor contractions of CCSD(T) was sufficient in attaining high performance, significant effort was required to safely and efficiently thread the TEXAS integral package when constructing the Fock matrix. Ultimately, our new MPI OpenMP hybrid implementations attain up to 65x better performance for the triples part of the CCSD(T) due in large part to the fact that the limited on-card memory limits the existing MPI implementation to a single process per card. Additionally, we obtain up to 1.6x better performance on Fock matrix constructions when compared with the best MPI implementations running multiple processes per card.« less

  8. Thread-level parallelization and optimization of NWChem for the Intel MIC architecture

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Shan, Hongzhang; Williams, Samuel; de Jong, Wibe

    In the multicore era it was possible to exploit the increase in on-chip parallelism by simply running multiple MPI processes per chip. Unfortunately, manycore processors' greatly increased thread- and data-level parallelism coupled with a reduced memory capacity demand an altogether different approach. In this paper we explore augmenting two NWChem modules, triples correction of the CCSD(T) and Fock matrix construction, with OpenMP in order that they might run efficiently on future manycore architectures. As the next NERSC machine will be a self-hosted Intel MIC (Xeon Phi) based supercomputer, we leverage an existing MIC testbed at NERSC to evaluate our experiments.more » In order to proxy the fact that future MIC machines will not have a host processor, we run all of our experiments in native mode. We found that while straightforward application of OpenMP to the deep loop nests associated with the tensor contractions of CCSD(T) was sufficient in attaining high performance, significant e ort was required to safely and efeciently thread the TEXAS integral package when constructing the Fock matrix. Ultimately, our new MPI+OpenMP hybrid implementations attain up to 65× better performance for the triples part of the CCSD(T) due in large part to the fact that the limited on-card memory limits the existing MPI implementation to a single process per card. Additionally, we obtain up to 1.6× better performance on Fock matrix constructions when compared with the best MPI implementations running multiple processes per card.« less

  9. Massively Parallel Simulations of Diffusion in Dense Polymeric Structures

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Faulon, Jean-Loup, Wilcox, R.T.

    1997-11-01

    An original computational technique to generate close-to-equilibrium dense polymeric structures is proposed. Diffusion of small gases are studied on the equilibrated structures using massively parallel molecular dynamics simulations running on the Intel Teraflops (9216 Pentium Pro processors) and Intel Paragon(1840 processors). Compared to the current state-of-the-art equilibration methods this new technique appears to be faster by some orders of magnitude.The main advantage of the technique is that one can circumvent the bottlenecks in configuration space that inhibit relaxation in molecular dynamics simulations. The technique is based on the fact that tetravalent atoms (such as carbon and silicon) fit in themore » center of a regular tetrahedron and that regular tetrahedrons can be used to mesh the three-dimensional space. Thus, the problem of polymer equilibration described by continuous equations in molecular dynamics is reduced to a discrete problem where solutions are approximated by simple algorithms. Practical modeling applications include the constructing of butyl rubber and ethylene-propylene-dimer-monomer (EPDM) models for oxygen and water diffusion calculations. Butyl and EPDM are used in O-ring systems and serve as sealing joints in many manufactured objects. Diffusion coefficients of small gases have been measured experimentally on both polymeric systems, and in general the diffusion coefficients in EPDM are an order of magnitude larger than in butyl. In order to better understand the diffusion phenomena, 10, 000 atoms models were generated and equilibrated for butyl and EPDM. The models were submitted to a massively parallel molecular dynamics simulation to monitor the trajectories of the diffusing species.« less

  10. Parallel hyperspectral compressive sensing method on GPU

    NASA Astrophysics Data System (ADS)

    Bernabé, Sergio; Martín, Gabriel; Nascimento, José M. P.

    2015-10-01

    Remote hyperspectral sensors collect large amounts of data per flight usually with low spatial resolution. It is known that the bandwidth connection between the satellite/airborne platform and the ground station is reduced, thus a compression onboard method is desirable to reduce the amount of data to be transmitted. This paper presents a parallel implementation of an compressive sensing method, called parallel hyperspectral coded aperture (P-HYCA), for graphics processing units (GPU) using the compute unified device architecture (CUDA). This method takes into account two main properties of hyperspectral dataset, namely the high correlation existing among the spectral bands and the generally low number of endmembers needed to explain the data, which largely reduces the number of measurements necessary to correctly reconstruct the original data. Experimental results conducted using synthetic and real hyperspectral datasets on two different GPU architectures by NVIDIA: GeForce GTX 590 and GeForce GTX TITAN, reveal that the use of GPUs can provide real-time compressive sensing performance. The achieved speedup is up to 20 times when compared with the processing time of HYCA running on one core of the Intel i7-2600 CPU (3.4GHz), with 16 Gbyte memory.

  11. Efficiently Distributing Component-Based Applications Across Wide-Area Environments

    DTIC Science & Technology

    2002-01-01

    Oracle 8.1.7 Enterprise Edition), each running on a dedicated 1GHz dual-processor Pentium III workstation. For the RUBiS tests, we used a MySQL 4.0.12...a variety of sophisticated network-accessible services such as e-mail, banking, on-line shopping, entertainment, and serv - ing as a data exchange...Beans Catalog Handles read-only queries to product database Customer Serves as a façade to Order and Account Stateful Session Beans ShoppingCart

  12. List-mode PET image reconstruction for motion correction using the Intel XEON PHI co-processor

    NASA Astrophysics Data System (ADS)

    Ryder, W. J.; Angelis, G. I.; Bashar, R.; Gillam, J. E.; Fulton, R.; Meikle, S.

    2014-03-01

    List-mode image reconstruction with motion correction is computationally expensive, as it requires projection of hundreds of millions of rays through a 3D array. To decrease reconstruction time it is possible to use symmetric multiprocessing computers or graphics processing units. The former can have high financial costs, while the latter can require refactoring of algorithms. The Xeon Phi is a new co-processor card with a Many Integrated Core architecture that can run 4 multiple-instruction, multiple data threads per core with each thread having a 512-bit single instruction, multiple data vector register. Thus, it is possible to run in the region of 220 threads simultaneously. The aim of this study was to investigate whether the Xeon Phi co-processor card is a viable alternative to an x86 Linux server for accelerating List-mode PET image reconstruction for motion correction. An existing list-mode image reconstruction algorithm with motion correction was ported to run on the Xeon Phi coprocessor with the multi-threading implemented using pthreads. There were no differences between images reconstructed using the Phi co-processor card and images reconstructed using the same algorithm run on a Linux server. However, it was found that the reconstruction runtimes were 3 times greater for the Phi than the server. A new version of the image reconstruction algorithm was developed in C++ using OpenMP for mutli-threading and the Phi runtimes decreased to 1.67 times that of the host Linux server. Data transfer from the host to co-processor card was found to be a rate-limiting step; this needs to be carefully considered in order to maximize runtime speeds. When considering the purchase price of a Linux workstation with Xeon Phi co-processor card and top of the range Linux server, the former is a cost-effective computation resource for list-mode image reconstruction. A multi-Phi workstation could be a viable alternative to cluster computers at a lower cost for medical imaging applications.

  13. DOE Office of Scientific and Technical Information (OSTI.GOV)

    Cohen, J; Dossa, D; Gokhale, M

    Critical data science applications requiring frequent access to storage perform poorly on today's computing architectures. This project addresses efficient computation of data-intensive problems in national security and basic science by exploring, advancing, and applying a new form of computing called storage-intensive supercomputing (SISC). Our goal is to enable applications that simply cannot run on current systems, and, for a broad range of data-intensive problems, to deliver an order of magnitude improvement in price/performance over today's data-intensive architectures. This technical report documents much of the work done under LDRD 07-ERD-063 Storage Intensive Supercomputing during the period 05/07-09/07. The following chapters describe:more » (1) a new file I/O monitoring tool iotrace developed to capture the dynamic I/O profiles of Linux processes; (2) an out-of-core graph benchmark for level-set expansion of scale-free graphs; (3) an entity extraction benchmark consisting of a pipeline of eight components; and (4) an image resampling benchmark drawn from the SWarp program in the LSST data processing pipeline. The performance of the graph and entity extraction benchmarks was measured in three different scenarios: data sets residing on the NFS file server and accessed over the network; data sets stored on local disk; and data sets stored on the Fusion I/O parallel NAND Flash array. The image resampling benchmark compared performance of software-only to GPU-accelerated. In addition to the work reported here, an additional text processing application was developed that used an FPGA to accelerate n-gram profiling for language classification. The n-gram application will be presented at SC07 at the High Performance Reconfigurable Computing Technologies and Applications Workshop. The graph and entity extraction benchmarks were run on a Supermicro server housing the NAND Flash 40GB parallel disk array, the Fusion-io. The Fusion system specs are as follows: SuperMicro X7DBE Xeon Dual Socket Blackford Server Motherboard; 2 Intel Xeon Dual-Core 2.66 GHz processors; 1 GB DDR2 PC2-5300 RAM (2 x 512); 80GB Hard Drive (Seagate SATA II Barracuda). The Fusion board is presently capable of 4X in a PCIe slot. The image resampling benchmark was run on a dual Xeon workstation with NVIDIA graphics card (see Chapter 5 for full specification). An XtremeData Opteron+FPGA was used for the language classification application. We observed that these benchmarks are not uniformly I/O intensive. The only benchmark that showed greater that 50% of the time in I/O was the graph algorithm when it accessed data files over NFS. When local disk was used, the graph benchmark spent at most 40% of its time in I/O. The other benchmarks were CPU dominated. The image resampling benchmark and language classification showed order of magnitude speedup over software by using co-processor technology to offload the CPU-intensive kernels. Our experiments to date suggest that emerging hardware technologies offer significant benefit to boosting the performance of data-intensive algorithms. Using GPU and FPGA co-processors, we were able to improve performance by more than an order of magnitude on the benchmark algorithms, eliminating the processor bottleneck of CPU-bound tasks. Experiments with a prototype solid state nonvolative memory available today show 10X better throughput on random reads than disk, with a 2X speedup on a graph processing benchmark when compared to the use of local SATA disk.« less

  14. Efficient irregular wavefront propagation algorithms on Intel® Xeon Phi™

    PubMed Central

    Gomes, Jeremias M.; Teodoro, George; de Melo, Alba; Kong, Jun; Kurc, Tahsin; Saltz, Joel H.

    2016-01-01

    We investigate the execution of the Irregular Wavefront Propagation Pattern (IWPP), a fundamental computing structure used in several image analysis operations, on the Intel® Xeon Phi™ co-processor. An efficient implementation of IWPP on the Xeon Phi is a challenging problem because of IWPP’s irregularity and the use of atomic instructions in the original IWPP algorithm to resolve race conditions. On the Xeon Phi, the use of SIMD and vectorization instructions is critical to attain high performance. However, SIMD atomic instructions are not supported. Therefore, we propose a new IWPP algorithm that can take advantage of the supported SIMD instruction set. We also evaluate an alternate storage container (priority queue) to track active elements in the wavefront in an effort to improve the parallel algorithm efficiency. The new IWPP algorithm is evaluated with Morphological Reconstruction and Imfill operations as use cases. Our results show performance improvements of up to 5.63× on top of the original IWPP due to vectorization. Moreover, the new IWPP achieves speedups of 45.7× and 1.62×, respectively, as compared to efficient CPU and GPU implementations. PMID:27298591

  15. Efficient irregular wavefront propagation algorithms on Intel® Xeon Phi™.

    PubMed

    Gomes, Jeremias M; Teodoro, George; de Melo, Alba; Kong, Jun; Kurc, Tahsin; Saltz, Joel H

    2015-10-01

    We investigate the execution of the Irregular Wavefront Propagation Pattern (IWPP), a fundamental computing structure used in several image analysis operations, on the Intel ® Xeon Phi ™ co-processor. An efficient implementation of IWPP on the Xeon Phi is a challenging problem because of IWPP's irregularity and the use of atomic instructions in the original IWPP algorithm to resolve race conditions. On the Xeon Phi, the use of SIMD and vectorization instructions is critical to attain high performance. However, SIMD atomic instructions are not supported. Therefore, we propose a new IWPP algorithm that can take advantage of the supported SIMD instruction set. We also evaluate an alternate storage container (priority queue) to track active elements in the wavefront in an effort to improve the parallel algorithm efficiency. The new IWPP algorithm is evaluated with Morphological Reconstruction and Imfill operations as use cases. Our results show performance improvements of up to 5.63 × on top of the original IWPP due to vectorization. Moreover, the new IWPP achieves speedups of 45.7 × and 1.62 × , respectively, as compared to efficient CPU and GPU implementations.

  16. Implementations of BLAST for parallel computers.

    PubMed

    Jülich, A

    1995-02-01

    The BLAST sequence comparison programs have been ported to a variety of parallel computers-the shared memory machine Cray Y-MP 8/864 and the distributed memory architectures Intel iPSC/860 and nCUBE. Additionally, the programs were ported to run on workstation clusters. We explain the parallelization techniques and consider the pros and cons of these methods. The BLAST programs are very well suited for parallelization for a moderate number of processors. We illustrate our results using the program blastp as an example. As input data for blastp, a 799 residue protein query sequence and the protein database PIR were used.

  17. Real-time operating system for selected Intel processors

    NASA Technical Reports Server (NTRS)

    Pool, W. R.

    1980-01-01

    The rationale for system development is given along with reasons for not using vendor supplied operating systems. Although many system design and performance goals were dictated by problems with vendor supplied systems, other goals surfaced as a result of a design for a custom system able to span multiple projects. System development and management problems and areas that required redesign or major code changes for system implementation are examined as well as the relative successes of the initial projects. A generic description of the actual project is provided and the ongoing support requirements and future plans are discussed.

  18. Computer Simulation Performed for Columbia Project Cooling System

    NASA Technical Reports Server (NTRS)

    Ahmad, Jasim

    2005-01-01

    This demo shows a high-fidelity simulation of the air flow in the main computer room housing the Columbia (10,024 intel titanium processors) system. The simulation asseses the performance of the cooling system and identified deficiencies, and recommended modifications to eliminate them. It used two in house software packages on NAS supercomputers: Chimera Grid tools to generate a geometric model of the computer room, OVERFLOW-2 code for fluid and thermal simulation. This state-of-the-art technology can be easily extended to provide a general capability for air flow analyses on any modern computer room. Columbia_CFD_black.tiff

  19. Scalability and Portability of Two Parallel Implementations of ADI

    NASA Technical Reports Server (NTRS)

    Phung, Thanh; VanderWijngaart, Rob F.

    1994-01-01

    Two domain decompositions for the implementation of the NAS Scalar Penta-diagonal Parallel Benchmark on MIMD systems are investigated, namely transposition and multi-partitioning. Hardware platforms considered are the Intel iPSC/860 and Paragon XP/S-15, and clusters of SGI workstations on ethernet, communicating through PVM. It is found that the multi-partitioning strategy offers the kind of coarse granularity that allows scaling up to hundreds of processors on a massively parallel machine. Moreover, efficiency is retained when the code is ported verbatim (save message passing syntax) to a PVM environment on a modest size cluster of workstations.

  20. On-board processing concepts for future satellite communications systems

    NASA Technical Reports Server (NTRS)

    Brandon, W. T. (Editor); White, B. E. (Editor)

    1980-01-01

    The initial definition of on-board processing for an advanced satellite communications system to service domestic markets in the 1990's is discussed. An exemplar system with both RF on-board switching and demodulation/remodulation baseband processing is used to identify important issues related to system implementation, cost, and technology development. Analyses of spectrum-efficient modulation, coding, and system control techniques are summarized. Implementations for an RF switch and baseband processor are described. Among the major conclusions listed is the need for high gain satellites capable of handling tens of simultaneous beams for the efficient reuse of the 2.5 GHz 30/20 frequency band. Several scanning beams are recommended in addition to the fixed beams. Low power solid state 20 GHz GaAs FET power amplifiers in the 5W range and a general purpose digital baseband processor with gigahertz logic speeds and megabits of memory are also recommended.

  1. The ACTS Flight System - Cost-Effective Advanced Communications Technology. [Advanced Communication Technology Satellite

    NASA Technical Reports Server (NTRS)

    Holmes, W. M., Jr.; Beck, G. A.

    1984-01-01

    The multibeam communications package (MCP) for the Advanced Communications Technology Satellite (ACTS) to be STS-launched by NASA in 1988 for experimental demonstration of satellite-switched TDMA (at 220 Mbit/sec) and baseband-processor signal routing (at 110 or 27.5 Mbit/sec) is characterized. The developmental history of the ACTS, the program definition, and the spacecraft-bus and MCP parameters are reviewed and illustrated with drawings, block diagrams, and maps of the coverage plan. Advanced features of the MPC include 4.5-dB-noise-figure 30-GHz FET amplifiers and 20-GHz TWTA transmitters which provide either 40-W or 8-W RF output, depending on rain conditions. The technologies being tested in ACTS can give frequency-reuse factors as high as 20, thus greatly expanding the orbit/spectrum resources available for U.S. communications use.

  2. Intel Xeon Phi accelerated Weather Research and Forecasting (WRF) Goddard microphysics scheme

    NASA Astrophysics Data System (ADS)

    Mielikainen, J.; Huang, B.; Huang, A. H.-L.

    2014-12-01

    The Weather Research and Forecasting (WRF) model is a numerical weather prediction system designed to serve both atmospheric research and operational forecasting needs. The WRF development is a done in collaboration around the globe. Furthermore, the WRF is used by academic atmospheric scientists, weather forecasters at the operational centers and so on. The WRF contains several physics components. The most time consuming one is the microphysics. One microphysics scheme is the Goddard cloud microphysics scheme. It is a sophisticated cloud microphysics scheme in the Weather Research and Forecasting (WRF) model. The Goddard microphysics scheme is very suitable for massively parallel computation as there are no interactions among horizontal grid points. Compared to the earlier microphysics schemes, the Goddard scheme incorporates a large number of improvements. Thus, we have optimized the Goddard scheme code. In this paper, we present our results of optimizing the Goddard microphysics scheme on Intel Many Integrated Core Architecture (MIC) hardware. The Intel Xeon Phi coprocessor is the first product based on Intel MIC architecture, and it consists of up to 61 cores connected by a high performance on-die bidirectional interconnect. The Intel MIC is capable of executing a full operating system and entire programs rather than just kernels as the GPU does. The MIC coprocessor supports all important Intel development tools. Thus, the development environment is one familiar to a vast number of CPU developers. Although, getting a maximum performance out of MICs will require using some novel optimization techniques. Those optimization techniques are discussed in this paper. The results show that the optimizations improved performance of Goddard microphysics scheme on Xeon Phi 7120P by a factor of 4.7×. In addition, the optimizations reduced the Goddard microphysics scheme's share of the total WRF processing time from 20.0 to 7.5%. Furthermore, the same optimizations improved performance on Intel Xeon E5-2670 by a factor of 2.8× compared to the original code.

  3. HeinzelCluster: accelerated reconstruction for FORE and OSEM3D.

    PubMed

    Vollmar, S; Michel, C; Treffert, J T; Newport, D F; Casey, M; Knöss, C; Wienhard, K; Liu, X; Defrise, M; Heiss, W D

    2002-08-07

    Using iterative three-dimensional (3D) reconstruction techniques for reconstruction of positron emission tomography (PET) is not feasible on most single-processor machines due to the excessive computing time needed, especially so for the large sinogram sizes of our high-resolution research tomograph (HRRT). In our first approach to speed up reconstruction time we transform the 3D scan into the format of a two-dimensional (2D) scan with sinograms that can be reconstructed independently using Fourier rebinning (FORE) and a fast 2D reconstruction method. On our dedicated reconstruction cluster (seven four-processor systems, Intel PIII@700 MHz, switched fast ethernet and Myrinet, Windows NT Server), we process these 2D sinograms in parallel. We have achieved a speedup > 23 using 26 processors and also compared results for different communication methods (RPC, Syngo, Myrinet GM). The other approach is to parallelize OSEM3D (implementation of C Michel), which has produced the best results for HRRT data so far and is more suitable for an adequate treatment of the sinogram gaps that result from the detector geometry of the HRRT. We have implemented two levels of parallelization for four dedicated cluster (a shared memory fine-grain level on each node utilizing all four processors and a coarse-grain level allowing for 15 nodes) reducing the time for one core iteration from over 7 h to about 35 min.

  4. Commercial counterboard for 10 ns software correlator for photon and fluorescence correlation spectroscopy.

    PubMed

    Molteni, Matteo; Ferri, Fabio

    2016-11-01

    A 10 ns time resolution, multi-tau software correlator, capable of computing simultaneous autocorrelation (A-A, B-B) and cross (A-B) correlation functions at count rates up to ∼10 MHz, with no data loss, has been developed in LabVIEW and C++ by using the National Instrument timer/counterboard (NI PCIe-6612) and a fast Personal Computer (PC) (Intel Core i7-4790 Processor 3.60 GHz ). The correlator works by using two algorithms: for large lag times (τ ≳ 1 μs), a classical time-mode scheme, based on the measure of the number of pulses per time interval, is used; differently, for τ ≲ 1 μs a photon-mode (PM) scheme is adopted and the correlation function is retrieved from the sequence of the photon arrival times. Single auto- and cross-correlation functions can be processed online in full real time up to count rates of ∼1.8 MHz and ∼1.2 MHz, respectively. Two autocorrelation (A-A, B-B) and a cross correlation (A-B) functions can be simultaneously processed in full real time only up to count rates of ∼750 kHz. At higher count rates, the online processing takes place in a delayed modality, but with no data loss. When tested with simulated correlation data and latex spheres solutions, the overall performances of the correlator appear to be comparable with those of commercial hardware correlators, but with several nontrivial advantages related to its flexibility, low cost, and easy adaptability to future developments of PC and data acquisition technology.

  5. Abstract: Inference and Interval Estimation for Indirect Effects With Latent Variable Models.

    PubMed

    Falk, Carl F; Biesanz, Jeremy C

    2011-11-30

    Models specifying indirect effects (or mediation) and structural equation modeling are both popular in the social sciences. Yet relatively little research has compared methods that test for indirect effects among latent variables and provided precise estimates of the effectiveness of different methods. This simulation study provides an extensive comparison of methods for constructing confidence intervals and for making inferences about indirect effects with latent variables. We compared the percentile (PC) bootstrap, bias-corrected (BC) bootstrap, bias-corrected accelerated (BC a ) bootstrap, likelihood-based confidence intervals (Neale & Miller, 1997), partial posterior predictive (Biesanz, Falk, and Savalei, 2010), and joint significance tests based on Wald tests or likelihood ratio tests. All models included three reflective latent variables representing the independent, dependent, and mediating variables. The design included the following fully crossed conditions: (a) sample size: 100, 200, and 500; (b) number of indicators per latent variable: 3 versus 5; (c) reliability per set of indicators: .7 versus .9; (d) and 16 different path combinations for the indirect effect (α = 0, .14, .39, or .59; and β = 0, .14, .39, or .59). Simulations were performed using a WestGrid cluster of 1680 3.06GHz Intel Xeon processors running R and OpenMx. Results based on 1,000 replications per cell and 2,000 resamples per bootstrap method indicated that the BC and BC a bootstrap methods have inflated Type I error rates. Likelihood-based confidence intervals and the PC bootstrap emerged as methods that adequately control Type I error and have good coverage rates.

  6. Commercial counterboard for 10 ns software correlator for photon and fluorescence correlation spectroscopy

    NASA Astrophysics Data System (ADS)

    Molteni, Matteo; Ferri, Fabio

    2016-11-01

    A 10 ns time resolution, multi-tau software correlator, capable of computing simultaneous autocorrelation (A-A, B-B) and cross (A-B) correlation functions at count rates up to ˜10 MHz, with no data loss, has been developed in LabVIEW and C++ by using the National Instrument timer/counterboard (NI PCIe-6612) and a fast Personal Computer (PC) (Intel Core i7-4790 Processor 3.60 GHz ). The correlator works by using two algorithms: for large lag times (τ ≳ 1 μs), a classical time-mode scheme, based on the measure of the number of pulses per time interval, is used; differently, for τ ≲ 1 μs a photon-mode (PM) scheme is adopted and the correlation function is retrieved from the sequence of the photon arrival times. Single auto- and cross-correlation functions can be processed online in full real time up to count rates of ˜1.8 MHz and ˜1.2 MHz, respectively. Two autocorrelation (A-A, B-B) and a cross correlation (A-B) functions can be simultaneously processed in full real time only up to count rates of ˜750 kHz. At higher count rates, the online processing takes place in a delayed modality, but with no data loss. When tested with simulated correlation data and latex spheres solutions, the overall performances of the correlator appear to be comparable with those of commercial hardware correlators, but with several nontrivial advantages related to its flexibility, low cost, and easy adaptability to future developments of PC and data acquisition technology.

  7. Intelligence system based classification approach for medical disease diagnosis

    NASA Astrophysics Data System (ADS)

    Sagir, Abdu Masanawa; Sathasivam, Saratha

    2017-08-01

    The prediction of breast cancer in women who have no signs or symptoms of the disease as well as survivability after undergone certain surgery has been a challenging problem for medical researchers. The decision about presence or absence of diseases depends on the physician's intuition, experience and skill for comparing current indicators with previous one than on knowledge rich data hidden in a database. This measure is a very crucial and challenging task. The goal is to predict patient condition by using an adaptive neuro fuzzy inference system (ANFIS) pre-processed by grid partitioning. To achieve an accurate diagnosis at this complex stage of symptom analysis, the physician may need efficient diagnosis system. A framework describes methodology for designing and evaluation of classification performances of two discrete ANFIS systems of hybrid learning algorithms least square estimates with Modified Levenberg-Marquardt and Gradient descent algorithms that can be used by physicians to accelerate diagnosis process. The proposed method's performance was evaluated based on training and test datasets with mammographic mass and Haberman's survival Datasets obtained from benchmarked datasets of University of California at Irvine's (UCI) machine learning repository. The robustness of the performance measuring total accuracy, sensitivity and specificity is examined. In comparison, the proposed method achieves superior performance when compared to conventional ANFIS based gradient descent algorithm and some related existing methods. The software used for the implementation is MATLAB R2014a (version 8.3) and executed in PC Intel Pentium IV E7400 processor with 2.80 GHz speed and 2.0 GB of RAM.

  8. Coupled circuit numerical analysis of eddy currents in an open MRI system.

    PubMed

    Akram, Md Shahadat Hossain; Terada, Yasuhiko; Keiichiro, Ishi; Kose, Katsumi

    2014-08-01

    We performed a new coupled circuit numerical simulation of eddy currents in an open compact magnetic resonance imaging (MRI) system. Following the coupled circuit approach, the conducting structures were divided into subdomains along the length (or width) and the thickness, and by implementing coupled circuit concepts we have simulated transient responses of eddy currents for subdomains in different locations. We implemented the Eigen matrix technique to solve the network of coupled differential equations to speed up our simulation program. On the other hand, to compute the coupling relations between the biplanar gradient coil and any other conducting structure, we implemented the solid angle form of Ampere's law. We have also calculated the solid angle for three dimensions to compute inductive couplings in any subdomain of the conducting structures. Details of the temporal and spatial distribution of the eddy currents were then implemented in the secondary magnetic field calculation by the Biot-Savart law. In a desktop computer (Programming platform: Wolfram Mathematica 8.0®, Processor: Intel(R) Core(TM)2 Duo E7500 @ 2.93GHz; OS: Windows 7 Professional; Memory (RAM): 4.00GB), it took less than 3min to simulate the entire calculation of eddy currents and fields, and approximately 6min for X-gradient coil. The results are given in the time-space domain for both the direct and the cross-terms of the eddy current magnetic fields generated by the Z-gradient coil. We have also conducted free induction decay (FID) experiments of eddy fields using a nuclear magnetic resonance (NMR) probe to verify our simulation results. The simulation results were found to be in good agreement with the experimental results. In this study we have also conducted simulations for transient and spatial responses of secondary magnetic field induced by X-gradient coil. Our approach is fast and has much less computational complexity than the conventional electromagnetic numerical simulation methods. Copyright © 2014 Elsevier Inc. All rights reserved.

  9. Millimeter-wave passive ultra-compact imaging technology for synthetic vision & mobile platforms

    NASA Technical Reports Server (NTRS)

    Olsen, Randall

    1996-01-01

    Substantial technical progress was made on all of the three high-risk subsystems of this program. The subsystems include dielectric antenna, G-band receiver, and electro-optic image processor. Progress is approximately on-schedule for both the receiver and the electro-optic processor development, while greater than anticipated challenges have been discovered in the dielectric antenna development. Much of the information in this report was covered in greater detail in the One-Year Review Meeting held at TTC on 22 February 1996. The performance goals of the dielectric antenna project are: Scan Angle -- 20 deg. desired; Loss -- 6 dB end to end (3 dB average); Frequency -- 206-218 GHz (6% bandwidth); Beam width -- 0.25 deg.; and Length -- 12 inches. The scan angle requirement was chosen to satisfy the needs of aircraft pilots. This requirement, coupled with the presently limited bandwidth processors (1 GHz state-of-the-art and 12 GHz in development in this program) forces the antenna to be dielectric (high scan angle air-filled waveguide-based antennas would be too lossy and their performance would vary too much as a function of frequency). A high dielectric constant (e.g., 10) was initially chosen for the dielectric material. This choice lead to the following fabrication challenges: total thickness variation (TTV) tolerance is 1 micrometer; coupler spacing tolerance is 1 micrometer; width tolerance is larger, but unknown, and the surfaces must have mirror finish. Also of importance is the difficulty in obtaining raw materials that satisfy the overall length requirement of 12 inches while simultaneously satisfying the above specifications.

  10. Operational Level 2 Data Processing System for the JEM/SMILES

    NASA Astrophysics Data System (ADS)

    Takahashi, C.; Mitsuda, C.; Suzuki, M.; Iwata, Y.; Horikawa, M.; Matsumoto, T.; Hayashi, H.; Imai, K.; Sano, T.; Takayanagi, M.

    2009-12-01

    To measure the thermal emission from stratospheric minor species with high sensitivity, the Superconducting Submillimeter-wave Limb-Emission Sounder (SMILES) aboard the Japanese Experiment Module (JEM) of the International Space Station (ISS) carries 4 K cooled Superconductor-Insulator-Superconductor (SIS) mixers. The major feature of the SMILES is its high-sensitive measurement ability with low system noise temperature less than 700 K. The SMILES measures the atmospheric limb emission from stratospheric minor constituents in the submillimeter-wave range, which are band A (624.3-625.5 GHz), band B (625.1-626.3 GHz), and band C (649.1-650.3 GHz). The target species of the SMILES are O3, ClO, HCl, HNO3, HOCl, CH3CN, HO2, BrO, and O3 isotopes (18OOO, 17OOO, and O17OO). Since a complete vertical scan takes 53 s and the orbital period of the ISS is approximately 93 min, approximately 105 scans per orbit give 1630 scans per day. There are 68 individual limb rays in a single scan, and the nominal altitude coverage is from 10 to 60 km. The spatial coverage is on a near global (38S - 65N). It is expected that it will be possible to make measurements within the elongated polar vortex in the Northern Hemisphere. As a part of the ground system for the SMILES, a level 2 data processing system (DPS-L2) designed to use a highly portable software working on a general Linux operating system has been developed. It retrieves the density distributions of the target species (level 2 data) from calibrated spectra (level 1B data) in near-real-time. The level 2 data are converted into the HDF-EOS format and are distributed to users accompanied with the ancillary data on the SMILES status through a web server. To support the analysis of the level 2 data, it is implemented on the Gfdnavi (geophysical fluid data navigator), which is a suite of software that facilitates databasing, analysis, data publication, and visualization of geophysical fluid data. This paper presents the development of the DPS-L2 along with the details on its algorithm and performance. The retrieval process consists of two parts: the forward model, which computes radiative transfer, and the inverse model, which deduces atmospheric states. Since the forward model must provide the most accurate basis for results and be implemented under limited computing resources, the forward model algorithm for an operational system has to be accurate and fast. Hence, the algorithm is improved (1) by designing accurate instrument functions such as the instrumental field of view (FOV), sideband rejection ratio of sideband separator, and spectral responses of acousto-optic spectrometer (AOS) and (2) by optimizing radiative transfer calculation. We have achieved that the accuracy of this algorithm is better than 1%, and the processing time for single-scan spectra is less than 1 min with 8 parallel processing using a 3.16-GHz Quad-Core Intel Xeon processor.

  11. Acceleration of spiking neural network based pattern recognition on NVIDIA graphics processors.

    PubMed

    Han, Bing; Taha, Tarek M

    2010-04-01

    There is currently a strong push in the research community to develop biological scale implementations of neuron based vision models. Systems at this scale are computationally demanding and generally utilize more accurate neuron models, such as the Izhikevich and the Hodgkin-Huxley models, in favor of the more popular integrate and fire model. We examine the feasibility of using graphics processing units (GPUs) to accelerate a spiking neural network based character recognition network to enable such large scale systems. Two versions of the network utilizing the Izhikevich and Hodgkin-Huxley models are implemented. Three NVIDIA general-purpose (GP) GPU platforms are examined, including the GeForce 9800 GX2, the Tesla C1060, and the Tesla S1070. Our results show that the GPGPUs can provide significant speedup over conventional processors. In particular, the fastest GPGPU utilized, the Tesla S1070, provided a speedup of 5.6 and 84.4 over highly optimized implementations on the fastest central processing unit (CPU) tested, a quadcore 2.67 GHz Xeon processor, for the Izhikevich and the Hodgkin-Huxley models, respectively. The CPU implementation utilized all four cores and the vector data parallelism offered by the processor. The results indicate that GPUs are well suited for this application domain.

  12. The Intelcities Community of Practice: The Capacity-Building, Co-Design, Evaluation, and Monitoring of E-Government Services

    ERIC Educational Resources Information Center

    Deakin, Mark; Lombardi, Patrizia; Cooper, Ian

    2011-01-01

    The paper examines the IntelCities Community of Practice (CoP) supporting the development of the organization's capacity-building, co-design, monitoring, and evaluation of e-government services. It begins by outlining the IntelCities CoP and goes on to set out the integrated model of electronically enhanced government (e-government) services…

  13. Measurements of the LHCb software stack on the ARM architecture

    NASA Astrophysics Data System (ADS)

    Vijay Kartik, S.; Couturier, Ben; Clemencic, Marco; Neufeld, Niko

    2014-06-01

    The ARM architecture is a power-efficient design that is used in most processors in mobile devices all around the world today since they provide reasonable compute performance per watt. The current LHCb software stack is designed (and thus expected) to build and run on machines with the x86/x86_64 architecture. This paper outlines the process of measuring the performance of the LHCb software stack on the ARM architecture - specifically, the ARMv7 architecture on Cortex-A9 processors from NVIDIA and on full-fledged ARM servers with chipsets from Calxeda - and makes comparisons with the performance on x86_64 architectures on the Intel Xeon L5520/X5650 and AMD Opteron 6272. The paper emphasises the aspects of performance per core with respect to the power drawn by the compute nodes for the given performance - this ensures a fair real-world comparison with much more 'powerful' Intel/AMD processors. The comparisons of these real workloads in the context of LHCb are also complemented with the standard synthetic benchmarks HEPSPEC and Coremark. The pitfalls and solutions for the non-trivial task of porting the source code to build for the ARMv7 instruction set are presented. The specific changes in the build process needed for ARM-specific portions of the software stack are described, to serve as pointers for further attempts taken up by other groups in this direction. Cases where architecture-specific tweaks at the assembler lever (both in ROOT and the LHCb software stack) were needed for a successful compile are detailed - these cases are good indicators of where/how the software stack as well as the build system can be made more portable and multi-arch friendly. The experience gained from the tasks described in this paper are intended to i) assist in making an informed choice about ARM-based server solutions as a feasible low-power alternative to the current compute nodes, and ii) revisit the software design and build system for portability and generic improvements.

  14. NASA Center for Climate Simulation (NCCS) Presentation

    NASA Technical Reports Server (NTRS)

    Webster, William P.

    2012-01-01

    The NASA Center for Climate Simulation (NCCS) offers integrated supercomputing, visualization, and data interaction technologies to enhance NASA's weather and climate prediction capabilities. It serves hundreds of users at NASA Goddard Space Flight Center, as well as other NASA centers, laboratories, and universities across the US. Over the past year, NCCS has continued expanding its data-centric computing environment to meet the increasingly data-intensive challenges of climate science. We doubled our Discover supercomputer's peak performance to more than 800 teraflops by adding 7,680 Intel Xeon Sandy Bridge processor-cores and most recently 240 Intel Xeon Phi Many Integrated Core (MIG) co-processors. A supercomputing-class analysis system named Dali gives users rapid access to their data on Discover and high-performance software including the Ultra-scale Visualization Climate Data Analysis Tools (UV-CDAT), with interfaces from user desktops and a 17- by 6-foot visualization wall. NCCS also is exploring highly efficient climate data services and management with a new MapReduce/Hadoop cluster while augmenting its data distribution to the science community. Using NCCS resources, NASA completed its modeling contributions to the Intergovernmental Panel on Climate Change (IPCG) Fifth Assessment Report this summer as part of the ongoing Coupled Modellntercomparison Project Phase 5 (CMIP5). Ensembles of simulations run on Discover reached back to the year 1000 to test model accuracy and projected climate change through the year 2300 based on four different scenarios of greenhouse gases, aerosols, and land use. The data resulting from several thousand IPCC/CMIP5 simulations, as well as a variety of other simulation, reanalysis, and observationdatasets, are available to scientists and decision makers through an enhanced NCCS Earth System Grid Federation Gateway. Worldwide downloads have totaled over 110 terabytes of data.

  15. SoAx: A generic C++ Structure of Arrays for handling particles in HPC codes

    NASA Astrophysics Data System (ADS)

    Homann, Holger; Laenen, Francois

    2018-03-01

    The numerical study of physical problems often require integrating the dynamics of a large number of particles evolving according to a given set of equations. Particles are characterized by the information they are carrying such as an identity, a position other. There are generally speaking two different possibilities for handling particles in high performance computing (HPC) codes. The concept of an Array of Structures (AoS) is in the spirit of the object-oriented programming (OOP) paradigm in that the particle information is implemented as a structure. Here, an object (realization of the structure) represents one particle and a set of many particles is stored in an array. In contrast, using the concept of a Structure of Arrays (SoA), a single structure holds several arrays each representing one property (such as the identity) of the whole set of particles. The AoS approach is often implemented in HPC codes due to its handiness and flexibility. For a class of problems, however, it is known that the performance of SoA is much better than that of AoS. We confirm this observation for our particle problem. Using a benchmark we show that on modern Intel Xeon processors the SoA implementation is typically several times faster than the AoS one. On Intel's MIC co-processors the performance gap even attains a factor of ten. The same is true for GPU computing, using both computational and multi-purpose GPUs. Combining performance and handiness, we present the library SoAx that has optimal performance (on CPUs, MICs, and GPUs) while providing the same handiness as AoS. For this, SoAx uses modern C++ design techniques such template meta programming that allows to automatically generate code for user defined heterogeneous data structures.

  16. Automated tracking of whiskers in videos of head fixed rodents.

    PubMed

    Clack, Nathan G; O'Connor, Daniel H; Huber, Daniel; Petreanu, Leopoldo; Hires, Andrew; Peron, Simon; Svoboda, Karel; Myers, Eugene W

    2012-01-01

    We have developed software for fully automated tracking of vibrissae (whiskers) in high-speed videos (>500 Hz) of head-fixed, behaving rodents trimmed to a single row of whiskers. Performance was assessed against a manually curated dataset consisting of 1.32 million video frames comprising 4.5 million whisker traces. The current implementation detects whiskers with a recall of 99.998% and identifies individual whiskers with 99.997% accuracy. The average processing rate for these images was 8 Mpx/s/cpu (2.6 GHz Intel Core2, 2 GB RAM). This translates to 35 processed frames per second for a 640 px×352 px video of 4 whiskers. The speed and accuracy achieved enables quantitative behavioral studies where the analysis of millions of video frames is required. We used the software to analyze the evolving whisking strategies as mice learned a whisker-based detection task over the course of 6 days (8148 trials, 25 million frames) and measure the forces at the sensory follicle that most underlie haptic perception.

  17. Automated Tracking of Whiskers in Videos of Head Fixed Rodents

    PubMed Central

    Clack, Nathan G.; O'Connor, Daniel H.; Huber, Daniel; Petreanu, Leopoldo; Hires, Andrew; Peron, Simon; Svoboda, Karel; Myers, Eugene W.

    2012-01-01

    We have developed software for fully automated tracking of vibrissae (whiskers) in high-speed videos (>500 Hz) of head-fixed, behaving rodents trimmed to a single row of whiskers. Performance was assessed against a manually curated dataset consisting of 1.32 million video frames comprising 4.5 million whisker traces. The current implementation detects whiskers with a recall of 99.998% and identifies individual whiskers with 99.997% accuracy. The average processing rate for these images was 8 Mpx/s/cpu (2.6 GHz Intel Core2, 2 GB RAM). This translates to 35 processed frames per second for a 640 px×352 px video of 4 whiskers. The speed and accuracy achieved enables quantitative behavioral studies where the analysis of millions of video frames is required. We used the software to analyze the evolving whisking strategies as mice learned a whisker-based detection task over the course of 6 days (8148 trials, 25 million frames) and measure the forces at the sensory follicle that most underlie haptic perception. PMID:22792058

  18. Development of a P-I-N HgCdTe photomixer for laser heterodyne spectrometry

    NASA Technical Reports Server (NTRS)

    Bratt, Peter R.

    1987-01-01

    An improved HgCdTe photomixer technology was demonstrated employing a p-i-n photodiode structure. The i-region was near intrinsic n-type HgCdTe; the n-region was formed by B+ ion implantation; and the p-region was formed either by a shallow Au diffusion or by a Pt Schottky barrier. Experimental devices in a back-side illuminated mesa diode configuration were fabricated, tested, and delivered. The best photomixer was packaged in a 24-hour LN2 dewar along with a cooled GaAs FET preamplifier. Testing was performed by mixing black-body radiation with a CO2 laser beam and measuring the IF signal, noise, and signal-to-noise ratio in the GHz frequency range. Signal bandwidth for this photomixer was 1.3 GHz. The heterodyne NEP was 4.4 x 10 to the -20 W/Hz out to 1 GHz increasing to 8.6 x 10 to the -10 W/Hz at 2 GHz. Other photomixers delivered on this program had heterodyne NEPs at 1 GHz ranging from 8 x 10 to the -20 to 4.4 x 10 to the -19 W/Hz and NEP bandwidths from 2 to 4 GHz.

  19. A SOPC-BASED Evaluation of AES for 2.4 GHz Wireless Network

    NASA Astrophysics Data System (ADS)

    Ken, Cai; Xiaoying, Liang

    In modern systems, data security is needed more than ever before and many cryptographic algorithms are utilized for security services. Wireless Sensor Networks (WSN) is an example of such technologies. In this paper an innovative SOPC-based approach for the security services evaluation in WSN is proposed that addresses the issues of scalability, flexible performance, and silicon efficiency for the hardware acceleration of encryption system. The design includes a Nios II processor together with custom designed modules for the Advanced Encryption Standard (AES) which has become the default choice for various security services in numerous applications. The objective of this mechanism is to present an efficient hardware realization of AES using very high speed integrated circuit hardware description language (Verilog HDL) and expand the usability for various applications. As compared to traditional customize processor design, the mechanism provides a very broad range of cost/performance points.

  20. FLY MPI-2: a parallel tree code for LSS

    NASA Astrophysics Data System (ADS)

    Becciani, U.; Comparato, M.; Antonuccio-Delogu, V.

    2006-04-01

    New version program summaryProgram title: FLY 3.1 Catalogue identifier: ADSC_v2_0 Licensing provisions: yes Program summary URL:http://cpc.cs.qub.ac.uk/summaries/ADSC_v2_0 Program obtainable from: CPC Program Library, Queen's University of Belfast, N. Ireland No. of lines in distributed program, including test data, etc.: 158 172 No. of bytes in distributed program, including test data, etc.: 4 719 953 Distribution format: tar.gz Programming language: Fortran 90, C Computer: Beowulf cluster, PC, MPP systems Operating system: Linux, Aix RAM: 100M words Catalogue identifier of previous version: ADSC_v1_0 Journal reference of previous version: Comput. Phys. Comm. 155 (2003) 159 Does the new version supersede the previous version?: yes Nature of problem: FLY is a parallel collisionless N-body code for the calculation of the gravitational force Solution method: FLY is based on the hierarchical oct-tree domain decomposition introduced by Barnes and Hut (1986) Reasons for the new version: The new version of FLY is implemented by using the MPI-2 standard: the distributed version 3.1 was developed by using the MPICH2 library on a PC Linux cluster. Today the FLY performance allows us to consider the FLY code among the most powerful parallel codes for tree N-body simulations. Another important new feature regards the availability of an interface with hydrodynamical Paramesh based codes. Simulations must follow a box large enough to accurately represent the power spectrum of fluctuations on very large scales so that we may hope to compare them meaningfully with real data. The number of particles then sets the mass resolution of the simulation, which we would like to make as fine as possible. The idea to build an interface between two codes, that have different and complementary cosmological tasks, allows us to execute complex cosmological simulations with FLY, specialized for DM evolution, and a code specialized for hydrodynamical components that uses a Paramesh block structure. Summary of revisions: The parallel communication schema was totally changed. The new version adopts the MPICH2 library. Now FLY can be executed on all Unix systems having an MPI-2 standard library. The main data structure, is declared in a module procedure of FLY (fly_h.F90 routine). FLY creates the MPI Window object for one-sided communication for all the shared arrays, with a call like the following: CALL MPI_WIN_CREATE(POS, SIZE, REAL8, MPI_INFO_NULL, MPI_COMM_WORLD, WIN_POS, IERR) the following main window objects are created: win_pos, win_vel, win_acc: particles positions velocities and accelerations, win_pos_cell, win_mass_cell, win_quad, win_subp, win_grouping: cells positions, masses, quadrupole momenta, tree structure and grouping cells. Other windows are created for dynamic load balance and global counters. Restrictions: The program uses the leapfrog integrator schema, but could be changed by the user. Unusual features: FLY uses the MPI-2 standard: the MPICH2 library on Linux systems was adopted. To run this version of FLY the working directory must be shared among all the processors that execute FLY. Additional comments: Full documentation for the program is included in the distribution in the form of a README file, a User Guide and a Reference manuscript. Running time: IBM Linux Cluster 1350, 512 nodes with 2 processors for each node and 2 GB RAM for each processor, at Cineca, was adopted to make performance tests. Processor type: Intel Xeon Pentium IV 3.0 GHz and 512 KB cache (128 nodes have Nocona processors). Internal Network: Myricom LAN Card "C" Version and "D" Version. Operating System: Linux SuSE SLES 8. The code was compiled using the mpif90 compiler version 8.1 and with basic optimization options in order to have performances that could be useful compared with other generic clusters Processors

  1. Application of graphics processing units to search pipelines for gravitational waves from coalescing binaries of compact objects

    NASA Astrophysics Data System (ADS)

    Chung, Shin Kee; Wen, Linqing; Blair, David; Cannon, Kipp; Datta, Amitava

    2010-07-01

    We report a novel application of a graphics processing unit (GPU) for the purpose of accelerating the search pipelines for gravitational waves from coalescing binaries of compact objects. A speed-up of 16-fold in total has been achieved with an NVIDIA GeForce 8800 Ultra GPU card compared with one core of a 2.5 GHz Intel Q9300 central processing unit (CPU). We show that substantial improvements are possible and discuss the reduction in CPU count required for the detection of inspiral sources afforded by the use of GPUs.

  2. PFLOTRAN User Manual: A Massively Parallel Reactive Flow and Transport Model for Describing Surface and Subsurface Processes

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Lichtner, Peter C.; Hammond, Glenn E.; Lu, Chuan

    PFLOTRAN solves a system of generally nonlinear partial differential equations describing multi-phase, multicomponent and multiscale reactive flow and transport in porous materials. The code is designed to run on massively parallel computing architectures as well as workstations and laptops (e.g. Hammond et al., 2011). Parallelization is achieved through domain decomposition using the PETSc (Portable Extensible Toolkit for Scientific Computation) libraries for the parallelization framework (Balay et al., 1997). PFLOTRAN has been developed from the ground up for parallel scalability and has been run on up to 218 processor cores with problem sizes up to 2 billion degrees of freedom. Writtenmore » in object oriented Fortran 90, the code requires the latest compilers compatible with Fortran 2003. At the time of this writing this requires gcc 4.7.x, Intel 12.1.x and PGC compilers. As a requirement of running problems with a large number of degrees of freedom, PFLOTRAN allows reading input data that is too large to fit into memory allotted to a single processor core. The current limitation to the problem size PFLOTRAN can handle is the limitation of the HDF5 file format used for parallel IO to 32 bit integers. Noting that 2 32 = 4; 294; 967; 296, this gives an estimate of the maximum problem size that can be currently run with PFLOTRAN. Hopefully this limitation will be remedied in the near future.« less

  3. Initial results on computational performance of Intel Many Integrated Core (MIC) architecture: implementation of the Weather and Research Forecasting (WRF) Purdue-Lin microphysics scheme

    NASA Astrophysics Data System (ADS)

    Mielikainen, Jarno; Huang, Bormin; Huang, Allen H.

    2014-10-01

    Purdue-Lin scheme is a relatively sophisticated microphysics scheme in the Weather Research and Forecasting (WRF) model. The scheme includes six classes of hydro meteors: water vapor, cloud water, raid, cloud ice, snow and graupel. The scheme is very suitable for massively parallel computation as there are no interactions among horizontal grid points. In this paper, we accelerate the Purdue Lin scheme using Intel Many Integrated Core Architecture (MIC) hardware. The Intel Xeon Phi is a high performance coprocessor consists of up to 61 cores. The Xeon Phi is connected to a CPU via the PCI Express (PICe) bus. In this paper, we will discuss in detail the code optimization issues encountered while tuning the Purdue-Lin microphysics Fortran code for Xeon Phi. In particularly, getting a good performance required utilizing multiple cores, the wide vector operations and make efficient use of memory. The results show that the optimizations improved performance of the original code on Xeon Phi 5110P by a factor of 4.2x. Furthermore, the same optimizations improved performance on Intel Xeon E5-2603 CPU by a factor of 1.2x compared to the original code.

  4. USC orthogonal multiprocessor for image processing with neural networks

    NASA Astrophysics Data System (ADS)

    Hwang, Kai; Panda, Dhabaleswar K.; Haddadi, Navid

    1990-07-01

    This paper presents the architectural features and imaging applications of the Orthogonal MultiProcessor (OMP) system, which is under construction at the University of Southern California with research funding from NSF and assistance from several industrial partners. The prototype OMP is being built with 16 Intel i860 RISC microprocessors and 256 parallel memory modules using custom-designed spanning buses, which are 2-D interleaved and orthogonally accessed without conflicts. The 16-processor OMP prototype is targeted to achieve 430 MIPS and 600 Mflops, which have been verified by simulation experiments based on the design parameters used. The prototype OMP machine will be initially applied for image processing, computer vision, and neural network simulation applications. We summarize important vision and imaging algorithms that can be restructured with neural network models. These algorithms can efficiently run on the OMP hardware with linear speedup. The ultimate goal is to develop a high-performance Visual Computer (Viscom) for integrated low- and high-level image processing and vision tasks.

  5. Concurrent and Accurate Short Read Mapping on Multicore Processors.

    PubMed

    Martínez, Héctor; Tárraga, Joaquín; Medina, Ignacio; Barrachina, Sergio; Castillo, Maribel; Dopazo, Joaquín; Quintana-Ortí, Enrique S

    2015-01-01

    We introduce a parallel aligner with a work-flow organization for fast and accurate mapping of RNA sequences on servers equipped with multicore processors. Our software, HPG Aligner SA (HPG Aligner SA is an open-source application. The software is available at http://www.opencb.org, exploits a suffix array to rapidly map a large fraction of the RNA fragments (reads), as well as leverages the accuracy of the Smith-Waterman algorithm to deal with conflictive reads. The aligner is enhanced with a careful strategy to detect splice junctions based on an adaptive division of RNA reads into small segments (or seeds), which are then mapped onto a number of candidate alignment locations, providing crucial information for the successful alignment of the complete reads. The experimental results on a platform with Intel multicore technology report the parallel performance of HPG Aligner SA, on RNA reads of 100-400 nucleotides, which excels in execution time/sensitivity to state-of-the-art aligners such as TopHat 2+Bowtie 2, MapSplice, and STAR.

  6. High performance in silico virtual drug screening on many-core processors.

    PubMed

    McIntosh-Smith, Simon; Price, James; Sessions, Richard B; Ibarra, Amaurys A

    2015-05-01

    Drug screening is an important part of the drug development pipeline for the pharmaceutical industry. Traditional, lab-based methods are increasingly being augmented with computational methods, ranging from simple molecular similarity searches through more complex pharmacophore matching to more computationally intensive approaches, such as molecular docking. The latter simulates the binding of drug molecules to their targets, typically protein molecules. In this work, we describe BUDE, the Bristol University Docking Engine, which has been ported to the OpenCL industry standard parallel programming language in order to exploit the performance of modern many-core processors. Our highly optimized OpenCL implementation of BUDE sustains 1.43 TFLOP/s on a single Nvidia GTX 680 GPU, or 46% of peak performance. BUDE also exploits OpenCL to deliver effective performance portability across a broad spectrum of different computer architectures from different vendors, including GPUs from Nvidia and AMD, Intel's Xeon Phi and multi-core CPUs with SIMD instruction sets.

  7. The cost of conservative synchronization in parallel discrete event simulations

    NASA Technical Reports Server (NTRS)

    Nicol, David M.

    1990-01-01

    The performance of a synchronous conservative parallel discrete-event simulation protocol is analyzed. The class of simulation models considered is oriented around a physical domain and possesses a limited ability to predict future behavior. A stochastic model is used to show that as the volume of simulation activity in the model increases relative to a fixed architecture, the complexity of the average per-event overhead due to synchronization, event list manipulation, lookahead calculations, and processor idle time approach the complexity of the average per-event overhead of a serial simulation. The method is therefore within a constant factor of optimal. The analysis demonstrates that on large problems--those for which parallel processing is ideally suited--there is often enough parallel workload so that processors are not usually idle. The viability of the method is also demonstrated empirically, showing how good performance is achieved on large problems using a thirty-two node Intel iPSC/2 distributed memory multiprocessor.

  8. Programmable optical processor chips: toward photonic RF filters with DSP-level flexibility and MHz-band selectivity

    NASA Astrophysics Data System (ADS)

    Xie, Yiwei; Geng, Zihan; Zhuang, Leimeng; Burla, Maurizio; Taddei, Caterina; Hoekman, Marcel; Leinse, Arne; Roeloffzen, Chris G. H.; Boller, Klaus-J.; Lowery, Arthur J.

    2017-12-01

    Integrated optical signal processors have been identified as a powerful engine for optical processing of microwave signals. They enable wideband and stable signal processing operations on miniaturized chips with ultimate control precision. As a promising application, such processors enables photonic implementations of reconfigurable radio frequency (RF) filters with wide design flexibility, large bandwidth, and high-frequency selectivity. This is a key technology for photonic-assisted RF front ends that opens a path to overcoming the bandwidth limitation of current digital electronics. Here, the recent progress of integrated optical signal processors for implementing such RF filters is reviewed. We highlight the use of a low-loss, high-index-contrast stoichiometric silicon nitride waveguide which promises to serve as a practical material platform for realizing high-performance optical signal processors and points toward photonic RF filters with digital signal processing (DSP)-level flexibility, hundreds-GHz bandwidth, MHz-band frequency selectivity, and full system integration on a chip scale.

  9. Advanced Communications Architecture Demonstration Made Significant Progress

    NASA Technical Reports Server (NTRS)

    Carek, David Andrew

    2004-01-01

    Simulation for a ground station located at 44.5 deg latitude. The Advanced Communications Architecture Demonstration (ACAD) is a concept architecture to provide high-rate Ka-band (27-GHz) direct-to-ground delivery of payload data from the International Space Station. This new concept in delivering data from the space station targets scientific experiments that buffer data onboard. The concept design provides a method to augment the current downlink capability through the Tracking Data Relay Satellite System (TDRSS) Ku-band (15-GHz) communications system. The ACAD concept pushes the limits of technology in high-rate data communications for space-qualified systems. Research activities are ongoing in examining the various aspects of high-rate communications systems including: (1) link budget parametric analyses, (2) antenna configuration trade studies, (3) orbital simulations (see the preceding figure), (4) optimization of ground station contact time (see the following graph), (5) processor and storage architecture definition, and (6) protocol evaluations and dependencies.

  10. Fast multipurpose Monte Carlo simulation for proton therapy using multi- and many-core CPU architectures.

    PubMed

    Souris, Kevin; Lee, John Aldo; Sterpin, Edmond

    2016-04-01

    Accuracy in proton therapy treatment planning can be improved using Monte Carlo (MC) simulations. However the long computation time of such methods hinders their use in clinical routine. This work aims to develop a fast multipurpose Monte Carlo simulation tool for proton therapy using massively parallel central processing unit (CPU) architectures. A new Monte Carlo, called MCsquare (many-core Monte Carlo), has been designed and optimized for the last generation of Intel Xeon processors and Intel Xeon Phi coprocessors. These massively parallel architectures offer the flexibility and the computational power suitable to MC methods. The class-II condensed history algorithm of MCsquare provides a fast and yet accurate method of simulating heavy charged particles such as protons, deuterons, and alphas inside voxelized geometries. Hard ionizations, with energy losses above a user-specified threshold, are simulated individually while soft events are regrouped in a multiple scattering theory. Elastic and inelastic nuclear interactions are sampled from ICRU 63 differential cross sections, thereby allowing for the computation of prompt gamma emission profiles. MCsquare has been benchmarked with the gate/geant4 Monte Carlo application for homogeneous and heterogeneous geometries. Comparisons with gate/geant4 for various geometries show deviations within 2%-1 mm. In spite of the limited memory bandwidth of the coprocessor simulation time is below 25 s for 10(7) primary 200 MeV protons in average soft tissues using all Xeon Phi and CPU resources embedded in a single desktop unit. MCsquare exploits the flexibility of CPU architectures to provide a multipurpose MC simulation tool. Optimized code enables the use of accurate MC calculation within a reasonable computation time, adequate for clinical practice. MCsquare also simulates prompt gamma emission and can thus be used also for in vivo range verification.

  11. SU-G-BRA-05: Application of a Feature-Based Tracking Algorithm to KV X-Ray Fluoroscopic Images Toward Marker-Less Real-Time Tumor Tracking

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Nakamura, M; Matsuo, Y; Mukumoto, N

    Purpose: To detect target position on kV X-ray fluoroscopic images using a feature-based tracking algorithm, Accelerated-KAZE (AKAZE), for markerless real-time tumor tracking (RTTT). Methods: Twelve lung cancer patients treated with RTTT on the Vero4DRT (Mitsubishi Heavy Industries, Japan, and Brainlab AG, Feldkirchen, Germany) were enrolled in this study. Respiratory tumor movement was greater than 10 mm. Three to five fiducial markers were implanted around the lung tumor transbronchially for each patient. Before beam delivery, external infrared (IR) markers and the fiducial markers were monitored for 20 to 40 s with the IR camera every 16.7 ms and with an orthogonalmore » kV x-ray imaging subsystem every 80 or 160 ms, respectively. Target positions derived from the fiducial markers were determined on the orthogonal kV x-ray images, which were used as the ground truth in this study. Meanwhile, tracking positions were identified by AKAZE. Among a lot of feature points, AKAZE found high-quality feature points through sequential cross-check and distance-check between two consecutive images. Then, these 2D positional data were converted to the 3D positional data by a transformation matrix with a predefined calibration parameter. Root mean square error (RMSE) was calculated to evaluate the difference between 3D tracking and target positions. A total of 393 frames was analyzed. The experiment was conducted on a personal computer with 16 GB RAM, Intel Core i7-2600, 3.4 GHz processor. Results: Reproducibility of the target position during the same respiratory phase was 0.6 +/− 0.6 mm (range, 0.1–3.3 mm). Mean +/− SD of the RMSEs was 0.3 +/− 0.2 mm (range, 0.0–1.0 mm). Median computation time per frame was 179 msec (range, 154–247 msec). Conclusion: AKAZE successfully and quickly detected the target position on kV X-ray fluoroscopic images. Initial results indicate that the differences between 3D tracking and target position would be clinically acceptable.« less

  12. Scalable Adaptive Graphics Environment (SAGE) Software for the Visualization of Large Data Sets on a Video Wall

    NASA Technical Reports Server (NTRS)

    Jedlovec, Gary; Srikishen, Jayanthi; Edwards, Rita; Cross, David; Welch, Jon; Smith, Matt

    2013-01-01

    The use of collaborative scientific visualization systems for the analysis, visualization, and sharing of "big data" available from new high resolution remote sensing satellite sensors or four-dimensional numerical model simulations is propelling the wider adoption of ultra-resolution tiled display walls interconnected by high speed networks. These systems require a globally connected and well-integrated operating environment that provides persistent visualization and collaboration services. This abstract and subsequent presentation describes a new collaborative visualization system installed for NASA's Shortterm Prediction Research and Transition (SPoRT) program at Marshall Space Flight Center and its use for Earth science applications. The system consists of a 3 x 4 array of 1920 x 1080 pixel thin bezel video monitors mounted on a wall in a scientific collaboration lab. The monitors are physically and virtually integrated into a 14' x 7' for video display. The display of scientific data on the video wall is controlled by a single Alienware Aurora PC with a 2nd Generation Intel Core 4.1 GHz processor, 32 GB memory, and an AMD Fire Pro W600 video card with 6 mini display port connections. Six mini display-to-dual DVI cables are used to connect the 12 individual video monitors. The open source Scalable Adaptive Graphics Environment (SAGE) windowing and media control framework, running on top of the Ubuntu 12 Linux operating system, allows several users to simultaneously control the display and storage of high resolution still and moving graphics in a variety of formats, on tiled display walls of any size. The Ubuntu operating system supports the open source Scalable Adaptive Graphics Environment (SAGE) software which provides a common environment, or framework, enabling its users to access, display and share a variety of data-intensive information. This information can be digital-cinema animations, high-resolution images, high-definition video-teleconferences, presentation slides, documents, spreadsheets or laptop screens. SAGE is cross-platform, community-driven, open-source visualization and collaboration middleware that utilizes shared national and international cyberinfrastructure for the advancement of scientific research and education.

  13. Scalable Adaptive Graphics Environment (SAGE) Software for the Visualization of Large Data Sets on a Video Wall

    NASA Astrophysics Data System (ADS)

    Jedlovec, G.; Srikishen, J.; Edwards, R.; Cross, D.; Welch, J. D.; Smith, M. R.

    2013-12-01

    The use of collaborative scientific visualization systems for the analysis, visualization, and sharing of 'big data' available from new high resolution remote sensing satellite sensors or four-dimensional numerical model simulations is propelling the wider adoption of ultra-resolution tiled display walls interconnected by high speed networks. These systems require a globally connected and well-integrated operating environment that provides persistent visualization and collaboration services. This abstract and subsequent presentation describes a new collaborative visualization system installed for NASA's Short-term Prediction Research and Transition (SPoRT) program at Marshall Space Flight Center and its use for Earth science applications. The system consists of a 3 x 4 array of 1920 x 1080 pixel thin bezel video monitors mounted on a wall in a scientific collaboration lab. The monitors are physically and virtually integrated into a 14' x 7' for video display. The display of scientific data on the video wall is controlled by a single Alienware Aurora PC with a 2nd Generation Intel Core 4.1 GHz processor, 32 GB memory, and an AMD Fire Pro W600 video card with 6 mini display port connections. Six mini display-to-dual DVI cables are used to connect the 12 individual video monitors. The open source Scalable Adaptive Graphics Environment (SAGE) windowing and media control framework, running on top of the Ubuntu 12 Linux operating system, allows several users to simultaneously control the display and storage of high resolution still and moving graphics in a variety of formats, on tiled display walls of any size. The Ubuntu operating system supports the open source Scalable Adaptive Graphics Environment (SAGE) software which provides a common environment, or framework, enabling its users to access, display and share a variety of data-intensive information. This information can be digital-cinema animations, high-resolution images, high-definition video-teleconferences, presentation slides, documents, spreadsheets or laptop screens. SAGE is cross-platform, community-driven, open-source visualization and collaboration middleware that utilizes shared national and international cyberinfrastructure for the advancement of scientific research and education.

  14. Efficient implementation of a 3-dimensional ADI method on the iPSC/860

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Van der Wijngaart, R.F.

    1993-12-31

    A comparison is made between several domain decomposition strategies for the solution of three-dimensional partial differential equations on a MIMD distributed memory parallel computer. The grids used are structured, and the numerical algorithm is ADI. Important implementation issues regarding load balancing, storage requirements, network latency, and overlap of computations and communications are discussed. Results of the solution of the three-dimensional heat equation on the Intel iPSC/860 are presented for the three most viable methods. It is found that the Bruno-Cappello decomposition delivers optimal computational speed through an almost complete elimination of processor idle time, while providing good memory efficiency.

  15. Kalman filter tracking on parallel architectures

    NASA Astrophysics Data System (ADS)

    Cerati, G.; Elmer, P.; Krutelyov, S.; Lantz, S.; Lefebvre, M.; McDermott, K.; Riley, D.; Tadel, M.; Wittich, P.; Wurthwein, F.; Yagil, A.

    2017-10-01

    We report on the progress of our studies towards a Kalman filter track reconstruction algorithm with optimal performance on manycore architectures. The combinatorial structure of these algorithms is not immediately compatible with an efficient SIMD (or SIMT) implementation; the challenge for us is to recast the existing software so it can readily generate hundreds of shared-memory threads that exploit the underlying instruction set of modern processors. We show how the data and associated tasks can be organized in a way that is conducive to both multithreading and vectorization. We demonstrate very good performance on Intel Xeon and Xeon Phi architectures, as well as promising first results on Nvidia GPUs.

  16. A low-cost vector processor boosting compute-intensive image processing operations

    NASA Technical Reports Server (NTRS)

    Adorf, Hans-Martin

    1992-01-01

    Low-cost vector processing (VP) is within reach of everyone seriously engaged in scientific computing. The advent of affordable add-on VP-boards for standard workstations complemented by mathematical/statistical libraries is beginning to impact compute-intensive tasks such as image processing. A case in point in the restoration of distorted images from the Hubble Space Telescope. A low-cost implementation is presented of the standard Tarasko-Richardson-Lucy restoration algorithm on an Intel i860-based VP-board which is seamlessly interfaced to a commercial, interactive image processing system. First experience is reported (including some benchmarks for standalone FFT's) and some conclusions are drawn.

  17. A parallel algorithm for switch-level timing simulation on a hypercube multiprocessor

    NASA Technical Reports Server (NTRS)

    Rao, Hariprasad Nannapaneni

    1989-01-01

    The parallel approach to speeding up simulation is studied, specifically the simulation of digital LSI MOS circuitry on the Intel iPSC/2 hypercube. The simulation algorithm is based on RSIM, an event driven switch-level simulator that incorporates a linear transistor model for simulating digital MOS circuits. Parallel processing techniques based on the concepts of Virtual Time and rollback are utilized so that portions of the circuit may be simulated on separate processors, in parallel for as large an increase in speed as possible. A partitioning algorithm is also developed in order to subdivide the circuit for parallel processing.

  18. I.T.S. camera deployment and systems integration P.I.N. 4ITV.09.121 : evaluation

    DOT National Transportation Integrated Search

    2006-08-01

    The Monroe County Department of Transportation (MCDOT) has built upon the success of its first computerized traffic signal system. The system has been operational since 1985. In the late 1990s the Department initiated a project to integrate Intell...

  19. Accelerated Monte Carlo Simulation on the Chemical Stage in Water Radiolysis using GPU

    PubMed Central

    Tian, Zhen; Jiang, Steve B.; Jia, Xun

    2018-01-01

    The accurate simulation of water radiolysis is an important step to understand the mechanisms of radiobiology and quantitatively test some hypotheses regarding radiobiological effects. However, the simulation of water radiolysis is highly time consuming, taking hours or even days to be completed by a conventional CPU processor. This time limitation hinders cell-level simulations for a number of research studies. We recently initiated efforts to develop gMicroMC, a GPU-based fast microscopic MC simulation package for water radiolysis. The first step of this project focused on accelerating the simulation of the chemical stage, the most time consuming stage in the entire water radiolysis process. A GPU-friendly parallelization strategy was designed to address the highly correlated many-body simulation problem caused by the mutual competitive chemical reactions between the radiolytic molecules. Two cases were tested, using a 750 keV electron and a 5 MeV proton incident in pure water, respectively. The time-dependent yields of all the radiolytic species during the chemical stage were used to evaluate the accuracy of the simulation. The relative differences between our simulation and the Geant4-DNA simulation were on average 5.3% and 4.4% for the two cases. Our package, executed on an Nvidia Titan black GPU card, successfully completed the chemical stage simulation of the two cases within 599.2 s and 489.0 s. As compared with Geant4-DNA that was executed on an Intel i7-5500U CPU processor and needed 28.6 h and 26.8 h for the two cases using a single CPU core, our package achieved a speed-up factor of 171.1-197.2. PMID:28323637

  20. Accelerated Monte Carlo simulation on the chemical stage in water radiolysis using GPU

    NASA Astrophysics Data System (ADS)

    Tian, Zhen; Jiang, Steve B.; Jia, Xun

    2017-04-01

    The accurate simulation of water radiolysis is an important step to understand the mechanisms of radiobiology and quantitatively test some hypotheses regarding radiobiological effects. However, the simulation of water radiolysis is highly time consuming, taking hours or even days to be completed by a conventional CPU processor. This time limitation hinders cell-level simulations for a number of research studies. We recently initiated efforts to develop gMicroMC, a GPU-based fast microscopic MC simulation package for water radiolysis. The first step of this project focused on accelerating the simulation of the chemical stage, the most time consuming stage in the entire water radiolysis process. A GPU-friendly parallelization strategy was designed to address the highly correlated many-body simulation problem caused by the mutual competitive chemical reactions between the radiolytic molecules. Two cases were tested, using a 750 keV electron and a 5 MeV proton incident in pure water, respectively. The time-dependent yields of all the radiolytic species during the chemical stage were used to evaluate the accuracy of the simulation. The relative differences between our simulation and the Geant4-DNA simulation were on average 5.3% and 4.4% for the two cases. Our package, executed on an Nvidia Titan black GPU card, successfully completed the chemical stage simulation of the two cases within 599.2 s and 489.0 s. As compared with Geant4-DNA that was executed on an Intel i7-5500U CPU processor and needed 28.6 h and 26.8 h for the two cases using a single CPU core, our package achieved a speed-up factor of 171.1-197.2.

  1. Accelerated Monte Carlo simulation on the chemical stage in water radiolysis using GPU.

    PubMed

    Tian, Zhen; Jiang, Steve B; Jia, Xun

    2017-04-21

    The accurate simulation of water radiolysis is an important step to understand the mechanisms of radiobiology and quantitatively test some hypotheses regarding radiobiological effects. However, the simulation of water radiolysis is highly time consuming, taking hours or even days to be completed by a conventional CPU processor. This time limitation hinders cell-level simulations for a number of research studies. We recently initiated efforts to develop gMicroMC, a GPU-based fast microscopic MC simulation package for water radiolysis. The first step of this project focused on accelerating the simulation of the chemical stage, the most time consuming stage in the entire water radiolysis process. A GPU-friendly parallelization strategy was designed to address the highly correlated many-body simulation problem caused by the mutual competitive chemical reactions between the radiolytic molecules. Two cases were tested, using a 750 keV electron and a 5 MeV proton incident in pure water, respectively. The time-dependent yields of all the radiolytic species during the chemical stage were used to evaluate the accuracy of the simulation. The relative differences between our simulation and the Geant4-DNA simulation were on average 5.3% and 4.4% for the two cases. Our package, executed on an Nvidia Titan black GPU card, successfully completed the chemical stage simulation of the two cases within 599.2 s and 489.0 s. As compared with Geant4-DNA that was executed on an Intel i7-5500U CPU processor and needed 28.6 h and 26.8 h for the two cases using a single CPU core, our package achieved a speed-up factor of 171.1-197.2.

  2. Task Parallel Incomplete Cholesky Factorization using 2D Partitioned-Block Layout

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Kim, Kyungjoo; Rajamanickam, Sivasankaran; Stelle, George Widgery

    We introduce a task-parallel algorithm for sparse incomplete Cholesky factorization that utilizes a 2D sparse partitioned-block layout of a matrix. Our factorization algorithm follows the idea of algorithms-by-blocks by using the block layout. The algorithm-byblocks approach induces a task graph for the factorization. These tasks are inter-related to each other through their data dependences in the factorization algorithm. To process the tasks on various manycore architectures in a portable manner, we also present a portable tasking API that incorporates different tasking backends and device-specific features using an open-source framework for manycore platforms i.e., Kokkos. A performance evaluation is presented onmore » both Intel Sandybridge and Xeon Phi platforms for matrices from the University of Florida sparse matrix collection to illustrate merits of the proposed task-based factorization. Experimental results demonstrate that our task-parallel implementation delivers about 26.6x speedup (geometric mean) over single-threaded incomplete Choleskyby- blocks and 19.2x speedup over serial Cholesky performance which does not carry tasking overhead using 56 threads on the Intel Xeon Phi processor for sparse matrices arising from various application problems.« less

  3. Particle-in-Cell laser-plasma simulation on Xeon Phi coprocessors

    NASA Astrophysics Data System (ADS)

    Surmin, I. A.; Bastrakov, S. I.; Efimenko, E. S.; Gonoskov, A. A.; Korzhimanov, A. V.; Meyerov, I. B.

    2016-05-01

    This paper concerns the development of a high-performance implementation of the Particle-in-Cell method for plasma simulation on Intel Xeon Phi coprocessors. We discuss the suitability of the method for Xeon Phi architecture and present our experience in the porting and optimization of the existing parallel Particle-in-Cell code PICADOR. Direct porting without code modification gives performance on Xeon Phi close to that of an 8-core CPU on a benchmark problem with 50 particles per cell. We demonstrate step-by-step optimization techniques, such as improving data locality, enhancing parallelization efficiency and vectorization leading to an overall 4.2 × speedup on CPU and 7.5 × on Xeon Phi compared to the baseline version. The optimized version achieves 16.9 ns per particle update on an Intel Xeon E5-2660 CPU and 9.3 ns per particle update on an Intel Xeon Phi 5110P. For a real problem of laser ion acceleration in targets with surface grating, where a large number of macroparticles per cell is required, the speedup of Xeon Phi compared to CPU is 1.6 ×.

  4. Active Acoustics using Bellhop-DRDC: Run Time Tests and Suggested Configurations for a Tracking Exercise in Shallow Scotian Waters

    DTIC Science & Technology

    2005-05-01

    simulée d’essai pour obtenir les diagrammes de perte de transmission et de réverbération pour 18 éléments (une source, un réseau remorqué et 16 bouées...were recorded using a 1.5GHz Pentium 4 processor. The test results indicate that the Bellhop program runs fast enough to provide the required acoustic...was determined that the Bellhop program will be fast enough for these clients. Future Plans It is intended to integrate further enhancements that

  5. EVN e-VLBI detections of MAXI J1659-152

    NASA Astrophysics Data System (ADS)

    Paragi, Z.; van der Horst, A. J.; Granot, J.; Taylor, G. B.; Kouveliotou, C.; Garrett, M. A.; Wijers, R. A. M. J.; Ramirez-Ruiz, E.; Kuulkers, E.; Gehrels, N.; Woods, P. M.

    2010-10-01

    We observed MAXI J1659-152 (Negoro et al. 2010, ATel #2873; Mangano et al. 2010, GCN #11296) following its sub-millimeter and centimeter radio detections (de Ugarte Postigo et al. 2010, GCN #11304; van der Horst et al. 2010, ATel #2874) with the European VLBI Network (EVN) in real-time e-VLBI mode on 30 September 2010, from 13:30 to 18:30 UT at 4.9 GHz. The participating telescopes were Cambridge, Effelsberg, Jodrell Bank (MkII), Hartebeesthoek, Medicina, Onsala, Torun and Westerbork sending data at a rate of ~1024 Mbps to the EVN Data Processor at JIVE.

  6. egs_brachy: a versatile and fast Monte Carlo code for brachytherapy

    NASA Astrophysics Data System (ADS)

    Chamberland, Marc J. P.; Taylor, Randle E. P.; Rogers, D. W. O.; Thomson, Rowan M.

    2016-12-01

    egs_brachy is a versatile and fast Monte Carlo (MC) code for brachytherapy applications. It is based on the EGSnrc code system, enabling simulation of photons and electrons. Complex geometries are modelled using the EGSnrc C++ class library and egs_brachy includes a library of geometry models for many brachytherapy sources, in addition to eye plaques and applicators. Several simulation efficiency enhancing features are implemented in the code. egs_brachy is benchmarked by comparing TG-43 source parameters of three source models to previously published values. 3D dose distributions calculated with egs_brachy are also compared to ones obtained with the BrachyDose code. Well-defined simulations are used to characterize the effectiveness of many efficiency improving techniques, both as an indication of the usefulness of each technique and to find optimal strategies. Efficiencies and calculation times are characterized through single source simulations and simulations of idealized and typical treatments using various efficiency improving techniques. In general, egs_brachy shows agreement within uncertainties with previously published TG-43 source parameter values. 3D dose distributions from egs_brachy and BrachyDose agree at the sub-percent level. Efficiencies vary with radionuclide and source type, number of sources, phantom media, and voxel size. The combined effects of efficiency-improving techniques in egs_brachy lead to short calculation times: simulations approximating prostate and breast permanent implant (both with (2 mm)3 voxels) and eye plaque (with (1 mm)3 voxels) treatments take between 13 and 39 s, on a single 2.5 GHz Intel Xeon E5-2680 v3 processor core, to achieve 2% average statistical uncertainty on doses within the PTV. egs_brachy will be released as free and open source software to the research community.

  7. Development of an adjoint sensitivity field-based treatment-planning technique for the use of newly designed directional LDR sources in brachytherapy.

    PubMed

    Chaswal, V; Thomadsen, B R; Henderson, D L

    2012-02-21

    The development and application of an automated 3D greedy heuristic (GH) optimization algorithm utilizing the adjoint sensitivity fields for treatment planning to assess the advantage of directional interstitial prostate brachytherapy is presented. Directional and isotropic dose kernels generated using Monte Carlo simulations based on Best Industries model 2301 I-125 source are utilized for treatment planning. The newly developed GH algorithm is employed for optimization of the treatment plans for seven interstitial prostate brachytherapy cases using mixed sources (directional brachytherapy) and using only isotropic sources (conventional brachytherapy). All treatment plans resulted in V100 > 98% and D90 > 45 Gy for the target prostate region. For the urethra region, the D10(Ur), D90(Ur) and V150(Ur) and for the rectum region the V100cc, D2cc, D90(Re) and V90(Re) all are reduced significantly when mixed sources brachytherapy is used employing directional sources. The simulations demonstrated that the use of directional sources in the low dose-rate (LDR) brachytherapy of the prostate clearly benefits in sparing the urethra and the rectum sensitive structures from overdose. The time taken for a conventional treatment plan is less than three seconds, while the time taken for a mixed source treatment plan is less than nine seconds, as tested on an Intel Core2 Duo 2.2 GHz processor with 1GB RAM. The new 3D GH algorithm is successful in generating a feasible LDR brachytherapy treatment planning solution with an extra degree of freedom, i.e. directionality in very little time.

  8. Development of an adjoint sensitivity field-based treatment-planning technique for the use of newly designed directional LDR sources in brachytherapy

    NASA Astrophysics Data System (ADS)

    Chaswal, V.; Thomadsen, B. R.; Henderson, D. L.

    2012-02-01

    The development and application of an automated 3D greedy heuristic (GH) optimization algorithm utilizing the adjoint sensitivity fields for treatment planning to assess the advantage of directional interstitial prostate brachytherapy is presented. Directional and isotropic dose kernels generated using Monte Carlo simulations based on Best Industries model 2301 I-125 source are utilized for treatment planning. The newly developed GH algorithm is employed for optimization of the treatment plans for seven interstitial prostate brachytherapy cases using mixed sources (directional brachytherapy) and using only isotropic sources (conventional brachytherapy). All treatment plans resulted in V100 > 98% and D90 > 45 Gy for the target prostate region. For the urethra region, the D10Ur, D90Ur and V150Ur and for the rectum region the V100cc, D2cc, D90Re and V90Re all are reduced significantly when mixed sources brachytherapy is used employing directional sources. The simulations demonstrated that the use of directional sources in the low dose-rate (LDR) brachytherapy of the prostate clearly benefits in sparing the urethra and the rectum sensitive structures from overdose. The time taken for a conventional treatment plan is less than three seconds, while the time taken for a mixed source treatment plan is less than nine seconds, as tested on an Intel Core2 Duo 2.2 GHz processor with 1GB RAM. The new 3D GH algorithm is successful in generating a feasible LDR brachytherapy treatment planning solution with an extra degree of freedom, i.e. directionality in very little time.

  9. GPU-Accelerated Voxelwise Hepatic Perfusion Quantification

    PubMed Central

    Wang, H; Cao, Y

    2012-01-01

    Voxelwise quantification of hepatic perfusion parameters from dynamic contrast enhanced (DCE) imaging greatly contributes to assessment of liver function in response to radiation therapy. However, the efficiency of the estimation of hepatic perfusion parameters voxel-by-voxel in the whole liver using a dual-input single-compartment model requires substantial improvement for routine clinical applications. In this paper, we utilize the parallel computation power of a graphics processing unit (GPU) to accelerate the computation, while maintaining the same accuracy as the conventional method. Using CUDA-GPU, the hepatic perfusion computations over multiple voxels are run across the GPU blocks concurrently but independently. At each voxel, non-linear least squares fitting the time series of the liver DCE data to the compartmental model is distributed to multiple threads in a block, and the computations of different time points are performed simultaneously and synchronically. An efficient fast Fourier transform in a block is also developed for the convolution computation in the model. The GPU computations of the voxel-by-voxel hepatic perfusion images are compared with ones by the CPU using the simulated DCE data and the experimental DCE MR images from patients. The computation speed is improved by 30 times using a NVIDIA Tesla C2050 GPU compared to a 2.67 GHz Intel Xeon CPU processor. To obtain liver perfusion maps with 626400 voxels in a patient’s liver, it takes 0.9 min with the GPU-accelerated voxelwise computation, compared to 110 min with the CPU, while both methods result in perfusion parameters differences less than 10−6. The method will be useful for generating liver perfusion images in clinical settings. PMID:22892645

  10. egs_brachy: a versatile and fast Monte Carlo code for brachytherapy.

    PubMed

    Chamberland, Marc J P; Taylor, Randle E P; Rogers, D W O; Thomson, Rowan M

    2016-12-07

    egs_brachy is a versatile and fast Monte Carlo (MC) code for brachytherapy applications. It is based on the EGSnrc code system, enabling simulation of photons and electrons. Complex geometries are modelled using the EGSnrc C++ class library and egs_brachy includes a library of geometry models for many brachytherapy sources, in addition to eye plaques and applicators. Several simulation efficiency enhancing features are implemented in the code. egs_brachy is benchmarked by comparing TG-43 source parameters of three source models to previously published values. 3D dose distributions calculated with egs_brachy are also compared to ones obtained with the BrachyDose code. Well-defined simulations are used to characterize the effectiveness of many efficiency improving techniques, both as an indication of the usefulness of each technique and to find optimal strategies. Efficiencies and calculation times are characterized through single source simulations and simulations of idealized and typical treatments using various efficiency improving techniques. In general, egs_brachy shows agreement within uncertainties with previously published TG-43 source parameter values. 3D dose distributions from egs_brachy and BrachyDose agree at the sub-percent level. Efficiencies vary with radionuclide and source type, number of sources, phantom media, and voxel size. The combined effects of efficiency-improving techniques in egs_brachy lead to short calculation times: simulations approximating prostate and breast permanent implant (both with (2 mm) 3 voxels) and eye plaque (with (1 mm) 3 voxels) treatments take between 13 and 39 s, on a single 2.5 GHz Intel Xeon E5-2680 v3 processor core, to achieve 2% average statistical uncertainty on doses within the PTV. egs_brachy will be released as free and open source software to the research community.

  11. SeqBox: RNAseq/ChIPseq reproducible analysis on a consumer game computer.

    PubMed

    Beccuti, Marco; Cordero, Francesca; Arigoni, Maddalena; Panero, Riccardo; Amparore, Elvio G; Donatelli, Susanna; Calogero, Raffaele A

    2018-03-01

    Short reads sequencing technology has been used for more than a decade now. However, the analysis of RNAseq and ChIPseq data is still computational demanding and the simple access to raw data does not guarantee results reproducibility between laboratories. To address these two aspects, we developed SeqBox, a cheap, efficient and reproducible RNAseq/ChIPseq hardware/software solution based on NUC6I7KYK mini-PC (an Intel consumer game computer with a fast processor and a high performance SSD disk), and Docker container platform. In SeqBox the analysis of RNAseq and ChIPseq data is supported by a friendly GUI. This allows access to fast and reproducible analysis also to scientists with/without scripting experience. Docker container images, docker4seq package and the GUI are available at http://www.bioinformatica.unito.it/reproducibile.bioinformatics.html. beccuti@di.unito.it. Supplementary data are available at Bioinformatics online. © The Author(s) 2017. Published by Oxford University Press.

  12. A network control concept for the 30/20 GHz communication system baseband processor

    NASA Technical Reports Server (NTRS)

    Sabourin, D. J.; Hay, R. E.

    1982-01-01

    The architecture and system design for a satellite-switched TDMA communication system employing on-board processing was developed by Motorola for NASA's Lewis Research Center. The system design is based on distributed processing techniques that provide extreme flexibility in the selection of a network control protocol without impacting the satellite or ground terminal hardware. A network control concept that includes system synchronization and allows burst synchronization to occur within the system operational requirement is described. This concept integrates the tracking and control links with the communication links via the baseband processor, resulting in an autonomous system operational approach.

  13. RASSP signal processing architectures

    NASA Astrophysics Data System (ADS)

    Shirley, Fred; Bassett, Bob; Letellier, J. P.

    1995-06-01

    The rapid prototyping of application specific signal processors (RASSP) program is an ARPA/tri-service effort to dramatically improve the process by which complex digital systems, particularly embedded signal processors, are specified, designed, documented, manufactured, and supported. The domain of embedded signal processing was chosen because it is important to a variety of military and commercial applications as well as for the challenge it presents in terms of complexity and performance demands. The principal effort is being performed by two major contractors, Lockheed Sanders (Nashua, NH) and Martin Marietta (Camden, NJ). For both, improvements in methodology are to be exercised and refined through the performance of individual 'Demonstration' efforts. The Lockheed Sanders' Demonstration effort is to develop an infrared search and track (IRST) processor. In addition, both contractors' results are being measured by a series of externally administered (by Lincoln Labs) six-month Benchmark programs that measure process improvement as a function of time. The first two Benchmark programs are designing and implementing a synthetic aperture radar (SAR) processor. Our demonstration team is using commercially available VME modules from Mercury Computer to assemble a multiprocessor system scalable from one to hundreds of Intel i860 microprocessors. Custom modules for the sensor interface and display driver are also being developed. This system implements either proprietary or Navy owned algorithms to perform the compute-intensive IRST function in real time in an avionics environment. Our Benchmark team is designing custom modules using commercially available processor ship sets, communication submodules, and reconfigurable logic devices. One of the modules contains multiple vector processors optimized for fast Fourier transform processing. Another module is a fiberoptic interface that accepts high-rate input data from the sensors and provides video-rate output data to a display. This paper discusses the impact of simulation on choosing signal processing algorithms and architectures, drawing from the experiences of the Demonstration and Benchmark inter-company teams at Lockhhed Sanders, Motorola, Hughes, and ISX.

  14. A new nonlinear conjugate gradient coefficient under strong Wolfe-Powell line search

    NASA Astrophysics Data System (ADS)

    Mohamed, Nur Syarafina; Mamat, Mustafa; Rivaie, Mohd

    2017-08-01

    A nonlinear conjugate gradient method (CG) plays an important role in solving a large-scale unconstrained optimization problem. This method is widely used due to its simplicity. The method is known to possess sufficient descend condition and global convergence properties. In this paper, a new nonlinear of CG coefficient βk is presented by employing the Strong Wolfe-Powell inexact line search. The new βk performance is tested based on number of iterations and central processing unit (CPU) time by using MATLAB software with Intel Core i7-3470 CPU processor. Numerical experimental results show that the new βk converge rapidly compared to other classical CG method.

  15. A microprocessor based anti-aliasing filter for a PCM system

    NASA Technical Reports Server (NTRS)

    Morrow, D. C.; Sandlin, D. R.

    1984-01-01

    Described is the design and evaluation of a microprocessor based digital filter. The filter was made to investigate the feasibility of a digital replacement for the analog pre-sampling filters used in telemetry systems at the NASA Ames-Dryden Flight Research Facility (DFRF). The digital filter will utilize an Intel 2920 Analog Signal Processor (ASP) chip. Testing includes measurements of: (1) the filter frequency response and, (2) the filter signal resolution. The evaluation of the digital filter was made on the basis of circuit size, projected environmental stability and filter resolution. The 2920 based digital filter was found to meet or exceed the pre-sampling filter specifications for limited signal resolution applications.

  16. Noiseless coding for the magnetometer

    NASA Technical Reports Server (NTRS)

    Rice, Robert F.; Lee, Jun-Ji

    1987-01-01

    Future unmanned space missions will continue to seek a full understanding of magnetic fields throughout the solar system. Severely constrained data rates during certain portions of these missions could limit the possible science return. This publication investigates the application of universal noiseless coding techniques to more efficiently represent magnetometer data without any loss in data integrity. Performance results indicated that compression factors of 2:1 to 6:1 can be expected. Feasibility for general deep space application was demonstrated by implementing a microprocessor breadboard coder/decoder using the Intel 8086 processor. The Comet Rendezvous Asteroid Flyby mission will incorporate these techniques in a buffer feedback, rate-controlled configuration. The characteristics of this system are discussed.

  17. Electromagnetic Physics Models for Parallel Computing Architectures

    NASA Astrophysics Data System (ADS)

    Amadio, G.; Ananya, A.; Apostolakis, J.; Aurora, A.; Bandieramonte, M.; Bhattacharyya, A.; Bianchini, C.; Brun, R.; Canal, P.; Carminati, F.; Duhem, L.; Elvira, D.; Gheata, A.; Gheata, M.; Goulas, I.; Iope, R.; Jun, S. Y.; Lima, G.; Mohanty, A.; Nikitina, T.; Novak, M.; Pokorski, W.; Ribon, A.; Seghal, R.; Shadura, O.; Vallecorsa, S.; Wenzel, S.; Zhang, Y.

    2016-10-01

    The recent emergence of hardware architectures characterized by many-core or accelerated processors has opened new opportunities for concurrent programming models taking advantage of both SIMD and SIMT architectures. GeantV, a next generation detector simulation, has been designed to exploit both the vector capability of mainstream CPUs and multi-threading capabilities of coprocessors including NVidia GPUs and Intel Xeon Phi. The characteristics of these architectures are very different in terms of the vectorization depth and type of parallelization needed to achieve optimal performance. In this paper we describe implementation of electromagnetic physics models developed for parallel computing architectures as a part of the GeantV project. Results of preliminary performance evaluation and physics validation are presented as well.

  18. Comparative study of FDMA, TDMA and hybrid 30/20 GHz satellite communications systems for small users

    NASA Technical Reports Server (NTRS)

    Berk, G.; Jean, P. N.; Rotholz, E.

    1982-01-01

    This study compares several satellite uplink and downlink accessing schemes for a Customer Premises Service. Four conceptual system designs are presented: Satellite-Routed FDMA, Frequency-Routed TDMA, Satellite-Switched TDMA, and Processor-Routed TDMA, operating in the 30/20 GHz band. The designs are compared on the basis of estimated satellite weight, power consumption, and cost. The system capacities are analyzed for a fixed multibeam coverage of CONUS. Analysis shows that the system capacity is limited by the available satellite resources and by the terminal size and cost.

  19. High performance ultrasonic field simulation on complex geometries

    NASA Astrophysics Data System (ADS)

    Chouh, H.; Rougeron, G.; Chatillon, S.; Iehl, J. C.; Farrugia, J. P.; Ostromoukhov, V.

    2016-02-01

    Ultrasonic field simulation is a key ingredient for the design of new testing methods as well as a crucial step for NDT inspection simulation. As presented in a previous paper [1], CEA-LIST has worked on the acceleration of these simulations focusing on simple geometries (planar interfaces, isotropic materials). In this context, significant accelerations were achieved on multicore processors and GPUs (Graphics Processing Units), bringing the execution time of realistic computations in the 0.1 s range. In this paper, we present recent works that aim at similar performances on a wider range of configurations. We adapted the physical model used by the CIVA platform to design and implement a new algorithm providing a fast ultrasonic field simulation that yields nearly interactive results for complex cases. The improvements over the CIVA pencil-tracing method include adaptive strategies for pencil subdivisions to achieve a good refinement of the sensor geometry while keeping a reasonable number of ray-tracing operations. Also, interpolation of the times of flight was used to avoid time consuming computations in the impulse response reconstruction stage. To achieve the best performance, our algorithm runs on multi-core superscalar CPUs and uses high performance specialized libraries such as Intel Embree for ray-tracing, Intel MKL for signal processing and Intel TBB for parallelization. We validated the simulation results by comparing them to the ones produced by CIVA on identical test configurations including mono-element and multiple-element transducers, homogeneous, meshed 3D CAD specimens, isotropic and anisotropic materials and wave paths that can involve several interactions with interfaces. We show performance results on complete simulations that achieve computation times in the 1s range.

  20. Performance tuning of N-body codes on modern microprocessors: I. Direct integration with a hermite scheme on x86_64 architecture

    NASA Astrophysics Data System (ADS)

    Nitadori, Keigo; Makino, Junichiro; Hut, Piet

    2006-12-01

    The main performance bottleneck of gravitational N-body codes is the force calculation between two particles. We have succeeded in speeding up this pair-wise force calculation by factors between 2 and 10, depending on the code and the processor on which the code is run. These speed-ups were obtained by writing highly fine-tuned code for x86_64 microprocessors. Any existing N-body code, running on these chips, can easily incorporate our assembly code programs. In the current paper, we present an outline of our overall approach, which we illustrate with one specific example: the use of a Hermite scheme for a direct N2 type integration on a single 2.0 GHz Athlon 64 processor, for which we obtain an effective performance of 4.05 Gflops, for double-precision accuracy. In subsequent papers, we will discuss other variations, including the combinations of N log N codes, single-precision implementations, and performance on other microprocessors.

  1. Modeling & Analysis of Multicore Architectures for Embedded SIGINT Applications

    DTIC Science & Technology

    2015-03-01

    NVIDIA Kepler K20 [7][8] 2496e 706 225 3520 15.6 Intel Xeon Phi 5110P [9] 60 1050 225 1010 4.5 Adapteva Epiphany [10] 16 – 4K 800 0.270 19 70.4...Cortex A15 and a Kepler GPU with 192 “CUDA” cores, and is more comparable as an HPEEC platform than Tesla series GPUs, such as the NVIDIA C2075 and K20

  2. DOE Office of Scientific and Technical Information (OSTI.GOV)

    Trędak, Przemysław, E-mail: przemyslaw.tredak@fuw.edu.pl; Rudnicki, Witold R.; Interdisciplinary Centre for Mathematical and Computational Modelling, University of Warsaw, ul. Pawińskiego 5a, 02-106 Warsaw

    The second generation Reactive Bond Order (REBO) empirical potential is commonly used to accurately model a wide range hydrocarbon materials. It is also extensible to other atom types and interactions. REBO potential assumes complex multi-body interaction model, that is difficult to represent efficiently in the SIMD or SIMT programming model. Hence, despite its importance, no efficient GPGPU implementation has been developed for this potential. Here we present a detailed description of a highly efficient GPGPU implementation of molecular dynamics algorithm using REBO potential. The presented algorithm takes advantage of rarely used properties of the SIMT architecture of a modern GPUmore » to solve difficult synchronizations issues that arise in computations of multi-body potential. Techniques developed for this problem may be also used to achieve efficient solutions of different problems. The performance of proposed algorithm is assessed using a range of model systems. It is compared to highly optimized CPU implementation (both single core and OpenMP) available in LAMMPS package. These experiments show up to 6x improvement in forces computation time using single processor of the NVIDIA Tesla K80 compared to high end 16-core Intel Xeon processor.« less

  3. Parallel halftoning technique using dot diffusion optimization

    NASA Astrophysics Data System (ADS)

    Molina-Garcia, Javier; Ponomaryov, Volodymyr I.; Reyes-Reyes, Rogelio; Cruz-Ramos, Clara

    2017-05-01

    In this paper, a novel approach for halftone images is proposed and implemented for images that are obtained by the Dot Diffusion (DD) method. Designed technique is based on an optimization of the so-called class matrix used in DD algorithm and it consists of generation new versions of class matrix, which has no baron and near-baron in order to minimize inconsistencies during the distribution of the error. Proposed class matrix has different properties and each is designed for two different applications: applications where the inverse-halftoning is necessary, and applications where this method is not required. The proposed method has been implemented in GPU (NVIDIA GeForce GTX 750 Ti), multicore processors (AMD FX(tm)-6300 Six-Core Processor and in Intel core i5-4200U), using CUDA and OpenCV over a PC with linux. Experimental results have shown that novel framework generates a good quality of the halftone images and the inverse halftone images obtained. The simulation results using parallel architectures have demonstrated the efficiency of the novel technique when it is implemented in real-time processing.

  4. Development of seismic tomography software for hybrid supercomputers

    NASA Astrophysics Data System (ADS)

    Nikitin, Alexandr; Serdyukov, Alexandr; Duchkov, Anton

    2015-04-01

    Seismic tomography is a technique used for computing velocity model of geologic structure from first arrival travel times of seismic waves. The technique is used in processing of regional and global seismic data, in seismic exploration for prospecting and exploration of mineral and hydrocarbon deposits, and in seismic engineering for monitoring the condition of engineering structures and the surrounding host medium. As a consequence of development of seismic monitoring systems and increasing volume of seismic data, there is a growing need for new, more effective computational algorithms for use in seismic tomography applications with improved performance, accuracy and resolution. To achieve this goal, it is necessary to use modern high performance computing systems, such as supercomputers with hybrid architecture that use not only CPUs, but also accelerators and co-processors for computation. The goal of this research is the development of parallel seismic tomography algorithms and software package for such systems, to be used in processing of large volumes of seismic data (hundreds of gigabytes and more). These algorithms and software package will be optimized for the most common computing devices used in modern hybrid supercomputers, such as Intel Xeon CPUs, NVIDIA Tesla accelerators and Intel Xeon Phi co-processors. In this work, the following general scheme of seismic tomography is utilized. Using the eikonal equation solver, arrival times of seismic waves are computed based on assumed velocity model of geologic structure being analyzed. In order to solve the linearized inverse problem, tomographic matrix is computed that connects model adjustments with travel time residuals, and the resulting system of linear equations is regularized and solved to adjust the model. The effectiveness of parallel implementations of existing algorithms on target architectures is considered. During the first stage of this work, algorithms were developed for execution on supercomputers using multicore CPUs only, with preliminary performance tests showing good parallel efficiency on large numerical grids. Porting of the algorithms to hybrid supercomputers is currently ongoing.

  5. Using Intel Xeon Phi to accelerate the WRF TEMF planetary boundary layer scheme

    NASA Astrophysics Data System (ADS)

    Mielikainen, Jarno; Huang, Bormin; Huang, Allen

    2014-05-01

    The Weather Research and Forecasting (WRF) model is designed for numerical weather prediction and atmospheric research. The WRF software infrastructure consists of several components such as dynamic solvers and physics schemes. Numerical models are used to resolve the large-scale flow. However, subgrid-scale parameterizations are for an estimation of small-scale properties (e.g., boundary layer turbulence and convection, clouds, radiation). Those have a significant influence on the resolved scale due to the complex nonlinear nature of the atmosphere. For the cloudy planetary boundary layer (PBL), it is fundamental to parameterize vertical turbulent fluxes and subgrid-scale condensation in a realistic manner. A parameterization based on the Total Energy - Mass Flux (TEMF) that unifies turbulence and moist convection components produces a better result that the other PBL schemes. For that reason, the TEMF scheme is chosen as the PBL scheme we optimized for Intel Many Integrated Core (MIC), which ushers in a new era of supercomputing speed, performance, and compatibility. It allows the developers to run code at trillions of calculations per second using the familiar programming model. In this paper, we present our optimization results for TEMF planetary boundary layer scheme. The optimizations that were performed were quite generic in nature. Those optimizations included vectorization of the code to utilize vector units inside each CPU. Furthermore, memory access was improved by scalarizing some of the intermediate arrays. The results show that the optimization improved MIC performance by 14.8x. Furthermore, the optimizations increased CPU performance by 2.6x compared to the original multi-threaded code on quad core Intel Xeon E5-2603 running at 1.8 GHz. Compared to the optimized code running on a single CPU socket the optimized MIC code is 6.2x faster.

  6. Using Coarrays to Parallelize Legacy Fortran Applications: Strategy and Case Study

    DOE PAGES

    Radhakrishnan, Hari; Rouson, Damian W. I.; Morris, Karla; ...

    2015-01-01

    This paper summarizes a strategy for parallelizing a legacy Fortran 77 program using the object-oriented (OO) and coarray features that entered Fortran in the 2003 and 2008 standards, respectively. OO programming (OOP) facilitates the construction of an extensible suite of model-verification and performance tests that drive the development. Coarray parallel programming facilitates a rapid evolution from a serial application to a parallel application capable of running on multicore processors and many-core accelerators in shared and distributed memory. We delineate 17 code modernization steps used to refactor and parallelize the program and study the resulting performance. Our initial studies were donemore » using the Intel Fortran compiler on a 32-core shared memory server. Scaling behavior was very poor, and profile analysis using TAU showed that the bottleneck in the performance was due to our implementation of a collective, sequential summation procedure. We were able to improve the scalability and achieve nearly linear speedup by replacing the sequential summation with a parallel, binary tree algorithm. We also tested the Cray compiler, which provides its own collective summation procedure. Intel provides no collective reductions. With Cray, the program shows linear speedup even in distributed-memory execution. We anticipate similar results with other compilers once they support the new collective procedures proposed for Fortran 2015.« less

  7. Modeling a Million-Node Slim Fly Network Using Parallel Discrete-Event Simulation

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Wolfe, Noah; Carothers, Christopher; Mubarak, Misbah

    As supercomputers close in on exascale performance, the increased number of processors and processing power translates to an increased demand on the underlying network interconnect. The Slim Fly network topology, a new lowdiameter and low-latency interconnection network, is gaining interest as one possible solution for next-generation supercomputing interconnect systems. In this paper, we present a high-fidelity Slim Fly it-level model leveraging the Rensselaer Optimistic Simulation System (ROSS) and Co-Design of Exascale Storage (CODES) frameworks. We validate our Slim Fly model with the Kathareios et al. Slim Fly model results provided at moderately sized network scales. We further scale the modelmore » size up to n unprecedented 1 million compute nodes; and through visualization of network simulation metrics such as link bandwidth, packet latency, and port occupancy, we get an insight into the network behavior at the million-node scale. We also show linear strong scaling of the Slim Fly model on an Intel cluster achieving a peak event rate of 36 million events per second using 128 MPI tasks to process 7 billion events. Detailed analysis of the underlying discrete-event simulation performance shows that a million-node Slim Fly model simulation can execute in 198 seconds on the Intel cluster.« less

  8. Reducing adaptive optics latency using Xeon Phi many-core processors

    NASA Astrophysics Data System (ADS)

    Barr, David; Basden, Alastair; Dipper, Nigel; Schwartz, Noah

    2015-11-01

    The next generation of Extremely Large Telescopes (ELTs) for astronomy will rely heavily on the performance of their adaptive optics (AO) systems. Real-time control is at the heart of the critical technologies that will enable telescopes to deliver the best possible science and will require a very significant extrapolation from current AO hardware existing for 4-10 m telescopes. Investigating novel real-time computing architectures and testing their eligibility against anticipated challenges is one of the main priorities of technology development for the ELTs. This paper investigates the suitability of the Intel Xeon Phi, which is a commercial off-the-shelf hardware accelerator. We focus on wavefront reconstruction performance, implementing a straightforward matrix-vector multiplication (MVM) algorithm. We present benchmarking results of the Xeon Phi on a real-time Linux platform, both as a standalone processor and integrated into an existing real-time controller (RTC). Performance of single and multiple Xeon Phis are investigated. We show that this technology has the potential of greatly reducing the mean latency and variations in execution time (jitter) of large AO systems. We present both a detailed performance analysis of the Xeon Phi for a typical E-ELT first-light instrument along with a more general approach that enables us to extend to any AO system size. We show that systematic and detailed performance analysis is an essential part of testing novel real-time control hardware to guarantee optimal science results.

  9. A computational system for lattice QCD with overlap Dirac quarks

    NASA Astrophysics Data System (ADS)

    Chiu, Ting-Wai; Hsieh, Tung-Han; Huang, Chao-Hsi; Huang, Tsung-Ren

    2003-05-01

    We outline the essential features of a Linux PC cluster which is now being developed at National Taiwan University, and discuss how to optimize its hardware and software for lattice QCD with overlap Dirac quarks. At present, the cluster constitutes of 30 nodes, with each node consisting of one Pentium 4 processor (1.6/2.0 GHz), one Gbyte of PC800 RDRAM, one 40/80 Gbyte hard disk, and a network card. The speed of this system is estimated to be 30 Gflops, and its price/performance ratio is better than $1.0/Mflops for 64-bit (double precision) computations in quenched lattice QCD with overlap Dirac quarks.

  10. Revisiting Intel Xeon Phi optimization of Thompson cloud microphysics scheme in Weather Research and Forecasting (WRF) model

    NASA Astrophysics Data System (ADS)

    Mielikainen, Jarno; Huang, Bormin; Huang, Allen

    2015-10-01

    The Thompson cloud microphysics scheme is a sophisticated cloud microphysics scheme in the Weather Research and Forecasting (WRF) model. The scheme is very suitable for massively parallel computation as there are no interactions among horizontal grid points. Compared to the earlier microphysics schemes, the Thompson scheme incorporates a large number of improvements. Thus, we have optimized the speed of this important part of WRF. Intel Many Integrated Core (MIC) ushers in a new era of supercomputing speed, performance, and compatibility. It allows the developers to run code at trillions of calculations per second using the familiar programming model. In this paper, we present our results of optimizing the Thompson microphysics scheme on Intel Many Integrated Core Architecture (MIC) hardware. The Intel Xeon Phi coprocessor is the first product based on Intel MIC architecture, and it consists of up to 61 cores connected by a high performance on-die bidirectional interconnect. The coprocessor supports all important Intel development tools. Thus, the development environment is familiar one to a vast number of CPU developers. Although, getting a maximum performance out of MICs will require using some novel optimization techniques. New optimizations for an updated Thompson scheme are discusses in this paper. The optimizations improved the performance of the original Thompson code on Xeon Phi 7120P by a factor of 1.8x. Furthermore, the same optimizations improved the performance of the Thompson on a dual socket configuration of eight core Intel Xeon E5-2670 CPUs by a factor of 1.8x compared to the original Thompson code.

  11. Better than $l/Mflops sustained: a scalable PC-based parallel computer for lattice QCD

    NASA Astrophysics Data System (ADS)

    Fodor, Zoltán; Katz, Sándor D.; Papp, Gábor

    2003-05-01

    We study the feasibility of a PC-based parallel computer for medium to large scale lattice QCD simulations. The Eötvös Univ., Inst. Theor. Phys. cluster consists of 137 Intel P4-1.7GHz nodes with 512 MB RDRAM. The 32-bit, single precision sustained performance for dynamical QCD without communication is 1510 Mflops/node with Wilson and 970 Mflops/node with staggered fermions. This gives a total performance of 208 Gflops for Wilson and 133 Gflops for staggered QCD, respectively (for 64-bit applications the performance is approximately halved). The novel feature of our system is its communication architecture. In order to have a scalable, cost-effective machine we use Gigabit Ethernet cards for nearest-neighbor communications in a two-dimensional mesh. This type of communication is cost effective (only 30% of the hardware costs is spent on the communication). According to our benchmark measurements this type of communication results in around 40% communication time fraction for lattices upto 48 3·96 in full QCD simulations. The price/sustained-performance ratio for full QCD is better than l/Mflops for Wilson (and around 1.5/Mflops for staggered) quarks for practically any lattice size, which can fit in our parallel computer. The communication software is freely available upon request for non-profit organizations.

  12. Optimizing the Betts-Miller-Janjic cumulus parameterization with Intel Many Integrated Core (MIC) architecture

    NASA Astrophysics Data System (ADS)

    Huang, Melin; Huang, Bormin; Huang, Allen H.-L.

    2015-10-01

    The schemes of cumulus parameterization are responsible for the sub-grid-scale effects of convective and/or shallow clouds, and intended to represent vertical fluxes due to unresolved updrafts and downdrafts and compensating motion outside the clouds. Some schemes additionally provide cloud and precipitation field tendencies in the convective column, and momentum tendencies due to convective transport of momentum. The schemes all provide the convective component of surface rainfall. Betts-Miller-Janjic (BMJ) is one scheme to fulfill such purposes in the weather research and forecast (WRF) model. National Centers for Environmental Prediction (NCEP) has tried to optimize the BMJ scheme for operational application. As there are no interactions among horizontal grid points, this scheme is very suitable for parallel computation. With the advantage of Intel Xeon Phi Many Integrated Core (MIC) architecture, efficient parallelization and vectorization essentials, it allows us to optimize the BMJ scheme. If compared to the original code respectively running on one CPU socket (eight cores) and on one CPU core with Intel Xeon E5-2670, the MIC-based optimization of this scheme running on Xeon Phi coprocessor 7120P improves the performance by 2.4x and 17.0x, respectively.

  13. Optimizing zonal advection of the Advanced Research WRF (ARW) dynamics for Intel MIC

    NASA Astrophysics Data System (ADS)

    Mielikainen, Jarno; Huang, Bormin; Huang, Allen H.

    2014-10-01

    The Weather Research and Forecast (WRF) model is the most widely used community weather forecast and research model in the world. There are two distinct varieties of WRF. The Advanced Research WRF (ARW) is an experimental, advanced research version featuring very high resolution. The WRF Nonhydrostatic Mesoscale Model (WRF-NMM) has been designed for forecasting operations. WRF consists of dynamics code and several physics modules. The WRF-ARW core is based on an Eulerian solver for the fully compressible nonhydrostatic equations. In the paper, we will use Intel Intel Many Integrated Core (MIC) architecture to substantially increase the performance of a zonal advection subroutine for optimization. It is of the most time consuming routines in the ARW dynamics core. Advection advances the explicit perturbation horizontal momentum equations by adding in the large-timestep tendency along with the small timestep pressure gradient tendency. We will describe the challenges we met during the development of a high-speed dynamics code subroutine for MIC architecture. Furthermore, lessons learned from the code optimization process will be discussed. The results show that the optimizations improved performance of the original code on Xeon Phi 5110P by a factor of 2.4x.

  14. Connecting Effective Instruction and Technology. Intel-elebration: Safari.

    ERIC Educational Resources Information Center

    Burton, Larry D.; Prest, Sharon

    Intel-ebration is an attempt to integrate the following research-based instructional frameworks and strategies: (1) dimensions of learning; (2) multiple intelligences; (3) thematic instruction; (4) cooperative learning; (5) project-based learning; and (6) instructional technology. This paper presents a thematic unit on safari, using the…

  15. Precision studies of the NNLO DGLAP evolution at the LHC with Candia

    NASA Astrophysics Data System (ADS)

    Cafarella, Alessandro; Corianò, Claudio; Guzzi, Marco

    2008-11-01

    We summarize the theoretical approach to the solution of the NNLO DGLAP equations using methods based on the logarithmic expansions in x-space and their implementation into the C program CANDIA 1.0. We present the various options implemented in the program and discuss the different solutions. The user can choose the order of the evolution, the type of the solution, which can be either exact or truncated, and the evolution either with a fixed or a varying flavor number, implemented in the varying-flavor-number scheme (VFNS). The renormalization and factorization scale dependencies are treated separately. In the non-singlet sector the program implements an exact NNLO solution. Program summaryProgram title: CANDIA Catalogue identifier: AEBK_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEBK_v1_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: Standard CPC licence, http://cpc.cs.qub.ac.uk/licence/licence.html No. of lines in distributed program, including test data, etc.: 101 376 No. of bytes in distributed program, including test data, etc.: 5 865 234 Distribution format: tar.gz Programming language: C and Fortran Computer: All Operating system: Linux RAM: In the given examples, it ranges from 4 to 490 MB Classification: 11.1, 11.5 Nature of problem: The program provided here solves the DGLAP evolution equations for the parton distribution functions up to NNLO. Solution method: The algorithm implemented is based on the theory of the logarithmic expansions in Bjorken x-space. Additional comments: To be sure of getting the latest version of the program, the authors suggest downloading the code from their official CANDIA website ( http://www.le.infn.it/candia). Running time: In the given examples, it ranges from 1 to 40 minutes. The jobs have been executed on an Intel Core 2 Duo T7250 CPU at 2 GHz with a 64 bit Linux kernel. The test run script included in the package contains 5 sample runs and may take a number of hours to process, depending on the speed of the processor used and the size of the available RAM. http://www.le.infn.it/candia.

  16. Application of a distributed network in computational fluid dynamic simulations

    NASA Technical Reports Server (NTRS)

    Deshpande, Manish; Feng, Jinzhang; Merkle, Charles L.; Deshpande, Ashish

    1994-01-01

    A general-purpose 3-D, incompressible Navier-Stokes algorithm is implemented on a network of concurrently operating workstations using parallel virtual machine (PVM) and compared with its performance on a CRAY Y-MP and on an Intel iPSC/860. The problem is relatively computationally intensive, and has a communication structure based primarily on nearest-neighbor communication, making it ideally suited to message passing. Such problems are frequently encountered in computational fluid dynamics (CDF), and their solution is increasingly in demand. The communication structure is explicitly coded in the implementation to fully exploit the regularity in message passing in order to produce a near-optimal solution. Results are presented for various grid sizes using up to eight processors.

  17. Accelerated Application Development: The ORNL Titan Experience

    DOE PAGES

    Joubert, Wayne; Archibald, Richard K.; Berrill, Mark A.; ...

    2015-05-09

    The use of computational accelerators such as NVIDIA GPUs and Intel Xeon Phi processors is now widespread in the high performance computing community, with many applications delivering impressive performance gains. However, programming these systems for high performance, performance portability and software maintainability has been a challenge. In this paper we discuss experiences porting applications to the Titan system. Titan, which began planning in 2009 and was deployed for general use in 2013, was the first multi-petaflop system based on accelerator hardware. To ready applications for accelerated computing, a preparedness effort was undertaken prior to delivery of Titan. In this papermore » we report experiences and lessons learned from this process and describe how users are currently making use of computational accelerators on Titan.« less

  18. Accelerated application development: The ORNL Titan experience

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Joubert, Wayne; Archibald, Rick; Berrill, Mark

    2015-08-01

    The use of computational accelerators such as NVIDIA GPUs and Intel Xeon Phi processors is now widespread in the high performance computing community, with many applications delivering impressive performance gains. However, programming these systems for high performance, performance portability and software maintainability has been a challenge. In this paper we discuss experiences porting applications to the Titan system. Titan, which began planning in 2009 and was deployed for general use in 2013, was the first multi-petaflop system based on accelerator hardware. To ready applications for accelerated computing, a preparedness effort was undertaken prior to delivery of Titan. In this papermore » we report experiences and lessons learned from this process and describe how users are currently making use of computational accelerators on Titan.« less

  19. Electromagnetic physics models for parallel computing architectures

    DOE PAGES

    Amadio, G.; Ananya, A.; Apostolakis, J.; ...

    2016-11-21

    The recent emergence of hardware architectures characterized by many-core or accelerated processors has opened new opportunities for concurrent programming models taking advantage of both SIMD and SIMT architectures. GeantV, a next generation detector simulation, has been designed to exploit both the vector capability of mainstream CPUs and multi-threading capabilities of coprocessors including NVidia GPUs and Intel Xeon Phi. The characteristics of these architectures are very different in terms of the vectorization depth and type of parallelization needed to achieve optimal performance. In this paper we describe implementation of electromagnetic physics models developed for parallel computing architectures as a part ofmore » the GeantV project. Finally, the results of preliminary performance evaluation and physics validation are presented as well.« less

  20. First experience of vectorizing electromagnetic physics models for detector simulation

    NASA Astrophysics Data System (ADS)

    Amadio, G.; Apostolakis, J.; Bandieramonte, M.; Bianchini, C.; Bitzes, G.; Brun, R.; Canal, P.; Carminati, F.; de Fine Licht, J.; Duhem, L.; Elvira, D.; Gheata, A.; Jun, S. Y.; Lima, G.; Novak, M.; Presbyterian, M.; Shadura, O.; Seghal, R.; Wenzel, S.

    2015-12-01

    The recent emergence of hardware architectures characterized by many-core or accelerated processors has opened new opportunities for concurrent programming models taking advantage of both SIMD and SIMT architectures. The GeantV vector prototype for detector simulations has been designed to exploit both the vector capability of mainstream CPUs and multi-threading capabilities of coprocessors including NVidia GPUs and Intel Xeon Phi. The characteristics of these architectures are very different in terms of the vectorization depth, parallelization needed to achieve optimal performance or memory access latency and speed. An additional challenge is to avoid the code duplication often inherent to supporting heterogeneous platforms. In this paper we present the first experience of vectorizing electromagnetic physics models developed for the GeantV project.

  1. Fast multipurpose Monte Carlo simulation for proton therapy using multi- and many-core CPU architectures

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Souris, Kevin, E-mail: kevin.souris@uclouvain.be; Lee, John Aldo; Sterpin, Edmond

    2016-04-15

    Purpose: Accuracy in proton therapy treatment planning can be improved using Monte Carlo (MC) simulations. However the long computation time of such methods hinders their use in clinical routine. This work aims to develop a fast multipurpose Monte Carlo simulation tool for proton therapy using massively parallel central processing unit (CPU) architectures. Methods: A new Monte Carlo, called MCsquare (many-core Monte Carlo), has been designed and optimized for the last generation of Intel Xeon processors and Intel Xeon Phi coprocessors. These massively parallel architectures offer the flexibility and the computational power suitable to MC methods. The class-II condensed history algorithmmore » of MCsquare provides a fast and yet accurate method of simulating heavy charged particles such as protons, deuterons, and alphas inside voxelized geometries. Hard ionizations, with energy losses above a user-specified threshold, are simulated individually while soft events are regrouped in a multiple scattering theory. Elastic and inelastic nuclear interactions are sampled from ICRU 63 differential cross sections, thereby allowing for the computation of prompt gamma emission profiles. MCsquare has been benchmarked with the GATE/GEANT4 Monte Carlo application for homogeneous and heterogeneous geometries. Results: Comparisons with GATE/GEANT4 for various geometries show deviations within 2%–1 mm. In spite of the limited memory bandwidth of the coprocessor simulation time is below 25 s for 10{sup 7} primary 200 MeV protons in average soft tissues using all Xeon Phi and CPU resources embedded in a single desktop unit. Conclusions: MCsquare exploits the flexibility of CPU architectures to provide a multipurpose MC simulation tool. Optimized code enables the use of accurate MC calculation within a reasonable computation time, adequate for clinical practice. MCsquare also simulates prompt gamma emission and can thus be used also for in vivo range verification.« less

  2. Saving time and energy with oversubscription and semi-direct Møller-Plesset second order perturbation methods.

    PubMed

    Fought, Ellie L; Sundriyal, Vaibhav; Sosonkina, Masha; Windus, Theresa L

    2017-04-30

    In this work, the effect of oversubscription is evaluated, via calling 2n, 3n, or 4n processes for n physical cores, on semi-direct MP2 energy and gradient calculations and RI-MP2 energy calculations with the cc-pVTZ basis using NWChem. Results indicate that on both Intel and AMD platforms, oversubscription reduces total time to solution on average for semi-direct MP2 energy calculations by 25-45% and reduces total energy consumed by the CPU and DRAM on average by 10-15% on the Intel platform. Semi-direct gradient time to solution is shortened on average by 8-15% and energy consumption is decreased by 5-10%. Linear regression analysis shows a strong correlation between time to solution and total energy consumed. Oversubscribing during RI-MP2 calculations results in performance degradations of 30-50% at the 4n level. © 2017 Wiley Periodicals, Inc. © 2017 Wiley Periodicals, Inc.

  3. Enhancing the Radio Astronomy Capabilities at NASA's Deep Space Network

    NASA Astrophysics Data System (ADS)

    Lazio, Joseph; Teitelbaum, Lawrence; Franco, Manuel M.; Garcia-Miro, Cristina; Horiuchi, Shinji; Jacobs, Christopher; Kuiper, Thomas; Majid, Walid

    2015-08-01

    NASA's Deep Space Network (DSN) is well known for its role in commanding and communicating with spacecraft across the solar system that produce a steady stream of new discoveries in Astrophysics, Heliophysics, and Planetary Science. Equipped with a number of large antennas distributed across the world, the DSN also has a history of contributing to a number of leading radio astronomical projects. This paper summarizes a number of enhancements that are being implemented currently and that are aimed at increasing its capabilities to engage in a wide range of science observations. These enhancements include* A dual-beam system operating between 18 and 27 GHz (~ 1 cm) capable of conducting a variety of molecular line observations, searches for pulsars in the Galactic center, and continuum flux density (photometry) of objects such as nearby protoplanetary disks* Enhanced spectroscopy and pulsar processing backends for use at 1.4--1.9 GHz (20 cm), 18--27 GHz (1 cm), and 38--50 GHz (0.7 cm)* The DSN Transient Observatory (DTN), an automated, non-invasive backend for transient searching* Larger bandwidths (>= 0.5 GHz) for pulsar searching and timing; and* Improved data rates (2048 Mbps) and better instrumental response for very long baseline interferometric (VLBI) observations with the new DSN VLBI processor (DVP), which is providing unprecedented sensitivity for maintenance of the International Celestial Reference Frame (ICRF) and development of future versions.One of the results of these improvements is that the 70~m Deep Space Station 43 (DSS-43, Tidbinbilla antenna) is now the most sensitive radio antenna in the southern hemisphere. Proposals to use these systems are accepted from the international community.Part of this research was carried out at the Jet Propulsion Laboratory, California Institute of Technology, under a contract with the National Aeronautics & Space Administration.

  4. Using all of your CPU's in HIPE

    NASA Astrophysics Data System (ADS)

    Jacobson, J. D.; Fadda, D.

    2012-09-01

    Modern computer architectures increasingly feature multi-core CPU's. For example, the MacbookPro features the Intel quad-core i7 processors. Through the use of hyper-threading, where each core can execute two threads simultaneously, the quad-core i7 can support eight simultaneous processing threads. All this on your laptop! This CPU power can now be put into service by scientists to perform data reduction tasks, but only if the software has been designed to take advantage of the multiple processor architectures. Up to now, software written for Herschel data reduction (HIPE), written in Jython and JAVA, is single-threaded and can only utilize a single processor. Users of HIPE do not get any advantage from the additional processors. Why not put all of the CPU resources to work reducing your data? We present a multi-threaded software application that corrects long-term transients in the signal from the PACS unchopped spectroscopy line scan mode. In this poster, we present a multi-threaded software framework to achieve performance improvements from parallel execution. We will show how a task to correct transients in the PACS Spectroscopy Pipeline for the un-chopped line scan mode, has been threaded. This computation-intensive task uses either a one-parameter or a three parameter exponential function, to characterize the transient. The task uses a JAVA implementation of Minpack, translated from the C (Moshier) and IDL (Markwardt) by the authors, to optimize the correction parameters. We also explain how to determine if a task can benefit from threading (Amdahl's Law), and if it is safe to thread. The design and implementation, using the JAVA concurrency package completions service is described. Pitfalls, timing bugs, thread safety, resource control, testing and performance improvements are described and plotted.

  5. An FPGA computing demo core for space charge simulation

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Wu, Jinyuan; Huang, Yifei; /Fermilab

    2009-01-01

    In accelerator physics, space charge simulation requires large amount of computing power. In a particle system, each calculation requires time/resource consuming operations such as multiplications, divisions, and square roots. Because of the flexibility of field programmable gate arrays (FPGAs), we implemented this task with efficient use of the available computing resources and completely eliminated non-calculating operations that are indispensable in regular micro-processors (e.g. instruction fetch, instruction decoding, etc.). We designed and tested a 16-bit demo core for computing Coulomb's force in an Altera Cyclone II FPGA device. To save resources, the inverse square-root cube operation in our design is computedmore » using a memory look-up table addressed with nine to ten most significant non-zero bits. At 200 MHz internal clock, our demo core reaches a throughput of 200 M pairs/s/core, faster than a typical 2 GHz micro-processor by about a factor of 10. Temperature and power consumption of FPGAs were also lower than those of micro-processors. Fast and convenient, FPGAs can serve as alternatives to time-consuming micro-processors for space charge simulation.« less

  6. Conceptual definition of a high voltage power supply test facility

    NASA Technical Reports Server (NTRS)

    Biess, John J.; Chu, Teh-Ming; Stevens, N. John

    1989-01-01

    NASA Lewis Research Center is presently developing a 60 GHz traveling wave tube for satellite cross-link communications. The operating voltage for this new tube is - 20 kV. There is concern about the high voltage insulation system and NASA is planning a space station high voltage experiment that will demonstrate both the 60 GHz communications and high voltage electronics technology. The experiment interfaces, requirements, conceptual design, technology issues and safety issues are determined. A block diagram of the high voltage power supply test facility was generated. It includes the high voltage power supply, the 60 GHz traveling wave tube, the communications package, the antenna package, a high voltage diagnostics package and a command and data processor system. The interfaces with the space station and the attached payload accommodations equipment were determined. A brief description of the different subsystems and a discussion of the technology development needs are presented.

  7. Multichannel photonic Hilbert transformers based on complex modulated integrated Bragg gratings.

    PubMed

    Cheng, Rui; Chrostowski, Lukas

    2018-03-01

    Multichannel photonic Hilbert transformers (MPHTs) are reported. The devices are based on single compact spiral integrated Bragg gratings on silicon with coupling coefficients precisely modulated by the phase of each grating period. MPHTs with up to nine wavelength channels and a single-channel bandwidth of up to ∼625  GHz are achieved. The potential of the devices for multichannel single-sideband signal generation is suggested. The work offers a new possibility of utilizing wavelength as an extra degree of freedom in designing radio-frequency photonic signal processors. Such multichannel processors are expected to possess improved capacities and a potential to greatly benefit current widespread wavelength division multiplexed systems.

  8. DOE Office of Scientific and Technical Information (OSTI.GOV)

    Szabo, Levente; Koniorczyk, Matyas; Adam, Peter

    We consider the entanglement manipulation capabilities of the universal covariant quantum cloner or quantum processor circuit for quantum bits. We investigate its use for cloning a member of a bipartite or a genuine tripartite entangled state of quantum bits. We find that for bipartite pure entangled states a nontrivial behavior of concurrence appears, while for GHZ entangled states a possibility of the partial extraction of bipartite entanglement can be achieved.

  9. A derivation and scalable implementation of the synchronous parallel kinetic Monte Carlo method for simulating long-time dynamics

    NASA Astrophysics Data System (ADS)

    Byun, Hye Suk; El-Naggar, Mohamed Y.; Kalia, Rajiv K.; Nakano, Aiichiro; Vashishta, Priya

    2017-10-01

    Kinetic Monte Carlo (KMC) simulations are used to study long-time dynamics of a wide variety of systems. Unfortunately, the conventional KMC algorithm is not scalable to larger systems, since its time scale is inversely proportional to the simulated system size. A promising approach to resolving this issue is the synchronous parallel KMC (SPKMC) algorithm, which makes the time scale size-independent. This paper introduces a formal derivation of the SPKMC algorithm based on local transition-state and time-dependent Hartree approximations, as well as its scalable parallel implementation based on a dual linked-list cell method. The resulting algorithm has achieved a weak-scaling parallel efficiency of 0.935 on 1024 Intel Xeon processors for simulating biological electron transfer dynamics in a 4.2 billion-heme system, as well as decent strong-scaling parallel efficiency. The parallel code has been used to simulate a lattice of cytochrome complexes on a bacterial-membrane nanowire, and it is broadly applicable to other problems such as computational synthesis of new materials.

  10. Effective correlator for RadioAstron project

    NASA Astrophysics Data System (ADS)

    Sergeev, Sergey

    This paper presents the implementation of programme FX-correlator for Very Long Baseline Interferometry, adapted for the project "RadioAstron". Software correlator implemented for heterogeneous computing systems using graphics accelerators. It is shown that for the task interferometry implementation of the graphics hardware has a high efficiency. The host processor of heterogeneous computing system, performs the function of forming the data flow for graphics accelerators, the number of which corresponds to the number of frequency channels. So, for the Radioastron project, such channels is seven. Each accelerator is perform correlation matrix for all bases for a single frequency channel. Initial data is converted to the floating-point format, is correction for the corresponding delay function and computes the entire correlation matrix simultaneously. Calculation of the correlation matrix is performed using the sliding Fourier transform. Thus, thanks to the compliance of a solved problem for architecture graphics accelerators, managed to get a performance for one processor platform Kepler, which corresponds to the performance of this task, the computing cluster platforms Intel on four nodes. This task successfully scaled not only on a large number of graphics accelerators, but also on a large number of nodes with multiple accelerators.

  11. Optimization of the coherence function estimation for multi-core central processing unit

    NASA Astrophysics Data System (ADS)

    Cheremnov, A. G.; Faerman, V. A.; Avramchuk, V. S.

    2017-02-01

    The paper considers use of parallel processing on multi-core central processing unit for optimization of the coherence function evaluation arising in digital signal processing. Coherence function along with other methods of spectral analysis is commonly used for vibration diagnosis of rotating machinery and its particular nodes. An algorithm is given for the function evaluation for signals represented with digital samples. The algorithm is analyzed for its software implementation and computational problems. Optimization measures are described, including algorithmic, architecture and compiler optimization, their results are assessed for multi-core processors from different manufacturers. Thus, speeding-up of the parallel execution with respect to sequential execution was studied and results are presented for Intel Core i7-4720HQ и AMD FX-9590 processors. The results show comparatively high efficiency of the optimization measures taken. In particular, acceleration indicators and average CPU utilization have been significantly improved, showing high degree of parallelism of the constructed calculating functions. The developed software underwent state registration and will be used as a part of a software and hardware solution for rotating machinery fault diagnosis and pipeline leak location with acoustic correlation method.

  12. Efficient implementation of the many-body Reactive Bond Order (REBO) potential on GPU

    NASA Astrophysics Data System (ADS)

    Trędak, Przemysław; Rudnicki, Witold R.; Majewski, Jacek A.

    2016-09-01

    The second generation Reactive Bond Order (REBO) empirical potential is commonly used to accurately model a wide range hydrocarbon materials. It is also extensible to other atom types and interactions. REBO potential assumes complex multi-body interaction model, that is difficult to represent efficiently in the SIMD or SIMT programming model. Hence, despite its importance, no efficient GPGPU implementation has been developed for this potential. Here we present a detailed description of a highly efficient GPGPU implementation of molecular dynamics algorithm using REBO potential. The presented algorithm takes advantage of rarely used properties of the SIMT architecture of a modern GPU to solve difficult synchronizations issues that arise in computations of multi-body potential. Techniques developed for this problem may be also used to achieve efficient solutions of different problems. The performance of proposed algorithm is assessed using a range of model systems. It is compared to highly optimized CPU implementation (both single core and OpenMP) available in LAMMPS package. These experiments show up to 6x improvement in forces computation time using single processor of the NVIDIA Tesla K80 compared to high end 16-core Intel Xeon processor.

  13. The Impact of IBM Cell Technology on the Programming Paradigm in the Context of Computer Systems for Climate and Weather Models

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Zhou, Shujia; Duffy, Daniel; Clune, Thomas

    The call for ever-increasing model resolutions and physical processes in climate and weather models demands a continual increase in computing power. The IBM Cell processor's order-of-magnitude peak performance increase over conventional processors makes it very attractive to fulfill this requirement. However, the Cell's characteristics, 256KB local memory per SPE and the new low-level communication mechanism, make it very challenging to port an application. As a trial, we selected the solar radiation component of the NASA GEOS-5 climate model, which: (1) is representative of column physics components (half the total computational time), (2) has an extremely high computational intensity: the ratiomore » of computational load to main memory transfers, and (3) exhibits embarrassingly parallel column computations. In this paper, we converted the baseline code (single-precision Fortran) to C and ported it to an IBM BladeCenter QS20. For performance, we manually SIMDize four independent columns and include several unrolling optimizations. Our results show that when compared with the baseline implementation running on one core of Intel's Xeon Woodcrest, Dempsey, and Itanium2, the Cell is approximately 8.8x, 11.6x, and 12.8x faster, respectively. Our preliminary analysis shows that the Cell can also accelerate the dynamics component (~;;25percent total computational time). We believe these dramatic performance improvements make the Cell processor very competitive as an accelerator.« less

  14. Digital 8-DPSK Modem For Trellis-Coded Communication

    NASA Technical Reports Server (NTRS)

    Jedrey, T. C.; Lay, N. E.; Rafferty, W.

    1989-01-01

    Digital real-time modem processes octuple differential-phase-shift-keyed trellis-coded modulation. Intended for use in communicating data at rate up to 4.8 kb/s in land-mobile satellite channel (Rician fading) of 5-kHz bandwidth at carrier frequency of 1 to 2 GHz. Modulator and demodulator contain digital signal processors performing modem functions. Design flexible in that functions altered via software. Modem successfully tested and evaluated in both laboratory and field experiments, including recent full-scale satellite experiment. In all cases, modem performed within 1 dB of theory. Other communication systems benefitting from this type of modem include land mobile (without satellites), paging, digitized voice, and frequency-modulation subcarrier data broadcasting.

  15. Application of Intel Many Integrated Core (MIC) architecture to the Yonsei University planetary boundary layer scheme in Weather Research and Forecasting model

    NASA Astrophysics Data System (ADS)

    Huang, Melin; Huang, Bormin; Huang, Allen H.

    2014-10-01

    The Weather Research and Forecasting (WRF) model provided operational services worldwide in many areas and has linked to our daily activity, in particular during severe weather events. The scheme of Yonsei University (YSU) is one of planetary boundary layer (PBL) models in WRF. The PBL is responsible for vertical sub-grid-scale fluxes due to eddy transports in the whole atmospheric column, determines the flux profiles within the well-mixed boundary layer and the stable layer, and thus provide atmospheric tendencies of temperature, moisture (including clouds), and horizontal momentum in the entire atmospheric column. The YSU scheme is very suitable for massively parallel computation as there are no interactions among horizontal grid points. To accelerate the computation process of the YSU scheme, we employ Intel Many Integrated Core (MIC) Architecture as it is a multiprocessor computer structure with merits of efficient parallelization and vectorization essentials. Our results show that the MIC-based optimization improved the performance of the first version of multi-threaded code on Xeon Phi 5110P by a factor of 2.4x. Furthermore, the same CPU-based optimizations improved the performance on Intel Xeon E5-2603 by a factor of 1.6x as compared to the first version of multi-threaded code.

  16. The GOODS-N Jansky VLA 10 GHz Pilot Survey: Sizes of Star-forming μJY Radio Sources

    NASA Astrophysics Data System (ADS)

    Murphy, Eric J.; Momjian, Emmanuel; Condon, James J.; Chary, Ranga-Ram; Dickinson, Mark; Inami, Hanae; Taylor, Andrew R.; Weiner, Benjamin J.

    2017-04-01

    Our sensitive ({σ }{{n}}≈ 572 {nJy} {{beam}}-1), high-resolution (FWHM {θ }1/2=0\\buildrel{\\prime\\prime}\\over{.} 22≈ 2 {kpc} {at} z≳ 1), 10 GHz image covering a single Karl G. Jansky Very Large Array (VLA) primary beam (FWHM {{{\\Theta }}}1/2≈ 4\\buildrel{ \\prime}\\over{.} 25) in the GOODS-N field contains 32 sources with {S}{{p}}≳ 2 μ {Jy} {{beam}}-1 and optical and/or near-infrared (OIR) counterparts. Most are about as large as the star-forming regions that power them. Their median FWHM major axis is < {θ }{{M}}> =167+/- 32 {mas}≈ 1.2+/- 0.28 {kpc}, with rms scatter ≈ 91 {mas}≈ 0.79 {kpc}. In units of the effective radius {r}{{e}} that encloses half their flux, these radio sizes are < {r}{{e}}> ≈ 69+/- 13 {mas}≈ 509+/- 114 {pc}, with rms scatter ≈ 38 {mas}≈ 324 {pc}. These sizes are smaller than those measured at lower radio frequencies, but agree with dust emission sizes measured at mm/sub-mm wavelengths and extinction-corrected Hα sizes. We made a low-resolution ({θ }1/2=1\\buildrel{\\prime\\prime}\\over{.} 0) image with ≈ 10× better brightness sensitivity, in order to detect extended sources and measure matched-resolution spectral indices {α }1.4 {GHz}10 {GHz}. It contains six new sources with {S}{{p}}≳ 3.9 μ {Jy} {{beam}}-1 and OIR counterparts. The median redshift of all 38 sources is < z> =1.24+/- 0.15. The 19 sources with 1.4 GHz counterparts have a median spectral index of < {α }1.4 {GHz}10 {GHz}> =-0.74+/- 0.10, with rms scatter ≈ 0.35. Including upper limits on α for sources not detected at 1.4 GHz flattens the median to < {α }1.4 {GHz}10 {GHz}> ≳ -0.61, suggesting that the μJy radio sources at higher redshifts—and hence those selected at higher rest-frame frequencies—may have flatter spectra. If the non-thermal spectral index is {α }{NT}≈ -0.85, the median thermal fraction of sources selected at median rest-frame frequency ≈ 20 {GHz} is ≳48%.

  17. Cache Energy Optimization Techniques For Modern Processors

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Mittal, Sparsh

    2013-01-01

    Modern multicore processors are employing large last-level caches, for example Intel's E7-8800 processor uses 24MB L3 cache. Further, with each CMOS technology generation, leakage energy has been dramatically increasing and hence, leakage energy is expected to become a major source of energy dissipation, especially in last-level caches (LLCs). The conventional schemes of cache energy saving either aim at saving dynamic energy or are based on properties specific to first-level caches, and thus these schemes have limited utility for last-level caches. Further, several other techniques require offline profiling or per-application tuning and hence are not suitable for product systems. In thismore » book, we present novel cache leakage energy saving schemes for single-core and multicore systems; desktop, QoS, real-time and server systems. Also, we present cache energy saving techniques for caches designed with both conventional SRAM devices and emerging non-volatile devices such as STT-RAM (spin-torque transfer RAM). We present software-controlled, hardware-assisted techniques which use dynamic cache reconfiguration to configure the cache to the most energy efficient configuration while keeping the performance loss bounded. To profile and test a large number of potential configurations, we utilize low-overhead, micro-architecture components, which can be easily integrated into modern processor chips. We adopt a system-wide approach to save energy to ensure that cache reconfiguration does not increase energy consumption of other components of the processor. We have compared our techniques with state-of-the-art techniques and have found that our techniques outperform them in terms of energy efficiency and other relevant metrics. The techniques presented in this book have important applications in improving energy-efficiency of higher-end embedded, desktop, QoS, real-time, server processors and multitasking systems. This book is intended to be a valuable guide for both newcomers and veterans in the field of cache power management. It will help graduate students, CAD tool developers and designers in understanding the need of energy efficiency in modern computing systems. Further, it will be useful for researchers in gaining insights into algorithms and techniques for micro-architectural and system-level energy optimization using dynamic cache reconfiguration. We sincerely believe that the ``food for thought'' presented in this book will inspire the readers to develop even better ideas for designing ``green'' processors of tomorrow.« less

  18. Performance optimization of Qbox and WEST on Intel Knights Landing

    NASA Astrophysics Data System (ADS)

    Zheng, Huihuo; Knight, Christopher; Galli, Giulia; Govoni, Marco; Gygi, Francois

    We present the optimization of electronic structure codes Qbox and WEST targeting the Intel®Xeon Phi™processor, codenamed Knights Landing (KNL). Qbox is an ab-initio molecular dynamics code based on plane wave density functional theory (DFT) and WEST is a post-DFT code for excited state calculations within many-body perturbation theory. Both Qbox and WEST employ highly scalable algorithms which enable accurate large-scale electronic structure calculations on leadership class supercomputer platforms beyond 100,000 cores, such as Mira and Theta at the Argonne Leadership Computing Facility. In this work, features of the KNL architecture (e.g. hierarchical memory) are explored to achieve higher performance in key algorithms of the Qbox and WEST codes and to develop a road-map for further development targeting next-generation computing architectures. In particular, the optimizations of the Qbox and WEST codes on the KNL platform will target efficient large-scale electronic structure calculations of nanostructured materials exhibiting complex structures and prediction of their electronic and thermal properties for use in solar and thermal energy conversion device. This work was supported by MICCoM, as part of Comp. Mats. Sci. Program funded by the U.S. DOE, Office of Sci., BES, MSE Division. This research used resources of the ALCF, which is a DOE Office of Sci. User Facility under Contract DE-AC02-06CH11357.

  19. Large Scale GW Calculations on the Cori System

    NASA Astrophysics Data System (ADS)

    Deslippe, Jack; Del Ben, Mauro; da Jornada, Felipe; Canning, Andrew; Louie, Steven

    The NERSC Cori system, powered by 9000+ Intel Xeon-Phi processors, represents one of the largest HPC systems for open-science in the United States and the world. We discuss the optimization of the GW methodology for this system, including both node level and system-scale optimizations. We highlight multiple large scale (thousands of atoms) case studies and discuss both absolute application performance and comparison to calculations on more traditional HPC architectures. We find that the GW method is particularly well suited for many-core architectures due to the ability to exploit a large amount of parallelism across many layers of the system. This work was supported by the U.S. Department of Energy, Office of Science, Basic Energy Sciences, Materials Sciences and Engineering Division, as part of the Computational Materials Sciences Program.

  20. QCD thermodynamics with two flavors at Nt=6

    NASA Astrophysics Data System (ADS)

    Bernard, Claude; Ogilvie, Michael C.; Degrand, Thomas A.; Detar, Carleton; Gottlieb, Steven; Krasnitz, Alex; Sugar, R. L.; Toussaint, D.

    1992-05-01

    The first results of numerical simulations of quantum chromodynamics on the Intel iPSC/860 parallel processor are presented. We performed calculations with two flavors of Kogut-Susskind quarks at Nt=6 with masses of 0.15T and 0.075T (0.025 and 0.0125 in lattice units) in order to locate the crossover from the low-temperature regime of ordinary hadronic matter to the high-temperature chirally symmetric regime. As with other recent two-flavor simulations, these calculations are insufficient to distinguish between a rapid crossover and a true phase transition. The phase transition is either absent or feeble at this quark mass. An improved estimate of the crossover temperature in physical units is given and results are presented for the hadronic screening lengths in both the high- and low-temperature regimes.

  1. Best kept secrets ... Source Data Systems, Inc. (SDS).

    PubMed

    Andrew, W F

    1991-03-01

    The SDS/MEDNET system is a cost-effective option for small- to medium-size hospitals (up to 400 beds). The parameter-driven system lets users control operations with only occasional SDS assistance. A full application set, available for modular selection to reduce upfront costs while facilitating steady growth and protecting client investment, is adaptable to multi-facility environments. The industry-standard, Intel-based multi-user processors, network communications and protocols assure high efficiency, low-cost solutions independent of any one hardware vendor. Sustained growth in both client base and product offerings point to a high level of responsiveness and healthcare industry commitment. Corporate emphasis on user involvement and open systems integration assures clients of leading-edge capabilities. SDS/MEDNET will be a strong contender in selected marketing environments.

  2. Compiling global name-space programs for distributed execution

    NASA Technical Reports Server (NTRS)

    Koelbel, Charles; Mehrotra, Piyush

    1990-01-01

    Distributed memory machines do not provide hardware support for a global address space. Thus programmers are forced to partition the data across the memories of the architecture and use explicit message passing to communicate data between processors. The compiler support required to allow programmers to express their algorithms using a global name-space is examined. A general method is presented for analysis of a high level source program and along with its translation to a set of independently executing tasks communicating via messages. If the compiler has enough information, this translation can be carried out at compile-time. Otherwise run-time code is generated to implement the required data movement. The analysis required in both situations is described and the performance of the generated code on the Intel iPSC/2 is presented.

  3. HANSF 1.3 Users Manual FAI/98-40-R2 Hanford Spent Nuclear Fuel (SNF) Safety Analysis Model [SEC 1 and 2

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    DUNCAN, D.R.

    The HANSF analysis tool is an integrated model considering phenomena inside a multi-canister overpack (MCO) spent nuclear fuel container such as fuel oxidation, convective and radiative heat transfer, and the potential for fission product release. This manual reflects the HANSF version 1.3.2, a revised version of 1.3.1. HANSF 1.3.2 was written to correct minor errors and to allow modeling of condensate flow on the MCO inner surface. HANSF 1.3.2 is intended for use on personal computers such as IBM-compatible machines with Intel processors running under Lahey TI or digital Visual FORTRAN, Version 6.0, but this does not preclude operation inmore » other environments.« less

  4. Evaluation of CHO Benchmarks on the Arria 10 FPGA using Intel FPGA SDK for OpenCL

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Jin, Zheming; Yoshii, Kazutomo; Finkel, Hal

    The OpenCL standard is an open programming model for accelerating algorithms on heterogeneous computing system. OpenCL extends the C-based programming language for developing portable codes on different platforms such as CPU, Graphics processing units (GPUs), Digital Signal Processors (DSPs) and Field Programmable Gate Arrays (FPGAs). The Intel FPGA SDK for OpenCL is a suite of tools that allows developers to abstract away the complex FPGA-based development flow for a high-level software development flow. Users can focus on the design of hardware-accelerated kernel functions in OpenCL and then direct the tools to generate the low-level FPGA implementations. The approach makes themore » FPGA-based development more accessible to software users as the needs for hybrid computing using CPUs and FPGAs are increasing. It can also significantly reduce the hardware development time as users can evaluate different ideas with high-level language without deep FPGA domain knowledge. Benchmarking of OpenCL-based framework is an effective way for analyzing the performance of system by studying the execution of the benchmark applications. CHO is a suite of benchmark applications that provides support for OpenCL [1]. The authors presented CHO as an OpenCL port of the CHStone benchmark. Using Altera OpenCL (AOCL) compiler to synthesize the benchmark applications, they listed the resource usage and performance of each kernel that can be successfully synthesized by the compiler. In this report, we evaluate the resource usage and performance of the CHO benchmark applications using the Intel FPGA SDK for OpenCL and Nallatech 385A FPGA board that features an Arria 10 FPGA device. The focus of the report is to have a better understanding of the resource usage and performance of the kernel implementations using Arria-10 FPGA devices compared to Stratix-5 FPGA devices. In addition, we also gain knowledge about the limitations of the current compiler when it fails to synthesize a benchmark application.« less

  5. Scaling Deep Learning Workloads: NVIDIA DGX-1/Pascal and Intel Knights Landing

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Gawande, Nitin A.; Landwehr, Joshua B.; Daily, Jeffrey A.

    Deep Learning (DL) algorithms have become ubiquitous in data analytics. As a result, major computing vendors --- including NVIDIA, Intel, AMD and IBM --- have architectural road-maps influenced by DL workloads. Furthermore, several vendors have recently advertised new computing products as accelerating DL workloads. Unfortunately, it is difficult for data scientists to quantify the potential of these different products. This paper provides a performance and power analysis of important DL workloads on two major parallel architectures: NVIDIA DGX-1 (eight Pascal P100 GPUs interconnected with NVLink) and Intel Knights Landing (KNL) CPUs interconnected with Intel Omni-Path. Our evaluation consists of amore » cross section of convolutional neural net workloads: CifarNet, CaffeNet, AlexNet and GoogleNet topologies using the Cifar10 and ImageNet datasets. The workloads are vendor optimized for each architecture. GPUs provide the highest overall raw performance. Our analysis indicates that although GPUs provide the highest overall performance, the gap can close for some convolutional networks; and KNL can be competitive when considering performance/watt. Furthermore, NVLink is critical to GPU scaling.« less

  6. Phantom-GRAPE: Numerical software library to accelerate collisionless N-body simulation with SIMD instruction set on x86 architecture

    NASA Astrophysics Data System (ADS)

    Tanikawa, Ataru; Yoshikawa, Kohji; Nitadori, Keigo; Okamoto, Takashi

    2013-02-01

    We have developed a numerical software library for collisionless N-body simulations named "Phantom-GRAPE" which highly accelerates force calculations among particles by use of a new SIMD instruction set extension to the x86 architecture, Advanced Vector eXtensions (AVX), an enhanced version of the Streaming SIMD Extensions (SSE). In our library, not only the Newton's forces, but also central forces with an arbitrary shape f(r), which has a finite cutoff radius rcut (i.e. f(r)=0 at r>rcut), can be quickly computed. In computing such central forces with an arbitrary force shape f(r), we refer to a pre-calculated look-up table. We also present a new scheme to create the look-up table whose binning is optimal to keep good accuracy in computing forces and whose size is small enough to avoid cache misses. Using an Intel Core i7-2600 processor, we measure the performance of our library for both of the Newton's forces and the arbitrarily shaped central forces. In the case of Newton's forces, we achieve 2×109 interactions per second with one processor core (or 75 GFLOPS if we count 38 operations per interaction), which is 20 times higher than the performance of an implementation without any explicit use of SIMD instructions, and 2 times than that with the SSE instructions. With four processor cores, we obtain the performance of 8×109 interactions per second (or 300 GFLOPS). In the case of the arbitrarily shaped central forces, we can calculate 1×109 and 4×109 interactions per second with one and four processor cores, respectively. The performance with one processor core is 6 times and 2 times higher than those of the implementations without any use of SIMD instructions and with the SSE instructions. These performances depend only weakly on the number of particles, irrespective of the force shape. It is good contrast with the fact that the performance of force calculations accelerated by graphics processing units (GPUs) depends strongly on the number of particles. Substantially weak dependence of the performance on the number of particles is suitable to collisionless N-body simulations, since these simulations are usually performed with sophisticated N-body solvers such as Tree- and TreePM-methods combined with an individual timestep scheme. We conclude that collisionless N-body simulations accelerated with our library have significant advantage over those accelerated by GPUs, especially on massively parallel environments.

  7. High performance computing for deformable image registration: towards a new paradigm in adaptive radiotherapy.

    PubMed

    Samant, Sanjiv S; Xia, Junyi; Muyan-Ozcelik, Pinar; Owens, John D

    2008-08-01

    The advent of readily available temporal imaging or time series volumetric (4D) imaging has become an indispensable component of treatment planning and adaptive radiotherapy (ART) at many radiotherapy centers. Deformable image registration (DIR) is also used in other areas of medical imaging, including motion corrected image reconstruction. Due to long computation time, clinical applications of DIR in radiation therapy and elsewhere have been limited and consequently relegated to offline analysis. With the recent advances in hardware and software, graphics processing unit (GPU) based computing is an emerging technology for general purpose computation, including DIR, and is suitable for highly parallelized computing. However, traditional general purpose computation on the GPU is limited because the constraints of the available programming platforms. As well, compared to CPU programming, the GPU currently has reduced dedicated processor memory, which can limit the useful working data set for parallelized processing. We present an implementation of the demons algorithm using the NVIDIA 8800 GTX GPU and the new CUDA programming language. The GPU performance will be compared with single threading and multithreading CPU implementations on an Intel dual core 2.4 GHz CPU using the C programming language. CUDA provides a C-like language programming interface, and allows for direct access to the highly parallel compute units in the GPU. Comparisons for volumetric clinical lung images acquired using 4DCT were carried out. Computation time for 100 iterations in the range of 1.8-13.5 s was observed for the GPU with image size ranging from 2.0 x 10(6) to 14.2 x 10(6) pixels. The GPU registration was 55-61 times faster than the CPU for the single threading implementation, and 34-39 times faster for the multithreading implementation. For CPU based computing, the computational time generally has a linear dependence on image size for medical imaging data. Computational efficiency is characterized in terms of time per megapixels per iteration (TPMI) with units of seconds per megapixels per iteration (or spmi). For the demons algorithm, our CPU implementation yielded largely invariant values of TPMI. The mean TPMIs were 0.527 spmi and 0.335 spmi for the single threading and multithreading cases, respectively, with <2% variation over the considered image data range. For GPU computing, we achieved TPMI =0.00916 spmi with 3.7% variation, indicating optimized memory handling under CUDA. The paradigm of GPU based real-time DIR opens up a host of clinical applications for medical imaging.

  8. MC-TESTER: a universal tool for comparisons of Monte Carlo predictions for particle decays in high energy physics

    NASA Astrophysics Data System (ADS)

    Golonka, P.; Pierzchała, T.; Waş, Z.

    2004-02-01

    Theoretical predictions in high energy physics are routinely provided in the form of Monte Carlo generators. Comparisons of predictions from different programs and/or different initialization set-ups are often necessary. MC-TESTER can be used for such tests of decays of intermediate states (particles or resonances) in a semi-automated way. Our test consists of two steps. Different Monte Carlo programs are run; events with decays of a chosen particle are searched, decay trees are analyzed and appropriate information is stored. Then, at the analysis step, a list of all found decay modes is defined and branching ratios are calculated for both runs. Histograms of all scalar Lorentz-invariant masses constructed from the decay products are plotted and compared for each decay mode found in both runs. For each plot a measure of the difference of the distributions is calculated and its maximal value over all histograms for each decay channel is printed in a summary table. As an example of MC-TESTER application, we include a test with the τ lepton decay Monte Carlo generators, TAUOLA and PYTHIA. The HEPEVT (or LUJETS) common block is used as exclusive source of information on the generated events. Program summaryTitle of the program:MC-TESTER, version 1.1 Catalogue identifier: ADSM Program summary URL:http://cpc.cs.qub.ac.uk/summaries/ADSM Program obtainable from: CPC Program Library, Queen's University of Belfast, N. Ireland Computer: PC, two Intel Xeon 2.0 GHz processors, 512MB RAM Operating system: Linux Red Hat 6.1, 7.2, and also 8.0 Programming language used:C++, FORTRAN77: gcc 2.96 or 2.95.2 (also 3.2) compiler suite with g++ and g77 Size of the package: 7.3 MB directory including example programs (2 MB compressed distribution archive), without ROOT libraries (additional 43 MB). No. of bytes in distributed program, including test data, etc.: 2 024 425 Distribution format: tar gzip file Additional disk space required: Depends on the analyzed particle: 40 MB in the case of τ lepton decays (30 decay channels, 594 histograms, 82-pages booklet). Keywords: particle physics, decay simulation, Monte Carlo methods, invariant mass distributions, programs comparison Nature of the physical problem: The decays of individual particles are well defined modules of a typical Monte Carlo program chain in high energy physics. A fast, semi-automatic way of comparing results from different programs is often desirable, for the development of new programs, to check correctness of the installations or for discussion of uncertainties. Method of solution: A typical HEP Monte Carlo program stores the generated events in the event records such as HEPEVT or PYJETS. MC-TESTER scans, event by event, the contents of the record and searches for the decays of the particle under study. The list of the found decay modes is successively incremented and histograms of all invariant masses which can be calculated from the momenta of the particle decay products are defined and filled. The outputs from the two runs of distinct programs can be later compared. A booklet of comparisons is created: for every decay channel, all histograms present in the two outputs are plotted and parameter quantifying shape difference is calculated. Its maximum over every decay channel is printed in the summary table. Restrictions on the complexity of the problem: For a list of limitations see Section 6. Typical running time: Varies substantially with the analyzed decay particle. On a PC/Linux with 2.0 GHz processors MC-TESTER increases the run time of the τ-lepton Monte Carlo program TAUOLA by 4.0 seconds for every 100 000 analyzed events (generation itself takes 26 seconds). The analysis step takes 13 seconds; ? processing takes additionally 10 seconds. Generation step runs may be executed simultaneously on multi-processor machines. Accessibility: web page: http://cern.ch/Piotr.Golonka/MC/MC-TESTER e-mails: Piotr.Golonka@CERN.CH, T.Pierzchala@friend.phys.us.edu.pl, Zbigniew.Was@CERN.CH.

  9. Latency-Information Theory: The Mathematical-Physical Theory of Communication-Observation

    DTIC Science & Technology

    2010-01-01

    Werner Heisenberg of quantum mechanics; 3) the source-entropy and channel-capacity lossless performance bounds of Claude Shannon that guide...through noisy intel-space channels, and where the physical time-dislocations of intel-space exhibit a passing of time Heisenberg information...life-space sensor, and where the physical time- dislocations of life-space exhibit a passing of time Heisenberg information-uncertainty; and 4

  10. Performance tuning Weather Research and Forecasting (WRF) Goddard longwave radiative transfer scheme on Intel Xeon Phi

    NASA Astrophysics Data System (ADS)

    Mielikainen, Jarno; Huang, Bormin; Huang, Allen H.

    2015-10-01

    Next-generation mesoscale numerical weather prediction system, the Weather Research and Forecasting (WRF) model, is a designed for dual use for forecasting and research. WRF offers multiple physics options that can be combined in any way. One of the physics options is radiance computation. The major source for energy for the earth's climate is solar radiation. Thus, it is imperative to accurately model horizontal and vertical distribution of the heating. Goddard solar radiative transfer model includes the absorption duo to water vapor,ozone, ozygen, carbon dioxide, clouds and aerosols. The model computes the interactions among the absorption and scattering by clouds, aerosols, molecules and surface. Finally, fluxes are integrated over the entire longwave spectrum.In this paper, we present our results of optimizing the Goddard longwave radiative transfer scheme on Intel Many Integrated Core Architecture (MIC) hardware. The Intel Xeon Phi coprocessor is the first product based on Intel MIC architecture, and it consists of up to 61 cores connected by a high performance on-die bidirectional interconnect. The coprocessor supports all important Intel development tools. Thus, the development environment is familiar one to a vast number of CPU developers. Although, getting a maximum performance out of MICs will require using some novel optimization techniques. Those optimization techniques are discusses in this paper. The optimizations improved the performance of the original Goddard longwave radiative transfer scheme on Xeon Phi 7120P by a factor of 2.2x. Furthermore, the same optimizations improved the performance of the Goddard longwave radiative transfer scheme on a dual socket configuration of eight core Intel Xeon E5-2670 CPUs by a factor of 2.1x compared to the original Goddard longwave radiative transfer scheme code.

  11. Area-Efficient 60 GHz +18.9 dBm Power Amplifier with On-Chip Four-Way Parallel Power Combiner in 65-nm CMOS

    NASA Astrophysics Data System (ADS)

    Farahabadi, Payam Masoumi; Basaligheh, Ali; Saffari, Parvaneh; Moez, Kambiz

    2017-06-01

    This paper presents a compact 60-GHz power amplifier utilizing a four-way on-chip parallel power combiner and splitter. The proposed topology provides the capability of combining the output power of four individual power amplifier cores in a compact die area. Each power amplifier core consists of a three-stage common-source amplifier with transformer-coupled impedance matching networks. Fabricated in 65-nm CMOS process, the measured gain of the 0.19-mm2 power amplifier at 60 GHz is 18.8 and 15 dB utilizing 1.4 and 1.0 V supply. Three-decibel band width of 4 GHz and P1dB of 16.9 dBm is measured while consuming 424 mW from a 1.4-V supply. A maximum saturated output power of 18.3 dBm is measured with the 15.9% peak power added efficiency at 60 GHz. The measured insertion loss is 1.9 dB at 60 GHz. The proposed power amplifier achieves the highest power density (power/area) compared to the reported 60-GHz CMOS power amplifiers in 65 nm or older CMOS technologies.

  12. A simple GPU-accelerated two-dimensional MUSCL-Hancock solver for ideal magnetohydrodynamics

    NASA Astrophysics Data System (ADS)

    Bard, Christopher M.; Dorelli, John C.

    2014-02-01

    We describe our experience using NVIDIA's CUDA (Compute Unified Device Architecture) C programming environment to implement a two-dimensional second-order MUSCL-Hancock ideal magnetohydrodynamics (MHD) solver on a GTX 480 Graphics Processing Unit (GPU). Taking a simple approach in which the MHD variables are stored exclusively in the global memory of the GTX 480 and accessed in a cache-friendly manner (without further optimizing memory access by, for example, staging data in the GPU's faster shared memory), we achieved a maximum speed-up of ≈126 for a 10242 grid relative to the sequential C code running on a single Intel Nehalem (2.8 GHz) core. This speedup is consistent with simple estimates based on the known floating point performance, memory throughput and parallel processing capacity of the GTX 480.

  13. Fine-grained parallelism accelerating for RNA secondary structure prediction with pseudoknots based on FPGA.

    PubMed

    Xia, Fei; Jin, Guoqing

    2014-06-01

    PKNOTS is a most famous benchmark program and has been widely used to predict RNA secondary structure including pseudoknots. It adopts the standard four-dimensional (4D) dynamic programming (DP) method and is the basis of many variants and improved algorithms. Unfortunately, the O(N(6)) computing requirements and complicated data dependency greatly limits the usefulness of PKNOTS package with the explosion in gene database size. In this paper, we present a fine-grained parallel PKNOTS package and prototype system for accelerating RNA folding application based on FPGA chip. We adopted a series of storage optimization strategies to resolve the "Memory Wall" problem. We aggressively exploit parallel computing strategies to improve computational efficiency. We also propose several methods that collectively reduce the storage requirements for FPGA on-chip memory. To the best of our knowledge, our design is the first FPGA implementation for accelerating 4D DP problem for RNA folding application including pseudoknots. The experimental results show a factor of more than 50x average speedup over the PKNOTS-1.08 software running on a PC platform with Intel Core2 Q9400 Quad CPU for input RNA sequences. However, the power consumption of our FPGA accelerator is only about 50% of the general-purpose micro-processors.

  14. Effects of 2.4 GHz radiofrequency radiation emitted from Wi-Fi equipment on microRNA expression in brain tissue.

    PubMed

    Dasdag, Suleyman; Akdag, Mehmet Zulkuf; Erdal, Mehmet Emin; Erdal, Nurten; Ay, Ozlem Izci; Ay, Mustafa Ertan; Yilmaz, Senay Gorucu; Tasdelen, Bahar; Yegin, Korkut

    2015-07-01

    MicroRNAs (miRNA) play a paramount role in growth, differentiation, proliferation and cell death by suppressing one or more target genes. However, their interaction with radiofrequencies is still unknown. The aim of this study was to investigate the long-term effects of radiofrequency radiation emitted from a Wireless Fidelity (Wi-Fi) system on some of the miRNA in brain tissue. The study was carried out on 16 Wistar Albino adult male rats by dividing them into two groups such as sham (n = 8) and exposure (n = 8). Rats in the exposure group were exposed to 2.4 GHz radiofrequency (RF) radiation for 24 hours a day for 12 months (one year). The same procedure was applied to the rats in the sham group except the Wi-Fi system was turned off. Immediately after the last exposure, rats were sacrificed and their brains were removed. miR-9-5p, miR-29a-3p, miR-106b-5p, miR-107, miR-125a-3p in brain were investigated in detail. The results revealed that long-term exposure of 2.4 GHz Wi-Fi radiation can alter expression of some of the miRNAs such as miR-106b-5p (adj p* = 0.010) and miR-107 (adj p* = 0.005). We observed that mir 107 expression is 3.3 times and miR- 106b-5p expression is 3.65 times lower in the exposure group than in the control group. However, miR-9-5p, miR-29a-3p and miR-125a-3p levels in brain were not altered. Long-term exposure of 2.4 GHz RF may lead to adverse effects such as neurodegenerative diseases originated from the alteration of some miRNA expression and more studies should be devoted to the effects of RF radiation on miRNA expression levels.

  15. A distributed microcomputer-controlled system for data acquisition and power spectral analysis of EEG.

    PubMed

    Vo, T D; Dwyer, G; Szeto, H H

    1986-04-01

    A relatively powerful and inexpensive microcomputer-based system for the spectral analysis of the EEG is presented. High resolution and speed is achieved with the use of recently available large-scale integrated circuit technology with enhanced functionality (INTEL Math co-processors 8087) which can perform transcendental functions rapidly. The versatility of the system is achieved with a hardware organization that has distributed data acquisition capability performed by the use of a microprocessor-based analog to digital converter with large resident memory (Cyborg ISAAC-2000). Compiled BASIC programs and assembly language subroutines perform on-line or off-line the fast Fourier transform and spectral analysis of the EEG which is stored as soft as well as hard copy. Some results obtained from test application of the entire system in animal studies are presented.

  16. A comparative study of serial and parallel aeroelastic computations of wings

    NASA Technical Reports Server (NTRS)

    Byun, Chansup; Guruswamy, Guru P.

    1994-01-01

    A procedure for computing the aeroelasticity of wings on parallel multiple-instruction, multiple-data (MIMD) computers is presented. In this procedure, fluids are modeled using Euler equations, and structures are modeled using modal or finite element equations. The procedure is designed in such a way that each discipline can be developed and maintained independently by using a domain decomposition approach. In the present parallel procedure, each computational domain is scalable. A parallel integration scheme is used to compute aeroelastic responses by solving fluid and structural equations concurrently. The computational efficiency issues of parallel integration of both fluid and structural equations are investigated in detail. This approach, which reduces the total computational time by a factor of almost 2, is demonstrated for a typical aeroelastic wing by using various numbers of processors on the Intel iPSC/860.

  17. Instrumentation, performance visualization, and debugging tools for multiprocessors

    NASA Technical Reports Server (NTRS)

    Yan, Jerry C.; Fineman, Charles E.; Hontalas, Philip J.

    1991-01-01

    The need for computing power has forced a migration from serial computation on a single processor to parallel processing on multiprocessor architectures. However, without effective means to monitor (and visualize) program execution, debugging, and tuning parallel programs becomes intractably difficult as program complexity increases with the number of processors. Research on performance evaluation tools for multiprocessors is being carried out at ARC. Besides investigating new techniques for instrumenting, monitoring, and presenting the state of parallel program execution in a coherent and user-friendly manner, prototypes of software tools are being incorporated into the run-time environments of various hardware testbeds to evaluate their impact on user productivity. Our current tool set, the Ames Instrumentation Systems (AIMS), incorporates features from various software systems developed in academia and industry. The execution of FORTRAN programs on the Intel iPSC/860 can be automatically instrumented and monitored. Performance data collected in this manner can be displayed graphically on workstations supporting X-Windows. We have successfully compared various parallel algorithms for computational fluid dynamics (CFD) applications in collaboration with scientists from the Numerical Aerodynamic Simulation Systems Division. By performing these comparisons, we show that performance monitors and debuggers such as AIMS are practical and can illuminate the complex dynamics that occur within parallel programs.

  18. High Performance Distributed Computing in a Supercomputer Environment: Computational Services and Applications Issues

    NASA Technical Reports Server (NTRS)

    Kramer, Williams T. C.; Simon, Horst D.

    1994-01-01

    This tutorial proposes to be a practical guide for the uninitiated to the main topics and themes of high-performance computing (HPC), with particular emphasis to distributed computing. The intent is first to provide some guidance and directions in the rapidly increasing field of scientific computing using both massively parallel and traditional supercomputers. Because of their considerable potential computational power, loosely or tightly coupled clusters of workstations are increasingly considered as a third alternative to both the more conventional supercomputers based on a small number of powerful vector processors, as well as high massively parallel processors. Even though many research issues concerning the effective use of workstation clusters and their integration into a large scale production facility are still unresolved, such clusters are already used for production computing. In this tutorial we will utilize the unique experience made at the NAS facility at NASA Ames Research Center. Over the last five years at NAS massively parallel supercomputers such as the Connection Machines CM-2 and CM-5 from Thinking Machines Corporation and the iPSC/860 (Touchstone Gamma Machine) and Paragon Machines from Intel were used in a production supercomputer center alongside with traditional vector supercomputers such as the Cray Y-MP and C90.

  19. Efficient Approximation Algorithms for Weighted $b$-Matching

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Khan, Arif; Pothen, Alex; Mostofa Ali Patwary, Md.

    2016-01-01

    We describe a half-approximation algorithm, b-Suitor, for computing a b-Matching of maximum weight in a graph with weights on the edges. b-Matching is a generalization of the well-known Matching problem in graphs, where the objective is to choose a subset of M edges in the graph such that at most a specified number b(v) of edges in M are incident on each vertex v. Subject to this restriction we maximize the sum of the weights of the edges in M. We prove that the b-Suitor algorithm computes the same b-Matching as the one obtained by the greedy algorithm for themore » problem. We implement the algorithm on serial and shared-memory parallel processors, and compare its performance against a collection of approximation algorithms that have been proposed for the Matching problem. Our results show that the b-Suitor algorithm outperforms the Greedy and Locally Dominant edge algorithms by one to two orders of magnitude on a serial processor. The b-Suitor algorithm has a high degree of concurrency, and it scales well up to 240 threads on a shared memory multiprocessor. The b-Suitor algorithm outperforms the Locally Dominant edge algorithm by a factor of fourteen on 16 cores of an Intel Xeon multiprocessor.« less

  20. A customizable system for real-time image processing using the Blackfin DSProcessor and the MicroC/OS-II real-time kernel

    NASA Astrophysics Data System (ADS)

    Coffey, Stephen; Connell, Joseph

    2005-06-01

    This paper presents a development platform for real-time image processing based on the ADSP-BF533 Blackfin processor and the MicroC/OS-II real-time operating system (RTOS). MicroC/OS-II is a completely portable, ROMable, pre-emptive, real-time kernel. The Blackfin Digital Signal Processors (DSPs), incorporating the Analog Devices/Intel Micro Signal Architecture (MSA), are a broad family of 16-bit fixed-point products with a dual Multiply Accumulate (MAC) core. In addition, they have a rich instruction set with variable instruction length and both DSP and MCU functionality thus making them ideal for media based applications. Using the MicroC/OS-II for task scheduling and management, the proposed system can capture and process raw RGB data from any standard 8-bit greyscale image sensor in soft real-time and then display the processed result using a simple PC graphical user interface (GUI). Additionally, the GUI allows configuration of the image capture rate and the system and core DSP clock rates thereby allowing connectivity to a selection of image sensors and memory devices. The GUI also allows selection from a set of image processing algorithms based in the embedded operating system.

  1. A compositional reservoir simulator on distributed memory parallel computers

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Rame, M.; Delshad, M.

    1995-12-31

    This paper presents the application of distributed memory parallel computes to field scale reservoir simulations using a parallel version of UTCHEM, The University of Texas Chemical Flooding Simulator. The model is a general purpose highly vectorized chemical compositional simulator that can simulate a wide range of displacement processes at both field and laboratory scales. The original simulator was modified to run on both distributed memory parallel machines (Intel iPSC/960 and Delta, Connection Machine 5, Kendall Square 1 and 2, and CRAY T3D) and a cluster of workstations. A domain decomposition approach has been taken towards parallelization of the code. Amore » portion of the discrete reservoir model is assigned to each processor by a set-up routine that attempts a data layout as even as possible from the load-balance standpoint. Each of these subdomains is extended so that data can be shared between adjacent processors for stencil computation. The added routines that make parallel execution possible are written in a modular fashion that makes the porting to new parallel platforms straight forward. Results of the distributed memory computing performance of Parallel simulator are presented for field scale applications such as tracer flood and polymer flood. A comparison of the wall-clock times for same problems on a vector supercomputer is also presented.« less

  2. Tomographic image reconstruction using the cell broadband engine (CBE) general purpose hardware

    NASA Astrophysics Data System (ADS)

    Knaup, Michael; Steckmann, Sven; Bockenbach, Olivier; Kachelrieß, Marc

    2007-02-01

    Tomographic image reconstruction, such as the reconstruction of CT projection values, of tomosynthesis data, PET or SPECT events, is computational very demanding. In filtered backprojection as well as in iterative reconstruction schemes, the most time-consuming steps are forward- and backprojection which are often limited by the memory bandwidth. Recently, a novel general purpose architecture optimized for distributed computing became available: the Cell Broadband Engine (CBE). Its eight synergistic processing elements (SPEs) currently allow for a theoretical performance of 192 GFlops (3 GHz, 8 units, 4 floats per vector, 2 instructions, multiply and add, per clock). To maximize image reconstruction speed we modified our parallel-beam and perspective backprojection algorithms which are highly optimized for standard PCs, and optimized the code for the CBE processor. 1-3 In addition, we implemented an optimized perspective forwardprojection on the CBE which allows us to perform statistical image reconstructions like the ordered subset convex (OSC) algorithm. 4 Performance was measured using simulated data with 512 projections per rotation and 5122 detector elements. The data were backprojected into an image of 512 3 voxels using our PC-based approaches and the new CBE- based algorithms. Both the PC and the CBE timings were scaled to a 3 GHz clock frequency. On the CBE, we obtain total reconstruction times of 4.04 s for the parallel backprojection, 13.6 s for the perspective backprojection and 192 s for a complete OSC reconstruction, consisting of one initial Feldkamp reconstruction, followed by 4 OSC iterations.

  3. Scalability of Parallel Spatial Direct Numerical Simulations on Intel Hypercube and IBM SP1 and SP2

    NASA Technical Reports Server (NTRS)

    Joslin, Ronald D.; Hanebutte, Ulf R.; Zubair, Mohammad

    1995-01-01

    The implementation and performance of a parallel spatial direct numerical simulation (PSDNS) approach on the Intel iPSC/860 hypercube and IBM SP1 and SP2 parallel computers is documented. Spatially evolving disturbances associated with the laminar-to-turbulent transition in boundary-layer flows are computed with the PSDNS code. The feasibility of using the PSDNS to perform transition studies on these computers is examined. The results indicate that PSDNS approach can effectively be parallelized on a distributed-memory parallel machine by remapping the distributed data structure during the course of the calculation. Scalability information is provided to estimate computational costs to match the actual costs relative to changes in the number of grid points. By increasing the number of processors, slower than linear speedups are achieved with optimized (machine-dependent library) routines. This slower than linear speedup results because the computational cost is dominated by FFT routine, which yields less than ideal speedups. By using appropriate compile options and optimized library routines on the SP1, the serial code achieves 52-56 M ops on a single node of the SP1 (45 percent of theoretical peak performance). The actual performance of the PSDNS code on the SP1 is evaluated with a "real world" simulation that consists of 1.7 million grid points. One time step of this simulation is calculated on eight nodes of the SP1 in the same time as required by a Cray Y/MP supercomputer. For the same simulation, 32-nodes of the SP1 and SP2 are required to reach the performance of a Cray C-90. A 32 node SP1 (SP2) configuration is 2.9 (4.6) times faster than a Cray Y/MP for this simulation, while the hypercube is roughly 2 times slower than the Y/MP for this application. KEY WORDS: Spatial direct numerical simulations; incompressible viscous flows; spectral methods; finite differences; parallel computing.

  4. Kalman Filter Tracking on Parallel Architectures

    NASA Astrophysics Data System (ADS)

    Cerati, Giuseppe; Elmer, Peter; Lantz, Steven; McDermott, Kevin; Riley, Dan; Tadel, Matevž; Wittich, Peter; Würthwein, Frank; Yagil, Avi

    2015-12-01

    Power density constraints are limiting the performance improvements of modern CPUs. To address this we have seen the introduction of lower-power, multi-core processors, but the future will be even more exciting. In order to stay within the power density limits but still obtain Moore's Law performance/price gains, it will be necessary to parallelize algorithms to exploit larger numbers of lightweight cores and specialized functions like large vector units. Example technologies today include Intel's Xeon Phi and GPGPUs. Track finding and fitting is one of the most computationally challenging problems for event reconstruction in particle physics. At the High Luminosity LHC, for example, this will be by far the dominant problem. The need for greater parallelism has driven investigations of very different track finding techniques including Cellular Automata or returning to Hough Transform. The most common track finding techniques in use today are however those based on the Kalman Filter [2]. Significant experience has been accumulated with these techniques on real tracking detector systems, both in the trigger and offline. They are known to provide high physics performance, are robust and are exactly those being used today for the design of the tracking system for HL-LHC. Our previous investigations showed that, using optimized data structures, track fitting with Kalman Filter can achieve large speedup both with Intel Xeon and Xeon Phi. We report here our further progress towards an end-to-end track reconstruction algorithm fully exploiting vectorization and parallelization techniques in a realistic simulation setup.

  5. Computing Maximum Cardinality Matchings in Parallel on Bipartite Graphs via Tree-Grafting

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Azad, Ariful; Buluc, Aydn; Pothen, Alex

    It is difficult to obtain high performance when computing matchings on parallel processors because matching algorithms explicitly or implicitly search for paths in the graph, and when these paths become long, there is little concurrency. In spite of this limitation, we present a new algorithm and its shared-memory parallelization that achieves good performance and scalability in computing maximum cardinality matchings in bipartite graphs. This algorithm searches for augmenting paths via specialized breadth-first searches (BFS) from multiple source vertices, hence creating more parallelism than single source algorithms. Algorithms that employ multiple-source searches cannot discard a search tree once no augmenting pathmore » is discovered from the tree, unlike algorithms that rely on single-source searches. We describe a novel tree-grafting method that eliminates most of the redundant edge traversals resulting from this property of multiple-source searches. We also employ the recent direction-optimizing BFS algorithm as a subroutine to discover augmenting paths faster. Our algorithm compares favorably with the current best algorithms in terms of the number of edges traversed, the average augmenting path length, and the number of iterations. Here, we provide a proof of correctness for our algorithm. Our NUMA-aware implementation is scalable to 80 threads of an Intel multiprocessor and to 240 threads on an Intel Knights Corner coprocessor. On average, our parallel algorithm runs an order of magnitude faster than the fastest algorithms available. The performance improvement is more significant on graphs with small matching number.« less

  6. Computing Maximum Cardinality Matchings in Parallel on Bipartite Graphs via Tree-Grafting

    DOE PAGES

    Azad, Ariful; Buluc, Aydn; Pothen, Alex

    2016-03-24

    It is difficult to obtain high performance when computing matchings on parallel processors because matching algorithms explicitly or implicitly search for paths in the graph, and when these paths become long, there is little concurrency. In spite of this limitation, we present a new algorithm and its shared-memory parallelization that achieves good performance and scalability in computing maximum cardinality matchings in bipartite graphs. This algorithm searches for augmenting paths via specialized breadth-first searches (BFS) from multiple source vertices, hence creating more parallelism than single source algorithms. Algorithms that employ multiple-source searches cannot discard a search tree once no augmenting pathmore » is discovered from the tree, unlike algorithms that rely on single-source searches. We describe a novel tree-grafting method that eliminates most of the redundant edge traversals resulting from this property of multiple-source searches. We also employ the recent direction-optimizing BFS algorithm as a subroutine to discover augmenting paths faster. Our algorithm compares favorably with the current best algorithms in terms of the number of edges traversed, the average augmenting path length, and the number of iterations. Here, we provide a proof of correctness for our algorithm. Our NUMA-aware implementation is scalable to 80 threads of an Intel multiprocessor and to 240 threads on an Intel Knights Corner coprocessor. On average, our parallel algorithm runs an order of magnitude faster than the fastest algorithms available. The performance improvement is more significant on graphs with small matching number.« less

  7. Semiconductor Ion Implanters

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    MacKinnon, Barry A.; Ruffell, John P.

    In 1953 the Raytheon CK722 transistor was priced at $7.60. Based upon this, an Intel Xeon Quad Core processor containing 820,000,000 transistors should list at $6.2 billion. Particle accelerator technology plays an important part in the remarkable story of why that Intel product can be purchased today for a few hundred dollars. Most people of the mid twentieth century would be astonished at the ubiquity of semiconductors in the products we now buy and use every day. Though relatively expensive in the nineteen fifties they now exist in a wide range of items from high-end multicore microprocessors like the Intelmore » product to disposable items containing 'only' hundreds or thousands like RFID chips and talking greeting cards. This historical development has been fueled by continuous advancement of the several individual technologies involved in the production of semiconductor devices including Ion Implantation and the charged particle beamlines at the heart of implant machines. In the course of its 40 year development, the worldwide implanter industry has reached annual sales levels around $2B, installed thousands of dedicated machines and directly employs thousands of workers. It represents in all these measures, as much and possibly more than any other industrial application of particle accelerator technology. This presentation discusses the history of implanter development. It touches on some of the people involved and on some of the developmental changes and challenges imposed as the requirements of the semiconductor industry evolved.« less

  8. Monolithic integration of an InP-based 4 × 25 GHz photodiode array to an O-band arrayed waveguide grating demultiplexer

    NASA Astrophysics Data System (ADS)

    Ye, Han; Han, Qin; Lv, Qianqian; Pan, Pan; An, Junming; Yang, Xiaohong

    2017-12-01

    We demonstrate the monolithic integration of a uni-traveling carrier photodiode array with a 4 channel, O-band arrayed waveguide grating demultiplexer on the InP platform by the selective area growth technique. An extended coupling layer at the butt-joint is adopted to ensure both good fabrication compatibility and high photodiode quantum efficiency of 77%. The fabricated integrated chip exhibits a uniform bandwidth over 25 GHz for each channel and a crosstalk below -22 dB.

  9. Parasail: SIMD C library for global, semi-global, and local pairwise sequence alignments.

    PubMed

    Daily, Jeff

    2016-02-10

    Sequence alignment algorithms are a key component of many bioinformatics applications. Though various fast Smith-Waterman local sequence alignment implementations have been developed for x86 CPUs, most are embedded into larger database search tools. In addition, fast implementations of Needleman-Wunsch global sequence alignment and its semi-global variants are not as widespread. This article presents the first software library for local, global, and semi-global pairwise intra-sequence alignments and improves the performance of previous intra-sequence implementations. A faster intra-sequence local pairwise alignment implementation is described and benchmarked, including new global and semi-global variants. Using a 375 residue query sequence a speed of 136 billion cell updates per second (GCUPS) was achieved on a dual Intel Xeon E5-2670 24-core processor system, the highest reported for an implementation based on Farrar's 'striped' approach. Rognes's SWIPE optimal database search application is still generally the fastest available at 1.2 to at best 2.4 times faster than Parasail for sequences shorter than 500 amino acids. However, Parasail was faster for longer sequences. For global alignments, Parasail's prefix scan implementation is generally the fastest, faster even than Farrar's 'striped' approach, however the opal library is faster for single-threaded applications. The software library is designed for 64 bit Linux, OS X, or Windows on processors with SSE2, SSE41, or AVX2. Source code is available from https://github.com/jeffdaily/parasail under the Battelle BSD-style license. Applications that require optimal alignment scores could benefit from the improved performance. For the first time, SIMD global, semi-global, and local alignments are available in a stand-alone C library.

  10. Evolution of tripartite entangled states in a decohering environment and their experimental protection using dynamical decoupling

    NASA Astrophysics Data System (ADS)

    Singh, Harpreet; Arvind, Dorai, Kavita

    2018-02-01

    We embarked upon the task of experimental protection of different classes of tripartite entangled states, namely, the maximally entangled Greenberger-Horne-Zeilinger (GHZ) and W states and the tripartite entangled state called the W W ¯ state, using dynamical decoupling. The states were created on a three-qubit NMR quantum information processor and allowed to evolve in the naturally noisy NMR environment. Tripartite entanglement was monitored at each time instant during state evolution, using negativity as an entanglement measure. It was found that the W state is most robust while the GHZ-type states are most fragile against the natural decoherence present in the NMR system. The W W ¯ state, which is in the GHZ class yet stores entanglement in a manner akin to the W state, surprisingly turned out to be more robust than the GHZ state. The experimental data were best modeled by considering the main noise channel to be an uncorrelated phase damping channel acting independently on each qubit, along with a generalized amplitude damping channel. Using dynamical decoupling, we were able to achieve a significant protection of entanglement for GHZ states. There was a marginal improvement in the state fidelity for the W state (which is already robust against natural system decoherence), while the W W ¯ state showed a significant improvement in fidelity and protection against decoherence.

  11. High-speed assembly language (80386/80387) programming for laser spectra scan control and data acquisition providing improved resolution water vapor spectroscopy

    NASA Technical Reports Server (NTRS)

    Allen, Robert J.

    1988-01-01

    An assembly language program using the Intel 80386 CPU and 80387 math co-processor chips was written to increase the speed of data gathering and processing, and provide control of a scanning CW ring dye laser system. This laser system is used in high resolution (better than 0.001 cm-1) water vapor spectroscopy experiments. Laser beam power is sensed at the input and output of white cells and the output of a Fabry-Perot. The assembly language subroutine is called from Basic, acquires the data and performs various calculations at rates greater than 150 faster than could be performed by the higher level language. The width of output control pulses generated in assembly language are 3 to 4 microsecs as compared to 2 to 3.7 millisecs for those generated in Basic (about 500 to 1000 times faster). Included are a block diagram and brief description of the spectroscopy experiment, a flow diagram of the Basic and assembly language programs, listing of the programs, scope photographs of the computer generated 5-volt pulses used for control and timing analysis, and representative water spectrum curves obtained using these programs.

  12. A rapid calculation system for tsunami propagation in Japan by using the AQUA-MT/CMT solutions

    NASA Astrophysics Data System (ADS)

    Nakamura, T.; Suzuki, W.; Yamamoto, N.; Kimura, H.; Takahashi, N.

    2017-12-01

    We developed a rapid calculation system of geodetic deformations and tsunami propagation in and around Japan. The system automatically conducts their forward calculations by using point source parameters estimated by the AQUA system (Matsumura et al., 2006), which analyze magnitude, hypocenter, and moment tensors for an event occurring in Japan in 3 minutes of the origin time at the earliest. An optimized calculation code developed by Nakamura and Baba (2016) is employed for the calculations on our computer server with 12 core processors of Intel Xeon 2.60 GHz. Assuming a homogeneous fault slip in the single fault plane as the source fault, the developed system calculates each geodetic deformation and tsunami propagation by numerically solving the 2D linear long-wave equations for the grid interval of 1 arc-min from two fault orientations simultaneously; i.e., one fault and its conjugate fault plane. Because fault models based on moment tensor analyses of event data are used, the system appropriately evaluate tsunami propagation even for unexpected events such as normal faulting in the subduction zone, which differs with the evaluation of tsunami arrivals and heights from a pre-calculated database by using fault models assuming typical types of faulting in anticipated source areas (e.g., Tatehata, 1998; Titov et al., 2005; Yamamoto et al., 2016). By the complete automation from event detection to output graphical figures, the calculation results can be available via e-mail and web site in 4 minutes of the origin time at the earliest. For moderate-sized events such as M5 to 6 events, the system helps us to rapidly investigate whether amplitudes of tsunamis at nearshore and offshore stations exceed a noise level or not, and easily identify actual tsunamis at the stations by comparing with obtained synthetic waveforms. In the case of using source models investigated from GNSS data, such evaluations may be difficult because of the low resolution of sources due to a low signal to noise ratio at land stations. For large to huge events in offshore areas, the developed system may be useful to decide to starting or stopping preparations and precautions against tsunami arrivals, because calculation results including arrival times and heights of initial and maximum waves can be rapidly available before their arrivals at coastal areas.

  13. Simulations of dusty plasmas using a special-purpose computer system designed for gravitational N-body problems

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Yamamoto, K.; Mizuno, Y.; Hibino, S.

    2006-01-15

    Simulations of dusty plasmas were performed using GRAPE-6, a special-purpose computer designed for gravitational N-body problems. The collective behavior of dust particles, which are injected into the plasma, was studied by means of three-dimensional computer simulations. As an example of a dusty plasma simulation, experiments on Coulomb crystals in plasmas are simulated. Formation of a quasi-two-dimensional Coulomb crystal has been observed under typical laboratory conditions. Another example was to simulate movement of dust particles in plasmas under microgravity conditions. Fully three-dimensional spherical structures of dust clouds have been observed. For the simulation of a dusty plasma in microgravity with 3x10{supmore » 4} particles, GRAPE-6 can perform the whole operation 1000 times faster than by using a Pentium 4 1.6 GHz processor.« less

  14. Optimizing the inner loop of the gravitational force interaction on modern processors

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Warren, Michael S

    2010-12-08

    We have achieved superior performance on multiple generations of the fastest supercomputers in the world with our hashed oct-tree N-body code (HOT), spanning almost two decades and garnering multiple Gordon Bell Prizes for significant achievement in parallel processing. Execution time for our N-body code is largely influenced by the force calculation in the inner loop. Improvements to the inner loop using SSE3 instructions has enabled the calculation of over 200 million gravitational interactions per second per processor on a 2.6 GHz Opteron, for a computational rate of over 7 Gflops in single precision (700/0 of peak). We obtain optimal performancemore » some processors (including the Cell) by decomposing the reciprocal square root function required for a gravitational interaction into a table lookup, Chebychev polynomial interpolation, and Newton-Raphson iteration, using the algorithm of Karp. By unrolling the loop by a factor of six, and using SPU intrinsics to compute on vectors, we obtain performance of over 16 Gflops on a single Cell SPE. Aggregated over the 8 SPEs on a Cell processor, the overall performance is roughly 130 Gflops. In comparison, the ordinary C version of our inner loop only obtains 1.6 Gflops per SPE with the spuxlc compiler.« less

  15. Interactive collision detection for deformable models using streaming AABBs.

    PubMed

    Zhang, Xinyu; Kim, Young J

    2007-01-01

    We present an interactive and accurate collision detection algorithm for deformable, polygonal objects based on the streaming computational model. Our algorithm can detect all possible pairwise primitive-level intersections between two severely deforming models at highly interactive rates. In our streaming computational model, we consider a set of axis aligned bounding boxes (AABBs) that bound each of the given deformable objects as an input stream and perform massively-parallel pairwise, overlapping tests onto the incoming streams. As a result, we are able to prevent performance stalls in the streaming pipeline that can be caused by expensive indexing mechanism required by bounding volume hierarchy-based streaming algorithms. At runtime, as the underlying models deform over time, we employ a novel, streaming algorithm to update the geometric changes in the AABB streams. Moreover, in order to get only the computed result (i.e., collision results between AABBs) without reading back the entire output streams, we propose a streaming en/decoding strategy that can be performed in a hierarchical fashion. After determining overlapped AABBs, we perform a primitive-level (e.g., triangle) intersection checking on a serial computational model such as CPUs. We implemented the entire pipeline of our algorithm using off-the-shelf graphics processors (GPUs), such as nVIDIA GeForce 7800 GTX, for streaming computations, and Intel Dual Core 3.4G processors for serial computations. We benchmarked our algorithm with different models of varying complexities, ranging from 15K up to 50K triangles, under various deformation motions, and the timings were obtained as 30 approximately 100 FPS depending on the complexity of models and their relative configurations. Finally, we made comparisons with a well-known GPU-based collision detection algorithm, CULLIDE [4] and observed about three times performance improvement over the earlier approach. We also made comparisons with a SW-based AABB culling algorithm [2] and observed about two times improvement.

  16. 75 FR 21353 - Intel Corporation, Fab 20 Division, Including On-Site Leased Workers From Volt Technical...

    Federal Register 2010, 2011, 2012, 2013, 2014

    2010-04-23

    ... DEPARTMENT OF LABOR Employment and Training Administration [TA-W-73,642] Intel Corporation, Fab 20... of Intel Corporation, Fab 20 Division, including on-site leased workers of Volt Technical Resources... Precision, Inc. were employed on-site at the Hillsboro, Oregon location of Intel Corporation, Fab 20...

  17. A Simple GPU-Accelerated Two-Dimensional MUSCL-Hancock Solver for Ideal Magnetohydrodynamics

    NASA Technical Reports Server (NTRS)

    Bard, Christopher; Dorelli, John C.

    2013-01-01

    We describe our experience using NVIDIA's CUDA (Compute Unified Device Architecture) C programming environment to implement a two-dimensional second-order MUSCL-Hancock ideal magnetohydrodynamics (MHD) solver on a GTX 480 Graphics Processing Unit (GPU). Taking a simple approach in which the MHD variables are stored exclusively in the global memory of the GTX 480 and accessed in a cache-friendly manner (without further optimizing memory access by, for example, staging data in the GPU's faster shared memory), we achieved a maximum speed-up of approx. = 126 for a sq 1024 grid relative to the sequential C code running on a single Intel Nehalem (2.8 GHz) core. This speedup is consistent with simple estimates based on the known floating point performance, memory throughput and parallel processing capacity of the GTX 480.

  18. a Linux PC Cluster for Lattice QCD with Exact Chiral Symmetry

    NASA Astrophysics Data System (ADS)

    Chiu, Ting-Wai; Hsieh, Tung-Han; Huang, Chao-Hsi; Huang, Tsung-Ren

    A computational system for lattice QCD with overlap Dirac quarks is described. The platform is a home-made Linux PC cluster, built with off-the-shelf components. At present the system constitutes of 64 nodes, with each node consisting of one Pentium 4 processor (1.6/2.0/2.5 GHz), one Gbyte of PC800/1066 RDRAM, one 40/80/120 Gbyte hard disk, and a network card. The computationally intensive parts of our program are written in SSE2 codes. The speed of our system is estimated to be 70 Gflops, and its price/performance ratio is better than $1.0/Mflops for 64-bit (double precision) computations in quenched QCD. We discuss how to optimize its hardware and software for computing propagators of overlap Dirac quarks.

  19. How Managers' everyday decisions create or destroy your company's strategy.

    PubMed

    Bower, Joseph L; Gilbert, Clark G

    2007-02-01

    Senior executives have long been frustrated by the disconnection between the plans and strategies they devise and the actual behavior of the managers throughout the company. This article approaches the problem from the ground up, recognizing that every time a manager allocates resources, that decision moves the company either into or out of alignment with its announced strategy. A well-known story--Intel's exit from the memory business--illustrates this point. When discussing what businesses Intel should be in, Andy Grove asked Gordon Moore what they would do if Intel were a company that they had just acquired. When Moore answered, "Get out of memory," they decided to do just that. It turned out, though, that Intel's revenues from memory were by this time only 4% of total sales. Intel's lower-level managers had already exited the business. What Intel hadn't done was to shut down the flow of research funding into memory (which was still eating up one-third of all research expenditures); nor had the company announced its exit to the outside world. Because divisional and operating managers-as well as customers and capital markets-have such a powerful impact on the realized strategy of the firm, senior management might consider focusing less on the company's formal strategy and more on the processes by which the company allocates resources. Top managers must know the track record of the people who are making resource allocation proposals; recognize the strategic issues at stake; reach down to operational managers to work across division lines; frame resource questions to reflect the corporate perspective, especially when large sums of money are involved and conditions are highly uncertain; and create a new context that allows top executives to circumvent the regular resource allocation process when necessary.

  20. Using the Intel Math Kernel Library on Peregrine | High-Performance

    Science.gov Websites

    Computing | NREL the Intel Math Kernel Library on Peregrine Using the Intel Math Kernel Library on Peregrine Learn how to use the Intel Math Kernel Library (MKL) with Peregrine system software. MKL architectures. Core math functions in MKL include BLAS, LAPACK, ScaLAPACK, sparse solvers, fast Fourier

  1. Enhanced tactical radar correlator (ETRAC): true interoperability of the 1990s

    NASA Astrophysics Data System (ADS)

    Guillen, Frank J.

    1994-10-01

    The enhanced tactical radar correlator (ETRAC) system is under development at Westinghouse Electric Corporation for the Army Space Program Office (ASPO). ETRAC is a real-time synthetic aperture radar (SAR) processing system that provides tactical IMINT to the corps commander. It features an open architecture comprised of ruggedized commercial-off-the-shelf (COTS), UNIX based workstations and processors. The architecture features the DoD common SAR processor (CSP), a multisensor computing platform to accommodate a variety of current and future imaging needs. ETRAC's principal functions include: (1) Mission planning and control -- ETRAC provides mission planning and control for the U-2R and ASARS-2 sensor, including capability for auto replanning, retasking, and immediate spot. (2) Image formation -- the image formation processor (IFP) provides the CPU intensive processing capability to produce real-time imagery for all ASARS imaging modes of operation. (3) Image exploitation -- two exploitation workstations are provided for first-phase image exploitation, manipulation, and annotation. Products include INTEL reports, annotated NITF SID imagery, high resolution hard copy prints and targeting data. ETRAC is transportable via two C-130 aircraft, with autonomous drive on/off capability for high mobility. Other autonomous capabilities include rapid setup/tear down, extended stand-alone support, internal environmental control units (ECUs) and power generation. ETRAC's mission is to provide the Army field commander with accurate, reliable, and timely imagery intelligence derived from collections made by the ASARS-2 sensor, located on-board the U-2R aircraft. To accomplish this mission, ETRAC receives video phase history (VPH) directly from the U-2R aircraft and converts it in real time into soft copy imagery for immediate exploitation and dissemination to the tactical users.

  2. Coding for parallel execution of hardware-in-the-loop millimeter-wave scene generation models on multicore SIMD processor architectures

    NASA Astrophysics Data System (ADS)

    Olson, Richard F.

    2013-05-01

    Rendering of point scatterer based radar scenes for millimeter wave (mmW) seeker tests in real-time hardware-in-the-loop (HWIL) scene generation requires efficient algorithms and vector-friendly computer architectures for complex signal synthesis. New processor technology from Intel implements an extended 256-bit vector SIMD instruction set (AVX, AVX2) in a multi-core CPU design providing peak execution rates of hundreds of GigaFLOPS (GFLOPS) on one chip. Real world mmW scene generation code can approach peak SIMD execution rates only after careful algorithm and source code design. An effective software design will maintain high computing intensity emphasizing register-to-register SIMD arithmetic operations over data movement between CPU caches or off-chip memories. Engineers at the U.S. Army Aviation and Missile Research, Development and Engineering Center (AMRDEC) applied two basic parallel coding methods to assess new 256-bit SIMD multi-core architectures for mmW scene generation in HWIL. These include use of POSIX threads built on vector library functions and more portable, highlevel parallel code based on compiler technology (e.g. OpenMP pragmas and SIMD autovectorization). Since CPU technology is rapidly advancing toward high processor core counts and TeraFLOPS peak SIMD execution rates, it is imperative that coding methods be identified which produce efficient and maintainable parallel code. This paper describes the algorithms used in point scatterer target model rendering, the parallelization of those algorithms, and the execution performance achieved on an AVX multi-core machine using the two basic parallel coding methods. The paper concludes with estimates for scale-up performance on upcoming multi-core technology.

  3. Scaling deep learning workloads: NVIDIA DGX-1/Pascal and Intel Knights Landing

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Gawande, Nitin A.; Landwehr, Joshua B.; Daily, Jeffrey A.

    Deep Learning (DL) algorithms have become ubiquitous in data analytics. As a result, major computing vendors --- including NVIDIA, Intel, AMD, and IBM --- have architectural road-maps influenced by DL workloads. Furthermore, several vendors have recently advertised new computing products as accelerating large DL workloads. Unfortunately, it is difficult for data scientists to quantify the potential of these different products. This paper provides a performance and power analysis of important DL workloads on two major parallel architectures: NVIDIA DGX-1 (eight Pascal P100 GPUs interconnected with NVLink) and Intel Knights Landing (KNL) CPUs interconnected with Intel Omni-Path or Cray Aries. Ourmore » evaluation consists of a cross section of convolutional neural net workloads: CifarNet, AlexNet, GoogLeNet, and ResNet50 topologies using the Cifar10 and ImageNet datasets. The workloads are vendor-optimized for each architecture. Our analysis indicates that although GPUs provide the highest overall performance, the gap can close for some convolutional networks; and the KNL can be competitive in performance/watt. We find that NVLink facilitates scaling efficiency on GPUs. However, its importance is heavily dependent on neural network architecture. Furthermore, for weak-scaling --- sometimes encouraged by restricted GPU memory --- NVLink is less important.« less

  4. Design and fabrication of pHEMT MMIC switches for IEEE 802.11.a/b/g WLAN applications

    NASA Astrophysics Data System (ADS)

    Mun, Jae Kyoung; Ji, Hong Gu; Ahn, Hyokyun; Kim, Haecheon; Park, Chong-Ook

    2005-08-01

    In this paper, we propose a channel structure for a promising switch pHEMT with excellent isolation characteristics based on the distribution of electric field intensity beneath the Schottky contact in the transistor. Using the proposed device channel structure, SPST and SPDT switches were designed and fabricated, applicable to 2.4 GHz and 5.8 GHz WLAN systems. We discuss the relationship between dc characteristics and switch parameters in this paper in detail. The developed SPST switch exhibits a low insertion loss of 0.26 dB and a high isolation of 34.3 dB with a control voltage of 0 V/-3 V at 5.8 GHz. The SPDT also shows a good performance of 0.85 dB insertion loss and 31.5 dB isolation under the same conditions. The measured power-handling capability at 2.4 GHz reveals that the SPDT has an output power of 27 dBm at the 1 dB compression point and a third-order intercept point of more than 46 dBm.

  5. Eliminating livelock by assigning the same priority state to each message that is input into a flushable routing system during N time intervals

    DOEpatents

    Faber, V.

    1994-11-29

    Livelock-free message routing is provided in a network of interconnected nodes that is flushable in time T. An input message processor generates sequences of at least N time intervals, each of duration T. An input register provides for receiving and holding each input message, where the message is assigned a priority state p during an nth one of the N time intervals. At each of the network nodes a message processor reads the assigned priority state and awards priority to messages with priority state (p-1) during an nth time interval and to messages with priority state p during an (n+1) th time interval. The messages that are awarded priority are output on an output path toward the addressed output message processor. Thus, no message remains in the network for a time longer than T. 4 figures.

  6. A real-time robot arm collision avoidance system

    NASA Technical Reports Server (NTRS)

    Shaffer, Clifford A.; Herb, Gregory M.

    1992-01-01

    A data structure and update algorithm are presented for a prototype real-time collision avoidance safety system simulating a multirobot workspace. The data structure is a variant of the octree, which serves as a spatial index. An octree recursively decomposes 3D space into eight equal cubic octants until each octant meets some decomposition criteria. The N-objects octree, which indexes a collection of 3D primitive solids is used. These primitives make up the two (seven-degrees-of-freedom) robot arms and workspace modeled by the system. As robot arms move, the octree is updated to reflect their changed positions. During most update cycles, any given primitive does not change which octree nodes it is in. Thus, modification to the octree is rarely required. Cycle time for interpreting current arm joint angles, updating the octree to reflect new positions, and detecting/reporting imminent collisions averages 30 ms on an Intel 80386 processor running at 20 MHz.

  7. (U) Status of Trinity and Crossroads Systems

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Archer, Billy Joe; Lujan, James Westley; Hemmert, K. S.

    2017-01-10

    (U) This paper provides a general overview of current and future plans for the Advanced Simulation and Computing (ASC) Advanced Technology (AT) systems fielded by the New Mexico Alliance for Computing at Extreme Scale (ACES), a collaboration between Los Alamos Laboratory and Sandia National Laboratories. Additionally, this paper touches on research of technology beyond traditional CMOS. The status of Trinity, ASCs first AT system, and Crossroads, anticipated to succeed Trinity as the third AT system in 2020 will be presented, along with initial performance studies of the Intel Knights Landing Xeon Phi processors, introduced on Trinity. The challenges and opportunitiesmore » for our production simulation codes on AT systems will also be discussed. Trinity and Crossroads are a joint procurement by ACES and Lawrence Berkeley Laboratory as part of the Alliance for application Performance at EXtreme scale (APEX) http://apex.lanl.gov.« less

  8. Microcomputer control soft tube measuring-testing instrument

    NASA Astrophysics Data System (ADS)

    Zhou, Yanzhou; Jiang, Xiu-Zhen; Wang, Wen-Yi

    1993-09-01

    Soft tube are key and easily spoiled parts used by the vehicles in the transportation with large numbers. Measuring and testing of the tubes were made by hands for a long time. Cooperating with Harbin Railway Bureau recently we have developed a new kind of automatical measuring and testing instrument In the paper the instrument structure property and measuring principle are presented in details. Centre of the system is a singlechip processor INTEL 80C31 . It can collect deal with data and display the results on LED. Furthermore it brings electromagnetic valves and motors under control. Five soft tubes are measured and tested in the same time all the process is finished automatically. On the hardware and software counter-electromagnetic disturbance methods is adopted efficiently so the performance of the instrument is improved significantly. In the long run the instrument is reliable and practical It solves a quite difficult problem in the railway transportation.

  9. Fabrication of Circuit QED Quantum Processors, Part 2: Advanced Semiconductor Manufacturing Perspectives

    NASA Astrophysics Data System (ADS)

    Michalak, D. J.; Bruno, A.; Caudillo, R.; Elsherbini, A. A.; Falcon, J. A.; Nam, Y. S.; Poletto, S.; Roberts, J.; Thomas, N. K.; Yoscovits, Z. R.; Dicarlo, L.; Clarke, J. S.

    Experimental quantum computing is rapidly approaching the integration of sufficient numbers of quantum bits for interesting applications, but many challenges still remain. These challenges include: realization of an extensible design for large array scale up, sufficient material process control, and discovery of integration schemes compatible with industrial 300 mm fabrication. We present recent developments in extensible circuits with vertical delivery. Toward the goal of developing a high-volume manufacturing process, we will present recent results on a new Josephson junction process that is compatible with current tooling. We will then present the improvements in NbTiN material uniformity that typical 300 mm fabrication tooling can provide. While initial results on few-qubit systems are encouraging, advanced processing control is expected to deliver the improvements in qubit uniformity, coherence time, and control required for larger systems. Research funded by Intel Corporation.

  10. Automatic generation of efficient array redistribution routines for distributed memory multicomputers

    NASA Technical Reports Server (NTRS)

    Ramaswamy, Shankar; Banerjee, Prithviraj

    1994-01-01

    Appropriate data distribution has been found to be critical for obtaining good performance on Distributed Memory Multicomputers like the CM-5, Intel Paragon and IBM SP-1. It has also been found that some programs need to change their distributions during execution for better performance (redistribution). This work focuses on automatically generating efficient routines for redistribution. We present a new mathematical representation for regular distributions called PITFALLS and then discuss algorithms for redistribution based on this representation. One of the significant contributions of this work is being able to handle arbitrary source and target processor sets while performing redistribution. Another important contribution is the ability to handle an arbitrary number of dimensions for the array involved in the redistribution in a scalable manner. Our implementation of these techniques is based on an MPI-like communication library. The results presented show the low overheads for our redistribution algorithm as compared to naive runtime methods.

  11. A Parallel Multigrid Solver for Viscous Flows on Anisotropic Structured Grids

    NASA Technical Reports Server (NTRS)

    Prieto, Manuel; Montero, Ruben S.; Llorente, Ignacio M.; Bushnell, Dennis M. (Technical Monitor)

    2001-01-01

    This paper presents an efficient parallel multigrid solver for speeding up the computation of a 3-D model that treats the flow of a viscous fluid over a flat plate. The main interest of this simulation lies in exhibiting some basic difficulties that prevent optimal multigrid efficiencies from being achieved. As the computing platform, we have used Coral, a Beowulf-class system based on Intel Pentium processors and equipped with GigaNet cLAN and switched Fast Ethernet networks. Our study not only examines the scalability of the solver but also includes a performance evaluation of Coral where the investigated solver has been used to compare several of its design choices, namely, the interconnection network (GigaNet versus switched Fast-Ethernet) and the node configuration (dual nodes versus single nodes). As a reference, the performance results have been compared with those obtained with the NAS-MG benchmark.

  12. RFI in the 0.5 to 10.8 GHz Band at the Allen Telescope Array

    NASA Astrophysics Data System (ADS)

    Backus, Peter R.; Kilsdonk, T. N.; Allen Telescope Array Team

    2007-05-01

    Thanks to funding from the Paul G. Allen Foundation (and other philanthropic supporters) for the technology development and first phase of construction, the first 42 elements of the Allen Telescope Array (ATA-42) are being commissioned for rapid surveys of the astrophysical and technological sky. Because of the innovative design of this array that will eventually include 350 elements, traditional radio astronomy and SETI are enabled simultaneously 24x7. The array has been designed to provide an optimal snapshot image of a very large field of view and simultaneously, 16 (dual polarization) phased beams within the field of view to be analyzed by a suite of backend processors. Four independent 100 MHz bands may be tuned anywhere within the instantaneous receiver bandwidth from 0.5 to 11.2 GHz. One key to the success of rapid surveys for astrophysical or technological signals is a quiet background. This poster presents the results of initial surveys with 6.1 meter dishes at high-spectral-resolution of the background spectrum from 0.5 to 10.8 GHz at the Hat Creek Radio Observatory, where the ATA is being constructed, and compares it with the background spectrum from 1.2-3 GHz at other observatories where SETI observations have been conducted within the past 11 years.

  13. InP Devices For Millimeter-Wave Monolithic Circuits

    NASA Astrophysics Data System (ADS)

    Binari, S. C.; Neidert, R. E.; Dietrich, H. B.

    1989-11-01

    High efficiency, mm-wave operation has been obtained from lateral transferred-electron devices (TEDs) designed with a high resistivity region located near the cathode contact. At 29.9 GHz, a CW power output of 29.1 mW with a conversion efficiency of 6.7% has been achieved with cavity-tuned discrete devices. This result represents the highest power output and efficiency of a lateral TED in this frequency range. The lateral devices also had a CW power output of 0.4 mW at 98.5 GHz and 0.9 mW at 75.2 GHz. In addition, a monolithic oscillator incorporating the lateral TED has been demonstrated at 79.9 GHz. InP Schottky-barrier diodes have been fabricated using selective MeV ion implantation into semi-insulating InP substrates. Using Si implantation with energies of up to 6.0 MeV, n+ layers as deep as 3 μm with peak carrier concentrations of 2 x 1018 cm-3 have been obtained. These devices have been evaluated as mixers and detectors at 94 GHz and have demonstrated a conversion loss of 7.6 dB and a zero-bias detector sensitivity as high as 400 mV/mW.

  14. Parallel hyperspectral image reconstruction using random projections

    NASA Astrophysics Data System (ADS)

    Sevilla, Jorge; Martín, Gabriel; Nascimento, José M. P.

    2016-10-01

    Spaceborne sensors systems are characterized by scarce onboard computing and storage resources and by communication links with reduced bandwidth. Random projections techniques have been demonstrated as an effective and very light way to reduce the number of measurements in hyperspectral data, thus, the data to be transmitted to the Earth station is reduced. However, the reconstruction of the original data from the random projections may be computationally expensive. SpeCA is a blind hyperspectral reconstruction technique that exploits the fact that hyperspectral vectors often belong to a low dimensional subspace. SpeCA has shown promising results in the task of recovering hyperspectral data from a reduced number of random measurements. In this manuscript we focus on the implementation of the SpeCA algorithm for graphics processing units (GPU) using the compute unified device architecture (CUDA). Experimental results conducted using synthetic and real hyperspectral datasets on the GPU architecture by NVIDIA: GeForce GTX 980, reveal that the use of GPUs can provide real-time reconstruction. The achieved speedup is up to 22 times when compared with the processing time of SpeCA running on one core of the Intel i7-4790K CPU (3.4GHz), with 32 Gbyte memory.

  15. Recent developments using TowerJazz SiGe BiCMOS platform for mmWave and THz applications

    NASA Astrophysics Data System (ADS)

    Kar-Roy, Arjun; Howard, David; Preisler, Edward J.; Racanelli, Marco

    2013-05-01

    In this paper, we report on the highest speed 240GHz/340GHz FT/FMAX NPN which is now available for product designs in the SBC18H4 process variant of TowerJazz's mature 0.18μm SBC18 silicon germanium (SiGe) BiCMOS technology platform. NFMIN of ~2dB at 50GHz has been obtained with these NPNs. We also describe the integration of earlier generation NPNs with FT/FMAX of 240GHz/280GHz into SBC13H3, a 0.13μm SiGe BiCMOS technology platform. Next, we detail the integration of the deep silicon via (DSV), through silicon via (TSV), high-resistivity substrate, sub-field stitching and hybrid-stitching capability into the 0.18μm SBC18 technology platform to enable higher performance and highly integrated product designs. The integration of SBC18H3 into a thick-film SOI substrate, with essentially unchanged FT and FMAX, is also described. We also report on recent circuit demonstrations using the SBC18H3 platform: (1) a 4-element phased-array 70-100GHz broadband transmit and receive chip with flat saturated power greater than 5dBm and conversion gain of 33dB; (2) a fully integrated W-band 9-element phase-controllable array with responsivity of 800MV/W and receiver NETD is 0.45K with 20ms integration time; (3) a 16-element 4x4 phased-array transmitter with scanning in both the E- and H-planes with maximum EIRP of 23-25 dBm at 100-110GHz; (4) a power efficient 200GHz VCO with -7.25dBm output power and tuning range of 3.5%; and (5) a 320GHz 16-element imaging receiver array with responsivity of 18KV/W at 315GHz, a 3dB bandwidth of 25GHz and a low NEP of 34pW/Hz1/2. Wafer-scale large-die implementation of the phased-arrays and mmWave imagers using stitching in TowerJazz SBC18 process are also discussed.

  16. Accelerating Monte Carlo simulations with an NVIDIA ® graphics processor

    NASA Astrophysics Data System (ADS)

    Martinsen, Paul; Blaschke, Johannes; Künnemeyer, Rainer; Jordan, Robert

    2009-10-01

    Modern graphics cards, commonly used in desktop computers, have evolved beyond a simple interface between processor and display to incorporate sophisticated calculation engines that can be applied to general purpose computing. The Monte Carlo algorithm for modelling photon transport in turbid media has been implemented on an NVIDIA ® 8800 GT graphics card using the CUDA toolkit. The Monte Carlo method relies on following the trajectory of millions of photons through the sample, often taking hours or days to complete. The graphics-processor implementation, processing roughly 110 million scattering events per second, was found to run more than 70 times faster than a similar, single-threaded implementation on a 2.67 GHz desktop computer. Program summaryProgram title: Phoogle-C/Phoogle-G Catalogue identifier: AEEB_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEEB_v1_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: Standard CPC licence, http://cpc.cs.qub.ac.uk/licence/licence.html No. of lines in distributed program, including test data, etc.: 51 264 No. of bytes in distributed program, including test data, etc.: 2 238 805 Distribution format: tar.gz Programming language: C++ Computer: Designed for Intel PCs. Phoogle-G requires a NVIDIA graphics card with support for CUDA 1.1 Operating system: Windows XP Has the code been vectorised or parallelized?: Phoogle-G is written for SIMD architectures RAM: 1 GB Classification: 21.1 External routines: Charles Karney Random number library. Microsoft Foundation Class library. NVIDA CUDA library [1]. Nature of problem: The Monte Carlo technique is an effective algorithm for exploring the propagation of light in turbid media. However, accurate results require tracing the path of many photons within the media. The independence of photons naturally lends the Monte Carlo technique to implementation on parallel architectures. Generally, parallel computing can be expensive, but recent advances in consumer grade graphics cards have opened the possibility of high-performance desktop parallel-computing. Solution method: In this pair of programmes we have implemented the Monte Carlo algorithm described by Prahl et al. [2] for photon transport in infinite scattering media to compare the performance of two readily accessible architectures: a standard desktop PC and a consumer grade graphics card from NVIDIA. Restrictions: The graphics card implementation uses single precision floating point numbers for all calculations. Only photon transport from an isotropic point-source is supported. The graphics-card version has no user interface. The simulation parameters must be set in the source code. The desktop version has a simple user interface; however some properties can only be accessed through an ActiveX client (such as Matlab). Additional comments: The random number library used has a LGPL ( http://www.gnu.org/copyleft/lesser.html) licence. Running time: Runtime can range from minutes to months depending on the number of photons simulated and the optical properties of the medium. References:http://www.nvidia.com/object/cuda_home.html. S. Prahl, M. Keijzer, Sl. Jacques, A. Welch, SPIE Institute Series 5 (1989) 102.

  17. Integrated amateur band and ultra-wide band monopole antenna with multiple band-notched

    NASA Astrophysics Data System (ADS)

    Srivastava, Kunal; Kumar, Ashwani; Kanaujia, B. K.; Dwari, Santanu

    2018-05-01

    This paper presents the integrated amateur band and ultra-wide band (UWB) monopole antenna with integrated multiple band-notched characteristics. It is designed for avoiding the potential interference of frequencies 3.99 GHz (3.83 GHz-4.34 GHz), 4.86 GHz (4.48 GHz-5.63 GHz), 7.20 GHz (6.10 GHz-7.55 GHz) and 8.0 GHz (7.62 GHz-8.47 GHz) with VSWR 4.9, 11.5, 6.4 and 5.3, respectively. Equivalent parallel resonant circuits have been presented for each band-notched frequencies of the antenna. Antenna operates in amateur band 1.2 GHz (1.05 GHz-1.3 GHz) and UWB band from 3.2 GHz-13.9 GHz. Different substrates are used to verify the working of the proposed antenna. Integrated GSM band from 0.6 GHz to 1.8 GHz can also be achieved by changing the radius of the radiating patch. Antenna gain varied from 1.4 dBi to 9.8 dBi. Measured results are presented to validate the antenna performances.

  18. Analysis OpenMP performance of AMD and Intel architecture for breaking waves simulation using MPS

    NASA Astrophysics Data System (ADS)

    Alamsyah, M. N. A.; Utomo, A.; Gunawan, P. H.

    2018-03-01

    Simulation of breaking waves by using Navier-Stokes equation via moving particle semi-implicit method (MPS) over close domain is given. The results show the parallel computing on multicore architecture using OpenMP platform can reduce the computational time almost half of the serial time. Here, the comparison using two computer architectures (AMD and Intel) are performed. The results using Intel architecture is shown better than AMD architecture in CPU time. However, in efficiency, the computer with AMD architecture gives slightly higher than the Intel. For the simulation by 1512 number of particles, the CPU time using Intel and AMD are 12662.47 and 28282.30 respectively. Moreover, the efficiency using similar number of particles, AMD obtains 50.09 % and Intel up to 49.42 %.

  19. SEMICONDUCTOR INTEGRATED CIRCUITS 8.64-11.62 GHz CMOS VCO and divider in a zero-IF 802.11a/b/g WLAN and Bluetooth application

    NASA Astrophysics Data System (ADS)

    Yu, Sun; Niansong, Mei; Bo, Lu; Yumei, Huang; Zhiliang, Hong

    2010-10-01

    A fully integrated VCO and divider implemented in SMIC 0.13-μm RFCMOS 1P8M technology with a 1.2 V supply voltage is presented. The frequency of the VCO is tuning from 8.64 to 11.62 GHz while the quadrature LO signals for 802.11a WLAN in 5.8 GHz band or for 802.11b/g WLAN and Bluetooth in 2.4 GHz band can be obtained by a frequency division by 2 or 4, respectively. A 6 bit switched capacitor array is applied for precise tuning of all necessary frequency bands. The testing results show that the VCO has a phase noise of—113 dBc @ 1 MHz offset from the carrier of 5.5 GHz by dividing VCO output by two and the VCO core consumes 3.72 mW. The figure-of-merit for the tuning-range (FOMT) of the VCO is -192.6 dBc/Hz.

  20. 40 CFR 721.3760 - Fluorene-containing diaromatic amines.

    Code of Federal Regulations, 2010 CFR

    2010-07-01

    ... amines (PMN P-88-998 and P-88-999) are subject to reporting under this section for the significant new... water. Requirements as specified in § 721.90 (a)(4), (b)(4), and (c)(4) (where n = 1). (ii) [Reserved...), (c), and (k) are applicable to manufacturers, importers, and processors of this substance. (2...

  1. 40 CFR 721.6070 - Alkyl phosphonate ammonium salts.

    Code of Federal Regulations, 2010 CFR

    2010-07-01

    ... salts (PMNs P-93-725 and P-93-726) are subject to reporting under this section for the significant new... water. Requirements as specified in § 721.90 (a)(4), (b)(4), and (c)(4) (where N = 400 ppb). (b...), and (k) are applicable to manufacturers, importers, and processors of this substance. (2) Limitations...

  2. A 75 GHz regenerative dynamic frequency divider with active transformer using InGaAs/InP HBT technology

    NASA Astrophysics Data System (ADS)

    Wang, Xi; Zhang, Bichan; Zhao, Hua; Su, Yongbo; Muhammad, Asif; Guo, Dong; Jin, Zhi

    2017-08-01

    This letter presents a high speed 2:1 regenerative dynamic frequency divider with an active transformer fabricated in 0.7 μm InP DHBT technology with {f}{{T}} of 165 GHz and {f}\\max of 230 GHz. The circuit includes a two-stage active transformer, input buffer, divider core and output buffer. The core part of the frequency divider is composed of a double-balanced active mixer (widely known as the Gilbert cell) and a regenerative feedback loop. The active transformer with two stages can contribute to positive gain and greatly improve phase difference. Instead of the passive transformer, the active one occupies a much smaller chip area. The area of the chip is only 469× 414 μ {{{m}}}2 and it entirely consumes a total DC power of only 94.6 mW from a single -4.8 V DC supply. The measured results present that the divider achieves an operating frequency bandwidth from 75 to 80 GHz, and performs a -23 dBm maximum output power at 37.5 GHz with a 0 dBm input signal of 75 GHz.

  3. VizieR Online Data Catalog: Panchromatic observations of PTF11qcj (Corsi+, 2014)

    NASA Astrophysics Data System (ADS)

    Corsi, A.; Ofek, E. O.; Gal-Yam, A.; Frail, D. A.; Kulkarni, S. R.; Fox, D. B.; Kasliwal, M. M.; Sullivan, M.; Horesh, A.; Carpenter, J.; Maguire, K.; Arcavi, I.; Cenko, S. B.; Cao, Y.; Mooley, K.; Pan, Y.-C.; Sesar, B.; Sternberg, A.; Xu, D.; Bersier, D.; James, P.; Bloom, J. S.; Nugent, P. E.

    2016-02-01

    On 2011 November 1, we discovered PTF11qcj in an R-band image from the 48 inch Samuel Oschin telescope at Palomar Observatory (P48), which is routinely used by the Palomar Transient Factory (PTF). Subsequent observations with the P48 were conducted with the Mould-R and Gunn-g filters. Photometry (Table2) was performed relative to the SDSS r-band and g-band magnitudes of stars in the field. Multi-color optical (gri) optical light curves were also obtained using the Palomar 60 inch telescope (P60) and the RATCAM optical imager on the robotic 2m Liverpool Telescope (LT) located at the Roque de Los Muchachos Observatory on La Palma. On 2011 November 15, we started a long-term monitoring campaign of PTF11qcj (along with calibrators J1327+4326 and 3C 286) with the Karl G. Jansky Very Large Array (VLA; http://public.nrao.edu/telescopes/vla) in its D, DnC, C, CnB, and A configurations, under our Target of Opportunity programs (VLA/11A-227, VLA/11B-034, VLA/11B-247, VLA/12B-195; PI: A. Corsi). The light curves of PTF11qcj at frequencies of 2.5GHz, 3.5GHz, 5GHz, 7.4GHz, 13.5GHz, 16GHz are reported in Table3. We also observed the field of PTF11qcj (together with the test calibrator J1203+480) using the Combined Array for Research in Millimeter-wave Astronomy (CARMA; http://www.mmarray.org/), at a frequency of 93GHz. The data collected on 2011 November 19 and 2011 November 26 (CARMA program no. c0857; PI: A. Horesh) both resulted in a detection of PTF11qcj (Table3). We have carried out an X-ray monitoring campaign of PTF11qcj with Chandra and Swift. All our Swift-XRT observations yielded non detections (see Table 4), while Chandra detected PTF11qcj in three epochs (DDT proposals nos. 501793, 501794, 501797; PI: A. Corsi). The results of our X-ray follow-up are reported in Table4. We observed the position of PTF11qcj with Spitzer on two epochs (on 2012 March 28.747 and 2012 June 25.643; Table5; DDT proposal no. 31731; PI: A. Corsi). On 2012 March 28 (Table5), we also observed the field of PTF11qcj in Ks-band with the Palomar 200 inch telescope (P200). (4 data files).

  4. Application of high-performance computing to numerical simulation of human movement

    NASA Technical Reports Server (NTRS)

    Anderson, F. C.; Ziegler, J. M.; Pandy, M. G.; Whalen, R. T.

    1995-01-01

    We have examined the feasibility of using massively-parallel and vector-processing supercomputers to solve large-scale optimization problems for human movement. Specifically, we compared the computational expense of determining the optimal controls for the single support phase of gait using a conventional serial machine (SGI Iris 4D25), a MIMD parallel machine (Intel iPSC/860), and a parallel-vector-processing machine (Cray Y-MP 8/864). With the human body modeled as a 14 degree-of-freedom linkage actuated by 46 musculotendinous units, computation of the optimal controls for gait could take up to 3 months of CPU time on the Iris. Both the Cray and the Intel are able to reduce this time to practical levels. The optimal solution for gait can be found with about 77 hours of CPU on the Cray and with about 88 hours of CPU on the Intel. Although the overall speeds of the Cray and the Intel were found to be similar, the unique capabilities of each machine are better suited to different portions of the computational algorithm used. The Intel was best suited to computing the derivatives of the performance criterion and the constraints whereas the Cray was best suited to parameter optimization of the controls. These results suggest that the ideal computer architecture for solving very large-scale optimal control problems is a hybrid system in which a vector-processing machine is integrated into the communication network of a MIMD parallel machine.

  5. Evaluation of the Intel Xeon Phi 7120 and NVIDIA K80 as accelerators for two-dimensional panel codes

    PubMed Central

    2017-01-01

    To optimize the geometry of airfoils for a specific application is an important engineering problem. In this context genetic algorithms have enjoyed some success as they are able to explore the search space without getting stuck in local optima. However, these algorithms require the computation of aerodynamic properties for a significant number of airfoil geometries. Consequently, for low-speed aerodynamics, panel methods are most often used as the inner solver. In this paper we evaluate the performance of such an optimization algorithm on modern accelerators (more specifically, the Intel Xeon Phi 7120 and the NVIDIA K80). For that purpose, we have implemented an optimized version of the algorithm on the CPU and Xeon Phi (based on OpenMP, vectorization, and the Intel MKL library) and on the GPU (based on CUDA and the MAGMA library). We present timing results for all codes and discuss the similarities and differences between the three implementations. Overall, we observe a speedup of approximately 2.5 for adding an Intel Xeon Phi 7120 to a dual socket workstation and a speedup between 3.4 and 3.8 for adding a NVIDIA K80 to a dual socket workstation. PMID:28582389

  6. Evaluation of the Intel Xeon Phi 7120 and NVIDIA K80 as accelerators for two-dimensional panel codes.

    PubMed

    Einkemmer, Lukas

    2017-01-01

    To optimize the geometry of airfoils for a specific application is an important engineering problem. In this context genetic algorithms have enjoyed some success as they are able to explore the search space without getting stuck in local optima. However, these algorithms require the computation of aerodynamic properties for a significant number of airfoil geometries. Consequently, for low-speed aerodynamics, panel methods are most often used as the inner solver. In this paper we evaluate the performance of such an optimization algorithm on modern accelerators (more specifically, the Intel Xeon Phi 7120 and the NVIDIA K80). For that purpose, we have implemented an optimized version of the algorithm on the CPU and Xeon Phi (based on OpenMP, vectorization, and the Intel MKL library) and on the GPU (based on CUDA and the MAGMA library). We present timing results for all codes and discuss the similarities and differences between the three implementations. Overall, we observe a speedup of approximately 2.5 for adding an Intel Xeon Phi 7120 to a dual socket workstation and a speedup between 3.4 and 3.8 for adding a NVIDIA K80 to a dual socket workstation.

  7. Requirements for benchmarking personal image retrieval systems

    NASA Astrophysics Data System (ADS)

    Bouguet, Jean-Yves; Dulong, Carole; Kozintsev, Igor; Wu, Yi

    2006-01-01

    It is now common to have accumulated tens of thousands of personal ictures. Efficient access to that many pictures can only be done with a robust image retrieval system. This application is of high interest to Intel processor architects. It is highly compute intensive, and could motivate end users to upgrade their personal computers to the next generations of processors. A key question is how to assess the robustness of a personal image retrieval system. Personal image databases are very different from digital libraries that have been used by many Content Based Image Retrieval Systems.1 For example a personal image database has a lot of pictures of people, but a small set of different people typically family, relatives, and friends. Pictures are taken in a limited set of places like home, work, school, and vacation destination. The most frequent queries are searched for people, and for places. These attributes, and many others affect how a personal image retrieval system should be benchmarked, and benchmarks need to be different from existing ones based on art images, or medical images for examples. The attributes of the data set do not change the list of components needed for the benchmarking of such systems as specified in2: - data sets - query tasks - ground truth - evaluation measures - benchmarking events. This paper proposed a way to build these components to be representative of personal image databases, and of the corresponding usage models.

  8. A communication-avoiding, hybrid-parallel, rank-revealing orthogonalization method.

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Hoemmen, Mark

    2010-11-01

    Orthogonalization consumes much of the run time of many iterative methods for solving sparse linear systems and eigenvalue problems. Commonly used algorithms, such as variants of Gram-Schmidt or Householder QR, have performance dominated by communication. Here, 'communication' includes both data movement between the CPU and memory, and messages between processors in parallel. Our Tall Skinny QR (TSQR) family of algorithms requires asymptotically fewer messages between processors and data movement between CPU and memory than typical orthogonalization methods, yet achieves the same accuracy as Householder QR factorization. Furthermore, in block orthogonalizations, TSQR is faster and more accurate than existing approaches formore » orthogonalizing the vectors within each block ('normalization'). TSQR's rank-revealing capability also makes it useful for detecting deflation in block iterative methods, for which existing approaches sacrifice performance, accuracy, or both. We have implemented a version of TSQR that exploits both distributed-memory and shared-memory parallelism, and supports real and complex arithmetic. Our implementation is optimized for the case of orthogonalizing a small number (5-20) of very long vectors. The shared-memory parallel component uses Intel's Threading Building Blocks, though its modular design supports other shared-memory programming models as well, including computation on the GPU. Our implementation achieves speedups of 2 times or more over competing orthogonalizations. It is available now in the development branch of the Trilinos software package, and will be included in the 10.8 release.« less

  9. Balancing Contention and Synchronization on the Intel Paragon

    NASA Technical Reports Server (NTRS)

    Bokhari, Shahid H.; Nicol, David M.

    1996-01-01

    The Intel Paragon is a mesh-connected distributed memory parallel computer. It uses an oblivious and deterministic message routing algorithm: this permits us to develop highly optimized schedules for frequently needed communication patterns. The complete exchange is one such pattern. Several approaches are available for carrying it out on the mesh. We study an algorithm developed by Scott. This algorithm assumes that a communication link can carry one message at a time and that a node can only transmit one message at a time. It requires global synchronization to enforce a schedule of transmissions. Unfortunately global synchronization has substantial overhead on the Paragon. At the same time the powerful interconnection mechanism of this machine permits 2 or 3 messages to share a communication link with minor overhead. It can also overlap multiple message transmission from the same node to some extent. We develop a generalization of Scott's algorithm that executes complete exchange with a prescribed contention. Schedules that incur greater contention require fewer synchronization steps. This permits us to tradeoff contention against synchronization overhead. We describe the performance of this algorithm and compare it with Scott's original algorithm as well as with a naive algorithm that does not take interconnection structure into account. The Bounded contention algorithm is always better than Scott's algorithm and outperforms the naive algorithm for all but the smallest message sizes. The naive algorithm fails to work on meshes larger than 12 x 12. These results show that due consideration of processor interconnect and machine performance parameters is necessary to obtain peak performance from the Paragon and its successor mesh machines.

  10. Frequency Dependence of Single-event Upset in Advanced Commerical PowerPC Microprocessors

    NASA Technical Reports Server (NTRS)

    Irom, Frokh; Farmanesh, Farhad F.; Swift, Gary M.; Johnston, Allen H.

    2004-01-01

    This paper examines single-event upsets in advanced commercial SOI microprocessors in a dynamic mode, studying SEU sensitivity of General Purpose Registers (GPRs) with clock frequency. Results are presented for SOI processors with feature sizes of 0.18 microns and two different core voltages. Single-event upset from heavy ions is measured for advanced commercial microprocessors in a dynamic mode with clock frequency up to 1GHz. Frequency and core voltage dependence of single-event upsets in registers is discussed.

  11. Homemade Buckeye-Pi: A Learning Many-Node Platform for High-Performance Parallel Computing

    NASA Astrophysics Data System (ADS)

    Amooie, M. A.; Moortgat, J.

    2017-12-01

    We report on the "Buckeye-Pi" cluster, the supercomputer developed in The Ohio State University School of Earth Sciences from 128 inexpensive Raspberry Pi (RPi) 3 Model B single-board computers. Each RPi is equipped with fast Quad Core 1.2GHz ARMv8 64bit processor, 1GB of RAM, and 32GB microSD card for local storage. Therefore, the cluster has a total RAM of 128GB that is distributed on the individual nodes and a flash capacity of 4TB with 512 processors, while it benefits from low power consumption, easy portability, and low total cost. The cluster uses the Message Passing Interface protocol to manage the communications between each node. These features render our platform the most powerful RPi supercomputer to date and suitable for educational applications in high-performance-computing (HPC) and handling of large datasets. In particular, we use the Buckeye-Pi to implement optimized parallel codes in our in-house simulator for subsurface media flows with the goal of achieving a massively-parallelized scalable code. We present benchmarking results for the computational performance across various number of RPi nodes. We believe our project could inspire scientists and students to consider the proposed unconventional cluster architecture as a mainstream and a feasible learning platform for challenging engineering and scientific problems.

  12. Intel NX to PVM 3.2 message passing conversion library

    NASA Technical Reports Server (NTRS)

    Arthur, Trey; Nelson, Michael L.

    1993-01-01

    NASA Langley Research Center has developed a library that allows Intel NX message passing codes to be executed under the more popular and widely supported Parallel Virtual Machine (PVM) message passing library. PVM was developed at Oak Ridge National Labs and has become the defacto standard for message passing. This library will allow the many programs that were developed on the Intel iPSC/860 or Intel Paragon in a Single Program Multiple Data (SPMD) design to be ported to the numerous architectures that PVM (version 3.2) supports. Also, the library adds global operations capability to PVM. A familiarity with Intel NX and PVM message passing is assumed.

  13. Ada Compiler Validation Summary Report: Certificate Number: 940325S1. 11352 DDC-I DACS Sun SPARC/Solaries to Pentium PM Bare Ada Cross Compiler System, Version 4.6.4 Sun SPARCclassic = Intel Pentium (Operated as Bare Machine) Based in Xpress Desktop (Intel Product Number: XBASE6E4F-B)

    DTIC Science & Technology

    1994-03-25

    Technology Building 225, Room A266 Gait•--eburg, Maryland 20899 U.S.A. Ada Von Ogan~ztionAda Jointt Program Office De & Software David R . Basel...Standards and Technology Building 225, Room A266 Gaithersburg, Maryland 20899 U.S.A. azi Ada Joint Program office Directoz’,’Coputer & Softvare David R ...characters, a bar (" r ) is written in the 16th position and the rest of the characters ame not prined. "* The place of the definition, i.e.. a line

  14. Summary of Documentation for DYNA3D-ParaDyn's Software Quality Assurance Regression Test Problems

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Zywicz, Edward

    The Software Quality Assurance (SQA) regression test suite for DYNA3D (Zywicz and Lin, 2015) and ParaDyn (DeGroot, et al., 2015) currently contains approximately 600 problems divided into 21 suites, and is a required component of ParaDyn’s SQA plan (Ferencz and Oliver, 2013). The regression suite allows developers to ensure that software modifications do not unintentionally alter the code response. The entire regression suite is run prior to permanently incorporating any software modification or addition. When code modifications alter test problem results, the specific cause must be determined and fully understood before the software changes and revised test answers can bemore » incorporated. The regression suite is executed on LLNL platforms using a Python script and an associated data file. The user specifies the DYNA3D or ParaDyn executable, number of processors to use, test problems to run, and other options to the script. The data file details how each problem and its answer extraction scripts are executed. For each problem in the regression suite there exists an input deck, an eight-processor partition file, an answer file, and various extraction scripts. These scripts assemble a temporary answer file in a specific format from the simulation results. The temporary and stored answer files are compared to a specific level of numerical precision, and when differences are detected the test problem is flagged as failed. Presently, numerical results are stored and compared to 16 digits. At this accuracy level different processor types, compilers, number of partitions, etc. impact the results to various degrees. Thus, for consistency purposes the regression suite is run with ParaDyn using 8 processors on machines with a specific processor type (currently the Intel Xeon E5530 processor). For non-parallel regression problems, i.e., the two XFEM problems, DYNA3D is used instead. When environments or platforms change, executables using the current source code and the new resource are created and the regression suite is run. If differences in answers arise, the new answers are retained provided that the differences are inconsequential. This bootstrap approach allows the test suite answers to evolve in a controlled manner with a high level of confidence. Developers also run the entire regression suite with (serial) DYNA3D. While these results normally differ from the stored (parallel) answers, abnormal termination or wildly different values are strong indicators of potential issues.« less

  15. EMCS Installation Follow-Up Study. Volume 2.

    DTIC Science & Technology

    1984-03-01

    softwaLe is currently being developed by Power;. 4. The system includc;s an equation processor which p~ovi,.’ arithmetic, logicaL, and timing...packages are actually written using the equation processor. 5. The system colorgraphics capability was demonstrated. The system can be operated...extremely important. The AsteroAI drawings were reviewed by the Navy only on a cursory basis. This proved to be inad- equate . 15. The specifications

  16. Multiphase complete exchange on a circuit switched hypercube

    NASA Technical Reports Server (NTRS)

    Bokhari, Shahid H.

    1991-01-01

    On a distributed memory parallel computer, the complete exchange (all-to-all personalized) communication pattern requires each of n processors to send a different block of data to each of the remaining n - 1 processors. This pattern is at the heart of many important algorithms, most notably the matrix transpose. For a circuit switched hypercube of dimension d(n = 2(sup d)), two algorithms for achieving complete exchange are known. These are (1) the Standard Exchange approach that employs d transmissions of size 2(sup d-1) blocks each and is useful for small block sizes, and (2) the Optimal Circuit Switched algorithm that employs 2(sup d) - 1 transmissions of 1 block each and is best for large block sizes. A unified multiphase algorithm is described that includes these two algorithms as special cases. The complete exchange on a hypercube of dimension d and block size m is achieved by carrying out k partial exchange on subcubes of dimension d(sub i) Sigma(sup k)(sub i=1) d(sub i) = d and effective block size m(sub i) = m2(sup d-di). When k = d and all d(sub i) = 1, this corresponds to algorithm (1) above. For the case of k = 1 and d(sub i) = d, this becomes the circuit switched algorithm (2). Changing the subcube dimensions d, varies the effective block size and permits a compromise between the data permutation and block transmission overhead of (1) and the startup overhead of (2). For a hypercube of dimension d, the number of possible combinations of subcubes is p(d), the number of partitions of the integer d. This is an exponential but very slowly growing function and it is feasible over these partitions to discover the best combination for a given message size. The approach was analyzed for, and implemented on, the Intel iPSC-860 circuit switched hypercube. Measurements show good agreement with predictions and demonstrate that the multiphase approach can substantially improve performance for block sizes in the 0 to 160 byte range. This range, which corresponds to 0 to 40 floating point numbers per processor, is commonly encountered in practical numeric applications. The multiphase technique is applicable to all circuit-switched hypercubes that use the common e-cube routing strategy.

  17. Olympus propagation studies in the US: Receiver development and the data acquisition system

    NASA Technical Reports Server (NTRS)

    Mckeeman, John C.

    1990-01-01

    Virginia Tech has developed two types of receivers to monitor the Olympus beacons, as well as a custom data acquisition system to store and display propagation data. Each of the receiver designs uses new hybrid analog/digital techniques. The data acquisition system uses a stand alone processor to collect and format the data for display and subsequent processing. The launch of the Olympus satellite with its coherent beacons offers new opportunities to study propagation effects at 12.5, 20, and 30 GHz. At Virginia Tech, the satellite is at 14 degrees in elevation, which allows us to measure low elevation angle effects. However, to make these measurements, a very accurate and stable measurement system is required. Virginia Tech has constructed a complex receiving system which monitors the Olympus beacons and all parameters associated with propagation research. In the current configuration, researchers have developed a receiver which frequency locks to the less fade susceptible 12.5 GHz beacon. Since all beacons on the satellite are driven from a single master oscillator, drift in the 12.5 GHz beacon implies corresponding drifts in the 20, and 30 GHz beacons. The receivers for the 20 and 30 GHz systems derive their frequency locking information from the 12.5 GHz system. This widens the dynamic range of the receivers and allows the receivers to maintain lock in severe fade conditions. In addition to monitoring the beacons, the sky noise is monitored with radiometers at each frequency. The radiometer output is used to set the clear air level for each beacon measurement. Researchers also measure the rain rate with several tipping bucket rain gauges placed along the propagation path.

  18. Hardware simulator for optical correlation spectroscopy with Gaussian statistics and arbitrary correlation functions.

    PubMed

    Molteni, Matteo; Weigel, Udo M; Remiro, Francisco; Durduran, Turgut; Ferri, Fabio

    2014-11-17

    We present a new hardware simulator (HS) for characterization, testing and benchmarking of digital correlators used in various optical correlation spectroscopy experiments where the photon statistics is Gaussian and the corresponding time correlation function can have any arbitrary shape. Starting from the HS developed in [Rev. Sci. Instrum. 74, 4273 (2003)], and using the same I/O board (PCI-6534 National Instrument) mounted on a modern PC (Intel Core i7-CPU, 3.07GHz, 12GB RAM), we have realized an instrument capable of delivering continuous streams of TTL pulses over two channels, with a time resolution of Δt = 50ns, up to a maximum count rate of 〈I〉 ∼ 5MHz. Pulse streams, typically detected in dynamic light scattering and diffuse correlation spectroscopy experiments were generated and measured with a commercial hardware correlator obtaining measured correlation functions that match accurately the expected ones.

  19. Estimation of winter wheat canopy nitrogen density at different growth stages based on Multi-LUT approach

    NASA Astrophysics Data System (ADS)

    Li, Zhenhai; Li, Na; Li, Zhenhong; Wang, Jianwen; Liu, Chang

    2017-10-01

    Rapid real-time monitoring of wheat nitrogen (N) status is crucial for precision N management during wheat growth. In this study, Multi Lookup Table (Multi-LUT) approach based on the N-PROSAIL model parameters setting at different growth stages was constructed to estimating canopy N density (CND) in winter wheat. The results showed that the estimated CND was in line with with measured CND, with the determination coefficient (R2) and the corresponding root mean square error (RMSE) values of 0.80 and 1.16 g m-2, respectively. Time-consuming of one sample estimation was only 6 ms under the test machine with CPU configuration of Intel(R) Core(TM) i5-2430 @2.40GHz quad-core. These results confirmed the potential of using Multi-LUT approach for CND retrieval in winter wheat at different growth stages and under variables climatic conditions.

  20. A Commodity Computing Cluster

    NASA Astrophysics Data System (ADS)

    Teuben, P. J.; Wolfire, M. G.; Pound, M. W.; Mundy, L. G.

    We have assembled a cluster of Intel-Pentium based PCs running Linux to compute a large set of Photodissociation Region (PDR) and Dust Continuum models. For various reasons the cluster is heterogeneous, currently ranging from a single Pentium-II 333 MHz to dual Pentium-III 450 MHz CPU machines. Although this will be sufficient for our ``embarrassingly parallelizable problem'' it may present some challenges for as yet unplanned future use. In addition the cluster was used to construct a MIRIAD benchmark, and compared to equivalent Ultra-Sparc based workstations. Currently the cluster consists of 8 machines, 14 CPUs, 50GB of disk-space, and a total peak speed of 5.83 GHz, or about 1.5 Gflops. The total cost of this cluster has been about $12,000, including all cabling, networking equipment, rack, and a CD-R backup system. The URL for this project is http://dustem.astro.umd.edu.

  1. Applications of surface acoustic and shallow bulk acoustic wave devices

    NASA Astrophysics Data System (ADS)

    Campbell, Colin K.

    1989-10-01

    Surface acoustic wave (SAW) device coverage includes delay lines and filters operating at selected frequencies in the range from about 10 MHz to 11 GHz; modeling with single-crystal piezoelectrics and layered structures; resonators and low-loss filters; comb filters and multiplexers; antenna duplexers; harmonic devices; chirp filters for pulse compression; coding with fixed and programmable transversal filters; Barker and quadraphase coding; adaptive filters; acoustic and acoustoelectric convolvers and correlators for radar, spread spectrum, and packet radio; acoustooptic processors for Bragg modulation and spectrum analysis; real-time Fourier-transform and cepstrum processors for radar and sonar; compressive receivers; Nyquist filters for microwave digital radio; clock-recovery filters for fiber communications; fixed-, tunable-, and multimode oscillators and frequency synthesizers; acoustic charge transport; and other SAW devices for signal processing on gallium arsenide. Shallow bulk acoustic wave device applications include gigahertz delay lines, surface-transverse-wave resonators employing energy-trapping gratings, and oscillators with enhanced performance and capability.

  2. Intercalation of Li Ions into a Graphite Anode Material: Molecular Dynamics Simulations

    NASA Astrophysics Data System (ADS)

    Abou Hamad, Ibrahim; Novotny, Mark

    2008-03-01

    Large-scale molecular dynamics simulations of the anode half-cell of a lithium-ion battery are presented. The model system is composed of an anode represented by a stack of graphite sheets, an electrolyte of ethylene carbonate and propylene carbonate molecules, and lithium and hexafluorophosphate ions. The simulations are done in the NVT ensemble and at room temperature. One charging scheme explored is normal charging in which intercalation is enhanced by electric charges on the graphitic sheets. The second charging mechanism has an external applied oscillatory electric field of amplitude A and frequency f. The simulations were performed on 2.6 GHz Opteron processors, using 160 processors at a time. Our simulation results show an improvement in the intercalation time of the lithium ions for the second charging mechanism. The dependence of the intercalation time on A and f will be discussed.

  3. Image segmentation based upon topological operators: real-time implementation case study

    NASA Astrophysics Data System (ADS)

    Mahmoudi, R.; Akil, M.

    2009-02-01

    In miscellaneous applications of image treatment, thinning and crest restoring present a lot of interests. Recommended algorithms for these procedures are those able to act directly over grayscales images while preserving topology. But their strong consummation in term of time remains the major disadvantage in their choice. In this paper we present an efficient hardware implementation on RISC processor of two powerful algorithms of thinning and crest restoring developed by our team. Proposed implementation enhances execution time. A chain of segmentation applied to medical imaging will serve as a concrete example to illustrate the improvements brought thanks to the optimization techniques in both algorithm and architectural levels. The particular use of the SSE instruction set relative to the X86_32 processors (PIV 3.06 GHz) will allow a best performance for real time processing: a cadency of 33 images (512*512) per second is assured.

  4. Testing the monogamy relations via rank-2 mixtures

    NASA Astrophysics Data System (ADS)

    Jung, Eylee; Park, DaeKil

    2016-10-01

    We introduce two tangle-based four-party entanglement measures t1 and t2, and two negativity-based measures n1 and n2, which are derived from the monogamy relations. These measures are computed for three four-qubit maximally entangled and W states explicitly. We also compute these measures for the rank-2 mixture ρ4=p | GHZ4>< GHZ4|+(1 -p ) | W4>< W4| by finding the corresponding optimal decompositions. It turns out that t1(ρ4) is trivial and the corresponding optimal decomposition is equal to the spectral decomposition. Probably, this triviality is a sign of the fact that the corresponding monogamy inequality is not sufficiently tight. We fail to compute t2(ρ4) due to the difficulty in the calculation of the residual entanglement. The negativity-based measures n1(ρ4) and n2(ρ4) are explicitly computed and the corresponding optimal decompositions are also derived explicitly.

  5. VizieR Online Data Catalog: Ultra-compact HII regions & methanol masers. I. (Hu+, 2016)

    NASA Astrophysics Data System (ADS)

    Hu, B.; Menten, K. M.; Wu, Y.; Bartkiewicz, A.; Rygl, K.; Reid, M. J.; Urquhart, J. S.; Zheng, X.

    2017-03-01

    372 unique targets were selected from the following methanol maser surveys: the Methanol Multi-Beam catalog (MMB; Caswell & Breen 2010MNRAS.407.2599C; Green+ 2010-2012, VIII/96), the Arecibo Methanol Maser Galactic Plane Survey (AMGPS; Pandian+ 2011ApJ...730...55P), the Torun catalog of 6.7GHz methanol masers (Szymczak+ 2012, J/AN/333/634), and other individual observations of known 6.7GHz methanol masers or MSFRs (Caswell+ 1995MNRAS.272...96C; Walsh+ 1997, J/MNRAS/291/261; 1998, J/MNRAS/301/640; Xu+ 2008A&A...485..729X; Caswell 2009, J/other/PASA/26.454). The observations were conducted with the VLA in C-configuration using five sessions from 2012 February 28 to April 16. Spectral line data used 2048 channels across 8MHz, yielding a channel spacing of 3.90625kHz at the central frequency of 6.6685192GHz and a velocity resolution of 0.176km/s. The continuum observations employed two 1GHz sub-bands from 4.9840 to 6.0080GHz (the low band) and from 6.6245 to 7.6485GHz (the high band) and each sub-band was divided into 16 channels. (4 data files).

  6. Face Detection and Modeling for Recognition

    DTIC Science & Technology

    2002-01-01

    gi st er ed ra n ge an d co lo r im ag es . 16 F ig u re 1. 12 . S y st em d ia gr...it h an d w it h ou t th e tr an sf or m ar e sh ow n . F or ea ch ex am p le , th e im ag es sh ow n in th e fi rs t co lu m n ar e sk in re gi on s...software/products /perflib/ipl/index.htm>. [187] Intel Open Source Computer Vision Library, <http://developer.intel.com/ soft- ware/opensource/cvfl/ opencv

  7. Scaling Deep Learning workloads: NVIDIA DGX-1/Pascal and Intel Knights Landing

    DOE PAGES

    Gawande, Nitin A.; Daily, Jeff A.; Siegel, Charles; ...

    2018-05-05

    Deep Learning (DL) algorithms have become ubiquitous in data analytics. As a result, major computing vendors—including NVIDIA, Intel, AMD, and IBM—have architectural road maps influenced by DL workloads. Furthermore, several vendors have recently advertised new computing products as accelerating large DL workloads. Unfortunately, it is difficult for data scientists to quantify the potential of these different products. Here, this article provides a performance and power analysis of important DL workloads on two major parallel architectures: NVIDIA DGX-1 (eight Pascal P100 GPUs interconnected with NVLink) and Intel Knights Landing (KNL) CPUs interconnected with Intel Omni-Path or Cray Aries. Our evaluation consistsmore » of a cross section of convolutional neural net workloads: CifarNet, AlexNet, GoogLeNet, and ResNet50 topologies using the Cifar10 and ImageNet datasets. The workloads are vendor-optimized for each architecture. We use sequentially equivalent implementations to maintain iso-accuracy between parallel and sequential DL models. Our analysis indicates that although GPUs provide the highest overall performance, the gap can close for some convolutional networks; and the KNL can be competitive in performance/watt. We find that NVLink facilitates scaling efficiency on GPUs. However, its importance is heavily dependent on neural network architecture. Furthermore, for weak-scaling—sometimes encouraged by restricted GPU memory—NVLink is less important.« less

  8. Scaling Deep Learning workloads: NVIDIA DGX-1/Pascal and Intel Knights Landing

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Gawande, Nitin A.; Daily, Jeff A.; Siegel, Charles

    Deep Learning (DL) algorithms have become ubiquitous in data analytics. As a result, major computing vendors—including NVIDIA, Intel, AMD, and IBM—have architectural road maps influenced by DL workloads. Furthermore, several vendors have recently advertised new computing products as accelerating large DL workloads. Unfortunately, it is difficult for data scientists to quantify the potential of these different products. Here, this article provides a performance and power analysis of important DL workloads on two major parallel architectures: NVIDIA DGX-1 (eight Pascal P100 GPUs interconnected with NVLink) and Intel Knights Landing (KNL) CPUs interconnected with Intel Omni-Path or Cray Aries. Our evaluation consistsmore » of a cross section of convolutional neural net workloads: CifarNet, AlexNet, GoogLeNet, and ResNet50 topologies using the Cifar10 and ImageNet datasets. The workloads are vendor-optimized for each architecture. We use sequentially equivalent implementations to maintain iso-accuracy between parallel and sequential DL models. Our analysis indicates that although GPUs provide the highest overall performance, the gap can close for some convolutional networks; and the KNL can be competitive in performance/watt. We find that NVLink facilitates scaling efficiency on GPUs. However, its importance is heavily dependent on neural network architecture. Furthermore, for weak-scaling—sometimes encouraged by restricted GPU memory—NVLink is less important.« less

  9. Second year technical report on-board processing for future satellite communications systems

    NASA Technical Reports Server (NTRS)

    Brandon, W. T.; Green, W. K.; Hoffman, M.; Jean, P. N.; Neal, W. R.; White, B. E.

    1980-01-01

    Advanced baseband and microwave switching techniques for large domestic communications satellites operating in the 30/20 GHz frequency bands are discussed. The nominal baseband processor throughput is one million packets per second (1.6 Gb/s) from one thousand T1 carrier rate customer premises terminals. A frequency reuse factor of sixteen is assumed by using 16 spot antenna beams with the same 100 MHz bandwidth per beam and a modulation with a one b/s per Hz bandwidth efficiency. Eight of the beams are fixed on major metropolitan areas and eight are scanning beams which periodically cover the remainder of the U.S. under dynamic control. User signals are regenerated (demodulated/remodulated) and message packages are reformatted on board. Frequency division multiple access and time division multiplex are employed on the uplinks and downlinks, respectively, for terminals within the coverage area and dwell interval of a scanning beam. Link establishment and packet routing protocols are defined. Also described is a detailed design of a separate 100 x 100 microwave switch capable of handling nonregenerated signals occupying the remaining 2.4 GHz bandwidth with 60 dB of isolation, at an estimated weight and power consumption of approximately 400 kg and 100 W, respectively.

  10. Second year technical report on-board processing for future satellite communications systems

    NASA Astrophysics Data System (ADS)

    Brandon, W. T.; Green, W. K.; Hoffman, M.; Jean, P. N.; Neal, W. R.; White, B. E.

    1980-10-01

    Advanced baseband and microwave switching techniques for large domestic communications satellites operating in the 30/20 GHz frequency bands are discussed. The nominal baseband processor throughput is one million packets per second (1.6 Gb/s) from one thousand T1 carrier rate customer premises terminals. A frequency reuse factor of sixteen is assumed by using 16 spot antenna beams with the same 100 MHz bandwidth per beam and a modulation with a one b/s per Hz bandwidth efficiency. Eight of the beams are fixed on major metropolitan areas and eight are scanning beams which periodically cover the remainder of the U.S. under dynamic control. User signals are regenerated (demodulated/remodulated) and message packages are reformatted on board. Frequency division multiple access and time division multiplex are employed on the uplinks and downlinks, respectively, for terminals within the coverage area and dwell interval of a scanning beam. Link establishment and packet routing protocols are defined. Also described is a detailed design of a separate 100 x 100 microwave switch capable of handling nonregenerated signals occupying the remaining 2.4 GHz bandwidth with 60 dB of isolation, at an estimated weight and power consumption of approximately 400 kg and 100 W, respectively.

  11. A Simple, Scalable, Script-based Science Processor

    NASA Technical Reports Server (NTRS)

    Lynnes, Christopher

    2004-01-01

    The production of Earth Science data from orbiting spacecraft is an activity that takes place 24 hours a day, 7 days a week. At the Goddard Earth Sciences Distributed Active Archive Center (GES DAAC), this results in as many as 16,000 program executions each day, far too many to be run by human operators. In fact, when the Moderate Resolution Imaging Spectroradiometer (MODIS) was launched aboard the Terra spacecraft in 1999, the automated commercial system for running science processing was able to manage no more than 4,000 executions per day. Consequently, the GES DAAC developed a lightweight system based on the popular Per1 scripting language, named the Simple, Scalable, Script-based Science Processor (S4P). S4P automates science processing, allowing operators to focus on the rare problems occurring from anomalies in data or algorithms. S4P has been reused in several systems ranging from routine processing of MODIS data to data mining and is publicly available from NASA.

  12. A design of 30/20 GHz flight communications experiment for NASA. [satellite and earth segments for high data rate commercial service

    NASA Technical Reports Server (NTRS)

    Kawamoto, Y.

    1982-01-01

    The objective of the 30/20 GHz Flight Experiment System is to develop the required technology and to experiment with the communication technique for an operational communication satellite system. The system uses polarization, spatial, and frequency isolations to maximize the spectrum utilization. The key spacecraft technologies required for the concept are the scan beam antenna, the baseband processor, the IF switch matrix, TWTA, SSPA, and LNA. The spacecraft communication payload information will be telemetered and monitored closely so that these technologies and performances can be verified. Two types of services, a trunk service and a customer premise service, are demonstrated in the system. Many experiments associated with these services, such as synchronization, demand assignment, link control, and network control will be performed to provide important information on the operational aspect of the system.

  13. Communication-Driven Codesign for Multiprocessor Systems

    DTIC Science & Technology

    2004-01-01

    processors, FPGA or ASIC subsystems, mi- croprocessors, and microcontrollers. When a processor is embedded within a SLOT architecture, one or more...Broderson, Low-power CMOS digital design, IEEE Journal of Solid-State Circuits 27 (1992), no. 4, 473–484. [25] L. Chao and E. Sha , Scheduling data-flow...1997), 239– 256 . [82] P. K. Murthy, E. G. Cohen, and S. Rowland, System Canvas: A new design en- vironment for embedded DSP and telecommunications

  14. A programming framework for data streaming on the Xeon Phi

    NASA Astrophysics Data System (ADS)

    Chapeland, S.; ALICE Collaboration

    2017-10-01

    ALICE (A Large Ion Collider Experiment) is the dedicated heavy-ion detector studying the physics of strongly interacting matter and the quark-gluon plasma at the CERN LHC (Large Hadron Collider). After the second long shut-down of the LHC, the ALICE detector will be upgraded to cope with an interaction rate of 50 kHz in Pb-Pb collisions, producing in the online computing system (O2) a sustained throughput of 3.4 TB/s. This data will be processed on the fly so that the stream to permanent storage does not exceed 90 GB/s peak, the raw data being discarded. In the context of assessing different computing platforms for the O2 system, we have developed a framework for the Intel Xeon Phi processors (MIC). It provides the components to build a processing pipeline streaming the data from the PC memory to a pool of permanent threads running on the MIC, and back to the host after processing. It is based on explicit offloading mechanisms (data transfer, asynchronous tasks) and basic building blocks (FIFOs, memory pools, C++11 threads). The user only needs to implement the processing method to be run on the MIC. We present in this paper the architecture, implementation, and performance of this system.

  15. Real-time simulator for designing electron dual scattering foil systems.

    PubMed

    Carver, Robert L; Hogstrom, Kenneth R; Price, Michael J; LeBlanc, Justin D; Pitcher, Garrett M

    2014-11-08

    The purpose of this work was to develop a user friendly, accurate, real-time com- puter simulator to facilitate the design of dual foil scattering systems for electron beams on radiotherapy accelerators. The simulator allows for a relatively quick, initial design that can be refined and verified with subsequent Monte Carlo (MC) calculations and measurements. The simulator also is a powerful educational tool. The simulator consists of an analytical algorithm for calculating electron fluence and X-ray dose and a graphical user interface (GUI) C++ program. The algorithm predicts electron fluence using Fermi-Eyges multiple Coulomb scattering theory with the reduced Gaussian formalism for scattering powers. The simulator also estimates central-axis and off-axis X-ray dose arising from the dual foil system. Once the geometry of the accelerator is specified, the simulator allows the user to continuously vary primary scattering foil material and thickness, secondary scat- tering foil material and Gaussian shape (thickness and sigma), and beam energy. The off-axis electron relative fluence or total dose profile and central-axis X-ray dose contamination are computed and displayed in real time. The simulator was validated by comparison of off-axis electron relative fluence and X-ray percent dose profiles with those calculated using EGSnrc MC. Over the energy range 7-20 MeV, using present foils on an Elekta radiotherapy accelerator, the simulator was able to reproduce MC profiles to within 2% out to 20 cm from the central axis. The central-axis X-ray percent dose predictions matched measured data to within 0.5%. The calculation time was approximately 100 ms using a single Intel 2.93 GHz processor, which allows for real-time variation of foil geometrical parameters using slider bars. This work demonstrates how the user-friendly GUI and real-time nature of the simulator make it an effective educational tool for gaining a better understanding of the effects that various system parameters have on a relative dose profile. This work also demonstrates a method for using the simulator as a design tool for creating custom dual scattering foil systems in the clinical range of beam energies (6-20 MeV). 

  16. CUDA compatible GPU cards as efficient hardware accelerators for Smith-Waterman sequence alignment

    PubMed Central

    Manavski, Svetlin A; Valle, Giorgio

    2008-01-01

    Background Searching for similarities in protein and DNA databases has become a routine procedure in Molecular Biology. The Smith-Waterman algorithm has been available for more than 25 years. It is based on a dynamic programming approach that explores all the possible alignments between two sequences; as a result it returns the optimal local alignment. Unfortunately, the computational cost is very high, requiring a number of operations proportional to the product of the length of two sequences. Furthermore, the exponential growth of protein and DNA databases makes the Smith-Waterman algorithm unrealistic for searching similarities in large sets of sequences. For these reasons heuristic approaches such as those implemented in FASTA and BLAST tend to be preferred, allowing faster execution times at the cost of reduced sensitivity. The main motivation of our work is to exploit the huge computational power of commonly available graphic cards, to develop high performance solutions for sequence alignment. Results In this paper we present what we believe is the fastest solution of the exact Smith-Waterman algorithm running on commodity hardware. It is implemented in the recently released CUDA programming environment by NVidia. CUDA allows direct access to the hardware primitives of the last-generation Graphics Processing Units (GPU) G80. Speeds of more than 3.5 GCUPS (Giga Cell Updates Per Second) are achieved on a workstation running two GeForce 8800 GTX. Exhaustive tests have been done to compare our implementation to SSEARCH and BLAST, running on a 3 GHz Intel Pentium IV processor. Our solution was also compared to a recently published GPU implementation and to a Single Instruction Multiple Data (SIMD) solution. These tests show that our implementation performs from 2 to 30 times faster than any other previous attempt available on commodity hardware. Conclusions The results show that graphic cards are now sufficiently advanced to be used as efficient hardware accelerators for sequence alignment. Their performance is better than any alternative available on commodity hardware platforms. The solution presented in this paper allows large scale alignments to be performed at low cost, using the exact Smith-Waterman algorithm instead of the largely adopted heuristic approaches. PMID:18387198

  17. Studies of Current Induced Magnetization reversal and generation of GHz radiation in magnetic nanopillars

    NASA Astrophysics Data System (ADS)

    Alhajdarwish, Mustafa Yousef

    This thesis describes studies of two phenomena: Current-Induced Magnetization Switching (CIMS), and Current-Induced Generation of GHz Radiation. The CIMS part contains results of measurements of current-perpendicular-to-plane (CPP) magnetoresistance (MR) and CIMS behavior on Ferromagnetic/Nonmetal/Ferromagnetic (F1/N/F2) nanopillars. Judicious combinations of F1 and F2 metals with different bulk scattering asymmetries, and with F1/N and N/F2 interfaces having different interfacial scattering asymmetries, are shown to be able to controllably, and independently, 'invert' both the CPP-MR and the CIMS. In 'normal' CPP-MR, R(AP) > R(P), where R(AP) and R(P) are the nanopillar resistances for the anti-parallel (AP) and parallel (P) orientations of the Fi and F2 magnetic moments. In 'inverse' CPP-MR, R(P) > R(AP). In 'normal' CIMS, positive current switches the nanopillar from the P to the AP state. In 'inverse' CIMS, positive current switches the nanopillar from AP to P. All four possible combinations of CPP-MR and CIMS---(a) 'normal'-'normal', (b) 'normal'- 'inverse', 'inverse'-'normal', and (d) 'inverse'-'inverse' are shown and explained. These results rule out the self-Oersted field as the switching source, since the direction of that field is independent of the bulk or interfacial scattering asymmetries. Successful use of impurities to reverse the bulk scattering asymmetry shows the importance of scattering off of impurities within the bulk F1 and F2 metals---i.e. that the transport must be treated as 'diffusive' rather than 'ballistic'. The GHz studies consist of five parts: (1) designing a sample geometry that allows reliable measurements; (2) making nanopillar samples with this geometry; (3) constructing a system for measuring frequencies up to 12 GHz and measuring current-driven GHz radiation data with it; (4) showing 'scaling' behavior of GHz data with the critical fields and currents for nominally identical (but actually slightly different) samples, and justifying such scaling; and (5) designing and constructing a system for frequency domain studies up to 40 GHz and for time domain studies.

  18. Bent-tailed radio sources in the australia telescope large area survey of the Chandra deep field south

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Dehghan, S.; Johnston-Hollitt, M.; Franzen, T. M. O.

    2014-11-01

    Using the 1.4 GHz Australia Telescope Large Area Survey, supplemented by the 1.4 GHz Very Large Array images, we undertook a search for bent-tailed (BT) radio galaxies in the Chandra Deep Field South. Here we present a catalog of 56 detections, which include 45 BT sources, 4 diffuse low-surface-brightness objects (1 relic, 2 halos, and 1 unclassified object), and a further 7 complex, multi-component sources. We report BT sources with rest-frame powers in the range 10{sup 22} ≤ P {sub 1.4} {sub GHz} ≤ 10{sup 26} W Hz{sup –1}, with redshifts up to 2 and linear extents from tens ofmore » kiloparsecs up to about 1 Mpc. This is the first systematic study of such sources down to such low powers and high redshifts and demonstrates the complementary nature of searches in deep, limited area surveys as compared to shallower, large surveys. Of the sources presented here, one is the most distant BT source yet detected at a redshift of 2.1688. Two of the sources are found to be associated with known clusters: a wide-angle tail source in A3141 and a putative radio relic which appears at the infall region between the galaxy group MZ 00108 and the galaxy cluster AMPCC 40. Further observations are required to confirm the relic detection, which, if successful, would demonstrate this to be the least powerful relic yet seen with P {sub 1.4} {sub GHz} = 9 × 10{sup 22} W Hz{sup –1}. Using these data, we predict future 1.4 GHz all-sky surveys with a resolution of ∼10 arcsec and a sensitivity of 10 μJy will detect of the order of 560,000 extended low-surface-brightness radio sources of which 440,000 will have a BT morphology.« less

  19. SWARM: A 32 GHz Correlator and VLBI Beamformer for the Submillimeter Array

    NASA Astrophysics Data System (ADS)

    Primiani, Rurik A.; Young, Kenneth H.; Young, André; Patel, Nimesh; Wilson, Robert W.; Vertatschitsch, Laura; Chitwood, Billie B.; Srinivasan, Ranjani; MacMahon, David; Weintroub, Jonathan

    2016-03-01

    A 32GHz bandwidth VLBI capable correlator and phased array has been designed and deployeda at the Smithsonian Astrophysical Observatory’s Submillimeter Array (SMA). The SMA Wideband Astronomical ROACH2 Machine (SWARM) integrates two instruments: a correlator with 140kHz spectral resolution across its full 32GHz band, used for connected interferometric observations, and a phased array summer used when the SMA participates as a station in the Event Horizon Telescope (EHT) very long baseline interferometry (VLBI) array. For each SWARM quadrant, Reconfigurable Open Architecture Computing Hardware (ROACH2) units shared under open-source from the Collaboration for Astronomy Signal Processing and Electronics Research (CASPER) are equipped with a pair of ultra-fast analog-to-digital converters (ADCs), a field programmable gate array (FPGA) processor, and eight 10 Gigabit Ethernet (GbE) ports. A VLBI data recorder interface designated the SWARM digital back end, or SDBE, is implemented with a ninth ROACH2 per quadrant, feeding four Mark6 VLBI recorders with an aggregate recording rate of 64 Gbps. This paper describes the design and implementation of SWARM, as well as its deployment at SMA with reference to verification and science data.

  20. The effect of disinfectants on dimensional stability of addition and condensation silicone impressions.

    PubMed

    Sinobad, Tamara; Obradović-Djuricić, Kosovka; Nikolić, Zoran; Dodić, Slobodan; Lazić, Vojkan; Sinobad, Vladimir; Jesenko-Rokvić, Aleksandra

    2014-03-01

    Dimensional stability and accuracy of an impression after chemical disinfection by immersion in disinfectants are crucial for the accuracy of final prosthetic restorations. The aim of this study was to assess the deformation of addition and condensation silicone impressions after disinfection in antimicrobial solutions. A total of 120 impressions were made on the model of the upper arch representing three full metal-ceramic crown preparations. Four impression materials were used: two condensation silicones (Oranwash L - Zhermack and Xantopren L Blue - Heraeus Kulzer) and two addition silicones (Elite H-D + regular body - Zhermack and Flexitime correct flow - Heraeus Kulzer). After removal from the model the impressions were immediatel immersed in appropriate disinfectant (glutaraldehyde, benzalkonium chloride - Sterigum and 5.25% NaOC1) for a period of 10 min. The control group consisted of samples that were not treated with disinfectant solution. Consecutive measurements of identical impressions were realized with a Canon G9 (12 megapixels, 2 fps, 6x/24x), and automated with a computer Asus Lamborghini VX-2R Intel C2D 2.4 GHz, by using Remote Capture software package, so that time-depending series of images of the same impression were obtained. The dimensional changes of all the samples were significant both as a function of time and the applied disinfectant. The results show significant differences of the obtained dimensional changes between the group of condensation silicones and the group of addition silicones for the same time, and the same applied disinfectant (p = 0.026, F = 3.95). The greatest dimensional changes of addition and condensation silicone impressions appear in the first hour after their separation from the model.

  1. GeoSAR: A Radar Terrain Mapping System for the New Millennium

    NASA Technical Reports Server (NTRS)

    Thompson, Thomas; vanZyl, Jakob; Hensley, Scott; Reis, James; Munjy, Riadh; Burton, John; Yoha, Robert

    2000-01-01

    GeoSAR Geographic Synthetic Aperture Radar) is a new 3 year effort to build a unique, dual-frequency, airborne Interferometric SAR for mapping of terrain. This is being pursued via a Consortium of the Jet Propulsion Laboratory (JPL), Calgis, Inc., and the California Department of Conservation. The airborne portion of this system will operate on a Calgis Gulfstream-II aircraft outfitted with P- and X-band Interferometric SARs. The ground portions of this system will be a suite of Flight Planning Software, an IFSAR Processor and a Radar-GIS Workstation. The airborne P-band and X-band radars will be constructed by JPL with the goal of obtaining foliage penetration at the longer P-band wavelengths. The P-band and X-band radar will operate at frequencies of 350 Mhz and 9.71 Ghz with bandwidths of either 80 or 160 Mhz. The airborne radars will be complemented with airborne laser system for measuring antenna positions. Aircraft flight lines and radar operating instructions will be computed with the Flight Planning Software The ground processing will be a two-step step process. First, the raw radar data will be processed into radar images and interferometer derived Digital Elevation Models (DEMs). Second, these radar images and DEMs will be processed with a Radar GIS Workstation which performs processes such as Projection Transformations, Registration, Geometric Adjustment, Mosaicking, Merging and Database Management. JPL will construct the IFSAR Processor and Calgis, Inc. will construct the Radar GIS Workstation. The GeoSAR Project was underway in November 1996 with a goal of having the radars and laser systems fully integrated onto the Calgis Gulfstream-II aircraft in early 1999. Then, Engineering Checkout and Calibration-Characterization Flights will be conducted through November 1999. The system will be completed at the end of 1999 and ready for routine operations in the year 2000.

  2. ISFET-based sensor signal processor chip design for environment monitoring applications

    NASA Astrophysics Data System (ADS)

    Chung, Wen-Yaw; Yang, Chung-Huang; Wang, Ming-Ga

    2004-12-01

    In recent years Ion-Sensitive Field Effect Transistor (ISFET) based transducers create valuable applications in physiological data acquisition and environment monitoring. This paper presents a mixed-mode ASIC design for potentiometric ISFET-based bio-chemical sensor applications including H+ sensing and hand-held pH meter. For battery power consideration, the proposed system consists of low voltage (3V) analog front-end readout circuits and digital processor has been developed and fabricated in a 0.5mm double-poly double-metal CMOS technology. To assure that the correct pH value can be measured, the two-point calibration circuitry based on the response of standard pH4 and pH7 buffer solution has been implemented by using algorithmic state machine hardware algorithms. The measurement accuracy of the chip is 10 bits and the measured range between pH 2 to pH 12 compared to ideal values is within the accuracy of 0.1pH. For homeland environmental applications, the system provide rapid, easy to use, and cost-effective on-site testing on the quality of water, such as drinking water, ground water and river water. The processor has a potential usage in battery-operated and portable devices in environmental monitoring applications compared to commercial hand-held pH meter.

  3. Observational Approach to Molecular Cloud Evolutation with the Submillimeter-Wave CI Lines

    NASA Astrophysics Data System (ADS)

    Oka, T.; Yamamoto, S.

    Neutral carbon atoms (CI) play important roles both in chemistry and cooling processes of interstellar molecular clouds. It is thus crucial to explore its large area distribution to obtain information on formation processes and thermal balance of molecular clouds. However, observations of the submillimeter-wave CI lines have been limited to small areas around some representative objects. We have constructed a 1.2 m submillimeter-wave telescope at the summit of Mt.Fuji. The telescope was designed for the exclusive use of surveying molecular clouds in two submillimeter-wave CI lines, 3 P1 -3 P0 (492GHz) and 3 P2 -3 P1 (809 GHz), of atomic carbon. A superconductor-insulator-superconductor (SIS) mixer receiver was equipped on the Nasmyth focus of the telescope. The receiver noise temperatures [Trx(DSB)] are 300 K and 1000 K for the 492 GHz and the 809 GHz mixers, respectively. The intermediate frequency is centered at 2 GHz, having a 700 MHz bandwidth. An acousto-optical spectrometer (AOS) with 1024 channel outputs is used as a receiver backend. The telescope was installed at Nishi-yasugawara (alt. 3725 m), which is 200 m north of the highest peak, Kengamine (3776 m), in July 1998. It has b en operatede successfully during 4 observing seasons in a remote way from the Hongo campus of the University of Tokyo. We have already observed more than 40 square degrees of the sky with the CI 492 GHz line. The distribution of CI emission is found to be different from those of the 13 CO or C1 8 O emission in some clouds. These differences are discussed in relation to formation processes of molecular clouds.

  4. Application of Intel Many Integrated Core (MIC) accelerators to the Pleim-Xiu land surface scheme

    NASA Astrophysics Data System (ADS)

    Huang, Melin; Huang, Bormin; Huang, Allen H.

    2015-10-01

    The land-surface model (LSM) is one physics process in the weather research and forecast (WRF) model. The LSM includes atmospheric information from the surface layer scheme, radiative forcing from the radiation scheme, and precipitation forcing from the microphysics and convective schemes, together with internal information on the land's state variables and land-surface properties. The LSM is to provide heat and moisture fluxes over land points and sea-ice points. The Pleim-Xiu (PX) scheme is one LSM. The PX LSM features three pathways for moisture fluxes: evapotranspiration, soil evaporation, and evaporation from wet canopies. To accelerate the computation process of this scheme, we employ Intel Xeon Phi Many Integrated Core (MIC) Architecture as it is a multiprocessor computer structure with merits of efficient parallelization and vectorization essentials. Our results show that the MIC-based optimization of this scheme running on Xeon Phi coprocessor 7120P improves the performance by 2.3x and 11.7x as compared to the original code respectively running on one CPU socket (eight cores) and on one CPU core with Intel Xeon E5-2670.

  5. Miniature Packaging Concept for LNAs in the 200-300 GHz Range

    NASA Technical Reports Server (NTRS)

    Samoska, Lorene; Fung, Andy; Varonen, Mikko; Lin, Robert; Peralta, Alejandro; Soria, Mary; Lee, Choonsup; Padmanabhan, Sharmila; Sarkozy, Stephen; Lai, Richard

    2016-01-01

    In this work, we describe new miniaturized low noise amplifier modules which we developed for incorporation in small-scale satellites or Cubesats, and which exhibit similar or better performance compared to previously reported LNAs in the literature. We have targeted the WR4 (170-260 GHz) and WR3 (220-325 GHz) waveguide bands for the module development. The modules include two different methods of E-plane probes which have been developed for low loss, and stability at high frequencies. MMIC LNAs were also developed for these frequency ranges and fabricated in Northrop Grumman Corporation's 35 nm InP HEMT technology, and we have experimentally verified that noise performance is lower than reported in prior work. The best results include a miniature LNA module with 550K noise at 224 GHz, and a wideband LNA module with 15 dB gain from 230-280 GHz.

  6. InP MMIC Chip Set for Power Sources Covering 80-170 GHz

    NASA Technical Reports Server (NTRS)

    Ngo, Catherine

    2001-01-01

    We will present a Monolithic Millimeter-wave Integrated Circuit (MMIC) chip set which provides high output-power sources for driving diode frequency multipliers into the terahertz range. The chip set was fabricated at HRL Laboratories using a 0.1-micrometer gate-length InAlAs/InGaAs/InP high electron mobility transistor (HEMT) process, and features transistors with an f(sub max) above 600 GHz. The HRL InP HEMT process has already demonstrated amplifiers in the 60-200 GHz range. In this paper, these high frequency HEMTs form the basis for power sources up to 170 GHz. A number of state-of-the-art InP HEMT MMICs will be presented. These include voltage-controlled and fixed-tuned oscillators, power amplifiers, and an active doubler. We will first discuss an 80 GHz voltage-controlled oscillator with 5 GHz of tunability and at least 17 mW of output power, as well as a 120 GHz oscillator providing 7 mW of output power. In addition, we will present results of a power amplifier which covers the full WRIO waveguide band (75-110 GHz), and provides 40-50 mW of output power. Furthermore, we will present an active doubler at 164 GHz providing 8% bandwidth, 3 mW of output power, and an unprecedented 2 dB of conversion loss for an InP HEMT MMIC at this frequency. Finally, we will demonstrate a power amplifier to cover 140-170 GHz with 15-25 mW of output power and 8 dB gain. These components can form a power source in the 155-165 GHz range by cascading the 80 GHz oscillator, W-band power amplifier, 164 GHz active doubler and final 140-170 GHz power amplifier for a stable, compact local oscillator subsystem, which could be used for atmospheric science or astrophysics radiometers.

  7. Implementation of 5-layer thermal diffusion scheme in weather research and forecasting model with Intel Many Integrated Cores

    NASA Astrophysics Data System (ADS)

    Huang, Melin; Huang, Bormin; Huang, Allen H.

    2014-10-01

    For weather forecasting and research, the Weather Research and Forecasting (WRF) model has been developed, consisting of several components such as dynamic solvers and physical simulation modules. WRF includes several Land- Surface Models (LSMs). The LSMs use atmospheric information, the radiative and precipitation forcing from the surface layer scheme, the radiation scheme, and the microphysics/convective scheme all together with the land's state variables and land-surface properties, to provide heat and moisture fluxes over land and sea-ice points. The WRF 5-layer thermal diffusion simulation is an LSM based on the MM5 5-layer soil temperature model with an energy budget that includes radiation, sensible, and latent heat flux. The WRF LSMs are very suitable for massively parallel computation as there are no interactions among horizontal grid points. The features, efficient parallelization and vectorization essentials, of Intel Many Integrated Core (MIC) architecture allow us to optimize this WRF 5-layer thermal diffusion scheme. In this work, we present the results of the computing performance on this scheme with Intel MIC architecture. Our results show that the MIC-based optimization improved the performance of the first version of multi-threaded code on Xeon Phi 5110P by a factor of 2.1x. Accordingly, the same CPU-based optimizations improved the performance on Intel Xeon E5- 2603 by a factor of 1.6x as compared to the first version of multi-threaded code.

  8. Integrated Advanced Microwave Sounding Unit-A(AMSU-A). Engineering Test Report: METSAT A1 Signal Processor, (P/N 1331670-2, S /N F05)

    NASA Technical Reports Server (NTRS)

    Lund, D.

    1998-01-01

    This report presents a description of the tests performed, and the test data, for the AI METSAT Signal Processor Assembly P/N 1331670-2, S/N F05. The assembly was tested in accordance with AE-26754, "METSAT Signal Processor Scan Drive and Integration Procedure." The objective is to demonstrate functionality of the signal processor prior to instrument integration.

  9. Ada Compiler Validation Summary Report: Certificate Number: 901212I1. 11120 Tartan Inc., Tartan Ada VMS/960MC Version 4.0 VAXstation 3100 = Intel ICE960/25 on an VMS 5.2 Intel EXV80960MC Board

    DTIC Science & Technology

    1991-01-09

    5.2 (Target), 90121211 .11120 6. AUTHOR( S ) IABG-AVFT IOttobrunn, Federal Republic of Germany 7 PERFORMING ORGANIZATION NAME( S ) AND ADDRESS(ES) N-1...FEDERAL REPUBLIC OF GERMANY 9 SPONSORINGMONITORING AGENCY NAME( S ) AND ADDRESS( ES) 10. SPONSORING/ONITORING AGENCY Ada Joint Program Office REPORT NUMBER...Ada implementacion for which validation status is realized. Host Computer A computer system where Ada source programs are transformec System into

  10. Parasail: SIMD C library for global, semi-global, and local pairwise sequence alignments

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Daily, Jeffrey A.

    Sequence alignment algorithms are a key component of many bioinformatics applications. Though various fast Smith-Waterman local sequence alignment implementations have been developed for x86 CPUs, most are embedded into larger database search tools. In addition, fast implementations of Needleman-Wunsch global sequence alignment and its semi-global variants are not as widespread. This article presents the first software library for local, global, and semi-global pairwise intra-sequence alignments and improves the performance of previous intra-sequence implementations. As a result, a faster intra-sequence pairwise alignment implementation is described and benchmarked. Using a 375 residue query sequence a speed of 136 billion cell updates permore » second (GCUPS) was achieved on a dual Intel Xeon E5-2670 12-core processor system, the highest reported for an implementation based on Farrar’s ’striped’ approach. When using only a single thread, parasail was 1.7 times faster than Rognes’s SWIPE. For many score matrices, parasail is faster than BLAST. The software library is designed for 64 bit Linux, OS X, or Windows on processors with SSE2, SSE41, or AVX2. Source code is available from https://github.com/jeffdaily/parasail under the Battelle BSD-style license. In conclusion, applications that require optimal alignment scores could benefit from the improved performance. For the first time, SIMD global, semi-global, and local alignments are available in a stand-alone C library.« less

  11. Efficient Execution of Microscopy Image Analysis on CPU, GPU, and MIC Equipped Cluster Systems.

    PubMed

    Andrade, G; Ferreira, R; Teodoro, George; Rocha, Leonardo; Saltz, Joel H; Kurc, Tahsin

    2014-10-01

    High performance computing is experiencing a major paradigm shift with the introduction of accelerators, such as graphics processing units (GPUs) and Intel Xeon Phi (MIC). These processors have made available a tremendous computing power at low cost, and are transforming machines into hybrid systems equipped with CPUs and accelerators. Although these systems can deliver a very high peak performance, making full use of its resources in real-world applications is a complex problem. Most current applications deployed to these machines are still being executed in a single processor, leaving other devices underutilized. In this paper we explore a scenario in which applications are composed of hierarchical data flow tasks which are allocated to nodes of a distributed memory machine in coarse-grain, but each of them may be composed of several finer-grain tasks which can be allocated to different devices within the node. We propose and implement novel performance aware scheduling techniques that can be used to allocate tasks to devices. We evaluate our techniques using a pathology image analysis application used to investigate brain cancer morphology, and our experimental evaluation shows that the proposed scheduling strategies significantly outperforms other efficient scheduling techniques, such as Heterogeneous Earliest Finish Time - HEFT, in cooperative executions using CPUs, GPUs, and MICs. We also experimentally show that our strategies are less sensitive to inaccuracy in the scheduling input data and that the performance gains are maintained as the application scales.

  12. Efficient Execution of Microscopy Image Analysis on CPU, GPU, and MIC Equipped Cluster Systems

    PubMed Central

    Andrade, G.; Ferreira, R.; Teodoro, George; Rocha, Leonardo; Saltz, Joel H.; Kurc, Tahsin

    2015-01-01

    High performance computing is experiencing a major paradigm shift with the introduction of accelerators, such as graphics processing units (GPUs) and Intel Xeon Phi (MIC). These processors have made available a tremendous computing power at low cost, and are transforming machines into hybrid systems equipped with CPUs and accelerators. Although these systems can deliver a very high peak performance, making full use of its resources in real-world applications is a complex problem. Most current applications deployed to these machines are still being executed in a single processor, leaving other devices underutilized. In this paper we explore a scenario in which applications are composed of hierarchical data flow tasks which are allocated to nodes of a distributed memory machine in coarse-grain, but each of them may be composed of several finer-grain tasks which can be allocated to different devices within the node. We propose and implement novel performance aware scheduling techniques that can be used to allocate tasks to devices. We evaluate our techniques using a pathology image analysis application used to investigate brain cancer morphology, and our experimental evaluation shows that the proposed scheduling strategies significantly outperforms other efficient scheduling techniques, such as Heterogeneous Earliest Finish Time - HEFT, in cooperative executions using CPUs, GPUs, and MICs. We also experimentally show that our strategies are less sensitive to inaccuracy in the scheduling input data and that the performance gains are maintained as the application scales. PMID:26640423

  13. Parasail: SIMD C library for global, semi-global, and local pairwise sequence alignments

    DOE PAGES

    Daily, Jeffrey A.

    2016-02-10

    Sequence alignment algorithms are a key component of many bioinformatics applications. Though various fast Smith-Waterman local sequence alignment implementations have been developed for x86 CPUs, most are embedded into larger database search tools. In addition, fast implementations of Needleman-Wunsch global sequence alignment and its semi-global variants are not as widespread. This article presents the first software library for local, global, and semi-global pairwise intra-sequence alignments and improves the performance of previous intra-sequence implementations. As a result, a faster intra-sequence pairwise alignment implementation is described and benchmarked. Using a 375 residue query sequence a speed of 136 billion cell updates permore » second (GCUPS) was achieved on a dual Intel Xeon E5-2670 12-core processor system, the highest reported for an implementation based on Farrar’s ’striped’ approach. When using only a single thread, parasail was 1.7 times faster than Rognes’s SWIPE. For many score matrices, parasail is faster than BLAST. The software library is designed for 64 bit Linux, OS X, or Windows on processors with SSE2, SSE41, or AVX2. Source code is available from https://github.com/jeffdaily/parasail under the Battelle BSD-style license. In conclusion, applications that require optimal alignment scores could benefit from the improved performance. For the first time, SIMD global, semi-global, and local alignments are available in a stand-alone C library.« less

  14. Many-core computing for space-based stereoscopic imaging

    NASA Astrophysics Data System (ADS)

    McCall, Paul; Torres, Gildo; LeGrand, Keith; Adjouadi, Malek; Liu, Chen; Darling, Jacob; Pernicka, Henry

    The potential benefits of using parallel computing in real-time visual-based satellite proximity operations missions are investigated. Improvements in performance and relative navigation solutions over single thread systems can be achieved through multi- and many-core computing. Stochastic relative orbit determination methods benefit from the higher measurement frequencies, allowing them to more accurately determine the associated statistical properties of the relative orbital elements. More accurate orbit determination can lead to reduced fuel consumption and extended mission capabilities and duration. Inherent to the process of stereoscopic image processing is the difficulty of loading, managing, parsing, and evaluating large amounts of data efficiently, which may result in delays or highly time consuming processes for single (or few) processor systems or platforms. In this research we utilize the Single-Chip Cloud Computer (SCC), a fully programmable 48-core experimental processor, created by Intel Labs as a platform for many-core software research, provided with a high-speed on-chip network for sharing information along with advanced power management technologies and support for message-passing. The results from utilizing the SCC platform for the stereoscopic image processing application are presented in the form of Performance, Power, Energy, and Energy-Delay-Product (EDP) metrics. Also, a comparison between the SCC results and those obtained from executing the same application on a commercial PC are presented, showing the potential benefits of utilizing the SCC in particular, and any many-core platforms in general for real-time processing of visual-based satellite proximity operations missions.

  15. IGA-ADS: Isogeometric analysis FEM using ADS solver

    NASA Astrophysics Data System (ADS)

    Łoś, Marcin M.; Woźniak, Maciej; Paszyński, Maciej; Lenharth, Andrew; Hassaan, Muhamm Amber; Pingali, Keshav

    2017-08-01

    In this paper we present a fast explicit solver for solution of non-stationary problems using L2 projections with isogeometric finite element method. The solver has been implemented within GALOIS framework. It enables parallel multi-core simulations of different time-dependent problems, in 1D, 2D, or 3D. We have prepared the solver framework in a way that enables direct implementation of the selected PDE and corresponding boundary conditions. In this paper we describe the installation, implementation of exemplary three PDEs, and execution of the simulations on multi-core Linux cluster nodes. We consider three case studies, including heat transfer, linear elasticity, as well as non-linear flow in heterogeneous media. The presented package generates output suitable for interfacing with Gnuplot and ParaView visualization software. The exemplary simulations show near perfect scalability on Gilbert shared-memory node with four Intel® Xeon® CPU E7-4860 processors, each possessing 10 physical cores (for a total of 40 cores).

  16. LHCb Kalman Filter cross architecture studies

    NASA Astrophysics Data System (ADS)

    Cámpora Pérez, Daniel Hugo

    2017-10-01

    The 2020 upgrade of the LHCb detector will vastly increase the rate of collisions the Online system needs to process in software, in order to filter events in real time. 30 million collisions per second will pass through a selection chain, where each step is executed conditional to its prior acceptance. The Kalman Filter is a fit applied to all reconstructed tracks which, due to its time characteristics and early execution in the selection chain, consumes 40% of the whole reconstruction time in the current trigger software. This makes the Kalman Filter a time-critical component as the LHCb trigger evolves into a full software trigger in the Upgrade. I present a new Kalman Filter algorithm for LHCb that can efficiently make use of any kind of SIMD processor, and its design is explained in depth. Performance benchmarks are compared between a variety of hardware architectures, including x86_64 and Power8, and the Intel Xeon Phi accelerator, and the suitability of said architectures to efficiently perform the LHCb Reconstruction process is determined.

  17. Traditional Tracking with Kalman Filter on Parallel Architectures

    NASA Astrophysics Data System (ADS)

    Cerati, Giuseppe; Elmer, Peter; Lantz, Steven; MacNeill, Ian; McDermott, Kevin; Riley, Dan; Tadel, Matevž; Wittich, Peter; Würthwein, Frank; Yagil, Avi

    2015-05-01

    Power density constraints are limiting the performance improvements of modern CPUs. To address this, we have seen the introduction of lower-power, multi-core processors, but the future will be even more exciting. In order to stay within the power density limits but still obtain Moore's Law performance/price gains, it will be necessary to parallelize algorithms to exploit larger numbers of lightweight cores and specialized functions like large vector units. Example technologies today include Intel's Xeon Phi and GPGPUs. Track finding and fitting is one of the most computationally challenging problems for event reconstruction in particle physics. At the High Luminosity LHC, for example, this will be by far the dominant problem. The most common track finding techniques in use today are however those based on the Kalman Filter. Significant experience has been accumulated with these techniques on real tracking detector systems, both in the trigger and offline. We report the results of our investigations into the potential and limitations of these algorithms on the new parallel hardware.

  18. MPIGeneNet: Parallel Calculation of Gene Co-Expression Networks on Multicore Clusters.

    PubMed

    Gonzalez-Dominguez, Jorge; Martin, Maria J

    2017-10-10

    In this work we present MPIGeneNet, a parallel tool that applies Pearson's correlation and Random Matrix Theory to construct gene co-expression networks. It is based on the state-of-the-art sequential tool RMTGeneNet, which provides networks with high robustness and sensitivity at the expenses of relatively long runtimes for large scale input datasets. MPIGeneNet returns the same results as RMTGeneNet but improves the memory management, reduces the I/O cost, and accelerates the two most computationally demanding steps of co-expression network construction by exploiting the compute capabilities of common multicore CPU clusters. Our performance evaluation on two different systems using three typical input datasets shows that MPIGeneNet is significantly faster than RMTGeneNet. As an example, our tool is up to 175.41 times faster on a cluster with eight nodes, each one containing two 12-core Intel Haswell processors. Source code of MPIGeneNet, as well as a reference manual, are available at https://sourceforge.net/projects/mpigenenet/.

  19. Fabrication of Circuit QED Quantum Processors, Part 1: Extensible Footprint for a Superconducting Surface Code

    NASA Astrophysics Data System (ADS)

    Bruno, A.; Michalak, D. J.; Poletto, S.; Clarke, J. S.; Dicarlo, L.

    Large-scale quantum computation hinges on the ability to preserve and process quantum information with higher fidelity by increasing redundancy in a quantum error correction code. We present the realization of a scalable footprint for superconducting surface code based on planar circuit QED. We developed a tileable unit cell for surface code with all I/O routed vertically by means of superconducting through-silicon vias (TSVs). We address some of the challenges encountered during the fabrication and assembly of these chips, such as the quality of etch of the TSV, the uniformity of the ALD TiN coating conformal to the TSV, and the reliability of superconducting indium contact between the chips and PCB. We compare measured performance to a detailed list of specifications required for the realization of quantum fault tolerance. Our demonstration using centimeter-scale chips can accommodate the 50 qubits needed to target the experimental demonstration of small-distance logical qubits. Research funded by Intel Corporation and IARPA.

  20. Army/NASA small turboshaft engine digital controls research program

    NASA Technical Reports Server (NTRS)

    Sellers, J. F.; Baez, A. N.

    1981-01-01

    The emphasis of a program to conduct digital controls research for small turboshaft engines is on engine test evaluation of advanced control logic using a flexible microprocessor based digital control system designed specifically for research on advanced control logic. Control software is stored in programmable memory. New control algorithms may be stored in a floppy disk and loaded directly into memory. This feature facilitates comparative evaluation of different advanced control modes. The central processor in the digital control is an Intel 8086 16 bit microprocessor. Control software is programmed in assembly language. Software checkout is accomplished prior to engine test by connecting the digital control to a real time hybrid computer simulation of the engine. The engine currently installed in the facility has a hydromechanical control modified to allow electrohydraulic fuel metering and VG actuation by the digital control. Simulation results are presented which show that the modern control reduces the transient rotor speed droop caused by unanticipated load changes such as cyclic pitch or wind gust transients.

  1. Parallelized reliability estimation of reconfigurable computer networks

    NASA Technical Reports Server (NTRS)

    Nicol, David M.; Das, Subhendu; Palumbo, Dan

    1990-01-01

    A parallelized system, ASSURE, for computing the reliability of embedded avionics flight control systems which are able to reconfigure themselves in the event of failure is described. ASSURE accepts a grammar that describes a reliability semi-Markov state-space. From this it creates a parallel program that simultaneously generates and analyzes the state-space, placing upper and lower bounds on the probability of system failure. ASSURE is implemented on a 32-node Intel iPSC/860, and has achieved high processor efficiencies on real problems. Through a combination of improved algorithms, exploitation of parallelism, and use of an advanced microprocessor architecture, ASSURE has reduced the execution time on substantial problems by a factor of one thousand over previous workstation implementations. Furthermore, ASSURE's parallel execution rate on the iPSC/860 is an order of magnitude faster than its serial execution rate on a Cray-2 supercomputer. While dynamic load balancing is necessary for ASSURE's good performance, it is needed only infrequently; the particular method of load balancing used does not substantially affect performance.

  2. Methods for compressible fluid simulation on GPUs using high-order finite differences

    NASA Astrophysics Data System (ADS)

    Pekkilä, Johannes; Väisälä, Miikka S.; Käpylä, Maarit J.; Käpylä, Petri J.; Anjum, Omer

    2017-08-01

    We focus on implementing and optimizing a sixth-order finite-difference solver for simulating compressible fluids on a GPU using third-order Runge-Kutta integration. Since graphics processing units perform well in data-parallel tasks, this makes them an attractive platform for fluid simulation. However, high-order stencil computation is memory-intensive with respect to both main memory and the caches of the GPU. We present two approaches for simulating compressible fluids using 55-point and 19-point stencils. We seek to reduce the requirements for memory bandwidth and cache size in our methods by using cache blocking and decomposing a latency-bound kernel into several bandwidth-bound kernels. Our fastest implementation is bandwidth-bound and integrates 343 million grid points per second on a Tesla K40t GPU, achieving a 3 . 6 × speedup over a comparable hydrodynamics solver benchmarked on two Intel Xeon E5-2690v3 processors. Our alternative GPU implementation is latency-bound and achieves the rate of 168 million updates per second.

  3. A fully reconfigurable photonic integrated signal processor

    NASA Astrophysics Data System (ADS)

    Liu, Weilin; Li, Ming; Guzzon, Robert S.; Norberg, Erik J.; Parker, John S.; Lu, Mingzhi; Coldren, Larry A.; Yao, Jianping

    2016-03-01

    Photonic signal processing has been considered a solution to overcome the inherent electronic speed limitations. Over the past few years, an impressive range of photonic integrated signal processors have been proposed, but they usually offer limited reconfigurability, a feature highly needed for the implementation of large-scale general-purpose photonic signal processors. Here, we report and experimentally demonstrate a fully reconfigurable photonic integrated signal processor based on an InP-InGaAsP material system. The proposed photonic signal processor is capable of performing reconfigurable signal processing functions including temporal integration, temporal differentiation and Hilbert transformation. The reconfigurability is achieved by controlling the injection currents to the active components of the signal processor. Our demonstration suggests great potential for chip-scale fully programmable all-optical signal processing.

  4. muBLASTP: database-indexed protein sequence search on multicore CPUs.

    PubMed

    Zhang, Jing; Misra, Sanchit; Wang, Hao; Feng, Wu-Chun

    2016-11-04

    The Basic Local Alignment Search Tool (BLAST) is a fundamental program in the life sciences that searches databases for sequences that are most similar to a query sequence. Currently, the BLAST algorithm utilizes a query-indexed approach. Although many approaches suggest that sequence search with a database index can achieve much higher throughput (e.g., BLAT, SSAHA, and CAFE), they cannot deliver the same level of sensitivity as the query-indexed BLAST, i.e., NCBI BLAST, or they can only support nucleotide sequence search, e.g., MegaBLAST. Due to different challenges and characteristics between query indexing and database indexing, the existing techniques for query-indexed search cannot be used into database indexed search. muBLASTP, a novel database-indexed BLAST for protein sequence search, delivers identical hits returned to NCBI BLAST. On Intel Haswell multicore CPUs, for a single query, the single-threaded muBLASTP achieves up to a 4.41-fold speedup for alignment stages, and up to a 1.75-fold end-to-end speedup over single-threaded NCBI BLAST. For a batch of queries, the multithreaded muBLASTP achieves up to a 5.7-fold speedups for alignment stages, and up to a 4.56-fold end-to-end speedup over multithreaded NCBI BLAST. With a newly designed index structure for protein database and associated optimizations in BLASTP algorithm, we re-factored BLASTP algorithm for modern multicore processors that achieves much higher throughput with acceptable memory footprint for the database index.

  5. Comparison of neuronal spike exchange methods on a Blue Gene/P supercomputer.

    PubMed

    Hines, Michael; Kumar, Sameer; Schürmann, Felix

    2011-01-01

    For neural network simulations on parallel machines, interprocessor spike communication can be a significant portion of the total simulation time. The performance of several spike exchange methods using a Blue Gene/P (BG/P) supercomputer has been tested with 8-128 K cores using randomly connected networks of up to 32 M cells with 1 k connections per cell and 4 M cells with 10 k connections per cell, i.e., on the order of 4·10(10) connections (K is 1024, M is 1024(2), and k is 1000). The spike exchange methods used are the standard Message Passing Interface (MPI) collective, MPI_Allgather, and several variants of the non-blocking Multisend method either implemented via non-blocking MPI_Isend, or exploiting the possibility of very low overhead direct memory access (DMA) communication available on the BG/P. In all cases, the worst performing method was that using MPI_Isend due to the high overhead of initiating a spike communication. The two best performing methods-the persistent Multisend method using the Record-Replay feature of the Deep Computing Messaging Framework DCMF_Multicast; and a two-phase multisend in which a DCMF_Multicast is used to first send to a subset of phase one destination cores, which then pass it on to their subset of phase two destination cores-had similar performance with very low overhead for the initiation of spike communication. Departure from ideal scaling for the Multisend methods is almost completely due to load imbalance caused by the large variation in number of cells that fire on each processor in the interval between synchronization. Spike exchange time itself is negligible since transmission overlaps with computation and is handled by a DMA controller. We conclude that ideal performance scaling will be ultimately limited by imbalance between incoming processor spikes between synchronization intervals. Thus, counterintuitively, maximization of load balance requires that the distribution of cells on processors should not reflect neural net architecture but be randomly distributed so that sets of cells which are burst firing together should be on different processors with their targets on as large a set of processors as possible.

  6. International Conference on Indium Phosphide and Related Materials, Held in Cape Cod, Massachusetts, on 11 - 15 May 1997.

    DTIC Science & Technology

    1998-01-14

    runaway cells are very uniform across the wafer. On-wafer active load- causing the so-called current collapse. Using a Au air- pull measurement was...Input Power [ dBm] support and encouragement. References Fig. 4: On-wafer load- pull measurement at 9 GHz. [1] P. M. Asbeck, M. C. F. Chang, J. A...Measured Load Pull Characteristics of the 0.15gm x 300gm GaInAs/InP HEMT at 7GHz. 160 exceeded 830 mS/mm for > 0.5V. The 140 small-signal output

  7. Recent advances in PC-Linux systems for electronic structure computations by optimized compilers and numerical libraries.

    PubMed

    Yu, Jen-Shiang K; Yu, Chin-Hui

    2002-01-01

    One of the most frequently used packages for electronic structure research, GAUSSIAN 98, is compiled on Linux systems with various hardware configurations, including AMD Athlon (with the "Thunderbird" core), AthlonMP, and AthlonXP (with the "Palomino" core) systems as well as the Intel Pentium 4 (with the "Willamette" core) machines. The default PGI FORTRAN compiler (pgf77) and the Intel FORTRAN compiler (ifc) are respectively employed with different architectural optimization options to compile GAUSSIAN 98 and test the performance improvement. In addition to the BLAS library included in revision A.11 of this package, the Automatically Tuned Linear Algebra Software (ATLAS) library is linked against the binary executables to improve the performance. Various Hartree-Fock, density-functional theories, and the MP2 calculations are done for benchmarking purposes. It is found that the combination of ifc with ATLAS library gives the best performance for GAUSSIAN 98 on all of these PC-Linux computers, including AMD and Intel CPUs. Even on AMD systems, the Intel FORTRAN compiler invariably produces binaries with better performance than pgf77. The enhancement provided by the ATLAS library is more significant for post-Hartree-Fock calculations. The performance on one single CPU is potentially as good as that on an Alpha 21264A workstation or an SGI supercomputer. The floating-point marks by SpecFP2000 have similar trends to the results of GAUSSIAN 98 package.

  8. New On-board Microprocessors

    NASA Astrophysics Data System (ADS)

    Weigand, R.

    Two new processor devices have been developed for the use on board of spacecrafts. An 8-bit 8032-microcontroller targets typical controlling applications in instruments and sub-systems, or could be used as a main processor on small satellites, whereas the LEON 32-bit SPARC processor can be used for high performance controlling and data processing tasks. The ADV80S32 is fully compliant to the Intel 80x1 architecture and instruction set, extended by additional peripherals, 512 bytes on-chip RAM and a bootstrap PROM, which allows downloading the application software using the CCSDS PacketWire pro- tocol. The memory controller provides a de-multiplexed address/data bus, and allows to access up to 16 MB data and 8 MB program RAM. The peripherals have been de- signed for the specific needs of a spacecraft, such as serial interfaces compatible to RS232, PacketWire and TTC-B-01, counters/timers for extended duration and a CRC calculation unit accelerating the CCSDS TM/TC protocol. The 0.5 um Atmel manu- facturing technology (MG2RT) provides latch-up and total dose immunity; SEU fault immunity is implemented by using SEU hardened Flip-Flops and EDAC protection of internal and external memories. The maximum clock frequency of 20 MHz allows a processing power of 3 MIPS. Engineering samples are available. For SW develop- ment, various SW packages for the 8051 architecture are on the market. The LEON processor implements a 32-bit SPARC V8 architecture, including all the multiply and divide instructions, complemented by a floating-point unit (FPU). It includes several standard peripherals, such as timers/watchdog, interrupt controller, UARTs, parallel I/Os and a memory controller, allowing to use 8, 16 and 32 bit PROM, SRAM or memory mapped I/O. With on-chip separate instruction and data caches, almost one instruction per clock cycle can be reached in some applications. A 33-MHz 32-bit PCI master/target interface and a PCI arbiter allow operating the device in a plug-in card (for SW development on PC etc.), or to consider using it as a PCI master controller in an on-board system. Advanced SEU fault tolerance is in- troduced by design, using triple modular redundancy (TMR) flip-flops for all registers and EDAC protection for all memories. The device will be manufactured in a radia- tion hard Atmel 0.25 um technology, targeting 100 MHz processor clock frequency. The non fault-tolerant LEON processor VHDL model is available as free source code, and the SPARC architecture is a well-known industry standard. Therefore, know-how, software tools and operating systems are widely available.

  9. Intel Teach to the Future: A Partnership for Professional Development.

    ERIC Educational Resources Information Center

    Metcalf, Teri; Jolly, Deborah

    This paper describes a public/private partnership program designed to provide staff development to help classroom teachers integrate technology in the curriculum by using the train-the-trainer model. The Intel[R] Teach to the Future Project was developed by Intel[R] in collaboration with other public and private sector partners, and has been…

  10. Scalable and portable visualization of large atomistic datasets

    NASA Astrophysics Data System (ADS)

    Sharma, Ashish; Kalia, Rajiv K.; Nakano, Aiichiro; Vashishta, Priya

    2004-10-01

    A scalable and portable code named Atomsviewer has been developed to interactively visualize a large atomistic dataset consisting of up to a billion atoms. The code uses a hierarchical view frustum-culling algorithm based on the octree data structure to efficiently remove atoms outside of the user's field-of-view. Probabilistic and depth-based occlusion-culling algorithms then select atoms, which have a high probability of being visible. Finally a multiresolution algorithm is used to render the selected subset of visible atoms at varying levels of detail. Atomsviewer is written in C++ and OpenGL, and it has been tested on a number of architectures including Windows, Macintosh, and SGI. Atomsviewer has been used to visualize tens of millions of atoms on a standard desktop computer and, in its parallel version, up to a billion atoms. Program summaryTitle of program: Atomsviewer Catalogue identifier: ADUM Program summary URL:http://cpc.cs.qub.ac.uk/summaries/ADUM Program obtainable from: CPC Program Library, Queen's University of Belfast, N. Ireland Computer for which the program is designed and others on which it has been tested: 2.4 GHz Pentium 4/Xeon processor, professional graphics card; Apple G4 (867 MHz)/G5, professional graphics card Operating systems under which the program has been tested: Windows 2000/XP, Mac OS 10.2/10.3, SGI IRIX 6.5 Programming languages used: C++, C and OpenGL Memory required to execute with typical data: 1 gigabyte of RAM High speed storage required: 60 gigabytes No. of lines in the distributed program including test data, etc.: 550 241 No. of bytes in the distributed program including test data, etc.: 6 258 245 Number of bits in a word: Arbitrary Number of processors used: 1 Has the code been vectorized or parallelized: No Distribution format: tar gzip file Nature of physical problem: Scientific visualization of atomic systems Method of solution: Rendering of atoms using computer graphic techniques, culling algorithms for data minimization, and levels-of-detail for minimal rendering Restrictions on the complexity of the problem: None Typical running time: The program is interactive in its execution Unusual features of the program: None References: The conceptual foundation and subsequent implementation of the algorithms are found in [A. Sharma, A. Nakano, R.K. Kalia, P. Vashishta, S. Kodiyalam, P. Miller, W. Zhao, X.L. Liu, T.J. Campbell, A. Haas, Presence—Teleoperators and Virtual Environments 12 (1) (2003)].

  11. Transformation of two and three-dimensional regions by elliptic systems

    NASA Technical Reports Server (NTRS)

    Mastin, C. Wayne

    1994-01-01

    Several reports are attached to this document which contain the results of our research at the end of this contract period. Three of the reports deal with our work on generating surface grids. One is a preprint of a paper which will appear in the journal Applied Mathematics and Computation. Another is the abstract from a dissertation which has been prepared by Ahmed Khamayseh, a graduate student who has been supported by this grant for the last two years. The last report on surface grids is the extended abstract of a paper to be presented at the 14th IMACS World Congress in July. This report contains results on conformal mappings of surfaces, which are closely related to elliptic methods for surface grid generation. A preliminary report is included on new methods for dealing with block interfaces in multiblock grid systems. The development work is complete and the methods will eventually be incorporated into the National Grid Project (NGP) grid generation code. Thus, the attached report contains only a simple grid system which was used to test the algorithms to prove that the concepts are sound. These developments will greatly aid grid control when using elliptic systems and prevent unwanted grid movement. The last report is a brief summary of some timings that were obtained when the multiblock grid generation code was run on the Intel IPSC/860 hypercube. Since most of the data in a grid code is local to a particular block, only a small fraction of the total data must be passed between processors. The data is also distributed among the processors so that the total size of the grid can be increase along with the number of processors. This work is only in a preliminary stage. However, one of the ERC graduate students has taken an interest in the project and is presently extending these results as a part of his master's thesis.

  12. Computing effective properties of random heterogeneous materials on heterogeneous parallel processors

    NASA Astrophysics Data System (ADS)

    Leidi, Tiziano; Scocchi, Giulio; Grossi, Loris; Pusterla, Simone; D'Angelo, Claudio; Thiran, Jean-Philippe; Ortona, Alberto

    2012-11-01

    In recent decades, finite element (FE) techniques have been extensively used for predicting effective properties of random heterogeneous materials. In the case of very complex microstructures, the choice of numerical methods for the solution of this problem can offer some advantages over classical analytical approaches, and it allows the use of digital images obtained from real material samples (e.g., using computed tomography). On the other hand, having a large number of elements is often necessary for properly describing complex microstructures, ultimately leading to extremely time-consuming computations and high memory requirements. With the final objective of reducing these limitations, we improved an existing freely available FE code for the computation of effective conductivity (electrical and thermal) of microstructure digital models. To allow execution on hardware combining multi-core CPUs and a GPU, we first translated the original algorithm from Fortran to C, and we subdivided it into software components. Then, we enhanced the C version of the algorithm for parallel processing with heterogeneous processors. With the goal of maximizing the obtained performances and limiting resource consumption, we utilized a software architecture based on stream processing, event-driven scheduling, and dynamic load balancing. The parallel processing version of the algorithm has been validated using a simple microstructure consisting of a single sphere located at the centre of a cubic box, yielding consistent results. Finally, the code was used for the calculation of the effective thermal conductivity of a digital model of a real sample (a ceramic foam obtained using X-ray computed tomography). On a computer equipped with dual hexa-core Intel Xeon X5670 processors and an NVIDIA Tesla C2050, the parallel application version features near to linear speed-up progression when using only the CPU cores. It executes more than 20 times faster when additionally using the GPU.

  13. 50 CFR 660.160 - Catcher/processor (C/P) Coop Program.

    Code of Federal Regulations, 2010 CFR

    2010-10-01

    ... Coop Program, or the Shorebased IFQ Program. As determined necessary by the Regional Administrator... combination. [Reserved] (4) Appeals. [Reserved] (5) Fees. The Regional Administrator is authorized to charge... entry permit owner in the NMFS permit database. (ii) Qualifying criteria for C/P endorsement. In order...

  14. Communication-Avoiding Parallel Recursive Algorithms for Matrix Multiplication

    DTIC Science & Technology

    2013-05-17

    cost recurrence is FUM(n, P ) = 15 ( n2 4P ) + FUM ( n 2 , P 7 ) with base case FUM(n, 1) = csn ω0 − 5n2, where cs is the constant of Strassen-Winograd...message varies according to the recursion depth, and is the number of words a processor owns of any Si, Ti, or Qi, namely n2 4P words. 1If one does not...recurrence for the entire UM scheme: WUM(n, P ) = 36 n2 4P +WUM ( n 2 , P 7 ) SUM(n, P ) = 36 + SUM ( n 2 , P 7 ) with base case SUM(n, 1) = WUM(n, 1

  15. Development of a wireless system for auditory neuroscience.

    PubMed

    Lukes, A J; Lear, A T; Snider, R K

    2001-01-01

    In order to study how the auditory cortex extracts communication sounds in a realistic acoustic environment, a wireless system is being developed that will transmit acoustic as well as neural signals. The miniature transmitter will be capable of transmitting two acoustic signals with 37.5 KHz bandwidths (75 KHz sample rate) and 56 neural signals with bandwidths of 9.375 KHz (18.75 KHz sample rate). These signals will be time-division multiplexed into one high bandwidth signal with a 1.2 MHz sample rate. This high bandwidth signal will then be frequency modulated onto a 2.4 GHz carrier, which resides in the industrial, scientic, and medical (ISM) band that is designed for low-power short-range wireless applications. On the receiver side, the signal will be demodulated from the 2.4 GHz carrier and then digitized by an analog-to-digital (A/D) converter. The acoustic and neural signals will be digitally demultiplexed from the multiplexed signal into their respective channels. Oversampling (20 MHz) will allow the reconstruction of the multiplexing clock by a digital signal processor (DSP) that will perform frame and bit synchronization. A frame is a subset of the signal that contains all the channels and several channels tied high and low will signal the start of a frame. This technological development will bring two benefits to auditory neuroscience. It will allow simultaneous recording of many neurons that will permit studies of population codes. It will also allow neural functions to be determined in higher auditory areas by correlating neural and acoustic signals without apriori knowledge of the necessary stimuli.

  16. 47 CFR 25.136 - Licensing provisions for user transceivers in the 1.6/2.4 GHz, 1.5/1.6 GHz, and 2 GHz Mobile...

    Code of Federal Regulations, 2011 CFR

    2011-10-01

    ... Applications and Licenses Earth Stations § 25.136 Licensing provisions for user transceivers in the 1.6/2.4 GHz... specified in § 25.213, earth stations operating in the 1.6/2.4 GHz and 1.5/1.6 GHz Mobile Satellite Services... aircraft unless the earth station has a direct physical connection to the aircraft cabin or cockpit...

  17. 47 CFR 25.136 - Licensing provisions for user transceivers in the 1.6/2.4 GHz, 1.5/1.6 GHz, and 2 GHz Mobile...

    Code of Federal Regulations, 2012 CFR

    2012-10-01

    ... Applications and Licenses Earth Stations § 25.136 Licensing provisions for user transceivers in the 1.6/2.4 GHz... specified in § 25.213, earth stations operating in the 1.6/2.4 GHz and 1.5/1.6 GHz Mobile Satellite Services... aircraft unless the earth station has a direct physical connection to the aircraft cabin or cockpit...

  18. 47 CFR 25.136 - Licensing provisions for user transceivers in the 1.6/2.4 GHz, 1.5/1.6 GHz, and 2 GHz Mobile...

    Code of Federal Regulations, 2013 CFR

    2013-10-01

    ... Applications and Licenses Earth Stations § 25.136 Licensing provisions for user transceivers in the 1.6/2.4 GHz... specified in § 25.213, earth stations operating in the 1.6/2.4 GHz and 1.5/1.6 GHz Mobile-Satellite Services... aircraft unless the earth station has a direct physical connection to the aircraft cabin or cockpit...

  19. Simulation on change of generic satellite radar cross section via artificially created plasma sprays

    NASA Astrophysics Data System (ADS)

    Chung, Shen Shou Max; Chuang, Yu-Chou

    2016-06-01

    Recent advancements in antisatellite missile technologies have proven the effectiveness of such attacks, and the vulnerability of satellites in such exercises inspires a new paradigm in RF Stealth techniques suitable for satellites. In this paper we examine the possibility of using artificially created plasma sprays on the surface of the satellite’s main body to alter its radar cross section (RCS). First, we briefly review past research related to RF Stealth using plasma. Next, we discuss the physics between electromagnetic waves and plasma, and the RCS number game in RF Stealth design. A comparison of RCS in a generic satellite and a more complicated model is made to illustrate the effect of the RCS number game, and its meaning for a simulation model. We also run a comparison between finite-difference-time-domain (FDTD) and multilevel fast multipole method (MLFMM) codes, and find the RCS results are very close. We then compare the RCS of the generic satellite and the plasma-covered satellite. The incident radar wave is a differentiated Gaussian monopulse, with 3 dB bandwidth between 1.2 GHz and 4 GHz, and we simulate three kinds of plasma density, with a characteristic plasma frequency ω P  =  0.1, 1, and 10 GHz. The electron-neutral collision frequency ν en is set at 0.01 GHz. We found the RCS of plasma-covered satellite is not necessarily smaller than the originally satellite. When ω P is 0.1 GHz, the plasma spray behaves like a dielectric, and there is minor reduction in the RCS. When ω P is 1 GHz, the X-Y cut RCS increases. When ω P is 10 GHz, the plasma behaves more like a metal to the radar wave, and stronger RCS dependency to frequency appears. Therefore, to use plasma as an RCS adjustment tool requires careful fine-tuning of plasma density and shape, in order to achieve the so-called plasma stealth effect.

  20. System calibration of the 1.4 GHz and 5 GHz radiometers for soil moisture remote sensing

    NASA Technical Reports Server (NTRS)

    Wang, J.; Shiue, J.; Gould, W.; Fuchs, J.; Hirschmann, E.; Glazar, W.

    1980-01-01

    Two microwave radiometers at the frequencies of 1.4 GHz and 5 GHz were mounted on a mobile tower and used for a remote sensing of soil moisture experiment at a Beltsville Agriculture Research Center test site. The experiment was performed in October 1979 over both bare field and fields covered with grass, soybean, and corn. The calibration procedure for the radiometer systems which forms the basis of obtaining the final radiometric data product is described. It is estimated from the calibration results that the accuracy of the 1.4 GHz radiometric measurements is about + or - 3 K. The measured 5 GHz brightness temperatures over bare fields with moisture content greater than 10 percent by dry weight are about 8 K lower than those taken simultaneously at 1.4 GHz. This could be due to either (1) a 5 GHz antenna side lobe seeing the cold brightness of the sky, or (2) the thermal microwave emission from a soil being less sensitive to surface roughness at 5 GHz than at 1.4 GHz.

  1. Three MMIC Amplifiers for the 120-to-200 GHz Frequency Band

    NASA Technical Reports Server (NTRS)

    Samoska, Lorene; Schmitz, Adele

    2009-01-01

    Closely following the development reported in the immediately preceding article, three new monolithic microwave integrated circuit (MMIC) amplifiers that would operate in the 120-to-200-GHz frequency band have been designed and are under construction at this writing. The active devices in these amplifiers are InP high-electron-mobility transistors (HEMTs). These amplifiers (see figure) are denoted the LSLNA150, the LSA200, and the LSA185, respectively. Like the amplifiers reported in the immediately preceding article, the LSLNA150 (1) is intended to be a prototype of low-noise amplifiers (LNAs) to be incorporated into spaceborne instruments for sensing cosmic microwave background radiation and (2) has potential for terrestrial use in electronic test equipment, passive millimeter-wave imaging systems, radar receivers, communication receivers, and systems for detecting hidden weapons. The HEMTs in this amplifier were fabricated according to 0.08- m design rules of a commercial product line of InP HEMT MMICs at HRL Laboratories, LLC, with a gate geometry of 2 fingers, each 15 m wide. On the basis of computational simulations, this amplifier is designed to afford at least 15 dB of gain, with a noise figure of no more than about 6 dB, at frequencies from 120 to 160 GHz. The measured results of the amplifier are shown next to the chip photo, with a gain of 16 dB at 150 GHz. Noise figure work is ongoing. The LSA200 and the LSA185 are intended to be prototypes of transmitting power amplifiers for use at frequencies between about 180 and about 200 GHz. These amplifiers have also been fabricated according to rules of the aforesaid commercial product line of InP HEMT MMICs, except that the HEMTs in these amplifiers are characterized by a gate geometry of 4 fingers, each 37 m wide. The measured peak performance of the LSA200 is characterized by a gain of about 1.4 dB at a frequency of 190 GHz; the measured peak performance of the LSA185 is characterized by a gain of about 2.7 dB at a frequency of 181 GHz. The measured gain results of each chip are shown next to their respective photos.

  2. Accelerating a three-dimensional eco-hydrological cellular automaton on GPGPU with OpenCL

    NASA Astrophysics Data System (ADS)

    Senatore, Alfonso; D'Ambrosio, Donato; De Rango, Alessio; Rongo, Rocco; Spataro, William; Straface, Salvatore; Mendicino, Giuseppe

    2016-10-01

    This work presents an effective implementation of a numerical model for complete eco-hydrological Cellular Automata modeling on Graphical Processing Units (GPU) with OpenCL (Open Computing Language) for heterogeneous computation (i.e., on CPUs and/or GPUs). Different types of parallel implementations were carried out (e.g., use of fast local memory, loop unrolling, etc), showing increasing performance improvements in terms of speedup, adopting also some original optimizations strategies. Moreover, numerical analysis of results (i.e., comparison of CPU and GPU outcomes in terms of rounding errors) have proven to be satisfactory. Experiments were carried out on a workstation with two CPUs (Intel Xeon E5440 at 2.83GHz), one GPU AMD R9 280X and one GPU nVIDIA Tesla K20c. Results have been extremely positive, but further testing should be performed to assess the functionality of the adopted strategies on other complete models and their ability to fruitfully exploit parallel systems resources.

  3. Decryption-decompression of AES protected ZIP files on GPUs

    NASA Astrophysics Data System (ADS)

    Duong, Tan Nhat; Pham, Phong Hong; Nguyen, Duc Huu; Nguyen, Thuy Thanh; Le, Hung Duc

    2011-10-01

    AES is a strong encryption system, so decryption-decompression of AES encrypted ZIP files requires very large computing power and techniques of reducing the password space. This makes implementations of techniques on common computing system not practical. In [1], we reduced the original very large password search space to a much smaller one which surely containing the correct password. Based on reduced set of passwords, in this paper, we parallel decryption, decompression and plain text recognition for encrypted ZIP files by using CUDA computing technology on graphics cards GeForce GTX295 of NVIDIA, to find out the correct password. The experimental results have shown that the speed of decrypting, decompressing, recognizing plain text and finding out the original password increases about from 45 to 180 times (depends on the number of GPUs) compared to sequential execution on the Intel Core 2 Quad Q8400 2.66 GHz. These results have demonstrated the potential applicability of GPUs in this cryptanalysis field.

  4. CUDA-based acceleration of collateral filtering in brain MR images

    NASA Astrophysics Data System (ADS)

    Li, Cheng-Yuan; Chang, Herng-Hua

    2017-02-01

    Image denoising is one of the fundamental and essential tasks within image processing. In medical imaging, finding an effective algorithm that can remove random noise in MR images is important. This paper proposes an effective noise reduction method for brain magnetic resonance (MR) images. Our approach is based on the collateral filter which is a more powerful method than the bilateral filter in many cases. However, the computation of the collateral filter algorithm is quite time-consuming. To solve this problem, we improved the collateral filter algorithm with parallel computing using GPU. We adopted CUDA, an application programming interface for GPU by NVIDIA, to accelerate the computation. Our experimental evaluation on an Intel Xeon CPU E5-2620 v3 2.40GHz with a NVIDIA Tesla K40c GPU indicated that the proposed implementation runs dramatically faster than the traditional collateral filter. We believe that the proposed framework has established a general blueprint for achieving fast and robust filtering in a wide variety of medical image denoising applications.

  5. DOE Office of Scientific and Technical Information (OSTI.GOV)

    Madduri, Kamesh; Im, Eun-Jin; Ibrahim, Khaled Z.

    The next decade of high-performance computing (HPC) systems will see a rapid evolution and divergence of multi- and manycore architectures as power and cooling constraints limit increases in microprocessor clock speeds. Understanding efficient optimization methodologies on diverse multicore designs in the context of demanding numerical methods is one of the greatest challenges faced today by the HPC community. In this paper, we examine the efficient multicore optimization of GTC, a petascale gyrokinetic toroidal fusion code for studying plasma microturbulence in tokamak devices. For GTC’s key computational components (charge deposition and particle push), we explore efficient parallelization strategies across a broadmore » range of emerging multicore designs, including the recently-released Intel Nehalem-EX, the AMD Opteron Istanbul, and the highly multithreaded Sun UltraSparc T2+. We also present the first study on tuning gyrokinetic particle-in-cell (PIC) algorithms for graphics processors, using the NVIDIA C2050 (Fermi). Our work discusses several novel optimization approaches for gyrokinetic PIC, including mixed-precision computation, particle binning and decomposition strategies, grid replication, SIMDized atomic floating-point operations, and effective GPU texture memory utilization. Overall, we achieve significant performance improvements of 1.3–4.7× on these complex PIC kernels, despite the inherent challenges of data dependency and locality. Finally, our work also points to several architectural and programming features that could significantly enhance PIC performance and productivity on next-generation architectures.« less

  6. Short Message Service (SMS) Command and Control (C2) Awareness in Android-based Smartphones Using Kernel-Level Auditing

    DTIC Science & Technology

    2012-06-14

    Display 480 x 800 pixels (3.7 inches) CPU Qualcomm QSD8250 1GHz Memory (internal) 512MB RAM / 512 MB ROM Kernel version 2.6.35.7-ge0fb012 Figure 3.5: HTC...development and writing). The 34 MSM kernel provided by the AOSP and compatible with the HTC Nexus One’s motherboard and Qualcomm chipset, is used for this...building the kernel is having the prebuilt toolchains and the right kernel for the hardware. Many HTC products use Qualcomm processors which uses the

  7. Algorithm theoretical baseline for formaldehyde retrievals from S5P TROPOMI and from the QA4ECV project

    NASA Astrophysics Data System (ADS)

    De Smedt, Isabelle; Theys, Nicolas; Yu, Huan; Danckaert, Thomas; Lerot, Christophe; Compernolle, Steven; Van Roozendael, Michel; Richter, Andreas; Hilboll, Andreas; Peters, Enno; Pedergnana, Mattia; Loyola, Diego; Beirle, Steffen; Wagner, Thomas; Eskes, Henk; van Geffen, Jos; Folkert Boersma, Klaas; Veefkind, Pepijn

    2018-04-01

    On board the Copernicus Sentinel-5 Precursor (S5P) platform, the TROPOspheric Monitoring Instrument (TROPOMI) is a double-channel, nadir-viewing grating spectrometer measuring solar back-scattered earthshine radiances in the ultraviolet, visible, near-infrared, and shortwave infrared with global daily coverage. In the ultraviolet range, its spectral resolution and radiometric performance are equivalent to those of its predecessor OMI, but its horizontal resolution at true nadir is improved by an order of magnitude. This paper introduces the formaldehyde (HCHO) tropospheric vertical column retrieval algorithm implemented in the S5P operational processor and comprehensively describes its various retrieval steps. Furthermore, algorithmic improvements developed in the framework of the EU FP7-project QA4ECV are described for future updates of the processor. Detailed error estimates are discussed in the light of Copernicus user requirements and needs for validation are highlighted. Finally, verification results based on the application of the algorithm to OMI measurements are presented, demonstrating the performances expected for TROPOMI.

  8. 47 CFR 27.806 - 1.4 GHz service licenses subject to competitive bidding.

    Code of Federal Regulations, 2011 CFR

    2011-10-01

    ... 47 Telecommunication 2 2011-10-01 2011-10-01 false 1.4 GHz service licenses subject to competitive bidding. 27.806 Section 27.806 Telecommunication FEDERAL COMMUNICATIONS COMMISSION (CONTINUED) COMMON CARRIER SERVICES MISCELLANEOUS WIRELESS COMMUNICATIONS SERVICES 1.4 GHz Band § 27.806 1.4 GHz service...

  9. 47 CFR 27.806 - 1.4 GHz service licenses subject to competitive bidding.

    Code of Federal Regulations, 2010 CFR

    2010-10-01

    ... 47 Telecommunication 2 2010-10-01 2010-10-01 false 1.4 GHz service licenses subject to competitive bidding. 27.806 Section 27.806 Telecommunication FEDERAL COMMUNICATIONS COMMISSION (CONTINUED) COMMON CARRIER SERVICES MISCELLANEOUS WIRELESS COMMUNICATIONS SERVICES 1.4 GHz Band § 27.806 1.4 GHz service...

  10. 77 FR 45503 - 4.9 GHz Band

    Federal Register 2010, 2011, 2012, 2013, 2014

    2012-08-01

    ... Docket No. 06-150; FCC 12-61] 4.9 GHz Band AGENCY: Federal Communications Commission. ACTION: Final rule... that exempted 4940-4990 MHz (4.9 GHz) band applicants from certified frequency coordination. Next, the Commission corrects the bandwidth of Channel 14 in the 4.9 GHz band plan from five megahertz to one megahertz...

  11. Templated synthesis of highly ordered mesoporous cobalt ferrite and its microwave absorption properties

    NASA Astrophysics Data System (ADS)

    Li, Guo-Min; Wang, Lian-Cheng; Xu, Yao

    2014-08-01

    Based on the nanocasting strategy, highly ordered mesoporous CoFe2O4 is synthesized via the ‘two-solvent’ impregnation method using a mesoporous SBA-15 template. An ordered two-dimensional (P6mm) structure is preserved for the CoFe2O4/SBA-15 composite after the nanocasting. After the SBA-15 template is dissolved by NaOH solution, a mesoporous structure composed of aligned nanoparticles can be obtained, and the P6mm structure of the parent SBA-15 is preserved. With a high specific surface area (above 90 m2/g) and ferromagnetic behavior, the obtained material shows potential in light weight microwave absorption application. The minimum reflection loss (RL) can reach -18 dB at about 16 GHz with a thickness of 2 mm and the corresponding absorption bandwidth is 4.5 GHz.

  12. Automatic film processors' quality control test in Greek military hospitals.

    PubMed

    Lymberis, C; Efstathopoulos, E P; Manetou, A; Poudridis, G

    1993-04-01

    The two major military radiology installations (Athens, Greece) using a total of 15 automatic film processors were assessed using the 21-step-wedge method. The results of quality control in all these processors are presented. The parameters measured under actual working conditions were base and fog, contrast and speed. Base and fog as well as speed displayed large variations with average values generally higher than acceptable, whilst contrast displayed greater stability. Developer temperature was measured daily during the test and was found to be outside the film manufacturers' recommended limits in nine of the 15 processors. In only one processor did film passing time vary on an every day basis and this was due to maloperation. Developer pH test was not part of the daily monitoring service being performed every 5 days for each film processor and found to be in the range 9-12; 10 of the 15 processors presented pH values outside the limits specified by the film manufacturers.

  13. Optimal processor assignment for pipeline computations

    NASA Technical Reports Server (NTRS)

    Nicol, David M.; Simha, Rahul; Choudhury, Alok N.; Narahari, Bhagirath

    1991-01-01

    The availability of large scale multitasked parallel architectures introduces the following processor assignment problem for pipelined computations. Given a set of tasks and their precedence constraints, along with their experimentally determined individual responses times for different processor sizes, find an assignment of processor to tasks. Two objectives are of interest: minimal response given a throughput requirement, and maximal throughput given a response time requirement. These assignment problems differ considerably from the classical mapping problem in which several tasks share a processor; instead, it is assumed that a large number of processors are to be assigned to a relatively small number of tasks. Efficient assignment algorithms were developed for different classes of task structures. For a p processor system and a series parallel precedence graph with n constituent tasks, an O(np2) algorithm is provided that finds the optimal assignment for the response time optimization problem; it was found that the assignment optimizing the constrained throughput in O(np2log p) time. Special cases of linear, independent, and tree graphs are also considered.

  14. 47 CFR 25.136 - Licensing provisions for user transceivers in the 1.6/2.4 GHz, 1.5/1.6 GHz, and 2 GHz Mobile...

    Code of Federal Regulations, 2010 CFR

    2010-10-01

    .../2.4 GHz Mobile-Satellite Service or 2 GHz Mobile-Satellite Service may not be operated on civil... rules and regulations in this Part and the applicable engineering standards. Prior to engaging in such...

  15. IntellEditS: intelligent learning-based editor of segmentations.

    PubMed

    Harrison, Adam P; Birkbeck, Neil; Sofka, Michal

    2013-01-01

    Automatic segmentation techniques, despite demonstrating excellent overall accuracy, can often produce inaccuracies in local regions. As a result, correcting segmentations remains an important task that is often laborious, especially when done manually for 3D datasets. This work presents a powerful tool called Intelligent Learning-Based Editor of Segmentations (IntellEditS) that minimizes user effort and further improves segmentation accuracy. The tool partners interactive learning with an energy-minimization approach to editing. Based on interactive user input, a discriminative classifier is trained and applied to the edited 3D region to produce soft voxel labeling. The labels are integrated into a novel energy functional along with the existing segmentation and image data. Unlike the state of the art, IntellEditS is designed to correct segmentation results represented not only as masks but also as meshes. In addition, IntellEditS accepts intuitive boundary-based user interactions. The versatility and performance of IntellEditS are demonstrated on both MRI and CT datasets consisting of varied anatomical structures and resolutions.

  16. Evaluation of pH monitoring as a method of processor control.

    PubMed

    Stears, J G; Gray, J E; Winkler, N T

    1979-01-01

    Sensitometry and pH values of the developer solution were compared in controlled over-replenishment, developer depletion, fixer contamination experiments, and on a daily quality control basis. The purpose of these comparisons was to evaluate the potential of pH monitoring as a method of processor control, or a supplement to sensitometry as a method of quality control. Reasonable correlation was found between pH values and film density in two of the three experiments but little or no correlation was found in the third experiment and on a day-to-day basis. The conclusion drawn from these comparisons is that pH monitoring has several limitations which render it unsuitable as a method of daily processor quality control as either a primary or supplementary technique. Sensitometry takes into account all the variables encountered in film processing and is the clear method of choice for processor quality control.

  17. Miniature MMIC Low Mass/Power Radiometer Modules for the 180 GHz GeoSTAR Array

    NASA Technical Reports Server (NTRS)

    Kangaslahti, Pekka; Tanner, Alan; Pukala, David; Lambrigtsen, Bjorn; Lim, Boon; Mei, Xiaobing; Lai, Richard

    2010-01-01

    We have developed and demonstrated miniature 180 GHz Monolithic Microwave Integrated Circuit (MMIC) radiometer modules that have low noise temperature, low mass and low power consumption. These modules will enable the Geostationary Synthetic Thinned Aperture Radiometer (GeoSTAR) of the Precipitation and All-weather Temperature and Humidity (PATH) Mission for atmospheric temperature and humidity profiling. The GeoSTAR instrument has an array of hundreds of receivers. Technology that was developed included Indium Phosphide (InP) MMIC Low Noise Amplifiers (LNAs) and second harmonic MMIC mixers and I-Q mixers, surface mount Multi-Chip Module (MCM) packages at 180 GHz, and interferometric array at 180 GHz. A complete MMIC chip set for the 180 GHz receiver modules (LNAs and I-Q Second harmonic mixer) was developed. The MMIC LNAs had more than 50% lower noise temperature (NT=300K) than previous state-of-art and MMIC I-Q mixers demonstrated low LO power (3 dBm). Two lots of MMIC wafers were processed with very high DC transconductance of up to 2800 mS/mm for the 35 nm gate length devices. Based on these MMICs a 180 GHz Multichip Module was developed that had a factor of 100 lower mass/volume (16x18x4.5 mm3, 3g) than previous generation 180 GHz receivers.

  18. Vertical InAs nanowire wrap gate transistors with f(t) > 7 GHz and f(max) > 20 GHz.

    PubMed

    Egard, M; Johansson, S; Johansson, A-C; Persson, K-M; Dey, A W; Borg, B M; Thelander, C; Wernersson, L-E; Lind, E

    2010-03-10

    In this letter we report on high-frequency measurements on vertically standing III-V nanowire wrap-gate MOSFETs (metal-oxide-semiconductor field-effect transistors). The nanowire transistors are fabricated from InAs nanowires that are epitaxially grown on a semi-insulating InP substrate. All three terminals of the MOSFETs are defined by wrap around contacts. This makes it possible to perform high-frequency measurements on the vertical InAs MOSFETs. We present S-parameter measurements performed on a matrix consisting of 70 InAs nanowire MOSFETs, which have a gate length of about 100 nm. The highest unity current gain cutoff frequency, f(t), extracted from these measurements is 7.4 GHz and the maximum frequency of oscillation, f(max), is higher than 20 GHz. This demonstrates that this is a viable technique for fabricating high-frequency integrated circuits consisting of vertical nanowires.

  19. Dual band multi frequency rectangular patch microstrip antenna with flyswatter shaped slot for wireless systems

    NASA Astrophysics Data System (ADS)

    Bhardwaj, Dheeraj; Saraswat, Shriti; Gulati, Gitansh; Shekhar, Snehanshu; Joshi, Kanika; Sharma, Komal

    2016-03-01

    In this paper a dual band planar antenna has been proposed for IEEE 802.16 Wi-MAX /IEEE 802.11 WLAN/4.9 GHz public safety applications. The antenna comprises a frequency bandwidth of 560MHz (3.37GHz-3.93GHz) for WLAN and WiMAX and 372MHz (4.82GHz-5.192GHz) for 4.9 GHz public safety applications and Radio astronomy services (4.8-4.94 GHz). The proposed antenna constitutes of a single microstrip patch reactively loaded with three identical steps positioned in a zig-zag manner towards the radiating edges of the patch. The coaxially fed patch antenna characteristics (radiation pattern, antenna gain, antenna directivity, current distribution, S11) have been investigated. The antenna design is primarily focused on achieving a dual band operation.

  20. A contribution to the design of wideband tunable second harmonic mode millimeter-wave InP-TED oscillators above 110 GHz

    NASA Astrophysics Data System (ADS)

    Rydberg, Anders

    1990-03-01

    Second harmonic InP-TED oscillators are investigated for frequencies above 110 GHz using different mounts and TED's. It is found that state of the art output powers, comparable to Schottky-varactor multipliers, of more than 2 mW can be generated above 190 GHz by reducing the capsule parasitics. Output power up to 216 GHz are observed. The tuning range above 110 GHz is found to be more than 40 percent. Using theoretical waveguide models the tuning behavior of the oscillators is also investigated.

Top