intel p4-processor machine: Topics by Science.gov

Sample records for intel p4-processor machine

Performance of a plasma fluid code on the Intel parallel computers

NASA Technical Reports Server (NTRS)

Lynch, V. E.; Carreras, B. A.; Drake, J. B.; Leboeuf, J. N.; Liewer, P.

1992-01-01

One approach to improving the real-time efficiency of plasma turbulence calculations is to use a parallel algorithm. A parallel algorithm for plasma turbulence calculations was tested on the Intel iPSC/860 hypercube and the Touchtone Delta machine. Using the 128 processors of the Intel iPSC/860 hypercube, a factor of 5 improvement over a single-processor CRAY-2 is obtained. For the Touchtone Delta machine, the corresponding improvement factor is 16. For plasma edge turbulence calculations, an extrapolation of the present results to the Intel (sigma) machine gives an improvement factor close to 64 over the single-processor CRAY-2.
Scaling Support Vector Machines On Modern HPC Platforms

DOE Office of Scientific and Technical Information (OSTI.GOV)

You, Yang; Fu, Haohuan; Song, Shuaiwen

2015-02-01

We designed and implemented MIC-SVM, a highly efficient parallel SVM for x86 based multicore and many-core architectures, such as the Intel Ivy Bridge CPUs and Intel Xeon Phi co-processor (MIC). We propose various novel analysis methods and optimization techniques to fully utilize the multilevel parallelism provided by these architectures and serve as general optimization methods for other machine learning tools.
Parallel computation for biological sequence comparison: comparing a portable model to the native model for the Intel Hypercube.

PubMed

Nadkarni, P M; Miller, P L

1991-01-01

A parallel program for inter-database sequence comparison was developed on the Intel Hypercube using two models of parallel programming. One version was built using machine-specific Hypercube parallel programming commands. The other version was built using Linda, a machine-independent parallel programming language. The two versions of the program provide a case study comparing these two approaches to parallelization in an important biological application area. Benchmark tests with both programs gave comparable results with a small number of processors. As the number of processors was increased, the Linda version was somewhat less efficient. The Linda version was also run without change on Network Linda, a virtual parallel machine running on a network of desktop workstations.
Parallel computation for biological sequence comparison: comparing a portable model to the native model for the Intel Hypercube.

PubMed Central

Nadkarni, P. M.; Miller, P. L.

1991-01-01

A parallel program for inter-database sequence comparison was developed on the Intel Hypercube using two models of parallel programming. One version was built using machine-specific Hypercube parallel programming commands. The other version was built using Linda, a machine-independent parallel programming language. The two versions of the program provide a case study comparing these two approaches to parallelization in an important biological application area. Benchmark tests with both programs gave comparable results with a small number of processors. As the number of processors was increased, the Linda version was somewhat less efficient. The Linda version was also run without change on Network Linda, a virtual parallel machine running on a network of desktop workstations. PMID:1807632
SPP: A data base processor data communications protocol

NASA Technical Reports Server (NTRS)

Fishwick, P. A.

1983-01-01

The design and implementation of a data communications protocol for the Intel Data Base Processor (DBP) is defined. The protocol is termed SPP (Service Port Protocol) since it enables data transfer between the host computer and the DBP service port. The protocol implementation is extensible in that it is explicitly layered and the protocol functionality is hierarchically organized. Extensive trace and performance capabilities have been supplied with the protocol software to permit optional efficient monitoring of the data transfer between the host and the Intel data base processor. Machine independence was considered to be an important attribute during the design and implementation of SPP. The protocol source is fully commented and is included in Appendix A of this report.
Web interfaces to relational databases

NASA Technical Reports Server (NTRS)

Carlisle, W. H.

1996-01-01

This reports on a project to extend the capabilities of a Virtual Research Center (VRC) for NASA's Advanced Concepts Office. The work was performed as part of NASA's 1995 Summer Faculty Fellowship program and involved the development of a prototype component of the VRC - a database system that provides data creation and access services within a room of the VRC. In support of VRC development, NASA has assembled a laboratory containing the variety of equipment expected to be used by scientists within the VRC. This laboratory consists of the major hardware platforms, SUN, Intel, and Motorola processors and their most common operating systems UNIX, Windows NT, Windows for Workgroups, and Macintosh. The SPARC 20 runs SUN Solaris 2.4, an Intel Pentium runs Windows NT and is installed on a different network from the other machines in the laboratory, a Pentium PC runs Windows for Workgroups, two Intel 386 machines run Windows 3.1, and finally, a PowerMacintosh and a Macintosh IIsi run MacOS.
Case for a field-programmable gate array multicore hybrid machine for an image-processing application

NASA Astrophysics Data System (ADS)

Rakvic, Ryan N.; Ives, Robert W.; Lira, Javier; Molina, Carlos

2011-01-01

General purpose computer designers have recently begun adding cores to their processors in order to increase performance. For example, Intel has adopted a homogeneous quad-core processor as a base for general purpose computing. PlayStation3 (PS3) game consoles contain a multicore heterogeneous processor known as the Cell, which is designed to perform complex image processing algorithms at a high level. Can modern image-processing algorithms utilize these additional cores? On the other hand, modern advancements in configurable hardware, most notably field-programmable gate arrays (FPGAs) have created an interesting question for general purpose computer designers. Is there a reason to combine FPGAs with multicore processors to create an FPGA multicore hybrid general purpose computer? Iris matching, a repeatedly executed portion of a modern iris-recognition algorithm, is parallelized on an Intel-based homogeneous multicore Xeon system, a heterogeneous multicore Cell system, and an FPGA multicore hybrid system. Surprisingly, the cheaper PS3 slightly outperforms the Intel-based multicore on a core-for-core basis. However, both multicore systems are beaten by the FPGA multicore hybrid system by >50%.
Efficacy of Code Optimization on Cache-Based Processors

NASA Technical Reports Server (NTRS)

VanderWijngaart, Rob F.; Saphir, William C.; Chancellor, Marisa K. (Technical Monitor)

1997-01-01

In this paper a number of techniques for improving the cache performance of a representative piece of numerical software is presented. Target machines are popular processors from several vendors: MIPS R5000 (SGI Indy), MIPS R8000 (SGI PowerChallenge), MIPS R10000 (SGI Origin), DEC Alpha EV4 + EV5 (Cray T3D & T3E), IBM RS6000 (SP Wide-node), Intel PentiumPro (Ames' Whitney), Sun UltraSparc (NERSC's NOW). The optimizations all attempt to increase the locality of memory accesses. But they meet with rather varied and often counterintuitive success on the different computing platforms. We conclude that it may be genuinely impossible to obtain portable performance on the current generation of cache-based machines. At the least, it appears that the performance of modern commodity processors cannot be described with parameters defining the cache alone.
Parallel spatial direct numerical simulations on the Intel iPSC/860 hypercube

NASA Technical Reports Server (NTRS)

Joslin, Ronald D.; Zubair, Mohammad

1993-01-01

The implementation and performance of a parallel spatial direct numerical simulation (PSDNS) approach on the Intel iPSC/860 hypercube is documented. The direct numerical simulation approach is used to compute spatially evolving disturbances associated with the laminar-to-turbulent transition in boundary-layer flows. The feasibility of using the PSDNS on the hypercube to perform transition studies is examined. The results indicate that the direct numerical simulation approach can effectively be parallelized on a distributed-memory parallel machine. By increasing the number of processors nearly ideal linear speedups are achieved with nonoptimized routines; slower than linear speedups are achieved with optimized (machine dependent library) routines. This slower than linear speedup results because the Fast Fourier Transform (FFT) routine dominates the computational cost and because the routine indicates less than ideal speedups. However with the machine-dependent routines the total computational cost decreases by a factor of 4 to 5 compared with standard FORTRAN routines. The computational cost increases linearly with spanwise wall-normal and streamwise grid refinements. The hypercube with 32 processors was estimated to require approximately twice the amount of Cray supercomputer single processor time to complete a comparable simulation; however it is estimated that a subgrid-scale model which reduces the required number of grid points and becomes a large-eddy simulation (PSLES) would reduce the computational cost and memory requirements by a factor of 10 over the PSDNS. This PSLES implementation would enable transition simulations on the hypercube at a reasonable computational cost.
Communication overhead on the Intel Paragon, IBM SP2 and Meiko CS-2

NASA Technical Reports Server (NTRS)

Bokhari, Shahid H.

1995-01-01

Interprocessor communication overhead is a crucial measure of the power of parallel computing systems-its impact can severely limit the performance of parallel programs. This report presents measurements of communication overhead on three contemporary commercial multicomputer systems: the Intel Paragon, the IBM SP2 and the Meiko CS-2. In each case the time to communicate between processors is presented as a function of message length. The time for global synchronization and memory access is discussed. The performance of these machines in emulating hypercubes and executing random pairwise exchanges is also investigated. It is shown that the interprocessor communication time depends heavily on the specific communication pattern required. These observations contradict the commonly held belief that communication overhead on contemporary machines is independent of the placement of tasks on processors. The information presented in this report permits the evaluation of the efficiency of parallel algorithm implementations against standard baselines.
A performance study of sparse Cholesky factorization on INTEL iPSC/860

NASA Technical Reports Server (NTRS)

Zubair, M.; Ghose, M.

1992-01-01

The problem of Cholesky factorization of a sparse matrix has been very well investigated on sequential machines. A number of efficient codes exist for factorizing large unstructured sparse matrices. However, there is a lack of such efficient codes on parallel machines in general, and distributed machines in particular. Some of the issues that are critical to the implementation of sparse Cholesky factorization on a distributed memory parallel machine are ordering, partitioning and mapping, load balancing, and ordering of various tasks within a processor. Here, we focus on the effect of various partitioning schemes on the performance of sparse Cholesky factorization on the Intel iPSC/860. Also, a new partitioning heuristic for structured as well as unstructured sparse matrices is proposed, and its performance is compared with other schemes.
Parallel community climate model: Description and user`s guide

DOE Office of Scientific and Technical Information (OSTI.GOV)

Drake, J.B.; Flanery, R.E.; Semeraro, B.D.

This report gives an overview of a parallel version of the NCAR Community Climate Model, CCM2, implemented for MIMD massively parallel computers using a message-passing programming paradigm. The parallel implementation was developed on an Intel iPSC/860 with 128 processors and on the Intel Delta with 512 processors, and the initial target platform for the production version of the code is the Intel Paragon with 2048 processors. Because the implementation uses a standard, portable message-passing libraries, the code has been easily ported to other multiprocessors supporting a message-passing programming paradigm. The parallelization strategy used is to decompose the problem domain intomore » geographical patches and assign each processor the computation associated with a distinct subset of the patches. With this decomposition, the physics calculations involve only grid points and data local to a processor and are performed in parallel. Using parallel algorithms developed for the semi-Lagrangian transport, the fast Fourier transform and the Legendre transform, both physics and dynamics are computed in parallel with minimal data movement and modest change to the original CCM2 source code. Sequential or parallel history tapes are written and input files (in history tape format) are read sequentially by the parallel code to promote compatibility with production use of the model on other computer systems. A validation exercise has been performed with the parallel code and is detailed along with some performance numbers on the Intel Paragon and the IBM SP2. A discussion of reproducibility of results is included. A user`s guide for the PCCM2 version 2.1 on the various parallel machines completes the report. Procedures for compilation, setup and execution are given. A discussion of code internals is included for those who may wish to modify and use the program in their own research.« less
Analysis and Implementation of Particle-to-Particle (P2P) Graphics Processor Unit (GPU) Kernel for Black-Box Adaptive Fast Multipole Method

DTIC Science & Technology

2015-06-01

5110P and 16 dx360M4 nodes each with one NVIDIA Kepler K20M/K40M GPU. Each node contained dual Intel Xeon E5-2670 (Sandy Bridge) central processing...kernel and as such does not employ multiple processors. This work makes use of a single processing core and a single NVIDIA Kepler K40 GK110...bandwidth (2 × 16 slot), 7.877 GFloat/s; Kepler K40 peak, 4,290 × 1 billion floating-point operations (GFLOPs), and 288 GB/s Kepler K40 memory
Cognitive Medical Wireless Testbed System (COMWITS)

DTIC Science & Technology

2016-11-01

Number: ...... ...... Sub Contractors (DD882) Names of other research staff Inventions (DD882) Scientific Progress This testbed merges two ARO grants...bit 64 bit CPU Intel Xeon Processor E5-1650v3 (6C, 3.5 GHz, Turbo, HT , 15M, 140W) Intel Core i7-3770 (3.4 GHz Quad Core, 77W) Dual Intel Xeon
GNAQPMS v1.1: accelerating the Global Nested Air Quality Prediction Modeling System (GNAQPMS) on Intel Xeon Phi processors

NASA Astrophysics Data System (ADS)

Wang, Hui; Chen, Huansheng; Wu, Qizhong; Lin, Junmin; Chen, Xueshun; Xie, Xinwei; Wang, Rongrong; Tang, Xiao; Wang, Zifa

2017-08-01

The Global Nested Air Quality Prediction Modeling System (GNAQPMS) is the global version of the Nested Air Quality Prediction Modeling System (NAQPMS), which is a multi-scale chemical transport model used for air quality forecast and atmospheric environmental research. In this study, we present the porting and optimisation of GNAQPMS on a second-generation Intel Xeon Phi processor, codenamed Knights Landing (KNL). Compared with the first-generation Xeon Phi coprocessor (codenamed Knights Corner, KNC), KNL has many new hardware features such as a bootable processor, high-performance in-package memory and ISA compatibility with Intel Xeon processors. In particular, we describe the five optimisations we applied to the key modules of GNAQPMS, including the CBM-Z gas-phase chemistry, advection, convection and wet deposition modules. These optimisations work well on both the KNL 7250 processor and the Intel Xeon E5-2697 V4 processor. They include (1) updating the pure Message Passing Interface (MPI) parallel mode to the hybrid parallel mode with MPI and OpenMP in the emission, advection, convection and gas-phase chemistry modules; (2) fully employing the 512 bit wide vector processing units (VPUs) on the KNL platform; (3) reducing unnecessary memory access to improve cache efficiency; (4) reducing the thread local storage (TLS) in the CBM-Z gas-phase chemistry module to improve its OpenMP performance; and (5) changing the global communication from writing/reading interface files to MPI functions to improve the performance and the parallel scalability. These optimisations greatly improved the GNAQPMS performance. The same optimisations also work well for the Intel Xeon Broadwell processor, specifically E5-2697 v4. Compared with the baseline version of GNAQPMS, the optimised version was 3.51 × faster on KNL and 2.77 × faster on the CPU. Moreover, the optimised version ran at 26 % lower average power on KNL than on the CPU. With the combined performance and energy improvement, the KNL platform was 37.5 % more efficient on power consumption compared with the CPU platform. The optimisations also enabled much further parallel scalability on both the CPU cluster and the KNL cluster scaled to 40 CPU nodes and 30 KNL nodes, with a parallel efficiency of 70.4 and 42.2 %, respectively.
Users manual for the Chameleon parallel programming tools

DOE Office of Scientific and Technical Information (OSTI.GOV)

Gropp, W.; Smith, B.

1993-06-01

Message passing is a common method for writing programs for distributed-memory parallel computers. Unfortunately, the lack of a standard for message passing has hampered the construction of portable and efficient parallel programs. In an attempt to remedy this problem, a number of groups have developed their own message-passing systems, each with its own strengths and weaknesses. Chameleon is a second-generation system of this type. Rather than replacing these existing systems, Chameleon is meant to supplement them by providing a uniform way to access many of these systems. Chameleon`s goals are to (a) be very lightweight (low over-head), (b) be highlymore » portable, and (c) help standardize program startup and the use of emerging message-passing operations such as collective operations on subsets of processors. Chameleon also provides a way to port programs written using PICL or Intel NX message passing to other systems, including collections of workstations. Chameleon is tracking the Message-Passing Interface (MPI) draft standard and will provide both an MPI implementation and an MPI transport layer. Chameleon provides support for heterogeneous computing by using p4 and PVM. Chameleon`s support for homogeneous computing includes the portable libraries p4, PICL, and PVM and vendor-specific implementation for Intel NX, IBM EUI (SP-1), and Thinking Machines CMMD (CM-5). Support for Ncube and PVM 3.x is also under development.« less
Architectural Specialization for Inter-Iteration Loop Dependence Patterns

DTIC Science & Technology

2015-10-01

Architectural Specialization for Inter-Iteration Loop Dependence Patterns Christopher Batten Computer Systems Laboratory School of Electrical and...Trends in Computer Architecture Transistors (Thousands) Frequency (MHz) Typical Power (W) MIPS R2K Intel P4 DEC Alpha 21264 Data collected by M...T as ks p er Jo ule ) Simple Processor Design Power Constraint High-Performance Architectures Embedded Architectures Design Performance
Theorem Proving in Intel Hardware Design

NASA Technical Reports Server (NTRS)

O'Leary, John

2009-01-01

For the past decade, a framework combining model checking (symbolic trajectory evaluation) and higher-order logic theorem proving has been in production use at Intel. Our tools and methodology have been used to formally verify execution cluster functionality (including floating-point operations) for a number of Intel products, including the Pentium(Registered TradeMark)4 and Core(TradeMark)i7 processors. Hardware verification in 2009 is much more challenging than it was in 1999 - today s CPU chip designs contain many processor cores and significant firmware content. This talk will attempt to distill the lessons learned over the past ten years, discuss how they apply to today s problems, outline some future directions.
Optimizing Performance of Combustion Chemistry Solvers on Intel's Many Integrated Core (MIC) Architectures

DOE Office of Scientific and Technical Information (OSTI.GOV)

Sitaraman, Hariswaran; Grout, Ray W

This work investigates novel algorithm designs and optimization techniques for restructuring chemistry integrators in zero and multidimensional combustion solvers, which can then be effectively used on the emerging generation of Intel's Many Integrated Core/Xeon Phi processors. These processors offer increased computing performance via large number of lightweight cores at relatively lower clock speeds compared to traditional processors (e.g. Intel Sandybridge/Ivybridge) used in current supercomputers. This style of processor can be productively used for chemistry integrators that form a costly part of computational combustion codes, in spite of their relatively lower clock speeds. Performance commensurate with traditional processors is achieved heremore » through the combination of careful memory layout, exposing multiple levels of fine grain parallelism and through extensive use of vendor supported libraries (Cilk Plus and Math Kernel Libraries). Important optimization techniques for efficient memory usage and vectorization have been identified and quantified. These optimizations resulted in a factor of ~ 3 speed-up using Intel 2013 compiler and ~ 1.5 using Intel 2017 compiler for large chemical mechanisms compared to the unoptimized version on the Intel Xeon Phi. The strategies, especially with respect to memory usage and vectorization, should also be beneficial for general purpose computational fluid dynamics codes.« less
The Fluke Security Project

DTIC Science & Technology

2000-04-01

be an extension of Utah’s nascent Quarks system, oriented to closely coupled cluster environments. However, the grant did not actually begin until... Intel x86, implemented ten virtual machine monitors and servers, including a virtual memory manager, a checkpointer, a process manager, a file server...Fluke, we developed a novel hierarchical processor scheduling frame- work called CPU inheritance scheduling [5]. This is a framework for scheduling

Multiphase complete exchange on Paragon, SP2 and CS-2

NASA Technical Reports Server (NTRS)

Bokhari, Shahid H.

1995-01-01

The overhead of interprocessor communication is a major factor in limiting the performance of parallel computer systems. The complete exchange is the severest communication pattern in that it requires each processor to send a distinct message to every other processor. This pattern is at the heart of many important parallel applications. On hypercubes, multiphase complete exchange has been developed and shown to provide optimal performance over varying message sizes. Most commercial multicomputer systems do not have a hypercube interconnect. However, they use special purpose hardware and dedicated communication processors to achieve very high performance communication and can be made to emulate the hypercube quite well. Multiphase complete exchange has been implemented on three contemporary parallel architectures: the Intel Paragon, IBM SP2 and Meiko CS-2. The essential features of these machines are described and their basic interprocessor communication overheads are discussed. The performance of multiphase complete exchange is evaluated on each machine. It is shown that the theoretical ideas developed for hypercubes are also applicable in practice to these machines and that multiphase complete exchange can lead to major savings in execution time over traditional solutions.
A simple integrative method for presenting head-contingent motion parallax and disparity cues on intel x86 processor-based machines.

PubMed

Szatmary, J; Hadani, I; Julesz, B

1997-01-01

Rogers and Graham (1979) developed a system to show that head-movement-contingent motion parallax produces monocular depth perception in random dot patterns. Their display system comprised an oscilloscope driven by function generators or a special graphics board that triggered the X and Y deflection of the raster scan signal. Replication of this system required costly hardware that is no longer on the market. In this paper the Rogers-Graham method is reproduced with an Intel processor based IBM PC compatible machine with no additional hardware cost. An adapted joystick sampled through the standard game-port can serve as a provisional head-movement sensor. Monitor resolution for displaying motion is effectively enhanced 16 times by the use of anti-aliasing, enabling the display of thousands of random dots in real-time with a refresh rate of 60 Hz or above. A color monitor enables the use of the anaglyph method, thus combining stereoscopic and monocular parallax on a single display without the loss of speed. The power of this system is demonstrated by a psychophysical measurement in which subjects nulled head-movement-contingent illusory parallax, evoked by a static stereogram, with real parallax. The amount of real parallax required to null the illusory stereoscopic parallax monotonically increased with disparity.
Extension of the AMBER molecular dynamics software to Intel's Many Integrated Core (MIC) architecture

NASA Astrophysics Data System (ADS)

Needham, Perri J.; Bhuiyan, Ashraf; Walker, Ross C.

2016-04-01

We present an implementation of explicit solvent particle mesh Ewald (PME) classical molecular dynamics (MD) within the PMEMD molecular dynamics engine, that forms part of the AMBER v14 MD software package, that makes use of Intel Xeon Phi coprocessors by offloading portions of the PME direct summation and neighbor list build to the coprocessor. We refer to this implementation as pmemd MIC offload and in this paper present the technical details of the algorithm, including basic models for MPI and OpenMP configuration, and analyze the resultant performance. The algorithm provides the best performance improvement for large systems (>400,000 atoms), achieving a ∼35% performance improvement for satellite tobacco mosaic virus (1,067,095 atoms) when 2 Intel E5-2697 v2 processors (2 ×12 cores, 30M cache, 2.7 GHz) are coupled to an Intel Xeon Phi coprocessor (Model 7120P-1.238/1.333 GHz, 61 cores). The implementation utilizes a two-fold decomposition strategy: spatial decomposition using an MPI library and thread-based decomposition using OpenMP. We also present compiler optimization settings that improve the performance on Intel Xeon processors, while retaining simulation accuracy.
Study of Thread Level Parallelism in a Video Encoding Application for Chip Multiprocessor Design

NASA Astrophysics Data System (ADS)

Debes, Eric; Kaine, Greg

2002-11-01

In media applications there is a high level of available thread level parallelism (TLP). In this paper we study the intra TLP in a video encoder. We show that a well-distributed highly optimized encoder running on a symmetric multiprocessor (SMP) system can run 3.2 faster on a 4-way SMP machine than on a single processor. The multithreaded encoder running on an SMP system is then used to understand the requirements of a chip multiprocessor (CMP) architecture, which is one possible architectural direction to better exploit TLP. In the framework of this study, we use a software approach to evaluate the dataflow between processors for the video encoder running on an SMP system. An estimation of the dataflow is done with L2 cache miss event counters using Intel® VTuneTM performance analyzer. The experimental measurements are compared to theoretical results.
Applications and development of communication models for the touchstone GAMMA and DELTA prototypes

NASA Technical Reports Server (NTRS)

Seidel, Steven R.

1993-01-01

The goal of this project was to develop models of the interconnection networks of the Intel iPSC/860 and DELTA multicomputers to guide the design of efficient algorithms for interprocessor communication in problems that commonly occur in CFD codes and other applications. Interprocessor communication costs of codes for message-passing architectures such as the iPSC/860 and DELTA significantly affect the level of performance that can be obtained from those machines. This project addressed several specific problems in the achievement of efficient communication on the Intel iPSC/860 hypercube and DELTA mesh. In particular, an efficient global processor synchronization algorithm was developed for the iPSC/860 and numerous broadcast algorithms were designed for the DELTA.
Tactical Operations Analysis Support Facility.

DTIC Science & Technology

1981-05-01

Punch/Reader 2 DMC-11AR DDCMP Micro Processor 2 DMC-11DA Network Link Line Unit 2 DL-11E Async Serial Line Interface 4 Intel IN-1670 448K Words MOS Memory...86 5.3 VIRTUAL PROCESSORS - VAX-11/750 ........................... 89 5.4 A RELATIONAL DATA MANAGEMENT SYSTEM - ORACLE...Central Processing Unit (CPU) is a 16 bit processor for high-speed, real time applications, and for large multi-user, multi- task, time shared
Cost/Performance Ratio Achieved by Using a Commodity-Based Cluster

NASA Technical Reports Server (NTRS)

Lopez, Isaac

2001-01-01

Researchers at the NASA Glenn Research Center acquired a commodity cluster based on Intel Corporation processors to compare its performance with a traditional UNIX cluster in the execution of aeropropulsion applications. Since the cost differential of the clusters was significant, a cost/performance ratio was calculated. After executing a propulsion application on both clusters, the researchers demonstrated a 9.4 cost/performance ratio in favor of the Intel-based cluster. These researchers utilize the Aeroshark cluster as one of the primary testbeds for developing NPSS parallel application codes and system software. The Aero-shark cluster provides 64 Intel Pentium II 400-MHz processors, housed in 32 nodes. Recently, APNASA - a code developed by a Government/industry team for the design and analysis of turbomachinery systems was used for a simulation on Glenn's Aeroshark cluster.
Does the Intel Xeon Phi processor fit HEP workloads?

NASA Astrophysics Data System (ADS)

Nowak, A.; Bitzes, G.; Dotti, A.; Lazzaro, A.; Jarp, S.; Szostek, P.; Valsan, L.; Botezatu, M.; Leduc, J.

2014-06-01

This paper summarizes the five years of CERN openlab's efforts focused on the Intel Xeon Phi co-processor, from the time of its inception to public release. We consider the architecture of the device vis a vis the characteristics of HEP software and identify key opportunities for HEP processing, as well as scaling limitations. We report on improvements and speedups linked to parallelization and vectorization on benchmarks involving software frameworks such as Geant4 and ROOT. Finally, we extrapolate current software and hardware trends and project them onto accelerators of the future, with the specifics of offline and online HEP processing in mind.
Building Columbia from the SysAdmin View

NASA Technical Reports Server (NTRS)

Chan, David

2005-01-01

Project Columbia was built at NASA Ames Research Center in partnership with SGI and Intel. Columbia consists of 20 512 processor Altix machines with 440TB of storage and achieved 51.87 TeraPlops to be ranked the second fastest on the top 500 at SuperComputing 2004. Columbia was delivered, installed and put into production in 3 months. On average, a new Columbia node was brought into production in less than a week. Columbia's configuration, installation, and future plans will be discussed.
Benchmarking GNU Radio Kernels and Multi-Processor Scheduling

DTIC Science & Technology

2013-01-14

AMD E350 APU , comparable to Atom • ARM Cortex A8 running on a Gumstix Overo on an Ettus USRP E110 The general testing procedure consists of • Build...Intel Atom, and the AMD E350 APU . 3.2 Multi-Processor Scheduling Figure 1: GFLOPs per second through an FFT array on an Intel i7. Example output from
Optimizing the updated Goddard shortwave radiation Weather Research and Forecasting (WRF) scheme for Intel Many Integrated Core (MIC) architecture

NASA Astrophysics Data System (ADS)

Mielikainen, Jarno; Huang, Bormin; Huang, Allen H.-L.

2015-05-01

Intel Many Integrated Core (MIC) ushers in a new era of supercomputing speed, performance, and compatibility. It allows the developers to run code at trillions of calculations per second using the familiar programming model. In this paper, we present our results of optimizing the updated Goddard shortwave radiation Weather Research and Forecasting (WRF) scheme on Intel Many Integrated Core Architecture (MIC) hardware. The Intel Xeon Phi coprocessor is the first product based on Intel MIC architecture, and it consists of up to 61 cores connected by a high performance on-die bidirectional interconnect. The co-processor supports all important Intel development tools. Thus, the development environment is familiar one to a vast number of CPU developers. Although, getting a maximum performance out of Xeon Phi will require using some novel optimization techniques. Those optimization techniques are discusses in this paper. The results show that the optimizations improved performance of the original code on Xeon Phi 7120P by a factor of 1.3x.
The Acceleration of Structural Microarchitectural Simulation via Scheduling

DTIC Science & Technology

2006-11-01

193 viii List of Tables 1.1 Size of Intel R ©Processors...Table 1.1 shows the total and estimated non-cache transistor counts in succeeding generations of Intel R ©microprocessors. (Cache array transistors are...Intel486TM 1989 1,200,000 800,000 Intel R ©Pentium R © 1993 3,100,000 2,300,000 Intel R ©Pentium R ©II 1997 7,500,000 5,500,000 Intel R ©Pentium R ©III 1999
A Survey of Recent MARTe Based Systems

NASA Astrophysics Data System (ADS)

Neto, André C.; Alves, Diogo; Boncagni, Luca; Carvalho, Pedro J.; Valcarcel, Daniel F.; Barbalace, Antonio; De Tommasi, Gianmaria; Fernandes, Horácio; Sartori, Filippo; Vitale, Enzo; Vitelli, Riccardo; Zabeo, Luca

2011-08-01

The Multithreaded Application Real-Time executor (MARTe) is a data driven framework environment for the development and deployment of real-time control algorithms. The main ideas which led to the present version of the framework were to standardize the development of real-time control systems, while providing a set of strictly bounded standard interfaces to the outside world and also accommodating a collection of facilities which promote the speed and ease of development, commissioning and deployment of such systems. At the core of every MARTe based application, is a set of independent inter-communicating software blocks, named Generic Application Modules (GAM), orchestrated by a real-time scheduler. The platform independence of its core library provides MARTe the necessary robustness and flexibility for conveniently testing applications in different environments including non-real-time operating systems. MARTe is already being used in several machines, each with its own peculiarities regarding hardware interfacing, supervisory control configuration, operating system and target control application. This paper presents and compares the most recent results of systems using MARTe: the JET Vertical Stabilization system, which uses the Real Time Application Interface (RTAI) operating system on Intel multi-core processors; the COMPASS plasma control system, driven by Linux RT also on Intel multi-core processors; ISTTOK real-time tomography equilibrium reconstruction which shares the same support configuration of COMPASS; JET error field correction coils based on VME, PowerPC and VxWorks; FTU LH reflected power system running on VME, Intel with RTAI.
Performance Evaluation of NWChem Ab-Initio Molecular Dynamics (AIMD) Simulations on the Intel® Xeon Phi™ Processor

DOE Office of Scientific and Technical Information (OSTI.GOV)

Bylaska, Eric J.; Jacquelin, Mathias; De Jong, Wibe A.

2017-10-20

Ab-initio Molecular Dynamics (AIMD) methods are an important class of algorithms, as they enable scientists to understand the chemistry and dynamics of molecular and condensed phase systems while retaining a first-principles-based description of their interactions. Many-core architectures such as the Intel® Xeon Phi™ processor are an interesting and promising target for these algorithms, as they can provide the computational power that is needed to solve interesting problems in chemistry. In this paper, we describe the efforts of refactoring the existing AIMD plane-wave method of NWChem from an MPI-only implementation to a scalable, hybrid code that employs MPI and OpenMP tomore » exploit the capabilities of current and future many-core architectures. We describe the optimizations required to get close to optimal performance for the multiplication of the tall-and-skinny matrices that form the core of the computational algorithm. We present strong scaling results on the complete AIMD simulation for a test case that simulates 256 water molecules and that strong-scales well on a cluster of 1024 nodes of Intel Xeon Phi processors. We compare the performance obtained with a cluster of dual-socket Intel® Xeon® E5–2698v3 processors.« less
Application of high-performance computing to numerical simulation of human movement

NASA Technical Reports Server (NTRS)

Anderson, F. C.; Ziegler, J. M.; Pandy, M. G.; Whalen, R. T.

1995-01-01

We have examined the feasibility of using massively-parallel and vector-processing supercomputers to solve large-scale optimization problems for human movement. Specifically, we compared the computational expense of determining the optimal controls for the single support phase of gait using a conventional serial machine (SGI Iris 4D25), a MIMD parallel machine (Intel iPSC/860), and a parallel-vector-processing machine (Cray Y-MP 8/864). With the human body modeled as a 14 degree-of-freedom linkage actuated by 46 musculotendinous units, computation of the optimal controls for gait could take up to 3 months of CPU time on the Iris. Both the Cray and the Intel are able to reduce this time to practical levels. The optimal solution for gait can be found with about 77 hours of CPU on the Cray and with about 88 hours of CPU on the Intel. Although the overall speeds of the Cray and the Intel were found to be similar, the unique capabilities of each machine are better suited to different portions of the computational algorithm used. The Intel was best suited to computing the derivatives of the performance criterion and the constraints whereas the Cray was best suited to parameter optimization of the controls. These results suggest that the ideal computer architecture for solving very large-scale optimal control problems is a hybrid system in which a vector-processing machine is integrated into the communication network of a MIMD parallel machine.
A new parallel-vector finite element analysis software on distributed-memory computers

NASA Technical Reports Server (NTRS)

Qin, Jiangning; Nguyen, Duc T.

1993-01-01

A new parallel-vector finite element analysis software package MPFEA (Massively Parallel-vector Finite Element Analysis) is developed for large-scale structural analysis on massively parallel computers with distributed-memory. MPFEA is designed for parallel generation and assembly of the global finite element stiffness matrices as well as parallel solution of the simultaneous linear equations, since these are often the major time-consuming parts of a finite element analysis. Block-skyline storage scheme along with vector-unrolling techniques are used to enhance the vector performance. Communications among processors are carried out concurrently with arithmetic operations to reduce the total execution time. Numerical results on the Intel iPSC/860 computers (such as the Intel Gamma with 128 processors and the Intel Touchstone Delta with 512 processors) are presented, including an aircraft structure and some very large truss structures, to demonstrate the efficiency and accuracy of MPFEA.
High Performance Distributed Computing in a Supercomputer Environment: Computational Services and Applications Issues

NASA Technical Reports Server (NTRS)

Kramer, Williams T. C.; Simon, Horst D.

1994-01-01

This tutorial proposes to be a practical guide for the uninitiated to the main topics and themes of high-performance computing (HPC), with particular emphasis to distributed computing. The intent is first to provide some guidance and directions in the rapidly increasing field of scientific computing using both massively parallel and traditional supercomputers. Because of their considerable potential computational power, loosely or tightly coupled clusters of workstations are increasingly considered as a third alternative to both the more conventional supercomputers based on a small number of powerful vector processors, as well as high massively parallel processors. Even though many research issues concerning the effective use of workstation clusters and their integration into a large scale production facility are still unresolved, such clusters are already used for production computing. In this tutorial we will utilize the unique experience made at the NAS facility at NASA Ames Research Center. Over the last five years at NAS massively parallel supercomputers such as the Connection Machines CM-2 and CM-5 from Thinking Machines Corporation and the iPSC/860 (Touchstone Gamma Machine) and Paragon Machines from Intel were used in a production supercomputer center alongside with traditional vector supercomputers such as the Cray Y-MP and C90.
A machine-learning approach for computation of fractional flow reserve from coronary computed tomography.

PubMed

Itu, Lucian; Rapaka, Saikiran; Passerini, Tiziano; Georgescu, Bogdan; Schwemmer, Chris; Schoebinger, Max; Flohr, Thomas; Sharma, Puneet; Comaniciu, Dorin

2016-07-01

Fractional flow reserve (FFR) is a functional index quantifying the severity of coronary artery lesions and is clinically obtained using an invasive, catheter-based measurement. Recently, physics-based models have shown great promise in being able to noninvasively estimate FFR from patient-specific anatomical information, e.g., obtained from computed tomography scans of the heart and the coronary arteries. However, these models have high computational demand, limiting their clinical adoption. In this paper, we present a machine-learning-based model for predicting FFR as an alternative to physics-based approaches. The model is trained on a large database of synthetically generated coronary anatomies, where the target values are computed using the physics-based model. The trained model predicts FFR at each point along the centerline of the coronary tree, and its performance was assessed by comparing the predictions against physics-based computations and against invasively measured FFR for 87 patients and 125 lesions in total. Correlation between machine-learning and physics-based predictions was excellent (0.9994, P < 0.001), and no systematic bias was found in Bland-Altman analysis: mean difference was -0.00081 ± 0.0039. Invasive FFR ≤ 0.80 was found in 38 lesions out of 125 and was predicted by the machine-learning algorithm with a sensitivity of 81.6%, a specificity of 83.9%, and an accuracy of 83.2%. The correlation was 0.729 (P < 0.001). Compared with the physics-based computation, average execution time was reduced by more than 80 times, leading to near real-time assessment of FFR. Average execution time went down from 196.3 ± 78.5 s for the CFD model to ∼2.4 ± 0.44 s for the machine-learning model on a workstation with 3.4-GHz Intel i7 8-core processor. Copyright © 2016 the American Physiological Society.
Thread-Level Parallelization and Optimization of NWChem for the Intel MIC Architecture

DOE Office of Scientific and Technical Information (OSTI.GOV)

Shan, Hongzhang; Williams, Samuel; Jong, Wibe de

In the multicore era it was possible to exploit the increase in on-chip parallelism by simply running multiple MPI processes per chip. Unfortunately, manycore processors' greatly increased thread- and data-level parallelism coupled with a reduced memory capacity demand an altogether different approach. In this paper we explore augmenting two NWChem modules, triples correction of the CCSD(T) and Fock matrix construction, with OpenMP in order that they might run efficiently on future manycore architectures. As the next NERSC machine will be a self-hosted Intel MIC (Xeon Phi) based supercomputer, we leverage an existing MIC testbed at NERSC to evaluate our experiments.more » In order to proxy the fact that future MIC machines will not have a host processor, we run all of our experiments in tt native mode. We found that while straightforward application of OpenMP to the deep loop nests associated with the tensor contractions of CCSD(T) was sufficient in attaining high performance, significant effort was required to safely and efficiently thread the TEXAS integral package when constructing the Fock matrix. Ultimately, our new MPI OpenMP hybrid implementations attain up to 65x better performance for the triples part of the CCSD(T) due in large part to the fact that the limited on-card memory limits the existing MPI implementation to a single process per card. Additionally, we obtain up to 1.6x better performance on Fock matrix constructions when compared with the best MPI implementations running multiple processes per card.« less
Thread-level parallelization and optimization of NWChem for the Intel MIC architecture

DOE Office of Scientific and Technical Information (OSTI.GOV)

Shan, Hongzhang; Williams, Samuel; de Jong, Wibe

In the multicore era it was possible to exploit the increase in on-chip parallelism by simply running multiple MPI processes per chip. Unfortunately, manycore processors' greatly increased thread- and data-level parallelism coupled with a reduced memory capacity demand an altogether different approach. In this paper we explore augmenting two NWChem modules, triples correction of the CCSD(T) and Fock matrix construction, with OpenMP in order that they might run efficiently on future manycore architectures. As the next NERSC machine will be a self-hosted Intel MIC (Xeon Phi) based supercomputer, we leverage an existing MIC testbed at NERSC to evaluate our experiments.more » In order to proxy the fact that future MIC machines will not have a host processor, we run all of our experiments in native mode. We found that while straightforward application of OpenMP to the deep loop nests associated with the tensor contractions of CCSD(T) was sufficient in attaining high performance, significant e ort was required to safely and efeciently thread the TEXAS integral package when constructing the Fock matrix. Ultimately, our new MPI+OpenMP hybrid implementations attain up to 65× better performance for the triples part of the CCSD(T) due in large part to the fact that the limited on-card memory limits the existing MPI implementation to a single process per card. Additionally, we obtain up to 1.6× better performance on Fock matrix constructions when compared with the best MPI implementations running multiple processes per card.« less

Transient Solid Dynamics Simulations on the Sandia/Intel Teraflop Computer

DOE Office of Scientific and Technical Information (OSTI.GOV)

Attaway, S.; Brown, K.; Gardner, D.

1997-12-31

Transient solid dynamics simulations are among the most widely used engineering calculations. Industrial applications include vehicle crashworthiness studies, metal forging, and powder compaction prior to sintering. These calculations are also critical to defense applications including safety studies and weapons simulations. The practical importance of these calculations and their computational intensiveness make them natural candidates for parallelization. This has proved to be difficult, and existing implementations fail to scale to more than a few dozen processors. In this paper we describe our parallelization of PRONTO, Sandia`s transient solid dynamics code, via a novel algorithmic approach that utilizes multiple decompositions for differentmore » key segments of the computations, including the material contact calculation. This latter calculation is notoriously difficult to perform well in parallel, because it involves dynamically changing geometry, global searches for elements in contact, and unstructured communications among the compute nodes. Our approach scales to at least 3600 compute nodes of the Sandia/Intel Teraflop computer (the largest set of nodes to which we have had access to date) on problems involving millions of finite elements. On this machine we can simulate models using more than ten- million elements in a few tenths of a second per timestep, and solve problems more than 3000 times faster than a single processor Cray Jedi.« less
Implementations of BLAST for parallel computers.

PubMed

Jülich, A

1995-02-01

The BLAST sequence comparison programs have been ported to a variety of parallel computers-the shared memory machine Cray Y-MP 8/864 and the distributed memory architectures Intel iPSC/860 and nCUBE. Additionally, the programs were ported to run on workstation clusters. We explain the parallelization techniques and consider the pros and cons of these methods. The BLAST programs are very well suited for parallelization for a moderate number of processors. We illustrate our results using the program blastp as an example. As input data for blastp, a 799 residue protein query sequence and the protein database PIR were used.
CUDA-based real time surgery simulation.

PubMed

Liu, Youquan; De, Suvranu

2008-01-01

In this paper we present a general software platform that enables real time surgery simulation on the newly available compute unified device architecture (CUDA)from NVIDIA. CUDA-enabled GPUs harness the power of 128 processors which allow data parallel computations. Compared to the previous GPGPU, it is significantly more flexible with a C language interface. We report implementation of both collision detection and consequent deformation computation algorithms. Our test results indicate that the CUDA enables a twenty times speedup for collision detection and about fifteen times speedup for deformation computation on an Intel Core 2 Quad 2.66 GHz machine with GeForce 8800 GTX.
Scalability and Portability of Two Parallel Implementations of ADI

NASA Technical Reports Server (NTRS)

Phung, Thanh; VanderWijngaart, Rob F.

1994-01-01

Two domain decompositions for the implementation of the NAS Scalar Penta-diagonal Parallel Benchmark on MIMD systems are investigated, namely transposition and multi-partitioning. Hardware platforms considered are the Intel iPSC/860 and Paragon XP/S-15, and clusters of SGI workstations on ethernet, communicating through PVM. It is found that the multi-partitioning strategy offers the kind of coarse granularity that allows scaling up to hundreds of processors on a massively parallel machine. Moreover, efficiency is retained when the code is ported verbatim (save message passing syntax) to a PVM environment on a modest size cluster of workstations.
DBPQL: A view-oriented query language for the Intel Data Base Processor

NASA Technical Reports Server (NTRS)

Fishwick, P. A.

1983-01-01

An interactive query language (BDPQL) for the Intel Data Base Processor (DBP) is defined. DBPQL includes a parser generator package which permits the analyst to easily create and manipulate the query statement syntax and semantics. The prototype language, DBPQL, includes trace and performance commands to aid the analyst when implementing new commands and analyzing the execution characteristics of the DBP. The DBPQL grammar file and associated key procedures are included as an appendix to this report.
MILC Code Performance on High End CPU and GPU Supercomputer Clusters

NASA Astrophysics Data System (ADS)

DeTar, Carleton; Gottlieb, Steven; Li, Ruizi; Toussaint, Doug

2018-03-01

With recent developments in parallel supercomputing architecture, many core, multi-core, and GPU processors are now commonplace, resulting in more levels of parallelism, memory hierarchy, and programming complexity. It has been necessary to adapt the MILC code to these new processors starting with NVIDIA GPUs, and more recently, the Intel Xeon Phi processors. We report on our efforts to port and optimize our code for the Intel Knights Landing architecture. We consider performance of the MILC code with MPI and OpenMP, and optimizations with QOPQDP and QPhiX. For the latter approach, we concentrate on the staggered conjugate gradient and gauge force. We also consider performance on recent NVIDIA GPUs using the QUDA library.
Numerical performance and throughput benchmark for electronic structure calculations in PC-Linux systems with new architectures, updated compilers, and libraries.

PubMed

Yu, Jen-Shiang K; Hwang, Jenn-Kang; Tang, Chuan Yi; Yu, Chin-Hui

2004-01-01

A number of recently released numerical libraries including Automatically Tuned Linear Algebra Subroutines (ATLAS) library, Intel Math Kernel Library (MKL), GOTO numerical library, and AMD Core Math Library (ACML) for AMD Opteron processors, are linked against the executables of the Gaussian 98 electronic structure calculation package, which is compiled by updated versions of Fortran compilers such as Intel Fortran compiler (ifc/efc) 7.1 and PGI Fortran compiler (pgf77/pgf90) 5.0. The ifc 7.1 delivers about 3% of improvement on 32-bit machines compared to the former version 6.0. Performance improved from pgf77 3.3 to 5.0 is also around 3% when utilizing the original unmodified optimization options of the compiler enclosed in the software. Nevertheless, if extensive compiler tuning options are used, the speed can be further accelerated to about 25%. The performances of these fully optimized numerical libraries are similar. The double-precision floating-point (FP) instruction sets (SSE2) are also functional on AMD Opteron processors operated in 32-bit compilation, and Intel Fortran compiler has performed better optimization. Hardware-level tuning is able to improve memory bandwidth by adjusting the DRAM timing, and the efficiency in the CL2 mode is further accelerated by 2.6% compared to that of the CL2.5 mode. The FP throughput is measured by simultaneous execution of two identical copies of each of the test jobs. Resultant performance impact suggests that IA64 and AMD64 architectures are able to fulfill significantly higher throughput than the IA32, which is consistent with the SpecFPrate2000 benchmarks.
Parallel algorithms for quantum chemistry. I. Integral transformations on a hypercube multiprocessor

DOE Office of Scientific and Technical Information (OSTI.GOV)

Whiteside, R.A.; Binkley, J.S.; Colvin, M.E.

1987-02-15

For many years it has been recognized that fundamental physical constraints such as the speed of light will limit the ultimate speed of single processor computers to less than about three billion floating point operations per second (3 GFLOPS). This limitation is becoming increasingly restrictive as commercially available machines are now within an order of magnitude of this asymptotic limit. A natural way to avoid this limit is to harness together many processors to work on a single computational problem. In principle, these parallel processing computers have speeds limited only by the number of processors one chooses to acquire. Themore » usefulness of potentially unlimited processing speed to a computationally intensive field such as quantum chemistry is obvious. If these methods are to be applied to significantly larger chemical systems, parallel schemes will have to be employed. For this reason we have developed distributed-memory algorithms for a number of standard quantum chemical methods. We are currently implementing these on a 32 processor Intel hypercube. In this paper we present our algorithm and benchmark results for one of the bottleneck steps in quantum chemical calculations: the four index integral transformation.« less
JPRS Report, Science & Technology, Europe.

DTIC Science & Technology

1991-04-30

processor in collaboration with Intel . The processor , christened Touchstone, will be used as the core of a parallel computer with 2,000 processors . One of...ELECTRONIQUE HEBDO in French 24 Jan 91 pp 14-15 [Article by Claire Remy: "Everything Set for Neural Signal Processors " first paragraph is ELECTRONIQUE...paving the way for neural signal processors in so doing. The principal advantage of this specific circuit over a neuromimetic software program is
Exact diagonalization of quantum lattice models on coprocessors

NASA Astrophysics Data System (ADS)

Siro, T.; Harju, A.

2016-10-01

We implement the Lanczos algorithm on an Intel Xeon Phi coprocessor and compare its performance to a multi-core Intel Xeon CPU and an NVIDIA graphics processor. The Xeon and the Xeon Phi are parallelized with OpenMP and the graphics processor is programmed with CUDA. The performance is evaluated by measuring the execution time of a single step in the Lanczos algorithm. We study two quantum lattice models with different particle numbers, and conclude that for small systems, the multi-core CPU is the fastest platform, while for large systems, the graphics processor is the clear winner, reaching speedups of up to 7.6 compared to the CPU. The Xeon Phi outperforms the CPU with sufficiently large particle number, reaching a speedup of 2.5.
Astrophysical N-body Simulations Using Hierarchical Tree Data Structures

NASA Astrophysics Data System (ADS)

Warren, M. S.; Salmon, J. K.

The authors report on recent large astrophysical N-body simulations executed on the Intel Touchstone Delta system. They review the astrophysical motivation and the numerical techniques and discuss steps taken to parallelize these simulations. The methods scale as O(N log N), for large values of N, and also scale linearly with the number of processors. The performance sustained for a duration of 67 h, was between 5.1 and 5.4 Gflop/s on a 512-processor system.
Discrete sensitivity derivatives of the Navier-Stokes equations with a parallel Krylov solver

NASA Technical Reports Server (NTRS)

Ajmani, Kumud; Taylor, Arthur C., III

1994-01-01

This paper solves an 'incremental' form of the sensitivity equations derived by differentiating the discretized thin-layer Navier Stokes equations with respect to certain design variables of interest. The equations are solved with a parallel, preconditioned Generalized Minimal RESidual (GMRES) solver on a distributed-memory architecture. The 'serial' sensitivity analysis code is parallelized by using the Single Program Multiple Data (SPMD) programming model, domain decomposition techniques, and message-passing tools. Sensitivity derivatives are computed for low and high Reynolds number flows over a NACA 1406 airfoil on a 32-processor Intel Hypercube, and found to be identical to those computed on a single-processor Cray Y-MP. It is estimated that the parallel sensitivity analysis code has to be run on 40-50 processors of the Intel Hypercube in order to match the single-processor processing time of a Cray Y-MP.
Digital Hardware Architecture Implementation

DTIC Science & Technology

1993-02-15

of micro - MOTOROLA 63.7 50MHZ 64 BIT 2092 N/A processors during quarterly re- INTEL 42 50MHz 64 BIT 1092 N/A views and monthly reports. The 186o XP...27 3.2.1 Signal Processor (SP) Analysis...31 3.2.1.11 MasPar Software Statements ........................................................ 32 3.2.2 Data Processor
Performance Evaluation of Supercomputers using HPCC and IMB Benchmarks

NASA Technical Reports Server (NTRS)

Saini, Subhash; Ciotti, Robert; Gunney, Brian T. N.; Spelce, Thomas E.; Koniges, Alice; Dossa, Don; Adamidis, Panagiotis; Rabenseifner, Rolf; Tiyyagura, Sunil R.; Mueller, Matthias;

2006-01-01

The HPC Challenge (HPCC) benchmark suite and the Intel MPI Benchmark (IMB) are used to compare and evaluate the combined performance of processor, memory subsystem and interconnect fabric of five leading supercomputers - SGI Altix BX2, Cray XI, Cray Opteron Cluster, Dell Xeon cluster, and NEC SX-8. These five systems use five different networks (SGI NUMALINK4, Cray network, Myrinet, InfiniBand, and NEC IXS). The complete set of HPCC benchmarks are run on each of these systems. Additionally, we present Intel MPI Benchmarks (IMB) results to study the performance of 11 MPI communication functions on these systems.

Performance Evaluation of an Intel Haswell- and Ivy Bridge-Based Supercomputer Using Scientific and Engineering Applications

NASA Technical Reports Server (NTRS)

Saini, Subhash; Hood, Robert T.; Chang, Johnny; Baron, John

2016-01-01

We present a performance evaluation conducted on a production supercomputer of the Intel Xeon Processor E5- 2680v3, a twelve-core implementation of the fourth-generation Haswell architecture, and compare it with Intel Xeon Processor E5-2680v2, an Ivy Bridge implementation of the third-generation Sandy Bridge architecture. Several new architectural features have been incorporated in Haswell including improvements in all levels of the memory hierarchy as well as improvements to vector instructions and power management. We critically evaluate these new features of Haswell and compare with Ivy Bridge using several low-level benchmarks including subset of HPCC, HPCG and four full-scale scientific and engineering applications. We also present a model to predict the performance of HPCG and Cart3D within 5%, and Overflow within 10% accuracy.
Hot Chips and Hot Interconnects for High End Computing Systems

NASA Technical Reports Server (NTRS)

Saini, Subhash

2005-01-01

I will discuss several processors: 1. The Cray proprietary processor used in the Cray X1; 2. The IBM Power 3 and Power 4 used in an IBM SP 3 and IBM SP 4 systems; 3. The Intel Itanium and Xeon, used in the SGI Altix systems and clusters respectively; 4. IBM System-on-a-Chip used in IBM BlueGene/L; 5. HP Alpha EV68 processor used in DOE ASCI Q cluster; 6. SPARC64 V processor, which is used in the Fujitsu PRIMEPOWER HPC2500; 7. An NEC proprietary processor, which is used in NEC SX-6/7; 8. Power 4+ processor, which is used in Hitachi SR11000; 9. NEC proprietary processor, which is used in Earth Simulator. The IBM POWER5 and Red Storm Computing Systems will also be discussed. The architectures of these processors will first be presented, followed by interconnection networks and a description of high-end computer systems based on these processors and networks. The performance of various hardware/programming model combinations will then be compared, based on latest NAS Parallel Benchmark results (MPI, OpenMP/HPF and hybrid (MPI + OpenMP). The tutorial will conclude with a discussion of general trends in the field of high performance computing, (quantum computing, DNA computing, cellular engineering, and neural networks).
Portable parallel stochastic optimization for the design of aeropropulsion components

NASA Technical Reports Server (NTRS)

Sues, Robert H.; Rhodes, G. S.

1994-01-01

This report presents the results of Phase 1 research to develop a methodology for performing large-scale Multi-disciplinary Stochastic Optimization (MSO) for the design of aerospace systems ranging from aeropropulsion components to complete aircraft configurations. The current research recognizes that such design optimization problems are computationally expensive, and require the use of either massively parallel or multiple-processor computers. The methodology also recognizes that many operational and performance parameters are uncertain, and that uncertainty must be considered explicitly to achieve optimum performance and cost. The objective of this Phase 1 research was to initialize the development of an MSO methodology that is portable to a wide variety of hardware platforms, while achieving efficient, large-scale parallelism when multiple processors are available. The first effort in the project was a literature review of available computer hardware, as well as review of portable, parallel programming environments. The first effort was to implement the MSO methodology for a problem using the portable parallel programming language, Parallel Virtual Machine (PVM). The third and final effort was to demonstrate the example on a variety of computers, including a distributed-memory multiprocessor, a distributed-memory network of workstations, and a single-processor workstation. Results indicate the MSO methodology can be well-applied towards large-scale aerospace design problems. Nearly perfect linear speedup was demonstrated for computation of optimization sensitivity coefficients on both a 128-node distributed-memory multiprocessor (the Intel iPSC/860) and a network of workstations (speedups of almost 19 times achieved for 20 workstations). Very high parallel efficiencies (75 percent for 31 processors and 60 percent for 50 processors) were also achieved for computation of aerodynamic influence coefficients on the Intel. Finally, the multi-level parallelization strategy that will be needed for large-scale MSO problems was demonstrated to be highly efficient. The same parallel code instructions were used on both platforms, demonstrating portability. There are many applications for which MSO can be applied, including NASA's High-Speed-Civil Transport, and advanced propulsion systems. The use of MSO will reduce design and development time and testing costs dramatically.
High-performance sparse matrix-matrix products on Intel KNL and multicore architectures

DOE Office of Scientific and Technical Information (OSTI.GOV)

Nagasaka, Y; Matsuoka, S; Azad, A

Sparse matrix-matrix multiplication (SpGEMM) is a computational primitive that is widely used in areas ranging from traditional numerical applications to recent big data analysis and machine learning. Although many SpGEMM algorithms have been proposed, hardware specific optimizations for multi- and many-core processors are lacking and a detailed analysis of their performance under various use cases and matrices is not available. We firstly identify and mitigate multiple bottlenecks with memory management and thread scheduling on Intel Xeon Phi (Knights Landing or KNL). Specifically targeting multi- and many-core processors, we develop a hash-table-based algorithm and optimize a heap-based shared-memory SpGEMM algorithm. Wemore » examine their performance together with other publicly available codes. Different from the literature, our evaluation also includes use cases that are representative of real graph algorithms, such as multi-source breadth-first search or triangle counting. Our hash-table and heap-based algorithms are showing significant speedups from libraries in the majority of the cases while different algorithms dominate the other scenarios with different matrix size, sparsity, compression factor and operation type. We wrap up in-depth evaluation results and make a recipe to give the best SpGEMM algorithm for target scenario. A critical finding is that hash-table-based SpGEMM gets a significant performance boost if the nonzeros are not required to be sorted within each row of the output matrix.« less
NearFar: A computer program for nearside farside decomposition of heavy-ion elastic scattering amplitude

NASA Astrophysics Data System (ADS)

Cha, Moon Hoe

2007-02-01

The NearFar program is a package for carrying out an interactive nearside-farside decomposition of heavy-ion elastic scattering amplitude. The program is implemented in Java to perform numerical operations on the nearside and farside angular distributions. It contains a graphical display interface for the numerical results. A test run has been applied to the elastic O16+Si28 scattering at E=1503 MeV. Program summaryTitle of program: NearFar Catalogue identifier: ADYP_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/ADYP_v1_0 Program obtainable from: CPC Program Library, Queen's University of Belfast, N. Ireland Licensing provisions: none Computers: designed for any machine capable of running Java, developed on PC-Pentium-4 Operating systems under which the program has been tested: Microsoft Windows XP (Home Edition) Program language used: Java Number of bits in a word: 64 Memory required to execute with typical data: case dependent No. of lines in distributed program, including test data, etc.: 3484 Number of bytes distributed program, including test data, etc.: 142 051 Distribution format: tar.gz Other software required: A Java runtime interpreter, or the Java Development Kit, version 5.0 Nature of physical problem: Interactive nearside-farside decomposition of heavy-ion elastic scattering amplitude. Method of solution: The user must supply a external data file or PPSM parameters which calculates theoretical values of the quantities to be decomposed. Typical running time: Problem dependent. In a test run, it is about 35 s on a 2.40 GHz Intel P4-processor machine.
USC orthogonal multiprocessor for image processing with neural networks

NASA Astrophysics Data System (ADS)

Hwang, Kai; Panda, Dhabaleswar K.; Haddadi, Navid

1990-07-01

This paper presents the architectural features and imaging applications of the Orthogonal MultiProcessor (OMP) system, which is under construction at the University of Southern California with research funding from NSF and assistance from several industrial partners. The prototype OMP is being built with 16 Intel i860 RISC microprocessors and 256 parallel memory modules using custom-designed spanning buses, which are 2-D interleaved and orthogonally accessed without conflicts. The 16-processor OMP prototype is targeted to achieve 430 MIPS and 600 Mflops, which have been verified by simulation experiments based on the design parameters used. The prototype OMP machine will be initially applied for image processing, computer vision, and neural network simulation applications. We summarize important vision and imaging algorithms that can be restructured with neural network models. These algorithms can efficiently run on the OMP hardware with linear speedup. The ultimate goal is to develop a high-performance Visual Computer (Viscom) for integrated low- and high-level image processing and vision tasks.

Applications Performance on NAS Intel Paragon XP/S - 15#

NASA Technical Reports Server (NTRS)

Saini, Subhash; Simon, Horst D.; Copper, D. M. (Technical Monitor)

1994-01-01

The Numerical Aerodynamic Simulation (NAS) Systems Division received an Intel Touchstone Sigma prototype model Paragon XP/S- 15 in February, 1993. The i860 XP microprocessor with an integrated floating point unit and operating in dual -instruction mode gives peak performance of 75 million floating point operations (NIFLOPS) per second for 64 bit floating point arithmetic. It is used in the Paragon XP/S-15 which has been installed at NAS, NASA Ames Research Center. The NAS Paragon has 208 nodes and its peak performance is 15.6 GFLOPS. Here, we will report on early experience using the Paragon XP/S- 15. We have tested its performance using both kernels and applications of interest to NAS. We have measured the performance of BLAS 1, 2 and 3 both assembly-coded and Fortran coded on NAS Paragon XP/S- 15. Furthermore, we have investigated the performance of a single node one-dimensional FFT, a distributed two-dimensional FFT and a distributed three-dimensional FFT Finally, we measured the performance of NAS Parallel Benchmarks (NPB) on the Paragon and compare it with the performance obtained on other highly parallel machines, such as CM-5, CRAY T3D, IBM SP I, etc. In particular, we investigated the following issues, which can strongly affect the performance of the Paragon: a. Impact of the operating system: Intel currently uses as a default an operating system OSF/1 AD from the Open Software Foundation. The paging of Open Software Foundation (OSF) server at 22 MB to make more memory available for the application degrades the performance. We found that when the limit of 26 NIB per node out of 32 MB available is reached, the application is paged out of main memory using virtual memory. When the application starts paging, the performance is considerably reduced. We found that dynamic memory allocation can help applications performance under certain circumstances. b. Impact of data cache on the i860/XP: We measured the performance of the BLAS both assembly coded and Fortran coded. We found that the measured performance of assembly-coded BLAS is much less than what memory bandwidth limitation would predict. The influence of data cache on different sizes of vectors is also investigated using one-dimensional FFTs. c. Impact of processor layout: There are several different ways processors can be laid out within the two-dimensional grid of processors on the Paragon. We have used the FFT example to investigate performance differences based on processors layout.
Testing the Tester: Lessons Learned During the Testing of a State-of-the-Art Commercial 14nm Processor Under Proton Irradiation

NASA Technical Reports Server (NTRS)

Szabo, Carl M., Jr.; Duncan, Adam R.; Label, Kenneth A.

2017-01-01

Testing of an Intel 14nm desktop processor was conducted under proton irradiation. We share lessons learned, demonstrating that complex devices beget further complex challenges requiring practical and theoretical investigative expertise to solve.
Balancing Contention and Synchronization on the Intel Paragon

NASA Technical Reports Server (NTRS)

Bokhari, Shahid H.; Nicol, David M.

1996-01-01

The Intel Paragon is a mesh-connected distributed memory parallel computer. It uses an oblivious and deterministic message routing algorithm: this permits us to develop highly optimized schedules for frequently needed communication patterns. The complete exchange is one such pattern. Several approaches are available for carrying it out on the mesh. We study an algorithm developed by Scott. This algorithm assumes that a communication link can carry one message at a time and that a node can only transmit one message at a time. It requires global synchronization to enforce a schedule of transmissions. Unfortunately global synchronization has substantial overhead on the Paragon. At the same time the powerful interconnection mechanism of this machine permits 2 or 3 messages to share a communication link with minor overhead. It can also overlap multiple message transmission from the same node to some extent. We develop a generalization of Scott's algorithm that executes complete exchange with a prescribed contention. Schedules that incur greater contention require fewer synchronization steps. This permits us to tradeoff contention against synchronization overhead. We describe the performance of this algorithm and compare it with Scott's original algorithm as well as with a naive algorithm that does not take interconnection structure into account. The Bounded contention algorithm is always better than Scott's algorithm and outperforms the naive algorithm for all but the smallest message sizes. The naive algorithm fails to work on meshes larger than 12 x 12. These results show that due consideration of processor interconnect and machine performance parameters is necessary to obtain peak performance from the Paragon and its successor mesh machines.
Why K-12 IT Managers and Administrators Are Embracing the Intel-Based Mac

ERIC Educational Resources Information Center

Technology & Learning, 2007

2007-01-01

Over the past year, Apple has dramatically increased its share of the school computer marketplace--especially in the category of notebook computers. A recent study conducted by Grunwald Associates and Rockman et al. reports that one of the major reasons for this growth is Apple's introduction of the Intel processor to the entire line of Mac…
The parallel algorithm for the 2D discrete wavelet transform

NASA Astrophysics Data System (ADS)

Barina, David; Najman, Pavel; Kleparnik, Petr; Kula, Michal; Zemcik, Pavel

2018-04-01

The discrete wavelet transform can be found at the heart of many image-processing algorithms. Until now, the transform on general-purpose processors (CPUs) was mostly computed using a separable lifting scheme. As the lifting scheme consists of a small number of operations, it is preferred for processing using single-core CPUs. However, considering a parallel processing using multi-core processors, this scheme is inappropriate due to a large number of steps. On such architectures, the number of steps corresponds to the number of points that represent the exchange of data. Consequently, these points often form a performance bottleneck. Our approach appropriately rearranges calculations inside the transform, and thereby reduces the number of steps. In other words, we propose a new scheme that is friendly to parallel environments. When evaluating on multi-core CPUs, we consistently overcome the original lifting scheme. The evaluation was performed on 61-core Intel Xeon Phi and 8-core Intel Xeon processors.
Accelerating Climate Simulations Through Hybrid Computing

NASA Technical Reports Server (NTRS)

Zhou, Shujia; Sinno, Scott; Cruz, Carlos; Purcell, Mark

2009-01-01

Unconventional multi-core processors (e.g., IBM Cell B/E and NYIDIDA GPU) have emerged as accelerators in climate simulation. However, climate models typically run on parallel computers with conventional processors (e.g., Intel and AMD) using MPI. Connecting accelerators to this architecture efficiently and easily becomes a critical issue. When using MPI for connection, we identified two challenges: (1) identical MPI implementation is required in both systems, and; (2) existing MPI code must be modified to accommodate the accelerators. In response, we have extended and deployed IBM Dynamic Application Virtualization (DAV) in a hybrid computing prototype system (one blade with two Intel quad-core processors, two IBM QS22 Cell blades, connected with Infiniband), allowing for seamlessly offloading compute-intensive functions to remote, heterogeneous accelerators in a scalable, load-balanced manner. Currently, a climate solar radiation model running with multiple MPI processes has been offloaded to multiple Cell blades with approx.10% network overhead.
High-performance reconfigurable hardware architecture for restricted Boltzmann machines.

PubMed

Ly, Daniel Le; Chow, Paul

2010-11-01

Despite the popularity and success of neural networks in research, the number of resulting commercial or industrial applications has been limited. A primary cause for this lack of adoption is that neural networks are usually implemented as software running on general-purpose processors. Hence, a hardware implementation that can exploit the inherent parallelism in neural networks is desired. This paper investigates how the restricted Boltzmann machine (RBM), which is a popular type of neural network, can be mapped to a high-performance hardware architecture on field-programmable gate array (FPGA) platforms. The proposed modular framework is designed to reduce the time complexity of the computations through heavily customized hardware engines. A method to partition large RBMs into smaller congruent components is also presented, allowing the distribution of one RBM across multiple FPGA resources. The framework is tested on a platform of four Xilinx Virtex II-Pro XC2VP70 FPGAs running at 100 MHz through a variety of different configurations. The maximum performance was obtained by instantiating an RBM of 256 × 256 nodes distributed across four FPGAs, which resulted in a computational speed of 3.13 billion connection-updates-per-second and a speedup of 145-fold over an optimized C program running on a 2.8-GHz Intel processor.
A compositional reservoir simulator on distributed memory parallel computers

DOE Office of Scientific and Technical Information (OSTI.GOV)

Rame, M.; Delshad, M.

1995-12-31

This paper presents the application of distributed memory parallel computes to field scale reservoir simulations using a parallel version of UTCHEM, The University of Texas Chemical Flooding Simulator. The model is a general purpose highly vectorized chemical compositional simulator that can simulate a wide range of displacement processes at both field and laboratory scales. The original simulator was modified to run on both distributed memory parallel machines (Intel iPSC/960 and Delta, Connection Machine 5, Kendall Square 1 and 2, and CRAY T3D) and a cluster of workstations. A domain decomposition approach has been taken towards parallelization of the code. Amore » portion of the discrete reservoir model is assigned to each processor by a set-up routine that attempts a data layout as even as possible from the load-balance standpoint. Each of these subdomains is extended so that data can be shared between adjacent processors for stencil computation. The added routines that make parallel execution possible are written in a modular fashion that makes the porting to new parallel platforms straight forward. Results of the distributed memory computing performance of Parallel simulator are presented for field scale applications such as tracer flood and polymer flood. A comparison of the wall-clock times for same problems on a vector supercomputer is also presented.« less
Thermo-mechanical properties of carbon nanotubes and applications in thermal management

NASA Astrophysics Data System (ADS)

Nguyen, Manh Hong; Thang Bui, Hung; Trinh Pham, Van; Phan, Ngoc Hong; Nguyen, Tuan Hong; Chuc Nguyen, Van; Quang Le, Dinh; Khoi Phan, Hong; Phan, Ngoc Minh

2016-06-01

Thanks to their very high thermal conductivity, high Young’s modulus and unique tensile strength, carbon nanotubes (CNTs) have become one of the most suitable nano additives for heat conductive materials. In this work, we present results obtained for the synthesis of heat conductive materials containing CNT based thermal greases, nanoliquids and lubricating oils. These synthesized heat conductive materials were applied to thermal management for high power electronic devices (CPUs, LEDs) and internal combustion engines. The simulation and experimental results on thermal greases for an Intel Pentium IV processor showed that the thermal conductivity of greases increases 1.4 times and the saturation temperature of the CPU decreased by 5 °C by using thermal grease containing 2 wt% CNTs. Nanoliquids containing CNT based distilled water/ethylene glycol were successfully applied in heat dissipation for an Intel Core i5 processor and a 450 W floodlight LED. The experimental results showed that the saturation temperature of the Intel Core i5 processor and the 450 W floodlight LED decreased by about 6 °C and 3.5 °C, respectively, when using nanoliquids containing 1 g l-1 of CNTs. The CNTs were also effectively utilized additive materials for the synthesis of lubricating oils to improve the thermal conductivity, heat dissipation efficiency and performance efficiency of engines. The experimental results show that the thermal conductivity of lubricating oils increased by 12.5%, the engine saved 15% fuel consumption, and the longevity of the lubricating oil increased up to 20 000 km by using 0.1% vol. CNTs in the lubricating oils. All above results have confirmed the tremendous application potential of heat conductive materials containing CNTs in thermal management for high power electronic devices, internal combustion engines and other high power apparatus.
A data base processor semantics specification package

NASA Technical Reports Server (NTRS)

Fishwick, P. A.

1983-01-01

A Semantics Specification Package (DBPSSP) for the Intel Data Base Processor (DBP) is defined. DBPSSP serves as a collection of cross assembly tools that allow the analyst to assemble request blocks on the host computer for passage to the DBP. The assembly tools discussed in this report may be effectively used in conjunction with a DBP compatible data communications protocol to form a query processor, precompiler, or file management system for the database processor. The source modules representing the components of DBPSSP are fully commented and included.
Scalability of a Low-Cost Multi-Teraflop Linux Cluster for High-End Classical Atomistic and Quantum Mechanical Simulations

NASA Technical Reports Server (NTRS)

Kikuchi, Hideaki; Kalia, Rajiv K.; Nakano, Aiichiro; Vashishta, Priya; Shimojo, Fuyuki; Saini, Subhash

2003-01-01

Scalability of a low-cost, Intel Xeon-based, multi-Teraflop Linux cluster is tested for two high-end scientific applications: Classical atomistic simulation based on the molecular dynamics method and quantum mechanical calculation based on the density functional theory. These scalable parallel applications use space-time multiresolution algorithms and feature computational-space decomposition, wavelet-based adaptive load balancing, and spacefilling-curve-based data compression for scalable I/O. Comparative performance tests are performed on a 1,024-processor Linux cluster and a conventional higher-end parallel supercomputer, 1,184-processor IBM SP4. The results show that the performance of the Linux cluster is comparable to that of the SP4. We also study various effects, such as the sharing of memory and L2 cache among processors, on the performance.
Parallelizing ATLAS Reconstruction and Simulation: Issues and Optimization Solutions for Scaling on Multi- and Many-CPU Platforms

NASA Astrophysics Data System (ADS)

Leggett, C.; Binet, S.; Jackson, K.; Levinthal, D.; Tatarkhanov, M.; Yao, Y.

2011-12-01

Thermal limitations have forced CPU manufacturers to shift from simply increasing clock speeds to improve processor performance, to producing chip designs with multi- and many-core architectures. Further the cores themselves can run multiple threads as a zero overhead context switch allowing low level resource sharing (Intel Hyperthreading). To maximize bandwidth and minimize memory latency, memory access has become non uniform (NUMA). As manufacturers add more cores to each chip, a careful understanding of the underlying architecture is required in order to fully utilize the available resources. We present AthenaMP and the Atlas event loop manager, the driver of the simulation and reconstruction engines, which have been rewritten to make use of multiple cores, by means of event based parallelism, and final stage I/O synchronization. However, initial studies on 8 andl6 core Intel architectures have shown marked non-linearities as parallel process counts increase, with as much as 30% reductions in event throughput in some scenarios. Since the Intel Nehalem architecture (both Gainestown and Westmere) will be the most common choice for the next round of hardware procurements, an understanding of these scaling issues is essential. Using hardware based event counters and Intel's Performance Tuning Utility, we have studied the performance bottlenecks at the hardware level, and discovered optimization schemes to maximize processor throughput. We have also produced optimization mechanisms, common to all large experiments, that address the extreme nature of today's HEP code, which due to it's size, places huge burdens on the memory infrastructure of today's processors.
Virtual Machine Language 2.1

NASA Technical Reports Server (NTRS)

Riedel, Joseph E.; Grasso, Christopher A.

2012-01-01

VML (Virtual Machine Language) is an advanced computing environment that allows spacecraft to operate using mechanisms ranging from simple, time-oriented sequencing to advanced, multicomponent reactive systems. VML has developed in four evolutionary stages. VML 0 is a core execution capability providing multi-threaded command execution, integer data types, and rudimentary branching. VML 1 added named parameterized procedures, extensive polymorphism, data typing, branching, looping issuance of commands using run-time parameters, and named global variables. VML 2 added for loops, data verification, telemetry reaction, and an open flight adaptation architecture. VML 2.1 contains major advances in control flow capabilities for executable state machines. On the resource requirements front, VML 2.1 features a reduced memory footprint in order to fit more capability into modestly sized flight processors, and endian-neutral data access for compatibility with Intel little-endian processors. Sequence packaging has been improved with object-oriented programming constructs and the use of implicit (rather than explicit) time tags on statements. Sequence event detection has been significantly enhanced with multi-variable waiting, which allows a sequence to detect and react to conditions defined by complex expressions with multiple global variables. This multi-variable waiting serves as the basis for implementing parallel rule checking, which in turn, makes possible executable state machines. The new state machine feature in VML 2.1 allows the creation of sophisticated autonomous reactive systems without the need to develop expensive flight software. Users specify named states and transitions, along with the truth conditions required, before taking transitions. Transitions with the same signal name allow separate state machines to coordinate actions: the conditions distributed across all state machines necessary to arm a particular signal are evaluated, and once found true, that signal is raised. The selected signal then causes all identically named transitions in all present state machines to be taken simultaneously. VML 2.1 has relevance to all potential space missions, both manned and unmanned. It was under consideration for use on Orion.
Communication overhead on the Intel iPSC-860 hypercube

NASA Technical Reports Server (NTRS)

Bokhari, Shahid H.

1990-01-01

Experiments were conducted on the Intel iPSC-860 hypercube in order to evaluate the overhead of interprocessor communication. It is demonstrated that: (1) contrary to popular belief, the distance between two communicating processors has a significant impact on communication time, (2) edge contention can increase communication time by a factor of more than 7, and (3) node contention has no measurable impact.
Application of a distributed network in computational fluid dynamic simulations

NASA Technical Reports Server (NTRS)

Deshpande, Manish; Feng, Jinzhang; Merkle, Charles L.; Deshpande, Ashish

1994-01-01

A general-purpose 3-D, incompressible Navier-Stokes algorithm is implemented on a network of concurrently operating workstations using parallel virtual machine (PVM) and compared with its performance on a CRAY Y-MP and on an Intel iPSC/860. The problem is relatively computationally intensive, and has a communication structure based primarily on nearest-neighbor communication, making it ideally suited to message passing. Such problems are frequently encountered in computational fluid dynamics (CDF), and their solution is increasingly in demand. The communication structure is explicitly coded in the implementation to fully exploit the regularity in message passing in order to produce a near-optimal solution. Results are presented for various grid sizes using up to eight processors.
Efficient Execution of Microscopy Image Analysis on CPU, GPU, and MIC Equipped Cluster Systems.

PubMed

Andrade, G; Ferreira, R; Teodoro, George; Rocha, Leonardo; Saltz, Joel H; Kurc, Tahsin

2014-10-01

High performance computing is experiencing a major paradigm shift with the introduction of accelerators, such as graphics processing units (GPUs) and Intel Xeon Phi (MIC). These processors have made available a tremendous computing power at low cost, and are transforming machines into hybrid systems equipped with CPUs and accelerators. Although these systems can deliver a very high peak performance, making full use of its resources in real-world applications is a complex problem. Most current applications deployed to these machines are still being executed in a single processor, leaving other devices underutilized. In this paper we explore a scenario in which applications are composed of hierarchical data flow tasks which are allocated to nodes of a distributed memory machine in coarse-grain, but each of them may be composed of several finer-grain tasks which can be allocated to different devices within the node. We propose and implement novel performance aware scheduling techniques that can be used to allocate tasks to devices. We evaluate our techniques using a pathology image analysis application used to investigate brain cancer morphology, and our experimental evaluation shows that the proposed scheduling strategies significantly outperforms other efficient scheduling techniques, such as Heterogeneous Earliest Finish Time - HEFT, in cooperative executions using CPUs, GPUs, and MICs. We also experimentally show that our strategies are less sensitive to inaccuracy in the scheduling input data and that the performance gains are maintained as the application scales.
Efficient Execution of Microscopy Image Analysis on CPU, GPU, and MIC Equipped Cluster Systems

PubMed Central

Andrade, G.; Ferreira, R.; Teodoro, George; Rocha, Leonardo; Saltz, Joel H.; Kurc, Tahsin

2015-01-01

High performance computing is experiencing a major paradigm shift with the introduction of accelerators, such as graphics processing units (GPUs) and Intel Xeon Phi (MIC). These processors have made available a tremendous computing power at low cost, and are transforming machines into hybrid systems equipped with CPUs and accelerators. Although these systems can deliver a very high peak performance, making full use of its resources in real-world applications is a complex problem. Most current applications deployed to these machines are still being executed in a single processor, leaving other devices underutilized. In this paper we explore a scenario in which applications are composed of hierarchical data flow tasks which are allocated to nodes of a distributed memory machine in coarse-grain, but each of them may be composed of several finer-grain tasks which can be allocated to different devices within the node. We propose and implement novel performance aware scheduling techniques that can be used to allocate tasks to devices. We evaluate our techniques using a pathology image analysis application used to investigate brain cancer morphology, and our experimental evaluation shows that the proposed scheduling strategies significantly outperforms other efficient scheduling techniques, such as Heterogeneous Earliest Finish Time - HEFT, in cooperative executions using CPUs, GPUs, and MICs. We also experimentally show that our strategies are less sensitive to inaccuracy in the scheduling input data and that the performance gains are maintained as the application scales. PMID:26640423
Multibus-based parallel processor for simulation

NASA Technical Reports Server (NTRS)

Ogrady, E. P.; Wang, C.-H.

1983-01-01

A Multibus-based parallel processor simulation system is described. The system is intended to serve as a vehicle for gaining hands-on experience, testing system and application software, and evaluating parallel processor performance during development of a larger system based on the horizontal/vertical-bus interprocessor communication mechanism. The prototype system consists of up to seven Intel iSBC 86/12A single-board computers which serve as processing elements, a multiple transmission controller (MTC) designed to support system operation, and an Intel Model 225 Microcomputer Development System which serves as the user interface and input/output processor. All components are interconnected by a Multibus/IEEE 796 bus. An important characteristic of the system is that it provides a mechanism for a processing element to broadcast data to other selected processing elements. This parallel transfer capability is provided through the design of the MTC and a minor modification to the iSBC 86/12A board. The operation of the MTC, the basic hardware-level operation of the system, and pertinent details about the iSBC 86/12A and the Multibus are described.
HeinzelCluster: accelerated reconstruction for FORE and OSEM3D.

PubMed

Vollmar, S; Michel, C; Treffert, J T; Newport, D F; Casey, M; Knöss, C; Wienhard, K; Liu, X; Defrise, M; Heiss, W D

2002-08-07

Using iterative three-dimensional (3D) reconstruction techniques for reconstruction of positron emission tomography (PET) is not feasible on most single-processor machines due to the excessive computing time needed, especially so for the large sinogram sizes of our high-resolution research tomograph (HRRT). In our first approach to speed up reconstruction time we transform the 3D scan into the format of a two-dimensional (2D) scan with sinograms that can be reconstructed independently using Fourier rebinning (FORE) and a fast 2D reconstruction method. On our dedicated reconstruction cluster (seven four-processor systems, Intel PIII@700 MHz, switched fast ethernet and Myrinet, Windows NT Server), we process these 2D sinograms in parallel. We have achieved a speedup > 23 using 26 processors and also compared results for different communication methods (RPC, Syngo, Myrinet GM). The other approach is to parallelize OSEM3D (implementation of C Michel), which has produced the best results for HRRT data so far and is more suitable for an adequate treatment of the sinogram gaps that result from the detector geometry of the HRRT. We have implemented two levels of parallelization for four dedicated cluster (a shared memory fine-grain level on each node utilizing all four processors and a coarse-grain level allowing for 15 nodes) reducing the time for one core iteration from over 7 h to about 35 min.
Generic accelerated sequence alignment in SeqAn using vectorization and multi-threading.

PubMed

Rahn, René; Budach, Stefan; Costanza, Pascal; Ehrhardt, Marcel; Hancox, Jonny; Reinert, Knut

2018-05-03

Pairwise sequence alignment is undoubtedly a central tool in many bioinformatics analyses. In this paper, we present a generically accelerated module for pairwise sequence alignments applicable for a broad range of applications. In our module, we unified the standard dynamic programming kernel used for pairwise sequence alignments and extended it with a generalized inter-sequence vectorization layout, such that many alignments can be computed simultaneously by exploiting SIMD (Single Instruction Multiple Data) instructions of modern processors. We then extended the module by adding two layers of thread-level parallelization, where we a) distribute many independent alignments on multiple threads and b) inherently parallelize a single alignment computation using a work stealing approach producing a dynamic wavefront progressing along the minor diagonal. We evaluated our alignment vectorization and parallelization on different processors, including the newest Intel® Xeon® (Skylake) and Intel® Xeon Phi™ (KNL) processors, and use cases. The instruction set AVX512-BW (Byte and Word), available on Skylake processors, can genuinely improve the performance of vectorized alignments. We could run single alignments 1600 times faster on the Xeon Phi™ and 1400 times faster on the Xeon® than executing them with our previous sequential alignment module. The module is programmed in C++ using the SeqAn (Reinert et al., 2017) library and distributed with version 2.4. under the BSD license. We support SSE4, AVX2, AVX512 instructions and included UME::SIMD, a SIMD-instruction wrapper library, to extend our module for further instruction sets. We thoroughly test all alignment components with all major C++ compilers on various platforms. rene.rahn@fu-berlin.de.

European Scientific Notes. Volume 35, Number 12,

DTIC Science & Technology

1981-12-31

been redesigned to work A. Osorio, which was organized some 3 with the Intel 8085 microprocessor, it years ago and contains about half of the has the...operational set. attempt to derive a set of invariants MOISE is based on the Intel 8085A upon which virtually speaker-invariant microprocessor, and...FACILITY software interface; a Research Signal Processor (RSP) using reduced computational It has been IBM International’s complexity algorithms for
Semiconductor Ion Implanters

DOE Office of Scientific and Technical Information (OSTI.GOV)

MacKinnon, Barry A.; Ruffell, John P.

In 1953 the Raytheon CK722 transistor was priced at $7.60. Based upon this, an Intel Xeon Quad Core processor containing 820,000,000 transistors should list at $6.2 billion. Particle accelerator technology plays an important part in the remarkable story of why that Intel product can be purchased today for a few hundred dollars. Most people of the mid twentieth century would be astonished at the ubiquity of semiconductors in the products we now buy and use every day. Though relatively expensive in the nineteen fifties they now exist in a wide range of items from high-end multicore microprocessors like the Intelmore » product to disposable items containing 'only' hundreds or thousands like RFID chips and talking greeting cards. This historical development has been fueled by continuous advancement of the several individual technologies involved in the production of semiconductor devices including Ion Implantation and the charged particle beamlines at the heart of implant machines. In the course of its 40 year development, the worldwide implanter industry has reached annual sales levels around $2B, installed thousands of dedicated machines and directly employs thousands of workers. It represents in all these measures, as much and possibly more than any other industrial application of particle accelerator technology. This presentation discusses the history of implanter development. It touches on some of the people involved and on some of the developmental changes and challenges imposed as the requirements of the semiconductor industry evolved.« less
Nonlinear Wave Simulation on the Xeon Phi Knights Landing Processor

NASA Astrophysics Data System (ADS)

Hristov, Ivan; Goranov, Goran; Hristova, Radoslava

2018-02-01

We consider an interesting from computational point of view standing wave simulation by solving coupled 2D perturbed Sine-Gordon equations. We make an OpenMP realization which explores both thread and SIMD levels of parallelism. We test the OpenMP program on two different energy equivalent Intel architectures: 2× Xeon E5-2695 v2 processors, (code-named "Ivy Bridge-EP") in the Hybrilit cluster, and Xeon Phi 7250 processor (code-named "Knights Landing" (KNL). The results show 2 times better performance on KNL processor.
Dynamic overset grid communication on distributed memory parallel processors

NASA Technical Reports Server (NTRS)

Barszcz, Eric; Weeratunga, Sisira K.; Meakin, Robert L.

1993-01-01

A parallel distributed memory implementation of intergrid communication for dynamic overset grids is presented. Included are discussions of various options considered during development. Results are presented comparing an Intel iPSC/860 to a single processor Cray Y-MP. Results for grids in relative motion show the iPSC/860 implementation to be faster than the Cray implementation.
Perfmon2: a leap forward in performance monitoring

NASA Astrophysics Data System (ADS)

Jarp, S.; Jurga, R.; Nowak, A.

2008-07-01

This paper describes the software component, perfmon2, that is about to be added to the Linux kernel as the standard interface to the Performance Monitoring Unit (PMU) on common processors, including x86 (AMD and Intel), Sun SPARC, MIPS, IBM Power and Intel Itanium. It also describes a set of tools for doing performance monitoring in practice and details how the CERN openlab team has participated in the testing and development of these tools.
Global synchronization algorithms for the Intel iPSC/860

NASA Technical Reports Server (NTRS)

Seidel, Steven R.; Davis, Mark A.

1992-01-01

In a distributed memory multicomputer that has no global clock, global processor synchronization can only be achieved through software. Global synchronization algorithms are used in tridiagonal systems solvers, CFD codes, sequence comparison algorithms, and sorting algorithms. They are also useful for event simulation, debugging, and for solving mutual exclusion problems. For the Intel iPSC/860 in particular, global synchronization can be used to ensure the most effective use of the communication network for operations such as the shift, where each processor in a one-dimensional array or ring concurrently sends a message to its right (or left) neighbor. Three global synchronization algorithms are considered for the iPSC/860: the gysnc() primitive provided by Intel, the PICL primitive sync0(), and a new recursive doubling synchronization (RDS) algorithm. The performance of these algorithms is compared to the performance predicted by communication models of both the long and forced message protocols. Measurements of the cost of shift operations preceded by global synchronization show that the RDS algorithm always synchronizes the nodes more precisely and costs only slightly more than the other two algorithms.
Preconditioned implicit solvers for the Navier-Stokes equations on distributed-memory machines

NASA Technical Reports Server (NTRS)

Ajmani, Kumud; Liou, Meng-Sing; Dyson, Rodger W.

1994-01-01

The GMRES method is parallelized, and combined with local preconditioning to construct an implicit parallel solver to obtain steady-state solutions for the Navier-Stokes equations of fluid flow on distributed-memory machines. The new implicit parallel solver is designed to preserve the convergence rate of the equivalent 'serial' solver. A static domain-decomposition is used to partition the computational domain amongst the available processing nodes of the parallel machine. The SPMD (Single-Program Multiple-Data) programming model is combined with message-passing tools to develop the parallel code on a 32-node Intel Hypercube and a 512-node Intel Delta machine. The implicit parallel solver is validated for internal and external flow problems, and is found to compare identically with flow solutions obtained on a Cray Y-MP/8. A peak computational speed of 2300 MFlops/sec has been achieved on 512 nodes of the Intel Delta machine,k for a problem size of 1024 K equations (256 K grid points).
Compiling global name-space programs for distributed execution

NASA Technical Reports Server (NTRS)

Koelbel, Charles; Mehrotra, Piyush

1990-01-01

Distributed memory machines do not provide hardware support for a global address space. Thus programmers are forced to partition the data across the memories of the architecture and use explicit message passing to communicate data between processors. The compiler support required to allow programmers to express their algorithms using a global name-space is examined. A general method is presented for analysis of a high level source program and along with its translation to a set of independently executing tasks communicating via messages. If the compiler has enough information, this translation can be carried out at compile-time. Otherwise run-time code is generated to implement the required data movement. The analysis required in both situations is described and the performance of the generated code on the Intel iPSC/2 is presented.
HANSF 1.3 Users Manual FAI/98-40-R2 Hanford Spent Nuclear Fuel (SNF) Safety Analysis Model [SEC 1 and 2

DOE Office of Scientific and Technical Information (OSTI.GOV)

DUNCAN, D.R.

The HANSF analysis tool is an integrated model considering phenomena inside a multi-canister overpack (MCO) spent nuclear fuel container such as fuel oxidation, convective and radiative heat transfer, and the potential for fission product release. This manual reflects the HANSF version 1.3.2, a revised version of 1.3.1. HANSF 1.3.2 was written to correct minor errors and to allow modeling of condensate flow on the MCO inner surface. HANSF 1.3.2 is intended for use on personal computers such as IBM-compatible machines with Intel processors running under Lahey TI or digital Visual FORTRAN, Version 6.0, but this does not preclude operation inmore » other environments.« less
Multi-Core Programming Design Patterns: Stream Processing Algorithms for Dynamic Scene Perceptions

DTIC Science & Technology

2014-05-01

processor developed by IBM and other companies , incorpo- rates the verb—POWER5— processor as the Power Processor Element (PPE), one of the early general...deliver an power efficient single-precision peak performance of more than 256 GFlops. Substantially more raw power became available later, when nVIDIA ...algorithms, including IBM’s Cell/B.E., GPUs from NVidia and AMD and many-core CPUs from Intel.27 The vast growth of digital video content has been a
A trick to improve the efficiency of generating unweighted B events from BCVEGPY

NASA Astrophysics Data System (ADS)

Wang, Xian-You; Wu, Xing-Gang

2012-02-01

In the present paper, we provide an addendum to improve the efficiency of generating unweighted events within PYTHIA environment for the generator BCVEGPY2.1 [C.H. Chang, J.X. Wang, X.G. Wu, Comput. Phys. Commun. 174 (2006) 241]. This trick is helpful for experimental simulation. Moreover, the BCVEGPY output has also been improved, i.e. one Les Houches Event common block has been added so as to generate a standard Les Houches Event file that contains the information of the generated B meson and the accompanying partons, which can be more conveniently used for further simulation. New version program summaryTitle of program: BCVEGPY2.1a Catalogue identifier: ADTJ_v2_2 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/ADTJ_v2_2.html Program obtained from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: Standard CPC licence, http://cpc.cs.qub.ac.uk/licence/licence.html No. of lines in distributed program, including test data, etc.: 166 133 No. of bytes in distributed program, including test data, etc.: 1 655 390 Distribution format: tar.gz Programming language used: FORTRAN 77/90 Computer: Any LINUX based on PC with FORTRAN 77 or FORTRAN 90 and GNU C compiler as well Operating systems: LINUX RAM: About 2.0 MB Classification: 11.2, 11.5 Catalogue identifier of previous version: ADTJ_v2_1 Reference in CPC: Comput. Phys. Commun. 175 (2006) 624 Does the new version supersede the old program: No Nature of physical problem: Hadronic Production of B meson and its excited states Method of solution: To generate weighted and unweighted B events within PYTHIA environment effectively. Restrictions on the complexity of the problem: Hadronic production of ( cb¯)-quarkonium via the gluon-gluon fusion mechanism are given by the 'complete calculation approach'. The simulation of B events is done within PYTHIA environment. Reasons for new version: More and more data are accumulated at the large hadronic collider, it would be possible to make precise studies on B meson properties, such as its lifetime, mass spectrum and etc. The BCVEGPY has been adopted by several experimental groups due to its high efficiency in comparison to that of PYTHIA. However, to generate unweighted events with PYTHIA inner mechanism as programmed by the previous version is still time-consuming. So it would be helpful to improve the efficiency for generating unweighted events within PYTHIA. Moreover, it would be better to use an uniform and standard output format for further detector simulation. Typical running time: Typical running time is machine and user-parameters dependent. I) To generate 10 6 weighted S-wave ( cb¯)-quarkonium events (IDWTUP = 3), it will take about 40 minutes on a 1.8 GHz Intel P4-processor machine. II) To generate unweighted S-wave ( cb¯)-quarkonium events with PYTHIA inner structure (IDWTUP = 1), it will take about 20 hour on a 1.8 GHz Intel P4-processor machine to generate 1000 events. III) To generate 10 6 unweighted S-wave ( cb¯)-quarkonium events with the present trick (IDWTUP = 1), it will take 17 hour on a 3.16 Hz Intel E8500 processor machine. Moreover, it can be found that the running time for the P-wave ( cb¯)-quarkonium production is about two times longer than the case of S-wave production under the same conditions. Keywords: Event generator; Hadronic production; B meson; Unweighted events Summary of revisions: (1) The generator BCVEGPY [1-3] has been programmed to generate B events under PYTHIA environment [4], which has been frequently adopted for theoretical and experimental studies, e.g. Refs. [5-18]. It is found that each experimental group shall have its own simulation software architecture, and the users will spend a lot of time to write an interface so as to implement BCVEGPY into their own software. So it would be better to supply a standard output. The LHE format becomes a standard format [19], which is proposed to store process and event information from the matrix-element-based generators. The users can pass these parton-level information to the general event generators like PYTHIA and HERWIG [20] for further simulation. For such purpose, we add two common blocks in genevent.F. One common block is called as bcvegpy_pyupin and the other one is write_lhe. The bcvegpy_pyupin, which is similar to PYUPIN subroutine in PYTHIA, stores the initialization information in the HEPRUP common block. INTEGER MAXPUP PARAMETER (MAXPUP = 100) INTEGER IDBMUP,PDFGUP,PDFSUP,IDWTUP,NPRUP,LPRUP DOUBLE PRECISION EBMUP,XSECUP,XERRUP,XMAXUP COMMON/HEPRUP/IDBMUP(2),EBMUP(2),PDFGUP(2),PDFSUP(2), &IDWTUP,NPRUP,XSECUP(MAXPUP),XERRUP(MAXPUP), &XMAXUP(MAXPUP),LPRUP(MAXPUP) The write_lhe, which is similar to PYUPEV subroutine in pythia, stores the information of each separate event in the HEPEUP common block. INTEGER MAXNUP PARAMETER (MAXNUP = 500) INTEGER NUP,IDPRUP,IDUP,ISTUP,MOTHUP,ICOLUP DOUBLE PRECISION XWGTUP,SCALUP,AQEDUP,AQCDUP,PUP,VTIMUP, &SPINUP COMMON/HEPEUP/NUP,IDPRUP,XWGTUP,SCALUP,AQEDUP,AQCDUP, &IDUP(MAXNUP),ISTUP(MAXNUP),MOTHUP(2,MAXNUP), &ICOLUP(2,MAXNUP),PUP(5,MAXNUP),VTIMUP(MAXNUP), &SPINUP(MAXNUP)
Vectorization for Molecular Dynamics on Intel Xeon Phi Corpocessors

NASA Astrophysics Data System (ADS)

Yi, Hongsuk

2014-03-01

Many modern processors are capable of exploiting data-level parallelism through the use of single instruction multiple data (SIMD) execution. The new Intel Xeon Phi coprocessor supports 512 bit vector registers for the high performance computing. In this paper, we have developed a hierarchical parallelization scheme for accelerated molecular dynamics simulations with the Terfoff potentials for covalent bond solid crystals on Intel Xeon Phi coprocessor systems. The scheme exploits multi-level parallelism computing. We combine thread-level parallelism using a tightly coupled thread-level and task-level parallelism with 512-bit vector register. The simulation results show that the parallel performance of SIMD implementations on Xeon Phi is apparently superior to their x86 CPU architecture.
Scalability of Parallel Spatial Direct Numerical Simulations on Intel Hypercube and IBM SP1 and SP2

NASA Technical Reports Server (NTRS)

Joslin, Ronald D.; Hanebutte, Ulf R.; Zubair, Mohammad

1995-01-01

The implementation and performance of a parallel spatial direct numerical simulation (PSDNS) approach on the Intel iPSC/860 hypercube and IBM SP1 and SP2 parallel computers is documented. Spatially evolving disturbances associated with the laminar-to-turbulent transition in boundary-layer flows are computed with the PSDNS code. The feasibility of using the PSDNS to perform transition studies on these computers is examined. The results indicate that PSDNS approach can effectively be parallelized on a distributed-memory parallel machine by remapping the distributed data structure during the course of the calculation. Scalability information is provided to estimate computational costs to match the actual costs relative to changes in the number of grid points. By increasing the number of processors, slower than linear speedups are achieved with optimized (machine-dependent library) routines. This slower than linear speedup results because the computational cost is dominated by FFT routine, which yields less than ideal speedups. By using appropriate compile options and optimized library routines on the SP1, the serial code achieves 52-56 M ops on a single node of the SP1 (45 percent of theoretical peak performance). The actual performance of the PSDNS code on the SP1 is evaluated with a "real world" simulation that consists of 1.7 million grid points. One time step of this simulation is calculated on eight nodes of the SP1 in the same time as required by a Cray Y/MP supercomputer. For the same simulation, 32-nodes of the SP1 and SP2 are required to reach the performance of a Cray C-90. A 32 node SP1 (SP2) configuration is 2.9 (4.6) times faster than a Cray Y/MP for this simulation, while the hypercube is roughly 2 times slower than the Y/MP for this application. KEY WORDS: Spatial direct numerical simulations; incompressible viscous flows; spectral methods; finite differences; parallel computing.
Efficient Multicriteria Protein Structure Comparison on Modern Processor Architectures

PubMed Central

Manolakos, Elias S.

2015-01-01

Fast increasing computational demand for all-to-all protein structures comparison (PSC) is a result of three confounding factors: rapidly expanding structural proteomics databases, high computational complexity of pairwise protein comparison algorithms, and the trend in the domain towards using multiple criteria for protein structures comparison (MCPSC) and combining results. We have developed a software framework that exploits many-core and multicore CPUs to implement efficient parallel MCPSC in modern processors based on three popular PSC methods, namely, TMalign, CE, and USM. We evaluate and compare the performance and efficiency of the two parallel MCPSC implementations using Intel's experimental many-core Single-Chip Cloud Computer (SCC) as well as Intel's Core i7 multicore processor. We show that the 48-core SCC is more efficient than the latest generation Core i7, achieving a speedup factor of 42 (efficiency of 0.9), making many-core processors an exciting emerging technology for large-scale structural proteomics. We compare and contrast the performance of the two processors on several datasets and also show that MCPSC outperforms its component methods in grouping related domains, achieving a high F-measure of 0.91 on the benchmark CK34 dataset. The software implementation for protein structure comparison using the three methods and combined MCPSC, along with the developed underlying rckskel algorithmic skeletons library, is available via GitHub. PMID:26605332
Efficient Multicriteria Protein Structure Comparison on Modern Processor Architectures.

PubMed

Sharma, Anuj; Manolakos, Elias S

2015-01-01

Fast increasing computational demand for all-to-all protein structures comparison (PSC) is a result of three confounding factors: rapidly expanding structural proteomics databases, high computational complexity of pairwise protein comparison algorithms, and the trend in the domain towards using multiple criteria for protein structures comparison (MCPSC) and combining results. We have developed a software framework that exploits many-core and multicore CPUs to implement efficient parallel MCPSC in modern processors based on three popular PSC methods, namely, TMalign, CE, and USM. We evaluate and compare the performance and efficiency of the two parallel MCPSC implementations using Intel's experimental many-core Single-Chip Cloud Computer (SCC) as well as Intel's Core i7 multicore processor. We show that the 48-core SCC is more efficient than the latest generation Core i7, achieving a speedup factor of 42 (efficiency of 0.9), making many-core processors an exciting emerging technology for large-scale structural proteomics. We compare and contrast the performance of the two processors on several datasets and also show that MCPSC outperforms its component methods in grouping related domains, achieving a high F-measure of 0.91 on the benchmark CK34 dataset. The software implementation for protein structure comparison using the three methods and combined MCPSC, along with the developed underlying rckskel algorithmic skeletons library, is available via GitHub.
Heterogeneous High Throughput Scientific Computing with APM X-Gene and Intel Xeon Phi

NASA Astrophysics Data System (ADS)

Abdurachmanov, David; Bockelman, Brian; Elmer, Peter; Eulisse, Giulio; Knight, Robert; Muzaffar, Shahzad

2015-05-01

Electrical power requirements will be a constraint on the future growth of Distributed High Throughput Computing (DHTC) as used by High Energy Physics. Performance-per-watt is a critical metric for the evaluation of computer architectures for cost- efficient computing. Additionally, future performance growth will come from heterogeneous, many-core, and high computing density platforms with specialized processors. In this paper, we examine the Intel Xeon Phi Many Integrated Cores (MIC) co-processor and Applied Micro X-Gene ARMv8 64-bit low-power server system-on-a-chip (SoC) solutions for scientific computing applications. We report our experience on software porting, performance and energy efficiency and evaluate the potential for use of such technologies in the context of distributed computing systems such as the Worldwide LHC Computing Grid (WLCG).
An Analysis of Hardware-Assisted Virtual Machine Based Rootkits

DTIC Science & Technology

2014-06-01

certain aspects of TPM implementation just to name a few. HyperWall is an architecture proposed by Szefer and Lee to protect guest VMs from...DISTRIBUTION CODE 13. ABSTRACT (maximum 200 words) The use of virtual machine (VM) technology has expanded rapidly since AMD and Intel implemented ...Intel VT-x implementations of Blue Pill to identify commonalities in the respective versions’ attack methodologies from both a functional and technical
Simulating Hydrologic Flow and Reactive Transport with PFLOTRAN and PETSc on Emerging Fine-Grained Parallel Computer Architectures

NASA Astrophysics Data System (ADS)

Mills, R. T.; Rupp, K.; Smith, B. F.; Brown, J.; Knepley, M.; Zhang, H.; Adams, M.; Hammond, G. E.

2017-12-01

As the high-performance computing community pushes towards the exascale horizon, power and heat considerations have driven the increasing importance and prevalence of fine-grained parallelism in new computer architectures. High-performance computing centers have become increasingly reliant on GPGPU accelerators and "manycore" processors such as the Intel Xeon Phi line, and 512-bit SIMD registers have even been introduced in the latest generation of Intel's mainstream Xeon server processors. The high degree of fine-grained parallelism and more complicated memory hierarchy considerations of such "manycore" processors present several challenges to existing scientific software. Here, we consider how the massively parallel, open-source hydrologic flow and reactive transport code PFLOTRAN - and the underlying Portable, Extensible Toolkit for Scientific Computation (PETSc) library on which it is built - can best take advantage of such architectures. We will discuss some key features of these novel architectures and our code optimizations and algorithmic developments targeted at them, and present experiences drawn from working with a wide range of PFLOTRAN benchmark problems on these architectures.
Accelerating Climate and Weather Simulations through Hybrid Computing

NASA Technical Reports Server (NTRS)

Zhou, Shujia; Cruz, Carlos; Duffy, Daniel; Tucker, Robert; Purcell, Mark

2011-01-01

Unconventional multi- and many-core processors (e.g. IBM (R) Cell B.E.(TM) and NVIDIA (R) GPU) have emerged as effective accelerators in trial climate and weather simulations. Yet these climate and weather models typically run on parallel computers with conventional processors (e.g. Intel, AMD, and IBM) using Message Passing Interface. To address challenges involved in efficiently and easily connecting accelerators to parallel computers, we investigated using IBM's Dynamic Application Virtualization (TM) (IBM DAV) software in a prototype hybrid computing system with representative climate and weather model components. The hybrid system comprises two Intel blades and two IBM QS22 Cell B.E. blades, connected with both InfiniBand(R) (IB) and 1-Gigabit Ethernet. The system significantly accelerates a solar radiation model component by offloading compute-intensive calculations to the Cell blades. Systematic tests show that IBM DAV can seamlessly offload compute-intensive calculations from Intel blades to Cell B.E. blades in a scalable, load-balanced manner. However, noticeable communication overhead was observed, mainly due to IP over the IB protocol. Full utilization of IB Sockets Direct Protocol and the lower latency production version of IBM DAV will reduce this overhead.
An efficient MPI/OpenMP parallelization of the Hartree–Fock–Roothaan method for the first generation of Intel® Xeon Phi™ processor architecture

DOE PAGES

Mironov, Vladimir; Moskovsky, Alexander; D’Mello, Michael; ...

2017-10-04

The Hartree-Fock (HF) method in the quantum chemistry package GAMESS represents one of the most irregular algorithms in computation today. Major steps in the calculation are the irregular computation of electron repulsion integrals (ERIs) and the building of the Fock matrix. These are the central components of the main Self Consistent Field (SCF) loop, the key hotspot in Electronic Structure (ES) codes. By threading the MPI ranks in the official release of the GAMESS code, we not only speed up the main SCF loop (4x to 6x for large systems), but also achieve a significant (>2x) reduction in the overallmore » memory footprint. These improvements are a direct consequence of memory access optimizations within the MPI ranks. We benchmark our implementation against the official release of the GAMESS code on the Intel R Xeon PhiTM supercomputer. Here, scaling numbers are reported on up to 7,680 cores on Intel Xeon Phi coprocessors.« less

Implementation of the DPM Monte Carlo code on a parallel architecture for treatment planning applications.

PubMed

Tyagi, Neelam; Bose, Abhijit; Chetty, Indrin J

2004-09-01

We have parallelized the Dose Planning Method (DPM), a Monte Carlo code optimized for radiotherapy class problems, on distributed-memory processor architectures using the Message Passing Interface (MPI). Parallelization has been investigated on a variety of parallel computing architectures at the University of Michigan-Center for Advanced Computing, with respect to efficiency and speedup as a function of the number of processors. We have integrated the parallel pseudo random number generator from the Scalable Parallel Pseudo-Random Number Generator (SPRNG) library to run with the parallel DPM. The Intel cluster consisting of 800 MHz Intel Pentium III processor shows an almost linear speedup up to 32 processors for simulating 1 x 10(8) or more particles. The speedup results are nearly linear on an Athlon cluster (up to 24 processors based on availability) which consists of 1.8 GHz+ Advanced Micro Devices (AMD) Athlon processors on increasing the problem size up to 8 x 10(8) histories. For a smaller number of histories (1 x 10(8)) the reduction of efficiency with the Athlon cluster (down to 83.9% with 24 processors) occurs because the processing time required to simulate 1 x 10(8) histories is less than the time associated with interprocessor communication. A similar trend was seen with the Opteron Cluster (consisting of 1400 MHz, 64-bit AMD Opteron processors) on increasing the problem size. Because of the 64-bit architecture Opteron processors are capable of storing and processing instructions at a faster rate and hence are faster as compared to the 32-bit Athlon processors. We have validated our implementation with an in-phantom dose calculation study using a parallel pencil monoenergetic electron beam of 20 MeV energy. The phantom consists of layers of water, lung, bone, aluminum, and titanium. The agreement in the central axis depth dose curves and profiles at different depths shows that the serial and parallel codes are equivalent in accuracy.
Federal Plan for High-End Computing. Report of the High-End Computing Revitalization Task Force (HECRTF)

DTIC Science & Technology

2004-07-01

steadily for the past fifteen years, while memory latency and bandwidth have improved much more slowly. For example, Intel processor clock rates38 have... processor and memory performance) all greatly restrict the ability to achieve high levels of performance for science, engineering, and national...sub-nuclear distances. Guide experiments to identify transition from quantum chromodynamics to quark -gluon plasma. Accelerator Physics Accurate
Wilson and Domainwall Kernels on Oakforest-PACS

NASA Astrophysics Data System (ADS)

Kanamori, Issaku; Matsufuru, Hideo

2018-03-01

We report the performance of Wilson and Domainwall Kernels on a new Intel Xeon Phi Knights Landing based machine named Oakforest-PACS, which is co-hosted by University of Tokyo and Tsukuba University and is currently fastest in Japan. This machine uses Intel Omni-Path for the internode network. We compare performance with several types of implementation including that makes use of the Grid library. The code is incorporated with the code set Bridge++.
Intel NX to PVM 3.2 message passing conversion library

NASA Technical Reports Server (NTRS)

Arthur, Trey; Nelson, Michael L.

1993-01-01

NASA Langley Research Center has developed a library that allows Intel NX message passing codes to be executed under the more popular and widely supported Parallel Virtual Machine (PVM) message passing library. PVM was developed at Oak Ridge National Labs and has become the defacto standard for message passing. This library will allow the many programs that were developed on the Intel iPSC/860 or Intel Paragon in a Single Program Multiple Data (SPMD) design to be ported to the numerous architectures that PVM (version 3.2) supports. Also, the library adds global operations capability to PVM. A familiarity with Intel NX and PVM message passing is assumed.
Heterogeneous high throughput scientific computing with APM X-Gene and Intel Xeon Phi

DOE PAGES

Abdurachmanov, David; Bockelman, Brian; Elmer, Peter; ...

2015-05-22

Electrical power requirements will be a constraint on the future growth of Distributed High Throughput Computing (DHTC) as used by High Energy Physics. Performance-per-watt is a critical metric for the evaluation of computer architectures for cost- efficient computing. Additionally, future performance growth will come from heterogeneous, many-core, and high computing density platforms with specialized processors. In this paper, we examine the Intel Xeon Phi Many Integrated Cores (MIC) co-processor and Applied Micro X-Gene ARMv8 64-bit low-power server system-on-a-chip (SoC) solutions for scientific computing applications. As a result, we report our experience on software porting, performance and energy efficiency and evaluatemore » the potential for use of such technologies in the context of distributed computing systems such as the Worldwide LHC Computing Grid (WLCG).« less
Evaluation of the Intel iWarp parallel processor for space flight applications

NASA Technical Reports Server (NTRS)

Hine, Butler P., III; Fong, Terrence W.

1993-01-01

The potential of a DARPA-sponsored advanced processor, the Intel iWarp, for use in future SSF Data Management Systems (DMS) upgrades is evaluated through integration into the Ames DMS testbed and applications testing. The iWarp is a distributed, parallel computing system well suited for high performance computing applications such as matrix operations and image processing. The system architecture is modular, supports systolic and message-based computation, and is capable of providing massive computational power in a low-cost, low-power package. As a consequence, the iWarp offers significant potential for advanced space-based computing. This research seeks to determine the iWarp's suitability as a processing device for space missions. In particular, the project focuses on evaluating the ease of integrating the iWarp into the SSF DMS baseline architecture and the iWarp's ability to support computationally stressing applications representative of SSF tasks.
Analysis of the Intel 386 and i486 microprocessors for the Space Station Freedom Data Management System

NASA Technical Reports Server (NTRS)

Liu, Yuan-Kwei

1991-01-01

The feasibility is analyzed of upgrading the Intel 386 microprocessor, which has been proposed as the baseline processor for the Space Station Freedom (SSF) Data Management System (DMS), to the more advanced i486 microprocessors. The items compared between the two processors include the instruction set architecture, power consumption, the MIL-STD-883C Class S (Space) qualification schedule, and performance. The advantages of the i486 over the 386 are (1) lower power consumption; and (2) higher floating point performance. The i486 on-chip cache does not have parity check or error detection and correction circuitry. The i486 with on-chip cache disabled, however, has lower integer performance than the 386 without cache, which is the current DMS design choice. Adding cache to the 386/386 DX memory hierachy appears to be the most beneficial change to the current DMS design at this time.
Analysis of the Intel 386 and i486 microprocessors for the Space Station Freedom Data Management System

NASA Technical Reports Server (NTRS)

Liu, Yuan-Kwei

1991-01-01

The feasibility is analyzed of upgrading the Intel 386 microprocessor, which has been proposed as the baseline processor for the Space Station Freedom (SSF) Data Management System (DMS), to the more advanced i486 microprocessors. The items compared between the two processors include the instruction set architecture, power consumption, the MIL-STD-883C Class S (Space) qualification schedule, and performance. The advantages of the i486 over the 386 are (1) lower power consumption; and (2) higher floating point performance. The i486 on-chip cache does not have parity check or error detection and correction circuitry. The i486 with on-chip cache disabled, however, has lower integer performance than the 386 without cache, which is the current DMS design choice. Adding cache to the 386/387 DX memory hierarchy appears to be the most beneficial change to the current DMS design at this time.
Measurements of the LHCb software stack on the ARM architecture

NASA Astrophysics Data System (ADS)

Vijay Kartik, S.; Couturier, Ben; Clemencic, Marco; Neufeld, Niko

2014-06-01

The ARM architecture is a power-efficient design that is used in most processors in mobile devices all around the world today since they provide reasonable compute performance per watt. The current LHCb software stack is designed (and thus expected) to build and run on machines with the x86/x86_64 architecture. This paper outlines the process of measuring the performance of the LHCb software stack on the ARM architecture - specifically, the ARMv7 architecture on Cortex-A9 processors from NVIDIA and on full-fledged ARM servers with chipsets from Calxeda - and makes comparisons with the performance on x86_64 architectures on the Intel Xeon L5520/X5650 and AMD Opteron 6272. The paper emphasises the aspects of performance per core with respect to the power drawn by the compute nodes for the given performance - this ensures a fair real-world comparison with much more 'powerful' Intel/AMD processors. The comparisons of these real workloads in the context of LHCb are also complemented with the standard synthetic benchmarks HEPSPEC and Coremark. The pitfalls and solutions for the non-trivial task of porting the source code to build for the ARMv7 instruction set are presented. The specific changes in the build process needed for ARM-specific portions of the software stack are described, to serve as pointers for further attempts taken up by other groups in this direction. Cases where architecture-specific tweaks at the assembler lever (both in ROOT and the LHCb software stack) were needed for a successful compile are detailed - these cases are good indicators of where/how the software stack as well as the build system can be made more portable and multi-arch friendly. The experience gained from the tasks described in this paper are intended to i) assist in making an informed choice about ARM-based server solutions as a feasible low-power alternative to the current compute nodes, and ii) revisit the software design and build system for portability and generic improvements.
Parallel algorithms for boundary value problems

NASA Technical Reports Server (NTRS)

Lin, Avi

1990-01-01

A general approach to solve boundary value problems numerically in a parallel environment is discussed. The basic algorithm consists of two steps: the local step where all the P available processors work in parallel, and the global step where one processor solves a tridiagonal linear system of the order P. The main advantages of this approach are two fold. First, this suggested approach is very flexible, especially in the local step and thus the algorithm can be used with any number of processors and with any of the SIMD or MIMD machines. Secondly, the communication complexity is very small and thus can be used as easily with shared memory machines. Several examples for using this strategy are discussed.
Preliminary Radiation Testing of a State-of-the-Art Commercial 14nm CMOS Processor - System-on-a-Chip

NASA Technical Reports Server (NTRS)

Szabo, Carl M., Jr.; Duncan, Adam; LaBel, Kenneth A.; Kay, Matt; Bruner, Pat; Krzesniak, Mike; Dong, Lei

2015-01-01

Hardness assurance test results of Intel state-of-the-art 14nm Broadwell U-series processor System-on-a-Chip (SoC) for total dose are presented, along with first-look exploratory results from trials at a medical proton facility. Test method builds upon previous efforts by utilizing commercial laptop motherboards and software stress applications as opposed to more traditional automated test equipment (ATE).
An efficient implementation of semi-numerical computation of the Hartree-Fock exchange on the Intel Phi processor

NASA Astrophysics Data System (ADS)

Liu, Fenglai; Kong, Jing

2018-07-01

Unique technical challenges and their solutions for implementing semi-numerical Hartree-Fock exchange on the Phil Processor are discussed, especially concerning the single- instruction-multiple-data type of processing and small cache size. Benchmark calculations on a series of buckyball molecules with various Gaussian basis sets on a Phi processor and a six-core CPU show that the Phi processor provides as much as 12 times of speedup with large basis sets compared with the conventional four-center electron repulsion integration approach performed on the CPU. The accuracy of the semi-numerical scheme is also evaluated and found to be comparable to that of the resolution-of-identity approach.
Ada Compiler Validation Summary Report: Certificate Number: 940325S1. 11352 DDC-I DACS Sun SPARC/Solaries to Pentium PM Bare Ada Cross Compiler System, Version 4.6.4 Sun SPARCclassic = Intel Pentium (Operated as Bare Machine) Based in Xpress Desktop (Intel Product Number: XBASE6E4F-B)

DTIC Science & Technology

1994-03-25

Technology Building 225, Room A266 Gait•--eburg, Maryland 20899 U.S.A. Ada Von Ogan~ztionAda Jointt Program Office De & Software David R . Basel...Standards and Technology Building 225, Room A266 Gaithersburg, Maryland 20899 U.S.A. azi Ada Joint Program office Directoz’,’Coputer & Softvare David R ...characters, a bar (" r ) is written in the 16th position and the rest of the characters ame not prined. "* The place of the definition, i.e.. a line
Multi-threaded ATLAS simulation on Intel Knights Landing processors

NASA Astrophysics Data System (ADS)

Farrell, Steven; Calafiura, Paolo; Leggett, Charles; Tsulaia, Vakhtang; Dotti, Andrea; ATLAS Collaboration

2017-10-01

The Knights Landing (KNL) release of the Intel Many Integrated Core (MIC) Xeon Phi line of processors is a potential game changer for HEP computing. With 72 cores and deep vector registers, the KNL cards promise significant performance benefits for highly-parallel, compute-heavy applications. Cori, the newest supercomputer at the National Energy Research Scientific Computing Center (NERSC), was delivered to its users in two phases with the first phase online at the end of 2015 and the second phase now online at the end of 2016. Cori Phase 2 is based on the KNL architecture and contains over 9000 compute nodes with 96GB DDR4 memory. ATLAS simulation with the multithreaded Athena Framework (AthenaMT) is a good potential use-case for the KNL architecture and supercomputers like Cori. ATLAS simulation jobs have a high ratio of CPU computation to disk I/O and have been shown to scale well in multi-threading and across many nodes. In this paper we will give an overview of the ATLAS simulation application with details on its multi-threaded design. Then, we will present a performance analysis of the application on KNL devices and compare it to a traditional x86 platform to demonstrate the capabilities of the architecture and evaluate the benefits of utilizing KNL platforms like Cori for ATLAS production.
Quantum Chemical Calculations Using Accelerators: Migrating Matrix Operations to the NVIDIA Kepler GPU and the Intel Xeon Phi.

PubMed

Leang, Sarom S; Rendell, Alistair P; Gordon, Mark S

2014-03-11

Increasingly, modern computer systems comprise a multicore general-purpose processor augmented with a number of special purpose devices or accelerators connected via an external interface such as a PCI bus. The NVIDIA Kepler Graphical Processing Unit (GPU) and the Intel Phi are two examples of such accelerators. Accelerators offer peak performances that can be well above those of the host processor. How to exploit this heterogeneous environment for legacy application codes is not, however, straightforward. This paper considers how matrix operations in typical quantum chemical calculations can be migrated to the GPU and Phi systems. Double precision general matrix multiply operations are endemic in electronic structure calculations, especially methods that include electron correlation, such as density functional theory, second order perturbation theory, and coupled cluster theory. The use of approaches that automatically determine whether to use the host or an accelerator, based on problem size, is explored, with computations that are occurring on the accelerator and/or the host. For data-transfers over PCI-e, the GPU provides the best overall performance for data sizes up to 4096 MB with consistent upload and download rates between 5-5.6 GB/s and 5.4-6.3 GB/s, respectively. The GPU outperforms the Phi for both square and nonsquare matrix multiplications.
Software architecture for time-constrained machine vision applications

NASA Astrophysics Data System (ADS)

Usamentiaga, Rubén; Molleda, Julio; García, Daniel F.; Bulnes, Francisco G.

2013-01-01

Real-time image and video processing applications require skilled architects, and recent trends in the hardware platform make the design and implementation of these applications increasingly complex. Many frameworks and libraries have been proposed or commercialized to simplify the design and tuning of real-time image processing applications. However, they tend to lack flexibility, because they are normally oriented toward particular types of applications, or they impose specific data processing models such as the pipeline. Other issues include large memory footprints, difficulty for reuse, and inefficient execution on multicore processors. We present a novel software architecture for time-constrained machine vision applications that addresses these issues. The architecture is divided into three layers. The platform abstraction layer provides a high-level application programming interface for the rest of the architecture. The messaging layer provides a message-passing interface based on a dynamic publish/subscribe pattern. A topic-based filtering in which messages are published to topics is used to route the messages from the publishers to the subscribers interested in a particular type of message. The application layer provides a repository for reusable application modules designed for machine vision applications. These modules, which include acquisition, visualization, communication, user interface, and data processing, take advantage of the power of well-known libraries such as OpenCV, Intel IPP, or CUDA. Finally, the proposed architecture is applied to a real machine vision application: a jam detector for steel pickling lines.
Preliminary Radiation Testing of a State-of-the-Art Commercial 14nm CMOS Processor/System-on-a-Chip

NASA Technical Reports Server (NTRS)

Szabo, Carl M., Jr.; Duncan, Adam; LaBel, Kenneth A.; Kay, Matt; Bruner, Pat; Krzesniak, Mike; Dong, Lei

2015-01-01

Hardness assurance test results of Intel state-of-the-art 14nm “Broadwell” U-series processor / System-on-a-Chip (SoC) for total ionizing dose (TID) are presented, along with exploratory results from trials at a medical proton facility. Test method builds upon previous efforts [1] by utilizing commercial laptop motherboards and software stress applications as opposed to more traditional automated test equipment (ATE).
The Portals 4.0 network programming interface.

DOE Office of Scientific and Technical Information (OSTI.GOV)

Barrett, Brian W.; Brightwell, Ronald Brian; Pedretti, Kevin

2012-11-01

This report presents a specification for the Portals 4.0 network programming interface. Portals 4.0 is intended to allow scalable, high-performance network communication between nodes of a parallel computing system. Portals 4.0 is well suited to massively parallel processing and embedded systems. Portals 4.0 represents an adaption of the data movement layer developed for massively parallel processing platforms, such as the 4500-node Intel TeraFLOPS machine. Sandias Cplant cluster project motivated the development of Version 3.0, which was later extended to Version 3.3 as part of the Cray Red Storm machine and XT line. Version 4.0 is targeted to the next generationmore » of machines employing advanced network interface architectures that support enhanced offload capabilities.« less
Multi-Kepler GPU vs. multi-Intel MIC for spin systems simulations

NASA Astrophysics Data System (ADS)

Bernaschi, M.; Bisson, M.; Salvadore, F.

2014-10-01

We present and compare the performances of two many-core architectures: the Nvidia Kepler and the Intel MIC both in a single system and in cluster configuration for the simulation of spin systems. As a benchmark we consider the time required to update a single spin of the 3D Heisenberg spin glass model by using the Over-relaxation algorithm. We present data also for a traditional high-end multi-core architecture: the Intel Sandy Bridge. The results show that although on the two Intel architectures it is possible to use basically the same code, the performances of a Intel MIC change dramatically depending on (apparently) minor details. Another issue is that to obtain a reasonable scalability with the Intel Phi coprocessor (Phi is the coprocessor that implements the MIC architecture) in a cluster configuration it is necessary to use the so-called offload mode which reduces the performances of the single system. As to the GPU, the Kepler architecture offers a clear advantage with respect to the previous Fermi architecture maintaining exactly the same source code. Scalability of the multi-GPU implementation remains very good by using the CPU as a communication co-processor of the GPU. All source codes are provided for inspection and for double-checking the results.
System-Level Virtualization for High Performance Computing

DOE Office of Scientific and Technical Information (OSTI.GOV)

Vallee, Geoffroy R; Naughton, III, Thomas J; Engelmann, Christian

2008-01-01

System-level virtualization has been a research topic since the 70's but regained popularity during the past few years because of the availability of efficient solution such as Xen and the implementation of hardware support in commodity processors (e.g. Intel-VT, AMD-V). However, a majority of system-level virtualization projects is guided by the server consolidation market. As a result, current virtualization solutions appear to not be suitable for high performance computing (HPC) which is typically based on large-scale systems. On another hand there is significant interest in exploiting virtual machines (VMs) within HPC for a number of other reasons. By virtualizing themore » machine, one is able to run a variety of operating systems and environments as needed by the applications. Virtualization allows users to isolate workloads, improving security and reliability. It is also possible to support non-native environments and/or legacy operating environments through virtualization. In addition, it is possible to balance work loads, use migration techniques to relocate applications from failing machines, and isolate fault systems for repair. This document presents the challenges for the implementation of a system-level virtualization solution for HPC. It also presents a brief survey of the different approaches and techniques to address these challenges.« less

DOE Office of Scientific and Technical Information (OSTI.GOV)

Barrett, Brian; Brightwell, Ronald B.; Grant, Ryan

This report presents a specification for the Portals 4 networ k programming interface. Portals 4 is intended to allow scalable, high-performance network communication betwee n nodes of a parallel computing system. Portals 4 is well suited to massively parallel processing and embedded syste ms. Portals 4 represents an adaption of the data movement layer developed for massively parallel processing platfor ms, such as the 4500-node Intel TeraFLOPS machine. Sandia's Cplant cluster project motivated the development of Version 3.0, which was later extended to Version 3.3 as part of the Cray Red Storm machine and XT line. Version 4 is tarmore » geted to the next generation of machines employing advanced network interface architectures that support enh anced offload capabilities.« less
Old PCs: Upgrade or Abandon?

ERIC Educational Resources Information Center

Perez, Ernest

1997-01-01

Examines the practical realities of upgrading Intel personal computers in libraries, considering budgets and technical personnel availability. Highlights include adding RAM; putting in faster processor chips, including clock multipliers; new hard disks; CD-ROM speed; motherboards and interface cards; cost limits and economic factors; and…
Optimization of the Brillouin operator on the KNL architecture

NASA Astrophysics Data System (ADS)

Dürr, Stephan

2018-03-01

Experiences with optimizing the matrix-times-vector application of the Brillouin operator on the Intel KNL processor are reported. Without adjustments to the memory layout, performance figures of 360 Gflop/s in single and 270 Gflop/s in double precision are observed. This is with Nc = 3 colors, Nv = 12 right-hand-sides, Nthr = 256 threads, on lattices of size 323 × 64, using exclusively OMP pragmas. Interestingly, the same routine performs quite well on Intel Core i7 architectures, too. Some observations on the much harderWilson fermion matrix-times-vector optimization problem are added.
Performance of VPIC on Sequoia

NASA Astrophysics Data System (ADS)

Nystrom, William

2014-10-01

Sequoia is a major DOE computing resource which is characteristic of future resources in that it has many threads per compute node, 64, and the individual processor cores are simpler and less powerful than cores on previous processors like Intel's Sandy Bridge or AMD's Opteron. An effort is in progress to port VPIC to the Blue Gene Q architecture of Sequoia and evaluate its performance. Results of this work will be presented on single node performance of VPIC as well as multi-node scaling.
OpenMP Performance on the Columbia Supercomputer

NASA Technical Reports Server (NTRS)

Haoqiang, Jin; Hood, Robert

2005-01-01

This presentation discusses Columbia World Class Supercomputer which is one of the world's fastest supercomputers providing 61 TFLOPs (10/20/04). Conceived, designed, built, and deployed in just 120 days. A 20-node supercomputer built on proven 512-processor nodes. The largest SGI system in the world with over 10,000 Intel Itanium 2 processors and provides the largest node size incorporating commodity parts (512) and the largest shared-memory environment (2048) with 88% efficiency tops the scalar systems on the Top500 list.
High-Level Data-Abstraction System

NASA Technical Reports Server (NTRS)

Fishwick, P. A.

1986-01-01

Communication with data-base processor flexible and efficient. High Level Data Abstraction (HILDA) system is three-layer system supporting data-abstraction features of Intel data-base processor (DBP). Purpose of HILDA establishment of flexible method of efficiently communicating with DBP. Power of HILDA lies in its extensibility with regard to syntax and semantic changes. HILDA's high-level query language readily modified. Offers powerful potential to computer sites where DBP attached to DEC VAX-series computer. HILDA system written in Pascal and FORTRAN 77 for interactive execution.
DOE Office of Scientific and Technical Information (OSTI.GOV)

Zhang, Yao; Balaprakash, Prasanna; Meng, Jiayuan

We present Raexplore, a performance modeling framework for architecture exploration. Raexplore enables rapid, automated, and systematic search of architecture design space by combining hardware counter-based performance characterization and analytical performance modeling. We demonstrate Raexplore for two recent manycore processors IBM Blue- Gene/Q compute chip and Intel Xeon Phi, targeting a set of scientific applications. Our framework is able to capture complex interactions between architectural components including instruction pipeline, cache, and memory, and to achieve a 3–22% error for same-architecture and cross-architecture performance predictions. Furthermore, we apply our framework to assess the two processors, and discover and evaluate a list ofmore » architectural scaling options for future processor designs.« less
Experiences and results multitasking a hydrodynamics code on global and local memory machines

DOE Office of Scientific and Technical Information (OSTI.GOV)

Mandell, D.

1987-01-01

A one-dimensional, time-dependent Lagrangian hydrodynamics code using a Godunov solution method has been multitasked for the Cray X-MP/48, the Intel iPSC hypercube, the Alliant FX series and the IBM RP3 computers. Actual multitasking results have been obtained for the Cray, Intel and Alliant computers and simulated results were obtained for the Cray and RP3 machines. The differences in the methods required to multitask on each of the machines is discussed. Results are presented for a sample problem involving a shock wave moving down a channel. Comparisons are made between theoretical speedups, predicted by Amdahl's law, and the actual speedups obtained.more » The problems of debugging on the different machines are also described.« less
Riding the Technology Wave.

ERIC Educational Resources Information Center

Malan, Pierre

This paper presents an overview of information technology development. The first section sets the scene, comparing the first WAN (Wide Area Network) and Intel processor to current technology. The birth of the microcomputer is described in the second section, including historical background on semiconductors, microprocessors, and the microcomputer.…
Digital Circuit Analysis Using an 8080 Processor.

ERIC Educational Resources Information Center

Greco, John; Stern, Kenneth

1983-01-01

Presents the essentials of a program written in Intel 8080 assembly language for the steady state analysis of a combinatorial logic gate circuit. Program features and potential modifications are considered. For example, the program could also be extended to include clocked/unclocked sequential circuits. (JN)
The portals 4.0.1 network programming interface.

DOE Office of Scientific and Technical Information (OSTI.GOV)

Barrett, Brian W.; Brightwell, Ronald Brian; Pedretti, Kevin

2013-04-01

This report presents a specification for the Portals 4.0 network programming interface. Portals 4.0 is intended to allow scalable, high-performance network communication between nodes of a parallel computing system. Portals 4.0 is well suited to massively parallel processing and embedded systems. Portals 4.0 represents an adaption of the data movement layer developed for massively parallel processing platforms, such as the 4500-node Intel TeraFLOPS machine. Sandias Cplant cluster project motivated the development of Version 3.0, which was later extended to Version 3.3 as part of the Cray Red Storm machine and XT line. Version 4.0 is targeted to the next generationmore » of machines employing advanced network interface architectures that support enhanced offload capabilities. 3« less
Parallelized Kalman-Filter-Based Reconstruction of Particle Tracks on Many-Core Processors and GPUs

DOE Office of Scientific and Technical Information (OSTI.GOV)

Cerati, Giuseppe; Elmer, Peter; Krutelyov, Slava

2017-01-01

For over a decade now, physical and energy constraints have limited clock speed improvements in commodity microprocessors. Instead, chipmakers have been pushed into producing lower-power, multi-core processors such as Graphical Processing Units (GPU), ARM CPUs, and Intel MICs. Broad-based efforts from manufacturers and developers have been devoted to making these processors user-friendly enough to perform general computations. However, extracting performance from a larger number of cores, as well as specialized vector or SIMD units, requires special care in algorithm design and code optimization. One of the most computationally challenging problems in high-energy particle experiments is finding and fitting the charged-particlemore » tracks during event reconstruction. This is expected to become by far the dominant problem at the High-Luminosity Large Hadron Collider (HL-LHC), for example. Today the most common track finding methods are those based on the Kalman filter. Experience with Kalman techniques on real tracking detector systems has shown that they are robust and provide high physics performance. This is why they are currently in use at the LHC, both in the trigger and offine. Previously we reported on the significant parallel speedups that resulted from our investigations to adapt Kalman filters to track fitting and track building on Intel Xeon and Xeon Phi. Here, we discuss our progresses toward the understanding of these processors and the new developments to port the Kalman filter to NVIDIA GPUs.« less
Parallelized Kalman-Filter-Based Reconstruction of Particle Tracks on Many-Core Processors and GPUs

NASA Astrophysics Data System (ADS)

Cerati, Giuseppe; Elmer, Peter; Krutelyov, Slava; Lantz, Steven; Lefebvre, Matthieu; Masciovecchio, Mario; McDermott, Kevin; Riley, Daniel; Tadel, Matevž; Wittich, Peter; Würthwein, Frank; Yagil, Avi

2017-08-01

For over a decade now, physical and energy constraints have limited clock speed improvements in commodity microprocessors. Instead, chipmakers have been pushed into producing lower-power, multi-core processors such as Graphical Processing Units (GPU), ARM CPUs, and Intel MICs. Broad-based efforts from manufacturers and developers have been devoted to making these processors user-friendly enough to perform general computations. However, extracting performance from a larger number of cores, as well as specialized vector or SIMD units, requires special care in algorithm design and code optimization. One of the most computationally challenging problems in high-energy particle experiments is finding and fitting the charged-particle tracks during event reconstruction. This is expected to become by far the dominant problem at the High-Luminosity Large Hadron Collider (HL-LHC), for example. Today the most common track finding methods are those based on the Kalman filter. Experience with Kalman techniques on real tracking detector systems has shown that they are robust and provide high physics performance. This is why they are currently in use at the LHC, both in the trigger and offine. Previously we reported on the significant parallel speedups that resulted from our investigations to adapt Kalman filters to track fitting and track building on Intel Xeon and Xeon Phi. Here, we discuss our progresses toward the understanding of these processors and the new developments to port the Kalman filter to NVIDIA GPUs.
Implementation and Optimization of miniGMG - a Compact Geometric Multigrid Benchmark

DOE Office of Scientific and Technical Information (OSTI.GOV)

Williams, Samuel; Kalamkar, Dhiraj; Singh, Amik

2012-12-01

Multigrid methods are widely used to accelerate the convergence of iterative solvers for linear systems used in a number of different application areas. In this report, we describe miniGMG, our compact geometric multigrid benchmark designed to proxy the multigrid solves found in AMR applications. We explore optimization techniques for geometric multigrid on existing and emerging multicore systems including the Opteron-based Cray XE6, Intel Sandy Bridge and Nehalem-based Infiniband clusters, as well as manycore-based architectures including NVIDIA's Fermi and Kepler GPUs and Intel's Knights Corner (KNC) co-processor. This report examines a variety of novel techniques including communication-aggregation, threaded wavefront-based DRAM communication-avoiding,more » dynamic threading decisions, SIMDization, and fusion of operators. We quantify performance through each phase of the V-cycle for both single-node and distributed-memory experiments and provide detailed analysis for each class of optimization. Results show our optimizations yield significant speedups across a variety of subdomain sizes while simultaneously demonstrating the potential of multi- and manycore processors to dramatically accelerate single-node performance. However, our analysis also indicates that improvements in networks and communication will be essential to reap the potential of manycore processors in large-scale multigrid calculations.« less
Using Intel's Knight Landing Processor to Accelerate Global Nested Air Quality Prediction Modeling System (GNAQPMS) Model

NASA Astrophysics Data System (ADS)

Wang, H.; Chen, H.; Chen, X.; Wu, Q.; Wang, Z.

2016-12-01

The Global Nested Air Quality Prediction Modeling System for Hg (GNAQPMS-Hg) is a global chemical transport model coupled Hg transport module to investigate the mercury pollution. In this study, we present our work of transplanting the GNAQPMS model on Intel Xeon Phi processor, Knights Landing (KNL) to accelerate the model. KNL is the second-generation product adopting Many Integrated Core Architecture (MIC) architecture. Compared with the first generation Knight Corner (KNC), KNL has more new hardware features, that it can be used as unique processor as well as coprocessor with other CPU. According to the Vtune tool, the high overhead modules in GNAQPMS model have been addressed, including CBMZ gas chemistry, advection and convection module, and wet deposition module. These high overhead modules were accelerated by optimizing code and using new techniques of KNL. The following optimized measures was done: 1) Changing the pure MPI parallel mode to hybrid parallel mode with MPI and OpenMP; 2.Vectorizing the code to using the 512-bit wide vector computation unit. 3. Reducing unnecessary memory access and calculation. 4. Reducing Thread Local Storage (TLS) for common variables with each OpenMP thread in CBMZ. 5. Changing the way of global communication from files writing and reading to MPI functions. After optimization, the performance of GNAQPMS is greatly increased both on CPU and KNL platform, the single-node test showed that optimized version has 2.6x speedup on two sockets CPU platform and 3.3x speedup on one socket KNL platform compared with the baseline version code, which means the KNL has 1.29x speedup when compared with 2 sockets CPU platform.
Parallel Mutual Information Based Construction of Genome-Scale Networks on the Intel® Xeon Phi™ Coprocessor.

PubMed

Misra, Sanchit; Pamnany, Kiran; Aluru, Srinivas

2015-01-01

Construction of whole-genome networks from large-scale gene expression data is an important problem in systems biology. While several techniques have been developed, most cannot handle network reconstruction at the whole-genome scale, and the few that can, require large clusters. In this paper, we present a solution on the Intel Xeon Phi coprocessor, taking advantage of its multi-level parallelism including many x86-based cores, multiple threads per core, and vector processing units. We also present a solution on the Intel® Xeon® processor. Our solution is based on TINGe, a fast parallel network reconstruction technique that uses mutual information and permutation testing for assessing statistical significance. We demonstrate the first ever inference of a plant whole genome regulatory network on a single chip by constructing a 15,575 gene network of the plant Arabidopsis thaliana from 3,137 microarray experiments in only 22 minutes. In addition, our optimization for parallelizing mutual information computation on the Intel Xeon Phi coprocessor holds out lessons that are applicable to other domains.
Real-time SHVC software decoding with multi-threaded parallel processing

NASA Astrophysics Data System (ADS)

Gudumasu, Srinivas; He, Yuwen; Ye, Yan; He, Yong; Ryu, Eun-Seok; Dong, Jie; Xiu, Xiaoyu

2014-09-01

This paper proposes a parallel decoding framework for scalable HEVC (SHVC). Various optimization technologies are implemented on the basis of SHVC reference software SHM-2.0 to achieve real-time decoding speed for the two layer spatial scalability configuration. SHVC decoder complexity is analyzed with profiling information. The decoding process at each layer and the up-sampling process are designed in parallel and scheduled by a high level application task manager. Within each layer, multi-threaded decoding is applied to accelerate the layer decoding speed. Entropy decoding, reconstruction, and in-loop processing are pipeline designed with multiple threads based on groups of coding tree units (CTU). A group of CTUs is treated as a processing unit in each pipeline stage to achieve a better trade-off between parallelism and synchronization. Motion compensation, inverse quantization, and inverse transform modules are further optimized with SSE4 SIMD instructions. Simulations on a desktop with an Intel i7 processor 2600 running at 3.4 GHz show that the parallel SHVC software decoder is able to decode 1080p spatial 2x at up to 60 fps (frames per second) and 1080p spatial 1.5x at up to 50 fps for those bitstreams generated with SHVC common test conditions in the JCT-VC standardization group. The decoding performance at various bitrates with different optimization technologies and different numbers of threads are compared in terms of decoding speed and resource usage, including processor and memory.
Parallel volume ray-casting for unstructured-grid data on distributed-memory architectures

NASA Technical Reports Server (NTRS)

Ma, Kwan-Liu

1995-01-01

As computing technology continues to advance, computational modeling of scientific and engineering problems produces data of increasing complexity: large in size and unstructured in shape. Volume visualization of such data is a challenging problem. This paper proposes a distributed parallel solution that makes ray-casting volume rendering of unstructured-grid data practical. Both the data and the rendering process are distributed among processors. At each processor, ray-casting of local data is performed independent of the other processors. The global image composing processes, which require inter-processor communication, are overlapped with the local ray-casting processes to achieve maximum parallel efficiency. This algorithm differs from previous ones in four ways: it is completely distributed, less view-dependent, reasonably scalable, and flexible. Without using dynamic load balancing, test results on the Intel Paragon using from two to 128 processors show, on average, about 60% parallel efficiency.
Quark structure of static correlators in high temperature QCD

NASA Astrophysics Data System (ADS)

Bernard, Claude; DeGrand, Thomas A.; DeTar, Carleton; Gottlieb, Steven; Krasnitz, A.; Ogilvie, Michael C.; Sugar, R. L.; Toussaint, D.

1992-07-01

We present results of numerical simulations of quantum chromodynamics at finite temperature with two flavors of Kogut-Susskind quarks on the Intel iPSC/860 parallel processor. We investigate the properties of the objects whose exchange gives static screening lengths by reconstructing their correlated quark-antiquark structure.
Ghost writer | ASCR Discovery

Science.gov Websites

the one illustrated here, the outer membrane protein OprF of Pseudomonas aeruginosa in its -1990s, NWChem was designed to run on networked processors, as in an HPC system, using one-sided communication, says Jeff Hammond of Intel Corp.'s Parallel Computing Laboratory. In one-sided communication, a

DD-αAMG on QPACE 3

NASA Astrophysics Data System (ADS)

Georg, Peter; Richtmann, Daniel; Wettig, Tilo

2018-03-01

We describe our experience porting the Regensburg implementation of the DD-αAMG solver from QPACE 2 to QPACE 3. We first review how the code was ported from the first generation Intel Xeon Phi processor (Knights Corner) to its successor (Knights Landing). We then describe the modifications in the communication library necessitated by the switch from InfiniBand to Omni-Path. Finally, we present the performance of the code on a single processor as well as the scaling on many nodes, where in both cases the speedup factor is close to the theoretical expectations.
A High Performance Block Eigensolver for Nuclear Configuration Interaction Calculations

DOE PAGES

Aktulga, Hasan Metin; Afibuzzaman, Md.; Williams, Samuel; ...

2017-06-01

As on-node parallelism increases and the performance gap between the processor and the memory system widens, achieving high performance in large-scale scientific applications requires an architecture-aware design of algorithms and solvers. We focus on the eigenvalue problem arising in nuclear Configuration Interaction (CI) calculations, where a few extreme eigenpairs of a sparse symmetric matrix are needed. Here, we consider a block iterative eigensolver whose main computational kernels are the multiplication of a sparse matrix with multiple vectors (SpMM), and tall-skinny matrix operations. We then present techniques to significantly improve the SpMM and the transpose operation SpMM T by using themore » compressed sparse blocks (CSB) format. We achieve 3-4× speedup on the requisite operations over good implementations with the commonly used compressed sparse row (CSR) format. We develop a performance model that allows us to correctly estimate the performance of our SpMM kernel implementations, and we identify cache bandwidth as a potential performance bottleneck beyond DRAM. We also analyze and optimize the performance of LOBPCG kernels (inner product and linear combinations on multiple vectors) and show up to 15× speedup over using high performance BLAS libraries for these operations. The resulting high performance LOBPCG solver achieves 1.4× to 1.8× speedup over the existing Lanczos solver on a series of CI computations on high-end multicore architectures (Intel Xeons). We also analyze the performance of our techniques on an Intel Xeon Phi Knights Corner (KNC) processor.« less
A High Performance Block Eigensolver for Nuclear Configuration Interaction Calculations

DOE Office of Scientific and Technical Information (OSTI.GOV)

Aktulga, Hasan Metin; Afibuzzaman, Md.; Williams, Samuel

As on-node parallelism increases and the performance gap between the processor and the memory system widens, achieving high performance in large-scale scientific applications requires an architecture-aware design of algorithms and solvers. We focus on the eigenvalue problem arising in nuclear Configuration Interaction (CI) calculations, where a few extreme eigenpairs of a sparse symmetric matrix are needed. Here, we consider a block iterative eigensolver whose main computational kernels are the multiplication of a sparse matrix with multiple vectors (SpMM), and tall-skinny matrix operations. We then present techniques to significantly improve the SpMM and the transpose operation SpMM T by using themore » compressed sparse blocks (CSB) format. We achieve 3-4× speedup on the requisite operations over good implementations with the commonly used compressed sparse row (CSR) format. We develop a performance model that allows us to correctly estimate the performance of our SpMM kernel implementations, and we identify cache bandwidth as a potential performance bottleneck beyond DRAM. We also analyze and optimize the performance of LOBPCG kernels (inner product and linear combinations on multiple vectors) and show up to 15× speedup over using high performance BLAS libraries for these operations. The resulting high performance LOBPCG solver achieves 1.4× to 1.8× speedup over the existing Lanczos solver on a series of CI computations on high-end multicore architectures (Intel Xeons). We also analyze the performance of our techniques on an Intel Xeon Phi Knights Corner (KNC) processor.« less
A Simple and Affordable TTL Processor for the Classroom

ERIC Educational Resources Information Center

Feinberg, Dave

2007-01-01

This paper presents a simple 4 bit computer processor design that may be built using TTL chips for less than $65. In addition to describing the processor itself in detail, we discuss our experience using the laboratory kit and its associated machine instruction set to teach computer architecture to high school students. (Contains 3 figures and 5…
Spectral Element Method for the Simulation of Unsteady Compressible Flows

NASA Technical Reports Server (NTRS)

Diosady, Laslo Tibor; Murman, Scott M.

2013-01-01

This work uses a discontinuous-Galerkin spectral-element method (DGSEM) to solve the compressible Navier-Stokes equations [1{3]. The inviscid ux is computed using the approximate Riemann solver of Roe [4]. The viscous fluxes are computed using the second form of Bassi and Rebay (BR2) [5] in a manner consistent with the spectral-element approximation. The method of lines with the classical 4th-order explicit Runge-Kutta scheme is used for time integration. Results for polynomial orders up to p = 15 (16th order) are presented. The code is parallelized using the Message Passing Interface (MPI). The computations presented in this work are performed using the Sandy Bridge nodes of the NASA Pleiades supercomputer at NASA Ames Research Center. Each Sandy Bridge node consists of 2 eight-core Intel Xeon E5-2670 processors with a clock speed of 2.6Ghz and 2GB per core memory. On a Sandy Bridge node the Tau Benchmark [6] runs in a time of 7.6s.
The Berkeley Out-of-Order Machine (BOOM): An Industry-Competitive, Synthesizable, Parameterized RISC-V Processor

DTIC Science & Technology

2015-06-13

The Berkeley Out-of-Order Machine (BOOM): An Industry- Competitive, Synthesizable, Parameterized RISC-V Processor Christopher Celio David A...Synthesizable, Parameterized RISC-V Processor Christopher Celio, David Patterson, and Krste Asanović University of California, Berkeley, California 94720...Order Machine BOOM is a synthesizable, parameterized, superscalar out- of-order RISC-V core designed to serve as the prototypical baseline processor
Using the GeoFEST Faulted Region Simulation System

NASA Technical Reports Server (NTRS)

Parker, Jay W.; Lyzenga, Gregory A.; Donnellan, Andrea; Judd, Michele A.; Norton, Charles D.; Baker, Teresa; Tisdale, Edwin R.; Li, Peggy

2004-01-01

GeoFEST (the Geophysical Finite Element Simulation Tool) simulates stress evolution, fault slip and plastic/elastic processes in realistic materials, and so is suitable for earthquake cycle studies in regions such as Southern California. Many new capabilities and means of access for GeoFEST are now supported. New abilities include MPI-based cluster parallel computing using automatic PYRAMID/Parmetis-based mesh partitioning, automatic mesh generation for layered media with rectangular faults, and results visualization that is integrated with remote sensing data. The parallel GeoFEST application has been successfully run on over a half-dozen computers, including Intel Xeon clusters, Itanium II and Altix machines, and the Apple G5 cluster. It is not separately optimized for different machines, but relies on good domain partitioning for load-balance and low communication, and careful writing of the parallel diagonally preconditioned conjugate gradient solver to keep communication overhead low. Demonstrated thousand-step solutions for over a million finite elements on 64 processors require under three hours, and scaling tests show high efficiency when using more than (order of) 4000 elements per processor. The source code and documentation for GeoFEST is available at no cost from Open Channel Foundation. In addition GeoFEST may be used through a browser-based portal environment available to approved users. That environment includes semi-automated geometry creation and mesh generation tools, GeoFEST, and RIVA-based visualization tools that include the ability to generate a flyover animation showing deformations and topography. Work is in progress to support simulation of a region with several faults using 16 million elements, using a strain energy metric to adapt the mesh to faithfully represent the solution in a region of widely varying strain.
Simulation of Fault Tolerance in a Hypercube Arrangement of Discrete Processors.

DTIC Science & Technology

1987-12-01

Geometric Properties .................... 22 Binary Properties ....................... 26 Intel Hypercube Hardware Arrangement ... 28 IV. Cube-Connected... Properties of the CCC..............35 CCC Redundancy............................... 38 iii 6L V. Re-Configurable Cube-Connected Cycles ....... 40 Global...o........ 74 iv List of Figures Page Figure 1: Hypercubes of Different Dimensions ......... 21 Figure 2: Hypercube Properties
Evaluation of an Adaptive Automation Trigger Based on Task Performance, Priority, and Frequency

DTIC Science & Technology

2013-06-01

with dual Intel ® Xeon ® CPU x5550 processors @ 2.67 GHz each, 12.0 GB RAM, and a 1.5 GB PCIe nVidia Quadro FX 4800 graphics card (Microsoft...Cole Publishing Company . Miller, C. A., & Parasuraman, R. (2007). Designing for flexible interaction between humans and automation: Delegation
Peregrine System | High-Performance Computing | NREL

Science.gov Websites

) and longer-term (/projects) storage. These file systems are mounted on all nodes. Peregrine has three -2670 Xeon processors and 64 GB of memory. In addition to mounting the /home, /nopt, /projects and # cores/node Memory/node Peak (DP) performance per node 88 Intel Xeon E5-2670 "Sandy Bridge" 8
Analysis of EDP performance

NASA Technical Reports Server (NTRS)

1994-01-01

The objective of this contract was the investigation of the potential performance gains that would result from an upgrade of the Space Station Freedom (SSF) Data Management System (DMS) Embedded Data Processor (EDP) '386' design with the Intel Pentium (registered trade-mark of Intel Corp.) '586' microprocessor. The Pentium ('586') is the latest member of the industry standard Intel X86 family of CISC (Complex Instruction Set Computer) microprocessors. This contract was scheduled to run in parallel with an internal IBM Federal Systems Company (FSC) Internal Research and Development (IR&D) task that had the goal to generate a baseline flight design for an upgraded EDP using the Pentium. This final report summarizes the activities performed in support of Contract NAS2-13758. Our plan was to baseline performance analyses and measurements on the latest state-of-the-art commercially available Pentium processor, representative of the proposed space station design, and then phase to an IBM capital funded breadboard version of the flight design (if available from IR&D and Space Station work) for additional evaluation of results. Unfortunately, the phase-over to the flight design breadboard did not take place, since the IBM Data Management System (DMS) for the Space Station Freedom was terminated by NASA before the referenced capital funded EDP breadboard could be completed. The baseline performance analyses and measurements, however, were successfully completed, as planned, on the commercial Pentium hardware. The results of those analyses, evaluations, and measurements are presented in this final report.
DOE Office of Scientific and Technical Information (OSTI.GOV)

Sancho Pitarch, Jose Carlos; Kerbyson, Darren; Lang, Mike

Increasing the core-count on current and future processors is posing critical challenges to the memory subsystem to efficiently handle concurrent memory requests. The current trend to cope with this challenge is to increase the number of memory channels available to the processor's memory controller. In this paper we investigate the effectiveness of this approach on the performance of parallel scientific applications. Specifically, we explore the trade-off between employing multiple memory channels per memory controller and the use of multiple memory controllers. Experiments conducted on two current state-of-the-art multicore processors, a 6-core AMD Istanbul and a 4-core Intel Nehalem-EP, for amore » wide range of production applications shows that there is a diminishing return when increasing the number of memory channels per memory controller. In addition, we show that this performance degradation can be efficiently addressed by increasing the ratio of memory controllers to channels while keeping the number of memory channels constant. Significant performance improvements can be achieved in this scheme, up to 28%, in the case of using two memory controllers with each with one channel compared with one controller with two memory channels.« less
Towards Highly Scalable Ab Initio Molecular Dynamics (AIMD) Simulations on the Intel Knights Landing Manycore Processor

DOE Office of Scientific and Technical Information (OSTI.GOV)

Jacquelin, Mathias; De Jong, Wibe A.; Bylaska, Eric J.

2017-07-03

The Ab Initio Molecular Dynamics (AIMD) method allows scientists to treat the dynamics of molecular and condensed phase systems while retaining a first-principles-based description of their interactions. This extremely important method has tremendous computational requirements, because the electronic Schr¨odinger equation, approximated using Kohn-Sham Density Functional Theory (DFT), is solved at every time step. With the advent of manycore architectures, application developers have a significant amount of processing power within each compute node that can only be exploited through massive parallelism. A compute intensive application such as AIMD forms a good candidate to leverage this processing power. In this paper, wemore » focus on adding thread level parallelism to the plane wave DFT methodology implemented in NWChem. Through a careful optimization of tall-skinny matrix products, which are at the heart of the Lagrange multiplier and nonlocal pseudopotential kernels, as well as 3D FFTs, our OpenMP implementation delivers excellent strong scaling on the latest Intel Knights Landing (KNL) processor. We assess the efficiency of our Lagrange multiplier kernels by building a Roofline model of the platform, and verify that our implementation is close to the roofline for various problem sizes. Finally, we present strong scaling results on the complete AIMD simulation for a 64 water molecules test case, that scales up to all 68 cores of the Knights Landing processor.« less
A sweep algorithm for massively parallel simulation of circuit-switched networks

NASA Technical Reports Server (NTRS)

Gaujal, Bruno; Greenberg, Albert G.; Nicol, David M.

1992-01-01

A new massively parallel algorithm is presented for simulating large asymmetric circuit-switched networks, controlled by a randomized-routing policy that includes trunk-reservation. A single instruction multiple data (SIMD) implementation is described, and corresponding experiments on a 16384 processor MasPar parallel computer are reported. A multiple instruction multiple data (MIMD) implementation is also described, and corresponding experiments on an Intel IPSC/860 parallel computer, using 16 processors, are reported. By exploiting parallelism, our algorithm increases the possible execution rate of such complex simulations by as much as an order of magnitude.
A high-speed digital signal processor for atmospheric radar, part 7.3A

NASA Technical Reports Server (NTRS)

Brosnahan, J. W.; Woodard, D. M.

1984-01-01

The Model SP-320 device is a monolithic realization of a complex general purpose signal processor, incorporating such features as a 32-bit ALU, a 16-bit x 16-bit combinatorial multiplier, and a 16-bit barrel shifter. The SP-320 is designed to operate as a slave processor to a host general purpose computer in applications such as coherent integration of a radar return signal in multiple ranges, or dedicated FFT processing. Presently available is an I/O module conforming to the Intel Multichannel interface standard; other I/O modules will be designed to meet specific user requirements. The main processor board includes input and output FIFO (First In First Out) memories, both with depths of 4096 W, to permit asynchronous operation between the source of data and the host computer. This design permits burst data rates in excess of 5 MW/s.
Parallelization of a Monte Carlo particle transport simulation code

NASA Astrophysics Data System (ADS)

Hadjidoukas, P.; Bousis, C.; Emfietzoglou, D.

2010-05-01

We have developed a high performance version of the Monte Carlo particle transport simulation code MC4. The original application code, developed in Visual Basic for Applications (VBA) for Microsoft Excel, was first rewritten in the C programming language for improving code portability. Several pseudo-random number generators have been also integrated and studied. The new MC4 version was then parallelized for shared and distributed-memory multiprocessor systems using the Message Passing Interface. Two parallel pseudo-random number generator libraries (SPRNG and DCMT) have been seamlessly integrated. The performance speedup of parallel MC4 has been studied on a variety of parallel computing architectures including an Intel Xeon server with 4 dual-core processors, a Sun cluster consisting of 16 nodes of 2 dual-core AMD Opteron processors and a 200 dual-processor HP cluster. For large problem size, which is limited only by the physical memory of the multiprocessor server, the speedup results are almost linear on all systems. We have validated the parallel implementation against the serial VBA and C implementations using the same random number generator. Our experimental results on the transport and energy loss of electrons in a water medium show that the serial and parallel codes are equivalent in accuracy. The present improvements allow for studying of higher particle energies with the use of more accurate physical models, and improve statistics as more particles tracks can be simulated in low response time.
Northeast Parallel Architectures Center (NPAC)

DTIC Science & Technology

1992-07-01

Computational Techniques: Mapping receptor units to processors , using NEWS communication to model interaction in the inhibitory field Goal of the Research...algorithms for classical problems to take advantage of multiple processors . Experiments in probability that have been too time consuming on serial...machine and achieved speedups of 4 to 5 times with 11 processors . It is believed that a slightly better speedup is achievable. In the case of stuck
ISFET-based sensor signal processor chip design for environment monitoring applications

NASA Astrophysics Data System (ADS)

Chung, Wen-Yaw; Yang, Chung-Huang; Wang, Ming-Ga

2004-12-01

In recent years Ion-Sensitive Field Effect Transistor (ISFET) based transducers create valuable applications in physiological data acquisition and environment monitoring. This paper presents a mixed-mode ASIC design for potentiometric ISFET-based bio-chemical sensor applications including H+ sensing and hand-held pH meter. For battery power consideration, the proposed system consists of low voltage (3V) analog front-end readout circuits and digital processor has been developed and fabricated in a 0.5mm double-poly double-metal CMOS technology. To assure that the correct pH value can be measured, the two-point calibration circuitry based on the response of standard pH4 and pH7 buffer solution has been implemented by using algorithmic state machine hardware algorithms. The measurement accuracy of the chip is 10 bits and the measured range between pH 2 to pH 12 compared to ideal values is within the accuracy of 0.1pH. For homeland environmental applications, the system provide rapid, easy to use, and cost-effective on-site testing on the quality of water, such as drinking water, ground water and river water. The processor has a potential usage in battery-operated and portable devices in environmental monitoring applications compared to commercial hand-held pH meter.
Three-Dimensional High-Lift Analysis Using a Parallel Unstructured Multigrid Solver

NASA Technical Reports Server (NTRS)

Mavriplis, Dimitri J.

1998-01-01

A directional implicit unstructured agglomeration multigrid solver is ported to shared and distributed memory massively parallel machines using the explicit domain-decomposition and message-passing approach. Because the algorithm operates on local implicit lines in the unstructured mesh, special care is required in partitioning the problem for parallel computing. A weighted partitioning strategy is described which avoids breaking the implicit lines across processor boundaries, while incurring minimal additional communication overhead. Good scalability is demonstrated on a 128 processor SGI Origin 2000 machine and on a 512 processor CRAY T3E machine for reasonably fine grids. The feasibility of performing large-scale unstructured grid calculations with the parallel multigrid algorithm is demonstrated by computing the flow over a partial-span flap wing high-lift geometry on a highly resolved grid of 13.5 million points in approximately 4 hours of wall clock time on the CRAY T3E.
Parallel Algorithms for Computer Vision

DTIC Science & Technology

1990-04-01

NA86-1, Thinking Machines Corporation, Cambridge, MA, December 1986. [43] J. Little, G. Blelloch, and T. Cass. How to program the connection machine for... to program the connection machine for computer vision. In Proc. Workshop on Comp. Architecture for Pattern Analysis and Machine Intell., 1987. [92] J...In Proceedings of SPIE Conf. on Advances in Intelligent Robotics Systems, Bellingham, VA, 1987. SPIE. [91] J. Little, G. Blelloch, and T. Cass. How

Computational algorithms for simulations in atmospheric optics.

PubMed

Konyaev, P A; Lukin, V P

2016-04-20

A computer simulation technique for atmospheric and adaptive optics based on parallel programing is discussed. A parallel propagation algorithm is designed and a modified spectral-phase method for computer generation of 2D time-variant random fields is developed. Temporal power spectra of Laguerre-Gaussian beam fluctuations are considered as an example to illustrate the applications discussed. Implementation of the proposed algorithms using Intel MKL and IPP libraries and NVIDIA CUDA technology is shown to be very fast and accurate. The hardware system for the computer simulation is an off-the-shelf desktop with an Intel Core i7-4790K CPU operating at a turbo-speed frequency up to 5 GHz and an NVIDIA GeForce GTX-960 graphics accelerator with 1024 1.5 GHz processors.
Coding for parallel execution of hardware-in-the-loop millimeter-wave scene generation models on multicore SIMD processor architectures

NASA Astrophysics Data System (ADS)

Olson, Richard F.

2013-05-01

Rendering of point scatterer based radar scenes for millimeter wave (mmW) seeker tests in real-time hardware-in-the-loop (HWIL) scene generation requires efficient algorithms and vector-friendly computer architectures for complex signal synthesis. New processor technology from Intel implements an extended 256-bit vector SIMD instruction set (AVX, AVX2) in a multi-core CPU design providing peak execution rates of hundreds of GigaFLOPS (GFLOPS) on one chip. Real world mmW scene generation code can approach peak SIMD execution rates only after careful algorithm and source code design. An effective software design will maintain high computing intensity emphasizing register-to-register SIMD arithmetic operations over data movement between CPU caches or off-chip memories. Engineers at the U.S. Army Aviation and Missile Research, Development and Engineering Center (AMRDEC) applied two basic parallel coding methods to assess new 256-bit SIMD multi-core architectures for mmW scene generation in HWIL. These include use of POSIX threads built on vector library functions and more portable, highlevel parallel code based on compiler technology (e.g. OpenMP pragmas and SIMD autovectorization). Since CPU technology is rapidly advancing toward high processor core counts and TeraFLOPS peak SIMD execution rates, it is imperative that coding methods be identified which produce efficient and maintainable parallel code. This paper describes the algorithms used in point scatterer target model rendering, the parallelization of those algorithms, and the execution performance achieved on an AVX multi-core machine using the two basic parallel coding methods. The paper concludes with estimates for scale-up performance on upcoming multi-core technology.
Modern multicore and manycore architectures: Modelling, optimisation and benchmarking a multiblock CFD code

NASA Astrophysics Data System (ADS)

Hadade, Ioan; di Mare, Luca

2016-08-01

Modern multicore and manycore processors exhibit multiple levels of parallelism through a wide range of architectural features such as SIMD for data parallel execution or threads for core parallelism. The exploitation of multi-level parallelism is therefore crucial for achieving superior performance on current and future processors. This paper presents the performance tuning of a multiblock CFD solver on Intel SandyBridge and Haswell multicore CPUs and the Intel Xeon Phi Knights Corner coprocessor. Code optimisations have been applied on two computational kernels exhibiting different computational patterns: the update of flow variables and the evaluation of the Roe numerical fluxes. We discuss at great length the code transformations required for achieving efficient SIMD computations for both kernels across the selected devices including SIMD shuffles and transpositions for flux stencil computations and global memory transformations. Core parallelism is expressed through threading based on a number of domain decomposition techniques together with optimisations pertaining to alleviating NUMA effects found in multi-socket compute nodes. Results are correlated with the Roofline performance model in order to assert their efficiency for each distinct architecture. We report significant speedups for single thread execution across both kernels: 2-5X on the multicore CPUs and 14-23X on the Xeon Phi coprocessor. Computations at full node and chip concurrency deliver a factor of three speedup on the multicore processors and up to 24X on the Xeon Phi manycore coprocessor.
Parallel computing of physical maps--a comparative study in SIMD and MIMD parallelism.

PubMed

Bhandarkar, S M; Chirravuri, S; Arnold, J

1996-01-01

Ordering clones from a genomic library into physical maps of whole chromosomes presents a central computational problem in genetics. Chromosome reconstruction via clone ordering is usually isomorphic to the NP-complete Optimal Linear Arrangement problem. Parallel SIMD and MIMD algorithms for simulated annealing based on Markov chain distribution are proposed and applied to the problem of chromosome reconstruction via clone ordering. Perturbation methods and problem-specific annealing heuristics are proposed and described. The SIMD algorithms are implemented on a 2048 processor MasPar MP-2 system which is an SIMD 2-D toroidal mesh architecture whereas the MIMD algorithms are implemented on an 8 processor Intel iPSC/860 which is an MIMD hypercube architecture. A comparative analysis of the various SIMD and MIMD algorithms is presented in which the convergence, speedup, and scalability characteristics of the various algorithms are analyzed and discussed. On a fine-grained, massively parallel SIMD architecture with a low synchronization overhead such as the MasPar MP-2, a parallel simulated annealing algorithm based on multiple periodically interacting searches performs the best. For a coarse-grained MIMD architecture with high synchronization overhead such as the Intel iPSC/860, a parallel simulated annealing algorithm based on multiple independent searches yields the best results. In either case, distribution of clonal data across multiple processors is shown to exacerbate the tendency of the parallel simulated annealing algorithm to get trapped in a local optimum.
Equation solvers for distributed-memory computers

NASA Technical Reports Server (NTRS)

Storaasli, Olaf O.

1994-01-01

A large number of scientific and engineering problems require the rapid solution of large systems of simultaneous equations. The performance of parallel computers in this area now dwarfs traditional vector computers by nearly an order of magnitude. This talk describes the major issues involved in parallel equation solvers with particular emphasis on the Intel Paragon, IBM SP-1 and SP-2 processors.
A High Performance Computing Framework for Physics-based Modeling and Simulation of Military Ground Vehicles

DTIC Science & Technology

2011-03-25

number one and Nebulae at number three. Both systems rely on GPU co-processing and use Intel Xeon processors cards and NVIDIA Tesla C2050 GPUs. In...spite of a theoretical peak capability of almost 3 Petaflop/s, Nebulae clocked at 1.271 PFlop/s when running the Linpack benchmark, which puts it
JPRS Report Science & Technology Europe.

DTIC Science & Technology

1992-09-17

9 Jul 92] 48 HERA Project Gets Green Light for Quark Structure Analysis [DuesseldorfVDI NACHRICHTEN, 12 Jul 92] .... 48 TELECOMMUNICATIONS...communicating with the control station. The demonstrator is the product of research performed at the Robot and Artificial Intel - ligence Unit of...from the microphones, speedometers, or tachometers. Each board is linked to a Motorola DSP [digital signal processor ]. Although the system has been
Scalable Algorithms for Clustering Large Geospatiotemporal Data Sets on Manycore Architectures

NASA Astrophysics Data System (ADS)

Mills, R. T.; Hoffman, F. M.; Kumar, J.; Sreepathi, S.; Sripathi, V.

2016-12-01

The increasing availability of high-resolution geospatiotemporal data sets from sources such as observatory networks, remote sensing platforms, and computational Earth system models has opened new possibilities for knowledge discovery using data sets fused from disparate sources. Traditional algorithms and computing platforms are impractical for the analysis and synthesis of data sets of this size; however, new algorithmic approaches that can effectively utilize the complex memory hierarchies and the extremely high levels of available parallelism in state-of-the-art high-performance computing platforms can enable such analysis. We describe a massively parallel implementation of accelerated k-means clustering and some optimizations to boost computational intensity and utilization of wide SIMD lanes on state-of-the art multi- and manycore processors, including the second-generation Intel Xeon Phi ("Knights Landing") processor based on the Intel Many Integrated Core (MIC) architecture, which includes several new features, including an on-package high-bandwidth memory. We also analyze the code in the context of a few practical applications to the analysis of climatic and remotely-sensed vegetation phenology data sets, and speculate on some of the new applications that such scalable analysis methods may enable.
Ada Compiler Validation Summary Report: Certificate Number: 940325S1. 11348 DDC-I, DACS Sun SPARC/Solaris to 80386 PM Bare Ada Cross Compiler System, Version 4.6.4 Sun SPARCclassic = Intel iSBC 386/116 (Bare Machine)

DTIC Science & Technology

1994-03-25

digits than SYSTEM.MAXDIGITS: C24113L..Y (14 tests) C35705L..Y (14 tests) C35706L..Y (14 tests) C35707L..Y (14 tests) 2-1 C35708L..Y (14 tests) C35802L...MACHINE CODETYPE : REGISTERTYPE MANTISSADOC : 31 A-2 MAX_- DIGITS : 15 MAX_-INT : 9223372036854775807 MAX_-INTPLUS_1 : 9223372036854775808 MIN_ INT...words. A libary task is formed when a task object is declared at the outermost level of a package. Library tasks ame created and activated during the
Parallel protein secondary structure prediction based on neural networks.

PubMed

Zhong, Wei; Altun, Gulsah; Tian, Xinmin; Harrison, Robert; Tai, Phang C; Pan, Yi

2004-01-01

Protein secondary structure prediction has a fundamental influence on today's bioinformatics research. In this work, binary and tertiary classifiers of protein secondary structure prediction are implemented on Denoeux belief neural network (DBNN) architecture. Hydrophobicity matrix, orthogonal matrix, BLOSUM62 and PSSM (position specific scoring matrix) are experimented separately as the encoding schemes for DBNN. The experimental results contribute to the design of new encoding schemes. New binary classifier for Helix versus not Helix ( approximately H) for DBNN produces prediction accuracy of 87% when PSSM is used for the input profile. The performance of DBNN binary classifier is comparable to other best prediction methods. The good test results for binary classifiers open a new approach for protein structure prediction with neural networks. Due to the time consuming task of training the neural networks, Pthread and OpenMP are employed to parallelize DBNN in the hyperthreading enabled Intel architecture. Speedup for 16 Pthreads is 4.9 and speedup for 16 OpenMP threads is 4 in the 4 processors shared memory architecture. Both speedup performance of OpenMP and Pthread is superior to that of other research. With the new parallel training algorithm, thousands of amino acids can be processed in reasonable amount of time. Our research also shows that hyperthreading technology for Intel architecture is efficient for parallel biological algorithms.
Single bus star connected reluctance drive and method

DOEpatents

Fahimi, Babak; Shamsi, Pourya

2016-05-10

A system and methods for operating a switched reluctance machine includes a controller, an inverter connected to the controller and to the switched reluctance machine, a hysteresis control connected to the controller and to the inverter, a set of sensors connected to the switched reluctance machine and to the controller, the switched reluctance machine further including a set of phases the controller further comprising a processor and a memory connected to the processor, wherein the processor programmed to execute a control process and a generation process.
On the relationship between parallel computation and graph embedding

DOE Office of Scientific and Technical Information (OSTI.GOV)

Gupta, A.K.

1989-01-01

The problem of efficiently simulating an algorithm designed for an n-processor parallel machine G on an m-processor parallel machine H with n > m arises when parallel algorithms designed for an ideal size machine are simulated on existing machines which are of a fixed size. The author studies this problem when every processor of H takes over the function of a number of processors in G, and he phrases the simulation problem as a graph embedding problem. New embeddings presented address relevant issues arising from the parallel computation environment. The main focus centers around embedding complete binary trees into smaller-sizedmore » binary trees, butterflies, and hypercubes. He also considers simultaneous embeddings of r source machines into a single hypercube. Constant factors play a crucial role in his embeddings since they are not only important in practice but also lead to interesting theoretical problems. All of his embeddings minimize dilation and load, which are the conventional cost measures in graph embeddings and determine the maximum amount of time required to simulate one step of G on H. His embeddings also optimize a new cost measure called ({alpha},{beta})-utilization which characterizes how evenly the processors of H are used by the processors of G. Ideally, the utilization should be balanced (i.e., every processor of H simulates at most (n/m) processors of G) and the ({alpha},{beta})-utilization measures how far off from a balanced utilization the embedding is. He presents embeddings for the situation when some processors of G have different capabilities (e.g. memory or I/O) than others and the processors with different capabilities are to be distributed uniformly among the processors of H. Placing such conditions on an embedding results in an increase in some of the cost measures.« less
Secure and Efficient Regression Analysis Using a Hybrid Cryptographic Framework: Development and Evaluation

PubMed Central

Jiang, Xiaoqian; Aziz, Md Momin Al; Wang, Shuang; Mohammed, Noman

2018-01-01

Background Machine learning is an effective data-driven tool that is being widely used to extract valuable patterns and insights from data. Specifically, predictive machine learning models are very important in health care for clinical data analysis. The machine learning algorithms that generate predictive models often require pooling data from different sources to discover statistical patterns or correlations among different attributes of the input data. The primary challenge is to fulfill one major objective: preserving the privacy of individuals while discovering knowledge from data. Objective Our objective was to develop a hybrid cryptographic framework for performing regression analysis over distributed data in a secure and efficient way. Methods Existing secure computation schemes are not suitable for processing the large-scale data that are used in cutting-edge machine learning applications. We designed, developed, and evaluated a hybrid cryptographic framework, which can securely perform regression analysis, a fundamental machine learning algorithm using somewhat homomorphic encryption and a newly introduced secure hardware component of Intel Software Guard Extensions (Intel SGX) to ensure both privacy and efficiency at the same time. Results Experimental results demonstrate that our proposed method provides a better trade-off in terms of security and efficiency than solely secure hardware-based methods. Besides, there is no approximation error. Computed model parameters are exactly similar to plaintext results. Conclusions To the best of our knowledge, this kind of secure computation model using a hybrid cryptographic framework, which leverages both somewhat homomorphic encryption and Intel SGX, is not proposed or evaluated to this date. Our proposed framework ensures data security and computational efficiency at the same time. PMID:29506966
Secure and Efficient Regression Analysis Using a Hybrid Cryptographic Framework: Development and Evaluation.

PubMed

Sadat, Md Nazmus; Jiang, Xiaoqian; Aziz, Md Momin Al; Wang, Shuang; Mohammed, Noman

2018-03-05

Machine learning is an effective data-driven tool that is being widely used to extract valuable patterns and insights from data. Specifically, predictive machine learning models are very important in health care for clinical data analysis. The machine learning algorithms that generate predictive models often require pooling data from different sources to discover statistical patterns or correlations among different attributes of the input data. The primary challenge is to fulfill one major objective: preserving the privacy of individuals while discovering knowledge from data. Our objective was to develop a hybrid cryptographic framework for performing regression analysis over distributed data in a secure and efficient way. Existing secure computation schemes are not suitable for processing the large-scale data that are used in cutting-edge machine learning applications. We designed, developed, and evaluated a hybrid cryptographic framework, which can securely perform regression analysis, a fundamental machine learning algorithm using somewhat homomorphic encryption and a newly introduced secure hardware component of Intel Software Guard Extensions (Intel SGX) to ensure both privacy and efficiency at the same time. Experimental results demonstrate that our proposed method provides a better trade-off in terms of security and efficiency than solely secure hardware-based methods. Besides, there is no approximation error. Computed model parameters are exactly similar to plaintext results. To the best of our knowledge, this kind of secure computation model using a hybrid cryptographic framework, which leverages both somewhat homomorphic encryption and Intel SGX, is not proposed or evaluated to this date. Our proposed framework ensures data security and computational efficiency at the same time. ©Md Nazmus Sadat, Xiaoqian Jiang, Md Momin Al Aziz, Shuang Wang, Noman Mohammed. Originally published in JMIR Medical Informatics (http://medinform.jmir.org), 05.03.2018.
Introduction to Parallel Computing

DTIC Science & Technology

1992-05-01

Instruction Stream, Multiple Data Stream Machines .................... 19 2.4 Networks of M achines...independent memory units and connecting them to the processors by an interconnection network . Many different interconnection schemes have been considered, and...connected to the same processor at the same time. Crossbar switching networks are still too expensive to be practical for connecting large numbers of
Impacts of the IBM Cell Processor to Support Climate Models

NASA Technical Reports Server (NTRS)

Zhou, Shujia; Duffy, Daniel; Clune, Tom; Suarez, Max; Williams, Samuel; Halem, Milt

2008-01-01

NASA is interested in the performance and cost benefits for adapting its applications to the IBM Cell processor. However, its 256KB local memory per SPE and the new communication mechanism, make it very challenging to port an application. We selected the solar radiation component of the NASA GEOS-5 climate model, which: (1) is representative of column physics (approximately 50% computational time), (2) has a high computational load relative to transferring data from and to main memory, (3) performs independent calculations across multiple columns. We converted the baseline code (single-precision, Fortran) to C and ported it with manually SIMDizing 4 independent columns and found that a Cell with 8 SPEs can process 2274 columns per second. Compared with the baseline results, the Cell is approximately 5.2X, approximately 8.2X, approximately 15.1X faster than a core on Intel Woodcrest, Dempsey, and Itanium2, respectively. We believe this dramatic performance improvement makes a hybrid cluster with Cell and traditional nodes competitive.
Recent advances in PC-Linux systems for electronic structure computations by optimized compilers and numerical libraries.

PubMed

Yu, Jen-Shiang K; Yu, Chin-Hui

2002-01-01

One of the most frequently used packages for electronic structure research, GAUSSIAN 98, is compiled on Linux systems with various hardware configurations, including AMD Athlon (with the "Thunderbird" core), AthlonMP, and AthlonXP (with the "Palomino" core) systems as well as the Intel Pentium 4 (with the "Willamette" core) machines. The default PGI FORTRAN compiler (pgf77) and the Intel FORTRAN compiler (ifc) are respectively employed with different architectural optimization options to compile GAUSSIAN 98 and test the performance improvement. In addition to the BLAS library included in revision A.11 of this package, the Automatically Tuned Linear Algebra Software (ATLAS) library is linked against the binary executables to improve the performance. Various Hartree-Fock, density-functional theories, and the MP2 calculations are done for benchmarking purposes. It is found that the combination of ifc with ATLAS library gives the best performance for GAUSSIAN 98 on all of these PC-Linux computers, including AMD and Intel CPUs. Even on AMD systems, the Intel FORTRAN compiler invariably produces binaries with better performance than pgf77. The enhancement provided by the ATLAS library is more significant for post-Hartree-Fock calculations. The performance on one single CPU is potentially as good as that on an Alpha 21264A workstation or an SGI supercomputer. The floating-point marks by SpecFP2000 have similar trends to the results of GAUSSIAN 98 package.
Adaptive Command and Control of Theater Air Power

DTIC Science & Technology

1997-06-01

Industries, Citicorp, Coca-Cola, Honda, and Intel corporations practice similar techniques 19 Notes as cited in Thomas Petzinger, Jr., “The Front Lines...before the leap to personal computers and word processors occurred. Finally, anticipation takes place as the stock market adjusts current prices...Leading Marines. January 1995. Fleet Marine Force Manual 1-1. Campaigning. January 1990. Gell-Mann, Murray, The Quark and the Jaguar: Adventures
Hydraulic Universal Display Processor System (HUDPS).

DTIC Science & Technology

1981-11-21

emphasis on smart alphanumeric devices in Task II. Volatile and non-volatile memory components were utilized along with the Intel 8748 microprocessor...system. 1.2 TASK 11 Fault display methods for ground support personnel were investigated during Phase II with emphasis on smart alphanumeric devices...CONSIDERATIONS Methods of display fault indication for ground support personnel have been investigated with emphasis on " smart " alphanumeric devices
A Portable Parallel Implementation of the U.S. Navy Layered Ocean Model

DTIC Science & Technology

1995-01-01

Wallcraft, PhD (I.C. 1981) Planning Systems Inc. & P. R. Moore, PhD (Camb. 1971) IC Dept. Math. DR Moore 1° Encontro de Metodos Numericos...Kendall Square, Hypercube, D R Moore 1 ° Encontro de Metodos Numericos para Equacöes de Derivadas Parciais A. J. Wallcraft IC Mathematics...chips: Chips Machine DEC Alpha CrayT3D/E SUN Sparc Fujitsu AP1000 Intel 860 Paragon D R Moore 1° Encontro de Metodos Numericos para Equacöes

Implementation and Characterization of Three-Dimensional Particle-in-Cell Codes on Multiple-Instruction-Multiple-Data Massively Parallel Supercomputers

NASA Technical Reports Server (NTRS)

Lyster, P. M.; Liewer, P. C.; Decyk, V. K.; Ferraro, R. D.

1995-01-01

A three-dimensional electrostatic particle-in-cell (PIC) plasma simulation code has been developed on coarse-grain distributed-memory massively parallel computers with message passing communications. Our implementation is the generalization to three-dimensions of the general concurrent particle-in-cell (GCPIC) algorithm. In the GCPIC algorithm, the particle computation is divided among the processors using a domain decomposition of the simulation domain. In a three-dimensional simulation, the domain can be partitioned into one-, two-, or three-dimensional subdomains ("slabs," "rods," or "cubes") and we investigate the efficiency of the parallel implementation of the push for all three choices. The present implementation runs on the Intel Touchstone Delta machine at Caltech; a multiple-instruction-multiple-data (MIMD) parallel computer with 512 nodes. We find that the parallel efficiency of the push is very high, with the ratio of communication to computation time in the range 0.3%-10.0%. The highest efficiency (> 99%) occurs for a large, scaled problem with 64(sup 3) particles per processing node (approximately 134 million particles of 512 nodes) which has a push time of about 250 ns per particle per time step. We have also developed expressions for the timing of the code which are a function of both code parameters (number of grid points, particles, etc.) and machine-dependent parameters (effective FLOP rate, and the effective interprocessor bandwidths for the communication of particles and grid points). These expressions can be used to estimate the performance of scaled problems--including those with inhomogeneous plasmas--to other parallel machines once the machine-dependent parameters are known.
Roofline Analysis in the Intel® Advisor to Deliver Optimized Performance for applications on Intel® Xeon Phi™ Processor

DOE Office of Scientific and Technical Information (OSTI.GOV)

Koskela, Tuomas S.; Lobet, Mathieu; Deslippe, Jack

In this session we show, in two case studies, how the roofline feature of Intel Advisor has been utilized to optimize the performance of kernels of the XGC1 and PICSAR codes in preparation for Intel Knights Landing architecture. The impact of the implemented optimizations and the benefits of using the automatic roofline feature of Intel Advisor to study performance of large applications will be presented. This demonstrates an effective optimization strategy that has enabled these science applications to achieve up to 4.6 times speed-up and prepare for future exascale architectures. # Goal/Relevance of Session The roofline model [1,2] is amore » powerful tool for analyzing the performance of applications with respect to the theoretical peak achievable on a given computer architecture. It allows one to graphically represent the performance of an application in terms of operational intensity, i.e. the ratio of flops performed and bytes moved from memory in order to guide optimization efforts. Given the scale and complexity of modern science applications, it can often be a tedious task for the user to perform the analysis on the level of functions or loops to identify where performance gains can be made. With new Intel tools, it is now possible to automate this task, as well as base the estimates of peak performance on measurements rather than vendor specifications. The goal of this session is to demonstrate how the roofline feature of Intel Advisor can be used to balance memory vs. computation related optimization efforts and effectively identify performance bottlenecks. A series of typical optimization techniques: cache blocking, structure refactoring, data alignment, and vectorization illustrated by the kernel cases will be addressed. # Description of the codes ## XGC1 The XGC1 code [3] is a magnetic fusion Particle-In-Cell code that uses an unstructured mesh for its Poisson solver that allows it to accurately resolve the edge plasma of a magnetic fusion device. After recent optimizations to its collision kernel [4], most of the computing time is spent in the electron push (pushe) kernel, where these optimization efforts have been focused. The kernel code scaled well with MPI+OpenMP but had almost no automatic compiler vectorization, in part due to indirect memory addresses and in part due to low trip counts of low-level loops that would be candidates for vectorization. Particle blocking and sorting have been implemented to increase trip counts of low-level loops and improve memory locality, and OpenMP directives have been added to vectorize compute-intensive loops that were identified by Advisor. The optimizations have improved the performance of the pushe kernel 2x on Haswell processors and 1.7x on KNL. The KNL node-for-node performance has been brought to within 30% of a NERSC Cori phase I Haswell node and we expect to bridge this gap by reducing the memory footprint of compute intensive routines to improve cache reuse. ## PICSAR is a Fortran/Python high-performance Particle-In-Cell library targeting at MIC architectures first designed to be coupled with the PIC code WARP for the simulation of laser-matter interaction and particle accelerators. PICSAR also contains a FORTRAN stand-alone kernel for performance studies and benchmarks. A MPI domain decomposition is used between NUMA domains and a tile decomposition (cache-blocking) handled by OpenMP has been added for shared-memory parallelism and better cache management. The so-called current deposition and field gathering steps that compose the PIC time loop constitute major hotspots that have been rewritten to enable more efficient vectorization. Particle communications between tiles and MPI domain has been merged and parallelized. All considered, these improvements provide speedups of 3.1 for order 1 and 4.6 for order 3 interpolation shape factors on KNL configured in SNC4 quadrant flat mode. Performance is similar between a node of cori phase 1 and KNL at order 1 and better on KNL by a factor 1.6 at order 3 with the considered test case (homogeneous thermal plasma).« less
High performance computing environment for multidimensional image analysis

PubMed Central

Rao, A Ravishankar; Cecchi, Guillermo A; Magnasco, Marcelo

2007-01-01

Background The processing of images acquired through microscopy is a challenging task due to the large size of datasets (several gigabytes) and the fast turnaround time required. If the throughput of the image processing stage is significantly increased, it can have a major impact in microscopy applications. Results We present a high performance computing (HPC) solution to this problem. This involves decomposing the spatial 3D image into segments that are assigned to unique processors, and matched to the 3D torus architecture of the IBM Blue Gene/L machine. Communication between segments is restricted to the nearest neighbors. When running on a 2 Ghz Intel CPU, the task of 3D median filtering on a typical 256 megabyte dataset takes two and a half hours, whereas by using 1024 nodes of Blue Gene, this task can be performed in 18.8 seconds, a 478× speedup. Conclusion Our parallel solution dramatically improves the performance of image processing, feature extraction and 3D reconstruction tasks. This increased throughput permits biologists to conduct unprecedented large scale experiments with massive datasets. PMID:17634099
High performance computing environment for multidimensional image analysis.

PubMed

Rao, A Ravishankar; Cecchi, Guillermo A; Magnasco, Marcelo

2007-07-10

The processing of images acquired through microscopy is a challenging task due to the large size of datasets (several gigabytes) and the fast turnaround time required. If the throughput of the image processing stage is significantly increased, it can have a major impact in microscopy applications. We present a high performance computing (HPC) solution to this problem. This involves decomposing the spatial 3D image into segments that are assigned to unique processors, and matched to the 3D torus architecture of the IBM Blue Gene/L machine. Communication between segments is restricted to the nearest neighbors. When running on a 2 Ghz Intel CPU, the task of 3D median filtering on a typical 256 megabyte dataset takes two and a half hours, whereas by using 1024 nodes of Blue Gene, this task can be performed in 18.8 seconds, a 478x speedup. Our parallel solution dramatically improves the performance of image processing, feature extraction and 3D reconstruction tasks. This increased throughput permits biologists to conduct unprecedented large scale experiments with massive datasets.
Demonstration of a full volume 3D pre-stack depth migration in the Garden Banks area using massively parallel processor (MPP) technology

DOE Office of Scientific and Technical Information (OSTI.GOV)

Solano, M.; Chang, H.; VanDyke, J.

1996-12-31

This paper describes the implementation and results of portable, production-scale 3D Pre-stack Kirchhoff depth migration software. Full volume pre-stack imaging was applied to a six million trace (46.9 Gigabyte) data set from a subsalt play in the Garden Banks area in the Gulf of Mexico. The velocity model building and updating, were derived using image depth gathers and an image-driven strategy. After three velocity iterations, depth migrated sections revealed drilling targets that were not visible in the conventional 3D post-stack time migrated data set. As expected from the implementation of the migration algorithm, it was found that amplitudes are wellmore » preserved and anomalies associated with known reservoirs, conform to petrophysical predictions. Image gathers for velocity analysis and the final depth migrated volume, were generated on an 1824 node Intel Paragon at Sandia National Laboratories. The code has been successfully ported to a CRAY (T3D) and Unix workstation Parallel Virtual Machine environments (PVM).« less
Parallelized direct execution simulation of message-passing parallel programs

NASA Technical Reports Server (NTRS)

Dickens, Phillip M.; Heidelberger, Philip; Nicol, David M.

1994-01-01

As massively parallel computers proliferate, there is growing interest in findings ways by which performance of massively parallel codes can be efficiently predicted. This problem arises in diverse contexts such as parallelizing computers, parallel performance monitoring, and parallel algorithm development. In this paper we describe one solution where one directly executes the application code, but uses a discrete-event simulator to model details of the presumed parallel machine such as operating system and communication network behavior. Because this approach is computationally expensive, we are interested in its own parallelization specifically the parallelization of the discrete-event simulator. We describe methods suitable for parallelized direct execution simulation of message-passing parallel programs, and report on the performance of such a system, Large Application Parallel Simulation Environment (LAPSE), we have built on the Intel Paragon. On all codes measured to date, LAPSE predicts performance well typically within 10 percent relative error. Depending on the nature of the application code, we have observed low slowdowns (relative to natively executing code) and high relative speedups using up to 64 processors.
On extending parallelism to serial simulators

NASA Technical Reports Server (NTRS)

Nicol, David; Heidelberger, Philip

1994-01-01

This paper describes an approach to discrete event simulation modeling that appears to be effective for developing portable and efficient parallel execution of models of large distributed systems and communication networks. In this approach, the modeler develops submodels using an existing sequential simulation modeling tool, using the full expressive power of the tool. A set of modeling language extensions permit automatically synchronized communication between submodels; however, the automation requires that any such communication must take a nonzero amount off simulation time. Within this modeling paradigm, a variety of conservative synchronization protocols can transparently support conservative execution of submodels on potentially different processors. A specific implementation of this approach, U.P.S. (Utilitarian Parallel Simulator), is described, along with performance results on the Intel Paragon.
MIC-SVM: Designing A Highly Efficient Support Vector Machine For Advanced Modern Multi-Core and Many-Core Architectures

DOE Office of Scientific and Technical Information (OSTI.GOV)

You, Yang; Song, Shuaiwen; Fu, Haohuan

2014-08-16

Support Vector Machine (SVM) has been widely used in data-mining and Big Data applications as modern commercial databases start to attach an increasing importance to the analytic capabilities. In recent years, SVM was adapted to the field of High Performance Computing for power/performance prediction, auto-tuning, and runtime scheduling. However, even at the risk of losing prediction accuracy due to insufficient runtime information, researchers can only afford to apply offline model training to avoid significant runtime training overhead. To address the challenges above, we designed and implemented MICSVM, a highly efficient parallel SVM for x86 based multi-core and many core architectures,more » such as the Intel Ivy Bridge CPUs and Intel Xeon Phi coprocessor (MIC).« less
Beyond core count: a look at new mainstream computing platforms for HEP workloads

NASA Astrophysics Data System (ADS)

Szostek, P.; Nowak, A.; Bitzes, G.; Valsan, L.; Jarp, S.; Dotti, A.

2014-06-01

As Moore's Law continues to deliver more and more transistors, the mainstream processor industry is preparing to expand its investments in areas other than simple core count. These new interests include deep integration of on-chip components, advanced vector units, memory, cache and interconnect technologies. We examine these moving trends with parallelized and vectorized High Energy Physics workloads in mind. In particular, we report on practical experience resulting from experiments with scalable HEP benchmarks on the Intel "Ivy Bridge-EP" and "Haswell" processor families. In addition, we examine the benefits of the new "Haswell" microarchitecture and its impact on multiple facets of HEP software. Finally, we report on the power efficiency of new systems.
Buffered coscheduling for parallel programming and enhanced fault tolerance

DOEpatents

Petrini, Fabrizio [Los Alamos, NM; Feng, Wu-chun [Los Alamos, NM

2006-01-31

A computer implemented method schedules processor jobs on a network of parallel machine processors or distributed system processors. Control information communications generated by each process performed by each processor during a defined time interval is accumulated in buffers, where adjacent time intervals are separated by strobe intervals for a global exchange of control information. A global exchange of the control information communications at the end of each defined time interval is performed during an intervening strobe interval so that each processor is informed by all of the other processors of the number of incoming jobs to be received by each processor in a subsequent time interval. The buffered coscheduling method of this invention also enhances the fault tolerance of a network of parallel machine processors or distributed system processors
Scheduling for Parallel Supercomputing: A Historical Perspective of Achievable Utilization

NASA Technical Reports Server (NTRS)

Jones, James Patton; Nitzberg, Bill

1999-01-01

The NAS facility has operated parallel supercomputers for the past 11 years, including the Intel iPSC/860, Intel Paragon, Thinking Machines CM-5, IBM SP-2, and Cray Origin 2000. Across this wide variety of machine architectures, across a span of 10 years, across a large number of different users, and through thousands of minor configuration and policy changes, the utilization of these machines shows three general trends: (1) scheduling using a naive FIFO first-fit policy results in 40-60% utilization, (2) switching to the more sophisticated dynamic backfilling scheduling algorithm improves utilization by about 15 percentage points (yielding about 70% utilization), and (3) reducing the maximum allowable job size further increases utilization. Most surprising is the consistency of these trends. Over the lifetime of the NAS parallel systems, we made hundreds, perhaps thousands, of small changes to hardware, software, and policy, yet, utilization was affected little. In particular these results show that the goal of achieving near 100% utilization while supporting a real parallel supercomputing workload is unrealistic.
Dense and Sparse Matrix Operations on the Cell Processor

DOE Office of Scientific and Technical Information (OSTI.GOV)

Williams, Samuel W.; Shalf, John; Oliker, Leonid

2005-05-01

The slowing pace of commodity microprocessor performance improvements combined with ever-increasing chip power demands has become of utmost concern to computational scientists. Therefore, the high performance computing community is examining alternative architectures that address the limitations of modern superscalar designs. In this work, we examine STI's forthcoming Cell processor: a novel, low-power architecture that combines a PowerPC core with eight independent SIMD processing units coupled with a software-controlled memory to offer high FLOP/s/Watt. Since neither Cell hardware nor cycle-accurate simulators are currently publicly available, we develop an analytic framework to predict Cell performance on dense and sparse matrix operations, usingmore » a variety of algorithmic approaches. Results demonstrate Cell's potential to deliver more than an order of magnitude better GFLOP/s per watt performance, when compared with the Intel Itanium2 and Cray X1 processors.« less
Spectral-element simulation of two-dimensional elastic wave propagation in fully heterogeneous media on a GPU cluster

NASA Astrophysics Data System (ADS)

Rudianto, Indra; Sudarmaji

2018-04-01

We present an implementation of the spectral-element method for simulation of two-dimensional elastic wave propagation in fully heterogeneous media. We have incorporated most of realistic geological features in the model, including surface topography, curved layer interfaces, and 2-D wave-speed heterogeneity. To accommodate such complexity, we use an unstructured quadrilateral meshing technique. Simulation was performed on a GPU cluster, which consists of 24 core processors Intel Xeon CPU and 4 NVIDIA Quadro graphics cards using CUDA and MPI implementation. We speed up the computation by a factor of about 5 compared to MPI only, and by a factor of about 40 compared to Serial implementation.
Evaluating Multi-core Architectures through Accelerating the Three-Dimensional Lax–Wendroff Correction

DOE Office of Scientific and Technical Information (OSTI.GOV)

You, Yang; Fu, Haohuan; Song, Shuaiwen

2014-07-18

Wave propagation forward modeling is a widely used computational method in oil and gas exploration. The iterative stencil loops in such problems have broad applications in scientific computing. However, executing such loops can be highly time time-consuming, which greatly limits application’s performance and power efficiency. In this paper, we accelerate the forward modeling technique on the latest multi-core and many-core architectures such as Intel Sandy Bridge CPUs, NVIDIA Fermi C2070 GPU, NVIDIA Kepler K20x GPU, and the Intel Xeon Phi Co-processor. For the GPU platforms, we propose two parallel strategies to explore the performance optimization opportunities for our stencil kernels.more » For Sandy Bridge CPUs and MIC, we also employ various optimization techniques in order to achieve the best.« less
GW Calculations of Materials on the Intel Xeon-Phi Architecture

NASA Astrophysics Data System (ADS)

Deslippe, Jack; da Jornada, Felipe H.; Vigil-Fowler, Derek; Biller, Ariel; Chelikowsky, James R.; Louie, Steven G.

Intel Xeon-Phi processors are expected to power a large number of High-Performance Computing (HPC) systems around the United States and the world in the near future. We evaluate the ability of GW and pre-requisite Density Functional Theory (DFT) calculations for materials on utilizing the Xeon-Phi architecture. We describe the optimization process and performance improvements achieved. We find that the GW method, like other higher level Many-Body methods beyond standard local/semilocal approximations to Kohn-Sham DFT, is particularly well suited for many-core architectures due to the ability to exploit a large amount of parallelism over plane-waves, band-pairs and frequencies. Support provided by the SCIDAC program, Department of Energy, Office of Science, Advanced Scientic Computing Research and Basic Energy Sciences. Grant Numbers DE-SC0008877 (Austin) and DE-AC02-05CH11231 (LBNL).
Finite element computation on nearest neighbor connected machines

NASA Technical Reports Server (NTRS)

Mcaulay, A. D.

1984-01-01

Research aimed at faster, more cost effective parallel machines and algorithms for improving designer productivity with finite element computations is discussed. A set of 8 boards, containing 4 nearest neighbor connected arrays of commercially available floating point chips and substantial memory, are inserted into a commercially available machine. One-tenth Mflop (64 bit operation) processors provide an 89% efficiency when solving the equations arising in a finite element problem for a single variable regular grid of size 40 by 40 by 40. This is approximately 15 to 20 times faster than a much more expensive machine such as a VAX 11/780 used in double precision. The efficiency falls off as faster or more processors are envisaged because communication times become dominant. A novel successive overrelaxation algorithm which uses cyclic reduction in order to permit data transfer and computation to overlap in time is proposed.
Parallel implementation of D-Phylo algorithm for maximum likelihood clusters.

PubMed

Malik, Shamita; Sharma, Dolly; Khatri, Sunil Kumar

2017-03-01

This study explains a newly developed parallel algorithm for phylogenetic analysis of DNA sequences. The newly designed D-Phylo is a more advanced algorithm for phylogenetic analysis using maximum likelihood approach. The D-Phylo while misusing the seeking capacity of k -means keeps away from its real constraint of getting stuck at privately conserved motifs. The authors have tested the behaviour of D-Phylo on Amazon Linux Amazon Machine Image(Hardware Virtual Machine)i2.4xlarge, six central processing unit, 122 GiB memory, 8 × 800 Solid-state drive Elastic Block Store volume, high network performance up to 15 processors for several real-life datasets. Distributing the clusters evenly on all the processors provides us the capacity to accomplish a near direct speed if there should arise an occurrence of huge number of processors.
Advanced electronics for the CTF MEG system.

PubMed

McCubbin, J; Vrba, J; Spear, P; McKenzie, D; Willis, R; Loewen, R; Robinson, S E; Fife, A A

2004-11-30

Development of the CTF MEG system has been advanced with the introduction of a computer processing cluster between the data acquisition electronics and the host computer. The advent of fast processors, memory, and network interfaces has made this innovation feasible for large data streams at high sampling rates. We have implemented tasks including anti-alias filter, sample rate decimation, higher gradient balancing, crosstalk correction, and optional filters with a cluster consisting of 4 dual Intel Xeon processors operating on up to 275 channel MEG systems at 12 kHz sample rate. The architecture is expandable with additional processors to implement advanced processing tasks which may include e.g., continuous head localization/motion correction, optional display filters, coherence calculations, or real time synthetic channels (via beamformer). We also describe an electronics configuration upgrade to provide operator console access to the peripheral interface features such as analog signal and trigger I/O. This allows remote location of the acoustically noisy electronics cabinet and fitting of the cabinet with doors for improved EMI shielding. Finally, we present the latest performance results available for the CTF 275 channel MEG system including an unshielded SEF (median nerve electrical stimulation) measurement enhanced by application of an adaptive beamformer technique (SAM) which allows recognition of the nominal 20-ms response in the unaveraged signal.
Benchmarking and tuning the MILC code on clusters and supercomputers

NASA Astrophysics Data System (ADS)

Gottlieb, Steven

2002-03-01

Recently, we have benchmarked and tuned the MILC code on a number of architectures including Intel Itanium and Pentium IV (PIV), dual-CPU Athlon, and the latest Compaq Alpha nodes. Results will be presented for many of these, and we shall discuss some simple code changes that can result in a very dramatic speedup of the KS conjugate gradient on processors with more advanced memory systems such as PIV, IBM SP and Alpha.
Benchmarking and tuning the MILC code on clusters and supercomputers

NASA Astrophysics Data System (ADS)

Gottlieb, Steven

Recently, we have benchmarked and tuned the MILC code on a number of architectures including Intel Itanium and Pentium IV (PIV), dual-CPU Athlon, and the latest Compaq Alpha nodes. Results will be presented for many of these, and we shall discuss some simple code changes that can result in a very dramatic speedup of the KS conjugate gradient on processors with more advanced memory systems such as PIV, IBM SP and Alpha.

Turbo Pascal Implementation of a Distributed Processing Network of MS-DOS Microcomputers Connected in a Master-Slave Configuration

DTIC Science & Technology

1989-12-01

Interrupt Procedures ....... 29 13. Support for a Larger Memory Model ................ 29 C. IMPLEMENTATION ........................................ 29...describe the programmer’s model of the hardware utilized in the microcomputers and interrupt driven serial communication considerations. Chapter III...Central Processor Unit The programming model of Table 2.1 is common to the Intel 8088, 8086 and 80x86 series of microprocessors used in the IBM PC/AT
A GPU Parallelization of the Absolute Nodal Coordinate Formulation for Applications in Flexible Multibody Dynamics

DTIC Science & Technology

2012-02-17

to be solved. Disclaimer: Reference herein to any specific commercial company , product, process, or service by trade name, trademark...data processing rather than data caching and control flow. To make use of this computational power, NVIDIA introduced a general purpose parallel...GPU implementations were run on an Intel Nehalem Xeon E5520 2.26GHz processor with an NVIDIA Tesla C2070 graphics card for varying numbers of
Cell-NPE (Numerical Performance Evaluation): Programming the IBM Cell Broadband Engine -- A General Parallelization Strategy

DTIC Science & Technology

2008-04-01

Space GmbH as follows: B. TECHNICAL PRPOPOSA/DESCRIPTION OF WORK Cell: A Revolutionary High Performance Computing Platform On 29 June 2005 [1...IBM has announced that is has partnered with Mercury Computer Systems, a maker of specialized computers . The Cell chip provides massive floating-point...the computing industry away from the traditional processor technology dominated by Intel. While in the past, the development of computing power has
The Fortran-P Translator: Towards Automatic Translation of Fortran 77 Programs for Massively Parallel Processors

DOE PAGES

O'keefe, Matthew; Parr, Terence; Edgar, B. Kevin; ...

1995-01-01

Massively parallel processors (MPPs) hold the promise of extremely high performance that, if realized, could be used to study problems of unprecedented size and complexity. One of the primary stumbling blocks to this promise has been the lack of tools to translate application codes to MPP form. In this article we show how applications codes written in a subset of Fortran 77, called Fortran-P, can be translated to achieve good performance on several massively parallel machines. This subset can express codes that are self-similar, where the algorithm applied to the global data domain is also applied to each subdomain. Wemore » have found many codes that match the Fortran-P programming style and have converted them using our tools. We believe a self-similar coding style will accomplish what a vectorizable style has accomplished for vector machines by allowing the construction of robust, user-friendly, automatic translation systems that increase programmer productivity and generate fast, efficient code for MPPs.« less
Research in Parallel Algorithms and Software for Computational Aerosciences

NASA Technical Reports Server (NTRS)

Domel, Neal D.

1996-01-01

Phase I is complete for the development of a Computational Fluid Dynamics parallel code with automatic grid generation and adaptation for the Euler analysis of flow over complex geometries. SPLITFLOW, an unstructured Cartesian grid code developed at Lockheed Martin Tactical Aircraft Systems, has been modified for a distributed memory/massively parallel computing environment. The parallel code is operational on an SGI network, Cray J90 and C90 vector machines, SGI Power Challenge, and Cray T3D and IBM SP2 massively parallel machines. Parallel Virtual Machine (PVM) is the message passing protocol for portability to various architectures. A domain decomposition technique was developed which enforces dynamic load balancing to improve solution speed and memory requirements. A host/node algorithm distributes the tasks. The solver parallelizes very well, and scales with the number of processors. Partially parallelized and non-parallelized tasks consume most of the wall clock time in a very fine grain environment. Timing comparisons on a Cray C90 demonstrate that Parallel SPLITFLOW runs 2.4 times faster on 8 processors than its non-parallel counterpart autotasked over 8 processors.
Research in Parallel Algorithms and Software for Computational Aerosciences

NASA Technical Reports Server (NTRS)

Domel, Neal D.

1996-01-01

Phase 1 is complete for the development of a computational fluid dynamics CFD) parallel code with automatic grid generation and adaptation for the Euler analysis of flow over complex geometries. SPLITFLOW, an unstructured Cartesian grid code developed at Lockheed Martin Tactical Aircraft Systems, has been modified for a distributed memory/massively parallel computing environment. The parallel code is operational on an SGI network, Cray J90 and C90 vector machines, SGI Power Challenge, and Cray T3D and IBM SP2 massively parallel machines. Parallel Virtual Machine (PVM) is the message passing protocol for portability to various architectures. A domain decomposition technique was developed which enforces dynamic load balancing to improve solution speed and memory requirements. A host/node algorithm distributes the tasks. The solver parallelizes very well, and scales with the number of processors. Partially parallelized and non-parallelized tasks consume most of the wall clock time in a very fine grain environment. Timing comparisons on a Cray C90 demonstrate that Parallel SPLITFLOW runs 2.4 times faster on 8 processors than its non-parallel counterpart autotasked over 8 processors.
Integrated 3-D vision system for autonomous vehicles

NASA Astrophysics Data System (ADS)

Hou, Kun M.; Shawky, Mohamed; Tu, Xiaowei

1992-03-01

Nowadays, autonomous vehicles have become a multidiscipline field. Its evolution is taking advantage of the recent technological progress in computer architectures. As the development tools became more sophisticated, the trend is being more specialized, or even dedicated architectures. In this paper, we will focus our interest on a parallel vision subsystem integrated in the overall system architecture. The system modules work in parallel, communicating through a hierarchical blackboard, an extension of the 'tuple space' from LINDA concepts, where they may exchange data or synchronization messages. The general purpose processing elements are of different skills, built around 40 MHz i860 Intel RISC processors for high level processing and pipelined systolic array processors based on PLAs or FPGAs for low-level processing.
Solving the Cauchy-Riemann equations on parallel computers

NASA Technical Reports Server (NTRS)

Fatoohi, Raad A.; Grosch, Chester E.

1987-01-01

Discussed is the implementation of a single algorithm on three parallel-vector computers. The algorithm is a relaxation scheme for the solution of the Cauchy-Riemann equations; a set of coupled first order partial differential equations. The computers were chosen so as to encompass a variety of architectures. They are: the MPP, and SIMD machine with 16K bit serial processors; FLEX/32, an MIMD machine with 20 processors; and CRAY/2, an MIMD machine with four vector processors. The machine architectures are briefly described. The implementation of the algorithm is discussed in relation to these architectures and measures of the performance on each machine are given. Simple performance models are used to describe the performance. These models highlight the bottlenecks and limiting factors for this algorithm on these architectures. Conclusions are presented.
Hardware description ADSP-21020 40-bit floating point DSP as designed in a remotely controlled digital CW Doppler radar

NASA Astrophysics Data System (ADS)

Morrison, R. E.; Robinson, S. H.

A continuous wave Doppler radar system has been designed which is portable, easily deployed, and remotely controlled. The heart of this system is a DSP/control board using Analog Devices ADSP-21020 40-bit floating point digital signal processor (DSP) microprocessor. Two 18-bit audio A/D converters provide digital input to the DSP/controller board for near real time target detection. Program memory for the DSP is dual ported with an Intel 87C51 microcontroller allowing DSP code to be up-loaded or down-loaded from a central controlling computer. The 87C51 provides overall system control for the remote radar and includes a time-of-day/day-of-year real time clock, system identification (ID) switches, and input/output (I/O) expansion by an Intel 82C55 I/O expander.
Parallelization of MRCI based on hole-particle symmetry.

PubMed

Suo, Bing; Zhai, Gaohong; Wang, Yubin; Wen, Zhenyi; Hu, Xiangqian; Li, Lemin

2005-01-15

The parallel implementation of multireference configuration interaction program based on the hole-particle symmetry is described. The platform to implement the parallelization is an Intel-Architectural cluster consisting of 12 nodes, each of which is equipped with two 2.4-G XEON processors, 3-GB memory, and 36-GB disk, and are connected by a Gigabit Ethernet Switch. The dependence of speedup on molecular symmetries and task granularities is discussed. Test calculations show that the scaling with the number of nodes is about 1.9 (for C1 and Cs), 1.65 (for C2v), and 1.55 (for D2h) when the number of nodes is doubled. The largest calculation performed on this cluster involves 5.6 x 10(8) CSFs.
DOE Office of Scientific and Technical Information (OSTI.GOV)

Reed, D.A.; Grunwald, D.C.

The spectrum of parallel processor designs can be divided into three sections according to the number and complexity of the processors. At one end there are simple, bit-serial processors. Any one of thee processors is of little value, but when it is coupled with many others, the aggregate computing power can be large. This approach to parallel processing can be likened to a colony of termites devouring a log. The most notable examples of this approach are the NASA/Goodyear Massively Parallel Processor, which has 16K one-bit processors, and the Thinking Machines Connection Machine, which has 64K one-bit processors. At themore » other end of the spectrum, a small number of processors, each built using the fastest available technology and the most sophisticated architecture, are combined. An example of this approach is the Cray X-MP. This type of parallel processing is akin to four woodmen attacking the log with chainsaws.« less
Achieving high performance on the Intel Paragon

DOE Office of Scientific and Technical Information (OSTI.GOV)

Greenberg, D.S.; Maccabe, B.; Riesen, R.

1993-11-01

When presented with a new supercomputer most users will first ask {open_quotes}How much faster will my applications run?{close_quotes} and then add a fearful {open_quotes}How much effort will it take me to convert to the new machine?{close_quotes} This paper describes some lessons learned at Sandia while asking these questions about the new 1800+ node Intel Paragon. The authors conclude that the operating system is crucial to both achieving high performance and allowing easy conversion from previous parallel implementations to a new machine. Using the Sandia/UNM Operating System (SUNMOS) they were able to port a LU factorization of dense matrices from themore » nCUBE2 to the Paragon and achieve 92% scaled speed-up on 1024 nodes. Thus on a 44,000 by 44,000 matrix which had required over 10 hours on the previous machine, they completed in less than 1/2 hour at a rate of over 40 GFLOPS. Two keys to achieving such high performance were the small size of SUNMOS (less than 256 kbytes) and the ability to send large messages with very low overhead.« less
Low latency messages on distributed memory multiprocessors

NASA Technical Reports Server (NTRS)

Rosing, Matthew; Saltz, Joel

1993-01-01

Many of the issues in developing an efficient interface for communication on distributed memory machines are described and a portable interface is proposed. Although the hardware component of message latency is less than one microsecond on many distributed memory machines, the software latency associated with sending and receiving typed messages is on the order of 50 microseconds. The reason for this imbalance is that the software interface does not match the hardware. By changing the interface to match the hardware more closely, applications with fine grained communication can be put on these machines. Based on several tests that were run on the iPSC/860, an interface that will better match current distributed memory machines is proposed. The model used in the proposed interface consists of a computation processor and a communication processor on each node. Communication between these processors and other nodes in the system is done through a buffered network. Information that is transmitted is either data or procedures to be executed on the remote processor. The dual processor system is better suited for efficiently handling asynchronous communications compared to a single processor system. The ability to send data or procedure is very flexible for minimizing message latency, based on the type of communication being performed. The test performed and the proposed interface are described.
Fast generation of computer-generated hologram by graphics processing unit

NASA Astrophysics Data System (ADS)

Matsuda, Sho; Fujii, Tomohiko; Yamaguchi, Takeshi; Yoshikawa, Hiroshi

2009-02-01

A cylindrical hologram is well known to be viewable in 360 deg. This hologram depends high pixel resolution.Therefore, Computer-Generated Cylindrical Hologram (CGCH) requires huge calculation amount.In our previous research, we used look-up table method for fast calculation with Intel Pentium4 2.8 GHz.It took 480 hours to calculate high resolution CGCH (504,000 x 63,000 pixels and the average number of object points are 27,000).To improve quality of CGCH reconstructed image, fringe pattern requires higher spatial frequency and resolution.Therefore, to increase the calculation speed, we have to change the calculation method. In this paper, to reduce the calculation time of CGCH (912,000 x 108,000 pixels), we employ Graphics Processing Unit (GPU).It took 4,406 hours to calculate high resolution CGCH on Xeon 3.4 GHz.Since GPU has many streaming processors and a parallel processing structure, GPU works as the high performance parallel processor.In addition, GPU gives max performance to 2 dimensional data and streaming data.Recently, GPU can be utilized for the general purpose (GPGPU).For example, NVIDIA's GeForce7 series became a programmable processor with Cg programming language.Next GeForce8 series have CUDA as software development kit made by NVIDIA.Theoretically, calculation ability of GPU is announced as 500 GFLOPS. From the experimental result, we have achieved that 47 times faster calculation compared with our previous work which used CPU.Therefore, CGCH can be generated in 95 hours.So, total time is 110 hours to calculate and print the CGCH.
QCD thermodynamics with two flavors of quarks[1

NASA Astrophysics Data System (ADS)

MIMD lattice Computations (MILC) Collaboration

We present results of numerical simulations of quantum chromodynamics at finite temperature on the Intel iPSC/860 parallel processor. We performed calculations with two flavors of Kogut-Susskind quarks and of Wilson quarks on 6 × 12 3 lattices in order to study the crossover from the low temperature hadronic regime to the high temperature regime. We investigate the properties of the objects whose exchange gives static screening lengths be reconstructing their correlated quark-antiquark structure.
Kalman Filter Tracking on Parallel Architectures

NASA Astrophysics Data System (ADS)

Cerati, Giuseppe; Elmer, Peter; Krutelyov, Slava; Lantz, Steven; Lefebvre, Matthieu; McDermott, Kevin; Riley, Daniel; Tadel, Matevž; Wittich, Peter; Würthwein, Frank; Yagil, Avi

2016-11-01

Power density constraints are limiting the performance improvements of modern CPUs. To address this we have seen the introduction of lower-power, multi-core processors such as GPGPU, ARM and Intel MIC. In order to achieve the theoretical performance gains of these processors, it will be necessary to parallelize algorithms to exploit larger numbers of lightweight cores and specialized functions like large vector units. Track finding and fitting is one of the most computationally challenging problems for event reconstruction in particle physics. At the High-Luminosity Large Hadron Collider (HL-LHC), for example, this will be by far the dominant problem. The need for greater parallelism has driven investigations of very different track finding techniques such as Cellular Automata or Hough Transforms. The most common track finding techniques in use today, however, are those based on a Kalman filter approach. Significant experience has been accumulated with these techniques on real tracking detector systems, both in the trigger and offline. They are known to provide high physics performance, are robust, and are in use today at the LHC. Given the utility of the Kalman filter in track finding, we have begun to port these algorithms to parallel architectures, namely Intel Xeon and Xeon Phi. We report here on our progress towards an end-to-end track reconstruction algorithm fully exploiting vectorization and parallelization techniques in a simplified experimental environment.
Summary of Documentation for DYNA3D-ParaDyn's Software Quality Assurance Regression Test Problems

DOE Office of Scientific and Technical Information (OSTI.GOV)

Zywicz, Edward

The Software Quality Assurance (SQA) regression test suite for DYNA3D (Zywicz and Lin, 2015) and ParaDyn (DeGroot, et al., 2015) currently contains approximately 600 problems divided into 21 suites, and is a required component of ParaDyn’s SQA plan (Ferencz and Oliver, 2013). The regression suite allows developers to ensure that software modifications do not unintentionally alter the code response. The entire regression suite is run prior to permanently incorporating any software modification or addition. When code modifications alter test problem results, the specific cause must be determined and fully understood before the software changes and revised test answers can bemore » incorporated. The regression suite is executed on LLNL platforms using a Python script and an associated data file. The user specifies the DYNA3D or ParaDyn executable, number of processors to use, test problems to run, and other options to the script. The data file details how each problem and its answer extraction scripts are executed. For each problem in the regression suite there exists an input deck, an eight-processor partition file, an answer file, and various extraction scripts. These scripts assemble a temporary answer file in a specific format from the simulation results. The temporary and stored answer files are compared to a specific level of numerical precision, and when differences are detected the test problem is flagged as failed. Presently, numerical results are stored and compared to 16 digits. At this accuracy level different processor types, compilers, number of partitions, etc. impact the results to various degrees. Thus, for consistency purposes the regression suite is run with ParaDyn using 8 processors on machines with a specific processor type (currently the Intel Xeon E5530 processor). For non-parallel regression problems, i.e., the two XFEM problems, DYNA3D is used instead. When environments or platforms change, executables using the current source code and the new resource are created and the regression suite is run. If differences in answers arise, the new answers are retained provided that the differences are inconsequential. This bootstrap approach allows the test suite answers to evolve in a controlled manner with a high level of confidence. Developers also run the entire regression suite with (serial) DYNA3D. While these results normally differ from the stored (parallel) answers, abnormal termination or wildly different values are strong indicators of potential issues.« less
Underwater Threat Source Localization: Processing Sensor Network TDOAs with a Terascale Optical Core Device

DOE Office of Scientific and Technical Information (OSTI.GOV)

Barhen, Jacob; Imam, Neena

2007-01-01

Revolutionary computing technologies are defined in terms of technological breakthroughs, which leapfrog over near-term projected advances in conventional hardware and software to produce paradigm shifts in computational science. For underwater threat source localization using information provided by a dynamical sensor network, one of the most promising computational advances builds upon the emergence of digital optical-core devices. In this article, we present initial results of sensor network calculations that focus on the concept of signal wavefront time-difference-of-arrival (TDOA). The corresponding algorithms are implemented on the EnLight processing platform recently introduced by Lenslet Laboratories. This tera-scale digital optical core processor is optimizedmore » for array operations, which it performs in a fixed-point-arithmetic architecture. Our results (i) illustrate the ability to reach the required accuracy in the TDOA computation, and (ii) demonstrate that a considerable speed-up can be achieved when using the EnLight 64a prototype processor as compared to a dual Intel XeonTM processor.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)

Newman, G.A.; Commer, M.

Three-dimensional (3D) geophysical imaging is now receiving considerable attention for electrical conductivity mapping of potential offshore oil and gas reservoirs. The imaging technology employs controlled source electromagnetic (CSEM) and magnetotelluric (MT) fields and treats geological media exhibiting transverse anisotropy. Moreover when combined with established seismic methods, direct imaging of reservoir fluids is possible. Because of the size of the 3D conductivity imaging problem, strategies are required exploiting computational parallelism and optimal meshing. The algorithm thus developed has been shown to scale to tens of thousands of processors. In one imaging experiment, 32,768 tasks/processors on the IBM Watson Research Blue Gene/Lmore » supercomputer were successfully utilized. Over a 24 hour period we were able to image a large scale field data set that previously required over four months of processing time on distributed clusters based on Intel or AMD processors utilizing 1024 tasks on an InfiniBand fabric. Electrical conductivity imaging using massively parallel computational resources produces results that cannot be obtained otherwise and are consistent with timeframes required for practical exploration problems.« less
ms2: A molecular simulation tool for thermodynamic properties

NASA Astrophysics Data System (ADS)

Deublein, Stephan; Eckl, Bernhard; Stoll, Jürgen; Lishchuk, Sergey V.; Guevara-Carrion, Gabriela; Glass, Colin W.; Merker, Thorsten; Bernreuther, Martin; Hasse, Hans; Vrabec, Jadran

2011-11-01

This work presents the molecular simulation program ms2 that is designed for the calculation of thermodynamic properties of bulk fluids in equilibrium consisting of small electro-neutral molecules. ms2 features the two main molecular simulation techniques, molecular dynamics (MD) and Monte-Carlo. It supports the calculation of vapor-liquid equilibria of pure fluids and multi-component mixtures described by rigid molecular models on the basis of the grand equilibrium method. Furthermore, it is capable of sampling various classical ensembles and yields numerous thermodynamic properties. To evaluate the chemical potential, Widom's test molecule method and gradual insertion are implemented. Transport properties are determined by equilibrium MD simulations following the Green-Kubo formalism. ms2 is designed to meet the requirements of academia and industry, particularly achieving short response times and straightforward handling. It is written in Fortran90 and optimized for a fast execution on a broad range of computer architectures, spanning from single processor PCs over PC-clusters and vector computers to high-end parallel machines. The standard Message Passing Interface (MPI) is used for parallelization and ms2 is therefore easily portable to different computing platforms. Feature tools facilitate the interaction with the code and the interpretation of input and output files. The accuracy and reliability of ms2 has been shown for a large variety of fluids in preceding work. Program summaryProgram title:ms2 Catalogue identifier: AEJF_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEJF_v1_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: Special Licence supplied by the authors No. of lines in distributed program, including test data, etc.: 82 794 No. of bytes in distributed program, including test data, etc.: 793 705 Distribution format: tar.gz Programming language: Fortran90 Computer: The simulation tool ms2 is usable on a wide variety of platforms, from single processor machines over PC-clusters and vector computers to vector-parallel architectures. (Tested with Fortran compilers: gfortran, Intel, PathScale, Portland Group and Sun Studio.) Operating system: Unix/Linux, Windows Has the code been vectorized or parallelized?: Yes. Message Passing Interface (MPI) protocol Scalability. Excellent scalability up to 16 processors for molecular dynamics and >512 processors for Monte-Carlo simulations. RAM:ms2 runs on single processors with 512 MB RAM. The memory demand rises with increasing number of processors used per node and increasing number of molecules. Classification: 7.7, 7.9, 12 External routines: Message Passing Interface (MPI) Nature of problem: Calculation of application oriented thermodynamic properties for rigid electro-neutral molecules: vapor-liquid equilibria, thermal and caloric data as well as transport properties of pure fluids and multi-component mixtures. Solution method: Molecular dynamics, Monte-Carlo, various classical ensembles, grand equilibrium method, Green-Kubo formalism. Restrictions: No. The system size is user-defined. Typical problems addressed by ms2 can be solved by simulating systems containing typically 2000 molecules or less. Unusual features: Feature tools are available for creating input files, analyzing simulation results and visualizing molecular trajectories. Additional comments: Sample makefiles for multiple operation platforms are provided. Documentation is provided with the installation package and is available at http://www.ms-2.de. Running time: The running time of ms2 depends on the problem set, the system size and the number of processes used in the simulation. Running four processes on a "Nehalem" processor, simulations calculating VLE data take between two and twelve hours, calculating transport properties between six and 24 hours.

High Order Accuracy Methods for Supersonic Reactive Flows

DTIC Science & Technology

2008-06-25

k = 0, · · · , N and N is the polynomial order used. The positive constant M is chosen such that σ(N) becomes machine zero. Typically M ∼ 32. Table...function used in this study is the Exponential filter given by σ(η) = exp(−αηp), (38) where α = − ln() and is the machine zero. The spectral...respectively. All numerical experiments were run 49 on a 667 MHz Compaq Alpha machine with 1GB memory and with an Alpha internal floating point processor. 9.1
Dynamically programmable cache

NASA Astrophysics Data System (ADS)

Nakkar, Mouna; Harding, John A.; Schwartz, David A.; Franzon, Paul D.; Conte, Thomas

1998-10-01

Reconfigurable machines have recently been used as co- processors to accelerate the execution of certain algorithms or program subroutines. The problems with the above approach include high reconfiguration time and limited partial reconfiguration. By far the most critical problems are: (1) the small on-chip memory which results in slower execution time, and (2) small FPGA areas that cannot implement large subroutines. Dynamically Programmable Cache (DPC) is a novel architecture for embedded processors which offers solutions to the above problems. To solve memory access problems, DPC processors merge reconfigurable arrays with the data cache at various cache levels to create a multi-level reconfigurable machines. As a result DPC machines have both higher data accessibility and FPGA memory bandwidth. To solve the limited FPGA resource problem, DPC processors implemented multi-context switching (Virtualization) concept. Virtualization allows implementation of large subroutines with fewer FPGA cells. Additionally, DPC processors can parallelize the execution of several operations resulting in faster execution time. In this paper, the speedup improvement for DPC machines are shown to be 5X faster than an Altera FLEX10K FPGA chip and 2X faster than a Sun Ultral SPARC station for two different algorithms (convolution and motion estimation).
Comparison of neuronal spike exchange methods on a Blue Gene/P supercomputer.

PubMed

Hines, Michael; Kumar, Sameer; Schürmann, Felix

2011-01-01

For neural network simulations on parallel machines, interprocessor spike communication can be a significant portion of the total simulation time. The performance of several spike exchange methods using a Blue Gene/P (BG/P) supercomputer has been tested with 8-128 K cores using randomly connected networks of up to 32 M cells with 1 k connections per cell and 4 M cells with 10 k connections per cell, i.e., on the order of 4·10(10) connections (K is 1024, M is 1024(2), and k is 1000). The spike exchange methods used are the standard Message Passing Interface (MPI) collective, MPI_Allgather, and several variants of the non-blocking Multisend method either implemented via non-blocking MPI_Isend, or exploiting the possibility of very low overhead direct memory access (DMA) communication available on the BG/P. In all cases, the worst performing method was that using MPI_Isend due to the high overhead of initiating a spike communication. The two best performing methods-the persistent Multisend method using the Record-Replay feature of the Deep Computing Messaging Framework DCMF_Multicast; and a two-phase multisend in which a DCMF_Multicast is used to first send to a subset of phase one destination cores, which then pass it on to their subset of phase two destination cores-had similar performance with very low overhead for the initiation of spike communication. Departure from ideal scaling for the Multisend methods is almost completely due to load imbalance caused by the large variation in number of cells that fire on each processor in the interval between synchronization. Spike exchange time itself is negligible since transmission overlaps with computation and is handled by a DMA controller. We conclude that ideal performance scaling will be ultimately limited by imbalance between incoming processor spikes between synchronization intervals. Thus, counterintuitively, maximization of load balance requires that the distribution of cells on processors should not reflect neural net architecture but be randomly distributed so that sets of cells which are burst firing together should be on different processors with their targets on as large a set of processors as possible.
Computational multicore on two-layer 1D shallow water equations for erodible dambreak

NASA Astrophysics Data System (ADS)

Simanjuntak, C. A.; Bagustara, B. A. R. H.; Gunawan, P. H.

2018-03-01

The simulation of erodible dambreak using two-layer shallow water equations and SCHR scheme are elaborated in this paper. The results show that the two-layer SWE model in a good agreement with the data experiment which is performed by Louvain-la-Neuve Université Catholique de Louvain. Moreover, the parallel algorithm with multicore architecture are given in the results. The results show that Computer I with processor Intel(R) Core(TM) i5-2500 CPU Quad-Core has the best performance to accelerate the computational time. Moreover, Computer III with processor AMD A6-5200 APU Quad-Core is observed has higher speedup and efficiency. The speedup and efficiency of Computer III with number of grids 3200 are 3.716050530 times and 92.9% respectively.
Static analysis of the hull plate using the finite element method

NASA Astrophysics Data System (ADS)

Ion, A.

2015-11-01

This paper aims at presenting the static analysis for two levels of a container ship's construction as follows: the first level is at the girder / hull plate and the second level is conducted at the entire strength hull of the vessel. This article will describe the work for the static analysis of a hull plate. We shall use the software package ANSYS Mechanical 14.5. The program is run on a computer with four Intel Xeon X5260 CPU processors at 3.33 GHz, 32 GB memory installed. In terms of software, the shared memory parallel version of ANSYS refers to running ANSYS across multiple cores on a SMP system. The distributed memory parallel version of ANSYS (Distributed ANSYS) refers to running ANSYS across multiple processors on SMP systems or DMP systems.
A Parallel Pipelined Renderer for the Time-Varying Volume Data

NASA Technical Reports Server (NTRS)

Chiueh, Tzi-Cker; Ma, Kwan-Liu

1997-01-01

This paper presents a strategy for efficiently rendering time-varying volume data sets on a distributed-memory parallel computer. Time-varying volume data take large storage space and visualizing them requires reading large files continuously or periodically throughout the course of the visualization process. Instead of using all the processors to collectively render one volume at a time, a pipelined rendering process is formed by partitioning processors into groups to render multiple volumes concurrently. In this way, the overall rendering time may be greatly reduced because the pipelined rendering tasks are overlapped with the I/O required to load each volume into a group of processors; moreover, parallelization overhead may be reduced as a result of partitioning the processors. We modify an existing parallel volume renderer to exploit various levels of rendering parallelism and to study how the partitioning of processors may lead to optimal rendering performance. Two factors which are important to the overall execution time are re-source utilization efficiency and pipeline startup latency. The optimal partitioning configuration is the one that balances these two factors. Tests on Intel Paragon computers show that in general optimal partitionings do exist for a given rendering task and result in 40-50% saving in overall rendering time.
Interactive high-resolution isosurface ray casting on multicore processors.

PubMed

Wang, Qin; JaJa, Joseph

2008-01-01

We present a new method for the interactive rendering of isosurfaces using ray casting on multi-core processors. This method consists of a combination of an object-order traversal that coarsely identifies possible candidate 3D data blocks for each small set of contiguous pixels, and an isosurface ray casting strategy tailored for the resulting limited-size lists of candidate 3D data blocks. While static screen partitioning is widely used in the literature, our scheme performs dynamic allocation of groups of ray casting tasks to ensure almost equal loads among the different threads running on multi-cores while maintaining spatial locality. We also make careful use of memory management environment commonly present in multi-core processors. We test our system on a two-processor Clovertown platform, each consisting of a Quad-Core 1.86-GHz Intel Xeon Processor, for a number of widely different benchmarks. The detailed experimental results show that our system is efficient and scalable, and achieves high cache performance and excellent load balancing, resulting in an overall performance that is superior to any of the previous algorithms. In fact, we achieve an interactive isosurface rendering on a 1024(2) screen for all the datasets tested up to the maximum size of the main memory of our platform.
Real-time trajectory optimization on parallel processors

NASA Technical Reports Server (NTRS)

Psiaki, Mark L.

1993-01-01

A parallel algorithm has been developed for rapidly solving trajectory optimization problems. The goal of the work has been to develop an algorithm that is suitable to do real-time, on-line optimal guidance through repeated solution of a trajectory optimization problem. The algorithm has been developed on an INTEL iPSC/860 message passing parallel processor. It uses a zero-order-hold discretization of a continuous-time problem and solves the resulting nonlinear programming problem using a custom-designed augmented Lagrangian nonlinear programming algorithm. The algorithm achieves parallelism of function, derivative, and search direction calculations through the principle of domain decomposition applied along the time axis. It has been encoded and tested on 3 example problems, the Goddard problem, the acceleration-limited, planar minimum-time to the origin problem, and a National Aerospace Plane minimum-fuel ascent guidance problem. Execution times as fast as 118 sec of wall clock time have been achieved for a 128-stage Goddard problem solved on 32 processors. A 32-stage minimum-time problem has been solved in 151 sec on 32 processors. A 32-stage National Aerospace Plane problem required 2 hours when solved on 32 processors. A speed-up factor of 7.2 has been achieved by using 32-nodes instead of 1-node to solve a 64-stage Goddard problem.
Proceedings of the Seminar on the DoD Computer Security Initiative Program, National Bureau of Standards, Gaithersburg, Maryland, July 17-18, 1979.

DTIC Science & Technology

1979-01-01

specifications have been prepared for a DoD communications processor on an IBM minicomputer, a minicomputer time sharing system for the DEC PDP-11 and...the Honeywell Level 6. a virtual machine monitor for the IBM 370, and Multics [10] for the Honeywell Level 68. MECHANISMS FOR KERNEL IMPLEMENTATION...HOL INA ZJO : ANERIONS g PROCESSORn , c ...THEOREMS 1 ITP I-THEOREMS PROOF EVIDENCE - p II KV./370 FORMAL DESIGN PROCESS M4ODULAR DECOMPOSITION * NON
MSTor: A program for calculating partition functions, free energies, enthalpies, entropies, and heat capacities of complex molecules including torsional anharmonicity

NASA Astrophysics Data System (ADS)

Zheng, Jingjing; Mielke, Steven L.; Clarkson, Kenneth L.; Truhlar, Donald G.

2012-08-01

We present a Fortran program package, MSTor, which calculates partition functions and thermodynamic functions of complex molecules involving multiple torsional motions by the recently proposed MS-T method. This method interpolates between the local harmonic approximation in the low-temperature limit, and the limit of free internal rotation of all torsions at high temperature. The program can also carry out calculations in the multiple-structure local harmonic approximation. The program package also includes six utility codes that can be used as stand-alone programs to calculate reduced moment of inertia matrices by the method of Kilpatrick and Pitzer, to generate conformational structures, to calculate, either analytically or by Monte Carlo sampling, volumes for torsional subdomains defined by Voronoi tessellation of the conformational subspace, to generate template input files, and to calculate one-dimensional torsional partition functions using the torsional eigenvalue summation method. Catalogue identifier: AEMF_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEMF_v1_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: Standard CPC licence, http://cpc.cs.qub.ac.uk/licence/licence.html No. of lines in distributed program, including test data, etc.: 77 434 No. of bytes in distributed program, including test data, etc.: 3 264 737 Distribution format: tar.gz Programming language: Fortran 90, C, and Perl Computer: Itasca (HP Linux cluster, each node has two-socket, quad-core 2.8 GHz Intel Xeon X5560 “Nehalem EP” processors), Calhoun (SGI Altix XE 1300 cluster, each node containing two quad-core 2.66 GHz Intel Xeon “Clovertown”-class processors sharing 16 GB of main memory), Koronis (Altix UV 1000 server with 190 6-core Intel Xeon X7542 “Westmere” processors at 2.66 GHz), Elmo (Sun Fire X4600 Linux cluster with AMD Opteron cores), and Mac Pro (two 2.8 GHz Quad-core Intel Xeon processors) Operating system: Linux/Unix/Mac OS RAM: 2 Mbytes Classification: 16.3, 16.12, 23 Nature of problem: Calculation of the partition functions and thermodynamic functions (standard-state energy, enthalpy, entropy, and free energy as functions of temperatures) of complex molecules involving multiple torsional motions. Solution method: The multi-structural approximation with torsional anharmonicity (MS-T). The program also provides results for the multi-structural local harmonic approximation [1]. Restrictions: There is no limit on the number of torsions that can be included in either the Voronoi calculation or the full MS-T calculation. In practice, the range of problems that can be addressed with the present method consists of all multi-torsional problems for which one can afford to calculate all the conformations and their frequencies. Unusual features: The method can be applied to transition states as well as stable molecules. The program package also includes the hull program for the calculation of Voronoi volumes and six utility codes that can be used as stand-alone programs to calculate reduced moment-of-inertia matrices by the method of Kilpatrick and Pitzer, to generate conformational structures, to calculate, either analytically or by Monte Carlo sampling, volumes for torsional subdomain defined by Voronoi tessellation of the conformational subspace, to generate template input files, and to calculate one-dimensional torsional partition functions using the torsional eigenvalue summation method. Additional comments: The program package includes a manual, installation script, and input and output files for a test suite. Running time: There are 24 test runs. The running time of the test runs on a single processor of the Itasca computer is less than 2 seconds. J. Zheng, T. Yu, E. Papajak, I.M. Alecu, S.L. Mielke, D.G. Truhlar, Practical methods for including torsional anharmonicity in thermochemical calculations of complex molecules: The internal-coordinate multi-structural approximation, Phys. Chem. Chem. Phys. 13 (2011) 10885-10907.
Cedar-a large scale multiprocessor

DOE Office of Scientific and Technical Information (OSTI.GOV)

Gajski, D.; Kuck, D.; Lawrie, D.

1983-01-01

This paper presents an overview of Cedar, a large scale multiprocessor being designed at the University of Illinois. This machine is designed to accommodate several thousand high performance processors which are capable of working together on a single job, or they can be partitioned into groups of processors where each group of one or more processors can work on separate jobs. Various aspects of the machine are described including the control methodology, communication network, optimizing compiler and plans for construction. 13 references.
Performance Evaluation of Synthetic Benchmarks and Image Processing (IP) Kernels on Intel and PowerPC Processors

DTIC Science & Technology

2013-08-01

2006 Linux Q1 2005 Pentium D (830) 3 2/2 2511 1148 3617 Windows Vista Q2 2005 Pentium D (830) 3 2/2 2938 1155 3556 Windows XP Q2 2005 PowerPC 970MP 2...1 3734 3439 1304 Cell Broadband Engine 3.2 1/1 0.207 2006 239 441 Pentium D (830) 3 2/2 2 3617 2511 1148 Pentium D (830) 3 2/2 2 3556 2938 1155
Multivariate statistical analysis of low-voltage EDS spectrum images

DOE Office of Scientific and Technical Information (OSTI.GOV)

Anderson, I.M.

1998-03-01

Whereas energy-dispersive X-ray spectrometry (EDS) has been used for compositional analysis in the scanning electron microscope for 30 years, the benefits of using low operating voltages for such analyses have been explored only during the last few years. This paper couples low-voltage EDS with two other emerging areas of characterization: spectrum imaging and multivariate statistical analysis. The specimen analyzed for this study was a finished Intel Pentium processor, with the polyimide protective coating stripped off to expose the final active layers.
Investigating the Naval Logistics Role in Humanitarian Assistance Activities

DTIC Science & Technology

2015-03-01

transportation means. E. BASE CASE RESULTS The computations were executed on a MacBook Pro , 3 GHz Intel Core i7-4578U processor with 8 GB. The...MacBook Pro was partitioned to also contain a Windows 7, 64-bit operating system. The computations were run in the Windows 7 operating system using the...it impacts the types of metamodels that can be developed as a result of data farming (Lucas et al., 2015). Using a metamodel, one can closely
Command/response protocols and concurrent software

NASA Technical Reports Server (NTRS)

Bynum, W. L.

1987-01-01

A version of the program to control the parallel jaw gripper is documented. The parallel jaw end-effector hardware and the Intel 8031 processor that is used to control the end-effector are briefly described. A general overview of the controller program is given and a complete description of the program's structure and design are contained. There are three appendices: a memory map of the on-chip RAM, a cross-reference listing of the self-scheduling routines, and a summary of the top-level and monitor commands.
SC'11 Poster: A Highly Efficient MGPT Implementation for LAMMPS; with Strong Scaling

DOE Office of Scientific and Technical Information (OSTI.GOV)

Oppelstrup, T; Stukowski, A; Marian, J

2011-12-07

The MGPT potential has been implemented as a drop in package to the general molecular dynamics code LAMMPS. We implement an improved communication scheme that shrinks the communication layer thickness, and increases the load balancing. This results in unprecedented strong scaling, and speedup continuing beyond 1/8 atom/core. In addition, we have optimized the small matrix linear algebra with generic blocking (for all processors) and specific SIMD intrinsics for vectorization on Intel, AMD, and BlueGene CPUs.
Investigating the Importance of Stereo Displays for Helicopter Landing Simulation

DTIC Science & Technology

2016-08-11

visualization. The two instances of X Plane® were implemented using two separate PCs, each incorporating Intel i7 processors and Nvidia Quadro K4200... Nvidia GeForce GTX 680 graphics card was used to administer the stereo acuity and fusion range tests. The tests were displayed on an Asus VG278HE 3D...monitor with 1920x1080 pixels that was compatible with Nvidia 3D Vision2 and that used active shutter glasses. At a 1-m viewing distance, the
Parallel algorithms for simulating continuous time Markov chains

NASA Technical Reports Server (NTRS)

Nicol, David M.; Heidelberger, Philip

1992-01-01

We have previously shown that the mathematical technique of uniformization can serve as the basis of synchronization for the parallel simulation of continuous-time Markov chains. This paper reviews the basic method and compares five different methods based on uniformization, evaluating their strengths and weaknesses as a function of problem characteristics. The methods vary in their use of optimism, logical aggregation, communication management, and adaptivity. Performance evaluation is conducted on the Intel Touchstone Delta multiprocessor, using up to 256 processors.
Pentium Pro inside. 1; A treecode at 430 Gigaflops on ASCI Red

NASA Technical Reports Server (NTRS)

Warren, M. S.; Becker, D. J.; Sterling, T.; Salmon, J. K.; Goda, M. P.

1997-01-01

As an entry for the 1997 Gordon Bell performance prize, we present results from two methods of solving the gravitational N-body problem on the Intel Teraflops system at Sandia National Laboratory (ASCI Red). The first method, an O(N2) algorithm, obtained 635 Gigaflops for a 1 million particle problem on 6800 Pentium Pro processors. The second solution method, a tree-code which scales as O(N log N), sustained 170 Gigaflops over a continuous 9.4 hour period on 4096 processors, integrating the motion of 322 million mutually interacting particles in a cosmology simulation, while saving over 100 Gigabytes of raw data. Additionally, the tree-code sustained 430 Gigaflops on 6800 processors for the first 5 time-steps of that simulation. This tree-code solution is approximately 105 times more efficient than the O(N2) algorithm for this problem. As an entry for the 1997 Gordon Bell price/performance prize, we present two calculations from the disciplines of astrophysics and fluid dynamics. The simulations were performed on two 16 Pentium Pro processor Beowulf-class computers (Loki and Hyglac) constructed entirely from commodity personal computer technology, at a cost of roughly $50k each in September, 1996. The price of an equivalent system in August 1997 is less than $30. At Los Alamos, Loki performed a gravitational tree-code N-body simulation of galaxy formation using 9.75 million particles, which sustained an average of 879 Mflops over a ten day period, and produced roughly 10 Gbytes of raw data.
Time-efficient simulations of tight-binding electronic structures with Intel Xeon PhiTM many-core processors

NASA Astrophysics Data System (ADS)

Ryu, Hoon; Jeong, Yosang; Kang, Ji-Hoon; Cho, Kyu Nam

2016-12-01

Modelling of multi-million atomic semiconductor structures is important as it not only predicts properties of physically realizable novel materials, but can accelerate advanced device designs. This work elaborates a new Technology-Computer-Aided-Design (TCAD) tool for nanoelectronics modelling, which uses a sp3d5s∗ tight-binding approach to describe multi-million atomic structures, and simulate electronic structures with high performance computing (HPC), including atomic effects such as alloy and dopant disorders. Being named as Quantum simulation tool for Advanced Nanoscale Devices (Q-AND), the tool shows nice scalability on traditional multi-core HPC clusters implying the strong capability of large-scale electronic structure simulations, particularly with remarkable performance enhancement on latest clusters of Intel Xeon PhiTM coprocessors. A review of the recent modelling study conducted to understand an experimental work of highly phosphorus-doped silicon nanowires, is presented to demonstrate the utility of Q-AND. Having been developed via Intel Parallel Computing Center project, Q-AND will be open to public to establish a sound framework of nanoelectronics modelling with advanced HPC clusters of a many-core base. With details of the development methodology and exemplary study of dopant electronics, this work will present a practical guideline for TCAD development to researchers in the field of computational nanoelectronics.

WATERLOPP V2/64: A highly parallel machine for numerical computation

NASA Astrophysics Data System (ADS)

Ostlund, Neil S.

1985-07-01

Current technological trends suggest that the high performance scientific machines of the future are very likely to consist of a large number (greater than 1024) of processors connected and communicating with each other in some as yet undetermined manner. Such an assembly of processors should behave as a single machine in obtaining numerical solutions to scientific problems. However, the appropriate way of organizing both the hardware and software of such an assembly of processors is an unsolved and active area of research. It is particularly important to minimize the organizational overhead of interprocessor comunication, global synchronization, and contention for shared resources if the performance of a large number ( n) of processors is to be anything like the desirable n times the performance of a single processor. In many situations, adding a processor actually decreases the performance of the overall system since the extra organizational overhead is larger than the extra processing power added. The systolic loop architecture is a new multiple processor architecture which attemps at a solution to the problem of how to organize a large number of asynchronous processors into an effective computational system while minimizing the organizational overhead. This paper gives a brief overview of the basic systolic loop architecture, systolic loop algorithms for numerical computation, and a 64-processor implementation of the architecture, WATERLOOP V2/64, that is being used as a testbed for exploring the hardware, software, and algorithmic aspects of the architecture.
A Qualitative Security Analysis of a New Class of 3-D Integrated Crypto Co-processors

DTIC Science & Technology

2012-01-01

and mobile phones, lottery ticket vending machines , and various electronic payment systems. The main reason for their use in such applications is that...military applications such as secure communication links. However, the proliferation of Automated Teller Machines (ATMs) in the ’80s introduced them to...commercial applications. Today many popular consumer devices have cryptographic processors in them, for example, smart- cards for pay-TV access machines
OpenMP GNU and Intel Fortran programs for solving the time-dependent Gross-Pitaevskii equation

NASA Astrophysics Data System (ADS)

Young-S., Luis E.; Muruganandam, Paulsamy; Adhikari, Sadhan K.; Lončar, Vladimir; Vudragović, Dušan; Balaž, Antun

2017-11-01

We present Open Multi-Processing (OpenMP) version of Fortran 90 programs for solving the Gross-Pitaevskii (GP) equation for a Bose-Einstein condensate in one, two, and three spatial dimensions, optimized for use with GNU and Intel compilers. We use the split-step Crank-Nicolson algorithm for imaginary- and real-time propagation, which enables efficient calculation of stationary and non-stationary solutions, respectively. The present OpenMP programs are designed for computers with multi-core processors and optimized for compiling with both commercially-licensed Intel Fortran and popular free open-source GNU Fortran compiler. The programs are easy to use and are elaborated with helpful comments for the users. All input parameters are listed at the beginning of each program. Different output files provide physical quantities such as energy, chemical potential, root-mean-square sizes, densities, etc. We also present speedup test results for new versions of the programs. Program files doi:http://dx.doi.org/10.17632/y8zk3jgn84.2 Licensing provisions: Apache License 2.0 Programming language: OpenMP GNU and Intel Fortran 90. Computer: Any multi-core personal computer or workstation with the appropriate OpenMP-capable Fortran compiler installed. Number of processors used: All available CPU cores on the executing computer. Journal reference of previous version: Comput. Phys. Commun. 180 (2009) 1888; ibid.204 (2016) 209. Does the new version supersede the previous version?: Not completely. It does supersede previous Fortran programs from both references above, but not OpenMP C programs from Comput. Phys. Commun. 204 (2016) 209. Nature of problem: The present Open Multi-Processing (OpenMP) Fortran programs, optimized for use with commercially-licensed Intel Fortran and free open-source GNU Fortran compilers, solve the time-dependent nonlinear partial differential (GP) equation for a trapped Bose-Einstein condensate in one (1d), two (2d), and three (3d) spatial dimensions for six different trap symmetries: axially and radially symmetric traps in 3d, circularly symmetric traps in 2d, fully isotropic (spherically symmetric) and fully anisotropic traps in 2d and 3d, as well as 1d traps, where no spatial symmetry is considered. Solution method: We employ the split-step Crank-Nicolson algorithm to discretize the time-dependent GP equation in space and time. The discretized equation is then solved by imaginary- or real-time propagation, employing adequately small space and time steps, to yield the solution of stationary and non-stationary problems, respectively. Reasons for the new version: Previously published Fortran programs [1,2] have now become popular tools [3] for solving the GP equation. These programs have been translated to the C programming language [4] and later extended to the more complex scenario of dipolar atoms [5]. Now virtually all computers have multi-core processors and some have motherboards with more than one physical computer processing unit (CPU), which may increase the number of available CPU cores on a single computer to several tens. The C programs have been adopted to be very fast on such multi-core modern computers using general-purpose graphic processing units (GPGPU) with Nvidia CUDA and computer clusters using Message Passing Interface (MPI) [6]. Nevertheless, previously developed Fortran programs are also commonly used for scientific computation and most of them use a single CPU core at a time in modern multi-core laptops, desktops, and workstations. Unless the Fortran programs are made aware and capable of making efficient use of the available CPU cores, the solution of even a realistic dynamical 1d problem, not to mention the more complicated 2d and 3d problems, could be time consuming using the Fortran programs. Previously, we published auto-parallel Fortran programs [2] suitable for Intel (but not GNU) compiler for solving the GP equation. Hence, a need for the full OpenMP version of the Fortran programs to reduce the execution time cannot be overemphasized. To address this issue, we provide here such OpenMP Fortran programs, optimized for both Intel and GNU Fortran compilers and capable of using all available CPU cores, which can significantly reduce the execution time. Summary of revisions: Previous Fortran programs [1] for solving the time-dependent GP equation in 1d, 2d, and 3d with different trap symmetries have been parallelized using the OpenMP interface to reduce the execution time on multi-core processors. There are six different trap symmetries considered, resulting in six programs for imaginary-time propagation and six for real-time propagation, totaling to 12 programs included in BEC-GP-OMP-FOR software package. All input data (number of atoms, scattering length, harmonic oscillator trap length, trap anisotropy, etc.) are conveniently placed at the beginning of each program, as before [2]. Present programs introduce a new input parameter, which is designated by Number_of_Threads and defines the number of CPU cores of the processor to be used in the calculation. If one sets the value 0 for this parameter, all available CPU cores will be used. For the most efficient calculation it is advisable to leave one CPU core unused for the background system's jobs. For example, on a machine with 20 CPU cores such that we used for testing, it is advisable to use up to 19 CPU cores. However, the total number of used CPU cores can be divided into more than one job. For instance, one can run three simulations simultaneously using 10, 4, and 5 CPU cores, respectively, thus totaling to 19 used CPU cores on a 20-core computer. The Fortran source programs are located in the directory src, and can be compiled by the make command using the makefile in the root directory BEC-GP-OMP-FOR of the software package. The examples of produced output files can be found in the directory output, although some large density files are omitted, to save space. The programs calculate the values of actually used dimensionless nonlinearities from the physical input parameters, where the input parameters correspond to the identical nonlinearity values as in the previously published programs [1], so that the output files of the old and new programs can be directly compared. The output files are conveniently named such that their contents can be easily identified, following the naming convention introduced in Ref. [2]. For example, a file named -out.txt, where is a name of the individual program, represents the general output file containing input data, time and space steps, nonlinearity, energy and chemical potential, and was named fort.7 in the old Fortran version of programs [1]. A file named -den.txt is the output file with the condensate density, which had the names fort.3 and fort.4 in the old Fortran version [1] for imaginary- and real-time propagation programs, respectively. Other possible density outputs, such as the initial density, are commented out in the programs to have a simpler set of output files, but users can uncomment and re-enable them, if needed. In addition, there are output files for reduced (integrated) 1d and 2d densities for different programs. In the real-time programs there is also an output file reporting the dynamics of evolution of root-mean-square sizes after a perturbation is introduced. The supplied real-time programs solve the stationary GP equation, and then calculate the dynamics. As the imaginary-time programs are more accurate than the real-time programs for the solution of a stationary problem, one can first solve the stationary problem using the imaginary-time programs, adapt the real-time programs to read the pre-calculated wave function and then study the dynamics. In that case the parameter NSTP in the real-time programs should be set to zero and the space mesh and nonlinearity parameters should be identical in both programs. The reader is advised to consult our previous publication where a complete description of the output files is given [2]. A readme.txt file, included in the root directory, explains the procedure to compile and run the programs. We tested our programs on a workstation with two 10-core Intel Xeon E5-2650 v3 CPUs. The parameters used for testing are given in sample input files, provided in the corresponding directory together with the programs. In Table 1 we present wall-clock execution times for runs on 1, 6, and 19 CPU cores for programs compiled using Intel and GNU Fortran compilers. The corresponding columns "Intel speedup" and "GNU speedup" give the ratio of wall-clock execution times of runs on 1 and 19 CPU cores, and denote the actual measured speedup for 19 CPU cores. In all cases and for all numbers of CPU cores, although the GNU Fortran compiler gives excellent results, the Intel Fortran compiler turns out to be slightly faster. Note that during these tests we always ran only a single simulation on a workstation at a time, to avoid any possible interference issues. Therefore, the obtained wall-clock times are more reliable than the ones that could be measured with two or more jobs running simultaneously. We also studied the speedup of the programs as a function of the number of CPU cores used. The performance of the Intel and GNU Fortran compilers is illustrated in Fig. 1, where we plot the speedup and actual wall-clock times as functions of the number of CPU cores for 2d and 3d programs. We see that the speedup increases monotonically with the number of CPU cores in all cases and has large values (between 10 and 14 for 3d programs) for the maximal number of cores. This fully justifies the development of OpenMP programs, which enable much faster and more efficient solving of the GP equation. However, a slow saturation in the speedup with the further increase in the number of CPU cores is observed in all cases, as expected. The speedup tends to increase for programs in higher dimensions, as they become more complex and have to process more data. This is why the speedups of the supplied 2d and 3d programs are larger than those of 1d programs. Also, for a single program the speedup increases with the size of the spatial grid, i.e., with the number of spatial discretization points, since this increases the amount of calculations performed by the program. To demonstrate this, we tested the supplied real2d-th program and varied the number of spatial discretization points NX=NY from 20 to 1000. The measured speedup obtained when running this program on 19 CPU cores as a function of the number of discretization points is shown in Fig. 2. The speedup first increases rapidly with the number of discretization points and eventually saturates. Additional comments: Example inputs provided with the programs take less than 30 minutes to run on a workstation with two Intel Xeon E5-2650 v3 processors (2 QPI links, 10 CPU cores, 25 MB cache, 2.3 GHz).



      
      Implementation and performance of parallel Prolog interpreter
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Wei, S.; Kale, L.V.; Balkrishna, R.
         1988-01-01
         In this paper, the authors discuss the implementation of a parallel Prolog interpreter on different parallel machines. The implementation is based on the REDUCE--OR process model which exploits both AND and OR parallelism in logic programs. It is machine independent as it runs on top of the chare-kernel--a machine-independent parallel programming system. The authors also give the performance of the interpreter running a diverse set of benchmark pargrams on parallel machines including shared memory systems: an Alliant FX/8, Sequent and a MultiMax, and a non-shared memory systems: Intel iPSC/32 hypercube, in addition to its performance on a multiprocessor simulation system.
      

      
      A Parallel Rendering Algorithm for MIMD Architectures
      NASA Technical Reports Server (NTRS)
      Crockett, Thomas W.; Orloff, Tobias
         1991-01-01
         Applications such as animation and scientific visualization demand high performance rendering of complex three dimensional scenes. To deliver the necessary rendering rates, highly parallel hardware architectures are required. The challenge is then to design algorithms and software which effectively use the hardware parallelism. A rendering algorithm targeted to distributed memory MIMD architectures is described. For maximum performance, the algorithm exploits both object-level and pixel-level parallelism. The behavior of the algorithm is examined both analytically and experimentally. Its performance for large numbers of processors is found to be limited primarily by communication overheads. An experimental implementation for the Intel iPSC/860 shows increasing performance from 1 to 128 processors across a wide range of scene complexities. It is shown that minimal modifications to the algorithm will adapt it for use on shared memory architectures as well.
      

      
      Evaluation of the Xeon phi processor as a technology for the acceleration of real-time control in high-order adaptive optics systems
      NASA Astrophysics Data System (ADS)
      Barr, David; Basden, Alastair; Dipper, Nigel; Schwartz, Noah; Vick, Andy; Schnetler, Hermine
         2014-08-01
         We present wavefront reconstruction acceleration of high-order AO systems using an Intel Xeon Phi processor. The Xeon Phi is a coprocessor providing many integrated cores and designed for accelerating compute intensive, numerical codes. Unlike other accelerator technologies, it allows virtually unchanged C/C++ to be recompiled to run on the Xeon Phi, giving the potential of making development, upgrade and maintenance faster and less complex. We benchmark the Xeon Phi in the context of AO real-time control by running a matrix vector multiply (MVM) algorithm. We investigate variability in execution time and demonstrate a substantial speed-up in loop frequency. We examine the integration of a Xeon Phi into an existing RTC system and show that performance improvements can be achieved with limited development effort.
      

      
      Architectures for reasoning in parallel
      NASA Technical Reports Server (NTRS)
      Hall, Lawrence O.
         1989-01-01
         The research conducted has dealt with rule-based expert systems. The algorithms that may lead to effective parallelization of them were investigated. Both the forward and backward chained control paradigms were investigated in the course of this work. The best computer architecture for the developed and investigated algorithms has been researched. Two experimental vehicles were developed to facilitate this research. They are Backpac, a parallel backward chained rule-based reasoning system and Datapac, a parallel forward chained rule-based reasoning system. Both systems have been written in Multilisp, a version of Lisp which contains the parallel construct, future. Applying the future function to a function causes the function to become a task parallel to the spawning task. Additionally, Backpac and Datapac have been run on several disparate parallel processors. The machines are an Encore Multimax with 10 processors, the Concert Multiprocessor with 64 processors, and a 32 processor BBN GP1000. Both the Concert and the GP1000 are switch-based machines. The Multimax has all its processors hung off a common bus. All are shared memory machines, but have different schemes for sharing the memory and different locales for the shared memory. The main results of the investigations come from experiments on the 10 processor Encore and the Concert with partitions of 32 or less processors. Additionally, experiments have been run with a stripped down version of EMYCIN.
      

      
      Evaluating the networking characteristics of the Cray XC-40 Intel Knights Landing-based Cori supercomputer at NERSC
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Doerfler, Douglas; Austin, Brian; Cook, Brandon
         
         There are many potential issues associated with deploying the Intel Xeon Phi™ (code named Knights Landing [KNL]) manycore processor in a large-scale supercomputer. One in particular is the ability to fully utilize the high-speed communications network, given that the serial performance of a Xeon Phi TM core is a fraction of a Xeon®core. In this paper, we take a look at the trade-offs associated with allocating enough cores to fully utilize the Aries high-speed network versus cores dedicated to computation, e.g., the trade-off between MPI and OpenMP. In addition, we evaluate new features of Cray MPI in support of KNL,more » such as internode optimizations. We also evaluate one-sided programming models such as Unified Parallel C. We quantify the impact of the above trade-offs and features using a suite of National Energy Research Scientific Computing Center applications.« less
      

      
      Implementation of an ADI method on parallel computers
      NASA Technical Reports Server (NTRS)
      Fatoohi, Raad A.; Grosch, Chester E.
         1987-01-01
         The implementation of an ADI method for solving the diffusion equation on three parallel/vector computers is discussed. The computers were chosen so as to encompass a variety of architectures. They are: the MPP, an SIMD machine with 16K bit serial processors; FLEX/32, an MIMD machine with 20 processors; and CRAY/2, an MIMD machine with four vector processors. The Gaussian elimination algorithm is used to solve a set of tridiagonal systems on the FLEX/32 and CRAY/2 while the cyclic elimination algorithm is used to solve these systems on the MPP. The implementation of the method is discussed in relation to these architectures and measures of the performance on each machine are given. Simple performance models are used to describe the performance. These models highlight the bottlenecks and limiting factors for this algorithm on these architectures. Finally, conclusions are presented.
      

      
      Implementation of an ADI method on parallel computers
      NASA Technical Reports Server (NTRS)
      Fatoohi, Raad A.; Grosch, Chester E.
         1987-01-01
         In this paper the implementation of an ADI method for solving the diffusion equation on three parallel/vector computers is discussed. The computers were chosen so as to encompass a variety of architectures. They are the MPP, an SIMD machine with 16-Kbit serial processors; Flex/32, an MIMD machine with 20 processors; and Cray/2, an MIMD machine with four vector processors. The Gaussian elimination algorithm is used to solve a set of tridiagonal systems on the Flex/32 and Cray/2 while the cyclic elimination algorithm is used to solve these systems on the MPP. The implementation of the method is discussed in relation to these architectures and measures of the performance on each machine are given. Simple performance models are used to describe the performance. These models highlight the bottlenecks and limiting factors for this algorithm on these architectures. Finally conclusions are presented.
      

      
      Matrix Multiplication Algorithm Selection with Support Vector Machines
      DTIC Science & Technology
      
         2015-05-01
         libraries that could intelligently choose the optimal algorithm for a particular set of inputs. Users would be oblivious to the underlying algorithmic...SAT.” J. Artif . Intell. Res.(JAIR), vol. 32, pp. 565–606, 2008. [9] M. G. Lagoudakis and M. L. Littman, “Algorithm selection using reinforcement...Artificial Intelligence , vol. 21, no. 05, pp. 961–976, 2007. [15] C.-C. Chang and C.-J. Lin, “LIBSVM: A library for support vector machines,” ACM
      

      
      MBASIC batch processor architectural overview
      NASA Technical Reports Server (NTRS)
      Reynolds, S. M.
         1978-01-01
         The MBASIC (TM) batch processor, a language translator designed to operate in the MBASIC (TM) environment is described. Features include: (1) a CONVERT TO BATCH command, usable from the ready mode; and (2) translation of the users program in stages through several levels of intermediate language and optimization. The processor is to be designed and implemented in both machine-independent and machine-dependent sections. The architecture is planned so that optimization processes are transparent to the rest of the system and need not be included in the first design implementation cycle.
      

      
      Parallel computing on Unix workstation arrays
      NASA Astrophysics Data System (ADS)
      Reale, F.; Bocchino, F.; Sciortino, S.
         1994-12-01
         We have tested arrays of general-purpose Unix workstations used as MIMD systems for massive parallel computations. In particular we have solved numerically a demanding test problem with a 2D hydrodynamic code, generally developed to study astrophysical flows, by exucuting it on arrays either of DECstations 5000/200 on Ethernet LAN, or of DECstations 3000/400, equipped with powerful Alpha processors, on FDDI LAN. The code is appropriate for data-domain decomposition, and we have used a library for parallelization previously developed in our Institute, and easily extended to work on Unix workstation arrays by using the PVM software toolset. We have compared the parallel efficiencies obtained on arrays of several processors to those obtained on a dedicated MIMD parallel system, namely a Meiko Computing Surface (CS-1), equipped with Intel i860 processors. We discuss the feasibility of using non-dedicated parallel systems and conclude that the convenience depends essentially on the size of the computational domain as compared to the relative processor power and network bandwidth. We point out that for future perspectives a parallel development of processor and network technology is important, and that the software still offers great opportunities of improvement, especially in terms of latency times in the message-passing protocols. In conditions of significant gain in terms of speedup, such workstation arrays represent a cost-effective approach to massive parallel computations.
      

      
      A multi-satellite orbit determination problem in a parallel processing environment
      NASA Technical Reports Server (NTRS)
      Deakyne, M. S.; Anderle, R. J.
         1988-01-01
         The Engineering Orbit Analysis Unit at GE Valley Forge used an Intel Hypercube Parallel Processor to investigate the performance and gain experience of parallel processors with a multi-satellite orbit determination problem. A general study was selected in which major blocks of computation for the multi-satellite orbit computations were used as units to be assigned to the various processors on the Hypercube. Problems encountered or successes achieved in addressing the orbit determination problem would be more likely to be transferable to other parallel processors. The prime objective was to study the algorithm to allow processing of observations later in time than those employed in the state update. Expertise in ephemeris determination was exploited in addressing these problems and the facility used to bring a realism to the study which would highlight the problems which may not otherwise be anticipated. Secondary objectives were to gain experience of a non-trivial problem in a parallel processor environment, to explore the necessary interplay of serial and parallel sections of the algorithm in terms of timing studies, to explore the granularity (coarse vs. fine grain) to discover the granularity limit above which there would be a risk of starvation where the majority of nodes would be idle or under the limit where the overhead associated with splitting the problem may require more work and communication time than is useful.
      

      
      Investigating the Use of the Intel Xeon Phi for Event Reconstruction
      NASA Astrophysics Data System (ADS)
      Sherman, Keegan; Gilfoyle, Gerard
         2014-09-01
         The physics goal of Jefferson Lab is to understand how quarks and gluons form nuclei and it is being upgraded to a higher, 12-GeV beam energy. The new CLAS12 detector in Hall B will collect 5-10 terabytes of data per day and will require considerable computing resources. We are investigating tools, such as the Intel Xeon Phi, to speed up the event reconstruction. The Kalman Filter is one of the methods being studied. It is a linear algebra algorithm that estimates the state of a system by combining existing data and predictions of those measurements. The tools required to apply this technique (i.e. matrix multiplication, matrix inversion) are being written using C++ intrinsics for Intel's Xeon Phi Coprocessor, which uses the Many Integrated Cores (MIC) architecture. The Intel MIC is a new high-performance chip that connects to a host machine through the PCIe bus and is built to run highly vectorized and parallelized code making it a well-suited device for applications such as the Kalman Filter. Our tests of the MIC optimized algorithms needed for the filter show significant increases in speed. For example, matrix multiplication of 5x5 matrices on the MIC was able to run up to 69 times faster than the host core. The physics goal of Jefferson Lab is to understand how quarks and gluons form nuclei and it is being upgraded to a higher, 12-GeV beam energy. The new CLAS12 detector in Hall B will collect 5-10 terabytes of data per day and will require considerable computing resources. We are investigating tools, such as the Intel Xeon Phi, to speed up the event reconstruction. The Kalman Filter is one of the methods being studied. It is a linear algebra algorithm that estimates the state of a system by combining existing data and predictions of those measurements. The tools required to apply this technique (i.e. matrix multiplication, matrix inversion) are being written using C++ intrinsics for Intel's Xeon Phi Coprocessor, which uses the Many Integrated Cores (MIC) architecture. The Intel MIC is a new high-performance chip that connects to a host machine through the PCIe bus and is built to run highly vectorized and parallelized code making it a well-suited device for applications such as the Kalman Filter. Our tests of the MIC optimized algorithms needed for the filter show significant increases in speed. For example, matrix multiplication of 5x5 matrices on the MIC was able to run up to 69 times faster than the host core. Work supported by the University of Richmond and the US Department of Energy.
      

      
      Computational experience with a parallel algorithm for tetrangle inequality bound smoothing.
      PubMed
      Rajan, K; Deo, N
         1999-09-01
         Determining molecular structure from interatomic distances is an important and challenging problem. Given a molecule with n atoms, lower and upper bounds on interatomic distances can usually be obtained only for a small subset of the 2(n(n-1)) atom pairs, using NMR. Given the bounds so obtained on the distances between some of the atom pairs, it is often useful to compute tighter bounds on all the 2(n(n-1)) pairwise distances. This process is referred to as bound smoothing. The initial lower and upper bounds for the pairwise distances not measured are usually assumed to be 0 and infinity. One method for bound smoothing is to use the limits imposed by the triangle inequality. The distance bounds so obtained can often be tightened further by applying the tetrangle inequality--the limits imposed on the six pairwise distances among a set of four atoms (instead of three for the triangle inequalities). The tetrangle inequality is expressed by the Cayley-Menger determinants. For every quadruple of atoms, each pass of the tetrangle inequality bound smoothing procedure finds upper and lower limits on each of the six distances in the quadruple. Applying the tetrangle inequalities to each of the (4n) quadruples requires O(n4) time. Here, we propose a parallel algorithm for bound smoothing employing the tetrangle inequality. Each pass of our algorithm requires O(n3 log n) time on a REW PRAM (Concurrent Read Exclusive Write Parallel Random Access Machine) with O(log(n)n) processors. An implementation of this parallel algorithm on the Intel Paragon XP/S and its performance are also discussed.
      

      
      Massively parallel electrical conductivity imaging of the subsurface: Applications to hydrocarbon exploration
      NASA Astrophysics Data System (ADS)
      Newman, Gregory A.; Commer, Michael
         2009-07-01
         Three-dimensional (3D) geophysical imaging is now receiving considerable attention for electrical conductivity mapping of potential offshore oil and gas reservoirs. The imaging technology employs controlled source electromagnetic (CSEM) and magnetotelluric (MT) fields and treats geological media exhibiting transverse anisotropy. Moreover when combined with established seismic methods, direct imaging of reservoir fluids is possible. Because of the size of the 3D conductivity imaging problem, strategies are required exploiting computational parallelism and optimal meshing. The algorithm thus developed has been shown to scale to tens of thousands of processors. In one imaging experiment, 32,768 tasks/processors on the IBM Watson Research Blue Gene/L supercomputer were successfully utilized. Over a 24 hour period we were able to image a large scale field data set that previously required over four months of processing time on distributed clusters based on Intel or AMD processors utilizing 1024 tasks on an InfiniBand fabric. Electrical conductivity imaging using massively parallel computational resources produces results that cannot be obtained otherwise and are consistent with timeframes required for practical exploration problems.
      

      
      A general purpose subroutine for fast fourier transform on a distributed memory parallel machine
      NASA Technical Reports Server (NTRS)
      Dubey, A.; Zubair, M.; Grosch, C. E.
         1992-01-01
         One issue which is central in developing a general purpose Fast Fourier Transform (FFT) subroutine on a distributed memory parallel machine is the data distribution. It is possible that different users would like to use the FFT routine with different data distributions. Thus, there is a need to design FFT schemes on distributed memory parallel machines which can support a variety of data distributions. An FFT implementation on a distributed memory parallel machine which works for a number of data distributions commonly encountered in scientific applications is presented. The problem of rearranging the data after computing the FFT is also addressed. The performance of the implementation on a distributed memory parallel machine Intel iPSC/860 is evaluated.
      

      
      Intel Many Integrated Core (MIC) architecture optimization strategies for a memory-bound Weather Research and Forecasting (WRF) Goddard microphysics scheme
      NASA Astrophysics Data System (ADS)
      Mielikainen, Jarno; Huang, Bormin; Huang, Allen H.
         2014-10-01
         The Goddard cloud microphysics scheme is a sophisticated cloud microphysics scheme in the Weather Research and Forecasting (WRF) model. The WRF is a widely used weather prediction system in the world. It development is a done in collaborative around the globe. The Goddard microphysics scheme is very suitable for massively parallel computation as there are no interactions among horizontal grid points. Compared to the earlier microphysics schemes, the Goddard scheme incorporates a large number of improvements. Thus, we have optimized the code of this important part of WRF. In this paper, we present our results of optimizing the Goddard microphysics scheme on Intel Many Integrated Core Architecture (MIC) hardware. The Intel Xeon Phi coprocessor is the first product based on Intel MIC architecture, and it consists of up to 61 cores connected by a high performance on-die bidirectional interconnect. The Intel MIC is capable of executing a full operating system and entire programs rather than just kernels as the GPU do. The MIC coprocessor supports all important Intel development tools. Thus, the development environment is familiar one to a vast number of CPU developers. Although, getting a maximum performance out of MICs will require using some novel optimization techniques. Those optimization techniques are discusses in this paper. The results show that the optimizations improved performance of the original code on Xeon Phi 7120P by a factor of 4.7x. Furthermore, the same optimizations improved performance on a dual socket Intel Xeon E5-2670 system by a factor of 2.8x compared to the original code.
      

      
      Noiseless coding for the Gamma Ray spectrometer
      NASA Technical Reports Server (NTRS)
      Rice, R.; Lee, J. J.
         1985-01-01
         The payload of several future unmanned space missions will include a sophisticated gamma ray spectrometer. Severely constrained data rates during certain portions of these missions could limit the possible science return from this instrument. This report investigates the application of universal noiseless coding techniques to represent gamma ray spectrometer data more efficiently without any loss in data integrity. Performance results demonstrate compression factors from 2.5:1 to 20:1 in comparison to a standard representation. Feasibility was also demonstrated by implementing a microprocessor breadboard coder/decoder using an Intel 8086 processor.


   
       
            
              
          

«

12
      13
      14
   15
      16
      »

          
        

           
           
             
               
      
      Energy-efficient Data-intensive Computing with a Fast Array of Wimpy Nodes
      DTIC Science & Technology
      
         2011-10-01
         sleep states provided by the Intel Atom chipset (between 2– 4 W) to turn off machines and migrate workloads during idle periods and low utilization...are generated. 81 0 200 400 600 800 1000 1200 IO PS in T ho us an ds Threads 1 2 4 8 16 32 64 Solid = Multi, Dashed = Single QD/T = 1 QD/T = 2...600 700 800 900 1000 L a te n c y ( in u s ) K Lookups/sec R1G8 R2G8 R4G8 R8G8 R16G8 R32G8 R64G8 (a) Multiget=2 (b) Multiget= 4 (c) Multiget=8 0
      

      
      Evaluation of the OpenCL AES Kernel using the Intel FPGA SDK for OpenCL
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Jin, Zheming; Yoshii, Kazutomo; Finkel, Hal
         
         The OpenCL standard is an open programming model for accelerating algorithms on heterogeneous computing system. OpenCL extends the C-based programming language for developing portable codes on different platforms such as CPU, Graphics processing units (GPUs), Digital Signal Processors (DSPs) and Field Programmable Gate Arrays (FPGAs). The Intel FPGA SDK for OpenCL is a suite of tools that allows developers to abstract away the complex FPGA-based development flow for a high-level software development flow. Users can focus on the design of hardware-accelerated kernel functions in OpenCL and then direct the tools to generate the low-level FPGA implementations. The approach makes themore » FPGA-based development more accessible to software users as the needs for hybrid computing using CPUs and FPGAs are increasing. It can also significantly reduce the hardware development time as users can evaluate different ideas with high-level language without deep FPGA domain knowledge. In this report, we evaluate the performance of the kernel using the Intel FPGA SDK for OpenCL and Nallatech 385A FPGA board. Compared to the M506 module, the board provides more hardware resources for a larger design exploration space. The kernel performance is measured with the compute kernel throughput, an upper bound to the FPGA throughput. The report presents the experimental results in details. The Appendix lists the kernel source code.« less
      

      
      One dimensional Linescan x-ray detection of pits in fresh cherries
      USDA-ARS?s Scientific Manuscript database
      
         
         The presence of pits in processed cherries is a concern for both processors and consumers, in many cases causing injury and potential lawsuits. While machines used for pitting cherries are extremely efficient, if one or more plungers in a pitting head become misaligned, a large number of pits may p...
      

      
      Dynamic Load Balancing for Grid Partitioning on a SP-2 Multiprocessor: A Framework
      NASA Technical Reports Server (NTRS)
      Sohn, Andrew; Simon, Horst; Lasinski, T. A. (Technical Monitor)
         1994-01-01
         Computational requirements of full scale computational fluid dynamics change as computation progresses on a parallel machine. The change in computational intensity causes workload imbalance of processors, which in turn requires a large amount of data movement at runtime. If parallel CFD is to be successful on a parallel or massively parallel machine, balancing of the runtime load is indispensable. Here a framework is presented for dynamic load balancing for CFD applications, called Jove. One processor is designated as a decision maker Jove while others are assigned to computational fluid dynamics. Processors running CFD send flags to Jove in a predetermined number of iterations to initiate load balancing. Jove starts working on load balancing while other processors continue working with the current data and load distribution. Jove goes through several steps to decide if the new data should be taken, including preliminary evaluate, partition, processor reassignment, cost evaluation, and decision. Jove running on a single EBM SP2 node has been completely implemented. Preliminary experimental results show that the Jove approach to dynamic load balancing can be effective for full scale grid partitioning on the target machine IBM SP2.
      

      
      Dynamic Load Balancing For Grid Partitioning on a SP-2 Multiprocessor: A Framework
      NASA Technical Reports Server (NTRS)
      Sohn, Andrew; Simon, Horst; Lasinski, T. A. (Technical Monitor)
         1994-01-01
         Computational requirements of full scale computational fluid dynamics change as computation progresses on a parallel machine. The change in computational intensity causes workload imbalance of processors, which in turn requires a large amount of data movement at runtime. If parallel CFD is to be successful on a parallel or massively parallel machine, balancing of the runtime load is indispensable. Here a framework is presented for dynamic load balancing for CFD applications, called Jove. One processor is designated as a decision maker Jove while others are assigned to computational fluid dynamics. Processors running CFD send flags to Jove in a predetermined number of iterations to initiate load balancing. Jove starts working on load balancing while other processors continue working with the current data and load distribution. Jove goes through several steps to decide if the new data should be taken, including preliminary evaluate, partition, processor reassignment, cost evaluation, and decision. Jove running on a single IBM SP2 node has been completely implemented. Preliminary experimental results show that the Jove approach to dynamic load balancing can be effective for full scale grid partitioning on the target machine IBM SP2.
      

      
      Face classification using electronic synapses
      NASA Astrophysics Data System (ADS)
      Yao, Peng; Wu, Huaqiang; Gao, Bin; Eryilmaz, Sukru Burc; Huang, Xueyao; Zhang, Wenqiang; Zhang, Qingtian; Deng, Ning; Shi, Luping; Wong, H.-S. Philip; Qian, He
         2017-05-01
         Conventional hardware platforms consume huge amount of energy for cognitive learning due to the data movement between the processor and the off-chip memory. Brain-inspired device technologies using analogue weight storage allow to complete cognitive tasks more efficiently. Here we present an analogue non-volatile resistive memory (an electronic synapse) with foundry friendly materials. The device shows bidirectional continuous weight modulation behaviour. Grey-scale face classification is experimentally demonstrated using an integrated 1024-cell array with parallel online training. The energy consumption within the analogue synapses for each iteration is 1,000 × (20 ×) lower compared to an implementation using Intel Xeon Phi processor with off-chip memory (with hypothetical on-chip digital resistive random access memory). The accuracy on test sets is close to the result using a central processing unit. These experimental results consolidate the feasibility of analogue synaptic array and pave the way toward building an energy efficient and large-scale neuromorphic system.
      

      
      Face classification using electronic synapses.
      PubMed
      Yao, Peng; Wu, Huaqiang; Gao, Bin; Eryilmaz, Sukru Burc; Huang, Xueyao; Zhang, Wenqiang; Zhang, Qingtian; Deng, Ning; Shi, Luping; Wong, H-S Philip; Qian, He
         2017-05-12
         Conventional hardware platforms consume huge amount of energy for cognitive learning due to the data movement between the processor and the off-chip memory. Brain-inspired device technologies using analogue weight storage allow to complete cognitive tasks more efficiently. Here we present an analogue non-volatile resistive memory (an electronic synapse) with foundry friendly materials. The device shows bidirectional continuous weight modulation behaviour. Grey-scale face classification is experimentally demonstrated using an integrated 1024-cell array with parallel online training. The energy consumption within the analogue synapses for each iteration is 1,000 × (20 ×) lower compared to an implementation using Intel Xeon Phi processor with off-chip memory (with hypothetical on-chip digital resistive random access memory). The accuracy on test sets is close to the result using a central processing unit. These experimental results consolidate the feasibility of analogue synaptic array and pave the way toward building an energy efficient and large-scale neuromorphic system.
      

      
      Parallelization of the preconditioned IDR solver for modern multicore computer systems
      NASA Astrophysics Data System (ADS)
      Bessonov, O. A.; Fedoseyev, A. I.
         2012-10-01
         This paper present the analysis, parallelization and optimization approach for the large sparse matrix solver CNSPACK for modern multicore microprocessors. CNSPACK is an advanced solver successfully used for coupled solution of stiff problems arising in multiphysics applications such as CFD, semiconductor transport, kinetic and quantum problems. It employs iterative IDR algorithm with ILU preconditioning (user chosen ILU preconditioning order). CNSPACK has been successfully used during last decade for solving problems in several application areas, including fluid dynamics and semiconductor device simulation. However, there was a dramatic change in processor architectures and computer system organization in recent years. Due to this, performance criteria and methods have been revisited, together with involving the parallelization of the solver and preconditioner using Open MP environment. Results of the successful implementation for efficient parallelization are presented for the most advances computer system (Intel Core i7-9xx or two-processor Xeon 55xx/56xx).
      

      
      Software design and documentation language, revision 1
      NASA Technical Reports Server (NTRS)
      Kleine, H.
         1979-01-01
         The Software Design and Documentation Language (SDDL) developed to provide an effective communications medium to support the design and documentation of complex software applications is described. Features of the system include: (1) a processor which can convert design specifications into an intelligible, informative machine-reproducible document; (2) a design and documentation language with forms and syntax that are simple, unrestrictive, and communicative; and (3) methodology for effective use of the language and processor. The SDDL processor is written in the SIMSCRIPT II programming language and is implemented on the UNIVAC 1108, the IBM 360/370, and Control Data machines.
      

      
      SIGPROC: Pulsar Signal Processing Programs
      NASA Astrophysics Data System (ADS)
      Lorimer, D. R.
         2011-07-01
         SIGPROC is a package designed to standardize the initial analysis of the many types of fast-sampled pulsar data. Currently recognized machines are the Wide Band Arecibo Pulsar Processor (WAPP), the Penn State Pulsar Machine (PSPM), the Arecibo Observatory Fourier Transform Machine (AOFTM), the Berkeley Pulsar Processors (BPP), the Parkes/Jodrell 1-bit filterbanks (SCAMP) and the filterbank at the Ooty radio telescope (OOTY). The SIGPROC tools should help users look at their data quickly, without the need to write (yet) another routine to read data or worry about big/little endian compatibility (byte swapping is handled automatically).
      

      
      Ada Compiler Validation Summary Report: Certificate Number: 940325S1. 11341 DDC-I, DACS Sun SPARC/SunOS to 80186 Bare Ada Cross Compiler System, Version 4.6.4 Sun SPARCstation IPX = Intel iSBC 186/100 (Bare Machine)
      DTIC Science & Technology
      
         1994-03-25
         declarations requiring more digits than SYSTEM.MAXDIGITS: C24113L..Y (14 tests) C35705L..Y (14 tests) C35706L..Y (14 tests) C35707L..Y (14 tests) 2-1...before the DACS Run-Time System (RMs) libary normally searches for na-4ime routin;, in this way one can replace the standard DACS RTS routines with...SHORTINTEGER is range -128 .. 127; type INTEGER is range -32_768 .. 32_767; type LONGINTEGER is range -2147_483_648 .. 2_147_483_647; type FLOAT is digits 6
      

      
      Massively Parallel Simulations of Diffusion in Dense Polymeric Structures
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Faulon, Jean-Loup, Wilcox, R.T.
         1997-11-01
         An original computational technique to generate close-to-equilibrium dense polymeric structures is proposed. Diffusion of small gases are studied on the equilibrated structures using massively parallel molecular dynamics simulations running on the Intel Teraflops (9216 Pentium Pro processors) and Intel Paragon(1840 processors). Compared to the current state-of-the-art equilibration methods this new technique appears to be faster by some orders of magnitude.The main advantage of the technique is that one can circumvent the bottlenecks in configuration space that inhibit relaxation in molecular dynamics simulations. The technique is based on the fact that tetravalent atoms (such as carbon and silicon) fit in themore » center of a regular tetrahedron and that regular tetrahedrons can be used to mesh the three-dimensional space. Thus, the problem of polymer equilibration described by continuous equations in molecular dynamics is reduced to a discrete problem where solutions are approximated by simple algorithms. Practical modeling applications include the constructing of butyl rubber and ethylene-propylene-dimer-monomer (EPDM) models for oxygen and water diffusion calculations. Butyl and EPDM are used in O-ring systems and serve as sealing joints in many manufactured objects. Diffusion coefficients of small gases have been measured experimentally on both polymeric systems, and in general the diffusion coefficients in EPDM are an order of magnitude larger than in butyl. In order to better understand the diffusion phenomena, 10, 000 atoms models were generated and equilibrated for butyl and EPDM. The models were submitted to a massively parallel molecular dynamics simulation to monitor the trajectories of the diffusing species.« less
      

      
      Efficiently modeling neural networks on massively parallel computers
      NASA Technical Reports Server (NTRS)
      Farber, Robert M.
         1993-01-01
         Neural networks are a very useful tool for analyzing and modeling complex real world systems. Applying neural network simulations to real world problems generally involves large amounts of data and massive amounts of computation. To efficiently handle the computational requirements of large problems, we have implemented at Los Alamos a highly efficient neural network compiler for serial computers, vector computers, vector parallel computers, and fine grain SIMD computers such as the CM-2 connection machine. This paper describes the mapping used by the compiler to implement feed-forward backpropagation neural networks for a SIMD (Single Instruction Multiple Data) architecture parallel computer. Thinking Machines Corporation has benchmarked our code at 1.3 billion interconnects per second (approximately 3 gigaflops) on a 64,000 processor CM-2 connection machine (Singer 1990). This mapping is applicable to other SIMD computers and can be implemented on MIMD computers such as the CM-5 connection machine. Our mapping has virtually no communications overhead with the exception of the communications required for a global summation across the processors (which has a sub-linear runtime growth on the order of O(log(number of processors)). We can efficiently model very large neural networks which have many neurons and interconnects and our mapping can extend to arbitrarily large networks (within memory limitations) by merging the memory space of separate processors with fast adjacent processor interprocessor communications. This paper will consider the simulation of only feed forward neural network although this method is extendable to recurrent networks.
      

      
      3D Navier-Stokes Flow Analysis for a Large-Array Multiprocessor
      DTIC Science & Technology
      
         1989-04-17
         computer, Alliant’s FX /8, Intel’s Hypercube, and Encore’s Multimax. Unfortunately, the current algorithms have been developed pri- marily for SISD machines...Reversing and Thrust-Vectoring Nozzle Flows," Ph.D. Dissertation in the Dept. of Aero. and Astro ., Univ. of Wash., Washington, 1986. [11] Anderson
      

      
      Traileka Glacier X-Stack. Final Scientific/Technical Report
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Borkar, Shekhar
         2015-09-01
         The XStack Traleika Glacier (XSTG) project was a three-year research award for exploring a revolutionary exascale-class machine software framework. The XSTG program, including Intel, UC San Diego, Pacific Northwest National Lab, UIUC, Rice University, Reservoir Labs, ET International, and U. Delaware, had major accomplishments, insights, and products resulting from this three-year effort.
      

      
      List-mode PET image reconstruction for motion correction using the Intel XEON PHI co-processor
      NASA Astrophysics Data System (ADS)
      Ryder, W. J.; Angelis, G. I.; Bashar, R.; Gillam, J. E.; Fulton, R.; Meikle, S.
         2014-03-01
         List-mode image reconstruction with motion correction is computationally expensive, as it requires projection of hundreds of millions of rays through a 3D array. To decrease reconstruction time it is possible to use symmetric multiprocessing computers or graphics processing units. The former can have high financial costs, while the latter can require refactoring of algorithms. The Xeon Phi is a new co-processor card with a Many Integrated Core architecture that can run 4 multiple-instruction, multiple data threads per core with each thread having a 512-bit single instruction, multiple data vector register. Thus, it is possible to run in the region of 220 threads simultaneously. The aim of this study was to investigate whether the Xeon Phi co-processor card is a viable alternative to an x86 Linux server for accelerating List-mode PET image reconstruction for motion correction. An existing list-mode image reconstruction algorithm with motion correction was ported to run on the Xeon Phi coprocessor with the multi-threading implemented using pthreads. There were no differences between images reconstructed using the Phi co-processor card and images reconstructed using the same algorithm run on a Linux server. However, it was found that the reconstruction runtimes were 3 times greater for the Phi than the server. A new version of the image reconstruction algorithm was developed in C++ using OpenMP for mutli-threading and the Phi runtimes decreased to 1.67 times that of the host Linux server. Data transfer from the host to co-processor card was found to be a rate-limiting step; this needs to be carefully considered in order to maximize runtime speeds. When considering the purchase price of a Linux workstation with Xeon Phi co-processor card and top of the range Linux server, the former is a cost-effective computation resource for list-mode image reconstruction. A multi-Phi workstation could be a viable alternative to cluster computers at a lower cost for medical imaging applications.
      

      
      A study of workstation computational performance for real-time flight simulation
      NASA Technical Reports Server (NTRS)
      Maddalon, Jeffrey M.; Cleveland, Jeff I., II
         1995-01-01
         With recent advances in microprocessor technology, some have suggested that modern workstations provide enough computational power to properly operate a real-time simulation. This paper presents the results of a computational benchmark, based on actual real-time flight simulation code used at Langley Research Center, which was executed on various workstation-class machines. The benchmark was executed on different machines from several companies including: CONVEX Computer Corporation, Cray Research, Digital Equipment Corporation, Hewlett-Packard, Intel, International Business Machines, Silicon Graphics, and Sun Microsystems. The machines are compared by their execution speed, computational accuracy, and porting effort. The results of this study show that the raw computational power needed for real-time simulation is now offered by workstations.
      

      
      Robust, High-Speed Network Design for Large-Scale Multiprocessing
      DTIC Science & Technology
      
         1993-09-01
         3.17 Left: Non-expansive Wiring of Processors to First Stage Routing Elements . ... 38 3.18 Right: Expansive Wiring of Processors to First Stage...162 8.2 RNI Micro -architecture ........ .............................. 163 8.3 Packaged RN I IC...169 11.1 MLUNK Message Formats ........ .............................. 173 12.1 Routing Board Arrangement for 64- processor Machine
      

      
      Efficient irregular wavefront propagation algorithms on Intel® Xeon Phi™
      PubMed Central
      Gomes, Jeremias M.; Teodoro, George; de Melo, Alba; Kong, Jun; Kurc, Tahsin; Saltz, Joel H.
         2016-01-01
         We investigate the execution of the Irregular Wavefront Propagation Pattern (IWPP), a fundamental computing structure used in several image analysis operations, on the Intel® Xeon Phi™ co-processor. An efficient implementation of IWPP on the Xeon Phi is a challenging problem because of IWPP’s irregularity and the use of atomic instructions in the original IWPP algorithm to resolve race conditions. On the Xeon Phi, the use of SIMD and vectorization instructions is critical to attain high performance. However, SIMD atomic instructions are not supported. Therefore, we propose a new IWPP algorithm that can take advantage of the supported SIMD instruction set. We also evaluate an alternate storage container (priority queue) to track active elements in the wavefront in an effort to improve the parallel algorithm efficiency. The new IWPP algorithm is evaluated with Morphological Reconstruction and Imfill operations as use cases. Our results show performance improvements of up to 5.63× on top of the original IWPP due to vectorization. Moreover, the new IWPP achieves speedups of 45.7× and 1.62×, respectively, as compared to efficient CPU and GPU implementations. PMID:27298591
      

      
      Efficient irregular wavefront propagation algorithms on Intel® Xeon Phi™.
      PubMed
      Gomes, Jeremias M; Teodoro, George; de Melo, Alba; Kong, Jun; Kurc, Tahsin; Saltz, Joel H
         2015-10-01
         We investigate the execution of the Irregular Wavefront Propagation Pattern (IWPP), a fundamental computing structure used in several image analysis operations, on the Intel ® Xeon Phi ™ co-processor. An efficient implementation of IWPP on the Xeon Phi is a challenging problem because of IWPP's irregularity and the use of atomic instructions in the original IWPP algorithm to resolve race conditions. On the Xeon Phi, the use of SIMD and vectorization instructions is critical to attain high performance. However, SIMD atomic instructions are not supported. Therefore, we propose a new IWPP algorithm that can take advantage of the supported SIMD instruction set. We also evaluate an alternate storage container (priority queue) to track active elements in the wavefront in an effort to improve the parallel algorithm efficiency. The new IWPP algorithm is evaluated with Morphological Reconstruction and Imfill operations as use cases. Our results show performance improvements of up to 5.63 × on top of the original IWPP due to vectorization. Moreover, the new IWPP achieves speedups of 45.7 × and 1.62 × , respectively, as compared to efficient CPU and GPU implementations.
      

        
       
          

«

12
      13
      14
   15
      16
      »

          
        

     

   

   
       
            
              
          

«

13
      14
      15
   16
      17
      »

          
        

           
           
             
               
      
      Real-time operating system for selected Intel processors
      NASA Technical Reports Server (NTRS)
      Pool, W. R.
         1980-01-01
         The rationale for system development is given along with reasons for not using vendor supplied operating systems. Although many system design and performance goals were dictated by problems with vendor supplied systems, other goals surfaced as a result of a design for a custom system able to span multiple projects. System development and management problems and areas that required redesign or major code changes for system implementation are examined as well as the relative successes of the initial projects. A generic description of the actual project is provided and the ongoing support requirements and future plans are discussed.
      

      
      Computer Simulation Performed for Columbia Project Cooling System
      NASA Technical Reports Server (NTRS)
      Ahmad, Jasim
         2005-01-01
         This demo shows a high-fidelity simulation of the air flow in the main computer room housing the Columbia (10,024 intel titanium processors) system. The simulation asseses the performance of the cooling system and identified deficiencies, and recommended modifications to eliminate them. It used two in house software packages on NAS supercomputers: Chimera Grid tools to generate a geometric model of the computer room, OVERFLOW-2 code for fluid and thermal simulation. This state-of-the-art technology can be easily extended to provide a general capability for air flow analyses on any modern computer room. Columbia_CFD_black.tiff
      

      
      Scaling Deep Learning on GPU and Knights Landing clusters
      DOE PAGES
      You, Yang; Buluc, Aydin; Demmel, James
         2017-09-26
         The speed of deep neural networks training has become a big bottleneck of deep learning research and development. For example, training GoogleNet by ImageNet dataset on one Nvidia K20 GPU needs 21 days. To speed up the training process, the current deep learning systems heavily rely on the hardware accelerators. However, these accelerators have limited on-chip memory compared with CPUs. To handle large datasets, they need to fetch data from either CPU memory or remote processors. We use both self-hosted Intel Knights Landing (KNL) clusters and multi-GPU clusters as our target platforms. From an algorithm aspect, current distributed machine learningmore » systems are mainly designed for cloud systems. These methods are asynchronous because of the slow network and high fault-tolerance requirement on cloud systems. We focus on Elastic Averaging SGD (EASGD) to design algorithms for HPC clusters. Original EASGD used round-robin method for communication and updating. The communication is ordered by the machine rank ID, which is inefficient on HPC clusters. First, we redesign four efficient algorithms for HPC systems to improve EASGD's poor scaling on clusters. Async EASGD, Async MEASGD, and Hogwild EASGD are faster \\textcolor{black}{than} their existing counterparts (Async SGD, Async MSGD, and Hogwild SGD, resp.) in all the comparisons. Finally, we design Sync EASGD, which ties for the best performance among all the methods while being deterministic. In addition to the algorithmic improvements, we use some system-algorithm codesign techniques to scale up the algorithms. By reducing the percentage of communication from 87% to 14%, our Sync EASGD achieves 5.3x speedup over original EASGD on the same platform. We get 91.5% weak scaling efficiency on 4253 KNL cores, which is higher than the state-of-the-art implementation.« less
      

      
      Scaling Deep Learning on GPU and Knights Landing clusters
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      You, Yang; Buluc, Aydin; Demmel, James
         
         The speed of deep neural networks training has become a big bottleneck of deep learning research and development. For example, training GoogleNet by ImageNet dataset on one Nvidia K20 GPU needs 21 days. To speed up the training process, the current deep learning systems heavily rely on the hardware accelerators. However, these accelerators have limited on-chip memory compared with CPUs. To handle large datasets, they need to fetch data from either CPU memory or remote processors. We use both self-hosted Intel Knights Landing (KNL) clusters and multi-GPU clusters as our target platforms. From an algorithm aspect, current distributed machine learningmore » systems are mainly designed for cloud systems. These methods are asynchronous because of the slow network and high fault-tolerance requirement on cloud systems. We focus on Elastic Averaging SGD (EASGD) to design algorithms for HPC clusters. Original EASGD used round-robin method for communication and updating. The communication is ordered by the machine rank ID, which is inefficient on HPC clusters. First, we redesign four efficient algorithms for HPC systems to improve EASGD's poor scaling on clusters. Async EASGD, Async MEASGD, and Hogwild EASGD are faster \\textcolor{black}{than} their existing counterparts (Async SGD, Async MSGD, and Hogwild SGD, resp.) in all the comparisons. Finally, we design Sync EASGD, which ties for the best performance among all the methods while being deterministic. In addition to the algorithmic improvements, we use some system-algorithm codesign techniques to scale up the algorithms. By reducing the percentage of communication from 87% to 14%, our Sync EASGD achieves 5.3x speedup over original EASGD on the same platform. We get 91.5% weak scaling efficiency on 4253 KNL cores, which is higher than the state-of-the-art implementation.« less
      

      
      Efficient Parallel Kernel Solvers for Computational Fluid Dynamics Applications
      NASA Technical Reports Server (NTRS)
      Sun, Xian-He
         1997-01-01
         Distributed-memory parallel computers dominate today's parallel computing arena. These machines, such as Intel Paragon, IBM SP2, and Cray Origin2OO, have successfully delivered high performance computing power for solving some of the so-called "grand-challenge" problems. Despite initial success, parallel machines have not been widely accepted in production engineering environments due to the complexity of parallel programming. On a parallel computing system, a task has to be partitioned and distributed appropriately among processors to reduce communication cost and to attain load balance. More importantly, even with careful partitioning and mapping, the performance of an algorithm may still be unsatisfactory, since conventional sequential algorithms may be serial in nature and may not be implemented efficiently on parallel machines. In many cases, new algorithms have to be introduced to increase parallel performance. In order to achieve optimal performance, in addition to partitioning and mapping, a careful performance study should be conducted for a given application to find a good algorithm-machine combination. This process, however, is usually painful and elusive. The goal of this project is to design and develop efficient parallel algorithms for highly accurate Computational Fluid Dynamics (CFD) simulations and other engineering applications. The work plan is 1) developing highly accurate parallel numerical algorithms, 2) conduct preliminary testing to verify the effectiveness and potential of these algorithms, 3) incorporate newly developed algorithms into actual simulation packages. The work plan has well achieved. Two highly accurate, efficient Poisson solvers have been developed and tested based on two different approaches: (1) Adopting a mathematical geometry which has a better capacity to describe the fluid, (2) Using compact scheme to gain high order accuracy in numerical discretization. The previously developed Parallel Diagonal Dominant (PDD) algorithm and Reduced Parallel Diagonal Dominant (RPDD) algorithm have been carefully studied on different parallel platforms for different applications, and a NASA simulation code developed by Man M. Rai and his colleagues has been parallelized and implemented based on data dependency analysis. These achievements are addressed in detail in the paper.
      

      
      Enterprise Cloud Architecture for Chinese Ministry of Railway
      NASA Astrophysics Data System (ADS)
      Shan, Xumei; Liu, Hefeng
         
         Enterprise like PRC Ministry of Railways (MOR), is facing various challenges ranging from highly distributed computing environment and low legacy system utilization, Cloud Computing is increasingly regarded as one workable solution to address this. This article describes full scale cloud solution with Intel Tashi as virtual machine infrastructure layer, Hadoop HDFS as computing platform, and self developed SaaS interface, gluing virtual machine and HDFS with Xen hypervisor. As a result, on demand computing task application and deployment have been tackled per MOR real working scenarios at the end of article.
      

      
      Intel Xeon Phi accelerated Weather Research and Forecasting (WRF) Goddard microphysics scheme
      NASA Astrophysics Data System (ADS)
      Mielikainen, J.; Huang, B.; Huang, A. H.-L.
         2014-12-01
         The Weather Research and Forecasting (WRF) model is a numerical weather prediction system designed to serve both atmospheric research and operational forecasting needs. The WRF development is a done in collaboration around the globe. Furthermore, the WRF is used by academic atmospheric scientists, weather forecasters at the operational centers and so on. The WRF contains several physics components. The most time consuming one is the microphysics. One microphysics scheme is the Goddard cloud microphysics scheme. It is a sophisticated cloud microphysics scheme in the Weather Research and Forecasting (WRF) model. The Goddard microphysics scheme is very suitable for massively parallel computation as there are no interactions among horizontal grid points. Compared to the earlier microphysics schemes, the Goddard scheme incorporates a large number of improvements. Thus, we have optimized the Goddard scheme code. In this paper, we present our results of optimizing the Goddard microphysics scheme on Intel Many Integrated Core Architecture (MIC) hardware. The Intel Xeon Phi coprocessor is the first product based on Intel MIC architecture, and it consists of up to 61 cores connected by a high performance on-die bidirectional interconnect. The Intel MIC is capable of executing a full operating system and entire programs rather than just kernels as the GPU does. The MIC coprocessor supports all important Intel development tools. Thus, the development environment is one familiar to a vast number of CPU developers. Although, getting a maximum performance out of MICs will require using some novel optimization techniques. Those optimization techniques are discussed in this paper. The results show that the optimizations improved performance of Goddard microphysics scheme on Xeon Phi 7120P by a factor of 4.7×. In addition, the optimizations reduced the Goddard microphysics scheme's share of the total WRF processing time from 20.0 to 7.5%. Furthermore, the same optimizations improved performance on Intel Xeon E5-2670 by a factor of 2.8× compared to the original code.
      

      
      Automated Creation of Labeled Pointcloud Datasets in Support of Machine-Learning Based Perception
      DTIC Science & Technology
      
         2017-12-01
         computationally intensive 3D vector math and took more than ten seconds to segment a single LIDAR frame from the HDL-32e with the Dell XPS15 9650’s Intel...Core i7 CPU. Depth Clustering avoids the computationally intensive 3D vector math of Euclidean Clustering-based DON segmentation and, instead
      

      
      NAS Parallel Benchmark. Results 11-96: Performance Comparison of HPF and MPI Based NAS Parallel Benchmarks. 1.0
      NASA Technical Reports Server (NTRS)
      Saini, Subash; Bailey, David; Chancellor, Marisa K. (Technical Monitor)
         1997-01-01
         High Performance Fortran (HPF), the high-level language for parallel Fortran programming, is based on Fortran 90. HALF was defined by an informal standards committee known as the High Performance Fortran Forum (HPFF) in 1993, and modeled on TMC's CM Fortran language. Several HPF features have since been incorporated into the draft ANSI/ISO Fortran 95, the next formal revision of the Fortran standard. HPF allows users to write a single parallel program that can execute on a serial machine, a shared-memory parallel machine, or a distributed-memory parallel machine. HPF eliminates the complex, error-prone task of explicitly specifying how, where, and when to pass messages between processors on distributed-memory machines, or when to synchronize processors on shared-memory machines. HPF is designed in a way that allows the programmer to code an application at a high level, and then selectively optimize portions of the code by dropping into message-passing or calling tuned library routines as 'extrinsics'. Compilers supporting High Performance Fortran features first appeared in late 1994 and early 1995 from Applied Parallel Research (APR) Digital Equipment Corporation, and The Portland Group (PGI). IBM introduced an HPF compiler for the IBM RS/6000 SP/2 in April of 1996. Over the past two years, these implementations have shown steady improvement in terms of both features and performance. The performance of various hardware/ programming model (HPF and MPI (message passing interface)) combinations will be compared, based on latest NAS (NASA Advanced Supercomputing) Parallel Benchmark (NPB) results, thus providing a cross-machine and cross-model comparison. Specifically, HPF based NPB results will be compared with MPI based NPB results to provide perspective on performance currently obtainable using HPF versus MPI or versus hand-tuned implementations such as those supplied by the hardware vendors. In addition we would also present NPB (Version 1.0) performance results for the following systems: DEC Alpha Server 8400 5/440, Fujitsu VPP Series (VX, VPP300, and VPP700), HP/Convex Exemplar SPP2000, IBM RS/6000 SP P2SC node (120 MHz) NEC SX-4/32, SGI/CRAY T3E, SGI Origin2000.
      

      
      Line-drawing algorithms for parallel machines
      NASA Technical Reports Server (NTRS)
      Pang, Alex T.
         1990-01-01
         The fact that conventional line-drawing algorithms, when applied directly on parallel machines, can lead to very inefficient codes is addressed. It is suggested that instead of modifying an existing algorithm for a parallel machine, a more efficient implementation can be produced by going back to the invariants in the definition. Popular line-drawing algorithms are compared with two alternatives; distance to a line (a point is on the line if sufficiently close to it) and intersection with a line (a point on the line if an intersection point). For massively parallel single-instruction-multiple-data (SIMD) machines (with thousands of processors and up), the alternatives provide viable line-drawing algorithms. Because of the pixel-per-processor mapping, their performance is independent of the line length and orientation.
      

      
      Vascular system modeling in parallel environment - distributed and shared memory approaches
      PubMed Central
      Jurczuk, Krzysztof; Kretowski, Marek; Bezy-Wendling, Johanne
         2011-01-01
         The paper presents two approaches in parallel modeling of vascular system development in internal organs. In the first approach, new parts of tissue are distributed among processors and each processor is responsible for perfusing its assigned parts of tissue to all vascular trees. Communication between processors is accomplished by passing messages and therefore this algorithm is perfectly suited for distributed memory architectures. The second approach is designed for shared memory machines. It parallelizes the perfusion process during which individual processing units perform calculations concerning different vascular trees. The experimental results, performed on a computing cluster and multi-core machines, show that both algorithms provide a significant speedup. PMID:21550891
      

      
      Parallel matrix multiplication on the Connection Machine
      NASA Technical Reports Server (NTRS)
      Tichy, Walter F.
         1988-01-01
         Matrix multiplication is a computation and communication intensive problem. Six parallel algorithms for matrix multiplication on the Connection Machine are presented and compared with respect to their performance and processor usage. For n by n matrices, the algorithms have theoretical running times of O(n to the 2nd power log n), O(n log n), O(n), and O(log n), and require n, n to the 2nd power, n to the 2nd power, and n to the 3rd power processors, respectively. With careful attention to communication patterns, the theoretically predicted runtimes can indeed be achieved in practice. The parallel algorithms illustrate the tradeoffs between performance, communication cost, and processor usage.
      

      
      The Intelcities Community of Practice: The Capacity-Building, Co-Design, Evaluation, and Monitoring of E-Government Services
      ERIC Educational Resources Information Center
      Deakin, Mark; Lombardi, Patrizia; Cooper, Ian
         2011-01-01
         The paper examines the IntelCities Community of Practice (CoP) supporting the development of the organization's capacity-building, co-design, monitoring, and evaluation of e-government services. It begins by outlining the IntelCities CoP and goes on to set out the integrated model of electronically enhanced government (e-government) services…
      

      
      Parallel grid generation algorithm for distributed memory computers
      NASA Technical Reports Server (NTRS)
      Moitra, Stuti; Moitra, Anutosh
         1994-01-01
         A parallel grid-generation algorithm and its implementation on the Intel iPSC/860 computer are described. The grid-generation scheme is based on an algebraic formulation of homotopic relations. Methods for utilizing the inherent parallelism of the grid-generation scheme are described, and implementation of multiple levELs of parallelism on multiple instruction multiple data machines are indicated. The algorithm is capable of providing near orthogonality and spacing control at solid boundaries while requiring minimal interprocessor communications. Results obtained on the Intel hypercube for a blended wing-body configuration are used to demonstrate the effectiveness of the algorithm. Fortran implementations bAsed on the native programming model of the iPSC/860 computer and the Express system of software tools are reported. Computational gains in execution time speed-up ratios are given.
      

      
      Parallel Computational Fluid Dynamics: Current Status and Future Requirements
      NASA Technical Reports Server (NTRS)
      Simon, Horst D.; VanDalsem, William R.; Dagum, Leonardo; Kutler, Paul (Technical Monitor)
         1994-01-01
         One or the key objectives of the Applied Research Branch in the Numerical Aerodynamic Simulation (NAS) Systems Division at NASA Allies Research Center is the accelerated introduction of highly parallel machines into a full operational environment. In this report we discuss the performance results obtained from the implementation of some computational fluid dynamics (CFD) applications on the Connection Machine CM-2 and the Intel iPSC/860. We summarize some of the experiences made so far with the parallel testbed machines at the NAS Applied Research Branch. Then we discuss the long term computational requirements for accomplishing some of the grand challenge problems in computational aerosciences. We argue that only massively parallel machines will be able to meet these grand challenge requirements, and we outline the computer science and algorithm research challenges ahead.
      

      
      Massive parallelization of a 3D finite difference electromagnetic forward solution using domain decomposition methods on multiple CUDA enabled GPUs
      NASA Astrophysics Data System (ADS)
      Schultz, A.
         2010-12-01
         3D forward solvers lie at the core of inverse formulations used to image the variation of electrical conductivity within the Earth's interior. This property is associated with variations in temperature, composition, phase, presence of volatiles, and in specific settings, the presence of groundwater, geothermal resources, oil/gas or minerals. The high cost of 3D solutions has been a stumbling block to wider adoption of 3D methods. Parallel algorithms for modeling frequency domain 3D EM problems have not achieved wide scale adoption, with emphasis on fairly coarse grained parallelism using MPI and similar approaches. The communications bandwidth as well as the latency required to send and receive network communication packets is a limiting factor in implementing fine grained parallel strategies, inhibiting wide adoption of these algorithms. Leading Graphics Processor Unit (GPU) companies now produce GPUs with hundreds of GPU processor cores per die. The footprint, in silicon, of the GPU's restricted instruction set is much smaller than the general purpose instruction set required of a CPU. Consequently, the density of processor cores on a GPU can be much greater than on a CPU. GPUs also have local memory, registers and high speed communication with host CPUs, usually through PCIe type interconnects. The extremely low cost and high computational power of GPUs provides the EM geophysics community with an opportunity to achieve fine grained (i.e. massive) parallelization of codes on low cost hardware. The current generation of GPUs (e.g. NVidia Fermi) provides 3 billion transistors per chip die, with nearly 500 processor cores and up to 6 GB of fast (DDR5) GPU memory. This latest generation of GPU supports fast hardware double precision (64 bit) floating point operations of the type required for frequency domain EM forward solutions. Each Fermi GPU board can sustain nearly 1 TFLOP in double precision, and multiple boards can be installed in the host computer system. We describe our ongoing efforts to achieve massive parallelization on a novel hybrid GPU testbed machine currently configured with 12 Intel Westmere Xeon CPU cores (or 24 parallel computational threads) with 96 GB DDR3 system memory, 4 GPU subsystems which in aggregate contain 960 NVidia Tesla GPU cores with 16 GB dedicated DDR3 GPU memory, and a second interleved bank of 4 GPU subsystems containing in aggregate 1792 NVidia Fermi GPU cores with 12 GB dedicated DDR5 GPU memory. We are applying domain decomposition methods to a modified version of Weiss' (2001) 3D frequency domain full physics EM finite difference code, an open source GPL licensed f90 code available for download from www.OpenEM.org. This will be the core of a new hybrid 3D inversion that parallelizes frequencies across CPUs and individual forward solutions across GPUs. We describe progress made in modifying the code to use direct solvers in GPU cores dedicated to each small subdomain, iteratively improving the solution by matching adjacent subdomain boundary solutions, rather than iterative Krylov space sparse solvers as currently applied to the whole domain.
      

      
      Interconnect Performance Evaluation of SGI Altix 3700 BX2, Cray X1, Cray Opteron Cluster, and Dell PowerEdge
      NASA Technical Reports Server (NTRS)
      Fatoohi, Rod; Saini, Subbash; Ciotti, Robert
         2006-01-01
         We study the performance of inter-process communication on four high-speed multiprocessor systems using a set of communication benchmarks. The goal is to identify certain limiting factors and bottlenecks with the interconnect of these systems as well as to compare these interconnects. We measured network bandwidth using different number of communicating processors and communication patterns, such as point-to-point communication, collective communication, and dense communication patterns. The four platforms are: a 512-processor SGI Altix 3700 BX2 shared-memory machine with 3.2 GB/s links; a 64-processor (single-streaming) Cray XI shared-memory machine with 32 1.6 GB/s links; a 128-processor Cray Opteron cluster using a Myrinet network; and a 1280-node Dell PowerEdge cluster with an InfiniBand network. Our, results show the impact of the network bandwidth and topology on the overall performance of each interconnect.
      

      
      NASA Center for Climate Simulation (NCCS) Presentation
      NASA Technical Reports Server (NTRS)
      Webster, William P.
         2012-01-01
         The NASA Center for Climate Simulation (NCCS) offers integrated supercomputing, visualization, and data interaction technologies to enhance NASA's weather and climate prediction capabilities. It serves hundreds of users at NASA Goddard Space Flight Center, as well as other NASA centers, laboratories, and universities across the US. Over the past year, NCCS has continued expanding its data-centric computing environment to meet the increasingly data-intensive challenges of climate science. We doubled our Discover supercomputer's peak performance to more than 800 teraflops by adding 7,680 Intel Xeon Sandy Bridge processor-cores and most recently 240 Intel Xeon Phi Many Integrated Core (MIG) co-processors. A supercomputing-class analysis system named Dali gives users rapid access to their data on Discover and high-performance software including the Ultra-scale Visualization Climate Data Analysis Tools (UV-CDAT), with interfaces from user desktops and a 17- by 6-foot visualization wall. NCCS also is exploring highly efficient climate data services and management with a new MapReduce/Hadoop cluster while augmenting its data distribution to the science community. Using NCCS resources, NASA completed its modeling contributions to the Intergovernmental Panel on Climate Change (IPCG) Fifth Assessment Report this summer as part of the ongoing Coupled Modellntercomparison Project Phase 5 (CMIP5). Ensembles of simulations run on Discover reached back to the year 1000 to test model accuracy and projected climate change through the year 2300 based on four different scenarios of greenhouse gases, aerosols, and land use. The data resulting from several thousand IPCC/CMIP5 simulations, as well as a variety of other simulation, reanalysis, and observationdatasets, are available to scientists and decision makers through an enhanced NCCS Earth System Grid Federation Gateway. Worldwide downloads have totaled over 110 terabytes of data.
      

      
      SoAx: A generic C++ Structure of Arrays for handling particles in HPC codes
      NASA Astrophysics Data System (ADS)
      Homann, Holger; Laenen, Francois
         2018-03-01
         The numerical study of physical problems often require integrating the dynamics of a large number of particles evolving according to a given set of equations. Particles are characterized by the information they are carrying such as an identity, a position other. There are generally speaking two different possibilities for handling particles in high performance computing (HPC) codes. The concept of an Array of Structures (AoS) is in the spirit of the object-oriented programming (OOP) paradigm in that the particle information is implemented as a structure. Here, an object (realization of the structure) represents one particle and a set of many particles is stored in an array. In contrast, using the concept of a Structure of Arrays (SoA), a single structure holds several arrays each representing one property (such as the identity) of the whole set of particles. The AoS approach is often implemented in HPC codes due to its handiness and flexibility. For a class of problems, however, it is known that the performance of SoA is much better than that of AoS. We confirm this observation for our particle problem. Using a benchmark we show that on modern Intel Xeon processors the SoA implementation is typically several times faster than the AoS one. On Intel's MIC co-processors the performance gap even attains a factor of ten. The same is true for GPU computing, using both computational and multi-purpose GPUs. Combining performance and handiness, we present the library SoAx that has optimal performance (on CPUs, MICs, and GPUs) while providing the same handiness as AoS. For this, SoAx uses modern C++ design techniques such template meta programming that allows to automatically generate code for user defined heterogeneous data structures.
      

      
      Highly fault-tolerant parallel computation
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Spielman, D.A.
         
         We re-introduce the coded model of fault-tolerant computation in which the input and output of a computational device are treated as words in an error-correcting code. A computational device correctly computes a function in the coded model if its input and output, once decoded, are a valid input and output of the function. In the coded model, it is reasonable to hope to simulate all computational devices by devices whose size is greater by a constant factor but which are exponentially reliable even if each of their components can fail with some constant probability. We consider fine-grained parallel computations inmore » which each processor has a constant probability of producing the wrong output at each time step. We show that any parallel computation that runs for time t on w processors can be performed reliably on a faulty machine in the coded model using w log{sup O(l)} w processors and time t log{sup O(l)} w. The failure probability of the computation will be at most t {center_dot} exp(-w{sup 1/4}). The codes used to communicate with our fault-tolerant machines are generalized Reed-Solomon codes and can thus be encoded and decoded in O(n log{sup O(1)} n) sequential time and are independent of the machine they are used to communicate with. We also show how coded computation can be used to self-correct many linear functions in parallel with arbitrarily small overhead.« less
      

        
       
          

«

13
      14
      15
   16
      17
      »

          
        

     

   

   
       
            
              
          

«

14
      15
      16
   17
      18
      »

          
        

           
           
             
               
      
      Methodologies and Tools for Tuning Parallel Programs: 80% Art, 20% Science, and 10% Luck
      NASA Technical Reports Server (NTRS)
      Yan, Jerry C.; Bailey, David (Technical Monitor)
         1996-01-01
         The need for computing power has forced a migration from serial computation on a single processor to parallel processing on multiprocessors. However, without effective means to monitor (and analyze) program execution, tuning the performance of parallel programs becomes exponentially difficult as program complexity and machine size increase. In the past few years, the ubiquitous introduction of performance tuning tools from various supercomputer vendors (Intel's ParAide, TMC's PRISM, CRI's Apprentice, and Convex's CXtrace) seems to indicate the maturity of performance instrumentation/monitor/tuning technologies and vendors'/customers' recognition of their importance. However, a few important questions remain: What kind of performance bottlenecks can these tools detect (or correct)? How time consuming is the performance tuning process? What are some important technical issues that remain to be tackled in this area? This workshop reviews the fundamental concepts involved in analyzing and improving the performance of parallel and heterogeneous message-passing programs. Several alternative strategies will be contrasted, and for each we will describe how currently available tuning tools (e.g. AIMS, ParAide, PRISM, Apprentice, CXtrace, ATExpert, Pablo, IPS-2) can be used to facilitate the process. We will characterize the effectiveness of the tools and methodologies based on actual user experiences at NASA Ames Research Center. Finally, we will discuss their limitations and outline recent approaches taken by vendors and the research community to address them.
      

      
      The Tera Multithreaded Architecture and Unstructured Meshes
      NASA Technical Reports Server (NTRS)
      Bokhari, Shahid H.; Mavriplis, Dimitri J.
         1998-01-01
         The Tera Multithreaded Architecture (MTA) is a new parallel supercomputer currently being installed at San Diego Supercomputing Center (SDSC). This machine has an architecture quite different from contemporary parallel machines. The computational processor is a custom design and the machine uses hardware to support very fine grained multithreading. The main memory is shared, hardware randomized and flat. These features make the machine highly suited to the execution of unstructured mesh problems, which are difficult to parallelize on other architectures. We report the results of a study carried out during July-August 1998 to evaluate the execution of EUL3D, a code that solves the Euler equations on an unstructured mesh, on the 2 processor Tera MTA at SDSC. Our investigation shows that parallelization of an unstructured code is extremely easy on the Tera. We were able to get an existing parallel code (designed for a shared memory machine), running on the Tera by changing only the compiler directives. Furthermore, a serial version of this code was compiled to run in parallel on the Tera by judicious use of directives to invoke the "full/empty" tag bits of the machine to obtain synchronization. This version achieves 212 and 406 Mflop/s on one and two processors respectively, and requires no attention to partitioning or placement of data issues that would be of paramount importance in other parallel architectures.
      

      
      PFLOTRAN User Manual: A Massively Parallel Reactive Flow and Transport Model for Describing Surface and Subsurface Processes
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Lichtner, Peter C.; Hammond, Glenn E.; Lu, Chuan
         
         PFLOTRAN solves a system of generally nonlinear partial differential equations describing multi-phase, multicomponent and multiscale reactive flow and transport in porous materials. The code is designed to run on massively parallel computing architectures as well as workstations and laptops (e.g. Hammond et al., 2011). Parallelization is achieved through domain decomposition using the PETSc (Portable Extensible Toolkit for Scientific Computation) libraries for the parallelization framework (Balay et al., 1997). PFLOTRAN has been developed from the ground up for parallel scalability and has been run on up to 218 processor cores with problem sizes up to 2 billion degrees of freedom. Writtenmore » in object oriented Fortran 90, the code requires the latest compilers compatible with Fortran 2003. At the time of this writing this requires gcc 4.7.x, Intel 12.1.x and PGC compilers. As a requirement of running problems with a large number of degrees of freedom, PFLOTRAN allows reading input data that is too large to fit into memory allotted to a single processor core. The current limitation to the problem size PFLOTRAN can handle is the limitation of the HDF5 file format used for parallel IO to 32 bit integers. Noting that 2 32 = 4; 294; 967; 296, this gives an estimate of the maximum problem size that can be currently run with PFLOTRAN. Hopefully this limitation will be remedied in the near future.« less
      

      
      Initial results on computational performance of Intel Many Integrated Core (MIC) architecture: implementation of the Weather and Research Forecasting (WRF) Purdue-Lin microphysics scheme
      NASA Astrophysics Data System (ADS)
      Mielikainen, Jarno; Huang, Bormin; Huang, Allen H.
         2014-10-01
         Purdue-Lin scheme is a relatively sophisticated microphysics scheme in the Weather Research and Forecasting (WRF) model. The scheme includes six classes of hydro meteors: water vapor, cloud water, raid, cloud ice, snow and graupel. The scheme is very suitable for massively parallel computation as there are no interactions among horizontal grid points. In this paper, we accelerate the Purdue Lin scheme using Intel Many Integrated Core Architecture (MIC) hardware. The Intel Xeon Phi is a high performance coprocessor consists of up to 61 cores. The Xeon Phi is connected to a CPU via the PCI Express (PICe) bus. In this paper, we will discuss in detail the code optimization issues encountered while tuning the Purdue-Lin microphysics Fortran code for Xeon Phi. In particularly, getting a good performance required utilizing multiple cores, the wide vector operations and make efficient use of memory. The results show that the optimizations improved performance of the original code on Xeon Phi 5110P by a factor of 4.2x. Furthermore, the same optimizations improved performance on Intel Xeon E5-2603 CPU by a factor of 1.2x compared to the original code.
      

      
      Concurrent and Accurate Short Read Mapping on Multicore Processors.
      PubMed
      Martínez, Héctor; Tárraga, Joaquín; Medina, Ignacio; Barrachina, Sergio; Castillo, Maribel; Dopazo, Joaquín; Quintana-Ortí, Enrique S
         2015-01-01
         We introduce a parallel aligner with a work-flow organization for fast and accurate mapping of RNA sequences on servers equipped with multicore processors. Our software, HPG Aligner SA (HPG Aligner SA is an open-source application. The software is available at http://www.opencb.org, exploits a suffix array to rapidly map a large fraction of the RNA fragments (reads), as well as leverages the accuracy of the Smith-Waterman algorithm to deal with conflictive reads. The aligner is enhanced with a careful strategy to detect splice junctions based on an adaptive division of RNA reads into small segments (or seeds), which are then mapped onto a number of candidate alignment locations, providing crucial information for the successful alignment of the complete reads. The experimental results on a platform with Intel multicore technology report the parallel performance of HPG Aligner SA, on RNA reads of 100-400 nucleotides, which excels in execution time/sensitivity to state-of-the-art aligners such as TopHat 2+Bowtie 2, MapSplice, and STAR.
      

      
      Real-time autocorrelator for fluorescence correlation spectroscopy based on graphical-processor-unit architecture: method, implementation, and comparative studies
      NASA Astrophysics Data System (ADS)
      Laracuente, Nicholas; Grossman, Carl
         2013-03-01
         We developed an algorithm and software to calculate autocorrelation functions from real-time photon-counting data using the fast, parallel capabilities of graphical processor units (GPUs). Recent developments in hardware and software have allowed for general purpose computing with inexpensive GPU hardware. These devices are more suited for emulating hardware autocorrelators than traditional CPU-based software applications by emphasizing parallel throughput over sequential speed. Incoming data are binned in a standard multi-tau scheme with configurable points-per-bin size and are mapped into a GPU memory pattern to reduce time-expensive memory access. Applications include dynamic light scattering (DLS) and fluorescence correlation spectroscopy (FCS) experiments. We ran the software on a 64-core graphics pci card in a 3.2 GHz Intel i5 CPU based computer running Linux. FCS measurements were made on Alexa-546 and Texas Red dyes in a standard buffer (PBS). Software correlations were compared to hardware correlator measurements on the same signals. Supported by HHMI and Swarthmore College
      

      
      High performance in silico virtual drug screening on many-core processors.
      PubMed
      McIntosh-Smith, Simon; Price, James; Sessions, Richard B; Ibarra, Amaurys A
         2015-05-01
         Drug screening is an important part of the drug development pipeline for the pharmaceutical industry. Traditional, lab-based methods are increasingly being augmented with computational methods, ranging from simple molecular similarity searches through more complex pharmacophore matching to more computationally intensive approaches, such as molecular docking. The latter simulates the binding of drug molecules to their targets, typically protein molecules. In this work, we describe BUDE, the Bristol University Docking Engine, which has been ported to the OpenCL industry standard parallel programming language in order to exploit the performance of modern many-core processors. Our highly optimized OpenCL implementation of BUDE sustains 1.43 TFLOP/s on a single Nvidia GTX 680 GPU, or 46% of peak performance. BUDE also exploits OpenCL to deliver effective performance portability across a broad spectrum of different computer architectures from different vendors, including GPUs from Nvidia and AMD, Intel's Xeon Phi and multi-core CPUs with SIMD instruction sets.
      

      
      The cost of conservative synchronization in parallel discrete event simulations
      NASA Technical Reports Server (NTRS)
      Nicol, David M.
         1990-01-01
         The performance of a synchronous conservative parallel discrete-event simulation protocol is analyzed. The class of simulation models considered is oriented around a physical domain and possesses a limited ability to predict future behavior. A stochastic model is used to show that as the volume of simulation activity in the model increases relative to a fixed architecture, the complexity of the average per-event overhead due to synchronization, event list manipulation, lookahead calculations, and processor idle time approach the complexity of the average per-event overhead of a serial simulation. The method is therefore within a constant factor of optimal. The analysis demonstrates that on large problems--those for which parallel processing is ideally suited--there is often enough parallel workload so that processors are not usually idle. The viability of the method is also demonstrated empirically, showing how good performance is achieved on large problems using a thirty-two node Intel iPSC/2 distributed memory multiprocessor.
      

      
      Fast multipurpose Monte Carlo simulation for proton therapy using multi- and many-core CPU architectures.
      PubMed
      Souris, Kevin; Lee, John Aldo; Sterpin, Edmond
         2016-04-01
         Accuracy in proton therapy treatment planning can be improved using Monte Carlo (MC) simulations. However the long computation time of such methods hinders their use in clinical routine. This work aims to develop a fast multipurpose Monte Carlo simulation tool for proton therapy using massively parallel central processing unit (CPU) architectures. A new Monte Carlo, called MCsquare (many-core Monte Carlo), has been designed and optimized for the last generation of Intel Xeon processors and Intel Xeon Phi coprocessors. These massively parallel architectures offer the flexibility and the computational power suitable to MC methods. The class-II condensed history algorithm of MCsquare provides a fast and yet accurate method of simulating heavy charged particles such as protons, deuterons, and alphas inside voxelized geometries. Hard ionizations, with energy losses above a user-specified threshold, are simulated individually while soft events are regrouped in a multiple scattering theory. Elastic and inelastic nuclear interactions are sampled from ICRU 63 differential cross sections, thereby allowing for the computation of prompt gamma emission profiles. MCsquare has been benchmarked with the gate/geant4 Monte Carlo application for homogeneous and heterogeneous geometries. Comparisons with gate/geant4 for various geometries show deviations within 2%-1 mm. In spite of the limited memory bandwidth of the coprocessor simulation time is below 25 s for 10(7) primary 200 MeV protons in average soft tissues using all Xeon Phi and CPU resources embedded in a single desktop unit. MCsquare exploits the flexibility of CPU architectures to provide a multipurpose MC simulation tool. Optimized code enables the use of accurate MC calculation within a reasonable computation time, adequate for clinical practice. MCsquare also simulates prompt gamma emission and can thus be used also for in vivo range verification.
      

      
      Second International Workshop on Software Engineering and Code Design in Parallel Meteorological and Oceanographic Applications
      NASA Technical Reports Server (NTRS)
      OKeefe, Matthew (Editor); Kerr, Christopher L. (Editor)
         1998-01-01
         This report contains the abstracts and technical papers from the Second International Workshop on Software Engineering and Code Design in Parallel Meteorological and Oceanographic Applications, held June 15-18, 1998, in Scottsdale, Arizona. The purpose of the workshop is to bring together software developers in meteorology and oceanography to discuss software engineering and code design issues for parallel architectures, including Massively Parallel Processors (MPP's), Parallel Vector Processors (PVP's), Symmetric Multi-Processors (SMP's), Distributed Shared Memory (DSM) multi-processors, and clusters. Issues to be discussed include: (1) code architectures for current parallel models, including basic data structures, storage allocation, variable naming conventions, coding rules and styles, i/o and pre/post-processing of data; (2) designing modular code; (3) load balancing and domain decomposition; (4) techniques that exploit parallelism efficiently yet hide the machine-related details from the programmer; (5) tools for making the programmer more productive; and (6) the proliferation of programming models (F--, OpenMP, MPI, and HPF).
      

      
      Hypercluster - Parallel processing for computational mechanics
      NASA Technical Reports Server (NTRS)
      Blech, Richard A.
         1988-01-01
         An account is given of the development status, performance capabilities and implications for further development of NASA-Lewis' testbed 'hypercluster' parallel computer network, in which multiple processors communicate through a shared memory. Processors have local as well as shared memory; the hypercluster is expanded in the same manner as the hypercube, with processor clusters replacing the normal single processor node. The NASA-Lewis machine has three nodes with a vector personality and one node with a scalar personality. Each of the vector nodes uses four board-level vector processors, while the scalar node uses four general-purpose microcomputer boards.
      

      
      Efficient implementation of a 3-dimensional ADI method on the iPSC/860
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Van der Wijngaart, R.F.
         1993-12-31
         A comparison is made between several domain decomposition strategies for the solution of three-dimensional partial differential equations on a MIMD distributed memory parallel computer. The grids used are structured, and the numerical algorithm is ADI. Important implementation issues regarding load balancing, storage requirements, network latency, and overlap of computations and communications are discussed. Results of the solution of the three-dimensional heat equation on the Intel iPSC/860 are presented for the three most viable methods. It is found that the Bruno-Cappello decomposition delivers optimal computational speed through an almost complete elimination of processor idle time, while providing good memory efficiency.
      

      
      Kalman filter tracking on parallel architectures
      NASA Astrophysics Data System (ADS)
      Cerati, G.; Elmer, P.; Krutelyov, S.; Lantz, S.; Lefebvre, M.; McDermott, K.; Riley, D.; Tadel, M.; Wittich, P.; Wurthwein, F.; Yagil, A.
         2017-10-01
         We report on the progress of our studies towards a Kalman filter track reconstruction algorithm with optimal performance on manycore architectures. The combinatorial structure of these algorithms is not immediately compatible with an efficient SIMD (or SIMT) implementation; the challenge for us is to recast the existing software so it can readily generate hundreds of shared-memory threads that exploit the underlying instruction set of modern processors. We show how the data and associated tasks can be organized in a way that is conducive to both multithreading and vectorization. We demonstrate very good performance on Intel Xeon and Xeon Phi architectures, as well as promising first results on Nvidia GPUs.
      

      
      A low-cost vector processor boosting compute-intensive image processing operations
      NASA Technical Reports Server (NTRS)
      Adorf, Hans-Martin
         1992-01-01
         Low-cost vector processing (VP) is within reach of everyone seriously engaged in scientific computing. The advent of affordable add-on VP-boards for standard workstations complemented by mathematical/statistical libraries is beginning to impact compute-intensive tasks such as image processing. A case in point in the restoration of distorted images from the Hubble Space Telescope. A low-cost implementation is presented of the standard Tarasko-Richardson-Lucy restoration algorithm on an Intel i860-based VP-board which is seamlessly interfaced to a commercial, interactive image processing system. First experience is reported (including some benchmarks for standalone FFT's) and some conclusions are drawn.
      

      
      A parallel algorithm for switch-level timing simulation on a hypercube multiprocessor
      NASA Technical Reports Server (NTRS)
      Rao, Hariprasad Nannapaneni
         1989-01-01
         The parallel approach to speeding up simulation is studied, specifically the simulation of digital LSI MOS circuitry on the Intel iPSC/2 hypercube. The simulation algorithm is based on RSIM, an event driven switch-level simulator that incorporates a linear transistor model for simulating digital MOS circuits. Parallel processing techniques based on the concepts of Virtual Time and rollback are utilized so that portions of the circuit may be simulated on separate processors, in parallel for as large an increase in speed as possible. A partitioning algorithm is also developed in order to subdivide the circuit for parallel processing.
      

      
      I.T.S. camera deployment and systems integration P.I.N. 4ITV.09.121 : evaluation
      DOT National Transportation Integrated Search
      
         2006-08-01
         The Monroe County Department of Transportation (MCDOT) has built upon the success of its first computerized traffic signal system. The system has been operational since 1985. In the late 1990s the Department initiated a project to integrate Intell...
      

      
      Ordered fast fourier transforms on a massively parallel hypercube multiprocessor
      NASA Technical Reports Server (NTRS)
      Tong, Charles; Swarztrauber, Paul N.
         1989-01-01
         Design alternatives for ordered Fast Fourier Transformation (FFT) algorithms were examined on massively parallel hypercube multiprocessors such as the Connection Machine. Particular emphasis is placed on reducing communication which is known to dominate the overall computing time. To this end, the order and computational phases of the FFT were combined, and the sequence to processor maps that reduce communication were used. The class of ordered transforms is expanded to include any FFT in which the order of the transform is the same as that of the input sequence. Two such orderings are examined, namely, standard-order and A-order which can be implemented with equal ease on the Connection Machine where orderings are determined by geometries and priorities. If the sequence has N = 2 exp r elements and the hypercube has P = 2 exp d processors, then a standard-order FFT can be implemented with d + r/2 + 1 parallel transmissions. An A-order sequence can be transformed with 2d - r/2 parallel transmissions which is r - d + 1 fewer than the standard order. A parallel method for computing the trigonometric coefficients is presented that does not use trigonometric functions or interprocessor communication. A performance of 0.9 GFLOPS was obtained for an A-order transform on the Connection Machine.
      

      
      Ada Compiler Validation Summary Report: Certificate Number: 910510510W1. 11148 Irvine Compiler Corporation, ICC Ada v7.0.0, Vaxstation 3100 Model M38, VMS 5.3-1 (Host) to Intel i80960MC (bare machine) (Target).
      DTIC Science & Technology
      
         1991-08-01
         ADA2 0 54 ENTATION PAGE_ IO ft 070401) s . 1215 JftvnDveHIy Sk 24 Arknscd. VA 22202-4M2. ao o d sOicei al kiio.so and PRagulmoy Aftirs Oceo i Ai.trNY...AUTHOR( S ) Wright- Patterson AFB, Dayton, OH USA 7. PERFORMING ORGANIZATION NAME( S ) AND ADDRESS(ES) 8. PERFORMING ORGANIZATION Ada Validation Facility...AGENCY NAM E( S ) AND ADDRESS(ES) 10. SPONSORINGMONITORING AGENCY Ada Joint Program Off ice REPORT NUMBER United States Department of Defense Pentagon, Rm
      

      
      Debugging Fortran on a shared memory machine
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Allen, T.R.; Padua, D.A.
         1987-01-01
         Debugging on a parallel processor is more difficult than debugging on a serial machine because errors in a parallel program may introduce nondeterminism. The approach to parallel debugging presented here attempts to reduce the problem of debugging on a parallel machine to that of debugging on a serial machine by automatically detecting nondeterminism. 20 refs., 6 figs.
      

      
      Accelerated Monte Carlo Simulation on the Chemical Stage in Water Radiolysis using GPU
      PubMed Central
      Tian, Zhen; Jiang, Steve B.; Jia, Xun
         2018-01-01
         The accurate simulation of water radiolysis is an important step to understand the mechanisms of radiobiology and quantitatively test some hypotheses regarding radiobiological effects. However, the simulation of water radiolysis is highly time consuming, taking hours or even days to be completed by a conventional CPU processor. This time limitation hinders cell-level simulations for a number of research studies. We recently initiated efforts to develop gMicroMC, a GPU-based fast microscopic MC simulation package for water radiolysis. The first step of this project focused on accelerating the simulation of the chemical stage, the most time consuming stage in the entire water radiolysis process. A GPU-friendly parallelization strategy was designed to address the highly correlated many-body simulation problem caused by the mutual competitive chemical reactions between the radiolytic molecules. Two cases were tested, using a 750 keV electron and a 5 MeV proton incident in pure water, respectively. The time-dependent yields of all the radiolytic species during the chemical stage were used to evaluate the accuracy of the simulation. The relative differences between our simulation and the Geant4-DNA simulation were on average 5.3% and 4.4% for the two cases. Our package, executed on an Nvidia Titan black GPU card, successfully completed the chemical stage simulation of the two cases within 599.2 s and 489.0 s. As compared with Geant4-DNA that was executed on an Intel i7-5500U CPU processor and needed 28.6 h and 26.8 h for the two cases using a single CPU core, our package achieved a speed-up factor of 171.1-197.2. PMID:28323637
      

        
       
          

«

14
      15
      16
   17
      18
      »

          
        

     

   

   
       
            
              
          

«

15
      16
      17
   18
      19
      »

          
        

           
           
             
               
      
      Accelerated Monte Carlo simulation on the chemical stage in water radiolysis using GPU
      NASA Astrophysics Data System (ADS)
      Tian, Zhen; Jiang, Steve B.; Jia, Xun
         2017-04-01
         The accurate simulation of water radiolysis is an important step to understand the mechanisms of radiobiology and quantitatively test some hypotheses regarding radiobiological effects. However, the simulation of water radiolysis is highly time consuming, taking hours or even days to be completed by a conventional CPU processor. This time limitation hinders cell-level simulations for a number of research studies. We recently initiated efforts to develop gMicroMC, a GPU-based fast microscopic MC simulation package for water radiolysis. The first step of this project focused on accelerating the simulation of the chemical stage, the most time consuming stage in the entire water radiolysis process. A GPU-friendly parallelization strategy was designed to address the highly correlated many-body simulation problem caused by the mutual competitive chemical reactions between the radiolytic molecules. Two cases were tested, using a 750 keV electron and a 5 MeV proton incident in pure water, respectively. The time-dependent yields of all the radiolytic species during the chemical stage were used to evaluate the accuracy of the simulation. The relative differences between our simulation and the Geant4-DNA simulation were on average 5.3% and 4.4% for the two cases. Our package, executed on an Nvidia Titan black GPU card, successfully completed the chemical stage simulation of the two cases within 599.2 s and 489.0 s. As compared with Geant4-DNA that was executed on an Intel i7-5500U CPU processor and needed 28.6 h and 26.8 h for the two cases using a single CPU core, our package achieved a speed-up factor of 171.1-197.2.
      

      
      Accelerated Monte Carlo simulation on the chemical stage in water radiolysis using GPU.
      PubMed
      Tian, Zhen; Jiang, Steve B; Jia, Xun
         2017-04-21
         The accurate simulation of water radiolysis is an important step to understand the mechanisms of radiobiology and quantitatively test some hypotheses regarding radiobiological effects. However, the simulation of water radiolysis is highly time consuming, taking hours or even days to be completed by a conventional CPU processor. This time limitation hinders cell-level simulations for a number of research studies. We recently initiated efforts to develop gMicroMC, a GPU-based fast microscopic MC simulation package for water radiolysis. The first step of this project focused on accelerating the simulation of the chemical stage, the most time consuming stage in the entire water radiolysis process. A GPU-friendly parallelization strategy was designed to address the highly correlated many-body simulation problem caused by the mutual competitive chemical reactions between the radiolytic molecules. Two cases were tested, using a 750 keV electron and a 5 MeV proton incident in pure water, respectively. The time-dependent yields of all the radiolytic species during the chemical stage were used to evaluate the accuracy of the simulation. The relative differences between our simulation and the Geant4-DNA simulation were on average 5.3% and 4.4% for the two cases. Our package, executed on an Nvidia Titan black GPU card, successfully completed the chemical stage simulation of the two cases within 599.2 s and 489.0 s. As compared with Geant4-DNA that was executed on an Intel i7-5500U CPU processor and needed 28.6 h and 26.8 h for the two cases using a single CPU core, our package achieved a speed-up factor of 171.1-197.2.
      

      
      Task Parallel Incomplete Cholesky Factorization using 2D Partitioned-Block Layout
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Kim, Kyungjoo; Rajamanickam, Sivasankaran; Stelle, George Widgery
         
         We introduce a task-parallel algorithm for sparse incomplete Cholesky factorization that utilizes a 2D sparse partitioned-block layout of a matrix. Our factorization algorithm follows the idea of algorithms-by-blocks by using the block layout. The algorithm-byblocks approach induces a task graph for the factorization. These tasks are inter-related to each other through their data dependences in the factorization algorithm. To process the tasks on various manycore architectures in a portable manner, we also present a portable tasking API that incorporates different tasking backends and device-specific features using an open-source framework for manycore platforms i.e., Kokkos. A performance evaluation is presented onmore » both Intel Sandybridge and Xeon Phi platforms for matrices from the University of Florida sparse matrix collection to illustrate merits of the proposed task-based factorization. Experimental results demonstrate that our task-parallel implementation delivers about 26.6x speedup (geometric mean) over single-threaded incomplete Choleskyby- blocks and 19.2x speedup over serial Cholesky performance which does not carry tasking overhead using 56 threads on the Intel Xeon Phi processor for sparse matrices arising from various application problems.« less
      

      
      Performance prediction: A case study using a multi-ring KSR-1 machine
      NASA Technical Reports Server (NTRS)
      Sun, Xian-He; Zhu, Jianping
         1995-01-01
         While computers with tens of thousands of processors have successfully delivered high performance power for solving some of the so-called 'grand-challenge' applications, the notion of scalability is becoming an important metric in the evaluation of parallel machine architectures and algorithms. In this study, the prediction of scalability and its application are carefully investigated. A simple formula is presented to show the relation between scalability, single processor computing power, and degradation of parallelism. A case study is conducted on a multi-ring KSR1 shared virtual memory machine. Experimental and theoretical results show that the influence of topology variation of an architecture is predictable. Therefore, the performance of an algorithm on a sophisticated, heirarchical architecture can be predicted and the best algorithm-machine combination can be selected for a given application.
      

      
      The Event Based Language and Its Multiple Processor Implementations.
      DTIC Science & Technology
      
         1980-01-01
         10 6.1 "Recursive" Linear Fibonacci ................................................ 105 6.2 The Readers Writers Problem...kinds. Examples of such systems are: C.mmp [Wu-72], Pluribus [He-73], Data Flow [ De -75], the boolean n-cube parallel machine [Su-77], and the MuNet [Wa...concurrency within programs; therefore, we hate concentrated on two types of systems which seem suitable: a processor network, and a data flow processor [ De -77
      

      
      JPRS Report, Science & Technology, China, High-Performance Computer Systems
      DTIC Science & Technology
      
         1992-10-28
         microprocessor array The microprocessor array in the AP85 system is com- posed of 16 completely identical array element micro - processors . Each array element...microprocessors and capable of host machine reading and writing. The memory capacity of the array element micro - processors as a whole can be expanded...transmission functions to carry out data transmission from array element micro - processor to array element microprocessor, from array element
      

      
      Impact of Load Balancing on Unstructured Adaptive Grid Computations for Distributed-Memory Multiprocessors
      NASA Technical Reports Server (NTRS)
      Sohn, Andrew; Biswas, Rupak; Simon, Horst D.
         1996-01-01
         The computational requirements for an adaptive solution of unsteady problems change as the simulation progresses. This causes workload imbalance among processors on a parallel machine which, in turn, requires significant data movement at runtime. We present a new dynamic load-balancing framework, called JOVE, that balances the workload across all processors with a global view. Whenever the computational mesh is adapted, JOVE is activated to eliminate the load imbalance. JOVE has been implemented on an IBM SP2 distributed-memory machine in MPI for portability. Experimental results for two model meshes demonstrate that mesh adaption with load balancing gives more than a sixfold improvement over one without load balancing. We also show that JOVE gives a 24-fold speedup on 64 processors compared to sequential execution.
      

      
      Parallel eigenanalysis of finite element models in a completely connected architecture
      NASA Technical Reports Server (NTRS)
      Akl, F. A.; Morel, M. R.
         1989-01-01
         A parallel algorithm is presented for the solution of the generalized eigenproblem in linear elastic finite element analysis, (K)(phi) = (M)(phi)(omega), where (K) and (M) are of order N, and (omega) is order of q. The concurrent solution of the eigenproblem is based on the multifrontal/modified subspace method and is achieved in a completely connected parallel architecture in which each processor is allowed to communicate with all other processors. The algorithm was successfully implemented on a tightly coupled multiple-instruction multiple-data parallel processing machine, Cray X-MP. A finite element model is divided into m domains each of which is assumed to process n elements. Each domain is then assigned to a processor or to a logical processor (task) if the number of domains exceeds the number of physical processors. The macrotasking library routines are used in mapping each domain to a user task. Computational speed-up and efficiency are used to determine the effectiveness of the algorithm. The effect of the number of domains, the number of degrees-of-freedom located along the global fronts and the dimension of the subspace on the performance of the algorithm are investigated. A parallel finite element dynamic analysis program, p-feda, is documented and the performance of its subroutines in parallel environment is analyzed.
      

      
      Machine Vision Giving Eyes to Robots. Resources in Technology.
      ERIC Educational Resources Information Center
      Technology Teacher, 1990
         1990-01-01
         This module introduces machine vision, which can be used for inspection, robot guidance and part sorting. The future for machine vision will include new technology and will bring vision systems closer to the ultimate vision processor, the human eye. Includes a student quiz, outcomes, and activities. (JOW)
      

      
      Particle-in-Cell laser-plasma simulation on Xeon Phi coprocessors
      NASA Astrophysics Data System (ADS)
      Surmin, I. A.; Bastrakov, S. I.; Efimenko, E. S.; Gonoskov, A. A.; Korzhimanov, A. V.; Meyerov, I. B.
         2016-05-01
         This paper concerns the development of a high-performance implementation of the Particle-in-Cell method for plasma simulation on Intel Xeon Phi coprocessors. We discuss the suitability of the method for Xeon Phi architecture and present our experience in the porting and optimization of the existing parallel Particle-in-Cell code PICADOR. Direct porting without code modification gives performance on Xeon Phi close to that of an 8-core CPU on a benchmark problem with 50 particles per cell. We demonstrate step-by-step optimization techniques, such as improving data locality, enhancing parallelization efficiency and vectorization leading to an overall 4.2 × speedup on CPU and 7.5 × on Xeon Phi compared to the baseline version. The optimized version achieves 16.9 ns per particle update on an Intel Xeon E5-2660 CPU and 9.3 ns per particle update on an Intel Xeon Phi 5110P. For a real problem of laser ion acceleration in targets with surface grating, where a large number of macroparticles per cell is required, the speedup of Xeon Phi compared to CPU is 1.6 ×.
      

      
      LTE-Enhanced Cognitive Radio Network Testbed (LTE-CORNET)
      DTIC Science & Technology
      
         2016-11-01
         4 PERCENT_SUPPORTEDNAME FTE Equivalent: Total Number: Sub Contractors (DD882) Names of Personnel receiving masters degrees Names of personnel...Turbo, HT , 15M, 140W) Intel Core i7-3770 (3.4 GHz Quad Core, 77W) Dual Intel Xeon E5-2695 v4 (18C, 2.1GHz, 3.3GHz Turbo, 2400MHz, 45MB, 120W
      

      
      Computer-aided design/computer-aided manufacturing skull base drill.
      PubMed
      Couldwell, William T; MacDonald, Joel D; Thomas, Charles L; Hansen, Bradley C; Lapalikar, Aniruddha; Thakkar, Bharat; Balaji, Alagar K
         2017-05-01
         The authors have developed a simple device for computer-aided design/computer-aided manufacturing (CAD-CAM) that uses an image-guided system to define a cutting tool path that is shared with a surgical machining system for drilling bone. Information from 2D images (obtained via CT and MRI) is transmitted to a processor that produces a 3D image. The processor generates code defining an optimized cutting tool path, which is sent to a surgical machining system that can drill the desired portion of bone. This tool has applications for bone removal in both cranial and spine neurosurgical approaches. Such applications have the potential to reduce surgical time and associated complications such as infection or blood loss. The device enables rapid removal of bone within 1 mm of vital structures. The validity of such a machining tool is exemplified in the rapid (< 3 minutes machining time) and accurate removal of bone for transtemporal (for example, translabyrinthine) approaches.
      

      
      SeqBox: RNAseq/ChIPseq reproducible analysis on a consumer game computer.
      PubMed
      Beccuti, Marco; Cordero, Francesca; Arigoni, Maddalena; Panero, Riccardo; Amparore, Elvio G; Donatelli, Susanna; Calogero, Raffaele A
         2018-03-01
         Short reads sequencing technology has been used for more than a decade now. However, the analysis of RNAseq and ChIPseq data is still computational demanding and the simple access to raw data does not guarantee results reproducibility between laboratories. To address these two aspects, we developed SeqBox, a cheap, efficient and reproducible RNAseq/ChIPseq hardware/software solution based on NUC6I7KYK mini-PC (an Intel consumer game computer with a fast processor and a high performance SSD disk), and Docker container platform. In SeqBox the analysis of RNAseq and ChIPseq data is supported by a friendly GUI. This allows access to fast and reproducible analysis also to scientists with/without scripting experience. Docker container images, docker4seq package and the GUI are available at http://www.bioinformatica.unito.it/reproducibile.bioinformatics.html. beccuti@di.unito.it. Supplementary data are available at Bioinformatics online. © The Author(s) 2017. Published by Oxford University Press.
      

      
      Ada Compiler Validation Summary Report: Certificate Number: 900121S1. 10251 Computer Sciences Corporation MC Ada V1.2.Beta/Concurrent Computer Corporation Concurrent/Masscomp 5600 Host To Concurrent/Masscomp 5600 (Dual 68020 Processor Configuration) Target
      DTIC Science & Technology
      
         1990-04-23
         developed Ada Real - Time Operating System (ARTOS) for bare machine environments(Target), ACW 1.1I0. " ; - -M.UIECTTERMS Ada programming language, Ada...configuration) Operating System: CSC developed Ada Real - Time Operating System (ARTOS) for bare machine environments Memory Size: 4MB 2.2...Test Method Testing of the MC Ado V1.2.beta/ Concurrent Computer Corporation compiler and the CSC developed Ada Real - Time Operating System (ARTOS) for
      

      
      PANDA: A distributed multiprocessor operating system
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Chubb, P.
         1989-01-01
         PANDA is a design for a distributed multiprocessor and an operating system. PANDA is designed to allow easy expansion of both hardware and software. As such, the PANDA kernel provides only message passing and memory and process management. The other features needed for the system (device drivers, secondary storage management, etc.) are provided as replaceable user tasks. The thesis presents PANDA's design and implementation, both hardware and software. PANDA uses multiple 68010 processors sharing memory on a VME bus, each such node potentially connected to others via a high speed network. The machine is completely homogeneous: there are no differencesmore » between processors that are detectable by programs running on the machine. A single two-processor node has been constructed. Each processor contains memory management circuits designed to allow processors to share page tables safely. PANDA presents a programmers' model similar to the hardware model: a job is divided into multiple tasks, each having its own address space. Within each task, multiple processes share code and data. Tasks can send messages to each other, and set up virtual circuits between themselves. Peripheral devices such as disc drives are represented within PANDA by tasks. PANDA divides secondary storage into volumes, each volume being accessed by a volume access task, or VAT. All knowledge about the way that data is stored on a disc is kept in its volume's VAT. The design is such that PANDA should provide a useful testbed for file systems and device drivers, as these can be installed without recompiling PANDA itself, and without rebooting the machine.« less
      

      
      Intelligent Vision On The SM9O Mini-Computer Basis And Applications
      NASA Astrophysics Data System (ADS)
      Hawryszkiw, J.
         1985-02-01
         Distinction has to be made between image processing and vision Image processing finds its roots in the strong tradition of linear signal processing and promotes geometrical transform techniques, such as fi I tering , compression, and restoration. Its purpose is to transform an image for a human observer to easily extract from that image information significant for him. For example edges after a gradient operator, or a specific direction after a directional filtering operation. Image processing consists in fact in a set of local or global space-time transforms. The interpretation of the final image is done by the human observer. The purpose of vision is to extract the semantic content of the image. The machine can then understand that content, and run a process of decision, which turns into an action. Thus, intel I i gent vision depends on - Image processing - Pattern recognition - Artificial intel I igence
      

      
      RASSP signal processing architectures
      NASA Astrophysics Data System (ADS)
      Shirley, Fred; Bassett, Bob; Letellier, J. P.
         1995-06-01
         The rapid prototyping of application specific signal processors (RASSP) program is an ARPA/tri-service effort to dramatically improve the process by which complex digital systems, particularly embedded signal processors, are specified, designed, documented, manufactured, and supported. The domain of embedded signal processing was chosen because it is important to a variety of military and commercial applications as well as for the challenge it presents in terms of complexity and performance demands. The principal effort is being performed by two major contractors, Lockheed Sanders (Nashua, NH) and Martin Marietta (Camden, NJ). For both, improvements in methodology are to be exercised and refined through the performance of individual 'Demonstration' efforts. The Lockheed Sanders' Demonstration effort is to develop an infrared search and track (IRST) processor. In addition, both contractors' results are being measured by a series of externally administered (by Lincoln Labs) six-month Benchmark programs that measure process improvement as a function of time. The first two Benchmark programs are designing and implementing a synthetic aperture radar (SAR) processor. Our demonstration team is using commercially available VME modules from Mercury Computer to assemble a multiprocessor system scalable from one to hundreds of Intel i860 microprocessors. Custom modules for the sensor interface and display driver are also being developed. This system implements either proprietary or Navy owned algorithms to perform the compute-intensive IRST function in real time in an avionics environment. Our Benchmark team is designing custom modules using commercially available processor ship sets, communication submodules, and reconfigurable logic devices. One of the modules contains multiple vector processors optimized for fast Fourier transform processing. Another module is a fiberoptic interface that accepts high-rate input data from the sensors and provides video-rate output data to a display. This paper discusses the impact of simulation on choosing signal processing algorithms and architectures, drawing from the experiences of the Demonstration and Benchmark inter-company teams at Lockhhed Sanders, Motorola, Hughes, and ISX.
      

      
      A new nonlinear conjugate gradient coefficient under strong Wolfe-Powell line search
      NASA Astrophysics Data System (ADS)
      Mohamed, Nur Syarafina; Mamat, Mustafa; Rivaie, Mohd
         2017-08-01
         A nonlinear conjugate gradient method (CG) plays an important role in solving a large-scale unconstrained optimization problem. This method is widely used due to its simplicity. The method is known to possess sufficient descend condition and global convergence properties. In this paper, a new nonlinear of CG coefficient βk is presented by employing the Strong Wolfe-Powell inexact line search. The new βk performance is tested based on number of iterations and central processing unit (CPU) time by using MATLAB software with Intel Core i7-3470 CPU processor. Numerical experimental results show that the new βk converge rapidly compared to other classical CG method.
      

      
      A microprocessor based anti-aliasing filter for a PCM system
      NASA Technical Reports Server (NTRS)
      Morrow, D. C.; Sandlin, D. R.
         1984-01-01
         Described is the design and evaluation of a microprocessor based digital filter. The filter was made to investigate the feasibility of a digital replacement for the analog pre-sampling filters used in telemetry systems at the NASA Ames-Dryden Flight Research Facility (DFRF). The digital filter will utilize an Intel 2920 Analog Signal Processor (ASP) chip. Testing includes measurements of: (1) the filter frequency response and, (2) the filter signal resolution. The evaluation of the digital filter was made on the basis of circuit size, projected environmental stability and filter resolution. The 2920 based digital filter was found to meet or exceed the pre-sampling filter specifications for limited signal resolution applications.
      

      
      Noiseless coding for the magnetometer
      NASA Technical Reports Server (NTRS)
      Rice, Robert F.; Lee, Jun-Ji
         1987-01-01
         Future unmanned space missions will continue to seek a full understanding of magnetic fields throughout the solar system. Severely constrained data rates during certain portions of these missions could limit the possible science return. This publication investigates the application of universal noiseless coding techniques to more efficiently represent magnetometer data without any loss in data integrity. Performance results indicated that compression factors of 2:1 to 6:1 can be expected. Feasibility for general deep space application was demonstrated by implementing a microprocessor breadboard coder/decoder using the Intel 8086 processor. The Comet Rendezvous Asteroid Flyby mission will incorporate these techniques in a buffer feedback, rate-controlled configuration. The characteristics of this system are discussed.
      

        
       
          

«

15
      16
      17
   18
      19
      »

          
        

     

   

   
       
            
              
          

«

16
      17
      18
   19
      20
      »

          
        

           
           
             
               
      
      Electromagnetic Physics Models for Parallel Computing Architectures
      NASA Astrophysics Data System (ADS)
      Amadio, G.; Ananya, A.; Apostolakis, J.; Aurora, A.; Bandieramonte, M.; Bhattacharyya, A.; Bianchini, C.; Brun, R.; Canal, P.; Carminati, F.; Duhem, L.; Elvira, D.; Gheata, A.; Gheata, M.; Goulas, I.; Iope, R.; Jun, S. Y.; Lima, G.; Mohanty, A.; Nikitina, T.; Novak, M.; Pokorski, W.; Ribon, A.; Seghal, R.; Shadura, O.; Vallecorsa, S.; Wenzel, S.; Zhang, Y.
         2016-10-01
         The recent emergence of hardware architectures characterized by many-core or accelerated processors has opened new opportunities for concurrent programming models taking advantage of both SIMD and SIMT architectures. GeantV, a next generation detector simulation, has been designed to exploit both the vector capability of mainstream CPUs and multi-threading capabilities of coprocessors including NVidia GPUs and Intel Xeon Phi. The characteristics of these architectures are very different in terms of the vectorization depth and type of parallelization needed to achieve optimal performance. In this paper we describe implementation of electromagnetic physics models developed for parallel computing architectures as a part of the GeantV project. Results of preliminary performance evaluation and physics validation are presented as well.
      

      
      Integral Fast Reactor fuel pin processor
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Levinskas, D.
         1993-01-01
         This report discusses the pin processor which receives metal alloy pins cast from recycled Integral Fast Reactor (IFR) fuel and prepares them for assembly into new IFR fuel elements. Either full length as-cast or precut pins are fed to the machine from a magazine, cut if necessary, and measured for length, weight, diameter and deviation from straightness. Accepted pins are loaded into cladding jackets located in a magazine, while rejects and cutting scraps are separated into trays. The magazines, trays, and the individual modules that perform the different machine functions are assembled and removed using remote manipulators and master-slaves.
      

      
      Integral Fast Reactor fuel pin processor
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Levinskas, D.
         1993-03-01
         This report discusses the pin processor which receives metal alloy pins cast from recycled Integral Fast Reactor (IFR) fuel and prepares them for assembly into new IFR fuel elements. Either full length as-cast or precut pins are fed to the machine from a magazine, cut if necessary, and measured for length, weight, diameter and deviation from straightness. Accepted pins are loaded into cladding jackets located in a magazine, while rejects and cutting scraps are separated into trays. The magazines, trays, and the individual modules that perform the different machine functions are assembled and removed using remote manipulators and master-slaves.
      

      
      High performance ultrasonic field simulation on complex geometries
      NASA Astrophysics Data System (ADS)
      Chouh, H.; Rougeron, G.; Chatillon, S.; Iehl, J. C.; Farrugia, J. P.; Ostromoukhov, V.
         2016-02-01
         Ultrasonic field simulation is a key ingredient for the design of new testing methods as well as a crucial step for NDT inspection simulation. As presented in a previous paper [1], CEA-LIST has worked on the acceleration of these simulations focusing on simple geometries (planar interfaces, isotropic materials). In this context, significant accelerations were achieved on multicore processors and GPUs (Graphics Processing Units), bringing the execution time of realistic computations in the 0.1 s range. In this paper, we present recent works that aim at similar performances on a wider range of configurations. We adapted the physical model used by the CIVA platform to design and implement a new algorithm providing a fast ultrasonic field simulation that yields nearly interactive results for complex cases. The improvements over the CIVA pencil-tracing method include adaptive strategies for pencil subdivisions to achieve a good refinement of the sensor geometry while keeping a reasonable number of ray-tracing operations. Also, interpolation of the times of flight was used to avoid time consuming computations in the impulse response reconstruction stage. To achieve the best performance, our algorithm runs on multi-core superscalar CPUs and uses high performance specialized libraries such as Intel Embree for ray-tracing, Intel MKL for signal processing and Intel TBB for parallelization. We validated the simulation results by comparing them to the ones produced by CIVA on identical test configurations including mono-element and multiple-element transducers, homogeneous, meshed 3D CAD specimens, isotropic and anisotropic materials and wave paths that can involve several interactions with interfaces. We show performance results on complete simulations that achieve computation times in the 1s range.
      

      
      Missileborne Artificial Vision System (MAVIS)
      NASA Technical Reports Server (NTRS)
      Andes, David K.; Witham, James C.; Miles, Michael D.
         1994-01-01
         Several years ago when INTEL and China Lake designed the ETANN chip, analog VLSI appeared to be the only way to do high density neural computing. In the last five years, however, digital parallel processing chips capable of performing neural computation functions have evolved to the point of rough equality with analog chips in system level computational density. The Naval Air Warfare Center, China Lake, has developed a real time, hardware and software system designed to implement and evaluate biologically inspired retinal and cortical models. The hardware is based on the Adaptive Solutions Inc. massively parallel CNAPS system COHO boards. Each COHO board is a standard size 6U VME card featuring 256 fixed point, RISC processors running at 20 MHz in a SIMD configuration. Each COHO board has a companion board built to support a real time VSB interface to an imaging seeker, a NTSC camera, and to other COHO boards. The system is designed to have multiple SIMD machines each performing different corticomorphic functions. The system level software has been developed which allows a high level description of corticomorphic structures to be translated into the native microcode of the CNAPS chips. Corticomorphic structures are those neural structures with a form similar to that of the retina, the lateral geniculate nucleus, or the visual cortex. This real time hardware system is designed to be shrunk into a volume compatible with air launched tactical missiles. Initial versions of the software and hardware have been completed and are in the early stages of integration with a missile seeker.
      

      
      Input-independent, Scalable and Fast String Matching on the Cray XMT
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Villa, Oreste; Chavarría-Miranda, Daniel; Maschhoff, Kristyn J
         2009-05-25
         String searching is at the core of many security and network applications like search engines, intrusion detection systems, virus scanners and spam filters. The growing size of on-line content and the increasing wire speeds push the need for fast, and often real- time, string searching solutions. For these conditions, many software implementations (if not all) targeting conventional cache-based microprocessors do not perform well. They either exhibit overall low performance or exhibit highly variable performance depending on the types of inputs. For this reason, real-time state of the art solutions rely on the use of either custom hardware or Field-Programmable Gatemore » Arrays (FPGAs) at the expense of overall system flexibility and programmability. This paper presents a software based implementation of the Aho-Corasick string searching algorithm on the Cray XMT multithreaded shared memory machine. Our so- lution relies on the particular features of the XMT architecture and on several algorith- mic strategies: it is fast, scalable and its performance is virtually content-independent. On a 128-processor Cray XMT, it reaches a scanning speed of ≈ 28 Gbps with a performance variability below 10 %. In the 10 Gbps performance range, variability is below 2.5%. By comparison, an Intel dual-socket, 8-core system running at 2.66 GHz achieves a peak performance which varies from 500 Mbps to 10 Gbps depending on the type of input and dictionary size.« less
      

      
      An accelerated framework for the classification of biological targets from solid-state micropore data.
      PubMed
      Hanif, Madiha; Hafeez, Abdul; Suleman, Yusuf; Mustafa Rafique, M; Butt, Ali R; Iqbal, Samir M
         2016-10-01
         Micro- and nanoscale systems have provided means to detect biological targets, such as DNA, proteins, and human cells, at ultrahigh sensitivity. However, these devices suffer from noise in the raw data, which continues to be significant as newer and devices that are more sensitive produce an increasing amount of data that needs to be analyzed. An important dimension that is often discounted in these systems is the ability to quickly process the measured data for an instant feedback. Realizing and developing algorithms for the accurate detection and classification of biological targets in realtime is vital. Toward this end, we describe a supervised machine-learning approach that records single cell events (pulses), computes useful pulse features, and classifies the future patterns into their respective types, such as cancerous/non-cancerous cells based on the training data. The approach detects cells with an accuracy of 70% from the raw data followed by an accurate classification when larger training sets are employed. The parallel implementation of the algorithm on graphics processing unit (GPU) demonstrates a speedup of three to four folds as compared to a serial implementation on an Intel Core i7 processor. This incredibly efficient GPU system is an effort to streamline the analysis of pulse data in an academic setting. This paper presents for the first time ever, a non-commercial technique using a GPU system for realtime analysis, paired with biological cluster targeting analysis. Copyright © 2016 Elsevier Ireland Ltd. All rights reserved.
      

      
      Design of a modified adaptive neuro fuzzy inference system classifier for medical diagnosis of Pima Indians Diabetes
      NASA Astrophysics Data System (ADS)
      Sagir, Abdu Masanawa; Sathasivam, Saratha
         2017-08-01
         Medical diagnosis is the process of determining which disease or medical condition explains a person's determinable signs and symptoms. Diagnosis of most of the diseases is very expensive as many tests are required for predictions. This paper aims to introduce an improved hybrid approach for training the adaptive network based fuzzy inference system with Modified Levenberg-Marquardt algorithm using analytical derivation scheme for computation of Jacobian matrix. The goal is to investigate how certain diseases are affected by patient's characteristics and measurement such as abnormalities or a decision about presence or absence of a disease. To achieve an accurate diagnosis at this complex stage of symptom analysis, the physician may need efficient diagnosis system to classify and predict patient condition by using an adaptive neuro fuzzy inference system (ANFIS) pre-processed by grid partitioning. The proposed hybridised intelligent system was tested with Pima Indian Diabetes dataset obtained from the University of California at Irvine's (UCI) machine learning repository. The proposed method's performance was evaluated based on training and test datasets. In addition, an attempt was done to specify the effectiveness of the performance measuring total accuracy, sensitivity and specificity. In comparison, the proposed method achieves superior performance when compared to conventional ANFIS based gradient descent algorithm and some related existing methods. The software used for the implementation is MATLAB R2014a (version 8.3) and executed in PC Intel Pentium IV E7400 processor with 2.80 GHz speed and 2.0 GB of RAM.
      

      
      An efficient and portable SIMD algorithm for charge/current deposition in Particle-In-Cell codes
      DOE PAGES
      Vincenti, H.; Lobet, M.; Lehe, R.; ...
         2016-09-19
         In current computer architectures, data movement (from die to network) is by far the most energy consuming part of an algorithm (≈20pJ/word on-die to ≈10,000 pJ/word on the network). To increase memory locality at the hardware level and reduce energy consumption related to data movement, future exascale computers tend to use many-core processors on each compute nodes that will have a reduced clock speed to allow for efficient cooling. To compensate for frequency decrease, machine vendors are making use of long SIMD instruction registers that are able to process multiple data with one arithmetic operator in one clock cycle. SIMD registermore » length is expected to double every four years. As a consequence, Particle-In-Cell (PIC) codes will have to achieve good vectorization to fully take advantage of these upcoming architectures. In this paper, we present a new algorithm that allows for efficient and portable SIMD vectorization of current/charge deposition routines that are, along with the field gathering routines, among the most time consuming parts of the PIC algorithm. Our new algorithm uses a particular data structure that takes into account memory alignment constraints and avoids gather/scat;ter instructions that can significantly affect vectorization performances on current CPUs. The new algorithm was successfully implemented in the 3D skeleton PIC code PICSAR and tested on Haswell Xeon processors (AVX2-256 bits wide data registers). Results show a factor of ×2 to ×2.5 speed-up in double precision for particle shape factor of orders 1–3. The new algorithm can be applied as is on future KNL (Knights Landing) architectures that will include AVX-512 instruction sets with 512 bits register lengths (8 doubles/16 singles). Program summary Program Title: vec_deposition Program Files doi:http://dx.doi.org/10.17632/nh77fv9k8c.1 Licensing provisions: BSD 3-Clause Programming language: Fortran 90 External routines/libraries:  OpenMP > 4.0 Nature of problem: Exascale architectures will have many-core processors per node with long vector data registers capable of performing one single instruction on multiple data during one clock cycle. Data register lengths are expected to double every four years and this pushes for new portable solutions for efficiently vectorizing Particle-In-Cell codes on these future many-core architectures. One of the main hotspot routines of the PIC algorithm is the current/charge deposition for which there is no efficient and portable vector algorithm. Solution method: Here we provide an efficient and portable vector algorithm of current/charge deposition routines that uses a new data structure, which significantly reduces gather/scatter operations. Vectorization is controlled using OpenMP 4.0 compiler directives for vectorization which ensures portability across different architectures. Restrictions: Here we do not provide the full PIC algorithm with an executable but only vector routines for current/charge deposition. These scalar/vector routines can be used as library routines in your 3D Particle-In-Cell code. However, to get the best performances out of vector routines you have to satisfy the two following requirements: (1) Your code should implement particle tiling (as explained in the manuscript) to allow for maximized cache reuse and reduce memory accesses that can hinder vector performances. The routines can be used directly on each particle tile. (2) You should compile your code with a Fortran 90 compiler (e.g Intel, gnu or cray) and provide proper alignment flags and compiler alignment directives (more details in README file).« less
      

      
      An efficient and portable SIMD algorithm for charge/current deposition in Particle-In-Cell codes
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Vincenti, H.; Lobet, M.; Lehe, R.
         
         In current computer architectures, data movement (from die to network) is by far the most energy consuming part of an algorithm (≈20pJ/word on-die to ≈10,000 pJ/word on the network). To increase memory locality at the hardware level and reduce energy consumption related to data movement, future exascale computers tend to use many-core processors on each compute nodes that will have a reduced clock speed to allow for efficient cooling. To compensate for frequency decrease, machine vendors are making use of long SIMD instruction registers that are able to process multiple data with one arithmetic operator in one clock cycle. SIMD registermore » length is expected to double every four years. As a consequence, Particle-In-Cell (PIC) codes will have to achieve good vectorization to fully take advantage of these upcoming architectures. In this paper, we present a new algorithm that allows for efficient and portable SIMD vectorization of current/charge deposition routines that are, along with the field gathering routines, among the most time consuming parts of the PIC algorithm. Our new algorithm uses a particular data structure that takes into account memory alignment constraints and avoids gather/scat;ter instructions that can significantly affect vectorization performances on current CPUs. The new algorithm was successfully implemented in the 3D skeleton PIC code PICSAR and tested on Haswell Xeon processors (AVX2-256 bits wide data registers). Results show a factor of ×2 to ×2.5 speed-up in double precision for particle shape factor of orders 1–3. The new algorithm can be applied as is on future KNL (Knights Landing) architectures that will include AVX-512 instruction sets with 512 bits register lengths (8 doubles/16 singles). Program summary Program Title: vec_deposition Program Files doi:http://dx.doi.org/10.17632/nh77fv9k8c.1 Licensing provisions: BSD 3-Clause Programming language: Fortran 90 External routines/libraries:  OpenMP > 4.0 Nature of problem: Exascale architectures will have many-core processors per node with long vector data registers capable of performing one single instruction on multiple data during one clock cycle. Data register lengths are expected to double every four years and this pushes for new portable solutions for efficiently vectorizing Particle-In-Cell codes on these future many-core architectures. One of the main hotspot routines of the PIC algorithm is the current/charge deposition for which there is no efficient and portable vector algorithm. Solution method: Here we provide an efficient and portable vector algorithm of current/charge deposition routines that uses a new data structure, which significantly reduces gather/scatter operations. Vectorization is controlled using OpenMP 4.0 compiler directives for vectorization which ensures portability across different architectures. Restrictions: Here we do not provide the full PIC algorithm with an executable but only vector routines for current/charge deposition. These scalar/vector routines can be used as library routines in your 3D Particle-In-Cell code. However, to get the best performances out of vector routines you have to satisfy the two following requirements: (1) Your code should implement particle tiling (as explained in the manuscript) to allow for maximized cache reuse and reduce memory accesses that can hinder vector performances. The routines can be used directly on each particle tile. (2) You should compile your code with a Fortran 90 compiler (e.g Intel, gnu or cray) and provide proper alignment flags and compiler alignment directives (more details in README file).« less
      

      
      Modeling & Analysis of Multicore Architectures for Embedded SIGINT Applications
      DTIC Science & Technology
      
         2015-03-01
         NVIDIA Kepler K20 [7][8] 2496e 706 225 3520 15.6 Intel Xeon Phi 5110P [9] 60 1050 225 1010 4.5 Adapteva Epiphany [10] 16 – 4K 800 0.270 19 70.4...Cortex A15 and a Kepler GPU with 192 “CUDA” cores, and is more comparable as an HPEEC platform than Tesla series GPUs, such as the NVIDIA C2075 and K20
      

      
      
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Trędak, Przemysław, E-mail: przemyslaw.tredak@fuw.edu.pl; Rudnicki, Witold R.; Interdisciplinary Centre for Mathematical and Computational Modelling, University of Warsaw, ul. Pawińskiego 5a, 02-106 Warsaw
         
         The second generation Reactive Bond Order (REBO) empirical potential is commonly used to accurately model a wide range hydrocarbon materials. It is also extensible to other atom types and interactions. REBO potential assumes complex multi-body interaction model, that is difficult to represent efficiently in the SIMD or SIMT programming model. Hence, despite its importance, no efficient GPGPU implementation has been developed for this potential. Here we present a detailed description of a highly efficient GPGPU implementation of molecular dynamics algorithm using REBO potential. The presented algorithm takes advantage of rarely used properties of the SIMT architecture of a modern GPUmore » to solve difficult synchronizations issues that arise in computations of multi-body potential. Techniques developed for this problem may be also used to achieve efficient solutions of different problems. The performance of proposed algorithm is assessed using a range of model systems. It is compared to highly optimized CPU implementation (both single core and OpenMP) available in LAMMPS package. These experiments show up to 6x improvement in forces computation time using single processor of the NVIDIA Tesla K80 compared to high end 16-core Intel Xeon processor.« less
      

      
      Parallel halftoning technique using dot diffusion optimization
      NASA Astrophysics Data System (ADS)
      Molina-Garcia, Javier; Ponomaryov, Volodymyr I.; Reyes-Reyes, Rogelio; Cruz-Ramos, Clara
         2017-05-01
         In this paper, a novel approach for halftone images is proposed and implemented for images that are obtained by the Dot Diffusion (DD) method. Designed technique is based on an optimization of the so-called class matrix used in DD algorithm and it consists of generation new versions of class matrix, which has no baron and near-baron in order to minimize inconsistencies during the distribution of the error. Proposed class matrix has different properties and each is designed for two different applications: applications where the inverse-halftoning is necessary, and applications where this method is not required. The proposed method has been implemented in GPU (NVIDIA GeForce GTX 750 Ti), multicore processors (AMD FX(tm)-6300 Six-Core Processor and in Intel core i5-4200U), using CUDA and OpenCV over a PC with linux. Experimental results have shown that novel framework generates a good quality of the halftone images and the inverse halftone images obtained. The simulation results using parallel architectures have demonstrated the efficiency of the novel technique when it is implemented in real-time processing.
      

      
      Cache Coherence Protocols for Large-Scale Multiprocessors
      DTIC Science & Technology
      
         1990-09-01
         and is compared with the other protocols for large-scale machines. In later analysis, this coherence method is designated by the acronym OCPD , which...private read misses 2 6 6 ( OCPD ) private write misses 2 6 6 Table 4.2: Transaction Types and Costs. the performance of the memory system. These...methodologies. Figure 4-2 shows the processor utiliza- tions of the Weather program, with special code in the dyn-nic post-mortem sched- 94 OCPD DlrINB
      

      
      Sentient networks
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Chapline, G.
         1998-03-01
         The engineering problems of constructing autonomous networks of sensors and data processors that can provide alerts for dangerous situations provide a new context for debating the question whether man-made systems can emulate the cognitive capabilities of the mammalian brain. In this paper we consider the question whether a distributed network of sensors and data processors can form ``perceptions`` based on sensory data. Because sensory data can have exponentially many explanations, the use of a central data processor to analyze the outputs from a large ensemble of sensors will in general introduce unacceptable latencies for responding to dangerous situations. A bettermore » idea is to use a distributed ``Helmholtz machine`` architecture in which the sensors are connected to a network of simple processors, and the collective state of the network as a whole provides an explanation for the sensory data. In general communication within such a network will require time division multiplexing, which opens the door to the possibility that with certain refinements to the Helmholtz machine architecture it may be possible to build sensor networks that exhibit a form of artificial consciousness.« less
      

      
      Sociable Machines: Expressive Social Exchange between Humans and Robots
      DTIC Science & Technology
      
         2000-05-01
         many occasions about theories on emotion. I’ve cornered Robert Irie’s again and again about auditory processing. I’ve bugged Matto Marjanovic throughout...development at the MIT Artificial Intel- ligence Lab (Brooks, Breazeal, Marjanovic , Scassellati & Williamson 1999). Cog is a general purpose humanoid...RA-2, 253-262. Brooks, R. A., Breazeal, C., Marjanovic , M., Scassellati, B. & Williamson, M. M. (1999), The Cog Project: Building a Humanoid Robot, in
      

      
      Fast adaptive composite grid methods on distributed parallel architectures
      NASA Technical Reports Server (NTRS)
      Lemke, Max; Quinlan, Daniel
         1992-01-01
         The fast adaptive composite (FAC) grid method is compared with the adaptive composite method (AFAC) under variety of conditions including vectorization and parallelization. Results are given for distributed memory multiprocessor architectures (SUPRENUM, Intel iPSC/2 and iPSC/860). It is shown that the good performance of AFAC and its superiority over FAC in a parallel environment is a property of the algorithm and not dependent on peculiarities of any machine.
      

      
      Development of seismic tomography software for hybrid supercomputers
      NASA Astrophysics Data System (ADS)
      Nikitin, Alexandr; Serdyukov, Alexandr; Duchkov, Anton
         2015-04-01
         Seismic tomography is a technique used for computing velocity model of geologic structure from first arrival travel times of seismic waves. The technique is used in processing of regional and global seismic data, in seismic exploration for prospecting and exploration of mineral and hydrocarbon deposits, and in seismic engineering for monitoring the condition of engineering structures and the surrounding host medium. As a consequence of development of seismic monitoring systems and increasing volume of seismic data, there is a growing need for new, more effective computational algorithms for use in seismic tomography applications with improved performance, accuracy and resolution. To achieve this goal, it is necessary to use modern high performance computing systems, such as supercomputers with hybrid architecture that use not only CPUs, but also accelerators and co-processors for computation. The goal of this research is the development of parallel seismic tomography algorithms and software package for such systems, to be used in processing of large volumes of seismic data (hundreds of gigabytes and more). These algorithms and software package will be optimized for the most common computing devices used in modern hybrid supercomputers, such as Intel Xeon CPUs, NVIDIA Tesla accelerators and Intel Xeon Phi co-processors. In this work, the following general scheme of seismic tomography is utilized. Using the eikonal equation solver, arrival times of seismic waves are computed based on assumed velocity model of geologic structure being analyzed. In order to solve the linearized inverse problem, tomographic matrix is computed that connects model adjustments with travel time residuals, and the resulting system of linear equations is regularized and solved to adjust the model. The effectiveness of parallel implementations of existing algorithms on target architectures is considered. During the first stage of this work, algorithms were developed for execution on supercomputers using multicore CPUs only, with preliminary performance tests showing good parallel efficiency on large numerical grids. Porting of the algorithms to hybrid supercomputers is currently ongoing.
      

      
      DeepX: Deep Learning Accelerator for Restricted Boltzmann Machine Artificial Neural Networks.
      PubMed
      Kim, Lok-Won
         2018-05-01
         Although there have been many decades of research and commercial presence on high performance general purpose processors, there are still many applications that require fully customized hardware architectures for further computational acceleration. Recently, deep learning has been successfully used to learn in a wide variety of applications, but their heavy computation demand has considerably limited their practical applications. This paper proposes a fully pipelined acceleration architecture to alleviate high computational demand of an artificial neural network (ANN) which is restricted Boltzmann machine (RBM) ANNs. The implemented RBM ANN accelerator (integrating network size, using 128 input cases per batch, and running at a 303-MHz clock frequency) integrated in a state-of-the art field-programmable gate array (FPGA) (Xilinx Virtex 7 XC7V-2000T) provides a computational performance of 301-billion connection-updates-per-second and about 193 times higher performance than a software solution running on general purpose processors. Most importantly, the architecture enables over 4 times (12 times in batch learning) higher performance compared with a previous work when both are implemented in an FPGA device (XC2VP70).
      

      
      Using Coarrays to Parallelize Legacy Fortran Applications: Strategy and Case Study
      DOE PAGES
      Radhakrishnan, Hari; Rouson, Damian W. I.; Morris, Karla; ...
         2015-01-01
         This paper summarizes a strategy for parallelizing a legacy Fortran 77 program using the object-oriented (OO) and coarray features that entered Fortran in the 2003 and 2008 standards, respectively. OO programming (OOP) facilitates the construction of an extensible suite of model-verification and performance tests that drive the development. Coarray parallel programming facilitates a rapid evolution from a serial application to a parallel application capable of running on multicore processors and many-core accelerators in shared and distributed memory. We delineate 17 code modernization steps used to refactor and parallelize the program and study the resulting performance. Our initial studies were donemore » using the Intel Fortran compiler on a 32-core shared memory server. Scaling behavior was very poor, and profile analysis using TAU showed that the bottleneck in the performance was due to our implementation of a collective, sequential summation procedure. We were able to improve the scalability and achieve nearly linear speedup by replacing the sequential summation with a parallel, binary tree algorithm. We also tested the Cray compiler, which provides its own collective summation procedure. Intel provides no collective reductions. With Cray, the program shows linear speedup even in distributed-memory execution. We anticipate similar results with other compilers once they support the new collective procedures proposed for Fortran 2015.« less
      

        
       
          

«

16
      17
      18
   19
      20
      »

          
        

     

   

   
       
            
              
          

«

17
      18
      19
   20
      21
      »

          
        

           
           
             
               
      
      Modeling a Million-Node Slim Fly Network Using Parallel Discrete-Event Simulation
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Wolfe, Noah; Carothers, Christopher; Mubarak, Misbah
         
         As supercomputers close in on exascale performance, the increased number of processors and processing power translates to an increased demand on the underlying network interconnect. The Slim Fly network topology, a new lowdiameter and low-latency interconnection network, is gaining interest as one possible solution for next-generation supercomputing interconnect systems. In this paper, we present a high-fidelity Slim Fly it-level model leveraging the Rensselaer Optimistic Simulation System (ROSS) and Co-Design of Exascale Storage (CODES) frameworks. We validate our Slim Fly model with the Kathareios et al. Slim Fly model results provided at moderately sized network scales. We further scale the modelmore » size up to n unprecedented 1 million compute nodes; and through visualization of network simulation metrics such as link bandwidth, packet latency, and port occupancy, we get an insight into the network behavior at the million-node scale. We also show linear strong scaling of the Slim Fly model on an Intel cluster achieving a peak event rate of 36 million events per second using 128 MPI tasks to process 7 billion events. Detailed analysis of the underlying discrete-event simulation performance shows that a million-node Slim Fly model simulation can execute in 198 seconds on the Intel cluster.« less
      

      
      Reducing adaptive optics latency using Xeon Phi many-core processors
      NASA Astrophysics Data System (ADS)
      Barr, David; Basden, Alastair; Dipper, Nigel; Schwartz, Noah
         2015-11-01
         The next generation of Extremely Large Telescopes (ELTs) for astronomy will rely heavily on the performance of their adaptive optics (AO) systems. Real-time control is at the heart of the critical technologies that will enable telescopes to deliver the best possible science and will require a very significant extrapolation from current AO hardware existing for 4-10 m telescopes. Investigating novel real-time computing architectures and testing their eligibility against anticipated challenges is one of the main priorities of technology development for the ELTs. This paper investigates the suitability of the Intel Xeon Phi, which is a commercial off-the-shelf hardware accelerator. We focus on wavefront reconstruction performance, implementing a straightforward matrix-vector multiplication (MVM) algorithm. We present benchmarking results of the Xeon Phi on a real-time Linux platform, both as a standalone processor and integrated into an existing real-time controller (RTC). Performance of single and multiple Xeon Phis are investigated. We show that this technology has the potential of greatly reducing the mean latency and variations in execution time (jitter) of large AO systems. We present both a detailed performance analysis of the Xeon Phi for a typical E-ELT first-light instrument along with a more general approach that enables us to extend to any AO system size. We show that systematic and detailed performance analysis is an essential part of testing novel real-time control hardware to guarantee optimal science results.
      

      
      Revisiting Intel Xeon Phi optimization of Thompson cloud microphysics scheme in Weather Research and Forecasting (WRF) model
      NASA Astrophysics Data System (ADS)
      Mielikainen, Jarno; Huang, Bormin; Huang, Allen
         2015-10-01
         The Thompson cloud microphysics scheme is a sophisticated cloud microphysics scheme in the Weather Research and Forecasting (WRF) model. The scheme is very suitable for massively parallel computation as there are no interactions among horizontal grid points. Compared to the earlier microphysics schemes, the Thompson scheme incorporates a large number of improvements. Thus, we have optimized the speed of this important part of WRF. Intel Many Integrated Core (MIC) ushers in a new era of supercomputing speed, performance, and compatibility. It allows the developers to run code at trillions of calculations per second using the familiar programming model. In this paper, we present our results of optimizing the Thompson microphysics scheme on Intel Many Integrated Core Architecture (MIC) hardware. The Intel Xeon Phi coprocessor is the first product based on Intel MIC architecture, and it consists of up to 61 cores connected by a high performance on-die bidirectional interconnect. The coprocessor supports all important Intel development tools. Thus, the development environment is familiar one to a vast number of CPU developers. Although, getting a maximum performance out of MICs will require using some novel optimization techniques. New optimizations for an updated Thompson scheme are discusses in this paper. The optimizations improved the performance of the original Thompson code on Xeon Phi 7120P by a factor of 1.8x. Furthermore, the same optimizations improved the performance of the Thompson on a dual socket configuration of eight core Intel Xeon E5-2670 CPUs by a factor of 1.8x compared to the original Thompson code.
      

      
      Optimizing the Betts-Miller-Janjic cumulus parameterization with Intel Many Integrated Core (MIC) architecture
      NASA Astrophysics Data System (ADS)
      Huang, Melin; Huang, Bormin; Huang, Allen H.-L.
         2015-10-01
         The schemes of cumulus parameterization are responsible for the sub-grid-scale effects of convective and/or shallow clouds, and intended to represent vertical fluxes due to unresolved updrafts and downdrafts and compensating motion outside the clouds. Some schemes additionally provide cloud and precipitation field tendencies in the convective column, and momentum tendencies due to convective transport of momentum. The schemes all provide the convective component of surface rainfall. Betts-Miller-Janjic (BMJ) is one scheme to fulfill such purposes in the weather research and forecast (WRF) model. National Centers for Environmental Prediction (NCEP) has tried to optimize the BMJ scheme for operational application. As there are no interactions among horizontal grid points, this scheme is very suitable for parallel computation. With the advantage of Intel Xeon Phi Many Integrated Core (MIC) architecture, efficient parallelization and vectorization essentials, it allows us to optimize the BMJ scheme. If compared to the original code respectively running on one CPU socket (eight cores) and on one CPU core with Intel Xeon E5-2670, the MIC-based optimization of this scheme running on Xeon Phi coprocessor 7120P improves the performance by 2.4x and 17.0x, respectively.
      

      
      Optimizing zonal advection of the Advanced Research WRF (ARW) dynamics for Intel MIC
      NASA Astrophysics Data System (ADS)
      Mielikainen, Jarno; Huang, Bormin; Huang, Allen H.
         2014-10-01
         The Weather Research and Forecast (WRF) model is the most widely used community weather forecast and research model in the world. There are two distinct varieties of WRF. The Advanced Research WRF (ARW) is an experimental, advanced research version featuring very high resolution. The WRF Nonhydrostatic Mesoscale Model (WRF-NMM) has been designed for forecasting operations. WRF consists of dynamics code and several physics modules. The WRF-ARW core is based on an Eulerian solver for the fully compressible nonhydrostatic equations. In the paper, we will use Intel Intel Many Integrated Core (MIC) architecture to substantially increase the performance of a zonal advection subroutine for optimization. It is of the most time consuming routines in the ARW dynamics core. Advection advances the explicit perturbation horizontal momentum equations by adding in the large-timestep tendency along with the small timestep pressure gradient tendency. We will describe the challenges we met during the development of a high-speed dynamics code subroutine for MIC architecture. Furthermore, lessons learned from the code optimization process will be discussed. The results show that the optimizations improved performance of the original code on Xeon Phi 5110P by a factor of 2.4x.
      

      
      Connecting Effective Instruction and Technology. Intel-elebration: Safari.
      ERIC Educational Resources Information Center
      Burton, Larry D.; Prest, Sharon
         
         Intel-ebration is an attempt to integrate the following research-based instructional frameworks and strategies: (1) dimensions of learning; (2) multiple intelligences; (3) thematic instruction; (4) cooperative learning; (5) project-based learning; and (6) instructional technology. This paper presents a thematic unit on safari, using the…
      

      
      Evaluation of the Intel Xeon Phi Co-processor to accelerate the sensitivity map calculation for PET imaging
      NASA Astrophysics Data System (ADS)
      Dey, T.; Rodrigue, P.
         2015-07-01
         We aim to evaluate the Intel Xeon Phi coprocessor for acceleration of 3D Positron Emission Tomography (PET) image reconstruction. We focus on the sensitivity map calculation as one computational intensive part of PET image reconstruction, since it is a promising candidate for acceleration with the Many Integrated Core (MIC) architecture of the Xeon Phi. The computation of the voxels in the field of view (FoV) can be done in parallel and the 103 to 104 samples needed to calculate the detection probability of each voxel can take advantage of vectorization. We use the ray tracing kernels of the Embree project to calculate the hit points of the sample rays with the detector and in a second step the sum of the radiological path taking into account attenuation is determined. The core components are implemented using the Intel single instruction multiple data compiler (ISPC) to enable a portable implementation showing efficient vectorization either on the Xeon Phi and the Host platform. On the Xeon Phi, the calculation of the radiological path is also implemented in hardware specific intrinsic instructions (so-called `intrinsics') to allow manually-optimized vectorization. For parallelization either OpenMP and ISPC tasking (based on pthreads) are evaluated.Our implementation achieved a scalability factor of 0.90 on the Xeon Phi coprocessor (model 5110P) with 60 cores at 1 GHz. Only minor differences were found between parallelization with OpenMP and the ISPC tasking feature. The implementation using intrinsics was found to be about 12% faster than the portable ISPC version. With this version, a speedup of 1.43 was achieved on the Xeon Phi coprocessor compared to the host system (HP SL250s Gen8) equipped with two Xeon (E5-2670) CPUs, with 8 cores at 2.6 to 3.3 GHz each. Using a second Xeon Phi card the speedup could be further increased to 2.77. No significant differences were found between the results of the different Xeon Phi and the Host implementations. The examination showed that a reasonable speedup of sensitivity map calculation could be achieved on the Xeon Phi either by a portable or a hardware specific implementation.
      

      
      QCDOC: A 10-teraflops scale computer for lattice QCD
      NASA Astrophysics Data System (ADS)
      Chen, D.; Christ, N. H.; Cristian, C.; Dong, Z.; Gara, A.; Garg, K.; Joo, B.; Kim, C.; Levkova, L.; Liao, X.; Mawhinney, R. D.; Ohta, S.; Wettig, T.
         2001-03-01
         The architecture of a new class of computers, optimized for lattice QCD calculations, is described. An individual node is based on a single integrated circuit containing a PowerPC 32-bit integer processor with a 1 Gflops 64-bit IEEE floating point unit, 4 Mbyte of memory, 8 Gbit/sec nearest-neighbor communications and additional control and diagnostic circuitry. The machine's name, QCDOC, derives from "QCD On a Chip".
      

      
      (M-CAT) Minor Caliber Weapons Trainer MK-19, 40mm Machine Gun
      DTIC Science & Technology
      
         1989-07-24
         microprocessor chip with an Intel 387 math coprocessor. The Nova 620 is a digital time base corrector. It is used to time base correct the video data...the circuit. After filtering, the horizontal and vertical position signals are converted to digital values by the Data Translation (DTX-311) analog...from the computer. Each frame of the video disk is individually digitized as to target size, location, and range. The guns azimuth and elevation are
      

      
      Ada Compiler Validation Summary Report: Certificate Number 900726W1. 11022 Verdix Corporation, VADS VAX/VMS= Intel 386, VMS 5.2, VAda-110-03315, Version 6.0, MicroVAX 3100 = iSBC 386/32
      DTIC Science & Technology
      
         1991-01-22
         iSBC 386/32, bare machine Customer Agreement Number: 90-05-29- VRX See Section 3.1 for any additional information about the testing environment. As a...Defense Avail and/or Washington DC 20301 -Dist Special ----- ---- i AVF Control Number: AVF-VSR-379.0191 22 January 1991 90-05-29- VRX Ada COMPILER
      

      
      Performance Measurements for the Microsoft Kinect Skeleton
      DTIC Science & Technology
      
         2012-03-01
         Information Inter- faces and Presentation]: User Interfaces—Input devices and strate- gies; 1 INTRODUCTION The Microsoft Kinect for Xbox 360 (“Kinect...these values. 2 MEASUREMENTS We conducted our tests on a machine configured with Windows 7 Ultimate (Service Pack 1) equipped with two Intel Core2...test. We tested with one , two, and three users present, although only two skeletons may be tracked. 2.1 Range We need to know how close and how far a
      

      
      Terascale spectral element algorithms and implementations.
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Fischer, P. F.; Tufo, H. M.
         1999-08-17
         We describe the development and implementation of an efficient spectral element code for multimillion gridpoint simulations of incompressible flows in general two- and three-dimensional domains. We review basic and recently developed algorithmic underpinnings that have resulted in good parallel and vector performance on a broad range of architectures, including the terascale computing systems now coming online at the DOE labs. Sustained performance of 219 GFLOPS has been recently achieved on 2048 nodes of the Intel ASCI-Red machine at Sandia.
      

      
      Geometric, Statistical, and Topological Modeling of Intrinsic Data Manifolds: Application to 3D Shapes
      DTIC Science & Technology
      
         2009-01-01
         representation to a simple curve in 3D by using the Whitney embedding theorem. In a very ludic way, we propose to combine phases one and two to...elimination principle which takes advantage of the designed parametrization. To further refine discrimination among objects, we introduce a post...packing numbers and design of principal curves. IEEE transactions on Pattern Analysis and Machine Intel- ligence, 22(3):281-297, 2000. [68] M. H. Yang, Face
      

      
      Accelerated Application Development: The ORNL Titan Experience
      DOE PAGES
      Joubert, Wayne; Archibald, Richard K.; Berrill, Mark A.; ...
         2015-05-09
         The use of computational accelerators such as NVIDIA GPUs and Intel Xeon Phi processors is now widespread in the high performance computing community, with many applications delivering impressive performance gains. However, programming these systems for high performance, performance portability and software maintainability has been a challenge. In this paper we discuss experiences porting applications to the Titan system. Titan, which began planning in 2009 and was deployed for general use in 2013, was the first multi-petaflop system based on accelerator hardware. To ready applications for accelerated computing, a preparedness effort was undertaken prior to delivery of Titan. In this papermore » we report experiences and lessons learned from this process and describe how users are currently making use of computational accelerators on Titan.« less
      

      
      Accelerated application development: The ORNL Titan experience
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Joubert, Wayne; Archibald, Rick; Berrill, Mark
         2015-08-01
         The use of computational accelerators such as NVIDIA GPUs and Intel Xeon Phi processors is now widespread in the high performance computing community, with many applications delivering impressive performance gains. However, programming these systems for high performance, performance portability and software maintainability has been a challenge. In this paper we discuss experiences porting applications to the Titan system. Titan, which began planning in 2009 and was deployed for general use in 2013, was the first multi-petaflop system based on accelerator hardware. To ready applications for accelerated computing, a preparedness effort was undertaken prior to delivery of Titan. In this papermore » we report experiences and lessons learned from this process and describe how users are currently making use of computational accelerators on Titan.« less
      

      
      Electromagnetic physics models for parallel computing architectures
      DOE PAGES
      Amadio, G.; Ananya, A.; Apostolakis, J.; ...
         2016-11-21
         The recent emergence of hardware architectures characterized by many-core or accelerated processors has opened new opportunities for concurrent programming models taking advantage of both SIMD and SIMT architectures. GeantV, a next generation detector simulation, has been designed to exploit both the vector capability of mainstream CPUs and multi-threading capabilities of coprocessors including NVidia GPUs and Intel Xeon Phi. The characteristics of these architectures are very different in terms of the vectorization depth and type of parallelization needed to achieve optimal performance. In this paper we describe implementation of electromagnetic physics models developed for parallel computing architectures as a part ofmore » the GeantV project. Finally, the results of preliminary performance evaluation and physics validation are presented as well.« less
      

      
      First experience of vectorizing electromagnetic physics models for detector simulation
      NASA Astrophysics Data System (ADS)
      Amadio, G.; Apostolakis, J.; Bandieramonte, M.; Bianchini, C.; Bitzes, G.; Brun, R.; Canal, P.; Carminati, F.; de Fine Licht, J.; Duhem, L.; Elvira, D.; Gheata, A.; Jun, S. Y.; Lima, G.; Novak, M.; Presbyterian, M.; Shadura, O.; Seghal, R.; Wenzel, S.
         2015-12-01
         The recent emergence of hardware architectures characterized by many-core or accelerated processors has opened new opportunities for concurrent programming models taking advantage of both SIMD and SIMT architectures. The GeantV vector prototype for detector simulations has been designed to exploit both the vector capability of mainstream CPUs and multi-threading capabilities of coprocessors including NVidia GPUs and Intel Xeon Phi. The characteristics of these architectures are very different in terms of the vectorization depth, parallelization needed to achieve optimal performance or memory access latency and speed. An additional challenge is to avoid the code duplication often inherent to supporting heterogeneous platforms. In this paper we present the first experience of vectorizing electromagnetic physics models developed for the GeantV project.
      

      
      Fast multipurpose Monte Carlo simulation for proton therapy using multi- and many-core CPU architectures
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Souris, Kevin, E-mail: kevin.souris@uclouvain.be; Lee, John Aldo; Sterpin, Edmond
         2016-04-15
         Purpose: Accuracy in proton therapy treatment planning can be improved using Monte Carlo (MC) simulations. However the long computation time of such methods hinders their use in clinical routine. This work aims to develop a fast multipurpose Monte Carlo simulation tool for proton therapy using massively parallel central processing unit (CPU) architectures. Methods: A new Monte Carlo, called MCsquare (many-core Monte Carlo), has been designed and optimized for the last generation of Intel Xeon processors and Intel Xeon Phi coprocessors. These massively parallel architectures offer the flexibility and the computational power suitable to MC methods. The class-II condensed history algorithmmore » of MCsquare provides a fast and yet accurate method of simulating heavy charged particles such as protons, deuterons, and alphas inside voxelized geometries. Hard ionizations, with energy losses above a user-specified threshold, are simulated individually while soft events are regrouped in a multiple scattering theory. Elastic and inelastic nuclear interactions are sampled from ICRU 63 differential cross sections, thereby allowing for the computation of prompt gamma emission profiles. MCsquare has been benchmarked with the GATE/GEANT4 Monte Carlo application for homogeneous and heterogeneous geometries. Results: Comparisons with GATE/GEANT4 for various geometries show deviations within 2%–1 mm. In spite of the limited memory bandwidth of the coprocessor simulation time is below 25 s for 10{sup 7} primary 200 MeV protons in average soft tissues using all Xeon Phi and CPU resources embedded in a single desktop unit. Conclusions: MCsquare exploits the flexibility of CPU architectures to provide a multipurpose MC simulation tool. Optimized code enables the use of accurate MC calculation within a reasonable computation time, adequate for clinical practice. MCsquare also simulates prompt gamma emission and can thus be used also for in vivo range verification.« less
      

      
      Saving time and energy with oversubscription and semi-direct Møller-Plesset second order perturbation methods.
      PubMed
      Fought, Ellie L; Sundriyal, Vaibhav; Sosonkina, Masha; Windus, Theresa L
         2017-04-30
         In this work, the effect of oversubscription is evaluated, via calling 2n, 3n, or 4n processes for n physical cores, on semi-direct MP2 energy and gradient calculations and RI-MP2 energy calculations with the cc-pVTZ basis using NWChem. Results indicate that on both Intel and AMD platforms, oversubscription reduces total time to solution on average for semi-direct MP2 energy calculations by 25-45% and reduces total energy consumed by the CPU and DRAM on average by 10-15% on the Intel platform. Semi-direct gradient time to solution is shortened on average by 8-15% and energy consumption is decreased by 5-10%. Linear regression analysis shows a strong correlation between time to solution and total energy consumed. Oversubscribing during RI-MP2 calculations results in performance degradations of 30-50% at the 4n level. © 2017 Wiley Periodicals, Inc. © 2017 Wiley Periodicals, Inc.
      

      
      Development and analysis of the Software Implemented Fault-Tolerance (SIFT) computer
      NASA Technical Reports Server (NTRS)
      Goldberg, J.; Kautz, W. H.; Melliar-Smith, P. M.; Green, M. W.; Levitt, K. N.; Schwartz, R. L.; Weinstock, C. B.
         1984-01-01
         SIFT (Software Implemented Fault Tolerance) is an experimental, fault-tolerant computer system designed to meet the extreme reliability requirements for safety-critical functions in advanced aircraft. Errors are masked by performing a majority voting operation over the results of identical computations, and faulty processors are removed from service by reassigning computations to the nonfaulty processors. This scheme has been implemented in a special architecture using a set of standard Bendix BDX930 processors, augmented by a special asynchronous-broadcast communication interface that provides direct, processor to processor communication among all processors. Fault isolation is accomplished in hardware; all other fault-tolerance functions, together with scheduling and synchronization are implemented exclusively by executive system software. The system reliability is predicted by a Markov model. Mathematical consistency of the system software with respect to the reliability model has been partially verified, using recently developed tools for machine-aided proof of program correctness.
      

        
       
          

«

17
      18
      19
   20
      21
      »

          
        

     

   

   
       
            
              
          

«

18
      19
      20
   21
      22
      »

          
        

           
           
             
               
      
      Performance Measurement, Visualization and Modeling of Parallel and Distributed Programs
      NASA Technical Reports Server (NTRS)
      Yan, Jerry C.; Sarukkai, Sekhar R.; Mehra, Pankaj; Lum, Henry, Jr. (Technical Monitor)
         1994-01-01
         This paper presents a methodology for debugging the performance of message-passing programs on both tightly coupled and loosely coupled distributed-memory machines. The AIMS (Automated Instrumentation and Monitoring System) toolkit, a suite of software tools for measurement and analysis of performance, is introduced and its application illustrated using several benchmark programs drawn from the field of computational fluid dynamics. AIMS includes (i) Xinstrument, a powerful source-code instrumentor, which supports both Fortran77 and C as well as a number of different message-passing libraries including Intel's NX Thinking Machines' CMMD, and PVM; (ii) Monitor, a library of timestamping and trace -collection routines that run on supercomputers (such as Intel's iPSC/860, Delta, and Paragon and Thinking Machines' CM5) as well as on networks of workstations (including Convex Cluster and SparcStations connected by a LAN); (iii) Visualization Kernel, a trace-animation facility that supports source-code clickback, simultaneous visualization of computation and communication patterns, as well as analysis of data movements; (iv) Statistics Kernel, an advanced profiling facility, that associates a variety of performance data with various syntactic components of a parallel program; (v) Index Kernel, a diagnostic tool that helps pinpoint performance bottlenecks through the use of abstract indices; (vi) Modeling Kernel, a facility for automated modeling of message-passing programs that supports both simulation -based and analytical approaches to performance prediction and scalability analysis; (vii) Intrusion Compensator, a utility for recovering true performance from observed performance by removing the overheads of monitoring and their effects on the communication pattern of the program; and (viii) Compatibility Tools, that convert AIMS-generated traces into formats used by other performance-visualization tools, such as ParaGraph, Pablo, and certain AVS/Explorer modules.
      

      
      JPRS Report, Science & Technology, USSR: Computers, Control Systems and Machines
      DTIC Science & Technology
      
         1989-03-14
         optimizatsii slozhnykh sistem (Coding Theory and Complex System Optimization ). Alma-Ata, Nauka Press, 1977, pp. 8-16. 11. Author’s certificate number...Interpreter Specifics [0. I. Amvrosova] ............................................. 141 Creation of Modern Computer Systems for Complex Ecological...processor can be designed to decrease degradation upon failure and assure more reliable processor operation, without requiring more complex software or
      

      
      Software design and documentation language
      NASA Technical Reports Server (NTRS)
      Kleine, H.
         1977-01-01
         A communications medium to support the design and documentation of complex software applications is studied. The medium also provides the following: (1) a processor which can convert design specifications into an intelligible, informative machine reproducible document; (2) a design and documentation language with forms and syntax that are simple, unrestrictive, and communicative; and (3) methodology for effective use of the language and processor.
      

      
      Using all of your CPU's in HIPE
      NASA Astrophysics Data System (ADS)
      Jacobson, J. D.; Fadda, D.
         2012-09-01
         Modern computer architectures increasingly feature multi-core CPU's. For example, the MacbookPro features the Intel quad-core i7 processors. Through the use of hyper-threading, where each core can execute two threads simultaneously, the quad-core i7 can support eight simultaneous processing threads. All this on your laptop! This CPU power can now be put into service by scientists to perform data reduction tasks, but only if the software has been designed to take advantage of the multiple processor architectures. Up to now, software written for Herschel data reduction (HIPE), written in Jython and JAVA, is single-threaded and can only utilize a single processor. Users of HIPE do not get any advantage from the additional processors. Why not put all of the CPU resources to work reducing your data? We present a multi-threaded software application that corrects long-term transients in the signal from the PACS unchopped spectroscopy line scan mode. In this poster, we present a multi-threaded software framework to achieve performance improvements from parallel execution. We will show how a task to correct transients in the PACS Spectroscopy Pipeline for the un-chopped line scan mode, has been threaded. This computation-intensive task uses either a one-parameter or a three parameter exponential function, to characterize the transient. The task uses a JAVA implementation of Minpack, translated from the C (Moshier) and IDL (Markwardt) by the authors, to optimize the correction parameters. We also explain how to determine if a task can benefit from threading (Amdahl's Law), and if it is safe to thread. The design and implementation, using the JAVA concurrency package completions service is described. Pitfalls, timing bugs, thread safety, resource control, testing and performance improvements are described and plotted.
      

      
      FPGA-based multiprocessor system for injection molding control.
      PubMed
      Muñoz-Barron, Benigno; Morales-Velazquez, Luis; Romero-Troncoso, Rene J; Rodriguez-Donate, Carlos; Trejo-Hernandez, Miguel; Benitez-Rangel, Juan P; Osornio-Rios, Roque A
         2012-10-18
         The plastic industry is a very important manufacturing sector and injection molding is a widely used forming method in that industry. The contribution of this work is the development of a strategy to retrofit control of an injection molding machine based on an embedded system microprocessors sensor network on a field programmable gate array (FPGA) device. Six types of embedded processors are included in the system: a smart-sensor processor, a micro fuzzy logic controller, a programmable logic controller, a system manager, an IO processor and a communication processor. Temperature, pressure and position are controlled by the proposed system and experimentation results show its feasibility and robustness. As validation of the present work, a particular sample was successfully injected.
      

      
      MPI implementation of PHOENICS: A general purpose computational fluid dynamics code
      NASA Astrophysics Data System (ADS)
      Simunovic, S.; Zacharia, T.; Baltas, N.; Spalding, D. B.
         1995-03-01
         PHOENICS is a suite of computational analysis programs that are used for simulation of fluid flow, heat transfer, and dynamical reaction processes. The parallel version of the solver EARTH for the Computational Fluid Dynamics (CFD) program PHOENICS has been implemented using Message Passing Interface (MPI) standard. Implementation of MPI version of PHOENICS makes this computational tool portable to a wide range of parallel machines and enables the use of high performance computing for large scale computational simulations. MPI libraries are available on several parallel architectures making the program usable across different architectures as well as on heterogeneous computer networks. The Intel Paragon NX and MPI versions of the program have been developed and tested on massively parallel supercomputers Intel Paragon XP/S 5, XP/S 35, and Kendall Square Research, and on the multiprocessor SGI Onyx computer at Oak Ridge National Laboratory. The preliminary testing results of the developed program have shown scalable performance for reasonably sized computational domains.
      

      
      MPI implementation of PHOENICS: A general purpose computational fluid dynamics code
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Simunovic, S.; Zacharia, T.; Baltas, N.
         1995-04-01
         PHOENICS is a suite of computational analysis programs that are used for simulation of fluid flow, heat transfer, and dynamical reaction processes. The parallel version of the solver EARTH for the Computational Fluid Dynamics (CFD) program PHOENICS has been implemented using Message Passing Interface (MPI) standard. Implementation of MPI version of PHOENICS makes this computational tool portable to a wide range of parallel machines and enables the use of high performance computing for large scale computational simulations. MPI libraries are available on several parallel architectures making the program usable across different architectures as well as on heterogeneous computer networks. Themore » Intel Paragon NX and MPI versions of the program have been developed and tested on massively parallel supercomputers Intel Paragon XP/S 5, XP/S 35, and Kendall Square Research, and on the multiprocessor SGI Onyx computer at Oak Ridge National Laboratory. The preliminary testing results of the developed program have shown scalable performance for reasonably sized computational domains.« less
      

      
      The design and implementation of a parallel unstructured Euler solver using software primitives
      NASA Technical Reports Server (NTRS)
      Das, R.; Mavriplis, D. J.; Saltz, J.; Gupta, S.; Ponnusamy, R.
         1992-01-01
         This paper is concerned with the implementation of a three-dimensional unstructured grid Euler-solver on massively parallel distributed-memory computer architectures. The goal is to minimize solution time by achieving high computational rates with a numerically efficient algorithm. An unstructured multigrid algorithm with an edge-based data structure has been adopted, and a number of optimizations have been devised and implemented in order to accelerate the parallel communication rates. The implementation is carried out by creating a set of software tools, which provide an interface between the parallelization issues and the sequential code, while providing a basis for future automatic run-time compilation support. Large practical unstructured grid problems are solved on the Intel iPSC/860 hypercube and Intel Touchstone Delta machine. The quantitative effect of the various optimizations are demonstrated, and we show that the combined effect of these optimizations leads to roughly a factor of three performance improvement. The overall solution efficiency is compared with that obtained on the CRAY-YMP vector supercomputer.
      

      
      A high-accuracy optical linear algebra processor for finite element applications
      NASA Technical Reports Server (NTRS)
      Casasent, D.; Taylor, B. K.
         1984-01-01
         Optical linear processors are computationally efficient computers for solving matrix-matrix and matrix-vector oriented problems. Optical system errors limit their dynamic range to 30-40 dB, which limits their accuray to 9-12 bits. Large problems, such as the finite element problem in structural mechanics (with tens or hundreds of thousands of variables) which can exploit the speed of optical processors, require the 32 bit accuracy obtainable from digital machines. To obtain this required 32 bit accuracy with an optical processor, the data can be digitally encoded, thereby reducing the dynamic range requirements of the optical system (i.e., decreasing the effect of optical errors on the data) while providing increased accuracy. This report describes a new digitally encoded optical linear algebra processor architecture for solving finite element and banded matrix-vector problems. A linear static plate bending case study is described which quantities the processor requirements. Multiplication by digital convolution is explained, and the digitally encoded optical processor architecture is advanced.
      

      
      Dynamic programming on a shared-memory multiprocessor
      NASA Technical Reports Server (NTRS)
      Edmonds, Phil; Chu, Eleanor; George, Alan
         1993-01-01
         Three new algorithms for solving dynamic programming problems on a shared-memory parallel computer are described. All three algorithms attempt to balance work load, while keeping synchronization cost low. In particular, for a multiprocessor having p processors, an analysis of the best algorithm shows that the arithmetic cost is O(n-cubed/6p) and that the synchronization cost is O(absolute value of log sub C n) if p much less than n, where C = (2p-1)/(2p + 1) and n is the size of the problem. The low synchronization cost is important for machines where synchronization is expensive. Analysis and experiments show that the best algorithm is effective in balancing the work load and producing high efficiency.
      

      
      Close to real life. [solving for transonic flow about lifting airfoils using supercomputers
      NASA Technical Reports Server (NTRS)
      Peterson, Victor L.; Bailey, F. Ron
         1988-01-01
         NASA's Numerical Aerodynamic Simulation (NAS) facility for CFD modeling of highly complex aerodynamic flows employs as its basic hardware two Cray-2s, an ETA-10 Model Q, an Amdahl 5880 mainframe computer that furnishes both support processing and access to 300 Gbytes of disk storage, several minicomputers and superminicomputers, and a Thinking Machines 16,000-device 'connection machine' processor. NAS, which was the first supercomputer facility to standardize operating-system and communication software on all processors, has done important Space Shuttle aerodynamics simulations and will be critical to the configurational refinement of the National Aerospace Plane and its intergrated powerplant, which will involve complex, high temperature reactive gasdynamic computations.
      

      
      A sparse matrix algorithm on the Boolean vector machine
      NASA Technical Reports Server (NTRS)
      Wagner, Robert A.; Patrick, Merrell L.
         1988-01-01
         VLSI technology is being used to implement a prototype Boolean Vector Machine (BVM), which is a large network of very small processors with equally small memories that operate in SIMD mode; these use bit-serial arithmetic, and communicate via cube-connected cycles network. The BVM's bit-serial arithmetic and the small memories of individual processors are noted to compromise the system's effectiveness in large numerical problem applications. Attention is presently given to the implementation of a basic matrix-vector iteration algorithm for space matrices of the BVM, in order to generate over 1 billion useful floating-point operations/sec for this iteration algorithm. The algorithm is expressed in a novel language designated 'BVM'.
      

      
      Performance Comparison of a Set of Periodic and Non-Periodic Tridiagonal Solvers on SP2 and Paragon Parallel Computers
      NASA Technical Reports Server (NTRS)
      Sun, Xian-He; Moitra, Stuti
         1996-01-01
         Various tridiagonal solvers have been proposed in recent years for different parallel platforms. In this paper, the performance of three tridiagonal solvers, namely, the parallel partition LU algorithm, the parallel diagonal dominant algorithm, and the reduced diagonal dominant algorithm, is studied. These algorithms are designed for distributed-memory machines and are tested on an Intel Paragon and an IBM SP2 machines. Measured results are reported in terms of execution time and speedup. Analytical study are conducted for different communication topologies and for different tridiagonal systems. The measured results match the analytical results closely. In addition to address implementation issues, performance considerations such as problem sizes and models of speedup are also discussed.
      

      
      Fast and Accurate Simulation of the Cray XMT Multithreaded Supercomputer
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Villa, Oreste; Tumeo, Antonino; Secchi, Simone
         
         Irregular applications, such as data mining and analysis or graph-based computations, show unpredictable memory/network access patterns and control structures. Highly multithreaded architectures with large processor counts, like the Cray MTA-1, MTA-2 and XMT, appear to address their requirements better than commodity clusters. However, the research on highly multithreaded systems is currently limited by the lack of adequate architectural simulation infrastructures due to issues such as size of the machines, memory footprint, simulation speed, accuracy and customization. At the same time, Shared-memory MultiProcessors (SMPs) with multi-core processors have become an attractive platform to simulate large scale machines. In this paper, wemore » introduce a cycle-level simulator of the highly multithreaded Cray XMT supercomputer. The simulator runs unmodified XMT applications. We discuss how we tackled the challenges posed by its development, detailing the techniques introduced to make the simulation as fast as possible while maintaining a high accuracy. By mapping XMT processors (ThreadStorm with 128 hardware threads) to host computing cores, the simulation speed remains constant as the number of simulated processors increases, up to the number of available host cores. The simulator supports zero-overhead switching among different accuracy levels at run-time and includes a network model that takes into account contention. On a modern 48-core SMP host, our infrastructure simulates a large set of irregular applications 500 to 2000 times slower than real time when compared to a 128-processor XMT, while remaining within 10\\% of accuracy. Emulation is only from 25 to 200 times slower than real time.« less
      

      
      Vector processing efficiency of plasma MHD codes by use of the FACOM 230-75 APU
      NASA Astrophysics Data System (ADS)
      Matsuura, T.; Tanaka, Y.; Naraoka, K.; Takizuka, T.; Tsunematsu, T.; Tokuda, S.; Azumi, M.; Kurita, G.; Takeda, T.
         1982-06-01
         In the framework of pipelined vector architecture, the efficiency of vector processing is assessed with respect to plasma MHD codes in nuclear fusion research. By using a vector processor, the FACOM 230-75 APU, the limit of the enhancement factor due to parallelism of current vector machines is examined for three numerical codes based on a fluid model. Reasonable speed-up factors of approximately 6,6 and 4 times faster than the highly optimized scalar version are obtained for ERATO (linear stability code), AEOLUS-R1 (nonlinear stability code) and APOLLO (1-1/2D transport code), respectively. Problems of the pipelined vector processors are discussed from the viewpoint of restructuring, optimization and choice of algorithms. In conclusion, the important concept of "concurrency within pipelined parallelism" is emphasized.
      

      
      APT - NASA ENHANCED VERSION OF AUTOMATICALLY PROGRAMMED TOOL SOFTWARE - STAND-ALONE VERSION
      NASA Technical Reports Server (NTRS)
      Premo, D. A.
         1994-01-01
         The APT code is one of the most widely used software tools for complex numerically controlled (N/C) machining. APT is an acronym for Automatically Programmed Tools and is used to denote both a language and the computer software that processes that language. Development of the APT language and software system was begun over twenty years ago as a U. S. government sponsored industry and university research effort. APT is a "problem oriented" language that was developed for the explicit purpose of aiding the N/C machine tools. Machine-tool instructions and geometry definitions are written in the APT language to constitute a "part program." The APT part program is processed by the APT software to produce a cutter location (CL) file. This CL file may then be processed by user supplied post processors to convert the CL data into a form suitable for a particular N/C machine tool. This June, 1989 offering of the APT system represents an adaptation, with enhancements, of the public domain version of APT IV/SSX8 to the DEC VAX-11/780 for use by the Engineering Services Division of the NASA Goddard Space Flight Center. Enhancements include the super pocket feature which allows concave and convex polygon shapes of up to 40 points including shapes that overlap, that leave islands of material within the pocket, and that have one or more arcs as part of the pocket boundary. Recent modifications to APT include a rework of the POCKET subroutine and correction of an error that prevented the use within a macro of a macro variable cutter move statement combined with macro variable double check surfaces. Former modifications included the expansion of array and buffer sizes to accommodate larger part programs, and the insertion of a few user friendly error messages. The APT system software on the DEC VAX-11/780 is organized into two separate programs: the load complex and the APT processor. The load complex handles the table initiation phase and is usually only run when changes to the APT processor capabilities are made. This phase initializes character recognition and syntax tables for the APT processor by creating FORTRAN block data programs. The APT processor consists of four components: the translator, the execution complex, the subroutine library, and the CL editor. The translator examines each APT statement in the part program for recognizable structure and generates a new statement, or series of statements, in an intermediate language. The execution complex processes all of the definition, motion, and related statements to generate cutter location coordinates. The subroutine library contains routines defining the algorithms required to process the sequenced list of intermediate language commands generated by the translator. The CL editor re-processes the cutter location coordinates according to user supplied commands to generate a final CL file. A sample post processor is also included which translates a CL file into a form for use with a Wales Strippit Fabramatic Model 30/30 sheet metal punch. The user should be able to readily develop post processors for other N/C machine tools. The APT language is a statement oriented, sequence dependent language. With the exception of such programming techniques as looping and macros, statements in an APT program are executed in a strict first-to-last sequence. In order to provide programming capability for the broadest possible range of parts and of machine tools, APT input (and output) is generalized, as represented by 3-dimensional geometry and tools, and arbitrarily uniform, as represented by the moving tool concept and output data in absolute coordinates. A command procedure allows the user to select the desired part program, ask for a graphics file of cutter motions in IGES format, and submit the procedure as a batch job, if desired. The APT system software is written in FORTRAN 77 for batch and interactive execution and has been implemented on a DEC VAX series computer under VMS 4.4. The enhancements for this version of APT were last updated in June, 1989. The NASA adaptation, with enhancements, of the public domain version of the APT IV/SSX8 software to the DEC VAX-11/780 is available by license for a period of ten (10) years to approved licensees. The licensed program product delivered includes the APT IV/SSX8 system source code, object code, executable images, and command procedures and one set of supporting documentation. Additional copies of the supporting documentation may be purchased at any time at the price indicated below.
      

      
      A derivation and scalable implementation of the synchronous parallel kinetic Monte Carlo method for simulating long-time dynamics
      NASA Astrophysics Data System (ADS)
      Byun, Hye Suk; El-Naggar, Mohamed Y.; Kalia, Rajiv K.; Nakano, Aiichiro; Vashishta, Priya
         2017-10-01
         Kinetic Monte Carlo (KMC) simulations are used to study long-time dynamics of a wide variety of systems. Unfortunately, the conventional KMC algorithm is not scalable to larger systems, since its time scale is inversely proportional to the simulated system size. A promising approach to resolving this issue is the synchronous parallel KMC (SPKMC) algorithm, which makes the time scale size-independent. This paper introduces a formal derivation of the SPKMC algorithm based on local transition-state and time-dependent Hartree approximations, as well as its scalable parallel implementation based on a dual linked-list cell method. The resulting algorithm has achieved a weak-scaling parallel efficiency of 0.935 on 1024 Intel Xeon processors for simulating biological electron transfer dynamics in a 4.2 billion-heme system, as well as decent strong-scaling parallel efficiency. The parallel code has been used to simulate a lattice of cytochrome complexes on a bacterial-membrane nanowire, and it is broadly applicable to other problems such as computational synthesis of new materials.
      

      
      Effective correlator for RadioAstron project
      NASA Astrophysics Data System (ADS)
      Sergeev, Sergey
         
         This paper presents the implementation of programme FX-correlator for Very Long Baseline Interferometry, adapted for the project "RadioAstron". Software correlator implemented for heterogeneous computing systems using graphics accelerators. It is shown that for the task interferometry implementation of the graphics hardware has a high efficiency. The host processor of heterogeneous computing system, performs the function of forming the data flow for graphics accelerators, the number of which corresponds to the number of frequency channels. So, for the Radioastron project, such channels is seven. Each accelerator is perform correlation matrix for all bases for a single frequency channel. Initial data is converted to the floating-point format, is correction for the corresponding delay function and computes the entire correlation matrix simultaneously. Calculation of the correlation matrix is performed using the sliding Fourier transform. Thus, thanks to the compliance of a solved problem for architecture graphics accelerators, managed to get a performance for one processor platform Kepler, which corresponds to the performance of this task, the computing cluster platforms Intel on four nodes. This task successfully scaled not only on a large number of graphics accelerators, but also on a large number of nodes with multiple accelerators.
      

      
      Optimization of the coherence function estimation for multi-core central processing unit
      NASA Astrophysics Data System (ADS)
      Cheremnov, A. G.; Faerman, V. A.; Avramchuk, V. S.
         2017-02-01
         The paper considers use of parallel processing on multi-core central processing unit for optimization of the coherence function evaluation arising in digital signal processing. Coherence function along with other methods of spectral analysis is commonly used for vibration diagnosis of rotating machinery and its particular nodes. An algorithm is given for the function evaluation for signals represented with digital samples. The algorithm is analyzed for its software implementation and computational problems. Optimization measures are described, including algorithmic, architecture and compiler optimization, their results are assessed for multi-core processors from different manufacturers. Thus, speeding-up of the parallel execution with respect to sequential execution was studied and results are presented for Intel Core i7-4720HQ и AMD FX-9590 processors. The results show comparatively high efficiency of the optimization measures taken. In particular, acceleration indicators and average CPU utilization have been significantly improved, showing high degree of parallelism of the constructed calculating functions. The developed software underwent state registration and will be used as a part of a software and hardware solution for rotating machinery fault diagnosis and pipeline leak location with acoustic correlation method.
      

      
      Efficient implementation of the many-body Reactive Bond Order (REBO) potential on GPU
      NASA Astrophysics Data System (ADS)
      Trędak, Przemysław; Rudnicki, Witold R.; Majewski, Jacek A.
         2016-09-01
         The second generation Reactive Bond Order (REBO) empirical potential is commonly used to accurately model a wide range hydrocarbon materials. It is also extensible to other atom types and interactions. REBO potential assumes complex multi-body interaction model, that is difficult to represent efficiently in the SIMD or SIMT programming model. Hence, despite its importance, no efficient GPGPU implementation has been developed for this potential. Here we present a detailed description of a highly efficient GPGPU implementation of molecular dynamics algorithm using REBO potential. The presented algorithm takes advantage of rarely used properties of the SIMT architecture of a modern GPU to solve difficult synchronizations issues that arise in computations of multi-body potential. Techniques developed for this problem may be also used to achieve efficient solutions of different problems. The performance of proposed algorithm is assessed using a range of model systems. It is compared to highly optimized CPU implementation (both single core and OpenMP) available in LAMMPS package. These experiments show up to 6x improvement in forces computation time using single processor of the NVIDIA Tesla K80 compared to high end 16-core Intel Xeon processor.
      

        
       
          

«

18
      19
      20
   21
      22
      »

          
        

     

   

   
       
            
              
          

«

19
      20
      21
   22
      23
      »

          
        

           
           
             
               
      
      The Impact of IBM Cell Technology on the Programming Paradigm in the Context of Computer Systems for Climate and Weather Models
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Zhou, Shujia; Duffy, Daniel; Clune, Thomas
         
         The call for ever-increasing model resolutions and physical processes in climate and weather models demands a continual increase in computing power. The IBM Cell processor's order-of-magnitude peak performance increase over conventional processors makes it very attractive to fulfill this requirement. However, the Cell's characteristics, 256KB local memory per SPE and the new low-level communication mechanism, make it very challenging to port an application. As a trial, we selected the solar radiation component of the NASA GEOS-5 climate model, which: (1) is representative of column physics components (half the total computational time), (2) has an extremely high computational intensity: the ratiomore » of computational load to main memory transfers, and (3) exhibits embarrassingly parallel column computations. In this paper, we converted the baseline code (single-precision Fortran) to C and ported it to an IBM BladeCenter QS20. For performance, we manually SIMDize four independent columns and include several unrolling optimizations. Our results show that when compared with the baseline implementation running on one core of Intel's Xeon Woodcrest, Dempsey, and Itanium2, the Cell is approximately 8.8x, 11.6x, and 12.8x faster, respectively. Our preliminary analysis shows that the Cell can also accelerate the dynamics component (~;;25percent total computational time). We believe these dramatic performance improvements make the Cell processor very competitive as an accelerator.« less
      

      
      Intelligence system based classification approach for medical disease diagnosis
      NASA Astrophysics Data System (ADS)
      Sagir, Abdu Masanawa; Sathasivam, Saratha
         2017-08-01
         The prediction of breast cancer in women who have no signs or symptoms of the disease as well as survivability after undergone certain surgery has been a challenging problem for medical researchers. The decision about presence or absence of diseases depends on the physician's intuition, experience and skill for comparing current indicators with previous one than on knowledge rich data hidden in a database. This measure is a very crucial and challenging task. The goal is to predict patient condition by using an adaptive neuro fuzzy inference system (ANFIS) pre-processed by grid partitioning. To achieve an accurate diagnosis at this complex stage of symptom analysis, the physician may need efficient diagnosis system. A framework describes methodology for designing and evaluation of classification performances of two discrete ANFIS systems of hybrid learning algorithms least square estimates with Modified Levenberg-Marquardt and Gradient descent algorithms that can be used by physicians to accelerate diagnosis process. The proposed method's performance was evaluated based on training and test datasets with mammographic mass and Haberman's survival Datasets obtained from benchmarked datasets of University of California at Irvine's (UCI) machine learning repository. The robustness of the performance measuring total accuracy, sensitivity and specificity is examined. In comparison, the proposed method achieves superior performance when compared to conventional ANFIS based gradient descent algorithm and some related existing methods. The software used for the implementation is MATLAB R2014a (version 8.3) and executed in PC Intel Pentium IV E7400 processor with 2.80 GHz speed and 2.0 GB of RAM.
      

      
      Independent backup mode transfer and mechanism for digital control computers
      NASA Technical Reports Server (NTRS)
      Tulpule, Bhalchandra R. (Inventor); Oscarson, Edward M. (Inventor)
         1992-01-01
         An interrupt is provided to a signal processor having a non-maskable interrupt input, in response to the detection of a request for transfer to backup software. The signal processor provides a transfer signal to a transfer mechanism only after completion of the present machine cycle. Transfer to the backup software is initiated by the transfer mechanism only upon reception of the transfer signal.
      

      
      Characterization of UMT2013 Performance on Advanced Architectures
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Howell, Louis
         2014-12-31
         This paper presents part of a larger effort to make detailed assessments of several proxy applications on various advanced architectures, with the eventual goal of extending these assessments to codes of programmatic interest running more realistic simulations. The focus here is on UMT2013, a proxy implementation of deterministic transport for unstructured meshes. I present weak and strong MPI scaling results and studies of OpenMP efficiency on the Sequoia BG/Q system at LLNL, with comparison against similar tests on an Intel Sandy Bridge TLCC2 system. The hardware counters on BG/Q provide detailed information on many aspects of on-node performance, while informationmore » from the mpiP tool gives insight into the reasons for the differing scaling behavior on these two different architectures. Preliminary tests that exploit NVRAM as extended memory on an Ivy Bridge machine designed for “Big Data” applications are also included.« less
      

      
      Rapid prototyping and evaluation of programmable SIMD SDR processors in LISA
      NASA Astrophysics Data System (ADS)
      Chen, Ting; Liu, Hengzhu; Zhang, Botao; Liu, Dongpei
         2013-03-01
         With the development of international wireless communication standards, there is an increase in computational requirement for baseband signal processors. Time-to-market pressure makes it impossible to completely redesign new processors for the evolving standards. Due to its high flexibility and low power, software defined radio (SDR) digital signal processors have been proposed as promising technology to replace traditional ASIC and FPGA fashions. In addition, there are large numbers of parallel data processed in computation-intensive functions, which fosters the development of single instruction multiple data (SIMD) architecture in SDR platform. So a new way must be found to prototype the SDR processors efficiently. In this paper we present a bit-and-cycle accurate model of programmable SIMD SDR processors in a machine description language LISA. LISA is a language for instruction set architecture which can gain rapid model at architectural level. In order to evaluate the availability of our proposed processor, three common baseband functions, FFT, FIR digital filter and matrix multiplication have been mapped on the SDR platform. Analytical results showed that the SDR processor achieved the maximum of 47.1% performance boost relative to the opponent processor.
      

      
      Application of Intel Many Integrated Core (MIC) architecture to the Yonsei University planetary boundary layer scheme in Weather Research and Forecasting model
      NASA Astrophysics Data System (ADS)
      Huang, Melin; Huang, Bormin; Huang, Allen H.
         2014-10-01
         The Weather Research and Forecasting (WRF) model provided operational services worldwide in many areas and has linked to our daily activity, in particular during severe weather events. The scheme of Yonsei University (YSU) is one of planetary boundary layer (PBL) models in WRF. The PBL is responsible for vertical sub-grid-scale fluxes due to eddy transports in the whole atmospheric column, determines the flux profiles within the well-mixed boundary layer and the stable layer, and thus provide atmospheric tendencies of temperature, moisture (including clouds), and horizontal momentum in the entire atmospheric column. The YSU scheme is very suitable for massively parallel computation as there are no interactions among horizontal grid points. To accelerate the computation process of the YSU scheme, we employ Intel Many Integrated Core (MIC) Architecture as it is a multiprocessor computer structure with merits of efficient parallelization and vectorization essentials. Our results show that the MIC-based optimization improved the performance of the first version of multi-threaded code on Xeon Phi 5110P by a factor of 2.4x. Furthermore, the same CPU-based optimizations improved the performance on Intel Xeon E5-2603 by a factor of 1.6x as compared to the first version of multi-threaded code.
      

      
      Massively Multithreaded Maxflow for Image Segmentation on the Cray XMT-2
      PubMed Central
      Bokhari, Shahid H.; Çatalyürek, Ümit V.; Gurcan, Metin N.
         2014-01-01
         SUMMARY Image segmentation is a very important step in the computerized analysis of digital images. The maxflow mincut approach has been successfully used to obtain minimum energy segmentations of images in many fields. Classical algorithms for maxflow in networks do not directly lend themselves to efficient parallel implementations on contemporary parallel processors. We present the results of an implementation of Goldberg-Tarjan preflow-push algorithm on the Cray XMT-2 massively multithreaded supercomputer. This machine has hardware support for 128 threads in each physical processor, a uniformly accessible shared memory of up to 4 TB and hardware synchronization for each 64 bit word. It is thus well-suited to the parallelization of graph theoretic algorithms, such as preflow-push. We describe the implementation of the preflow-push code on the XMT-2 and present the results of timing experiments on a series of synthetically generated as well as real images. Our results indicate very good performance on large images and pave the way for practical applications of this machine architecture for image analysis in a production setting. The largest images we have run are 320002 pixels in size, which are well beyond the largest previously reported in the literature. PMID:25598745
      

      
      FPGA-Based Multiprocessor System for Injection Molding Control
      PubMed Central
      Muñoz-Barron, Benigno; Morales-Velazquez, Luis; Romero-Troncoso, Rene J.; Rodriguez-Donate, Carlos; Trejo-Hernandez, Miguel; Benitez-Rangel, Juan P.; Osornio-Rios, Roque A.
         2012-01-01
         The plastic industry is a very important manufacturing sector and injection molding is a widely used forming method in that industry. The contribution of this work is the development of a strategy to retrofit control of an injection molding machine based on an embedded system microprocessors sensor network on a field programmable gate array (FPGA) device. Six types of embedded processors are included in the system: a smart-sensor processor, a micro fuzzy logic controller, a programmable logic controller, a system manager, an IO processor and a communication processor. Temperature, pressure and position are controlled by the proposed system and experimentation results show its feasibility and robustness. As validation of the present work, a particular sample was successfully injected. PMID:23202036
      

      
      Set processing in a network environment. [data bases and magnetic disks and tapes
      NASA Technical Reports Server (NTRS)
      Hardgrave, W. T.
         1975-01-01
         A combination of a local network, a mass storage system, and an autonomous set processor serving as a data/storage management machine is described. Its characteristics include: content-accessible data bases usable from all connected devices; efficient storage/access of large data bases; simple and direct programming with data manipulation and storage management handled by the set processor; simple data base design and entry from source representation to set processor representation with no predefinition necessary; capability available for user sort/order specification; significant reduction in tape/disk pack storage and mounts; flexible environment that allows upgrading hardware/software configuration without causing major interruptions in service; minimal traffic on data communications network; and improved central memory usage on large processors.
      

      
      Cache Energy Optimization Techniques For Modern Processors
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Mittal, Sparsh
         2013-01-01
         Modern multicore processors are employing large last-level caches, for example Intel's E7-8800 processor uses 24MB L3 cache. Further, with each CMOS technology generation, leakage energy has been dramatically increasing and hence, leakage energy is expected to become a major source of energy dissipation, especially in last-level caches (LLCs). The conventional schemes of cache energy saving either aim at saving dynamic energy or are based on properties specific to first-level caches, and thus these schemes have limited utility for last-level caches. Further, several other techniques require offline profiling or per-application tuning and hence are not suitable for product systems. In thismore » book, we present novel cache leakage energy saving schemes for single-core and multicore systems; desktop, QoS, real-time and server systems. Also, we present cache energy saving techniques for caches designed with both conventional SRAM devices and emerging non-volatile devices such as STT-RAM (spin-torque transfer RAM). We present software-controlled, hardware-assisted techniques which use dynamic cache reconfiguration to configure the cache to the most energy efficient configuration while keeping the performance loss bounded. To profile and test a large number of potential configurations, we utilize low-overhead, micro-architecture components, which can be easily integrated into modern processor chips. We adopt a system-wide approach to save energy to ensure that cache reconfiguration does not increase energy consumption of other components of the processor. We have compared our techniques with state-of-the-art techniques and have found that our techniques outperform them in terms of energy efficiency and other relevant metrics. The techniques presented in this book have important applications in improving energy-efficiency of higher-end embedded, desktop, QoS, real-time, server processors and multitasking systems. This book is intended to be a valuable guide for both newcomers and veterans in the field of cache power management. It will help graduate students, CAD tool developers and designers in understanding the need of energy efficiency in modern computing systems. Further, it will be useful for researchers in gaining insights into algorithms and techniques for micro-architectural and system-level energy optimization using dynamic cache reconfiguration. We sincerely believe that the ``food for thought'' presented in this book will inspire the readers to develop even better ideas for designing ``green'' processors of tomorrow.« less
      

      
      Performance optimization of Qbox and WEST on Intel Knights Landing
      NASA Astrophysics Data System (ADS)
      Zheng, Huihuo; Knight, Christopher; Galli, Giulia; Govoni, Marco; Gygi, Francois
         
         We present the optimization of electronic structure codes Qbox and WEST targeting the Intel®Xeon Phi™processor, codenamed Knights Landing (KNL). Qbox is an ab-initio molecular dynamics code based on plane wave density functional theory (DFT) and WEST is a post-DFT code for excited state calculations within many-body perturbation theory. Both Qbox and WEST employ highly scalable algorithms which enable accurate large-scale electronic structure calculations on leadership class supercomputer platforms beyond 100,000 cores, such as Mira and Theta at the Argonne Leadership Computing Facility. In this work, features of the KNL architecture (e.g. hierarchical memory) are explored to achieve higher performance in key algorithms of the Qbox and WEST codes and to develop a road-map for further development targeting next-generation computing architectures. In particular, the optimizations of the Qbox and WEST codes on the KNL platform will target efficient large-scale electronic structure calculations of nanostructured materials exhibiting complex structures and prediction of their electronic and thermal properties for use in solar and thermal energy conversion device. This work was supported by MICCoM, as part of Comp. Mats. Sci. Program funded by the U.S. DOE, Office of Sci., BES, MSE Division. This research used resources of the ALCF, which is a DOE Office of Sci. User Facility under Contract DE-AC02-06CH11357.
      

      
      Conversion of Mass Storage Hierarchy in an IBM Computer Network
      DTIC Science & Technology
      
         1989-03-01
         storage devices GUIDE IBM users’ group for DOS operating systems IBM International Business Machines IBM 370/145 CPU introduced in 1970 IBM 370/168 CPU...February 12, 1985, Information Systems Group, International Business Machines Corporation. "IBM 3090 Processor Complex" and 񓼪 Mass Storage System...34 Mainframe Journal, pp. 15-26, 64-65, Dallas, Texas, September-October 1987. 3. International Business Machines Corporation, Introduction to IBM 3S80 Storage
      

      
      Large Scale GW Calculations on the Cori System
      NASA Astrophysics Data System (ADS)
      Deslippe, Jack; Del Ben, Mauro; da Jornada, Felipe; Canning, Andrew; Louie, Steven
         
         The NERSC Cori system, powered by 9000+ Intel Xeon-Phi processors, represents one of the largest HPC systems for open-science in the United States and the world. We discuss the optimization of the GW methodology for this system, including both node level and system-scale optimizations. We highlight multiple large scale (thousands of atoms) case studies and discuss both absolute application performance and comparison to calculations on more traditional HPC architectures. We find that the GW method is particularly well suited for many-core architectures due to the ability to exploit a large amount of parallelism across many layers of the system. This work was supported by the U.S. Department of Energy, Office of Science, Basic Energy Sciences, Materials Sciences and Engineering Division, as part of the Computational Materials Sciences Program.
      

      
      QCD thermodynamics with two flavors at Nt=6
      NASA Astrophysics Data System (ADS)
      Bernard, Claude; Ogilvie, Michael C.; Degrand, Thomas A.; Detar, Carleton; Gottlieb, Steven; Krasnitz, Alex; Sugar, R. L.; Toussaint, D.
         1992-05-01
         The first results of numerical simulations of quantum chromodynamics on the Intel iPSC/860 parallel processor are presented. We performed calculations with two flavors of Kogut-Susskind quarks at Nt=6 with masses of 0.15T and 0.075T (0.025 and 0.0125 in lattice units) in order to locate the crossover from the low-temperature regime of ordinary hadronic matter to the high-temperature chirally symmetric regime. As with other recent two-flavor simulations, these calculations are insufficient to distinguish between a rapid crossover and a true phase transition. The phase transition is either absent or feeble at this quark mass. An improved estimate of the crossover temperature in physical units is given and results are presented for the hadronic screening lengths in both the high- and low-temperature regimes.
      

      
      Best kept secrets ... Source Data Systems, Inc. (SDS).
      PubMed
      Andrew, W F
         1991-03-01
         The SDS/MEDNET system is a cost-effective option for small- to medium-size hospitals (up to 400 beds). The parameter-driven system lets users control operations with only occasional SDS assistance. A full application set, available for modular selection to reduce upfront costs while facilitating steady growth and protecting client investment, is adaptable to multi-facility environments. The industry-standard, Intel-based multi-user processors, network communications and protocols assure high efficiency, low-cost solutions independent of any one hardware vendor. Sustained growth in both client base and product offerings point to a high level of responsiveness and healthcare industry commitment. Corporate emphasis on user involvement and open systems integration assures clients of leading-edge capabilities. SDS/MEDNET will be a strong contender in selected marketing environments.
      

      
      Towards developing robust algorithms for solving partial differential equations on MIMD machines
      NASA Technical Reports Server (NTRS)
      Saltz, Joel H.; Naik, Vijay K.
         1988-01-01
         Methods for efficient computation of numerical algorithms on a wide variety of MIMD machines are proposed. These techniques reorganize the data dependency patterns to improve the processor utilization. The model problem finds the time-accurate solution to a parabolic partial differential equation discretized in space and implicitly marched forward in time. The algorithms are extensions of Jacobi and SOR. The extensions consist of iterating over a window of several timesteps, allowing efficient overlap of computation with communication. The methods increase the degree to which work can be performed while data are communicated between processors. The effect of the window size and of domain partitioning on the system performance is examined both by implementing the algorithm on a simulated multiprocessor system.
      

      
      Architecture and data processing alternatives for Tse computer. Volume 1: Tse logic design concepts and the development of image processing machine architectures
      NASA Technical Reports Server (NTRS)
      Rickard, D. A.; Bodenheimer, R. E.
         1976-01-01
         Digital computer components which perform two dimensional array logic operations (Tse logic) on binary data arrays are described. The properties of Golay transforms which make them useful in image processing are reviewed, and several architectures for Golay transform processors are presented with emphasis on the skeletonizing algorithm. Conventional logic control units developed for the Golay transform processors are described. One is a unique microprogrammable control unit that uses a microprocessor to control the Tse computer. The remaining control units are based on programmable logic arrays. Performance criteria are established and utilized to compare the various Golay transform machines developed. A critique of Tse logic is presented, and recommendations for additional research are included.
      

      
      Towards developing robust algorithms for solving partial differential equations on MIMD machines
      NASA Technical Reports Server (NTRS)
      Saltz, J. H.; Naik, V. K.
         1985-01-01
         Methods for efficient computation of numerical algorithms on a wide variety of MIMD machines are proposed. These techniques reorganize the data dependency patterns to improve the processor utilization. The model problem finds the time-accurate solution to a parabolic partial differential equation discretized in space and implicitly marched forward in time. The algorithms are extensions of Jacobi and SOR. The extensions consist of iterating over a window of several timesteps, allowing efficient overlap of computation with communication. The methods increase the degree to which work can be performed while data are communicated between processors. The effect of the window size and of domain partitioning on the system performance is examined both by implementing the algorithm on a simulated multiprocessor system.
      

      
      Implementing ISO/IEEE 11073: proposal of two different strategic approaches.
      PubMed
      Martínez-Espronceda, M; Serrano, L; Martínez, I; Escayola, J; Led, S; Trigo, J; García, J
         2008-01-01
         This paper explains the challenges encountered during the ISO/IEEE 11073 standard implementation process. The complexity of the standard and the consequent heavy requirements, which have not encouraged software engineers to adopt the standard. The developing complexity evaluation drives us to propose two possible implementation strategies that cover almost all possible use cases and eases handling the standard by non-expert users. The first one is focused on medical devices (MD) and proposes a low-memory and low-processor usage technique. It is based on message patterns that allow simple functions to generate ISO/IEEE 11073 messages and to process them easily. In this way a framework for MDs can be obtained. Second one is focused on more powerful machines such as data loggers or gateways (aka. computer engines (CE)), which do not have the MDs' memory and processor usage constraints. For CEs a more intelligent and adaptative Plug&Play (P&P) solution is provided. It consists on a general platform that can access to any device supported by the standard. Combining both strategies will cut developing time for applications based on ISO/EEE 11073.
      

      
      Evaluation of CHO Benchmarks on the Arria 10 FPGA using Intel FPGA SDK for OpenCL
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Jin, Zheming; Yoshii, Kazutomo; Finkel, Hal
         
         The OpenCL standard is an open programming model for accelerating algorithms on heterogeneous computing system. OpenCL extends the C-based programming language for developing portable codes on different platforms such as CPU, Graphics processing units (GPUs), Digital Signal Processors (DSPs) and Field Programmable Gate Arrays (FPGAs). The Intel FPGA SDK for OpenCL is a suite of tools that allows developers to abstract away the complex FPGA-based development flow for a high-level software development flow. Users can focus on the design of hardware-accelerated kernel functions in OpenCL and then direct the tools to generate the low-level FPGA implementations. The approach makes themore » FPGA-based development more accessible to software users as the needs for hybrid computing using CPUs and FPGAs are increasing. It can also significantly reduce the hardware development time as users can evaluate different ideas with high-level language without deep FPGA domain knowledge. Benchmarking of OpenCL-based framework is an effective way for analyzing the performance of system by studying the execution of the benchmark applications. CHO is a suite of benchmark applications that provides support for OpenCL [1]. The authors presented CHO as an OpenCL port of the CHStone benchmark. Using Altera OpenCL (AOCL) compiler to synthesize the benchmark applications, they listed the resource usage and performance of each kernel that can be successfully synthesized by the compiler. In this report, we evaluate the resource usage and performance of the CHO benchmark applications using the Intel FPGA SDK for OpenCL and Nallatech 385A FPGA board that features an Arria 10 FPGA device. The focus of the report is to have a better understanding of the resource usage and performance of the kernel implementations using Arria-10 FPGA devices compared to Stratix-5 FPGA devices. In addition, we also gain knowledge about the limitations of the current compiler when it fails to synthesize a benchmark application.« less
      

        
       
          

«

19
      20
      21
   22
      23
      »

          
        

     

   

   
       
            
              
          

«

20
      21
      22
   23
      24
      »

          
        

           
           
             
               
      
      Tactical Operations Analysis Support Facility.
      DTIC Science & Technology
      
         1983-07-01
         are stored in nonvolatile RAM (NVR). Communication with a host processor via a UART (75-19.2K bps) in full duplex mode. An advanced video option...hardware/firmware "machines." Smart terminals, I/O con- * trollers, and unique peripheral processors are examples of this process. Briton Lee, Inc...the relational data base for symbol attributes and data retrievals. * Generates a grid system for precise cursor positioning for lines, charts, and
      

      
      An executable specification for the message processor in a simple combining network
      NASA Technical Reports Server (NTRS)
      Middleton, David
         1995-01-01
         While the primary function of the network in a parallel computer is to communicate data between processors, it is often useful if the network can also perform rudimentary calculations. That is, some simple processing ability in the network itself, particularly for performing parallel prefix computations, can reduce both the volume of data being communicated and the computational load on the processors proper. Unfortunately, typical implementations of such networks require a large fraction of the hardware budget, and so combining networks are viewed as being impractical. The FFP Machine has such a combining network, and various characteristics of the machine allow a good deal of simplification in the network design. Despite being simple in construction however, the network relies on many subtle details to work correctly. This paper describes an executable model of the network which will serve several purposes. It provides a complete and detailed description of the network which can substantiate its ability to support necessary functions. It provides an environment in which algorithms to be run on the network can be designed and debugged more easily than they would on physical hardware. Finally, it provides the foundation for exploring the design of the message receiving facility which connects the network to the individual processors.
      

      
      Compact VLSI neural computer integrated with active pixel sensor for real-time ATR applications
      NASA Astrophysics Data System (ADS)
      Fang, Wai-Chi; Udomkesmalee, Gabriel; Alkalai, Leon
         1997-04-01
         A compact VLSI neural computer integrated with an active pixel sensor has been under development to mimic what is inherent in biological vision systems. This electronic eye- brain computer is targeted for real-time machine vision applications which require both high-bandwidth communication and high-performance computing for data sensing, synergy of multiple types of sensory information, feature extraction, target detection, target recognition, and control functions. The neural computer is based on a composite structure which combines Annealing Cellular Neural Network (ACNN) and Hierarchical Self-Organization Neural Network (HSONN). The ACNN architecture is a programmable and scalable multi- dimensional array of annealing neurons which are locally connected with their local neurons. Meanwhile, the HSONN adopts a hierarchical structure with nonlinear basis functions. The ACNN+HSONN neural computer is effectively designed to perform programmable functions for machine vision processing in all levels with its embedded host processor. It provides a two order-of-magnitude increase in computation power over the state-of-the-art microcomputer and DSP microelectronics. A compact current-mode VLSI design feasibility of the ACNN+HSONN neural computer is demonstrated by a 3D 16X8X9-cube neural processor chip design in a 2-micrometers CMOS technology. Integration of this neural computer as one slice of a 4'X4' multichip module into the 3D MCM based avionics architecture for NASA's New Millennium Program is also described.
      

      
      Parallel machine architecture for production rule systems
      DOEpatents
      Allen, Jr., John D.; Butler, Philip L.
         1989-01-01
         A parallel processing system for production rule programs utilizes a host processor for storing production rule right hand sides (RHS) and a plurality of rule processors for storing left hand sides (LHS). The rule processors operate in parallel in the recognize phase of the system recognize -Act Cycle to match their respective LHS's against a stored list of working memory elements (WME) in order to find a self consistent set of WME's. The list of WME is dynamically varied during the Act phase of the system in which the host executes or fires rule RHS's for those rules for which a self-consistent set has been found by the rule processors. The host transmits instructions for creating or deleting working memory elements as dictated by the rule firings until the rule processors are unable to find any further self-consistent working memory element sets at which time the production rule system is halted.
      

      
      Solving very large, sparse linear systems on mesh-connected parallel computers
      NASA Technical Reports Server (NTRS)
      Opsahl, Torstein; Reif, John
         1987-01-01
         The implementation of Pan and Reif's Parallel Nested Dissection (PND) algorithm on mesh connected parallel computers is described. This is the first known algorithm that allows very large, sparse linear systems of equations to be solved efficiently in polylog time using a small number of processors. How the processor bound of PND can be matched to the number of processors available on a given parallel computer by slowing down the algorithm by constant factors is described. Also, for the important class of problems where G(A) is a grid graph, a unique memory mapping that reduces the inter-processor communication requirements of PND to those that can be executed on mesh connected parallel machines is detailed. A description of an implementation on the Goodyear Massively Parallel Processor (MPP), located at Goddard is given. Also, a detailed discussion of data mappings and performance issues is given.
      

      
      Scaling Deep Learning Workloads: NVIDIA DGX-1/Pascal and Intel Knights Landing
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Gawande, Nitin A.; Landwehr, Joshua B.; Daily, Jeffrey A.
         
         Deep Learning (DL) algorithms have become ubiquitous in data analytics. As a result, major computing vendors --- including NVIDIA, Intel, AMD and IBM --- have architectural road-maps influenced by DL workloads. Furthermore, several vendors have recently advertised new computing products as accelerating DL workloads. Unfortunately, it is difficult for data scientists to quantify the potential of these different products. This paper provides a performance and power analysis of important DL workloads on two major parallel architectures: NVIDIA DGX-1 (eight Pascal P100 GPUs interconnected with NVLink) and Intel Knights Landing (KNL) CPUs interconnected with Intel Omni-Path. Our evaluation consists of amore » cross section of convolutional neural net workloads: CifarNet, CaffeNet, AlexNet and GoogleNet topologies using the Cifar10 and ImageNet datasets. The workloads are vendor optimized for each architecture. GPUs provide the highest overall raw performance. Our analysis indicates that although GPUs provide the highest overall performance, the gap can close for some convolutional networks; and KNL can be competitive when considering performance/watt. Furthermore, NVLink is critical to GPU scaling.« less
      

      
      Phantom-GRAPE: Numerical software library to accelerate collisionless N-body simulation with SIMD instruction set on x86 architecture
      NASA Astrophysics Data System (ADS)
      Tanikawa, Ataru; Yoshikawa, Kohji; Nitadori, Keigo; Okamoto, Takashi
         2013-02-01
         We have developed a numerical software library for collisionless N-body simulations named "Phantom-GRAPE" which highly accelerates force calculations among particles by use of a new SIMD instruction set extension to the x86 architecture, Advanced Vector eXtensions (AVX), an enhanced version of the Streaming SIMD Extensions (SSE). In our library, not only the Newton's forces, but also central forces with an arbitrary shape f(r), which has a finite cutoff radius rcut (i.e. f(r)=0 at r>rcut), can be quickly computed. In computing such central forces with an arbitrary force shape f(r), we refer to a pre-calculated look-up table. We also present a new scheme to create the look-up table whose binning is optimal to keep good accuracy in computing forces and whose size is small enough to avoid cache misses. Using an Intel Core i7-2600 processor, we measure the performance of our library for both of the Newton's forces and the arbitrarily shaped central forces. In the case of Newton's forces, we achieve 2×109 interactions per second with one processor core (or 75 GFLOPS if we count 38 operations per interaction), which is 20 times higher than the performance of an implementation without any explicit use of SIMD instructions, and 2 times than that with the SSE instructions. With four processor cores, we obtain the performance of 8×109 interactions per second (or 300 GFLOPS). In the case of the arbitrarily shaped central forces, we can calculate 1×109 and 4×109 interactions per second with one and four processor cores, respectively. The performance with one processor core is 6 times and 2 times higher than those of the implementations without any use of SIMD instructions and with the SSE instructions. These performances depend only weakly on the number of particles, irrespective of the force shape. It is good contrast with the fact that the performance of force calculations accelerated by graphics processing units (GPUs) depends strongly on the number of particles. Substantially weak dependence of the performance on the number of particles is suitable to collisionless N-body simulations, since these simulations are usually performed with sophisticated N-body solvers such as Tree- and TreePM-methods combined with an individual timestep scheme. We conclude that collisionless N-body simulations accelerated with our library have significant advantage over those accelerated by GPUs, especially on massively parallel environments.
      

      
      Tensorial Basis Spline Collocation Method for Poisson's Equation
      NASA Astrophysics Data System (ADS)
      Plagne, Laurent; Berthou, Jean-Yves
         2000-01-01
         This paper aims to describe the tensorial basis spline collocation method applied to Poisson's equation. In the case of a localized 3D charge distribution in vacuum, this direct method based on a tensorial decomposition of the differential operator is shown to be competitive with both iterative BSCM and FFT-based methods. We emphasize the O(h4) and O(h6) convergence of TBSCM for cubic and quintic splines, respectively. We describe the implementation of this method on a distributed memory parallel machine. Performance measurements on a Cray T3E are reported. Our code exhibits high performance and good scalability: As an example, a 27 Gflops performance is obtained when solving Poisson's equation on a 2563 non-uniform 3D Cartesian mesh by using 128 T3E-750 processors. This represents 215 Mflops per processors.
      

      
      Development Of A Three-Dimensional Circuit Integration Technology And Computer Architecture
      NASA Astrophysics Data System (ADS)
      Etchells, R. D.; Grinberg, J.; Nudd, G. R.
         1981-12-01
         This paper is the first of a series 1,2,3 describing a range of efforts at Hughes Research Laboratories, which are collectively referred to as "Three-Dimensional Microelectronics." The technology being developed is a combination of a unique circuit fabrication/packaging technology and a novel processing architecture. The packaging technology greatly reduces the parasitic impedances associated with signal-routing in complex VLSI structures, while simultaneously allowing circuit densities orders of magnitude higher than the current state-of-the-art. When combined with the 3-D processor architecture, the resulting machine exhibits a one- to two-order of magnitude simultaneous improvement over current state-of-the-art machines in the three areas of processing speed, power consumption, and physical volume. The 3-D architecture is essentially that commonly referred to as a "cellular array", with the ultimate implementation having as many as 512 x 512 processors working in parallel. The three-dimensional nature of the assembled machine arises from the fact that the chips containing the active circuitry of the processor are stacked on top of each other. In this structure, electrical signals are passed vertically through the chips via thermomigrated aluminum feedthroughs. Signals are passed between adjacent chips by micro-interconnects. This discussion presents a broad view of the total effort, as well as a more detailed treatment of the fabrication and packaging technologies themselves. The results of performance simulations of the completed 3-D processor executing a variety of algorithms are also presented. Of particular pertinence to the interests of the focal-plane array community is the simulation of the UNICORNS nonuniformity correction algorithms as executed by the 3-D architecture.
      

      
      Satellite antenna management system and method
      NASA Technical Reports Server (NTRS)
      Leath, Timothy T (Inventor); Azzolini, John D (Inventor)
         1999-01-01
         The antenna management system and method allow a satellite to communicate with a ground station either directly or by an intermediary of a second satellite, thus permitting communication even when the satellite is not within range of the ground station. The system and method employ five major software components, which are the control and initialization module, the command and telemetry handler module, the contact schedule processor module, the contact state machining module, and the telemetry state machine module. The control and initialization module initializes the system and operates the main control cycle, in which the other modules are called. The command and telemetry handler module handles communication to and from the ground station. The contact scheduler processor module handles the contact entry schedules to allow scheduling of contacts with the second satellite. The contact and telemetry state machine modules handle the various states of the satellite in beginning, maintaining and ending contact with the second satellite and in beginning, maintaining and ending communication with the satellite.
      

      
      Highly parallel sparse Cholesky factorization
      NASA Technical Reports Server (NTRS)
      Gilbert, John R.; Schreiber, Robert
         1990-01-01
         Several fine grained parallel algorithms were developed and compared to compute the Cholesky factorization of a sparse matrix. The experimental implementations are on the Connection Machine, a distributed memory SIMD machine whose programming model conceptually supplies one processor per data element. In contrast to special purpose algorithms in which the matrix structure conforms to the connection structure of the machine, the focus is on matrices with arbitrary sparsity structure. The most promising algorithm is one whose inner loop performs several dense factorizations simultaneously on a 2-D grid of processors. Virtually any massively parallel dense factorization algorithm can be used as the key subroutine. The sparse code attains execution rates comparable to those of the dense subroutine. Although at present architectural limitations prevent the dense factorization from realizing its potential efficiency, it is concluded that a regular data parallel architecture can be used efficiently to solve arbitrarily structured sparse problems. A performance model is also presented and it is used to analyze the algorithms.
      

      
      WARP: Weight Associative Rule Processor. A dedicated VLSI fuzzy logic megacell
      NASA Technical Reports Server (NTRS)
      Pagni, A.; Poluzzi, R.; Rizzotto, G. G.
         1992-01-01
         During the last five years Fuzzy Logic has gained enormous popularity in the academic and industrial worlds. The success of this new methodology has led the microelectronics industry to create a new class of machines, called Fuzzy Machines, to overcome the limitations of traditional computing systems when utilized as Fuzzy Systems. This paper gives an overview of the methods by which Fuzzy Logic data structures are represented in the machines (each with its own advantages and inefficiencies). Next, the paper introduces WARP (Weight Associative Rule Processor) which is a dedicated VLSI megacell allowing the realization of a fuzzy controller suitable for a wide range of applications. WARP represents an innovative approach to VLSI Fuzzy controllers by utilizing different types of data structures for characterizing the membership functions during the various stages of the Fuzzy processing. WARP dedicated architecture has been designed in order to achieve high performance by exploiting the computational advantages offered by the different data representations.
      

      
      Concurrent computation of attribute filters on shared memory parallel machines.
      PubMed
      Wilkinson, Michael H F; Gao, Hui; Hesselink, Wim H; Jonker, Jan-Eppo; Meijster, Arnold
         2008-10-01
         Morphological attribute filters have not previously been parallelized, mainly because they are both global and non-separable. We propose a parallel algorithm that achieves efficient parallelism for a large class of attribute filters, including attribute openings, closings, thinnings and thickenings, based on Salembier's Max-Trees and Min-trees. The image or volume is first partitioned in multiple slices. We then compute the Max-trees of each slice using any sequential Max-Tree algorithm. Subsequently, the Max-trees of the slices can be merged to obtain the Max-tree of the image. A C-implementation yielded good speed-ups on both a 16-processor MIPS 14000 parallel machine, and a dual-core Opteron-based machine. It is shown that the speed-up of the parallel algorithm is a direct measure of the gain with respect to the sequential algorithm used. Furthermore, the concurrent algorithm shows a speed gain of up to 72 percent on a single-core processor, due to reduced cache thrashing.
      

      
      Efficient Load Balancing and Data Remapping for Adaptive Grid Calculations
      NASA Technical Reports Server (NTRS)
      Oliker, Leonid; Biswas, Rupak
         1997-01-01
         Mesh adaption is a powerful tool for efficient unstructured- grid computations but causes load imbalance among processors on a parallel machine. We present a novel method to dynamically balance the processor workloads with a global view. This paper presents, for the first time, the implementation and integration of all major components within our dynamic load balancing strategy for adaptive grid calculations. Mesh adaption, repartitioning, processor assignment, and remapping are critical components of the framework that must be accomplished rapidly and efficiently so as not to cause a significant overhead to the numerical simulation. Previous results indicated that mesh repartitioning and data remapping are potential bottlenecks for performing large-scale scientific calculations. We resolve these issues and demonstrate that our framework remains viable on a large number of processors.
      

      
      Extended Logic Intelligent Processing System for a Sensor Fusion Processor Hardware
      NASA Technical Reports Server (NTRS)
      Stoica, Adrian; Thomas, Tyson; Li, Wei-Te; Daud, Taher; Fabunmi, James
         2000-01-01
         The paper presents the hardware implementation and initial tests from a low-power, highspeed reconfigurable sensor fusion processor. The Extended Logic Intelligent Processing System (ELIPS) is described, which combines rule-based systems, fuzzy logic, and neural networks to achieve parallel fusion of sensor signals in compact low power VLSI. The development of the ELIPS concept is being done to demonstrate the interceptor functionality which particularly underlines the high speed and low power requirements. The hardware programmability allows the processor to reconfigure into different machines, taking the most efficient hardware implementation during each phase of information processing. Processing speeds of microseconds have been demonstrated using our test hardware.
      

      
      Latency-Information Theory: The Mathematical-Physical Theory of Communication-Observation
      DTIC Science & Technology
      
         2010-01-01
         Werner Heisenberg of quantum mechanics; 3) the source-entropy and channel-capacity lossless performance bounds of Claude Shannon that guide...through noisy intel-space channels, and where the physical time-dislocations of intel-space exhibit a passing of time Heisenberg information...life-space sensor, and where the physical time- dislocations of life-space exhibit a passing of time Heisenberg information-uncertainty; and 4
      

      
      Performance tuning Weather Research and Forecasting (WRF) Goddard longwave radiative transfer scheme on Intel Xeon Phi
      NASA Astrophysics Data System (ADS)
      Mielikainen, Jarno; Huang, Bormin; Huang, Allen H.
         2015-10-01
         Next-generation mesoscale numerical weather prediction system, the Weather Research and Forecasting (WRF) model, is a designed for dual use for forecasting and research. WRF offers multiple physics options that can be combined in any way. One of the physics options is radiance computation. The major source for energy for the earth's climate is solar radiation. Thus, it is imperative to accurately model horizontal and vertical distribution of the heating. Goddard solar radiative transfer model includes the absorption duo to water vapor,ozone, ozygen, carbon dioxide, clouds and aerosols. The model computes the interactions among the absorption and scattering by clouds, aerosols, molecules and surface. Finally, fluxes are integrated over the entire longwave spectrum.In this paper, we present our results of optimizing the Goddard longwave radiative transfer scheme on Intel Many Integrated Core Architecture (MIC) hardware. The Intel Xeon Phi coprocessor is the first product based on Intel MIC architecture, and it consists of up to 61 cores connected by a high performance on-die bidirectional interconnect. The coprocessor supports all important Intel development tools. Thus, the development environment is familiar one to a vast number of CPU developers. Although, getting a maximum performance out of MICs will require using some novel optimization techniques. Those optimization techniques are discusses in this paper. The optimizations improved the performance of the original Goddard longwave radiative transfer scheme on Xeon Phi 7120P by a factor of 2.2x. Furthermore, the same optimizations improved the performance of the Goddard longwave radiative transfer scheme on a dual socket configuration of eight core Intel Xeon E5-2670 CPUs by a factor of 2.1x compared to the original Goddard longwave radiative transfer scheme code.
      

      
      Multimedia systems in ultrasound image boundary detection and measurements
      NASA Astrophysics Data System (ADS)
      Pathak, Sayan D.; Chalana, Vikram; Kim, Yongmin
         1997-05-01
         Ultrasound as a medical imaging modality offers the clinician a real-time of the anatomy of the internal organs/tissues, their movement, and flow noninvasively. One of the applications of ultrasound is to monitor fetal growth by measuring biparietal diameter (BPD) and head circumference (HC). We have been working on automatic detection of fetal head boundaries in ultrasound images. These detected boundaries are used to measure BPD and HC. The boundary detection algorithm is based on active contour models and takes 32 seconds on an external high-end workstation, SUN SparcStation 20/71. Our goal has been to make this tool available within an ultrasound machine and at the same time significantly improve its performance utilizing multimedia technology. With the advent of high- performance programmable digital signal processors (DSP), the software solution within an ultrasound machine instead of the traditional hardwired approach or requiring an external computer is now possible. We have integrated our boundary detection algorithm into a programmable ultrasound image processor (PUIP) that fits into a commercial ultrasound machine. The PUIP provides both the high computing power and flexibility needed to support computationally-intensive image processing algorithms within an ultrasound machine. According to our data analysis, BPD/HC measurements made on PUIP lie within the interobserver variability. Hence, the errors in the automated BPD/HC measurements using the algorithm are on the same order as the average interobserver differences. On PUIP, it takes 360 ms to measure the values of BPD/HC on one head image. When processing multiple head images in sequence, it takes 185 ms per image, thus enabling 5.4 BPD/HC measurements per second. Reduction in the overall execution time from 32 seconds to a fraction of a second and making this multimedia system available within an ultrasound machine will help this image processing algorithm and other computer-intensive imaging applications become a practical tool for the sonographers in the feature.
      

      
      A scalable PC-based parallel computer for lattice QCD
      NASA Astrophysics Data System (ADS)
      Fodor, Z.; Katz, S. D.; Pappa, G.
         2003-05-01
         A PC-based parallel computer for medium/large scale lattice QCD simulations is suggested. The Eo¨tvo¨s Univ., Inst. Theor. Phys. cluster consists of 137 Intel P4-1.7GHz nodes. Gigabit Ethernet cards are used for nearest neighbor communication in a two-dimensional mesh. The sustained performance for dynamical staggered (wilson) quarks on large lattices is around 70(110) GFlops. The exceptional price/performance ratio is below $1/Mflop.
      

      
      Fine-grained parallelism accelerating for RNA secondary structure prediction with pseudoknots based on FPGA.
      PubMed
      Xia, Fei; Jin, Guoqing
         2014-06-01
         PKNOTS is a most famous benchmark program and has been widely used to predict RNA secondary structure including pseudoknots. It adopts the standard four-dimensional (4D) dynamic programming (DP) method and is the basis of many variants and improved algorithms. Unfortunately, the O(N(6)) computing requirements and complicated data dependency greatly limits the usefulness of PKNOTS package with the explosion in gene database size. In this paper, we present a fine-grained parallel PKNOTS package and prototype system for accelerating RNA folding application based on FPGA chip. We adopted a series of storage optimization strategies to resolve the "Memory Wall" problem. We aggressively exploit parallel computing strategies to improve computational efficiency. We also propose several methods that collectively reduce the storage requirements for FPGA on-chip memory. To the best of our knowledge, our design is the first FPGA implementation for accelerating 4D DP problem for RNA folding application including pseudoknots. The experimental results show a factor of more than 50x average speedup over the PKNOTS-1.08 software running on a PC platform with Intel Core2 Q9400 Quad CPU for input RNA sequences. However, the power consumption of our FPGA accelerator is only about 50% of the general-purpose micro-processors.
      

        
       
          

«

20
      21
      22
   23
      24
      »

          
        

     

   

   
       
            
              
          

«

21
      22
      23
   24
      25
      »

          
        

           
           
             
               
      
      Electroacoustic dewatering of food and other suspensions
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Kim, B.C.; Zelinski, M.S.; Criner, C.L.
         1989-05-31
         The food processing industry is a large user of energy for evaporative drying due to limited effectiveness of conventional mechanical dewatering machines. Battelle's Electroacoustic Dewatering (EAD) process improves the performance of mechanical dewatering machines by superimposing electric and ultrasonic fields. A two phase development program to demonstrate the benefits of EAD was carried out in cooperation with the food processing industry, the National Food Processors Association (NFPA) and two equipment vendors. In Phase I, laboratory scale studies were carried out on a variety of food suspensions. The process was scaled up to small commercial scale in Phase II. The technicalmore » feasibility of EAD for a variety of food materials, without adversely affecting the food properties, was successfully demonstrated during this phase, which is the subject of this report. Two Process Research Units (PRUs) were designed and built through joint efforts between Battelle and two equipment vendors. A 0.5-meter wide belt press was tested on apple mash, corn fiber, and corn gluten at sites provided by two food processors. A high speed citrus juice finisher (a hybrid form of screw press and centrifuge) was tested on orange pulp. These tests were carried out jointly by Battelle, equipment vendors, NFPA, and food processors. The apple and citrus juice products were analyzed by food processors and NFPA. 26 figs., 30 tabs.« less
      

      
      A distributed microcomputer-controlled system for data acquisition and power spectral analysis of EEG.
      PubMed
      Vo, T D; Dwyer, G; Szeto, H H
         1986-04-01
         A relatively powerful and inexpensive microcomputer-based system for the spectral analysis of the EEG is presented. High resolution and speed is achieved with the use of recently available large-scale integrated circuit technology with enhanced functionality (INTEL Math co-processors 8087) which can perform transcendental functions rapidly. The versatility of the system is achieved with a hardware organization that has distributed data acquisition capability performed by the use of a microprocessor-based analog to digital converter with large resident memory (Cyborg ISAAC-2000). Compiled BASIC programs and assembly language subroutines perform on-line or off-line the fast Fourier transform and spectral analysis of the EEG which is stored as soft as well as hard copy. Some results obtained from test application of the entire system in animal studies are presented.
      

      
      A comparative study of serial and parallel aeroelastic computations of wings
      NASA Technical Reports Server (NTRS)
      Byun, Chansup; Guruswamy, Guru P.
         1994-01-01
         A procedure for computing the aeroelasticity of wings on parallel multiple-instruction, multiple-data (MIMD) computers is presented. In this procedure, fluids are modeled using Euler equations, and structures are modeled using modal or finite element equations. The procedure is designed in such a way that each discipline can be developed and maintained independently by using a domain decomposition approach. In the present parallel procedure, each computational domain is scalable. A parallel integration scheme is used to compute aeroelastic responses by solving fluid and structural equations concurrently. The computational efficiency issues of parallel integration of both fluid and structural equations are investigated in detail. This approach, which reduces the total computational time by a factor of almost 2, is demonstrated for a typical aeroelastic wing by using various numbers of processors on the Intel iPSC/860.
      

      
      Stanford Hardware Development Program
      NASA Technical Reports Server (NTRS)
      Peterson, A.; Linscott, I.; Burr, J.
         1986-01-01
         Architectures for high performance, digital signal processing, particularly for high resolution, wide band spectrum analysis were developed. These developments are intended to provide instrumentation for NASA's Search for Extraterrestrial Intelligence (SETI) program. The real time signal processing is both formal and experimental. The efficient organization and optimal scheduling of signal processing algorithms were investigated. The work is complemented by efforts in processor architecture design and implementation. A high resolution, multichannel spectrometer that incorporates special purpose microcoded signal processors is being tested. A general purpose signal processor for the data from the multichannel spectrometer was designed to function as the processing element in a highly concurrent machine. The processor performance required for the spectrometer is in the range of 1000 to 10,000 million instructions per second (MIPS). Multiple node processor configurations, where each node performs at 100 MIPS, are sought. The nodes are microprogrammable and are interconnected through a network with high bandwidth for neighboring nodes, and medium bandwidth for nodes at larger distance. The implementation of both the current mutlichannel spectrometer and the signal processor as Very Large Scale Integration CMOS chip sets was commenced.
      

      
      Parallelising a molecular dynamics algorithm on a multi-processor workstation
      NASA Astrophysics Data System (ADS)
      Müller-Plathe, Florian
         1990-12-01
         The Verlet neighbour-list algorithm is parallelised for a multi-processor Hewlett-Packard/Apollo DN10000 workstation. The implementation makes use of memory shared between the processors. It is a genuine master-slave approach by which most of the computational tasks are kept in the master process and the slaves are only called to do part of the nonbonded forces calculation. The implementation features elements of both fine-grain and coarse-grain parallelism. Apart from three calls to library routines, two of which are standard UNIX calls, and two machine-specific language extensions, the whole code is written in standard Fortran 77. Hence, it may be expected that this parallelisation concept can be transfered in parts or as a whole to other multi-processor shared-memory computers. The parallel code is routinely used in production work.
      

      
      Instrumentation, performance visualization, and debugging tools for multiprocessors
      NASA Technical Reports Server (NTRS)
      Yan, Jerry C.; Fineman, Charles E.; Hontalas, Philip J.
         1991-01-01
         The need for computing power has forced a migration from serial computation on a single processor to parallel processing on multiprocessor architectures. However, without effective means to monitor (and visualize) program execution, debugging, and tuning parallel programs becomes intractably difficult as program complexity increases with the number of processors. Research on performance evaluation tools for multiprocessors is being carried out at ARC. Besides investigating new techniques for instrumenting, monitoring, and presenting the state of parallel program execution in a coherent and user-friendly manner, prototypes of software tools are being incorporated into the run-time environments of various hardware testbeds to evaluate their impact on user productivity. Our current tool set, the Ames Instrumentation Systems (AIMS), incorporates features from various software systems developed in academia and industry. The execution of FORTRAN programs on the Intel iPSC/860 can be automatically instrumented and monitored. Performance data collected in this manner can be displayed graphically on workstations supporting X-Windows. We have successfully compared various parallel algorithms for computational fluid dynamics (CFD) applications in collaboration with scientists from the Numerical Aerodynamic Simulation Systems Division. By performing these comparisons, we show that performance monitors and debuggers such as AIMS are practical and can illuminate the complex dynamics that occur within parallel programs.
      

      
      Efficient Approximation Algorithms for Weighted $b$-Matching
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Khan, Arif; Pothen, Alex; Mostofa Ali Patwary, Md.
         2016-01-01
         We describe a half-approximation algorithm, b-Suitor, for computing a b-Matching of maximum weight in a graph with weights on the edges. b-Matching is a generalization of the well-known Matching problem in graphs, where the objective is to choose a subset of M edges in the graph such that at most a specified number b(v) of edges in M are incident on each vertex v. Subject to this restriction we maximize the sum of the weights of the edges in M. We prove that the b-Suitor algorithm computes the same b-Matching as the one obtained by the greedy algorithm for themore » problem. We implement the algorithm on serial and shared-memory parallel processors, and compare its performance against a collection of approximation algorithms that have been proposed for the Matching problem. Our results show that the b-Suitor algorithm outperforms the Greedy and Locally Dominant edge algorithms by one to two orders of magnitude on a serial processor. The b-Suitor algorithm has a high degree of concurrency, and it scales well up to 240 threads on a shared memory multiprocessor. The b-Suitor algorithm outperforms the Locally Dominant edge algorithm by a factor of fourteen on 16 cores of an Intel Xeon multiprocessor.« less
      

      
      A customizable system for real-time image processing using the Blackfin DSProcessor and the MicroC/OS-II real-time kernel
      NASA Astrophysics Data System (ADS)
      Coffey, Stephen; Connell, Joseph
         2005-06-01
         This paper presents a development platform for real-time image processing based on the ADSP-BF533 Blackfin processor and the MicroC/OS-II real-time operating system (RTOS). MicroC/OS-II is a completely portable, ROMable, pre-emptive, real-time kernel. The Blackfin Digital Signal Processors (DSPs), incorporating the Analog Devices/Intel Micro Signal Architecture (MSA), are a broad family of 16-bit fixed-point products with a dual Multiply Accumulate (MAC) core. In addition, they have a rich instruction set with variable instruction length and both DSP and MCU functionality thus making them ideal for media based applications. Using the MicroC/OS-II for task scheduling and management, the proposed system can capture and process raw RGB data from any standard 8-bit greyscale image sensor in soft real-time and then display the processed result using a simple PC graphical user interface (GUI). Additionally, the GUI allows configuration of the image capture rate and the system and core DSP clock rates thereby allowing connectivity to a selection of image sensors and memory devices. The GUI also allows selection from a set of image processing algorithms based in the embedded operating system.
      

      
      Multiprocessing the Sieve of Eratosthenes
      NASA Technical Reports Server (NTRS)
      Bokhari, S.
         1986-01-01
         The Sieve of Eratosthenes for finding prime numbers in recent years has seen much use as a benchmark algorithm for serial computers while its intrinsically parallel nature has gone largely unnoticed. The implementation of a parallel version of this algorithm for a real parallel computer, the Flex/32, is described and its performance discussed. It is shown that the algorithm is sensitive to several fundamental performance parameters of parallel machines, such as spawning time, signaling time, memory access, and overhead of process switching. Because of the nature of the algorithm, it is impossible to get any speedup beyond 4 or 5 processors unless some form of dynamic load balancing is employed. We describe the performance of our algorithm with and without load balancing and compare it with theoretical lower bounds and simulated results. It is straightforward to understand this algorithm and to check the final results. However, its efficient implementation on a real parallel machine requires thoughtful design, especially if dynamic load balancing is desired. The fundamental operations required by the algorithm are very simple: this means that the slightest overhead appears prominently in performance data. The Sieve thus serves not only as a very severe test of the capabilities of a parallel processor but is also an interesting challenge for the programmer.
      

      
      Parallel definition of tear film maps on distributed-memory clusters for the support of dry eye diagnosis.
      PubMed
      González-Domínguez, Jorge; Remeseiro, Beatriz; Martín, María J
         2017-02-01
         The analysis of the interference patterns on the tear film lipid layer is a useful clinical test to diagnose dry eye syndrome. This task can be automated with a high degree of accuracy by means of the use of tear film maps. However, the time required by the existing applications to generate them prevents a wider acceptance of this method by medical experts. Multithreading has been previously successfully employed by the authors to accelerate the tear film map definition on multicore single-node machines. In this work, we propose a hybrid message-passing and multithreading parallel approach that further accelerates the generation of tear film maps by exploiting the computational capabilities of distributed-memory systems such as multicore clusters and supercomputers. The algorithm for drawing tear film maps is parallelized using Message Passing Interface (MPI) for inter-node communications and the multithreading support available in the C++11 standard for intra-node parallelization. The original algorithm is modified to reduce the communications and increase the scalability. The hybrid method has been tested on 32 nodes of an Intel cluster (with two 12-core Haswell 2680v3 processors per node) using 50 representative images. Results show that maximum runtime is reduced from almost two minutes using the previous only-multithreaded approach to less than ten seconds using the hybrid method. The hybrid MPI/multithreaded implementation can be used by medical experts to obtain tear film maps in only a few seconds, which will significantly accelerate and facilitate the diagnosis of the dry eye syndrome. Copyright © 2016 Elsevier Ireland Ltd. All rights reserved.
      

      
      Kalman Filter Tracking on Parallel Architectures
      NASA Astrophysics Data System (ADS)
      Cerati, Giuseppe; Elmer, Peter; Lantz, Steven; McDermott, Kevin; Riley, Dan; Tadel, Matevž; Wittich, Peter; Würthwein, Frank; Yagil, Avi
         2015-12-01
         Power density constraints are limiting the performance improvements of modern CPUs. To address this we have seen the introduction of lower-power, multi-core processors, but the future will be even more exciting. In order to stay within the power density limits but still obtain Moore's Law performance/price gains, it will be necessary to parallelize algorithms to exploit larger numbers of lightweight cores and specialized functions like large vector units. Example technologies today include Intel's Xeon Phi and GPGPUs. Track finding and fitting is one of the most computationally challenging problems for event reconstruction in particle physics. At the High Luminosity LHC, for example, this will be by far the dominant problem. The need for greater parallelism has driven investigations of very different track finding techniques including Cellular Automata or returning to Hough Transform. The most common track finding techniques in use today are however those based on the Kalman Filter [2]. Significant experience has been accumulated with these techniques on real tracking detector systems, both in the trigger and offline. They are known to provide high physics performance, are robust and are exactly those being used today for the design of the tracking system for HL-LHC. Our previous investigations showed that, using optimized data structures, track fitting with Kalman Filter can achieve large speedup both with Intel Xeon and Xeon Phi. We report here our further progress towards an end-to-end track reconstruction algorithm fully exploiting vectorization and parallelization techniques in a realistic simulation setup.
      

      
      Computing Maximum Cardinality Matchings in Parallel on Bipartite Graphs via Tree-Grafting
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Azad, Ariful; Buluc, Aydn; Pothen, Alex
         
         It is difficult to obtain high performance when computing matchings on parallel processors because matching algorithms explicitly or implicitly search for paths in the graph, and when these paths become long, there is little concurrency. In spite of this limitation, we present a new algorithm and its shared-memory parallelization that achieves good performance and scalability in computing maximum cardinality matchings in bipartite graphs. This algorithm searches for augmenting paths via specialized breadth-first searches (BFS) from multiple source vertices, hence creating more parallelism than single source algorithms. Algorithms that employ multiple-source searches cannot discard a search tree once no augmenting pathmore » is discovered from the tree, unlike algorithms that rely on single-source searches. We describe a novel tree-grafting method that eliminates most of the redundant edge traversals resulting from this property of multiple-source searches. We also employ the recent direction-optimizing BFS algorithm as a subroutine to discover augmenting paths faster. Our algorithm compares favorably with the current best algorithms in terms of the number of edges traversed, the average augmenting path length, and the number of iterations. Here, we provide a proof of correctness for our algorithm. Our NUMA-aware implementation is scalable to 80 threads of an Intel multiprocessor and to 240 threads on an Intel Knights Corner coprocessor. On average, our parallel algorithm runs an order of magnitude faster than the fastest algorithms available. The performance improvement is more significant on graphs with small matching number.« less
      

      
      Computing Maximum Cardinality Matchings in Parallel on Bipartite Graphs via Tree-Grafting
      DOE PAGES
      Azad, Ariful; Buluc, Aydn; Pothen, Alex
         2016-03-24
         It is difficult to obtain high performance when computing matchings on parallel processors because matching algorithms explicitly or implicitly search for paths in the graph, and when these paths become long, there is little concurrency. In spite of this limitation, we present a new algorithm and its shared-memory parallelization that achieves good performance and scalability in computing maximum cardinality matchings in bipartite graphs. This algorithm searches for augmenting paths via specialized breadth-first searches (BFS) from multiple source vertices, hence creating more parallelism than single source algorithms. Algorithms that employ multiple-source searches cannot discard a search tree once no augmenting pathmore » is discovered from the tree, unlike algorithms that rely on single-source searches. We describe a novel tree-grafting method that eliminates most of the redundant edge traversals resulting from this property of multiple-source searches. We also employ the recent direction-optimizing BFS algorithm as a subroutine to discover augmenting paths faster. Our algorithm compares favorably with the current best algorithms in terms of the number of edges traversed, the average augmenting path length, and the number of iterations. Here, we provide a proof of correctness for our algorithm. Our NUMA-aware implementation is scalable to 80 threads of an Intel multiprocessor and to 240 threads on an Intel Knights Corner coprocessor. On average, our parallel algorithm runs an order of magnitude faster than the fastest algorithms available. The performance improvement is more significant on graphs with small matching number.« less
      

      
      Load Balancing Unstructured Adaptive Grids for CFD Problems
      NASA Technical Reports Server (NTRS)
      Biswas, Rupak; Oliker, Leonid
         1996-01-01
         Mesh adaption is a powerful tool for efficient unstructured-grid computations but causes load imbalance among processors on a parallel machine. A dynamic load balancing method is presented that balances the workload across all processors with a global view. After each parallel tetrahedral mesh adaption, the method first determines if the new mesh is sufficiently unbalanced to warrant a repartitioning. If so, the adapted mesh is repartitioned, with new partitions assigned to processors so that the redistribution cost is minimized. The new partitions are accepted only if the remapping cost is compensated by the improved load balance. Results indicate that this strategy is effective for large-scale scientific computations on distributed-memory multiprocessors.
      

      
      Parasail: SIMD C library for global, semi-global, and local pairwise sequence alignments.
      PubMed
      Daily, Jeff
         2016-02-10
         Sequence alignment algorithms are a key component of many bioinformatics applications. Though various fast Smith-Waterman local sequence alignment implementations have been developed for x86 CPUs, most are embedded into larger database search tools. In addition, fast implementations of Needleman-Wunsch global sequence alignment and its semi-global variants are not as widespread. This article presents the first software library for local, global, and semi-global pairwise intra-sequence alignments and improves the performance of previous intra-sequence implementations. A faster intra-sequence local pairwise alignment implementation is described and benchmarked, including new global and semi-global variants. Using a 375 residue query sequence a speed of 136 billion cell updates per second (GCUPS) was achieved on a dual Intel Xeon E5-2670 24-core processor system, the highest reported for an implementation based on Farrar's 'striped' approach. Rognes's SWIPE optimal database search application is still generally the fastest available at 1.2 to at best 2.4 times faster than Parasail for sequences shorter than 500 amino acids. However, Parasail was faster for longer sequences. For global alignments, Parasail's prefix scan implementation is generally the fastest, faster even than Farrar's 'striped' approach, however the opal library is faster for single-threaded applications. The software library is designed for 64 bit Linux, OS X, or Windows on processors with SSE2, SSE41, or AVX2. Source code is available from https://github.com/jeffdaily/parasail under the Battelle BSD-style license. Applications that require optimal alignment scores could benefit from the improved performance. For the first time, SIMD global, semi-global, and local alignments are available in a stand-alone C library.
      

      
      High-speed assembly language (80386/80387) programming for laser spectra scan control and data acquisition providing improved resolution water vapor spectroscopy
      NASA Technical Reports Server (NTRS)
      Allen, Robert J.
         1988-01-01
         An assembly language program using the Intel 80386 CPU and 80387 math co-processor chips was written to increase the speed of data gathering and processing, and provide control of a scanning CW ring dye laser system. This laser system is used in high resolution (better than 0.001 cm-1) water vapor spectroscopy experiments. Laser beam power is sensed at the input and output of white cells and the output of a Fabry-Perot. The assembly language subroutine is called from Basic, acquires the data and performs various calculations at rates greater than 150 faster than could be performed by the higher level language. The width of output control pulses generated in assembly language are 3 to 4 microsecs as compared to 2 to 3.7 millisecs for those generated in Basic (about 500 to 1000 times faster). Included are a block diagram and brief description of the spectroscopy experiment, a flow diagram of the Basic and assembly language programs, listing of the programs, scope photographs of the computer generated 5-volt pulses used for control and timing analysis, and representative water spectrum curves obtained using these programs.
      

      
      Protein structure database search and evolutionary classification.
      PubMed
      Yang, Jinn-Moon; Tung, Chi-Hua
         2006-01-01
         As more protein structures become available and structural genomics efforts provide structural models in a genome-wide strategy, there is a growing need for fast and accurate methods for discovering homologous proteins and evolutionary classifications of newly determined structures. We have developed 3D-BLAST, in part, to address these issues. 3D-BLAST is as fast as BLAST and calculates the statistical significance (E-value) of an alignment to indicate the reliability of the prediction. Using this method, we first identified 23 states of the structural alphabet that represent pattern profiles of the backbone fragments and then used them to represent protein structure databases as structural alphabet sequence databases (SADB). Our method enhanced BLAST as a search method, using a new structural alphabet substitution matrix (SASM) to find the longest common substructures with high-scoring structured segment pairs from an SADB database. Using personal computers with Intel Pentium4 (2.8 GHz) processors, our method searched more than 10 000 protein structures in 1.3 s and achieved a good agreement with search results from detailed structure alignment methods. [3D-BLAST is available at http://3d-blast.life.nctu.edu.tw].
      

      
      Interactive collision detection for deformable models using streaming AABBs.
      PubMed
      Zhang, Xinyu; Kim, Young J
         2007-01-01
         We present an interactive and accurate collision detection algorithm for deformable, polygonal objects based on the streaming computational model. Our algorithm can detect all possible pairwise primitive-level intersections between two severely deforming models at highly interactive rates. In our streaming computational model, we consider a set of axis aligned bounding boxes (AABBs) that bound each of the given deformable objects as an input stream and perform massively-parallel pairwise, overlapping tests onto the incoming streams. As a result, we are able to prevent performance stalls in the streaming pipeline that can be caused by expensive indexing mechanism required by bounding volume hierarchy-based streaming algorithms. At runtime, as the underlying models deform over time, we employ a novel, streaming algorithm to update the geometric changes in the AABB streams. Moreover, in order to get only the computed result (i.e., collision results between AABBs) without reading back the entire output streams, we propose a streaming en/decoding strategy that can be performed in a hierarchical fashion. After determining overlapped AABBs, we perform a primitive-level (e.g., triangle) intersection checking on a serial computational model such as CPUs. We implemented the entire pipeline of our algorithm using off-the-shelf graphics processors (GPUs), such as nVIDIA GeForce 7800 GTX, for streaming computations, and Intel Dual Core 3.4G processors for serial computations. We benchmarked our algorithm with different models of varying complexities, ranging from 15K up to 50K triangles, under various deformation motions, and the timings were obtained as 30 approximately 100 FPS depending on the complexity of models and their relative configurations. Finally, we made comparisons with a well-known GPU-based collision detection algorithm, CULLIDE [4] and observed about three times performance improvement over the earlier approach. We also made comparisons with a SW-based AABB culling algorithm [2] and observed about two times improvement.
      

      
      Implementation and analysis of a Navier-Stokes algorithm on parallel computers
      NASA Technical Reports Server (NTRS)
      Fatoohi, Raad A.; Grosch, Chester E.
         1988-01-01
         The results of the implementation of a Navier-Stokes algorithm on three parallel/vector computers are presented. The object of this research is to determine how well, or poorly, a single numerical algorithm would map onto three different architectures. The algorithm is a compact difference scheme for the solution of the incompressible, two-dimensional, time-dependent Navier-Stokes equations. The computers were chosen so as to encompass a variety of architectures. They are the following: the MPP, an SIMD machine with 16K bit serial processors; Flex/32, an MIMD machine with 20 processors; and Cray/2. The implementation of the algorithm is discussed in relation to these architectures and measures of the performance on each machine are given. The basic comparison is among SIMD instruction parallelism on the MPP, MIMD process parallelism on the Flex/32, and vectorization of a serial code on the Cray/2. Simple performance models are used to describe the performance. These models highlight the bottlenecks and limiting factors for this algorithm on these architectures. Finally, conclusions are presented.
      

      
      A Navier-Strokes Chimera Code on the Connection Machine CM-5: Design and Performance
      NASA Technical Reports Server (NTRS)
      Jespersen, Dennis C.; Levit, Creon; Kwak, Dochan (Technical Monitor)
         1994-01-01
         We have implemented a three-dimensional compressible Navier-Stokes code on the Connection Machine CM-5. The code is set up for implicit time-stepping on single or multiple structured grids. For multiple grids and geometrically complex problems, we follow the 'chimera' approach, where flow data on one zone is interpolated onto another in the region of overlap. We will describe our design philosophy and give some timing results for the current code. A parallel machine like the CM-5 is well-suited for finite-difference methods on structured grids. The regular pattern of connections of a structured mesh maps well onto the architecture of the machine. So the first design choice, finite differences on a structured mesh, is natural. We use centered differences in space, with added artificial dissipation terms. When numerically solving the Navier-Stokes equations, there are liable to be some mesh cells near a solid body that are small in at least one direction. This mesh cell geometry can impose a very severe CFL (Courant-Friedrichs-Lewy) condition on the time step for explicit time-stepping methods. Thus, though explicit time-stepping is well-suited to the architecture of the machine, we have adopted implicit time-stepping. We have further taken the approximate factorization approach. This creates the need to solve large banded linear systems and creates the first possible barrier to an efficient algorithm. To overcome this first possible barrier we have considered two options. The first is just to solve the banded linear systems with data spread over the whole machine, using whatever fast method is available. This option is adequate for solving scalar tridiagonal systems, but for scalar pentadiagonal or block tridiagonal systems it is somewhat slower than desired. The second option is to 'transpose' the flow and geometry variables as part of the time-stepping process: Start with x-lines of data in-processor. Form explicit terms in x, then transpose so y-lines of data are in-processor. Form explicit terms in y, then transpose so z-lines are in processor. Form explicit terms in z, then solve linear systems in the z-direction. Transpose to the y-direction, then solve linear systems in the y-direction. Finally transpose to the x direction and solve linear systems in the x-direction. This strategy avoids inter-processor communication when differencing and solving linear systems, but requires a large amount of communication when doing the transposes. The transpose method is more efficient than the non-transpose strategy when dealing with scalar pentadiagonal or block tridiagonal systems. For handling geometrically complex problems the chimera strategy was adopted. For multiple zone cases we compute on each zone sequentially (using the whole parallel machine), then send the chimera interpolation data to a distributed data structure (array) laid out over the whole machine. This information transfer implies an irregular communication pattern, and is the second possible barrier to an efficient algorithm. We have implemented these ideas on the CM-5 using CMF (Connection Machine Fortran), a data parallel language which combines elements of Fortran 90 and certain extensions, and which bears a strong similarity to High Performance Fortran. We make use of the Connection Machine Scientific Software Library (CMSSL) for the linear solver and array transpose operations.
      

        
       
          

«

21
      22
      23
   24
      25
      »

          
        

     

   

   
       
            
              
          

«

21
      22
      23
      24
   25
      »

          
        

           
           
             
               
      
      The VERCE platform: Enabling Computational Seismology via Streaming Workflows and Science Gateways
      NASA Astrophysics Data System (ADS)
      Spinuso, Alessandro; Filgueira, Rosa; Krause, Amrey; Matser, Jonas; Casarotti, Emanuele; Magnoni, Federica; Gemund, Andre; Frobert, Laurent; Krischer, Lion; Atkinson, Malcolm
         2015-04-01
         The VERCE project is creating an e-Science platform to facilitate innovative data analysis and coding methods that fully exploit the wealth of data in global seismology. One of the technologies developed within the project is the Dispel4Py python library, which allows to describe abstract stream-based workflows for data-intensive applications and to execute them in a distributed environment. At runtime Dispel4Py is able to map workflow descriptions dynamically onto a number of computational resources (Apache Storm clusters, MPI powered clusters, and shared-memory multi-core machines, single-core machines), setting it apart from other workflow frameworks. Therefore, Dispel4Py enables scientists to focus on their computation instead of being distracted by details of the computing infrastructure they use. Among the workflows developed with Dispel4Py in VERCE, we mention here those for Seismic Ambient Noise Cross-Correlation and MISFIT calculation, which address two data-intensive problems that are common in computational seismology. The former, also called Passive Imaging, allows the detection of relative seismic-wave velocity variations during the time of recording, to be associated with the stress-field changes that occurred in the test area. The MISFIT instead, takes as input the synthetic seismograms generated from HPC simulations for a certain Earth model and earthquake and, after a preprocessing stage, compares them with real observations in order to foster subsequent model updates and improvement (Inversion). The VERCE Science Gateway exposes the MISFIT calculation workflow as a service, in combination with the simulation phase. Both phases can be configured, controlled and monitored by the user via a rich user interface which is integrated within the gUSE Science Gateway framework, hiding the complexity of accessing third parties data services, security mechanisms and enactment on the target resources. Thanks to a modular extension to the Dispel4Py framework, the system collects provenance data adopting the W3C-PROV data model. Provenance recordings can be explored and analysed at run time for rapid diagnostic and workflow steering, or later for further validation and comparisons across runs. We will illustrate the interactive services of the gateway and the capabilities of the produced metadata, coupled with the VERCE data management layer based on iRODS. The Cross-Correlation workflow was evaluated on SuperMUC, a supercomputing cluster at the Leibniz Supercomputing Centre in Munich, with 155,656 processor cores in 9400 compute nodes. SuperMUC is based on the Intel Xeon architecture consisting of 18 Thin Node Islands and one Fat Node Island. This work has only had access to the Thin Node Islands, which contain Sandy Bridge nodes, each having 16 cores and 32 GB of memory. In the evaluations we used 1000 stations, and we applied two types of methods (whiten and non-whiten) for pre-processing the data. The workflow was tested on a varying number of cores (16, 32, 64, 128, and 256 cores) using the MPI mapping of Dispel4Py. The results show that Dispel4Py is able to improve the performance by increasing the number of cores without changing the description of the workflow.
      

      
      Global Load Balancing with Parallel Mesh Adaption on Distributed-Memory Systems
      NASA Technical Reports Server (NTRS)
      Biswas, Rupak; Oliker, Leonid; Sohn, Andrew
         1996-01-01
         Dynamic mesh adaption on unstructured grids is a powerful tool for efficiently computing unsteady problems to resolve solution features of interest. Unfortunately, this causes load imbalance among processors on a parallel machine. This paper describes the parallel implementation of a tetrahedral mesh adaption scheme and a new global load balancing method. A heuristic remapping algorithm is presented that assigns partitions to processors such that the redistribution cost is minimized. Results indicate that the parallel performance of the mesh adaption code depends on the nature of the adaption region and show a 35.5X speedup on 64 processors of an SP2 when 35% of the mesh is randomly adapted. For large-scale scientific computations, our load balancing strategy gives almost a sixfold reduction in solver execution times over non-balanced loads. Furthermore, our heuristic remapper yields processor assignments that are less than 3% off the optimal solutions but requires only 1% of the computational time.
      

      
      PLUM: Parallel Load Balancing for Adaptive Unstructured Meshes
      NASA Technical Reports Server (NTRS)
      Oliker, Leonid; Biswas, Rupak; Saini, Subhash (Technical Monitor)
         1998-01-01
         Mesh adaption is a powerful tool for efficient unstructured-grid computations but causes load imbalance among processors on a parallel machine. We present a novel method called PLUM to dynamically balance the processor workloads with a global view. This paper presents the implementation and integration of all major components within our dynamic load balancing strategy for adaptive grid calculations. Mesh adaption, repartitioning, processor assignment, and remapping are critical components of the framework that must be accomplished rapidly and efficiently so as not to cause a significant overhead to the numerical simulation. A data redistribution model is also presented that predicts the remapping cost on the SP2. This model is required to determine whether the gain from a balanced workload distribution offsets the cost of data movement. Results presented in this paper demonstrate that PLUM is an effective dynamic load balancing strategy which remains viable on a large number of processors.
      

      
      System Architecture For High Speed Sorting Of Potatoes
      NASA Astrophysics Data System (ADS)
      Marchant, J. A.; Onyango, C. M.; Street, M. J.
         1989-03-01
         This paper illustrates an industrial application of vision processing in which potatoes are sorted according to their size and shape at speeds of up to 40 objects per second. The result is a multi-processing approach built around the VME bus. A hardware unit has been designed and constructed to encode the boundary of the potatoes, to reducing the amount of data to be processed. A master 68000 processor is used to control this unit and to handle data transfers along the bus. Boundary data is passed to one of three 68010 slave processors each responsible for a line of potatoes across a conveyor belt. The slave processors calculate attributes such as shape, size and estimated weight of each potato and the master processor uses this data to operate the sorting mechanism. The system has been interfaced with a commercial grading machine and performance trials are now in progress.
      

      
      System software for the finite element machine
      NASA Technical Reports Server (NTRS)
      Crockett, T. W.; Knott, J. D.
         1985-01-01
         The Finite Element Machine is an experimental parallel computer developed at Langley Research Center to investigate the application of concurrent processing to structural engineering analysis. This report describes system-level software which has been developed to facilitate use of the machine by applications researchers. The overall software design is outlined, and several important parallel processing issues are discussed in detail, including processor management, communication, synchronization, and input/output. Based on experience using the system, the hardware architecture and software design are critiqued, and areas for further work are suggested.
      

      
      Integrated Inverter For Driving Multiple Electric Machines
      DOEpatents
      Su, Gui-Jia [Knoxville, TN; Hsu, John S [Oak Ridge, TN
         2006-04-04
         An electric machine drive (50) has a plurality of inverters (50a, 50b) for controlling respective electric machines (57, 62), which may include a three-phase main traction machine (57) and two-phase accessory machines (62) in a hybrid or electric vehicle. The drive (50) has a common control section (53, 54) for controlling the plurality of inverters (50a, 50b) with only one microelectronic processor (54) for controlling the plurality of inverters (50a, 50b), only one gate driver circuit (53) for controlling conduction of semiconductor switches (S1-S10) in the plurality of inverters (50a, 50b), and also includes a common dc bus (70), a common dc bus filtering capacitor (C1) and a common dc bus voltage sensor (67). The electric machines (57, 62) may be synchronous machines, induction machines, or PM machines and may be operated in a motoring mode or a generating mode.
      

      
      75 FR 21353 - Intel Corporation, Fab 20 Division, Including On-Site Leased Workers From Volt Technical...
      Federal Register 2010, 2011, 2012, 2013, 2014
      
         2010-04-23
         ... DEPARTMENT OF LABOR Employment and Training Administration [TA-W-73,642] Intel Corporation, Fab 20... of Intel Corporation, Fab 20 Division, including on-site leased workers of Volt Technical Resources... Precision, Inc. were employed on-site at the Hillsboro, Oregon location of Intel Corporation, Fab 20...
      

      
      How Managers' everyday decisions create or destroy your company's strategy.
      PubMed
      Bower, Joseph L; Gilbert, Clark G
         2007-02-01
         Senior executives have long been frustrated by the disconnection between the plans and strategies they devise and the actual behavior of the managers throughout the company. This article approaches the problem from the ground up, recognizing that every time a manager allocates resources, that decision moves the company either into or out of alignment with its announced strategy. A well-known story--Intel's exit from the memory business--illustrates this point. When discussing what businesses Intel should be in, Andy Grove asked Gordon Moore what they would do if Intel were a company that they had just acquired. When Moore answered, "Get out of memory," they decided to do just that. It turned out, though, that Intel's revenues from memory were by this time only 4% of total sales. Intel's lower-level managers had already exited the business. What Intel hadn't done was to shut down the flow of research funding into memory (which was still eating up one-third of all research expenditures); nor had the company announced its exit to the outside world. Because divisional and operating managers-as well as customers and capital markets-have such a powerful impact on the realized strategy of the firm, senior management might consider focusing less on the company's formal strategy and more on the processes by which the company allocates resources. Top managers must know the track record of the people who are making resource allocation proposals; recognize the strategic issues at stake; reach down to operational managers to work across division lines; frame resource questions to reflect the corporate perspective, especially when large sums of money are involved and conditions are highly uncertain; and create a new context that allows top executives to circumvent the regular resource allocation process when necessary.
      

      
      mm_par2.0: An object-oriented molecular dynamics simulation program parallelized using a hierarchical scheme with MPI and OPENMP
      NASA Astrophysics Data System (ADS)
      Oh, Kwang Jin; Kang, Ji Hoon; Myung, Hun Joo
         2012-02-01
         We have revised a general purpose parallel molecular dynamics simulation program mm_par using the object-oriented programming. We parallelized the revised version using a hierarchical scheme in order to utilize more processors for a given system size. The benchmark result will be presented here. New version program summaryProgram title: mm_par2.0 Catalogue identifier: ADXP_v2_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/ADXP_v2_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: Standard CPC license, http://cpc.cs.qub.ac.uk/licence/licence.html No. of lines in distributed program, including test data, etc.: 2 390 858 No. of bytes in distributed program, including test data, etc.: 25 068 310 Distribution format: tar.gz Programming language: C++ Computer: Any system operated by Linux or Unix Operating system: Linux Classification: 7.7 External routines: We provide wrappers for FFTW [1], Intel MKL library [2] FFT routine, and Numerical recipes [3] FFT, random number generator, and eigenvalue solver routines, SPRNG [4] random number generator, Mersenne Twister [5] random number generator, space filling curve routine. Catalogue identifier of previous version: ADXP_v1_0 Journal reference of previous version: Comput. Phys. Comm. 174 (2006) 560 Does the new version supersede the previous version?: Yes Nature of problem: Structural, thermodynamic, and dynamical properties of fluids and solids from microscopic scales to mesoscopic scales. Solution method: Molecular dynamics simulation in NVE, NVT, and NPT ensemble, Langevin dynamics simulation, dissipative particle dynamics simulation. Reasons for new version: First, object-oriented programming has been used, which is known to be open for extension and closed for modification. It is also known to be better for maintenance. Second, version 1.0 was based on atom decomposition and domain decomposition scheme [6] for parallelization. However, atom decomposition is not popular due to its poor scalability. On the other hand, domain decomposition scheme is better for scalability. It still has a limitation in utilizing a large number of cores on recent petascale computers due to the requirement that the domain size is larger than the potential cutoff distance. To go beyond such a limitation, a hierarchical parallelization scheme has been adopted in this new version and implemented using MPI [7] and OPENMP [8]. Summary of revisions: (1) Object-oriented programming has been used. (2) A hierarchical parallelization scheme has been adopted. (3) SPME routine has been fully parallelized with parallel 3D FFT using volumetric decomposition scheme [9]. K.J.O. thanks Mr. Seung Min Lee for useful discussion on programming and debugging. Running time: Running time depends on system size and methods used. For test system containing a protein (PDB id: 5DHFR) with CHARMM22 force field [10] and 7023 TIP3P [11] waters in simulation box having dimension 62.23 Å×62.23 Å×62.23 Å, the benchmark results are given in Fig. 1. Here the potential cutoff distance was set to 12 Å and the switching function was applied from 10 Å for the force calculation in real space. For the SPME [12] calculation, K, K, and K were set to 64 and the interpolation order was set to 4. To do the fast Fourier transform, we used Intel MKL library. All bonds including hydrogen atoms were constrained using SHAKE/RATTLE algorithms [13,14]. The code was compiled using Intel compiler version 11.1 and mvapich2 version 1.5. Fig. 2 shows performance gains from using CUDA-enabled version [15] of mm_par for 5DHFR simulation in water on Intel Core2Quad 2.83 GHz and GeForce GTX 580. Even though mm_par2.0 is not ported yet for GPU, its performance data would be useful to expect mm_par2.0 performance on GPU. Timing results for 1000 MD steps. 1, 2, 4, and 8 in the figure mean the number of OPENMP threads. Timing results for 1000 MD steps from double precision simulation on CPU, single precision simulation on GPU, and double precision simulation on GPU.
      

      
      Can High Bandwidth and Latency Justify Large Cache Blocks in Scalable Multiprocessors?
      DTIC Science & Technology
      
         1994-01-01
         400 MB/second. 4 Dubnicki’s work used trace-driven simulation, with traces collected on an 8-processor machine. We would expect such small-scale...312 1 6 32 64 of odk Sb* Bad64.M Figure 17: Miss rate of Ind Blocked LU. Figure 18: MCPR of Ind Blocked LU. overall miss rate of TGauss is a factor of...easily. 17 (’his approach assunics that the model paramelers we collect from simulations with infinite band- width (such as the miss rate and the
      

      
      Using the Intel Math Kernel Library on Peregrine | High-Performance
      Science.gov Websites
      
         
         Computing | NREL the Intel Math Kernel Library on Peregrine Using the Intel Math Kernel Library on Peregrine Learn how to use the Intel Math Kernel Library (MKL) with Peregrine system software. MKL architectures. Core math functions in MKL include BLAS, LAPACK, ScaLAPACK, sparse solvers, fast Fourier
      

      
      High coherence plane breaking packaging for superconducting qubits.
      PubMed
      Bronn, Nicholas T; Adiga, Vivekananda P; Olivadese, Salvatore B; Wu, Xian; Chow, Jerry M; Pappas, David P
         2018-04-01
         We demonstrate a pogo pin package for a superconducting quantum processor specifically designed with a nontrivial layout topology (e.g., a center qubit that cannot be accessed from the sides of the chip). Two experiments on two nominally identical superconducting quantum processors in pogo packages, which use commercially available parts and require modest machining tolerances, are performed at low temperature (10 mK) in a dilution refrigerator and both found to behave comparably to processors in standard planar packages with wirebonds where control and readout signals come in from the edges. Single- and two-qubit gate errors are also characterized via randomized benchmarking, exhibiting similar error rates as in standard packages, opening the possibility of integrating pogo pin packaging with extensible qubit architectures.
      

      
      Hardware accuracy counters for application precision and quality feedback
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      de Paula Rosa Piga, Leonardo; Majumdar, Abhinandan; Paul, Indrani
         
         Methods, devices, and systems for capturing an accuracy of an instruction executing on a processor. An instruction may be executed on the processor, and the accuracy of the instruction may be captured using a hardware counter circuit. The accuracy of the instruction may be captured by analyzing bits of at least one value of the instruction to determine a minimum or maximum precision datatype for representing the field, and determining whether to adjust a value of the hardware counter circuit accordingly. The representation may be output to a debugger or logfile for use by a developer, or may be outputmore » to a runtime or virtual machine to automatically adjust instruction precision or gating of portions of the processor datapath.« less
      

      
      High coherence plane breaking packaging for superconducting qubits
      NASA Astrophysics Data System (ADS)
      Bronn, Nicholas T.; Adiga, Vivekananda P.; Olivadese, Salvatore B.; Wu, Xian; Chow, Jerry M.; Pappas, David P.
         2018-04-01
         We demonstrate a pogo pin package for a superconducting quantum processor specifically designed with a nontrivial layout topology (e.g., a center qubit that cannot be accessed from the sides of the chip). Two experiments on two nominally identical superconducting quantum processors in pogo packages, which use commercially available parts and require modest machining tolerances, are performed at low temperature (10 mK) in a dilution refrigerator and both found to behave comparably to processors in standard planar packages with wirebonds where control and readout signals come in from the edges. Single- and two-qubit gate errors are also characterized via randomized benchmarking, exhibiting similar error rates as in standard packages, opening the possibility of integrating pogo pin packaging with extensible qubit architectures.
      

      
      Enhanced tactical radar correlator (ETRAC): true interoperability of the 1990s
      NASA Astrophysics Data System (ADS)
      Guillen, Frank J.
         1994-10-01
         The enhanced tactical radar correlator (ETRAC) system is under development at Westinghouse Electric Corporation for the Army Space Program Office (ASPO). ETRAC is a real-time synthetic aperture radar (SAR) processing system that provides tactical IMINT to the corps commander. It features an open architecture comprised of ruggedized commercial-off-the-shelf (COTS), UNIX based workstations and processors. The architecture features the DoD common SAR processor (CSP), a multisensor computing platform to accommodate a variety of current and future imaging needs. ETRAC's principal functions include: (1) Mission planning and control -- ETRAC provides mission planning and control for the U-2R and ASARS-2 sensor, including capability for auto replanning, retasking, and immediate spot. (2) Image formation -- the image formation processor (IFP) provides the CPU intensive processing capability to produce real-time imagery for all ASARS imaging modes of operation. (3) Image exploitation -- two exploitation workstations are provided for first-phase image exploitation, manipulation, and annotation. Products include INTEL reports, annotated NITF SID imagery, high resolution hard copy prints and targeting data. ETRAC is transportable via two C-130 aircraft, with autonomous drive on/off capability for high mobility. Other autonomous capabilities include rapid setup/tear down, extended stand-alone support, internal environmental control units (ECUs) and power generation. ETRAC's mission is to provide the Army field commander with accurate, reliable, and timely imagery intelligence derived from collections made by the ASARS-2 sensor, located on-board the U-2R aircraft. To accomplish this mission, ETRAC receives video phase history (VPH) directly from the U-2R aircraft and converts it in real time into soft copy imagery for immediate exploitation and dissemination to the tactical users.
      

      
      N-body simulation for self-gravitating collisional systems with a new SIMD instruction set extension to the x86 architecture, Advanced Vector eXtensions
      NASA Astrophysics Data System (ADS)
      Tanikawa, Ataru; Yoshikawa, Kohji; Okamoto, Takashi; Nitadori, Keigo
         2012-02-01
         We present a high-performance N-body code for self-gravitating collisional systems accelerated with the aid of a new SIMD instruction set extension of the x86 architecture: Advanced Vector eXtensions (AVX), an enhanced version of the Streaming SIMD Extensions (SSE). With one processor core of Intel Core i7-2600 processor (8 MB cache and 3.40 GHz) based on Sandy Bridge micro-architecture, we implemented a fourth-order Hermite scheme with individual timestep scheme ( Makino and Aarseth, 1992), and achieved the performance of ˜20 giga floating point number operations per second (GFLOPS) for double-precision accuracy, which is two times and five times higher than that of the previously developed code implemented with the SSE instructions ( Nitadori et al., 2006b), and that of a code implemented without any explicit use of SIMD instructions with the same processor core, respectively. We have parallelized the code by using so-called NINJA scheme ( Nitadori et al., 2006a), and achieved ˜90 GFLOPS for a system containing more than N = 8192 particles with 8 MPI processes on four cores. We expect to achieve about 10 tera FLOPS (TFLOPS) for a self-gravitating collisional system with N ˜ 10 5 on massively parallel systems with at most 800 cores with Sandy Bridge micro-architecture. This performance will be comparable to that of Graphic Processing Unit (GPU) cluster systems, such as the one with about 200 Tesla C1070 GPUs ( Spurzem et al., 2010). This paper offers an alternative to collisional N-body simulations with GRAPEs and GPUs.
      

      
      Scaling deep learning workloads: NVIDIA DGX-1/Pascal and Intel Knights Landing
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Gawande, Nitin A.; Landwehr, Joshua B.; Daily, Jeffrey A.
         
         Deep Learning (DL) algorithms have become ubiquitous in data analytics. As a result, major computing vendors --- including NVIDIA, Intel, AMD, and IBM --- have architectural road-maps influenced by DL workloads. Furthermore, several vendors have recently advertised new computing products as accelerating large DL workloads. Unfortunately, it is difficult for data scientists to quantify the potential of these different products. This paper provides a performance and power analysis of important DL workloads on two major parallel architectures: NVIDIA DGX-1 (eight Pascal P100 GPUs interconnected with NVLink) and Intel Knights Landing (KNL) CPUs interconnected with Intel Omni-Path or Cray Aries. Ourmore » evaluation consists of a cross section of convolutional neural net workloads: CifarNet, AlexNet, GoogLeNet, and ResNet50 topologies using the Cifar10 and ImageNet datasets. The workloads are vendor-optimized for each architecture. Our analysis indicates that although GPUs provide the highest overall performance, the gap can close for some convolutional networks; and the KNL can be competitive in performance/watt. We find that NVLink facilitates scaling efficiency on GPUs. However, its importance is heavily dependent on neural network architecture. Furthermore, for weak-scaling --- sometimes encouraged by restricted GPU memory --- NVLink is less important.« less
      

      
      Eliminating livelock by assigning the same priority state to each message that is input into a flushable routing system during N time intervals
      DOEpatents
      Faber, V.
         1994-11-29
         Livelock-free message routing is provided in a network of interconnected nodes that is flushable in time T. An input message processor generates sequences of at least N time intervals, each of duration T. An input register provides for receiving and holding each input message, where the message is assigned a priority state p during an nth one of the N time intervals. At each of the network nodes a message processor reads the assigned priority state and awards priority to messages with priority state (p-1) during an nth time interval and to messages with priority state p during an (n+1) th time interval. The messages that are awarded priority are output on an output path toward the addressed output message processor. Thus, no message remains in the network for a time longer than T. 4 figures.
      

      
      A real-time robot arm collision avoidance system
      NASA Technical Reports Server (NTRS)
      Shaffer, Clifford A.; Herb, Gregory M.
         1992-01-01
         A data structure and update algorithm are presented for a prototype real-time collision avoidance safety system simulating a multirobot workspace. The data structure is a variant of the octree, which serves as a spatial index. An octree recursively decomposes 3D space into eight equal cubic octants until each octant meets some decomposition criteria. The N-objects octree, which indexes a collection of 3D primitive solids is used. These primitives make up the two (seven-degrees-of-freedom) robot arms and workspace modeled by the system. As robot arms move, the octree is updated to reflect their changed positions. During most update cycles, any given primitive does not change which octree nodes it is in. Thus, modification to the octree is rarely required. Cycle time for interpreting current arm joint angles, updating the octree to reflect new positions, and detecting/reporting imminent collisions averages 30 ms on an Intel 80386 processor running at 20 MHz.
      

      
      (U) Status of Trinity and Crossroads Systems
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Archer, Billy Joe; Lujan, James Westley; Hemmert, K. S.
         2017-01-10
         (U) This paper provides a general overview of current and future plans for the Advanced Simulation and Computing (ASC) Advanced Technology (AT) systems fielded by the New Mexico Alliance for Computing at Extreme Scale (ACES), a collaboration between Los Alamos Laboratory and Sandia National Laboratories. Additionally, this paper touches on research of technology beyond traditional CMOS. The status of Trinity, ASCs first AT system, and Crossroads, anticipated to succeed Trinity as the third AT system in 2020 will be presented, along with initial performance studies of the Intel Knights Landing Xeon Phi processors, introduced on Trinity. The challenges and opportunitiesmore » for our production simulation codes on AT systems will also be discussed. Trinity and Crossroads are a joint procurement by ACES and Lawrence Berkeley Laboratory as part of the Alliance for application Performance at EXtreme scale (APEX) http://apex.lanl.gov.« less
      

        
       
          

«

21
      22
      23
      24
   25
      »

          
        

     

   

   
       
            
              
          

«

21
      22
      23
      24
      25
   »

          
        

           
           
             
               
      
      Microcomputer control soft tube measuring-testing instrument
      NASA Astrophysics Data System (ADS)
      Zhou, Yanzhou; Jiang, Xiu-Zhen; Wang, Wen-Yi
         1993-09-01
         Soft tube are key and easily spoiled parts used by the vehicles in the transportation with large numbers. Measuring and testing of the tubes were made by hands for a long time. Cooperating with Harbin Railway Bureau recently we have developed a new kind of automatical measuring and testing instrument In the paper the instrument structure property and measuring principle are presented in details. Centre of the system is a singlechip processor INTEL 80C31 . It can collect deal with data and display the results on LED. Furthermore it brings electromagnetic valves and motors under control. Five soft tubes are measured and tested in the same time all the process is finished automatically. On the hardware and software counter-electromagnetic disturbance methods is adopted efficiently so the performance of the instrument is improved significantly. In the long run the instrument is reliable and practical It solves a quite difficult problem in the railway transportation.
      

      
      Fabrication of Circuit QED Quantum Processors, Part 2: Advanced Semiconductor Manufacturing Perspectives
      NASA Astrophysics Data System (ADS)
      Michalak, D. J.; Bruno, A.; Caudillo, R.; Elsherbini, A. A.; Falcon, J. A.; Nam, Y. S.; Poletto, S.; Roberts, J.; Thomas, N. K.; Yoscovits, Z. R.; Dicarlo, L.; Clarke, J. S.
         
         Experimental quantum computing is rapidly approaching the integration of sufficient numbers of quantum bits for interesting applications, but many challenges still remain. These challenges include: realization of an extensible design for large array scale up, sufficient material process control, and discovery of integration schemes compatible with industrial 300 mm fabrication. We present recent developments in extensible circuits with vertical delivery. Toward the goal of developing a high-volume manufacturing process, we will present recent results on a new Josephson junction process that is compatible with current tooling. We will then present the improvements in NbTiN material uniformity that typical 300 mm fabrication tooling can provide. While initial results on few-qubit systems are encouraging, advanced processing control is expected to deliver the improvements in qubit uniformity, coherence time, and control required for larger systems. Research funded by Intel Corporation.
      

      
      Automatic generation of efficient array redistribution routines for distributed memory multicomputers
      NASA Technical Reports Server (NTRS)
      Ramaswamy, Shankar; Banerjee, Prithviraj
         1994-01-01
         Appropriate data distribution has been found to be critical for obtaining good performance on Distributed Memory Multicomputers like the CM-5, Intel Paragon and IBM SP-1. It has also been found that some programs need to change their distributions during execution for better performance (redistribution). This work focuses on automatically generating efficient routines for redistribution. We present a new mathematical representation for regular distributions called PITFALLS and then discuss algorithms for redistribution based on this representation. One of the significant contributions of this work is being able to handle arbitrary source and target processor sets while performing redistribution. Another important contribution is the ability to handle an arbitrary number of dimensions for the array involved in the redistribution in a scalable manner. Our implementation of these techniques is based on an MPI-like communication library. The results presented show the low overheads for our redistribution algorithm as compared to naive runtime methods.
      

      
      A Parallel Multigrid Solver for Viscous Flows on Anisotropic Structured Grids
      NASA Technical Reports Server (NTRS)
      Prieto, Manuel; Montero, Ruben S.; Llorente, Ignacio M.; Bushnell, Dennis M. (Technical Monitor)
         2001-01-01
         This paper presents an efficient parallel multigrid solver for speeding up the computation of a 3-D model that treats the flow of a viscous fluid over a flat plate. The main interest of this simulation lies in exhibiting some basic difficulties that prevent optimal multigrid efficiencies from being achieved. As the computing platform, we have used Coral, a Beowulf-class system based on Intel Pentium processors and equipped with GigaNet cLAN and switched Fast Ethernet networks. Our study not only examines the scalability of the solver but also includes a performance evaluation of Coral where the investigated solver has been used to compare several of its design choices, namely, the interconnection network (GigaNet versus switched Fast-Ethernet) and the node configuration (dual nodes versus single nodes). As a reference, the performance results have been compared with those obtained with the NAS-MG benchmark.
      

      
      A Commodity Computing Cluster
      NASA Astrophysics Data System (ADS)
      Teuben, P. J.; Wolfire, M. G.; Pound, M. W.; Mundy, L. G.
         
         We have assembled a cluster of Intel-Pentium based PCs running Linux to compute a large set of Photodissociation Region (PDR) and Dust Continuum models. For various reasons the cluster is heterogeneous, currently ranging from a single Pentium-II 333 MHz to dual Pentium-III 450 MHz CPU machines. Although this will be sufficient for our ``embarrassingly parallelizable problem'' it may present some challenges for as yet unplanned future use. In addition the cluster was used to construct a MIRIAD benchmark, and compared to equivalent Ultra-Sparc based workstations. Currently the cluster consists of 8 machines, 14 CPUs, 50GB of disk-space, and a total peak speed of 5.83 GHz, or about 1.5 Gflops. The total cost of this cluster has been about $12,000, including all cabling, networking equipment, rack, and a CD-R backup system. The URL for this project is http://dustem.astro.umd.edu.
      

      
      Optimizing NEURON Simulation Environment Using Remote Memory Access with Recursive Doubling on Distributed Memory Systems.
      PubMed
      Shehzad, Danish; Bozkuş, Zeki
         2016-01-01
         Increase in complexity of neuronal network models escalated the efforts to make NEURON simulation environment efficient. The computational neuroscientists divided the equations into subnets amongst multiple processors for achieving better hardware performance. On parallel machines for neuronal networks, interprocessor spikes exchange consumes large section of overall simulation time. In NEURON for communication between processors Message Passing Interface (MPI) is used. MPI_Allgather collective is exercised for spikes exchange after each interval across distributed memory systems. The increase in number of processors though results in achieving concurrency and better performance but it inversely affects MPI_Allgather which increases communication time between processors. This necessitates improving communication methodology to decrease the spikes exchange time over distributed memory systems. This work has improved MPI_Allgather method using Remote Memory Access (RMA) by moving two-sided communication to one-sided communication, and use of recursive doubling mechanism facilitates achieving efficient communication between the processors in precise steps. This approach enhanced communication concurrency and has improved overall runtime making NEURON more efficient for simulation of large neuronal network models.
      

      
      Optimizing NEURON Simulation Environment Using Remote Memory Access with Recursive Doubling on Distributed Memory Systems
      PubMed Central
      Bozkuş, Zeki
         2016-01-01
         Increase in complexity of neuronal network models escalated the efforts to make NEURON simulation environment efficient. The computational neuroscientists divided the equations into subnets amongst multiple processors for achieving better hardware performance. On parallel machines for neuronal networks, interprocessor spikes exchange consumes large section of overall simulation time. In NEURON for communication between processors Message Passing Interface (MPI) is used. MPI_Allgather collective is exercised for spikes exchange after each interval across distributed memory systems. The increase in number of processors though results in achieving concurrency and better performance but it inversely affects MPI_Allgather which increases communication time between processors. This necessitates improving communication methodology to decrease the spikes exchange time over distributed memory systems. This work has improved MPI_Allgather method using Remote Memory Access (RMA) by moving two-sided communication to one-sided communication, and use of recursive doubling mechanism facilitates achieving efficient communication between the processors in precise steps. This approach enhanced communication concurrency and has improved overall runtime making NEURON more efficient for simulation of large neuronal network models. PMID:27413363
      

      
      Experience with a Genetic Algorithm Implemented on a Multiprocessor Computer
      NASA Technical Reports Server (NTRS)
      Plassman, Gerald E.; Sobieszczanski-Sobieski, Jaroslaw
         2000-01-01
         Numerical experiments were conducted to find out the extent to which a Genetic Algorithm (GA) may benefit from a multiprocessor implementation, considering, on one hand, that analyses of individual designs in a population are independent of each other so that they may be executed concurrently on separate processors, and, on the other hand, that there are some operations in a GA that cannot be so distributed. The algorithm experimented with was based on a gaussian distribution rather than bit exchange in the GA reproductive mechanism, and the test case was a hub frame structure of up to 1080 design variables. The experimentation engaging up to 128 processors confirmed expectations of radical elapsed time reductions comparing to a conventional single processor implementation. It also demonstrated that the time spent in the non-distributable parts of the algorithm and the attendant cross-processor communication may have a very detrimental effect on the efficient utilization of the multiprocessor machine and on the number of processors that can be used effectively in a concurrent manner. Three techniques were devised and tested to mitigate that effect, resulting in efficiency increasing to exceed 99 percent.
      

      
      Analysis OpenMP performance of AMD and Intel architecture for breaking waves simulation using MPS
      NASA Astrophysics Data System (ADS)
      Alamsyah, M. N. A.; Utomo, A.; Gunawan, P. H.
         2018-03-01
         Simulation of breaking waves by using Navier-Stokes equation via moving particle semi-implicit method (MPS) over close domain is given. The results show the parallel computing on multicore architecture using OpenMP platform can reduce the computational time almost half of the serial time. Here, the comparison using two computer architectures (AMD and Intel) are performed. The results using Intel architecture is shown better than AMD architecture in CPU time. However, in efficiency, the computer with AMD architecture gives slightly higher than the Intel. For the simulation by 1512 number of particles, the CPU time using Intel and AMD are 12662.47 and 28282.30 respectively. Moreover, the efficiency using similar number of particles, AMD obtains 50.09 % and Intel up to 49.42 %.
      

      
      40 CFR 721.3760 - Fluorene-containing diaromatic amines.
      Code of Federal Regulations, 2010 CFR
      
         2010-07-01
         ... amines (PMN P-88-998 and P-88-999) are subject to reporting under this section for the significant new... water. Requirements as specified in § 721.90 (a)(4), (b)(4), and (c)(4) (where n = 1). (ii) [Reserved...), (c), and (k) are applicable to manufacturers, importers, and processors of this substance. (2...
      

      
      40 CFR 721.6070 - Alkyl phosphonate ammonium salts.
      Code of Federal Regulations, 2010 CFR
      
         2010-07-01
         ... salts (PMNs P-93-725 and P-93-726) are subject to reporting under this section for the significant new... water. Requirements as specified in § 721.90 (a)(4), (b)(4), and (c)(4) (where N = 400 ppb). (b...), and (k) are applicable to manufacturers, importers, and processors of this substance. (2) Limitations...
      

      
      Editing of EIA coded, numerically controlled, machine tool tapes
      NASA Technical Reports Server (NTRS)
      Weiner, J. M.
         1975-01-01
         Editing of numerically controlled (N/C) machine tool tapes (8-level paper tape) using an interactive graphic display processor is described. A rapid technique required for correcting production errors in N/C tapes was developed using the interactive text editor on the IMLAC PDS-ID graphic display system and two special programs resident on disk. The correction technique and special programs for processing N/C tapes coded to EIA specifications are discussed.
      

      
      A new measuring machine in Paris
      NASA Technical Reports Server (NTRS)
      Guibert, J.; Charvin, P.
         1984-01-01
         A new photographic measuring machine is under construction at the Paris Observatory. The amount of transmitted light is measured by a linear array of 1024 photodiodes. Carriage control, data acquisition and on line processing are performed by microprocessors, a S.E.L. 32/27 computer, and an AP 120-B Array Processor. It is expected that a Schmidt telescope plate of size 360 mm square will be scanned in one hour with pixel size of ten microns.
      

      
      Bit-parallel arithmetic in a massively-parallel associative processor
      NASA Technical Reports Server (NTRS)
      Scherson, Isaac D.; Kramer, David A.; Alleyne, Brian D.
         1992-01-01
         A simple but powerful new architecture based on a classical associative processor model is presented. Algorithms for performing the four basic arithmetic operations both for integer and floating point operands are described. For m-bit operands, the proposed architecture makes it possible to execute complex operations in O(m) cycles as opposed to O(m exp 2) for bit-serial machines. A word-parallel, bit-parallel, massively-parallel computing system can be constructed using this architecture with VLSI technology. The operation of this system is demonstrated for the fast Fourier transform and matrix multiplication.
      

      
      RAMA: A file system for massively parallel computers
      NASA Technical Reports Server (NTRS)
      Miller, Ethan L.; Katz, Randy H.
         1993-01-01
         This paper describes a file system design for massively parallel computers which makes very efficient use of a few disks per processor. This overcomes the traditional I/O bottleneck of massively parallel machines by storing the data on disks within the high-speed interconnection network. In addition, the file system, called RAMA, requires little inter-node synchronization, removing another common bottleneck in parallel processor file systems. Support for a large tertiary storage system can easily be integrated in lo the file system; in fact, RAMA runs most efficiently when tertiary storage is used.
      

      
      Dynamically allocating sets of fine-grained processors to running computations
      NASA Technical Reports Server (NTRS)
      Middleton, David
         1988-01-01
         Researchers explore an approach to using general purpose parallel computers which involves mapping hardware resources onto computations instead of mapping computations onto hardware. Problems such as processor allocation, task scheduling and load balancing, which have traditionally proven to be challenging, change significantly under this approach and may become amenable to new attacks. Researchers describe the implementation of this approach used by the FFP Machine whose computation and communication resources are repeatedly partitioned into disjoint groups that match the needs of available tasks from moment to moment. Several consequences of this system are examined.
      

      
      Effective Vectorization with OpenMP 4.5
      DOE Office of Scientific and Technical Information (OSTI.GOV)
      Huber, Joseph N.; Hernandez, Oscar R.; Lopez, Matthew Graham
         
         This paper describes how the Single Instruction Multiple Data (SIMD) model and its extensions in OpenMP work, and how these are implemented in different compilers. Modern processors are highly parallel computational machines which often include multiple processors capable of executing several instructions in parallel. Understanding SIMD and executing instructions in parallel allows the processor to achieve higher performance without increasing the power required to run it. SIMD instructions can significantly reduce the runtime of code by executing a single operation on large groups of data. The SIMD model is so integral to the processor s potential performance that, if SIMDmore » is not utilized, less than half of the processor is ever actually used. Unfortunately, using SIMD instructions is a challenge in higher level languages because most programming languages do not have a way to describe them. Most compilers are capable of vectorizing code by using the SIMD instructions, but there are many code features important for SIMD vectorization that the compiler cannot determine at compile time. OpenMP attempts to solve this by extending the C++/C and Fortran programming languages with compiler directives that express SIMD parallelism. OpenMP is used to pass hints to the compiler about the code to be executed in SIMD. This is a key resource for making optimized code, but it does not change whether or not the code can use SIMD operations. However, in many cases critical functions are limited by a poor understanding of how SIMD instructions are actually implemented, as SIMD can be implemented through vector instructions or simultaneous multi-threading (SMT). We have found that it is often the case that code cannot be vectorized, or is vectorized poorly, because the programmer does not have sufficient knowledge of how SIMD instructions work.« less
      

      
      Mission and data operations IBM 360 user's guide
      NASA Technical Reports Server (NTRS)
      Balakirsky, J.
         1973-01-01
         The M and DO computer systems are introduced and supplemented. The hardware and software status is discussed, along with standard processors and user libraries. Data management techniques are presented, as well as machine independence, debugging facilities, and overlay considerations.
      

      
      Evaluation of the Intel Xeon Phi 7120 and NVIDIA K80 as accelerators for two-dimensional panel codes
      PubMed Central
      
         2017-01-01
         To optimize the geometry of airfoils for a specific application is an important engineering problem. In this context genetic algorithms have enjoyed some success as they are able to explore the search space without getting stuck in local optima. However, these algorithms require the computation of aerodynamic properties for a significant number of airfoil geometries. Consequently, for low-speed aerodynamics, panel methods are most often used as the inner solver. In this paper we evaluate the performance of such an optimization algorithm on modern accelerators (more specifically, the Intel Xeon Phi 7120 and the NVIDIA K80). For that purpose, we have implemented an optimized version of the algorithm on the CPU and Xeon Phi (based on OpenMP, vectorization, and the Intel MKL library) and on the GPU (based on CUDA and the MAGMA library). We present timing results for all codes and discuss the similarities and differences between the three implementations. Overall, we observe a speedup of approximately 2.5 for adding an Intel Xeon Phi 7120 to a dual socket workstation and a speedup between 3.4 and 3.8 for adding a NVIDIA K80 to a dual socket workstation. PMID:28582389
      

      
      Evaluation of the Intel Xeon Phi 7120 and NVIDIA K80 as accelerators for two-dimensional panel codes.
      PubMed
      Einkemmer, Lukas
         2017-01-01
         To optimize the geometry of airfoils for a specific application is an important engineering problem. In this context genetic algorithms have enjoyed some success as they are able to explore the search space without getting stuck in local optima. However, these algorithms require the computation of aerodynamic properties for a significant number of airfoil geometries. Consequently, for low-speed aerodynamics, panel methods are most often used as the inner solver. In this paper we evaluate the performance of such an optimization algorithm on modern accelerators (more specifically, the Intel Xeon Phi 7120 and the NVIDIA K80). For that purpose, we have implemented an optimized version of the algorithm on the CPU and Xeon Phi (based on OpenMP, vectorization, and the Intel MKL library) and on the GPU (based on CUDA and the MAGMA library). We present timing results for all codes and discuss the similarities and differences between the three implementations. Overall, we observe a speedup of approximately 2.5 for adding an Intel Xeon Phi 7120 to a dual socket workstation and a speedup between 3.4 and 3.8 for adding a NVIDIA K80 to a dual socket workstation.
      

        
       
          

«

21
      22
      23
      24
      25
   »

          
        

     

   

   Some links on this page may take you to non-federal websites. Their policies may differ from this site.