Architectural Specialization for Inter-Iteration Loop Dependence Patterns
2015-10-01
Architectural Specialization for Inter-Iteration Loop Dependence Patterns Christopher Batten Computer Systems Laboratory School of Electrical and...Trends in Computer Architecture Transistors (Thousands) Frequency (MHz) Typical Power (W) MIPS R2K Intel P4 DEC Alpha 21264 Data collected by M...T as ks p er Jo ule ) Simple Processor Design Power Constraint High-Performance Architectures Embedded Architectures Design Performance
Performance Analysis of Cloud Computing Architectures Using Discrete Event Simulation
NASA Technical Reports Server (NTRS)
Stocker, John C.; Golomb, Andrew M.
2011-01-01
Cloud computing offers the economic benefit of on-demand resource allocation to meet changing enterprise computing needs. However, the flexibility of cloud computing is disadvantaged when compared to traditional hosting in providing predictable application and service performance. Cloud computing relies on resource scheduling in a virtualized network-centric server environment, which makes static performance analysis infeasible. We developed a discrete event simulation model to evaluate the overall effectiveness of organizations in executing their workflow in traditional and cloud computing architectures. The two part model framework characterizes both the demand using a probability distribution for each type of service request as well as enterprise computing resource constraints. Our simulations provide quantitative analysis to design and provision computing architectures that maximize overall mission effectiveness. We share our analysis of key resource constraints in cloud computing architectures and findings on the appropriateness of cloud computing in various applications.
Distributed Computing Architecture for Image-Based Wavefront Sensing and 2 D FFTs
NASA Technical Reports Server (NTRS)
Smith, Jeffrey S.; Dean, Bruce H.; Haghani, Shadan
2006-01-01
Image-based wavefront sensing (WFS) provides significant advantages over interferometric-based wavefi-ont sensors such as optical design simplicity and stability. However, the image-based approach is computational intensive, and therefore, specialized high-performance computing architectures are required in applications utilizing the image-based approach. The development and testing of these high-performance computing architectures are essential to such missions as James Webb Space Telescope (JWST), Terrestial Planet Finder-Coronagraph (TPF-C and CorSpec), and Spherical Primary Optical Telescope (SPOT). The development of these specialized computing architectures require numerous two-dimensional Fourier Transforms, which necessitate an all-to-all communication when applied on a distributed computational architecture. Several solutions for distributed computing are presented with an emphasis on a 64 Node cluster of DSPs, multiple DSP FPGAs, and an application of low-diameter graph theory. Timing results and performance analysis will be presented. The solutions offered could be applied to other all-to-all communication and scientifically computationally complex problems.
Efficient Numeric and Geometric Computations using Heterogeneous Shared Memory Architectures
2017-10-04
Report: Efficient Numeric and Geometric Computations using Heterogeneous Shared Memory Architectures The views, opinions and/or findings contained in this...Chapel Hill Title: Efficient Numeric and Geometric Computations using Heterogeneous Shared Memory Architectures Report Term: 0-Other Email: dm...algorithms for scientific and geometric computing by exploiting the power and performance efficiency of heterogeneous shared memory architectures . These
Analysis OpenMP performance of AMD and Intel architecture for breaking waves simulation using MPS
NASA Astrophysics Data System (ADS)
Alamsyah, M. N. A.; Utomo, A.; Gunawan, P. H.
2018-03-01
Simulation of breaking waves by using Navier-Stokes equation via moving particle semi-implicit method (MPS) over close domain is given. The results show the parallel computing on multicore architecture using OpenMP platform can reduce the computational time almost half of the serial time. Here, the comparison using two computer architectures (AMD and Intel) are performed. The results using Intel architecture is shown better than AMD architecture in CPU time. However, in efficiency, the computer with AMD architecture gives slightly higher than the Intel. For the simulation by 1512 number of particles, the CPU time using Intel and AMD are 12662.47 and 28282.30 respectively. Moreover, the efficiency using similar number of particles, AMD obtains 50.09 % and Intel up to 49.42 %.
Architecture Adaptive Computing Environment
NASA Technical Reports Server (NTRS)
Dorband, John E.
2006-01-01
Architecture Adaptive Computing Environment (aCe) is a software system that includes a language, compiler, and run-time library for parallel computing. aCe was developed to enable programmers to write programs, more easily than was previously possible, for a variety of parallel computing architectures. Heretofore, it has been perceived to be difficult to write parallel programs for parallel computers and more difficult to port the programs to different parallel computing architectures. In contrast, aCe is supportable on all high-performance computing architectures. Currently, it is supported on LINUX clusters. aCe uses parallel programming constructs that facilitate writing of parallel programs. Such constructs were used in single-instruction/multiple-data (SIMD) programming languages of the 1980s, including Parallel Pascal, Parallel Forth, C*, *LISP, and MasPar MPL. In aCe, these constructs are extended and implemented for both SIMD and multiple- instruction/multiple-data (MIMD) architectures. Two new constructs incorporated in aCe are those of (1) scalar and virtual variables and (2) pre-computed paths. The scalar-and-virtual-variables construct increases flexibility in optimizing memory utilization in various architectures. The pre-computed-paths construct enables the compiler to pre-compute part of a communication operation once, rather than computing it every time the communication operation is performed.
A new software-based architecture for quantum computer
NASA Astrophysics Data System (ADS)
Wu, Nan; Song, FangMin; Li, Xiangdong
2010-04-01
In this paper, we study a reliable architecture of a quantum computer and a new instruction set and machine language for the architecture, which can improve the performance and reduce the cost of the quantum computing. We also try to address some key issues in detail in the software-driven universal quantum computers.
Yokohama, Noriya
2013-07-01
This report was aimed at structuring the design of architectures and studying performance measurement of a parallel computing environment using a Monte Carlo simulation for particle therapy using a high performance computing (HPC) instance within a public cloud-computing infrastructure. Performance measurements showed an approximately 28 times faster speed than seen with single-thread architecture, combined with improved stability. A study of methods of optimizing the system operations also indicated lower cost.
SU (2) lattice gauge theory simulations on Fermi GPUs
NASA Astrophysics Data System (ADS)
Cardoso, Nuno; Bicudo, Pedro
2011-05-01
In this work we explore the performance of CUDA in quenched lattice SU (2) simulations. CUDA, NVIDIA Compute Unified Device Architecture, is a hardware and software architecture developed by NVIDIA for computing on the GPU. We present an analysis and performance comparison between the GPU and CPU in single and double precision. Analyses with multiple GPUs and two different architectures (G200 and Fermi architectures) are also presented. In order to obtain a high performance, the code must be optimized for the GPU architecture, i.e., an implementation that exploits the memory hierarchy of the CUDA programming model. We produce codes for the Monte Carlo generation of SU (2) lattice gauge configurations, for the mean plaquette, for the Polyakov Loop at finite T and for the Wilson loop. We also present results for the potential using many configurations (50,000) without smearing and almost 2000 configurations with APE smearing. With two Fermi GPUs we have achieved an excellent performance of 200× the speed over one CPU, in single precision, around 110 Gflops/s. We also find that, using the Fermi architecture, double precision computations for the static quark-antiquark potential are not much slower (less than 2× slower) than single precision computations.
NASA Astrophysics Data System (ADS)
Liu, Chen; Han, Runze; Zhou, Zheng; Huang, Peng; Liu, Lifeng; Liu, Xiaoyan; Kang, Jinfeng
2018-04-01
In this work we present a novel convolution computing architecture based on metal oxide resistive random access memory (RRAM) to process the image data stored in the RRAM arrays. The proposed image storage architecture shows performances of better speed-device consumption efficiency compared with the previous kernel storage architecture. Further we improve the architecture for a high accuracy and low power computing by utilizing the binary storage and the series resistor. For a 28 × 28 image and 10 kernels with a size of 3 × 3, compared with the previous kernel storage approach, the newly proposed architecture shows excellent performances including: 1) almost 100% accuracy within 20% LRS variation and 90% HRS variation; 2) more than 67 times speed boost; 3) 71.4% energy saving.
Hypercluster Parallel Processor
NASA Technical Reports Server (NTRS)
Blech, Richard A.; Cole, Gary L.; Milner, Edward J.; Quealy, Angela
1992-01-01
Hypercluster computer system includes multiple digital processors, operation of which coordinated through specialized software. Configurable according to various parallel-computing architectures of shared-memory or distributed-memory class, including scalar computer, vector computer, reduced-instruction-set computer, and complex-instruction-set computer. Designed as flexible, relatively inexpensive system that provides single programming and operating environment within which one can investigate effects of various parallel-computing architectures and combinations on performance in solution of complicated problems like those of three-dimensional flows in turbomachines. Hypercluster software and architectural concepts are in public domain.
Electro-Optic Computing Architectures. Volume I
1998-02-01
The objective of the Electro - Optic Computing Architecture (EOCA) program was to develop multi-function electro - optic interfaces and optical...interconnect units to enhance the performance of parallel processor systems and form the building blocks for future electro - optic computing architectures...Specifically, three multi-function interface modules were targeted for development - an Electro - Optic Interface (EOI), an Optical Interconnection Unit (OW
Progress in a novel architecture for high performance processing
NASA Astrophysics Data System (ADS)
Zhang, Zhiwei; Liu, Meng; Liu, Zijun; Du, Xueliang; Xie, Shaolin; Ma, Hong; Ding, Guangxin; Ren, Weili; Zhou, Fabiao; Sun, Wenqin; Wang, Huijuan; Wang, Donglin
2018-04-01
The high performance processing (HPP) is an innovative architecture which targets on high performance computing with excellent power efficiency and computing performance. It is suitable for data intensive applications like supercomputing, machine learning and wireless communication. An example chip with four application-specific integrated circuit (ASIC) cores which is the first generation of HPP cores has been taped out successfully under Taiwan Semiconductor Manufacturing Company (TSMC) 40 nm low power process. The innovative architecture shows great energy efficiency over the traditional central processing unit (CPU) and general-purpose computing on graphics processing units (GPGPU). Compared with MaPU, HPP has made great improvement in architecture. The chip with 32 HPP cores is being developed under TSMC 16 nm field effect transistor (FFC) technology process and is planed to use commercially. The peak performance of this chip can reach 4.3 teraFLOPS (TFLOPS) and its power efficiency is up to 89.5 gigaFLOPS per watt (GFLOPS/W).
Collaborative Working Architecture for IoT-Based Applications.
Mora, Higinio; Signes-Pont, María Teresa; Gil, David; Johnsson, Magnus
2018-05-23
The new sensing applications need enhanced computing capabilities to handle the requirements of complex and huge data processing. The Internet of Things (IoT) concept brings processing and communication features to devices. In addition, the Cloud Computing paradigm provides resources and infrastructures for performing the computations and outsourcing the work from the IoT devices. This scenario opens new opportunities for designing advanced IoT-based applications, however, there is still much research to be done to properly gear all the systems for working together. This work proposes a collaborative model and an architecture to take advantage of the available computing resources. The resulting architecture involves a novel network design with different levels which combines sensing and processing capabilities based on the Mobile Cloud Computing (MCC) paradigm. An experiment is included to demonstrate that this approach can be used in diverse real applications. The results show the flexibility of the architecture to perform complex computational tasks of advanced applications.
Innovative architectures for dense multi-microprocessor computers
NASA Technical Reports Server (NTRS)
Donaldson, Thomas; Doty, Karl; Engle, Steven W.; Larson, Robert E.; O'Reilly, John G.
1988-01-01
The results of a Phase I Small Business Innovative Research (SBIR) project performed for the NASA Langley Computational Structural Mechanics Group are described. The project resulted in the identification of a family of chordal-ring interconnection architectures with excellent potential to serve as the basis for new multimicroprocessor (MMP) computers. The paper presents examples of how computational algorithms from structural mechanics can be efficiently implemented on the chordal-ring architecture.
Generic Divide and Conquer Internet-Based Computing
NASA Technical Reports Server (NTRS)
Radenski, Atanas; Follen, Gregory J. (Technical Monitor)
2001-01-01
The rapid growth of internet-based applications and the proliferation of networking technologies have been transforming traditional commercial application areas as well as computer and computational sciences and engineering. This growth stimulates the exploration of new, internet-oriented software technologies that can open new research and application opportunities not only for the commercial world, but also for the scientific and high -performance computing applications community. The general goal of this research project is to contribute to better understanding of the transition to internet-based high -performance computing and to develop solutions for some of the difficulties of this transition. More specifically, our goal is to design an architecture for generic divide and conquer internet-based computing, to develop a portable implementation of this architecture, to create an example library of high-performance divide-and-conquer computing agents that run on top of this architecture, and to evaluate the performance of these agents. We have been designing an architecture that incorporates a master task-pool server and utilizes satellite computational servers that operate on the Internet in a dynamically changing large configuration of lower-end nodes provided by volunteer contributors. Our designed architecture is intended to be complementary to and accessible from computational grids such as Globus, Legion, and Condor. Grids provide remote access to existing high-end computing resources; in contrast, our goal is to utilize idle processor time of lower-end internet nodes. Our project is focused on a generic divide-and-conquer paradigm and its applications that operate on a loose and ever changing pool of lower-end internet nodes.
The new landscape of parallel computer architecture
NASA Astrophysics Data System (ADS)
Shalf, John
2007-07-01
The past few years has seen a sea change in computer architecture that will impact every facet of our society as every electronic device from cell phone to supercomputer will need to confront parallelism of unprecedented scale. Whereas the conventional multicore approach (2, 4, and even 8 cores) adopted by the computing industry will eventually hit a performance plateau, the highest performance per watt and per chip area is achieved using manycore technology (hundreds or even thousands of cores). However, fully unleashing the potential of the manycore approach to ensure future advances in sustained computational performance will require fundamental advances in computer architecture and programming models that are nothing short of reinventing computing. In this paper we examine the reasons behind the movement to exponentially increasing parallelism, and its ramifications for system design, applications and programming models.
Resource Efficient Hardware Architecture for Fast Computation of Running Max/Min Filters
Torres-Huitzil, Cesar
2013-01-01
Running max/min filters on rectangular kernels are widely used in many digital signal and image processing applications. Filtering with a k × k kernel requires of k 2 − 1 comparisons per sample for a direct implementation; thus, performance scales expensively with the kernel size k. Faster computations can be achieved by kernel decomposition and using constant time one-dimensional algorithms on custom hardware. This paper presents a hardware architecture for real-time computation of running max/min filters based on the van Herk/Gil-Werman (HGW) algorithm. The proposed architecture design uses less computation and memory resources than previously reported architectures when targeted to Field Programmable Gate Array (FPGA) devices. Implementation results show that the architecture is able to compute max/min filters, on 1024 × 1024 images with up to 255 × 255 kernels, in around 8.4 milliseconds, 120 frames per second, at a clock frequency of 250 MHz. The implementation is highly scalable for the kernel size with good performance/area tradeoff suitable for embedded applications. The applicability of the architecture is shown for local adaptive image thresholding. PMID:24288456
State-of-the-art in Heterogeneous Computing
Brodtkorb, Andre R.; Dyken, Christopher; Hagen, Trond R.; ...
2010-01-01
Node level heterogeneous architectures have become attractive during the last decade for several reasons: compared to traditional symmetric CPUs, they offer high peak performance and are energy and/or cost efficient. With the increase of fine-grained parallelism in high-performance computing, as well as the introduction of parallelism in workstations, there is an acute need for a good overview and understanding of these architectures. We give an overview of the state-of-the-art in heterogeneous computing, focusing on three commonly found architectures: the Cell Broadband Engine Architecture, graphics processing units (GPUs), and field programmable gate arrays (FPGAs). We present a review of hardware, availablemore » software tools, and an overview of state-of-the-art techniques and algorithms. Furthermore, we present a qualitative and quantitative comparison of the architectures, and give our view on the future of heterogeneous computing.« less
SU (2) lattice gauge theory simulations on Fermi GPUs
DOE Office of Scientific and Technical Information (OSTI.GOV)
Cardoso, Nuno, E-mail: nunocardoso@cftp.ist.utl.p; Bicudo, Pedro, E-mail: bicudo@ist.utl.p
2011-05-10
In this work we explore the performance of CUDA in quenched lattice SU (2) simulations. CUDA, NVIDIA Compute Unified Device Architecture, is a hardware and software architecture developed by NVIDIA for computing on the GPU. We present an analysis and performance comparison between the GPU and CPU in single and double precision. Analyses with multiple GPUs and two different architectures (G200 and Fermi architectures) are also presented. In order to obtain a high performance, the code must be optimized for the GPU architecture, i.e., an implementation that exploits the memory hierarchy of the CUDA programming model. We produce codes formore » the Monte Carlo generation of SU (2) lattice gauge configurations, for the mean plaquette, for the Polyakov Loop at finite T and for the Wilson loop. We also present results for the potential using many configurations (50,000) without smearing and almost 2000 configurations with APE smearing. With two Fermi GPUs we have achieved an excellent performance of 200x the speed over one CPU, in single precision, around 110 Gflops/s. We also find that, using the Fermi architecture, double precision computations for the static quark-antiquark potential are not much slower (less than 2x slower) than single precision computations.« less
Solving the Cauchy-Riemann equations on parallel computers
NASA Technical Reports Server (NTRS)
Fatoohi, Raad A.; Grosch, Chester E.
1987-01-01
Discussed is the implementation of a single algorithm on three parallel-vector computers. The algorithm is a relaxation scheme for the solution of the Cauchy-Riemann equations; a set of coupled first order partial differential equations. The computers were chosen so as to encompass a variety of architectures. They are: the MPP, and SIMD machine with 16K bit serial processors; FLEX/32, an MIMD machine with 20 processors; and CRAY/2, an MIMD machine with four vector processors. The machine architectures are briefly described. The implementation of the algorithm is discussed in relation to these architectures and measures of the performance on each machine are given. Simple performance models are used to describe the performance. These models highlight the bottlenecks and limiting factors for this algorithm on these architectures. Conclusions are presented.
Innovative architectures for dense multi-microprocessor computers
NASA Technical Reports Server (NTRS)
Larson, Robert E.
1989-01-01
The purpose is to summarize a Phase 1 SBIR project performed for the NASA/Langley Computational Structural Mechanics Group. The project was performed from February to August 1987. The main objectives of the project were to: (1) expand upon previous research into the application of chordal ring architectures to the general problem of designing multi-microcomputer architectures, (2) attempt to identify a family of chordal rings such that each chordal ring can be simply expanded to produce the next member of the family, (3) perform a preliminary, high-level design of an expandable multi-microprocessor computer based upon chordal rings, (4) analyze the potential use of chordal ring based multi-microprocessors for sparse matrix problems and other applications arising in computational structural mechanics.
Importance of balanced architectures in the design of high-performance imaging systems
NASA Astrophysics Data System (ADS)
Sgro, Joseph A.; Stanton, Paul C.
1999-03-01
Imaging systems employed in demanding military and industrial applications, such as automatic target recognition and computer vision, typically require real-time high-performance computing resources. While high- performances computing systems have traditionally relied on proprietary architectures and custom components, recent advances in high performance general-purpose microprocessor technology have produced an abundance of low cost components suitable for use in high-performance computing systems. A common pitfall in the design of high performance imaging system, particularly systems employing scalable multiprocessor architectures, is the failure to balance computational and memory bandwidth. The performance of standard cluster designs, for example, in which several processors share a common memory bus, is typically constrained by memory bandwidth. The symptom characteristic of this problem is failure to the performance of the system to scale as more processors are added. The problem becomes exacerbated if I/O and memory functions share the same bus. The recent introduction of microprocessors with large internal caches and high performance external memory interfaces makes it practical to design high performance imaging system with balanced computational and memory bandwidth. Real word examples of such designs will be presented, along with a discussion of adapting algorithm design to best utilize available memory bandwidth.
A Case Study on Neural Inspired Dynamic Memory Management Strategies for High Performance Computing.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Vineyard, Craig Michael; Verzi, Stephen Joseph
As high performance computing architectures pursue more computational power there is a need for increased memory capacity and bandwidth as well. A multi-level memory (MLM) architecture addresses this need by combining multiple memory types with different characteristics as varying levels of the same architecture. How to efficiently utilize this memory infrastructure is an unknown challenge, and in this research we sought to investigate whether neural inspired approaches can meaningfully help with memory management. In particular we explored neurogenesis inspired re- source allocation, and were able to show a neural inspired mixed controller policy can beneficially impact how MLM architectures utilizemore » memory.« less
NASA Technical Reports Server (NTRS)
Carroll, Chester C.; Youngblood, John N.; Saha, Aindam
1987-01-01
Improvements and advances in the development of computer architecture now provide innovative technology for the recasting of traditional sequential solutions into high-performance, low-cost, parallel system to increase system performance. Research conducted in development of specialized computer architecture for the algorithmic execution of an avionics system, guidance and control problem in real time is described. A comprehensive treatment of both the hardware and software structures of a customized computer which performs real-time computation of guidance commands with updated estimates of target motion and time-to-go is presented. An optimal, real-time allocation algorithm was developed which maps the algorithmic tasks onto the processing elements. This allocation is based on the critical path analysis. The final stage is the design and development of the hardware structures suitable for the efficient execution of the allocated task graph. The processing element is designed for rapid execution of the allocated tasks. Fault tolerance is a key feature of the overall architecture. Parallel numerical integration techniques, tasks definitions, and allocation algorithms are discussed. The parallel implementation is analytically verified and the experimental results are presented. The design of the data-driven computer architecture, customized for the execution of the particular algorithm, is discussed.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Carroll, C.C.; Youngblood, J.N.; Saha, A.
1987-12-01
Improvements and advances in the development of computer architecture now provide innovative technology for the recasting of traditional sequential solutions into high-performance, low-cost, parallel system to increase system performance. Research conducted in development of specialized computer architecture for the algorithmic execution of an avionics system, guidance and control problem in real time is described. A comprehensive treatment of both the hardware and software structures of a customized computer which performs real-time computation of guidance commands with updated estimates of target motion and time-to-go is presented. An optimal, real-time allocation algorithm was developed which maps the algorithmic tasks onto the processingmore » elements. This allocation is based on the critical path analysis. The final stage is the design and development of the hardware structures suitable for the efficient execution of the allocated task graph. The processing element is designed for rapid execution of the allocated tasks. Fault tolerance is a key feature of the overall architecture. Parallel numerical integration techniques, tasks definitions, and allocation algorithms are discussed. The parallel implementation is analytically verified and the experimental results are presented. The design of the data-driven computer architecture, customized for the execution of the particular algorithm, is discussed.« less
Electro-Optic Computing Architectures: Volume II. Components and System Design and Analysis
1998-02-01
The objective of the Electro - Optic Computing Architecture (EOCA) program was to develop multi-function electro - optic interfaces and optical...interconnect units to enhance the performance of parallel processor systems and form the building blocks for future electro - optic computing architectures...Specifically, three multi-function interface modules were targeted for development - an Electro - Optic Interface (EOI), an Optical Interconnection Unit
A Study of Complex Deep Learning Networks on High Performance, Neuromorphic, and Quantum Computers
DOE Office of Scientific and Technical Information (OSTI.GOV)
Potok, Thomas E; Schuman, Catherine D; Young, Steven R
Current Deep Learning models use highly optimized convolutional neural networks (CNN) trained on large graphical processing units (GPU)-based computers with a fairly simple layered network topology, i.e., highly connected layers, without intra-layer connections. Complex topologies have been proposed, but are intractable to train on current systems. Building the topologies of the deep learning network requires hand tuning, and implementing the network in hardware is expensive in both cost and power. In this paper, we evaluate deep learning models using three different computing architectures to address these problems: quantum computing to train complex topologies, high performance computing (HPC) to automatically determinemore » network topology, and neuromorphic computing for a low-power hardware implementation. Due to input size limitations of current quantum computers we use the MNIST dataset for our evaluation. The results show the possibility of using the three architectures in tandem to explore complex deep learning networks that are untrainable using a von Neumann architecture. We show that a quantum computer can find high quality values of intra-layer connections and weights, while yielding a tractable time result as the complexity of the network increases; a high performance computer can find optimal layer-based topologies; and a neuromorphic computer can represent the complex topology and weights derived from the other architectures in low power memristive hardware. This represents a new capability that is not feasible with current von Neumann architecture. It potentially enables the ability to solve very complicated problems unsolvable with current computing technologies.« less
NASA Astrophysics Data System (ADS)
Dave, Gaurav P.; Sureshkumar, N.; Blessy Trencia Lincy, S. S.
2017-11-01
Current trend in processor manufacturing focuses on multi-core architectures rather than increasing the clock speed for performance improvement. Graphic processors have become as commodity hardware for providing fast co-processing in computer systems. Developments in IoT, social networking web applications, big data created huge demand for data processing activities and such kind of throughput intensive applications inherently contains data level parallelism which is more suited for SIMD architecture based GPU. This paper reviews the architectural aspects of multi/many core processors and graphics processors. Different case studies are taken to compare performance of throughput computing applications using shared memory programming in OpenMP and CUDA API based programming.
Design of a massively parallel computer using bit serial processing elements
NASA Technical Reports Server (NTRS)
Aburdene, Maurice F.; Khouri, Kamal S.; Piatt, Jason E.; Zheng, Jianqing
1995-01-01
A 1-bit serial processor designed for a parallel computer architecture is described. This processor is used to develop a massively parallel computational engine, with a single instruction-multiple data (SIMD) architecture. The computer is simulated and tested to verify its operation and to measure its performance for further development.
Evaluation of Cache-based Superscalar and Cacheless Vector Architectures for Scientific Computations
NASA Technical Reports Server (NTRS)
Oliker, Leonid; Carter, Jonathan; Shalf, John; Skinner, David; Ethier, Stephane; Biswas, Rupak; Djomehri, Jahed; VanderWijngaart, Rob
2003-01-01
The growing gap between sustained and peak performance for scientific applications has become a well-known problem in high performance computing. The recent development of parallel vector systems offers the potential to bridge this gap for a significant number of computational science codes and deliver a substantial increase in computing capabilities. This paper examines the intranode performance of the NEC SX6 vector processor and the cache-based IBM Power3/4 superscalar architectures across a number of key scientific computing areas. First, we present the performance of a microbenchmark suite that examines a full spectrum of low-level machine characteristics. Next, we study the behavior of the NAS Parallel Benchmarks using some simple optimizations. Finally, we evaluate the perfor- mance of several numerical codes from key scientific computing domains. Overall results demonstrate that the SX6 achieves high performance on a large fraction of our application suite and in many cases significantly outperforms the RISC-based architectures. However, certain classes of applications are not easily amenable to vectorization and would likely require extensive reengineering of both algorithm and implementation to utilize the SX6 effectively.
Optimizing Engineering Tools Using Modern Ground Architectures
2017-12-01
Considerations,” International Journal of Computer Science & Engineering Survey , vol. 5, no. 4, 2014. [10] R. Bell. (n.d). A beginner’s guide to big O notation...scientific community. Traditional computing architectures were not capable of processing the data efficiently, or in some cases, could not process the...thesis investigates how these modern computing architectures could be leveraged by industry and academia to improve the performance and capabilities of
Modelling parallel programs and multiprocessor architectures with AXE
NASA Technical Reports Server (NTRS)
Yan, Jerry C.; Fineman, Charles E.
1991-01-01
AXE, An Experimental Environment for Parallel Systems, was designed to model and simulate for parallel systems at the process level. It provides an integrated environment for specifying computation models, multiprocessor architectures, data collection, and performance visualization. AXE is being used at NASA-Ames for developing resource management strategies, parallel problem formulation, multiprocessor architectures, and operating system issues related to the High Performance Computing and Communications Program. AXE's simple, structured user-interface enables the user to model parallel programs and machines precisely and efficiently. Its quick turn-around time keeps the user interested and productive. AXE models multicomputers. The user may easily modify various architectural parameters including the number of sites, connection topologies, and overhead for operating system activities. Parallel computations in AXE are represented as collections of autonomous computing objects known as players. Their use and behavior is described. Performance data of the multiprocessor model can be observed on a color screen. These include CPU and message routing bottlenecks, and the dynamic status of the software.
NASA Astrophysics Data System (ADS)
Mills, R. T.; Rupp, K.; Smith, B. F.; Brown, J.; Knepley, M.; Zhang, H.; Adams, M.; Hammond, G. E.
2017-12-01
As the high-performance computing community pushes towards the exascale horizon, power and heat considerations have driven the increasing importance and prevalence of fine-grained parallelism in new computer architectures. High-performance computing centers have become increasingly reliant on GPGPU accelerators and "manycore" processors such as the Intel Xeon Phi line, and 512-bit SIMD registers have even been introduced in the latest generation of Intel's mainstream Xeon server processors. The high degree of fine-grained parallelism and more complicated memory hierarchy considerations of such "manycore" processors present several challenges to existing scientific software. Here, we consider how the massively parallel, open-source hydrologic flow and reactive transport code PFLOTRAN - and the underlying Portable, Extensible Toolkit for Scientific Computation (PETSc) library on which it is built - can best take advantage of such architectures. We will discuss some key features of these novel architectures and our code optimizations and algorithmic developments targeted at them, and present experiences drawn from working with a wide range of PFLOTRAN benchmark problems on these architectures.
Computer architecture evaluation for structural dynamics computations: Project summary
NASA Technical Reports Server (NTRS)
Standley, Hilda M.
1989-01-01
The intent of the proposed effort is the examination of the impact of the elements of parallel architectures on the performance realized in a parallel computation. To this end, three major projects are developed: a language for the expression of high level parallelism, a statistical technique for the synthesis of multicomputer interconnection networks based upon performance prediction, and a queueing model for the analysis of shared memory hierarchies.
Advanced Architectures for Astrophysical Supercomputing
NASA Astrophysics Data System (ADS)
Barsdell, B. R.; Barnes, D. G.; Fluke, C. J.
2010-12-01
Astronomers have come to rely on the increasing performance of computers to reduce, analyze, simulate and visualize their data. In this environment, faster computation can mean more science outcomes or the opening up of new parameter spaces for investigation. If we are to avoid major issues when implementing codes on advanced architectures, it is important that we have a solid understanding of our algorithms. A recent addition to the high-performance computing scene that highlights this point is the graphics processing unit (GPU). The hardware originally designed for speeding-up graphics rendering in video games is now achieving speed-ups of O(100×) in general-purpose computation - performance that cannot be ignored. We are using a generalized approach, based on the analysis of astronomy algorithms, to identify the optimal problem-types and techniques for taking advantage of both current GPU hardware and future developments in computing architectures.
Electromagnetic Physics Models for Parallel Computing Architectures
NASA Astrophysics Data System (ADS)
Amadio, G.; Ananya, A.; Apostolakis, J.; Aurora, A.; Bandieramonte, M.; Bhattacharyya, A.; Bianchini, C.; Brun, R.; Canal, P.; Carminati, F.; Duhem, L.; Elvira, D.; Gheata, A.; Gheata, M.; Goulas, I.; Iope, R.; Jun, S. Y.; Lima, G.; Mohanty, A.; Nikitina, T.; Novak, M.; Pokorski, W.; Ribon, A.; Seghal, R.; Shadura, O.; Vallecorsa, S.; Wenzel, S.; Zhang, Y.
2016-10-01
The recent emergence of hardware architectures characterized by many-core or accelerated processors has opened new opportunities for concurrent programming models taking advantage of both SIMD and SIMT architectures. GeantV, a next generation detector simulation, has been designed to exploit both the vector capability of mainstream CPUs and multi-threading capabilities of coprocessors including NVidia GPUs and Intel Xeon Phi. The characteristics of these architectures are very different in terms of the vectorization depth and type of parallelization needed to achieve optimal performance. In this paper we describe implementation of electromagnetic physics models developed for parallel computing architectures as a part of the GeantV project. Results of preliminary performance evaluation and physics validation are presented as well.
Manyscale Computing for Sensor Processing in Support of Space Situational Awareness
NASA Astrophysics Data System (ADS)
Schmalz, M.; Chapman, W.; Hayden, E.; Sahni, S.; Ranka, S.
2014-09-01
Increasing image and signal data burden associated with sensor data processing in support of space situational awareness implies continuing computational throughput growth beyond the petascale regime. In addition to growing applications data burden and diversity, the breadth, diversity and scalability of high performance computing architectures and their various organizations challenge the development of a single, unifying, practicable model of parallel computation. Therefore, models for scalable parallel processing have exploited architectural and structural idiosyncrasies, yielding potential misapplications when legacy programs are ported among such architectures. In response to this challenge, we have developed a concise, efficient computational paradigm and software called Manyscale Computing to facilitate efficient mapping of annotated application codes to heterogeneous parallel architectures. Our theory, algorithms, software, and experimental results support partitioning and scheduling of application codes for envisioned parallel architectures, in terms of work atoms that are mapped (for example) to threads or thread blocks on computational hardware. Because of the rigor, completeness, conciseness, and layered design of our manyscale approach, application-to-architecture mapping is feasible and scalable for architectures at petascales, exascales, and above. Further, our methodology is simple, relying primarily on a small set of primitive mapping operations and support routines that are readily implemented on modern parallel processors such as graphics processing units (GPUs) and hybrid multi-processors (HMPs). In this paper, we overview the opportunities and challenges of manyscale computing for image and signal processing in support of space situational awareness applications. We discuss applications in terms of a layered hardware architecture (laboratory > supercomputer > rack > processor > component hierarchy). Demonstration applications include performance analysis and results in terms of execution time as well as storage, power, and energy consumption for bus-connected and/or networked architectures. The feasibility of the manyscale paradigm is demonstrated by addressing four principal challenges: (1) architectural/structural diversity, parallelism, and locality, (2) masking of I/O and memory latencies, (3) scalability of design as well as implementation, and (4) efficient representation/expression of parallel applications. Examples will demonstrate how manyscale computing helps solve these challenges efficiently on real-world computing systems.
Benchmarking high performance computing architectures with CMS’ skeleton framework
NASA Astrophysics Data System (ADS)
Sexton-Kennedy, E.; Gartung, P.; Jones, C. D.
2017-10-01
In 2012 CMS evaluated which underlying concurrency technology would be the best to use for its multi-threaded framework. The available technologies were evaluated on the high throughput computing systems dominating the resources in use at that time. A skeleton framework benchmarking suite that emulates the tasks performed within a CMSSW application was used to select Intel’s Thread Building Block library, based on the measured overheads in both memory and CPU on the different technologies benchmarked. In 2016 CMS will get access to high performance computing resources that use new many core architectures; machines such as Cori Phase 1&2, Theta, Mira. Because of this we have revived the 2012 benchmark to test it’s performance and conclusions on these new architectures. This talk will discuss the results of this exercise.
Sabne, Amit J.; Sakdhnagool, Putt; Lee, Seyong; ...
2015-07-13
Accelerator-based heterogeneous computing is gaining momentum in the high-performance computing arena. However, the increased complexity of heterogeneous architectures demands more generic, high-level programming models. OpenACC is one such attempt to tackle this problem. Although the abstraction provided by OpenACC offers productivity, it raises questions concerning both functional and performance portability. In this article, the authors propose HeteroIR, a high-level, architecture-independent intermediate representation, to map high-level programming models, such as OpenACC, to heterogeneous architectures. They present a compiler approach that translates OpenACC programs into HeteroIR and accelerator kernels to obtain OpenACC functional portability. They then evaluate the performance portability obtained bymore » OpenACC with their approach on 12 OpenACC programs on Nvidia CUDA, AMD GCN, and Intel Xeon Phi architectures. They study the effects of various compiler optimizations and OpenACC program settings on these architectures to provide insights into the achieved performance portability.« less
An Architecture for Cross-Cloud System Management
NASA Astrophysics Data System (ADS)
Dodda, Ravi Teja; Smith, Chris; van Moorsel, Aad
The emergence of the cloud computing paradigm promises flexibility and adaptability through on-demand provisioning of compute resources. As the utilization of cloud resources extends beyond a single provider, for business as well as technical reasons, the issue of effectively managing such resources comes to the fore. Different providers expose different interfaces to their compute resources utilizing varied architectures and implementation technologies. This heterogeneity poses a significant system management problem, and can limit the extent to which the benefits of cross-cloud resource utilization can be realized. We address this problem through the definition of an architecture to facilitate the management of compute resources from different cloud providers in an homogenous manner. This preserves the flexibility and adaptability promised by the cloud computing paradigm, whilst enabling the benefits of cross-cloud resource utilization to be realized. The practical efficacy of the architecture is demonstrated through an implementation utilizing compute resources managed through different interfaces on the Amazon Elastic Compute Cloud (EC2) service. Additionally, we provide empirical results highlighting the performance differential of these different interfaces, and discuss the impact of this performance differential on efficiency and profitability.
NASA Astrophysics Data System (ADS)
Nishiura, Daisuke; Furuichi, Mikito; Sakaguchi, Hide
2015-09-01
The computational performance of a smoothed particle hydrodynamics (SPH) simulation is investigated for three types of current shared-memory parallel computer devices: many integrated core (MIC) processors, graphics processing units (GPUs), and multi-core CPUs. We are especially interested in efficient shared-memory allocation methods for each chipset, because the efficient data access patterns differ between compute unified device architecture (CUDA) programming for GPUs and OpenMP programming for MIC processors and multi-core CPUs. We first introduce several parallel implementation techniques for the SPH code, and then examine these on our target computer architectures to determine the most effective algorithms for each processor unit. In addition, we evaluate the effective computing performance and power efficiency of the SPH simulation on each architecture, as these are critical metrics for overall performance in a multi-device environment. In our benchmark test, the GPU is found to produce the best arithmetic performance as a standalone device unit, and gives the most efficient power consumption. The multi-core CPU obtains the most effective computing performance. The computational speed of the MIC processor on Xeon Phi approached that of two Xeon CPUs. This indicates that using MICs is an attractive choice for existing SPH codes on multi-core CPUs parallelized by OpenMP, as it gains computational acceleration without the need for significant changes to the source code.
ERIC Educational Resources Information Center
Amenyo, John-Thones
2012-01-01
Carefully engineered playable games can serve as vehicles for students and practitioners to learn and explore the programming of advanced computer architectures to execute applications, such as high performance computing (HPC) and complex, inter-networked, distributed systems. The article presents families of playable games that are grounded in…
Three-Dimensional Nanobiocomputing Architectures With Neuronal Hypercells
2007-06-01
Neumann architectures, and CMOS fabrication. Novel solutions of massive parallel distributed computing and processing (pipelined due to systolic... and processing platforms utilizing molecular hardware within an enabling organization and architecture. The design technology is based on utilizing a...Microsystems and Nanotechnologies investigated a novel 3D3 (Hardware Software Nanotechnology) technology to design super-high performance computing
NASA Astrophysics Data System (ADS)
Tramm, John R.; Gunow, Geoffrey; He, Tim; Smith, Kord S.; Forget, Benoit; Siegel, Andrew R.
2016-05-01
In this study we present and analyze a formulation of the 3D Method of Characteristics (MOC) technique applied to the simulation of full core nuclear reactors. Key features of the algorithm include a task-based parallelism model that allows independent MOC tracks to be assigned to threads dynamically, ensuring load balancing, and a wide vectorizable inner loop that takes advantage of modern SIMD computer architectures. The algorithm is implemented in a set of highly optimized proxy applications in order to investigate its performance characteristics on CPU, GPU, and Intel Xeon Phi architectures. Speed, power, and hardware cost efficiencies are compared. Additionally, performance bottlenecks are identified for each architecture in order to determine the prospects for continued scalability of the algorithm on next generation HPC architectures.
Performance Analysis of Distributed Object-Oriented Applications
NASA Technical Reports Server (NTRS)
Schoeffler, James D.
1998-01-01
The purpose of this research was to evaluate the efficiency of a distributed simulation architecture which creates individual modules which are made self-scheduling through the use of a message-based communication system used for requesting input data from another module which is the source of that data. To make the architecture as general as possible, the message-based communication architecture was implemented using standard remote object architectures (Common Object Request Broker Architecture (CORBA) and/or Distributed Component Object Model (DCOM)). A series of experiments were run in which different systems are distributed in a variety of ways across multiple computers and the performance evaluated. The experiments were duplicated in each case so that the overhead due to message communication and data transmission can be separated from the time required to actually perform the computational update of a module each iteration. The software used to distribute the modules across multiple computers was developed in the first year of the current grant and was modified considerably to add a message-based communication scheme supported by the DCOM distributed object architecture. The resulting performance was analyzed using a model created during the first year of this grant which predicts the overhead due to CORBA and DCOM remote procedure calls and includes the effects of data passed to and from the remote objects. A report covering the distributed simulation software and the results of the performance experiments has been submitted separately. The above report also discusses possible future work to apply the methodology to dynamically distribute the simulation modules so as to minimize overall computation time.
Generic Divide and Conquer Internet-Based Computing
NASA Technical Reports Server (NTRS)
Follen, Gregory J. (Technical Monitor); Radenski, Atanas
2003-01-01
The growth of Internet-based applications and the proliferation of networking technologies have been transforming traditional commercial application areas as well as computer and computational sciences and engineering. This growth stimulates the exploration of Peer to Peer (P2P) software technologies that can open new research and application opportunities not only for the commercial world, but also for the scientific and high-performance computing applications community. The general goal of this project is to achieve better understanding of the transition to Internet-based high-performance computing and to develop solutions for some of the technical challenges of this transition. In particular, we are interested in creating long-term motivation for end users to provide their idle processor time to support computationally intensive tasks. We believe that a practical P2P architecture should provide useful service to both clients with high-performance computing needs and contributors of lower-end computing resources. To achieve this, we are designing dual -service architecture for P2P high-performance divide-and conquer computing; we are also experimenting with a prototype implementation. Our proposed architecture incorporates a master server, utilizes dual satellite servers, and operates on the Internet in a dynamically changing large configuration of lower-end nodes provided by volunteer contributors. A dual satellite server comprises a high-performance computing engine and a lower-end contributor service engine. The computing engine provides generic support for divide and conquer computations. The service engine is intended to provide free useful HTTP-based services to contributors of lower-end computing resources. Our proposed architecture is complementary to and accessible from computational grids, such as Globus, Legion, and Condor. Grids provide remote access to existing higher-end computing resources; in contrast, our goal is to utilize idle processor time of lower-end Internet nodes. Our project is focused on a generic divide and conquer paradigm and on mobile applications of this paradigm that can operate on a loose and ever changing pool of lower-end Internet nodes.
Efficient architecture for spike sorting in reconfigurable hardware.
Hwang, Wen-Jyi; Lee, Wei-Hao; Lin, Shiow-Jyu; Lai, Sheng-Ying
2013-11-01
This paper presents a novel hardware architecture for fast spike sorting. The architecture is able to perform both the feature extraction and clustering in hardware. The generalized Hebbian algorithm (GHA) and fuzzy C-means (FCM) algorithm are used for feature extraction and clustering, respectively. The employment of GHA allows efficient computation of principal components for subsequent clustering operations. The FCM is able to achieve near optimal clustering for spike sorting. Its performance is insensitive to the selection of initial cluster centers. The hardware implementations of GHA and FCM feature low area costs and high throughput. In the GHA architecture, the computation of different weight vectors share the same circuit for lowering the area costs. Moreover, in the FCM hardware implementation, the usual iterative operations for updating the membership matrix and cluster centroid are merged into one single updating process to evade the large storage requirement. To show the effectiveness of the circuit, the proposed architecture is physically implemented by field programmable gate array (FPGA). It is embedded in a System-on-Chip (SOC) platform for performance measurement. Experimental results show that the proposed architecture is an efficient spike sorting design for attaining high classification correct rate and high speed computation.
Computing architecture for autonomous microgrids
Goldsmith, Steven Y.
2015-09-29
A computing architecture that facilitates autonomously controlling operations of a microgrid is described herein. A microgrid network includes numerous computing devices that execute intelligent agents, each of which is assigned to a particular entity (load, source, storage device, or switch) in the microgrid. The intelligent agents can execute in accordance with predefined protocols to collectively perform computations that facilitate uninterrupted control of the .
Electromagnetic physics models for parallel computing architectures
Amadio, G.; Ananya, A.; Apostolakis, J.; ...
2016-11-21
The recent emergence of hardware architectures characterized by many-core or accelerated processors has opened new opportunities for concurrent programming models taking advantage of both SIMD and SIMT architectures. GeantV, a next generation detector simulation, has been designed to exploit both the vector capability of mainstream CPUs and multi-threading capabilities of coprocessors including NVidia GPUs and Intel Xeon Phi. The characteristics of these architectures are very different in terms of the vectorization depth and type of parallelization needed to achieve optimal performance. In this paper we describe implementation of electromagnetic physics models developed for parallel computing architectures as a part ofmore » the GeantV project. Finally, the results of preliminary performance evaluation and physics validation are presented as well.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)
Zhang, Yao; Balaprakash, Prasanna; Meng, Jiayuan
We present Raexplore, a performance modeling framework for architecture exploration. Raexplore enables rapid, automated, and systematic search of architecture design space by combining hardware counter-based performance characterization and analytical performance modeling. We demonstrate Raexplore for two recent manycore processors IBM Blue- Gene/Q compute chip and Intel Xeon Phi, targeting a set of scientific applications. Our framework is able to capture complex interactions between architectural components including instruction pipeline, cache, and memory, and to achieve a 3–22% error for same-architecture and cross-architecture performance predictions. Furthermore, we apply our framework to assess the two processors, and discover and evaluate a list ofmore » architectural scaling options for future processor designs.« less
Benchmarking high performance computing architectures with CMS’ skeleton framework
Sexton-Kennedy, E.; Gartung, P.; Jones, C. D.
2017-11-23
Here, in 2012 CMS evaluated which underlying concurrency technology would be the best to use for its multi-threaded framework. The available technologies were evaluated on the high throughput computing systems dominating the resources in use at that time. A skeleton framework benchmarking suite that emulates the tasks performed within a CMSSW application was used to select Intel’s Thread Building Block library, based on the measured overheads in both memory and CPU on the different technologies benchmarked. In 2016 CMS will get access to high performance computing resources that use new many core architectures; machines such as Cori Phase 1&2, Theta,more » Mira. Because of this we have revived the 2012 benchmark to test it’s performance and conclusions on these new architectures. This talk will discuss the results of this exercise.« less
Benchmarking high performance computing architectures with CMS’ skeleton framework
DOE Office of Scientific and Technical Information (OSTI.GOV)
Sexton-Kennedy, E.; Gartung, P.; Jones, C. D.
Here, in 2012 CMS evaluated which underlying concurrency technology would be the best to use for its multi-threaded framework. The available technologies were evaluated on the high throughput computing systems dominating the resources in use at that time. A skeleton framework benchmarking suite that emulates the tasks performed within a CMSSW application was used to select Intel’s Thread Building Block library, based on the measured overheads in both memory and CPU on the different technologies benchmarked. In 2016 CMS will get access to high performance computing resources that use new many core architectures; machines such as Cori Phase 1&2, Theta,more » Mira. Because of this we have revived the 2012 benchmark to test it’s performance and conclusions on these new architectures. This talk will discuss the results of this exercise.« less
NASA Astrophysics Data System (ADS)
Jiang, Yuning; Kang, Jinfeng; Wang, Xinan
2017-03-01
Resistive switching memory (RRAM) is considered as one of the most promising devices for parallel computing solutions that may overcome the von Neumann bottleneck of today’s electronic systems. However, the existing RRAM-based parallel computing architectures suffer from practical problems such as device variations and extra computing circuits. In this work, we propose a novel parallel computing architecture for pattern recognition by implementing k-nearest neighbor classification on metal-oxide RRAM crossbar arrays. Metal-oxide RRAM with gradual RESET behaviors is chosen as both the storage and computing components. The proposed architecture is tested by the MNIST database. High speed (~100 ns per example) and high recognition accuracy (97.05%) are obtained. The influence of several non-ideal device properties is also discussed, and it turns out that the proposed architecture shows great tolerance to device variations. This work paves a new way to achieve RRAM-based parallel computing hardware systems with high performance.
Distributed computing environments for future space control systems
NASA Technical Reports Server (NTRS)
Viallefont, Pierre
1993-01-01
The aim of this paper is to present the results of a CNES research project on distributed computing systems. The purpose of this research was to study the impact of the use of new computer technologies in the design and development of future space applications. The first part of this study was a state-of-the-art review of distributed computing systems. One of the interesting ideas arising from this review is the concept of a 'virtual computer' allowing the distributed hardware architecture to be hidden from a software application. The 'virtual computer' can improve system performance by adapting the best architecture (addition of computers) to the software application without having to modify its source code. This concept can also decrease the cost and obsolescence of the hardware architecture. In order to verify the feasibility of the 'virtual computer' concept, a prototype representative of a distributed space application is being developed independently of the hardware architecture.
An Evaluation of Architectural Platforms for Parallel Navier-Stokes Computations
NASA Technical Reports Server (NTRS)
Jayasimha, D. N.; Hayder, M. E.; Pillay, S. K.
1996-01-01
We study the computational, communication, and scalability characteristics of a computational fluid dynamics application, which solves the time accurate flow field of a jet using the compressible Navier-Stokes equations, on a variety of parallel architecture platforms. The platforms chosen for this study are a cluster of workstations (the LACE experimental testbed at NASA Lewis), a shared memory multiprocessor (the Cray YMP), and distributed memory multiprocessors with different topologies - the IBM SP and the Cray T3D. We investigate the impact of various networks connecting the cluster of workstations on the performance of the application and the overheads induced by popular message passing libraries used for parallelization. The work also highlights the importance of matching the memory bandwidth to the processor speed for good single processor performance. By studying the performance of an application on a variety of architectures, we are able to point out the strengths and weaknesses of each of the example computing platforms.
Parallelizing Navier-Stokes Computations on a Variety of Architectural Platforms
NASA Technical Reports Server (NTRS)
Jayasimha, D. N.; Hayder, M. E.; Pillay, S. K.
1997-01-01
We study the computational, communication, and scalability characteristics of a Computational Fluid Dynamics application, which solves the time accurate flow field of a jet using the compressible Navier-Stokes equations, on a variety of parallel architectural platforms. The platforms chosen for this study are a cluster of workstations (the LACE experimental testbed at NASA Lewis), a shared memory multiprocessor (the Cray YMP), distributed memory multiprocessors with different topologies-the IBM SP and the Cray T3D. We investigate the impact of various networks, connecting the cluster of workstations, on the performance of the application and the overheads induced by popular message passing libraries used for parallelization. The work also highlights the importance of matching the memory bandwidth to the processor speed for good single processor performance. By studying the performance of an application on a variety of architectures, we are able to point out the strengths and weaknesses of each of the example computing platforms.
Developing Information Power Grid Based Algorithms and Software
NASA Technical Reports Server (NTRS)
Dongarra, Jack
1998-01-01
This exploratory study initiated our effort to understand performance modeling on parallel systems. The basic goal of performance modeling is to understand and predict the performance of a computer program or set of programs on a computer system. Performance modeling has numerous applications, including evaluation of algorithms, optimization of code implementations, parallel library development, comparison of system architectures, parallel system design, and procurement of new systems. Our work lays the basis for the construction of parallel libraries that allow for the reconstruction of application codes on several distinct architectures so as to assure performance portability. Following our strategy, once the requirements of applications are well understood, one can then construct a library in a layered fashion. The top level of this library will consist of architecture-independent geometric, numerical, and symbolic algorithms that are needed by the sample of applications. These routines should be written in a language that is portable across the targeted architectures.
Exploring Gigabyte Datasets in Real Time: Architectures, Interfaces and Time-Critical Design
NASA Technical Reports Server (NTRS)
Bryson, Steve; Gerald-Yamasaki, Michael (Technical Monitor)
1998-01-01
Architectures and Interfaces: The implications of real-time interaction on software architecture design: decoupling of interaction/graphics and computation into asynchronous processes. The performance requirements of graphics and computation for interaction. Time management in such an architecture. Examples of how visualization algorithms must be modified for high performance. Brief survey of interaction techniques and design, including direct manipulation and manipulation via widgets. talk discusses how human factors considerations drove the design and implementation of the virtual wind tunnel. Time-Critical Design: A survey of time-critical techniques for both computation and rendering. Emphasis on the assignment of a time budget to both the overall visualization environment and to each individual visualization technique in the environment. The estimation of the benefit and cost of an individual technique. Examples of the modification of visualization algorithms to allow time-critical control.
Design and optimization of a portable LQCD Monte Carlo code using OpenACC
NASA Astrophysics Data System (ADS)
Bonati, Claudio; Coscetti, Simone; D'Elia, Massimo; Mesiti, Michele; Negro, Francesco; Calore, Enrico; Schifano, Sebastiano Fabio; Silvi, Giorgio; Tripiccione, Raffaele
The present panorama of HPC architectures is extremely heterogeneous, ranging from traditional multi-core CPU processors, supporting a wide class of applications but delivering moderate computing performance, to many-core Graphics Processor Units (GPUs), exploiting aggressive data-parallelism and delivering higher performances for streaming computing applications. In this scenario, code portability (and performance portability) become necessary for easy maintainability of applications; this is very relevant in scientific computing where code changes are very frequent, making it tedious and prone to error to keep different code versions aligned. In this work, we present the design and optimization of a state-of-the-art production-level LQCD Monte Carlo application, using the directive-based OpenACC programming model. OpenACC abstracts parallel programming to a descriptive level, relieving programmers from specifying how codes should be mapped onto the target architecture. We describe the implementation of a code fully written in OpenAcc, and show that we are able to target several different architectures, including state-of-the-art traditional CPUs and GPUs, with the same code. We also measure performance, evaluating the computing efficiency of our OpenACC code on several architectures, comparing with GPU-specific implementations and showing that a good level of performance-portability can be reached.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Not Available
An account of the Caltech Concurrent Computation Program (C{sup 3}P), a five year project that focused on answering the question: Can parallel computers be used to do large-scale scientific computations '' As the title indicates, the question is answered in the affirmative, by implementing numerous scientific applications on real parallel computers and doing computations that produced new scientific results. In the process of doing so, C{sup 3}P helped design and build several new computers, designed and implemented basic system software, developed algorithms for frequently used mathematical computations on massively parallel machines, devised performance models and measured the performance of manymore » computers, and created a high performance computing facility based exclusively on parallel computers. While the initial focus of C{sup 3}P was the hypercube architecture developed by C. Seitz, many of the methods developed and lessons learned have been applied successfully on other massively parallel architectures.« less
Technology advances and market forces: Their impact on high performance architectures
NASA Technical Reports Server (NTRS)
Best, D. R.
1978-01-01
Reasonable projections into future supercomputer architectures and technology require an analysis of the computer industry market environment, the current capabilities and trends within the component industry, and the research activities on computer architecture in the industrial and academic communities. Management, programmer, architect, and user must cooperate to increase the efficiency of supercomputer development efforts. Care must be taken to match the funding, compiler, architecture and application with greater attention to testability, maintainability, reliability, and usability than supercomputer development programs of the past.
FPGA Implementation of Generalized Hebbian Algorithm for Texture Classification
Lin, Shiow-Jyu; Hwang, Wen-Jyi; Lee, Wei-Hao
2012-01-01
This paper presents a novel hardware architecture for principal component analysis. The architecture is based on the Generalized Hebbian Algorithm (GHA) because of its simplicity and effectiveness. The architecture is separated into three portions: the weight vector updating unit, the principal computation unit and the memory unit. In the weight vector updating unit, the computation of different synaptic weight vectors shares the same circuit for reducing the area costs. To show the effectiveness of the circuit, a texture classification system based on the proposed architecture is physically implemented by Field Programmable Gate Array (FPGA). It is embedded in a System-On-Programmable-Chip (SOPC) platform for performance measurement. Experimental results show that the proposed architecture is an efficient design for attaining both high speed performance and low area costs. PMID:22778640
1983-12-30
AD-Ri46 57? ARCHITECTURE DESIGN AND SYSTEM; PERFORMANCE ASSESSMENT i/i AND DEVELOPMENT ME..(U) NAVAL SURFACE WEAPONS CENTER SILYER SPRING MD J...AD-A 146 577 NSIWC TR 83-324 ARCHITECTURE , DESIGN , AND SYSTEM; PERFORMANCE ASSESSMENT AND DEVELOPMENT METHODOLOGY...REPORT NUMBER 12. GOVT ACCESSION NO.3. RECIPIENT’S CATALOG NUMBER NSWC TR 83-324 10- 1 1 51’ 4. ?ITLE (and subtitle) ARCHITECTURE , DESIGN , AND SYSTEM; S
Efficient Phase Unwrapping Architecture for Digital Holographic Microscopy
Hwang, Wen-Jyi; Cheng, Shih-Chang; Cheng, Chau-Jern
2011-01-01
This paper presents a novel phase unwrapping architecture for accelerating the computational speed of digital holographic microscopy (DHM). A fast Fourier transform (FFT) based phase unwrapping algorithm providing a minimum squared error solution is adopted for hardware implementation because of its simplicity and robustness to noise. The proposed architecture is realized in a pipeline fashion to maximize throughput of the computation. Moreover, the number of hardware multipliers and dividers are minimized to reduce the hardware costs. The proposed architecture is used as a custom user logic in a system on programmable chip (SOPC) for physical performance measurement. Experimental results reveal that the proposed architecture is effective for expediting the computational speed while consuming low hardware resources for designing an embedded DHM system. PMID:22163688
Prospective Architectures for Onboard vs Cloud-Based Decision Making for Unmanned Aerial Systems
NASA Technical Reports Server (NTRS)
Sankararaman, Shankar; Teubert, Christopher
2017-01-01
This paper investigates propsective architectures for decision-making in unmanned aerial systems. When these unmanned vehicles operate in urban environments, there are several sources of uncertainty that affect their behavior, and decision-making algorithms need to be robust to account for these different sources of uncertainty. It is important to account for several risk-factors that affect the flight of these unmanned systems, and facilitate decision-making by taking into consideration these various risk-factors. In addition, there are several technical challenges related to autonomous flight of unmanned aerial systems; these challenges include sensing, obstacle detection, path planning and navigation, trajectory generation and selection, etc. Many of these activities require significant computational power and in many situations, all of these activities need to be performed in real-time. In order to efficiently integrate these activities, it is important to develop a systematic architecture that can facilitate real-time decision-making. Four prospective architectures are discussed in this paper; on one end of the spectrum, the first architecture considers all activities/computations being performed onboard the vehicle whereas on the other end of the spectrum, the fourth and final architecture considers all activities/computations being performed in the cloud, using a new service known as Prognostics as a Service that is being developed at NASA Ames Research Center. The four different architectures are compared, their advantages and disadvantages are explained and conclusions are presented.
Job Superscheduler Architecture and Performance in Computational Grid Environments
NASA Technical Reports Server (NTRS)
Shan, Hongzhang; Oliker, Leonid; Biswas, Rupak
2003-01-01
Computational grids hold great promise in utilizing geographically separated heterogeneous resources to solve large-scale complex scientific problems. However, a number of major technical hurdles, including distributed resource management and effective job scheduling, stand in the way of realizing these gains. In this paper, we propose a novel grid superscheduler architecture and three distributed job migration algorithms. We also model the critical interaction between the superscheduler and autonomous local schedulers. Extensive performance comparisons with ideal, central, and local schemes using real workloads from leading computational centers are conducted in a simulation environment. Additionally, synthetic workloads are used to perform a detailed sensitivity analysis of our superscheduler. Several key metrics demonstrate that substantial performance gains can be achieved via smart superscheduling in distributed computational grids.
NASA Technical Reports Server (NTRS)
Fijany, Amir (Inventor); Bejczy, Antal K. (Inventor)
1993-01-01
This is a real-time robotic controller and simulator which is a MIMD-SIMD parallel architecture for interfacing with an external host computer and providing a high degree of parallelism in computations for robotic control and simulation. It includes a host processor for receiving instructions from the external host computer and for transmitting answers to the external host computer. There are a plurality of SIMD microprocessors, each SIMD processor being a SIMD parallel processor capable of exploiting fine grain parallelism and further being able to operate asynchronously to form a MIMD architecture. Each SIMD processor comprises a SIMD architecture capable of performing two matrix-vector operations in parallel while fully exploiting parallelism in each operation. There is a system bus connecting the host processor to the plurality of SIMD microprocessors and a common clock providing a continuous sequence of clock pulses. There is also a ring structure interconnecting the plurality of SIMD microprocessors and connected to the clock for providing the clock pulses to the SIMD microprocessors and for providing a path for the flow of data and instructions between the SIMD microprocessors. The host processor includes logic for controlling the RRCS by interpreting instructions sent by the external host computer, decomposing the instructions into a series of computations to be performed by the SIMD microprocessors, using the system bus to distribute associated data among the SIMD microprocessors, and initiating activity of the SIMD microprocessors to perform the computations on the data by procedure call.
Multiprocessor architecture: Synthesis and evaluation
NASA Technical Reports Server (NTRS)
Standley, Hilda M.
1990-01-01
Multiprocessor computed architecture evaluation for structural computations is the focus of the research effort described. Results obtained are expected to lead to more efficient use of existing architectures and to suggest designs for new, application specific, architectures. The brief descriptions given outline a number of related efforts directed toward this purpose. The difficulty is analyzing an existing architecture or in designing a new computer architecture lies in the fact that the performance of a particular architecture, within the context of a given application, is determined by a number of factors. These include, but are not limited to, the efficiency of the computation algorithm, the programming language and support environment, the quality of the program written in the programming language, the multiplicity of the processing elements, the characteristics of the individual processing elements, the interconnection network connecting processors and non-local memories, and the shared memory organization covering the spectrum from no shared memory (all local memory) to one global access memory. These performance determiners may be loosely classified as being software or hardware related. This distinction is not clear or even appropriate in many cases. The effect of the choice of algorithm is ignored by assuming that the algorithm is specified as given. Effort directed toward the removal of the effect of the programming language and program resulted in the design of a high-level parallel programming language. Two characteristics of the fundamental structure of the architecture (memory organization and interconnection network) are examined.
Performances of multiprocessor multidisk architectures for continuous media storage
NASA Astrophysics Data System (ADS)
Gennart, Benoit A.; Messerli, Vincent; Hersch, Roger D.
1996-03-01
Multimedia interfaces increase the need for large image databases, capable of storing and reading streams of data with strict synchronicity and isochronicity requirements. In order to fulfill these requirements, we consider a parallel image server architecture which relies on arrays of intelligent disk nodes, each disk node being composed of one processor and one or more disks. This contribution analyzes through bottleneck performance evaluation and simulation the behavior of two multi-processor multi-disk architectures: a point-to-point architecture and a shared-bus architecture similar to current multiprocessor workstation architectures. We compare the two architectures on the basis of two multimedia algorithms: the compute-bound frame resizing by resampling and the data-bound disk-to-client stream transfer. The results suggest that the shared bus is a potential bottleneck despite its very high hardware throughput (400Mbytes/s) and that an architecture with addressable local memories located closely to their respective processors could partially remove this bottleneck. The point- to-point architecture is scalable and able to sustain high throughputs for simultaneous compute- bound and data-bound operations.
OS friendly microprocessor architecture: Hardware level computer security
NASA Astrophysics Data System (ADS)
Jungwirth, Patrick; La Fratta, Patrick
2016-05-01
We present an introduction to the patented OS Friendly Microprocessor Architecture (OSFA) and hardware level computer security. Conventional microprocessors have not tried to balance hardware performance and OS performance at the same time. Conventional microprocessors have depended on the Operating System for computer security and information assurance. The goal of the OS Friendly Architecture is to provide a high performance and secure microprocessor and OS system. We are interested in cyber security, information technology (IT), and SCADA control professionals reviewing the hardware level security features. The OS Friendly Architecture is a switched set of cache memory banks in a pipeline configuration. For light-weight threads, the memory pipeline configuration provides near instantaneous context switching times. The pipelining and parallelism provided by the cache memory pipeline provides for background cache read and write operations while the microprocessor's execution pipeline is running instructions. The cache bank selection controllers provide arbitration to prevent the memory pipeline and microprocessor's execution pipeline from accessing the same cache bank at the same time. This separation allows the cache memory pages to transfer to and from level 1 (L1) caching while the microprocessor pipeline is executing instructions. Computer security operations are implemented in hardware. By extending Unix file permissions bits to each cache memory bank and memory address, the OSFA provides hardware level computer security.
An S N Algorithm for Modern Architectures
DOE Office of Scientific and Technical Information (OSTI.GOV)
Baker, Randal Scott
2016-08-29
LANL discrete ordinates transport packages are required to perform large, computationally intensive time-dependent calculations on massively parallel architectures, where even a single such calculation may need many months to complete. While KBA methods scale out well to very large numbers of compute nodes, we are limited by practical constraints on the number of such nodes we can actually apply to any given calculation. Instead, we describe a modified KBA algorithm that allows realization of the reductions in solution time offered by both the current, and future, architectural changes within a compute node.
NASA Technical Reports Server (NTRS)
Rickard, D. A.; Bodenheimer, R. E.
1976-01-01
Digital computer components which perform two dimensional array logic operations (Tse logic) on binary data arrays are described. The properties of Golay transforms which make them useful in image processing are reviewed, and several architectures for Golay transform processors are presented with emphasis on the skeletonizing algorithm. Conventional logic control units developed for the Golay transform processors are described. One is a unique microprogrammable control unit that uses a microprocessor to control the Tse computer. The remaining control units are based on programmable logic arrays. Performance criteria are established and utilized to compare the various Golay transform machines developed. A critique of Tse logic is presented, and recommendations for additional research are included.
Computing Project, Marc develops high-fidelity turbulence models to enhance simulation accuracy and efficient numerical algorithms for future high performance computing hardware architectures. Research Interests High performance computing High order numerical methods for computational fluid dynamics Fluid
Power and Performance Trade-offs for Space Time Adaptive Processing
DOE Office of Scientific and Technical Information (OSTI.GOV)
Gawande, Nitin A.; Manzano Franco, Joseph B.; Tumeo, Antonino
Computational efficiency – performance relative to power or energy – is one of the most important concerns when designing RADAR processing systems. This paper analyzes power and performance trade-offs for a typical Space Time Adaptive Processing (STAP) application. We study STAP implementations for CUDA and OpenMP on two computationally efficient architectures, Intel Haswell Core I7-4770TE and NVIDIA Kayla with a GK208 GPU. We analyze the power and performance of STAP’s computationally intensive kernels across the two hardware testbeds. We also show the impact and trade-offs of GPU optimization techniques. We show that data parallelism can be exploited for efficient implementationmore » on the Haswell CPU architecture. The GPU architecture is able to process large size data sets without increase in power requirement. The use of shared memory has a significant impact on the power requirement for the GPU. A balance between the use of shared memory and main memory access leads to an improved performance in a typical STAP application.« less
Chen, Qingkui; Zhao, Deyu; Wang, Jingjuan
2017-01-01
This paper aims to develop a low-cost, high-performance and high-reliability computing system to process large-scale data using common data mining algorithms in the Internet of Things (IoT) computing environment. Considering the characteristics of IoT data processing, similar to mainstream high performance computing, we use a GPU (Graphics Processing Unit) cluster to achieve better IoT services. Firstly, we present an energy consumption calculation method (ECCM) based on WSNs. Then, using the CUDA (Compute Unified Device Architecture) Programming model, we propose a Two-level Parallel Optimization Model (TLPOM) which exploits reasonable resource planning and common compiler optimization techniques to obtain the best blocks and threads configuration considering the resource constraints of each node. The key to this part is dynamic coupling Thread-Level Parallelism (TLP) and Instruction-Level Parallelism (ILP) to improve the performance of the algorithms without additional energy consumption. Finally, combining the ECCM and the TLPOM, we use the Reliable GPU Cluster Architecture (RGCA) to obtain a high-reliability computing system considering the nodes’ diversity, algorithm characteristics, etc. The results show that the performance of the algorithms significantly increased by 34.1%, 33.96% and 24.07% for Fermi, Kepler and Maxwell on average with TLPOM and the RGCA ensures that our IoT computing system provides low-cost and high-reliability services. PMID:28777325
Fang, Yuling; Chen, Qingkui; Xiong, Neal N; Zhao, Deyu; Wang, Jingjuan
2017-08-04
This paper aims to develop a low-cost, high-performance and high-reliability computing system to process large-scale data using common data mining algorithms in the Internet of Things (IoT) computing environment. Considering the characteristics of IoT data processing, similar to mainstream high performance computing, we use a GPU (Graphics Processing Unit) cluster to achieve better IoT services. Firstly, we present an energy consumption calculation method (ECCM) based on WSNs. Then, using the CUDA (Compute Unified Device Architecture) Programming model, we propose a Two-level Parallel Optimization Model (TLPOM) which exploits reasonable resource planning and common compiler optimization techniques to obtain the best blocks and threads configuration considering the resource constraints of each node. The key to this part is dynamic coupling Thread-Level Parallelism (TLP) and Instruction-Level Parallelism (ILP) to improve the performance of the algorithms without additional energy consumption. Finally, combining the ECCM and the TLPOM, we use the Reliable GPU Cluster Architecture (RGCA) to obtain a high-reliability computing system considering the nodes' diversity, algorithm characteristics, etc. The results show that the performance of the algorithms significantly increased by 34.1%, 33.96% and 24.07% for Fermi, Kepler and Maxwell on average with TLPOM and the RGCA ensures that our IoT computing system provides low-cost and high-reliability services.
Hybrid parallel computing architecture for multiview phase shifting
NASA Astrophysics Data System (ADS)
Zhong, Kai; Li, Zhongwei; Zhou, Xiaohui; Shi, Yusheng; Wang, Congjun
2014-11-01
The multiview phase-shifting method shows its powerful capability in achieving high resolution three-dimensional (3-D) shape measurement. Unfortunately, this ability results in very high computation costs and 3-D computations have to be processed offline. To realize real-time 3-D shape measurement, a hybrid parallel computing architecture is proposed for multiview phase shifting. In this architecture, the central processing unit can co-operate with the graphic processing unit (GPU) to achieve hybrid parallel computing. The high computation cost procedures, including lens distortion rectification, phase computation, correspondence, and 3-D reconstruction, are implemented in GPU, and a three-layer kernel function model is designed to simultaneously realize coarse-grained and fine-grained paralleling computing. Experimental results verify that the developed system can perform 50 fps (frame per second) real-time 3-D measurement with 260 K 3-D points per frame. A speedup of up to 180 times is obtained for the performance of the proposed technique using a NVIDIA GT560Ti graphics card rather than a sequential C in a 3.4 GHZ Inter Core i7 3770.
NASA Astrophysics Data System (ADS)
Shi, X.
2015-12-01
As NSF indicated - "Theory and experimentation have for centuries been regarded as two fundamental pillars of science. It is now widely recognized that computational and data-enabled science forms a critical third pillar." Geocomputation is the third pillar of GIScience and geosciences. With the exponential growth of geodata, the challenge of scalable and high performance computing for big data analytics become urgent because many research activities are constrained by the inability of software or tool that even could not complete the computation process. Heterogeneous geodata integration and analytics obviously magnify the complexity and operational time frame. Many large-scale geospatial problems may be not processable at all if the computer system does not have sufficient memory or computational power. Emerging computer architectures, such as Intel's Many Integrated Core (MIC) Architecture and Graphics Processing Unit (GPU), and advanced computing technologies provide promising solutions to employ massive parallelism and hardware resources to achieve scalability and high performance for data intensive computing over large spatiotemporal and social media data. Exploring novel algorithms and deploying the solutions in massively parallel computing environment to achieve the capability for scalable data processing and analytics over large-scale, complex, and heterogeneous geodata with consistent quality and high-performance has been the central theme of our research team in the Department of Geosciences at the University of Arkansas (UARK). New multi-core architectures combined with application accelerators hold the promise to achieve scalability and high performance by exploiting task and data levels of parallelism that are not supported by the conventional computing systems. Such a parallel or distributed computing environment is particularly suitable for large-scale geocomputation over big data as proved by our prior works, while the potential of such advanced infrastructure remains unexplored in this domain. Within this presentation, our prior and on-going initiatives will be summarized to exemplify how we exploit multicore CPUs, GPUs, and MICs, and clusters of CPUs, GPUs and MICs, to accelerate geocomputation in different applications.
Programming for 1.6 Millon cores: Early experiences with IBM's BG/Q SMP architecture
NASA Astrophysics Data System (ADS)
Glosli, James
2013-03-01
With the stall in clock cycle improvements a decade ago, the drive for computational performance has continues along a path of increasing core counts on a processor. The multi-core evolution has been expressed in both a symmetric multi processor (SMP) architecture and cpu/GPU architecture. Debates rage in the high performance computing (HPC) community which architecture best serves HPC. In this talk I will not attempt to resolve that debate but perhaps fuel it. I will discuss the experience of exploiting Sequoia, a 98304 node IBM Blue Gene/Q SMP at Lawrence Livermore National Laboratory. The advantages and challenges of leveraging the computational power BG/Q will be detailed through the discussion of two applications. The first application is a Molecular Dynamics code called ddcMD. This is a code developed over the last decade at LLNL and ported to BG/Q. The second application is a cardiac modeling code called Cardioid. This is a code that was recently designed and developed at LLNL to exploit the fine scale parallelism of BG/Q's SMP architecture. Through the lenses of these efforts I'll illustrate the need to rethink how we express and implement our computational approaches. This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.
Performance Analysis of Multilevel Parallel Applications on Shared Memory Architectures
NASA Technical Reports Server (NTRS)
Biegel, Bryan A. (Technical Monitor); Jost, G.; Jin, H.; Labarta J.; Gimenez, J.; Caubet, J.
2003-01-01
Parallel programming paradigms include process level parallelism, thread level parallelization, and multilevel parallelism. This viewgraph presentation describes a detailed performance analysis of these paradigms for Shared Memory Architecture (SMA). This analysis uses the Paraver Performance Analysis System. The presentation includes diagrams of a flow of useful computations.
Liu, Kui; Wei, Sixiao; Chen, Zhijiang; Jia, Bin; Chen, Genshe; Ling, Haibin; Sheaff, Carolyn; Blasch, Erik
2017-01-01
This paper presents the first attempt at combining Cloud with Graphic Processing Units (GPUs) in a complementary manner within the framework of a real-time high performance computation architecture for the application of detecting and tracking multiple moving targets based on Wide Area Motion Imagery (WAMI). More specifically, the GPU and Cloud Moving Target Tracking (GC-MTT) system applied a front-end web based server to perform the interaction with Hadoop and highly parallelized computation functions based on the Compute Unified Device Architecture (CUDA©). The introduced multiple moving target detection and tracking method can be extended to other applications such as pedestrian tracking, group tracking, and Patterns of Life (PoL) analysis. The cloud and GPUs based computing provides an efficient real-time target recognition and tracking approach as compared to methods when the work flow is applied using only central processing units (CPUs). The simultaneous tracking and recognition results demonstrate that a GC-MTT based approach provides drastically improved tracking with low frame rates over realistic conditions. PMID:28208684
Liu, Kui; Wei, Sixiao; Chen, Zhijiang; Jia, Bin; Chen, Genshe; Ling, Haibin; Sheaff, Carolyn; Blasch, Erik
2017-02-12
This paper presents the first attempt at combining Cloud with Graphic Processing Units (GPUs) in a complementary manner within the framework of a real-time high performance computation architecture for the application of detecting and tracking multiple moving targets based on Wide Area Motion Imagery (WAMI). More specifically, the GPU and Cloud Moving Target Tracking (GC-MTT) system applied a front-end web based server to perform the interaction with Hadoop and highly parallelized computation functions based on the Compute Unified Device Architecture (CUDA©). The introduced multiple moving target detection and tracking method can be extended to other applications such as pedestrian tracking, group tracking, and Patterns of Life (PoL) analysis. The cloud and GPUs based computing provides an efficient real-time target recognition and tracking approach as compared to methods when the work flow is applied using only central processing units (CPUs). The simultaneous tracking and recognition results demonstrate that a GC-MTT based approach provides drastically improved tracking with low frame rates over realistic conditions.
DeepX: Deep Learning Accelerator for Restricted Boltzmann Machine Artificial Neural Networks.
Kim, Lok-Won
2018-05-01
Although there have been many decades of research and commercial presence on high performance general purpose processors, there are still many applications that require fully customized hardware architectures for further computational acceleration. Recently, deep learning has been successfully used to learn in a wide variety of applications, but their heavy computation demand has considerably limited their practical applications. This paper proposes a fully pipelined acceleration architecture to alleviate high computational demand of an artificial neural network (ANN) which is restricted Boltzmann machine (RBM) ANNs. The implemented RBM ANN accelerator (integrating network size, using 128 input cases per batch, and running at a 303-MHz clock frequency) integrated in a state-of-the art field-programmable gate array (FPGA) (Xilinx Virtex 7 XC7V-2000T) provides a computational performance of 301-billion connection-updates-per-second and about 193 times higher performance than a software solution running on general purpose processors. Most importantly, the architecture enables over 4 times (12 times in batch learning) higher performance compared with a previous work when both are implemented in an FPGA device (XC2VP70).
Implementation of an ADI method on parallel computers
NASA Technical Reports Server (NTRS)
Fatoohi, Raad A.; Grosch, Chester E.
1987-01-01
The implementation of an ADI method for solving the diffusion equation on three parallel/vector computers is discussed. The computers were chosen so as to encompass a variety of architectures. They are: the MPP, an SIMD machine with 16K bit serial processors; FLEX/32, an MIMD machine with 20 processors; and CRAY/2, an MIMD machine with four vector processors. The Gaussian elimination algorithm is used to solve a set of tridiagonal systems on the FLEX/32 and CRAY/2 while the cyclic elimination algorithm is used to solve these systems on the MPP. The implementation of the method is discussed in relation to these architectures and measures of the performance on each machine are given. Simple performance models are used to describe the performance. These models highlight the bottlenecks and limiting factors for this algorithm on these architectures. Finally, conclusions are presented.
Implementation of an ADI method on parallel computers
NASA Technical Reports Server (NTRS)
Fatoohi, Raad A.; Grosch, Chester E.
1987-01-01
In this paper the implementation of an ADI method for solving the diffusion equation on three parallel/vector computers is discussed. The computers were chosen so as to encompass a variety of architectures. They are the MPP, an SIMD machine with 16-Kbit serial processors; Flex/32, an MIMD machine with 20 processors; and Cray/2, an MIMD machine with four vector processors. The Gaussian elimination algorithm is used to solve a set of tridiagonal systems on the Flex/32 and Cray/2 while the cyclic elimination algorithm is used to solve these systems on the MPP. The implementation of the method is discussed in relation to these architectures and measures of the performance on each machine are given. Simple performance models are used to describe the performance. These models highlight the bottlenecks and limiting factors for this algorithm on these architectures. Finally conclusions are presented.
A performance analysis of advanced I/O architectures for PC-based network file servers
NASA Astrophysics Data System (ADS)
Huynh, K. D.; Khoshgoftaar, T. M.
1994-12-01
In the personal computing and workstation environments, more and more I/O adapters are becoming complete functional subsystems that are intelligent enough to handle I/O operations on their own without much intervention from the host processor. The IBM Subsystem Control Block (SCB) architecture has been defined to enhance the potential of these intelligent adapters by defining services and conventions that deliver command information and data to and from the adapters. In recent years, a new storage architecture, the Redundant Array of Independent Disks (RAID), has been quickly gaining acceptance in the world of computing. In this paper, we would like to discuss critical system design issues that are important to the performance of a network file server. We then present a performance analysis of the SCB architecture and disk array technology in typical network file server environments based on personal computers (PCs). One of the key issues investigated in this paper is whether a disk array can outperform a group of disks (of same type, same data capacity, and same cost) operating independently, not in parallel as in a disk array.
Implementation and analysis of a Navier-Stokes algorithm on parallel computers
NASA Technical Reports Server (NTRS)
Fatoohi, Raad A.; Grosch, Chester E.
1988-01-01
The results of the implementation of a Navier-Stokes algorithm on three parallel/vector computers are presented. The object of this research is to determine how well, or poorly, a single numerical algorithm would map onto three different architectures. The algorithm is a compact difference scheme for the solution of the incompressible, two-dimensional, time-dependent Navier-Stokes equations. The computers were chosen so as to encompass a variety of architectures. They are the following: the MPP, an SIMD machine with 16K bit serial processors; Flex/32, an MIMD machine with 20 processors; and Cray/2. The implementation of the algorithm is discussed in relation to these architectures and measures of the performance on each machine are given. The basic comparison is among SIMD instruction parallelism on the MPP, MIMD process parallelism on the Flex/32, and vectorization of a serial code on the Cray/2. Simple performance models are used to describe the performance. These models highlight the bottlenecks and limiting factors for this algorithm on these architectures. Finally, conclusions are presented.
CUDA Optimization Strategies for Compute- and Memory-Bound Neuroimaging Algorithms
Lee, Daren; Dinov, Ivo; Dong, Bin; Gutman, Boris; Yanovsky, Igor; Toga, Arthur W.
2011-01-01
As neuroimaging algorithms and technology continue to grow faster than CPU performance in complexity and image resolution, data-parallel computing methods will be increasingly important. The high performance, data-parallel architecture of modern graphical processing units (GPUs) can reduce computational times by orders of magnitude. However, its massively threaded architecture introduces challenges when GPU resources are exceeded. This paper presents optimization strategies for compute- and memory-bound algorithms for the CUDA architecture. For compute-bound algorithms, the registers are reduced through variable reuse via shared memory and the data throughput is increased through heavier thread workloads and maximizing the thread configuration for a single thread block per multiprocessor. For memory-bound algorithms, fitting the data into the fast but limited GPU resources is achieved through reorganizing the data into self-contained structures and employing a multi-pass approach. Memory latencies are reduced by selecting memory resources whose cache performance are optimized for the algorithm's access patterns. We demonstrate the strategies on two computationally expensive algorithms and achieve optimized GPU implementations that perform up to 6× faster than unoptimized ones. Compared to CPU implementations, we achieve peak GPU speedups of 129× for the 3D unbiased nonlinear image registration technique and 93× for the non-local means surface denoising algorithm. PMID:21159404
CUDA optimization strategies for compute- and memory-bound neuroimaging algorithms.
Lee, Daren; Dinov, Ivo; Dong, Bin; Gutman, Boris; Yanovsky, Igor; Toga, Arthur W
2012-06-01
As neuroimaging algorithms and technology continue to grow faster than CPU performance in complexity and image resolution, data-parallel computing methods will be increasingly important. The high performance, data-parallel architecture of modern graphical processing units (GPUs) can reduce computational times by orders of magnitude. However, its massively threaded architecture introduces challenges when GPU resources are exceeded. This paper presents optimization strategies for compute- and memory-bound algorithms for the CUDA architecture. For compute-bound algorithms, the registers are reduced through variable reuse via shared memory and the data throughput is increased through heavier thread workloads and maximizing the thread configuration for a single thread block per multiprocessor. For memory-bound algorithms, fitting the data into the fast but limited GPU resources is achieved through reorganizing the data into self-contained structures and employing a multi-pass approach. Memory latencies are reduced by selecting memory resources whose cache performance are optimized for the algorithm's access patterns. We demonstrate the strategies on two computationally expensive algorithms and achieve optimized GPU implementations that perform up to 6× faster than unoptimized ones. Compared to CPU implementations, we achieve peak GPU speedups of 129× for the 3D unbiased nonlinear image registration technique and 93× for the non-local means surface denoising algorithm. Copyright © 2010 Elsevier Ireland Ltd. All rights reserved.
SNAVA-A real-time multi-FPGA multi-model spiking neural network simulation architecture.
Sripad, Athul; Sanchez, Giovanny; Zapata, Mireya; Pirrone, Vito; Dorta, Taho; Cambria, Salvatore; Marti, Albert; Krishnamourthy, Karthikeyan; Madrenas, Jordi
2018-01-01
Spiking Neural Networks (SNN) for Versatile Applications (SNAVA) simulation platform is a scalable and programmable parallel architecture that supports real-time, large-scale, multi-model SNN computation. This parallel architecture is implemented in modern Field-Programmable Gate Arrays (FPGAs) devices to provide high performance execution and flexibility to support large-scale SNN models. Flexibility is defined in terms of programmability, which allows easy synapse and neuron implementation. This has been achieved by using a special-purpose Processing Elements (PEs) for computing SNNs, and analyzing and customizing the instruction set according to the processing needs to achieve maximum performance with minimum resources. The parallel architecture is interfaced with customized Graphical User Interfaces (GUIs) to configure the SNN's connectivity, to compile the neuron-synapse model and to monitor SNN's activity. Our contribution intends to provide a tool that allows to prototype SNNs faster than on CPU/GPU architectures but significantly cheaper than fabricating a customized neuromorphic chip. This could be potentially valuable to the computational neuroscience and neuromorphic engineering communities. Copyright © 2017 Elsevier Ltd. All rights reserved.
The RISC (Reduced Instruction Set Computer) Architecture and Computer Performance Evaluation.
1986-03-01
time where the main emphasis of the evaluation process is put on the software . The model is intended to provide a tool for computer architects to use...program, or 3) Was to be implemented in random logic more effec- tively than the equivalent sequence of software instructions. Both data and address...definition is the IEEE standard 729-1983 stating Computer Architecture as: " The process of defining a collection of hardware and software components and
A supportive architecture for CFD-based design optimisation
NASA Astrophysics Data System (ADS)
Li, Ni; Su, Zeya; Bi, Zhuming; Tian, Chao; Ren, Zhiming; Gong, Guanghong
2014-03-01
Multi-disciplinary design optimisation (MDO) is one of critical methodologies to the implementation of enterprise systems (ES). MDO requiring the analysis of fluid dynamics raises a special challenge due to its extremely intensive computation. The rapid development of computational fluid dynamic (CFD) technique has caused a rise of its applications in various fields. Especially for the exterior designs of vehicles, CFD has become one of the three main design tools comparable to analytical approaches and wind tunnel experiments. CFD-based design optimisation is an effective way to achieve the desired performance under the given constraints. However, due to the complexity of CFD, integrating with CFD analysis in an intelligent optimisation algorithm is not straightforward. It is a challenge to solve a CFD-based design problem, which is usually with high dimensions, and multiple objectives and constraints. It is desirable to have an integrated architecture for CFD-based design optimisation. However, our review on existing works has found that very few researchers have studied on the assistive tools to facilitate CFD-based design optimisation. In the paper, a multi-layer architecture and a general procedure are proposed to integrate different CFD toolsets with intelligent optimisation algorithms, parallel computing technique and other techniques for efficient computation. In the proposed architecture, the integration is performed either at the code level or data level to fully utilise the capabilities of different assistive tools. Two intelligent algorithms are developed and embedded with parallel computing. These algorithms, together with the supportive architecture, lay a solid foundation for various applications of CFD-based design optimisation. To illustrate the effectiveness of the proposed architecture and algorithms, the case studies on aerodynamic shape design of a hypersonic cruising vehicle are provided, and the result has shown that the proposed architecture and developed algorithms have performed successfully and efficiently in dealing with the design optimisation with over 200 design variables.
Computer Electromagnetics and Supercomputer Architecture
NASA Technical Reports Server (NTRS)
Cwik, Tom
1993-01-01
The dramatic increase in performance over the last decade for microporcessor computations is compared with that for the supercomputer computations. This performance, the projected performance, and a number of other issues such as cost and the inherent pysical limitations in curent supercomputer technology have naturally led to parallel supercomputers and ensemble of interconnected microprocessors.
Neural simulations on multi-core architectures.
Eichner, Hubert; Klug, Tobias; Borst, Alexander
2009-01-01
Neuroscience is witnessing increasing knowledge about the anatomy and electrophysiological properties of neurons and their connectivity, leading to an ever increasing computational complexity of neural simulations. At the same time, a rather radical change in personal computer technology emerges with the establishment of multi-cores: high-density, explicitly parallel processor architectures for both high performance as well as standard desktop computers. This work introduces strategies for the parallelization of biophysically realistic neural simulations based on the compartmental modeling technique and results of such an implementation, with a strong focus on multi-core architectures and automation, i.e. user-transparent load balancing.
Neural Simulations on Multi-Core Architectures
Eichner, Hubert; Klug, Tobias; Borst, Alexander
2009-01-01
Neuroscience is witnessing increasing knowledge about the anatomy and electrophysiological properties of neurons and their connectivity, leading to an ever increasing computational complexity of neural simulations. At the same time, a rather radical change in personal computer technology emerges with the establishment of multi-cores: high-density, explicitly parallel processor architectures for both high performance as well as standard desktop computers. This work introduces strategies for the parallelization of biophysically realistic neural simulations based on the compartmental modeling technique and results of such an implementation, with a strong focus on multi-core architectures and automation, i.e. user-transparent load balancing. PMID:19636393
GPU and APU computations of Finite Time Lyapunov Exponent fields
NASA Astrophysics Data System (ADS)
Conti, Christian; Rossinelli, Diego; Koumoutsakos, Petros
2012-03-01
We present GPU and APU accelerated computations of Finite-Time Lyapunov Exponent (FTLE) fields. The calculation of FTLEs is a computationally intensive process, as in order to obtain the sharp ridges associated with the Lagrangian Coherent Structures an extensive resampling of the flow field is required. The computational performance of this resampling is limited by the memory bandwidth of the underlying computer architecture. The present technique harnesses data-parallel execution of many-core architectures and relies on fast and accurate evaluations of moment conserving functions for the mesh to particle interpolations. We demonstrate how the computation of FTLEs can be efficiently performed on a GPU and on an APU through OpenCL and we report over one order of magnitude improvements over multi-threaded executions in FTLE computations of bluff body flows.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Luo, Y.; Cameron, K.W.
1998-11-24
Workload characterization has been proven an essential tool to architecture design and performance evaluation in both scientific and commercial computing areas. Traditional workload characterization techniques include FLOPS rate, cache miss ratios, CPI (cycles per instruction or IPC, instructions per cycle) etc. With the complexity of sophisticated modern superscalar microprocessors, these traditional characterization techniques are not powerful enough to pinpoint the performance bottleneck of an application on a specific microprocessor. They are also incapable of immediately demonstrating the potential performance benefit of any architectural or functional improvement in a new processor design. To solve these problems, many people rely on simulators,more » which have substantial constraints especially on large-scale scientific computing applications. This paper presents a new technique of characterizing applications at the instruction level using hardware performance counters. It has the advantage of collecting instruction-level characteristics in a few runs virtually without overhead or slowdown. A variety of instruction counts can be utilized to calculate some average abstract workload parameters corresponding to microprocessor pipelines or functional units. Based on the microprocessor architectural constraints and these calculated abstract parameters, the architectural performance bottleneck for a specific application can be estimated. In particular, the analysis results can provide some insight to the problem that only a small percentage of processor peak performance can be achieved even for many very cache-friendly codes. Meanwhile, the bottleneck estimation can provide suggestions about viable architectural/functional improvement for certain workloads. Eventually, these abstract parameters can lead to the creation of an analytical microprocessor pipeline model and memory hierarchy model.« less
Selecting a Benchmark Suite to Profile High-Performance Computing (HPC) Machines
2014-11-01
architectures. Machines now contain central processing units (CPUs), graphics processing units (GPUs), and many integrated core ( MIC ) architecture all...evaluate the feasibility and applicability of a new architecture just released to the market . Researchers are often unsure how available resources will...architectures. Having a suite of programs running on different architectures, such as GPUs, MICs , and CPUs, adds complexity and technical challenges
NASA Astrophysics Data System (ADS)
Elkurdi, Yousef; Fernández, David; Souleimanov, Evgueni; Giannacopoulos, Dennis; Gross, Warren J.
2008-04-01
The Finite Element Method (FEM) is a computationally intensive scientific and engineering analysis tool that has diverse applications ranging from structural engineering to electromagnetic simulation. The trends in floating-point performance are moving in favor of Field-Programmable Gate Arrays (FPGAs), hence increasing interest has grown in the scientific community to exploit this technology. We present an architecture and implementation of an FPGA-based sparse matrix-vector multiplier (SMVM) for use in the iterative solution of large, sparse systems of equations arising from FEM applications. FEM matrices display specific sparsity patterns that can be exploited to improve the efficiency of hardware designs. Our architecture exploits FEM matrix sparsity structure to achieve a balance between performance and hardware resource requirements by relying on external SDRAM for data storage while utilizing the FPGAs computational resources in a stream-through systolic approach. The architecture is based on a pipelined linear array of processing elements (PEs) coupled with a hardware-oriented matrix striping algorithm and a partitioning scheme which enables it to process arbitrarily big matrices without changing the number of PEs in the architecture. Therefore, this architecture is only limited by the amount of external RAM available to the FPGA. The implemented SMVM-pipeline prototype contains 8 PEs and is clocked at 110 MHz obtaining a peak performance of 1.76 GFLOPS. For 8 GB/s of memory bandwidth typical of recent FPGA systems, this architecture can achieve 1.5 GFLOPS sustained performance. Using multiple instances of the pipeline, linear scaling of the peak and sustained performance can be achieved. Our stream-through architecture provides the added advantage of enabling an iterative implementation of the SMVM computation required by iterative solution techniques such as the conjugate gradient method, avoiding initialization time due to data loading and setup inside the FPGA internal memory.
Kazakis, Georgios; Kanellopoulos, Ioannis; Sotiropoulos, Stefanos; Lagaros, Nikos D
2017-10-01
Construction industry has a major impact on the environment that we spend most of our life. Therefore, it is important that the outcome of architectural intuition performs well and complies with the design requirements. Architects usually describe as "optimal design" their choice among a rather limited set of design alternatives, dictated by their experience and intuition. However, modern design of structures requires accounting for a great number of criteria derived from multiple disciplines, often of conflicting nature. Such criteria derived from structural engineering, eco-design, bioclimatic and acoustic performance. The resulting vast number of alternatives enhances the need for computer-aided architecture in order to increase the possibility of arriving at a more preferable solution. Therefore, the incorporation of smart, automatic tools in the design process, able to further guide designer's intuition becomes even more indispensable. The principal aim of this study is to present possibilities to integrate automatic computational techniques related to topology optimization in the phase of intuition of civil structures as part of computer aided architectural design. In this direction, different aspects of a new computer aided architectural era related to the interpretation of the optimized designs, difficulties resulted from the increased computational effort and 3D printing capabilities are covered here in.
Benchmarking hardware architecture candidates for the NFIRAOS real-time controller
NASA Astrophysics Data System (ADS)
Smith, Malcolm; Kerley, Dan; Herriot, Glen; Véran, Jean-Pierre
2014-07-01
As a part of the trade study for the Narrow Field Infrared Adaptive Optics System, the adaptive optics system for the Thirty Meter Telescope, we investigated the feasibility of performing real-time control computation using a Linux operating system and Intel Xeon E5 CPUs. We also investigated a Xeon Phi based architecture which allows higher levels of parallelism. This paper summarizes both the CPU based real-time controller architecture and the Xeon Phi based RTC. The Intel Xeon E5 CPU solution meets the requirements and performs the computation for one AO cycle in an average of 767 microseconds. The Xeon Phi solution did not meet the 1200 microsecond time requirement and also suffered from unpredictable execution times. More detailed benchmark results are reported for both architectures.
Architecture Framework for Trapped-Ion Quantum Computer based on Performance Simulation Tool
NASA Astrophysics Data System (ADS)
Ahsan, Muhammad
The challenge of building scalable quantum computer lies in striking appropriate balance between designing a reliable system architecture from large number of faulty computational resources and improving the physical quality of system components. The detailed investigation of performance variation with physics of the components and the system architecture requires adequate performance simulation tool. In this thesis we demonstrate a software tool capable of (1) mapping and scheduling the quantum circuit on a realistic quantum hardware architecture with physical resource constraints, (2) evaluating the performance metrics such as the execution time and the success probability of the algorithm execution, and (3) analyzing the constituents of these metrics and visualizing resource utilization to identify system components which crucially define the overall performance. Using this versatile tool, we explore vast design space for modular quantum computer architecture based on trapped ions. We find that while success probability is uniformly determined by the fidelity of physical quantum operation, the execution time is a function of system resources invested at various layers of design hierarchy. At physical level, the number of lasers performing quantum gates, impact the latency of the fault-tolerant circuit blocks execution. When these blocks are used to construct meaningful arithmetic circuit such as quantum adders, the number of ancilla qubits for complicated non-clifford gates and entanglement resources to establish long-distance communication channels, become major performance limiting factors. Next, in order to factorize large integers, these adders are assembled into modular exponentiation circuit comprising bulk of Shor's algorithm. At this stage, the overall scaling of resource-constraint performance with the size of problem, describes the effectiveness of chosen design. By matching the resource investment with the pace of advancement in hardware technology, we find optimal designs for different types of quantum adders. Conclusively, we show that 2,048-bit Shor's algorithm can be reliably executed within the resource budget of 1.5 million qubits.
Computational Particle Dynamic Simulations on Multicore Processors (CPDMu) Final Report Phase I
DOE Office of Scientific and Technical Information (OSTI.GOV)
Schmalz, Mark S
2011-07-24
Statement of Problem - Department of Energy has many legacy codes for simulation of computational particle dynamics and computational fluid dynamics applications that are designed to run on sequential processors and are not easily parallelized. Emerging high-performance computing architectures employ massively parallel multicore architectures (e.g., graphics processing units) to increase throughput. Parallelization of legacy simulation codes is a high priority, to achieve compatibility, efficiency, accuracy, and extensibility. General Statement of Solution - A legacy simulation application designed for implementation on mainly-sequential processors has been represented as a graph G. Mathematical transformations, applied to G, produce a graph representation {und G}more » for a high-performance architecture. Key computational and data movement kernels of the application were analyzed/optimized for parallel execution using the mapping G {yields} {und G}, which can be performed semi-automatically. This approach is widely applicable to many types of high-performance computing systems, such as graphics processing units or clusters comprised of nodes that contain one or more such units. Phase I Accomplishments - Phase I research decomposed/profiled computational particle dynamics simulation code for rocket fuel combustion into low and high computational cost regions (respectively, mainly sequential and mainly parallel kernels), with analysis of space and time complexity. Using the research team's expertise in algorithm-to-architecture mappings, the high-cost kernels were transformed, parallelized, and implemented on Nvidia Fermi GPUs. Measured speedups (GPU with respect to single-core CPU) were approximately 20-32X for realistic model parameters, without final optimization. Error analysis showed no loss of computational accuracy. Commercial Applications and Other Benefits - The proposed research will constitute a breakthrough in solution of problems related to efficient parallel computation of particle and fluid dynamics simulations. These problems occur throughout DOE, military and commercial sectors: the potential payoff is high. We plan to license or sell the solution to contractors for military and domestic applications such as disaster simulation (aerodynamic and hydrodynamic), Government agencies (hydrological and environmental simulations), and medical applications (e.g., in tomographic image reconstruction). Keywords - High-performance Computing, Graphic Processing Unit, Fluid/Particle Simulation. Summary for Members of Congress - Department of Energy has many simulation codes that must compute faster, to be effective. The Phase I research parallelized particle/fluid simulations for rocket combustion, for high-performance computing systems.« less
Architecture independent environment for developing engineering software on MIMD computers
NASA Technical Reports Server (NTRS)
Valimohamed, Karim A.; Lopez, L. A.
1990-01-01
Engineers are constantly faced with solving problems of increasing complexity and detail. Multiple Instruction stream Multiple Data stream (MIMD) computers have been developed to overcome the performance limitations of serial computers. The hardware architectures of MIMD computers vary considerably and are much more sophisticated than serial computers. Developing large scale software for a variety of MIMD computers is difficult and expensive. There is a need to provide tools that facilitate programming these machines. First, the issues that must be considered to develop those tools are examined. The two main areas of concern were architecture independence and data management. Architecture independent software facilitates software portability and improves the longevity and utility of the software product. It provides some form of insurance for the investment of time and effort that goes into developing the software. The management of data is a crucial aspect of solving large engineering problems. It must be considered in light of the new hardware organizations that are available. Second, the functional design and implementation of a software environment that facilitates developing architecture independent software for large engineering applications are described. The topics of discussion include: a description of the model that supports the development of architecture independent software; identifying and exploiting concurrency within the application program; data coherence; engineering data base and memory management.
Parallel Architectures and Parallel Algorithms for Integrated Vision Systems. Ph.D. Thesis
NASA Technical Reports Server (NTRS)
Choudhary, Alok Nidhi
1989-01-01
Computer vision is regarded as one of the most complex and computationally intensive problems. An integrated vision system (IVS) is a system that uses vision algorithms from all levels of processing to perform for a high level application (e.g., object recognition). An IVS normally involves algorithms from low level, intermediate level, and high level vision. Designing parallel architectures for vision systems is of tremendous interest to researchers. Several issues are addressed in parallel architectures and parallel algorithms for integrated vision systems.
Using SPEEDES to simulate the blue gene interconnect network
NASA Technical Reports Server (NTRS)
Springer, P.; Upchurch, E.
2003-01-01
JPL and the Center for Advanced Computer Architecture (CACR) is conducting application and simulation analyses of BG/L in order to establish a range of effectiveness for the Blue Gene/L MPP architecture in performing important classes of computations and to determine the design sensitivity of the global interconnect network in support of real world ASCI application execution.
NASA Technical Reports Server (NTRS)
Mavriplis, D. J.; Das, Raja; Saltz, Joel; Vermeland, R. E.
1992-01-01
An efficient three dimensional unstructured Euler solver is parallelized on a Cray Y-MP C90 shared memory computer and on an Intel Touchstone Delta distributed memory computer. This paper relates the experiences gained and describes the software tools and hardware used in this study. Performance comparisons between two differing architectures are made.
Multi-threaded Sparse Matrix Sparse Matrix Multiplication for Many-Core and GPU Architectures.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Deveci, Mehmet; Trott, Christian Robert; Rajamanickam, Sivasankaran
Sparse Matrix-Matrix multiplication is a key kernel that has applications in several domains such as scientific computing and graph analysis. Several algorithms have been studied in the past for this foundational kernel. In this paper, we develop parallel algorithms for sparse matrix- matrix multiplication with a focus on performance portability across different high performance computing architectures. The performance of these algorithms depend on the data structures used in them. We compare different types of accumulators in these algorithms and demonstrate the performance difference between these data structures. Furthermore, we develop a meta-algorithm, kkSpGEMM, to choose the right algorithm and datamore » structure based on the characteristics of the problem. We show performance comparisons on three architectures and demonstrate the need for the community to develop two phase sparse matrix-matrix multiplication implementations for efficient reuse of the data structures involved.« less
Multi-threaded Sparse Matrix-Matrix Multiplication for Many-Core and GPU Architectures.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Deveci, Mehmet; Rajamanickam, Sivasankaran; Trott, Christian Robert
Sparse Matrix-Matrix multiplication is a key kernel that has applications in several domains such as scienti c computing and graph analysis. Several algorithms have been studied in the past for this foundational kernel. In this paper, we develop parallel algorithms for sparse matrix-matrix multiplication with a focus on performance portability across different high performance computing architectures. The performance of these algorithms depend on the data structures used in them. We compare different types of accumulators in these algorithms and demonstrate the performance difference between these data structures. Furthermore, we develop a meta-algorithm, kkSpGEMM, to choose the right algorithm and datamore » structure based on the characteristics of the problem. We show performance comparisons on three architectures and demonstrate the need for the community to develop two phase sparse matrix-matrix multiplication implementations for efficient reuse of the data structures involved.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)
Karthik, Rajasekar
2014-01-01
In this paper, an architecture for building Scalable And Mobile Environment For High-Performance Computing with spatial capabilities called SAME4HPC is described using cutting-edge technologies and standards such as Node.js, HTML5, ECMAScript 6, and PostgreSQL 9.4. Mobile devices are increasingly becoming powerful enough to run high-performance apps. At the same time, there exist a significant number of low-end and older devices that rely heavily on the server or the cloud infrastructure to do the heavy lifting. Our architecture aims to support both of these types of devices to provide high-performance and rich user experience. A cloud infrastructure consisting of OpenStack withmore » Ubuntu, GeoServer, and high-performance JavaScript frameworks are some of the key open-source and industry standard practices that has been adopted in this architecture.« less
A survey of compiler optimization techniques
NASA Technical Reports Server (NTRS)
Schneck, P. B.
1972-01-01
Major optimization techniques of compilers are described and grouped into three categories: machine dependent, architecture dependent, and architecture independent. Machine-dependent optimizations tend to be local and are performed upon short spans of generated code by using particular properties of an instruction set to reduce the time or space required by a program. Architecture-dependent optimizations are global and are performed while generating code. These optimizations consider the structure of a computer, but not its detailed instruction set. Architecture independent optimizations are also global but are based on analysis of the program flow graph and the dependencies among statements of source program. A conceptual review of a universal optimizer that performs architecture-independent optimizations at source-code level is also presented.
A novel strategy for load balancing of distributed medical applications.
Logeswaran, Rajasvaran; Chen, Li-Choo
2012-04-01
Current trends in medicine, specifically in the electronic handling of medical applications, ranging from digital imaging, paperless hospital administration and electronic medical records, telemedicine, to computer-aided diagnosis, creates a burden on the network. Distributed Service Architectures, such as Intelligent Network (IN), Telecommunication Information Networking Architecture (TINA) and Open Service Access (OSA), are able to meet this new challenge. Distribution enables computational tasks to be spread among multiple processors; hence, performance is an important issue. This paper proposes a novel approach in load balancing, the Random Sender Initiated Algorithm, for distribution of tasks among several nodes sharing the same computational object (CO) instances in Distributed Service Architectures. Simulations illustrate that the proposed algorithm produces better network performance than the benchmark load balancing algorithms-the Random Node Selection Algorithm and the Shortest Queue Algorithm, especially under medium and heavily loaded conditions.
Takeda, Shuntaro; Furusawa, Akira
2017-09-22
We propose a scalable scheme for optical quantum computing using measurement-induced continuous-variable quantum gates in a loop-based architecture. Here, time-bin-encoded quantum information in a single spatial mode is deterministically processed in a nested loop by an electrically programmable gate sequence. This architecture can process any input state and an arbitrary number of modes with almost minimum resources, and offers a universal gate set for both qubits and continuous variables. Furthermore, quantum computing can be performed fault tolerantly by a known scheme for encoding a qubit in an infinite-dimensional Hilbert space of a single light mode.
NASA Astrophysics Data System (ADS)
Takeda, Shuntaro; Furusawa, Akira
2017-09-01
We propose a scalable scheme for optical quantum computing using measurement-induced continuous-variable quantum gates in a loop-based architecture. Here, time-bin-encoded quantum information in a single spatial mode is deterministically processed in a nested loop by an electrically programmable gate sequence. This architecture can process any input state and an arbitrary number of modes with almost minimum resources, and offers a universal gate set for both qubits and continuous variables. Furthermore, quantum computing can be performed fault tolerantly by a known scheme for encoding a qubit in an infinite-dimensional Hilbert space of a single light mode.
Integrating Computer Architectures into the Design of High-Performance Controllers
NASA Technical Reports Server (NTRS)
Jacklin, Stephen A.; Leyland, Jane A.; Warmbrodt, William
1986-01-01
Modern control systems must typically perform real-time identification and control, as well as coordinate a host of other activities related to user interaction, on-line graphics, and file management. This paper discusses five global design considerations that are useful to integrate array processor, multimicroprocessor, and host computer system architecture into versatile, high-speed controllers. Such controllers are capable of very high control throughput, and can maintain constant interaction with the non-real-time or user environment. As an application example, the architecture of a high-speed, closed-loop controller used to actively control helicopter vibration will be briefly discussed. Although this system has been designed for use as the controller for real-time rotorcraft dynamics and control studies in a wind-tunnel environment, the control architecture can generally be applied to a wide range of automatic control applications.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Potok, Thomas; Schuman, Catherine; Patton, Robert
The White House and Department of Energy have been instrumental in driving the development of a neuromorphic computing program to help the United States continue its lead in basic research into (1) Beyond Exascale—high performance computing beyond Moore’s Law and von Neumann architectures, (2) Scientific Discovery—new paradigms for understanding increasingly large and complex scientific data, and (3) Emerging Architectures—assessing the potential of neuromorphic and quantum architectures. Neuromorphic computing spans a broad range of scientific disciplines from materials science to devices, to computer science, to neuroscience, all of which are required to solve the neuromorphic computing grand challenge. In our workshopmore » we focus on the computer science aspects, specifically from a neuromorphic device through an application. Neuromorphic devices present a very different paradigm to the computer science community from traditional von Neumann architectures, which raises six major questions about building a neuromorphic application from the device level. We used these fundamental questions to organize the workshop program and to direct the workshop panels and discussions. From the white papers, presentations, panels, and discussions, there emerged several recommendations on how to proceed.« less
Design and Training of Limited-Interconnect Architectures
1991-07-16
and signal processing. Neuromorphic (brain like) models, allow an alternative for achieving real-time operation tor such tasks, while having a...compact and robust architecture. Neuromorphic models consist of interconnections of simple computational nodes. In this approach, each node computes a...operational performance. I1. Research Objectives The research objectives were: 1. Development of on- chip local training rules specifically designed for
WATERLOPP V2/64: A highly parallel machine for numerical computation
NASA Astrophysics Data System (ADS)
Ostlund, Neil S.
1985-07-01
Current technological trends suggest that the high performance scientific machines of the future are very likely to consist of a large number (greater than 1024) of processors connected and communicating with each other in some as yet undetermined manner. Such an assembly of processors should behave as a single machine in obtaining numerical solutions to scientific problems. However, the appropriate way of organizing both the hardware and software of such an assembly of processors is an unsolved and active area of research. It is particularly important to minimize the organizational overhead of interprocessor comunication, global synchronization, and contention for shared resources if the performance of a large number ( n) of processors is to be anything like the desirable n times the performance of a single processor. In many situations, adding a processor actually decreases the performance of the overall system since the extra organizational overhead is larger than the extra processing power added. The systolic loop architecture is a new multiple processor architecture which attemps at a solution to the problem of how to organize a large number of asynchronous processors into an effective computational system while minimizing the organizational overhead. This paper gives a brief overview of the basic systolic loop architecture, systolic loop algorithms for numerical computation, and a 64-processor implementation of the architecture, WATERLOOP V2/64, that is being used as a testbed for exploring the hardware, software, and algorithmic aspects of the architecture.
Real-time FPGA architectures for computer vision
NASA Astrophysics Data System (ADS)
Arias-Estrada, Miguel; Torres-Huitzil, Cesar
2000-03-01
This paper presents an architecture for real-time generic convolution of a mask and an image. The architecture is intended for fast low level image processing. The FPGA-based architecture takes advantage of the availability of registers in FPGAs to implement an efficient and compact module to process the convolutions. The architecture is designed to minimize the number of accesses to the image memory and is based on parallel modules with internal pipeline operation in order to improve its performance. The architecture is prototyped in a FPGA, but it can be implemented on a dedicated VLSI to reach higher clock frequencies. Complexity issues, FPGA resources utilization, FPGA limitations, and real time performance are discussed. Some results are presented and discussed.
NASA Astrophysics Data System (ADS)
Tekin, Tolga; Töpper, Michael; Reichl, Herbert
2009-05-01
Technological frontiers between semiconductor technology, packaging, and system design are disappearing. Scaling down geometries [1] alone does not provide improvement of performance, less power, smaller size, and lower cost. It will require "More than Moore" [2] through the tighter integration of system level components at the package level. System-in-Package (SiP) will deliver the efficient use of three dimensions (3D) through innovation in packaging and interconnect technology. A key bottleneck to the implementation of high-performance microelectronic systems, including SiP, is the lack of lowlatency, high-bandwidth, and high density off-chip interconnects. Some of the challenges in achieving high-bandwidth chip-to-chip communication using electrical interconnects include the high losses in the substrate dielectric, reflections and impedance discontinuities, and susceptibility to crosstalk [3]. Obviously, the incentive for the use of photonics to overcome the challenges and leverage low-latency and highbandwidth communication will enable the vision of optical computing within next generation architectures. Supercomputers of today offer sustained performance of more than petaflops, which can be increased by utilizing optical interconnects. Next generation computing architectures are needed with ultra low power consumption; ultra high performance with novel interconnection technologies. In this paper we will discuss a CMOS compatible underlying technology to enable next generation optical computing architectures. By introducing a new optical layer within the 3D SiP, the development of converged microsystems, deployment for next generation optical computing architecture will be leveraged.
Performance study of a data flow architecture
NASA Technical Reports Server (NTRS)
Adams, George
1985-01-01
Teams of scientists studied data flow concepts, static data flow machine architecture, and the VAL language. Each team mapped its application onto the machine and coded it in VAL. The principal findings of the study were: (1) Five of the seven applications used the full power of the target machine. The galactic simulation and multigrid fluid flow teams found that a significantly smaller version of the machine (16 processing elements) would suffice. (2) A number of machine design parameters including processing element (PE) function unit numbers, array memory size and bandwidth, and routing network capability were found to be crucial for optimal machine performance. (3) The study participants readily acquired VAL programming skills. (4) Participants learned that application-based performance evaluation is a sound method of evaluating new computer architectures, even those that are not fully specified. During the course of the study, participants developed models for using computers to solve numerical problems and for evaluating new architectures. These models form the bases for future evaluation studies.
NASA Technical Reports Server (NTRS)
Rutishauser, David
2006-01-01
The motivation for this work comes from an observation that amidst the push for Massively Parallel (MP) solutions to high-end computing problems such as numerical physical simulations, large amounts of legacy code exist that are highly optimized for vector supercomputers. Because re-hosting legacy code often requires a complete re-write of the original code, which can be a very long and expensive effort, this work examines the potential to exploit reconfigurable computing machines in place of a vector supercomputer to implement an essentially unmodified legacy source code. Custom and reconfigurable computing resources could be used to emulate an original application's target platform to the extent required to achieve high performance. To arrive at an architecture that delivers the desired performance subject to limited resources involves solving a multi-variable optimization problem with constraints. Prior research in the area of reconfigurable computing has demonstrated that designing an optimum hardware implementation of a given application under hardware resource constraints is an NP-complete problem. The premise of the approach is that the general issue of applying reconfigurable computing resources to the implementation of an application, maximizing the performance of the computation subject to physical resource constraints, can be made a tractable problem by assuming a computational paradigm, such as vector processing. This research contributes a formulation of the problem and a methodology to design a reconfigurable vector processing implementation of a given application that satisfies a performance metric. A generic, parametric, architectural framework for vector processing implemented in reconfigurable logic is developed as a target for a scheduling/mapping algorithm that maps an input computation to a given instance of the architecture. This algorithm is integrated with an optimization framework to arrive at a specification of the architecture parameters that attempts to minimize execution time, while staying within resource constraints. The flexibility of using a custom reconfigurable implementation is exploited in a unique manner to leverage the lessons learned in vector supercomputer development. The vector processing framework is tailored to the application, with variable parameters that are fixed in traditional vector processing. Benchmark data that demonstrates the functionality and utility of the approach is presented. The benchmark data includes an identified bottleneck in a real case study example vector code, the NASA Langley Terminal Area Simulation System (TASS) application.
Shared Memory Parallelization of an Implicit ADI-type CFD Code
NASA Technical Reports Server (NTRS)
Hauser, Th.; Huang, P. G.
1999-01-01
A parallelization study designed for ADI-type algorithms is presented using the OpenMP specification for shared-memory multiprocessor programming. Details of optimizations specifically addressed to cache-based computer architectures are described and performance measurements for the single and multiprocessor implementation are summarized. The paper demonstrates that optimization of memory access on a cache-based computer architecture controls the performance of the computational algorithm. A hybrid MPI/OpenMP approach is proposed for clusters of shared memory machines to further enhance the parallel performance. The method is applied to develop a new LES/DNS code, named LESTool. A preliminary DNS calculation of a fully developed channel flow at a Reynolds number of 180, Re(sub tau) = 180, has shown good agreement with existing data.
Blueprint for a microwave trapped ion quantum computer.
Lekitsch, Bjoern; Weidt, Sebastian; Fowler, Austin G; Mølmer, Klaus; Devitt, Simon J; Wunderlich, Christof; Hensinger, Winfried K
2017-02-01
The availability of a universal quantum computer may have a fundamental impact on a vast number of research fields and on society as a whole. An increasingly large scientific and industrial community is working toward the realization of such a device. An arbitrarily large quantum computer may best be constructed using a modular approach. We present a blueprint for a trapped ion-based scalable quantum computer module, making it possible to create a scalable quantum computer architecture based on long-wavelength radiation quantum gates. The modules control all operations as stand-alone units, are constructed using silicon microfabrication techniques, and are within reach of current technology. To perform the required quantum computations, the modules make use of long-wavelength radiation-based quantum gate technology. To scale this microwave quantum computer architecture to a large size, we present a fully scalable design that makes use of ion transport between different modules, thereby allowing arbitrarily many modules to be connected to construct a large-scale device. A high error-threshold surface error correction code can be implemented in the proposed architecture to execute fault-tolerant operations. With appropriate adjustments, the proposed modules are also suitable for alternative trapped ion quantum computer architectures, such as schemes using photonic interconnects.
NASA Astrophysics Data System (ADS)
Yang, Wei; Hall, Trevor J.
2013-12-01
The Internet is entering an era of cloud computing to provide more cost effective, eco-friendly and reliable services to consumer and business users. As a consequence, the nature of the Internet traffic has been fundamentally transformed from a pure packet-based pattern to today's predominantly flow-based pattern. Cloud computing has also brought about an unprecedented growth in the Internet traffic. In this paper, a hybrid optical switch architecture is presented to deal with the flow-based Internet traffic, aiming to offer flexible and intelligent bandwidth on demand to improve fiber capacity utilization. The hybrid optical switch is capable of integrating IP into optical networks for cloud-based traffic with predictable performance, for which the delay performance of the electronic module in the hybrid optical switch architecture is evaluated through simulation.
Evolutionary Telemetry and Command Processor (TCP) architecture
NASA Technical Reports Server (NTRS)
Schneider, John R.
1992-01-01
A low cost, modular, high performance, and compact Telemetry and Command Processor (TCP) is being built as the foundation of command and data handling subsystems for the next generation of satellites. The TCP product line will support command and telemetry requirements for small to large spacecraft and from low to high rate data transmission. It is compatible with the latest TDRSS, STDN and SGLS transponders and provides CCSDS protocol communications in addition to standard TDM formats. Its high performance computer provides computing resources for hosted flight software. Layered and modular software provides common services using standardized interfaces to applications thereby enhancing software re-use, transportability, and interoperability. The TCP architecture is based on existing standards, distributed networking, distributed and open system computing, and packet technology. The first TCP application is planned for the 94 SDIO SPAS 3 mission. The architecture enhances rapid tailoring of functions thereby reducing costs and schedules developed for individual spacecraft missions.
Essential issues in multiprocessor systems
DOE Office of Scientific and Technical Information (OSTI.GOV)
Gajski, D.D.; Peir, J.K.
1985-06-01
During the past several years, a great number of proposals have been made with the objective to increase supercomputer performance by an order of magnitude on the basis of a utilization of new computer architectures. The present paper is concerned with a suitable classification scheme for comparing these architectures. It is pointed out that there are basically four schools of thought as to the most important factor for an enhancement of computer performance. According to one school, the development of faster circuits will make it possible to retain present architectures, except, possibly, for a mechanism providing synchronization of parallel processes.more » A second school assigns priority to the optimization and vectorization of compilers, which will detect parallelism and help users to write better parallel programs. A third school believes in the predominant importance of new parallel algorithms, while the fourth school supports new models of computation. The merits of the four approaches are critically evaluated. 50 references.« less
Dual-scale topology optoelectronic processor.
Marsden, G C; Krishnamoorthy, A V; Esener, S C; Lee, S H
1991-12-15
The dual-scale topology optoelectronic processor (D-STOP) is a parallel optoelectronic architecture for matrix algebraic processing. The architecture can be used for matrix-vector multiplication and two types of vector outer product. The computations are performed electronically, which allows multiplication and summation concepts in linear algebra to be generalized to various nonlinear or symbolic operations. This generalization permits the application of D-STOP to many computational problems. The architecture uses a minimum number of optical transmitters, which thereby reduces fabrication requirements while maintaining area-efficient electronics. The necessary optical interconnections are space invariant, minimizing space-bandwidth requirements.
Specialized computer architectures for computational aerodynamics
NASA Technical Reports Server (NTRS)
Stevenson, D. K.
1978-01-01
In recent years, computational fluid dynamics has made significant progress in modelling aerodynamic phenomena. Currently, one of the major barriers to future development lies in the compute-intensive nature of the numerical formulations and the relative high cost of performing these computations on commercially available general purpose computers, a cost high with respect to dollar expenditure and/or elapsed time. Today's computing technology will support a program designed to create specialized computing facilities to be dedicated to the important problems of computational aerodynamics. One of the still unresolved questions is the organization of the computing components in such a facility. The characteristics of fluid dynamic problems which will have significant impact on the choice of computer architecture for a specialized facility are reviewed.
NASA Technical Reports Server (NTRS)
Smith, Paul H.
1988-01-01
The Computer Science Program provides advanced concepts, techniques, system architectures, algorithms, and software for both space and aeronautics information sciences and computer systems. The overall goal is to provide the technical foundation within NASA for the advancement of computing technology in aerospace applications. The research program is improving the state of knowledge of fundamental aerospace computing principles and advancing computing technology in space applications such as software engineering and information extraction from data collected by scientific instruments in space. The program includes the development of special algorithms and techniques to exploit the computing power provided by high performance parallel processors and special purpose architectures. Research is being conducted in the fundamentals of data base logic and improvement techniques for producing reliable computing systems.
Performance analysis of parallel branch and bound search with the hypercube architecture
NASA Technical Reports Server (NTRS)
Mraz, Richard T.
1987-01-01
With the availability of commercial parallel computers, researchers are examining new classes of problems which might benefit from parallel computing. This paper presents results of an investigation of the class of search intensive problems. The specific problem discussed is the Least-Cost Branch and Bound search method of deadline job scheduling. The object-oriented design methodology was used to map the problem into a parallel solution. While the initial design was good for a prototype, the best performance resulted from fine-tuning the algorithm for a specific computer. The experiments analyze the computation time, the speed up over a VAX 11/785, and the load balance of the problem when using loosely coupled multiprocessor system based on the hypercube architecture.
HyperForest: A high performance multi-processor architecture for real-time intelligent systems
DOE Office of Scientific and Technical Information (OSTI.GOV)
Garcia, P. Jr.; Rebeil, J.P.; Pollard, H.
1997-04-01
Intelligent Systems are characterized by the intensive use of computer power. The computer revolution of the last few years is what has made possible the development of the first generation of Intelligent Systems. Software for second generation Intelligent Systems will be more complex and will require more powerful computing engines in order to meet real-time constraints imposed by new robots, sensors, and applications. A multiprocessor architecture was developed that merges the advantages of message-passing and shared-memory structures: expendability and real-time compliance. The HyperForest architecture will provide an expandable real-time computing platform for computationally intensive Intelligent Systems and open the doorsmore » for the application of these systems to more complex tasks in environmental restoration and cleanup projects, flexible manufacturing systems, and DOE`s own production and disassembly activities.« less
Performance prediction: A case study using a multi-ring KSR-1 machine
NASA Technical Reports Server (NTRS)
Sun, Xian-He; Zhu, Jianping
1995-01-01
While computers with tens of thousands of processors have successfully delivered high performance power for solving some of the so-called 'grand-challenge' applications, the notion of scalability is becoming an important metric in the evaluation of parallel machine architectures and algorithms. In this study, the prediction of scalability and its application are carefully investigated. A simple formula is presented to show the relation between scalability, single processor computing power, and degradation of parallelism. A case study is conducted on a multi-ring KSR1 shared virtual memory machine. Experimental and theoretical results show that the influence of topology variation of an architecture is predictable. Therefore, the performance of an algorithm on a sophisticated, heirarchical architecture can be predicted and the best algorithm-machine combination can be selected for a given application.
SpaceCubeX: A Framework for Evaluating Hybrid Multi-Core CPU FPGA DSP Architectures
NASA Technical Reports Server (NTRS)
Schmidt, Andrew G.; Weisz, Gabriel; French, Matthew; Flatley, Thomas; Villalpando, Carlos Y.
2017-01-01
The SpaceCubeX project is motivated by the need for high performance, modular, and scalable on-board processing to help scientists answer critical 21st century questions about global climate change, air quality, ocean health, and ecosystem dynamics, while adding new capabilities such as low-latency data products for extreme event warnings. These goals translate into on-board processing throughput requirements that are on the order of 100-1,000 more than those of previous Earth Science missions for standard processing, compression, storage, and downlink operations. To study possible future architectures to achieve these performance requirements, the SpaceCubeX project provides an evolvable testbed and framework that enables a focused design space exploration of candidate hybrid CPU/FPGA/DSP processing architectures. The framework includes ArchGen, an architecture generator tool populated with candidate architecture components, performance models, and IP cores, that allows an end user to specify the type, number, and connectivity of a hybrid architecture. The framework requires minimal extensions to integrate new processors, such as the anticipated High Performance Spaceflight Computer (HPSC), reducing time to initiate benchmarking by months. To evaluate the framework, we leverage a wide suite of high performance embedded computing benchmarks and Earth science scenarios to ensure robust architecture characterization. We report on our projects Year 1 efforts and demonstrate the capabilities across four simulation testbed models, a baseline SpaceCube 2.0 system, a dual ARM A9 processor system, a hybrid quad ARM A53 and FPGA system, and a hybrid quad ARM A53 and DSP system.
High-performance computing — an overview
NASA Astrophysics Data System (ADS)
Marksteiner, Peter
1996-08-01
An overview of high-performance computing (HPC) is given. Different types of computer architectures used in HPC are discussed: vector supercomputers, high-performance RISC processors, various parallel computers like symmetric multiprocessors, workstation clusters, massively parallel processors. Software tools and programming techniques used in HPC are reviewed: vectorizing compilers, optimization and vector tuning, optimization for RISC processors; parallel programming techniques like shared-memory parallelism, message passing and data parallelism; and numerical libraries.
A high performance parallel computing architecture for robust image features
NASA Astrophysics Data System (ADS)
Zhou, Renyan; Liu, Leibo; Wei, Shaojun
2014-03-01
A design of parallel architecture for image feature detection and description is proposed in this article. The major component of this architecture is a 2D cellular network composed of simple reprogrammable processors, enabling the Hessian Blob Detector and Haar Response Calculation, which are the most computing-intensive stage of the Speeded Up Robust Features (SURF) algorithm. Combining this 2D cellular network and dedicated hardware for SURF descriptors, this architecture achieves real-time image feature detection with minimal software in the host processor. A prototype FPGA implementation of the proposed architecture achieves 1318.9 GOPS general pixel processing @ 100 MHz clock and achieves up to 118 fps in VGA (640 × 480) image feature detection. The proposed architecture is stand-alone and scalable so it is easy to be migrated into VLSI implementation.
Local Alignment Tool Based on Hadoop Framework and GPU Architecture
Hung, Che-Lun; Hua, Guan-Jie
2014-01-01
With the rapid growth of next generation sequencing technologies, such as Slex, more and more data have been discovered and published. To analyze such huge data the computational performance is an important issue. Recently, many tools, such as SOAP, have been implemented on Hadoop and GPU parallel computing architectures. BLASTP is an important tool, implemented on GPU architectures, for biologists to compare protein sequences. To deal with the big biology data, it is hard to rely on single GPU. Therefore, we implement a distributed BLASTP by combining Hadoop and multi-GPUs. The experimental results present that the proposed method can improve the performance of BLASTP on single GPU, and also it can achieve high availability and fault tolerance. PMID:24955362
Local alignment tool based on Hadoop framework and GPU architecture.
Hung, Che-Lun; Hua, Guan-Jie
2014-01-01
With the rapid growth of next generation sequencing technologies, such as Slex, more and more data have been discovered and published. To analyze such huge data the computational performance is an important issue. Recently, many tools, such as SOAP, have been implemented on Hadoop and GPU parallel computing architectures. BLASTP is an important tool, implemented on GPU architectures, for biologists to compare protein sequences. To deal with the big biology data, it is hard to rely on single GPU. Therefore, we implement a distributed BLASTP by combining Hadoop and multi-GPUs. The experimental results present that the proposed method can improve the performance of BLASTP on single GPU, and also it can achieve high availability and fault tolerance.
Traffic Simulations on Parallel Computers Using Domain Decomposition Techniques
DOT National Transportation Integrated Search
1995-01-01
Large scale simulations of Intelligent Transportation Systems (ITS) can only be acheived by using the computing resources offered by parallel computing architectures. Domain decomposition techniques are proposed which allow the performance of traffic...
Engineering and Computing Portal to Solve Environmental Problems
NASA Astrophysics Data System (ADS)
Gudov, A. M.; Zavozkin, S. Y.; Sotnikov, I. Y.
2018-01-01
This paper describes architecture and services of the Engineering and Computing Portal, which is considered to be a complex solution that provides access to high-performance computing resources, enables to carry out computational experiments, teach parallel technologies and solve computing tasks, including technogenic safety ones.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Venkata, Manjunath Gorentla; Aderholdt, William F
The pre-exascale systems are expected to have a significant amount of hierarchical and heterogeneous on-node memory, and this trend of system architecture in extreme-scale systems is expected to continue into the exascale era. along with hierarchical-heterogeneous memory, the system typically has a high-performing network ad a compute accelerator. This system architecture is not only effective for running traditional High Performance Computing (HPC) applications (Big-Compute), but also for running data-intensive HPC applications and Big-Data applications. As a consequence, there is a growing desire to have a single system serve the needs of both Big-Compute and Big-Data applications. Though the system architecturemore » supports the convergence of the Big-Compute and Big-Data, the programming models and software layer have yet to evolve to support either hierarchical-heterogeneous memory systems or the convergence. A programming abstraction to address this problem. The programming abstraction is implemented as a software library and runs on pre-exascale and exascale systems supporting current and emerging system architecture. Using distributed data-structures as a central concept, it provides (1) a simple, usable, and portable abstraction for hierarchical-heterogeneous memory and (2) a unified programming abstraction for Big-Compute and Big-Data applications.« less
High-performance computing with quantum processing units
Britt, Keith A.; Oak Ridge National Lab.; Humble, Travis S.; ...
2017-03-01
The prospects of quantum computing have driven efforts to realize fully functional quantum processing units (QPUs). Recent success in developing proof-of-principle QPUs has prompted the question of how to integrate these emerging processors into modern high-performance computing (HPC) systems. We examine how QPUs can be integrated into current and future HPC system architectures by accounting for func- tional and physical design requirements. We identify two integration pathways that are differentiated by infrastructure constraints on the QPU and the use cases expected for the HPC system. This includes a tight integration that assumes infrastructure bottlenecks can be overcome as well asmore » a loose integration that as- sumes they cannot. We find that the performance of both approaches is likely to depend on the quantum interconnect that serves to entangle multiple QPUs. As a result, we also identify several challenges in assessing QPU performance for HPC, and we consider new metrics that capture the interplay between system architecture and the quantum parallelism underlying computational performance.« less
High-performance computing with quantum processing units
DOE Office of Scientific and Technical Information (OSTI.GOV)
Britt, Keith A.; Oak Ridge National Lab.; Humble, Travis S.
The prospects of quantum computing have driven efforts to realize fully functional quantum processing units (QPUs). Recent success in developing proof-of-principle QPUs has prompted the question of how to integrate these emerging processors into modern high-performance computing (HPC) systems. We examine how QPUs can be integrated into current and future HPC system architectures by accounting for func- tional and physical design requirements. We identify two integration pathways that are differentiated by infrastructure constraints on the QPU and the use cases expected for the HPC system. This includes a tight integration that assumes infrastructure bottlenecks can be overcome as well asmore » a loose integration that as- sumes they cannot. We find that the performance of both approaches is likely to depend on the quantum interconnect that serves to entangle multiple QPUs. As a result, we also identify several challenges in assessing QPU performance for HPC, and we consider new metrics that capture the interplay between system architecture and the quantum parallelism underlying computational performance.« less
Scout: high-performance heterogeneous computing made simple
DOE Office of Scientific and Technical Information (OSTI.GOV)
Jablin, James; Mc Cormick, Patrick; Herlihy, Maurice
2011-01-26
Researchers must often write their own simulation and analysis software. During this process they simultaneously confront both computational and scientific problems. Current strategies for aiding the generation of performance-oriented programs do not abstract the software development from the science. Furthermore, the problem is becoming increasingly complex and pressing with the continued development of many-core and heterogeneous (CPU-GPU) architectures. To acbieve high performance, scientists must expertly navigate both software and hardware. Co-design between computer scientists and research scientists can alleviate but not solve this problem. The science community requires better tools for developing, optimizing, and future-proofing codes, allowing scientists to focusmore » on their research while still achieving high computational performance. Scout is a parallel programming language and extensible compiler framework targeting heterogeneous architectures. It provides the abstraction required to buffer scientists from the constantly-shifting details of hardware while still realizing higb-performance by encapsulating software and hardware optimization within a compiler framework.« less
NETRA: A parallel architecture for integrated vision systems. 1: Architecture and organization
NASA Technical Reports Server (NTRS)
Choudhary, Alok N.; Patel, Janak H.; Ahuja, Narendra
1989-01-01
Computer vision is regarded as one of the most complex and computationally intensive problems. An integrated vision system (IVS) is considered to be a system that uses vision algorithms from all levels of processing for a high level application (such as object recognition). A model of computation is presented for parallel processing for an IVS. Using the model, desired features and capabilities of a parallel architecture suitable for IVSs are derived. Then a multiprocessor architecture (called NETRA) is presented. This architecture is highly flexible without the use of complex interconnection schemes. The topology of NETRA is recursively defined and hence is easily scalable from small to large systems. Homogeneity of NETRA permits fault tolerance and graceful degradation under faults. It is a recursively defined tree-type hierarchical architecture where each of the leaf nodes consists of a cluster of processors connected with a programmable crossbar with selective broadcast capability to provide for desired flexibility. A qualitative evaluation of NETRA is presented. Then general schemes are described to map parallel algorithms onto NETRA. Algorithms are classified according to their communication requirements for parallel processing. An extensive analysis of inter-cluster communication strategies in NETRA is presented, and parameters affecting performance of parallel algorithms when mapped on NETRA are discussed. Finally, a methodology to evaluate performance of algorithms on NETRA is described.
Strategies for concurrent processing of complex algorithms in data driven architectures
NASA Technical Reports Server (NTRS)
Som, Sukhamoy; Stoughton, John W.; Mielke, Roland R.
1990-01-01
Performance modeling and performance enhancement for periodic execution of large-grain, decision-free algorithms in data flow architectures are discussed. Applications include real-time implementation of control and signal processing algorithms where performance is required to be highly predictable. The mapping of algorithms onto the specified class of data flow architectures is realized by a marked graph model called algorithm to architecture mapping model (ATAMM). Performance measures and bounds are established. Algorithm transformation techniques are identified for performance enhancement and reduction of resource (computing element) requirements. A systematic design procedure is described for generating operating conditions for predictable performance both with and without resource constraints. An ATAMM simulator is used to test and validate the performance prediction by the design procedure. Experiments on a three resource testbed provide verification of the ATAMM model and the design procedure.
The architecture of the High Performance Storage System (HPSS)
NASA Technical Reports Server (NTRS)
Teaff, Danny; Watson, Dick; Coyne, Bob
1994-01-01
The rapid growth in the size of datasets has caused a serious imbalance in I/O and storage system performance and functionality relative to application requirements and the capabilities of other system components. The High Performance Storage System (HPSS) is a scalable, next-generation storage system that will meet the functionality and performance requirements or large-scale scientific and commercial computing environments. Our goal is to improve the performance and capacity of storage by two orders of magnitude or more over what is available in the general or mass marketplace today. We are also providing corresponding improvements in architecture and functionality. This paper describes the architecture and functionality of HPSS.
HTMT-class Latency Tolerant Parallel Architecture for Petaflops Scale Computation
NASA Technical Reports Server (NTRS)
Sterling, Thomas; Bergman, Larry
2000-01-01
Computational Aero Sciences and other numeric intensive computation disciplines demand computing throughputs substantially greater than the Teraflops scale systems only now becoming available. The related fields of fluids, structures, thermal, combustion, and dynamic controls are among the interdisciplinary areas that in combination with sufficient resolution and advanced adaptive techniques may force performance requirements towards Petaflops. This will be especially true for compute intensive models such as Navier-Stokes are or when such system models are only part of a larger design optimization computation involving many design points. Yet recent experience with conventional MPP configurations comprising commodity processing and memory components has shown that larger scale frequently results in higher programming difficulty and lower system efficiency. While important advances in system software and algorithms techniques have had some impact on efficiency and programmability for certain classes of problems, in general it is unlikely that software alone will resolve the challenges to higher scalability. As in the past, future generations of high-end computers may require a combination of hardware architecture and system software advances to enable efficient operation at a Petaflops level. The NASA led HTMT project has engaged the talents of a broad interdisciplinary team to develop a new strategy in high-end system architecture to deliver petaflops scale computing in the 2004/5 timeframe. The Hybrid-Technology, MultiThreaded parallel computer architecture incorporates several advanced technologies in combination with an innovative dynamic adaptive scheduling mechanism to provide unprecedented performance and efficiency within practical constraints of cost, complexity, and power consumption. The emerging superconductor Rapid Single Flux Quantum electronics can operate at 100 GHz (the record is 770 GHz) and one percent of the power required by convention semiconductor logic. Wave Division Multiplexing optical communications can approach a peak per fiber bandwidth of 1 Tbps and the new Data Vortex network topology employing this technology can connect tens of thousands of ports providing a bi-section bandwidth on the order of a Petabyte per second with latencies well below 100 nanoseconds, even under heavy loads. Processor-in-Memory (PIM) technology combines logic and memory on the same chip exposing the internal bandwidth of the memory row buffers at low latency. And holographic storage photorefractive storage technologies provide high-density memory with access a thousand times faster than conventional disk technologies. Together these technologies enable a new class of shared memory system architecture with a peak performance in the range of a Petaflops but size and power requirements comparable to today's largest Teraflops scale systems. To achieve high-sustained performance, HTMT combines an advanced multithreading processor architecture with a memory-driven coarse-grained latency management strategy called "percolation", yielding high efficiency while reducing the much of the parallel programming burden. This paper will present the basic system architecture characteristics made possible through this series of advanced technologies and then give a detailed description of the new percolation approach to runtime latency management.
MPI implementation of PHOENICS: A general purpose computational fluid dynamics code
NASA Astrophysics Data System (ADS)
Simunovic, S.; Zacharia, T.; Baltas, N.; Spalding, D. B.
1995-03-01
PHOENICS is a suite of computational analysis programs that are used for simulation of fluid flow, heat transfer, and dynamical reaction processes. The parallel version of the solver EARTH for the Computational Fluid Dynamics (CFD) program PHOENICS has been implemented using Message Passing Interface (MPI) standard. Implementation of MPI version of PHOENICS makes this computational tool portable to a wide range of parallel machines and enables the use of high performance computing for large scale computational simulations. MPI libraries are available on several parallel architectures making the program usable across different architectures as well as on heterogeneous computer networks. The Intel Paragon NX and MPI versions of the program have been developed and tested on massively parallel supercomputers Intel Paragon XP/S 5, XP/S 35, and Kendall Square Research, and on the multiprocessor SGI Onyx computer at Oak Ridge National Laboratory. The preliminary testing results of the developed program have shown scalable performance for reasonably sized computational domains.
MPI implementation of PHOENICS: A general purpose computational fluid dynamics code
DOE Office of Scientific and Technical Information (OSTI.GOV)
Simunovic, S.; Zacharia, T.; Baltas, N.
1995-04-01
PHOENICS is a suite of computational analysis programs that are used for simulation of fluid flow, heat transfer, and dynamical reaction processes. The parallel version of the solver EARTH for the Computational Fluid Dynamics (CFD) program PHOENICS has been implemented using Message Passing Interface (MPI) standard. Implementation of MPI version of PHOENICS makes this computational tool portable to a wide range of parallel machines and enables the use of high performance computing for large scale computational simulations. MPI libraries are available on several parallel architectures making the program usable across different architectures as well as on heterogeneous computer networks. Themore » Intel Paragon NX and MPI versions of the program have been developed and tested on massively parallel supercomputers Intel Paragon XP/S 5, XP/S 35, and Kendall Square Research, and on the multiprocessor SGI Onyx computer at Oak Ridge National Laboratory. The preliminary testing results of the developed program have shown scalable performance for reasonably sized computational domains.« less
Performance issues for domain-oriented time-driven distributed simulations
NASA Technical Reports Server (NTRS)
Nicol, David M.
1987-01-01
It has long been recognized that simulations form an interesting and important class of computations that may benefit from distributed or parallel processing. Since the point of parallel processing is improved performance, the recent proliferation of multiprocessors requires that we consider the performance issues that naturally arise when attempting to implement a distributed simulation. Three such issues are: (1) the problem of mapping the simulation onto the architecture, (2) the possibilities for performing redundant computation in order to reduce communication, and (3) the avoidance of deadlock due to distributed contention for message-buffer space. These issues are discussed in the context of a battlefield simulation implemented on a medium-scale multiprocessor message-passing architecture.
Performance of GeantV EM Physics Models
NASA Astrophysics Data System (ADS)
Amadio, G.; Ananya, A.; Apostolakis, J.; Aurora, A.; Bandieramonte, M.; Bhattacharyya, A.; Bianchini, C.; Brun, R.; Canal, P.; Carminati, F.; Cosmo, G.; Duhem, L.; Elvira, D.; Folger, G.; Gheata, A.; Gheata, M.; Goulas, I.; Iope, R.; Jun, S. Y.; Lima, G.; Mohanty, A.; Nikitina, T.; Novak, M.; Pokorski, W.; Ribon, A.; Seghal, R.; Shadura, O.; Vallecorsa, S.; Wenzel, S.; Zhang, Y.
2017-10-01
The recent progress in parallel hardware architectures with deeper vector pipelines or many-cores technologies brings opportunities for HEP experiments to take advantage of SIMD and SIMT computing models. Launched in 2013, the GeantV project studies performance gains in propagating multiple particles in parallel, improving instruction throughput and data locality in HEP event simulation on modern parallel hardware architecture. Due to the complexity of geometry description and physics algorithms of a typical HEP application, performance analysis is indispensable in identifying factors limiting parallel execution. In this report, we will present design considerations and preliminary computing performance of GeantV physics models on coprocessors (Intel Xeon Phi and NVidia GPUs) as well as on mainstream CPUs.
DFT algorithms for bit-serial GaAs array processor architectures
NASA Technical Reports Server (NTRS)
Mcmillan, Gary B.
1988-01-01
Systems and Processes Engineering Corporation (SPEC) has developed an innovative array processor architecture for computing Fourier transforms and other commonly used signal processing algorithms. This architecture is designed to extract the highest possible array performance from state-of-the-art GaAs technology. SPEC's architectural design includes a high performance RISC processor implemented in GaAs, along with a Floating Point Coprocessor and a unique Array Communications Coprocessor, also implemented in GaAs technology. Together, these data processors represent the latest in technology, both from an architectural and implementation viewpoint. SPEC has examined numerous algorithms and parallel processing architectures to determine the optimum array processor architecture. SPEC has developed an array processor architecture with integral communications ability to provide maximum node connectivity. The Array Communications Coprocessor embeds communications operations directly in the core of the processor architecture. A Floating Point Coprocessor architecture has been defined that utilizes Bit-Serial arithmetic units, operating at very high frequency, to perform floating point operations. These Bit-Serial devices reduce the device integration level and complexity to a level compatible with state-of-the-art GaAs device technology.
Neuromorphic Computing – From Materials Research to Systems Architecture Roundtable
DOE Office of Scientific and Technical Information (OSTI.GOV)
Schuller, Ivan K.; Stevens, Rick; Pino, Robinson
2015-10-29
Computation in its many forms is the engine that fuels our modern civilization. Modern computation—based on the von Neumann architecture—has allowed, until now, the development of continuous improvements, as predicted by Moore’s law. However, computation using current architectures and materials will inevitably—within the next 10 years—reach a limit because of fundamental scientific reasons. DOE convened a roundtable of experts in neuromorphic computing systems, materials science, and computer science in Washington on October 29-30, 2015 to address the following basic questions: Can brain-like (“neuromorphic”) computing devices based on new material concepts and systems be developed to dramatically outperform conventional CMOS basedmore » technology? If so, what are the basic research challenges for materials sicence and computing? The overarching answer that emerged was: The development of novel functional materials and devices incorporated into unique architectures will allow a revolutionary technological leap toward the implementation of a fully “neuromorphic” computer. To address this challenge, the following issues were considered: The main differences between neuromorphic and conventional computing as related to: signaling models, timing/clock, non-volatile memory, architecture, fault tolerance, integrated memory and compute, noise tolerance, analog vs. digital, and in situ learning New neuromorphic architectures needed to: produce lower energy consumption, potential novel nanostructured materials, and enhanced computation Device and materials properties needed to implement functions such as: hysteresis, stability, and fault tolerance Comparisons of different implementations: spin torque, memristors, resistive switching, phase change, and optical schemes for enhanced breakthroughs in performance, cost, fault tolerance, and/or manufacturability.« less
Optimal nonlinear information processing capacity in delay-based reservoir computers
NASA Astrophysics Data System (ADS)
Grigoryeva, Lyudmila; Henriques, Julie; Larger, Laurent; Ortega, Juan-Pablo
2015-09-01
Reservoir computing is a recently introduced brain-inspired machine learning paradigm capable of excellent performances in the processing of empirical data. We focus in a particular kind of time-delay based reservoir computers that have been physically implemented using optical and electronic systems and have shown unprecedented data processing rates. Reservoir computing is well-known for the ease of the associated training scheme but also for the problematic sensitivity of its performance to architecture parameters. This article addresses the reservoir design problem, which remains the biggest challenge in the applicability of this information processing scheme. More specifically, we use the information available regarding the optimal reservoir working regimes to construct a functional link between the reservoir parameters and its performance. This function is used to explore various properties of the device and to choose the optimal reservoir architecture, thus replacing the tedious and time consuming parameter scannings used so far in the literature.
Job Scheduling in a Heterogeneous Grid Environment
NASA Technical Reports Server (NTRS)
Shan, Hong-Zhang; Smith, Warren; Oliker, Leonid; Biswas, Rupak
2004-01-01
Computational grids have the potential for solving large-scale scientific problems using heterogeneous and geographically distributed resources. However, a number of major technical hurdles must be overcome before this potential can be realized. One problem that is critical to effective utilization of computational grids is the efficient scheduling of jobs. This work addresses this problem by describing and evaluating a grid scheduling architecture and three job migration algorithms. The architecture is scalable and does not assume control of local site resources. The job migration policies use the availability and performance of computer systems, the network bandwidth available between systems, and the volume of input and output data associated with each job. An extensive performance comparison is presented using real workloads from leading computational centers. The results, based on several key metrics, demonstrate that the performance of our distributed migration algorithms is significantly greater than that of a local scheduling framework and comparable to a non-scalable global scheduling approach.
Optimal nonlinear information processing capacity in delay-based reservoir computers.
Grigoryeva, Lyudmila; Henriques, Julie; Larger, Laurent; Ortega, Juan-Pablo
2015-09-11
Reservoir computing is a recently introduced brain-inspired machine learning paradigm capable of excellent performances in the processing of empirical data. We focus in a particular kind of time-delay based reservoir computers that have been physically implemented using optical and electronic systems and have shown unprecedented data processing rates. Reservoir computing is well-known for the ease of the associated training scheme but also for the problematic sensitivity of its performance to architecture parameters. This article addresses the reservoir design problem, which remains the biggest challenge in the applicability of this information processing scheme. More specifically, we use the information available regarding the optimal reservoir working regimes to construct a functional link between the reservoir parameters and its performance. This function is used to explore various properties of the device and to choose the optimal reservoir architecture, thus replacing the tedious and time consuming parameter scannings used so far in the literature.
Optimal nonlinear information processing capacity in delay-based reservoir computers
Grigoryeva, Lyudmila; Henriques, Julie; Larger, Laurent; Ortega, Juan-Pablo
2015-01-01
Reservoir computing is a recently introduced brain-inspired machine learning paradigm capable of excellent performances in the processing of empirical data. We focus in a particular kind of time-delay based reservoir computers that have been physically implemented using optical and electronic systems and have shown unprecedented data processing rates. Reservoir computing is well-known for the ease of the associated training scheme but also for the problematic sensitivity of its performance to architecture parameters. This article addresses the reservoir design problem, which remains the biggest challenge in the applicability of this information processing scheme. More specifically, we use the information available regarding the optimal reservoir working regimes to construct a functional link between the reservoir parameters and its performance. This function is used to explore various properties of the device and to choose the optimal reservoir architecture, thus replacing the tedious and time consuming parameter scannings used so far in the literature. PMID:26358528
NASA Technical Reports Server (NTRS)
Chow, Edward T.; Schatzel, Donald V.; Whitaker, William D.; Sterling, Thomas
2008-01-01
A Spaceborne Processor Array in Multifunctional Structure (SPAMS) can lower the total mass of the electronic and structural overhead of spacecraft, resulting in reduced launch costs, while increasing the science return through dynamic onboard computing. SPAMS integrates the multifunctional structure (MFS) and the Gilgamesh Memory, Intelligence, and Network Device (MIND) multi-core in-memory computer architecture into a single-system super-architecture. This transforms every inch of a spacecraft into a sharable, interconnected, smart computing element to increase computing performance while simultaneously reducing mass. The MIND in-memory architecture provides a foundation for high-performance, low-power, and fault-tolerant computing. The MIND chip has an internal structure that includes memory, processing, and communication functionality. The Gilgamesh is a scalable system comprising multiple MIND chips interconnected to operate as a single, tightly coupled, parallel computer. The array of MIND components shares a global, virtual name space for program variables and tasks that are allocated at run time to the distributed physical memory and processing resources. Individual processor- memory nodes can be activated or powered down at run time to provide active power management and to configure around faults. A SPAMS system is comprised of a distributed Gilgamesh array built into MFS, interfaces into instrument and communication subsystems, a mass storage interface, and a radiation-hardened flight computer.
Analog Computation by DNA Strand Displacement Circuits.
Song, Tianqi; Garg, Sudhanshu; Mokhtar, Reem; Bui, Hieu; Reif, John
2016-08-19
DNA circuits have been widely used to develop biological computing devices because of their high programmability and versatility. Here, we propose an architecture for the systematic construction of DNA circuits for analog computation based on DNA strand displacement. The elementary gates in our architecture include addition, subtraction, and multiplication gates. The input and output of these gates are analog, which means that they are directly represented by the concentrations of the input and output DNA strands, respectively, without requiring a threshold for converting to Boolean signals. We provide detailed domain designs and kinetic simulations of the gates to demonstrate their expected performance. On the basis of these gates, we describe how DNA circuits to compute polynomial functions of inputs can be built. Using Taylor Series and Newton Iteration methods, functions beyond the scope of polynomials can also be computed by DNA circuits built upon our architecture.
Proactive Fault Tolerance Using Preemptive Migration
DOE Office of Scientific and Technical Information (OSTI.GOV)
Engelmann, Christian; Vallee, Geoffroy R; Naughton, III, Thomas J
2009-01-01
Proactive fault tolerance (FT) in high-performance computing is a concept that prevents compute node failures from impacting running parallel applications by preemptively migrating application parts away from nodes that are about to fail. This paper provides a foundation for proactive FT by defining its architecture and classifying implementation options. This paper further relates prior work to the presented architecture and classification, and discusses the challenges ahead for needed supporting technologies.
Implementation of Virtualization Oriented Architecture: A Healthcare Industry Case Study
NASA Astrophysics Data System (ADS)
Rao, G. Subrahmanya Vrk; Parthasarathi, Jinka; Karthik, Sundararaman; Rao, Gvn Appa; Ganesan, Suresh
This paper presents a Virtualization Oriented Architecture (VOA) and an implementation of VOA for Hridaya - a Telemedicine initiative. Hadoop Compute cloud was established at our labs and jobs which require a massive computing capability such as ECG signal analysis were submitted and the study is presented in this current paper. VOA takes advantage of inexpensive community PCs and provides added advantages such as Fault Tolerance, Scalability, Performance, High Availability.
Strategies for concurrent processing of complex algorithms in data driven architectures
NASA Technical Reports Server (NTRS)
Stoughton, John W.; Mielke, Roland R.; Som, Sukhamony
1990-01-01
The performance modeling and enhancement for periodic execution of large-grain, decision-free algorithms in data flow architectures is examined. Applications include real-time implementation of control and signal processing algorithms where performance is required to be highly predictable. The mapping of algorithms onto the specified class of data flow architectures is realized by a marked graph model called ATAMM (Algorithm To Architecture Mapping Model). Performance measures and bounds are established. Algorithm transformation techniques are identified for performance enhancement and reduction of resource (computing element) requirements. A systematic design procedure is described for generating operating conditions for predictable performance both with and without resource constraints. An ATAMM simulator is used to test and validate the performance prediction by the design procedure. Experiments on a three resource testbed provide verification of the ATAMM model and the design procedure.
Satellite on-board processing for earth resources data
NASA Technical Reports Server (NTRS)
Bodenheimer, R. E.; Gonzalez, R. C.; Gupta, J. N.; Hwang, K.; Rochelle, R. W.; Wilson, J. B.; Wintz, P. A.
1975-01-01
Results of a survey of earth resources user applications and their data requirements, earth resources multispectral scanner sensor technology, and preprocessing algorithms for correcting the sensor outputs and for data bulk reduction are presented along with a candidate data format. Computational requirements required to implement the data analysis algorithms are included along with a review of computer architectures and organizations. Computer architectures capable of handling the algorithm computational requirements are suggested and the environmental effects of an on-board processor discussed. By relating performance parameters to the system requirements of each of the user requirements the feasibility of on-board processing is determined for each user. A tradeoff analysis is performed to determine the sensitivity of results to each of the system parameters. Significant results and conclusions are discussed, and recommendations are presented.
VTK-m: Accelerating the Visualization Toolkit for Massively Threaded Architectures
Moreland, Kenneth; Sewell, Christopher; Usher, William; ...
2016-05-09
Here, one of the most critical challenges for high-performance computing (HPC) scientific visualization is execution on massively threaded processors. Of the many fundamental changes we are seeing in HPC systems, one of the most profound is a reliance on new processor types optimized for execution bandwidth over latency hiding. Our current production scientific visualization software is not designed for these new types of architectures. To address this issue, the VTK-m framework serves as a container for algorithms, provides flexible data representation, and simplifies the design of visualization algorithms on new and future computer architecture.
VTK-m: Accelerating the Visualization Toolkit for Massively Threaded Architectures
Moreland, Kenneth; Sewell, Christopher; Usher, William; ...
2016-05-09
Execution on massively threaded processors is one of the most critical challenges for high-performance computing (HPC) scientific visualization. Of the many fundamental changes we are seeing in HPC systems, one of the most profound is a reliance on new processor types optimized for execution bandwidth over latency hiding. Moreover, our current production scientific visualization software is not designed for these new types of architectures. In order to address this issue, the VTK-m framework serves as a container for algorithms, provides flexible data representation, and simplifies the design of visualization algorithms on new and future computer architecture.
On the impact of approximate computation in an analog DeSTIN architecture.
Young, Steven; Lu, Junjie; Holleman, Jeremy; Arel, Itamar
2014-05-01
Deep machine learning (DML) holds the potential to revolutionize machine learning by automating rich feature extraction, which has become the primary bottleneck of human engineering in pattern recognition systems. However, the heavy computational burden renders DML systems implemented on conventional digital processors impractical for large-scale problems. The highly parallel computations required to implement large-scale deep learning systems are well suited to custom hardware. Analog computation has demonstrated power efficiency advantages of multiple orders of magnitude relative to digital systems while performing nonideal computations. In this paper, we investigate typical error sources introduced by analog computational elements and their impact on system-level performance in DeSTIN--a compositional deep learning architecture. These inaccuracies are evaluated on a pattern classification benchmark, clearly demonstrating the robustness of the underlying algorithm to the errors introduced by analog computational elements. A clear understanding of the impacts of nonideal computations is necessary to fully exploit the efficiency of analog circuits.
GPU-computing in econophysics and statistical physics
NASA Astrophysics Data System (ADS)
Preis, T.
2011-03-01
A recent trend in computer science and related fields is general purpose computing on graphics processing units (GPUs), which can yield impressive performance. With multiple cores connected by high memory bandwidth, today's GPUs offer resources for non-graphics parallel processing. This article provides a brief introduction into the field of GPU computing and includes examples. In particular computationally expensive analyses employed in financial market context are coded on a graphics card architecture which leads to a significant reduction of computing time. In order to demonstrate the wide range of possible applications, a standard model in statistical physics - the Ising model - is ported to a graphics card architecture as well, resulting in large speedup values.
Strategies for concurrent processing of complex algorithms in data driven architectures
NASA Technical Reports Server (NTRS)
Stoughton, John W.; Mielke, Roland R.
1988-01-01
The purpose is to document research to develop strategies for concurrent processing of complex algorithms in data driven architectures. The problem domain consists of decision-free algorithms having large-grained, computationally complex primitive operations. Such are often found in signal processing and control applications. The anticipated multiprocessor environment is a data flow architecture containing between two and twenty computing elements. Each computing element is a processor having local program memory, and which communicates with a common global data memory. A new graph theoretic model called ATAMM which establishes rules for relating a decomposed algorithm to its execution in a data flow architecture is presented. The ATAMM model is used to determine strategies to achieve optimum time performance and to develop a system diagnostic software tool. In addition, preliminary work on a new multiprocessor operating system based on the ATAMM specifications is described.
Blueprint for a microwave trapped ion quantum computer
Lekitsch, Bjoern; Weidt, Sebastian; Fowler, Austin G.; Mølmer, Klaus; Devitt, Simon J.; Wunderlich, Christof; Hensinger, Winfried K.
2017-01-01
The availability of a universal quantum computer may have a fundamental impact on a vast number of research fields and on society as a whole. An increasingly large scientific and industrial community is working toward the realization of such a device. An arbitrarily large quantum computer may best be constructed using a modular approach. We present a blueprint for a trapped ion–based scalable quantum computer module, making it possible to create a scalable quantum computer architecture based on long-wavelength radiation quantum gates. The modules control all operations as stand-alone units, are constructed using silicon microfabrication techniques, and are within reach of current technology. To perform the required quantum computations, the modules make use of long-wavelength radiation–based quantum gate technology. To scale this microwave quantum computer architecture to a large size, we present a fully scalable design that makes use of ion transport between different modules, thereby allowing arbitrarily many modules to be connected to construct a large-scale device. A high error–threshold surface error correction code can be implemented in the proposed architecture to execute fault-tolerant operations. With appropriate adjustments, the proposed modules are also suitable for alternative trapped ion quantum computer architectures, such as schemes using photonic interconnects. PMID:28164154
Building a Terabyte Memory Bandwidth Compute Node with Four Consumer Electronics GPUs
NASA Astrophysics Data System (ADS)
Omlin, Samuel; Räss, Ludovic; Podladchikov, Yuri
2014-05-01
GPUs released for consumer electronics are generally built with the same chip architectures as the GPUs released for professional usage. With regards to scientific computing, there are no obvious important differences in functionality or performance between the two types of releases, yet the price can differ up to one order of magnitude. For example, the consumer electronics release of the most recent NVIDIA Kepler architecture (GK110), named GeForce GTX TITAN, performed equally well in conducted memory bandwidth tests as the professional release, named Tesla K20; the consumer electronics release costs about one third of the professional release. We explain how to design and assemble a well adjusted computer with four high-end consumer electronics GPUs (GeForce GTX TITAN) combining more than 1 terabyte/s memory bandwidth. We compare the system's performance and precision with the one of hardware released for professional usage. The system can be used as a powerful workstation for scientific computing or as a compute node in a home-built GPU cluster.
Song, Tianqi; Garg, Sudhanshu; Mokhtar, Reem; Bui, Hieu; Reif, John
2018-01-19
A main goal in DNA computing is to build DNA circuits to compute designated functions using a minimal number of DNA strands. Here, we propose a novel architecture to build compact DNA strand displacement circuits to compute a broad scope of functions in an analog fashion. A circuit by this architecture is composed of three autocatalytic amplifiers, and the amplifiers interact to perform computation. We show DNA circuits to compute functions sqrt(x), ln(x) and exp(x) for x in tunable ranges with simulation results. A key innovation in our architecture, inspired by Napier's use of logarithm transforms to compute square roots on a slide rule, is to make use of autocatalytic amplifiers to do logarithmic and exponential transforms in concentration and time. In particular, we convert from the input that is encoded by the initial concentration of the input DNA strand, to time, and then back again to the output encoded by the concentration of the output DNA strand at equilibrium. This combined use of strand-concentration and time encoding of computational values may have impact on other forms of molecular computation.
FPGA-based real-time phase measuring profilometry algorithm design and implementation
NASA Astrophysics Data System (ADS)
Zhan, Guomin; Tang, Hongwei; Zhong, Kai; Li, Zhongwei; Shi, Yusheng
2016-11-01
Phase measuring profilometry (PMP) has been widely used in many fields, like Computer Aided Verification (CAV), Flexible Manufacturing System (FMS) et al. High frame-rate (HFR) real-time vision-based feedback control will be a common demands in near future. However, the instruction time delay in the computer caused by numerous repetitive operations greatly limit the efficiency of data processing. FPGA has the advantages of pipeline architecture and parallel execution, and it fit for handling PMP algorithm. In this paper, we design a fully pipelined hardware architecture for PMP. The functions of hardware architecture includes rectification, phase calculation, phase shifting, and stereo matching. The experiment verified the performance of this method, and the factors that may influence the computation accuracy was analyzed.
NASA Astrophysics Data System (ADS)
Ford, Eric B.; Dindar, Saleh; Peters, Jorg
2015-08-01
The realism of astrophysical simulations and statistical analyses of astronomical data are set by the available computational resources. Thus, astronomers and astrophysicists are constantly pushing the limits of computational capabilities. For decades, astronomers benefited from massive improvements in computational power that were driven primarily by increasing clock speeds and required relatively little attention to details of the computational hardware. For nearly a decade, increases in computational capabilities have come primarily from increasing the degree of parallelism, rather than increasing clock speeds. Further increases in computational capabilities will likely be led by many-core architectures such as Graphical Processing Units (GPUs) and Intel Xeon Phi. Successfully harnessing these new architectures, requires significantly more understanding of the hardware architecture, cache hierarchy, compiler capabilities and network network characteristics.I will provide an astronomer's overview of the opportunities and challenges provided by modern many-core architectures and elastic cloud computing. The primary goal is to help an astronomical audience understand what types of problems are likely to yield more than order of magnitude speed-ups and which problems are unlikely to parallelize sufficiently efficiently to be worth the development time and/or costs.I will draw on my experience leading a team in developing the Swarm-NG library for parallel integration of large ensembles of small n-body systems on GPUs, as well as several smaller software projects. I will share lessons learned from collaborating with computer scientists, including both technical and soft skills. Finally, I will discuss the challenges of training the next generation of astronomers to be proficient in this new era of high-performance computing, drawing on experience teaching a graduate class on High-Performance Scientific Computing for Astrophysics and organizing a 2014 advanced summer school on Bayesian Computing for Astronomical Data Analysis with support of the Penn State Center for Astrostatistics and Institute for CyberScience.
Eichenberger, Alexandre E; Gschwind, Michael K; Gunnels, John A
2013-11-05
Mechanisms for performing matrix multiplication operations with data pre-conditioning in a high performance computing architecture are provided. A vector load operation is performed to load a first vector operand of the matrix multiplication operation to a first target vector register. A load and splat operation is performed to load an element of a second vector operand and replicating the element to each of a plurality of elements of a second target vector register. A multiply add operation is performed on elements of the first target vector register and elements of the second target vector register to generate a partial product of the matrix multiplication operation. The partial product of the matrix multiplication operation is accumulated with other partial products of the matrix multiplication operation.
Chemical calculations on Cray computers
NASA Technical Reports Server (NTRS)
Taylor, Peter R.; Bauschlicher, Charles W., Jr.; Schwenke, David W.
1989-01-01
The influence of recent developments in supercomputing on computational chemistry is discussed with particular reference to Cray computers and their pipelined vector/limited parallel architectures. After reviewing Cray hardware and software the performance of different elementary program structures are examined, and effective methods for improving program performance are outlined. The computational strategies appropriate for obtaining optimum performance in applications to quantum chemistry and dynamics are discussed. Finally, some discussion is given of new developments and future hardware and software improvements.
Measurements of the LHCb software stack on the ARM architecture
NASA Astrophysics Data System (ADS)
Vijay Kartik, S.; Couturier, Ben; Clemencic, Marco; Neufeld, Niko
2014-06-01
The ARM architecture is a power-efficient design that is used in most processors in mobile devices all around the world today since they provide reasonable compute performance per watt. The current LHCb software stack is designed (and thus expected) to build and run on machines with the x86/x86_64 architecture. This paper outlines the process of measuring the performance of the LHCb software stack on the ARM architecture - specifically, the ARMv7 architecture on Cortex-A9 processors from NVIDIA and on full-fledged ARM servers with chipsets from Calxeda - and makes comparisons with the performance on x86_64 architectures on the Intel Xeon L5520/X5650 and AMD Opteron 6272. The paper emphasises the aspects of performance per core with respect to the power drawn by the compute nodes for the given performance - this ensures a fair real-world comparison with much more 'powerful' Intel/AMD processors. The comparisons of these real workloads in the context of LHCb are also complemented with the standard synthetic benchmarks HEPSPEC and Coremark. The pitfalls and solutions for the non-trivial task of porting the source code to build for the ARMv7 instruction set are presented. The specific changes in the build process needed for ARM-specific portions of the software stack are described, to serve as pointers for further attempts taken up by other groups in this direction. Cases where architecture-specific tweaks at the assembler lever (both in ROOT and the LHCb software stack) were needed for a successful compile are detailed - these cases are good indicators of where/how the software stack as well as the build system can be made more portable and multi-arch friendly. The experience gained from the tasks described in this paper are intended to i) assist in making an informed choice about ARM-based server solutions as a feasible low-power alternative to the current compute nodes, and ii) revisit the software design and build system for portability and generic improvements.
Parallel Evolutionary Optimization for Neuromorphic Network Training
DOE Office of Scientific and Technical Information (OSTI.GOV)
Schuman, Catherine D; Disney, Adam; Singh, Susheela
One of the key impediments to the success of current neuromorphic computing architectures is the issue of how best to program them. Evolutionary optimization (EO) is one promising programming technique; in particular, its wide applicability makes it especially attractive for neuromorphic architectures, which can have many different characteristics. In this paper, we explore different facets of EO on a spiking neuromorphic computing model called DANNA. We focus on the performance of EO in the design of our DANNA simulator, and on how to structure EO on both multicore and massively parallel computing systems. We evaluate how our parallel methods impactmore » the performance of EO on Titan, the U.S.'s largest open science supercomputer, and BOB, a Beowulf-style cluster of Raspberry Pi's. We also focus on how to improve the EO by evaluating commonality in higher performing neural networks, and present the result of a study that evaluates the EO performed by Titan.« less
Real-time field programmable gate array architecture for computer vision
NASA Astrophysics Data System (ADS)
Arias-Estrada, Miguel; Torres-Huitzil, Cesar
2001-01-01
This paper presents an architecture for real-time generic convolution of a mask and an image. The architecture is intended for fast low-level image processing. The field programmable gate array (FPGA)-based architecture takes advantage of the availability of registers in FPGAs to implement an efficient and compact module to process the convolutions. The architecture is designed to minimize the number of accesses to the image memory and it is based on parallel modules with internal pipeline operation in order to improve its performance. The architecture is prototyped in a FPGA, but it can be implemented on dedicated very- large-scale-integrated devices to reach higher clock frequencies. Complexity issues, FPGA resources utilization, FPGA limitations, and real-time performance are discussed. Some results are presented and discussed.
Gigaflop architecture, a hardware perspective
NASA Technical Reports Server (NTRS)
Feierbach, G. F.
1978-01-01
Any super computer built in the early 1980s will use components that are available by fall 1978. The architecture of such a system cannot depart radically from current super computers if the software experience painfully acquired from these computers in the 70's is to apply. Given the above constraints, 10 billion floating point operations per second (BFLOPS) are attainable and a problem memory of 512 million (64 bit) words could be supported by the technology of the time. In contrast to this, industry is likely to respond with commercially available machines with a performance of less than 150 MFLOPS. This is due to self-imposed constraints on the manufacturers to provide upward compatible architectures (same instruction set) and systems which can be sold in significant volumes. Since this computing speed is inadequate to meet the demands of computational fluid dynamics, a special processor is required. Issues which are felt to be significant in the pursuit of maximum compute capability in this special processor are discussed.
NASA Astrophysics Data System (ADS)
Zaveri, Mazad Shaheriar
The semiconductor/computer industry has been following Moore's law for several decades and has reaped the benefits in speed and density of the resultant scaling. Transistor density has reached almost one billion per chip, and transistor delays are in picoseconds. However, scaling has slowed down, and the semiconductor industry is now facing several challenges. Hybrid CMOS/nano technologies, such as CMOL, are considered as an interim solution to some of the challenges. Another potential architectural solution includes specialized architectures for applications/models in the intelligent computing domain, one aspect of which includes abstract computational models inspired from the neuro/cognitive sciences. Consequently in this dissertation, we focus on the hardware implementations of Bayesian Memory (BM), which is a (Bayesian) Biologically Inspired Computational Model (BICM). This model is a simplified version of George and Hawkins' model of the visual cortex, which includes an inference framework based on Judea Pearl's belief propagation. We then present a "hardware design space exploration" methodology for implementing and analyzing the (digital and mixed-signal) hardware for the BM. This particular methodology involves: analyzing the computational/operational cost and the related micro-architecture, exploring candidate hardware components, proposing various custom hardware architectures using both traditional CMOS and hybrid nanotechnology - CMOL, and investigating the baseline performance/price of these architectures. The results suggest that CMOL is a promising candidate for implementing a BM. Such implementations can utilize the very high density storage/computation benefits of these new nano-scale technologies much more efficiently; for example, the throughput per 858 mm2 (TPM) obtained for CMOL based architectures is 32 to 40 times better than the TPM for a CMOS based multiprocessor/multi-FPGA system, and almost 2000 times better than the TPM for a PC implementation. We later use this methodology to investigate the hardware implementations of cortex-scale spiking neural system, which is an approximate neural equivalent of BICM based cortex-scale system. The results of this investigation also suggest that CMOL is a promising candidate to implement such large-scale neuromorphic systems. In general, the assessment of such hypothetical baseline hardware architectures provides the prospects for building large-scale (mammalian cortex-scale) implementations of neuromorphic/Bayesian/intelligent systems using state-of-the-art and beyond state-of-the-art silicon structures.
Patterns and Practices for Future Architectures
2014-08-01
14. SUBJECT TERMS computing architecture, graph algorithms, high-performance computing, big data , GPU 15. NUMBER OF PAGES 44 16. PRICE CODE 17...at Vertex 1 6 Figure 4: Data Structures Created by Kernel 1 of Single CPU, List Implementation Using the Graph in the Example from Section 1.2 9...Figure 5: Kernel 2 of Graph500 BFS Reference Implementation: Single CPU, List 10 Figure 6: Data Structures for Sequential CSR Algorithm 12 Figure 7
Automated problem scheduling and reduction of synchronization delay effects
NASA Technical Reports Server (NTRS)
Saltz, Joel H.
1987-01-01
It is anticipated that in order to make effective use of many future high performance architectures, programs will have to exhibit at least a medium grained parallelism. A framework is presented for partitioning very sparse triangular systems of linear equations that is designed to produce favorable preformance results in a wide variety of parallel architectures. Efficient methods for solving these systems are of interest because: (1) they provide a useful model problem for use in exploring heuristics for the aggregation, mapping and scheduling of relatively fine grained computations whose data dependencies are specified by directed acrylic graphs, and (2) because such efficient methods can find direct application in the development of parallel algorithms for scientific computation. Simple expressions are derived that describe how to schedule computational work with varying degrees of granularity. The Encore Multimax was used as a hardware simulator to investigate the performance effects of using the partitioning techniques presented in shared memory architectures with varying relative synchronization costs.
Efficient parallel implementation of active appearance model fitting algorithm on GPU.
Wang, Jinwei; Ma, Xirong; Zhu, Yuanping; Sun, Jizhou
2014-01-01
The active appearance model (AAM) is one of the most powerful model-based object detecting and tracking methods which has been widely used in various situations. However, the high-dimensional texture representation causes very time-consuming computations, which makes the AAM difficult to apply to real-time systems. The emergence of modern graphics processing units (GPUs) that feature a many-core, fine-grained parallel architecture provides new and promising solutions to overcome the computational challenge. In this paper, we propose an efficient parallel implementation of the AAM fitting algorithm on GPUs. Our design idea is fine grain parallelism in which we distribute the texture data of the AAM, in pixels, to thousands of parallel GPU threads for processing, which makes the algorithm fit better into the GPU architecture. We implement our algorithm using the compute unified device architecture (CUDA) on the Nvidia's GTX 650 GPU, which has the latest Kepler architecture. To compare the performance of our algorithm with different data sizes, we built sixteen face AAM models of different dimensional textures. The experiment results show that our parallel AAM fitting algorithm can achieve real-time performance for videos even on very high-dimensional textures.
Efficient Parallel Implementation of Active Appearance Model Fitting Algorithm on GPU
Wang, Jinwei; Ma, Xirong; Zhu, Yuanping; Sun, Jizhou
2014-01-01
The active appearance model (AAM) is one of the most powerful model-based object detecting and tracking methods which has been widely used in various situations. However, the high-dimensional texture representation causes very time-consuming computations, which makes the AAM difficult to apply to real-time systems. The emergence of modern graphics processing units (GPUs) that feature a many-core, fine-grained parallel architecture provides new and promising solutions to overcome the computational challenge. In this paper, we propose an efficient parallel implementation of the AAM fitting algorithm on GPUs. Our design idea is fine grain parallelism in which we distribute the texture data of the AAM, in pixels, to thousands of parallel GPU threads for processing, which makes the algorithm fit better into the GPU architecture. We implement our algorithm using the compute unified device architecture (CUDA) on the Nvidia's GTX 650 GPU, which has the latest Kepler architecture. To compare the performance of our algorithm with different data sizes, we built sixteen face AAM models of different dimensional textures. The experiment results show that our parallel AAM fitting algorithm can achieve real-time performance for videos even on very high-dimensional textures. PMID:24723812
High Performance GPU-Based Fourier Volume Rendering.
Abdellah, Marwan; Eldeib, Ayman; Sharawi, Amr
2015-01-01
Fourier volume rendering (FVR) is a significant visualization technique that has been used widely in digital radiography. As a result of its (N (2)logN) time complexity, it provides a faster alternative to spatial domain volume rendering algorithms that are (N (3)) computationally complex. Relying on the Fourier projection-slice theorem, this technique operates on the spectral representation of a 3D volume instead of processing its spatial representation to generate attenuation-only projections that look like X-ray radiographs. Due to the rapid evolution of its underlying architecture, the graphics processing unit (GPU) became an attractive competent platform that can deliver giant computational raw power compared to the central processing unit (CPU) on a per-dollar-basis. The introduction of the compute unified device architecture (CUDA) technology enables embarrassingly-parallel algorithms to run efficiently on CUDA-capable GPU architectures. In this work, a high performance GPU-accelerated implementation of the FVR pipeline on CUDA-enabled GPUs is presented. This proposed implementation can achieve a speed-up of 117x compared to a single-threaded hybrid implementation that uses the CPU and GPU together by taking advantage of executing the rendering pipeline entirely on recent GPU architectures.
Colt: an experiment in wormhole run-time reconfiguration
NASA Astrophysics Data System (ADS)
Bittner, Ray; Athanas, Peter M.; Musgrove, Mark
1996-10-01
Wormhole run-time reconfiguration (RTR) is an attempt to create a refined computing paradigm for high performance computational tasks. By combining concepts from field programmable gate array (FPGA) technologies with data flow computing, the Colt/Stallion architecture achieves high utilization of hardware resources, and facilitates rapid run-time reconfiguration. Targeted mainly at DSP-type operations, the Colt integrated circuit -- a prototype wormhole RTR device -- compares favorably to contemporary DSP alternatives in terms of silicon area consumed per unit computation and in computing performance. Although emphasis has been placed on signal processing applications, general purpose computation has not been overlooked. Colt is a prototype that defines an architecture not only at the chip level but also in terms of an overall system design. As this system is realized, the concept of wormhole RTR will be applied to numerical computation and DSP applications including those common to image processing, communications systems, digital filters, acoustic processing, real-time control systems and simulation acceleration.
Applications of an architecture design and assessment system (ADAS)
NASA Technical Reports Server (NTRS)
Gray, F. Gail; Debrunner, Linda S.; White, Tennis S.
1988-01-01
A new Architecture Design and Assessment System (ADAS) tool package is introduced, and a range of possible applications is illustrated. ADAS was used to evaluate the performance of an advanced fault-tolerant computer architecture in a modern flight control application. Bottlenecks were identified and possible solutions suggested. The tool was also used to inject faults into the architecture and evaluate the synchronization algorithm, and improvements are suggested. Finally, ADAS was used as a front end research tool to aid in the design of reconfiguration algorithms in a distributed array architecture.
PIMS: Memristor-Based Processing-in-Memory-and-Storage.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Cook, Jeanine
Continued progress in computing has augmented the quest for higher performance with a new quest for higher energy efficiency. This has led to the re-emergence of Processing-In-Memory (PIM) ar- chitectures that offer higher density and performance with some boost in energy efficiency. Past PIM work either integrated a standard CPU with a conventional DRAM to improve the CPU- memory link, or used a bit-level processor with Single Instruction Multiple Data (SIMD) control, but neither matched the energy consumption of the memory to the computation. We originally proposed to develop a new architecture derived from PIM that more effectively addressed energymore » efficiency for high performance scientific, data analytics, and neuromorphic applications. We also originally planned to implement a von Neumann architecture with arithmetic/logic units (ALUs) that matched the power consumption of an advanced storage array to maximize energy efficiency. Implementing this architecture in storage was our original idea, since by augmenting storage (in- stead of memory), the system could address both in-memory computation and applications that accessed larger data sets directly from storage, hence Processing-in-Memory-and-Storage (PIMS). However, as our research matured, we discovered several things that changed our original direc- tion, the most important being that a PIM that implements a standard von Neumann-type archi- tecture results in significant energy efficiency improvement, but only about a O(10) performance improvement. In addition to this, the emergence of new memory technologies moved us to propos- ing a non-von Neumann architecture, called Superstrider, implemented not in storage, but in a new DRAM technology called High Bandwidth Memory (HBM). HBM is a stacked DRAM tech- nology that includes a logic layer where an architecture such as Superstrider could potentially be implemented.« less
NASA Astrophysics Data System (ADS)
Hadade, Ioan; di Mare, Luca
2016-08-01
Modern multicore and manycore processors exhibit multiple levels of parallelism through a wide range of architectural features such as SIMD for data parallel execution or threads for core parallelism. The exploitation of multi-level parallelism is therefore crucial for achieving superior performance on current and future processors. This paper presents the performance tuning of a multiblock CFD solver on Intel SandyBridge and Haswell multicore CPUs and the Intel Xeon Phi Knights Corner coprocessor. Code optimisations have been applied on two computational kernels exhibiting different computational patterns: the update of flow variables and the evaluation of the Roe numerical fluxes. We discuss at great length the code transformations required for achieving efficient SIMD computations for both kernels across the selected devices including SIMD shuffles and transpositions for flux stencil computations and global memory transformations. Core parallelism is expressed through threading based on a number of domain decomposition techniques together with optimisations pertaining to alleviating NUMA effects found in multi-socket compute nodes. Results are correlated with the Roofline performance model in order to assert their efficiency for each distinct architecture. We report significant speedups for single thread execution across both kernels: 2-5X on the multicore CPUs and 14-23X on the Xeon Phi coprocessor. Computations at full node and chip concurrency deliver a factor of three speedup on the multicore processors and up to 24X on the Xeon Phi manycore coprocessor.
ERIC Educational Resources Information Center
Mills, Kim; Fox, Geoffrey
1994-01-01
Describes the InfoMall, a program led by the Northeast Parallel Architectures Center (NPAC) at Syracuse University (New York). The InfoMall features a partnership of approximately 24 organizations offering linked programs in High Performance Computing and Communications (HPCC) technology integration, software development, marketing, education and…
Elastic Cloud Computing Architecture and System for Heterogeneous Spatiotemporal Computing
NASA Astrophysics Data System (ADS)
Shi, X.
2017-10-01
Spatiotemporal computation implements a variety of different algorithms. When big data are involved, desktop computer or standalone application may not be able to complete the computation task due to limited memory and computing power. Now that a variety of hardware accelerators and computing platforms are available to improve the performance of geocomputation, different algorithms may have different behavior on different computing infrastructure and platforms. Some are perfect for implementation on a cluster of graphics processing units (GPUs), while GPUs may not be useful on certain kind of spatiotemporal computation. This is the same situation in utilizing a cluster of Intel's many-integrated-core (MIC) or Xeon Phi, as well as Hadoop or Spark platforms, to handle big spatiotemporal data. Furthermore, considering the energy efficiency requirement in general computation, Field Programmable Gate Array (FPGA) may be a better solution for better energy efficiency when the performance of computation could be similar or better than GPUs and MICs. It is expected that an elastic cloud computing architecture and system that integrates all of GPUs, MICs, and FPGAs could be developed and deployed to support spatiotemporal computing over heterogeneous data types and computational problems.
NASA Astrophysics Data System (ADS)
Rucinski, Marek; Coates, Adam; Montano, Giuseppe; Allouis, Elie; Jameux, David
2015-09-01
The Lightweight Advanced Robotic Arm Demonstrator (LARAD) is a state-of-the-art, two-meter long robotic arm for planetary surface exploration currently being developed by a UK consortium led by Airbus Defence and Space Ltd under contract to the UK Space Agency (CREST-2 programme). LARAD has a modular design, which allows for experimentation with different electronics and control software. The control system architecture includes the on-board computer, control software and firmware, and the communication infrastructure (e.g. data links, switches) connecting on-board computer(s), sensors, actuators and the end-effector. The purpose of the control system is to operate the arm according to pre-defined performance requirements, monitoring its behaviour in real-time and performing safing/recovery actions in case of faults. This paper reports on the results of a recent study about the feasibility of the development and integration of a novel control system architecture for LARAD fully based on the SpaceWire protocol. The current control system architecture is based on the combination of two communication protocols, Ethernet and CAN. The new SpaceWire-based control system will allow for improved monitoring and telecommanding performance thanks to higher communication data rate, allowing for the adoption of advanced control schemes, potentially based on multiple vision sensors, and for the handling of sophisticated end-effectors that require fine control, such as science payloads or robotic hands.
GW Calculations of Materials on the Intel Xeon-Phi Architecture
NASA Astrophysics Data System (ADS)
Deslippe, Jack; da Jornada, Felipe H.; Vigil-Fowler, Derek; Biller, Ariel; Chelikowsky, James R.; Louie, Steven G.
Intel Xeon-Phi processors are expected to power a large number of High-Performance Computing (HPC) systems around the United States and the world in the near future. We evaluate the ability of GW and pre-requisite Density Functional Theory (DFT) calculations for materials on utilizing the Xeon-Phi architecture. We describe the optimization process and performance improvements achieved. We find that the GW method, like other higher level Many-Body methods beyond standard local/semilocal approximations to Kohn-Sham DFT, is particularly well suited for many-core architectures due to the ability to exploit a large amount of parallelism over plane-waves, band-pairs and frequencies. Support provided by the SCIDAC program, Department of Energy, Office of Science, Advanced Scientic Computing Research and Basic Energy Sciences. Grant Numbers DE-SC0008877 (Austin) and DE-AC02-05CH11231 (LBNL).
NASA Astrophysics Data System (ADS)
Pruhs, Kirk
A particularly important emergent technology is heterogeneous processors (or cores), which many computer architects believe will be the dominant architectural design in the future. The main advantage of a heterogeneous architecture, relative to an architecture of identical processors, is that it allows for the inclusion of processors whose design is specialized for particular types of jobs, and for jobs to be assigned to a processor best suited for that job. Most notably, it is envisioned that these heterogeneous architectures will consist of a small number of high-power high-performance processors for critical jobs, and a larger number of lower-power lower-performance processors for less critical jobs. Naturally, the lower-power processors would be more energy efficient in terms of the computation performed per unit of energy expended, and would generate less heat per unit of computation. For a given area and power budget, heterogeneous designs can give significantly better performance for standard workloads. Moreover, even processors that were designed to be homogeneous, are increasingly likely to be heterogeneous at run time: the dominant underlying cause is the increasing variability in the fabrication process as the feature size is scaled down (although run time faults will also play a role). Since manufacturing yields would be unacceptably low if every processor/core was required to be perfect, and since there would be significant performance loss from derating the entire chip to the functioning of the least functional processor (which is what would be required in order to attain processor homogeneity), some processor heterogeneity seems inevitable in chips with many processors/cores.
Power Efficient Hardware Architecture of SHA-1 Algorithm for Trusted Mobile Computing
NASA Astrophysics Data System (ADS)
Kim, Mooseop; Ryou, Jaecheol
The Trusted Mobile Platform (TMP) is developed and promoted by the Trusted Computing Group (TCG), which is an industry standard body to enhance the security of the mobile computing environment. The built-in SHA-1 engine in TMP is one of the most important circuit blocks and contributes the performance of the whole platform because it is used as key primitives supporting platform integrity and command authentication. Mobile platforms have very stringent limitations with respect to available power, physical circuit area, and cost. Therefore special architecture and design methods for low power SHA-1 circuit are required. In this paper, we present a novel and efficient hardware architecture of low power SHA-1 design for TMP. Our low power SHA-1 hardware can compute 512-bit data block using less than 7,000 gates and has a power consumption about 1.1 mA on a 0.25μm CMOS process.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Armstrong, Robert C.; Ray, Jaideep; Malony, A.
2003-11-01
We present a case study of performance measurement and modeling of a CCA (Common Component Architecture) component-based application in a high performance computing environment. We explore issues peculiar to component-based HPC applications and propose a performance measurement infrastructure for HPC based loosely on recent work done for Grid environments. A prototypical implementation of the infrastructure is used to collect data for a three components in a scientific application and construct performance models for two of them. Both computational and message-passing performance are addressed.
Design and Analysis of a Neuromemristive Reservoir Computing Architecture for Biosignal Processing
Kudithipudi, Dhireesha; Saleh, Qutaiba; Merkel, Cory; Thesing, James; Wysocki, Bryant
2016-01-01
Reservoir computing (RC) is gaining traction in several signal processing domains, owing to its non-linear stateful computation, spatiotemporal encoding, and reduced training complexity over recurrent neural networks (RNNs). Previous studies have shown the effectiveness of software-based RCs for a wide spectrum of applications. A parallel body of work indicates that realizing RNN architectures using custom integrated circuits and reconfigurable hardware platforms yields significant improvements in power and latency. In this research, we propose a neuromemristive RC architecture, with doubly twisted toroidal structure, that is validated for biosignal processing applications. We exploit the device mismatch to implement the random weight distributions within the reservoir and propose mixed-signal subthreshold circuits for energy efficiency. A comprehensive analysis is performed to compare the efficiency of the neuromemristive RC architecture in both digital(reconfigurable) and subthreshold mixed-signal realizations. Both Electroencephalogram (EEG) and Electromyogram (EMG) biosignal benchmarks are used for validating the RC designs. The proposed RC architecture demonstrated an accuracy of 90 and 84% for epileptic seizure detection and EMG prosthetic finger control, respectively. PMID:26869876
Craciun, Stefan; Brockmeier, Austin J; George, Alan D; Lam, Herman; Príncipe, José C
2011-01-01
Methods for decoding movements from neural spike counts using adaptive filters often rely on minimizing the mean-squared error. However, for non-Gaussian distribution of errors, this approach is not optimal for performance. Therefore, rather than using probabilistic modeling, we propose an alternate non-parametric approach. In order to extract more structure from the input signal (neuronal spike counts) we propose using minimum error entropy (MEE), an information-theoretic approach that minimizes the error entropy as part of an iterative cost function. However, the disadvantage of using MEE as the cost function for adaptive filters is the increase in computational complexity. In this paper we present a comparison between the decoding performance of the analytic Wiener filter and a linear filter trained with MEE, which is then mapped to a parallel architecture in reconfigurable hardware tailored to the computational needs of the MEE filter. We observe considerable speedup from the hardware design. The adaptation of filter weights for the multiple-input, multiple-output linear filters, necessary in motor decoding, is a highly parallelizable algorithm. It can be decomposed into many independent computational blocks with a parallel architecture readily mapped to a field-programmable gate array (FPGA) and scales to large numbers of neurons. By pipelining and parallelizing independent computations in the algorithm, the proposed parallel architecture has sublinear increases in execution time with respect to both window size and filter order.
FPGA-based architecture for motion recovering in real-time
NASA Astrophysics Data System (ADS)
Arias-Estrada, Miguel; Maya-Rueda, Selene E.; Torres-Huitzil, Cesar
2002-03-01
A key problem in the computer vision field is the measurement of object motion in a scene. The main goal is to compute an approximation of the 3D motion from the analysis of an image sequence. Once computed, this information can be used as a basis to reach higher level goals in different applications. Motion estimation algorithms pose a significant computational load for the sequential processors limiting its use in practical applications. In this work we propose a hardware architecture for motion estimation in real time based on FPGA technology. The technique used for motion estimation is Optical Flow due to its accuracy, and the density of velocity estimation, however other techniques are being explored. The architecture is composed of parallel modules working in a pipeline scheme to reach high throughput rates near gigaflops. The modules are organized in a regular structure to provide a high degree of flexibility to cover different applications. Some results will be presented and the real-time performance will be discussed and analyzed. The architecture is prototyped in an FPGA board with a Virtex device interfaced to a digital imager.
Strategies for concurrent processing of complex algorithms in data driven architectures
NASA Technical Reports Server (NTRS)
Stoughton, John W.; Mielke, Roland R.
1987-01-01
The results of ongoing research directed at developing a graph theoretical model for describing data and control flow associated with the execution of large grained algorithms in a spatial distributed computer environment is presented. This model is identified by the acronym ATAMM (Algorithm/Architecture Mapping Model). The purpose of such a model is to provide a basis for establishing rules for relating an algorithm to its execution in a multiprocessor environment. Specifications derived from the model lead directly to the description of a data flow architecture which is a consequence of the inherent behavior of the data and control flow described by the model. The purpose of the ATAMM based architecture is to optimize computational concurrency in the multiprocessor environment and to provide an analytical basis for performance evaluation. The ATAMM model and architecture specifications are demonstrated on a prototype system for concept validation.
GaAs Supercomputing: Architecture, Language, And Algorithms For Image Processing
NASA Astrophysics Data System (ADS)
Johl, John T.; Baker, Nick C.
1988-10-01
The application of high-speed GaAs processors in a parallel system matches the demanding computational requirements of image processing. The architecture of the McDonnell Douglas Astronautics Company (MDAC) vector processor is described along with the algorithms and language translator. Most image and signal processing algorithms can utilize parallel processing and show a significant performance improvement over sequential versions. The parallelization performed by this system is within each vector instruction. Since each vector has many elements, each requiring some computation, useful concurrent arithmetic operations can easily be performed. Balancing the memory bandwidth with the computation rate of the processors is an important design consideration for high efficiency and utilization. The architecture features a bus-based execution unit consisting of four to eight 32-bit GaAs RISC microprocessors running at a 200 MHz clock rate for a peak performance of 1.6 BOPS. The execution unit is connected to a vector memory with three buses capable of transferring two input words and one output word every 10 nsec. The address generators inside the vector memory perform different vector addressing modes and feed the data to the execution unit. The functions discussed in this paper include basic MATRIX OPERATIONS, 2-D SPATIAL CONVOLUTION, HISTOGRAM, and FFT. For each of these algorithms, assembly language programs were run on a behavioral model of the system to obtain performance figures.
On implementation of DCTCP on three-tier and fat-tree data center network topologies.
Zafar, Saima; Bashir, Abeer; Chaudhry, Shafique Ahmad
2016-01-01
A data center is a facility for housing computational and storage systems interconnected through a communication network called data center network (DCN). Due to a tremendous growth in the computational power, storage capacity and the number of inter-connected servers, the DCN faces challenges concerning efficiency, reliability and scalability. Although transmission control protocol (TCP) is a time-tested transport protocol in the Internet, DCN challenges such as inadequate buffer space in switches and bandwidth limitations have prompted the researchers to propose techniques to improve TCP performance or design new transport protocols for DCN. Data center TCP (DCTCP) emerge as one of the most promising solutions in this domain which employs the explicit congestion notification feature of TCP to enhance the TCP congestion control algorithm. While DCTCP has been analyzed for two-tier tree-based DCN topology for traffic between servers in the same rack which is common in cloud applications, it remains oblivious to the traffic patterns common in university and private enterprise networks which traverse the complete network interconnect spanning upper tier layers. We also recognize that DCTCP performance cannot remain unaffected by the underlying DCN architecture hence there is a need to test and compare DCTCP performance when implemented over diverse DCN architectures. Some of the most notable DCN architectures are the legacy three-tier, fat-tree, BCube, DCell, VL2, and CamCube. In this research, we simulate the two switch-centric DCN architectures; the widely deployed legacy three-tier architecture and the promising fat-tree architecture using network simulator and analyze the performance of DCTCP in terms of throughput and delay for realistic traffic patterns. We also examine how DCTCP prevents incast and outcast congestion when realistic DCN traffic patterns are employed in above mentioned topologies. Our results show that the underlying DCN architecture significantly impacts DCTCP performance. We find that DCTCP gives optimal performance in fat-tree topology and is most suitable for large networks.
VASSAR: Value assessment of system architectures using rules
NASA Astrophysics Data System (ADS)
Selva, D.; Crawley, E. F.
A key step of the mission development process is the selection of a system architecture, i.e., the layout of the major high-level system design decisions. This step typically involves the identification of a set of candidate architectures and a cost-benefit analysis to compare them. Computational tools have been used in the past to bring rigor and consistency into this process. These tools can automatically generate architectures by enumerating different combinations of decisions and options. They can also evaluate these architectures by applying cost models and simplified performance models. Current performance models are purely quantitative tools that are best fit for the evaluation of the technical performance of mission design. However, assessing the relative merit of a system architecture is a much more holistic task than evaluating performance of a mission design. Indeed, the merit of a system architecture comes from satisfying a variety of stakeholder needs, some of which are easy to quantify, and some of which are harder to quantify (e.g., elegance, scientific value, political robustness, flexibility). Moreover, assessing the merit of a system architecture at these very early stages of design often requires dealing with a mix of: a) quantitative and semi-qualitative data; objective and subjective information. Current computational tools are poorly suited for these purposes. In this paper, we propose a general methodology that can used to assess the relative merit of several candidate system architectures under the presence of objective, subjective, quantitative, and qualitative stakeholder needs. The methodology called VASSAR (Value ASsessment for System Architectures using Rules). The major underlying assumption of the VASSAR methodology is that the merit of a system architecture can assessed by comparing the capabilities of the architecture with the stakeholder requirements. Hence for example, a candidate architecture that fully satisfies all critical sta- eholder requirements is a good architecture. The assessment process is thus fundamentally seen as a pattern matching process where capabilities match requirements, which motivates the use of rule-based expert systems (RBES). This paper describes the VASSAR methodology and shows how it can be applied to a large complex space system, namely an Earth observation satellite system. Companion papers show its applicability to the NASA space communications and navigation program and the joint NOAA-DoD NPOESS program.
A visual programming environment for the Navier-Stokes computer
NASA Technical Reports Server (NTRS)
Tomboulian, Sherryl; Crockett, Thomas W.; Middleton, David
1988-01-01
The Navier-Stokes computer is a high-performance, reconfigurable, pipelined machine designed to solve large computational fluid dynamics problems. Due to the complexity of the architecture, development of effective, high-level language compilers for the system appears to be a very difficult task. Consequently, a visual programming methodology has been developed which allows users to program the system at an architectural level by constructing diagrams of the pipeline configuration. These schematic program representations can then be checked for validity and automatically translated into machine code. The visual environment is illustrated by using a prototype graphical editor to program an example problem.
Improving Quantum Gate Simulation using a GPU
NASA Astrophysics Data System (ADS)
Gutierrez, Eladio; Romero, Sergio; Trenas, Maria A.; Zapata, Emilio L.
2008-11-01
Due to the increasing computing power of the graphics processing units (GPU), they are becoming more and more popular when solving general purpose algorithms. As the simulation of quantum computers results on a problem with exponential complexity, it is advisable to perform a parallel computation, such as the one provided by the SIMD multiprocessors present in recent GPUs. In this paper, we focus on an important quantum algorithm, the quantum Fourier transform (QTF), in order to evaluate different parallelization strategies on a novel GPU architecture. Our implementation makes use of the new CUDA software/hardware architecture developed recently by NVIDIA.
Program optimizations: The interplay between power, performance, and energy
Leon, Edgar A.; Karlin, Ian; Grant, Ryan E.; ...
2016-05-16
Practical considerations for future supercomputer designs will impose limits on both instantaneous power consumption and total energy consumption. Working within these constraints while providing the maximum possible performance, application developers will need to optimize their code for speed alongside power and energy concerns. This paper analyzes the effectiveness of several code optimizations including loop fusion, data structure transformations, and global allocations. A per component measurement and analysis of different architectures is performed, enabling the examination of code optimizations on different compute subsystems. Using an explicit hydrodynamics proxy application from the U.S. Department of Energy, LULESH, we show how code optimizationsmore » impact different computational phases of the simulation. This provides insight for simulation developers into the best optimizations to use during particular simulation compute phases when optimizing code for future supercomputing platforms. Here, we examine and contrast both x86 and Blue Gene architectures with respect to these optimizations.« less
NASA Astrophysics Data System (ADS)
Menzel, R.; Paynter, D.; Jones, A. L.
2017-12-01
Due to their relatively low computational cost, radiative transfer models in global climate models (GCMs) run on traditional CPU architectures generally consist of shortwave and longwave parameterizations over a small number of wavelength bands. With the rise of newer GPU and MIC architectures, however, the performance of high resolution line-by-line radiative transfer models may soon approach those of the physical parameterizations currently employed in GCMs. Here we present an analysis of the current performance of a new line-by-line radiative transfer model currently under development at GFDL. Although originally designed to specifically exploit GPU architectures through the use of CUDA, the radiative transfer model has recently been extended to include OpenMP in an effort to also effectively target MIC architectures such as Intel's Xeon Phi. Using input data provided by the upcoming Radiative Forcing Model Intercomparison Project (RFMIP, as part of CMIP 6), we compare model results and performance data for various model configurations and spectral resolutions run on both GPU and Intel Knights Landing architectures to analogous runs of the standard Oxford Reference Forward Model on traditional CPUs.
Multi-threaded ATLAS simulation on Intel Knights Landing processors
NASA Astrophysics Data System (ADS)
Farrell, Steven; Calafiura, Paolo; Leggett, Charles; Tsulaia, Vakhtang; Dotti, Andrea; ATLAS Collaboration
2017-10-01
The Knights Landing (KNL) release of the Intel Many Integrated Core (MIC) Xeon Phi line of processors is a potential game changer for HEP computing. With 72 cores and deep vector registers, the KNL cards promise significant performance benefits for highly-parallel, compute-heavy applications. Cori, the newest supercomputer at the National Energy Research Scientific Computing Center (NERSC), was delivered to its users in two phases with the first phase online at the end of 2015 and the second phase now online at the end of 2016. Cori Phase 2 is based on the KNL architecture and contains over 9000 compute nodes with 96GB DDR4 memory. ATLAS simulation with the multithreaded Athena Framework (AthenaMT) is a good potential use-case for the KNL architecture and supercomputers like Cori. ATLAS simulation jobs have a high ratio of CPU computation to disk I/O and have been shown to scale well in multi-threading and across many nodes. In this paper we will give an overview of the ATLAS simulation application with details on its multi-threaded design. Then, we will present a performance analysis of the application on KNL devices and compare it to a traditional x86 platform to demonstrate the capabilities of the architecture and evaluate the benefits of utilizing KNL platforms like Cori for ATLAS production.
Space Debris Detection on the HPDP, a Coarse-Grained Reconfigurable Array Architecture for Space
NASA Astrophysics Data System (ADS)
Suarez, Diego Andres; Bretz, Daniel; Helfers, Tim; Weidendorfer, Josef; Utzmann, Jens
2016-08-01
Stream processing, widely used in communications and digital signal processing applications, requires high- throughput data processing that is achieved in most cases using Application-Specific Integrated Circuit (ASIC) designs. Lack of programmability is an issue especially in space applications, which use on-board components with long life-cycles requiring applications updates. To this end, the High Performance Data Processor (HPDP) architecture integrates an array of coarse-grained reconfigurable elements to provide both flexible and efficient computational power suitable for stream-based data processing applications in space. In this work the capabilities of the HPDP architecture are demonstrated with the implementation of a real-time image processing algorithm for space debris detection in a space-based space surveillance system. The implementation challenges and alternatives are described making trade-offs to improve performance at the expense of negligible degradation of detection accuracy. The proposed implementation uses over 99% of the available computational resources. Performance estimations based on simulations show that the HPDP can amply match the application requirements.
NASA Astrophysics Data System (ADS)
Liyanagedera, Chamika M.; Sengupta, Abhronil; Jaiswal, Akhilesh; Roy, Kaushik
2017-12-01
Stochastic spiking neural networks based on nanoelectronic spin devices can be a possible pathway to achieving "brainlike" compact and energy-efficient cognitive intelligence. The computational model attempt to exploit the intrinsic device stochasticity of nanoelectronic synaptic or neural components to perform learning or inference. However, there has been limited analysis on the scaling effect of stochastic spin devices and its impact on the operation of such stochastic networks at the system level. This work attempts to explore the design space and analyze the performance of nanomagnet-based stochastic neuromorphic computing architectures for magnets with different barrier heights. We illustrate how the underlying network architecture must be modified to account for the random telegraphic switching behavior displayed by magnets with low barrier heights as they are scaled into the superparamagnetic regime. We perform a device-to-system-level analysis on a deep neural-network architecture for a digit-recognition problem on the MNIST data set.
Parallel compression/decompression-based datapath architecture for multibeam mask writers
NASA Astrophysics Data System (ADS)
Chaudhary, Narendra; Savari, Serap A.
2017-06-01
Multibeam electron beam systems will be used in the future for mask writing and for complimentary lithography. The major challenges of the multibeam systems are in meeting throughput requirements and in handling the large data volumes associated with writing grayscale data on the wafer. In terms of future communications and computational requirements Amdahl's Law suggests that a simple increase of computation power and parallelism may not be a sustainable solution. We propose a parallel data compression algorithm to exploit the sparsity of mask data and a grayscale video-like representation of data. To improve the communication and computational efficiency of these systems at the write time we propose an alternate datapath architecture partly motivated by multibeam direct write lithography and partly motivated by the circuit testing literature, where parallel decompression reduces clock cycles. We explain a deflection plate architecture inspired by NuFlare Technology's multibeam mask writing system and how our datapath architecture can be easily added to it to improve performance.
Parallel compression/decompression-based datapath architecture for multibeam mask writers
NASA Astrophysics Data System (ADS)
Chaudhary, Narendra; Savari, Serap A.
2017-10-01
Multibeam electron beam systems will be used in the future for mask writing and for complementary lithography. The major challenges of the multibeam systems are in meeting throughput requirements and in handling the large data volumes associated with writing grayscale data on the wafer. In terms of future communications and computational requirements, Amdahl's law suggests that a simple increase of computation power and parallelism may not be a sustainable solution. We propose a parallel data compression algorithm to exploit the sparsity of mask data and a grayscale video-like representation of data. To improve the communication and computational efficiency of these systems at the write time, we propose an alternate datapath architecture partly motivated by multibeam direct-write lithography and partly motivated by the circuit testing literature, where parallel decompression reduces clock cycles. We explain a deflection plate architecture inspired by NuFlare Technology's multibeam mask writing system and how our datapath architecture can be easily added to it to improve performance.
Chessa, Manuela; Bianchi, Valentina; Zampetti, Massimo; Sabatini, Silvio P; Solari, Fabio
2012-01-01
The intrinsic parallelism of visual neural architectures based on distributed hierarchical layers is well suited to be implemented on the multi-core architectures of modern graphics cards. The design strategies that allow us to optimally take advantage of such parallelism, in order to efficiently map on GPU the hierarchy of layers and the canonical neural computations, are proposed. Specifically, the advantages of a cortical map-like representation of the data are exploited. Moreover, a GPU implementation of a novel neural architecture for the computation of binocular disparity from stereo image pairs, based on populations of binocular energy neurons, is presented. The implemented neural model achieves good performances in terms of reliability of the disparity estimates and a near real-time execution speed, thus demonstrating the effectiveness of the devised design strategies. The proposed approach is valid in general, since the neural building blocks we implemented are a common basis for the modeling of visual neural functionalities.
AltiVec performance increases for autonomous robotics for the MARSSCAPE architecture program
NASA Astrophysics Data System (ADS)
Gothard, Benny M.
2002-02-01
One of the main tall poles that must be overcome to develop a fully autonomous vehicle is the inability of the computer to understand its surrounding environment to a level that is required for the intended task. The military mission scenario requires a robot to interact in a complex, unstructured, dynamic environment. Reference A High Fidelity Multi-Sensor Scene Understanding System for Autonomous Navigation The Mobile Autonomous Robot Software Self Composing Adaptive Programming Environment (MarsScape) perception research addresses three aspects of the problem; sensor system design, processing architectures, and algorithm enhancements. A prototype perception system has been demonstrated on robotic High Mobility Multi-purpose Wheeled Vehicle and All Terrain Vehicle testbeds. This paper addresses the tall pole of processing requirements and the performance improvements based on the selected MarsScape Processing Architecture. The processor chosen is the Motorola Altivec-G4 Power PC(PPC) (1998 Motorola, Inc.), a highly parallized commercial Single Instruction Multiple Data processor. Both derived perception benchmarks and actual perception subsystems code will be benchmarked and compared against previous Demo II-Semi-autonomous Surrogate Vehicle processing architectures along with desktop Personal Computers(PC). Performance gains are highlighted with progress to date, and lessons learned and future directions are described.
Irregular Applications: Architectures & Algorithms
DOE Office of Scientific and Technical Information (OSTI.GOV)
Feo, John T.; Villa, Oreste; Tumeo, Antonino
Irregular applications are characterized by irregular data structures, control and communication patterns. Novel irregular high performance applications which deal with large data sets and require have recently appeared. Unfortunately, current high performance systems and software infrastructures executes irregular algorithms poorly. Only coordinated efforts by end user, area specialists and computer scientists that consider both the architecture and the software stack may be able to provide solutions to the challenges of modern irregular applications.
NASA Astrophysics Data System (ADS)
Yarovyi, Andrii A.; Timchenko, Leonid I.; Kozhemiako, Volodymyr P.; Kokriatskaia, Nataliya I.; Hamdi, Rami R.; Savchuk, Tamara O.; Kulyk, Oleksandr O.; Surtel, Wojciech; Amirgaliyev, Yedilkhan; Kashaganova, Gulzhan
2017-08-01
The paper deals with a problem of insufficient productivity of existing computer means for large image processing, which do not meet modern requirements posed by resource-intensive computing tasks of laser beam profiling. The research concentrated on one of the profiling problems, namely, real-time processing of spot images of the laser beam profile. Development of a theory of parallel-hierarchic transformation allowed to produce models for high-performance parallel-hierarchical processes, as well as algorithms and software for their implementation based on the GPU-oriented architecture using GPGPU technologies. The analyzed performance of suggested computerized tools for processing and classification of laser beam profile images allows to perform real-time processing of dynamic images of various sizes.
High Performance Semantic Factoring of Giga-Scale Semantic Graph Databases
DOE Office of Scientific and Technical Information (OSTI.GOV)
Joslyn, Cliff A.; Adolf, Robert D.; Al-Saffar, Sinan
2010-10-04
As semantic graph database technology grows to address components ranging from extant large triple stores to SPARQL endpoints over SQL-structured relational databases, it will become increasingly important to be able to bring high performance computational resources to bear on their analysis, interpretation, and visualization, especially with respect to their innate semantic structure. Our research group built a novel high performance hybrid system comprising computational capability for semantic graph database processing utilizing the large multithreaded architecture of the Cray XMT platform, conventional clusters, and large data stores. In this paper we describe that architecture, and present the results of our deployingmore » that for the analysis of the Billion Triple dataset with respect to its semantic factors.« less
NASA Astrophysics Data System (ADS)
Dutta, Sandeep; Gros, Eric
2018-03-01
Deep Learning (DL) has been successfully applied in numerous fields fueled by increasing computational power and access to data. However, for medical imaging tasks, limited training set size is a common challenge when applying DL. This paper explores the applicability of DL to the task of classifying a single axial slice from a CT exam into one of six anatomy regions. A total of 29000 images selected from 223 CT exams were manually labeled for ground truth. An additional 54 exams were labeled and used as an independent test set. The network architecture developed for this application is composed of 6 convolutional layers and 2 fully connected layers with RELU non-linear activations between each layer. Max-pooling was used after every second convolutional layer, and a softmax layer was used at the end. Given this base architecture, the effect of inclusion of network architecture components such as Dropout and Batch Normalization on network performance and training is explored. The network performance as a function of training and validation set size is characterized by training each network architecture variation using 5,10,20,40,50 and 100% of the available training data. The performance comparison of the various network architectures was done for anatomy classification as well as two computer vision datasets. The anatomy classifier accuracy varied from 74.1% to 92.3% in this study depending on the training size and network layout used. Dropout layers improved the model accuracy for all training sizes.
Convolutional neural network architectures for predicting DNA–protein binding
Zeng, Haoyang; Edwards, Matthew D.; Liu, Ge; Gifford, David K.
2016-01-01
Motivation: Convolutional neural networks (CNN) have outperformed conventional methods in modeling the sequence specificity of DNA–protein binding. Yet inappropriate CNN architectures can yield poorer performance than simpler models. Thus an in-depth understanding of how to match CNN architecture to a given task is needed to fully harness the power of CNNs for computational biology applications. Results: We present a systematic exploration of CNN architectures for predicting DNA sequence binding using a large compendium of transcription factor datasets. We identify the best-performing architectures by varying CNN width, depth and pooling designs. We find that adding convolutional kernels to a network is important for motif-based tasks. We show the benefits of CNNs in learning rich higher-order sequence features, such as secondary motifs and local sequence context, by comparing network performance on multiple modeling tasks ranging in difficulty. We also demonstrate how careful construction of sequence benchmark datasets, using approaches that control potentially confounding effects like positional or motif strength bias, is critical in making fair comparisons between competing methods. We explore how to establish the sufficiency of training data for these learning tasks, and we have created a flexible cloud-based framework that permits the rapid exploration of alternative neural network architectures for problems in computational biology. Availability and Implementation: All the models analyzed are available at http://cnn.csail.mit.edu. Contact: gifford@mit.edu Supplementary information: Supplementary data are available at Bioinformatics online. PMID:27307608
NASA Astrophysics Data System (ADS)
Lee, J.; Kim, K.
A Very Large Scale Integration (VLSI) architecture for robot direct kinematic computation suitable for industrial robot manipulators was investigated. The Denavit-Hartenberg transformations are reviewed to exploit a proper processing element, namely an augmented CORDIC. Specifically, two distinct implementations are elaborated on, such as the bit-serial and parallel. Performance of each scheme is analyzed with respect to the time to compute one location of the end-effector of a 6-links manipulator, and the number of transistors required.
NASA Technical Reports Server (NTRS)
Lee, J.; Kim, K.
1991-01-01
A Very Large Scale Integration (VLSI) architecture for robot direct kinematic computation suitable for industrial robot manipulators was investigated. The Denavit-Hartenberg transformations are reviewed to exploit a proper processing element, namely an augmented CORDIC. Specifically, two distinct implementations are elaborated on, such as the bit-serial and parallel. Performance of each scheme is analyzed with respect to the time to compute one location of the end-effector of a 6-links manipulator, and the number of transistors required.
2008-03-01
computational version of the CASIE architecture serves to demonstrate the functionality of our primary theories. However, implementation of several other...following facts. First, based on Theorem 3 and Theorem 5, the objective function is non -increasing under updating rule (6); second, by the criteria for...reassignment in updating rule (7), it is trivial to show that the objective function is non -increasing under updating rule (7). A Unified View to Graph
Topical perspective on massive threading and parallelism.
Farber, Robert M
2011-09-01
Unquestionably computer architectures have undergone a recent and noteworthy paradigm shift that now delivers multi- and many-core systems with tens to many thousands of concurrent hardware processing elements per workstation or supercomputer node. GPGPU (General Purpose Graphics Processor Unit) technology in particular has attracted significant attention as new software development capabilities, namely CUDA (Compute Unified Device Architecture) and OpenCL™, have made it possible for students as well as small and large research organizations to achieve excellent speedup for many applications over more conventional computing architectures. The current scientific literature reflects this shift with numerous examples of GPGPU applications that have achieved one, two, and in some special cases, three-orders of magnitude increased computational performance through the use of massive threading to exploit parallelism. Multi-core architectures are also evolving quickly to exploit both massive-threading and massive-parallelism such as the 1.3 million threads Blue Waters supercomputer. The challenge confronting scientists in planning future experimental and theoretical research efforts--be they individual efforts with one computer or collaborative efforts proposing to use the largest supercomputers in the world is how to capitalize on these new massively threaded computational architectures--especially as not all computational problems will scale to massive parallelism. In particular, the costs associated with restructuring software (and potentially redesigning algorithms) to exploit the parallelism of these multi- and many-threaded machines must be considered along with application scalability and lifespan. This perspective is an overview of the current state of threading and parallelize with some insight into the future. Published by Elsevier Inc.
Long-range interactions and parallel scalability in molecular simulations
NASA Astrophysics Data System (ADS)
Patra, Michael; Hyvönen, Marja T.; Falck, Emma; Sabouri-Ghomi, Mohsen; Vattulainen, Ilpo; Karttunen, Mikko
2007-01-01
Typical biomolecular systems such as cellular membranes, DNA, and protein complexes are highly charged. Thus, efficient and accurate treatment of electrostatic interactions is of great importance in computational modeling of such systems. We have employed the GROMACS simulation package to perform extensive benchmarking of different commonly used electrostatic schemes on a range of computer architectures (Pentium-4, IBM Power 4, and Apple/IBM G5) for single processor and parallel performance up to 8 nodes—we have also tested the scalability on four different networks, namely Infiniband, GigaBit Ethernet, Fast Ethernet, and nearly uniform memory architecture, i.e. communication between CPUs is possible by directly reading from or writing to other CPUs' local memory. It turns out that the particle-mesh Ewald method (PME) performs surprisingly well and offers competitive performance unless parallel runs on PC hardware with older network infrastructure are needed. Lipid bilayers of sizes 128, 512 and 2048 lipid molecules were used as the test systems representing typical cases encountered in biomolecular simulations. Our results enable an accurate prediction of computational speed on most current computing systems, both for serial and parallel runs. These results should be helpful in, for example, choosing the most suitable configuration for a small departmental computer cluster.
Computers in Academic Architecture Libraries.
ERIC Educational Resources Information Center
Willis, Alfred; And Others
1992-01-01
Computers are widely used in architectural research and teaching in U.S. schools of architecture. A survey of libraries serving these schools sought information on the emphasis placed on computers by the architectural curriculum, accessibility of computers to library staff, and accessibility of computers to library patrons. Survey results and…
Computers for symbolic processing
NASA Technical Reports Server (NTRS)
Wah, Benjamin W.; Lowrie, Matthew B.; Li, Guo-Jie
1989-01-01
A detailed survey on the motivations, design, applications, current status, and limitations of computers designed for symbolic processing is provided. Symbolic processing computations are performed at the word, relation, or meaning levels, and the knowledge used in symbolic applications may be fuzzy, uncertain, indeterminate, and ill represented. Various techniques for knowledge representation and processing are discussed from both the designers' and users' points of view. The design and choice of a suitable language for symbolic processing and the mapping of applications into a software architecture are then considered. The process of refining the application requirements into hardware and software architectures is treated, and state-of-the-art sequential and parallel computers designed for symbolic processing are discussed.
Vectorization, threading, and cache-blocking considerations for hydrocodes on emerging architectures
Fung, J.; Aulwes, R. T.; Bement, M. T.; ...
2015-07-14
This work reports on considerations for improving computational performance in preparation for current and expected changes to computer architecture. The algorithms studied will include increasingly complex prototypes for radiation hydrodynamics codes, such as gradient routines and diffusion matrix assembly (e.g., in [1-6]). The meshes considered for the algorithms are structured or unstructured meshes. The considerations applied for performance improvements are meant to be general in terms of architecture (not specifically graphical processing unit (GPUs) or multi-core machines, for example) and include techniques for vectorization, threading, tiling, and cache blocking. Out of a survey of optimization techniques on applications such asmore » diffusion and hydrodynamics, we make general recommendations with a view toward making these techniques conceptually accessible to the applications code developer. Published 2015. This article is a U.S. Government work and is in the public domain in the USA.« less
Perspectives in numerical astrophysics:
NASA Astrophysics Data System (ADS)
Reverdy, V.
2016-12-01
In this discussion paper, we investigate the current and future status of numerical astrophysics and highlight key questions concerning the transition to the exascale era. We first discuss the fact that one of the main motivation behind high performance simulations should not be the reproduction of observational or experimental data, but the understanding of the emergence of complexity from fundamental laws. This motivation is put into perspective regarding the quest for more computational power and we argue that extra computational resources can be used to gain in abstraction. Then, the readiness level of present-day simulation codes in regard to upcoming exascale architecture is examined and two major challenges are raised concerning both the central role of data movement for performances and the growing complexity of codes. Software architecture is finally presented as a key component to make the most of upcoming architectures while solving original physics problems.
Scaling Deep Learning Workloads: NVIDIA DGX-1/Pascal and Intel Knights Landing
DOE Office of Scientific and Technical Information (OSTI.GOV)
Gawande, Nitin A.; Landwehr, Joshua B.; Daily, Jeffrey A.
Deep Learning (DL) algorithms have become ubiquitous in data analytics. As a result, major computing vendors --- including NVIDIA, Intel, AMD and IBM --- have architectural road-maps influenced by DL workloads. Furthermore, several vendors have recently advertised new computing products as accelerating DL workloads. Unfortunately, it is difficult for data scientists to quantify the potential of these different products. This paper provides a performance and power analysis of important DL workloads on two major parallel architectures: NVIDIA DGX-1 (eight Pascal P100 GPUs interconnected with NVLink) and Intel Knights Landing (KNL) CPUs interconnected with Intel Omni-Path. Our evaluation consists of amore » cross section of convolutional neural net workloads: CifarNet, CaffeNet, AlexNet and GoogleNet topologies using the Cifar10 and ImageNet datasets. The workloads are vendor optimized for each architecture. GPUs provide the highest overall raw performance. Our analysis indicates that although GPUs provide the highest overall performance, the gap can close for some convolutional networks; and KNL can be competitive when considering performance/watt. Furthermore, NVLink is critical to GPU scaling.« less
A High Performance COTS Based Computer Architecture
NASA Astrophysics Data System (ADS)
Patte, Mathieu; Grimoldi, Raoul; Trautner, Roland
2014-08-01
Using Commercial Off The Shelf (COTS) electronic components for space applications is a long standing idea. Indeed the difference in processing performance and energy efficiency between radiation hardened components and COTS components is so important that COTS components are very attractive for use in mass and power constrained systems. However using COTS components in space is not straightforward as one must account with the effects of the space environment on the COTS components behavior. In the frame of the ESA funded activity called High Performance COTS Based Computer, Airbus Defense and Space and its subcontractor OHB CGS have developed and prototyped a versatile COTS based architecture for high performance processing. The rest of the paper is organized as follows: in a first section we will start by recapitulating the interests and constraints of using COTS components for space applications; then we will briefly describe existing fault mitigation architectures and present our solution for fault mitigation based on a component called the SmartIO; in the last part of the paper we will describe the prototyping activities executed during the HiP CBC project.
Design of a real-time wind turbine simulator using a custom parallel architecture
NASA Technical Reports Server (NTRS)
Hoffman, John A.; Gluck, R.; Sridhar, S.
1995-01-01
The design of a new parallel-processing digital simulator is described. The new simulator has been developed specifically for analysis of wind energy systems in real time. The new processor has been named: the Wind Energy System Time-domain simulator, version 3 (WEST-3). Like previous WEST versions, WEST-3 performs many computations in parallel. The modules in WEST-3 are pure digital processors, however. These digital processors can be programmed individually and operated in concert to achieve real-time simulation of wind turbine systems. Because of this programmability, WEST-3 is very much more flexible and general than its two predecessors. The design features of WEST-3 are described to show how the system produces high-speed solutions of nonlinear time-domain equations. WEST-3 has two very fast Computational Units (CU's) that use minicomputer technology plus special architectural features that make them many times faster than a microcomputer. These CU's are needed to perform the complex computations associated with the wind turbine rotor system in real time. The parallel architecture of the CU causes several tasks to be done in each cycle, including an IO operation and the combination of a multiply, add, and store. The WEST-3 simulator can be expanded at any time for additional computational power. This is possible because the CU's interfaced to each other and to other portions of the simulation using special serial buses. These buses can be 'patched' together in essentially any configuration (in a manner very similar to the programming methods used in analog computation) to balance the input/ output requirements. CU's can be added in any number to share a given computational load. This flexible bus feature is very different from many other parallel processors which usually have a throughput limit because of rigid bus architecture.
Heterogeneous computing architecture for fast detection of SNP-SNP interactions.
Sluga, Davor; Curk, Tomaz; Zupan, Blaz; Lotric, Uros
2014-06-25
The extent of data in a typical genome-wide association study (GWAS) poses considerable computational challenges to software tools for gene-gene interaction discovery. Exhaustive evaluation of all interactions among hundreds of thousands to millions of single nucleotide polymorphisms (SNPs) may require weeks or even months of computation. Massively parallel hardware within a modern Graphic Processing Unit (GPU) and Many Integrated Core (MIC) coprocessors can shorten the run time considerably. While the utility of GPU-based implementations in bioinformatics has been well studied, MIC architecture has been introduced only recently and may provide a number of comparative advantages that have yet to be explored and tested. We have developed a heterogeneous, GPU and Intel MIC-accelerated software module for SNP-SNP interaction discovery to replace the previously single-threaded computational core in the interactive web-based data exploration program SNPsyn. We report on differences between these two modern massively parallel architectures and their software environments. Their utility resulted in an order of magnitude shorter execution times when compared to the single-threaded CPU implementation. GPU implementation on a single Nvidia Tesla K20 runs twice as fast as that for the MIC architecture-based Xeon Phi P5110 coprocessor, but also requires considerably more programming effort. General purpose GPUs are a mature platform with large amounts of computing power capable of tackling inherently parallel problems, but can prove demanding for the programmer. On the other hand the new MIC architecture, albeit lacking in performance reduces the programming effort and makes it up with a more general architecture suitable for a wider range of problems.
Heterogeneous computing architecture for fast detection of SNP-SNP interactions
2014-01-01
Background The extent of data in a typical genome-wide association study (GWAS) poses considerable computational challenges to software tools for gene-gene interaction discovery. Exhaustive evaluation of all interactions among hundreds of thousands to millions of single nucleotide polymorphisms (SNPs) may require weeks or even months of computation. Massively parallel hardware within a modern Graphic Processing Unit (GPU) and Many Integrated Core (MIC) coprocessors can shorten the run time considerably. While the utility of GPU-based implementations in bioinformatics has been well studied, MIC architecture has been introduced only recently and may provide a number of comparative advantages that have yet to be explored and tested. Results We have developed a heterogeneous, GPU and Intel MIC-accelerated software module for SNP-SNP interaction discovery to replace the previously single-threaded computational core in the interactive web-based data exploration program SNPsyn. We report on differences between these two modern massively parallel architectures and their software environments. Their utility resulted in an order of magnitude shorter execution times when compared to the single-threaded CPU implementation. GPU implementation on a single Nvidia Tesla K20 runs twice as fast as that for the MIC architecture-based Xeon Phi P5110 coprocessor, but also requires considerably more programming effort. Conclusions General purpose GPUs are a mature platform with large amounts of computing power capable of tackling inherently parallel problems, but can prove demanding for the programmer. On the other hand the new MIC architecture, albeit lacking in performance reduces the programming effort and makes it up with a more general architecture suitable for a wider range of problems. PMID:24964802
The path toward HEP High Performance Computing
NASA Astrophysics Data System (ADS)
Apostolakis, John; Brun, René; Carminati, Federico; Gheata, Andrei; Wenzel, Sandro
2014-06-01
High Energy Physics code has been known for making poor use of high performance computing architectures. Efforts in optimising HEP code on vector and RISC architectures have yield limited results and recent studies have shown that, on modern architectures, it achieves a performance between 10% and 50% of the peak one. Although several successful attempts have been made to port selected codes on GPUs, no major HEP code suite has a "High Performance" implementation. With LHC undergoing a major upgrade and a number of challenging experiments on the drawing board, HEP cannot any longer neglect the less-than-optimal performance of its code and it has to try making the best usage of the hardware. This activity is one of the foci of the SFT group at CERN, which hosts, among others, the Root and Geant4 project. The activity of the experiments is shared and coordinated via a Concurrency Forum, where the experience in optimising HEP code is presented and discussed. Another activity is the Geant-V project, centred on the development of a highperformance prototype for particle transport. Achieving a good concurrency level on the emerging parallel architectures without a complete redesign of the framework can only be done by parallelizing at event level, or with a much larger effort at track level. Apart the shareable data structures, this typically implies a multiplication factor in terms of memory consumption compared to the single threaded version, together with sub-optimal handling of event processing tails. Besides this, the low level instruction pipelining of modern processors cannot be used efficiently to speedup the program. We have implemented a framework that allows scheduling vectors of particles to an arbitrary number of computing resources in a fine grain parallel approach. The talk will review the current optimisation activities within the SFT group with a particular emphasis on the development perspectives towards a simulation framework able to profit best from the recent technology evolution in computing.
Parallel Signal Processing and System Simulation using aCe
NASA Technical Reports Server (NTRS)
Dorband, John E.; Aburdene, Maurice F.
2003-01-01
Recently, networked and cluster computation have become very popular for both signal processing and system simulation. A new language is ideally suited for parallel signal processing applications and system simulation since it allows the programmer to explicitly express the computations that can be performed concurrently. In addition, the new C based parallel language (ace C) for architecture-adaptive programming allows programmers to implement algorithms and system simulation applications on parallel architectures by providing them with the assurance that future parallel architectures will be able to run their applications with a minimum of modification. In this paper, we will focus on some fundamental features of ace C and present a signal processing application (FFT).
Parallel language constructs for tensor product computations on loosely coupled architectures
NASA Technical Reports Server (NTRS)
Mehrotra, Piyush; Vanrosendale, John
1989-01-01
Distributed memory architectures offer high levels of performance and flexibility, but have proven awkard to program. Current languages for nonshared memory architectures provide a relatively low level programming environment, and are poorly suited to modular programming, and to the construction of libraries. A set of language primitives designed to allow the specification of parallel numerical algorithms at a higher level is described. Tensor product array computations are focused on along with a simple but important class of numerical algorithms. The problem of programming 1-D kernal routines is focused on first, such as parallel tridiagonal solvers, and then how such parallel kernels can be combined to form parallel tensor product algorithms is examined.
NASA Technical Reports Server (NTRS)
Smith, T. B., III; Lala, J. H.
1984-01-01
The FTMP architecture is a high reliability computer concept modeled after a homogeneous multiprocessor architecture. Elements of the FTMP are operated in tight synchronism with one another and hardware fault-detection and fault-masking is provided which is transparent to the software. Operating system design and user software design is thus greatly simplified. Performance of the FTMP is also comparable to that of a simplex equivalent due to the efficiency of fault handling hardware. The FTMP project constructed an engineering module of the FTMP, programmed the machine and extensively tested the architecture through fault injection and other stress testing. This testing confirmed the soundness of the FTMP concepts.
A highly efficient 3D level-set grain growth algorithm tailored for ccNUMA architecture
NASA Astrophysics Data System (ADS)
Mießen, C.; Velinov, N.; Gottstein, G.; Barrales-Mora, L. A.
2017-12-01
A highly efficient simulation model for 2D and 3D grain growth was developed based on the level-set method. The model introduces modern computational concepts to achieve excellent performance on parallel computer architectures. Strong scalability was measured on cache-coherent non-uniform memory access (ccNUMA) architectures. To achieve this, the proposed approach considers the application of local level-set functions at the grain level. Ideal and non-ideal grain growth was simulated in 3D with the objective to study the evolution of statistical representative volume elements in polycrystals. In addition, microstructure evolution in an anisotropic magnetic material affected by an external magnetic field was simulated.
Two-dimensional nonsteady viscous flow simulation on the Navier-Stokes computer miniNode
NASA Technical Reports Server (NTRS)
Nosenchuck, Daniel M.; Littman, Michael G.; Flannery, William
1986-01-01
The needs of large-scale scientific computation are outpacing the growth in performance of mainframe supercomputers. In particular, problems in fluid mechanics involving complex flow simulations require far more speed and capacity than that provided by current and proposed Class VI supercomputers. To address this concern, the Navier-Stokes Computer (NSC) was developed. The NSC is a parallel-processing machine, comprised of individual Nodes, each comparable in performance to current supercomputers. The global architecture is that of a hypercube, and a 128-Node NSC has been designed. New architectural features, such as a reconfigurable many-function ALU pipeline and a multifunction memory-ALU switch, have provided the capability to efficiently implement a wide range of algorithms. Efficient algorithms typically involve numerically intensive tasks, which often include conditional operations. These operations may be efficiently implemented on the NSC without, in general, sacrificing vector-processing speed. To illustrate the architecture, programming, and several of the capabilities of the NSC, the simulation of two-dimensional, nonsteady viscous flows on a prototype Node, called the miniNode, is presented.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Muller, U.A.; Baumle, B.; Kohler, P.
1992-10-01
Music, a DSP-based system with a parallel distributed-memory architecture, provides enormous computing power yet retains the flexibility of a general-purpose computer. Reaching a peak performance of 2.7 Gflops at a significantly lower cost, power consumption, and space requirement than conventional supercomputers, Music is well suited to computationally intensive applications such as neural network simulation. 12 refs., 9 figs., 2 tabs.
Radio Synthesis Imaging - A High Performance Computing and Communications Project
NASA Astrophysics Data System (ADS)
Crutcher, Richard M.
The National Science Foundation has funded a five-year High Performance Computing and Communications project at the National Center for Supercomputing Applications (NCSA) for the direct implementation of several of the computing recommendations of the Astronomy and Astrophysics Survey Committee (the "Bahcall report"). This paper is a summary of the project goals and a progress report. The project will implement a prototype of the next generation of astronomical telescope systems - remotely located telescopes connected by high-speed networks to very high performance, scalable architecture computers and on-line data archives, which are accessed by astronomers over Gbit/sec networks. Specifically, a data link has been installed between the BIMA millimeter-wave synthesis array at Hat Creek, California and NCSA at Urbana, Illinois for real-time transmission of data to NCSA. Data are automatically archived, and may be browsed and retrieved by astronomers using the NCSA Mosaic software. In addition, an on-line digital library of processed images will be established. BIMA data will be processed on a very high performance distributed computing system, with I/O, user interface, and most of the software system running on the NCSA Convex C3880 supercomputer or Silicon Graphics Onyx workstations connected by HiPPI to the high performance, massively parallel Thinking Machines Corporation CM-5. The very computationally intensive algorithms for calibration and imaging of radio synthesis array observations will be optimized for the CM-5 and new algorithms which utilize the massively parallel architecture will be developed. Code running simultaneously on the distributed computers will communicate using the Data Transport Mechanism developed by NCSA. The project will also use the BLANCA Gbit/s testbed network between Urbana and Madison, Wisconsin to connect an Onyx workstation in the University of Wisconsin Astronomy Department to the NCSA CM-5, for development of long-distance distributed computing. Finally, the project is developing 2D and 3D visualization software as part of the international AIPS++ project. This research and development project is being carried out by a team of experts in radio astronomy, algorithm development for massively parallel architectures, high-speed networking, database management, and Thinking Machines Corporation personnel. The development of this complete software, distributed computing, and data archive and library solution to the radio astronomy computing problem will advance our expertise in high performance computing and communications technology and the application of these techniques to astronomical data processing.
Rodríguez, Alfonso; Valverde, Juan; Portilla, Jorge; Otero, Andrés; Riesgo, Teresa; de la Torre, Eduardo
2018-06-08
Cyber-Physical Systems are experiencing a paradigm shift in which processing has been relocated to the distributed sensing layer and is no longer performed in a centralized manner. This approach, usually referred to as Edge Computing, demands the use of hardware platforms that are able to manage the steadily increasing requirements in computing performance, while keeping energy efficiency and the adaptability imposed by the interaction with the physical world. In this context, SRAM-based FPGAs and their inherent run-time reconfigurability, when coupled with smart power management strategies, are a suitable solution. However, they usually fail in user accessibility and ease of development. In this paper, an integrated framework to develop FPGA-based high-performance embedded systems for Edge Computing in Cyber-Physical Systems is presented. This framework provides a hardware-based processing architecture, an automated toolchain, and a runtime to transparently generate and manage reconfigurable systems from high-level system descriptions without additional user intervention. Moreover, it provides users with support for dynamically adapting the available computing resources to switch the working point of the architecture in a solution space defined by computing performance, energy consumption and fault tolerance. Results show that it is indeed possible to explore this solution space at run time and prove that the proposed framework is a competitive alternative to software-based edge computing platforms, being able to provide not only faster solutions, but also higher energy efficiency for computing-intensive algorithms with significant levels of data-level parallelism.
NASA Technical Reports Server (NTRS)
Pineda, Evan Jorge; Bednarcyk, Brett A.; Arnold, Steven M.
2014-01-01
Integrated computational materials engineering (ICME) is a useful approach for tailoring the performance of a material. For fiber-reinforced composites, not only do the properties of the constituents of the composite affect the performance, but so does the architecture (or microstructure) of the constituents. The generalized method of cells is demonstrated to be a viable micromechanics tool for determining the effects of the microstructure on the performance of laminates. The micromechanics is used to predict the inputs for a macroscale model for a variety of different fiber volume fractions, and fiber architectures. Using this technique, the material performance can be tailored for specific applications by judicious selection of constituents, volume fraction, and architectural arrangement given a particular manufacturing scenario
Architectural Techniques For Managing Non-volatile Caches
DOE Office of Scientific and Technical Information (OSTI.GOV)
Mittal, Sparsh
As chip power dissipation becomes a critical challenge in scaling processor performance, computer architects are forced to fundamentally rethink the design of modern processors and hence, the chip-design industry is now at a major inflection point in its hardware roadmap. The high leakage power and low density of SRAM poses serious obstacles in its use for designing large on-chip caches and for this reason, researchers are exploring non-volatile memory (NVM) devices, such as spin torque transfer RAM, phase change RAM and resistive RAM. However, since NVMs are not strictly superior to SRAM, effective architectural techniques are required for making themmore » a universal memory solution. This book discusses techniques for designing processor caches using NVM devices. It presents algorithms and architectures for improving their energy efficiency, performance and lifetime. It also provides both qualitative and quantitative evaluation to help the reader gain insights and motivate them to explore further. This book will be highly useful for beginners as well as veterans in computer architecture, chip designers, product managers and technical marketing professionals.« less
NASA Technical Reports Server (NTRS)
Abada, Christopher H.; Farley, Gary L.; Hyer, Michael W.
2006-01-01
A computer-based parametric study of the effect of reinforcement architectures on fracture response of aluminum compact-tension (CT) specimens is performed. Eleven different reinforcement architectures consisting of rectangular and triangular cross-section reinforcements were evaluated. Reinforced specimens produced between 13 and 28 percent higher fracture load than achieved with the non-reinforced case. Reinforcements with blunt leading edges (rectangular reinforcements) exhibited superior performance relative to the triangular reinforcements with sharp leading edges. Relative to the rectangular reinforcements, the most important architectural feature was reinforcement thickness. At failure, the reinforcements carried between 58 and 85 percent of the load applied to the specimen, suggesting that there is considerable load transfer between the base material and the reinforcement.
Roads towards fault-tolerant universal quantum computation
NASA Astrophysics Data System (ADS)
Campbell, Earl T.; Terhal, Barbara M.; Vuillot, Christophe
2017-09-01
A practical quantum computer must not merely store information, but also process it. To prevent errors introduced by noise from multiplying and spreading, a fault-tolerant computational architecture is required. Current experiments are taking the first steps toward noise-resilient logical qubits. But to convert these quantum devices from memories to processors, it is necessary to specify how a universal set of gates is performed on them. The leading proposals for doing so, such as magic-state distillation and colour-code techniques, have high resource demands. Alternative schemes, such as those that use high-dimensional quantum codes in a modular architecture, have potential benefits, but need to be explored further.
Roads towards fault-tolerant universal quantum computation.
Campbell, Earl T; Terhal, Barbara M; Vuillot, Christophe
2017-09-13
A practical quantum computer must not merely store information, but also process it. To prevent errors introduced by noise from multiplying and spreading, a fault-tolerant computational architecture is required. Current experiments are taking the first steps toward noise-resilient logical qubits. But to convert these quantum devices from memories to processors, it is necessary to specify how a universal set of gates is performed on them. The leading proposals for doing so, such as magic-state distillation and colour-code techniques, have high resource demands. Alternative schemes, such as those that use high-dimensional quantum codes in a modular architecture, have potential benefits, but need to be explored further.
Boolean and brain-inspired computing using spin-transfer torque devices
NASA Astrophysics Data System (ADS)
Fan, Deliang
Several completely new approaches (such as spintronic, carbon nanotube, graphene, TFETs, etc.) to information processing and data storage technologies are emerging to address the time frame beyond current Complementary Metal-Oxide-Semiconductor (CMOS) roadmap. The high speed magnetization switching of a nano-magnet due to current induced spin-transfer torque (STT) have been demonstrated in recent experiments. Such STT devices can be explored in compact, low power memory and logic design. In order to truly leverage STT devices based computing, researchers require a re-think of circuit, architecture, and computing model, since the STT devices are unlikely to be drop-in replacements for CMOS. The potential of STT devices based computing will be best realized by considering new computing models that are inherently suited to the characteristics of STT devices, and new applications that are enabled by their unique capabilities, thereby attaining performance that CMOS cannot achieve. The goal of this research is to conduct synergistic exploration in architecture, circuit and device levels for Boolean and brain-inspired computing using nanoscale STT devices. Specifically, we first show that the non-volatile STT devices can be used in designing configurable Boolean logic blocks. We propose a spin-memristor threshold logic (SMTL) gate design, where memristive cross-bar array is used to perform current mode summation of binary inputs and the low power current mode spintronic threshold device carries out the energy efficient threshold operation. Next, for brain-inspired computing, we have exploited different spin-transfer torque device structures that can implement the hard-limiting and soft-limiting artificial neuron transfer functions respectively. We apply such STT based neuron (or 'spin-neuron') in various neural network architectures, such as hierarchical temporal memory and feed-forward neural network, for performing "human-like" cognitive computing, which show more than two orders of lower energy consumption compared to state of the art CMOS implementation. Finally, we show the dynamics of injection locked Spin Hall Effect Spin-Torque Oscillator (SHE-STO) cluster can be exploited as a robust multi-dimensional distance metric for associative computing, image/ video analysis, etc. Our simulation results show that the proposed system architecture with injection locked SHE-STOs and the associated CMOS interface circuits can be suitable for robust and energy efficient associative computing and pattern matching.
NASA Astrophysics Data System (ADS)
Suarez, Hernan; Zhang, Yan R.
2015-05-01
New radar applications need to perform complex algorithms and process large quantity of data to generate useful information for the users. This situation has motivated the search for better processing solutions that include low power high-performance processors, efficient algorithms, and high-speed interfaces. In this work, hardware implementation of adaptive pulse compression for real-time transceiver optimization are presented, they are based on a System-on-Chip architecture for Xilinx devices. This study also evaluates the performance of dedicated coprocessor as hardware accelerator units to speed up and improve the computation of computing-intensive tasks such matrix multiplication and matrix inversion which are essential units to solve the covariance matrix. The tradeoffs between latency and hardware utilization are also presented. Moreover, the system architecture takes advantage of the embedded processor, which is interconnected with the logic resources through the high performance AXI buses, to perform floating-point operations, control the processing blocks, and communicate with external PC through a customized software interface. The overall system functionality is demonstrated and tested for real-time operations using a Ku-band tested together with a low-cost channel emulator for different types of waveforms.
Wilson Dslash Kernel From Lattice QCD Optimization
DOE Office of Scientific and Technical Information (OSTI.GOV)
Joo, Balint; Smelyanskiy, Mikhail; Kalamkar, Dhiraj D.
2015-07-01
Lattice Quantum Chromodynamics (LQCD) is a numerical technique used for calculations in Theoretical Nuclear and High Energy Physics. LQCD is traditionally one of the first applications ported to many new high performance computing architectures and indeed LQCD practitioners have been known to design and build custom LQCD computers. Lattice QCD kernels are frequently used as benchmarks (e.g. 168.wupwise in the SPEC suite) and are generally well understood, and as such are ideal to illustrate several optimization techniques. In this chapter we will detail our work in optimizing the Wilson-Dslash kernels for Intel Xeon Phi, however, as we will show themore » technique gives excellent performance on regular Xeon Architecture as well.« less
1988-05-01
for Advanced Computer Studies and Department of Computer Science University of Maryland College Park, MD 20742 4, ABSTRACT We discuss some aspects of...Computer Studies and Technology & Dept. of Compute. Scienc II. CONTROLLING OFFICE NAME AND ADDRESS Viyriyf~ 12. REPORT DATE Department of the Navy uo...number)-1/ 2.) We study the performance of CG and PCG by examining its performance for u E (0,1), for solving the two model problems with an accuracy
Fault tolerant computer control for a Maglev transportation system
NASA Technical Reports Server (NTRS)
Lala, Jaynarayan H.; Nagle, Gail A.; Anagnostopoulos, George
1994-01-01
Magnetically levitated (Maglev) vehicles operating on dedicated guideways at speeds of 500 km/hr are an emerging transportation alternative to short-haul air and high-speed rail. They have the potential to offer a service significantly more dependable than air and with less operating cost than both air and high-speed rail. Maglev transportation derives these benefits by using magnetic forces to suspend a vehicle 8 to 200 mm above the guideway. Magnetic forces are also used for propulsion and guidance. The combination of high speed, short headways, stringent ride quality requirements, and a distributed offboard propulsion system necessitates high levels of automation for the Maglev control and operation. Very high levels of safety and availability will be required for the Maglev control system. This paper describes the mission scenario, functional requirements, and dependability and performance requirements of the Maglev command, control, and communications system. A distributed hierarchical architecture consisting of vehicle on-board computers, wayside zone computers, a central computer facility, and communication links between these entities was synthesized to meet the functional and dependability requirements on the maglev. Two variations of the basic architecture are described: the Smart Vehicle Architecture (SVA) and the Zone Control Architecture (ZCA). Preliminary dependability modeling results are also presented.
All-optical reservoir computing.
Duport, François; Schneider, Bendix; Smerieri, Anteo; Haelterman, Marc; Massar, Serge
2012-09-24
Reservoir Computing is a novel computing paradigm that uses a nonlinear recurrent dynamical system to carry out information processing. Recent electronic and optoelectronic Reservoir Computers based on an architecture with a single nonlinear node and a delay loop have shown performance on standardized tasks comparable to state-of-the-art digital implementations. Here we report an all-optical implementation of a Reservoir Computer, made of off-the-shelf components for optical telecommunications. It uses the saturation of a semiconductor optical amplifier as nonlinearity. The present work shows that, within the Reservoir Computing paradigm, all-optical computing with state-of-the-art performance is possible.
Scalable quantum computer architecture with coupled donor-quantum dot qubits
Schenkel, Thomas; Lo, Cheuk Chi; Weis, Christoph; Lyon, Stephen; Tyryshkin, Alexei; Bokor, Jeffrey
2014-08-26
A quantum bit computing architecture includes a plurality of single spin memory donor atoms embedded in a semiconductor layer, a plurality of quantum dots arranged with the semiconductor layer and aligned with the donor atoms, wherein a first voltage applied across at least one pair of the aligned quantum dot and donor atom controls a donor-quantum dot coupling. A method of performing quantum computing in a scalable architecture quantum computing apparatus includes arranging a pattern of single spin memory donor atoms in a semiconductor layer, forming a plurality of quantum dots arranged with the semiconductor layer and aligned with the donor atoms, applying a first voltage across at least one aligned pair of a quantum dot and donor atom to control a donor-quantum dot coupling, and applying a second voltage between one or more quantum dots to control a Heisenberg exchange J coupling between quantum dots and to cause transport of a single spin polarized electron between quantum dots.
High performance semantic factoring of giga-scale semantic graph databases.
DOE Office of Scientific and Technical Information (OSTI.GOV)
al-Saffar, Sinan; Adolf, Bob; Haglin, David
2010-10-01
As semantic graph database technology grows to address components ranging from extant large triple stores to SPARQL endpoints over SQL-structured relational databases, it will become increasingly important to be able to bring high performance computational resources to bear on their analysis, interpretation, and visualization, especially with respect to their innate semantic structure. Our research group built a novel high performance hybrid system comprising computational capability for semantic graph database processing utilizing the large multithreaded architecture of the Cray XMT platform, conventional clusters, and large data stores. In this paper we describe that architecture, and present the results of our deployingmore » that for the analysis of the Billion Triple dataset with respect to its semantic factors, including basic properties, connected components, namespace interaction, and typed paths.« less
A distributed parallel storage architecture and its potential application within EOSDIS
NASA Technical Reports Server (NTRS)
Johnston, William E.; Tierney, Brian; Feuquay, Jay; Butzer, Tony
1994-01-01
We describe the architecture, implementation, use of a scalable, high performance, distributed-parallel data storage system developed in the ARPA funded MAGIC gigabit testbed. A collection of wide area distributed disk servers operate in parallel to provide logical block level access to large data sets. Operated primarily as a network-based cache, the architecture supports cooperation among independently owned resources to provide fast, large-scale, on-demand storage to support data handling, simulation, and computation.
Towards fully analog hardware reservoir computing for speech recognition
NASA Astrophysics Data System (ADS)
Smerieri, Anteo; Duport, François; Paquot, Yvan; Haelterman, Marc; Schrauwen, Benjamin; Massar, Serge
2012-09-01
Reservoir computing is a very recent, neural network inspired unconventional computation technique, where a recurrent nonlinear system is used in conjunction with a linear readout to perform complex calculations, leveraging its inherent internal dynamics. In this paper we show the operation of an optoelectronic reservoir computer in which both the nonlinear recurrent part and the readout layer are implemented in hardware for a speech recognition application. The performance obtained is close to the one of to state-of-the-art digital reservoirs, while the analog architecture opens the way to ultrafast computation.
A High Performance Computer Architecture for Embedded And/Or Multi-Computer Applications
1990-09-01
commercially available, real - time operating system . CHOICES and ARTS are real-time operating systems developed at the University of Illinois and CMU...respectively. Selection of a real - time operating system will be made in the next phase of the project. U BIBLIOGRAPHY U Wulf, Wm. A. The WM Computer
Architectural Principles and Experimentation of Distributed High Performance Virtual Clusters
ERIC Educational Resources Information Center
Younge, Andrew J.
2016-01-01
With the advent of virtualization and Infrastructure-as-a-Service (IaaS), the broader scientific computing community is considering the use of clouds for their scientific computing needs. This is due to the relative scalability, ease of use, advanced user environment customization abilities, and the many novel computing paradigms available for…
How to Teach Residue Number System to Computer Scientists and Engineers
ERIC Educational Resources Information Center
Navi, K.; Molahosseini, A. S.; Esmaeildoust, M.
2011-01-01
The residue number system (RNS) has been an important research field in computer arithmetic for many decades, mainly because of its carry-free nature, which can provide high-performance computing architectures with superior delay specifications. Recently, research on RNS has found new directions that have resulted in the introduction of efficient…
Feasibility of a special-purpose computer to solve the Navier-Stokes equations
NASA Technical Reports Server (NTRS)
Gritton, E. C.; King, W. S.; Sutherland, I.; Gaines, R. S.; Gazley, C., Jr.; Grosch, C.; Juncosa, M.; Petersen, H.
1978-01-01
Orders-of-magnitude improvements in computer performance can be realized with a parallel array of thousands of fast microprocessors. In this architecture, wiring congestion is minimized by limiting processor communication to nearest neighbors. When certain standard algorithms are applied to a viscous flow problem and existing LSI technology is used, performance estimates of this conceptual design show a dramatic decrease in computational time when compared to the CDC 7600.
A Cloud-Based Simulation Architecture for Pandemic Influenza Simulation
Eriksson, Henrik; Raciti, Massimiliano; Basile, Maurizio; Cunsolo, Alessandro; Fröberg, Anders; Leifler, Ola; Ekberg, Joakim; Timpka, Toomas
2011-01-01
High-fidelity simulations of pandemic outbreaks are resource consuming. Cluster-based solutions have been suggested for executing such complex computations. We present a cloud-based simulation architecture that utilizes computing resources both locally available and dynamically rented online. The approach uses the Condor framework for job distribution and management of the Amazon Elastic Computing Cloud (EC2) as well as local resources. The architecture has a web-based user interface that allows users to monitor and control simulation execution. In a benchmark test, the best cost-adjusted performance was recorded for the EC2 H-CPU Medium instance, while a field trial showed that the job configuration had significant influence on the execution time and that the network capacity of the master node could become a bottleneck. We conclude that it is possible to develop a scalable simulation environment that uses cloud-based solutions, while providing an easy-to-use graphical user interface. PMID:22195089
The MasPar MP-1 As a Computer Arithmetic Laboratory
Anuta, Michael A.; Lozier, Daniel W.; Turner, Peter R.
1996-01-01
This paper is a blueprint for the use of a massively parallel SIMD computer architecture for the simulation of various forms of computer arithmetic. The particular system used is a DEC/MasPar MP-1 with 4096 processors in a square array. This architecture has many advantages for such simulations due largely to the simplicity of the individual processors. Arithmetic operations can be spread across the processor array to simulate a hardware chip. Alternatively they may be performed on individual processors to allow simulation of a massively parallel implementation of the arithmetic. Compromises between these extremes permit speed-area tradeoffs to be examined. The paper includes a description of the architecture and its features. It then summarizes some of the arithmetic systems which have been, or are to be, implemented. The implementation of the level-index and symmetric level-index, LI and SLI, systems is described in some detail. An extensive bibliography is included. PMID:27805123
Redundancy management for efficient fault recovery in NASA's distributed computing system
NASA Technical Reports Server (NTRS)
Malek, Miroslaw; Pandya, Mihir; Yau, Kitty
1991-01-01
The management of redundancy in computer systems was studied and guidelines were provided for the development of NASA's fault-tolerant distributed systems. Fault recovery and reconfiguration mechanisms were examined. A theoretical foundation was laid for redundancy management by efficient reconfiguration methods and algorithmic diversity. Algorithms were developed to optimize the resources for embedding of computational graphs of tasks in the system architecture and reconfiguration of these tasks after a failure has occurred. The computational structure represented by a path and the complete binary tree was considered and the mesh and hypercube architectures were targeted for their embeddings. The innovative concept of Hybrid Algorithm Technique was introduced. This new technique provides a mechanism for obtaining fault tolerance while exhibiting improved performance.
Porting plasma physics simulation codes to modern computing architectures using the
NASA Astrophysics Data System (ADS)
Germaschewski, Kai; Abbott, Stephen
2015-11-01
Available computing power has continued to grow exponentially even after single-core performance satured in the last decade. The increase has since been driven by more parallelism, both using more cores and having more parallelism in each core, e.g. in GPUs and Intel Xeon Phi. Adapting existing plasma physics codes is challenging, in particular as there is no single programming model that covers current and future architectures. We will introduce the open-source
Problems Related to Parallelization of CFD Algorithms on GPU, Multi-GPU and Hybrid Architectures
NASA Astrophysics Data System (ADS)
Biazewicz, Marek; Kurowski, Krzysztof; Ludwiczak, Bogdan; Napieraia, Krystyna
2010-09-01
Computational Fluid Dynamics (CFD) is one of the branches of fluid mechanics, which uses numerical methods and algorithms to solve and analyze fluid flows. CFD is used in various domains, such as oil and gas reservoir uncertainty analysis, aerodynamic body shapes optimization (e.g. planes, cars, ships, sport helmets, skis), natural phenomena analysis, numerical simulation for weather forecasting or realistic visualizations. CFD problem is very complex and needs a lot of computational power to obtain the results in a reasonable time. We have implemented a parallel application for two-dimensional CFD simulation with a free surface approximation (MAC method) using new hardware architectures, in particular multi-GPU and hybrid computing environments. For this purpose we decided to use NVIDIA graphic cards with CUDA environment due to its simplicity of programming and good computations performance. We used finite difference discretization of Navier-Stokes equations, where fluid is propagated over an Eulerian Grid. In this model, the behavior of the fluid inside the cell depends only on the properties of local, surrounding cells, therefore it is well suited for the GPU-based architecture. In this paper we demonstrate how to use efficiently the computing power of GPUs for CFD. Additionally, we present some best practices to help users analyze and improve the performance of CFD applications executed on GPU. Finally, we discuss various challenges around the multi-GPU implementation on the example of matrix multiplication.
2010-06-01
DATES COVEREDAPR 2009 – JAN 2010 (From - To) APR 2009 – JAN 2010 4. TITLE AND SUBTITLE EMERGING NEUROMORPHIC COMPUTING ARCHITECTURES AND ENABLING...14. ABSTRACT The highly cross-disciplinary emerging field of neuromorphic computing architectures for cognitive information processing applications...belief systems, software, computer engineering, etc. In our effort to develop cognitive systems atop a neuromorphic computing architecture, we explored
A resource management architecture based on complex network theory in cloud computing federation
NASA Astrophysics Data System (ADS)
Zhang, Zehua; Zhang, Xuejie
2011-10-01
Cloud Computing Federation is a main trend of Cloud Computing. Resource Management has significant effect on the design, realization, and efficiency of Cloud Computing Federation. Cloud Computing Federation has the typical characteristic of the Complex System, therefore, we propose a resource management architecture based on complex network theory for Cloud Computing Federation (abbreviated as RMABC) in this paper, with the detailed design of the resource discovery and resource announcement mechanisms. Compare with the existing resource management mechanisms in distributed computing systems, a Task Manager in RMABC can use the historical information and current state data get from other Task Managers for the evolution of the complex network which is composed of Task Managers, thus has the advantages in resource discovery speed, fault tolerance and adaptive ability. The result of the model experiment confirmed the advantage of RMABC in resource discovery performance.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Moreland, Kenneth; Sewell, Christopher; Usher, William
Here, one of the most critical challenges for high-performance computing (HPC) scientific visualization is execution on massively threaded processors. Of the many fundamental changes we are seeing in HPC systems, one of the most profound is a reliance on new processor types optimized for execution bandwidth over latency hiding. Our current production scientific visualization software is not designed for these new types of architectures. To address this issue, the VTK-m framework serves as a container for algorithms, provides flexible data representation, and simplifies the design of visualization algorithms on new and future computer architecture.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Moreland, Kenneth; Sewell, Christopher; Usher, William
Execution on massively threaded processors is one of the most critical challenges for high-performance computing (HPC) scientific visualization. Of the many fundamental changes we are seeing in HPC systems, one of the most profound is a reliance on new processor types optimized for execution bandwidth over latency hiding. Moreover, our current production scientific visualization software is not designed for these new types of architectures. In order to address this issue, the VTK-m framework serves as a container for algorithms, provides flexible data representation, and simplifies the design of visualization algorithms on new and future computer architecture.
The Design of Fault Tolerant Quantum Dot Cellular Automata Based Logic
NASA Technical Reports Server (NTRS)
Armstrong, C. Duane; Humphreys, William M.; Fijany, Amir
2002-01-01
As transistor geometries are reduced, quantum effects begin to dominate device performance. At some point, transistors cease to have the properties that make them useful computational components. New computing elements must be developed in order to keep pace with Moore s Law. Quantum dot cellular automata (QCA) represent an alternative paradigm to transistor-based logic. QCA architectures that are robust to manufacturing tolerances and defects must be developed. We are developing software that allows the exploration of fault tolerant QCA gate architectures by automating the specification, simulation, analysis and documentation processes.
Eisenbach, Markus
2017-01-01
A major impediment to deploying next-generation high-performance computational systems is the required electrical power, often measured in units of megawatts. The solution to this problem is driving the introduction of novel machine architectures, such as those employing many-core processors and specialized accelerators. In this article, we describe the use of a hybrid accelerated architecture to achieve both reduced time to solution and the associated reduction in the electrical cost for a state-of-the-art materials science computation.
GPU Accelerated Vector Median Filter
NASA Technical Reports Server (NTRS)
Aras, Rifat; Shen, Yuzhong
2011-01-01
Noise reduction is an important step for most image processing tasks. For three channel color images, a widely used technique is vector median filter in which color values of pixels are treated as 3-component vectors. Vector median filters are computationally expensive; for a window size of n x n, each of the n(sup 2) vectors has to be compared with other n(sup 2) - 1 vectors in distances. General purpose computation on graphics processing units (GPUs) is the paradigm of utilizing high-performance many-core GPU architectures for computation tasks that are normally handled by CPUs. In this work. NVIDIA's Compute Unified Device Architecture (CUDA) paradigm is used to accelerate vector median filtering. which has to the best of our knowledge never been done before. The performance of GPU accelerated vector median filter is compared to that of the CPU and MPI-based versions for different image and window sizes, Initial findings of the study showed 100x improvement of performance of vector median filter implementation on GPUs over CPU implementations and further speed-up is expected after more extensive optimizations of the GPU algorithm .
Examining the architecture of cellular computing through a comparative study with a computer
Wang, Degeng; Gribskov, Michael
2005-01-01
The computer and the cell both use information embedded in simple coding, the binary software code and the quadruple genomic code, respectively, to support system operations. A comparative examination of their system architecture as well as their information storage and utilization schemes is performed. On top of the code, both systems display a modular, multi-layered architecture, which, in the case of a computer, arises from human engineering efforts through a combination of hardware implementation and software abstraction. Using the computer as a reference system, a simplistic mapping of the architectural components between the two is easily detected. This comparison also reveals that a cell abolishes the software–hardware barrier through genomic encoding for the constituents of the biochemical network, a cell's ‘hardware’ equivalent to the computer central processing unit (CPU). The information loading (gene expression) process acts as a major determinant of the encoded constituent's abundance, which, in turn, often determines the ‘bandwidth’ of a biochemical pathway. Cellular processes are implemented in biochemical pathways in parallel manners. In a computer, on the other hand, the software provides only instructions and data for the CPU. A process represents just sequentially ordered actions by the CPU and only virtual parallelism can be implemented through CPU time-sharing. Whereas process management in a computer may simply mean job scheduling, coordinating pathway bandwidth through the gene expression machinery represents a major process management scheme in a cell. In summary, a cell can be viewed as a super-parallel computer, which computes through controlled hardware composition. While we have, at best, a very fragmented understanding of cellular operation, we have a thorough understanding of the computer throughout the engineering process. The potential utilization of this knowledge to the benefit of systems biology is discussed. PMID:16849179
Examining the architecture of cellular computing through a comparative study with a computer.
Wang, Degeng; Gribskov, Michael
2005-06-22
The computer and the cell both use information embedded in simple coding, the binary software code and the quadruple genomic code, respectively, to support system operations. A comparative examination of their system architecture as well as their information storage and utilization schemes is performed. On top of the code, both systems display a modular, multi-layered architecture, which, in the case of a computer, arises from human engineering efforts through a combination of hardware implementation and software abstraction. Using the computer as a reference system, a simplistic mapping of the architectural components between the two is easily detected. This comparison also reveals that a cell abolishes the software-hardware barrier through genomic encoding for the constituents of the biochemical network, a cell's "hardware" equivalent to the computer central processing unit (CPU). The information loading (gene expression) process acts as a major determinant of the encoded constituent's abundance, which, in turn, often determines the "bandwidth" of a biochemical pathway. Cellular processes are implemented in biochemical pathways in parallel manners. In a computer, on the other hand, the software provides only instructions and data for the CPU. A process represents just sequentially ordered actions by the CPU and only virtual parallelism can be implemented through CPU time-sharing. Whereas process management in a computer may simply mean job scheduling, coordinating pathway bandwidth through the gene expression machinery represents a major process management scheme in a cell. In summary, a cell can be viewed as a super-parallel computer, which computes through controlled hardware composition. While we have, at best, a very fragmented understanding of cellular operation, we have a thorough understanding of the computer throughout the engineering process. The potential utilization of this knowledge to the benefit of systems biology is discussed.
Memristor-Based Synapse Design and Training Scheme for Neuromorphic Computing Architecture
2012-06-01
system level built upon the conventional Von Neumann computer architecture [2][3]. Developing the neuromorphic architecture at chip level by...SCHEME FOR NEUROMORPHIC COMPUTING ARCHITECTURE 5a. CONTRACT NUMBER FA8750-11-2-0046 5b. GRANT NUMBER N/A 5c. PROGRAM ELEMENT NUMBER 62788F 6...creation of memristor-based neuromorphic computing architecture. Rather than the existing crossbar-based neuron network designs, we focus on memristor
Multiplier Architecture for Coding Circuits
NASA Technical Reports Server (NTRS)
Wang, C. C.; Truong, T. K.; Shao, H. M.; Deutsch, L. J.
1986-01-01
Multipliers based on new algorithm for Galois-field (GF) arithmetic regular and expandable. Pipeline structures used for computing both multiplications and inverses. Designs suitable for implementation in very-large-scale integrated (VLSI) circuits. This general type of inverter and multiplier architecture especially useful in performing finite-field arithmetic of Reed-Solomon error-correcting codes and of some cryptographic algorithms.
Interaction with Machine Improvisation
NASA Astrophysics Data System (ADS)
Assayag, Gerard; Bloch, George; Cont, Arshia; Dubnov, Shlomo
We describe two multi-agent architectures for an improvisation oriented musician-machine interaction systems that learn in real time from human performers. The improvisation kernel is based on sequence modeling and statistical learning. We present two frameworks of interaction with this kernel. In the first, the stylistic interaction is guided by a human operator in front of an interactive computer environment. In the second framework, the stylistic interaction is delegated to machine intelligence and therefore, knowledge propagation and decision are taken care of by the computer alone. The first framework involves a hybrid architecture using two popular composition/performance environments, Max and OpenMusic, that are put to work and communicate together, each one handling the process at a different time/memory scale. The second framework shares the same representational schemes with the first but uses an Active Learning architecture based on collaborative, competitive and memory-based learning to handle stylistic interactions. Both systems are capable of processing real-time audio/video as well as MIDI. After discussing the general cognitive background of improvisation practices, the statistical modelling tools and the concurrent agent architecture are presented. Then, an Active Learning scheme is described and considered in terms of using different improvisation regimes for improvisation planning. Finally, we provide more details about the different system implementations and describe several performances with the system.
Scaling deep learning workloads: NVIDIA DGX-1/Pascal and Intel Knights Landing
DOE Office of Scientific and Technical Information (OSTI.GOV)
Gawande, Nitin A.; Landwehr, Joshua B.; Daily, Jeffrey A.
Deep Learning (DL) algorithms have become ubiquitous in data analytics. As a result, major computing vendors --- including NVIDIA, Intel, AMD, and IBM --- have architectural road-maps influenced by DL workloads. Furthermore, several vendors have recently advertised new computing products as accelerating large DL workloads. Unfortunately, it is difficult for data scientists to quantify the potential of these different products. This paper provides a performance and power analysis of important DL workloads on two major parallel architectures: NVIDIA DGX-1 (eight Pascal P100 GPUs interconnected with NVLink) and Intel Knights Landing (KNL) CPUs interconnected with Intel Omni-Path or Cray Aries. Ourmore » evaluation consists of a cross section of convolutional neural net workloads: CifarNet, AlexNet, GoogLeNet, and ResNet50 topologies using the Cifar10 and ImageNet datasets. The workloads are vendor-optimized for each architecture. Our analysis indicates that although GPUs provide the highest overall performance, the gap can close for some convolutional networks; and the KNL can be competitive in performance/watt. We find that NVLink facilitates scaling efficiency on GPUs. However, its importance is heavily dependent on neural network architecture. Furthermore, for weak-scaling --- sometimes encouraged by restricted GPU memory --- NVLink is less important.« less
Thrifty: An Exascale Architecture for Energy Proportional Computing
DOE Office of Scientific and Technical Information (OSTI.GOV)
Torrellas, Josep
2014-12-23
The objective of this project is to design different aspects of a novel exascale architecture called Thrifty. Our goal is to focus on the challenges of power/energy efficiency, performance, and resiliency in exascale systems. The project includes work on computer architecture (Josep Torrellas from University of Illinois), compilation (Daniel Quinlan from Lawrence Livermore National Laboratory), runtime and applications (Laura Carrington from University of California San Diego), and circuits (Wilfred Pinfold from Intel Corporation). In this report, we focus on the progress at the University of Illinois during the last year of the grant (September 1, 2013 to August 31, 2014).more » We also point to the progress in the other collaborating institutions when needed.« less
NASA Technical Reports Server (NTRS)
Jacklin, S. A.; Leyland, J. A.; Warmbrodt, W.
1985-01-01
Modern control systems must typically perform real-time identification and control, as well as coordinate a host of other activities related to user interaction, online graphics, and file management. This paper discusses five global design considerations which are useful to integrate array processor, multimicroprocessor, and host computer system architectures into versatile, high-speed controllers. Such controllers are capable of very high control throughput, and can maintain constant interaction with the nonreal-time or user environment. As an application example, the architecture of a high-speed, closed-loop controller used to actively control helicopter vibration is briefly discussed. Although this system has been designed for use as the controller for real-time rotorcraft dynamics and control studies in a wind tunnel environment, the controller architecture can generally be applied to a wide range of automatic control applications.
NASA Astrophysics Data System (ADS)
Makino, Junichiro
2002-12-01
We overview our GRAvity PipE (GRAPE) project to develop special-purpose computers for astrophysical N-body simulations. The basic idea of GRAPE is to attach a custom-build computer dedicated to the calculation of gravitational interaction between particles to a general-purpose programmable computer. By this hybrid architecture, we can achieve both a wide range of applications and very high peak performance. Our newest machine, GRAPE-6, achieved the peak speed of 32 Tflops, and sustained performance of 11.55 Tflops, for the total budget of about 4 million USD. We also discuss relative advantages of special-purpose and general-purpose computers and the future of high-performance computing for science and technology.
NASA Technical Reports Server (NTRS)
Fischer, James R.; Grosch, Chester; Mcanulty, Michael; Odonnell, John; Storey, Owen
1987-01-01
NASA's Office of Space Science and Applications (OSSA) gave a select group of scientists the opportunity to test and implement their computational algorithms on the Massively Parallel Processor (MPP) located at Goddard Space Flight Center, beginning in late 1985. One year later, the Working Group presented its report, which addressed the following: algorithms, programming languages, architecture, programming environments, the way theory relates, and performance measured. The findings point to a number of demonstrated computational techniques for which the MPP architecture is ideally suited. For example, besides executing much faster on the MPP than on conventional computers, systolic VLSI simulation (where distances are short), lattice simulation, neural network simulation, and image problems were found to be easier to program on the MPP's architecture than on a CYBER 205 or even a VAX. The report also makes technical recommendations covering all aspects of MPP use, and recommendations concerning the future of the MPP and machines based on similar architectures, expansion of the Working Group, and study of the role of future parallel processors for space station, EOS, and the Great Observatories era.
NASA Astrophysics Data System (ADS)
Fabien-Ouellet, Gabriel; Gloaguen, Erwan; Giroux, Bernard
2017-03-01
Full Waveform Inversion (FWI) aims at recovering the elastic parameters of the Earth by matching recordings of the ground motion with the direct solution of the wave equation. Modeling the wave propagation for realistic scenarios is computationally intensive, which limits the applicability of FWI. The current hardware evolution brings increasing parallel computing power that can speed up the computations in FWI. However, to take advantage of the diversity of parallel architectures presently available, new programming approaches are required. In this work, we explore the use of OpenCL to develop a portable code that can take advantage of the many parallel processor architectures now available. We present a program called SeisCL for 2D and 3D viscoelastic FWI in the time domain. The code computes the forward and adjoint wavefields using finite-difference and outputs the gradient of the misfit function given by the adjoint state method. To demonstrate the code portability on different architectures, the performance of SeisCL is tested on three different devices: Intel CPUs, NVidia GPUs and Intel Xeon PHI. Results show that the use of GPUs with OpenCL can speed up the computations by nearly two orders of magnitudes over a single threaded application on the CPU. Although OpenCL allows code portability, we show that some device-specific optimization is still required to get the best performance out of a specific architecture. Using OpenCL in conjunction with MPI allows the domain decomposition of large models on several devices located on different nodes of a cluster. For large enough models, the speedup of the domain decomposition varies quasi-linearly with the number of devices. Finally, we investigate two different approaches to compute the gradient by the adjoint state method and show the significant advantages of using OpenCL for FWI.
Three-dimensional integration of nanotechnologies for computing and data storage on a single chip
NASA Astrophysics Data System (ADS)
Shulaker, Max M.; Hills, Gage; Park, Rebecca S.; Howe, Roger T.; Saraswat, Krishna; Wong, H.-S. Philip; Mitra, Subhasish
2017-07-01
The computing demands of future data-intensive applications will greatly exceed the capabilities of current electronics, and are unlikely to be met by isolated improvements in transistors, data storage technologies or integrated circuit architectures alone. Instead, transformative nanosystems, which use new nanotechnologies to simultaneously realize improved devices and new integrated circuit architectures, are required. Here we present a prototype of such a transformative nanosystem. It consists of more than one million resistive random-access memory cells and more than two million carbon-nanotube field-effect transistors—promising new nanotechnologies for use in energy-efficient digital logic circuits and for dense data storage—fabricated on vertically stacked layers in a single chip. Unlike conventional integrated circuit architectures, the layered fabrication realizes a three-dimensional integrated circuit architecture with fine-grained and dense vertical connectivity between layers of computing, data storage, and input and output (in this instance, sensing). As a result, our nanosystem can capture massive amounts of data every second, store it directly on-chip, perform in situ processing of the captured data, and produce ‘highly processed’ information. As a working prototype, our nanosystem senses and classifies ambient gases. Furthermore, because the layers are fabricated on top of silicon logic circuitry, our nanosystem is compatible with existing infrastructure for silicon-based technologies. Such complex nano-electronic systems will be essential for future high-performance and highly energy-efficient electronic systems.
Three-dimensional integration of nanotechnologies for computing and data storage on a single chip.
Shulaker, Max M; Hills, Gage; Park, Rebecca S; Howe, Roger T; Saraswat, Krishna; Wong, H-S Philip; Mitra, Subhasish
2017-07-05
The computing demands of future data-intensive applications will greatly exceed the capabilities of current electronics, and are unlikely to be met by isolated improvements in transistors, data storage technologies or integrated circuit architectures alone. Instead, transformative nanosystems, which use new nanotechnologies to simultaneously realize improved devices and new integrated circuit architectures, are required. Here we present a prototype of such a transformative nanosystem. It consists of more than one million resistive random-access memory cells and more than two million carbon-nanotube field-effect transistors-promising new nanotechnologies for use in energy-efficient digital logic circuits and for dense data storage-fabricated on vertically stacked layers in a single chip. Unlike conventional integrated circuit architectures, the layered fabrication realizes a three-dimensional integrated circuit architecture with fine-grained and dense vertical connectivity between layers of computing, data storage, and input and output (in this instance, sensing). As a result, our nanosystem can capture massive amounts of data every second, store it directly on-chip, perform in situ processing of the captured data, and produce 'highly processed' information. As a working prototype, our nanosystem senses and classifies ambient gases. Furthermore, because the layers are fabricated on top of silicon logic circuitry, our nanosystem is compatible with existing infrastructure for silicon-based technologies. Such complex nano-electronic systems will be essential for future high-performance and highly energy-efficient electronic systems.
High Performance Descriptive Semantic Analysis of Semantic Graph Databases
DOE Office of Scientific and Technical Information (OSTI.GOV)
Joslyn, Cliff A.; Adolf, Robert D.; al-Saffar, Sinan
As semantic graph database technology grows to address components ranging from extant large triple stores to SPARQL endpoints over SQL-structured relational databases, it will become increasingly important to be able to understand their inherent semantic structure, whether codified in explicit ontologies or not. Our group is researching novel methods for what we call descriptive semantic analysis of RDF triplestores, to serve purposes of analysis, interpretation, visualization, and optimization. But data size and computational complexity makes it increasingly necessary to bring high performance computational resources to bear on this task. Our research group built a novel high performance hybrid system comprisingmore » computational capability for semantic graph database processing utilizing the large multi-threaded architecture of the Cray XMT platform, conventional servers, and large data stores. In this paper we describe that architecture and our methods, and present the results of our analyses of basic properties, connected components, namespace interaction, and typed paths such for the Billion Triple Challenge 2010 dataset.« less
Performance of the Wavelet Decomposition on Massively Parallel Architectures
NASA Technical Reports Server (NTRS)
El-Ghazawi, Tarek A.; LeMoigne, Jacqueline; Zukor, Dorothy (Technical Monitor)
2001-01-01
Traditionally, Fourier Transforms have been utilized for performing signal analysis and representation. But although it is straightforward to reconstruct a signal from its Fourier transform, no local description of the signal is included in its Fourier representation. To alleviate this problem, Windowed Fourier transforms and then wavelet transforms have been introduced, and it has been proven that wavelets give a better localization than traditional Fourier transforms, as well as a better division of the time- or space-frequency plane than Windowed Fourier transforms. Because of these properties and after the development of several fast algorithms for computing the wavelet representation of any signal, in particular the Multi-Resolution Analysis (MRA) developed by Mallat, wavelet transforms have increasingly been applied to signal analysis problems, especially real-life problems, in which speed is critical. In this paper we present and compare efficient wavelet decomposition algorithms on different parallel architectures. We report and analyze experimental measurements, using NASA remotely sensed images. Results show that our algorithms achieve significant performance gains on current high performance parallel systems, and meet scientific applications and multimedia requirements. The extensive performance measurements collected over a number of high-performance computer systems have revealed important architectural characteristics of these systems, in relation to the processing demands of the wavelet decomposition of digital images.
2012-09-30
System N Agent « datatype » SoS Architecture -Receives Capabilities1 -Provides Capabilities1 1 -Provides Capabilities1 1 -Provides Capabilities1 -Updates 1...fitness, or objective function. The structure of the SoS Agent is depicted in Figure 10. SoS Agent Architecture « datatype » Initial SoS...Architecture «subsystem» Fuzzy Inference Engine FAM « datatype » Affordability « datatype » Flexibility « datatype » Performance « datatype » Robustness Input Input
Dense and Sparse Matrix Operations on the Cell Processor
DOE Office of Scientific and Technical Information (OSTI.GOV)
Williams, Samuel W.; Shalf, John; Oliker, Leonid
2005-05-01
The slowing pace of commodity microprocessor performance improvements combined with ever-increasing chip power demands has become of utmost concern to computational scientists. Therefore, the high performance computing community is examining alternative architectures that address the limitations of modern superscalar designs. In this work, we examine STI's forthcoming Cell processor: a novel, low-power architecture that combines a PowerPC core with eight independent SIMD processing units coupled with a software-controlled memory to offer high FLOP/s/Watt. Since neither Cell hardware nor cycle-accurate simulators are currently publicly available, we develop an analytic framework to predict Cell performance on dense and sparse matrix operations, usingmore » a variety of algorithmic approaches. Results demonstrate Cell's potential to deliver more than an order of magnitude better GFLOP/s per watt performance, when compared with the Intel Itanium2 and Cray X1 processors.« less
Portable multi-node LQCD Monte Carlo simulations using OpenACC
NASA Astrophysics Data System (ADS)
Bonati, Claudio; Calore, Enrico; D'Elia, Massimo; Mesiti, Michele; Negro, Francesco; Sanfilippo, Francesco; Schifano, Sebastiano Fabio; Silvi, Giorgio; Tripiccione, Raffaele
This paper describes a state-of-the-art parallel Lattice QCD Monte Carlo code for staggered fermions, purposely designed to be portable across different computer architectures, including GPUs and commodity CPUs. Portability is achieved using the OpenACC parallel programming model, used to develop a code that can be compiled for several processor architectures. The paper focuses on parallelization on multiple computing nodes using OpenACC to manage parallelism within the node, and OpenMPI to manage parallelism among the nodes. We first discuss the available strategies to be adopted to maximize performances, we then describe selected relevant details of the code, and finally measure the level of performance and scaling-performance that we are able to achieve. The work focuses mainly on GPUs, which offer a significantly high level of performances for this application, but also compares with results measured on other processors.
Creating executable architectures using Visual Simulation Objects (VSO)
NASA Astrophysics Data System (ADS)
Woodring, John W.; Comiskey, John B.; Petrov, Orlin M.; Woodring, Brian L.
2005-05-01
Investigations have been performed to identify a methodology for creating executable models of architectures and simulations of architecture that lead to an understanding of their dynamic properties. Colored Petri Nets (CPNs) are used to describe architecture because of their strong mathematical foundations, the existence of techniques for their verification and graph theory"s well-established history of success in modern science. CPNs have been extended to interoperate with legacy simulations via a High Level Architecture (HLA) compliant interface. It has also been demonstrated that an architecture created as a CPN can be integrated with Department of Defense Architecture Framework products to ensure consistency between static and dynamic descriptions. A computer-aided tool, Visual Simulation Objects (VSO), which aids analysts in specifying, composing and executing architectures, has been developed to verify the methodology and as a prototype commercial product.
A Standard Platform for Testing and Comparison of MDAO Architectures
NASA Technical Reports Server (NTRS)
Gray, Justin S.; Moore, Kenneth T.; Hearn, Tristan A.; Naylor, Bret A.
2012-01-01
The Multidisciplinary Design Analysis and Optimization (MDAO) community has developed a multitude of algorithms and techniques, called architectures, for performing optimizations on complex engineering systems which involve coupling between multiple discipline analyses. These architectures seek to efficiently handle optimizations with computationally expensive analyses including multiple disciplines. We propose a new testing procedure that can provide a quantitative and qualitative means of comparison among architectures. The proposed test procedure is implemented within the open source framework, OpenMDAO, and comparative results are presented for five well-known architectures: MDF, IDF, CO, BLISS, and BLISS-2000. We also demonstrate how using open source soft- ware development methods can allow the MDAO community to submit new problems and architectures to keep the test suite relevant.
Convolutional networks for fast, energy-efficient neuromorphic computing
Esser, Steven K.; Merolla, Paul A.; Arthur, John V.; Cassidy, Andrew S.; Appuswamy, Rathinakumar; Andreopoulos, Alexander; Berg, David J.; McKinstry, Jeffrey L.; Melano, Timothy; Barch, Davis R.; di Nolfo, Carmelo; Datta, Pallab; Amir, Arnon; Taba, Brian; Flickner, Myron D.; Modha, Dharmendra S.
2016-01-01
Deep networks are now able to achieve human-level performance on a broad spectrum of recognition tasks. Independently, neuromorphic computing has now demonstrated unprecedented energy-efficiency through a new chip architecture based on spiking neurons, low precision synapses, and a scalable communication network. Here, we demonstrate that neuromorphic computing, despite its novel architectural primitives, can implement deep convolution networks that (i) approach state-of-the-art classification accuracy across eight standard datasets encompassing vision and speech, (ii) perform inference while preserving the hardware’s underlying energy-efficiency and high throughput, running on the aforementioned datasets at between 1,200 and 2,600 frames/s and using between 25 and 275 mW (effectively >6,000 frames/s per Watt), and (iii) can be specified and trained using backpropagation with the same ease-of-use as contemporary deep learning. This approach allows the algorithmic power of deep learning to be merged with the efficiency of neuromorphic processors, bringing the promise of embedded, intelligent, brain-inspired computing one step closer. PMID:27651489
Convolutional networks for fast, energy-efficient neuromorphic computing.
Esser, Steven K; Merolla, Paul A; Arthur, John V; Cassidy, Andrew S; Appuswamy, Rathinakumar; Andreopoulos, Alexander; Berg, David J; McKinstry, Jeffrey L; Melano, Timothy; Barch, Davis R; di Nolfo, Carmelo; Datta, Pallab; Amir, Arnon; Taba, Brian; Flickner, Myron D; Modha, Dharmendra S
2016-10-11
Deep networks are now able to achieve human-level performance on a broad spectrum of recognition tasks. Independently, neuromorphic computing has now demonstrated unprecedented energy-efficiency through a new chip architecture based on spiking neurons, low precision synapses, and a scalable communication network. Here, we demonstrate that neuromorphic computing, despite its novel architectural primitives, can implement deep convolution networks that (i) approach state-of-the-art classification accuracy across eight standard datasets encompassing vision and speech, (ii) perform inference while preserving the hardware's underlying energy-efficiency and high throughput, running on the aforementioned datasets at between 1,200 and 2,600 frames/s and using between 25 and 275 mW (effectively >6,000 frames/s per Watt), and (iii) can be specified and trained using backpropagation with the same ease-of-use as contemporary deep learning. This approach allows the algorithmic power of deep learning to be merged with the efficiency of neuromorphic processors, bringing the promise of embedded, intelligent, brain-inspired computing one step closer.
Sequoia: A fault-tolerant tightly coupled multiprocessor for transaction processing
DOE Office of Scientific and Technical Information (OSTI.GOV)
Bernstein, P.A.
1988-02-01
The Sequoia computer is a tightly coupled multiprocessor, and thus attains the performance advantages of this style of architecture. It avoids most of the fault-tolerance disadvantages of tight coupling by using a new fault-tolerance design. The Sequoia architecture is similar to other multimicroprocessor architectures, such as those of Encore and Sequent, in that it gives dozens of microprocessors shared access to a large main memory. It resembles the Stratus architecture in its extensive use of hardware fault-detection techniques. It resembles Stratus and Auragen in its ability to quickly recover all processes after a single point failure, transparently to the user.more » However, Sequoia is unique in its combination of a large-scale tightly coupled architecture with a hardware approach to fault tolerance. This article gives an overview of how the hardware architecture and operating systems (OS) work together to provide a high degree of fault tolerance with good system performance.« less
Matrix-vector multiplication using digital partitioning for more accurate optical computing
NASA Technical Reports Server (NTRS)
Gary, C. K.
1992-01-01
Digital partitioning offers a flexible means of increasing the accuracy of an optical matrix-vector processor. This algorithm can be implemented with the same architecture required for a purely analog processor, which gives optical matrix-vector processors the ability to perform high-accuracy calculations at speeds comparable with or greater than electronic computers as well as the ability to perform analog operations at a much greater speed. Digital partitioning is compared with digital multiplication by analog convolution, residue number systems, and redundant number representation in terms of the size and the speed required for an equivalent throughput as well as in terms of the hardware requirements. Digital partitioning and digital multiplication by analog convolution are found to be the most efficient alogrithms if coding time and hardware are considered, and the architecture for digital partitioning permits the use of analog computations to provide the greatest throughput for a single processor.
On the Usage of GPUs for Efficient Motion Estimation in Medical Image Sequences
Thiyagalingam, Jeyarajan; Goodman, Daniel; Schnabel, Julia A.; Trefethen, Anne; Grau, Vicente
2011-01-01
Images are ubiquitous in biomedical applications from basic research to clinical practice. With the rapid increase in resolution, dimensionality of the images and the need for real-time performance in many applications, computational requirements demand proper exploitation of multicore architectures. Towards this, GPU-specific implementations of image analysis algorithms are particularly promising. In this paper, we investigate the mapping of an enhanced motion estimation algorithm to novel GPU-specific architectures, the resulting challenges and benefits therein. Using a database of three-dimensional image sequences, we show that the mapping leads to substantial performance gains, up to a factor of 60, and can provide near-real-time experience. We also show how architectural peculiarities of these devices can be best exploited in the benefit of algorithms, most specifically for addressing the challenges related to their access patterns and different memory configurations. Finally, we evaluate the performance of the algorithm on three different GPU architectures and perform a comprehensive analysis of the results. PMID:21869880
EOS MLS Science Data Processing System: A Description of Architecture and Capabilities
NASA Technical Reports Server (NTRS)
Cuddy, David T.; Echeverri, Mark D.; Wagner, Paul A.; Hanzel, Audrey T.; Fuller, Ryan A.
2006-01-01
This paper describes the architecture and capabilities of the Science Data Processing System (SDPS) for the EOS MLS. The SDPS consists of two major components--the Science Computing Facility and the Science Investigator-led Processing System. The Science Computing Facility provides the facilities for the EOS MLS Science Team to perform the functions of scientific algorithm development, processing software development, quality control of data products, and scientific analyses. The Science Investigator-led Processing System processes and reprocesses the science data for the entire mission and delivers the data products to the Science Computing Facility and to the Goddard Space Flight Center Earth Science Distributed Active Archive Center, which archives and distributes the standard science products.
Model of a programmable quantum processing unit based on a quantum transistor effect
NASA Astrophysics Data System (ADS)
Ablayev, Farid; Andrianov, Sergey; Fetisov, Danila; Moiseev, Sergey; Terentyev, Alexandr; Urmanchev, Andrey; Vasiliev, Alexander
2018-02-01
In this paper we propose a model of a programmable quantum processing device realizable with existing nano-photonic technologies. It can be viewed as a basis for new high performance hardware architectures. Protocols for physical implementation of device on the controlled photon transfer and atomic transitions are presented. These protocols are designed for executing basic single-qubit and multi-qubit gates forming a universal set. We analyze the possible operation of this quantum computer scheme. Then we formalize the physical architecture by a mathematical model of a Quantum Processing Unit (QPU), which we use as a basis for the Quantum Programming Framework. This framework makes it possible to perform universal quantum computations in a multitasking environment.
Evaluation of the Intel iWarp parallel processor for space flight applications
NASA Technical Reports Server (NTRS)
Hine, Butler P., III; Fong, Terrence W.
1993-01-01
The potential of a DARPA-sponsored advanced processor, the Intel iWarp, for use in future SSF Data Management Systems (DMS) upgrades is evaluated through integration into the Ames DMS testbed and applications testing. The iWarp is a distributed, parallel computing system well suited for high performance computing applications such as matrix operations and image processing. The system architecture is modular, supports systolic and message-based computation, and is capable of providing massive computational power in a low-cost, low-power package. As a consequence, the iWarp offers significant potential for advanced space-based computing. This research seeks to determine the iWarp's suitability as a processing device for space missions. In particular, the project focuses on evaluating the ease of integrating the iWarp into the SSF DMS baseline architecture and the iWarp's ability to support computationally stressing applications representative of SSF tasks.
A Computer for Low Context-Switch Time
1990-03-01
Results To find out how an implementation performs, we use a set of programs that make up a simulation system. These programs compile C language programs ...have worse relative context-switch performance: the time needed to switch contexts has not de- creased as much as the time to run programs . Much of...this study is: How seriously is throughput performance im- paired by this approach to computer architecture? Reasonable estimates are possible only
Multidisciplinary Shape Optimization of a Composite Blended Wing Body Aircraft
NASA Astrophysics Data System (ADS)
Boozer, Charles Maxwell
A multidisciplinary shape optimization tool coupling aerodynamics, structure, and performance was developed for battery powered aircraft. Utilizing high-fidelity computational fluid dynamics analysis tools and a structural wing weight tool, coupled based on the multidisciplinary feasible optimization architecture; aircraft geometry is modified in the optimization of the aircraft's range or endurance. The developed tool is applied to three geometries: a hybrid blended wing body, delta wing UAS, the ONERA M6 wing, and a modified ONERA M6 wing. First, the optimization problem is presented with the objective function, constraints, and design vector. Next, the tool's architecture and the analysis tools that are utilized are described. Finally, various optimizations are described and their results analyzed for all test subjects. Results show that less computationally expensive inviscid optimizations yield positive performance improvements using planform, airfoil, and three-dimensional degrees of freedom. From the results obtained through a series of optimizations, it is concluded that the newly developed tool is both effective at improving performance and serves as a platform ready to receive additional performance modules, further improving its computational design support potential.
Fuzzy Behavior Modulation with Threshold Activation for Autonomous Vehicle Navigation
NASA Technical Reports Server (NTRS)
Tunstel, Edward
2000-01-01
This paper describes fuzzy logic techniques used in a hierarchical behavior-based architecture for robot navigation. An architectural feature for threshold activation of fuzzy-behaviors is emphasized, which is potentially useful for tuning navigation performance in real world applications. The target application is autonomous local navigation of a small planetary rover. Threshold activation of low-level navigation behaviors is the primary focus. A preliminary assessment of its impact on local navigation performance is provided based on computer simulations.
Interfacing Computer Aided Parallelization and Performance Analysis
NASA Technical Reports Server (NTRS)
Jost, Gabriele; Jin, Haoqiang; Labarta, Jesus; Gimenez, Judit; Biegel, Bryan A. (Technical Monitor)
2003-01-01
When porting sequential applications to parallel computer architectures, the program developer will typically go through several cycles of source code optimization and performance analysis. We have started a project to develop an environment where the user can jointly navigate through program structure and performance data information in order to make efficient optimization decisions. In a prototype implementation we have interfaced the CAPO computer aided parallelization tool with the Paraver performance analysis tool. We describe both tools and their interface and give an example for how the interface helps within the program development cycle of a benchmark code.
Multiprocessor architectural study
NASA Technical Reports Server (NTRS)
Kosmala, A. L.; Stanten, S. F.; Vandever, W. H.
1972-01-01
An architectural design study was made of a multiprocessor computing system intended to meet functional and performance specifications appropriate to a manned space station application. Intermetrics, previous experience, and accumulated knowledge of the multiprocessor field is used to generate a baseline philosophy for the design of a future SUMC* multiprocessor. Interrupts are defined and the crucial questions of interrupt structure, such as processor selection and response time, are discussed. Memory hierarchy and performance is discussed extensively with particular attention to the design approach which utilizes a cache memory associated with each processor. The ability of an individual processor to approach its theoretical maximum performance is then analyzed in terms of a hit ratio. Memory management is envisioned as a virtual memory system implemented either through segmentation or paging. Addressing is discussed in terms of various register design adopted by current computers and those of advanced design.
A novel parallel architecture for local histogram equalization
NASA Astrophysics Data System (ADS)
Ohannessian, Mesrob I.; Choueiter, Ghinwa F.; Diab, Hassan
2005-07-01
Local histogram equalization is an image enhancement algorithm that has found wide application in the pre-processing stage of areas such as computer vision, pattern recognition and medical imaging. The computationally intensive nature of the procedure, however, is a main limitation when real time interactive applications are in question. This work explores the possibility of performing parallel local histogram equalization, using an array of special purpose elementary processors, through an HDL implementation that targets FPGA or ASIC platforms. A novel parallelization scheme is presented and the corresponding architecture is derived. The algorithm is reduced to pixel-level operations. Processing elements are assigned image blocks, to maintain a reasonable performance-cost ratio. To further simplify both processor and memory organizations, a bit-serial access scheme is used. A brief performance assessment is provided to illustrate and quantify the merit of the approach.
Developing a Distributed Computing Architecture at Arizona State University.
ERIC Educational Resources Information Center
Armann, Neil; And Others
1994-01-01
Development of Arizona State University's computing architecture, designed to ensure that all new distributed computing pieces will work together, is described. Aspects discussed include the business rationale, the general architectural approach, characteristics and objectives of the architecture, specific services, and impact on the university…
Reprogrammable logic in memristive crossbar for in-memory computing
NASA Astrophysics Data System (ADS)
Cheng, Long; Zhang, Mei-Yun; Li, Yi; Zhou, Ya-Xiong; Wang, Zhuo-Rui; Hu, Si-Yu; Long, Shi-Bing; Liu, Ming; Miao, Xiang-Shui
2017-12-01
Memristive stateful logic has emerged as a promising next-generation in-memory computing paradigm to address escalating computing-performance pressures in traditional von Neumann architecture. Here, we present a nonvolatile reprogrammable logic method that can process data between different rows and columns in a memristive crossbar array based on material implication (IMP) logic. Arbitrary Boolean logic can be executed with a reprogrammable cell containing four memristors in a crossbar array. In the fabricated Ti/HfO2/W memristive array, some fundamental functions, such as universal NAND logic and data transfer, were experimentally implemented. Moreover, using eight memristors in a 2 × 4 array, a one-bit full adder was theoretically designed and verified by simulation to exhibit the feasibility of our method to accomplish complex computing tasks. In addition, some critical logic-related performances were further discussed, such as the flexibility of data processing, cascading problem and bit error rate. Such a method could be a step forward in developing IMP-based memristive nonvolatile logic for large-scale in-memory computing architecture.
2013-03-29
Assessor that is in the SoS agent. Figure 31. Fuzzy Assessor for the SoS Agent for Assessment of SoS Architecture «subsystem» Fuzzy Rules « datatype ...Affordability « datatype » Flexibility « datatype » Performance « datatype » Robustness Input Input Input Input « datatype » Architecture QualityOutput Fuzzy
An object-oriented software approach for a distributed human tracking motion system
NASA Astrophysics Data System (ADS)
Micucci, Daniela L.
2003-06-01
Tracking is a composite job involving the co-operation of autonomous activities which exploit a complex information model and rely on a distributed architecture. Both information and activities must be classified and related in several dimensions: abstraction levels (what is modelled and how information is processed); topology (where the modelled entities are); time (when entities exist); strategy (why something happens); responsibilities (who is in charge of processing the information). A proper Object-Oriented analysis and design approach leads to a modular architecture where information about conceptual entities is modelled at each abstraction level via classes and intra-level associations, whereas inter-level associations between classes model the abstraction process. Both information and computation are partitioned according to level-specific topological models. They are also placed in a temporal framework modelled by suitable abstractions. Domain-specific strategies control the execution of the computations. Computational components perform both intra-level processing and intra-level information conversion. The paper overviews the phases of the analysis and design process, presents major concepts at each abstraction level, and shows how the resulting design turns into a modular, flexible and adaptive architecture. Finally, the paper sketches how the conceptual architecture can be deployed into a concrete distribute architecture by relying on an experimental framework.
Frances: A Tool for Understanding Computer Architecture and Assembly Language
ERIC Educational Resources Information Center
Sondag, Tyler; Pokorny, Kian L.; Rajan, Hridesh
2012-01-01
Students in all areas of computing require knowledge of the computing device including software implementation at the machine level. Several courses in computer science curricula address these low-level details such as computer architecture and assembly languages. For such courses, there are advantages to studying real architectures instead of…
Concepts and Relations in Neurally Inspired In Situ Concept-Based Computing
van der Velde, Frank
2016-01-01
In situ concept-based computing is based on the notion that conceptual representations in the human brain are “in situ.” In this way, they are grounded in perception and action. Examples are neuronal assemblies, whose connection structures develop over time and are distributed over different brain areas. In situ concepts representations cannot be copied or duplicated because that will disrupt their connection structure, and thus the meaning of these concepts. Higher-level cognitive processes, as found in language and reasoning, can be performed with in situ concepts by embedding them in specialized neurally inspired “blackboards.” The interactions between the in situ concepts and the blackboards form the basis for in situ concept computing architectures. In these architectures, memory (concepts) and processing are interwoven, in contrast with the separation between memory and processing found in Von Neumann architectures. Because the further development of Von Neumann computing (more, faster, yet power limited) is questionable, in situ concept computing might be an alternative for concept-based computing. In situ concept computing will be illustrated with a recently developed BABI reasoning task. Neurorobotics can play an important role in the development of in situ concept computing because of the development of in situ concept representations derived in scenarios as needed for reasoning tasks. Neurorobotics would also benefit from power limited and in situ concept computing. PMID:27242504
Concepts and Relations in Neurally Inspired In Situ Concept-Based Computing.
van der Velde, Frank
2016-01-01
In situ concept-based computing is based on the notion that conceptual representations in the human brain are "in situ." In this way, they are grounded in perception and action. Examples are neuronal assemblies, whose connection structures develop over time and are distributed over different brain areas. In situ concepts representations cannot be copied or duplicated because that will disrupt their connection structure, and thus the meaning of these concepts. Higher-level cognitive processes, as found in language and reasoning, can be performed with in situ concepts by embedding them in specialized neurally inspired "blackboards." The interactions between the in situ concepts and the blackboards form the basis for in situ concept computing architectures. In these architectures, memory (concepts) and processing are interwoven, in contrast with the separation between memory and processing found in Von Neumann architectures. Because the further development of Von Neumann computing (more, faster, yet power limited) is questionable, in situ concept computing might be an alternative for concept-based computing. In situ concept computing will be illustrated with a recently developed BABI reasoning task. Neurorobotics can play an important role in the development of in situ concept computing because of the development of in situ concept representations derived in scenarios as needed for reasoning tasks. Neurorobotics would also benefit from power limited and in situ concept computing.
Tutorial: Computer architecture
DOE Office of Scientific and Technical Information (OSTI.GOV)
Gajski, D.D.; Milutinovic, V.M.; Siegel, H.J.
1986-01-01
This book presents the state-of-the-art in advanced computer architecture. It deals with the concepts underlying current architectures and covers approaches and techniques being used in the design of advanced computer systems.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Brun, B.
1997-07-01
Computer technology has improved tremendously during the last years with larger media capacity, memory and more computational power. Visual computing with high-performance graphic interface and desktop computational power have changed the way engineers accomplish everyday tasks, development and safety studies analysis. The emergence of parallel computing will permit simulation over a larger domain. In addition, new development methods, languages and tools have appeared in the last several years.
Micromagnetics on high-performance workstation and mobile computational platforms
NASA Astrophysics Data System (ADS)
Fu, S.; Chang, R.; Couture, S.; Menarini, M.; Escobar, M. A.; Kuteifan, M.; Lubarda, M.; Gabay, D.; Lomakin, V.
2015-05-01
The feasibility of using high-performance desktop and embedded mobile computational platforms is presented, including multi-core Intel central processing unit, Nvidia desktop graphics processing units, and Nvidia Jetson TK1 Platform. FastMag finite element method-based micromagnetic simulator is used as a testbed, showing high efficiency on all the platforms. Optimization aspects of improving the performance of the mobile systems are discussed. The high performance, low cost, low power consumption, and rapid performance increase of the embedded mobile systems make them a promising candidate for micromagnetic simulations. Such architectures can be used as standalone systems or can be built as low-power computing clusters.
Outline of a novel architecture for cortical computation.
Majumdar, Kaushik
2008-03-01
In this paper a novel architecture for cortical computation has been proposed. This architecture is composed of computing paths consisting of neurons and synapses. These paths have been decomposed into lateral, longitudinal and vertical components. Cortical computation has then been decomposed into lateral computation (LaC), longitudinal computation (LoC) and vertical computation (VeC). It has been shown that various loop structures in the cortical circuit play important roles in cortical computation as well as in memory storage and retrieval, keeping in conformity with the molecular basis of short and long term memory. A new learning scheme for the brain has also been proposed and how it is implemented within the proposed architecture has been explained. A few mathematical results about the architecture have been proposed, some of which are without proof.
Fracture Response Enhancement Of Aluminum Using In-Situ Selective Reinforcement
NASA Technical Reports Server (NTRS)
Abada, Christopher H.; Farley, Gary L.; Hyer, Michael W.
2006-01-01
A computer-based parametric study of the effect of reinforcement architectures on fracture response of aluminum compact-tension (CT) specimens is performed. Eleven different reinforcement architectures consisting of rectangular and triangular cross-section reinforcements were evaluated. Reinforced specimens produced between 13 and 28 percent higher fracture load than achieved with the unreinforced case. Reinforcements with blunt leading edges (rectangular reinforcements) exhibited superior performance relative to the triangular reinforcements with sharp leading edges. Relative to the rectangular reinforcements, the most important architectural feature was reinforcement thickness. At failure, the reinforcements carried between 58 and 85 percent of the load applied to the specimen, suggesting that there is considerable load transfer between the base material and the reinforcement.
NASA Technical Reports Server (NTRS)
Gorospe, George E., Jr.; Daigle, Matthew J.; Sankararaman, Shankar; Kulkarni, Chetan S.; Ng, Eley
2017-01-01
Prognostic methods enable operators and maintainers to predict the future performance for critical systems. However, these methods can be computationally expensive and may need to be performed each time new information about the system becomes available. In light of these computational requirements, we have investigated the application of graphics processing units (GPUs) as a computational platform for real-time prognostics. Recent advances in GPU technology have reduced cost and increased the computational capability of these highly parallel processing units, making them more attractive for the deployment of prognostic software. We present a survey of model-based prognostic algorithms with considerations for leveraging the parallel architecture of the GPU and a case study of GPU-accelerated battery prognostics with computational performance results.
NASA Astrophysics Data System (ADS)
Acernese, Fausto; Barone, Fabrizio; De Rosa, Rosario; Eleuteri, Antonio; Milano, Leopoldo; Pardi, Silvio; Ricciardi, Iolanda; Russo, Guido
2004-09-01
One of the main requirements of a digital system for the control of interferometric detectors of gravitational waves is the computing power, that is a direct consequence of the increasing complexity of the digital algorithms necessary for the control signals generation. For this specific task many specialized non standard real-time architectures have been developed, often very expensive and difficult to upgrade. On the other hand, such computing power is generally fully available for off-line applications on standard Pc based systems. Therefore, a possible and obvious solution may be provided by the integration of both the real-time and off-line architecture resulting in a hybrid control system architecture based on standards available components, trying to get both the advantages of the perfect data synchronization provided by the real-time systems and by the large computing power available on Pc based systems. Such integration may be provided by the implementation of the link between the two different architectures through the standard Ethernet network, whose data transfer speed is largely increasing in these years, using the TCP/IP, UDP and raw Ethernet protocols. In this paper we describe the architecture of an hybrid Ethernet based real-time control system prototype we implemented in Napoli, discussing its characteristics and performances. Finally we discuss a possible application to the real-time control of a suspended mass of the mode cleaner of the 3m prototype optical interferometer for gravitational wave detection (IDGW-3P) operational in Napoli.
Real-time control system for adaptive resonator
DOE Office of Scientific and Technical Information (OSTI.GOV)
Flath, L; An, J; Brase, J
2000-07-24
Sustained operation of high average power solid-state lasers currently requires an adaptive resonator to produce the optimal beam quality. We describe the architecture of a real-time adaptive control system for correcting intra-cavity aberrations in a heat capacity laser. Image data collected from a wavefront sensor are processed and used to control phase with a high-spatial-resolution deformable mirror. Our controller takes advantage of recent developments in low-cost, high-performance processor technology. A desktop-based computational engine and object-oriented software architecture replaces the high-cost rack-mount embedded computers of previous systems.
High performance network and channel-based storage
NASA Technical Reports Server (NTRS)
Katz, Randy H.
1991-01-01
In the traditional mainframe-centered view of a computer system, storage devices are coupled to the system through complex hardware subsystems called input/output (I/O) channels. With the dramatic shift towards workstation-based computing, and its associated client/server model of computation, storage facilities are now found attached to file servers and distributed throughout the network. We discuss the underlying technology trends that are leading to high performance network-based storage, namely advances in networks, storage devices, and I/O controller and server architectures. We review several commercial systems and research prototypes that are leading to a new approach to high performance computing based on network-attached storage.
Performance Analysis of a Hybrid Overset Multi-Block Application on Multiple Architectures
NASA Technical Reports Server (NTRS)
Djomehri, M. Jahed; Biswas, Rupak
2003-01-01
This paper presents a detailed performance analysis of a multi-block overset grid compu- tational fluid dynamics app!ication on multiple state-of-the-art computer architectures. The application is implemented using a hybrid MPI+OpenMP programming paradigm that exploits both coarse and fine-grain parallelism; the former via MPI message passing and the latter via OpenMP directives. The hybrid model also extends the applicability of multi-block programs to large clusters of SNIP nodes by overcoming the restriction that the number of processors be less than the number of grid blocks. A key kernel of the application, namely the LU-SGS linear solver, had to be modified to enhance the performance of the hybrid approach on the target machines. Investigations were conducted on cacheless Cray SX6 vector processors, cache-based IBM Power3 and Power4 architectures, and single system image SGI Origin3000 platforms. Overall results for complex vortex dynamics simulations demonstrate that the SX6 achieves the highest performance and outperforms the RISC-based architectures; however, the best scaling performance was achieved on the Power3.
ERIC Educational Resources Information Center
La Brecque, Mort
1984-01-01
To break the bottleneck inherent in today's linear computer architectures, parallel schemes (which allow computers to perform multiple tasks at one time) are being devised. Several of these schemes are described. Dataflow devices, parallel number-crunchers, programing languages, and a device based on a neurological model are among the areas…
Phipps, Eric T.; D'Elia, Marta; Edwards, Harold C.; ...
2017-04-18
In this study, quantifying simulation uncertainties is a critical component of rigorous predictive simulation. A key component of this is forward propagation of uncertainties in simulation input data to output quantities of interest. Typical approaches involve repeated sampling of the simulation over the uncertain input data, and can require numerous samples when accurately propagating uncertainties from large numbers of sources. Often simulation processes from sample to sample are similar and much of the data generated from each sample evaluation could be reused. We explore a new method for implementing sampling methods that simultaneously propagates groups of samples together in anmore » embedded fashion, which we call embedded ensemble propagation. We show how this approach takes advantage of properties of modern computer architectures to improve performance by enabling reuse between samples, reducing memory bandwidth requirements, improving memory access patterns, improving opportunities for fine-grained parallelization, and reducing communication costs. We describe a software technique for implementing embedded ensemble propagation based on the use of C++ templates and describe its integration with various scientific computing libraries within Trilinos. We demonstrate improved performance, portability and scalability for the approach applied to the simulation of partial differential equations on a variety of CPU, GPU, and accelerator architectures, including up to 131,072 cores on a Cray XK7 (Titan).« less
DOE Office of Scientific and Technical Information (OSTI.GOV)
Madduri, Kamesh; Im, Eun-Jin; Ibrahim, Khaled Z.
The next decade of high-performance computing (HPC) systems will see a rapid evolution and divergence of multi- and manycore architectures as power and cooling constraints limit increases in microprocessor clock speeds. Understanding efficient optimization methodologies on diverse multicore designs in the context of demanding numerical methods is one of the greatest challenges faced today by the HPC community. In this paper, we examine the efficient multicore optimization of GTC, a petascale gyrokinetic toroidal fusion code for studying plasma microturbulence in tokamak devices. For GTC’s key computational components (charge deposition and particle push), we explore efficient parallelization strategies across a broadmore » range of emerging multicore designs, including the recently-released Intel Nehalem-EX, the AMD Opteron Istanbul, and the highly multithreaded Sun UltraSparc T2+. We also present the first study on tuning gyrokinetic particle-in-cell (PIC) algorithms for graphics processors, using the NVIDIA C2050 (Fermi). Our work discusses several novel optimization approaches for gyrokinetic PIC, including mixed-precision computation, particle binning and decomposition strategies, grid replication, SIMDized atomic floating-point operations, and effective GPU texture memory utilization. Overall, we achieve significant performance improvements of 1.3–4.7× on these complex PIC kernels, despite the inherent challenges of data dependency and locality. Finally, our work also points to several architectural and programming features that could significantly enhance PIC performance and productivity on next-generation architectures.« less
Stamatakis, Alexandros; Ott, Michael
2008-12-27
The continuous accumulation of sequence data, for example, due to novel wet-laboratory techniques such as pyrosequencing, coupled with the increasing popularity of multi-gene phylogenies and emerging multi-core processor architectures that face problems of cache congestion, poses new challenges with respect to the efficient computation of the phylogenetic maximum-likelihood (ML) function. Here, we propose two approaches that can significantly speed up likelihood computations that typically represent over 95 per cent of the computational effort conducted by current ML or Bayesian inference programs. Initially, we present a method and an appropriate data structure to efficiently compute the likelihood score on 'gappy' multi-gene alignments. By 'gappy' we denote sampling-induced gaps owing to missing sequences in individual genes (partitions), i.e. not real alignment gaps. A first proof-of-concept implementation in RAXML indicates that this approach can accelerate inferences on large and gappy alignments by approximately one order of magnitude. Moreover, we present insights and initial performance results on multi-core architectures obtained during the transition from an OpenMP-based to a Pthreads-based fine-grained parallelization of the ML function.
On-chip visual perception of motion: a bio-inspired connectionist model on FPGA.
Torres-Huitzil, César; Girau, Bernard; Castellanos-Sánchez, Claudio
2005-01-01
Visual motion provides useful information to understand the dynamics of a scene to allow intelligent systems interact with their environment. Motion computation is usually restricted by real time requirements that need the design and implementation of specific hardware architectures. In this paper, the design of hardware architecture for a bio-inspired neural model for motion estimation is presented. The motion estimation is based on a strongly localized bio-inspired connectionist model with a particular adaptation of spatio-temporal Gabor-like filtering. The architecture is constituted by three main modules that perform spatial, temporal, and excitatory-inhibitory connectionist processing. The biomimetic architecture is modeled, simulated and validated in VHDL. The synthesis results on a Field Programmable Gate Array (FPGA) device show the potential achievement of real-time performance at an affordable silicon area.
System and Propagation Availability Analysis for NASA's Advanced Air Transportation Technologies
NASA Technical Reports Server (NTRS)
Ugweje, Okechukwu C.
2000-01-01
This report summarizes the research on the System and Propagation Availability Analysis for NASA's project on Advanced Air Transportation Technologies (AATT). The objectives of the project were to determine the communication systems requirements and architecture, and to investigate the effect of propagation on the transmission of space information. In this report, results from the first year investigation are presented and limitations are highlighted. To study the propagation links, an understanding of the total system architecture is necessary since the links form the major component of the overall architecture. This study was conducted by way of analysis, modeling and simulation on the system communication links. The overall goals was to develop an understanding of the space communication requirements relevant to the AATT project, and then analyze the links taking into consideration system availability under adverse atmospheric weather conditions. This project began with a preliminary study of the end-to-end system architecture by modeling a representative communication system in MATLAB SIMULINK. Based on the defining concepts, the possibility of computer modeling was determined. The investigations continue with the parametric studies of the communication system architecture. These studies were also carried out with SIMULINK modeling and simulation. After a series of modifications, two end-to-end communication links were identified as the most probable models for the communication architecture. Link budget calculations were then performed in MATHCAD and MATLAB for the identified communication scenarios. A remarkable outcome of this project is the development of a graphic user interface (GUI) program for the computation of the link budget parameters in real time. Using this program, one can interactively compute the link budget requirements after supplying a few necessary parameters. It provides a framework for the eventual automation of several computations required in many experimental NASA missions. For the first year of this project, most of the stated objectives were accomplished. We were able to identify probable communication systems architectures, model and analyze several communication links, perform numerous simulation on different system models, and then develop a program for the link budget analysis. However, most of the work is still unfinished. The effect of propagation on the transmission of information in the identified communication channels has not been performed. Propagation effects cannot be studied until the system under consideration is identified and characterized. To study the propagation links, an understanding of the total communications architecture is necessary. It is important to mention that the original project was intended for two years and the results presented here are only for the first year of research. It is prudent therefore that these efforts be continued in order to obtain a complete picture of the system and propagation availability requirements.
An 81.6 μW FastICA processor for epileptic seizure detection.
Yang, Chia-Hsiang; Shih, Yi-Hsin; Chiueh, Herming
2015-02-01
To improve the performance of epileptic seizure detection, independent component analysis (ICA) is applied to multi-channel signals to separate artifacts and signals of interest. FastICA is an efficient algorithm to compute ICA. To reduce the energy dissipation, eigenvalue decomposition (EVD) is utilized in the preprocessing stage to reduce the convergence time of iterative calculation of ICA components. EVD is computed efficiently through an array structure of processing elements running in parallel. Area-efficient EVD architecture is realized by leveraging the approximate Jacobi algorithm, leading to a 77.2% area reduction. By choosing proper memory element and reduced wordlength, the power and area of storage memory are reduced by 95.6% and 51.7%, respectively. The chip area is minimized through fixed-point implementation and architectural transformations. Given a latency constraint of 0.1 s, an 86.5% area reduction is achieved compared to the direct-mapped architecture. Fabricated in 90 nm CMOS, the core area of the chip is 0.40 mm(2). The FastICA processor, part of an integrated epileptic control SoC, dissipates 81.6 μW at 0.32 V. The computation delay of a frame of 256 samples for 8 channels is 84.2 ms. Compared to prior work, 0.5% power dissipation, 26.7% silicon area, and 3.4 × computation speedup are achieved. The performance of the chip was verified by human dataset.
Innovative HPC architectures for the study of planetary plasma environments
NASA Astrophysics Data System (ADS)
Amaya, Jorge; Wolf, Anna; Lembège, Bertrand; Zitz, Anke; Alvarez, Damian; Lapenta, Giovanni
2016-04-01
DEEP-ER is an European Commission founded project that develops a new type of High Performance Computer architecture. The revolutionary system is currently used by KU Leuven to study the effects of the solar wind on the global environments of the Earth and Mercury. The new architecture combines the versatility of Intel Xeon computing nodes with the power of the upcoming Intel Xeon Phi accelerators. Contrary to classical heterogeneous HPC architectures, where it is customary to find CPU and accelerators in the same computing nodes, in the DEEP-ER system CPU nodes are grouped together (Cluster) and independently from the accelerator nodes (Booster). The system is equipped with a state of the art interconnection network, a highly scalable and fast I/O and a fail recovery resiliency system. The final objective of the project is to introduce a scalable system that can be used to create the next generation of exascale supercomputers. The code iPic3D from KU Leuven is being adapted to this new architecture. This particle-in-cell code can now perform the computation of the electromagnetic fields in the Cluster while the particles are moved in the Booster side. Using fast and scalable Xeon Phi accelerators in the Booster we can introduce many more particles per cell in the simulation than what is possible in the current generation of HPC systems, allowing to calculate fully kinetic plasmas with very low interpolation noise. The system will be used to perform fully kinetic, low noise, 3D simulations of the interaction of the solar wind with the magnetosphere of the Earth and Mercury. Preliminary simulations have been performed in other HPC centers in order to compare the results in different systems. In this presentation we show the complexity of the plasma flow around the planets, including the development of hydrodynamic instabilities at the flanks, the presence of the collision-less shock, the magnetosheath, the magnetopause, reconnection zones, the formation of the plasma sheet and the magnetotail, and the variation of ion/electron plasma flows when crossing these frontiers. The simulations also give access to detailed information about the particle dynamics and their velocity distribution at locations that can be used for comparison with satellite data.
Bilinearity, Rules, and Prefrontal Cortex
Dayan, Peter
2007-01-01
Humans can be instructed verbally to perform computationally complex cognitive tasks; their performance then improves relatively slowly over the course of practice. Many skills underlie these abilities; in this paper, we focus on the particular question of a uniform architecture for the instantiation of habitual performance and the storage, recall, and execution of simple rules. Our account builds on models of gated working memory, and involves a bilinear architecture for representing conditional input-output maps and for matching rules to the state of the input and working memory. We demonstrate the performance of our model on two paradigmatic tasks used to investigate prefrontal and basal ganglia function. PMID:18946523
Scaling Deep Learning workloads: NVIDIA DGX-1/Pascal and Intel Knights Landing
Gawande, Nitin A.; Daily, Jeff A.; Siegel, Charles; ...
2018-05-05
Deep Learning (DL) algorithms have become ubiquitous in data analytics. As a result, major computing vendors—including NVIDIA, Intel, AMD, and IBM—have architectural road maps influenced by DL workloads. Furthermore, several vendors have recently advertised new computing products as accelerating large DL workloads. Unfortunately, it is difficult for data scientists to quantify the potential of these different products. Here, this article provides a performance and power analysis of important DL workloads on two major parallel architectures: NVIDIA DGX-1 (eight Pascal P100 GPUs interconnected with NVLink) and Intel Knights Landing (KNL) CPUs interconnected with Intel Omni-Path or Cray Aries. Our evaluation consistsmore » of a cross section of convolutional neural net workloads: CifarNet, AlexNet, GoogLeNet, and ResNet50 topologies using the Cifar10 and ImageNet datasets. The workloads are vendor-optimized for each architecture. We use sequentially equivalent implementations to maintain iso-accuracy between parallel and sequential DL models. Our analysis indicates that although GPUs provide the highest overall performance, the gap can close for some convolutional networks; and the KNL can be competitive in performance/watt. We find that NVLink facilitates scaling efficiency on GPUs. However, its importance is heavily dependent on neural network architecture. Furthermore, for weak-scaling—sometimes encouraged by restricted GPU memory—NVLink is less important.« less
Scaling Deep Learning workloads: NVIDIA DGX-1/Pascal and Intel Knights Landing
DOE Office of Scientific and Technical Information (OSTI.GOV)
Gawande, Nitin A.; Daily, Jeff A.; Siegel, Charles
Deep Learning (DL) algorithms have become ubiquitous in data analytics. As a result, major computing vendors—including NVIDIA, Intel, AMD, and IBM—have architectural road maps influenced by DL workloads. Furthermore, several vendors have recently advertised new computing products as accelerating large DL workloads. Unfortunately, it is difficult for data scientists to quantify the potential of these different products. Here, this article provides a performance and power analysis of important DL workloads on two major parallel architectures: NVIDIA DGX-1 (eight Pascal P100 GPUs interconnected with NVLink) and Intel Knights Landing (KNL) CPUs interconnected with Intel Omni-Path or Cray Aries. Our evaluation consistsmore » of a cross section of convolutional neural net workloads: CifarNet, AlexNet, GoogLeNet, and ResNet50 topologies using the Cifar10 and ImageNet datasets. The workloads are vendor-optimized for each architecture. We use sequentially equivalent implementations to maintain iso-accuracy between parallel and sequential DL models. Our analysis indicates that although GPUs provide the highest overall performance, the gap can close for some convolutional networks; and the KNL can be competitive in performance/watt. We find that NVLink facilitates scaling efficiency on GPUs. However, its importance is heavily dependent on neural network architecture. Furthermore, for weak-scaling—sometimes encouraged by restricted GPU memory—NVLink is less important.« less
NASA Astrophysics Data System (ADS)
Lhamon, Michael Earl
A pattern recognition system which uses complex correlation filter banks requires proportionally more computational effort than single-real valued filters. This introduces increased computation burden but also introduces a higher level of parallelism, that common computing platforms fail to identify. As a result, we consider algorithm mapping to both optical and digital processors. For digital implementation, we develop computationally efficient pattern recognition algorithms, referred to as, vector inner product operators that require less computational effort than traditional fast Fourier methods. These algorithms do not need correlation and they map readily onto parallel digital architectures, which imply new architectures for optical processors. These filters exploit circulant-symmetric matrix structures of the training set data representing a variety of distortions. By using the same mathematical basis as with the vector inner product operations, we are able to extend the capabilities of more traditional correlation filtering to what we refer to as "Super Images". These "Super Images" are used to morphologically transform a complicated input scene into a predetermined dot pattern. The orientation of the dot pattern is related to the rotational distortion of the object of interest. The optical implementation of "Super Images" yields feature reduction necessary for using other techniques, such as artificial neural networks. We propose a parallel digital signal processor architecture based on specific pattern recognition algorithms but general enough to be applicable to other similar problems. Such an architecture is classified as a data flow architecture. Instead of mapping an algorithm to an architecture, we propose mapping the DSP architecture to a class of pattern recognition algorithms. Today's optical processing systems have difficulties implementing full complex filter structures. Typically, optical systems (like the 4f correlators) are limited to phase-only implementation with lower detection performance than full complex electronic systems. Our study includes pseudo-random pixel encoding techniques for approximating full complex filtering. Optical filter bank implementation is possible and they have the advantage of time averaging the entire filter bank at real time rates. Time-averaged optical filtering is computational comparable to billions of digital operations-per-second. For this reason, we believe future trends in high speed pattern recognition will involve hybrid architectures of both optical and DSP elements.
High End Computing Technologies for Earth Science Applications: Trends, Challenges, and Innovations
NASA Technical Reports Server (NTRS)
Parks, John (Technical Monitor); Biswas, Rupak; Yan, Jerry C.; Brooks, Walter F.; Sterling, Thomas L.
2003-01-01
Earth science applications of the future will stress the capabilities of even the highest performance supercomputers in the areas of raw compute power, mass storage management, and software environments. These NASA mission critical problems demand usable multi-petaflops and exabyte-scale systems to fully realize their science goals. With an exciting vision of the technologies needed, NASA has established a comprehensive program of advanced research in computer architecture, software tools, and device technology to ensure that, in partnership with US industry, it can meet these demanding requirements with reliable, cost effective, and usable ultra-scale systems. NASA will exploit, explore, and influence emerging high end computing architectures and technologies to accelerate the next generation of engineering, operations, and discovery processes for NASA Enterprises. This article captures this vision and describes the concepts, accomplishments, and the potential payoff of the key thrusts that will help meet the computational challenges in Earth science applications.
Deep learning with coherent nanophotonic circuits
NASA Astrophysics Data System (ADS)
Shen, Yichen; Harris, Nicholas C.; Skirlo, Scott; Prabhu, Mihika; Baehr-Jones, Tom; Hochberg, Michael; Sun, Xin; Zhao, Shijie; Larochelle, Hugo; Englund, Dirk; Soljačić, Marin
2017-07-01
Artificial neural networks are computational network models inspired by signal processing in the brain. These models have dramatically improved performance for many machine-learning tasks, including speech and image recognition. However, today's computing hardware is inefficient at implementing neural networks, in large part because much of it was designed for von Neumann computing schemes. Significant effort has been made towards developing electronic architectures tuned to implement artificial neural networks that exhibit improved computational speed and accuracy. Here, we propose a new architecture for a fully optical neural network that, in principle, could offer an enhancement in computational speed and power efficiency over state-of-the-art electronics for conventional inference tasks. We experimentally demonstrate the essential part of the concept using a programmable nanophotonic processor featuring a cascaded array of 56 programmable Mach-Zehnder interferometers in a silicon photonic integrated circuit and show its utility for vowel recognition.
Improving Conceptual Design for Launch Vehicles
NASA Technical Reports Server (NTRS)
Olds, John R.
1998-01-01
This report summarizes activities performed during the second year of a three year cooperative agreement between NASA - Langley Research Center and Georgia Tech. Year 1 of the project resulted in the creation of a new Cost and Business Assessment Model (CABAM) for estimating the economic performance of advanced reusable launch vehicles including non-recurring costs, recurring costs, and revenue. The current year (second year) activities were focused on the evaluation of automated, collaborative design frameworks (computation architectures or computational frameworks) for automating the design process in advanced space vehicle design. Consistent with NASA's new thrust area in developing and understanding Intelligent Synthesis Environments (ISE), the goals of this year's research efforts were to develop and apply computer integration techniques and near-term computational frameworks for conducting advanced space vehicle design. NASA - Langley (VAB) has taken a lead role in developing a web-based computing architectures within which the designer can interact with disciplinary analysis tools through a flexible web interface. The advantages of this approach are, 1) flexible access to the designer interface through a simple web browser (e.g. Netscape Navigator), 2) ability to include existing 'legacy' codes, and 3) ability to include distributed analysis tools running on remote computers. To date, VAB's internal emphasis has been on developing this test system for the planetary entry mission under the joint Integrated Design System (IDS) program with NASA - Ames and JPL. Georgia Tech's complementary goals this year were to: 1) Examine an alternate 'custom' computational architecture for the three-discipline IDS planetary entry problem to assess the advantages and disadvantages relative to the web-based approach.and 2) Develop and examine a web-based interface and framework for a typical launch vehicle design problem.
Bit-parallel arithmetic in a massively-parallel associative processor
NASA Technical Reports Server (NTRS)
Scherson, Isaac D.; Kramer, David A.; Alleyne, Brian D.
1992-01-01
A simple but powerful new architecture based on a classical associative processor model is presented. Algorithms for performing the four basic arithmetic operations both for integer and floating point operands are described. For m-bit operands, the proposed architecture makes it possible to execute complex operations in O(m) cycles as opposed to O(m exp 2) for bit-serial machines. A word-parallel, bit-parallel, massively-parallel computing system can be constructed using this architecture with VLSI technology. The operation of this system is demonstrated for the fast Fourier transform and matrix multiplication.
Mathematical modelling of Bit-Level Architecture using Reciprocal Quantum Logic
NASA Astrophysics Data System (ADS)
Narendran, S.; Selvakumar, J.
2018-04-01
Efficiency of high-performance computing is on high demand with both speed and energy efficiency. Reciprocal Quantum Logic (RQL) is one of the technology which will produce high speed and zero static power dissipation. RQL uses AC power supply as input rather than DC input. RQL has three set of basic gates. Series of reciprocal transmission lines are placed in between each gate to avoid loss of power and to achieve high speed. Analytical model of Bit-Level Architecture are done through RQL. Major drawback of reciprocal Quantum Logic is area, because of lack in proper power supply. To achieve proper power supply we need to use splitters which will occupy large area. Distributed arithmetic uses vector- vector multiplication one is constant and other is signed variable and each word performs as a binary number, they rearranged and mixed to form distributed system. Distributed arithmetic is widely used in convolution and high performance computational devices.
Cognitive architectures and language acquisition: a case study in pronoun comprehension.
VAN Rij, Jacolien; VAN Rijn, Hedderik; Hendriks, Petra
2010-06-01
In this paper we discuss a computational cognitive model of children's poor performance on pronoun interpretation (the so-called Delay of Principle B Effect, or DPBE). This cognitive model is based on a theoretical account that attributes the DPBE to children's inability as hearers to also take into account the speaker's perspective. The cognitive model predicts that child hearers are unable to do so because their speed of linguistic processing is too limited to perform this second step in interpretation. We tested this hypothesis empirically in a psycholinguistic study, in which we slowed down the speech rate to give children more time for interpretation, and in a computational simulation study. The results of the two studies confirm the predictions of our model. Moreover, these studies show that embedding a theory of linguistic competence in a cognitive architecture allows for the generation of detailed and testable predictions with respect to linguistic performance.
Exploring the architectural trade space of NASAs Space Communication and Navigation Program
NASA Astrophysics Data System (ADS)
Sanchez, M.; Selva, D.; Cameron, B.; Crawley, E.; Seas, A.; Seery, B.
NASAs Space Communication and Navigation (SCaN) Program is responsible for providing communication and navigation services to space missions and other users in and beyond low Earth orbit. The current SCaN architecture consists of three independent networks: the Space Network (SN), which contains the TDRS relay satellites in GEO; the Near Earth Network (NEN), which consists of several NASA owned and commercially operated ground stations; and the Deep Space Network (DSN), with three ground stations in Goldstone, Madrid, and Canberra. The first task of this study is the stakeholder analysis. The goal of the stakeholder analysis is to identify the main stakeholders of the SCaN system and their needs. Twenty-one main groups of stakeholders have been identified and put on a stakeholder map. Their needs are currently being elicited by means of interviews and an extensive literature review. The data will then be analyzed by applying Cameron and Crawley's stakeholder analysis theory, with a view to highlighting dominant needs and conflicting needs. The second task of this study is the architectural tradespace exploration of the next generation TDRSS. The space of possible architectures for SCaN is represented by a set of architectural decisions, each of which has a discrete set of options. A computational tool is used to automatically synthesize a very large number of possible architectures by enumerating different combinations of decisions and options. The same tool contains models to evaluate the architectures in terms of performance and cost. The performance model uses the stakeholder needs and requirements identified in the previous steps as inputs, and it is based in the VASSAR methodology presented in a companion paper. This paper summarizes the current status of the MIT SCaN architecture study. It starts by motivating the need to perform tradespace exploration studies in the context of relay data systems through a description of the history NASA's space communicati- n networks. It then presents the generalities of possible architectures for future space communication and navigation networks. Finally, it describes the tools and methods being developed, clearly indicating the architectural decisions that have been taken into account as well as the systematic approach followed to model them. The purpose of this study is to explore the SCaN architectural tradespace by means of a computational tool. This paper describes the tool, while the tradespace exploration is underway.
A Roadmap for caGrid, an Enterprise Grid Architecture for Biomedical Research
Saltz, Joel; Hastings, Shannon; Langella, Stephen; Oster, Scott; Kurc, Tahsin; Payne, Philip; Ferreira, Renato; Plale, Beth; Goble, Carole; Ervin, David; Sharma, Ashish; Pan, Tony; Permar, Justin; Brezany, Peter; Siebenlist, Frank; Madduri, Ravi; Foster, Ian; Shanbhag, Krishnakant; Mead, Charlie; Hong, Neil Chue
2012-01-01
caGrid is a middleware system which combines the Grid computing, the service oriented architecture, and the model driven architecture paradigms to support development of interoperable data and analytical resources and federation of such resources in a Grid environment. The functionality provided by caGrid is an essential and integral component of the cancer Biomedical Informatics Grid (caBIG™) program. This program is established by the National Cancer Institute as a nationwide effort to develop enabling informatics technologies for collaborative, multi-institutional biomedical research with the overarching goal of accelerating translational cancer research. Although the main application domain for caGrid is cancer research, the infrastructure provides a generic framework that can be employed in other biomedical research and healthcare domains. The development of caGrid is an ongoing effort, adding new functionality and improvements based on feedback and use cases from the community. This paper provides an overview of potential future architecture and tooling directions and areas of improvement for caGrid and caGrid-like systems. This summary is based on discussions at a roadmap workshop held in February with participants from biomedical research, Grid computing, and high performance computing communities. PMID:18560123
A roadmap for caGrid, an enterprise Grid architecture for biomedical research.
Saltz, Joel; Hastings, Shannon; Langella, Stephen; Oster, Scott; Kurc, Tahsin; Payne, Philip; Ferreira, Renato; Plale, Beth; Goble, Carole; Ervin, David; Sharma, Ashish; Pan, Tony; Permar, Justin; Brezany, Peter; Siebenlist, Frank; Madduri, Ravi; Foster, Ian; Shanbhag, Krishnakant; Mead, Charlie; Chue Hong, Neil
2008-01-01
caGrid is a middleware system which combines the Grid computing, the service oriented architecture, and the model driven architecture paradigms to support development of interoperable data and analytical resources and federation of such resources in a Grid environment. The functionality provided by caGrid is an essential and integral component of the cancer Biomedical Informatics Grid (caBIG) program. This program is established by the National Cancer Institute as a nationwide effort to develop enabling informatics technologies for collaborative, multi-institutional biomedical research with the overarching goal of accelerating translational cancer research. Although the main application domain for caGrid is cancer research, the infrastructure provides a generic framework that can be employed in other biomedical research and healthcare domains. The development of caGrid is an ongoing effort, adding new functionality and improvements based on feedback and use cases from the community. This paper provides an overview of potential future architecture and tooling directions and areas of improvement for caGrid and caGrid-like systems. This summary is based on discussions at a roadmap workshop held in February with participants from biomedical research, Grid computing, and high performance computing communities.
NASA Astrophysics Data System (ADS)
Gallagher, J. H. R.; Jelenak, A.; Potter, N.; Fulker, D. W.; Habermann, T.
2017-12-01
Providing data services based on cloud computing technology that is equivalent to those developed for traditional computing and storage systems is critical for successful migration to cloud-based architectures for data production, scientific analysis and storage. OPeNDAP Web-service capabilities (comprising the Data Access Protocol (DAP) specification plus open-source software for realizing DAP in servers and clients) are among the most widely deployed means for achieving data-as-service functionality in the Earth sciences. OPeNDAP services are especially common in traditional data center environments where servers offer access to datasets stored in (very large) file systems, and a preponderance of the source data for these services is being stored in the Hierarchical Data Format Version 5 (HDF5). Three candidate architectures for serving NASA satellite Earth Science HDF5 data via Hyrax running on Amazon Web Services (AWS) were developed and their performance examined for a set of representative use cases. The performance was based both on runtime and incurred cost. The three architectures differ in how HDF5 files are stored in the Amazon Simple Storage Service (S3) and how the Hyrax server (as an EC2 instance) retrieves their data. The results for both the serial and parallel access to HDF5 data in the S3 will be presented. While the study focused on HDF5 data, OPeNDAP and the Hyrax data server, the architectures are generic and the analysis can be extrapolated to many different data formats, web APIs, and data servers.
From photons to big-data applications: terminating terabits
2016-01-01
Computer architectures have entered a watershed as the quantity of network data generated by user applications exceeds the data-processing capacity of any individual computer end-system. It will become impossible to scale existing computer systems while a gap grows between the quantity of networked data and the capacity for per system data processing. Despite this, the growth in demand in both task variety and task complexity continues unabated. Networked computer systems provide a fertile environment in which new applications develop. As networked computer systems become akin to infrastructure, any limitation upon the growth in capacity and capabilities becomes an important constraint of concern to all computer users. Considering a networked computer system capable of processing terabits per second, as a benchmark for scalability, we critique the state of the art in commodity computing, and propose a wholesale reconsideration in the design of computer architectures and their attendant ecosystem. Our proposal seeks to reduce costs, save power and increase performance in a multi-scale approach that has potential application from nanoscale to data-centre-scale computers. PMID:26809573
From photons to big-data applications: terminating terabits.
Zilberman, Noa; Moore, Andrew W; Crowcroft, Jon A
2016-03-06
Computer architectures have entered a watershed as the quantity of network data generated by user applications exceeds the data-processing capacity of any individual computer end-system. It will become impossible to scale existing computer systems while a gap grows between the quantity of networked data and the capacity for per system data processing. Despite this, the growth in demand in both task variety and task complexity continues unabated. Networked computer systems provide a fertile environment in which new applications develop. As networked computer systems become akin to infrastructure, any limitation upon the growth in capacity and capabilities becomes an important constraint of concern to all computer users. Considering a networked computer system capable of processing terabits per second, as a benchmark for scalability, we critique the state of the art in commodity computing, and propose a wholesale reconsideration in the design of computer architectures and their attendant ecosystem. Our proposal seeks to reduce costs, save power and increase performance in a multi-scale approach that has potential application from nanoscale to data-centre-scale computers. © 2016 The Authors.
NASA Astrophysics Data System (ADS)
Megherbi, Dalila B.; Yan, Yin; Tanmay, Parikh; Khoury, Jed; Woods, C. L.
2004-11-01
Recently surveillance and Automatic Target Recognition (ATR) applications are increasing as the cost of computing power needed to process the massive amount of information continues to fall. This computing power has been made possible partly by the latest advances in FPGAs and SOPCs. In particular, to design and implement state-of-the-Art electro-optical imaging systems to provide advanced surveillance capabilities, there is a need to integrate several technologies (e.g. telescope, precise optics, cameras, image/compute vision algorithms, which can be geographically distributed or sharing distributed resources) into a programmable system and DSP systems. Additionally, pattern recognition techniques and fast information retrieval, are often important components of intelligent systems. The aim of this work is using embedded FPGA as a fast, configurable and synthesizable search engine in fast image pattern recognition/retrieval in a distributed hardware/software co-design environment. In particular, we propose and show a low cost Content Addressable Memory (CAM)-based distributed embedded FPGA hardware architecture solution with real time recognition capabilities and computing for pattern look-up, pattern recognition, and image retrieval. We show how the distributed CAM-based architecture offers a performance advantage of an order-of-magnitude over RAM-based architecture (Random Access Memory) search for implementing high speed pattern recognition for image retrieval. The methods of designing, implementing, and analyzing the proposed CAM based embedded architecture are described here. Other SOPC solutions/design issues are covered. Finally, experimental results, hardware verification, and performance evaluations using both the Xilinx Virtex-II and the Altera Apex20k are provided to show the potential and power of the proposed method for low cost reconfigurable fast image pattern recognition/retrieval at the hardware/software co-design level.
Platform Architecture for Decentralized Positioning Systems.
Kasmi, Zakaria; Norrdine, Abdelmoumen; Blankenbach, Jörg
2017-04-26
A platform architecture for positioning systems is essential for the realization of a flexible localization system, which interacts with other systems and supports various positioning technologies and algorithms. The decentralized processing of a position enables pushing the application-level knowledge into a mobile station and avoids the communication with a central unit such as a server or a base station. In addition, the calculation of the position on low-cost and resource-constrained devices presents a challenge due to the limited computing, storage capacity, as well as power supply. Therefore, we propose a platform architecture that enables the design of a system with the reusability of the components, extensibility (e.g., with other positioning technologies) and interoperability. Furthermore, the position is computed on a low-cost device such as a microcontroller, which simultaneously performs additional tasks such as data collecting or preprocessing based on an operating system. The platform architecture is designed, implemented and evaluated on the basis of two positioning systems: a field strength system and a time of arrival-based positioning system.
Platform Architecture for Decentralized Positioning Systems
Kasmi, Zakaria; Norrdine, Abdelmoumen; Blankenbach, Jörg
2017-01-01
A platform architecture for positioning systems is essential for the realization of a flexible localization system, which interacts with other systems and supports various positioning technologies and algorithms. The decentralized processing of a position enables pushing the application-level knowledge into a mobile station and avoids the communication with a central unit such as a server or a base station. In addition, the calculation of the position on low-cost and resource-constrained devices presents a challenge due to the limited computing, storage capacity, as well as power supply. Therefore, we propose a platform architecture that enables the design of a system with the reusability of the components, extensibility (e.g., with other positioning technologies) and interoperability. Furthermore, the position is computed on a low-cost device such as a microcontroller, which simultaneously performs additional tasks such as data collecting or preprocessing based on an operating system. The platform architecture is designed, implemented and evaluated on the basis of two positioning systems: a field strength system and a time of arrival-based positioning system. PMID:28445414
A Web Centric Architecture for Deploying Multi-Disciplinary Engineering Design Processes
NASA Technical Reports Server (NTRS)
Woyak, Scott; Kim, Hongman; Mullins, James; Sobieszczanski-Sobieski, Jaroslaw
2004-01-01
There are continuous needs for engineering organizations to improve their design process. Current state of the art techniques use computational simulations to predict design performance, and optimize it through advanced design methods. These tools have been used mostly by individual engineers. This paper presents an architecture for achieving results at an organization level beyond individual level. The next set of gains in process improvement will come from improving the effective use of computers and software within a whole organization, not just for an individual. The architecture takes advantage of state of the art capabilities to produce a Web based system to carry engineering design into the future. To illustrate deployment of the architecture, a case study for implementing advanced multidisciplinary design optimization processes such as Bi-Level Integrated System Synthesis is discussed. Another example for rolling-out a design process for Design for Six Sigma is also described. Each example explains how an organization can effectively infuse engineering practice with new design methods and retain the knowledge over time.
Supporting Undergraduate Computer Architecture Students Using a Visual MIPS64 CPU Simulator
ERIC Educational Resources Information Center
Patti, D.; Spadaccini, A.; Palesi, M.; Fazzino, F.; Catania, V.
2012-01-01
The topics of computer architecture are always taught using an Assembly dialect as an example. The most commonly used textbooks in this field use the MIPS64 Instruction Set Architecture (ISA) to help students in learning the fundamentals of computer architecture because of its orthogonality and its suitability for real-world applications. This…
Memristor-Based Computing Architecture: Design Methodologies and Circuit Techniques
2013-03-01
MEMRISTOR-BASED COMPUTING ARCHITECTURE : DESIGN METHODOLOGIES AND CIRCUIT TECHNIQUES POLYTECHNIC INSTITUTE OF NEW YORK UNIVERSITY...TECHNICAL REPORT 3. DATES COVERED (From - To) OCT 2010 – OCT 2012 4. TITLE AND SUBTITLE MEMRISTOR-BASED COMPUTING ARCHITECTURE : DESIGN METHODOLOGIES...schemes for a memristor-based reconfigurable architecture design have not been fully explored yet. Therefore, in this project, we investigated
Evolution of a designless nanoparticle network into reconfigurable Boolean logic
NASA Astrophysics Data System (ADS)
Bose, S. K.; Lawrence, C. P.; Liu, Z.; Makarenko, K. S.; van Damme, R. M. J.; Broersma, H. J.; van der Wiel, W. G.
2015-12-01
Natural computers exploit the emergent properties and massive parallelism of interconnected networks of locally active components. Evolution has resulted in systems that compute quickly and that use energy efficiently, utilizing whatever physical properties are exploitable. Man-made computers, on the other hand, are based on circuits of functional units that follow given design rules. Hence, potentially exploitable physical processes, such as capacitive crosstalk, to solve a problem are left out. Until now, designless nanoscale networks of inanimate matter that exhibit robust computational functionality had not been realized. Here we artificially evolve the electrical properties of a disordered nanomaterials system (by optimizing the values of control voltages using a genetic algorithm) to perform computational tasks reconfigurably. We exploit the rich behaviour that emerges from interconnected metal nanoparticles, which act as strongly nonlinear single-electron transistors, and find that this nanoscale architecture can be configured in situ into any Boolean logic gate. This universal, reconfigurable gate would require about ten transistors in a conventional circuit. Our system meets the criteria for the physical realization of (cellular) neural networks: universality (arbitrary Boolean functions), compactness, robustness and evolvability, which implies scalability to perform more advanced tasks. Our evolutionary approach works around device-to-device variations and the accompanying uncertainties in performance. Moreover, it bears a great potential for more energy-efficient computation, and for solving problems that are very hard to tackle in conventional architectures.
NASA Technical Reports Server (NTRS)
Simon, Donald L.
2010-01-01
Aircraft engine performance trend monitoring and gas path fault diagnostics are closely related technologies that assist operators in managing the health of their gas turbine engine assets. Trend monitoring is the process of monitoring the gradual performance change that an aircraft engine will naturally incur over time due to turbomachinery deterioration, while gas path diagnostics is the process of detecting and isolating the occurrence of any faults impacting engine flow-path performance. Today, performance trend monitoring and gas path fault diagnostic functions are performed by a combination of on-board and off-board strategies. On-board engine control computers contain logic that monitors for anomalous engine operation in real-time. Off-board ground stations are used to conduct fleet-wide engine trend monitoring and fault diagnostics based on data collected from each engine each flight. Continuing advances in avionics are enabling the migration of portions of the ground-based functionality on-board, giving rise to more sophisticated on-board engine health management capabilities. This paper reviews the conventional engine performance trend monitoring and gas path fault diagnostic architecture commonly applied today, and presents a proposed enhanced on-board architecture for future applications. The enhanced architecture gains real-time access to an expanded quantity of engine parameters, and provides advanced on-board model-based estimation capabilities. The benefits of the enhanced architecture include the real-time continuous monitoring of engine health, the early diagnosis of fault conditions, and the estimation of unmeasured engine performance parameters. A future vision to advance the enhanced architecture is also presented and discussed
Multicore Architecture-aware Scientific Applications
DOE Office of Scientific and Technical Information (OSTI.GOV)
Srinivasa, Avinash
Modern high performance systems are becoming increasingly complex and powerful due to advancements in processor and memory architecture. In order to keep up with this increasing complexity, applications have to be augmented with certain capabilities to fully exploit such systems. These may be at the application level, such as static or dynamic adaptations or at the system level, like having strategies in place to override some of the default operating system polices, the main objective being to improve computational performance of the application. The current work proposes two such capabilites with respect to multi-threaded scientific applications, in particular a largemore » scale physics application computing ab-initio nuclear structure. The first involves using a middleware tool to invoke dynamic adaptations in the application, so as to be able to adjust to the changing computational resource availability at run-time. The second involves a strategy for effective placement of data in main memory, to optimize memory access latencies and bandwidth. These capabilties when included were found to have a significant impact on the application performance, resulting in average speedups of as much as two to four times.« less
Exascale computing and big data
Reed, Daniel A.; Dongarra, Jack
2015-06-25
Scientific discovery and engineering innovation requires unifying traditionally separated high-performance computing and big data analytics. The tools and cultures of high-performance computing and big data analytics have diverged, to the detriment of both; unification is essential to address a spectrum of major research domains. The challenges of scale tax our ability to transmit data, compute complicated functions on that data, or store a substantial part of it; new approaches are required to meet these challenges. Finally, the international nature of science demands further development of advanced computer architectures and global standards for processing data, even as international competition complicates themore » openness of the scientific process.« less
Exascale computing and big data
DOE Office of Scientific and Technical Information (OSTI.GOV)
Reed, Daniel A.; Dongarra, Jack
Scientific discovery and engineering innovation requires unifying traditionally separated high-performance computing and big data analytics. The tools and cultures of high-performance computing and big data analytics have diverged, to the detriment of both; unification is essential to address a spectrum of major research domains. The challenges of scale tax our ability to transmit data, compute complicated functions on that data, or store a substantial part of it; new approaches are required to meet these challenges. Finally, the international nature of science demands further development of advanced computer architectures and global standards for processing data, even as international competition complicates themore » openness of the scientific process.« less
Performance of VPIC on Sequoia
NASA Astrophysics Data System (ADS)
Nystrom, William
2014-10-01
Sequoia is a major DOE computing resource which is characteristic of future resources in that it has many threads per compute node, 64, and the individual processor cores are simpler and less powerful than cores on previous processors like Intel's Sandy Bridge or AMD's Opteron. An effort is in progress to port VPIC to the Blue Gene Q architecture of Sequoia and evaluate its performance. Results of this work will be presented on single node performance of VPIC as well as multi-node scaling.
Heterogeneous High Throughput Scientific Computing with APM X-Gene and Intel Xeon Phi
NASA Astrophysics Data System (ADS)
Abdurachmanov, David; Bockelman, Brian; Elmer, Peter; Eulisse, Giulio; Knight, Robert; Muzaffar, Shahzad
2015-05-01
Electrical power requirements will be a constraint on the future growth of Distributed High Throughput Computing (DHTC) as used by High Energy Physics. Performance-per-watt is a critical metric for the evaluation of computer architectures for cost- efficient computing. Additionally, future performance growth will come from heterogeneous, many-core, and high computing density platforms with specialized processors. In this paper, we examine the Intel Xeon Phi Many Integrated Cores (MIC) co-processor and Applied Micro X-Gene ARMv8 64-bit low-power server system-on-a-chip (SoC) solutions for scientific computing applications. We report our experience on software porting, performance and energy efficiency and evaluate the potential for use of such technologies in the context of distributed computing systems such as the Worldwide LHC Computing Grid (WLCG).
A PC-Based Controller for Dextrous Arms
NASA Technical Reports Server (NTRS)
Fiorini, Paolo; Seraji, Homayoun; Long, Mark
1996-01-01
This paper describes the architecture and performance of a PC-based controller for 7-DOF dextrous manipulators. The computing platform is a 486-based personal computer equipped with a bus extender to access the robot Multibus controller, together with a single board computer as the graphical engine, and with a parallel I/O board to interface with a force-torque sensor mounted on the manipulator wrist.
Investigating the impact of the cielo cray XE6 architecture on scientific application codes.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Rajan, Mahesh; Barrett, Richard; Pedretti, Kevin Thomas Tauke
2010-12-01
Cielo, a Cray XE6, is the Department of Energy NNSA Advanced Simulation and Computing (ASC) campaign's newest capability machine. Rated at 1.37 PFLOPS, it consists of 8,944 dual-socket oct-core AMD Magny-Cours compute nodes, linked using Cray's Gemini interconnect. Its primary mission objective is to enable a suite of the ASC applications implemented using MPI to scale to tens of thousands of cores. Cielo is an evolutionary improvement to a successful architecture previously available to many of our codes, thus enabling a basis for understanding the capabilities of this new architecture. Using three codes strategically important to the ASC campaign, andmore » supplemented with some micro-benchmarks that expose the fundamental capabilities of the XE6, we report on the performance characteristics and capabilities of Cielo.« less
Brain architecture: a design for natural computation.
Kaiser, Marcus
2007-12-15
Fifty years ago, John von Neumann compared the architecture of the brain with that of the computers he invented and which are still in use today. In those days, the organization of computers was based on concepts of brain organization. Here, we give an update on current results on the global organization of neural systems. For neural systems, we outline how the spatial and topological architecture of neuronal and cortical networks facilitates robustness against failures, fast processing and balanced network activation. Finally, we discuss mechanisms of self-organization for such architectures. After all, the organization of the brain might again inspire computer architecture.
Architecture and Programming Models for High Performance Intensive Computation
2016-06-29
Applications Systems and Large-Scale-Big-Data & Large-Scale-Big-Computing (DDDAS- LS ). ICCS 2015, June 2015. Reykjavk, Ice- land. 2. Bo YT, Wang P, Guo ZL...The Mahali project,” Communications Magazine , vol. 52, pp. 111–133, Aug 2014. 14 DISTRIBUTION A: Distribution approved for public release. Response ID
DOE Office of Scientific and Technical Information (OSTI.GOV)
Arumugam, Kamesh
Efficient parallel implementations of scientific applications on multi-core CPUs with accelerators such as GPUs and Xeon Phis is challenging. This requires - exploiting the data parallel architecture of the accelerator along with the vector pipelines of modern x86 CPU architectures, load balancing, and efficient memory transfer between different devices. It is relatively easy to meet these requirements for highly structured scientific applications. In contrast, a number of scientific and engineering applications are unstructured. Getting performance on accelerators for these applications is extremely challenging because many of these applications employ irregular algorithms which exhibit data-dependent control-ow and irregular memory accesses. Furthermore,more » these applications are often iterative with dependency between steps, and thus making it hard to parallelize across steps. As a result, parallelism in these applications is often limited to a single step. Numerical simulation of charged particles beam dynamics is one such application where the distribution of work and memory access pattern at each time step is irregular. Applications with these properties tend to present significant branch and memory divergence, load imbalance between different processor cores, and poor compute and memory utilization. Prior research on parallelizing such irregular applications have been focused around optimizing the irregular, data-dependent memory accesses and control-ow during a single step of the application independent of the other steps, with the assumption that these patterns are completely unpredictable. We observed that the structure of computation leading to control-ow divergence and irregular memory accesses in one step is similar to that in the next step. It is possible to predict this structure in the current step by observing the computation structure of previous steps. In this dissertation, we present novel machine learning based optimization techniques to address the parallel implementation challenges of such irregular applications on different HPC architectures. In particular, we use supervised learning to predict the computation structure and use it to address the control-ow and memory access irregularities in the parallel implementation of such applications on GPUs, Xeon Phis, and heterogeneous architectures composed of multi-core CPUs with GPUs or Xeon Phis. We use numerical simulation of charged particles beam dynamics simulation as a motivating example throughout the dissertation to present our new approach, though they should be equally applicable to a wide range of irregular applications. The machine learning approach presented here use predictive analytics and forecasting techniques to adaptively model and track the irregular memory access pattern at each time step of the simulation to anticipate the future memory access pattern. Access pattern forecasts can then be used to formulate optimization decisions during application execution which improves the performance of the application at a future time step based on the observations from earlier time steps. In heterogeneous architectures, forecasts can also be used to improve the memory performance and resource utilization of all the processing units to deliver a good aggregate performance. We used these optimization techniques and anticipation strategy to design a cache-aware, memory efficient parallel algorithm to address the irregularities in the parallel implementation of charged particles beam dynamics simulation on different HPC architectures. Experimental result using a diverse mix of HPC architectures shows that our approach in using anticipation strategy is effective in maximizing data reuse, ensuring workload balance, minimizing branch and memory divergence, and in improving resource utilization.« less
NASA Astrophysics Data System (ADS)
Gel, Aytekin; Hu, Jonathan; Ould-Ahmed-Vall, ElMoustapha; Kalinkin, Alexander A.
2017-02-01
Legacy codes remain a crucial element of today's simulation-based engineering ecosystem due to the extensive validation process and investment in such software. The rapid evolution of high-performance computing architectures necessitates the modernization of these codes. One approach to modernization is a complete overhaul of the code. However, this could require extensive investments, such as rewriting in modern languages, new data constructs, etc., which will necessitate systematic verification and validation to re-establish the credibility of the computational models. The current study advocates using a more incremental approach and is a culmination of several modernization efforts of the legacy code MFIX, which is an open-source computational fluid dynamics code that has evolved over several decades, widely used in multiphase flows and still being developed by the National Energy Technology Laboratory. Two different modernization approaches,'bottom-up' and 'top-down', are illustrated. Preliminary results show up to 8.5x improvement at the selected kernel level with the first approach, and up to 50% improvement in total simulated time with the latter were achieved for the demonstration cases and target HPC systems employed.
Compact VLSI neural computer integrated with active pixel sensor for real-time ATR applications
NASA Astrophysics Data System (ADS)
Fang, Wai-Chi; Udomkesmalee, Gabriel; Alkalai, Leon
1997-04-01
A compact VLSI neural computer integrated with an active pixel sensor has been under development to mimic what is inherent in biological vision systems. This electronic eye- brain computer is targeted for real-time machine vision applications which require both high-bandwidth communication and high-performance computing for data sensing, synergy of multiple types of sensory information, feature extraction, target detection, target recognition, and control functions. The neural computer is based on a composite structure which combines Annealing Cellular Neural Network (ACNN) and Hierarchical Self-Organization Neural Network (HSONN). The ACNN architecture is a programmable and scalable multi- dimensional array of annealing neurons which are locally connected with their local neurons. Meanwhile, the HSONN adopts a hierarchical structure with nonlinear basis functions. The ACNN+HSONN neural computer is effectively designed to perform programmable functions for machine vision processing in all levels with its embedded host processor. It provides a two order-of-magnitude increase in computation power over the state-of-the-art microcomputer and DSP microelectronics. A compact current-mode VLSI design feasibility of the ACNN+HSONN neural computer is demonstrated by a 3D 16X8X9-cube neural processor chip design in a 2-micrometers CMOS technology. Integration of this neural computer as one slice of a 4'X4' multichip module into the 3D MCM based avionics architecture for NASA's New Millennium Program is also described.
Application-specific coarse-grained reconfigurable array: architecture and design methodology
NASA Astrophysics Data System (ADS)
Zhou, Li; Liu, Dongpei; Zhang, Jianfeng; Liu, Hengzhu
2015-06-01
Coarse-grained reconfigurable arrays (CGRAs) have shown potential for application in embedded systems in recent years. Numerous reconfigurable processing elements (PEs) in CGRAs provide flexibility while maintaining high performance by exploring different levels of parallelism. However, a difference remains between the CGRA and the application-specific integrated circuit (ASIC). Some application domains, such as software-defined radios (SDRs), require flexibility with performance demand increases. More effective CGRA architectures are expected to be developed. Customisation of a CGRA according to its application can improve performance and efficiency. This study proposes an application-specific CGRA architecture template composed of generic PEs (GPEs) and special PEs (SPEs). The hardware of the SPE can be customised to accelerate specific computational patterns. An automatic design methodology that includes pattern identification and application-specific function unit generation is also presented. A mapping algorithm based on ant colony optimisation is provided. Experimental results on the SDR target domain show that compared with other ordinary and application-specific reconfigurable architectures, the CGRA generated by the proposed method performs more efficiently for given applications.
Editorial: Cognitive Architectures, Model Comparison and AGI
NASA Astrophysics Data System (ADS)
Lebiere, Christian; Gonzalez, Cleotilde; Warwick, Walter
2010-12-01
Cognitive Science and Artificial Intelligence share compatible goals of understanding and possibly generating broadly intelligent behavior. In order to determine if progress is made, it is essential to be able to evaluate the behavior of complex computational models, especially those built on general cognitive architectures, and compare it to benchmarks of intelligent behavior such as human performance. Significant methodological challenges arise, however, when trying to extend approaches used to compare model and human performance from tightly controlled laboratory tasks to complex tasks involving more open-ended behavior. This paper describes a model comparison challenge built around a dynamic control task, the Dynamic Stocks and Flows. We present and discuss distinct approaches to evaluating performance and comparing models. Lessons drawn from this challenge are discussed in light of the challenge of using cognitive architectures to achieve Artificial General Intelligence.
Eastman, Peter; Friedrichs, Mark S; Chodera, John D; Radmer, Randall J; Bruns, Christopher M; Ku, Joy P; Beauchamp, Kyle A; Lane, Thomas J; Wang, Lee-Ping; Shukla, Diwakar; Tye, Tony; Houston, Mike; Stich, Timo; Klein, Christoph; Shirts, Michael R; Pande, Vijay S
2013-01-08
OpenMM is a software toolkit for performing molecular simulations on a range of high performance computing architectures. It is based on a layered architecture: the lower layers function as a reusable library that can be invoked by any application, while the upper layers form a complete environment for running molecular simulations. The library API hides all hardware-specific dependencies and optimizations from the users and developers of simulation programs: they can be run without modification on any hardware on which the API has been implemented. The current implementations of OpenMM include support for graphics processing units using the OpenCL and CUDA frameworks. In addition, OpenMM was designed to be extensible, so new hardware architectures can be accommodated and new functionality (e.g., energy terms and integrators) can be easily added.
Eastman, Peter; Friedrichs, Mark S.; Chodera, John D.; Radmer, Randall J.; Bruns, Christopher M.; Ku, Joy P.; Beauchamp, Kyle A.; Lane, Thomas J.; Wang, Lee-Ping; Shukla, Diwakar; Tye, Tony; Houston, Mike; Stich, Timo; Klein, Christoph; Shirts, Michael R.; Pande, Vijay S.
2012-01-01
OpenMM is a software toolkit for performing molecular simulations on a range of high performance computing architectures. It is based on a layered architecture: the lower layers function as a reusable library that can be invoked by any application, while the upper layers form a complete environment for running molecular simulations. The library API hides all hardware-specific dependencies and optimizations from the users and developers of simulation programs: they can be run without modification on any hardware on which the API has been implemented. The current implementations of OpenMM include support for graphics processing units using the OpenCL and CUDA frameworks. In addition, OpenMM was designed to be extensible, so new hardware architectures can be accommodated and new functionality (e.g., energy terms and integrators) can be easily added. PMID:23316124
New Developments in Modeling MHD Systems on High Performance Computing Architectures
NASA Astrophysics Data System (ADS)
Germaschewski, K.; Raeder, J.; Larson, D. J.; Bhattacharjee, A.
2009-04-01
Modeling the wide range of time and length scales present even in fluid models of plasmas like MHD and X-MHD (Extended MHD including two fluid effects like Hall term, electron inertia, electron pressure gradient) is challenging even on state-of-the-art supercomputers. In the last years, HPC capacity has continued to grow exponentially, but at the expense of making the computer systems more and more difficult to program in order to get maximum performance. In this paper, we will present a new approach to managing the complexity caused by the need to write efficient codes: Separating the numerical description of the problem, in our case a discretized right hand side (r.h.s.), from the actual implementation of efficiently evaluating it. An automatic code generator is used to describe the r.h.s. in a quasi-symbolic form while leaving the translation into efficient and parallelized code to a computer program itself. We implemented this approach for OpenGGCM (Open General Geospace Circulation Model), a model of the Earth's magnetosphere, which was accelerated by a factor of three on regular x86 architecture and a factor of 25 on the Cell BE architecture (commonly known for its deployment in Sony's PlayStation 3).
NASA Technical Reports Server (NTRS)
Clement, Bradley J.; Estlin, Tara A.; Bornstein, Benjamin J.
2013-01-01
The Mobile Thread Task Manager (MTTM) is being applied to parallelizing existing flight software to understand the benefits and to develop new techniques and architectural concepts for adapting software to multicore architectures. It allocates and load-balances tasks for a group of threads that migrate across processors to improve cache performance. In order to balance-load across threads, the MTTM augments a basic map-reduce strategy to draw jobs from a global queue. In a multicore processor, memory may be "homed" to the cache of a specific processor and must be accessed from that processor. The MTTB architecture wraps access to data with thread management to move threads to the home processor for that data so that the computation follows the data in an attempt to avoid L2 cache misses. Cache homing is also handled by a memory manager that translates identifiers to processor IDs where the data will be homed (according to rules defined by the user). The user can also specify the number of threads and processors separately, which is important for tuning performance for different patterns of computation and memory access. MTTM efficiently processes tasks in parallel on a multiprocessor computer. It also provides an interface to make it easier to adapt existing software to a multiprocessor environment.
FPGA design for constrained energy minimization
NASA Astrophysics Data System (ADS)
Wang, Jianwei; Chang, Chein-I.; Cao, Mang
2004-02-01
The Constrained Energy Minimization (CEM) has been widely used for hyperspectral detection and classification. The feasibility of implementing the CEM as a real-time processing algorithm in systolic arrays has been also demonstrated. The main challenge of realizing the CEM in hardware architecture in the computation of the inverse of the data correlation matrix performed in the CEM, which requires a complete set of data samples. In order to cope with this problem, the data correlation matrix must be calculated in a causal manner which only needs data samples up to the sample at the time it is processed. This paper presents a Field Programmable Gate Arrays (FPGA) design of such a causal CEM. The main feature of the proposed FPGA design is to use the Coordinate Rotation DIgital Computer (CORDIC) algorithm that can convert a Givens rotation of a vector to a set of shift-add operations. As a result, the CORDIC algorithm can be easily implemented in hardware architecture, therefore in FPGA. Since the computation of the inverse of the data correlction involves a series of Givens rotations, the utility of the CORDIC algorithm allows the causal CEM to perform real-time processing in FPGA. In this paper, an FPGA implementation of the causal CEM will be studied and its detailed architecture will be also described.
A novel VLSI processor architecture for supercomputing arrays
NASA Technical Reports Server (NTRS)
Venkateswaran, N.; Pattabiraman, S.; Devanathan, R.; Ahmed, Ashaf; Venkataraman, S.; Ganesh, N.
1993-01-01
Design of the processor element for general purpose massively parallel supercomputing arrays is highly complex and cost ineffective. To overcome this, the architecture and organization of the functional units of the processor element should be such as to suit the diverse computational structures and simplify mapping of complex communication structures of different classes of algorithms. This demands that the computation and communication structures of different class of algorithms be unified. While unifying the different communication structures is a difficult process, analysis of a wide class of algorithms reveals that their computation structures can be expressed in terms of basic IP,IP,OP,CM,R,SM, and MAA operations. The execution of these operations is unified on the PAcube macro-cell array. Based on this PAcube macro-cell array, we present a novel processor element called the GIPOP processor, which has dedicated functional units to perform the above operations. The architecture and organization of these functional units are such to satisfy the two important criteria mentioned above. The structure of the macro-cell and the unification process has led to a very regular and simpler design of the GIPOP processor. The production cost of the GIPOP processor is drastically reduced as it is designed on high performance mask programmable PAcube arrays.
High Performance Radiation Transport Simulations on TITAN
DOE Office of Scientific and Technical Information (OSTI.GOV)
Baker, Christopher G; Davidson, Gregory G; Evans, Thomas M
2012-01-01
In this paper we describe the Denovo code system. Denovo solves the six-dimensional, steady-state, linear Boltzmann transport equation, of central importance to nuclear technology applications such as reactor core analysis (neutronics), radiation shielding, nuclear forensics and radiation detection. The code features multiple spatial differencing schemes, state-of-the-art linear solvers, the Koch-Baker-Alcouffe (KBA) parallel-wavefront sweep algorithm for inverting the transport operator, a new multilevel energy decomposition method scaling to hundreds of thousands of processing cores, and a modern, novel code architecture that supports straightforward integration of new features. In this paper we discuss the performance of Denovo on the 10--20 petaflop ORNLmore » GPU-based system, Titan. We describe algorithms and techniques used to exploit the capabilities of Titan's heterogeneous compute node architecture and the challenges of obtaining good parallel performance for this sparse hyperbolic PDE solver containing inherently sequential computations. Numerical results demonstrating Denovo performance on early Titan hardware are presented.« less
PIC codes for plasma accelerators on emerging computer architectures (GPUS, Multicore/Manycore CPUS)
NASA Astrophysics Data System (ADS)
Vincenti, Henri
2016-03-01
The advent of exascale computers will enable 3D simulations of a new laser-plasma interaction regimes that were previously out of reach of current Petasale computers. However, the paradigm used to write current PIC codes will have to change in order to fully exploit the potentialities of these new computing architectures. Indeed, achieving Exascale computing facilities in the next decade will be a great challenge in terms of energy consumption and will imply hardware developments directly impacting our way of implementing PIC codes. As data movement (from die to network) is by far the most energy consuming part of an algorithm future computers will tend to increase memory locality at the hardware level and reduce energy consumption related to data movement by using more and more cores on each compute nodes (''fat nodes'') that will have a reduced clock speed to allow for efficient cooling. To compensate for frequency decrease, CPU machine vendors are making use of long SIMD instruction registers that are able to process multiple data with one arithmetic operator in one clock cycle. SIMD register length is expected to double every four years. GPU's also have a reduced clock speed per core and can process Multiple Instructions on Multiple Datas (MIMD). At the software level Particle-In-Cell (PIC) codes will thus have to achieve both good memory locality and vectorization (for Multicore/Manycore CPU) to fully take advantage of these upcoming architectures. In this talk, we present the portable solutions we implemented in our high performance skeleton PIC code PICSAR to both achieve good memory locality and cache reuse as well as good vectorization on SIMD architectures. We also present the portable solutions used to parallelize the Pseudo-sepctral quasi-cylindrical code FBPIC on GPUs using the Numba python compiler.
Basu, Protonu; Williams, Samuel; Van Straalen, Brian; ...
2017-04-05
GPUs, with their high bandwidths and computational capabilities are an increasingly popular target for scientific computing. Unfortunately, to date, harnessing the power of the GPU has required use of a GPU-specific programming model like CUDA, OpenCL, or OpenACC. Thus, in order to deliver portability across CPU-based and GPU-accelerated supercomputers, programmers are forced to write and maintain two versions of their applications or frameworks. In this paper, we explore the use of a compiler-based autotuning framework based on CUDA-CHiLL to deliver not only portability, but also performance portability across CPU- and GPU-accelerated platforms for the geometric multigrid linear solvers found inmore » many scientific applications. We also show that with autotuning we can attain near Roofline (a performance bound for a computation and target architecture) performance across the key operations in the miniGMG benchmark for both CPU- and GPU-based architectures as well as for a multiple stencil discretizations and smoothers. We show that our technology is readily interoperable with MPI resulting in performance at scale equal to that obtained via hand-optimized MPI+CUDA implementation.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)
Basu, Protonu; Williams, Samuel; Van Straalen, Brian
GPUs, with their high bandwidths and computational capabilities are an increasingly popular target for scientific computing. Unfortunately, to date, harnessing the power of the GPU has required use of a GPU-specific programming model like CUDA, OpenCL, or OpenACC. Thus, in order to deliver portability across CPU-based and GPU-accelerated supercomputers, programmers are forced to write and maintain two versions of their applications or frameworks. In this paper, we explore the use of a compiler-based autotuning framework based on CUDA-CHiLL to deliver not only portability, but also performance portability across CPU- and GPU-accelerated platforms for the geometric multigrid linear solvers found inmore » many scientific applications. We also show that with autotuning we can attain near Roofline (a performance bound for a computation and target architecture) performance across the key operations in the miniGMG benchmark for both CPU- and GPU-based architectures as well as for a multiple stencil discretizations and smoothers. We show that our technology is readily interoperable with MPI resulting in performance at scale equal to that obtained via hand-optimized MPI+CUDA implementation.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)
Draeger, Erik W.
This report documents the fact that the work in creating a strategic plan and beginning customer engagements has been completed. The description of milestone is: The newly formed advanced architecture and portability specialists (AAPS) team will develop a strategic plan to meet the goals of 1) sharing knowledge and experience with code teams to ensure that ASC codes run well on new architectures, and 2) supplying skilled computational scientists to put the strategy into practice. The plan will be delivered to ASC management in the first quarter. By the fourth quarter, the team will identify their first customers within PEMmore » and IC, perform an initial assessment and scalability and performance bottleneck for next-generation architectures, and embed AAPS team members with customer code teams to assist with initial portability development within standalone kernels or proxy applications.« less
Padhi, Radhakant; Unnikrishnan, Nishant; Wang, Xiaohua; Balakrishnan, S N
2006-12-01
Even though dynamic programming offers an optimal control solution in a state feedback form, the method is overwhelmed by computational and storage requirements. Approximate dynamic programming implemented with an Adaptive Critic (AC) neural network structure has evolved as a powerful alternative technique that obviates the need for excessive computations and storage requirements in solving optimal control problems. In this paper, an improvement to the AC architecture, called the "Single Network Adaptive Critic (SNAC)" is presented. This approach is applicable to a wide class of nonlinear systems where the optimal control (stationary) equation can be explicitly expressed in terms of the state and costate variables. The selection of this terminology is guided by the fact that it eliminates the use of one neural network (namely the action network) that is part of a typical dual network AC setup. As a consequence, the SNAC architecture offers three potential advantages: a simpler architecture, lesser computational load and elimination of the approximation error associated with the eliminated network. In order to demonstrate these benefits and the control synthesis technique using SNAC, two problems have been solved with the AC and SNAC approaches and their computational performances are compared. One of these problems is a real-life Micro-Electro-Mechanical-system (MEMS) problem, which demonstrates that the SNAC technique is applicable to complex engineering systems.
Developing a Complete and Effective ACT-R Architecture
2008-01-01
of computational primitives , as contrasted with the predominant “one-off” and “grab-bag” cognitive models in the field. These architectures have...transport/ semaphore protocols connected via a glue script. Both protocols rely on the fact that file rename and file remove operations are atomic...the Trial Log file until just prior to processing the next input request. Thus, to perform synchronous identifications it is necessary to run an
Reverse time migration: A seismic processing application on the connection machine
NASA Technical Reports Server (NTRS)
Fiebrich, Rolf-Dieter
1987-01-01
The implementation of a reverse time migration algorithm on the Connection Machine, a massively parallel computer is described. Essential architectural features of this machine as well as programming concepts are presented. The data structures and parallel operations for the implementation of the reverse time migration algorithm are described. The algorithm matches the Connection Machine architecture closely and executes almost at the peak performance of this machine.
Wolinski, Christophe Czeslaw [Los Alamos, NM; Gokhale, Maya B [Los Alamos, NM; McCabe, Kevin Peter [Los Alamos, NM
2011-01-18
Fabric-based computing systems and methods are disclosed. A fabric-based computing system can include a polymorphous computing fabric that can be customized on a per application basis and a host processor in communication with said polymorphous computing fabric. The polymorphous computing fabric includes a cellular architecture that can be highly parameterized to enable a customized synthesis of fabric instances for a variety of enhanced application performances thereof. A global memory concept can also be included that provides the host processor random access to all variables and instructions associated with the polymorphous computing fabric.
Calculating Reuse Distance from Source Code
DOE Office of Scientific and Technical Information (OSTI.GOV)
Narayanan, Sri Hari Krishna; Hovland, Paul
The efficient use of a system is of paramount importance in high-performance computing. Applications need to be engineered for future systems even before the architecture of such a system is clearly known. Static performance analysis that generates performance bounds is one way to approach the task of understanding application behavior. Performance bounds provide an upper limit on the performance of an application on a given architecture. Predicting cache hierarchy behavior and accesses to main memory is a requirement for accurate performance bounds. This work presents our static reuse distance algorithm to generate reuse distance histograms. We then use these histogramsmore » to predict cache miss rates. Experimental results for kernels studied show that the approach is accurate.« less
Efficient self-organizing multilayer neural network for nonlinear system modeling.
Han, Hong-Gui; Wang, Li-Dan; Qiao, Jun-Fei
2013-07-01
It has been shown extensively that the dynamic behaviors of a neural system are strongly influenced by the network architecture and learning process. To establish an artificial neural network (ANN) with self-organizing architecture and suitable learning algorithm for nonlinear system modeling, an automatic axon-neural network (AANN) is investigated in the following respects. First, the network architecture is constructed automatically to change both the number of hidden neurons and topologies of the neural network during the training process. The approach introduced in adaptive connecting-and-pruning algorithm (ACP) is a type of mixed mode operation, which is equivalent to pruning or adding the connecting of the neurons, as well as inserting some required neurons directly. Secondly, the weights are adjusted, using a feedforward computation (FC) to obtain the information for the gradient during learning computation. Unlike most of the previous studies, AANN is able to self-organize the architecture and weights, and to improve the network performances. Also, the proposed AANN has been tested on a number of benchmark problems, ranging from nonlinear function approximating to nonlinear systems modeling. The experimental results show that AANN can have better performances than that of some existing neural networks. Crown Copyright © 2013. Published by Elsevier Ltd. All rights reserved.
Hypercube matrix computation task
NASA Technical Reports Server (NTRS)
Calalo, R.; Imbriale, W.; Liewer, P.; Lyons, J.; Manshadi, F.; Patterson, J.
1987-01-01
The Hypercube Matrix Computation (Year 1986-1987) task investigated the applicability of a parallel computing architecture to the solution of large scale electromagnetic scattering problems. Two existing electromagnetic scattering codes were selected for conversion to the Mark III Hypercube concurrent computing environment. They were selected so that the underlying numerical algorithms utilized would be different thereby providing a more thorough evaluation of the appropriateness of the parallel environment for these types of problems. The first code was a frequency domain method of moments solution, NEC-2, developed at Lawrence Livermore National Laboratory. The second code was a time domain finite difference solution of Maxwell's equations to solve for the scattered fields. Once the codes were implemented on the hypercube and verified to obtain correct solutions by comparing the results with those from sequential runs, several measures were used to evaluate the performance of the two codes. First, a comparison was provided of the problem size possible on the hypercube with 128 megabytes of memory for a 32-node configuration with that available in a typical sequential user environment of 4 to 8 megabytes. Then, the performance of the codes was anlyzed for the computational speedup attained by the parallel architecture.
Meta assembler enhancements and generalized linkage editor
NASA Technical Reports Server (NTRS)
1979-01-01
A meta Assembler for NASA was developed. The initial development of the Meta Assembler for the SUMC was performed. The capabilities included assembly for both main and micro level programs. A period of checkout and utilization to verify the performance of the Meta Assembler was undertaken. Additional enhancements were made to the Meta Assembler which expanded the target computer family to include architectures represented by the PDP-11, MODCOMP 2, and Raytheon 706 computers.
Oelerich, Jan Oliver; Duschek, Lennart; Belz, Jürgen; Beyer, Andreas; Baranovskii, Sergei D; Volz, Kerstin
2017-06-01
We present a new multislice code for the computer simulation of scanning transmission electron microscope (STEM) images based on the frozen lattice approximation. Unlike existing software packages, the code is optimized to perform well on highly parallelized computing clusters, combining distributed and shared memory architectures. This enables efficient calculation of large lateral scanning areas of the specimen within the frozen lattice approximation and fine-grained sweeps of parameter space. Copyright © 2017 Elsevier B.V. All rights reserved.
Recent Trends in Spintronics-Based Nanomagnetic Logic
NASA Astrophysics Data System (ADS)
Das, Jayita; Alam, Syed M.; Bhanja, Sanjukta
2014-09-01
With the growing concerns of standby power in sub-100-nm CMOS technologies, alternative computing techniques and memory technologies are explored. Spin transfer torque magnetoresistive RAM (STT-MRAM) is one such nonvolatile memory relying on magnetic tunnel junctions (MTJs) to store information. It uses spin transfer torque to write information and magnetoresistance to read information. In 2012, Everspin Technologies, Inc. commercialized the first 64Mbit Spin Torque MRAM. On the computing end, nanomagnetic logic (NML) is a promising technique with zero leakage and high data retention. In 2000, Cowburn and Welland first demonstrated its potential in logic and information propagation through magnetostatic interaction in a chain of single domain circular nanomagnetic dots of Supermalloy (Ni80Fe14Mo5X1, X is other metals). In 2006, Imre et al. demonstrated wires and majority gates followed by coplanar cross wire systems demonstration in 2010 by Pulecio et al. Since 2004 researchers have also investigated the potential of MTJs in logic. More recently with dipolar coupling between MTJs demonstrated in 2012, logic-in-memory architecture with STT-MRAM have been investigated. The architecture borrows the computing concept from NML and read and write style from MRAM. The architecture can switch its operation between logic and memory modes with clock as classifier. Further through logic partitioning between MTJ and CMOS plane, a significant performance boost has been observed in basic computing blocks within the architecture. In this work, we have explored the developments in NML, in MTJs and more recent developments in hybrid MTJ/CMOS logic-in-memory architecture and its unique logic partitioning capability.
Exploiting the chaotic behaviour of atmospheric models with reconfigurable architectures
NASA Astrophysics Data System (ADS)
Russell, Francis P.; Düben, Peter D.; Niu, Xinyu; Luk, Wayne; Palmer, T. N.
2017-12-01
Reconfigurable architectures are becoming mainstream: Amazon, Microsoft and IBM are supporting such architectures in their data centres. The computationally intensive nature of atmospheric modelling is an attractive target for hardware acceleration using reconfigurable computing. Performance of hardware designs can be improved through the use of reduced-precision arithmetic, but maintaining appropriate accuracy is essential. We explore reduced-precision optimisation for simulating chaotic systems, targeting atmospheric modelling, in which even minor changes in arithmetic behaviour will cause simulations to diverge quickly. The possibility of equally valid simulations having differing outcomes means that standard techniques for comparing numerical accuracy are inappropriate. We use the Hellinger distance to compare statistical behaviour between reduced-precision CPU implementations to guide reconfigurable designs of a chaotic system, then analyse accuracy, performance and power efficiency of the resulting implementations. Our results show that with only a limited loss in accuracy corresponding to less than 10% uncertainty in input parameters, the throughput and energy efficiency of a single-precision chaotic system implemented on a Xilinx Virtex-6 SX475T Field Programmable Gate Array (FPGA) can be more than doubled.
NASA Astrophysics Data System (ADS)
Luo, Aiwen; An, Fengwei; Zhang, Xiangyu; Chen, Lei; Huang, Zunkai; Jürgen Mattausch, Hans
2018-04-01
Feature extraction techniques are a cornerstone of object detection in computer-vision-based applications. The detection performance of vison-based detection systems is often degraded by, e.g., changes in the illumination intensity of the light source, foreground-background contrast variations or automatic gain control from the camera. In order to avoid such degradation effects, we present a block-based L1-norm-circuit architecture which is configurable for different image-cell sizes, cell-based feature descriptors and image resolutions according to customization parameters from the circuit input. The incorporated flexibility in both the image resolution and the cell size for multi-scale image pyramids leads to lower computational complexity and power consumption. Additionally, an object-detection prototype for performance evaluation in 65 nm CMOS implements the proposed L1-norm circuit together with a histogram of oriented gradients (HOG) descriptor and a support vector machine (SVM) classifier. The proposed parallel architecture with high hardware efficiency enables real-time processing, high detection robustness, small chip-core area as well as low power consumption for multi-scale object detection.
A smart sensor architecture based on emergent computation in an array of outer-totalistic cells
NASA Astrophysics Data System (ADS)
Dogaru, Radu; Dogaru, Ioana; Glesner, Manfred
2005-06-01
A novel smart-sensor architecture is proposed, capable to segment and recognize characters in a monochrome image. It is capable to provide a list of ASCII codes representing the recognized characters from the monochrome visual field. It can operate as a blind's aid or for industrial applications. A bio-inspired cellular model with simple linear neurons was found the best to perform the nontrivial task of cropping isolated compact objects such as handwritten digits or characters. By attaching a simple outer-totalistic cell to each pixel sensor, emergent computation in the resulting cellular automata lattice provides a straightforward and compact solution to the otherwise computationally intensive problem of character segmentation. A simple and robust recognition algorithm is built in a compact sequential controller accessing the array of cells so that the integrated device can provide directly a list of codes of the recognized characters. Preliminary simulation tests indicate good performance and robustness to various distortions of the visual field.
Low latency network and distributed storage for next generation HPC systems: the ExaNeSt project
NASA Astrophysics Data System (ADS)
Ammendola, R.; Biagioni, A.; Cretaro, P.; Frezza, O.; Lo Cicero, F.; Lonardo, A.; Martinelli, M.; Paolucci, P. S.; Pastorelli, E.; Pisani, F.; Simula, F.; Vicini, P.; Navaridas, J.; Chaix, F.; Chrysos, N.; Katevenis, M.; Papaeustathiou, V.
2017-10-01
With processor architecture evolution, the HPC market has undergone a paradigm shift. The adoption of low-cost, Linux-based clusters extended the reach of HPC from its roots in modelling and simulation of complex physical systems to a broader range of industries, from biotechnology, cloud computing, computer analytics and big data challenges to manufacturing sectors. In this perspective, the near future HPC systems can be envisioned as composed of millions of low-power computing cores, densely packed — meaning cooling by appropriate technology — with a tightly interconnected, low latency and high performance network and equipped with a distributed storage architecture. Each of these features — dense packing, distributed storage and high performance interconnect — represents a challenge, made all the harder by the need to solve them at the same time. These challenges lie as stumbling blocks along the road towards Exascale-class systems; the ExaNeSt project acknowledges them and tasks itself with investigating ways around them.
Architectures for single-chip image computing
NASA Astrophysics Data System (ADS)
Gove, Robert J.
1992-04-01
This paper will focus on the architectures of VLSI programmable processing components for image computing applications. TI, the maker of industry-leading RISC, DSP, and graphics components, has developed an architecture for a new-generation of image processors capable of implementing a plurality of image, graphics, video, and audio computing functions. We will show that the use of a single-chip heterogeneous MIMD parallel architecture best suits this class of processors--those which will dominate the desktop multimedia, document imaging, computer graphics, and visualization systems of this decade.
Transitioning ISR architecture into the cloud
NASA Astrophysics Data System (ADS)
Lash, Thomas D.
2012-06-01
Emerging cloud computing platforms offer an ideal opportunity for Intelligence, Surveillance, and Reconnaissance (ISR) intelligence analysis. Cloud computing platforms help overcome challenges and limitations of traditional ISR architectures. Modern ISR architectures can benefit from examining commercial cloud applications, especially as they relate to user experience, usage profiling, and transformational business models. This paper outlines legacy ISR architectures and their limitations, presents an overview of cloud technologies and their applications to the ISR intelligence mission, and presents an idealized ISR architecture implemented with cloud computing.
Architecture-Adaptive Computing Environment: A Tool for Teaching Parallel Programming
NASA Technical Reports Server (NTRS)
Dorband, John E.; Aburdene, Maurice F.
2002-01-01
Recently, networked and cluster computation have become very popular. This paper is an introduction to a new C based parallel language for architecture-adaptive programming, aCe C. The primary purpose of aCe (Architecture-adaptive Computing Environment) is to encourage programmers to implement applications on parallel architectures by providing them the assurance that future architectures will be able to run their applications with a minimum of modification. A secondary purpose is to encourage computer architects to develop new types of architectures by providing an easily implemented software development environment and a library of test applications. This new language should be an ideal tool to teach parallel programming. In this paper, we will focus on some fundamental features of aCe C.
Toward a Fault Tolerant Architecture for Vital Medical-Based Wearable Computing.
Abdali-Mohammadi, Fardin; Bajalan, Vahid; Fathi, Abdolhossein
2015-12-01
Advancements in computers and electronic technologies have led to the emergence of a new generation of efficient small intelligent systems. The products of such technologies might include Smartphones and wearable devices, which have attracted the attention of medical applications. These products are used less in critical medical applications because of their resource constraint and failure sensitivity. This is due to the fact that without safety considerations, small-integrated hardware will endanger patients' lives. Therefore, proposing some principals is required to construct wearable systems in healthcare so that the existing concerns are dealt with. Accordingly, this paper proposes an architecture for constructing wearable systems in critical medical applications. The proposed architecture is a three-tier one, supporting data flow from body sensors to cloud. The tiers of this architecture include wearable computers, mobile computing, and mobile cloud computing. One of the features of this architecture is its high possible fault tolerance due to the nature of its components. Moreover, the required protocols are presented to coordinate the components of this architecture. Finally, the reliability of this architecture is assessed by simulating the architecture and its components, and other aspects of the proposed architecture are discussed.
GPU-completeness: theory and implications
NASA Astrophysics Data System (ADS)
Lin, I.-Jong
2011-01-01
This paper formalizes a major insight into a class of algorithms that relate parallelism and performance. The purpose of this paper is to define a class of algorithms that trades off parallelism for quality of result (e.g. visual quality, compression rate), and we propose a similar method for algorithmic classification based on NP-Completeness techniques, applied toward parallel acceleration. We will define this class of algorithm as "GPU-Complete" and will postulate the necessary properties of the algorithms for admission into this class. We will also formally relate his algorithmic space and imaging algorithms space. This concept is based upon our experience in the print production area where GPUs (Graphic Processing Units) have shown a substantial cost/performance advantage within the context of HPdelivered enterprise services and commercial printing infrastructure. While CPUs and GPUs are converging in their underlying hardware and functional blocks, their system behaviors are clearly distinct in many ways: memory system design, programming paradigms, and massively parallel SIMD architecture. There are applications that are clearly suited to each architecture: for CPU: language compilation, word processing, operating systems, and other applications that are highly sequential in nature; for GPU: video rendering, particle simulation, pixel color conversion, and other problems clearly amenable to massive parallelization. While GPUs establishing themselves as a second, distinct computing architecture from CPUs, their end-to-end system cost/performance advantage in certain parts of computation inform the structure of algorithms and their efficient parallel implementations. While GPUs are merely one type of architecture for parallelization, we show that their introduction into the design space of printing systems demonstrate the trade-offs against competing multi-core, FPGA, and ASIC architectures. While each architecture has its own optimal application, we believe that the selection of architecture can be defined in terms of properties of GPU-Completeness. For a welldefined subset of algorithms, GPU-Completeness is intended to connect the parallelism, algorithms and efficient architectures into a unified framework to show that multiple layers of parallel implementation are guided by the same underlying trade-off.
Advanced computer architecture specification for automated weld systems
NASA Technical Reports Server (NTRS)
Katsinis, Constantine
1994-01-01
This report describes the requirements for an advanced automated weld system and the associated computer architecture, and defines the overall system specification from a broad perspective. According to the requirements of welding procedures as they relate to an integrated multiaxis motion control and sensor architecture, the computer system requirements are developed based on a proven multiple-processor architecture with an expandable, distributed-memory, single global bus architecture, containing individual processors which are assigned to specific tasks that support sensor or control processes. The specified architecture is sufficiently flexible to integrate previously developed equipment, be upgradable and allow on-site modifications.
Quantum Computing Architectural Design
NASA Astrophysics Data System (ADS)
West, Jacob; Simms, Geoffrey; Gyure, Mark
2006-03-01
Large scale quantum computers will invariably require scalable architectures in addition to high fidelity gate operations. Quantum computing architectural design (QCAD) addresses the problems of actually implementing fault-tolerant algorithms given physical and architectural constraints beyond those of basic gate-level fidelity. Here we introduce a unified framework for QCAD that enables the scientist to study the impact of varying error correction schemes, architectural parameters including layout and scheduling, and physical operations native to a given architecture. Our software package, aptly named QCAD, provides compilation, manipulation/transformation, multi-paradigm simulation, and visualization tools. We demonstrate various features of the QCAD software package through several examples.
Recursive computer architecture for VLSI
DOE Office of Scientific and Technical Information (OSTI.GOV)
Treleaven, P.C.; Hopkins, R.P.
1982-01-01
A general-purpose computer architecture based on the concept of recursion and suitable for VLSI computer systems built from replicated (lego-like) computing elements is presented. The recursive computer architecture is defined by presenting a program organisation, a machine organisation and an experimental machine implementation oriented to VLSI. The experimental implementation is being restricted to simple, identical microcomputers each containing a memory, a processor and a communications capability. This future generation of lego-like computer systems are termed fifth generation computers by the Japanese. 30 references.
Design and Development of a Run-Time Monitor for Multi-Core Architectures in Cloud Computing
Kang, Mikyung; Kang, Dong-In; Crago, Stephen P.; Park, Gyung-Leen; Lee, Junghoon
2011-01-01
Cloud computing is a new information technology trend that moves computing and data away from desktops and portable PCs into large data centers. The basic principle of cloud computing is to deliver applications as services over the Internet as well as infrastructure. A cloud is a type of parallel and distributed system consisting of a collection of inter-connected and virtualized computers that are dynamically provisioned and presented as one or more unified computing resources. The large-scale distributed applications on a cloud require adaptive service-based software, which has the capability of monitoring system status changes, analyzing the monitored information, and adapting its service configuration while considering tradeoffs among multiple QoS features simultaneously. In this paper, we design and develop a Run-Time Monitor (RTM) which is a system software to monitor the application behavior at run-time, analyze the collected information, and optimize cloud computing resources for multi-core architectures. RTM monitors application software through library instrumentation as well as underlying hardware through a performance counter optimizing its computing configuration based on the analyzed data. PMID:22163811
Design and development of a run-time monitor for multi-core architectures in cloud computing.
Kang, Mikyung; Kang, Dong-In; Crago, Stephen P; Park, Gyung-Leen; Lee, Junghoon
2011-01-01
Cloud computing is a new information technology trend that moves computing and data away from desktops and portable PCs into large data centers. The basic principle of cloud computing is to deliver applications as services over the Internet as well as infrastructure. A cloud is a type of parallel and distributed system consisting of a collection of inter-connected and virtualized computers that are dynamically provisioned and presented as one or more unified computing resources. The large-scale distributed applications on a cloud require adaptive service-based software, which has the capability of monitoring system status changes, analyzing the monitored information, and adapting its service configuration while considering tradeoffs among multiple QoS features simultaneously. In this paper, we design and develop a Run-Time Monitor (RTM) which is a system software to monitor the application behavior at run-time, analyze the collected information, and optimize cloud computing resources for multi-core architectures. RTM monitors application software through library instrumentation as well as underlying hardware through a performance counter optimizing its computing configuration based on the analyzed data.
Analytical Cost Metrics : Days of Future Past
DOE Office of Scientific and Technical Information (OSTI.GOV)
Prajapati, Nirmal; Rajopadhye, Sanjay; Djidjev, Hristo Nikolov
As we move towards the exascale era, the new architectures must be capable of running the massive computational problems efficiently. Scientists and researchers are continuously investing in tuning the performance of extreme-scale computational problems. These problems arise in almost all areas of computing, ranging from big data analytics, artificial intelligence, search, machine learning, virtual/augmented reality, computer vision, image/signal processing to computational science and bioinformatics. With Moore’s law driving the evolution of hardware platforms towards exascale, the dominant performance metric (time efficiency) has now expanded to also incorporate power/energy efficiency. Therefore the major challenge that we face in computing systems researchmore » is: “how to solve massive-scale computational problems in the most time/power/energy efficient manner?”« less
Achieving realistic performance and decison-making capabilities in computer-generated air forces
NASA Astrophysics Data System (ADS)
Banks, Sheila B.; Stytz, Martin R.; Santos, Eugene, Jr.; Zurita, Vincent B.; Benslay, James L., Jr.
1997-07-01
For a computer-generated force (CGF) system to be useful in training environments, it must be able to operate at multiple skill levels, exhibit competency at assigned missions, and comply with current doctrine. Because of the rapid rate of change in distributed interactive simulation (DIS) and the expanding set of performance objectives for any computer- generated force, the system must also be modifiable at reasonable cost and incorporate mechanisms for learning. Therefore, CGF applications must have adaptable decision mechanisms and behaviors and perform automated incorporation of past reasoning and experience into its decision process. The CGF must also possess multiple skill levels for classes of entities, gracefully degrade its reasoning capability in response to system stress, possess an expandable modular knowledge structure, and perform adaptive mission planning. Furthermore, correctly performing individual entity behaviors is not sufficient. Issues related to complex inter-entity behavioral interactions, such as the need to maintain formation and share information, must also be considered. The CGF must also be able to acceptably respond to unforeseen circumstances and be able to make decisions in spite of uncertain information. Because of the need for increased complexity in the virtual battlespace, the CGF should exhibit complex, realistic behavior patterns within the battlespace. To achieve these necessary capabilities, an extensible software architecture, an expandable knowledge base, and an adaptable decision making mechanism are required. Our lab has addressed these issues in detail. The resulting DIS-compliant system is called the automated wingman (AW). The AW is based on fuzzy logic, the common object database (CODB) software architecture, and a hierarchical knowledge structure. We describe the techniques we used to enable us to make progress toward a CGF entity that satisfies the requirements presented above. We present our design and implementation of an adaptable decision making mechanism that uses multi-layered, fuzzy logic controlled situational analysis. Because our research indicates that fuzzy logic can perform poorly under certain circumstances, we combine fuzzy logic inferencing with adversarial game tree techniques for decision making in strategic and tactical engagements. We describe the approach we employed to achieve this fusion. We also describe the automated wingman's system architecture and knowledge base architecture.
NASA Technical Reports Server (NTRS)
Gentzsch, W.
1982-01-01
Problems which can arise with vector and parallel computers are discussed in a user oriented context. Emphasis is placed on the algorithms used and the programming techniques adopted. Three recently developed supercomputers are examined and typical application examples are given in CRAY FORTRAN, CYBER 205 FORTRAN and DAP (distributed array processor) FORTRAN. The systems performance is compared. The addition of parts of two N x N arrays is considered. The influence of the architecture on the algorithms and programming language is demonstrated. Numerical analysis of magnetohydrodynamic differential equations by an explicit difference method is illustrated, showing very good results for all three systems. The prognosis for supercomputer development is assessed.
DOE Office of Scientific and Technical Information (OSTI.GOV)
You, Yang; Fu, Haohuan; Song, Shuaiwen
2014-07-18
Wave propagation forward modeling is a widely used computational method in oil and gas exploration. The iterative stencil loops in such problems have broad applications in scientific computing. However, executing such loops can be highly time time-consuming, which greatly limits application’s performance and power efficiency. In this paper, we accelerate the forward modeling technique on the latest multi-core and many-core architectures such as Intel Sandy Bridge CPUs, NVIDIA Fermi C2070 GPU, NVIDIA Kepler K20x GPU, and the Intel Xeon Phi Co-processor. For the GPU platforms, we propose two parallel strategies to explore the performance optimization opportunities for our stencil kernels.more » For Sandy Bridge CPUs and MIC, we also employ various optimization techniques in order to achieve the best.« less
Gyrokinetic particle-in-cell optimization on emerging multi- and manycore platforms
Madduri, Kamesh; Im, Eun-Jin; Ibrahim, Khaled Z.; ...
2011-03-02
The next decade of high-performance computing (HPC) systems will see a rapid evolution and divergence of multi- and manycore architectures as power and cooling constraints limit increases in microprocessor clock speeds. Understanding efficient optimization methodologies on diverse multicore designs in the context of demanding numerical methods is one of the greatest challenges faced today by the HPC community. In this paper, we examine the efficient multicore optimization of GTC, a petascale gyrokinetic toroidal fusion code for studying plasma microturbulence in tokamak devices. For GTC’s key computational components (charge deposition and particle push), we explore efficient parallelization strategies across a broadmore » range of emerging multicore designs, including the recently-released Intel Nehalem-EX, the AMD Opteron Istanbul, and the highly multithreaded Sun UltraSparc T2+. We also present the first study on tuning gyrokinetic particle-in-cell (PIC) algorithms for graphics processors, using the NVIDIA C2050 (Fermi). Our work discusses several novel optimization approaches for gyrokinetic PIC, including mixed-precision computation, particle binning and decomposition strategies, grid replication, SIMDized atomic floating-point operations, and effective GPU texture memory utilization. Overall, we achieve significant performance improvements of 1.3–4.7× on these complex PIC kernels, despite the inherent challenges of data dependency and locality. Finally, our work also points to several architectural and programming features that could significantly enhance PIC performance and productivity on next-generation architectures.« less
Software Systems for High-performance Quantum Computing
DOE Office of Scientific and Technical Information (OSTI.GOV)
Humble, Travis S; Britt, Keith A
Quantum computing promises new opportunities for solving hard computational problems, but harnessing this novelty requires breakthrough concepts in the design, operation, and application of computing systems. We define some of the challenges facing the development of quantum computing systems as well as software-based approaches that can be used to overcome these challenges. Following a brief overview of the state of the art, we present models for the quantum programming and execution models, the development of architectures for hybrid high-performance computing systems, and the realization of software stacks for quantum networking. This leads to a discussion of the role that conventionalmore » computing plays in the quantum paradigm and how some of the current challenges for exascale computing overlap with those facing quantum computing.« less
Peer-to-peer architectures for exascale computing : LDRD final report.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Vorobeychik, Yevgeniy; Mayo, Jackson R.; Minnich, Ronald G.
2010-09-01
The goal of this research was to investigate the potential for employing dynamic, decentralized software architectures to achieve reliability in future high-performance computing platforms. These architectures, inspired by peer-to-peer networks such as botnets that already scale to millions of unreliable nodes, hold promise for enabling scientific applications to run usefully on next-generation exascale platforms ({approx} 10{sup 18} operations per second). Traditional parallel programming techniques suffer rapid deterioration of performance scaling with growing platform size, as the work of coping with increasingly frequent failures dominates over useful computation. Our studies suggest that new architectures, in which failures are treated as ubiquitousmore » and their effects are considered as simply another controllable source of error in a scientific computation, can remove such obstacles to exascale computing for certain applications. We have developed a simulation framework, as well as a preliminary implementation in a large-scale emulation environment, for exploration of these 'fault-oblivious computing' approaches. High-performance computing (HPC) faces a fundamental problem of increasing total component failure rates due to increasing system sizes, which threaten to degrade system reliability to an unusable level by the time the exascale range is reached ({approx} 10{sup 18} operations per second, requiring of order millions of processors). As computer scientists seek a way to scale system software for next-generation exascale machines, it is worth considering peer-to-peer (P2P) architectures that are already capable of supporting 10{sup 6}-10{sup 7} unreliable nodes. Exascale platforms will require a different way of looking at systems and software because the machine will likely not be available in its entirety for a meaningful execution time. Realistic estimates of failure rates range from a few times per day to more than once per hour for these platforms. P2P architectures give us a starting point for crafting applications and system software for exascale. In the context of the Internet, P2P applications (e.g., file sharing, botnets) have already solved this problem for 10{sup 6}-10{sup 7} nodes. Usually based on a fractal distributed hash table structure, these systems have proven robust in practice to constant and unpredictable outages, failures, and even subversion. For example, a recent estimate of botnet turnover (i.e., the number of machines leaving and joining) is about 11% per week. Nonetheless, P2P networks remain effective despite these failures: The Conficker botnet has grown to {approx} 5 x 10{sup 6} peers. Unlike today's system software and applications, those for next-generation exascale machines cannot assume a static structure and, to be scalable over millions of nodes, must be decentralized. P2P architectures achieve both, and provide a promising model for 'fault-oblivious computing'. This project aimed to study the dynamics of P2P networks in the context of a design for exascale systems and applications. Having no single point of failure, the most successful P2P architectures are adaptive and self-organizing. While there has been some previous work applying P2P to message passing, little attention has been previously paid to the tightly coupled exascale domain. Typically, the per-node footprint of P2P systems is small, making them ideal for HPC use. The implementation on each peer node cooperates en masse to 'heal' disruptions rather than relying on a controlling 'master' node. Understanding this cooperative behavior from a complex systems viewpoint is essential to predicting useful environments for the inextricably unreliable exascale platforms of the future. We sought to obtain theoretical insight into the stability and large-scale behavior of candidate architectures, and to work toward leveraging Sandia's Emulytics platform to test promising candidates in a realistic (ultimately {ge} 10{sup 7} nodes) setting. Our primary example applications are drawn from linear algebra: a Jacobi relaxation solver for the heat equation, and the closely related technique of value iteration in optimization. We aimed to apply P2P concepts in designing implementations capable of surviving an unreliable machine of 10{sup 6} nodes.« less
Connectionist Models for Intelligent Computation
1989-07-26
Intelligent Canputation 12. PERSONAL AUTHOR(S) H.H. Chen and Y.C. Lee 13a. o R,POT Cal 13b TIME lVD/rED 14 DATE OF REPORT (Year, Month, Day) JS PAGE...fied Project Title: Connectionist Models-for Intelligent Computation Contract/Grant No.: AFOSR-87-0388 Contract/Grant Period of Performance: Sept. 1...underlying principles, architectures and appilications of artificial neural networks for intelligent computations.o, Approach: -) We use both numerical
Random noise effects in pulse-mode digital multilayer neural networks.
Kim, Y C; Shanblatt, M A
1995-01-01
A pulse-mode digital multilayer neural network (DMNN) based on stochastic computing techniques is implemented with simple logic gates as basic computing elements. The pulse-mode signal representation and the use of simple logic gates for neural operations lead to a massively parallel yet compact and flexible network architecture, well suited for VLSI implementation. Algebraic neural operations are replaced by stochastic processes using pseudorandom pulse sequences. The distributions of the results from the stochastic processes are approximated using the hypergeometric distribution. Synaptic weights and neuron states are represented as probabilities and estimated as average pulse occurrence rates in corresponding pulse sequences. A statistical model of the noise (error) is developed to estimate the relative accuracy associated with stochastic computing in terms of mean and variance. Computational differences are then explained by comparison to deterministic neural computations. DMNN feedforward architectures are modeled in VHDL using character recognition problems as testbeds. Computational accuracy is analyzed, and the results of the statistical model are compared with the actual simulation results. Experiments show that the calculations performed in the DMNN are more accurate than those anticipated when Bernoulli sequences are assumed, as is common in the literature. Furthermore, the statistical model successfully predicts the accuracy of the operations performed in the DMNN.
Neuromorphic Kalman filter implementation in IBM’s TrueNorth
NASA Astrophysics Data System (ADS)
Carney, R.; Bouchard, K.; Calafiura, P.; Clark, D.; Donofrio, D.; Garcia-Sciveres, M.; Livezey, J.
2017-10-01
Following the advent of a post-Moore’s law field of computation, novel architectures continue to emerge. With composite, multi-million connection neuromorphic chips like IBM’s TrueNorth, neural engineering has now become a feasible technology in this novel computing paradigm. High Energy Physics experiments are continuously exploring new methods of computation and data handling, including neuromorphic, to support the growing challenges of the field and be prepared for future commodity computing trends. This work details the first instance of a Kalman filter implementation in IBM’s neuromorphic architecture, TrueNorth, for both parallel and serial spike trains. The implementation is tested on multiple simulated systems and its performance is evaluated with respect to an equivalent non-spiking Kalman filter. The limits of the implementation are explored whilst varying the size of weight and threshold registers, the number of spikes used to encode a state, size of neuron block for spatial encoding, and neuron potential reset schemes.
Design and deployment of an elastic network test-bed in IHEP data center based on SDN
NASA Astrophysics Data System (ADS)
Zeng, Shan; Qi, Fazhi; Chen, Gang
2017-10-01
High energy physics experiments produce huge amounts of raw data, while because of the sharing characteristics of the network resources, there is no guarantee of the available bandwidth for each experiment which may cause link congestion problems. On the other side, with the development of cloud computing technologies, IHEP have established a cloud platform based on OpenStack which can ensure the flexibility of the computing and storage resources, and more and more computing applications have been deployed on virtual machines established by OpenStack. However, under the traditional network architecture, network capability can’t be required elastically, which becomes the bottleneck of restricting the flexible application of cloud computing. In order to solve the above problems, we propose an elastic cloud data center network architecture based on SDN, and we also design a high performance controller cluster based on OpenDaylight. In the end, we present our current test results.
The science of visual analysis at extreme scale
NASA Astrophysics Data System (ADS)
Nowell, Lucy T.
2011-01-01
Driven by market forces and spanning the full spectrum of computational devices, computer architectures are changing in ways that present tremendous opportunities and challenges for data analysis and visual analytic technologies. Leadership-class high performance computing system will have as many as a million cores by 2020 and support 10 billion-way concurrency, while laptop computers are expected to have as many as 1,000 cores by 2015. At the same time, data of all types are increasing exponentially and automated analytic methods are essential for all disciplines. Many existing analytic technologies do not scale to make full use of current platforms and fewer still are likely to scale to the systems that will be operational by the end of this decade. Furthermore, on the new architectures and for data at extreme scales, validating the accuracy and effectiveness of analytic methods, including visual analysis, will be increasingly important.
NASA Astrophysics Data System (ADS)
Bird, Robert; Nystrom, David; Albright, Brian
2017-10-01
The ability of scientific simulations to effectively deliver performant computation is increasingly being challenged by successive generations of high-performance computing architectures. Code development to support efficient computation on these modern architectures is both expensive, and highly complex; if it is approached without due care, it may also not be directly transferable between subsequent hardware generations. Previous works have discussed techniques to support the process of adapting a legacy code for modern hardware generations, but despite the breakthroughs in the areas of mini-app development, portable-performance, and cache oblivious algorithms the problem still remains largely unsolved. In this work we demonstrate how a focus on platform agnostic modern code-development can be applied to Particle-in-Cell (PIC) simulations to facilitate effective scientific delivery. This work builds directly on our previous work optimizing VPIC, in which we replaced intrinsic based vectorisation with compile generated auto-vectorization to improve the performance and portability of VPIC. In this work we present the use of a specialized SIMD queue for processing some particle operations, and also preview a GPU capable OpenMP variant of VPIC. Finally we include a lessons learnt. Work performed under the auspices of the U.S. Dept. of Energy by the Los Alamos National Security, LLC Los Alamos National Laboratory under contract DE-AC52-06NA25396 and supported by the LANL LDRD program.
A Bandwidth-Optimized Multi-Core Architecture for Irregular Applications
DOE Office of Scientific and Technical Information (OSTI.GOV)
Secchi, Simone; Tumeo, Antonino; Villa, Oreste
This paper presents an architecture template for next-generation high performance computing systems specifically targeted to irregular applications. We start our work by considering that future generation interconnection and memory bandwidth full-system numbers are expected to grow by a factor of 10. In order to keep up with such a communication capacity, while still resorting to fine-grained multithreading as the main way to tolerate unpredictable memory access latencies of irregular applications, we show how overall performance scaling can benefit from the multi-core paradigm. At the same time, we also show how such an architecture template must be coupled with specific techniquesmore » in order to optimize bandwidth utilization and achieve the maximum scalability. We propose a technique based on memory references aggregation, together with the related hardware implementation, as one of such optimization techniques. We explore the proposed architecture template by focusing on the Cray XMT architecture and, using a dedicated simulation infrastructure, validate the performance of our template with two typical irregular applications. Our experimental results prove the benefits provided by both the multi-core approach and the bandwidth optimization reference aggregation technique.« less
Optimization of atmospheric transport models on HPC platforms
NASA Astrophysics Data System (ADS)
de la Cruz, Raúl; Folch, Arnau; Farré, Pau; Cabezas, Javier; Navarro, Nacho; Cela, José María
2016-12-01
The performance and scalability of atmospheric transport models on high performance computing environments is often far from optimal for multiple reasons including, for example, sequential input and output, synchronous communications, work unbalance, memory access latency or lack of task overlapping. We investigate how different software optimizations and porting to non general-purpose hardware architectures improve code scalability and execution times considering, as an example, the FALL3D volcanic ash transport model. To this purpose, we implement the FALL3D model equations in the WARIS framework, a software designed from scratch to solve in a parallel and efficient way different geoscience problems on a wide variety of architectures. In addition, we consider further improvements in WARIS such as hybrid MPI-OMP parallelization, spatial blocking, auto-tuning and thread affinity. Considering all these aspects together, the FALL3D execution times for a realistic test case running on general-purpose cluster architectures (Intel Sandy Bridge) decrease by a factor between 7 and 40 depending on the grid resolution. Finally, we port the application to Intel Xeon Phi (MIC) and NVIDIA GPUs (CUDA) accelerator-based architectures and compare performance, cost and power consumption on all the architectures. Implications on time-constrained operational model configurations are discussed.
Heterogeneous high throughput scientific computing with APM X-Gene and Intel Xeon Phi
Abdurachmanov, David; Bockelman, Brian; Elmer, Peter; ...
2015-05-22
Electrical power requirements will be a constraint on the future growth of Distributed High Throughput Computing (DHTC) as used by High Energy Physics. Performance-per-watt is a critical metric for the evaluation of computer architectures for cost- efficient computing. Additionally, future performance growth will come from heterogeneous, many-core, and high computing density platforms with specialized processors. In this paper, we examine the Intel Xeon Phi Many Integrated Cores (MIC) co-processor and Applied Micro X-Gene ARMv8 64-bit low-power server system-on-a-chip (SoC) solutions for scientific computing applications. As a result, we report our experience on software porting, performance and energy efficiency and evaluatemore » the potential for use of such technologies in the context of distributed computing systems such as the Worldwide LHC Computing Grid (WLCG).« less
Multi-processor including data flow accelerator module
Davidson, George S.; Pierce, Paul E.
1990-01-01
An accelerator module for a data flow computer includes an intelligent memory. The module is added to a multiprocessor arrangement and uses a shared tagged memory architecture in the data flow computer. The intelligent memory module assigns locations for holding data values in correspondence with arcs leading to a node in a data dependency graph. Each primitive computation is associated with a corresponding memory cell, including a number of slots for operands needed to execute a primitive computation, a primitive identifying pointer, and linking slots for distributing the result of the cell computation to other cells requiring that result as an operand. Circuitry is provided for utilizing tag bits to determine automatically when all operands required by a processor are available and for scheduling the primitive for execution in a queue. Each memory cell of the module may be associated with any of the primitives, and the particular primitive to be executed by the processor associated with the cell is identified by providing an index, such as the cell number for the primitive, to the primitive lookup table of starting addresses. The module thus serves to perform functions previously performed by a number of sections of data flow architectures and coexists with conventional shared memory therein. A multiprocessing system including the module operates in a hybrid mode, wherein the same processing modules are used to perform some processing in a sequential mode, under immediate control of an operating system, while performing other processing in a data flow mode.
Multicore Challenges and Benefits for High Performance Scientific Computing
Nielsen, Ida M. B.; Janssen, Curtis L.
2008-01-01
Until recently, performance gains in processors were achieved largely by improvements in clock speeds and instruction level parallelism. Thus, applications could obtain performance increases with relatively minor changes by upgrading to the latest generation of computing hardware. Currently, however, processor performance improvements are realized by using multicore technology and hardware support for multiple threads within each core, and taking full advantage of this technology to improve the performance of applications requires exposure of extreme levels of software parallelism. We will here discuss the architecture of parallel computers constructed from many multicore chips as well as techniques for managing the complexitymore » of programming such computers, including the hybrid message-passing/multi-threading programming model. We will illustrate these ideas with a hybrid distributed memory matrix multiply and a quantum chemistry algorithm for energy computation using Møller–Plesset perturbation theory.« less
Reducing Communication in Algebraic Multigrid Using Additive Variants
DOE Office of Scientific and Technical Information (OSTI.GOV)
Vassilevski, Panayot S.; Yang, Ulrike Meier
Algebraic multigrid (AMG) has proven to be an effective scalable solver on many high performance computers. However, its increasing communication complexity on coarser levels has shown to seriously impact its performance on computers with high communication cost. Moreover, additive AMG variants provide not only increased parallelism as well as decreased numbers of messages per cycle but also generally exhibit slower convergence. Here we present various new additive variants with convergence rates that are significantly improved compared to the classical additive algebraic multigrid method and investigate their potential for decreased communication, and improved communication-computation overlap, features that are essential for goodmore » performance on future exascale architectures.« less
Reducing Communication in Algebraic Multigrid Using Additive Variants
Vassilevski, Panayot S.; Yang, Ulrike Meier
2014-02-12
Algebraic multigrid (AMG) has proven to be an effective scalable solver on many high performance computers. However, its increasing communication complexity on coarser levels has shown to seriously impact its performance on computers with high communication cost. Moreover, additive AMG variants provide not only increased parallelism as well as decreased numbers of messages per cycle but also generally exhibit slower convergence. Here we present various new additive variants with convergence rates that are significantly improved compared to the classical additive algebraic multigrid method and investigate their potential for decreased communication, and improved communication-computation overlap, features that are essential for goodmore » performance on future exascale architectures.« less
Metal oxide resistive random access memory based synaptic devices for brain-inspired computing
NASA Astrophysics Data System (ADS)
Gao, Bin; Kang, Jinfeng; Zhou, Zheng; Chen, Zhe; Huang, Peng; Liu, Lifeng; Liu, Xiaoyan
2016-04-01
The traditional Boolean computing paradigm based on the von Neumann architecture is facing great challenges for future information technology applications such as big data, the Internet of Things (IoT), and wearable devices, due to the limited processing capability issues such as binary data storage and computing, non-parallel data processing, and the buses requirement between memory units and logic units. The brain-inspired neuromorphic computing paradigm is believed to be one of the promising solutions for realizing more complex functions with a lower cost. To perform such brain-inspired computing with a low cost and low power consumption, novel devices for use as electronic synapses are needed. Metal oxide resistive random access memory (ReRAM) devices have emerged as the leading candidate for electronic synapses. This paper comprehensively addresses the recent work on the design and optimization of metal oxide ReRAM-based synaptic devices. A performance enhancement methodology and optimized operation scheme to achieve analog resistive switching and low-energy training behavior are provided. A three-dimensional vertical synapse network architecture is proposed for high-density integration and low-cost fabrication. The impacts of the ReRAM synaptic device features on the performances of neuromorphic systems are also discussed on the basis of a constructed neuromorphic visual system with a pattern recognition function. Possible solutions to achieve the high recognition accuracy and efficiency of neuromorphic systems are presented.
Geospace simulations using modern accelerator processor technology
NASA Astrophysics Data System (ADS)
Germaschewski, K.; Raeder, J.; Larson, D. J.
2009-12-01
OpenGGCM (Open Geospace General Circulation Model) is a well-established numerical code simulating the Earth's space environment. The most computing intensive part is the MHD (magnetohydrodynamics) solver that models the plasma surrounding Earth and its interaction with Earth's magnetic field and the solar wind flowing in from the sun. Like other global magnetosphere codes, OpenGGCM's realism is currently limited by computational constraints on grid resolution. OpenGGCM has been ported to make use of the added computational powerof modern accelerator based processor architectures, in particular the Cell processor. The Cell architecture is a novel inhomogeneous multicore architecture capable of achieving up to 230 GFLops on a single chip. The University of New Hampshire recently acquired a PowerXCell 8i based computing cluster, and here we will report initial performance results of OpenGGCM. Realizing the high theoretical performance of the Cell processor is a programming challenge, though. We implemented the MHD solver using a multi-level parallelization approach: On the coarsest level, the problem is distributed to processors based upon the usual domain decomposition approach. Then, on each processor, the problem is divided into 3D columns, each of which is handled by the memory limited SPEs (synergistic processing elements) slice by slice. Finally, SIMD instructions are used to fully exploit the SIMD FPUs in each SPE. Memory management needs to be handled explicitly by the code, using DMA to move data from main memory to the per-SPE local store and vice versa. We use a modern technique, automatic code generation, which shields the application programmer from having to deal with all of the implementation details just described, keeping the code much more easily maintainable. Our preliminary results indicate excellent performance, a speed-up of a factor of 30 compared to the unoptimized version.
NASA Technical Reports Server (NTRS)
1985-01-01
The second task in the Space Station Data System (SSDS) Analysis/Architecture Study is the development of an information base that will support the conduct of trade studies and provide sufficient data to make key design/programmatic decisions. This volume identifies the preferred options in the technology category and characterizes these options with respect to performance attributes, constraints, cost, and risk. The technology category includes advanced materials, processes, and techniques that can be used to enhance the implementation of SSDS design structures. The specific areas discussed are mass storage, including space and round on-line storage and off-line storage; man/machine interface; data processing hardware, including flight computers and advanced/fault tolerant computer architectures; and software, including data compression algorithms, on-board high level languages, and software tools. Also discussed are artificial intelligence applications and hard-wire communications.
Efficient parallel architecture for highly coupled real-time linear system applications
NASA Technical Reports Server (NTRS)
Carroll, Chester C.; Homaifar, Abdollah; Barua, Soumavo
1988-01-01
A systematic procedure is developed for exploiting the parallel constructs of computation in a highly coupled, linear system application. An overall top-down design approach is adopted. Differential equations governing the application under consideration are partitioned into subtasks on the basis of a data flow analysis. The interconnected task units constitute a task graph which has to be computed in every update interval. Multiprocessing concepts utilizing parallel integration algorithms are then applied for efficient task graph execution. A simple scheduling routine is developed to handle task allocation while in the multiprocessor mode. Results of simulation and scheduling are compared on the basis of standard performance indices. Processor timing diagrams are developed on the basis of program output accruing to an optimal set of processors. Basic architectural attributes for implementing the system are discussed together with suggestions for processing element design. Emphasis is placed on flexible architectures capable of accommodating widely varying application specifics.
Diamond Eye: a distributed architecture for image data mining
NASA Astrophysics Data System (ADS)
Burl, Michael C.; Fowlkes, Charless; Roden, Joe; Stechert, Andre; Mukhtar, Saleem
1999-02-01
Diamond Eye is a distributed software architecture, which enables users (scientists) to analyze large image collections by interacting with one or more custom data mining servers via a Java applet interface. Each server is coupled with an object-oriented database and a computational engine, such as a network of high-performance workstations. The database provides persistent storage and supports querying of the 'mined' information. The computational engine provides parallel execution of expensive image processing, object recognition, and query-by-content operations. Key benefits of the Diamond Eye architecture are: (1) the design promotes trial evaluation of advanced data mining and machine learning techniques by potential new users (all that is required is to point a web browser to the appropriate URL), (2) software infrastructure that is common across a range of science mining applications is factored out and reused, and (3) the system facilitates closer collaborations between algorithm developers and domain experts.
1981-06-01
design of manufacturing systems, "ilidation and verification of ICAM modules, integration of ICAM modules and the orderly transition of ICAM modules into...Function Model of "Manufacture Product" (MFGO) VIII - Composite Function Model of " Design Product" (DESIGNO) IX - Composite Information Model of...User Interface Requirements; and the Architecture of Design . This work was performed during the period of 29 September 1978 through 10
2006-08-01
ABSTRACT This report describes how changes in architectural parameters in the Adaptive Control of Thought – Rational (ACT-R) can be used to understand...Computational Models of Cognition Cognitive architectures like the Adaptive Control of Thought – Rational (ACT-R) and Soar provide an alternative to...Belavkin (2001) to simulate the role of emotion in decision making; and by Ritter and colleagues (2004) to simulate pre-task appraisal and anxiety
2010-07-22
dependent , providing a natural bandwidth match between compute cores and the memory subsystem. • High Bandwidth Dcnsity. Waveguides crossing the chip...simulate this memory access architecture on a 2S6-core chip with a concentrated 64-node network lIsing detailed traces of high-performance embedded...memory modulcs, wc placc memory access poi nts (MAPs) around the pcriphery of the chip connected to thc nctwork. These MAPs, shown in Figure 4, contain
Optimal cube-connected cube multiprocessors
NASA Technical Reports Server (NTRS)
Sun, Xian-He; Wu, Jie
1993-01-01
Many CFD (computational fluid dynamics) and other scientific applications can be partitioned into subproblems. However, in general the partitioned subproblems are very large. They demand high performance computing power themselves, and the solutions of the subproblems have to be combined at each time step. The cube-connect cube (CCCube) architecture is studied. The CCCube architecture is an extended hypercube structure with each node represented as a cube. It requires fewer physical links between nodes than the hypercube, and provides the same communication support as the hypercube does on many applications. The reduced physical links can be used to enhance the bandwidth of the remaining links and, therefore, enhance the overall performance. The concept and the method to obtain optimal CCCubes, which are the CCCubes with a minimum number of links under a given total number of nodes, are proposed. The superiority of optimal CCCubes over standard hypercubes was also shown in terms of the link usage in the embedding of a binomial tree. A useful computation structure based on a semi-binomial tree for divide-and-conquer type of parallel algorithms was identified. It was shown that this structure can be implemented in optimal CCCubes without performance degradation compared with regular hypercubes. The result presented should provide a useful approach to design of scientific parallel computers.
NASA Astrophysics Data System (ADS)
Yang, Chen; Liu, LeiBo; Yin, ShouYi; Wei, ShaoJun
2014-12-01
The computational capability of a coarse-grained reconfigurable array (CGRA) can be significantly restrained due to data and context memory bandwidth bottlenecks. Traditionally, two methods have been used to resolve this problem. One method loads the context into the CGRA at run time. This method occupies very small on-chip memory but induces very large latency, which leads to low computational efficiency. The other method adopts a multi-context structure. This method loads the context into the on-chip context memory at the boot phase. Broadcasting the pointer of a set of contexts changes the hardware configuration on a cycle-by-cycle basis. The size of the context memory induces a large area overhead in multi-context structures, which results in major restrictions on application complexity. This paper proposes a Predictable Context Cache (PCC) architecture to address the above context issues by buffering the context inside a CGRA. In this architecture, context is dynamically transferred into the CGRA. Utilizing a PCC significantly reduces the on-chip context memory and the complexity of the applications running on the CGRA is no longer restricted by the size of the on-chip context memory. Data preloading is the most frequently used approach to hide input data latency and speed up the data transmission process for the data bandwidth issue. Rather than fundamentally reducing the amount of input data, the transferred data and computations are processed in parallel. However, the data preloading method cannot work efficiently because data transmission becomes the critical path as the reconfigurable array scale increases. This paper also presents a Hierarchical Data Memory (HDM) architecture as a solution to the efficiency problem. In this architecture, high internal bandwidth is provided to buffer both reused input data and intermediate data. The HDM architecture relieves the external memory from the data transfer burden so that the performance is significantly improved. As a result of using PCC and HDM, experiments running mainstream video decoding programs achieved performance improvements of 13.57%-19.48% when there was a reasonable memory size. Therefore, 1080p@35.7fps for H.264 high profile video decoding can be achieved on PCC and HDM architecture when utilizing a 200 MHz working frequency. Further, the size of the on-chip context memory no longer restricted complex applications, which were efficiently executed on the PCC and HDM architecture.
A synchronized computational architecture for generalized bilateral control of robot arms
NASA Technical Reports Server (NTRS)
Bejczy, Antal K.; Szakaly, Zoltan
1987-01-01
This paper describes a computational architecture for an interconnected high speed distributed computing system for generalized bilateral control of robot arms. The key method of the architecture is the use of fully synchronized, interrupt driven software. Since an objective of the development is to utilize the processing resources efficiently, the synchronization is done in the hardware level to reduce system software overhead. The architecture also achieves a balaced load on the communication channel. The paper also describes some architectural relations to trading or sharing manual and automatic control.
Stability and performance tradeoffs in bi-lateral telemanipulation
NASA Technical Reports Server (NTRS)
Hannaford, Blake
1989-01-01
Kinesthetic force feedback provides measurable increase in remote manipulation system performance. Intensive computation time requirements or operation under conditions of time delay can cause serious stability problems in control-system design. Here, a simplified linear analysis of this stability problem is presented for the forward-flow generalized architecture, applying the hybrid two-port representation to express the loop gain of the traditional master-slave architecture, which can be subjected to similar analysis. The hybrid two-port representation is also used to express the effects on the fidelity of manipulation or feel of one design approach used to stabilize the forward-flow architecture. The results suggest that, when local force feedback at the slave side is used to reduce manipulator stability problems, a price is paid in terms of telemanipulation fidelity.
HRST architecture modeling and assessments
NASA Astrophysics Data System (ADS)
Comstock, Douglas A.
1997-01-01
This paper presents work supporting the assessment of advanced concept options for the Highly Reusable Space Transportation (HRST) study. It describes the development of computer models as the basis for creating an integrated capability to evaluate the economic feasibility and sustainability of a variety of system architectures. It summarizes modeling capabilities for use on the HRST study to perform sensitivity analysis of alternative architectures (consisting of different combinations of highly reusable vehicles, launch assist systems, and alternative operations and support concepts) in terms of cost, schedule, performance, and demand. In addition, the identification and preliminary assessment of alternative market segments for HRST applications, such as space manufacturing, space tourism, etc., is described. Finally, the development of an initial prototype model that can begin to be used for modeling alternative HRST concepts at the system level is presented.
NASA Astrophysics Data System (ADS)
Clay, M. P.; Buaria, D.; Yeung, P. K.; Gotoh, T.
2018-07-01
This paper reports on the successful implementation of a massively parallel GPU-accelerated algorithm for the direct numerical simulation of turbulent mixing at high Schmidt number. The work stems from a recent development (Comput. Phys. Commun., vol. 219, 2017, 313-328), in which a low-communication algorithm was shown to attain high degrees of scalability on the Cray XE6 architecture when overlapping communication and computation via dedicated communication threads. An even higher level of performance has now been achieved using OpenMP 4.5 on the Cray XK7 architecture, where on each node the 16 integer cores of an AMD Interlagos processor share a single Nvidia K20X GPU accelerator. In the new algorithm, data movements are minimized by performing virtually all of the intensive scalar field computations in the form of combined compact finite difference (CCD) operations on the GPUs. A memory layout in departure from usual practices is found to provide much better performance for a specific kernel required to apply the CCD scheme. Asynchronous execution enabled by adding the OpenMP 4.5 NOWAIT clause to TARGET constructs improves scalability when used to overlap computation on the GPUs with computation and communication on the CPUs. On the 27-petaflops supercomputer Titan at Oak Ridge National Laboratory, USA, a GPU-to-CPU speedup factor of approximately 5 is consistently observed at the largest problem size of 81923 grid points for the scalar field computed with 8192 XK7 nodes.
Exploring quantum computing application to satellite data assimilation
NASA Astrophysics Data System (ADS)
Cheung, S.; Zhang, S. Q.
2015-12-01
This is an exploring work on potential application of quantum computing to a scientific data optimization problem. On classical computational platforms, the physical domain of a satellite data assimilation problem is represented by a discrete variable transform, and classical minimization algorithms are employed to find optimal solution of the analysis cost function. The computation becomes intensive and time-consuming when the problem involves large number of variables and data. The new quantum computer opens a very different approach both in conceptual programming and in hardware architecture for solving optimization problem. In order to explore if we can utilize the quantum computing machine architecture, we formulate a satellite data assimilation experimental case in the form of quadratic programming optimization problem. We find a transformation of the problem to map it into Quadratic Unconstrained Binary Optimization (QUBO) framework. Binary Wavelet Transform (BWT) will be applied to the data assimilation variables for its invertible decomposition and all calculations in BWT are performed by Boolean operations. The transformed problem will be experimented as to solve for a solution of QUBO instances defined on Chimera graphs of the quantum computer.
NASA Astrophysics Data System (ADS)
Erez, Mattan; Dally, William J.
Stream processors, like other multi core architectures partition their functional units and storage into multiple processing elements. In contrast to typical architectures, which contain symmetric general-purpose cores and a cache hierarchy, stream processors have a significantly leaner design. Stream processors are specifically designed for the stream execution model, in which applications have large amounts of explicit parallel computation, structured and predictable control, and memory accesses that can be performed at a coarse granularity. Applications in the streaming model are expressed in a gather-compute-scatter form, yielding programs with explicit control over transferring data to and from on-chip memory. Relying on these characteristics, which are common to many media processing and scientific computing applications, stream architectures redefine the boundary between software and hardware responsibilities with software bearing much of the complexity required to manage concurrency, locality, and latency tolerance. Thus, stream processors have minimal control consisting of fetching medium- and coarse-grained instructions and executing them directly on the many ALUs. Moreover, the on-chip storage hierarchy of stream processors is under explicit software control, as is all communication, eliminating the need for complex reactive hardware mechanisms.
Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster
NASA Technical Reports Server (NTRS)
Jost, Gabriele; Jin, Haoqiang; anMey, Dieter; Hatay, Ferhat F.
2003-01-01
With the advent of parallel hardware and software technologies users are faced with the challenge to choose a programming paradigm best suited for the underlying computer architecture. With the current trend in parallel computer architectures towards clusters of shared memory symmetric multi-processors (SMP), parallel programming techniques have evolved to support parallelism beyond a single level. Which programming paradigm is the best will depend on the nature of the given problem, the hardware architecture, and the available software. In this study we will compare different programming paradigms for the parallelization of a selected benchmark application on a cluster of SMP nodes. We compare the timings of different implementations of the same CFD benchmark application employing the same numerical algorithm on a cluster of Sun Fire SMP nodes. The rest of the paper is structured as follows: In section 2 we briefly discuss the programming models under consideration. We describe our compute platform in section 3. The different implementations of our benchmark code are described in section 4 and the performance results are presented in section 5. We conclude our study in section 6.
Parallel Computing Strategies for Irregular Algorithms
NASA Technical Reports Server (NTRS)
Biswas, Rupak; Oliker, Leonid; Shan, Hongzhang; Biegel, Bryan (Technical Monitor)
2002-01-01
Parallel computing promises several orders of magnitude increase in our ability to solve realistic computationally-intensive problems, but relies on their efficient mapping and execution on large-scale multiprocessor architectures. Unfortunately, many important applications are irregular and dynamic in nature, making their effective parallel implementation a daunting task. Moreover, with the proliferation of parallel architectures and programming paradigms, the typical scientist is faced with a plethora of questions that must be answered in order to obtain an acceptable parallel implementation of the solution algorithm. In this paper, we consider three representative irregular applications: unstructured remeshing, sparse matrix computations, and N-body problems, and parallelize them using various popular programming paradigms on a wide spectrum of computer platforms ranging from state-of-the-art supercomputers to PC clusters. We present the underlying problems, the solution algorithms, and the parallel implementation strategies. Smart load-balancing, partitioning, and ordering techniques are used to enhance parallel performance. Overall results demonstrate the complexity of efficiently parallelizing irregular algorithms.
Performance/price estimates for cortex-scale hardware: a design space exploration.
Zaveri, Mazad S; Hammerstrom, Dan
2011-04-01
In this paper, we revisit the concept of virtualization. Virtualization is useful for understanding and investigating the performance/price and other trade-offs related to the hardware design space. Moreover, it is perhaps the most important aspect of a hardware design space exploration. Such a design space exploration is a necessary part of the study of hardware architectures for large-scale computational models for intelligent computing, including AI, Bayesian, bio-inspired and neural models. A methodical exploration is needed to identify potentially interesting regions in the design space, and to assess the relative performance/price points of these implementations. As an example, in this paper we investigate the performance/price of (digital and mixed-signal) CMOS and hypothetical CMOL (nanogrid) technology based hardware implementations of human cortex-scale spiking neural systems. Through this analysis, and the resulting performance/price points, we demonstrate, in general, the importance of virtualization, and of doing these kinds of design space explorations. The specific results suggest that hybrid nanotechnology such as CMOL is a promising candidate to implement very large-scale spiking neural systems, providing a more efficient utilization of the density and storage benefits of emerging nano-scale technologies. In general, we believe that the study of such hypothetical designs/architectures will guide the neuromorphic hardware community towards building large-scale systems, and help guide research trends in intelligent computing, and computer engineering. Copyright © 2010 Elsevier Ltd. All rights reserved.
Cognitive Architectures and Human-Computer Interaction. Introduction to Special Issue.
ERIC Educational Resources Information Center
Gray, Wayne D.; Young, Richard M.; Kirschenbaum, Susan S.
1997-01-01
In this introduction to a special issue on cognitive architectures and human-computer interaction (HCI), editors and contributors provide a brief overview of cognitive architectures. The following four architectures represented by articles in this issue are: Soar; LICAI (linked model of comprehension-based action planning and instruction taking);…
Hybrid techniques for the digital control of mechanical and optical systems
NASA Astrophysics Data System (ADS)
Acernese, Fausto; Barone, Fabrizio; De Rosa, Rosario; Eleuteri, Antonio; Milano, Leopoldo; Pardi, Silvio; Ricciardi, Iolanda; Russo, Guido
2004-07-01
One of the main requirements of a digital system for the control of interferometric detectors of gravitational waves is the computing power, that is a direct consequence of the increasing complexity of the digital algorithms necessary for the control signals generation. For this specific task many specialised non standard real-time architectures have been developed, often very expensive and difficult to upgrade. On the other hand, such computing power is generally fully available for off-line applications on standard Pc based systems. Therefore, a possible and obvious solution may be provided by the integration of both the the real-time and off-line architecture resulting in a hybrid control system architecture based on standards available components, trying to get both the advantages of the perfect data synchronization provided by the real-time systems and by the large computing power available on Pc based systems. Such integration may be provided by the implementation of the link between the two different architectures through the standard Ethernet network, whose data transfer speed is largely increasing in these years, using the TCP/IP and UDP protocols. In this paper we describe the architecture of an hybrid Ethernet based real-time control system protoype we implemented in Napoli, discussing its characteristics and performances. Finally we discuss a possible application to the real-time control of a suspended mass of the mode cleaner of the 3m prototype optical interferometer for gravitational wave detection (IDGW-3P) operational in Napoli.
Biomimetic design processes in architecture: morphogenetic and evolutionary computational design.
Menges, Achim
2012-03-01
Design computation has profound impact on architectural design methods. This paper explains how computational design enables the development of biomimetic design processes specific to architecture, and how they need to be significantly different from established biomimetic processes in engineering disciplines. The paper first explains the fundamental difference between computer-aided and computational design in architecture, as the understanding of this distinction is of critical importance for the research presented. Thereafter, the conceptual relation and possible transfer of principles from natural morphogenesis to design computation are introduced and the related developments of generative, feature-based, constraint-based, process-based and feedback-based computational design methods are presented. This morphogenetic design research is then related to exploratory evolutionary computation, followed by the presentation of two case studies focusing on the exemplary development of spatial envelope morphologies and urban block morphologies.
Rapid indirect trajectory optimization on highly parallel computing architectures
NASA Astrophysics Data System (ADS)
Antony, Thomas
Trajectory optimization is a field which can benefit greatly from the advantages offered by parallel computing. The current state-of-the-art in trajectory optimization focuses on the use of direct optimization methods, such as the pseudo-spectral method. These methods are favored due to their ease of implementation and large convergence regions while indirect methods have largely been ignored in the literature in the past decade except for specific applications in astrodynamics. It has been shown that the shortcomings conventionally associated with indirect methods can be overcome by the use of a continuation method in which complex trajectory solutions are obtained by solving a sequence of progressively difficult optimization problems. High performance computing hardware is trending towards more parallel architectures as opposed to powerful single-core processors. Graphics Processing Units (GPU), which were originally developed for 3D graphics rendering have gained popularity in the past decade as high-performance, programmable parallel processors. The Compute Unified Device Architecture (CUDA) framework, a parallel computing architecture and programming model developed by NVIDIA, is one of the most widely used platforms in GPU computing. GPUs have been applied to a wide range of fields that require the solution of complex, computationally demanding problems. A GPU-accelerated indirect trajectory optimization methodology which uses the multiple shooting method and continuation is developed using the CUDA platform. The various algorithmic optimizations used to exploit the parallelism inherent in the indirect shooting method are described. The resulting rapid optimal control framework enables the construction of high quality optimal trajectories that satisfy problem-specific constraints and fully satisfy the necessary conditions of optimality. The benefits of the framework are highlighted by construction of maximum terminal velocity trajectories for a hypothetical long range weapon system. The techniques used to construct an initial guess from an analytic near-ballistic trajectory and the methods used to formulate the necessary conditions of optimality in a manner that is transparent to the designer are discussed. Various hypothetical mission scenarios that enforce different combinations of initial, terminal, interior point and path constraints demonstrate the rapid construction of complex trajectories without requiring any a-priori insight into the structure of the solutions. Trajectory problems of this kind were previously considered impractical to solve using indirect methods. The performance of the GPU-accelerated solver is found to be 2x--4x faster than MATLAB's bvp4c, even while running on GPU hardware that is five years behind the state-of-the-art.
The RTE inversion on FPGA aboard the solar orbiter PHI instrument
NASA Astrophysics Data System (ADS)
Cobos Carrascosa, J. P.; Aparicio del Moral, B.; Ramos Mas, J. L.; Balaguer, M.; López Jiménez, A. C.; del Toro Iniesta, J. C.
2016-07-01
In this work we propose a multiprocessor architecture to reach high performance in floating point operations by using radiation tolerant FPGA devices, and under narrow time and power constraints. This architecture is used in the PHI instrument that carries out the scientific analysis aboard the ESA's Solar Orbiter mission. The proposed architecture, in a SIMD flavor, is aimed to be an accelerator within the Data Processing Unit (it is composed by a main Leon processor and two FPGAs) for carrying out the RTE inversion on board the spacecraft using a relatively slow FPGA device - Xilinx XQR4VSX55-. The proposed architecture squeezes the FPGA resources in order to reach the computational requirements and improves the ground-based system performance based on commercial CPUs regarding time and power consumption. In this work we demonstrate the feasibility of using this FPGA devices embedded in the SO/PHI instrument. With that goal in mind, we perform tests to evaluate the scientific results and to measure the processing time and power consumption for carrying out the RTE inversion.
FPGA cluster for high-performance AO real-time control system
NASA Astrophysics Data System (ADS)
Geng, Deli; Goodsell, Stephen J.; Basden, Alastair G.; Dipper, Nigel A.; Myers, Richard M.; Saunter, Chris D.
2006-06-01
Whilst the high throughput and low latency requirements for the next generation AO real-time control systems have posed a significant challenge to von Neumann architecture processor systems, the Field Programmable Gate Array (FPGA) has emerged as a long term solution with high performance on throughput and excellent predictability on latency. Moreover, FPGA devices have highly capable programmable interfacing, which lead to more highly integrated system. Nevertheless, a single FPGA is still not enough: multiple FPGA devices need to be clustered to perform the required subaperture processing and the reconstruction computation. In an AO real-time control system, the memory bandwidth is often the bottleneck of the system, simply because a vast amount of supporting data, e.g. pixel calibration maps and the reconstruction matrix, need to be accessed within a short period. The cluster, as a general computing architecture, has excellent scalability in processing throughput, memory bandwidth, memory capacity, and communication bandwidth. Problems, such as task distribution, node communication, system verification, are discussed.
By Hand or Not By-Hand: A Case Study of Alternative Approaches to Parallelize CFD Applications
NASA Technical Reports Server (NTRS)
Yan, Jerry C.; Bailey, David (Technical Monitor)
1997-01-01
While parallel processing promises to speed up applications by several orders of magnitude, the performance achieved still depends upon several factors, including the multiprocessor architecture, system software, data distribution and alignment, as well as the methods used for partitioning the application and mapping its components onto the architecture. The existence of the Gorden Bell Prize given out at Supercomputing every year suggests that while good performance can be attained for real applications on general purpose multiprocessors, the large investment in man-power and time still has to be repeated for each application-machine combination. As applications and machine architectures become more complex, the cost and time-delays for obtaining performance by hand will become prohibitive. Computer users today can turn to three possible avenues for help: parallel libraries, parallel languages and compilers, interactive parallelization tools. The success of these methodologies, in turn, depends on proper application of data dependency analysis, program structure recognition and transformation, performance prediction as well as exploitation of user supplied knowledge. NASA has been developing multidisciplinary applications on highly parallel architectures under the High Performance Computing and Communications Program. Over the past six years, the transition of underlying hardware and system software have forced the scientists to spend a large effort to migrate and recede their applications. Various attempts to exploit software tools to automate the parallelization process have not produced favorable results. In this paper, we report our most recent experience with CAPTOOL, a package developed at Greenwich University. We have chosen CAPTOOL for three reasons: 1. CAPTOOL accepts a FORTRAN 77 program as input. This suggests its potential applicability to a large collection of legacy codes currently in use. 2. CAPTOOL employs domain decomposition to obtain parallelism. Although the fact that not all kinds of parallelism are handled may seem unappealing, many NASA applications in computational aerosciences as well as earth and space sciences are amenable to domain decomposition. 3. CAPTOOL generates code for a large variety of environments employed across NASA centers: MPI/PVM on network of workstations to the IBS/SP2 and CRAY/T3D.
Hardware implementation of CMAC neural network with reduced storage requirement.
Ker, J S; Kuo, Y H; Wen, R C; Liu, B D
1997-01-01
The cerebellar model articulation controller (CMAC) neural network has the advantages of fast convergence speed and low computation complexity. However, it suffers from a low storage space utilization rate on weight memory. In this paper, we propose a direct weight address mapping approach, which can reduce the required weight memory size with a utilization rate near 100%. Based on such an address mapping approach, we developed a pipeline architecture to efficiently perform the addressing operations. The proposed direct weight address mapping approach also speeds up the computation for the generation of weight addresses. Besides, a CMAC hardware prototype used for color calibration has been implemented to confirm the proposed approach and architecture.
A Programmable Five Qubit Quantum Computer Using Trapped Atomic Ions
NASA Astrophysics Data System (ADS)
Debnath, Shantanu
Quantum computers can solve certain problems more efficiently compared to conventional classical methods. In the endeavor to build a quantum computer, several competing platforms have emerged that can implement certain quantum algorithms using a few qubits. However, the demonstrations so far have been done usually by tailoring the hardware to meet the requirements of a particular algorithm implemented for a limited number of instances. Although such proof of principal implementations are important to verify the working of algorithms on a physical system, they further need to have the potential to serve as a general purpose quantum computer allowing the flexibility required for running multiple algorithms and be scaled up to host more qubits. Here we demonstrate a small programmable quantum computer based on five trapped atomic ions each of which serves as a qubit. By optically resolving each ion we can individually address them in order to perform a complete set of single-qubit and fully connected two-qubit quantum gates and alsoperform efficient individual qubit measurements. We implement a computation architecture that accepts an algorithm from a user interface in the form of a standard logic gate sequence and decomposes it into fundamental quantum operations that are native to the hardware using a set of compilation instructions that are defined within the software. These operations are then effected through a pattern of laser pulses that perform coherent rotations on targeted qubits in the chain. The architecture implemented in the experiment therefore gives us unprecedented flexibility in the programming of any quantum algorithm while staying blind to the underlying hardware. As a demonstration we implement the Deutsch-Jozsa and Bernstein-Vazirani algorithms on the five-qubit processor and achieve average success rates of 95 and 90 percent, respectively. We also implement a five-qubit coherent quantum Fourier transform and examine its performance in the period finding and phase estimation protocol. We find fidelities of 84 and 62 percent, respectively. While maintaining the same computation architecture the system can be scaled to more ions using resources that scale favorably (O(N. 2)) with the numberof qubits N.
The flight telerobotic servicer: From functional architecture to computer architecture
NASA Technical Reports Server (NTRS)
Lumia, Ronald; Fiala, John
1989-01-01
After a brief tutorial on the NASA/National Bureau of Standards Standard Reference Model for Telerobot Control System Architecture (NASREM) functional architecture, the approach to its implementation is shown. First, interfaces must be defined which are capable of supporting the known algorithms. This is illustrated by considering the interfaces required for the SERVO level of the NASREM functional architecture. After interface definition, the specific computer architecture for the implementation must be determined. This choice is obviously technology dependent. An example illustrating one possible mapping of the NASREM functional architecture to a particular set of computers which implements it is shown. The result of choosing the NASREM functional architecture is that it provides a technology independent paradigm which can be mapped into a technology dependent implementation capable of evolving with technology in the laboratory and in space.
Evaluation of fault-tolerant parallel-processor architectures over long space missions
NASA Technical Reports Server (NTRS)
Johnson, Sally C.
1989-01-01
The impact of a five year space mission environment on fault-tolerant parallel processor architectures is examined. The target application is a Strategic Defense Initiative (SDI) satellite requiring 256 parallel processors to provide the computation throughput. The reliability requirements are that the system still be operational after five years with .99 probability and that the probability of system failure during one-half hour of full operation be less than 10(-7). The fault tolerance features an architecture must possess to meet these reliability requirements are presented, many potential architectures are briefly evaluated, and one candidate architecture, the Charles Stark Draper Laboratory's Fault-Tolerant Parallel Processor (FTPP) is evaluated in detail. A methodology for designing a preliminary system configuration to meet the reliability and performance requirements of the mission is then presented and demonstrated by designing an FTPP configuration.
Strategies for concurrent processing of complex algorithms in data driven architectures
NASA Technical Reports Server (NTRS)
Stoughton, John W.; Mielke, Roland R.
1988-01-01
Research directed at developing a graph theoretical model for describing data and control flow associated with the execution of large grained algorithms in a special distributed computer environment is presented. This model is identified by the acronym ATAMM which represents Algorithms To Architecture Mapping Model. The purpose of such a model is to provide a basis for establishing rules for relating an algorithm to its execution in a multiprocessor environment. Specifications derived from the model lead directly to the description of a data flow architecture which is a consequence of the inherent behavior of the data and control flow described by the model. The purpose of the ATAMM based architecture is to provide an analytical basis for performance evaluation. The ATAMM model and architecture specifications are demonstrated on a prototype system for concept validation.
Jiang, Guangli; Liu, Leibo; Zhu, Wenping; Yin, Shouyi; Wei, Shaojun
2015-09-04
This paper proposes a real-time feature extraction VLSI architecture for high-resolution images based on the accelerated KAZE algorithm. Firstly, a new system architecture is proposed. It increases the system throughput, provides flexibility in image resolution, and offers trade-offs between speed and scaling robustness. The architecture consists of a two-dimensional pipeline array that fully utilizes computational similarities in octaves. Secondly, a substructure (block-serial discrete-time cellular neural network) that can realize a nonlinear filter is proposed. This structure decreases the memory demand through the removal of data dependency. Thirdly, a hardware-friendly descriptor is introduced in order to overcome the hardware design bottleneck through the polar sample pattern; a simplified method to realize rotation invariance is also presented. Finally, the proposed architecture is designed in TSMC 65 nm CMOS technology. The experimental results show a performance of 127 fps in full HD resolution at 200 MHz frequency. The peak performance reaches 181 GOPS and the throughput is double the speed of other state-of-the-art architectures.
NASA Technical Reports Server (NTRS)
1983-01-01
Missions to be performed, station operations and functions to be carried out, and technologies anticipated during the time frame of the space station were examined in order to determine the scope of the overall information management system for the space station. This system comprises: (1) the data management system which includes onboard computer related hardware and software required to assume and exercise control of all activities performed on the station; (2) the communication system for both internal and external communications; and (3) the ground segment. Techniques used to examine the information system from a functional and performance point of view are described as well as the analyses performed to derive the architecture of both the onboard data management system and the system for internal and external communications. These architectures are then used to generate a conceptual design of the onboard elements in order to determine the physical parameters (size/weight/power) of the hardware and software. The ground segment elements are summarized.
NASA Astrophysics Data System (ADS)
Yan, Hui; Wang, K. G.; Jones, Jim E.
2016-06-01
A parallel algorithm for large-scale three-dimensional phase-field simulations of phase coarsening is developed and implemented on high-performance architectures. From the large-scale simulations, a new kinetics in phase coarsening in the region of ultrahigh volume fraction is found. The parallel implementation is capable of harnessing the greater computer power available from high-performance architectures. The parallelized code enables increase in three-dimensional simulation system size up to a 5123 grid cube. Through the parallelized code, practical runtime can be achieved for three-dimensional large-scale simulations, and the statistical significance of the results from these high resolution parallel simulations are greatly improved over those obtainable from serial simulations. A detailed performance analysis on speed-up and scalability is presented, showing good scalability which improves with increasing problem size. In addition, a model for prediction of runtime is developed, which shows a good agreement with actual run time from numerical tests.
NASA Astrophysics Data System (ADS)
1983-04-01
Missions to be performed, station operations and functions to be carried out, and technologies anticipated during the time frame of the space station were examined in order to determine the scope of the overall information management system for the space station. This system comprises: (1) the data management system which includes onboard computer related hardware and software required to assume and exercise control of all activities performed on the station; (2) the communication system for both internal and external communications; and (3) the ground segment. Techniques used to examine the information system from a functional and performance point of view are described as well as the analyses performed to derive the architecture of both the onboard data management system and the system for internal and external communications. These architectures are then used to generate a conceptual design of the onboard elements in order to determine the physical parameters (size/weight/power) of the hardware and software. The ground segment elements are summarized.
Fast semivariogram computation using FPGA architectures
NASA Astrophysics Data System (ADS)
Lagadapati, Yamuna; Shirvaikar, Mukul; Dong, Xuanliang
2015-02-01
The semivariogram is a statistical measure of the spatial distribution of data and is based on Markov Random Fields (MRFs). Semivariogram analysis is a computationally intensive algorithm that has typically seen applications in the geosciences and remote sensing areas. Recently, applications in the area of medical imaging have been investigated, resulting in the need for efficient real time implementation of the algorithm. The semivariogram is a plot of semivariances for different lag distances between pixels. A semi-variance, γ(h), is defined as the half of the expected squared differences of pixel values between any two data locations with a lag distance of h. Due to the need to examine each pair of pixels in the image or sub-image being processed, the base algorithm complexity for an image window with n pixels is O(n2). Field Programmable Gate Arrays (FPGAs) are an attractive solution for such demanding applications due to their parallel processing capability. FPGAs also tend to operate at relatively modest clock rates measured in a few hundreds of megahertz, but they can perform tens of thousands of calculations per clock cycle while operating in the low range of power. This paper presents a technique for the fast computation of the semivariogram using two custom FPGA architectures. The design consists of several modules dedicated to the constituent computational tasks. A modular architecture approach is chosen to allow for replication of processing units. This allows for high throughput due to concurrent processing of pixel pairs. The current implementation is focused on isotropic semivariogram computations only. Anisotropic semivariogram implementation is anticipated to be an extension of the current architecture, ostensibly based on refinements to the current modules. The algorithm is benchmarked using VHDL on a Xilinx XUPV5-LX110T development Kit, which utilizes the Virtex5 FPGA. Medical image data from MRI scans are utilized for the experiments. Computational speedup is measured with respect to Matlab implementation on a personal computer with an Intel i7 multi-core processor. Preliminary simulation results indicate that a significant advantage in speed can be attained by the architectures, making the algorithm viable for implementation in medical devices
DOE Office of Scientific and Technical Information (OSTI.GOV)
Krueger, Jens; Micikevicius, Paulius; Williams, Samuel
Reverse Time Migration (RTM) is one of the main approaches in the seismic processing industry for imaging the subsurface structure of the Earth. While RTM provides qualitative advantages over its predecessors, it has a high computational cost warranting implementation on HPC architectures. We focus on three progressively more complex kernels extracted from RTM: for isotropic (ISO), vertical transverse isotropic (VTI) and tilted transverse isotropic (TTI) media. In this work, we examine performance optimization of forward wave modeling, which describes the computational kernels used in RTM, on emerging multi- and manycore processors and introduce a novel common subexpression elimination optimization formore » TTI kernels. We compare attained performance and energy efficiency in both the single-node and distributed memory environments in order to satisfy industry’s demands for fidelity, performance, and energy efficiency. Moreover, we discuss the interplay between architecture (chip and system) and optimizations (both on-node computation) highlighting the importance of NUMA-aware approaches to MPI communication. Ultimately, our results show we can improve CPU energy efficiency by more than 10× on Magny Cours nodes while acceleration via multiple GPUs can surpass the energy-efficient Intel Sandy Bridge by as much as 3.6×.« less
High performance cellular level agent-based simulation with FLAME for the GPU.
Richmond, Paul; Walker, Dawn; Coakley, Simon; Romano, Daniela
2010-05-01
Driven by the availability of experimental data and ability to simulate a biological scale which is of immediate interest, the cellular scale is fast emerging as an ideal candidate for middle-out modelling. As with 'bottom-up' simulation approaches, cellular level simulations demand a high degree of computational power, which in large-scale simulations can only be achieved through parallel computing. The flexible large-scale agent modelling environment (FLAME) is a template driven framework for agent-based modelling (ABM) on parallel architectures ideally suited to the simulation of cellular systems. It is available for both high performance computing clusters (www.flame.ac.uk) and GPU hardware (www.flamegpu.com) and uses a formal specification technique that acts as a universal modelling format. This not only creates an abstraction from the underlying hardware architectures, but avoids the steep learning curve associated with programming them. In benchmarking tests and simulations of advanced cellular systems, FLAME GPU has reported massive improvement in performance over more traditional ABM frameworks. This allows the time spent in the development and testing stages of modelling to be drastically reduced and creates the possibility of real-time visualisation for simple visual face-validation.
An operating system for future aerospace vehicle computer systems
NASA Technical Reports Server (NTRS)
Foudriat, E. C.; Berman, W. J.; Will, R. W.; Bynum, W. L.
1984-01-01
The requirements for future aerospace vehicle computer operating systems are examined in this paper. The computer architecture is assumed to be distributed with a local area network connecting the nodes. Each node is assumed to provide a specific functionality. The network provides for communication so that the overall tasks of the vehicle are accomplished. The O/S structure is based upon the concept of objects. The mechanisms for integrating node unique objects with node common objects in order to implement both the autonomy and the cooperation between nodes is developed. The requirements for time critical performance and reliability and recovery are discussed. Time critical performance impacts all parts of the distributed operating system; e.g., its structure, the functional design of its objects, the language structure, etc. Throughout the paper the tradeoffs - concurrency, language structure, object recovery, binding, file structure, communication protocol, programmer freedom, etc. - are considered to arrive at a feasible, maximum performance design. Reliability of the network system is considered. A parallel multipath bus structure is proposed for the control of delivery time for time critical messages. The architecture also supports immediate recovery for the time critical message system after a communication failure.
NASA Astrophysics Data System (ADS)
Mills, R. T.
2014-12-01
As the high performance computing (HPC) community pushes towards the exascale horizon, the importance and prevalence of fine-grained parallelism in new computer architectures is increasing. This is perhaps most apparent in the proliferation of so-called "accelerators" such as the Intel Xeon Phi or NVIDIA GPGPUs, but the trend also holds for CPUs, where serial performance has grown slowly and effective use of hardware threads and vector units are becoming increasingly important to realizing high performance. This has significant implications for weather, climate, and Earth system modeling codes, many of which display impressive scalability across MPI ranks but take relatively little advantage of threading and vector processing. In addition to increasing parallelism, next generation codes will also need to address increasingly deep hierarchies for data movement: NUMA/cache levels, on node vs. off node, local vs. wide neighborhoods on the interconnect, and even in the I/O system. We will discuss some approaches (grounded in experiences with the Intel Xeon Phi architecture) for restructuring Earth science codes to maximize concurrency across multiple levels (vectors, threads, MPI ranks), and also discuss some novel approaches for minimizing expensive data movement/communication.
Local storage federation through XRootD architecture for interactive distributed analysis
NASA Astrophysics Data System (ADS)
Colamaria, F.; Colella, D.; Donvito, G.; Elia, D.; Franco, A.; Luparello, G.; Maggi, G.; Miniello, G.; Vallero, S.; Vino, G.
2015-12-01
A cloud-based Virtual Analysis Facility (VAF) for the ALICE experiment at the LHC has been deployed in Bari. Similar facilities are currently running in other Italian sites with the aim to create a federation of interoperating farms able to provide their computing resources for interactive distributed analysis. The use of cloud technology, along with elastic provisioning of computing resources as an alternative to the grid for running data intensive analyses, is the main challenge of these facilities. One of the crucial aspects of the user-driven analysis execution is the data access. A local storage facility has the disadvantage that the stored data can be accessed only locally, i.e. from within the single VAF. To overcome such a limitation a federated infrastructure, which provides full access to all the data belonging to the federation independently from the site where they are stored, has been set up. The federation architecture exploits both cloud computing and XRootD technologies, in order to provide a dynamic, easy-to-use and well performing solution for data handling. It should allow the users to store the files and efficiently retrieve the data, since it implements a dynamic distributed cache among many datacenters in Italy connected to one another through the high-bandwidth national network. Details on the preliminary architecture implementation and performance studies are discussed.
NASA Astrophysics Data System (ADS)
Jia, Xin; Huang, Zhengxiang; Zu, Xudong; Gu, Xiaohui; Xiao, Qiangqiang
2013-12-01
In this study, an optimal finite element model of Kevlar woven fabric that is more computational efficient compared with existing models was developed to simulate ballistic impact onto fabric. Kevlar woven fabric was modeled to yarn level architecture by using the hybrid elements analysis (HEA), which uses solid elements in modeling the yarns at the impact region and uses shell elements in modeling the yarns away from the impact region. Three HEA configurations were constructed, in which the solid element region was set as about one, two, and three times that of the projectile's diameter with impact velocities of 30 m/s (non-perforation case) and 200 m/s (perforation case) to determine the optimal ratio between the solid element region and the shell element region. To further reduce computational time and to maintain the necessary accuracy, three multiscale models were presented also. These multiscale models combine the local region with the yarn level architecture by using the HEA approach and the global region with homogenous level architecture. The effect of the varying ratios of the local and global area on the ballistic performance of fabric was discussed. The deformation and damage mechanisms of fabric were analyzed and compared among numerical models. Simulation results indicate that the multiscale model based on HEA accurately reproduces the baseline results and obviously decreases computational time.
NASA Astrophysics Data System (ADS)
Tolba, Khaled Ibrahim; Morgenthal, Guido
2018-01-01
This paper presents an analysis of the scalability and efficiency of a simulation framework based on the vortex particle method. The code is applied for the numerical aerodynamic analysis of line-like structures. The numerical code runs on multicore CPU and GPU architectures using OpenCL framework. The focus of this paper is the analysis of the parallel efficiency and scalability of the method being applied to an engineering test case, specifically the aeroelastic response of a long-span bridge girder at the construction stage. The target is to assess the optimal configuration and the required computer architecture, such that it becomes feasible to efficiently utilise the method within the computational resources available for a regular engineering office. The simulations and the scalability analysis are performed on a regular gaming type computer.
NASA Technical Reports Server (NTRS)
Benyo, Theresa L.
2002-01-01
Integration of a supersonic inlet simulation with a computer aided design (CAD) system is demonstrated. The integration is performed using the Project Integration Architecture (PIA). PIA provides a common environment for wrapping many types of applications. Accessing geometry data from CAD files is accomplished by incorporating appropriate function calls from the Computational Analysis Programming Interface (CAPRI). CAPRI is a CAD vendor neutral programming interface that aids in acquiring geometry data directly from CAD files. The benefits of wrapping a supersonic inlet simulation into PIA using CAPRI are; direct access of geometry data, accurate capture of geometry data, automatic conversion of data units, CAD vendor neutral operation, and on-line interactive history capture. This paper describes the PIA and the CAPRI wrapper and details the supersonic inlet simulation demonstration.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Shipman, Galen M.
These are the slides for a presentation on programming models in HPC, at the Los Alamos National Laboratory's Parallel Computing Summer School. The following topics are covered: Flynn's Taxonomy of computer architectures; single instruction single data; single instruction multiple data; multiple instruction multiple data; address space organization; definition of Trinity (Intel Xeon-Phi is a MIMD architecture); single program multiple data; multiple program multiple data; ExMatEx workflow overview; definition of a programming model, programming languages, runtime systems; programming model and environments; MPI (Message Passing Interface); OpenMP; Kokkos (Performance Portable Thread-Parallel Programming Model); Kokkos abstractions, patterns, policies, and spaces; RAJA, a systematicmore » approach to node-level portability and tuning; overview of the Legion Programming Model; mapping tasks and data to hardware resources; interoperability: supporting task-level models; Legion S3D execution and performance details; workflow, integration of external resources into the programming model.« less
2014-06-01
in large-scale datasets such as might be obtained by monitoring a corporate network or social network. Identifying guilty actors, rather than payload...by monitoring a corporate network or social network. Identifying guilty actors, rather than payload-carrying objects, is entirely novel in steganalysis...implementation using Compute Unified Device Architecture (CUDA) on NVIDIA graphics cards. The key to good performance is to combine computations so that
A High Performance VLSI Computer Architecture For Computer Graphics
NASA Astrophysics Data System (ADS)
Chin, Chi-Yuan; Lin, Wen-Tai
1988-10-01
A VLSI computer architecture, consisting of multiple processors, is presented in this paper to satisfy the modern computer graphics demands, e.g. high resolution, realistic animation, real-time display etc.. All processors share a global memory which are partitioned into multiple banks. Through a crossbar network, data from one memory bank can be broadcasted to many processors. Processors are physically interconnected through a hyper-crossbar network (a crossbar-like network). By programming the network, the topology of communication links among processors can be reconfigurated to satisfy specific dataflows of different applications. Each processor consists of a controller, arithmetic operators, local memory, a local crossbar network, and I/O ports to communicate with other processors, memory banks, and a system controller. Operations in each processor are characterized into two modes, i.e. object domain and space domain, to fully utilize the data-independency characteristics of graphics processing. Special graphics features such as 3D-to-2D conversion, shadow generation, texturing, and reflection, can be easily handled. With the current high density interconnection (MI) technology, it is feasible to implement a 64-processor system to achieve 2.5 billion operations per second, a performance needed in most advanced graphics applications.
Gel, Aytekin; Hu, Jonathan; Ould-Ahmed-Vall, ElMoustapha; ...
2017-03-20
Legacy codes remain a crucial element of today's simulation-based engineering ecosystem due to the extensive validation process and investment in such software. The rapid evolution of high-performance computing architectures necessitates the modernization of these codes. One approach to modernization is a complete overhaul of the code. However, this could require extensive investments, such as rewriting in modern languages, new data constructs, etc., which will necessitate systematic verification and validation to re-establish the credibility of the computational models. The current study advocates using a more incremental approach and is a culmination of several modernization efforts of the legacy code MFIX, whichmore » is an open-source computational fluid dynamics code that has evolved over several decades, widely used in multiphase flows and still being developed by the National Energy Technology Laboratory. Two different modernization approaches,‘bottom-up’ and ‘top-down’, are illustrated. Here, preliminary results show up to 8.5x improvement at the selected kernel level with the first approach, and up to 50% improvement in total simulated time with the latter were achieved for the demonstration cases and target HPC systems employed.« less
On some Aitken-like acceleration of the Schwarz method
NASA Astrophysics Data System (ADS)
Garbey, M.; Tromeur-Dervout, D.
2002-12-01
In this paper we present a family of domain decomposition based on Aitken-like acceleration of the Schwarz method seen as an iterative procedure with a linear rate of convergence. We first present the so-called Aitken-Schwarz procedure for linear differential operators. The solver can be a direct solver when applied to the Helmholtz problem with five-point finite difference scheme on regular grids. We then introduce the Steffensen-Schwarz variant which is an iterative domain decomposition solver that can be applied to linear and nonlinear problems. We show that these solvers have reasonable numerical efficiency compared to classical fast solvers for the Poisson problem or multigrids for more general linear and nonlinear elliptic problems. However, the salient feature of our method is that our algorithm has high tolerance to slow network in the context of distributed parallel computing and is attractive, generally speaking, to use with computer architecture for which performance is limited by the memory bandwidth rather than the flop performance of the CPU. This is nowadays the case for most parallel. computer using the RISC processor architecture. We will illustrate this highly desirable property of our algorithm with large-scale computing experiments.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Gel, Aytekin; Hu, Jonathan; Ould-Ahmed-Vall, ElMoustapha
Legacy codes remain a crucial element of today's simulation-based engineering ecosystem due to the extensive validation process and investment in such software. The rapid evolution of high-performance computing architectures necessitates the modernization of these codes. One approach to modernization is a complete overhaul of the code. However, this could require extensive investments, such as rewriting in modern languages, new data constructs, etc., which will necessitate systematic verification and validation to re-establish the credibility of the computational models. The current study advocates using a more incremental approach and is a culmination of several modernization efforts of the legacy code MFIX, whichmore » is an open-source computational fluid dynamics code that has evolved over several decades, widely used in multiphase flows and still being developed by the National Energy Technology Laboratory. Two different modernization approaches,‘bottom-up’ and ‘top-down’, are illustrated. Here, preliminary results show up to 8.5x improvement at the selected kernel level with the first approach, and up to 50% improvement in total simulated time with the latter were achieved for the demonstration cases and target HPC systems employed.« less
A computer architecture for intelligent machines
NASA Technical Reports Server (NTRS)
Lefebvre, D. R.; Saridis, G. N.
1992-01-01
The theory of intelligent machines proposes a hierarchical organization for the functions of an autonomous robot based on the principle of increasing precision with decreasing intelligence. An analytic formulation of this theory using information-theoretic measures of uncertainty for each level of the intelligent machine has been developed. The authors present a computer architecture that implements the lower two levels of the intelligent machine. The architecture supports an event-driven programming paradigm that is independent of the underlying computer architecture and operating system. Execution-level controllers for motion and vision systems are briefly addressed, as well as the Petri net transducer software used to implement coordination-level functions. A case study illustrates how this computer architecture integrates real-time and higher-level control of manipulator and vision systems.
An Object Oriented Extensible Architecture for Affordable Aerospace Propulsion Systems
NASA Technical Reports Server (NTRS)
Follen, Gregory J.; Lytle, John K. (Technical Monitor)
2002-01-01
Driven by a need to explore and develop propulsion systems that exceeded current computing capabilities, NASA Glenn embarked on a novel strategy leading to the development of an architecture that enables propulsion simulations never thought possible before. Full engine 3 Dimensional Computational Fluid Dynamic propulsion system simulations were deemed impossible due to the impracticality of the hardware and software computing systems required. However, with a software paradigm shift and an embracing of parallel and distributed processing, an architecture was designed to meet the needs of future propulsion system modeling. The author suggests that the architecture designed at the NASA Glenn Research Center for propulsion system modeling has potential for impacting the direction of development of affordable weapons systems currently under consideration by the Applied Vehicle Technology Panel (AVT). This paper discusses the salient features of the NPSS Architecture including its interface layer, object layer, implementation for accessing legacy codes, numerical zooming infrastructure and its computing layer. The computing layer focuses on the use and deployment of these propulsion simulations on parallel and distributed computing platforms which has been the focus of NASA Ames. Additional features of the object oriented architecture that support MultiDisciplinary (MD) Coupling, computer aided design (CAD) access and MD coupling objects will be discussed. Included will be a discussion of the successes, challenges and benefits of implementing this architecture.
Lu, Xiaofeng; Song, Li; Shen, Sumin; He, Kang; Yu, Songyu; Ling, Nam
2013-01-01
Hough Transform has been widely used for straight line detection in low-definition and still images, but it suffers from execution time and resource requirements. Field Programmable Gate Arrays (FPGA) provide a competitive alternative for hardware acceleration to reap tremendous computing performance. In this paper, we propose a novel parallel Hough Transform (PHT) and FPGA architecture-associated framework for real-time straight line detection in high-definition videos. A resource-optimized Canny edge detection method with enhanced non-maximum suppression conditions is presented to suppress most possible false edges and obtain more accurate candidate edge pixels for subsequent accelerated computation. Then, a novel PHT algorithm exploiting spatial angle-level parallelism is proposed to upgrade computational accuracy by improving the minimum computational step. Moreover, the FPGA based multi-level pipelined PHT architecture optimized by spatial parallelism ensures real-time computation for 1,024 × 768 resolution videos without any off-chip memory consumption. This framework is evaluated on ALTERA DE2-115 FPGA evaluation platform at a maximum frequency of 200 MHz, and it can calculate straight line parameters in 15.59 ms on the average for one frame. Qualitative and quantitative evaluation results have validated the system performance regarding data throughput, memory bandwidth, resource, speed and robustness. PMID:23867746
Lu, Xiaofeng; Song, Li; Shen, Sumin; He, Kang; Yu, Songyu; Ling, Nam
2013-07-17
Hough Transform has been widely used for straight line detection in low-definition and still images, but it suffers from execution time and resource requirements. Field Programmable Gate Arrays (FPGA) provide a competitive alternative for hardware acceleration to reap tremendous computing performance. In this paper, we propose a novel parallel Hough Transform (PHT) and FPGA architecture-associated framework for real-time straight line detection in high-definition videos. A resource-optimized Canny edge detection method with enhanced non-maximum suppression conditions is presented to suppress most possible false edges and obtain more accurate candidate edge pixels for subsequent accelerated computation. Then, a novel PHT algorithm exploiting spatial angle-level parallelism is proposed to upgrade computational accuracy by improving the minimum computational step. Moreover, the FPGA based multi-level pipelined PHT architecture optimized by spatial parallelism ensures real-time computation for 1,024 × 768 resolution videos without any off-chip memory consumption. This framework is evaluated on ALTERA DE2-115 FPGA evaluation platform at a maximum frequency of 200 MHz, and it can calculate straight line parameters in 15.59 ms on the average for one frame. Qualitative and quantitative evaluation results have validated the system performance regarding data throughput, memory bandwidth, resource, speed and robustness.
Some Problems and Solutions in Transferring Ecosystem Simulation Codes to Supercomputers
NASA Technical Reports Server (NTRS)
Skiles, J. W.; Schulbach, C. H.
1994-01-01
Many computer codes for the simulation of ecological systems have been developed in the last twenty-five years. This development took place initially on main-frame computers, then mini-computers, and more recently, on micro-computers and workstations. Recent recognition of ecosystem science as a High Performance Computing and Communications Program Grand Challenge area emphasizes supercomputers (both parallel and distributed systems) as the next set of tools for ecological simulation. Transferring ecosystem simulation codes to such systems is not a matter of simply compiling and executing existing code on the supercomputer since there are significant differences in the system architectures of sequential, scalar computers and parallel and/or vector supercomputers. To more appropriately match the application to the architecture (necessary to achieve reasonable performance), the parallelism (if it exists) of the original application must be exploited. We discuss our work in transferring a general grassland simulation model (developed on a VAX in the FORTRAN computer programming language) to a Cray Y-MP. We show the Cray shared-memory vector-architecture, and discuss our rationale for selecting the Cray. We describe porting the model to the Cray and executing and verifying a baseline version, and we discuss the changes we made to exploit the parallelism in the application and to improve code execution. As a result, the Cray executed the model 30 times faster than the VAX 11/785 and 10 times faster than a Sun 4 workstation. We achieved an additional speed-up of approximately 30 percent over the original Cray run by using the compiler's vectorizing capabilities and the machine's ability to put subroutines and functions "in-line" in the code. With the modifications, the code still runs at only about 5% of the Cray's peak speed because it makes ineffective use of the vector processing capabilities of the Cray. We conclude with a discussion and future plans.
Analytical Performance Modeling and Validation of Intel’s Xeon Phi Architecture
DOE Office of Scientific and Technical Information (OSTI.GOV)
Chunduri, Sudheer; Balaprakash, Prasanna; Morozov, Vitali
Modeling the performance of scientific applications on emerging hardware plays a central role in achieving extreme-scale computing goals. Analytical models that capture the interaction between applications and hardware characteristics are attractive because even a reasonably accurate model can be useful for performance tuning before the hardware is made available. In this paper, we develop a hardware model for Intel’s second-generation Xeon Phi architecture code-named Knights Landing (KNL) for the SKOPE framework. We validate the KNL hardware model by projecting the performance of mini-benchmarks and application kernels. The results show that our KNL model can project the performance with prediction errorsmore » of 10% to 20%. The hardware model also provides informative recommendations for code transformations and tuning.« less
A Disciplined Architectural Approach to Scaling Data Analysis for Massive, Scientific Data
NASA Astrophysics Data System (ADS)
Crichton, D. J.; Braverman, A. J.; Cinquini, L.; Turmon, M.; Lee, H.; Law, E.
2014-12-01
Data collections across remote sensing and ground-based instruments in astronomy, Earth science, and planetary science are outpacing scientists' ability to analyze them. Furthermore, the distribution, structure, and heterogeneity of the measurements themselves pose challenges that limit the scalability of data analysis using traditional approaches. Methods for developing science data processing pipelines, distribution of scientific datasets, and performing analysis will require innovative approaches that integrate cyber-infrastructure, algorithms, and data into more systematic approaches that can more efficiently compute and reduce data, particularly distributed data. This requires the integration of computer science, machine learning, statistics and domain expertise to identify scalable architectures for data analysis. The size of data returned from Earth Science observing satellites and the magnitude of data from climate model output, is predicted to grow into the tens of petabytes challenging current data analysis paradigms. This same kind of growth is present in astronomy and planetary science data. One of the major challenges in data science and related disciplines defining new approaches to scaling systems and analysis in order to increase scientific productivity and yield. Specific needs include: 1) identification of optimized system architectures for analyzing massive, distributed data sets; 2) algorithms for systematic analysis of massive data sets in distributed environments; and 3) the development of software infrastructures that are capable of performing massive, distributed data analysis across a comprehensive data science framework. NASA/JPL has begun an initiative in data science to address these challenges. Our goal is to evaluate how scientific productivity can be improved through optimized architectural topologies that identify how to deploy and manage the access, distribution, computation, and reduction of massive, distributed data, while managing the uncertainties of scientific conclusions derived from such capabilities. This talk will provide an overview of JPL's efforts in developing a comprehensive architectural approach to data science.
Information processing using a single dynamical node as complex system
Appeltant, L.; Soriano, M.C.; Van der Sande, G.; Danckaert, J.; Massar, S.; Dambre, J.; Schrauwen, B.; Mirasso, C.R.; Fischer, I.
2011-01-01
Novel methods for information processing are highly desired in our information-driven society. Inspired by the brain's ability to process information, the recently introduced paradigm known as 'reservoir computing' shows that complex networks can efficiently perform computation. Here we introduce a novel architecture that reduces the usually required large number of elements to a single nonlinear node with delayed feedback. Through an electronic implementation, we experimentally and numerically demonstrate excellent performance in a speech recognition benchmark. Complementary numerical studies also show excellent performance for a time series prediction benchmark. These results prove that delay-dynamical systems, even in their simplest manifestation, can perform efficient information processing. This finding paves the way to feasible and resource-efficient technological implementations of reservoir computing. PMID:21915110
NASA Astrophysics Data System (ADS)
Neff, John A.
1989-12-01
Experiments originating from Gestalt psychology have shown that representing information in a symbolic form provides a more effective means to understanding. Computer scientists have been struggling for the last two decades to determine how best to create, manipulate, and store collections of symbolic structures. In the past, much of this struggling led to software innovations because that was the path of least resistance. For example, the development of heuristics for organizing the searching through knowledge bases was much less expensive than building massively parallel machines that could search in parallel. That is now beginning to change with the emergence of parallel architectures which are showing the potential for handling symbolic structures. This paper will review the relationships between symbolic computing and parallel computing architectures, and will identify opportunities for optics to significantly impact the performance of such computing machines. Although neural networks are an exciting subset of massively parallel computing structures, this paper will not touch on this area since it is receiving a great deal of attention in the literature. That is, the concepts presented herein do not consider the distributed representation of knowledge.
Low-complexity nonlinear adaptive filter based on a pipelined bilinear recurrent neural network.
Zhao, Haiquan; Zeng, Xiangping; He, Zhengyou
2011-09-01
To reduce the computational complexity of the bilinear recurrent neural network (BLRNN), a novel low-complexity nonlinear adaptive filter with a pipelined bilinear recurrent neural network (PBLRNN) is presented in this paper. The PBLRNN, inheriting the modular architectures of the pipelined RNN proposed by Haykin and Li, comprises a number of BLRNN modules that are cascaded in a chained form. Each module is implemented by a small-scale BLRNN with internal dynamics. Since those modules of the PBLRNN can be performed simultaneously in a pipelined parallelism fashion, it would result in a significant improvement of computational efficiency. Moreover, due to nesting module, the performance of the PBLRNN can be further improved. To suit for the modular architectures, a modified adaptive amplitude real-time recurrent learning algorithm is derived on the gradient descent approach. Extensive simulations are carried out to evaluate the performance of the PBLRNN on nonlinear system identification, nonlinear channel equalization, and chaotic time series prediction. Experimental results show that the PBLRNN provides considerably better performance compared to the single BLRNN and RNN models.
Advanced flight control system study
NASA Technical Reports Server (NTRS)
Hartmann, G. L.; Wall, J. E., Jr.; Rang, E. R.; Lee, H. P.; Schulte, R. W.; Ng, W. K.
1982-01-01
A fly by wire flight control system architecture designed for high reliability includes spare sensor and computer elements to permit safe dispatch with failed elements, thereby reducing unscheduled maintenance. A methodology capable of demonstrating that the architecture does achieve the predicted performance characteristics consists of a hierarchy of activities ranging from analytical calculations of system reliability and formal methods of software verification to iron bird testing followed by flight evaluation. Interfacing this architecture to the Lockheed S-3A aircraft for flight test is discussed. This testbed vehicle can be expanded to support flight experiments in advanced aerodynamics, electromechanical actuators, secondary power systems, flight management, new displays, and air traffic control concepts.
Canbay, Ferhat; Levent, Vecdi Emre; Serbes, Gorkem; Ugurdag, H. Fatih; Goren, Sezer
2016-01-01
The authors aimed to develop an application for producing different architectures to implement dual tree complex wavelet transform (DTCWT) having near shift-invariance property. To obtain a low-cost and portable solution for implementing the DTCWT in multi-channel real-time applications, various embedded-system approaches are realised. For comparison, the DTCWT was implemented in C language on a personal computer and on a PIC microcontroller. However, in the former approach portability and in the latter desired speed performance properties cannot be achieved. Hence, implementation of the DTCWT on a reconfigurable platform such as field programmable gate array, which provides portable, low-cost, low-power, and high-performance computing, is considered as the most feasible solution. At first, they used the system generator DSP design tool of Xilinx for algorithm design. However, the design implemented by using such tools is not optimised in terms of area and power. To overcome all these drawbacks mentioned above, they implemented the DTCWT algorithm by using Verilog Hardware Description Language, which has its own difficulties. To overcome these difficulties, simplify the usage of proposed algorithms and the adaptation procedures, a code generator program that can produce different architectures is proposed. PMID:27733925
Canbay, Ferhat; Levent, Vecdi Emre; Serbes, Gorkem; Ugurdag, H Fatih; Goren, Sezer; Aydin, Nizamettin
2016-09-01
The authors aimed to develop an application for producing different architectures to implement dual tree complex wavelet transform (DTCWT) having near shift-invariance property. To obtain a low-cost and portable solution for implementing the DTCWT in multi-channel real-time applications, various embedded-system approaches are realised. For comparison, the DTCWT was implemented in C language on a personal computer and on a PIC microcontroller. However, in the former approach portability and in the latter desired speed performance properties cannot be achieved. Hence, implementation of the DTCWT on a reconfigurable platform such as field programmable gate array, which provides portable, low-cost, low-power, and high-performance computing, is considered as the most feasible solution. At first, they used the system generator DSP design tool of Xilinx for algorithm design. However, the design implemented by using such tools is not optimised in terms of area and power. To overcome all these drawbacks mentioned above, they implemented the DTCWT algorithm by using Verilog Hardware Description Language, which has its own difficulties. To overcome these difficulties, simplify the usage of proposed algorithms and the adaptation procedures, a code generator program that can produce different architectures is proposed.
Image-Processing Software For A Hypercube Computer
NASA Technical Reports Server (NTRS)
Lee, Meemong; Mazer, Alan S.; Groom, Steven L.; Williams, Winifred I.
1992-01-01
Concurrent Image Processing Executive (CIPE) is software system intended to develop and use image-processing application programs on concurrent computing environment. Designed to shield programmer from complexities of concurrent-system architecture, it provides interactive image-processing environment for end user. CIPE utilizes architectural characteristics of particular concurrent system to maximize efficiency while preserving architectural independence from user and programmer. CIPE runs on Mark-IIIfp 8-node hypercube computer and associated SUN-4 host computer.
Experimental Comparison of Two Quantum Computing Architectures
2017-03-28
IN A U G U RA L A RT IC LE CO M PU TE R SC IE N CE S Experimental comparison of two quantum computing architectures Norbert M. Linkea,b,1, Dmitri...the vast computing power a universal quantumcomputer could offer, several candidate systems are being explored. They have allowed experimental ...existing systems and the role of architecture in quantum computer design . These will be crucial for the realization of more advanced future incarna
NASA Astrophysics Data System (ADS)
Petit, C.; Le Louarn, M.; Fusco, T.; Madec, P.-Y.
2011-09-01
Various tomographic control solutions have been proposed during the last decades to ensure efficient or even optimal closed-loop correction to tomographic Adaptive Optics (AO) concepts such as Laser Tomographic AO (LTAO), Multi-Conjugate AO (MCAO). The optimal solution, based on Linear Quadratic Gaussian (LQG) approach, as well as suboptimal but efficient solutions such as Pseudo-Open Loop Control (POLC) require multiple Matrix Vector Multiplications (MVM). Disregarding their respective performance, these efficient control solutions thus exhibit strong increase of on-line complexity and their implementation may become difficult in demanding cases. Among them, two cases are of particular interest. First, the system Real-Time Computer architecture and implementation is derived from past or present solutions and does not support multiple MVM. This is the case of the AO Facility which RTC architecture is derived from the SPARTA platform and inherits its simple MVM architecture, which does not fit with LTAO control solutions for instance. Second, considering future systems such as Extremely Large Telescopes, the number of degrees of freedom is twenty to one hundred times bigger than present systems. In these conditions, tomographic control solutions can hardly be used in their standard form and optimized implementation shall be considered. Single MVM tomographic control solutions represent a potential solution, and straightforward solutions such as Virtual Deformable Mirrors have been already proposed for LTAO but with tuning issues. We investigate in this paper the possibility to derive from tomographic control solutions, such as POLC or LQG, simplified control solutions ensuring simple MVM architecture and that could be thus implemented on nowadays systems or future complex systems. We theoretically derive various solutions and analyze their respective performance on various systems thanks to numerical simulation. We discuss the optimization of their performance and stability issues with respect to classic control solutions. We finally discuss off-line computation and implementation constraints.
Lai, Chin-Feng; Chen, Min; Pan, Jeng-Shyang; Youn, Chan-Hyun; Chao, Han-Chieh
2014-03-01
As cloud computing and wireless body sensor network technologies become gradually developed, ubiquitous healthcare services prevent accidents instantly and effectively, as well as provides relevant information to reduce related processing time and cost. This study proposes a co-processing intermediary framework integrated cloud and wireless body sensor networks, which is mainly applied to fall detection and 3-D motion reconstruction. In this study, the main focuses includes distributed computing and resource allocation of processing sensing data over the computing architecture, network conditions and performance evaluation. Through this framework, the transmissions and computing time of sensing data are reduced to enhance overall performance for the services of fall events detection and 3-D motion reconstruction.
Architectural Strategies for Enabling Data-Driven Science at Scale
NASA Astrophysics Data System (ADS)
Crichton, D. J.; Law, E. S.; Doyle, R. J.; Little, M. M.
2017-12-01
The analysis of large data collections from NASA or other agencies is often executed through traditional computational and data analysis approaches, which require users to bring data to their desktops and perform local data analysis. Alternatively, data are hauled to large computational environments that provide centralized data analysis via traditional High Performance Computing (HPC). Scientific data archives, however, are not only growing massive, but are also becoming highly distributed. Neither traditional approach provides a good solution for optimizing analysis into the future. Assumptions across the NASA mission and science data lifecycle, which historically assume that all data can be collected, transmitted, processed, and archived, will not scale as more capable instruments stress legacy-based systems. New paradigms are needed to increase the productivity and effectiveness of scientific data analysis. This paradigm must recognize that architectural and analytical choices are interrelated, and must be carefully coordinated in any system that aims to allow efficient, interactive scientific exploration and discovery to exploit massive data collections, from point of collection (e.g., onboard) to analysis and decision support. The most effective approach to analyzing a distributed set of massive data may involve some exploration and iteration, putting a premium on the flexibility afforded by the architectural framework. The framework should enable scientist users to assemble workflows efficiently, manage the uncertainties related to data analysis and inference, and optimize deep-dive analytics to enhance scalability. In many cases, this "data ecosystem" needs to be able to integrate multiple observing assets, ground environments, archives, and analytics, evolving from stewardship of measurements of data to using computational methodologies to better derive insight from the data that may be fused with other sets of data. This presentation will discuss architectural strategies, including a 2015-2016 NASA AIST Study on Big Data, for evolving scientific research towards massively distributed data-driven discovery. It will include example use cases across earth science, planetary science, and other disciplines.
An ICAI architecture for troubleshooting in complex, dynamic systems
NASA Technical Reports Server (NTRS)
Fath, Janet L.; Mitchell, Christine M.; Govindaraj, T.
1990-01-01
Ahab, an intelligent computer-aided instruction (ICAI) program, illustrates an architecture for simulator-based ICAI programs to teach troubleshooting in complex, dynamic environments. The architecture posits three elements of a computerized instructor: the task model, the student model, and the instructional module. The task model is a prescriptive model of expert performance that uses symptomatic and topographic search strategies to provide students with directed problem-solving aids. The student model is a descriptive model of student performance in the context of the task model. This student model compares the student and task models, critiques student performance, and provides interactive performance feedback. The instructional module coordinates information presented by the instructional media, the task model, and the student model so that each student receives individualized instruction. Concept and metaconcept knowledge that supports these elements is contained in frames and production rules, respectively. The results of an experimental evaluation are discussed. They support the hypothesis that training with an adaptive online system built using the Ahab architecture produces better performance than training using simulator practice alone, at least with unfamiliar problems. It is not sufficient to develop an expert strategy and present it to students using offline materials. The training is most effective if it adapts to individual student needs.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Amadio, G.; et al.
An intensive R&D and programming effort is required to accomplish new challenges posed by future experimental high-energy particle physics (HEP) programs. The GeantV project aims to narrow the gap between the performance of the existing HEP detector simulation software and the ideal performance achievable, exploiting latest advances in computing technology. The project has developed a particle detector simulation prototype capable of transporting in parallel particles in complex geometries exploiting instruction level microparallelism (SIMD and SIMT), task-level parallelism (multithreading) and high-level parallelism (MPI), leveraging both the multi-core and the many-core opportunities. We present preliminary verification results concerning the electromagnetic (EM) physicsmore » models developed for parallel computing architectures within the GeantV project. In order to exploit the potential of vectorization and accelerators and to make the physics model effectively parallelizable, advanced sampling techniques have been implemented and tested. In this paper we introduce a set of automated statistical tests in order to verify the vectorized models by checking their consistency with the corresponding Geant4 models and to validate them against experimental data.« less
A Programming Framework for Scientific Applications on CPU-GPU Systems
DOE Office of Scientific and Technical Information (OSTI.GOV)
Owens, John
2013-03-24
At a high level, my research interests center around designing, programming, and evaluating computer systems that use new approaches to solve interesting problems. The rapid change of technology allows a variety of different architectural approaches to computationally difficult problems, and a constantly shifting set of constraints and trends makes the solutions to these problems both challenging and interesting. One of the most important recent trends in computing has been a move to commodity parallel architectures. This sea change is motivated by the industry’s inability to continue to profitably increase performance on a single processor and instead to move to multiplemore » parallel processors. In the period of review, my most significant work has been leading a research group looking at the use of the graphics processing unit (GPU) as a general-purpose processor. GPUs can potentially deliver superior performance on a broad range of problems than their CPU counterparts, but effectively mapping complex applications to a parallel programming model with an emerging programming environment is a significant and important research problem.« less
An Evaluation of an Ada Implementation of the Rete Algorithm for Embedded Flight Processors
1990-12-01
computers was desired. The VAX VMS operating system has many built-in methods for determining program performance (including VAX PCA), but these methods... overviev , of the target environment-- the MIL-STD-1750A VHSIC Avionic Modular Processor ( VA.IP, running under the Ada Avionics Real-Time Software (AARTS... computers . Mil-STD-1750A, the Air Force’s standard flight computer architecture, however, places severe constraints on applications software processing
High-performance computing in image registration
NASA Astrophysics Data System (ADS)
Zanin, Michele; Remondino, Fabio; Dalla Mura, Mauro
2012-10-01
Thanks to the recent technological advances, a large variety of image data is at our disposal with variable geometric, radiometric and temporal resolution. In many applications the processing of such images needs high performance computing techniques in order to deliver timely responses e.g. for rapid decisions or real-time actions. Thus, parallel or distributed computing methods, Digital Signal Processor (DSP) architectures, Graphical Processing Unit (GPU) programming and Field-Programmable Gate Array (FPGA) devices have become essential tools for the challenging issue of processing large amount of geo-data. The article focuses on the processing and registration of large datasets of terrestrial and aerial images for 3D reconstruction, diagnostic purposes and monitoring of the environment. For the image alignment procedure, sets of corresponding feature points need to be automatically extracted in order to successively compute the geometric transformation that aligns the data. The feature extraction and matching are ones of the most computationally demanding operations in the processing chain thus, a great degree of automation and speed is mandatory. The details of the implemented operations (named LARES) exploiting parallel architectures and GPU are thus presented. The innovative aspects of the implementation are (i) the effectiveness on a large variety of unorganized and complex datasets, (ii) capability to work with high-resolution images and (iii) the speed of the computations. Examples and comparisons with standard CPU processing are also reported and commented.